Robust Node Classification on Graphs: Jointly from Bayesian Label Transition and Topology-based Label Propagation

Jun Zhuang Indiana University-Purdue University IndianapolisIndianapolisINUSA [email protected] and Mohammad Al Hasan Indiana University-Purdue University IndianapolisIndianapolisINUSA [email protected]

(2022)

Abstract.

Node classification using Graph Neural Networks (GNNs) has been widely applied in various real-world scenarios. However, in recent years, compelling evidence emerges that the performance of GNN-based node classification may deteriorate substantially by topological perturbation, such as random connections or adversarial attacks. Various solutions, such as topological denoising methods and mechanism design methods, have been proposed to develop robust GNN-based node classifiers but none of these works can fully address the problems related to topological perturbations. Recently, the Bayesian label transition model is proposed to tackle this issue but its slow convergence may lead to inferior performance. In this work, we propose a new label inference model, namely LInDT, which integrates both Bayesian label transition and topology-based label propagation for improving the robustness of GNNs against topological perturbations. LInDT is superior to existing label transition methods as it improves the label prediction of uncertain nodes by utilizing neighborhood-based label propagation leading to better convergence of label inference. Besides, LIndT adopts asymmetric Dirichlet distribution as a prior, which also helps it to improve label inference. Extensive experiments on five graph datasets demonstrate the superiority of LInDT for GNN-based node classification under three scenarios of topological perturbations.

Graph neural networks; Adversarial defense; Bayesian inference

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: Proceedings of the 31st ACM International Conference on Information and Knowledge Management; October 17–21, 2022; Atlanta, GA, USA^†^†booktitle: Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM ’22), October 17–21, 2022, Atlanta, GA, USA^†^†price: 15.00^†^†doi: 10.1145/3511808.3557437^†^†isbn: 978-1-4503-9236-5/22/10^†^†ccs: Mathematics of computing Bayesian computation

1. Introduction

Among the various machine learning tasks on the graph data, node classification is probably the most popular with numerous real-world applications, such as, user profiling in online social networks (OSNs) (Chen et al., 2019; Zhuang and Al Hasan, 2022b; Zhuang and Hasan, 2022), community detection (Chen et al., 2017; Breuer et al., 2020), expert finding in community-based question answering (Zhao et al., 2016; Fang et al., 2016), recommendation systems (Ying et al., 2018; Fan et al., 2019; Gao et al., 2022), and epidemiology study (La Gatta et al., 2020; Hsieh et al., 2020). While many traditional approaches have been used for solving a node classification task (Mountain and Dresler, 1997; Bhagat et al., 2011; Tang et al., 2016; Li and Pi, 2019), in this deep learning era various graph neural network models (GNNs) (Bruna et al., 2013; Defferrard et al., 2016; Kipf and Welling, 2016; Hamilton et al., 2017; Du et al., 2017) have become popular for solving this task. The enormous appeal of GNN models for node classification is due to their ability to learn effective node-level representation by performing locality-based non-linear aggregation jointly over node attributes and adjacency information leading to superior prediction performance.

However, in recent years, compelling evidence emerges that the performance of GNN-based node classification may deteriorate substantially by changes in a graph structure (Sun et al., 2018; Jin et al., 2020a), which raises concern regarding the application of GNNs in various real-life scenarios. For instance, in an online social network, non-malicious selfish users may randomly connect with many other nodes for promoting their business activities (Zhuang and Al Hasan, 2022b; Hahn-Klimroth et al., 2020); such random connections exacerbate over-smoothing (Liu et al., 2020a; Hasanzadeh et al., 2020), a well-known cause of the poor performance of GNNs. On the other extreme, the message-passing mechanism of GNNs performs poorly on nodes with sparse connection (Tam and Dunson, 2020; Ye and Ji, 2021), such as the nodes representing new users of OSNs, which lack sufficient links (edges) or profile information (feature). Third, GNNs may be vulnerable to adversarial attacks (Zügner et al., 2018; Dai et al., 2018; Zügner and Günnemann, 2019; Wang et al., 2018; Dai et al., 2022). Such attacks may significantly deteriorate the performance of node classifications with slight and unnoticeable modification of graph structures and node features. In this paper, we refer to all such scenarios as topological perturbations.

Many recent works have been proposed to develop robust GNN-based node classification models against topological perturbations. Most of these works can be divided into two categories: the topological denoising methods, which design methodologies for denoising perturbed graphs in a pre-processing stage (Wu et al., 2019; Entezari et al., 2020) or in the training stage (Xu et al., 2020b; Luo et al., 2021; Rong et al., 2019; Tang et al., 2020; Zheng et al., 2020), and the mechanism design methods, which make GCN’s prediction robust by message-passing-based aggregation over the nodes (Seddik et al., 2021; Chen et al., 2021; Jin et al., 2021; Zhang and Zitnik, 2020; Feng et al., 2020) or through regularization (Chang et al., 2021; Liu et al., 2021; Regol et al., 2022; Xu et al., 2021; Jin et al., 2020b; Hasanzadeh et al., 2020; Xu et al., 2020a; Zhu et al., 2019). Nevertheless, none of these works can fully address the topological perturbations. For one thing, topological denoising methods may fail to prune suspicious edges when the graph structure is sparse. For another, mechanism designing methods sometimes highly depend on heuristics explorations of topological connections among nodes. The mechanism may perform worse when the graph structure is under severe perturbations.

In some recent works (Zhuang and Al Hasan, 2022b, a), the authors tackle the topological perturbation issues by developing a Bayesian label transition mechanism, which fixes the poor performance of GNNs by post-processing the prediction; specifically, their approaches learn a label transition matrix through Bayesian inferences, which fixes the erroneous prediction of GNNs by replacing the GNN inferred labels to a different label. A key benefit of post-processing is that it is effective for dynamic network scenarios, in which the network structure and node features of test data may be different than the train data. However, Bayesian label transition (Zhuang and Al Hasan, 2022a) suffers slow convergence as the posterior distribution from which the Gibbs sampling samples the node labels adapt incrementally over a sequence of label-propagation followed by GNN re-training iterations. Slow convergence also leads to inferior performance because the labels on which the GNN is re-trained may not be accurate as the Bayesian label transition may have not yet been converged.

Although label transition is a relatively new idea in the node classification task, label propagation (LP) has been widely used in the semi-supervised learning paradigm (Zhu and Ghahramani, 2002; Wang and Zhang, 2007; Speriosu et al., 2011; Wang et al., 2013; Karasuyama and Mamitsuka, 2013; Ugander and Backstrom, 2013; Gong et al., 2016; Iscen et al., 2019; Huang et al., 2020; Wang and Leskovec, 2020). Recently, Huang et al. (Huang et al., 2020) demonstrated that graph semi-supervised learning using LP can exceed or match the performance of some variants of GNNs. Zhu and Ghahramani (Zhu and Ghahramani, 2002) have shown that label propagation is similar to mean-field approximation, which has fast convergence when the number of instances is large. The main assumption of LP is that closer data points tend to have similar class labels—this assumption is analogous to the graph Homophily assumption, which denotes that adjacent vertices have similar labels. Specifically, on noisy and perturbed graphs, the label prediction on a given (or a targeted) node can be affected severely, however, the prediction of other neighboring nodes may still be sufficiently good, and by the virtue of homophily, those predictions can help recovering the true label of the affected node. Inspired by these observations, in this work, we improve the robustness of GNN-based node classification through an integrated framework combining Bayesian label transition and graph topology-based label propagation. We name our proposed model LInDT ¹¹1LInDT is built by taking the bold letters of “Label Inference using Dirichlet and Topological sampling”, and its pronunciation is similar to the popular chocolate brand name. Source code is available at https://github.com/junzhuang-code/LInDT.

The main advantage of LInDT over a label-transition-based method, such as GraphSS (Zhuang and Al Hasan, 2022a), is that the former effectively utilizes LP when the sampling distribution of Bayesian label transition is uncertain. This immediately improves the label convergence and subsequently improves the label prediction performance. Besides, LInDT adopts a better Bayesian prior than GraphSS; specifically, LInDT follows an informative asymmetric Dirichlet distribution, but GraphSS follows a symmetric Dirichlet distribution, which allows LInDT to take advantage of the most up-to-date label distribution, leading to better performance. We evaluate LInDT’s performance under random perturbations, information sparsity, and adversarial attacks scenarios, across five public graph datasets. The experimental results verify the following facts: First, the GNN-based node classifier benefits strongly from LInDT’s proposed label inference as post-processing; Second, LInDT converges faster than a pure Bayesian label transition which only uses Gibbs based posterior sampling; Third, LInDT outperforms eight popular competing methods. We also perform an ablation study to verify the effectiveness of LInDT’s design choices. For example, our experiment validates that asymmetric Dirichlet distributions of LInDT help better label inference using Bayesian label transition.

Overall, we summarize our contributions in this paper as follows:

•

We propose a new label inference mechanism that can infer node labels jointly from a Bayesian and label propagation framework. Our model can converge faster than the Bayesian label transition using Gibbs sampling.
•

We release the constraint of symmetric Dirichlet distributions on the Bayesian label transition and verify that dynamically updating the $\alpha$ vector (Dirichlet prior) can help better label inference.
•

Extensive experiments demonstrate that our model achieves robust node classification on three topological perturbation scenarios and outperforms eight popular defending models across five public graph datasets.

2. Methodology

In this section, we first briefly introduce the node classification using GNNs. We then describe the basic idea of Bayesian label transition. Besides, we discuss asymmetric Dirichlet distributions and introduce our proposed model. Lastly, we present the pseudo-code and time complexity of our model.

2.1. GNN-based Node Classification

An undirected attributed graph is denoted as $\mathcal{G}$ = $(\mathcal{V}$ , $\mathcal{E})$ , where $\mathcal{V}$ = $\{v_{1},v_{2},...,v_{N}\}$ is the set of vertices, $N$ is the number of vertices in $\mathcal{G}$ , and $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ is the set of edges between vertices. We denote $\mathbf{A}\in\mathbb{R}^{N\times N}$ as symmetric adjacency matrix and $\mathbf{X}\in\mathbb{R}^{N\times d}$ as the feature matrix, where $d$ is the number of features for each vertex. We choose the most representative variant of GNNs, GCN (Kipf and Welling, 2016), as the node classifier $\textit{f}_{\theta}$ in this work. The layer-wise propagation of GCN is formulated as follows:

(1)

\mathbf{H}^{(l+1)}=\sigma\left(\mathbf{\tilde{D}}^{-\frac{1}{2}}\mathbf{\tilde{A}}\mathbf{\tilde{D}}^{-\frac{1}{2}}\mathbf{H}^{(l)}\mathbf{W}^{(l)}\right)

In Eq. 1, $\mathbf{\tilde{A}}=\mathbf{A}+I_{N}$ , $\mathbf{\tilde{D}}=\mathbf{D}+I_{N}$ , where $I_{N}$ is the identity matrix and $\mathbf{D}_{i,i}=\sum_{j}\mathbf{A}_{i,j}$ is the diagonal degree matrix. $\mathbf{H}^{(l)}\in\mathbb{R}^{N\times d}$ is the nodes hidden representation in the $l$ -th layer, where $\mathbf{H}^{(0)}=\mathbf{X}$ . $\mathbf{W}^{(l)}$ is the weight matrix in the $l$ -th layer. $\sigma(\cdot)$ denotes a non-linear activation function, such as ReLU.

In general node classification tasks, a GNN-based node classifier $\textit{f}_{\theta}$ takes both $\mathbf{A}$ and $\mathbf{X}$ as inputs and is trained with ground-truth labels in train data. To obtain robust classification, some studies attempt to train the model using noisy labels as a regularization scheme (Tanno et al., 2019; Li et al., 2020). In this study, we borrow this idea and train the GNN node classifier, $\textit{f}_{\theta}$ , after assigning a uniform random label to a percentage (in this work we use 10%) of the train nodes. We denote these noisy labels as $\mathcal{Y}\in\mathbb{R}^{N\times 1}$ . The higher the percentage of noise ratio, the stronger the regularization. On the other hand, we denote the ground truth labels (a.k.a. latent labels) as $\mathcal{Z}\in\mathbb{R}^{N\times 1}$ , which are unobserved (to the GNN), yet we want the GNN to be able to predict the unobserved label. So, the model’s performance is optimized to predict $\mathcal{Z}$ by using a validation set. It may seem counter-intuitive to train a model with noisy labels and still expect it to perform well on the ground-truth label; however, this is feasible because both $\mathcal{Z}$ and $\mathcal{Y}$ take values from the same closed category set. Also, there is a strong correlation between $\mathcal{Y}$ and $\mathcal{Z}$ as only a fraction of the node’s labels have been altered. But, as GNN is trained on the noisy labels, it becomes noise-resilient and robust, and performs well when the network is perturbed.

2.2. Bayesian Label Transition

Refer to caption — Figure 1. The diagram of Bayesian label transition. $\mathcal{V}$ , $\mathcal{Z}$ , and $\mathcal{Y}$ denote $N$ copies of nodes, latent labels, and noisy labels, respectively. $\phi$ denotes the $K$ -class Dirichlet-based conditional label transition matrix parameterized by $\alpha$ .

A GNN suffers performance loss when applied in the test data, which undergoes various topological perturbations. To overcome this, recent works have applied Bayesian label transition to fix erroneous prediction of GNN by learning a label transition matrix on the fly (from the test data). We explain Bayesian label transition by utilizing the plate diagram in Fig. 1. The unobserved latent labels ( $\mathcal{Z}$ ) depend on the nodes, whereas the observed noisy labels ( $\mathcal{Y}$ ) depend on both $\mathcal{Z}$ and the conditional label transition matrix, $\phi$ , which is parameterized by $\alpha$ . Both $\mathcal{Z}$ and $\mathcal{Y}$ are sampled from $Categorical$ distributions of $\mathcal{V}$ , the predictions from the trained node classifier. $\phi$ = $[\phi_{1},\phi_{2},…,\phi_{K}]^{T}$ $\in\mathbb{R}^{K\times K}$ consists of K transition vectors. The $k$ -th transition vector $\phi_{k}$ is sampled from corresponding Dirichlet distribution $Dirichlet(\alpha_{k})$ , where $\alpha_{k}$ is the $k$ -th element of the $\alpha$ vector. Based on this dependency, the posterior of $\mathcal{Z}$ , which is conditioned on the nodes $\mathcal{V}$ , the noisy labels $\mathcal{Y}$ , and the Dirichlet parameter $\alpha$ , can be formulated as follows:

(2)

\textit{P}\left(\mathcal{Z}\mid\mathcal{V},\mathcal{Y};\alpha\right)=\textit{P}\left(\mathcal{Z}\mid\mathcal{V},\mathcal{Y},\phi\right)\textit{P}\left(\phi;\alpha\right)\\

However, sampling from such a posterior distribution is difficult. GraphSS (Zhuang and Al Hasan, 2022a) applies Gibbs sampling to approximate this posterior, so that inferred labels $\mathcal{\bar{Z}}\in\mathbb{R}^{N\times 1}$ matches to the latent labels ( $\mathcal{Z})$ as identically as possible.

In this study, we assume the test graph is perturbed through various structural changes and further develop the Bayesian label transition model to improve the robustness of GNN-based node classifiers against topological perturbations. Before summarizing the problem we aim to solve in this work, we first clarify that noisy labels in this work include both manual-annotated labels $\mathcal{Y}_{m}\in\mathbb{R}^{N_{train}\times 1}$ and auto-generated labels $\mathcal{Y}_{a}\in\mathbb{R}^{N_{test}\times 1}$ , where $N_{train}$ and $N_{test}$ denote the number of nodes on the train and test graphs, respectively. More specifically, we first train GNN with $\mathcal{Y}_{m}$ on the unperturbed train graph to generate $\mathcal{Y}_{a}$ . The prediction of the test nodes may perform worse when the test graph is under topological perturbations. At this point, we employ our proposed label inference model to iteratively recover the perturbed predictions by $\mathcal{Y}_{a}$ with several epochs of transition. After the label transition is converged, we expect that the inferred labels will be as identical as the latent labels. Overall, the above-mentioned problem could be formally described as follows:

Problem 1.

Given a graph $\mathcal{G}=(\mathbf{A},\mathbf{X})$ and the manual-annotated labels $\mathcal{Y}_{m}$ , which are used for training a GNN-based node classifier $\textit{f}_{\theta}$ on unperturbed $\mathcal{G}_{train}$ , we aim to develop a label inference model to improve the robustness of the node classifier $\textit{f}_{\theta}$ on perturbed $\mathcal{G}_{test}$ by approximating the inferred labels $\mathcal{\bar{Z}}$ to the latent labels $\mathcal{Z}$ as identical as possible.

2.3. Asymmetric Dirichlet Distributions

In Eq. 2, GraphSS (Zhuang and Al Hasan, 2022a) assumes the Dirichlet distribution is symmetric, i.e., the parameter vector $\alpha$ has the same value, 1.0, to all classes. In this study, we release this constraint and investigate whether adopting asymmetric Dirichlet distributions as a prior can contribute to the label transition. Fig. 2 visualizes the Dirichlet distributions by equilateral triangles. The first triangle represents the symmetric Dirichlet distribution. The second triangle shows that data points tend to move to the corner with the same probability. The third triangle presents that data points have a higher probability to be assigned to the class whose $\alpha$ value is higher. The intuition is that the $i^{th}$ inferred label $\mathbf{\bar{z}}^{t}_{i}$ may have higher probability to be transited to the $k^{th}$ class in the $t^{th}$ transition when $\alpha^{t}_{k}$ is higher. Following this intuition, we dynamically update the $k^{th}$ -class $\alpha$ value, $\alpha^{t}_{k}$ , in the $t^{th}$ transition as follows:

(3)

\alpha^{t}_{k}=\alpha^{t-1}_{k}\frac{\sum_{i=1}^{N}{I(\mathbf{\bar{z}}^{t}_{i}=k)}}{\sum_{i=1}^{N}{I(\mathbf{\bar{z}}^{t-1}_{i}=k)}}

where $I(\cdot)$ denotes an indicator function.

2.4. Label Inference Jointly from Bayesian Label Transition and Topology-based Label Propagation

GraphSS (Zhuang and Al Hasan, 2022a) applies Gibbs sampling to recover the multinominal distribution of nodes on perturbed graphs. However, such sampling may converge slowly and thus leads to inferior inference. This drawback can be mitigated by considering the fact that the majority of real-life networks exhibit homophily property, i.e., labels of the adjacent nodes are similar. So, if the inferred label of a node during label transition is uncertain, we can sample its label from the labels of adjacent nodes. This idea is inspired by well-known Label propagation algorithms (Wang et al., 2013). Thus, LInDT integrates both Bayesian label transition and topology-based label propagation. More specifically, for each node on $\mathcal{G}_{test}$ , our model first infers the label from Bayesian label transition and then substitutes this label using topology-based label propagation when this inferred label is uncertain. We define the node label uncertainty during the inference as follows:

Definition 0.

In the $t^{th}$ transition, an inferred label $\mathbf{\bar{z}}^{t}$ is uncertain, such that $\mathbf{\bar{z}}^{t}\neq\mathbf{\bar{z}}^{t-1}$ or $\mathbf{\bar{z}}^{t}\neq\mathbf{y}$ .

Note that $\mathbf{\bar{z}}^{t}\neq\mathbf{\bar{z}}^{t-1}$ will be happened when the inference is not converged yet, whereas $\mathbf{\bar{z}}^{t}\neq\mathbf{y}$ indicates that the inferred labels $\mathcal{\bar{Z}}$ still deviates from the auto-generated labels $\mathcal{Y}_{a}$ . The intuition of our model is that unnoticeable perturbations may severely deteriorate the predictions of GNN-based node classifiers, but fortunately, such perturbations only alter a small number of topological structures. Based on the homophily assumption of the graph structure, i.e., intra-class nodes tend to connect with each other, sampling labels with topology-based label propagation can decrease the node label uncertainty by utilizing topological information of nodes. Note that we present the edge homophily ratio (EHR $\in[0,1]$ ) (Zhu et al., 2020), the percentage of intra-class edges, in Tab. 1 to verify the homophily property in the graph datasets. Higher EHR indicates stronger homophily property.

Input: Categorical distribution

\overline{\textit{P}}\left(\mathcal{\bar{Z}}^{t-1}\mid\mathcal{V}\right)

, Transition matrix

\phi^{t-1}

, Topology-based label sampler.

1 for $i\leftarrow 0$ to $N$ do

\mathbf{\bar{z}}^{t}_{i}\sim\operatorname*{arg\,max}\overline{\textit{P}}\left(\mathbf{\bar{z}}^{t-1}_{i}\mid v_{i}\right)\phi^{t-1}

;

3 if $\mathbf{\bar{z}}^{t}_{i}$ is uncertain then

4 Update

\mathbf{\bar{z}}^{t}_{i}

with the topology-based label sampler;

5 end if

7 end for

return Inferred labels $\mathcal{\bar{Z}}^{t}$ in the $t^{th}$ transition.

Algorithm 1 Label Sampling Jointly from Bayesian Label Transition and Topology-based Label Propagation

We present how our model samples labels jointly from Bayesian Label Transition and Topology-based Label Propagation in Algo. 1. In the $t^{th}$ transition, we first sample a label $\mathbf{\bar{z}}^{t}_{i}$ of the $i^{th}$ node from the categorical distribution $\overline{\textit{P}}(\mathbf{\bar{z}}^{t-1}_{i}\mid v_{i})\in\mathbb{R}^{N\times K}$ updated with a given label transition matrix $\phi^{t-1}$ in the $(t-1)^{th}$ transition (Line 2). We then update $\mathbf{\bar{z}}^{t}_{i}$ with a given topology-based label sampler when this inferred label is uncertain (Line 3-4). In the end, we obtain the inferred labels in the $t^{th}$ transition.

Besides Algo. 1, we further use a toy example to explain our proposed sampling method in Fig. 3. In the $t^{th}$ transition, our model first samples the inferred label of node A as the red class via Bayesian label transition. However, this inferred label is different from the auto-generated label of node A; let’s say it’s blue. In this case, we say that this node label is uncertain. Based on the assumption of homophily graphs and unnoticeable perturbations, i.e., the labels of majority neighbor nodes are predicted correctly in $\mathcal{Y}_{a}$ , we then apply the topology-based label sampler to substitute the label, i.e., replacing it from red to blue based on the topological information of node A.

Based on the above-mentioned intuition, we investigate three types of topology-based label samplers, random sampler (Random), majority sampler (Major), and degree-weighted sampler (Degree). All of these samplers sample a label from 1-hop neighborhood of the current node. We denote the set of classes from the 1-hop neighborhood as $K_{nei}$ and the number of neighbour nodes as $N_{nei}$ . We highlight the distinctions of these candidate samplers as follows and further explain them with toy examples in Fig. 4.

•

Random sampler randomly samples a label. In the $t^{th}$ transition, the probability that the $i^{th}$ inferred label $\mathbf{\bar{z}}^{t}_{i}$ is equal to $k^{th}$ class can be formulated as Eq. 4.

(4)

\textit{P}\left(\mathbf{\bar{z}}^{t}_{i}=k\mid v_{i}\right)=\frac{\sum_{i=1}^{N_{nei}}{I(\mathbf{\bar{z}}^{t}_{i}=k)}}{\sum_{i=1}^{N_{nei}}{I(\mathbf{\bar{z}}^{t}_{i}\in K_{nei})}}

•

Majority sampler selects the majority class $k_{mj}$ as the label, such that the number of nodes in majority class $|v_{mj}|=max\{\sum_{\mathbf{\bar{z}}_{j}=k}\mathbf{A}_{ij}:k\in K_{nei}\}$ .
•

Degree-weighted sampler chooses a label from the class $k_{dw}$ , such that sum of the neighbor nodes’ total degrees in $k_{dw}$ is maximum. This maximum sum can be expressed as $max\{\sum_{\mathbf{\bar{z}}_{i}=k}d_{i}:k\in K_{nei}\}$ , where $d_{i}=\sum_{j}\mathbf{A}_{ij}$ is the total degree of the $i^{th}$ node.

Analysis of Convergence. To analyze the convergence of our proposed sampling method, we discuss two conjectures as follows.

Conjecture 0.

Given a large enough iterations of transition $T$ , the transition of inferred labels $\mathcal{\bar{Z}}$ will eventually converge to $\mathcal{\bar{Z}}^{\pi}$ .

Analysis: We formally state the transition as follows:

(5)

\small\mathcal{\bar{Z}^{\pi}}=\lim_{T\to\infty}\operatorname*{arg\,max}\overline{\textit{P}}\left(\mathcal{\bar{Z}}^{T}\mid\mathcal{V}\right)=\lim_{T\to\infty}\operatorname*{arg\,max}\overline{\textit{P}}\left(\mathcal{\bar{Z}}^{T-1}\mid\mathcal{V}\right)\phi^{T-1}\\

Theoretical proof of this conjecture is difficult because both the categorical distribution $\overline{\textit{P}}\left(\mathcal{\bar{Z}}\mid\mathcal{V}\right)$ and the label transition matrix $\phi$ in each transition will be dynamically updated. Instead, we empirically examine this conjecture in the experiment. Note that the converged inferred labels $\mathcal{\bar{Z}^{\pi}}$ is not guaranteed to be the same as the latent labels $\mathcal{Z}$ , which translates to the possibility of erroneous prediction.

Conjecture 0.

Label sampling jointly from Bayesian Label Transition and Topology-based Label Propagation helps the inference converge with fewer iterations of transition than the Bayesian-based sampling.

Analysis: Our proposed sampling method will substitute the inferred labels of uncertain nodes based on the topological information. On the homophily graphs, we can get smaller Total Variation Distance between current node label distributions of inferred labels in $t^{th}$ transition $\mathcal{\bar{Z}}^{t}$ and the convergence distributions of inferred labels $\mathcal{\bar{Z}}^{\pi}$ as follows:

(6)

\small\lVert\mathcal{\bar{Z}}^{t}_{DT}-\mathcal{\bar{Z}}^{\pi}\rVert_{TV}\leq\lVert\mathcal{\bar{Z}}^{t}_{Bayes}-\mathcal{\bar{Z}}^{\pi}\rVert_{TV}\\

where $\mathcal{\bar{Z}}^{t}_{DT}$ denotes the inferred labels using our proposed sampling method, whereas $\mathcal{\bar{Z}}^{t}_{Bayes}$ denotes the inferred labels using the Bayesian-based sampling method. $\lVert\cdot\rVert_{TV}$ denotes the Total Variation Distance.

We argue that this conjecture is true as such substitutions can force the inferred labels to get closer to the convergence based on the homophily assumption, shortening the iterations of transition compared to using the Bayesian-based sampling method, e.g., Gibbs sampling. Our experiments empirically verify this conjecture.

2.5. Pseudo-code of Our Proposed Model

Input: Graph

\mathcal{G}_{train}

and

\mathcal{G}_{test}

, which contain corresponding symmetric adjacency matrix

\mathbf{A}

and feature matrix

\mathbf{X}

, Manual-annotated labels

\mathcal{Y}_{m}

\mathcal{G}_{train}

, Node classifier

\textit{f}_{\theta}

, Initial

\alpha

, The number of warm-up steps

WS

, The number of transition

T

1 Train

\textit{f}_{\theta}

with

\mathcal{Y}_{m}

\mathcal{G}_{train}

;

2 Generate initial categorical distribution

\overline{\textit{P}}\left(\mathcal{Z}\mid\mathcal{V}\right)

and auto-generated labels

\mathcal{Y}_{a}

\textit{f}_{\theta}

;

3 Compute warm-up label transition matrix

\phi^{\prime}

based on

\mathcal{G}_{train}

;

4 Define inferred labels

\mathcal{\bar{Z}}

, dynamic label transition matrix

\phi

based on

\mathcal{G}_{test}

, and initial

\alpha

vector;

5 for $t\leftarrow 1$ to $T$ do

6 if $t<WS$ then

7 Sample

\mathcal{\bar{Z}}^{t}

with warm-up

\phi^{\prime}

by Algo. 1;

9 end if

10 else

11 Sample

\mathcal{\bar{Z}}^{t}

with dynamic

\phi

by Algo. 1;

13 end if

14 Update

\alpha

by Eq. 3 and dynamic

\phi

;

15 Retrain

\textit{f}_{\theta}

and update

\overline{\textit{P}}\left(\mathcal{\bar{Z}}^{t}\mid\mathcal{V}\right)

;

17 end for

return Inferred labels $\mathcal{\bar{Z}}$ and Dynamic $\phi$ ;

Algorithm 2 LInDT’s Pseudo-code

To warp up our model in Algo. 2, we discuss the pseudo-code and analyze the time complexity in this section. Training: Our model trains the node classifier $\textit{f}_{\theta}$ on the train graph $\mathcal{G}_{train}$ at first with manual-annotated labels $\mathcal{Y}_{m}$ (line 1) and then generates initial categorical distribution $\overline{\textit{P}}\left(\mathcal{Z}\mid\mathcal{V}\right)$ and auto-generated labels $\mathcal{Y}_{a}$ by $\textit{f}_{\theta}$ (line 2). Inference: Before the inference, our model first computes a warm-up label transition matrix $\phi^{\prime}$ by using the prediction over $\mathcal{G}_{train}$ (line 3). Our model then defines (creates empty spaces) the inferred labels $\mathcal{\bar{Z}}$ , the dynamic label transition matrix $\phi$ based on the test graph $\mathcal{G}_{test}$ , and also initializes the $\alpha$ vector with a given initial $\alpha$ value (line 4). In the warm-up stage, our model samples the inferred labels in the $t^{th}$ transition with the warm-up $\phi^{\prime}$ , which is built with the categorical distribution of $\textit{f}_{\theta}$ and $\mathcal{Y}_{m}$ on $\mathcal{G}_{train}$ , via corresponding topology-based label sampler using Algo. 1 (line 7). After the warm-up stage, our model applies the same sampling method with the dynamic $\phi$ (line 10). This dynamic $\phi$ updates in each transition with current inferred labels $\mathcal{\bar{Z}}^{t}$ and corresponding $\mathcal{Y}_{a}$ . Simultaneously, our model dynamically updates the $\alpha$ vector by Eq. 3 and the dynamic $\phi$ matrix (line 12). Besides, our model iteratively re-trains the node classifier with the current inferred labels $\mathcal{\bar{Z}}^{t}$ in the $t^{th}$ transition, and then updates the categorical distribution $\overline{\textit{P}}\left(\mathcal{\bar{Z}}^{t}\mid\mathcal{V}\right)$ modeled by the re-trained $\textit{f}_{\theta}$ (line 13). The transition will eventually reach convergence, approximating the inferred labels to the latent labels as identical as possible. In other words, our model can help recover the original predictions by dynamic conditional label transition on perturbed graphs.

According to Algo. 2, our model applies $T$ transitions in the inference. For each transition, our proposed sampling method traverses all test nodes once. Overall, the time complexity of the inference approximates to $\mathcal{O}(T\cdot N_{test})$ , where $T$ denotes the number of transition, whereas $N_{test}$ denotes the number of nodes on the test graph.

3. Experiments

In this section, we want to answer the following questions for evaluating our model.

•

Can node classification benefit from our model against topological perturbations?
•

How’s the convergence of our model?
•

How’s our performance compared to competing methods?
•

Can asymmetric Dirichlet distributions contribute to the label inference?

Table 1. Statistics of datasets.

\left|\mathcal{V}\right|

\left|\mathcal{E}\right|

\left|F\right|

, and

\left|C\right|

denote the number of nodes, edges, features, and classes, respectively. Avg.D denotes the average degree of test nodes. EHR denotes the edge homophily ratio.

Dataset	$\left\|\mathcal{V}\right\|$	$\left\|\mathcal{E}\right\|$	$\left\|F\right\|$	$\left\|C\right\|$	Avg.D	EHR(%)
Cora	2,708	10,556	1,433	7	4.99	81.00
Citeseer	3,327	9,228	3,703	6	3.72	73.55
Pubmed	19,717	88,651	500	3	5.50	80.24
AMZcobuy	7,650	287,326	745	8	32.77	82.72
Coauthor	18,333	327,576	6,805	15	10.01	80.81

Table 2. Examination of our model on top of GCN under three scenarios of topological perturbations across five datasets. ”Original” denotes the original performance of GCN on perturbed graphs (before inference). GraphSS is our baseline model with a fixed

\alpha

value, 1.0. ”Vanilla” denotes that we dynamically update the

\alpha

vector using Gibbs samplers. ”Random”, ”Major”, and ”Degree” denote that we employ the corresponding topology-based label sampler based on our vanilla architecture instead of using Gibbs samplers. All methods are evaluated by classification accuracy (Acc.) and average normalized entropy (Ent.) on victim nodes.

	Methods	Cora		Citeseer		Pubmed		AMZcobuy		Coauthor
	Methods	Acc.	Ent.	Acc.	Ent.	Acc.	Ent.	Acc.	Ent.	Acc.	Ent.
rdmPert	Original	48.95	9.24	45.92	62.17	25.13	3.59	81.25	43.83	38.63	17.10
	GraphSS (Zhuang and Al Hasan, 2022a)	82.61	18.34	31.33	9.35	81.33	25.25	83.56	7.93	88.42	6.60
	Vanilla	83.68	18.34	56.22	40.35	82.45	23.86	84.68	7.41	89.66	5.78
	Random	84.74	21.49	71.24	53.71	83.54	32.41	91.91	12.61	90.58	6.86
	Major	84.21	22.22	66.95	66.99	83.64	32.97	91.93	12.79	90.59	7.19
	Degree	83.63	26.92	65.32	66.81	83.52	32.32	91.93	12.24	90.62	7.20
infoSparse	Original	71.73	47.87	62.26	90.13	76.26	57.65	90.01	13.48	86.62	38.34
	GraphSS (Zhuang and Al Hasan, 2022a)	79.32	27.96	69.77	46.81	84.03	29.64	90.81	7.10	90.15	8.77
	Vanilla	79.48	27.96	69.77	46.81	84.05	29.64	91.90	7.25	91.06	8.54
	Random	79.48	27.96	69.77	46.81	84.09	29.74	91.91	7.26	91.20	8.71
	Major	79.48	27.96	69.77	46.81	84.10	29.56	91.91	6.88	91.22	8.29
	Degree	79.48	27.96	69.77	46.81	84.07	28.38	91.93	6.89	91.24	8.63
advAttack	Original	33.86	1.43	4.31	37.29	23.55	11.83	77.17	49.25	58.69	36.05
	GraphSS (Zhuang and Al Hasan, 2022a)	38.10	1.24	54.31	13.04	83.04	9.89	81.50	13.55	74.52	8.35
	Vanilla	39.68	1.24	66.38	13.42	85.13	9.89	82.61	13.72	76.39	8.23
	Random	76.72	8.59	69.84	15.47	85.70	11.48	84.78	13.47	80.28	8.05
	Major	78.84	9.69	71.98	15.25	85.87	10.87	84.78	12.36	80.31	6.98
	Degree	80.95	7.15	70.26	15.23	85.51	11.21	84.79	13.44	79.53	6.90

Experimental Settings. Before answering these questions, we first introduce the experimental settings.
Datasets. Tab. 1 presents statistics of five public node-labeled graph datasets. Cora, Citeseer, and Pubmed are famous citation graph data (Sen et al., 2008). AMZcobuy comes from the photo segment of the Amazon co-purchase graph (Shchur et al., 2018). Coauthor is co-authorship graphs of computer science based on the Microsoft Academic Graph from the KDD Cup 2016 challenge ²²2https://www.kdd.org/kdd-cup/view/kdd-cup-2016. For all datasets, the percentage of train, validation, and test partition comprise 10%, 20%, and 70% of the nodes, respectively. Similar to GraphSS (Zhuang and Al Hasan, 2022a), only the train graphs contain manual-annotated labels, but the validation and test graphs don’t. We simulate the manual-annotated labels by randomly replacing the ground-truth labels of 10% (noise ratio $nr$ ) nodes with another label, chosen uniformly.

Implementation of Topological Perturbations. We examine our model under three scenarios of topological perturbations as follows. 1) Random perturbations (rdmPert): perturbators in OSNs intend to randomly connect with many other normal users for commercial purposes, e.g., new brand promotion. In the experiments, we limit the number of perturbators to 1% of the validation/test nodes (a.k.a. victim nodes) and restrict the number of connections from each perturbator to 100. Note that rdmPert doesn’t apply gradient-based attacks, such as FGSM (Goodfellow et al., 2014), PGD (Madry et al., 2017), etc. 2) Information sparsity (infoSparse): we sparse the 90% links and 100% features of the victim nodes ( $L\&F$ ) on validation/test graphs. 3) Adversarial attacks (advAttack): we execute node-level direct evasion non-targeted attacks (Zügner et al., 2018) on both links and features ( $L\&F$ ) of the victim nodes on sparse graphs, whereas the trained node classifier remains unchanged. To ensure the sparsity, we sparse the denser graphs (AMZcobuy and Coauthor) and select the victim nodes whose total degrees are within (0, 10) for attacks. The intensity of perturbation $n_{pert}$ is set as 2 for all datasets. The ratio of $n_{pert}$ between applying on links and applying on features is 1: 10.

Competing Methods. We first compare our model with seven popular competing methods which develop robust GNN-based node classifiers using topological information. Also, we consider MC Dropout as one of the competing methods for measuring uncertainty purposes. GNN-Jaccard (Wu et al., 2019) preprocesses the graph by eliminating suspicious connections, whose Jaccard similarity of node’s features is smaller than a given threshold. GNN-SVD (Entezari et al., 2020) proposes another preprocessing approach with low-rank approximation on the perturbed graph to mitigate the negative effects from high-rank attacks, such as Nettack (Zügner et al., 2018). DropEdge (Rong et al., 2019) randomly removes a number of edges from the input graph in each training epoch. GRAND (Feng et al., 2020) proposes random propagation and consistency regularization strategies to address the issues of over-smoothing and non-robustness of GCNs. RGCN (Zhu et al., 2019) adopts Gaussian distributions as the hidden representations of nodes to mitigate the negative effects of adversarial attacks. ProGNN (Jin et al., 2020b) jointly learns the structural graph properties and iteratively reconstructs the clean graph to reduce the effects of adversarial structure. GDC (Hasanzadeh et al., 2020) proposes an adaptive connection sampling method using stochastic regularization for training GNNs. This method can learn with uncertainty on graphs. MC Dropout (Gal and Ghahramani, 2016) develops a dropout framework that approximates Bayesian inference in deep Gaussian processes.

Evaluation Metrics. We use both accuracy (Acc.) and average normalized entropy (Ent.) to measure the robustness of a node classifier. More specifically, Acc. evaluates the node classification performance (the higher the better), whereas Ent. represents the uncertainty of nodes’ categorical distributions (the lower the better).

Node Classification Benefits from Our Model against Topological Perturbations. The performance of a node classifier may degrade on perturbed graphs. We present the comparison of node label distributions of test nodes among the ground-truth labels, the prediction on unperturbed graphs, and the prediction on perturbed graphs (under rdmPert) in Fig. 6 as an example. This comparison indicates that node label distributions may significantly change on perturbed graphs, causing performance degradation. In this study, our goal is to improve the robustness of the node classifier on perturbed graphs. In other words, we aim to increase the classification accuracy (Acc.) while maintaining the uncertainty (Ent.) at a low level.

To answer the first question, we examine whether the node classifier can benefit from our model under three scenarios of topological perturbations, random perturbations (rdmPert), information sparsity (infoSparse), and adversarial attacks (advAttack). We compare the performance between the baseline model, GraphSS (Zhuang and Al Hasan, 2022a), and the variants of our model architectures. The denotations of related methods are mentioned in the caption of Tab. 2. The results affirmatively verify that our model can outperform the baseline model and successfully achieve our goal in most cases. Besides, we have several observations as follows. At first, topological samplers could further boost accuracy under both rdmPert and advAttack but may fail under infoSparse because most links and features are sparsified under this scenario. Moreover, the majority sampler attains higher accuracy than the degree-weighted sampler on sparse graphs (Cora, Citeseer, and Pubmed) in most cases. Such a situation may be largely reversed on denser graphs (AMZcobuy and Coauther) because the degree-weighted sampler can further take the advantage of degree information from neighbors. Furthermore, the uncertainty may be lower under some scenarios, e.g., Ent. is 1.43% under advAttack on Cora. However, such kind of low uncertainty reveals that the node classifier may suffer severe perturbations, i.e., many nodes are certainly assigned to incorrect classes. Thus, lower uncertainty couldn’t fully indicate robust performance. Last but not least, our model couldn’t significantly decrease the uncertainty under rdmPert on Citeseer.

Table 3. Comparison between competing methods and our model under the random perturbations scenario. Acc. (%) denotes classification accuracy. Ent. (%) denotes average normalized entropy. Time (s) denotes total runtime.

Methods	Cora			Citeseer			Pubmed			AMZcobuy			Coauthor
Methods	Acc.	Ent.	Time	Acc.	Ent.	Time	Acc.	Ent.	Time	Acc.	Ent.	Time	Acc.	Ent.	Time
GNN-Jaccard (Wu et al., 2019)	66.32	93.24	2.83	56.65	95.47	2.34	60.68	81.77	3.71	53.12	96.17	5.83	88.47	96.88	7.44
GNN-SVD (Entezari et al., 2020)	50.53	93.01	3.05	31.76	95.20	3.48	74.22	88.39	12.96	70.02	93.52	5.01	74.22	97.21	11.51
DropEdge (Rong et al., 2019)	67.88	95.28	1.89	46.78	96.44	1.46	77.34	76.66	2.56	63.42	96.25	2.61	71.42	97.48	2.90
GRAND (Feng et al., 2020)	52.34	94.99	8.72	35.02	95.35	5.69	49.75	89.89	12.15	40.22	96.23	27.14	55.56	97.73	154.17
RGCN (Zhu et al., 2019)	62.63	93.72	5.04	62.23	96.39	5.17	83.20	89.91	27.71	77.87	97.36	30.85	89.33	98.39	179.65
ProGNN (Jin et al., 2020b)	52.63	92.09	174.75	36.18	96.32	294.93	50.11	88.35	2135.67	45.28	98.63	1914.42	57.39	96.55	2366.04
GDC (Hasanzadeh et al., 2020)	71.18	85.77	12.30	43.15	93.73	17.52	48.95	32.94	258.34	45.58	98.18	80.84	62.65	94.02	205.22
MC Dropout (Gal and Ghahramani, 2016)	80.53	40.25	2.41	64.81	89.80	2.74	79.51	72.73	1.99	90.68	39.21	1.81	88.40	43.30	5.79
Ours	84.19	23.54	29.35	67.84	62.50	70.73	83.57	32.57	158.25	91.92	12.55	75.28	90.60	7.08	184.15

We further investigate this unusual case by visualizing the node label distributions via confusion matrices in Fig. 7. The left matrix presents the predicted node label distributions on unperturbed graphs, whereas the middle and right matrices present the predicted and inferred distributions on perturbed graphs, respectively. The visualization first shows that our model can recover the distribution as close as that on unperturbed graphs when the node label distribution is perturbed. These confusion matrices also indicate that the node classification on the unperturbed graphs already misses one class, leading to the inaccurate prior distribution. The label inference may probably fail to decrease the uncertainty due to this reason. Besides, this case reveals two limitations of our model. First, the inference highly depends on an accurate prior distribution. Second, our model couldn’t recover the missing class, which implies that our model currently cannot handle the open-set classification.

Empirical Analysis of Convergence. To answer the second question, we conduct an empirical analysis of convergence to examine two proposed conjectures. Fig. 5 displays the empirical comparison between using the Gibbs sampler and using the topology-based label sampler (Major) during the label inference under rdmPert. The first row presents the curves of validation accuracy (%). The second row displays the curves of the ratio of uncertain nodes (%) on inferred labels. According to the results, the label inference using both sampling methods can eventually converge across five datasets. The convergence verifies that the first conjecture holds true. Also, the results show that the label inference using the topology-based label sampler reaches the convergence with fewer iterations of transition compared to using the Gibbs sampler. In other words, using the topology-based label sampler can help decrease the ratio of uncertain nodes faster, and further reduces the fluctuations of the curve. The comparison verifies the second conjecture.

Comparison with Competing Methods. Tab. 3 presents the comparison of performance between competing methods and our model under rdmPert. For our model, we present the average values of three topological samplers. The results verify that our model is superior to competing methods across five datasets. Most competing methods fail to decrease the uncertainty (Ent.) but MC Dropout obtains robust performance among them. Our model can outperform MC Dropout, achieving higher classification accuracy with lower uncertainty. We notice that DropEdge and RGCN perform higher accuracy on larger graphs. For DropEdge, randomly dropping a certain number of edges is equivalent to data augmentation, which will be performed stronger when the size of the input graph is getting larger. RGCN adopts Gaussian distributions in hidden layers to reduce the negative impacts that come from the shift of node label distribution. The Gaussian-based hidden matrix has a higher capability to mitigate the negative impacts on larger graphs.

From the perspective of total runtime (in second), we observe that topological denoising methods (GCN-Jaccard, GCN-SVD, and DropEdge) and MC dropout consume much less time than other methods. These are reasonable because topological denoising methods mainly process the input graph structure, and MC dropout can speed up the training by dropping some hidden units. On the contrary, ProGNN consumes much more time as the size of graphs increases. One of the reasons is that ProGNN iteratively reconstructs the graph by preserving low rank and sparsity properties. Such reconstructions will take a much longer time on larger graphs.

Analysis of Parameters. In this study, we maintain the same values of both warm-up steps $WS$ , 40, and retraining epochs $Retrain$ , 60, as that in GraphSS (Zhuang and Al Hasan, 2022a). To avoid over-fitting, we retrain the node classifier every 10 epochs during the inference. In the meanwhile, we analyze how the initial $\alpha$ value affects accuracy, and conduct this analysis on the validation set. We fix the number of transition states $T$ as [100, 200, 80, 100, 90] for five datasets, respectively. Fig. 8 presents two curves of accuracy along different initial $\alpha$ values. The blue curve denotes the case that we dynamically update the $\alpha$ vector, whereas the red curve denotes the case with fixed $\alpha$ values. The results affirmatively answer the fourth question that adopting asymmetric Dirichlet distributions as a prior can contribute to the label inference using Bayesian label transition. Besides, the comparison between two curves verifies that the dynamic $\alpha$ mechanism can further boost the accuracy in most cases. Based on this analysis, we select the initial $\alpha$ values as [0.1, 0.3, 1.0, 0.7, 0.1] for five datasets, respectively.

Limitation and Future Directions. 1) Using the topology-based sampler may fail to boost the classification accuracy on extreme sparse graphs where most links and features are sparsified or missing. 2) Our label inference model highly depends on an accurate prior distribution. In other words, our model couldn’t handle the poisoning attacks. In the future, integrating a denoising approach on the prior distribution would be a possible solution to mitigate this issue. 3) Our model couldn’t recover the missing class from the prior distribution. This limitation implies that our model currently cannot handle the open-set classification. In the future, we could jointly utilize a distance-based label transition matrix to detect potential unseen classes during the inference.

4. Related Work and Impact

In this section, we further discuss how our work may contribute to other research fields as follows: 1) Noisy label learning. Learning GNNs with noisy labels has been widely studied (NT et al., 2019; Zhong et al., 2019; Dai et al., 2021; Li et al., 2021; Fan et al., 2022). We train the GCN with noisy labels for downstream inference. Our work could further contribute to this field as the Bayesian label transition model may recover the node label distribution using noisy labels with different noisy ratios. 2) Uncertainty estimation. Many researchers conduct studies on uncertainty estimation of node classifications using Bayesian approaches (Malinin and Gales, 2018; Zhang et al., 2019; Zhao et al., 2020; Liu et al., 2020b; Stadler et al., 2021). Our work could dedicate to this field since the experiments demonstrate that the label inference can significantly decrease the uncertainty of node classifications.

5. Conclusion

In this work, we aim to improve the robustness of a GNN-based node classifier against topological perturbations given that the node classifier is trained with manual-annotated labels. To achieve this goal, we propose a new label inference model, namely LInDT, that integrates both Bayesian label transition and topology-based label propagation. Extensive experiments demonstrate the following conclusions. Firstly, GNN-based node classification can benefit from our model under three scenarios of topological perturbations. Furthermore, our model converges faster, achieves higher accuracy on node classification while maintaining normalized entropy at a low level, and surpasses eight popular competing methods across five public graph datasets. Lastly, adopting asymmetric Dirichlet distributions as a prior can contribute to label inference.

Acknowledgements.

Our work is supported by NSF with grant number IIS-1909916.

Appendix A Implementation

A.1. Hardware and Software

We conduct the experiments using Python 3.8 and PyTorch 1.7. on Ubuntu 18.04.5 LTS with Intel(R) Xeon(R) Gold 6258R CPU 2.70 GHz and NVIDIA Tesla V100 PCIe 16GB GPU.

A.2. Model Architecture and Hyper-parameters

Our proposed model can be applied on top of GNNs. In this paper, we choose a two-layer GCN with 200 hidden units and a ReLU activation function. The GCN is trained by Adam optimizer with $1\times 10^{-3}$ learning rate and converged within 200 epochs on all datasets.

A.3. Hyper-parameters of Competing Methods

We present the hyper-parameters of competing methods for reproducibility purposes. All models are trained by Adam optimizer.
GNN-Jaccard (Wu et al., 2019): The similarity threshold is 0.01. The hidden units are 16. The dropout rate is 0.5. The training epochs are 300.
GNN-SVD (Entezari et al., 2020): The number of singular values is 15. Hidden units are 16. The dropout rate is 0.5. The training epochs are 300.
DropEdge (Rong et al., 2019): We choose GCN as the base model with 1 base block layer and train this model with 300 epochs. Rest of the parameters are mentioned in Tab. 4.
GRAND (Feng et al., 2020): The model is trained with 200 epochs. The hidden units, drop node rate, and L2 weight decay is 32, 0.5, and $5\times 10^{-4}$ , respectively. The rest of parameters are reported in Tab. 5.
RGCN (Zhu et al., 2019): We set up $\gamma$ as 1, $\beta_{1}$ and $\beta_{2}$ as $5\times 10^{-4}$ on all datasets. The hidden units for each dataset are 64, 64, 128, 1024, and 1024, respectively. The dropout rate is 0.6. The learning rate is 0.01. The training epochs are 400.
ProGNN (Jin et al., 2020b): $\alpha$ , $\beta$ , $\gamma$ , and $\lambda$ is $5\times 10^{-4}$ , 1.5, 1.0, and 0.0, respectively. The hidden units are 16. The dropout rate is 0.5. The learning rate is 0.01. Weight decay is $5\times 10^{-4}$ . The training epochs are 100.
GDC (Hasanzadeh et al., 2020): The number of blocks and layers is 2 and 4, respectively. The hidden units are 32. The dropout rate is 0.5. Both learning rate and weight decay are $5\times 10^{-3}$ . The training epochs are 400.
MC Dropout (Gal and Ghahramani, 2016): We use the same GCN architecture and training setting as ours except for the optimal dropout rate for each dataset as 0.7, 0.6, 0.9, 0.6, and 0.8, respectively. Note that MC Dropout will apply dropout in test data.

Table 4. Hyper-parameters of DropEdge in this paper.

Hyper-parameters	Cora	Citeseer	Pubmed	AMZcobuy	Coauthor
Hidden units	128	128	128	256	128
Dropout rate	0.8	0.8	0.5	0.5	0.5
Learning rate	0.01	9e-3	0.01	0.01	0.01
Weight decay	5e-3	1e-3	1e-3	0.01	1e-3
Use BN	$\times$	$\times$	$\times$	$\checkmark$	$\times$

Table 5. Hyper-parameters of GRAND in this paper.

Hyper-parameters	Cora	Citeseer	Pubmed	AMZcobuy	Coauthor
Propagation step	8	2	5	5	5
Data augmentation times	4	2	4	3	3
CR loss coefficient	1.0	0.7	1.0	0.9	0.9
Sharpening temperature	0.5	0.3	0.2	0.4	0.4
Learning rate	0.01	0.01	0.2	0.2	0.2
Early stopping patience	200	200	100	100	100
Input dropout	0.5	0.0	0.6	0.6	0.6
Hidden dropout	0.5	0.2	0.8	0.5	0.5
Use BN	$\times$	$\times$	$\checkmark$	$\checkmark$	$\checkmark$

References

(1)
Bhagat et al. (2011) Smriti Bhagat, Graham Cormode, and S Muthukrishnan. 2011. Node classification in social networks. In Social network data analytics. Springer, 115–148.
Breuer et al. (2020) Adam Breuer, Roee Eilat, and Udi Weinsberg. 2020. Friend or Faux: Graph-Based Early Detection of Fake Accounts on Social Networks. In Proceedings of The Web Conference 2020. 1287–1297.
Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013).
Chang et al. (2021) Heng Chang, Yu Rong, Tingyang Xu, Yatao Bian, Shiji Zhou, Xin Wang, Junzhou Huang, and Wenwu Zhu. 2021. Not All Low-Pass Filters are Robust in Graph Convolutional Networks. Advances in Neural Information Processing Systems 34 (2021).
Chen et al. (2021) Liang Chen, Jintang Li, Qibiao Peng, Yang Liu, Zibin Zheng, and Carl Yang. 2021. Understanding structural vulnerability in graph convolutional networks. arXiv preprint arXiv:2108.06280 (2021).
Chen et al. (2019) Weijian Chen, Yulong Gu, Zhaochun Ren, Xiangnan He, Hongtao Xie, Tong Guo, Dawei Yin, and Yongdong Zhang. 2019. Semi-supervised User Profiling with Heterogeneous Graph Attention Networks.. In IJCAI, Vol. 19. 2116–2122.
Chen et al. (2017) Zhengdao Chen, Xiang Li, and Joan Bruna. 2017. Supervised community detection with line graph neural networks. arXiv preprint arXiv:1705.08415 (2017).
Dai et al. (2021) Enyan Dai, Charu Aggarwal, and Suhang Wang. 2021. NRGNN: Learning a Label Noise-Resistant Graph Neural Network on Sparsely and Noisily Labeled Graphs. arXiv preprint arXiv:2106.04714 (2021).
Dai et al. (2018) Hanjun Dai, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. 2018. Adversarial attack on graph structured data. In International conference on machine learning. PMLR, 1115–1124.
Dai et al. (2022) Jiazhu Dai, Weifeng Zhu, and Xiangfeng Luo. 2022. A Targeted Universal Attack on Graph Convolutional Network by Using Fake Nodes. Neural Processing Letters (2022), 1–17.
Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems. 3844–3852.
Du et al. (2017) Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. 2017. Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710.10370 (2017).
Entezari et al. (2020) Negin Entezari, Saba A Al-Sayouri, Amirali Darvishzadeh, and Evangelos E Papalexakis. 2020. All you need is low (rank) defending against adversarial attacks on graphs. In Proceedings of the 13th International Conference on Web Search and Data Mining. 169–177.
Fan et al. (2022) Jinfu Fan, Yang Yu, and Zhongjie Wang. 2022. Partial Label Learning with competitive learning graph neural network. Engineering Applications of Artificial Intelligence 111 (2022), 104779.
Fan et al. (2019) Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph neural networks for social recommendation. In The world wide web conference. 417–426.
Fang et al. (2016) Hanyin Fang, Fei Wu, Zhou Zhao, Xinyu Duan, Yueting Zhuang, and Martin Ester. 2016. Community-based question answering via heterogeneous social network learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
Feng et al. (2020) Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. 2020. Graph Random Neural Networks for Semi-Supervised Learning on Graphs. Advances in Neural Information Processing Systems 33 (2020).
Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning. PMLR, 1050–1059.
Gao et al. (2022) Chen Gao, Xiang Wang, Xiangnan He, and Yong Li. 2022. Graph neural networks for recommender system. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1623–1625.
Gong et al. (2016) Chen Gong, Dacheng Tao, Wei Liu, Liu Liu, and Jie Yang. 2016. Label propagation via teaching-to-learn and learning-to-teach. IEEE transactions on neural networks and learning systems 28, 6 (2016), 1452–1465.
Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
Hahn-Klimroth et al. (2020) Max Hahn-Klimroth, Giulia S Maesaka, Yannick Mogge, Samuel Mohr, and Olaf Parczyk. 2020. Random perturbation of sparse graphs. arXiv preprint arXiv:2004.04672 (2020).
Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024–1034.
Hasanzadeh et al. (2020) Arman Hasanzadeh, Ehsan Hajiramezanali, Shahin Boluki, Mingyuan Zhou, Nick Duffield, Krishna Narayanan, and Xiaoning Qian. 2020. Bayesian graph neural networks with adaptive connection sampling. In International conference on machine learning. PMLR, 4094–4104.
Hsieh et al. (2020) Kanglin Hsieh, Yinyin Wang, Luyao Chen, Zhongming Zhao, Sean Savitz, Xiaoqian Jiang, Jing Tang, and Yejin Kim. 2020. Drug repurposing for COVID-19 using graph neural network with genetic, mechanistic, and epidemiological validation. Research Square (2020).
Huang et al. (2020) Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin R Benson. 2020. Combining label propagation and simple models out-performs graph neural networks. arXiv preprint arXiv:2010.13993 (2020).
Iscen et al. (2019) Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. 2019. Label propagation for deep semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5070–5079.
Jin et al. (2021) Wei Jin, Tyler Derr, Yiqi Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2021. Node similarity preserving graph convolutional networks. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining.
Jin et al. (2020a) Wei Jin, Yaxin Li, Han Xu, Yiqi Wang, Shuiwang Ji, Charu Aggarwal, and Jiliang Tang. 2020a. Adversarial Attacks and Defenses on Graphs: A Review, A Tool and Empirical Studies. arXiv preprint arXiv:2003.00653 (2020).
Jin et al. (2020b) Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. 2020b. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 66–74.
Karasuyama and Mamitsuka (2013) Masayuki Karasuyama and Hiroshi Mamitsuka. 2013. Multiple graph label propagation by sparse integration. IEEE transactions on neural networks and learning systems 24, 12 (2013), 1999–2012.
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
La Gatta et al. (2020) Valerio La Gatta, Vincenzo Moscato, Marco Postiglione, and Giancarlo Sperli. 2020. An epidemiological neural network exploiting dynamic graph structured data applied to the covid-19 outbreak. IEEE Transactions on Big Data 7, 1 (2020), 45–55.
Li and Pi (2019) Bentian Li and Dechang Pi. 2019. Learning deep neural networks for node classification. Expert Systems with Applications 137 (2019), 324–334.
Li et al. (2020) Junnan Li, Richard Socher, and Steven CH Hoi. 2020. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020).
Li et al. (2021) Yayong Li, Jie Yin, and Ling Chen. 2021. Unified robust training for graph neural networks against label noise. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 528–540.
Liu et al. (2020a) Meng Liu, Hongyang Gao, and Shuiwang Ji. 2020a. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 338–348.
Liu et al. (2021) Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. 2021. Elastic graph neural networks. In International Conference on Machine Learning. PMLR.
Liu et al. (2020b) Zhao-Yang Liu, Shao-Yuan Li, Songcan Chen, Yao Hu, and Sheng-Jun Huang. 2020b. Uncertainty aware graph gaussian process for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4957–4964.
Luo et al. (2021) Dongsheng Luo, Wei Cheng, Wenchao Yu, Bo Zong, Jingchao Ni, Haifeng Chen, and Xiang Zhang. 2021. Learning to drop: Robust graph neural network via topological denoising. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 779–787.
Madry et al. (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
Malinin and Gales (2018) Andrey Malinin and Mark Gales. 2018. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems 31 (2018).
Mountain and Dresler (1997) Clifton F Mountain and Carolyn M Dresler. 1997. Regional lymph node classification for lung cancer staging. Chest 111, 6 (1997), 1718–1723.
NT et al. (2019) Hoang NT, Choong Jun Jin, and Tsuyoshi Murata. 2019. Learning Graph Neural Networks with Noisy Labels. arXiv preprint arXiv:1905.01591 (2019).
Regol et al. (2022) Florence Regol, Soumyasundar Pal, Jianing Sun, Yingxue Zhang, Yanhui Geng, and Mark Coates. 2022. Node copying: A random graph model for effective graph sampling. Signal Processing 192 (2022), 108335.
Rong et al. (2019) Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2019. Dropedge: Towards deep graph convolutional networks on node classification. arXiv preprint arXiv:1907.10903 (2019).
Seddik et al. (2021) Mohamed El Amine Seddik, Changmin Wu, Johannes F Lutzeyer, and Michalis Vazirgiannis. 2021. Node Feature Kernels Increase Graph Convolutional Network Robustness. arXiv preprint arXiv:2109.01785 (2021).
Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine (2008).
Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of Graph Neural Network Evaluation. Relational Representation Learning Workshop, NeurIPS 2018 (2018).
Speriosu et al. (2011) Michael Speriosu, Nikita Sudan, Sid Upadhyay, and Jason Baldridge. 2011. Twitter polarity classification with label propagation over lexical links and the follower graph. In Proceedings of the First workshop on Unsupervised Learning in NLP. 53–63.
Stadler et al. (2021) Maximilian Stadler, Bertrand Charpentier, Simon Geisler, Daniel Zügner, and Stephan Günnemann. 2021. Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classification. Advances in Neural Information Processing Systems 34 (2021).
Sun et al. (2018) Lichao Sun, Yingtong Dou, Carl Yang, Ji Wang, Philip S Yu, and Bo Li. 2018. Adversarial attack and defense on graph data: A survey. arXiv preprint arXiv:1812.10528 (2018).
Tam and Dunson (2020) Edric Tam and David Dunson. 2020. Fiedler regularization: Learning neural networks with graph sparsity. In International Conference on Machine Learning. PMLR, 9346–9355.
Tang et al. (2016) Jiliang Tang, Charu Aggarwal, and Huan Liu. 2016. Node classification in signed social networks. In Proceedings of the 2016 SIAM international conference on data mining. SIAM, 54–62.
Tang et al. (2020) Xianfeng Tang, Yandong Li, Yiwei Sun, Huaxiu Yao, Prasenjit Mitra, and Suhang Wang. 2020. Transferring robustness for graph neural network against poisoning attacks. In Proceedings of the 13th International Conference on Web Search and Data Mining. 600–608.
Tanno et al. (2019) Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C Alexander, and Nathan Silberman. 2019. Learning from noisy labels by regularized estimation of annotator confusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11244–11253.
Ugander and Backstrom (2013) Johan Ugander and Lars Backstrom. 2013. Balanced label propagation for partitioning massive graphs. In Proceedings of the sixth ACM international conference on Web search and data mining. 507–516.
Wang et al. (2013) Bo Wang, Zhuowen Tu, and John K Tsotsos. 2013. Dynamic label propagation for semi-supervised multi-class multi-label classification. In Proceedings of the IEEE international conference on computer vision. 425–432.
Wang and Zhang (2007) Fei Wang and Changshui Zhang. 2007. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20, 1 (2007), 55–67.
Wang and Leskovec (2020) Hongwei Wang and Jure Leskovec. 2020. Unifying graph convolutional neural networks and label propagation. arXiv preprint arXiv:2002.06755 (2020).
Wang et al. (2018) Xiaoyun Wang, Joe Eaton, Cho-Jui Hsieh, and Felix Wu. 2018. Attack graph convolutional networks by adding fake nodes. arXiv preprint arXiv:1810.10751 (2018).
Wu et al. (2019) Huijun Wu, Chen Wang, Yuriy Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu. 2019. Adversarial examples on graph data: Deep insights into attack and defense. arXiv preprint arXiv:1903.01610 (2019).
Xu et al. (2021) Hui Xu, Liyao Xiang, Jiahao Yu, Anqi Cao, and Xinbing Wang. 2021. Speedup Robust Graph Structure Learning with Low-Rank Information. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2241–2250.
Xu et al. (2020b) Jiarong Xu, Yang Yang, Junru Chen, Xin Jiang, Chunping Wang, Jiangang Lu, and Yizhou Sun. 2020b. Unsupervised adversarially-robust representation learning on graphs. arXiv preprint arXiv:2012.02486 (2020).
Xu et al. (2020a) Kaidi Xu, Sijia Liu, Pin-Yu Chen, Mengshu Sun, Caiwen Ding, Bhavya Kailkhura, and Xue Lin. 2020a. Towards an efficient and general framework of robust training for graph neural networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8479–8483.
Ye and Ji (2021) Yang Ye and Shihao Ji. 2021. Sparse graph attention networks. IEEE Transactions on Knowledge and Data Engineering (2021).
Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 974–983.
Zhang and Zitnik (2020) Xiang Zhang and Marinka Zitnik. 2020. Gnnguard: Defending graph neural networks against adversarial attacks. arXiv preprint arXiv:2006.08149 (2020).
Zhang et al. (2019) Yingxue Zhang, Soumyasundar Pal, Mark Coates, and Deniz Ustebay. 2019. Bayesian graph convolutional neural networks for semi-supervised classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 5829–5836.
Zhao et al. (2020) Xujiang Zhao, Feng Chen, Shu Hu, and Jin-Hee Cho. 2020. Uncertainty aware semi-supervised learning on graph data. Advances in Neural Information Processing Systems 33 (2020), 12827–12836.
Zhao et al. (2016) Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2016. Expert finding for community-based question answering via ranking metric network learning.. In Ijcai, Vol. 16. 3000–3006.
Zheng et al. (2020) Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, and Wei Wang. 2020. Robust graph representation learning via neural sparsification. In International Conference on Machine Learning. PMLR, 11458–11468.
Zhong et al. (2019) Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. 2019. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Zhu et al. (2019) Dingyuan Zhu, Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2019. Robust graph convolutional networks against adversarial attacks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
Zhu et al. (2020) Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. 2020. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in Neural Information Processing Systems 33 (2020), 7793–7804.
Zhu and Ghahramani (2002) Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from labeled and unlabeled data with label propagation. (2002).
Zhuang and Al Hasan (2022a) Jun Zhuang and Mohammad Al Hasan. 2022a. Defending Graph Convolutional Networks against Dynamic Graph Perturbations via Bayesian Self-Supervision. Proceedings of the AAAI Conference on Artificial Intelligence 36, 4 (Jun. 2022), 4405–4413. https://doi.org/10.1609/aaai.v36i4.20362
Zhuang and Al Hasan (2022b) Jun Zhuang and Mohammad Al Hasan. 2022b. Deperturbation of Online Social Networks via Bayesian Label Transition. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). SIAM, 603–611.
Zhuang and Hasan (2022) Jun Zhuang and Mohammad Al Hasan. 2022. How Does Bayesian Noisy Self-Supervision Defend Graph Convolutional Networks? Neural Processing Letters (2022), 1–22.
Zügner et al. (2018) Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. 2018. Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2847–2856.
Zügner and Günnemann (2019) Daniel Zügner and Stephan Günnemann. 2019. Adversarial attacks on graph neural networks via meta learning. arXiv preprint arXiv:1902.08412 (2019).