This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Overcoming Class Imbalance: Unified GNN Learning with Structural and Semantic Connectivity Representations

Abdullah Alchihabi    Hao Yan    Yuhong Guo
Abstract

Class imbalance is pervasive in real-world graph datasets, where the majority of annotated nodes belong to a small set of classes (majority classes), leaving many other classes (minority classes) with only a handful of labeled nodes. Graph Neural Networks (GNNs) suffer from significant performance degradation in the presence of class imbalance, exhibiting bias towards majority classes and struggling to generalize effectively on minority classes. This limitation stems, in part, from the message passing process, leading GNNs to overfit to the limited neighborhood of annotated nodes from minority classes and impeding the propagation of discriminative information throughout the entire graph. In this paper, we introduce a novel Unified Graph Neural Network Learning (Uni-GNN) framework to tackle class-imbalanced node classification. The proposed framework seamlessly integrates both structural and semantic connectivity representations through semantic and structural node encoders. By combining these connectivity types, Uni-GNN extends the propagation of node embeddings beyond immediate neighbors, encompassing non-adjacent structural nodes and semantically similar nodes, enabling efficient diffusion of discriminative information throughout the graph. Moreover, to harness the potential of unlabeled nodes within the graph, we employ a balanced pseudo-label generation mechanism that augments the pool of available labeled nodes from minority classes in the training set. Experimental results underscore the superior performance of our proposed Uni-GNN framework compared to state-of-the-art class-imbalanced graph learning baselines across multiple benchmark datasets.

GNN Learning, Class Imbalance

1 Introduction

Graph Neural Networks (GNNs) have exhibited significant success in addressing the node classification task (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018) across diverse application domains from molecular biology (Hao et al., 2020) to fraud detection (Zhang et al., 2021). The efficacy of GNNs has been particularly notable when applied to balanced annotated datasets, where all classes have a similar number of labeled training instances. The performance of GNNs experiences a notable degradation when confronted with an increasingly imbalanced class distribution in the available training instances (Yun et al., 2022). This decline in performance materializes as a bias towards the majority classes, which possess a considerable number of labeled instances, resulting in a challenge to generalize effectively over minority classes that have fewer labeled instances (Park et al., 2021; Yan et al., 2023). The root of this issue lies in GNNs’ reliance on message passing to disseminate information across the graph. Specifically, when the number of labeled nodes for a particular class is limited, GNNs struggle to propagate discriminative information related to that class throughout the entire graph. This tendency leads to GNNs’ overfitting to the confined neighborhood of labeled nodes belonging to minority classes (Tang et al., 2020; Yun et al., 2022; Li et al., 2023). This is commonly denoted as the ‘under-reaching problem’ (Sun et al., 2022) or ‘neighborhood memorization’ (Park et al., 2021).

Class-imbalanced real-world graph data are widespread, spanning various application domains such as the Internet of Things (Wang et al., 2022), Fraud Detection (Zhang et al., 2021), and Cognitive Diagnosis (Wang et al., 2023). Consequently, there is a critical need to develop GNN models that demonstrate robustness to class imbalance, avoiding biases towards majority classes while maintaining the ability to generalize effectively over minority classes. Traditional methods addressing class imbalance, such as oversampling (Chawla et al., 2002) or re-weighting (Yuan & Ma, 2012), face limitations in the context of graph-structured data as they do not account for the inherent graph structure. Consequently, several approaches have been proposed to specifically tackle class imbalance within the realm of semi-supervised node classification. Topology-aware re-weighting methods, which consider the graph connectivity when assigning weights to labeled nodes, such as TAM (Song et al., 2022), have demonstrated improved performance compared to traditional re-weighting methods. However, these methods still exhibit limitations as they do not effectively address issues related to neighborhood memorization and the under-reaching problem. Node oversampling methods, including ImGAGN (Qu et al., 2021), GraphSMOTE (Zhao et al., 2021), and GraphENS (Park et al., 2021), generate new nodes and establish connections with the existing graph through various mechanisms. Despite their potential, these methods face an open challenge in determining the optimal way to connect synthesized nodes to the rest of the graph. Additionally, they often fall short in harnessing the untapped potential of the substantial number of unlabeled nodes present in the graph.

In this study, we present a novel Unified Graph Neural Network Learning (Uni-GNN) framework designed to tackle the challenges posed by class-imbalanced node classification tasks. Our proposed framework leverages both structural and semantic connectivity representations, specifically addressing the under-reaching and neighborhood memorization issues. To achieve this, we construct a structural connectivity based on the input graph structure, complemented by a semantic connectivity derived from the similarity between node embeddings. Within each layer of the Uni-GNN framework, we establish a dedicated message passing layer for each type of connectivity. This allows for the propagation of node messages across both structural and semantic connectivity types, resulting in the acquisition of comprehensive structural and semantic representations. The Uni-GNN framework’s unique utilization of both structural and semantic connectivity empowers it to effectively extend the propagation of node embeddings beyond the standard neighborhood. This extension reaches non-direct structural neighbors and semantically similar nodes, facilitating the efficient dissemination of discriminative information throughout the entire graph. Moreover, to harness the potential of unlabeled nodes in the graph, we introduce a balanced pseudo-label generation method. This method strategically samples unlabeled nodes with confident predictions in a class-balanced manner, effectively increasing the number of labeled instances for minority classes. Our experimental evaluations on multiple benchmark datasets underscore the superior performance of the proposed Uni-GNN framework compared to state-of-the-art Graph Neural Network methods designed to address class imbalance.

2 Related Works

2.1 Class Imbalanced Learning

Class-imbalanced learning methods generally fall into three categories: re-sampling, re-weighting, and data augmentation. Re-sampling involves adjusting the original imbalanced data distribution to a class-balanced one by either up-sampling minority class data or down-sampling majority class data (Wallace et al., 2011; Mahajan et al., 2018). Re-weighting methods assign smaller weights to majority instances and larger weights to minority instances, either in the loss function or at the logits level (Ren et al., 2020; Menon et al., 2021). Data augmentation methods either transfer knowledge from head classes to augment tail classes or apply augmentation techniques to generate additional data for minority classes (Chawla et al., 2002; Kim et al., 2020; Zhong et al., 2021; Ahn et al., 2023). Traditional oversampling methods do not consider the graph structure when generating synthesized samples, making them unsuitable for graph data. Self-training methods for semi-supervised class imbalanced learning address class imbalance by supplementing the labeled set with unlabeled nodes predicted to belong to the minority class (Wei et al., 2021). However, GNNs’ bias towards majority classes can introduce noise in pseudo-labels, necessitating the development of new methods to mitigate the bias under class imbalance.

2.2 Class Imbalanced Graph Learning

Traditional class imbalance methods assume independently distributed labeled instances, a condition not met in graph data where nodes influence each other through the graph structure. Existing class imbalanced graph learning can be categorized into three groups: re-weighting, over-sampling, and ensemble methods. Re-weighting methods assign importance weights to labeled nodes based on their class labels while considering the graph structure. Topology-Aware Margin (TAM) adjusts node logits by incorporating local connectivity (Song et al., 2022). However, re-weighting methods have limitations in addressing the under-reaching or neighborhood memorization problems (Park et al., 2021; Sun et al., 2022). Graph over-sampling methods generate new nodes and connect them to the existing graph. GraphSMOTE uses SMOTE to synthesize minority nodes in the learned embedding space and connects them using an edge generator module (Zhao et al., 2021). ImGAGN employs a generative adversarial approach to balance the input graph by generating synthetic nodes with semantic and topological features (Qu et al., 2021). GraphENS generates new nodes by mixing minority class nodes with others in the graph, using saliency-based node mixing for features and KL divergence for structure information (Park et al., 2021). However, these methods increase the graph size, leading to higher computational costs. Long-Tail Experts for Graphs (LTE4G) is an ensemble method that divides training nodes based on class and degree into subsets, trains expert teacher models, and uses knowledge distillation to obtain student models (Yun et al., 2022). It however comes with a computational cost due to training multiple models.

3 Method

In the context of semi-supervised node classification with class imbalance, we consider a graph G=(V,E)G=(V,E), where VV represents the set of N=|V|N=|V| nodes, and EE denotes the set of edges within the graph. EE is commonly represented by an adjacency matrix AA of size N×NN\times N. This matrix is assumed to be symmetric (i.e. for undirected graphs), and it may contain either weighted or binary values. Each node in the graph is associated with a feature vector of size DD, while the feature vectors for all the NN nodes are organized into an input feature matrix XN×DX\in\mathbb{R}^{N\times D}. The set of nodes VV is partitioned into two distinct subsets: VlV_{l}, comprising labeled nodes, and VuV_{u}, encompassing unlabeled nodes. The labeled nodes in VlV_{l} are paired with class labels, and this information is encapsulated in a label indicator matrix Yl{0,1}Nl×CY^{l}\in\{0,1\}^{N_{l\times C}}. Here, CC signifies the number of classes, and NlN_{l} is the number of labeled nodes. Further, VlV_{l} can be subdivided into CC non-overlapping subsets, denoted as {Vl1,,VlC}\{V^{1}_{l},\cdots,V^{C}_{l}\}, where each subset VliV^{i}_{l} corresponds to the labeled nodes belonging to class ii. It is noteworthy that VlV_{l} exhibits class imbalance, characterized by an imbalance ratio ρ\rho. This ratio, defined as ρ=mini(|Vli|)maxi(|Vli|)\rho=\frac{\text{min}_{i}(|V^{i}_{l}|)}{\text{max}_{i}(|V^{i}_{l}|)}, is considerably less than 1. Such class imbalance problem introduces challenges and necessitates specialized techniques in the development of effective node classification models.

This section introduces the proposed Unified Graph Neural Network Learning (Uni-GNN) framework designed specifically for addressing class-imbalanced semi-supervised node classification tasks. The Uni-GNN framework integrates both structural and semantic connectivity to facilitate the learning of discriminative, unbiased node embeddings. Comprising two node encoders—structural and semantic—Uni-GNN ensures joint utilization of these two facets. The collaborative efforts of the structural and semantic encoders converge in the balanced node classifier, which effectively utilizes the merged output from both encoders to categorize nodes within the graph. Notably, a weighted loss function is employed to address the challenge of class-imbalanced nodes in the graph, ensuring that the classifier is robust and capable of handling imbalanced class distributions. To harness the potential of unlabeled nodes, we introduce a balanced pseudo-label generation strategy. This strategy generates class-balanced confident pseudo-labels for the unlabeled nodes, contributing to the overall robustness and effectiveness of the Uni-GNN framework. In the rest of this section, we delve into the details of both the Uni-GNN framework and the balanced pseudo-label generation strategy, as well as the synergies between them.

3.1 Unified GNN Learning Framework

The Unified GNN Learning (Uni-GNN) framework comprises three crucial components: the structural node encoder, the semantic node encoder and the balanced node classifier. The structural node encoder is dedicated to constructing a graph adjacency matrix founded on structural connectivity, facilitating the propagation of node embeddings beyond the immediate structural neighbors of nodes. Concurrently, the semantic node encoder generates adjacency matrices based on semantic connectivity where it connects nodes to their semantically similar neighboring nodes, transcending structural distances. This facilitates the linking of nodes that are spatially distant in the graph but are from the same class, and enables message propagation between distantly located yet similar nodes, enriching the encoding process with learned node embeddings. At every layer in both the structural and semantic encoders, we propagate the integrated structural and semantic embeddings of the nodes along the corresponding connectivity, alleviating bias induced by each single type of connectivity. Ultimately, a balanced node classifier utilizes the acquired node embeddings from both the structural and semantic encoders for node classification. The overall framework of Uni-GNN is illustrated in Figure 1.

Refer to caption
Figure 1: Overview of the proposed Unified GNN Learning framework. The structural (fstructf_{\text{struct}}) and semantic (fsemf_{\text{sem}}) node encoders leverage their respective connectivity matrices—structural (AstructA_{\text{struct}}) and semantic ({Asem}=1L\{A^{\ell}_{\text{sem}}\}_{\ell=1}^{L}). The encoders share concatenated node embeddings—structural (Hstruct1H^{\ell-1}_{\text{struct}}) and semantic (Hsem1H^{\ell-1}_{\text{sem}})—at each message passing layer (\ell). The balanced node classifier (ϕ\phi) utilizes the final unified node embeddings (HstructL||HsemLH^{L}_{\text{struct}}||H^{L}_{\text{sem}}) for both node classification and balanced pseudo-label generation.

3.1.1 Structural Node Encoder

The objective of the structural encoder is to learn an embedding of the nodes based on structural connectivity. Instead of directly using the input adjacency matrix AA, we construct a new structural connectivity-based graph adjacency matrix AstructA_{\text{struct}} that extends connections beyond the immediate neighbors in the input graph. This matrix is determined by the distances between pairs of nodes, measured in terms of the number of edges along the shortest path connecting the respective nodes, such that

Astruct[i,j]={1SPD(i,j)SPD(i,j)α0otherwiseA_{\text{struct}}[i,j]=\begin{cases}\frac{1}{\text{SPD}(i,j)}&\text{SPD}(i,j)\leq\alpha\\ 0&\text{otherwise}\\ \end{cases} (1)

where SPD(.,.)\text{SPD}(.,.) is the shortest path distance function that measures the distance between pairs of input nodes in terms of the number of edges along the shortest path in the input graph GG. The hyper-parameter α1\alpha\geq 1 governs the maximum length of the shortest path distance to be considered. In AstructA_{\text{struct}}, edges connecting node pairs are assigned weights that are inversely proportional to the length of the shortest path between them. This design ensures that the propagated messages carry importance weights, scaling the messages based on the corresponding edge weights between connected nodes. The constructed structural connectivity enables us to directly propagate messages to nodes beyond a node’s immediate structural neighbors in the original graph. This is beneficial for expanding the influence of labeled minority nodes towards more distant neighboring nodes within the graph, particularly when there are under-reaching problems induced by sparse connectivities in the original graph.

The structural node encoder, denoted as fstructf_{\text{struct}}, consists of LL message propagation layers. In each layer \ell of the structural node encoder, denoted as fstructf^{\ell}_{\text{struct}}, the node input features comprise the learned node embeddings, Hstruct1H^{\ell-1}_{\text{struct}} and Hsem1H^{\ell-1}_{\text{sem}}, from the previous layer of both the structural encoder and the semantic encoder, respectively. As a consequence, the propagated messages encode both semantic and structural information facilitating the learning of more discriminative node embeddings. The constructed structural connectivity matrix AstructA_{\text{struct}} is employed as the adjacency matrix for message propagation within each layer. We employ the conventional Graph Convolution Network (GCN) (Kipf & Welling, 2017) as our message-passing layer, given its simplicity, efficiency and ability to handle weighted graph structures, in the following manner:

Hstruct=fstruct(Hstruct1||Hsem1,Astruct)\displaystyle{H}_{\text{struct}}^{\ell}=f^{\ell}_{\text{struct}}({H}_{\text{struct}}^{\ell-1}||{H}_{\text{sem}}^{\ell-1},A_{\text{struct}}) (2)
=σ(D^struct12A^structD^struct12(Hstruct1||Hsem1)Wstruct)\displaystyle=\sigma\left(\hat{D}_{\text{struct}}^{-\frac{1}{2}}\hat{A}_{\text{struct}}\hat{D}_{\text{struct}}^{-\frac{1}{2}}({H}_{\text{struct}}^{\ell-1}||{H}_{\text{sem}}^{\ell-1})W_{\text{struct}}^{\ell}\right)

Here, σ\sigma represents the non-linear activation function, ``||"``||" denotes the feature concatenation operation, WstructW_{\text{struct}}^{\ell} is the matrix of learnable parameters for fstructf^{\ell}_{\text{struct}}, A^struct=Astruct+I\hat{A}_{\text{struct}}={A}_{\text{struct}}+I is the adjusted adjacency matrix with self-connections, and D^struct\hat{D}_{\text{struct}} is the diagonal node degree matrix of A^struct\hat{A}_{\text{struct}} such that D^struct[i,i]=jA^struct[i,j]\hat{D}_{\text{struct}}[i,i]=\sum_{j}\hat{A}_{\text{struct}}[i,j]. In the case of the first layer of fstructf_{\text{struct}}, the node input features are solely represented by the input feature matrix XX.

3.1.2 Semantic Node Encoder

The objective of the semantic node encoder is to learn node embeddings based on the semantic connectivity. The semantic node encoder, denoted as fsemf_{\text{sem}}, comprises LL message passing layers. In each layer \ell of the semantic node encoder, represented by fsemf^{\ell}_{\text{sem}}, a semantic-based graph adjacency matrix AsemA^{\ell}_{\text{sem}} is constructed based on the similarity between the embeddings of nodes from the previous layer of the semantic and structural node encoders, measured in terms of clustering assignments. For each layer \ell, the following fine-grained node clustering is performed:

S=g(Hstruct1||Hsem1,K)S^{\ell}=g({H}_{\text{struct}}^{\ell-1}||{H}_{\text{sem}}^{\ell-1},K) (3)

which clusters all the graph nodes into KK (KCK\gg C) clusters. Here, SN×KS^{\ell}\in\mathbb{R}^{N\times K} is the fine-grained clustering assignment matrix obtained from the clustering function gg. The row S[i]S^{\ell}[i] indicates the cluster to which node ii is assigned. The fine-grained clustering function gg takes as input the concatenation of the structural and semantic node embeddings from layer 1\ell-1, along with the number of clusters KK, and outputs the cluster assignments SS^{\ell}. The clustering function gg is realized by performing K-means clustering to minimize the following least squares clustering loss:

clust=iVk=1KS[i,k](Hstruct1[i]||Hsem1[i])𝝁k2\!\!\!\mathcal{L}_{\text{clust}}=\sum_{i\in V}\sum_{k=1}^{K}S^{\ell}[i,k]\left\lVert({H}_{\text{struct}}^{\ell-1}[i]||{H}_{\text{sem}}^{\ell-1}[i])-\boldsymbol{\mu}_{k}\right\rVert^{2} (4)

where 𝝁k\boldsymbol{\mu}_{k} represents the mean vector for the cluster kk, and S[i,k]S^{\ell}[i,k] has a binary value (0 or 1) that indicates whether node ii is assigned to cluster kk. Based on the fine-grained clustering assignment matrix SS^{\ell}, the construction of the semantic connectivity-based graph adjacency matrix AsemA^{\ell}_{\text{sem}} is detailed as follows:

Asem[i,j]={1if S[i]=S[j]0otherwise.A^{\ell}_{\text{sem}}[i,j]=\begin{cases}1&\text{if }S^{\ell}[i]=S^{\ell}[j]\\ 0&\text{otherwise}.\end{cases} (5)

In the construction of AsemA^{\ell}_{\text{sem}}, nodes assigned to the same cluster are connected, establishing edges between them, while nodes assigned to different clusters are not connected, resulting in an adjacency matrix that encapsulates the semantic connectivity encoded within the fine-grained clusters. This process enables message propagation among nodes that share semantic similarities in the graph, irrespective of their structural separation. This is instrumental in addressing the issue of under-reaching of minority nodes. Moreover, by constructing individual semantic adjacency matrix for each layer, we can prevent adherence to fixed local semantic clusters and enhance both robustness and adaptivity. The semantic connectivity matrix of the first layer of the semantic encoder, Asem1A^{1}_{\text{sem}}, is constructed based on the input features matrix XX. Furthermore, to balance efficiency with stability during the training process, an update mechanism is introduced. Specifically, AsemA^{\ell}_{\text{sem}} is periodically updated by re-clustering the nodes based on the updated node embeddings along the training process. The update is performed at intervals of β\beta training iterations. This adaptive strategy ensures that the semantic connectivity information remains relevant and adapts to the evolving node embeddings during the training process.

Each layer \ell of the semantic node encoder, fsemf^{\ell}_{\text{sem}}, takes the concatenation of node embeddings, Hstruct1Hsem1H^{\ell-1}_{\text{struct}}\|H^{\ell-1}_{\text{sem}}, from the previous layer 1\ell-1 of both the structural encoder and the semantic encoder as input, aiming to gather richer information from both aspects, but propagates messages with the constructed semantic adjacency matrix AsemA^{\ell}_{\text{sem}}. For the first layer of fsemf_{\text{sem}}, the input node features are simply the input features matrix XX. We opt for the conventional Graph Convolution Network (GCN) (Kipf & Welling, 2017) as our message-passing layer again, which is employed in the following manner:

Hsem=fsem(Hstruct1||Hsem1,Asem)\displaystyle{H}_{\text{sem}}^{\ell}=f^{\ell}_{\text{sem}}({H}_{\text{struct}}^{\ell-1}||{H}_{\text{sem}}^{\ell-1},A^{\ell}_{\text{sem}}) (6)
=σ(D^sem12A^semD^sem12(Hstruct1||Hsem1)Wsem)\displaystyle=\sigma\left(\hat{D}_{\text{sem}}^{-\frac{1}{2}}\hat{A}^{\ell}_{\text{sem}}\hat{D}_{\text{sem}}^{-\frac{1}{2}}({H}_{\text{struct}}^{\ell-1}||{H}_{\text{sem}}^{\ell-1})W_{\text{sem}}^{\ell}\right)

Here, σ\sigma again represents the non-linear activation function; WsemW_{\text{sem}}^{\ell} is the matrix of learnable parameters for fsemf^{\ell}_{\text{sem}}; A^sem=Asem+I\hat{A}_{\text{sem}}^{\ell}=A_{\text{sem}}^{\ell}+I is the adjusted adjacency matrix with self-connections; and the diagonal node degree matrix D^sem\hat{D}_{\text{sem}} is computed as D^sem[i,i]=jA^sem[i,j]\hat{D}_{\text{sem}}[i,i]=\sum_{j}\hat{A}_{\text{sem}}^{\ell}[i,j].

3.1.3 Balanced Node Classifier

We define a balanced node classification function ϕ\phi, which classifies the nodes in the graph based on their structural and semantic embeddings learned by the Structural Encoder and Semantic Encoder respectively. In particular, the balanced node classification function takes as input the output of the LL-th layers of the structural and semantic node encoders, denoted as HstructL{H}_{\text{struct}}^{L} and HsemL{H}_{\text{sem}}^{L}, respectively:

P=ϕ(HstructL||HsemL)P=\phi({H}_{\text{struct}}^{L}||{H}_{\text{sem}}^{L}) (7)

where PN×CP\in\mathbb{R}^{N\times C} is the predicted class probability matrix of all the nodes in the graph. Given the class imbalance in the set of labeled nodes VlV_{l}, the node classification function is trained to minimize the following weighted node classification loss on the labeled nodes:

=iVlωcice(Pi,Yil).\mathcal{L}=\sum\nolimits_{i\in V_{l}}\omega_{c_{i}}\ell_{\text{ce}}(P_{i},Y^{l}_{i}). (8)

Here, ce\ell_{\text{ce}} denotes the standard cross-entropy loss function. For a given node ii, PiP_{i} represents its predicted class probability vector, and YilY^{l}_{i} is the true label indicator vector if ii is a labeled node. The weight ωci\omega_{c_{i}} associated with each node ii is introduced to balance the contribution of data from different classes in the supervised training loss. It gives different weights to nodes from different classes. In particular, the class balance weight ωci\omega_{c_{i}} is calculated as follows:

ωci=1|Vlci|\omega_{c_{i}}=\frac{1}{|V_{l}^{c_{i}}|} (9)

where cic_{i} denotes the class index of node ii, such that Yl[i,ci]=1Y^{l}[i,c_{i}]=1; and |Vlci||V_{l}^{c_{i}}| is the size of class cic_{i} in the labeled nodes— i.e., the number of labeled nodes VlciV_{l}^{c_{i}} from class cic_{i}. Since ωci\omega_{c_{i}} is inversely proportional to the corresponding class size, it enforces that larger weights are assigned to nodes from minority classes with fewer labeled instances in the supervised node classification loss, while smaller weights are assigned to nodes from majority classes with abundant labeled nodes. Specifically, through the incorporation of this class weighting mechanism, each class contributes equally to the supervised loss function, irrespective of the quantity of labeled nodes associated with it within the training set, thereby promoting balanced learning across different classes.

Table 1: The overall performance (standard deviation within brackets) on Cora, CiteSeer and PubMed datasets with 2 different numbers of minority classes (3 and 5 on Cora and CiteSeer, 1 and 2 on Pubmed) and 2 imbalance ratios (10% and 5%). OOM indicates out-of-memory.
# Min. Class 3 5
ρ\rho 10% 5% 10% 5%
bAcc. Macro-F1 G-Means bAcc. Macro-F1 G-Means bAcc. Macro-F1 G-Means bAcc. Macro-F1 G-Means
Cora GCN 68.8(4.0)68.8_{(4.0)} 68.8(4.0)68.8_{(4.0)} 80.8(2.6)80.8_{(2.6)} 60.0(0.4)60.0_{(0.4)} 56.6(0.7)56.6_{(0.7)} 74.8(0.3)74.8_{(0.3)} 64.9(6.2)64.9_{(6.2)} 64.7(5.7)64.7_{(5.7)} 78.1(4.2)78.1_{(4.2)} 55.1(2.6)55.1_{(2.6)} 51.4(2.4)51.4_{(2.4)} 71.4(1.8)71.4_{(1.8)}
Over-sampling 65.6(4.3)65.6_{(4.3)} 63.4(5.6)63.4_{(5.6)} 78.6(2.9)78.6_{(2.9)} 59.0(2.2)59.0_{(2.2)} 53.9(2.6)53.9_{(2.6)} 74.2(1.5)74.2_{(1.5)} 58.9(5.6)58.9_{(5.6)} 56.9(7.6)56.9_{(7.6)} 74.0(3.9)74.0_{(3.9)} 49.1(4.1)49.1_{(4.1)} 45.6(5.4)45.6_{(5.4)} 67.0(3.1)67.0_{(3.1)}
Re-weight 70.6(4.3)70.6_{(4.3)} 69.9(5.1)69.9_{(5.1)} 81.9(2.8)81.9_{(2.8)} 60.8(1.8)60.8_{(1.8)} 56.7(2.4)56.7_{(2.4)} 75.4(1.3)75.4_{(1.3)} 65.2(7.6)65.2_{(7.6)} 65.0(8.2)65.0_{(8.2)} 78.3(5.2)78.3_{(5.2)} 57.9(4.3)57.9_{(4.3)} 54.8(5.5)54.8_{(5.5)} 73.3(3.0)73.3_{(3.0)}
SMOTE 65.1(4.0)65.1_{(4.0)} 62.3(5.1)62.3_{(5.1)} 78.3(2.7)78.3_{(2.7)} 59.0(2.2)59.0_{(2.2)} 53.9(2.6)53.9_{(2.6)} 74.2(1.5)74.2_{(1.5)} 60.3(7.6)60.3_{(7.6)} 58.7(8.9)58.7_{(8.9)} 74.9(5.3)74.9_{(5.3)} 49.1(4.1)49.1_{(4.1)} 45.6(5.4)45.6_{(5.4)} 67.0(3.1)67.0_{(3.1)}
Embed-SMOTE 61.0(3.6)61.0_{(3.6)} 58.0(5.5)58.0_{(5.5)} 75.5(2.5)75.5_{(2.5)} 55.5(2.1)55.5_{(2.1)} 50.0(3.1)50.0_{(3.1)} 71.7(1.5)71.7_{(1.5)} 53.2(5.2)53.2_{(5.2)} 50.9(6.4)50.9_{(6.4)} 70.0(3.8)70.0_{(3.8)} 40.7(2.8)40.7_{(2.8)} 36.5(3.0)36.5_{(3.0)} 60.5(2.2)60.5_{(2.2)}
GraphSMOTE 70.0(3.4)70.0_{(3.4)} 68.6(4.9)68.6_{(4.9)} 81.6(2.2)81.6_{(2.2)} 62.5(1.8)62.5_{(1.8)} 58.7(2.1)58.7_{(2.1)} 76.5(1.2)76.5_{(1.2)} 66.3(6.6)66.3_{(6.6)} 65.3(7.7)65.3_{(7.7)} 79.0(4.5)79.0_{(4.5)} 55.8(5.6)55.8_{(5.6)} 52.4(4.7)52.4_{(4.7)} 71.9(4.0)71.9_{(4.0)}
GraphENS 59.3(7.0)59.3_{(7.0)} 55.4(10.6)55.4_{(10.6)} 74.2(4.9)74.2_{(4.9)} 55.1(4.9)55.1_{(4.9)} 48.1(7.9)48.1_{(7.9)} 71.3(3.5)71.3_{(3.5)} 44.3(6.5)44.3_{(6.5)} 41.0(7.0)41.0_{(7.0)} 63.3(5.0)63.3_{(5.0)} 36.1(10.1)36.1_{(10.1)} 31.1(12.3)31.1_{(12.3)} 56.3(8.4)56.3_{(8.4)}
LTE4G 73.2(5.4)73.2_{(5.4)} 72.1(6.1)72.1_{(6.1)} 83.6(3.5)83.6_{(3.5)} 70.9(2.5)70.9_{(2.5)} 69.6(2.8)69.6_{(2.8)} 82.1(1.6)82.1_{(1.6)} 75.4(5.6)75.4_{(5.6)} 75.4(5.4)75.4_{(5.4)} 85.0(3.6)85.0_{(3.6)} 70.2(4.5)70.2_{(4.5)} 68.8(4.7)\mathbf{68.8}_{(4.7)} 81.7(3.0)81.7_{(3.0)}
Uni-GNN 76.5(0.5)\mathbf{76.5}_{(0.5)} 76.4(0.7)\mathbf{76.4}_{(0.7)} 85.8(0.3)\mathbf{85.8}_{(0.3)} 71.5(1.2)\mathbf{71.5}_{(1.2)} 70.7(1.5)\mathbf{70.7}_{(1.5)} 82.5(0.8)\mathbf{82.5}_{(0.8)} 75.4(3.7)\mathbf{75.4}_{(3.7)} 75.4(3.7)\mathbf{75.4}_{(3.7)} 85.0(2.4)\mathbf{85.0}_{(2.4)} 70.5(3.7)\mathbf{70.5}_{(3.7)} 68.7(2.6){68.7}_{(2.6)} 81.8(2.5)\mathbf{81.8}_{(2.5)}
CiteSeer GCN 49.5(2.1)49.5_{(2.1)} 43.1(2.3)43.1_{(2.3)} 66.7(1.5)66.7_{(1.5)} 48.2(0.9)48.2_{(0.9)} 39.3(0.4)39.3_{(0.4)} 65.7(0.7)65.7_{(0.7)} 48.9(1.4)48.9_{(1.4)} 45.3(1.3)45.3_{(1.3)} 66.2(1.1)66.2_{(1.1)} 42.4(6.5)42.4_{(6.5)} 39.1(7.3)39.1_{(7.3)} 61.1(5.1)61.1_{(5.1)}
Over-sampling 51.5(3.0)51.5_{(3.0)} 43.7(2.1)43.7_{(2.1)} 68.2(2.2)68.2_{(2.2)} 47.8(0.8)47.8_{(0.8)} 38.9(1.9)38.9_{(1.9)} 65.4(0.6)65.4_{(0.6)} 43.0(3.4)43.0_{(3.4)} 40.3(1.7)40.3_{(1.7)} 61.7(2.7)61.7_{(2.7)} 40.1(2.0)40.1_{(2.0)} 34.2(1.5)34.2_{(1.5)} 59.4(1.6)59.4_{(1.6)}
Re-weight 52.1(2.7)52.1_{(2.7)} 46.2(3.2)46.2_{(3.2)} 68.6(2.0)68.6_{(2.0)} 48.0(0.4)48.0_{(0.4)} 39.2(1.1)39.2_{(1.1)} 65.6(0.3)65.6_{(0.3)} 48.4(3.9)48.4_{(3.9)} 44.5(3.9)44.5_{(3.9)} 65.8(2.9)65.8_{(2.9)} 41.3(4.5)41.3_{(4.5)} 35.6(5.3)35.6_{(5.3)} 60.3(3.6)60.3_{(3.6)}
SMOTE 48.7(2.5)48.7_{(2.5)} 40.1(1.8)40.1_{(1.8)} 66.1(1.9)66.1_{(1.9)} 47.8(0.8)47.8_{(0.8)} 38.9(1.9)38.9_{(1.9)} 65.4(0.6)65.4_{(0.6)} 44.9(4.4)44.9_{(4.4)} 41.9(4.1)41.9_{(4.1)} 63.2(3.4)63.2_{(3.4)} 40.1(2.0)40.1_{(2.0)} 34.2(1.5)34.2_{(1.5)} 59.4(1.6)59.4_{(1.6)}
Embed-SMOTE 47.5(2.1)47.5_{(2.1)} 37.9(1.7)37.9_{(1.7)} 65.2(1.6)65.2_{(1.6)} 46.7(3.0)46.7_{(3.0)} 35.7(2.8)35.7_{(2.8)} 64.5(2.3)64.5_{(2.3)} 43.2(6.5)43.2_{(6.5)} 38.3(5.8)38.3_{(5.8)} 61.8(5.2)61.8_{(5.2)} 33.2(6.6)33.2_{(6.6)} 28.3(7.9)28.3_{(7.9)} 53.4(5.9)53.4_{(5.9)}
GraphSMOTE 51.2(3.7)51.2_{(3.7)} 43.4(4.2)43.4_{(4.2)} 67.9(2.8)67.9_{(2.8)} 49.3(2.0)49.3_{(2.0)} 40.1(1.3)40.1_{(1.3)} 66.5(1.5)66.5_{(1.5)} 50.3(5.0)50.3_{(5.0)} 46.1(4.5)46.1_{(4.5)} 67.2(3.7)67.2_{(3.7)} 46.5(3.7)46.5_{(3.7)} 41.5(4.1)41.5_{(4.1)} 64.4(2.9)64.4_{(2.9)}
GraphENS 44.2(3.5)44.2_{(3.5)} 35.9(1.0)35.9_{(1.0)} 62.7(2.7)62.7_{(2.7)} 43.5(2.6)43.5_{(2.6)} 33.4(1.9)33.4_{(1.9)} 62.1(2.1)62.1_{(2.1)} 33.0(3.2)33.0_{(3.2)} 28.6(4.4)28.6_{(4.4)} 53.4(2.9)53.4_{(2.9)} 28.5(6.7)28.5_{(6.7)} 23.1(6.2)23.1_{(6.2)} 49.1(6.2)49.1_{(6.2)}
LTE4G 54.2(4.5)54.2_{(4.5)} 51.8(4.1)51.8_{(4.1)} 70.2(3.3)70.2_{(3.3)} 52.7(2.1)52.7_{(2.1)} 48.3(3.7)48.3_{(3.7)} 69.1(1.5)69.1_{(1.5)} 52.1(3.7)52.1_{(3.7)} 47.2(3.6)47.2_{(3.6)} 68.6(2.7)68.6_{(2.7)} 47.3(1.1)47.3_{(1.1)} 41.2(2.1)41.2_{(2.1)} 65.0(0.9)65.0_{(0.9)}
Uni-GNN 59.1(3.6)\mathbf{59.1}_{(3.6)} 54.6(3.3)\mathbf{54.6}_{(3.3)} 73.6(2.5)\mathbf{73.6}_{(2.5)} 54.1(3.1)\mathbf{54.1}_{(3.1)} 48.5(4.1)\mathbf{48.5}_{(4.1)} 70.1(2.3)\mathbf{70.1}_{(2.3)} 58.3(2.5)\mathbf{58.3}_{(2.5)} 55.0(1.3)\mathbf{55.0}_{(1.3)} 73.1(1.8)\mathbf{73.1}_{(1.8)} 54.0(2.2)\mathbf{54.0}_{(2.2)} 51.4(2.2)\mathbf{51.4}_{(2.2)} 70.0(1.6)\mathbf{70.0}_{(1.6)}
# Min. Class 1 2
ρ\rho 10% 5% 10% 5%
PubMed GCN 60.4(6.5)60.4_{(6.5)} 55.9(9.5)55.9_{(9.5)} 69.6(5.2)69.6_{(5.2)} 58.6(3.0)58.6_{(3.0)} 51.9(6.2)51.9_{(6.2)} 68.1(2.4)68.1_{(2.4)} 49.1(10.9)49.1_{(10.9)} 44.0(14.5)44.0_{(14.5)} 60.3(8.9)60.3_{(8.9)} 41.0(5.6)41.0_{(5.6)} 32.2(8.4)32.2_{(8.4)} 53.7(4.7)53.7_{(4.7)}
Oversampling 57.6(0.5)57.6_{(0.5)} 51.3(4.0)51.3_{(4.0)} 67.4(0.4)67.4_{(0.4)} 55.2(2.8)55.2_{(2.8)} 46.8(2.2)46.8_{(2.2)} 65.4(2.2)65.4_{(2.2)} 41.6(5.7)41.6_{(5.7)} 33.5(9.4)33.5_{(9.4)} 54.2(4.8)54.2_{(4.8)} 36.6(2.1)36.6_{(2.1)} 23.8(4.8)23.8_{(4.8)} 50.0(1.8)50.0_{(1.8)}
Re-weight 62.2(5.7)62.2_{(5.7)} 57.6(9.2)57.6_{(9.2)} 71.0(4.5)71.0_{(4.5)} 59.4(3.6)59.4_{(3.6)} 53.3(7.8)53.3_{(7.8)} 68.8(2.8)68.8_{(2.8)} 54.7(11.2)54.7_{(11.2)} 53.8(11.7)53.8_{(11.7)} 64.9(9.1)64.9_{(9.1)} 47.5(11.4)47.5_{(11.4)} 42.6(13.4)42.6_{(13.4)} 59.0(9.5)59.0_{(9.5)}
SMOTE 55.8(3.0)55.8_{(3.0)} 48.2(3.3)48.2_{(3.3)} 65.9(2.4)65.9_{(2.4)} 55.2(2.8)55.2_{(2.8)} 46.8(2.2)46.8_{(2.2)} 65.4(2.2)65.4_{(2.2)} 41.8(4.0)41.8_{(4.0)} 32.6(6.6)32.6_{(6.6)} 54.4(3.3)54.4_{(3.3)} 36.6(2.1)36.6_{(2.1)} 23.8(4.8)23.8_{(4.8)} 50.0(1.8)50.0_{(1.8)}
Embed-SMOTE 53.4(1.8)53.4_{(1.8)} 44.9(2.0)44.9_{(2.0)} 63.9(1.4)63.9_{(1.4)} 53.3(3.1)53.3_{(3.1)} 43.4(3.4)43.4_{(3.4)} 63.9(2.5)63.9_{(2.5)} 38.6(1.2)38.6_{(1.2)} 28.3(1.1)28.3_{(1.1)} 51.7(1.1)51.7_{(1.1)} 35.2(1.3)35.2_{(1.3)} 21.4(3.5)21.4_{(3.5)} 48.7(1.1)48.7_{(1.1)}
GraphSMOTE OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM
LTE4G 62.6(3.0)62.6_{(3.0)} 59.2(6.7)59.2_{(6.7)} 71.4(2.4)71.4_{(2.4)} 60.0(5.4)60.0_{(5.4)} 55.3(8.2)55.3_{(8.2)} 69.3(4.3)69.3_{(4.3)} 52.1(7.0)52.1_{(7.0)} 50.2(8.3)50.2_{(8.3)} 62.9(5.7)62.9_{(5.7)} 48.5(9.9)48.5_{(9.9)} 44.3(12.2)44.3_{(12.2)} 48.5(9.9)48.5_{(9.9)}
Uni-GNN 73.9(2.3)\mathbf{73.9}_{(2.3)} 73.8(2.5)\mathbf{73.8}_{(2.5)} 80.2(1.8)\mathbf{80.2}_{(1.8)} 67.1(4.2)\mathbf{67.1}_{(4.2)} 65.2(5.3)\mathbf{65.2}_{(5.3)} 74.8(3.3)\mathbf{74.8}_{(3.3)} 66.9(4.3)\mathbf{66.9}_{(4.3)} 66.0(4.1)\mathbf{66.0}_{(4.1)} 74.7(3.3)\mathbf{74.7}_{(3.3)} 63.4(6.4)\mathbf{63.4}_{(6.4)} 62.3(8.6)\mathbf{62.3}_{(8.6)} 72.0(5.0)\mathbf{72.0}_{(5.0)}

3.2 Balanced Pseudo-Label Generation

To leverage the unlabeled nodes in the graph, a balanced pseudo-label generation mechanism is proposed. The objective is to increase the number of available labeled nodes in the graph while considering the class imbalance in the set of labeled nodes. The goal is to generate more pseudo-labels from minority classes and fewer pseudo-labels from majority classes, thus balancing the class label distribution of the training data. In particular, for each class cc, the number of nodes to pseudo-label, denoted as N^c\hat{N}_{c}, is set as the difference between the largest labeled class size and the size of class cc, aiming to balance the class label distribution over the union of labeled nodes and pseudo-labeled nodes:

N^c=maxi{1,,C}(|Vli|)|Vlc|\hat{N}_{c}=\text{max}_{i\in\{1,\ldots,C\}}(|V^{i}_{l}|)-|V^{c}_{l}| (10)

The set of unlabeled nodes that can be confidently pseudo-labeled to class cc can be determined as:

V^uc={imax(Pi)>ϵ,argmax(Pi)=c,iVu}\!\!\!\hat{V}_{u}^{c}=\{i\mid\max(P_{i})>\epsilon,\,\text{argmax}(P_{i})=c,i\in V_{u}\} (11)

where ϵ\epsilon is a hyperparameter determining the confidence prediction threshold. Balanced sampling is then performed on each set V^uc\hat{V}_{u}^{c} by selecting the top N^c\hat{N}_{c} nodes, denoted as TopN^c(V^uc)\text{Top}_{\hat{N}_{c}}(\hat{V}_{u}^{c}), with the most confident pseudo-labels based on the predicted probability Pi[c]P_{i}[c]. This results in a total set of pseudo-labeled nodes, denoted as V^u\hat{V}_{u}, from all classes:

V^u={TopN^1(V^u1),,TopN^C(V^uC)}.\hat{V}_{u}=\{\text{Top}_{\hat{N}_{1}}(\hat{V}_{u}^{1}),\cdots,\text{Top}_{\hat{N}_{C}}(\hat{V}_{u}^{C})\}. (12)

The Unified GNN Learning framework is trained to minimize the following node classification loss over this set of pseudo-labeled nodes V^u\hat{V}_{u}:

ps=1|V^u|iV^uce(Pi,𝟏argmax(Pi))\mathcal{L}_{\text{ps}}=\frac{1}{|\hat{V}_{u}|}\sum\nolimits_{i\in\hat{V}_{u}}\ell_{\text{ce}}(P_{i},\boldsymbol{1}_{\text{argmax}(P_{i})}) (13)

where ce\ell_{\text{ce}} again is the standard cross-entropy loss function, PiP_{i} is the predicted class probability vector with classifier ϕ\phi, and 𝟏argmax(Pi)\boldsymbol{1}_{\text{argmax}(P_{i})} is a one-hot pseudo-label vector with a single 1 at the predicted class entry argmax(Pi)\text{argmax}(P_{i}). This pseudo-labeling mechanism aims to augment the labeled node set, particularly focusing on addressing class imbalances by generating more pseudo-labels for minority classes.

Training Loss

The unified GNN Learning framework is trained on the labeled set VlV_{l} and the selected pseudo-labeled set V^u\hat{V}_{u} in an end-to-end fashion to minimize the following integrated total loss:

total=+λps\mathcal{L}_{\text{total}}=\mathcal{L}+\lambda\mathcal{L}_{\text{ps}} (14)

where λ\lambda is a hyper-parameter. The training procedure is provided in the Appendix.

4 Experiments

4.1 Experimental Setup

Datasets & Baselines

We conduct experiments on three datasets (Cora, CiteSeer and PubMed) (Sen et al., 2008). To ensure a fair comparison, we adhere to the evaluation protocol used in previous studies (Zhao et al., 2021; Yun et al., 2022). The datasets undergo manual pre-processing to achieve the desired imbalance ratio (ρ\rho). Specifically, each majority class is allocated 20 labeled training nodes, while each minority class is assigned 20×ρ20\times\rho labeled training nodes. For validation and test sets, each class is assigned 25 and 55 nodes, respectively. Following the protocol of (Yun et al., 2022), we consider two imbalance ratios (ρ=10%,5%\rho=10\%,5\%), and two different numbers of minority classes: 3 and 5 on Cora and CiteSeer, and 1 and 2 on the PubMed dataset. Additionally, for Cora and CiteSeer, we also adopt a long-tail class label distribution setup. Here, low-degree nodes from minority classes are removed from the graph until the desired imbalance ratio (ρ\rho) of 1% and class label distribution are achieved, similar to (Park et al., 2021; Yun et al., 2022). We compare the proposed Uni-GNN with the underlying GCN baseline, as well as various traditional and graph-based class-imbalanced learning baselines: Over-sampling, Re-weight (Yuan & Ma, 2012), SMOTE (Chawla et al., 2002), Embed-SMOTE (Ando & Huang, 2017), GraphSMOTE (Zhao et al., 2021), GraphENS (Park et al., 2021), and LTE4G (Yun et al., 2022).

Implementation Details Graph Convolution Network (GCN) (Kipf & Welling, 2017) implements the message passing layers in our proposed framework and all the comparison baselines. The semantic and structural encoders consist of 2 message passing layers each, followed by a ReLU activation function. The node classifier is composed of a single GCN layer, followed by ReLU activation, and then a single fully connected linear layer. Uni-GNN undergoes training using an Adam optimizer with a learning rate of 1e21e^{-2} and weight decay of 5e45e^{-4} over 10,000 epochs. We incorporate an early stopping criterion with a patience of 1,000 epochs and apply a dropout rate of 0.5 to all layers of our framework. The size of the learned hidden embeddings for all layers of structural and semantic encoders is set to 64. The hyperparameter λ\lambda is assigned the value 1. For the hyperparameters KK, α\alpha, ϵ\epsilon, and β\beta, we explore the following ranges: {3C,4C,10C,20C,30C}\{3C,4C,10C,20C,30C\}, {1,2}\{1,2\}, {0.5,0.7}\{0.5,0.7\}, and {50,100}\{50,100\}, respectively. Each experiment is repeated three times, and the reported performance metrics represent the mean and standard deviation across all three runs. For results for comparison methods on Cora and CiteSeer, we refer to the outcomes from (Yun et al., 2022).

Table 2: The overall performance (standard deviation within brackets) on Cora-LT and CiteSeer-LT datasets with ρ=1%\rho=1\%.
Cora-LT CiteSeer-LT
bAcc. Macro-F1 G-Means bAcc. Macro-F1 G-Means
GCN 66.8(1.1)66.8_{(1.1)} 65.0(1.0)65.0_{(1.0)} 79.5(0.7)79.5_{(0.7)} 50.4(1.4)50.4_{(1.4)} 45.7(0.8)45.7_{(0.8)} 67.4(1.1)67.4_{(1.1)}
Over-sampling 66.6(0.8)66.6_{(0.8)} 64.5(0.6)64.5_{(0.6)} 79.3(0.5)79.3_{(0.5)} 51.7(1.2)51.7_{(1.2)} 46.7(0.8)46.7_{(0.8)} 68.4(0.9)68.4_{(0.9)}
Re-weight 68.0(0.7)68.0_{(0.7)} 66.7(1.4)66.7_{(1.4)} 80.2(0.5)80.2_{(0.5)} 53.0(2.4)53.0_{(2.4)} 48.8(2.3)48.8_{(2.3)} 69.3(1.7)69.3_{(1.7)}
SMOTE 66.8(0.4)66.8_{(0.4)} 66.4(0.6)66.4_{(0.6)} 79.4(0.2)79.4_{(0.2)} 51.2(1.9)51.2_{(1.9)} 46.6(1.8)46.6_{(1.8)} 68.0(1.4)68.0_{(1.4)}
Embed-SMOTE 65.2(0.6)65.2_{(0.6)} 63.1(0.3)63.1_{(0.3)} 78.4(0.4)78.4_{(0.4)} 52.6(1.6)52.6_{(1.6)} 48.0(1.5)48.0_{(1.5)} 69.0(1.2)69.0_{(1.2)}
GraphSMOTE 67.7(1.2)67.7_{(1.2)} 66.3(1.1)66.3_{(1.1)} 80.0(0.8)80.0_{(0.8)} 51.5(0.7)51.5_{(0.7)} 46.8(0.9)46.8_{(0.9)} 68.2(0.5)68.2_{(0.5)}
GraphENS 67.4(1.5)67.4_{(1.5)} 64.5(1.5)64.5_{(1.5)} 79.8(1.0)79.8_{(1.0)} 52.4(1.7)52.4_{(1.7)} 47.2(1.0)47.2_{(1.0)} 68.9(1.3)68.9_{(1.3)}
LTE4G 72.2(3.1)72.2_{(3.1)} 72.0(2.9)72.0_{(2.9)} 83.0(2.0)83.0_{(2.0)} 56.4(2.1)56.4_{(2.1)} 52.5(2.2)52.5_{(2.2)} 71.7(1.5)71.7_{(1.5)}
Uni-GNN 75.2(1.3)\mathbf{75.2}_{(1.3)} 74.8(1.3)\mathbf{74.8}_{(1.3)} 84.9(0.8)\mathbf{84.9}_{(0.8)} 63.3(1.7)\mathbf{63.3}_{(1.7)} 61.6(2.5)\mathbf{61.6}_{(2.5)} 76.6(1.2)\mathbf{76.6}_{(1.2)}

4.2 Comparison Results

Table 3: Ablation study results (standard deviation within brackets) on Cora and CiteSeer datasets with 3 minority classes and 10% imbalance ratio and Long-Tail class label distribution with 1% imbalance ratio.
(# Min. Class, ρ\rho ) Cora (3, 10%) CiteSeer (3, 10%) Cora (LT, 1%) CiteSeer (LT, 1%)
bAcc. Macro-F1 G-Means bAcc. Macro-F1 G-Means bAcc. Macro-F1 G-Means bAcc. Macro-F1 G-Means
GCN 68.8(4.0){68.8}_{(4.0)} 67.6(5.0){67.6}_{(5.0)} 80.8(2.6){80.8}_{(2.6)} 49.5(2.1){49.5}_{(2.1)} 43.1(2.3){43.1}_{(2.3)} 66.7(1.5){66.7}_{(1.5)} 66.8(1.1)66.8_{(1.1)} 65.0(1.0)65.0_{(1.0)} 79.5(0.7)79.5_{(0.7)} 50.4(1.4)50.4_{(1.4)} 45.7(0.8)45.7_{(0.8)} 67.4(1.1)67.4_{(1.1)}
Uni-GNN 76.5(0.5)\mathbf{76.5}_{(0.5)} 76.4(0.7)\mathbf{76.4}_{(0.7)} 85.8(0.3)\mathbf{85.8}_{(0.3)} 59.1(3.6)\mathbf{59.1}_{(3.6)} 54.6(3.3)\mathbf{54.6}_{(3.3)} 73.6(2.5)\mathbf{73.6}_{(2.5)} 75.2(1.3)\mathbf{75.2}_{(1.3)} 74.8(1.3)\mathbf{74.8}_{(1.3)} 84.9(0.8)\mathbf{84.9}_{(0.8)} 63.3(1.7)\mathbf{63.3}_{(1.7)} 61.6(2.5)\mathbf{61.6}_{(2.5)} 76.6(1.2)\mathbf{76.6}_{(1.2)}
\;- Ind. Enc. 74.3(3.9){74.3}_{(3.9)} 74.0(4.2){74.0}_{(4.2)} 84.3(2.5){84.3}_{(2.5)} 54.2(5.0){54.2}_{(5.0)} 51.7(5.4){51.7}_{(5.4)} 70.1(3.6){70.1}_{(3.6)} 74.9(1.0)74.9_{(1.0)} 74.3(0.8)74.3_{(0.8)} 84.7(0.6)84.7_{(0.6)} 60.4(0.9)60.4_{(0.9)} 58.2(1.5)58.2_{(1.5)} 74.6(0.7)74.6_{(0.7)}
\;- Drop ps\mathcal{L}_{\text{ps}} 72.6(1.5){72.6}_{(1.5)} 72.3(1.6){72.3}_{(1.6)} 83.3(1.0){83.3}_{(1.0)} 57.3(4.6){57.3}_{(4.6)} 53.8(4.4){53.8}_{(4.4)} 72.3(3.2){72.3}_{(3.2)} 74.1(1.2)74.1_{(1.2)} 73.6(1.2)73.6_{(1.2)} 84.2(0.8)84.2_{(0.8)} 57.7(3.9)57.7_{(3.9)} 53.8(4.0)53.8_{(4.0)} 72.6(2.8)72.6_{(2.8)}
Astruct=A\;-\,A_{\text{struct}}=A 74.9(1.5){74.9}_{(1.5)} 75.2(1.2){75.2}_{(1.2)} 84.7(1.0){84.7}_{(1.0)} 56.0(2.5){56.0}_{(2.5)} 52.3(2.8){52.3}_{(2.8)} 71.4(1.8){71.4}_{(1.8)} 75.2(1.3){75.2}_{(1.3)} 74.8(1.3){74.8}_{(1.3)} 84.9(0.8){84.9}_{(0.8)} 63.3(1.7){63.3}_{(1.7)} 61.6(2.5){61.6}_{(2.5)} 76.6(1.2){76.6}_{(1.2)}
\;- Semantic Enc. 61.9(0.7){61.9}_{(0.7)} 61.9(0.7){61.9}_{(0.7)} 76.1(0.5){76.1}_{(0.5)} 46.6(1.4){46.6}_{(1.4)} 42.6(1.4){42.6}_{(1.4)} 64.5(1.0){64.5}_{(1.0)} 57.6(3.2)57.6_{(3.2)} 56.3(3.6)56.3_{(3.6)} 73.1(2.2)73.1_{(2.2)} 51.3(2.9)51.3_{(2.9)} 49.2(3.2)49.2_{(3.2)} 68.0(2.1)68.0_{(2.1)}
\;- Structural Enc. 73.6(3.3){73.6}_{(3.3)} 73.0(3.5){73.0}_{(3.5)} 83.9(2.1){83.9}_{(2.1)} 53.3(2.6){53.3}_{(2.6)} 49.2(3.5){49.2}_{(3.5)} 69.5(1.9){69.5}_{(1.9)} 73.2(0.9)73.2_{(0.9)} 72.3(0.7)72.3_{(0.7)} 83.7(0.6)83.7_{(0.6)} 61.2(1.9)61.2_{(1.9)} 59.0(1.9)59.0_{(1.9)} 75.1(1.3)75.1_{(1.3)}
\;- Imbalanced PL 72.6(1.8){72.6}_{(1.8)} 72.1(1.9){72.1}_{(1.9)} 83.3(1.2){83.3}_{(1.2)} 46.9(2.4){46.9}_{(2.4)} 39.3(3.0){39.3}_{(3.0)} 64.7(1.8){64.7}_{(1.8)} 74.5(0.1)74.5_{(0.1)} 74.0(0.2)74.0_{(0.2)} 84.4(0.1)84.4_{(0.1)} 54.4(2.4)54.4_{(2.4)} 50.0(2.6)50.0_{(2.6)} 70.3(1.7)70.3_{(1.7)}
\;- Fixed {Asem}=2L\{A^{\ell}_{\text{sem}}\}_{\ell=2}^{L} 76.0(2.5){76.0}_{(2.5)} 75.9(2.6){75.9}_{(2.6)} 85.4(1.6){85.4}_{(1.6)} 55.8(2.9){55.8}_{(2.9)} 52.8(2.0){52.8}_{(2.0)} 71.3(2.0){71.3}_{(2.0)} 74.4(0.2)74.4_{(0.2)} 73.5(0.4)73.5_{(0.4)} 84.4(0.2)84.4_{(0.2)} 61.9(1.5)61.9_{(1.5)} 60.4(0.9)60.4_{(0.9)} 75.6(1.0)75.6_{(1.0)}
ωci=1,iVl\;-\,\,\omega_{c_{i}}=1,\forall i\in V_{l} 71.1(3.1){71.1}_{(3.1)} 70.7(3.4){70.7}_{(3.4)} 82.2(2.0){82.2}_{(2.0)} 51.7(4.2){51.7}_{(4.2)} 48.6(4.7){48.6}_{(4.7)} 68.3(3.0){68.3}_{(3.0)} 57.0(1.9)57.0_{(1.9)} 48.7(1.3)48.7_{(1.3)} 72.7(1.4)72.7_{(1.4)} 57.9(1.9)57.9_{(1.9)} 52.8(2.1)52.8_{(2.1)} 72.8(1.3)72.8_{(1.3)}
Refer to caption
(a) α\alpha
Refer to caption
(b) KK
Refer to caption
(c) ϵ\epsilon
Refer to caption
(d) β\beta
Figure 2: Sensitivity analysis for the proposed framework on hyper-parameters (a) α\alpha, the max SPD distance in AstructA_{\text{struct}}; (b) KK, the number of clusters; (c) ϵ\epsilon, the pseudo-label confidence threshold; (d) β\beta, the rate of updating {Asem}=2L\{A^{\ell}_{\text{sem}}\}_{\ell=2}^{L}.

We evaluate the performance of Uni-GNN framework on the semi-supervised node classification task under class imbalance. Across the three datasets, we explore four distinct evaluation setups by manipulating the number of minority classes and the imbalance ratio (ρ\rho). Additionally, we explore the long-tail class label distribution setting for Cora and CiteSeer with an imbalance ratio of ρ=1%\rho=1\%. We assess the performance of Uni-GNN using balanced Accuracy (bAcc), Macro-F1, and Geometric Means (G-Means), reporting the mean and standard deviation of each metric over 3 runs. Table 1 summarizes the results for different numbers of minority classes and imbalance ratios on all three datasets, while Table 2 showcases the results under long-tail class label distribution on Cora and CiteSeer.

Table 1 illustrates that the performance of all methods diminishes with decreasing imbalance ratio (ρ\rho) and increasing numbers of minority classes. Our proposed framework consistently outperforms the underlying GCN baseline and all other methods across all three datasets and various evaluation setups. The performance gains over the GCN baseline are substantial, exceeding 10% in most cases for Cora and CiteSeer datasets and 13% for most instances of the PubMed dataset. Moreover, Uni-GNN consistently demonstrates superior performance compared to all other comparison methods, achieving notable improvements over the second-best method (LTE4G) by around 3%, 5%, and 11% on Cora, CiteSeer with 3 minority classes and ρ=10%\rho=10\%, and PubMed with 1 minority class and ρ=10%\rho=10\%, respectively. Similarly, Table 2 highlights that Uni-GNN consistently enhances the performance of the underlying GCN baseline, achieving performance gains exceeding 8% and 12% on Cora-LT and CiteSeer-LT datasets, respectively. Furthermore, Uni-GNN demonstrates remarkable performance gains over all other class imbalance methods, surpassing 3% and 6% on Cora-LT and CiteSeer-LT, respectively. These results underscore the superior performance of our framework over existing state-of-the-art class-imbalanced GNN methods across numerous challenging class imbalance scenarios.

4.3 Ablation Study

We conducted an ablation study to discern the individual contributions of each component in our proposed framework. In addition to the underlying GCN baseline, eight variants were considered: (1) Independent Node Encoders (Ind. Enc.): each node encoder exclusively propagates its own node embeddings, instead of propagating the concatenated semantic and structural embeddings. Specifically, fsemf^{\ell}_{\text{sem}} solely propagates Hsem1{H}_{\text{sem}}^{\ell-1} while fstructf^{\ell}_{\text{struct}} solely propagates Hstruct1{H}_{\text{struct}}^{\ell-1}. (2) Drop ps\mathcal{L}_{\text{ps}}: excluding the balanced pseudo-label generation. (3) Astruct=AA_{\text{struct}}=A: structural connectivity considers only immediate neighbors of nodes (α=1\alpha=1). (4) Semantic Encoder only (Semantic Enc.): it discards the structural encoder. (5) Structural Encoder only (Structural Enc.): it discards the semantic encoder. (6) Imbalanced Pseudo-Labeling: it generates pseudo-labels for all unlabeled nodes with confident predictions without considering class imbalance. (7) Fixed {Asem}=2L\{A^{\ell}_{\text{sem}}\}_{\ell=2}^{L}: it does not update the semantic connectivity during training. (8) ωci=1,iVl\omega_{c_{i}}=1,\forall i\in V_{l}: it assigns equal weights to all labeled nodes in the training set. Evaluation was performed on Cora and CiteSeer datasets, each with 3 minority classes and imbalance ratio (ρ\rho) of 10%, and with long-tail class label distribution and imbalance ratio (ρ\rho) of 1%. The results are reported in Table 3.

Table 3 illustrates performance degradation in all variants compared to the full proposed framework. The observed performance decline in the Independent Node Encoders (Ind. Enc.) variant underscores the importance of simultaneously propagating semantic and structural embeddings across both the semantic and structural connectivity. This emphasizes the need for incorporating both aspects to effectively learn more discriminative node embeddings. The performance drop observed in the Semantic Enc. and Structural Enc. variants underscores the significance and individual contribution of each connectivity type to the proposed Uni-GNN framework. This highlights the critical role that each type of connectivity plays in the overall performance of the proposed framework. The Astruct=AA_{\text{struct}}=A variant’s performance drop emphasizes the importance of connecting nodes beyond their immediate structural neighbors, enabling the propagation of messages across a larger portion of the graph and learning more discriminative embeddings. The performance degradation in Drop ps\mathcal{L}_{\text{ps}} and Imbalanced Pseudo-Labeling variants validates the substantial contribution of our balanced pseudo-label generation mechanism. The ωci=1,iVl\omega_{c_{i}}=1,\forall i\in V_{l} variant underscores the importance of assigning weights to each node in the labeled training set based on their class frequency. The consistent performance drops across both datasets for all variants affirm the essential contribution of each corresponding component in the proposed framework.

4.4 Hyper-Parameter Sensitivity

To investigate the impact of the hyper-parameters in Uni-GNN framework, we present the results of several sensitivity analyses on the Cora dataset with 3 minority classes and an imbalance ratio of 10% in Figure 2. Figures 2(a), 2(b), 2(c), and 2(d) depict the accuracy of Uni-GNN as we independently vary the max SPD distance in AstructA_{\text{struct}} (α\alpha), the number of clusters (KK), the pseudo-label confidence threshold (ϵ\epsilon) and the rate of updating {Asem}=2L\{A^{\ell}_{\text{sem}}\}_{\ell=2}^{L} (β\beta), respectively. Larger values for α\alpha result in performance degradation due to over-smoothing as the graph becomes more densely connected. Optimal performance is achieved with α=2\alpha=2. Uni-GNN exhibits robustness to variations in the hyperparameters KK, ϵ\epsilon, and β\beta within a broad range. It consistently outperforms state-of-the-art methods across diverse settings of these hyperparameters, as depicted by the corresponding results presented in Table 1. Small KK values lead to noisy clusters with mixed-class nodes, while large KK values result in over-segmented clusters with sparse semantic connectivity. Optimal performance is achieved when KK falls within the range of 5C5C to 15C15C. Inadequately small values for ϵ\epsilon result in the utilization of noisy pseudo-label predictions in the training process, while excessively large values exclude reasonably confident pseudo-labeled nodes from selection. The optimal range for ϵ\epsilon lies between 0.4 and 0.7. Extremely small values for β\beta lead to frequent updates of the semantic connectivity, preventing Uni-GNN from learning stable discriminative embeddings. Values around 100 training epochs yield the best results.

5 Conclusion

In this paper, we introduced a novel Uni-GNN framework for class-imbalanced node classification. The proposed framework harnesses the combined strength of structural and semantic connectivity through dedicated structural and semantic node encoders, enabling the learning of a unified node representation. By utilizing these encoders, the structural and semantic connectivity ensures effective propagation of messages well beyond the structural immediate neighbors of nodes, thereby addressing the under-reaching and neighborhood memorization problems. Moreover, we proposed a balanced pseudo-label generation mechanism to incorporate confident pseudo-label predictions from minority unlabeled nodes into the training set. Our experimental evaluations on three benchmark datasets for node classification affirm the efficacy of our proposed framework. The results demonstrate that Uni-GNN adeptly mitigates class imbalance bias, surpassing existing state-of-the-art methods in class-imbalanced graph learning.

References

  • Ahn et al. (2023) Ahn, S., Ko, J., and Yun, S.-Y. Cuda: Curriculum of data augmentation for long-tailed recognition. In International Conference on Learning Representations (ICLR), 2023.
  • Ando & Huang (2017) Ando, S. and Huang, C. Y. Deep over-sampling framework for classifying imbalanced data. In European Conf. on Machine Learn. and Principles and Practice of Knowledge Dis. in Databases (ECML PKDD), 2017.
  • Chawla et al. (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. Smote: synthetic minority over-sampling technique. In Journal of Artificial Intelligence Research, 2002.
  • Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • Hao et al. (2020) Hao, Z., Lu, C., Huang, Z., Wang, H., Hu, Z., Liu, Q., Chen, E., and Lee, C. Asgn: An active semi-supervised graph neural network for molecular property prediction. In SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2020.
  • Kim et al. (2020) Kim, J., Jeong, J., and Shin, J. M2m: Imbalanced classification via major-to-minor translation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In Inter. Conference on Learning Representations (ICLR), 2017.
  • Li et al. (2023) Li, W.-Z., Wang, C.-D., Xiong, H., and Lai, J.-H. Graphsha: Synthesizing harder samples for class-imbalanced node classification. In SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2023.
  • Mahajan et al. (2018) Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and Van Der Maaten, L. Exploring the limits of weakly supervised pretraining. In European conference on Computer Vision (ECCV), 2018.
  • Menon et al. (2021) Menon, A. K., Jayasumana, S., Rawat, A. S., Jain, H., Veit, A., and Kumar, S. Long-tail learning via logit adjustment. In International Conference on Learning Representations (ICLR), 2021.
  • Park et al. (2021) Park, J., Song, J., and Yang, E. Graphens: Neighbor-aware ego network synthesis for class-imbalanced node classification. In International Conference on Learning Representations (ICLR), 2021.
  • Qu et al. (2021) Qu, L., Zhu, H., Zheng, R., Shi, Y., and Yin, H. Imgagn: Imbalanced network embedding via generative adversarial graph networks. In SIGKDD Conf. on Knowledge Discovery & Data Mining (KDD), 2021.
  • Ren et al. (2020) Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al. Balanced meta-softmax for long-tailed visual recognition. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. In AI magazine, 2008.
  • Song et al. (2022) Song, J., Park, J., and Yang, E. Tam: topology-aware margin loss for class-imbalanced node classification. In International Conference on Machine Learning (ICML), 2022.
  • Sun et al. (2022) Sun, Q., Li, J., Yuan, H., Fu, X., Peng, H., Ji, C., Li, Q., and Yu, P. S. Position-aware structure learning for graph topology-imbalance by relieving under-reaching and over-squashing. In International Conference on Information & Knowledge Management (CIKM), 2022.
  • Tang et al. (2020) Tang, X., Yao, H., Sun, Y., Wang, Y., Tang, J., Aggarwal, C., Mitra, P., and Wang, S. Investigating and mitigating degree-related biases in graph convoltuional networks. In International Conference on Information & Knowledge Management (CIKM), 2020.
  • Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
  • Wallace et al. (2011) Wallace, B. C., Small, K., Brodley, C. E., and Trikalinos, T. A. Class imbalance, redux. In International Conference on Data Mining (ICDM), 2011.
  • Wang et al. (2022) Wang, K., An, J., Zhou, M., Shi, Z., Shi, X., and Kang, Q. Minority-weighted graph neural network for imbalanced node classification in social networks of internet of people. In IEEE Internet of Things Journal, 2022.
  • Wang et al. (2023) Wang, S., Zeng, Z., Yang, X., and Zhang, X. Self-supervised graph learning for long-tailed cognitive diagnosis. In AAAI Conf. on Artificial Intelligence, 2023.
  • Wei et al. (2021) Wei, C., Sohn, K., Mellina, C., Yuille, A., and Yang, F. Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Yan et al. (2023) Yan, L., Zhang, S., Li, B., Zhou, M., and Huang, Z. Unreal: Unlabeled nodes retrieval and labeling for heavily-imbalanced node classification. In arXiv preprint arXiv:2303.10371, 2023.
  • Yuan & Ma (2012) Yuan, B. and Ma, X. Sampling+ reweighting: Boosting the performance of adaboost on imbalanced datasets. In International Joint Conference on Neural Networks (IJCNN), 2012.
  • Yun et al. (2022) Yun, S., Kim, K., Yoon, K., and Park, C. Lte4g: long-tail experts for graph neural networks. In International Conference on Information & Knowledge Management (CIKM), 2022.
  • Zhang et al. (2021) Zhang, G., Wu, J., Yang, J., Beheshti, A., Xue, S., Zhou, C., and Sheng, Q. Z. Fraudre: Fraud detection dual-resistant to graph inconsistency and imbalance. In International Conference on Data Mining (ICDM), 2021.
  • Zhao et al. (2021) Zhao, T., Zhang, X., and Wang, S. Graphsmote: Imbalanced node classification on graphs with graph neural networks. In International conference on web search and data mining (WSDM), 2021.
  • Zhong et al. (2021) Zhong, Z., Cui, J., Liu, S., and Jia, J. Improving calibration for long-tailed recognition. In Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Appendix A Training Procedure

The details of the training procedure of the Unified GNN Learning framework are presented in algorithm 2.

Algorithm 1 Training Procedure for Uni-GNN Framework
1:  Input: Graph GG with input feature matrix XX, adjacency matrix AA, label indicator matrix YlY^{l} Hyper-parameters α\alpha, β\beta, λ\lambda, ϵ\epsilon, KK
2:  Output: Learned model parameters 𝒲sem\mathcal{W}_{\text{sem}}, 𝒲struct\mathcal{W}_{\text{struct}}, 𝒲ϕ\mathcal{W}_{\phi}
3:  Construct AstructA_{\text{struct}} using Eq.(1)
4:  Construct {Asem}=1L\{A^{\ell}_{\text{sem}}\}_{\ell=1}^{L} using Eq.(3), Eq.(4), Eq.(5)
5:  for iter = 1 to maxiters do
6:     Hsem1=fsem1(X,Asem1){H}_{\text{sem}}^{1}=f^{1}_{\text{sem}}(X,A^{1}_{\text{sem}})
7:     Hstruct1=fstruct1(X,Astruct){H}_{\text{struct}}^{1}=f^{1}_{\text{struct}}(X,A_{\text{struct}})
8:     for \ell = 2 todo
9:        Hsem=fsem(Hstruct1||Hsem1,Asem){H}_{\text{sem}}^{\ell}=f^{\ell}_{\text{sem}}({H}_{\text{struct}}^{\ell-1}||{H}_{\text{sem}}^{\ell-1},A^{\ell}_{\text{sem}})
10:        Hstruct=fstruct(Hstruct1||Hsem1,Astruct){H}_{\text{struct}}^{\ell}=f^{\ell}_{\text{struct}}({H}_{\text{struct}}^{\ell-1}||{H}_{\text{sem}}^{\ell-1},A_{\text{struct}})
11:     end for
12:     P=ϕ(HstructL||HsemL)P=\phi({H}_{\text{struct}}^{L}||{H}_{\text{sem}}^{L})
13:     Update V^u\hat{V}_{u} using Eq.(10), Eq.(11), Eq.(12)
14:     Calculate total\mathcal{L}_{\text{total}} using Eq.(8), Eq.(13), Eq.(14)
15:     Update 𝒲sem\mathcal{W}_{\text{sem}}, 𝒲struct\mathcal{W}_{\text{struct}}, 𝒲ϕ\mathcal{W}_{\phi} to minimize total\mathcal{L}_{\text{total}} with gradient descent
16:     if itermodβ=0iter\mod\beta=0 then
17:        Update {Asem}=2L\{A^{\ell}_{\text{sem}}\}_{\ell=2}^{L} using Eq.(3), Eq.(4), Eq.(5)
18:     end if
19:  end for
{comment}
Algorithm 2 Training Procedure for Uni-GNN Framework
1:  Input: Graph GG with node feature matrix XX, adjacency matrix AA, label indicator matrix YY^{\ell} Hyper-parameters α\alpha, β\beta, λ\lambda, ϵ\epsilon, KK
2:  Output: Learned model parameters 𝒲sem\mathcal{W}_{\text{sem}}, 𝒲struct\mathcal{W}_{\text{struct}}, 𝒲ϕ\mathcal{W}_{\phi}
3:  Construct AstructA_{\text{struct}} using Eq.(1)
4:  Construct {Asem}=1L\{A^{\ell}_{\text{sem}}\}_{\ell=1}^{L} using Eq.(3), Eq.(4), Eq.(5)
5:  for iter = 1 to maxiters do
6:     Hsem1=fsem1(X,Asem1){H}_{\text{sem}}^{1}=f^{1}_{\text{sem}}(X,A^{1}_{\text{sem}})
7:     Hstruct1=fstruct1(X,Astruct){H}_{\text{struct}}^{1}=f^{1}_{\text{struct}}(X,A_{\text{struct}})
8:     for \ell = 2 todo
9:        Hsem=fsem(Hstruct1||Hsem1,Asem){H}_{\text{sem}}^{\ell}=f^{\ell}_{\text{sem}}({H}_{\text{struct}}^{\ell-1}||{H}_{\text{sem}}^{\ell-1},A^{\ell}_{\text{sem}})
10:        Hstruct=fstruct(Hstruct1||Hsem1,Astruct){H}_{\text{struct}}^{\ell}=f^{\ell}_{\text{struct}}({H}_{\text{struct}}^{\ell-1}||{H}_{\text{sem}}^{\ell-1},A_{\text{struct}})
11:     end for
12:     P=ϕ(HstructL||HsemL)P=\phi({H}_{\text{struct}}^{L}||{H}_{\text{sem}}^{L})
13:     Update 𝒩^\hat{\mathcal{N}} using Eq.(LABEL:eq:pseudo_count), Eq.(11), Eq.(12)
14:     Calculate total\mathcal{L}_{\text{total}} using Eq.(8), Eq.(13), Eq.(14)
15:     Update 𝒲sem\mathcal{W}_{\text{sem}}, 𝒲struct\mathcal{W}_{\text{struct}}, 𝒲ϕ\mathcal{W}_{\phi} to minimize total\mathcal{L}_{\text{total}} with gradient descent
16:     if itermodβ=0iter\mod\beta=0 then
17:        Update {Asem}=2L\{A^{\ell}_{\text{sem}}\}_{\ell=2}^{L} using Eq.(3), Eq.(4), Eq.(5)
18:     end if
19:  end for

Appendix B Datasets

The specifics of the benchmark datasets are detailed in Table 4. Detailed class label distribution information for the training sets in all evaluation setups on all datasets are provided in Table 5.

Table 4: Details of the benchmark datasets.
Dataset # Nodes # Edges # Features # Classes
Cora 2,708 5,278 1,433 7
CiteSeer 3,327 4,552 3,703 6
PubMed 19,717 44,324 500 3
Table 5: Details of the class label distribution of the training set of all evaluation setups on all datasets.
Dataset # Min. Class ρ\rho |Vl1||V^{1}_{l}| |Vl2||V^{2}_{l}| |Vl3||V^{3}_{l}| |Vl4||V^{4}_{l}| |Vl5||V^{5}_{l}| |Vl6||V^{6}_{l}| |Vl7||V^{7}_{l}|
Cora 3 10% 20 20 20 20 2 2 2
5% 20 20 20 20 1 1 1
5 10% 20 20 2 2 2 2 2
5% 20 20 1 1 1 1 1
LT 1% 341 158 73 34 15 7 3
CiteSeer 3 10% 20 20 20 2 2 2 -
5% 20 20 20 1 1 1 -
5 10% 20 2 2 2 2 2 -
5% 20 1 1 1 1 1 -
LT 1% 371 147 58 32 9 3 -
PubMed 1 10% 20 20 2 - - - -
5% 20 20 1 - - - -
2 10% 20 2 2 - - - -
5% 20 1 1 - - - -