UniGAP: A Universal and Adaptive Graph Upsampling Approach to Mitigate Over-Smoothing in Node Classification Tasks

Xiaotang Wang Huazhong University of Science and TechnologyWuhanChina [email protected] , Yun Zhu Zhejiang UniversityHangzhouChina zhuyun˙[email protected] , Haizhou Shi Rutgers UniversityNew BrunswickNew JerseyUS [email protected] , Yongchao Liu Ant GroupHangzhouChina [email protected] and Chuntao Hong Ant GroupHangzhouChina [email protected]

(2018)

Abstract.

In the graph domain, deep graph networks based on Message Passing Neural Networks (MPNNs) or Graph Transformers often cause over-smoothing of node features, limiting their expressive capacity. Many upsampling techniques involving node and edge manipulation have been proposed to mitigate this issue. However, these methods often require extensive manual labor, resulting in suboptimal performance and lacking a universal integration strategy. In this study, we introduce UniGAP, a universal and adaptive graph upsampling technique for graph data. It provides a universal framework for graph upsampling, encompassing most current methods as variants. Moreover, UniGAP serves as a plug-in component that can be seamlessly and adaptively integrated with existing GNNs to enhance performance and mitigate the over-smoothing problem. Through extensive experiments, UniGAP demonstrates significant improvements over heuristic data augmentation methods across various datasets and metrics. We analyze how graph structure evolves with UniGAP, identifying key bottlenecks where over-smoothing occurs, and providing insights into how UniGAP addresses this issue. Lastly, we show the potential of combining UniGAP with large language models (LLMs) to further improve downstream performance. Our code is available at: https://github.com/wangxiaotang0906/UniGAP

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Graph data is ubiquitous in the real world applications like recommender systems (van den Berg et al., 2017; Ying et al., 2018), traffic analysis (Zheng et al., 2020; Guo et al., 2019), and drug discovery (Gaudelet et al., 2021). Graph neural networks (GNNs) (Defferrard et al., 2016; Velickovic et al., 2018; Hamilton et al., 2017) are powerful tools for handling graph-structured data, demonstrating significant success in recent years. In contrast to computer vision and natural language processing, where deep models often lead to significant improvements (He et al., 2016), in the graph domain, increasing the number of model layers can cause node features to converge exponentially towards the same vector, a phenomenon known as “over-smoothing” (Rusch et al., 2023; Keriven, 2022; Chen et al., 2020a). This convergence diminishes the model’s expressive power and adversely affects its performance.

Previous researches have explored a variety of strategies to mitigate the over-smoothing problem, including manipulating graph topology (Chen et al., 2020a; Rong et al., 2020; You et al., 2020), refining model structure (Kipf and Welling, 2017; Xu et al., 2018), and employing dynamically learning (Zhang et al., 2023; Zou et al., 2023). Recently, graph upsampling, a lightweight graph topology manipulation strategy, has proven effective in complementing and enhancing other techniques. For instance, AdaEdge (Chen et al., 2020a) heavily relys on “homophily” assumption and alleviate over-smoothing by increasing intra-class and decreasing inter-class connections based on the model’s classification outcomes. HalfHop (Azabou et al., 2023) proposes a heuristic method of randomly adding nodes to slow down message propagation. However, these approaches rely on heuristic strategies that require prior knowledge and significant human intervention, often resulting in a sub-optimal performance and lack of interpretability, which makes it challenging to understand their effectiveness or rationale. Moreover, there is currently no universal framework that describes and encompasses these methods, despite their common manipulation of graph topology and interconnected effects.

To address the aforementioned challenges, we introduce UniGAP, a universal and adaptive graph upsampling methodology. Specifically, it offers a universal framework that subsumes various graph upsampling techniques as its variants. For instance, methods such as HalfHop, AdaEdge, and other upsampling techniques (Wu et al., 2021; Liu et al., 2022) can be considered specific variants of UniGAP (Section 3.4). Furthermore, different from existing methods that employ a heuristic upsampling strategy, UniGAP adaptively calculates the probability of inserting an intermediate node for each pair of adjacent nodes in the original graph, generating a new upsampled graph through a differentiable sampling technique. Subsequently, UniGAP iteratively refines these sampling probabilities based on evaluation metrics (e.g., accuracy or smoothness) of the upsampled graph in downstream tasks. This iterative process strives to yield an optimally upsampled graph suitable for downstream applications. Moreover, UniGAP enhances the interpretability of the over-smoothing issue within graph structure learning. By examining the augmented graph that achieves optimal performance, we can precisely identify where intermediate nodes are inserted. This analysis provides valuable insights into the root causes of the over-smoothing problem. Lastly, we demonstrate the potential of integrating LLMs with UniGAP to further improve downstream performance.

Our contributions can be summarized as follows:

•

We propose UniGAP, a universal and adaptive graph upsampling approach to mitigate over-smoothing in node classification and enhance downstream performance as a plug-in. Additionally, UniGAP can encompass various graph upsampling techniques as its variants.
•

UniGAP identifies the critical bottleneck where the over-smoothing problem occurs, offering interpretability to this issue from a graph structure learning perspective, supported by our provided visualizations.
•

Through extensive experiments, UniGAP shows significant improvements over heuristic data augmentation methods and integrates seamlessly with GNNs to enhance performance and address over-smoothing. Additionally, it can also be combined with LLMs for further performance boosts.

2. Related Work

2.1. Over-Smoothing Problem in GNNs

Over-smoothing represents a prevalent research issue within the realm of graphs, garnering significant attention recently. This phenomenon substantially impedes the expressiveness of GNNs as the number of propagation layers increases, resulting in node features becoming progressively more indistinguishable. To assess the extent of over-smoothing, a variety of quantitative metrics have been established, including Mean Average Distance (MAD) (Chen et al., 2020b), Euclidean Distance (Zhou et al., 2020a), Dirichlet Energy (Rusch et al., 2023), and others.

Previous research (Cai and Wang, 2020; Zhang et al., 2024; Rusch et al., 2023) has explored various strategies to mitigate the over-smoothing problem. These solutions can be broadly categorized into three main approaches: manipulating graph topology, refining model structure, and dynamic learning. Firstly, manipulating graph topology involves modifying the original graph structure using dropout techniques (e.g., DropEdge (Rong et al., 2020), DropNode (Feng et al., 2021), DropMessage (Fang et al., 2023)), graph upsampling (Azabou et al., 2023; Chen et al., 2020a), or others (Han et al., 2022). These methods aim to reduce message passing or enhance graph structure to alleviate over-smoothing. Secondly, refining model structure includes modifying message passing mechanisms (Chien et al., 2021) or utilizing some model tricks like residual connections (Chen et al., 2020c), multi-hop information (Xu et al., 2018) to mitigate over-smoothing. Lastly, dynamic learning strategies dynamically adjusting the learning process of the model. For instance, DAGNN (Zhou et al., 2020b) calibrates a retention score to preserve representations in each layer, DGSLN (Zou et al., 2023) propose a graph generation module to optimize the graph structure, and ADGCN (Zhang et al., 2023) employs attention for selective cross-layer information aggregation.

In this study, our focus is on graph upsampling due to its simplicity and effectiveness. To the best of our knowledge, previous studies on graph upsampling have relied on heuristic strategies that involve significant human effort and lack a universal framework for integration. Our work aims to address these challenges by introducing a universal and adaptive method.

2.2. Graph Augmentations

Graphs, as non-Euclidean data, primarily consist of entities and the relations between them. Therefore, augmentation techniques for graph data are inherently more complex compared to other domains such as computer vision and natural language processing. Graph augmentations aim to enhance model generalization, improve performance, and mitigate over-smoothing. These augmentations can be categorized into three main types: node-level, edge-level, and graph-level augmentations. Node-level augmentations focus on individual nodes. For example, DropNode (You et al., 2020) randomly sets some nodes as zero vectors, FeatureMask (Zhu et al., 2020a) masks partial features of each node randomly, and FLAG (Kong et al., 2022) adds a learnable perturbation to each node. Edge-level augmentations manipulate edges within the graph. For instance, DropEdge (Rong et al., 2020) and EdgeMask (You et al., 2020) randomly remove existing edges based on specified probabilities. AdaEdge (Chen et al., 2020a) relies on a heuristic approach by adjusting intra-class and inter-class connections based on model classification outcomes. GAug (Zhao et al., 2021) uses neural edge predictors to estimate edge probabilities. Graph-level augmentations modify both nodes and edges simultaneously. GraphCL (You et al., 2020) combines various augmentation techniques to modify the original graph. Additionally, GCA (Zhu et al., 2021) proposes adaptive augmentation approaches based on metrics such as the connectivity of target nodes. HalfHop (Azabou et al., 2023) suggests adding nodes to existing edges randomly to slow down message propagation and mitigate over-smoothing.

Unlike previous studies, our work aims to develop a universal framework capable of integrating existing graph augmentations, particularly focusing on graph upsampling. Additionally, this framework is designed to be adaptive and optimized without requiring extensive human intervention.

3. UniGAP: A Universal and Adaptive Graph Upsampling Method

In this section, we will start with the notations we use throughout the rest of the paper in Section 3.1. Subsequently, we will provide an overview of our method in Section 3.2. Then we will introduce each component in our method for details in Section 3.3. Finally, we show UniGAP as a universal framework in Section 3.4 and discuss UniGAP from a theoretical perspective in Section 3.5.

Refer to caption — Figure 1. Framework of UniGAP. There are four main components in UniGAP, including Trajectory Precomputation, Multi-View Condensation (MVC) Encoder, Adaptive Upsampler, and the downstream model. Trajectory Precomputation needs to be performed only once in each task, then MVC, Adaptive Upsampler, and downstream are co-trained to obtain the optimal graph structure.

3.1. Notations

In this work, our focus is on node classification, which involves learning a graph predictor $g_{\theta}:\mathcal{G}\mapsto\mathcal{Y}$ that maps instance $G=(\mathcal{V},\mathcal{E},{A},{X},{Y})\in\mathcal{G}$ to label ${Y}\in\mathcal{Y}$ , where $\mathcal{V}$ and $\mathcal{E}$ denote the sets of nodes and edges, respectively. ${X}\in\mathbb{R}^{N\times d}$ represents the node attribute feature matrix with $N$ nodes and $d$ attributes. ${A}\in\mathbb{R}^{N\times N}$ is the adjacency matrix. Additionally, we denote the diagonal degree matrix as $D\in\mathbb{R}^{N\times N}$ , and the matrices with added self-loops as $\tilde{A}\in\mathbb{R}^{N\times N}$ and $\tilde{D}\in\mathbb{R}^{N\times N}$ , respectively. The graph Laplacian with self-loops is defined as $\tilde{L}=\tilde{D}-\tilde{A}$ . For normalization, the symmetrically normalized adjacency matrix is $\hat{A}=\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}$ , and the symmetrically normalized graph Laplacian is $\hat{L}=\tilde{D}^{-\frac{1}{2}}\tilde{L}\tilde{D}^{-\frac{1}{2}}$ . For clarity in notations, we will use $A$ and $L$ to denote the original graph, regardless of whether it is normalized or unnormalized. The augmented graph, which will be introduced later, will be represented by $\hat{A}$ and $\hat{L}$ .

3.2. Overview of UniGAP

In this section, we will provide an overview of our pipeline, with a detailed introduction of our method available in subsequent sections. UniGAP consists of four main components: Trajectory Computation, Multi-View Condensation (MVC) Encoder, Adaptive Upsampler, and the downstream model.

Initially, we compute trajectories (multi-hop information) of the input graph using our Trajectory Precomputation module, denoted as $T=\mathcal{P}(G)$ , where $T\in\mathbb{R}^{L\times N\times d}$ represents the trajectories and $\mathcal{P}(\cdot)$ denotes the frozen trajectory precomputation module. These trajectories $T$ are then processed by a tunable Multi-View Condensation (MVC) Encoder $\mathcal{F}_{\omega}(\cdot):\mathbb{R}^{L\times N\times d}\mapsto\mathbb{R}^{N\times d}$ to obtain condensed information, denoted as $\hat{T}=\mathcal{F}_{\omega}(T)$ . Next, the condensed information $\hat{T}$ is used to compute the node insertion probabilities between each pair, denoted as $P=\mathcal{U}_{\phi}(\hat{T})$ , where $\mathcal{U}_{\phi}:\mathbb{R}^{N\times N}\mapsto\mathbb{R}^{E\times 1}$ represents adaptive upsampler module. Building upon this, a differentiable upsampling operation $\mathcal{S}$ is employed to obtain the augmented graph $\hat{G}=(\hat{A},\hat{X})=\mathcal{S}(P)$ . Finally, the augmented graph will be input into the downstream model $\mathcal{D}_{\psi}$ for specific tasks. Using node classification as an example, the cross-entropy loss will be computed to iteratively optimize the previous learnable modules. In subsequent epochs, the refined downstream model will compute trajectories, with other steps remaining consistent throughout the iteration. The pipeline of our method is illustrated in Figure 1 and Algorithm 1. Next, we will delve into each component in detail.

3.3. UniGAP as An Adaptive Upsampling Framework

3.3.1. Trajectory Computation

Trajectories encompass multi-view information of each node, including one-hop, two-hop, and higher-order information, which are essential for subsequent modules. As the output of node features at each layer of the model, they implicitly reflect the gradual process of over-smoothing as the number of layers increases. Our aim is to derive a high-dimensional representation of the over-smoothing process from the trajectories, thereby enabling us to devise strategies to mitigate over-smoothing. In this section, we will elucidate the process of trajectory computation.

Let the trajectories be denoted as $T\in\mathbb{R}^{L\times N\times d}$ , where $L$ represents the number of layers, $N$ denotes the number of nodes, and $d$ signifies the dimensionality of features. Trajectories include the concatenation of multi-view features (hidden states from each layer):

(1)

T=\operatorname{concat}({T}^{(1)},{T}^{(2)},\cdots,{T}^{(L)})\in\mathbb{R}^{L\times N\times d}

where $\operatorname{concat}$ denotes channel-wise concatenation. $L$ is the number of model layers. $T^{(1)},\cdots,T^{(L)}$ represent multi-view features (1-hop, $\cdots$ , L-hop information), we collect them through gathering the hidden representations in each layer:

(2)

\displaystyle{T}^{(l)}=\begin{cases}\operatorname{Norm}({H}^{(l)}),&\text{if}\ \operatorname{mod}(l,w)=0\\ {H}^{(l)},&\text{o.w.}\\ \end{cases}

where the representation of each hop is denoted as $H^{(l)}=f^{(l)}(H^{(l-1)})\in\mathbb{R}^{N\times d}$ , $f^{(l)}(\cdot)$ means $l$ -th layer in trajectory computation model, and $\operatorname{Norm}({x}_{i})=\frac{{x}_{i}}{\left|\left|{x}_{i}\right|\right|_{2}}$ signifies a channel-wise $L2$ normalization, which aids in regulating the magnitude of features that may escalate exponentially with the number of iterations. $w$ is a hyper-parameter that determines the frequency of normalizing features.

In the initial stage, a specific model is not available for calculating the trajectories of the original graph. Therefore, a trajectory precomputation module is needed to obtain initialized trajectories. Various strategies can be employed to achieve this goal. In this work, we explore three off-the-shelf strategies, considering their efficiency and simplicity: Zero Initialization, Non-parametric Message Passing, and Pretrained GNN. Next, we will discuss how to integrate these strategies into our methods:

1) Zero initialization. We initialize the trajectories with all-zero vectors, resulting in zero multi-view condensed features during the first epoch of training. Consequently, the upsampling phase in this initial epoch does not introduce new edges or nodes, thus utilizing the original graph for downstream tasks. After finishing the first epoch, we store the hidden representations of each layer in downstream model as the trajectories for subsequent epochs. This zero initialization ensures that the UniGAP module learns the graph structure tailored to each downstream task with minimal noise and optimal performance. The zero initialization can be formulated as

(3)

{T}^{(l)}=\boldsymbol{0}\in\mathbb{R}^{N\times d}

2) Non-parametric Message Passing. After removing the non-linear operator in graph neural networks, we can decouple the message passing ( $AX$ ) and feature transformation ( $XW$ ). However, in the initial stage, we have no access to the transformation weights $W$ unless pre-training methods are employed. Therefore, we adopt non-parameteric message passing to collect multi-view information, i.e., $AX,A^{2}X,\cdots A^{L}X$ . This approach avoids nonlinear transformations, preserving maximum original feature, graph topology, and trajectories. The formulation is as follows:

(4)

T^{(l)}=\mathcal{P}^{(l)}=AT^{(l-1)}\ \text{or}\ T^{(l)}=\mathcal{P}^{(l)}=LT^{(l-1)}

where $\mathcal{P}^{(l)}$ denotes the $l$ -th layer in $\mathcal{P}$ and $T^{(0)}=G$ .

3) Pretrained GNN. The first two methods do not introduce additional parameters, which may limit the expressivity of the information embedded in the raw features. Here, we introduce a more powerful method, though it may incur additional computational costs. We leverage the training data to train a GNN model by designing a pretext task such as unsupervised contrastive loss (You et al., 2020; Zhu et al., 2022, 2024b) or masked prediction (Hou et al., 2022; Zhu et al., 2023) to obtain a generalized feature extractor. An additional advantage of this method is the ability to use pre-trained parameters to initialize the encoding parameters in the downstream model, facilitating faster convergence and reducing training time. In this way, we can kill two birds with one stone. The pretraining procedure can be formulated as follows:

(5)

\mathcal{P}=\operatorname*{arg\,min}_{\theta}\mathcal{L}_{\text{pretext}}(\mathcal{P}_{\theta}(G))

After the pre-training process, the pre-trained GNN model will be employed to compute trajectories:

(6)

T^{(l)}=\mathcal{P}^{(l)}(T^{(l-1)})

3.3.2. Multi-View Condensation

After obtaining trajectories (multi-view information), we aim to adaptively condense this multi-view information into compact features that reflect the node’s propensity to over-smoothing, while considering efficiency and adaptivity. To achieve this goal, we design a multi-view condensation (MVC) encoder that can adaptively condense trajectories $T\in\mathbb{R}^{L\times N\times d}$ into condensed trajectory features $\hat{T}\in\mathbb{R}^{N\times d}$ . This reduces computational cost and adaptively adjusts the weights of information at each hop. We instantiate the MVC encoder using two strategies in this work: Trajectory-MLP-Mixer (TMM) and Trajectory-Transformer.

1) Trajectory-MLP-Mixer (TMM). Considering simplicity and efficiency, we introduce a lightweight model, i.e., Trajectory-MLP-Mixer, to instantiate the MVC encoder as depicted in Figure 2a. Initially, trajectory mixing is applied to the trajectory dimension to adaptively integrate information from each hop. Subsequently, channel mixing is applied to the mixed trajectories to obtain condensed multi-view features:

(7)

\hat{T}=\mathcal{F}_{\omega}(T)=\operatorname{TMM}({T})=\sigma(w_{\text{traj}}\odot T)\oplus T)W_{\text{channel}}

where $w_{\text{traj}}\in\mathbb{R}^{1\times d}$ is employed for trajectory mixing, $\odot$ denotes the broadcasted Hadamard product with a summation across the feature dimension, and $\sigma$ represents the normalization function. Thus, $\sigma(w_{\text{traj}}\odot T)\in\mathbb{R}^{L\times N}$ signifies the importance weights for each hop information of each node, which will be utilized for the weighted summation $\oplus$ of multi-view information $T$ . Additionally, $W_{\text{channel}}\in\mathbb{R}^{d\times d}$ is used for channel mixing.

2) Trajectory-Transformer (TT). We treat the trajectories as a sequence, with each hop information considered a token input, as illustrated in Figure 2b. These tokens are fed into a Transformer encoder to learn the mixed-hop representations. Subsequently, an attention-based readout function is developed to adaptively aggregate mixed-hop features to obtain condensed trajectory features. The steps can be formulated as follows:

(8)

\hat{T}=\mathcal{F}_{\omega}(T)=\operatorname{TT}(T)=\mathcal{R}(\operatorname{Tr}(T))

where $\operatorname{Tr}(\cdot)$ denotes a shallow Transformer encoder, and $\mathcal{R}$ represents an attention-based readout function.

3.3.3. Adaptive Graph Upsampler

After obtaining condensed trajectory features $\hat{T}$ , we use them to compute the upsampling probability for each existing edge to decelerate message passing, thereby mitigating the over-smoothing problem (theoretical justification can be found in Section 3.5.1). Subsequently, we employ a differentiable sampling method to generate an augmented graph. Specifically, we compute the upsampling logits as follows:

(9)

P_{ij}=\mathcal{U}_{\phi}(\hat{T})={W}_{\text{up}}\left[\hat{{T}}_{i}\|\hat{{T}}_{j}\right]\in\mathbb{R}^{|\mathcal{E}|\times 2},\forall(i,j)\in\mathcal{E}

where $P_{ij}$ represents the upsampling logits for the edge between nodes $i$ and $j$ , ${W}_{\text{up}}$ is a learnable weight matrix, and $\left[\hat{{T}}_{i}\|\hat{{T}}_{j}\right]$ denotes the concatenation of the condensed trajectory features of nodes $i$ and $j$ .

Then we employ Gumbel-Softmax (Jang et al., 2017) with additional tricks to derive the upsampling mask and generate the augmented graph $\hat{G}$ :

	$\displaystyle\hat{P}=\operatorname{softmax}({P+\epsilon_{\text{gumbel}}})\in\mathbb{R}^{\|\mathcal{E}\|\times 2}$
(10)		$\displaystyle M=\mathds{1}_{\hat{P}[:,0]\leq\hat{P}[:,1]}-\tilde{{P}}+\hat{P}\in\mathbb{R}^{\|\mathcal{E}\|}$

where $\epsilon\in\mathbb{R}^{|\mathcal{E}|\times 2}$ represents Gumbel noise used to introduce diversity into the sampling process (exploration), preventing the selection of edges solely based on highest probabilities and ensuring differentiability. $\hat{P}$ denotes the upsampling probability, and $\mathds{1}_{\hat{P}[:,0]\leq\hat{P}[:,1]}$ is an indicator function that returns 1 if $\hat{P}[:,0]\leq\hat{P}[:,1]$ . $\tilde{P}$ results from truncating the gradient of $\hat{P}$ . This approach achieves two objectives: (1) ensures the output is exactly one-hot (by adding and then subtracting a soft value), and (2) maintains the gradient equal to the soft gradient (by keeping the original gradients). Lastly, $M$ represents an active mask used to determine which edges will have a new node inserted. Specifically, if the mask for edge $i\to j$ is 1, a new node $k$ is inserted between nodes $i$ and $j$ . The original edge $i\to j$ is then masked and replaced by new edges $i\to k$ and $k\to j$ . The initial feature vector of node $k$ can be set to a zero vector, the average of nodes $i$ and $j$ , or an adaptive sum:

(11)

\hat{X}_{k}={P}_{ij}^{(l)}[0]{X}_{i}+{P}_{ij}^{(l)}[1]{X}_{j}

Then, we obtain the augmented graph $\hat{{G}}=(\hat{\mathcal{V}},\hat{\mathcal{E}},\hat{X})$ , which will be used for downstream tasks to mitigate over-smoothing and improve overall performance.

3.3.4. Model Training

To optimize the augmented graph, we will utilize it in downstream tasks such as node classification. Let the downstream model be denoted as $\mathcal{D}_{\psi}(\cdot)$ . The augmented graph will be fed into this model to compute the downstream loss $\mathcal{L}_{\text{down}}$ . Additionally, we incorporate over-smoothing metrics, such as Mean Average Distance (MAD) (Chen et al., 2020b) and Dirichlet Energy (Rusch et al., 2023), as regularization terms $\mathcal{L}_{\text{smooth}}$ to ensure that UniGAP explicitly focuses on mitigating the over-smoothing issue. The total loss is defined as:

(12)		$\displaystyle\mathcal{L}_{\text{total}}$	$\displaystyle=\mathcal{L}_{\text{down}}+\beta\mathcal{L}_{\text{smooth}}$
(12)			$\displaystyle=-\mathbb{E}_{v\in\hat{\mathcal{V}}}y\log p(\tilde{y}\mid v,\mathcal{N}_{v})-\beta\operatorname{MAD}(\hat{G})$

where $\tilde{y}$ represents the prediction of the downstream model, $\mathcal{N}_{v}$ denotes the set of neighboring nodes associated with node $v$ used in the downstream model, and $\beta$ is the regularization coefficient that controls the impact of the smoothing metric. We empirically set $\beta$ to 1 because a larger value would negatively impact downstream performance. The formula of MAD is represnted as:

(13)

\operatorname{MAD}(X)=\frac{1}{|\mathcal{V}|}\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{V}_{i}}1-\frac{X_{i}^{T}X_{j}}{||X_{i}||||X_{j}||}

Through optimizing the downstream loss, all parameters in our framework will be updated in an end-to-end manner through backpropagation. After warming up the downstream model, it will be used to compute trajectories instead of relying solely on the trajectory precomputation module mentioned in Section 3.3.1. The model training process continues until convergence, aiming to obtain the optimal upsampled graph and model parameters.

3.4. UniGAP as A Universal Framework

UniGAP is a universal plug-in that can be integrated with various GNN models, as demonstrated in Section 4.2.1. Additionally, by combining different components and strategies, UniGAP can encompass most existing graph upsampling techniques (Chen et al., 2020a; Azabou et al., 2023; Wu et al., 2021). In this subsection, we will discuss how these techniques can be considered as variants of our approach.

HalfHop (Azabou et al., 2023). Consider the Trajectory Computation initiated with a stochastic vector, with Multi-View Condensation and Graph Upsampler frozen with random parameters. The upsampling of nodes then becomes a stochastic process with probability $p$ based on these parameters, making UniGAP function as a randomized heuristic upsampler, equivalent to HalfHop.

AdaEdge (Chen et al., 2020a). Let the Trajectory Computation contain only node embeddings from the last layer of the downstream GNN, with the Multi-View Condensation omitted. The module computing the upsampling probability for each edge in the Adaptive Graph Upsampler can be considered a binary classifier, with the output indicating whether the edge is intra-class or inter-class. This allows UniGAP to determine edge types based on the last layer’s output, adaptively increasing intra-class connections and decreasing inter-class connections, similar to AdaEdge.

GraphMixup (Wu et al., 2021). Using Non-parametric Message Passing as the Trajectory Precomputation Module and a self-supervised MVC Encoder to learn disentangled semantic features, and constraining the Graph Upsampler to generate only synthetic minority nodes with a GNN classifier updating the upsampling scale via a reinforcement learning agent, makes the UniGAP framework equivalent to GraphMixup, improving class-imbalanced node classification tasks.

3.5. Discussion

3.5.1. Why Does UniGAP Work?

Equipped with UniGAP, the optimized upsampled graph can not only enhance downstream performance but also mitigate the over-smoothing problem. In this subsection, we will provide theoretical explanations by answering the following questions:

RQ1: How does UniGAP learn in concert with downstream tasks to get the optimal upsampling graph structure?

RQ2: How does UniGAP mitigate the over-smoothing problem?

Closed-Form Solution (RQ1). To simplify the analysis, we assume the GNN model as linear GNN with mean aggregation (Keriven, 2022). Therefore, the $k$ -th hop node features at $e$ -th epoch are:

(14)

\displaystyle H^{(k)}_{e}

\displaystyle=\hat{A}^{k}_{e}X=\operatorname{UniGAP}(H^{(k)}_{e-1},\cdots,H^{(0)}_{e-1},\theta_{u})^{k}X:=f^{k}(A,\theta_{u},X)

where $\hat{A}$ is the adaptive message passing matrix, and $\theta_{u}$ is the learnable parameters of UniGAP.

Proposition 3.1.

By jointly training the parameters of the downstream model $\theta$ and UniGAP $\theta_{u}$ at each epoch, UniGAP learns the optimal structural representation for downstream tasks by optimizing $\langle\hat{\theta},\hat{\theta}_{u}\rangle$ as follows:

(15)	$\displaystyle\langle\hat{\theta},\hat{\theta}_{u}\rangle_{e}$	$\displaystyle=\operatorname*{arg\,min}_{\langle\theta,\theta_{u}\rangle}\frac{1}{2n}\|\|Y-H^{(k)}_{e}\theta\|\|^{2}+\lambda\|\|\theta\|\|^{2}+\gamma\|\|\theta_{u}\|\|^{2}$
		$\displaystyle\quad\quad-\beta\operatorname{MAD}(H^{(k)}_{e})$
		$\displaystyle=\Bigg{\langle}\left(\frac{(f^{k})^{T}f^{k}}{n}+\lambda I\right)^{-1}\frac{(f^{k})^{T}Y}{n},$
		$\displaystyle\quad\quad\gamma^{-1}\left(\frac{1}{2n}\left(\frac{\partial f^{k}}{\partial\theta_{u}}\right)^{T}(Y-f^{k}\hat{\theta})\hat{\theta}-\frac{\beta}{2}\frac{\partial}{\partial\theta_{u}}\text{MAD}(f^{k})\right)\Bigg{\rangle}$

As shown in Equation 14,15, UniGAP can learn the message passing dynamics and adaptivly change the nodes’ receptive filed, finally reach a optimal structural representation for downstream task. The derivation and more illustrations can be found in Appendix B.

Slower Smoothing (RQ2). We find that UniGAP can decelerate the message passing process by slowing down the smooth rate of the covariance of node features during message passing. The following Lemma and Theorem shows the approximated covariance of the node features after $k$ rounds of message passing with or without UniGAP.

Lemma 3.2.

The risk after k rounds of Message Passing can be approximated as $\mathcal{R}^{(k)}\simeq\mathcal{R}_{\text{reg.}}(A^{2k}\Sigma)$ , where $A=(I+\Sigma^{-1})^{-1}$ .

As shown in Lemma 3.2, the rate of smoothing of original MP is $2k$ due to the $A^{2k}\Sigma$ operator (Keriven, 2022). Then we consider the condition with UniGAP.

Lemma 3.3.

With UniGAP, the risk of test set $\mathcal{R}_{\text{reg.}}(\Sigma^{(k)}_{\text{UniGAP}})$ after $k$ rounds of message passing with UniGAP can be approximated as:

(16)

\mathcal{R}^{(k)}_{\text{UniGAP}}\simeq\mathcal{R}_{\text{reg.}}(\frac{1}{2}A^{k-1}(I+((1-p)I+pA)^{2})\Sigma)

Here $\Sigma^{(k)}_{\text{UniGAP}}$ is the approximated covariance of the node features, $p=\mathbb{E}(\hat{P}[:,0])$ and $\hat{P}$ denotes the upsampling probability at the optimal graph structure. The rate of smoothing of MP with UniGAP is $k-1$ due to the $A^{k-1}\Sigma$ operator. The complete proof can be found in Appendix B.

With Lemma 3.2 and 3.3, we can naturally get the Theorem:

Theorem 3.4 (Slower Smoothing of UniGAP).

After $k$ rounds of message passing, the risk obtained with UniGAP is $\mathcal{R}^{(k)}_{\text{UniGAP}}\propto A^{k-1}$ , roughly half the rate of smoothing compared to original MP model’s $\mathcal{R}^{(k)}\propto A^{2k}$ .

By slowing down the smoothing rate of the covariance of node features during message passing, UniGAP reduces the smoothing rate from $2k$ to $k-1$ .

3.5.2. Complexity Analysis

In this subsection, we will present a complexity analysis of UniGAP from a theoretical perspective.

Time Complexity. Our framework comprises four primary components: trajectory computation, multi-view condensation, adaptive graph upsampler, and downstream model. The trajectory precomputation module involves complexities of $\mathcal{O}(1)$ , $\mathcal{O}(LEd)$ , or $\mathcal{O}(LEd+LNd^{2})$ depending on the initialization method: zero initialization, non-parametric message passing, or pre-trained GNN. The MVC Condensation module (Trajectory-MLP-Mixer) operates with a time complexity of $\mathcal{O}(LN+Nd^{2})$ , while the adaptive graph upsampler computes with a complexity of $O(Ed)$ to determine upsampling probabilities for adjacent node pairs. The time complexity of downstream model, e.g., GNN, is $\mathcal{O}(LEd+LNd^{2})$ . Consequently, the total time complexity of our method is $O(LN+LEd+LNd^{2})$ , comparable to traditional L-layer GNNs, i.e., $\mathcal{O}(LEd+LNd^{2})$ .

Space Complexity. The space complexity of our method also closely aligns with that of traditional GNNs ( $\mathcal{O}(E+Ld^{2}+LNd)$ ). Specifically, the trajectory precomputation module exhibits space complexities of $\mathcal{O}(1)$ , $\mathcal{O}(E)$ , or $\mathcal{O}(E+Ld^{2}+LNd)$ . The multi-view condensation module requires $\mathcal{O}(d+d^{2}+LN+Nd)$ , whereas the adaptive graph upsampler operates with a complexity of $\mathcal{O}(d+E)$ . The space complexity of the downstream model, such as GNN, is $\mathcal{O}(E+Ld^{2}+LNd)$ . Consequently, the total space complexity of our method is $\mathcal{O}(d+d^{2}+LN+Nd+E+Ld^{2}+LNd)$ which is similar to that of GNN.

In summary, the complexity of UniGAP is on par with traditional GNN methods, maintaining efficiency.

4. Experiments

In this section, we present a comprehensive empirical investigation. Specifically, we aim to address the following research questions: RQ1: How does the proposed UniGAP approach perform when applied to various GNN models across both homophilic and heterophilic datasets? RQ2: To what extent can UniGAP effectively mitigate the phenomenon of over-smoothing? RQ3: How does UniGAP mitigate over-smoothing? RQ4: How do the various strategic choices across different modules impact performance? RQ5: Can we further enhance UniGAP with advanced techniques like LLM?

4.1. Experimental Setup

4.1.1. Datasets

We evaluated our method on 8 node classification benchmark datasets, as shown in Table 1, comprising 4 homophilic and 4 heterophilic datasets. More detailed information is provided in Appendix A.

4.1.2. Evaluation Protocols

To compare UniGAP with baselines, we use public split setting for all datasets, and report the mean accuracy along with the standard deviation across twenty different random seeds. UniGAP is integrated as a plug-in method with various SOTA GNN models to evaluate its effectiveness. Besides, we report the MAD metric to observe the phenomenon of over-smoothing as the number of layers increases.

4.1.3. Baselines

The baselines primarily consist of four categories: (1) traditional GNN methods such as GCN (Kipf and Welling, 2017), GAT (Velickovic et al., 2018), and GraphSAGE (Hamilton et al., 2017), (2) advanced GNN methods including GCNII (Chen et al., 2020c), GGCN (Yan et al., 2021), H2GCN (Zhu et al., 2020b), GPRGNN (Chien et al., 2021), and AERO-GNN (Lee et al., 2023), (3) Graph Transformer methods like GT (Dwivedi and Bresson, 2020), GraphGPS (Rampášek et al., 2023), and NAGphormer (Chen et al., 2023), and (4) other graph upsampling methods, i.e., HalfHop. We integrate UniGAP and HalfHop with various GNN models for evaluation.

4.2. Experimental Results

4.2.1. Overall Performance Comparison (RQ1)

For heterophilic datasets, UniGAP significantly enhances traditional GNN models, which heavily rely on the “homophily” assumption. For example, UniGAP achieves an average absolute improvement on four heterophilic datasets of 10.0% on GCN, 13.3% on GAT, and 5.0% on GraphSAGE, respectively. It also enhances other advanced GNN models, with absolute improvements such as 2.3% on GCNII, 1.1% on H2GCN, and 2.9% on GPRGNN. Compared to HalfHop, UniGAP surpasses it in all heterophilic datasets with over 3.9% absolute improvement with GCN, demonstrating the effectiveness of adaptive graph upsampling in identifying the optimal graph structure to enhance the downstream model’s performance.

For homophilic datasets, traditional GNN models benefit from their inductive bias, specifically the homophily assumption, leading to moderate performance. However, UniGAP can still identify graph structures that enhance downstream tasks through adaptive learning, thereby improving the performance of all baselines across different datasets. Average absolute improvements range from 1.1% to 4.8%.

In summary, compared to all the baselines and datasets, UniGAP demonstrates a significant improvement, with an average absolute boost ranging from 1.4% to 6.7%, surpassing the SOTA baselines, showcasing the effectiveness of UniGAP.

Table 1. Node classification performance on homophilic and heterophilic graphs. The results equipped with UniGAP are highlighted by a gray background. The boldface and underscore show the best and the runner-up, respectively.

Type	Homophilic graphs				Heterophilic graphs
Dataset	Pubmed	Citeseer	Cora	Arxiv	Texas	Cornell	Wisconsin	Actor
Homophily	0.66	0.63	0.77	0.42	0.00	0.02	0.05	0.00	Avg
GT (Dwivedi and Bresson, 2020)	79.08 ± 0.4	70.10 ± 0.8	82.22 ± 0.6	70.63 ± 0.4	84.18 ± 5.4	80.16 ± 5.1	82.74 ± 6.0	34.28 ± 0.7	72.92
GraphGPS (Rampášek et al., 2023)	79.94 ± 0.3	72.40 ± 0.6	82.44 ± 0.6	70.97 ± 0.4	82.21 ± 6.9	82.06 ± 5.1	85.36 ± 4.2	36.01 ± 0.9	73.92
NAGphormer (Chen et al., 2023)	80.57 ± 0.3	72.80 ± 0.8	84.20 ± 0.5	70.13 ± 0.6	80.12 ± 5.5	79.89 ± 7.1	82.97 ± 3.2	34.24 ± 0.9	73.11
AERO-GNN (Lee et al., 2023)	80.59 ± 0.5	73.20 ± 0.6	83.90 ± 0.5	72.41 ± 0.4	84.35 ± 5.2	81.24 ± 6.8	84.80 ± 3.3	36.57 ± 1.1	74.63
GCN (Kipf and Welling, 2017)	79.54 ± 0.4	72.10 ± 0.5	82.15 ± 0.5	71.74 ± 0.3	65.65 ± 4.8	58.41 ± 3.3	62.02 ± 5.9	30.57 ± 0.7	65.27
+HalfHop	77.42 ± 0.6	72.40 ± 0.7	81.36 ± 0.6	72.59 ± 0.2	71.62 ± 3.9	72.39 ± 4.8	66.14 ± 6.6	30.69 ± 0.7	68.07
+UniGAP	80.06 ± 0.6	73.10 ± 0.6	84.20 ± 0.5	72.97 ± 0.2	77.37 ± 3.5	75.95 ± 4.7	69.59 ± 6.2	33.73 ± 0.9	70.87
GAT (Velickovic et al., 2018)	77.81 ± 0.4	71.00 ± 0.8	83.18 ± 0.5	71.95 ± 0.4	60.46 ± 6.2	58.22 ± 3.7	63.59 ± 6.1	30.36 ± 0.9	64.57
+HalfHop	78.03 ± 0.5	71.80 ± 0.8	82.29 ± 0.8	72.56 ± 0.4	74.54 ± 6.1	66.48 ± 5.5	70.17 ± 6.0	30.28 ± 1.0	68.27
+UniGAP	80.10 ± 0.3	73.20 ± 0.5	84.23 ± 0.6	74.06 ± 0.3	78.64 ± 4.9	76.04 ± 5.1	74.81 ± 5.3	32.79 ± 0.8	71.73
GraphSAGE (Hamilton et al., 2017)	78.67 ± 0.4	71.80 ± 0.6	83.76 ± 0.5	71.49 ± 0.3	82.43 ± 6.1	75.95 ± 5.3	81.18 ± 5.5	34.23 ± 1.0	72.44
+HalfHop	78.94 ± 0.3	72.00 ± 0.8	84.27 ± 0.6	72.93 ± 0.5	85.95 ± 6.4	74.60 ± 6.0	85.88 ± 4.0	36.82 ± 0.7	73.92
+UniGAP	80.21 ± 0.7	73.60 ± 0.3	86.46 ± 0.4	73.71 ± 0.2	86.52 ± 4.8	83.21 ± 5.7	86.98 ± 3.2	37.14 ± 0.8	75.98
GCNII (Chen et al., 2020c)	80.14 ± 0.7	72.80 ± 0.5	84.33 ± 0.5	72.74 ± 0.2	78.59 ± 6.6	78.84 ± 6.6	81.41 ± 4.7	35.76 ± 1.0	73.08
+HalfHop	78.96 ± 0.9	72.40 ± 0.7	84.91 ± 0.6	72.91 ± 0.3	79.84 ± 6.1	80.04 ± 5.1	82.96 ± 3.7	36.01 ± 0.9	73.50
+UniGAP	80.79 ± 0.7	73.70 ± 0.4	86.22 ± 0.4	73.32 ± 0.3	82.08 ± 4.2	81.16 ± 5.0	84.06 ± 3.8	36.47 ± 1.1	74.76
GGCN (Yan et al., 2021)	78.34 ± 0.6	71.60 ± 0.6	83.64 ± 0.6	71.29 ± 0.1	82.86 ± 4.5	85.68 ± 6.6	86.86 ± 3.2	37.54 ± 1.5	74.73
+HalfHop	77.92 ± 0.8	70.90 ± 0.8	84.07 ± 0.7	71.12 ± 0.3	83.46 ± 5.1	83.22 ± 7.1	86.49 ± 4.6	36.32 ± 1.3	74.19
+UniGAP	80.09 ± 0.7	73.00 ± 0.6	84.97 ± 0.6	73.08 ± 0.3	85.14 ± 4.9	84.27 ± 6.0	87.69 ± 3.3	37.69 ± 1.2	75.74
H2GCN (Zhu et al., 2020b)	78.12 ± 0.4	71.80 ± 0.8	84.26 ± 0.6	70.97 ± 0.2	84.86 ± 7.2	82.70 ± 5.2	87.65 ± 4.9	35.70 ± 1.0	74.51
+HalfHop	79.23 ± 0.8	71.90 ± 0.9	82.65 ± 0.8	71.26 ± 0.3	85.02 ± 7.0	83.54 ± 5.9	86.22 ± 5.2	34.83 ± 1.2	74.33
+UniGAP	80.26 ± 0.6	73.10 ± 0.5	85.13 ± 0.6	73.19 ± 0.2	86.14 ± 6.5	84.96 ± 5.0	87.73 ± 4.8	36.42 ± 0.8	75.87
GPRGNN (Chien et al., 2021)	75.68 ± 0.4	71.60 ± 0.8	84.20 ± 0.5	71.86 ± 0.3	81.51 ± 6.1	80.27 ± 8.1	84.06 ± 5.2	35.58 ± 0.9	73.10
+HalfHop	77.95 ± 1.0	72.30 ± 0.7	83.49 ± 0.8	71.44 ± 0.5	83.87 ± 6.0	84.79 ± 7.6	85.76 ± 6.1	35.46 ± 1.0	74.38
+UniGAP	80.21 ± 0.9	72.90 ± 0.6	85.18 ± 0.4	71.98 ± 0.4	85.65 ± 5.4	84.37 ± 6.3	86.57 ± 5.0	36.51 ± 0.9	75.42

4.2.2. The Degree of Over-smoothing (RQ2)

To analyze the extent to which UniGAP effectively mitigates the phenomenon of over-smoothing, we conducted experiments on GCN across several datasets, evaluating its accuracy and MAD as the number of layers increases. As depicted in Figure 3, the experimental results demonstrate that UniGAP significantly enhances model performance and alleviates over-smoothing at each layer, surpassing AdaEdge and HalfHop.

Specifically, while HalfHop and AdaEdge can mitigate over-smoothing to some extent, they do not achieve optimal performance as UniGAP does. HalfHop fails to identify crucial elements necessary to mitigate over-smoothing and enhance performance because it inserts nodes randomly rather than adaptively finding optimal positions for the inserted nodes. AdaEdge increases connectivity between intra-class nodes, but effective message propagation between inter-class nodes requires an optimal graph structure. Therefore, adaptive upsampling represents a more sophisticated approach to mitigating over-smoothing and enhancing model performance.

4.2.3. Interpretability in Mitigating Over-smoothing (RQ3)

To better analyze the reasons for mitigating over-smoothing, we examine the ratio of intermediate node insertions. This analysis provides valuable insights into the root causes of over-smoothing. As shown in Figure 4, we identify the locations where intermediate nodes are added. We observe that new nodes tend to be inserted between inter-class node pairs, slowing down the message passing between different classes and preventing the features of different classes from collapsing into identical vectors, thus mitigating the over-smoothing problem. For the portion of nodes inserted into intra-class edges, we speculate that some intra-class edges contain noise. Equipped with UniGAP, these harmful noises can be eliminated, thereby improving performance.

4.3. Ablation Study (RQ4)

In Section 3.3, we provide various strategies for instantiating our modules. In this subsection, we will explore the impact of different strategies within the various modules of UniGAP on model performance.

Comparison of Different Strategies in Trajectory Precomputation. In this study, we investigate three readily available strategies in Section 3.3.1: Zero Initialization, Non-parametric Message Passing, and Pretrained GNN. We conduct experiments on GCN+UniGAP across several datasets. From Figure 5, we observe that Pretrained GNN consistently outperforms both Zero Initialization and Non-parametric Message Passing. Due to its larger number of parameters, it can learn unique, additional information from the graph, allowing UniGAP to optimize the graph structure from a more advantageous starting point.

Comparsion of Different Strategies in MVC Encoder. We investigate the MVC encoder using three strategies in this work: No encoder, Trajectory-MLP-Mixer (TMM), and Trajectory-Transformer (TT). As shown in Figure 4, both TMM and TT demonstrate significant improvements compared to not using an MVC Encoder, with absolute improvements ranging from 1% to 1.92%. Additionally, the MVC Encoder effectively compresses the trajectory features, reducing the time consumption for subsequent processes. Between TMM and TT, different datasets prefer different structures. For small datasets like Cora, we recommend using TMM as it is more lightweight than TT, thus avoiding the overfitting problem.

Comparsion of Different Initialization for Inserted Nodes. In Section 3.3.3, we introduce three strategies for initializing features for inserted nodes: (1) zero initialization, (2) mean feature of source and target node, and (3) learnable weighted sum of source and target node features. As shown in Figure 4, the learnable weighted sum achieves the best performance because it can adaptively adjust the weights of source and target features, thus generating better inserted node features.

4.4. Enhanced with LLMs (RQ5)

Recently, the graph domain has become closely connected with LLMs on text-attributed graphs (He et al., [n. d.]; Wang et al., 2024; Zhu et al., 2024a), demonstrating significant improvements due to the advanced capabilities in language understanding. In this subsection, we explore the potential of combining UniGAP with LLMs, like LLaMA2-7B (Touvron et al., 2023), on two text-attributed graphs, i.e., Arxiv and WikiCS (Mernyei and Cangea, 2020).

From Table 2, we observe that UniGAP consistently enhances downstream performance when combined with LLM features. The improved quality of node features enables UniGAP to identify a better upsampled graph, thereby achieving superior performance. More complex combinations are left for future work.

Table 2. The performance of UniGAP with LLM.

h_{\text{original}}

indicates the use of features provided in PyG (Fey and Lenssen, 2019), and

h_{\text{LLM}}

denotes the use of hidden features generated by LLaMA2-7B.

	Feature type	Arxiv	WikiCS
GraphSAGE	$h_{\text{original}}$	71.49 ± 0.3	79.13 ± 0.5
+ UniGAP	$h_{\text{original}}$	73.71 ± 0.2	80.57 ± 0.4
GraphSAGE	$h_{\text{LLM}}$	75.03 ± 0.5	81.24 ± 0.7
+ UniGAP	$h_{\text{LLM}}$	76.21 ± 0.4	83.12 ± 0.5

5. Conclusion

In this work, we propose UniGAP, a universal and adaptive graph upsampling methodology that alleviates the over-smoothing issue in the graph domain. By offering a universal framework that encompasses various graph upsampling techniques, UniGAP enhances the adaptability, performance, and interpretability of GNNs. Furthermore, we integrate UniGAP with LLMs, demonstrating the significant potential of combining LLMs with our method. For more intricate combinations, we leave it as future work. This research lays a robust foundation for future advancements in graph upsampling, offering valuable insights and practical solutions to longstanding challenges in the field.

References

(1)
Azabou et al. (2023) Mehdi Azabou, Venkataramana Ganesh, Shantanu Thakoor, Chi-Heng Lin, Lakshmi Sathidevi, Ran Liu, Michal Valko, Petar Veličković, and Eva L. Dyer. 2023. Half-Hop: A graph upsampling approach for slowing down message passing. arXiv:2308.09198 [cs.LG]
Cai and Wang (2020) Chen Cai and Yusu Wang. 2020. A Note on Over-Smoothing for Graph Neural Networks. arXiv:2006.13318 [cs.LG]
Chen et al. (2020a) Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020a. Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 3438–3445. https://aaai.org/ojs/index.php/AAAI/article/view/5747
Chen et al. (2020b) Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020b. Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 3438–3445. https://aaai.org/ojs/index.php/AAAI/article/view/5747
Chen et al. (2023) Jinsong Chen, Kaiyuan Gao, Gaichao Li, and Kun He. 2023. NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs. arXiv:2206.04910 [cs.LG]
Chen et al. (2020c) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. 2020c. Simple and Deep Graph Convolutional Networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 1725–1735. http://proceedings.mlr.press/v119/chen20v.html
Chien et al. (2021) Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. 2021. Adaptive Universal Generalized PageRank Graph Neural Network. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=n6jl7fLxrP
Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 3837–3845. https://proceedings.neurips.cc/paper/2016/hash/04df4d434d481c5bb723be1b6df1ee65-Abstract.html
Dwivedi and Bresson (2020) Vijay Prakash Dwivedi and Xavier Bresson. 2020. A Generalization of Transformer Networks to Graphs. https://arxiv.org/abs/2012.09699
Fang et al. (2023) Taoran Fang, Zhiqing Xiao, Chunping Wang, Jiarong Xu, Xuan Yang, and Yang Yang. 2023. DropMessage: Unifying Random Dropping for Graph Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence 37, 4 (June 2023), 4267–4275. https://doi.org/10.1609/aaai.v37i4.25545
Feng et al. (2021) Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. 2021. Graph Random Neural Network for Semi-Supervised Learning on Graphs. arXiv:2005.11079 [cs.LG]
Fey and Lenssen (2019) Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. ArXiv preprint (2019).
Gaudelet et al. (2021) Thomas Gaudelet, Ben Day, Arian R. Jamasb, Jyothish Soman, Cristian Regep, Gertrude Liu, Jeremy B. R. Hayter, Richard Vickers, Charles Roberts, Jian Tang, David Roblin, Tom L. Blundell, Michael M. Bronstein, and Jake P. Taylor-King. 2021. Utilising Graph Machine Learning within Drug Discovery and Development. arXiv:2012.05716 [q-bio.QM]
Guo et al. (2019) Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 922–929. https://doi.org/10.1609/aaai.v33i01.3301922
Hamilton et al. (2017) William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 1024–1034. https://proceedings.neurips.cc/paper/2017/hash/5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html
Han et al. (2022) Xiaotian Han, Zhimeng Jiang, Ninghao Liu, and Xia Hu. 2022. G-Mixup: Graph Data Augmentation for Graph Classification. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). PMLR, 8230–8248. https://proceedings.mlr.press/v162/han22c.html
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90
He et al. ([n. d.]) Xiaoxin He, Xavier Bresson, Thomas Laurent, Adam Perold, Yann LeCun, and Bryan Hooi. [n. d.]. Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning. In The Twelfth International Conference on Learning Representations.
Hou et al. (2022) Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. 2022. GraphMAE: Self-Supervised Masked Graph Autoencoders. https://arxiv.org/abs/2205.10803
Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open Graph Benchmark: Datasets for Machine Learning on Graphs. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html
Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=rkE3y85ee
Keriven (2022) Nicolas Keriven. 2022. Not too little, not too much: a theoretical analysis of graph (over)smoothing. https://arxiv.org/abs/2205.12156
Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl
Kong et al. (2022) Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, Gavin Taylor, and Tom Goldstein. 2022. Robust Optimization as Data Augmentation for Large-scale Graphs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 60–69. https://doi.org/10.1109/CVPR52688.2022.00016
Lee et al. (2023) Soo Yong Lee, Fanchen Bu, Jaemin Yoo, and Kijung Shin. 2023. Towards Deep Attention in Graph Neural Networks: Problems and Remedies. https://arxiv.org/abs/2306.02376
Lim et al. (2021) Derek Lim, Xiuyu Li, Felix Hohne, and Ser-Nam Lim. 2021. New Benchmarks for Learning on Non-Homophilous Graphs. https://arxiv.org/abs/2104.01404
Liu et al. (2022) Xin Liu, Jiayang Cheng, Yangqiu Song, and Xin Jiang. 2022. Boosting Graph Structure Learning with Dummy Nodes. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). PMLR, 13704–13716. https://proceedings.mlr.press/v162/liu22d.html
Mernyei and Cangea (2020) Péter Mernyei and Cătălina Cangea. 2020. Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks. https://arxiv.org/abs/2007.02901
Pei et al. (2020) Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. 2020. Geom-GCN: Geometric Graph Convolutional Networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=S1e2agrFvS
Rampášek et al. (2023) Ladislav Rampášek, Mikhail Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. 2023. Recipe for a General, Powerful, Scalable Graph Transformer. arXiv:2205.12454 [cs.LG]
Rong et al. (2020) Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2020. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=Hkx1qkrKPr
Rusch et al. (2023) T. Konstantin Rusch, Michael M. Bronstein, and Siddhartha Mishra. 2023. A Survey on Oversmoothing in Graph Neural Networks. arXiv:2303.10993 [cs.LG]
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint abs/2307.09288 (2023). https://arxiv.org/abs/2307.09288
van den Berg et al. (2017) Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion. arXiv:1706.02263 [stat.ML]
Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=rJXMpikCZ
Wang et al. (2024) Yaoke Wang, Yun Zhu, Wenqiao Zhang, Yueting Zhuang, Yunfei Li, and Siliang Tang. 2024. Bridging Local Details and Global Context in Text-Attributed Graphs. https://arxiv.org/abs/2406.12608
Wu et al. (2021) Lirong Wu, Haitao Lin, Zhangyang Gao, Cheng Tan, and Stan. Z. Li. 2021. GraphMixup: Improving Class-Imbalanced Node Classification on Graphs by Self-supervised Context Prediction. https://arxiv.org/abs/2106.11133
Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation Learning on Graphs with Jumping Knowledge Networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 (Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, 5449–5458. http://proceedings.mlr.press/v80/xu18c.html
Yan et al. (2021) Yujun Yan, Milad Hashemi, Kevin Swersky, Yaoqing Yang, and Danai Koutra. 2021. Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks. https://arxiv.org/abs/2102.06462
Yang et al. (2016) Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. 2016. Revisiting Semi-Supervised Learning with Graph Embeddings. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (JMLR Workshop and Conference Proceedings, Vol. 48), Maria-Florina Balcan and Kilian Q. Weinberger (Eds.). JMLR.org, 40–48. http://proceedings.mlr.press/v48/yanga16.html
Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems (KDD ’18). ACM. https://doi.org/10.1145/3219819.3219890
You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/3fe230348e9a12c13120749e3f9fa4cd-Abstract.html
Zhang et al. (2023) Sisi Zhang, Lun Du, Fan Li, Ge Yu, and MengYuan Chen. 2023. Propagate Deeper and Adaptable Graph Convolutional Networks. https://openreview.net/forum?id=RR_w2fbYmV
Zhang et al. (2024) Xu Zhang, Yonghui Xu, Wei He, Wei Guo, and Lizhen Cui. 2024. A Comprehensive Review of the Oversmoothing in Graph Neural Networks. In Computer Supported Cooperative Work and Social Computing, Yuqing Sun, Tun Lu, Tong Wang, Hongfei Fan, Dongning Liu, and Bowen Du (Eds.). Springer Nature Singapore, Singapore, 451–465.
Zhao et al. (2021) Tong Zhao, Yozen Liu, Leonardo Neves, Oliver J. Woodford, Meng Jiang, and Neil Shah. 2021. Data Augmentation for Graph Neural Networks. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 11015–11023. https://ojs.aaai.org/index.php/AAAI/article/view/17315
Zheng et al. (2020) Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2020. GMAN: A Graph Multi-Attention Network for Traffic Prediction. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 1234–1241. https://aaai.org/ojs/index.php/AAAI/article/view/5477
Zhou et al. (2020a) Kaixiong Zhou, Xiao Huang, Yuening Li, Daochen Zha, Rui Chen, and Xia Hu. 2020a. Towards Deeper Graph Neural Networks with Differentiable Group Normalization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/33dd6dba1d56e826aac1cbf23cdcca87-Abstract.html
Zhou et al. (2020b) Kaixiong Zhou, Xiao Huang, Yuening Li, Daochen Zha, Rui Chen, and Xia Hu. 2020b. Towards Deeper Graph Neural Networks with Differentiable Group Normalization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/33dd6dba1d56e826aac1cbf23cdcca87-Abstract.html
Zhu et al. (2020b) Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. 2020b. Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/58ae23d878a47004366189884c2f8440-Abstract.html
Zhu et al. (2023) Yun Zhu, Jianhao Guo, and Siliang Tang. 2023. Sgl-pt: A strong graph learner with graph prompt tuning. ArXiv preprint abs/2302.12449 (2023). https://arxiv.org/abs/2302.12449
Zhu et al. (2022) Yun Zhu, Jianhao Guo, Fei Wu, and Siliang Tang. 2022. RoSA: A Robust Self-Aligned Framework for Node-Node Graph Contrastive Learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 3795–3801. https://doi.org/10.24963/ijcai.2022/527 Main Track.
Zhu et al. (2024a) Yun Zhu, Yaoke Wang, Haizhou Shi, and Siliang Tang. 2024a. Efficient Tuning and Inference for Large Language Models on Textual Graphs. https://arxiv.org/abs/2401.15569
Zhu et al. (2024b) Yun Zhu, Yaoke Wang, Haizhou Shi, Zhenshuo Zhang, Dian Jiao, and Siliang Tang. 2024b. GraphControl: Adding Conditional Control to Universal Graph Pre-trained Models for Graph Domain Transfer Learning. In Proceedings of the ACM on Web Conference 2024 (Singapore, Singapore) (WWW ’24). Association for Computing Machinery, New York, NY, USA, 539–550. https://doi.org/10.1145/3589334.3645439
Zhu et al. (2020a) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020a. Deep Graph Contrastive Representation Learning. https://arxiv.org/abs/2006.04131
Zhu et al. (2021) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph Contrastive Learning with Adaptive Augmentation. In Proceedings of the Web Conference 2021 (WWW ’21). ACM. https://doi.org/10.1145/3442381.3449802
Zou et al. (2023) Xiaofeng Zou, Kenli Li, Cen Chen, Xulei Yang, Wei Wei, and Keqin Li. 2023. DGSLN: Differentiable graph structure learning neural network for robust graph representations. Information Sciences 626 (2023), 94–113. https://doi.org/10.1016/j.ins.2023.01.059

Appendix A Experimental Details

A.1. Datasets

We employs 8 node classification benchmark datasets, among which 4 are homophilic and 4 are heterophilic. The train-validation-test splits utilized are those which are publicly accessible. For measuring the degree of homophily of each dataset, we use the homophily metric suggested by (Lim et al., 2021). Table 5 delineates the comprehensive statistics of these benchmark collections.

The PubMed, CiteSeer, and Cora are citation networks(Yang et al., 2016). Within these networks, each vertex signifies a scholarly paper, and adjacency is established when there exists a bibliographic citation linking any two respective papers. The vertex attributes comprise the bag-of-words representation of the related manuscript, with the vertex categorization reflecting the scholarly domain of the associated paper.

The ogbn-arxiv (Hu et al., 2020) dataset constitutes a directed graph typifying the citation network amid all Computer Science (CS) arXiv manuscripts as cataloged by the Microsoft Academic Graph (MAG). Each vertex within this graph is an arXiv document, while each directed edge signifies a citation from one document to another. Accompanying each paper is a 128-dimensional feature vector, derived from the mean of the embeddings corresponding to words in the document’s title and abstract sections.

The Texas, Cornell, and Wisconsin datasets (Pei et al., 2020) originate from the WebKB corpus. In these, each vertex corresponds to a webpage, and adjacency is predicated upon a hyperlink’s presence interlinking any two respective webpages. Here, the vertex features encapsulate the bag-of-words attributes of the relevant webpage, with the vertex classification indicative of the webpage’s categorical type.

The Actor dataset (Pei et al., 2020) is a subgraph strictly encompassing actors from a larger network of film directors, actors, and screenwriters collated from Wikipedia entries. Each vertex in this dataset represents an actor, with an adjacency drawn between vertices when the pertinent actors are jointly mentioned on the same Wikipedia page. The nodal attributes are extrapolated from the salient keywords present on each actor’s Wikipedia entry, and the vertex classification is ascertained by the contextual terms found on these pages.

Table 3. Details of datasets.

Dataset	Homopily	#Nodes	#Edges	#Features	#Classes	Split(%)
Pubmed	0.66	19,717	44,324	500	3	0.3/2.5/5.0
Citeseer	0.63	3327	4552	3703	6	5.2/18/37
Cora	0.77	2708	5728	1433	7	3.6/15/30
Arxiv	0.42	169,343	1,166,243	128	40	53.7/17.6/28.7
WikiCS	0.57	11,701	216,123	300	10	5.0/15/50
Texas	0.00	183	279	1703	5	48/32/20
Cornell	0.02	183	277	1703	5	48/32/20
Wisconsin	0.05	251	450	1703	5	48/32/20
Actor	0.01	7600	26659	932	5	48/32/20

A.2. Hyperparameters

Experimental results are reported on the hyperparameter settings below, where we choose the settings that achieve the highest performance on the validation set. We choose hyperparameter grids that do not necessarily give optimal performance, but hopefully cover enough regimes so that each model is reasonably evaluated on each dataset.

•

lr $\in\{5e-2,1e-2,5e-3,1e-3,5e-4\}$
•

hidden_dim $\in\{16,32,64,128,256\}$
•

dropout $\in\{0,0.1,0.2,0.3,0.5,0.8\}$
•

weight_decay $\in\{1e-2,5e-3,1e-3,5e-4,1e-4\}$
•

activation $\in\{\text{elu},\text{relu},\text{prelu}\}$
•

layer_num $\in\{1,2,3,4,5,6,7,8\}$

For GCNII,

•

alpha $\in\{0.1,0.2,0.3,0.4,0.5\}$
•

theta $=1$

For UniGAP,

•

MVC_dim $\in\{32,64,128,256,512\}$
•

w $\in\{1,2,3,4,\text{None}\}$

For Half-Hop,

•

p $\in\{0,0.5,0.75,1\}$
•

$\alpha=0.5$

Appendix B Theory Proofs

Our proofs are based on the work of (Keriven, 2022) and rely heavily on the Lemmas and proofs in the original paper. For a better understanding of our proof, we appeal interested readers to read the work of (Keriven, 2022).

B.1. Proof of Proposition 3.1

Here we consider a simplified situation of linear GNN with mean aggregation, the smoothed node features after k rounds of mean aggregation are:

(17)

H^{(k)}=L^{(k)}X

We consider learning with a single Mean Square Error (MSE) loss and Ridge regularization. For $\lambda>0$ , the regression coefficients vector on the smoothed features is

(18)		$\displaystyle\hat{\theta}$	$\displaystyle\overset{def_{.}}{=}\operatorname*{arg\,min}_{\theta}\frac{1}{2n}\|\|Y-H^{(k)}\theta\|\|^{2}+\lambda\|\|\theta\|\|^{2}$
(18)			$\displaystyle=\left(\frac{(H^{(k)})^{T}H^{(k)}}{n}+\lambda I\right)^{-1}\frac{(H^{(k)})^{T}Y}{n}$

$n$ and $n_{test}$ is the count of training data and test data, and $\theta$ is the set of learnable parameters of the training model. Then, the test risk $R^{(k)}$ is

(19)

R^{(k)}\overset{def_{.}}{=}\frac{1}{n_{test}}||Y_{test}-\hat{Y}^{(k)}||^{2}\ where\ \hat{Y}^{(k)}=H^{(k)}\hat{\theta}

As proofed in (Keriven, 2022), there is an optimal $k^{*}>0$ such that $R^{(k^{*})}<min(R^{(0)},R^{(\infty)})$ . Below, we show that when equip with UniGAP, the model can learn the optimal $\langle\hat{\theta},\hat{\theta_{u}}\rangle$ for each k, adaptively changing the node’s receptive field and message propagation process so that the model operates on the optimal graph structure. Thus by learning graph structure, UniGAP makes $k^{*}$ approximate our choice of k, rather than heuristically finding $k^{*}$ for each downstream task.

When equpied with UniGAP, the optimization objective is:

(20)		$\displaystyle\langle\hat{\theta},\hat{\theta_{u}}\rangle_{e}=$	$\displaystyle\operatorname*{arg\,min}_{\langle\theta,\theta_{u}\rangle}\bigg{(}\frac{1}{2n}\\|Y-H^{(k)}_{e}\theta\\|^{2}+\lambda\\|\theta\\|^{2}+\gamma\\|\theta_{u}\\|^{2}$
(20)			$\displaystyle-\beta\text{MAD}(H^{(k)}_{e})\bigg{)}$

First, we ignore the $\beta\text{MAD}(H^{(k)}_{e})$ term as it may introduce complex non-linear dependencies. We only consider the preceding terms and the problem can be reformulated as:

(21)

\langle\hat{\theta},\hat{\theta_{u}}\rangle=\operatorname*{arg\,min}_{\langle\theta,\theta_{u}\rangle}\left(\frac{1}{2n}\|Y-H^{(k)}(\theta_{u})\theta\|^{2}+\lambda\|\theta\|^{2}+\gamma\|\theta_{u}\|^{2}\right)

Next, we derive each component and set the derivatives to zero to find the minima. When solving for $\theta$ , we assume $H^{(k)}_{e}(\theta_{u})$ is known. First, we write the Lagrangian form:

(22)

L(\theta,\theta_{u})=\frac{1}{2n}\|Y-H^{(k)}_{e}\theta\|^{2}+\lambda\|\theta\|^{2}+\gamma\|\theta_{u}\|^{2}

Then we take the derivative with respect to $\theta$ and set it to zero:

(23)

\nabla_{\theta}L=-\frac{1}{n}(H^{(k)}_{e})^{T}(Y-H^{(k)}_{e}\theta)+2\lambda\theta=0

We apply a rearranging and obtain:

(24)

\frac{1}{n}(H^{(k)}_{e})^{T}Y=\left(\frac{1}{n}(H^{(k)}_{e})^{T}H^{(k)}_{e}+\lambda I\right)\theta

So we can get:

(25)

\theta=\left(\frac{1}{n}(H^{(k)}_{e})^{T}H^{(k)}_{e}+\lambda I\right)^{-1}\frac{1}{n}(H^{(k)}_{e})^{T}Y

When solving for $\theta_{u}$ , for a fixed solution $\hat{\theta}$ , we take the derivative with respect to $\theta_{u}$ and set it to zero:

(26)		$\displaystyle\nabla_{\theta_{u}}L$	$\displaystyle=\frac{\partial}{\partial\theta_{u}}\left(\frac{1}{2n}\\|Y-H^{(k)}(\theta_{u})\hat{\theta}\\|^{2}\right)+\nabla_{\theta_{u}}(\gamma\\|\theta_{u}\\|^{2})$
(26)			$\displaystyle\quad\quad-\nabla_{\theta_{u}}(\beta\text{MAD}(H^{(k)}(\theta_{u})))$

Setting the derivative to zero, we can get:

(27)			$\displaystyle\frac{1}{n}\left(\left(\frac{\partial H^{(k)}(\theta_{u})}{\partial\theta_{u}}\right)^{T}(Y-H^{(k)}(\theta_{u})\hat{\theta})\hat{\theta}\right)-2\gamma\theta_{u}$
(27)			$\displaystyle+\beta\frac{\partial}{\partial\theta_{u}}\text{MAD}(H^{(k)}(\theta_{u}))=0$

We apply a rearranging and obtain:

(28)		$\displaystyle\theta_{u}=$	$\displaystyle\gamma^{-1}\bigg{(}\frac{1}{2n}\left(\frac{\partial H^{(k)}(\theta_{u})}{\partial\theta_{u}}\right)^{T}(Y-H^{(k)}(\theta_{u})\hat{\theta})\hat{\theta}$
(28)			$\displaystyle-\frac{\beta}{2}\frac{\partial}{\partial\theta_{u}}\text{MAD}(H^{(k)}(\theta_{u}))\bigg{)}$

End of proof.

B.2. Proof of Lemma 3.3

Following the Theorem 4 and Section 4.2 of the work of (Keriven, 2022), they adopt the latent space random graph model, and assume that the observed node features are projections of some underlying latent features with Gaussian distribution $\mathcal{N}(0,\Sigma)$ . They get the regression risk with $k$ step of smoothing as:

\mathcal{R}^{(k)}\simeq\mathcal{R}_{reg.}(\Sigma^{(k)})=\mathcal{R}_{reg.}((I+\Sigma^{-1})^{-2k}\Sigma)

Here $A=(I+\Sigma^{-1})^{-1}$ , then the risk become $\mathcal{R}^{(k)}\simeq\mathcal{R}_{reg.}(A^{2k}\Sigma)$ . Thus, the rate of smoothing of original MP is $2k$ due to the $A^{2k}\Sigma$ operator.

Now following the (Azabou et al., 2023), we compute the regression risk with $k$ step of smoothing with UniGAP. We define $p=\mathbb{E}(\hat{P}[:,0])$ and $\hat{P}$ denotes the upsampling probability at the optimal graph structure, $U_{i}$ is the inserted nodes between node $i$ and $V_{i}$ . After we insert nodes in original edges, begin with the first round of message passing, we can get:

(29)	$\displaystyle\forall i,x_{i}^{(1)}$	$\displaystyle=\sum_{u\in U_{i}}a_{ij}\hat{x}^{(0)}_{u}$
	$\displaystyle\mathbb{E}(x_{i}^{(1)})$	$\displaystyle=\sum_{j\in V_{i}}a_{ij}((1-p)x_{i}^{(0)}+px_{j}^{(0)})$
		$\displaystyle=(1-p)x_{i}^{(0)}(\sum_{j\in V_{i}}a_{ij})+p\sum_{j\in V_{i}}a_{ij}x_{j}^{(0)}$
		$\displaystyle=(1-p)x_{i}^{(0)}+p\sum_{j\in V_{i}}a_{ij}x_{j}^{(0)}$

Here $U_{i}$ is the upsampled inserted nodes between node $i$ and its neighbor $j$ . For inserted nodes, their features are:

(30)

\forall i,j\in V_{i},u\in U_{i},\hat{x}^{(1)}_{u}=x^{(0)}_{j}

At the second round of message passing, the original nodes are updated as follows:

(31)

\displaystyle\forall i,j\in V_{i},u\in U_{i},x_{i}^{(2)}

\displaystyle=\sum_{u\in U_{i}}a_{ij}\hat{x}^{(1)}_{u}=\sum_{u\in U_{i}}a_{ij}x^{(0)}_{j}\simeq Ax_{i}^{(0)}

The the features of inserted nodes are same as the the source node:

(32)

\forall i,j\in V_{i},u\in U_{i},\hat{x}^{(2)}_{u}=x^{(1)}_{j}\simeq(1-p)x_{j}+pAx_{j}

Repeat the process of message passing of the first round and the second round, we can get a recurrence relation:

(33)		$\displaystyle x_{i}^{(k+1)}$	$\displaystyle\simeq A\hat{x}^{(k)}_{u}$
(33)		$\displaystyle\hat{x}^{(k+1)}_{u}$	$\displaystyle=x_{j}^{(k)}$

By associating the above equations we can get the connection of $i$ and $j$ , as $x_{i}^{(k+2)}\simeq A\hat{x}^{(k+1)}_{u}=Ax_{j}^{(k)}$ . Then we use $2k+1$ to denote an odd number and get

(34)	$\displaystyle x_{i}^{(2k+1)}$	$\displaystyle\simeq A^{k}{x}^{(1)}_{j}$
		$\displaystyle=A^{k}((1-p)x_{i}^{(0)}+pAx_{i}^{(0)})$
		$\displaystyle=(1-p)A^{k}x_{i}+pA^{k+1}x_{i}$

Then we replace $A$ as $(I+\Sigma^{-1})^{-1}$ and get

(35)

\Sigma_{UniGAP}^{(k)}=\frac{1}{2}A^{k-1}(I+((1-p)I+pA)^{2})\Sigma

End of proof.

Appendix C Algorithm

This section we show the pseudo-code identity of the UniGAP in algorithm 1.

Input: Input graph

G

, MVC encoder parameters

\omega

, UniGAP module parameters

\phi

, downstream model parameters

\psi

, number of epochs

num\_epochs

Output: Optimized model

Function TrajectoryPrecomputation( $G$ ):

T\leftarrow\mathcal{P}(G)

;

// Compute initial trajectories

return

T

Function MVCEncoder( $T,\omega$ ):

\hat{T}\leftarrow\mathcal{F}_{\omega}(T)

;

// Compute Multi-View Condensation

return

\hat{T}

Function AdaptiveUpsampler( $\hat{T},\phi$ ):

P\leftarrow\mathcal{U}_{\phi}(\hat{T})

;

// Compute upsampling probabilities

\hat{G}\leftarrow\mathcal{S}(P)

;

// Upsampling with gumbel-softmax

return

\hat{G}

Function DownstreamModel( $\hat{G},\psi$ ):

for $i=1$ to $num\_layers$ do

t_{i}\leftarrow Layer_{i}(\hat{G},\psi_{i})

;

end for

T\leftarrow\operatorname{concat}(t_{1},t_{2},\cdots t_{l})

;

\mathcal{L}\leftarrow\operatorname{compute\_loss}(\hat{G},\omega,\phi,\psi)

;

optimize_modules(

\mathcal{L}

) ;

return

T,\mathcal{L}

Function Main( $G,\omega,\phi,\psi$ ):

T\leftarrow

TrajectoryPrecomputation( $G$ ) ;

for $epoch=0$ to $num\_epochs$ do

\hat{T}\leftarrow

MVCEncoder( $T,\omega$ ) ;

\hat{G}\leftarrow

AdaptiveUpsampler( $\hat{T},\phi$ ) ;

T

\mathcal{L}\leftarrow

DownstreamModel( $\hat{G},\psi$ ) ;

end for

return

optimized\_model

;

optimized\_model\leftarrow

Main( $G,\omega,\phi,\psi$ ) ;

Algorithm 1 The UniGAP Algorithm