\newunicodechar

ﬁfi \newunicodecharﬀff \newunicodecharﬂfl

Graph Structure Learning with Bi-level Optimization

Nan Yin

Abstract

Currently, most Graph Structure Learning (GSL) methods, as a means of learning graph structure, improve the robustness of GNN merely from a local view by considering the local information related to each edge and indiscriminately applying the mechanism across edges, which may suffer from the local structure heterogeneity of the graph (i.e., the uneven distribution of inter-class connections over nodes). To overcome the cons, we extract the graph structure as a learnable parameter and jointly learn the structure and common parameters of GNN from the global view. Excitingly, the common parameters contain the global information for nodes features mapping, which is also crucial for structure optimization (i.e., optimizing the structure relies on global mapping information). Mathematically, we apply a generic structure extractor to abstract the graph structure and transform GNNs in the form of learning structure and common parameters. Then, we model the learning process as a novel bi-level optimization, i.e., Generic Structure Extraction with Bi-level Optimization for Graph Structure Learning (GSEBO), which optimizes GNN parameters in the upper level to obtain the global mapping information and graph structure is optimized in the lower level with the global information learned from the upper level. We instantiate the proposed GSEBO on classical GNNs and compare it with the state-of-the-art GSL methods. Extensive experiments validate the effectiveness of the proposed GSEBO on four real-world datasets.

1 Introduction

Based on the homophily assumption of “like to associate with like” McPherson et al. (2001); Yin et al. (2023c); Shou et al. (2023), Graph Neural Network (GNN) Scarselli et al. (2009); Yin et al. (2024a, b); Ju et al. (2024) has become the promising solution for node classification. However, a large portion of edges are inter-class connections Pandit et al. (2007), and representation propagation over such connections can largely hinder the GNN from obtaining class-separated node representations, hurting the performance.

Existing GSL methods are roughly categorized into attentive mechanism, noise detection, and probabilistic mechanism. Attentive mechanism, which calculates weights for edges to adjust the contribution of different neighbors during representation propagation Veličković et al. (2018); Wang et al. (2021); Jiang et al. (2019); Chen et al. (2020b); Zhao et al. (2021a). These methods can hardly work well in practice for two reasons: (1) the mechanism may not generalize well to all nodes with different local structures (cf. Figure 1(a)); and (2) the attention cannot be easily trained well due to the limited labeled data Knyazev et al. (2019). Noise detection, which incorporates an edge classifier to estimate the probability of inter-class connection for each edge Zheng et al. (2020); Zhao et al. (2021b); Luo et al. (2021); Kazi et al. (2020); Franceschi et al. (2019); Li et al. (2018). Although it can be better trained owing to the supervision of edge labels, the edge classifier also suffers from local structure heterogeneity and lacks consideration of global information. Probabilistic mechanism, which models connection from a global view which assumes a prior distribution of edge and estimates GNN parameters with Bayesian optimization to overcome the impact of inter-class edges Zhang et al. (2019); Elinas et al. (2020); Wang et al. (2020); Wu et al. (2020). Although the edge specific parameterization largely enhances the model representation ability, it is hard to accurately access the prior distribution.

Despite the achievements of the existing methods, there still exists some common cons: (1) Edge modeling method, the existing methods model edges with parameter sharing mechanism, which may suffer from the local structure heterogeneity problem; (2) Local optimization, the local optimization problem focuses on optimizing the parameters with the information of neighbor nodes, which ignores the impact from the global view. Therefore, we come up with the key considerations for GSL: (1) modeling graph connection in an edge-specific manner instead of a shared mechanism; and (2) optimizing the corresponding parameters with a global objective of accurately classifying all nodes. The edge-specific modeling can overcome the local structure heterogeneity, i.e., handling nodes with different properties (e.g., node 1 and node 9 in Figure 1(a)) via different strategies. Besides, blindly removing the inter-class edges will increase the risk of misclassifying the target nodes (in dash circle) due to cutting off their connections to the labeled neighbors in the same class (e.g., edge between node 5 and node 9 in Figure 1(b)). Thus, it is necessary to optimize the graph structure from the global view instead the local ones.

However, it is non-trivial to achieve the targets mentioned above due to the following challenges: (1) the graph structure is embedded into the GNN model, which affects the procedure of model parameter optimization once updated, requiring a careful design of the model training algorithm; (2) the calculation of the ideal global objective is intractable due to the limited labelled nodes, especially the hard semi-supervised setting.

Refer to caption — Figure 1: Examples for (a) the impact of local structure heterogeneity (node 1 and 9, the weights between nodes do not provide enough information for classes differentiation), and (b) the global view of messaging (the edge between node 3 and node 5, the deleted edge prevents the intra-class information transfer).

In this work, we propose a new Generic Structure Extraction with Bi-level Optimization for Graph Structure Learning (GSEBO), which optimize the graph structure and learn the node embeddings from the global view. In particular, we first devise a new generic structure extractor, which accounts for the graph structure with both the connectedness between nodes and the strength of connections. In addition to the adjacency matrix, the extractor adopts a learnable matrix to represent the graph structure and adjusts the representation propagation. Moreover, we design a bi-level optimization algorithm where the outer and inner optimizations update the structure and the parameters of the base graph convolution (e.g., feature mapping parameters). In this way, we decompose the hard optimization issue of GSEBO into two easy ones. In addition, we set the objective of outer optimization as the validation loss to better approximate the ideal global objective. The proposed generic structure extractor can be extended to most existing graph convolution operators. We instantiate it on four representative GNN models (i.e., GCN Kipf and Welling (2017), GAT Veličković et al. (2018), GraphSAGE Hamilton et al. (2017), and JK-Net Xu et al. (2018)) and compare GSEBO with state-of-the-art GSL methods, which are evaluated on four real-world node classification datasets. Extensive experiments justify the rationality, effectiveness and robustness of the proposed method. In summary, our main contributions are as follows:

•

We propose a novel GSEBO with bi-level optimization for edge-specific graph structure learning, which learns the graph structure from a global view by optimizing a global objective of node classification.
•

We devise a generic structure extractor, which parameterizes the strength of each edge during representation propagation. Besides, we summarize how the classical GNN methods are transferred in the form of learnable graph structure with generic structure extractor.
•

We evaluate the proposed GSEBO with classical GNNs as backbones and compare it with the state-of-the-art GSL methods. Extensive experiments on four real-world datasets show the superior learning ability of the proposed method over the existing methods.

2 Related Work

Attentive mechanism.

The attentive mechanism methods adaptively learn the weights of edges and adjust the contributions of neighbor nodes Pang et al. (2023); Yin et al. (2022a, 2023d). MAGNA Wang et al. (2021) incorporates multi-hop context information into every layer of attention computation. IDGL Chen et al. (2020b) uses the multi-head self-attention mechanism to reconstruct the graph, which has the ability to add new nodes without retraining. HGSL Zhao et al. (2021a) extends the graph structure learning to heterogeneous graphs, which constructs different feature propagation graphs and fuses these graphs together in an attentive manner. However, those methods suffer from different local structures and are difficult to train.

Noise detection.

The noise detection methods leverage the off-shelf pre-trained model to induce node embeddings or labels and incorporate an edge classifier to estimate the probability of each edge Yin et al. (2023b, ). NeuralSparse Zheng et al. (2020) considers the graph sparsification task by removing irrelevant edges. GAUG Zhao et al. (2021b) utilizes a GNN to parameterize the categorical distribution instead of MLP in NerualSparse. PTDNet Luo et al. (2021) prunes task-irrelevant edges by penalizing the number of edges in the sparsified graph with parameterized networks. Even though the noise detection methods can be well trained with the supervision of edge labels, the edge classifier also suffers from local structure heterogeneity and lacks consideration of global information.

Probabilistic mechanism.

This type of methods assume the prior distribution of graph or noise and estimate the parameters through observed values, then resample the edges or noise to obtain a new graph Yin et al. (2023a); Zhang et al. (2019); Yin et al. (2022b). BGCN Zhang et al. (2019) estimates the parameter distribution of edges and communities by sampling edges from graph, and resample new graphs with the estimated parameters for prediction. VGCN Elinas et al. (2020) trains a graph distribution parameter similar to the original structure through ELBO, and resample graphs for prediction. However, both of BGCN and VGCN models are sampled from noisy graph, the estimated parameters also contain noise. DenNE Wang et al. (2020) assumes the observed graph is composed of real values and noise and the prior distribution of features and noise is known. With a generative model, the likelihood is used to estimate the representation of nodes. However, this method highly relies on the priors of feature and noise, which is difficult to obtain accurately.

Bi-level optimization on GNN.

LDS Franceschi et al. (2019) jointly learns the graph structure and GNN parameters by solving a bilevel optimization issue that learns a discrete probability distribution for each edges. According to the learned distributions, LDS generates a new graph structure by sampling. Towards this end, LDS sets the objective of outer optimization as generating the observed edges, which clearly has a gap to the overall classification objective. Moreover, LDS needs to estimate $N^{2}$ distribution parameters at least, which is hard due to insufficient labels ( $|\mathcal{E}|\ll N^{2}|$ . On the contrary, our method only activates a small portion of entries in $\bm{Z}$ where $A_{ij}=1$ , i.e., the number of estimated parameters is same as the number of edges in graph $\bm{G}$ (i.e., $|\mathcal{E}|$ ).

3 Methodology

We first introduce the essential preliminaries for GNN, and then elaborate the graph convolution operator and bilevel optimization algorithm of the proposed GSEBO.

3.1 Preliminary

Let $\bm{G}=(\mathcal{V},\mathcal{E},\bm{X})$ represents a graph with $N$ nodes and $M$ edges, where $\mathcal{V}=\{v_{1},v_{2},\cdots,v_{N}\}$ and $\mathcal{E}=\{e_{1},e_{2},\cdots,e_{M}\}$ denote the set of nodes and edges respectively. $\bm{X}=[\bm{x}_{1},\bm{x}_{2},\cdots,\bm{x}_{N}]^{\top}\in\mathbb{R}^{N\times C}$ are nodes features, where $\bm{x}_{i}\in\mathbb{R}^{C}$ is the $i$ -th row of $\bm{X}$ , corresponds to node $v_{i}$ in the form of a $C$ -dimensional vector. The adjacency matrix $\bm{A}\in\{0,1\}^{N\times N}$ indicates the connectedness of node pairs.

Node classification.

This task aims to learn a classifier $f(\bm{A},\bm{X};\bm{\theta})$ from a set of labeled nodes to forecast the remaining nodes labels, where $\bm{\theta}$ denotes model parameters. Assuming there are $M$ labels, we index them from 1 to $M$ without loss of generality. Formally, $\bm{Y}=[y_{1},y_{2},\cdots,y_{N}]^{\top}\in\mathbb{R}^{N}$ are the labels of nodes, where $y_{i}\in\mathbb{R}$ is the label of node $i$ . The target is achieved by optimizing the model parameter $\bm{\theta}$ over the labeled nodes, which is formulated as:

\displaystyle\min_{\bm{\theta}}\sum_{i\leq M}l(f(\bm{A},\bm{X})_{i},y_{i};\bm{\theta})+\lambda\|\bm{\theta}\|,

(1)

where $l(\cdot)$ is a classification loss, and $\lambda$ is a hyperparameter to adjust the strength of parameter regularization.

3.2 Generic Structure Extraction with Bi-level Optimization

To optimize the graph structure, the key consideration lies in (1) decoupling the graph structure from the GNNs to account for the edge-specific modeling and (2) learning the graph structure from the global information in $\bm{\theta}$ .

Generic structure extractor.

Towards the first purpose, the core idea is to decompose the graph structure information into connectedness (the edges in the adjacency matrix) and the strength of connection (the latent variable). In general, there are two ways to model the connection strength regarding whether relying on the inductive bias of translation invariant or not. On one hand, attentive mechanisms or noise detection models are translation invariant, which decode the connection strength of each edge from its local features. However, with the consideration of the local structure heterogeneity issue in most real-world graphs Xu et al. (2018), it is risky to rely on the translation invariant bias. On the other hand, probabilistic mechanisms separately model the connection strength for each edge, where each edge corresponds to a specific distribution. However, it is non-trivial to set a proper prior in practice. According to these pros and cons, we summarize two considerations for extending the graph convolution: (1) edge-specific modeling; and (2) optimization without prior.

Towards this end, we model the connection strength as a parameter matrix $\bm{Z}$ with the same size as $\bm{A}$ . Formally, the generic structure extractor (GSE) is abstracted as:

	$\displaystyle\textbf{GSE}(\bm{Z}):$	$\displaystyle=\sigma(\bm{Z})\odot\tilde{\bm{A}},$
	$\displaystyle\bm{H}^{(k)}=\text{COM}(\bm{H}^{(k-1)}$	$\displaystyle,\textbf{AGG}(\bm{H}^{(k-1)},\textbf{GSE}(\bm{Z})),$

where $\sigma(\cdot)$ is a non-negative activation function since the value of strength is always positive¹¹1In this work, we use the min[max[0,x],1] function to restrict the value within [0, 1].. COM and AGG are the combination and aggregation functions respectively.

Noteworthy, different from GNNs, GSE decouples the graph structure from GNNs and treats it as a learnable objective. Besides, GSE is a generic extractor, which can be instantiated over most existing graph convolutions (cf. Appendix A). In this way, as long as learning the connection strength is set properly, GSEBO can downweight the neighbors with inter-class connection during representation propagation, reducing the impact of inter-class with bi-level optimization.

Bi-level optimization.

To achieve the second purpose of learning graph structure from global information, it is essential to carefully design a proper training algorithm to optimize the connection strength matrix $\bm{Z}$ . Assume that we construct GSEBO with $K$ layers, which is denoted as $f(\bm{A},\bm{X};\bm{\theta})$ with parameters $\bm{\theta}=\{\bm{Z};\bm{W}^{(k)}|k\in[1,K]\}$ . We have three main considerations for designing the training algorithm: (1) Connection strength is a relative value, which changes across different views. As shown in Figure 1(b), the connection between node 3 and node 5 is weak from the local view, i.e., the edge is inter-class and should be assigned a low weight. However, this edge is essential for the classification of node 9, 10, 11, which deserves a high weight. Therefore, the optimization objective of $\bm{Z}$ should be the overall performance of node classification. (2) $\bm{Z}$ and $\bm{\mathcal{W}}=\{\bm{W}^{(k)}|k\in[1,K]\}$ play different roles, but are closely related. The role of $\bm{Z}$ is close to $\bm{A}$ , which restricts the model space for the mapping from node feature to label, and the role of $\bm{\mathcal{W}}$ includes the global mapping information for classification, which would relieve the cons of local optimization. That is, an update on $\bm{Z}$ will adjust $\bm{\mathcal{W}}$ and its optimization procedure. Therefore, the optimization of $\bm{Z}$ and $\bm{\mathcal{W}}$ are at two different but dependent levels. (3) Directly minimizing the objective function of Eq. (1) to obtain the parameters $\bm{Z}$ and $\bm{\mathcal{W}}$ is not able to achieve the desired purpose of learning the structure for the reason of over-fitting, which is shown on Figure 3.

Towards this end, we propose a bilevel optimization Domke (2012); Maclaurin et al. (2015); Franceschi et al. (2017) algorithm with inner and outer optimizations to learn $\bm{\mathcal{W}}$ and $\bm{Z}$ , respectively. Figure 2 shows the overall procedure of the algorithm, where the inner and outer optimization steps are iteratively executed until stop condition.

$\displaystyle\min_{\bm{Z}}F(\bm{\mathcal{W}}^{*}(\bm{Z}))$	$\displaystyle\!=\!\sum_{v}l_{outer}(f(\bm{A},\bm{X})_{v},y_{v};\bm{\mathcal{W}}^{*}(\bm{Z})),$	(2)
$\displaystyle s.t.\;\bm{\mathcal{W}}^{*}(\bm{Z})$	$\displaystyle\!=\!\arg\min_{\bm{\mathcal{W}}}L(\bm{\mathcal{W}},\bm{Z})$
	$\displaystyle\!=\!\sum_{u}l_{inner}(f(\bm{A},\bm{X})_{u},y_{u};\bm{\mathcal{W}},\bm{Z}).$	(3)

Algorithm 1 Training of GSEBO.

\bm{A}

: adjacency matrix;

\bm{X}

: nodes features;

\eta_{o},\eta_{i}

: outer and inner learning rates;

\tau

: number of inner steps;

2:Initialize

\bm{\mathcal{W}}

and

\bm{Z}

²²2We initialize

\bm{Z}

with the normalized

\tilde{\bm{A}}

3:while not early stopping do

4: for

t=1

\tau

do # inner optimization;

5: Update

\bm{\mathcal{W}}

w.r.t. Eq. (1);

6: end for

\bm{\alpha}_{\tau}=\nabla_{\bm{Z}}\sum_{v\in\mathcal{V}_{val}}l(f(\bm{A},\bm{X})_{v},y_{v};\bm{\mathcal{W}}_{\tau})

;

\bm{P}=\bm{0}

; # initial outer gradient;

9: for

\!t=\tau-1\!

downto 1 do # calculate outer gradient;

10:

\bm{M}_{t+1}=\frac{\partial\bm{\mathcal{W}}_{t+1}}{\partial\bm{\mathcal{W}}_{t}}

\bm{N}_{t+1}=\frac{\partial\bm{\mathcal{W}}_{t+1}}{\partial\bm{Z}}

;

11:

\bm{P}=\bm{P}+\bm{\alpha}_{t+1}\bm{N}_{t+1}

;

12:

\bm{\alpha}_{t}=\bm{\alpha}_{t+1}\bm{M}_{t+1}

;

13: end for

14: Update

\bm{Z}=\bm{Z}-\eta_{o}\bm{P}

; # outer optimization;

15:end while

Inner optimization for common parameters. This step is similar to the normal training of GNN for $\bm{\mathcal{W}}$ optimization. In particular, we update $\bm{\mathcal{W}}$ with a gradient descent based optimizer (e.g., Adam Kingma and Ba (2015)) over the training nodes by minimizing Eq. (3) (i.e., set $l_{inner}=l(f(\bm{A},\bm{X}),y;\bm{\theta})+\lambda\|\bm{\theta}\|$ ). When calculating the gradient of $\bm{\mathcal{W}}$ , we treat the connection strength matrix $\bm{Z}$ as constant.

Assuming that $\tau$ times of gradient descent, an approximate solution $\bm{\mathcal{W}}_{1},\cdots,\bm{\mathcal{W}}_{\tau}$ of the inner optimization problem are obtained. For $t=1$ to $\tau$ , the updated $\bm{\mathcal{W}}$ is calculated as $\bm{\mathcal{W}}_{t}=\bm{\mathcal{W}}_{t-1}-\eta_{i}\nabla{L}(\bm{\mathcal{W}}_{t-1},\bm{Z})$ , where $\eta_{i}$ is the inner learning rate, and the process of updating $\bm{\mathcal{W}}$ is shown on line 3-5 in Algorithm 1.

Outer optimization for graph structure. Similarly, we treat the sequence of $\bm{\mathcal{W}}$ as constant to optimize the graph structure $\bm{Z}$ . Ideally, the objective should be the overall classification loss of all nodes in graph. Formally,

\displaystyle\min_{\bm{Z}}\sum_{i\leq N}l(f(\bm{A},\bm{X})_{i},y_{i};\bm{\theta}).

Apparently, the calculation of the ideal objective is intractable. Similar as the global parameters optimization, we can approximate the ideal objective as the empirical risk over training nodes. However, it can easily suffer from the over-fitting issue, which is shown on Figure 3. We believe the empirical risk over validation nodes is a better approximation of the ideal objective since it reflects to what extent the parameter generalizes well. In this light, we set $l_{outer}$ as the classification loss and optimize $\bm{Z}$ over the validation set $\mathcal{V}_{val}$ by minimizing Eq. (2).

The details of updating $\bm{Z}$ is shown on line 8-11 in Algorithm 1 and the mathematical formulation is shown on Appendix B. With the parameter $\bm{Z}$ updated by the graph structure optimization, we reoptimize the $\bm{\mathcal{W}}$ with $\bm{Z}$ by global parameters optimization, and repeat this process until the early stopping is met. To summarize, Algorithm 1 shows the training procedure of GSEBO. Besides, we explain why the bi-level optimization could defeat the over fitting problem when training on train and validation dataset on Appendix C.

4 Experiments

Table 1: Performance comparison across GNN architectures.

GCN
Method	Cora	Citeseer	Terrorist	Air-USA
Vanilla	81.6 $\pm$ 0.7	71.6 $\pm$ 0.4	70.0 $\pm$ 1.1	56.0 $\pm$ 0.8
AdaEdge	81.9 $\pm$ 0.7	72.8 $\pm$ 0.7	71.0 $\pm$ 1.9	57.2 $\pm$ 0.8
DropEdge	82.0 $\pm$ 0.8	71.8 $\pm$ 0.2	70.3 $\pm$ 0.9	56.9 $\pm$ 0.6
GAUG	83.2 $\pm$ 0.7	73.0 $\pm$ 0.8	71.4 $\pm$ 2.0	57.9 $\pm$ 0.4
GSEBO	84.0 $\pm$ 0.4	74.4 $\pm$ 0.5	72.1 $\pm$ 0.6	59.8 $\pm$ 0.6
GAT
Vanilla	81.3 $\pm$ 1.1	70.5 $\pm$ 0.7	67.3 $\pm$ 0.7	52.0 $\pm$ 1.3
AdaEdge	82.0 $\pm$ 0.6	71.1 $\pm$ 0.8	72.2 $\pm$ 1.4	54.5 $\pm$ 1.9
DropEdge	81.9 $\pm$ 0.6	71.0 $\pm$ 0.5	69.9 $\pm$ 1.1	52.8 $\pm$ 1.7
GAUG	81.6 $\pm$ 0.8	69.9 $\pm$ 1.4	68.8 $\pm$ 1.1	53.0 $\pm$ 2.0
GSEBO	82.9 $\pm$ 0.1	72.1 $\pm$ 0.8	69.7 $\pm$ 1.6	57.0 $\pm$ 0.3
GraphSAGE
Vanilla	81.3 $\pm$ 0.5	70.6 $\pm$ 0.5	69.3 $\pm$ 1.0	57.0 $\pm$ 0.7
AdaEdge	81.5 $\pm$ 0.6	71.3 $\pm$ 0.8	72.0 $\pm$ 1.8	57.1 $\pm$ 0.5
DropEdge	81.6 $\pm$ 0.5	70.8 $\pm$ 0.5	70.1 $\pm$ 0.8	57.1 $\pm$ 0.5
GAUG	81.7 $\pm$ 0.3	71.4 $\pm$ 1.0	70.4 $\pm$ 0.5	55.0 $\pm$ 1.1
GSEBO	80.4 $\pm$ 0.9	73.0 $\pm$ 0.5	76.5 $\pm$ 0.9	59.0 $\pm$ 1.3
JK-Net
Vanilla	78.8 $\pm$ 1.5	67.6 $\pm$ 1.8	70.7 $\pm$ 0.7	53.1 $\pm$ 0.8
AdaEdge	80.4 $\pm$ 1.4	68.9 $\pm$ 1.2	71.2 $\pm$ 0.7	59.4 $\pm$ 1.0
DropEdge	80.4 $\pm$ 0.7	69.4 $\pm$ 1.1	70.2 $\pm$ 1.3	58.9 $\pm$ 1.4
GAUG	79.4 $\pm$ 1.3	68.9 $\pm$ 1.3	70.2 $\pm$ 0.5	52.3 $\pm$ 1.8
GSEBO	81.8 $\pm$ 1.0	69.4 $\pm$ 1.0	73.7 $\pm$ 1.2	59.6 $\pm$ 1.2

In this section, we conduct experiments on four datasets to answer the following research questions: (1) how does the performance of the proposed GSEBO compared with the state-of-the-art methods? (2) how robust is the proposed GSEBO? (3) to what extent does the proposed GSEBO decrease the impact of inter-class connections? (4) what are the factors that influence the effectiveness of the proposed GSEBO? We present the results of questions (3) and (4) on Appendix D and E due to space limitations.

4.1 Experimental Setup

Dataset.

We select four widely used node real-world classification benchmark datasets with graph of citation networks (Cora and Citeseer Kipf and Welling (2017)), social networks (Terrorist) Zhao et al. (2006), and air traffic (Air-USA) Wu et al. (2019). We summarize the statistics of the datasets and describe in detail on Appendix F. We adopt the same data split of Cora and Citeseer as Kipf and Welling (2017), and a split of training, validation, testing with a ratio of 10:20:70 on other datasets Zhao et al. (2021b).

Compared methods.

We apply GSEBO on 4 representative GNN architectures: GCN Kipf and Welling (2017), GAT Veličković et al. (2018), GraphSAGE Hamilton et al. (2017) and JK-Net Xu et al. (2018). For each GNN, we compare GSEBO with the vanilla version, and three variants with state-of-the-art connection modeling methods: AdaEdge Chen et al. (2020a), DropEdge Rong et al. (2020) and GAUG Zhao et al. (2021b). In addition, we compared the GSEBO with GSL methods: BGCN Zhang et al. (2019), VGCN Elinas et al. (2020), PTDNet Luo et al. (2021), and advanced attention mechanism: MAGNA Wang et al. (2021). Note that the base model of GSEBO, BGCN, and PTDNet are GCN.

Implementation details.

In the experiments, the latent dimension of all the methods is set to the 16. The parameters for all baseline methods are initialized as the corresponding papers, and are carefully tuned to achieve optimal performances. The learning rate of inner and outer optimization are set to 0.01. The hyperparameter $\lambda$ is set to $5\times{10^{-4}}$ , and we search for the inner learning depth $\tau$ over the range [5,10,15,20,25]. Due to bilevel optimization can not guarantee the convergence, we set the patience of early stopping to 20 to terminate the optimization of GSEBO. To prevent overfitting, the dropout ratio is set to 0.5 for all the methods. The network architectures of all the mehtods are configured to be the same as described in the original papers. Our experiments are conducted with Tensorflow running on GPU machines (NVIDIA 2080Ti). For all the compared methods, we report the average accuracy on the test set over 10 runs.

Table 2: Performance comparison of GSEBO with GSL methods.

Method	Cora	Citeseer	Terrorist	Air-USA
BGCN	81.2 $\pm$ 0.8	72.4 $\pm$ 0.5	70.3 $\pm$ 0.8	56.5 $\pm$ 0.9
VGCN	64.4 $\pm$ 0.2	67.8 $\pm$ 0.8	73.8 $\pm$ 0.9	53.3 $\pm$ 0.3
PTDNet	82.8 $\pm$ 2.6	72.7 $\pm$ 1.8	68.3 $\pm$ 1.6	53.4 $\pm$ 1.4
MAGNA	81.7 $\pm$ 0.4	66.4 $\pm$ 0.1	67.2 $\pm$ 0.1	55.1 $\pm$ 1.2
GSEBO	84.0 $\pm$ 0.4	74.4 $\pm$ 0.5	72.1 $\pm$ 0.6	59.8 $\pm$ 0.6

4.2 Performance Comparison

Table 1 shows the performance of GSEBO instantiate with classical GNNs, and Tabel 2 presents the results comparison of GSEBO and state-of-the-art GSL methods. From Table 1 and Table 2, we have the following observations:

Improvement over baselines.

GSEBO outperforms the baselines in most cases. Considering the average performance over four datasets, the improvement of GSEBO over the baselines is in the range of [3.0%, 12.5%], which validates the effectiveness of the proposed method. In particular, (1) Probabilistic mechanisms. The performance gain of BGCN and VGCN over the vanilla GCN are limited, which might because of the unsatisfied assumption of the prior distribution. This result shows the rationality of relaxing the prior assumption and modeling the connection strength with parameters directly. (2) Connection modeling. DropEdge, AdaEdge, GAUG, and PTDNet achieve better performance than vanilla GNN, which modify the structure of the graph from different perspectives such the smoothness Chen et al. (2020a) and robustness Rong et al. (2020). These results reflect the benefit of connection modeling. However, there is still a clear gap between these methods and GSEBO, which is attributed to the global objective for learning the edge strength. (3) Attention mechanism. The performance of MAGNA is inferior, which is consistent with the analyzing in Knyazev et al. (2019).

Effects across GNN architectures.

On the four GNN architectures, GSEBO achieves better performance than the vanilla version in all cases. In particular, GSEBO achieves an average improvement across datasets of 4.2%, 4.4%, 4.0%, and 5.74% over the vanilla GCN, GAT, GraphSAGE, and JK-Net, respectively. These results justify the effectiveness of structure learning of the proposed GSEBO. Across the four architectures, GSEBO achieves the largest improvement over JK-Net. We postulate the reason is that the jump connection in JK-Net makes it aggregates more hops of neighbors than the other GNNs. As the hops increase, the homophily ratio of neighbors will decrease, i.e., more neighbors are in classes different from the target node. Therefore, optimizing the connection strength (i.e., $\bm{Z}$ ) is more beneficial on JK-Net.

Effects across datasets.

From the perspective of dataset, GSEBO consistently performs better than the vanilla version on the four datasets. Specifically, the average improvement over the four across classic GNN architectures achieved by GSEBO are 1.9%, 3.1%, 5.3% and 8.0% on the four datasets, respectively. Considering that the four datasets come from different scenarios, these results are evidence for the potential of GSEBO to be widely applied in practice. Note that the trend of performance improvement is similar to the density of graph where Air-USA is the most dense graph with the largest performance improvement. As the number of neighbors increases, the percentage of neighbors essential for the classification of the target node will decrease. This result can reflect the rationality of optimizing the connection strength according to the overall classification objective.

Moreover, the training loss and test accuracy of GNN methods and the proposed GSEBO are shown in Figure 4. We observe that the loss of GSEBO tend to be stable after 30 epochs and is smaller than GCN, GAT, and GraphSAGE, showing the empirical convergence of GSEBO. This is because simultaneously optimizing the parameters of $\bm{\mathcal{W}}$ and $\bm{Z}$ would make the loss function more easily approach to the labels. However, the loss of JK-Net is smaller than GSEBO, we attribute it to that JK-Net is easier to trap in over-fitting.

4.3 Robustness Analysis

We investigate the robustness of GSEBO under the different inter-class levels. In particular, we follow Luo et al. (2021) and construct graphs based on Cora by randomly adding 1,000, 3,000, 5,000, 10,000, and 20,000 inter-class edges, respectively. On the synthetic datasets, we compare GSEBO with Vanilla, GAUG, DropEdge, and AdaEdge. Figure 5 shows the performance on the five GNN architectures. From the figures, we have the following observations:

•

The margins between GSEBO and vanilla GNN on the synthetic datasets are larger than the original Cora. For example, when adding 20,000 inter-class edges, GSEBO improves the accuracy by 17.6%, 3.1%, 40.6%, and 4.4% over GCN, GAT, GraphSAGE and JKNet. This result indicates the robustness of GSEBO’s structure learning ability.
•

In most cases, GSEBO outperforms the baselines at different noisy levels, which further justifies its robustness.
•

GAUG and AdaEdge utilizes different strategies to update the structure of the graph, which also consistently perform better than vanilla GNN. However, their gaps to GSEBO on the synthetic data are larger than the original Cora. We postulate that the reason is their objectives are affected by the intentionally added noise.
•

DropEdge shows worse performances than the vanilla GNN on the synthetic datasets. The comparison shows that randomly dropping edges fails to enhance GNN robustness when the noisy level is high.

5 Conclusion

In this work, we propose a novel GSEBO, which utilizes the global information to learn the graph structure. To better optimizes the strength of each edge and the parameters of feature mapping, we decompose the optimization problem into two mutually constrained optimization objectives, i.e., the inner and outer objective based on a a universal graph convolution operator. Extensive experiments demonstrate the effectiveness and robustness of GSEBO on both benchmark and synthetic datasets.

Although the impressive results GSEBO achieved, there are still some limitations: GSEBO cannot be effectively applied to large graphs, which requires implementation of mini-batches. Besides, we evaluate GSEBO in the transductive setting, when new nodes add to the graph after training, GSEBO has to retrain the entire model. For future researches, we would like to explore solutions to the above limitations.

References

Chen et al. [2020a] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In AAAI, 2020.
Chen et al. [2020b] Yu Chen, Lingfei Wu, and Mohammed J. Zaki. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. In NeurIPS, 2020.
Domke [2012] Justin Domke. Generic methods for optimization-based modeling. In AISTATS, 2012.
Elinas et al. [2020] Pantelis Elinas, Edwin V Bonilla, and Louis Tiao. Variational inference for graph convolutional networks in the absence of graph data and adversarial settings. In NeurIPS, 2020.
Franceschi et al. [2017] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In ICML, 2017.
Franceschi et al. [2019] Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. In ICML, 2019.
Hamilton et al. [2017] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, 2017.
Jiang et al. [2019] Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, and Bin Luo. Semi-supervised learning with graph learning-convolutional networks. In CVPR, 2019.
Ju et al. [2024] Wei Ju, Siyu Yi, Yifan Wang, Zhiping Xiao, Zhengyang Mao, Hourun Li, Yiyang Gu, Yifang Qin, Nan Yin, Senzhang Wang, et al. A survey of graph neural networks in real world: Imbalance, noise, privacy and ood challenges. arXiv preprint arXiv:2403.04468, 2024.
Kazi et al. [2020] Anees Kazi, Luca Cosmo, Nassir Navab, and Michael Bronstein. Differentiable graph module (dgm) for graph convolutional networks, 2020.
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
Knyazev et al. [2019] Boris Knyazev, Graham W Taylor, and Mohamed Amer. Understanding attention and generalization in graph neural networks. In NeurIPS, 2019.
Li et al. [2018] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convolutional neural networks. In AAAI, 2018.
Luo et al. [2021] Dongsheng Luo, Wei Cheng, Wenchao Yu, Bo Zong, Jingchao Ni, Haifeng Chen, and Xiang Zhang. Learning to drop: Robust graph neural network via topological denoising. In WSDM, 2021.
Maclaurin et al. [2015] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In ICML, 2015.
McPherson et al. [2001] M. McPherson, L. Smith-Lovin, and J. Cook. Birds of a feather: Homophily in social networks. Review of Sociology, 27:415–444, 2001.
Pandit et al. [2007] Shashank Pandit, Duen Horng Chau, Samuel Wang, and C. Faloutsos. Netprobe: a fast and scalable system for fraud detection in online auction networks. In WWW, 2007.
Pang et al. [2023] Jinhui Pang, Zixuan Wang, Jiliang Tang, Mingyan Xiao, and Nan Yin. Sa-gda: Spectral augmentation for graph domain adaptation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 309–318, 2023.
Rong et al. [2020] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In ICLR, 2020.
Scarselli et al. [2009] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Trans. Neural Networks, 20(1):61–80, 2009.
Shou et al. [2023] Yuntao Shou, Tao Meng, Wei Ai, Nan Yin, and Keqin Li. Adversarial representation with intra-modal and inter-modal graph contrastive learning for multimodal emotion recognition. arXiv preprint arXiv:2312.16778, 2023.
Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. In ICLR, 2018.
Wang et al. [2020] Junshan Wang, Ziyao Li, Qingqing Long, Weiyu Zhang, Guojie Song, and Chuan Shi. Learning node representations from noisy graph structures. In ICDM, 2020.
Wang et al. [2021] Guangtao Wang, Zhitao Ying, Jing Huang, and Jure Leskovec. Multi-hop attention graph neural network. In IJCAI, 2021.
Wu et al. [2019] Jun Wu, Jingrui He, and Jiejun Xu. Demo-net: Degree-specific graph neural networks for node and graph classification. In KDD, 2019.
Wu et al. [2020] Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. Graph information bottleneck. In NeurIPS, 2020.
Xu et al. [2018] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
[29] Nan Yin, Li Shen, Chong Chen, Xian-Sheng Hua, and Xiao Luo. Sport: A subgraph perspective on graph classification with label noise. ACM Transactions on Knowledge Discovery from Data.
Yin et al. [2022a] Nan Yin, Fuli Feng, Zhigang Luo, Xiang Zhang, Wenjie Wang, Xiao Luo, Chong Chen, and Xian-Sheng Hua. Dynamic hypergraph convolutional network. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 1621–1634. IEEE, 2022.
Yin et al. [2022b] Nan Yin, Li Shen, Baopu Li, Mengzhu Wang, Xiao Luo, Chong Chen, Zhigang Luo, and Xian-Sheng Hua. Deal: An unsupervised domain adaptive framework for graph-level classification. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3470–3479, 2022.
Yin et al. [2023a] Nan Yin, Li Shen, Mengzhu Wang, Long Lan, Zeyu Ma, Chong Chen, Xian-Sheng Hua, and Xiao Luo. Coco: A coupled contrastive framework for unsupervised domain adaptive graph classification. In International Conference on Machine Learning, pages 40040–40053. PMLR, 2023.
Yin et al. [2023b] Nan Yin, Li Shen, Mengzhu Wang, Xiao Luo, Zhigang Luo, and Dacheng Tao. Omg: towards effective graph classification against label noise. IEEE Transactions on Knowledge and Data Engineering, 2023.
Yin et al. [2023c] Nan Yin, Li Shen, Huan Xiong, Bin Gu, Chong Chen, Xian-Sheng Hua, Siwei Liu, and Xiao Luo. Messages are never propagated alone: Collaborative hypergraph neural network for time-series forecasting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Yin et al. [2023d] Nan Yin, Mengzhu Wang, Zhenghan Chen, Li Shen, Huan Xiong, Bin Gu, and Xiao Luo. Dream: Dual structured exploration with mixup for open-set graph domain adaption. In The Twelfth International Conference on Learning Representations, 2023.
Yin et al. [2024a] Nan Yin, Mengzhu Wan, Li Shen, Hitesh Laxmichand Patel, Baopu Li, Bin Gu, and Huan Xiong. Continuous spiking graph neural networks. arXiv preprint arXiv:2404.01897, 2024.
Yin et al. [2024b] Nan Yin, Mengzhu Wang, Zhenghan Chen, Giulia De Masi, Huan Xiong, and Bin Gu. Dynamic spiking graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16495–16503, 2024.
Zhang et al. [2019] Yingxue Zhang, Soumyasundar Pal, Mark Coates, and Deniz Ustebay. Bayesian graph convolutional neural networks for semi-supervised classification. In AAAI, 2019.
Zhao et al. [2006] Binglei Zhao, P. Sen, and L. Getoor. Entity and relationship labeling in affiliation networks. In ICML Workshop, 2006.
Zhao et al. [2021a] Jianan Zhao, Xiao Wang, Chuan Shi, Binbin Hu, Guojie Song, and Yanfang Ye. Heterogeneous graph structure learning for graph neural networks. In AAAI, 2021.
Zhao et al. [2021b] Tong Zhao, Yozen Liu, Leonardo Neves, Oliver Woodford, Meng Jiang, and Neil Shah. Data augmentation for graph neural networks. In AAAI, 2021.
Zheng et al. [2020] Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, and Wei Wang. Robust graph representation learning via neural sparsification. In ICML, 2020.