This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Shanghai Jiao Tong University
11email: {1473686097, xyl-alter, jiaxinding, yiluofu, xwang8}@sjtu.edu.cn, [email protected]
22institutetext: 2012 Laboratories, Huawei Technologies
22email: [email protected]

Characterizing the Influence of Topology on Graph Learning Tasks

Kailong Wu 11    Yule Xie 11    Jiaxin Ding 11    Yuxiang Ren 22    Luoyi Fu 11    Xinbing Wang 11    Chenghu Zhou 11
Abstract

Graph neural networks (GNN) have achieved remarkable success in a wide range of tasks by encoding features combined with topology to create effective representations. However, the fundamental problem of understanding and analyzing how graph topology influences the performance of learning models on downstream tasks has not yet been well understood. In this paper, we propose a metric, TopoInf, which characterizes the influence of graph topology by measuring the level of compatibility between the topological information of graph data and downstream task objectives. We provide analysis based on the decoupled GNNs on the contextual stochastic block model to demonstrate the effectiveness of the metric. Through extensive experiments, we demonstrate that TopoInf is an effective metric for measuring topological influence on corresponding tasks and can be further leveraged to enhance graph learning.

Keywords:
GNNs, graph topology, graph rewiring

1 Introduction

Graph neural networks (GNNs) have emerged as state-of-the-art models to learn graph representation by the message-passing mechanism over graph topology for downstream tasks, such as node classification, link prediction, etc. [12, 3, 18].

Despite their success, GNNs are vulnerable to the compatibility between graph topology and graph tasks [14, 9]. In essence, the effectiveness of the message-passing mechanism and, by extension, the performance of GNNs depends on the assumption that the way information is propagated and aggregated in the graph neural networks should be conducive to the downstream tasks. First, if the underlying motivation of graph topology, why two nodes get connected, has no relation to the downstream task, such task-irrelevant topology can hurt the information aggregation per se. For example, in a random graph that lacks meaningful clusters, graph learning becomes impossible in community detection tasks. Furthermore, GNNs may even be outperformed by multilayer perceptrons (MLP) on graphs exhibiting heterophily [31, 32]. Second, even if the motivation for connections is compatible with tasks, graphs in real-world applications are usually inherent with “noisy” edges or incompleteness, due to error-prone data collection. This topological noise can also affect the effectiveness of message passing. Third, the vulnerability of GNNs to compatibility varies across models due to differences in their message-passing mechanisms. For example, GCN [12] may not be as effective on heterophilic graphs. However, H2GCN [32] is able to mitigate the heterophilic problem, with specially designed message passing. Furthermore, even the same model with different hyperparameters of network layers can also exhibit different vulnerabilities, since the reception fields of the neighborhoods are different. All this, there is a natural question to ask: What is the metric for the compatibility between graph topology and graph learning tasks?

This question, which is essential for choosing a graph learning model to deploy over a graph for downstream tasks, has not yet been systematically characterized. Existing work focuses on measuring homophily [17, 32, 24] or edge signal-to-noise ratio [9], based on the discrepancy of nodes and their neighbors in features or labels without considering graph learning models. Therefore, the results of these metrics can be an indicator of the performance of GCN-like models but cannot be generalized to more recently developed GNN models and guide topology modification to improve the performance of these models. From the above analysis, we would like to investigate the compatibility between topology and tasks associated with graph learning models globally, and locally, the influence from each edge, either positive or negative, can be used to adjust the topology to increase or decrease the performance. Answering this question can help us better understand graph learning, increase the interpretability of learning models, and shed light on improving model performance.

We first propose a metric for model-dependent compatibility between topology and graph tasks, measured by the difference between the “ideal” results, labels, and the result of the models performing on the “ideal” features. Thereafter, we introduce a metric TopoInf, to locally characterize the influence of each edge on the overall compatibility, by evaluating the change of such compatibility after topology perturbation. We provide motivating analysis based on the decoupled GNNs [13, 8] on cSBMs [7] to validate the effectiveness of the metric and demonstrate that TopoInf is an effective metric through extensive experiments.

Our main contributions can be summarized as follows:

  • To the best of our knowledge, we are the first to measure compatibility between graph topology and learning tasks associated with graph models.

  • We propose a new metric, TopoInf, to measure the influence of edges on the performance of GNN models in node classification tasks, and conduct extensive experiments to validate the effectiveness.

  • The proposed TopoInf can be leveraged to improve performance by modifying the topology on different GNN models, and such a scheme can be applied further to different scenarios.

2 Preliminaries

Notations. We denote an undirected graph as 𝒢=(𝒱,,A)\mathcal{G}=(\mathcal{V},\mathcal{E},\textbf{A}), where 𝒱\mathcal{V} is the node set with cardinality |𝒱|=n|\mathcal{V}|=n, \mathcal{E} is the edge set without self-loop, and An×n\textbf{A}\in\mathbb{R}^{n\times n} is the symmetric adjacency matrix with Ai,j=1\textbf{A}_{i,j}=1 when eije_{ij}\in\mathcal{E} otherwise Ai,j=0\textbf{A}_{i,j}=0. D denotes the diagonal degree matrix of 𝒢\mathcal{G} where Di,i=di\textbf{D}_{i,i}=d_{i}, the degree of node viv_{i}. We use A~=A+I\tilde{\textbf{A}}=\textbf{A}+\textbf{I} to represent the adjacency matrix with self-loops and D~=D+I\tilde{\textbf{D}}=\textbf{D}+\textbf{I}. The symmetric normalized adjacency matrix is A^=D1/2A~D~1/2\hat{{\textbf{A}}}={\textbf{D}}^{-1/2}\tilde{\textbf{A}}\tilde{\textbf{D}}^{-1/2}. For node classification with cc classes, Ln×c\textbf{L}\in\mathbb{R}^{n\times c} denotes the label matrix, whose ithi^{\text{th}} row represents the one-hot encoding of node viv_{i}, while Xn×d\textbf{X}\in\mathbb{R}^{n\times d} is the feature matrix, whose ithi^{\text{th}} row Xi,:\textbf{X}_{i,:} represents a dd-dimensional feature vector of node viv_{i}.

Graph Filters. Recent studies show that GNN models, such as ChebNet [6], APPNP [13] and GPRGNN [4], can be viewed as operations of polynomial graph spectral filters. Specifically, the effect of such GNN models on a graph signal xn×1\textbf{x}\in\mathbb{R}^{n\times 1}, can be formulated as graph spectral filtering operation 𝒇(A)\boldsymbol{f}(\textbf{A}) based on adjacency matrix A, such that 𝒇(A)x=k=0KγkA^kx,\boldsymbol{f}(\textbf{A})\textbf{x}=\sum_{k=0}^{K}\gamma_{k}\hat{\textbf{A}}^{k}\textbf{x}, where A^\hat{\textbf{A}} is the normalized adjacency matrix, KK is the order of the graph filter and γk\gamma_{k}’s are the weights, which can be assigned [26, 13] or learned [4, 11]. The graph filter can be extracted as the effect of topology associated with GNN models.

Decoupled GNN. Recent works [13, 29, 8] show that neighborhood aggregation and feature transformation can be decoupled and formulate the graph learning tasks as

L^=softmax(𝒇(A)𝒈θ(X))\hat{\textbf{L}}=\text{softmax}\left(\boldsymbol{f}(\textbf{A})\boldsymbol{g}_{\theta}(\textbf{X})\right) (1)

where the prediction L^\hat{\textbf{L}} can be obtained by operating on graph feature matrix X, through the feature transformation function 𝒈θ()\boldsymbol{g}_{\theta}(\cdot) with learnable parameters, thereafter applying the graph filter 𝒇(A)\boldsymbol{f}(\textbf{A}) acting as the neighborhood aggregation, and finally passing through a softmax layer.

3 Methodology

In this section, we present our methodology to study the influence of topology on graph learning tasks. We evaluate the global compatibility between graph topology and tasks, and propose our metric, TopoInf.

3.1 Matching graph topology and tasks

We first relate the graph topology to the graph tasks. Thanks to the graph filters in the preliminaries, the influence of the graph topology on a GNN model can be simplified and approximated as a graph filter 𝒇(A)\boldsymbol{f}(\textbf{A}). 𝒇(A)\boldsymbol{f}(\textbf{A}) can be viewed as a graph filter function 𝒇():n×nn×n\boldsymbol{f}(\cdot):\mathbb{R}^{n\times n}\rightarrow\mathbb{R}^{n\times n} applied to the adjacency matrix A of the observed graph. With the decoupling analysis of GNN, the prediction is obtained by 𝒇(A)\boldsymbol{f}(\textbf{A}) working on prediction results 𝒈θ(X)\boldsymbol{g}_{\theta}(\textbf{X}) based on feature matrix X through graph learning models [5, 29, 8], where 𝒈θ(X)\boldsymbol{g}_{\theta}(\textbf{X}) on X can be viewed as extracting task-related information in X and ideally, one of the best predictions on X could be the label matrix L. Therefore, we replace the functions with 𝒇(A)𝒈θ(X)\boldsymbol{f}(\textbf{A})\boldsymbol{g}_{\theta}(\textbf{X}) in Equation 1 to be 𝒇(A)L\boldsymbol{f}(\textbf{A})\textbf{L}, which simplifies the analysis from features, makes us focus on the labels and graph tasks, and provides an “upper bound” of the predicted results in the ideal situation. In the next section, we will provide a motivation analysis for this setting.

Further, instead of applying the softmax function, we perform row-normalization [9], which normalizes the summation of entries on the same row to be one, on 𝒇(A)L\boldsymbol{f}(\textbf{A})\textbf{L} as the prediction as used in previous works [16, 9]. The row-normalized matrix 𝒇(A)L\boldsymbol{f}(\textbf{A})\textbf{L} provides a summary of the edge statistics at the individual node level, as each row of 𝒇(A)L\boldsymbol{f}(\textbf{A})\textbf{L} represents the label neighborhood distribution of the node, which has direct connections to GNN learning [15]. As RowNorm(𝒇(A)L)=RowNorm(𝒇(A))L\text{RowNorm}(\boldsymbol{f}(\textbf{A})\textbf{L})=\text{RowNorm}(\boldsymbol{f}(\textbf{A}))\textbf{L}, we perform row-normalization on graph filter 𝒇(A)\boldsymbol{f}(\textbf{A}), and 𝒇(A)\boldsymbol{f}(\textbf{A}) is row-normalized in our subsequent analysis.

Therefore, to quantify the compatibility between graph topology and graph tasks, we propose measuring the similarity between the ideal prediction L and the graph filter that performs the ideal feature 𝒇(A)L\boldsymbol{f}(\textbf{A})\textbf{L}. The more similar L and 𝒇(A)L\boldsymbol{f}(\textbf{A})\textbf{L} is (𝒇(A)LL\boldsymbol{f}(\textbf{A})\textbf{L}\sim\textbf{L} for short), the better the graph topology matches the graph tasks, which we provide a motivating example with mild assumptions to validate in the next section. This can be understood through various scenarios:

  • In graph filtering, the fact that 𝒇(A)LL\boldsymbol{f}(\textbf{A})\textbf{L}\sim\textbf{L} indicates that signals in L, whether low or high frequency, are well preserved after being filtered by 𝒇(A)\boldsymbol{f}(\textbf{A}). This preservation of information suggests a good match between task labels and the graph topology.

  • Considering the concept of the homophily/heterophily, when 𝒇(A)LL\boldsymbol{f}(\textbf{A})\textbf{L}\sim\textbf{L} means linked nodes described by 𝒇(A)\boldsymbol{f}(\textbf{A}) likely belong to the same class or have similar characteristics, which is aligned with the homophily principle.

  • In terms of Dirichlet energy, if 𝒇(A)LL\boldsymbol{f}(\textbf{A})\textbf{L}\sim\textbf{L} means low Dirichlet energy of L w.r.t. the (augmented) normalized Laplacian derived from 𝒇(A)\boldsymbol{f}(\textbf{A}). This indicates the smoothness of L on 𝒇(A)\boldsymbol{f}(\textbf{A}), suggesting a good match between the graph topology and tasks.

All above, we define (A)\mathcal{I}(\textbf{A}) to quantify the degree of compatibility between graph topology and graph tasks,

(A)=𝒮(L𝒇(A)L),\mathcal{I}(\textbf{A})=\mathcal{S}\left(\textbf{L}\|\boldsymbol{f}(\textbf{A})\textbf{L}\right),

where 𝒮()\mathcal{S}(\cdot) is a similarity measurement, which can be chosen according to the requirements of the problem, for example, similarity induced by the inner product or Euclidean distance.

To find a graph topology that matches the graph tasks well, our goal is to maximize (A)\mathcal{I}(\textbf{A}) over A. However, (A)\mathcal{I}(\textbf{A}) would be trivially maximized with 𝒇(A)=I\boldsymbol{f}(\textbf{A})=\textbf{I}, where the graph topology is completely removed. To address this issue, we introduce a regularization function ()\mathcal{R}(\cdot), which penalizes (A)\mathcal{I}(\textbf{A}) if 𝒇(A)\boldsymbol{f}(\textbf{A}) is close to the identity matrix, to minimize the modification of 𝒇(A)\boldsymbol{f}(\textbf{A}). We also demonstrate such regularization from the perspective of graph denoising in the next section. Therefore, the optimization problem is formulated as

maxA𝒞(A)=(A)λ(A)\max_{\textbf{A}}\ \mathcal{C}(\textbf{A})=\mathcal{I}(\textbf{A})-\lambda\mathcal{R}\left(\textbf{A}\right) (2)

where λ\lambda is a hyperparameter used to balance the trade-off between (A)\mathcal{I}(\textbf{A}) and (A)\mathcal{R}(\textbf{A}). However, there are 2n×n2^{n\times n} possible states of A, making the optimization of 𝒞(A)\mathcal{C}(\textbf{A}) computationally intractable due to its nondifferentiability and combinatorial nature. Moreover, the optimization does not consider the original topology. Therefore, instead of optimizing the problem globally, a more practical and effective approach is to start with the originally observed graph and locally optimize it, as we will present in the next subsection.

Now we zoom in the general compatibility to the node level: for every node viv_{i}, we use the node level similarity function 𝒮v()\mathcal{S}_{\textbf{v}}(\cdot) to measure the similarity of its one-hot label vector Li,:\textbf{L}_{i,:} and its normalized soft label vector L¯i,:\overline{\textbf{L}}_{i,:} filtered by 𝒇(A)\boldsymbol{f}(\textbf{A}), and node level regularization function v()\mathcal{R}_{\textbf{v}}(\cdot) to measure the regularization of its connectivity to other nodes, specifically, the ithi^{\text{th}} row of 𝒇(A)\boldsymbol{f}(\textbf{A}). Let 𝒞A(vi)\mathcal{C}_{\textbf{A}}(v_{i}) denote the compatibility between the topology of node viv_{i}’ and graph tasks. The overall compatibility metric becomes the following

𝒞(A)=vi𝒱t𝒞A(vi)=vi𝒱tA(vi)λA(vi),\mathcal{C}(\textbf{A})=\sum_{v_{i}\in\mathcal{V}_{t}}\mathcal{C}_{\textbf{A}}(v_{i})=\sum_{v_{i}\in\mathcal{V}_{t}}\mathcal{I}_{\textbf{A}}(v_{i})-\lambda\mathcal{R}_{\textbf{A}}(v_{i}), (3)

where A(vi)=𝒮v(Li,:L¯i,:)\mathcal{I}_{\textbf{A}}(v_{i})=\mathcal{S}_{\textbf{v}}(\textbf{L}_{i,:}\|\overline{\textbf{L}}_{i,:}) and 𝒱t𝒱\mathcal{V}_{t}\subseteq\mathcal{V} is the node set. Notice that 𝒱t\mathcal{V}_{t} can be not only the whole node set but also be chosen flexibly as the node set whose compatibility we care about most in correspondence to the circumstances. For example, in adversarial attack tasks, 𝒱t\mathcal{V}_{t} can be a collection of targeted nodes to be attacked; in graph node classification tasks, 𝒱t\mathcal{V}_{t} can be the nodes to be predicted. In this work, we choose the inner product as the similarity metric, where A(vi)\mathcal{I}_{\textbf{A}}(v_{i}) can be evaluated as A(vi)=L¯i,ci\mathcal{I}_{\textbf{A}}(v_{i})=\overline{\textbf{L}}_{i,c_{i}}, and cic_{i} is the true label of viv_{i}, and the reciprocal of degree as the regularizer, where A(vi)\mathcal{R}_{\textbf{A}}(v_{i}) can be evaluated as A(vi)=1/di\mathcal{R}_{\textbf{A}}(v_{i})=1/d_{i}, and did_{i} is the degree of node viv_{i}.

3.2 TopoInf: measuring the influence of graph topology on tasks

From the above analysis, we can evaluate how well the overall graph topology and learning tasks are in accordance with 𝒞(A)\mathcal{C}(A). We are interested in optimizing 𝒞(A)\mathcal{C}(A), which can improve the compatibility between graph topology and tasks, and, by extension, improve the performance of the learning model. We are also interested in characterizing the influence of modifying part of topology on the degree of task compatibility, which can provide more interpretability of graph learning and meanwhile guide the topology modification for better task performance. To maximize 𝒞(A)\mathcal{C}(\textbf{A}), we can take the “derivative” of 𝒞(A)\mathcal{C}(\textbf{A}) by obtaining the change of 𝒞(A)\mathcal{C}(\textbf{A}) with local topology perturbation. Here we focus on the topology of edges and the same analysis can be simply extended to nodes and substructures. The influence of edge eije_{ij} can be measured by the difference between 𝒞(A)\mathcal{C}(\textbf{A}) on original (normalized) adjacency matrix A and 𝒞(A)\mathcal{C}(\textbf{A}^{\prime}) on the modified (normalized) adjacency matrix A\textbf{A}^{\prime} obtained by removing edge eije_{ij}. Therefore, we define such topology influence as TopoInf of edge eije_{ij}, denoted as 𝒞A(eij)\nabla\mathcal{C}_{\textbf{A}}(e_{ij}), is given by

𝒞A(eij)=𝒞(A)𝒞(A)=vk𝒱𝒞A(vk)𝒞A(vk).\nabla\mathcal{C}_{\textbf{A}}(e_{ij})=\mathcal{C}(\textbf{A}^{\prime})-\mathcal{C}(\textbf{A})=\sum_{v_{k}\in\mathcal{V}}\mathcal{C}_{\textbf{A}^{\prime}}(v_{k})-\mathcal{C}_{\textbf{A}}(v_{k}). (4)

The sign of TopoInf reflects the positive or negative influence of removing the edge on the matching of topology and tasks, which also indicates that removing the edge can increase or decrease the model performance. Remark that edges with positive TopoInf correspond to edges with a “negative” effect on the model performance, such that removing those can bring a positive influence, and vice versa. The absolute value of TopoInf measures the magnitude of influence: a higher absolute value means a higher influence.

It should be noted that to compute TopoInf of an edge, we do not need to recompute 𝒇(A)L\boldsymbol{f}(\textbf{A}^{\prime})\textbf{L}, which is computationally expensive. The difference between (A)\mathcal{I}(\textbf{A}^{\prime}) and (A)\mathcal{I}(\textbf{A}) after removing eije_{ij} is limited within KK hop neighborhood of viv_{i} and vjv_{j} considering a KK-order filter 𝒇(A)\boldsymbol{f}(\textbf{A}). Therefore, we only need to compute the difference of node influence on the neighborhood affected by the edge removal in Equation 3.

Remark that we do not assume that we have all the true labels. Using all labels is for the convenience of analysis to demonstrate the effectiveness of the metric. In practical graph tasks where not all labels are available, we can use pseudo labels obtained by MLP or GNN models as replacements for true labels and compute estimated TopoInf based on pseudo labels. This will introduce errors, but can still be effective which we will show in our experiment.

Moreover, due to the presence of nonlinearity in GNN models, the extraction of 𝒇(A)\boldsymbol{f}(\textbf{A}) can be challenging for graph models, and an approximation of 𝒇(A)\boldsymbol{f}(\textbf{A}) is required. We assume that 𝒇(A)\boldsymbol{f}(\textbf{A}) can be obtained or approximated for the graph models, and a better approximation can improve the accuracy of the influence metric. Here, we present our approach to obtaining, approximating and learning 𝒇(A)\boldsymbol{f}(\textbf{A}) for several representative GNN models.

  • Obtained graph filters. For SGC [26], S2GC [30], APPNP [13], and other decoupled GNNs, 𝒇(A)\boldsymbol{f}(\textbf{A}) can be directly obtained by Equation 1.

  • Approximated graph filters. For GCN [12] and other GCN-like models, the presence of nonlinearities introduces additional complexities to 𝒇(A)\boldsymbol{f}(\textbf{A}), as it becomes dependent not only on the observed graph topology but also on the learnable parameters and the activation function. In this study, we adopt an approximate approach to estimate 𝒇(A)\boldsymbol{f}(\textbf{A}) by removing the nonlinear activation to simplify the analysis, as works in [26, 15]. For GCNII [3] and other GNNs with residual connections between layers, we can employ a similar approximate approach as in GCN to estimate 𝒇(A)\boldsymbol{f}(\textbf{A}) by removing the nonlinear normalization. This approximation without nonlinearity, can still capture the effects of graph learning models performing on the topology, as we will show in our experiments.

  • Learned graph filters. For GPRGNN [4], ChebNet [6], and other GNNs driven by learning filters, the filter weights are learned based on backpropagation, where we are not able to obtain the graph filter 𝒇(A)\boldsymbol{f}(\textbf{A}) before training. In this study, we adopt two approximate approaches to deal with this problem. One approach is to train the filter-based GNN model, obtain the trained filter weights, and then apply the learned graph filter as 𝒇(A)\boldsymbol{f}(\textbf{A}). Another approach is to use fixed filter weights or predefined filters as the graph filter 𝒇(A)\boldsymbol{f}(\textbf{A}), such as A^K\hat{\textbf{A}}^{K}. This approach ignores the effect of the learned filter weights in the GNN model but can still preserve the order information and other prior knowledge from the GNN model. Moreover, for GAT [25] and other attention-based GNNs, unlike GNNs driven by learning filters, the learned filters are implicitly determined by the weights of the neural network and the input features, and the learned filters may change due to the randomness during initialization and training. In this case, we apply only the second approach with a predefined fixed filter as an approximation.

Table 1: Approximation approach of f(A)\boldsymbol{f}(\textbf{A}) for representative GNN models. Here we take K-layer GNN models as examples. α\alpha is predefined hyper-parameter and γ\gamma is learnable parameter. 𝐇(0)=𝒈θ(X)\mathbf{H}^{(0)}=\boldsymbol{g}_{\theta}(\textbf{X}).
Model Neighbor Aggregation Filter Type
SGC L^=softmax(A^KXW)\hat{\textbf{L}}=\text{softmax}(\hat{\textbf{A}}^{K}\textbf{X}\textbf{W}) 𝒇(A)=A^K\boldsymbol{f}(\textbf{A})=\hat{\textbf{A}}^{K} obtained
S2GC L^=softmax(1Kk=1K((1α)A^k+αI)𝐗W)\hat{\textbf{L}}=\operatorname{softmax}\left(\frac{1}{K}\sum_{k=1}^{K}\left((1-\alpha)\hat{\textbf{A}}^{k}+\alpha I\right)\mathbf{X}\textbf{W}\right) 𝒇(A)=1Kk=1K(1α)A^k+αI\boldsymbol{f}(\textbf{A})=\frac{1}{K}\sum_{k=1}^{K}(1-\alpha)\hat{\textbf{A}}^{k}+\alpha I obtained
APPNP 𝐇(+1)=(1α)𝐀^𝐇()+α𝐇(0)\mathbf{H}^{(\ell+1)}=\left(1-\alpha\right)\hat{\mathbf{A}}\mathbf{H}^{(\ell)}+\alpha\mathbf{H}^{(0)} 𝒇(A)=(1α)A^K+k=0K1α(1α)kA^k\boldsymbol{f}(\textbf{A})=(1-\alpha)\hat{\textbf{A}}^{K}+\sum_{k=0}^{K-1}\alpha(1-\alpha)^{k}\hat{\textbf{A}}^{k} obtained
GCN 𝐇(+1)=σ(A^𝐇()𝐖())\mathbf{H}^{(\ell+1)}=\sigma\left(\hat{\textbf{A}}\mathbf{H}^{(\ell)}\mathbf{W}^{(\ell)}\right) 𝒇(A)=A^K\boldsymbol{f}(\textbf{A})=\hat{\textbf{A}}^{K} approximated
GCNII 𝐇(+1)=σ(((1α)𝐀^𝐇()+α𝐇(0))𝐖())\mathbf{H}^{(\ell+1)}=\sigma\left(\left(\left(1-\alpha\right)\hat{\mathbf{A}}\mathbf{H}^{(\ell)}+\alpha\mathbf{H}^{(0)}\right)\mathbf{W}^{(\ell)}\right) 𝒇(A)=(1α)A^K+k=0K1α(1α)kA^k\boldsymbol{f}(\textbf{A})=(1-\alpha)\hat{\textbf{A}}^{K}+\sum_{k=0}^{K-1}\alpha(1-\alpha)^{k}\hat{\textbf{A}}^{k} approximated
GPRGNN L^=softmax(k=0KγkA^k𝐇(0))\hat{\textbf{L}}=\operatorname{softmax}\left(\sum_{k=0}^{K}\gamma_{k}\hat{\textbf{A}}^{k}\mathbf{H}^{(0)}\right) 𝒇(A)=k=0KγkA^k\boldsymbol{f}(\textbf{A})=\sum_{k=0}^{K}\gamma_{k}\hat{\textbf{A}}^{k} learned

4 Motivation

In this section, we provide a theoretical analysis and motivating example to validate our methodology and increase its comprehensibility. Specifically, we provide alternative perspectives to explain the meaning and validate the necessity of (A)\mathcal{I}(\textbf{A}) and (A)\mathcal{R}(\textbf{A}) in 𝒞(A)\mathcal{C}(\textbf{A}).

We use contextual stochastic block model (cSBM) [7] for analysis, which is a widely used model for complex graph structures that integrates contextual features and graph topology [1, 9, 7]. In cSBM, node features are generated by the spiked covariance model, which follows a Gaussian distribution, the mean of which depends on the community assignment. The underlying assumption is that the node features are “informative” and can be perceived as label embedding vectors with Gaussian noise. The graph topology is generated by SBMs, resulting in communities of nodes connected by certain edge densities.

Definition 1

(Contextual Stochastic Block Model, cSBM). 𝒢\mathcal{G} has nn nodes belonging to cc communities, with intra/inter-community edge probabilities of pp and qq. The feature matrix is X=F+N\textbf{X}=\textbf{F}+\textbf{N}, where Nij𝒩(0;σ2)\textbf{N}_{ij}\sim\mathcal{N}(0;\sigma^{2}) are i.i.d. Gaussian noise, Fi,:=𝝁𝒄𝒊\textbf{F}_{i,:}=\boldsymbol{\mu_{c_{i}}^{\top}} is the dd dimensional feature vector for the center of the community cic_{i}, which is the community viv_{i} belongs to. Therefore, F=L𝝁\textbf{F}=\textbf{L}\boldsymbol{\mu}, where L is the matrix of one hot label that denotes the community to which each node belongs, and 𝝁=(𝝁1,𝝁2,,𝝁c)c×d\boldsymbol{\mu}=\left(\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2},\dots,\boldsymbol{\mu}_{c}\right)^{\top}\in\mathbb{R}^{c\times d}.

4.1 Theoretical Analysis

Here we present a theoretical analysis to validate our TopoInf and enhance its comprehensibility. According to the definition of cSBMs [7], the feature matrix F=Lμ\textbf{F}=\textbf{L}\mu is directly related to the prediction L and contains sufficient information for the graph learning task, which is also discussed and empirically verified in [16]. In this part, we follow the setting in [16] as a motivating example, and treat F as the true signals and 𝒇(A)X\boldsymbol{f}(\textbf{A})\textbf{X} as the prediction, and evaluate the effect of the low-pass filter by measuring the difference between the filtered feature matrix 𝒇(A)X\boldsymbol{f}(\textbf{A})\textbf{X} and the true feature matrix F. To analyze the difference, we focus on the low-pass filter, which is true for most GCN-like models [12, 26, 13, 16], use 𝒇(A)XF\left\|\boldsymbol{f}(\textbf{A})\textbf{X}-\textbf{F}\right\| as the prediction error introduced by the learning model and the topology, and consequently establish the following theorem.

Theorem 4.1

For 0<δ<10<\delta<1, with probability at least 1δ1-\delta, we have

𝒇(A)XFc1𝒇(A)LL+c2𝒇(A),\left\|\boldsymbol{f}(\textbf{A})\textbf{X}-\textbf{F}\right\|\leq c_{1}\left\|\boldsymbol{f}(\textbf{A})\textbf{L}-\textbf{L}\right\|+c_{2}\left\|\boldsymbol{f}(\textbf{A})\right\|, (5)

where c1=O(𝛍)c_{1}=O\left(\left\|\boldsymbol{\mu}\right\|\right), c2=O(𝔼{N}/δ)c_{2}=O\left(\mathbb{E}\left\{\left\|\textbf{N}\right\|\right\}/\delta\right).

Proof

By substituting X=F+N\textbf{X}=\textbf{F}+\textbf{N}, we obtain 𝒇(A)XF𝒇(A)FF+𝒇(A)N.\left\|\boldsymbol{f}(\textbf{A})\textbf{X}-\textbf{F}\right\|\leq\left\|\boldsymbol{f}(\textbf{A})\textbf{F}-\textbf{F}\right\|+\left\|\boldsymbol{f}(\textbf{A})\textbf{N}\right\|. For the first term, by substituting F=Lμ\textbf{F}=\textbf{L}\mu, we obtain 𝒇(A)FF𝒇(A)LLO(𝝁).\left\|\boldsymbol{f}(\textbf{A})\textbf{F}-\textbf{F}\right\|\leq\left\|\boldsymbol{f}(\textbf{A})\textbf{L}-\textbf{L}\right\|O\left(\left\|\boldsymbol{\mu}\right\|\right). For the second term, by Markov inequality, we obtain, t>0\forall t>0, Prob{𝒇(A)N>t}𝔼(𝒇(A)N)t.\text{Prob}\left\{\left\|\boldsymbol{f}(\textbf{A})\textbf{N}\right\|>t\right\}\leq\frac{\mathbb{E}(\left\|\boldsymbol{f}(\textbf{A})\textbf{N}\right\|)}{t}. By substituting t=𝔼(𝒇(A)N)δt=\frac{\mathbb{E}(\left\|\boldsymbol{f}(\textbf{A})\textbf{N}\right\|)}{\delta}, we obtain Prob{𝒇(A)NO(𝔼{N}δ)𝒇(A)}1δ.\text{Prob}\left\{\left\|\boldsymbol{f}(\textbf{A})\textbf{N}\right\|\leq O\left(\mathbb{E}\frac{\left\{\left\|\textbf{N}\right\|\right\}}{\delta}\right)\left\|\boldsymbol{f}(\textbf{A})\right\|\right\}\leq 1-\delta.

Theorem 4.1 suggests that the effect of a graph filter is upper bounded by two parts, as shown on the right side of Equation 5. For the first part, 𝒇(A)LL\left\|\boldsymbol{f}(\textbf{A})\textbf{L}-\textbf{L}\right\| represents the bias introduced by the filter 𝒇(A)\boldsymbol{f}(\textbf{A}) on the ideal feature matrix F=Lμ\textbf{F}=\textbf{L}\mu, which is related to the label matrix L, weighted by a constant of the embedding vectors of the labels. For the second part, 𝒇(A)\left\|\boldsymbol{f}(\textbf{A})\right\| weighted by the constant related to the noise, reflects the effect that 𝒇(A)\boldsymbol{f}(\textbf{A}) filters the noise N. Intuitively, better performance can be achieved when bias is minimized and more noise is filtered out. Our definition of 𝒞(A)\mathcal{C}(\textbf{A}), which captures the compatibility between graph topology and tasks, aligns with Theorem 4.1, as it considers both parts. On the one hand, (A)\mathcal{I}(\textbf{A}) corresponds to 𝒇(A)LL\left\|\boldsymbol{f}(\textbf{A})\textbf{L}-\textbf{L}\right\|, as a smaller value of 𝒇(A)LL\left\|\boldsymbol{f}(\textbf{A})\textbf{L}-\textbf{L}\right\| indicates a higher similarity between 𝒇(A)L\boldsymbol{f}(\textbf{A})\textbf{L} and L. On the other hand, (A)\mathcal{R}(\textbf{A}) aligns with 𝒇(A)\left\|\boldsymbol{f}(\textbf{A})\right\|, as a smaller value of 𝒇(A)\left\|\boldsymbol{f}(\textbf{A})\right\| implies less regularization on 𝒇(A)\boldsymbol{f}(\textbf{A}). Furthermore, our definition of 𝒞(A)\mathcal{C}(\textbf{A}) offers more flexibility, as the similarity function 𝒮()\mathcal{S}(\cdot) is not limited to Euclidean distance and the regularization function ()\mathcal{R}(\cdot) is not limited to L2-norm. This flexibility allows for a broader range of choices in defining the compatibility measure based on specific requirements and characteristics of the graph learning task at hand.

We further investigate the bias introduced by 𝒇(A)\boldsymbol{f}(\textbf{A}) on F by analyzing the change in the distance between a node and its farthest inter-class node, which represents the radius of nodes belonging to a different community distributed, centered on the node. A larger radius indicates easier classification. For noise filtering, we measure the change in variance after applying the filter. In the following, we present the results in detail.

Theorem 4.2

Suppose that 𝐟(A)=RowNorm(k=0KγkA^k)\boldsymbol{f}(\textbf{A})=\text{RowNorm}(\sum_{k=0}^{K}\gamma_{k}\hat{\textbf{A}}^{k}), k=0Kγk=1\sum_{k=0}^{K}\gamma_{k}=1 and γk>=0\gamma_{k}>=0, which are low-pass filters [4, 11]. We define the distance between node viv_{i} and the farthest node to viv_{i} that belongs to a different community as 𝒟(X,vi)=maxvj𝒱,cicjXi,:Xj,:\mathcal{D}(\textbf{X},v_{i})=\max_{v_{j}\in\mathcal{V},c_{i}\neq c_{j}}\left\|\textbf{X}_{i,:}-\textbf{X}_{j,:}\right\|. We have

  • The maximum distance between viv_{i} and the farthest node to viv_{i} that belongs to a different community decreases after applying the graph filter: vi𝒱,𝒟(F,vi)𝒟(𝒇(A)F,vi)\forall v_{i}\in\mathcal{V},\mathcal{D}(\textbf{F},v_{i})\geq\mathcal{D}(\boldsymbol{f}(\textbf{A})\textbf{F},v_{i}).

  • The total variance of the noise decreases: Var{N}1Var{𝒇(A)N}1\left\|\text{Var}\{\textbf{N}\}\right\|_{1}\geq\left\|\text{Var}\{\boldsymbol{f}(\textbf{A})\textbf{N}\}\right\|_{1}, where X1=ij|Xi,j|\left\|X\right\|_{1}=\sum_{i}\sum_{j}|X_{i,j}|.

Proof

For the first result, by definition,

𝒟(𝒇(A)F,vi)\displaystyle\mathcal{D}(\boldsymbol{f}(\textbf{A})\textbf{F},v_{i}) =maxvj𝒱,cicjvk𝒱(𝒇(A))i,kFk,:vk𝒱(𝒇(A))j,kFk,:\displaystyle=\max_{v_{j}\in\mathcal{V},c_{i}\neq c_{j}}\left\|\sum_{v_{k}\in\mathcal{V}}\left(\boldsymbol{f}(\textbf{A})\right)_{i,k}\textbf{F}_{k,:}-\sum_{v_{k}\in\mathcal{V}}\left(\boldsymbol{f}(\textbf{A})\right)_{j,k}\textbf{F}_{k,:}\right\|
=maxvj𝒱,cicj(𝒇(A))i,:Lμ(𝒇(A))j,:Lμ=maxvj𝒱,cicjc=1Clc(i,j)μc,\displaystyle=\max_{v_{j}\in\mathcal{V},c_{i}\neq c_{j}}\left\|\left(\boldsymbol{f}(\textbf{A})\right)_{i,:}\textbf{L}\mu-\left(\boldsymbol{f}(\textbf{A})\right)_{j,:}\textbf{L}\mu\right\|=\max_{v_{j}\in\mathcal{V},c_{i}\neq c_{j}}\left\|\sum_{c=1}^{C}l_{c}^{(i,j)}\mu_{c}\right\|,

where l(i,j)=(𝒇(A))i,:L(𝒇(A))j,:Ll^{(i,j)}=\left(\boldsymbol{f}(\textbf{A})\right)_{i,:}\textbf{L}-\left(\boldsymbol{f}(\textbf{A})\right)_{j,:}\textbf{L} and l(i,j)1×Cl^{(i,j)}\in\mathbb{R}^{1\times C}, and (𝒇(A))i,:\left(\boldsymbol{f}(\textbf{A})\right)_{i,:}, (𝒇(A))j,:\left(\boldsymbol{f}(\textbf{A})\right)_{j,:} represents the ithi^{\text{th}} and jthj^{\text{th}} row of 𝒇(A)\boldsymbol{f}(\textbf{A}) respectively. And wm,n(i,j),c=1Clc(i,j)μc=m=1Cn=1Cwm,n(i,j)(μmμn)\exists w_{m,n}^{(i,j)},\sum_{c=1}^{C}l_{c}^{(i,j)}\mu_{c}=\sum_{m=1}^{C}\sum_{n=1}^{C}w_{m,n}^{(i,j)}(\mu_{m}-\mu_{n}), such that 1c,m,n,C,wc,c(i,j)=0,wc,m(i,j)wc,n(i,j)0,min(wm,n(i,j),wn,m(i,j))=0\forall 1\leq c,m,n,\leq C,w_{c,c}^{(i,j)}=0,w_{c,m}^{(i,j)}w_{c,n}^{(i,j)}\geq 0,\min(w_{m,n}^{(i,j)},w_{n,m}^{(i,j)})=0. This can be seen as the process that the positive elements in l(i,j)l^{(i,j)} distribute its value to negative elements in l(i,j)l^{(i,j)}. Then we can obtain c=1C|lc(i,j)|=2m=1Cn=1C|wm,n(i,j)|\sum_{c=1}^{C}|l_{c}^{(i,j)}|=2\sum_{m=1}^{C}\sum_{n=1}^{C}|w_{m,n}^{(i,j)}|. Due to the fact that all elements in 𝒇(A)\boldsymbol{f}(\textbf{A}) are non-negative and each row sum of 𝒇(A)\boldsymbol{f}(\textbf{A}) equals one, we obtain vi,vj𝒱\forall v_{i},v_{j}\in\mathcal{V}, c=1Clc(i,j)=0\sum_{c=1}^{C}l_{c}^{(i,j)}=0 and c=1C|lc(i,j)|<=2\sum_{c=1}^{C}|l_{c}^{(i,j)}|<=2. Further, we denote vj=argmaxvj𝒱,cicjFi,:Fj,:v_{j^{*}}=\text{argmax}_{v_{j}\in\mathcal{V},c_{i}\neq c_{j}}\left\|\textbf{F}_{i,:}-\textbf{F}_{j,:}\right\| for vi𝒱\forall v_{i}\in\mathcal{V}, then we obtain 𝒟(F,vi)=Fi,:Fj,:=μciμcj.\mathcal{D}(\textbf{F},v_{i})=\left\|\textbf{F}_{i,:}-\textbf{F}_{j^{*},:}\right\|=\left\|\mu_{c_{i}}-\mu_{c_{j^{*}}}\right\|. Subsequently, we derive

𝒟(𝒇(A)F,vi)\displaystyle\mathcal{D}(\boldsymbol{f}(\textbf{A})\textbf{F},v_{i}) =maxvj𝒱,cicjm=1Cn=1Cwm,n(i,j)(μmμn)maxvj𝒱,cicjm=1Cn=1C|wm,n(i,j)|(μmμn)\displaystyle=\max_{v_{j}\in\mathcal{V},c_{i}\neq c_{j}}\left\|\sum_{m=1}^{C}\sum_{n=1}^{C}w_{m,n}^{(i,j)}(\mu_{m}-\mu_{n})\right\|\leq\max_{v_{j}\in\mathcal{V},c_{i}\neq c_{j}}\sum_{m=1}^{C}\sum_{n=1}^{C}|w_{m,n}^{(i,j)}|\left\|(\mu_{m}-\mu_{n})\right\|
maxvj𝒱,cicjm=1Cn=1C|wm,n(i,j)|μciμcjμciμcj=𝒟(F,vi).\displaystyle\leq\max_{v_{j}\in\mathcal{V},c_{i}\neq c_{j}}\sum_{m=1}^{C}\sum_{n=1}^{C}|w_{m,n}^{(i,j)}|\left\|\mu_{c_{i}}-\mu_{c_{j^{*}}}\right\|\leq\left\|\mu_{c_{i}}-\mu_{c_{j^{*}}}\right\|=\mathcal{D}(\textbf{F},v_{i}).

For the second result, 𝒇(A)\boldsymbol{f}(\textbf{A}) is a row stochastic matrix, so absolute values of all eigenvalues of 𝒇(A)\boldsymbol{f}(\textbf{A}) are within 1. Therefore, Var{𝒇(A)N}1=trace(𝔼(𝒇(A)NN𝒇(A)))nσ2trace(𝔼(𝒇(A)𝒇(A)))n2σ2=Var{N}1\left\|\text{Var}\{\boldsymbol{f}(\textbf{A})\textbf{N}\}\right\|_{1}=\textbf{trace}\left(\mathbb{E}(\boldsymbol{f}(\textbf{A})\textbf{N}\textbf{N}^{\top}\boldsymbol{f}(\textbf{A})^{\top})\right)\leq n\sigma^{2}\cdot\textbf{trace}\left(\mathbb{E}(\boldsymbol{f}(\textbf{A})\boldsymbol{f}(\textbf{A})^{\top})\right)\leq n^{2}\sigma^{2}=\left\|\text{Var}\{\textbf{N}\}\right\|_{1}.

Theorem 4.2 shows that the distances analyzed decrease after applying the graph filter, indicating that the difficulty increases for classification and reflects that bias is introduced. For noise, the variance of the noise decreases through the low-pass filter, suggesting that the noise is reduced. In order to analyze the compatibility comprehensively, we need to consider these two effects simultaneously. This also explains why (A)\mathcal{I}(\textbf{A}) and (A)\mathcal{R}(\textbf{A}) are necessary in 𝒞(A)\mathcal{C}(\textbf{A}).

4.2 Motivating Example

We further present a case study on cSBM to analyze the performance of GCN with different feature noise and show the effects of 𝒇(A)\boldsymbol{f}(\textbf{A}).

Refer to caption
Figure 1: The performance of GCN and MLP on the synthetic graph datasets generated by cSBM with different σ,p,q\sigma,p,q, corresponding to the noise variance of features, intra-community and inter-community connection probability. The synthetic datasets share the same statistics with Cora [20], such as the number of nodes and edges and the dimension of features. pp and qq used to generate SBMs are (0.9,0.1)(0.9,0.1), (0.8,0.2)(0.8,0.2), and (0.7,0.3)(0.7,0.3) respectively from left to right.

As shown in Figure 1, MLP outperforms GCN when the noise variance σ\sigma is close to zero. The performance gap occurs because 𝒇(A)\boldsymbol{f}(\textbf{A}) biases the prediction of GCN and the lower the quality of the topology, i.e., the lower pp, the larger the gap, from left to right. However, as σ\sigma increases, the performance of both MLP and GCN decreases, but the performance of GCN decreases more slowly than that of MLP because GCN smooths the noise through topology. During this phase, the performance of MLP gradually approaches that of GCN until GCN outperforms MLP. When the noise variance is too large to dominate the feature, the performance of MLP and GCN drops and converges. The convergence performance of GCN is usually higher than MLP, as MLP degenerates into random guesses, while GCN is still able to learn from topology.

This motivating experiment shows that, compared with MLPs which do not utilize graph topology, the bias introduced by the low-pass filter 𝒇(A)\boldsymbol{f}(\textbf{A}) results in that MLPs outperform GNNs when the noise variance in features is near zero, suggesting that this effect is generally bad for performance. However, compared to MLPs, low-pass filters 𝒇(A)\boldsymbol{f}(\textbf{A}) also provide an additional noise reduction ability, which improves the performance of GNNs when the noise variance in features is greater than zero. Therefore, to fully analyze the compatibility between topology and task in GNN models, we design both (A)\mathcal{I}(\textbf{A}) and (A)\mathcal{R}(\textbf{A}) in 𝒞(A)\mathcal{C}(\textbf{A}) to analyze and quantify these two effects simultaneously.

5 Experiments

We conduct experiments to validate the performance of TopoInf:

  • validate the effectiveness of TopoInf on graphs with ground truth labels;

  • estimate TopoInf based on pseudo labels and utilize the estimated TopoInf to refine graph topology for improving model performance;

  • show further applications of model improvement, by modifying topology guided by TopoInf, with DropEdge [19] as an example.

Setup. We choose six real-world graph datasets, namely Cora, CiteSeer, PubMed, Computers, Photos and Actor in our experiments [20, 28, 21, 17]. For Computers and Photos, we randomly select 20 nodes from each class for training, 30 nodes from each class for validation, and use the remaining nodes for testing. For other datasets, we use their public train/val/test splits. We choose nine widely adopted models, namely GCN, SGC, APPNP, GAT, GIN, TAGCN, GPRGNN, BernNet and GCNII to work on [12, 26, 13, 25, 27, 10, 4, 11, 3].

5.1 Validating TopoInf on Graphs

Can TopoInf reflect edges’ influence on model performance? We first conduct experiments on TopoInf computed using ground truth labels. Based on the former analysis, the sign of TopoInf implicates the directional influence of removing eije_{ij} on the model performance, while the absolute value of TopoInf reflects the magnitude of the influence. We validate these implications in real datasets. For each dataset, we compute TopoInf based on ground truth labels and partition edges with positive TopoInf into the positive set and edges with negative TopoInf into the negative set. Then we sequentially remove edges from the positive/negative set ordered by the absolute values of their TopoInf in a descendant manner and record the performance of models.

As is shown in Figure 2, performance curves exhibit an S-shaped pattern. Specifically, after removing edges in the positive set, the model performance increases, while removing edges in the negative set diminishes the performance. Moreover, removing edges with higher absolute TopoInf values results in a greater derivative of the performance plots. These are consistent with our previous analysis and show the effectiveness of TopoInf.

Refer to captionRefer to caption
Figure 2: Performance change while deleting edges by TopoInf. Horizontal axis’s meaning: zero point corresponds to the original graph without deleting any edges. The absolute value of the coordinate denotes the ratio of deleted edges to all edges. The negative/positive coordinate corresponds to the case that the graph is obtained by deleting edges with negative/positive TopoInf values of the corresponding proportion, in the descendant order of the absolute value of TopoInf. The vertical axis is the accuracy of the node classification in the GNN models. The TopoInf values are calculated with 𝒱t\mathcal{V}_{t} in Equation 3 set as the test set. For each step, we remove 10% of edges in the negative/positive set.
Table 2: Performance after deleting edges based on estimated TopoInf on different models and datasets. The estimated TopoInf values are calculated with 𝒱t\mathcal{V}_{t} in Equation 3 set as the whole nodes. The ratio of the edges removed to all edges is a hyperparameter chosen from {0.02,0.04,0.06,0.08}\{0.02,0.04,0.06,0.08\} and is determined by validation. We run experiments 10 times and report testing accuracy with a 95% confidence interval.
Datasets Method GCN SGC APPNP GAT GIN TAGCN GPRGNN BERNNET GCNII
CORA original 80.3±0.3 80.8±0.4 80.4±0.4 80.4±0.6 74.8±0.8 82.0±0.3 81.3±0.7 82.2±0.3 83.6±0.4
w/o retrain(our) 81.6±0.4 81.2±0.5 81.1±0.5 80.9±0.8 77.1±0.8 82.0±0.2 81.6±0.6 82.6±0.4 83.6±0.1
w retrain(our) 81.6±0.4 81.2±0.4 81.0±0.4 81.5±0.4 77.0±0.8 82.0±0.2 81.6±0.6 82.6±0.4 83.5±0.1
\hdashlineCITESEER original 68.7±0.5 65.6±1.0 65.1±0.5 68.1±0.7 61.6±1.4 69.8±0.4 69.4±2.1 71.2±0.3 72.4±0.6
w/o retrain(our) 70.0±0.7 66.8±1.0 67.3±0.8 70.5±1.1 62.1±1.4 70.6±0.3 71.3±0.6 71.0±0.7 72.7±0.6
w retrain(our) 69.8±0.7 66.9±1.2 67.4±0.8 70.5±1.0 62.1±1.4 70.8±0.2 71.3±0.5 71.0±0.6 72.7±0.5
\hdashlinePUBMED original 77.9±0.3 77.6±0.2 77.6±0.4 77.4±0.3 75.5±0.6 79.4±0.1 78.6±0.5 78.9±0.3 79.2±0.2
w/o retrain(our) 78.7±0.3 78.5±0.5 78.5±0.4 77.8±0.2 76.0±0.6 80.0±0.3 79.3±0.5 79.2±0.2 79.4±0.2
w retrain(our) 78.8±0.2 78.4±0.5 78.5±0.3 77.9±0.2 76.1±0.7 79.7±0.3 79.2±0.5 79.1±0.5 79.3±0.2
Table 3: Performance of different DropEdge strategies on different datasets and models. The estimated TopoInf values are calculated with 𝒱t\mathcal{V}_{t} in Equation 3 set as the whole nodes. The DropEdge rate and the temperature are hyper-parameters chosen from {0.3,0.4,0.5}\{0.3,0.4,0.5\} and {0.5,0.75,1}\{0.5,0.75,1\} respectively and are determined by validation. We run experiments 10 times and report testing accuracy with a 95% confidence interval.
Datasets DropEdge GCN SGC APPNP GAT GIN TAGCN GPRGNN BERNNET GCNII
CORA without 80.3±0.4 80.7±0.7 80.5±0.5 80.6±0.6 74.7±0.8 82.0±0.4 81.5±1.7 82.2±0.4 83.6±0.5
random 81.9±0.7 81.4±0.8 80.8±0.7 81.7±0.6 76.1±0.9 83.0±0.4 82.0±0.9 82.3±0.2 84.2±0.4
TopoInf(our) 82.2±0.6 81.5±0.9 81.6±0.5 82.2±0.5 77.2±0.7 83.0±0.3 82.1±0.8 82.8±0.3 84.5±0.4
\hdashlineCITESEER without 68.8±0.7 65.9±1.3 65.1±0.6 69.3±0.9 61.4±1.4 69.9±0.4 70.4±0.7 71.2±0.5 72.5±0.6
random 68.7±0.9 67.1±1.0 67.8±0.8 69.8±0.8 62.4±1.0 71.2±0.4 70.9±0.7 71.1±0.6 72.5±0.6
TopoInf(our) 69.3±1.1 67.8±0.9 68.3±0.9 70.3±0.5 63.5±0.9 71.7±0.3 71.1±0.6 71.8±0.4 72.9±0.6
\hdashlinePUBMED without 77.9±0.5 77.6±0.2 77.6±0.4 77.5±0.5 75.5±0.7 79.5±0.3 78.7±0.7 78.9±0.4 79.2±0.2
random 77.5±0.4 77.9±0.3 77.6±0.5 77.0±0.4 76.9±0.5 79.2±0.3 78.5±0.7 78.7±0.5 78.8±0.3
TopoInf(our) 78.0±0.6 78.3±0.3 78.0±0.4 77.3±0.4 77.4±0.7 79.7±0.2 79.3±0.4 78.8±0.5 79.1±0.2

In addition, we compare our edge removal strategies with other methods. We choose two baseline methods, namely Random and AdaEdge. For Random, we randomly remove edges with equal probability. AdaEdge [2] divides edges between nodes with the same labels into the negative set and different ones into the positive set. AdaEdge randomly removes edges in each set, while TopoInf removes edges in each set in descending order, sorted by the absolute values of their TopoInf.

Figure 3 shows our results. When an equal number of positive/negative edges are removed, the model performance of Random remains unchanged almost, while the model performance of AdaEdge and TopoInf increases/decreases distinctly, indicating a correlation between both the homophily metric and our compatibility metric with the model performance. Moreover, the model performance of TopoInf increases/decreases more significantly than that of AdaEdge. This shows the effectiveness of TopoInf as it not only identifies the direction of influence but also measures the magnitude of influence.

5.2 Validating Estimated TopoInf

When ground truth labels are generally only available on training sets, we employ an initial training phase using the GNN model to obtain pseudo labels for each node. Thereafter, we estimate TopoInf using these pseudo labels for the nodes without true labels and remove edges based on the estimated TopoInf.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: GCN performance with different edge deletion strategies on Cora. The left figure shows the results of removing positive edges to increase performance (higher is better), and the right figure shows the results of removing negative edges to decrease performance (lower is better). The horizontal axis denotes the ratio of deleted edges to all edges. The vertical axis is the accuracy of the node classification in GCN. The TopoInf values are calculated with 𝒱t\mathcal{V}_{t} in the Equation 3 set as a test set.

How well does estimated TopoInf work to improve model performance? We conduct experiments on nine GNN models. We train the model and obtain pseudo labels. After that, we can estimate TopoInf based on pseudo labels and refine the original topology based on the estimated TopoInf. Specifically, we remove a given number of edges, which is a hyper-parameter determined by validation, with the highest estimated TopoInf. Since we have already obtained the model parameters through the initial training, we can choose to either continue to apply these parameters on the refined topology (denoted as w retrain) or retrain the model on the refined topology (denoted as w/o retrain).

Table 3 shows the results. We can see that in almost all cases, the refined topologies based on estimated TopoInf outperform the original ones, in both retraining and non-retraining settings. On SOTA models, such as BernNet, TAGCN and GCNII, TopoInf still achieves competitive results or improves performance. This experiment shows that the estimated TopoInf is still an effective metric for enhancing topology.

Are there any other approaches to utilize estimated TopoInf? In this work, we also demonstrate that the estimated TopoInf can be combined with other methods to guide topology modification and improve model performance. We take DropEdge [19] as an example, where edges are randomly removed from the original graph at each training epoch to help prevent over-fitting and over-smoothing and achieve better performance. We replace the random dropping edge scheme with a TopoInf-guided scheme, where edges with higher TopoInf are more likely to be dropped. Specifically, for TopoInf-guided DropEdge, we define our edge-dropping probability as PijAijexp(𝒞A(eij)/τ),P_{ij}\propto\textbf{A}_{ij}\cdot\exp\left(\nabla\mathcal{C}_{\textbf{A}}(e_{ij})/\tau\right), where Aij\textbf{A}_{ij} indicates the existence of eije_{ij}, 𝒞A(eij)\nabla\mathcal{C}_{\textbf{A}}(e_{ij}) is the TopoInf of eije_{ij}, and τ\tau is the temperature, which is a hyper-parameter and determined by validation.

Table 3 shows the results. In almost all cases, our TopoInf-guided DropEdge method achieves the highest accuracy, and in all cases, our method outperforms DropEdge. Even in cases where DropEdge degenerates the performance of GNNs, such as BernNet on CiteSeer, and GPRGNN on PubMed, TopoInf-guided DropEdge can still improve the performance. This experiment shows the potential to combine TopoInf with other methods for topology modification and performance improvement.

6 Related Works

Graph filters optimization. These studies approach graphs as filters for features and aim to find the optimal filter coefficients for the graph learning task. ChebNet [6] approximates the spectral graph convolutions using Chebyshev polynomials. GPRGNN [4] adaptively learns a polynomial filter and jointly optimizes the neural parameters of the neural network as well as filter coefficients. ASGC [1] obtains filter coefficients by minimizing the approximation error between the raw feature and the feature filtered by graph topology, to make SGC appropriate in heterophilic settings. These works improve graph learning by optimizing coefficients of graph filters, while we focus on optimizing graph topology.

Graph topology optimization. Curvature-based methods [23] view graphs from a geometric view, measure the effect of edges by the graph curvature, and rewire the graph topology with the help of curvature, to improve the model performance. Node pair distances [2, 24] and edge signal-to-noise ratio [9] are proposed to measure the topological information and alleviate the over-smoothing to further build deeper models to exploit the multi-hop neighborhood structures. In the era of large models, Sun et al. [22] utilize LLMs to evaluate the edges between nodes based on node attributes in order to optimize graph topology.

7 Conclusion

In this work, we model and analyze the compatibility between the topological information of graph data and downstream task objectives, and propose a new metric called TopoInf to characterize the influence of one single edge by measuring the change of the compatibility. We also discuss the potential applications of leveraging TopoInf to model performance adjustment and empirically demonstrate the effectiveness of TopoInf. In future, a better estimation of the topological effect of GNNs may further improve our metric.

Acknowledgment

This work was supported by the National Key Research and Development Plan No. 2022YFB3904204, NSF China under Grant No. 62202299, 62020106005, 61960206002, Shanghai Natural Science Foundation No. 22ZR1429100.

References

  • [1] Chanpuriya, S., Musco, C.: Simplified graph convolution with heterophily. In: NeurIPS (2022)
  • [2] Chen, D., Lin, Y., Li, W., et al.: Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In: AAAI (2020)
  • [3] Chen, M., Wei, Z., Huang, Z., et al.: Simple and deep graph convolutional networks. In: ICML (2020)
  • [4] Chien, E., Peng, J., Li, P., Milenkovic, O.: Adaptive universal generalized pagerank graph neural network. In: ICLR (2021)
  • [5] Chin, A., Chen, Y., M. Altenburger, K., Ugander, J.: Decoupled smoothing on graphs. In: WWW (2019)
  • [6] Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NeurIPS (2016)
  • [7] Deshpande, Y., Montanari, A., Mossel, E., Sen, S.: Contextual stochastic block models. In: NeurIPS (2018)
  • [8] Dong, H., Chen, J., Feng, F., et al.: On the equivalence of decoupled graph convolution network and label propagation. In: WWW (2021)
  • [9] Dong, M., Kluger, Y.: Towards understanding and reducing graph structural noise for GNNs. In: ICML (2023)
  • [10] Du, J., Zhang, S., Wu, G., Moura, J.M.F., Kar, S.: Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710.10370 (2018)
  • [11] He, M., Wei, Z., Huang, Z., Xu, H.: Bernnet: Learning arbitrary graph spectral filters via Bernstein approximation. In: NeurIPS (2021)
  • [12] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
  • [13] Klicpera, J., Bojchevski, A., Günnemann, S.: Predict then propagate: Graph neural networks meet personalized PageRank. In: ICLR (2019)
  • [14] Luo, D., Cheng, W., Yu, W., et al.: Learning to drop: Robust graph neural network via topological denoising. In: WSDM (2021)
  • [15] Ma, Y., Liu, X., Shah, N., et al.: Is homophily a necessity for graph neural networks? arXiv preprint arXiv:2106.06134 (2021)
  • [16] Nt, H., Maehara, T.: Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550 (2019)
  • [17] Pei, H., Wei, B., Chang, K.C., Lei, Y., Yang, B.: Geom-gcn: Geometric graph convolutional networks. In: ICLR (2020)
  • [18] Ren, Y., Bai, J., Zhang, J.: Label contrastive coding based graph neural network for graph classification. In: DASFAA (2021)
  • [19] Rong, Y., Huang, W., Xu, T., Huang, J.: Dropedge: Towards deep graph convolutional networks on node classification. In: ICLR (2020)
  • [20] Sen, P., Namata, G., Bilgic, M., et al.: Collective classification in network data. AI magazine (2008)
  • [21] Shchur, O., Mumme, M., Bojchevski, A., Günnemann, S.: Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018)
  • [22] Sun, S., Yuxiang, R., et al.: Large language models as topological structure enhancers for text-attributed graphs. arXiv preprint arXiv:2311.14324 (2024)
  • [23] Topping, J., Di Giovanni, F., Chamberlain, B.P., et al.: Understanding over-squashing and bottlenecks on graphs via curvature. In: ICLR (2022)
  • [24] Bai, J., Ren, Y., Zhang, J.: Measuring and sampling: A metric-guided subgraph learning framework for graph neural network. Int. J. Intell. Syst. (2022)
  • [25] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)
  • [26] Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., Weinberger, K.: Simplifying graph convolutional networks. In: ICML (2019)
  • [27] Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: ICLR (2019)
  • [28] Yang, Z., Cohen, W., Salakhudinov, R.: Revisiting semi-supervised learning with graph embeddings. In: WWW (2016)
  • [29] Zhang, W., Yang, M., Sheng, Z., et al.: Node-dependent local smoothing for scalable graph learning. In: NeurIPS (2021)
  • [30] Zhu, H., Koniusz, P.: Simple spectral graph convolution. In: ICLR (2021)
  • [31] Zhu, J., Rossi, R.A., Rao, A., et al.: Graph neural networks with heterophily. In: AAAI (2021)
  • [32] Zhu, J., Yan, Y., et al.: Beyond homophily in graph neural networks: Current limitations and effective designs. In: NeurIPS (2020)