GraphGLOW: Open-World Graph Structure Learning for Cross-Graph Topology Refinement

Wentao Zhao [email protected] Shanghai Jiao Tong UniversityShanghaiChina , Qitian Wu [email protected] Shanghai Jiao Tong UniversityShanghaiChina , Chenxiao Yang [email protected] Shanghai Jiao Tong UniversityShanghaiChina and Junchi Yan [email protected] Shanghai Jiao Tong UniversityShanghaiChina

(2023)

GraphGLOW: Universal and Generalizable Structure Learning for Graph Neural Networks

(2023)

Abstract.

Graph structure learning is a well-established problem that aims at optimizing graph structures adaptive to specific graph datasets to help message passing neural networks (i.e., GNNs) to yield effective and robust node embeddings. However, the common limitation of existing models lies in the underlying closed-world assumption: the testing graph is the same as the training graph. This premise requires independently training the structure learning model from scratch for each graph dataset, which leads to prohibitive computation costs and potential risks for serious over-fitting. To mitigate these issues, this paper explores a new direction that moves forward to learn a universal structure learning model that can generalize across graph datasets in an open world. We first introduce the mathematical definition of this novel problem setting, and describe the model formulation from a probabilistic data-generative aspect. Then we devise a general framework that coordinates a single graph-shared structure learner and multiple graph-specific GNNs to capture the generalizable patterns of optimal message-passing topology across datasets. The well-trained structure learner can directly produce adaptive structures for unseen target graphs without any fine-tuning. Across diverse datasets and various challenging cross-graph generalization protocols, our experiments show that even without training on target graphs, the proposed model i) significantly outperforms expressive GNNs trained on input (non-optimized) topology, and ii) surprisingly performs on par with state-of-the-art models that independently optimize adaptive structures for specific target graphs, with notably orders-of-magnitude acceleration for training on the target graph.

Graph Structure Learning, Out-of-Distribution Generalization, Graph Neural Networks, Network Representation Learning

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 6–10, 2023; Long Beach, CA, USA^†^†booktitle: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23), August 6–10, 2023, Long Beach, CA, USA^†^†price: 15.00^†^†doi: 10.1145/3580305.3599373^†^†isbn: 979-8-4007-0103-0/23/08^†^†ccs: Computing methodologies Machine learning algorithms

1. Introduction

Graph neural networks (GNNs) (Kipf and Welling, 2016; Veličković et al., 2018; Hamilton et al., 2017), as a de facto model class based on the message passing principle, have shown promising efficacy for learning node representations for graph-structured data, with extensive applications to, e.g., physics simulation (Sanchez-Gonzalez et al., 2020), traffic prediction (Jiang and Luo, 2022), drug recommendation (Yang et al., 2023b). However, due to the inevitable error-prone data collection (Chen et al., 2020), the input graph may contain spurious and unobserved edges that lead to sub-optimal results of GNNs and degrade the downstream performance.

Graph structure learning (Zhu et al., 2021c) serves as a plausible remedy for such an issue via optimizing graph structures and GNN classifiers at the same time. To this end, recent endeavors explore different technical aspects, e.g., parameterizing each potential edge between any pair of nodes (Franceschi et al., 2019; Elinas et al., 2020) or estimating potential links through a parameterized network (Chen et al., 2020; Wu et al., 2022b; Li et al., 2018), etc. However, existing models limit their applicability within a closed-world hypothesis: the training and testing of structure learning models, which optimize the graph structures, are performed on the same graph. The issue, however, is that since structure learning is often heavy-weighted and requires sophisticated optimization, it can be prohibitively resource-consuming to train structure learning models from scratch for each graph dataset. Moreover, due to limited labels in common graph-based predictive tasks, structure learning models are prone to over-fitting given that they cannot utilize the common knowledge shared across different graph datasets.

To resolve the above dilemma, this paper attempts to explore a novel problem setting termed Open-World Graph Structure Learning. Specifically, we target learning a generalizable graph structure learning model which is trained with multiple source graphs and can be directly adapted for inference (without re-training or fine-tuning) on new unseen target graphs. We formulate the problem as a bi-level optimization target that jointly learns a single dataset-shared structure learner and multiple dataset-specific GNNs tailored for particular graph datasets, as shown in Fig. 1. Under such a framework, the well-trained structure learner can leverage the common transferrable knowledge across datasets for enhancing generalization and more critically, be readily utilized to yield adaptive message-passing topology for arbitrarily given target graphs.

With the guidance of the aforementioned general goal, we propose GraphGLOW (short for A Graph Structure Learning Model for Open-World Generalization) that aims at learning the generalizable patterns of optimal message-passing topology across source graphs. Specifically, we first take a bottom-up perspective and formulate the generative process for observed data in a probabilistic manner. On top of this, we derive a tractable and feasible learning objective through the lens of variational inference. The structure learner is specified as a multi-head weighted similarity function so as to guarantee enough expressivity for accommodating diverse structural information, and we further harness an approximation scheme to reduce the quadratic complexity overhead of learning potential edges from arbitrary node pairs.

Refer to caption — Figure 1. Illustration of Open-World Graph Structure Learning. In a diverse set of source graphs, we train multiple dataset-specific GNNs and a shared structure learner. In the target graph, we directly utilize the learned structure learner and only need to train a new GNN.

To reasonably and comprehensively evaluate the model, we devise experiments with a diverse set of protocols that can measure the generalization ability under different difficulty levels (according to the intensity of distribution shifts between source graphs and target graphs). Concretely, we consider: 1) In-domain generalization, in which we generalize from some citation (social) networks to other citation (social) networks. 2) Cross-domain networks generalization between citation and social networks. The results, which are consistent across various combinations of source and target graph datasets, demonstrate that when evaluated on the target graphs, our approach i) consistently outperforms directly training the GNN counterpart on original non-optimized graph structures of the target datasets and ii) performs on par with state-of-the-art structure learning methods (Franceschi et al., 2019; Chen et al., 2020; Elinas et al., 2020) trained on target graphs from scratch with up to $25\times$ less training time consumed. Our code is available at https://github.com/WtaoZhao/GraphGLOW.

2. Preliminary and Problem Definition

Node-Level Predictive Tasks. Denote a graph with $N$ nodes as $\mathcal{G}=(\mathbf{A},\mathbf{X},\mathbf{Y})$ where $\mathbf{A}=\{a_{uv}\}_{N\times N}$ is an adjacency matrix ( $a_{uv}=1$ means the edge between node $u$ and $v$ exists and 0 otherwise), $\mathbf{X}=\{\mathbf{x}_{u}\}_{N\times D}$ is a feature matrix with $\mathbf{x}_{u}$ a $D$ -dimensional node feature vector of node $u$ , and $\mathbf{Y}=\{y_{u}\}_{N\times C}$ with $y_{u}$ the label vector of node $u$ and $C$ class number. The node labels are partially observed as training data, based on which the node-level prediction aims to predict the unobserved labels for testing nodes in the graph using node features and graph structures. The latter is often achieved via a GNN model, denoted as $h_{w}$ , that yields predicted node labels $\hat{\mathbf{Y}}=h_{w}(\mathbf{A},\mathbf{X})$ and is optimized with the classification loss $w^{*}=\operatorname*{arg\,min}_{w}=\mathcal{L}(\hat{\mathbf{Y}},\mathbf{Y})$ using observed labels from training nodes.

Closed-World Graph Structure Learning (GLCW). The standard graph structure learning for node-level predictive tasks trains a graph structure learner $g_{\theta}$ to refine the given structure, i.e., $\hat{\mathbf{A}}=g_{\theta}(\mathbf{A},\mathbf{X})$ , over which the GNN classifier $h_{w}$ conducts message passing for producing node representations and predictions. The $g_{\theta}$ is expected to produce optimal graph structures that can give rise to satisfactory downstream classification performance of the GNN classifier. Formally speaking, the goal for training $g_{\theta}$ along with $h_{w}$ can be expressed as a nested optimization problem:

(1)

\theta^{*}=\operatorname*{arg\,min}_{w}\min_{\theta}\mathcal{L}\left(h_{w}(g_{\theta}(\mathbf{A},\mathbf{X}),\mathbf{X}),\mathbf{Y}\right).

The above formulation of graph structure learning under closed-world assumptions constrains the training and testing nodes in the same graph, which requires $g_{\theta}$ to be trained from scratch on each graph dataset. Since $g_{\theta}$ is often much more complicated (e.g., with orders-of-magnitude more trainable parameters) and difficult for optimization (due to the bi-level optimization (1)) than the GNN $h_{w}$ , the GLCW would lead to undesired inefficiency and vulnerability for serious over-fitting (due to limited labeled information).

Open-World Graph Structure Learning (GLOW). In this work, we turn to a new learning paradigm that generalizes graph structure learning to open-world assumptions, borrowing the concepts of domain generalization (Muandet et al., 2013) and out-of-distribution generalization (Wu et al., 2022a), more broadly. Specifically, assume that we are given multiple source graphs, denoted as $\{\mathcal{G}^{s}_{m}\}_{m=1}^{M}=\{(\mathbf{A}^{s}_{m},\mathbf{X}^{s}_{m},\mathbf{Y}^{s}_{m})\}_{m=1}^{M}$ , and a target graph $\mathcal{G}^{t}=(\mathbf{A}^{t},\mathbf{X}^{t},\mathbf{Y}^{t})$ , whose distribution is often different from any source graph. The goal is to train a universal structure learner $g_{\theta}$ on source graphs which can be directly used for inference on the target graph without any re-training or fine-tuning. The trained structure learner is expected to produce desired graph structures that can bring up better downstream classification of a GNN classifier optimized for the target graph.

More specifically, we consider a one-to-many framework that coordinates a shared graph structure learner $g_{\theta}$ and multiple dataset-specific GNNs $\{h_{w_{m}}\}_{m=1}^{M}$ , where $h_{w_{m}}$ with independent parameterization $w_{m}$ is optimized for a given source graph $\mathcal{G}_{m}^{s}$ . With the aim of learning a universal $g_{\theta}$ that can generalize to new unseen target graphs, our training goal can be formulated as the following bi-level optimization problem:

(2)

\theta^{*}=\operatorname*{arg\,min}_{\theta}\min_{w_{1},\cdots,w_{M}}\sum_{m=1}^{M}\mathcal{L}\left(h_{w_{m}}(g_{\theta}(\mathbf{A}^{s}_{m},\mathbf{X}^{s}_{m}),\mathbf{X}^{s}_{m}),\mathbf{Y}^{s}_{m}\right),

where the inner optimization is a multi-task learning objective. Generally, (2) aims at finding an optimal $g_{\theta}$ that can jointly minimize the classification loss induced by $M$ GNN models, each trained for a particular source graph. After training, we can directly adapt $g_{\theta^{*}}$ to the target graph for testing purpose, and only need to train a GNN $h_{w}$ on the target graph:

(3)

w^{*}=\operatorname*{arg\,min}_{w}\mathcal{L}\left(h_{w}(g_{\theta^{*}}(\mathbf{A}^{t},\mathbf{X}^{t}),\mathbf{X}^{t}),\mathbf{Y}^{t}\right).

3. Proposed Model

To handle the above problem, we present an end-to-end learning framework GraphGLOW that guides the central graph structure learner to learn adaptive message-passing structures exploited by multiple GNNs. The overview of GraphGLOW is shown in Fig. 2.

The fundamental challenge of GLOW lies in how to model and capture the generalizable patterns among adaptive structures of different graphs. To this end, we first take a data-generative perspective that treats the inputs and inter-mediate results as random variables and investigate into their dependency, based on which we present the high-level model formulation in a probabilistic form (Sec. 3.1). Then we proceed to instantiate the model components (Sec. 3.2). Finally, we discuss differentiable training approaches for optimization (Sec. 3.3).

3.1. Model Formulation

To commence, we characterize the data generation process by a latent variable model, based on which we derive the formulation of our method. We treat the latent graph $\hat{\mathbf{A}}$ (given by $g_{\theta}$ ) as a latent variable whose prior distribution is given by $p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})$ . The prior distribution reflects how one presumed on the latent structures before observed labels arrive. Then, the prediction is given by a predictive distribution $p(\mathbf{Y}|\hat{\mathbf{A}},\mathbf{A},\mathbf{X})$ . The learning objective aims at maximizing the log-likelihood of observed labels, which can be written as: $\log p(\mathbf{Y}|\mathbf{A},\mathbf{X})=\log\int_{\hat{\mathbf{A}}}p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})d\hat{\mathbf{A}}$ . To estimate latent graphs that could enhance message passing for downstream tasks, one plausible way is to sample from the posterior, i.e., $p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X})$ , conditioned on the labels from downstream tasks. Using the Bayes’ rule, we have

(4)

p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X})=\frac{p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}{\int_{\hat{\mathbf{A}}}p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})d\hat{\mathbf{A}}}.

However, the integration over $\hat{\mathbf{A}}$ in the denominator is intractable for computation due to the exponentially large space of $\hat{\mathbf{A}}$ .

To circumvent the difficulty, we can introduce a variational distribution $q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})$ over $\hat{\mathbf{A}}$ as an approximation to $p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X})$ . We can sample latent graphs from $q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})$ , i.e., instantiate it as the structure learner $g_{\theta}$ , and once $q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})=p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X})$ , we could have samples from the posterior that ideally generates the optimal graph structures for downstream prediction. By this principle, we can start with minimizing the Kullback-Leibler divergence between $q$ and $p$ and derive the learning objective as follows:

(5)

\begin{split}&\mathcal{D}_{KL}(q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})\|p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X}))\\ =&-\underbrace{\mathbb{E}_{\hat{\mathbf{A}}\sim q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}\left[\log\frac{p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}{q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}\right]}_{\mbox{Evidence Lower Bound}}+\log p(\mathbf{Y}|\mathbf{A},\mathbf{X}).\end{split}

Based on this equation, we further have the inequality which bridges the relationship between the Evidence Lower Bound (ELBO) and observed data log-likelihood:

(6)

\log p(\mathbf{Y}|\mathbf{A},\mathbf{X})\\ \geq\mathbb{E}_{\hat{\mathbf{A}}\sim q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}\left[\log\frac{p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}{q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}\right].

The equality holds if and only if $\mathcal{D}_{KL}(q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})\|p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X}))=0$ . The above fact suggests that we can optimize the ELBO as a surrogate for $\log p(\mathbf{Y}|\mathbf{A},\mathbf{X})$ which involves the intractable integration. More importantly, when the ELBO is optimized w.r.t. $q$ distribution, the variational bound is lifted to the original log-likelihood and one has $q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})=p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X})$ , i.e., the variational distribution equals to the true posterior, which is what we expect.

Pushing further and incorporating source graphs $\mathcal{G}_{m}$ (we omit the superscript for simplicity), we arrive at the following objective:

(7)

\begin{split}\mathbb{E}_{\mathcal{G}_{m}\sim p(\mathcal{G})}\left[\mathbb{E}_{\hat{\mathbf{A}}\sim q_{\theta}(\hat{\mathbf{A}}|\mathbf{A}=\mathbf{A}_{m},\mathbf{X}=\mathbf{X}_{m})}\left[\log p_{w_{m}}(\mathbf{Y}|\mathbf{A}=\mathbf{A}_{m},\mathbf{X}=\mathbf{X}_{m},\hat{\mathbf{A}})\right.\right.\\ \left.\left.+\log p_{0}(\hat{\mathbf{A}}|\mathbf{A}=\mathbf{A}_{m},\mathbf{X}=\mathbf{X}_{m})-\log q_{\theta}(\hat{\mathbf{A}}|\mathbf{A}=\mathbf{A}_{m},\mathbf{X}=\mathbf{X}_{m})\right]\right].\end{split}

Here we instantiate $q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})$ as the shared structure learner $g_{\theta}$ , $p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})$ as a (shared) non-parametric prior distribution $p_{0}$ for latent structures, and $p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})$ as the dataset-specific GNN model $h_{w_{m}}$ , to suit the framework for our formulated problem in Section 2. The formulation of (7) shares the spirits with Bayesian meta learning (Finn et al., 2018). We can treat the GNN training as a dataset-specific learning task and latent graph as a certain ‘learning algorithm’ or ‘hyper-parameter’, so (7) essentially aims at learning a structure learner that can yield desirable ‘learning algorithm’ for each specific learning task on graphs. Furthermore, the three terms in (7) have distinct effects: i) the predictive term $\log p_{w_{m}}$ acts as a supervised classification loss; ii) the prior term $\log p_{0}$ serves for regularization on the generated structures; iii) the third term, which is essentially the entropy of $q_{\theta}$ , penalizes high confidence on certain structures.

To sum up, we can optimize (7) with joint learning of the structure learner $g_{\theta}$ and GNN models $\{h_{w_{m}}\}_{m=1}^{M}$ on source graphs $\{\mathcal{G}_{m}\}_{m=1}^{M}$ for training the structure learner. After that, we can generalize the well-trained $g_{\theta^{*}}$ to estimate latent graph structures for a new target graph $\mathcal{G}^{t}=(\mathbf{A}^{t},\mathbf{X}^{t})$ and only need to train the GNN model $h_{w}$ w.r.t. the predictive objective with fixed $\theta^{*}$ :

(8)

\mathbb{E}_{\hat{\mathbf{A}}\sim q_{\theta^{*}}(\hat{\mathbf{A}}|\mathbf{A}=\mathbf{A}^{t},\mathbf{X}=\mathbf{X}^{t})}\left[\log p_{w}(\mathbf{Y}|\mathbf{A}=\mathbf{A}^{t},\mathbf{X}=\mathbf{X}^{t},\hat{\mathbf{A}})\right].

We next discuss how to specify $g_{\theta}$ , $h_{w_{m}}$ and $p_{0}$ with special focus on their expressiveness and efficiency in Section 3.2. Later, we present the details for loss computation and model training based on the formulation stated above in Section 3.3.

3.2. Model Instantiations

3.2.1. Instantiation for $q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})$

The variational distribution aims at learning the conditional distribution that generates suitable latent structures for message passing based on input observations. A natural means is to assume each edge of the latent graph as a Bernoulli random variable and the distribution $q$ is a product of $N\times N$ independent Bernoulli random variables (Elinas et al., 2020; Wu et al., 2020).

The graph structure learner $g_{\theta}$ can be used for predicting the Bernoulli parameter matrix. To accommodate the information from node features and graph structure, we can use the node representation, denoted as $\mathbf{z}_{u}\in\mathbb{R}^{d}$ , where $d$ is the embedding dimension, to compute the edge probability $\alpha_{uv}$ for edge $(u,v)$ as

(9)

\alpha_{uv}=\delta\left(\frac{1}{H}\sum_{h=1}^{H}s(\mathbf{w}_{h}^{1}\odot\mathbf{z}_{u},\mathbf{w}_{h}^{2}\odot\mathbf{z}_{v})\right),

where $s(\cdot,\cdot)$ is a similarity function for two vectors, $\odot$ denotes Hadamard product, $\delta$ is a function that converts the input into values within $[0,1]$ and $\mathbf{w}_{h}^{1},\mathbf{w}_{h}^{2}\in\mathbb{R}^{d}$ are two weight vectors of the $h$ -th head. Common choices for $s(\cdot,\cdot)$ include simple dot-product, cosine distance (Nguyen and Bai, 2010), RBF kernel (Li et al., 2018), etc. Here we introduce $H$ heads and aggregate their results to enhance model’s expressiveness for capturing the underlying influence between nodes from multifaceted causes, following the spirit of multi-head attention (Vaswani et al., 2017; Veličković et al., 2018). Besides, the weight vectors in (9) could learn to element-wisely scale the input vectors, i.e., node representations, and adaptively attend to dominant features. Apart from these, two weight vectors $\mathbf{w}_{h}^{1},\mathbf{w}_{h}^{2}$ with independent parameterization could potentially have the same or distinct directions, which makes the model capable of connecting similar or dissimilar nodes and expressive enough to handle both homophilous and non-homophilous graphs.

To obtain discrete latent graph $\hat{\mathbf{A}}=\{\hat{a}_{uv}\}_{N\times N}$ , one can sample from $\hat{a}_{uv}\sim Bernoulli(\alpha_{uv})$ to obtain each latent edge. However, such an approach induces the quadratic algorithmic complexity $O(N^{2})$ for computing and storing an estimated structure that entails potential links between any node pair, which could be prohibitive for large graphs. To reduce space and time complexity, we adopt a pivot-based structure learning method, as shown in Fig. 3. Concretely, we randomly choose $P$ nodes in the graph as pivots, where $P$ is a hyperparameter much smaller than $N$ (e.g., $P\approx\frac{1}{10}N$ ). We then leverage pivot nodes as intermediates and convert the $N\times N$ graph $\hat{\mathbf{A}}$ , which can be prohibitively large with dense edges, into a cascade of one $N\times P$ node-pivot bipartite graph $\hat{\mathbf{B}}_{1}$ and one $P\times N$ pivot-node bipartite graph $\hat{\mathbf{B}}_{2}=\hat{\mathbf{B}}_{1}^{\top}$ , which can effectively control the computational cost with proper $P$ . In this way, we can compute a node-pivot similarity matrix $\mathbf{\Gamma}=\{\alpha_{up}\}_{N\times P}$ based on (9), to parameterize the distribution of latent graph structures. This only requires $O(NP)$ time and space complexity, and one can sample from each $Bernoulli(\alpha_{up})$ to obtain $\hat{\mathbf{B}}_{1}$ and $\hat{\mathbf{B}}_{2}$ . In the meanwhile, the original $N\times N$ adjacency matrix could be retrieved by $\hat{\mathbf{A}}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{2}$ , which suggests that one can execute message passing on $\hat{\mathbf{B}}_{1}$ and $\hat{\mathbf{B}}_{2}$ to approximate that on $\hat{\mathbf{A}}$ (see more details in Section 3.2.2). In terms of the acceleration of structure learning, other strategies like the all-pair message passing schemes with linear complexity explored by (Wu et al., 2022b, 2023b) can also be utilized to achieve the purpose.

3.2.2. Instantiation for $p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})$

The predictive distribution, parameterized by the GNN network $h_{w_{m}}$ , aims at recursively propagating features along the latent graph to update node representations and producing the prediction result for each node. We then present the details of GNN’s message passing on the latent graph in order for enough expressiveness, stability and efficiency.

To begin with, we review the message-passing rule in common GNN models, like GCN (Kipf and Welling, 2016), that operates on the original graph $\mathbf{A}$ :

(10)

\mathbf{Z}^{(l+1)}=\sigma\left(\operatorname{MP}_{1}(\mathbf{Z}^{(l)},\mathbf{A})\mathbf{W}^{(l)}\right)=\sigma\left(\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}}\mathbf{Z}^{(l)}\mathbf{W}^{(l)}\right),

where $\mathbf{W}^{(l)}\in\mathbb{R}^{d\times d}$ is a weight matrix, $\sigma$ is non-linear activation, and $\mathbf{D}$ denotes a diagonal degree matrix from input graph $\mathbf{A}$ and $\mathbf{Z}^{(l)}=\{\mathbf{z}_{u}^{(l)}\}_{N\times d}$ is a stack of node representations at the $l$ -th layer.

With the estimated latent graph $\hat{\mathbf{A}}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{2}$ , we perform message passing $\mathrm{MP}_{2}(\cdot)$ in a two-step fashion to update node representations:

(11)

\begin{split}&\mbox{i) node-to-pivot passing:}~{}~{}~{}\mathbf{C}^{(l+\frac{1}{2})}=\mathrm{RowNorm}(\mathbf{\Gamma}^{\top})\mathbf{Z}^{(l)},\\ &\mbox{ii) pivot-to-node passing:}~{}~{}~{}\mathbf{C}^{(l+1)}=\mathrm{RowNorm}(\mathbf{\Gamma})\mathbf{C}^{(l+\frac{1}{2})},\end{split}

where $\mathbf{C}^{(l+\frac{1}{2})}$ is an intermediate node representation and $\mathbf{\Gamma}=\{\alpha_{uv}\}_{N\times P}$ is the node-pivot similarity matrix calculated by (9). Such a two-step procedure can be efficiently conducted within $O(NP)$ time and space complexity.

Despite that the feature propagation on the estimated latent structure could presumably yield better node representations, the original input graph structures also contain useful information, such as effective inductive bias (Bronstein et al., 2016). Therefore, we integrate two message-passing functions to compute layer-wise updating for node representations:

(12)

\mathbf{Z}^{(l+1)}=\sigma\left(\lambda\mbox{MP}_{1}(\mathbf{Z}^{(l)},\mathbf{A})\mathbf{W}^{(l)}+(1-\lambda)\mbox{MP}_{2}(\mathbf{Z}^{(l)},\hat{\mathbf{A}})\mathbf{W}^{(l)}\right),

where $\lambda$ is a trading hyper-parameter that controls the concentration weight on input structures. Such design could also improve the training stability by reducing the impact from large variation of latent structures through training procedure.

With $L$ GNN layers, one can obtain the prediction $\hat{\mathbf{Y}}$ by setting $\hat{\mathbf{Y}}=\mathbf{Z}^{(L)}$ and $\mathbf{W}^{(L-1)}\in\mathbb{R}^{d\times C}$ where $C$ is the number of classes. Alg. 1 shows the feed-forward computation of message passing.

3.2.3. Instantiation for $p_{0}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})$

The prior distribution reflects how we presume on the latent graph structures without the information of observed labels. In other words, it characterizes how likely a given graph structure could provide enough potential for feature propagation by GNNs. The prior could be leveraged for regularization on the estimated latent graph $\hat{\mathbf{A}}$ . In this consideration, we choose the prior as an energy function that quantifies the smoothness of the graph:

(13)

p_{0}(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})\propto\exp{\left(-\alpha\sum_{u,v}\hat{\mathbf{A}}_{uv}\|\mathbf{x}_{u}-\mathbf{x}_{v}\|_{2}^{2}-\rho\|\hat{\mathbf{A}}\|_{F}^{2}\right)},

where $\|\cdot\|_{F}$ is the Frobenius norm. The first term in (13) measures the smoothness of the latent graph (Belkin and Niyogi, 2001), with the hypothesis that graphs with smoother feature has lower energy (i.e., higher probability). The second term helps avoiding too large node degrees (Kalofolias, 2016). The hyperparameters $\alpha$ and $\rho$ control the strength for regularization effects.

While we can retrieve the latent graph via $\hat{\mathbf{A}}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{2}$ , the computation of (13) still requires $O(N^{2})$ cost. To reduce the overhead, we apply the regularization on the $P\times P$ pivot-pivot adjacency matrix $\hat{\mathbf{E}}=\hat{\mathbf{B}}_{2}\hat{\mathbf{B}}_{1}$ as a proxy regularization:

(14)

\begin{split}\mathcal{R}(\hat{\mathbf{E}})&=\log p_{0}(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})\\ &\approx-\alpha\sum_{p,q}\hat{\mathbf{E}}_{pq}\|\mathbf{x}^{\prime}_{p}-\mathbf{x}^{\prime}_{q}\|_{2}^{2}-\rho\|\hat{\mathbf{E}}\|_{F}^{2},\end{split}

where $\mathbf{x}^{\prime}_{p}$ denotes the input feature of the $p$ -th pivot node.

3.3. Model Training

For optimization with (7), we proceed to derive the loss functions and updating gradients for $\theta$ and $w_{m}$ based on the three terms $\mathbb{E}_{q_{\theta}}[\log p_{w_{m}}]$ , $\mathbb{E}_{q_{\theta}}[\log p_{0}]$ and $\mathbb{E}_{q_{\theta}}[\log q_{\theta}]$ .

3.3.1. Optimization for $\mathbb{E}_{q_{\theta}}[\log p_{w_{m}}]$

The optimization difficulty stems from the expectation over $q_{\theta}$ , where the sampling process is non-differentiable and hinders back-propagation. Common strategies for approximating the sampling for discrete random variables include Gumbel-Softmax trick (Jang et al., 2017) and REINFORCE trick (Williams, 1992). However, both strategies yield a sparse graph structure each time of sampling, which could lead to high variance for the prediction result $\log p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})$ produced by message passing over a sampled graph. To mitigate the issue, we alternatively adopt the Normalized Weighted Geometric Mean (NWGM) (Xu et al., 2015) to move the outer expectation into the feature-level. Specifically, we have (see Appendix A for detailed derivations)

(15)

\begin{split}&\nabla_{\theta}\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[\log p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})]\\ \approx&\nabla_{\theta}\log p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}}=\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[\hat{\mathbf{A}}]).\end{split}

We denote the opposite of the above term as $\nabla_{\theta}\mathcal{L}_{s}(\theta)$ . The gradient w.r.t. $w_{m}$ can be similarly derived. The above form is a biased estimation for the original objective, yet it can reduce the variance from sampling and also improve training efficiency (without the need of message passing over multiple sampled graphs).(15) induces the supervised cross-entropy loss.

3.3.2. Optimization for $\mathbb{E}_{q_{\theta}}[\log p_{0}]$

As for the second term in (7), we adopt the REINFORCE trick, i.e., policy gradient, to tackle the non-differentiability of sampling from $q_{\theta}$ . Specifically, for each feedforward computation, we sample from the Bernoulli distribution for each edge given by the estimated node-pivot similarity matrix, i.e., $Bernoulli(\alpha_{up})$ , and obtain the sampled latent bipartite graph $\hat{\mathbf{B}}_{1}$ and subsequently have $\hat{\mathbf{E}}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{2}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{1}^{\top}$ . The probability for the latent structure could be computed by

(16)

\pi_{\theta}(\hat{\mathbf{E}})=\prod_{u,p}\left(\hat{\mathbf{B}}_{1,up}\alpha_{up}+(1-\hat{\mathbf{B}}_{1,up})\cdot(1-\alpha_{up})\right).

Denote $\hat{\mathbf{E}}_{k}$ as the sampled result at the $k$ -th time, we can independently sample $K$ times and obtain $\{\hat{\mathbf{E}}_{k}\}_{k=1}^{K}$ and $\{\pi_{\theta}(\hat{\mathbf{E}}_{k})\}_{k=1}^{K}$ . Recall that the regularization reward from $\log p_{0}$ has been given by (14). The policy gradient (Williams, 1992) yields the gradient of loss for $\theta$ as

(17)

\begin{split}\nabla_{\theta}\mathcal{L}_{r}(\theta)&=-\nabla_{\theta}\mathbb{E}_{\hat{\mathbf{A}}\sim q(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})}[\log p_{0}(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})]\\ &\approx-\nabla_{\theta}\frac{1}{K}\sum_{k=1}^{K}\log\pi_{\theta}(\hat{\mathbf{E}}_{k})\left(\mathcal{R}(\hat{\mathbf{E}}_{k})-\overline{\mathcal{R}}\right),\end{split}

where $\overline{\mathcal{R}}$ acts as a baseline function by averaging the regularization rewards $\mathcal{R}(\hat{\mathbf{E}}_{k})$ in one feed-forward computation, which helps to reduce the variance during policy gradient training (Sutton et al., 1999).

3.3.3. Optimization with $\mathbb{E}_{q_{\theta}}[\log q_{\theta}]$

The last entropy term for $q_{\theta}$ could be directly computed by

(18)

\begin{split}\mathcal{L}_{e}(\theta)&=\mathbb{E}_{\hat{\mathbf{A}}\sim q(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})}[\log q(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})]\\ &\approx\frac{1}{NP}\sum_{u=1}^{N}\sum_{p=1}^{P}\left[\alpha_{up}\log\alpha_{up}+(1-\alpha_{up})\log(1-\alpha_{up})\right],\end{split}

where we again adopt the node-pivot similarity matrix as a proxy for the estimated latent graph.

Table 1. Test accuracy (%) on target graphs for in-domain generalizations. For each social network (resp. citation network) as target dataset, we consider the other social networks (resp. citation networks) as source graphs. GraphGLOW* is an oracle model that shares the same architecture as our model GraphGLOW and is directly trained on target graphs.

Type	Method	Cornell5	Johns.55	Amherst41	Reed98	Cora	CiteSeer	PubMed
Pure GNN	GCN	68.6 ± 0.5	70.8 ± 1.0	65.8 ± 1.6	60.8 ± 1.6	81.6 ± 0.4	71.6 ± 0.3	78.8 ± 0.6
	SAGE	68.7 ± 0.8	67.5 ± 0.9	66.3 ± 1.8	63.9 ± 1.9	81.4 ± 0.6	71.6 ± 0.5	78.6 ± 0.7
	GAT	69.6 ± 1.2	69.4 ± 0.7	68.7 ± 2.1	64.5 ± 2.5	83.0 ± 0.7	72.1 ± 1.1	79.0 ± 0.4
	GPR	68.8 ± 0.7	69.6 ± 1.3	66.2 ± 1.5	62.7 ± 2.0	83.1 ± 0.7	72.4 ± 0.8	79.6 ± 0.5
	APPNP	68.5 ± 0.8	69.1 ± 1.4	65.9 ± 1.3	62.3 ± 1.5	82.7 ± 0.5	71.9 ± 0.5	79.2 ± 0.3
	H₂GCN	71.4 ± 0.5	68.3 ± 1.0	66.5 ± 2.2	65.4 ± 1.3	82.5 ± 0.8	71.4 ± 0.7	79.4 ± 0.4
	CPGNN	71.1 ± 0.5	68.7 ± 1.3	66.7 ± 0.8	63.6 ± 1.8	80.8 ± 0.4	71.6 ± 0.4	78.5 ± 0.7
Graph Structure Learning	$\mathrm{GraphGLOW_{dp}}$	71.5 ± 0.7	71.3 ± 1.2	68.5 ± 1.6	63.2 ± 1.2	83.1 ± 0.8	71.7 ± 1.0	77.3 ± 0.8
	$\mathrm{GraphGLOW_{knn}}$	69.4 ± 0.8	71.0 ± 1.3	64.8 ± 1.2	63.6 ± 1.6	81.7 ± 0.8	71.5 ± 0.8	79.4 ± 0.6
	$\mathrm{GraphGLOW_{cos}}$	69.9 ± 0.7	70.8 ± 1.4	65.2 ± 1.8	62.7 ± 1.3	82.0 ± 0.7	71.9 ± 0.9	78.7 ± 0.8
	$\mathrm{GraphGLOW_{at}}$	69.3 ± 0.8	70.9 ± 1.3	65.0 ± 1.3	65.0 ± 1.7	81.9 ± 0.9	71.3 ± 0.7	78.8 ± 0.6
	GraphGLOW	71.8 ± 0.9	71.5 ± 0.8	70.6 ± 1.4	66.8 ± 1.1	83.5 ± 0.6	73.6 ± 0.6	79.8 ± 0.8
	GraphGLOW*	71.1 ± 0.3	72.2 ± 0.5	70.3 ± 0.9	66.8 ± 1.4	83.5 ± 0.6	73.9 ± 0.7	79.9 ± 0.5

3.3.4. Iterative Structure Learning for Acceleration

A straightforward way is to consider once structure inference and once GNN’s message passing for prediction in each feed-forward computation. To enable structure learning and GNN learning mutually reinforce each other (Chen et al., 2020), we consider multiple iterative updates of graph structures and node representations before once back-propagation. More specifically, in each epoch, we repeatedly update node representations $\mathbf{Z}^{t}$ (where the superscript $t$ denotes the $t$ -th iteration) and latent graph $\hat{\mathbf{A}}^{t}$ until a given maximum budget is achieved. To accelerate the training, we aggregate the losses $\mathcal{L}^{t}$ in each iteration step for parameter updating. As different graphs have different feature space, we utilize the first layer of GNN as an encoder at the very beginning and then feed the encoded representations to structure learner. The training algorithm for structure learner $g_{\theta}$ on source graphs is described in Alg. 2 (in the appendix) where we train structure learner for multiple episodes and in each episode, we train $g_{\theta}$ on each source graph for several epochs. In testing, the well-trained $g_{\theta}$ is fixed and we train a GNN $h_{w}$ on the target graph with latent structures inferred by $g_{\theta}$ , as described in Alg. 3.

4. Related Works

Graph Neural Networks

Graph neural networks (GNNs) (Kipf and Welling, 2016; Veličković et al., 2018; Hamilton et al., 2017; Xu et al., 2018; Abu-El-Haija et al., 2019; Geng et al., 2023) have achieved impressive performances in modeling graph-structured data. Nonetheless, there is increasing evidence suggesting GNNs’ deficiency for graph structures that are inconsistent with the principle of message passing. One typical situation lies in non-homophilous graphs (Zhu et al., 2020), where adjacent nodes tend to have dissimilar features/labels. Recent studies devise adaptive feature propagation/aggregation to tackle the heterophily (Li et al., 2021; Zhu et al., 2021b, 2020; Chien et al., 2021). Another situation stems from graphs with noisy or spurious links, for which several works propose to purify the observed structures for more robust node representations (Zheng et al., 2020; Gao et al., 2021). Our work is related to these works by searching adaptive graph structures that is suitable for GNN’s message passing. Yet, the key difference is that our method targets learning a new graph out of the scope of input one, while the above works focus on message passing within the input graph.

Graph Structure learning

To effectively address the limitations of GNNs’ feature propagation within observed structures, many recent works attempt to jointly learn graph structures and the GNN model. For instance, (Franceschi et al., 2019) models each edge as a Bernoulli random variable and optimizes graph structures along with the GCN. To exploit enough information from observed structure for structure learning, (Li et al., 2018) proposes a metric learning approach based on RBF kernel to compute edge probability with node representations, while (Jiang et al., 2019) adopts attention mechanism to achieve the similar goal. Furthermore, (Chen et al., 2020) considers an iterative method that enables mutual reinforcement between learning graph structures and node embeddings. Also, (Zhang et al., 2019) presents a probabilistic framework that views the input graph as a random sample from a collection modeled by a parametric random graph model. (Lao et al., 2022; Elinas et al., 2020) harnesses variational inference to estimate a posterior of graph structures and GNN parameters. While learning graph structures often requires $O(N^{2})$ complexity, a recent work (Wu et al., 2022b) proposes an efficient Transformer that achieves latent structure learning in each layer with $O(N)$ complexity. However, though these methods have shown promising results, they assume training nodes and testing nodes are from the same graph and consider only one graph. By contrast, we consider graph structure learning under the cross-graph setting and propose a general framework to learn a shared structure learner which can generalize to target graphs without any re-training.

Out-of-Distribution Generalization on Graphs.

Due to the demand for handling testing data in the wild, improving the capability of the neural networks for performing satisfactorily on out-of-distribution data has received increasing attention (Wu et al., 2022a; Ma et al., 2021; Li et al., 2022; Wu et al., 2023a, 2021b; Yang et al., 2023a). Recent studies, e.g., (Zhu et al., 2021a; Wu et al., 2022a; Sui et al., 2022) explore effective treatments for tackling general distribution shifts on graphs, and there are also works focusing on particular categories of distribution shifts like size generalization (Bevilacqua et al., 2021), molecular scaffold generalization (Yang et al., 2022b), feature/attribute shifts (Wu et al., 2021a; Bi et al., 2023), topological shifts (Yang et al., 2022a), etc. To the best of our knowledge, there is no prior works considering OOD generalization in the context of graph structure learning. In our case, the target graph, where the structure learner is expected to yield adaptive structures, can have disparate distributions than the source graphs. The distribution shifts could potentially stem from feature/label space, graph sizes or domains (e.g., from social networks to citation networks). As the first attempt along this path, our work can fill the research gap and enable the graph structure learning model to deal with new unseen graphs in an open world.

5. Experiments

We apply GraphGLOW to real-world datasets for node classification to test the efficacy of proposed structure learner for boosting performance of GNN learning on target graphs with distribution shifts from source graphs. We specify the backbone GNN network for GraphGLOW as a two-layer GCN (Kipf and Welling, 2016). We focus on the following research questions:

$\bullet$ 1) How does GraphGLOW perform compared with directly training GNN models on input structure of target graphs?

$\bullet$ 2) How does GraphGLOW perform compared to state-of-the-art structure learning models that are directly trained on target datasets in terms of both accuracy and training time?

$\bullet$ 3) Are the proposed components of GraphGLOW effective and necessary for the achieved performance?

$\bullet$ 4) What is the impact of hyper-parameter on performance and what is the impact of attack on observed edges?

$\bullet$ 5) What is the property of inferred latent graphs and what generalizable pattern does the structure learner capture?

5.1. Experimental Protocols

Datasets. Our experiments are conducted on several public graph datasets. First we consider three commonly used citation networks Cora, CiteSeer and PubMed. We use the same splits as in (Yang et al., 2016). These three datasets have high homophily ratios (i.e., adjacent nodes tend to have similar labels) (Lim et al., 2021). Apart from this, we also consider four social networks from Facebook-100 (Traud et al., 2012), which have low homophily ratios. Readers may refer to Appendix B for more dataset information like splitting ratios.

Competitors. We mainly compare with GCN (Kipf and Welling, 2016), the GNN counterpart trained on input structure, for testing the efficacy of produced latent graphs by GraphGLOW. As further investigation, we also compare with other advanced GNN models: GraphSAGE (Hamilton et al., 2017), GAT (Veličković et al., 2018), APPNP (Klicpera et al., 2019), $\mbox{H}_{2}$ GCN (Zhu et al., 2020) and GPRGNN (Chien et al., 2021). Here APPNP, $\mbox{H}_{2}$ GCN and GPRGNN are all strong GNN models equipped with adaptive feature propagation and high-order aggregation. For these pure GNN models, the training and testing are considered on (the same) target graphs. Furthermore, we compete GraphGLOW with state-of-the-art graph structure learning models, IDS (Franceschi et al., 2019), IDGL (Chen et al., 2020) and VGCN (Elinas et al., 2020). Since these models are all designed for training on one dataset from scratch, we directly train them on the target graph and they in principle could yield better performance than GraphGLOW.

We also consider variants of GraphGLOW as baselines. We replace the similarity function $s$ with attention-based structure learner, denoted as $\mathrm{GraphGLOW_{at}}$ , which follows the same training scheme as GraphGLOW. Besides, we consider some non-parametric similarity functions like dot-product, KNN and cosine distance (denoted as $\mathrm{GraphGLOW_{dp}}$ , $\mathrm{GraphGLOW_{knn}}$ and $\mathrm{GraphGLOW_{cos}}$ , respectively). For these models, we only need to train the GNN network on target graphs with the non-parametric structure learners yielding latent structures. In addition, we introduce a variant GraphGLOW* that shares the same architecture as GraphGLOW and is directly trained on target graphs. Also, GraphGLOW* in principle could produce superior results than GraphGLOW. We report the test accuracy given by the model that produces the highest validation accuracy within 500 training epochs.

5.2. In-domain Generalization

Table 2. Test accuracy (%) on target graphs for cross-domain generalizations. For each social network (resp. citation network) as target dataset, we consider citation networks (resp. social networks) as source graphs.

Type	Method	Cornell5	Johns.55	Amherst41	Reed98	Cora	CiteSeer	PubMed
Pure GNN	GCN	68.6 ± 0.5	70.8 ± 1.0	65.8 ± 1.6	60.8 ± 1.6	81.6 ± 0.4	71.6 ± 0.3	78.8 ± 0.6
	SAGE	68.7 ± 0.8	67.5 ± 0.9	66.3 ± 1.8	63.9 ± 1.9	81.4 ± 0.6	71.6 ± 0.5	78.6 ± 0.7
	GAT	69.6 ± 1.2	69.4 ± 0.7	68.7 ± 2.1	64.5 ± 2.5	83.0 ± 0.7	72.1 ± 1.1	79.0 ± 0.4
	GPR	68.8 ± 0.7	69.6 ± 1.3	66.2 ± 1.5	62.7 ± 2.0	83.1 ± 0.7	72.4 ± 0.8	79.6 ± 0.5
	APPNP	68.5 ± 0.8	69.1 ± 1.4	65.9 ± 1.3	62.3 ± 1.5	82.7 ± 0.5	71.9 ± 0.5	79.2 ± 0.3
	H₂GCN	71.4 ± 0.5	68.3 ± 1.0	66.5 ± 2.2	65.4 ± 1.3	82.5 ± 0.8	71.4 ± 0.7	79.4 ± 0.4
	CPGNN	71.1 ± 0.5	68.7 ± 1.3	66.7 ± 0.8	63.6 ± 1.8	80.8 ± 0.4	71.6 ± 0.4	78.5 ± 0.7
Graph Structure Learning	$\mathrm{GraphGLOW_{dp}}$	71.5 ± 0.7	71.3 ± 1.2	68.5 ± 1.6	63.2 ± 1.2	83.1 ± 0.8	71.7 ± 1.0	77.3 ± 0.8
	$\mathrm{GraphGLOW_{knn}}$	69.4 ± 0.8	71.0 ± 1.3	64.8 ± 1.2	63.6 ± 1.6	81.7 ± 0.8	71.5 ± 0.8	79.4 ± 0.6
	$\mathrm{GraphGLOW_{cos}}$	69.9 ± 0.7	70.8 ± 1.4	65.2 ± 1.8	62.7 ± 1.3	82.0 ± 0.7	71.9 ± 0.9	78.7 ± 0.8
	$\mathrm{GraphGLOW_{at}}$	69.9 ± 1.0	70.4 ± 1.5	64.4 ± 1.2	65.0 ± 1.7	82.5 ± 0.9	71.8 ± 0.8	78.5 ± 0.7
	GraphGLOW	72.0 ± 1.0	71.8 ± 0.7	69.8 ± 1.3	67.3 ± 1.2	83.2 ± 0.4	73.8 ± 0.9	79.6 ± 0.7
	GraphGLOW*	71.1 ± 0.3	72.2 ± 0.5	70.3 ± 0.9	66.8 ± 1.4	83.5 ± 0.6	73.9 ± 0.7	79.9 ± 0.5

We first consider transferring within social networks or citation networks. The results are reported in Table 1 where for each social network (resp. citation network) as the target, we use the other social networks (resp. citation networks) as the source datasets. GraphGLOW performs consistently better than GCN, i.e., the counterpart using observed graph for message passing, which proves that GraphGLOW can capture generalizable patterns for desirable message-passing structure for unseen datasets that can indeed boost the GCN backbone’s performance on downstream tasks. In particular, the improvement over GCN is over 5% on Cornell5 and Reed98, two datasets with low homophily ratios (as shown in Table 3). The reason is that for non-homophilous graphs where the message passing may propagate inconsistent signals (as mentioned in Section 1), the GNN learning could better benefits from structure learning than homophilous graphs. Furthermore, compared to other strong GNN models, GraphGLOW still achieves slight improvement than the best competitors though the backbone GCN network is less expressive. One could expect further performance gain by GraphGLOW if we specify the GNN backbone as other advanced architectures.

In contrast with non-parametric structure learning models and $\mathrm{GraphGLOW_{at}}$ , GraphGLOW outperforms them by a large margin throughout all cases, which verifies the superiority of our design of multi-head weighted similarity function that can accommodate multi-faceted diverse structural information. Compared with GraphGLOW*, GraphGLOW performs on par with and even exceeds it on Cornell5 and Amherst41. The possible reasons are two-fold. First, there exist sufficient shared patterns among citation networks (resp. social networks), which paves the way for successful generalization of GraphGLOW. Second, GraphGLOW* could sometimes overfit specific datasets, since the amount of free parameters are regularly orders-of-magnitude more than the number of labeled nodes in the dataset. The results also imply that our transfer learning approach can help to mitigate over-fitting on one dataset. Moreover, GraphGLOW can generalize structure learner to unseen graphs that is nearly three times larger than training graphs, i.e., Cornell5.

5.3. Cross-domain Generalization

We next consider a more difficult task, transferring between social networks and citation networks. The difficulty stems from two aspects: 1) social networks and citations graphs are from distinct categories thus have larger underlying data-generating distribution gaps; 2) they have varied homophily ratios, which indicates that the observed edges play different roles in original graphs. In Table 2 we report the results. Despite the task difficulty, GraphGLOW manages to achieve superior results than GCN and also outperforms other non-parametric graph structure learning methods throughout all cases. This suggests GraphGLOW’s ability for handling target graphs with distinct properties.

In Fig. 4 we further compare GraphGLOW with three state-of-the-art graph structure learning models that are directly trained on target graphs. Here we follow the setting in Table 2. The results show that even trained on source graphs that are different from the target one, GraphGLOW still performs on par with the competitors that are trained and tested on (the same) target graphs. Notably, GraphGLOW significantly reduces training time. For instance, in John Hopkins55, GraphGLOW is 6x, 9x and 40x faster than IDGL, LDS and VGCN, respectively. This shows one clear advantage of GraphGLOW in terms of training efficiency and also verifies that our model indeed helps to reduce the significant cost of training time for structure learning on target graphs.

5.4. Ablation Studies

We conduct ablation studies to test the effectiveness of iterative learning scheme and regularization on graphs.

Effect of Iterative Learning. We replace the iterative learning process as a one-step prediction (i.e., once structure estimation and updating node representations in once feed-forward computation) and compare its test accuracy with GraphGLOW. The results are shown in Fig. 5 where we follow the setting of Table 1. The non-iterative version exhibits a considerable drop in accuracy (as large as 5.4% and 8.8% when tested on target graphs Cornell5 and Amherst41, respectively). Therefore, the iterative updates indeed help to learn better graph structures and node embeddings, contributing to higher accuracy for downstream prediction.

Effect of Regularization on structures. We remove the regularization on structures (i.e., setting $\alpha=\rho=0$ ) and compare with GraphGLOW. As shown in Fig. 5, there is more or loss performance degradation. In fact, the regularization loss derived from the prior distribution for latent structures could help to provide some guidance for structure learning, especially when labeled information is limited.

5.5. Hyper-parameter Sensitivity

In Fig. 7 (in the appendix), we study the variation of model’s performance w.r.t. $\lambda$ (the weight on input graphs) and $P$ (the number of pivots) on target datasets Cora and CiteSeer. Overall, the model is not sensitive to $\lambda$ ’s. For Cora, larger $\lambda$ contributes to higher accuracy, while for CiteSeer, smaller $\lambda$ yields better performance. The possible reason is that the initial graph of Cora is more suitable for message passing (due to higher homophily ratio). For the impact of pivot number, as shown in Fig. 7(b), a moderate value of $P$ could provide decent downstream performance.

5.6. Robustness Analysis

In addition, we find that GraphGLOW is more immune to edge deletion attack than GCN. We randomly remove 10-50% edges of target graphs respectively, and then apply GraphGLOW and GCN. We present the results in Johns Hopkins55 in Fig. 5 and leave more results in Appendix D. When the drop ratio increases, the performance gap between two models becomes more significant. This is due to our structure learner’s ability for learning new graph structures from node embeddings, making it less reliant on initial graph structures and more robust to attack on input edges.

5.7. Case Study

We further probe into why our approach is effective for node classification by dissecting the learnt graph structures. Specifically, we measure the homophily ratios of learnt structures and their variance of neighborhood distributions of nodes with same labels. As nodes receive messages from neighbors in message passing, the more similar the neighborhood patterns of nodes within one class are, the easier it is for GNNs to correctly classify them (Ma et al., 2022). We use homophily metric proposed in (Lim et al., 2021) to measure homophily ratios. For calculation of variance of neighborhood distribution, we first calculate variance for each class, and then take weighted sum to get the final variance, where the weight is proportional to the number of nodes within corresponding class.

Homophily Ratio. We choose Amherst41, Johns Hopkins55 and Reed98 as target graphs, and record the homophily ratios of inferred latent structures every five epochs during training. As shown in Fig. 6. the homophily ratios of inferred latent graphs exhibit a clear increase as the training epochs become more and the final ratio is considerably larger than that of input graph. The results indicate that the trained structure learner incline to output more homophilous latent structures that are reckoned to be more suitable for message passing.

Neighborhood Distribution Variance. As shown in Fig. 6, the variance of neighborhood distribution of nodes with the same label is significantly smaller in our learnt structure, making it easier to classify nodes through message passing. The results also imply that high homophily ratio and similar intra-class neighborhood patterns could be two of the underlying transferrable patterns of optimal message-passing structure, identified by GraphGLOW.

6. Conclusion

This paper proposes Graph Structure Learning Under Cross-Graph Distribution Shift, a new problem that requires structure learner to transfer to new target graphs without re-training and handles distribution shift. We develop a transfer learning framework that guides the structure learner to discover shared knowledge across source datasets with respect to optimal message-passing structure for boosting downstream performance. We also carefully design the model components and training approach in terms of expressiveness, scalability and stability. We devise experiments with various difficulties and demonstrate the efficacy and robustness of our approach. Although our framework is pretty general, we believe their are other potential methods that can lead to equally competitive results, which we leave as future work.

Acknowledgements.

The work was supported in part by National Key Research and Development Program of China (2020AAA0107600), NSFC (62222607), Science and Technology Commission of Shanghai Municipality (22511105100), and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).

References

(1)
Abu-El-Haija et al. (2019) Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. 2019. MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing. In ICML.
Belkin and Niyogi (2001) Mikhail Belkin and Partha Niyogi. 2001. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In NeurIPS.
Bevilacqua et al. (2021) Beatrice Bevilacqua, Yangze Zhou, and Bruno Ribeiro. 2021. Size-invariant graph representations for graph classification extrapolations. In ICML.
Bi et al. (2023) Wendong Bi, Bingbing Xu, Xiaoqian Sun, Li Xu, Huawei Shen, and Xueqi Cheng. 2023. Predicting the Silent Majority on Graphs: Knowledge Transferable Graph Neural Network. In The Web Conference.
Bronstein et al. (2016) Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2016. Geometric deep learning: going beyond Euclidean data. arXiv preprint abs/1611.08097 (2016).
Chen et al. (2020) Yu Chen, Lingfei Wu, and Mohammed Zaki. 2020. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. In NeurIPS.
Chien et al. (2021) Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. 2021. Adaptive universal generalized pagerank graph neural network. In ICLR.
Elinas et al. (2020) Pantelis Elinas, Edwin V Bonilla, and Louis Tiao. 2020. Variational inference for graph convolutional networks in the absence of graph data and adversarial settings. In NeurIPS.
Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. 2018. Probabilistic Model-Agnostic Meta-Learning. In NeurIPS.
Franceschi et al. (2019) Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. 2019. Learning discrete structures for graph neural networks. In ICML.
Gao et al. (2021) Zhan Gao, Subhrajit Bhattacharya, Leiming Zhang, Rick S. Blum, Alejandro Ribeiro, and Brian M. Sadler. 2021. Training Robust Graph Neural Networks with Topology Adaptive Edge Dropping. arXiv preprint arXiv:2106.02892 (2021).
Geng et al. (2023) Haoyu Geng, Chao Chen, Yixuan He, Gang Zeng, Zhaobing Han, Hua Chai, and Junchi Yan. 2023. Pyramid Graph Neural Network: a Graph Sampling and Filtering Approach for Multi-scale Disentangled Representations. In SIGKDD.
Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS.
Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In ICLR.
Jiang et al. (2019) Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, and Bin Luo. 2019. Semi-supervised learning with graph learning-convolutional networks. In CVPR.
Jiang and Luo (2022) Weiwei Jiang and Jiayun Luo. 2022. Graph neural network for traffic forecasting: A survey. Expert Systems with Applications (2022), 117921.
Kalofolias (2016) Vassilis Kalofolias. 2016. How to Learn a Graph from Smooth Signals. In AISTATS.
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In ICLR.
Klicpera et al. (2019) Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. 2019. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In ICLR.
Lao et al. (2022) Danning Lao, Xinyu Yang, Qitian Wu, and Junchi Yan. 2022. Variational Inference for Training Graph Neural Networks in Low-Data Regime through Joint Structure-Label Estimation. In SIGKDD.
Li et al. (2018) Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018. Adaptive graph convolutional neural networks. In AAAI.
Li et al. (2021) Sean Li, Dongwoo Kim, and Qing Wang. 2021. Beyond Low-Pass Filters: Adaptive Feature Propagation on Graphs. arXiv preprint arXiv:2103.14187 (2021).
Li et al. (2022) Zenan Li, Qitian Wu, Fan Nie, and Junchi Yan. 2022. GraphDE: A Generative Framework for Debiased Learning and Out-of-Distribution Detection on Graphs. In NeurIPS.
Lim et al. (2021) Derek Lim, Xiuyu Li, Felix Hohne, and Ser-Nam Lim. 2021. New Benchmarks for Learning on Non-Homophilous Graphs. arXiv preprint arXiv:2104.01404 (2021).
Ma et al. (2021) Jiaqi Ma, Junwei Deng, and Qiaozhu Mei. 2021. Subgroup generalization and fairness of graph neural networks. In NeurIPS.
Ma et al. (2022) Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. 2022. Is Homophily a Necessity for Graph Neural Networks?. In ICLR.
Muandet et al. (2013) Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. 2013. Domain Generalization via Invariant Feature Representation. In ICML.
Nguyen and Bai (2010) Hieu V Nguyen and Li Bai. 2010. Cosine similarity metric learning for face verification. In ACCV.
Sanchez-Gonzalez et al. (2020) Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. 2020. Learning to simulate complex physics with graph networks. In ICML.
Sui et al. (2022) Yongduo Sui, Xiang Wang, Jiancan Wu, Min Lin, Xiangnan He, and Tat-Seng Chua. 2022. Causal attention for interpretable and generalizable graph classification. In SIGKDD.
Sutton et al. (1999) Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NeurIPS.
Traud et al. (2012) Amanda L Traud, Peter J Mucha, and Mason A Porter. 2012. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391, 16 (2012), 4165–4180.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS.
Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In ICLR.
Williams (1992) Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 8 (1992), 229–256.
Wu et al. (2023a) Qitian Wu, Yiting Chen, Chenxiao Yang, and Junchi Yan. 2023a. Energy-based Out-of-Distribution Detection for Graph Neural Networks. In ICLR.
Wu et al. (2021a) Qitian Wu, Chenxiao Yang, and Junchi Yan. 2021a. Towards open-world feature extrapolation: An inductive graph learning approach. In NeurIPS.
Wu et al. (2023b) Qitian Wu, Chenxiao Yang, Wentao Zhao, Yixuan He, David Wipf, and Junchi Yan. 2023b. DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion. In ICLR.
Wu et al. (2021b) Qitian Wu, Hengrui Zhang, Xiaofeng Gao, Junchi Yan, and Hongyuan Zha. 2021b. Towards open-world recommendation: An inductive model-based collaborative filtering approach. In ICML.
Wu et al. (2022a) Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. 2022a. Handling distribution shifts on graphs: An invariance perspective. In ICLR.
Wu et al. (2022b) Qitian Wu, Wentao Zhao, Zenan Li, David Wipf, and Junchi Yan. 2022b. NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification. In NeurIPS.
Wu et al. (2020) Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. 2020. Graph Information Bottleneck. In NeurIPS.
Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML.
Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation Learning on Graphs with Jumping Knowledge Networks. In ICML.
Yang et al. (2023a) Chenxiao Yang, Qitian Wu, Jiahua Wang, and Junchi Yan. 2023a. Graph Neural Networks are Inherently Good Generalizers: Insights by Bridging GNNs and MLPs. In ICLR.
Yang et al. (2022a) Chenxiao Yang, Qitian Wu, and Junchi Yan. 2022a. Geometric Knowledge Distillation: Topology Compression for Graph Neural Networks. In NeurIPS.
Yang et al. (2022b) Nianzu Yang, Kaipeng Zeng, Qitian Wu, Xiaosong Jia, and Junchi Yan. 2022b. Learning substructure invariance for out-of-distribution molecular representations. In NeurIPS.
Yang et al. (2023b) Nianzu Yang, Kaipeng Zeng, Qitian Wu, and Junchi Yan. 2023b. MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning. In The Web Conference.
Yang et al. (2016) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In ICML.
Zhang et al. (2019) Yingxue Zhang, Soumyasundar Pal, Mark Coates, and Deniz Ustebay. 2019. Bayesian graph convolutional neural networks for semi-supervised classification. In AAAI.
Zheng et al. (2020) Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, and Wei Wang. 2020. Robust graph representation learning via neural sparsification. In ICML.
Zhu et al. (2021b) Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai Koutra. 2021b. Graph neural networks with heterophily. In AAAI.
Zhu et al. (2020) Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. 2020. Beyond homophily in graph neural networks: Current limitations and effective designs. In NeurIPS.
Zhu et al. (2021a) Qi Zhu, Natalia Ponomareva, Jiawei Han, and Bryan Perozzi. 2021a. Shift-robust gnns: Overcoming the limitations of localized graph training data. In NeurIPS.
Zhu et al. (2021c) Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. 2021c. Deep Graph Structure Learning for Robust Representations: A Survey. arXiv preprint abs/2103.03036 (2021).

Input: node features

\mathbf{X}

, input adjacency

\mathbf{A}

, latent graph node-pivot similarity matrix

\mathbf{\Gamma}=\{\alpha_{up}\}_{N\times P}

\mathbf{Z}^{(0)}\leftarrow\mathbf{X}

;

for $l=0,1,\cdots,L-1$ do

\mathbf{Z}^{(l+1)}\leftarrow\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}}\mathbf{Z}^{(l)}\mathbf{W}^{(l)}

;

\mathbf{C}^{(l+\frac{1}{2})}=\mathrm{RowNorm}(\mathbf{\Gamma}^{\top})\mathbf{Z}^{(l)}

;

\mathbf{C}^{(l+1)}=\mathrm{RowNorm}(\mathbf{\Gamma})\mathbf{C}^{(l+\frac{1}{2})}

;

\mathbf{Z}^{(l+1)}\leftarrow\sigma\left((\lambda\mathbf{Z}^{(l+1)}+(1-\lambda)\mathbf{C}^{(l+1)}\mathbf{W}^{(l)}\right)

;

end for

Output: node representations

\mathbf{Z}=\mathbf{Z}^{(L-1)}

, node-level prediction

\hat{\mathbf{Y}}=\mathbf{Z}^{(L)}

Algorithm 1 Message Passing over Latent Graphs

[\mathbf{Z},\hat{\mathbf{Y}}]=\mbox{GNN}(\mathbf{A},\mathbf{X},\mathbf{\Gamma};w)

Input: observed source graphs

\{\mathcal{G}_{m}\}_{m=1}^{M}=\{(\mathbf{A}_{m},\mathbf{X}_{m},\mathbf{Y}_{m})\}_{m=1}^{M}

, maximum training episode

M

, maximum iteration

T

, a shared graph structure learner

g_{\theta}

, GNN networks

\{h_{w_{m}}\}_{m=1}^{M}

for each source graph.

Initialize

\theta

and

\{w_{m}\}_{m=1}^{M}

;

for episode: $1,2,\dots,E$ do

for each source graph $\mathcal{G}_{m}$ do

for $e=1:T$ do

\Delta_{\theta}\leftarrow 0

\Delta_{w}\leftarrow 0

t\leftarrow 0

;

\mathbf{Z}^{0}=\mbox{MLP}(\mathbf{X};w_{m})

\mathbf{Z}^{0}=\mbox{GCN}(\mathbf{A},\mathbf{X};w_{m})

;

while not converged do

t\leftarrow t+1

;

Compute

\mathbf{\Gamma}^{t}=\{\alpha_{up}\}

using (9) with

\mathbf{Z}^{t-1}

;

Sample

K

times over

\mathbf{\Gamma}^{t}

to obtain

\{\hat{\mathbf{E}}_{k}^{t}\}_{k=1}^{K}

;

Compute

\{\pi_{\theta}(\hat{\mathbf{E}}_{k}^{t})\}_{k=1}^{K}

using (16);

[\mathbf{Z}^{t},\hat{\mathbf{Y}}^{t}]=\mbox{GNN}(\mathbf{A},\mathbf{X},\mathbf{\Gamma}^{t};w_{m})

;

Compute

\nabla_{\theta}\mathcal{L}_{s},\nabla_{\theta}\mathcal{L}_{r},\nabla_{\theta}\mathcal{L}_{e}

using (15), (17), (18), respectively ;

\Delta_{\theta}^{(t)}\leftarrow\nabla_{\theta}\mathcal{L}_{s}+\nabla_{\theta}\mathcal{L}_{r}+\nabla_{\theta}\mathcal{L}_{e}

;

\Delta_{w}^{(t)}\leftarrow\nabla_{w_{m}}\mathcal{L}_{s}

;

end while

\Delta_{\theta}\leftarrow\Delta_{\theta}^{(0)}+\sum_{i=2}^{t}\Delta_{\theta}^{(i)}/(t-1)

;

\Delta_{w}\leftarrow\Delta_{w}^{(0)}+\sum_{i=2}^{t}\Delta_{w}^{(i)}/(t-1)

;

Use gradient

\Delta_{\theta}

to update

\theta

;

Use gradient

\Delta_{w}

to update

w_{m}

;

end for

Output: trained graph structure learner

g_{\theta^{*}}

Algorithm 2 Training Graph Structure Learner

Input: observed target graphs

\mathcal{G}=(\mathbf{A},\mathbf{X},\mathbf{Y})

, GNN network

h_{w}

for the target graph, maximum iteration

T

Initialize

w

;

for $e=1:T$ do

t\leftarrow 0

\Delta_{w}\leftarrow 0

\mathbf{Z}^{(0)}\leftarrow\mathbf{X}_{m}

\mathbf{A}\leftarrow\mathbf{A}_{m}

;

\mathbf{Z}^{0}=\mbox{MLP}(\mathbf{X};w)

\mathbf{Z}^{0}=\mbox{GCN}(\mathbf{A},\mathbf{X};w)

;

while not converged do

t\leftarrow t+1

;

Compute

\mathbf{\Gamma}^{t}=\{\alpha_{up}\}

using (9) with

\mathbf{Z}^{t-1}

;

Sample

K

times over

\mathbf{\Gamma}^{t}

to obtain

\{\hat{\mathbf{E}}^{t}_{k}\}_{k=1}^{K}

;

Compute

\{\pi_{\theta}(\hat{\mathbf{E}}^{t}_{k})\}_{k=1}^{K}

using (16);

[\mathbf{Z},\hat{\mathbf{Y}}]=\mbox{GNN}(\mathbf{A},\mathbf{X},\mathbf{\Gamma}^{t};w)

;

Compute

\nabla_{\theta}\mathcal{L}_{s}

using (15) ;

\Delta_{w}^{(t)}\leftarrow\nabla_{w_{m}}\mathcal{L}_{s}

;

end while

\Delta_{w}\leftarrow\Delta_{w}^{(0)}+\sum_{i=2}^{t}\Delta_{w}^{(i)}/(t-1)

;

Use gradient

\Delta_{w}

to update

w

;

end for

Output: trained GNN network

h_{w^{*}}

for the target graph.

Algorithm 3 Supervised Learning for target GNN

Appendix A Derivations for NWGM

First, when taking the gradient, we have $\nabla_{\theta}\mathbb{E}_{q_{\theta}}[\log p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})]$ $\approx c\nabla_{\theta}\log\mathbb{E}_{q_{\theta}}[p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})]$ with basic applications of the chain rule. We then adopt Normalized Weighted Geometric Mean (NWGM) (Xu et al., 2015):

(19)

\begin{split}&\nabla_{\theta}\log\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[p_{w_{m}}(\mathbf{Y}_{u,c}=1|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})]\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\frac{\exp(s_{1}(\hat{\mathbf{A}}))}{\exp(s_{1}(\hat{\mathbf{A}}))+\exp(s_{2}(\hat{\mathbf{A}}))}q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\mbox{Softmax}(s_{1}(\hat{\mathbf{A}}))q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})\\ \approx&\nabla_{\theta}\log\mbox{NWGM}(\mbox{Softmax}(s_{1}(\hat{\mathbf{A}})))\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\frac{\prod_{\hat{\mathbf{A}}}[\exp(s_{1}(\hat{\mathbf{A}}))]^{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}}{\prod_{\hat{\mathbf{A}}}\exp(s_{1}(\hat{\mathbf{A}}))]^{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}+\prod_{\hat{\mathbf{A}}}\exp(s_{2}(\hat{\mathbf{A}}))]^{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}}\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\frac{\exp(\sum_{\hat{\mathbf{A}}}s_{1}(\hat{\mathbf{A}})q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}))}{\exp(\sum_{\hat{\mathbf{A}}}s_{1}(\hat{\mathbf{A}})q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}))+\exp(\sum_{\hat{\mathbf{A}}}s_{2}(\hat{\mathbf{A}})q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}))}\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\frac{\exp(\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[s_{1}(\hat{\mathbf{A}})])}{\exp(\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[s_{1}(\hat{\mathbf{A}})])+\exp(\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[s_{2}(\hat{\mathbf{A}})])}\\ =&\nabla_{\theta}\log p_{w_{m}}(\mathbf{Y}_{u,c}=1|\mathbf{A},\mathbf{X},\hat{\mathbf{A}}=\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[\hat{\mathbf{A}}]),\end{split}

where $s_{1}$ denotes the positive predicted score for class $c$ which is indeed associated with node $u$ , and $s_{2}(\hat{\mathbf{A}})=0$ in our case. We thus conclude the proof for (15).

Table 3. Statistic information on experimental datasets. The column Homo. reports the homophily ratios, measured by the metric proposed in (Lim et al., 2021).

Dataset	Type	# Node	# Edge	Homo.
Cora	citation	2,708	5,429	0.77
CiteSeer	citation	3,327	4,732	0.63
PubMed	citation	19,717	44,338	0.66
Amherst41	social	2,235	90,964	0.06
Cornell5	social	18,660	790,777	0.09
Johns Hopkins55	social	5,180	186,586	0.10
Reed98	social	962	18,812	0.04

Appendix B Datasets and Experimental Details

The statistical information on datasets is displayed in Table 3. For the splitting of Cora, CiteSeer and PubMed, we follow (Yang et al., 2016) to randomly select 20 instances per class for training, 500/1000 instances for validation/testing in each dataset. In the remaining datasets, we employ random train/valid/test splits of 50%/25%/25%.

The backbone GNN network is specified as a two-layer GCN model. We set the similarity function $s$ in (9) as cosine similarity and $\delta$ as a threshold-based truncation. Besides, since the dimensions of input node features are different across datasets, we adopt a transformation network that converts input features into a $d$ -dimensional node representations before the structure learning module as shown in Alg. 2 ( $\mathbf{Z}^{0}=\mbox{MLP}(\mathbf{X};w_{m})$ or $\mathbf{Z}^{0}=\mbox{GCN}(\mathbf{A},\mathbf{X};w_{m})$ ). We can specify the transformation as a one-layer MLP or a one-layer GCN network (what we adopt). Most of the experiments were conducted on an NVIDIA GeForce RTX 2080 Ti with 11GB memory. For experiments involving two larger datasets, PubMed and Cornell5, we utilized an NVIDIA GeForce RTX 3090 with 24 GB memory.

Appendix C Hyperparameters

We use grid search on validation set to tune the hyperparameters. The learning rate is searched in $\{0.001,0.005,0.01,0.05\}$ ; Dropout is searched in $\{0,0.2,0.3,0.5,0.6\}$ ; Hidden channels is searched in $\{16,32,64,96\}$ . Other hyperparameters for specific models are stated below.

For GCN, GraphSAGE and H²GCN, we use 2 layers. For GAT, we search gat head number in $\{2,4\}$ and use 2 layers. For APPNP and GPR, we search $\alpha$ in $\{0.1,0.2,0.5\}$ and set $K$ to 10. We list the searching space of structure learning methods below.

•

GraphGLOW and its variants: pivot number $P\in$ {800, 1000, 1200, 1400}, embedding size $d\in$ {16, 32, 64, 96}, $\lambda\in$ [0.1, 0.9], $\alpha\in$ {0, 0.1, 0.15, 0.2, 0.25, 0.3}, $\rho\in$ {0, 0.1, 0.15, 0.2, 0.25, 0.3}, threshold $\in$ {4e-5, 8.5e-5}, $H\in$ {4, 6}, $T=10$ , $E\in\{1,2,3\}$ .
•

LDS: the sampling time $S=16$ , the patience window size $\rho\in$ {10, 20}, the hidden size $\in$ {8, 16, 32, 64}, the inner learning rate $\gamma\in$ {1e-4, 1e-3, 1e-2, 1e-1}, and the number of updates used to compute the truncated hypergradient $\tau\in$ {5, 10, 15}.
•

IDGL: $\epsilon=0.01$ , hidden size $\in$ {16, 64, 96,128}, $\lambda\in$ {0.5, 0.6, 0.7, 0.8}, $\eta\in$ {0, 0.1, 0.2}, $\alpha\in$ {0, 0.1, 0.2}, $\beta\in$ {0, 0.1}, $\gamma\in$ {0.1, 0.2}, $m\in$ {6, 9, 12}.
•

VGCN: $\bar{\rho}_{1}\in$ {0.25, 0.5, 0.75, 0.99}, $\bar{\rho}_{0}=10^{-5}$ , $\tau_{0}\in$ {0.1, 0.5}, $\tau\in$ {0.1, 0.5}, $\beta\in$ { $10^{-4}$ , $10^{-3}$ , $10^{-2}$ , 1}. Sampling time is 3. Maximum number of training epochs is 5000.

Appendix D More Experimental Results

Table 4. Comparison between one-layer GCN and MLP as transformation network before structure learning module.

Dataset	GraphGLOW-GCN	GraphGLOW-MLP
Cora	83.2 ± 0.4	82.5 ± 0.5
CiteSeer	73.8 ± 0.9	73.3 ± 0.7
PubMed	79.6 ± 0.7	79.6 ± 0.7
Amherst41	68.1 ± 1.3	68.3 ± 1.5
Johns Hopkins55	71.8 ± 0.7	72.1 ± 0.9
Cornell5	70.5 ± 1.0	69.5 ± 1.0
Reed98	67.3 ± 1.2	65.9 ± 1.4

We compare with using MLP and GCN, respectively, as the transformation network before structure learning module and report the results in Table 4. In summary, these two methods are of equal competence, which suggests that GraphGLOW is not sensitive to the transformation network used for converting node features with various dimensions into embeddings with a shared dimension. This also implies that simple neural architectures, e.g. MLP and GCN, could provide enough capacity for extracting the information in input observation, which is leveraged by the shared graph structure learner to discover generalizable patterns in optimal message-passing structure.

We also provide more results of edge deletion experiments in Fig. 8. We randomly remove 10-50% edges of target graphs respectively, and then apply GraphGLOW and GCN. The results demonstrate that GraphGLOW is more immune to edge deletion. This is due to our structure learner’s ability for learning new structures, making it less reliant on initial graph structures and more robust to attack on input edges.

GraphGLOW: Open-World Graph Structure Learning for Cross-Graph Topology Refinement

GraphGLOW: Universal and Generalizable Structure Learning for Graph Neural Networks

Abstract.

1. Introduction

2. Preliminary and Problem Definition

3. Proposed Model

3.1. Model Formulation

3.2. Model Instantiations

3.2.1. Instantiation for qθ​(𝐀^|𝐀,𝐗)q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})

3.2.2. Instantiation for pwm​(𝐘|𝐀,𝐗,𝐀^)p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})

3.2.3. Instantiation for p0​(𝐀^|𝐀,𝐗)p_{0}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})

3.3. Model Training

3.3.1. Optimization for 𝔼qθ​[log⁡pwm]\mathbb{E}_{q_{\theta}}[\log p_{w_{m}}]

3.3.2. Optimization for 𝔼qθ​[log⁡p0]\mathbb{E}_{q_{\theta}}[\log p_{0}]

3.3.3. Optimization with 𝔼qθ​[log⁡qθ]\mathbb{E}_{q_{\theta}}[\log q_{\theta}]

3.3.4. Iterative Structure Learning for Acceleration

4. Related Works

Graph Neural Networks

Graph Structure learning

Out-of-Distribution Generalization on Graphs.

5. Experiments

5.1. Experimental Protocols

5.2. In-domain Generalization

5.3. Cross-domain Generalization

5.4. Ablation Studies

5.5. Hyper-parameter Sensitivity

5.6. Robustness Analysis

5.7. Case Study

6. Conclusion

Acknowledgements.

References

Appendix A Derivations for NWGM

Appendix B Datasets and Experimental Details

Appendix C Hyperparameters

Appendix D More Experimental Results

3.2.1. Instantiation for $q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})$

3.2.2. Instantiation for $p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})$

3.2.3. Instantiation for $p_{0}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})$

3.3.1. Optimization for $\mathbb{E}_{q_{\theta}}[\log p_{w_{m}}]$

3.3.2. Optimization for $\mathbb{E}_{q_{\theta}}[\log p_{0}]$

3.3.3. Optimization with $\mathbb{E}_{q_{\theta}}[\log q_{\theta}]$