This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GraphGLOW: Open-World Graph Structure Learning for Cross-Graph Topology Refinement

Wentao Zhao [email protected] Shanghai Jiao Tong UniversityShanghaiChina Qitian Wu [email protected] Shanghai Jiao Tong UniversityShanghaiChina Chenxiao Yang [email protected] Shanghai Jiao Tong UniversityShanghaiChina  and  Junchi Yan [email protected] Shanghai Jiao Tong UniversityShanghaiChina
(2023)

GraphGLOW: Universal and Generalizable Structure Learning for Graph Neural Networks

Wentao Zhao [email protected] Shanghai Jiao Tong UniversityShanghaiChina Qitian Wu [email protected] Shanghai Jiao Tong UniversityShanghaiChina Chenxiao Yang [email protected] Shanghai Jiao Tong UniversityShanghaiChina  and  Junchi Yan [email protected] Shanghai Jiao Tong UniversityShanghaiChina
(2023)
Abstract.

Graph structure learning is a well-established problem that aims at optimizing graph structures adaptive to specific graph datasets to help message passing neural networks (i.e., GNNs) to yield effective and robust node embeddings. However, the common limitation of existing models lies in the underlying closed-world assumption: the testing graph is the same as the training graph. This premise requires independently training the structure learning model from scratch for each graph dataset, which leads to prohibitive computation costs and potential risks for serious over-fitting. To mitigate these issues, this paper explores a new direction that moves forward to learn a universal structure learning model that can generalize across graph datasets in an open world. We first introduce the mathematical definition of this novel problem setting, and describe the model formulation from a probabilistic data-generative aspect. Then we devise a general framework that coordinates a single graph-shared structure learner and multiple graph-specific GNNs to capture the generalizable patterns of optimal message-passing topology across datasets. The well-trained structure learner can directly produce adaptive structures for unseen target graphs without any fine-tuning. Across diverse datasets and various challenging cross-graph generalization protocols, our experiments show that even without training on target graphs, the proposed model i) significantly outperforms expressive GNNs trained on input (non-optimized) topology, and ii) surprisingly performs on par with state-of-the-art models that independently optimize adaptive structures for specific target graphs, with notably orders-of-magnitude acceleration for training on the target graph.

Graph Structure Learning, Out-of-Distribution Generalization, Graph Neural Networks, Network Representation Learning
journalyear: 2023copyright: acmlicensedconference: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 6–10, 2023; Long Beach, CA, USAbooktitle: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23), August 6–10, 2023, Long Beach, CA, USAprice: 15.00doi: 10.1145/3580305.3599373isbn: 979-8-4007-0103-0/23/08ccs: Computing methodologies Machine learning algorithms

1. Introduction

Graph neural networks (GNNs) (Kipf and Welling, 2016; Veličković et al., 2018; Hamilton et al., 2017), as a de facto model class based on the message passing principle, have shown promising efficacy for learning node representations for graph-structured data, with extensive applications to, e.g., physics simulation (Sanchez-Gonzalez et al., 2020), traffic prediction (Jiang and Luo, 2022), drug recommendation (Yang et al., 2023b). However, due to the inevitable error-prone data collection (Chen et al., 2020), the input graph may contain spurious and unobserved edges that lead to sub-optimal results of GNNs and degrade the downstream performance.

Graph structure learning (Zhu et al., 2021c) serves as a plausible remedy for such an issue via optimizing graph structures and GNN classifiers at the same time. To this end, recent endeavors explore different technical aspects, e.g., parameterizing each potential edge between any pair of nodes (Franceschi et al., 2019; Elinas et al., 2020) or estimating potential links through a parameterized network (Chen et al., 2020; Wu et al., 2022b; Li et al., 2018), etc. However, existing models limit their applicability within a closed-world hypothesis: the training and testing of structure learning models, which optimize the graph structures, are performed on the same graph. The issue, however, is that since structure learning is often heavy-weighted and requires sophisticated optimization, it can be prohibitively resource-consuming to train structure learning models from scratch for each graph dataset. Moreover, due to limited labels in common graph-based predictive tasks, structure learning models are prone to over-fitting given that they cannot utilize the common knowledge shared across different graph datasets.

To resolve the above dilemma, this paper attempts to explore a novel problem setting termed Open-World Graph Structure Learning. Specifically, we target learning a generalizable graph structure learning model which is trained with multiple source graphs and can be directly adapted for inference (without re-training or fine-tuning) on new unseen target graphs. We formulate the problem as a bi-level optimization target that jointly learns a single dataset-shared structure learner and multiple dataset-specific GNNs tailored for particular graph datasets, as shown in Fig. 1. Under such a framework, the well-trained structure learner can leverage the common transferrable knowledge across datasets for enhancing generalization and more critically, be readily utilized to yield adaptive message-passing topology for arbitrarily given target graphs.

With the guidance of the aforementioned general goal, we propose GraphGLOW (short for A Graph Structure Learning Model for Open-World Generalization) that aims at learning the generalizable patterns of optimal message-passing topology across source graphs. Specifically, we first take a bottom-up perspective and formulate the generative process for observed data in a probabilistic manner. On top of this, we derive a tractable and feasible learning objective through the lens of variational inference. The structure learner is specified as a multi-head weighted similarity function so as to guarantee enough expressivity for accommodating diverse structural information, and we further harness an approximation scheme to reduce the quadratic complexity overhead of learning potential edges from arbitrary node pairs.

Refer to caption
Figure 1. Illustration of Open-World Graph Structure Learning. In a diverse set of source graphs, we train multiple dataset-specific GNNs and a shared structure learner. In the target graph, we directly utilize the learned structure learner and only need to train a new GNN.

To reasonably and comprehensively evaluate the model, we devise experiments with a diverse set of protocols that can measure the generalization ability under different difficulty levels (according to the intensity of distribution shifts between source graphs and target graphs). Concretely, we consider: 1) In-domain generalization, in which we generalize from some citation (social) networks to other citation (social) networks. 2) Cross-domain networks generalization between citation and social networks. The results, which are consistent across various combinations of source and target graph datasets, demonstrate that when evaluated on the target graphs, our approach i) consistently outperforms directly training the GNN counterpart on original non-optimized graph structures of the target datasets and ii) performs on par with state-of-the-art structure learning methods (Franceschi et al., 2019; Chen et al., 2020; Elinas et al., 2020) trained on target graphs from scratch with up to 25×25\times less training time consumed. Our code is available at https://github.com/WtaoZhao/GraphGLOW.

Refer to caption
Figure 2. Illustration of the proposed framework GraphGLOW targeting open-world graph structure learning. The middle part of the figure presents the training process for the structure learner together with multiple dataset-specific GNNs on source graphs. In (a)-(e) we illustrate the details of graph structure learner, backbone GNN, iterative training process, training procedure and transferring procedure. When the training is finished, the structure learner is fixed and we only need to train a dataset-specific GNN network on new target graph with latent structures inferred by the well-trained structure learner.

2. Preliminary and Problem Definition

Node-Level Predictive Tasks. Denote a graph with NN nodes as 𝒢=(𝐀,𝐗,𝐘)\mathcal{G}=(\mathbf{A},\mathbf{X},\mathbf{Y}) where 𝐀={auv}N×N\mathbf{A}=\{a_{uv}\}_{N\times N} is an adjacency matrix (auv=1a_{uv}=1 means the edge between node uu and vv exists and 0 otherwise), 𝐗={𝐱u}N×D\mathbf{X}=\{\mathbf{x}_{u}\}_{N\times D} is a feature matrix with 𝐱u\mathbf{x}_{u} a DD-dimensional node feature vector of node uu, and 𝐘={yu}N×C\mathbf{Y}=\{y_{u}\}_{N\times C} with yuy_{u} the label vector of node uu and CC class number. The node labels are partially observed as training data, based on which the node-level prediction aims to predict the unobserved labels for testing nodes in the graph using node features and graph structures. The latter is often achieved via a GNN model, denoted as hwh_{w}, that yields predicted node labels 𝐘^=hw(𝐀,𝐗)\hat{\mathbf{Y}}=h_{w}(\mathbf{A},\mathbf{X}) and is optimized with the classification loss w=argminw=(𝐘^,𝐘)w^{*}=\operatorname*{arg\,min}_{w}=\mathcal{L}(\hat{\mathbf{Y}},\mathbf{Y}) using observed labels from training nodes.

Closed-World Graph Structure Learning (GLCW). The standard graph structure learning for node-level predictive tasks trains a graph structure learner gθg_{\theta} to refine the given structure, i.e., 𝐀^=gθ(𝐀,𝐗)\hat{\mathbf{A}}=g_{\theta}(\mathbf{A},\mathbf{X}), over which the GNN classifier hwh_{w} conducts message passing for producing node representations and predictions. The gθg_{\theta} is expected to produce optimal graph structures that can give rise to satisfactory downstream classification performance of the GNN classifier. Formally speaking, the goal for training gθg_{\theta} along with hwh_{w} can be expressed as a nested optimization problem:

(1) θ=argminwminθ(hw(gθ(𝐀,𝐗),𝐗),𝐘).\theta^{*}=\operatorname*{arg\,min}_{w}\min_{\theta}\mathcal{L}\left(h_{w}(g_{\theta}(\mathbf{A},\mathbf{X}),\mathbf{X}),\mathbf{Y}\right).

The above formulation of graph structure learning under closed-world assumptions constrains the training and testing nodes in the same graph, which requires gθg_{\theta} to be trained from scratch on each graph dataset. Since gθg_{\theta} is often much more complicated (e.g., with orders-of-magnitude more trainable parameters) and difficult for optimization (due to the bi-level optimization (1)) than the GNN hwh_{w}, the GLCW would lead to undesired inefficiency and vulnerability for serious over-fitting (due to limited labeled information).

Open-World Graph Structure Learning (GLOW). In this work, we turn to a new learning paradigm that generalizes graph structure learning to open-world assumptions, borrowing the concepts of domain generalization (Muandet et al., 2013) and out-of-distribution generalization (Wu et al., 2022a), more broadly. Specifically, assume that we are given multiple source graphs, denoted as {𝒢ms}m=1M={(𝐀ms,𝐗ms,𝐘ms)}m=1M\{\mathcal{G}^{s}_{m}\}_{m=1}^{M}=\{(\mathbf{A}^{s}_{m},\mathbf{X}^{s}_{m},\mathbf{Y}^{s}_{m})\}_{m=1}^{M}, and a target graph 𝒢t=(𝐀t,𝐗t,𝐘t)\mathcal{G}^{t}=(\mathbf{A}^{t},\mathbf{X}^{t},\mathbf{Y}^{t}), whose distribution is often different from any source graph. The goal is to train a universal structure learner gθg_{\theta} on source graphs which can be directly used for inference on the target graph without any re-training or fine-tuning. The trained structure learner is expected to produce desired graph structures that can bring up better downstream classification of a GNN classifier optimized for the target graph.

More specifically, we consider a one-to-many framework that coordinates a shared graph structure learner gθg_{\theta} and multiple dataset-specific GNNs {hwm}m=1M\{h_{w_{m}}\}_{m=1}^{M}, where hwmh_{w_{m}} with independent parameterization wmw_{m} is optimized for a given source graph 𝒢ms\mathcal{G}_{m}^{s}. With the aim of learning a universal gθg_{\theta} that can generalize to new unseen target graphs, our training goal can be formulated as the following bi-level optimization problem:

(2) θ=argminθminw1,,wMm=1M(hwm(gθ(𝐀ms,𝐗ms),𝐗ms),𝐘ms),\theta^{*}=\operatorname*{arg\,min}_{\theta}\min_{w_{1},\cdots,w_{M}}\sum_{m=1}^{M}\mathcal{L}\left(h_{w_{m}}(g_{\theta}(\mathbf{A}^{s}_{m},\mathbf{X}^{s}_{m}),\mathbf{X}^{s}_{m}),\mathbf{Y}^{s}_{m}\right),

where the inner optimization is a multi-task learning objective. Generally, (2) aims at finding an optimal gθg_{\theta} that can jointly minimize the classification loss induced by MM GNN models, each trained for a particular source graph. After training, we can directly adapt gθg_{\theta^{*}} to the target graph for testing purpose, and only need to train a GNN hwh_{w} on the target graph:

(3) w=argminw(hw(gθ(𝐀t,𝐗t),𝐗t),𝐘t).w^{*}=\operatorname*{arg\,min}_{w}\mathcal{L}\left(h_{w}(g_{\theta^{*}}(\mathbf{A}^{t},\mathbf{X}^{t}),\mathbf{X}^{t}),\mathbf{Y}^{t}\right).

3. Proposed Model

To handle the above problem, we present an end-to-end learning framework GraphGLOW that guides the central graph structure learner to learn adaptive message-passing structures exploited by multiple GNNs. The overview of GraphGLOW is shown in Fig. 2.

The fundamental challenge of GLOW lies in how to model and capture the generalizable patterns among adaptive structures of different graphs. To this end, we first take a data-generative perspective that treats the inputs and inter-mediate results as random variables and investigate into their dependency, based on which we present the high-level model formulation in a probabilistic form (Sec. 3.1). Then we proceed to instantiate the model components (Sec. 3.2). Finally, we discuss differentiable training approaches for optimization (Sec. 3.3).

3.1. Model Formulation

To commence, we characterize the data generation process by a latent variable model, based on which we derive the formulation of our method. We treat the latent graph 𝐀^\hat{\mathbf{A}} (given by gθg_{\theta}) as a latent variable whose prior distribution is given by p(𝐀^|𝐀,𝐗)p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}). The prior distribution reflects how one presumed on the latent structures before observed labels arrive. Then, the prediction is given by a predictive distribution p(𝐘|𝐀^,𝐀,𝐗)p(\mathbf{Y}|\hat{\mathbf{A}},\mathbf{A},\mathbf{X}). The learning objective aims at maximizing the log-likelihood of observed labels, which can be written as: logp(𝐘|𝐀,𝐗)=log𝐀^p(𝐘|𝐀,𝐗,𝐀^)p(𝐀^|𝐀,𝐗)𝑑𝐀^\log p(\mathbf{Y}|\mathbf{A},\mathbf{X})=\log\int_{\hat{\mathbf{A}}}p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})d\hat{\mathbf{A}}. To estimate latent graphs that could enhance message passing for downstream tasks, one plausible way is to sample from the posterior, i.e., p(𝐀^|𝐘,𝐀,𝐗)p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X}), conditioned on the labels from downstream tasks. Using the Bayes’ rule, we have

(4) p(𝐀^|𝐘,𝐀,𝐗)=p(𝐘|𝐀,𝐗,𝐀^)p(𝐀^|𝐀,𝐗)𝐀^p(𝐘|𝐀,𝐗,𝐀^)p(𝐀^|𝐀,𝐗)𝑑𝐀^.p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X})=\frac{p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}{\int_{\hat{\mathbf{A}}}p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})d\hat{\mathbf{A}}}.

However, the integration over 𝐀^\hat{\mathbf{A}} in the denominator is intractable for computation due to the exponentially large space of 𝐀^\hat{\mathbf{A}}.

To circumvent the difficulty, we can introduce a variational distribution q(𝐀^|𝐀,𝐗)q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}) over 𝐀^\hat{\mathbf{A}} as an approximation to p(𝐀^|𝐘,𝐀,𝐗)p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X}). We can sample latent graphs from q(𝐀^|𝐀,𝐗)q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}), i.e., instantiate it as the structure learner gθg_{\theta}, and once q(𝐀^|𝐀,𝐗)=p(𝐀^|𝐘,𝐀,𝐗)q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})=p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X}), we could have samples from the posterior that ideally generates the optimal graph structures for downstream prediction. By this principle, we can start with minimizing the Kullback-Leibler divergence between qq and pp and derive the learning objective as follows:

(5) 𝒟KL(q(𝐀^|𝐀,𝐗)p(𝐀^|𝐘,𝐀,𝐗))=𝔼𝐀^q(𝐀^|𝐀,𝐗)[logp(𝐘|𝐀,𝐗,𝐀^)p(𝐀^|𝐀,𝐗)q(𝐀^|𝐀,𝐗)]Evidence Lower Bound+logp(𝐘|𝐀,𝐗).\begin{split}&\mathcal{D}_{KL}(q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})\|p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X}))\\ =&-\underbrace{\mathbb{E}_{\hat{\mathbf{A}}\sim q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}\left[\log\frac{p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}{q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}\right]}_{\mbox{Evidence Lower Bound}}+\log p(\mathbf{Y}|\mathbf{A},\mathbf{X}).\end{split}

Based on this equation, we further have the inequality which bridges the relationship between the Evidence Lower Bound (ELBO) and observed data log-likelihood:

(6) logp(𝐘|𝐀,𝐗)𝔼𝐀^q(𝐀^|𝐀,𝐗)[logp(𝐘|𝐀,𝐗,𝐀^)p(𝐀^|𝐀,𝐗)q(𝐀^|𝐀,𝐗)].\log p(\mathbf{Y}|\mathbf{A},\mathbf{X})\\ \geq\mathbb{E}_{\hat{\mathbf{A}}\sim q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}\left[\log\frac{p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}{q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}\right].

The equality holds if and only if 𝒟KL(q(𝐀^|𝐀,𝐗)p(𝐀^|𝐘,𝐀,𝐗))=0\mathcal{D}_{KL}(q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})\|p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X}))=0. The above fact suggests that we can optimize the ELBO as a surrogate for logp(𝐘|𝐀,𝐗)\log p(\mathbf{Y}|\mathbf{A},\mathbf{X}) which involves the intractable integration. More importantly, when the ELBO is optimized w.r.t. qq distribution, the variational bound is lifted to the original log-likelihood and one has q(𝐀^|𝐀,𝐗)=p(𝐀^|𝐘,𝐀,𝐗)q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})=p(\hat{\mathbf{A}}|\mathbf{Y},\mathbf{A},\mathbf{X}), i.e., the variational distribution equals to the true posterior, which is what we expect.

Pushing further and incorporating source graphs 𝒢m\mathcal{G}_{m} (we omit the superscript for simplicity), we arrive at the following objective:

(7) 𝔼𝒢mp(𝒢)[𝔼𝐀^qθ(𝐀^|𝐀=𝐀m,𝐗=𝐗m)[logpwm(𝐘|𝐀=𝐀m,𝐗=𝐗m,𝐀^)+logp0(𝐀^|𝐀=𝐀m,𝐗=𝐗m)logqθ(𝐀^|𝐀=𝐀m,𝐗=𝐗m)]].\begin{split}\mathbb{E}_{\mathcal{G}_{m}\sim p(\mathcal{G})}\left[\mathbb{E}_{\hat{\mathbf{A}}\sim q_{\theta}(\hat{\mathbf{A}}|\mathbf{A}=\mathbf{A}_{m},\mathbf{X}=\mathbf{X}_{m})}\left[\log p_{w_{m}}(\mathbf{Y}|\mathbf{A}=\mathbf{A}_{m},\mathbf{X}=\mathbf{X}_{m},\hat{\mathbf{A}})\right.\right.\\ \left.\left.+\log p_{0}(\hat{\mathbf{A}}|\mathbf{A}=\mathbf{A}_{m},\mathbf{X}=\mathbf{X}_{m})-\log q_{\theta}(\hat{\mathbf{A}}|\mathbf{A}=\mathbf{A}_{m},\mathbf{X}=\mathbf{X}_{m})\right]\right].\end{split}

Here we instantiate q(𝐀^|𝐀,𝐗)q(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}) as the shared structure learner gθg_{\theta}, p(𝐀^|𝐀,𝐗)p(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}) as a (shared) non-parametric prior distribution p0p_{0} for latent structures, and p(𝐘|𝐀,𝐗,𝐀^)p(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}}) as the dataset-specific GNN model hwmh_{w_{m}}, to suit the framework for our formulated problem in Section 2. The formulation of (7) shares the spirits with Bayesian meta learning (Finn et al., 2018). We can treat the GNN training as a dataset-specific learning task and latent graph as a certain ‘learning algorithm’ or ‘hyper-parameter’, so (7) essentially aims at learning a structure learner that can yield desirable ‘learning algorithm’ for each specific learning task on graphs. Furthermore, the three terms in (7) have distinct effects: i) the predictive term logpwm\log p_{w_{m}} acts as a supervised classification loss; ii) the prior term logp0\log p_{0} serves for regularization on the generated structures; iii) the third term, which is essentially the entropy of qθq_{\theta}, penalizes high confidence on certain structures.

To sum up, we can optimize (7) with joint learning of the structure learner gθg_{\theta} and GNN models {hwm}m=1M\{h_{w_{m}}\}_{m=1}^{M} on source graphs {𝒢m}m=1M\{\mathcal{G}_{m}\}_{m=1}^{M} for training the structure learner. After that, we can generalize the well-trained gθg_{\theta^{*}} to estimate latent graph structures for a new target graph 𝒢t=(𝐀t,𝐗t)\mathcal{G}^{t}=(\mathbf{A}^{t},\mathbf{X}^{t}) and only need to train the GNN model hwh_{w} w.r.t. the predictive objective with fixed θ\theta^{*}:

(8) 𝔼𝐀^qθ(𝐀^|𝐀=𝐀t,𝐗=𝐗t)[logpw(𝐘|𝐀=𝐀t,𝐗=𝐗t,𝐀^)].\mathbb{E}_{\hat{\mathbf{A}}\sim q_{\theta^{*}}(\hat{\mathbf{A}}|\mathbf{A}=\mathbf{A}^{t},\mathbf{X}=\mathbf{X}^{t})}\left[\log p_{w}(\mathbf{Y}|\mathbf{A}=\mathbf{A}^{t},\mathbf{X}=\mathbf{X}^{t},\hat{\mathbf{A}})\right].

We next discuss how to specify gθg_{\theta}, hwmh_{w_{m}} and p0p_{0} with special focus on their expressiveness and efficiency in Section 3.2. Later, we present the details for loss computation and model training based on the formulation stated above in Section 3.3.

3.2. Model Instantiations

3.2.1. Instantiation for qθ(𝐀^|𝐀,𝐗)q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})

The variational distribution aims at learning the conditional distribution that generates suitable latent structures for message passing based on input observations. A natural means is to assume each edge of the latent graph as a Bernoulli random variable and the distribution qq is a product of N×NN\times N independent Bernoulli random variables (Elinas et al., 2020; Wu et al., 2020).

The graph structure learner gθg_{\theta} can be used for predicting the Bernoulli parameter matrix. To accommodate the information from node features and graph structure, we can use the node representation, denoted as 𝐳ud\mathbf{z}_{u}\in\mathbb{R}^{d}, where dd is the embedding dimension, to compute the edge probability αuv\alpha_{uv} for edge (u,v)(u,v) as

(9) αuv=δ(1Hh=1Hs(𝐰h1𝐳u,𝐰h2𝐳v)),\alpha_{uv}=\delta\left(\frac{1}{H}\sum_{h=1}^{H}s(\mathbf{w}_{h}^{1}\odot\mathbf{z}_{u},\mathbf{w}_{h}^{2}\odot\mathbf{z}_{v})\right),

where s(,)s(\cdot,\cdot) is a similarity function for two vectors, \odot denotes Hadamard product, δ\delta is a function that converts the input into values within [0,1][0,1] and 𝐰h1,𝐰h2d\mathbf{w}_{h}^{1},\mathbf{w}_{h}^{2}\in\mathbb{R}^{d} are two weight vectors of the hh-th head. Common choices for s(,)s(\cdot,\cdot) include simple dot-product, cosine distance (Nguyen and Bai, 2010), RBF kernel (Li et al., 2018), etc. Here we introduce HH heads and aggregate their results to enhance model’s expressiveness for capturing the underlying influence between nodes from multifaceted causes, following the spirit of multi-head attention (Vaswani et al., 2017; Veličković et al., 2018). Besides, the weight vectors in (9) could learn to element-wisely scale the input vectors, i.e., node representations, and adaptively attend to dominant features. Apart from these, two weight vectors 𝐰h1,𝐰h2\mathbf{w}_{h}^{1},\mathbf{w}_{h}^{2} with independent parameterization could potentially have the same or distinct directions, which makes the model capable of connecting similar or dissimilar nodes and expressive enough to handle both homophilous and non-homophilous graphs.

To obtain discrete latent graph 𝐀^={a^uv}N×N\hat{\mathbf{A}}=\{\hat{a}_{uv}\}_{N\times N}, one can sample from a^uvBernoulli(αuv)\hat{a}_{uv}\sim Bernoulli(\alpha_{uv}) to obtain each latent edge. However, such an approach induces the quadratic algorithmic complexity O(N2)O(N^{2}) for computing and storing an estimated structure that entails potential links between any node pair, which could be prohibitive for large graphs. To reduce space and time complexity, we adopt a pivot-based structure learning method, as shown in Fig. 3. Concretely, we randomly choose PP nodes in the graph as pivots, where PP is a hyperparameter much smaller than NN (e.g., P110NP\approx\frac{1}{10}N). We then leverage pivot nodes as intermediates and convert the N×NN\times N graph 𝐀^\hat{\mathbf{A}}, which can be prohibitively large with dense edges, into a cascade of one N×PN\times P node-pivot bipartite graph 𝐁^1\hat{\mathbf{B}}_{1} and one P×NP\times N pivot-node bipartite graph 𝐁^2=𝐁^1\hat{\mathbf{B}}_{2}=\hat{\mathbf{B}}_{1}^{\top}, which can effectively control the computational cost with proper PP. In this way, we can compute a node-pivot similarity matrix 𝚪={αup}N×P\mathbf{\Gamma}=\{\alpha_{up}\}_{N\times P} based on (9), to parameterize the distribution of latent graph structures. This only requires O(NP)O(NP) time and space complexity, and one can sample from each Bernoulli(αup)Bernoulli(\alpha_{up}) to obtain 𝐁^1\hat{\mathbf{B}}_{1} and 𝐁^2\hat{\mathbf{B}}_{2}. In the meanwhile, the original N×NN\times N adjacency matrix could be retrieved by 𝐀^=𝐁^1𝐁^2\hat{\mathbf{A}}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{2}, which suggests that one can execute message passing on 𝐁^1\hat{\mathbf{B}}_{1} and 𝐁^2\hat{\mathbf{B}}_{2} to approximate that on 𝐀^\hat{\mathbf{A}} (see more details in Section 3.2.2). In terms of the acceleration of structure learning, other strategies like the all-pair message passing schemes with linear complexity explored by (Wu et al., 2022b, 2023b) can also be utilized to achieve the purpose.

3.2.2. Instantiation for pwm(𝐘|𝐀,𝐗,𝐀^)p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})

The predictive distribution, parameterized by the GNN network hwmh_{w_{m}}, aims at recursively propagating features along the latent graph to update node representations and producing the prediction result for each node. We then present the details of GNN’s message passing on the latent graph in order for enough expressiveness, stability and efficiency.

To begin with, we review the message-passing rule in common GNN models, like GCN (Kipf and Welling, 2016), that operates on the original graph 𝐀\mathbf{A}:

(10) 𝐙(l+1)=σ(MP1(𝐙(l),𝐀)𝐖(l))=σ(𝐃12𝐀𝐃12𝐙(l)𝐖(l)),\mathbf{Z}^{(l+1)}=\sigma\left(\operatorname{MP}_{1}(\mathbf{Z}^{(l)},\mathbf{A})\mathbf{W}^{(l)}\right)=\sigma\left(\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}}\mathbf{Z}^{(l)}\mathbf{W}^{(l)}\right),

where 𝐖(l)d×d\mathbf{W}^{(l)}\in\mathbb{R}^{d\times d} is a weight matrix, σ\sigma is non-linear activation, and 𝐃\mathbf{D} denotes a diagonal degree matrix from input graph 𝐀\mathbf{A} and 𝐙(l)={𝐳u(l)}N×d\mathbf{Z}^{(l)}=\{\mathbf{z}_{u}^{(l)}\}_{N\times d} is a stack of node representations at the ll-th layer.

Refer to caption
Figure 3. Illustration for scalable structure learning message passing, which reduces algorithmic complexity from O(N2)O(N^{2}) to O(NP)O(NP). We choose PP nodes as pivots and convert the N×NN\times N matrix to the product of two N×PN\times P node-pivot matrices (where the message passing is executed with two steps, i.e., node-to-pivot and pivot-to-node.).

With the estimated latent graph 𝐀^=𝐁^1𝐁^2\hat{\mathbf{A}}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{2}, we perform message passing MP2()\mathrm{MP}_{2}(\cdot) in a two-step fashion to update node representations:

(11) i) node-to-pivot passing:𝐂(l+12)=RowNorm(𝚪)𝐙(l),ii) pivot-to-node passing:𝐂(l+1)=RowNorm(𝚪)𝐂(l+12),\begin{split}&\mbox{i) node-to-pivot passing:}~{}~{}~{}\mathbf{C}^{(l+\frac{1}{2})}=\mathrm{RowNorm}(\mathbf{\Gamma}^{\top})\mathbf{Z}^{(l)},\\ &\mbox{ii) pivot-to-node passing:}~{}~{}~{}\mathbf{C}^{(l+1)}=\mathrm{RowNorm}(\mathbf{\Gamma})\mathbf{C}^{(l+\frac{1}{2})},\end{split}

where 𝐂(l+12)\mathbf{C}^{(l+\frac{1}{2})} is an intermediate node representation and 𝚪={αuv}N×P\mathbf{\Gamma}=\{\alpha_{uv}\}_{N\times P} is the node-pivot similarity matrix calculated by (9). Such a two-step procedure can be efficiently conducted within O(NP)O(NP) time and space complexity.

Despite that the feature propagation on the estimated latent structure could presumably yield better node representations, the original input graph structures also contain useful information, such as effective inductive bias (Bronstein et al., 2016). Therefore, we integrate two message-passing functions to compute layer-wise updating for node representations:

(12) 𝐙(l+1)=σ(λMP1(𝐙(l),𝐀)𝐖(l)+(1λ)MP2(𝐙(l),𝐀^)𝐖(l)),\mathbf{Z}^{(l+1)}=\sigma\left(\lambda\mbox{MP}_{1}(\mathbf{Z}^{(l)},\mathbf{A})\mathbf{W}^{(l)}+(1-\lambda)\mbox{MP}_{2}(\mathbf{Z}^{(l)},\hat{\mathbf{A}})\mathbf{W}^{(l)}\right),

where λ\lambda is a trading hyper-parameter that controls the concentration weight on input structures. Such design could also improve the training stability by reducing the impact from large variation of latent structures through training procedure.

With LL GNN layers, one can obtain the prediction 𝐘^\hat{\mathbf{Y}} by setting 𝐘^=𝐙(L)\hat{\mathbf{Y}}=\mathbf{Z}^{(L)} and 𝐖(L1)d×C\mathbf{W}^{(L-1)}\in\mathbb{R}^{d\times C} where CC is the number of classes. Alg. 1 shows the feed-forward computation of message passing.

3.2.3. Instantiation for p0(𝐀^|𝐀,𝐗)p_{0}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})

The prior distribution reflects how we presume on the latent graph structures without the information of observed labels. In other words, it characterizes how likely a given graph structure could provide enough potential for feature propagation by GNNs. The prior could be leveraged for regularization on the estimated latent graph 𝐀^\hat{\mathbf{A}}. In this consideration, we choose the prior as an energy function that quantifies the smoothness of the graph:

(13) p0(𝐀^|𝐗,𝐀)exp(αu,v𝐀^uv𝐱u𝐱v22ρ𝐀^F2),p_{0}(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})\propto\exp{\left(-\alpha\sum_{u,v}\hat{\mathbf{A}}_{uv}\|\mathbf{x}_{u}-\mathbf{x}_{v}\|_{2}^{2}-\rho\|\hat{\mathbf{A}}\|_{F}^{2}\right)},

where F\|\cdot\|_{F} is the Frobenius norm. The first term in (13) measures the smoothness of the latent graph (Belkin and Niyogi, 2001), with the hypothesis that graphs with smoother feature has lower energy (i.e., higher probability). The second term helps avoiding too large node degrees (Kalofolias, 2016). The hyperparameters α\alpha and ρ\rho control the strength for regularization effects.

While we can retrieve the latent graph via 𝐀^=𝐁^1𝐁^2\hat{\mathbf{A}}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{2}, the computation of (13) still requires O(N2)O(N^{2}) cost. To reduce the overhead, we apply the regularization on the P×PP\times P pivot-pivot adjacency matrix 𝐄^=𝐁^2𝐁^1\hat{\mathbf{E}}=\hat{\mathbf{B}}_{2}\hat{\mathbf{B}}_{1} as a proxy regularization:

(14) (𝐄^)=logp0(𝐀^|𝐗,𝐀)αp,q𝐄^pq𝐱p𝐱q22ρ𝐄^F2,\begin{split}\mathcal{R}(\hat{\mathbf{E}})&=\log p_{0}(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})\\ &\approx-\alpha\sum_{p,q}\hat{\mathbf{E}}_{pq}\|\mathbf{x}^{\prime}_{p}-\mathbf{x}^{\prime}_{q}\|_{2}^{2}-\rho\|\hat{\mathbf{E}}\|_{F}^{2},\end{split}

where 𝐱p\mathbf{x}^{\prime}_{p} denotes the input feature of the pp-th pivot node.

3.3. Model Training

For optimization with (7), we proceed to derive the loss functions and updating gradients for θ\theta and wmw_{m} based on the three terms 𝔼qθ[logpwm]\mathbb{E}_{q_{\theta}}[\log p_{w_{m}}], 𝔼qθ[logp0]\mathbb{E}_{q_{\theta}}[\log p_{0}] and 𝔼qθ[logqθ]\mathbb{E}_{q_{\theta}}[\log q_{\theta}].

3.3.1. Optimization for 𝔼qθ[logpwm]\mathbb{E}_{q_{\theta}}[\log p_{w_{m}}]

The optimization difficulty stems from the expectation over qθq_{\theta}, where the sampling process is non-differentiable and hinders back-propagation. Common strategies for approximating the sampling for discrete random variables include Gumbel-Softmax trick (Jang et al., 2017) and REINFORCE trick (Williams, 1992). However, both strategies yield a sparse graph structure each time of sampling, which could lead to high variance for the prediction result logpwm(𝐘|𝐀,𝐗,𝐀^)\log p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}}) produced by message passing over a sampled graph. To mitigate the issue, we alternatively adopt the Normalized Weighted Geometric Mean (NWGM) (Xu et al., 2015) to move the outer expectation into the feature-level. Specifically, we have (see Appendix A for detailed derivations)

(15) θ𝔼qθ(𝐀^|𝐀,𝐗)[logpwm(𝐘|𝐀,𝐗,𝐀^)]θlogpwm(𝐘|𝐀,𝐗,𝐀^=𝔼qθ(𝐀^|𝐀,𝐗)[𝐀^]).\begin{split}&\nabla_{\theta}\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[\log p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})]\\ \approx&\nabla_{\theta}\log p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}}=\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[\hat{\mathbf{A}}]).\end{split}

We denote the opposite of the above term as θs(θ)\nabla_{\theta}\mathcal{L}_{s}(\theta). The gradient w.r.t. wmw_{m} can be similarly derived. The above form is a biased estimation for the original objective, yet it can reduce the variance from sampling and also improve training efficiency (without the need of message passing over multiple sampled graphs).(15) induces the supervised cross-entropy loss.

3.3.2. Optimization for 𝔼qθ[logp0]\mathbb{E}_{q_{\theta}}[\log p_{0}]

As for the second term in (7), we adopt the REINFORCE trick, i.e., policy gradient, to tackle the non-differentiability of sampling from qθq_{\theta}. Specifically, for each feedforward computation, we sample from the Bernoulli distribution for each edge given by the estimated node-pivot similarity matrix, i.e., Bernoulli(αup)Bernoulli(\alpha_{up}), and obtain the sampled latent bipartite graph 𝐁^1\hat{\mathbf{B}}_{1} and subsequently have 𝐄^=𝐁^1𝐁^2=𝐁^1𝐁^1\hat{\mathbf{E}}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{2}=\hat{\mathbf{B}}_{1}\hat{\mathbf{B}}_{1}^{\top}. The probability for the latent structure could be computed by

(16) πθ(𝐄^)=u,p(𝐁^1,upαup+(1𝐁^1,up)(1αup)).\pi_{\theta}(\hat{\mathbf{E}})=\prod_{u,p}\left(\hat{\mathbf{B}}_{1,up}\alpha_{up}+(1-\hat{\mathbf{B}}_{1,up})\cdot(1-\alpha_{up})\right).

Denote 𝐄^k\hat{\mathbf{E}}_{k} as the sampled result at the kk-th time, we can independently sample KK times and obtain {𝐄^k}k=1K\{\hat{\mathbf{E}}_{k}\}_{k=1}^{K} and {πθ(𝐄^k)}k=1K\{\pi_{\theta}(\hat{\mathbf{E}}_{k})\}_{k=1}^{K}. Recall that the regularization reward from logp0\log p_{0} has been given by (14). The policy gradient (Williams, 1992) yields the gradient of loss for θ\theta as

(17) θr(θ)=θ𝔼𝐀^q(𝐀^|𝐗,𝐀)[logp0(𝐀^|𝐗,𝐀)]θ1Kk=1Klogπθ(𝐄^k)((𝐄^k)¯),\begin{split}\nabla_{\theta}\mathcal{L}_{r}(\theta)&=-\nabla_{\theta}\mathbb{E}_{\hat{\mathbf{A}}\sim q(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})}[\log p_{0}(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})]\\ &\approx-\nabla_{\theta}\frac{1}{K}\sum_{k=1}^{K}\log\pi_{\theta}(\hat{\mathbf{E}}_{k})\left(\mathcal{R}(\hat{\mathbf{E}}_{k})-\overline{\mathcal{R}}\right),\end{split}

where ¯\overline{\mathcal{R}} acts as a baseline function by averaging the regularization rewards (𝐄^k)\mathcal{R}(\hat{\mathbf{E}}_{k}) in one feed-forward computation, which helps to reduce the variance during policy gradient training (Sutton et al., 1999).

3.3.3. Optimization with 𝔼qθ[logqθ]\mathbb{E}_{q_{\theta}}[\log q_{\theta}]

The last entropy term for qθq_{\theta} could be directly computed by

(18) e(θ)=𝔼𝐀^q(𝐀^|𝐗,𝐀)[logq(𝐀^|𝐗,𝐀)]1NPu=1Np=1P[αuplogαup+(1αup)log(1αup)],\begin{split}\mathcal{L}_{e}(\theta)&=\mathbb{E}_{\hat{\mathbf{A}}\sim q(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})}[\log q(\hat{\mathbf{A}}|\mathbf{X},\mathbf{A})]\\ &\approx\frac{1}{NP}\sum_{u=1}^{N}\sum_{p=1}^{P}\left[\alpha_{up}\log\alpha_{up}+(1-\alpha_{up})\log(1-\alpha_{up})\right],\end{split}

where we again adopt the node-pivot similarity matrix as a proxy for the estimated latent graph.

Table 1. Test accuracy (%) on target graphs for in-domain generalizations. For each social network (resp. citation network) as target dataset, we consider the other social networks (resp. citation networks) as source graphs. GraphGLOW* is an oracle model that shares the same architecture as our model GraphGLOW and is directly trained on target graphs.
Type Method Cornell5 Johns.55 Amherst41 Reed98 Cora CiteSeer PubMed
Pure GNN GCN 68.6 ± 0.5 70.8 ± 1.0 65.8 ± 1.6 60.8 ± 1.6 81.6 ± 0.4 71.6 ± 0.3 78.8 ± 0.6
SAGE 68.7 ± 0.8 67.5 ± 0.9 66.3 ± 1.8 63.9 ± 1.9 81.4 ± 0.6 71.6 ± 0.5 78.6 ± 0.7
GAT 69.6 ± 1.2 69.4 ± 0.7 68.7 ± 2.1 64.5 ± 2.5 83.0 ± 0.7 72.1 ± 1.1 79.0 ± 0.4
GPR 68.8 ± 0.7 69.6 ± 1.3 66.2 ± 1.5 62.7 ± 2.0 83.1 ± 0.7 72.4 ± 0.8 79.6 ± 0.5
APPNP 68.5 ± 0.8 69.1 ± 1.4 65.9 ± 1.3 62.3 ± 1.5 82.7 ± 0.5 71.9 ± 0.5 79.2 ± 0.3
H2GCN 71.4 ± 0.5 68.3 ± 1.0 66.5 ± 2.2 65.4 ± 1.3 82.5 ± 0.8 71.4 ± 0.7 79.4 ± 0.4
CPGNN 71.1 ± 0.5 68.7 ± 1.3 66.7 ± 0.8 63.6 ± 1.8 80.8 ± 0.4 71.6 ± 0.4 78.5 ± 0.7
Graph Structure Learning GraphGLOWdp\mathrm{GraphGLOW_{dp}} 71.5 ± 0.7 71.3 ± 1.2 68.5 ± 1.6 63.2 ± 1.2 83.1 ± 0.8 71.7 ± 1.0 77.3 ± 0.8
GraphGLOWknn\mathrm{GraphGLOW_{knn}} 69.4 ± 0.8 71.0 ± 1.3 64.8 ± 1.2 63.6 ± 1.6 81.7 ± 0.8 71.5 ± 0.8 79.4 ± 0.6
GraphGLOWcos\mathrm{GraphGLOW_{cos}} 69.9 ± 0.7 70.8 ± 1.4 65.2 ± 1.8 62.7 ± 1.3 82.0 ± 0.7 71.9 ± 0.9 78.7 ± 0.8
GraphGLOWat\mathrm{GraphGLOW_{at}} 69.3 ± 0.8 70.9 ± 1.3 65.0 ± 1.3 65.0 ± 1.7 81.9 ± 0.9 71.3 ± 0.7 78.8 ± 0.6
GraphGLOW 71.8 ± 0.9 71.5 ± 0.8 70.6 ± 1.4 66.8 ± 1.1 83.5 ± 0.6 73.6 ± 0.6 79.8 ± 0.8
GraphGLOW* 71.1 ± 0.3 72.2 ± 0.5 70.3 ± 0.9 66.8 ± 1.4 83.5 ± 0.6 73.9 ± 0.7 79.9 ± 0.5

3.3.4. Iterative Structure Learning for Acceleration

A straightforward way is to consider once structure inference and once GNN’s message passing for prediction in each feed-forward computation. To enable structure learning and GNN learning mutually reinforce each other (Chen et al., 2020), we consider multiple iterative updates of graph structures and node representations before once back-propagation. More specifically, in each epoch, we repeatedly update node representations 𝐙t\mathbf{Z}^{t} (where the superscript tt denotes the tt-th iteration) and latent graph 𝐀^t\hat{\mathbf{A}}^{t} until a given maximum budget is achieved. To accelerate the training, we aggregate the losses t\mathcal{L}^{t} in each iteration step for parameter updating. As different graphs have different feature space, we utilize the first layer of GNN as an encoder at the very beginning and then feed the encoded representations to structure learner. The training algorithm for structure learner gθg_{\theta} on source graphs is described in Alg. 2 (in the appendix) where we train structure learner for multiple episodes and in each episode, we train gθg_{\theta} on each source graph for several epochs. In testing, the well-trained gθg_{\theta} is fixed and we train a GNN hwh_{w} on the target graph with latent structures inferred by gθg_{\theta}, as described in Alg. 3.

4. Related Works

Graph Neural Networks

Graph neural networks (GNNs) (Kipf and Welling, 2016; Veličković et al., 2018; Hamilton et al., 2017; Xu et al., 2018; Abu-El-Haija et al., 2019; Geng et al., 2023) have achieved impressive performances in modeling graph-structured data. Nonetheless, there is increasing evidence suggesting GNNs’ deficiency for graph structures that are inconsistent with the principle of message passing. One typical situation lies in non-homophilous graphs (Zhu et al., 2020), where adjacent nodes tend to have dissimilar features/labels. Recent studies devise adaptive feature propagation/aggregation to tackle the heterophily (Li et al., 2021; Zhu et al., 2021b, 2020; Chien et al., 2021). Another situation stems from graphs with noisy or spurious links, for which several works propose to purify the observed structures for more robust node representations (Zheng et al., 2020; Gao et al., 2021). Our work is related to these works by searching adaptive graph structures that is suitable for GNN’s message passing. Yet, the key difference is that our method targets learning a new graph out of the scope of input one, while the above works focus on message passing within the input graph.

Graph Structure learning

To effectively address the limitations of GNNs’ feature propagation within observed structures, many recent works attempt to jointly learn graph structures and the GNN model. For instance, (Franceschi et al., 2019) models each edge as a Bernoulli random variable and optimizes graph structures along with the GCN. To exploit enough information from observed structure for structure learning, (Li et al., 2018) proposes a metric learning approach based on RBF kernel to compute edge probability with node representations, while (Jiang et al., 2019) adopts attention mechanism to achieve the similar goal. Furthermore, (Chen et al., 2020) considers an iterative method that enables mutual reinforcement between learning graph structures and node embeddings. Also, (Zhang et al., 2019) presents a probabilistic framework that views the input graph as a random sample from a collection modeled by a parametric random graph model. (Lao et al., 2022; Elinas et al., 2020) harnesses variational inference to estimate a posterior of graph structures and GNN parameters. While learning graph structures often requires O(N2)O(N^{2}) complexity, a recent work (Wu et al., 2022b) proposes an efficient Transformer that achieves latent structure learning in each layer with O(N)O(N) complexity. However, though these methods have shown promising results, they assume training nodes and testing nodes are from the same graph and consider only one graph. By contrast, we consider graph structure learning under the cross-graph setting and propose a general framework to learn a shared structure learner which can generalize to target graphs without any re-training.

Out-of-Distribution Generalization on Graphs.

Due to the demand for handling testing data in the wild, improving the capability of the neural networks for performing satisfactorily on out-of-distribution data has received increasing attention (Wu et al., 2022a; Ma et al., 2021; Li et al., 2022; Wu et al., 2023a, 2021b; Yang et al., 2023a). Recent studies, e.g., (Zhu et al., 2021a; Wu et al., 2022a; Sui et al., 2022) explore effective treatments for tackling general distribution shifts on graphs, and there are also works focusing on particular categories of distribution shifts like size generalization (Bevilacqua et al., 2021), molecular scaffold generalization (Yang et al., 2022b), feature/attribute shifts (Wu et al., 2021a; Bi et al., 2023), topological shifts (Yang et al., 2022a), etc. To the best of our knowledge, there is no prior works considering OOD generalization in the context of graph structure learning. In our case, the target graph, where the structure learner is expected to yield adaptive structures, can have disparate distributions than the source graphs. The distribution shifts could potentially stem from feature/label space, graph sizes or domains (e.g., from social networks to citation networks). As the first attempt along this path, our work can fill the research gap and enable the graph structure learning model to deal with new unseen graphs in an open world.

5. Experiments

We apply GraphGLOW to real-world datasets for node classification to test the efficacy of proposed structure learner for boosting performance of GNN learning on target graphs with distribution shifts from source graphs. We specify the backbone GNN network for GraphGLOW as a two-layer GCN (Kipf and Welling, 2016). We focus on the following research questions:

\bullet1) How does GraphGLOW perform compared with directly training GNN models on input structure of target graphs?

\bullet2) How does GraphGLOW perform compared to state-of-the-art structure learning models that are directly trained on target datasets in terms of both accuracy and training time?

\bullet3) Are the proposed components of GraphGLOW effective and necessary for the achieved performance?

\bullet4) What is the impact of hyper-parameter on performance and what is the impact of attack on observed edges?

\bullet5) What is the property of inferred latent graphs and what generalizable pattern does the structure learner capture?

5.1. Experimental Protocols

Datasets. Our experiments are conducted on several public graph datasets. First we consider three commonly used citation networks Cora, CiteSeer and PubMed. We use the same splits as in (Yang et al., 2016). These three datasets have high homophily ratios (i.e., adjacent nodes tend to have similar labels) (Lim et al., 2021). Apart from this, we also consider four social networks from Facebook-100 (Traud et al., 2012), which have low homophily ratios. Readers may refer to Appendix B for more dataset information like splitting ratios.

Competitors. We mainly compare with GCN (Kipf and Welling, 2016), the GNN counterpart trained on input structure, for testing the efficacy of produced latent graphs by GraphGLOW. As further investigation, we also compare with other advanced GNN models: GraphSAGE (Hamilton et al., 2017), GAT (Veličković et al., 2018), APPNP (Klicpera et al., 2019), H2\mbox{H}_{2}GCN (Zhu et al., 2020) and GPRGNN (Chien et al., 2021). Here APPNP, H2\mbox{H}_{2}GCN and GPRGNN are all strong GNN models equipped with adaptive feature propagation and high-order aggregation. For these pure GNN models, the training and testing are considered on (the same) target graphs. Furthermore, we compete GraphGLOW with state-of-the-art graph structure learning models, IDS (Franceschi et al., 2019), IDGL (Chen et al., 2020) and VGCN (Elinas et al., 2020). Since these models are all designed for training on one dataset from scratch, we directly train them on the target graph and they in principle could yield better performance than GraphGLOW.

We also consider variants of GraphGLOW as baselines. We replace the similarity function ss with attention-based structure learner, denoted as GraphGLOWat\mathrm{GraphGLOW_{at}}, which follows the same training scheme as GraphGLOW. Besides, we consider some non-parametric similarity functions like dot-product, KNN and cosine distance (denoted as GraphGLOWdp\mathrm{GraphGLOW_{dp}}, GraphGLOWknn\mathrm{GraphGLOW_{knn}} and GraphGLOWcos\mathrm{GraphGLOW_{cos}}, respectively). For these models, we only need to train the GNN network on target graphs with the non-parametric structure learners yielding latent structures. In addition, we introduce a variant GraphGLOW* that shares the same architecture as GraphGLOW and is directly trained on target graphs. Also, GraphGLOW* in principle could produce superior results than GraphGLOW. We report the test accuracy given by the model that produces the highest validation accuracy within 500 training epochs.

5.2. In-domain Generalization

Table 2. Test accuracy (%) on target graphs for cross-domain generalizations. For each social network (resp. citation network) as target dataset, we consider citation networks (resp. social networks) as source graphs.
Type Method Cornell5 Johns.55 Amherst41 Reed98 Cora CiteSeer PubMed
Pure GNN GCN 68.6 ± 0.5 70.8 ± 1.0 65.8 ± 1.6 60.8 ± 1.6 81.6 ± 0.4 71.6 ± 0.3 78.8 ± 0.6
SAGE 68.7 ± 0.8 67.5 ± 0.9 66.3 ± 1.8 63.9 ± 1.9 81.4 ± 0.6 71.6 ± 0.5 78.6 ± 0.7
GAT 69.6 ± 1.2 69.4 ± 0.7 68.7 ± 2.1 64.5 ± 2.5 83.0 ± 0.7 72.1 ± 1.1 79.0 ± 0.4
GPR 68.8 ± 0.7 69.6 ± 1.3 66.2 ± 1.5 62.7 ± 2.0 83.1 ± 0.7 72.4 ± 0.8 79.6 ± 0.5
APPNP 68.5 ± 0.8 69.1 ± 1.4 65.9 ± 1.3 62.3 ± 1.5 82.7 ± 0.5 71.9 ± 0.5 79.2 ± 0.3
H2GCN 71.4 ± 0.5 68.3 ± 1.0 66.5 ± 2.2 65.4 ± 1.3 82.5 ± 0.8 71.4 ± 0.7 79.4 ± 0.4
CPGNN 71.1 ± 0.5 68.7 ± 1.3 66.7 ± 0.8 63.6 ± 1.8 80.8 ± 0.4 71.6 ± 0.4 78.5 ± 0.7
Graph Structure Learning GraphGLOWdp\mathrm{GraphGLOW_{dp}} 71.5 ± 0.7 71.3 ± 1.2 68.5 ± 1.6 63.2 ± 1.2 83.1 ± 0.8 71.7 ± 1.0 77.3 ± 0.8
GraphGLOWknn\mathrm{GraphGLOW_{knn}} 69.4 ± 0.8 71.0 ± 1.3 64.8 ± 1.2 63.6 ± 1.6 81.7 ± 0.8 71.5 ± 0.8 79.4 ± 0.6
GraphGLOWcos\mathrm{GraphGLOW_{cos}} 69.9 ± 0.7 70.8 ± 1.4 65.2 ± 1.8 62.7 ± 1.3 82.0 ± 0.7 71.9 ± 0.9 78.7 ± 0.8
GraphGLOWat\mathrm{GraphGLOW_{at}} 69.9 ± 1.0 70.4 ± 1.5 64.4 ± 1.2 65.0 ± 1.7 82.5 ± 0.9 71.8 ± 0.8 78.5 ± 0.7
GraphGLOW 72.0 ± 1.0 71.8 ± 0.7 69.8 ± 1.3 67.3 ± 1.2 83.2 ± 0.4 73.8 ± 0.9 79.6 ± 0.7
GraphGLOW* 71.1 ± 0.3 72.2 ± 0.5 70.3 ± 0.9 66.8 ± 1.4 83.5 ± 0.6 73.9 ± 0.7 79.9 ± 0.5
Refer to caption
(a) Cora
Refer to caption
(b) CiteSeer
Refer to caption
(c) Amherst41
Refer to caption
(d) Johns Hopkins55
Refer to caption
(e) Reed98
Figure 4. Comparison of test accuracy and training time with SOTA structure learning models (LDS (Franceschi et al., 2019), IDGL (Chen et al., 2020) and VGCN (Elinas et al., 2020)). The radius of circle is proportional to standard deviation. The experiments are run on one Tesla V4 with 16 GPU memory. We adopt the same setting as Table 2 and report the results on target datasets. For Cornell5 and PubMed, the competitor models suffer out-of-memory.

We first consider transferring within social networks or citation networks. The results are reported in Table 1 where for each social network (resp. citation network) as the target, we use the other social networks (resp. citation networks) as the source datasets. GraphGLOW performs consistently better than GCN, i.e., the counterpart using observed graph for message passing, which proves that GraphGLOW can capture generalizable patterns for desirable message-passing structure for unseen datasets that can indeed boost the GCN backbone’s performance on downstream tasks. In particular, the improvement over GCN is over 5% on Cornell5 and Reed98, two datasets with low homophily ratios (as shown in Table 3). The reason is that for non-homophilous graphs where the message passing may propagate inconsistent signals (as mentioned in Section 1), the GNN learning could better benefits from structure learning than homophilous graphs. Furthermore, compared to other strong GNN models, GraphGLOW still achieves slight improvement than the best competitors though the backbone GCN network is less expressive. One could expect further performance gain by GraphGLOW if we specify the GNN backbone as other advanced architectures.

In contrast with non-parametric structure learning models and GraphGLOWat\mathrm{GraphGLOW_{at}}, GraphGLOW outperforms them by a large margin throughout all cases, which verifies the superiority of our design of multi-head weighted similarity function that can accommodate multi-faceted diverse structural information. Compared with GraphGLOW*, GraphGLOW performs on par with and even exceeds it on Cornell5 and Amherst41. The possible reasons are two-fold. First, there exist sufficient shared patterns among citation networks (resp. social networks), which paves the way for successful generalization of GraphGLOW. Second, GraphGLOW* could sometimes overfit specific datasets, since the amount of free parameters are regularly orders-of-magnitude more than the number of labeled nodes in the dataset. The results also imply that our transfer learning approach can help to mitigate over-fitting on one dataset. Moreover, GraphGLOW can generalize structure learner to unseen graphs that is nearly three times larger than training graphs, i.e., Cornell5.

5.3. Cross-domain Generalization

We next consider a more difficult task, transferring between social networks and citation networks. The difficulty stems from two aspects: 1) social networks and citations graphs are from distinct categories thus have larger underlying data-generating distribution gaps; 2) they have varied homophily ratios, which indicates that the observed edges play different roles in original graphs. In Table 2 we report the results. Despite the task difficulty, GraphGLOW manages to achieve superior results than GCN and also outperforms other non-parametric graph structure learning methods throughout all cases. This suggests GraphGLOW’s ability for handling target graphs with distinct properties.

In Fig. 4 we further compare GraphGLOW with three state-of-the-art graph structure learning models that are directly trained on target graphs. Here we follow the setting in Table 2. The results show that even trained on source graphs that are different from the target one, GraphGLOW still performs on par with the competitors that are trained and tested on (the same) target graphs. Notably, GraphGLOW significantly reduces training time. For instance, in John Hopkins55, GraphGLOW is 6x, 9x and 40x faster than IDGL, LDS and VGCN, respectively. This shows one clear advantage of GraphGLOW in terms of training efficiency and also verifies that our model indeed helps to reduce the significant cost of training time for structure learning on target graphs.

5.4. Ablation Studies

We conduct ablation studies to test the effectiveness of iterative learning scheme and regularization on graphs.

Effect of Iterative Learning. We replace the iterative learning process as a one-step prediction (i.e., once structure estimation and updating node representations in once feed-forward computation) and compare its test accuracy with GraphGLOW. The results are shown in Fig.  5 where we follow the setting of Table 1. The non-iterative version exhibits a considerable drop in accuracy (as large as 5.4% and 8.8% when tested on target graphs Cornell5 and Amherst41, respectively). Therefore, the iterative updates indeed help to learn better graph structures and node embeddings, contributing to higher accuracy for downstream prediction.

Effect of Regularization on structures. We remove the regularization on structures (i.e., setting α=ρ=0\alpha=\rho=0) and compare with GraphGLOW. As shown in Fig. 5, there is more or loss performance degradation. In fact, the regularization loss derived from the prior distribution for latent structures could help to provide some guidance for structure learning, especially when labeled information is limited.

Refer to caption
Refer to caption
Figure 5. (a) Ablation study for GraphGLOW. (b) Performance comparison of GraphGLOW and GCN w.r.t. randomly removing certain ratios of edges in Johns Hopkins55.
Refer to caption
Refer to caption
Figure 6. (a) The curves of homophily ratios for latent structures during the learning process. (b) The variance of neighborhood distribution of nodes with the same label in original graphs and learnt structure.

5.5. Hyper-parameter Sensitivity

In Fig. 7 (in the appendix), we study the variation of model’s performance w.r.t. λ\lambda (the weight on input graphs) and PP (the number of pivots) on target datasets Cora and CiteSeer. Overall, the model is not sensitive to λ\lambda’s. For Cora, larger λ\lambda contributes to higher accuracy, while for CiteSeer, smaller λ\lambda yields better performance. The possible reason is that the initial graph of Cora is more suitable for message passing (due to higher homophily ratio). For the impact of pivot number, as shown in Fig. 7(b), a moderate value of PP could provide decent downstream performance.

5.6. Robustness Analysis

In addition, we find that GraphGLOW is more immune to edge deletion attack than GCN. We randomly remove 10-50% edges of target graphs respectively, and then apply GraphGLOW and GCN. We present the results in Johns Hopkins55 in Fig. 5 and leave more results in Appendix D. When the drop ratio increases, the performance gap between two models becomes more significant. This is due to our structure learner’s ability for learning new graph structures from node embeddings, making it less reliant on initial graph structures and more robust to attack on input edges.

5.7. Case Study

We further probe into why our approach is effective for node classification by dissecting the learnt graph structures. Specifically, we measure the homophily ratios of learnt structures and their variance of neighborhood distributions of nodes with same labels. As nodes receive messages from neighbors in message passing, the more similar the neighborhood patterns of nodes within one class are, the easier it is for GNNs to correctly classify them (Ma et al., 2022). We use homophily metric proposed in (Lim et al., 2021) to measure homophily ratios. For calculation of variance of neighborhood distribution, we first calculate variance for each class, and then take weighted sum to get the final variance, where the weight is proportional to the number of nodes within corresponding class.

Homophily Ratio. We choose Amherst41, Johns Hopkins55 and Reed98 as target graphs, and record the homophily ratios of inferred latent structures every five epochs during training. As shown in Fig. 6. the homophily ratios of inferred latent graphs exhibit a clear increase as the training epochs become more and the final ratio is considerably larger than that of input graph. The results indicate that the trained structure learner incline to output more homophilous latent structures that are reckoned to be more suitable for message passing.

Neighborhood Distribution Variance. As shown in Fig. 6, the variance of neighborhood distribution of nodes with the same label is significantly smaller in our learnt structure, making it easier to classify nodes through message passing. The results also imply that high homophily ratio and similar intra-class neighborhood patterns could be two of the underlying transferrable patterns of optimal message-passing structure, identified by GraphGLOW.

6. Conclusion

This paper proposes Graph Structure Learning Under Cross-Graph Distribution Shift, a new problem that requires structure learner to transfer to new target graphs without re-training and handles distribution shift. We develop a transfer learning framework that guides the structure learner to discover shared knowledge across source datasets with respect to optimal message-passing structure for boosting downstream performance. We also carefully design the model components and training approach in terms of expressiveness, scalability and stability. We devise experiments with various difficulties and demonstrate the efficacy and robustness of our approach. Although our framework is pretty general, we believe their are other potential methods that can lead to equally competitive results, which we leave as future work.

Acknowledgements.
The work was supported in part by National Key Research and Development Program of China (2020AAA0107600), NSFC (62222607), Science and Technology Commission of Shanghai Municipality (22511105100), and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).

References

  • (1)
  • Abu-El-Haija et al. (2019) Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. 2019. MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing. In ICML.
  • Belkin and Niyogi (2001) Mikhail Belkin and Partha Niyogi. 2001. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In NeurIPS.
  • Bevilacqua et al. (2021) Beatrice Bevilacqua, Yangze Zhou, and Bruno Ribeiro. 2021. Size-invariant graph representations for graph classification extrapolations. In ICML.
  • Bi et al. (2023) Wendong Bi, Bingbing Xu, Xiaoqian Sun, Li Xu, Huawei Shen, and Xueqi Cheng. 2023. Predicting the Silent Majority on Graphs: Knowledge Transferable Graph Neural Network. In The Web Conference.
  • Bronstein et al. (2016) Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2016. Geometric deep learning: going beyond Euclidean data. arXiv preprint abs/1611.08097 (2016).
  • Chen et al. (2020) Yu Chen, Lingfei Wu, and Mohammed Zaki. 2020. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. In NeurIPS.
  • Chien et al. (2021) Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. 2021. Adaptive universal generalized pagerank graph neural network. In ICLR.
  • Elinas et al. (2020) Pantelis Elinas, Edwin V Bonilla, and Louis Tiao. 2020. Variational inference for graph convolutional networks in the absence of graph data and adversarial settings. In NeurIPS.
  • Finn et al. (2018) Chelsea Finn, Kelvin Xu, and Sergey Levine. 2018. Probabilistic Model-Agnostic Meta-Learning. In NeurIPS.
  • Franceschi et al. (2019) Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. 2019. Learning discrete structures for graph neural networks. In ICML.
  • Gao et al. (2021) Zhan Gao, Subhrajit Bhattacharya, Leiming Zhang, Rick S. Blum, Alejandro Ribeiro, and Brian M. Sadler. 2021. Training Robust Graph Neural Networks with Topology Adaptive Edge Dropping. arXiv preprint arXiv:2106.02892 (2021).
  • Geng et al. (2023) Haoyu Geng, Chao Chen, Yixuan He, Gang Zeng, Zhaobing Han, Hua Chai, and Junchi Yan. 2023. Pyramid Graph Neural Network: a Graph Sampling and Filtering Approach for Multi-scale Disentangled Representations. In SIGKDD.
  • Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In ICLR.
  • Jiang et al. (2019) Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, and Bin Luo. 2019. Semi-supervised learning with graph learning-convolutional networks. In CVPR.
  • Jiang and Luo (2022) Weiwei Jiang and Jiayun Luo. 2022. Graph neural network for traffic forecasting: A survey. Expert Systems with Applications (2022), 117921.
  • Kalofolias (2016) Vassilis Kalofolias. 2016. How to Learn a Graph from Smooth Signals. In AISTATS.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In ICLR.
  • Klicpera et al. (2019) Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. 2019. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In ICLR.
  • Lao et al. (2022) Danning Lao, Xinyu Yang, Qitian Wu, and Junchi Yan. 2022. Variational Inference for Training Graph Neural Networks in Low-Data Regime through Joint Structure-Label Estimation. In SIGKDD.
  • Li et al. (2018) Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018. Adaptive graph convolutional neural networks. In AAAI.
  • Li et al. (2021) Sean Li, Dongwoo Kim, and Qing Wang. 2021. Beyond Low-Pass Filters: Adaptive Feature Propagation on Graphs. arXiv preprint arXiv:2103.14187 (2021).
  • Li et al. (2022) Zenan Li, Qitian Wu, Fan Nie, and Junchi Yan. 2022. GraphDE: A Generative Framework for Debiased Learning and Out-of-Distribution Detection on Graphs. In NeurIPS.
  • Lim et al. (2021) Derek Lim, Xiuyu Li, Felix Hohne, and Ser-Nam Lim. 2021. New Benchmarks for Learning on Non-Homophilous Graphs. arXiv preprint arXiv:2104.01404 (2021).
  • Ma et al. (2021) Jiaqi Ma, Junwei Deng, and Qiaozhu Mei. 2021. Subgroup generalization and fairness of graph neural networks. In NeurIPS.
  • Ma et al. (2022) Yao Ma, Xiaorui Liu, Neil Shah, and Jiliang Tang. 2022. Is Homophily a Necessity for Graph Neural Networks?. In ICLR.
  • Muandet et al. (2013) Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. 2013. Domain Generalization via Invariant Feature Representation. In ICML.
  • Nguyen and Bai (2010) Hieu V Nguyen and Li Bai. 2010. Cosine similarity metric learning for face verification. In ACCV.
  • Sanchez-Gonzalez et al. (2020) Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. 2020. Learning to simulate complex physics with graph networks. In ICML.
  • Sui et al. (2022) Yongduo Sui, Xiang Wang, Jiancan Wu, Min Lin, Xiangnan He, and Tat-Seng Chua. 2022. Causal attention for interpretable and generalizable graph classification. In SIGKDD.
  • Sutton et al. (1999) Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NeurIPS.
  • Traud et al. (2012) Amanda L Traud, Peter J Mucha, and Mason A Porter. 2012. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391, 16 (2012), 4165–4180.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In ICLR.
  • Williams (1992) Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 8 (1992), 229–256.
  • Wu et al. (2023a) Qitian Wu, Yiting Chen, Chenxiao Yang, and Junchi Yan. 2023a. Energy-based Out-of-Distribution Detection for Graph Neural Networks. In ICLR.
  • Wu et al. (2021a) Qitian Wu, Chenxiao Yang, and Junchi Yan. 2021a. Towards open-world feature extrapolation: An inductive graph learning approach. In NeurIPS.
  • Wu et al. (2023b) Qitian Wu, Chenxiao Yang, Wentao Zhao, Yixuan He, David Wipf, and Junchi Yan. 2023b. DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion. In ICLR.
  • Wu et al. (2021b) Qitian Wu, Hengrui Zhang, Xiaofeng Gao, Junchi Yan, and Hongyuan Zha. 2021b. Towards open-world recommendation: An inductive model-based collaborative filtering approach. In ICML.
  • Wu et al. (2022a) Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. 2022a. Handling distribution shifts on graphs: An invariance perspective. In ICLR.
  • Wu et al. (2022b) Qitian Wu, Wentao Zhao, Zenan Li, David Wipf, and Junchi Yan. 2022b. NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification. In NeurIPS.
  • Wu et al. (2020) Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. 2020. Graph Information Bottleneck. In NeurIPS.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML.
  • Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation Learning on Graphs with Jumping Knowledge Networks. In ICML.
  • Yang et al. (2023a) Chenxiao Yang, Qitian Wu, Jiahua Wang, and Junchi Yan. 2023a. Graph Neural Networks are Inherently Good Generalizers: Insights by Bridging GNNs and MLPs. In ICLR.
  • Yang et al. (2022a) Chenxiao Yang, Qitian Wu, and Junchi Yan. 2022a. Geometric Knowledge Distillation: Topology Compression for Graph Neural Networks. In NeurIPS.
  • Yang et al. (2022b) Nianzu Yang, Kaipeng Zeng, Qitian Wu, Xiaosong Jia, and Junchi Yan. 2022b. Learning substructure invariance for out-of-distribution molecular representations. In NeurIPS.
  • Yang et al. (2023b) Nianzu Yang, Kaipeng Zeng, Qitian Wu, and Junchi Yan. 2023b. MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning. In The Web Conference.
  • Yang et al. (2016) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In ICML.
  • Zhang et al. (2019) Yingxue Zhang, Soumyasundar Pal, Mark Coates, and Deniz Ustebay. 2019. Bayesian graph convolutional neural networks for semi-supervised classification. In AAAI.
  • Zheng et al. (2020) Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, and Wei Wang. 2020. Robust graph representation learning via neural sparsification. In ICML.
  • Zhu et al. (2021b) Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai Koutra. 2021b. Graph neural networks with heterophily. In AAAI.
  • Zhu et al. (2020) Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. 2020. Beyond homophily in graph neural networks: Current limitations and effective designs. In NeurIPS.
  • Zhu et al. (2021a) Qi Zhu, Natalia Ponomareva, Jiawei Han, and Bryan Perozzi. 2021a. Shift-robust gnns: Overcoming the limitations of localized graph training data. In NeurIPS.
  • Zhu et al. (2021c) Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. 2021c. Deep Graph Structure Learning for Robust Representations: A Survey. arXiv preprint abs/2103.03036 (2021).
Input: node features 𝐗\mathbf{X}, input adjacency 𝐀\mathbf{A}, latent graph node-pivot similarity matrix 𝚪={αup}N×P\mathbf{\Gamma}=\{\alpha_{up}\}_{N\times P}.
𝐙(0)𝐗\mathbf{Z}^{(0)}\leftarrow\mathbf{X};
for l=0,1,,L1l=0,1,\cdots,L-1 do
       𝐙(l+1)𝐃12𝐀𝐃12𝐙(l)𝐖(l)\mathbf{Z}^{(l+1)}\leftarrow\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}}\mathbf{Z}^{(l)}\mathbf{W}^{(l)};
       𝐂(l+12)=RowNorm(𝚪)𝐙(l)\mathbf{C}^{(l+\frac{1}{2})}=\mathrm{RowNorm}(\mathbf{\Gamma}^{\top})\mathbf{Z}^{(l)};
       𝐂(l+1)=RowNorm(𝚪)𝐂(l+12)\mathbf{C}^{(l+1)}=\mathrm{RowNorm}(\mathbf{\Gamma})\mathbf{C}^{(l+\frac{1}{2})} ;
       𝐙(l+1)σ((λ𝐙(l+1)+(1λ)𝐂(l+1)𝐖(l))\mathbf{Z}^{(l+1)}\leftarrow\sigma\left((\lambda\mathbf{Z}^{(l+1)}+(1-\lambda)\mathbf{C}^{(l+1)}\mathbf{W}^{(l)}\right);
      
end for
Output: node representations 𝐙=𝐙(L1)\mathbf{Z}=\mathbf{Z}^{(L-1)}, node-level prediction 𝐘^=𝐙(L)\hat{\mathbf{Y}}=\mathbf{Z}^{(L)}.
Algorithm 1 Message Passing over Latent Graphs [𝐙,𝐘^]=GNN(𝐀,𝐗,𝚪;w)[\mathbf{Z},\hat{\mathbf{Y}}]=\mbox{GNN}(\mathbf{A},\mathbf{X},\mathbf{\Gamma};w)
Input: observed source graphs {𝒢m}m=1M={(𝐀m,𝐗m,𝐘m)}m=1M\{\mathcal{G}_{m}\}_{m=1}^{M}=\{(\mathbf{A}_{m},\mathbf{X}_{m},\mathbf{Y}_{m})\}_{m=1}^{M}, maximum training episode MM, maximum iteration TT, a shared graph structure learner gθg_{\theta}, GNN networks {hwm}m=1M\{h_{w_{m}}\}_{m=1}^{M} for each source graph.
Initialize θ\theta and {wm}m=1M\{w_{m}\}_{m=1}^{M};
for episode: 1,2,,E1,2,\dots,E do
       for each source graph 𝒢m\mathcal{G}_{m} do
             for e=1:Te=1:T do
                   Δθ0\Delta_{\theta}\leftarrow 0, Δw0\Delta_{w}\leftarrow 0, t0t\leftarrow 0;
                   𝐙0=MLP(𝐗;wm)\mathbf{Z}^{0}=\mbox{MLP}(\mathbf{X};w_{m}) or 𝐙0=GCN(𝐀,𝐗;wm)\mathbf{Z}^{0}=\mbox{GCN}(\mathbf{A},\mathbf{X};w_{m});
                   while not converged do
                         tt+1t\leftarrow t+1;
                         Compute 𝚪t={αup}\mathbf{\Gamma}^{t}=\{\alpha_{up}\} using (9) with 𝐙t1\mathbf{Z}^{t-1};
                         Sample KK times over 𝚪t\mathbf{\Gamma}^{t} to obtain {𝐄^kt}k=1K\{\hat{\mathbf{E}}_{k}^{t}\}_{k=1}^{K};
                         Compute {πθ(𝐄^kt)}k=1K\{\pi_{\theta}(\hat{\mathbf{E}}_{k}^{t})\}_{k=1}^{K} using (16);
                         [𝐙t,𝐘^t]=GNN(𝐀,𝐗,𝚪t;wm)[\mathbf{Z}^{t},\hat{\mathbf{Y}}^{t}]=\mbox{GNN}(\mathbf{A},\mathbf{X},\mathbf{\Gamma}^{t};w_{m});
                         Compute θs,θr,θe\nabla_{\theta}\mathcal{L}_{s},\nabla_{\theta}\mathcal{L}_{r},\nabla_{\theta}\mathcal{L}_{e} using (15), (17), (18), respectively ;
                         Δθ(t)θs+θr+θe\Delta_{\theta}^{(t)}\leftarrow\nabla_{\theta}\mathcal{L}_{s}+\nabla_{\theta}\mathcal{L}_{r}+\nabla_{\theta}\mathcal{L}_{e};
                         Δw(t)wms\Delta_{w}^{(t)}\leftarrow\nabla_{w_{m}}\mathcal{L}_{s};
                        
                   end while
                  ΔθΔθ(0)+i=2tΔθ(i)/(t1)\Delta_{\theta}\leftarrow\Delta_{\theta}^{(0)}+\sum_{i=2}^{t}\Delta_{\theta}^{(i)}/(t-1);
                   ΔwΔw(0)+i=2tΔw(i)/(t1)\Delta_{w}\leftarrow\Delta_{w}^{(0)}+\sum_{i=2}^{t}\Delta_{w}^{(i)}/(t-1);
                   Use gradient Δθ\Delta_{\theta} to update θ\theta;
                   Use gradient Δw\Delta_{w} to update wmw_{m};
                  
             end for
            
       end for
      
end for
Output: trained graph structure learner gθg_{\theta^{*}}.
Algorithm 2 Training Graph Structure Learner
Input: observed target graphs 𝒢=(𝐀,𝐗,𝐘)\mathcal{G}=(\mathbf{A},\mathbf{X},\mathbf{Y}), GNN network hwh_{w} for the target graph, maximum iteration TT.
Initialize ww;
for e=1:Te=1:T do
       t0t\leftarrow 0, Δw0\Delta_{w}\leftarrow 0, 𝐙(0)𝐗m\mathbf{Z}^{(0)}\leftarrow\mathbf{X}_{m}, 𝐀𝐀m\mathbf{A}\leftarrow\mathbf{A}_{m};
       𝐙0=MLP(𝐗;w)\mathbf{Z}^{0}=\mbox{MLP}(\mathbf{X};w) or 𝐙0=GCN(𝐀,𝐗;w)\mathbf{Z}^{0}=\mbox{GCN}(\mathbf{A},\mathbf{X};w);
       while not converged do
             tt+1t\leftarrow t+1;
             Compute 𝚪t={αup}\mathbf{\Gamma}^{t}=\{\alpha_{up}\} using (9) with 𝐙t1\mathbf{Z}^{t-1};
             Sample KK times over 𝚪t\mathbf{\Gamma}^{t} to obtain {𝐄^kt}k=1K\{\hat{\mathbf{E}}^{t}_{k}\}_{k=1}^{K};
             Compute {πθ(𝐄^kt)}k=1K\{\pi_{\theta}(\hat{\mathbf{E}}^{t}_{k})\}_{k=1}^{K} using (16);
             [𝐙,𝐘^]=GNN(𝐀,𝐗,𝚪t;w)[\mathbf{Z},\hat{\mathbf{Y}}]=\mbox{GNN}(\mathbf{A},\mathbf{X},\mathbf{\Gamma}^{t};w);
             Compute θs\nabla_{\theta}\mathcal{L}_{s} using (15) ;
             Δw(t)wms\Delta_{w}^{(t)}\leftarrow\nabla_{w_{m}}\mathcal{L}_{s};
            
       end while
      ΔwΔw(0)+i=2tΔw(i)/(t1)\Delta_{w}\leftarrow\Delta_{w}^{(0)}+\sum_{i=2}^{t}\Delta_{w}^{(i)}/(t-1);
       Use gradient Δw\Delta_{w} to update ww;
      
end for
Output: trained GNN network hwh_{w^{*}} for the target graph.
Algorithm 3 Supervised Learning for target GNN

Appendix A Derivations for NWGM

First, when taking the gradient, we have θ𝔼qθ[logpwm(𝐘|𝐀,𝐗,𝐀^)]\nabla_{\theta}\mathbb{E}_{q_{\theta}}[\log p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})] cθlog𝔼qθ[pwm(𝐘|𝐀,𝐗,𝐀^)]\approx c\nabla_{\theta}\log\mathbb{E}_{q_{\theta}}[p_{w_{m}}(\mathbf{Y}|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})] with basic applications of the chain rule. We then adopt Normalized Weighted Geometric Mean (NWGM) (Xu et al., 2015):

(19) θlog𝔼qθ(𝐀^|𝐀,𝐗)[pwm(𝐘u,c=1|𝐀,𝐗,𝐀^)]=θlog𝐀^exp(s1(𝐀^))exp(s1(𝐀^))+exp(s2(𝐀^))qθ(𝐀^|𝐀,𝐗)=θlog𝐀^Softmax(s1(𝐀^))qθ(𝐀^|𝐀,𝐗)θlogNWGM(Softmax(s1(𝐀^)))=θlog𝐀^𝐀^[exp(s1(𝐀^))]qθ(𝐀^|𝐀,𝐗)𝐀^exp(s1(𝐀^))]qθ(𝐀^|𝐀,𝐗)+𝐀^exp(s2(𝐀^))]qθ(𝐀^|𝐀,𝐗)=θlog𝐀^exp(𝐀^s1(𝐀^)qθ(𝐀^|𝐀,𝐗))exp(𝐀^s1(𝐀^)qθ(𝐀^|𝐀,𝐗))+exp(𝐀^s2(𝐀^)qθ(𝐀^|𝐀,𝐗))=θlog𝐀^exp(𝔼qθ(𝐀^|𝐀,𝐗)[s1(𝐀^)])exp(𝔼qθ(𝐀^|𝐀,𝐗)[s1(𝐀^)])+exp(𝔼qθ(𝐀^|𝐀,𝐗)[s2(𝐀^)])=θlogpwm(𝐘u,c=1|𝐀,𝐗,𝐀^=𝔼qθ(𝐀^|𝐀,𝐗)[𝐀^]),\begin{split}&\nabla_{\theta}\log\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[p_{w_{m}}(\mathbf{Y}_{u,c}=1|\mathbf{A},\mathbf{X},\hat{\mathbf{A}})]\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\frac{\exp(s_{1}(\hat{\mathbf{A}}))}{\exp(s_{1}(\hat{\mathbf{A}}))+\exp(s_{2}(\hat{\mathbf{A}}))}q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\mbox{Softmax}(s_{1}(\hat{\mathbf{A}}))q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})\\ \approx&\nabla_{\theta}\log\mbox{NWGM}(\mbox{Softmax}(s_{1}(\hat{\mathbf{A}})))\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\frac{\prod_{\hat{\mathbf{A}}}[\exp(s_{1}(\hat{\mathbf{A}}))]^{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}}{\prod_{\hat{\mathbf{A}}}\exp(s_{1}(\hat{\mathbf{A}}))]^{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}+\prod_{\hat{\mathbf{A}}}\exp(s_{2}(\hat{\mathbf{A}}))]^{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}}\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\frac{\exp(\sum_{\hat{\mathbf{A}}}s_{1}(\hat{\mathbf{A}})q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}))}{\exp(\sum_{\hat{\mathbf{A}}}s_{1}(\hat{\mathbf{A}})q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}))+\exp(\sum_{\hat{\mathbf{A}}}s_{2}(\hat{\mathbf{A}})q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X}))}\\ =&\nabla_{\theta}\log\sum_{\hat{\mathbf{A}}}\frac{\exp(\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[s_{1}(\hat{\mathbf{A}})])}{\exp(\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[s_{1}(\hat{\mathbf{A}})])+\exp(\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[s_{2}(\hat{\mathbf{A}})])}\\ =&\nabla_{\theta}\log p_{w_{m}}(\mathbf{Y}_{u,c}=1|\mathbf{A},\mathbf{X},\hat{\mathbf{A}}=\mathbb{E}_{q_{\theta}(\hat{\mathbf{A}}|\mathbf{A},\mathbf{X})}[\hat{\mathbf{A}}]),\end{split}

where s1s_{1} denotes the positive predicted score for class cc which is indeed associated with node uu, and s2(𝐀^)=0s_{2}(\hat{\mathbf{A}})=0 in our case. We thus conclude the proof for (15).

Table 3. Statistic information on experimental datasets. The column Homo. reports the homophily ratios, measured by the metric proposed in (Lim et al., 2021).
Dataset Type # Node # Edge Homo.
Cora citation 2,708 5,429 0.77
CiteSeer citation 3,327 4,732 0.63
PubMed citation 19,717 44,338 0.66
Amherst41 social 2,235 90,964 0.06
Cornell5 social 18,660 790,777 0.09
Johns Hopkins55 social 5,180 186,586 0.10
Reed98 social 962 18,812 0.04

Appendix B Datasets and Experimental Details

The statistical information on datasets is displayed in Table 3. For the splitting of Cora, CiteSeer and PubMed, we follow (Yang et al., 2016) to randomly select 20 instances per class for training, 500/1000 instances for validation/testing in each dataset. In the remaining datasets, we employ random train/valid/test splits of 50%/25%/25%.

The backbone GNN network is specified as a two-layer GCN model. We set the similarity function ss in (9) as cosine similarity and δ\delta as a threshold-based truncation. Besides, since the dimensions of input node features are different across datasets, we adopt a transformation network that converts input features into a dd-dimensional node representations before the structure learning module as shown in Alg. 2 (𝐙0=MLP(𝐗;wm)\mathbf{Z}^{0}=\mbox{MLP}(\mathbf{X};w_{m}) or 𝐙0=GCN(𝐀,𝐗;wm)\mathbf{Z}^{0}=\mbox{GCN}(\mathbf{A},\mathbf{X};w_{m})). We can specify the transformation as a one-layer MLP or a one-layer GCN network (what we adopt). Most of the experiments were conducted on an NVIDIA GeForce RTX 2080 Ti with 11GB memory. For experiments involving two larger datasets, PubMed and Cornell5, we utilized an NVIDIA GeForce RTX 3090 with 24 GB memory.

Appendix C Hyperparameters

We use grid search on validation set to tune the hyperparameters. The learning rate is searched in {0.001,0.005,0.01,0.05}\{0.001,0.005,0.01,0.05\}; Dropout is searched in {0,0.2,0.3,0.5,0.6}\{0,0.2,0.3,0.5,0.6\}; Hidden channels is searched in {16,32,64,96}\{16,32,64,96\}. Other hyperparameters for specific models are stated below.

For GCN, GraphSAGE and H2GCN, we use 2 layers. For GAT, we search gat head number in {2,4}\{2,4\} and use 2 layers. For APPNP and GPR, we search α\alpha in {0.1,0.2,0.5}\{0.1,0.2,0.5\} and set KK to 10. We list the searching space of structure learning methods below.

  • GraphGLOW and its variants: pivot number PP\in {800, 1000, 1200, 1400}, embedding size dd\in {16, 32, 64, 96}, λ\lambda\in [0.1, 0.9], α\alpha\in {0, 0.1, 0.15, 0.2, 0.25, 0.3}, ρ\rho\in {0, 0.1, 0.15, 0.2, 0.25, 0.3}, threshold \in {4e-5, 8.5e-5}, HH\in {4, 6}, T=10T=10, E{1,2,3}E\in\{1,2,3\}.

  • LDS: the sampling time S=16S=16, the patience window size ρ\rho\in{10, 20}, the hidden size \in {8, 16, 32, 64}, the inner learning rate γ\gamma\in {1e-4, 1e-3, 1e-2, 1e-1}, and the number of updates used to compute the truncated hypergradient τ\tau\in {5, 10, 15}.

  • IDGL: ϵ=0.01\epsilon=0.01, hidden size \in {16, 64, 96,128}, λ\lambda\in {0.5, 0.6, 0.7, 0.8}, η\eta\in {0, 0.1, 0.2}, α\alpha\in {0, 0.1, 0.2}, β\beta\in {0, 0.1}, γ\gamma\in {0.1, 0.2}, mm\in {6, 9, 12}.

  • VGCN: ρ¯1\bar{\rho}_{1}\in {0.25, 0.5, 0.75, 0.99}, ρ¯0=105\bar{\rho}_{0}=10^{-5}, τ0\tau_{0}\in {0.1, 0.5}, τ\tau\in {0.1, 0.5}, β\beta\in {10410^{-4}, 10310^{-3}, 10210^{-2}, 1}. Sampling time is 3. Maximum number of training epochs is 5000.

Refer to caption
(a) λ\lambda
Refer to caption
(b) PP
Figure 7. Hyper-parameter sensitivity analysis on the weight of input graphs λ\lambda and pivot number PP.
Refer to caption
(a) CiteSeer
Refer to caption
(b) Johns Hopkins55
Figure 8. Performance comparison of GraphGLOW and GCN w.r.t. randomly removing certain ratios of edges in input graphs.

Appendix D More Experimental Results

Table 4. Comparison between one-layer GCN and MLP as transformation network before structure learning module.
Dataset GraphGLOW-GCN GraphGLOW-MLP
Cora 83.2 ± 0.4 82.5 ± 0.5
CiteSeer 73.8 ± 0.9 73.3 ± 0.7
PubMed 79.6 ± 0.7 79.6 ± 0.7
Amherst41 68.1 ± 1.3 68.3 ± 1.5
Johns Hopkins55 71.8 ± 0.7 72.1 ± 0.9
Cornell5 70.5 ± 1.0 69.5 ± 1.0
Reed98 67.3 ± 1.2 65.9 ± 1.4

We compare with using MLP and GCN, respectively, as the transformation network before structure learning module and report the results in Table 4. In summary, these two methods are of equal competence, which suggests that GraphGLOW is not sensitive to the transformation network used for converting node features with various dimensions into embeddings with a shared dimension. This also implies that simple neural architectures, e.g. MLP and GCN, could provide enough capacity for extracting the information in input observation, which is leveraged by the shared graph structure learner to discover generalizable patterns in optimal message-passing structure.

We also provide more results of edge deletion experiments in Fig. 8. We randomly remove 10-50% edges of target graphs respectively, and then apply GraphGLOW and GCN. The results demonstrate that GraphGLOW is more immune to edge deletion. This is due to our structure learner’s ability for learning new structures, making it less reliant on initial graph structures and more robust to attack on input edges.