This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Transfer Learning of Graph Neural Networks with Ego-graph Information Maximization

Qi Zhu1, Carl Yang211footnotemark: 1, Yidan Xu3, Haonan Wang1, Chao Zhang4, Jiawei Han1
1University of Illinois Urbana-Champaign, 2Emory University,
3University of Washington, 4Georgia Institute of Technology
1{qiz3,haonan3,hanj}@illinois.edu, 2[email protected],
3[email protected], 4[email protected]
These two authors contribute equally.
Abstract

Graph neural networks (GNNs) have achieved superior performance in various applications, but training dedicated GNNs can be costly for large-scale graphs. Some recent work started to study the pre-training of GNNs. However, none of them provide theoretical insights into the design of their frameworks, or clear requirements and guarantees towards their transferability. In this work, we establish a theoretically grounded and practically useful framework for the transfer learning of GNNs. Firstly, we propose a novel view towards the essential graph information and advocate the capturing of it as the goal of transferable GNN training, which motivates the design of Egi (Ego-Graph Information maximization) to analytically achieve this goal. Secondly, when node features are structure-relevant, we conduct an analysis of Egi transferability regarding the difference between the local graph Laplacians of the source and target graphs. We conduct controlled synthetic experiments to directly justify our theoretical conclusions. Comprehensive experiments on two real-world network datasets show consistent results in the analyzed setting of direct-transfering, while those on large-scale knowledge graphs show promising results in the more practical setting of transfering with fine-tuning.111Code and processed data are available at https://github.com/GentleZhu/EGI.

1 Introduction

Graph neural networks (GNNs) have been intensively studied recently [29, 26, 39, 68], due to their established performance towards various real-world tasks [15, 69, 53], as well as close connections to spectral graph theory [12, 9, 16]. While most GNN architectures are not very complicated, the training of GNNs can still be costly regarding both memory and computation resources on real-world large-scale graphs [10, 63]. Moreover, it is intriguing to transfer learned structural information across different graphs and even domains in settings like few-shot learning [56, 44, 25]. Therefore, several very recent studies have been conducted on the transferability of GNNs [21, 23, 22, 71, 31, 3, 47]. However, it is unclear in what situations the models will excel or fail especially when the pre-training and fine-tuning tasks are different. To provide rigorous analysis and guarantee on the transferability of GNNs, we focus on the setting of direct-transfering between the source and target graphs, under an analogous setting of “domain adaptation” [7, 59, 71].

In this work, we establish a theoretically grounded framework for the transfer learning of GNNs, and leverage it to design a practically transferable GNN model. Figure 1 gives an overview of our framework. It is based on a novel view of a graph as samples from the joint distribution of its k-hop ego-graph structures and node features, which allows us to define graph information and similarity, so as to analyze GNN transferability (§3). This view motivates us to design Egi, a novel GNN training objective based on ego-graph information maximization, which is effective in capturing the graph information as we define (§3.1). Then we further specify the requirement on transferable node features and analyze the transferability of Egi that is dependent on the local graph Laplacians of source and target graphs (§3.2).

All of our theoretical conclusions have been directly validated through controlled synthetic experiments (Table 1), where we use structural-equivalent role identification in an direct-transfering setting to analyze the impacts of different model designs, node features and source-target structure similarities on GNN transferability. In §4, we conduct real-world experiments on multiple publicly available network datasets. On the Airport and Gene graphs (§4.1), we closely follow the settings of our synthetic experiments and observe consistent but more detailed results supporting the design of Egi and the utility of our theoretical analysis. On the YAGO graphs (§4.2), we further evaluate Egi on the more generalized and practical setting of transfer learning with task-specific fine-tuning. We find our theoretical insights still indicative in such scenarios, where Egi consistently outperforms state-of-the-art GNN representation and transfer learning frameworks with significant margins.

Refer to caption
Figure 1: Overview of our GNN transfer learning framework: (1) we represent the toy graph as a combination of its 1-hop ego-graph and node feature distributions; (2) we design a transferable GNN regarding the capturing of such essential graph information; (3) we establish a rigorous guarantee of GNN transferability based on the node feature requirement and graph structure difference.

2 Related Work

Representation learning on graphs has been studied for decades, with earlier spectral-based methods [6, 46, 52] theoretically grounded but hardly scaling up to graphs with over a thousand of nodes. With the emergence of neural networks, unsupervised network embedding methods based on the Skip-gram objective [37] have replenished the field [51, 14, 42, 45, 66, 62, 65]. Equipped with efficient structural sampling (random walk, neighborhood, etc.) and negative sampling schemes, these methods are easily parallelizable and scalable to graphs with thousands to millions of nodes. However, these models are essentially transductive as they compute fully parameterized embeddings only for nodes seen during training, which are impossible to be transfered to unseen graphs.

More recently, researchers introduce the family of graph neural networks (GNNs) that are capable of inductive learning and generalizing to unseen nodes given meaningful node features [29, 12, 15, 67]. Yet, most existing GNNs require task-specific labels for training in a semi-supervised fashion to achieve satisfactory performance [29, 15, 53, 64], and their usage is limited to single graphs where the downstream task is fixed. To this end, several unsupervised GNNs are presented, such as the auto-encoder-based ones like VGAE [28] and GNFs [35], as well as the deep-infomax-based ones like DGI [54] and InfoGraph [50]. Their potential in the transfer learning of GNN remains unclear when the node features and link structures vary across different graphs.

Although the architectures of popular GNNs such as GCN [29] may not be very complicated compared with heavy vision and language models, training a dedicated GNN for each graph can still be cumbersome [10, 63]. Moreover, as pre-training neural networks are proven to be successful in other domains [13, 18], the idea is intriguing to transfer well-trained GNNs from relevant source graphs to improve the modeling of target graphs or enable few-shot learning [59, 31, 3] when labeled data are scarce. In light of this, pioneering works have studied both generative [22] and discriminative [21, 23] GNN pre-training schemes. Though Graph Contrastive Coding [43] shares the most similar view towards graph structures as us, it utilizes contrastive learning across all graphs instead of focusing on the transfer learning between any specific pairs. On the other hand, unsupervised domain adaptive GCNs [59] study the domain adaption problem only when the source and target tasks are homogeneous.

Most previous pre-training and self-supervised GNNs lack a rigorous analysis towards their transferability and thus have unpredictable effectiveness. The only existing theoretical work on GNN transferability studies the performance of GNNs across different permutations of a single original graph [33, 34] and the tradeoff between discriminability and transferability of GNNs [47]. We, instead, are the first to rigorously study the more practical setting of transferring GNNs across pairs of different source and target graphs.

3 Transferable Graph Neural Networks

In this paper, we design a more transferable training objective for GNN (Egi) based on our novel view of essential graph information (§3.1). We then analyze its transferability as the gap between its abilities to model the source and target graphs, based on their local graph Laplacians (§3.2).

Based on the connection between GNN and spectral graph theory [29], we describe the output of a GNN as a combination of its input node features XX, fixed graph Laplacian LL and learnable graph filters Ψ\Psi. The goal of training a GNN is then to improve its utility by learning the graph filters that are compatible with the other two components towards specific tasks.

In the graph transfer learning setting where downstream tasks are often unknown during pre-training, we argue that the general utility of a GNN should be optimized and quantified w.r.t. its ability of capturing the essential graph information in terms of the joint distribution of its topology structures and node features, which motivates us to design a novel ego-graph information maximization model (Egi) (§3.1). The general transferability of a GNN is then quantified by the gap between its abilities to model the source and target graphs. Under reasonable requirements such as using structure-respecting node features as the GNN input, we analyze this gap for Egi based on the structural difference between two graphs w.r.t. their local graph Laplacians (§3.2).

3.1 Transferable GNN via Ego-graph Information Maximization

In this work, we focus on the direct-transfering setting where a GNN is pre-trained on a source graph GaG_{a} in an unsupervised fashion and applied on a target graph GbG_{b} without fine-tuning.111In the experiments, we show our model to be generalizable to the more practical settings with task-specific pre-training and fine-tuning, while the study of rigorous bound in such scenarios is left as future work. Consider a graph G={V,E}G=\{V,E\}, where the set of nodes VV are associated with certain features XX and the set of edges EE form graph structures. Intuitively, the transfer learning will be successful only if both the features and structures of GaG_{a} and GbG_{b} are similar in some ways, so that the graph filters of a GNN learned on GaG_{a} are compatible with the features and structures of GbG_{b}.

Graph kernels [57, 8, 30, 38] are well-known for their capability of measuring similarity between pair of graphs. Motivated by k-hop subgraph kernels [4], we introduce a novel view of a graph as samples from the joint distribution of its k-hop ego-graph structures and node features. Since GNN essentially encodes such k-hop ego graph samples, this view allows us to give concrete definitions towards structural information of graphs in the transfer learning setting, which facilitates the measuring of similarity (difference) among graphs. Yet, none of the existing GNN training objectives are capable of recovering such distributional signals of ego graphs. To this end, we design Ego-Graph Information maximization (Egi), which alternatively reconstructs the k-hop ego-graph of each center node via mutual information maximization [20].

Definition 3.1 (K-hop ego-graph).

We call a graph gi={V(gi),E(gi)}g_{i}=\{V(g_{i}),E(g_{i})\} a kk-hop ego-graph centered at node viv_{i} if it has a kk-layer centroid expansion [4] such that the greatest distance between viv_{i} and any other nodes in the ego-graph is k, i.e. vjV(gi),|d(vi,vj)|k\forall v_{j}\in V(g_{i}),|d(v_{i},v_{j})|\leq k, where d(vi,vj)d(v_{i},v_{j}) is the graph distance between viv_{i} and vjv_{j}.

In this paper, we use directed k-hop ego-graph and its direction is decided by whether it is composed of incoming or outgoing edges to the center node, i.e., gig_{i} and gi~\tilde{g_{i}}. The results apply trivially to undirected graphs with gi=gi~g_{i}=\tilde{g_{i}}.

Definition 3.2 (Structural information).

Let 𝒢\mathcal{G} be a topological space of sub-graphs, we view a graph GG as samples of k-hop ego-graphs {gi}i=1n\{g_{i}\}_{i=1}^{n} drawn i.i.d. from 𝒢\mathcal{G} with probability μ\mu, i.e., gii.i.d.μi=1,,ng_{i}\overset{\emph{i.i.d.}}{\sim}\mu\;\forall i=1,\cdots,n. The structural information of GG is then defined to be the set of k-hop ego-graphs of {gi}i=1n\{g_{i}\}_{i=1}^{n} and their empirical distribution.

As shown in Figure 1, three graphs G0G_{0}, G1G_{1} and G2G_{2} are characterized by a set of 1-hop ego-graphs and their empirical distributions, which allows us to quantify the structural similarity among graphs as shown in §3.2 (i.e., G0G_{0} is more similar to G1G_{1} than G2G_{2} under such characterization). In practice, the nodes in a graph GG are characterized not only by their k-hop ego-graph structures but also their associated node features. Therefore, GG should be regarded as samples {(gi,xi)}\{(g_{i},x_{i})\} drawn from the joint distribution \mathbb{P} on the product space of 𝒢\mathcal{G} and a node feature space 𝒳\mathcal{X}.

Refer to caption
Figure 2: The overall EGI training framework.

Ego-Graph Information Maximization. Given a set of ego-graphs {(gi,xi)}i\{(g_{i},x_{i})\}_{i} drawn from an empirical joint distribution (gi,xi)(g_{i},x_{i})\sim\mathbb{P}. We aim to train an GNN encoder Ψ\Psi to maximize the mutual informaion (MI (gi,Ψ(gi,xi))(g_{i},\Psi(g_{i},x_{i}))) between the defined structural information gig_{i}222Later in section 3.2, we will discuss the equivalence between MI(gi,zig_{i},z_{i}) and MI((gi,xi),zi(g_{i},x_{i}),z_{i}) when node feature is structure-respecting. (i.e. k-hop ego-graph) and node embedding zi=Ψ(gi,xi)z_{i}=\Psi(g_{i},x_{i}). To maximize the MI, another discriminator 𝒟(gi,zi):E(gi)×zi+\mathcal{D}(g_{i},z_{i}):E(g_{i})\times z_{i}\rightarrow\mathbb{R}^{+} is introduced to compute the probability of an edge ee belongs to the given ego-graph gig_{i}. We use the Jensen-Shannon MI estimator [20] in the Egi objective,

Egi=MI(JSD)(𝒢,Ψ)=1Ni=1N[sp(𝒟(gi,zi))+sp(𝒟(gi,zi))],\begin{array}[]{r}\mathcal{L}_{\textsc{Egi}}=-\text{MI}^{\text{(JSD)}}\left(\mathcal{G},\Psi\right)=\frac{1}{N}\sum\limits_{i=1}^{N}\left[\text{sp}\left(\mathcal{D}(g_{i},z_{i}^{\prime})\right)+\text{sp}\left(-\mathcal{D}(g_{i},z_{i})\right)\right],\end{array} (1)

where sp(x)=log(1+ex)\text{sp}(x)=\log(1+e^{x}) is the softplus function and (gi,zi)(g_{i},z_{i}^{\prime}) is randomly drawn from the product of marginal distributions, i.e. zi=Ψ(gi,xi),(gi,xi),iiz_{i}^{\prime}=\Psi(g_{i^{\prime}},x_{i^{\prime}}),(g_{i^{\prime}},x_{i^{\prime}})\sim\mathbb{P},i^{\prime}\neq i. In general, we can also randomly draw negative gig_{i}^{\prime} in the topological space, while enumerating all possible graphs gig_{i^{\prime}} leads to high computation cost.

In Eq. 1, the computation of 𝒟\mathcal{D} on E(gi)E(g_{i}) depends on the node orders. Following the common practice in graph generation [70], we characterize the decision process of 𝒟\mathcal{D} with a fixed graph ordering, i.e., the BFS-ordering π\pi over edges E(gi)E(g_{i}). 𝒟=fΦ\mathcal{D}=f\circ\Phi is composed by another GNN encoder Φ\Phi and scoring function ff over an edge sequence Eπ:{e1,e2,,en}E^{\pi}:\{e_{1},e_{2},...,e_{n}\}, which makes predictions on the BFS-ordered edges.

Recall our previous definition on the direction of k-hop ego-graph, the center node encoder Ψ\Psi receives pairs of (gi,xi)(g_{i},x_{i}) while the neighbor node encoder Φ\Phi in discriminator 𝒟\mathcal{D} receives (gi~,xi)(\tilde{g_{i}},x_{i}). Both encoders are parameterized as GNNs,

Ψ(gi,xi)=GNNΨ(Ai,Xi),Φ(gi~,xi)=GNNΦ(Ai,Xi),\Psi(g_{i},x_{i})=\text{GNN}_{\Psi}(A_{i},X_{i}),\Phi(\tilde{g_{i}},x_{i})=\text{GNN}_{\Phi}(A_{i}^{\prime},X_{i}),

where Ai,AiA_{i},A_{i}^{\prime} is the adjacency matrix with self-loops of gig_{i} and gi~\tilde{g_{i}}, respectively. The self-loops are added following the common design of GNNs, which allows the convolutional node embeddings to always incorporate the influence of the center node. Ai=AiA_{i}={A_{i}^{\prime}}^{\intercal}. The output of Ψ\Psi, i.e., zinz_{i}\in\mathbb{R}^{n}, is the center node embedding, while Φ\Phi outputs representation H|gi|×nH\in\mathbb{R}^{|g_{i}|\times n} for neighbor nodes in the ego-graph.

Once node representation HH is computed, we now describe the scoring function ff. For each of the node pair (p,q)Eπ(p,q)\in E^{\pi}, hph_{p} is the source node representation from Φ\Phi, xqx_{q} is the destination node features. The scoring function is,

f(hp,xq,zi)=σ(UTτ(WT[hpxqzi])),f(h_{p},x_{q},z_{i})=\sigma\left(U^{T}\cdot\tau\left(W^{T}[h_{p}||x_{q}||z_{i}]\right)\right), (2)

where σ\sigma and τ\tau are Sigmoid and ReLU activation functions. Thus, the discriminator 𝒟\mathcal{D} is asked to distinguish a positive ((p,q),zi)((p,q),z_{i}) and negative pair ((p,q),zi))((p,q),z_{i}^{\prime})) for each edge in gig_{i}.

𝒟(gi,zi)=(p,q)Eπlogf(hp,xq,zi),𝒟(gi,zi)=(p,q)Eπlogf(hp,xq,zi).\mathcal{D}(g_{i},z_{i})=\sum\limits_{(p,q)\in{E^{\pi}}}\log f(h_{p},x_{q},z_{i}),\ \ \mathcal{D}(g_{i},z_{i}^{\prime})=\sum\limits_{(p,q)}^{E^{\pi}}\log f(h_{p},x_{q},z_{i}^{\prime}). (3)

There are two types of edges (p,q)(p,q) in our consideration of node orders, type-a - the edges across different hops (from the center node), and type-b - the edges within the same hop (from the center node). The aforementioned BFS-based node ordering guarantees that Eq. 3 is sensitive to the ordering of type-a edges, and invariant to the ordering of type-b edges, which is consistent with the requirement of our theoretical analysis on Δ𝒟\Delta_{\mathcal{D}}. Due to the fact that the output of a k-layer GNN only depends on a k-hop ego-graph for both encoders Ψ\Psi and Φ\Phi, Egi can be trained in parallel by sampling batches of gig_{i}’s. Besides, the training objective of Egi is transferable as long as (gi,xi)(g_{i},x_{i}) across source graph GaG_{a} and GbG_{b} satisfies the conditions given in §3.2. More model details in Appendix §B and source code in the Supplementary Materials.

Connection with existing work. To provide more insights into the Egi objective, we also present it as a dual problem of ego-graph reconstruction. Recall our definition of ego-graph mutual information MI(gi,Ψ(gi,xi))(g_{i},\Psi(g_{i},x_{i})). It can be related to an ego-graph reconstruction loss R(gi|Ψ(gi,xi))R(g_{i}|\Psi(g_{i},x_{i})) as

maxMI(gi,Ψ(gi,xi))=H(gi)H(gi|Ψ(gi,xi))H(gi)R(gi|Ψ(gi,xi)).\begin{array}[]{r}\max\text{MI}(g_{i},\Psi(g_{i},x_{i}))=H(g_{i})-H(g_{i}|\Psi(g_{i},x_{i}))\leq H(g_{i})-R(g_{i}|\Psi(g_{i},x_{i})).\end{array} (4)

When Egi is maximizing the mutual information, it simultaneously minimizes the upper error bound of reconstructing an ego-graph gig_{i}. In this view, the key difference between Egi and VGAE [28] is they assume each edge in a graph to be observed independently during the reconstruction. While in Egi, edges in an ego-graph are observed jointly during the GNN decoding. Moreover, existing mutual information based GNNs such as DGI [54] and GMI [41] explicitly measure the mutual information between node features xx and GNN output Ψ\Psi. In this way, they tend to capture node features instead of graph structures, which we deem more essential in graph transfer learning as discussed in §3.2.

Use cases of Egi framework. In this paper, we focus on the classical domain adaption (direct-transferring) setting [7], where no target domain labels are available and transferability is measured by the performance discrepancy without fine-tuning. In this setting, the transferability of Egi is theoretically guaranteed by Theorem 3.1. In §4.1, we validated this with the airport datasets. Beyond direct-transferring, Egi is also useful in the more generalized and practical setting of transfer learning with fine-tuning, which we introduced in §4.2 and validated with the YAGO datasets. In this setting, the transferability of Egi is not rigorously studied yet, but is empirically shown promising.

Supportive observations. In the first three columns of our synthetic experimental results (Table 1), in both cases of transfering GNNs between similar graphs (F-F) and dissimilar graphs (B-F), Egi significantly outperforms all competitors when using node degree one-hot encoding as transferable node features. In particular, the performance gains over the untrained GIN show the effectiveness of training and transfering, and our gains are always larger than the two state-of-the-art unsupervised GNNs. Such results clearly indicate advantageous structure preserving capability and transferability of Egi.

3.2 Transferability analysis based on local graph Laplacians

We now study the transferability of a GNN (in particular, with the training objective of Egi\mathcal{L}_{\textsc{Egi}}) between the source graph GaG_{a} and target graph GbG_{b} based on their graph similarity. We firstly establish the requirement towards node features, under which we then focus on analyzing the transferability of Egi w.r.t. the structural information of GaG_{a} and GbG_{b}.

Recall our view of the GNN output as a combination of its input node features, fixed graph Laplacian and learnable graph filters. The utility of a GNN is determined by the compatibility among the three. In order to fulfill such compatibility, we require the node features to be structure-respecting:

Definition 3.3 (Structure-respecting node features).

Let gig_{i} be an ordered ego-graph centered on node viv_{i} with a set of node features {xp,qi}p=0,q=1k,|Vp(gi)|\{x^{i}_{p,q}\}_{p=0,q=1}^{k,|V_{p}(g_{i})|}, where Vp(gi)V_{p}(g_{i}) is the set of nodes in pp-th hop of gig_{i}. Then we say the node features on gig_{i} are structure-respecting if xp,qi=[f(gi)]p,qdx_{p,q}^{i}=[f(g_{i})]_{p,q}\in\mathbb{R}^{d} for any node vqVp(gi)v_{q}\in V_{p}(g_{i}), where f:𝒢d×|V(gi)|f:\mathcal{G}\to\mathbb{R}^{d\times|V(g_{i})|} is a function. In the strict case, ff should be injective.

In its essence, Def 3.3 requires the node features to be a function of the graph structures, which is sensitive to changes in the graph structures, and in an ideal case, injective to the graph structures (i.e., mapping different graphs to different features). In this way, when the learned graph filters of a transfered GNN is compatible to the structure of GG, they are also compatible to the node features of GG. As we will explain in Remark 2 of Theorem 3.1, this requirement is also essential for the analysis of Egi transferability which eventually only depends on the structural difference between two graphs.

In practice, commonly used node features like node degrees, PageRank scores [40], spectral embeddings [11], and many pre-computed unsupervised network embeddings [42, 51, 14] are all structure-respecting in nature. However, other commonly used node features like random vectors [68] or uniform vectors [60] are not and thus non-transferable. When raw node attributes are available, they are transferable as long as the concept of homophily [36] applies, which also implies Def 3.3, but we do not have a rigorous analysis on it yet.

Supportive observations. In the fifth and sixth columns in Table 1, where we use same fixed vectors as non-transferable node features to contrast with the first three columns, there is almost no transferability (see δ(acc.)\delta(acc.)) for all compared methods when non-transferable features are used, as the performance of trained GNNs are similar to or worse than their untrained baselines. More detailed experiments on different transferable and non-transferable features can be found in Appendix §C.1.

With our view of graphs and requirement on node features both established, now we derive the following theorem by characterizing the performance difference of Egi on two graphs based on Eq. 1.

Theorem 3.1 (GNN transferability).

Let Ga={(gi,xi)}i=1nG_{a}=\{(g_{i},x_{i})\}_{i=1}^{n} and Gb={(gi,xi)}i=1mG_{b}=\{(g_{i^{\prime}},x_{i^{\prime}})\}_{i^{\prime}=1}^{m} be two graphs, and assume node features are structure-relevant. Consider GCN Ψθ\Psi_{\theta} with k layers and a 1-hop polynomial filter ϕ\phi. With reasonable assumptions on the local spectrum of GaG_{a} and GbG_{b}, the empirical performance difference of Ψθ\Psi_{\theta} evaluated on Egi\mathcal{L}_{\textsc{Egi}} satisfies

|Egi(Ga)Egi(Gb)|𝒪(Δ𝒟(Ga,Gb)+C).\begin{array}[]{l}|\mathcal{L}_{\textsc{Egi}}(G_{a})-\mathcal{L}_{\textsc{Egi}}(G_{b})|\leq\mathcal{O}\left(\Delta_{\mathcal{D}}(G_{a},G_{b})+C\right).\end{array} (5)

On the RHS, CC is only dependent on the graph encoders and node features, while Δ𝒟(Ga,Gb)\Delta_{\mathcal{D}}(G_{a},G_{b}) measures the structural difference between the source and target graphs as follows,

Δ𝒟(Ga,Gb)=C~1nmi=1ni=1mλmax(L~giL~gi)\displaystyle\Delta_{\mathcal{D}}(G_{a},G_{b})=\tilde{C}\frac{1}{nm}\sum_{i=1}^{n}\sum_{i^{\prime}=1}^{m}\lambda_{\max}(\tilde{L}_{g_{i}}-\tilde{L}_{g_{i^{\prime}}}) (6)

where λmax(A):=λmax(ATA)1/2\lambda_{\max}(A):=\lambda_{\max}(A^{T}A)^{1/2}, and L~gi\tilde{L}_{g_{i}} denotes the normalised graph Laplacian of gi~\tilde{g_{i}} by its in-degree. C~\tilde{C} is a constant dependant on λmax(L~gi)\lambda_{\max}(\tilde{L}_{g_{i}}) and 𝒟\mathcal{D}.

Proof.

The full proof is detailed in Appendix §A. ∎

The analysis in Theorem 3.1 naturally instantiates our insight about the correspondence between structural similarity and GNN transferability. It allows us to tell how well an Egi trained on GaG_{a} can work on GbG_{b} by only checking the local graph Laplacians of GaG_{a} and GbG_{b} without actually training any model. In particular, we define the EGI gap as Δ𝒟\Delta_{\mathcal{D}} in Eq. 6, as other term CC is the same for different methods using same GNN encoder. It can be computed to bound the transferability of Egi regarding its loss difference on the source and target graphs.

Remark 1.

Our view of a graph GG as samples of k-hop ego-graphs is important, as it allows us to obtain node-wise characterization of GNN similarly as in [55]. It also allows us to set the depth of ego-graphs in the analysis to be the same as the number of GNN layers (k), since the GNN embedding of each node mostly depends on its k-hop ego-graph instead of the whole graph.

Remark 2.

For Eq. 1, Def 3.3 ensures the sampling of GNN embedding at a node always corresponds to sampling an ego-graph from 𝒢\mathcal{G}, which reduces to uniformly sampling from G={gi}i=1nG=\{g_{i}\}_{i=1}^{n} under the setting of Theorem 3.1. Therefore, the requirement of Def 3.3 in the context of Theorem 3.1 guarantees the analysis to be only depending on the structural information of the graph.

Supportive observations. In Table 1, in the d¯\bar{d} columns, we compute the average structural difference between two Forest-fire graphs (Δ𝒟\Delta_{\mathcal{D}}(F,F)) and between Barabasi and Forest-fire graphs (Δ𝒟\Delta_{\mathcal{D}}(B,F)), based on the RHS of Eq. 5. The results validate the topological difference between graphs generated by different random-graph models, while also verifying our view of graph as k-hop ego-graph samples and the way we propose based on it to characterize structural information of graphs. We further highlight in the δ\delta(acc) columns the accuracy difference between the GNNs transfered from Forest-fire graphs and Barabasi graphs to Forest-fire graphs. Since Forest-fire graphs are more similar to Forest-fire graphs than Barabasi graphs (as verified in the Δ𝒟\Delta_{\mathcal{D}} columns), we expect δ\delta(acc.) to be positive and large, indicating more positive transfer between the more similar graphs. Indeed, the behaviors of Egi align well with the expectation, which indicates its well-understood transferability and the utility of our theoretical analysis.

Use cases of Theorem 3.1. Our Theorem 3.1 naturally allows for two practical use cases among many others: point-wise pre-judge and pair-wise pre-selection for Egi pre-training. Suppose we have a target graph GbG_{b} which does not have sufficient training labels. In the first setting, we have a single source graph GaG_{a} which might be useful for pre-training a GNN to be used on GbG_{b}. The Egi gap Δ𝒟(Ga,Gb)\Delta_{\mathcal{D}}(G_{a},G_{b}) in Eq. 6 can then be computed between GaG_{a} and GbG_{b} to pre-judge whether such transfer learning would be successful before any actual GNN training (i.e., yes if Δ𝒟(Ga,Gb)\Delta_{\mathcal{D}}(G_{a},G_{b}) is empirically much smaller than 1.01.0; no otherwise). In the second setting, we have two or more source graphs {Ga1,Ga2,}\{G_{a}^{1},G_{a}^{2},\ldots\} which might be useful for pre-training the GNN. The Egi gap can then be computed between every pair of GaiG_{a}^{i} and GbG_{b} to pre-select the best source graph (i.e., select the one with the least Egi gap).

In practice, the computation of eigenvalues on the small ego-graphs can be rather efficient [2], and we do not need to enumerate all pairs of ego-graphs on two compared graphs especially if the graphs are really large (e.g., with more than a thousand nodes). Instead, we can randomly sample pairs of ego-graphs from the two graphs, update the average difference on-the-fly, and stop when it converges. Suppose we need to sample MM pairs of k-hop ego-graphs to compare two large graphs, and the average size of ego-graphs are LL, then the overall complexity of computing Eq. 5 is 𝒪(ML2)\mathcal{O}(ML^{2}), where MM is often less than 1K and LL less than 50. In Appendix §C.4, we report the approximated Δ𝒟\Delta_{\mathcal{D}}’s w.r.t. different sampling frequencies, and they are indeed pretty close to the actual value even with smaller sample frequencies, showing the feasible efficiency of computing Δ𝒟\Delta_{\mathcal{D}} through sampling.

Limitations. Egi is designed to account for the structural difference captured by GNNs (i.e., k-hop ego-graphs). The effectiveness of Egi could be limited if the tasks on target graphs depend on different structural signals. For example, as Eq. 6 is computing the average pairwise distances between the graph Laplacians of local ego-graphs, Δ𝒟\Delta_{\mathcal{D}} is possibly less effective in explicitly capturing global graph properties such as numbers of connected components (CCs). In some specific tasks (such as counting CCs or community detection) where such properties become the key factors, Δ𝒟\Delta_{\mathcal{D}} may fail to predict the transferability of GNNs.

Table 1: Synthetic experiments of identifying structural equivalent nodes. We randomly generate 40 graphs with the Forest-fire model (F) [32] and 40 graphs with the Barabasi model (B) [1], The GNN model is GIN [60] with random parameters (baseline with only the neighborhood aggregation function), VGAE[28], DGI [54], and Egi with GIN encoder. We train VGAE, DGI and Egi on one graph from either set (F and B), and test them on the rest of Forest-fire graphs (F). Transferable feature is node degree one-hot encoding and non-transferable feature is uniform vectors. More details about the results and dataset can be found in Appendix §C.1

. Method transferable features non-transferable feature structural difference F-F B-F δ\delta(acc.) F-F B-F δ\delta(acc.) Δ𝒟\Delta_{\mathcal{D}}(F,F) Δ𝒟\Delta_{\mathcal{D}}(B,F) GIN (untrained) 0.572 0.572 / 0.358 0.358 / 0.752 0.883 VGAE (GIN) 0.498 0.432 +0.066 0.240 0.239 0.001 DGI (GIN) 0.578 0.591 -0.013 0.394 0.213 +0.181 Egi (GIN) 0.710 0.616 +0.094 0.376 0.346 +0.03

4 Real Data Experiments

Baselines. We compare the proposed model against existing self-supervised GNNs and pre-training GNN algorithms. To exclude the impact of different GNN encoders Ψ\Psi on transferability, we always use the same encoder architecture for all compared methods (i.e., GIN [60] for direct-transfering experiments, GCN [29] for transfering with fine-tuning).

The self-supervised GNN baselines are GVAE [28], DGI [54] and two latest mutual information estimation methods GMI [41] and MVC [17]. As for pre-training GNN algorithms, MaskGNN and ContextPredGNN are two node-level pre-training models proposed in [21] Besides, Structural Pre-train [23] also conducts unsupervised node-level pre-training with structural features like node degrees and clustering coefficients.

Experimental Settings. The main hyperparameter kk is set 2 in Egi as a common practice. We use Adam [27] as optimizer and learning rate is 0.01. We provide the experimental result with varying kk in the Appendix §C.4. All baselines are set with the default parameters. Our experiments were run on an AWS g4dn.2xlarge machine with 1 Nvidia T4 GPU. By default, we use node degree one-hot encoding as the transferable feature across all different graphs. As stated before, other transferable features like spectral and other pre-computed node embeddings are also applicable. We focus on the setting where the downstream tasks on target graphs are unspecified but assumed to be structure-relevant, and thus pre-train the GNNs on source graphs in an unsupervised fashion.333The downstream tasks are unspecified because we aim to study the general transferability of GNNs that is not bounded to specific tasks. Nevertheless, we assume the tasks to be relevant to graph structures. In terms of evaluation, we design two realistic experimental settings: (1) Direct-transfering on the more structure-relevant task of role identification without given node features to directly evaluate the utility and transferability of Egi. (2) Few-shot learning on relation prediction with task-specific node features to evaluate the generalization ability of Egi.

4.1 Direct-transfering on role identification

First, we use the role identification without node features in a direct-transfering setting as a reliable proxy to evaluate transfer learning performance regarding different pre-training objectives. Role in a network is defined as nodes with similar structural behaviors, such as clique members, hub and bridge [19]. Across graphs in the same domain, we assume the definition of role to be consistent, and the task of role identification is highly structure-relevant, which can directly reflect the transferability of different methods and allows us to conduct the analysis according to Theorem 3.1. Upon convergence of pre-training each model on the source graphs, we directly apply them to the target graphs and further train a multi-layer perceptron (MLP) upon their outputs. The GNN parameters are frozen during the MLP training. We refer to this strategy as direct-transfering since there is no fine-tuning of the models after transfering to the target graphs.

We use two real-world network datasets with role-based node labels: (1) Airport [45] contains three networks from different regions– Brazil, USA and Europe. Each node is an airport and each link is the flight between airports. The airports are assigned with external labels based on their level of popularity. (2) Gene [68] contains the gene interactions regarding 50 different cancers. Each gene has a binary label indicating whether it is a transcription factor. More details about the results and dataset can be found in Appendix C.2.

The experimental setup on the Airport dataset closely resembles that of our synthetic experiments in Table 1, but with real data and more detailed comparisons. We train all models (except for the untrained ones) on the Europe network, and test them on all three networks. The results are presented in Table 2. We notice that the node degree features themselves (with MLP) show reasonable performance in all three networks, which is not surprising since the popularity-based airport role labels are highly relevant to node degrees. The untrained GIN encoder yields a significant margin over just node features, as GNN encoder incorporates structural information to node representations. While training of the DGI can further improve the performance on the source graph, Egi shows the best performance there with the structure-relevant node degree features, corroborating the claimed effectiveness of Egi in capturing the essential graph information (i.e. recover the k-hop ego-graph distributions) as we stress in §3.

When transfering the models to USA and Brazil networks, Egi further achieves the best performance compared with all baselines when structure relevant features are used (64.55 and 73.15), which reflects the most significant positive transfer. Interestingly, direct application of GVAE, DGI and MVC that do not capture the input k-hop graph jointly, leads to rather limited and even negative transferrability (through comparison against the untrained GIN encoders). The recently proposed transfer learning frameworks for GNN like MaskGNN and Structural Pre-train are able to mitigate negative transfer to some extent, but their performances are still inferior to Egi. We believe this is because their models are prone to learn the graph-specific information that is less transferable across different graphs. GMI is also known to capture the graph structure and node features, so it achieves second best result comparing with Egi.

Similarly as in Table 1, we also compute the structural differences among three networks w.r.t. the Egi gap in Eq. 6. The structural difference is 0.869 between the Europe and USA networks, and 0.851 between the Europe and Brazil datasets, which are pretty close. Consequently, the transferability of Egi regarding its performance gain over the untrained GIN baseline is 4.8%4.8\% on the USA network and 4.4%4.4\% on the Brazil network, which are also close. Such observations again align well with our conclusion in Theorem 3.1 that the transferability of Egi is closely related to the structural differences between source and target graphs.

Table 2: Results of role identification with direct-transfering on the Airport dataset. We report mean and standard deviation over 100 runs. The scores marked with ∗∗ passed t-test with p<0.01p<0.01 over the second runners.
Method Airport [45]
Europe USA Brazil
features 0.528±\pm0.052 0.557±\pm0.028 0.671±\pm0.089
GIN (random-init) 0.558±\pm0.050 0.616±\pm0.030 0.700±\pm0.082
GVAE (GIN)  [28] 0.539±\pm0.053 0.555±\pm0.029 0.663±\pm0.089
DGI (GIN)  [54] 0.578±\pm0.050 0.549±\pm0.028 0.673±\pm0.084
Mask-GIN [21] 0.564±\pm0.053 0.608±\pm0.027 0.667±\pm0.073
ContextPred-GIN [21] 0.527±\pm0.048 0.504±\pm0.030 0.621±\pm0.078
Structural Pre-train [23] 0.560±\pm0.050 0.622±\pm0.030 0.688±\pm0.082
MVC [17] 0.532±\pm0.050 0.597±\pm0.030 0.661±\pm0.093
GMI [41] 0.581±\pm0.054 0.593±\pm0.031 0.731±\pm0.107
Egi (GIN) 0.592±\pm0.046∗∗ 0.646±\pm0.029 ∗∗ 0.732±\pm0.078

On the Gene dataset, with more graphs available, we focus on Egi to further validate the utility of Eq. 5 in Theorem 3.1, regarding the connection between the Egi gap (Eq. 6) and the performance gap (micro-F1) of Egi on them. Due to severe label imbalance that removes the performance gaps, we only use the seven brain cancer networks that have a more consistent balance of labels. As shown in Figure 3, we train Egi on one graph and test it on the other graphs. The xx-axis shows the Egi gap, and yy-axis shows the improvement on micro-F1 compared with an untrained GIN. The negative correlation between two quantities is obvious. Specifically, when the structural difference is smaller than 1, positive transfer is observed (upper left area) as the performance of transferred Egi is better than untrained GIN, and when the structural difference becomes large (>1>1), negative transfer is observed. We also notice a similar graph pattern, i.e. single dense cluster, between source graph and positive transferred target graph G2G_{2}.

Refer to caption
Figure 3: Transfer learning performance of role identification on the Gene dataset. We visualize the source graph G0G_{0} and two example target graphs that are relatively more different (G4G_{4}) or similar (G2G_{2}) with G0G_{0}.

4.2 Few-shot learning on relation prediction

Here we evaluate Egi in the more generalized and practical setting of few-shot learning on the less structure-relevant task of relation prediction, with task-specific node features and fine-tuning. The source graph contains a cleaned full dump of 579K entities from YAGO [49], and we investigate 20-shot relation prediction on a target graph with 24 relation types, which is a sub-graph of 115K entities sampled from the same dump. In post-fine-tuning, the models are pre-trained with an unsupervised loss on the source graph and fine-tuned with the task-specific loss on the target graph. In joint-fine-tuning, the same pre-trained models are jointly optimized w.r.t. the unsupervised pre-training loss and task-specific fine-tuning loss on the target graph. In Table 3, we observe most of the existing models fail to transfer across pre-training and fine-tuning tasks, especially in the joint-fine-tuning setting. In particular, both Mask-GIN and ContextPred-GIN rely a lot on task-specific fine-tuning, while Egi focuses on the capturing of similar ego-graph structures that are transferable across graphs. The mutual information based method GMI also demonstrates considerable transferability and we believe the ability to capture the graph structure is the key to the transferability. As a consequence, Egi significantly outperforms all compared methods in both settings. More detailed statistics and running time are in Appendix §C.3.

Table 3: Performance of few-shot relation prediction on YAGO. The scores marked with ∗∗ passed t-test with p<0.01p<0.01 over the second best results.
Method post-fine-tuning joint-fine-tuning
AUROC MRR AUROC MRR
No pre-train 0.687±\pm0.002 0.596±\pm0.003 N.A. N.A.
GVAE 0.701±\pm0.003 0.601±\pm0.007 0.679±\pm0.004 0.568±\pm0.008
DGI 0.689±\pm0.011 0.586±\pm0.025 0.688±\pm0.012 0.537±\pm0.023
MaskGNN 0.713±\pm0.009 0.631±\pm0.015 0.712±\pm0.005 0.560±\pm0.010
ContextPredGNN 0.692±\pm0.030 0.662±\pm0.030 0.705±\pm0.011 0.575±\pm0.021
GMI 0.728±\pm0.005 0.625±\pm0.009 0.721±\pm0.007 0.643±\pm0.011
Structural Pre-train OOM OOM OOM OOM
MVC OOM OOM OOM OOM
Egi 0.739±\pm 0.009∗∗ 0.670±\pm0.014 0.787 ±\pm 0.011∗∗ 0.729 ±\pm 0.016∗∗

5 Conclusion

To the best of our knowledge, this is the first research effort towards establishing a theoretically grounded framework to analyze GNN transferability, which we also demonstrate to be practically useful for guiding the design and conduct of transfer learning with GNNs. For future work, it is intriguing to further strengthen the bound with relaxed assumptions, rigorously extend it to the more complicated and less restricted settings regarding node features and downstream tasks, as well as analyze and improve the proposed framework over more transfer learning scenarios and datasets. It is also important to protect the privacy of pre-training data to avoid potential negative societal impacts.

Acknowledgments and Disclosure of Funding

Research was supported in part by US DARPA KAIROS Program No. FA8750-19-2-1004, SocialSim Program No. W911NF-17-C-0099, and INCAS Program No. HR001121C0165, National Science Foundation IIS-19-56151, IIS-17-41317, and IIS 17-04532, and the Molecule Maker Lab Institute: An AI Research Institutes program supported by NSF under Award No. 2019897. Chao Zhang is supported NSF IIS-2008334, IIS-2106961, and ONR MURI N00014-17-1-2656. We would like to thank AWS Machine Learning Research Awards program for providing computational resources for the experiments in this paper. This work is also partially supported by the internal funding and GPU servers provided by the Computer Science Department of Emory University. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily represent the views, either expressed or implied, of DARPA or the U.S. Government.

References

  • [1] Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
  • [2] Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In FOCS, pages 339–348, 2005.
  • [3] Jinheon Baek, Dong Bok Lee, and Sung Ju Hwang. Learning to extrapolate knowledge: Transductive few-shot out-of-graph link prediction. Advances in Neural Information Processing Systems, 33, 2020.
  • [4] Lu Bai and Edwin R Hancock. Fast depth-based subgraph kernels for unattributed graphs. Pattern Recognition, 50:233–245, 2016.
  • [5] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
  • [6] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, pages 585–591, 2002.
  • [7] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In NIPS, pages 137–144, 2007.
  • [8] Karsten Borgwardt, Elisabetta Ghisu, Felipe Llinares-López, Leslie O’Bray, and Bastian Rieck. Graph kernels: State-of-the-art and future challenges. arXiv preprint arXiv:2011.03854, 2020.
  • [9] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In ICLR, 2014.
  • [10] Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. In ICLR, 2018.
  • [11] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
  • [12] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pages 3844–3852, 2016.
  • [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In ACL, pages 4171–4186, 2019.
  • [14] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, pages 855–864, 2016.
  • [15] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1024–1034, 2017.
  • [16] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. ACHA, 30(2):129–150, 2011.
  • [17] Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning, pages 4116–4126. PMLR, 2020.
  • [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [19] Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. Rolx: structural role extraction & mining in large graphs. In KDD, pages 1231–1239, 2012.
  • [20] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
  • [21] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. In ICLR, 2019.
  • [22] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. Gpt-gnn: Generative pre-training of graph neural networks. In KDD, pages 1857–1867, 2020.
  • [23] Ziniu Hu, Changjun Fan, Ting Chen, Kai-Wei Chang, and Yizhou Sun. Pre-training graph neural networks for generic structural feature extraction. arXiv preprint arXiv:1905.13728, 2019.
  • [24] Suk-Geun Hwang. Cauchy’s interlace theorem for eigenvalues of hermitian matrices. The American Mathematical Monthly, 111(2):157–159, 2004.
  • [25] Xuan Kan, Hejie Cui, and Carl Yang. Zero-shot scene graph relation prediction through commonsense knowledge integration. In ECML-PKDD, 2021.
  • [26] Nicolas Keriven and Gabriel Peyré. Universal invariant and equivariant graph neural networks. In NIPS, pages 7090–7099, 2019.
  • [27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [28] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  • [29] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [30] Nils M Kriege, Fredrik D Johansson, and Christopher Morris. A survey on graph kernels. Applied Network Science, 5(1):1–42, 2020.
  • [31] Lin Lan, Pinghui Wang, Xuefeng Du, Kaikai Song, Jing Tao, and Xiaohong Guan. Node classification on graphs with few-shot novel labels via meta transformed network embedding. Advances in Neural Information Processing Systems, 33, 2020.
  • [32] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187, 2005.
  • [33] Ron Levie, Wei Huang, Lorenzo Bucci, Michael M Bronstein, and Gitta Kutyniok. Transferability of spectral graph convolutional neural networks. arXiv preprint arXiv:1907.12972, 2019.
  • [34] Ron Levie, Elvin Isufi, and Gitta Kutyniok. On the transferability of spectral graph filters. In 2019 13th International conference on Sampling Theory and Applications (SampTA), pages 1–5. IEEE, 2019.
  • [35] Jenny Liu, Aviral Kumar, Jimmy Ba, Jamie Kiros, and Kevin Swersky. Graph normalizing flows. In Advances in Neural Information Processing Systems, pages 13556–13566, 2019.
  • [36] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415–444, 2001.
  • [37] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546, 2013.
  • [38] Giannis Nikolentzos, Giannis Siglidis, and Michalis Vazirgiannis. Graph kernels: A survey. arXiv preprint arXiv:1904.12218, 2019.
  • [39] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In ICLR, 2020.
  • [40] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
  • [41] Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, and Junzhou Huang. Graph representation learning via graphical mutual information maximization. In WWW, pages 259–270, 2020.
  • [42] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD, pages 701–710, 2014.
  • [43] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. Gcc: Graph contrastive coding for graph neural network pre-training. In KDD, pages 1150–1160, 2020.
  • [44] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.
  • [45] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. struc2vec: Learning node representations from structural identity. In KDD, pages 385–394, 2017.
  • [46] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
  • [47] Luana Ruiz, Luiz Chamon, and Alejandro Ribeiro. Graphon neural networks and the transferability of graph neural networks. Advances in Neural Information Processing Systems, 33, 2020.
  • [48] Yu Shi, Qi Zhu, Fang Guo, Chao Zhang, and Jiawei Han. Easing embedding learning by comprehensive transcription of heterogeneous information networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2190–2199, 2018.
  • [49] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowledge. In WWW, pages 697–706, 2007.
  • [50] Fan-Yun Sun, Jordan Hoffman, Vikas Verma, and Jian Tang. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In ICLR, 2019.
  • [51] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In WWW, pages 1067–1077, 2015.
  • [52] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
  • [53] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
  • [54] Petar Velickovic, William Fedus, William L Hamilton, Pietro Lio, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In ICLR, 2019.
  • [55] Saurabh Verma and Zhi-Li Zhang. Stability and generalization of graph convolutional neural networks. In KDD, 2019.
  • [56] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NIPS, pages 3630–3638, 2016.
  • [57] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. Journal of Machine Learning Research, 11:1201–1242, 2010.
  • [58] Boris Weisfeiler and Andrei A Lehman. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia, 2(9):12–16, 1968.
  • [59] Man Wu, Shirui Pan, Chuan Zhou, Xiaojun Chang, and Xingquan Zhu. Unsupervised domain adaptive graph convolutional networks. In WWW, pages 1457–1467, 2020.
  • [60] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
  • [61] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
  • [62] Carl Yang, Yichen Feng, Pan Li, Yu Shi, and Jiawei Han. Meta-graph based hin spectral embedding: Methods, analyses, and insights. In ICDM, 2018.
  • [63] Carl Yang, Aditya Pal, Andrew Zhai, Nikil Pancha, Jiawei Han, Chuck Rosenberg, and Jure Leskovec. Multisage: Empowering graphsage with contextualized multi-embedding on web-scale multipartite networks. In KDD, 2020.
  • [64] Carl Yang, Yuxin Xiao, Yu Zhang, Yizhou Sun, and Jiawei Han. Heterogeneous network representation learning: A unified framework with survey and benchmark. In TKDE, 2020.
  • [65] Carl Yang, Chao Zhang, Xuewen Chen, Jieping Ye, and Jiawei Han. Did you enjoy the ride? understanding passenger experience via heterogeneous network embedding. In ICDE, 2018.
  • [66] Carl Yang, Jieyu Zhang, and Jiawei Han. Co-embedding network nodes and hierarchical labels with taxonomy based generative adversarial nets. In ICDM, 2020.
  • [67] Carl Yang, Jieyu Zhang, Haonan Wang, Sha Li, Myungwan Kim, Matt Walker, Yiou Xiao, and Jiawei Han. Relation learning on social networks with multi-modal graph edge variational autoencoders. In WSDM, 2020.
  • [68] Carl Yang, Peiye Zhuang, Wenhan Shi, Alan Luu, and Pan Li. Conditional structure generation through graph variational generative adversarial nets. In NIPS, pages 1338–1349, 2019.
  • [69] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In NIPS, 2018.
  • [70] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. GraphRNN: Generating realistic graphs with deep auto-regressive models. In Proceedings of the 35th International Conference on Machine Learning, pages 5708–5717. PMLR, 2018.
  • [71] Qi Zhu, Natalia Ponomareva, Jiawei Han, and Bryan Perozzi. Shift-robust gnns: Overcoming the limitations of localized graph training data. In NeurIPS, 2021.

Appendix A Theory Details

From the Egi\mathcal{L}_{\textsc{Egi}} objective, we have assumed gii.i.d.μg_{i}\overset{\emph{i.i.d.}}{\sim}\mu, xii.i.d.νx_{i}\overset{\emph{i.i.d.}}{\sim}\nu, and (gi,xi)i.i.d.p(g_{i},x_{i})\overset{\emph{i.i.d.}}{\sim}p, for (gi,xi)𝒢×𝒳(g_{i},x_{i})\in\mathcal{G}\times\mathcal{X}. Then for a sample {(gi,xi)}i\{(g_{i},x_{i})\}_{i}, we have access to the empirical distributions of the three. In the procedure of evaluating the objective, we sample uniformly.

Note that, in Eq. 2 of the main paper, we used a dd dimensional hidden state hph_{p} to denote an edge’s source node representation and xqx_{q} as destination node features from the structure of the ego-graph and the associated source node feature with GNN. In our proof, we denote vp,qv_{p,q} as the qq-th node in the pp-th layer of the ego-graph and let hp,q=hph_{p,q}=h_{p} and xp,q=xqx_{p,q}=x_{q}. For simplicity, in i-th layer, we denote f(xi)=hp,qixp,qif(x^{i})=h_{p,q}^{i}\|x^{i}_{p,q}, where [][\cdot\|\cdot] is the concatenation operation.

Finally, as we are considering GNN with kk layers, its computation only depends on the k-hop ego-graphs of GG, which is an important consideration when unfolding the embedding of GNN at a centre node with Lamma A.1.

Lemma A.1.

For any Am×nA\in\mathbb{R}^{m\times n}, where mnm\geq n, and AA is a submatrix of Bm×nB\in\mathbb{R}^{m^{\prime}\times n}, where m<mm<m^{\prime}, we have

A2B2.\|A\|_{2}\leq\|B\|_{2}.
Proof.

Note that, AATAA^{T} is a principle matrix of BBTBB^{T}, i.e., AATAA^{T} is obtained by removing the same set of rows and columns from BBTBB^{T}. Then, by Eigenvalue Interlacing Theorem [24] and the fact that ATAA^{T}A and AATAA^{T} have the same set of non-zero singular values, the matrix operator norm satisfies A2=λmax(ATA)=λmax(AAT)λmax(BBT)=B2\|A\|_{2}=\sqrt{\lambda_{\max}(A^{T}A)}=\sqrt{\lambda_{\max}(AA^{T})}\leq\sqrt{\lambda_{\max}(BB^{T})}=\|B\|_{2}. ∎

A.1 Center-node view of GCN

Recall that Vp(gi)V_{p}(g_{i}) denotes the set of nodes in the ppth hop of k-hop ego-graph gig_{i}, and xp,qix_{p,q}^{i} denotes the feature for qqth node in ppth hop of gig_{i}, for any p=0,,k;q=1,,|Vp(gi)|p=0,\dots,k;\;q=1,\dots,|V_{p}(g_{i})|. Similarly, V(gi)V(g_{i}) denotes the entire set of nodes in gig_{i}. In each ego-graph sample {gi,xi}\{g_{i},x_{i}\}, the layer-wise propagation rules for the center node embedding in encoder Ψ\Psi and discriminator 𝒟\mathcal{D} can be written into the form of GCN as followed

Z(l)=ReLU(D12(I+A)D12Z(l1)θ(l))Z^{(l)}=\text{ReLU}(D^{-\frac{1}{2}}(I+A)D^{-\frac{1}{2}}Z^{(l-1)}\theta^{(l)})

where AA is adjacency matrix of GG. II adds the self-loop and Dii=jAijD_{ii}=\sum_{j}A_{ij} is the degree matrix.

We focus on the center node’s embedding obtained from a kk-layer GCN with 1-hop polynomial filter ϕ(L)=IdL\phi(L)=Id-L. Inspired by the characterization of GCN from a node-wise view in [55], we similarly denote the embedding of node xii=1,,nx_{i}\;\forall i=1,\cdots,n in the final layer of the GCN as

zi(k)=zi=Ψθ(xi)=σ(j𝒩(xi)eijzj(k1)Tθ(k))d,z_{i}^{(k)}=z_{i}=\Psi_{\theta}(x_{i})=\sigma(\sum_{j\in\mathcal{N}(x_{i})}e_{ij}{z_{j}^{(k-1)}}^{T}\theta^{(k)})\in\mathbb{R}^{d},

where eij=[ϕ(L)]ije_{ij}=[\phi(L)]_{ij}\in\mathbb{R} the weighted link between node ii and jj; and θ(k)d×d\theta^{(k)}\in\mathbb{R}^{d\times d} is the weight for the kkth layer sharing across nodes. Then θ={θ()}=1k\theta=\{\theta^{(\ell)}\}_{\ell=1}^{k}. We may denote zi()dz_{i}^{(\ell)}\in\mathbb{R}^{d} similarly for =1,,k1\ell=1,\cdots,k-1, and zi0=xidz_{i}^{0}=x_{i}\in\mathbb{R}^{d} as the node feature of center node xix_{i}. With the assumption of GCN in the statement, it is clear that only the k-hop ego-graph gig_{i} centered at xix_{i} is needed to compute zi(k)z_{i}^{(k)} for any i=1,,ni=1,\cdots,n instead of the whole of GG. Precisely, pp-hop of subgraph corresponds to the =(kp)\ell=(k-p)th layer in the model.

With such observation in mind, let us denote the matrix of node embeddings of gig_{i} at the \ellth layer as [zi()]|V(gi)|×d[z_{i}^{(\ell)}]\in\mathbb{R}^{|V(g_{i})|\times d} for =1,,k\ell=1,\cdots,k; and let [zi(0)][xi](d)|V(gi)|[z_{i}^{(0)}]\equiv[x_{i}]\in(\mathbb{R}^{d})^{|V(g_{i})|} denote the matrix of node features in the kk-hop ego-graph gig_{i}. In addition, denote [zi()]p[z_{i}^{(\ell)}]_{p} as the principle submatrix, which includes embeddings for nodes in the 0 to ppth hop of gig_{i}, 0pk0\leq p\leq k.

We denote LgiL_{g_{i}} as the out-degree normalised graph Laplacian of gig_{i}. Here, the out-degree is defined with respect to the direction from leaves to centre node in gig_{i}. Similarly, denote L~gi\tilde{L}_{g_{i}} as the in-degree normalised graph Laplacian of gig_{i}, where the direction is from centre to leaves.

WLOG, we write the \ellth layer embedding in matrix notation of the following form

[zi()]k+1=σ([ϕ(Lgi)]k+1[zi(1)]k+1θ()),[z_{i}^{(\ell)}]_{k-\ell+1}=\sigma([\phi(L_{g_{i}})]_{k-\ell+1}[z_{i}^{(\ell-1)}]_{k-\ell+1}\theta^{(\ell)}),

where the GCN only updates the embedding of nodes in the 0 to (k)(k-\ell)th hop. We also implicitly assume the embedding of nodes in (k+1)(k-\ell+1) to kkth hop are unchanged through the update, due to the directed nature of gig_{i}. Hence, we obtain zi[zi(k)]0z_{i}\equiv[z_{i}^{(k)}]_{0} from the following

[zi(k)]1=σ([ϕ(Lgi)]1[zi(k1)]1θ(k)).[z_{i}^{(k)}]_{1}=\sigma([\phi(L_{g_{i}})]_{1}[z_{i}^{(k-1)}]_{1}\theta^{(k)}).

Similarly, we are able to write down the form of discriminator using matrix representation for GCN. The edge information at \ellth time point for nodes in V(gi)V(g_{i}) can be described as follows

[hi()]=ReLU(ϕ(L~gi)[hi(1)]θ~()),[h_{i}^{(\ell)}]=ReLU(\phi(\tilde{L}_{g_{i}})[h_{i}^{(\ell-1)}]\tilde{\theta}^{(\ell)}),

A.2 Proof for Theorem 4.1

We restate Theorem 4.1 from the main paper as below.

Theorem A.2.

Let Ga={(gi,xi)}i=1nG_{a}=\{(g_{i},x_{i})\}_{i=1}^{n} and Gb={(gi,xi)}i=1mG_{b}=\{(g_{i^{\prime}},x_{i^{\prime}})\}_{i^{\prime}=1}^{m} be two graphs and node features are structure-respecting with xi=f(Lgi),xi=f(Lgi)x_{i}=f(L_{g_{i}}),x_{i^{\prime}}=f(L_{g_{i^{\prime}}}) for some function f:|V(gi)|×|V(gi)|df:\mathbb{R}^{|V(g_{i})|\times|V(g_{i})|}\to\mathbb{R}^{d}. Consider GCN Ψθ\Psi_{\theta} with k layers and a 1-hop polynomial filter ϕ\phi,the empirical performance difference of Ψθ\Psi_{\theta} with Egi\mathcal{L}_{\textsc{Egi}} satisfies

|Egi(Ga)Egi(Gb)|𝒪(1nmi=1ni=1m[M+Cλmax(LgiLgi)+C~λmax(L~giL~gi))]),\displaystyle|\mathcal{L}_{\textsc{Egi}}(G_{a})-\mathcal{L}_{\textsc{Egi}}(G_{b})|\leq\mathcal{O}\left(\frac{1}{nm}\sum_{i=1}^{n}\sum_{i^{\prime}=1}^{m}[M+C\lambda_{\max}(L_{g_{i}}-L_{g_{i^{\prime}}})+\tilde{C}\lambda_{\max}(\tilde{L}_{g_{i}}-\tilde{L}_{g_{i^{\prime}}}))]\right), (7)

where MM is dependant on Ψ\Psi, 𝒟\mathcal{D}, node features, and the largest eigenvalue of LgiL_{g_{i}} and L~gi\tilde{L}_{g_{i}}. CC is a constant dependant on the encoder, while C~\tilde{C} is a constant dependant on the decoder. With a slight abuse of notation, we denote λmax(A):=λmax(ATA)1/2\lambda_{\max}(A):=\lambda_{\max}(A^{T}A)^{1/2}. Note that, in the main paper, we have C:=M+Cλmax(LgiLgi)C:=M+C\lambda_{\max}(L_{g_{i}}-L_{g_{i^{\prime}}}), and Δ𝒟(Ga,Gb):=C~λmax(L~giL~gi)\Delta_{\mathcal{D}}(G_{a},G_{b}):=\tilde{C}\lambda_{\max}(\tilde{L}_{g_{i}}-\tilde{L}_{g_{i^{\prime}}}).

Proof.

Now,

|Egi(G)Egi(G)|\displaystyle|\mathcal{L}_{\textsc{Egi}}(G)-\mathcal{L}_{\textsc{Egi}}(G^{\prime})|
=\displaystyle= |1n2i,j=1n(𝒟(gi,zj))1ni=1n((𝒟(gi,zi))(1m2i,j=1m(𝒟(gi,zj))1mi=1m((𝒟(gi,zi))))|\displaystyle\left|\frac{1}{n^{2}}\sum_{i,j=1}^{n}(\mathcal{D}(g_{i},z_{j}))-\frac{1}{n}\sum_{i=1}^{n}(-(-\mathcal{D}(g_{i},z_{i}))-(\frac{1}{m^{2}}\sum_{i^{\prime},j^{\prime}=1}^{m}(\mathcal{D}(g_{i^{\prime}},z_{j^{\prime}}))-\frac{1}{m}\sum_{i^{\prime}=1}^{m}(-(-\mathcal{D}(g_{i^{\prime}},z_{i^{\prime}}))))\right|
\displaystyle\leq 1n2m2i,j=1ni,j=1m|𝒟(gi,zj)𝒟(gi,zj)|+1nmi=1ni=1m|𝒟(gi,zi)𝒟(gi,zi)|\displaystyle\frac{1}{n^{2}m^{2}}\sum_{i,j=1}^{n}\sum_{i^{\prime},j^{\prime}=1}^{m}\left|\mathcal{D}(g_{i},z_{j})-\mathcal{D}(g_{i^{\prime}},z_{j^{\prime}})\right|+\frac{1}{nm}\sum_{i=1}^{n}\sum_{i^{\prime}=1}^{m}\left|\mathcal{D}(g_{i},z_{i})-\mathcal{D}(g_{i^{\prime}},z_{i^{\prime}})\right|
=\displaystyle= 1n2m2i,j=1ni,j=1mA+1nmi=1ni=1mB.\displaystyle\frac{1}{n^{2}m^{2}}\sum_{i,j=1}^{n}\sum_{i^{\prime},j^{\prime}=1}^{m}A+\frac{1}{nm}\sum_{i=1}^{n}\sum_{i^{\prime}=1}^{m}B.

We make the following assumptions in the proof,

  1. 1.

    Assume the size of the neighborhood for each node is bounded by 0<r<0<r<\infty, then the maximum number of node for pp-th layer subgraph is bounded by rpr^{p}. WLOG, let 1|Vp(gi)||Vp(gi)|rp1\leq|V_{p}(g_{i})|\leq|V_{p}(g_{i^{\prime}})|\leq r^{p};

  2. 2.

    Assume hp,qixp,qi=0h_{p,q}^{i}\|x^{i}_{p,q}=0 if |Vp(gi)|<q|V_{p}(g_{i})|<q, i.e. assume non-informative edge information and node features for non-existed nodes in the smaller neighborhood with no links;

From Assumption 2, we add isolated nodes to the smaller neighborhood Vp(gi)V_{p}(g_{i}) such that the neighborhood size at each hop match. It can be found in our code to compute Egi gap as pad_nbhd. For the following proof, we WLOG assume |Vp(gi)|=|Vp(gi)|p|V_{p}(g_{i})|=|V_{p}(g_{i^{\prime}})|\;\forall p.

First we consider BB. Recall that, Vp(gi)V_{p}(g_{i}) is the set of nodes in layer pp of gig_{i},

𝒟(gi,zi)=p=1kq=1|Vp(gi)|log(σsig(UTτ(WT[f(xi)zi]))),\mathcal{D}(g_{i},z_{i})=\sum\limits_{p=1}^{k}\sum\limits_{q=1}^{|V_{p}(g_{i})|}\log(\sigma_{sig}\left(U^{T}\tau\left(W^{T}[f(x^{i})\|z_{i}]\right)\right)),

where σsig(t)=11+et\sigma_{sig}(t)=\frac{1}{1+e^{-t}} is the sigmoid function, τ\tau is some γτ\gamma_{\tau}-Lipschitz activation function and [][\cdot\|\cdot] denotes the concatenation of two vectors. Then we obtain

UTτ(WT[f(xi)zi])=UTτ(W1Tf(xi)+W2Tzi).U^{T}\tau\left(W^{T}[f(x^{i})\|z_{i}]\right)=U^{T}\tau\left(W_{1}^{T}f(x^{i})+W_{2}^{T}z_{i}\right).

Since log(σsig(t))=log(1+et)\log(\sigma_{sig}(t))=-\log(1+e^{-t}), which is 11-Lipschitz, it gives

B\displaystyle B pk|q|Vp(gi)|σs(UTτ(W1Tf(xi)+W2Tzi))σs(UTτ(W1Tf(xi)+W2Tzi))|\displaystyle\leq\sum_{p}^{k}|\sum_{q}^{|V_{p}(g_{i^{\prime}})|}\sigma_{s}(U^{T}\tau\left(W_{1}^{T}f(x^{i})+W_{2}^{T}z_{i}\right))-\sigma_{s}(U^{T}\tau\left(W_{1}^{T}f(x^{i^{\prime}})+W_{2}^{T}z_{i^{\prime}}\right))| (8)
γτU2p=1kq=1|Vp(gi)|(W1Tf(xi)W1Tf(xi)2+W2TziW2Tzi2)\displaystyle\leq\gamma_{\tau}\|U\|_{2}\sum_{p=1}^{k}\sum_{q=1}^{|V_{p}(g_{i^{\prime}})|}(\|W_{1}^{T}f(x^{i})-W_{1}^{T}f(x^{i^{\prime}})\|_{2}+\|W_{2}^{T}z_{i}-W_{2}^{T}z_{i^{\prime}}\|_{2})
γτU2sw(p=1kq=1|Vp(gi)|[hp,qihp,qi2+xp,qixp,qi2]+p=1kq=1|Vp(gi)|zizi2)\displaystyle\leq\gamma_{\tau}\|U\|_{2}s_{w}\left(\sum_{p=1}^{k}\sum_{q=1}^{|V_{p}(g_{i^{\prime}})|}\left[\|h^{i}_{p,q}-h^{i^{\prime}}_{p,q}\|_{2}+\|x^{i}_{p,q}-x^{i^{\prime}}_{p,q}\|_{2}\right]+\sum_{p=1}^{k}\sum_{q=1}^{|V_{p}(g_{i^{\prime}})|}\|z_{i}-z_{i^{\prime}}\|_{2}\right)
C1(p=1kq=1|Vp(gi)|[hp,qihp,qi2+xp,qixp,qi2]/p=1krp+zizi2)\displaystyle\leq C_{1}\left(\sum_{p=1}^{k}\sum_{q=1}^{|V_{p}(g_{i^{\prime}})|}\left[\|h^{i}_{p,q}-h^{i^{\prime}}_{p,q}\|_{2}+\|x^{i}_{p,q}-x^{i^{\prime}}_{p,q}\|_{2}\right]/\sum_{p=1}^{k}r^{p}+\|z_{i}-z_{i^{\prime}}\|_{2}\right)
=C1(I1+I2)\displaystyle=C_{1}\left(I_{1}+I_{2}\right)

We provide the derivation for the unfolding of \ellth layer GCN with the centre-node view in Lemma A.3. This will be used in the derivation of I1I_{1} and I2I_{2}.

Lemma A.3.

For any =1,,k\ell=1,\cdots,k, we have an upper bound for the hidden representation difference between gig_{i} and gig_{i}^{\prime},

[zi()]k[zi()]k2\displaystyle\|[z_{i}^{(\ell)}]_{k-\ell}-[z_{i^{\prime}}^{(\ell)}]_{k-\ell}\|_{2} (γσcθ)ϕ(Lgi)2[xi][xi]2\displaystyle\leq(\gamma_{\sigma}c_{\theta})^{\ell}\|\phi(L_{g_{i}})\|_{2}^{\ell}\|[x_{i}]-[x_{i^{\prime}}]\|_{2} (9)
+(γσcθ)ϕ(Lgi)2+1γσcθϕ(Lgi)21γσcθczϕ(Lgi)ϕ(Lgi)2.\displaystyle+\frac{(\gamma_{\sigma}c_{\theta})^{\ell}\|\phi(L_{g_{i}})\|_{2}^{\ell}+1}{\gamma_{\sigma}c_{\theta}\|\phi(L_{g_{i}})\|_{2}-1}\gamma_{\sigma}c_{\theta}c_{z}\|\phi(L_{g_{i}})-\phi(L_{g_{i^{\prime}}})\|_{2}.

Specifically, for =k\ell=k, we obtain the expansion for center node embedding [zi(k)]0[zi(k)]0zizi\|[z_{i}^{(k)}]_{0}-[z_{i^{\prime}}^{(k)}]_{0}\|\equiv\|z_{i}-z_{i^{\prime}}\|.

Proof.

By Lemma A.1, for any =1,,k\ell=1,\cdots,k, the following holds

[zi()]k[zi()]k2[zi()]k+1[zi()]k+12.\|[z_{i}^{(\ell)}]_{k-\ell}-[z_{i^{\prime}}^{(\ell)}]_{k-\ell}\|_{2}\leq\|[z_{i}^{(\ell)}]_{k-\ell+1}-[z_{i^{\prime}}^{(\ell)}]_{k-\ell+1}\|_{2}.

Assume max[zi()]2cz<i\max_{\ell}\|[z_{i}^{(\ell)}]\|_{2}\leq c_{z}<\infty\;\forall i, and maxθ()2cθ<\max_{\ell}\|\theta^{(\ell)}\|_{2}\leq c_{\theta}<\infty, where cθ=sθ()c_{\theta}=\vee_{\ell}s_{\theta^{(\ell)}} the largest singular value.

Then, for =1,,k1\ell=1,\cdots,k-1, we have

[zi()]k[zi()]k2\displaystyle\|[z_{i^{\prime}}^{(\ell)}]_{k-\ell}-[z_{i^{\prime}}^{(\ell)}]_{k-\ell}\|_{2} (10)
\displaystyle\leq [σ([ϕ(Lgi)]k+1[zi(1)]k+1θ())σ([ϕ(Lgi)]k+1[zi(1)]k+1θ())]k)2\displaystyle\|[\sigma([\phi(L_{g_{i}})]_{k-\ell+1}[z_{i}^{(\ell-1)}]_{k-\ell+1}\theta^{(\ell)})-\sigma([\phi(L_{g_{i^{\prime}}})]_{k-\ell+1}[z_{i^{\prime}}^{(\ell-1)}]_{k-\ell+1}\theta^{(\ell)})]_{k-\ell})\|_{2}
\displaystyle\leq γσ[ϕ(Lgi)]k+1[zi(1)]k+1[ϕ(Lgi)]k+1[zi(1)]k+12θ(k)2\displaystyle\gamma_{\sigma}\|[\phi(L_{g_{i}})]_{k-\ell+1}[z_{i}^{(\ell-1)}]_{k-\ell+1}-[\phi(L_{g_{i^{\prime}}})]_{k-\ell+1}[z_{i^{\prime}}^{(\ell-1)}]_{k-\ell+1}\|_{2}\|\theta^{(k)}\|_{2}
\displaystyle\leq γσcθ[ϕ(Lgi)]k+12[zi(1)]k+1[zi(1)]k+12+γσcθ[zi(1)]k+12[ϕ(Lgi)]k+1[ϕ(Lgi)]k+12\displaystyle\gamma_{\sigma}c_{\theta}\|[\phi(L_{g_{i}})]_{k-\ell+1}\|_{2}\|[z_{i}^{(\ell-1)}]_{k-\ell+1}-[z_{i^{\prime}}^{(\ell-1)}]_{k-\ell+1}\|_{2}+\gamma_{\sigma}c_{\theta}\|[z_{i^{\prime}}^{(\ell-1)}]_{k-\ell+1}\|_{2}\|[\phi(L_{g_{i}})]_{k-\ell+1}-[\phi(L_{g_{i^{\prime}}})]_{k-\ell+1}\|_{2}
\displaystyle\leq γσcθϕ(Lgi)2[zi(1)]k+1[zi(1)]k+12+γσcθczϕ(Lgi)ϕ(Lgi)2.\displaystyle\gamma_{\sigma}c_{\theta}\|\phi(L_{g_{i}})\|_{2}\|[z_{i}^{(\ell-1)}]_{k-\ell+1}-[z_{i^{\prime}}^{(\ell-1)}]_{k-\ell+1}\|_{2}+\gamma_{\sigma}c_{\theta}c_{z}\|\phi(L_{g_{i}})-\phi(L_{g_{i^{\prime}}})\|_{2}.

since [ϕ(Lgi)]k+1[\phi(L_{g_{i}})]_{k-\ell+1} is the principle submatrix of ϕ(Lgi)\phi(L_{g_{i}}). Then we equivalently write the above equation as EbE1+aE_{\ell}\leq bE_{\ell-1}+a, which gives

EbE1+b+1b1a.E_{\ell}\leq b^{\ell}E_{1}+\frac{b^{\ell}+1}{b-1}a.

With [xi]=[zi(0)]k[x_{i}]=[z_{i}^{(0)}]_{k}, we see the following is only dependant on the structure of gig_{i} and gig_{i^{\prime}},

[zi()]k[zi()]k2\displaystyle\|[z_{i^{\prime}}^{(\ell)}]_{k-\ell}-[z_{i^{\prime}}^{(\ell)}]_{k-\ell}\|_{2} (γσcθ)ϕ(Lgi)2[xi][xi]2\displaystyle\leq(\gamma_{\sigma}c_{\theta})^{\ell}\|\phi(L_{g_{i}})\|_{2}^{\ell}\|[x_{i}]-[x_{i^{\prime}}]\|_{2}
+(γσcθ)ϕ(Lgi)2+1γσcθϕ(Lgi)21γσcθczϕ(Lgi)ϕ(Lgi)2.\displaystyle+\frac{(\gamma_{\sigma}c_{\theta})^{\ell}\|\phi(L_{g_{i}})\|_{2}^{\ell}+1}{\gamma_{\sigma}c_{\theta}\|\phi(L_{g_{i}})\|_{2}-1}\gamma_{\sigma}c_{\theta}c_{z}\|\phi(L_{g_{i}})-\phi(L_{g_{i^{\prime}}})\|_{2}.

Since the the graph Laplacians are normalised, we have ϕ(Lgi)2cL<i\|\phi(L_{g_{i}})\|_{2}\leq c_{L}<\infty\;\forall i. In addition, let

xp,qixp,qi2supisupp,qxp,qixp,qi2=supif(Lgi)f(Lgi)2:=δx.\|x_{p,q}^{i}-x_{p,q}^{i^{\prime}}\|_{2}\leq\sup_{i}\sup_{p,q}\|x_{p,q}^{i}-x_{p,q}^{i^{\prime}}\|_{2}=\sup_{i}\|f(L_{g_{i}})-f(L_{g_{i^{\prime}}})\|_{2}:=\delta_{x}.

Hence, [xi][xi]2δx(p=1krp)1/2:=cx\|[x_{i}]-[x_{i^{\prime}}]\|_{2}\leq\delta_{x}(\sum_{p=1}^{k}r^{p})^{1/2}:=c_{x}. From Lemma A.3, it is clear that we obtain the following at the final layer

I2=zizi2\displaystyle I_{2}=\|z_{i}-z_{i^{\prime}}\|_{2} (γσcθcL)kcx+(γσcθcL)k+1γσcθcL1γσcθczϕ(Lgi)ϕ(Lgi)2\displaystyle\leq(\gamma_{\sigma}c_{\theta}c_{L})^{k}c_{x}+\frac{(\gamma_{\sigma}c_{\theta}c_{L})^{k}+1}{\gamma_{\sigma}c_{\theta}c_{L}-1}\gamma_{\sigma}c_{\theta}c_{z}\|\phi(L_{g_{i}})-\phi(L_{g_{i^{\prime}}})\|_{2} (11)
C(Mcx+LgiLgi2)\displaystyle\leq C(Mc_{x}+\|L_{g_{i}}-L_{g_{i^{\prime}}}\|_{2})
=C(Mcx+λmax(LgiLgi)1/2).\displaystyle=C(Mc_{x}+\lambda_{\max}(L_{g_{i}}-L_{g_{i^{\prime}}})^{1/2}).

since ϕ\phi is a linear function for LL. Indeed, this can be generalised to polynomial function ϕ\phi of higher powers.

Now, consider the following term that is related with discriminator 𝒟\mathcal{D},

I1=p=1kq=1|Vp(gi)|[hp,qihp,qi2+xp,qixp,qi2]/p=1krpI_{1}=\sum_{p=1}^{k}\sum_{q=1}^{|V_{p}(g_{i^{\prime}})|}\left[\|h^{i}_{p,q}-h^{i^{\prime}}_{p,q}\|_{2}+\|x^{i}_{p,q}-x^{i^{\prime}}_{p,q}\|_{2}\right]/\sum_{p=1}^{k}r^{p}

Firstly, we denote L~p,q\tilde{L}_{p,q} as the in-degree graph Laplacian derived with the subgraph gqg_{q} of gig_{i} centred at qVp(gi)q\in V_{p}(g_{i}). Different from the encoder, we utilize every node’s hidden embedding in the computation. Specifically, gqg_{q} is obtained by retrieving links in gig_{i} that connects to the qqth node in the ppth layer. This is a principle submatrix of the in-degree graph Laplacian L~gi\tilde{L}_{g_{i}} of gig_{i}.

Just as defined in §A.1, we denote [hq(p)][h_{q}^{(p)}]_{\ell} as the ppth layer GCN embedding for nodes in hop 0 to hop [0,p]\ell\in[0,p] of gqg_{q}. Note that in this case, [hq(p)]0=hq(p)[h_{q}^{(p)}]_{0}=h_{q}^{(p)}, which is one row of [hi(p)][h_{i}^{(p)}], corresponding to the qq-th node in the neighborhood. So we may write the first term in I1I_{1} as

p=1kq=1|Vp(gi)|hq(p)hq(p)\sum_{p=1}^{k}\sum_{q=1}^{|V_{p}(g_{i^{\prime}})|}\|h_{q}^{(p)}-h_{q^{\prime}}^{(p)}\|

where hq(p):=hp,qih_{q^{\prime}}^{(p)}:=h_{p,q}^{i^{\prime}} for short. In this way, we regard each node qVp(gi)q\in V_{p}(g_{i}) as the centre node, which allows us to unfold the convolution similarly as expanding the I2I_{2} term. Now, for any qVk(gi)q\in V_{k}(g_{i}), i.e. when p=kp=k, we apply Lemma A.3 similarly as for zizi2\|z_{i}-z_{i^{\prime}}\|_{2}. Then,

hq(k)hq(k)\displaystyle\|h_{q}^{(k)}-h_{q^{\prime}}^{(k)}\| (γσcθ~cL~)kcx+(γσcθ~cL~)k+1γσcθ~cL~1γσcθ~chϕ(L~k,q)ϕ(L~k,q)2\displaystyle\leq(\gamma_{\sigma}c_{\tilde{\theta}}c_{\tilde{L}})^{k}c_{x}+\frac{(\gamma_{\sigma}c_{\tilde{\theta}}c_{\tilde{L}})^{k}+1}{\gamma_{\sigma}c_{\tilde{\theta}}c_{\tilde{L}}-1}\gamma_{\sigma}c_{\tilde{\theta}}c_{h}\|\phi(\tilde{L}_{k,q})-\phi(\tilde{L}_{k,q^{\prime}})\|_{2}
C~k(M~kcx+ϕ(L~gi)ϕ(L~gi)2)\displaystyle\leq\tilde{C}_{k}(\tilde{M}_{k}c_{x}+\|\phi(\tilde{L}_{g_{i}})-\phi(\tilde{L}_{g_{i^{\prime}}})\|_{2})

where L~p,q\tilde{L}_{p,q} is the principle submatrix of L~gi\tilde{L}_{g_{i}} and Lemma A.1 can be applied iin the last inequality. In addition, C~k\tilde{C}_{k} and M~k\tilde{M}_{k} are taken to be the maximum over any qVk(gi)q\in V_{k}(g_{i}). In general, for qVp(gi)q\in V_{p}(g_{i}), 0<p<k0<p<k, we have

hq(p)hq(p)2C~p(M~pcx+ϕ(L~gi)ϕ(L~gi)2)\|h_{q}^{(p)}-h_{q^{\prime}}^{(p)}\|_{2}\leq\tilde{C}_{p}(\tilde{M}_{p}c_{x}+\|\phi(\tilde{L}_{g_{i}})-\phi(\tilde{L}_{g_{i^{\prime}}})\|_{2})

Take a common upper bound for C~p,M~p\tilde{C}_{p},\tilde{M}_{p} over 0<pk0<p\leq k, we obtain

p=1kq=1|Vp(gi)|hq(p)hq(p)/p=1krp\displaystyle\sum_{p=1}^{k}\sum_{q=1}^{|V_{p}(g_{i^{\prime}})|}\|h_{q}^{(p)}-h_{q^{\prime}}^{(p)}\|/\sum_{p=1}^{k}r^{p} C~(M~cx+L~giL~gi2)\displaystyle\leq\tilde{C}(\tilde{M}c_{x}+\|\tilde{L}_{g_{i}}-\tilde{L}_{g_{i^{\prime}}}\|_{2})
=C~(M~cx+λmax(L~giL~gi)1/2)\displaystyle=\tilde{C}(\tilde{M}c_{x}+\lambda_{\max}(\tilde{L}_{g_{i}}-\tilde{L}_{g_{i^{\prime}}})^{1/2})

In addition, for the other half of I1I_{1}, we have

p=1kq=1|Vp(gi)|xp,qixp,qi2/p=1krpsupisupp,qxp,qixp,qi2=δx=cx/(p=1krp)1/2\sum_{p=1}^{k}\sum_{q=1}^{|V_{p}(g_{i^{\prime}})|}\|x_{p,q}^{i}-x_{p,q}^{i^{\prime}}\|_{2}/\sum_{p=1}^{k}r^{p}\leq\sup_{i}\sup_{p,q}\|x_{p,q}^{i}-x_{p,q}^{i^{\prime}}\|_{2}=\delta_{x}=c_{x}/(\sum_{p=1}^{k}r^{p})^{1/2}

We can write \mathcal{B} in terms of weights CC and C~\tilde{C}, which is dependant on the activation function σ\sigma, kk and supiλmax(Lgi)\sup_{i}\lambda_{\max}(L_{g_{i}}). Hence,

B\displaystyle B (CM+C~M~+1/(p=1krp))cx+Cλmax(LgiLgi)+C~λmax(L~giL~gi)\displaystyle\leq(CM+\tilde{C}\tilde{M}+1/(\sum_{p=1}^{k}r^{p}))c_{x}+C\lambda_{\max}(L_{g_{i}}-L_{g_{i^{\prime}}})+\tilde{C}\lambda_{\max}(\tilde{L}_{g_{i}}-\tilde{L}_{g_{i^{\prime}}})
=Mcx+Cλmax(LgiLgi)+C~λmax(L~giL~gi)\displaystyle=M^{\prime}c_{x}+C\lambda_{\max}(L_{g_{i}}-L_{g_{i^{\prime}}})+\tilde{C}\lambda_{\max}(\tilde{L}_{g_{i}}-\tilde{L}_{g_{i^{\prime}}})

Note that the derived I1I_{1} for BB is the same for AA, since the node features, edge information and embedded features are bounded by separate terms in Eq. 8. The only difference is given by I2I_{2}, where a different set of graph Laplacians Lgj,LgjL_{g_{j}},\,L_{g_{j^{\prime}}} and node features (xj)(x_{j}) are used. Therefore,

A\displaystyle A Mcx+Cλmax(LgjLgj)+C~λmax(L~giL~gi)\displaystyle\leq M^{\prime}c_{x}+C\lambda_{\max}(L_{g_{j}}-L_{g_{j^{\prime}}})+\tilde{C}\lambda_{\max}(\tilde{L}_{g_{i}}-\tilde{L}_{g_{i^{\prime}}})

Hence the result. ∎

Note that, our view of structural information is closely related to graph kernels [4] and graph perturbation [55]. Specifically, our Definition on k-hop ego-graph is motivated by the concept of k-layer expansion sub-graph in [4]. However, [4] used the Jensen-Shannon divergence between pairwise representations of sub-graphs to define a depth-based sub-graph kernel, while we depict GG as samples of its ego-graphs. In this sense, our view is related to the setup in [55], which derived a uniform algorithmic stability bound of a 1-layer GNN under 1-hop structure perturbation of GG.

In the setting of domain adaptation, [7] draws a connection between the difference in the distributions of source and target domains and the model transferability, and learns a transferable model by minimizing such distribution differences. This coincides with our approach of connecting the structure difference of two graphs in terms of k-hop subgraph distributions and the transferability of GNNs in the above theory.

Appendix B Model Details

Following the same notations used in the main paper, Egi consists of a GNN encoder Ψ\Psi and a GNN discriminator 𝒟\mathcal{D}. In general, the GNN encoder Ψ\Psi and discriminator 𝒟\mathcal{D} can be any existing GNN models. For each ego-graph and its node features {gi,xi}\{g_{i},x_{i}\}, the GNN encoder returns node embedding ziz_{i} for the center node viv_{i}. As mentioned in Eq. 2 in the main paper, the GNN discriminator 𝒟\mathcal{D} makes edge-level predictions as follows,

𝒟(ev~v|hp,qq~,xp,qi,zi)=σ(UTτ(WT[hp,qq~xp,qizi])),\mathcal{D}(e_{\tilde{v}v}|h_{p,q}^{\tilde{q}},x^{i}_{p,q},z_{i})=\sigma\left(U^{T}\cdot\tau\left(W^{T}[h_{p,q}^{\tilde{q}}||x^{i}_{p,q}||z_{i}]\right)\right), (12)

where ev~vE(gi)e_{\tilde{v}v}\in E(g_{i}) and hp,qq~dh_{p,q}^{\tilde{q}}\in\mathbb{R}^{d} (simplified as hph_{p} in the main paper, same for xp,qi=xqx^{i}_{p,q}=x_{q}) is the representation for edge ev~ve_{\tilde{v}v} between node vp1,q~v_{p-1,\tilde{q}} in hop p1p-1 and vp,qv_{p,q} in hop pp. The prediction relies on the combination of center node embedding ziz_{i}, destination node feature xp,qix^{i}_{p,q} and source node representation hp,qq~h_{p,q}^{\tilde{q}}. And now we describe how we calculate the source node representation in 𝒟\mathcal{D}.

1
2The GNN encoder Ψ\Psi and the GNN discriminator 𝒟\mathcal{D}, k-hop ego graph and features {gi,xi}\{g_{i},x_{i}\};
3 /* EGI-training starts */
4 while Egi\mathcal{L}_{\textsc{Egi}} not converges do
5       Sample M ego-graphs {(g1,x1),,(gM,xM)}\{(g_{1},x_{1}),...,(g_{M},x_{M})\} from empirical distribution \mathbb{P} without replacement, and obtained their positive and negative node embeddings zi,ziz_{i},z_{i}^{\prime} through Ψ\Psi
zi=Ψ(gi,xi),zi=Ψ(gi,xi),z_{i}=\Psi(g_{i},x_{i}),z_{i}^{\prime}=\Psi(g_{i}^{\prime},x_{i}^{\prime}),
/* Initialize positive and negative expectation in Eq. 1 in the main paper*/
6       Epos=0,Eneg=0E_{pos}=0,E_{neg}=0
7       for p = 1 to kk do
8            
9            /* Compute JSD on edges at each hop*/
10             for e(p1,q~)(p,q)E(gi)e_{(p-1,\tilde{q})(p,q)}\in E(g_{i}) do
11                   generate source node embedding hp,qq~h_{p,q}^{\tilde{q}} in Eq. 13 ;
12                   Epos=EposE_{\text{pos}}=E_{\text{pos}} + σ(UTτ(WT[hp,qq~xp,qizi]))\sigma\left(U^{T}\cdot\tau\left(W^{T}[h_{p,q}^{\tilde{q}}||x^{i}_{p,q}||z_{i}]\right)\right)
13                   Eneg=EnegE_{\text{neg}}=E_{\text{neg}} + σ(UTτ(WT[hp,qq~xp,qizi]))\sigma\left(U^{T}\cdot\tau\left(W^{T}[h_{p,q}^{\tilde{q}}||x^{i}_{p,q}||z_{i}^{\prime}]\right)\right)
14                  
15             end for
16            
17       end for
18      /* Compute batch loss*/
19       EGI=EnegEpos\mathcal{L}_{\text{EGI}}=E_{\text{neg}}-E_{\text{pos}}
20       /* Update Ψ\Psi, 𝒟\mathcal{D} */
21       θΨ+ΨEGI\theta_{\Psi}\xleftarrow{+}-\nabla_{\Psi}\mathcal{L}_{\text{EGI}}, θ𝒟+𝒟EGI\theta_{\mathcal{D}}\xleftarrow{+}-\nabla_{\mathcal{D}}\mathcal{L}_{\text{EGI}}
22 end while
23
Algorithm 1 Pseudo code for training Egi

To obtain the source node representation representations hh, the GNN in discriminator 𝒟\mathcal{D} operates on a reversed ego-graph gi~\tilde{g_{i}} while encoder Ψ\Psi performs forward propagation on gig_{i}. The discriminator GNN starts from the center node viv_{i} and compute the hidden representation mp1,q~m_{p-1,\tilde{q}} for node vp1,qv_{p-1,q} at each hop. We denote the source node at p1p-1 hop as q~Q~p,q,Q~p,q={q~:vp1,q~Vp1(gi),e(p1,q~)(p,q)E(gi)}\tilde{q}\in\tilde{Q}_{p,q},\tilde{Q}_{p,q}=\{\tilde{q}:v_{p-1,\tilde{q}}\in V_{p-1}(g_{i}),e_{(p-1,\tilde{q})(p,q)}\in E(g_{i})\}. Although hp,qh_{p,q} is calculated as node embedding, in reversed ego graph gi~\tilde{g_{i}}, node only has one incoming edge. Thus, we can also interpret hp,qq~h_{p,q}^{\tilde{q}} as the edge embedding as it combines source node’s hidden representation mp1,q~m_{p-1,\tilde{q}} and destination node features xp,qx_{p,q} as follows,

hp,qq~=ReLU(WpT(mp1,q~+xp,qi)),mp1,q~=1|Q~p1,q~|qQ~p1q~hp1,q~qh_{p,q}^{\tilde{q}}=\text{ReLU}\left(W_{p}^{T}\left(m_{p-1,\tilde{q}}+x^{i}_{p,q}\right)\right),\;m_{p-1,\tilde{q}}=\frac{1}{|\tilde{Q}_{{p-1},\tilde{q}}|}\sum_{q^{\prime}\in\tilde{Q}_{{p-1}\tilde{q}}}h_{p-1,\tilde{q}}^{q^{\prime}} (13)

When p=1p=1, every edge origins from the center node viv_{i} and m0,qm_{0,q^{\prime}} is the center node feature xvix_{v_{i}}. Note that we the elaborated aggregation rule is equivalent as layer-wise propagation rules (different in-degree matrix for each hp,qh_{p,q}) of Egi earlier in §A.1.

In every batch, we sample a set of ego-graphs and their node features {gi,xi}\{g_{i},x_{i}\}. During the forward pass of encoder Ψ\Psi, it aggregates from neighbor nodes to the center node viv_{i}. Then, the discriminator calculates the edge embedding in Eq. 13 from center node viv_{i} to its neighbors and make edge-level predictions– fake or true. Besides training framework Figure 2 in the main paper, the algorithm Egi is depicted in Algorithm 1.

We implement our method and all of the baselines using the same encoders Ψ\Psi: 2-layer GIN [60] for synthetic and role identification experiments, 2-layer GraphSAGE [15] for the relation prediction experiments. We set hidden dimension as 32 for both synthetic and role identification experiments, For relation prediction fine-tuning task, we set hidden dimension as 256. We train Egi in a mini-batch fashion since all the information for encoder and discriminators are within the k-hop ego-graph gig_{i} and its features xix_{i}. Further, we conduct neighborhood sampling and set maximum neighbors as 10 to speed up the parrallel training. The space and time complexity of Egi is O(BNK)O(BN^{K}), where BB is the batch size, NN is the number of the neighbors and k is the number of hops of ego-graphs. Notice that both the encoder Ψ\Psi and discriminator 𝒟\mathcal{D} propagate message on the k-hop ego-graphs, so the extra computation cost of 𝒟\mathcal{D} compared with a common GNN module is a constant multiplier over the original one. The scalability of Egi on million scale YAGO network is reported in section C.3.

B.1 Transfer Learning Settings

The goal of transfer learning is to train a model on a dataset or task, and use it on another. In our graph learning setting, we focus on training the model on one graph and using it on another. In particular, we focus our study on the setting of unsupervised-transfering, where the model learned on the source graph is directly applied on the target graph without fine-tuning. We study this setting because it allows us to directly measure the transferability of GNNs, which is not affected by the fine-tuning process on the target graph. In other words, the fine-tuning process introduces significant uncertainty to the analysis, because there is no guarantee on how much the fine-tuned GNN is different from the pre-trained one. Depending on specific tasks and labels distributions on the two graphs, the fine-tuned GNN might be quite similar to the pre-trained one, or it can be significantly different. It is then very hard to analyze how much the pre-trained GNN itself is able to help. Another reason is about efficiency. The fine-tuning of GNNs requires the same environment set-up and computation resource as training GNNs from scratch, although it may take less training time eventually if pre-training is effective. It is intriguing if this whole process can be eliminated when we guarantee the performance with unsupervised-transfering.

In our experiments, we also study the setting of transfer learning with fine-tuning, particularly on the real-world large-scale YAGO graphs. Since we aim to study the general transferability of GNNs not bounded to specific tasks, we always pre-train GNNs with the unsupervised pre-training objective on source graphs. Then we enable two types of fine-tuning. The first one is post-fine-tuning (=s\mathcal{L}=\mathcal{L}_{s}), where the pre-trained GNNs are fine-tuned with the supervised task specific objective s\mathcal{L}_{s} on the target graphs. The second on is joint-fine-tuning (=s+u\mathcal{L}=\mathcal{L}_{s}+\mathcal{L}_{u}), where pre-training is the same, but fine-tuning is done w.r.t. both the pre-training objective u\mathcal{L}_{u} and task specific objective s\mathcal{L}_{s} on target graphs in a semi-supervised learning fashion. The unsupervised pre-training objective u\mathcal{L}_{u} of Egi is Algorithm 1, while those of the compared algorithms are as defined in their papers. The supervised fine-tuning objective s\mathcal{L}_{s} is the same as in the DistMult paper [61] for all algorithms.

Appendix C Additional Experiment Details

C.1 Synthetic Experiments

Data. As mentioned in the main paper, we use two traditional graph generation models for synthetic data generation: (1) barabasi-albert graph [5] and (2) forest-fire graph [32]. We generate 40 graphs each with 100 nodes with each model. We control the parameters of two models to generate two graphs with different ego-graph distributions. Specifically, we set the number of attached edges as 2 for barabasi-albert model and set pforward=0.4p_{\text{forward}}=0.4, pbackward=0.3p_{\text{backward}}=0.3 for forest-fire model. In Figure 4(a) and 4(b), we show example graphs from two families in our datasets. They have the same size but different appearance which leads to our study on the transferability gap Δ𝒟\Delta_{\mathcal{D}}(F, F) and Δ𝒟\Delta_{\mathcal{D}}(B, F) in Table 1 in the main paper. The accuracy of this task defined as the percentage of nearest neighbors for target node in the embedding space z=Ψ()z=\Psi(\cdot) that are structure-equivalent, i.e. #correct k-nn neighbors / #ground truth equivalent nodes.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: Visualizations of the graphs and labels we use in the synthetic experiments.

Results. The structural equivalence label is obtained by a 2-hop WL-test [58] on the ego-graphs. If two nodes have the same 2-hop ego-graphs, they will be assigned the same label. In the example of Figure 4(c), the nodes labeled with same number (e.g. 2, 4) have the isomorphic 2-hop ego-graphs. Note that this task is exactly solvable when node features and GNN architectures are powerful enough like GIN [60]. In order to show the performance difference among different methods, we set the length of one-hot node degree encoding to 3 (all nodes with degrees higher than 3 have the same encoding). Here, we present the performance comparison with different length of degree encodings (d) in Table 4. When the capacity of initial node features is high (d=10), the transfer learning gap diminishes between different methods and different graphs because the structural equivalence problem can be exactly solved by neighborhood aggregations. However, when the information in initial node features is limited, the advantage of Egi in learning and transfering the graph structural information is obvious. In Table 5, we also show the performance of different transferable and non-transferable features discussed after Definition 4.3 in the main paper, i.e. node embedding [42] and random feature vectors. The observation is similar with Table 1 in the main paper: the transferable feature can reflect the performance gap between similar and dissimilar graphs while non-transferable features can not.

In both Table 4 and 8 here as well as Table 1 in the main paper, we report the structural difference among graphs in the two sets (d¯\bar{d}) calculated w.r.t. the term Δ𝒟(Ga,Gb)\Delta_{\mathcal{D}}(G_{a},G_{b}) on the RHS of Theorem 4.1 in the main paper. This indicates that the Forest fire graphs are structurally similar to the other Forest fire graphs, while less similar to the Barabasi graphs, as can be verified from Figure 4(a) and 4(b). Our bound in Theorem 4.1 then tells us that the GNNs (in particular, Egi) should be more transferable in the F-F case than B-F. This is verified in Table 4 and 5 when using the transferable node features of degree encoding with limited dimension (d=3) as well as DeepWalk embedding, as Egi pre-trained on Forest fire graphs performs significantly better on Forest fire graphs than on Barabasi graphs (with +0.094 and +0.057 differences, respectively).

Table 4: Synthetic experiments of identifying structural-equivalent nodes with different degree encoding dimensions.
Method #dim degree encoding d = 3 # dim degree encoding d = 10 structural difference
F-F B-F δ\delta(acc.) F-F B-F δ\delta(acc.) ΔD\Delta_{D}(F,F) ΔD\Delta_{D}(B,F)
GCN (untrained) 0.478 0.478 / 0.940 0.940 / 0.752 0.883
GIN (untrained) 0.572 0.572 / 0.940 0.940 /
VGAE (GIN) 0.498 0.432 +0.066 0.939 0.937 0.002
DGI (GIN) 0.578 0.591 -0.013 0.939 0.941 -0.002
Egi (GIN) 0.710 0.616 +0.094 0.942 0.942 0
Table 5: Synthetic experiments of identifying structural-equivalent nodes with different transferable and non-transferable features.
Method DeepWalk embedding random vectors structural difference
F-F B-F δ\delta(acc.) F-F B-F δ\delta(acc.) ΔD\Delta_{D}(F,F) ΔD\Delta_{D}(B,F)
GCN (untrained) 0.658 0.658 / 0.246 0.246 / 0.752 0.883
GIN (untrained) 0.663 0.663 / 0.520 0.520 /
GVAE (GIN) 0.713 0.659 +0.054 0.266 0.264 0.002
DGI (GIN) 0.640 0.613 +0.027 0.512 0.576 -0.064
Egi (GIN) 0.772 0.715 +0.057 0.507 0.485 +0.022

C.2 Real-world Role Identification Experiments

Data.

Table 6: Overall Dataset Statistics
Dataset # Nodes # Edges # Classes
Europe 399 5,995 4
USA 1,190 13,599 4
Brazil 131 1,074 4
Gene 9,228 57,029 2

We report the number of nodes, edges and classes for both airport and gene dataset. The numbers for the Gene dataset are the aggregations of the total 52 gene networks in the dataset. For the three airport networks, Figure 5 shows the power-law degree distribution on log-log scale. The class labels are between 0 to 3 reflecting the level of the airport activities [45]. For the Gene dataset, we matched the gene names in the TCGA dataset [68] to the list of transcription factors on wikipedia444https://en.wikipedia.org/wiki/Transcription_factor. 75% of the genes are marked as 1 (transcription factors) and some gene graphs have extremely imbalanced class distributions. So we conduct experiments on the relatively balanced gene graphs of brain cancers (Figure 2 in the main paper). Both datasets do not have organic node attributes. The role-based node labels are highly relevant to their local graph structures, but are not trivially computable such as from node degrees.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: Visualizations of power-law degree distribution on three airport dataset.

Results. As we can observe from Figure 5, the three airport graphs have quite different sizes and structures (e.g., regarding edge density and connectivity pattern). Thus, the absolute classification accuracy in both Table 2 in the main paper and Table 7 here varies across different graphs. However, as we mention in the main paper, the structural difference we compute based on Eq. 5 in Theorem 3.1 is close among the Europe-USA and Europe-Brazil graph pairs (0.869 and 0.851), which leads to close transferability of Egi from Europe to USA and Brazil. This indicates the effectiveness of our view over essential structural information. In Table 7, we also provide the comparison between transferable and non-transferable feature on airport dataset. As expected, Egi only yields good transferability with transferable features.

Table 7: Results of role identification with direct-transfering on the Airport dataset (Table 2, main paper). The performance reported (%) are the average over 100 runs. We set all node features same as non-transferable features.
Method Europe (source) USA (target) Brazil (target)
node degree same feat. node degree same feat. node degree same feat.
features 52.81 20.59 55.67 20.22 67.11 19.63
GIN (untrained) 55.75 53.88 61.56 58.32 70.04 70.37
GVAE 53.90 21.12 55.51 22.39 66.33 17.70
DGI 57.75 22.13 54.90 21.76 67.93 18.78
MaskGNN 56.37 55.53 60.82 54.64 66.71 74.54
ContextPredGNN 52.69 49.95 50.38 54.75 62.11 70.66
Structural Pre-train 56.00 53.83 62.17 57.49 68.78 72.41
MVC 53.16 51.69 59.66 50.42 66.07 61.55
GMI 58.12 46.25 59.28 47.64 73.07 62.96
Egi (GIN) 59.15∗∗ 54.98 64.55∗∗ 57.40 73.15 70.00

Besides that, the results present in Table 8 are the accuracy of GNNs directly trained and evaluated on each network without transfering. Therefore, only the Europe column has the same results as in Table 2 in the main paper, while the USA and Brazil columns can be regarded as providing an upper-bound performance of GNN transfered from other graphs. As we can see, Egi gives the closest results from Table 2 (main paper) to Table 8 here, demonstrating the its plausible transferability. The scores are so close, showing a possibility to skip fine-tuning when the source and target graphs are similar enough. Also note that, although the variances are pretty large (which is also observed in other works like [45] since the networks are small), our t-tests have shown the improvements of Egi to be significant.

Table 8: Role identification that identifies structurally similar nodes on real-world networks. The performance reported are the average and standard deviation for 10 runs. Our classification accuracy on three datasets all passed the t-test (p<<0.01) with the second best result in the table.
Method Airport [45]
Europe USA Brazil
node degree 52.81% ±\pm 5.81% 55.67% ±\pm 3.63% 67.11% ±\pm 7.58%
GCN (random-init) 52.96% ±\pm 4.51% 56.18% ±\pm 3.82% 55.93% ±\pm 1.38%
GIN (random-init) 55.75% ±\pm 5.84% 62.77% ±\pm 2.35% 69.26% ±\pm 9.08%
GVAE (GIN) 53.90% ±\pm 4.65% 58.99% ±\pm 2.44% 55.56% ±\pm 6.83%
DGI (GIN) 57.75% ±\pm 4.47% 62.44% ±\pm 4.46% 68.15% ±\pm 6.24%
Mask-GIN 56.37% ±\pm 5.07% 63.78% ±\pm 2.79% 61.85% ±\pm 10.74%
ContextPred-GIN 52.69% ±\pm 6.12% 56.22% ±\pm 4.05% 58.52% ±\pm 10.18%
Structural Pre-train 56.00% ±\pm 4.58% 62.29% ±\pm 3.51% 71.48% ±\pm 9.38 %
MVC 53.16% ±\pm 4.07% 62.81 % ±\pm 3.12% 67.78 % ±\pm 4.79%
GMI 58.12 % ±\pm 5.28% 63.36 % ±\pm 2.92% 73.70% ±\pm 4.21%
Egi (GIN) 59.15% ±\pm 4.44% 65.88% ±\pm 3.65% 74.07% ±\pm 5.49%
Table 9: dataset statistics and running time of Egi
Dataset # Nodes # Edges # Relations # Train/Test Training time per epoch
YAGO-Source 579,721 2,191,464 / / 338 seconds
YAGO-Target 115,186 409,952 24 480/409,472 134 seconds

C.3 Real-world large-scale Relation Prediction Experiments

Data. As shown in Table 9, the source graph we use to pre-train GNNs is the full graph cleaned from the YAGO dump [49], where we assume the relations among entities are unknown. The target graph we use is a subgraph uniformed sampled from the same YAGO dump (we sample the nodes and then include all edges among the sampled nodes). The similar ratio between number of nodes and edges can be observed in Table 9. On the target graph, we also have the access to 24 different relations [48] such as isAdvisedBy, isMarriedTo and so on. Such relation labels are still relevant to the graph structures, but the relevance is lower compared with the structural role labels. We use the 256-dim degree encoding as node features for pre-training on the source graph, then we use the 128-dim positional embedding generated by LINE [51] for fine-tuning on the target graph, to explicitly make the features differ across source and target graphs.

Results. In Section B.1, we introduced two different types of fine-tuning, i.e., post-fine-tuning and joint-fine-tuning. For both types of fine-tuning, we add one feature encoder \mathcal{E} before feeding it into the GNNs for two purposes. First, the target graph fine-tuning feature usually has different dimensions with the pre-training features, such as the node degree encoding we use. Second, the semantics and distributions of fine-tuning features can be different from pre-training features. The feature encoder aims to bridge the gap between feature difference in practice. The supervised loss used in this experiment is the same as in DistMult [61]. In particular, the bilinear score function is calculated as s(h,r,t)=zhTMrzts(h,r,t)=z_{h}^{T}M_{r}z_{t}, where MrM_{r} is a diagonal matrix for each relation rr, zhz_{h} and ztz_{t} the the embedding of GNN encoder Ψ\Psi for head and tail entities. The experiments were run on GTX1080 with 12G memories. We report the average training time per epoch of our algorithm in pre-training and fine-tuning stage in Table 9 as well. The pre-training and fine-tuning takes about 40 epochs and 10 epochs to converge, respectively. In Table 9, we also present the per-epoch training time of Egi. Egi takes about 338 seconds per epoch for optimizing the ego-graph information maximization objective on YAGO-source. As we can see, fine-tuning also takes significant time compared to pre-training, which strengthens our arguments about avoiding or reducing fine-tuning through structural analysis. We implement all baselines within the same pipeline, and the running times are all in same scale.

C.4 Parameter study

In this section, we provide additional parameter analysis towards proposed EGI model - choices of kk, and efficiency study on EGI gap Δ𝒟\Delta_{\mathcal{D}} - sampling frequencies.

Performance of different size of ego-graphs. In our Theorem 3.1 and EGI algorithm (Eq. 1), number of hops kk determines the size of ego-graphs. In principle, kk may affect the transferability of EGI in two ways: (1) larger kk may make the EGI model (both center node encoder Ψ\Psi and neighbor node encoder Φ\Phi) more expressive (better precision) and the EGI gap Δ𝒟\Delta_{\mathcal{D}} more accurate (better predictiveness); (2) However, the GNN encoders may suffer from the over-smoothing problem and the computations may suffer from more noises. Therefore, it is hard to determine the influence of kk without empirical analysis. As we can observe in , when k=1k=1 or k=3k=3, the classification accuracy of the source graph is worse than k=2k=2, likely because the GNN encoder is either less powerful or over-smoothed. As a result, k=2k=2 obtains the best transferability to both the USA and Brazil networks. When k=3k=3, Δ𝒟\Delta_{\mathcal{D}} likely accounts for too subtle/noisy ego-graph differences and may become less effective in predicting the transferability. Therefore, we choose k=2k=2 to conduct experiments in main paper.

Table 10: Comparison of EGI with different kk. Accuracy and EGI gap Δ𝒟\Delta_{\mathcal{D}} are reported.
Europe (source) USA (target) Brazil (target)
acc. acc. Δ𝒟\Delta_{\mathcal{D}} acc. Δ𝒟\Delta_{\mathcal{D}}
EGI (k=1) 58.25 60.08 0.385 60.74 0.335
EGI (k=2) 59.15 64.55 0.869 73.15 0.851
EGI (k=3) 57.63 64.12 0.912 72.22 0.909

Precision of Δ𝒟\Delta_{\mathcal{D}} under different sampling frequencies. In Table 11, we present the estimated Δ𝒟\Delta_{\mathcal{D}} versus sampling frequency for 10 runs on airport dataset. A theoretical study on its convergence could be an interesting future direction. As we can observe, large sample frequency leads to more accurate and robust estimation of Δ𝒟\Delta_{\mathcal{D}}. Between Europe and USA, although 100 pairs of ego-graphs are only equivalent as 2.1% of the total pair-wise enumerations, the estimated Δ𝒟\Delta_{\mathcal{D}} is pretty close.

Table 11: EGI gap Δ𝒟\Delta_{\mathcal{D}} on airport dataset with different sampling frequencies.
Sampling frequency Δ𝒟\Delta_{\mathcal{D}}(Europe, USA) Δ𝒟\Delta_{\mathcal{D}}(Europe, Brazil)
100 pairs 0.872±\pm0.039 0.854±\pm0.042
1000 pairs 0.859±\pm0.012 0.848±\pm0.007
All pairs 0.869 ±\pm0.000 0.851 ±\pm0.000