This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hyperbolic Graph Neural Networks at Scale:
A Meta Learning Approach

Nurendra Choudhary
Virginia Tech
Arlington, VA, USA
[email protected] &Nikhil Rao
Microsoft
Sunnyvale, CA, USA
[email protected] &Chandan K. Reddy
Virginia Tech
Arlington, VA, USA
[email protected]
Abstract

The progress in hyperbolic neural networks (HNNs) research is hindered by their absence of inductive bias mechanisms, which are essential for generalizing to new tasks and facilitating scalable learning over large datasets. In this paper, we aim to alleviate these issues by learning generalizable inductive biases from the nodes’ local subgraph and transfer them for faster learning over new subgraphs with a disjoint set of nodes, edges, and labels in a few-shot setting. We introduce a novel method, Hyperbolic GRAph Meta Learner (H-GRAM), that, for the tasks of node classification and link prediction, learns transferable information from a set of support local subgraphs in the form of hyperbolic meta gradients and label hyperbolic protonets to enable faster learning over a query set of new tasks dealing with disjoint subgraphs. Furthermore, we show that an extension of our meta-learning framework also mitigates the scalability challenges seen in HNNs faced by existing approaches. Our comparative analysis shows that H-GRAM effectively learns and transfers information in multiple challenging few-shot settings compared to other state-of-the-art baselines. Additionally, we demonstrate that, unlike standard HNNs, our approach is able to scale over large graph datasets and improve performance over its Euclidean counterparts.

1 Introduction

Graphs are extensively used in various applications, including image processing, natural language processing, chemistry, and bioinformatics. With modern graph datasets ranging from hundreds of thousands to billions of nodes, research in the graph domain has shifted towards larger and more complex graphs, e.g., the nodes and edges in the traditional Cora dataset [29] are in the order of 10310^{3}, whereas, the more recent OGBN datasets [17] are in the order of 10610^{6}. However, despite their potential, Hyperbolic Neural Networks (HNNs), which capture the hierarchy of graphs in hyperbolic space, have struggled to scale up with the increasing size and complexity of modern datasets.

HNNs have outperformed their Euclidean counterparts by taking advantage of the inherent hierarchy present in many modern datasets [13, 4, 10]. Unlike a standard Graph Neural Network (GNN), a HNN learns node representations based on an “anchor” or a “root” node for the entire graph, and the operations needed to learn these embeddings are a function of this root node. Specifically, HNN formulations [13] depend on the global origin (root node) for several transformation operations (such as Möbius addition, Möbius multiplication, and others), and hence focusing on subgraph structures to learn representations becomes meaningless when their relationship to the root node is not considered. Thus, state-of-the-art HNNs such as HGCN [4], HAT [15], and HypE [8] require access to the entire dataset to learn representations, and hence only scale to experimental datasets (with 103{\approx}10^{3} nodes). Despite this major drawback, HNNs have shown impressive performance on several research domains including recommendation systems [32], e-commerce [9], natural language processing [11], and knowledge graphs [5, 8]. It is thus imperative that one develops methods to scale HNNs to larger datasets, so as to realize their full potential. To this end, we introduce a novel method, Hyperbolic GRAph Meta Learner (H-GRAM), that utilizes meta-learning to learn information from local subgraphs for HNNs and transfer it for faster learning on a disjoint set of nodes, edges, and labels contained in the larger graph. As a consequence of meta-learning, H-GRAM also achieves several desirable benefits that extend HNNs’ applicability including the ability to transfer information on new graphs (inductive learning), elimination of over-smoothing, and few-shot learning. We experimentally show that H-GRAM can scale to graphs of size 𝒪(106)\mathcal{O}(10^{6}), which is 𝒪(103)\mathcal{O}(10^{3}) times larger than previous state-of-the-art HNNs. To the best of our knowledge, there are no other methods that can scale HNNs to datasets of this size.

Refer to caption
Figure 1: Meta-learning on hyperbolic neural networks. The procedure consists of two phases - (i) meta-training to update the parameters of the HNNs and learn inductive biases (meta gradients and label protonets), and (ii) meta-testing that initializes the HNNs with the inductive biases for faster learning over new graphs with a disjoint set of nodes, edges, or labels.

Recent research has shown that both node-level and edge-level tasks only depend on the local neighborhood for evidence of prediction [42, 18, 19]. Inspired by the insights from such research, our model handles large graph datasets using their node-centric subgraph partitions, where each subgraph consists of a root node and the kk-hop neighborhood around it. In H-GRAM, the HNN formulations establish the root node as the local origin to encode the subgraph. We theoretically show that, due to the locality of tangent space transformations in HNNs (more details in Section 3), the evidence for a node’s prediction can predominantly be found in the immediate neighborhood. Thus, the subgraph encodings do not lose a significant amount of feature information despite not having access to a “global” root node. However, due to the node-centric graph partitioning, the subgraphs are non-exhaustive, i.e., they do not contain all the nodes, edges, and labels, as previously assumed by HNNs. Thus, to overcome the issue of non-exhaustive subgraphs, we formulate four meta-learning problems (illustrated in Figure 1) that learn inductive biases on a support set and transfer it for faster learning on a query set with disjoint nodes, edges, or labels. Our model learns inductive biases in the meta-training phase, which contains two steps - HNN update and meta update. HNN updates are regular stochastic gradient descent steps based on the loss function for each support task. The updated HNN parameters are used to calculate the loss on query tasks and the gradients are accumulated into a meta-gradient for the meta update. In the meta-testing phase, the models are evaluated on query tasks with parameters post the meta updates, as this snapshot of the model is the most adaptable for faster learning in a few-shot setting. Our main contributions are as follows:

  1. 1.

    We theoretically prove that HNNs rely on the nodes’ local neighborhood for evidence in prediction, as well as, formulate HNNs to encode node-centric local subgraphs with root nodes as the local origin using the locality of tangent space transformations.

  2. 2.

    We develop Hyperbolic GRAph Meta Learner (H-GRAM), a novel method that learns meta information (as meta gradients and label protonets) from local subgraphs and generalize it to new graphs with a disjoint set of nodes, edges, and labels. Our experiments show that H-GRAM can be used to generalize information from subgraph partitions, thus, enabling scalability.

  3. 3.

    Our analysis on a diverse set of datasets demonstrates that our meta-learning setup also solves several challenges in HNNs including inductive learning, elimination of over-smoothing, and few-shot learning in several challenging scenarios.

2 Related Work

This section reviews the relevant work in the areas of hyperbolic neural networks and meta learning.

Hyperbolic Neural Networks: Due to their ability to efficiently encode tree-like structures, hyperbolic space has been a significant development in the modeling of hierarchical datasets [43, 44]. Among its different isometric interpretations, the Poincaré ball model is the most popular one and has been applied in several HNN formulations of Euclidean networks including the recurrent (HGRU [13]), convolution (HGCN [4]), and attention layers (HAT [15]). As a result of their performance gains on hierarchical graphs, the formulations have also been extended to applications in knowledge graphs for efficiently encoding the hierarchical relations in different tasks such as representation learning (MuRP [1], AttH [5]) and logical reasoning (HypE [8]). However, the above approaches have been designed for experimental datasets with a relatively small number of nodes (in the order of 10310^{3}), and do not scale to real-world datasets. Hence, we have designed H-GRAM as a meta-learning algorithm to translate the performance gains of HNNs to large datasets in a scalable manner.

Graph Meta-learning: Few-shot meta-learning transfers knowledge from prior tasks for faster learning over new tasks with few labels. Due to their wide applicability, they have been adopted in several domains including computer vision [34, 14], natural language processing [22] and, more recently, graphs [18, 39]. One early approach in graph meta-learning is Gated Propagation Networks [23] which learns to propagate information between label prototypes to improve the information available while learning new related labels. Subsequent developments such as MetaR [7], Meta-NA [41] and G-Matching [39] relied on metric-based meta learning algorithms for relational graphs, network alignment and generic graphs, respectively. These approaches show impressive performance on few-shot learning, but are only defined for single graphs. G-Meta [18] extends the metric-based techniques to handle multiple graphs with disjoint labels. However, the method processes information from local subgraphs in a Euclidean GNN, and thus, is not as capable as hyperbolic networks in encoding tree-like structures. Thus, we model H-GRAM to encode hierarchical information from local subgraphs and transfer it to new subgraphs with disjoint nodes, edges, and labels.

Refer to caption
Figure 2: An overview of the proposed H-GRAM meta-learning framework. Here, the input graphs 𝒢\mathcal{G}^{\cup} are first partitioned into node-centric subgraph partitions. We theoretically show that encoding these subgraph neighborhoods is equivalent to encoding the entire graph in the context of node classification and link prediction tasks. H-GRAM then uses an HGCN encoder to produce subgraph encodings, which are further utilized to get label prototypes. Using the HGCN gradient updates and label prototypes, the HNN model’s parameters PθP_{\theta^{*}} is updated through a series of weight updates and meta updates for η\eta meta-training steps. The parameters are then transferred to the meta-testing stage PθθP_{\theta^{*}\rightarrow\theta} and further trained on 𝒟tests\mathcal{D}_{test}^{s} and evaluated on 𝒟testq\mathcal{D}_{test}^{q}.

3 The Proposed H-GRAM Model

In this section, we define the problem setup for different meta-learning scenarios and describe our proposed model, H-GRAM, illustrated in Figure 2. For better clarity, we explain the problem setup for node classification and use HGCN as the exemplar HNN model. However, the provided setup can be extended to link prediction or to other HNN models, which we have evaluated in our experiments. The preliminaries of hyperbolic operations and meta-learning are provided in Appendix A.

3.1 Problem Setup

Our problem consists of a group of tasks 𝒯i𝒯\mathcal{T}_{i}\in\mathcal{T} which are divided into a support set 𝒯s\mathcal{T}^{s} and query set 𝒯q\mathcal{T}^{q}, where 𝒯s𝒯q=ϕ\mathcal{T}^{s}\cap\mathcal{T}^{q}=\phi. Furthermore, each task 𝒯i\mathcal{T}_{i} is a batch of node-centric subgraphs SuS_{u} with a corresponding label YuY_{u} (class of root node in node classification or root link in link prediction). The subgraphs SuS_{u} could be the partitions derived from a single graph or multiple graphs, both denoted by 𝒢=𝒢s𝒢q\mathcal{G}^{\cup}=\mathcal{G}^{s}\cup\mathcal{G}^{q}. We also define Ys={Yu𝒯s}Y_{s}=\{Y_{u}\in\mathcal{T}^{s}\} and Yq={Yu𝒯q}Y_{q}=\{Y_{u}\in\mathcal{T}^{q}\} as the set of labels in the support and query set respectively. The primary goal of meta-learning is to learn a predictor using the support set, Pθ(Ys|𝒯s)P_{\theta_{*}}(Y_{s}|\mathcal{T}^{s}) such that the model can quickly learn a predictor Pθ(Ys|𝒯s)P_{\theta}(Y_{s}|\mathcal{T}^{s}) on the query set. Following literature in this area [18], the problem categories are defined as follows:

  1. 1.

    Single graph, shared labels SG,SL\langle SG,SL\rangle: The objective is to learn the meta-learning model PθθP_{\theta_{*}\rightarrow\theta}, where Ys=YqY_{s}=Y_{q} and |𝒢s|=|𝒢q|=1|\mathcal{G}^{s}|=|\mathcal{G}^{q}|=1. It should be noted that the tasks are based on subgraphs, so |𝒢s|=|𝒢q|=1|𝒯s|=|Tq|=1|\mathcal{G}^{s}|=|\mathcal{G}^{q}|=1\mathrel{{\ooalign{$\not\phantom{=}$\cr$\implies$}}}|\mathcal{T}^{s}|=|{T}^{q}|=1. Also, this problem setup is identical to the standard node classification task considering 𝒯iqDtest\mathcal{T}_{i}^{q}\in D_{test} to be the test set.

  2. 2.

    Single graph, disjoint labels SG,DL\langle SG,DL\rangle: This problem operates on the same graph in the support and query set, however unlike the previous one, the label sets are disjoint. The goal is to learn the model PθθP_{\theta_{*}\rightarrow\theta}, where |𝒢s|=|𝒢q|=1 and YsYq=ϕ|\mathcal{G}^{s}|=|\mathcal{G}^{q}|=1\text{ and }Y_{s}\cap Y_{q}=\phi.

  3. 3.

    Multiple graphs, shared labels MG,SL\langle MG,SL\rangle: This problem setup varies in terms of the dataset it handles, i.e., the dataset can contain multiple graphs instead of a single one. However, our method focuses on tasks which contain node-centric subgraphs, and hence, the model’s aim is the same as problem 1. The aim is to learn the predictor model PθθP_{\theta_{*}\rightarrow\theta}, where Ys=YqY_{s}=Y_{q} and |𝒢s|,|𝒢q|>1|\mathcal{G}^{s}|,|\mathcal{G}^{q}|>1.

  4. 4.

    Multiple graphs, disjoint labels MG,DL\langle MG,DL\rangle: In this problem, the setup is similar to the previous one, but only with disjoint labels instead of shared ones, i.e., learn a predictor model PθθP_{\theta_{*}\rightarrow\theta}, where YsYq=ϕY_{s}\cap Y_{q}=\phi and |𝒢s|,|𝒢q|>1|\mathcal{G}^{s}|,|\mathcal{G}^{q}|>1.

From the problem setups, we observe that, while they handle different dataset variants, the base HNN model operates on the (Su,Yu)(S_{u},Y_{u}) pair. So, we utilize a hyperbolic subgraph encoder and prototypical labels to encode SuS_{u} and get a continuous version of YuY_{u} for our meta-learning algorithm, respectively.

3.2 Hyperbolic Subgraph Encoder

In previous methods [16, 18, 38], authors have shown that nodes’ local neighborhood provides some informative signals for the prediction task. While the theory is not trivially extensible, we use the local tangent space of Poincaré ball model to prove that the local neighborhood policy holds better for HNN models. The reason being that, while message propagation is linear in Euclidean GNNs [42], it is exponential in HNNs. Hence, a node’s influence, as given by Definition 1, outside its neighborhood decreases exponentially.

Definition 1.

The influence of a hyperbolic node vector xvx^{\mathbb{H}}_{v} on node xux^{\mathbb{H}}_{u} is defined by the influence score uv=exp0c(log0x(xu)log0x(xv))\mathcal{I}_{uv}=exp_{0}^{c}\left(\left\|\frac{\partial\log^{x}_{0}\left(x^{\mathbb{H}}_{u}\right)}{\partial\log^{x}_{0}\left(x^{\mathbb{H}}_{v}\right)}\right\|\right).

Definition 2.

The influence of a graph 𝒢\mathcal{G} with set of vertices 𝒱\mathcal{V} on a node u𝒱u\in\mathcal{V} is defined as 𝒢(u)=exp0c((v𝒱log0c(uv))\mathcal{I}_{\mathcal{G}}(u)=exp_{0}^{c}\left((\sum_{v\in\mathcal{V}}log_{0}^{c}\left(\mathcal{I}_{uv}\right)\right).

Theorem 1.

For a set of paths PuvP_{uv} between nodes uu and vv, let us define DgμpiD^{p_{i}}_{g\mu} as the geometric mean of nodes’ degree in a path piPuvp_{i}\in P_{uv}, puvp_{uv} as the shortest path, and uv\mathcal{I}_{uv} as the influence of node vv on uu. Also, let us say Dgμmin=min{DgμpipiPuv}D^{min}_{g\mu}=min\left\{D^{p_{i}}_{g\mu}\forall p_{i}\in P_{uv}\right\}, then the relation uvexpuc(K/(Dgμmin)puv)\mathcal{I}_{uv}\leq exp_{u}^{c}\left(K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}\|}\right) (where KK is a constant) holds for message propagation in HGCN models.

Theorem 1 shows that the influence of a node decreases exponent-of-exponentially (expuc=𝒪(en)exp_{u}^{c}=\mathcal{O}(e^{n})) with increasing distance puv\|p_{uv}\|. Thus, we conclude that encoding the local neighborhood of a node is sufficient to encode its features for label prediction.

Definition 3.

The information loss between encoding an entire graph 𝒢\mathcal{G} and a subgraph SuS_{u} with root node uu is defined as δ(𝒢,Su)=exp0c(log0c(𝒢(u))log0c(Su(u)))\delta_{\mathbb{H}}(\mathcal{G},S_{u})=exp^{c}_{0}\left(log^{c}_{0}\left(\mathcal{I}_{\mathcal{G}}(u)\right)-log^{c}_{0}\left(\mathcal{I}_{S_{u}}(u)\right)\right).

Theorem 2.

For a subgraph SuS_{u} of graph 𝒢\mathcal{G} centered at node uu, let us define a node v𝒢v\in\mathcal{G} with maximum influence on uu, i.e., v=argmaxt({ut,t𝒩(u)u})v=\operatorname*{arg\,max}_{t}(\{\mathcal{I}_{u}t,t\in\mathcal{N}(u)\setminus u\}). For a set of paths PuvP_{uv} between nodes uu and vv, let us define DgμpiD^{p_{i}}_{g\mu} as the geometric mean of degree of nodes in a path piPuvp_{i}\in P_{uv}, puv\|p_{uv}\| is the shortest path length, and Dgμmin=min{DgμpipiPuv}D^{min}_{g\mu}=min\left\{D^{p_{i}}_{g\mu}\forall p_{i}\in P_{uv}\right\}. Then, the information loss is bounded by δ(𝒢,Su)expuc(K/(Dgμmin)puv+1)\delta_{\mathbb{H}}(\mathcal{G},S_{u})\leq exp_{u}^{c}\left(K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}\|+1}\right) (where KK is a constant).

Theorem 2 shows that encoding the local subgraph is a epuve^{\|p_{uv}\|} order approximation of encoding the entire graph, and thus, with high enough puv\|p_{uv}\|, the encodings are equivalent. Note that puv\|p_{uv}\| is equivalent to the subgraph’s neighborhood size (kk in kk-hop). This shows that encoding a node’s local neighborhood is sufficient to encode its features. Theorem proofs are provided in Appendix B. The above theorems provide us with the theoretical justification for encoding local subgraphs into our meta-learning framework. Hence, we partition the input 𝒢\mathcal{G}^{\cup} into subgraphs Su=V×ES_{u}=V\times E centered at root node uu, with |V||V| nodes, |E||E| edges, a neighborhood size kk, and the corresponding label YuY_{u}. The subgraphs are processed through an kk-layer HGCN network and the Einstein midpoint [33] of all the node encodings are taken as the overall encoding of the subgraph SuS_{u}. For a given subgraph Su=(A,X)S_{u}=(A,X), where AV×VA\in\mathbb{H}^{\|V\|\times\|V\|} is the local adjacency matrix and XV×mX\in\mathbb{H}^{\|V\|\times m} are the node feature vectors of mm dimension, the encoding procedure given can be formalized as:

hu\displaystyle h_{u} =HGCNθ(A,X), where huV×d\displaystyle=HGCN_{\theta^{*}}(A,X)\text{, where }h_{u}\in\mathbb{H}^{\|V\|\times d} (1)
eu\displaystyle e_{u} =i=1Vγiuhiui=1Vγiu, where γiu=11hiu2\displaystyle=\frac{\sum_{i=1}^{\|V\|}\upgamma_{iu}h_{iu}}{\sum_{i=1}^{\|V\|}\upgamma_{iu}}\text{, where }\upgamma_{iu}=\frac{1}{\sqrt{1-\|h_{iu}\|^{2}}} (2)

where HGCN(A,X)V×dHGCN(A,X)\in\mathbb{H}^{\|V\|\times d} is the output of k-layer HGCN with dd output units and eude_{u}\in\mathbb{H}^{d} is the final subgraph encoding of SuS_{u}. γiu\upgamma_{iu} is the Lorentz factor of the hyperbolic vector hiuh_{iu} that indicates its weightage towards the Einstein midpoint.

3.3 Label Prototypes

Label information is generally categorical in node classification tasks. However, this does not allow us to pass inductive biases from the support set to the query set. Hence, to circumvent this issue, we use prototypical networks [31] as our label encoding. Our approach constructs continuous label prototypes by using the mean of meta-training nodes’ features that belong to the label. These prototypes are then employed to classify meta-testing samples based on their similarity to the corresponding meta-training label prototypes. This enables our model to handle new, non-exhaustive labels in an inductive manner, without the need for additional training data. The primary idea is to form continuous label prototypes using the mean of nodes that belong to the label. To this end, the continuous label prototype of a label yky_{k} is defined as ck=Yu=ykγieiYu=ykγi, where γi=11ei2c_{k}=\frac{\sum_{Y_{u}=y_{k}}\upgamma_{i}e_{i}}{\sum_{Y_{u}=y_{k}}\upgamma_{i}}\text{, where }\upgamma_{i}=\frac{1}{\sqrt{1-\|e_{i}\|^{2}}},and eide_{i}\in\mathbb{H}^{d} is encoding of subgraphs SuS_{u} with labels Yu=ykY_{u}=y_{k}. For each SuS_{u} with class yky_{k}, we compute the class distribution vector as pk=e(dc(eu,ck))ke(dc(eu,ck)), where dc(eu,ck)=2ctanh1(ceuck)p_{k}=\frac{e^{(-d_{\mathbb{H}}^{c}(e_{u},c_{k}))}}{\sum_{k}e^{(-d_{\mathbb{H}}^{c}(e_{u},c_{k}))}}\text{, where }d_{\mathbb{H}}^{c}(e_{u},c_{k})=\frac{2}{\sqrt{c}}\tanh^{-1}\left(\sqrt{c}\|-e_{u}\oplus c_{k}\|\right) and the loss for HNN updates (p,y)=jyilogpj\mathcal{L}(p,y)=\sum_{j}y_{i}\log{p_{j}}, where yiy_{i} is the one-hot encoding of the ground truth. The class distribution vector pkp_{k} is a softmax over the hyperbolic distance of subgraph encoding to the label prototypes, which indicates the probability that the subgraph belongs to the class yky_{k}. The loss function (p,y)\mathcal{L}(p,y) is the cross-entropy loss between ground truth labels yy and the class distribution vector pp.

3.4 Meta-Learning

In the previous section, we learned a continuous label encoding that is able to capture inductive biases from the subgraph. In this section, we utilize the optimization-based MAML algorithm [12] to transfer the inductive biases from the support set to the query set. To this end, we sample a batch of tasks, where each task 𝒯i={Si,Yi}i=1𝒯i\mathcal{T}_{i}=\{S_{i},Y_{i}\}_{i=1}^{\|\mathcal{T}_{i}\|}. In the meta-training phase, we first optimize the HNN parameters using the Riemannian stochastic gradient descent (RSGD) [2] on support loss, i.e., for each 𝒯is𝒯trains:θjexpθjc(αs)\mathcal{T}^{s}_{i}\in\mathcal{T}^{s}_{train}:\theta^{*}_{j}\leftarrow exp^{c}_{\theta^{*}_{j}}(-\alpha\nabla\mathcal{L}^{s}), where α\alpha is the learning rate of RSGD. Using the updated parameters θj\theta^{*}_{j}, we record the evaluation results on the query set, i.e., loss on task 𝒯iq𝒯trainq\mathcal{T}^{q}_{i}\in\mathcal{T}^{q}_{train} is iq\mathcal{L}^{q}_{i}. The above procedure is repeated η\eta times post which iq\mathcal{L}^{q}_{i} over the batch of tasks 𝒯iq𝒯trainq\mathcal{T}^{q}_{i}\in\mathcal{T}^{q}_{train} is accumulated for the meta update θexpθc(βiiq)\theta^{*}\leftarrow exp^{c}_{\theta^{*}}(-\beta\nabla\sum_{i}\mathcal{L}^{q}_{i}). The above steps are repeated with the updated θ\theta^{*} and a new batch of tasks till convergence. The final updated parameter set θθ\theta^{*}\rightarrow\theta is transferred to the meta-testing phase. In meta-testing, the tasks 𝒯is𝒯tests\mathcal{T}^{s}_{i}\in\mathcal{T}^{s}_{test} are used for RSGD parameter updates, i.e., 𝒯is𝒯tests:θjexpθjc(αs)\mathcal{T}^{s}_{i}\in\mathcal{T}^{s}_{test}:\theta_{j}\leftarrow exp^{c}_{\theta_{j}}(-\alpha\nabla\mathcal{L}^{s}) until convergence. The updated parameters θ\theta are used for the final evaluation on 𝒯iq𝒯testq\mathcal{T}^{q}_{i}\in\mathcal{T}^{q}_{test}. Our meta-learning procedure is further detailed in Appendix C and the implementation code with our experiments is available at https://github.com/Akirato/HGRAM. Details on implementation and broader impacts are provided in Appendices 3.5 and E, respectively.

3.5 Implementation Details

H-GRAM is primarily implemented in Pytorch [26], with geoopt [21] and GraphZoo [35] as support libraries for hyperbolic formulations. Our experiments are conducted on a Nvidia V100 GPU with 16 GB of VRAM. For gradient descent, we employ Riemannian Adam [28] with an initial learning rate of 0.01 and standard β\beta values of 0.9 and 0.999. The other hyper-parameters were selected based on the best performance on the validation set (𝒟val\mathcal{D}_{val}) under the given computational constraints. In our experiments, we empirically set k=2,d=32,h=4k=2,d=32,h=4, and η=10\eta=10. We explore the following search space and tune our hyper-parameters for best performance. The number of tasks in each batch are varied among 4, 8, 16, 32, and 64. The learning rate explored for both HNN updates and meta updates are 102,5×103,10310^{-2},5\times 10^{-3},10^{-3} and 5×1045\times 10^{-4}. The size of hidden dimensions are selected from among 64, 128, and 256. The final best-performing hyper-parameter setup for real-world datasets is presented in Table 5.

4 Experimental Setup

Our experiments aim to evaluate the performance of the proposed H-GRAM model and investigate the following research questions:

  • RQ1:

    Does our hyperbolic meta-learning algorithm outperform the Euclidean baselines on various meta-learning problems?

  • RQ2:

    How does our model perform and scale in comparison to other HNN formulations in standard graph problems?

  • RQ3:

    How does H-GRAM model’s performance vary with different few-shot settings, i.e., different values of kk and NN?

  • RQ4:

    What is the importance of different meta information components?

We use a set of standard benchmark datasets and baseline methods to compare the performance of H-GRAM on meta-learning and graph analysis tasks. The HNN models do not scale to the large datasets used in the meta-learning task, and hence, we limit our tests to Euclidean baselines. To compare against HNN models, we rely on standard node classification and link prediction on small datasets. Also, we do not consider other learning paradigms, such as self-supervised learning because they require an exhaustive set of nodes and labels and do not handle disjoint problem settings.

4.1 Datasets

For the task of meta-learning, we utilize the experimental setup from earlier approaches [18]; two synthetic datasets to understand if H-GRAM is able to capture local graph information and five real-world datasets to evaluate our model’s performance in a few-shot setting.

  • Synthetic Cycle [18] contains multiple graphs with cycle as the basis with different topologies (House, Star, Diamond, and Fan) attached to the nodes on the cycle. The classes of the node are defined by their topology.

  • Synthetic BA [18] uses Barabási-Albert (BA) graph as the basis with different topologies planted within it. The nodes are labeled using spectral clustering over the Graphlet Distribution Vector [27] of each node.

  • ogbn-arxiv [17] is a large citation graph of papers, where titles are the node features and subject areas are the labels.

  • Tissue-PPI [43, 16] contains multiple protein-protein interaction (PPI) networks collected from different tissues, gene signatures and ontology functions as features and labels, respectively.

  • FirstMM-DB [25] is a standard 3D point cloud link prediction dataset.

  • Fold-PPI [43] is a set of tissue PPI networks, where the node features and labels are the conjoint triad protein descriptor [30] and protein structures, respectively.

  • Tree-of-Life [44] is a large collection of PPI networks, originating from different species.

4.2 Baseline Methods

For comparison with HNNs, we utilize the standard benchmark citation graphs of Cora [29], Pubmed [24], and Citeseer [29]. For the baselines, we select the following methods to understand H-GRAM’s performance compared to state-of-the-art models in the tasks of meta-learning and standard graph processing.

  • Meta-Graph [3], developed for few-shot link prediction over multiple graphs, utilizes VGAE [20] model with additional graph encoding signals.

  • Meta-GNN [40] is a MAML developed over simple graph convolution (SGC) network [36].

  • FS-GIN [37] runs Graph Isomorphism Network (GIN) on the entire graph and then uses the few-shot labelled nodes to propagate loss and learn.

  • FS-SGC [36] is the same as FS-GIN but uses SGC instead of GIN as the GNN network.

  • ProtoNet [31] learn a metric space over label prototypes to generalize over unseen classes.

  • MAML [12] is a Model-Agnostic Meta Learning (MAML) method that learns on multiple tasks to adapt the gradients faster on unseen tasks.

  • HMLP [13], HGCN [4], and HAT [15] are the hyperbolic variants of Euclidean multi-layer perceptron (MLP), Graph Convolution Network (GCN), and Attention (AT) networks that use hyperbolic gyrovector operations instead of the vector space model.

It should be noted that not all the baselines can be applied to both node classification and link prediction. Hence, we compare our model against the baselines only on the applicable scenarios.

5 Experimental Results

We adopt the following standard problem setting which is widely studied in the literature [18]. The details of the datasets used in our experiments are provided in Table 1. In the case of synthetic datasets, we use a 2-way setup for disjoint label problems, and for the shared label problems the cycle graph and Barabási–Albert (BA) graph have 17 and 10 labels, respectively. The evaluation of our model uses 5 and 10 gradient update steps in meta-training and meta-testing, respectively. In the case of real-world datasets, we use 3-shot and 16-shot setup for node classification and link prediction, respectively. For real-world disjoint labels problem, we use the 3-way classification setting. The evaluation of our model uses 20 and 10 gradient update steps in meta-training and meta-testing, respectively. In the case of Tissue-PPI dataset, we perform each 2-way protein function task three times and average it over 10 iterations for the final result. In the case of link prediction task, we need to ensure the distinct nature of support and query set in all meta-training tasks. For this, we hold out a fixed set comprised of 30% and 70% of the edges as a preprocessing step for every graph for the support and query set, respectively.

Table 1: Basic statistics of the datasets used in our experiments. The columns present the dataset task (node classification or link prediction), number of graphs |𝓖||\mathcal{G}^{\cup}|, nodes |𝑽||V|, edges |𝑬||E|, node features |𝑿||X|, and labels |𝒀||Y|. Node, Link, and N/L indicates whether the datasets are used for node classification, link prediction, and both, respectively.
Dataset Task |𝓖||\mathcal{G}^{\cup}| |𝑽||V| |𝑬||E| |𝑿||X| |𝒀||Y|
Synth. Cycle Node 10 11,476 19,687 - 17
Synth. BA Node 10 2,000 7,647 - 10
ogbn-arxiv Node 1 169,343 1,166,243 128 40
Tissue-PPI Node 24 51,194 1,350,412 50 10
FirstMM-DB Link 41 56,468 126,024 5 2
Fold-PPI Node 144 274,606 3,666,563 512 29
Tree-of-Life Link 1,840 1,450,633 8,762,166 - 2
Cora N/L 1 2,708 5,429 1,433 7
Pubmed N/L 1 19,717 44,338 500 3
Citeseer N/L 1 3,312 4,732 3,703 6
Table 2: Performance of H-GRAM and the baselines on synthetic and real-world datasets. The top three rows define the task, problem setup (Single Graph (SG), Multiple Graphs (MG), Shared Labels (SL) or Disjoint Labels (DL)) and dataset. The problems with disjoint labels use a 2-way meta-learning setup, and in the case of shared labels, the cycle (S. Cy) and BA (S. BA) graph have 17 and 10 labels, respectively. In our evaluation, we use 5 and 10 gradient update steps in meta-training and meta-testing, respectively. The columns present the average multi-class classification accuracy and 95% confidence interval over five-folds. Note that the baselines are only defined for certain tasks, “-” implies that the baseline is not defined for the task and setup. Meta-Graph is only defined for link prediction. The confidence intervals for the results are provided in Appendix D.
Synthetic Datasets Real-world (Node Classification) Real-world (Link Prediction)
Task Setup 𝑺𝑮,𝑫𝑳\langle SG,DL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle 𝑴𝑮,𝑫𝑳\langle MG,DL\rangle 𝑺𝑮,𝑫𝑳\langle SG,DL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle 𝑴𝑮,𝑫𝑳\langle MG,DL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle
Dataset S. Cy S. BA S. Cy S. BA S. Cy S. BA ogbn-arxiv Tissue-PPI Fold-PPI FirstMM-DB Tree-of-Life
Meta-Graph - - - - - - - - - .719 .705
Meta-GNN .720 .694 - - - - .273 - - - -
FS-GIN .684 .749 - - - - .336 - - - -
FS-SGC .574 .715 - - - - .347 - - - -
ProtoNet .821 .858 .282 .657 .749 .866 .372 .546 .382 .779 .697
MAML .842 .848 .511 .726 .653 .844 .389 .745 .482 .758 .719
G-META .872 .867 .542 .734 .767 .867 .451 .768 .561 .784 .722
H-GRAM .883 .873 .555 .746 .779 .888 .472 .786 .584 .804 .742

5.1 RQ1: Performance of Meta-Learning

To analyze the meta-learning capability of H-GRAM, we compare it against previous approaches in this area on a standard evaluation setup. We consider two experimental setups inline with previous evaluation in the literature [18]; (i) with synthetic datasets to analyze performance on different problem setups without altering the graph topology, and (ii) with real-world datasets to analyze performance for practical application. Based on the problem setup, the datasets are partitioned into node-centric subgraphs with corresonding root node’s label as the subgraph’s ground truth label. The subgraphs are subsequently batched into tasks which are further divided into support set and query set for meta-learning. The evaluation metric for both the tasks of node classification and link prediction is accuracy 𝒜=|Y=Y^|/|Y|\mathcal{A}=|Y=\hat{Y}|/|Y|. For robust comparison, the metrics are computed over five folds of validation splits in a 2-shot setting for node classification and 32-shot setting for link prediction. Table 2 presents the five-fold average and 95% confidence interval of our experiments on synthetic and real-world datasets, respectively. From the results, we observe that H-GRAM consistently outperforms the baseline methods on a diverse set of datasets and meta-learning problem setups. For the disjoint labels setting, H-GRAM outperforms the best baseline in both the cases of single and multiple graphs. In the case of synthetic graphs, we observe that subgraph methods of H-GRAM and G-Meta outperform the entire graph encoding based approaches showing that subgraph methods are able to limit the over-smoothing problem [6] and improve performance. Also, we observe that meta-learning methods (ProtoNet and MAML) are unreliable in their results producing good results for some tasks and worse for others, whereas H-GRAM is consistently better across the board. Hence, we conclude that using label prototypes to learn inductive biases and transferring them using MAML meta updates is a more robust technique. We note that H-GRAM, unlike previous HNN models, is able to handle graphs with edges and nodes in the order of millions, as evident by the performance on large real-world datasets including ogbn-arxiv, Tissue-PPI, Fold-PPI, and Tree-of-Life. Our experiments clearly demonstrate the significant performance of H-GRAM in a wide-range of applications and prove the effectiveness of meta-learning in HNN models.

Table 3: Comparison with HNN models on standard benchmarks. We compare the Single Graph, Shared Labels (SG,SL) setup of the H-GRAM model to the baselines. The columns report the average multi-class classification accuracy and 95% confidence interval over five-folds on the tasks of node classification (Node) and link prediction (Link) in the standard citation graphs.
Dataset Cora Pubmed Citeseer
Task Node Link Node Link Node Link
HMLP .754±\pm.029 .765±\pm.047 .657±\pm.045 .848±\pm.038 .879±\pm.078 .877±\pm.090
HAT .796±\pm.036 .792±\pm.038 .681±\pm.034 .908±\pm.038 .939±\pm.034 .922±\pm.036
HGCN .779±\pm.026 .789±\pm.030 .696±\pm.029 .914±\pm.031 .950±\pm.032 .928±\pm.030
H-GRAM .827±\pm.037 .790±\pm.026 .716±\pm.029 .896±\pm.025 .924±\pm.033 .936±\pm.030

5.2 RQ2: Comparison with HNN models

The current HNN formulations of HMLP, HAT, and HGCN do not scale to large datasets, and hence, we were not able to compare them against H-GRAM on large-scale datasets. However, it is necessary to compare the standard HNN formulations with H-GRAM to understand the importance of subgraph encoders and meta-learning.

Refer to caption
Figure 3: Time taken (per epoch) by H-GRAM compared to other HNNs with varying number of nodes |𝒱|={10i}i=17|\mathcal{V}|=\{10^{i}\}_{i=1}^{7} in Syn. BA graph. H-GRAM(multi) is the multi-GPU version of H-GRAM.

Thus, we utilize the single graph and shared labels setup of H-GRAM to evaluate its performance on citation networks of Cora, Pubmed, and Citeseer for both tasks of node classification and link prediction. We also compare the time taken by our model and other HNNs on varying number of nodes (|𝒱|={10i}i=17)\left(|\mathcal{V}|=\{10^{i}\}_{i=1}^{7}\right) in the Synthetic BA graph. For this experiment, we also consider a multi-GPU version of H-GRAM that parallelizes the HNN update computations and accumulates them for meta update. From our results, presented in Table 3, we observe that H-GRAM is, on average, able to outperform the best baseline. This shows that our formulation of HNN models using meta-learning over node-centric subgraphs is more effective than the traditional models, while also being scalable over large datasets. The scalability allows for multi-GPU training and translates the performance gains of HNNs to larger datasets. In the results provided in Figure 3, we observe that the time taken by the models is inline with their parameter complexity (HMLP\leqHGCN\leqHAT\leqH-GRAM). However, the traditional HNNs are not able to scale beyond |𝒱|=104|\mathcal{V}|=10^{4}, whereas, H-GRAM is able to accommodate large graphs. Another point to note is that H-GRAM(multi) is able to parallelize well over multiple GPUs with its time taken showing stability after 10410^{4} nodes (which is the size of nodes that a single GPU can accommodate).

5.3 RQ3: Challenging Few-shot Settings

To understand the effect of different few-shot learning scenarios, we vary the number of few-shots MM and hops KK in the neighborhood. For the experiment on few-shot classification, we consider the problems of node classification on Fold-PPI dataset and link prediction on FirstMM-DB dataset. We vary M=1,2,3M=1,2,3 and M=16,32,64M=16,32,64 for node classification and link prediction, respectively, and calculate the corresponding accuracy and 95% confidence interval of H-GRAM. The results for this experiment are presented in Figures 4(a) and 4(b) for node classification and link prediction, respectively. To determine the effect of hops in the neighborhood, we vary K=1,2,3K=1,2,3 for the same problem setting and compute the corresponding performance of our model. The results for varying neighborhood sizes are reported in Figure 4(c). In the results on varying the number of few-shots, we observe a linear trend in both the tasks of node classification and link prediction, i.e., a linear increase in H-GRAM’s accuracy with an increase in the number of shots in meta-testing. Thus, we conclude that H-GRAM, like other generic learning models, performs better with an increasing number of training samples. In the experiment on increasing the neighborhood size, we observe that in the task of node classification, K=3K=3 shows the best performance, but in link prediction K=2K=2 has the best performance, with a significant drop in K=3K=3. Thus, for stability, we choose K=2K=2 in our experiments. The trend in link prediction also shows that larger neighborhoods can lead to an increase in noise, which can negatively affect performance.

Refer to caption
(a) Number of Shots (vs) Accuracy for node classification on Fold-PPI dataset.
Refer to caption
(b) Number of Shots (vs) Accuracy for link prediction on FirstMM-DB dataset.
Refer to caption
(c) Number of hops (vs) Accuracy on the task of node classification and link prediction.
Figure 4: Performance of H-GRAM on challenging few-shot settings. The reported accuracies are multi-class classification accuracy averaged over five-fold runs of our model.

5.4 RQ4: Ablation Study

In this section, we aim to understand the contribution of the different components in our model. To this end, we compare variants of our model by (i) varying the base HNN model (HMLP, HAT, and HGCN), and (ii) deleting individual meta-learning components - H-ProtoNet implies H-GRAM without meta updates and H-MAML implies H-GRAM without prototypes. The model variants are compared on the real-world datasets and the results are presented in Table 4. The ablation study indicates that meta gradient updates and label prototypes contribute to \approx16% and \approx6% improvement in H-GRAM’s performance, respectively. This clearly demonstrates the ability of label prototypes in encoding inductive biases and that of meta gradients in transferring the knowledge from meta-training to the meta-testing phase. Additionally, from our study on different HNN bases for H-GRAM, we note that the HGCN base outperforms the other bases of HMLP and HAT by \approx19% and \approx2%, respectively. Thus, we choose HGCN as the base in our final model.

Table 4: Ablation study results - H-ProtoNet and H-MAML can be considered H-GRAM’s model variants without meta updates and label prototypes, respectively. H-GRAM(HMLP) and H-GRAM(HAT) represents the variant of H-GRAM with HMLP and HAT as base, respectively. Our final model, presented in the last row, uses HGCN as the base model. The columns report the average multi-class classification accuracy and 95% confidence interval over five-folds on different tasks.
Task Node Classification Link Prediction
Setup 𝑺𝑮,𝑫𝑳\langle SG,DL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle 𝑴𝑮,𝑫𝑳\langle MG,DL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle
Dataset ogbn-arxiv Tissue-PPI Fold-PPI FirstMM-DB Tree-of-Life
H-ProtoNet .389±\pm.019 .559±\pm.027 .398±\pm.023 .799±\pm.015 .716±\pm.004
H-MAML .407±\pm.023 .762±\pm.056 .502±\pm.046 .777±\pm.018 .739±\pm.005
H-GRAM(HMLP) .370±\pm.036 .537±\pm.044 .372±\pm.036 .772±\pm.028 .688±\pm.019
H-GRAM(HAT) .462±\pm.032 .777±\pm.028 .573±\pm.048 .794±\pm.023 .732±\pm.021
H-GRAM(ours) .472±\pm.035 .786±\pm.031 .584±\pm.044 .804±\pm.021 .742±\pm.013

6 Conclusion

In this paper, we introduce H-GRAM, a scalable hyperbolic meta-learning model that is able to learn inductive biases from a support set and adapt to a query set with disjoint nodes, edges, and labels by transferring the knowledge. We have theoretically proven the effectiveness of node-centric subgraph information in HNN models, and used that to formulate a meta-learning model that can scale over large datasets. Our model is able to handle challenging few-shot learning scenarios and also outperform the previous Euclidean baselines in the area of meta-learning. Additionally, unlike previous HNN models, H-GRAM is also able to scale to large graph datasets.

References

  • Balažević et al. [2019] Ivana Balažević, Carl Allen, and Timothy Hospedales. Multi-relational poincaré graph embeddings. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc.
  • Bonnabel [2013] Silvère Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013. doi: 10.1109/TAC.2013.2254619.
  • Bose et al. [2020] Avishek Joey Bose, Ankit Jain, Piero Molino, and William L. Hamilton. Meta-graph: Few shot link prediction via meta learning, 2020. URL https://openreview.net/forum?id=BJepcaEtwB.
  • Chami et al. [2019] Ines Chami, Zhitao Ying, Christopher Ré, and Jure Leskovec. Hyperbolic graph convolutional neural networks. Advances in neural information processing systems, 32, 2019.
  • Chami et al. [2020] Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, and Christopher Ré. Low-dimensional hyperbolic knowledge graph embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6901–6914, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.617. URL https://aclanthology.org/2020.acl-main.617.
  • Chen et al. [2020] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):3438–3445, Apr. 2020. doi: 10.1609/aaai.v34i04.5747. URL https://ojs.aaai.org/index.php/AAAI/article/view/5747.
  • Chen et al. [2019] Mingyang Chen, Wen Zhang, Wei Zhang, Qiang Chen, and Huajun Chen. Meta relational learning for few-shot link prediction in knowledge graphs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4208–4217, Hong Kong, China, November 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D19-1431.
  • Choudhary et al. [2021] Nurendra Choudhary, Nikhil Rao, Sumeet Katariya, Karthik Subbian, and Chandan K. Reddy. Self-supervised hyperboloid representations from logical queries over knowledge graphs. In Proceedings of the Web Conference 2021, WWW ’21, page 1373–1384, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381.3449974. URL https://doi.org/10.1145/3442381.3449974.
  • Choudhary et al. [2022a] Nurendra Choudhary, Nikhil Rao, Sumeet Katariya, Karthik Subbian, and Chandan K. Reddy. Anthem: Attentive hyperbolic entity model for product search. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, page 161–171, New York, NY, USA, 2022a. Association for Computing Machinery. ISBN 9781450391320. doi: 10.1145/3488560.3498456. URL https://doi.org/10.1145/3488560.3498456.
  • Choudhary et al. [2022b] Nurendra Choudhary, Nikhil Rao, Karthik Subbian, Srinivasan H. Sengamedu, and Chandan K. Reddy. Hyperbolic neural networks: Theory, architectures and applications. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, page 4778–4779, New York, NY, USA, 2022b. Association for Computing Machinery. ISBN 9781450393850. doi: 10.1145/3534678.3542613. URL https://doi.org/10.1145/3534678.3542613.
  • Dhingra et al. [2018] Bhuwan Dhingra, Christopher Shallue, Mohammad Norouzi, Andrew Dai, and George Dahl. Embedding text in hyperbolic spaces. In Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), pages 59–69, New Orleans, Louisiana, USA, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-1708. URL https://aclanthology.org/W18-1708.
  • Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1126–1135. JMLR.org, 2017.
  • Ganea et al. [2018] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. Advances in neural information processing systems, 31, 2018.
  • Gao et al. [2022] Ning Gao, Hanna Ziesche, Ngo Anh Vien, Michael Volpp, and Gerhard Neumann. What matters for meta-learning vision regression tasks? In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14756–14766, 2022. doi: 10.1109/CVPR52688.2022.01436.
  • Gulcehre et al. [2019] Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz Hermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro, and Nando de Freitas. Hyperbolic attention networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJxHsjRqFQ.
  • Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
  • Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
  • Huang and Zitnik [2020] Kexin Huang and Marinka Zitnik. Graph meta learning via local subgraphs. Advances in Neural Information Processing Systems, 33:5862–5874, 2020.
  • Khatir et al. [2023] Mehrdad Khatir, Nurendra Choudhary, Sutanay Choudhury, Khushbu Agarwal, and Chandan K. Reddy. A unification framework for euclidean and hyperbolic graph neural networks. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 3875–3883. ijcai.org, 2023. doi: 10.24963/ijcai.2023/431. URL https://doi.org/10.24963/ijcai.2023/431.
  • Kipf and Welling [2016] Thomas N Kipf and Max Welling. Variational graph auto-encoders. NIPS Workshop on Bayesian Deep Learning, 2016.
  • Kochurov et al. [2020] Max Kochurov, Rasul Karimov, and Serge Kozlukov. Geoopt: Riemannian optimization in pytorch. arXiv preprint arXiv:2005.02819, 2020.
  • Lee et al. [2022] Hung-yi Lee, Shang-Wen Li, and Thang Vu. Meta learning for natural language processing: A survey. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 666–684, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.49. URL https://aclanthology.org/2022.naacl-main.49.
  • Liu et al. [2019] Lu Liu, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Learning to Propagate for Graph Meta-Learning. Curran Associates Inc., Red Hook, NY, USA, 2019.
  • Namata et al. [2012] Galileo Mark Namata, Ben London, Lise Getoor, and Bert Huang. Query-driven active surveying for collective classification. In International Workshop on Mining and Learning with Graphs, Edinburgh, Scotland, 2012.
  • Neumann et al. [2013] Marion Neumann, Plinio Moreno, Laura Antanas, Roman Garnett, and Kristian Kersting. Graph kernels for object category prediction in task-dependent robot grasping. In Online proceedings of the eleventh workshop on mining and learning with graphs, pages 0–6, 2013.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  • Pržulj [2007] Nataša Pržulj. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2):e177–e183, 2007.
  • Reddi et al. [2018] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7f-RZ.
  • Rossi and Ahmed [2015] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, page 4292–4293. AAAI Press, 2015. ISBN 0262511290.
  • Shen et al. [2007] Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen, Yixue Li, and Hualiang Jiang. Predicting protein–protein interactions based only on sequences information. Proceedings of the National Academy of Sciences, 104(11):4337–4341, 2007.
  • Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/cb8da6767461f2812ae4290eac7cbc42-Paper.pdf.
  • Sun et al. [2021] Jianing Sun, Zhaoyue Cheng, Saba Zuberi, Felipe Perez, and Maksims Volkovs. Hgcf: Hyperbolic graph convolution networks for collaborative filtering. In Proceedings of the Web Conference 2021, WWW ’21, page 593–601, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381.3450101. URL https://doi.org/10.1145/3442381.3450101.
  • Ungar [2005] Abraham A Ungar. Analytic hyperbolic geometry: Mathematical foundations and applications. World Scientific, 2005.
  • Vilalta and Drissi [2002] Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artificial intelligence review, 18(2):77–95, 2002.
  • Vyas et al. [2022] Anoushka Vyas, Nurendra Choudhary, Mehrdad Khatir, and Chandan K. Reddy. Graphzoo: A development toolkit for graph neural networks with hyperbolic geometries. In Companion Proceedings of the Web Conference 2022, WWW ’22, page 184–188, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391306. doi: 10.1145/3487553.3524241. URL https://doi.org/10.1145/3487553.3524241.
  • Wu et al. [2019] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, pages 6861–6871. PMLR, 2019.
  • Xu et al. [2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
  • Zeng et al. [2020] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. GraphSAINT: Graph sampling based inductive learning method. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJe8pkHFwS.
  • Zhang et al. [2020] Chuxu Zhang, Huaxiu Yao, Chao Huang, Meng Jiang, Zhenhui Li, and Nitesh V Chawla. Few-shot knowledge graph completion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3041–3048, 2020.
  • Zhou et al. [2019] Fan Zhou, Chengtai Cao, Kunpeng Zhang, Goce Trajcevski, Ting Zhong, and Ji Geng. Meta-gnn: On few-shot node classification in graph meta-learning. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, page 2357–2360, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450369763. doi: 10.1145/3357384.3358106. URL https://doi.org/10.1145/3357384.3358106.
  • Zhou et al. [2020a] Fan Zhou, Chengtai Cao, Goce Trajcevski, Kunpeng Zhang, Ting Zhong, and Ji Geng. Fast network alignment via graph meta-learning. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, pages 686–695, 2020a. doi: 10.1109/INFOCOM41043.2020.9155456.
  • Zhou et al. [2020b] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020b. ISSN 2666-6510. doi: https://doi.org/10.1016/j.aiopen.2021.01.001. URL https://www.sciencedirect.com/science/article/pii/S2666651021000012.
  • Zitnik and Leskovec [2017] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
  • Zitnik et al. [2019] Marinka Zitnik, Rok Sosič, Marcus W. Feldman, and Jure Leskovec. Evolution of resilience in protein interactomes across the tree of life. Proceedings of the National Academy of Sciences, 116(10):4426–4433, 2019. doi: 10.1073/pnas.1818013116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1818013116.

Appendix

Appendix A Preliminaries

In this section, we discuss the hyperbolic operations used in HNN formulations and set up the meta-learning problem.

Hyperbolic operations: The hyperbolic gyrovector operations required for training neural networks, in a Poincaré ball with curvature cc, are defined by Möbius addition (c)(\oplus_{c}), Möbius subtraction (c)(\ominus_{c}), exponential map (expxc)(\exp_{x}^{c}), logarithmic map (logxc)(\log_{x}^{c}), Möbius scalar product (c)(\odot_{c}), and Möbius matrix-vector product (c)(\otimes_{c}).

gx\displaystyle g_{x}^{\mathbb{H}} λx2g𝔼where λx21x2\displaystyle\coloneqq\lambda_{x}^{2}~{}\ g^{\mathbb{E}}\quad\text{where }\lambda_{x}\coloneqq\frac{2}{1-\|x\|^{2}} (3)
xcy\displaystyle x\oplus_{c}y (1+2cx,y+cy2)x+(1cx2)y1+2cx,y+c2x2y2\displaystyle\coloneqq\frac{\left(1+2c\langle x,y\rangle+c\|y\|^{2}\right)x+\left(1-c\|x\|^{2}\right)y}{1+2c\langle x,y\rangle+c^{2}\|x\|^{2}\|y\|^{2}} (4)
xcy\displaystyle x\ominus_{c}y xcy\displaystyle\coloneqq x\oplus_{c}-y (5)
expxc(v)\displaystyle\exp_{x}^{c}(v) xc(tanh(cλxcv2)vcv)\displaystyle\coloneqq x\oplus_{c}\left(\tanh\left(\sqrt{c}\frac{\lambda_{x}^{c}\|v\|}{2}\right)\frac{v}{\sqrt{c}\|v\|}\right) (6)
logxc(y)\displaystyle\log_{x}^{c}(y) 2cλxctanh1(cxcy)xcyxcy\displaystyle\coloneqq\frac{2}{\sqrt{c}\lambda_{x}^{c}}\tanh^{-1}\left(\sqrt{c}\|-x\oplus_{c}y\|\right)\frac{-x\oplus_{c}y}{\|-x\oplus_{c}y\|} (7)
rcx\displaystyle r\odot_{c}x exp0c(rlog0c(x)),r,xd\displaystyle\coloneqq\exp_{0}^{c}(rlog_{0}^{c}(x)),~{}\forall r\in\mathbb{R},x\in\mathbb{H}^{d} (8)
Mcx\displaystyle M\otimes_{c}x 1ctanh(Mxxtanh1(cx))MxMx\displaystyle\coloneqq\frac{1}{\sqrt{c}}\tanh\left(\frac{\|Mx\|}{\|x\|}tanh^{-1}\left(\sqrt{c}\|x\|\right)\right)\frac{Mx}{\|Mx\|} (9)

Here, \coloneqq denotes assignment operation for Möbius operations. The operands x,ydx,y\in\mathbb{H}^{d} are hyperbolic gyrovectors. g𝔼=Ing^{\mathbb{E}}=\textbf{I}_{n} is the Euclidean identity metric tensor and x\|x\| is the Euclidean norm of xx. λx\lambda_{x} is the conformal factor between the Euclidean and hyperbolic metric tensor [13].

Meta-Learning setup: In meta-learning, the dataset consists of multiple tasks 𝒯i𝒟\mathcal{T}_{i}\in\mathcal{D}. The dataset is divided into training (𝒟train\mathcal{D}_{train}), test (𝒟test\mathcal{D}_{test}), and validation (𝒟val\mathcal{D}_{val}) sets, and each task 𝒯i\mathcal{T}_{i} is divided into a support set (𝒯is\mathcal{T}_{i}^{s}) and query set (𝒯iq\mathcal{T}_{i}^{q}) with corresponding labels YsY_{s} and YqY_{q}, respectively. The size of query label set is given by N=|Yq|N=|Y_{q}| and the size of support set in meta-testing is given by K=|𝒯is𝒟test|K=|\mathcal{T}_{i}^{s}\in\mathcal{D}_{test}|. This particular setup is also known as the N-ways K-shot learning problem. In the meta-training phase of MAML [12], the model trains on tasks 𝒯isDtrain\mathcal{T}_{i}^{s}\in D_{train} and is evaluated on 𝒯iq𝒟train\mathcal{T}_{i}^{q}\in\mathcal{D}_{train} to learn the meta-information. For few-shot evaluation, the model is trained on 𝒯is𝒟test\mathcal{T}_{i}^{s}\in\mathcal{D}_{test} and evaluated on 𝒯iq𝒟test\mathcal{T}_{i}^{q}\in\mathcal{D}_{test}, where |𝒯is||𝒯iq||\mathcal{T}_{i}^{s}|\ll|\mathcal{T}_{i}^{q}|. The goal is to learn the initial parameters θ\theta_{*} from 𝒟train\mathcal{D}_{train} such that they can quickly transition to the parameters θ\theta for new tasks in 𝒟test\mathcal{D}_{test}. The hyper-parameters of the meta-learning setup are tuned using 𝒟val\mathcal{D}_{val}.

Appendix B Theorem Proofs

This section provides the theoretical proofs of the theorems presented in our main paper.

B.1 Proof of Theorem 1

Theorem.

For a set of paths PuvP_{uv} between nodes uu and vv, let us define DgμpiD^{p_{i}}_{g\mu} as the geometric mean of degree of nodes in a path piPuvp_{i}\in P_{uv}, puvp_{uv} as the shortest path, and uv\mathcal{I}_{uv} as the influence of node vv on uu. Also, let us say define Dgμmin=min{DgμpipiPuv}D^{min}_{g\mu}=min\left\{D^{p_{i}}_{g\mu}\forall p_{i}\in P_{uv}\right\}, then the relation uvexpuc(K/(Dgμmin)puv)\mathcal{I}_{uv}\leq exp_{u}^{c}\left(K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}\|}\right) (where KK is a constant) holds for message propagation in HGCN models.

Proof.

The aggregation in HGCN model is defined as:

xu=expxuc(1Dui𝒩(u)wuilogxuc(xi))x^{\mathbb{H}}_{u}=exp_{x^{\mathbb{H}}_{u}}^{c}\left(\frac{1}{D_{u}}\sum_{i\in\mathcal{N}(u)}w_{ui}log_{x^{\mathbb{H}}_{u}}^{c}\left(x^{\mathbb{H}}_{i}\right)\right)

where xux_{u}, DuD_{u}, and 𝒩(u)\mathcal{N}(u) are the embedding, degree, and neighborhood of the root node, respectively. Note that points in the local tangent space follow Euclidean algebra. For simplicity, let us define the Euclidean vector as xi=logxuc(xi)x_{i}=log_{x^{\mathbb{H}}_{u}}^{c}\left(x^{\mathbb{H}}_{i}\right). Simplifying the aggregation function:

xu=expxuc(1Dui𝒩(u)wuixi)x^{\mathbb{H}}_{u}=exp_{x^{\mathbb{H}}_{u}}^{c}\left(\frac{1}{D_{u}}\sum_{i\in\mathcal{N}(u)}w_{ui}x_{i}\right)

Expanding the aggregation function to cover all possible paths from uu to its connected nodes.

xu=expxuc(1Dui𝒩(u)wui1Dij𝒩(i)wij1Dmn𝒩(m)wmn1Doo𝒩(k)wkoxo)\begin{split}x^{\mathbb{H}}_{u}=&exp_{x^{\mathbb{H}}_{u}}^{c}\Bigg{(}\frac{1}{D_{u}}\sum_{i\in\mathcal{N}(u)}w_{ui}\frac{1}{D_{i}}\sum_{j\in\mathcal{N}(i)}w_{ij}...\\ &\frac{1}{D_{m}}\sum_{n\in\mathcal{N}(m)}w_{mn}...\frac{1}{D_{o}}\sum_{o\in\mathcal{N}(k)}w_{ko}x_{o}\Bigg{)}\end{split}

The influence of a node vv on uu is given by:

uv=exp0c(logxc(xu)logxc(xv))\mathcal{I}_{uv}=exp_{0}^{c}\left(\left\|\frac{\partial log_{x}^{c}\left(x^{\mathbb{H}}_{u}\right)}{\partial log_{x}^{c}\left(x^{\mathbb{H}}_{v}\right)}\right\|\right)

Simplifying tangent space vectors,

uv=exp0c(xuxv)\mathcal{I}_{uv}=exp_{0}^{c}\left(\left\|\frac{\partial x_{u}}{\partial x_{v}}\right\|\right)
xuxv=\displaystyle\left\|\frac{\partial x_{u}}{\partial x_{v}}\right\|= xv(1Dui𝒩(u)wui1Dij𝒩(i)wij\displaystyle\Bigg{\|}\frac{\partial}{\partial x_{v}}\Bigg{(}\frac{1}{D_{u}}\sum_{i\in\mathcal{N}(u)}w_{ui}\frac{1}{D_{i}}\sum_{j\in\mathcal{N}(i)}w_{ij}...
1Dmn𝒩(m)wmn1Doo𝒩(k)wkoxo)\displaystyle\frac{1}{D_{m}}\sum_{n\in\mathcal{N}(m)}w_{mn}...\frac{1}{D_{o}}\sum_{o\in\mathcal{N}(k)}w_{ko}x_{o}\Bigg{)}\Bigg{\|}

Given that the partial derivative is with respect to xvx_{v}, only paths between uu and vv will be non-zero, all other paths shall be zero, i.e.,

xuxv=\displaystyle\left\|\frac{\partial x_{u}}{\partial x_{v}}\right\|= xv(1Duwupi11Dpi1wpi1pj11Dpk1wpk1vxv+\displaystyle\Bigg{\|}\frac{\partial}{\partial x_{v}}\Bigg{(}\frac{1}{D_{u}}w_{up^{1}_{i}}\frac{1}{D_{p^{1}_{i}}}w_{p^{1}_{i}p^{1}_{j}}...\frac{1}{D_{p^{1}_{k}}}w_{p^{1}_{k}v}x_{v}+...
+1Duwupim1Dpimwpimpjm1Dpkmwpkmvxv)\displaystyle+\frac{1}{D_{u}}w_{up^{m}_{i}}\frac{1}{D_{p^{m}_{i}}}w_{{p^{m}_{i}}p^{m}_{j}}...\frac{1}{D_{p^{m}_{k}}}w_{p^{m}_{k}v}x_{v}\Bigg{)}\Bigg{\|}

where (u,pit,pjt,,pkt,v)t[1,m](u,p^{t}_{i},p^{t}_{j},...,p^{t}_{k},v)\forall t\in[1,m] are the paths between node uu and vv. Aggregating the terms together and getting the constants out of the derivative,

xuxv=\displaystyle\left\|\frac{\partial x_{u}}{\partial x_{v}}\right\|= wupi1wpi1pj1wpk1vDuDpi1Dpk1++wupimwpimpjmwpkmvDuDpimDpkm.xvxv\displaystyle\Bigg{\|}\frac{w_{up^{1}_{i}}w_{p^{1}_{i}p^{1}_{j}}...w_{p^{1}_{k}v}}{D_{u}D_{p^{1}_{i}}...D_{p^{1}_{k}}}+...+\frac{w_{up^{m}_{i}}w_{p^{m}_{i}p^{m}_{j}}...w_{p^{m}_{k}v}}{D_{u}D_{p^{m}_{i}}...D_{p^{m}_{k}}}\Bigg{\|}.\Bigg{\|}\frac{\partial x_{v}}{\partial x_{v}}\Bigg{\|}
=\displaystyle= wupi1wpi1pj1wpk1vDuDpi1Dpk1++wupimwpimpjmwpkmvDuDpimDpkm\displaystyle\Bigg{\|}\frac{w_{up^{1}_{i}}w_{p^{1}_{i}p^{1}_{j}}...w_{p^{1}_{k}v}}{D_{u}D_{p^{1}_{i}}...D_{p^{1}_{k}}}+...+\frac{w_{up^{m}_{i}}w_{p^{m}_{i}p^{m}_{j}}...w_{p^{m}_{k}v}}{D_{u}D_{p^{m}_{i}}...D_{p^{m}_{k}}}\Bigg{\|}
\displaystyle\leq mmax(wupi1wpi1pj1wpk1vDuDpi1Dpk1,,wupimwpimpjmwpkmvDuDpimDpkm)\displaystyle\Bigg{\|}m*max\left(\frac{w_{up^{1}_{i}}w_{p^{1}_{i}p^{1}_{j}}...w_{p^{1}_{k}v}}{D_{u}D_{p^{1}_{i}}...D_{p^{1}_{k}}},...,\frac{w_{up^{m}_{i}}w_{p^{m}_{i}p^{m}_{j}}...w_{p^{m}_{k}v}}{D_{u}D_{p^{m}_{i}}...D_{p^{m}_{k}}}\right)\Bigg{\|}

Let us say,

t=argmaxt({wupitwpitpjtwpktvDuDpi1Dpkt}t=1t=m)t^{*}=\operatorname*{arg\,max}_{t}\left(\left\{\frac{w_{up^{t}_{i}}w_{p^{t}_{i}p^{t}_{j}}...w_{p^{t}_{k}v}}{D_{u}D_{p^{1}_{i}}...D_{p^{t}_{k}}}\right\}_{t=1}^{t=m}\right) (10)

Then,

xuxv\displaystyle\left\|\frac{\partial x_{u}}{\partial x_{v}}\right\|\leq mwupitwpitpjtwpktvDuDpitDpkt\displaystyle\left\|m*\frac{w_{up^{t^{*}}_{i}}w_{p^{t^{*}}_{i}p^{t^{*}}_{j}}...w_{p^{t^{*}}_{k}v}}{D_{u}D_{p^{t^{*}}_{i}}...D_{p^{t^{*}}_{k}}}\right\|

Aggregating the constants and substituting the geometric mean, we get;

xuxv\displaystyle\left\|\frac{\partial x_{u}}{\partial x_{v}}\right\|\leq K×(1(DuDpitDpkt)1/n)n=K(Dgμt)n\displaystyle K\times\left(\frac{1}{\left(D_{u}D_{p^{t^{*}}_{i}}...D_{p^{t^{*}}_{k}}\right)^{1/n_{*}}}\right)^{n_{*}}=\frac{K}{\left(D^{t^{*}}_{g\mu}\right)^{n_{*}}}

Substituting the variables with shortest paths and minimum degree,

xuxv\displaystyle\left\|\frac{\partial x_{u}}{\partial x_{v}}\right\|\leq K(Dgμt)nK(Dgμmin)puv\displaystyle\frac{K}{\left(D^{t^{*}}_{g\mu}\right)^{n_{*}}}\leq\frac{K}{\left(D^{min}_{g\mu}\right)^{\|p_{uv}\|}}

With transitive property and adding exponential map on both sides;

expuc(xuxv)\displaystyle exp_{u}^{c}\left(\left\|\frac{\partial x_{u}}{\partial x_{v}}\right\|\right)\leq expuc(K/(Dgμmin)puv)\displaystyle exp_{u}^{c}\left(K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}\|}\right)
uv\displaystyle\mathcal{I}_{uv}\leq expuc(K/(Dgμmin)puv)\displaystyle exp_{u}^{c}\left(K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}\|}\right)

B.2 Proof of Theorem 2

Theorem.

Given the subgraph SuS_{u} of graph 𝒢\mathcal{G} centered at node uu, with the corresponding label YuY_{u}, let us define a node v𝒢v\in\mathcal{G} with maximum influence on uu, i.e., v=argmaxt({ut,t𝒩(u)u})v=\operatorname*{arg\,max}_{t}(\{\mathcal{I}_{u}t,t\in\mathcal{N}(u)\setminus u\}). For a set of paths PuvP_{uv} between nodes uu and vv, let us define DgμpiD^{p_{i}}_{g\mu} as the geometric mean of degree of nodes in a path piPuvp_{i}\in P_{uv}, puv\|p_{uv}\| is the shortest path length, and Dgμmin=min{DgμpipiPuv}D^{min}_{g\mu}=min\left\{D^{p_{i}}_{g\mu}\forall p_{i}\in P_{uv}\right\}. Then, the information loss between encoding the 𝒢\mathcal{G} and SuS_{u} is bounded by δ(𝒢,Su)expuc(K/(Dgμmin)puv+1)\delta_{\mathbb{H}}(\mathcal{G},S_{u})\leq exp_{u}^{c}\left(K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}\|+1}\right) (where KK is a constant).

Proof.

The information loss between encoding the entire graph 𝒢\mathcal{G} and node-centric local subgraph SuS_{u} with root node uu is given by;

δ(𝒢,Su)=\displaystyle\delta_{\mathbb{H}}(\mathcal{G},S_{u})= exp0c(δ(𝒢,Su))\displaystyle exp^{c}_{0}\left(\delta(\mathcal{G},S_{u})\right)
δ(𝒢,Su)=\displaystyle\delta(\mathcal{G},S_{u})= log0c(𝒢(u))log0c(Su(u))\displaystyle log^{c}_{0}\left(\mathcal{I}_{\mathcal{G}}(u)\right)-log^{c}_{0}\left(\mathcal{I}_{S_{u}}(u)\right)
=\displaystyle= (xux1++xuxn)(xuxi1++xuxim)\displaystyle\left(\left\|\frac{\partial x_{u}}{\partial x_{1}}\right\|+...+\left\|\frac{\partial x_{u}}{\partial x_{n}}\right\|\right)-\left(\left\|\frac{\partial x_{u}}{\partial x_{i_{1}}}\right\|+...+\left\|\frac{\partial x_{u}}{\partial x_{i_{m}}}\right\|\right)
Delete the overlapping nodes in the paths,
=\displaystyle= xuxt1+xuxt2++xuxtnm\displaystyle\left\|\frac{\partial x_{u}}{\partial x_{t_{1}}}\right\|+\left\|\frac{\partial x_{u}}{\partial x_{t_{2}}}\right\|+...+\left\|\frac{\partial x_{u}}{\partial x_{t_{n-m}}}\right\|
Using Theorem 1,
\displaystyle\leq Kt1(Dgμt1)puvt1+Kt2(Dgμt2)puvt2++Ktnm(Dgμtnm)puvtnm\displaystyle\frac{K_{t_{1}}}{\left(D^{t_{1}}_{g\mu}\right)^{\|p^{t_{1}}_{uv}\|}}+\frac{K_{t_{2}}}{\left(D^{t_{2}}_{g\mu}\right)^{\|p^{t_{2}}_{uv}\|}}+...+\frac{K_{t_{n-m}}}{\left(D^{t_{n-m}}_{g\mu}\right)^{\|p^{t_{n-m}}_{uv}\|}}
\displaystyle\leq (nm)×Kmin/(Dgμmin)puvmin+1\displaystyle(n-m)\times K_{min}/\left(D^{min}_{g\mu}\right)^{\|p^{min}_{uv}+1\|}
\displaystyle\leq (nm)×Kmin/(Dgμmin)puv+1=K/(Dgμmin)puv+1\displaystyle(n-m)\times K_{min}/\left(D^{min}_{g\mu}\right)^{\|p_{uv}+1\|}=K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}+1\|}
δ(𝒢,Su)\displaystyle\delta(\mathcal{G},S_{u})\leq K/(Dgμmin)puv+1\displaystyle K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}+1\|}
exp0c(δ(𝒢,Su))\displaystyle exp^{c}_{0}\left(\delta(\mathcal{G},S_{u})\right)\leq exp0c(K/(Dgμmin)puv+1)\displaystyle exp^{c}_{0}\left(K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}+1\|}\right)
δ(𝒢,Su)\displaystyle\delta_{\mathbb{H}}(\mathcal{G},S_{u})\leq expuc(K/(Dgμmin)puv+1)\displaystyle exp_{u}^{c}\left(K/\left(D^{min}_{g\mu}\right)^{\|p_{uv}\|+1}\right)

Table 5: Hyper-parameter setup of real-world datasets. The columns present the number of tasks in each batch (# Tasks), HNN update learning rate (α\alpha), meta update learning rate (β\beta), and size of hidden dimensions (d).
Dataset # Tasks 𝜶\alpha 𝜷\beta d
arxiv-ogbn 32 10210^{-2} 10310^{-3} 256
Tissue-PPI 4 10210^{-2} 5×1035\times 10^{-3} 128
Fold-PPI 16 5×1035\times 10^{-3} 10310^{-3} 128
FirstMM-DB 8 10210^{-2} 5×1045\times 10^{-4} 128
Tree-of-Life 8 5×1035\times 10^{-3} 5×1045\times 10^{-4} 256

Appendix C H-GRAM Algorithm

Algorithm 1 provides the details of the procedure of the H-GRAM model.

Input: Graphs 𝒢={𝒢i}i=1𝒢\mathcal{G}^{\cup}=\{\mathcal{G}_{i}\}_{i=1}^{\|\mathcal{G}^{\cup}\|}, Ground truth YY;
Output: Predictor PθP_{\theta}, parameters θ\theta;
1 Initialize θ\theta_{*} and θ\theta as HGCN model and meta-update parameters, respectively ;
2 # Partition graphs into node-centric subgraphs
3 S1,S2,SV=Partition(𝒢);S_{1},S_{2},...S_{\|V\|}=Partition\left(\mathcal{G}^{\cup}\right);
4 # Batch graphs into tasks for meta-learning
5 𝒯={𝒯1,𝒯2,,𝒯𝒯};\mathcal{T}=\{\mathcal{T}_{1},\mathcal{T}_{2},...,\mathcal{T}_{\|\mathcal{T}\|}\};
6 while not converged do
7       𝒯trainsample(𝒯);\mathcal{T}_{train}\leftarrow sample(\mathcal{T});
8       for 𝒯i𝒯train\mathcal{T}_{i}\in\mathcal{T}_{train} do
9             # Batch of support and query set from the tasks
10             {Su}s,{Yu}s𝒯is;\{S_{u}\}^{s},\{Y_{u}\}^{s}\leftarrow\mathcal{T}_{i}^{s};
11             {Su}q,{Yu}q𝒯iq;\{S_{u}\}^{q},\{Y_{u}\}^{q}\leftarrow\mathcal{T}_{i}^{q};
12             for j[1,η]j\in[1,\eta] do
13                   # Update θ\theta^{*} using support set
14                   hus=HGCNθj1(Sus)h_{u}^{s}=HGCN_{\theta^{*}_{j-1}}({S_{u}}^{s})
15                  eus=i=1Vγiuhiusi=1Vγiue_{u}^{s}=\frac{\sum_{i=1}^{\|V\|}\upgamma_{iu}h^{s}_{iu}}{\sum_{i=1}^{\|V\|}\upgamma_{iu}}
16                  cks=Yu=ykγieisYu=ykγic_{k}^{s}=\frac{\sum_{Y_{u}=y_{k}}\upgamma_{i}e^{s}_{i}}{\sum_{Y_{u}=y_{k}}\upgamma_{i}}
17                  pks=e(dc(eus,cks))ke(dc(eus,cks))p_{k}^{s}=\frac{e^{(-d_{\mathbb{H}}^{c}(e^{s}_{u},c^{s}_{k}))}}{\sum_{k}e^{(-d_{\mathbb{H}}^{c}(e^{s}_{u},c^{s}_{k}))}}
18                  s=(ps,Yus)=jyislogpjs\mathcal{L}^{s}=\mathcal{L}(p^{s},Y_{u}^{s})=\sum_{j}y^{s}_{i}\log{p^{s}_{j}}
19                  θjexpθj1c(αs)\theta^{*}_{j}\leftarrow exp^{c}_{\theta^{*}_{j-1}}(-\alpha\nabla\mathcal{L}^{s})
20                   # Record evaluation with θ\theta^{*} on query set
21                   huq=HGCNθj(Suq)h_{u}^{q}=HGCN_{\theta^{*}_{j}}({S_{u}}^{q})
22                  euq=i=1Vγiuhiuqi=1Vγiue_{u}^{q}=\frac{\sum_{i=1}^{\|V\|}\upgamma_{iu}h^{q}_{iu}}{\sum_{i=1}^{\|V\|}\upgamma_{iu}}
23                  ckq=Yu=ykγieiqYu=ykγic_{k}^{q}=\frac{\sum_{Y_{u}=y_{k}}\upgamma_{i}e^{q}_{i}}{\sum_{Y_{u}=y_{k}}\upgamma_{i}}
24                  pkq=e(dc(euq,ckq))ke(dc(euq,ckq))p_{k}^{q}=\frac{e^{(-d_{\mathbb{H}}^{c}(e^{q}_{u},c^{q}_{k}))}}{\sum_{k}e^{(-d_{\mathbb{H}}^{c}(e^{q}_{u},c^{q}_{k}))}}
25                  ijq=(pq,Yuq)=jyiqlogpjq\mathcal{L}^{q}_{ij}=\mathcal{L}(p^{q},Y_{u}^{q})=\sum_{j}y^{q}_{i}\log{p^{q}_{j}}
26             end for
27            
28       end for
29       # Update meta-learning parameter θ\theta
30       θexpθc(βiiuq)\theta\leftarrow exp_{\theta}^{c}(-\beta\nabla\sum_{i}\mathcal{L}^{q}_{iu})
31      
32 end while
Algorithm 1 H-GRAM meta-learning algorithm
Table 6: Performance of H-GRAM and the baselines on synthetic and real-world datasets. The top three rows define the task, problem setup (Single Graph (SG), Multiple Graphs (MG), Shared Labels (SL) or Disjoint Labels (DL)) and dataset. The problems with disjoint labels use a 2-way meta-learning setup, and in the case of shared labels, the cycle and BA graph have 17 and 10 labels, respectively. In our evaluation, we use 5 and 10 gradient update steps in meta-training and meta-testing, respectively. The columns present the average multi-class classification accuracy and 95% confidence interval over five-folds. Note that the baselines are only defined for certain tasks, “-” implies that the baseline is not defined for the task and setup. Meta-Graph is only defined for link prediction.
Task Node Classification Node Classification Node Classification Node Classification Link Prediction
Setup 𝑺𝑮,𝑫𝑳\langle SG,DL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle 𝑴𝑮,𝑫𝑳\langle MG,DL\rangle 𝑺𝑮,𝑫𝑳\langle SG,DL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle 𝑴𝑮,𝑫𝑳\langle MG,DL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle 𝑴𝑮,𝑺𝑳\langle MG,SL\rangle
Dataset Syn. Cycle Syn. BA Syn. Cycle Syn. BA Syn. Cycle Syn. BA ogbn-arxiv Tissue-PPI Fold-PPI FirstMM-DB Tree-of-Life
Meta-Graph - - - - - - - - - .719±\pm.018 .705±\pm.004
Meta-GNN .720±\pm.191 .694±\pm.098 - - - - .273±\pm.107 - - - -
FS-GIN .684±\pm.126 .749±\pm.093 - - - - .336±\pm.037 - - - -
FS-SGC .574±\pm.081 .715±\pm.088 - - - - .347±\pm.004 - - - -
ProtoNet .821±\pm.173 .858±\pm.126 .282±\pm.039 .657±\pm.030 .749±\pm.160 .866±\pm.186 .372±\pm.015 .546±\pm.022 .382±\pm.027 .779±\pm.018 .697±\pm.009
MAML .842±\pm.181 .848±\pm.186 .511±\pm.044 .726±\pm.020 .653±\pm.082 .844±\pm.177 .389±\pm.018 .745±\pm.045 .482±\pm.054 .758±\pm.022 .719±\pm.011
G-META .872±\pm.113 .867±\pm.129 .542±\pm.039 .734±\pm.033 .767±\pm.156 .867±\pm.183 .451±\pm.028 .768±\pm.025 .561±\pm.052 .784±\pm.025 .722±\pm.028
H-GRAM .883±\pm.145 .873±\pm.120 .555±\pm.041 .746±\pm.028 .779±\pm.132 .888±\pm.182 .472±\pm.035 .786±\pm.031 .584±\pm.044 .804±\pm.021 .742±\pm.013

Appendix D Confidence Intervals

In the results presented in Table 6, the reported mean demonstrates the predictive power of H-GRAM and the 95% confidence interval estimates the degree of uncertainty. The large confidence interval in the results on synthetic datasets is because in meta-testing, we only sampled two-labels in each fold. In some cases where the structure of the local subgraphs in meta-training is significantly different compared to meta-testing, our model has poor performance due to limited scope of knowledge transfer. We observe that, in the limited number of data split possibilities in synthetic datasets, there generally is a case where our model does not perform well which results in a larger confidence interval. Real-world datasets contain many more labels, and hence, we are able to sample more for meta-testing, e.g., 5 labels for 3-way classification. This reduces the possibility of atypical results, thus leading to smaller intervals.

Appendix E Broader Impacts

Our model has the potential to impact various applications that involve graph-structured data, such as social network analysis, bioinformatics, and recommendation systems. Furthermore, the ability to generalize information from subgraph partitions of large datasets can be especially beneficial for applications with limited labeled data, such as in the fields of healthcare and finance. Moreover, H-GRAM also addresses several challenges in HNNs, including inductive learning, over-smoothing, and few-shot learning. These capabilities can be used to improve the performance of HNNs in various tasks such as node classification, link prediction, and graph classification. However, it is important to note that this model also has certain limitations. In particular, H-GRAM is based on a specific type of hyperbolic space, which may not be applicable to certain types of graph-structured data, and there are some assumptions made in the proof of our theoretical results which may not hold in general. Additionally, the meta-learning setup may not be suitable for all types of tasks, and further research is needed to test the performance of H-GRAM on other types of tasks. As a future direction, it would be interesting to investigate the effect of inadequate local subgraphs, scalability of H-GRAM on even larger datasets and explore the effectiveness of H-GRAM on other types of tasks with temporal or multi-modal graph data.