Attention-Driven Metapath Encoding in Heterogeneous Graphs
Abstract
One of the emerging techniques in node classification in heterogeneous graphs is to restrict message aggregation to pre-defined, semantically meaningful structures called metapaths. This work is the first attempt to incorporate attention into the process of encoding entire metapaths without dropping intermediate nodes. In particular, we construct two encoders: the first uses sequential attention to extend the multi-hop message passing algorithm designed in Wang et al. [6] to the metapath setting, and the second incorporates direct attention to extract semantic relations in the metapath. The model then employs the intra-metapath and inter-metapath aggregation mechanisms of Wang et al. [7]. We furthermore use the powerful training scheduler specialized for heterogeneous graphs that was developed in Wong et al. [8], ensuring the model slowly learns how to classify the most difficult nodes. The result is a resilient, general-purpose framework for capturing semantic structures in heterogeneous graphs. In particular, we demonstrate that our model is competitive with state-of-the-art models on performing node classification on the IMDB dataset, a popular benchmark introduced in Lv et al. [2].
1 Background
Since the invention of Graph Attention Networks (GATs) in Veličković et al. [5], a variety of approaches have sought to extend the attention mechanism to heterogeneous domains. The GAT fails to account for the semantic relationships introduced by varying node and edge types, so a mechanism that incorporates this information into its base architecture is required. The heterogeneous attention network (HAN) in Wang et al. [7] was the first method to incorporate semantic information into the attention mechanism by combining node-level and semantic-level attentions. In particular it employs metapaths, where a metapath is formally defined as a sequence of nodes and edges in the form that represents a composite relation between nodes and . For example, in the IMDB dataset a particular metapath may take the form of [’Movie’, ’Actor’, ’Movie’]. A metapath instance is a particular realization of that pattern (for example, Inception,Leonardo DiCaprio, Titanic}). We define the metapath-based neighbors of a node and a metapath as the set of all nodes which are terminal nodes of any instance of starting at . For each and , HAN first calculates node-level attention by applying a node type specific projection so that all nodes share a feature space, and (2) aggregating the result of attending to each metapath-based neighbor for each metapath instance of (discarding non-terminal "intermediate" nodes) to obtain groups of semantic-specific embeddings. It then calculates semantic-level attention by transforming the semantic embeddings via a MLP and measuring similarity with a learnable semantic-level attention vector . This step measures the importance of each metapath.
Addressing (2), Fu et al. [1] incorporates intermediate nodes by employing intra-metapath aggregation. Specifically, for each and metapath , they use a metapath instance encoder to separately transform the features of all of the nodes in each metapath instance originating at into a single vector. Encoder options include mean, linear, and relational rotation (Sun et al. [4]). Attention is then used to create a weighted sum of all of the metapath instances of that contain . Unlike HAN, Yang et al. [9] proposes a model which completely removes node-level attention, replacing it with simple aggregation. Other methods choose to not employ metapaths at all. It is shown in Wong et al. [8] that many of these methods can be augmented by a Loss-aware Training Schedule (LTS). The scheduler first prioritizes the easiest nodes during training and gradually introduces the more difficult nodes to mitigate the impact of noisy data.
A completely different approach is found in Wang et al. [6]. The model in the paper, Multi-hop Attention Graph Neural Network (MAGNA), uses a diffusion mechanism to have nodes multiple hops away attend to each other in each layer. In particular, it supports the heterogeneous graph setting by incorporating relation embeddings. However, it does not exploit long-range relations (e.g. metapaths). This is the main subject of our work.
2 Method
We construct two novel metapath instance encoders. The goal is to create a meaningful embedding for each instance such that they can be meaningfully aggregated using the simple attention-based method of HAN as in Wang et al. [7]. Our codebase is public; we have also implemented the base model HAN for ablation in PyTorch Geometric.111https://github.com/calderkatyal/CPSC483FinalProject
2.1 Metapath Extraction
As in Fu et al. [1], we seek to incorporate intermediate nodes into our metapaths. We develop a simple but novel method to do so. In particular, we modify transforms.AddMetaPaths to store a dictionary of the specific node types and node IDs in each metapath instance of each metapath. We store the features of the original graph so we can easily access the features in each metapath before performing the metapath transform, which constructs edges between terminal nodes in each metapath instance and removes intermediate nodes.
2.2 Multi-hop Encoder
Our first encoder extends the multi-hop framework MAGNA in Wang et al. [6] as to metapaths, where we use all node features in the metapath (which we stored while extracting metapaths). Recall that MAGNA first calculates one-hop attention scores via:
(1) |
where , , and are trainable weights. The attention score matrix is then computed as:
(2) |
Finally, the attention matrix is obtained by performing row-wise softmax over :
(3) |
where denotes the attention value at layer when aggregating message from node to node . Next, MAGNA builds a multi-hop attention diffusion matrix:
(4) |
The node embeddings are then updated by aggregating messages based on this multi-hop attention:
(5) |
where represents the set of multi-hop neighbors.
We now show that these calculations simplify significantly when restricted to metapaths and there emerges good interpretability. Consider a metapath instance of length with vertices . We seek to learn an embedding for the source node. The key is realizing that we can think of as a directed chain . This structure makes intuitive sense as we would expect the nearest neighbor of to provide a more informative message than the far away node . It also makes sense that the feature for the source should influence the updated embedding; this is the same as adding a virtual self-loop around the source. Let denote attention score (e.g. how much "attends to" or "is influenced by" ). Due to the virtual self-loop, . The following result follows:
Theorem 2.1.
For a metapath instance of length with vertices with a self-loop at the source, the updated embedding for the source node given by MAGNA (with a single layer) is given by:
(6) | ||||
The proof is given in Appendix A. As shown in Wang et al. [6], multihop attention is equivalent to putting a Personalized Page Rank (PPR) prior on attention values. Thus we get good interpretability for our metapath encodings; not only do far away nodes contribute less to the encoding than closer nodes, information flows in a semantically meaningful way. For instance, consider the metapath instance {Inception,Leonardo DiCaprio, Titanic}, where the goal is to get a good embedding for Inception for movie genre classification. Intuitively, the genre of Inception is influenced directly by the fact that Leonardo Di Caprio starred in it and also influenced indirectly by the movie Titanic, mediated through the intermediary of Leonardo DiCaprio himself (in other words, the semantic relationship between the two movies exists only through his shared involvement in both).
Therefore, it suffices to compute attention between all pairs and self-attention for the source. To do so, we use a 1-layer version of MAGNA’s algorithm, employing (1) without relationship embeddings and (3) with sigmoid instead of softmax (a simplex constraint for attention weights is too strong for structures with such semantic richness). We then use (6) to get metapath instance encodings (a value of 0.4 seems to work well in practice).
2.3 Direct Attention Encoder
Our second encoder is a simplified version that intuitively works well for short metapaths (which are common in practice). Let be a metapath instance. We now compute the embedding by directly attending to all nodes in the metapath.
First, features are transformed using learnable matrices for the source node and for all other nodes:
(7) |
Attention scores are computed between and each node (including self-attention for ) as:
(8) |
The final embedding for the source node is then computed as a weighted sum of all transformed node features:
(9) |
This encoder allows the source node to perform weighted aggregation on all node features in the metapath at the cost of lesser interpretability for longer metapaths (where it is vital that distant nodes attend to each other in a controlled manner).
2.4 Algorithm
We now give the full algorithm for our model HAN-ME (HAN with Metapath Encoding). We first perform a type-specific projection to ensure all node features share a common feature space as HAN in Wang et al. [7]. We then diverge from HAN by encoding each metapath instance using either multi-hop or direct attention. To aggregate over metapath instances, we use attention between the embedding of the source node and the metapath instance encoding. Inter-metapath aggregation is performed the same as in HAN. Our loss function is binary cross-entropy with logits.
3 Experiments
We use the multi-label IMDB dataset introduced by Lv et al. [3] for heterogeneous node classification. The dataset is one of the most common for this task and is known for its noisiness and difficulty. The dataset comprises 21,420 nodes (4,932 movies, 2,393 directors, 6,124 actors, 7,971 keywords) with 86,642 bidirectional edges for each relation type. Only movie nodes contain sparse features, such as "budget," "duration," and "language." They are labeled with up to five genres (Drama: 2,517, Comedy: 1,837, Thriller: 1,384, Action: 1,135, Romance: 1,090), with most having one or two labels. We split the data into training (1,096 nodes, 22.22%), validation (275 nodes, 5.58%), and test (3,202 nodes, 64.92%) sets, omitting unlabeled nodes as is standard practice with this benchmark. We perform minimal preprocessing except for assigning features to director, actor, and keyword nodes through mean pooling of the connected node features. As in Wang et al. [7], we employ the metapaths {Movie, Director, Movie} and {Movie, Actor, Movie}.
To train our model we use a learning rate of 0.005, weight decay of 0.001, a dropout rate of 0.6, 8 attention heads, and 128 hidden units as in Wang et al. [7], as well as a random seed of 483 and a patience level of 100 for early stopping. Furthermore, seems to work well in practice for the teleportation rate in the multihop encoder.
We also implement LTS as in Wong et al. [8]. LTS ranks nodes by their loss values during training and progressively introduces them from easiest to hardest, where at epoch a proportion of nodes are included according to one of three pacing functions: linear (), root (), or geometric (), with being the initial proportion and the epoch when all nodes are included. For our results we use and the linear scheduler. We select the model with highest validation performance as our final model. Training performance is summarized below:
Model | Micro F1 | Macro F1 | ||
---|---|---|---|---|
Standard | With LTS | Standard | With LTS | |
HAN-ME Multihop | ||||
HAN-ME Direct | ||||
HAN |

We see that the multihop encoder yields the best test Micro F1 and the direct attention encoder yields the best test Macro F1. Notably, while the two perform significantly better than HAN, they perform at around the same level. This is likely due to two factors (1) We only use metapaths with 3 nodes, for which the direct attention encoder has good interpretability, and (2) We did not have explicit node features for non-movie nodes, which would significantly improve the multihop method. Additionally, LTS (which we tested with multiple configurations) did not cause a statistically significant difference in performance. This islikely because the dataset features many very difficult nodes which require significant message passing and neighbor information to classify correctly. As seen in 1, the models overfit to the training data (as expected) and suffer somewhat in generalization.
4 Conclusion
In this paper we introduced HAN-ME, a novel framework for incorporating attention-based metapath instance encoding into heterogeneous graph neural networks. We presented two distinct encoding approaches: a multi-hop encoder that extends MAGNA’s diffusion mechanism to metapaths and a direct attention encoder. Both encoders show significant improvements over the baseline HAN model. We believe the multihop method, with its strong interpretability, has the potential to deliver better performance if applied to longer metapaths or if specific components, such as its activation function, were optimized. Our experimental results provide compelling evidence for the effectiveness of HAN-ME in capturing complex heterogeneous graph structures.
References
- Fu et al. [2020] Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding. In Proceedings of The Web Conference 2020, WWW ’20. ACM, April 2020. doi: 10.1145/3366423.3380297. URL http://dx.doi.org/10.1145/3366423.3380297.
- Lv et al. [2021a] Qingsong Lv, Ming Ding, Qiang Liu, Yuxiang Chen, Wenzheng Feng, Siming He, Chang Zhou, Jianguo Jiang, Yuxiao Dong, and Jie Tang. Are we really making much progress? revisiting, benchmarking, and refining heterogeneous graph neural networks, 2021a. URL https://arxiv.org/abs/2112.14936.
- Lv et al. [2021b] Qingsong Lv, Ming Ding, Qiang Liu, Yuxiang Chen, Wenzheng Feng, Siming He, Chang Zhou, Jianguo Jiang, Yuxiao Dong, and Jie Tang. Are we really making much progress? revisiting, benchmarking and refining the heterogeneous graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2021b.
- Sun et al. [2019] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space, 2019. URL https://arxiv.org/abs/1902.10197.
- Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks, 2018. URL https://arxiv.org/abs/1710.10903.
- Wang et al. [2021a] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Multi-hop attention graph neural network, 2021a. URL https://arxiv.org/abs/2009.14332.
- Wang et al. [2021b] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Peng Cui, P. Yu, and Yanfang Ye. Heterogeneous graph attention network, 2021b. URL https://arxiv.org/abs/1903.07293.
- Wong et al. [2024] Zhen Hao Wong, Hansi Yang, Xiaoyi Fu, and Quanming Yao. Loss-aware curriculum learning for heterogeneous graph neural networks, 2024. URL https://arxiv.org/abs/2402.18875.
- Yang et al. [2023] Xiaocheng Yang, Mingyu Yan, Shirui Pan, Xiaochun Ye, and Dongrui Fan. Simple and efficient heterogeneous graph neural network, 2023. URL https://arxiv.org/abs/2207.02547.
Appendix
Appendix A Proof of Theorem 1
We prove (6). Due to the virtual self-loop, we have that . Then:
Note that is nilpotent (e.g. , ). Therefore (4) becomes
For a metapath instance with vertices , the entry represents the attention contribution from node to . This is computed as:
The embedding for the source node is then updated by aggregating messages from all other nodes in the metapath instance. Using the definition of , we have:
which expands to: