Attention-Driven Metapath Encoding in Heterogeneous Graphs

Calder K. Katyal
Undergraduate
Yale University
New Haven, CT 06511
[email protected]

Abstract

One of the emerging techniques in node classification in heterogeneous graphs is to restrict message aggregation to pre-defined, semantically meaningful structures called metapaths. This work is the first attempt to incorporate attention into the process of encoding entire metapaths without dropping intermediate nodes. In particular, we construct two encoders: the first uses sequential attention to extend the multi-hop message passing algorithm designed in Wang et al. [6] to the metapath setting, and the second incorporates direct attention to extract semantic relations in the metapath. The model then employs the intra-metapath and inter-metapath aggregation mechanisms of Wang et al. [7]. We furthermore use the powerful training scheduler specialized for heterogeneous graphs that was developed in Wong et al. [8], ensuring the model slowly learns how to classify the most difficult nodes. The result is a resilient, general-purpose framework for capturing semantic structures in heterogeneous graphs. In particular, we demonstrate that our model is competitive with state-of-the-art models on performing node classification on the IMDB dataset, a popular benchmark introduced in Lv et al. [2].

1 Background

Since the invention of Graph Attention Networks (GATs) in Veličković et al. [5], a variety of approaches have sought to extend the attention mechanism to heterogeneous domains. The GAT fails to account for the semantic relationships introduced by varying node and edge types, so a mechanism that incorporates this information into its base architecture is required. The heterogeneous attention network (HAN) in Wang et al. [7] was the first method to incorporate semantic information into the attention mechanism by combining node-level and semantic-level attentions. In particular it employs metapaths, where a metapath $\Phi_{j}$ is formally defined as a sequence of nodes and edges in the form $V_{1}\rightarrow{E_{1}}V_{2}\rightarrow{E_{2}}\cdots\rightarrow{E_{l}}V_{l+1}$ that represents a composite relation between nodes $V_{1}$ and $V_{l+1}$ . For example, in the IMDB dataset a particular metapath may take the form of [’Movie’, ’Actor’, ’Movie’]. A metapath instance is a particular realization of that pattern (for example, Inception,Leonardo DiCaprio, Titanic}). We define the metapath-based neighbors of a node $v_{i}$ and a metapath $\Phi_{j}$ as the set of all nodes $v_{j}$ which are terminal nodes of any instance of $\Phi$ starting at $v_{i}$ . For each $v_{i}$ and $\Phi_{j}\in\{\Phi_{1},\Phi_{2},...,\Phi_{P}\}$ , HAN first calculates node-level attention by applying a node type specific projection so that all nodes share a feature space, and (2) aggregating the result of attending to each metapath-based neighbor for each metapath instance of $\Phi_{j}$ (discarding non-terminal "intermediate" nodes) to obtain $P$ groups of semantic-specific embeddings. It then calculates semantic-level attention by transforming the semantic embeddings via a MLP and measuring similarity with a learnable semantic-level attention vector $\mathbf{q}$ . This step measures the importance of each metapath.

Addressing (2), Fu et al. [1] incorporates intermediate nodes by employing intra-metapath aggregation. Specifically, for each $v_{i}$ and metapath $\Phi_{j}$ , they use a metapath instance encoder to separately transform the features of all of the nodes in each metapath instance originating at $v_{i}$ into a single vector. Encoder options include mean, linear, and relational rotation (Sun et al. [4]). Attention is then used to create a weighted sum of all of the metapath instances of $\Phi_{j}$ that contain $v_{i}$ . Unlike HAN, Yang et al. [9] proposes a model which completely removes node-level attention, replacing it with simple aggregation. Other methods choose to not employ metapaths at all. It is shown in Wong et al. [8] that many of these methods can be augmented by a Loss-aware Training Schedule (LTS). The scheduler first prioritizes the easiest nodes during training and gradually introduces the more difficult nodes to mitigate the impact of noisy data.

A completely different approach is found in Wang et al. [6]. The model in the paper, Multi-hop Attention Graph Neural Network (MAGNA), uses a diffusion mechanism to have nodes multiple hops away attend to each other in each layer. In particular, it supports the heterogeneous graph setting by incorporating relation embeddings. However, it does not exploit long-range relations (e.g. metapaths). This is the main subject of our work.

2 Method

We construct two novel metapath instance encoders. The goal is to create a meaningful embedding for each instance such that they can be meaningfully aggregated using the simple attention-based method of HAN as in Wang et al. [7]. Our codebase is public; we have also implemented the base model HAN for ablation in PyTorch Geometric.¹¹1https://github.com/calderkatyal/CPSC483FinalProject

2.1 Metapath Extraction

As in Fu et al. [1], we seek to incorporate intermediate nodes into our metapaths. We develop a simple but novel method to do so. In particular, we modify transforms.AddMetaPaths to store a dictionary of the specific node types and node IDs in each metapath instance of each metapath. We store the features of the original graph so we can easily access the features in each metapath before performing the metapath transform, which constructs edges between terminal nodes in each metapath instance and removes intermediate nodes.

2.2 Multi-hop Encoder

Our first encoder extends the multi-hop framework MAGNA in Wang et al. [6] as to metapaths, where we use all node features in the metapath (which we stored while extracting metapaths). Recall that MAGNA first calculates one-hop attention scores via:

s_{i,k,j}^{(l)}=\delta(v_{a}^{(l)}\tanh(W_{h}^{(l)}h_{i}^{(l)}\|W_{t}^{(l)}h_{j}^{(l)}\|W_{r}^{(l)}r_{k}))

(1)

where $\delta=\text{LeakyReLU}$ , $W_{h}^{(l)},W_{t}^{(l)}\in\mathbb{R}^{d^{(l)}\times d^{(l)}}$ , $W_{r}^{(l)}\in\mathbb{R}^{d^{(l)}\times d_{r}}$ and $v_{a}^{(l)}\in\mathbb{R}^{1\times 3d^{(l)}}$ are trainable weights. The attention score matrix $S^{(l)}$ is then computed as:

S_{i,j}^{(l)}=\begin{cases}s_{i,k,j}^{(l)},&\text{if }(v_{i},r_{k},v_{j})\text{ appears in }\mathcal{G}\\ -\infty,&\text{otherwise}\end{cases}

(2)

Finally, the attention matrix $A^{(l)}$ is obtained by performing row-wise softmax over $S^{(l)}$ :

A^{(l)}=\text{softmax}(S^{(l)})

(3)

where $A_{i,j}^{(l)}$ denotes the attention value at layer $l$ when aggregating message from node $j$ to node $i$ . Next, MAGNA builds a multi-hop attention diffusion matrix:

\mathcal{A}=\sum_{m=0}^{\infty}\gamma(1-\gamma)^{m}A^{(l-1)^{m}},0<\gamma<1

(4)

The node embeddings are then updated by aggregating messages based on this multi-hop attention:

h_{i}^{(l)}=\sum_{j\in N_{i}}\mathcal{A}_{ij}h_{j}^{(l-1)}

(5)

where $N_{i}$ represents the set of multi-hop neighbors.

We now show that these calculations simplify significantly when restricted to metapaths and there emerges good interpretability. Consider a metapath instance $\Phi$ of length $|\Phi|-1=k$ with vertices $\{v_{0},v_{1},v_{2},\dots,v_{k}\}$ . We seek to learn an embedding $(\mathbf{h}_{0})^{\prime}$ for the source node. The key is realizing that we can think of $\Phi$ as a directed chain $v_{0}\leftarrow v_{1}\leftarrow v_{2}\leftarrow\dots\leftarrow v_{k}$ . This structure makes intuitive sense as we would expect the nearest neighbor $v_{1}$ of $v_{0}$ to provide a more informative message than the far away node $v_{k}$ . It also makes sense that the feature for the source $\mathbf{h}_{0}$ should influence the updated embedding; this is the same as adding a virtual self-loop around the source. Let $\alpha_{ij}=A_{ij}$ denote attention score (e.g. how much $v_{i}$ "attends to" or "is influenced by" $v_{j}$ ). Due to the virtual self-loop, $\mathbf{A}^{0}=\text{diag}(\alpha_{ii})_{i=0}^{k}$ . The following result follows:

Theorem 2.1.

For a metapath instance $\Phi$ of length $k$ with vertices $\{v_{0},v_{1},\dots,v_{k}\}$ with a self-loop at the source, the updated embedding $(\mathbf{h}_{0})^{\prime}$ for the source node given by MAGNA (with a single layer) is given by:

$\displaystyle(\mathbf{h}_{0})^{\prime}$	$\displaystyle=\gamma\mathbf{h}_{0}a_{00}+\sum_{i=1}^{k}\gamma(1-\gamma)^{i}\mathbf{h}_{i}\prod_{j=1}^{i}a_{j(j-1)}$	(6)
	$\displaystyle=\gamma\mathbf{h}_{0}a_{00}+\gamma(1-\gamma)\mathbf{h}_{1}a_{10}+\gamma(1-\gamma)^{2}\mathbf{h}_{2}(a_{21}a_{10})$
	$\displaystyle\quad+\gamma(1-\gamma)^{3}\mathbf{h}_{3}(a_{32}a_{21}a_{10})+\ldots+\gamma(1-\gamma)^{k}\mathbf{h}_{k}(a_{k(k-1)}\ldots a_{21}a_{10}).$

The proof is given in Appendix A. As shown in Wang et al. [6], multihop attention is equivalent to putting a Personalized Page Rank (PPR) prior on attention values. Thus we get good interpretability for our metapath encodings; not only do far away nodes contribute less to the encoding than closer nodes, information flows in a semantically meaningful way. For instance, consider the metapath instance {Inception,Leonardo DiCaprio, Titanic}, where the goal is to get a good embedding for Inception for movie genre classification. Intuitively, the genre of Inception is influenced directly by the fact that Leonardo Di Caprio starred in it and also influenced indirectly by the movie Titanic, mediated through the intermediary of Leonardo DiCaprio himself (in other words, the semantic relationship between the two movies exists only through his shared involvement in both).

Therefore, it suffices to compute attention between all pairs $(v_{i},{v_{i-1}})_{i=1}^{k}$ and self-attention for the source. To do so, we use a 1-layer version of MAGNA’s algorithm, employing (1) without relationship embeddings and (3) with sigmoid instead of softmax (a simplex constraint for attention weights is too strong for structures with such semantic richness). We then use (6) to get metapath instance encodings (a $\gamma$ value of 0.4 seems to work well in practice).

2.3 Direct Attention Encoder

Our second encoder is a simplified version that intuitively works well for short metapaths (which are common in practice). Let $\Phi=\{v_{0},v_{1},v_{2},\dots,v_{k}\}$ be a metapath instance. We now compute the embedding $\mathbf{h}_{0}^{\prime}$ by directly attending to all nodes in the metapath.

First, features are transformed using learnable matrices $\mathbf{W}_{t}$ for the source node $v_{0}$ and $\mathbf{W}_{h}$ for all other nodes:

\mathbf{h}_{0}\triangleq\mathbf{h}_{0}\mathbf{W}_{t},\quad\mathbf{h}_{i}\triangleq\mathbf{h}_{\text{raw},i}\mathbf{W}_{h},\quad i\in\{1,2,\dots,k\}.

(7)

Attention scores $\alpha_{0i}$ are computed between $v_{0}$ and each node $v_{i}$ (including self-attention for $v_{0}$ ) as:

\alpha_{i0}=\text{sigmoid}\left(\frac{\langle\mathbf{h}_{0},\mathbf{h}_{i}\rangle}{\sqrt{d}}\right),\quad i\in\{0,1,2,\dots,k\}.

(8)

The final embedding for the source node $v_{0}$ is then computed as a weighted sum of all transformed node features:

\mathbf{h}_{0}^{\prime}=\sum_{i=0}^{k}\alpha_{i0}\mathbf{h}_{i}.

(9)

This encoder allows the source node to perform weighted aggregation on all node features in the metapath at the cost of lesser interpretability for longer metapaths (where it is vital that distant nodes attend to each other in a controlled manner).

2.4 Algorithm

We now give the full algorithm for our model HAN-ME (HAN with Metapath Encoding). We first perform a type-specific projection to ensure all node features share a common feature space as HAN in Wang et al. [7]. We then diverge from HAN by encoding each metapath instance using either multi-hop or direct attention. To aggregate over metapath instances, we use attention between the embedding of the source node and the metapath instance encoding. Inter-metapath aggregation is performed the same as in HAN. Our loss function is binary cross-entropy with logits.

Algorithm 1 HAN-ME: Heterogeneous Attention Network with Metapath Encoding

1:Input:

2: The heterogeneous graph

\mathcal{G}=(\mathcal{V},\mathcal{E})

; the node feature

\{\mathbf{h}_{i},\forall i\in\mathcal{V}\}

3: The metapath set

\{\Phi_{0},\Phi_{1},\ldots,\Phi_{P}\}

; the number of attention heads

K

4:Output:

5: The final embedding

\mathbf{Z}

6:for

\Phi_{i}\in\{\Phi_{0},\Phi_{1},\ldots,\Phi_{P}\}

7: for

k=1\ldots K

8: Type-specific transformation:

\mathbf{h}_{i}\leftarrow\mathbf{M}_{\phi_{i}}\cdot\mathbf{h}_{i}

\triangleright

\mathbf{M}_{\phi_{i}}

is transformation matrix for node type

\phi_{i}

9: for

i\in\mathcal{V}

10: for each metapath instance

\Phi_{ij}

rooted at

v_{i}

11: Metapath Encoding:

h_{ij}\leftarrow\text{Encoder}(\Phi_{ij})

\triangleright

Multihop or direct attention

12: Calculate importance:

e_{ij}^{\Phi}=\text{att}_{\text{node}}(\mathbf{h}_{i},\mathbf{h}_{ij};\Phi_{ij})

\triangleright

Node-level attention score

13: Normalize weights:

\alpha_{ij}^{\Phi}=\frac{\exp(\sigma(\mathbf{a}_{\Phi}^{\top}\cdot[\mathbf{h}_{i}\|\mathbf{h}_{ij}]))}{\sum_{k\in\mathcal{N}_{i}^{\Phi}}\exp(\sigma(\mathbf{a}_{\Phi}^{\top}\cdot[\mathbf{h}_{i}\|\mathbf{h}_{ik}]))}

\triangleright

\mathbf{a}_{\Phi}

is attention vector

14: end for

15: Calculate semantic-specific node embedding:

16:

\mathbf{z}_{i}^{\Phi}\leftarrow\sigma(\sum_{j\in\mathcal{N}_{i}^{\Phi}}\alpha_{ij}^{\Phi}\cdot\mathbf{h}_{j}^{\prime})

\triangleright

Aggregate neighbor features

17: end for

18: Concatenate learned embeddings from all attention heads:

19:

\mathbf{z}_{i}^{\Phi}\leftarrow\|_{k=1}^{K}\sigma(\sum_{j\in\mathcal{N}_{i}^{\Phi}}\alpha_{ij}^{\Phi}\cdot\mathbf{h}_{j}^{\prime})

20: end for

21: Calculate metapath importance:

22:

w_{\Phi_{i}}=\frac{1}{|\mathcal{V}|}\sum_{i\in\mathcal{V}}\mathbf{q}^{\top}\cdot\tanh(\mathbf{W}\cdot\mathbf{z}_{i}^{\Phi}+\mathbf{b})

\triangleright

\mathbf{q}

is semantic attention vector

23: Normalize metapath weights:

24:

\beta_{\Phi_{i}}=\frac{\exp(w_{\Phi_{i}})}{\sum_{i=1}^{P}\exp(w_{\Phi_{i}})}

\triangleright

Softmax over metapaths

25: Fuse semantic-specific embedding:

26:

\mathbf{Z}\leftarrow\sum_{i=1}^{P}\beta_{\Phi_{i}}\cdot\mathbf{Z}_{\Phi_{i}}

\triangleright

Weighted combination of metapaths

27:end for

28:Calculate Binary Cross-Entropy with Logits:

29:

L=-\sum_{l\in\mathcal{Y}_{L}}\mathbf{Y}_{l}\log(\sigma(\mathbf{Z}_{l}))+(1-\mathbf{Y}_{l})\log(1-\sigma(\mathbf{Z}_{l}))

\triangleright

Optionally use LTS to find lowest loss nodes

30:Back propagation and update parameters in HAN-ME

31:return

\mathbf{Z}

3 Experiments

We use the multi-label IMDB dataset introduced by Lv et al. [3] for heterogeneous node classification. The dataset is one of the most common for this task and is known for its noisiness and difficulty. The dataset comprises 21,420 nodes (4,932 movies, 2,393 directors, 6,124 actors, 7,971 keywords) with 86,642 bidirectional edges for each relation type. Only movie nodes contain sparse features, such as "budget," "duration," and "language." They are labeled with up to five genres (Drama: 2,517, Comedy: 1,837, Thriller: 1,384, Action: 1,135, Romance: 1,090), with most having one or two labels. We split the data into training (1,096 nodes, 22.22%), validation (275 nodes, 5.58%), and test (3,202 nodes, 64.92%) sets, omitting unlabeled nodes as is standard practice with this benchmark. We perform minimal preprocessing except for assigning features to director, actor, and keyword nodes through mean pooling of the connected node features. As in Wang et al. [7], we employ the metapaths {Movie, Director, Movie} and {Movie, Actor, Movie}.

To train our model we use a learning rate of 0.005, weight decay of 0.001, a dropout rate of 0.6, 8 attention heads, and 128 hidden units as in Wang et al. [7], as well as a random seed of 483 and a patience level of 100 for early stopping. Furthermore, $\gamma=0.4$ seems to work well in practice for the teleportation rate in the multihop encoder.

We also implement LTS as in Wong et al. [8]. LTS ranks nodes by their loss values during training and progressively introduces them from easiest to hardest, where at epoch $t$ a proportion $\lambda_{t}$ of nodes are included according to one of three pacing functions: linear ( $\lambda_{t}=\lambda_{0}+(1-\lambda_{0})\frac{t}{T}$ ), root ( $\lambda_{t}=\sqrt{\lambda_{0}^{2}+(1-\lambda_{0}^{2})\frac{t}{T}}$ ), or geometric ( $\lambda_{t}=2^{\log_{2}(\lambda_{0})-\log_{2}(\lambda_{0})\frac{t}{T}}$ ), with $\lambda_{0}$ being the initial proportion and $T$ the epoch when all nodes are included. For our results we use $\lambda_{0}=0.1$ and the linear scheduler. We select the model with highest validation performance as our final model. Training performance is summarized below:

Table 1: Performance of HAN-ME Multihop and HAN-ME Direct With Baseline HAN

Model	Micro F1		Macro F1
Model	Standard	With LTS	Standard	With LTS
HAN-ME Multihop	$\mathbf{0.6801\pm 0.0001}$	$0.6737\pm 0.0005$	$0.6353\pm 0.0008$	$0.6222\pm 0.0005$
HAN-ME Direct	$0.6767\pm 0.0006$	$0.6769\pm 0.0003$	$\mathbf{0.6418\pm 0.0007}$	$0.6380\pm 0.0004$
HAN	$0.6426\pm 0.0006$	$0.6427\pm 0.0007$	$0.5949\pm 0.0010$	$0.5966\pm 0.0009$

Refer to caption — Figure 1: Micro F1 and MacroF1 on Training and Validation Sets

We see that the multihop encoder yields the best test Micro F1 and the direct attention encoder yields the best test Macro F1. Notably, while the two perform significantly better than HAN, they perform at around the same level. This is likely due to two factors (1) We only use metapaths with 3 nodes, for which the direct attention encoder has good interpretability, and (2) We did not have explicit node features for non-movie nodes, which would significantly improve the multihop method. Additionally, LTS (which we tested with multiple configurations) did not cause a statistically significant difference in performance. This islikely because the dataset features many very difficult nodes which require significant message passing and neighbor information to classify correctly. As seen in 1, the models overfit to the training data (as expected) and suffer somewhat in generalization.

4 Conclusion

In this paper we introduced HAN-ME, a novel framework for incorporating attention-based metapath instance encoding into heterogeneous graph neural networks. We presented two distinct encoding approaches: a multi-hop encoder that extends MAGNA’s diffusion mechanism to metapaths and a direct attention encoder. Both encoders show significant improvements over the baseline HAN model. We believe the multihop method, with its strong interpretability, has the potential to deliver better performance if applied to longer metapaths or if specific components, such as its activation function, were optimized. Our experimental results provide compelling evidence for the effectiveness of HAN-ME in capturing complex heterogeneous graph structures.

References

Fu et al. [2020] Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding. In Proceedings of The Web Conference 2020, WWW ’20. ACM, April 2020. doi: 10.1145/3366423.3380297. URL http://dx.doi.org/10.1145/3366423.3380297.
Lv et al. [2021a] Qingsong Lv, Ming Ding, Qiang Liu, Yuxiang Chen, Wenzheng Feng, Siming He, Chang Zhou, Jianguo Jiang, Yuxiao Dong, and Jie Tang. Are we really making much progress? revisiting, benchmarking, and refining heterogeneous graph neural networks, 2021a. URL https://arxiv.org/abs/2112.14936.
Lv et al. [2021b] Qingsong Lv, Ming Ding, Qiang Liu, Yuxiang Chen, Wenzheng Feng, Siming He, Chang Zhou, Jianguo Jiang, Yuxiao Dong, and Jie Tang. Are we really making much progress? revisiting, benchmarking and refining the heterogeneous graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2021b.
Sun et al. [2019] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space, 2019. URL https://arxiv.org/abs/1902.10197.
Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks, 2018. URL https://arxiv.org/abs/1710.10903.
Wang et al. [2021a] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Multi-hop attention graph neural network, 2021a. URL https://arxiv.org/abs/2009.14332.
Wang et al. [2021b] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Peng Cui, P. Yu, and Yanfang Ye. Heterogeneous graph attention network, 2021b. URL https://arxiv.org/abs/1903.07293.
Wong et al. [2024] Zhen Hao Wong, Hansi Yang, Xiaoyi Fu, and Quanming Yao. Loss-aware curriculum learning for heterogeneous graph neural networks, 2024. URL https://arxiv.org/abs/2402.18875.
Yang et al. [2023] Xiaocheng Yang, Mingyu Yan, Shirui Pan, Xiaochun Ye, and Dongrui Fan. Simple and efficient heterogeneous graph neural network, 2023. URL https://arxiv.org/abs/2207.02547.

Appendix

Appendix A Proof of Theorem 1

We prove (6). Due to the virtual self-loop, we have that $\mathbf{A}^{0}=\text{diag}(\alpha_{ii})_{i=0}^{k}$ . Then:

\mathbf{A}^{1}=\begin{bmatrix}0&\alpha_{01}&0&\cdots&0&0&0\\ 0&0&\alpha_{12}&\cdots&0&0&0\\ 0&0&0&\ddots&0&0&0\\ \vdots&\vdots&\vdots&\ddots&\alpha_{k-2,k-1}&0&0\\ 0&0&0&\cdots&0&\alpha_{k-1,k}&0\\ 0&0&0&\cdots&0&0&\alpha_{k,k+1}\\ 0&0&0&\cdots&0&0&0\end{bmatrix},

\mathbf{A}^{2}=\begin{bmatrix}0&0&\alpha_{01}\alpha_{12}&\cdots&0&0&0\\ 0&0&0&\ddots&0&0&0\\ 0&0&0&\ddots&\alpha_{(k-3)(k-2)}\alpha_{(k-2)(k-1)}&0&0\\ \vdots&\vdots&\vdots&\ddots&0&\alpha_{(k-2)(k-1)}\alpha_{(k-1)k}&0\\ 0&0&0&\cdots&0&0&\alpha_{(k-1)k}\alpha_{k(k+1)}\\ 0&0&0&\cdots&0&0&0\\ 0&0&0&\cdots&0&0&0\end{bmatrix},

\mathbf{A}^{k-1}=\begin{bmatrix}0&0&0&\cdots&0&0&\alpha_{01}\alpha_{12}\cdots\alpha_{(k-1)(k)}\alpha_{(k)(k+1)}\\ 0&0&0&\cdots&0&0&0\\ 0&0&0&\cdots&0&0&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots&\vdots\\ 0&0&0&\cdots&0&0&0\\ 0&0&0&\cdots&0&0&0\\ 0&0&0&\cdots&0&0&0\end{bmatrix}

Note that $A$ is nilpotent (e.g. $\forall p\geq k$ , $\mathbf{A}^{p}=\mathbf{0}$ ). Therefore (4) becomes

\mathcal{A}=\sum_{m=0}^{\infty}\gamma(1-\gamma)^{k}\mathbf{A}^{k}=\sum_{m=0}^{k}\gamma(1-\gamma)^{m}\mathbf{A}^{m},

For a metapath instance $\Phi$ with vertices $\{v_{0},v_{1},v_{2},\dots,v_{k}\}$ , the entry $(\mathcal{A})_{0j}$ represents the attention contribution from node $v_{j}$ to $v_{0}$ . This is computed as:

(\mathcal{A})_{0j}=\sum_{i=0}^{k}\gamma(1-\gamma)^{i}\prod_{t=1}^{i}a_{t(t-1)},

The embedding $\mathbf{h}_{0}^{\prime}$ for the source node is then updated by aggregating messages from all other nodes $v_{j}$ in the metapath instance. Using the definition of $\mathcal{A}$ , we have:

\mathbf{h}_{0}^{\prime}=\sum_{j=0}^{k}(\mathcal{A})_{0j}\mathbf{h}_{j},

which expands to:

	$\displaystyle(\mathbf{h}_{0})^{\prime}$	$\displaystyle=\gamma\mathbf{h}_{0}a_{00}+\sum_{i=1}^{k}\gamma(1-\gamma)^{i}\mathbf{h}_{i}\prod_{j=1}^{i}a_{j(j-1)}$
		$\displaystyle=\gamma\mathbf{h}_{0}a_{00}+\gamma(1-\gamma)\mathbf{h}_{1}a_{10}+\gamma(1-\gamma)^{2}\mathbf{h}_{2}(a_{21}a_{10})$
		$\displaystyle\quad+\gamma(1-\gamma)^{3}\mathbf{h}_{3}(a_{32}a_{21}a_{10})+\ldots+\gamma(1-\gamma)^{k}\mathbf{h}_{k}(a_{k(k-1)}\ldots a_{21}a_{10}).$