This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Attention-Driven Metapath Encoding in Heterogeneous Graphs

Calder K. Katyal
Undergraduate
Yale University
New Haven, CT 06511
[email protected]
Abstract

One of the emerging techniques in node classification in heterogeneous graphs is to restrict message aggregation to pre-defined, semantically meaningful structures called metapaths. This work is the first attempt to incorporate attention into the process of encoding entire metapaths without dropping intermediate nodes. In particular, we construct two encoders: the first uses sequential attention to extend the multi-hop message passing algorithm designed in Wang et al. [6] to the metapath setting, and the second incorporates direct attention to extract semantic relations in the metapath. The model then employs the intra-metapath and inter-metapath aggregation mechanisms of Wang et al. [7]. We furthermore use the powerful training scheduler specialized for heterogeneous graphs that was developed in Wong et al. [8], ensuring the model slowly learns how to classify the most difficult nodes. The result is a resilient, general-purpose framework for capturing semantic structures in heterogeneous graphs. In particular, we demonstrate that our model is competitive with state-of-the-art models on performing node classification on the IMDB dataset, a popular benchmark introduced in Lv et al. [2].

1 Background

Since the invention of Graph Attention Networks (GATs) in Veličković et al. [5], a variety of approaches have sought to extend the attention mechanism to heterogeneous domains. The GAT fails to account for the semantic relationships introduced by varying node and edge types, so a mechanism that incorporates this information into its base architecture is required. The heterogeneous attention network (HAN) in Wang et al. [7] was the first method to incorporate semantic information into the attention mechanism by combining node-level and semantic-level attentions. In particular it employs metapaths, where a metapath Φj\Phi_{j} is formally defined as a sequence of nodes and edges in the form V1E1V2E2ElVl+1V_{1}\rightarrow{E_{1}}V_{2}\rightarrow{E_{2}}\cdots\rightarrow{E_{l}}V_{l+1} that represents a composite relation between nodes V1V_{1} and Vl+1V_{l+1}. For example, in the IMDB dataset a particular metapath may take the form of [’Movie’, ’Actor’, ’Movie’]. A metapath instance is a particular realization of that pattern (for example, Inception,Leonardo DiCaprio, Titanic}). We define the metapath-based neighbors of a node viv_{i} and a metapath Φj\Phi_{j} as the set of all nodes vjv_{j} which are terminal nodes of any instance of Φ\Phi starting at viv_{i}. For each viv_{i} and Φj{Φ1,Φ2,,ΦP}\Phi_{j}\in\{\Phi_{1},\Phi_{2},...,\Phi_{P}\}, HAN first calculates node-level attention by applying a node type specific projection so that all nodes share a feature space, and (2) aggregating the result of attending to each metapath-based neighbor for each metapath instance of Φj\Phi_{j} (discarding non-terminal "intermediate" nodes) to obtain PP groups of semantic-specific embeddings. It then calculates semantic-level attention by transforming the semantic embeddings via a MLP and measuring similarity with a learnable semantic-level attention vector 𝐪\mathbf{q}. This step measures the importance of each metapath.

Addressing (2), Fu et al. [1] incorporates intermediate nodes by employing intra-metapath aggregation. Specifically, for each viv_{i} and metapath Φj\Phi_{j}, they use a metapath instance encoder to separately transform the features of all of the nodes in each metapath instance originating at viv_{i} into a single vector. Encoder options include mean, linear, and relational rotation (Sun et al. [4]). Attention is then used to create a weighted sum of all of the metapath instances of Φj\Phi_{j} that contain viv_{i}. Unlike HAN, Yang et al. [9] proposes a model which completely removes node-level attention, replacing it with simple aggregation. Other methods choose to not employ metapaths at all. It is shown in Wong et al. [8] that many of these methods can be augmented by a Loss-aware Training Schedule (LTS). The scheduler first prioritizes the easiest nodes during training and gradually introduces the more difficult nodes to mitigate the impact of noisy data.

A completely different approach is found in Wang et al. [6]. The model in the paper, Multi-hop Attention Graph Neural Network (MAGNA), uses a diffusion mechanism to have nodes multiple hops away attend to each other in each layer. In particular, it supports the heterogeneous graph setting by incorporating relation embeddings. However, it does not exploit long-range relations (e.g. metapaths). This is the main subject of our work.

2 Method

We construct two novel metapath instance encoders. The goal is to create a meaningful embedding for each instance such that they can be meaningfully aggregated using the simple attention-based method of HAN as in Wang et al. [7]. Our codebase is public; we have also implemented the base model HAN for ablation in PyTorch Geometric.111https://github.com/calderkatyal/CPSC483FinalProject

2.1 Metapath Extraction

As in Fu et al. [1], we seek to incorporate intermediate nodes into our metapaths. We develop a simple but novel method to do so. In particular, we modify transforms.AddMetaPaths to store a dictionary of the specific node types and node IDs in each metapath instance of each metapath. We store the features of the original graph so we can easily access the features in each metapath before performing the metapath transform, which constructs edges between terminal nodes in each metapath instance and removes intermediate nodes.

2.2 Multi-hop Encoder

Our first encoder extends the multi-hop framework MAGNA in Wang et al. [6] as to metapaths, where we use all node features in the metapath (which we stored while extracting metapaths). Recall that MAGNA first calculates one-hop attention scores via:

si,k,j(l)=δ(va(l)tanh(Wh(l)hi(l)Wt(l)hj(l)Wr(l)rk))s_{i,k,j}^{(l)}=\delta(v_{a}^{(l)}\tanh(W_{h}^{(l)}h_{i}^{(l)}\|W_{t}^{(l)}h_{j}^{(l)}\|W_{r}^{(l)}r_{k})) (1)

where δ=LeakyReLU\delta=\text{LeakyReLU}, Wh(l),Wt(l)d(l)×d(l)W_{h}^{(l)},W_{t}^{(l)}\in\mathbb{R}^{d^{(l)}\times d^{(l)}}, Wr(l)d(l)×drW_{r}^{(l)}\in\mathbb{R}^{d^{(l)}\times d_{r}} and va(l)1×3d(l)v_{a}^{(l)}\in\mathbb{R}^{1\times 3d^{(l)}} are trainable weights. The attention score matrix S(l)S^{(l)} is then computed as:

Si,j(l)={si,k,j(l),if (vi,rk,vj) appears in 𝒢,otherwiseS_{i,j}^{(l)}=\begin{cases}s_{i,k,j}^{(l)},&\text{if }(v_{i},r_{k},v_{j})\text{ appears in }\mathcal{G}\\ -\infty,&\text{otherwise}\end{cases} (2)

Finally, the attention matrix A(l)A^{(l)} is obtained by performing row-wise softmax over S(l)S^{(l)}:

A(l)=softmax(S(l))A^{(l)}=\text{softmax}(S^{(l)}) (3)

where Ai,j(l)A_{i,j}^{(l)} denotes the attention value at layer ll when aggregating message from node jj to node ii. Next, MAGNA builds a multi-hop attention diffusion matrix:

𝒜=m=0γ(1γ)mA(l1)m,0<γ<1\mathcal{A}=\sum_{m=0}^{\infty}\gamma(1-\gamma)^{m}A^{(l-1)^{m}},0<\gamma<1 (4)

The node embeddings are then updated by aggregating messages based on this multi-hop attention:

hi(l)=jNi𝒜ijhj(l1)h_{i}^{(l)}=\sum_{j\in N_{i}}\mathcal{A}_{ij}h_{j}^{(l-1)} (5)

where NiN_{i} represents the set of multi-hop neighbors.

We now show that these calculations simplify significantly when restricted to metapaths and there emerges good interpretability. Consider a metapath instance Φ\Phi of length |Φ|1=k|\Phi|-1=k with vertices {v0,v1,v2,,vk}\{v_{0},v_{1},v_{2},\dots,v_{k}\}. We seek to learn an embedding (𝐡0)(\mathbf{h}_{0})^{\prime} for the source node. The key is realizing that we can think of Φ\Phi as a directed chain v0v1v2vkv_{0}\leftarrow v_{1}\leftarrow v_{2}\leftarrow\dots\leftarrow v_{k}. This structure makes intuitive sense as we would expect the nearest neighbor v1v_{1} of v0v_{0} to provide a more informative message than the far away node vkv_{k}. It also makes sense that the feature for the source 𝐡0\mathbf{h}_{0} should influence the updated embedding; this is the same as adding a virtual self-loop around the source. Let αij=Aij\alpha_{ij}=A_{ij} denote attention score (e.g. how much viv_{i} "attends to" or "is influenced by" vjv_{j}). Due to the virtual self-loop, 𝐀0=diag(αii)i=0k\mathbf{A}^{0}=\text{diag}(\alpha_{ii})_{i=0}^{k}. The following result follows:

Theorem 2.1.

For a metapath instance Φ\Phi of length kk with vertices {v0,v1,,vk}\{v_{0},v_{1},\dots,v_{k}\} with a self-loop at the source, the updated embedding (𝐡0)(\mathbf{h}_{0})^{\prime} for the source node given by MAGNA (with a single layer) is given by:

(𝐡0)\displaystyle(\mathbf{h}_{0})^{\prime} =γ𝐡0a00+i=1kγ(1γ)i𝐡ij=1iaj(j1)\displaystyle=\gamma\mathbf{h}_{0}a_{00}+\sum_{i=1}^{k}\gamma(1-\gamma)^{i}\mathbf{h}_{i}\prod_{j=1}^{i}a_{j(j-1)} (6)
=γ𝐡0a00+γ(1γ)𝐡1a10+γ(1γ)2𝐡2(a21a10)\displaystyle=\gamma\mathbf{h}_{0}a_{00}+\gamma(1-\gamma)\mathbf{h}_{1}a_{10}+\gamma(1-\gamma)^{2}\mathbf{h}_{2}(a_{21}a_{10})
+γ(1γ)3𝐡3(a32a21a10)++γ(1γ)k𝐡k(ak(k1)a21a10).\displaystyle\quad+\gamma(1-\gamma)^{3}\mathbf{h}_{3}(a_{32}a_{21}a_{10})+\ldots+\gamma(1-\gamma)^{k}\mathbf{h}_{k}(a_{k(k-1)}\ldots a_{21}a_{10}).

The proof is given in Appendix A. As shown in Wang et al. [6], multihop attention is equivalent to putting a Personalized Page Rank (PPR) prior on attention values. Thus we get good interpretability for our metapath encodings; not only do far away nodes contribute less to the encoding than closer nodes, information flows in a semantically meaningful way. For instance, consider the metapath instance {Inception,Leonardo DiCaprio, Titanic}, where the goal is to get a good embedding for Inception for movie genre classification. Intuitively, the genre of Inception is influenced directly by the fact that Leonardo Di Caprio starred in it and also influenced indirectly by the movie Titanic, mediated through the intermediary of Leonardo DiCaprio himself (in other words, the semantic relationship between the two movies exists only through his shared involvement in both).

Therefore, it suffices to compute attention between all pairs (vi,vi1)i=1k(v_{i},{v_{i-1}})_{i=1}^{k} and self-attention for the source. To do so, we use a 1-layer version of MAGNA’s algorithm, employing (1) without relationship embeddings and (3) with sigmoid instead of softmax (a simplex constraint for attention weights is too strong for structures with such semantic richness). We then use (6) to get metapath instance encodings (a γ\gamma value of 0.4 seems to work well in practice).

2.3 Direct Attention Encoder

Our second encoder is a simplified version that intuitively works well for short metapaths (which are common in practice). Let Φ={v0,v1,v2,,vk}\Phi=\{v_{0},v_{1},v_{2},\dots,v_{k}\} be a metapath instance. We now compute the embedding 𝐡0\mathbf{h}_{0}^{\prime} by directly attending to all nodes in the metapath.

First, features are transformed using learnable matrices 𝐖t\mathbf{W}_{t} for the source node v0v_{0} and 𝐖h\mathbf{W}_{h} for all other nodes:

𝐡0𝐡0𝐖t,𝐡i𝐡raw,i𝐖h,i{1,2,,k}.\mathbf{h}_{0}\triangleq\mathbf{h}_{0}\mathbf{W}_{t},\quad\mathbf{h}_{i}\triangleq\mathbf{h}_{\text{raw},i}\mathbf{W}_{h},\quad i\in\{1,2,\dots,k\}. (7)

Attention scores α0i\alpha_{0i} are computed between v0v_{0} and each node viv_{i} (including self-attention for v0v_{0}) as:

αi0=sigmoid(𝐡0,𝐡id),i{0,1,2,,k}.\alpha_{i0}=\text{sigmoid}\left(\frac{\langle\mathbf{h}_{0},\mathbf{h}_{i}\rangle}{\sqrt{d}}\right),\quad i\in\{0,1,2,\dots,k\}. (8)

The final embedding for the source node v0v_{0} is then computed as a weighted sum of all transformed node features:

𝐡0=i=0kαi0𝐡i.\mathbf{h}_{0}^{\prime}=\sum_{i=0}^{k}\alpha_{i0}\mathbf{h}_{i}. (9)

This encoder allows the source node to perform weighted aggregation on all node features in the metapath at the cost of lesser interpretability for longer metapaths (where it is vital that distant nodes attend to each other in a controlled manner).

2.4 Algorithm

We now give the full algorithm for our model HAN-ME (HAN with Metapath Encoding). We first perform a type-specific projection to ensure all node features share a common feature space as HAN in Wang et al. [7]. We then diverge from HAN by encoding each metapath instance using either multi-hop or direct attention. To aggregate over metapath instances, we use attention between the embedding of the source node and the metapath instance encoding. Inter-metapath aggregation is performed the same as in HAN. Our loss function is binary cross-entropy with logits.

Algorithm 1 HAN-ME: Heterogeneous Attention Network with Metapath Encoding
1:Input:
2:       The heterogeneous graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}); the node feature {𝐡i,i𝒱}\{\mathbf{h}_{i},\forall i\in\mathcal{V}\}
3:       The metapath set {Φ0,Φ1,,ΦP}\{\Phi_{0},\Phi_{1},\ldots,\Phi_{P}\}; the number of attention heads KK
4:Output:
5:       The final embedding 𝐙\mathbf{Z}
6:for Φi{Φ0,Φ1,,ΦP}\Phi_{i}\in\{\Phi_{0},\Phi_{1},\ldots,\Phi_{P}\} do
7:    for k=1Kk=1\ldots K do
8:         Type-specific transformation: 𝐡i𝐌ϕi𝐡i\mathbf{h}_{i}\leftarrow\mathbf{M}_{\phi_{i}}\cdot\mathbf{h}_{i} \triangleright 𝐌ϕi\mathbf{M}_{\phi_{i}} is transformation matrix for node type ϕi\phi_{i}
9:         for i𝒱i\in\mathcal{V} do
10:             for each metapath instance Φij\Phi_{ij} rooted at viv_{i} do
11:                 Metapath Encoding: hijEncoder(Φij)h_{ij}\leftarrow\text{Encoder}(\Phi_{ij}) \triangleright Multihop or direct attention
12:                 Calculate importance: eijΦ=attnode(𝐡i,𝐡ij;Φij)e_{ij}^{\Phi}=\text{att}_{\text{node}}(\mathbf{h}_{i},\mathbf{h}_{ij};\Phi_{ij}) \triangleright Node-level attention score
13:                 Normalize weights: αijΦ=exp(σ(𝐚Φ[𝐡i𝐡ij]))k𝒩iΦexp(σ(𝐚Φ[𝐡i𝐡ik]))\alpha_{ij}^{\Phi}=\frac{\exp(\sigma(\mathbf{a}_{\Phi}^{\top}\cdot[\mathbf{h}_{i}\|\mathbf{h}_{ij}]))}{\sum_{k\in\mathcal{N}_{i}^{\Phi}}\exp(\sigma(\mathbf{a}_{\Phi}^{\top}\cdot[\mathbf{h}_{i}\|\mathbf{h}_{ik}]))} \triangleright 𝐚Φ\mathbf{a}_{\Phi} is attention vector
14:             end for
15:             Calculate semantic-specific node embedding:
16:             𝐳iΦσ(j𝒩iΦαijΦ𝐡j)\mathbf{z}_{i}^{\Phi}\leftarrow\sigma(\sum_{j\in\mathcal{N}_{i}^{\Phi}}\alpha_{ij}^{\Phi}\cdot\mathbf{h}_{j}^{\prime}) \triangleright Aggregate neighbor features
17:         end for
18:         Concatenate learned embeddings from all attention heads:
19:         𝐳iΦk=1Kσ(j𝒩iΦαijΦ𝐡j)\mathbf{z}_{i}^{\Phi}\leftarrow\|_{k=1}^{K}\sigma(\sum_{j\in\mathcal{N}_{i}^{\Phi}}\alpha_{ij}^{\Phi}\cdot\mathbf{h}_{j}^{\prime})
20:    end for
21:    Calculate metapath importance:
22:    wΦi=1|𝒱|i𝒱𝐪tanh(𝐖𝐳iΦ+𝐛)w_{\Phi_{i}}=\frac{1}{|\mathcal{V}|}\sum_{i\in\mathcal{V}}\mathbf{q}^{\top}\cdot\tanh(\mathbf{W}\cdot\mathbf{z}_{i}^{\Phi}+\mathbf{b}) \triangleright 𝐪\mathbf{q} is semantic attention vector
23:    Normalize metapath weights:
24:    βΦi=exp(wΦi)i=1Pexp(wΦi)\beta_{\Phi_{i}}=\frac{\exp(w_{\Phi_{i}})}{\sum_{i=1}^{P}\exp(w_{\Phi_{i}})} \triangleright Softmax over metapaths
25:    Fuse semantic-specific embedding:
26:    𝐙i=1PβΦi𝐙Φi\mathbf{Z}\leftarrow\sum_{i=1}^{P}\beta_{\Phi_{i}}\cdot\mathbf{Z}_{\Phi_{i}} \triangleright Weighted combination of metapaths
27:end for
28:Calculate Binary Cross-Entropy with Logits:
29:L=l𝒴L𝐘llog(σ(𝐙l))+(1𝐘l)log(1σ(𝐙l))L=-\sum_{l\in\mathcal{Y}_{L}}\mathbf{Y}_{l}\log(\sigma(\mathbf{Z}_{l}))+(1-\mathbf{Y}_{l})\log(1-\sigma(\mathbf{Z}_{l})) \triangleright Optionally use LTS to find lowest loss nodes
30:Back propagation and update parameters in HAN-ME
31:return 𝐙\mathbf{Z}

3 Experiments

We use the multi-label IMDB dataset introduced by Lv et al. [3] for heterogeneous node classification. The dataset is one of the most common for this task and is known for its noisiness and difficulty. The dataset comprises 21,420 nodes (4,932 movies, 2,393 directors, 6,124 actors, 7,971 keywords) with 86,642 bidirectional edges for each relation type. Only movie nodes contain sparse features, such as "budget," "duration," and "language." They are labeled with up to five genres (Drama: 2,517, Comedy: 1,837, Thriller: 1,384, Action: 1,135, Romance: 1,090), with most having one or two labels. We split the data into training (1,096 nodes, 22.22%), validation (275 nodes, 5.58%), and test (3,202 nodes, 64.92%) sets, omitting unlabeled nodes as is standard practice with this benchmark. We perform minimal preprocessing except for assigning features to director, actor, and keyword nodes through mean pooling of the connected node features. As in Wang et al. [7], we employ the metapaths {Movie, Director, Movie} and {Movie, Actor, Movie}.

To train our model we use a learning rate of 0.005, weight decay of 0.001, a dropout rate of 0.6, 8 attention heads, and 128 hidden units as in Wang et al. [7], as well as a random seed of 483 and a patience level of 100 for early stopping. Furthermore, γ=0.4\gamma=0.4 seems to work well in practice for the teleportation rate in the multihop encoder.

We also implement LTS as in Wong et al. [8]. LTS ranks nodes by their loss values during training and progressively introduces them from easiest to hardest, where at epoch tt a proportion λt\lambda_{t} of nodes are included according to one of three pacing functions: linear (λt=λ0+(1λ0)tT\lambda_{t}=\lambda_{0}+(1-\lambda_{0})\frac{t}{T}), root (λt=λ02+(1λ02)tT\lambda_{t}=\sqrt{\lambda_{0}^{2}+(1-\lambda_{0}^{2})\frac{t}{T}}), or geometric (λt=2log2(λ0)log2(λ0)tT\lambda_{t}=2^{\log_{2}(\lambda_{0})-\log_{2}(\lambda_{0})\frac{t}{T}}), with λ0\lambda_{0} being the initial proportion and TT the epoch when all nodes are included. For our results we use λ0=0.1\lambda_{0}=0.1 and the linear scheduler. We select the model with highest validation performance as our final model. Training performance is summarized below:

Table 1: Performance of HAN-ME Multihop and HAN-ME Direct With Baseline HAN
Model Micro F1 Macro F1
Standard With LTS Standard With LTS
HAN-ME Multihop 0.6801±0.0001\mathbf{0.6801\pm 0.0001} 0.6737±0.00050.6737\pm 0.0005 0.6353±0.00080.6353\pm 0.0008 0.6222±0.00050.6222\pm 0.0005
HAN-ME Direct 0.6767±0.00060.6767\pm 0.0006 0.6769±0.00030.6769\pm 0.0003 0.6418±0.0007\mathbf{0.6418\pm 0.0007} 0.6380±0.00040.6380\pm 0.0004
HAN 0.6426±0.00060.6426\pm 0.0006 0.6427±0.00070.6427\pm 0.0007 0.5949±0.00100.5949\pm 0.0010 0.5966±0.00090.5966\pm 0.0009
Refer to caption
Figure 1: Micro F1 and MacroF1 on Training and Validation Sets

We see that the multihop encoder yields the best test Micro F1 and the direct attention encoder yields the best test Macro F1. Notably, while the two perform significantly better than HAN, they perform at around the same level. This is likely due to two factors (1) We only use metapaths with 3 nodes, for which the direct attention encoder has good interpretability, and (2) We did not have explicit node features for non-movie nodes, which would significantly improve the multihop method. Additionally, LTS (which we tested with multiple configurations) did not cause a statistically significant difference in performance. This islikely because the dataset features many very difficult nodes which require significant message passing and neighbor information to classify correctly. As seen in 1, the models overfit to the training data (as expected) and suffer somewhat in generalization.

4 Conclusion

In this paper we introduced HAN-ME, a novel framework for incorporating attention-based metapath instance encoding into heterogeneous graph neural networks. We presented two distinct encoding approaches: a multi-hop encoder that extends MAGNA’s diffusion mechanism to metapaths and a direct attention encoder. Both encoders show significant improvements over the baseline HAN model. We believe the multihop method, with its strong interpretability, has the potential to deliver better performance if applied to longer metapaths or if specific components, such as its activation function, were optimized. Our experimental results provide compelling evidence for the effectiveness of HAN-ME in capturing complex heterogeneous graph structures.

References

  • Fu et al. [2020] Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding. In Proceedings of The Web Conference 2020, WWW ’20. ACM, April 2020. doi: 10.1145/3366423.3380297. URL http://dx.doi.org/10.1145/3366423.3380297.
  • Lv et al. [2021a] Qingsong Lv, Ming Ding, Qiang Liu, Yuxiang Chen, Wenzheng Feng, Siming He, Chang Zhou, Jianguo Jiang, Yuxiao Dong, and Jie Tang. Are we really making much progress? revisiting, benchmarking, and refining heterogeneous graph neural networks, 2021a. URL https://arxiv.org/abs/2112.14936.
  • Lv et al. [2021b] Qingsong Lv, Ming Ding, Qiang Liu, Yuxiang Chen, Wenzheng Feng, Siming He, Chang Zhou, Jianguo Jiang, Yuxiao Dong, and Jie Tang. Are we really making much progress? revisiting, benchmarking and refining the heterogeneous graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2021b.
  • Sun et al. [2019] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space, 2019. URL https://arxiv.org/abs/1902.10197.
  • Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks, 2018. URL https://arxiv.org/abs/1710.10903.
  • Wang et al. [2021a] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Multi-hop attention graph neural network, 2021a. URL https://arxiv.org/abs/2009.14332.
  • Wang et al. [2021b] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Peng Cui, P. Yu, and Yanfang Ye. Heterogeneous graph attention network, 2021b. URL https://arxiv.org/abs/1903.07293.
  • Wong et al. [2024] Zhen Hao Wong, Hansi Yang, Xiaoyi Fu, and Quanming Yao. Loss-aware curriculum learning for heterogeneous graph neural networks, 2024. URL https://arxiv.org/abs/2402.18875.
  • Yang et al. [2023] Xiaocheng Yang, Mingyu Yan, Shirui Pan, Xiaochun Ye, and Dongrui Fan. Simple and efficient heterogeneous graph neural network, 2023. URL https://arxiv.org/abs/2207.02547.

Appendix

Appendix A Proof of Theorem 1

We prove (6). Due to the virtual self-loop, we have that 𝐀0=diag(αii)i=0k\mathbf{A}^{0}=\text{diag}(\alpha_{ii})_{i=0}^{k}. Then:

𝐀1=[0α01000000α12000000000αk2,k1000000αk1,k000000αk,k+1000000],\mathbf{A}^{1}=\begin{bmatrix}0&\alpha_{01}&0&\cdots&0&0&0\\ 0&0&\alpha_{12}&\cdots&0&0&0\\ 0&0&0&\ddots&0&0&0\\ \vdots&\vdots&\vdots&\ddots&\alpha_{k-2,k-1}&0&0\\ 0&0&0&\cdots&0&\alpha_{k-1,k}&0\\ 0&0&0&\cdots&0&0&\alpha_{k,k+1}\\ 0&0&0&\cdots&0&0&0\end{bmatrix},
𝐀2=[00α01α12000000000000α(k3)(k2)α(k2)(k1)000α(k2)(k1)α(k1)k000000α(k1)kαk(k+1)000000000000],\mathbf{A}^{2}=\begin{bmatrix}0&0&\alpha_{01}\alpha_{12}&\cdots&0&0&0\\ 0&0&0&\ddots&0&0&0\\ 0&0&0&\ddots&\alpha_{(k-3)(k-2)}\alpha_{(k-2)(k-1)}&0&0\\ \vdots&\vdots&\vdots&\ddots&0&\alpha_{(k-2)(k-1)}\alpha_{(k-1)k}&0\\ 0&0&0&\cdots&0&0&\alpha_{(k-1)k}\alpha_{k(k+1)}\\ 0&0&0&\cdots&0&0&0\\ 0&0&0&\cdots&0&0&0\end{bmatrix},
𝐀k1=[00000α01α12α(k1)(k)α(k)(k+1)000000000000000000000000000000]\mathbf{A}^{k-1}=\begin{bmatrix}0&0&0&\cdots&0&0&\alpha_{01}\alpha_{12}\cdots\alpha_{(k-1)(k)}\alpha_{(k)(k+1)}\\ 0&0&0&\cdots&0&0&0\\ 0&0&0&\cdots&0&0&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots&\vdots\\ 0&0&0&\cdots&0&0&0\\ 0&0&0&\cdots&0&0&0\\ 0&0&0&\cdots&0&0&0\end{bmatrix}

Note that AA is nilpotent (e.g. pk\forall p\geq k, 𝐀p=𝟎\mathbf{A}^{p}=\mathbf{0}). Therefore (4) becomes

𝒜=m=0γ(1γ)k𝐀k=m=0kγ(1γ)m𝐀m,\mathcal{A}=\sum_{m=0}^{\infty}\gamma(1-\gamma)^{k}\mathbf{A}^{k}=\sum_{m=0}^{k}\gamma(1-\gamma)^{m}\mathbf{A}^{m},

For a metapath instance Φ\Phi with vertices {v0,v1,v2,,vk}\{v_{0},v_{1},v_{2},\dots,v_{k}\}, the entry (𝒜)0j(\mathcal{A})_{0j} represents the attention contribution from node vjv_{j} to v0v_{0}. This is computed as:

(𝒜)0j=i=0kγ(1γ)it=1iat(t1),(\mathcal{A})_{0j}=\sum_{i=0}^{k}\gamma(1-\gamma)^{i}\prod_{t=1}^{i}a_{t(t-1)},

The embedding 𝐡0\mathbf{h}_{0}^{\prime} for the source node is then updated by aggregating messages from all other nodes vjv_{j} in the metapath instance. Using the definition of 𝒜\mathcal{A}, we have:

𝐡0=j=0k(𝒜)0j𝐡j,\mathbf{h}_{0}^{\prime}=\sum_{j=0}^{k}(\mathcal{A})_{0j}\mathbf{h}_{j},

which expands to:

(𝐡0)\displaystyle(\mathbf{h}_{0})^{\prime} =γ𝐡0a00+i=1kγ(1γ)i𝐡ij=1iaj(j1)\displaystyle=\gamma\mathbf{h}_{0}a_{00}+\sum_{i=1}^{k}\gamma(1-\gamma)^{i}\mathbf{h}_{i}\prod_{j=1}^{i}a_{j(j-1)}
=γ𝐡0a00+γ(1γ)𝐡1a10+γ(1γ)2𝐡2(a21a10)\displaystyle=\gamma\mathbf{h}_{0}a_{00}+\gamma(1-\gamma)\mathbf{h}_{1}a_{10}+\gamma(1-\gamma)^{2}\mathbf{h}_{2}(a_{21}a_{10})
+γ(1γ)3𝐡3(a32a21a10)++γ(1γ)k𝐡k(ak(k1)a21a10).\displaystyle\quad+\gamma(1-\gamma)^{3}\mathbf{h}_{3}(a_{32}a_{21}a_{10})+\ldots+\gamma(1-\gamma)^{k}\mathbf{h}_{k}(a_{k(k-1)}\ldots a_{21}a_{10}).