Layer-stacked Attention for Heterogeneous Network Embedding

Nhat Tran,¹ Jean Gao ¹

Abstract

The heterogeneous network is a robust data abstraction that can model entities of different types interacting in various ways. Such heterogeneity brings rich semantic information but presents nontrivial challenges in aggregating the heterogeneous relationships between objects – especially those of higher-order indirect relations. Recent graph neural network approaches for representation learning on heterogeneous networks typically employ the attention mechanism, which is often only optimized for predictions based on direct links. Furthermore, even though most deep learning methods can aggregate higher-order information by building deeper models, such a scheme can diminish the degree of interpretability. To overcome these challenges, we explore an architecture–Layer-stacked ATTention Embedding (LATTE)–that automatically decomposes higher-order meta relations at each layer to extract the relevant heterogeneous neighborhood structures for each node. Additionally, by successively stacking layer representations, the learned node embedding offers a more interpretable aggregation scheme for nodes of different types at different neighborhood ranges. We conducted experiments on several benchmark heterogeneous network datasets. In both transductive and inductive node classification tasks, LATTE can achieve state-of-the-art performance compared to existing approaches, all while offering a lightweight model. With extensive experimental analyses and visualizations, the framework can demonstrate the ability to extract informative insights on heterogeneous networks.

Heterogeneous networks have been commonly used to model complex systems where there are multiple types of relationships among objects of different types. Such a rich semantic structure brings ripe graph mining opportunities for various real-world systems, including knowledge bases, academic networks, social networks, bioinformatic interactomes, and other multimodal abstractions. Recently, a significant line of research has been explored for representation learning of heterogeneous networks (Dong et al. 2020). The basic principle behind these dimensionality-reduction approaches is to aggregate the high-dimensional information about a node’s heterogeneous neighborhood to an embedding vector representation. These node embeddings can then aid in downstream machine learning tasks such as node classification, clustering, and link prediction.

Among the most effective approaches for representation learning on networks, graph neural network (GNN) methods has gained a dramatic increase in popularity in recent years (Kipf and Welling 2016; Hamilton, Ying, and Leskovec 2017; Veličković et al. 2018). While these powerful methods were designed for homogeneous networks, one can apply them to heterogeneous networks by ignoring the link/node type distinction and assuming the network structure to be homogeneous. However, this would be suboptimal, as it’s been proven that neglecting the structural dependencies between relations by combining the multi-relations into a single network will omit important topological properties of the system (Battiston, Nicosia, and Latora 2014). Therefore, the primary challenges for heterogeneous network embedding are maintaining the semantic information and aggregating the multi-relations for respective node types.

There have been several attempts to adopt GNNs to learn multi-relational networks (Schlichtkrull et al. 2018; Zhang et al. 2019). More recently, several GNN models designed for heterogeneous networks have introduced the attention mechanism for increased interpretation of the aggregation of heterogeneous structures (Wang et al. 2019; Yun et al. 2019; Hu et al. 2020). However, these approaches for heterogeneous networks face at least one of the following issues. First, some of them are only fitted to aggregate the multi-relations for a single primary node type; thus, they may require a manual design of meta paths. Second, they only optimize for prediction between directly interacting nodes, which is insufficient to capture the heterogeneous network’s global properties and higher-order structures. Third, although these GNNs with multiple hidden layers can flexibly propagate high-order information across layers, they do not explicitly preserve the semantics of higher-order meta relations. These shortcomings can often affect the model’s scalability, hinder the its generalizability for inductive predictions, or limit the interpretability of the model.

In consideration of these limitations and challenges, we aim to design an approach for heterogeneous GNNs to extract higher-order structures by leveraging the semantic information of all relations and node types. To handle heterogeneity in the network, we introduce a relation-specific attention mechanism, i.e., depending on the types and direction of a link. As each node type is involved in a subset of all relations, only the relevant relations are aggregated. The mechanism can then capture individual node heterogeneity, where each node is allowed to selectively determine which of its relation-specific neighborhoods contain a more prevalent signal.

To generate higher-order meta path connections between nodes of different types, we propose a novel scheme that combines transitive meta relations at each layer successively. As a result, all meta relation sequences of arbitrary length can be enumerated while retaining their semantic context. This process allows the model to distill the unique global structure of each node type by decomposing its heterogeneous neighborhoods at different ranges. With a combination of the mechanisms proposed, our approach can generate higher-order meta paths and infer the most effective meta paths for prediction without requiring domain knowledge while maintaining the interpretability of the process.

Our main contributions with the proposed Layer-stacked ATTention Embedding (LATTE) method for heterogeneous networks are as followed. To the best of our knowledge, we are the first to introduce a GNN architecture that can successively stack hidden layers to generate higher-order meta relations specifically for each node type. Secondly, we utilize a aggregation technique that only combines specific meta relations depending on the node type. Lastly, we adopt a Noise Contrastive Estimation loss function to preserve the high-order proximities for weighted heterogeneous links, which improved performance for inductive node classification.

Related Work

Graph Neural Networks

In recent years, many classes of GNN methods have been developed for a variety of heterogeneous network types (Schlichtkrull et al. 2018; Zhang et al. 2019; Wang et al. 2019; Zhou et al. 2019; Hu et al. 2020). Although these types of GNNs are flexible for end-to-end supervised prediction tasks, they only optimize for predictions between direct interactions. Compared to conventional network embedding methods (Grover and Leskovec 2016; Tang et al. 2015), standard GNNs generally do not take advantage of second-order relationships between indirect neighboring nodes. Recently, a paper by (Huang et al. 2020) applied a fusion technique to combine first-order and second-order embeddings at alternating steps. Additionally, the Jumping Knowledge architecture from (Xu et al. 2018) and the GraphSAGE (sampling and aggregation) from (Hamilton, Ying, and Leskovec 2017) has proposed to extend the neighborhood ranges; however, there has yet to be an extension of such techniques for heterogeneous networks.

Notably, GTN (Yun et al. 2019) was recently proposed to enable learning on higher-order meta paths in heterogeneous networks. It proposes a mechanism that soft-selects a convex combination of meta path layers using attention weights, then applies multiplication of adjacency matrices successively to reveal arbitrary-length transitive meta paths. This mechanism is unique in that it can infer attention weights not only on the given relations, but also on higher-order relations generated by deeper layers, a feature that most existing GNN methods often neglect. A few limitations with GTN is it necessarily assumes the feature distribution and representation space of different node and link types to be the same, and it cannot weigh the importance of each meta path separately for each node type. Additionally, GTN can be computationally expensive, since it requires computations involving the adjacency structure of all node types at once.

Multiplex Network Embedding

It is worth mentioning the approaches designed for a subclass of the heterogeneous network, the multiplex network. Many of the current multiplex or multiview network embedding methods (Fu, Lee, and Lei 2017; Matsuno and Murata 2018; Qu et al. 2017; Shi et al. 2018) have proposed strategies for aggregating the learned embeddings of multiple network “layers” into a single unified embedding. This class of methods typically specify separate objectives for each of the layers to estimate the node features independently, then apply another objective to aggregate the information from all layers together.

Another paradigm is to use random-walk of meta paths to model heterogeneous structures, as proposed in (Perozzi, Al-Rfou, and Skiena 2014; Dong, Chawla, and Swami 2017; Fu, Lee, and Lei 2017). This class of approaches can learn network representations without supervised training for a specific task. However, they only learn representations for the primary node type, which consequently requires the customized design of meta paths. Also, they can be sensitive to the random walk’s hyper-parameter settings, which may introduce unwanted biases or is computationally costly, thus can lead to lacking performance. Another class of algorithm utilizing embedding translations can also be applied for embedding heterogeneous networks. For instance, (Bordes et al. 2013) learned linear transformations for each relation to model semantic relationships between entities. While embedding translations can effectively model heterogeneous networks, they are mainly fitted for link prediction tasks.

Method

Preliminary

We consider a heterogeneous network as a complex system involving multiple types of links between nodes of various types. To effectively represent the complex structure of the system, it is important to define separate adjacency matrices to distinguish the nature of relationships. In this section, we define coherent notations to study the class of attributed heterogeneous networks.

Definition 3.1: Attributed Heterogeneous Network

is defined as a graph $G=(\mathcal{V},\mathcal{E},\mathcal{T})$ in which each node $i\in\mathcal{V}$ and each link $e_{ij}\in\mathcal{E}$ are associated with their mapping functions $\phi(i):\mathcal{V}\shortrightarrow\mathcal{T_{V}}$ and $\phi(e_{ij}):\mathcal{E}\shortrightarrow\mathcal{T_{E}}$ . $\mathcal{T_{V}}$ and $\mathcal{T_{E}}$ denote the sets of object and relation types, where $|\mathcal{T_{V}}|+|\mathcal{T_{E}}|>2$ . In the case of attributed heterogeneous network, the node features representation is given by $\Phi(i)=\mathbf{x_{i}}\in\mathbb{R}^{D_{m}}$ , which maps node $i$ of node type $m\in\mathcal{T_{V}}$ to its corresponding feature vector $\mathbf{x_{i}}$ of dimension $D_{m}$ .

We represent the heterogeneous link types as a set of biadjacency matrices $\mathcal{A}=\{\mathbf{A}^{(m,n)}\mid\exists m,n\in\mathcal{T_{V}}\}$ where $|\mathcal{A}|=|\mathcal{T_{E}}|$ . Each meta relation $(m,n)$ specifies a link type between source node type $m$ and target node type $n$ , such that $\mathbf{A}^{(m,n)}=\{a^{(m,n)}_{ij}\mid i\in\mathcal{V}_{m},j\in\mathcal{V}_{n}\}$ . The biadjacency matrix may consist of weighted links, where $a^{(m,n)}_{ij}>0$ if there exists a link, otherwise, $a^{(m,n)}_{ij}=0$ , indicating the absence of evidence for interaction. For a $\mathbf{A}^{(m,n)}$ subnetwork, we define node $i$ ’s neighbors set as $\mathcal{N}^{(m,n)}_{i}=\{j\mid\forall j\in\mathcal{V}_{n}\mathrel{s.t.}a^{(m,n)}_{ij}>0\}$ . Note that $\mathbf{A}^{(m,n)}\in\mathbb{R}^{|\mathcal{V}_{m}|\times|\mathcal{V}_{n}|}$ ’s size is non-quadratic, and thus does not have a diagonal. Furthermore, this definition assumes relations of directed links, but for a relation $\mathbf{A}^{(m,n)}$ with inherently undirected links, we may inject a reverse relation $\mathbf{A}^{(n,m)}=\{a_{ji}|\forall a_{ij}\in\mathbf{A}^{(m,n)}\}$ into the $\mathcal{A}$ set.

LATTE-1: First-order Heterogeneous Network Embedding

In this section, we start by describing the attention-based layers used in the LATTE heterogeneous network embedding architecture. The attention mechanism utilized in our method closely follows GAT (Veličković et al. 2018) but is extended to infer higher-order link proximity scores for nodes and links of heterogeneous types. We also introduce the layer building blocks where each layer has the roles of inferring node embeddings from heterogeneous node content and preserving higher-order link proximities.

The input to our model is the set of heterogeneous adjacency matrices $\mathcal{A}$ and the heterogeneous node features $\mathcal{X}=\{\mathbf{X}_{m}|\exists m\in\mathcal{T_{V}}\}$ , where $\mathbf{X}_{m}=\{\mathbf{x}_{i}\in\mathbb{R}^{D_{m}}|\forall i\in\mathcal{V}_{m}\}$ . At each $t^{\text{th}}$ layer, we define the node embeddings output $\mathbf{h}^{t}\in\mathbb{R}^{|\mathcal{V}|\times F}$ , where $F$ is the embedding dimension, as:

\mathbf{h}^{t}=f_{t}(\mathbf{h}^{t-1},\mathcal{A}^{t})

where $\mathbf{h}_{i}^{0}=\mathbf{x}_{i}$ , and $\mathcal{A}^{t}$ is the heterogeneous link adjacency matrices in the $t^{\text{th}}$ -order. In the next section, we describe the operations involved when $t=1$ .

Heterogeneous First-order Proximities

The first-order proximity refers to direct links between any two nodes in the network among the heterogeneous relations in $\mathcal{A}$ . In order to model the different distribution of links in each relation type $\mathbf{A}^{(m,n)}\in\mathcal{A}$ , we utilize a node-level attentional kernel depending on the type of the relation. Additionally, to sufficiently encode node features into higher-level features, each node type $m$ requires a separate linear transformation applied to every node in $\mathcal{V}_{m}$ . Given any node $i$ of type $m$ and node $j$ of type $n$ , the respective kernel parameter $\mathbf{q}^{1}_{(m,n)}$ is utilized to compute the scoring mechanism:

e^{1}_{ij}={\mathbf{q}^{1}_{(m,n)}}^{\top}[\mathbf{U}^{1}_{m}\mathbf{x}_{i}||\mathbf{V}^{1}_{n}\mathbf{x}_{j}]

(1)

where $\cdot^{\top}$ is the transposition and $||$ is the concatenation operation. We utilize two weight matrices $\mathbf{U}^{1}_{m}\in\mathbb{R}^{F\times D_{m}}$ and $\mathbf{V}^{1}_{n}\in\mathbb{R}^{F\times D_{n}}$ to obtain the ”source” context and the ”target” context, respectively, for a pair of nodes depending on the node types and the direction of the link. Note that the attention-based proximity score $e^{1}_{ij}$ is asymmetric, hence capable of modeling directed relationships where $a_{ij}\neq a_{ji}$ .

Inferring Node-level Attention Coefficients

Next, our goal is to infer the importance of each neighbor node in the neighborhood around node $i$ for a given relation. Similar to GAT, we compute masked attention on existing links, such that $e^{1}_{ij}$ is only computed for first-order neighbor nodes $j\in\mathcal{N}^{(m,n)}_{i}$ . The attention coefficients are computed by softmax normalization of the scores across all $j$ , as:

\alpha^{(m,n)}_{ij}=\frac{exp(\tau^{(m,n)}*e^{1}_{ij})}{\sum_{j^{\prime}\in\mathcal{N}^{(m,n)}_{i}}exp(\tau^{(m,n)}*e^{1}_{ij^{\prime}})}

(2)

where $\tau^{(m,n)}$ is a learnable “temperature“ variable initialized at $1$ that have the role of “sharpening” the attention scores (Chorowski et al. 2015) across the links distribution in the $(m,n)$ relation. It is expected that $\tau^{(m,n)}>1$ when the particular link distribution is dense or noisy, thus, integrating this technique allows the attention mechanism to focus on fewer neighbors. Once obtained, the normalized attention coefficients are used to compute the features distribution of a node’s by a linear combination of its neighbors for each relation.

Inferring Relation Weighing Coefficients

Since a node type $m$ is assumed to be involved in multiple types of relations, we must aggregate the relation-specific representations for each node. Previous works (Wang et al. 2019; Yun et al. 2019) have proposed to measure the importance of each relation type by a set of semantic-level attention coefficients shared by all nodes. Instead, our method chooses to assign the relation attention coefficients differently for each node $i$ , which enables the capacity to capture individual node heterogeneity in the network. First, we denote $\mathcal{A}_{(m\shortrightarrow)}\subset\mathcal{A}$ as the subset of meta relations with source type $m$ . Since the number of relations involved in each node type can be different, each node of type $m$ only needs to soft-select from the subset of relevant relations. We utilize a linear transformation directly on node features to predict a normalized coefficient vector of size $|\mathcal{A}_{(m\shortrightarrow)}|+1$ that soft-selects among the set of associated relations $\mathcal{A}_{(m\shortrightarrow)}$ or itself. This operation is computed by:

\boldsymbol{\beta}^{1,i}=softmax(\mathbf{W}^{1}_{m}\mathbf{x}_{i}+\mathbf{b}^{1}_{m})

(3)

where $\boldsymbol{\beta}^{1,i}\in\mathbb{R}^{|\mathcal{A}_{(m\shortrightarrow)}|+1}$ is parameterized by the weights $\mathbf{W}^{1}_{m}\in\mathbb{R}^{(|\mathcal{A}_{(m\shortrightarrow)}|+1)\times D_{m}}$ and bias $\mathbf{b}^{1}_{m}$ for each node type $m\in\mathcal{T_{V}}$ . Since $\boldsymbol{\beta}^{1,i}$ is softmax normalized, $\mathbf{\beta}^{1,i}_{0}+\sum^{\mathcal{A}_{(m\shortrightarrow)}}_{(m,n)}\mathbf{\beta}^{1,i}_{(m,n)}=1$ , where $\mathbf{\beta}^{1,i}_{0}$ is the coefficient indexed for the “self” choice.

Aggregating First-order Neighborhoods

It is important to not only capture the local neighborhood of a node in a single relation but also aggregate the neighborhoods among multiple relations and integrate the node’s own features representation. First, we gather information obtained from each relation’s local neighborhoods, then combine their relation-specific embeddings. We apply both the node-level and relation-level attention coefficients to a weighted-average aggregation scheme:

\mathbf{h}_{i}^{1}=\sigma\left(\mathbf{\beta}^{1,i}_{0}\mathbf{U}^{1}_{m}\mathbf{x}_{i}+\sum_{(m,n)}^{\mathcal{A}_{(m\shortrightarrow)}}\mathbf{\beta}^{1,i}_{(m,n)}\sum_{j}^{\mathcal{N}^{(m,n)}_{i}}\alpha^{(m,n)}_{ij}\mathbf{V}^{1}_{n}\mathbf{x}_{j}\right)

(4)

where $\sigma$ is a nonlinear function such as ReLU, and $i$ ’s node type is $m$ . The first-order node embedding is computed as an aggregation of linear-transformed immediate neighbor nodes. Next, we show that multiple LATTE layers can be stacked successively in a manner that allows the attention mechanism to capture higher-order relationships.

LATTE-T: Higher-order Heterogeneous Network Embedding

In this section, we describe the layer-stacking operations involved to extract higher-order proximities when $t\geq 2$ . The t ${}^{\text{th}}$ -order proximity applies to indirect $t$ -length metapaths achieved by combining two matching meta relations. For instance, when $t=2$ , we can connect a relation $\mathbf{A}^{(m,n)}\in\mathcal{A}^{t-1}$ with target type $n$ to another relation $\mathbf{A}^{(n,p)}\in\mathcal{A}$ with matching source type $n$ . Then, computing the Adamic-Adar (Adamic and Adar 2003) as:

\begin{split}\mathbf{A}^{(m,n,p)}&=\mathbf{A}^{(m,n)}\mathbf{D}^{-1}\mathbf{A}^{(n,p)}\\ \mathbf{D}_{jj}&=\sum_{i\in\mathcal{V}_{m}}a^{(m,n)}_{ij}+\sum_{k\in\mathcal{V}_{p}}a^{(n,p)}_{jk}\end{split}

(5)

yields $\mathbf{A}^{(m,n,p)}$ as the degree-normalized biadjacency matrix consisting of length-2 metapaths from $\mathcal{V}_{m}$ nodes to $\mathcal{V}_{p}$ nodes. We define the set of meta relations containing all length- $t$ metapaths in the network, as:

\mathcal{A}^{t}=\mathcal{A}^{t-1}\times\mathcal{A}

(6)

where $\times$ behaves as a cartesian product that yields the Adamic-Adar only for matching pairs of relations. A length- $t$ sequence of meta relations with source type $m$ and target type $p$ is denoted as $(m\overset{{}_{t}}{\shortrightarrow}p)$ . This is directly applicable to the classical metapath paradigm (Sun et al. 2011), where all possible $t$ -length metapaths are decomposed in each separate relation in $\mathcal{A}^{t}$ . Note that throughout this paper, the meta relations $(m,n)$ notation is overloaded for brevity. In fact, this architecture can handle multiple meta relation types with the same source type and target type, i.e. $\phi(e_{ij})=\left<\phi(i),\phi(e),\phi(j)\right>$ , without loss of generalization.

Heterogeneous Higher-order Proximities

Learning the higher-order attention structure for $t^{\text{th}}$ -order relations involves the composition between $\mathcal{A}^{t-1}$ and $\mathcal{A}$ meta relation sets. Since the t ${}^{\text{th}}$ -order proximity is a measure between a node’s $(t\text{-}1)^{\text{th}}$ -order context to another node in the network, naturally, we must take into consideration of $\mathbf{h}^{t-1}$ as the prior-order context embeddings. Similar to the first-order attention score, $e^{t}_{ik}$ is the $t^{\text{th}}$ -order attention score between node $i\in\mathcal{V}_{m}$ and node $k\in\mathcal{V}_{p}$ , defined as:

e^{t}_{ik}={\mathbf{q}^{t}_{(m\overset{{}_{t}}{\shortrightarrow}p)}}^{\top}[\mathbf{U}^{t}_{m}\mathbf{h}^{t-1}_{i}||\mathbf{V}^{t}_{p}\mathbf{x}_{k}]

(7)

The $t^{\text{th}}$ -order attention scoring mechanism is parameterized by $\mathbf{U}^{t}_{m}\in\mathbb{R}^{F\times F}$ and $\mathbf{V}^{t}_{p}\in\mathbb{R}^{F\times D_{p}}$ for all node types, as well as $\mathbf{q}^{t}_{(m\overset{{}_{t}}{\shortrightarrow}p)}\in\mathbb{R}^{{}^{2F}}$ for each relation type in $\mathcal{A}^{t}$ . Then, in the same manner as in Eq. (3), the attention coefficients for each $t^{\text{th}}$ -order neighbors in the relation is the softmax normalized $e^{t}_{ik}$ along with the temperature $\tau^{(m\overset{{}_{t}}{\shortrightarrow}p)}$ :

\alpha^{(m\overset{{}_{t}}{\shortrightarrow}p)}_{ik}=\frac{exp(\tau^{(m\overset{{}_{t}}{\shortrightarrow}p)}*e^{t}_{ik})}{\sum_{k^{\prime}\in\mathcal{N}^{(m\overset{{}_{t}}{\shortrightarrow}p)}_{i}}exp(\tau^{(m\overset{{}_{t}}{\shortrightarrow}p)}*e^{t}_{ik^{\prime}})}

Obtaining the relation-weighing coefficients in the $t^{\text{th}}$ -order also involves the prior-order context embedding for each node. For a node $i$ of type $m$ , we apply the relation weighing mechanism using its prior-order embedding $\mathbf{h}_{i}^{t-1}$ with:

\boldsymbol{\beta}^{t,i}=softmax(\mathbf{W}^{t}_{m}\mathbf{h}^{t-1}_{i}+\mathbf{b}^{t}_{m})

(8)

where $\boldsymbol{\beta}^{t,i}\in\mathbb{R}^{|\mathcal{A}^{t}_{(m\shortrightarrow)}|+1}$ is parameterized by weights $\mathbf{W}^{t}_{m}\in\mathbb{R}^{1+|\mathcal{A}^{t}_{(m\shortrightarrow)}|\times F}$ . By far, LATTE can automatically identify important meta relations of any arbitrary $t$ -length by learning an adaptive relation weighing mechanism.

Aggregating Layer-wise Embeddings

While the first-order embedding represents the local neighborhood among the multiple relations, its $t^{\text{th}}$ -order embedding expands the receptive field’s vicinity by traversing higher-order meta paths. The $t^{\text{th}}$ -order embedding of node $i$ is expressed as:

Dataset	Relations (A-B)	# nodes (A)	# nodes (B)	# links	# features	Training	Testing
DBLP	Paper-Author (PA)	14328	4057	19645	334	20%	70%
	Paper-Conference (PC)	14328	20	14328
	Paper-Term (PT)	14328	4057	88420
ACM	Paper-Author (PA)	2464	5835	9744	1830	20%	70%
ACM	Paper-Subject (PS)	3025	56	3025	1830	20%	70%
IMDB	Movie-Actor (MA)	4780	5841	9744	1232	10%	80%
IMDB	Movie-Director (MD)	4780	2269	3025	1232	10%	80%

Table 1: Statistics for the heterogeneous network datasets.

\mathbf{h}_{i}^{t}=\sigma\left(\mathbf{\beta}^{t,i}_{0}\mathbf{U}^{t}_{m}\mathbf{h}^{t-1}_{i}+\sum_{(m\overset{{}_{t}}{\shortrightarrow}p)}^{\mathcal{A}^{t}_{(m\shortrightarrow)}}\beta_{(m\overset{{}_{t}}{\shortrightarrow}p)}^{t,i}\sum_{k}^{\mathcal{N}^{(m\overset{{}_{t}}{\shortrightarrow}p)}_{i}}\alpha^{(m\overset{{}_{t}}{\shortrightarrow}p)}_{ik}\mathbf{V}^{t}_{p}\mathbf{x}_{k}\right)

(9)

With this framework, the receptive field of $t^{\text{th}}$ -order relations is contained within each $t^{\text{th}}$ -order context embedding. Furthermore, as $\boldsymbol{\beta}^{t,i}$ encapsulates each relation in $\mathcal{A}^{t}$ separately, it is possible to identify the specific relation types that are involved the composite representation.

Given the layer-wise representations $\mathbf{h}^{1}_{i},...,\mathbf{h}^{T}_{i}$ of node $i$ , we obtain the final embedding output by concatenating all the $t$ -order context embeddings, as:

\mathbf{h}_{i}=\bigparallel_{t=1}^{T}f_{t}(\mathbf{h}^{t-1},\mathcal{A}^{t})=\bigparallel_{t=1}^{T}\mathbf{h}^{t}_{i}

(10)

where $\mathbf{h}_{i}\in\mathbb{R}^{TF},\forall i\in\mathcal{V}$ with $T*F$ as the unified embedding dimension size for all node types.

Preserving Proximities with Attention Scores

We repurpose the computed attention scores to estimate the heterogeneous pairwise proximities in the network explicitly. Incorporating this objective not only enables our model for unsupervised learning but also allows the node-level attention mechanism to reinforce highly connected node pairs by taking advantage of weighted links. To preserve pairwise t^th-order proximities for all links in each $(m\overset{{}_{t}}{\shortrightarrow}p)$ relation, we apply the Noise Contrastive Estimation with negative sampling (Mikolov et al. 2013) objective as:

\begin{split}L_{t}(\mathbf{A}^{(m\overset{{}_{t}}{\shortrightarrow}p)})=&-\frac{1}{|\mathbf{A}^{(m\overset{{}_{t}}{\shortrightarrow}p)}|}\sum_{a_{ik}}^{\mathbf{A}^{(m\overset{{}_{t}}{\shortrightarrow}p)}}a_{ik}\log(\rho(e^{t}_{ik}))\\ &-\frac{1}{K}\sum_{k}^{K}E_{a_{uv}\sim P(\mathbf{A}^{(m\overset{{}_{t}}{\shortrightarrow}p)})}[\log\rho(-e^{t}_{uv})]\\ \end{split}

(11)

where $\rho$ denotes the sigmoid function applied to the attention score to infer a probability value. The first term models the observed links, the second term models the negative links drawn from the noise distribution in $(m\overset{{}_{t}}{\shortrightarrow}p)$ , and $K$ is the number of sampled negative links. Typically, $K$ is chosen to be between 2 to 5 times the number of positive links.

Model Optimization

To learn from both the heterogeneous network’s attributes and topology, we optimize the proximity-preserving objectives and the downstream objective of the embedding outputs with the standard back-propagation algorithm. For semi-supervised node classification, a multi-layer perceptron $g(\mathbf{h}_{i})=\widetilde{\mathbf{y}}_{i}\in[0,1]^{G}$ follows the LATTE layers in order to predicts $G$ labels given the node embedding. The cross-entropy minimization objectives are defined as:

L(\mathcal{X},\mathcal{A})=-\sum_{i\in\mathcal{V}_{Y}}\mathbf{y}_{i}log(g(\mathbf{h}_{i}))+\sum_{t=1}^{T}\sum^{\mathcal{A}^{t}}_{\mathbf{A}^{(m\overset{{}_{t}}{\shortrightarrow}n)}}L_{t}(\mathbf{A}^{(m\overset{{}_{t}}{\shortrightarrow}n)})

(12)

where $\mathcal{V}_{Y}$ is the set of nodes that have labels, and $\mathbf{y}_{i}$ is the true label. The first term aims to encode the node embedding representations with attention mechanisms, while the second term reinforces the attention scores by iterating through weighted positive and sampled negative links.

Our model allows for computing embeddings for a subnetwork each iteration; thus, it does not require computations involving the global network structure of all nodes at once. This approach not only enables mini-batch training on large networks that do not fit on memory but also makes our technique fitted for inductive learning. To perform online training at each iteration, an input batch is generated by recursively sampling a fixed number of neighbor nodes (Hamilton, Ying, and Leskovec 2017). Then, LATTE can yield embedding outputs for a sampled subnetwork given the local links and node attributes.

Experiments

An effective network representation learning method can generalize to an unseen node by accurately encoding its links and attributes and then “aligning” them to the embedding space learned from seen (trained) nodes. In this section, we evaluate our method’s effectiveness on several node classification experiments, where the task is to predict node labels for a portion of the network hidden during training.

Dataset	Metric	metapath2vec	HIN2Vec	HAN	GTN	LATTE-1	LATTE-2	LATTE-2_prox
DBLP	F1_trans	0.7518	0.7431	0.9121	0.9203	0.8911 $\pm$ 0.003	0.9240 $\pm$ 0.003	0.9156 $\pm$ 0.003
	F1_induc	–	–	0.8666	0.8721	0.8620 $\pm$ 0.004	0.8631 $\pm$ 0.003	0.8822 $\pm$ 0.032
	# params	2.3M	2.3M	240K	125K	78K	111K	111K
ACM	F1_trans	0.8879	0.8466	0.8725	0.9085	0.9118 $\pm$ 0.005	0.9134 $\pm$ 0.005	0.9153 $\pm$ 0.003
	F1_induc	–	–	0.7909	0.8860	0.8988 $\pm$ 0.003	0.9007 $\pm$ 0.003	0.9156 $\pm$ 0.003
	# Params	387K	1.1M	1.5M	326K	250K	273K	273K
IMDB	F1_trans	0.4310	0.4404	0.5394	0.5924	0.6066 $\pm$ 0.018	0.6135 $\pm$ 0.014	0.6363 $\pm$ 0.007
	F1_induc	–	–	0.3877	0.5810	0.6036 $\pm$ 0.009	0.6117 $\pm$ 0.038	0.6355 $\pm$ 0.004
	# Params	611K	1.6M	1.4M	243K	170K	196K	196K
^± denotes the mean and standard deviation over 10 trials.

Table 2: Performance comparison of Macro F1 for various methods over trans-ductive and induc-tive node classifications.

Datasets

We conduct performance comparison experiments over several benchmark heterogeneous network datasets. In Table 1, a summary of the network statistics is provided for each of the following datasets:

1.

DBLP¹¹1https://dblp.uni-trier.de: a heterogenous network extracted from a bibliography dataset on major computer science journals and proceedings. The dataset have been preprocessed to contain 14328 papers, 4057 authors, 20 conferences, and 8789 terms. There are 3 relations types paper-author, paper-conference and paper-term considered. The author’s attributes are a bag-of-word representation of publication keywords. The classification task is to predict the label for each author among four domain areas: database, data mining, machine learning, and information retrieval.
2.

ACM²²2https://dl.acm.org: A small citation network dataset containing paper-author and paper-subject relation types among 3025 papers, 5835 authors, and 56 subjects node types. Paper nodes are associated with a bag-of-words presentation of keywords as features. The task is to label the conference each paper is published in, among the KDD, SIGMOD, SIGCOM, MobiCOMM, and VLDB venues.
3.

IMDB (Cantador, Brusilovsky, and Kuflik 2011): A movie database network containing movie-actor and movie-director relations among 4780 movies, 5841 actors, and 2269 directors. Each movie contain bag-of-words features of the plot, and the prediction task is to label the movie’s genre among Action, Comedy, and Drama.

In each of the datasets, all directed relation have a reverse relation included. All self-loop links have been removed, unless if required for a certain algorithm.

Experimental Setup

To provide a consistent and reproducible experimental setup, the preprocessed networks were obtained from the CogDL Toolkit (Cen et al. 2020) benchmark datasets. Each of the datasets has been provided with a standard separation of train, validation, and test sets, as well as the full input features and labels set. Since our model evaluates these datasets based on their standard environment, the result from different experiments can be directly compared.

Baselines

We verify the effectiveness of our framework by testing multiple variants of LATTE along with other existing approaches. For comparison with some of the state-of-the-art baselines, we consider various heterogeneous network embedding and GNN methods, including:

•

Metapath2Vec (Dong, Chawla, and Swami 2017): An unsupervised random walk method that utilizes the skip-gram along with negative sampling on meta paths to embed heterogeneous nodes. It has been shown to achieve prominent performance among random walk based approaches.
•

HIN2Vec (Fu, Lee, and Lei 2017): a state-of-the-art deep neural network that learns embedding by considering the meta paths in an attributed heterogeneous network. It utilizes a random walk preprocessing, and it does not consider weighing of different meta paths.
•

HAN (Wang et al. 2019): A GNN that employs a GAT-based node-level attention mechanism for heterogeneous networks. It proposes a hierarchical attention procedure that weighs the importance for each meta path, however only among pre-defined hand-crafted meta paths.
•

GTN (Yun et al. 2019): A GNN with an attention mechanism that weighs and combines heterogeneous meta paths successively into higher-order structures, then performs graph convolution on the resulting adjacency matrix.
•

LATTE-1: A variant of the proposed LATTE model with one layer that only considers first-order meta relations. The pairwise proximity preserving objectives is excluded.
•

LATTE-2: A variant of LATTE with two layers that considers both first-order and second-order meta relations. The pairwise proximity preserving objectives is excluded.
•

LATTE-2_prox: Same as LATTE-2 but additionally optimizes the higher-order proximity preserving objectives.

Every method was evaluated on the identical split of training, validation, and testing sets for fairness and reproducibility. The final model is trained on the training set until the early stopping criteria on the validation set is met, then evaluated on the test set. Additionally, each method must exploit all relations and the available node attributes in the dataset, except for metapath2vec due to its limitation. If a particular node type in the heterogeneous network is not attributed, we instantiate a set of learnable embeddings to replace $\mathcal{X}$ as node features.

Implementation Details

We set the following hyper-parameters identically for all methods: embedding dimension size at 128, learning rate at 0.001, mini-batch size at 2048, and early stopping if the validation loss doesn’t decrease after ten epochs. For HAN and GTN, the number of GNN hidden layers is 2, preceding an MLP that predicts node labels given the embedding outputs in an end-to-end manner. For random walk based methods, a logistic classifier is employed to perform node classification given the learned node embeddings. The hyper-parameters for metapath2vec and HIN2Vec are walk length at 100, window size at 5, walks per node at 40, and the number of negative samples at 5. Among GNN-based methods, the batch sampling procedure that recursively samples a fixed number of neighbor nodes (Hamilton, Ying, and Leskovec 2017) is utilized, with neighborhood sample sizes 25 and 20. Where possible, the standard implementation of baseline methods has been provided by the CogDL Toolkit.

For all LATTE variants, the best performing hyper-parameters selected ReLU as the embedding activation function, drop-out at 30% on the embedding outputs, and weight decay regularization (excluding biases) at 0.01. In LATTE-2_prox, the negative sampling ratio is set to $5.0$ . The models have been implemented with Pytorch Geometric (PyG), and the experiments have been conducted on a GeForce RTX 2080 Ti with 11 GB of GPU memory. The hyper-parameter tuning were conducted by Weight and Biases (Biewald 2020), and the parameter ranges tested were reported in the technical appendix.

Node Classification Experiments and Results

We consider the semi-supervised classification tasks in both inductive and transductive settings to perform thorough evaluations of representation learning in heterogeneous networks. In the transductive setting, models can traverse on the subgraph containing nodes in the test set during training. In contrast, the inductive setting requires the models never to encounter the test subgraph during the training phase and must predict testing nodes’ labels on the novel subgraph at the testing phase. We train and evaluate all baseline methods to predict test nodes for each transductive and inductive setting over ten trials.

To measure the classification performance of the prediction outputs, we record the precision and recall for each class label to compute the F1 score. Due to the apparent class imbalance in the three datasets, we report only the averaged Macro-F1 score, which was the more challenging metric in similar experiments (Wang et al. 2019). The performance comparisons are reported in Table 2. For metapath2vec, HIN2Vec, HAN, and GTN, the benchmark Macro F1 scores in the transductive setting has been provided by the CogDL Toolkit, while the Macro F1 in the inductive setting are averaged scores over 10 experiment runs.

The top performance by LATTE-2_prox indicates its effectiveness at learning node representations on the high-order meta relation structures, especially with 80-90% of the network set aside for testing. Compared to HAN, which does not consider higher-order relations, GTN and LATTE-2 have a significant edge in inductive prediction because both can capture global properties. Compared to GTN, which does not maintain the semantic space of individual meta path, LATTE-2_prox outperforms with explicit proximity-preserving objectives for each of the decomposed higher-order meta relations.

Interpretation of the Attention Mechanism

LATTE’s fundamental properties are the construction of higher-order meta relations and the attention mechanism that weighs the importance of those relations. To demonstrate these features’ benefits, we interpret the importance levels chosen for each meta relations and verify whether they reflect the structural topology in the heterogeneous network. Given the learned weights $\boldsymbol{\beta}^{t,i}$ for each node $i$ at a layer $t$ , we can assess not only the averaged meta relation weights for a node type, but also the individual meta relation weights for each node. In Fig. 1, we report the average and standard deviation of the meta relation attention weights for IMDB, as well as the correlation between those weights and the node degrees for each relation. The meta relation weights for DLBP and ACM are reported as supplementary material.

For IMDB movies, it can be observed that on average, the MA, MD, MDM, and MAM meta relations have the highest attention weights. This indicates that information from the movie-actor neighborhoods, movie-director neighborhoods, and node’s features are relatively more represented in each movie’s first-order embedding. This selection also persists in the second-order embeddings, where MDM and MAM have higher weights. Additionally, when looking at the correlation between MA’s weights and the degree of MA links over all nodes, there is a $0.73$ correlation, which indicates the attention mechanism can adaptively weigh the relation based on the number connections present in the node. Interestingly, there is a substantial negative correlation of $-0.88$ between the M “self” relation weights and the node degree. This fact indicates that nodes with fewer or no links will choose a higher weight for its own features, since little information can be gained from other modalities. As individual nodes may have varying levels of participation among the various relations, this result demonstrates that LATTE can select the most effective meta relation for individual nodes depending on its local and global properties in the heterogeneous topology.

Refer to caption — Figure 1: (a) Average and standard deviation of the 1st and 2nd-order meta relation attention weights, where relations starting with M are aggregated to embed IMDB movie nodes. (b) Correlation between nodes degrees and relation weights for each meta relation in IMDB. A single-letter relation (e.g. M, M1) denotes the “self” choice.

Discussion and Conclusion

The task of aggregating heterogeneous relations remains a fundamental challenge in designing a representation learning method for heterogeneous networks. Multiple relations can represent different semantics, and their link distributions can be overlapping, interconnected, or non-complementary. Therefore, it is an appropriate first step to consider them as separate components of the network to unravel their structural dependencies. One of the key differences between existing GNN methods and the proposed LATTE is that the latter exploits the semantic information in each meta relation. Instead of conflating heterogeneous relations for all node types as in HAN and GTN, LATTE aggregates only the relevant relations for each node type. Furthermore, by considering the source type and target type of each meta relation, only relevant pairs of relations can be joined during generating higher-order meta paths. A significant benefit to this approach is that it relieves the computational burden of multiplying adjacency matrices for all nodes while allowing distinct representation for the different node types.

This work has proposed an architecture for heterogeneous network embedding, which can generate higher-order meta relations. The benefits of the mechanism proposed are not only to improve inductive node classification performance but also to improve interpretation of deep GNN models. In the future, we will explore whether to incorporate a self-attention mechanism to learn the structural dependencies between relations by propagating information between the different relation-specific embeddings. Other interesting future developments are to enable LATTE to pre-train without supervision and to extend LATTE to link prediction tasks.

References

Adamic and Adar (2003) Adamic, L. A.; and Adar, E. 2003. Friends and neighbors on the web. Social networks 25(3): 211–230.
Battiston, Nicosia, and Latora (2014) Battiston, F.; Nicosia, V.; and Latora, V. 2014. Structural measures for multiplex networks. Physical Review E 89(3): 032804.
Biewald (2020) Biewald, L. 2020. Experiment Tracking with Weights and Biases. URL https://www.wandb.com/. Software available from wandb.com.
Bordes et al. (2013) Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and Yakhnenko, O. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, 2787–2795.
Cantador, Brusilovsky, and Kuflik (2011) Cantador, I.; Brusilovsky, P.; and Kuflik, T. 2011. Second workshop on information heterogeneity and fusion in recommender systems (HetRec2011). In Proceedings of the fifth ACM conference on Recommender systems, 387–388.
Cen et al. (2020) Cen, Y.; Wang, Y.; Hou, Z.; Chen, Q.; and Tang, J. 2020. CogDL: An Extensive Research Toolkit for Deep Learning on Graphs. URL https://github.com/thudm/cogdl.
Chorowski et al. (2015) Chorowski, J. K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; and Bengio, Y. 2015. Attention-based models for speech recognition. In Advances in neural information processing systems, 577–585.
Dong, Chawla, and Swami (2017) Dong, Y.; Chawla, N. V.; and Swami, A. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 135–144.
Dong et al. (2020) Dong, Y.; Hu, Z.; Wang, K.; Sun, Y.; and Tang, J. 2020. Heterogeneous Network Representation Learning. IJCAI.
Fu, Lee, and Lei (2017) Fu, T.-y.; Lee, W.-C.; and Lei, Z. 2017. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1797–1806.
Grover and Leskovec (2016) Grover, A.; and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 855–864.
Hamilton, Ying, and Leskovec (2017) Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems, 1024–1034.
Hu et al. (2020) Hu, Z.; Dong, Y.; Wang, K.; and Sun, Y. 2020. Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, 2704–2710.
Huang et al. (2020) Huang, K.; Xiao, C.; Glass, L.; Zitnik, M.; and Sun, J. 2020. SkipGNN: Predicting Molecular Interactions with Skip-Graph Networks. arXiv preprint arXiv:2004.14949 .
Kipf and Welling (2016) Kipf, T. N.; and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 .
Matsuno and Murata (2018) Matsuno, R.; and Murata, T. 2018. MELL: effective embedding method for multiplex networks. In Companion Proceedings of the The Web Conference 2018, 1261–1268.
Mikolov et al. (2013) Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
Perozzi, Al-Rfou, and Skiena (2014) Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701–710.
Qu et al. (2017) Qu, M.; Tang, J.; Shang, J.; Ren, X.; Zhang, M.; and Han, J. 2017. An attention-based collaboration framework for multi-view network representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1767–1776.
Schlichtkrull et al. (2018) Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Van Den Berg, R.; Titov, I.; and Welling, M. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, 593–607. Springer.
Shi et al. (2018) Shi, Y.; Han, F.; He, X.; He, X.; Yang, C.; Luo, J.; and Han, J. 2018. mvn2vec: Preservation and collaboration in multi-view network embedding. arXiv preprint arXiv:1801.06597 .
Sun et al. (2011) Sun, Y.; Han, J.; Yan, X.; Yu, P. S.; and Wu, T. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment 4(11): 992–1003.
Tang et al. (2015) Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, 1067–1077.
Veličković et al. (2018) Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2018. Graph Attention Networks. International Conference on Learning Representations .
Wang et al. (2019) Wang, X.; Ji, H.; Shi, C.; Wang, B.; Ye, Y.; Cui, P.; and Yu, P. S. 2019. Heterogeneous graph attention network. In The World Wide Web Conference, 2022–2032.
Xu et al. (2018) Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.; and Jegelka, S. 2018. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536 .
Yun et al. (2019) Yun, S.; Jeong, M.; Kim, R.; Kang, J.; and Kim, H. J. 2019. Graph transformer networks. In Advances in Neural Information Processing Systems, 11983–11993.
Zhang et al. (2019) Zhang, C.; Song, D.; Huang, C.; Swami, A.; and Chawla, N. V. 2019. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 793–803.
Zhou et al. (2019) Zhou, S.; Bu, J.; Wang, X.; Chen, J.; and Wang, C. 2019. HAHE: Hierarchical attentive heterogeneous information network embedding. arXiv preprint arXiv:1902.01475 .

Technical Appendix

Hyper-parameter Tuning

The hyper-parameter tuning were conducted by the Weight and Biases³³3Biewald, L. 2020. Experiment Tracking with Weights and Biases. URL https://www.wandb.com/. Software available from wandb.com. platform, where we utilize a random search approach that chooses random sets of parameter values. The parameters tested for are the embedding dimension, ${t}$ -order, attention scores activation function, number of neighbors sampled, negative sampling ratio, embedding output activation function, and dropout probability.

Interpretation of the Attention Mechanism for DBLP and ACM

Following the demonstration to interpret the learned attention weights in the IMDB dataset, we report the same attention weights and the weights-degree correlation results for the DBLP and ACM datasets. In Fig. 4 and 5, it can be observed that correlation between the meta relation weights and the node degree exhibits the same phenomenon described for IMDB.


(a)		(b)


(a)		(b)