This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tree Structure-Aware Graph Representation Learning via Integrated Hierarchical Aggregation and Relational Metric Learning

Ziyue Qiao1,2, Pengyang Wang3, Yanjie Fu3, Yi Du1,∗, Pengfei Wang4, Yuanchun Zhou1 Corresponding author.The research is supported by the Natural Science Foundation of China under Grant No. 61836013, the National Key Research and Development Plan of China (No. 2016YFB0501901), Ministry of Science and Technology Innovation Methods Special work Project under grant 2019IM020100, Beijing Nova Program of Science and Technology under Grant No. Z191100001119090. 1Computer Network Information Center, Chinese Academy of Sciences, Beijing
2University of Chinese Academy of Sciences, Beijing
3Department of Computer Science, University of Central Florida, Orlando
4Alibaba DAMO Academy, Alibaba Group, China
[email protected], {pengyang.wang@knights., yanjie.fu@}ucf.edu, {duyi, zyc}@cnic.cn
Abstract

While Graph Neural Network (GNN) has shown superiority in learning node representations of homogeneous graphs, leveraging GNN on heterogeneous graphs remains a challenging problem. The dominating reason is that GNN learns node representations by aggregating neighbors’ information regardless of node types. Some work is proposed to alleviate such issue by exploiting relations or meta-path to sample neighbors with distinct categories, then use attention mechanism to learn different importance for different categories. However, one limitation is that the learned representations for different types of nodes should own different feature spaces, while all the above work still project node representations into one feature space. Moreover, after exploring massive heterogeneous graphs, we identify a fact that multiple nodes with the same type always connect to a node with another type, which reveals the many-to-one schema, a.k.a. the hierarchical tree structure. But all the above work cannot preserve such tree structure, since the exact multi-hop path correlation from neighbors to the target node would be erased through aggregation. Therefore, to overcome the limitations of the literature, we propose T-GNN, a tree structure-aware graph neural network model for graph representation learning. Specifically, the proposed T-GNN consists of two modules: (1) the integrated hierarchical aggregation module and (2) the relational metric learning module. The integrated hierarchical aggregation module aims to preserve the tree structure by combining GNN with gated recurrent unit to integrate the hierarchical and sequential neighborhood information on the tree structure to node representations. The relational metric learning module aims to preserve the heterogeneity by embedding each type of nodes into a type-specific space with distinct distribution based on similarity metrics. In this way, our proposed T-GNN is capable of simultaneously preserving the heterogeneity and the tree structure inherent in heterogeneous graphs. Finally, we conduct extensive experiments to show the outstanding performance of T-GNN in tasks of node clustering and classification, inductive node clustering and classification, and link prediction.

Index Terms:
Graph Neural Network; Graph Representation learning; Metric Learning; Heterogeneous Graph.

I INTRODUCTION

Graph Neural Network (GNN) is a family of graph representation learning approaches that encode node features into low-dimensional representation vectors by aggregating neighbors’ information [1]. GNN has drawn increasing attention in recent decades, due to the superior performance in tremendous real-world applications, such as recommender systems [2, 3], urban computing [4, 5], chemistry [6], etc.

While GNN models conform well with the learning tasks on homogeneous graphs, leveraging GNN on learning node representations for heterogeneous graphs remains a challenging problem. First of all, heterogeneous graphs consist of nodes with different types, but vanilla GNN indiscriminately aggregates neighbors’ information regardless of node types. Second, the relations between different types of nodes usually reveal the structure of many-to-one, a.k.a. the hierarchical tree structure, which indicates the schema that multiple nodes with the same type connect to a node with another type. For example, Figure 1 shows an instance in the academia graph, papers p1p_{1}, p2p_{2} and p3p_{3} are linked to author a1a_{1} and a2a_{2} by the relation Paperauthored\xrightarrow{authored} Author (abbreviated as PA\overrightarrow{PA}), also authors a1a_{1} and a2a_{2} are linked to organization o1o_{1} by the relation Authoremployed\xrightarrow{employed} Organization (abbreviated as AO\overrightarrow{AO}). By composing PA\overrightarrow{PA} and AO\overrightarrow{AO}, the nodes of type P, A and O form a three-level hierarchical tree with structure PAO\overrightarrow{PAO} and o1o_{1} as the root node, this tree implies a two-hop structured neighborhood for o1o_{1}, where the papers are in the ground level and the authors are in the second level.

Refer to caption
Figure 1: An illustration of hierarchical tree structures in academic heterogeneous graph.

In literature, some work are proposed to alleviate such issue. The main idea is sampling neighbors with distinct categories via different types of relations or meta-path (i,e,. composed relations), and use attention mechanism or coefficients to assign different importance to different categories of nodes as aggregating neighbors [7, 8, 9, 10, 11]. However, there still exists limitations. First, some methods[12, 13, 11, 14] only differentiate neighbor information by atomic relations, i.e., the one-hop links, and propagate information of multi-hop neighborhoods by stacking multiple layers. However, the exact multi-hop path correlation from neighbors to the target node would be erased through aggregation. Second, for some methods[9, 2] using meta-path to sample neighbors in multi-hop, given the target node, meta-paths directly transfer the information between the two end nodes of the meta paths, while ignore the relay nodes in the middle. On the other hand, meta-path based neighborhoods preserve less structural information relatively than the tree-based neighborhoods. Third, the learned representations for different types of nodes should own different feature spaces, while all the above work still projects node representations into one feature space.

Therefore, to overcome the limitations of the literature, we propose a Tree structure-aware Graph Neural Network(T-GNN) for heterogeneous graph representation learning. Next, we outline the key steps and insights of our proposed T-GNN.

Refer to caption
Figure 2: An illustration of respectively embedding three venues and their corresponding papers into same spaces and type-specific spaces. (1)(2) shows if different types of nodes are embeded into one space. (3) shows by embedding nodes into type-specific spaces.

Preserving Tree Structure via Integrated Hierarchical Aggregation. As mentioned above, multi-hop neighborhoods in heterogeneous graphs can be decomposed to different trees, which can denote different organizational forms of neighbors and relations. Also on these trees, the neighbor information with different distance can be further formulated as a sequence with variable length from the bottom level to the top level. This indicates that we can sequentially aggregate the information of lower level nodes and propagate it to the higher level nodes to finally contribute to the root nodes’ representations. Therefore, we propose an integrated hierarchical aggregation module to preserve tree structures of heterogeneous graphs. Specifically, we differentiate tree structured neighborhoods by introducing hierarchical tree schema, and partition neighborhoods into organized trees with different categories. Then we combine GNN models with Gated Recurrent Unit (GRU) modules to aggregate hierarchical and sequential neighborhood information on these trees to node representations.

Preserving Heterogeneity via Relational Metric Learning. Most heterogeneous graph representation learning methods aim to embed different types of nodes into the same feature space. However, in one space, different types of nodes share the same polynomial distribution but the similarity between nodes of different types could be incompatible and dampen with each other.

For example in Figure 2(a), the closeness of two similar venues would results in the papers published by these two venues tend to be inseparable, while in Figure 2(b), well separated papers makes their two venues distant and the proximity of them would not be well preserved. Such incompatibility may also exists between other types of nodes in one embedding space. Model optimization usually need to learn specific distributions of nodes for certain tasks. Therefore, we propose a relational metric learning module to embed each type of nodes into a type-specific space with independent distribution. Specifically, we introduce metrics to measure the similarities between nodes with different types, as shown in Figure 2(c), the similarity between nodes with same type is calculated in their type space, and that between nodes with different types is calculated in a relation type-specific metric space. Finally, we leverage a relation-aware graph context loss based on random walk and negative sampling to train the whole model.

In summary, we propose a T-GNN model to improve the capability of GNN on learning node representations of heterogeneous graphs. The proposed T-GNN includes two key modules, (1) hierarchical aggregation module, which aims to preserve the tree structure by integrating GNN with GRU; (2) relational metric learning module, which aims to preserve the heterogeneity by mapping different type of nodes into different embedding space with incorporating similarity measurement. The main contributions of our work are summarized as follow.

  1. 1.

    To our best knowledge, we are the first to introduce the tree structures inherent in heterongenous graphs into the architecture of graph neural network model to learn node representations.

  2. 2.

    We propose a tree structure-aware graph neural network model named T-GNN, which contains aggregation modules and sequence propagation modules to capture both the node attributes and the information in multi-hop neighborhoods with different hierarchical organizations into node representations.

  3. 3.

    We propose a relational metric learning module for unsupervised graph representation learning that embed nodes into type-specific spaces with independent distributions.

  4. 4.

    We conduct extensive experiments to evaluate the performance of our proposed model on three datasets. The results demonstrate the superiority of our proposed method over several state-of-the-art methods.

II PRELIMINARIES

In this section, we give formal definitions of some related concepts and notations.

Definition 1 (Heterogeneous Graph).

A heterogeneous graph is denoted as G=(N,E,𝒯,)G=(N,E,\mathcal{T},\mathcal{R}), in which each node nNn\in{N} is associated with a node type mapping functions ϕ(n):N𝒯\phi(n):N\to\mathcal{T} and each edge eEe\in{E} is associated with a relation type mapping function ϕ(e):E\phi(e):E\to{\mathcal{R}}. This suggests that G has multiple node types and relation types, and |𝒯|+||>2|\mathcal{T}|+|\mathcal{R}|>2.

Figure 3 shows the schemas of three heterogeneous graphs. We resolve the undirected links between nodes into fine-grained directed relations to present the hierarchical structures. The directions of relations are consistent with the structures of many-to-one between different types of nodes.

Definition 2 (Meta-path).

In a heterogeneous graph G=(N,E,𝒯,)G=(N,E,\mathcal{T},\mathcal{R}), a meta-path is defined as a path in the form of t0r1t1r2rltlt_{0}\xrightarrow{r_{1}}t_{1}\xrightarrow{r_{2}}...\xrightarrow{r_{l}}t_{l} (abbreviated as t0t1tlt_{0}t_{1}...t_{l}), where rir_{i}\in\mathcal{R} and ti𝒯t_{i}\in\mathcal{T}, the sequence of relations represents the composite relation r=r1r2rlr=r_{1}\circ r_{2}\circ...\circ r_{l} between the node type t0t_{0} and tlt_{l}.

For example, we can design PAPPAP, APVAPV as meta-paths, PAPPAP means two papers are coauthored by a same person, APVAPV means an author publish a paper on a venue. Meta-paths are widely utilized to extract paths in previous works. Similarly, in our paper, we define hierarchical tree schemas to extract hierarchical trees.

Refer to caption
Figure 3: Schemas of three heterogeneous graphs. (a) DBLP Academic graph. (b) IMDB Movie graph. (c) YELP review graph. Each solid line arrows represent a relation type, each circles represent one node type and the bold letters represent their abbreviations.
Definition 3 (Hierarchical Tree Schema).

A Hierarchical Tree Schema ss is a directed meta-path in the form of s={t0r1t1r2rmtm}s=\{t_{0}\xrightarrow{r_{1}}t_{1}\xrightarrow{r_{2}}...\xrightarrow{r_{m}}t_{m}\} (abbreviated as t0t1tm\overrightarrow{t_{0}t_{1}...t_{m}}), where the direction of each rir_{i} is from ti1t_{i-1} to tit_{i}. These schemas are used to extract trees where tmt_{m} is the type of root node and each node in the iith level has the type tit_{i}, representing the (mi)(m-i)-hop neighbors of the root node.

Definition 4 (Hierarchical Tree Structured Neighborhood).

Given a hierarchical tree schema s={t0r1t1r2rmtm}s=\{t_{0}\xrightarrow{r_{1}}t_{1}\xrightarrow{r_{2}}...\xrightarrow{r_{m}}t_{m}\} and a node non_{o} with type tmt_{m}, we first sample the neighbors of non_{o} connecting by relation rmr_{m} to the (m1)(m-1)th level, then we sample neighbors of nodes in the (m1)(m-1)th level connecting by relation rm1r_{m-1} to the (m2)(m-2)th level, an so on, sample nodes until to the 0th level. Finally, we can obtain a hierarchical tree with non_{o} in the level mm as the root node, and the nodes in level mkm-k with type tmkt_{m-k} is non_{o}’s kk-hop neighbors.

For example, in academic dataset, the hierarchical tree schema TPV\overrightarrow{TPV} represent a kind of structured neighborhood for venue nodes, where venue is the root nodes on the 2-th level, its papers are on the 1-th level, and the terms of these papers are on the 0-th level. AMD\overrightarrow{AMD} in movie dataset represents a structured neighborhood for each director, containing its movies on the 1-th level and the actors of these movies on the 0-th level.

Given a node non_{o} with node type tm=ϕ(no)t_{m}=\phi(n_{o}), we can extract kk kinds of hierarchical tree schemas ended with non_{o}’s type: 𝒮t={s1,s2,,sk}\mathcal{S}_{t}=\{s_{1},s_{2},...,s_{k}\}, that is, s𝒮t,s={t0r1t1r2rmtm}\forall s\in\mathcal{S}_{t},s=\{t_{0}\xrightarrow{r_{1}}t_{1}\xrightarrow{r_{2}}...\xrightarrow{r_{m}}t_{m}\}. To differentiate the neighborhoods that different schemas refer to, we define each scheme in 𝒮t\mathcal{S}_{t} is not a subsequence of another, i,e,. si,sj𝒮tijsisj\forall s_{i},s_{j}\in\mathcal{S}_{t}\bigwedge i\neq j\rightarrow s_{i}\nsubseteq s_{j}.

These hierarchical tree schemas partition non_{o}’s neighborhoods into different categories of hierarchical trees. These hierarchical trees preserve different semantics and sequence information and are discriminated in aggregation model. For example, TPV\overrightarrow{TPV} and APV\overrightarrow{APV} represent two kinds of structured neighborhood of venues. For TPV\overrightarrow{TPV}, first term information is aggregated to papers, and then papers which contain the terms’ information are aggregated to the venues, while for APV\overrightarrow{APV}, the aggregation order is authors, papers and venues.

Definition 5 (Heterogeneous Graph Representation Learning).

Given a heterogeneous graph G=(N,E,𝒯,)G=(N,E,\mathcal{T},\mathcal{R}) learning the representation is to learn node type-specific mapping functions {ft:Nttd},t𝒯\{f_{t}:N_{t}\to\mathbb{R}^{d^{\prime}}_{t}\},t\in\mathcal{T}, which projects the nodes in NtN_{t}(node sets of type tt) to a representation vector in the dd^{\prime}-dimensional latent space td\mathbb{R}^{d^{\prime}}_{t}, corresponding to the feature space of the node type tt.

Refer to caption
Figure 4: An illustration of our proposed propagation model. (a) The whole propagation model which consists of hierarchical tree structured neighborhood extracting, hierarchical aggregating and representation integrating. (b) The architecture of the proposed hierarchical aggregation.

III PROPOSED MODEL

In this section, we introduce the T-GNN model for heterogeneous graph representation learning. Specifically, we combine the gated recurrent neural networks and GNN to process the information on hierarchical tree structured neighborhoods. Then we use attention mechanism to measure the importance of the information aggregated on different trees and integrate them into node representations. Finally, we present a relational metric learning module for model optimization.

III-A Hierarchical Aggregation

For a node non_{o}, we have discussed we can use hierarchical tree schemas to divide its multi-hop neighborhood into multiple structured trees, each of those represents a distinctive neighborhood for non_{o}. Given the hierarchical tree extracted by one schema ss where non_{o} is the root node, and the initial feature vectors 𝐱idt\mathbf{x}_{i}\in\mathbb{R}^{d_{t}}(dtd_{t}: the initial feature vector dimension) of each node on this tree, our intuition is to aggregate the information from all the neighbors under non_{o}’s level on the tree as well as 𝐱o\mathbf{x}_{o} to form its representations.

As the neighbor information in the propagation can be regarded as sequential inputs divided by levels and GNN can not process sequential data. We conduct Gated Recurrent Unit (GRU) module combining with GNN on this tree to encode 𝐱o\mathbf{x}_{o} and its neighbors information into a schema-specific output 𝐳os\mathbf{z}^{s}_{o}, the basic recurrence of the aggregation modules and propagation modules on hierarchical trees is:

hi(0)^=𝐱i,ϕ(ni)=t0\displaystyle\widehat{h^{(0)}_{i}}=\mathbf{x}_{i},\quad\phi(n_{i})=t_{0} (1)
hi(a)^=GRU(𝐱i,hi(a1)),ϕ(ni)=ta\displaystyle\widehat{h^{(a)}_{i}}=GRU(\mathbf{x}_{i},h_{i}^{(a-1)}),\quad\phi(n_{i})=t_{a}
hi(a1)=AGGREGATEra({hj(a1)^,njNira})\displaystyle h_{i}^{(a-1)}=AGGREGATE_{r_{a}}(\{\widehat{h_{j}^{(a-1)}},n_{j}\in N_{i}^{r_{a}}\})

where 0<am0<a\leq m, hi(a)^\widehat{h^{(a)}_{i}} represents the output hidden sate of node nin_{i} which is on the aath level of the tree with schema ss, i,e, . The hidden state hi(a1)h_{i}^{(a-1)} represents the neighborhood message of nin_{i} passing from all its neighbors on the (a1)(a-1)th level. The GRU(𝐱i,hi(a1))GRU(\mathbf{x}_{i},h_{i}^{(a-1)}) is formulated as:

zt=σ(Az𝐱i+Bzhi(a1))\displaystyle z_{t}=\sigma(A_{z}\mathbf{x}_{i}+B_{z}h_{i}^{(a-1)}) (2)
rt=σ(Ar𝐱i+Brhi(a1))\displaystyle r_{t}=\sigma(A_{r}\mathbf{x}_{i}+B_{r}h_{i}^{(a-1)})
h~i(a)=tanh[Ah𝐱i+Bh(rthi(a1))]\displaystyle\tilde{h}^{(a)}_{i}=tanh[A_{h}\mathbf{x}_{i}+B_{h}(r_{t}\circ h_{i}^{(a-1)})]
hi(a)^=zthi(a1)+(1zt)h~i(a)\displaystyle\widehat{h^{(a)}_{i}}=z_{t}\circ h_{i}^{(a-1)}+(1-z_{t})\circ\tilde{h}^{(a)}_{i}

Where σ(x)=11+ex\sigma(x)=\frac{1}{1+e^{-x}} is the sigmoid function and \circ is element-wise multiplication. The aggregator function AGGREGATEra()AGGREGATE_{r_{a}}(\cdot) is a information aggregator of nin_{i}’s sub-level neighbors specific for relation type rar_{a}, which could be a mean pooling layer, max pooling layer, or a GNN aggregator, such as mean aggregator, LSTM aggregator, pooling aggregator introduced in GraphSAGE[15], graph attentional layer introduced in GAT[16]. In our paper, we use weighted mean aggregator, then the hi(a1)h_{i}^{(a-1)} in Eq.1 is reformulated as:

hi(a1)=cijranjNiraWrahj(a1)^h_{i}^{(a-1)}=c^{r_{a}}_{ij}\sum_{n_{j}\in{N^{r_{a}}_{i}}}W_{r_{a}}\widehat{h^{(a-1)}_{j}} (3)

where rar_{a} is the relation type between level aa and level a1a-1 on ss that connect nin_{i} with its sub-level neighbors, NirN^{r}_{i} denotes the set of neighbors indices of node nin_{i} under the relation rr\in\mathcal{R}. WrW_{r} is a trainable weight matrix specific for relation type rr. cijrc^{r}_{ij} is the weight or normalization constant of the edge (ni,nj)(n_{i},n_{j}) with type rr.

After the mm levels’ propagation on hierarchical trees structured neighborhood, the hidden output of node non_{o} on schema ss can be obtained by:

𝐳o=GRU(𝐱o,ho(m1))\mathbf{z}_{o}=GRU(\mathbf{x}_{o},h_{o}^{(m-1)}) (4)

Where 𝐳od\mathbf{z}_{o}\in\mathbb{R}^{d^{\prime}}(dd^{\prime}: representation dimension). To make model tuning easy, we also set the dimension of hidden states as dd^{\prime}. In the neighbor information aggregation, parameters in GRU modules are shared and those in GNN modules are shared for same relation types. In practice, to save the calculation time, we use tensor operation and avoid the repeated aggregating on different trees with the same schemas, for example, for schema TP\overrightarrow{TP} of paper nodes and TPA\overrightarrow{TPA} of author nodes, the aggregating processes from TT to PP are same for two schema and we only calculate once.

We have noticed that the gated neural units have been applied to some graph neural network models[17, 18, 19], majority of which are originated from GGNN[20]. Their main idea is stacking kk-layer GNNs, and using GRU module to process the kk hidden layer outputs of each node as the sequenced information. Our proposed model is different from this idea: 1) We use the GRU module to help GNN directly aggregates the sequence information composed of neighbor nodes with arbitrary length, 2) Instead of stacking a fixed number of layers, our proposed T-GNN can stack different numbers of propagation layers for different node types according to their neighborhood depths.

III-B Neighborhood Information Integrating

Then given the hierarchical tree schema set 𝒮t={s1,s2,,sk}\mathcal{S}_{t}=\{s_{1},s_{2},...,s_{k}\} of non_{o}, we can obtain the hidden output set 𝐙o={𝐳o1,𝐳o2,,𝐳ok}\mathbf{Z}_{o}=\{\mathbf{z}^{1}_{o},\mathbf{z}^{2}_{o},...,\mathbf{z}^{k}_{o}\} of non_{o} and 𝐳oid\mathbf{z}^{i}_{o}\in\mathbb{R}^{d^{\prime}}. These hidden representations that contain different neighborhood information may make different contributions to non_{o}’s final representation. We employ the attention mechanism to combine these kk hidden representations with non_{o}’s feature vectors into the final representation vector of non_{o}, formulated as:

𝐮o=ReLU(αooWt𝐱o+𝐳oi𝐙oαoi𝐳oi)\mathbf{u}_{o}=ReLU\left(\alpha^{o}_{o}\cdot W_{t}\mathbf{x}_{o}+\sum_{\mathbf{z}^{i}_{o}\in\mathbf{Z}_{o}}\alpha^{i}_{o}\cdot\mathbf{z}^{i}_{o}\right) (5)

Where 𝐮od\mathbf{u}_{o}\in\mathbb{R}^{d^{\prime}} is the final representation vector of non_{o}, αo\alpha^{*}_{o} indicates the importance of different hidden representations for non_{o}, Wtd×dW_{t}\in\mathbb{R}^{d^{\prime}\times d} is a type-specific linear transformation to project the features vectors of nodes with type tt into a hidden representation vector, which represent the feature information aggregated from each node itself. For brevity, we denote that 𝐳o0=Wt𝐱o\mathbf{z}^{0}_{o}=W_{t}\mathbf{x}_{o} and add it into representation set 𝐙o\mathbf{Z}_{o}, then 𝐮o\mathbf{u}_{o} can be reformulated as:

𝐮o=ReLU(𝐳oi𝐙oαoi𝐳oi)\mathbf{u}_{o}=ReLU\left(\sum_{\mathbf{z}^{i}_{o}\in\mathbf{Z}_{o}}\alpha^{i}_{o}\cdot\mathbf{z}^{i}_{o}\right)\\ (6)

We leverage self-attention to learn the attention weights αoi\alpha^{i}_{o}, which are calculated via softmax function:

αoi=exp{LeakyReLU(aT[𝐳o0||𝐳oi])}𝐳oj𝐙oexp{LeakyReLU(aT[𝐳o0||𝐳oj])}\alpha^{i}_{o}=\frac{exp\{LeakyReLU(a^{T}[\mathbf{z}^{0}_{o}||\mathbf{z}^{i}_{o}])\}}{\sum_{\mathbf{z}^{j}_{o}\in\mathbf{Z}_{o}}exp\{LeakyReLU(a^{T}[\mathbf{z}^{0}_{o}||\mathbf{z}^{j}_{o}])\}} (7)

which can be interpreted as the importance of the hidden representation 𝐳oi\mathbf{z}^{i}_{o}, the higher the αoi\alpha^{i}_{o}, the more important 𝐳oi\mathbf{z}^{i}_{o}. LeakyReLULeakyReLU denotes the leaky version of a Rectified Linear Unit, a1×2da\in\mathbb{R}^{1\times 2d^{\prime}} is the attention parameter.

Then we can apply the final node representation vector 𝐮o\mathbf{u}_{o} to the loss functions in supervised tasks, semi-supervised tasks, or unsupervised tasks, and optimize the propagation model via back propagation.

III-C Optimization via Relational Metric learning

To perform heterogeneous graph representation learning, in this section, we leverage a graph context loss to optimize the model. Given a heterogeneous graph G=(N,E,𝒯,)G=(N,E,\mathcal{T},\mathcal{R}), we aim to learn effective node representations by maximizing the probability of any node nin_{i} having its neighbor node ncn_{c}:

argmaxθniNrncNr(ni)logp(nc|ni,θ)\arg\max_{\theta}\sum_{n_{i}\in N}\sum_{r\in{\mathcal{R}}}\sum_{n_{c}\in N_{r}(n_{i})}\log p(n_{c}|n_{i},\theta) (8)

where Nr(ni)N_{r}(n_{i}) is a set of neighbors of nin_{i}, and ncNr(ni)\forall n_{c}\in N_{r}(n_{i}) connects nin_{i} with relation rr, θ\theta represents the parameters of the whole model. p(nc|ni,θ)p(n_{c}|n_{i},\theta) is the probability of any node nin_{i} having its neighbor node ncn_{c}, defined as a softmax function:

p(nc|ni,θ)=exp(s(𝐮i,𝐮c))njNtexp(s(𝐮i,𝐮j))p(n_{c}|n_{i},\theta)=\frac{\exp(s(\mathbf{u}_{i},\mathbf{u}_{c}))}{\sum_{n_{j}\in N_{t}}\exp(s(\mathbf{u}_{i},\mathbf{u}_{j}))} (9)

where 𝐮i\mathbf{u}_{i} represents the encoded representation vector of nin_{i}, s(𝐮i,𝐮j)s(\mathbf{u}_{i},\mathbf{u}_{j}) is the similarity between node nin_{i} and njn_{j}. NtN_{t} represents the node set of type tt and t=ϕ(nc)t=\phi(n_{c}). The softmax function is normalized with respect to the node type of the context ncn_{c}, so each type of the neighborhood in the output layer is specified to one distinct multinomial distribution.

As we aim to embed each type t𝒯t\in\mathcal{T} of nodes into a distinct space, we set s(𝐮i,𝐮j)=𝐮iT𝐮js(\mathbf{u}_{i},\mathbf{u}_{j})=\mathbf{u}_{i}^{T}\mathbf{u}_{j} if nin_{i} and njn_{j} have same type and in the same embedding space. For each node pair (𝐮i,𝐮j)(\mathbf{u}_{i},\mathbf{u}_{j}) with distinct node type tat_{a} and tbt_{b} and connected by a relation with type rr (which can be an atomic relation or a composite relation), we formulate the similarity s(𝐮i,𝐮j)s(\mathbf{u}_{i},\mathbf{u}_{j}) by introduce two candidate similarity metrics specific for rr, which are described as:

  • Bilinear. A multiplicative similarity metric. We introduce a bilinear metric Mrd×dM_{r}\in\mathbb{R}^{d^{\prime}\times d^{\prime}}, which is shared by each node pair (𝐮i,𝐮j)(\mathbf{u}_{i},\mathbf{u}_{j}) with relation type rr and specific to different relation types to avoid the similarity calculation on respective semantics to dampen each other.

    s(𝐮i,𝐮j)=𝐮iTMr𝐮js(\mathbf{u}_{i},\mathbf{u}_{j})={\mathbf{u}_{i}}^{T}M_{r}\mathbf{u}_{j} (10)
  • Perceptron. A additive similarity metric. We first input 𝐮i\mathbf{u}_{i} and 𝐮j\mathbf{u}_{j} to node-type-specific multilayer perceptrons with output metric dimension dmd_{m}, then add the outputs of them, next with a tanh activation layer we can obtain the hidden representation vector of node pair (𝐮i,𝐮j)(\mathbf{u}_{i},\mathbf{u}_{j}). Then a relation-type-specific metric mrdmm_{r}\in\mathbb{R}^{d_{m}} is multiplied to calculate the similarity.

    s(𝐮i,𝐮j)=mrTtanh(Mta𝐮i+Mtb𝐮j)s(\mathbf{u}_{i},\mathbf{u}_{j})=m_{r}^{T}\tanh(M_{t_{a}}\mathbf{u}_{i}+M_{t_{b}}\mathbf{u}_{j}) (11)

The metrics are trainable parameters and are learned to measure which 𝐮j\mathbf{u}_{j} under the relation rr is closer to 𝐮i\mathbf{u}_{i}, whose mechanism is similar to the mechanism of attention models. Note that the parameters of metrics are shared for same relation type rr, while different relation types make use of different metrics and the corresponding distance on these relations of nodes is calculated in different metric spaces,

Then, we adopt the popular negative sampling method proposed in [21] to sample negative nodes to increase the optimization efficiency. Then, the probability logp(nc|ni,θ)\log p(n_{c}|n_{i},\theta) can be approximately defined as:

logσ(s(𝐮i,𝐮c))+njNnegilogσ(s(𝐮i,𝐮j))\log\sigma(s(\mathbf{u}_{i},\mathbf{u}_{c}))+\sum\nolimits_{n_{j}\in N_{neg}^{i}}\log\sigma(-s(\mathbf{u}_{i},\mathbf{u}_{j})) (12)

where σ(x)=11+ex\sigma(x)=\frac{1}{1+e^{-x}} is the sigmoid function and NnegiNtN_{neg}^{i}\subseteq N_{t} is a negative node set for nin_{i} sampled from a pre-computed node frequency distribution Pt(ni)P_{t}(n_{i}), we set P(ni)fni3/4P(n_{i})\propto f_{n_{i}}^{3/4}, where fni3/4f_{n_{i}}^{3/4} is the frequency of nin_{i} in 𝒫\mathcal{P}. Then, we conduct random walks on the GG to sample a set of paths 𝒫\mathcal{P}. Therefore, we can use skip-gram model on these paths and reformulate the objective in eq.8 as follows:

=p𝒫nipncCiklogp(nc|ni,θ)+λθ2\mathcal{L}=-\sum_{p\in\mathcal{P}}\sum_{n_{i}\in{p}}\sum_{n_{c}\in{C_{i}^{k}}}\log p(n_{c}|n_{i},\theta)+\lambda{\|\theta\|}^{2} (13)

where kk is the windows size of context, CikC_{i}^{k} is the context set containing the previous kk context nodes and next kk context nodes of nin_{i} in the path pp. Parameter λ\lambda controls penalty of regularization for over-fitting. Finally, we use a mini-batch Adam optimizer to minimize \mathcal{L} and optimize the parameters in the whole model.

IV experiment

In this section, We conduct several experiments to validate the performance of our proposed T-GNN model on three datasets. The results demonstrate the superiority of our proposed method over several state-of-the-art methods. All codes of our method are publicly available111https://github.com/joe817/T-GNN.

IV-A Datasets

To evaluate the proposed method, we use three kinds of real-world datasets: academic graphs (DBLP), movie graphs (MOVIE), and review graph (YELP). The detailed descriptions of these datasets are as follows:

  • DBLP222https://www.aminer.cn/citation: DBLP is an academic bibliography with millions of publications. We extract a subset of DBLP with four areas: Data Mining (DM), Database (DB), Natural Language Processing (NLP) and Computer Vision (CV) from the year of 2013 to 2017. For each area, we choose three top venues333DM: KDD, ICDM, WSDM. DB: SIGMOD, VLDB, ICDE. NLP: ACL, EMNLP, NAACL. CV: CVPR, ICCV, ECCV. and related papers(P), terms(T), authors(A) to construct an heterogeneous graph.

  • IMDB444https://www.kaggle.com/tmdb/tmdb-movie-metadata: IMDB dataset contains knowledge about movies. We extract three genres of movie information from IMDB: Action, Comedy and Drama. Then we construct a movie heterogeneous graph which contains movies(M), actors(A), directors(D) and Tags(T).

  • YELP555https://www.kaggle.com/yelp-dataset/yelp-dataset: YELP is a datasets containing the data of restaurants reviews. We extract a subset of YELP containing costumers(C), review(V), keywords(K) and restaurants(R) from the information of restaurants with types: American, Chinese, Japanese and French to form a restaurant review heterogeneous graph.

We select some types of nodes and use graph representation learning methods to learn their representation vectors, Table I shows the statistics of these nodes and their corresponding hierarchical tree schemas used in our proposed model.

TABLE I: Statistics of three datasets
Datasets Node (# Number)
Hierarchical
Tree Schema
DBLP Paper #20,552 AP\overrightarrow{AP}TP\overrightarrow{TP}
Author #19,247 TPA\overrightarrow{TPA}
Venue #12 TPV\overrightarrow{TPV}APV\overrightarrow{APV}
IMDB Movie #3,630 AM\overrightarrow{AM}TM\overrightarrow{TM}
Actor #13,156 TMA\overrightarrow{TMA}
Director #1,932 AMD\overrightarrow{AMD}TMD\overrightarrow{TMD}
YELP Costumer #2,536 KVC\overrightarrow{KVC}
Restaurant #3,332 KVR\overrightarrow{KVR}

IV-B Comparison Methods

To validate the performance of our proposed method, we compare with some state-of-art baselines, including homogeneous heterogeneous graph representation learning method (DeepWalk), heterogeneous graph representation learning methods (Metapath2vec, RHINE), homogeneous graph neural networks (GraphSAGE, GAT) and heterogeneous graph neural networks (HAN, HetGNN). The brief descriptions of these baselines are as follows:

  • DeepWalk:DeepWalk[22] is a graph representation learning method based on random walks and skip-gram model to learn latent node representations.

  • Metapath2vec: Metapath2vec[23] uses meta-path based random walks to construct the heterogeneous neighborhood of nodes and then leverages a heterogeneous skip-gram model to perform node representations. We leverage the meta-paths APVPA, AMDMA and CRC in DBLP, IMDB, and Yelp respectively.

  • RHINE: RHINE[24] explore the different structural characteristics of relations in HINs and present two structure-related measures for two distinct categories of relations: IR and AR. We select the same relations in their paper on DBLP and YELP, for IMDB, we select AM, AMD as IR, and TM, TMD, TMA as AR.

  • GraphSAGE: GraphSAGE[15] learn node representations through different aggregation function form local neighborhood of nodes. We use the GCN module as the aggregator of GraphSAGE.

  • GAT: GAT[16] consider the attention mechanism and measure impacts of different neighbors’ feature information by multi-head self-attention neural network.

  • HAN: HAN[9] leverages node level attention and semantic-level attention to respectively learn the importance of neighbors based on same meta-path and different meta-paths. We select meta-paths CRC, RCR for YELP, and meta-paths same as theirs for DBLP, IMDB.

  • HetGNN: HetGNN[11] jointly considered heterogeneous neighbors sampling, node heterogeneous contents encoding, type based neighbors aggregation, and heterogeneous types combination.

We use Par2vec[25] to pre-train the text content of nodes and set the dimension of initial feature vector as 128. For the proposed T-GNN, we set all the hidden layer dimensions and final representation dimension of nodes as 128. We use the Adam optimizer with a learning rate of 10310^{-3} and the l2 penalty parameter λ\lambda is 10410^{-4}. We set the windows size kk of skip-gram model as 2, the size of negative samples as 3 for all datasets, the similarity metric is chosen to Perceptron and the metric space dimension dmd_{m} is 32. As GAT and HAN only mention semi-supervised objective function for node representations, for a fair comparison, we optimize their models same as our method to learn unsupervised node representations. For the baselines, we set the dimension of nodes in three datasets also as 128 and the other hyper-parameter setting are based on default values or the values specified in their own papers. All the experiments are repeated many times to make sure the results can reflect the performances of methods.

TABLE II: Results of Clustering
Dataset DBLP IMDB YELP
Metric NMI ARI NMI ARI NMI ARI
DeepWalk 0.735 0.616 0.023 0.015 0.260 0.279
Metapath2vec 0.864 0.899 0.096 0.091 0.261 0.282
RHINE 0.866 0.902 0.055 0.036 0.342 0.351
GraphSAGE 0.865 0.912 0.128 0.135 0.396 0.433
GAT 0.855 0.897 0.114 0.113 0.413 0.457
HAN 0.900 0.933 0.136 0.144 0.413 0.458
HetGNN 0.891 0.939 0.131 0.139 0.403 0.440
T-GNN 0.916 0.955 0.145 0.152 0.420 0.484
TABLE III: Results of Multi-class Classification
Dataset DBLP IMDB YELP
Metric(F1) Micro Macro Micro Macro Micro Macro
DeepWalk 0.905 0.896 0.478 0.473 0.703 0.660
Metapath2vec 0.922 0.918 0.505 0.509 0.705 0.660
RHINE 0.927 0.920 0.449 0.448 0.726 0.676
GraphSAGE 0.962 0.943 0.583 0.584 0.739 0.748
GAT 0.967 0.958 0.543 0.542 0.744 0.758
HAN 0.979 0.971 0.596 0.598 0.746 0.759
HetGNN 0.983 0.979 0.594 0.593 0.730 0.753
T-GNN 0.997 0.996 0.608 0.609 0.760 0.772

IV-C Clustering and Classification

First, we conduct clustering and classification tasks to evaluate the performance of the methods. For DBLP, we label papers by areas of their venues and label authors by their representative areas (the area with the majority of their papers). For IMDB, we label the movies by genres and label actors by the genre with the majority of their acted movies. For YELP, we label the restaurant by type. Then we evaluate these nodes’ representations by clustering and multi-class classification. We learn the node representations on the heterogeneous graphs by each model. For clustering task, we feed node representations into the K-Means algorithm, the number of clusters is set to the true number of classes for each dataset, we evaluate the clustering performance in terms of normalized mutual information (NMI) and adjusted rand index (ARI). For multi-class classification task, the learned node representations are used as the input to a logistic regression classifier, the proportion of training and valid data is set to 50%, 10%, the remaining nodes are used for test. We use both Micro-F1 and Macro-F1 as classification evaluation metrics.

Results. Table II shows the performance evaluation of clustering on three datasets and Table III shows the classification results. The proposed T-GNN model consistently outperform all baselines in terms of two tasks, showing that it can learn more effective node representations than other methods. The relative improvements (%) of T-GNN over the best baselines range from 1.7%-10.7% for clustering task and 1.4%-2.2% for classification task. We can also observe 1) GNN based graph representation learning models achieve better performance than the traditional graph representation learning methods on these two tasks, that is because besides the graph structure information, GNNs can also capture node attributes information to node representations. 2) Heterogeneous models perform better than homogeneous models because they can differentiate different types of relations and capture more semantic information to node representations. 3) T-GNN’s performance is better than two heterogeneous graph neural network models HAN and HetGNN, showing that it can better capture the information of heterogeneous neighborhood.

Refer to caption
(a) Inductive Clustering (10%)
Refer to caption
(b) Inductive Clustering (20%)
Refer to caption
(c) Inductive Classification (10%)
Refer to caption
(d) Inductive Classification (20%)
Figure 5: Results of inductive clustering and inductive multi-class classification, where the percentages in captions represent the invisible ratios of nodes.

IV-D Inductive Clustering and Classification

In this task, we evaluate the inductive capability of our proposed T-GNN on DBLP dataset and compare with the graph neural network based baselines GraphSAGE, GAT, HAN and HetGNN. Specifically, for author nodes and paper nodes which to be evaluated, we randomly set a part of them as the invisible nodes for graph representation learning methods, and use the rest visible nodes and the links between them on the graph to train the models. Then we use the learned models on the whole graph to infer the representations of all invisible nodes. Finally, we use the inferred representations as the input of clustering tasks. For the classification tasks, the visible nodes are used to train the classifier, and the invisible nodes are used to evaluate. The labels and evaluation metrics are the same as the previous clustering and classification tasks. We report the results of the invisible nodes ratio in as 10% and 20% respectively.

Results. Figure 5 shows the performance of GNN models on the inductive learning task. Our proposed T-GNN model achieves the best performance among all the homogeneous and heterogeneous models. Specifically, for the comparison of heterogeneous models, our proposed T-GNN outperforms HAN and HetGNN, because the representations of invisible nodes are mainly inferred by their neighborhood information, but HAN uses meta-path to sample neighbors, ignoring the intermediate nodes; HetGNN uses random walks to sample neighbors, them both unable to process tree structured information and path correlation on multi-hop neighborhood.

IV-E Link Prediction

In this section, we aim to evaluate the performance of our graph representation learning methods on the link prediction task, which is a practical problem in many applications such as user/item recommendation. We formulate this task as a binary classification problem to predict whether a relation link between two nodes exists. We consider two types of links: links of two author collaborating one paper in DBLP dataset, and links of a user giving a review to an restaurant in YELP dataset. Specifically, for DBLP, we sample the collaborating links before 2013 for training, in 2013 for validation and after 2013 for test. For YELP, we randomly sample 60% reviewing links for training, 10% for validation, and the rest for testing. In training, we remove the links in the validation and testing sets and use graph representation learning methods on graphs to learn the node representations. We use the element-wise multiplication of two candidate nodes’ representations as the link representation, then input link representations into a binary logistic classifier. Also negative links (not connected in graphs) with three times the number of true links are randomly sampled to the training, validation and testing sets. We use AUC and F1 as evaluate metrics.

Results. Table IV shows the prediction results on two datasets. We can find that our proposed T-GNN model achieves the best performance or is comparable to the best methods on the link prediction task. We believe the reason is that our proposed T-GNN model has outstanding ability in preserve heterogeneity into node representations, which helps to overcome the mutual interference of different relations on the node distributions, and improve the quality of node representations.

TABLE IV: Results of Link Prediction
Dataset
DBLP
(Collaborating)
YELP
(Reviewing)
Metric AUC F1 AUC F1
DeepWalk 0.691 0.539 0.519 0.392
Metapath2vec 0.723 0.622 0.518 0.390
RHINE 0.729 0.630 0.552 0.410
GraphSAGE 0.713 0.584 0.665 0.536
GAT 0.719 0.603 0.652 0.550
HAN 0.743 0.635 0.655 0.551
HetGNN 0.760 0.660 0.646 0.536
T-GNN 0.824 0.764 0.663 0.571

IV-F Comparison of Variant Models

In order to further verify the effectiveness of the metrics on learning node representations, we design three variant heterogeneous graph representation learning models based on T-GNN and different output models: T-GNNdp, T-GNNpe, T-GNNbi. T-GNNdp do not use metrics and leverage dot product as the similarity s(ni,nj)s(n_{i},n_{j}) of any two nodes nin_{i} and njn_{j}, T-GNNpe use the Perceptron as the similarity metric of nodes with different types and T-GNNbi use Bilinear, both of them are introduced in Section III-C. We evaluate these three models on the node clustering task and classification task. The experiment setting are same as mentioned above. The results are shown in Figure 6. It is evident that using two kinds of metrics achieves better performance. Besides, the performance of T-GNNpe is better than T-GNNbi. We believe the reason is that there are more parameters in Perceptron metric than Bilinear metric, leading to Perceptron metric is better in preserve the complicated proximities between multiple nodes.

Refer to caption
(a) Clustering
Refer to caption
(b) Classification
Figure 6: Performance Evaluation of Variant Models.
Refer to caption
(a) Metapath2vec (Author)
Refer to caption
(b) GraphSAGE (Author)
Refer to caption
(c) T-GNN (Author)
Refer to caption
(d) Metapath2vec (Paper)
Refer to caption
(e) GraphSAGE (Paper)
Refer to caption
(f) T-GNN (Paper)
Figure 7: t-SNE Visualization of representation distribution of authors and papers learned by three different graph representation learning methods: Metapath2vec, GraphSAGE and T-GNN. (a), (b), (c) denotes the author representations learned by these three methods and (d), (e), (f) denotes the paper representations. The four different research areas which authors and papers belong to are colored differently.

IV-G Visualization

We have discussed that our model can embed each type of nodes into type-specific space with distinct distribution. In this section, we compare our model with metapath2vec and GraphSAGE by visualizing node representations with different types. Specifically, we first learn the node representations in the year of 2015 of four venue: CVPR, ACL, KDD, SIGMOD in DBLP data, representing four research areas. Then we use t-SNE to respectively project author representations and paper representations into 2-dimensional spaces.

The results are shown in Figure 7. We can observe that while metapath2vec cluster papers and authors into small groups, some nodes belonging to the same research area are far away from each other, and some belonging to different areas are mixed with each other, GraphSAGE can roughly partition nodes with different research areas, but the boundaries between different clusters are not sharp, and many nodes overlap at the same points. We can observe that in the visualization of T-GNN, 1) the representations of paper nodes in the same area cluster closely and can be well distinguished from each other. Author nodes are also embedded well comparing with other two methods, but some of them are embedded on the boundary between different clusters, which is because some authors may have multiple research areas. 2) Authors and papers in their own space have independent distributions, showing high intra-class similarities and distinct boundaries between different research areas.

V Related work

Graph Representation Learning Graph representation learning, also known as network embedding, has become one of the most popular research interests in data mining recently. Network embedding method aim to embed network into lower dimensional space which preserve the network property and node proximity, the learned node embeddings can be utilized in many graph mining tasks. In homogeneous networks embedding methods, DeepWalk[22] and Node2Vec[26] use random walk strategy on network and skip-gram[27, 28] model to learn the representation of each node in network. LINE[29] aims to learn the node embedding that preserve both first-order and second-order proximities. Some use deep neural network for homogeneous network embedding, such as DNGR[30] and SDNE[31]. Different with homogeneous graph, heterogeneous information network contains multiple types of nodes and relations, many real world graph can be modeled as HINs. Metapath2Vec[23] designs a meta-path based random walk and utilizes skip-gram to perform heterogeneous graph embedding, HINE[32] define an objective function modeling the meta path based proximities in the embedded low-dimensional space. HIN2Vec[33] learns the embeddings of HINs by conducting multiple prediction training tasks jointly, RHINE[24] explore the relation characteristics in HINs and partition them into two categories, then use different models to optimize the node representations. Besides, Some methods introduce metrics learning into heterogeneous network embedding, HEER[34] embeds HINs via edge representations that are further coupled with properly-learned heterogeneous metrics, PME[35] models node and link heterogenities in elaborately designed relation-specific spaces.

Graph Neural Network Graph neural networks extend the deep neural network to encode the node features and local neighborhood into low-dimensional representation vectors[1]. Kipf et al. propose GCN[36], which designs a graph convolutional network via a localized first-order approximation of spectral graph convolutions. GraphSAGE[15] propose an general aggregating strategy containing multiple aggregators, such as Mean aggregator, LSTM aggregator and Pooling aggregator. For heterogeneous graph, R-GCN[12] introduces a relational graph convolutional network to link prediction task and entity classification task. Zhang et al. propose a heterogeneous graph neural network model HetGNN[11] which considers both types and heterogeneous attributes of nodes. RSHN [13] utilize graph structure and implicit relation structural information to simultaneously learn node and edge type embedding. Some work introduce attention mechanisms into network architectures, GAT[16] employs self-attention mechanism to measure impacts of different neighbors and combine their impacts to obtain node representations. HAN[9] employs node-level and semantic-level attentions on heterogeneous network to learn the importance of neighbors based on meta-paths. Also some work introduce gated recurrent units, GGNN[20] aim learning representations of the internal state during the process of producing a sequence of outputs. Inspired by this idea, [17] proposed a novel encoder-decoder architecture for graph-to-sequence learning. [37] propose general conceptual message passing neural networks framework that subsumes most GNNs and evaluate it on molecular property prediction task. [38] propose a differentiable graph pooling module to learn hierarchical representations of graphs on task of graph classification.

VI Conclusion remarks

In this paper, we study the problem of tree structure-aware graph representation learning. To effectively aggregate the multi-hop neighborhood, which usually present the structure of hierarchical trees in most heterogeneous graphs, we propose a tree structure-aware graph neural network model, which contains aggregation modules and sequence propagation modules to transform information of neighbors within multi-hop links to the root nodes. To overcome the problem of the mutual interference on the distributions of different types of nodes, we propose a relational metric learning module to optimize model, which utilize metrics to calculate node similarities on relation-specific metric spaces and embed nodes into type-specific spaces. From the experiment, the heterogeneous graph representation learning method we proposed can better extract the rich information of neighborhoods and improve the qualities of node representations.

References

  • [1] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” arXiv preprint arXiv:1812.08434, 2018.
  • [2] S. Fan, J. Zhu, X. Han, C. Shi, L. Hu, B. Ma, and Y. Li, “Metapath-guided heterogeneous graph neural network for intent recommendation,” in Proceedings of the 25th ACM SIGKDD, 2019, pp. 2478–2486.
  • [3] L. Wu, Y. Yang, K. Zhang, R. Hong, Y. Fu, and M. Wang, “Joint item recommendation and attribute inference: An adaptive graph convolutional network approach,” arXiv preprint arXiv:2005.12021, 2020.
  • [4] P. Wang, Y. Fu, H. Xiong, and X. Li, “Adversarial substructured representation learning for mobile user profiling,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 130–138.
  • [5] P. Wang, Y. Fu, Y. Zhou, K. Liu, X. Li, and K. Hua, “Exploiting mutual information for substructure-aware graph representation learning,” in Proceedings of the 29th International Joint Conference on Artificial Intelligence. AAAI Press, 2020.
  • [6] C. W. Coley, W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola, W. H. Green, R. Barzilay, and K. F. Jensen, “A graph-convolutional neural network model for the prediction of chemical reactivity,” Chemical science, vol. 10, no. 2, pp. 370–377, 2019.
  • [7] A. Sankar, X. Zhang, and K. C.-C. Chang, “Meta-gnn: metagraph neural network for semi-supervised learning in attributed heterogeneous information networks,” in Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2019, pp. 137–144.
  • [8] X. Fu, J. Zhang, Z. Meng, and I. King, “Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding,” in Proceedings of The Web Conference 2020, 2020, pp. 2331–2341.
  • [9] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu, “Heterogeneous graph attention network,” in The World Wide Web Conference, 2019, pp. 2022–2032.
  • [10] Z. Qiao, Y. Du, Y. Fu, P. Wang, and Y. Zhou, “Unsupervised author disambiguation using heterogeneous graph convolutional network embedding,” in 2019 IEEE International Conference on Big Data (Big Data).   IEEE, 2019, pp. 910–919.
  • [11] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla, “Heterogeneous graph neural network,” in Proceedings of the 25th ACM SIGKDD, 2019, pp. 793–803.
  • [12] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in European Semantic Web Conference.   Springer, 2018, pp. 593–607.
  • [13] S. Zhu, C. Zhou, S. Pan, X. Zhu, and B. Wang, “Relation structure-aware heterogeneous graph neural network,” in 2019 IEEE International Conference on Data Mining (ICDM).   IEEE, 2019, pp. 1534–1539.
  • [14] P. Wang, J. Gui, Z. Chen, J. Rhee, H. Chen, and Y. Fu, “A generic edge-empowered graph convolutional network via node-edge mutual enhancement,” in Proceedings of The Web Conference 2020, 2020, pp. 2144–2154.
  • [15] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1024–1034.
  • [16] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
  • [17] D. Beck, G. Haffari, and T. Cohn, “Graph-to-sequence learning using gated graph neural networks,” arXiv preprint arXiv:1806.09835, 2018.
  • [18] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Structured sequence modeling with graph convolutional recurrent networks,” in International Conference on Neural Information Processing.   Springer, 2018, pp. 362–373.
  • [19] X. Bresson and T. Laurent, “Residual gated graph convnets,” arXiv preprint arXiv:1711.07553, 2017.
  • [20] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
  • [21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
  • [22] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in SIGKDD.   ACM, 2014, pp. 701–710.
  • [23] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable representation learning for heterogeneous networks,” in SIGKDD.   ACM, 2017, pp. 135–144.
  • [24] Y. Lu, C. Shi, L. Hu, and Z. Liu, “Relation structure-aware heterogeneous information network embedding,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 4456–4463.
  • [25] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in ICML, 2014, pp. 1188–1196.
  • [26] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in SIGKDD.   ACM, 2016, pp. 855–864.
  • [27] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [28] X. Rong, “word2vec parameter learning explained,” arXiv preprint arXiv:1411.2738, 2014.
  • [29] J. Tang, M. Qu, M. Wang, M. Zhang et al., “Line: Large-scale information network embedding,” in WWW.   International World Wide Web Conferences Steering Committee, 2015, pp. 1067–1077.
  • [30] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning graph representations,” in AAAI, 2016, pp. 1145–1152.
  • [31] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in Proceedings of the 22nd ACM SIGKDD, 2016, pp. 1225–1234.
  • [32] Z. Huang and N. Mamoulis, “Heterogeneous information network embedding for meta path based proximity,” arXiv preprint arXiv:1701.05291, 2017.
  • [33] T. Fu, W. Lee, and Z. Lei, “Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning,” in CIKM.   ACM, 2017, pp. 1797–1806.
  • [34] Y. Shi, Q. Zhu, F. Guo, C. Zhang, and J. Han, “Easing embedding learning by comprehensive transcription of heterogeneous information networks,” in SIGKDD, 2018, pp. 2190–2199.
  • [35] H. Chen, H. Yin, W. Wang, H. Wang, Q. V. H. Nguyen, and X. Li, “Pme: projected metric embedding on heterogeneous networks for link prediction,” in SIGKDD, 2018, pp. 1177–1186.
  • [36] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [37] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 1263–1272.
  • [38] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec, “Hierarchical graph representation learning with differentiable pooling,” in Advances in neural information processing systems, 2018, pp. 4800–4810.