This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Heterogeneous Network Representation Learning: A Unified Framework with
Survey and Benchmark

Carl Yang1, Yuxin Xiao1, Yu Zhang1, Yizhou Sun, and Jiawei Han 1Three authors contribute equally.Carl Yang is with Emory University; Yuxin Xiao is with Carnegie Mellon University; Yu Zhang and Jiawei Han are with University of Illinois, Urbana Champaign; Yizhou Sun is with University of California, Los Angeles.Manuscript received Jun 2020; revised Oct 2020; accepted Dec 2020.
Abstract

Since real-world objects and their interactions are often multi-modal and multi-typed, heterogeneous networks have been widely used as a more powerful, realistic, and generic superclass of traditional homogeneous networks (graphs). Meanwhile, representation learning (a.k.a. embedding) has recently been intensively studied and shown effective for various network mining and analytical tasks. In this work, we aim to provide a unified framework to deeply summarize and evaluate existing research on heterogeneous network embedding (HNE), which includes but goes beyond a normal survey. Since there has already been a broad body of HNE algorithms, as the first contribution of this work, we provide a generic paradigm for the systematic categorization and analysis over the merits of various existing HNE algorithms. Moreover, existing HNE algorithms, though mostly claimed generic, are often evaluated on different datasets. Understandable due to the application favor of HNE, such indirect comparisons largely hinder the proper attribution of improved task performance towards effective data preprocessing and novel technical design, especially considering the various ways possible to construct a heterogeneous network from real-world application data. Therefore, as the second contribution, we create four benchmark datasets with various properties regarding scale, structure, attribute/label availability, and etc. from different sources, towards handy and fair evaluations of HNE algorithms. As the third contribution, we carefully refactor and amend the implementations and create friendly interfaces for 13 popular HNE algorithms, and provide all-around comparisons among them over multiple tasks and experimental settings.

By putting all existing HNE algorithms under a unified framework, we aim to provide a universal reference and guideline for the understanding and development of HNE algorithms. Meanwhile, by open-sourcing all data and code, we envision to serve the community with an ready-to-use benchmark platform to test and compare the performance of existing and future HNE algorithms (https://github.com/yangji9181/HNE).

Abstract

Since real-world objects and their interactions are often multi-modal and multi-typed, heterogeneous networks have been widely used as a more powerful, realistic, and generic superclass of traditional homogeneous networks (graphs). Meanwhile, representation learning (a.k.a. embedding) has recently been intensively studied and shown effective for various network mining and analytical tasks. In this work, we aim to provide a unified framework to deeply summarize and evaluate existing research on heterogeneous network embedding (HNE), which includes but goes beyond a normal survey. Since there has already been a broad body of HNE algorithms, as the first contribution of this work, we provide a generic paradigm for the systematic categorization and analysis over the merits of various existing HNE algorithms. Moreover, existing HNE algorithms, though mostly claimed generic, are often evaluated on different datasets. Understandable due to the application favor of HNE, such indirect comparisons largely hinder the proper attribution of improved task performance towards effective data preprocessing and novel technical design, especially considering the various ways possible to construct a heterogeneous network from real-world application data. Therefore, as the second contribution, we create four benchmark datasets with various properties regarding scale, structure, attribute/label availability, and etc. from different sources, towards handy and fair evaluations of HNE algorithms. As the third contribution, we carefully refactor and amend the implementations and create friendly interfaces for 13 popular HNE algorithms, and provide all-around comparisons among them over multiple tasks and experimental settings.

By putting all existing HNE algorithms under a unified framework, we aim to provide a universal reference and guideline for the understanding and development of HNE algorithms. Meanwhile, by open-sourcing all data and code, we envision to serve the community with an ready-to-use benchmark platform to test and compare the performance of existing and future HNE algorithms (https://github.com/yangji9181/HNE).

Index Terms:
heterogeneous network, representation learning, survey, benchmark
Index Terms:
heterogeneous network, representation learning, survey, benchmark

1 Introduction

Networks and graphs constitute a canonical and ubiquitous paradigm for the modeling of interactive objects, which has drawn significant research attention from various scientific domains [1, 2, 3, 4, 5, 6]. However, real-world objects and interactions are often multi-modal and multi-typed (e.g., authors, papers, venues and terms in a publication network [7, 8]; users, places, categories and GPS-coordinates in a location-based social network [9, 10, 11]; and genes, proteins, diseases and species in a biomedical network [12, 13]). To capture and exploit such node and link heterogeneity, heterogeneous networks have been proposed and widely used in many real-world network mining scenarios, such as meta-path based similarity search [14, 15, 16], node classification and clustering [17, 18, 19], knowledge base completion [20, 21, 22], and recommendations [23, 24, 25].

In the meantime, current research on graphs has largely focused on representation learning (embedding), especially following the pioneer of neural network based algorithms that demonstrate revealing empirical evidence towards unprecedentedly effective yet efficient graph mining [26, 27, 28]. They aim to convert graph data (e.g., nodes [29, 30, 31, 32, 33, 34, 35, 36], links [37, 38, 39, 40], and subgraphs [41, 42, 43, 44]) into low dimensional distributed vectors in the embedding space where the graph topological information (e.g., higher-order proximity [45, 46, 47, 48] and structure [49, 50, 51, 52]) is preserved. Such embedding vectors are then directly executable by various downstream machine learning algorithms [53, 54, 55].

Right on the intersection of heterogeneous networks and graph embedding, heterogeneous network embedding (HNE) recently has also received significant research attention [56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76]. Due to the application favor of HNE, many algorithms have been separately developed in different application domains such as search and recommendations [23, 77, 78, 5]. Moreover, as knowledge bases (KBs) also fall under the general umbrella of heterogeneous networks, many KB embedding algorithms can be compared with the HNE ones [79, 4, 80, 20, 81, 82, 21, 83, 84].

Unfortunately, various HNE algorithms are developed in quite disparate communities across academia and industry. They have never been systematically and comprehensively analyzed either in concepts or through experiments. In fact, due to the lack of benchmark platforms (with ready-to-use datasets and baselines), researchers often tend to construct their own datasets and re-implement a few most popular (sometimes outdated) algorithms for comparison, which renders fair performance evaluation and clear improvement attribution extremely hard, if not impossible.

Refer to caption
(a) Plain heterogeneous net.
Refer to caption
(b) Labeled attributed net.
Figure 1: Different heterogeneous networks constructed from the same real-world application data.

Simply consider the toy examples of a publication dataset in Figure 1.111https://dblp.uni-trier.de/ Earlier HNE algorithms like metapath2vec [59] were developed on the heterogeneous network with node types of authors, papers and venues as in (a). However, one can enrich papers with a large number of terms and topics as additional nodes as in (b), which makes the random-walk based shallow embedding algorithms rather inefficient, but favors neighborhood aggregation based deep graph neural networks like R-GCN [67]. Moreover, one can further include node attributes like term embedding and labels like research fields and make them only available to the semi-supervised attributed embedding models, which may introduce even more bias [66, 75, 85, 86]. Eventually, it is often hard to clearly attribute performance gains between technical novelty and data tweaking.

In this work, we first formulate a unified yet flexible mathematical paradigm of HNE algorithms, easing the understanding of the critical merits of each model (Section 2). Particularly, based on a uniform taxonomy that clearly categorizes and summarizes the existing models (and likely future models), we propose a generic objective function of network smoothness, and reformulate all existing models into this uniform paradigm while highlighting their individual novel contributions (Section 3). We envision this paradigm to be helpful in guiding the development of future novel HNE algorithms, and in the meantime facilitate their conceptual contrast towards existing ones.

As the second contribution, we prepare four benchmark heterogeneous network datasets through exhaustive data collection, cleaning, analysis and curation (Section 4).222https://github.com/yangji9181/HNE The datasets we come up with cover a wide spectrum of application domains (i.e., publication, recommendation, knowledge base, and biomedicine), which have various properties regarding scale, structure, attribute/label availability, etc. This diverse set of data, together with a series of standard network mining tasks and evaluation metrics, constitute a handy and fair benchmark resource for future HNE algorithms.

As the third contribution, many existing HNE algorithms (including some very popular ones) either do not have a flexible implementation (e.g., hard-coded node and edge types, fixed set of meta-paths, etc.), or do not scale to larger networks (e.g., high memory requirement during training), which adds much burden to novel research (i.e., requiring much engineering effort in correct re-implementation). To this end, we focus on 13 popular HNE algorithms, where we carefully refactor and scale up the original implementations and apply additional interfaces for plug-and-run experiments on our prepared datasets (Section 5).2 Based on these ready-to-use and efficient implementations, we then conduct all-around empirical evaluations of the algorithms, and report their benchmark performances. The empirical results, while providing much insight into the merits of different models that are consistent with the conceptual analysis in Section 3, also serve as the example utilization of our benchmark platform that can be followed by future studies on HNE.

Note that, although there have been several attempts to survey or benchmark heterogeneous network models [7, 8, 79, 87] and homogeneous graph embedding [26, 27, 28, 88, 89, 90], none of them has deeply looked into the intersection of the two. We advocate that our unified framework for the research and experiments on HNE is timely and necessary. Firstly, as we will cover in this work, there has been a significant amount of research on the particular problem of HNE especially in the very recent several years, but most of them scatter across different domains, lacking proper connections and comparisons. Secondly, none of the existing surveys has proposed a generic mathematically complete paradigm for conceptual analysis of all HNE models. Thirdly, existing surveys mostly do not provide systematic benchmark evaluation results, nor do they come with benchmark datasets and open-source baselines to facilitate future algorithm development.

The rest of this paper is organized as follows. Section 2 first introduces our proposed generic HNE paradigm. Subsequently, representative models in our survey are conceptually categorized and analyzed in Section 3. We then present in Section 4 our prepared benchmark datasets with detailed analysis. In Section 5, we provide a systematic empirical study over 13 popular HNE algorithms to benchmark the current state-of-the-art of HNE. Section 6 concludes the paper with visions towards future usage of our platform and research on HNE.

2 Generic Paradigm

2.1 Problem Definitions

Definition 2.1

Heterogeneous network. A heterogeneous network H={V,E,X,R,ϕ,ψ}H=\{V,E,X,R,\phi,\psi\} is a network with multiple types of nodes and links. Particularly, within HH, each node viVv_{i}\in V is associated with a node type ϕ(vi)\phi(v_{i}), and each link eijEe_{ij}\in E is associated with a link type ψ(eij)\psi(e_{ij}). Each node viv_{i} of type o=ϕ(vi)o=\phi(v_{i}) is also potentially associated with attribute XioX^{o}_{i}, while each link eije_{ij} of type o=ψ(eij)o=\psi(e_{ij}) with attribute RijoR^{o}_{ij}. It is worth noting that the type of a link eije_{ij} automatically defines the types of nodes viv_{i} and vjv_{j} on its two ends.

Heterogeneous networks have been intensively studied due to its power of accommodating multi-modal multi-typed interconnected data. Besides the classic example of DBLP data used in most existing works as well as Figure 1, consider a different yet illustrative example from NYTimes in Figure 2.333https://www.nytimes.com/ Nodes in this heterogeneous network include news articles, categories, phrases, locations, and datetimes. To illustrate the power of heterogeneous networks, we introduce the concept of meta-path, which has been leveraged by most existing works on heterogeneous network modeling [7, 8].

Definition 2.2

Meta-path. A meta-path is a path defined on the network schema denoted in the form of o1l1o2l2lmom+1o_{1}\xrightarrow{l_{1}}o_{2}\xrightarrow{l_{2}}\cdots\xrightarrow{l_{m}}o_{m+1}, where oo and ll are node types and link types, respectively.

Each meta-path captures the proximity among the nodes on its two ends from a particular semantic perspective. Continue with our example of heterogeneous network from news data in Figure 2. The meta-path of article belongs to\xrightarrow{\text{\sf belongs to}} category includes\xrightarrow{\text{\sf includes}} article carries different semantics from article mentions\xrightarrow{\text{\sf mentions}} location mentioned by\xrightarrow{\text{\sf mentioned by}} article. Thus, the leverage of different meta-paths allows heterogeneous network models to compute the multi-modal multi-typed node proximity and relation, which has been shown beneficial to many real-world network mining applications [15, 91, 25].

Next, we introduce the problem of general network embedding (representation learning).

Definition 2.3

Network embedding. For a given network G={V,E}G=\{V,E\}, where VV is the set of nodes (vertices) and EE is the set of links (edges), a network embedding is a mapping function Φ:V|V|×d\Phi\;:\;V\mapsto\mathbb{R}^{|V|\times d}, where d|V|d\ll|V|. This mapping Φ\Phi defines the latent representation (a.k.a. embedding) of each node vVv\in V, which captures network topological information in EE.

In most cases, network proximity is the major topological information to be captured. For example, DeepWalk [29] captures the random-walk based node proximity and illustrates the 2-dim node representations learned on the famous Zachary’s Karate network of small groups, where a clear correspondence between the node position in the input graph and learned embedding space can be observed. Various follow-up works have improved or extended DeepWalk, while a complete coverage of them is beyond the scope of this work. In this work, we focus on the embedding of heterogeneous networks.

Refer to caption
Figure 2: Toy example of a heterogeneous network constructed from the news data.

Now we define the main problem of focus in this work, heterogeneous network embedding (HNE), which lies in the intersection between Def 2.1 and Def 2.3.

Definition 2.4

Heterogeneous network embedding. For a given heterogeneous network HH, a heterogeneous network embedding is a set of mapping functions {Φk:Vk|Vk|×d}k=1K\{\Phi_{k}\;:\;V_{k}\mapsto\mathbb{R}^{|V_{k}|\times d}\}_{k=1}^{K}, where KK is the number of node types, viVk\forall v_{i}\in V_{k}, ϕ(vi)=k,d|V|\phi(v_{i})=k,\;d\ll|V|. Each mapping Φk\Phi_{k} defines the latent representation (a.k.a. embedding) of all nodes of type kk, which captures the network topological information regarding the heterogeneous links in EE.

Compared with homogeneous networks, the definition of topological information in heterogeneous networks is even more diverse. As we will show in Section 3, the major distinctions among different HNE algorithms mostly lie in their different ways of capturing such topological information. Particularly, the leverage of meta-paths as in Def 2.2 often plays an essential role, since many popular HNE algorithms exactly aim to model the different proximity indicated by meta-paths [59, 63, 73, 70, 69, 75, 66, 77].

2.2 Proposed Paradigm

In this work, we stress that one of the most important principles underlying HNE (as well as most other scenarios of network modeling and mining) is homophily [92]. Particularly, in the network embedding setting, homophily can be translated as ‘nodes close on a network should have similar embeddings’, which matches the requirement of Def 2.3. In fact, we further find intrinsic connections between the well-perceived homophily principle and widely-used smoothness enforcement technique on networks [93, 94, 95], which leads to a generic mathematical paradigm covering most existing and likely many future HNE algorithms.

Based on earlier well-established concepts underlying network modeling and embedding learning [96, 97, 93, 94, 95], we introduce the following key objective function of network smoothness enforcement as follows

𝒥=u,vVwuvd(𝒆u,𝒆v)+𝒥R,\displaystyle\mathcal{J}=\sum_{u,v\in V}w_{uv}d({\bm{e}}_{u},{\bm{e}}_{v})+\mathcal{J}_{R}, (1)

where 𝒆u=Φ(u){\bm{e}}_{u}=\Phi(u) and 𝒆v=Φ(v){\bm{e}}_{v}=\Phi(v) are the node embedding vectors to be learned. wuvw_{uv} is the proximity weight, d(,)d(\cdot,\cdot) is the embedding distance function, and 𝒥R\mathcal{J}_{R} denotes possible additional objectives such as regularizers, all three of which can be defined and implemented differently by the particular HNE algorithms.

3 Algorithm Taxonomy

In this section, we find a universal taxonomy for existing HNE algorithms with three categories based on their common objectives, and elaborate in detail how they fit into our paradigm of Eq. (1). The main challenge of instantiating Eq. (1) on heterogeneous networks is the consideration of complex interactions regarding multi-typed links and higher-order meta-paths. In fact, our Eq. (1) also readily generalizes to homogeneous networks, though that is beyond the scope of this work.

3.1 Proximity-Preserving Methods

As mentioned above, one basic goal of network embedding is to capture network topological information. This can be achieved by preserving different types of proximity among nodes. There are two major categories of proximity-preserving methods in HNE: random walk approaches (inspired by DeepWalk [29]) and first/second-order proximity based ones (inspired by LINE [30]). Both types of proximity-preserving methods are considered as shallow network embedding, due to their essential single-layer decomposition of certain affinity matrices [98].

3.1.1 Random Walk Approaches

metapath2vec [59]. Following homogeneous network embedding [29, 31], metapath2vec utilizes the node paths traversed by meta-path guided random walks to model the context of a node regarding heterogeneous semantics. Formally, given a meta-path =o1l1o2l2lm1om\mathcal{M}=o_{1}\xrightarrow{l_{1}}o_{2}\xrightarrow{l_{2}}\cdots\xrightarrow{l_{m-1}}o_{m}, the transition probability at step ii is defined as

p(vi+1|vi,)={1|𝒩li(vi)|ϕ(vi+1)=oi+1,ψ(vi,vi+1)=li0otherwisep(v_{i+1}|v_{i},\mathcal{M})=\begin{cases}\frac{1}{|\mathcal{N}_{l_{i}}(v_{i})|}&\phi(v_{i+1})=o_{i+1},\psi(v_{i},v_{i+1})=l_{i}\\ 0&{\rm otherwise}\end{cases} (2)

where 𝒩l(v)={u|ψ(u,v)=l}\mathcal{N}_{l}(v)=\{u|\psi(u,v)=l\} denotes the neighbors of vv associated with edge type ll. Assume 𝒫={𝒫1,,𝒫M}\mathcal{P}=\{\mathcal{P}_{1},...,\mathcal{P}_{M}\} is the set of generated random walk sequences. The objective of metapath2vec is

𝒥=vVu𝒞(v)logexp(𝒆uT𝒆v)uVexp(𝒆uT𝒆v),\mathcal{J}=\sum_{v\in V}\sum_{u\in\mathcal{C}(v)}\log\frac{\exp({\bm{e}}_{u}^{T}{\bm{e}}_{v})}{\sum_{u^{\prime}\in V}\exp({\bm{e}}_{u^{\prime}}^{T}{\bm{e}}_{v})}, (3)

where 𝒞(v)\mathcal{C}(v) is the context (i.e., skip-grams) of vv in 𝒫\mathcal{P}. For example, if 𝒫1=v1v2v3v4v5\mathcal{P}_{1}=v_{1}v_{2}v_{3}v_{4}v_{5}... and the context window size is 2, then {v1,v2,v4,v5}𝒞(v3)\{v_{1},v_{2},v_{4},v_{5}\}\subseteq\mathcal{C}(v_{3}). Let wuvw_{uv} be the number of times that u𝒞(v)u\in\mathcal{C}(v), and we can rewrite Eq. (3) as

𝒥=u,vVwuvlogexp(𝒆uT𝒆v)uVexp(𝒆uT𝒆v).\mathcal{J}=\sum_{u,v\in V}w_{uv}\log\frac{\exp({\bm{e}}_{u}^{T}{\bm{e}}_{v})}{\sum_{u^{\prime}\in V}\exp({\bm{e}}_{u^{\prime}}^{T}{\bm{e}}_{v})}.

Calculating the denominator in this objective requires summing over all nodes, which is computationally expensive. In actual computation, it is approximated using negative sampling [99, 30].

HIN2Vec [63]. HIN2Vec considers the probability that there is a meta-path \mathcal{M} between nodes uu and vv. Specifically,

p(|u,v)=σ(𝟏T(𝑾XT𝒖𝑾YT𝒗f01(𝑾RT𝒎))),p(\mathcal{M}|u,v)=\sigma\Big{(}{\bf 1}^{T}\Big{(}{\bm{W}}_{X}^{T}{\bm{u}}\odot{\bm{W}}_{Y}^{T}{\bm{v}}\odot f_{01}\big{(}{\bm{W}}_{R}^{T}{\bm{m}}\big{)}\Big{)}\Big{)},

where 𝟏\bf 1 is an all-ones vector; \odot is the Hadamard product; f01f_{01} is a normalization function. Here 𝒆u=𝑾XT𝒖{\bm{e}}_{u}={\bm{W}}_{X}^{T}{\bm{u}}, 𝒆v=𝑾YT𝒗{\bm{e}}_{v}={\bm{W}}_{Y}^{T}{\bm{v}} and 𝒆=f01(𝑾RT𝒎){\bm{e}}_{\mathcal{M}}=f_{01}\big{(}{\bm{W}}_{R}^{T}{\bm{m}}\big{)} can be viewed as the embeddings of uu, vv and \mathcal{M}, respectively. Let 𝑨=diag(𝒆1,,{\bm{A}}_{\mathcal{M}}=diag({\bm{e}}_{\mathcal{M}1},..., 𝒆d){\bm{e}}_{\mathcal{M}d}). We have

p(|u,v)=σ(𝟏T(𝒆u𝒆v𝒆))=σ(𝒆uT𝑨𝒆v).p(\mathcal{M}|u,v)=\sigma\Big{(}{\bf 1}^{T}\big{(}{\bm{e}}_{u}\odot{\bm{e}}_{v}\odot{\bm{e}}_{\mathcal{M}}\big{)}\Big{)}=\sigma({\bm{e}}_{u}^{T}{\bm{A}}_{\mathcal{M}}{\bm{e}}_{v}).

σ\sigma is the sigmoid function, so we have

1p(|u,v)=1σ(𝒆uT𝑨𝒆v)=σ(𝒆uT𝑨𝒆v).1-p(\mathcal{M}|u,v)=1-\sigma({\bm{e}}_{u}^{T}{\bm{A}}_{\mathcal{M}}{\bm{e}}_{v})=\sigma(-{\bm{e}}_{u}^{T}{\bm{A}}_{\mathcal{M}}{\bm{e}}_{v}).

HIN2Vec generates positive tuples (u,v,)(u,v,\mathcal{M}) (i.e., uu connects with vv via the meta-path \mathcal{M}) using homogeneous random walks [29] regardless of node/link types. For each positive tuple (u,v,)(u,v,\mathcal{M}), it generates several negative tuples by replacing uu with a random node uu^{\prime}. Its objective is

𝒥0=(u,v,)logp(|u,v)+(u,v,)log(1p(|u,v))=(u,v,)(logσ(𝒆uT𝑨𝒆v)+ulogσ(𝒆uT𝑨𝒆v)).\begin{split}\mathcal{J}_{0}&=\sum_{(u,v,\mathcal{M})}\log p(\mathcal{M}|u,v)+\sum_{(u^{\prime},v,\mathcal{M})}\log(1-p(\mathcal{M}|u,v))\\ &=\sum_{(u,v,\mathcal{M})}\Big{(}\log\sigma({\bm{e}}_{u}^{T}{\bm{A}}_{\mathcal{M}}{\bm{e}}_{v})+\sum_{u^{\prime}}\log\sigma(-{\bm{e}}_{u^{\prime}}^{T}{\bm{A}}_{\mathcal{M}}{\bm{e}}_{v})\Big{)}.\end{split}

This is essentially the negative sampling approximation of the following objective

𝒥=u,vVwuv()logexp(𝒆uT𝑨𝒆v)uVexp(𝒆uT𝑨𝒆v),\begin{split}\mathcal{J}&=\sum_{\mathcal{M}}\sum_{u,v\in V}w_{uv}^{(\mathcal{M})}\log\frac{\exp({\bm{e}}_{u}^{T}{\bm{A}}_{\mathcal{M}}{\bm{e}}_{v})}{\sum_{u^{\prime}\in V}\exp({\bm{e}}_{u^{\prime}}^{T}{\bm{A}}_{\mathcal{M}}{\bm{e}}_{v})},\end{split}

where wuv()w_{uv}^{(\mathcal{M})} is the number of path instances between uu and vv following the meta-path \mathcal{M}.

Other random walk approaches are summarized in Table I. To be specific, MRWNN [57] incorporates content priors into DeepWalk for image retrieval; SHNE [69] incorporates additional node information like categorical attributes, images, etc. by leveraging domain-specific deep encoders; HHNE [73] extends metapath2vec to the hyperbolic space; GHE [19] proposes a semi-supervised meta-path weighting technique; MNE [100] conducts random walks separately for each view in a multi-view network; JUST [65] proposes random walks with Jump and Stay strategies that do not rely on pre-defined meta-paths; HeteSpaceyWalk [101] introduces a scalable embedding framework based on heterogeneous personalized spacey random walks; TapEm [102] proposes a task-guided node pair embedding approach for author identification.

3.1.2 First/Second-Order Proximity Based Approaches

PTE [103]. PTE proposes to decompose a heterogeneous network into multiple bipartite networks, each of which describes one edge type. Its objective is the sum of log-likelihoods over all bipartite networks:

𝒥=l𝒯Eu,vVwuv(l)logexp(𝒆uT𝒆v)uVϕ(u)exp(𝒆uT𝒆v)=u,vVwuvlogexp(𝒆uT𝒆v)uVϕ(u)exp(𝒆uT𝒆v).\begin{split}\mathcal{J}&=\sum_{l\in\mathcal{T}_{E}}\sum_{u,v\in V}w_{uv}^{(l)}\log\frac{\exp({\bm{e}}_{u}^{T}{\bm{e}}_{v})}{\sum_{u^{\prime}\in V_{\phi(u)}}\exp({\bm{e}}_{u^{\prime}}^{T}{\bm{e}}_{v})}\\ &=\sum_{u,v\in V}w_{uv}\log\frac{\exp({\bm{e}}_{u}^{T}{\bm{e}}_{v})}{\sum_{u^{\prime}\in V_{\phi(u)}}\exp({\bm{e}}_{u^{\prime}}^{T}{\bm{e}}_{v})}.\end{split}

Here 𝒯E\mathcal{T}_{E} is the set of edge types; wuv(l)w_{uv}^{(l)} is the type-ll edge weight of (u,v)(u,v) (if there is no edge between uu and vv with type ll, then wuv(l)=0w_{uv}^{(l)}=0); wuv=lwuv(l)w_{uv}=\sum_{l}w_{uv}^{(l)} is the total edge weight between uu and vv.

AspEm [60]. AspEm assumes that each heterogeneous network has multiple aspects, and each aspect is defined as a subgraph of the network schema [14]. An incompatibility measure is proposed to select appropriate aspects for embedding learning. Given an aspect aa, its objective is

𝒥=l𝒯Ea1Zlu,vVwuv(l)logexp(𝒆u,aT𝒆v,a)uVϕ(u)exp(𝒆u,aT𝒆v,a),\mathcal{J}=\sum_{l\in\mathcal{T}_{E}^{a}}\frac{1}{Z_{l}}\sum_{u,v\in V}w_{uv}^{(l)}\log\frac{\exp({\bm{e}}_{u,a}^{T}{\bm{e}}_{v,a})}{\sum_{u^{\prime}\in V_{\phi(u)}}\exp({\bm{e}}_{u,a}^{T}{\bm{e}}_{v,a})},

where 𝒯Ea\mathcal{T}_{E}^{a} is the set of edge types in aspect aa; Zl=u,vwuv(l)Z_{l}=\sum_{u,v}w_{uv}^{(l)} is a normalization factor; eu,ae_{u,a} is the aspect-specific embedding of uu.

Algorithm wuvw_{uv} / wuv(l)w_{uv}^{(l)} / wuv()w_{uv}^{(\mathcal{M})} d(𝒆u,𝒆v)d({\bm{e}}_{u},{\bm{e}}_{v}) 𝒥R0\mathcal{J}_{R0} Applications
MRWNN [57] number of times that u𝒞(v)u\in\mathcal{C}(v) in homogeneous random walks 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2} v𝒆vfENC(Xv)2-\sum_{v}||{\bm{e}}_{v}-f_{\rm ENC}(X_{v})||^{2} image retrieval
metapath2vec [59] number of times that u𝒞(v)u\in\mathcal{C}(v) in heterogeneous random walks following \mathcal{M} N/A node classification, node clustering, link prediction, recommendation
SHNE [69] 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2}, 𝒆u=fENC(𝒙u){\bm{e}}_{u}=f_{\rm ENC}({\bm{x}}_{u})
HHNE [73] d𝔻(𝒆u,𝒆v)d_{\mathbb{D}}({\bm{e}}_{u},{\bm{e}}_{v})
GHE [19] number of meta-path instances between uu and vv 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2} author identification
HIN2Vec [63] following \mathcal{M} 𝑨(𝒆u𝒆v)2||\sqrt{{\bm{A}}_{\mathcal{M}}}({\bm{e}}_{u}-{\bm{e}}_{v})||^{2} node classification, node clustering, link prediction, recommendation
MNE [100] number of times that u𝒞(v)u\in\mathcal{C}(v) in homogeneous random walks in (V,El)(V,E_{l}) 𝒆u𝒆v,l2||{\bm{e}}_{u}-{\bm{e}}_{v,l}||^{2}
JUST [65] number of times that u𝒞(v)u\in\mathcal{C}(v) in Jump/Stay random walks [65] 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2}
HeteSpaceyWalk [101] number of times that u𝒞(v)u\in\mathcal{C}(v) in heterogeneous spacey random walks [101]
TapEm [102] number of times that u𝒞(v)u\in\mathcal{C}(v) in heterogeneous random walks following \mathcal{M} fMAP(𝒆u,𝒆v)𝒉uv2||f_{\rm MAP}({\bm{e}}_{u},{\bm{e}}_{v})-{\bm{h}}_{uv}^{\mathcal{M}}||^{2} 𝒉uv{\bm{h}}_{uv}^{\mathcal{M}}: representation of the node sequence from uu to vv following \mathcal{M} supervised loss ylogy^\sum y\log\hat{y} +(1y)log(1y^)+(1-y)\log(1-\hat{y}) author identification
HNE [56] 𝟏(u,v)E{\bf 1}_{(u,v)\in E} 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2}, 𝒆u=𝑨ϕ(u)𝒙u{\bm{e}}_{u}={\bm{A}}_{\phi(u)}{\bm{x}}_{u} o𝒯V𝑨oF2-\sum_{o\in\mathcal{T}_{V}}||{\bm{A}}_{o}||_{F}^{2} text classification, image retrieval
PTE [103] edge weight of (u,v)(u,v) 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2} N/A text classification
DRL [68] node classification
CMF [58] v𝒆v2-\sum_{v}||{\bm{e}}_{v}||^{2} relatedness measurement of Wikipedia entities
HEBE [62] edge weight of hyperedge (u,C)(u,C) [62] 𝒆u𝒆C2||{\bm{e}}_{u}-{\bm{e}}_{C}||^{2}, 𝒆C=vC𝒆v/|C|{\bm{e}}_{C}=\sum_{v\in C}{\bm{e}}_{v}/|C| N/A event embedding
Phine [11] number of times that uu and vv co-occur in a meta-graph instance 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2} supervised loss |y^y|2-\sum|\hat{y}-y|^{2} user rating prediction
MVE [104] edge weight of (u,v)(u,v) with type ll 𝒆u,l𝒆v2||{\bm{e}}_{u,l}-{\bm{e}}_{v}||^{2} vlλv,l𝒆v,l𝒆v2-\sum_{v}\sum_{l}\lambda_{v,l}||{\bm{e}}_{v,l}-{\bm{e}}_{v}||^{2} node classification, link prediction
AspEm [60] 𝒆u,a𝒆v,a2||{\bm{e}}_{u,a}-{\bm{e}}_{v,a}||^{2},  aa: aspect N/A aspect mining
HEER [61] 𝑨l(𝒆u𝒆v)2||\sqrt{{\bm{A}}_{l}}({\bm{e}}_{u}-{\bm{e}}_{v})||^{2} relation prediction in knowledge graphs
GERM [105] edge weight of (u,v)(u,v) with type ll (if ll is activated by the genetic algorithm [105]) 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2} citation recommendation coauthor recommendation feeds recommendation
mg2vec [106] 1st: number of meta-graph instances containing node uu 2nd: number of meta-graph instances containing both nodes uu and vv 1st: 𝒆v𝒆m2||{\bm{e}}_{v}-{\bm{e}}_{m}||^{2} 2nd: fMAP(𝒆u,𝒆v)𝒆m2||f_{\rm MAP}({\bm{e}}_{u},{\bm{e}}_{v})-{\bm{e}}_{m}||^{2} mm: meta-graph relationship prediction, relationship search
TABLE I: A summary of proximity-preserving based HNE algorithms. (Additional notations: fENCf_{\rm ENC}: a function to encode text/image/attribute information; d𝔻d_{\mathbb{D}}: distance between two points in the Poincaré ball; fMAPf_{\rm MAP}: a function to map two dd-dimensional embeddings to one dd-dimenional vector.)

HEER [61]. HEER extends PTE by considering typed closeness. Specifically, each edge type ll has an embedding 𝝁l{\bm{\mu}}_{l}, and its objective is

𝒥=l𝒯Eu,vVwuv(l)logexp(𝝁lT𝒈uv)(u,v)Pl(u,v)exp(𝝁lT𝒈uv),\mathcal{J}=\sum_{l\in\mathcal{T}_{E}}\sum_{u,v\in V}w_{uv}^{(l)}\log\frac{\exp({\bm{\mu}}_{l}^{T}{\bm{g}}_{uv})}{\sum_{(u^{\prime},v^{\prime})\in P_{l}(u,v)}\exp({\bm{\mu}}_{l}^{T}{\bm{g}}_{u^{\prime}v^{\prime}})},

where 𝒈uv{\bm{g}}_{uv} is the edge embedding of (u,v)(u,v); Pl(u,v)={(u,v)|P_{l}(u,v)=\{(u^{\prime},v)| ψ(u,v)=l}{(u,v)|ψ(u,v)=l}\psi(u^{\prime},v)=l\}\cup\{(u,v^{\prime})|\psi(u,v^{\prime})=l\}. In [61], 𝒈uv{\bm{g}}_{uv} has different definitions for directed and undirected edges based on the Hadamard product. To simplify our discussion, we assume 𝒈uv=𝒆u𝒆v{\bm{g}}_{uv}={\bm{e}}_{u}\odot{\bm{e}}_{v}. Let 𝑨l=diag(𝝁l1,,𝝁ld){\bm{A}}_{l}=diag({\bm{\mu}}_{l1},...,{\bm{\mu}}_{ld}). Then we have 𝝁lT𝒈uv=𝒆uT𝑨l𝒆v{\bm{\mu}}_{l}^{T}{\bm{g}}_{uv}={\bm{e}}_{u}^{T}{\bm{A}}_{l}{\bm{e}}_{v} and

𝒥=l𝒯Eu,vVwuv(l)logexp(𝒆uT𝑨l𝒆v)(u,v)Pl(u,v)exp(𝒆uT𝑨l𝒆v).\mathcal{J}=\sum_{l\in\mathcal{T}_{E}}\sum_{u,v\in V}w_{uv}^{(l)}\log\frac{\exp({\bm{e}}_{u}^{T}{\bm{A}}_{l}{\bm{e}}_{v})}{\sum_{(u^{\prime},v^{\prime})\in P_{l}(u,v)}\exp({\bm{e}}_{u^{\prime}}^{T}{\bm{A}}_{l}{\bm{e}}_{v})}.

Other first/second-order proximity based approaches are summarized in Table I. To be specific, Chang et al. [56] introduce a node type-aware content encoder; CMF [58] performs joint matrix factorization over the decomposed bipartite networks; HEBE [62] preserves proximity regarding each meta-graph; Phine [11] combines additional regularization towards semi-supervised training; MVE [104] proposes an attenion based framework to consider multiple views of the same nodes; DRL [68] proposes to learn one type of edges at each step and uses deep reinforcement learning approaches to select the edge type for the next step; GERM [105] adopts a genetic evolutionary approach to preserve the critical edge-types while removing the noisy or incompatible ones given specific tasks; mg2vec [106] jointly embeds nodes and meta-graphs into the same space by exploiting both first-order and second-order proximity.

3.1.3 Unified Objectives

Based on the discussions above, the objective of most proximity-preserving methods can be unified as

max𝒥=u,vwuvlogexp(s(u,v))uexp(s(u,v))+𝒥R0.\max\mathcal{J}=\sum_{u,v}w_{uv}\log\frac{\exp(s(u,v))}{\sum_{u^{\prime}}\exp(s(u^{\prime},v))}+\mathcal{J}_{R0}. (4)

Here, wuvw_{uv} is the weight of node pair (u,v)(u,v) in the objective444wuvw_{uv} can be specific to a meta-path \mathcal{M} or an edge type ll, in which cases we can denote it as wuv()w_{uv}^{(\mathcal{M})} or wuv(l)w_{uv}^{(l)} accordingly. In Eq. (4) and following derivations, for ease of notation, we omit the superscript.; 𝒥R0\mathcal{J}_{R0} is an algorithm-specific regularization term (summarized in Table I); s(u,v)s(u,v) is a proximity function between uu and vv.

Negative Sampling. Directly optimizing the objective in Eq. (4) is computationally expensive because it needs to traverse all nodes when computing the softmax function. Therefore, in actual computation, most studies adopt the negative sampling strategy [99, 30], which modifies the objective as follows:

u,vwuv(logσ(s(u,v))+b𝔼uPN[logσ(s(u,v))])+𝒥R0.\sum_{u,v}w_{uv}\Big{(}\log\sigma(s(u,v))+b\mathbb{E}_{u^{\prime}\sim P_{N}}[\log\sigma(s(u^{\prime},v))]\Big{)}+\mathcal{J}_{R0}.

Here, bb is the number of negative samples; PNP_{N} is known as the noise distribution that generates negative samples.

Negative sampling serves as a generic paradigm to unify network embedding approaches. For example, starting from the negative sampling objective, Qiu et al. [98] unify DeepWalk [29], LINE [30], PTE [103] and node2vec [31] into a matrix factorization framework. In this paper, as mentioned in Section 2.2, we introduce another objective (i.e., Eq. (1)) that is equivalent to Eq. (4) and consider network embedding from the perspective of network smoothness enforcement.

Network Smoothness Enforcement. Note that in most cases, we can write s(u,v)s(u,v) in Eq. (4) as f(𝒆u)Tf(𝒆v)f({\bm{e}}_{u})^{T}f({\bm{e}}_{v}). For example, f(𝒆u)=𝒆uf({\bm{e}}_{u})={\bm{e}}_{u} in metapath2vec, PTE, etc.; f(𝒆u)=𝑨𝒆uf({\bm{e}}_{u})=\sqrt{{\bm{A}}_{\mathcal{M}}}{\bm{e}}_{u} in HIN2Vec; f(𝒆u)=𝑨l𝒆uf({\bm{e}}_{u})=\sqrt{{\bm{A}}_{l}}{\bm{e}}_{u} in HEER. In these cases,

𝒥=u,vwuvs(u,v)u,vwuvloguexp(s(u,v))+𝒥R0=u,vwuvf(𝒆u)Tf(𝒆v)u,vwuvloguexp(f(𝒆u)Tf(𝒆v))+𝒥R0=u,vwuv2(f(𝒆u)2+f(𝒆v)2f(𝒆u)f(𝒆v)2)u,vwuvloguexp(f(𝒆u)Tf(𝒆v))+𝒥R0.\begin{split}\mathcal{J}&=\sum_{u,v}w_{uv}s(u,v)-\sum_{u,v}w_{uv}\log\sum_{u^{\prime}}\exp(s(u^{\prime},v))+\mathcal{J}_{R0}\\ &=\sum_{u,v}w_{uv}f({\bm{e}}_{u})^{T}f({\bm{e}}_{v})\\ &\ \ \ \ -\sum_{u,v}w_{uv}\log\sum_{u^{\prime}}\exp(f({\bm{e}}_{u^{\prime}})^{T}f({\bm{e}}_{v}))+\mathcal{J}_{R0}\\ &=\sum_{u,v}\frac{w_{uv}}{2}\Big{(}||f({\bm{e}}_{u})||^{2}+||f({\bm{e}}_{v})||^{2}-||f({\bm{e}}_{u})-f({\bm{e}}_{v})||^{2}\Big{)}\\ &\ \ \ \ -\sum_{u,v}w_{uv}\log\sum_{u^{\prime}}\exp(f({\bm{e}}_{u^{\prime}})^{T}f({\bm{e}}_{v}))+\mathcal{J}_{R0}.\end{split}

The last step holds because 𝒙𝒚2=(𝒙𝒚)T(𝒙𝒚)=𝒙2+𝒚22𝒙T𝒚||{\bm{x}}-{\bm{y}}||^{2}=({\bm{x}}-{\bm{y}})^{T}({\bm{x}}-{\bm{y}})=||{\bm{x}}||^{2}+||{\bm{y}}||^{2}-2{\bm{x}}^{T}{\bm{y}}. Therefore, our goal is equivalent to

min𝒥=u,vwuv2f(𝒆u)f(𝒆v)2d(𝒆u,𝒆v)𝒥R0u,vwuv2(f(𝒆u)2+f(𝒆v)2)𝒥R1+u,vwuvloguexp(f(𝒆u)Tf(𝒆v))𝒥R2.\begin{split}\min-\mathcal{J}&=\sum_{u,v}\frac{w_{uv}}{2}\underbrace{||f({\bm{e}}_{u})-f({\bm{e}}_{v})||^{2}}_{d({\bm{e}}_{u},{\bm{e}}_{v})}-\mathcal{J}_{R0}\\ &-\underbrace{\sum_{u,v}\frac{w_{uv}}{2}\Big{(}||f({\bm{e}}_{u})||^{2}+||f({\bm{e}}_{v})||^{2}\Big{)}}_{\mathcal{J}_{R1}}\\ &+\underbrace{\sum_{u,v}w_{uv}\log\sum_{u^{\prime}}\exp(f({\bm{e}}_{u^{\prime}})^{T}f({\bm{e}}_{v}))}_{\mathcal{J}_{R2}}.\end{split} (5)

Here 𝒥R1\mathcal{J}_{R1} and 𝒥R2\mathcal{J}_{R2} are two regularization terms. Without 𝒥R1\mathcal{J}_{R1}, d(𝒆u,𝒆v)d({\bm{e}}_{u},{\bm{e}}_{v}) can be minimized by letting f(𝒆u)0(uV)||f({\bm{e}}_{u})||\rightarrow 0\ (\forall u\in V); without 𝒥R2\mathcal{J}_{R2}, d(𝒆u,𝒆v)d({\bm{e}}_{u},{\bm{e}}_{v}) can be minimized by letting 𝒆u𝒄(uV){\bm{e}}_{u}\equiv{\bm{c}}\ (\forall u\in V). 𝒥R\mathcal{J}_{R} in Eq. (1) is then the sum of 𝒥R0-\mathcal{J}_{R0}, 𝒥R1-\mathcal{J}_{R1} and 𝒥R2\mathcal{J}_{R2}. Among them, 𝒥R0-\mathcal{J}_{R0} is algorithm-specific, while 𝒥R1-\mathcal{J}_{R1} and 𝒥R2\mathcal{J}_{R2} are commonly shared across most HNE algorithms in the proximity-preserving group. We summarize the choices of wuvw_{uv}, d(eu,ev)d(e_{u},e_{v}) and 𝒥R0\mathcal{J}_{R0} in Table I.

Although Eq. (4) can cover most existing and likely many future approaches, we would like to remark that there are also studies adopting other forms of proximity-preserving objectives. For example, SHINE [107] uses reconstruction loss of autoencoders; HINSE [64] adopts spectral embedding based on adjacency matrices with different meta-graphs; MetaDynaMix [108] introduces a low-rank matrix factorization framework for dynamic HIN embedding; HeGAN [72] proposes an adversarial learning approach with a relation type-aware discriminator.

3.2 Message-Passing Methods

Each node in a network can have attribute information represented as a feature vector 𝒙u{\bm{x}}_{u}. Message-passing methods aim to learn node embeddings 𝒆u{\bm{e}}_{u} based on 𝒙u{\bm{x}}_{u} by aggregating the information from uu’s neighbors. In recent studies, Graph Neural Networks (GNNs) [33] are widely adopted to facilitate this aggregation/message-passing process. Compared to the proximity based HNE methods, message-passing methods, especially the GNN based ones are often considered as deep network embedding, due to their multiple layers of learnable projection functions.

R-GCN [67]. R-GCN has KK convolutional layers. The initial node representation 𝒉u(0){\bm{h}}_{u}^{(0)} is just the node feature 𝒙u{\bm{x}}_{u}. In the kk-th convolutional layer, each representation vector is updated by accumulating the vectors of neighboring nodes through a normalized sum.

𝒉u(k+1)=σ(l𝒯Ev𝒩l(u)1|𝒩l(u)|𝑾l(k)𝒉v(k)+𝑾0(k)𝒉u(k)).{\bm{h}}_{u}^{(k+1)}=\sigma\Big{(}\sum_{l\in\mathcal{T}_{E}}\sum_{v\in\mathcal{N}_{l}(u)}\frac{1}{|\mathcal{N}_{l}(u)|}{\bm{W}}_{l}^{(k)}{\bm{h}}_{v}^{(k)}+{\bm{W}}_{0}^{(k)}{\bm{h}}_{u}^{(k)}\Big{)}.

Different from the regular GCN model [33], R-GCN considers edge heterogeneity by learning multiple convolution matrices 𝑾{\bm{W}}’s, each of which corresponding to one edge type. During message passing, neighbors under the same edge type will be aggregated and normalized first. The node embedding is the output of the KK-th layer (i.e., 𝒆v=𝒉v(K){\bm{e}}_{v}={\bm{h}}_{v}^{(K)}).

In unsupervised settings, message-passing approaches use link prediction as their downstream task to train GNNs. To be specific, the likelihood of observing edges in the heterogeneous network is maximized. R-GCN optimizes a cross-entropy loss through negative sampling. Essentially, it is the approximation of the following objective:

𝒥=l𝒯Eu,vVwuv(l)logexp(𝒆uT𝑨l𝒆v)uVexp(𝒆uT𝑨l𝒆v),\mathcal{J}=\sum_{l\in\mathcal{T}_{E}}\sum_{u,v\in V}w_{uv}^{(l)}\log\frac{\exp({\bm{e}}_{u}^{T}{\bm{A}}_{l}{\bm{e}}_{v})}{\sum_{u^{\prime}\in V}\exp({\bm{e}}_{u}^{T}{\bm{A}}_{l}{\bm{e}}_{v})},

where wuv(l)=1(u,v)Elw_{uv}^{(l)}=\textbf{1}_{(u,v)\in E_{l}}.

Recently, CompGCN [109] extends R-GCN by leveraging a variety of entity-relation composition operations so as to jointly embed nodes and relations.

Algorithm wuvw_{uv} / wuv(l)w_{uv}^{(l)} / wuvw_{uv}^{\mathcal{M}} d(𝒆u,𝒆v)d({\bm{e}}_{u},{\bm{e}}_{v}) Aggregation Function Applications
R-GCN [67] 1(u,v)El\textbf{1}_{(u,v)\in E_{l}} 𝑨l(𝒆u𝒆v)2||\sqrt{{\bm{A}}_{l}}({\bm{e}}_{u}-{\bm{e}}_{v})||^{2} 𝒉u(k+1)=σ(l𝒯Ev𝒩l(u)1|𝒩l(u)|𝑾r(k)𝒉v(k)+𝑾0(k)𝒉u(k)){\bm{h}}_{u}^{(k+1)}=\sigma\Big{(}\sum_{l\in\mathcal{T}_{E}}\sum_{v\in\mathcal{N}_{l}(u)}\frac{1}{|\mathcal{N}_{l}(u)|}{\bm{W}}_{r}^{(k)}{\bm{h}}_{v}^{(k)}+{\bm{W}}_{0}^{(k)}{\bm{h}}_{u}^{(k)}\Big{)} 𝒉u(0)=𝒙u{\bm{h}}_{u}^{(0)}={\bm{x}}_{u},   𝒆u=𝒉u(K){\bm{e}}_{u}={\bm{h}}_{u}^{(K)} entity classification, KB completion
HEP [110] 1(u,v)E\textbf{1}_{(u,v)\in E} 𝑨(𝒆u𝒆v)2||\sqrt{{\bm{A}}}({\bm{e}}_{u}-{\bm{e}}_{v})||^{2} 𝒆~u=σ(𝑾ϕ(u)(||o𝒯Vv𝒩o(u)αuv𝒆v)+𝒃ϕ(u))\tilde{{\bm{e}}}_{u}=\sigma\Big{(}{\bm{W}}_{\phi(u)}\Big{(}{\big{|}\big{|}}_{o\in\mathcal{T}_{V}}\sum_{v\in\mathcal{N}_{o}(u)}\alpha_{uv}{\bm{e}}_{v}\Big{)}+{\bm{b}}_{\phi(u)}\Big{)} user alignment
HAN [75] edge weight of (u,v)(u,v) 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2} 𝒆u=βσ(v𝒩(u)αuv𝑴ϕ(u)𝒙u){\bm{e}}_{u}=\sum_{\mathcal{M}}\beta_{\mathcal{M}}\sigma\Big{(}\sum_{v\in\mathcal{N}_{\mathcal{M}}(u)}\alpha_{uv}^{\mathcal{M}}{\bm{M}}_{\phi(u)}{\bm{x}}_{u}\Big{)} node classification, node clustering, link prediction, recommendation
HetGNN [71] number of times that u𝒞(v)u\in\mathcal{C}(v) in homogeneous random walks 𝒉u=fENC(𝒙u),𝒉u,o=fAGG({𝒉v|v𝒩RWR(u),ϕ(v)=o}){\bm{h}}_{u}=f_{\rm ENC}({\bm{x}}_{u}),\ \ {\bm{h}}_{u,o}=f_{\rm AGG}\Big{(}\{{\bm{h}}_{v}|v\in\mathcal{N}_{\rm RWR}(u),\phi(v)=o\}\Big{)} 𝒆u=αu𝒉u+o𝒯Vαu,o𝒉u,o{\bm{e}}_{u}=\alpha_{u}{\bm{h}}_{u}+\sum_{o\in\mathcal{T}_{V}}\alpha_{u,o}{\bm{h}}_{u,o}
GATNE [70] number of times that u𝒞(v)u\in\mathcal{C}(v) in random walks following \mathcal{M} 𝒆u,l𝒆v2||{\bm{e}}_{u,l}-{\bm{e}}_{v}||^{2} 𝒉u,l(k+1)=fAGG({𝒉v,l(k)|v𝒩l(u)}){\bm{h}}_{u,l}^{(k+1)}=f_{\rm AGG}\Big{(}\{{\bm{h}}_{v,l}^{(k)}|v\in\mathcal{N}_{l}(u)\}\Big{)},   𝒉u,l(0)=𝒙u{\bm{h}}_{u,l}^{(0)}={\bm{x}}_{u} 𝒆u,l=αl𝑾l(||l𝒯E𝒉u,l(K))𝒂u,l+𝒃u{\bm{e}}_{u,l}=\alpha_{l}{\bm{W}}_{l}\Big{(}{\big{|}\big{|}}_{l\in\mathcal{T}_{E}}{\bm{h}}_{u,l}^{(K)}\Big{)}{\bm{a}}_{u,l}+{\bm{b}}_{u}
MV-ACM [111] 𝒉u,l(k+1)=σ(𝑾l(k)1|𝒩l(u)|v𝒩l(u)𝒉v,l(k)){\bm{h}}_{u,l}^{(k+1)}=\sigma\Big{(}{\bm{W}}_{l}^{(k)}\frac{1}{|\mathcal{N}_{l}(u)|}\sum_{v\in\mathcal{N}_{l}(u)}{\bm{h}}_{v,l}^{(k)}\Big{)},   𝒉u,l(0)=𝒙u{\bm{h}}_{u,l}^{(0)}={\bm{x}}_{u} 𝒆u,l=𝑴l(𝒉u,l(K)+l𝒯Eαu,l,l𝒉u,l(K))+𝒃u{\bm{e}}_{u,l}={\bm{M}_{l}}\Big{(}{\bm{h}}_{u,l}^{(K)}+\sum_{l^{\prime}\in\mathcal{T}_{E}}\alpha_{u,l,l^{\prime}}{\bm{h}}_{u,l^{\prime}}^{(K)}\Big{)}+{\bm{b}}_{u}
MAGNN [112] edge weight of (u,v)(u,v) 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2} 𝒉u=σ(v𝒩(u)αuvfENC({𝑴ϕ(t)𝒙t|t𝒫uv})){\bm{h}}_{u}^{\mathcal{M}}=\sigma\Big{(}\sum_{v\in\mathcal{N}_{\mathcal{M}}(u)}\alpha_{uv}^{\mathcal{M}}f_{\rm ENC}\big{(}\{{\bm{M}}_{\phi(t)}{\bm{x}}_{t}|t\in\mathcal{P}_{u\rightarrow v}^{\mathcal{M}}\}\big{)}\Big{)} 𝒆u=σ(𝑾(β𝒉u)){\bm{e}}_{u}=\sigma\Big{(}{\bm{W}}\big{(}\sum_{\mathcal{M}}\beta_{\mathcal{M}}{\bm{h}}_{u}^{\mathcal{M}}\big{)}\Big{)}
HGT [85] 1(u,v)E\textbf{1}_{(u,v)\in E} refer to NTN [20] in Table III 𝒉^v(k+1)=uN(v)𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧(u,v)𝐌𝐞𝐬𝐬𝐚𝐠𝐞(u,v)\hat{{\bm{h}}}^{(k+1)}_{v}=\sum_{u\in N(v)}{\bf Attention}(u,v)\odot{\bf Message}(u,v) 𝒉v(k+1)=Aϕ(v)(σ(𝒉^v(k+1)))+𝒉v(k){\bm{h}}^{(k+1)}_{v}=A_{\phi(v)}(\sigma(\hat{{\bm{h}}}^{(k+1)}_{v}))+{\bm{h}}^{(k)}_{v},   𝒉u(0)=𝒙u{\bm{h}}_{u}^{(0)}={\bm{x}}_{u},   𝒆u=𝒉u(K){\bm{e}}_{u}={\bm{h}}_{u}^{(K)} node classification, author identification
TABLE II: A summary of message-passing based HNE algorithms. (Additional notations: 𝒩RWR(v)\mathcal{N}_{\rm RWR}(v): neighbors of vv defined through random walk with restart [113]; 𝒩l(u)\mathcal{N}_{l}(u): neighbors of uu with edge type ll; 𝒩o(u)\mathcal{N}_{o}(u): neighbors of uu with node type oo; 𝒩(u)\mathcal{N}_{\mathcal{M}}(u): nodes connects with uu via meta-path \mathcal{M}; 𝒫uv\mathcal{P}_{u\rightarrow v}^{\mathcal{M}}: a meta-path instance of \mathcal{M} connecting uu and vv; fAGGf_{\rm AGG}: a function to aggregate information from neighbors. We show the transductive version of GATNE.)

HAN [75]. Instead of considering one-hop neighbors, HAN utilizes meta-paths to model higher-order proximity. Given a meta-path \mathcal{M}, the representation of node uu is aggregated from its meta-path based neighbors 𝒩(u)={u}{v|v\mathcal{N}_{\mathcal{M}}(u)=\{u\}\cup\{v|v connects with uu via the meta-path }\mathcal{M}\}. HAN proposes an attention mechanism to learn the weights of different neighbors:

αuv=exp(σ(𝒂T[𝒙u||𝒙v]))v𝒩(u)exp(σ(𝒂T[𝒙u||𝒙v])),\displaystyle\alpha_{uv}^{\mathcal{M}}=\frac{\exp\big{(}\sigma({\bm{a}}_{\mathcal{M}}^{T}[{\bm{x}}^{\prime}_{u}||{\bm{x}}^{\prime}_{v}])\big{)}}{\sum_{v^{\prime}\in\mathcal{N}_{\mathcal{M}}(u)}\exp\big{(}\sigma({\bm{a}}_{\mathcal{M}}^{T}[{\bm{x}}^{\prime}_{u}||{\bm{x}}^{\prime}_{v^{\prime}}])\big{)}},
𝒉u=σ(v𝒩(u)αuv𝒙v),\displaystyle{\bm{h}}_{u}^{\mathcal{M}}=\sigma\Big{(}\sum_{v\in\mathcal{N}_{\mathcal{M}}(u)}\alpha_{uv}^{\mathcal{M}}{\bm{x}}^{\prime}_{v}\Big{)},

where 𝒂{\bm{a}}_{\mathcal{M}} is the node-level attention vector of \mathcal{M}; 𝒙u=𝑴ϕ(u)𝒙u{\bm{x}}^{\prime}_{u}={\bm{M}}_{\phi(u)}{\bm{x}}_{u} is the projected feature vector of node uu; |||| is the concatenation operator. Given the meta-path specific embedding 𝒉u{\bm{h}}_{u}^{\mathcal{M}}, HAN uses a semantic-level attention to weigh different meta-paths:

β=exp(1|V|vV𝒒Ttanh(𝑾𝒉v+𝒃))exp(1|V|vV𝒒Ttanh(𝑾𝒉v+𝒃)),\displaystyle\beta_{\mathcal{M}}=\frac{\exp\big{(}\frac{1}{|V|}\sum_{v\in V}{\bm{q}}^{T}\text{tanh}({\bm{W}}{\bm{h}}_{v}^{\mathcal{M}}+{\bm{b}})\big{)}}{\sum_{\mathcal{M}^{\prime}}\exp\big{(}\frac{1}{|V|}\sum_{v\in V}{\bm{q}}^{T}\text{tanh}({\bm{W}}{\bm{h}}_{v}^{\mathcal{M^{\prime}}}+{\bm{b}})\big{)}},
𝒆u=β𝒉u,\displaystyle{\bm{e}}_{u}=\sum_{\mathcal{M}}\beta_{\mathcal{M}}{\bm{h}}_{u}^{\mathcal{M}},

where 𝒒{\bm{q}} is the semantic-level attention vector.

In the original HAN paper, the authors mainly consider the task of semi-supervised node classification. For unsupervised learning (i.e., without any node labels), according to [112], HAN can use the link prediction loss introduced in GraphSAGE [34], which is the negative sampling approximation of the following objective:

𝒥=u,vVwuvlogexp(𝒆uT𝒆v)uVexp(𝒆uT𝒆v).\mathcal{J}=\sum_{u,v\in V}w_{uv}\log\frac{\exp({\bm{e}}_{u}^{T}{\bm{e}}_{v})}{\sum_{u^{\prime}\in V}\exp({\bm{e}}_{u^{\prime}}^{T}{\bm{e}}_{v})}. (6)

Here, wuvw_{uv} is the edge weight of (u,v)(u,v).

MAGNN [112]. MAGNN extends HAN by considering both the meta-path based neighborhood 𝒩(u)={v|v\mathcal{N}_{\mathcal{M}}(u)=\{v|v connects with uu via meta-path }\mathcal{M}\} and the nodes along the meta-path instances. Given a meta-path \mathcal{M}, MAGNN first employs an encoder to transform all the node features along an instance of \mathcal{M} into a single vector.

𝒉uv=fENC({𝒙t|t𝒫uv}),{\bm{h}}_{uv}^{\mathcal{M}}=f_{\rm ENC}\big{(}\{{\bm{x}}^{\prime}_{t}|t\in\mathcal{P}_{u\rightarrow v}^{\mathcal{M}}\}\big{)},

where 𝒙u=𝑴ϕ(u)𝒙u{\bm{x}}^{\prime}_{u}={\bm{M}}_{\phi(u)}{\bm{x}}_{u} is the projected feature vector of node uu; 𝒫uv\mathcal{P}_{u\rightarrow v}^{\mathcal{M}} denotes a meta-path instance of \mathcal{M} connecting uu and vv; fENCf_{\rm ENC} is a relational rotation encoder inspired by [114]. After encoding each meta-path instance, MAGNN proposes an intra-meta-path aggregation to learn the weight of different neighbors:

αuv=exp(LeakyReLU(𝒂T[𝒙v||𝒉uv]))v𝒩(u)exp(LeakyReLU(𝒂T[𝒙v||𝒉uv])),\displaystyle\alpha_{uv}^{\mathcal{M}}=\frac{\exp(\text{LeakyReLU}({\bm{a}}_{\mathcal{M}}^{T}[{\bm{x}}^{\prime}_{v}||{\bm{h}}_{uv}^{\mathcal{M}}]))}{\sum_{v^{\prime}\in\mathcal{N}_{\mathcal{M}}(u)}\exp(\text{LeakyReLU}({\bm{a}}_{\mathcal{M}}^{T}[{\bm{x}}^{\prime}_{v^{\prime}}||{\bm{h}}_{uv^{\prime}}^{\mathcal{M}}]))},
𝒉u=σ(v𝒩(u)αuv𝒉uv),\displaystyle{\bm{h}}_{u}^{\mathcal{M}}=\sigma\Big{(}\sum_{v\in\mathcal{N}_{\mathcal{M}}(u)}\alpha_{uv}^{\mathcal{M}}{\bm{h}}_{uv}^{\mathcal{M}}\Big{)},

where 𝒂{\bm{a}}_{\mathcal{M}} is the node-level attention vector of \mathcal{M}. This attention mechanism can also be extended to multiple heads. After aggregating the information within each meta-path, MAGNN further combines the semantics revealed by all meta-paths using an inter-meta-path aggregation:

β=exp(1|Vϕ(u)|vVϕ(u)𝒒ϕ(u)Ttanh(𝑾ϕ(u)𝒉v+𝒃ϕ(u)))exp(1|Vϕ(u)|vVϕ(u)𝒒ϕ(u)Ttanh(𝑾ϕ(u)𝒉v+𝒃ϕ(u))),\footnotesize\begin{split}\beta_{\mathcal{M}}=\frac{\exp\big{(}\frac{1}{|V_{\phi(u)}|}\sum_{v\in V_{\phi(u)}}{\bm{q}}_{\phi(u)}^{T}{\rm tanh}({\bm{W}}_{\phi(u)}{\bm{h}}_{v}^{\mathcal{M}}+{\bm{b}}_{\phi(u)})\big{)}}{\sum_{\mathcal{M}^{\prime}}\exp\big{(}\frac{1}{|V_{\phi(u)}|}\sum_{v\in V_{\phi(u)}}{\bm{q}}_{\phi(u)}^{T}{\rm tanh}({\bm{W}}_{\phi(u)}{\bm{h}}_{v}^{\mathcal{M^{\prime}}}+{\bm{b}}_{\phi(u)})\big{)}},\end{split}
𝒆u=σ(𝑾(β𝒉u)),{\bm{e}}_{u}=\sigma\Big{(}{\bm{W}}\big{(}\sum_{\mathcal{M}}\beta_{\mathcal{M}}{\bm{h}}_{u}^{\mathcal{M}}\big{)}\Big{)},

where 𝒒ϕ(u){\bm{q}}_{\phi(u)} is the semantic-level attention vector for node type ϕ(u)\phi(u). In comparison with HAN, MAGNN employs an additional projection to get the final representation 𝒆u{\bm{e}}_{u}.

For link prediction, MAGNN adopts the loss introduced in GraphSAGE [34], which is equivalent to the objective in Eq. (6).

HGT [85]. Inspired by the success of Transformer [115, 116] in text representation learning, Hu et al. propose to use each edge’s type to parameterize the Transformer-like self-attention architecture. To be specific, for each edge (u,v)(u,v), their Heterogeneous Graph Transformer (HGT) model maps vv into a Query vector, and uu into a Key vector, and calculate their dot product as attention:

𝑸vi=Qϕ(v)i(𝒉v(k)),𝑲ui=Kϕ(u)i(𝒉u(k))HeadiATT(u,v)=(𝑲ui𝑾ψ(u,v)ATT𝑸viTd)μ(ϕ(u),ψ(u,v),ϕ(v)),𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧(u,v)=SoftmaxuN(v)(||iHeadiATT(u,v)).\begin{gathered}{\bm{Q}}_{v}^{i}=Q_{\phi(v)}^{i}({\bm{h}}_{v}^{(k)}),\ \ \ {\bm{K}}_{u}^{i}=K_{\phi(u)}^{i}({\bm{h}}_{u}^{(k)})\\ {\rm Head}^{{\rm ATT}}_{i}(u,v)=\Big{(}\frac{{\bm{K}}_{u}^{i}{\bm{W}}^{{\rm ATT}}_{\psi(u,v)}{\bm{Q}}_{v}^{i\ T}}{\sqrt{d}}\Big{)}\mu(\phi(u),\psi(u,v),\phi(v)),\\ {\bf Attention}(u,v)={\rm Softmax}_{u\in N(v)}\Big{(}{\big{|}\big{|}}_{i}{\rm Head}^{{\rm ATT}}_{i}(u,v)\Big{)}.\end{gathered}

Here, 𝒉u(k){\bm{h}}_{u}^{(k)} is the output of the kk-th HGT layer (𝒉u(0)=𝒙u{\bm{h}}_{u}^{(0)}={\bm{x}}_{u}); Qϕ(v)iQ_{\phi(v)}^{i} and Kϕ(u)iK_{\phi(u)}^{i} are node type-aware linear mappings; HeadiATT{\rm Head}^{{\rm ATT}}_{i} is the ii-th attention head; μ\mu is a prior tensor representing the weight of each edge type in the attention. Parallel to the calculation of attention, the message passing process can be computed in a similar way by incorporating node and edge types:

HeadiMSG(u,v)=Mϕ(u)i(𝒉u(k))𝑾ψ(u,v)MSG,𝐌𝐞𝐬𝐬𝐚𝐠𝐞(u,v)=||iHeadiMSG(u,v),\begin{gathered}{\rm Head}^{{\rm MSG}}_{i}(u,v)=M_{\phi(u)}^{i}({\bm{h}}_{u}^{(k)}){\bm{W}}^{{\rm MSG}}_{\psi(u,v)},\\ {\bf Message}(u,v)={\big{|}\big{|}}_{i}{\rm Head}^{{\rm MSG}}_{i}(u,v),\end{gathered}

where Mϕ(u)iM_{\phi(u)}^{i} is also a node type-aware linear mapping. To aggregate the messages from vv’s neighborhood, the attention vector serves as the weight to get the updated vector:

𝒉^v(k+1)=uN(v)𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧(u,v)𝐌𝐞𝐬𝐬𝐚𝐠𝐞(u,v),\hat{{\bm{h}}}^{(k+1)}_{v}=\sum_{u\in N(v)}{\bf Attention}(u,v)\odot{\bf Message}(u,v),

and following the residual connection [117], the output of the kk-th layer is

𝒉v(k+1)=Aϕ(v)(σ(𝒉^v(k+1)))+𝒉v(k).{\bm{h}}^{(k+1)}_{v}=A_{\phi(v)}(\sigma(\hat{{\bm{h}}}^{(k+1)}_{v}))+{\bm{h}}^{(k)}_{v}.

Here, Aϕ(u)A_{\phi(u)} is a linear function mapping vv’s vector back to its node type-specific distribution.

For the unsupervised link prediction task, HGT borrows the objective function from Neural Tensor Network [20], which will be further discussed in Section 3.3.

Algorithm wuv(l)w_{uv}^{(l)} d(𝒆u,𝒆v)d({\bm{e}}_{u},{\bm{e}}_{v}) 𝒥R0\mathcal{J}_{R0} Applications
TransE [4] 1(u,v)El\textbf{1}_{(u,v)\in E_{l}} 𝒆u+𝒆l𝒆v||{\bm{e}}_{u}+{\bm{e}}_{l}-{\bm{e}}_{v}|| v(𝒆v1)\sum_{v}(||{\bm{e}}_{v}||-1) KB completion, relation extraction from text
TransH [118] 𝒆u,l+𝒆l𝒆v,l2||{\bm{e}}_{u,l}+{\bm{e}}_{l}-{\bm{e}}_{v,l}||^{2}, 𝒆v,l=𝒆v𝒘lT𝒆v𝒘l{\bm{e}}_{v,l}={\bm{e}}_{v}-{\bm{w}}_{l}^{T}{\bm{e}}_{v}{\bm{w}}_{l} l(𝒘l1)+v[𝒆v1]+\sum_{l}(||{\bm{w}}_{l}||-1)+\sum_{v}[||{\bm{e}}_{v}||-1]_{+} +l[(𝒘lT𝒆l)2𝒆l2ϵ2]++\sum_{l}\Big{[}\frac{({\bm{w}}_{l}^{T}{\bm{e}}_{l})^{2}}{||{\bm{e}}_{l}||^{2}}-\epsilon^{2}\Big{]}_{+}
TransR [80] 𝒆u,l+𝒆l𝒆v,l2||{\bm{e}}_{u,l}+{\bm{e}}_{l}-{\bm{e}}_{v,l}||^{2}, 𝒆v,l=𝑨l𝒆v{\bm{e}}_{v,l}={\bm{A}}_{l}{\bm{e}}_{v} v[𝒆v1]++l[𝒆l1]+\sum_{v}[||{\bm{e}}_{v}||-1]_{+}+\sum_{l}[||{\bm{e}}_{l}||-1]_{+} +v[𝑨l𝒆v1]++\sum_{v}[||{\bm{A}}_{l}{\bm{e}}_{v}||-1]_{+}
RHINE [76] edge weight of (u,v)(u,v) with type ll 𝒆u𝒆v2||{\bm{e}}_{u}-{\bm{e}}_{v}||^{2} if ll models affiliation, 𝒆u+𝒆l𝒆v||{\bm{e}}_{u}+{\bm{e}}_{l}-{\bm{e}}_{v}|| if ll models interaction N/A link prediction, node classification
RotatE [114] 1(u,v)El\textbf{1}_{(u,v)\in E_{l}} 𝒆u𝒆l𝒆v2||{\bm{e}}_{u}\odot{\bm{e}}_{l}-{\bm{e}}_{v}||^{2},    𝒆u,𝒆v,𝒆ln{\bm{e}}_{u},{\bm{e}}_{v},{\bm{e}}_{l}\in\mathbb{C}^{n} l(𝒆l1)\sum_{l}(||{\bm{e}}_{l}||-1) KB completion, relation pattern inference
RESCAL [119] 𝑨l(𝒆u𝒆v)2||\sqrt{{\bm{A}}_{l}}({\bm{e}}_{u}-{\bm{e}}_{v})||^{2} v[𝒆v1]++l[𝑨lF1]+\sum_{v}[||{\bm{e}}_{v}||-1]_{+}+\sum_{l}[||{\bm{A}}_{l}||_{F}-1]_{+} entity resolution, link prediction
DistMult [81] ( 𝑨l(𝒆u𝒆v)2||\sqrt{{\bm{A}}_{l}}({\bm{e}}_{u}-{\bm{e}}_{v})||^{2},   𝑨l=diag(𝒆l){\bm{A}}_{l}=diag({\bm{e}}_{l}) v(𝒆v1)+l[𝒆l1]+\sum_{v}(||{\bm{e}}_{v}||-1)+\sum_{l}[||{\bm{e}}_{l}||-1]_{+} KB completion, triplet classification
HolE [120] ( 𝒆r1((𝒆u)¯(𝒆v))2||{\bm{e}}_{r}-\mathcal{F}^{-1}(\overline{\mathcal{F}({\bm{e}}_{u})}\odot\mathcal{F}({\bm{e}}_{v}))||^{2} v[𝒆v1]++l[𝒆l1]+\sum_{v}[||{\bm{e}}_{v}||-1]_{+}+\sum_{l}[||{\bm{e}}_{l}||-1]_{+}
ComplEx [121] ( 𝑨l𝒆u𝒆v2||{\bm{A}}_{l}{\bm{e}}_{u}-{\bm{e}}_{v}||^{2},   𝑨l=diag(𝒆l){\bm{A}}_{l}=diag({\bm{e}}_{l}),   𝒆u,𝒆v,𝒆ln{\bm{e}}_{u},{\bm{e}}_{v},{\bm{e}}_{l}\in\mathbb{C}^{n} v𝒆v2+l𝒆l2\sum_{v}||{\bm{e}}_{v}||^{2}+\sum_{l}||{\bm{e}}_{l}||^{2}
SimplE [122] 12(𝑨l(𝒆u𝒆v)2+𝑨l1(𝒆v𝒆u)2)\frac{1}{2}\Big{(}||\sqrt{{\bm{A}}_{l}}({\bm{e}}_{u}-{\bm{e}}_{v})||^{2}+||\sqrt{{\bm{A}}_{l^{-1}}}({\bm{e}}_{v}-{\bm{e}}_{u})||^{2}\Big{)}, 𝑨l=diag(𝒆l){\bm{A}}_{l}=diag({\bm{e}}_{l}),   𝑨l1=diag(𝒆l1){\bm{A}}_{l^{-1}}=diag({\bm{e}}_{l^{-1}})
TuckER [123] ( C𝒲×1𝒆u×2𝒆l×3𝒆vC-\mathcal{W}\times_{1}{\bm{e}}_{u}\times_{2}{\bm{e}}_{l}\times_{3}{\bm{e}}_{v} N/A
NTN [20] C𝒆lTC-{\bm{e}}_{l}^{T}tanh(𝒆uTl𝒆v+𝑴l,1𝒆u+𝑴l,2𝒆v+𝒃l)\Big{(}{\bm{e}}_{u}^{T}\mathcal{M}_{l}{\bm{e}}_{v}+{\bm{M}}_{l,1}{\bm{e}}_{u}+{\bm{M}}_{l,2}{\bm{e}}_{v}+{\bm{b}}_{l}\Big{)} Θ22||\Theta||_{2}^{2}
ConvE [82] Cσ(vec(σ([𝑬u;𝑬l]ω))𝑾)𝒆vC-\sigma\Big{(}{\rm vec}\big{(}\sigma([{\bm{E}}_{u};{\bm{E}}_{l}]*\omega)\big{)}{\bm{W}}\Big{)}{\bm{e}}_{v} N/A
NKGE [83] same as TransE or ConvE, where 𝒆u=σ(𝒈u)𝒆us+(1σ(𝒈u))𝒆un{\bm{e}}_{u}=\sigma({\bm{g}}_{u})\odot{\bm{e}}_{u}^{s}+\big{(}1-\sigma({\bm{g}}_{u})\big{)}\odot{\bm{e}}_{u}^{n} Θ22||\Theta||_{2}^{2}
SACN [84] Cf(vec(𝑴(𝒆u,𝒆l))𝑾)𝒆vC-f\Big{(}{\rm vec}\big{(}{\bm{M}}({\bm{e}}_{u},{\bm{e}}_{l})\big{)}{\bm{W}}\Big{)}{\bm{e}}_{v} N/A
TABLE III: A summary of relation-learning based HNE algorithms. (Additional notations: ff: a non-linear activation function; [x]+[x]_{+}: max{x,0}\max\{x,0\}; ||||F||\cdot||_{F}: the Frobenius norm of a matrix; ,1\mathcal{F},\mathcal{F}^{-1}: the fast Fourier transform and its inverse; x¯\overline{x}: the complex conjugate of xx; l1l^{-1}: the inverse of relation ll; ×i\times_{i}: the tensor product along the ii-th mode; Θ\Theta: the set of learned parameters; 𝑬u{\bm{E}}_{u}, 𝑬l{\bm{E}}_{l}: 2D reshaping matrices of 𝒆u{\bm{e}}_{u} and 𝒆l{\bm{e}}_{l} [82]; vec\rm vec: the vectorization operator; *: the convolution operator; 𝑴(𝒆u,𝒆l){\bm{M}}({\bm{e}}_{u},{\bm{e}}_{l}): a matrix aligning the output vectors from the convolution with all kernels [84].)

Some other message-passing approaches are summarized in Table II. For example, HEP [110] aggregates uu’s representation from 𝒩o(u)\mathcal{N}_{o}(u) (i.e., the node type-aware neighborhood) to reconstruct uu’s own embedding; HetGNN [71] also adopts a node type-aware neighborhood aggregation, but the neighborhood is defined through random walk with restart [113]; GATNE [70] aggregates uu’s representation from 𝒩l(u)\mathcal{N}_{l}(u) (i.e., the edge type-aware neighborhood) and is applicable to both transductive and inductive network embedding settings; MV-ACM [111] aggregates uu’s representation from 𝒩(u)\mathcal{N}_{\mathcal{M}}(u) (i.e., the meta-path-aware neighborhood) and utilizes an adversarial learning framework to learn the reciprocity between different edge types.

The objective of message-passing approaches mentioned above can also be written as Eq. (4) (except HGT, whose objective is the same as NTN [20] and will be discussed in Section 3.3), where s(u,v)s(u,v) is a function of 𝒆u{\bm{e}}_{u} and 𝒆v{\bm{e}}_{v}. The only difference is that 𝒆u{\bm{e}}_{u} here is aggregated from 𝒙v{\bm{x}}_{v} using GNNs. Following the derivation of proximity-preserving approaches, if we still write 𝒥R\mathcal{J}_{R} in Eq. (1) as the sum of 𝒥R0-\mathcal{J}_{R_{0}}, 𝒥R1-\mathcal{J}_{R_{1}} and 𝒥R2\mathcal{J}_{R_{2}}, we can get the exactly same 𝒥R1\mathcal{J}_{R_{1}} and 𝒥R2\mathcal{J}_{R_{2}} as in Eq. (5). We summarize the choices of wuvw_{uv}, d(𝒆u,𝒆v)d({\bm{e}}_{u},{\bm{e}}_{v}) and the aggregation function in Table II.

Within this group of algorithms, HEP has an additional reconstruction loss 𝒥R0=v𝒆~v𝒆v2\mathcal{J}_{R_{0}}=\sum_{v}||\tilde{{\bm{e}}}_{v}-{\bm{e}}_{v}||^{2}, and MV-ACM [111] has an adversarial loss 𝒥R0=vV,l𝒯E𝔼Z𝒢(|l,v)\mathcal{J}_{R_{0}}=\sum_{v\in V,l\in\mathcal{T}_{E}}\mathbb{E}_{Z\sim\mathcal{G}(\cdot|l,v)} [log(1𝒟(Z,l,v))]\big{[}\log(1-\mathcal{D}(Z,l,v))\big{]}, where 𝒢\mathcal{G} and 𝒟\mathcal{D} are the generator and discriminator of complementary edge types, respectively. All the other algorithms in this group have 𝒥R0=0\mathcal{J}_{R_{0}}=0.

Similar to the case of proximity-preserving approaches, there are also message-passing methods adopting other forms of objectives. For example, GraphInception [66] studies collective classification (where the labels within a group of instances are correlated and should be inferred collectively) using meta-path-aware aggregation; HDGI [86] maximizes local-global mutual information to improve unsupervised training based on HAN; NEP [74] performs semi-supervised node classification using an edge type-aware propagation function to mimic the process of label propagation; NLAH [124] also considers semi-supervised node classification by extending HAN with several pre-computed non-local features; HGCN [125] explores the collective classification task and improves GraphInception by considering semantics of different types of edges/nodes and relations among different node types; HetETA [126] studies a specific task of estimating the time of arrival in intelligent transportation using a heterogeneous graph network with fast localized spectral filtering.

3.3 Relation-Learning Methods

As discussed above, knowledge graphs can be regarded as a special case of heterogeneous networks, which are schema-rich [127]. To model network heterogeneity, existing knowledge graph embedding approaches explicitly model the relation types of edges via parametric algebraic operators, which are often shallow network embedding [79, 128]. Compared to the shallow proximity-preserving HNE models, they often focus on the designs of triplet based scoring functions instead of meta-paths or meta-graphs due to the large numbers of entity and relation types.

Each edge in a heterogeneous network can be viewed as a triplet (u,l,v)(u,l,v) composed of two nodes u,vVu,v\in V and an edge type l𝒯El\in\mathcal{T}_{E} (i.e., entities and relations, in the terminology of KG). The goal of relation-learning methods is to learn a scoring function sl(u,v)s_{l}(u,v) which evaluates an arbitrary triplet and outputs a scalar to measure the acceptability of this triplet. This idea is widely adopted in KB embedding. Since there are surveys of KB embedding algorithms already [79], we only cover the most popular approaches here and highlight their connections to HNE.

TransE [4]. TransE assumes that the relation induced by ll-labeled edges corresponds to a translation of the embedding (i.e., 𝒆u+𝒆l𝒆v{\bm{e}}_{u}+{\bm{e}}_{l}\approx{\bm{e}}_{v}) when (u,l,v)(u,l,v) holds. Therefore, the scoring function of TransE is defined as

sl(u,v)=𝒆u+𝒆l𝒆vp,s_{l}(u,v)=-||{\bm{e}}_{u}+{\bm{e}}_{l}-{\bm{e}}_{v}||_{p}, (7)

where p=1p=1 or 22. The objective is to minimize a margin based ranking loss.

𝒥=(u,l,v)T(u,l,v)T(u,l,v)max(0,γsl(u,v)+sl(u,v)),\mathcal{J}=\sum_{(u,l,v)\in T}\sum_{(u^{\prime},l,v^{\prime})\in T^{\prime}_{(u,l,v)}}\max(0,\gamma-s_{l}(u,v)+s_{l}(u^{\prime},v^{\prime})), (8)

where TT is the set of positive triplets (i.e., edges); T(u,l,v)T^{\prime}_{(u,l,v)} is the set of corrupted triplets, which are constructed by replacing either uu or vv with an arbitrary node. Formally,

T(u,l,v)={(u,l,v)|uV}{(u,l,v)|vV}.T^{\prime}_{(u,l,v)}=\{(u^{\prime},l,v)|u^{\prime}\in V\}\cup\{(u,l,v^{\prime})|v^{\prime}\in V\}.

TransE is the most representative model using “a translational distance” to define the scoring function. It has many extensions. For example, TransH [118] projects each entity vector to a relation-specific hyperplane when calculating the distance; TransR [80] further extends relation-specific hyperplanes to relation-specific spaces; RHINE [76] distinguishes affiliation relations from interaction relations and adopts different objectives for the two types of relations. For more extensions of TransE, please refer to [83].

Recently, Sun et al. [114] propose the RotatE model, which defines each relation as a rotation (instead of a translation) from the source entity to the target entity in the complex vector space. Their model is able to describe various relation patterns including symmetry/antisymmetry, inversion, and composition.

DistMult [81]. In contrast to translational distance models [4, 118, 80], DistMult exploits a similarity based scoring function. Each relation is represented by a diagonal matrix 𝑨l=diag(𝒆l1,,𝒆ld){\bm{A}}_{l}=diag({\bm{e}}_{l1},...,{\bm{e}}_{ld}), and sl(u,v)s_{l}(u,v) is defined using a bilinear function:

sl(u,v)=𝒆uT𝑨l𝒆v.s_{l}(u,v)={\bm{e}}_{u}^{T}{\bm{A}}_{l}{\bm{e}}_{v}.

Note that sl(u,v)=sl(v,u)s_{l}(u,v)=s_{l}(v,u) for any uu and vv. Therefore, DistMult is mainly designed for symmetric relations.

Using the equation 𝒙𝒚2=(𝒙𝒚)T(𝒙𝒚)=𝒙2+𝒚22𝒙T𝒚||{\bm{x}}-{\bm{y}}||^{2}=({\bm{x}}-{\bm{y}})^{T}({\bm{x}}-{\bm{y}})=||{\bm{x}}||^{2}+||{\bm{y}}||^{2}-2{\bm{x}}^{T}{\bm{y}}, we have

sl(u,v)=(𝑨l𝒆u)T(𝑨l𝒆v)=12(𝑨l𝒆u2+𝑨l𝒆v2𝑨l(𝒆u𝒆v)2).\begin{split}s_{l}(u,v)&=(\sqrt{{\bm{A}}_{l}}{\bm{e}}_{u})^{T}(\sqrt{{\bm{A}}_{l}}{\bm{e}}_{v})\\ &=\frac{1}{2}\big{(}||\sqrt{{\bm{A}}_{l}}{\bm{e}}_{u}||^{2}+||\sqrt{{\bm{A}}_{l}}{\bm{e}}_{v}||^{2}-||\sqrt{{\bm{A}}_{l}}({\bm{e}}_{u}-{\bm{e}}_{v})||^{2}\big{)}.\end{split}

ComplEx [121]. Instead of considering a real-valued embedding space, ComplEx introduces complex-valued representations for 𝒆u{\bm{e}}_{u}, 𝒆v{\bm{e}}_{v} and 𝒆l{\bm{e}}_{l}. Similar to DistMult, it utilizes a similarity based scoring function:

sl(u,v)=Re(𝒆uT𝑨l𝒆v¯),s_{l}(u,v)={\rm Re}({\bm{e}}_{u}^{T}{\bm{A}}_{l}\overline{{\bm{e}}_{v}}),

where Re(){\rm Re}(\cdot) is the real part of a complex number; 𝒆¯\overline{{\bm{e}}} is the complex conjugate of 𝒆{\bm{e}}; 𝑨l=diag(𝒆l1,,𝒆ld){\bm{A}}_{l}=diag({\bm{e}}_{l1},...,{\bm{e}}_{ld}). Here, it is possible that sl(u,v)sl(v,u)s_{l}(u,v)\neq s_{l}(v,u), which allows ComplEx to capture asymmetric relations.

In the complex space, using the equation 𝒙𝒚2=(𝒙𝒚)T(𝒙𝒚¯)=𝒙2+𝒚2𝒙T𝒚¯𝒚T𝒙¯=𝒙2+𝒚22Re(𝒙T𝒚¯)||{\bm{x}}-{\bm{y}}||^{2}=({\bm{x}}-{\bm{y}})^{T}(\overline{{\bm{x}}-{\bm{y}}})=||{\bm{x}}||^{2}+||{\bm{y}}||^{2}-{\bm{x}}^{T}\overline{{\bm{y}}}-{\bm{y}}^{T}\overline{{\bm{x}}}=||{\bm{x}}||^{2}+||{\bm{y}}||^{2}-2{\rm Re}({\bm{x}}^{T}\overline{{\bm{y}}}), we have

sl(u,v)=Re((𝑨l𝒆u)T𝒆v¯)=12(𝑨l𝒆u2+𝒆v2𝑨l𝒆u𝒆v2).\begin{split}s_{l}(u,v)&={\rm Re}(({\bm{A}}_{l}{\bm{e}}_{u})^{T}\overline{{\bm{e}}_{v}})\\ &=\frac{1}{2}\big{(}||{\bm{A}}_{l}{\bm{e}}_{u}||^{2}+||{\bm{e}}_{v}||^{2}-||{\bm{A}}_{l}{\bm{e}}_{u}-{\bm{e}}_{v}||^{2}\big{)}.\end{split}

Besides ComplEx, there are many other extensions of DistMult. RESCAL [119] uses a bilinear scoring function similar to the one in DistMult, but 𝑨l{\bm{A}}_{l} is no longer restricted to be diagonal; HolE [120] composes the node representations using the circular correlation operation, which combines the expressive power of RESCAL with the efficiency of DistMult; SimplE [122] considers the inverse of relations and calculates the average score of (u,l,v)(u,l,v) and (v,l1,u)(v,l^{-1},u); TuckER [123] proposes to use a three-way Tucker tensor decomposition approach to learning node and relation embeddings.

ConvE [82]. ConvE goes beyond simple distance or similarity functions and proposes deep neural models to score a triplet. The score is defined by a convolution over 2D shaped embeddings. Formally,

sl(u,v)=σ(vec(σ([𝑬u;𝑬r]ω))𝑾)𝒆v,s_{l}(u,v)=\sigma({\rm vec}(\sigma([{\bm{E}}_{u};{\bm{E}}_{r}]*\omega)){\bm{W}}){\bm{e}}_{v},

where 𝑬u{\bm{E}}_{u} and 𝑬r{\bm{E}}_{r} denote the 2D reshaping matrices of node embedding and relation embedding, respectively; vec\rm vec is the vectorization operator that maps a mm by nn matrix to a mnmn-dimensional vector; “*” is the convolution operator.

There are several other models leveraging deep neural scoring functions. For example, NTN [20] proposes to combine the two node embedding vectors by a relation-specific tensor ld×d×dl\mathcal{M}_{l}\in\mathbb{R}^{d\times d\times d_{l}}, where dld_{l} is the dimension of 𝒆l{\bm{e}}_{l}; NKGE [83] develops a deep memory network to encode information from neighbors and employs a gating mechanism to integrate structure representation 𝒆us{\bm{e}}_{u}^{s} and neighbor representation 𝒆un{\bm{e}}_{u}^{n}; SACN [84] encodes node representations using a graph neural network and then scores the triplet using a convolutional neural network with the translational property.

Most relation-learning approaches adopt a margin based ranking loss with some regularization terms that generalizes Eq. (8):

(u,l,v)wuv(l)(u,l,v)max(0,γsl(u,v)+sl(u,v))+𝒥R0.\sum_{(u,l,v)}w_{uv}^{(l)}\sum_{(u^{\prime},l,v^{\prime})}\max(0,\gamma-s_{l}(u,v)+s_{l}(u^{\prime},v^{\prime}))+\mathcal{J}_{R0}. (9)

In [129], Qiu et al. point out that the margin based loss shares a very similar form with the following negative sampling loss:

(u,l,v)(log(σ(sl(u,v)))b𝔼(u,l,v)[log(σ(sl(u,v)))]).-\sum_{(u,l,v)}\Big{(}\log(\sigma(s_{l}(u,v)))-b\mathbb{E}_{(u^{\prime},l,v^{\prime})}\big{[}\log(\sigma(s_{l}(u^{\prime},v^{\prime})))\big{]}\Big{)}.

Following [129], if we use the negative sampling loss to rewrite Eq. (9), we are approximately maximizing

𝒥=(u,l,v)wuv(l)logexp(sl(u,v))(u,l,v)exp(sl(u,v))+𝒥R0.\mathcal{J}=\sum_{(u,l,v)}w_{uv}^{(l)}\log\frac{\exp(s_{l}(u,v))}{\sum_{(u^{\prime},l,v^{\prime})}\exp(s_{l}(u^{\prime},v^{\prime}))}+\mathcal{J}_{R0}.

For translational models [4, 118, 80, 76] whose sl(u,v)s_{l}(u,v) is described by a distance function, maximizing 𝒥\mathcal{J} is equivalent to

min𝒥=(u,l,v)wuv(l)𝒆u+𝒆l𝒆v1 or 2d(𝒆u,𝒆v)𝒥R0+(u,l,v)wuv(l)log(u,l,v)exp(sl(u,v))𝒥R1.\begin{split}\min-\mathcal{J}&=\sum_{(u,l,v)}w_{uv}^{(l)}\underbrace{||{\bm{e}}_{u}+{\bm{e}}_{l}-{\bm{e}}_{v}||^{1\text{ or }2}}_{d({\bm{e}}_{u},{\bm{e}}_{v})}-\mathcal{J}_{R0}\\ &+\underbrace{\sum_{(u,l,v)}w_{uv}^{(l)}\log\sum_{(u^{\prime},l,v^{\prime})}\exp(s_{l}(u^{\prime},v^{\prime}))}_{\mathcal{J}_{R1}}.\end{split} (10)

In this case, we can write 𝒥R\mathcal{J}_{R} as 𝒥R0+𝒥R1-\mathcal{J}_{R0}+\mathcal{J}_{R1}. For RotatE [114], the objective is the same except that d(𝒆u,𝒆v)=𝒆u𝒆l𝒆v2d({\bm{e}}_{u},{\bm{e}}_{v})=||{\bm{e}}_{u}\odot{\bm{e}}_{l}-{\bm{e}}_{v}||^{2}.

For similarity based approaches [119, 81, 121, 122], we follow the derivation of Eq. (5), and the objective will be

min𝒥=(u,l,v)wuv(l)2f(𝒆u)f(𝒆v)2d(𝒆u,𝒆v)𝒥R0(u,l,v)wuv(l)2(f(𝒆u)2+f(𝒆v)2)𝒥R1+(u,l,v)wuv(l)log(u,l,v)exp(sl(u,v))𝒥R2.\begin{split}\min-\mathcal{J}&=\sum_{(u,l,v)}\frac{w^{(l)}_{uv}}{2}\underbrace{||f({\bm{e}}_{u})-f({\bm{e}}_{v})||^{2}}_{d({\bm{e}}_{u},{\bm{e}}_{v})}-\mathcal{J}_{R0}\\ &-\underbrace{\sum_{{(u,l,v)}}\frac{w^{(l)}_{uv}}{2}\Big{(}||f({\bm{e}}_{u})||^{2}+||f({\bm{e}}_{v})||^{2}\Big{)}}_{\mathcal{J}_{R1}}\\ &+\underbrace{\sum_{{(u,l,v)}}w^{(l)}_{uv}\log\sum_{{(u^{\prime},l,v^{\prime})}}\exp(s_{l}(u^{\prime},v^{\prime}))}_{\mathcal{J}_{R2}}.\end{split}

Here, f(𝒆u)=𝑨l𝒆uf({\bm{e}}_{u})=\sqrt{{\bm{A}}_{l}}{\bm{e}}_{u} in RESCAL [119] and DistMult [81]; f(𝒆u)=𝑨l𝒆uf({\bm{e}}_{u})=\sqrt{{\bm{A}}_{l}}{\bm{e}}_{u} or 𝑨l1𝒆u\sqrt{{\bm{A}}_{l^{-1}}}{\bm{e}}_{u} in SimplE [122]; f(𝒆u)=𝑨l𝒆uf({\bm{e}}_{u})={\bm{A}}_{l}{\bm{e}}_{u} and f(𝒆v)=𝒆vf({\bm{e}}_{v})={\bm{e}}_{v} in ComplEx [121]. The regularization term is 𝒥R=𝒥R0𝒥R1+𝒥R2\mathcal{J}_{R}=-\mathcal{J}_{R0}-\mathcal{J}_{R1}+\mathcal{J}_{R2}.

For neural triplet scorers [82, 20, 21, 84], the forms of sl(u,v)s_{l}(u,v) are more complicated than translational distances or bilinear products. In these cases, since distance (or dissimilarity) and proximity can be viewed as reverse metrics, we define d(𝒆u,𝒆v)d({\bm{e}}_{u},{\bm{e}}_{v}) as Csl(u,v)C-s_{l}(u,v), where CC is an constant upper bound of sl(,)s_{l}(\cdot,\cdot). Then the derivation of the loss function is similar to that of Eq. (10), i.e.,

min𝒥=(u,l,v)wuv(l)(Csl(u,v))d(𝒆u,𝒆v)𝒥R0+(u,l,v)wuv(l)log(u,l,v)exp(sl(u,v))𝒥R1.\begin{split}\min-\mathcal{J}&=\sum_{(u,l,v)}w_{uv}^{(l)}\underbrace{\Big{(}C-s_{l}(u,v)\Big{)}}_{d({\bm{e}}_{u},{\bm{e}}_{v})}-\mathcal{J}_{R0}\\ &+\underbrace{\sum_{(u,l,v)}w_{uv}^{(l)}\log\sum_{(u^{\prime},l,v^{\prime})}\exp(s_{l}(u^{\prime},v^{\prime}))}_{\mathcal{J}_{R1}}.\end{split}

In this case, 𝒥R=𝒥R0+𝒥R1\mathcal{J}_{R}=-\mathcal{J}_{R0}+\mathcal{J}_{R1}.

We summarize the choices of wuv(l)w_{uv}^{(l)}, d(𝒆u,𝒆v)d({\bm{e}}_{u},{\bm{e}}_{v}) and 𝒥R0\mathcal{J}_{R0} in Table III. Note that for relation learning methods, d(,)d(\cdot,\cdot) is usually not a distance metric. For example, d(𝒆u,𝒆v)d(𝒆v,𝒆u)d({\bm{e}}_{u},{\bm{e}}_{v})\neq d({\bm{e}}_{v},{\bm{e}}_{u}) in most translational distance models and deep neural models. This is intuitive because (u,l,v)(u,l,v) and (v,l,u)(v,l,u) often express different meanings.

4 Benchmark

4.1 Dataset Preparation

Towards off-the-shelf evaluation of HNE algorithms with standard settings, in this work, we collect, process, analyze, and publish four new real-world heterogeneous network datasets from different domains, which we aim to set up as a handy and fair benchmark for existing and future HNE algorithms.2

Dataset #node type #node #link type #link #attributes #attributed nodes #label type #labeled node
DBLP 4 1,989,077 6 275,940,913 300 ALL 13 618
Yelp 4 82,465 4 30,542,675 N/A N/A 16 7,417
Freebase 8 12,164,758 36 62,982,566 N/A N/A 8 47,190
PubMed 4 63,109 10 244,986 200 ALL 8 454
TABLE IV: A summary of the statistics on four real-world heterogeneous network datasets.
Refer to caption
(a) DBLP
Refer to caption
(b) Yelp
Refer to caption
(c) Freebase
Refer to caption
(d) PubMed
Figure 3: Portion of different node types in four real-world heterogeneous network datasets.
Refer to caption
(a) DBLP
Refer to caption
(b) Yelp
Refer to caption
(c) Freebase
Refer to caption
(d) PubMed
Figure 4: Degree distribution of different node types in four real-world heterogeneous network datasets.
Refer to caption
(a) DBLP
Refer to caption
(b) Yelp
Refer to caption
(c) Freebase
Refer to caption
(d) PubMed
Figure 5: Local clustering coefficient of different node types in four real-world heterogeneous network datasets.
Refer to caption
(a) DBLP
Refer to caption
(b) Yelp
Refer to caption
(c) Freebase
Refer to caption
(d) PubMed
Figure 6: Number of five most frequent 2-hop meta-paths in four real-world heterogeneous network datasets.

DBLP.

We construct a network of authors, papers, venues, and phrases from DBLP. Phrases are extracted by the popular AutoPhrase [130] algorithm from paper texts and further filtered by human experts. We compute word2vec [99] on all paper texts and aggregate the word embeddings to get 300-dim paper and phrase features. Author and venue features are the aggregations of their corresponding paper features. We further manually label a relatively small portion of authors into 12 research groups from four research areas by crawling the web. Each labeled author has only one label.

Yelp.

We construct a network of businesses, users, locations, and reviews from Yelp.555https://www.yelp.com/dataset/challenge Nodes do not have features, but a large portion of businesses are labeled into sixteen categories. Each labeled business has one or multiple labels.

Freebase.

We construct a network of books, films, music, sports, people, locations, organizations, and businesses from Freebase.666http://www.freebase.com/ Nodes are not associated with any features, but a large portion of books are labeled into eight genres of literature. Each labeled book has only one label.

PubMed.

We construct a network of genes, diseases, chemicals, and species from PubMed.777https://www.ncbi.nlm.nih.gov/pubmed/ All nodes are extracted by AutoPhrase [130], typed by bioNER [131], and further filtered by human experts. The links are constructed through open relation pattern mining [132] and manual selection. We compute word2vec [99] on all PubMed papers and aggregate the word embeddings to get 200-dim features for all types of nodes. We further label a relatively small portion of diseases into eight categories. Each labeled disease has only one label.

The four datasets we prepare are from four different domains, which have been individually studied in some existing works on HNE. Among them, DBLP has been most commonly used, because all information about authors, papers, etc. is public and there is no privacy issue, and the results are often more interpretable to researchers in computer science related domains. Other real-world networks like Yelp and IMDB have been commonly studied for recommender systems. These networks are naturally heterogeneous, including at least users and items (e.g., businesses and movies), as well as some additional item descriptors (e.g., categories of businesses and genres of movies). Freebase is one of the most popular open-source knowledge graph, which is relatively smaller but cleaner compared with the others (e.g., YAGO [133] and Wikidata [134]), where most entity and relation types are well defined. One major difference between conventional heterogeneous networks and knowledge graphs is the number of types of nodes and links. We further restrict the types of entities and relations inside Freebase, so as to get a heterogeneous network that is closer to a knowledge graph, while in the meantime does not have too many types of nodes and links. Therefore, most conventional HNE algorithms can be applied on this dataset and properly compared against the KB embedding ones. PubMed is a novel biomedical network we directly construct through text mining and manual processing on biomedical literature. This is the first time we make it available to the public, and we hope it to serve both the evaluation of HNE algorithms and novel downstream tasks in biomedical science such as biomedical information retrieval and disease evolution study.

We notice that several other heterogeneous network datasets such as OAG [135, 85, 87] and IMDB [75, 112, 125] have been constructed recently in parallel to this work. Due to their similar nature and organization as some of our datasets (e.g., DBLP, Yelp), our pipeline can be easily adopted on these datasets, so we do not copy them here.

4.2 Structure Analysis

A summary of the statistics on the four datasets is provided in Table IV and Figure 3. As can be observed, the datasets have different sizes (numbers of nodes and links) and heterogeneity (numbers and distributions of node/link types). Moreover, due to the nature of the data sources, DBLP and PubMed networks are attributed, whereas Yelp and Freebase networks are abundantly labeled. A combination of these four datasets thus allows researchers to flexibly start by testing an HNE algorithm in the most appropriate settings, and eventually complete an all-around evaluation over all settings.

We also provide detailed analysis regarding several most widely concerned properties of heterogeneous networks, i.e., degree distribution (Figure 4), clustering coefficient (Figure 5), and number of frequent meta-paths (Figure 6). In particular, degree distribution is known to significantly influence the performance of HNE algorithms due to the widely used node sampling process, whereas clustering coefficient impacts HNE algorithms that utilize latent community structures. Moreover, since many HNE algorithms rely on meta-paths, the skewer distribution of meta-paths can bias towards algorithms using fewer meta-paths.

As we can see, the properties we concern are rather different across the four datasets we prepare. For example, there are tighter links and more labels in Yelp, while there are more types of nodes and links in Freebase; compared with nodes in Freebase and PubMed which clearly follow the long-tail degree distribution, certain types of nodes in DBLP and Yelp are always well connected (e.g., phrases in DBLP and businesses in Yelp), forming more star-shaped subgraphs; the type-wise clustering coefficients and meta-path distributions are the most skewed in DBLP and most balanced in PubMed. The set of four datasets together provide a comprehensive benchmark towards the robustness and generalizability of various HNE algorithms (as we will also see in Section 5.2).

4.3 Settings, Tasks, and Metrics

We mainly compare all 13 algorithms under the setting of unsupervised unattributed HNE over all datasets, where the essential goal is to preserve different types of edges in the heterogeneous networks. Moreover, for message-passing algorithms that are particularly designed for attributed and semi-supervised HNE, we also conduct additional experiments for them in the corresponding settings. Particularly, due to the nature of the datasets, we evaluate attributed HNE on DBLP and PubMed datasets where node attributes are available, and semi-supervised HNE on Yelp and Freebase where node labels are abundant. We always test the computed network embeddings on the two standard network mining tasks of node classification and link prediction. Note that, while most existing studies on novel HNE algorithms have been focusing on these two standard tasks [59, 63, 103, 61, 75, 112, 85], we notice that there are also various other tasks due to the wide usage of heterogeneous networks in real-world applications [11, 136, 137, 138, 139, 140, 126]. While the performance of HNE there can be rather task-dependant and data-dependant, to firstly simplifying the task into either a standard node classification or link prediction problem can often serve to provide more insights into the task and dataset, which can help the further development of novel HNE algorithms.

For the standard unattributed unsupervised HNE setting, we first randomly hide 20% links and train all HNE algorithms with the remaining 80% links. For node classification, we then train a separate linear Support Vector Machine (LinearSVC) [141] based on the learned embeddings on 80% of the labeled nodes and predict on the remaining 20%. We repeat the process for five times and compute the average scores regarding macro-F1 (across all labels) and micro-F1 (across all nodes). For link prediction, we use the Hadamard function to construct feature vectors for node pairs, train a two-class LinearSVC on the 80% training links and evaluate towards the 20% held out links. We also repeat the process for five times and compute the two metrics of AUC (area under the ROC curve) and MRR (mean reciprocal rank). AUC is a standard measure for classification, where we regard link prediction as a binary classification problem, and MRR is a standard measure for ranking, where we regard link prediction as a link retrieval problem. Since exhaustive computation over all node pairs is too heavy, we always use the two-hop neighbors as the candidates for all nodes. For attributed HNE, node features are used during the training of HNE algorithms, whereas for semi-supervised HNE, certain amounts of node labels are used (80% by default).

5 Experimental Evaluations

Model Node classification (Macro-F1/Micro-F1) Link prediction (AUC/MRR)
DBLP Yelp Freebase PubMed DBLP Yelp Freebase PubMed
metapath2vec 43.85/55.07 5.16/23.32 20.55/46.43 12.90/15.51 65.26/90.68 80.52/99.72 56.14/78.24 69.38/84.79
PTE 43.34/54.53 5.10/23.24 10.25/39.87 09.74/12.27 57.72/77.51 50.32/68.84 57.89/78.23 70.36/89.54
HIN2Vec 12.17/25.88 5.12/23.25 17.40/41.92 10.93/15.31 53.29/75.47 51.64/66.71 58.11/81.65 69.68/84.48
AspEm 33.07/43.85 5.40/23.82 23.26/45.42 11.19/14.44 67.20/91.46 76.10/95.18 55.80/77.70 68.31/87.43
HEER 09.72/27.72 5.03/22.92 12.96/37.51 11.73/15.29 53.00/72.76 73.72/95.92 55.78/78.31 69.06/88.42
R-GCN 09.38/13.39 5.10/23.24 06.89/38.02 10.75/12.73 50.50/73.35 72.17/97.46 50.18/74.01 63.33/81.19
HAN 07.91/16.98 5.10/23.24 06.90/38.01 09.54/12.18 50.24/73.10 N/A 51.50/74.13 65.85/85.33
MAGNN 06.74/10.35 5.10/23.24 06.89/38.02 10.30/12.60 50.10/73.26 50.03/69.81 50.12/74.18 61.11/90.01
HGT 15.17/32.05 5.07/23.12 23.06/46.51 11.24/18.72 59.98/83.13 79.00/99.66 55.68/79.46 73.00/88.05
TransE 22.76/37.18 5.05/23.03 31.83/52.04 11.40/15.16 63.53/86.29 69.13/83.66 52.84/75.80 67.95/84.69
DistMult 11.42/25.07 5.04/23.00 23.82/45.50 11.27/15.79 52.87/74.84 80.28/99.73 54.91/78.04 70.61/90.64
ComplEx 20.48/37.34 5.05/23.03 35.26/52.03 09.84/18.51 65.92/90.01 80.11/99.73 60.43/84.22 75.96/92.47
ConvE 12.42/26.42 5.09/23.02 24.57/47.61 13.00/14.49 54.03/75.31 78.55/99.70 54.29/76.11 71.81/89.82
TABLE V: Performance comparison (%) under the standard setting of unattributed unsupervised HNE.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Performance comparison under controlled experiments with varying emb. sizes and link removals (PubMed).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Performance comparison under controlled experiments with varying label amount and attr. noise (PubMed).

5.1 Algorithms and Modifications

We amend the implementations of 13 popular HNE algorithms for seamless and efficient experimental evaluations on our prepared datasets. The algorithms we choose and the modifications we make are as follows.

  • metapath2vec [59]: Since the original implementation contains a large amount of hard-coded data-specific settings such as node types and meta-paths, and the optimization is unstable and limited as it only examines one type of meta-path based context, we completely reimplement the algorithm. In particular, we first run random walks to learn the weights of different meta-paths based on the number of sampled instances, and then train the model using the unified loss function, which is a weighted sum over the loss functions of individual meta-paths. Both the random walk and meta-path-based embedding optimization are implemented with multi-threads in parallel.

  • PTE [103]: Instead of accepting labeled texts as input and working on text networks with the specific three types of nodes (word, document, and label) and three types of links (word-word, document-word, and label-word), we revise the original implementation and allow the model to consume heterogeneous networks directly with arbitrary types of nodes and links by adding more type-specific objective functions.

  • HIN2Vec [63]: We remove unnecessary data preprocessing codes and modify the original implementation so that the program first generates random walks, then trains the model, and finally outputs node embeddings only.

  • AspEm [60]: We clean up the hard-coded data-specific settings in the original implementation and write a script to connect the different components of automatically selecting the aspects with the least incompatibilities, as well as learning, matching, and concatenating the embeddings based on different aspects.

  • HEER [61]: We remove the hard-coded data-specific settings and largely simplify the data preprocessing step in the original implementation by skipping the knockout step and disentangling the graph building step.

  • R-GCN [67]: The existing implementation from DGL [142] is only scalable to heterogeneous networks with thousands of nodes, due to the requirement of putting the whole graphs into memory during graph convolutions. To scale up R-GCN, we perform fixed-sized node and link sampling for batch-wise training following the framework of GraphSAGE [34].

  • HAN [75]: Since the original implementation of HAN contains a large amount of hard-coded data-specific settings such as node types and meta-paths, and is unfeasible for large-scale datasets due to the same reason as R-GCN, we completely reimplement the HAN algorithm based on our implementation of R-GCN. In particular, we first automatically construct meta-path based adjacency lists for the chosen node type, and then sample the neighborhood for the seed nodes during batch-wise training.

  • MAGNN [112]: The original DGL-based [142] unsupervised implementation of MAGNN solely looks at predicting the links between users and artists in a music website dataset with the use of a single neural layer. Hence, we remove all the hard-coded data-specific settings and refactor the entire pipeline so that the model can support multi-layer mini-batch training over arbitrary link types.

  • HGT [85]: The original PyG-based [143] implementation of HGT targets on the specific task of author disambiguation among the associated papers in a dynamic academic graph [135]. Therefore, we refactor it by removing hard-coded data-specific settings, assigning the same timestamp to all the nodes, and conducting training over all types of links.

  • TransE [4]: We modify the OpenKE [144] implementation so that the model outputs node embeddings only.

  • DistMult [81]: We remove the hard-coded data-specific settings and largely simplify the data preprocessing step in the original implementation.

  • ComplEx [121]: Same as for TransE.

  • ConvE [82]: Same as for DistMult.

We set the embedding size of all algorithms to 50 by default, and tune other hyperparameters following the original papers through standard five-fold cross validation on all datasets. We have put the implementation of all compared algorithms in a python package and released them together with the datasets to constitute an open-source ready-to-use HNE benchmark.2

5.2 Performance Benchmarks

We provide systematic experimental comparisons of the 13 popular state-of-the-art HNE algorithms across our four datasets, on the scenarios of unsupervised unattributed HNE, attributed HNE, and semi-supervised HNE.

Table V shows the performance of compared algorithms on unsupervised unattributed HNE, evaluated towards node classification and link prediction. We have the following observations.

From the perspective of compared algorithms:

(1) Proximity-preserving algorithms often perform well on both tasks under the unsupervised unattributed HNE setting, which explains why proximity-preserving is the most widely used HNE or even general network embedding framework when node attributes and labels are unavailable. Among the proximity-preserving methods, HIN2Vec and HEER show reasonable results on link prediction but perform not so well on node classification (especially on DBLP and Freebase). In fact, these two methods focus on modeling link representations in their objectives (𝑨{\bm{A}}_{\mathcal{M}} in HIN2Vec and 𝑨l{\bm{A}}_{l} in HEER), thus are more suitable for link prediction.

(2) Under the unsupervised unattributed HNE setting, message-passing methods perform poorly except for HGT, especially on node classification. As we discuss before, message-passing methods are known to excel due to their integration of node attributes, link structures, and training labels. When neither of node attributes and labels are available, we use random vectors as node features and adopt a link prediction loss, which largely limits the performance of R-GCN, HAN, and MAGNN. We will focus our evaluation on the message-passing algorithms in the attributed and semi-supervised HNE settings later. On the contrary, HGT exhibits competitive results on both node classification and link prediction. This is attributed to the usage of node-type and link-type dependent parameters which maintains dedicated representations for different types of nodes and links. In addition, the heterogeneous mini-batch graph sampling algorithm designed by HGT further reduces the loss of structural information due to sampling and boosts the performance to a greater extent. Finally, the link prediction result of HAN on the Yelp dataset is not available. This is because HAN can only embed one type of nodes at a time (we embed Business in Yelp) and thus predict the link between two nodes with the same type (i.e., Business-Business). However, all links in Yelp connect distinct types of nodes (e.g., Business-Location, Business-User), and HAN cannot predict such links (thus marked as N/A).

(3) Relation-learning methods such as TransE and ComplEx perform better on Freebase and PubMed on both tasks, especially on link prediction. In fact, in Table IV and Figure 3 we can observe that both datasets (especially Freebase) have more link types. Relation-learning approaches, which are mainly designed to embed knowledge graphs (e.g., Freebase), can better capture the semantics of numerous types of direct links.

From the perspective of datasets:

(1) All approaches have relatively low F1 scores on Yelp and PubMed (especially Yelp) on node classification. This is because both datasets have larger numbers of classes (i.e., 16 in Yelp and 8 in PubMed) as shown in Table IV. Moreover, unlike the cases of the other datasets, a node in Yelp can have multiple labels, which makes the classification task more challenging.

(2) In Figure 4, we can observe that the degree distribution of Freebase is more skewed. Therefore, when we conduct link sampling or random walks on Freebase during representation learning, nodes with lower degrees will be sampled less frequently and their representations may not be learned accurately. This observation may explain why the link prediction metrics on Freebase are in general lower than those on Yelp and PubMed.

(3) As we can see in Figure 3-6, most studied network properties are more balanced on Freebase and PubMed (especially PubMed) across different types of nodes and links. This in general makes both the node classification and link prediction tasks harder for all algorithms, and also makes the gaps among different algorithms smaller.

5.3 Ablation Studies

To provide an in-depth performance comparison among various HNE algorithms, we further conduct controlled experiments by varying the embedding sizes and randomly removing links from the training set.

In Figure 7, we show the micro-F1 scores for node classification and AUC scores for link prediction computed on the PubMed dataset. We omit the other results here, which can be easily computed in our provided benchmark package. As we can observe, some algorithms are more robust to varying settings while some others are more sensitive. In general, varying embedding size and link removal can significantly impact the performance of most algorithms on both tasks, and sometimes can even lead to different ordering of certain algorithms. This again emphasizes the importance of setting up standard benchmark including datasets and evaluation protocols for systematic HNE algorithm evaluation. In particular, on PubMed, larger embedding sizes like over 50 can harm the performance of most algorithms especially on node classification, probably due to overfitting with the limited labeled data. Interestingly, the random removal of links does have a negative impact on link prediction, but it does not necessarily harm node classification. This means that node classes and link structures may not always be tightly correlated, and even parts of the links already provide the necessary information useful enough for node classification.

Towards the evaluation of the message-passing HNE algorithms designed to integrate node attributes and labels into representation learning like R-GCN, HAN, MAGNN, and HGT, we also conduct controlled experiments by adding random Gaussian noises to the node attributes and masking different amounts of training labels.

In Figure 8, we show the results on the PubMed dataset. As we can observe, the scores in most subfigures are significantly higher than the scores in Table V, indicating the effectiveness of R-GCN, HAN, MAGNN, and HGT in integrating node attributes and labels for HNE. In particular, the incorporation of node attributes boosts the node classification results of R-GCN, HAN, and MAGNN significantly (almost tripling the F1 scores and significantly higher than all algorithms that do not use node attributes), but it offers very little help to HGT. This suggests that R-GCN, HAN, and MAGNN can effectively leverage the semantic information associated with attributes, whereas HGT relies more on the network structures and type information of nodes and links. Moreover, MAGNN achieves the highest node classification results with large margins as it successfully incorporates the attributes of intermediate nodes along the meta-paths (which are ignored by HAN). In addition, when random noises with larger variances are added to node attributes, the performance of node classification significantly drops, while the performance of link prediction is less affected. As more training labels become available, without a surprise, the node classification results of all four algorithms increase, but surprisingly the link prediction results are almost not affected. These observations again reveal the different natures of the two tasks, where node classes are more related to node contents, whereas links should be typically inferred from structural information.

6 Future

In this work, we present a comprehensive survey on various existing HNE algorithms, and provide benchmark datasets and baseline implementations to ease future research in this direction. While HNE has already demonstrated strong performance across a variety of downstream tasks, it is still in its infancy with many open challenges. To conclude this work and inspire future research, we now briefly discuss the limitation of current HNE and several specific directions potentially worth pursuing.

Beyond homophily.

As we formulate in Eq. (1), current HNE algorithms focus on the leverage of network homophily. Due to recent research on homogeneous networks that study the combination of positional and structural embedding [145, 49], it would be interesting to explore how to generalize such design principles and paradigms to HNE. Particularly, in heterogeneous networks, relative positions and structural roles of nodes can both be measured under different meta-paths or meta-graphs, which are naturally more informative and diverse. However, such considerations also introduce harder computational challenges.

Beyond accuracy.

Most, if not all, existing research on HNE has primarily focused on the accuracy towards different downstream tasks. It would be interesting to further study the scalability and efficiency (for large-scale networks) [34, 146], temporal adaptability (for dynamic evolving networks) [147, 148], robustness (towards adversarial attacks) [32, 149], interpretability [150], uncertainty [151], fairness [152, 153] of HNE, and so on.

Beyond node embedding.

Graph- and subgraph-level embeddings have been intensively studied on homogeneous networks to enable graph-level classification [154, 43] and unsupervised message-passing model training [36, 155], but they are hardly studied on heterogeneous networks [156]. Although existing works like HIN2Vec [63] study the embedding of meta-paths to improve the embedding of nodes, direct applications of graph- and subgraph-level embeddings in the context of heterogeneous networks largely remain nascent.

Revisiting KB embedding.

The difference between KB embedding and other types of HNE is mainly due to the numbers of node and link types. Direct application of KB embedding to heterogeneous networks fails to consider meta-paths with rich semantics, whereas directly applying HNE to KB is unrealistic due to the exponential number of meta-paths. However, it would still be interesting to study the intersection between these two groups of methods (as well as two types of data). For example, how can we combine the ideas of meta-paths on heterogeneous networks and embedding transformation on KB for HNE with more semantic-aware transformations? How can we devise truncated random walk based methods for KB embedding to include higher-order relations?

Modeling heterogeneous contexts.

Heterogeneous networks mainly model different types of nodes and links. However, networks nowadays are often associated with rich contents, which provide contexts of the nodes, links, and subnetworks [157, 158, 159]. There have been studies exploiting text-rich heterogeneous information networks for taxonomy construction [160, 161] and text classification [162, 163]. Meanwhile, how to model heterogeneous interactions under multi-faceted contexts through the integration of multi-modal content and structure could be a challenging but rewarding research area.

Understanding the limitation.

While HNE (as well as many neural representation learning models) has demonstrated strong performance in various domains, it is worthwhile to understand its potential limits. For example, when do modern HNE algorithms work better compared with traditional network mining approaches (e.g., path counting, subgraph matching, non-neural or linear propagation)? How can we join the advantages of both worlds? Moreover, while there has been intensive research on the mathematical mechanisms behind neural networks for homogeneous network data (e.g., smoothing, low-pass filtering, invariant and equivariant transformations), by unifying existing models on HNE, this work also aims to stimulate further theoretical studies on the power and limitation of HNE.

Acknowledgement

Special thanks to Dr. Hanghang Tong for his generous help in the polish of this work. Research was sponsored in part by the U.S.  Army Research Laboratory under Cooperative Agreement No. W911NF-09-2-0053 and W911NF-13-1-0193, DARPA No. W911NF-17-C-0099 and FA8750-19-2-1004, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative.

References

  • [1] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Structured sequence modeling with graph convolutional recurrent networks,” in ICONIP, 2018.
  • [2] J. He, M. Li, H.-J. Zhang, H. Tong, and C. Zhang, “Manifold-ranking based image retrieval,” in SIGMM, 2004.
  • [3] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in ICML, 2017.
  • [4] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in NIPS, 2013.
  • [5] C. Yang, L. Bai, C. Zhang, Q. Yuan, and J. Han, “Bridging collaborative filtering and semi-supervised learning: A neural approach for poi recommendation,” in KDD, 2017.
  • [6] Y. Xiao, A. Krishnan, and H. Sundaram, “Discovering strategic behaviors for collaborative content-production in social networks,” in WWW, 2020.
  • [7] Y. Sun and J. Han, “Mining heterogeneous information networks: principles and methodologies,” Synthesis Lectures on Data Mining and Knowledge Discovery, vol. 3, no. 2, pp. 1–159, 2012.
  • [8] C. Shi, Y. Li, J. Zhang, Y. Sun, and S. Y. Philip, “A survey of heterogeneous information network analysis,” TKDE, vol. 29, no. 1, pp. 17–37, 2016.
  • [9] J.-D. Zhang and C.-Y. Chow, “Geosoca: Exploiting geographical, social and categorical correlations for point-of-interest recommendations,” in SIGIR, 2015.
  • [10] C. Yang, D. H. Hoang, T. Mikolov, and J. Han, “Place deduplication with embeddings,” in WWW, 2019.
  • [11] C. Yang, C. Zhang, X. Chen, J. Ye, and J. Han, “Did you enjoy the ride: Understanding passenger experience via heterogeneous network embedding,” in ICDE, 2018.
  • [12] Y. Li and J. C. Patra, “Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network,” Bioinformatics, vol. 26, no. 9, pp. 1219–1224, 2010.
  • [13] A. P. Davis, C. J. Grondin, R. J. Johnson, D. Sciaky, B. L. King, R. McMorran, J. Wiegers, T. C. Wiegers, and C. J. Mattingly, “The comparative toxicogenomics database: update 2017,” Nucleic acids research, vol. 45, no. D1, pp. D972–D978, 2016.
  • [14] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “Pathsim: Meta path-based top-k similarity search in heterogeneous information networks,” in VLDB, 2011.
  • [15] C. Shi, X. Kong, Y. Huang, S. Y. Philip, and B. Wu, “Hetesim: A general framework for relevance measure in heterogeneous networks,” TKDE, vol. 26, no. 10, pp. 2479–2492, 2014.
  • [16] C. Yang, M. Liu, F. He, X. Zhang, J. Peng, and J. Han, “Similarity modeling on heterogeneous networks via automatic path discovery,” in ECML-PKDD, 2018.
  • [17] L. Dos Santos, B. Piwowarski, and P. Gallinari, “Multilabel classification on heterogeneous graphs with gaussian embeddings,” in ECML-PKDD, 2016.
  • [18] D. Eswaran, S. Günnemann, C. Faloutsos, D. Makhija, and M. Kumar, “Zoobp: Belief propagation for heterogeneous networks,” in VLDB, 2017.
  • [19] T. Chen and Y. Sun, “Task-guided and path-augmented heterogeneous network embedding for author identification,” in WSDM, 2017.
  • [20] R. Socher, D. Chen, C. D. Manning, and A. Ng, “Reasoning with neural tensor networks for knowledge base completion,” in NIPS, 2013.
  • [21] B. Oh, S. Seo, and K.-H. Lee, “Knowledge graph completion by context-aware convolutional learning with multi-hop neighborhoods,” in CIKM, 2018.
  • [22] W. Zhang, B. Paudel, L. Wang, J. Chen, H. Zhu, W. Zhang, A. Bernstein, and H. Chen, “Iteratively learning embeddings and rules for knowledge graph reasoning,” in WWW, 2019.
  • [23] X. Geng, H. Zhang, J. Bian, and T.-S. Chua, “Learning image and user features for recommendation in social networks,” in ICCV, 2015.
  • [24] H. Zhao, Q. Yao, J. Li, Y. Song, and D. L. Lee, “Meta-graph based recommendation fusion over heterogeneous information networks,” in KDD, 2017.
  • [25] S. Hou, Y. Ye, Y. Song, and M. Abdulhayoglu, “Hindroid: An intelligent android malware detection system based on structured heterogeneous information network,” in KDD, 2017.
  • [26] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” Knowledge-Based Systems, vol. 151, pp. 78–94, 2018.
  • [27] H. Cai, V. W. Zheng, and K. C.-C. Chang, “A comprehensive survey of graph embedding: Problems, techniques, and applications,” TKDE, vol. 30, no. 9, pp. 1616–1637, 2018.
  • [28] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on network embedding,” TKDE, vol. 31, no. 5, pp. 833–852, 2018.
  • [29] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in KDD, 2014.
  • [30] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Large-scale information network embedding,” in WWW, 2015.
  • [31] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in KDD, 2016.
  • [32] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie, and M. Guo, “Graphgan: Graph representation learning with generative adversarial nets,” in AAAI, 2018.
  • [33] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
  • [34] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS, 2017.
  • [35] J. Chen, T. Ma, and C. Xiao, “Fastgcn: fast learning with graph convolutional networks via importance sampling,” in ICLR, 2018.
  • [36] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax,” in ICLR, 2019.
  • [37] N. Zhao, H. Zhang, M. Wang, R. Hong, and T.-S. Chua, “Learning content–social influential features for influence analysis,” IJMIR, 2016.
  • [38] S. Abu-El-Haija, B. Perozzi, and R. Al-Rfou, “Learning edge representations via low-rank asymmetric projections,” in CIKM, 2017.
  • [39] B. Perozzi, V. Kulkarni, H. Chen, and S. Skiena, “Don’t walk, skip!: online learning of multi-scale network embeddings,” in ASONAM, 2017.
  • [40] C. Yang, J. Zhang, H. Wang, S. Li, M. Kim, M. Walker, Y. Xiao, and J. Han, “Relation learning on social networks with multi-modal graph edge variational autoencoders,” in WSDM, 2020.
  • [41] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in ICML, 2016.
  • [42] C. Yang, M. Liu, V. W. Zheng, and J. Han, “Node, motif and subgraph: Leveraging network functional blocks through structural convolution,” in ASONAM, 2018.
  • [43] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec, “Hierarchical graph representation learning with differentiable pooling,” in NIPS, 2018.
  • [44] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe, “Weisfeiler and leman go neural: Higher-order graph neural networks,” in AAAI, 2019.
  • [45] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations with global structural information,” in CIKM, 2015.
  • [46] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in KDD, 2016.
  • [47] Z. Zhang, P. Cui, X. Wang, J. Pei, X. Yao, and W. Zhu, “Arbitrary-order proximity preserved network embedding,” in KDD, 2018.
  • [48] J. Huang, X. Liu, and Y. Song, “Hyper-path-based representation learning for hyper-networks,” in CIKM, 2019.
  • [49] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo, “struc2vec: Learning node representations from structural identity,” in KDD, 2017.
  • [50] M. Zhang and Y. Chen, “Weisfeiler-lehman neural machine for link prediction,” in KDD, 2017.
  • [51] T. Lyu, Y. Zhang, and Y. Zhang, “Enhancing the network embedding quality with structural similarity,” in CIKM, 2017.
  • [52] C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec, “Learning structural node embeddings via diffusion wavelets,” in KDD, 2018.
  • [53] B. Scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond.   MIT press, 2001.
  • [54] A. Liaw, M. Wiener et al., “Classification and regression by randomforest,” R news, vol. 2, no. 3, pp. 18–22, 2002.
  • [55] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in KDD, 2016.
  • [56] S. Chang, W. Han, J. Tang, G.-J. Qi, C. C. Aggarwal, and T. S. Huang, “Heterogeneous network embedding via deep architectures,” in KDD, 2015.
  • [57] F. Wu, X. Lu, J. Song, S. Yan, Z. M. Zhang, Y. Rui, and Y. Zhuang, “Learning of multimodal representations with random walks on the click graph,” TIP, vol. 25, no. 2, pp. 630–642, 2015.
  • [58] Y. Zhao, Z. Liu, and M. Sun, “Representation learning for measuring entity relatedness with rich information,” in AAAI, 2015.
  • [59] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable representation learning for heterogeneous networks,” in KDD, 2017.
  • [60] Y. Shi, H. Gui, Q. Zhu, L. Kaplan, and J. Han, “Aspem: Embedding learning by aspects in heterogeneous information networks,” in SDM, 2018.
  • [61] Y. Shi, Q. Zhu, F. Guo, C. Zhang, and J. Han, “Easing embedding learning by comprehensive transcription of heterogeneous information networks,” in KDD, 2018.
  • [62] H. Gui, J. Liu, F. Tao, M. Jiang, B. Norick, and J. Han, “Large-scale embedding learning in heterogeneous event data,” in ICDM, 2016.
  • [63] T.-y. Fu, W.-C. Lee, and Z. Lei, “Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning,” in CIKM, 2017.
  • [64] C. Yang, Y. Feng, P. Li, Y. Shi, and J. Han, “Meta-graph based hin spectral embedding: Methods, analyses, and insights,” in ICDM, 2018.
  • [65] R. Hussein, D. Yang, and P. Cudré-Mauroux, “Are meta-paths necessary? revisiting heterogeneous graph embeddings,” in CIKM, 2018.
  • [66] Y. Zhang, Y. Xiong, X. Kong, S. Li, J. Mi, and Y. Zhu, “Deep collective classification in heterogeneous information networks,” in WWW, 2018.
  • [67] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in ESWC, 2018.
  • [68] M. Qu, J. Tang, and J. Han, “Curriculum learning for heterogeneous star network embedding via deep reinforcement learning,” in WSDM, 2018.
  • [69] C. Zhang, A. Swami, and N. V. Chawla, “Shne: Representation learning for semantic-associated heterogeneous networks,” in WSDM, 2019.
  • [70] Y. Cen, X. Zou, J. Zhang, H. Yang, J. Zhou, and J. Tang, “Representation learning for attributed multiplex heterogeneous network,” in KDD, 2019.
  • [71] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla, “Heterogeneous graph neural network,” in KDD, 2019.
  • [72] B. Hu, Y. Fang, and C. Shi, “Adversarial learning on heterogeneous information networks,” in KDD, 2019.
  • [73] X. Wang, Y. Zhang, and C. Shi, “Hyperbolic heterogeneous information network embedding,” in AAAI, 2019.
  • [74] C. Yang, J. Zhang, and J. Han, “Neural embedding propagation on heterogeneous networks,” in ICDM, 2019.
  • [75] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu, “Heterogeneous graph attention network,” in WWW, 2019.
  • [76] Y. Lu, C. Shi, L. Hu, and Z. Liu, “Relation structure-aware heterogeneous information network embedding,” in AAAI, 2019.
  • [77] C. Shi, B. Hu, W. X. Zhao, and S. Y. Philip, “Heterogeneous information network embedding for recommendation,” TKDE, vol. 31, no. 2, pp. 357–370, 2018.
  • [78] Y. Cao, X. Wang, X. He, Z. Hu, and T.-S. Chua, “Unifying knowledge graph learning and recommendation: Towards a better understanding of user preferences,” in WWW, 2019.
  • [79] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,” TKDE, vol. 29, no. 12, pp. 2724–2743, 2017.
  • [80] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and relation embeddings for knowledge graph completion,” in AAAI, 2015.
  • [81] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng, “Embedding entities and relations for learning and inference in knowledge bases,” in ICLR, 2015.
  • [82] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolutional 2d knowledge graph embeddings,” in AAAI, 2018.
  • [83] K. Wang, Y. Liu, X. Xu, and D. Lin, “Knowledge graph embedding with entity neighbors and deep memory network,” arXiv preprint arXiv:1808.03752, 2018.
  • [84] C. Shang, Y. Tang, J. Huang, J. Bi, X. He, and B. Zhou, “End-to-end structure-aware convolutional networks for knowledge base completion,” in AAAI, 2019.
  • [85] Z. Hu, Y. Dong, K. Wang, and Y. Sun, “Heterogeneous graph transformer,” in WWW, 2020.
  • [86] Y. Ren, B. Liu, C. Huang, P. Dai, L. Bo, and J. Zhang, “Heterogeneous deep graph infomax,” arXiv preprint arXiv:1911.08538, 2019.
  • [87] Y. Dong, Z. Hu, K. Wang, Y. Sun, and J. Tang, “Heterogeneous network representation learning,” in IJCAI, 2020.
  • [88] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” TNNLS, 2020.
  • [89] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson, “Benchmarking graph neural networks,” arXiv preprint arXiv:2003.00982, 2020.
  • [90] X. Yue, Z. Wang, J. Huang, S. Parthasarathy, S. Moosavinasab, Y. Huang, S. M. Lin, W. Zhang, P. Zhang, and H. Sun, “Graph embedding on biomedical networks: methods, applications and evaluations,” Bioinformatics, vol. 36, no. 4, pp. 1241–1251, 2020.
  • [91] H. Jiang, Y. Song, C. Wang, M. Zhang, and Y. Sun, “Semi-supervised learning over heterogeneous information networks by ensemble of meta-graph guided random walks.” in IJCAI, 2017.
  • [92] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather: Homophily in social networks,” Annual review of sociology, vol. 27, no. 1, pp. 415–444, 2001.
  • [93] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in NIPS, 2002.
  • [94] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in NIPS, 2004.
  • [95] X. Zhu, Z. Ghahramani, J. Lafferty et al., “Semi-supervised learning using gaussian fields and harmonic functions,” in ICML, 2003.
  • [96] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
  • [97] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
  • [98] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec,” in WSDM, 2018.
  • [99] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013.
  • [100] H. Zhang, L. Qiu, L. Yi, and Y. Song, “Scalable multiplex network embedding.” in IJCAI, 2018.
  • [101] Y. He, Y. Song, J. Li, C. Ji, J. Peng, and H. Peng, “Hetespaceywalk: a heterogeneous spacey random walk for heterogeneous information network embedding,” in CIKM, 2019.
  • [102] C. Park, D. Kim, Q. Zhu, J. Han, and H. Yu, “Task-guided pair embedding in heterogeneous network,” in CIKM, 2019.
  • [103] J. Tang, M. Qu, and Q. Mei, “Pte: Predictive text embedding through large-scale heterogeneous text networks,” in KDD, 2015.
  • [104] M. Qu, J. Tang, J. Shang, X. Ren, M. Zhang, and J. Han, “An attention-based collaboration framework for multi-view network representation learning,” in CIKM, 2017.
  • [105] Z. Jiang, Z. Gao, J. Lan, H. Yang, Y. Lu, and X. Liu, “Task-oriented genetic activation for large-scale complex heterogeneous graph embedding,” in WWW, 2020.
  • [106] W. Zhang, Y. Fang, Z. Liu, M. Wu, and X. Zhang, “mg2vec: Learning relationship-preserving heterogeneous graph representations via metagraph embedding,” TKDE, vol. 14, no. 8, p. 1, 2020.
  • [107] H. Wang, F. Zhang, M. Hou, X. Xie, M. Guo, and Q. Liu, “Shine: Signed heterogeneous information network embedding for sentiment link prediction,” in WSDM, 2018.
  • [108] A. M. Fard, E. Bagheri, and K. Wang, “Relationship prediction in dynamic heterogeneous information networks,” in ECIR, 2019.
  • [109] S. Vashishth, S. Sanyal, V. Nitin, and P. Talukdar, “Composition-based multi-relational graph convolutional networks,” ICLR, 2020.
  • [110] V. W. Zheng, M. Sha, Y. Li, H. Yang, Y. Fang, Z. Zhang, K.-L. Tan, and K. C.-C. Chang, “Heterogeneous embedding propagation for large-scale e-commerce user alignment,” in ICDM, 2018.
  • [111] K. Zhao, T. Bai, B. Wu, B. Wang, Y. Zhang, Y. Yang, and J.-Y. Nie, “Deep adversarial completion for sparse heterogeneous information network embedding,” in WWW, 2020.
  • [112] X. Fu, J. Zhang, Z. Meng, and I. King, “Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding,” in WWW, 2020.
  • [113] H. Tong, C. Faloutsos, and J.-Y. Pan, “Fast random walk with restart and its applications,” in ICDM, 2006.
  • [114] Z. Sun, Z. Deng, J. Nie, and J. Tang, “Rotate: Knowledge graph embedding by relational rotation in complex space,” in ICLR, 2019.
  • [115] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
  • [116] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  • [117] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [118] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph embedding by translating on hyperplanes,” in AAAI, 2014.
  • [119] M. Nickel, V. Tresp, and H.-P. Kriegel, “A three-way model for collective learning on multi-relational data.” in ICML, 2011.
  • [120] M. Nickel, L. Rosasco, and T. Poggio, “Holographic embeddings of knowledge graphs,” in AAAI, 2016.
  • [121] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard, “Complex embeddings for simple link prediction.”   ICML, 2016.
  • [122] S. M. Kazemi and D. Poole, “Simple embedding for link prediction in knowledge graphs,” in NIPS, 2018.
  • [123] I. Balazevic, C. Allen, and T. Hospedales, “Tucker: Tensor factorization for knowledge graph completion,” in EMNLP, 2019.
  • [124] Y. Xiao, Z. Zhang, C. Yang, and C. Zhai, “Non-local attention learning on large heterogeneous information networks,” in BigData, 2019.
  • [125] Z. Zhu, X. Fan, X. Chu, and J. Bi, “Hgcn: A heterogeneous graph convolutional network-based deep learning model toward collective classification,” in KDD, 2020.
  • [126] H. Hong, Y. Lin, X. Yang, Z. Li, K. Fu, Z. Wang, X. Qie, and J. Ye, “Heteta: Heterogeneous information network embedding for estimating time of arrival,” in KDD, 2020.
  • [127] C. Wang, Y. Sun, Y. Song, J. Han, Y. Song, L. Wang, and M. Zhang, “Relsim: relation similarity search in schema-rich heterogeneous information networks,” in SDM, 2016.
  • [128] S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. S. Yu, “A survey on knowledge graphs: Representation, acquisition and applications,” arXiv preprint arXiv:2002.00388, 2020.
  • [129] J. Qiu, H. Ma, Y. Dong, K. Wang, and J. Tang, “Revisiting knowledge base embedding as tensor decomposition,” 2018. [Online]. Available: https://openreview.net/pdf?id=S1sRrN-CW
  • [130] J. Shang, J. Liu, M. Jiang, X. Ren, C. R. Voss, and J. Han, “Automated phrase mining from massive text corpora,” TKDE, vol. 30, no. 10, pp. 1825–1837, 2018.
  • [131] X. Wang, Y. Zhang, Q. Li, X. Ren, J. Shang, and J. Han, “Distantly supervised biomedical named entity recognition with dictionary expansion,” in BIBM, 2019.
  • [132] Q. Li, X. Wang, Y. Zhang, F. Ling, C. H. Wu, and J. Han, “Pattern discovery for wide-window open information extraction in biomedical literature,” in BIBM, 2018.
  • [133] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a core of semantic knowledge,” in WWW, 2007.
  • [134] D. Vrandečić and M. Krötzsch, “Wikidata: a free collaborative knowledgebase,” Communications of the ACM, vol. 57, no. 10, pp. 78–85, 2014.
  • [135] F. Zhang, X. Liu, J. Tang, Y. Dong, P. Yao, J. Zhang, X. Gu, Y. Wang, B. Shao, R. Li et al., “Oag: Toward linking large-scale heterogeneous entity graphs,” in KDD, 2019.
  • [136] Z. Liu, V. W. Zheng, Z. Zhao, Z. Li, H. Yang, M. Wu, and J. Ying, “Interactive paths embedding for semantic proximity search on heterogeneous graphs,” in KDD, 2018.
  • [137] Y. Zhang, Y. Fan, W. Song, S. Hou, Y. Ye, X. Li, L. Zhao, C. Shi, J. Wang, and Q. Xiong, “Your style your identity: Leveraging writing and photography styles for drug trafficker identification in darknet markets over attributed heterogeneous information network,” in WWW, 2019.
  • [138] S. Fan, J. Zhu, X. Han, C. Shi, L. Hu, B. Ma, and Y. Li, “Metapath-guided heterogeneous graph neural network for intent recommendation,” in KDD, 2019.
  • [139] X. Niu, B. Li, C. Li, R. Xiao, H. Sun, H. Deng, and Z. Chen, “A dual heterogeneous graph attention network to improve long-tail performance for shop search in e-commerce,” in KDD, 2020.
  • [140] W. Luo, H. Zhang, X. Yang, L. Bo, X. Yang, Z. Li, X. Qie, and J. Ye, “Dynamic heterogeneous graph neural network for real-time event prediction,” in KDD, 2020.
  • [141] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” in ICML, 2008.
  • [142] M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, Z. Huang, Q. Guo, H. Zhang, H. Lin, J. Zhao, J. Li, A. J. Smola, and Z. Zhang, “Deep graph library: Towards efficient and scalable deep learning on graphs,” ICLRW, 2019.
  • [143] M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch Geometric,” in ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • [144] X. Han, S. Cao, L. Xin, Y. Lin, Z. Liu, M. Sun, and J. Li, “Openke: An open toolkit for knowledge embedding,” in EMNLP, 2018.
  • [145] J. You, R. Ying, and J. Leskovec, “Position-aware graph neural networks,” in ICML, 2019.
  • [146] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec, “Strategies for pre-training graph neural networks,” in ICLR, 2020.
  • [147] L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang, “Dynamic network embedding by modeling triadic closure process,” in AAAI, 2018.
  • [148] L. Du, Y. Wang, G. Song, Z. Lu, and J. Wang, “Dynamic network embedding: An extended approach for skip-gram based network embedding.” in IJCAI, 2018.
  • [149] D. Zügner, A. Akbarnejad, and S. Günnemann, “Adversarial attacks on neural networks for graph data,” in KDD, 2018.
  • [150] N. Liu, X. Huang, J. Li, and X. Hu, “On interpretation of network embedding via taxonomy induction,” in KDD, 2018.
  • [151] X. Chen, M. Chen, W. Shi, Y. Sun, and C. Zaniolo, “Embedding uncertain knowledge graphs,” in AAAI, 2019.
  • [152] T. A. Rahman, B. Surma, M. Backes, and Y. Zhang, “Fairwalk: Towards fair graph embedding.” in IJCAI, 2019.
  • [153] A. J. Bose and W. L. Hamilton, “Compositional fairness constraints for graph embeddings,” in ICML, 2019.
  • [154] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in ICLR, 2018.
  • [155] F.-Y. Sun, J. Hoffman, V. Verma, and J. Tang, “Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization,” in ICML, 2019.
  • [156] C. Park, J. Han, and H. Yu, “Deep multiplex graph infomax: Attentive multiplex network embedding using global information,” in AAAI, 2020.
  • [157] H. Wang, T. Xu, Q. Liu, D. Lian, E. Chen, D. Du, H. Wu, and W. Su, “Mcne: An end-to-end framework for learning multiple conditional network representations of social network,” in KDD, 2019.
  • [158] C. Yang, A. Pal, A. Zhai, N. Pancha, C. Rosenberg, and J. Leskovec, “Multisage: empowering gcn with contextualized multi-embeddings on web-scale multipartite networks,” in KDD, 2020.
  • [159] C. Yang, D. Teng, S. Liu, S. Basu, J. Zhang, J. Shen, C. Zhang, J. Shang, L. Kaplan, T. Harratty et al., “Cubenet: Multi-facet hierarchical heterogeneous network construction, analysis, and mining,” in KDD (demo), 2019.
  • [160] Y. Shi, J. Shen, Y. Li, N. Zhang, X. He, Z. Lou, Q. Zhu, M. Walker, M. Kim, and J. Han, “Discovering hypernymy in text-rich heterogeneous information network by exploiting context granularity,” in CIKM, 2019.
  • [161] J. Shang, X. Zhang, L. Liu, S. Li, and J. Han, “Nettaxo: Automated topic taxonomy construction from text-rich network,” in WWW, 2020.
  • [162] Y. Zhang, F. F. Xu, S. Li, Y. Meng, X. Wang, Q. Li, and J. Han, “Higitclass: Keyword-driven hierarchical classification of github repositories,” in ICDM, 2019.
  • [163] Y. Zhang, Y. Meng, J. Huang, F. F. Xu, X. Wang, and J. Han, “Minimally supervised categorization of text with metadata,” in SIGIR, 2020.
[Uncaptioned image] Carl Yang received his Ph.D. under Jiawei Han in Computer Science at University of Illinois, Urbana-Champaign. He received his B.Eng. in Computer Science and Engineering at Zhejiang University in 2014. In his research, he develops data-driven techniques and neural architectures for learning with context-rich network data. Carl’s research results have been published in top conferences like KDD, NeurIPS, WWW, ICDE, SIGIR, ICML. Carl also received the Dissertation Completion Fellowship of UIUC in 2020, Best Paper Award of ICDM 2020, and has started as an Assistant Professor in Emory University after Ph.D. graduation.
[Uncaptioned image] Yuxin Xiao is a master student at Carnegie Mellon University. He received his B.S. in Computer Science and B.S. in Statistics and Mathematics at the University of Illinois at Urbana-Champaign in 2020. His research focuses on machine learning on graph data and he has published first-authored papers in WWW and IEEE BigData. Yuxin also received the CRA Outstanding Undergraduate Researcher Award and C.W. Gear Outstanding Undergraduate Award at UIUC in 2020.
[Uncaptioned image] Yu Zhang is a Ph.D. candidate in computer science at the University of Illinois at Urbana-Champaign, advised by Prof. Jiawei Han. He received his M.S. degree at UIUC in 2019 and his B.S. degree at Peking University in 2017. His research interests include text-rich network mining, text classification, and their applications to bioinformatics. He has published first-authored papers in SIGIR, WSDM, ICDM, and CIKM. Yu also received WWW Best Poster Award Honorable Mention in 2018.
[Uncaptioned image] Yizhou Sun is an associate professor at department of computer science of UCLA. She received her Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 2012. Her principal research interest is on mining graphs/networks, and more generally in data mining, machine learning, and network science, with a focus on modeling novel problems and proposing scalable algorithms for large-scale, real-world applications. She is a pioneer researcher in mining heterogeneous information network, with a recent focus on deep learning on graphs/networks. Yizhou has over 100 publications in books, journals, and major conferences. Tutorials of her research have been given in many premier conferences. She received 2012 ACM SIGKDD Best Student Paper Award, 2013 ACM SIGKDD Doctoral Dissertation Award, 2013 Yahoo ACE (Academic Career Enhancement) Award, 2015 NSF CAREER Award, 2016 CS@ILLINOIS Distinguished Educator Award, 2018 Amazon Research Award, and 2019 Okawa Foundation Research Grant.
[Uncaptioned image] Jiawei Han is Michael Aiken Chair Professor in the Department of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, and database systems, and their various applications, with over 600 publications. He served as the founding Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data (TKDD) (2007-2012). Jiawei has received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), IEEE Computer Society W. Wallace McDowell Award (2009), Daniel C. Drucker Eminent Faculty Award at UIUC (2011), and Japan’s Funai Achievement Award (2018). He is Fellow of ACM and Fellow of IEEE and served as co-Director of KnowEnG, a Center of Excellence in Big Data Computing (2014-2019), funded by NIH Big Data to Knowledge (BD2K) Initiative and as the Director of Information Network Academic Research Center (INARC) (2009-2016) supported by the Network Science-Collaborative Technology Alliance (NS-CTA) program of U.S. Army Research Lab. His co-authored textbook “Data Mining: Concepts and Techniques” (Morgan Kaufmann) has been adopted popularly as a textbook worldwide.