Deep Partial Multiplex Network Embedding

Qifan Wang Facebook AIMenlo ParkCAUSA [email protected] , Yi Fang Santa Clara UniversitySanta ClaraCAUSA [email protected] , Anirudh Ravula Google ResearchMountain ViewCAUSA [email protected] , Ruining He Google ResearchMountain ViewCAUSA [email protected] , Bin Shen Google ResearchMountain ViewCAUSA [email protected] , Jingang Wang Meituan NLP CenterBeijingChina [email protected] , Xiaojun Quan Sun Yat-sen UniversityGuangzhouChina [email protected] and Dongfang Liu Rochester Institute of TechnologyRochesterNYUSA [email protected]

(2022)

Abstract.

Network embedding is an effective technique to learn the low-dimensional representations of nodes in networks. Real-world networks are usually with multiplex or having multi-view representations from different relations. Recently, there has been increasing interest in network embedding on multiplex data. However, most existing multiplex approaches assume that the data is complete in all views. But in real applications, it is often the case that each view suffers from the missing of some data and therefore results in partial multiplex data.

In this paper, we present a novel Deep Partial Multiplex Network Embedding approach to deal with incomplete data. In particular, the network embeddings are learned by simultaneously minimizing the deep reconstruction loss with the autoencoder neural network, enforcing the data consistency across views via common latent subspace learning, and preserving the data topological structure within the same network through graph Laplacian. We further prove the orthogonal invariant property of the learned embeddings and connect our approach with the binary embedding techniques. Experiments on four multiplex benchmarks demonstrate the superior performance of the proposed approach over several state-of-the-art methods on node classification, link prediction and clustering tasks.

network embedding, multiplex learning, social network representation, graph representation

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: Companion Proceedings of the Web Conference 2022; April 25–29, 2022; Virtual Event, Lyon, France.^†^†booktitle: Companion Proceedings of the Web Conference 2022 (WWW ’22 Companion), April 25–29, 2022, Virtual Event, Lyon, France^†^†price: 15.00^†^†isbn: 978-1-4503-9130-6/22/04^†^†doi: 10.1145/XXXXXX.XXXXXX^†^†ccs: Information systems Web searching and information discovery

1. Introduction

Network embedding is designed for learning low-dimensional and typically non-linear representations of nodes in the network, which is able to preserve network information. Network embedding has been shown to be useful in many downstream tasks, such as node classification (WangTAL16, ), node clustering (GaoH18b, ), link prediction (Li0ZZC19, ) and community detection (WeiCSLY16, ). A variety of network embedding techniques have been proposed in the literature (DeepWalk, ; DRNE, ; wang2017instance, ; DuLWSW018, ; WangLDWZ19, ; ZhangSDYJ19, ; MengLBZ19, ; 0004ZMK20, ; BhowmickMDGM20, ; SunY0CCSH21, ; Pan0T0021, ; JiangKS21, ; LiuHYD21, ; MAVE, ; WebFormer, ). However, most of these methods focus on single networks, where nodes in the networks are only associated with one type of features or relations.

In many real-world applications, data usually have multiplex representations (MNE, ), where nodes are associated with multiple features from different sources. Multiple types of edges/relations are then generated from these disparate features. For example, in document corpus, a document has hyperlink feature that connects to other related documents. It can also have semantic representation such as attribute or tag feature. Documents are linked together in the attribute view if they share at least one attribute. In Flickr (FlickrData, ), users can be represented with their friendship to others, public comments, photos, reviews, tags, etc. Similarly, users are linked in the photo network or tag network if they share same photos or tags. Previous research on multiplex representation learning (ICML15, ; LiuZLTZY20, ) has demonstrated improved performance by leveraging complementary information from different views. Therefore, there is a growing interest in multiplex network embedding that effectively integrates information from disparate views (ZhangLSC18, ; Kawamae19, ; ZhaoWSLY20, ).

Although existing multiplex network embedding methods generate promising results in dealing with multiplex data, most of them assume that all nodes in the network have full information in all views. However, in real-world tasks, it is often the case that a view suffers from some missing information, which results in partial data (LiJZ14, ; WangSS15, ; abs-2003-13088, ). For instance, in document corpus, many documents may not contain any hyperlink or tag information. In Flickr, a user might have few or even no friend connections, reviews or tags, resulting in an isolated node in the corresponding relationship network. Moreover, it is also common that users don’t share some of their information, such as photos and comments, for privacy consideration. Therefore, it is a practical and important research problem to design effective network embedding methods on partial multiplex data.

There are several ways to apply existing multiplex network embedding methods to partial data. One can either remove the data that suffer from missing information, or preprocess the partial data by first filling in the missing data. The first strategy is clearly not suitable since the purpose is to map all nodes to their corresponding embedding vectors, and our experiments show that the second strategy does not achieve good performance either. In this paper, we propose a novel Deep Partial Multiplex Network Embedding (DPMNE) approach to deal with such incomplete data. More specifically, a deep autoencoder network is introduced to learn the deep representations of node features. A unified learning framework is developed to learn the network embedding, which simultaneously minimizes the reconstruction error from the deep autoencoder, enforces the data consistency across different views via common latent subspace learning, and preserves data proximity within the same network through graph Laplacian. A coordinate descent algorithm is applied as the optimization procedure. We then further connect our approach to binary embedding methods (WangZSSS18, ) based on the orthogonal invariant property of our formulation. Experiments on four multiplex datasets demonstrate the advantages of the proposed approach over several state-of-the-art single and multiplex network embedding methods.

2. Related Work

2.1. Single Network Embedding

Single network embedding methods (GongLSW20, ; aaai2021_a, ; aaai2021_b, ; feng2021should, ; chen2021catgcn, ; wu2022graph, ) learn an information preserving embedding of a single-view network for node classification, node clustering and many other related tasks. A spectral based method (BelkinN01, ) has been proposed, which uses the top-k eigenvectors to represent the network nodes. DeepWalk (DeepWalk, ) introduces the idea of Skip-gram to learn node representations from random-walk sequences. LINE (LINE, ) tries to use the embedding to approximate the first-order and second-order proximities of the network. On top of DeepWalk, Node2Vec (node2vec, ) adds two parameters to control the random walk process and make it biased random walk. SDNE (SDNE, ) uses deep neural networks to preserve the neighbors structure proximity in network embedding. Recently, graph neural network (GNN) based methods (SAGE, ; ArmandpourDHH19, ; YangPZP0RL20, ) have been proposed, which generate embedding by sampling and aggregating features from a node local neighborhood in the network. GraphSAGE (SAGE, ) is a general inductive framework that leverages node feature information to efficiently generate node embeddings for previously unseen data. DEGNN (LiWWL20, ) presents a distance encoding GNN that learns a generator to approximate the node connectivity distribution and a discriminator to differentiate fake nodes and the nodes sampled from the true data distribution. For a comprehensive review on GNN models, please refer to (WuPCLZY21, ).

methods	Deep	Partial	Multiplex	Network
DeepWalk (DeepWalk, )	x	x	x	✓
Node2Vec (node2vec, )	x	x	x	✓
LINE (LINE, )	x	x	x	✓
GraphSAGE (SAGE, )	✓	x	x	✓
SDNE (SDNE, )	✓	x	x	✓
DANE (GaoH18b, )	✓	x	✓	✓
MVE (MVE, )	x	x	✓	✓
MNE (MNE, )	x	x	✓	✓
CFANE (Pan0T0021, )	✓	x	✓	✓
MAGCN (ChengWTXG20, )	✓	x	✓	✓
HWNN (SunY0CCSH21, )	✓	x	✓	✓
PVC (LiJZ14, )	x	✓	✓	x
IMVC (LiuZLTZY20, )	x	✓	✓	x
DPMNE	✓	✓	✓	✓

Table 1. Summary of existing network embedding and partial data representation approaches. Deep: learning deep representations. Partial: handling partial data. Multiplex: incorporating multiplex data. Network: modeling network information.

2.2. Multiplex Network Embedding

Various approaches have been proposed to learn network embedding on multiplex data (ChangHTQAH15, ; CenZZYZ019, ; ChenY0DLZ19, ; abs-2008-10085, ; abs-2004-00216, ; abs-2012-12517, ; YanZT21, ; XieOCLXYZ21, ; qifan2021deep, ). Such methods effectively integrate information from individual views while exploiting complementary information supplied by different views. MVE (MVE, ) is one of the pioneer work in this category. This work combines the information from multiple sources using a weighted voting scheme. MVN2Vec (mvn2vec, ) further studies how varied extent of preservation and collaboration can impact the multiplex embedding learning. DANE (GaoH18b, ), DONE (BandyopadhyayNV20, ) and ProGAN (ProGAN, ) treat the attribute/tag information associated with the nodes as additional feature and learn embedding with deep neural networks. More recently, MNE (MNE, ) uses a latent space to integrate the information across multiple views. MAGCN (ChengWTXG20, ) proposes a multi-view attribute graph convolution networks model for the clustering task. A quick survey on multiplex network representation is provided in (abs-2004-00216, ). These multiplex network embedding methods (WangJSWYCY19, ; HuDWS20, ; XueYRJWL21, ) achieve promising results. However, the partial data scenarios are not considered.

Refer to caption — Figure 1. The architecture of our proposed DPMNE. There are three views in the multiplex network example, with missing data from each individual views, i.e., $x_{4}$ is missing from view 1, $x_{5}$ is missing from view 2, $x_{1}$ and $x_{4}$ are missing from view 3. The three key components of DPMNE are: 1) Deep reconstruction loss from all views. 2) Data consistency among views via latent common subspace learning. 3) Proximity preservation from each view.

2.3. Partial Data Learning

It is also worth mentioning that there are few single network embedding approaches that deal with incomplete data (WeiCSLY16, ; ZhangYZZ18, ). Obviously, these methods are not suitable in multiplex network setting. There are also some multi-view approaches (LiJZ14, ; RaiNCD16, ; GuoY19, ; XuGZWNL19, ; LiuZLTZY20, ) proposed to deal with incomplete data in clustering task. For example, PVC (LiJZ14, ) develops a two-view clustering method that learns a unique representation between views to handle incomplete data in each view. Most recently, IMVC (LiuZLTZY20, ) uses multiple kernel k-means with incomplete kernels to jointly conduct clustering and kernel matrix imputation. Although these methods obtain reasonable performance, they are not directly applicable to network embedding as network information can not be modeled by these methods. Moreover, none of these methods learn the deep representations for the multiplex network. We compare and summarize these approaches in Table 1.

3. Deep Partial Multiplex Network Embedding

3.1. Problem Definition

Given a multiplex network $\boldsymbol{G}$ = $(\boldsymbol{V},\boldsymbol{E})$ , where $\boldsymbol{V}$ = $\{v_{i}\ |\ i=1,\dots,n\}$ denotes the set of nodes. $\boldsymbol{X}$ = $\{\boldsymbol{X}^{1},\boldsymbol{X}^{2},\dots,\boldsymbol{X}^{t}\}$ are the multiplex features associated with the nodes, where $t$ is the total number of views, $\boldsymbol{X}^{s}$ ={ $x_{i}^{s}\ |\ i=1,2,\dots,n\}$ with $x_{i}^{s}\in\mathbb{R}^{d_{s}}$ is the feature of the $i$ - $th$ node in the $s$ - $th$ view. $\boldsymbol{E}$ = $\{\boldsymbol{E}^{s}\}$ denotes the edge sets in the multiplex network, where $\boldsymbol{E}_{i,j}^{s}$ = $1$ indicates that $v_{i}$ and $v_{j}$ are linked in the $s$ - $th$ view. In the partial data setting, a partial data ${\bar{\boldsymbol{X}}}$ = $\{{\bar{\boldsymbol{X}}}^{1},{\bar{\boldsymbol{X}}}^{2},\dots,{\bar{\boldsymbol{X}}}^{t}\}$ instead of $\boldsymbol{X}$ is given, where some of the node features are missing from each individual views, e.g., $x_{4}^{1}$ is missing from View 1 and $x_{5}^{2}$ is missing from View 2 in Figure 1. The purpose of DPMNE is to learn a low-dimensional embedding representation $\boldsymbol{Y}$ = $\{y_{1},y_{2},\dots,y_{n}\}\in\mathbb{R}^{n\times d}$ of $\boldsymbol{G}$ , where $d\ll d_{s}$ is the latent embedding dimension.

The overall model architecture is shown in Figure 1. The objective function of DPMNE is composed of three components: (1) Deep reconstruction loss within each views, where we learn a deep representation of the node feature using autoencoder model. (2) Data consistency among views, where latent common subspace learning is utilized to ensure that the node embeddings generated from different views are consistent. (3) Proximity preservation within view, where graph Laplacian is applied to enforce that linked nodes within each network should have close embeddings.

3.2. Deep Representation

To capture the sparsity and highly non-linear structure in the feature space, we adopt a deep autoencoder to map the input data to the representation space. Autoencoder is a powerful unsupervised deep model for feature learning. It has been widely used for various machine learning applications (JiangGCH16, ; SDNE, ). The basic autoencoder contains three layers, they are the input layer, the hidden layer, and the output layer, which are defined as follows:

(1)

h=\sigma(\boldsymbol{W}^{(en)}x+b^{(en)}),\ \ \ \ \tilde{x}=\sigma(\boldsymbol{W}^{(de)}h+b^{(de)})

where $h$ is the deep representation from the encoder, $\tilde{x}$ is the reconstructed feature from the decoder. $\boldsymbol{H}$ = $\{\boldsymbol{W}^{(en)},b^{(en)},\boldsymbol{W}^{(de)},b^{(de)}\}$ are autoencoder parameters. $\sigma(.)$ denotes the non-linear activation function. In our implementation, we employ $K$ layers in the encoder:

(2)

\begin{split}h_{1}=\sigma(\boldsymbol{W}_{1}^{(en)}x+b_{1}^{(en)})\ \ \ \ \ \ \\ h_{k}=\sigma(\boldsymbol{W}_{k}^{(en)}h_{k-1}+b_{k}^{(en)})\end{split}

Similarly, there will be $K$ layers in the decoder. The goal of the autoencoder is to minimize the reconstruction loss of the reconstructed features $\tilde{x}$ and the input features $x$ , in each individual views (note that there is one autoencoder per view):

(3)

\min_{\boldsymbol{H}}\sum_{s=1}^{t}\|{{\boldsymbol{X}}}^{s}-{\tilde{\boldsymbol{X}}}^{s}\|_{F}^{2}

Although minimizing the reconstruction loss does not explicitly preserve the similarity between samples, as shown in (SalakhutdinovH09, ), the reconstruction criterion can effectively capture the data manifolds and thus preserve the similarity between data examples.

3.3. Data Consistency among Views

In the partial multiplex network setting, the nodes are represented by heterogeneous features of different dimensions, which makes it difficult for learning their embeddings. By investigating the problem from view perspective, in each individual view, the nodes are sharing the same feature space, and different views are coupled/bridged by the shared common nodes. If we can learn a common latent subspace across views, where embeddings belonging to the same node among different views are consistent, while at the same time for each view, the representations for linked nodes are close in the latent subspace. Then the embeddings can be directly learned from this subspace, and we do not need to fill in or complete the partial data. Following the above idea, the deep latent subspace learning can be formulated as:

(4)

\begin{split}\min_{\boldsymbol{Y},\boldsymbol{B},\boldsymbol{H}}\sum_{s=1}^{t}\|{\boldsymbol{I}}^{s}({\boldsymbol{H}}^{s}-{\boldsymbol{Y}}{\boldsymbol{B}}^{s})\|_{F}^{2}+\lambda\ R({\boldsymbol{Y},\boldsymbol{B},\boldsymbol{H}})\end{split}

where $\boldsymbol{I}^{s}\in\mathbb{R}^{n\times n}$ is a diagonal matrix with element $\boldsymbol{I}^{s}_{ii}$ = 0 or 1 indicating whether $x_{i}^{s}$ is missing or not. $\boldsymbol{H}^{s}\in\mathbb{R}^{n\times d_{s}^{h}}$ are the deep representations of the data in $s$ - $th$ view as described in Section 3.2. Note that the $i$ - $th$ row of $\boldsymbol{H}^{s}$ will be 0 if $x_{i}^{s}$ is missing. $\boldsymbol{B}^{s}\in\mathbb{R}^{d\times d_{s}^{h}}$ is the basis matrix for $s$ - $th$ view’s latent space. The same latent space dimension $d$ is shared across views. $\boldsymbol{Y}\in\mathbb{R}^{n\times d}$ is the common representation/embedding of nodes in the latent space. $R(\cdotp)=\|\cdotp\|_{F}^{2}$ is the regularization term and $\lambda$ is the trade-off parameter.

In the above formulation, the individual basis matrix $\boldsymbol{B}^{s}$ , which are learned from all available instances from all views, are connected by the common latent representation $\boldsymbol{Y}$ . Moreover, no interpolation is needed for the missing data beforehand. In fact, the deep representation of the missing data can even be reconstructed with the learned basis and the common latent embedding, i.e., $h_{i}^{s}=y_{i}B^{s}$ . By solving the above problem, the deep representation $\boldsymbol{H}$ , the latent space basis $\boldsymbol{B}$ and the homogeneous feature representation $\boldsymbol{Y}$ are simultaneously learned to minimize the latent representation error.

3.4. Proximity Preservation within Views

One of the key problems in network embedding algorithms is proximity preserving, which indicates that linked nodes should be mapped to similar embedding within a close distance. Therefore, besides the data consistency across different views, we also preserve the data proximity within each individual network. In other words, we want the learned embedding $\boldsymbol{Y}$ to preserve the proximity structure in each network. In this work, we use the $L_{2}$ distance to measure the proximity between $y_{i}$ and $y_{j}$ as $\|y_{i}-y_{j}\|^{2}$ as in most network embedding work. Then one natural way to preserve the proximity in each network is to minimize the weighted average distance as follows:

(5)

\sum_{s=1}^{t}\sum_{i,j}{\boldsymbol{P}}_{ij}^{s}\|y_{i}-y_{j}\|^{2}=\sum_{i,j}\boldsymbol{P}_{ij}\|y_{i}-y_{j}\|^{2}

here $\boldsymbol{P}_{ij}=\sum_{s=1}^{t}\boldsymbol{P}_{ij}^{s}$ and $\boldsymbol{P}^{s}$ is the proximity matrix in $s$ - $th$ view, which can be obtained from the edges in $s$ - $th$ network $\boldsymbol{E}^{s}$ . A simple way to define $\boldsymbol{P}^{s}$ is to directly use the first-order proximity, i.e., $\boldsymbol{P}^{s}=\boldsymbol{E}^{s}$ . However, the first-order proximity is usually very sparse and insufficient to fully model the relationships between nodes in most cases, especially under the partial data setting. In order to characterize the connections between nodes better, we adopt high-order proximity (GaoH18b, ; Li0ZZC19, ) and define $\boldsymbol{P}^{s}$ as:

(6)

\boldsymbol{P}^{s}=w_{1}\boldsymbol{E}^{s}+w_{2}(\boldsymbol{E}^{s})^{2}+\dots+w_{l}(\boldsymbol{E}^{s})^{l}

where $l$ is the order, and $w_{1}$ , $\dots$ , $w_{l}$ are the weights for each term¹¹1In our implementation, we set $l$ to 5, $w_{1}$ to 1 and $w_{i}$ = $0.5w_{i-1}$ .. Matrix $(\boldsymbol{E}^{s})^{l}$ denotes the $l$ -order proximity matrix. To meet the proximity preservation criterion, we seek to minimize the quantity in Eqn.5 in the network since it incurs a heavy penalty if two connected nodes have very different embedding representations. By introducing a diagonal matrix $\boldsymbol{D}$ , whose entries are given by $\boldsymbol{D}_{ii}=\sum_{j=1}^{n}\boldsymbol{P}_{ij}$ . Eqn.5 can be rewritten as:

(7)

\begin{split}tr\left({\boldsymbol{Y}}^{T}(\boldsymbol{D}-\boldsymbol{P}){\boldsymbol{Y}}\right)=tr\left(\boldsymbol{Y}^{T}\boldsymbol{L}{\boldsymbol{Y}}\right)\end{split}

where $\boldsymbol{L}$ is called graph $Laplacian$ (BelkinN01, ) and $tr(\cdotp)$ is the matrix trace function. By minimizing the above objective in all networks, the proximity between different nodes can be preserved in the learned embedding.

3.5. Overall Objective and Optimization

The entire objective function consists of three components: the deep reconstruction loss in Eqn.3, the data consistency among views in Eqn.4 and proximity preservation within views given in Eqn.7 as follows:

(8)

\begin{split}\min_{\boldsymbol{Y},\boldsymbol{B},\boldsymbol{H}}O=\sum_{s=1}^{t}\|{{\boldsymbol{X}}}^{s}-{\tilde{\boldsymbol{X}}}^{s}\|_{F}^{2}+\alpha\sum_{s=1}^{t}\|{\boldsymbol{I}}^{s}({\boldsymbol{H}}^{s}-{\boldsymbol{Y}}{\boldsymbol{B}}^{s})\|_{F}^{2}\\ +\beta\ tr\left(\boldsymbol{Y}^{T}\boldsymbol{L}{\boldsymbol{Y}}\right)+\lambda\ R(\boldsymbol{Y},\boldsymbol{B},\boldsymbol{H})\ \ \ \ \ \ \ \ \ \ \ \ \end{split}

where $\alpha$ , $\beta$ and $\lambda$ are trade-off parameters to balance the weights among the terms. Directly minimizing the objective function in Eqn.8 is intractable since it is a non-convex optimization problem with $\boldsymbol{Y}$ , $\boldsymbol{B}$ and $\boldsymbol{H}$ coupled together. We propose to use coordinate descent scheme by iteratively solving the optimization problem with respect to $\boldsymbol{Y}$ , $\boldsymbol{B}$ and $\boldsymbol{H}$ as follows:

(1) Update ${Y}$ by fixing $B$ and $H$ . Given the basis matrix $\boldsymbol{B}^{s}$ and encoders $\boldsymbol{H}^{s}$ for all views, we seek to solve the following sub-problem:

(9)

\begin{split}\min_{\boldsymbol{Y}}\ O(\boldsymbol{Y})\ =\alpha\sum_{s=1}^{t}\|{\boldsymbol{I}}^{s}({\boldsymbol{H}}^{s}-{\boldsymbol{Y}}{\boldsymbol{B}}^{s})\|_{F}^{2}\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \\ +\beta\ tr\left(\boldsymbol{Y}^{T}\boldsymbol{L}{\boldsymbol{Y}}\right)+\lambda\ R({\boldsymbol{Y}})+const\end{split}

where $const$ is the constant value independent with the parameter that to be optimized with. Although Eqn.9 is still non-convex, it is smooth and differentiable which enables gradient descent methods for efficient optimization with the calculated gradient:

(10)

\begin{split}\partial\frac{O(\boldsymbol{Y})}{\boldsymbol{Y}}=2{\boldsymbol{I}}^{s}(\boldsymbol{Y}\boldsymbol{B}^{s}-{\boldsymbol{H}}^{s})(\boldsymbol{B}^{s})^{T}+2\beta\boldsymbol{L}{\boldsymbol{Y}}+4\lambda\boldsymbol{Y}(\boldsymbol{Y}^{T}\boldsymbol{Y}-\boldsymbol{I}_{d})\end{split}

(2) Update ${B}^{s}$ by fixing $Y$ and $H$ . It is equivalent to solve the following least square problems:

(11)

\min_{\boldsymbol{B}^{s}}O(\boldsymbol{B}^{s})=\alpha\|{\boldsymbol{I}}^{s}({\boldsymbol{H}}^{s}-{\boldsymbol{Y}}{\boldsymbol{B}}^{s})\|_{F}^{2}+\lambda\|\boldsymbol{B}^{s}\|_{F}^{2}

which has a closed form solution and can be simply derived.

(3) Update ${H}^{s}$ by fixing $Y$ and $B$ . It is a standard autoencoder with an additional regression loss:

(12)

\min_{\boldsymbol{H}^{s}}O(\boldsymbol{H}^{s})=\|{{\boldsymbol{X}}}^{s}-{\tilde{\boldsymbol{X}}}^{s}\|_{F}^{2}+\alpha\|{\boldsymbol{I}}^{s}({\boldsymbol{H}}^{s}-{\boldsymbol{Y}}{\boldsymbol{B}}^{s})\|_{F}^{2}+\lambda\ R(\boldsymbol{H}^{s})

which can be solved with gradient back-propagation. We then alternate the process of updating $\boldsymbol{Y}$ , $\boldsymbol{B}$ and $\boldsymbol{H}$ for several iterations to find a locally optimal solution.

3.6. Binary Embedding

This section connects our work to the quantization-based binary embedding techniques (GongLGP13, ; WangZSSS18, ; wang2015learning, ), which learn compact binary representations of the data examples for efficient similarity search tasks. Quantization-based binary embedding methods directly binarize the low-dimensional representation to achieve the binary codes. In this work, we can easily obtain the binary codes $\boldsymbol{C}$ for the nodes in the network by binarizing the learned embedding $\boldsymbol{Y}$ . However, the quantization error can be further reduced based on the orthogonal invariant property, by minimizing the quantization error between the binary codes and the orthogonal rotation of the embeddings (since $\boldsymbol{YQ}$ is also an optimal embedding):

(13)

\begin{split}\min_{{\boldsymbol{C},\boldsymbol{Q}}}\|{\boldsymbol{C}}-\boldsymbol{YQ}\|_{F}^{2}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \\ s.t.\ \ \ \ \ \ \boldsymbol{C}\in\{-1,1\}^{n\times d},\ \ \ \boldsymbol{Q}^{T}\boldsymbol{Q}={\boldsymbol{I}_{d}}\ \ \ \ \ \ \end{split}

The above quantization problem is well studied in the literature (GongLGP13, ).

3.7. Theoretical Analysis

This section provides some complexity analysis on the training cost of the learning algorithm. The optimization algorithm of DPMNE consists of three steps in each iteration to update $Y$ , $B$ and $H$ . The time complexities for solving ${Y}$ and $B$ are bounded by $O(tndd_{s}+tnd^{2}+n^{2}d)$ and $O(tnd^{2}+tndd_{s})$ respectively. In practice, $L$ is usually a sparse matrix, and the cost can be reduced from $O(n^{2}d)$ to $O(ld)$ with sparse matrix multiplication, where $l$ is the number of non-zero elements in $L$ . The cost of updating $H$ depends on the number of hidden layers and units in the autoencoder network, which is roughly $O(tnmd_{s})$ . Here $m$ is the number of unique units in the network. Thus, the total time complexity of the learning algorithm is bounded by $O(tndd_{s}+ld+tnd^{2}+tnmd_{s})$ .

Dataset	#nodes	#total edges	#views	#labels	avg. PDR
Cora	2,708	12,887	2	7	0.02
DBLP	69,110	1,884,236	3	8	0.39
Flickr	6,163	378,547	5	10	0.46
Last.fm	10,197	1,325,367	12	11	0.52

Table 2. A summary of the statistics on all datasets.

	Cora		DBLP		Flickr		Last.fm
methods	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Micro-F1	Macro-F1
DeepWalk	0.817 $\pm$ 0.022	0.809 $\pm$ 0.019	0.709 $\pm$ 0.014	0.713 $\pm$ 0.010	0.472 $\pm$ 0.015	0.451 $\pm$ 0.013	0.482 $\pm$ 0.011	0.453 $\pm$ 0.014
GraphSAGE	0.826 $\pm$ 0.015	0.814 $\pm$ 0.017	0.718 $\pm$ 0.019	0.720 $\pm$ 0.021	0.479 $\pm$ 0.015	0.462 $\pm$ 0.014	0.503 $\pm$ 0.011	0.465 $\pm$ 0.012
SDNE	0.821 $\pm$ 0.021	0.816 $\pm$ 0.016	0.706 $\pm$ 0.014	0.709 $\pm$ 0.018	0.482 $\pm$ 0.015	0.454 $\pm$ 0.009	0.508 $\pm$ 0.011	0.479 $\pm$ 0.019
DANE	0.841 $\pm$ 0.015	0.832 $\pm$ 0.012	0.727 $\pm$ 0.015	0.721 $\pm$ 0.014	0.485 $\pm$ 0.021	0.467 $\pm$ 0.013	0.503 $\pm$ 0.014	0.481 $\pm$ 0.012
IMVC	0.828 $\pm$ 0.017	0.820 $\pm$ 0.016	0.741 $\pm$ 0.013	0.734 $\pm$ 0.016	0.478 $\pm$ 0.020	0.464 $\pm$ 0.012	0.496 $\pm$ 0.017	0.478 $\pm$ 0.013
MAGCN	0.856 $\pm$ 0.022	0.847 $\pm$ 0.021	0.750 $\pm$ 0.025	0.736 $\pm$ 0.019	0.508 $\pm$ 0.018	0.480 $\pm$ 0.015	0.530 $\pm$ 0.011	0.508 $\pm$ 0.012
HWNN	0.851 $\pm$ 0.017	0.843 $\pm$ 0.016	0.752 $\pm$ 0.023	0.741 $\pm$ 0.014	0.502 $\pm$ 0.013	0.474 $\pm$ 0.021	0.532 $\pm$ 0.013	0.511 $\pm$ 0.016
DPMNE	0.859 $\pm$ 0.016	0.844 $\pm$ 0.015	0.784 $\pm$ 0.016	0.769 $\pm$ 0.009	0.526 $\pm$ 0.011	0.512 $\pm$ 0.014	0.558 $\pm$ 0.013	0.534 $\pm$ 0.015

Table 3. Node classification results on all datasets. Results are statistically significant with p-value

<

0.001.

4. Experiments

4.1. Experimental Setting

4.1.1. Datasets

The proposed approach is evaluated on four benchmarks: Cora, DBLP, Flickr and Last.fm.

•

Cora²²2https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz is a widely used document corpus from paper citation networks. It contains 2,708 scientific publications classified into one of 7 classes. Citation links and attributes are used as the multiplex, where the features are represented by a 0/1-valued vector indicating the absence/presence of the corresponding link and attribute respectively. The citation network consists of 5,429 edges. For the attribute network, there is an edge between two papers if they share at least one attribute, resulting in 7,458 edges.
•

DBLP³³3https://www.aminer.org/aminernetwork is an author network from the DBLP dataset (TangZYLZS08, ). It contains 69,110 nodes. Three views are identified including the co-authorship, author-citation and text views. For co-authorship and author-citation views, 0/1-valued feature vectors are used indicating co-authorship and citation between two authors, resulting in 430,117 and 763,029 edges in the corresponding views. For the text view, TF-IDF features are extracted from author’s title and abstract. The network is construct based on the text similarity, i.e., there is a link between two authors if their text similarity is high, resulting in 691,090 edges. We select eight diverse research fields as labels including “machine learning”, “computational linguistics”, “programming language”, “data mining”, “database”, “system technology”, “hardware” and “theory”. For each field, several representative conferences are selected, and only papers published in these conferences are kept to construct the three views.
•

Flickr (FlickrData, ) data were collected from the Flickr photo sharing service. It consists of 6,163 users with 10 unique labels. There are 5 views associated with this data: Comment, Favorite, Photo, Tag and User. Here, the views correspond to different aspects of Flickr and edges denote shared interests between users. For example, in the comment view, there is a link between 2 users if they have both commented on the same set of 5 or more photos. All features are 0/1-valued vectors. The resulting five views are: CommentView (2,358 nodes, 13,789 links), FavoriteView (2,724 nodes, 30,757 links), PhotoView (4,061 nodes, 91,329 links), TagView (1,341 nodes, 154,620 links), and UserView (6,163 nodes, 88,052 links).
•

Last.fm (FlickrData, ) dataset was collected from the music network, with the nodes representing the users and the edges corresponding to different relationships between Last.fm users and other entities. In each view, two users are connected by an edge if they share similar interests, yielding 12 views: ArtistView (2,118 nodes, 149,495 links), EventView (7,240 nodes, 177,000 links), NeighborView (5,320 nodes, 8,387 links), ShoutView (7,488 nodes, 14,486 links), ReleaseView (4,132 nodes, 129,167 links), TagView (1,024 nodes, 118,770 links), TopAlbumView (4,122 nodes, 128,865 links), TopArtistView (6,436 nodes, 12,4731 links), TopTagView (1,296 nodes, 136,104 links), TopTrackView (6,164 nodes, 87,491 links), TrackView (2,680 nodes, 93,358 links), and UserView (10,197 nodes, 38,743 links).

All these multiplex network are suffering from missing data. For example, the CommentView of Flickr only has 2,358 users (out of 6,163), where comments are missing from a certain amount of users. In the TagView of Last.fm, only 1,024 out of 10,197 users have associated with tags, resulting partial multiplex data. We use Partial Data Ratio (PDR) to represent the fraction of the missing data, e.g., the PDR of the CommentView of Flickr is 0.62⁴⁴4(6,163-2,358)/6,163 = 0.62. The statistics of the datasets with average PDR are given in Table 2.

	Cora	DBLP	Flickr	Last.fm
DeepWalk	0.676	0.573	0.429	0.435
GraphSAGE	0.683	0.589	0.422	0.441
SDNE	0.677	0.594	0.397	0.430
DANE	0.695	0.613	0.454	0.444
IMVC	0.692	0.635	0.457	0.447
MAGCN	0.707	0.628	0.480	0.461
HWNN	0.702	0.631	0.482	0.465
DPMNE	0.711	0.649	0.506	0.483

Table 4. Node clustering accuracy results on all datasets.

4.1.2. Baselines

The proposed approach is compared with seven different state-of-the-art baselines, including three single-view methods, DeepWalk, GraphSAGE and SDNE with four multi-view methods, DANE, IMVC, MAGCN and HWNN.

•

DeepWalk (DeepWalk, ) learns node representations from random-walk on the networks.
•

GraphSAGE (SAGE, ) aggregates the neighbor information to generate node embeddings with graph neural network.
•

SDNE (SDNE, ) uses deep neural networks to preserve the structure proximity in network embedding.
•

DANE (GaoH18b, ) uses attribute/tag information associated with the nodes as additional feature and learn embedding with deep neural networks.
•

IMVC (LiuZLTZY20, ) is a multi-view clustering method that explicitly handles incomplete data.
•

MAGCN (ChengWTXG20, ) learns network embeddings with a multi-view graph convolution networks (GCN).
•

HWNN (SunY0CCSH21, ) is a GNN-based representation learning for heterogeneous hypergraphs, which models multiple non-pairwise relations.

4.1.3. Implementation Details

To apply single-view method, we generate a single network from the multiplex network by placing an edge between a pair of nodes if they are linked by an edge in any view. For DeepWalk, we set the window size as 10, the walk length as 80, and the number of walks as 10 which are the optimal parameters tuned in our experiments. The number of units in the hidden layer is set to 200 in the deep neural network for GraphSAGE, SDNE and DANE. For IMVC, the code is public available⁵⁵5https://github.com/xinwangliu/TPAMI_EEIMVC and we tune the best hyperparameter with 5-fold cross validation.

For our method, the parameters $\alpha$ , $\beta$ and $\lambda$ are tuned by 5-fold cross validation on the training set. To get a fair comparison with deep models, we adopt the same architecture of the neural network, with 200 units in the hidden layer. We set the maximum number of iterations to 60. The number of embedding dimension is set to 128 for all methods (for DANE, each view has dimension of 64). We repeat each experiment 10 times and report the result based on the average over these runs. 50% of the data with random split is used as training.

4.2. Results and Discussion

4.2.1. Evaluation of Different Methods

We first evaluate the performance of different methods. To apply the compared deep neural network methods to the partial data, a simple way is to fill in the missing features with 0. However, this may result in large fitting errors across views for the multi-view methods, since the embedding for the missing instance will be 0. Therefore, to achieve stronger baseline results, we replace the missing feature using the linear combination of its 5-nearest neighbor examples, weighted by the similarities, which appear across views. Then the baseline deep methods can be directly applied on these extended data.

We conduct both node classification and clustering on the learned node embedding. Specifically, for classification, we employ L2-regularized logistic regression as the classifier, with Micro-F1 and Macro-F1 as metrics. For clustering, we employ $K$ -means as the clustering method and use clustering accuracy as the metric. The classification results with standard deviations on all datasets are reported in Table 3. From these comparison results, we can see that DPMNE provides the best results among all compared methods on all datasets. For example, the Micro-F1 of DPMNE increases over 4.3% and 7.8% compared with both HWNN and DANE on DBLP. The reason is that DPMNE can effectively handle the partial data by common latent subspace learning across views and proximity preservation within individual networks, while the compared methods fail to accurately extract a common space from the partial nodes. We observe that DPMNE outperforms IMVC by 8.8%. Although IMVC tries to deal with incomplete data, the network information is not fully explored. We also observe that DPMNE achieves comparable or slightly better results compare with other baselines on Cora, whose PDR is very small, i.e., 0.02 from Table 2. This further validates that DPMNE is equally effective on multiplex network without missing data. It can be seen that multi-view methods outperform the single-view methods on all four datasets. The reason is that multi-view methods construct embedding that incorporates complementary information from all views. The clustering result is summarized in Table 4. From this table, we can find that our approach achieves much better clustering performance than the others for most cases, which further verifies the effectiveness of DPMNE.

Micro-F1	Cora	DBLP	Flickr	Last.fm
w/o deep autoencoder	0.835	0.747	0.479	0.524
w/o partial multiplex	0.852	0.739	0.488	0.517
w/o proximity preservation	0.830	0.721	0.477	0.535
DPMNE	0.859	0.784	0.526	0.558

Table 5. Performance of node classification with different model ablations.

4.2.2. Effect of Partial Data Ratio

To evaluate the effectiveness of the proposed DPMNE under different PDRs, we progressively increase the PDR by randomly removing features from the multiplex network, and compare our method with the other baselines. The node classification results are shown in Figure 2. It can be seen from the figure that when the partial data ratio is 0 (on Cora), the data actually becomes the traditional multiplex setting without missing data. As aforementioned, DPMNE is also comparable with other baselines. However, as the PDR increases, our DPMNE approach significantly outperforms other baselines on all datasets. In other words, the performance of DPMNE drops much slower compared to the baseline methods. For example, the Micro-F1 of DPMNE increases over 20% compared with the state-of-the-art GNN-based models, HWNN and MAGCN, on Cora with 0.4 PDR. Our hypothesis is that, although the missing data are recovered from the common nodes across views, the baseline deep methods seem less effective in the view missing case. The missing data may not be accurately recovered when the data are missing blockwise for the partial data setting. In other words, the missing nodes can be dissimilar to all the nodes appear across views.

4.2.3. Ablation Study

We conduct a series of ablation studies of our model. We first analyze the behavior of each component in DPMNE. There are three key components in our DPMNE, the deep autoencoder, the common latent subspace learning and the network proximity preservation. We train three additional models by removing each of these components separately. Specifically, we remove the deep autoencoder by fixing all its parameters to 1, i.e., both encoder and decoder will map the input to itself. For the other two components, we simply remove the loss terms in the objective. The comparison results are shown in Table 5. It can be seen that all these components are indispensable in DPMNE. Without the common latent subspace learning part, our model degrades to HWNN or MAGCN which is not able to model the partial multiplex data effectively. On the other hand, we observe that deep autoencoder clearly improve the model quality. The reason is that it captures the sparsity and non-linearity in the original feature space. Lastly, it is obvious that network proximity preservation is crucial in learning network embeddings.

To understand which views are important for learning the network embeddings, we conduct another set of experiments on view importance analysis. Specifically, we build and evaluate the model performance by removing one view from the multiplex network at a time. The node classification results are shown in Figure 3. It can be seen that co-authorship and author-citation views are more important than the text view in DBLP dataset, which is consistent with our expectation, as text similarity might not truly reflect the relationships among authors. We also observe that the UserView is the most important view in both Flickr and Last.fm datasets. The reason is that the UserView directly reveals the connections among the users. Some other useful views are TagView and FavoriteView.

We evaluate the performance of DPMNE with different embedding dimensions by varying the dimension $d$ from {16, 32, 64, 128, 256, 512}. The node classification results on all datasets are shown in Fig.4. It can be seen from the figure that the values of both Micro-F1 and Macro-F1 of DPMNE consistently increase with the increasing of the embedding dimension, from 16 to 256, on all datasets. Our approach achieves similar results between dimension 256 and 512. This observation indicates that very large embedding size is not needed for node representation, which is consistent with the observation in MEGAN (MEGAN, ).

4.2.4. Effect of Binary Embedding

We further evaluate our approach in learning binary embeddings. The binary embeddings can be direct achieved by binarizing the learned embeddings, and this method is referred to bi-DPMNE. To obtain more effective binary embeddings, we further conduct iterative quantization based on the orthogonal invariant property of the learned embeddings from Eqn.13. This method is referred to bi-DPMNE-ITQ. We then compare these two methods with two state-of-the-art multiplex binary embedding methods, CCA-ITQ (GongLGP13, ; GongKIL14, ) and DCMH (JiangL17, ). The Micro-F1 results with 128 bits are reported in Table 6. It can be seen from the table that bi-DPMNE consistently performs better than the two baselines on the incomplete data. On the other hand, the bi-DPMNE-ITQ achieves even better results compared to bi-DPMNE, which is consistent with our expectation as it further minimizes the quantization errors.

Micro-F1	Cora	DBLP	Flickr	Last.fm
CCA-ITQ	0.748	0.657	0.438	0.451
DCMH	0.762	0.683	0.449	0.474
bi-DPMNE	0.788	0.709	0.463	0.482
bi-DPMNE-ITQ	0.803	0.727	0.472	0.495

Table 6. Node classification comparison of binary embedding with 128 bits on all datasets.

5. Conclusions

In this paper, we propose a novel deep network embedding approach to deal with partial multiplex data. We formulate a unified learning framework by simultaneously minimizing the deep reconstruction loss with the autoencoder neural network, ensuring data consistency among different views via common latent subspace learning, and preserving data proximity within the same view through graph Laplacian. Extensive experiments on four benchmarks have demonstrated the effectiveness of the proposed approach. In future, we plan to adopt distributed optimization to speed up the learning process. We also plan to further extend the subspace partial view learning to nonlinear cases.

Acknowledgements.

This work is supported by the National Natural Science Foundation of China (No. 62176270).

References

[1] M. Armandpour, P. Ding, J. Huang, and X. Hu. Robust negative sampling for network embedding. In AAAI, pages 3191–3198, 2019.
[2] S. Bandyopadhyay, L. N, S. V. Vivek, and M. N. Murty. Outlier resistant unsupervised deep architectures for attributed network embedding. In WSDM, pages 25–33, 2020.
[3] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, pages 585–591, 2001.
[4] A. K. Bhowmick, K. Meneni, M. Danisch, J. Guillaume, and B. Mitra. Louvainne: Hierarchical louvain method for high quality and scalable network embedding. In WSDM, pages 43–51, 2020.
[5] N. Bui, T. Le, and V. G. Honavar. Labeling actors in multi-view social networks by integrating information from within and across multiple views. In BigData, pages 616–625, 2016.
[6] Y. Cen, X. Zou, J. Zhang, H. Yang, J. Zhou, and J. Tang. Representation learning for attributed multiplex heterogeneous network. In SIGKDD, pages 1358–1368, 2019.
[7] S. Chang, W. Han, J. Tang, G. Qi, C. C. Aggarwal, and T. S. Huang. Heterogeneous network embedding via deep architectures. In SIGKDD, pages 119–128, 2015.
[8] W. Chen, F. Feng, Q. Wang, X. He, C. Song, G. Ling, and Y. Zhang. Catgcn: Graph convolutional networks with categorical node features. IEEE Transactions on Knowledge and Data Engineering, 2021.
[9] X. Chen, G. Yu, J. Wang, C. Domeniconi, Z. Li, and X. Zhang. Activehne: Active heterogeneous network embedding. In IJCAI, pages 2123–2129, 2019.
[10] J. Cheng, Q. Wang, Z. Tao, D. Xie, and Q. Gao. Multi-view attribute graph convolution networks for clustering. In IJCAI, pages 2973–2979, 2020.
[11] C. Cruceru, G. Bécigneul, and O. Ganea. Computationally tractable riemannian manifolds for graph embeddings. In AAAI, 2021.
[12] L. Du, Z. Lu, Y. Wang, G. Song, Y. Wang, and W. Chen. Galaxy network embedding: A hierarchical community structure preserving approach. In IJCAI, pages 2079–2085, 2018.
[13] F. Feng, W. Huang, X. He, X. Xin, Q. Wang, and T.-S. Chua. Should graph convolution trust neighbors? a simple causal inference method. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1208–1218, 2021.
[14] M. Feng, C. Hsu, C. Li, M. Yeh, and S. Lin. MARINE: multi-relational network embeddings with relational proximity and node attributes. In WWW, pages 470–479, 2019.
[15] X. Fu, J. Zhang, Z. Meng, and I. King. MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding. In WWW, pages 2331–2341, 2020.
[16] H. Gao and H. Huang. Deep attributed network embedding. In IJCAI, pages 3364–3370, 2018.
[17] H. Gao, J. Pei, and H. Huang. Progan: Network embedding via proximity generative adversarial network. In SIGKDD, pages 1308–1316, 2019.
[18] L. Gong, L. Lin, W. Song, and H. Wang. JNET: learning user representations via joint network embedding and topic embedding. In WSDM, pages 205–213, 2020.
[19] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis., 106(2):210–233, 2014.
[20] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. TPAMI, 35(12):2916–2929, 2013.
[21] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In SIGKDD, pages 855–864, 2016.
[22] J. Guo and J. Ye. Anchors bring ease: An embarrassingly simple approach to partial multi-view clustering. In AAAI, pages 118–125, 2019.
[23] W. L. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1024–1034, 2017.
[24] Z. Hu, Y. Dong, K. Wang, and Y. Sun. Heterogeneous graph transformer. In WWW, pages 2704–2710, 2020.
[25] Q. Jiang and W. Li. Deep cross-modal hashing. In CVPR, pages 3270–3278, 2017.
[26] S. Jiang, B. Koch, and Y. Sun. HINTS: citation time series prediction for new publications via dynamic heterogeneous information network embedding. In WWW, pages 3158–3167, 2021.
[27] W. Jiang, H. Gao, F. Chung, and H. Huang. The l2, 1-norm stacked robust autoencoders for domain adaptation. In AAAI, pages 1723–1729, 2016.
[28] P. Li, Y. Wang, H. Wang, and J. Leskovec. Distance encoding: Design provably more powerful neural networks for graph representation learning. In NeurIPS, pages 4465–4478, 2020.
[29] S. Li, Y. Jiang, and Z. Zhou. Partial multi-view clustering. In AAAI, pages 1968–1974, 2014.
[30] X. Li, L. Wen, C. Qian, and J. Wang. GAHNE: graph-aggregated heterogeneous network embedding. CoRR, abs/2012.12517, 2020.
[31] Y. Li, Y. Wang, T. Zhang, J. Zhang, and Y. Chang. Learning network embedding with community structural information. In IJCAI, pages 2937–2943, 2019.
[32] X. Liu, M. Li, C. Tang, J. Xia, J. Xiong, L. Liu, M. Kloft, and E. Zhu. Efficient and effective regularized incomplete multi-view clustering. In TPAMI, 2020.
[33] Z. Liu, C. Huang, Y. Yu, and J. Dong. Motif-preserving dynamic attributed network embedding. In WWW, pages 1629–1638, 2021.
[34] Z. Meng, S. Liang, H. Bao, and X. Zhang. Co-embedding attributed networks. In WSDM, pages 393–401, 2019.
[35] G. Pan, Y. Yao, H. Tong, F. Xu, and J. Lu. Unsupervised attributed network embedding via cross fusion. In WSDM, pages 797–805, 2021.
[36] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: online learning of social representations. In SIGKDD, pages 701–710, 2014.
[37] L. Pio-Lopez, A. Valdeolivas, L. Tichit, E. Remy, and A. Baudot. Multiverse: a multiplex and multiplex-heterogeneous network embedding approach. CoRR, abs/2008.10085, 2020.
[38] M. Qu, J. Tang, J. Shang, X. Ren, M. Zhang, and J. Han. An attention-based collaboration framework for multi-view network representation learning. In CIKM, pages 1767–1776, 2017.
[39] N. Rai, S. Negi, S. Chaudhury, and O. Deshmukh. Partial multi-view clustering using graph regularized NMF. In ICPR, pages 2192–2197, 2016.
[40] R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reason., 50(7):969–978, 2009.
[41] Y. Shi, F. Han, X. He, C. Yang, J. Luo, and J. Han. mvn2vec: Preservation and collaboration in multi-view network embedding. CoRR, abs/1801.06597, 2018.
[42] X. Sun, H. Yin, B. Liu, H. Chen, J. Cao, Y. Shao, and N. Q. V. Hung. Heterogeneous hypergraph embedding for graph classification. In WSDM, pages 725–733, 2021.
[43] Y. Sun, S. Wang, T. Hsieh, X. Tang, and V. G. Honavar. MEGAN: A generative adversarial network for multi-view network embedding. In IJCAI, pages 3527–3533, 2019.
[44] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. LINE: large-scale information network embedding. In WWW, pages 1067–1077, 2015.
[45] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In SIGKDD, pages 990–998, 2008.
[46] K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu. Deep recursive network embedding with regular equivalence. In SIGKDD, pages 2357–2366, 2018.
[47] D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In SIGKDD, pages 1225–1234, 2016.
[48] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen. A survey on learning to hash. TPAMI, 40(4):769–790, 2018.
[49] Q. Wang. Learning compact hashing codes with complex objectives from multiple sources for large scale similarity search. PhD thesis, Purdue University, 2015.
[50] Q. Wang, G. Chechik, C. Sun, and B. Shen. Instance-level label propagation with multi-instance learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 2943–2949, 2017.
[51] Q. Wang, Z. Ding, Z. Tao, Q. Gao, and Y. Fu. Generative partial multi-view clustering. CoRR, abs/2003.13088, 2020.
[52] Q. Wang, Y. Fang, A. Ravula, F. Feng, X. Quan, and D. Liu. Webformer: The web-page transformer for structure information extraction. In WWW, 2022.
[53] Q. Wang and R. He. Deep multi-view network embedding on incomplete data, 2021. US Patent App. 17/154,123.
[54] Q. Wang, L. Si, and B. Shen. Learning to hash on partial multi-modal data. In IJCAI, pages 3904–3910, 2015.
[55] S. Wang, J. Tang, C. C. Aggarwal, and H. Liu. Linked document embedding for classification. In CIKM, pages 115–124, 2016.
[56] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes. On deep multi-view representation learning. In ICML, pages 1083–1092, 2015.
[57] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu. Heterogeneous graph attention network. In WWW, pages 2022–2032, 2019.
[58] Z. Wang, H. Liu, Y. Du, Z. Wu, and X. Zhang. Unified embedding model over heterogeneous information network for personalized recommendation. In IJCAI, pages 3813–3819, 2019.
[59] X. Wei, B. Cao, W. Shao, C. Lu, and P. S. Yu. Community detection with partially observable links and node attributes. In BigData, 2016.
[60] J. Wu, X. He, X. Wang, Q. Wang, W. Chen, J. Lian, and X. Xie. Graph convolution machine for context-aware recommender system. Frontiers of Computer Science, 16(6):1–12, 2022.
[61] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. IEEE Trans. Neural Networks Learn. Syst., 32(1):4–24, 2021.
[62] Y. Xie, Z. Ou, L. Chen, Y. Liu, K. Xu, C. Yang, and Z. Zheng. Learning and updating node embedding on dynamic heterogeneous information network. In WSDM, pages 184–192, 2021.
[63] C. Xu, Z. Guan, W. Zhao, H. Wu, Y. Niu, and B. Ling. Adversarial incomplete multi-view clustering. In IJCAI, pages 3933–3939, 2019.
[64] H. Xue, L. Yang, V. Rajan, W. Jiang, Y. Wei, and Y. Lin. Multiplex bipartite network embedding using dual hypergraph convolutional networks. In WWW, pages 1649–1660, 2021.
[65] Y. Yan, S. Zhang, and H. Tong. BRIGHT: A bridging algorithm for network alignment. In WWW, pages 3907–3917, 2021.
[66] C. Yang, A. Pal, A. Zhai, N. Pancha, J. Han, C. Rosenberg, and J. Leskovec. Multisage: Empowering GCN with contextualized multi-embeddings on web-scale multipartite networks. In SIGKDD, pages 2434–2443, 2020.
[67] C. Yang, Y. Xiao, Y. Zhang, Y. Sun, and J. Han. Heterogeneous network representation learning: Survey, benchmark, evaluation, and beyond. CoRR, abs/2004.00216, 2020.
[68] L. Yang, Q. Wang, Z. Yu, A. Kulkarni, S. Sanghai, B. Shu, J. Elsas, and B. Kanagal. Mave: A product dataset for multi-source attribute value extraction. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, page 1256–1265, 2022.
[69] D. Zhang, J. Yin, X. Zhu, and C. Zhang. SINE: scalable incomplete network embedding. In ICDM, pages 737–746, 2018.
[70] H. Zhang, L. Qiu, L. Yi, and Y. Song. Scalable multiplex network embedding. In IJCAI, pages 3082–3088, 2018.
[71] X. Zhang, Y. Li, D. Shen, and L. Carin. Diffusion maps for textual network embedding. In NIPS, pages 7598–7608, 2018.
[72] Y. Zhang, G. Song, L. Du, S. Yang, and Y. Jin. DANE: domain adaptive network embedding. In IJCAI, pages 4362–4368, 2019.
[73] J. Zhao, X. Wang, C. Shi, Z. Liu, and Y. Ye. Network schema preserving heterogeneous information network embedding. In C. Bessiere, editor, IJCAI, pages 1366–1372, 2020.
[74] S. Zhu, J. Li, H. Peng, S. Wang, P. S. Yu, and L. He. Adversarial directed graph embedding. In AAAI, pages 4741–4748, 2021.