This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Representation Learning of
Reconstructed Graphs Using Random
Walk Graph Convolutional Network

Xing Li, Wei Wei, Xiangnan Feng, Zhiming Zheng W. Wei was with the School of Mathematical Science, Beihang University, Beijing, China, Key Laboratory of Mathematics Informatics Behavioral Semantics, Ministry of Education, China, Peng Cheng Laboratory, Shenzhen, Guangdong, China and Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China.
E-mail: see [email protected] X. Li, X. Feng and Z. Zheng were with the School of Mathematical Science, Beihang University, Beijing, China, Key Laboratory of Mathematics Informatics Behavioral Semantics, Ministry of Education, China and Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China.
Abstract

Graphs are often used to organize data because of their simple topological structure, and therefore play a key role in machine learning. And it turns out that the low-dimensional embedded representation obtained by graph representation learning are extremely useful in various typical tasks, such as node classification, content recommendation and link prediction. However, the existing methods mostly start from the microstructure (i.e., the edges) in the graph, ignoring the mesoscopic structure (high-order local structure). Here, we propose wGCN – a novel framework that utilizes random walk to obtain the node-specific mesoscopic structures of the graph, and utilizes these mesoscopic structures to reconstruct the graph And organize the characteristic information of the nodes. Our method can effectively generate node embeddings for previously unseen data, which has been proven in a series of experiments conducted on citation networks and social networks (our method has advantages over baseline methods). We believe that combining high-order local structural information can more efficiently explore the potential of the network, which will greatly improve the learning efficiency of graph neural network and promote the establishment of new learning models.

Index Terms:
Representation learning, Graph neural network, Random walk, Node classification.

1 Introduction

Graph is a flexible data structure, which can store data and reflect the underlying topological relationship between data. Because of this, graph structure is widely used in various fields, including social networks [1], biological protein-protein networks [2], drug molecule graphs [3], knowledge network [4], etc. For example, in a drug molecule graph, a node can represent an atom, and nodes are connected with edges indicating that there are chemical bonds between atoms. By this way, people can easily store the data information in the network and access the relational knowledge about the interactive entities from this structured knowledge base at any time.

The traditional methods of extracting information from graphs depend on the statistics (degree, clustering coefficient [5], centrality [6, 7], etc.) of graphs, or kernel functions [8] (or other characteristic functions) which are carefully designed. However, with the development of information technology, the amount of data increases rapidly, which makes the graph network more and more complex. Therefore, the traditional manual feature extraction methods becomes expensive, time-consuming and unreliable – can not extract valid information from complicated organizations.

In this case, representation learning has played a key role to efficiently extracts information in the graph. The graph representation learning method is a technical method for learning graph structure data, which hopes to transform complex original data into easy-to-process low-dimensional vector representations. In essence, the graph representation learning method is to learn a function, and this function maps the input graph or node to a point in the low-dimensional vector space. Compared with traditional methods, the representation learning method treats the problem of capturing graph information as part of the learning task itself, rather than just as a preprocessing link. In fact, the representation learning method uses data-driven methods to obtain features, avoiding the trouble of traditional manual feature extraction.

The goal of representation learning is not simply to obtain results directly, but to obtain an efficient representation of the original data. In other words, the choice of representation usually depends on subsequent learning tasks, and a good representation should make the learning of downstream tasks easier. In recent years, representation learning on graphs is very popular, and many good results have been obtained. Belkin et al. [9] propose Laplacian feature map in 2002, which is one of the earliest and most famous representation learning methods. Then a large number of methods are proposed, and the representative ones such as Grarep (Cao et al. [10]), Deepwalk (Perozzi et al. [11]), node2vec (Grover et al. [12]), GNN (Scarselli et al. [13]), GCN (Kipf et al. [14]), etc.

However, on the one hand, these methods are based on the binary relationship (i.e., edges) in the graph, and can not leverage the local structure of the graph; on the other hand, due to the sparseness of the edges in the graph, many methods are Encountered difficulties in generalization in many cases. Thus, our method is proposed in order to leverage the high-order connection patterns that are essential for understanding the control and regulation of the basic structure of complex network systems, and to alleviate the problem of edge sparsity in the graph.

Present work. This paper develops a new framework – wGCN, a controllable and supervised representation learning method. wGCN can be regarded as a two-stage task: in the first stage, according to the original graph structure data, we execute a random walk algorithm on the graph, and then use the generated walk to redistribute the weight of the graph network to obtain the reconstructed graph; in the second stage, we connect the reconstructed image to the graph convolutional neural network, and combine the original features and labels for training. This new framework can improve the accuracy of prediction tasks while spending a little more time. We prove that it has the same level of time complexity as the graph convolutional neural network, and conduct a large number of experiments on a variety of actual datasets, all of which have obtained better test results than the baselines.

The rest of this article is organized as follows. In Section 2, the related work in the past is summarized, and in Section 3, our representation learning method wGCN is introduced in detail. Our experiment will be introduced in Section 4, and the results will be given in Section 5. In Section 6, conclusions and future work will be discussed.

2 Related Work

This part will focus on previous work closely related to wGCN. Specifically, we first introduce random walk, and then introduce some classic graph representation learning methods, which inspires our methods.

2.1 Random Walk

Random walk refers to the behavior that can not predict future development steps and directions based on past performance. The core concept is that the conserved quantities of any irregular walker correspond to a law of diffusion and transportation, which is the ideal mathematical state of Brownian motion.

Refer to caption
Figure 1: The image on the left is the original graph, and the image on the right shows a random walk of length 8 (8 nodes, marked in red). The red arrow shows the walking order, and this random walk is recorded as vi1vi2vi8.v_{i_{1}}v_{i_{2}}...v_{i_{8}}.

The random walk on the graph refers to starting from one or a series of nodes and moving between nodes in the graph according to specific rules. For example, a walker randomly selects a vertex to start, walks to the neighbor node of this vertex with probability p(0<p1)p\ (0<p\leq 1), and jumps to any node in the graph with probability 1p1-p, which is called a jump-turn probability. Each walk will result in a probability distribution reflecting the node to be visited, which is used as the input of the next step of the random walk. And the above process is iterated continuously. When certain conditions are met, the results of the iteration will tend to converge, resulting in a stable probability distribution. Fig.1 gives a specific random walk on graph. Random walk is widely used in the field of information retrieval. The well-known PageRank [15] algorithm is a classic application of random walk.

2.2 Representation Learning Method

The graph representation learning method is a technical method for learning graph structure data. It hopes to transform complex raw data into a low-dimensional representation that is convenient to develop and process by machine learning. To a certain extent, it can also be regarded as a method of dimensionality reduction. According to whether neural network is used, graph representation learning methods can be divided into two categories: i) direct coding methods that do not use neural networks; ii) neural network-based coding methods.

Direct coding methods. Early learning node representation methods are concentrated in the matrix factorization framework. Belkin et al. present Laplacian eigenmaps method, which is a encoder-decoder framework measured by Euclidean distance in the coding space [9]. Following the Laplacian eigenmaps method, Ahmed et al. [16] and Cao et al. [10] propose Graph Factorization (GF) and GraRep separately, whose main difference is the way the basic matrix is used. The original adjacency matrix of graph is used in GF and GraRep is based on various powers high order relationship of the adjacency matrix. And Mingdong er al. present High Order Proximity preserved Embedding (HOPE), which can preserve the asymmetric transitivity of the directed graph [17].

Refer to caption
Refer to caption
Figure 2: Explanation of node2vec. In subfigure (a), node ss is the source node of a random walk, and node cc is the current node. Hyperparameters pp and qq control the walking probability of the next step (from node cc to its neighbors). As marked in subfigure (a), the probability of returning node ss from node cc is 1/p1/p; the probability from node cc to node v1v_{1} or v4v_{4} is 11 as v1v_{1} and v4v_{4} are also neighbors of node ss; and the probability from node cc to node v2v_{2} or v3v_{3} is 1/q1/q as v2v_{2} and v3v_{3} are the second-order neighbors of node cc. Thus, when pp decreases, walking tends to return to the source node (i.e., ”more local”); when qq decreases, walking tends to move away from the source node (i.e., ”more global”). The consequence is shown in subfigure (b), red arrows show the walking which is ”more local”, and blue arrows show the walking which is ”more global”.

On the other hand, there are also some classic methods based on random walk instead of matrix factorization, thus becoming more flexible. Here, two representative methods are DeepWalk [11] and node2vec [12]. DeepWalk uses random walk to disassemble the graph which is nonlinear structure into multiple linear node sequences, then the node sequences treated as ”sentences” (the nodes are treated as ”words”) are processed by SkipGram [18]. As for node2vec, it allows an adjustable random walk on the graph. In particular, node2vec creatively uses two hyperparameters pp and qq to control the random walk ”more local” or ”more global”, in other words, depth-first search or breadth-first search (relative to the starting node, see Fig.2).

Neural network-based coding methods. The above direct encoding methods independently generate a representation vector for each node, which results in no shared parameters between nodes, high computational complexity, and underutilized node attribute information. Considering to solve these problems, many graph neural network methods have been proposed in recent years. Scarselli et al. [13] present the Graph Neural Network (GNN) model which can implement a function that maps the graph and one of its nodes to Euclidean space. And Inspired by the success of Convolutional Neural Network in image processing (Convolutional Neural Network extremely reduces the number of parameters by using convolution kernels to gather the information of local pixels on the image. However, CNN has encountered difficulties on the graph, due to the irregularity of the graph, that is, the number of neighbors is uncertain), Kipf et al. [14] propose the well-known Graph Convolutional Network (GCN) method. The GCN method cleverly applies the convolution operation to the graph structure, which means that the information of neighbor nodes is aggregated on the irregular graph structure.

3 Method

We first perform random walk operation on the original graph, and then use the obtained ”walks” to reconstruct the graph network. After that, the graph convolution operation is performed on the obtained reconstructed graph to obtain the representation vectors of the nodes. We believe that these approaches can combine the nodes of the graph at multiple levels to obtain a more informative representation. Finally, the obtained representation vectors are sent to the downstream classifier (such as knn, mlp and so on) to complete the node classification task.

Next, we will introduce the technical details of our method. First, the random walk and the reconstructed graph is introduced in detail in Section 3.1; then, the wGCN embedding algorithm to generate embeddings for nodes is described in Section 3.2; finally in Section 3.3, we give an analysis of the complexity of the algorithm and prove it at the same time.

3.1 Reconstructed Graph

We will explain reconstructed graph utilizing random walks in detail in this section. For the convenience of explanation, some commonly used symbols are given below:

Formally, let graph G=(V,E)G=(V,E), where VV is the set of the nodes in network, |V||V| represents the number of nodes, and E(V×V)E\subseteq(V\times V) is the set of the edges. Given a labeled network with node feature information G=(V,E,X,Y)G=(V,E,X,Y), where XR(N×T)X\in R^{(N\times T)} (NN is the number of the nodes in network, TT is the feature dimension) is the feature information matrix and YR(N×L)Y\in R^{(N\times L)} (LL is the feature dimension) is the label information matrix, our goal is to use the labels of some of the nodes for training, and generate a vector representation matrix ZZ of the nodes.

Then, we give:

Definition 1: Given a graph G=(V,E,X,Y)G=(V,E,X,Y) (V={v1,v2,,v|V|}V=\{v_{1},v_{2},...,v_{|V|}\}) and initial node vi1v_{i_{1}}, a random walk of length nn rooted at vi1v_{i_{1}} is denoted as Wvi1={(vi1,vi2),(vi2,vi3),,(vin1,vin)}W_{v_{i_{1}}}=\{(v_{i_{1}},v_{i_{2}}),(v_{i_{2}},v_{i_{3}}),...,{(v_{i_{n-1}},v_{i_{n}})}\}, where 1i1,i2,,in|V|1\leq i_{1},i_{2},...,i_{n}\leq|V| and the two nodes in the same parenthesis are neighbors (i.e., there is an edge connection between vijv_{i_{j}} and vij+1v_{i_{j+1}}). Or for convenience, Wvi1={vi1vi2vi3vin1vin}W_{v_{i_{1}}}=\{v_{i_{1}}v_{i_{2}}v_{i_{3}}...v_{i_{n-1}}v_{i_{n}}\}.

Refer to caption
Figure 3: Example of reconstructing graph using random walk. Subfigure (a) is the original graph, in which the node cc is the source node marked in red. And the ”walks” are marked with red lines and extracted in subfigure (b). Subfigure (c) is the reconstructed graph, where the weights αk,k=1,2,3\alpha^{k},k=1,2,3 is used to weight the original k-hop nodes away from node cc.

As shown in the Fig.3, we first perform random walk operation on the graph (subfigure (a)) to obtain ”walks” (subfigure (b)). After that, we can use the obtained ”walks” to reconstruct the graph network. A node that appears in the same walk with node cc is considered to be related to node cc, and this node is connected to node cc in the reconstructed graph (the red lines in subfigure (c)). And we assign different weights to distinguish the distance between the nodes and node cc in ”walks”. Thus in the reconstructed graph, the weight wijw_{ij} for nodes i,ji,j can be given by the following formula:

wij=aij+k{thespacingofi,jin"walks"}αk,w_{ij}=a_{ij}+\sum_{k\in\{the\leavevmode\nobreak\ spacing\leavevmode\nobreak\ of\leavevmode\nobreak\ i,j\leavevmode\nobreak\ in\leavevmode\nobreak\ "walks"\}}\alpha^{k}, (1)

where aija_{ij} indicates the original adjacency matrix, which is defined as follows:

aij={0,if nodes i,j are not connected in the original graph1,if nodes i,j are connected in the original graph,a_{ij}=\begin{cases}0,&\text{if nodes $i,j$ are not connected in the original graph}\\ 1,&\text{if nodes $i,j$ are connected in the original graph}\end{cases}, (2)

and α\alpha is a parameter to be determined indicating the decay speed with distance, satisfying 0<α<10<\alpha<1. And k is the exponent, whose values are the distance between nodes i,ji,j in the ”walks”. For example, node cc and node v12v_{{}_{12}} are not connected in the original graph (subfigure (a)), thus acv12=0a_{cv_{{}_{12}}}=0; and node cc and node v12v_{{}_{12}} have a distance of 22 in ”walks” (subfigure (b)), thus k=2k=2; in summary:

wcv12\displaystyle w_{cv_{{}_{12}}} =\displaystyle= acv12+k{2}αk\displaystyle a_{cv_{{}_{12}}}+\sum_{k\in\{2\}}\alpha^{k}
=\displaystyle= 0+α2\displaystyle 0+\alpha^{2}
=\displaystyle= α2.\displaystyle\alpha^{2}.

In the next section, we give the complete algorithm.

3.2 Embedding Algorithm

Algorithm 1 : wGCN embedding algorithm.
1:Graph (V,E)(V,E); Adjacency matrix AA; The number of walks from each node TT; Walks length LL; Reconstructed scale parameter λ\lambda; Node feature matrix XX; The number of graph convolutional neural network layers HH;
2:Representation matrix ZZ;
3:Initialize the random walk matrix 𝒜\mathcal{A} as zero;
4:for t=1Tt=1...T do
5:     for vVv\in V do
6:         Wvt=RandomWalk(G,v,L)W_{v}^{t}=RandomWalk(G,v,L)
7:     end for
8:end for
9:for t=1Tt=1...T do
10:     for vVv\in V do
11:         for l=2Ll=2...L do
12:              αvWvt[l]=αvWvt[l]+αl1\alpha_{vW_{v}^{t}[l]}=\alpha_{vW_{v}^{t}[l]}+\alpha^{l-1};
13:         end for
14:     end for
15:end for
16:A~=A+λ𝒜\tilde{A}=A+\lambda\cdot\mathcal{A}
17:hv0=xv,vVh_{v}^{0}=x_{v},\forall v\in V;
18:for k=1Hk=1...H do
19:     for vVv\in V do
20:         hvkf(hvk1,huk1),uNA~(v)h_{v}^{k}\leftarrow f(h_{v}^{k-1},h_{u}^{k-1}),u\in N_{\tilde{A}}(v);
21:     end for
22:end for
23:zv=hvH,vVz_{v}=h_{v}^{H},\forall v\in V;
24:return ZZ (whose column vectors are zv,vVz_{v},v\in V);

After entering the required data, we first initialize the random walk matrix 𝒜\mathcal{A} to a zero matrix in line 1. The random walk matrix 𝒜\mathcal{A} refers to the matrix generated by random walk, whose element αuv\alpha_{uv} indicates the weight attached to the nodes u,vu,v by the ”walks”. And in lines 2-14, the reconstructed graph matrix is built as described in section 3.1:

1. Generate TT ”walks” of length LL at each node (RandomWalk()RandomWalk(\cdot)) in lines 2-5;

2. Update the elements of the random walk matrix 𝒜\mathcal{A} in lines 7-13;

3. Obtain the reconstructed graph matrix A~\tilde{A} in line 14.

Note that the ”walk” returned by RandomWalk()RandomWalk(\cdot) is a node list. For example, Wv11=RandomWalk(G,v1,L)=[v1v2vL]W_{v_{1}}^{1}=RandomWalk(G,v_{1},L)=[v_{1}v_{2}...v_{L}], in which vi+1v_{i+1} is the neighbor of viv_{i}, 1iL11\leq i\leq L-1, and Wv11[2]=v2W_{v_{1}}^{1}[2]=v_{2}. In addition, the adjacency matrix AA and the random walk matrix 𝒜\mathcal{A} are mixed and normalized to obtain the reconstructed graph matrix A~\tilde{A} in line 14, where λ\lambda is the mixing ratio coefficient. And Normalized()Normalized(\cdot) is a symmetric normalization function:

Normalized(X)=D12(I+X)D12,Normalized(X)=D^{-\frac{1}{2}}(I+X)D^{-\frac{1}{2}}, (4)

where XX is a square matrix, II is the identity matrix and DD is a diagonal matrix, satisfying Dii=jXijD_{ii}=\sum_{j}X_{ij}.

Then the graph convolution operation is performed. The number of graph convolutional neural network layers is specified by users in advance. And the initial representation of all nodes is expressed as: hv0=xv,vVh_{v}^{0}=x_{v},\forall v\in V, in line 15; In lines 16-20, we perform a graph convolution operation based on reconstructed graph, in the formula hvkf(hvk1,huk1),uNA~(v)h_{v}^{k}\leftarrow f(h_{v}^{k-1},h_{u}^{k-1}),u\in N_{\tilde{A}}(v), f()f(\cdot) represents a weighted nonlinear aggregation function, whose purpose is to reorganize the information of the target node and its neighbors. Formally,

hvk=σ(uNM(v){v}avu~huk1Wk),h_{v}^{k}=\sigma(\sum_{u\in N_{M}(v)\cup\{v\}}\tilde{a_{vu}}\cdot h_{u}^{k-1}\cdot W_{k}), (5)

where hvkh_{v}^{k} is the hidden representation of node vv in the kk-th layer; avu~\tilde{a_{vu}} is the number on the vv-row and uu-column of the reconstructed graph matrix A~\tilde{A}, indicating the closeness between nodes of vv and uu; WkW_{k} is the parameter matrix to be trained of layer kk; NA~(v)N_{\tilde{A}}(v) is the neighborhood nodes set of node vv in the reconstructed graph; σ()\sigma(\cdot) represents for ReLU function:

f(x)=max(0,x).f(x)=\text{max}(0,x). (6)

Then, the final representation vector zvz_{v} of node vv is obtained. And the representation vectors can be sent to the downstream classifier (such as softmax classifier) to obtain the predicted category vectors yv^,vV\hat{y_{v}},v\in V.

If softmax classifier is chosen (the form is as follows),

Softmax(z)i=exp(zi)j=1dexp(zj),i=1d,\text{Softmax}(z)_{i}=\frac{\text{exp}(z_{i})}{\sum_{j=1}^{d}\text{exp}(z_{j})},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ i=1...d, (7)

where zz is the representation vector, Softmax(z)i(z)_{i} is the ii-th component of the vector Softmax(z)(z), and dd is the dimension of the representation vector zz,

then the cross entropy function can be used as the loss function to train the parameters of our model:

loss=v(yvlog(yv^)+(1yv)log(1yv^)),vtrainset,loss=\sum_{v}(y_{v}\cdot log(\hat{y_{v}})+(1-y_{v})\cdot log(1-\hat{y_{v}})),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ v\in trainset, (8)

where yvy_{v} is the true label of the node vv.

3.3 Complexity Analysis

Our method is based on GCN. And from the related work of Kipf et al. [14], we know that the computational complexity of the original GCN based on the following formula is 𝒪(||CHF)\mathcal{O}(|\mathcal{E}|CHF), where \mathcal{E} is the edge set of the graph:

Z=f(X,A)=softmax(A^ReLU(A^XW(0))W(1)),Z=f(X,A)=softmax\left(\hat{A}\leavevmode\nobreak\ {\rm ReLU}\left(\hat{A}XW^{(0)}\right)W^{(1)}\right), (9)

where AA is the adjacency matrix and XX is the feature matrix. And A^\hat{A} is the normalized processing matrix of the adjacency matrix AA. W(0)C×HW^{(0)}\in\mathbb{R}^{C\times H} is an input-to-hidden weight matrix and W(1)H×FW^{(1)}\in\mathbb{R}^{H\times F} is a hidden-to-output weight matrix, where CC is input channels, HH is the number of feature maps in the hidden layer and FF is the number of feature maps in the output layer [14].

The time of our algorithm is mainly consumed in the training phase of the neural network. Thus, we will prove that the calculation complexity of our method in the training phase of the neural network is also 𝒪(||CHF)\mathcal{O}(|\mathcal{E}|CHF), while keeping the number of hidden layers unchanged.

Proof.

Let LL be the ”walks” length in random walk, TT denotes the number of ”walks” from each node, and A,𝒜,A~A,\mathcal{A},\tilde{A} denotes the original adjacency matrix, reconstructed graph matrix and random walk matrix respectively. We compare the number of non-zero elements in AA and A~\tilde{A}.

Since each node has TT ”walks” of length LL, then the maximum number of non-zero elements per node in the random walk matrix 𝒜\mathcal{A} is T(L1)T(L-1) (in this case, the ”walks” guided by the starting node have no duplicate nodes except the starting node itself).

So A~\tilde{A} has at most |V|T(L1)|V|T(L-1) non-zero elements more than AA (|V||V| is the number of nodes). And in experiments, LL is set to 5 and TT is generally set to 2||/|V|2|\mathcal{E}|/|V| (2||/|V|2|\mathcal{E}|/|V| is the average degree of the graph), thus T(L1)T(L-1)\ll\mathcal{E}. Therefore, the computational complexity satisfies:

𝒪((||+|V|T(L1))CHF)\displaystyle\mathcal{O}((|\mathcal{E}|+|V|T(L-1))CHF)
=\displaystyle= 𝒪((||+2|V|(||/|V|)(51)CHF)\displaystyle\mathcal{O}((|\mathcal{E}|+2|V|(|\mathcal{E}|/|V|)(5-1)CHF)
=\displaystyle= 𝒪(9||CHF)\displaystyle\mathcal{O}(9|\mathcal{E}|CHF)
=\displaystyle= 𝒪(||CHF).\displaystyle\mathcal{O}(|\mathcal{E}|CHF).

4 Experiments and Result

In section 4.1, we introduce the datasets used in the experiment, and the specific settings of the experiment are described in section 4.2. In section 4.3, a wide variety of baselines and previous approaches are introduced, and the results are shown in section 4.4.

4.1 Datasets

Table 1: Summary of the datasets. Datasets Types Undirected Nodes Edges Features Classes Cora Citation yes 2708 5429 1433 7 Citeseer Citation yes 3327 4732 3703 6 Pubmed Citation yes 19717 44338 500 3 107Ego(F) Social yes 1045 53498 576 9 414Ego(F) Social yes 159 3386 105 7 1684Ego(F) Social yes 792 28048 319 16 1912Ego(F) Social yes 755 60050 480 46 2106Ego(G) Social no 2457 174309 2094 2 3600Ego(G) Social no 2356 582827 995 3 5249Ego(G) Social no 826 435569 2627 2 7461Ego(G) Social no 997 1270141 2882 2

Citation Network Datasets: Citeseer, Cora, and Pubmed. In these three standard citation network datasets, nodes represent documents and edges (undirected) represent citation links. The three citation network datasets contain a sparse feature vector for each document and each document has a category label [14]. As shown in the Table 1, the Cora, Citeseer and Pubmed datasets contain 1433 features, 3703 features and 500 features per node respectively, and the number of label categories are 7, 6 and 3 respectively.

Social Network Datasets: Ego-Facebook and Ego-Gplus. For Ego-Facebook, this dataset consists of ’circles’ (or ’friends lists’) from Facebook [19]. There are many subsets of the Ego-Facebook dataset. Take ’107Ego(F)’ as an example. This dataset is a network with the node ’107’ as the core, where the nodes represent users and the edges (undirected) represent interactions between users. And each user has a feature attribute vector and a category label. As for Ego-Gplus, it is similar to Ego-Facebook, except that the data comes from Google++ and the edges are directed [19]. We choose the suitable subsets of Ego-Facebook and Ego-Gplus for experiments (to facilitate the distinction, (F) represents a subset of Ego-Facebook and (G) represents a subset of Ego-Gplus). And after preprocessing, the data whose information has been lost is removed.

4.2 Experimental Setup

Table 2: Parameter setting on social network datasets.       Datasets       W1W_{1}       W2W_{2}       W3W_{3}       TT       α\alpha       λ\lambda       107Ego(F)       576×128\mathbb{R}^{576\times 128}       128×32\mathbb{R}^{128\times 32}       32×9\mathbb{R}^{32\times 9}       40       0.8       0.8       414Ego(F)       105×36\mathbb{R}^{105\times 36}       36×12\mathbb{R}^{36\times 12}       12×4\mathbb{R}^{12\times 4}       24       0.8       0.7       1684Ego(F)       319×128\mathbb{R}^{319\times 128}       128×40\mathbb{R}^{128\times 40}       40×13\mathbb{R}^{40\times 13}       35       0.5       0.9       1912Ego(F)       480×160\mathbb{R}^{480\times 160}       160×60\mathbb{R}^{160\times 60}       60×19\mathbb{R}^{60\times 19}       80       0.8       0.9       2106Ego(G)       2094×128\mathbb{R}^{2094\times 128}       128×16\mathbb{R}^{128\times 16}       16×2\mathbb{R}^{16\times 2}       30       0.8       0.8       3600Ego(G)       995×128\mathbb{R}^{995\times 128}       128×16\mathbb{R}^{128\times 16}       16×3\mathbb{R}^{16\times 3}       200       0.5       0.9       5249Ego(G)       2627×256\mathbb{R}^{2627\times 256}       256×16\mathbb{R}^{256\times 16}       16×2\mathbb{R}^{16\times 2}       50       0.8       0.9       7461Ego(G)       2882×256\mathbb{R}^{2882\times 256}       256×16\mathbb{R}^{256\times 16}       16×2\mathbb{R}^{16\times 2}       30       0.8       0.7

Citation Network Datasets. On citation network datasets, we apply a two-layer wGCN model. Specifically, we perform random walk operation 8 times for each node, and stop after passing 4 different nodes every time on Cora dataset. And the decay rate (i.e., α\alpha, mentioned in section 3.1) is set to 0.8 and the mixing ratio coefficient (i.e., λ\lambda, mentioned in section 3.2) is set to 0.9; on Citation dataset, the random walk operation is performed 3 times for each node, and stop after passing 4 different nodes every time. The decay rate is set to 0.8 and the mixing ratio coefficient is set to 0.73; on Pubmed dataset, the random walk operation is performed 5 times for each node, and also stop after passing 4 different nodes every time. The decay rate and the mixing ratio coefficient ars set to 0.8 and 0.9, respectively. The remaining parameter settings follow the settings in [14].

Social Network Datasets. On social network datasets, we apply a three-layer wGCN model. Specifically, the shape of the parameter matrix (W1,W2,W3W_{1},W_{2},W_{3}) of the three-layer model, the number TT of random walk operation from each node, the decay rate α\alpha and the mixing ratio coefficient λ\lambda on 8 subdatasets are shown in Table 2. In addition, the learning rates on subdatasets of Ego-Facebook and Ego-Gplus are set to 0.02 and 0.01 respectively. And the random walk operation for each node is also stopped after passing 4 different nodes every time.

4.3 Baselines and Previous Approaches

Citation Network Datasets. On citation network datasets (Citeseer, Cora, and Pubmed), our method is compared with the same strong baselines and previous approaches as specified in [14], including label propagation (LP) [20], semi-supervised embedding (SemiEmb) [21], manifold regularization (ManiReg) [22], iterative classification algorithm (ICA) [23] and Planetoid [24]. Here, DeepWalk is a method based on random walks, as stated at the beginning of the article, whose sampling strategy can be seen as a special case of node2vec with p=1p=1 and q=1q=1. As for method named GCN, which is the first method to achieve convolution on the graph, it is the best performing baseline method.

Social Network Datasets. On social network datasets (Ego-Facebook and Ego-Gplus), our method is compared against Deepwalk, GraRep and again GCN, which is the strongest baseline in the above experiment. And here, GraRep [10] works by utilizing the adjacency matrix of each order and defining a more accurate loss function that allows non-linear combinations of different local relationship information to be integrated.

4.4 Results

The results of our comparative evaluation experiments are summarized in Tables 3 and 4.

Table 3: The results of classification accuracy on citation network datasets.      Method      Cora      Citeseer      Pubmed      ManiReg      59.5      60.1      70.7      SemiEmb      59.0      60.1      71.1      LP      68.0      45.3      63.0      DeepWalk      67.2      43.2      65.3      ICA      75.1      69.1      73.9      Planetoid      75.7      64.7      77.2      GCN      81.5      70.3      79.0      wGCN      81.8      72.4      79.5

As shown in Table 2, our method achieves the best results on all datasets, and compared with the strongest baseline, our method improve upon GCN by a margin of 0.3, 2.1 and 0.5 on Cora, Citeseer and Pubmed respectively.

Next, we use the social network datasets (Ego-Facebook and Ego-Gplus) for the experiments, and compare the experimental results with the classic method based on adjacency matrix -- GraRep [] and based on random walk -- Deepwalk [], and the best performing baseline method -- GCN []. The experimental results are shown in the Table 3:

Table 4: The results of classification accuracy on social network datasets. Method 107Ego(F) 414Ego(F) 1684Ego(F) 1912Ego(F) DeepWalk 77.5 79.2 64.4 66.5 GraRep 90.0 85.4 76.3 77.0 GCN 92.5 93.8 81.9 77.0 wGCN 95.0 97.9 85.0 81.0

Method 2106Ego(G) 3600Ego(G) 5249Ego(G) 7461Ego(G)
DeepWalk 75.8 47.5 76.2 65.6
GraRep 87.9 62.3 98.9 77.4
GCN 92.3 95.7 99.5 80.2
wGCN 93.4 98.1 99.5 83.1

As can be seen from Table 3, the experimental results of both parts of our method are significantly higher than the results of Deepwalk and GraRep. And except for ’5249Ego(G)’ (the result is as good as the result of GCN), the other results of our method are also better than the results of GCN.

5 Conclusion

In this paper, we have designed a new framework combined with random walk -- wGCN, which can reconstruct the graph to capture higher-order features through random walk, and can effectively aggregate node information. We conduct experiments on a series of datasets (citation network and social network, directed and undirected). The results have shown that wGCN can effectively generate embeddings for nodes of unknown category and get the better results than the baseline methods.

There are many extensions and potential improvements to our method, such as exploring more random walk methods with different strategies and extending wGCN to handle multi-graph mode or time-series-graph mode. Another interesting direction for future work is to extend the method to be able to handle edge features, which will allow the model to have wider applications.

Acknowledgments

This work is supported by the Research and Development Program of China (No.2018AAA0101100), the Fundamental Research Funds for the Central Universities, the International Cooperation Project No.2010DFR00700, Fundamental Research of Civil Aircraft No. MJ-F-2012-04, the Beijing Natural Science Foundation (1192012, Z180005) and National Natural Science Foundation of China (No.62050132).

References

  • [1] L. Backstrom and J. Leskovec, “Supervised random walks: Predicting and recommending links in social networks,” in ACM international conference on web search and data miningWSDM’11, 2012.
  • [2] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” 2017.
  • [3] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” ser. NIPS’15.   MIT Press, 2015, pp. 2224–2232.
  • [4] X. Wang, Y. Ye, and A. Gupta, “Zero-shot recognition via semantic embeddings and knowledge graphs,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • [5] S. Bhagat, G. Cormode, and S. Muthukrishnan, “Node classification in social networks,” Computer ence, vol. 16, no. 3, pp. 115–148, 2011.
  • [6] Koschützki, Dirk, Lehmann, A. Katharina, and Oliver, Centrality Indices.   Springer Berlin Heidelberg, 2005.
  • [7] N. Perra and S. Fortunato, “Spectral centrality measures in complex networks,” Physical Review E, vol. 78, no. 3 Pt 2, p. 036107, 2008.
  • [8] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt, “Graph kernels,” Journal of Machine Learning Research, vol. 11, no. 2, pp. 1201–1242, 2010.
  • [9] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” Advances in Neural Information Processing Systems, vol. 14, no. 6, pp. 585–591, 2001.
  • [10] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations with global structural information.”   Association for Computing Machinery, 2015, pp. 891–900.
  • [11] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 701–710.
  • [12] A. Grover and J. Leskovec, “Node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864.
  • [13] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
  • [14] T. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” 2016.
  • [15] A. Langville and C. Meyer, “Deeper inside pagerank,” Internet Mathematics, vol. 1, no. 3, pp. 335–380, 2004.
  • [16] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola, “Distributed large-scale natural graph factorization,” in Proceedings of the 22nd international conference on World Wide Web, 2013.
  • [17] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in Acm Sigkdd International Conference, 2016.
  • [18] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” Computer Science, 2013.
  • [19] J. McAuley and J. Leskovec, “Learning to discover social circles in ego networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’12.   Curran Associates Inc., 2012, pp. 539–547.
  • [20] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, 2003.
  • [21] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semi-supervised embedding,” in Neural Networks: Tricks of the Trade: Second Edition, G. Montavon, G. B. Orr, and K.-R. Müller, Eds.   Springer Berlin Heidelberg, 2012, pp. 639–655.
  • [22] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” Journal of Machine Learning Research, vol. 7, no. 1, pp. 2399–2434, 2006.
  • [23] Q. Lu and L. Getoor, “Link-based classification,” in Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, 2003.
  • [24] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi-supervised learning with graph embeddings.”   JMLR.org, 2016, pp. 40–48.