Deep Manifold Learning with Graph Mining

Xuelong Li,, Ziheng Jiao, Hongyuan Zhang, and Rui Zhang^∗

*

Rui Zhang is the Corresponding Author.Rui Zhang, Ziheng Jiao, Hongyuan Zhang, and Xuelong Li are with School of Computer Science and School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, Shaanxi, P. R. China.E-mail: [email protected]; [email protected]; [email protected]; [email protected].

Abstract

Admittedly, Graph Convolution Network (GCN) has achieved excellent results on graph datasets such as social networks, citation networks, etc. However, softmax used as the decision layer in these frameworks is generally optimized with thousands of iterations via gradient descent. Furthermore, due to ignoring the inner distribution of the graph nodes, the decision layer might lead to an unsatisfactory performance in semi-supervised learning with less label support. To address the referred issues, we propose a novel graph deep model with a non-gradient decision layer for graph mining. Firstly, manifold learning is unified with label local-structure preservation to capture the topological information of the nodes. Moreover, owing to the non-gradient property, closed-form solutions is achieved to be employed as the decision layer for GCN. Particularly, a joint optimization method is designed for this graph model, which extremely accelerates the convergence of the model. Finally, extensive experiments show that the proposed model has achieved state-of-the-art performance compared to the current models.

Index Terms:

Deep graph learning, orthogonal manifold, closed-form solution, topological information.

Refer to caption — Figure 1: A framework of the proposed model. The dark blue node is the label data and the gray node is the unlabeled data. As is shown, the topology of the prior graph is fully embedded into the model. After the deep graph network extracts the graph embedding, the non-gradient decision layer, unifying the orthogonal manifold and label local-structure preservation, mines the distribution of these embeddings to predict the labels.

1 Introduction

Classification,a classic task in machine learning, aims to classify the unlabeled data by learning the information from the labeled data. However, considering the expensive handcraft labeling, privacy, and security, it is difficult to obtain abundant labeled data for training classifiers. In recent years, semi-supervised algorithms trained with scarce labeled data make some progress in the practical application[1]. In general, the semi-supervised models could be divided into two categories, based on label information and based on the data distribution[2] .

Based on label information, these models mainly focus on learning high-quality features from the labeled data, such as co-training method [3, 4, 5, 6]. Besides, latent structure information often helps to enhance classification accuracy [7, 8, 9, 10, 11]. Among them, manifold learning is a classical method and can explore the latent topological information of the data [12]. The KPCA [13] integrates manifold learning into semi-supervised dimensionality reduction. Besides, a novel generalized power iteration (GPI) [14] is employed to solve the problem on the Stiefel manifold. Accordingly, the orthogonal regression manifold (OR) is the least square problem defined on the Stiefel manifold.

However, it is known that the classic convolution neural network mainly deals with regular data such as images and weakly handles irregular data tasks like graph node classification. Some researchers utilize spectral decomposition to explore the structure of the graph and expand the deep network into irregular data space [15, 16, 17, 18]. [15] offers flexibility to compose and evaluate different deep networks using classic machine learning algorithms. Besides, the pseudo-labels generated by label propagation are utilizing to train graph convolution networks[16]. Masked-GCN [17] only propagates some attributes of nodes to the neighbors, which jointly considers the distribution of local neighbors. Apart from that, [19] proposes Propagation-regularization to boost the performance of existing GNN models. Aiming to accelerate the speed of the model, [20] suggests a method called distant compatibility estimation.

Generally, softmax utilizes the deep feature to predict the label and is extensively employed as the decision layer in the graph networks. However, it ignores the distribution and latent structure of the irregular data during classification. Apart from that, this decision layer is optimized via gradient descent with thousands of iterations to obtain the approximate solutions. Besides, due to the lack of sufficient prior labels under the semi-supervised tasks, trivial solutions can not be avoided during the optimization.

To tackle the referred deficiency, we propose a novel deep model for graph knowledge learning with a non-gradient decision layer. Our major contributions are listed by the following items:

1)

Unifying the orthogonal manifold with label local-structure preservation to mine the topological information of the deep embeddings, the novel non-gradient graph decision layer is put forward.
2)

The theorems are designed to solve the proposed layer with an elegant analytical solution, which accelerates the convergence of the model.
3)

A novel deep graph convolution network is proposed and optimized jointly. Moreover, extensive experiments suggest that the model has achieved state-of-the-art results.

2 Notations and Background

2.1 Notations

In this paper, the boldface capital letters and boldface lower letters are employed to present matrics and vectors respectively. Suppose the $\bm{m_{i}}$ as the $i$ -th row of the matrix $\bm{M}\in R^{d\times c}$ . The element of $\bm{M}$ is $m_{ij}$ . Besides, the $F$ -norm of the matrix $\bm{M}$ is defined as $||\bm{M}||_{F}=\sqrt{\sum_{i=1}^{d}\sum_{j=1}^{c}m_{ij}^{2}}$ . The trace and transpose of matrix $\bm{M}$ are denoted as $Tr(\bm{M})$ and $\bm{M}^{T}$ , respectively. $\nabla_{\bm{x}}f(\bm{x})$ means the the gradient of $f(\bm{x})$ w.r.t. $\bm{x}$ . And $\bm{1}_{c}$ is a unit column vector with dimension c. For the semi-supervised classification, $\bm{X}=[\bm{x_{1},x_{2},...,x_{n}}]\in R^{n\times d}$ is the feature matrix with $n$ samples and $d$ dimensions. And we have $l$ labeled data points and $u$ unlabeled data samples where $l+u=n$ . Let $\bm{Y}=[\bm{Y}_{l};\bm{Y}_{u}]\in R^{n\times c}$ denotes the labeled and unlabeled with $c$ classes. And $\bm{Y}_{u}$ is initialized according to $\bm{Y}_{u}\bm{1}_{c}=\bm{1}_{u}$ . Apart from that, we denote $\bm{H}^{(m)}\in R^{n\times d_{m}}$ as the $m$ -th layer hidden feature.

2.2 Background

The classic convolutional neural network is designed to process regular data like images. However, there exists numerous irregular data, such as citation networks and social networks. Owing to the distribution of these data is not shift-invariant, the traditional neural networks could not extract the latent topological information of these data. Aiming to tackle these issues, some researchers try to transfer traditional Laplacian kernel and convolution operators into graph data space by spectral decomposition based on the graph [21, 22, 23]. However, the spectral decomposition of the Laplacian matrix is very time-consuming. Defferrard et al [22] employ the Chebyshev polynomial to fit the results of spectral decomposition and design a Chebyshev convolution kernel as follows

g_{\bm{\theta}}=g_{\bm{\theta}}(\bm{\Lambda})\approx\sum_{i=0}^{K}\theta_{i}T_{k}(\tilde{\bm{\Lambda}}),

(1)

where $\tilde{\bm{\Lambda}}=2\bm{\Lambda}/\lambda_{max}-\bm{I}_{N}$ is the normalized feature vector matrix, $\lambda_{max}$ is the spectral radius, $\bm{\theta}\in R^{K}$ is Chebyshev coefficient vector, and the Chebyshev polynomial is defined as $T_{k}(x)=2xT_{k-1}(x)-T_{k-2}(x)$ .

Based on this, Kipf et al [24] use the first-order expansion of the Chebyshev polynomial to fit the graph convolution kernel with $K=1,\lambda_{max}=2$ . To prevent overfitting and reduce complexity, the authors introduce some tricks, $\bm{\theta}_{0}=-\bm{\theta}_{1}$ , $\bm{\tilde{A}}=\bm{A}+\bm{I}_{N}$ and the diagnonal element in degree matrix is $\widetilde{\bm{D}}_{ii}=\sum_{j}\widetilde{\bm{A}}_{ij}$ . Finally, the classic GCN is defined as follows

\bm{H}^{(l+1)}=\sigma(\widetilde{\bm{D}}^{-\frac{1}{2}}\widetilde{\bm{A}}\widetilde{\bm{D}}^{-\frac{1}{2}}\bm{H}^{(l)}\bm{W}^{(l)}).

(2)

Apart from that, manifold learning [25] is a classic nonlinear algorithm for dimensionality reduction, representation learning, and classification tasks. It believes that the data nodes are generated from a low-dimensional manifold embedded in a high-dimensional space and this procedure is often described by a function with few underlying parameters. Based on this idea, manifold learning aims to uncover the function in order to explore the latent topological structure of the data.

3 Deep Manifold Learning with Graph Mining

In this section, we propose a novel graph deep model with a non-gradient decision layer for graph mining. The framework is illustrated in Fig. 1.

3.1 Problem Revisited: GCN with Softmax

As is known to us, Softmax has been widely used as a decision layer in lots of classic neural networks. Moreover, on some computer vision tasks such as image classification, these networks also have made excellent progress and can be generally formulated as

\left\{\begin{array}[]{l}\bm{H}^{(a)}=\sigma(f_{n}(\bm{W}^{(a-1)},\bm{H}^{(a-1)}))\\ \bm{\hat{Y}}=softmax(\bm{H}^{(m)})\\ \end{array}\right.,

(3)

where $f_{n}$ is a neural network such as CNN to extract the deep features, $a\in\{0,1,...,m\}$ is the $a$ -th layer in the network, and $\bm{W}^{(a)}$ is the weight matrix in the $a$ -th layer. The latent feature in the $a$ -th layer is defined as $\bm{H}^{(a)}$ and the $\bm{H}^{(0)}$ is equivalent to $\bm{X}$ . $\bm{\hat{Y}}\in R^{n\times c}$ is the prediction label. The latent feature $\bm{H}^{(m)}$ has the same shape with $\bm{\hat{Y}}$ . $\sigma(\cdot)$ is the activation function. Aiming to predict the classes, Softmax maps and normalizes these deep features into real numbers in $[0,1]$ via the formula defined as

\hat{y}_{ij}=\frac{e^{h^{(m)}_{ij}}}{\sum_{j=1}^{c}e^{h^{(m)}_{ij}}}.

(4)

The cross-entropy is employed as the loss function and optimized via gradient descent. Aiming to classify the irregular data such as citation network, [24] defines the graph convolution to extract the graph deep features and also classify them via softmax.

From the Eq. (4), softmax mainly forecasts the probability according to the value of the embedding. However, it ignores the latent topological distribution of the graph embedding. For example, in social networks, people ought to have a high degree of similarity to those with whom they socialize frequently. Therefore, introducing the softmax into GCN for irregular data classification may not get the same achievements as utilizing CNN for the regular data without this topological structure. More importantly, the softmax is optimized via gradient descent with thousands of iteration to obtain approximate solutions, whose time complexity is non-linear with the number of the convolutional layers. Aiming to improve the performance of the GCN, we propose a graph neural network with a non-gradient decision layer.

3.2 Graph-based Deep Model with a Non-Gradient Decision Layer

3.2.1 Unify Orthogonal Manifold and Label Local-structure Preservation

Owing to obtaining the closed-form result via the analytical method, the least square regression can also be a classic method to predict the label in machine learning. Apart from this, the softmax has low complexity and high interpretability. In semi-supervised learning, it can be defined as

\min\limits_{\bm{U},\bm{Y}_{u}}||\bm{X}\bm{U}-[\bm{Y}_{l};\bm{Y}_{u}]||_{F}^{2},

(5)

where $\bm{X}\in R^{n\times d}$ is the feature matrix, $\bm{U}\in R^{d\times c}$ is the projection matrix with $c$ classes, and $[\bm{Y}_{l};\bm{Y}_{u}]=\bm{Y}\in R^{n\times c}$ is a indices matrix including one-hot labeled and unlabeled points. We can easily optimize the projection matrix $\bm{U}$ via taking the derivative of Eq. (5) w.r.t $\bm{U}$ and setting it to 0,

\nabla_{\bm{U}}||\bm{X}\bm{U}-[\bm{Y}_{l};\bm{Y}_{u}]||_{F}^{2}=0.

(6)

Unfortunately, in the semi-supervised or unsupervised task, Eq. (6) may cause the trivial solution. The label rate is greatly small in semi-supervised learning, $i.e.$ , $l\ll u$ . Under this circumstance, when $\bm{U}=[\bm{1}_{d},\bm{0}_{d\times(c-1)}]$ , the trivial solution that $\bm{Y}_{u}=[\bm{1}_{n},\bm{0}_{n\times(c-1)}]$ may be triggered. Apart from the potential trivial solution, directly projecting the feature into the label space may ignore the original data structure.

In order to tackle these problems, we firstly introduce the orthogonal manifold, $i.e.$ , $\bm{U}^{T}\bm{U}=\bm{I}$ , to learn the low-dimensional distribution of the graph data, which can avoid projecting the unlabeled graph embedding into the same class and improve the robustness of the model. Meanwhile, since the structural similarity and local information are often credible (especially in manifold learning), a label local-structure preservation is unified into the learned low-dimensional distribution. It guides the classification via label similarity measured by local structure and the decision layer can be defined as

\min\limits_{\bm{U}^{T}\bm{U}=I,\bm{Y}_{u}}||\bm{X}\bm{U}-[\bm{Y}_{l};\bm{Y}_{u}]||_{F}^{2}\\ +\lambda\sum_{i,j=1}^{n}a_{ij}\|\bm{y}_{i}-\bm{y}_{j}\|_{2}^{2},

(7)

where the element of adjacent matrix $\bm{A}$ is $a_{ij}$ , $\bm{y}_{i}$ is the one-hot label in $\bm{Y}=[\bm{Y}_{l};\bm{Y}_{u}]$ , and $\lambda$ is a trade-off parameter. The Eq. 7 can be directly solved with the proposed Theorem 1 and Theorem 2, which is non-gradient and proved in Section 4.

Theorem 1.

Given a feature matrix $\bm{X}$ and a label matrix $\bm{Y}$ , the problem $\min\limits_{\bm{U}^{T}\bm{U}=I}||\bm{X}\bm{U}-\bm{Y}||_{F}^{2}$ can be solved by $\bm{U}=\bm{S}[:,:c]$ , where $\bm{S}=\bm{E}\bm{F}^{T}$ and $[\bm{E},\sim,\bm{F}^{T}]=svd(\bm{X}\bm{R})$ . $\bm{R}$ is defined as $\bm{R}=[\bm{Y},\bm{X}\bm{V}]$ and $\bm{V}\in R^{d\times(d-c)}$ is generated in the orthogonal complement spaces of $\bm{U}$ .

Theorem 2.

Given a feature matrix $\bm{X}$ , a projection matrix $\bm{U}$ and the labeled matrix $\bm{Y}_{l}$ a label matrix, the problem defined as

\begin{split}\min\limits_{\bm{Y}_{u}}||\bm{X}\bm{U}-[\bm{Y}_{l};\bm{Y}_{u}]||_{F}^{2}+\lambda\sum_{i,j=1}^{n}a_{ij}\|\bm{y}_{i}-\bm{y}_{j}\|_{2}^{2}\end{split}

(8)

can be solved by $\bm{Y}_{u}=(\bm{I}+\lambda\bm{L}_{uu})^{-1}(\bm{X}_{u}\bm{U}-\lambda\bm{L}_{ul}\bm{Y}_{l})$ , where $\bm{X}=[\bm{X}_{l};\bm{X}_{u}]$ , $\bm{D}=diag(\sum_{j=1}a_{ij})$ and Laplacian

\bm{L}=\bm{D}-\bm{A}=\left[\begin{array}[]{cc}\bm{L}_{ll}&\bm{L}_{lu}\\ \bm{L}_{ul}&\bm{L}_{uu}\end{array}\right].

(9)

3.2.2 A Novel Graph Deep Network

Because the orthogonal manifold projection is a linear transform, Eq. (7) has not good performance when facing non-linear problems. As is known to us, the kernel trick and neural network are widely employed to handle the nonlinearity. However, the kernel trick is weak in representation and severely sensitive to the choice of the kernel function. On the contrary, graph convolution not only has an excellent ability in representation learning for irregular data but also successfully mines the complex relationships and interdependence between graph nodes via spectral and spatial operators. Based on this, we finally put forward a novel GCN with a non-gradient decision layer like

\left\{\begin{split}&\bm{H}^{(a)}=\sigma(\varphi(\widetilde{\bm{A}})\bm{H}^{(a-1)}\bm{W}^{(a)}+\bm{b}^{(a)})\\ &\begin{split}\min\limits_{\bm{U}^{T}\bm{U}=I,\bm{Y}_{u}}&||\bm{H}^{(m)}\bm{U}-[\bm{Y}_{l};\bm{Y}_{u}]||_{F}^{2}\\ &+\lambda\sum_{i,j=1}^{n}a_{ij}\|\bm{y}_{i}-\bm{y}_{j}\|_{2}^{2},\\ \end{split}\end{split}\right.

(10)

where the self-loop adjacent matrix is $\widetilde{\bm{A}}=\bm{A}+\bm{I}\in R^{n\times n}$ , and the graph kernel is $\varphi(\bm{A})=\widetilde{\bm{D}}^{-\frac{1}{2}}\widetilde{\bm{A}}\widetilde{\bm{D}}^{-\frac{1}{2}}$ . Moreover, the whole network is jointly optimized via Algorithm 1.

In this network, the non-gradient decision layer not only preserve and utilize the topological information of the deep graph embeddings but also can be solved with the closed-form solutions, which significantly improves the accuracy and accelerates the convergence.

The Merits of the Proposed Model: Compared with the classic softmax decision layer, the proposed model unifies the orthogonal manifold and label local-structure preservation to learn the distribution and topology of the graph nodes, which successfully avoids the trivial solutions and improves the robustness. Moreover, contrasting with optimized via gradient descent with thousands of iterations, our model can be solved with the analytical solutions via the proposed non-gradient theorems. Finally, we propose a novel deep graph convolution network and design a joint optimization strategy.

Algorithm 1 Deep Graph Model with a Non-Gradient Decision Layer

0: data matrix

\bm{X}

, (i.e.,

\bm{H}^{0}

), adjacent matrix

\bm{A}

, one-hot label matrix

\bm{Y}_{l}

, parameter

\lambda

0: the

\bm{Y}_{u}

1: Initialize

\bm{Y}_{u}

with random pseudo label matrix;

2: Calculate the degree matrix

\bm{D}

, the Laplacian matrix

\bm{L}

and the graph kernel

\varphi(\bm{A})

3: for

a=0

m

4: Initialize

\bm{W}^{(a)}

and

\bm{b}^{(a)}

;

5: end for

6: while

not\;Convergence

7: for

a=0

m

\bm{H}^{(a)}\leftarrow\sigma(\varphi(\widetilde{\bm{A}})\bm{H}^{(a-1)}\bm{W}^{(a)}+\bm{b}^{(a)})

;

9: end for

10: repeat

11: Calculate

\bm{U}

by Eq. (17);

12: Update

\bm{Y}_{u}

by Eq. (24);

13: until

Convergence

14: Backward propagation with gradient descent;

15: end while

3.3 Time Complexity

In each iteration of the back propagation, the computaional complexity of graph convolution in Eq. (10) is $O(|\mathcal{E}|d_{a})$ , where $\bm{A}$ is a sparse matrix and the amount of non-zero entries is $|\mathcal{E}|$ . Assume that the proposed deep graph model converges after $t$ iterations. The time complexity of a non-gradient decision layer is $O(u\log u+u^{2}c+ud_{m}c)$ , where $u$ is unlabel nodes. Therefore, the total complexity is $O(|\mathcal{E}|d_{a}+t(ulogu+u^{2}c+ud_{m}c))$ . Considering that the solution of the non-gradient decision layer is independent with gradient and there is no need to optimize the classifier every iteration, the complexity can be approximate to $O(|\mathcal{E}|d_{a})$ .

4 Proof

In this section, we will prove the proposed Theorem 1 and Theorem 2 individually.

4.1 Proof of Theorem 1

Since problem $\min\limits_{\bm{U}^{T}\bm{U}=I}||\bm{X}\bm{U}-\bm{Y}||_{F}^{2}$ is difficult to solve directly, we introduce the matrix $\bm{V}\in R^{d\times(d-c)}$ generated in the orthogonal complement spaces of $\bm{U}$ according to [26]. The problem is equivalent to the following balanced problem

\min\limits_{\bm{U}^{T}\bm{U}=\bm{I}}||\bm{X}[\bm{U},\bm{V}]-[\bm{Y},\bm{X}\bm{V}]||_{F}^{2}.

(11)

To express more conveniently, we define the matrix $\bm{S}=[\bm{U},\bm{V}]$ and problem can be transformed into

\begin{split}\min\limits_{\bm{S}^{T}\bm{S}=\bm{I}}&||\bm{X}\bm{S}-\bm{R}||_{F}^{2}\\ =\min\limits_{\bm{S}^{T}\bm{S}=\bm{I}}&Tr((\bm{X}\bm{S}-\bm{R})^{T}(\bm{X}\bm{S}-\bm{R}))\\ =\min\limits_{\bm{S}^{T}\bm{S}=\bm{I}}&Tr({\bm{X}}^{T}\bm{S}^{T}\bm{S\bm{X}})-2Tr(\bm{S}^{T}\bm{X}\bm{R}),\end{split}

(12)

where $\bm{R}=[\bm{Y},\bm{X}\bm{V}]$ .

Owing to the orthogonal embedding, the problem can be simplified into

\min\limits_{\bm{S}^{T}\bm{S}=\bm{I}}Tr({\bm{X}}^{T}\bm{X})-2Tr(\bm{S}^{T}\bm{X}\bm{R}).

(13)

Notice that the first term in Eq. (13) is equal to a constant under the constraints. Therefore, the problem can be transformed into

\max\limits_{\bm{S}^{T}\bm{S}=\bm{I}}Tr(\bm{S}^{T}\bm{X}\bm{R}).

(14)

The following lemma [27] points out that the transformed problem Eq. (14) has a closed-form solution.

Lemma 1.

For $\bm{Q}$ , $\bm{P}\in{R}^{m\times n}$ where $m>n$ , the problem

\max\limits_{\bm{Q}^{T}\bm{Q}=\bm{I}}Tr(\bm{Q}^{T}\bm{P})

(15)

can be solved by $\bm{Q}=\bm{B}\bm{C}^{T}$ where $[\bm{B},\bm{\Psi},\bm{C}^{T}]=svd(\bm{P})$ , $\bm{B}\in R^{m\times r}$ , $\bm{C}\in R^{n\times r}$ , and $r=rank(\bm{P})$ .

According to Lemma 1, $\bm{S}$ can be calculated by

\bm{S}=\bm{E}\bm{F}^{T},

(16)

where $[\bm{E},\sim,\bm{F}^{T}]=svd(\bm{H}^{(m)}\bm{R})$ . Therefore, the orthogonal matrix is obtained

\bm{U}=\bm{S}[:,:c],

(17)

where $\bm{S}[:,:c]$ means that the first $c$ columns of all the rows of $\bm{S}$ is taken.

TABLE I: Datasets Description

Dataset	Type	Nodes	Edges	Features	Classes	Label Rate
Cora	Citation network	2,708	5,429	1,433	7	5.2%
Citeseer	Citation network	3,327	4,732	3,703	6	3.6%
Pubmed	Citation network	19,717	44,338	500	3	0.3%
Coauthor-CS	Coauthor network	18,333	81,894	6,805	15	1.6%
Coauthor-Phy	Coauthor network	34,496	247,962	8,415	5	57.9%

4.2 Proof of Theorem 2

It is difficult to optimize the second term in problem (8) directly. Therefore, we transform it like

\begin{split}min_{\bm{Y}_{u}}&\sum_{i,j=1}^{n}a_{ij}\|\bm{y}_{i}-\bm{y}_{j}\|_{2}^{2}\\ =&\sum_{i,j=1}^{n}a_{ij}\|\bm{y}_{i}\|_{2}^{2}+\sum_{i,j=1}^{n}a_{ij}\|\bm{y}_{j}\|_{2}^{2}-2\sum_{i,j=1}^{n}a_{ij}y_{ij}^{2}.\\ \end{split}

(18)

Considering that $\bm{D}=diag(\sum_{j}a_{ij})$ , the Eq. (18) is formulated as

\begin{split}&\sum_{i,j=1}^{n}a_{ij}\|\bm{y}_{i}\|_{2}^{2}+\sum_{i,j=1}^{n}a_{ij}\|\bm{y}_{j}\|_{2}^{2}-2\sum_{i,j=1}^{n}a_{ij}y_{ij}^{2}.\\ =&\sum_{i=1}^{n}d_{i}\|\bm{y}_{i}\|_{2}^{2}+\sum_{j=1}^{n}d_{j}\|\bm{y}_{j}\|_{2}^{2}-2\sum_{i,j=1}^{n}a_{ij}y_{ij}^{2}\\ =&2(\sum_{i=1}^{n}d_{i}\|\bm{y}_{i}\|_{2}^{2}-\sum_{i,j=1}^{n}a_{ij}y_{ij}^{2})\\ =&Tr(\bm{Y}^{T}\bm{D}\bm{Y})-Tr(\bm{Y}^{T}\bm{A}\bm{Y})\\ =&Tr([\bm{Y}_{l}^{T},\bm{Y}_{u}^{T}]\bm{L}[\bm{Y}_{l};\bm{Y}_{u}]),\\ \end{split}

(19)

where $d_{i}$ represents the diagonal element in $\bm{D}$ . Furthermore, the inital problem is written as

\begin{split}\min\limits_{\bm{Y}_{u}}\mathcal{J}=&\underbrace{||[\bm{X}_{l};\bm{X}_{u}]\bm{U}-[\bm{Y}_{l};\bm{Y}_{u}]||_{F}^{2}}_{\mathcal{J}_{1}}\\ &+\underbrace{\lambda Tr([\bm{Y}_{l}^{T},\bm{Y}_{u}^{T}][\begin{array}[]{cc}\bm{L}_{ll}&\bm{L}_{lu}\\ \bm{L}_{ul}&\bm{L}_{uu}\end{array}][\bm{Y}_{l};\bm{Y}_{u}])}_{\mathcal{J}_{2}},\end{split}

(20)

where the $\bm{X}_{l}$ and $\bm{X}_{u}$ are the label and unlabel graph feature, respectively.

The first parts in Eq. (20) can be transformed by

\begin{split}\mathcal{J}_{1}&=||[\bm{X}_{l};\bm{X}_{u}]\bm{U}-[\bm{Y}_{l};\bm{Y}_{u}]||_{F}^{2}\\ &=Tr(\bm{U}^{T}(\bm{X}_{l}^{T}\bm{X}_{l}+\bm{X}_{u}^{T}\bm{X}_{u})\bm{U})+Tr(\bm{Y}_{l}^{T}\bm{Y}_{l}+\bm{Y}_{u}^{T}\bm{Y}_{u})\\ &~{}~{}~{}~{}-2Tr(\bm{U}^{T}(\bm{X}_{l}^{T}\bm{Y}_{l}+\bm{X}_{u}^{T}\bm{Y}_{u}))\\ &\Leftrightarrow Tr(\bm{Y}_{u}^{T}\bm{Y}_{u})-2Tr(\bm{U}^{T}\bm{X}_{u}^{T}\bm{Y}_{u}).\end{split}

(21)

And the $\mathcal{J}_{2}$ can also be simplified by

\begin{split}\mathcal{J}_{2}&=\lambda Tr([\bm{Y}_{l}^{T},\bm{Y}_{u}^{T}][\begin{array}[]{cc}\bm{L}_{ll}&\bm{L}_{lu}\\ \bm{L}_{ul}&\bm{L}_{uu}\end{array}][\bm{Y}_{l};\bm{Y}_{u}])\\ &=\lambda Tr((\bm{Y}_{l}^{T}\bm{L}_{ll}+\bm{Y}_{u}^{T}\bm{L}_{ul})\bm{Y}_{l}+(\bm{Y}_{l}^{T}\bm{L}_{lu}+\bm{Y}_{u}^{T}\bm{L}_{uu})\bm{Y}_{u})\\ &\Leftrightarrow\lambda Tr(\bm{Y}_{u}^{T}\bm{L}_{ul}\bm{Y}_{l}+(\bm{Y}_{l}^{T}\bm{L}_{lu}+\bm{Y}_{u}^{T}\bm{L}_{uu})\bm{Y}_{u})\\ &\Leftrightarrow\lambda Tr(\bm{Y}_{u}^{T}\bm{L}_{uu}\bm{Y}_{u})+2\lambda Tr(\bm{Y}_{u}^{T}\bm{L}_{ul}\bm{Y}_{l}),\end{split}

(22)

with $\bm{L}_{lu}=\bm{L}_{ul}^{T}$ .

Combing the sub-problems Eq. (21) and Eq. (22), the primal problem can be define as

\begin{split}\mathcal{J}=&Tr(\bm{Y}_{u}^{T}\bm{Y}_{u})-2tr(\bm{U}^{T}\bm{X}_{u}^{T}\bm{Y}_{u})\\ &+\lambda Tr(\bm{Y}_{u}^{T}\bm{L}_{uu}\bm{Y}_{u})+2\lambda Tr(\bm{Y}_{u}^{T}\bm{L}_{ul}\bm{Y}_{l}).\end{split}

(23)

Owing to a constraint on $\bm{Y}_{u}$ , the problem Eq. (23) is derivated w.r.t $\bm{Y}_{u}$ and set to 0. Then, we have

\begin{split}\bm{Y}_{u}=(\bm{I}+\lambda\bm{L}_{uu})^{-1}(\bm{X}_{u}\bm{U}-\lambda\bm{L}_{ul}\bm{Y}_{l}).\end{split}

(24)

4.3 Proof of Lemma 1

Having be decomposed with SVD, we can obtain the $[\bm{B},\bm{\Psi},\bm{C}^{T}]=svd(\bm{P})$ . Note that

\begin{split}Tr(\bm{Q}^{T}\bm{P})&=Tr(\bm{Q}^{T}\bm{B}\bm{\Psi}\bm{C}^{T})=Tr(\bm{C}^{T}\bm{Q}^{T}\bm{B}\bm{\Psi})\\ &=Tr(\bm{K}\bm{\Psi})=\sum_{i=1}^{n}k_{ii}\psi_{ii}\end{split}.

(25)

where $\bm{K}=\bm{C}^{T}\bm{Q}^{T}\bm{B}$ . Clearly, $\bm{K}\bm{K}^{T}=\bm{I}$ such that $k_{ij}\leq 1$ . Hence, we have

Tr(\bm{K}\bm{\Psi})\leq\sum_{i=1}^{n}\psi_{ii}.

(26)

We can simply set $\bm{K}=\bm{C}^{T}\bm{Q}^{T}\bm{B}=\bm{I}_{r}$ . In other words,

\bm{Q}=\bm{B}\bm{C}^{T}.

(27)

Consequently, the lemma is proved.

TABLE II: Semi-supervised classification accuracy (%) on the benchmark dataset.

	Input	CORA	CITESEER	PUBMED	Coauthor-CS	Coauthor-Phy
DeepEmbedding	$\bm{X}$ , $\bm{Y}_{l}$ , $\widetilde{\bm{A}}$	59.00 $\pm$ 1.72	59.60 $\pm$ 2.29	71.10 $\pm$ 1.37	70.93 $\pm$ 0.55	85.77 $\pm$ 0.08
GAE	$\bm{X}$ , $\bm{Y}_{l}$ , $\widetilde{\bm{A}}$	71.50 $\pm$ 0.02	65.80 $\pm$ 0.02	72.10 $\pm$ 0.01	56.84 $\pm$ 0.43	66.20 $\pm$ 0.20
GraphEmbedding	$\bm{X}$ , $\bm{Y}_{l}$	75.70 $\pm$ 3.83	64.70 $\pm$ 1.11	77.20 $\pm$ 4.38	28.91 $\pm$ 0.02	50.32 $\pm$ 1.11
GCN	$\bm{X}$ , $\bm{Y}_{l}$ , $\widetilde{\bm{A}}$	81.50 $\pm$ 0.50	70.30 $\pm$ 0.50	78.33 $\pm$ 0.70	78.12 $\pm$ 3.42	92.64 $\pm$ 3.80
DGI	$\bm{X}$ , $\bm{Y}_{l}$	82.30 $\pm$ 0.60	71.80 $\pm$ 0.70	76.80 $\pm$ 0.60	84.34 $\pm$ 0.09	92.54 $\pm$ 0.03
PFS-LS	$\bm{X}$ , $\bm{Y}_{l}$	70.09 $\pm$ 1.38	43.23 $\pm$ 1.02	74.42 $\pm$ 14.2	58.80 $\pm$ 14.91	91.16 $\pm$ 6.55
APPNP	$\bm{X}$ , $\bm{Y}_{l}$ , $\widetilde{\bm{A}}$	85.09 $\pm$ 0.25	71.93 $\pm$ 2.00	79.73 $\pm$ 0.31	91.52 $\pm$ 0.13	95.56 $\pm$ 0.48
GCN-LPA	$\bm{X}$ , $\bm{Y}_{l}$ , $\widetilde{\bm{A}}$	77.28 $\pm$ 0.16	70.72 $\pm$ 0.16	73.63 $\pm$ 0.40	87.63 $\pm$ 0.38	95.80 $\pm$ 0.03
MulGraph	$\bm{X}$ , $\bm{Y}_{l}$ , $\widetilde{\bm{A}}$	86.80 $\pm$ 0.50	73.30 $\pm$ 0.50	80.10 $\pm$ 0.70	88.36 $\pm$ 0.09	94.30 $\pm$ 0.01
Ours	$\bm{X}$ , $\bm{Y}_{l}$ , $\widetilde{\bm{A}}$	85.65 $\pm$ 0.76	74.80 $\pm$ 0.41	80.87 $\pm$ 0.17	91.60 $\pm$ 0.23	95.89 $\pm$ 0.09

5 Experiment

To verify the performance of the proposed model, in this section we will compare the proposed model with seven state-of-the-art methods on citation network benchmark datasets.

5.1 Benchmark Dataset Description

We employ citation networks and coauthor networks to evaluate the performance of models:

Citation Networks: The three classic citation networks involved in our experiments are Cora, Citeseer, and Pubmed, which are closely following the experimental setup in [18]. In these networks, the nodes are articles and the edge reflects the citations between these articles.

Coauthor Networks: The two co-authorship networks [29], Coauthor-CS and Coauthor-Phy, are also used for evaluation. In these networks, the nodes are authors and the edge suggests that two co-authored a paper. These features represent the keywords of the author’s papers. Besides, the coauthor networks are split according to [30].

The detail of these datasets is summarized in TABLE I.

5.2 Experiment Setting

The ratio of labeled data for training is divided according to TABLE I. Like the [18], we employ the extra 500 labeled nodes as a valid set to optimize the hyperparameters including the hidden units in each layer, weight coefficient $\lambda$ in Eq. (10), dropout, and learning rate.

Before classification, the adjacent and feature matrix are accordingly (row-)normalized to the range of $[0,1]$ . The two graph convolution layers are used as a deep graph feature extractor. For the hidden units in each layer, dropout rate, learning rate, and weight coefficient $\lambda$ , we choose values of parameters from $\left\{10,32,64,128,256,512\right\}$ , $\left\{0.1,0.3,0.5,0.7,0.9\right\}$ , $\left\{10^{-3},10^{-2},10^{-1},...,10^{2},10^{3}\right\}$ and $\left\{2^{-3},2^{-2},2^{-1},...,2^{2},2^{3}\right\}$ respectively via 4-fold cross-validation and record the best records. The initialization described in [31] is employed to initialize the network weights. We train the model for a maximum of 500 training iterations with batch gradient descent. After tuning the hyperparameters, we adopt the adam optimizer with a learning rate of 0.01 and optimize the orthogonal manifold classifier every 50 iterations. Accuracy is employed to evaluate the performance of the model.

5.3 Comparsion with Methods

We compare the proposed model with eight state-of-the-art semi-supervised models including Graph neural networks with personalized pagerank ( $APPNP$ ) [32], Unifying GCN and label propagation ( $GCN$ - $LPA$ ) [30], Multi-graph representation learning ( $MulGraph$ ) [33], Deep graph infomax ( $DGI$ ) [34],Parameter-Free Similarity of Label and Side Information ( $PFS$ - $LS$ ) [35], Graph Convolution Network ( $GCN$ ) [24], Variational Graph Auto-Encoder ( $GAE$ ) [36], Semi-supervised learning with Graph Embedding ( $GraphEmbedding$ )[18] and Deep learning via semi-supervised embedding ( $DeepEmbedding$ ) [37]. Among them, $PFS$ - $LS$ is a non-GNN model.

To be fair, we utilize the same experiment setting in section 5.2 train these comparison models and tune the correlated hyper-parameters. For classification models based graph, we utilize the self-loop adjacent $\widetilde{\bm{A}}$ to train as the same as in Eq. [10]. The models mentioned above are evaluated 10 times on the benchmark datasets. The accuracy and standard deviation are reported in TABLE II. Therefore, we could conclude that:

1)

The proposed model has achieved state-of-the-art accuracy with respect to the comparison methods. For example, on Citeseer and Pubmed dataset, we achieve $74.8\%$ and $80.87\%$ accuracy respectively. The scores are higher than the other methods, which indicates that the proposed model could mine the conceal relationship among points and work well on classification tasks with few labeled data.
2)

When dealing with the datasets with high dimensions, huge points, and low labeling ratios like Pubmed and Coauthor-CS, it has a $1\%$ relative improvement over previous state-of-the-art. Besides, the proposed model has a good performance on Coauthor-Phy which has nearly 2.5 million edges and complex topological relationships.
3)

The proposed model is more successful and reasonable to integrate the graph topological structure with the label information than some state-of-the-art models like GCN-LPA, APPNP, and MulGraph. Among them, GCN-LPA utilizes label propagation to learn the edge weights of graphs. Compared with these models, ours integrates manifold learning with label local-structure as a non-gradient classifier. In this way, the deep graph networks could extract meaningful features to contribute to the decision layer.

The proposed model and comparison models are all implemented with PyTorch 1.2.0 on Windows 10 PC.

5.4 Convergence & Visualization

Fig. 3 shows the results on each benchmark datasets. It suggests that the proposed model converges rapidly on each dataset with $\lambda=1$ during the training step. The decision layer (defined in Eq. (7)) is optimized after every 100, 50, and 80 backpropagation on each dataset respectively, which contributes to the objective value dropped sharply and accelerates the training of the proposed model. Besides, we investigate the impacts of the dropout in graph network and parameter $\lambda$ (defined in Eq. (10)) on the model. The dropout ratio and $\lambda$ value are chosen like in Sec. 5.2. Convergence Epoch and Accuracy are utilized to evaluate the training speed and classification performance of the model. The result is shown in Fig. 4. From the results, we notice that the Convergence Epoch varies considerably with the $\lambda$ changing. On the contrary, Accuracy is not sensitive to weight value. In conclusion, the $\lambda$ mainly has an impact on the convergence rate of the model, and the proper dropout ratio could enhance the performance of the proposed model via discarding some neurons during the training.

Apart from that, we utilize the t-SNE [28] to reduce the deep graph embedding $\bm{H}^{(m)}$ (defined in Eq. (10)) to 2D. Before visualize, these embeddings are not normalized and the dimension is not reduced via PCA and normalized. The perplexity of t-SNE is set to 30. And the visualization of the reduction results in Fig. 2. We notice that the topological data is not preserved well at the beginning of the training. When the algorithm has converged, the distribution of the deep graph embedding is evenly and could be easy to classify.

TABLE III: Ablation Study on Benchmark Dataset.

	Cora	Citeseer	Pubmed
GCN-Softmax	78.99 $\pm$ 0.61	70.39 $\pm$ 0.58	78.39 $\pm$ 0.61
GCN-LP	79.70 $\pm$ 0.52	71.52 $\pm$ 0.28	78.67 $\pm$ 0.47
GCN-OM	83.91 $\pm$ 0.23	73.71 $\pm$ 0.16	79.82 $\pm$ 0.72
OURS	85.65 $\pm$ 0.76	74.80 $\pm$ 0.41	80.87 $\pm$ 0.17

5.5 Ablation Study

We conduct an ablation study to evaluate how each part of the proposed non-gradient graph layer, Orthogonal Manifold (OM) and Local-structure Preservation (LP), contribute to overall model performances. Besides, GCN with a softmax classification strategy is employed as a baseline. The setting and results are summarized in Table III. We confirm that graph mining methods, orthogonal manifold learning and local-structure preservation, can both learn the relationship knowledge among the deep embedding. Besides, it is shown that the accuracy of GCN-OR and GCN-LP are both superior to traditional GCN-softmax. Moreover, the proposed deep model with a non-decision layer not only properly unifies the OM and LP parts but also successfully utilizes the topological structure and label information to enhance the performance.

6 Conclusion

In this paper, we propose a deep graph model with a non-gradient decision layer. Unifying the orthogonal manifold and label local-structure preservation, the deep graph model successfully learns the topological knowledge of the deep embedding. Furthermore, compared with softmax optimized via gradient descent, the non-gradient layer can be solved with the analytical solutions via the proposed theorems. On the benchmark datasets, the proposed model can achieve higher accuracy over the previous state-of-the-art works.

References

[1] Xiaojin Zhu and Andrew B Goldberg, “Introduction to semi-supervised learning,” Synthesis lectures on artificial intelligence and machine learning, vol. 3, no. 1, pp. 1–130, 2009.
[2] Xiaojin Jerry Zhu, “Semi-supervised learning literature survey,” Tech. Rep., University of Wisconsin-Madison Department of Computer Sciences, 2005.
[3] Zhi-Hua Zhou and Ming Li, “Semi-supervised regression with co-training.,” in IJCAI, 2005, vol. 5, pp. 908–913.
[4] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin, “A co-regularization approach to semi-supervised learning with multiple views,” in Proceedings of ICML workshop on learning with multiple views. Citeseer, 2005, vol. 2005, pp. 74–79.
[5] Zhi-Hua Zhou and Ming Li, “Semisupervised regression with cotraining-style algorithms,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 11, pp. 1479–1493, 2007.
[6] Ching-Hao Mao, Hahn-Ming Lee, Devi Parikh, Tsuhan Chen, and Si-Yu Huang, “Semi-supervised co-training and active learning based approach for multi-view intrusion detection,” in Proceedings of the 2009 ACM symposium on Applied Computing, 2009, pp. 2042–2048.
[7] X. Zhu, “Learning from labeled and unlabeled data with label propagation,” Tech Report, 2002.
[8] Tianhang Long, Junbin Gao, Mingyan Yang, Yongli Hu, and Baocai Yin, “Locality preserving projection via deep neural network,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8.
[9] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum, “Label propagation for deep semi-supervised learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 5070–5079.
[10] Mingchen Gao, Ziyue Xu, Le Lu, Aaron Wu, Isabella Nogues, Ronald M Summers, and Daniel J Mollura, “Segmentation label propagation using deep convolutional neural networks and dense conditional random field,” in 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI). IEEE, 2016, pp. 1265–1268.
[11] Y. Yi, Y. Chen, J. Dai, X. Gui, C. Chen, Gang Lei, and W. Wang, “Semi-supervised ridge regression with adaptive graph-based label propagation,” Applied Sciences, vol. 8, no. 12, 2018.
[12] Antonio Criminisi, Jamie Shotton, Ender Konukoglu, et al., “Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning,” 2012.
[13] Ratthachat Chatpatanasiri and Boonserm Kijsirikul, “A unified semi-supervised dimensionality reduction framework for manifold learning,” Neurocomputing, vol. 73, no. 10-12, pp. 1631–1640, 2010.
[14] “A generalized power iteration method for solving quadratic problem on the stiefel manifold,” Science China(Information Sciences), 2017.
[15] Rahul Ragesh, Sundararajan Sellamanickam, Vijay Lingam, and Arun Iyer, “A graph convolutional network composition framework for semi-supervised classification,” 2020.
[16] Hande Dong, Jiawei Chen, Fuli Feng, Xiangnan He, Shuxian Bi, Zhaolin Ding, and Peng Cui, “On the equivalence of decoupled graph convolution network and label propagation,” 2021.
[17] Liang Yang, Fan Wu, Yingkui Wang, Junhua Gu, and Yuanfang Guo, “Masked graph convolutional network,” in Twenty-Eighth International Joint Conference on Artificial Intelligence IJCAI-19, 2019.
[18] Zhilin Yang, William Cohen, and Ruslan Salakhudinov, “Revisiting semi-supervised learning with graph embeddings,” in International conference on machine learning. PMLR, 2016, pp. 40–48.
[19] Han Yang, Kaili Ma, and James Cheng, “Rethinking graph regularization for graph neural networks,” 2020.
[20] Krishna Kumar P., Paul Langton, and Wolfgang Gatterbauer, “Factorized graph representations for semi-supervised learning from sparse data,” 2020.
[21] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
[22] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016, pp. 3844–3852.
[23] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
[24] Thomas N Kipf and Max Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[25] Lawrence Cayton, “Algorithms for manifold learning,” Univ. of California at San Diego Tech. Rep, vol. 12, no. 1-17, pp. 1, 2005.
[26] Rui Zhang, Feiping Nie, and Xuelong Li, “Feature selection under regularized orthogonal least square regression with optimal scaling,” Neurocomputing, vol. 273, no. jan.17, pp. 547–553, 2017.
[27] R. Zhang, H. Zhang, and X. Li, “Robust multi-task learning with flexible manifold constraint,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020.
[28] Van Der Maaten Laurens and Geoffrey Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 2605, pp. 2579–2605, 2008.
[29] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann, “Pitfalls of graph neural network evaluation,” 2018.
[30] Hongwei Wang and Jure Leskovec, “Unifying graph convolutional neural networks and label propagation,” 2020.
[31] Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.
[32] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann, “Predict then propagate: Graph neural networks meet personalized pagerank,” 2019.
[33] Kaveh Hassani and Amir Hosein Khasahmadi, “Contrastive multi-view representation learning on graphs,” arXiv preprint arXiv:2006.05582, 2020.
[34] Petar Velickovic, William Fedus, William L Hamilton, Pietro Lio, Yoshua Bengio, and R Devon Hjelm, “Deep graph infomax.,” in ICLR (Poster), 2019.
[35] R. Zhang, F. Nie, and X. Li, “Semisupervised learning with parameter-free similarity of label and side information,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 2, pp. 405–414, 2019.
[36] Thomas N Kipf and Max Welling, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308, 2016.
[37] Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert, “Deep learning via semi-supervised embedding,” in Neural networks: Tricks of the trade, pp. 639–655. Springer, 2012.