This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-Graph Data Clustering via 𝒪(n)\mathcal{O}(n) Bipartite Graph Convolution

Hongyuan Zhang, Jiankun Shi, Rui Zhang, and Xuelong Li Corresponding authorThis work is supported by The National Natural Science Foundation of China (No. 61871470).The authors are with the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, P.R. China. ©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. E-mail: [email protected], [email protected], [email protected], [email protected].
Abstract

Since the representative capacity of graph-based clustering methods is usually limited by the graph constructed on the original features, it is attractive to find whether graph neural networks (GNNs), a strong extension of neural networks to graphs, can be applied to augment the capacity of graph-based clustering methods. The core problems mainly come from two aspects. On the one hand, the graph is unavailable in the most general clustering scenes so that how to construct graph on the non-graph data and the quality of graph is usually the most important part. On the other hand, given nn samples, the graph-based clustering methods usually consume at least 𝒪(n2)\mathcal{O}(n^{2}) time to build graphs and the graph convolution requires nearly 𝒪(n2)\mathcal{O}(n^{2}) for a dense graph and 𝒪(||)\mathcal{O}(|\mathcal{E}|) for a sparse one with |||\mathcal{E}| edges. Accordingly, both graph-based clustering and GNNs suffer from the severe inefficiency problem. To tackle these problems, we propose a novel clustering method, AnchorGAE, with the self-supervised estimation of graph and efficient graph convolution. We first show how to convert a non-graph dataset into a graph dataset, by introducing the generative graph model and anchors. A bipartite graph is built via generating anchors and estimating the connectivity distributions of original points and anchors. We then show that the constructed bipartite graph can reduce the computational complexity of graph convolution from 𝒪(n2)\mathcal{O}(n^{2}) and 𝒪(||)\mathcal{O}(|\mathcal{E}|) to 𝒪(n)\mathcal{O}(n). The succeeding steps for clustering can be easily designed as 𝒪(n)\mathcal{O}(n) operations. Interestingly, the anchors naturally lead to siamese architecture with the help of the Markov process. Furthermore, the estimated bipartite graph is updated dynamically according to the features extracted by GNN modules, to promote the quality of the graph by exploiting the high-level information by GNNs. However, we theoretically prove that the self-supervised paradigm frequently results in a collapse that often occurs after 2-3 update iterations in experiments, especially when the model is well-trained. A specific strategy is accordingly designed to prevent the collapse. The experiments support the theoretical analysis and show the superiority of AnchorGAE.

Index Terms:
Graph Convolution Network, Efficient Clustering, Anchors, Siamese Network, Self-Supervised Learning.

1 Introduction

Graph-based clustering methods [1, 2, 3, 4, 5, 6] are important in the clustering field since they can capture the data topology and group data points non-linearly via constructing a graph. For instance, spectral clustering [1] originates from a relaxed graph-cut problem, e.g., Ratio Cut [2], Normalized Cut [3], Balanced Cut [7], etc. In graph-based methods, it is easier to formally study diverse clustering-related tasks with the help of graph theory [8].

Given nn data points, the graph-based clustering methods suffer the severe inefficiency in practice due to the construction of graphs, which usually needs 𝒪(n2)\mathcal{O}(n^{2}) time without any acceleration tricks. For some methods such as spectral clustering [1], the eigenvalue decomposition, an 𝒪(n3)\mathcal{O}(n^{3}) operation, is required. To alleviate the phenomenon, plenty of variants of spectral clustering have been developed [9, 10, 11]. An important variant is to introduce some representative points [12] to accelerate both graph construction and optimization [13, 14]. Some works [15, 16] recently attempt to extend these methods into multi-view scenarios. However, the graph constructed by original features may only contain low-level information (i.e., the representative capacity is limited), since the construction is usually based on the original features without mapping (e.g., neural networks). The limited ability to extract the high-level relationship of data results in the inaccurate graphs and thus becomes the bottleneck of graph-based clustering methods.

With the rise of deep learning, the utilization of deep neural networks (DNN) is widely regarded as an effective way to exploit deep information of data via extracting latent features [17]. Some models [18, 19, 20] based on convolution neural networks (CNN) have achieved impressive improvement on clustering, but CNN can not be applied to non-Euclidean data [21], such as social networks, recommendation systems, word vector, data mining data, etc. Compared with the Euclidean data (e.g., images, videos, etc.), the most difference of non-Euclidean data is that the relationships of features are not regular (or even do not exist). It indicates that the classical convolution operation could not be utilized and the existing CNN-based clustering models can not be applied to general clustering scenes. Although some deep clustering models [22, 23, 24, 18, 19] have been proposed, most methods aim to introduce some widely used modules from supervised scenarios to clustering and few models focus on how to utilize deep models to further promote the capacity graph-based methods. SpectralNet [25] is an important attempt to connect neural networks and classical spectral clustering. It utilizes neural networks to promote capacity and simulate spectral decomposition.

Refer to caption
Figure 1: The core idea of AnchorGAE. First, we iteratively calculate the connectivity distribution and anchors. Then they are used as the input of a siamese GNN for acceleration. After training the network, the anchors and connectivity distribution are updated according to the embedding. The two steps are repeated multiple times until the high-level information is exploited.

Graph neural networks (GNNs) [26, 27, 28, 29, 30, 31], which attempt to extend the classical neural networks into graph-type data with considering the graph structure, have attracted much attention. GNN is designed for graph-type data, such as social networks, recommendation systems, citation networks, etc. In these practical scenarios, links are easy and cheap to obtain and graphs are provided as prior information. For example, in citation networks, each paper is viewed as a node in graph and the citation relationships, which are easily available, are modeled as links. As GNNs have shown impressive capacity and performance on graph-type data, it is attractive to employ GNN to promote the capacity of graph-based clustering methods. Nevertheless, there are two major hurdles. On the one hand, in the most scenarios (e.g., text, signals, etc.), graphs are not provided as the prior information, which indicates that the existing GNN-based models for node clustering cannot be applied on these non-graph data without artificial construction of graphs. It goes back to the same dilemma as the graph-based clustering, i.e., how to construct a graph containing enough information. On the other hand, the inefficiency is also a well-known problem for GNNs [32, 33]. Forward propagation of each node in a GNN layer relies on neighboring samples such that the computation complexity of the graph convolution increases exponentially with the growth of layers. For a dense graph, the complexity of a multi-layer GNN is almost 𝒪(n2)\mathcal{O}(n^{2}) time. For a sparse one, the computational cost becomes 𝒪(||)\mathcal{O}(|\mathcal{E}|) where |||\mathcal{E}| represents the amount of edges. Accordingly, each training step approximately depends on all data points and the training is quite time-consuming. It also results in the unavailability of batch gradient descent. To accelerate GNNs, some works focus on how to apply batch gradient descent [34, 35, 36, 32]. The inefficiency problem is also identical to the graph-based clustering.

The main motivation of this paper is how to efficiently improve the capacity of graph-based clustering via GNNs on non-graph data. The efficiency guarantees the practicability and the non-graph property of data ensures the universality of our models. Although SDCN [37] aims to extend GNNs into the data clustering task, it simply constructs KNN graphs and thus fails to exploit the high-level information of data. The major contributions of the proposed model, namely AnchorGAE, are listed as follows

  • As no graph is provided in general, we aim to construct the graph via estimating the connectivity distribution of each sample from a perspective of generative models. Under this generative perspective, anchors are introduced to accelerate the graph construction and computation of GNNs. The produced anchors and connectivity distributions lead to a bipartite graph convolution operation so that the computational complexity of each forward propagation of GNNs is reduced from 𝒪(n2)\mathcal{O}(n^{2}) or 𝒪(||)\mathcal{O}(|\mathcal{E}|) to 𝒪(n)\mathcal{O}(n).

  • With the help of Markov process and the reconstruction goal provided by generative graph models, a siamese architecture is naturally designed to transform anchors into the embedding space.

  • We theoretically prove that a collapse, where points are split into plenty of tiny groups and these groups scatter randomly, may happen due to the self-supervised update of the graph (and connectivity distributions), especially when the model is well-trained. We therefore devise a specific strategy to update the graph to prevent the collapse. The experiments also support the theoretical analysis and verify the effectiveness of the designed strategy.

2 Preliminary and Related Work

In this section, we introduce the existing work about graph neural networks and fast clustering algorithms.

2.1 Preliminary

Let 𝒢=(𝒱,,𝒲)\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{W}) be a graph where 𝒱\mathcal{V}, \mathcal{E}, and 𝒲\mathcal{W} represent nodes, edges, and weights respectively. For an unweighted matrix, 𝒲\mathcal{W} can be ignored since the weights of edges are all regarded as 1. All matrices and vectors are represented by upper-case letters and lower-case letters in boldface, respectively. |||\cdot| represents the size of a set. ()+=max(,0)(\cdot)_{+}=\max(\cdot,0). For a real number xx, x\lfloor x\rfloor represents the maximum integer that is not larger than xx. 1n\textbf{1}_{n} denotes an nn-dimension vector whose all entries equal 1. In computer science, the adjacency matrix, which is denoted by 𝑨\bm{A} in this paper, is a common way to represent the edges of a graph. Specifically, for an unweighted graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), Aij=1A_{ij}=1 if vi,vj\langle v_{i},v_{j}\rangle\in\mathcal{E} and Aij=0A_{ij}=0 otherwise. A popular model of GNN is the graph convolution network (GCN) [28] where the graph convolution layer is accordingly formulated as [28]

𝑯=φ(𝑳𝑿𝑾),\bm{H}=\varphi(\bm{L}\bm{X}\bm{W}), (1)

where φ()\varphi(\cdot) is an activation function, 𝑿n×d\bm{X}\in\mathbb{R}^{n\times d} is the representation of data points, 𝑾d×d\bm{W}\in\mathbb{R}^{d\times d^{\prime}} denotes the coefficients to learn, 𝑳=𝑫12(𝑨+𝑰)𝑫12\bm{L}=\bm{D}^{-\frac{1}{2}}(\bm{A}+\bm{I})\bm{D}^{-\frac{1}{2}} is often viewed as a variant of the normalized Laplacian matrix, and 𝑫\bm{D} is the degree matrix of (𝑨+𝑰)(\bm{A}+\bm{I}). Specifically, 𝑫\bm{D} is a diagonal matrix where Dii=j=1n(Aij+1)D_{ii}=\sum_{j=1}^{n}(A_{ij}+1).

2.2 Graph Neural Networks

In recent years, how to effectively extract features of graphs has attracted a lot of attention [28, 31]. Inspired by CNN, a few works [27, 38, 28] attempt to investigate convolution operation on graph-type data. With the rise of the attention mechanism, GAT [29, 39, 40] introduces self-attention into graph neural networks. Meanwhile, GNN can be applied to sequential tasks [41, 30], which is known as spatial-temporal graph neural network (STGNN). Some theoretical works [42, 43, 44, 45, 46, 47] are also investigated in recent years. Some works [46] argue the impact of over-smoothing while literature [47] proposes the over-squashing. To extend GCN [28] into unsupervised learning, graph auto-encoders [48, 49, 50, 51, 52], based on auto-encoder [53, 54], are developed. Different from traditional auto-encoders, GCN [28] is used as an encoder and the decoder calculates inner-products of embedding to reconstruct graphs. To build a symmetric architecture, GALA [50] designs a sharpening operator since the graph convolution is equivalent to a smooth operator. SDCN [37] has tried to utilize GNN for clustering, while it just constructs a KNN graph as the preprocessing and suffers from the inefficiency problem as well. Although GNN has achieved outstanding performance [28, 55], it requires too much memory and time. Some methods [33, 34, 36, 35, 56] for accelerating training and reducing consumption of memory are proposed. For instance, SGC [35] attempts to accelerate by removing non-linear activations in GCN, which reduces the number of parameters. More importantly, it compresses multiple layers into one layer such that stochastic gradient descent can be used. GraphSAGE [33] attempts to sample part of nodes for each node to reduce the computational complexity, while FastGCN [32] samples nodes independently. ClusterGCN [34] tries to find an approximate graph with multiple connected components via METIS [57] to employ batch gradient descent.

2.3 Clustering on Large Scale Data

In traditional clustering, kk-means [58] works efficiently on large scale data but it can only deal with spherical data, which limits the application of kk-means. By virtue of deep learning, deep clustering [22, 17, 23, 18, 19, 59] often aims to learn expressive representations and then feed the representations into the simple clustering module. The whole optimization is based on batch gradient descent so that they are still available on large scale data. However, they frequently suffers from the over-fitting and the performance heavily relies on the tuning of neural networks. For graph-based clustering methods [3, 2, 1], both computational complexity and space complexity of graph construction are 𝒪(n2)\mathcal{O}(n^{2}). In particular, solving spectral clustering needs eigenvalue decomposition, which is also time-consuming. To apply spectral clustering on large scale data, a lot of variants [12, 9, 10, 11] have been developed. Anchor-based methods [12, 13, 14] are important extensions of spectral clustering. The core idea is to construct a graph implicitly via some representative points, which are usually named as anchors. By constructing a bipartite graph between samples and anchors, they only need to perform singular value decomposition on the Laplacian of the bipartite graph [13]. The anchor-based spectral clustering provides an elegant scheme to improve the efficiency of graph-based models.

Remark 1.

Although several CNN-based clustering methods [19, 18, 20] have achieved good performance, they focus on vision data and the main improvement is caused by the classical convolution operator, so that they could not be applied to non-Euclidean data, e.g., word vectors, data mining data, etc. Therefore, the CNN-based methods will not be used as the main competitors in this paper. It should be emphasized that CNN-based methods and AnchorGAE should not be mutually exclusive. AnchorGAE aims to find a general paradigm to improve the clustering baseline under the most general settings. It may be an attractive topic in the future to extend the core idea of AnchorGAE into the vision-specific clustering by combining CNNs.

3 Methodology

In this section, we first revisit the main problems that we attempt to address in this paper. In Section 3.2, we show how to construct a bipartite graph on the non-graph data from a generative perspective. The probability perspective is vital to the derivation of the fast bipartite graph convolution and the reconstruction goal of the unsupervised model. In Section 3.3, the core idea of AnchorGAE and its efficiency are elaborated. In Section 3.4, we will find that a siamese structure is naturally induced by the bipartite graph convolution. The rigorous analysis of the degeneration caused by the simple update of graphs is provided in Section 3.5. Finally, the methods to obtain the final clustering assignments are discussed. The core idea of AnchorGAE is illustrated in Figure 1.

3.1 Revisit the Critical Problems

Before elaborating on how to efficiently improve the capacity of graph-based clustering via GNNs on non-graph data, we introduce the major hurdles respectively.

No Prior Graph: In the scenarios of clustering, only features are given as the prior information. The graph is usually constructed as the preprocessing in graph-based clustering methods. The key of performance is the quality of the constructed graph. When the scattering of data is too complicated to precisely capture the key information based on the original data (which is the main bottleneck of the most existing graph-based clustering methods), it is a rational scheme to map the data into a latent space and then build a more precise graph in this space. As we focus on graph-based models, GNNs are more rational schemes compared with classical neural networks. Moreover, the utilization of the estimated graph (via GNNs) to update graph leads to a self-supervised procedure. The graph construction usually results in 𝒪(n2)\mathcal{O}(n^{2}) costs, which also causes the inefficiency of graph-based clustering methods. The scheme to solve this problem will be elaborated in Section 3.2.

Inefficiency of GNNs: In the forward propagation of GNNs, it usually require neighbored nodes to compute the hidden feature of one node. This computation results in the inefficiency of GNNs on large graph [60], especially with the increase of depth. In this paper, we simply focus on GCN, the widely used variant of GNN, since the theoretical analysis does not depend on specific GNN modules. For instance, it requires 𝒪(||d)\mathcal{O}(|\mathcal{E}|d) time to compute 𝑳𝑿\bm{L}\bm{X} in the forward propagation. If 𝒢\mathcal{G} is sparse enough, the computational complexity is acceptable. If there are plenty of edges in 𝒢\mathcal{G}, the computational complexity can be roughly regarded as 𝒪(n2d)\mathcal{O}(n^{2}d), so that GCNs can not be applied in this case. We will show how to alleviate it into 𝒪(n)\mathcal{O}(n) in Sections 3.3 and 3.4.

Inefficiency of Graph-Based Clustering Methods: Even after the 𝒪(n2)\mathcal{O}(n^{2}) graph construction, the computation complexity reaches 𝒪(n3)\mathcal{O}(n^{3}) due to the eigenvalue decomposition. The problem is easy to address after introducing the 𝒪(n)\mathcal{O}(n) bipartite graph convolution and the details can be found in Section 3.6.

3.2 Generative Perspective for Bipartite Graphs

By virtue of the generative perspective introduced in this section, the probability transition matrix is accordingly built so that the Markov process can be used to implement the efficient bipartite graph convolution, which is elaborated in Section 3.3. The generative perspective also provides a training goal, reconstructing the distribution, for AnchorGAE.

3.2.1 Generative Perspective for Graphs

For a graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), an edge, vi,vj\langle v_{i},v_{j}\rangle\in\mathcal{E}, can be regarded as a sampling from an underlying connectivity distribution p(|vi)p(\cdot|v_{i}) for viv_{i}, where viv_{i} denotes the ii-th node. Note that j=1np(vj|vi)=1\sum_{j=1}^{n}p(v_{j}|v_{i})=1 where n=|𝒱|n=|\mathcal{V}|. Accordingly, a graph can be viewed as multiple samplings regarding connectivity distributions {p(|vi)}i=1n\{p(\cdot|v_{i})\}_{i=1}^{n}. Before the further discussion of weighted graphs, we further clarify the difference between the generative perspective and discriminative perspective. In the discriminative view, the existence of edge is certain and the model should train a classifier to predict p(vi,vj)p(\langle v_{i},v_{j}\rangle\in\mathcal{E}) and p(vi,vj)p(\langle v_{i},v_{j}\rangle\notin\mathcal{E}). Note that jp(vi,vj)1\sum_{j}p(\langle v_{i},v_{j}\rangle\in\mathcal{E})\neq 1 in most cases. In this paper, the generative view provides an elegant Bayesian framework to simultaneously capture the local topology and offer a reconstruction goal for GAE.

With the generative perspective, for a weighted graph 𝒢=(𝒱,,𝒲)\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{W}) where 𝒲\mathcal{W} represents the weights on edges, 𝒲\mathcal{W} can be viewed as an estimation of the underlying connectivity distributions. From the generative perspective, the underlying connectivity distributions can be supposed to exist in any dataset, while the connectivity distributions are agnostic on non-graph datasets. However, no matter which type of data is, it will not affect the existence of connectivity distributions. The vital idea that leads AnchorGAE to be applied to non-graph data is to estimate the distributions. Besides, they are also the target to reconstruct by our unsupervised model. The following assumption helps to estimate the connectivity distributions.

Assumption 1.

Given an ideal metric of difference between two nodes d(vi,vj)d(v_{i},v_{j}), p(vj|vi)p(v_{j}|v_{i}) is negatively related to d(vi,vj)d(v_{i},v_{j}), i.e., p(vj|vi)d(vi,vj)p(v_{j}|v_{i})\varpropto-d(v_{i},v_{j}). An ideal metric indicates d(vi,vj)d(vi,vl),p(vj|vi)p(vl|vi)\forall d(v_{i},v_{j})\leq d(v_{i},v_{l}),p(v_{j}|v_{i})\geq p(v_{l}|v_{i}).

In this paper, each node is described by a dd-dimension vector, i.e., 𝒙id\bm{x}_{i}\in\mathbb{R}^{d}. Therefore, the metric can be reformulated as

d(vi,vj)=f(𝒙i)f(𝒙j)22,d(v_{i},v_{j})=\|f(\bm{x}_{i})-f(\bm{x}_{j})\|_{2}^{2}, (2)

where f()f(\cdot) represents an ideal mapping of the raw features. The implementation is discussed in Section 3.3.

Remark 2.

Although the above assumption seems too strong to hold on raw data representation, the goal of our model is to find an approximately ideal mapping function f()f(\cdot) to build a well-structured graph.

3.2.2 Bipartite Graph Simplification with Anchors

To efficiently apply graph convolution, we attempt to simplify the graph via some representative points, which are named anchors or landmarks. Let 𝒰\mathcal{U} denote anchors where |𝒰|=m|\mathcal{U}|=m and the ii-th anchor is also described by a dd-dimension vector 𝒄i\bm{c}_{i}. Then, we can get a new bipartite graph with anchors, 𝒢a=(𝒱a,a,𝒲a)\mathcal{G}_{a}=(\mathcal{V}_{a},\mathcal{E}_{a},\mathcal{W}_{a}). Specifically, 𝒱a=𝒱𝒰\mathcal{V}_{a}=\mathcal{V}\cup\mathcal{U} and 𝒲a\mathcal{W}_{a} is the corresponding weights of 𝒱a\mathcal{V}_{a}. The bipartite property means that only edges between 𝒱\mathcal{V} and 𝒰\mathcal{U} are allowed in a\mathcal{E}_{a}. This bipartite property is from the following assumption.

Assumption 2.

Given an ideal set of anchors 𝒰\mathcal{U}, p(vj|vi)p(v_{j}|v_{i}) can be constructed by {p(ul|vi)}l=1m\{p(u_{l}|v_{i})\}_{l=1}^{m} and {p(vj|ul)}l=1m\{p(v_{j}|u_{l})\}_{l=1}^{m}.

Since both connectivity distributions and anchors are unknown in general data, we need to estimate them. First, we intend to estimate anchors and p(uj|vi)p(u_{j}|v_{i}) alternatively by

min𝒰,p(|vi)i=1n𝔼ujp(|vi)d(vi,uj),\min\limits_{\mathcal{U},p(\cdot|v_{i})}\sum\limits_{i=1}^{n}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}d(v_{i},u_{j}), (3)

according to Assumption 1. Specifically speaking, in the ideal metric space introduced by Assumption 1, the expected divergence between 𝒱\mathcal{V} and 𝒰\mathcal{U} should be minimum. Note that we only focus on the estimation of p(uj|vi)p(u_{j}|v_{i}) and overlook p(vi|uj)p(v_{i}|u_{j}) for now. However, the above problem has a trivial solution, i.e.,

p(uj|vi)={1,j=argminj(d(vi,uj)),0,else.p_{\dagger}(u_{j}|v_{i})=\left\{\begin{array}[]{l l}1,&j=\arg\min_{j}(d(v_{i},u_{j})),\\ 0,&\text{else}.\end{array}\right. (4)

It indicates that each original node (from 𝒱\mathcal{V}) only connects with its nearest anchors (from 𝒰\mathcal{U}). To avoid the ill connectivity distributions and anchor graph, we turn to the following regularized problem,

min𝒰,p(|vi)i=1n𝔼ujp(|vi)d(vi,uj)+(p(|vi),π(|vi)),\min\limits_{\mathcal{U},p(\cdot|v_{i})}\sum\limits_{i=1}^{n}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}d(v_{i},u_{j})+\ell(p(\cdot|v_{i}),\pi(\cdot|v_{i})), (5)

where π(|vi)\pi(\cdot|v_{i}) represents the uniform distribution and (,)\ell(\cdot,\cdot) is a metric of two discrete distributions. Intuitively, p(uj|vi)p(u_{j}|v_{i}) should be sparse such that anchors far from viv_{i} are ignored. In practice, Kullback-Leibler divergence (KL-divergence) is usually used as (,)\ell(\cdot,\cdot) to measure the divergence of two probability distributions. However, it will return a dense solution of p(|vi)p(\cdot|v_{i}). Instead, we employ a simple measure as

min𝒰,p(|vi)i=1n𝔼ujp(|vi)d(vi,uj)+γij=1m(p(uj|vi)π(uj|vi))2.\min\limits_{\mathcal{U},p(\cdot|v_{i})}\sum\limits_{i=1}^{n}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}d(v_{i},u_{j})+\gamma_{i}\sum\limits_{j=1}^{m}(p(u_{j}|v_{i})-\pi(u_{j}|v_{i}))^{2}. (6)

Let dk(vi,)d_{k}(v_{i},\cdot) be the kk-th smallest value of {d(vi,uj)}j=1m\{d(v_{i},u_{j})\}_{j=1}^{m}, and if γi=12(kdk+1(vi,)l=1kdl(vi,))\gamma_{i}=\frac{1}{2}(kd_{k+1}(v_{i},\cdot)-\sum_{l=1}^{k}d_{l}(v_{i},\cdot)), p(|vi)p(\cdot|v_{i}) will be kk-sparse (only kk entries are non-zero). Meanwhile, it can be transformed into [4] and solved by

p(uj|vi)=(dk+1(vi,)d(vi,uj)l=1kdk+1(vi,)dl(vi,))+.p(u_{j}|v_{i})=(\frac{d_{k+1}(v_{i},\cdot)-d(v_{i},u_{j})}{\sum_{l=1}^{k}d_{k+1}(v_{i},\cdot)-d_{l}(v_{i},\cdot)})_{+}. (7)

In other words, the hyper-parameter γi\gamma_{i} is converted to the sparsity kk, which is much easier to tune.

Algorithm 1 Iterative update of the connectivity distribution and anchors, i.e.,, optimization of problem (6).
0:  {f(𝒙i)}i=1n\{f(\bm{x}_{i})\}_{i=1}^{n}, {f(0)(𝒄j)}j=1m\{f^{(0)}(\bm{c}_{j})\}_{j=1}^{m}, sparsity kk.
  t=0t=0.
  d(0)(vi,uj)=f(𝒙i)f(0)(𝒖j)22d^{(0)}(v_{i},u_{j})=\|f(\bm{x}_{i})-f^{(0)}(\bm{u}_{j})\|_{2}^{2}.
  repeat
     t=t+1t=t+1.
     p(t)(uj|vi)=(dk+1(t1)(vi,)d(t1)(vi,uj)l=1kdk+1(t1)(vi,)dl(t1)(vi,))+p^{(t)}(u_{j}|v_{i})=(\frac{d^{(t-1)}_{k+1}(v_{i},\cdot)-d^{(t-1)}(v_{i},u_{j})}{\sum_{l=1}^{k}d^{(t-1)}_{k+1}(v_{i},\cdot)-d_{l}^{(t-1)}(v_{i},\cdot)})_{+}.
     f(t)(𝒄j)=i=1np(t)(uj|vi)f(𝒙i)i=1np(t)(uj|vi)f^{(t)}(\bm{c}_{j})=\frac{\sum_{i=1}^{n}p^{(t)}(u_{j}|v_{i})f(\bm{x}_{i})}{\sum_{i=1}^{n}p^{(t)}(u_{j}|v_{i})}.
     d(t)(vi,uj)=f(𝒙i)f(t)(𝒄j)22.d^{(t)}(v_{i},u_{j})=\|f(\bm{x}_{i})-f^{(t)}(\bm{c}_{j})\|_{2}^{2}.
  until convergence .
  f(𝒄j)=f(t)(𝒄j)f(\bm{c}_{j})=f^{(t)}(\bm{c}_{j}), p(|vi)=p(t)(|vi)p(\cdot|v_{i})=p^{(t)}(\cdot|v_{i}).
  f(𝒄j)f(\bm{c}_{j}) and p(|vi)p(\cdot|v_{i}) .

When p(uj|vi)p(u_{j}|v_{i}) is fixed, anchors are computed by solving the following problem,

minf(𝒄j)𝔼ujp(|vi)f(𝒙i)f(𝒄j)22,\min\limits_{f(\bm{c}_{j})}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}\|f(\bm{x}_{i})-f(\bm{c}_{j})\|_{2}^{2}, (8)

where 𝒄j\bm{c}_{j} is representation of anchor uju_{j} in the original space. If we take the derivative of the above equation regarding f(𝒄j)f(\bm{c}_{j}),

f(𝒄j)𝔼ujp(|vi)f(𝒙i)f(𝒄j)22=2(i=1mp(uj|vi)f(𝒄j)p(uj|vi)f(𝒙i)),\begin{split}&\nabla_{f(\bm{c}_{j})}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}\|f(\bm{x}_{i})-f(\bm{c}_{j})\|_{2}^{2}\\ =~{}&2(\sum\limits_{i=1}^{m}p(u_{j}|v_{i})f(\bm{c}_{j})-p(u_{j}|v_{i})f(\bm{x}_{i})),\end{split} (9)

and set it to 0, then we have

f(𝒄j)=i=1np(uj|vi)f(𝒙i)i=1np(uj|vi).f(\bm{c}_{j})=\frac{\sum_{i=1}^{n}p(u_{j}|v_{i})f(\bm{x}_{i})}{\sum_{i=1}^{n}p(u_{j}|v_{i})}. (10)

In sum, problem (6) can be solved iteratively through Eq. (7) and (10), which is summarized as Algorithm 1.

Build A Generative Bipartite Graph by {p(uj|vi)}i,j\{p(u_{j}|v_{i})\}_{i,j} and {f(𝒄j)}j\{f(\bm{c}_{j})\}_{j}: Then we turn to model p(vi|uj)p(v_{i}|u_{j}), which is used to compute p(vj|vi)p(v_{j}|v_{i}), the probability between two raw nodes. Rather than solving a problem like problem (6), p(vi|uj)p(v_{i}|u_{j}) is set by a simple normalized step,

p(vi|uj)=p(uj|vi)l=1np(uj|vl),p(v_{i}|u_{j})=\frac{p(u_{j}|v_{i})}{\sum_{l=1}^{n}p(u_{j}|v_{l})}, (11)

via utilizing p(uj|vi)p(u_{j}|v_{i}) calculated by Eq. (7). To show how the above formulations construct a bipartite graph and simplify the succeeding discussion about the fast convolution, we further reformulate p(uj|vi)p(u_{j}|v_{i}) and p(vi|uj)p(v_{i}|u_{j}) by matrix form. Let 𝑩n×m\bm{B}\in\mathbb{R}^{n\times m} be a matrix where bij=p(uj|vi)b_{ij}=p(u_{j}|v_{i}). Then

𝑻=[𝟎𝑩𝑩T𝟎]and𝑫a=[𝑰𝟎𝟎𝚫]\bm{T}=\left[\begin{array}[]{c c}\bm{0}&\bm{B}\\ \bm{B}^{T}&\bm{0}\end{array}\right]~{}\text{and}~{}\bm{D}_{a}=\left[\begin{array}[]{c c}\bm{I}&\bm{0}\\ \bm{0}&\bm{\Delta}\\ \end{array}\right] (12)

represent an unnormalized bipartite graph and its degree matrix, respectively. Note that 𝑻\bm{T} does not conform to the generative perspective for graphs, though it is a bipartite graph. According to Eq. (11), the generative bipartite graph can be defined as

𝑷=𝑫a1𝑻=[𝟎𝑩𝚫1𝑩T𝟎](n+m)×(n+m),\bm{P}=\bm{D}_{a}^{-1}\bm{T}=\left[\begin{array}[]{c c}\bm{0}&\bm{B}\\ \bm{\Delta}^{-1}\bm{B}^{T}&\bm{0}\end{array}\right]\in\mathbb{R}^{(n+m)\times(n+m)}, (13)

which can be also regarded as a probability transferring matrix.

Algorithm 2 Optimization of AnchorGAE.
0:  Raw features 𝑿\bm{X}, initial sparsity k0k_{0}, the estimated upper-bound of sparsity kmk_{m}, maximum epochs EE.
1:  Initialize 𝑩\bm{B} and 𝑪\bm{C} via solving problem (6).
2:  Initialize 𝑨a\bm{A}_{a} by Eq. (15).
3:  kk0k\leftarrow k_{0}.
4:  for i=1,2,,Ei=1,2,\cdots,E do
5:     #\# Obtain the mapping f()f(\cdot).
6:     Update parameters of GCN layers by gradient descent.
7:     Update BB and f(𝒄j)f(\bm{c}_{j}) on 𝒁\bm{Z} and 𝒁t\bm{Z}_{t} via Algorithm 1.
8:     Update anchors in the original space for the bipartite graph convolution: 𝒄j=f1(f(𝒄j))=𝑿T𝒃j\bm{c}_{j}=f^{-1}(f(\bm{c}_{j}))=\bm{X}^{T}\bm{b}_{j}.
9:     Update 𝑨a\bm{A}_{a} by Eq. (15).
10:     kk+Δkk\leftarrow k+\Delta k.
11:  end for
12:  Perform fast clustering on 𝒁\bm{Z}.
12:  Clustering assignments and embedding 𝒁\bm{Z}.

3.3 Efficient 𝒪(n)\mathcal{O}(n) Bipartite Graph Convolution

As the goal is to capture the structure information, a multi-layer GCN is a rational scheme to implement f(xi)f(\bm{x}_{i}) which is defined in Eq. (2). In this subsection, we will show why the bipartite graph convolution accelerates GCN.

We utilize the bipartite graph, defined in Eq. (12), to accelerate the graph convolution. In virtue of the Markov process [61], p(vj|vi)p(v_{j}|v_{i}) can be obtained by the one-step transition probability, i.e.,

p(vj|vi)=l=1mp(vj|ul)p(ul|vi).p(v_{j}|v_{i})=\sum_{l=1}^{m}p(v_{j}|u_{l})p(u_{l}|v_{i}). (14)

Similarly, p(uj|ui)=l=1np(uj|vl)p(vl|ui)p(u_{j}|u_{i})=\sum_{l=1}^{n}p(u_{j}|v_{l})p(v_{l}|u_{i}). Formally, the constructed graphs of 𝒱\mathcal{V} and 𝒰\mathcal{U} are defined as

𝑨a=𝑷2=[𝑩𝚫1𝑩T𝟎𝟎𝚫1𝑩T𝑩]=[𝑨𝟎𝟎𝑨t].\bm{A}_{a}=\bm{P}^{2}=\left[\begin{array}[]{c c}\bm{B}\bm{\Delta}^{-1}\bm{B}^{T}&\bm{0}\\ \bm{0}&\bm{\Delta}^{-1}\bm{B}^{T}\bm{B}\end{array}\right]=\left[\begin{array}[]{c c}\bm{A}&\bm{0}\\ \bm{0}&\bm{A}_{t}\end{array}\right]. (15)

Accordingly, 𝑨\bm{A} is the weighted adjacency matrix, constructed by anchors, of nn samples, and 𝑨t\bm{A}_{t} is the weighted adjacency of mm anchors.

Remark 3.

In GCNs, the graph is conventionally preprocessed by adding self-loops as shown in Eq. (1). The graph constructed by Eq. (15) is equivalent to be assigned with adaptive self-loops, since 𝐀ii\bm{A}_{ii} and (𝐀t)ii(\bm{A}_{t})_{ii} are non-zero in most cases.

Therefore, 𝑨\bm{A} can be directly used as a part of the graph convolution operator. Since

𝑨1=𝑩𝚫1𝑩T1=𝑩1=1,\bm{A}\textbf{1}=\bm{B}\bm{\Delta}^{-1}\bm{B}^{T}\textbf{1}=\bm{B}\textbf{1}=\textbf{1}, (16)

the degree matrix of 𝑨\bm{A} is the identity matrix, i.e., 𝑳=𝑨\bm{L}=\bm{A}. Therefore, the output of GCN is formulated as

𝑯=φ(𝑩𝚫1𝑩T𝑿𝑾).\bm{H}=\varphi(\bm{B}\bm{\Delta}^{-1}\bm{B}^{T}\bm{X}\bm{W}). (17)

Here, we explain why the above equation accelerates the operation. It should be emphasized that 𝑨\bm{A} is not required to be computed explicitly. The core is to rearrange the order of matrix multiplications as follows

𝑻1=𝑩T𝑿𝑻2=𝚫1𝑻1𝑻3=𝑻2𝑾𝑯=φ(𝑩𝑻3).\begin{split}&\bm{T}_{1}=\bm{B}^{T}\bm{X}\\ \Rightarrow&~{}\bm{T}_{2}=\bm{\Delta}^{-1}\bm{T}_{1}\\ \Rightarrow&~{}\bm{T}_{3}=\bm{T}_{2}\bm{W}\\ \Rightarrow&~{}\bm{H}=\varphi(\bm{B}\bm{T}_{3}).\end{split} (18)

According to the above order of computation, the computational complexity is reduced to 𝒪(nmd+m2d+mdd+nmd)\mathcal{O}(nmd+m^{2}d+mdd^{\prime}+nmd^{\prime}). Since dd, dd^{\prime}, and mm are usually much smaller than nn, the complexity can be regarded as 𝒪(nm)\mathcal{O}(nm). Moreover, if the amount of anchors is small enough, the required time of each forward propagation of a GCN layer is 𝒪(n)\mathcal{O}(n).

3.4 Siamese Architecture for Anchor Mapping

After showing the accelerated GCN mapping based on anchors and the bipartite graph, we discuss more details of the whole procedure in this section, especially the changes of architecture brought by anchors.

As the task is unsupervised, the whole architecture is based on graph auto-encoder (GAE). The encoder consists of multiple bipartite GCN layers defined in Eq. (17),

𝒁=f(𝑿)=φL(𝑨φL1(φ1(𝑨𝑿𝑾1))𝑾L),\bm{Z}=f(\bm{X})=\varphi_{L}(\bm{A}\varphi_{L-1}(\cdots\varphi_{1}(\bm{A}\bm{X}\bm{W}_{1})\cdots)\bm{W}_{L}), (19)

where LL denotes the number of layers. The embedding (denoted by 𝒁\bm{Z}), which is produced by the encoder, is a non-linear mapping for the raw representation of 𝒱\mathcal{V}.

However, the mentioned encoder just transforms 𝒱\mathcal{V} and there is no edge between 𝒱\mathcal{V} and 𝒰\mathcal{U}. In other words, 𝒰\mathcal{U} can not be projected via the network with the adjacency 𝐀\bm{A}. Meanwhile, GCN layers suffer from the out-of-sample problem. Accordingly, 𝒰\mathcal{U} can not be directly projected into the deep feature space by the encoder. To map anchors into the same feature space, a siamese [62] encoder is thus designed. The encoder for anchors should share the same parameters but has its own graph structure. Surprisingly, the one-step probability transition matrix defined in Eq. (15) gives us a graph that is only composed of anchors, 𝑨t=𝚫1𝑩T𝑩\bm{A}_{t}=\bm{\Delta}^{-1}\bm{B}^{T}\bm{B}. As

𝑨t1=𝚫1𝑩T𝑩1=𝚫1𝑩T1=1,\bm{A}_{t}\textbf{1}=\bm{\Delta}^{-1}\bm{B}^{T}\bm{B}\textbf{1}=\bm{\Delta}^{-1}\bm{B}^{T}\textbf{1}=\textbf{1}, (20)

the corresponding degree matrix, 𝑫t\bm{D}_{t}, is also the identity matrix, i.e.,

𝑳t=𝑫t12𝑨t𝑫t12=𝑨t.\bm{L}_{t}=\bm{D}_{t}^{-\frac{1}{2}}\bm{A}_{t}\bm{D}_{t}^{-\frac{1}{2}}=\bm{A}_{t}. (21)

Accordingly, the embedding of anchors is formulated as

𝒁t=f(𝑪)=φL(𝑨tφL1(φ1(𝑨t𝑪𝑾1))𝑾L),\bm{Z}_{t}=f(\bm{C})=\varphi_{L}(\bm{A}_{t}\varphi_{L-1}(\cdots\varphi_{1}(\bm{A}_{t}\bm{C}\bm{W}_{1})\cdots)\bm{W}_{L}), (22)

where 𝑪m×d\bm{C}\in\mathbb{R}^{m\times d} represents mm anchors.

Loss: As the underlying connectivity distributions have been estimated, we use them, rather than the adjacency in the existing GAEs, as the target for the decoder to rebuild. According to Assumption 1, the connectivity probability should be reconstructed according to Euclidean distances between 𝒱\mathcal{V} and 𝒰\mathcal{U}. Formally speaking, the decoder is

q(uj|vi)=exp(d(vi,uj))l=1mexp(d(vi,ul)).\begin{split}q(u_{j}|v_{i})&=\frac{\exp(-d(v_{i},u_{j}))}{\sum_{l=1}^{m}\exp(-d(v_{i},u_{l}))}.\\ \end{split} (23)

Instead of using mean-square error (MSE) to reconstruct AA, AnchorGAE intends to reconstruct the underlying connectivity distribution and employ the cross-entropy to measure the divergence,

=i=1nj=1mp(uj|vi)log1q(uj|vi).\mathcal{L}=\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}p(u_{j}|v_{i})\log\frac{1}{q(u_{j}|v_{i})}. (24)

The GCN part of AnchorGAE is trained via minimizing \mathcal{L}.

So far, how to get anchors at the beginning is ignored. In deep learning, a common manner is to pre-train a neural network. Nevertheless, the pre-training of raw graph auto-encoders will result in nearly 𝒪(n2)\mathcal{O}(n^{2}) time, which violates the motivation of our model. Therefore, we first compute anchors and transition probability on the raw features. Then anchors and transition probability will be rectified dynamically, which will be elaborated in Section 3.5.

3.4.1 Why not to reconstruct features?

One may ask why AnchorGAE just reconstruct the constructed graph rather than the features, which is a priori. The goal of the classical auto-encoder is to restore the input features. However, the learned representations are often not promising enough due to no limitation on the latent space. Accordingly, several works (e.g., adversarial auto-encoder [63], variational auto-encoder [54]) constrain the representations. More importantly, contrastive learning [64] does not reconstruct the features anymore and focuses on how to pull similar samples together and push dissimilar samples away. Contrastive learning has achieved impressive performance in recent years.

In AnchorGAE, the graph is constructed on the learned representations. Define a mapping ψ:n×dn×d\psi:\mathbb{R}^{n\times d}\rightarrow\mathbb{R}^{n\times d}. Then the graph can be roughly formulated as 𝑨=ψ(f(𝑿))\bm{A}=\psi(f(\bm{X})) where ff denotes the GNN mapping. On the one hand, the composite mapping, ψf\psi\circ f, removes the redundant information of XX. The reconstruction of 𝐀\bm{A} can be regarded as the restoration of the simplified 𝐗\bm{X}. On the other hand, the edges of graphs indicate similar sample pairs and dissimilar pairs. The reconstruction of graph is therefore consistent with the idea of contrastive learning. That’s why AnchorGAE only reconstructs the constructed graph.

Refer to caption

(a) Collapse

Refer to caption

(b) AnchorGAE
Figure 2: Illustration of the strategy to avoid degeneration on MNIST-test: The left one is from AnchorGAE with fixed kk while the right one is from AnchorGAE with dynamically increasing kk.
TABLE I: Experimental Results: ACC and NMI
Method ISOLET SEGMENT USPS MNIST-test MNIST-full Fashion
ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI
K-Means 0.621 0.775 0.559 0.590 0.648 0.628 0.548 0.501 0.534 0.500 0.474 0.512
Nystrom 0.648 0.780 0.562 0.549 0.668 0.619 0.575 0.505 0.589 0.502 0.563 0.522
CSC 0.661 0.785 0.606 0.555 0.726 0.696 0.623 0.572 0.592 0.569 0.515 0.501
KASP 0.630 0.754 0.554 0.544 0.712 0.693 0.654 0.600 0.620 0.555 0.538 0.529
LSC 0.657 0.774 0.562 0.557 0.727 0.694 0.702 0.615 0.714 0.623 0.634 0.629
SNC 0.553 0.631 0.567 0.535 0.703 0.589 0.570 0.523
SGC 0.464 0.665 0.524 0.611 0.657 0.801 0.719 0.757
GAE 0.612 0.783 0.442 0.475 0.679 0.821 0.696 0.758
DEC 0.425 0.685 0.143 0.000 0.621 0.588 0.813 0.803 0.797 0.810 0.516 0.541
SpectralNet 0.528 0.781 0.515 0.540 0.678 0.818 0.820 0.817
ClusterGCN-L2 0.581 0.772 0.580 0.601 0.659 0.769 0.632 0.666 0.707 0.730 0.532 0.605
ClusterGCN-L4 0.574 0.738 0.593 0.599 0.662 0.769 0.631 0.665 0.708 0.732 0.532 0.606
Ours-A (fixed BB) 0.615 0.814 0.443 0.490 0.490 0.475 0.290 0.351 0.468 0.635 0.588 0.630
Ours-B (fixed kk) 0.373 0.532 0.285 0.125 0.265 0.174 0.284 0.228 0.289 0.232 0.194 0.112
Ours-C (KNN) 0.654 0.778 0.555 0.511 0.729 0.746 0.776 0.735 0.793 0.743 0.533 0.553
AnchorGAE-L2 0.663 0.787 0.635 0.613 0.853 0.828 0.823 0.773 0.833 0.773 0.645 0.607
AnchorGAE-L4 0.591 0.756 0.576 0.508 0.781 0.775 0.810 0.762 0.808 0.771 0.613 0.590

3.5 Will the Adaptive Update Work?

Since the initialized anchors and transition probability is calculated from the original features, the information hidden in 𝑨\bm{A} may be limited especially when the data distribution is complicated. To precisely exploit the high-level information in a self-supervised way, a feasible method is to adaptively update anchors according to the learned embedding.

To achieve this goal, we first need to recompute anchors, 𝒰\mathcal{U}. According to Eq. (7) and (10), we can obtain anchors under the deep representations, {f(𝒄j)}j=1m\{f(\bm{c}_{j})\}_{j=1}^{m}. However, the input of our model requires anchors under the raw features. In other words, it is inevitable to compute 𝒄j=f1(f(𝒄j))\bm{c}_{j}=f^{-1}(f(\bm{c}_{j})). Unfortunately, it is hard to ensure f()f(\cdot) an invertible mapping. A scheme is to utilize the classical auto-encoders since the decoder that tries to reconstruct the raw input from the embedding can be regarded as an approximation of the inverse mapping of the encoder. To simplify the discussion in this paper, we simply set

𝒄j=f1(f(𝒄j))𝑿T𝒃j,\bm{c}_{j}=f^{-1}(f(\bm{c}_{j}))\approx\bm{X}^{T}\bm{b}_{j}, (25)

to estimate 𝒄j\bm{c}_{j} under the original feature space.

In our expectations, the self-supervised update of anchors and transition probability would become more and more precise. However, the performance of AnchorGAE will become worse and worse after several updates of anchors, which has been proved theoretically and empirically. An interesting phenomenon is that a better reconstruction indicates a more severe collapse. The following theorem provides a rigorous description of this collapse.

Theorem 1.

Let pk(|vi)p_{k}(\cdot|v_{i}) be the kk-largest one of {p(uj|vi)}j=1m\{p(u_{j}|v_{i})\}_{j=1}^{m} and {p^(uj|vi)}j=1m\{\hat{p}(u_{j}|v_{i})\}_{j=1}^{m} be the updated distribution with the same kk after step 6 of Algorithm 2. If |q(|vi)p(|vi)|ε|q(\cdot|v_{i})-p(\cdot|v_{i})|\leq\varepsilon, then there exists a constant δ>0\delta>0, so that when εδ\varepsilon\leq\delta,

|p^j(|vi)1k|𝒪(log1/2(1/ε))|\hat{p}_{j}(\cdot|v_{i})-\frac{1}{k}|\leq\mathcal{O}(\log^{-1/2}(1/\varepsilon))

holds for any jkj\leq k.

The above conclusion means that p^(|vi)\hat{p}(\cdot|v_{i}) is a sparse and uniform distribution. In other words, the constructed graph degenerates into an unweighted graph, which is usually unexpected in clustering. More intuitively, the degeneration is caused by projecting samples from the identical cluster into multiple tiny and cohesive clusters as shown in Figure 2. These tiny clusters have no connection with each other and they scatter randomly in the representation space, such that the clustering result is disturbed. To avoid this collapse, we propose two strategies: (1) increase kk which represents the sparsity of neighbors; (2) decrease the number of anchors, mm. In this paper, we dynamically increase kk, which is equivalent to attempt to connect these tiny groups before they scatter randomly, as follows

kk+Δk.k\leftarrow k+\Delta k. (26)

Suppose that there are cc clusters and the number of samples in the smallest cluster is represented by nsn_{s}. Then we can define the upper-bound of kk and increment of sparsity Δk\Delta k as

km=m×nsnandΔk=kmk0E,k_{m}=\lfloor m\times\frac{n_{s}}{n}\rfloor~{}~{}\textrm{and}~{}~{}\Delta k=\lfloor\frac{k_{m}-k_{0}}{E}\rfloor, (27)

where k0k_{0} denotes the sparsity of the initialized anchor graph and EE be the number of iterations to update anchors. If we have no prior information of nsn_{s}, nsn_{s} can be simply set as n/c\lfloor n/c\rfloor or n/(2c)\lfloor n/(2c)\rfloor. The brief modification helps a lot to guarantee the stability and performance of AnchorGAE.

Refer to caption

(a) Original features

Refer to caption

(b) Fixed kk

Refer to caption

(c) Fixed BB

Refer to caption

(d) AnchorGAE
Figure 3: Visualization of SEGMENT by t-SNE.

When AnchorGAE is trained, we can perform some fast clustering algorithms on 𝑩\bm{B} or 𝒁\bm{Z}, which is elaborated in Section 3.6. The optimization of AnchorGAE is summarized in Algorithm 2.

3.6 Obtain Clustering Assignments

After training AnchorGAE, we can simply run kk-means on the learned embedding. Due to Assumption 1, Euclidean distances of similar points will be small so that it is appropriate to use kk-means on the deep representations.

Another method is to perform spectral clustering on the constructed graph, which is the same as anchor-based spectral clustering. Let

𝓛=𝑰𝑫12𝑨𝑫12=𝑰𝑨=𝑰𝑩𝚫1𝑩T\bm{\mathcal{L}}=\bm{I}-\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}=\bm{I}-\bm{A}=\bm{I}-\bm{B}\bm{\Delta}^{-1}\bm{B}^{T} (28)

be the normalized Laplacian matrix. It should be emphasized that 𝓛\bm{\mathcal{L}} is also the unnormalized Laplacian matrix. Accordingly, the objective of spectral clustering is

max𝑭T𝑭=𝑰tr(𝑭T𝑩𝚫1𝑩T𝑭),\max\limits_{\bm{F}^{T}\bm{F}=\bm{I}}{\rm tr}(\bm{F}^{T}\bm{B}\bm{\Delta}^{-1}\bm{B}^{T}\bm{F}), (29)

where 𝑭n×c\bm{F}\in\mathbb{R}^{n\times c} is the soft indicator matrix. Let 𝑩^=𝑩𝚫12\hat{\bm{B}}=\bm{B}\bm{\Delta}^{-\frac{1}{2}}. Then the eigenvalue decomposition of 𝑨\bm{A} can be transformed into the singular value decomposition of 𝑩^\hat{{\bm{B}}}, which reduces the complexity from 𝒪(n2c)\mathcal{O}(n^{2}c) to 𝒪(mnc)\mathcal{O}(mnc).

Finally, one can perform clustering on the bipartite graph directly rather than the constructed graph. Let the bipartite graph be

𝑾=[𝟎𝑩𝑩T𝟎](n+m)×(n+m).\bm{W}=\left[\begin{array}[]{c c}\bm{0}&\bm{B}\\ \bm{B}^{T}&\bm{0}\end{array}\right]\in\mathbb{R}^{(n+m)\times(n+m)}. (30)

The degree matrix is represented as

𝑫=[𝑫v𝟎𝟎𝑫u].\bm{D}=\left[\begin{array}[]{c c}\bm{D}_{v}&\bm{0}\\ \bm{0}&\bm{D}_{u}\end{array}\right]. (31)

Then the normalized cut used spectral clustering is to optimize

max𝑭T𝑭=𝑰tr(𝑭T𝑫12𝑾𝑫12𝑭T),\max\limits_{\bm{F}^{T}\bm{F}=\bm{I}}{\rm tr}(\bm{F}^{T}\bm{D}^{-\frac{1}{2}}\bm{W}\bm{D}^{-\frac{1}{2}}\bm{F}^{T}), (32)

where

𝑭=[𝑽𝑼].\bm{F}=\left[\begin{array}[]{c}\bm{V}\\ \bm{U}\end{array}\right]. (33)

Since 𝑫v=𝑰\bm{D}_{v}=\bm{I}, the above problem can be further reformulated as

max𝑼T𝑼+𝑽T𝑽=𝑰tr(𝑽T𝑩𝑫u12𝑼).\max\limits_{\bm{U}^{T}\bm{U}+\bm{V}^{T}\bm{V}=\bm{I}}{\rm tr}(\bm{V}^{T}\bm{B}\bm{D}_{u}^{-\frac{1}{2}}\bm{U}). (34)

The closed-form solution can be calculated by

{𝑽=22𝑽~𝑼=22𝑼~,\left\{\begin{array}[]{l}\bm{V}=\frac{\sqrt{2}}{2}\tilde{\bm{V}}\\ \bm{U}=\frac{\sqrt{2}}{2}\tilde{\bm{U}},\\ \end{array}\right. (35)

where 𝑽~\tilde{\bm{V}} and 𝑼~\tilde{\bm{U}} are cc leading left and right singular vectors of 𝑩𝑫u12\bm{B}\bm{D}_{u}^{-\frac{1}{2}}. The details can be found in [65]. In our experiments, we report the results obtained by the above technique based on bipartite graphs, since it usually outperforms the simple kk-means.

Refer to caption
(a) SEGMENT
Refer to caption
(b) USPS
Refer to caption
(c) MNIST-full
Refer to caption
(d) ISOLET
Refer to caption
(e) MNIST-test
Refer to caption
(f) Fashion-MNIST
Figure 4: The influence of number of anchors to ACC, NMI, and TIME on 6 datasets.
TABLE II: Information of Datasets
Dataset # Feature # Size # Classes
ISOLET 617 1560 26
SEGMENT 19 2310 7
USPS 256 9298 10
MNIST-test 784 10000 10
MNIST-full 784 70000 10
Fashion-MNIST 784 70000 10

4 Experiment

In this section, the experimental results and parametric settings of AnchorGAE are illustrated. According to these experimental results, we analyze the performance of AnchorGAE objectively. Additionally, the visualization of data is displayed. The experimental results strongly support the theoretical analysis.

4.1 Datasets and Comparative Methods

The performance of the AnchorGAE is evaluated on 6 non-graph datasets, including 2 UCI datasets (ISOLET [66] and SEGMENT [66]) and 4 image datasets (USPS [67], MNIST-test [68], MNIST-full [68], and Fashion-MNIST [69]). Remark that the two UCI datasets show the difference between AnchorGAE and CNN-based clustering models. Since each data point can only be represented by a vector, the CNN-based model can not be applied to these datasets. Note that Fashion-MNIST is denoted by Fashion in Table II The details of these datasets are shown in Table II.

AnchorGAE is compared with 9 methods, including 5 anchor-based methods (Nystrom [9], CSC [10], KASP [11], LSC [13], and SNC [14]), 3 GAE-based methods (SGC [35], GAE [48] and ClusterGCN [34]), 2 deep methods (DEC [17] and SpectralNet [25]), and classical K-Means [58]. All codes are downloaded from the homepages of authors.

4.2 Experimental Setup

There are three clustering evaluation criteria used in our experiments, including the clustering accuracy (ACC), normalized mutual information (NMI), and running time. In our experiments, the number of anchor points is set from 100 to 1000 and the increasing step is 100. The best results are reported in Table I. The activation function of the last layer is set to linear, while the other layers use the ReLU function. Particularly, ClusterGCN recommends 4 layers convolution. To ensure the convolution structure of ClusterGCN the same as others, we modify the 4 layers of convolution structure (denoted by ClusterGCN-L4) to 2 layers (denoted by ClusterGCN-L2) and remain 4 layers of the convolution structure as a competitor. To apply ClusterGCN to unsupervised tasks, we use the same GAE loss and decoder architecture that is same as the other models. To test the scalability of AnchorGAE in a deeper network, we design 2 layers (denoted by AnchorGAE-L2) and 4 layers (denoted by AnchorGAE-L4) convolution structure separately. The 2 layers neurons are designed as 128 and 64 or 256 and 32. For the 4 layers neurons, we set the first three layers as 16 and set the last layer as LL+1. LL is the number of labels. The maximum iterations to update network are 200, while the maximum epochs to update anchors are set as 5. The initial sparsity k0k_{0} is set as 3.

Refer to caption
Figure 5: Runtime of several GNN-based methods. If the model can not run on some datasets, it is marked by \infty.

To investigate the impact of anchors, we run AnchorGAE with different numbers of anchors and initial sparsity, i.e., mm and k0k_{0}. We also conduct 3 ablation experiments based on AnchorGAE-L2 to study the impact of different parts. Firstly, we fix anchors and transition probability during training, which is denoted by Ours-A. Secondly, we fix the sparsity kk, which is denoted by Ours-B. Thirdly, we use a KNN graph, instead of a weighted graph, with the adaptive update and incremental kk in our model, which is denoted by Ours-C. All methods are run 10 times and the average results are reported. All experiments are conducted on a PC with an NVIDIA GeForce GTX 1660 GPU SUPER.

Refer to caption
(a) SEGMENT
Refer to caption
(b) MNIST-test
Figure 6: The influence of k0k_{0} to ACC and NMI on SEGMENT and MNIST-test.

4.3 Experimental Results

ACC and NMI are shown in Table I while the runtime is reported in Figure 5. For ACC and NMI, the best results are bolded and suboptimal results are underlined. Especially, some methods either spend too much memory to run on MNIST-full and Fashion-MNIST, or the codes provided by authors fail to run on them. So they are denoted by a horizontal line in Table I. Similarly, they are represented as \infty in Figure 5. Besides, the visualization of SEGMENT is shown in Figure 3. Figure 4 shows the impact of the number of anchors while the effect of the initial sparsity is illustrated in Figure 6.

From the experimental results, we can conclude that:

  1. 1.

    On almost all datasets, AnchorGAE-L2 obtains the best results in all metrics. On MNIST-full, AnchorGAE obtains the suboptimal NMI, while DEC, which consists of much more layers, outperforms AnchorGAE. Meanwhile, AnchorGAE, with only 2 layers, achieves better ACC than DEC. AnchorGAE needs to compute anchors such that it requires more time.

  2. 2.

    On MNIST-full, normal GAE-based methods fail to run due to out-of-memory since these methods need to save the graph structure in each convolution. AnchorGAE can convolute without constructing a graph explicitly, which not only needs less memory but also reduces runtime.

  3. 3.

    When sparsity kk is fixed or anchors and transition probability (i.e., matrix 𝑩\bm{B}) is not updated, AnchorGAE usually gets poor results on many datasets. For instance, Ours-B shrinks about 55% in ACC and 55% in NMI on MNIST-full, while Ours-A reduces about 40% in ACC and reduces about 15% in NMI.

  4. 4.

    To ensure fair competition and distinguish between anchor-based methods and GAE-based methods, we set the same number of anchors for AnchorGAE-L4 and other anchor-based methods. Note that the first two layers of AnchorGAE-L4 are the same as AnchorGAE-L2. Through observing the results of experiments, AnchorGAE-L4 obtains acceptable performance in most of the datasets but not the best results. It may be caused by the too small dimension of the final embedding.

  5. 5.

    If the initial sparsity, k0k_{0}, is set as a too large number, then it results in dense connections. As a result, a great quantity of wrong information is captured.

  6. 6.

    From the comparison between AnchorGAE and Ours-C, we find that the weighted graphs are still important for clustering even after introducing non-linear GNN mapping modules.

5 Proof of Theorem 1

For an arbitrary probability distribution {p(uj|vi)}j=1m\{p(u_{j}|v_{i})\}_{j=1}^{m}, let pk(|vi)p_{k}(\cdot|v_{i}) be the kk-largest one. For simplicity, let dij=d(uj,vi)=f(𝒙i)f(𝒄j)22d_{ij}=d(u_{j},v_{i})=\|f(\bm{x}_{i})-f(\bm{c}_{j})\|_{2}^{2} where f()f(\cdot) represents the mapping function implemented by the trained GCN encoder. Note that the connectivity distribution is estimated through the current representations, i.e.,

p(0)(|vi)=argminp(|vi)𝔼ujp(|vi)d^ij+γij=1m(p(uj|vi)1m)2.p^{(0)}(\cdot|v_{i})=\arg\min_{p(\cdot|v_{i})}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}\hat{d}_{ij}+\gamma_{i}\sum_{j=1}^{m}(p(u_{j}|v_{i})-\frac{1}{m})^{2}. (36)

where d^ij=f^(𝒙i)f(𝒄j)22\hat{d}_{ij}=\|\hat{f}(\bm{x}_{i})-f(\bm{c}_{j})\|_{2}^{2} and f^()\hat{f}(\cdot) denotes the mapping function before the new training of GCN encoders.

We define p(t)(uj|vi)p^{(t)}(u_{j}|v_{i}) and f(t)(𝒄j)f^{(t)}(\bm{c}_{j}) as the solutions at the tt-th iteration. To prove the theorem clearly, the iterative update of p(|vi)p(\cdot|v_{i}) and anchors is shown in Algorithm 1. The following lemma aims to depict the first estimated connectivity distribution after one perfect training of GCNs.

Lemma 1.

Suppose that

p(1)(|vi)=argminp(|vi)𝔼ujp(|vi)dij+γij=1m(p(uj|vi)1m)2p^{(1)}(\cdot|v_{i})=\arg\min_{p(\cdot|v_{i})}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}d_{ij}+\gamma_{i}\sum_{j=1}^{m}(p(u_{j}|v_{i})-\frac{1}{m})^{2}

where γi\gamma_{i} ensures the sparsity as kk. Let pk(0)(|vi)=μp_{k}^{(0)}(\cdot|v_{i})=\mu. If |pj(0)(|vi)qj(|vi)|εμ|p_{j}^{(0)}(\cdot|v_{i})-q_{j}(\cdot|v_{i})|\leq\varepsilon\leq\mu for any ii, then jk\forall j\leq k,

|pj(1)(|vi)1k|𝒪(log(1/μ)log(1/ε)).|p^{(1)}_{j}(\cdot|v_{i})-\frac{1}{k}|\leq\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)}). (37)
Proof.

Without loss of generality, we focus on the connectivity distribution of vv and suppose that p(0)(u1|v)p(0)(u2|v)p(0)(uk|v)>0=p(0)(uk+1|v)==p(0)(um|v)p^{(0)}(u_{1}|v)\geq p^{(0)}(u_{2}|v)\geq\cdots\geq p^{(0)}(u_{k}|v)>0=p^{(0)}(u_{k+1}|v)=\cdots=p^{(0)}(u_{m}|v). Let pi=p(0)(ui|v)p_{i}=p^{(0)}(u_{i}|v) and qi=q(ui|v)q_{i}=q(u_{i}|v). According to the definitions,

qi=exp(di)j=1mexp(dj).\displaystyle q_{i}=\frac{\exp(-d_{i})}{\sum_{j=1}^{m}\exp(-d_{j})}. (38)

If for any ii, we have |piqi|ε|p_{i}-q_{i}|\leq\varepsilon. Clearly, pk+1=0p_{k+1}=0. Suppose that qk+1=τεq_{k+1}=\tau\leq\varepsilon. Therefore, we have

exp(dk+1)j=1mexp(dj)=τdk+1=log1τlogC.\begin{split}&\frac{\exp(-d_{k+1})}{\sum_{j=1}^{m}\exp(-d_{j})}=\tau\Leftrightarrow~{}d_{k+1}=\log\frac{1}{\tau}-\log C.\end{split} (39)

where C=j=1mexp(dj)C=\sum_{j=1}^{m}\exp(-d_{j}). Combine with the condition, pkμp_{k}\geq\mu, and we have

exp(dk)j=1mexp(dj)pkεμεdklogClog(με).\begin{split}&\frac{\exp(-d_{k})}{\sum_{j=1}^{m}\exp(-d_{j})}\geq p_{k}-\varepsilon\geq\mu-\varepsilon\\ \Rightarrow~{}&d_{k}\leq-\log C-\log(\mu-\varepsilon).\\ \end{split} (40)

Similarly, since p11kp_{1}\geq\frac{1}{k},

exp(d1)j=1mexp(dj)1d1logC.\begin{split}&\frac{\exp(-d_{1})}{\sum_{j=1}^{m}\exp(-d_{j})}\leq 1\Rightarrow~{}d_{1}\geq-\log C.\\ \end{split} (41)

If we update the connectivity distribution based on {di}i=1m\{d_{i}\}_{i=1}^{m}, then for any iki\leq k,

pi(1)=dk+1dij=1k(dk+1dj).p_{i}^{(1)}=\frac{d_{k+1}-d_{i}}{\sum_{j=1}^{k}(d_{k+1}-d_{j})}. (42)

Furthermore, for any i,jki,j\leq k,

|pi(1)pj(1)|\displaystyle|p_{i}^{(1)}-p_{j}^{(1)}| =|djdi|j=1k(dk+1dj)\displaystyle=\frac{|d_{j}-d_{i}|}{\sum_{j=1}^{k}(d_{k+1}-d_{j})}
|dkd1|j=1k(dk+1dj)\displaystyle\leq\frac{|d_{k}-d_{1}|}{\sum_{j=1}^{k}(d_{k+1}-d_{j})}
=log(με)j=1k(log1τlogCdj)\displaystyle=\frac{-\log(\mu-\varepsilon)}{\sum_{j=1}^{k}(\log\frac{1}{\tau}-\log C-d_{j})}
log(με)j=1k(log1τlogCdk)\displaystyle\leq\frac{-\log(\mu-\varepsilon)}{\sum_{j=1}^{k}(\log\frac{1}{\tau}-\log C-d_{k})}
log(με)k(log(με)logτ)\displaystyle\leq\frac{-\log(\mu-\varepsilon)}{k(\log(\mu-\varepsilon)-\log\tau)}
=1klog(με)logτlog(με)\displaystyle=\frac{1}{k}\cdot\frac{\log(\mu-\varepsilon)}{\log\tau-\log(\mu-\varepsilon)}
=1k1log(1/τ)log(1/(με))1\displaystyle=\frac{1}{k}\cdot\frac{1}{\frac{\log(1/\tau)}{\log(1/(\mu-\varepsilon))}-1}
1k1log(1/ε)log1/(με)1=𝒪(log(1/μ)log(1/ε)).\displaystyle\leq\frac{1}{k}\cdot\frac{1}{\frac{\log(1/\varepsilon)}{\log 1/(\mu-\varepsilon)}-1}=\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)}).

For any iki\leq k, we have the following formulation based on the above result

|pi(1)1k|\displaystyle|p_{i}^{(1)}-\frac{1}{k}| |pi(1)1kj=1kpj(1)|\displaystyle\leq|p_{i}^{(1)}-\frac{1}{k}\sum_{j=1}^{k}p_{j}^{(1)}|
1k|j=1k(pi(1)pj(1))|\displaystyle\leq\frac{1}{k}|\sum_{j=1}^{k}(p_{i}^{(1)}-p_{j}^{(1)})|
1kj=1k|(pi(1)pj(1))|\displaystyle\leq\frac{1}{k}\sum_{j=1}^{k}|(p_{i}^{(1)}-p_{j}^{(1)})|
𝒪(log(1/μ)log(1/ε)).\displaystyle\leq\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)}).

The proof is easy to extend to other nodes. Hence, the theorem is proved. ∎

The following lemma shows that the updated anchors will approximate the mean vectors of the connected nodes if p(t)(uj|vi)p^{(t)}(u_{j}|v_{i}) approaches the sparse and uniform distribution.

Lemma 2.

Let 𝒩j\mathcal{N}_{j} denote the nodes that connect the anchor uju_{j} and p(uj|vi)p(u_{j}|v_{i}) represent a kk-sparse solution given by Theorem 1. If |pj(|vi)1k|ε|p_{j}(\cdot|v_{i})-\frac{1}{k}|\leq\varepsilon for any jkj\leq k, then

i=1np(uj|vi)f(𝒙i)i=1np(uj|vi)1nji𝒩jf(𝒙i)22𝒪(ε).\|\frac{\sum_{i=1}^{n}p(u_{j}|v_{i})f(\bm{x}_{i})}{\sum_{i=1}^{n}p(u_{j}|v_{i})}-\frac{1}{n_{j}}\sum\limits_{i\in\mathcal{N}_{j}}f(\bm{x}_{i})\|_{2}^{2}\leq\mathcal{O}(\varepsilon). (43)
Proof.

To keep the notation uncluttered, let 𝒇i=f(𝒙i)\bm{f}_{i}=f(\bm{x}_{i}) and

𝒉j=i=1np(uj|vi)f(𝒙i)i=1np(uj|vi).\bm{h}_{j}=\frac{\sum_{i=1}^{n}p(u_{j}|v_{i})f(\bm{x}_{i})}{\sum_{i=1}^{n}p(u_{j}|v_{i})}. (44)

Note that i=1np(uj|vi)f(𝒙i)=i𝒩jp(uj|vi)f(𝒙i)\sum_{i=1}^{n}p(u_{j}|v_{i})f(\bm{x}_{i})=\sum_{i\in\mathcal{N}_{j}}p(u_{j}|v_{i})f(\bm{x}_{i}). Let nj=|𝒩j|n_{j}=|\mathcal{N}_{j}| and we have

𝒉j1nji𝒩j𝒇i2=1sji𝒩j(p(uj|vi)sjnj)𝒇i2,\|\bm{h}_{j}-\frac{1}{n_{j}}\sum\limits_{i\in\mathcal{N}_{j}}\bm{f}_{i}\|_{2}=\frac{1}{s_{j}}\|\sum\limits_{i\in\mathcal{N}_{j}}(p(u_{j}|v_{i})-\frac{s_{j}}{n_{j}})\bm{f}_{i}\|_{2}, (45)

where sj=i=1np(uj|vi)s_{j}=\sum_{i=1}^{n}p(u_{j}|v_{i}). According to the condition, we have

1kεp(uj|vi)1k+ε\displaystyle\frac{1}{k}-\varepsilon\leq p(u_{j}|v_{i})\leq\frac{1}{k}+\varepsilon
\displaystyle~{}\Rightarrow~{} njknjεsjnjk+njε.\displaystyle\frac{n_{j}}{k}-n_{j}\cdot\varepsilon\leq s_{j}\leq\frac{n_{j}}{k}+n_{j}\cdot\varepsilon.

Accordingly, we have

𝒉j1nji𝒩j𝒇i2\displaystyle\|\bm{h}_{j}-\frac{1}{n_{j}}\sum\limits_{i\in\mathcal{N}_{j}}\bm{f}_{i}\|_{2}
\displaystyle\leq i𝒩j1sj|p(uj|vi)sjnj|𝒇i2\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{1}{s_{j}}|p(u_{j}|v_{i})-\frac{s_{j}}{n_{j}}|\cdot\|\bm{f}_{i}\|_{2}
\displaystyle\leq i𝒩j𝒇i2sjmax{|p(uj|vi)1k+ε|,|p(uj|vi)1kε|}\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{\|\bm{f}_{i}\|_{2}}{s_{j}}\max\{|p(u_{j}|v_{i})-\frac{1}{k}+\varepsilon|,|p(u_{j}|v_{i})-\frac{1}{k}-\varepsilon|\}
\displaystyle\leq i𝒩j𝒇i2sj(|p(uj|vi)1k|+|ε|)\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{\|\bm{f}_{i}\|_{2}}{s_{j}}(|p(u_{j}|v_{i})-\frac{1}{k}|+|\varepsilon|)
\displaystyle\leq i𝒩j2𝒇i2sjε=𝒪(ε).\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{2\|\bm{f}_{i}\|_{2}}{s_{j}}\varepsilon=\mathcal{O}(\varepsilon).

Hence, the theorem is proved. ∎

With the following lemma, we can prove Theorem 1 via the mathematical induction method.

Lemma 3.

Suppose that qj(|vi)εq_{j}(\cdot|v_{i})\leq\varepsilon for j>kj>k, and d(vi,uj)Md(v_{i},u_{j})\leq M. Then there exists δ>0\delta>0 such that if εδ\varepsilon\leq\delta and viv_{i} does not connect with uju_{j}, then for any v𝒩jv^{\prime}\in\mathcal{N}_{j}, viv_{i} and vv^{\prime} share no anchors.

Proof.

At first, let μ=miniqk(|vi)\mu=\min_{i}q_{k}(\cdot|v_{i}). Hence, qk(|vi)μεq_{k}(\cdot|v_{i})\geq\mu\geq\varepsilon. For an arbitrary node viv_{i} connected with uau_{a} and ubu_{b} and an anchor ucu_{c} which disconnects with viv_{i}, suppose that there exists a node vjv_{j} such that vjv_{j} connects with ubu_{b} and ucu_{c} but disconnects with uau_{a}. According to the triangular inequality, we have

exp(d(ua,uc))exp(d(vi,ua))\displaystyle\frac{\exp(-d(u_{a},u_{c}))}{\exp(-d(v_{i},u_{a}))} exp(d(vi,uc)+d(vi,ua))exp(d(vi,ua))\displaystyle\leq\frac{\exp(-d(v_{i},u_{c})+d(v_{i},u_{a}))}{\exp(-d(v_{i},u_{a}))}
=exp(d(vi,uc))exp(2d(vi,ua))\displaystyle=\frac{\exp(-d(v_{i},u_{c}))}{\exp(-2d(v_{i},u_{a}))}
=q(uc|vi)q(ua|vi)exp(d(vi,ua))\displaystyle=\frac{q(u_{c}|v_{i})}{q(u_{a}|v_{i})\exp(-d(v_{i},u_{a}))}
εμeM.\displaystyle\leq\frac{\varepsilon}{\mu}e^{M}.

According to the triangular inequality,

d(vi,ua)+logμεM\displaystyle d(v_{i},u_{a})+\log\frac{\mu}{\varepsilon}-M
\displaystyle\leq~{} d(ua,uc)<d(ua,ub)+d(ub,uc)\displaystyle d(u_{a},u_{c})<d(u_{a},u_{b})+d(u_{b},u_{c})
<\displaystyle<~{} d(vi,ua)+d(vi,ub)+d(vj,uc)+d(vj,ub).\displaystyle d(v_{i},u_{a})+d(v_{i},u_{b})+d(v_{j},u_{c})+d(v_{j},u_{b}).

Therefore,

logμε<4M.\displaystyle\log\frac{\mu}{\varepsilon}<4M. (46)

If we define δ\delta as

δ=min{μexp(4M),μ},\delta=\min\{\mu\cdot\exp(-4M),\mu\}, (47)

then Ineq. (46) results in a contradiction when εδ\varepsilon\leq\delta. ∎

Proof of Theorem 1.

The proof is to analyze each iteration of step 6. Let μ=minipk(1)(|vi)\mu=\min_{i}p_{k}^{(1)}(\cdot|v_{i}) and we can obtain that |pj(1)(|vi)1k|𝒪(log(1/μ)log(1/ε))|p_{j}^{(1)}(\cdot|v_{i})-\frac{1}{k}|\leq\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)}) provided that εμ\varepsilon\leq\mu, according to Lemma 1. Certainly, the conclusion constructs the precondition of Lemma 2.

Let 𝜽j=1|𝒩j|𝒙i𝒩jf(𝒙i)\bm{\theta}_{j}=\frac{1}{|\mathcal{N}_{j}|}\sum_{\bm{x}_{i}\in\mathcal{N}_{j}}f(\bm{x}_{i}). According to Lemma 2, we have f(1)(𝒄j)𝜽j22𝒪(log(1/μ)log(1/ε))\|f^{(1)}(\bm{c}_{j})-\bm{\theta}_{j}\|_{2}^{2}\leq\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)}) and we can further derive that

f(𝒙i)f(1)(𝒄j)2\displaystyle\|f(\bm{x}_{i})-f^{(1)}(\bm{c}_{j})\|_{2} f(𝒙i)𝜽j2+f(1)(𝒄j)𝜽j2\displaystyle\leq\|f(\bm{x}_{i})-\bm{\theta}_{j}\|_{2}+\|f^{(1)}(\bm{c}_{j})-\bm{\theta}_{j}\|_{2}
f(𝒙i)𝜽j2+𝒪(log(1/μ)log(1/ε)),\displaystyle\leq\|f(\bm{x}_{i})-\bm{\theta}_{j}\|_{2}+\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}),

and

f(𝒙i)f(1)(𝒄j)2\displaystyle\|f(\bm{x}_{i})-f^{(1)}(\bm{c}_{j})\|_{2} f(𝒙i)𝜽j2f(1)(𝒄j)𝜽j2\displaystyle\geq\|f(\bm{x}_{i})-\bm{\theta}_{j}\|_{2}-\|f^{(1)}(\bm{c}_{j})-\bm{\theta}_{j}\|_{2}
f(𝒙i)𝜽j2𝒪(log(1/μ)log(1/ε)).\displaystyle\geq\|f(\bm{x}_{i})-\bm{\theta}_{j}\|_{2}-\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).

Therefore, for d(1)(vi,uj)d^{(1)}(v_{i},u_{j}), we have

𝒪(log(1/μ)log(1/ε))2f(𝒙i)𝜽j2𝒪(log(1/μ)log(1/ε))\displaystyle\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})-2\|f(\bm{x}_{i})-\bm{\theta}_{j}\|_{2}\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})
\displaystyle\leq d(1)(vi,uj)f(𝒙i)𝜽j22\displaystyle d^{(1)}(v_{i},u_{j})-\|f(\bm{x}_{i})-\bm{\theta}_{j}\|_{2}^{2}
\displaystyle\leq 𝒪(log(1/μ)log(1/ε))+2f(𝒙i)𝜽j2𝒪(log(1/μ)log(1/ε)).\displaystyle\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})+2\|f(\bm{x}_{i})-\bm{\theta}_{j}\|_{2}\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).

Due to that d(1)(vi,uj)d^{(1)}(v_{i},u_{j}) is upper-bounded, the above conclusions are provided as the preconditions of Lemma 3. To keep simplicity, let r=f(𝒙i)𝜽j2r=\|f(\bm{x}_{i})-\bm{\theta}_{j}\|_{2}. Therefore, we know that for any lkl\leq k,

dk+1(1)(vi,)dl(1)(vi,)\displaystyle d^{(1)}_{k+1}(v_{i},\cdot)-d^{(1)}_{l}(v_{i},\cdot)
\displaystyle\leq~{} dk+1(1)(vi,)r2𝒪(log(1/μ)log(1/ε))+2r𝒪(log(1/μ)log(1/ε)),\displaystyle d^{(1)}_{k+1}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})+2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}),

and

dk+1(1)(vi,)dl(1)(vi,)\displaystyle d^{(1)}_{k+1}(v_{i},\cdot)-d^{(1)}_{l}(v_{i},\cdot)
\displaystyle\geq~{} dk+1(1)(vi,)r2𝒪(log(1/μ)log(1/ε))2r𝒪(log(1/μ)log(1/ε)).\displaystyle d^{(1)}_{k+1}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})-2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).

Furthermore, as

p(2)(uj|vi)\displaystyle p^{(2)}(u_{j}|v_{i})
=\displaystyle= (dk+1(1)(vi,)d(1)(vi,))+l=1k(dk+1(1)(vi,)dl(1)(vi,))\displaystyle\frac{(d^{(1)}_{k+1}(v_{i},\cdot)-d^{(1)}(v_{i},\cdot))_{+}}{\sum_{l=1}^{k}(d^{(1)}_{k+1}(v_{i},\cdot)-d^{(1)}_{l}(v_{i},\cdot))}
\displaystyle\leq dk+1(1)(vi,)r2𝒪(log(1/μ)log(1/ε))+2r𝒪(log(1/μ)log(1/ε))i=1k[dk+1(1)(vi,)r2]k𝒪(log(1/μ)log(1/ε))2kr𝒪(log(1/μ)log(1/ε)),\displaystyle\frac{d^{(1)}_{k+1}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})+2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})}{\sum_{i=1}^{k}[d_{k+1}^{(1)}(v_{i},\cdot)-r^{2}]-k\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})-2kr\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})},

we have

p(2)(uj|vi)1k\displaystyle p^{(2)}(u_{j}|v_{i})-\frac{1}{k}
\displaystyle\leq~{} 4r𝒪(log(1/μ)log(1/ε))k[dk+1(1)(vi,)r2𝒪(log(1/μ)log(1/ε))2r𝒪(log(1/μ)log(1/ε))]\displaystyle\frac{4r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})}{k[d_{k+1}^{(1)}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})-2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})]}
=\displaystyle=~{} 𝒪(log(1/μ)log(1/ε)).\displaystyle\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).

Similarly, we can also infer that

p(2)(uj|vi)1k\displaystyle p^{(2)}(u_{j}|v_{i})-\frac{1}{k}
\displaystyle\leq~{} 4r𝒪(log(1/μ)log(1/ε))k[dk+1(1)(vi,)r2𝒪(log(1/μ)log(1/ε))+2r𝒪(log(1/μ)log(1/ε))]\displaystyle\frac{-4r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})}{k[d_{k+1}^{(1)}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})+2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})]}
=\displaystyle=~{} 𝒪(log(1/μ)log(1/ε)).\displaystyle\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).

According to Lemma 3, there exists a constant δ\delta such that when εδ\varepsilon\leq\delta, p(2)(uj|vi)p^{(2)}(u_{j}|v_{i}) approximates the sparse and uniform distribution,

|p(2)(uj|vi)1k|𝒪(log(1/μ)log(1/ε))=𝒪(log1/2(1/ε)).|p^{(2)}(u_{j}|v_{i})-\frac{1}{k}|\leq\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})=\mathcal{O}(\log^{-1/2}(1/\varepsilon)). (48)

We can repeat the above proof for every iteration to update p(t)(uj|vi)p^{(t)}(u_{j}|v_{i}) and f(t)(𝒄j)f^{(t)}(\bm{c}_{j}), and therefore, the theorem is proved. ∎

6 Conclusion

In this paper, we propose an anchor-siamese GCN clustering with adaptive 𝒪(n)\mathcal{O}(n) bipartite convolution (AnchorGAE). The generative perspective for weighted graphs helps us to build the relationship between the non-graph-type data and the graph-type data, and provides a reconstruction goal. Since the anchor-based bipartite graph factorizes the adjacency matrix, we can rearrange the order of matrix multiplications to accelerate the graph convolution. Experiments show that AnchorGAE consumes less time especially on MNIST-full and obtains impressive performance. Ablation experiments also verify the necessity of updating anchors, transition probability, and sparsity in a self-supervised way. From both theoretical and empirical aspects, we show that the simple update would cause a collapse when the model is well-trained. The designed mechanism to connect different groups in advance is also verified by experiments. Since the original images of anchors are estimated roughly, we will focus on how to exactly compute the original images from the embedding space in future work. To further speed up the model, it is worthy developing stochastic gradient descent for AnchorGAE in the future.

References

  • [1] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in neural information processing systems, 2002, pp. 849–856.
  • [2] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partitioning and clustering,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 11, no. 9, pp. 1074–1085, 1992.
  • [3] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
  • [4] F. Nie, H. Huang, and C. Ding, “Low-rank matrix recovery via efficient schatten p-norm minimization,” in Proceedings of the Twenty-sixth AAAI conference on artificial intelligence, 2012.
  • [5] X. Li, M. Chen, F. Nie, and Q. Wang, “A multiview-based parameter free framework for group detection,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017, pp. 4147–4153.
  • [6] X. Li, H. Zhang, and R. Zhang, “Matrix completion via non-convex relaxation and adaptive correlation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2022.
  • [7] X. Chen, W. Hong, F. Nie, J. Z. Huang, and L. Shen, “Enhanced balanced min cut,” International Journal of Computer Vision, pp. 1–14, 2020.
  • [8] F. R. Chung, Spectral graph theory.   American Mathematical Soc., 1997, vol. 92.
  • [9] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral grouping using the nystrom method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 214–225, 2004.
  • [10] H. Shinnou and M. Sasaki, “Spectral clustering for a large data set by reducing the similarity matrix size,” in International Conference on Language Resources & Evaluation, 2008.
  • [11] D. Yan, L. Huang, and M. I. Jordan, “Fast approximate spectral clustering,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2009.
  • [12] W. Liu, J. He, and S. F. Chang, “Large graph construction for scalable semi-supervised learning,” in International Conference on Machine Learning, 2010.
  • [13] D. Cai and X. Chen, “Large scale spectral clustering via landmark-based sparse representation,” IEEE Trans Cybern, vol. 45, no. 8, pp. 1669–1680, 2015.
  • [14] X. Chen, F. Nie, J. Z. Huang, and M. Yang, “Scalable normalized cut with improved spectral rotation,” in Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017.
  • [15] X. Li, H. Zhang, R. Wang, and F. Nie, “Multiview clustering: A scalable and parameter-free bipartite graph fusion method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 330–344, 2022.
  • [16] Q. Qiang, B. Zhang, F. Wang, and F. Nie, “Fast multi-view discrete clustering with anchor graphs,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021, pp. 9360–9367.
  • [17] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in International conference on machine learning, 2016, pp. 478–487.
  • [18] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5147–5156.
  • [19] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang, “Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5736–5745.
  • [20] X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu, “Deep spectral clustering using dual autoencoder network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4066–4075.
  • [21] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond euclidean data,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 18–42, 2017.
  • [22] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama, “Learning discrete representations via information maximizing self-augmented training,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, vol. 70, 2017, pp. 1558–1567.
  • [23] H. Zhang, P. Li, R. Zhang, and X. Li, “Embedding graph auto-encoder for graph clustering,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–1, 2022.
  • [24] X. Peng, J. Feng, S. Xiao, W.-Y. Yau, J. T. Zhou, and S. Yang, “Structured autoencoders for subspace clustering,” IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 5076–5086, 2018.
  • [25] U. Shaham, K. Stanton, H. Li, B. Nadler, R. Basri, and Y. Kluger, “Spectralnet: Spectral clustering using deep neural networks,” arXiv preprint arXiv:1801.01587, 2018.
  • [26] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2008.
  • [27] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
  • [28] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
  • [29] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in 6th International Conference on Learning Representations, ICLR 2018, 2018.
  • [30] M. Li and Z. Zhu, “Spatial-temporal fusion graph neural networks for traffic flow forecasting,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, 2021, pp. 4189–4196.
  • [31] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21, 2020.
  • [32] J. Chen, T. Ma, and C. Xiao, “Fastgcn: Fast learning with graph convolutional networks via importance sampling,” in 6th International Conference on Learning Representations, ICLR 2018, 2018.
  • [33] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1025–1035.
  • [34] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh, “Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 257–266.
  • [35] F. Wu, T. Zhang, A. Holanda de Souza, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” Proceedings of Machine Learning Research, 2019.
  • [36] J. Chen, J. Zhu, and L. Song, “Stochastic training of graph convolutional networks with variance reduction,” arXiv preprint arXiv:1710.10568, 2017.
  • [37] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deep clustering network,” in WWW.   ACM / IW3C2, 2020, pp. 1400–1410.
  • [38] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016, pp. 3844–3852.
  • [39] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang, “Attributed graph clustering: a deep attentional embedding approach,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019, pp. 3670–3676.
  • [40] R. Cheng and Q. Li, “Modeling the momentum spillover effect for stock prediction via attribute-driven graph attention networks,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, 2021, pp. 55–62.
  • [41] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” in 6th International Conference on Learning Representations, ICLR 2018, 2018.
  • [42] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in 7th International Conference on Learning Representations, ICLR 2019, 2019.
  • [43] T. Cai, S. Luo, K. Xu, D. He, T. Liu, and L. Wang, “Graphnorm: A principled approach to accelerating graph neural network training,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, vol. 139, 2021, pp. 1204–1215.
  • [44] K. Xu, M. Zhang, S. Jegelka, and K. Kawaguchi, “Optimization of graph neural networks: Implicit acceleration by skip connections and more depth,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, vol. 139, 2021, pp. 11 592–11 602.
  • [45] K. Oono and T. Suzuki, “Graph neural networks exponentially lose expressive power for node classification,” in International Conference on Learning Representations, 2020.
  • [46] W. Cong, M. Ramezani, and M. Mahdavi, “On provable benefits of depth in training graph convolutional networks,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [47] U. Alon and E. Yahav, “On the bottleneck of graph neural networks and its practical implications,” in 9th International Conference on Learning Representations, ICLR 2021, 2021.
  • [48] T. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308, 2016.
  • [49] S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang, “Adversarially regularized graph autoencoder for graph embedding.” in IJCAI, 2018, pp. 2609–2615.
  • [50] J. Park, M. Lee, H. J. Chang, K. Lee, and J. Y. Choi, “Symmetric graph convolutional autoencoder for unsupervised graph representation learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6519–6528.
  • [51] X. Zhang, H. Liu, Q. Li, and X. M. Wu, “Attributed graph clustering via adaptive graph convolution,” in 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019, pp. 4327–4333.
  • [52] X. Li, H. Zhang, and R. Zhang, “Adaptive graph auto-encoder for general data clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
  • [53] G. E. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [54] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  • [55] K. Hassani and A. H. K. Ahmadi, “Contrastive multi-view representation learning on graphs,” in International conference on machine learning, 2020.
  • [56] H. Zhu and P. Koniusz, “Simple spectral graph convolution,” in 9th International Conference on Learning Representations, 2021.
  • [57] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” Siam Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998.
  • [58] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proc of Berkeley Symposium on Mathematical Statistics & Probability, 1965.
  • [59] Y. Li, P. Hu, J. Z. Liu, D. Peng, J. T. Zhou, and X. Peng, “Contrastive clustering,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, 2021, pp. 8547–8555.
  • [60] X. Liu, M. Yan, L. Deng, G. Li, X. Ye, D. Fan, S. Pan, and Y. Xie, “Survey on graph neural network acceleration: An algorithmic perspective,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022.
  • [61] F. Spitzer, Principles of random walk.   Springer Science & Business Media, 2013, vol. 34.
  • [62] S. Chopra, R. Hadsell, and Y. Lecun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005.
  • [63] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
  • [64] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, ser. Proceedings of Machine Learning Research, vol. 119, 2020, pp. 1597–1607.
  • [65] F. Nie, X. Wang, C. Deng, and H. Huang, “Learning a structured optimal bipartite graph for co-clustering,” in Advances in Neural Information Processing Systems, 2017, pp. 4129–4138.
  • [66] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  • [67] J. J. Hull, “A database for handwritten text recognition research,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 5, pp. 550–554, 2002.
  • [68] Y. Lecun and L. Bottou, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [69] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv: 1708.07747, 2017.
[Uncaptioned image] Hongyuan Zhang received the B.E. degree in software engineering from Xidian University, Xi’an, China in 2019. He is currently pursuing the Ph.D. degree from the School of Computer Science and the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an, China.
[Uncaptioned image] Jiankun Shi received the B.E. degree in software engineering from Henan University, Zhengzhou, China in 2020. He is currently working toward the master’s degree with the School of Artificial Intelligence, OPtics and ElectroNics(iOPEN), Northwestern Polytechnical University, Xi’an, China.
[Uncaptioned image] Rui Zhang (M’19) received the Ph.D degree in computer science at Northwestern Polytechnical University, Xi’an, China in 2018. He currently serves as an Associate Professor with the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an, China.
Xuelong Li (M’02-SM’07-F’12) is a Full Professor with the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an, China.