Non-Graph Data Clustering via $\mathcal{O}(n)$ Bipartite Graph Convolution

Hongyuan Zhang, Jiankun Shi, Rui Zhang, and Xuelong Li^∗ ^∗ Corresponding authorThis work is supported by The National Natural Science Foundation of China (No. 61871470).The authors are with the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, P.R. China. ©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. E-mail: [email protected], [email protected], [email protected], [email protected].

Abstract

Since the representative capacity of graph-based clustering methods is usually limited by the graph constructed on the original features, it is attractive to find whether graph neural networks (GNNs), a strong extension of neural networks to graphs, can be applied to augment the capacity of graph-based clustering methods. The core problems mainly come from two aspects. On the one hand, the graph is unavailable in the most general clustering scenes so that how to construct graph on the non-graph data and the quality of graph is usually the most important part. On the other hand, given $n$ samples, the graph-based clustering methods usually consume at least $\mathcal{O}(n^{2})$ time to build graphs and the graph convolution requires nearly $\mathcal{O}(n^{2})$ for a dense graph and $\mathcal{O}(|\mathcal{E}|)$ for a sparse one with $|\mathcal{E}|$ edges. Accordingly, both graph-based clustering and GNNs suffer from the severe inefficiency problem. To tackle these problems, we propose a novel clustering method, AnchorGAE, with the self-supervised estimation of graph and efficient graph convolution. We first show how to convert a non-graph dataset into a graph dataset, by introducing the generative graph model and anchors. A bipartite graph is built via generating anchors and estimating the connectivity distributions of original points and anchors. We then show that the constructed bipartite graph can reduce the computational complexity of graph convolution from $\mathcal{O}(n^{2})$ and $\mathcal{O}(|\mathcal{E}|)$ to $\mathcal{O}(n)$ . The succeeding steps for clustering can be easily designed as $\mathcal{O}(n)$ operations. Interestingly, the anchors naturally lead to siamese architecture with the help of the Markov process. Furthermore, the estimated bipartite graph is updated dynamically according to the features extracted by GNN modules, to promote the quality of the graph by exploiting the high-level information by GNNs. However, we theoretically prove that the self-supervised paradigm frequently results in a collapse that often occurs after 2-3 update iterations in experiments, especially when the model is well-trained. A specific strategy is accordingly designed to prevent the collapse. The experiments support the theoretical analysis and show the superiority of AnchorGAE.

Index Terms:

Graph Convolution Network, Efficient Clustering, Anchors, Siamese Network, Self-Supervised Learning.

1 Introduction

Graph-based clustering methods [1, 2, 3, 4, 5, 6] are important in the clustering field since they can capture the data topology and group data points non-linearly via constructing a graph. For instance, spectral clustering [1] originates from a relaxed graph-cut problem, e.g., Ratio Cut [2], Normalized Cut [3], Balanced Cut [7], etc. In graph-based methods, it is easier to formally study diverse clustering-related tasks with the help of graph theory [8].

Given $n$ data points, the graph-based clustering methods suffer the severe inefficiency in practice due to the construction of graphs, which usually needs $\mathcal{O}(n^{2})$ time without any acceleration tricks. For some methods such as spectral clustering [1], the eigenvalue decomposition, an $\mathcal{O}(n^{3})$ operation, is required. To alleviate the phenomenon, plenty of variants of spectral clustering have been developed [9, 10, 11]. An important variant is to introduce some representative points [12] to accelerate both graph construction and optimization [13, 14]. Some works [15, 16] recently attempt to extend these methods into multi-view scenarios. However, the graph constructed by original features may only contain low-level information (i.e., the representative capacity is limited), since the construction is usually based on the original features without mapping (e.g., neural networks). The limited ability to extract the high-level relationship of data results in the inaccurate graphs and thus becomes the bottleneck of graph-based clustering methods.

With the rise of deep learning, the utilization of deep neural networks (DNN) is widely regarded as an effective way to exploit deep information of data via extracting latent features [17]. Some models [18, 19, 20] based on convolution neural networks (CNN) have achieved impressive improvement on clustering, but CNN can not be applied to non-Euclidean data [21], such as social networks, recommendation systems, word vector, data mining data, etc. Compared with the Euclidean data (e.g., images, videos, etc.), the most difference of non-Euclidean data is that the relationships of features are not regular (or even do not exist). It indicates that the classical convolution operation could not be utilized and the existing CNN-based clustering models can not be applied to general clustering scenes. Although some deep clustering models [22, 23, 24, 18, 19] have been proposed, most methods aim to introduce some widely used modules from supervised scenarios to clustering and few models focus on how to utilize deep models to further promote the capacity graph-based methods. SpectralNet [25] is an important attempt to connect neural networks and classical spectral clustering. It utilizes neural networks to promote capacity and simulate spectral decomposition.

Refer to caption — Figure 1: The core idea of AnchorGAE. First, we iteratively calculate the connectivity distribution and anchors. Then they are used as the input of a siamese GNN for acceleration. After training the network, the anchors and connectivity distribution are updated according to the embedding. The two steps are repeated multiple times until the high-level information is exploited.

Graph neural networks (GNNs) [26, 27, 28, 29, 30, 31], which attempt to extend the classical neural networks into graph-type data with considering the graph structure, have attracted much attention. GNN is designed for graph-type data, such as social networks, recommendation systems, citation networks, etc. In these practical scenarios, links are easy and cheap to obtain and graphs are provided as prior information. For example, in citation networks, each paper is viewed as a node in graph and the citation relationships, which are easily available, are modeled as links. As GNNs have shown impressive capacity and performance on graph-type data, it is attractive to employ GNN to promote the capacity of graph-based clustering methods. Nevertheless, there are two major hurdles. On the one hand, in the most scenarios (e.g., text, signals, etc.), graphs are not provided as the prior information, which indicates that the existing GNN-based models for node clustering cannot be applied on these non-graph data without artificial construction of graphs. It goes back to the same dilemma as the graph-based clustering, i.e., how to construct a graph containing enough information. On the other hand, the inefficiency is also a well-known problem for GNNs [32, 33]. Forward propagation of each node in a GNN layer relies on neighboring samples such that the computation complexity of the graph convolution increases exponentially with the growth of layers. For a dense graph, the complexity of a multi-layer GNN is almost $\mathcal{O}(n^{2})$ time. For a sparse one, the computational cost becomes $\mathcal{O}(|\mathcal{E}|)$ where $|\mathcal{E}|$ represents the amount of edges. Accordingly, each training step approximately depends on all data points and the training is quite time-consuming. It also results in the unavailability of batch gradient descent. To accelerate GNNs, some works focus on how to apply batch gradient descent [34, 35, 36, 32]. The inefficiency problem is also identical to the graph-based clustering.

The main motivation of this paper is how to efficiently improve the capacity of graph-based clustering via GNNs on non-graph data. The efficiency guarantees the practicability and the non-graph property of data ensures the universality of our models. Although SDCN [37] aims to extend GNNs into the data clustering task, it simply constructs KNN graphs and thus fails to exploit the high-level information of data. The major contributions of the proposed model, namely AnchorGAE, are listed as follows

•

As no graph is provided in general, we aim to construct the graph via estimating the connectivity distribution of each sample from a perspective of generative models. Under this generative perspective, anchors are introduced to accelerate the graph construction and computation of GNNs. The produced anchors and connectivity distributions lead to a bipartite graph convolution operation so that the computational complexity of each forward propagation of GNNs is reduced from $\mathcal{O}(n^{2})$ or $\mathcal{O}(|\mathcal{E}|)$ to $\mathcal{O}(n)$ .
•

With the help of Markov process and the reconstruction goal provided by generative graph models, a siamese architecture is naturally designed to transform anchors into the embedding space.
•

We theoretically prove that a collapse, where points are split into plenty of tiny groups and these groups scatter randomly, may happen due to the self-supervised update of the graph (and connectivity distributions), especially when the model is well-trained. We therefore devise a specific strategy to update the graph to prevent the collapse. The experiments also support the theoretical analysis and verify the effectiveness of the designed strategy.

2 Preliminary and Related Work

In this section, we introduce the existing work about graph neural networks and fast clustering algorithms.

2.1 Preliminary

Let $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{W})$ be a graph where $\mathcal{V}$ , $\mathcal{E}$ , and $\mathcal{W}$ represent nodes, edges, and weights respectively. For an unweighted matrix, $\mathcal{W}$ can be ignored since the weights of edges are all regarded as 1. All matrices and vectors are represented by upper-case letters and lower-case letters in boldface, respectively. $|\cdot|$ represents the size of a set. $(\cdot)_{+}=\max(\cdot,0)$ . For a real number $x$ , $\lfloor x\rfloor$ represents the maximum integer that is not larger than $x$ . $\textbf{1}_{n}$ denotes an $n$ -dimension vector whose all entries equal 1. In computer science, the adjacency matrix, which is denoted by $\bm{A}$ in this paper, is a common way to represent the edges of a graph. Specifically, for an unweighted graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , $A_{ij}=1$ if $\langle v_{i},v_{j}\rangle\in\mathcal{E}$ and $A_{ij}=0$ otherwise. A popular model of GNN is the graph convolution network (GCN) [28] where the graph convolution layer is accordingly formulated as [28]

\bm{H}=\varphi(\bm{L}\bm{X}\bm{W}),

(1)

where $\varphi(\cdot)$ is an activation function, $\bm{X}\in\mathbb{R}^{n\times d}$ is the representation of data points, $\bm{W}\in\mathbb{R}^{d\times d^{\prime}}$ denotes the coefficients to learn, $\bm{L}=\bm{D}^{-\frac{1}{2}}(\bm{A}+\bm{I})\bm{D}^{-\frac{1}{2}}$ is often viewed as a variant of the normalized Laplacian matrix, and $\bm{D}$ is the degree matrix of $(\bm{A}+\bm{I})$ . Specifically, $\bm{D}$ is a diagonal matrix where $D_{ii}=\sum_{j=1}^{n}(A_{ij}+1)$ .

2.2 Graph Neural Networks

In recent years, how to effectively extract features of graphs has attracted a lot of attention [28, 31]. Inspired by CNN, a few works [27, 38, 28] attempt to investigate convolution operation on graph-type data. With the rise of the attention mechanism, GAT [29, 39, 40] introduces self-attention into graph neural networks. Meanwhile, GNN can be applied to sequential tasks [41, 30], which is known as spatial-temporal graph neural network (STGNN). Some theoretical works [42, 43, 44, 45, 46, 47] are also investigated in recent years. Some works [46] argue the impact of over-smoothing while literature [47] proposes the over-squashing. To extend GCN [28] into unsupervised learning, graph auto-encoders [48, 49, 50, 51, 52], based on auto-encoder [53, 54], are developed. Different from traditional auto-encoders, GCN [28] is used as an encoder and the decoder calculates inner-products of embedding to reconstruct graphs. To build a symmetric architecture, GALA [50] designs a sharpening operator since the graph convolution is equivalent to a smooth operator. SDCN [37] has tried to utilize GNN for clustering, while it just constructs a KNN graph as the preprocessing and suffers from the inefficiency problem as well. Although GNN has achieved outstanding performance [28, 55], it requires too much memory and time. Some methods [33, 34, 36, 35, 56] for accelerating training and reducing consumption of memory are proposed. For instance, SGC [35] attempts to accelerate by removing non-linear activations in GCN, which reduces the number of parameters. More importantly, it compresses multiple layers into one layer such that stochastic gradient descent can be used. GraphSAGE [33] attempts to sample part of nodes for each node to reduce the computational complexity, while FastGCN [32] samples nodes independently. ClusterGCN [34] tries to find an approximate graph with multiple connected components via METIS [57] to employ batch gradient descent.

2.3 Clustering on Large Scale Data

In traditional clustering, $k$ -means [58] works efficiently on large scale data but it can only deal with spherical data, which limits the application of $k$ -means. By virtue of deep learning, deep clustering [22, 17, 23, 18, 19, 59] often aims to learn expressive representations and then feed the representations into the simple clustering module. The whole optimization is based on batch gradient descent so that they are still available on large scale data. However, they frequently suffers from the over-fitting and the performance heavily relies on the tuning of neural networks. For graph-based clustering methods [3, 2, 1], both computational complexity and space complexity of graph construction are $\mathcal{O}(n^{2})$ . In particular, solving spectral clustering needs eigenvalue decomposition, which is also time-consuming. To apply spectral clustering on large scale data, a lot of variants [12, 9, 10, 11] have been developed. Anchor-based methods [12, 13, 14] are important extensions of spectral clustering. The core idea is to construct a graph implicitly via some representative points, which are usually named as anchors. By constructing a bipartite graph between samples and anchors, they only need to perform singular value decomposition on the Laplacian of the bipartite graph [13]. The anchor-based spectral clustering provides an elegant scheme to improve the efficiency of graph-based models.

Remark 1.

Although several CNN-based clustering methods [19, 18, 20] have achieved good performance, they focus on vision data and the main improvement is caused by the classical convolution operator, so that they could not be applied to non-Euclidean data, e.g., word vectors, data mining data, etc. Therefore, the CNN-based methods will not be used as the main competitors in this paper. It should be emphasized that CNN-based methods and AnchorGAE should not be mutually exclusive. AnchorGAE aims to find a general paradigm to improve the clustering baseline under the most general settings. It may be an attractive topic in the future to extend the core idea of AnchorGAE into the vision-specific clustering by combining CNNs.

3 Methodology

In this section, we first revisit the main problems that we attempt to address in this paper. In Section 3.2, we show how to construct a bipartite graph on the non-graph data from a generative perspective. The probability perspective is vital to the derivation of the fast bipartite graph convolution and the reconstruction goal of the unsupervised model. In Section 3.3, the core idea of AnchorGAE and its efficiency are elaborated. In Section 3.4, we will find that a siamese structure is naturally induced by the bipartite graph convolution. The rigorous analysis of the degeneration caused by the simple update of graphs is provided in Section 3.5. Finally, the methods to obtain the final clustering assignments are discussed. The core idea of AnchorGAE is illustrated in Figure 1.

3.1 Revisit the Critical Problems

Before elaborating on how to efficiently improve the capacity of graph-based clustering via GNNs on non-graph data, we introduce the major hurdles respectively.

No Prior Graph: In the scenarios of clustering, only features are given as the prior information. The graph is usually constructed as the preprocessing in graph-based clustering methods. The key of performance is the quality of the constructed graph. When the scattering of data is too complicated to precisely capture the key information based on the original data (which is the main bottleneck of the most existing graph-based clustering methods), it is a rational scheme to map the data into a latent space and then build a more precise graph in this space. As we focus on graph-based models, GNNs are more rational schemes compared with classical neural networks. Moreover, the utilization of the estimated graph (via GNNs) to update graph leads to a self-supervised procedure. The graph construction usually results in $\mathcal{O}(n^{2})$ costs, which also causes the inefficiency of graph-based clustering methods. The scheme to solve this problem will be elaborated in Section 3.2.

Inefficiency of GNNs: In the forward propagation of GNNs, it usually require neighbored nodes to compute the hidden feature of one node. This computation results in the inefficiency of GNNs on large graph [60], especially with the increase of depth. In this paper, we simply focus on GCN, the widely used variant of GNN, since the theoretical analysis does not depend on specific GNN modules. For instance, it requires $\mathcal{O}(|\mathcal{E}|d)$ time to compute $\bm{L}\bm{X}$ in the forward propagation. If $\mathcal{G}$ is sparse enough, the computational complexity is acceptable. If there are plenty of edges in $\mathcal{G}$ , the computational complexity can be roughly regarded as $\mathcal{O}(n^{2}d)$ , so that GCNs can not be applied in this case. We will show how to alleviate it into $\mathcal{O}(n)$ in Sections 3.3 and 3.4.

Inefficiency of Graph-Based Clustering Methods: Even after the $\mathcal{O}(n^{2})$ graph construction, the computation complexity reaches $\mathcal{O}(n^{3})$ due to the eigenvalue decomposition. The problem is easy to address after introducing the $\mathcal{O}(n)$ bipartite graph convolution and the details can be found in Section 3.6.

3.2 Generative Perspective for Bipartite Graphs

By virtue of the generative perspective introduced in this section, the probability transition matrix is accordingly built so that the Markov process can be used to implement the efficient bipartite graph convolution, which is elaborated in Section 3.3. The generative perspective also provides a training goal, reconstructing the distribution, for AnchorGAE.

3.2.1 Generative Perspective for Graphs

For a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , an edge, $\langle v_{i},v_{j}\rangle\in\mathcal{E}$ , can be regarded as a sampling from an underlying connectivity distribution $p(\cdot|v_{i})$ for $v_{i}$ , where $v_{i}$ denotes the $i$ -th node. Note that $\sum_{j=1}^{n}p(v_{j}|v_{i})=1$ where $n=|\mathcal{V}|$ . Accordingly, a graph can be viewed as multiple samplings regarding connectivity distributions $\{p(\cdot|v_{i})\}_{i=1}^{n}$ . Before the further discussion of weighted graphs, we further clarify the difference between the generative perspective and discriminative perspective. In the discriminative view, the existence of edge is certain and the model should train a classifier to predict $p(\langle v_{i},v_{j}\rangle\in\mathcal{E})$ and $p(\langle v_{i},v_{j}\rangle\notin\mathcal{E})$ . Note that $\sum_{j}p(\langle v_{i},v_{j}\rangle\in\mathcal{E})\neq 1$ in most cases. In this paper, the generative view provides an elegant Bayesian framework to simultaneously capture the local topology and offer a reconstruction goal for GAE.

With the generative perspective, for a weighted graph $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{W})$ where $\mathcal{W}$ represents the weights on edges, $\mathcal{W}$ can be viewed as an estimation of the underlying connectivity distributions. From the generative perspective, the underlying connectivity distributions can be supposed to exist in any dataset, while the connectivity distributions are agnostic on non-graph datasets. However, no matter which type of data is, it will not affect the existence of connectivity distributions. The vital idea that leads AnchorGAE to be applied to non-graph data is to estimate the distributions. Besides, they are also the target to reconstruct by our unsupervised model. The following assumption helps to estimate the connectivity distributions.

Assumption 1.

Given an ideal metric of difference between two nodes $d(v_{i},v_{j})$ , $p(v_{j}|v_{i})$ is negatively related to $d(v_{i},v_{j})$ , i.e., $p(v_{j}|v_{i})\varpropto-d(v_{i},v_{j})$ . An ideal metric indicates $\forall d(v_{i},v_{j})\leq d(v_{i},v_{l}),p(v_{j}|v_{i})\geq p(v_{l}|v_{i})$ .

In this paper, each node is described by a $d$ -dimension vector, i.e., $\bm{x}_{i}\in\mathbb{R}^{d}$ . Therefore, the metric can be reformulated as

d(v_{i},v_{j})=\|f(\bm{x}_{i})-f(\bm{x}_{j})\|_{2}^{2},

(2)

where $f(\cdot)$ represents an ideal mapping of the raw features. The implementation is discussed in Section 3.3.

Remark 2.

Although the above assumption seems too strong to hold on raw data representation, the goal of our model is to find an approximately ideal mapping function $f(\cdot)$ to build a well-structured graph.

3.2.2 Bipartite Graph Simplification with Anchors

To efficiently apply graph convolution, we attempt to simplify the graph via some representative points, which are named anchors or landmarks. Let $\mathcal{U}$ denote anchors where $|\mathcal{U}|=m$ and the $i$ -th anchor is also described by a $d$ -dimension vector $\bm{c}_{i}$ . Then, we can get a new bipartite graph with anchors, $\mathcal{G}_{a}=(\mathcal{V}_{a},\mathcal{E}_{a},\mathcal{W}_{a})$ . Specifically, $\mathcal{V}_{a}=\mathcal{V}\cup\mathcal{U}$ and $\mathcal{W}_{a}$ is the corresponding weights of $\mathcal{V}_{a}$ . The bipartite property means that only edges between $\mathcal{V}$ and $\mathcal{U}$ are allowed in $\mathcal{E}_{a}$ . This bipartite property is from the following assumption.

Assumption 2.

Given an ideal set of anchors $\mathcal{U}$ , $p(v_{j}|v_{i})$ can be constructed by $\{p(u_{l}|v_{i})\}_{l=1}^{m}$ and $\{p(v_{j}|u_{l})\}_{l=1}^{m}$ .

Since both connectivity distributions and anchors are unknown in general data, we need to estimate them. First, we intend to estimate anchors and $p(u_{j}|v_{i})$ alternatively by

\min\limits_{\mathcal{U},p(\cdot|v_{i})}\sum\limits_{i=1}^{n}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}d(v_{i},u_{j}),

(3)

according to Assumption 1. Specifically speaking, in the ideal metric space introduced by Assumption 1, the expected divergence between $\mathcal{V}$ and $\mathcal{U}$ should be minimum. Note that we only focus on the estimation of $p(u_{j}|v_{i})$ and overlook $p(v_{i}|u_{j})$ for now. However, the above problem has a trivial solution, i.e.,

p_{\dagger}(u_{j}|v_{i})=\left\{\begin{array}[]{l l}1,&j=\arg\min_{j}(d(v_{i},u_{j})),\\ 0,&\text{else}.\end{array}\right.

(4)

It indicates that each original node (from $\mathcal{V}$ ) only connects with its nearest anchors (from $\mathcal{U}$ ). To avoid the ill connectivity distributions and anchor graph, we turn to the following regularized problem,

\min\limits_{\mathcal{U},p(\cdot|v_{i})}\sum\limits_{i=1}^{n}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}d(v_{i},u_{j})+\ell(p(\cdot|v_{i}),\pi(\cdot|v_{i})),

(5)

where $\pi(\cdot|v_{i})$ represents the uniform distribution and $\ell(\cdot,\cdot)$ is a metric of two discrete distributions. Intuitively, $p(u_{j}|v_{i})$ should be sparse such that anchors far from $v_{i}$ are ignored. In practice, Kullback-Leibler divergence (KL-divergence) is usually used as $\ell(\cdot,\cdot)$ to measure the divergence of two probability distributions. However, it will return a dense solution of $p(\cdot|v_{i})$ . Instead, we employ a simple measure as

\min\limits_{\mathcal{U},p(\cdot|v_{i})}\sum\limits_{i=1}^{n}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}d(v_{i},u_{j})+\gamma_{i}\sum\limits_{j=1}^{m}(p(u_{j}|v_{i})-\pi(u_{j}|v_{i}))^{2}.

(6)

Let $d_{k}(v_{i},\cdot)$ be the $k$ -th smallest value of $\{d(v_{i},u_{j})\}_{j=1}^{m}$ , and if $\gamma_{i}=\frac{1}{2}(kd_{k+1}(v_{i},\cdot)-\sum_{l=1}^{k}d_{l}(v_{i},\cdot))$ , $p(\cdot|v_{i})$ will be $k$ -sparse (only $k$ entries are non-zero). Meanwhile, it can be transformed into [4] and solved by

p(u_{j}|v_{i})=(\frac{d_{k+1}(v_{i},\cdot)-d(v_{i},u_{j})}{\sum_{l=1}^{k}d_{k+1}(v_{i},\cdot)-d_{l}(v_{i},\cdot)})_{+}.

(7)

In other words, the hyper-parameter $\gamma_{i}$ is converted to the sparsity $k$ , which is much easier to tune.

Algorithm 1 Iterative update of the connectivity distribution and anchors, i.e.,, optimization of problem (6).

\{f(\bm{x}_{i})\}_{i=1}^{n}

\{f^{(0)}(\bm{c}_{j})\}_{j=1}^{m}

, sparsity

k

t=0

d^{(0)}(v_{i},u_{j})=\|f(\bm{x}_{i})-f^{(0)}(\bm{u}_{j})\|_{2}^{2}

repeat

t=t+1

p^{(t)}(u_{j}|v_{i})=(\frac{d^{(t-1)}_{k+1}(v_{i},\cdot)-d^{(t-1)}(v_{i},u_{j})}{\sum_{l=1}^{k}d^{(t-1)}_{k+1}(v_{i},\cdot)-d_{l}^{(t-1)}(v_{i},\cdot)})_{+}

f^{(t)}(\bm{c}_{j})=\frac{\sum_{i=1}^{n}p^{(t)}(u_{j}|v_{i})f(\bm{x}_{i})}{\sum_{i=1}^{n}p^{(t)}(u_{j}|v_{i})}

d^{(t)}(v_{i},u_{j})=\|f(\bm{x}_{i})-f^{(t)}(\bm{c}_{j})\|_{2}^{2}.

until convergence .

f(\bm{c}_{j})=f^{(t)}(\bm{c}_{j})

p(\cdot|v_{i})=p^{(t)}(\cdot|v_{i})

f(\bm{c}_{j})

and

p(\cdot|v_{i})

When $p(u_{j}|v_{i})$ is fixed, anchors are computed by solving the following problem,

\min\limits_{f(\bm{c}_{j})}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}\|f(\bm{x}_{i})-f(\bm{c}_{j})\|_{2}^{2},

(8)

where $\bm{c}_{j}$ is representation of anchor $u_{j}$ in the original space. If we take the derivative of the above equation regarding $f(\bm{c}_{j})$ ,

\begin{split}&\nabla_{f(\bm{c}_{j})}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}\|f(\bm{x}_{i})-f(\bm{c}_{j})\|_{2}^{2}\\ =~{}&2(\sum\limits_{i=1}^{m}p(u_{j}|v_{i})f(\bm{c}_{j})-p(u_{j}|v_{i})f(\bm{x}_{i})),\end{split}

(9)

and set it to 0, then we have

f(\bm{c}_{j})=\frac{\sum_{i=1}^{n}p(u_{j}|v_{i})f(\bm{x}_{i})}{\sum_{i=1}^{n}p(u_{j}|v_{i})}.

(10)

In sum, problem (6) can be solved iteratively through Eq. (7) and (10), which is summarized as Algorithm 1.

Build A Generative Bipartite Graph by $\{p(u_{j}|v_{i})\}_{i,j}$ and $\{f(\bm{c}_{j})\}_{j}$ : Then we turn to model $p(v_{i}|u_{j})$ , which is used to compute $p(v_{j}|v_{i})$ , the probability between two raw nodes. Rather than solving a problem like problem (6), $p(v_{i}|u_{j})$ is set by a simple normalized step,

p(v_{i}|u_{j})=\frac{p(u_{j}|v_{i})}{\sum_{l=1}^{n}p(u_{j}|v_{l})},

(11)

via utilizing $p(u_{j}|v_{i})$ calculated by Eq. (7). To show how the above formulations construct a bipartite graph and simplify the succeeding discussion about the fast convolution, we further reformulate $p(u_{j}|v_{i})$ and $p(v_{i}|u_{j})$ by matrix form. Let $\bm{B}\in\mathbb{R}^{n\times m}$ be a matrix where $b_{ij}=p(u_{j}|v_{i})$ . Then

\bm{T}=\left[\begin{array}[]{c c}\bm{0}&\bm{B}\\ \bm{B}^{T}&\bm{0}\end{array}\right]~{}\text{and}~{}\bm{D}_{a}=\left[\begin{array}[]{c c}\bm{I}&\bm{0}\\ \bm{0}&\bm{\Delta}\\ \end{array}\right]

(12)

represent an unnormalized bipartite graph and its degree matrix, respectively. Note that $\bm{T}$ does not conform to the generative perspective for graphs, though it is a bipartite graph. According to Eq. (11), the generative bipartite graph can be defined as

\bm{P}=\bm{D}_{a}^{-1}\bm{T}=\left[\begin{array}[]{c c}\bm{0}&\bm{B}\\ \bm{\Delta}^{-1}\bm{B}^{T}&\bm{0}\end{array}\right]\in\mathbb{R}^{(n+m)\times(n+m)},

(13)

which can be also regarded as a probability transferring matrix.

Algorithm 2 Optimization of AnchorGAE.

0: Raw features

\bm{X}

, initial sparsity

k_{0}

, the estimated upper-bound of sparsity

k_{m}

, maximum epochs

E

1: Initialize

\bm{B}

and

\bm{C}

via solving problem (6).

2: Initialize

\bm{A}_{a}

by Eq. (15).

k\leftarrow k_{0}

4: for

i=1,2,\cdots,E

\#

Obtain the mapping

f(\cdot)

6: Update parameters of GCN layers by gradient descent.

7: Update

B

and

f(\bm{c}_{j})

\bm{Z}

and

\bm{Z}_{t}

via Algorithm 1.

8: Update anchors in the original space for the bipartite graph convolution:

\bm{c}_{j}=f^{-1}(f(\bm{c}_{j}))=\bm{X}^{T}\bm{b}_{j}

9: Update

\bm{A}_{a}

by Eq. (15).

10:

k\leftarrow k+\Delta k

11: end for

12: Perform fast clustering on

\bm{Z}

12: Clustering assignments and embedding

\bm{Z}

3.3 Efficient $\mathcal{O}(n)$ Bipartite Graph Convolution

As the goal is to capture the structure information, a multi-layer GCN is a rational scheme to implement $f(\bm{x}_{i})$ which is defined in Eq. (2). In this subsection, we will show why the bipartite graph convolution accelerates GCN.

We utilize the bipartite graph, defined in Eq. (12), to accelerate the graph convolution. In virtue of the Markov process [61], $p(v_{j}|v_{i})$ can be obtained by the one-step transition probability, i.e.,

p(v_{j}|v_{i})=\sum_{l=1}^{m}p(v_{j}|u_{l})p(u_{l}|v_{i}).

(14)

Similarly, $p(u_{j}|u_{i})=\sum_{l=1}^{n}p(u_{j}|v_{l})p(v_{l}|u_{i})$ . Formally, the constructed graphs of $\mathcal{V}$ and $\mathcal{U}$ are defined as

\bm{A}_{a}=\bm{P}^{2}=\left[\begin{array}[]{c c}\bm{B}\bm{\Delta}^{-1}\bm{B}^{T}&\bm{0}\\ \bm{0}&\bm{\Delta}^{-1}\bm{B}^{T}\bm{B}\end{array}\right]=\left[\begin{array}[]{c c}\bm{A}&\bm{0}\\ \bm{0}&\bm{A}_{t}\end{array}\right].

(15)

Accordingly, $\bm{A}$ is the weighted adjacency matrix, constructed by anchors, of $n$ samples, and $\bm{A}_{t}$ is the weighted adjacency of $m$ anchors.

Remark 3.

In GCNs, the graph is conventionally preprocessed by adding self-loops as shown in Eq. (1). The graph constructed by Eq. (15) is equivalent to be assigned with adaptive self-loops, since $\bm{A}_{ii}$ and $(\bm{A}_{t})_{ii}$ are non-zero in most cases.

Therefore, $\bm{A}$ can be directly used as a part of the graph convolution operator. Since

\bm{A}\textbf{1}=\bm{B}\bm{\Delta}^{-1}\bm{B}^{T}\textbf{1}=\bm{B}\textbf{1}=\textbf{1},

(16)

the degree matrix of $\bm{A}$ is the identity matrix, i.e., $\bm{L}=\bm{A}$ . Therefore, the output of GCN is formulated as

\bm{H}=\varphi(\bm{B}\bm{\Delta}^{-1}\bm{B}^{T}\bm{X}\bm{W}).

(17)

Here, we explain why the above equation accelerates the operation. It should be emphasized that $\bm{A}$ is not required to be computed explicitly. The core is to rearrange the order of matrix multiplications as follows

\begin{split}&\bm{T}_{1}=\bm{B}^{T}\bm{X}\\ \Rightarrow&~{}\bm{T}_{2}=\bm{\Delta}^{-1}\bm{T}_{1}\\ \Rightarrow&~{}\bm{T}_{3}=\bm{T}_{2}\bm{W}\\ \Rightarrow&~{}\bm{H}=\varphi(\bm{B}\bm{T}_{3}).\end{split}

(18)

According to the above order of computation, the computational complexity is reduced to $\mathcal{O}(nmd+m^{2}d+mdd^{\prime}+nmd^{\prime})$ . Since $d$ , $d^{\prime}$ , and $m$ are usually much smaller than $n$ , the complexity can be regarded as $\mathcal{O}(nm)$ . Moreover, if the amount of anchors is small enough, the required time of each forward propagation of a GCN layer is $\mathcal{O}(n)$ .

3.4 Siamese Architecture for Anchor Mapping

After showing the accelerated GCN mapping based on anchors and the bipartite graph, we discuss more details of the whole procedure in this section, especially the changes of architecture brought by anchors.

As the task is unsupervised, the whole architecture is based on graph auto-encoder (GAE). The encoder consists of multiple bipartite GCN layers defined in Eq. (17),

\bm{Z}=f(\bm{X})=\varphi_{L}(\bm{A}\varphi_{L-1}(\cdots\varphi_{1}(\bm{A}\bm{X}\bm{W}_{1})\cdots)\bm{W}_{L}),

(19)

where $L$ denotes the number of layers. The embedding (denoted by $\bm{Z}$ ), which is produced by the encoder, is a non-linear mapping for the raw representation of $\mathcal{V}$ .

However, the mentioned encoder just transforms $\mathcal{V}$ and there is no edge between $\mathcal{V}$ and $\mathcal{U}$ . In other words, $\mathcal{U}$ can not be projected via the network with the adjacency $\bm{A}$ . Meanwhile, GCN layers suffer from the out-of-sample problem. Accordingly, $\mathcal{U}$ can not be directly projected into the deep feature space by the encoder. To map anchors into the same feature space, a siamese [62] encoder is thus designed. The encoder for anchors should share the same parameters but has its own graph structure. Surprisingly, the one-step probability transition matrix defined in Eq. (15) gives us a graph that is only composed of anchors, $\bm{A}_{t}=\bm{\Delta}^{-1}\bm{B}^{T}\bm{B}$ . As

\bm{A}_{t}\textbf{1}=\bm{\Delta}^{-1}\bm{B}^{T}\bm{B}\textbf{1}=\bm{\Delta}^{-1}\bm{B}^{T}\textbf{1}=\textbf{1},

(20)

the corresponding degree matrix, $\bm{D}_{t}$ , is also the identity matrix, i.e.,

\bm{L}_{t}=\bm{D}_{t}^{-\frac{1}{2}}\bm{A}_{t}\bm{D}_{t}^{-\frac{1}{2}}=\bm{A}_{t}.

(21)

Accordingly, the embedding of anchors is formulated as

\bm{Z}_{t}=f(\bm{C})=\varphi_{L}(\bm{A}_{t}\varphi_{L-1}(\cdots\varphi_{1}(\bm{A}_{t}\bm{C}\bm{W}_{1})\cdots)\bm{W}_{L}),

(22)

where $\bm{C}\in\mathbb{R}^{m\times d}$ represents $m$ anchors.

Loss: As the underlying connectivity distributions have been estimated, we use them, rather than the adjacency in the existing GAEs, as the target for the decoder to rebuild. According to Assumption 1, the connectivity probability should be reconstructed according to Euclidean distances between $\mathcal{V}$ and $\mathcal{U}$ . Formally speaking, the decoder is

\begin{split}q(u_{j}|v_{i})&=\frac{\exp(-d(v_{i},u_{j}))}{\sum_{l=1}^{m}\exp(-d(v_{i},u_{l}))}.\\ \end{split}

(23)

Instead of using mean-square error (MSE) to reconstruct $A$ , AnchorGAE intends to reconstruct the underlying connectivity distribution and employ the cross-entropy to measure the divergence,

\mathcal{L}=\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}p(u_{j}|v_{i})\log\frac{1}{q(u_{j}|v_{i})}.

(24)

The GCN part of AnchorGAE is trained via minimizing $\mathcal{L}$ .

So far, how to get anchors at the beginning is ignored. In deep learning, a common manner is to pre-train a neural network. Nevertheless, the pre-training of raw graph auto-encoders will result in nearly $\mathcal{O}(n^{2})$ time, which violates the motivation of our model. Therefore, we first compute anchors and transition probability on the raw features. Then anchors and transition probability will be rectified dynamically, which will be elaborated in Section 3.5.

3.4.1 Why not to reconstruct features?

One may ask why AnchorGAE just reconstruct the constructed graph rather than the features, which is a priori. The goal of the classical auto-encoder is to restore the input features. However, the learned representations are often not promising enough due to no limitation on the latent space. Accordingly, several works (e.g., adversarial auto-encoder [63], variational auto-encoder [54]) constrain the representations. More importantly, contrastive learning [64] does not reconstruct the features anymore and focuses on how to pull similar samples together and push dissimilar samples away. Contrastive learning has achieved impressive performance in recent years.

In AnchorGAE, the graph is constructed on the learned representations. Define a mapping $\psi:\mathbb{R}^{n\times d}\rightarrow\mathbb{R}^{n\times d}$ . Then the graph can be roughly formulated as $\bm{A}=\psi(f(\bm{X}))$ where $f$ denotes the GNN mapping. On the one hand, the composite mapping, $\psi\circ f$ , removes the redundant information of $X$ . The reconstruction of $\bm{A}$ can be regarded as the restoration of the simplified $\bm{X}$ . On the other hand, the edges of graphs indicate similar sample pairs and dissimilar pairs. The reconstruction of graph is therefore consistent with the idea of contrastive learning. That’s why AnchorGAE only reconstructs the constructed graph.

TABLE I: Experimental Results: ACC and NMI

Method	ISOLET		SEGMENT		USPS		MNIST-test		MNIST-full		Fashion
Method	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
K-Means	0.621	0.775	0.559	0.590	0.648	0.628	0.548	0.501	0.534	0.500	0.474	0.512
Nystrom	0.648	0.780	0.562	0.549	0.668	0.619	0.575	0.505	0.589	0.502	0.563	0.522
CSC	0.661	0.785	0.606	0.555	0.726	0.696	0.623	0.572	0.592	0.569	0.515	0.501
KASP	0.630	0.754	0.554	0.544	0.712	0.693	0.654	0.600	0.620	0.555	0.538	0.529
LSC	0.657	0.774	0.562	0.557	0.727	0.694	0.702	0.615	0.714	0.623	0.634	0.629
SNC	0.553	0.631	0.567	0.535	0.703	0.589	0.570	0.523	—	—	—	—
SGC	0.464	0.665	0.524	0.611	0.657	0.801	0.719	0.757	—	—	—	—
GAE	0.612	0.783	0.442	0.475	0.679	0.821	0.696	0.758	—	—	—	—
DEC	0.425	0.685	0.143	0.000	0.621	0.588	0.813	0.803	0.797	0.810	0.516	0.541
SpectralNet	0.528	0.781	0.515	0.540	0.678	0.818	0.820	0.817	—	—	—	—
ClusterGCN-L2	0.581	0.772	0.580	0.601	0.659	0.769	0.632	0.666	0.707	0.730	0.532	0.605
ClusterGCN-L4	0.574	0.738	0.593	0.599	0.662	0.769	0.631	0.665	0.708	0.732	0.532	0.606
Ours-A (fixed $B$ )	0.615	0.814	0.443	0.490	0.490	0.475	0.290	0.351	0.468	0.635	0.588	0.630
Ours-B (fixed $k$ )	0.373	0.532	0.285	0.125	0.265	0.174	0.284	0.228	0.289	0.232	0.194	0.112
Ours-C (KNN)	0.654	0.778	0.555	0.511	0.729	0.746	0.776	0.735	0.793	0.743	0.533	0.553
AnchorGAE-L2	0.663	0.787	0.635	0.613	0.853	0.828	0.823	0.773	0.833	0.773	0.645	0.607
AnchorGAE-L4	0.591	0.756	0.576	0.508	0.781	0.775	0.810	0.762	0.808	0.771	0.613	0.590

3.5 Will the Adaptive Update Work?

Since the initialized anchors and transition probability is calculated from the original features, the information hidden in $\bm{A}$ may be limited especially when the data distribution is complicated. To precisely exploit the high-level information in a self-supervised way, a feasible method is to adaptively update anchors according to the learned embedding.

To achieve this goal, we first need to recompute anchors, $\mathcal{U}$ . According to Eq. (7) and (10), we can obtain anchors under the deep representations, $\{f(\bm{c}_{j})\}_{j=1}^{m}$ . However, the input of our model requires anchors under the raw features. In other words, it is inevitable to compute $\bm{c}_{j}=f^{-1}(f(\bm{c}_{j}))$ . Unfortunately, it is hard to ensure $f(\cdot)$ an invertible mapping. A scheme is to utilize the classical auto-encoders since the decoder that tries to reconstruct the raw input from the embedding can be regarded as an approximation of the inverse mapping of the encoder. To simplify the discussion in this paper, we simply set

\bm{c}_{j}=f^{-1}(f(\bm{c}_{j}))\approx\bm{X}^{T}\bm{b}_{j},

(25)

to estimate $\bm{c}_{j}$ under the original feature space.

In our expectations, the self-supervised update of anchors and transition probability would become more and more precise. However, the performance of AnchorGAE will become worse and worse after several updates of anchors, which has been proved theoretically and empirically. An interesting phenomenon is that a better reconstruction indicates a more severe collapse. The following theorem provides a rigorous description of this collapse.

Theorem 1.

Let $p_{k}(\cdot|v_{i})$ be the $k$ -largest one of $\{p(u_{j}|v_{i})\}_{j=1}^{m}$ and $\{\hat{p}(u_{j}|v_{i})\}_{j=1}^{m}$ be the updated distribution with the same $k$ after step 6 of Algorithm 2. If $|q(\cdot|v_{i})-p(\cdot|v_{i})|\leq\varepsilon$ , then there exists a constant $\delta>0$ , so that when $\varepsilon\leq\delta$ ,

|\hat{p}_{j}(\cdot|v_{i})-\frac{1}{k}|\leq\mathcal{O}(\log^{-1/2}(1/\varepsilon))

holds for any $j\leq k$ .

The above conclusion means that $\hat{p}(\cdot|v_{i})$ is a sparse and uniform distribution. In other words, the constructed graph degenerates into an unweighted graph, which is usually unexpected in clustering. More intuitively, the degeneration is caused by projecting samples from the identical cluster into multiple tiny and cohesive clusters as shown in Figure 2. These tiny clusters have no connection with each other and they scatter randomly in the representation space, such that the clustering result is disturbed. To avoid this collapse, we propose two strategies: (1) increase $k$ which represents the sparsity of neighbors; (2) decrease the number of anchors, $m$ . In this paper, we dynamically increase $k$ , which is equivalent to attempt to connect these tiny groups before they scatter randomly, as follows

k\leftarrow k+\Delta k.

(26)

Suppose that there are $c$ clusters and the number of samples in the smallest cluster is represented by $n_{s}$ . Then we can define the upper-bound of $k$ and increment of sparsity $\Delta k$ as

k_{m}=\lfloor m\times\frac{n_{s}}{n}\rfloor~{}~{}\textrm{and}~{}~{}\Delta k=\lfloor\frac{k_{m}-k_{0}}{E}\rfloor,

(27)

where $k_{0}$ denotes the sparsity of the initialized anchor graph and $E$ be the number of iterations to update anchors. If we have no prior information of $n_{s}$ , $n_{s}$ can be simply set as $\lfloor n/c\rfloor$ or $\lfloor n/(2c)\rfloor$ . The brief modification helps a lot to guarantee the stability and performance of AnchorGAE.

When AnchorGAE is trained, we can perform some fast clustering algorithms on $\bm{B}$ or $\bm{Z}$ , which is elaborated in Section 3.6. The optimization of AnchorGAE is summarized in Algorithm 2.

3.6 Obtain Clustering Assignments

After training AnchorGAE, we can simply run $k$ -means on the learned embedding. Due to Assumption 1, Euclidean distances of similar points will be small so that it is appropriate to use $k$ -means on the deep representations.

Another method is to perform spectral clustering on the constructed graph, which is the same as anchor-based spectral clustering. Let

\bm{\mathcal{L}}=\bm{I}-\bm{D}^{-\frac{1}{2}}\bm{A}\bm{D}^{-\frac{1}{2}}=\bm{I}-\bm{A}=\bm{I}-\bm{B}\bm{\Delta}^{-1}\bm{B}^{T}

(28)

be the normalized Laplacian matrix. It should be emphasized that $\bm{\mathcal{L}}$ is also the unnormalized Laplacian matrix. Accordingly, the objective of spectral clustering is

\max\limits_{\bm{F}^{T}\bm{F}=\bm{I}}{\rm tr}(\bm{F}^{T}\bm{B}\bm{\Delta}^{-1}\bm{B}^{T}\bm{F}),

(29)

where $\bm{F}\in\mathbb{R}^{n\times c}$ is the soft indicator matrix. Let $\hat{\bm{B}}=\bm{B}\bm{\Delta}^{-\frac{1}{2}}$ . Then the eigenvalue decomposition of $\bm{A}$ can be transformed into the singular value decomposition of $\hat{{\bm{B}}}$ , which reduces the complexity from $\mathcal{O}(n^{2}c)$ to $\mathcal{O}(mnc)$ .

Finally, one can perform clustering on the bipartite graph directly rather than the constructed graph. Let the bipartite graph be

\bm{W}=\left[\begin{array}[]{c c}\bm{0}&\bm{B}\\ \bm{B}^{T}&\bm{0}\end{array}\right]\in\mathbb{R}^{(n+m)\times(n+m)}.

(30)

The degree matrix is represented as

\bm{D}=\left[\begin{array}[]{c c}\bm{D}_{v}&\bm{0}\\ \bm{0}&\bm{D}_{u}\end{array}\right].

(31)

Then the normalized cut used spectral clustering is to optimize

\max\limits_{\bm{F}^{T}\bm{F}=\bm{I}}{\rm tr}(\bm{F}^{T}\bm{D}^{-\frac{1}{2}}\bm{W}\bm{D}^{-\frac{1}{2}}\bm{F}^{T}),

(32)

where

\bm{F}=\left[\begin{array}[]{c}\bm{V}\\ \bm{U}\end{array}\right].

(33)

Since $\bm{D}_{v}=\bm{I}$ , the above problem can be further reformulated as

\max\limits_{\bm{U}^{T}\bm{U}+\bm{V}^{T}\bm{V}=\bm{I}}{\rm tr}(\bm{V}^{T}\bm{B}\bm{D}_{u}^{-\frac{1}{2}}\bm{U}).

(34)

The closed-form solution can be calculated by

\left\{\begin{array}[]{l}\bm{V}=\frac{\sqrt{2}}{2}\tilde{\bm{V}}\\ \bm{U}=\frac{\sqrt{2}}{2}\tilde{\bm{U}},\\ \end{array}\right.

(35)

where $\tilde{\bm{V}}$ and $\tilde{\bm{U}}$ are $c$ leading left and right singular vectors of $\bm{B}\bm{D}_{u}^{-\frac{1}{2}}$ . The details can be found in [65]. In our experiments, we report the results obtained by the above technique based on bipartite graphs, since it usually outperforms the simple $k$ -means.

TABLE II: Information of Datasets

Dataset	# Feature	# Size	# Classes
ISOLET	617	1560	26
SEGMENT	19	2310	7
USPS	256	9298	10
MNIST-test	784	10000	10
MNIST-full	784	70000	10
Fashion-MNIST	784	70000	10

4 Experiment

In this section, the experimental results and parametric settings of AnchorGAE are illustrated. According to these experimental results, we analyze the performance of AnchorGAE objectively. Additionally, the visualization of data is displayed. The experimental results strongly support the theoretical analysis.

4.1 Datasets and Comparative Methods

The performance of the AnchorGAE is evaluated on 6 non-graph datasets, including 2 UCI datasets (ISOLET [66] and SEGMENT [66]) and 4 image datasets (USPS [67], MNIST-test [68], MNIST-full [68], and Fashion-MNIST [69]). Remark that the two UCI datasets show the difference between AnchorGAE and CNN-based clustering models. Since each data point can only be represented by a vector, the CNN-based model can not be applied to these datasets. Note that Fashion-MNIST is denoted by Fashion in Table II The details of these datasets are shown in Table II.

AnchorGAE is compared with 9 methods, including 5 anchor-based methods (Nystrom [9], CSC [10], KASP [11], LSC [13], and SNC [14]), 3 GAE-based methods (SGC [35], GAE [48] and ClusterGCN [34]), 2 deep methods (DEC [17] and SpectralNet [25]), and classical K-Means [58]. All codes are downloaded from the homepages of authors.

4.2 Experimental Setup

There are three clustering evaluation criteria used in our experiments, including the clustering accuracy (ACC), normalized mutual information (NMI), and running time. In our experiments, the number of anchor points is set from 100 to 1000 and the increasing step is 100. The best results are reported in Table I. The activation function of the last layer is set to linear, while the other layers use the ReLU function. Particularly, ClusterGCN recommends 4 layers convolution. To ensure the convolution structure of ClusterGCN the same as others, we modify the 4 layers of convolution structure (denoted by ClusterGCN-L4) to 2 layers (denoted by ClusterGCN-L2) and remain 4 layers of the convolution structure as a competitor. To apply ClusterGCN to unsupervised tasks, we use the same GAE loss and decoder architecture that is same as the other models. To test the scalability of AnchorGAE in a deeper network, we design 2 layers (denoted by AnchorGAE-L2) and 4 layers (denoted by AnchorGAE-L4) convolution structure separately. The 2 layers neurons are designed as 128 and 64 or 256 and 32. For the 4 layers neurons, we set the first three layers as 16 and set the last layer as $L$ +1. $L$ is the number of labels. The maximum iterations to update network are 200, while the maximum epochs to update anchors are set as 5. The initial sparsity $k_{0}$ is set as 3.

To investigate the impact of anchors, we run AnchorGAE with different numbers of anchors and initial sparsity, i.e., $m$ and $k_{0}$ . We also conduct 3 ablation experiments based on AnchorGAE-L2 to study the impact of different parts. Firstly, we fix anchors and transition probability during training, which is denoted by Ours-A. Secondly, we fix the sparsity $k$ , which is denoted by Ours-B. Thirdly, we use a KNN graph, instead of a weighted graph, with the adaptive update and incremental $k$ in our model, which is denoted by Ours-C. All methods are run 10 times and the average results are reported. All experiments are conducted on a PC with an NVIDIA GeForce GTX 1660 GPU SUPER.

4.3 Experimental Results

ACC and NMI are shown in Table I while the runtime is reported in Figure 5. For ACC and NMI, the best results are bolded and suboptimal results are underlined. Especially, some methods either spend too much memory to run on MNIST-full and Fashion-MNIST, or the codes provided by authors fail to run on them. So they are denoted by a horizontal line in Table I. Similarly, they are represented as $\infty$ in Figure 5. Besides, the visualization of SEGMENT is shown in Figure 3. Figure 4 shows the impact of the number of anchors while the effect of the initial sparsity is illustrated in Figure 6.

From the experimental results, we can conclude that:

1.

On almost all datasets, AnchorGAE-L2 obtains the best results in all metrics. On MNIST-full, AnchorGAE obtains the suboptimal NMI, while DEC, which consists of much more layers, outperforms AnchorGAE. Meanwhile, AnchorGAE, with only 2 layers, achieves better ACC than DEC. AnchorGAE needs to compute anchors such that it requires more time.
2.

On MNIST-full, normal GAE-based methods fail to run due to out-of-memory since these methods need to save the graph structure in each convolution. AnchorGAE can convolute without constructing a graph explicitly, which not only needs less memory but also reduces runtime.
3.

When sparsity $k$ is fixed or anchors and transition probability (i.e., matrix $\bm{B}$ ) is not updated, AnchorGAE usually gets poor results on many datasets. For instance, Ours-B shrinks about 55% in ACC and 55% in NMI on MNIST-full, while Ours-A reduces about 40% in ACC and reduces about 15% in NMI.
4.

To ensure fair competition and distinguish between anchor-based methods and GAE-based methods, we set the same number of anchors for AnchorGAE-L4 and other anchor-based methods. Note that the first two layers of AnchorGAE-L4 are the same as AnchorGAE-L2. Through observing the results of experiments, AnchorGAE-L4 obtains acceptable performance in most of the datasets but not the best results. It may be caused by the too small dimension of the final embedding.
5.

If the initial sparsity, $k_{0}$ , is set as a too large number, then it results in dense connections. As a result, a great quantity of wrong information is captured.
6.

From the comparison between AnchorGAE and Ours-C, we find that the weighted graphs are still important for clustering even after introducing non-linear GNN mapping modules.

5 Proof of Theorem 1

For an arbitrary probability distribution $\{p(u_{j}|v_{i})\}_{j=1}^{m}$ , let $p_{k}(\cdot|v_{i})$ be the $k$ -largest one. For simplicity, let $d_{ij}=d(u_{j},v_{i})=\|f(\bm{x}_{i})-f(\bm{c}_{j})\|_{2}^{2}$ where $f(\cdot)$ represents the mapping function implemented by the trained GCN encoder. Note that the connectivity distribution is estimated through the current representations, i.e.,

p^{(0)}(\cdot|v_{i})=\arg\min_{p(\cdot|v_{i})}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}\hat{d}_{ij}+\gamma_{i}\sum_{j=1}^{m}(p(u_{j}|v_{i})-\frac{1}{m})^{2}.

(36)

where $\hat{d}_{ij}=\|\hat{f}(\bm{x}_{i})-f(\bm{c}_{j})\|_{2}^{2}$ and $\hat{f}(\cdot)$ denotes the mapping function before the new training of GCN encoders.

We define $p^{(t)}(u_{j}|v_{i})$ and $f^{(t)}(\bm{c}_{j})$ as the solutions at the $t$ -th iteration. To prove the theorem clearly, the iterative update of $p(\cdot|v_{i})$ and anchors is shown in Algorithm 1. The following lemma aims to depict the first estimated connectivity distribution after one perfect training of GCNs.

Lemma 1.

Suppose that

p^{(1)}(\cdot|v_{i})=\arg\min_{p(\cdot|v_{i})}\mathbb{E}_{u_{j}\sim p(\cdot|v_{i})}d_{ij}+\gamma_{i}\sum_{j=1}^{m}(p(u_{j}|v_{i})-\frac{1}{m})^{2}

where $\gamma_{i}$ ensures the sparsity as $k$ . Let $p_{k}^{(0)}(\cdot|v_{i})=\mu$ . If $|p_{j}^{(0)}(\cdot|v_{i})-q_{j}(\cdot|v_{i})|\leq\varepsilon\leq\mu$ for any $i$ , then $\forall j\leq k$ ,

|p^{(1)}_{j}(\cdot|v_{i})-\frac{1}{k}|\leq\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)}).

(37)

Proof.

Without loss of generality, we focus on the connectivity distribution of $v$ and suppose that $p^{(0)}(u_{1}|v)\geq p^{(0)}(u_{2}|v)\geq\cdots\geq p^{(0)}(u_{k}|v)>0=p^{(0)}(u_{k+1}|v)=\cdots=p^{(0)}(u_{m}|v)$ . Let $p_{i}=p^{(0)}(u_{i}|v)$ and $q_{i}=q(u_{i}|v)$ . According to the definitions,

\displaystyle q_{i}=\frac{\exp(-d_{i})}{\sum_{j=1}^{m}\exp(-d_{j})}.

(38)

If for any $i$ , we have $|p_{i}-q_{i}|\leq\varepsilon$ . Clearly, $p_{k+1}=0$ . Suppose that $q_{k+1}=\tau\leq\varepsilon$ . Therefore, we have

\begin{split}&\frac{\exp(-d_{k+1})}{\sum_{j=1}^{m}\exp(-d_{j})}=\tau\Leftrightarrow~{}d_{k+1}=\log\frac{1}{\tau}-\log C.\end{split}

(39)

where $C=\sum_{j=1}^{m}\exp(-d_{j})$ . Combine with the condition, $p_{k}\geq\mu$ , and we have

\begin{split}&\frac{\exp(-d_{k})}{\sum_{j=1}^{m}\exp(-d_{j})}\geq p_{k}-\varepsilon\geq\mu-\varepsilon\\ \Rightarrow~{}&d_{k}\leq-\log C-\log(\mu-\varepsilon).\\ \end{split}

(40)

Similarly, since $p_{1}\geq\frac{1}{k}$ ,

\begin{split}&\frac{\exp(-d_{1})}{\sum_{j=1}^{m}\exp(-d_{j})}\leq 1\Rightarrow~{}d_{1}\geq-\log C.\\ \end{split}

(41)

If we update the connectivity distribution based on $\{d_{i}\}_{i=1}^{m}$ , then for any $i\leq k$ ,

p_{i}^{(1)}=\frac{d_{k+1}-d_{i}}{\sum_{j=1}^{k}(d_{k+1}-d_{j})}.

(42)

Furthermore, for any $i,j\leq k$ ,

	$\displaystyle\|p_{i}^{(1)}-p_{j}^{(1)}\|$	$\displaystyle=\frac{\|d_{j}-d_{i}\|}{\sum_{j=1}^{k}(d_{k+1}-d_{j})}$
		$\displaystyle\leq\frac{\|d_{k}-d_{1}\|}{\sum_{j=1}^{k}(d_{k+1}-d_{j})}$
		$\displaystyle=\frac{-\log(\mu-\varepsilon)}{\sum_{j=1}^{k}(\log\frac{1}{\tau}-\log C-d_{j})}$
		$\displaystyle\leq\frac{-\log(\mu-\varepsilon)}{\sum_{j=1}^{k}(\log\frac{1}{\tau}-\log C-d_{k})}$
		$\displaystyle\leq\frac{-\log(\mu-\varepsilon)}{k(\log(\mu-\varepsilon)-\log\tau)}$
		$\displaystyle=\frac{1}{k}\cdot\frac{\log(\mu-\varepsilon)}{\log\tau-\log(\mu-\varepsilon)}$
		$\displaystyle=\frac{1}{k}\cdot\frac{1}{\frac{\log(1/\tau)}{\log(1/(\mu-\varepsilon))}-1}$
		$\displaystyle\leq\frac{1}{k}\cdot\frac{1}{\frac{\log(1/\varepsilon)}{\log 1/(\mu-\varepsilon)}-1}=\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)}).$

For any $i\leq k$ , we have the following formulation based on the above result

	$\displaystyle\|p_{i}^{(1)}-\frac{1}{k}\|$	$\displaystyle\leq\|p_{i}^{(1)}-\frac{1}{k}\sum_{j=1}^{k}p_{j}^{(1)}\|$
		$\displaystyle\leq\frac{1}{k}\|\sum_{j=1}^{k}(p_{i}^{(1)}-p_{j}^{(1)})\|$
		$\displaystyle\leq\frac{1}{k}\sum_{j=1}^{k}\|(p_{i}^{(1)}-p_{j}^{(1)})\|$
		$\displaystyle\leq\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)}).$

The proof is easy to extend to other nodes. Hence, the theorem is proved. ∎

The following lemma shows that the updated anchors will approximate the mean vectors of the connected nodes if $p^{(t)}(u_{j}|v_{i})$ approaches the sparse and uniform distribution.

Lemma 2.

Let $\mathcal{N}_{j}$ denote the nodes that connect the anchor $u_{j}$ and $p(u_{j}|v_{i})$ represent a $k$ -sparse solution given by Theorem 1. If $|p_{j}(\cdot|v_{i})-\frac{1}{k}|\leq\varepsilon$ for any $j\leq k$ , then

\|\frac{\sum_{i=1}^{n}p(u_{j}|v_{i})f(\bm{x}_{i})}{\sum_{i=1}^{n}p(u_{j}|v_{i})}-\frac{1}{n_{j}}\sum\limits_{i\in\mathcal{N}_{j}}f(\bm{x}_{i})\|_{2}^{2}\leq\mathcal{O}(\varepsilon).

(43)

Proof.

To keep the notation uncluttered, let $\bm{f}_{i}=f(\bm{x}_{i})$ and

\bm{h}_{j}=\frac{\sum_{i=1}^{n}p(u_{j}|v_{i})f(\bm{x}_{i})}{\sum_{i=1}^{n}p(u_{j}|v_{i})}.

(44)

Note that $\sum_{i=1}^{n}p(u_{j}|v_{i})f(\bm{x}_{i})=\sum_{i\in\mathcal{N}_{j}}p(u_{j}|v_{i})f(\bm{x}_{i})$ . Let $n_{j}=|\mathcal{N}_{j}|$ and we have

\|\bm{h}_{j}-\frac{1}{n_{j}}\sum\limits_{i\in\mathcal{N}_{j}}\bm{f}_{i}\|_{2}=\frac{1}{s_{j}}\|\sum\limits_{i\in\mathcal{N}_{j}}(p(u_{j}|v_{i})-\frac{s_{j}}{n_{j}})\bm{f}_{i}\|_{2},

(45)

where $s_{j}=\sum_{i=1}^{n}p(u_{j}|v_{i})$ . According to the condition, we have

		$\displaystyle\frac{1}{k}-\varepsilon\leq p(u_{j}\|v_{i})\leq\frac{1}{k}+\varepsilon$
	$\displaystyle~{}\Rightarrow~{}$	$\displaystyle\frac{n_{j}}{k}-n_{j}\cdot\varepsilon\leq s_{j}\leq\frac{n_{j}}{k}+n_{j}\cdot\varepsilon.$

Accordingly, we have

		$\displaystyle\\|\bm{h}_{j}-\frac{1}{n_{j}}\sum\limits_{i\in\mathcal{N}_{j}}\bm{f}_{i}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{1}{s_{j}}\|p(u_{j}\|v_{i})-\frac{s_{j}}{n_{j}}\|\cdot\\|\bm{f}_{i}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{\\|\bm{f}_{i}\\|_{2}}{s_{j}}\max\{\|p(u_{j}\|v_{i})-\frac{1}{k}+\varepsilon\|,\|p(u_{j}\|v_{i})-\frac{1}{k}-\varepsilon\|\}$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{\\|\bm{f}_{i}\\|_{2}}{s_{j}}(\|p(u_{j}\|v_{i})-\frac{1}{k}\|+\|\varepsilon\|)$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{2\\|\bm{f}_{i}\\|_{2}}{s_{j}}\varepsilon=\mathcal{O}(\varepsilon).$

Hence, the theorem is proved. ∎

With the following lemma, we can prove Theorem 1 via the mathematical induction method.

Lemma 3.

Suppose that $q_{j}(\cdot|v_{i})\leq\varepsilon$ for $j>k$ , and $d(v_{i},u_{j})\leq M$ . Then there exists $\delta>0$ such that if $\varepsilon\leq\delta$ and $v_{i}$ does not connect with $u_{j}$ , then for any $v^{\prime}\in\mathcal{N}_{j}$ , $v_{i}$ and $v^{\prime}$ share no anchors.

Proof.

At first, let $\mu=\min_{i}q_{k}(\cdot|v_{i})$ . Hence, $q_{k}(\cdot|v_{i})\geq\mu\geq\varepsilon$ . For an arbitrary node $v_{i}$ connected with $u_{a}$ and $u_{b}$ and an anchor $u_{c}$ which disconnects with $v_{i}$ , suppose that there exists a node $v_{j}$ such that $v_{j}$ connects with $u_{b}$ and $u_{c}$ but disconnects with $u_{a}$ . According to the triangular inequality, we have

	$\displaystyle\frac{\exp(-d(u_{a},u_{c}))}{\exp(-d(v_{i},u_{a}))}$	$\displaystyle\leq\frac{\exp(-d(v_{i},u_{c})+d(v_{i},u_{a}))}{\exp(-d(v_{i},u_{a}))}$
		$\displaystyle=\frac{\exp(-d(v_{i},u_{c}))}{\exp(-2d(v_{i},u_{a}))}$
		$\displaystyle=\frac{q(u_{c}\|v_{i})}{q(u_{a}\|v_{i})\exp(-d(v_{i},u_{a}))}$
		$\displaystyle\leq\frac{\varepsilon}{\mu}e^{M}.$

According to the triangular inequality,

		$\displaystyle d(v_{i},u_{a})+\log\frac{\mu}{\varepsilon}-M$
	$\displaystyle\leq~{}$	$\displaystyle d(u_{a},u_{c})<d(u_{a},u_{b})+d(u_{b},u_{c})$
	$\displaystyle<~{}$	$\displaystyle d(v_{i},u_{a})+d(v_{i},u_{b})+d(v_{j},u_{c})+d(v_{j},u_{b}).$

Therefore,

\displaystyle\log\frac{\mu}{\varepsilon}<4M.

(46)

If we define $\delta$ as

\delta=\min\{\mu\cdot\exp(-4M),\mu\},

(47)

then Ineq. (46) results in a contradiction when $\varepsilon\leq\delta$ . ∎

Proof of Theorem 1.

The proof is to analyze each iteration of step 6. Let $\mu=\min_{i}p_{k}^{(1)}(\cdot|v_{i})$ and we can obtain that $|p_{j}^{(1)}(\cdot|v_{i})-\frac{1}{k}|\leq\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})$ provided that $\varepsilon\leq\mu$ , according to Lemma 1. Certainly, the conclusion constructs the precondition of Lemma 2.

Let $\bm{\theta}_{j}=\frac{1}{|\mathcal{N}_{j}|}\sum_{\bm{x}_{i}\in\mathcal{N}_{j}}f(\bm{x}_{i})$ . According to Lemma 2, we have $\|f^{(1)}(\bm{c}_{j})-\bm{\theta}_{j}\|_{2}^{2}\leq\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})$ and we can further derive that

	$\displaystyle\\|f(\bm{x}_{i})-f^{(1)}(\bm{c}_{j})\\|_{2}$	$\displaystyle\leq\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}+\\|f^{(1)}(\bm{c}_{j})-\bm{\theta}_{j}\\|_{2}$
		$\displaystyle\leq\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}+\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}),$

and

	$\displaystyle\\|f(\bm{x}_{i})-f^{(1)}(\bm{c}_{j})\\|_{2}$	$\displaystyle\geq\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}-\\|f^{(1)}(\bm{c}_{j})-\bm{\theta}_{j}\\|_{2}$
		$\displaystyle\geq\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}-\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).$

Therefore, for $d^{(1)}(v_{i},u_{j})$ , we have

		$\displaystyle\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})-2\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})$
	$\displaystyle\leq$	$\displaystyle d^{(1)}(v_{i},u_{j})-\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})+2\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).$

Due to that $d^{(1)}(v_{i},u_{j})$ is upper-bounded, the above conclusions are provided as the preconditions of Lemma 3. To keep simplicity, let $r=\|f(\bm{x}_{i})-\bm{\theta}_{j}\|_{2}$ . Therefore, we know that for any $l\leq k$ ,

		$\displaystyle d^{(1)}_{k+1}(v_{i},\cdot)-d^{(1)}_{l}(v_{i},\cdot)$
	$\displaystyle\leq~{}$	$\displaystyle d^{(1)}_{k+1}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})+2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}),$

and

		$\displaystyle d^{(1)}_{k+1}(v_{i},\cdot)-d^{(1)}_{l}(v_{i},\cdot)$
	$\displaystyle\geq~{}$	$\displaystyle d^{(1)}_{k+1}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})-2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).$

Furthermore, as

		$\displaystyle p^{(2)}(u_{j}\|v_{i})$
	$\displaystyle=$	$\displaystyle\frac{(d^{(1)}_{k+1}(v_{i},\cdot)-d^{(1)}(v_{i},\cdot))_{+}}{\sum_{l=1}^{k}(d^{(1)}_{k+1}(v_{i},\cdot)-d^{(1)}_{l}(v_{i},\cdot))}$
	$\displaystyle\leq$	$\displaystyle\frac{d^{(1)}_{k+1}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})+2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})}{\sum_{i=1}^{k}[d_{k+1}^{(1)}(v_{i},\cdot)-r^{2}]-k\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})-2kr\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})},$

we have

		$\displaystyle p^{(2)}(u_{j}\|v_{i})-\frac{1}{k}$
	$\displaystyle\leq~{}$	$\displaystyle\frac{4r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})}{k[d_{k+1}^{(1)}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})-2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})]}$
	$\displaystyle=~{}$	$\displaystyle\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).$

Similarly, we can also infer that

		$\displaystyle p^{(2)}(u_{j}\|v_{i})-\frac{1}{k}$
	$\displaystyle\leq~{}$	$\displaystyle\frac{-4r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})}{k[d_{k+1}^{(1)}(v_{i},\cdot)-r^{2}-\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})+2r\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})]}$
	$\displaystyle=~{}$	$\displaystyle\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).$

According to Lemma 3, there exists a constant $\delta$ such that when $\varepsilon\leq\delta$ , $p^{(2)}(u_{j}|v_{i})$ approximates the sparse and uniform distribution,

|p^{(2)}(u_{j}|v_{i})-\frac{1}{k}|\leq\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})=\mathcal{O}(\log^{-1/2}(1/\varepsilon)).

(48)

We can repeat the above proof for every iteration to update $p^{(t)}(u_{j}|v_{i})$ and $f^{(t)}(\bm{c}_{j})$ , and therefore, the theorem is proved. ∎

6 Conclusion

In this paper, we propose an anchor-siamese GCN clustering with adaptive $\mathcal{O}(n)$ bipartite convolution (AnchorGAE). The generative perspective for weighted graphs helps us to build the relationship between the non-graph-type data and the graph-type data, and provides a reconstruction goal. Since the anchor-based bipartite graph factorizes the adjacency matrix, we can rearrange the order of matrix multiplications to accelerate the graph convolution. Experiments show that AnchorGAE consumes less time especially on MNIST-full and obtains impressive performance. Ablation experiments also verify the necessity of updating anchors, transition probability, and sparsity in a self-supervised way. From both theoretical and empirical aspects, we show that the simple update would cause a collapse when the model is well-trained. The designed mechanism to connect different groups in advance is also verified by experiments. Since the original images of anchors are estimated roughly, we will focus on how to exactly compute the original images from the embedding space in future work. To further speed up the model, it is worthy developing stochastic gradient descent for AnchorGAE in the future.

References

[1] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in neural information processing systems, 2002, pp. 849–856.
[2] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partitioning and clustering,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 11, no. 9, pp. 1074–1085, 1992.
[3] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
[4] F. Nie, H. Huang, and C. Ding, “Low-rank matrix recovery via efficient schatten p-norm minimization,” in Proceedings of the Twenty-sixth AAAI conference on artificial intelligence, 2012.
[5] X. Li, M. Chen, F. Nie, and Q. Wang, “A multiview-based parameter free framework for group detection,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017, pp. 4147–4153.
[6] X. Li, H. Zhang, and R. Zhang, “Matrix completion via non-convex relaxation and adaptive correlation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2022.
[7] X. Chen, W. Hong, F. Nie, J. Z. Huang, and L. Shen, “Enhanced balanced min cut,” International Journal of Computer Vision, pp. 1–14, 2020.
[8] F. R. Chung, Spectral graph theory. American Mathematical Soc., 1997, vol. 92.
[9] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral grouping using the nystrom method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 214–225, 2004.
[10] H. Shinnou and M. Sasaki, “Spectral clustering for a large data set by reducing the similarity matrix size,” in International Conference on Language Resources & Evaluation, 2008.
[11] D. Yan, L. Huang, and M. I. Jordan, “Fast approximate spectral clustering,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2009.
[12] W. Liu, J. He, and S. F. Chang, “Large graph construction for scalable semi-supervised learning,” in International Conference on Machine Learning, 2010.
[13] D. Cai and X. Chen, “Large scale spectral clustering via landmark-based sparse representation,” IEEE Trans Cybern, vol. 45, no. 8, pp. 1669–1680, 2015.
[14] X. Chen, F. Nie, J. Z. Huang, and M. Yang, “Scalable normalized cut with improved spectral rotation,” in Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017.
[15] X. Li, H. Zhang, R. Wang, and F. Nie, “Multiview clustering: A scalable and parameter-free bipartite graph fusion method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 330–344, 2022.
[16] Q. Qiang, B. Zhang, F. Wang, and F. Nie, “Fast multi-view discrete clustering with anchor graphs,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021, pp. 9360–9367.
[17] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in International conference on machine learning, 2016, pp. 478–487.
[18] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5147–5156.
[19] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang, “Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5736–5745.
[20] X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu, “Deep spectral clustering using dual autoencoder network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4066–4075.
[21] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond euclidean data,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 18–42, 2017.
[22] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama, “Learning discrete representations via information maximizing self-augmented training,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, vol. 70, 2017, pp. 1558–1567.
[23] H. Zhang, P. Li, R. Zhang, and X. Li, “Embedding graph auto-encoder for graph clustering,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–1, 2022.
[24] X. Peng, J. Feng, S. Xiao, W.-Y. Yau, J. T. Zhou, and S. Yang, “Structured autoencoders for subspace clustering,” IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 5076–5086, 2018.
[25] U. Shaham, K. Stanton, H. Li, B. Nadler, R. Basri, and Y. Kluger, “Spectralnet: Spectral clustering using deep neural networks,” arXiv preprint arXiv:1801.01587, 2018.
[26] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2008.
[27] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
[28] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
[29] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in 6th International Conference on Learning Representations, ICLR 2018, 2018.
[30] M. Li and Z. Zhu, “Spatial-temporal fusion graph neural networks for traffic flow forecasting,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, 2021, pp. 4189–4196.
[31] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21, 2020.
[32] J. Chen, T. Ma, and C. Xiao, “Fastgcn: Fast learning with graph convolutional networks via importance sampling,” in 6th International Conference on Learning Representations, ICLR 2018, 2018.
[33] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1025–1035.
[34] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh, “Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 257–266.
[35] F. Wu, T. Zhang, A. Holanda de Souza, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” Proceedings of Machine Learning Research, 2019.
[36] J. Chen, J. Zhu, and L. Song, “Stochastic training of graph convolutional networks with variance reduction,” arXiv preprint arXiv:1710.10568, 2017.
[37] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deep clustering network,” in WWW. ACM / IW3C2, 2020, pp. 1400–1410.
[38] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016, pp. 3844–3852.
[39] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang, “Attributed graph clustering: a deep attentional embedding approach,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019, pp. 3670–3676.
[40] R. Cheng and Q. Li, “Modeling the momentum spillover effect for stock prediction via attribute-driven graph attention networks,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, 2021, pp. 55–62.
[41] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” in 6th International Conference on Learning Representations, ICLR 2018, 2018.
[42] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in 7th International Conference on Learning Representations, ICLR 2019, 2019.
[43] T. Cai, S. Luo, K. Xu, D. He, T. Liu, and L. Wang, “Graphnorm: A principled approach to accelerating graph neural network training,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, vol. 139, 2021, pp. 1204–1215.
[44] K. Xu, M. Zhang, S. Jegelka, and K. Kawaguchi, “Optimization of graph neural networks: Implicit acceleration by skip connections and more depth,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, vol. 139, 2021, pp. 11 592–11 602.
[45] K. Oono and T. Suzuki, “Graph neural networks exponentially lose expressive power for node classification,” in International Conference on Learning Representations, 2020.
[46] W. Cong, M. Ramezani, and M. Mahdavi, “On provable benefits of depth in training graph convolutional networks,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[47] U. Alon and E. Yahav, “On the bottleneck of graph neural networks and its practical implications,” in 9th International Conference on Learning Representations, ICLR 2021, 2021.
[48] T. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308, 2016.
[49] S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang, “Adversarially regularized graph autoencoder for graph embedding.” in IJCAI, 2018, pp. 2609–2615.
[50] J. Park, M. Lee, H. J. Chang, K. Lee, and J. Y. Choi, “Symmetric graph convolutional autoencoder for unsupervised graph representation learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6519–6528.
[51] X. Zhang, H. Liu, Q. Li, and X. M. Wu, “Attributed graph clustering via adaptive graph convolution,” in 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019, pp. 4327–4333.
[52] X. Li, H. Zhang, and R. Zhang, “Adaptive graph auto-encoder for general data clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
[53] G. E. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[54] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
[55] K. Hassani and A. H. K. Ahmadi, “Contrastive multi-view representation learning on graphs,” in International conference on machine learning, 2020.
[56] H. Zhu and P. Koniusz, “Simple spectral graph convolution,” in 9th International Conference on Learning Representations, 2021.
[57] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” Siam Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998.
[58] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proc of Berkeley Symposium on Mathematical Statistics & Probability, 1965.
[59] Y. Li, P. Hu, J. Z. Liu, D. Peng, J. T. Zhou, and X. Peng, “Contrastive clustering,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, 2021, pp. 8547–8555.
[60] X. Liu, M. Yan, L. Deng, G. Li, X. Ye, D. Fan, S. Pan, and Y. Xie, “Survey on graph neural network acceleration: An algorithmic perspective,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022.
[61] F. Spitzer, Principles of random walk. Springer Science & Business Media, 2013, vol. 34.
[62] S. Chopra, R. Hadsell, and Y. Lecun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005.
[63] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
[64] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, ser. Proceedings of Machine Learning Research, vol. 119, 2020, pp. 1597–1607.
[65] F. Nie, X. Wang, C. Deng, and H. Huang, “Learning a structured optimal bipartite graph for co-clustering,” in Advances in Neural Information Processing Systems, 2017, pp. 4129–4138.
[66] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
[67] J. J. Hull, “A database for handwritten text recognition research,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 5, pp. 550–554, 2002.
[68] Y. Lecun and L. Bottou, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[69] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv: 1708.07747, 2017.

Xuelong Li (M’02-SM’07-F’12) is a Full Professor with the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an, China.

	$\displaystyle\|p_{i}^{(1)}-\frac{1}{k}\|$	$\displaystyle\leq\|p_{i}^{(1)}-\frac{1}{k}\sum_{j=1}^{k}p_{j}^{(1)}\|$
		$\displaystyle\leq\frac{1}{k}\|\sum_{j=1}^{k}(p_{i}^{(1)}-p_{j}^{(1)})\|$
		$\displaystyle\leq\frac{1}{k}\sum_{j=1}^{k}\|(p_{i}^{(1)}-p_{j}^{(1)})\|$
		$\displaystyle\leq\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)}).$

		$\displaystyle\\|\bm{h}_{j}-\frac{1}{n_{j}}\sum\limits_{i\in\mathcal{N}_{j}}\bm{f}_{i}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{1}{s_{j}}\|p(u_{j}\|v_{i})-\frac{s_{j}}{n_{j}}\|\cdot\\|\bm{f}_{i}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{\\|\bm{f}_{i}\\|_{2}}{s_{j}}\max\{\|p(u_{j}\|v_{i})-\frac{1}{k}+\varepsilon\|,\|p(u_{j}\|v_{i})-\frac{1}{k}-\varepsilon\|\}$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{\\|\bm{f}_{i}\\|_{2}}{s_{j}}(\|p(u_{j}\|v_{i})-\frac{1}{k}\|+\|\varepsilon\|)$
	$\displaystyle\leq$	$\displaystyle\sum\limits_{i\in\mathcal{N}_{j}}\frac{2\\|\bm{f}_{i}\\|_{2}}{s_{j}}\varepsilon=\mathcal{O}(\varepsilon).$

		$\displaystyle\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})-2\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}})$
	$\displaystyle\leq$	$\displaystyle d^{(1)}(v_{i},u_{j})-\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\mathcal{O}(\frac{\log(1/\mu)}{\log(1/\varepsilon)})+2\\|f(\bm{x}_{i})-\bm{\theta}_{j}\\|_{2}\mathcal{O}(\sqrt{\frac{\log(1/\mu)}{\log(1/\varepsilon)}}).$

Non-Graph Data Clustering via 𝒪​(n)\mathcal{O}(n) Bipartite Graph Convolution

Abstract

Index Terms:

1 Introduction

2 Preliminary and Related Work

2.1 Preliminary

2.2 Graph Neural Networks

2.3 Clustering on Large Scale Data

Remark 1.

3 Methodology

3.1 Revisit the Critical Problems

3.2 Generative Perspective for Bipartite Graphs

3.2.1 Generative Perspective for Graphs

Assumption 1.

Remark 2.

3.2.2 Bipartite Graph Simplification with Anchors

Assumption 2.

3.3 Efficient 𝒪​(n)\mathcal{O}(n) Bipartite Graph Convolution

Remark 3.

3.4 Siamese Architecture for Anchor Mapping

3.4.1 Why not to reconstruct features?

3.5 Will the Adaptive Update Work?

Theorem 1.

3.6 Obtain Clustering Assignments

4 Experiment

4.1 Datasets and Comparative Methods

4.2 Experimental Setup

4.3 Experimental Results

5 Proof of Theorem 1

Lemma 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Proof of Theorem 1.

6 Conclusion

References

Non-Graph Data Clustering via $\mathcal{O}(n)$ Bipartite Graph Convolution

3.3 Efficient $\mathcal{O}(n)$ Bipartite Graph Convolution