Improving Spectral Clustering Using Spectrum-Preserving Node Aggregation

Abstract

Spectral clustering is an important graph clustering technique. However, it suffers from a scalability problem due to the involved computationally expensive eigen-decomposition procedure and is highly sensitive to noisy nodes in the graph. In this work we solve the two problems simultaneously by using spectrum-preserving node aggregation to generate a nearly-noiseless concise representation of the data set. Specifically, we create a small set of spectral-representative pseudo-nodes based on spectral analysis. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, we map the clusters of pseudo nodes back to find the cluster memberships of the data points in the original data set. The proposed framework has a nearly-linear time complexity. Meanwhile, the clustering accuracy can be significantly improved. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.

Index Terms— Spectral Clustering, Node Reduction

1 Introduction

Clustering is one of the most fundamental machine learning problems [1]. It aims to assign data samples in a data set into different clusters in such a way that samples in the same cluster are more similar compared to those in different clusters.

In the past decades, many clustering techniques have been proposed. Among them, spectral clustering has drawn considerable attention and has been widely applied to computer vision, image processing and speech processing [2, 3]. Although spectral clustering has superior performance, the involved eigen-decomposition procedure has a time complexity of $O(N^{3})$ , where $N$ is the number of samples in the data set. The high computational cost can easily become an obstacle in many large-scale applications [4, 5]. To solve this problem, considerable effort has been devoted in research related to approximate spectral clustering. [6] proposed a k-means based hierachical spectral clustering framework (KASP); Inspired by sparse coding theory, [5] proposed a landmark-based representation method (LSC); [7] leveraged the Nyström method to approximate the affinity matrix. However, none of these methods can preserve the spectrum of the original graphs, thus result in a significant degradation of clustering accuracy. For example, for Covtype data set, the clustering accuracy drops more than $26\%$ when using the LSC method, more than $21\%$ when using the KASP framework and more than $15\%$ when using the Nyström method.

[8] attempted to apply a graph coarsening method for dividing the graph topology into random number of partitions. However, its claims about spectral clustering have the following problems:

1) Methodologically, the algorithm flow in [8] only partitions very few pseudo aggregated nodes. Without mapping the pseudo clusters to original real nodes via mapping operator, [8] can only provide an auxiliary intermediate result for data clustering.

2) Partitioning pure graph topologies into small parts and finding clusters of data samples are two different tasks. Clustering is the task of grouping similar samples into same clusters. According to the definition of spectral clustering [9], the input of spectral clustering is similarity matrix of data samples calculated from their feature vectors. However, in [8]’s experiment about spectral clustering, there is no data sample but only pure graph topologies are used as input. Without ”data” points represented by features, there will be no ”similarity” among nodes.

3) To convincingly evaluate the clustering performance, each input sample should have a ground truth label (cluster membership) so that we can check the correctness of the clustering algorithm’s result [5, 4]. However, in [8], clustering correctness cannot be measured because there is no label. So the experiments in [8] cannot demonstrate the effectiveness of spectrum-preserving node aggregation on spectral clustering.

4) In clustering, the number of clusters are meaningful or naturally formed based on the density. For example, handwritten digits tasks have 10 clusters corresponding to digits 0-9, ModelNet40 data set ¹¹1https://modelnet.cs.princeton.edu/ has 40 clusters corresponding to its 40 kinds of 3D objects (tables, planes, chairs, …). However, [8] chose a random number 30 as the number of subsets for all the graphs. This is not make sense for spectral clustering experiments and evaluations.

In contrast to [8], in this paper, we integrate the spectrum-preserving node aggregation into the general approximate spectral clustering framework [6] and construct the correspondence table based on the mapping operators. Then we use the table to map the clusters of aggregated nodes to original samples in the data set to find cluster membership of real data points. Based on this complete framework, we use real-world data sets to demonstrate the effectiveness of the aggregation scheme for data clustering. The major contributions of this work has been summarized as follows:

1.

In this work, we proposed a complete framework for improving spectral clustering by integrating spectral methods into the general approximate spectral clustering framework [6] and demonstrate its effectiveness via real-world data sets.
2.

In contrast to other spectral clustering acceleration methods based on representative nodes such as KASP, Nystrom and LSC, our method uses the spectral-representative nodes which can directly preserve the spectrum of the graph Laplacian which will be key to clustering accuracy. As a result, for very large data set such as Covtype which has $581,012$ instances, only our method can accelerate clustering without loss of accuracy. All the other representation nodes based methods have $15\%$ to $26\%$ accuracy degradation.
3.

Our method enable to construct a nearly-noiseless manifold representation of the data set and dramatically clustering accuracy improvements can be achieved. For example, by removing more than $99.84\%$ of nodes for the MNIST data set, its noisy nodes are also removed together. The clustering accuracy improves from $64.20\%$ to $82.94\%$ .

2 Preliminaries

k-means is the most fundamental clustering method [9]. It discovers $k$ clusters by minimizing the following objective function:

F={\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{k}\eta_{ij}{\|\textbf{x}_{i}-\boldsymbol{\mu}_{j}\|_{2}}},

(1)

where:

\eta_{ij}=\begin{cases}1&\text{ if }\textbf{x}_{i}\in C_{j}\\ 0&\text{otherwise },\end{cases}

(2)

and $\boldsymbol{\mu}_{j}$ is the centroid of the cluster $C_{j}$ .

However, it often fails to handle complicated geometric shapes, such as non-convex shapes [10]. In contrast, spectral clustering is good at detecting non-convex and linearly non-separable patterns [11]. Spectral clustering includes the following three steps: 1) construct a Laplacian matrix $L_{G}$ according to the similarities between data points; 2) embed nodes into $k$ -dimensional space using the first $k$ nontrivial eigenvectors of $L_{G}$ ; 3) apply k-means to partition the embedded data points into $k$ clusters. As shown in Figure 1, in the two moons data set, there are two slightly entangled non-convex shapes, where each cluster corresponds to a moon. In the two circles data set, the samples are arranged in two concentric circles, where each cluster corresponds to a circle. k-means gives incorrect clustering results for both data sets while spectral clustering performs an ideal clustering. Because k-means only cares about the distance while spectral clustering cares about the connectivity between nodes. The connectivity information is embedded in the spectrum of the underlying graph so that it is critical to preserve the spectrum when manipulating graphs [12].

(a) k-means of the two moons data set

(b) Spectral clustering of the two moons data set

(d) Spectral clustering of the two circles data set

Fig. 1: Performance of k-means and spectral clustering

3 Methods

3.1 Algorithmic Framework

We first construct a standard k-NN graph $G$ . Then, based on the node proximity measure proposed in [13], the spectral similarity of two nodes $p$ and $q$ can be calculated as:

Sim_{p,q}=\frac{{|(\mathbf{X}_{p},\mathbf{X}_{q})|}^{2}}{(\mathbf{X}_{p},\mathbf{X}_{p})(\mathbf{X}_{q},\mathbf{X}_{q})},\mbox{ }(\mathbf{X}_{p},\mathbf{X}_{q}):=\sum_{k=1}^{K}{\left(\mathbf{x}_{p}^{(k)}\cdot\mathbf{x}_{q}^{(k)}\right)},

(3)

where $\mathbf{X}:=(\mathbf{x}^{(1)},\dots,\mathbf{x}{{}^{(K)}})$ is obtained by applying Gauss-Seidel relaxation to solve ${L}_{G}\mathbf{x}^{(i)}=0$ for $i=1,\dots,K$ , starting with $K$ random vectors. We aggregate the nodes with high spectral similarity to form a few spectral-representative nodes to reduce the graph size while preserving the spectrum of the original graph. By aggregating nodes based on spectral similarity, the reduced graph can best represent the original data set in the sense of minimizing the structural distortion [12], as shown in Figure 2.

Refer to caption — (a) The adjacency graph corresponding to the original node set of 9298 nodes

Then the standard spectral clustering is performed on the aggregated nodes. Spectral clustering algorithm needs to perform a very computational expensive eigen-decompistion procedure with $O(n^{3})$ time complexity to calculate the first $k$ non-trivial eigenvectors of the graph Laplacian, where $n$ is number of nodes. By using spectral-representative nodes to represent the global structure of the original data set, number of nodes $n$ can be reduced and thus leads to speedup of eigen-decomposition. If the desired size of node set is not reached, the nodes can be further aggregated layer by layer by utilizing a multi-level aggregation scheme [8]: given the graph Laplacian ${L}_{G}$ and the mapping operators from the finest level $1$ to the coarsest level $r$ , ${L}_{R}=\mathbf{H}_{G}^{R}{L}_{G}\mathbf{H}_{R}^{G}$ , where $\mathbf{H}_{G}^{R}=\mathbf{H}_{1}^{2}\mathbf{H}_{2}^{3}\cdots\mathbf{H}_{r-1}^{r}$ and $\mathbf{H}_{R}^{G}=\mathbf{H}_{2}^{1}\mathbf{H}_{3}^{2}\cdots\mathbf{H}_{r}^{r-1}$

However, the goal of data clustering is to find cluster membership of original samples in the data set rather than the clusters of the pseudo aggregated nodes. In this paper, we propose to map the cluster membership back from the pseudo aggregated nodes to the original real data points. Specifically, we build a correspondence table using the mapping operators of each layer to associate each original sample to its aggregated nodes in the final layer. For each original data points, we assigning it to the cluster as its corresponding aggregated node. The complete algorithm flow has been shown in Algorithm 1.

Algorithm 1 spectrum-preserving node aggregation-based high-performance spectral clustering framework

Input: A data set $D$ with $N$ samples $x_{1},...,x_{N}\in{R}^{d}$ , number of clusters $k$ .
Output: Clusters $C_{1}$ ,…, $C_{k}$ .

1: Construct a k-nearest neighbor (kNN) graph

G

from the input data ;

2: Compute the adjacency matrix

A_{G}

and diagonal matrix

D_{G}

of graph

G

;

3: Compute the Laplacian matrix

L_{G}

D_{G}

A_{G}

;

4: Perform spectrum-preserving node aggregation to obtain the reduced graph Laplacian

L_{r}

;

5: Build a correspondence table to associate each node with its corresponding aggregated node;

6: Compute the eigenvectors

u_{1}

,…,

u_{k}

that correspond to the bottom

k

nonzero eigenvalues of

L_{r}

;

7: Construct

U\in{R}^{|V_{L_{r}}|\times k}

, with the

k

eigenvectors stored as column vectors;

8: Perform k-means algorithm to partition the rows of

U

into

k

clusters.

9: Retrieve the cluster membership of each

x_{i}

by assigning it to the cluster as its corresponding aggregated node and return the result.

3.2 Nearly noiseless representation

One drawback of spectral clustering is that it is highly sensitive to noisy nodes in the graph, which limits the capability to accurately discover clusters: if a few noisy nodes form a narrow bridge between two densely connected communities, the algorithm tends to report the two communities plus the noisy nodes as a single community.

Relying on the very strong capability of spectrum-preserving pseudo nodes in preserving the spectrum of the graph Laplacian and global manifold of the data set, almost all the original nodes including the noisy nodes can be removed together. For example, for the MNIST data set with $70,000$ nodes, only $109$ spectral-representative nodes are enough for representing the global structure of the data set, which means $99.84\%$ of nodes can be removed so that the noise nodes won’t exist in the aggregated graph. As a result, the clustering accuracy improves from $64.20\%$ to $82.94\%$ for the MNIST data set.

Even though other approximate spectral clustering methods can also significantly reduced to node set size with a small amount of representative nodes, however, without using spectral analysis to choose spectrally-critical nodes, their representative nodes are not able to truthfully encode the spectrum of original graph Laplacian. So their clustering accuracy gains from denoising are counteracted with the loss of spectrum information.

3.3 Algorithm Complexity

The computational complexity of spectrum-preserving node reduction is $O(|E_{G}|log(|V|))$ , where $|E_{G}|$ is the number of edges in the original graph and $|V|$ is the number of nodes. The computational complexity of reduced standard spectral clustering is $O(P^{3})$ where $P$ is the number of aggregated nodes. The complexity of cluster membership retrieving is $O(N)$ , where $N$ is the number of samples in the original data set.

4 Experiment

Experiments are performed using MATLAB R2020b running on a Laptop. The reported results are averaged over $10$ runs.

4.1 Experiment Setup

One mid-sized data set (USPS), one large data set (MNIST) and one very large data set (Covtype) are used in our experiments. They can be downloaded from the UCI machine learning repository ²²2https://archive.ics.uci.edu/ml/ and LibSVM Data³³3https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/: USPS includes $9,298$ images of USPS hand written digits with $256$ attributes; MNIST is a data set from Yann LeCun’s website⁴⁴4 http://yann.lecun.com/exdb/mnist/, which includes $70,000$ images with each of them represented by $784$ attributes; Covtype includes $581,012$ instances for predicting forest cover type from cartographic variables and each instance with $54$ attributes is from one of seven classes.

We compare the proposed method against both the baseline and the state-of-the-art fast spectral clustering methods including: (1) the standard spectral clustering algorithm [9], (2) the Nyström method [7], (3) the Landmark-based spectral clustering (LSC) method that uses random sampling for landmark selection [5], and (4) the KASP method using k-means for centroids selection [6]. For fair comparison, we use the same parameter setting in [5] for compared algorithms: the number of sampled points in Nyström method ( or the number of landmarks in LSC, or the number of centroids in KASP ) is set to 500.

4.2 Evaluation Metrics

Clustering accuracy ( $ACC$ ) is the most widely used measurement for clustering quality [5, 4]. It is defined as follows:

ACC=\frac{\sum\limits_{i=1}^{N}{\delta{(y_{i},map(c_{i}))}}}{{N}},

(4)

where $N$ is the number of data samples, $y_{i}$ is the ground-truth label, and $c_{i}$ is the label generated by the algorithm. $\delta(x,y)$ is a delta function defined as: $\delta(x,y)$ =1 for $x=y$ , and $\delta(x,y)$ =0, otherwise. $map(\bullet)$ is a permutation function which can be realized using the Hungarian algorithm [14]. A higher value of $ACC$ indicates better clustering quality.

4.3 Experimental Results

Table 1: Spectral clustering accuracy (%)

Data Set	Standard SC	Nyström	KASP	LSC	Ours( $5X$ )	Ours( $10X$ )	Ours( $50X$ )
USPS	64.31	69.31	70.62	66.28	70.65	72.25	81.39
MNIST	64.20	55.86	71.23	58.94	70.99	72.42	82.18
Covtype	48.81	33.24	27.56	22.60	48.77	44.50	48.28

Table 2: Runtime (seconds)

Data Set	Standard SC	Nyström	KASP	LSC	Ours( $5X$ )	Ours( $10X$ )	Ours( $50X$ )
USPS	0.72	0.29	0.16	0.22	0.21	0.18	0.17
MNIST	252.59	0.95	0.18	0.49	0.59	0.35	0.21
Covtype	128.70	5.48	0.70	3.77	0.92	0.50	0.49

Table 3: Graph complexity comparison

Data Set	$\|V_{orig}\|$	$\|V_{reduced}\|($ 5X $)$	$\|V_{reduced}\|($ 10X $)$	$\|V_{reduced}\|($ 50X $)$
USPS	9,298	1,692	767	138
MNIST	70,000	13,658	6,368	1,182
Covtype	581,012	104,260	44,188	8,192

Table 1 shows the clustering accuracy results of compared methods and our method with node reduction ratios of $5X$ , $10X$ and $50X$ . The runtime of the eigen-decomposition and k-means steps are reported in Table 2. As observed, the proposed method can consistently lead to dramatic performance improvement, beating all competitors in clustering accuracy across all the three data sets: our method achieves more than $10\%$ accuracy gain on USPS, $11\%$ gain on MNIST and $15\%$ gain on Covtype over the second-best methods; for the USPS and MNIST data sets, our method achieves over $17\%$ and $18\%$ accuracy gain over the standard spectral clustering method, respectively. The superior clustering results of our method clearly illustrate that spectrum-preserving concise representation can improve clustering accuracy by removing redundant or false relations among data samples.

As shown in Table 3, the $50X$ reduced graphs generated by our framework have only $138$ , $1182$ and $8192$ nodes for USPS, MNIST and Covtype data sets, respectively, thereby allowing much faster eigen-decompositions. As shown in Fig. 3 and Fig. 4, with increasing node reduction ratio, the proposed method consistently produces high clustering accuracy. For the very large data set Covtype, our method is the only one that enables us to accelerate spectral clustering algorithm without loss of clustering accuracy.

5 Conclusion

In this paper, a complete framework for improving spectral clustering is proposed. We use real-world data sets to demonstrate the effectiveness of the method. Our method outperforms state-of-the-art methods by a large margin.

References

[1] Feiping Nie, Dong Xu, Ivor Wai-Hung Tsang, and Changshui Zhang, “Spectral embedded clustering,” in Twenty-First International Joint Conference on Artificial Intelligence, 2009.
[2] Francis Bach and Michael Jordan, “Learning spectral clustering,” Advances in neural information processing systems, vol. 16, no. 2, pp. 305–312, 2004.
[3] Xia Wan and C-CJ Kuo, “A new approach to image retrieval with hierarchical color clustering,” IEEE transactions on circuits and systems for video technology, vol. 8, no. 5, pp. 628–643, 1998.
[4] Jialu Liu, Chi Wang, Marina Danilevsky, and Jiawei Han, “Large-scale spectral clustering on graphs,” in Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. AAAI Press, 2013, pp. 1486–1492.
[5] Xinlei Chen and Deng Cai, “Large scale spectral clustering with landmark-based representation.,” in AAAI, 2011.
[6] Donghui Yan, Ling Huang, and Michael I Jordan, “Fast approximate spectral clustering,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009, pp. 907–916.
[7] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik, “Spectral grouping using the nystrom method,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 2, pp. 214–225, 2004.
[8] Zhiqiang Zhao, Ying Zhang, and Zhuo Feng, “Towards scalable spectral embedding and data visualization via spectral coarsening,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 869–877.
[9] Ulrike Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.
[10] Feiping Nie, Cheng-Long Wang, and Xuelong Li, “K-multiple-means: A multiple-means clustering method with specified k clusters,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 959–967.
[11] Andrew Y Ng, Michael I Jordan, Yair Weiss, et al., “On spectral clustering: Analysis and an algorithm,” Advances in neural information processing systems, vol. 2, pp. 849–856, 2002.
[12] Yongyu Wang, High Performance Spectral Methods for Graph-Based Machine Learning, Ph.D. thesis, Michigan Technological University, 2021.
[13] Oren E Livne and Achi Brandt, “Lean algebraic multigrid (lamg): Fast graph laplacian linear solver,” SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. B499–B522, 2012.
[14] Christos H Papadimitriou and Kenneth Steiglitz, Combinatorial optimization: algorithms and complexity, Courier Corporation, 1982.