This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving Spectral Clustering Using Spectrum-Preserving Node Aggregation

Abstract

Spectral clustering is an important graph clustering technique. However, it suffers from a scalability problem due to the involved computationally expensive eigen-decomposition procedure and is highly sensitive to noisy nodes in the graph. In this work we solve the two problems simultaneously by using spectrum-preserving node aggregation to generate a nearly-noiseless concise representation of the data set. Specifically, we create a small set of spectral-representative pseudo-nodes based on spectral analysis. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, we map the clusters of pseudo nodes back to find the cluster memberships of the data points in the original data set. The proposed framework has a nearly-linear time complexity. Meanwhile, the clustering accuracy can be significantly improved. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.

Index Terms—  Spectral Clustering, Node Reduction

1 Introduction

Clustering is one of the most fundamental machine learning problems [1]. It aims to assign data samples in a data set into different clusters in such a way that samples in the same cluster are more similar compared to those in different clusters.

In the past decades, many clustering techniques have been proposed. Among them, spectral clustering has drawn considerable attention and has been widely applied to computer vision, image processing and speech processing [2, 3]. Although spectral clustering has superior performance, the involved eigen-decomposition procedure has a time complexity of O(N3)O(N^{3}), where NN is the number of samples in the data set. The high computational cost can easily become an obstacle in many large-scale applications [4, 5]. To solve this problem, considerable effort has been devoted in research related to approximate spectral clustering. [6] proposed a k-means based hierachical spectral clustering framework (KASP); Inspired by sparse coding theory, [5] proposed a landmark-based representation method (LSC); [7] leveraged the Nyström method to approximate the affinity matrix. However, none of these methods can preserve the spectrum of the original graphs, thus result in a significant degradation of clustering accuracy. For example, for Covtype data set, the clustering accuracy drops more than 26%26\% when using the LSC method, more than 21%21\%when using the KASP framework and more than 15%15\%when using the Nyström method.

[8] attempted to apply a graph coarsening method for dividing the graph topology into random number of partitions. However, its claims about spectral clustering have the following problems:

1) Methodologically, the algorithm flow in [8] only partitions very few pseudo aggregated nodes. Without mapping the pseudo clusters to original real nodes via mapping operator, [8] can only provide an auxiliary intermediate result for data clustering.

2) Partitioning pure graph topologies into small parts and finding clusters of data samples are two different tasks. Clustering is the task of grouping similar samples into same clusters. According to the definition of spectral clustering [9], the input of spectral clustering is similarity matrix of data samples calculated from their feature vectors. However, in [8]’s experiment about spectral clustering, there is no data sample but only pure graph topologies are used as input. Without ”data” points represented by features, there will be no ”similarity” among nodes.

3) To convincingly evaluate the clustering performance, each input sample should have a ground truth label (cluster membership) so that we can check the correctness of the clustering algorithm’s result [5, 4]. However, in [8], clustering correctness cannot be measured because there is no label. So the experiments in [8] cannot demonstrate the effectiveness of spectrum-preserving node aggregation on spectral clustering.

4) In clustering, the number of clusters are meaningful or naturally formed based on the density. For example, handwritten digits tasks have 10 clusters corresponding to digits 0-9, ModelNet40 data set 111https://modelnet.cs.princeton.edu/ has 40 clusters corresponding to its 40 kinds of 3D objects (tables, planes, chairs, …). However, [8] chose a random number 30 as the number of subsets for all the graphs. This is not make sense for spectral clustering experiments and evaluations.

In contrast to [8], in this paper, we integrate the spectrum-preserving node aggregation into the general approximate spectral clustering framework [6] and construct the correspondence table based on the mapping operators. Then we use the table to map the clusters of aggregated nodes to original samples in the data set to find cluster membership of real data points. Based on this complete framework, we use real-world data sets to demonstrate the effectiveness of the aggregation scheme for data clustering. The major contributions of this work has been summarized as follows:

  1. 1.

    In this work, we proposed a complete framework for improving spectral clustering by integrating spectral methods into the general approximate spectral clustering framework [6] and demonstrate its effectiveness via real-world data sets.

  2. 2.

    In contrast to other spectral clustering acceleration methods based on representative nodes such as KASP, Nystrom and LSC, our method uses the spectral-representative nodes which can directly preserve the spectrum of the graph Laplacian which will be key to clustering accuracy. As a result, for very large data set such as Covtype which has 581,012581,012 instances, only our method can accelerate clustering without loss of accuracy. All the other representation nodes based methods have 15%15\% to 26%26\% accuracy degradation.

  3. 3.

    Our method enable to construct a nearly-noiseless manifold representation of the data set and dramatically clustering accuracy improvements can be achieved. For example, by removing more than 99.84%99.84\% of nodes for the MNIST data set, its noisy nodes are also removed together. The clustering accuracy improves from 64.20%64.20\% to 82.94%82.94\%.

2 Preliminaries

k-means is the most fundamental clustering method [9]. It discovers kk clusters by minimizing the following objective function:

F=i=1nj=1kηijxi𝝁j2,F={\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{k}\eta_{ij}{\|\textbf{x}_{i}-\boldsymbol{\mu}_{j}\|_{2}}}, (1)

where:

ηij={1 if xiCj0otherwise ,\eta_{ij}=\begin{cases}1&\text{ if }\textbf{x}_{i}\in C_{j}\\ 0&\text{otherwise },\end{cases} (2)

and 𝝁j\boldsymbol{\mu}_{j} is the centroid of the cluster CjC_{j}.

However, it often fails to handle complicated geometric shapes, such as non-convex shapes [10]. In contrast, spectral clustering is good at detecting non-convex and linearly non-separable patterns [11]. Spectral clustering includes the following three steps: 1) construct a Laplacian matrix LGL_{G} according to the similarities between data points; 2) embed nodes into kk-dimensional space using the first kk nontrivial eigenvectors of LGL_{G}; 3) apply k-means to partition the embedded data points into kk clusters. As shown in Figure 1, in the two moons data set, there are two slightly entangled non-convex shapes, where each cluster corresponds to a moon. In the two circles data set, the samples are arranged in two concentric circles, where each cluster corresponds to a circle. k-means gives incorrect clustering results for both data sets while spectral clustering performs an ideal clustering. Because k-means only cares about the distance while spectral clustering cares about the connectivity between nodes. The connectivity information is embedded in the spectrum of the underlying graph so that it is critical to preserve the spectrum when manipulating graphs [12].

Refer to caption
(a) k-means of the two moons data set
Refer to caption
(b) Spectral clustering of the two moons data set
Refer to caption
(c) k-means of the two circles data set
Refer to caption
(d) Spectral clustering of the two circles data set
Fig. 1: Performance of k-means and spectral clustering

3 Methods

3.1 Algorithmic Framework

We first construct a standard k-NN graph GG. Then, based on the node proximity measure proposed in [13], the spectral similarity of two nodes pp and qq can be calculated as:

Simp,q=|(𝐗p,𝐗q)|2(𝐗p,𝐗p)(𝐗q,𝐗q), (𝐗p,𝐗q):=k=1K(𝐱p(k)𝐱q(k)),Sim_{p,q}=\frac{{|(\mathbf{X}_{p},\mathbf{X}_{q})|}^{2}}{(\mathbf{X}_{p},\mathbf{X}_{p})(\mathbf{X}_{q},\mathbf{X}_{q})},\mbox{ }(\mathbf{X}_{p},\mathbf{X}_{q}):=\sum_{k=1}^{K}{\left(\mathbf{x}_{p}^{(k)}\cdot\mathbf{x}_{q}^{(k)}\right)}, (3)

where 𝐗:=(𝐱(1),,𝐱)(K)\mathbf{X}:=(\mathbf{x}^{(1)},\dots,\mathbf{x}{{}^{(K)}}) is obtained by applying Gauss-Seidel relaxation to solve LG𝐱(i)=0{L}_{G}\mathbf{x}^{(i)}=0 for i=1,,Ki=1,\dots,K, starting with KK random vectors. We aggregate the nodes with high spectral similarity to form a few spectral-representative nodes to reduce the graph size while preserving the spectrum of the original graph. By aggregating nodes based on spectral similarity, the reduced graph can best represent the original data set in the sense of minimizing the structural distortion [12], as shown in Figure 2.

Refer to caption
(a) The adjacency graph corresponding to the original node set of 9298 nodes
Refer to caption
(b) The adjacency graph corresponding to the 50X50X reduced node set of 138 pseudo-nodes
Fig. 2: Visualization of node aggregation results of USPS data set

Then the standard spectral clustering is performed on the aggregated nodes. Spectral clustering algorithm needs to perform a very computational expensive eigen-decompistion procedure with O(n3)O(n^{3}) time complexity to calculate the first kk non-trivial eigenvectors of the graph Laplacian, where nn is number of nodes. By using spectral-representative nodes to represent the global structure of the original data set, number of nodes nn can be reduced and thus leads to speedup of eigen-decomposition. If the desired size of node set is not reached, the nodes can be further aggregated layer by layer by utilizing a multi-level aggregation scheme [8]: given the graph Laplacian LG{L}_{G} and the mapping operators from the finest level 11 to the coarsest level rr, LR=𝐇GRLG𝐇RG{L}_{R}=\mathbf{H}_{G}^{R}{L}_{G}\mathbf{H}_{R}^{G}, where 𝐇GR=𝐇12𝐇23𝐇r1r\mathbf{H}_{G}^{R}=\mathbf{H}_{1}^{2}\mathbf{H}_{2}^{3}\cdots\mathbf{H}_{r-1}^{r} and 𝐇RG=𝐇21𝐇32𝐇rr1\mathbf{H}_{R}^{G}=\mathbf{H}_{2}^{1}\mathbf{H}_{3}^{2}\cdots\mathbf{H}_{r}^{r-1}

However, the goal of data clustering is to find cluster membership of original samples in the data set rather than the clusters of the pseudo aggregated nodes. In this paper, we propose to map the cluster membership back from the pseudo aggregated nodes to the original real data points. Specifically, we build a correspondence table using the mapping operators of each layer to associate each original sample to its aggregated nodes in the final layer. For each original data points, we assigning it to the cluster as its corresponding aggregated node. The complete algorithm flow has been shown in Algorithm 1.

Algorithm 1 spectrum-preserving node aggregation-based high-performance spectral clustering framework

Input: A data set DD with NN samples x1,,xNRdx_{1},...,x_{N}\in{R}^{d}, number of clusters kk.
Output: Clusters C1C_{1},…,CkC_{k}.

1:  Construct a k-nearest neighbor (kNN) graph GG from the input data ;
2:  Compute the adjacency matrix AGA_{G} and diagonal matrix DGD_{G} of graph GG ;
3:  Compute the Laplacian matrix LGL_{G}=DGD_{G}-AGA_{G};
4:  Perform spectrum-preserving node aggregation to obtain the reduced graph Laplacian LrL_{r};
5:  Build a correspondence table to associate each node with its corresponding aggregated node;
6:  Compute the eigenvectors u1u_{1},…,uku_{k} that correspond to the bottom kk nonzero eigenvalues of LrL_{r};
7:  Construct UR|VLr|×kU\in{R}^{|V_{L_{r}}|\times k}, with the kk eigenvectors stored as column vectors;
8:  Perform k-means algorithm to partition the rows of UU into kk clusters.
9:  Retrieve the cluster membership of each xix_{i} by assigning it to the cluster as its corresponding aggregated node and return the result.

3.2 Nearly noiseless representation

One drawback of spectral clustering is that it is highly sensitive to noisy nodes in the graph, which limits the capability to accurately discover clusters: if a few noisy nodes form a narrow bridge between two densely connected communities, the algorithm tends to report the two communities plus the noisy nodes as a single community.

Relying on the very strong capability of spectrum-preserving pseudo nodes in preserving the spectrum of the graph Laplacian and global manifold of the data set, almost all the original nodes including the noisy nodes can be removed together. For example, for the MNIST data set with 70,00070,000 nodes, only 109109 spectral-representative nodes are enough for representing the global structure of the data set, which means 99.84%99.84\% of nodes can be removed so that the noise nodes won’t exist in the aggregated graph. As a result, the clustering accuracy improves from 64.20%64.20\% to 82.94%82.94\% for the MNIST data set.

Even though other approximate spectral clustering methods can also significantly reduced to node set size with a small amount of representative nodes, however, without using spectral analysis to choose spectrally-critical nodes, their representative nodes are not able to truthfully encode the spectrum of original graph Laplacian. So their clustering accuracy gains from denoising are counteracted with the loss of spectrum information.

3.3 Algorithm Complexity

The computational complexity of spectrum-preserving node reduction is O(|EG|log(|V|))O(|E_{G}|log(|V|)), where |EG||E_{G}| is the number of edges in the original graph and |V||V| is the number of nodes. The computational complexity of reduced standard spectral clustering is O(P3)O(P^{3}) where PP is the number of aggregated nodes. The complexity of cluster membership retrieving is O(N)O(N), where NN is the number of samples in the original data set.

4 Experiment

Experiments are performed using MATLAB R2020b running on a Laptop. The reported results are averaged over 1010 runs.

4.1 Experiment Setup

One mid-sized data set (USPS), one large data set (MNIST) and one very large data set (Covtype) are used in our experiments. They can be downloaded from the UCI machine learning repository 222https://archive.ics.uci.edu/ml/ and LibSVM Data333https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/: USPS includes 9,2989,298 images of USPS hand written digits with 256256 attributes; MNIST is a data set from Yann LeCun’s website444 http://yann.lecun.com/exdb/mnist/, which includes 70,00070,000 images with each of them represented by 784784 attributes; Covtype includes 581,012581,012 instances for predicting forest cover type from cartographic variables and each instance with 5454 attributes is from one of seven classes.

We compare the proposed method against both the baseline and the state-of-the-art fast spectral clustering methods including: (1) the standard spectral clustering algorithm [9], (2) the Nyström method [7], (3) the Landmark-based spectral clustering (LSC) method that uses random sampling for landmark selection [5], and (4) the KASP method using k-means for centroids selection [6]. For fair comparison, we use the same parameter setting in [5] for compared algorithms: the number of sampled points in Nyström method ( or the number of landmarks in LSC, or the number of centroids in KASP ) is set to 500.

4.2 Evaluation Metrics

Clustering accuracy (ACCACC) is the most widely used measurement for clustering quality [5, 4]. It is defined as follows:

ACC=i=1Nδ(yi,map(ci))N,ACC=\frac{\sum\limits_{i=1}^{N}{\delta{(y_{i},map(c_{i}))}}}{{N}}, (4)

where NN is the number of data samples, yiy_{i} is the ground-truth label, and cic_{i} is the label generated by the algorithm. δ(x,y)\delta(x,y) is a delta function defined as: δ(x,y)\delta(x,y)=1 for x=yx=y, and δ(x,y)\delta(x,y)=0, otherwise. map()map(\bullet) is a permutation function which can be realized using the Hungarian algorithm [14]. A higher value of ACCACC indicates better clustering quality.

4.3 Experimental Results

Table 1: Spectral clustering accuracy (%)
Data Set Standard SC Nyström KASP LSC Ours(5X5X) Ours(10X10X) Ours(50X50X)
USPS 64.31 69.31 70.62 66.28 70.65 72.25 81.39
MNIST 64.20 55.86 71.23 58.94 70.99 72.42 82.18
Covtype 48.81 33.24 27.56 22.60 48.77 44.50 48.28
Table 2: Runtime (seconds)
Data Set Standard SC Nyström KASP LSC Ours(5X5X) Ours(10X10X) Ours(50X50X)
USPS 0.72 0.29 0.16 0.22 0.21 0.18 0.17
MNIST 252.59 0.95 0.18 0.49 0.59 0.35 0.21
Covtype 128.70 5.48 0.70 3.77 0.92 0.50 0.49
Table 3: Graph complexity comparison
Data Set |Vorig||V_{orig}| |Vreduced|(|V_{reduced}|(5X)) |Vreduced|(|V_{reduced}|(10X)) |Vreduced|(|V_{reduced}|(50X))
USPS 9,298 1,692 767 138
MNIST 70,000 13,658 6,368 1,182
Covtype 581,012 104,260 44,188 8,192
Refer to caption
Fig. 3: ACC VS reduction ratio for the MNIST data set.
Refer to caption
Fig. 4: Graph size VS reduction ratio for the MNIST data set.

Table 1 shows the clustering accuracy results of compared methods and our method with node reduction ratios of 5X5X, 10X10X and 50X50X. The runtime of the eigen-decomposition and k-means steps are reported in Table 2. As observed, the proposed method can consistently lead to dramatic performance improvement, beating all competitors in clustering accuracy across all the three data sets: our method achieves more than 10%10\% accuracy gain on USPS, 11%11\% gain on MNIST and 15%15\% gain on Covtype over the second-best methods; for the USPS and MNIST data sets, our method achieves over 17%17\% and 18%18\% accuracy gain over the standard spectral clustering method, respectively. The superior clustering results of our method clearly illustrate that spectrum-preserving concise representation can improve clustering accuracy by removing redundant or false relations among data samples.

As shown in Table 3, the 50X50X reduced graphs generated by our framework have only 138138, 11821182 and 81928192 nodes for USPS, MNIST and Covtype data sets, respectively, thereby allowing much faster eigen-decompositions. As shown in Fig. 3 and Fig. 4, with increasing node reduction ratio, the proposed method consistently produces high clustering accuracy. For the very large data set Covtype, our method is the only one that enables us to accelerate spectral clustering algorithm without loss of clustering accuracy.

5 Conclusion

In this paper, a complete framework for improving spectral clustering is proposed. We use real-world data sets to demonstrate the effectiveness of the method. Our method outperforms state-of-the-art methods by a large margin.

References

  • [1] Feiping Nie, Dong Xu, Ivor Wai-Hung Tsang, and Changshui Zhang, “Spectral embedded clustering,” in Twenty-First International Joint Conference on Artificial Intelligence, 2009.
  • [2] Francis Bach and Michael Jordan, “Learning spectral clustering,” Advances in neural information processing systems, vol. 16, no. 2, pp. 305–312, 2004.
  • [3] Xia Wan and C-CJ Kuo, “A new approach to image retrieval with hierarchical color clustering,” IEEE transactions on circuits and systems for video technology, vol. 8, no. 5, pp. 628–643, 1998.
  • [4] Jialu Liu, Chi Wang, Marina Danilevsky, and Jiawei Han, “Large-scale spectral clustering on graphs,” in Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. AAAI Press, 2013, pp. 1486–1492.
  • [5] Xinlei Chen and Deng Cai, “Large scale spectral clustering with landmark-based representation.,” in AAAI, 2011.
  • [6] Donghui Yan, Ling Huang, and Michael I Jordan, “Fast approximate spectral clustering,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009, pp. 907–916.
  • [7] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik, “Spectral grouping using the nystrom method,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 2, pp. 214–225, 2004.
  • [8] Zhiqiang Zhao, Ying Zhang, and Zhuo Feng, “Towards scalable spectral embedding and data visualization via spectral coarsening,” in Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2021, pp. 869–877.
  • [9] Ulrike Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.
  • [10] Feiping Nie, Cheng-Long Wang, and Xuelong Li, “K-multiple-means: A multiple-means clustering method with specified k clusters,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 959–967.
  • [11] Andrew Y Ng, Michael I Jordan, Yair Weiss, et al., “On spectral clustering: Analysis and an algorithm,” Advances in neural information processing systems, vol. 2, pp. 849–856, 2002.
  • [12] Yongyu Wang, High Performance Spectral Methods for Graph-Based Machine Learning, Ph.D. thesis, Michigan Technological University, 2021.
  • [13] Oren E Livne and Achi Brandt, “Lean algebraic multigrid (lamg): Fast graph laplacian linear solver,” SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. B499–B522, 2012.
  • [14] Christos H Papadimitriou and Kenneth Steiglitz, Combinatorial optimization: algorithms and complexity, Courier Corporation, 1982.