A Compressed Sensing Based Least Squares Approach to Semi-supervised Local Cluster Extraction
Abstract
A least squares semi-supervised local clustering algorithm based on the idea of compressed sensing is proposed to extract clusters from a graph with known adjacency matrix. The algorithm is based on a two-stage approach similar to the one in [26]. However, under a weaker assumption and with less computational complexity than the one in [26], the algorithm is shown to be able to find a desired cluster with high probability. The “one cluster at a time” feature of our method distinguishes it from other global clustering methods. Several numerical experiments are conducted on the synthetic data such as stochastic block model and real data such as MNIST, political blogs network, AT&T and YaleB human faces data sets to demonstrate the effectiveness and efficiency of our algorithm.
1 Introduction
Informally speaking, graph clustering aims at dividing the set of vertices from a graph into subsets in a way such that there are more edges within each subset, and fewer edges between different subsets. When analyzing a graph, one of people’s primary interest is to find the underlying clustered structure of the graph, as the vertices in the same cluster can reasonably be assumed to have some latent similarity. For data sets which are not presented as graphs, we can create a suitable auxiliary graph such as the -nearest-neighbors (-NN) graph based on the given data, for example, see [42], [30] and [21]. Then we can apply graph clustering techiques on this auxiliary graph.
Graph clustering problem has become prevalent recently in areas of social network study [16], [20] and [24], image classification [6], [7] and [42], natural language processing [11] and [31]. For example, suppose a social network graph has vertices which represent users of a social network (e.g. Facebook, Linkedln), then the edges could represent users which are connected to each other. The sets of nodes with high inter-connectivity, which we call them communities or clusters, could represent friendship groups or co-workers. By identifying those communities we can suggest new connections to users. Note that some networks are directed (e.g. Twitter, Citation Networks), which could make community detection more subtle. For the scope of this paper, we will only focus on weighted undirected graphs.
The classical graph based clustering problem is a global clustering problem which assigns every vertice a unique cluster, assuming there are no multi-class vertices. It is usually considered as an unsupervised learning problem which can be done by using method such as spectral clustering [29], [35] and [50] or ways of finding an optimal cut of the graph [12], [13], these approaches are generally computational expensive and hard to implement for large data sets. It can also be done semi-supervisely, such as [25], [21] and [49]. However, sometimes it is only of people’s interests in finding one certain cluster which contains the target vertices, given some prior knowledge of a small portion of labels for the entire true cluster, which is usually attainable for real data. This type of problem is called local clustering, or local cluster extraction, which loosely speaking, is defined to be the problem which takes a set of vertices with given labels, we call them seeds, as input, and returns a cluster such that . The local clustering problem hasn’t been studied exhaustively, and many aspects of the local clustering problem still remain open. Some recent related work are [18], [47], [48], [44] and [26]. Especially the work in [26] is one of the recent works with the same problem setting as in this paper. More precisely, we propose a new semi-supervised local clustering approach using the ideas of compressed sensing and method of least squares to make the clustering effective and efficient. Indeed, as we will see in the numerical experiments section, our approach outperforms the work in [26] in terms of both the accuracy and efficiency.
The main contribution of this paper is that it proposes the local cluster extraction Algorithms 3 and 4 which improve the performance of the algorithms in [26] and also slightly improve state-of-the-art result in [2] for the political blog network [3]. It also achieves better or comparable results on synthetic stochastic block model, human faces data, and MNIST data compared with several other modern local clustering algorithms or semi-supervised algorithms.
The subsequent sections in this paper are structured as follows. In Section 2, we give brief introductions to spectral clustering and concept of graph Laplacian, we also make the assumptions for the graph model which we will use later for theoretical analysis. In Section 3, we explain the main algorithms for solving the local cluster extraction problem in two-stage and show the correctness of our algorithms asymptotically. In Section 4, we analyze the computational complexity of our algorithms. In Section 5, several synthetic and real data sets are used to evaluate the performance of our algorithms and we also compared their performances with the state-of-the-art results.
2 Preliminaries and Models
2.1 Notations and Definitions
We use standard notation to denote the graph with the set of vertices and set of edges . For the case , we identify with the set of integers . We use to denote the adjacency matrix (possibly non-negative weighted) of , so in the undirected case, is a symmetric matrix. Let be the diagonal matrix , where each is the degree of vertex . We have the following definition.
Definition 2.1.
The unnormalized graph Laplacian is defined as . There are also two other normalized graph Laplacians which are symmetric graph Laplacian , and the random walk graph Laplacian .
The following result serves as the foundation of our approach for solving the graph clustering problem, we omit its proof here by directly referring to [8] and [29].
Lemma 2.1.
Let G be an undirected graph of size with non-negative weights. Then the multiplicity of the eigenvalue of equals to the number of connected components in , and the indicator vectors on these components span the kernel of .
Let us introduce some more notations which we will use later. Suppose for the moment we have information about structure of the underlying clusters for each vertex, then it is useful to write as a union of two edge-disjoint subgraphs where consists of only intra-connection edges, and consists of only inter-connection edges. We will use to denote the degree of vertex in the subgraph , and to denote the degree of vertex in the subgraph . We will also use and to denote the adjacency matrix and graph Laplacian associated with , and and to denote the adjacency matrix and graph Laplacian associated with . Note that these notations are just for convenience for the analysis in the next section, in reality we will have no assurance about which cluster each individual vertex belongs to, so we will have no access to and . It is also worthwhile to point out that but in general. Furthermore, we will use or to denote the matrix or vector where each its entry is replaced by the absolute value, and we will use to denote the size of whenever is a set. In the later sections, we will use and to indicate and respectively, and use and to denote the submatrices of and with column indices subset respectively. For convenience, let us formulate the notations being used through the paper into Table 1.
Symbols | ||
---|---|---|
A general graph | ||
Size of G | ||
Set of vertices of graph G | ||
(We identify if through the paper) | ||
Size of | ||
Set of edges of graph G | ||
Subset of which consists only intra-connection edges | ||
Subset of which consists only inter-connection edges | ||
Subgraph of on with edge set | ||
Subgraph of on with edge set | ||
Adjacency matrix of graph | ||
Adjacency matrix of graph | ||
Adjacency matrix of graph | ||
Random walk graph Laplacian of | ||
Random walk graph Laplacian of | ||
Random walk graph Laplacian of | ||
Indicator vector on subset | ||
submatrix of with column indices | ||
submatrix of with column indices | ||
Entrywised absolute value operation on matrix | ||
Entrywised absolute value operation on vector | ||
norm of matrix | ||
norm of vector . |
2.2 Graph Model Assumptions
We make the following assumption for our graph model in the asymptotic perspective.
Assumption 1.
Suppose can be partitioned into connected components such that , where each is the underlying vertex set for each connected component of .
-
(I)
The degree of each vertex is asymptotically the same for vertices belong to the same cluster .
-
(II)
The degree is small relative to degree asymptotically for each vertex .
The random graphs which satisfies assumption (I) is not uncommon, for example, the Erdös-Rényi (ER) model with for any , see [15] and [9]. A natural generalization of the ER model is the stochastic block model (SBM) [19], which is a generative model for random graphs with certain edge densities within and between underlying clusters, such that the edges within clusters are more dense than the edges between clusters. In the case of each cluster has the same size and the intra- and inter-connection probability are the same among all the vertices, we have the symmetric stochastic block model (SSBM). It is worthwhile to note that the information theoretical bound for exact cluster recovery in SBM are given in [1] and [2]. It was also shown in [26] that a general SBM under certain assumptions of the parameters can be clustered by using a compressed sensing approach. Our model requires a weaker assumption than the one in [26], indeed, we remove the assumption imposed on the eigenvalues of in [26]. Therefore, our model will be applicable to a broader range of random graphs.
3 Main Algorithms
Our analysis is based on the following key observation. Suppose that graph has connected components , i.e., . Suppose further that we temporarily have access to the information about the structure of . Then we can write the graph Laplacian into a block diagonal form
(1) |
Suppose now we are interested in finding the cluster with the smallest number of vertices, say , which corresponds to . By Lemma 2.1, forms a basis of the kernel of . Note that all the have disjoint supports, so for and , we can write
(2) |
with some . Therefore, if has the fewest non-zero entries among all elements of , then we can find it by solving the following minimization problem:
(3) |
Here the norm indicates the number of nonzero components for the input vector. Problem (3) can be solved using method such as greedy algorithm in compressed sensing as explained in [26]. However, we will propose a different approach to tackle it in this paper and demonstrate that the new approach is more effective numerically and require a fewer number of assumptions.
3.1 Least Squares Cluster Pursuit
Let us consider problem (3) again, instead of finding directly, let us try to find what are not in . Suppose there is a superset such that , and for all . Since , we have
(4) |
Letting , then to find what are not in within is equivalent to solve the following problem (5)
(5) |
Note that solving problem (5) directly will give and both as solutions. By setting the columns , solving problem (5) is equivalent to solving
(6) |
Directly solving problem (6) gives at least two solutions and , where indicates the complement set of . Between these two solutions, the latter is much more informative for us to extract from than the former. We need to find a way to avoid the non-informative solution but keep the informative solution .
We can achieve this by removing a subset of columns from index set . Let us use to indicate the indices of column we aim to remove. Suppose we could choose such that . Now consider the following variation (7) of the minimization problem (6)
(7) |
Different from solving (6) which gives two solutions, solving (7) only gives one solution , as is no longer a solution because of the removal of . The solution is indeed still a solution to (7) because . Furthermore, the solution to (7) is unique since it is a least squares problem with matrix of full column rank, therefore is the unique solution to (7).
However, there is no way in theory we can select and assure the condition . In practice, the way we choose is based on the following observation. Suppose , and for . Then for all , and for all . Therefore, we can choose in such a way that is small for all . These ideas lead to Algorithm 1.
-
1.
Compute and .
-
2.
Let be the set of column indices of smallest components of the vector .
-
3.
Let be the solution to
(8) obtained by using an iterative least squares solver.
-
4.
Let .
Remark 3.1.
We impose the absolute value rather than direct dot product in order to have fewer cancellation between vector components when summing over the entrywised products. In practice, the value of will not affect the performance too much as long as its value is not too extreme. We find that works well for our numerical experiments.
Remark 3.2.
In practice, we choose to use MATLAB’s lsqr function to solve the least squares problem (8). As we will see in Lemma 3.2, our problem is well conditioned, so it is also possible to solve the normal equation exactly for problems which are not in a very large scale. However, we choose to solve it iteratively over exactly because the quality of the numerical solution is not essential for our task here, we are only interested in an approximated solution as we can use the cutoff number for clustering.
Remark 3.3.
As indicated in [26], we can reformulate problem (3) as solving
(9) |
by applying the greedy algorithms such as subspace pursuit [10] and compressed sensing matching pursuit (CoSaMP) [34]. Or alternatively, we can consider LASSO, see [41] and [43], formulation of the problem
(10) |
The reason that Lasso is a good way to interpret this problem is that the solution we are trying to solve for is the sparse indicator vector which satisfies . We do not analyze it further here.
However, in reality we have no access to , what we know only is , and in general . We argue that the solution to the perturbed problem (8) associated with will not be too far away from the solution to the unperturbed (7) problem associated with , if the difference between and is relative small. Let us make this precise by first quoting the following standard result in numerical analysis.
Lemma 3.1.
Let be an operator norm, be a non-singular square matrix, , . Let , , be perturbed measurements of , , respectively. Suppose , , and suppose further , then
The above lemma asserts that the size of cond() is significant in determining the stability of the solution with respect to small perturbations on and . For the discussion from now on, we will use to denote the standard vector or matrix induced two-norm unless state otherwise. The next lemma claims the invertibility of and gives an estimation bound of its condition number.
Lemma 3.2.
Let be the disjoint union of underlying clusters with size and assume . Let be the degree for vertex , , and suppose be such that and for . Then
-
(i)
If , then is invertible.
-
(ii)
Suppose further and . Then
almost surely as , e.g. when .
Proof.
Without loss of generality, let us assume that the column indices of are already permuted such that the indices number is in the same order relative to their underlying clusters. The invertibility of follows directly from the fact that is of full column rank. So let us show that is of full column rank. Because of the reordering, is in a block diagonal form
It is then suffices to show each block is of full column rank. By Lemma 2.1, each of has as an eigenvalue with multiplicity one, and the corresponding eigenspace is spanned by . Hence . Now suppose by contradiction that the columns of are linearly dependent, so there exists such that , or . This means that is an eigenvector associated to eigenvalue zero, which contradicts the fact that the eigenspace is spanned by . Therefore is linearly independent, hence is of full column rank. For with , since , is a proper subset of . The strategy above applies as well. Therefore all blocks in are of full column rank, so is of full column rank.
Now since is in a block form, to estimate the condition number, we only need to estimate the largest and smallest eigenvalues for each block. Writing and , for each , , and for with , . Note that the probability of having an edge between and given degree sequences equals to , as the existence of an edge between two vertices is proportional to their degrees. So equals to with probability , which implies ; equals to with probability , which implies . Hence the expectation
By the law of large numbers, almost surely as . Therefore for , we have
almost surely as .
Similarly, for each , , we have , and almost
surely as .
Now we apply Gershgorin’s circle theorem to bound the spectrum of .
For all , the circles are centered at , with radius less than or equal to
almost surely, hence and .
almost surely. Therefore we have
almost surely, as desired. ∎
Remark 3.4.
Note that there is a minor difficulty in estimating the expectation of inner product between two different columns of . The computation assumes the independence of degree distribution of each individual vertex within each cluster, but this may not be true in general for arbitrary graph. However, the independence will occur if the asymptotic uniformity of the degree distribution within each cluster is assumed, that is why our model needs this assumption.
Now the perturbed problem (8) is equivalent to solving , while the unperturbed problem (7) is to solve . Let , , and . Let us give an estimate for .
Lemma 3.3.
Let be the graph Laplacian of and . Let for all and . Then .
Proof.
Let denote the Kronecker delta symbol, observe that
Since , we have . So we have
Therefore . To bound the spectral norm we apply Gershgorin’s circle theorem, noting that for all , hence
This completes the proof. ∎
Next we will have the following result.
Lemma 3.4.
almost surely.
Proof.
Note that . We want to give an estimate of . Similar to the computation we did in Lemma 3.2, for each , we have , , and . For each , , we have , , and almost surely. Therefore, the row sum of for row equals to zero, and the row sum for row larger than almost surely. Hence almost surely. ∎
Now let us use previous lemmas to establish that the difference between perturbed solution and unperturbed solution is small in the order of .
Theorem 3.1.
Proof.
Let , , , . We will apply Lemma 3.2 with , , , .
First by Lemma 3.3, we have . Therefore
For each , we have , and . Hence
(11) | ||||
(12) | ||||
(13) |
We also have
Next by Lemma 3.4, almost surely. Therefore
The second inequality holds since . The third inequality holds since , which comes from the similar reasoning as in Lemma 3.2 by using Gershgorin’s circle theorem. Consequently, we have . Now putting Lemma 3.2 and Lemma 3.1 together with , , , , we have
∎
Next we can estimate the size of the symmetric difference between output and the true cluster relative to the size of , the symmetric difference is defined as . Let us state another lemma before we establish the result.
Lemma 3.5.
Let , , and . Suppose , then .
Proof.
Let and write , where and are the parts of supported on and . Then we can write
Note that and . We have
Therefore as desired. ∎
Theorem 3.2.
Under the same assumptions as Theorem 3.1, we have
In other words, the error rate of successfully recovering is at most a constant multiple of .
3.2 Random Walk Threshold
In order to apply Algorithm 1, we need a “nice” superset which contains . The task for this subsection is to find such a superset from the given seeds . We will apply a simple diffusion based random walk algorithm on to find such . This leads to Algorithm 2, which is described in [26] as well. However, the difference of the random walk threshold algorithm between this paper and the one in [26] is that the threshold parameter here is heuristically chosen to be larger than the corresponding threshold parameter in [26]. This is another advantage of our method as it will increase the chances of having entirely contained in . Such a choice is made based on the natural differences of our approaches. It is worthwhile to point out that there are also other sophisticated algorithms such as the ones described in [5], [23] and [45] which can achieve the same goal. We avoid using these methods here as our purpose is just to implement a fast way of obtaining a set .
-
1.
Compute and .
-
2.
Compute .
-
3.
Define .
The thresholding operator is defined as
The motivation of Algorithm 2 is the following intuitive observation. Suppose we are given seed vertices , then by starting from , since the edges within each cluster are more dense than those between different clusters, the probability of staying within will be much higher than entering other clusters , for , in a short amount of depth. Therefore, by performing a random walk up to a certain depth , e.g., , we will have a well approximated set such that is almost surely contained in . Let us make this more precisely in Theorem 3.3.
Theorem 3.3.
Assume and in Algorithm 2, the probability . In other words, the probability that the -steps random walk with seed vertices being not in is at most a constant multiple of .
Proof.
Let us first consider the case . Suppose . Then we have . It is also easy to see that . For , we have . So by assuming , we have .
Suppose now , we can apply the above argument to each individual vertex in , where the random walk starting from each vertex can be considered independently, therefore we have . ∎
Remark 3.5.
It is worthwhile to note that we do not want to be too large, one reason is that Theorem 3.3 tells us the probability of staying within the target cluster decreases as increases. An alternative interpretation is that we can treat our graph , suppose connected, as a time homogeneous finite state Markov chain with evenly distributed transition probability determined by the vertex degree between adjacent vertices. Since is connected, it is certainly irreducible and aperiodic. By the fundamental theorem of Markov chains, the limiting probability of finally being at each vertex will be the same, regardless of what the seed set is.
Meanwhile, we do not want to be too small as well, otherwise the random walk will not be able to explore all the reachable vertices. There is also a trade-off between the size of and the random walk depth , where a smaller size of usually induces a larger in order to fully explore the target cluster.
3.3 Local Cluster Extraction
Let us now combine the previous two subroutines into Algorithm 3. In practice, we may want to vary the number of iterations based on the number of examples in the data set in order to achieve a better performance. For the purpose theoretical analysis, let us fix .
-
1.
for
-
2.
Random Walk Threshold (, , , , ).
-
3.
Least Squares Cluster Pursuit (, , , ).
-
4.
end
-
5.
Let .
Remark 3.6.
The hyperparameter in the algorithm is usually choosen based on the size of initial seed vertices relative to , we do not have a formal way of choosing the best rather than choose it heuristically. In practice, we believe will do a very good job most of the time.
The analysis in previous two subsections gives that the difference between true cluster and the estimated is relative small compared to the size of , this can be written more formally using the asymptotic notation.
3.4 From Local to Global
We can make one step further by applying Algorithm 3 iteratively on the entire graph to extract all the underlying clusters. That is, we remove each time after the Algorithm 3 finds it, and update the graph by removing the subgraph spanned by vertices successively. This leads to Algorithm 4. We will not analyze further the theoretical guarantees of the iterative version the algorithm, but rather provide with numerical examples in the later section to show its effectiveness and efficiency.
-
1.
for
-
2.
Let be the output of Algorithm 3.
-
3.
Let be the subgraph spanned by .
-
4.
Updates .
-
5.
end
Remark 3.7.
It is worth noting that Algorithm 4 extracts one cluster at a time, which is different from most of other global unsupervised clustering algorithms. In practice, those global clustering methods could have impractically high run time [33] or tricky to implement [2]. In contrast, our method requires much lower computational time and can be implemented easily. In addition, the ”one cluster at a time” feature of our method provides more flexibility for problems under certain circumstances.
4 Computational Complexity
In this section, let us discuss the run time of the algorithms introduced previously.
Theorem 4.1.
Algorithm 2 requires operations, where is the depth of the random walk.
Proof.
Notice that if are stored as sparse matrices, then for each in the second step of Algorithm 2, it requires , where is the maximal degrees among all the vertices. Therefore the algorithm requires , where the part comes from the third step of sorting. In practice, the random walk depth is with respect to the graph size , therefore we have . ∎
Theorem 4.2.
Algorithm 1 requires operations.
Proof.
For Algorithm 1, its first step requires , second step requires , where the part comes from matrix vector multiplication, and part comes from sorting. For its third step, to avoid solving the normal equation exactly for large scale problems, we recommend using an iterative mehod, for example conjugate gradient descent (we use MATLAB’s lsqr operation in our implementation). As we have shown the matrices are associated with well behaved condition numbers, it requires only a constant number of iterations to get a well approximated least squares solution to problem (8). Since the cost for each iteration in conjugate gradient descent equals to a few operations of matrix vector multiplication, which is , the total cost for Algorithm 1 is . ∎
As a consequence, the total run time for Algorithm 3 is . if the number of clusters , then Algorithm 4 also runs in .
Remark 4.1.
The computational scheme of our methods follow the similar framework as CP+RWT in [26]. However, one of the differences between these two approaches is that we apply lsqr to solve the least squares problem (8), but CP+RWT applies iterations of subspace pursuit algorithm to solve (8), and each its subspace pursuit is implemented with lsqr as a subroutine. So essentially, our proposed method is times cheaper than CP+RWT. We can also see this difference by comparing the run times for our numerical experiments in the next section.
5 Numerical Experiments
In this section, we evaluate the performance of our algorithms on synthetic symmetric stochastic block model (SSBM), network data on political blogs [3], AT&T Database of Faces 333https://git-disl.github.io/GTDLBench/datasets/att_face_dataset/, Extended Yale Face Database B (YaleB) 444http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html, and MNIST data 555http://yann.lecun.com/exdb/mnist/.
For single cluster extraction tasks, we will consider the diffusion based methods plus a possible refinement procedure CP+RWT [26], HK-Grow [23], PPR [5], and LBSA [36] as our baseline methods. For multi-cluster extraction tasks, we will consider ICP+RWT [26], LapRF and TVRF [49], Multi-class MBO Auction Dynamics [21], AtlasRBF [37], Pseudo-label [27], DGN [22] and Ladder Networks [39] as the baseline methods. The standard spectral clustering algorithm [35] is also being applied in some of the experiments. For the implementation of Algorithms 3 and 4, we use MATLAB’s function as our iterative least squares solver to solve equation (8). We tune the rejection parameters for all algorithms appropriately to make the output of each algorithm approximately the same size for comparison purpose. For the evaluation metrics of our experiments, we will consider Jaccard index for symmetric stochastic block model, F1 score for human faces data, and classification accuracy for political blogs data and MNIST data. More implementation details are summarized as a supplementary material.
5.1 Stochastic Block Model Data
We first test Algorithm 3 on with different choices of parameters. The paramenter indicates the total number of vertices, indicates the number of clusters, is the probability of having an edge between any two vertices within each cluster, and is the probability of having an edge between any two vertices from different clusters. Figure 1 left panel demonstrates such a synthetic random graph model with three underlying clusters. Figure 1 right panel illustrates an adjacency matrix of a random graph generated from symmetric stochastic block model with three underlying clusters. In our experiments, we fix and choose respectively. The connection probability between edges are chosen as and . By choosing the parameters carefully, we obtain the Jaccard index and logarithm of running time of each method shown in Figure 2. We also run the experiments on stochastic block model for non-symmetric case and obtained similar gaps in accuracy and run time. For the implementation of symmetric stochastic block model, we use three vertices with given label as our seeds, and we focus on only recovering the target cluster, say . The experiments are performed with 500 repetitions.
![]() |
![]() |
![]() |
![]() |
5.2 Political Blogs Network Data
Next we test Algorithm 3 on the data from ”The political blogosphere and the 2004 US Election”[3], which contains a list of political blogs that were classified as liberal or conservative, and links between the blogs. See Figure 3 as an illustration for the community structure (Figure source [3]).

As explained by Abbe and Sandon in [2], their simplified algorithm gave a reasonably good classification 37 times out of 40 trials. Each of these trials classified all but 56 to 67 of the 1222 vertices in the graph main component correctly. According to [2], the state-of-the-art described in [51] before the work in [2] gives a lowest value around 60, while using regularized spectral algorithms such as the one in [38] obtain about 80 errors. In our experiments, given three labeled seeds, the Algorithm 3 succeeds 35 trials out of a total of 40 trials. Among these 35 trials, the average number of misclassified node in the graph main component is 55, which is slightly favorable than the state-of-the-art result. We also tested CP+RWT on this dataset and found the results were not very satisfactory.
5.3 AT&T Database of Faces
The AT&T Database of Faces contains gray scale images for different people of pixel size . Images of each person are taken under different conditions, by varying the three perspectives of faces, lighting conditions, and facial expressions.
![]() |
![]() |
We use part of this data set by randomly selecting 10 people such that each individual has 10 images. We randomly permute the images as shown in the left side of Figure 4, and compute its adjacency matrix based on the preprocessing method discussed in the appendix B. Then we iteratively apply Algorithm 3 and try to recover all the 10 images belong to the same individual. The desired permutation of these individual images after iteratively performing Algorithm 3 are shown in the right side of Figure 4. Some more details of the implementation regarding to the hyperparameters tuning are summarized in the appendxi A. The performance of our algorithm compared with CP+RWT and Spectral Clustering (SC) are summarized in Table 2 under 500 repetitions. Note that spectral clustering method is unsupervised, hence its accuracy does not affected by the percentage of labeled data.
5.4 Extended Yale Face Database B (YaleB)
The YaleB dataset contains 16128 gray scale images of 28 human subjects under 9 poses and 64 illumination conditions. We use part of this data set by randomly selecting 20 images from each person after the preprocessing in appendix B. The images are randomly permuted and we aim to recover all the clusters by iteratively performing Algorithm 3. Figure 5 shows this dataset with randomly permuted images on the left side and the desired clustering results on the right side. Figure 6 enlarges a small part inside the pictures from Figure 5 with the red boxes. The performance of our algorithm compared with CP+RWT and Spectral Clustering (SC) are summarized in Table 3 under 500 repetitions.
Labeled Data % | |||
---|---|---|---|
LSC | 92.1% | ||
CP+RWT [26] | 89.2% | ||
SC [35] | 93.8% | 93.8% | 93.8% |
![]() |
![]() |
![]() |
![]() |
5.5 MNIST Data
We also test Algorithm 4 on the MNIST data, which is a famous machine learning benchmark dataset in classification that consists of grayscale images of the handwritten digits - of size with approximately images of each digit. We used a certain percentage of vertices with given labels within each of the ten clusters as our seed vertices, the performance ILSC and ICP+RWT are summarized in Table 4 under 100 repetitions.
Labeled Data % | ILSC | Run Time | ICP+RWT [26] | Run Time |
---|---|---|---|---|
0.5 | 15.5 | 18.1 | ||
1 | 15.3 | 19.1 | ||
1.5 | 15.4 | 19.8 | ||
2 | 15.5 | 21.4 | ||
2.5 | 15.4 | 22.1 |
We also compare ILSC with several other state-of-the-art semi-supervised methods on MNIST. As we can see in Table 5, ILSC outperforms the other algorithms except for the Ladder Networks which uses more information of labels and involved in a deep neural network architecture that requires training on GPUs for several hours.
References
- [1] E. Abbe, Community Detection and Stochastic Block Models: Recent Developments, Journal of Machine Learning Research. vol.18, no.177, pp.1-86, 2018
- [2] E. Abbe and C. Sandon, Recovering communities in the general stochastic block model without knowing the parameters, In Advances in Neural Information Processing Systems, 676–684, 2015
- [3] L. A. Adamic and N. Glance, The political blogosphere and the 2004 US election: Divided they blog, In Proceedings of the 3rd International Workshop on Link Discovery, 36-43, 2005.
- [4] D. J. Aldous, Random walks on finite groups and rapidly mixing Markov chains, In Seminaire de Probabilites XVII, pages 243–297. Springer Verlag, 1983. Lecture Notes in Mathematics 986.
- [5] R. Andersen, F. Chung, and K. Lang, Using pagerank to locally partition a graph, Internet Mathematics, 4(1):35-64, 2007.
- [6] G. Camps-Valls, T. Bandos and D. Zhou, Semisupervised graph-based hyperspectral image classification, IEEE Trans. Geosci. Remote Sensing, vol. 45, no. 10, pp. 3044-3054, 2007.
- [7] Y. Chen, J. Z. Wang, and R. Krovetz, Clue: Cluster-based retrieval of images by unsupervised learning, IEEE Trans. Image Process. 14, 8, 1187–1201, 2005.
- [8] F. Chung, Spectral Graph Theory, Vol. 92. American Mathematical Society., 1997.
- [9] F. Chung and L. Lu, Complex Graphs and Networks, Vol. 107. American Mathematical Soc., 2006.
- [10] W. Dai and O. Milenkovic, Subspace pursuit for compressive sensing signal reconstruction, IEEE Trans. Inform. Theory, vol. 55, no. 5, pp. 2230–2249, May 2009.
- [11] S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning, In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining:269-274, 2001.
- [12] S. Dhillon, Y. Guan, and B. Kulis, Kernel k-means: spectral clustering and normalized cuts, In Proc. of the 10th ACM SIGKDD Conference, 2004.
- [13] C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon,A min-max cut algorithm for graph partitioning and data clustering, In Proceedings of IEEE ICDM 2001, pages 107–114, 2001.
- [14] W. Doeblin, Expose de la theorie des chaınes simples constantes de Markov a un nombre fini d’etats, Mathematique de l’Union Interbalkanique, 2:77–105, 1938
- [15] P. Erdős and A. Rényi, On random graphs, I. Publ. Math. Debrecen 6 (1959), 290-297.
- [16] S. Fortunato, Community detection in graphs, Physics Reports, 486 (3–5), 75–174, 2010.
- [17] A. Georghiades, P. Belhumeur, and D. Kriegman, From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose, PAMI, 2001, Vol 23, pp 643-660.
- [18] W. Ha, K. Fountoulakis, and M. W. Mahoney, Statistical guarantees for local graph clustering, The 23rd International Conference on Artificial Intelligence and Statistics, 2020.
- [19] P. W. Holland, K. Laskey, and S. Leinhardt, Stochastic blockmodels: First steps, Social Networks 5 (1983), no. 2,109–137
- [20] D. Hric, R. K. Darst, and S. Fortunato, Community detection in networks: Structural clusters versus ground truth, Physical Review E, 9, 062805, 2014
- [21] M. Jacobs, E. Merkurjev, and S. Esedoglu, Auction Dynamics: A Volume Constrained MBO Scheme, Journal of Computational Physics, 354:288-310, 2018.
- [22] D. P. Kingma, S. Mohamed, D. J. Rezende, and Max Welling,Semi-supervised learning with deep generative models, In Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 3581–3589, 2014
- [23] K. Kloster and D. F. Gleich, Heat kernel based community detection, In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 1386-1395, 2014.
- [24] G. Kossinets and D. J. Watts, Empirical analysis of an evolving social network, Science, 311 (5757), 88–90, 2016
- [25] B. Kulis, S. Basu, I. Dhillon, and R. Mooney, Semi-supervised graph clustering: A kernel approach, Proc. ICML-2005.
- [26] M.-J. Lai, D. Mckenzie, Compressive Sensing Approach to Cut Improvement and Local Clustering, SIAM Journal on Mathematics of Data Science, 2(2020), 368–395.
- [27] D-H. Lee,Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, In Workshop on Challenges in Representation Learning, ICML 2013.
- [28] T. Lindvall, Lectures on the coupling method, John Wiley & Sons Inc., New York, 1992.
- [29] U. V. Luxburg, A tutorial on spectral clustering, Statistics and computing, 17(4):395– 416, 2007.
- [30] M. W. Mahoney, L. Orecchia, and N. K. Vishnoi, A local spectral method for graphs: With applications to improving graph partitions and exploring data graphs locally, Journal of Machine Learning Research, 13(Aug):2339-2365, 2012.
- [31] R. Mihalcea and D. Radev, Graph-based natural language processing and information retrieval, Cambridge university press, 2011.
- [32] T. Miyato, S-i. Maeda, M. Koyama, K. Nakae, and S. Ishii, Distributional smoothing by virtual adversarial examples, arXiv:1507.00677, 2015
- [33] E. Mossel, J. Neeman, and A. Sly. A proof of the block model threshold conjecture. Combinatorica 38.3 (2018): 665-708.
- [34] D. Needell and Joel A. Tropp, CoSaMP: Iterative signal recovery from incomplete and inaccurate samples, Applied and Computational Harmonic Analysis, 26(3):301-321, 2009.
- [35] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, On spectral clustering: Analysis and an algorithm, In Advances in Neural Information Processing Systems: 849-856, 2002.
- [36] Pan Shi, Kun He, David Bindel, and John E.Hopcroft, Locally-biased spectral approximation for community detection. Knowledge-Based Systems, 164:459–472, 2019.
- [37] N. Pitelis, C. Russell, and L. Agapito, Semi-supervised learning using an unsupervised atlas, In Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2014), pages 565–580. Springer, 2014.
- [38] T. Qin and K. Rohe, Regularized spectral clustering under the degree-corrected stochastic block model, Advances in Neural Information Processing Systems 26 (C.j.c. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, eds.), 2013, pp. 3120–3128.
- [39] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised Learning with Ladder Networks, In Advances in Neural Information Processing Systems: 3546-3554, 2015.
- [40] F. Samaria and A. Harter, Parameterisation of a Stochastic Model for Human Face Identification, Proceedings of 2nd IEEE Workshop on Applications of Computer Vision, 1994.
- [41] F. Santosa and W. Symes, Linear inversion of band-limited reflection seismograms, SIAM Journal on Scientific and Statistical Computing. SIAM. 7 (4): 1307–1330. doi:10.1137/0907087, 1986.
- [42] J. Shi and J. Malik, Normalized Cuts and Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000.
- [43] R. Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B (methodological). Wiley. 58 (1): 267–88. JSTOR 2346178, 1996
- [44] N. Veldt, C. Klymko, and D. F. Gleich, Flow-Based Local Graph Clustering with Better Seed Set Inclusion, In Proceedings of the SIAM International Conference on Data Mining, 2019
- [45] D. Wang, K. Fountoulakis, M. Henziger, Michael W. Mahoney, S. Rao, Capacity releasing diffusion for speed and locality, Proceedings of the 34th International Conference on Machine Learning, 70:3598-3607, 2017.
- [46] Y. Wu, M. Rosca, and T. Lillicrap, Deep compressed sensing. International Conference on Machine Learning, pp. 6850–6860, 2019.
- [47] Y. Yan, Y. Bian, D. Luo, D. Lee, and X. Zhang, Constrained Local Graph Clustering by Colored Random Walk, in Proc. World Wide Web Conf., 2019, pp. 2137–2146.
- [48] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich, Local higher-order graph clustering, In Proceedings of the 23rd ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2017, pp. 555–564.
- [49] K. Yin, X.-C. Tai, An Effective Region Force for Some Variational Models for Learning and Clustering, Journal of Scientific Computing, 74 (2018), 175-196.
- [50] L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering, In Advances in neural information processing systems, pages 1601–1608, 2004.
- [51] A. Y. Zhang, H. H. Zhou, C. Gao, Z. Ma, Achieving optimal misclassification proportion in stochastic block model, arXiv:1505.03772 (2015).
Declarations
Funding
The first author is supported by the Simons Foundation collaboration grant #864439.
Competing Interests
The authors have disclosed any competing interests.
Data Availability
A sample demo program for reproducing Fig. 4 in this paper can be found at https://github.com/zzzzms/LeastSquareClustering. All other demonstration codes or data are available upon request.
Appendix A Hyperparameters for Numerical Experiments
For each cluster to be recovered, we sampled the seed vertices uniformly from during all implementations. We fix the random walk depth with , use random walk threshold parameter for political blogs network and for all the other experiments. We vary the rejection parameters for each specific experiments based on the estimated sizes of clusters. In the case of no knowledge of estimated sizes of clusters nor the number of clusters are given, we may have to refer to the spectra of graph Laplacian and use the large gap between two consecutive spectrum to estimate the number of clusters, and then use the average size to estimate the size of each cluster.
We fix the least squares threshold parameter with for all the experiments, which is totally heuristic. However, we have experimented that the performance of algorithms will not be affected too much by varying . The hyperparameter is chosen according to the size of initial seed vertices relative to the total number of vertices in the cluster. For purely comparison purpose, we keep for MNIST data. By experimenting on different choices of , we find that the best performance for AT&T data occurs at for seeds and for and seeds. For YaleB data, the best performance occurs at for , , and seeds. All the numerical experiments are implemented in MATLAB and can be run on a local personal machine, for the authenticity of our results, we put a sample demo code at https://github.com/zzzzms/LeastSquareClustering for verification purpose.
Appendix B Image Data Preprocessing
For YaleB human faces data, we have performed some data preprocessing techinuqes to avoid the poor quality images. Specifically, we abandoned the pictures which are too dark, and we cropped each image into size of to reduce the effect of background noise. For the remaining qualified pictures, we randomly selected 20 images for each person.
All the image data in MNIST, AT&T, YaleB needs to be firstly constructed into an auxiliary graph before the implementation. Let be the vectorization of an image from the original data set, we define the following affinity matrix of the -NN auxiliary graph based on Gaussian kernel according to [21] and [50],.
The notation indicates the set of -nearest neighbours of , and where is the -th closest point of . Note that the above is not necessary symmetric, so we consider for symmetrization. Alternatively, one may also want to consider or . We use as the input adjacency matrix for our algorithms.
We fix the local scaling parameters , for the MNIST data, , for the YaleB data, and , for the AT&T data during implementation.