Formatting Instructions For NeurIPS 2023
On the Power of SVD in the Stochastic Block Model
Abstract
A popular heuristic method for improving clustering results is to apply dimensionality reduction before running clustering algorithms. It has been observed that spectral-based dimensionality reduction tools, such as PCA or SVD, improve the performance of clustering algorithms in many applications. This phenomenon indicates that spectral method not only serves as a dimensionality reduction tool, but also contributes to the clustering procedure in some sense. It is an interesting question to understand the behavior of spectral steps in clustering problems.
As an initial step in this direction, this paper studies the power of vanilla-SVD algorithm in the stochastic block model (SBM). We show that, in the symmetric setting, vanilla-SVD algorithm recovers all clusters correctly. This result answers an open question posed by Van Vu (Combinatorics Probability and Computing, 2018) in the symmetric setting.
1 Introduction
Clustering is a fundamental task in machine learning, with applications in many fields, such as biology, data mining, and statistical physics. Given a set of objects, the goal is to partition them into clusters according to their similarities. Objects and known relations can be represented in various ways. In most cases, objects are represented as vectors in , forming a data set ; each coordinate is called a feature, whose value is directly derived from raw data.
In many applications, the number of features is very large. It has been observed that the performance of classical clustering algorithms such as K-means may be worse on high-dimensional datasets. Some people call this phenomenon curse of dimensionality in machine learning [SEK04]. A popular heuristic method to address this issue is to apply dimensionality reduction before clustering. Among tools for dimensionality reduction, it is noted in practice that spectral methods such as principal component analysis (PCA) and singular value decomposition (SVD) significantly improve clustering results, e.g.,Β [SEK04, KAH19].
A natural question arises: why do spectral methods help to cluster high-dimensional datasets? Some practitioners believe one reason is that the spectral method filters some noise from the high-dimensional data [ARS+04, SEK04, ZLWZ09, KAH19]. Simultaneously, many theory works also (partially) support this explanation [AFWZ20, EBW18, LZK22, MZ22a]. With this explanation in mind, people analyzed the behavior of spectral-based algorithms with noise perturbation. Based on these analyses, many algorithms were proposed to recover clusters in probabilistic generative models. Among them, a well-studied model is the signal-plus-noise model.
Signal-Plus-Noise model
In this model, we assume that each observed sample has the form , where is a ground-truth vector and is a random noise vector. For any two sample vectors , if they are from the same cluster, their corresponding ground-truth vectors are identical, i.e.,Β . Signal-plus-noise model is very general; it has plentiful variants with different types of ground-truth vectors and noise distribution. In this paper, we focus on an important instance known as the stochastic block model (SBM). Though SBM is not as broad as general signal-plus-noise model, it usually serves as a benchmark for clustering and provides preliminary intuition about random graphs.
Stochastic block model
The SBM is first introduced by [HLL83] and is widely used as a theoretical benchmark for graph clustering algorithms. In the paper, we focus on the symmetric version of stochastic block model (SSBM), described as follows. Given a set of vertices , we uniformly partition them into disjoint sets (clusters), denoted by . Based on this partition, a random (undirected) graph is sampled in the following way: for all pairs of vertices , an edge is added independently with probability , if for some ; otherwise, an edge is added independently with probability .
We usually assume that . The task is to recover the hidden partition from the random graph . We denote this model as .
SBM as a signal-plus-noise model
Though SBM was originally designed for graph clustering, we view it as a special form of vector clustering. Namely, given the adjacency matrix of a graph , the columns of form a set of vectors. To see that SBM fits into the signal-plus-noise model, note that in the SBM, the adjacency matrix can be viewed as a fixed matrix plus a random noise, i.e.,Β , where is the mean and is a zero-mean random matrix. More precisely, in the case of SSBM,
where the random variables are independent and for all . 111Assume we fix an order of the vertices in .
1.1 Motivations: Analyzing Vanilla Spectral Algorithms
Since the seminal work by McSherry [McS01], many spectral-based algorithms have been proposed and studied in the SBM [GM05, Vu18, LR15, EBW18, Col19, AFWZ20, MZ22b] and even more general signal-plus-noise models [AFWZ20, EBW18, CTP19, LZK22, MZ22a]. These algorithms are largely based on the spectral analysis of random matrices. The purpose of designing and analyzing such algorithms is twofold.
Understanding the limitation of spectral-based algorithms
SBM is specified by parameters, such as in the symmetric case. Clustering is usually getting harder for larger and smaller gap . Many existing works aim to understand in which regimes of these parameters it is possible to recover the hidden partition. In this regard, the state-of-the-art bound is given by Vu [Vu18]. Concretely, [Vu18] proved that, in the symmetric setting, there is an algorithm that recovers all clusters if , where and is a constant.
Understanding spectral-based algorithms in practice
Besides analyzing spectral algorithms in theory, the other purpose, which is the primary purpose of this paper, is to explain the usefulness of such algorithms in practice. Indeed, as we mentioned before, many spectral-based algorithms, as observed in practice, can filter the noise and address the curse of dimensionality [ARS+04, SEK04, ZLWZ09, KAH19]. Some representative algorithms are PCA and SVD. Furthermore, it has been observed that spectral algorithms used in practice, such as PCA or SVD, are usually very simple: they just project data points into some lower-dimension subspace, and no extra steps are conducted.
In stark contrast, most of the aforementioned theoretical algorithms have pre-processing or post-processing steps. For example, the idea in [LR15] is that one first applies SVD, and then runs a variant of K-means to clean up the clustering; the main algorithm in [Vu18] partitions the graph into several parts and uses these parts in different ways. As noted in [Vu18], these extra steps are only for the purpose of theoretical analysis: From the perspective of algorithm design, these extra steps appear redundant. Later on, [AFWZ20] coined the phrase vanilla spectral algorithms to describe spectral algorithms that do not include any additional steps. Both [Vu18] and [AFWZ20] conjectured that vanilla spectral algorithms are themselves good clustering algorithms. In practice, this is a widely-used heuristic; however, in theory, the analysis of vanilla spectral algorithms is not satisfactory due to the lack of techniques for analysis. We refer to [MZ22b] for a detailed discussion on barriers of the current analysis.
Why do we study vanilla algorithms?
Our main focus is particularly on vanilla spectral algorithms for two reasons:
-
1.
Vanilla spectral algorithms are the most popular in practiceβno extra steps are widely used. Plus, their performance seems good enough. The lack of theoretical analysis is mostly due to technical obstacles.
-
2.
A vanilla spectral algorithm is often simple and is not specifically designed for theoretical models such as SBM. In contrast, some complicated algorithms use extra steps which are designed for SBM. These steps made the analysis of SBM go through (as commented by [Vu18]). Meanwhile, these extra steps exploit specific structures and may cause βoverfittingsβ on SBM, which makes these algorithms less powerful in practice.
The main purpose of this paper is to theoretically understand the power of practically successful vanilla spectral algorithms. To this end, we study SBM as a preliminary demonstration. We do not aim to design algorithms for SBM that outperforms existing algorithms.
1.2 Our Results
The contribution of this paper is twofold. On the one hand, we show that vanilla algorithms (AlgorithmΒ 1) is indeed a clustering algorithm in the SSBM for a wide range of parameters, breaking previous barrier on analyzing on only constant number of clusters. On the other hand, we provide a novel analysis on matrix perturbation with random noise. We discuss more details on this part in SectionΒ 1.4.
Recall that parameters of SBM is specified by , where . Let . Our main result is stated below.
Theorem 1.1.
There exists a constant with the following property. In the model , if and , then AlgorithmΒ 1 recovers all clusters with probability .
Here we describe the vanilla-SVD algorithm in more detail. Algorithms in [McS01, Vu18, Col19] share a common idea: they both use SVD-based methods to find a clear-cut vector representation of vertices. That is, every node is associated with a vector , and we say a vector representation is clear-cut if the following holds for some threshold : if for some , then ; otherwise, .
Once a clear-cut representation is found, the clustering task is easy: If the parameters are all known, we can calculate and simply decide whether two vertices are in the same cluster based on their distance; in the case where is unknown, we need one more step. 222For example, one possible implementation is as follows: create a minimal spanning tree according to the distances under , then remove the heaviest edges, resulting in connected components, and output these components as clusters. Following [Vu18], we denote by an algorithm that recovers the partition from a clear-cut representation. One natural representation is obtained by SVD as follows. Let be the adjacent matrix of the input graph, and let be the orthogonal projection matrix onto the space spanned by the first eigenvectors of . Then set , where is the column index by . This yields AlgorithmΒ 1, the vanilla-SVD algorithm.
-
1.
Compute for each .
-
2.
Run with representation .
1.3 Comparison with Existing Analysis for Vanilla Spectral Algorithms in the SBM
To the best of our knowledge, there are very few works on the analysis of vanilla spectral algorithms [AFWZ20, EBW18, PPV+19]. All of them only apply to the case of . In this work, we obtain the first analysis for general parameters , in the symmetric SBM setting.
Davis-Kahan approaches
To study spectral algorithms in signal-plus-noise models, a key step is to understand how random noise perturbs the eigenvectors of a matrix. A commonly-used technical ingredient is the Davis-Kahan theorem (or its variant). However, this type of approach faces two challenges in the SBM.
-
β’
Davis-Kahan leads to worst-case perturbation bounds. For perturbations caused by random noises, such as signal-plus-noise models, Davis-Kahan theorem is sometimes suboptimal.
-
β’
These theorems only lead to bound on -norm. However, in the SBM analysis, we may need -norm bounds. See [CTP19] for more discussions.
Previous works such as [AFWZ20, EBW18, PPV+19] mainly followed this approach. They proposed some novel ideas to (partially) address these two challenges, but only apply to the case of . In contrast, our approach, following the power-iteration-based analysis proposed by [MZ22b], completely avoids Davis-Kahan theorem and can handle the case of .
Comparison with [MZ22b]
Inspired by power iteration methods, Mukherjee and Zhang [MZ22b] proposed a new approach to analyze the perturbation of random matrices. The idea is to approximate the eigenvectors of a matrix by its power. In fact, this method has been widely used in practice as a fast algorithm to approximate eigenvectors. However, there are two limitations of [MZ22b].
-
β’
Their analysis requires a nice structure of the mean matrix, i.e., all large eigenvalues are more or less the same.
-
β’
Their algorithm is not vanilla as it has a βcentering stepβ. Moreover, their algorithm requires the knowledge of parameters , and particularly, the centering step alone requires the knowledge of . In comparison, we only need to know ; further, we can also guess (by checking the number of large eigenvalues) and then make AlgorithmΒ 1 fully parameter-free.
To overcome these limitations, we introduce a novel βpolynomial approximation + entrywise analysisβ method, which makes this analysis more robust and requires less structure. More details will be discussed in SectionΒ 1.4.
Regarding parameters in TheoremΒ 1.1
The difference between the parameters in our results and those in Vuβs paper is that we replaced by . If , and are equal up to constant, so our bound is essentially the same as [Vu18] except for the factor. We believe the extra factor can be improved by future works. This term stems from the new concentration inequality we used as a black box. A refined analysis of this concentration inequality may remove this factor. Here are two example settings of parameters that satisfy the conditions in TheoremΒ 1.1:
-
β’
;
-
β’
and .
1.4 Proof Outline and Technical Contributions
Let denote the size of the cluster to which belongs. Assume for now that all βs are of size roughly . Indeed, this happens with high probability inasmuch as the partition is uniformly sampled.
Our goal is to show that there exists some threshold such that for every : if for some , then ; otherwise, . Write . Then Note that if for some , otherwise, . Therefore, setting , it suffices to show that for every .
We decompose into two terms:
(1) |
We shall bound the two terms from above separately. Intuitively, the noise term is small means reduces the noise, while the deviation term is small means preserves the data.
Upper bound of the noise term
It is known that (resp.,Β ) can be write as a polynomial of (resp.,Β ). By Weylβs inequality, the eigenvalues of are not too far from those of . Therefore, in our case, one can find a simple polynomial which only depends on , such that (resp.,Β ) is a good approximation of (resp.,Β ); this is formalized in LemmaΒ 3.2. Then we have the following decomposition: where the first inequality follows from LemmaΒ 3.2, which roughly says is a good approximation of .
-
1.
The first term, , is small with high probability. To see this, we use LemmaΒ 3.2 again: . According to a known result (c.f. PropositionΒ 2.4), is small with high probability, largely because the projection and the vector are independent.
-
2.
The second term is the tricky part, and we draw on an entrywise analysis. Namely, we study every entry of , using the new inequality from [MZ22b]. See LemmaΒ 3.3 for more details.
The upper bound for the noise term is encapsulated in LemmaΒ 3.4.
Upper bound of the deviation term
The following argument is reminiscent of [Vu18]. Say . Note that where is the normalized characteristic vector of (i.e.,Β ). It follows that
where the second inequality holds because is the best -rank approximation of and , and in the third inequality, we use , as is a projection matrix. Therefore,
(2) |
A typical result in random matrix theory (c.f. PropositionΒ 2.3) states that with high probability, . Combining EquationΒ 2 and , we get And by our assumption on , we have .
Technical contribution
The major novelty of our analysis is using the polynomial . [MZ22b] used a centering step to make the mean matrix nicely structured, while in our analysis, we used polynomial approximation to address this issue. Another difference is that in [MZ22b], the centering step appears explicitly in the algorithm. By contrast, our polynomial approximation only appears in the analysis β the algorithm is vanilla.
As a byproduct, we developed new techniques for studying eigenspace perturbation, a typical topic in random matrix theory. Our high-level idea is βpolynomial approximation + entrywise analysisβ. That is, we reduce the analysis of eigenspace perturbation to the analysis of a simple polynomial (of matrix) under perturbation. We have more tools to deal with the latter.
1.5 Discussion and Future Directions
In this paper, we studied the behavior of vanilla-SVD in the SSBM, a benchmark signal-plus-noise model widely studied in random matrix theory. We showed that vanilla-SVD indeed filters noise in the SSBM. In fact, our analysis technique, βpolynomial approximation + entrywise analysisβ, is not very limited to SSBM. A direct and interesting question yet to be answered is: Can our method be extended to prove that vanilla-SVD works in the general SBM where partitions are not uniformly sampled and edges appear with different probabilities? Moreover, our method may be useful for analyzing some other realistic, probabilistic models such as the factor model β a model which has been widely used in economics and model portfolio theory.
In the long term, it would be very interesting to understand the behavior of vanilla spectral algorithms on real data: 1) Why does it succeed in some applications? 2) How could we fix it if it has failed in other cases? A deeper understanding of vanilla spectral algorithms will provide guidelines for using them in many machine learning tasks.
2 Preliminaries
Notations
Let denote the -dimensional vector whose entries are all 1βs, and let be the matrix whose entries are all 1βs. Let denote the size of the cluster to which belongs. For a matrix , denotes the row of indexed by , and denotes the column indexed by ; is the -th largest eigenvalue of ; let denote the orthogonal projection matrix onto the space spanned by the first eigenvectors of . For a vector , denotes the Euclidean norm.
Definition 2.1 (Matrix operator norms).
Let . Define and .
Proposition 2.1 (e.g.,Β [CTP19]).
For all matrices , it holds that (1) ; (2) .
Proposition 2.2 (Weylβs inequality).
For all , we have .
Proposition 2.3 (Norm of a random matrix [Vu18]).
There is a constant . Let be a symmetric matrix whose upper diagonal entries are independent random variables where or with probabilities and respectively, where . Let . If , then
Proposition 2.4 (Projection of a random vector, lemma 2.1 in [Vu18]).
There exists a constant such that the following holds. Let be a random vector in whose coordinates are independent random variables with mean 0 and variance at most . Assume furthermore that the are bounded by 1 in absolute value. Let be a subspace of dimension and let be the length of the orthogonal projection of onto . Then
Proposition 2.5.
For and , if , then .
3 Analysis of Vanilla SVD Algorithm
Write . We say the partition is balanced if By Chernoff bound, the partition is balanced with probability at least ; hence, we assume that the partition is balanced in the following argument. Since , the event holds with high probability (see PropositionΒ 2.3).
Recall the decomposition into deviation term and noise term in EquationΒ 1. We first state our upper bound of the deviation term, which readily follows from the argument in SectionΒ 1.4, and the complete proof is in AppendixΒ B.
Lemma 3.1 (Upper bound of deviation term).
Let be the constant in PropositionΒ 2.3. If the partition is balanced and , then with probability at least we have
Section 3.1 and Section 3.2 lead to an upper bound of the noise term, and Section 3.3 is the proof of main theorem.
3.1 An Approximation of and
In order to give some intuition on the choice of , we first analyze the spectrum of , and the result is summed up in TheoremΒ 3.1.
The eigenvalues of
Note that , where Without loss of generality, assume that . It is easy to see that the eigenvalues of are Viewing as a rank-one perturbation of , we have the following theorem that characterizes eigenvalues of . Its proof, in AppendixΒ C, readily follows from a theorem in [BNS79], which studies eigenvalues under rank-one perturbation.
Theorem 3.1.
Write and assume that . Define , then (1) and ; (2) , and hence
The choice of the polynomial
Let , and let be the quadratic polynomial such that , i.e.,Β where . Finally, let where .
Here we give some intuition for the choice of . Let be the spectral decomposition of . Then The spectral decomposition of is Hence,
(3) |
Recall that is bounded by Weylβs inequality. Plus, when the partition is balanced, TheoremΒ 3.1 shows that the eigenvalues of is nicely distributed: Except for , other eigenvalues are all close to . Hence, our choice of makes small, and thus is a good approximation of . Formally, we have the following lemma.
Lemma 3.2 (Polynomial approximation).
Assume that the partition is balanced and , where is the constant in PropositionΒ 2.3. Then with probability at least , it holds that for all , and
Proof.
Let (resp.,Β ) Β be the spectral decomposition of (resp.,Β ). We shall use the following claim.
Claim 3.1.
The following holds with probability (over the choice of ): for every , ; and for every , .
Fix . On the one hand, , and hence
which means . On the other hand,
Since , we have . This establishes the first part.
Note that and we also have , and thus similar argument goes for . This finishes the proof of LemmaΒ 3.2.
It remains to prove ClaimΒ 3.1. The claim readily follows from the choice of and the fact that are close. A complete proof can be found in AppendixΒ C. β
3.2 The Upper Bound of the Noise Term
According to EquationΒ 1, in order to derive an upper bound of , it remains to bound from above. This is done by the following lemma.
Lemma 3.3.
Let be the constant in PropositionΒ 2.3. Assume that the partition is balanced and . For every , it holds that
where is a constant.
Combining LemmaΒ 3.2 and PropositionΒ 2.4, we get an upper bound of the noise term:
Lemma 3.4 (Upper bound of noise term).
Let be the constant in PropositionΒ 2.3. Assume that . Then with probability at least , we have for all , where is a constant.
The proof of LemmaΒ 3.3 is deferred to SectionΒ 4. We use it to prove LemmaΒ 3.4 here.
Proof of LemmaΒ 3.4.
It follows from LemmaΒ 3.2 that
By PropositionΒ 2.4, with high probability at least , is bounded by , where is a universal constant. Meanwhile, by LemmaΒ 3.3 and union bound over all , with probability at least , for every . Therefore, with probability , it holds that for all . Setting , we have the desired result. β
3.3 Putting It Together
Now we are well-equipped to prove TheoremΒ 1.1.
Proof of TheoremΒ 1.1.
Let , where are the constants in PropositionΒ 2.3 and LemmaΒ 3.4. By our assumption on , we have It is easy to verify satisfies the conditions in LemmaΒ 3.4 and LemmaΒ 3.1.
Write . We aim to show that for every : if for some , then ; otherwise, . Then by calling , AlgorithmΒ 1 recovers all large clusters correctly.
Let . According to the argument in SectionΒ 1.4, it suffices to show that for all . We further decompose into noise term and deviation term, i.e.,Β , where and . By LemmaΒ 3.4 and LemmaΒ 3.1, with probability at least , the following hold for all : (1) ; (2) . Therefore, with probability at least , we indeed have . This completes the proof. β
4 Proof of LemmaΒ 3.3: Entrywise Analysis
This section is dedicated to proving LemmaΒ 3.3.
Since both and are symmetric, we have The high-level idea is to write as a sum of matrices, where each matrix is of the form such that . This way, we have , and is bounded by a lemma from [MZ22b].
Let and write . Then
where the last step is a decomposition based on the first location of in the product terms. And
That is, . We bound the three terms respectively.
Here we first list some definitions and estimations of the quantities involved.
-
β’
According to PropositionΒ 2.3, with probability at least , we have where is a constant. In the following argument, we always assume this holds.
-
β’
. By our assumption on , we have .
-
β’
; , and thus .
-
β’
By ClaimΒ 3.1, . By PropositionΒ 2.5, .
Upper bound of
Note that , and for all . Moreover, . And the following lemma gives an upper bound of .
Lemma 4.1.
.
Therefore, by union bound, we have the following holds with probability at least :
(4) |
Upper bound of
Since , we have
(5) |
Lemma 4.2 (Upper bound of ).
With probability (over the choice of ), we have , where is a constant.
Finally, combining EquationΒ 4, EquationΒ 5, and the above lemma, we conclude that with probability at least ,
This establishes LemmaΒ 3.3.
Proofs of LemmaΒ 4.1 and LemmaΒ 4.2 are deferred to AppendixΒ D.
Acknowledgments and Disclosure of Funding
We are grateful to anonymous NeurIPS reviewers for their helpful comments. This research is supported by NSF CAREER award 2141536.
References
- [AFWZ20] Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, and Yiqiao Zhong. Entrywise eigenvector analysis of random matrices with low expected rank. Ann. Statist., 48(3):1452β1474, 2020.
- [ARS+04] Paolo Antonelli, HEΒ Revercomb, LAΒ Sromovsky, WLΒ Smith, ROΒ Knuteson, DCΒ Tobin, RKΒ Garcia, HBΒ Howell, H-L Huang, and FAΒ Best. A principal component noise filter for high spectral resolution infrared measurements. Journal of Geophysical Research: Atmospheres, 109(D23), 2004.
- [BNS79] JamesΒ R. Bunch, ChristopherΒ P. Nielsen, and DannyΒ C. Sorensen. Rank-one modification of the symmetric eigenproblem. Numer. Math., 31(1):31β48, 1978/79.
- [Col19] Sam Cole. Recovering nonuniform planted partitions via iterated projection. Linear Algebra and its Applications, 576:79β107, 2019.
- [CTP19] Joshua Cape, Minh Tang, and CareyΒ E Priebe. The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. The Annals of Statistics, 47(5):2405β2439, 2019.
- [EBW18] Justin Eldridge, Mikhail Belkin, and Yusu Wang. Unperturbed: spectral analysis beyond Davis-Kahan. In Algorithmic learning theory 2018, volumeΒ 83 of Proc. Mach. Learn. Res. (PMLR), pageΒ 38. Proceedings of Machine Learning Research PMLR, [place of publication not identified], 2018.
- [GM05] Joachim Giesen and Dieter Mitsche. Reconstructing many partitions using spectral techniques. In International Symposium on Fundamentals of Computation Theory, pages 433β444. Springer, 2005.
- [HLL83] PaulΒ W Holland, KathrynΒ Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109β137, 1983.
- [KAH19] VladimirΒ Yu Kiselev, TallulahΒ S Andrews, and Martin Hemberg. Challenges in unsupervised clustering of single-cell rna-seq data. Nature Reviews Genetics, 20(5):273β282, 2019.
- [LR15] Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215β237, 2015.
- [LZK22] Boris Landa, ThomasΒ TCK Zhang, and Yuval Kluger. Biwhitening reveals the rank of a count matrix. SIAM Journal on Mathematics of Data Science, 4(4):1420β1446, 2022.
- [McS01] Frank McSherry. Spectral partitioning of random graphs. In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001), pages 529β537. IEEE Computer Soc., Los Alamitos, CA, 2001.
- [MZ22a] ChandraΒ Sekhar Mukherjee and Jiapeng Zhang. Compressibility: Power of pca in clustering problems beyond dimensionality reduction. arXiv preprint arXiv:2204.10888, 2022.
- [MZ22b] ChandraΒ Sekhar Mukherjee and Jiapeng Zhang. Detecting hidden communities by power iterations with connections to vanilla spectral algorithms, 2022.
- [PPV+19] CareyΒ E Priebe, Youngser Park, JoshuaΒ T Vogelstein, JohnΒ M Conroy, Vince Lyzinski, Minh Tang, Avanti Athreya, Joshua Cape, and Eric Bridgeford. On a two-truths phenomenon in spectral graph clustering. Proceedings of the National Academy of Sciences, 116(13):5995β6000, 2019.
- [SEK04] Michael Steinbach, Levent ErtΓΆz, and Vipin Kumar. The challenges of clustering high dimensional data. In New directions in statistical physics, pages 273β309. Springer, 2004.
- [Vu18] Van Vu. A simple SVD algorithm for finding hidden partitions. Combin. Probab. Comput., 27(1):124β140, 2018.
- [ZLWZ09] Lei Zhang, Rastislav Lukac, Xiaolin Wu, and David Zhang. Pca-based spatially adaptive denoising of cfa images for single-sensor digital cameras. IEEE transactions on image processing, 18(4):797β812, 2009.
Appendix A Useful inequalities
Proposition A.1 (Chernoff bound).
Let be i.i.d random variables that can take values in , with for . Then it holds that
Proposition A.2 (Hoeffding bound).
Let be independent random variables such that , and write . Then it holds that
Definition A.1.
Let be a Bernoulli random variable with parameter , i.e.,Β . The random variable is called centered Bernoulli random variable with parameter .
Proposition A.3 (Adapted from [MZ22b]).
Let , and let be an symmetric random matrix, where
are independent, centered Bernoulli random variables with parameter at most for all . Suppose that every entry of takes value in , and each column of has at most non-zero entries. Then for every , it holds that
where
By union bound,
Remark A.1.
The parameter is determined by , which equals to in our case. The above bound is particularly useful when are small, that is, we want the matrix to have either small entries or sparse columns.
Proposition A.4 (PropositionΒ 2.5 restated).
For and , if , then .
Proof.
Let . If , we have . If , then and hence
β
Appendix B Bounding the Deviation Term
Proof of LemmaΒ 3.1.
Our assumption on implies that . By PropositionΒ 2.3, with probability at least , we have
According to EquationΒ 2 and , we have
β
Appendix C Polynomial Approximation
The proof of TheoremΒ 3.1 rely on the following result on rank-one pertuebation.
Proposition C.1 (Eigenvalues under rank-one perturbation, Theorem 1 in [BNS79]).
Let , where is diagonal, . Let be the eigenvalues of , and let be the eigenvalues of . Then
where and .
Proof of TheoremΒ 3.1.
Let be the indicator vector for , i.e.,Β iff . It is easy to see that the eigenvectors of are . Write , , then we have . Note that , and hence
where . This means the eigenvalues of are the same as those of . Since , Item 1 follow directly from PropositionΒ C.1. To see Item 2, we use the Rayleigh quotient characterization of the largest eigenvalue:
where the last inequality follows from , β
Proof of ClaimΒ 3.1.
The assumption on in LemmaΒ 3.2 implies that . By Weylβs inequality, we have
Meanwhile, by TheoremΒ 3.1,
for . Hence, write , we have
-
1.
;
-
2.
;
-
3.
for every , .
First, according to the definition of , and hence . As for ,
(since | ||||
(as ) | ||||
Consequently, by PropositionΒ 2.5.
Next, for , the argument is similar:
where the second inequality follows from . This yields by PropositionΒ 2.5.
Finally, for , it holds that
which means . β
Appendix D Bounding the Noise Term
Proof of LemmaΒ 4.1.
The lemma readily follows from the following entrywise bound and Chernoff bound.
Claim D.1 (Entries of ).
For every , if for some , then ; otherwise, .
We decompose , where is the intra-cluster part, i.e.,Β if for some , and otherwise. Since for every column of , its non-zero entries are identical and at most by the above claim. Hence, every entry of equals to the sum of at most independent, centered Bernoulli variables with parameter , scaled by some factor at most . By Chernoff bound, , and we have by union bound. Analogously, by Hoeffding bound, . Since , the lemma follows from the above two inequalities and union bound. β
Proof of ClaimΒ D.1.
Write and recall that (i) for all (ii) , (iii) , and for all . Assume that for some . Then
Since , the numerator is at least
Because , we have
which means . Meawhile,
where the first term is at most by (ii); second term is at most . Therefore, .
For the second part, assume that are not in the same cluster. Then
Hence,
By (ii), the first term is at most ; by (i), the second term is at most ; hence, . β
Upper bound of (Proof of LemmaΒ 4.2)
Write . Then . It suffices to derive a good upper bound of and , as can be further expressed as sum of powers of . This is done by the following lemma:
Lemma D.1.
The following holds with probability over the choice of : for all , it holds that
-
β’
,
-
β’
,
where is an absolute constant.
Specifically,
Note that . It follows that for every ,
where the second inequality is by LemmaΒ D.1, and the last step follows from PropositionΒ 2.5. Similarly, for every ,
where the second inequality follows from , and the third inequality is by LemmaΒ D.1. In sum,
(6) |
where in the last inequality we also use .
It remains to prove LemmaΒ D.1; the proof draws on the entrywise bound in PropositionΒ A.3.
Proof of LemmaΒ D.1.
Fix an . Write for the ease of notation. Observe that . We can apply PropositionΒ A.3 to and , with . That is, with probability at least ,
Our assumption on yields ; moreover, . Therefore,
(7) |
Similarly, by applying PropositionΒ A.3 to with , we have, with probability at least ,
Since , we have
(8) |
Combining EquationΒ 7 and EquationΒ 8, we have, with probability at least , where .
For the second part, we decompose , where is the intra-cluster part, i.e.,Β if for some ; and otherwise. Equipped with ClaimΒ D.1, we can apply PropositionΒ A.3 on (with ), and (with ):
where the second inequality holds with probability at least . The lemma follows from a union bound over all . β