This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Formatting Instructions For NeurIPS 2023

Xinyu Mao
University of Southern California
[email protected]
&Jiapeng Zhang
University of Southern California
[email protected]

On the Power of SVD in the Stochastic Block Model

Xinyu Mao
University of Southern California
[email protected]
&Jiapeng Zhang
University of Southern California
[email protected]
Abstract

A popular heuristic method for improving clustering results is to apply dimensionality reduction before running clustering algorithms. It has been observed that spectral-based dimensionality reduction tools, such as PCA or SVD, improve the performance of clustering algorithms in many applications. This phenomenon indicates that spectral method not only serves as a dimensionality reduction tool, but also contributes to the clustering procedure in some sense. It is an interesting question to understand the behavior of spectral steps in clustering problems.

As an initial step in this direction, this paper studies the power of vanilla-SVD algorithm in the stochastic block model (SBM). We show that, in the symmetric setting, vanilla-SVD algorithm recovers all clusters correctly. This result answers an open question posed by Van Vu (Combinatorics Probability and Computing, 2018) in the symmetric setting.

1 Introduction

Clustering is a fundamental task in machine learning, with applications in many fields, such as biology, data mining, and statistical physics. Given a set of objects, the goal is to partition them into clusters according to their similarities. Objects and known relations can be represented in various ways. In most cases, objects are represented as vectors in ℝd\mathbb{R}^{d}, forming a data set π’ŸβŠ‚β„d\mathcal{D}\subset\mathbb{R}^{d}; each coordinate is called a feature, whose value is directly derived from raw data.

In many applications, the number of features is very large. It has been observed that the performance of classical clustering algorithms such as K-means may be worse on high-dimensional datasets. Some people call this phenomenon curse of dimensionality in machine learning [SEK04]. A popular heuristic method to address this issue is to apply dimensionality reduction before clustering. Among tools for dimensionality reduction, it is noted in practice that spectral methods such as principal component analysis (PCA) and singular value decomposition (SVD) significantly improve clustering results, e.g.,Β [SEK04, KAH19].

A natural question arises: why do spectral methods help to cluster high-dimensional datasets? Some practitioners believe one reason is that the spectral method filters some noise from the high-dimensional data [ARS+04, SEK04, ZLWZ09, KAH19]. Simultaneously, many theory works also (partially) support this explanation [AFWZ20, EBW18, LZK22, MZ22a]. With this explanation in mind, people analyzed the behavior of spectral-based algorithms with noise perturbation. Based on these analyses, many algorithms were proposed to recover clusters in probabilistic generative models. Among them, a well-studied model is the signal-plus-noise model.

Signal-Plus-Noise model

In this model, we assume that each observed sample v^i\widehat{v}_{i} has the form v^i=vi+ei\widehat{v}_{i}=v_{i}+e_{i}, where viv_{i} is a ground-truth vector and eie_{i} is a random noise vector. For any two sample vectors v^i,v^j\widehat{v}_{i},\widehat{v}_{j}, if they are from the same cluster, their corresponding ground-truth vectors are identical, i.e.,Β vi=vjv_{i}=v_{j}. Signal-plus-noise model is very general; it has plentiful variants with different types of ground-truth vectors and noise distribution. In this paper, we focus on an important instance known as the stochastic block model (SBM). Though SBM is not as broad as general signal-plus-noise model, it usually serves as a benchmark for clustering and provides preliminary intuition about random graphs.

Stochastic block model

The SBM is first introduced by [HLL83] and is widely used as a theoretical benchmark for graph clustering algorithms. In the paper, we focus on the symmetric version of stochastic block model (SSBM), described as follows. Given a set of nn vertices VV, we uniformly partition them into kk disjoint sets (clusters), denoted by V1,…,VkV_{1},\dots,V_{k}. Based on this partition, a random (undirected) graph G^=(V,E)\widehat{G}=(V,E) is sampled in the following way: for all pairs of vertices u,v∈Vu,v\in V, an edge {u,v}\left\{u,v\right\} is added independently with probability pp, if u,v∈Vβ„“u,v\in V_{\ell} for some β„“\ell; otherwise, an edge {u,v}\left\{u,v\right\} is added independently with probability qq.

We usually assume that p>qp>q. The task is to recover the hidden partition V1,…,VkV_{1},\dots,V_{k} from the random graph G^\widehat{G}. We denote this model as 𝖲𝖲𝖑𝖬​(V,n,k,p,q)\mathsf{SSBM}(V,n,k,p,q).

SBM as a signal-plus-noise model

Though SBM was originally designed for graph clustering, we view it as a special form of vector clustering. Namely, given the adjacency matrix of a graph G^∈{0,1}VΓ—V\widehat{G}\in\left\{0,1\right\}^{V\times V}, the columns of G^\widehat{G} form a set of n=|V|n=|V| vectors. To see that SBM fits into the signal-plus-noise model, note that in the SBM, the adjacency matrix G^∈{0,1}VΓ—V\widehat{G}\in\left\{0,1\right\}^{V\times V} can be viewed as a fixed matrix GG plus a random noise, i.e.,Β G^=G+E\widehat{G}=G+E, where G=𝖽𝖾𝖿𝐄[G^]G\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\operatorname*{\mathbf{E}}\left[\widehat{G}\right] is the mean and EE is a zero-mean random matrix. More precisely, in the case of SSBM,

Gu​v={pifΒ u,v∈Vβ„“Β for someΒ β„“,qotherwise,​ and ​Eu​v={1βˆ’Gu​vwith probabilityΒ Gu​v,βˆ’Gu​vwith probabilityΒ 1βˆ’Gu​v,G_{uv}=\begin{cases}p&\text{if $u,v\in V_{\ell}$ for some $\ell$},\\ q&\text{otherwise},\end{cases}\text{\quad and\quad}E_{uv}=\begin{cases}1-G_{uv}&\text{with probability $G_{uv}$},\\ -G_{uv}&\text{with probability $1-G_{uv}$},\end{cases}

where the random variables {Eu​v:u≀v}\left\{E_{uv}:u\leq v\right\} are independent and Ev​u=Eu​vE_{vu}=E_{uv} for all u,v∈Vu,v\in V. 111Assume we fix an order of the vertices in VV.

1.1 Motivations: Analyzing Vanilla Spectral Algorithms

Since the seminal work by McSherry [McS01], many spectral-based algorithms have been proposed and studied in the SBM [GM05, Vu18, LR15, EBW18, Col19, AFWZ20, MZ22b] and even more general signal-plus-noise models [AFWZ20, EBW18, CTP19, LZK22, MZ22a]. These algorithms are largely based on the spectral analysis of random matrices. The purpose of designing and analyzing such algorithms is twofold.

Understanding the limitation of spectral-based algorithms

SBM is specified by parameters, such as n,k,p,qn,k,p,q in the symmetric case. Clustering is usually getting harder for larger kk and smaller gap (pβˆ’q)(p-q). Many existing works aim to understand in which regimes of these parameters it is possible to recover the hidden partition. In this regard, the state-of-the-art bound is given by Vu [Vu18]. Concretely, [Vu18] proved that, in the symmetric setting, there is an algorithm that recovers all clusters if nβ‰₯Cβ‹…k​(σ​k+log⁑npβˆ’q)2n\geq C\cdot k\left(\frac{\sigma\sqrt{k}+\sqrt{\log n}}{p-q}\right)^{2}, where Οƒ2=𝖽𝖾𝖿max⁑{p​(1βˆ’p),q​(1βˆ’q)}\sigma^{2}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\max\left\{p(1-p),q(1-q)\right\} and CC is a constant.

Understanding spectral-based algorithms in practice

Besides analyzing spectral algorithms in theory, the other purpose, which is the primary purpose of this paper, is to explain the usefulness of such algorithms in practice. Indeed, as we mentioned before, many spectral-based algorithms, as observed in practice, can filter the noise and address the curse of dimensionality [ARS+04, SEK04, ZLWZ09, KAH19]. Some representative algorithms are PCA and SVD. Furthermore, it has been observed that spectral algorithms used in practice, such as PCA or SVD, are usually very simple: they just project data points into some lower-dimension subspace, and no extra steps are conducted.

In stark contrast, most of the aforementioned theoretical algorithms have pre-processing or post-processing steps. For example, the idea in [LR15] is that one first applies SVD, and then runs a variant of K-means to clean up the clustering; the main algorithm in [Vu18] partitions the graph into several parts and uses these parts in different ways. As noted in [Vu18], these extra steps are only for the purpose of theoretical analysis: From the perspective of algorithm design, these extra steps appear redundant. Later on, [AFWZ20] coined the phrase vanilla spectral algorithms to describe spectral algorithms that do not include any additional steps. Both [Vu18] and [AFWZ20] conjectured that vanilla spectral algorithms are themselves good clustering algorithms. In practice, this is a widely-used heuristic; however, in theory, the analysis of vanilla spectral algorithms is not satisfactory due to the lack of techniques for analysis. We refer to [MZ22b] for a detailed discussion on barriers of the current analysis.

Why do we study vanilla algorithms?

Our main focus is particularly on vanilla spectral algorithms for two reasons:

  1. 1.

    Vanilla spectral algorithms are the most popular in practiceβ€”no extra steps are widely used. Plus, their performance seems good enough. The lack of theoretical analysis is mostly due to technical obstacles.

  2. 2.

    A vanilla spectral algorithm is often simple and is not specifically designed for theoretical models such as SBM. In contrast, some complicated algorithms use extra steps which are designed for SBM. These steps made the analysis of SBM go through (as commented by [Vu18]). Meanwhile, these extra steps exploit specific structures and may cause β€˜overfittings’ on SBM, which makes these algorithms less powerful in practice.

The main purpose of this paper is to theoretically understand the power of practically successful vanilla spectral algorithms. To this end, we study SBM as a preliminary demonstration. We do not aim to design algorithms for SBM that outperforms existing algorithms.

1.2 Our Results

The contribution of this paper is twofold. On the one hand, we show that vanilla algorithms (AlgorithmΒ 1) is indeed a clustering algorithm in the SSBM for a wide range of parameters, breaking previous barrier on analyzing on only constant number of clusters. On the other hand, we provide a novel analysis on matrix perturbation with random noise. We discuss more details on this part in SectionΒ 1.4.

Recall that parameters of SBM is specified by 𝖲𝖲𝖑𝖬​(V,n,k,p,q)\mathsf{SSBM}(V,n,k,p,q), where n=|V|n=|V|. Let Οƒ2=max⁑{p​(1βˆ’p),q​(1βˆ’q)}\sigma^{2}=\max\left\{p(1-p),q(1-q)\right\}. Our main result is stated below.

Theorem 1.1.

There exists a constant C>0C>0 with the following property. In the model 𝖲𝖲𝖑𝖬​(V,n,k,p,q)\mathsf{SSBM}(V,n,k,p,q), if Οƒ2β‰₯C​log⁑n/n\sigma^{2}\geq C\log n/n and nβ‰₯Cβ‹…k​(k​pβ‹…log6⁑n+log⁑npβˆ’q)2n\geq C\cdot k\left(\frac{\sqrt{kp}\cdot\log^{6}n+\sqrt{\log n}}{p-q}\right)^{2}, then AlgorithmΒ 1 recovers all clusters with probability 1βˆ’O​(nβˆ’1)1-O(n^{-1}).

Here we describe the vanilla-SVD algorithm in more detail. Algorithms in [McS01, Vu18, Col19] share a common idea: they both use SVD-based methods to find a clear-cut vector representation of vertices. That is, every node v∈Vv\in V is associated with a vector ρ​(v)\rho(v), and we say a vector representation ρ\rho is clear-cut if the following holds for some threshold Ξ”\Delta: if u,v∈Vβ„“u,v\in V_{\ell} for some β„“\ell, then ‖ρ​(u)βˆ’Οβ€‹(v)‖≀Δ/4\left\|\rho(u)-\rho(v)\right\|\leq\Delta/4; otherwise, ‖ρ​(u)βˆ’Οβ€‹(v)β€–β‰₯Ξ”\left\|\rho(u)-\rho(v)\right\|\geq\Delta.

Once a clear-cut representation is found, the clustering task is easy: If the parameters n,k,p,qn,k,p,q are all known, we can calculate Ξ”\Delta and simply decide whether two vertices are in the same cluster based on their distance; in the case where Ξ”\Delta is unknown, we need one more step. 222For example, one possible implementation is as follows: create a minimal spanning tree according to the distances under ρ\rho, then remove the heaviest (kβˆ’1)(k-1) edges, resulting in kk connected components, and output these components as clusters. Following [Vu18], we denote by π™²πš•πšžπšœπšπšŽπš›π™±πš’π™³πš’πšœπšπšŠπš—πšŒπšŽ\mathtt{ClusterByDistance} an algorithm that recovers the partition from a clear-cut representation. One natural representation is obtained by SVD as follows. Let G^∈{0,1}VΓ—V\widehat{G}\in\left\{0,1\right\}^{V\times V} be the adjacent matrix of the input graph, and let PG^kP_{\widehat{G}_{k}} be the orthogonal projection matrix onto the space spanned by the first kk eigenvectors of G^\widehat{G}. Then set ρ​(u)=𝖽𝖾𝖿PG^k​G^u\rho(u)\stackrel{{\scriptstyle\mathsf{def}}}{{=}}P_{\widehat{G}_{k}}\widehat{G}_{u}, where G^u\widehat{G}_{u} is the column index by u∈Vu\in V. This yields AlgorithmΒ 1, the vanilla-SVD algorithm.

1 Input: adjacent matrix G^∈{0,1}VΓ—V\widehat{G}\in\left\{0,1\right\}^{V\times V}
2Output: a partition of VV
  1. 1.

    Compute ρ​(u)=𝖽𝖾𝖿PG^k​G^u\rho(u)\stackrel{{\scriptstyle\mathsf{def}}}{{=}}P_{\widehat{G}_{k}}\widehat{G}_{u} for each u∈Vu\in V.

  2. 2.

    Run π™²πš•πšžπšœπšπšŽπš›π™±πš’π™³πš’πšœπšπšŠπš—πšŒπšŽ\mathtt{ClusterByDistance} with representation ρ\rho.

AlgorithmΒ 1 Vanilla-SVD algorithm for graph clustering

1.3 Comparison with Existing Analysis for Vanilla Spectral Algorithms in the SBM

To the best of our knowledge, there are very few works on the analysis of vanilla spectral algorithms [AFWZ20, EBW18, PPV+19]. All of them only apply to the case of k=O​(1)k=O(1). In this work, we obtain the first analysis for general parameters n,k,p,qn,k,p,q, in the symmetric SBM setting.

Davis-Kahan approaches

To study spectral algorithms in signal-plus-noise models, a key step is to understand how random noise perturbs the eigenvectors of a matrix. A commonly-used technical ingredient is the Davis-Kahan sin⁑Θ\sin\Theta theorem (or its variant). However, this type of approach faces two challenges in the SBM.

  • β€’

    Davis-Kahan leads to worst-case perturbation bounds. For perturbations caused by random noises, such as signal-plus-noise models, Davis-Kahan sin⁑Θ\sin\Theta theorem is sometimes suboptimal.

  • β€’

    These sin⁑Θ\sin\Theta theorems only lead to bound on 22-norm. However, in the SBM analysis, we may need (2β†’βˆž)(2\to\infty)-norm bounds. See [CTP19] for more discussions.

Previous works such as [AFWZ20, EBW18, PPV+19] mainly followed this approach. They proposed some novel ideas to (partially) address these two challenges, but only apply to the case of k=O​(1)k=O(1). In contrast, our approach, following the power-iteration-based analysis proposed by [MZ22b], completely avoids Davis-Kahan sin⁑Θ\sin\Theta theorem and can handle the case of k=ω​(1)k=\omega(1).

Comparison with [MZ22b]

Inspired by power iteration methods, Mukherjee and Zhang [MZ22b] proposed a new approach to analyze the perturbation of random matrices. The idea is to approximate the eigenvectors of a matrix by its power. In fact, this method has been widely used in practice as a fast algorithm to approximate eigenvectors. However, there are two limitations of [MZ22b].

  • β€’

    Their analysis requires a nice structure of the mean matrix, i.e., all large eigenvalues are more or less the same.

  • β€’

    Their algorithm is not vanilla as it has a β€˜centering step’. Moreover, their algorithm requires the knowledge of parameters p,q,kp,q,k, and particularly, the centering step alone requires the knowledge of qq. In comparison, we only need to know kk; further, we can also guess kk (by checking the number of large eigenvalues) and then make AlgorithmΒ 1 fully parameter-free.

To overcome these limitations, we introduce a novel β€˜polynomial approximation + entrywise analysis’ method, which makes this analysis more robust and requires less structure. More details will be discussed in SectionΒ 1.4.

Regarding parameters in TheoremΒ 1.1

The difference between the parameters in our results and those in Vu’s paper is that we replaced Οƒ\sigma by p​log6⁑n\sqrt{p}\log^{6}n. If p<0.9p<0.9, Οƒ\sigma and p\sqrt{p} are equal up to constant, so our bound is essentially the same as [Vu18] except for the log6⁑n\log^{6}n factor. We believe the extra log6⁑n\log^{6}n factor can be improved by future works. This term stems from the new concentration inequality we used as a black box. A refined analysis of this concentration inequality may remove this factor. Here are two example settings of parameters that satisfy the conditions in TheoremΒ 1.1:

  • β€’

    q=Ξ˜β€‹(log⁑nn),p=Ω​(k​log6⁑nn)q=\Theta(\frac{\log n}{n}),p=\Omega(\frac{k\log^{6}n}{\sqrt{n}});

  • β€’

    pβˆ’q=Ξ˜β€‹(1)p-q=\Theta(1) and k=O​(nlog6⁑n)k=O(\frac{\sqrt{n}}{\log^{6}n}).

1.4 Proof Outline and Technical Contributions

Let sus_{u} denote the size of the cluster to which uu belongs. Assume for now that all ViV_{i}’s are of size roughly n/kn/k. Indeed, this happens with high probability inasmuch as the partition is uniformly sampled.

Our goal is to show that there exists some threshold Ξ”>0\Delta>0 such that for every u,v∈Vu,v\in V: if u,v∈Vβ„“u,v\in V_{\ell} for some β„“\ell, then β€–PG^k​G^uβˆ’PG^k​G^v‖≀Δ/4\left\|P_{\widehat{G}_{k}}\widehat{G}_{u}-P_{\widehat{G}_{k}}\widehat{G}_{v}\right\|\leq\Delta/4; otherwise, β€–PG^k​G^uβˆ’PG^k​G^vβ€–β‰₯Ξ”\left\|P_{\widehat{G}_{k}}\widehat{G}_{u}-P_{\widehat{G}_{k}}\widehat{G}_{v}\right\|\geq\Delta. Write Ρ​(u)=𝖽𝖾𝖿‖PG^k​G^uβˆ’Guβ€–\varepsilon(u)\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\left\|P_{\widehat{G}_{k}}\widehat{G}_{u}-G_{u}\right\|. Then |β€–PG^k​G^uβˆ’PG^k​G^vβ€–βˆ’β€–Gvβˆ’Guβ€–|≀Ρ​(u)+Ρ​(v).\left|\left\|P_{\widehat{G}_{k}}\widehat{G}_{u}-P_{\widehat{G}_{k}}\widehat{G}_{v}\right\|-\left\|G_{v}-G_{u}\right\|\right|\leq\varepsilon(u)+\varepsilon(v). Note that β€–Gvβˆ’Guβ€–=0\left\|G_{v}-G_{u}\right\|=0 if u,v∈Vβ„“u,v\in V_{\ell} for some β„“\ell, otherwise, β€–Gvβˆ’Guβ€–=(pβˆ’q)β‹…su+sv>(pβˆ’q)​n/k\left\|G_{v}-G_{u}\right\|=(p-q)\cdot\sqrt{s_{u}+s_{v}}>(p-q)\sqrt{n/k}. Therefore, setting Ξ”=0.8​(pβˆ’q)​n/k\Delta=0.8(p-q)\sqrt{n/k}, it suffices to show that Ρ​(u)≀0.1​(pβˆ’q)​n/kΒ―\underline{\varepsilon(u)\leq 0.1(p-q)\sqrt{n/k}} for every u∈Vu\in V.

We decompose Ρ​(u)\varepsilon(u) into two terms:

Ρ​(u)≀‖PG^k​(G^uβˆ’Gu)β€–+β€–(PG^kβˆ’I)​Guβ€–=β€–PG^k​Euβ€–βŸ"noise term"+β€–(PG^kβˆ’I)​Guβ€–βŸβ€œdeviation term”.\varepsilon(u)\leq\left\|P_{\widehat{G}_{k}}(\widehat{G}_{u}-G_{u})\right\|+\left\|(P_{\widehat{G}_{k}}-I)G_{u}\right\|=\underbrace{\left\|P_{\widehat{G}_{k}}E_{u}\right\|}_{\text{"noise term"}}+\underbrace{\left\|(P_{\widehat{G}_{k}}-I)G_{u}\right\|}_{\text{``deviation term''}}. (1)

We shall bound the two terms from above separately. Intuitively, the noise term is small means PG^kP_{\widehat{G}_{k}} reduces the noise, while the deviation term is small means PG^kP_{\widehat{G}_{k}} preserves the data.

Upper bound of the noise term

It is known that PG^kP_{\widehat{G}_{k}} (resp.,Β PGkP_{G_{k}}) can be write as a polynomial of G^\widehat{G} (resp.,Β GG). By Weyl’s inequality, the eigenvalues of G^\widehat{G} are not too far from those of GG. Therefore, in our case, one can find a simple polynomial Ο†\varphi which only depends on GG, such that φ​(G^)\varphi(\widehat{G}) (resp., φ​(G)\varphi(G)) is a good approximation of PG^kP_{\widehat{G}_{k}} (resp.,Β PGkP_{G_{k}}); this is formalized in LemmaΒ 3.2. Then we have the following decomposition: β€–PG^k​Eu‖≀2​‖φ​(G^)​Eu‖≀2​‖φ​(G)​Euβ€–+2​‖(φ​(G^)βˆ’Ο†β€‹(G))​Euβ€–,\left\|P_{\widehat{G}_{k}}E_{u}\right\|\leq 2\left\|\varphi(\widehat{G})E_{u}\right\|\leq 2\left\|\varphi(G)E_{u}\right\|+2\left\|\left(\varphi(\widehat{G})-\varphi(G)\right)E_{u}\right\|, where the first inequality follows from LemmaΒ 3.2, which roughly says φ​(G^)\varphi(\widehat{G}) is a good approximation of PG^kP_{\widehat{G}_{k}}.

  1. 1.

    The first term, ‖φ​(G)​Euβ€–\left\|\varphi(G)E_{u}\right\|, is small with high probability. To see this, we use LemmaΒ 3.2 again: ‖φ​(G)​Eu‖≀32​‖PGk​Euβ€–\left\|\varphi(G)E_{u}\right\|\leq\frac{3}{2}\left\|P_{G_{k}}E_{u}\right\|. According to a known result (c.f. PropositionΒ 2.4), β€–PGk​Euβ€–\left\|P_{G_{k}}E_{u}\right\| is small with high probability, largely because the projection PGkP_{G_{k}} and the vector EuE_{u} are independent.

  2. 2.

    The second term is the tricky part, and we draw on an entrywise analysis. Namely, we study every entry of (φ​(G^)βˆ’Ο†β€‹(G))​Eu(\varphi(\widehat{G})-\varphi(G))E_{u}, using the new inequality from [MZ22b]. See LemmaΒ 3.3 for more details.

The upper bound for the noise term is encapsulated in LemmaΒ 3.4.

Upper bound of the deviation term

The following argument is reminiscent of [Vu18]. Say u∈Vβ„“u\in V_{\ell}. Note that G​χℓ=suβ‹…GuG\chi_{\ell}=\sqrt{s_{u}}\cdot G_{u} where Ο‡β„“=1suβ‹…1Vβ„“\chi_{\ell}=\frac{1}{\sqrt{s_{u}}}\cdot 1_{V_{\ell}} is the normalized characteristic vector of Vβ„“V_{\ell} (i.e.,Β 1Vℓ​(v)=1⇔v∈Vβ„“1_{V_{\ell}}(v)=1\iff v\in V_{\ell}). It follows that

β€–(PG^kβˆ’I)​Gβ€–2≀‖(PG^kβˆ’I)​G^β€–2+β€–(PG^kβˆ’I)​Eβ€–2≀‖Gβˆ’G^β€–2+β€–(PG^kβˆ’I)​Eβ€–2≀2​‖Eβ€–2,\displaystyle\left\|(P_{\widehat{G}_{k}}-I)G\right\|_{2}\leq\left\|(P_{\widehat{G}_{k}}-I)\widehat{G}\right\|_{2}+\left\|(P_{\widehat{G}_{k}}-I)E\right\|_{2}\leq\left\|G-\widehat{G}\right\|_{2}+\left\|(P_{\widehat{G}_{k}}-I)E\right\|_{2}\leq 2\left\|E\right\|_{2},

where the second inequality holds because PG^k​G^P_{\widehat{G}_{k}}\widehat{G} is the best kk-rank approximation of G^\widehat{G} and rank​(G)=k\mathrm{rank}(G)=k, and in the third inequality, we use β€–(PG^kβˆ’I)β€–2≀1\left\|(P_{\widehat{G}_{k}}-I)\right\|_{2}\leq 1, as PG^kP_{\widehat{G}_{k}} is a projection matrix. Therefore,

β€–(PG^kβˆ’I)​Guβ€–=1su​‖(PG^kβˆ’I)​G​χu‖≀1su​‖(PG^kβˆ’I)​Gβ€–2≀2​‖Eβ€–2su.\left\|(P_{\widehat{G}_{k}}-I)G_{u}\right\|=\frac{1}{\sqrt{s_{u}}}\left\|(P_{\widehat{G}_{k}}-I)G\chi_{u}\right\|\leq\frac{1}{\sqrt{s_{u}}}\left\|(P_{\widehat{G}_{k}}-I)G\right\|_{2}\leq\frac{2\left\|E\right\|_{2}}{\sqrt{s_{u}}}. (2)

A typical result in random matrix theory (c.f. PropositionΒ 2.3) states that with high probability, β€–Eβ€–2=O​(n)\left\|E\right\|_{2}=O(\sqrt{n}). Combining EquationΒ 2 and suβ‰ˆn/ks_{u}\approx n/k, we get β€–(PG^kβˆ’I)​Guβ€–=O​(k).\left\|(P_{\widehat{G}_{k}}-I)G_{u}\right\|=O(\sqrt{k}). And by our assumption on nn, we have k=o​((pβˆ’q)​n/k)=o​(Ξ”)\sqrt{k}=o((p-q)n/k)=o(\Delta).

Technical contribution

The major novelty of our analysis is using the polynomial Ο†\varphi. [MZ22b] used a centering step to make the mean matrix nicely structured, while in our analysis, we used polynomial approximation to address this issue. Another difference is that in [MZ22b], the centering step appears explicitly in the algorithm. By contrast, our polynomial approximation only appears in the analysis β€” the algorithm is vanilla.

As a byproduct, we developed new techniques for studying eigenspace perturbation, a typical topic in random matrix theory. Our high-level idea is β€œpolynomial approximation + entrywise analysis”. That is, we reduce the analysis of eigenspace perturbation to the analysis of a simple polynomial (of matrix) under perturbation. We have more tools to deal with the latter.

1.5 Discussion and Future Directions

In this paper, we studied the behavior of vanilla-SVD in the SSBM, a benchmark signal-plus-noise model widely studied in random matrix theory. We showed that vanilla-SVD indeed filters noise in the SSBM. In fact, our analysis technique, β€˜polynomial approximation + entrywise analysis’, is not very limited to SSBM. A direct and interesting question yet to be answered is: Can our method be extended to prove that vanilla-SVD works in the general SBM where partitions are not uniformly sampled and edges appear with different probabilities? Moreover, our method may be useful for analyzing some other realistic, probabilistic models such as the factor model β€” a model which has been widely used in economics and model portfolio theory.

In the long term, it would be very interesting to understand the behavior of vanilla spectral algorithms on real data: 1) Why does it succeed in some applications? 2) How could we fix it if it has failed in other cases? A deeper understanding of vanilla spectral algorithms will provide guidelines for using them in many machine learning tasks.

2 Preliminaries

Notations

Let 𝟏n\mathbf{1}_{n} denote the nn-dimensional vector whose entries are all 1’s, and let JnJ_{n} be the nΓ—nn\times n matrix whose entries are all 1’s. Let sus_{u} denote the size of the cluster to which uu belongs. For a matrix AA, A​[i]A[i] denotes the row of AA indexed by ii, and AiA_{i} denotes the column indexed by ii; Ξ»i​(A)\lambda_{i}(A) is the ii-th largest eigenvalue of AA; let PAkP_{A_{k}} denote the orthogonal projection matrix onto the space spanned by the first kk eigenvectors of AA. For a vector xβˆˆβ„nx\in\mathbb{R}^{n}, β€–xβ€–=𝖽𝖾𝖿x12+β‹―+xn2\left\|x\right\|\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\sqrt{x_{1}^{2}+\cdots+x_{n}^{2}} denotes the Euclidean norm.

Definition 2.1 (Matrix operator norms).

Let Aβˆˆβ„nΓ—nA\in\mathbb{R}^{n\times n}. Define β€–Aβ€–2=𝖽𝖾𝖿maxβ€–xβ€–=1⁑‖A​xβ€–\left\|A\right\|_{2}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\max_{\left\|x\right\|=1}\left\|Ax\right\| and β€–Aβ€–2β†’βˆž=𝖽𝖾𝖿maxx:β€–xβ€–=1⁑‖A​xβ€–βˆž\left\|A\right\|_{2\to\infty}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\max_{x:\left\|x\right\|=1}\left\|Ax\right\|_{\infty}.

Proposition 2.1 (e.g.,Β [CTP19]).

For all matrices A,Bβˆˆβ„nΓ—nA,B\in\mathbb{R}^{n\times n}, it holds that (1) β€–Aβ€–2β†’βˆž=maxi∈[n]⁑‖A​[i]β€–\left\|A\right\|_{2\to\infty}=\max_{i\in[n]}\left\|A[i]\right\|; (2) β€–A​Bβ€–2β†’βˆžβ‰€β€–Aβ€–2β†’βˆžβ€‹β€–Bβ€–2\left\|AB\right\|_{2\to\infty}\leq\left\|A\right\|_{2\to\infty}\left\|B\right\|_{2}.

Proposition 2.2 (Weyl’s inequality).

For all A,Eβˆˆβ„nΓ—nA,E\in\mathbb{R}^{n\times n}, we have |Ξ»i​(A)βˆ’Ξ»i​(A+E)|≀‖Eβ€–2\left|\lambda_{i}(A)-\lambda_{i}(A+E)\right|\leq\left\|E\right\|_{2}.

Proposition 2.3 (Norm of a random matrix [Vu18]).

There is a constant C0>0C_{0}>0. Let EE be a symmetric matrix whose upper diagonal entries ei​je_{ij} are independent random variables where ei​j=1βˆ’pi​je_{ij}=1-p_{ij} or βˆ’pi​j-p_{ij} with probabilities pi​jp_{ij} and 1βˆ’pi​j1-p_{ij} respectively, where pi​j∈[0,1]p_{ij}\in[0,1]. Let Οƒ2:=maxi​j⁑{pi​j​(1βˆ’pi​j)}\sigma^{2}:=\max_{ij}\{p_{ij}(1-p_{ij})\}. If Οƒ2β‰₯C0​log⁑n/n\sigma^{2}\geq C_{0}\log n/n, then 𝐏𝐫[β€–Eβ€–2β‰₯C0​σ​n1/2]≀nβˆ’3.\operatorname*{\mathbf{Pr}}[\|E\|_{2}\geq C_{0}\sigma n^{1/2}]\leq n^{-3}.

Proposition 2.4 (Projection of a random vector, lemma 2.1 in [Vu18]).

There exists a constant C1C_{1} such that the following holds. Let X=(ΞΎ1,…,ΞΎn)X=\left(\xi_{1},\ldots,\xi_{n}\right) be a random vector in ℝn\mathbb{R}^{n} whose coordinates ΞΎi\xi_{i} are independent random variables with mean 0 and variance at most Οƒ2β©½1\sigma^{2}\leqslant 1. Assume furthermore that the ΞΎi\xi_{i} are bounded by 1 in absolute value. Let HH be a subspace of dimension dd and let Ξ H​ξ\Pi_{H}\xi be the length of the orthogonal projection of ΞΎ\xi onto HH. Then 𝐏𝐫[Ξ H​Xβ‰₯σ​d+C1​log⁑n]≀nβˆ’3.\operatorname*{\mathbf{Pr}}\left[\Pi_{H}X\geq\sigma\sqrt{d}+C_{1}\sqrt{\log n}\right]\leq n^{-3}.

Proposition 2.5.

For a∈[0,2]a\in[0,2] and rβˆˆβ„•r\in\mathbb{N}, if |aβˆ’1|≀δ<12​r\left|a-1\right|\leq\delta<\frac{1}{2r}, then |arβˆ’1|≀2​r​δ\left|a^{r}-1\right|\leq 2r\delta.

3 Analysis of Vanilla SVD Algorithm

Write si=𝖽𝖾𝖿|Vi|s_{i}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}|V_{i}|. We say the partition V1,…,VkV_{1},\dots,V_{k} is balanced if (1βˆ’116​log⁑n)​nk≀si≀(1+116​log⁑n)​nk,βˆ€i∈[k].\left(1-\frac{1}{16\log n}\right)\frac{n}{k}\leq s_{i}\leq\left(1+\frac{1}{16\log n}\right)\frac{n}{k},\forall i\in[k]. By Chernoff bound, the partition V1,…,VkV_{1},\dots,V_{k} is balanced with probability at least 1βˆ’nβˆ’11-n^{-1}; hence, we assume that the partition is balanced in the following argument. Since Οƒ2β‰₯C​log⁑n/n\sigma^{2}\geq C\log n/n, the event β€–Eβ€–=O​(n)\left\|E\right\|=O(\sqrt{n}) holds with high probability (see PropositionΒ 2.3).

Recall the decomposition into deviation term and noise term in EquationΒ 1. We first state our upper bound of the deviation term, which readily follows from the argument in SectionΒ 1.4, and the complete proof is in AppendixΒ B.

Lemma 3.1 (Upper bound of deviation term).

Let C0C_{0} be the constant in PropositionΒ 2.3. If the partition is balanced and nβ‰₯104β‹…C02​k2​σ2(pβˆ’q)2n\geq 10^{4}\cdot C_{0}^{2}\frac{k^{2}\sigma^{2}}{(p-q)^{2}}, then with probability at least 1βˆ’nβˆ’31-n^{-3} we have β€–(PG^kβˆ’I)​Gu‖≀0.04​(pβˆ’q)​n/k,βˆ€u∈V.\left\|(P_{\widehat{G}_{k}}-I)G_{u}\right\|\leq 0.04(p-q)\sqrt{n/k},\forall u\in V.

Section 3.1 and Section 3.2 lead to an upper bound of the noise term, and Section 3.3 is the proof of main theorem.

3.1 An Approximation of PGkP_{G_{k}} and PG^kP_{\widehat{G}_{k}}

In order to give some intuition on the choice of Ο†\varphi, we first analyze the spectrum of GG, and the result is summed up in TheoremΒ 3.1.

The eigenvalues of GG

Note that G=H+qβ€‹πŸnβ€‹πŸn⊀G=H+q\mathbf{1}_{n}\mathbf{1}_{n}^{\top}, where H=((pβˆ’q)​Js1𝟎𝟎𝟎𝟎(pβˆ’q)​Js2πŸŽπŸŽπŸŽπŸŽβ‹±πŸŽπŸŽπŸŽπŸŽ(pβˆ’q)​Jsk).H=\begin{pmatrix}(p-q)J_{s_{1}}&\mathbf{0}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&(p-q)J_{s_{2}}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{0}&\ddots&\mathbf{0}\\ \mathbf{0}&\mathbf{0}&\mathbf{0}&(p-q)J_{s_{k}}\end{pmatrix}. Without loss of generality, assume that s1β‰₯s2β‰₯β‹―β‰₯sks_{1}\geq s_{2}\geq\cdots\geq s_{k}. It is easy to see that the eigenvalues of HH are (pβˆ’q)​s1,…,(pβˆ’q)​sk,0.(p-q)s_{1},\dots,(p-q)s_{k},0. Viewing GG as a rank-one perturbation of HH, we have the following theorem that characterizes eigenvalues of GG. Its proof, in AppendixΒ C, readily follows from a theorem in [BNS79], which studies eigenvalues under rank-one perturbation.

Theorem 3.1.

Write si=𝖽𝖾𝖿|Vi|s_{i}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}|V_{i}| and assume that s1β‰₯s2β‰₯β‹―β‰₯sks_{1}\geq s_{2}\geq\cdots\geq s_{k}. Define Ξ΄i=𝖽𝖾𝖿λi​(G)βˆ’(pβˆ’q)​si\delta_{i}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\lambda_{i}(G)-(p-q)s_{i}, then (1) Ξ΄iβ‰₯0\delta_{i}\geq 0 and βˆ‘i=1kΞ΄i=n​q\sum_{i=1}^{k}\delta_{i}=nq; (2) Ξ»1​(G)β‰₯n​q+(pβˆ’q)​nk\lambda_{1}(G)\geq nq+(p-q)\frac{n}{k}, and hence βˆ‘i=2kΞ΄i≀(pβˆ’q)​(s1βˆ’nk).\sum_{i=2}^{k}\delta_{i}\leq(p-q)(s_{1}-\frac{n}{k}).

The choice of the polynomial Ο†\varphi

Let ΞΌ=𝖽𝖾𝖿(pβˆ’q)​nk\mu\stackrel{{\scriptstyle\mathsf{def}}}{{=}}(p-q)\frac{n}{k}, and let Οˆβ€‹(t)\psi(t) be the quadratic polynomial such that Οˆβ€‹(Ξ»1​(G))=Οˆβ€‹(ΞΌ)=1,Οˆβ€‹(0)=0\psi(\lambda_{1}(G))=\psi(\mu)=1,\psi(0)=0, i.e.,Β Οˆβ€‹(t)=π–½π–Ύπ–Ώβˆ’1Ξ»1​(G)​μ​(tβˆ’Ξ»1​(G))​(tβˆ’ΞΌ)+1=𝖽𝖾𝖿A​t2+B​t,\psi(t)\stackrel{{\scriptstyle\mathsf{def}}}{{=}}-\frac{1}{\lambda_{1}(G)\mu}(t-\lambda_{1}(G))(t-\mu)+1\stackrel{{\scriptstyle\mathsf{def}}}{{=}}At^{2}+Bt, where A=βˆ’1Ξ»1​(G)​μ,B=1Ξ»1​(G)+1ΞΌA=-\frac{1}{\lambda_{1}(G)\mu},B=\frac{1}{\lambda_{1}(G)}+\frac{1}{\mu}. Finally, let φ​(t)=𝖽𝖾𝖿(Οˆβ€‹(t))r\varphi(t)\stackrel{{\scriptstyle\mathsf{def}}}{{=}}(\psi(t))^{r} where r=𝖽𝖾𝖿log⁑nr\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\log n.

Here we give some intuition for the choice of Ο†\varphi. Let G^=βˆ‘i=1nΞ»^i​vi​vi⊀\widehat{G}=\sum_{i=1}^{n}\widehat{\lambda}_{i}v_{i}v_{i}^{\top} be the spectral decomposition of G^\widehat{G}. Then φ​(G^)=βˆ‘i=1nφ​(Ξ»^i)​vi​vi⊀,PG^k=βˆ‘i=1kvi​v⊀.\varphi(\widehat{G})=\sum_{i=1}^{n}\varphi(\widehat{\lambda}_{i})v_{i}v_{i}^{\top},P_{\widehat{G}_{k}}=\sum_{i=1}^{k}v_{i}v^{\top}. The spectral decomposition of φ​(G^)βˆ’PG^k\varphi(\widehat{G})-P_{\widehat{G}_{k}} is φ​(G^)βˆ’PG^k=βˆ‘i=1k(φ​(Ξ»^i)βˆ’1)​vi​v⊀+βˆ‘i=k+1nφ​(Ξ»i)​vi​v⊀.\varphi(\widehat{G})-P_{\widehat{G}_{k}}=\sum_{i=1}^{k}(\varphi(\widehat{\lambda}_{i})-1)v_{i}v^{\top}+\sum_{i=k+1}^{n}\varphi(\lambda_{i})v_{i}v^{\top}. Hence,

‖φ​(G^)βˆ’PG^kβ€–2=max⁑{|φ​(Ξ»^1)βˆ’1|,…,|φ​(Ξ»^k)βˆ’1|,|φ​(Ξ»^k+1)|,…,|φ​(Ξ»^n)|}.\displaystyle\left\|\varphi(\widehat{G})-P_{\widehat{G}_{k}}\right\|_{2}=\max\{|\varphi(\widehat{\lambda}_{1})-1|,\dots,|\varphi(\widehat{\lambda}_{k})-1|,|\varphi(\widehat{\lambda}_{k+1})|,\dots,|\varphi(\widehat{\lambda}_{n})|\}. (3)

Recall that Ξ»i^βˆ’Ξ»i​(G)\widehat{\lambda_{i}}-\lambda_{i}(G) is bounded by Weyl’s inequality. Plus, when the partition is balanced, TheoremΒ 3.1 shows that the eigenvalues of GG is nicely distributed: Except for Ξ»1​(G)\lambda_{1}(G), other eigenvalues are all close to ΞΌ\mu. Hence, our choice of Ο†\varphi makes ‖φ​(G^)βˆ’PG^kβ€–2\left\|\varphi(\widehat{G})-P_{\widehat{G}_{k}}\right\|_{2} small, and thus φ​(G^)\varphi(\widehat{G}) is a good approximation of PG^kP_{\widehat{G}_{k}}. Formally, we have the following lemma.

Lemma 3.2 (Polynomial approximation).

Assume that the partition is balanced and nβ‰₯104β‹…C02β‹…k2β‹…pβ‹…log⁑n(pβˆ’q)2n\geq 10^{4}\cdot C_{0}^{2}\cdot\frac{k^{2}\cdot p\cdot\log n}{(p-q)^{2}}, where C0C_{0} is the constant in PropositionΒ 2.3. Then with probability at least 1βˆ’nβˆ’31-n^{-3}, it holds that for all xβˆˆβ„nx\in\mathbb{R}^{n}, 12​‖PG^k​x‖≀‖φ​(G^)​x‖≀32​‖PG^k​xβ€–+β€–xβ€–/nlog⁑log⁑n,\frac{1}{2}\left\|P_{\widehat{G}_{k}}x\right\|\leq\left\|\varphi(\widehat{G})x\right\|\leq\frac{3}{2}\left\|P_{\widehat{G}_{k}}x\right\|+\left\|x\right\|/n^{\log\log n}, and 12​‖PGk​x‖≀‖φ​(G)​x‖≀32​‖PGk​xβ€–.\frac{1}{2}\left\|P_{G_{k}}x\right\|\leq\left\|\varphi(G)x\right\|\leq\frac{3}{2}\left\|P_{G_{k}}x\right\|.

Proof.

Let G=βˆ‘i=1kΞ»i​ui​ui⊀G=\sum_{i=1}^{k}\lambda_{i}u_{i}u_{i}^{\top}(resp.,Β G^=βˆ‘i=1nΞ»^i​vi​vi⊀\widehat{G}=\sum_{i=1}^{n}\widehat{\lambda}_{i}v_{i}v_{i}^{\top}) Β be the spectral decomposition of GG (resp.,Β G^\widehat{G}). We shall use the following claim.

Claim 3.1.

The following holds with probability 1βˆ’nβˆ’31-n^{-3} (over the choice of EE): for every i∈[k]i\in[k], |φ​(Ξ»^i)βˆ’1|<12,|φ​(Ξ»i)βˆ’1|<12\left|\varphi(\widehat{\lambda}_{i})-1\right|<\frac{1}{2},\left|\varphi(\lambda_{i})-1\right|<\frac{1}{2}; and for every i=k+1,…,ni=k+1,\dots,n, |φ​(Ξ»i^)|<nβˆ’log⁑log⁑n\left|\varphi(\widehat{\lambda_{i}})\right|<n^{-\log\log n}.

Fix xβˆˆβ„nx\in\mathbb{R}^{n}. On the one hand, 12≀φ​(Ξ»^i)≀32,βˆ€i∈[k]\frac{1}{2}\leq\varphi(\widehat{\lambda}_{i})\leq\frac{3}{2},\forall i\in[k], and hence

‖φ​(G^)​xβ€–2=βˆ‘i=1nφ​(Ξ»^i)2β€‹βŸ¨x,vi⟩2β‰₯βˆ‘i=1kφ​(Ξ»^i)2β€‹βŸ¨x,vi⟩2β‰₯βˆ‘i=1k14β€‹βŸ¨x,vi⟩2=14​‖PG^k​xβ€–2,\left\|\varphi(\widehat{G})x\right\|^{2}=\sum_{i=1}^{n}\varphi(\widehat{\lambda}_{i})^{2}\langle x,v_{i}\rangle^{2}\geq\sum_{i=1}^{k}\varphi(\widehat{\lambda}_{i})^{2}\langle x,v_{i}\rangle^{2}\geq\sum_{i=1}^{k}\frac{1}{4}\langle x,v_{i}\rangle^{2}=\frac{1}{4}\left\|P_{\widehat{G}_{k}}x\right\|^{2},

which means ‖φ​(G^)​xβ€–β‰₯12​‖PG^k​xβ€–\left\|\varphi(\widehat{G})x\right\|\geq\frac{1}{2}\left\|P_{\widehat{G}_{k}}x\right\|. On the other hand,

‖φ​(G^)​xβ€–2=βˆ‘i=1nφ​(Ξ»^i)2β€‹βŸ¨x,vi⟩2β‰€βˆ‘i=1k(32)2β€‹βŸ¨x,vi⟩2+βˆ‘i=k+1n⟨x,vi⟩2n2​log⁑log⁑n≀94​‖PG^k​xβ€–2+β€–xβ€–2n2​log⁑log⁑n.\left\|\varphi(\widehat{G})x\right\|^{2}=\sum_{i=1}^{n}\varphi(\widehat{\lambda}_{i})^{2}\langle x,v_{i}\rangle^{2}\leq\sum_{i=1}^{k}\left(\frac{3}{2}\right)^{2}\langle x,v_{i}\rangle^{2}+\sum_{i=k+1}^{n}\frac{\langle x,v_{i}\rangle^{2}}{n^{2\log\log n}}\leq\frac{9}{4}\left\|P_{\widehat{G}_{k}}x\right\|^{2}+\frac{\left\|x\right\|^{2}}{n^{2\log\log n}}.

Since a+b≀a+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}, we have ‖φ​(G^)​x‖≀32​‖PG^k​xβ€–+β€–xβ€–nlog⁑log⁑n\left\|\varphi(\widehat{G})x\right\|\leq\frac{3}{2}\left\|P_{\widehat{G}_{k}}x\right\|+\frac{\left\|x\right\|}{n^{\log\log n}}. This establishes the first part.

Note that ‖φ​(G)​xβ€–=βˆ‘i=1kφ​(Ξ»i)2β€‹βŸ¨x,ui⟩2\left\|\varphi(G)x\right\|=\sqrt{\sum_{i=1}^{k}\varphi(\lambda_{i})^{2}\langle x,u_{i}\rangle^{2}} and we also have 12≀φ​(Ξ»i)≀32,βˆ€i∈[k]\frac{1}{2}\leq\varphi(\lambda_{i})\leq\frac{3}{2},\forall i\in[k], and thus similar argument goes for GG. This finishes the proof of LemmaΒ 3.2.

It remains to prove ClaimΒ 3.1. The claim readily follows from the choice of Ο†\varphi and the fact that Ξ»i,Ξ»^i\lambda_{i},\widehat{\lambda}_{i} are close. A complete proof can be found in AppendixΒ C. ∎

3.2 The Upper Bound of the Noise Term

According to EquationΒ 1, in order to derive an upper bound of β€–PGk​Euβ€–\left\|P_{G_{k}}E_{u}\right\|, it remains to bound β€–(φ​(G^)βˆ’Ο†β€‹(G))​Euβ€–\left\|\left(\varphi(\widehat{G})-\varphi(G)\right)E_{u}\right\| from above. This is done by the following lemma.

Lemma 3.3.

Let C0C_{0} be the constant in PropositionΒ 2.3. Assume that the partition is balanced and nβ‰₯(100+C0)2β‹…k2β‹…pβ‹…log12⁑n(pβˆ’q)2n\geq(100+C_{0})^{2}\cdot\frac{k^{2}\cdot p\cdot\log^{12}n}{(p-q)^{2}}. For every u∈Vu\in V, it holds that

𝐏𝐫E[βˆ₯(Ο†(G^)βˆ’Ο†(G))Euβˆ₯≀C2(k​plog2n)+1log⁑n)]β‰₯1βˆ’O(nβˆ’2),\operatorname*{\mathbf{Pr}}_{E}\left[\left\|\left(\varphi(\widehat{G})-\varphi(G)\right)E_{u}\right\|\leq C_{2}(\sqrt{kp}\log^{2}n)+\frac{1}{\log n})\right]\geq 1-O(n^{-2}),

where C2=𝖽𝖾𝖿7β‹…106C_{2}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}7\cdot 10^{6} is a constant.

Combining LemmaΒ 3.2 and PropositionΒ 2.4, we get an upper bound of the noise term:

Lemma 3.4 (Upper bound of noise term).

Let C0C_{0} be the constant in PropositionΒ 2.3. Assume that nβ‰₯(100+C0)2β‹…k2β‹…pβ‹…log12⁑n(pβˆ’q)2n\geq(100+C_{0})^{2}\cdot\frac{k^{2}\cdot p\cdot\log^{12}n}{(p-q)^{2}}. Then with probability at least 1βˆ’O​(nβˆ’1)1-O(n^{-1}), we have β€–PG^k​Eu‖≀C3​(k​p​log2⁑n+log⁑n)\left\|P_{\widehat{G}_{k}}E_{u}\right\|\leq C_{3}(\sqrt{kp}\log^{2}n+\sqrt{\log n}) for all u∈Vu\in V, where C3C_{3} is a constant.

The proof of LemmaΒ 3.3 is deferred to SectionΒ 4. We use it to prove LemmaΒ 3.4 here.

Proof of LemmaΒ 3.4.

It follows from LemmaΒ 3.2 that

β€–PG^k​Eu‖≀2​‖φ​(G^)​Euβ€–\displaystyle\left\|P_{\widehat{G}_{k}}E_{u}\right\|\leq 2\left\|\varphi(\widehat{G})E_{u}\right\| ≀2​‖(φ​(G^)βˆ’Ο†β€‹(G))​Euβ€–+2​‖φ​(G)​Euβ€–\displaystyle\leq 2\left\|\left(\varphi(\widehat{G})-\varphi(G)\right)E_{u}\right\|+2\left\|\varphi(G)E_{u}\right\|
≀2​‖(φ​(G^)βˆ’Ο†β€‹(G))​Euβ€–+3​‖PGk​Euβ€–.\displaystyle\leq 2\left\|\left(\varphi(\widehat{G})-\varphi(G)\right)E_{u}\right\|+3\left\|P_{G_{k}}E_{u}\right\|.

By PropositionΒ 2.4, with high probability at least 1βˆ’nβˆ’11-n^{-1}, β€–PGk​Euβ€–\left\|P_{G_{k}}E_{u}\right\| is bounded by σ​k+C1​log⁑n\sigma\sqrt{k}+C_{1}\sqrt{\log n}, where C1C_{1} is a universal constant. Meanwhile, by LemmaΒ 3.3 and union bound over all uu, with probability at least 1βˆ’O​(nβˆ’1)1-O(n^{-1}), β€–(φ​(G^)βˆ’Ο†β€‹(G))​Eu‖≀7β‹…106​(k​p​log2⁑n+1/log⁑n)\left\|\left(\varphi(\widehat{G})-\varphi(G)\right)E_{u}\right\|\leq 7\cdot 10^{6}(\sqrt{kp}\log^{2}n+1/\log n) for every u∈Vu\in V. Therefore, with probability 1βˆ’O​(nβˆ’1)1-O(n^{-1}), it holds that β€–PG^k​Eu‖≀1.4Γ—107​k​p​log2⁑n+3​σ​k+3​C1​log⁑n\left\|P_{\widehat{G}_{k}}E_{u}\right\|\leq 1.4\times 10^{7}\sqrt{kp}\log^{2}n+3\sigma\sqrt{k}+3C_{1}\sqrt{\log n} for all u∈Vu\in V. Setting C3=𝖽𝖾𝖿(1.4Γ—107+3+3​C1)C_{3}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}(1.4\times 10^{7}+3+3C_{1}), we have the desired result. ∎

3.3 Putting It Together

Now we are well-equipped to prove TheoremΒ 1.1.

Proof of TheoremΒ 1.1.

Let C=𝖽𝖾𝖿(100+100​C0+100​C3)2C\stackrel{{\scriptstyle\mathsf{def}}}{{=}}(100+100C_{0}+100C_{3})^{2}, where C0,C3C_{0},C_{3} are the constants in PropositionΒ 2.3 and LemmaΒ 3.4. By our assumption on nn, we have (pβˆ’q)​n/k>100​C3​(k​p​log6⁑n+log⁑n).(p-q)\sqrt{n/k}>100C_{3}(\sqrt{kp}\log^{6}n+\sqrt{\log n}). It is easy to verify nn satisfies the conditions in LemmaΒ 3.4 and LemmaΒ 3.1.

Write Ξ”=𝖽𝖾𝖿0.8​(pβˆ’q)​n/k\Delta\stackrel{{\scriptstyle\mathsf{def}}}{{=}}0.8(p-q)\sqrt{n/k}. We aim to show that for every u,v∈Vu,v\in V: if u,v∈Vβ„“u,v\in V_{\ell} for some β„“\ell, then β€–PG^k​G^uβˆ’PG^k​G^v‖≀Δ/4\left\|P_{\widehat{G}_{k}}\widehat{G}_{u}-P_{\widehat{G}_{k}}\widehat{G}_{v}\right\|\leq\Delta/4; otherwise, β€–PG^k​G^uβˆ’PG^k​G^vβ€–β‰₯Ξ”\left\|P_{\widehat{G}_{k}}\widehat{G}_{u}-P_{\widehat{G}_{k}}\widehat{G}_{v}\right\|\geq\Delta. Then by calling π™²πš•πšžπšœπšπšŽπš›π™±πš’π™³πš’πšœπšπšŠπš—πšŒπšŽ\mathtt{ClusterByDistance}, AlgorithmΒ 1 recovers all large clusters correctly.

Let Ρ​(u)=𝖽𝖾𝖿‖PG^k​G^uβˆ’Guβ€–\varepsilon(u)\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\left\|P_{\widehat{G}_{k}}\widehat{G}_{u}-G_{u}\right\|. According to the argument in SectionΒ 1.4, it suffices to show that Ρ​(u)≀0.1​(pβˆ’q)​n/k\varepsilon(u)\leq 0.1(p-q)\sqrt{n/k} for all u∈Vu\in V. We further decompose Ρ​(u)\varepsilon(u) into noise term and deviation term, i.e., Ρ​(u)β‰€π—‡π—ˆπ—‚π—Œπ–Ύβ€‹(u)+𝖽𝖾𝗏​(u)\varepsilon(u)\leq\mathsf{noise}(u)+\mathsf{dev}(u), where π—‡π—ˆπ—‚π—Œπ–Ύβ€‹(u)=𝖽𝖾𝖿‖PG^​Euβ€–\mathsf{noise}(u)\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\left\|P_{\widehat{G}}E_{u}\right\| and 𝖽𝖾𝗏​(u)=𝖽𝖾𝖿‖(PG^βˆ’I)​Guβ€–\mathsf{dev}(u)\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\left\|(P_{\widehat{G}}-I)G_{u}\right\|. By LemmaΒ 3.4 and LemmaΒ 3.1, with probability at least 1βˆ’O​(nβˆ’1)1-O(n^{-1}), the following hold for all u∈Vu\in V: (1) π—‡π—ˆπ—‚π—Œπ–Ύβ€‹(u)≀C3​(k​p​log2⁑n+log⁑n)≀0.01​(pβˆ’q)​n/k\mathsf{noise}(u)\leq C_{3}(\sqrt{kp}\log^{2}n+\sqrt{\log n})\leq 0.01(p-q)\sqrt{n/k}; (2) 𝖽𝖾𝗏​(u)≀0.04​(pβˆ’q)​n/k\mathsf{dev}(u)\leq 0.04(p-q)\sqrt{n/k}. Therefore, with probability at least 1βˆ’O​(nβˆ’1)1-O(n^{-1}), we indeed have Ρ​(u)≀0.1​(pβˆ’q)​n/k,βˆ€u∈V\varepsilon(u)\leq 0.1(p-q)\sqrt{n/k},\forall u\in V. This completes the proof. ∎

4 Proof of LemmaΒ 3.3: Entrywise Analysis

This section is dedicated to proving LemmaΒ 3.3.

Since both (φ​(G^)βˆ’Ο†β€‹(G))(\varphi(\widehat{G})-\varphi(G)) and EE are symmetric, we have β€–(φ​(G^)βˆ’Ο†β€‹(G))​Eu‖≀‖E​(φ​(G^)βˆ’Ο†β€‹(G))β€–2β†’βˆž.\left\|(\varphi(\widehat{G})-\varphi(G))E_{u}\right\|\leq\left\|E(\varphi(\widehat{G})-\varphi(G))\right\|_{2\to\infty}. The high-level idea is to write E​(φ​(G^)βˆ’Ο†β€‹(G))E(\varphi(\widehat{G})-\varphi(G)) as a sum of matrices, where each matrix is of the form Et​S​QE^{t}SQ such that β€–Qβ€–2=O​(1)\left\|Q\right\|_{2}=O(1). This way, we have β€–Et​S​Qβ€–2β†’βˆžβ‰€β€–Et​Sβ€–2β†’βˆžβ‹…O​(1)\left\|E^{t}SQ\right\|_{2\to\infty}\leq\left\|E^{t}S\right\|_{2\to\infty}\cdot O(1), and β€–Et​Sβ€–2β†’βˆž\left\|E^{t}S\right\|_{2\to\infty} is bounded by a lemma from [MZ22b].

Let D=π–½π–Ύπ–ΏΟˆβ€‹(G^)βˆ’Οˆβ€‹(G)=A​(E​G+G​E+E2)+B​ED\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\psi(\widehat{G})-\psi(G)=A(EG+GE+E^{2})+BE and write F=π–½π–Ύπ–ΏΟˆβ€‹(G),F^=π–½π–Ύπ–ΏΟˆβ€‹(G^)F\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\psi(G),\widehat{F}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\psi(\widehat{G}). Then

φ​(G^)βˆ’Ο†β€‹(G)=Οˆβ€‹(G^)rβˆ’Οˆβ€‹(G)r\displaystyle\varphi(\widehat{G})-\varphi(G)=\psi(\widehat{G})^{r}-\psi(G)^{r} =(F+D)rβˆ’Fr\displaystyle=(F+D)^{r}-F^{r}
=Frβˆ’1​D+Frβˆ’2​D​F^+β‹―+F​D​F^rβˆ’2⏟=𝖽𝖾𝖿M+D​F^rβˆ’1,\displaystyle=\underbrace{F^{r-1}D+F^{r-2}D\widehat{F}+\cdots+FD\widehat{F}^{r-2}}_{\stackrel{{\scriptstyle\mathsf{def}}}{{=}}M}+D\widehat{F}^{r-1},

where the last step is a decomposition based on the first location of DD in the product terms. And

D​F^rβˆ’1=D​(D+F)rβˆ’1=Dr+D​F​F^rβˆ’2+D2​F​F^rβˆ’3+β‹―+Drβˆ’1​F⏟=𝖽𝖾𝖿Mβ€².\displaystyle D\widehat{F}^{r-1}=D(D+F)^{r-1}=D^{r}+\underbrace{DF\widehat{F}^{r-2}+D^{2}F\widehat{F}^{r-3}+\cdots+D^{r-1}F}_{\stackrel{{\scriptstyle\mathsf{def}}}{{=}}M^{\prime}}.

That is, E​(φ​(G^)βˆ’Ο†β€‹(G))=E​M+E​Dr+E​Mβ€²E(\varphi(\widehat{G})-\varphi(G))=EM+ED^{r}+EM^{\prime}. We bound the three terms respectively.

Here we first list some definitions and estimations of the quantities involved.

  • β€’

    According to PropositionΒ 2.3, with probability at least 1βˆ’nβˆ’31-n^{-3}, we have β€–Eβ€–2≀C0​σ​n,\left\|E\right\|_{2}\leq C_{0}\sigma\sqrt{n}, where C0C_{0} is a constant. In the following argument, we always assume this holds.

  • β€’

    ΞΌ=𝖽𝖾𝖿(pβˆ’q)​n/k\mu\stackrel{{\scriptstyle\mathsf{def}}}{{=}}(p-q)n/k. By our assumption on nn, we have ΞΌβ‰₯(100+C0)​n​p​log6⁑n\mu\geq(100+C_{0})\sqrt{np}\log^{6}n.

  • β€’

    A=βˆ’1Ξ»1​(G)​μ,B=(1Ξ»1​(G)+1ΞΌ),r=log⁑nA=-\frac{1}{\lambda_{1}(G)\mu},B=(\frac{1}{\lambda_{1}(G)}+\frac{1}{\mu}),r=\log n; Ξ»1​(G)>ΞΌ\lambda_{1}(G)>\mu, and thus B≀2ΞΌ,|A|≀1ΞΌ2B\leq\frac{2}{\mu},|A|\leq\frac{1}{\mu^{2}}.

  • β€’

    By ClaimΒ 3.1, β€–Fβ€–2≀1+14​log⁑n,β€–F^β€–2≀1+14​log⁑n\left\|F\right\|_{2}\leq 1+\frac{1}{4\log n},\left\|\widehat{F}\right\|_{2}\leq 1+\frac{1}{4\log n}. By PropositionΒ 2.5, β€–Fβ€–2t,β€–F^β€–2t≀2,βˆ€t≀log⁑n\left\|F\right\|_{2}^{t},\left\|\widehat{F}\right\|_{2}^{t}\leq 2,\forall t\leq\log n.

Upper bound of β€–E​Mβ€–2β†’βˆž\left\|EM\right\|_{2\to\infty}

Note that β€–E​Ft​Dβ€–2β†’βˆžβ‰€β€–E​Fβ€–2β†’βˆžβ€‹β€–Fβ€–2tβˆ’1​‖Dβ€–2\left\|EF^{t}D\right\|_{2\to\infty}\leq\left\|EF\right\|_{2\to\infty}\left\|F\right\|^{t-1}_{2}\left\|D\right\|_{2}, and β€–Fβ€–2tβˆ’1≀2\left\|F\right\|^{t-1}_{2}\leq 2 for all t≀rt\leq r. Moreover, β€–Dβ€–2≀|A|​(2​‖Eβ€–2​‖Gβ€–2+β€–Eβ€–22)+B​‖Eβ€–2≀3​‖Eβ€–2ΞΌ+β€–Eβ€–22ΞΌ2+≀4​‖Eβ€–2μ≀4​(log6⁑n)βˆ’1<1log3⁑n\left\|D\right\|_{2}\leq|A|(2\left\|E\right\|_{2}\left\|G\right\|_{2}+\left\|E\right\|^{2}_{2})+B\left\|E\right\|_{2}\leq 3\frac{\left\|E\right\|_{2}}{\mu}+\frac{\left\|E\right\|_{2}^{2}}{\mu^{2}}+\leq 4\frac{\left\|E\right\|_{2}}{\mu}\leq 4(\log^{6}n)^{-1}<\frac{1}{\log^{3}n}. And the following lemma gives an upper bound of β€–E​Fβ€–2β†’βˆž\left\|EF\right\|_{2\to\infty}.

Lemma 4.1.

𝐏𝐫E[β€–E​Fβ€–2β†’βˆžβ‰€10​(k​p​log⁑n+log⁑n)]β‰₯1βˆ’2​nβˆ’2\operatorname*{\mathbf{Pr}}_{E}\left[\left\|EF\right\|_{2\to\infty}\leq 10(\sqrt{kp\log n}+\sqrt{\log n})\right]\geq 1-2n^{-2}.

Therefore, by union bound, we have the following holds with probability at least 1βˆ’nβˆ’11-n^{-1}:

β€–E​Mβ€–2β†’βˆžβ‰€rβ‹…10​(k​p​log⁑n+log⁑n)β‹…2β‹…1log3⁑n≀40​(k​p+1)log⁑n.\displaystyle\left\|EM\right\|_{2\to\infty}\leq r\cdot 10(\sqrt{kp\log n}+\sqrt{\log n})\cdot 2\cdot\frac{1}{\log^{3}n}\leq\frac{40(\sqrt{kp}+1)}{\log n}. (4)

Upper bound of β€–E​Drβ€–2β†’βˆž\left\|ED^{r}\right\|_{2\to\infty}

Since β€–Dβ€–2<1log3⁑n\left\|D\right\|_{2}<\frac{1}{\log^{3}n}, we have

β€–E​Drβ€–2β†’βˆžβ‰€β€–Eβ€–2β†’βˆžβ€‹β€–Dβ€–2r≀nβ‹…(log3⁑n)βˆ’log⁑n<1n.\displaystyle\left\|ED^{r}\right\|_{2\to\infty}\leq\left\|E\right\|_{2\to\infty}\left\|D\right\|_{2}^{r}\leq\sqrt{n}\cdot(\log^{3}n)^{-\log n}<\frac{1}{n}. (5)
Lemma 4.2 (Upper bound of β€–E​Mβ€²β€–2β†’βˆž\left\|EM^{\prime}\right\|_{2\to\infty}).

With probability 1βˆ’O​(nβˆ’2)1-O(n^{-2}) (over the choice of EE), we have β€–E​Mβ€²β€–2β†’βˆžβ‰€6​C2​k​p​log2⁑n\left\|EM^{\prime}\right\|_{2\to\infty}\leq 6C_{2}\sqrt{kp}\log^{2}n, where C2=106C_{2}=10^{6} is a constant.

Finally, combining EquationΒ 4, EquationΒ 5, and the above lemma, we conclude that with probability at least 1βˆ’O​(nβˆ’2)1-O(n^{-2}),

β€–E​(φ​(G^)βˆ’Ο†β€‹(G))β€–2β†’βˆžβ‰€40​(k​p+1)log⁑n+1n+6​C2​k​p​log2⁑n≀7​C2​(k​p​log2⁑n+1log⁑n).\left\|E(\varphi(\widehat{G})-\varphi(G))\right\|_{2\to\infty}\leq\frac{40(\sqrt{kp}+1)}{\log n}+\frac{1}{n}+6C_{2}\sqrt{kp}\log^{2}n\leq 7C_{2}(\sqrt{kp}\log^{2}n+\frac{1}{\log n}).

This establishes LemmaΒ 3.3.

Proofs of LemmaΒ 4.1 and LemmaΒ 4.2 are deferred to AppendixΒ D.

Acknowledgments and Disclosure of Funding

We are grateful to anonymous NeurIPS reviewers for their helpful comments. This research is supported by NSF CAREER award 2141536.

References

  • [AFWZ20] Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, and Yiqiao Zhong. Entrywise eigenvector analysis of random matrices with low expected rank. Ann. Statist., 48(3):1452–1474, 2020.
  • [ARS+04] Paolo Antonelli, HEΒ Revercomb, LAΒ Sromovsky, WLΒ Smith, ROΒ Knuteson, DCΒ Tobin, RKΒ Garcia, HBΒ Howell, H-L Huang, and FAΒ Best. A principal component noise filter for high spectral resolution infrared measurements. Journal of Geophysical Research: Atmospheres, 109(D23), 2004.
  • [BNS79] JamesΒ R. Bunch, ChristopherΒ P. Nielsen, and DannyΒ C. Sorensen. Rank-one modification of the symmetric eigenproblem. Numer. Math., 31(1):31–48, 1978/79.
  • [Col19] Sam Cole. Recovering nonuniform planted partitions via iterated projection. Linear Algebra and its Applications, 576:79–107, 2019.
  • [CTP19] Joshua Cape, Minh Tang, and CareyΒ E Priebe. The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. The Annals of Statistics, 47(5):2405–2439, 2019.
  • [EBW18] Justin Eldridge, Mikhail Belkin, and Yusu Wang. Unperturbed: spectral analysis beyond Davis-Kahan. In Algorithmic learning theory 2018, volumeΒ 83 of Proc. Mach. Learn. Res. (PMLR), pageΒ 38. Proceedings of Machine Learning Research PMLR, [place of publication not identified], 2018.
  • [GM05] Joachim Giesen and Dieter Mitsche. Reconstructing many partitions using spectral techniques. In International Symposium on Fundamentals of Computation Theory, pages 433–444. Springer, 2005.
  • [HLL83] PaulΒ W Holland, KathrynΒ Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
  • [KAH19] VladimirΒ Yu Kiselev, TallulahΒ S Andrews, and Martin Hemberg. Challenges in unsupervised clustering of single-cell rna-seq data. Nature Reviews Genetics, 20(5):273–282, 2019.
  • [LR15] Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215–237, 2015.
  • [LZK22] Boris Landa, ThomasΒ TCK Zhang, and Yuval Kluger. Biwhitening reveals the rank of a count matrix. SIAM Journal on Mathematics of Data Science, 4(4):1420–1446, 2022.
  • [McS01] Frank McSherry. Spectral partitioning of random graphs. In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001), pages 529–537. IEEE Computer Soc., Los Alamitos, CA, 2001.
  • [MZ22a] ChandraΒ Sekhar Mukherjee and Jiapeng Zhang. Compressibility: Power of pca in clustering problems beyond dimensionality reduction. arXiv preprint arXiv:2204.10888, 2022.
  • [MZ22b] ChandraΒ Sekhar Mukherjee and Jiapeng Zhang. Detecting hidden communities by power iterations with connections to vanilla spectral algorithms, 2022.
  • [PPV+19] CareyΒ E Priebe, Youngser Park, JoshuaΒ T Vogelstein, JohnΒ M Conroy, Vince Lyzinski, Minh Tang, Avanti Athreya, Joshua Cape, and Eric Bridgeford. On a two-truths phenomenon in spectral graph clustering. Proceedings of the National Academy of Sciences, 116(13):5995–6000, 2019.
  • [SEK04] Michael Steinbach, Levent ErtΓΆz, and Vipin Kumar. The challenges of clustering high dimensional data. In New directions in statistical physics, pages 273–309. Springer, 2004.
  • [Vu18] Van Vu. A simple SVD algorithm for finding hidden partitions. Combin. Probab. Comput., 27(1):124–140, 2018.
  • [ZLWZ09] Lei Zhang, Rastislav Lukac, Xiaolin Wu, and David Zhang. Pca-based spatially adaptive denoising of cfa images for single-sensor digital cameras. IEEE transactions on image processing, 18(4):797–812, 2009.

Appendix A Useful inequalities

Proposition A.1 (Chernoff bound).

Let X1,…,XmX_{1},\ldots,X_{m} be i.i.d random variables that can take values in {0,1}\{0,1\}, with 𝐄[Xi]≀p\operatorname*{\mathbf{E}}[X_{i}]\leq p for 1≀i≀m1\leq i\leq m. Then it holds that

𝐏𝐫[|βˆ‘i=1nXiβˆ’m​p|β‰₯t]≀exp⁑(βˆ’3​t2m​p).\operatorname*{\mathbf{Pr}}\left[\left|\sum_{i=1}^{n}X_{i}-mp\right|\geq t\right]\leq\exp\left(-\frac{3t^{2}}{mp}\right).
Proposition A.2 (Hoeffding bound).

Let X1,…,XmX_{1},\dots,X_{m} be independent random variables such that ai≀X1≀bia_{i}\leq X_{1}\leq b_{i}, and write S=π–½π–Ύπ–Ώβˆ‘i=1mXiS\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\sum_{i=1}^{m}X_{i}. Then it holds that

𝐏𝐫[|Sβˆ’π„[S]|>t]≀2​exp⁑(βˆ’2​t2βˆ‘i=1m(biβˆ’ai)2).\operatorname*{\mathbf{Pr}}\left[\left|S-\operatorname*{\mathbf{E}}\left[S\right]\right|>t\right]\leq 2\exp\left(-\frac{2t^{2}}{\sum_{i=1}^{m}(b_{i}-a_{i})^{2}}\right).
Definition A.1.

Let XX be a Bernoulli random variable with parameter pp, i.e., 𝐏𝐫[X=1]=p,𝐏𝐫[X=0]=1βˆ’p\operatorname*{\mathbf{Pr}}\left[X=1\right]=p,\operatorname*{\mathbf{Pr}}\left[X=0\right]=1-p. The random variable Y=𝖽𝖾𝖿Xβˆ’p=Xβˆ’π„[X]Y\stackrel{{\scriptstyle\mathsf{def}}}{{=}}X-p=X-\operatorname*{\mathbf{E}}\left[X\right] is called centered Bernoulli random variable with parameter pp.

Proposition A.3 (Adapted from [MZ22b]).

Let Sβˆˆβ„nΓ—nS\in\mathbb{R}^{n\times n}, and let E=(ΞΎi​j)E=(\xi_{ij}) be an nΓ—nn\times n symmetric random matrix, where

{ΞΎi​j:1≀i≀j≀n}\left\{\xi_{ij}:1\leq i\leq j\leq n\right\}

are independent, centered Bernoulli random variables with parameter at most Ξ±\alpha for all i,ji,j. Suppose that every entry of SS takes value in [βˆ’Ξ²,Ξ²][-\beta,\beta], and each column of SS has at most Ξ³\gamma non-zero entries. Then for every t∈[log⁑n]t\in[\log n], it holds that

𝐏𝐫[|(Et​S)i​j|>(log⁑n)5​t​Ct]=O​(nβˆ’4),βˆ€i,j∈[n],\operatorname*{\mathbf{Pr}}\left[\left|(E^{t}S)_{ij}\right|>(\log n)^{5t}C_{t}\right]=O(n^{-4}),\forall i,j\in[n],

where

Ct=𝖽𝖾𝖿500​β​α​γ⋅(100​n​α)tβˆ’1.C_{t}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}500\beta\sqrt{\alpha}\sqrt{\gamma}\cdot\left(100\sqrt{n\alpha}\right)^{t-1}.

By union bound,

𝐏𝐫[β€–Et​Sβ€–2β†’βˆž>n​(log⁑n)5​t​Ct]=O​(nβˆ’2).\operatorname*{\mathbf{Pr}}\left[\left\|E^{t}S\right\|_{2\to\infty}>\sqrt{n}(\log n)^{5t}C_{t}\right]=O(n^{-2}).
Remark A.1.

The parameter Ξ±\alpha is determined by EE, which equals to pp in our case. The above bound is particularly useful when Ξ²,Ξ³\beta,\gamma are small, that is, we want the matrix SS to have either small entries or sparse columns.

Proposition A.4 (PropositionΒ 2.5 restated).

For a∈[0,2]a\in[0,2] and rβˆˆβ„•r\in\mathbb{N}, if |aβˆ’1|≀δ<12​r\left|a-1\right|\leq\delta<\frac{1}{2r}, then |arβˆ’1|≀2​r​δ\left|a^{r}-1\right|\leq 2r\delta.

Proof.

Let x=aβˆ’1∈[βˆ’Ξ΄,Ξ΄]x=a-1\in[-\delta,\delta]. If 0≀a≀10\leq a\leq 1, we have 1β‰₯ar=(1+x)rβ‰₯1+r​xβ‰₯1βˆ’r​δ1\geq a^{r}=(1+x)^{r}\geq 1+rx\geq 1-r\delta. If 1<a<1+1/r1<a<1+1/r, then 0<x<1/r0<x<1/r and hence

1≀ar=(1+x)r=βˆ‘i=0r(ri)​xiβ‰€βˆ‘i=0rri​xi<βˆ‘i=0∞(r​x)i=11βˆ’r​x=1+r​x1βˆ’r​x≀1+2​r​δ.1\leq a^{r}=(1+x)^{r}=\sum_{i=0}^{r}{r\choose i}x^{i}\leq\sum_{i=0}^{r}r^{i}x^{i}<\sum_{i=0}^{\infty}(rx)^{i}=\frac{1}{1-rx}=1+\frac{rx}{1-rx}\leq 1+2r\delta.

∎

Appendix B Bounding the Deviation Term

Proof of LemmaΒ 3.1.

Our assumption on nn implies that (pβˆ’q)​n/k>100​C0​σ​n(p-q)n/k>100C_{0}\sigma\sqrt{n}. By PropositionΒ 2.3, with probability at least 1βˆ’nβˆ’31-n^{-3}, we have

β€–Eβ€–2≀C0​σ​n≀0.01​(pβˆ’q)​n/k.\left\|E\right\|_{2}\leq C_{0}\sigma\sqrt{n}\leq 0.01(p-q)n/k.

According to EquationΒ 2 and suβ‰₯n2​ks_{u}\geq\frac{n}{2k}, we have

β€–(PG^βˆ’I)​Gu‖≀2​‖Eβ€–2su≀0.04​(pβˆ’q)​n/k.\displaystyle\left\|(P_{\widehat{G}}-I)G_{u}\right\|\leq\frac{2\left\|E\right\|_{2}}{\sqrt{s_{u}}}\leq 0.04(p-q)n/k.

∎

Appendix C Polynomial Approximation

The proof of TheoremΒ 3.1 rely on the following result on rank-one pertuebation.

Proposition C.1 (Eigenvalues under rank-one perturbation, Theorem 1 in [BNS79]).

Let C=D+ρ​z​zTC=D+\rho zz^{T}, where DD is diagonal, β€–zβ€–2=1\left\|z\right\|_{2}=1. Let d1β‰₯d2β‰₯β‹―β‰₯dnd_{1}\geq d_{2}\geq\cdots\geq d_{n} be the eigenvalues of DD, and let d~1β‰₯d~2β‰₯β‹―β‰₯d~n\widetilde{d}_{1}\geq\widetilde{d}_{2}\geq\cdots\geq\widetilde{d}_{n} be the eigenvalues of CC. Then

d~i=di+ρ​μi,1≀i≀n,\widetilde{d}_{i}=d_{i}+\rho\mu_{i},\quad 1\leq i\leq n,

where βˆ‘i=1nΞΌi=1\sum_{i=1}^{n}\mu_{i}=1 and 0≀μi≀10\leq\mu_{i}\leq 1.

Proof of TheoremΒ 3.1.

Let Ο‡i∈{0,1}V\chi_{i}\in\left\{0,1\right\}^{V} be the indicator vector for ViV_{i}, i.e.,Β Ο‡i​(u)=1\chi_{i}(u)=1 iff u∈Viu\in V_{i}. It is easy to see that the eigenvectors of HH are 1s1​χ1,…,1sk​χk\frac{1}{\sqrt{s_{1}}}\chi_{1},\dots,\frac{1}{\sqrt{s_{k}}}\chi_{k}. Write U=(1s1​χ1,…,1sk​χk)βˆˆβ„VΓ—VU=\left(\frac{1}{\sqrt{s_{1}}}\chi_{1},\dots,\frac{1}{\sqrt{s_{k}}}\chi_{k}\right)\in\mathbb{R}^{V\times V}, D=diag​((pβˆ’q)​s1,…,(pβˆ’q)​sk,0,…,0)D=\mathrm{diag}((p-q)s_{1},\dots,(p-q)s_{k},0,\dots,0), then we have H=U​D​U⊀H=UDU^{\top}. Note that 𝟏n=U​(s1,…,sn)⊀\mathbf{1}_{n}=U(\sqrt{s_{1}},\dots,\sqrt{s_{n}})^{\top}, and hence

G=H+qβ€‹πŸnβ€‹πŸn⊀=U​(D+ρ​z​z⊀)​U⊀,G=H+q\mathbf{1}_{n}\mathbf{1}_{n}^{\top}=U(D+\rho zz^{\top})U^{\top},

where ρ=n​q,z=1n​(s1,…,sn)⊀\rho=nq,z=\frac{1}{\sqrt{n}}(\sqrt{s_{1}},\dots,\sqrt{s_{n}})^{\top}. This means the eigenvalues of GG are the same as those of D+ρ​z​z⊀D+\rho zz^{\top}. Since β€–zβ€–=1\left\|z\right\|=1, Item 1 follow directly from PropositionΒ C.1. To see Item 2, we use the Rayleigh quotient characterization of the largest eigenvalue:

Ξ»1​(G)=maxv⁑vβŠ€β€‹G​vβ€–vβ€–2β‰₯𝟏nβŠ€β€‹Gβ€‹πŸnn=βˆ‘u,v∈XGu​vn\displaystyle\lambda_{1}(G)=\max_{v}\frac{v^{\top}Gv}{\left\|v\right\|^{2}}\geq\frac{\mathbf{1}_{n}^{\top}G\mathbf{1}_{n}}{n}=\frac{\sum_{u,v\in X}G_{uv}}{n} =n2​q+(pβˆ’q)β‹…(s12+β‹―+sk2)n\displaystyle=\frac{n^{2}q+(p-q)\cdot(s_{1}^{2}+\cdots+s_{k}^{2})}{n}
β‰₯n​q+(pβˆ’q)​nk.\displaystyle\geq nq+(p-q)\frac{n}{k}.

where the last inequality follows from nk=1kβ€‹βˆ‘i=1ksiβ‰€βˆ‘i=1ksi2/k\frac{n}{k}=\frac{1}{k}\sum_{i=1}^{k}s_{i}\leq\sqrt{\sum_{i=1}^{k}s_{i}^{2}/k}, ∎

Proof of ClaimΒ 3.1.

The assumption on nn in LemmaΒ 3.2 implies that ΞΌ=(pβˆ’q)​n/kβ‰₯100​C0​σ​nβ‹…log⁑n\mu=(p-q)n/k\geq 100C_{0}\sigma\sqrt{n}\cdot\log n. By Weyl’s inequality, we have

|Ξ»^iβˆ’Ξ»i|≀‖Eβ€–2≀C0​σ​n≀μ100​log⁑n,βˆ€i∈[n].\displaystyle\left|\widehat{\lambda}_{i}-\lambda_{i}\right|\leq\left\|E\right\|_{2}\leq C_{0}\sigma\sqrt{n}\leq\frac{\mu}{100\log n},\forall i\in[n].

Meanwhile, by TheoremΒ 3.1,

|Ξ»iβˆ’ΞΌ|≀|Ξ»i​(G)βˆ’(pβˆ’q)​si|+|(pβˆ’q)​siβˆ’ΞΌ|≀(pβˆ’q)​n/k16​log⁑n+(pβˆ’q)​n/k16​log⁑n≀μ8​log⁑n.\displaystyle\left|\lambda_{i}-\mu\right|\leq\left|\lambda_{i}(G)-(p-q)s_{i}\right|+\left|(p-q)s_{i}-\mu\right|\leq\frac{(p-q)n/k}{16\log n}+\frac{(p-q)n/k}{16\log n}\leq\frac{\mu}{8\log n}.

for i=2,3,…,ki=2,3,\dots,k. Hence, write Ξ΅=𝖽𝖾𝖿μ6​log⁑n\varepsilon\stackrel{{\scriptstyle\mathsf{def}}}{{=}}\frac{\mu}{6\log n}, we have

  1. 1.

    |Ξ»^1βˆ’Ξ»1|≀Ρ\left|\widehat{\lambda}_{1}-\lambda_{1}\right|\leq\varepsilon;

  2. 2.

    Ξ»2,…,Ξ»k,Ξ»^2,…,Ξ»^k∈[ΞΌβˆ’Ξ΅,ΞΌ+Ξ΅]\lambda_{2},\dots,\lambda_{k},\widehat{\lambda}_{2},\dots,\widehat{\lambda}_{k}\in[\mu-\varepsilon,\mu+\varepsilon];

  3. 3.

    for every iβ‰₯k+1i\geq k+1, |Ξ»i^|≀Ρ\left|\widehat{\lambda_{i}}\right|\leq\varepsilon.

First, Οˆβ€‹(Ξ»1)=1\psi(\lambda_{1})=1 according to the definition of ψ\psi, and hence φ​(Ξ»1)=1\varphi(\lambda_{1})=1. As for Ξ»^1\widehat{\lambda}_{1},

|Οˆβ€‹(Ξ»^1)βˆ’1|=|Οˆβ€‹(Ξ»^i)βˆ’Οˆβ€‹(Ξ»1)|\displaystyle\left|\psi(\widehat{\lambda}_{1})-1\right|=\left|\psi(\widehat{\lambda}_{i})-\psi(\lambda_{1})\right| ≀|A|​Ρ2+|2​A​λ1+B|​Ρ\displaystyle\leq|A|\varepsilon^{2}+|2A\lambda_{1}+B|\varepsilon (by definition of ψ)\displaystyle(\text{by definition of $\psi$})
≀Ρ2Ξ»1​μ+Ρμ\displaystyle\leq\frac{\varepsilon^{2}}{\lambda_{1}\mu}+\frac{\varepsilon}{\mu} (since 2AΞ»1+B=1Ξ»1βˆ’1ΞΌβ‰₯βˆ’1ΞΌ)2A\lambda_{1}+B=\frac{1}{\lambda_{1}}-\frac{1}{\mu}\geq-\frac{1}{\mu})
≀136​log2⁑n+16​log⁑n\displaystyle\leq\frac{1}{36\log^{2}n}+\frac{1}{6\log n} (as Ρμ=16​log⁑n\frac{\varepsilon}{\mu}=\frac{1}{6\log n})
<14​log⁑n.\displaystyle<\frac{1}{4\log n}.

Consequently, |φ​(Ξ»^1)βˆ’1|<2​r4​log⁑n≀1/2|\varphi(\widehat{\lambda}_{1})-1|<\frac{2r}{4\log n}\leq 1/2 by PropositionΒ 2.5.

Next, for a∈{Ξ»2,…,Ξ»k,Ξ»^2,…,Ξ»^k}a\in\left\{\lambda_{2},\dots,\lambda_{k},\widehat{\lambda}_{2},\dots,\widehat{\lambda}_{k}\right\}, the argument is similar:

|Οˆβ€‹(a)βˆ’1|=|Οˆβ€‹(a)βˆ’Οˆβ€‹(ΞΌ)|≀|A|​Ρ2+|2​A​μ+B|​Ρ≀Ρ2Ξ»1​μ+Ρμ<14​log⁑n,\displaystyle\left|\psi(a)-1\right|=\left|\psi(a)-\psi(\mu)\right|\leq|A|\varepsilon^{2}+|2A\mu+B|\varepsilon\leq\frac{\varepsilon^{2}}{\lambda_{1}\mu}+\frac{\varepsilon}{\mu}<\frac{1}{4\log n},

where the second inequality follows from 2​A​μ+B=1ΞΌβˆ’1Ξ»1≀1ΞΌ2A\mu+B=\frac{1}{\mu}-\frac{1}{\lambda_{1}}\leq\frac{1}{\mu}. This yields |φ​(a)βˆ’1|≀1/2\left|\varphi(a)-1\right|\leq 1/2 by PropositionΒ 2.5.

Finally, for iβ‰₯k+1i\geq k+1, it holds that

|Οˆβ€‹(Ξ»^i)|≀|A|​Ρ2+B​Ρ=Ξ΅2Ξ»1​μ+Ρμ<14​log⁑n,\left|\psi(\widehat{\lambda}_{i})\right|\leq|A|\varepsilon^{2}+B\varepsilon=\frac{\varepsilon^{2}}{\lambda_{1}\mu}+\frac{\varepsilon}{\mu}<\frac{1}{4\log n},

which means |φ​(Ξ»^i)|r=|Οˆβ€‹(Ξ»^i)|r<(14​log⁑n)log⁑n<nβˆ’log⁑log⁑n\left|\varphi(\widehat{\lambda}_{i})\right|^{r}=\left|\psi(\widehat{\lambda}_{i})\right|^{r}<\left(\frac{1}{4\log n}\right)^{\log n}<n^{-\log\log n}. ∎

Appendix D Bounding the Noise Term

Proof of LemmaΒ 4.1.

The lemma readily follows from the following entrywise bound and Chernoff bound.

Claim D.1 (Entries of Οˆβ€‹(G)\psi(G)).

For every u,v∈Xu,v\in X, if u,v∈Vβ„“u,v\in V_{\ell} for some β„“\ell, then 0≀Fu​v≀5​kn0\leq F_{uv}\leq\frac{5k}{n}; otherwise, |Fu​v|≀10n\left|F_{uv}\right|\leq\frac{10}{n}.

We decompose F=Fβ€²+Fβ€²β€²F=F^{\prime}+F^{\prime\prime}, where Fβ€²F^{\prime} is the intra-cluster part, i.e.,Β Fu​vβ€²=Fu​vF^{\prime}_{uv}=F_{uv} if u,v∈Vβ„“u,v\in V_{\ell} for some β„“\ell, and Fu​vβ€²=0F^{\prime}_{uv}=0 otherwise. Since for every column of Fvβ€²F^{\prime}_{v}, its non-zero entries are identical and at most 5​k/n5k/n by the above claim. Hence, every entry of E​Fβ€²EF^{\prime} equals to the sum of at most 2​n/k2n/k independent, centered Bernoulli variables with parameter pp, scaled by some factor at most 5​kn\frac{5k}{n}. By Chernoff bound, 𝐏𝐫E[|(E​Fβ€²)u​v|>10​k​p​log⁑n/n]≀nβˆ’4,βˆ€u,v∈V\operatorname*{\mathbf{Pr}}_{E}\left[|(EF^{\prime})_{uv}|>10\sqrt{kp\log n/n}\right]\leq n^{-4},\forall u,v\in V, and we have 𝐏𝐫E[β€–E​Fβ€²β€–2β†’βˆžβ‰€10​k​p​log⁑n]β‰₯1βˆ’nβˆ’2\operatorname*{\mathbf{Pr}}_{E}\left[\left\|EF^{\prime}\right\|_{2\to\infty}\leq 10\sqrt{kp\log n}\right]\geq 1-n^{-2} by union bound. Analogously, by Hoeffding bound, 𝐏𝐫E[β€–E​Fβ€²β€²β€–2β†’βˆžβ‰€10​log⁑n]β‰₯1βˆ’nβˆ’2\operatorname*{\mathbf{Pr}}_{E}\left[\left\|EF^{\prime\prime}\right\|_{2\to\infty}\leq 10\sqrt{\log n}\right]\geq 1-n^{-2}. Since β€–E​Fβ€–2β†’βˆžβ‰€β€–E​Fβ€²β€–2β†’βˆž+β€–E​Fβ€²β€²β€–2β†’βˆž\left\|EF\right\|_{2\to\infty}\leq\left\|EF^{\prime}\right\|_{2\to\infty}+\left\|EF^{\prime\prime}\right\|_{2\to\infty}, the lemma follows from the above two inequalities and union bound. ∎

Proof of ClaimΒ D.1.

Write Ξ»=Ξ»1​(G)\lambda=\lambda_{1}(G) and recall that (i) (pβˆ’q)​su≀2​μ(p-q)s_{u}\leq 2\mu for all uu (ii) n​q+μ≀λ≀n​q+(pβˆ’q)​s1<n​q+2​μnq+\mu\leq\lambda\leq nq+(p-q)s_{1}<nq+2\mu, (iii) Ξ»>pβ‹…ΞΌ\lambda>p\cdot\mu, and for all u∈Vu\in V. Assume that u,v∈Vβ„“u,v\in V_{\ell} for some β„“\ell. Then

Fu​v=A​GuβŠ€β€‹Gv+B​Gu​v\displaystyle F_{uv}=AG_{u}^{\top}G_{v}+BG_{uv} =βˆ’n​q2+(p2βˆ’q2)​suλ​μ+(1Ξ»+1ΞΌ)​p\displaystyle=-\frac{nq^{2}+(p^{2}-q^{2})s_{u}}{\lambda\mu}+\left(\frac{1}{\lambda}+\frac{1}{\mu}\right)p
=βˆ’n​q2βˆ’(p2βˆ’q2)​su+(pβˆ’q)​(Ξ»+ΞΌ)+q​(Ξ»+ΞΌ)λ​μ\displaystyle=\frac{-nq^{2}-(p^{2}-q^{2})s_{u}+(p-q)(\lambda+\mu)+q(\lambda+\mu)}{\lambda\mu}
=q​(ΞΌ+Ξ»βˆ’n​q)+(pβˆ’q)​(Ξ»+ΞΌβˆ’(p+q)​su)λ​μ.\displaystyle=\frac{q(\mu+\lambda-nq)+(p-q)(\lambda+\mu-(p+q)s_{u})}{\lambda\mu}.

Since Ξ»βˆ’n​qβ‰₯ΞΌ\lambda-nq\geq\mu, the numerator is at least

2​q​μ+(pβˆ’q)​(Ξ»+ΞΌβˆ’(p+q)​su)=(pβˆ’q)​(2​q​n/k+Ξ»+ΞΌβˆ’(p+q)​su).2q\mu+(p-q)(\lambda+\mu-(p+q)s_{u})=(p-q)\left(2qn/k+\lambda+\mu-(p+q)s_{u}\right).

Because su≀2​n/k,Ξ»β‰₯n​q+(pβˆ’q)​n/ks_{u}\leq 2n/k,\lambda\geq nq+(p-q)n/k, we have

2​q​n/k+Ξ»+ΞΌβˆ’(p+q)​su>2​q​n/k+n​q+(pβˆ’q)​2​n/kβˆ’(p+q)​2​n/k=(nβˆ’2​n/k)​qβ‰₯0,2qn/k+\lambda+\mu-(p+q)s_{u}>2qn/k+nq+(p-q)2n/k-(p+q)2n/k=(n-2n/k)q\geq 0,

which means Fu​vβ‰₯0F_{uv}\geq 0. Meawhile,

Fu​v≀q​(ΞΌ+Ξ»βˆ’n​q)λ​μ+(pβˆ’q)​(Ξ»+ΞΌ)λ​μ,F_{uv}\leq\frac{q(\mu+\lambda-nq)}{\lambda\mu}+\frac{(p-q)(\lambda+\mu)}{\lambda\mu},

where the first term is at most 3​qλ≀3n\frac{3q}{\lambda}\leq\frac{3}{n} by (ii); second term is at most 2​(pβˆ’q)μ≀2​kn\frac{2(p-q)}{\mu}\leq\frac{2k}{n}. Therefore, |Fu​v|≀5​kn|F_{uv}|\leq\frac{5k}{n}.

For the second part, assume that u,vu,v are not in the same cluster. Then

Fu​v\displaystyle F_{uv} =A​GuβŠ€β€‹Gv+B​Gu​v=βˆ’n​q2+(p​qβˆ’q2)​(su+sv)λ​μ+(1Ξ»+1ΞΌ)​q.\displaystyle=AG_{u}^{\top}G_{v}+BG_{uv}=-\frac{nq^{2}+(pq-q^{2})(s_{u}+s_{v})}{\lambda\mu}+\left(\frac{1}{\lambda}+\frac{1}{\mu}\right)q.

Hence,

|Fu​v|≀|q​(Ξ»+ΞΌβˆ’n​q)λ​μ|+|q​(pβˆ’q)​(su+sv)λ​μ|.\left|F_{uv}\right|\leq\left|\frac{q(\lambda+\mu-nq)}{\lambda\mu}\right|+\left|\frac{q(p-q)(s_{u}+s_{v})}{\lambda\mu}\right|.

By (ii), the first term is at most 3​qΞ»<3n\frac{3q}{\lambda}<\frac{3}{n}; by (i), the second term is at most 4​qλ≀4n\frac{4q}{\lambda}\leq\frac{4}{n}; hence, |Fu​v|≀10n\left|F_{uv}\right|\leq\frac{10}{n}. ∎

Upper bound of β€–E​Mβ€²β€–2β†’βˆž\left\|EM^{\prime}\right\|_{2\to\infty} (Proof of LemmaΒ 4.2)

Write L=𝖽𝖾𝖿A​(E​G+G​E),R=𝖽𝖾𝖿A​E2+B​EL\stackrel{{\scriptstyle\mathsf{def}}}{{=}}A(EG+GE),R\stackrel{{\scriptstyle\mathsf{def}}}{{=}}AE^{2}+BE. Then Dt​F=(L+R)t​F=Rt​F+Rtβˆ’1​L​F+Rtβˆ’2​L​D​F+β‹―+R​L​Dtβˆ’2+L​Dtβˆ’1​FD^{t}F=(L+R)^{t}F=R^{t}F+R^{t-1}LF+R^{t-2}LDF+\cdots+RLD^{t-2}+LD^{t-1}F. It suffices to derive a good upper bound of β€–Eη​Lβ€–2β†’βˆž\left\|E^{\eta}L\right\|_{2\to\infty} and β€–Eη​Fβ€–2β†’βˆž\left\|E^{\eta}F\right\|_{2\to\infty}, as RwR^{w} can be further expressed as sum of powers of EE. This is done by the following lemma:

Lemma D.1.

The following holds with probability 1βˆ’O​(nβˆ’2)1-O(n^{-2}) over the choice of EE: for all η≀log⁑n\eta\leq\log n, it holds that

  • β€’

    β€–Eη​Lβ€–2β†’βˆžβ‰€C2​k​p​(100​n​p)Ξ·βˆ’1​log5​η⁑n\left\|E^{\eta}L\right\|_{2\to\infty}\leq C_{2}\sqrt{kp}(100\sqrt{np})^{\eta-1}\log^{5\eta}n,

  • β€’

    β€–Eη​Fβ€–2β†’βˆžβ‰€C2​k​p​(100​n​p)Ξ·βˆ’1​log5​η⁑n\left\|E^{\eta}F\right\|_{2\to\infty}\leq C_{2}\sqrt{kp}(100\sqrt{np})^{\eta-1}\log^{5\eta}n,

where C2=𝖽𝖾𝖿106C_{2}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}10^{6} is an absolute constant.

Specifically,

E​Dt​F\displaystyle ED^{t}F =E​Rt​F+βˆ‘i=0tβˆ’1E​Ri​L​Dtβˆ’1βˆ’i​F\displaystyle=ER^{t}F+\sum_{i=0}^{t-1}ER^{i}LD^{t-1-i}F
=Eβ€‹βˆ‘j=0t(tj)​Aj​Btβˆ’j​Ej+t​F⏟=𝖽𝖾𝖿Mt+βˆ‘i=0tβˆ’1Eβ€‹βˆ‘j=0i(ij)​Aj​Biβˆ’j​Ei+j​L​Dtβˆ’1βˆ’i​F⏟=𝖽𝖾𝖿Nt.\displaystyle=\underbrace{E\sum_{j=0}^{t}{t\choose j}A^{j}B^{t-j}E^{j+t}F}_{\stackrel{{\scriptstyle\mathsf{def}}}{{=}}M_{t}}\quad+\underbrace{\sum_{i=0}^{t-1}E\sum_{j=0}^{i}{i\choose j}A^{j}B^{i-j}E^{i+j}LD^{t-1-i}F}_{\stackrel{{\scriptstyle\mathsf{def}}}{{=}}N_{t}}.

Note that |Aj​Bwβˆ’j|≀2wβ‹…ΞΌβˆ’(w+j)≀2wβ‹…(100​n​p​log6⁑n)βˆ’(w+j)|A^{j}B^{w-j}|\leq 2^{w}\cdot\mu^{-(w+j)}\leq 2^{w}\cdot(100\sqrt{np}\log^{6}n)^{-(w+j)}. It follows that for every t∈[rβˆ’1]t\in[r-1],

β€–Mtβ€–2β†’βˆž\displaystyle\left\|M_{t}\right\|_{2\to\infty} β‰€βˆ‘j=0t(tj)​|Aj​Btβˆ’j|​‖Et+j+1​Fβ€–2β†’βˆž\displaystyle\leq\sum_{j=0}^{t}{t\choose j}|A^{j}B^{t-j}|\left\|E^{t+j+1}F\right\|_{2\to\infty}
β‰€βˆ‘j=0t(tj)​2tβ‹…(100​n​p​log6⁑n)βˆ’(t+j)β‹…C2​k​pβ‹…(100​n​p)t+j​log5​(t+j)⁑n\displaystyle\leq\sum_{j=0}^{t}{t\choose j}2^{t}\cdot(100\sqrt{np}\log^{6}n)^{-(t+j)}\cdot C_{2}\sqrt{kp}\cdot(100\sqrt{np})^{t+j}\log^{5(t+j)}n
≀C2​k​pβ‹…2tβ€‹βˆ‘j=0t(tj)​(log⁑n)βˆ’(t+j)\displaystyle\leq C_{2}\sqrt{kp}\cdot 2^{t}\sum_{j=0}^{t}{t\choose j}(\log n)^{-(t+j)}
=C2​k​pβ‹…2t​(log⁑n)βˆ’tβ‹…(1+1log⁑n)t≀C2​k​p,\displaystyle=C_{2}\sqrt{kp}\cdot 2^{t}(\log n)^{-t}\cdot(1+\frac{1}{\log n})^{t}\leq C_{2}\sqrt{kp},

where the second inequality is by LemmaΒ D.1, and the last step follows from PropositionΒ 2.5. Similarly, for every t∈[rβˆ’1]t\in[r-1],

β€–Ntβ€–2β†’βˆž\displaystyle\left\|N_{t}\right\|_{2\to\infty} β‰€βˆ‘i=0tβˆ’1βˆ‘j=0i(ij)​|Aj​Biβˆ’j|​‖Ei+j+1​Lβ€–2β†’βˆžβ€‹β€–Dtβˆ’1βˆ’i​Fβ€–2\displaystyle\leq\sum_{i=0}^{t-1}\sum_{j=0}^{i}{i\choose j}|A^{j}B^{i-j}|\left\|E^{i+j+1}L\right\|_{2\to\infty}\left\|D^{t-1-i}F\right\|_{2}
≀2β€‹βˆ‘i=0tβˆ’1βˆ‘j=0i(ij)​|Aj​Biβˆ’j|​‖Ei+j+1​Lβ€–2β†’βˆž\displaystyle\leq 2\sum_{i=0}^{t-1}\sum_{j=0}^{i}{i\choose j}|A^{j}B^{i-j}|\left\|E^{i+j+1}L\right\|_{2\to\infty}
≀2β€‹βˆ‘i=0tβˆ’1βˆ‘j=0i(ij)​2iβ‹…(100​n​p​log6⁑n)βˆ’(i+j)β‹…C2​k​pβ‹…(100​n​p)i+j​log5​(i+j)⁑n\displaystyle\leq 2\sum_{i=0}^{t-1}\sum_{j=0}^{i}{i\choose j}2^{i}\cdot(100\sqrt{np}\log^{6}n)^{-(i+j)}\cdot C_{2}\sqrt{kp}\cdot(100\sqrt{np})^{i+j}\log^{5(i+j)}n
≀2​C2​k​pβ€‹βˆ‘i=0tβˆ’12iβ€‹βˆ‘j=0i(tj)​(log⁑n)βˆ’(i+j)\displaystyle\leq 2C_{2}\sqrt{kp}\sum_{i=0}^{t-1}2^{i}\sum_{j=0}^{i}{t\choose j}(\log n)^{-(i+j)}
≀2​C2​k​pβ‹…t≀2​C2​k​pβ‹…log⁑n,\displaystyle\leq 2C_{2}\sqrt{kp}\cdot t\leq 2C_{2}\sqrt{kp}\cdot\log n,

where the second inequality follows from β€–Dtβˆ’1βˆ’i​Fβ€–2≀2\left\|D^{t-1-i}F\right\|_{2}\leq 2, and the third inequality is by LemmaΒ D.1. In sum,

β€–E​Mβ€²β€–2β†’βˆžβ‰€βˆ‘t=1rβˆ’1(β€–Mtβ€–2β†’βˆž+β€–Ntβ€–2β†’βˆž)​‖F^β€–2rβˆ’1βˆ’t≀6​C2​k​p​log2⁑n,\displaystyle\left\|EM^{\prime}\right\|_{2\to\infty}\leq\sum_{t=1}^{r-1}(\left\|M_{t}\right\|_{2\to\infty}+\left\|N_{t}\right\|_{2\to\infty})\left\|\widehat{F}\right\|^{r-1-t}_{2}\leq 6C_{2}\sqrt{kp}\log^{2}n, (6)

where in the last inequality we also use β€–F^β€–2t≀2\left\|\widehat{F}\right\|_{2}^{t}\leq 2.

It remains to prove LemmaΒ D.1; the proof draws on the entrywise bound in PropositionΒ A.3.

Proof of LemmaΒ D.1.

Fix an η≀log⁑n\eta\leq\log n. Write sβˆ—=𝖽𝖾𝖿n/ks^{*}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}n/k for the ease of notation. Observe that Eη​L=A​(EΞ·+1​G+Eη​G​E)=A​(EΞ·+1​H+Eη​H​E)+A​q​(EΞ·+1​Jn+Eη​Jn​E)E^{\eta}L=A(E^{\eta+1}G+E^{\eta}GE)=A(E^{\eta+1}H+E^{\eta}HE)+Aq(E^{\eta+1}J_{n}+E^{\eta}J_{n}E). We can apply PropositionΒ A.3 to EΞ·+1​HE^{\eta+1}H and Eη​HE^{\eta}H, with Ξ±=p,Ξ²=pβˆ’q,Ξ³=2​sβˆ—\alpha=p,\beta=p-q,\gamma=2s^{*}. That is, with probability at least 1βˆ’O​(nβˆ’2)1-O(n^{-2}),

β€–Ej​Hβ€–2β†’βˆžβ‰€500​(log⁑n)5​j​n​(pβˆ’q)​p​2​sβˆ—β‹…(100​n​p)jβˆ’1​ for ​j=Ξ·,Ξ·+1.\displaystyle\left\|E^{j}H\right\|_{2\to\infty}\leq 500(\log n)^{5j}\sqrt{n}(p-q)\sqrt{p}\sqrt{2s^{*}}\cdot\left(100\sqrt{np}\right)^{j-1}\text{ for }j=\eta,\eta+1.

Our assumption on nn yields ΞΌβ‰₯C​n​p​log6⁑n\mu\geq C\sqrt{np}\log^{6}n; moreover, |A|​(pβˆ’q)≀pβˆ’qΞΌ2≀1/sβˆ—β‹…1μ≀1/sβˆ—β‹…(C​n​p​log6⁑n)βˆ’1|A|(p-q)\leq\frac{p-q}{\mu^{2}}\leq 1/s^{*}\cdot\frac{1}{\mu}\leq 1/s^{*}\cdot(C\sqrt{np}\log^{6}n)^{-1}. Therefore,

β€–A​(EΞ·+1​H+Eη​H​E)β€–2β†’βˆž\displaystyle\left\|A(E^{\eta+1}H+E^{\eta}HE)\right\|_{2\to\infty}
≀A​(β€–EΞ·+1​Hβ€–2β†’βˆž+β€–Eη​Hβ€–2β†’βˆžβ€‹β€–Eβ€–2)\displaystyle\leq A\left(\left\|E^{\eta+1}H\right\|_{2\to\infty}+\left\|E^{\eta}H\right\|_{2\to\infty}\left\|E\right\|_{2}\right)
≀1sβˆ—β‹…(C​n​p​log6⁑n)βˆ’1β‹…500​(log⁑n)5​η+5β‹…nβ‹…pβ‹…2​sβˆ—β€‹((100​n​p)Ξ·+(100​n​p)Ξ·βˆ’1​C0​σ​n)\displaystyle\leq\frac{1}{s^{*}}\cdot(C\sqrt{np}\log^{6}n)^{-1}\cdot 500(\log n)^{5\eta+5}\cdot\sqrt{n}\cdot\sqrt{p}\cdot\sqrt{2s^{*}}\left(\left(100\sqrt{np}\right)^{\eta}+\left(100\sqrt{np}\right)^{\eta-1}C_{0}\sigma\sqrt{n}\right)
≀500000​n​p/sβˆ—β€‹(log⁑n)5​η​(100​n​p)Ξ·βˆ’1.\displaystyle\leq 500000\sqrt{np/s^{*}}(\log n)^{5\eta}(100\sqrt{np})^{\eta-1}. (7)

Similarly, by applying PropositionΒ A.3 to Ej​JnE^{j}J_{n} with Ξ±=p,Ξ²=1,Ξ³=n\alpha=p,\beta=1,\gamma=n, we have, with probability at least 1βˆ’O​(nβˆ’2)1-O(n^{-2}),

β€–Ej​Jnβ€–2β†’βˆžβ‰€500​(log⁑n)5​jβ‹…pβ‹…nβ‹…(100​n​p)jβˆ’1,j=Ξ·,Ξ·+1.\left\|E^{j}J_{n}\right\|_{2\to\infty}\leq 500(\log n)^{5j}\cdot\sqrt{p}\cdot n\cdot\left(100\sqrt{np}\right)^{j-1},j=\eta,\eta+1.

Since |A​q|=qΞ»β‹…1μ≀1nβ‹…(C​n​log6⁑n)βˆ’1|Aq|=\frac{q}{\lambda}\cdot\frac{1}{\mu}\leq\frac{1}{n}\cdot(C\sqrt{n}\log^{6}n)^{-1}, we have

β€–A​q​(EΞ·+1​Jn+Eη​Jn​E)β€–2β†’βˆž\displaystyle\left\|Aq(E^{\eta+1}J_{n}+E^{\eta}J_{n}E)\right\|_{2\to\infty}
≀|A​q|​(β€–EΞ·+1​Jnβ€–2β†’βˆž+β€–Eη​Jnβ€–2β†’βˆžβ€‹β€–Eβ€–2)\displaystyle\leq|Aq|\left(\left\|E^{\eta+1}J_{n}\right\|_{2\to\infty}+\left\|E^{\eta}J_{n}\right\|_{2\to\infty}\left\|E\right\|_{2}\right)
≀1nβ‹…(C​n​p​log6⁑n)βˆ’1β‹…500β‹…(log⁑n)5​η+5β‹…nβ‹…((100​n​p)Ξ·+(100​n​p)Ξ·βˆ’1​C0​σ​n)\displaystyle\leq\frac{1}{n}\cdot(C\sqrt{np}\log^{6}n)^{-1}\cdot 500\cdot(\log n)^{5\eta+5}\cdot n\cdot\left(\left(100\sqrt{np}\right)^{\eta}+\left(100\sqrt{np}\right)^{\eta-1}C_{0}\sigma\sqrt{n}\right)
≀500000​p​(log⁑n)5​η​(100​n​p)Ξ·βˆ’1.\displaystyle\leq 500000\sqrt{p}(\log n)^{5\eta}(100\sqrt{np})^{\eta-1}. (8)

Combining EquationΒ 7 and EquationΒ 8, we have, with probability at least 1βˆ’O​(nβˆ’2)1-O(n^{-2}), β€–Eη​Lβ€–2β†’βˆžβ‰€C2​n​p/sβˆ—β€‹(log⁑n)5​η​(100​n​p)Ξ·βˆ’1\left\|E^{\eta}L\right\|_{2\to\infty}\leq C_{2}\sqrt{np/s^{*}}(\log n)^{5\eta}(100\sqrt{np})^{\eta-1} where C2=𝖽𝖾𝖿106C_{2}\stackrel{{\scriptstyle\mathsf{def}}}{{=}}10^{6}.

For the second part, we decompose F=Fβ€²+Fβ€²β€²F=F^{\prime}+F^{\prime\prime}, where Fβ€²F^{\prime} is the intra-cluster part, i.e.,Β Fu​vβ€²=Fu​vF^{\prime}_{uv}=F_{uv} if u,v∈Vβ„“u,v\in V_{\ell} for some β„“\ell; and Fu​vβ€²=0F^{\prime}_{uv}=0 otherwise. Equipped with ClaimΒ D.1, we can apply PropositionΒ A.3 on Eη​Fβ€²E^{\eta}F^{\prime} (with Ξ±=p,Ξ²=10/sβˆ—,Ξ³=2​sβˆ—\alpha=p,\beta=10/s^{*},\gamma=2s^{*}), and Eη​Fβ€²β€²E^{\eta}F^{\prime\prime} (with Ξ±=p,Ξ²=5n,Ξ³=n\alpha=p,\beta=\frac{5}{n},\gamma=n):

β€–Eη​Fβ€–2β†’βˆž\displaystyle\left\|E^{\eta}F\right\|_{2\to\infty} ≀‖Eη​Fβ€²β€–2β†’βˆž+β€–Eη​Fβ€²β€²β€–2β†’βˆž\displaystyle\leq\left\|E^{\eta}F^{\prime}\right\|_{2\to\infty}+\left\|E^{\eta}F^{\prime\prime}\right\|_{2\to\infty}
≀20000​log5​η⁑n​p​(10sβˆ—β‹…2​sβˆ—+10nβ‹…n)​(100​n)Ξ·βˆ’1\displaystyle\leq 20000\log^{5\eta}\sqrt{n}\sqrt{p}\left(\frac{10}{s^{*}}\cdot\sqrt{2s^{*}}+\frac{10}{n}\cdot\sqrt{n}\right)(100\sqrt{n})^{\eta-1}
≀C2​(log⁑n)5​η​n​p/sβˆ—β€‹(100​n)Ξ·βˆ’1,\displaystyle\leq C_{2}(\log n)^{{5\eta}}\sqrt{np/s^{*}}(100\sqrt{n})^{\eta-1},

where the second inequality holds with probability at least 1βˆ’O​(nβˆ’2)1-O(n^{-2}). The lemma follows from a union bound over all η≀log⁑n\eta\leq\log n. ∎