This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Sublinear-Time Spectral Clustering Oracle with Improved Preprocessing Time

Ranran Shen
[email protected] &Pan Peng11footnotemark: 1
[email protected]
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China. Corresponding author: Pan Peng.
Abstract

We address the problem of designing a sublinear-time spectral clustering oracle for graphs that exhibit strong clusterability. Such graphs contain kk latent clusters, each characterized by a large inner conductance (at least φ\varphi) and a small outer conductance (at most ε\varepsilon). Our aim is to preprocess the graph to enable clustering membership queries, with the key requirement that both preprocessing and query answering should be performed in sublinear time, and the resulting partition should be consistent with a kk-partition that is close to the ground-truth clustering. Previous oracles have relied on either a poly(k)logn\textrm{poly}(k)\log n gap between inner and outer conductances or exponential (in k/εk/\varepsilon) preprocessing time. Our algorithm relaxes these assumptions, albeit at the cost of a slightly higher misclassification ratio. We also show that our clustering oracle is robust against a few random edge deletions. To validate our theoretical bounds, we conducted experiments on synthetic networks.

1 Introduction

Graph clustering is a fundamental task in the field of graph analysis. Given a graph G=(V,E)G=(V,E) and an integer kk, the objective of graph clustering is to partition the vertex set VV into kk disjoint clusters C1,,CkC_{1},\dots,C_{k}. Each cluster should exhibit tight connections within the cluster while maintaining loose connections with the other clusters. This task finds applications in various domains, including community detection [32, 12], image segmentation [11] and bio-informatics [29].

However, global graph clustering algorithms, such as spectral clustering [27], modularity maximization [26], density-based clustering [10], can be computationally expensive, especially for large datasets. For instance, spectral clustering is a significant algorithm for solving the graph clustering problem, which involves two steps. The first step is to map all the vertices to a kk-dimensional Euclidean space using the Laplacian matrix of the graph. The second step is to cluster all the points in this kk-dimensional Euclidean space, often employing the kk-means algorithm. The time complexity of spectral clustering, as well as other global clustering algorithms, is poly(n)\textrm{poly}(n), where n=|V|n=|V| denotes the size of the graph. As the graph size increases, the computational demands of these global clustering algorithms become impractical.

Addressing this challenge, an effective approach lies in the utilization of local algorithms that operate within sublinear time. In this paper, our primary focus is on a particular category of such algorithms designed for graph clustering, known as sublinear-time spectral clustering oracles [30, 14]. These algorithms consist of two phases: the preprocessing phase and the query phase, both of which can be executed in sublinear time. During the preprocessing phase, these algorithms sample a subset of vertices from VV, enabling them to locally explore a small portion of the graph and gain insights into its cluster structure. In the query phase, these algorithms utilize the cluster structure learned during the preprocessing phase to respond to WhichCluster(G,xG,x) queries. The resulting partition defined by the output of WhichCluster(G,xG,x) should be consistent with a kk-partition that is close to the ground-truth clustering.

We study such oracles for graphs that exhibit strong clusterability, which are graphs that contain kk latent clusters, each characterized by a large inner conductance (at least φ\varphi) and a small outer conductance (at most ε\varepsilon). Let us assume φ>0\varphi>0 is some constant. In [30] (see also [8]), a robust clustering oracle was designed with preprocessing time approximately O(npoly(klognε))O(\sqrt{n}\cdot\textrm{poly}(\frac{k\log n}{\varepsilon})), query time approximately O(npoly(klognε))O(\sqrt{n}\cdot\textrm{poly}(\frac{k\log n}{\varepsilon})), misclassification error (i.e., the number of vertices that are misclassified with respect to a ground-truth clustering) approximately O(knε)O(kn\sqrt{\varepsilon}). The oracle relied on a poly(k)logn\textrm{poly}(k)\log n gap between inner and outer conductance. In [14], a clustering oracle was designed with preprocessing time approximately 2poly(kε)poly(logn)n1/2+O(ε)2^{\textrm{poly}(\frac{k}{\varepsilon})}\textrm{poly}(\log n)\cdot n^{1/2+O(\varepsilon)}, query time approximately poly(klognε)n1/2+O(ε)\textrm{poly}(\frac{k\log n}{\varepsilon})\cdot n^{1/2+O(\varepsilon)}, misclassification error O(logkε)|Ci|O(\log k\cdot\varepsilon)|C_{i}| for each cluster Ci,i[k]C_{i},i\in[k] and it takes approximately O(poly(kε)n1/2+O(ε)poly(logn))O({\rm poly}(\frac{k}{\varepsilon})\cdot n^{1/2+O(\varepsilon)}\cdot{\rm poly}(\log n)) space. This oracle relied on a logk\log k gap between inner and outer conductance.

One of our key contributions in this research is a new sublinear-time spectral clustering oracle that offers enhanced preprocessing efficiency. Specifically, we introduce an oracle that significantly reduces both the preprocessing and query time, performing in poly(klogn)n1/2+O(ε)\textrm{poly}(k\log n)\cdot n^{1/2+O(\varepsilon)} time and reduces the space complexity, taking O(poly(k)n1/2+O(ε)poly(logn))O({\rm poly}(k)\cdot n^{1/2+O(\varepsilon)}\cdot{\rm poly}(\log n)) space. This approach relies on a poly(k)\textrm{poly}(k) gap between the inner and outer conductances, while maintaining a misclassification error of O(poly(k)ε1/3)|Ci|O(\textrm{poly}(k)\cdot\varepsilon^{1/3})|C_{i}| for each cluster Ci,i[k]C_{i},i\in[k]. Moreover, our oracle offers practical implementation feasibility, making it well-suited for real-world applications. In contrast, the clustering oracle proposed in [14] presents challenges in terms of implementation (mainly due to the exponential dependency on k/εk/\varepsilon).

We also investigate the sensitivity of our clustering oracle to edge perturbations. This analysis holds significance in various practical scenarios where the input graph may be unreliable due to factors such as privacy concerns, adversarial attacks, or random noises [31]. We demonstrate the robustness of our clustering oracle by showing that it can accurately identify the underlying clusters in the resulting graph even after the random deletion of one or a few edges from a well-clusterable graph.

1.1 Basic definitions

Graph clustering problems often rely on conductance as a metric to assess the quality of a cluster. Several recent studies ([8, 9, 30, 14, 22]) have employed conductance in their investigations. Hence, in this paper, we adopt the same definition to characterize the cluster quality. We state our results for dd-regular graphs for some constant d3d\geq 3, though they can be easily generalized to graphs with maximum degree at most dd (see Section 2).

Definition 1.1 (Inner and outer conductance).

Let G=(V,E)G=(V,E) be a dd-regular nn-vertex graph. For a set SCVS\subseteq C\subseteq V, we let E(S,C\S)E(S,C\backslash S) denote the set of edges with one endpoint in SS and the other endpoint in C\SC\backslash S. The outer conductance of a set CC is defined to be ϕout(C,V)=|E(C,V\C)|d|C|\phi_{{\rm out}}(C,V)=\frac{|E(C,V\backslash C)|}{d|C|}. The inner conductance of a set CC is defined to be ϕin(C)=minSC,0<|S||C|2ϕout(S,C)=minSC,0<|S||C|2|E(S,C\S)|d|S|\phi_{\rm in}(C)=\min\limits_{S\subseteq C,0<|S|\leq\frac{|C|}{2}}{\phi_{\rm out}(S,C)}=\min\limits_{S\subseteq C,0<|S|\leq\frac{|C|}{2}}{\frac{|E(S,C\backslash S)|}{d|S|}} if |C|>1|C|>1 and one otherwise. Specially, the conductance of graph GG is defined to be ϕ(G)=minCV,0<|C|n2ϕout(C,V)\phi(G)=\min\limits_{C\subseteq V,0<|C|\leq\frac{n}{2}}{\phi_{\rm out}(C,V)}.

Note that based on the above definition, for a cluster CC, the smaller the ϕout(C,V)\phi_{\rm out}(C,V) is, the more loosely connected with the other clusters and the bigger the ϕin(C)\phi_{\rm in}(C) is, the more tightly connected within CC. For a high quality cluster CC, we have ϕout(C,V)ϕin(C)1\phi_{\rm out}(C,V)\ll\phi_{\rm in}(C)\leq 1.

Definition 1.2 (kk-partition).

Let G=(V,E)G=(V,E) be a graph. A kk-partition of VV is a collection of disjoint subsets C1,,CkC_{1},\dots,C_{k} such that i=1kCi=V\cup_{i=1}^{k}{C_{i}}=V.

Based on the above, we have the following definition of clusterable graphs.

Definition 1.3 ((k,φ,ε)(k,\varphi,\varepsilon)-clustering).

Let G=(V,E)G=(V,E) be a dd-regular graph. A (k,φ,ε)(k,\varphi,\varepsilon)-clustering of GG is a kk-partition of VV, denoted by C1,,CkC_{1},\dots,C_{k}, such that for all i[k]i\in[k], ϕin(Ci)φ\phi_{\rm in}(C_{i})\geq\varphi, ϕout(Ci,V)ε\phi_{\rm out}(C_{i},V)\leq\varepsilon and for all i,j[k]i,j\in[k] one has |Ci||Cj|O(1)\frac{|C_{i}|}{|C_{j}|}\in O(1). GG is called a (k,φ,ε)(k,\varphi,\varepsilon)-clusterable graph if there exists a (k,φ,ε)(k,\varphi,\varepsilon)-clustering of GG.

1.2 Main results

Our main contribution is a sublinear-time spectral clustering oracle with improved preprocessing time for dd-regular (k,φ,ε)(k,\varphi,\varepsilon)-clusterable graphs. We assume query access to the adjacency list of a graph GG, that is, one can query the ii-th neighbor of any vertex in constant time.

Theorem 1.

Let k2k\geq 2 be an integer, φ(0,1)\varphi\in(0,1). Let G=(V,E)G=(V,E) be a dd-regular nn-vertex graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}, εφ2γ3k92log3k\frac{\varepsilon}{\varphi^{2}}\ll\frac{\gamma^{3}}{k^{\frac{9}{2}}\cdot\log^{3}k} and for all i[k]i\in[k], γnk|Ci|nγk\gamma\frac{n}{k}\leq|C_{i}|\leq\frac{n}{\gamma k}, where γ\gamma is a constant that is in (0.001,1](0.001,1]. There exists an algorithm that has query access to the adjacency list of GG and constructs a clustering oracle in O(n1/2+O(ε/φ2)poly(klognγφ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma\varphi})) preprocessing time and takes O(n1/2+O(ε/φ2)poly(klognγ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma})) space. Furthermore, with probability at least 0.950.95, the following hold:

  1. 1.

    Using the oracle, the algorithm can answer any WhichCluster query in O(n1/2+O(ε/φ2)poly(klognγφ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma\varphi})) time and a WhichCluster query takes O(n1/2+O(ε/φ2)poly(klognγ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma})) space.

  2. 2.

    Let Ui:={xV:WhichCluster(G,x)=i},i[k]U_{i}:=\{x\in V:\textsc{WhichCluster}(G,x)=i\},i\in[k] be the clusters recovered by the algorithm. There exists a permutation π:[k][k]\pi:[k]\rightarrow[k] such that for all i[k]i\in[k], |Uπ(i)Ci|O(k32γ(εφ2)1/3)|Ci||U_{\pi(i)}\triangle C_{i}|\leq O(\frac{k^{\frac{3}{2}}}{\gamma}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3})|C_{i}|.

Specifically, for every graph G=(V,E)G=(V,E) that admits a kk-partition C1,,CkC_{1},\dots,C_{k} with constant inner conductance φ\varphi and outer conductance εO(1poly(k))\varepsilon\ll O(\frac{1}{\textrm{poly}(k)}), our oracle has preprocessing time n1/2+O(ε)poly(klogn)\approx n^{1/2+O(\varepsilon)}\cdot\textrm{poly}(k\log n), query time n1/2+O(ε)poly(klogn)\approx n^{1/2+O(\varepsilon)}\cdot\textrm{poly}(k\log n), space \approx O(n1/2+O(ε/φ2)poly(klogn))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(k\log n)) and misclassification error O(poly(k)ε1/3)|Ci|\approx O(\textrm{poly}(k)\cdot\varepsilon^{1/3})|C_{i}| for each cluster Ci,i[k]C_{i},i\in[k]. In comparison to [30], our oracle relies on a smaller gap between inner and outer conductance (specifically O(poly(k)logn)O({\textrm{poly}(k)}\log n)). In comparison to [14], our oracle has a smaller preprocessing time and a smaller space at the expense of a slightly higher misclassification error of O(poly(k)ε1/3)|Ci|O(\textrm{poly}(k)\cdot\varepsilon^{1/3})|C_{i}| instead of O(logkε)|Ci|O(\log k\cdot\varepsilon)|C_{i}| and a slightly worse conductance gap of εO(φ2/poly(k))\varepsilon\ll O(\varphi^{2}/{\rm poly}(k)) instead of εO(φ3/log(k))\varepsilon\ll O(\varphi^{3}/{\rm log}(k)). It’s worth highlighting that our space complexity significantly outperforms that of [14] (i.e., O(n1/2+O(ε/φ2)poly(kεlogn))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k}{\varepsilon}\cdot\log n))), particularly in cases where kk is fixed and ε\varepsilon takes on exceptionally small values, such as ε=1nc\varepsilon=\frac{1}{n^{c}} for sufficiently small constant c>0c>0, since the second term in our space complexity does not depend on ε\varepsilon in comparison to the one in [14].

Another contribution of our work is the verification of the robustness of our oracle against the deletion of one or a few random edges. The main idea underlying the proof is that a well-clusterable graph is still well-clusterable (with a slightly worse clustering quality) after removing a few random edges, which in turn is built upon the intuition that after removing a few random edges, an expander graph remains an expander. See the complete statement and proof of this claim in Appendix B.

Theorem 2 (Informal; Robust against random edge deletions).

Let c>0c>0 be a constant. Let G0G_{0} be a graph satisfying the similar conditions as stated in Theorem 1. Let GG be a graph obtained from G0G_{0} by randomly deleting cc edges. Then there exists a clustering oracle for GG with the same guarantees as presented in Theorem 1.

1.3 Related work

Sublinear-time algorithms for graph clustering have been extensively researched. Czumaj et al. [8] proposed a property testing algorithm capable of determining whether a graph is kk-clusterable or significantly far from being kk-clusterable in sublinear time. This algorithm, which can be adapted to a sublinear-time clustering oracle, was later extended by Peng [30] to handle graphs with noisy partial information through a robust clustering oracle. Subsequent improvements to both the testing algorithm and the oracle were introduced by Chiplunkar et al. [6] and Gluchowski et al. [14]. Recently, Kapralov et al. [16, 17] presented a hierarchical clustering oracle specifically designed for graphs exhibiting a pronounced hierarchical structure. This oracle offers query access to a high-quality hierarchical clustering at a cost of poly(k)n1/2+O(γ)\textrm{poly}(k)\cdot n^{1/2+O(\gamma)} per query. However, it is important to note that their algorithm does not provide an oracle for flat kk-clustering, as considered in our work, with the same query complexity. Sublinear-time clustering oracles for signed graphs have also been studied recently [25].

The field of local graph clustering [33, 1, 3, 2, 34, 28] is also closely related to our research. In this framework, the objective is to identify a cluster starting from a given vertex within a running time that is bounded by the size of the output set, with a weak dependence on nn. Zhu et al. [34] proposed a local clustering algorithm that produces a set with low conductance when both inner and outer conductance are used as measures of cluster quality. It is worth noting that the running times of these algorithms are sublinear only if the target set’s size (or volume) is small, for example, at most o(n)o(n). In contrast, in our setting, the clusters of interest have a minimum size that is Ω(n/k)\Omega(n/k).

Extensive research has been conducted on fully or partially recovering clusters in the presence of noise within the “global algorithm regimes”. Examples include recovering the planted partition in the stochastic block model with modeling errors or noise [4, 15, 24, 21], correlation clustering on different ground-truth graphs in the semi-random model [23, 5, 13, 20], and graph partitioning in the average-case model [18, 19, 20]. It is important to note that all these algorithms require at least linear time to run.

2 Preliminaries

Let G=(V,E)G=(V,E) denote a dd-regular undirected and unweighted graph, where V:={1,,n}V:=\{1,\dots,n\}. Throughout the paper, we use i[n]i\in[n] to denote 1in1\leq i\leq n and all the vectors will be column vectors unless otherwise specified or transposed to row vectors. For a vertex xVx\in V, let 𝟙xn\mathbbm{1}_{x}\in\mathbb{R}^{n} denote the indicator of xx, which means 𝟙x(i)=1\mathbbm{1}_{x}(i)=1 if i=xi=x and 0 otherwise. For a vector 𝐱\mathbf{x}, we let 𝐱2=i𝐱(i)2\|\mathbf{x}\|_{2}=\sqrt{\sum_{i}{\mathbf{x}(i)^{2}}} denote its 2\ell_{2} norm. For a matrix An×nA\in\mathbb{R}^{n\times n}, we use A\|A\| to denote the spectral norm of AA, and we use AF\|A\|_{F} to denote the Frobenius norm of AA. For any two vectors 𝐱,𝐲n\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}, we let 𝐱,𝐲=𝐱T𝐲\langle\mathbf{x},\mathbf{y}\rangle=\mathbf{x}^{T}\mathbf{y} denote the dot product of 𝐱\mathbf{x} and 𝐲\mathbf{y}. For a matrix An×nA\in\mathbb{R}^{n\times n}, we use A[i]n×iA_{[i]}\in\mathbb{R}^{n\times i} to denote the first ii columns of AA, 1in1\leq i\leq n.

Let An×nA\in\mathbb{R}^{n\times n} denote the adjacency matrix of GG and let Dn×nD\in\mathbb{R}^{n\times n} denote a diagonal matrix. For the adjacency matrix AA, A(i,j)=1A(i,j)=1 if (i,j)E(i,j)\in E and 0 otherwise, u,v[n]u,v\in[n]. For the diagonal matrix DD, D(i,i)=deg(i)D(i,i)=\deg(i), where deg(i)\deg(i) is the degree of vertex i,i[n]i,i\in[n]. We denote with LL the normalized Laplacian of GG where L=D1/2(DA)D1/2=IAdL=D^{-1/2}(D-A)D^{-1/2}=I-\frac{A}{d}. For LL, we use 0λ1λn20\leq\lambda_{1}\leq\dots\leq\lambda_{n}\leq 2 [7] to denote its eigenvalues and we use u1,,unnu_{1},\dots,u_{n}\in\mathbb{R}^{n} to denote the corresponding eigenvectors. Note that the corresponding eigenvectors are not unique, in this paper, we let u1,,unu_{1},\dots,u_{n} be an orthonormal basis of eigenvectors of LL. Let Un×nU\in\mathbb{R}^{n\times n} be a matrix whose ii-th column is ui,i[n]u_{i},i\in[n], then for every vertex xVx\in V, fx=U[k]T𝟙xf_{x}=U_{[k]}^{T}\mathbbm{1}_{x}. For any two sets S1S_{1} and S2S_{2}, we let S1S2S_{1}\triangle S_{2} denote the symmetric difference between S1S_{1} and S2S_{2}.

From d\bm{d}-bounded graphs to d\bm{d}-regular graphs. For a dd-bounded graph G=(V,E)G^{{}^{\prime}}=(V,E), we can get a dd-regular graph GG from GG^{{}^{\prime}} by adding ddeg(x)d-\deg(x) self-loops with weight 1/21/2 to each vertex xVx\in V. Note that the lazy random walk on GG is equivalent to the random walk on GG^{\prime}, with the random walk satisfying that if we are at vertex xx, then we jump to a random neighbor with probability 12d\frac{1}{2d} and stay at xx with probability 1deg(x)2d1-\frac{\deg(x)}{2d}. We use wself(x)=(ddeg(x))12w_{self}(x)=(d-\deg(x))\cdot\frac{1}{2} to denote the the weight of all self-loops of xVx\in V.

Our algorithms in this paper are based on the properties of the dot product of spectral embeddings, so we also need the following definition.

Definition 2.1 (Spectral embedding).

For a graph G=(V,E)G=(V,E) with n=|V|n=|V| and an integer 2kn2\leq k\leq n, we use LL denote the normalized Laplacian of GG. Let U[k]n×kU_{[k]}\in\mathbb{R}^{n\times k} denote the matrix of the bottom kk eigenvectors of LL. Then for every xVx\in V, the spectral embedding of xx, denoted by fxkf_{x}\in\mathbb{R}^{k}, is the xx-row of U[k]U_{[k]}, which means fx(i)=ui(x),i[k]f_{x}(i)=u_{i}(x),i\in[k].

Definition 2.2 (Cluster centers).

Let G=(V,E)G=(V,E) be a dd-regular graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. The cluster center μi\mu_{i} of CiC_{i} is defined to be μi=1|Ci|xCifx\mu_{i}=\frac{1}{|C_{i}|}\sum_{x\in C_{i}}{f_{x}}, i[k]i\in[k].

The following lemma shows that the dot product of two spectral embeddings can be approximated in O~(n1/2+O(ε/φ2)poly(k))\widetilde{O}(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot\textrm{poly}(k)) time.

Lemma 2.1 (Theorem 2, [14]).

Let ε,φ(0,1)\varepsilon,\varphi\in(0,1) with εφ2105\varepsilon\leq\frac{\varphi^{2}}{10^{5}}. Let G=(V,E)G=(V,E) be a dd-regular graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. Let 1n5<ξ<1\frac{1}{n^{5}}<\xi<1. Then there exists an algorithm InitializeOracle(G,1/2,ξG,1/2,\xi) that computes in time (kξ)O(1)n1/2+O(ε/φ2)(logn)31φ2(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot(\log n)^{3}\cdot\frac{1}{\varphi^{2}} a sublinear space data structure 𝒟\mathcal{D} of size (kξ)O(1)n1/2+O(ε/φ2)(logn)3(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot(\log n)^{3} such that with probability at least 1n1001-n^{-100} the following property is satisfied: For every pair of vertices x,yVx,y\in V, the algorithm SpectralDotProductOracle(G,x,y,1/2,ξ,𝒟G,x,y,1/2,\xi,\mathcal{D}) computes an output value fx,fyapx\langle f_{x},f_{y}\rangle_{\rm apx} such that with probability at least 1n1001-n^{-100}

|fx,fyapxfx,fy|ξn.\Big{|}\langle f_{x},f_{y}\rangle_{\rm apx}-\langle f_{x},f_{y}\rangle\Big{|}\leq\frac{\xi}{n}.

The running time of SpectralDotProductOracle(G,x,y,1/2,ξ,𝒟G,x,y,1/2,\xi,\mathcal{D}) is (kξ)O(1)n1/2+O(ε/φ2)(logn)21φ2(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot(\log n)^{2}\cdot\frac{1}{\varphi^{2}}.

For the completeness of this paper, we will describe the algorithm InitializeOracle(G,1/2,ξG,1/2,\xi) and SpectralDotProductOracle(G,x,y,1/2,ξ,𝒟G,x,y,1/2,\xi,\mathcal{D}) in Appendix A.

3 Spectral clustering oracle

3.1 Our techniques

We begin by outlining the main concepts of the spectral clustering oracle presented in [14]. Firstly, the authors introduce a sublinear time oracle that provides dot product access to the spectral embedding of graph GG by estimating distributions of short random walks originating from vertices in GG (as described in Lemma 2.1). Subsequently, they demonstrate that (1) the set of points corresponding to the spectral embeddings of all vertices exhibits well-concentrated clustering around the cluster center μi\mu_{i} (refer to Definition 2.2), and (2) all the cluster centers are approximately orthogonal to each other. The clustering oracle in [14] operates as follows: it initially guesses the kk cluster centers from a set of poly(k/ε)\textrm{poly}(k/\varepsilon) sampled vertices, which requires a time complexity of 2poly(k/ε)n1/2+O(ε)2^{\textrm{poly}(k/\varepsilon)}n^{1/2+O(\varepsilon)}. Subsequently, it iteratively employs the dot product oracle to estimate fx,μi\langle f_{x},\mu_{i}\rangle. If the value of fx,μi\langle f_{x},\mu_{i}\rangle is significant, it allows them to infer that vertex xx likely belongs to cluster CiC_{i}.

Refer to caption
Figure 1: The angle between embeddings of vertices in the same cluster is small and the angle between embeddings of vertices in different clusters is close to orthogonal (k=3)(k=3).

Now we present our algorithm, which builds upon the dot product oracle in [14]. Our main insight is to avoid relying directly on cluster centers in our algorithm. By doing so, we can eliminate the need to guess cluster centers and consequently remove the exponential time required in the preprocessing phase described in [14]. The underlying intuition is as follows: if two vertices, xx and yy, belong to the same cluster CiC_{i}, their corresponding spectral embeddings fxf_{x} and fyf_{y} will be close to the cluster center μi\mu_{i}. As a result, the angle between fxf_{x} and fyf_{y} will be small, and the dot product fx,fy\langle f_{x},f_{y}\rangle will be large (roughly on the order of O(kn)O(\frac{k}{n})). Conversely, if xx and yy belong to different clusters, their embeddings fxf_{x} and fyf_{y} will tend to be orthogonal, resulting in a small dot product fx,fy\langle f_{x},f_{y}\rangle (close to 0). We prove that this desirable property holds for the majority of vertices in dd-regular (k,φ,ε)(k,\varphi,\varepsilon)-clusterable graphs (see Figure 1 for an illustrative example). Slightly more formally, we introduce the definitions of good and bad vertices (refer to Definition 4.1) such that the set of good vertices corresponds to the core part of clusters and each pair of good vertices satisfies the aforementioned property; the rest vertices are the bad vertices. Leveraging this property, we can directly utilize the dot product of spectral embeddings to construct a sublinear clustering oracle.

Based on the desirable property discussed earlier, which holds for dd-regular (k,φ,ε)(k,\varphi,\varepsilon)-clusterable graphs, we can devise a sublinear spectral clustering oracle. Let G=(V,E)G=(V,E) be a dd-regular (k,φ,ε)(k,\varphi,\varepsilon)-clusterable graph that possesses a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. In the preprocessing phase, we sample a set SS of ss vertices from VV and construct a similarity graph, denoted as HH, on SS. For each pair of vertices x,ySx,y\in S, we utilize the dot product oracle from [14] to estimate fx,fy\langle f_{x},f_{y}\rangle. If xx and yy belong to the same cluster CiC_{i}, yielding a large fx,fy\langle f_{x},f_{y}\rangle, we add an edge (x,y)(x,y) to HH. Conversely, if xx and yy belong to different clusters, resulting in a fx,fy\langle f_{x},f_{y}\rangle close to 0, we make no modifications to HH. Consequently, only vertices within the same cluster Ci(i[k])C_{i}(i\in[k]) can be connected by edges. We can also establish that, by appropriately selecting ss, the sampling set SS will, with high probability, contain at least one vertex from each C1,,CkC_{1},\dots,C_{k}. Thus, the similarity graph HH will have kk connected components, with each component corresponding to a cluster in the ground-truth. We utilize these kk connected components, denoted as S1,,SkS_{1},\dots,S_{k}, to represent C1,,CkC_{1},\dots,C_{k}.

During the query phase, we determine whether the queried vertex xx belongs to a connected component in HH. Specifically, we estimate fx,fy\langle f_{x},f_{y}\rangle for all ySy\in S. If there exists a unique index i[k]i\in[k] for which fx,fu\langle f_{x},f_{u}\rangle is significant (approximately O(kn)O(\frac{k}{n})) for all uSiu\in S_{i}, we conclude that xx belongs to cluster CiC_{i}, associated with SiS_{i}. If no such unique index is found, we assign xx a random index ii, where i[k]i\in[k].

3.2 The clustering oracle

Next, we present our algorithms for constructing a spectral clustering oracle and handling the WhichCluster queries. In the preprocessing phase, the algorithm ConstructOracle(G,k,φ,ε,γG,k,\varphi,\varepsilon,\gamma) learns the cluster structure representation of GG. This involves constructing a similarity graph HH on a sampled vertex set SS and assigning membership labels \ell to all vertices in SS. During the query phase, the algorithm WhichCluster(G,xG,x) determines the clustering membership index to which vertex xx belongs. More specifically, WhichCluster(G,xG,x) utilizes the function SearchIndex(H,,xH,\ell,x) to check whether the queried vertex xx belongs to a unique connected component in HH. If it does, SearchIndex(H,,xH,\ell,x) will return the index of the unique connected component in HH.

The algorithm in preprocessing phase is given in Algorithm 1 ConstructOracle(G,k,φ,ε,γG,k,\varphi,\varepsilon,\gamma).

1
2Let ξ=γ1000\xi=\frac{\sqrt{\gamma}}{1000} and let s=10klogkγs=\frac{10\cdot k\log k}{\gamma}
3 Let θ=0.96(14εφ)γknkn(εφ2)1/6ξn\theta=0.96(1-\frac{4\sqrt{\varepsilon}}{\varphi})\frac{\gamma k}{n}-\frac{\sqrt{k}}{n}(\frac{\varepsilon}{\varphi^{2}})^{1/6}-\frac{\xi}{n}
4 Sample a set S of ss vertices independently and uniformly at random from VV
5 Generate a similarity graph H=(S,)H=(S,\emptyset)
6 Let 𝒟=\mathcal{D}= InitializeOracle(G,1/2,ξG,1/2,\xi)
7
8for any u,vSu,v\in S do
9       Let fu,fvapx=\langle f_{u},f_{v}\rangle_{\rm apx}= SpectralDotProductOracle(G,u,v,1/2,ξ,𝒟G,u,v,1/2,\xi,\mathcal{D})
10       if fu,fvapxθ\langle f_{u},f_{v}\rangle_{\rm apx}\geq\theta then
11             Add an edge (u,v)(u,v) to the similarity graph HH
12            
13      
14if HH has exactly kk connected components then
15       Label the connected components with 1,2,,k1,2,\dots,k (we write them as S1,,SkS_{1},\dots,S_{k})
16       Label xSx\in S with ii if xSix\in S_{i}
17       Return HH and the vertex labeling \ell
18else
19       return fail
Algorithm 1 ConstructOracle(G,k,φ,ε,γG,k,\varphi,\varepsilon,\gamma)

See Appendix A for algorithm InitializeOracle and SpectralDotProductOracle invoked by ConstructOracle(G,k,φ,ε,γG,k,\varphi,\varepsilon,\gamma).

Our algorithms used in the query phase are described in Algorithm 2 SearchIndex(H,,xH,\ell,x) and Algorithm 3 WhichCluster(G,xG,x).

1
2for any vertex uSu\in S do
3       Let fu,fxapx=\langle f_{u},f_{x}\rangle_{\rm apx}= SpectralDotProductOracle(G,u,x,1/2,ξ,𝒟G,u,x,1/2,\xi,\mathcal{D})
4      
5if there exists a unique index 1ik1\leq i\leq k such that fu,fxapxθ\langle f_{u},f_{x}\rangle_{\rm apx}\geq\theta for all uSiu\in S_{i} then
6       return index ii
7else
8       return outlier
Algorithm 2 SearchIndex(H,,xH,\ell,x)
1
2if preprocessing phase fails then
3       return fail
4if SearchIndex(H,,xH,\ell,x) return outlier then
5       return a random index[k]\in[k]
6else
7       return SearchIndex(H,,xH,\ell,x)
Algorithm 3 WhichCluster(G,xG,x)

4 Analysis of the oracle

4.1 Key properties

We now prove the following property: for most vertex pairs x,yVx,y\in V, if x,yx,y are in the same cluster, then fx,fy\langle f_{x},f_{y}\rangle is rough O(kn)O(\frac{k}{n}) (Lemma 4.6); and if x,yx,y are in the different clusters, then fx,fy\langle f_{x},f_{y}\rangle is close to 0 (Lemma 4.8). To prove the two key properties, we make use of the following three lemmas (Lemma 4.1, Lemma 4.2 and Lemma 4.4).

The following lemma shows that for most vertices xx, the norm fx2\|f_{x}\|_{2} is small.

Lemma 4.1.

Let α(0,1)\alpha\in(0,1). Let k2k\geq 2 be an integer, φ(0,1)\varphi\in(0,1), and ε(0,1)\varepsilon\in(0,1). Let G=(V,E)G=(V,E) be a dd-regular (k,φ,ε)(k,\varphi,\varepsilon)-clusterable graph with |V|=n|V|=n. There exists a subset V^V\widehat{V}\subseteq V with |V^|(1α)|V||\widehat{V}|\geq(1-\alpha)|V| such that for all xV^x\in\widehat{V}, it holds that fx21αkn\|f_{x}\|_{2}\leq\sqrt{\frac{1}{\alpha}\cdot\frac{k}{n}}.

Proof.

Recall that u1,,uku_{1},\dots,u_{k} are an orthonormal basis of eigenvectors of LL, so ui22=1\|u_{i}\|_{2}^{2}=1 for all i[k]i\in[k]. So i=1kui22=i=1nfxi22=k.\sum_{i=1}^{k}\|u_{i}\|_{2}^{2}=\sum_{i=1}^{n}\|f_{x_{i}}\|_{2}^{2}=k. Let XX be a random variable such that X=fxi22X=\|f_{x_{i}}\|_{2}^{2} with probability 1n\frac{1}{n}, for each i[n]i\in[n]. Then we have E[X]=1ni=1nfxi22=kn.\textrm{E}[X]=\frac{1}{n}\sum_{i=1}^{n}\|f_{x_{i}}\|_{2}^{2}=\frac{k}{n}. Using Markov’s inequality, we have Pr[X1αE[X]]=Pr[X1αkn]α.\Pr[X\geq\frac{1}{\alpha}\cdot\textrm{E}[X]]=\Pr[X\geq\frac{1}{\alpha}\cdot\frac{k}{n}]\leq\alpha. This gives us that Pr[X1αkn]1α\Pr[X\leq\frac{1}{\alpha}\cdot\frac{k}{n}]\geq 1-\alpha, which means that at least (1α)(1-\alpha) fraction of vertices in V satisfies fx221αkn\|f_{x}\|_{2}^{2}\leq\frac{1}{\alpha}\cdot\frac{k}{n}. We define V^:={xV:fx221αkn}\widehat{V}:=\{x\in V:\|f_{x}\|_{2}^{2}\leq\frac{1}{\alpha}\cdot\frac{k}{n}\}, then we have |V^|(1α)|V||\widehat{V}|\geq(1-\alpha)|V|. This ends the proof. ∎

We then show that for most vertices xx, fxf_{x} is close to its center μx\mu_{x} of the cluster containing xx.

Lemma 4.2.

Let β(0,1)\beta\in(0,1). Let k2k\geq 2 be an integer, φ(0,1)\varphi\in(0,1), and ε(0,1)\varepsilon\in(0,1). Let G=(V,E)G=(V,E) be a dd-regular graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k} with |V|=n|V|=n. There exists a subset V~V\widetilde{V}\subseteq V with |V~|(1β)|V||\widetilde{V}|\geq\left(1-\beta\right)|V| such that for all xV~x\in\widetilde{V}, it holds that fxμx24kεβφ21n\|f_{x}-\mu_{x}\|_{2}\leq\sqrt{\frac{4k\varepsilon}{\beta\varphi^{2}}\cdot\frac{1}{n}}.

The following result will be used in our proof:

Lemma 4.3 (Lemma 6, [14]).

Let k2k\geq 2 be an integer, φ(0,1)\varphi\in(0,1), and ε(0,1)\varepsilon\in(0,1). Let G=(V,E)G=(V,E) be a dd-regular graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. Then for all αk\alpha\in\mathbb{R}^{k}, with α2=1\|\alpha\|_{2}=1 we have

i=1kxCifxμi,α24εφ2.\sum_{i=1}^{k}{\sum_{x\in C_{i}}{\langle f_{x}-\mu_{i},\alpha\rangle^{2}\leq\frac{4\varepsilon}{\varphi^{2}}}}.
Proof of Lemma 4.2.

By summing over α\alpha in an orthonormal basis of k\mathbb{R}^{k}, we can get

xVfxμx22k4εφ2=4kεφ2,\sum_{x\in V}{\|f_{x}-\mu_{x}\|_{2}^{2}\leq k\cdot\frac{4\varepsilon}{\varphi^{2}}}=\frac{4k\varepsilon}{\varphi^{2}},

where μx\mu_{x} is the cluster center of the cluster that xx belongs to. Define V={xV:fxμx224kεβφ21n}V^{*}=\{x\in V:\|f_{x}-\mu_{x}\|_{2}^{2}\geq\frac{4k\varepsilon}{\beta\varphi^{2}}\cdot\frac{1}{n}\}. Then,

4kεφ2xVfxμx22xVfxμx22xV4kεβφ21n=|V|4kεβφ21n.\begin{split}\frac{4k\varepsilon}{\varphi^{2}}&\geq\sum_{x\in V}{\|f_{x}-\mu_{x}\|_{2}^{2}}\geq\sum_{x\in V^{*}}{\|f_{x}-\mu_{x}\|_{2}^{2}}\geq\sum_{x\in V^{*}}{\frac{4k\varepsilon}{\beta\varphi^{2}}\cdot\frac{1}{n}}=|V^{*}|\cdot\frac{4k\varepsilon}{\beta\varphi^{2}}\cdot\frac{1}{n}.\end{split}

So, we can get |V|βn|V^{*}|\leq\beta n. We define V~=V\V={xV:fxμx224kεβφ21n}\widetilde{V}=V\backslash V^{*}=\{x\in V:\|f_{x}-\mu_{x}\|_{2}^{2}\leq\frac{4k\varepsilon}{\beta\varphi^{2}}\cdot\frac{1}{n}\}. Therefore, we have |V~|(1β)n=(1β)|V||\widetilde{V}|\geq(1-\beta)n=(1-\beta)|V|. This ends the proof. ∎

The next lemma shows that for most vertices xx in a cluster CiC_{i}, the inner product fx,μi\langle f_{x},\mu_{i}\rangle is large.

Lemma 4.4.

Let k2k\geq 2 be an integer, φ(0,1)\varphi\in(0,1), and εφ2\frac{\varepsilon}{\varphi^{2}} be smaller than a sufficiently small constant. Let G=(V,E)G=(V,E) be a dd-regular graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. Let CiC_{i} denote the cluster corresponding to the center μi\mu_{i}, i[k]i\in[k]. Then for every CiC_{i}, i[k]i\in[k], there exists a subset Ci~Ci\widetilde{C_{i}}\subseteq C_{i} with |Ci~|(1104εφ2)|Ci||\widetilde{C_{i}}|\geq(1-\frac{10^{4}\varepsilon}{\varphi^{2}})|C_{i}| such that for all xCi~x\in\widetilde{C_{i}}, it holds that fx,μi0.96μi22.\langle f_{x},\mu_{i}\rangle\geq 0.96\|\mu_{i}\|_{2}^{2}.

The following result will be used in our proof:

Lemma 4.5 (Lemma 31, [14]).

Let k2k\geq 2, φ(0,1)\varphi\in(0,1), and εφ2\frac{\varepsilon}{\varphi^{2}} be smaller than a sufficiently small constant. Let G=(V,E)G=(V,E) be a dd-regular graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. If μis\mu_{i}^{\prime}s are cluster centers then the following conditions hold. Let S{μ1,,μk}S\subset\{\mu_{1},\dots,\mu_{k}\}. Let Π\Pi denote the orthogonal projection matrix on to the span(S)span(S)^{\perp}. Let μ{μ1,,μk}\S\mu\in\{\mu_{1},\dots,\mu_{k}\}\backslash S. Let CC denote the cluster corresponding to the center μ\mu. Let

C^:={xV:Πfx,Πμ0.96Πμ22}\widehat{C}:=\{x\in V:\langle\Pi f_{x},\Pi\mu\rangle\geq 0.96\|\Pi\mu\|_{2}^{2}\}

then we have:

|C\C^|104εφ2|C|.|C\backslash\widehat{C}|\leq\frac{10^{4}\varepsilon}{\varphi^{2}}|C|.
Proof of Lemma 4.4.

We apply S=S=\emptyset in Lemma 4.5 so that Π\Pi is an identity matrix and we will have |Ci\Ci^|104εφ2|Ci||C_{i}\backslash\widehat{C_{i}}|\leq\frac{10^{4}\varepsilon}{\varphi^{2}}|C_{i}|, where Ci^:={xV:fx,μi0.96μi22}\widehat{C_{i}}:=\{x\in V:\langle f_{x},\mu_{i}\rangle\geq 0.96\|\mu_{i}\|_{2}^{2}\}, i[k]i\in[k]. So

|CiCi^|(1104εφ2)|Ci|.|C_{i}\cap\widehat{C_{i}}|\geq\left(1-\frac{10^{4}\varepsilon}{\varphi^{2}}\right)|C_{i}|.

We define Ci~=CiCi^\widetilde{C_{i}}=C_{i}\cap\widehat{C_{i}}, i[k]i\in[k]. Therefore, for every CiC_{i}, i[k]i\in[k], there exists a subset Ci~Ci\widetilde{C_{i}}\subseteq C_{i} with |Ci~|(1104εφ2)|Ci||\widetilde{C_{i}}|\geq(1-\frac{10^{4}\varepsilon}{\varphi^{2}})|C_{i}| such that for all xCi~x\in\widetilde{C_{i}}, it holds that fx,μi0.96μi22.\langle f_{x},\mu_{i}\rangle\geq 0.96\|\mu_{i}\|_{2}^{2}.

For the sake of description, we introduce the following definition.

Definition 4.1 (Good and bad vertices).

Let k2k\geq 2 be an integer, φ(0,1)\varphi\in(0,1), and εφ2\frac{\varepsilon}{\varphi^{2}} be smaller than a sufficiently small constant. Let G=(V,E)G=(V,E) be a dd-regular nn-vertex graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. We call a vertex xVx\in V a good vertex with respect to α(0,1)\alpha\in(0,1) and β(0,1)\beta\in(0,1) if x(V^V~(i=1kCi~))x\in(\widehat{V}\cap\widetilde{V}\cap(\cup_{i=1}^{k}{\widetilde{C_{i}}})), where V^\widehat{V} is the set as defined in Lemma 4.1, V~\widetilde{V} is the set as defined in Lemma 4.2 and Ci~\widetilde{C_{i}} (i[k])(i\in[k]) is the set as defined in Lemma 4.4. We call a vertex xVx\in V a bad vertex with respect to α(0,1)\alpha\in(0,1) and β(0,1)\beta\in(0,1) if it’s not a good vertex with respect to α\alpha and β\beta.

Note that for a good vertex xx with respect to α(0,1)\alpha\in(0,1) and β(0,1)\beta\in(0,1), the following hold: (1) fx21αkn\|f_{x}\|_{2}\leq\sqrt{\frac{1}{\alpha}\cdot\frac{k}{n}}; (2) fxμx24kεβφ21n\|f_{x}-\mu_{x}\|_{2}\leq\sqrt{\frac{4k\varepsilon}{\beta\varphi^{2}}\cdot\frac{1}{n}}; (3) fx,μx0.96μx22\langle f_{x},\mu_{x}\rangle\geq 0.96\|\mu_{x}\|_{2}^{2}. For a bad vertex xx with respect to α(0,1)\alpha\in(0,1) and β(0,1)\beta\in(0,1), it does not satisfy at least one of the above three conditions.

The following lemma shows that if vertex xx and vertex yy are in the same cluster and both of them are good vertices with respect to α\alpha and β\beta (α\alpha and β\beta should be chosen appropriately), then the spectral dot product fx,fy\langle f_{x},f_{y}\rangle is roughly 0.961|Ci|0.96\cdot\frac{1}{|C_{i}|}.

Lemma 4.6.

Let k2k\geq 2, φ(0,1)\varphi\in(0,1) and εφ2\frac{\varepsilon}{\varphi^{2}} be smaller than a sufficiently small constant. Let G=(V,E)G=(V,E) be a dd-regular nn-vertex graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. Suppose that x,yVx,y\in V are in the same cluster Ci,i[k]C_{i},i\in[k] and both of them are good vertices with respect to α=2k(εφ2)1/3\alpha=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3} and β=2k(εφ2)1/3\beta=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}. Then

fx,fy0.96(14εφ)1|Ci|kn(εφ2)1/6.\langle f_{x},f_{y}\rangle\geq 0.96\left(1-\frac{4\sqrt{\varepsilon}}{\varphi}\right)\frac{1}{|C_{i}|}-\frac{\sqrt{k}}{n}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/6}.

The following result will also be used in our proof:

Lemma 4.7 (Lemma 7, [14]).

Let k2k\geq 2 be an integer, φ(0,1)\varphi\in(0,1), and ε(0,1)\varepsilon\in(0,1). Let G=(V,E)G=(V,E) be a dd-regular graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. Then we have

  1. 1.

    for all i[k]i\in[k], |μi221|Ci||4εφ1|Ci|\left|\|\mu_{i}\|_{2}^{2}-\frac{1}{|C_{i}|}\right|\leq\frac{4\sqrt{\varepsilon}}{\varphi}\frac{1}{|C_{i}|},

  2. 2.

    for all ij[k]i\neq j\in[k], |μi,μj|8εφ1|Ci||Cj||\langle\mu_{i},\mu_{j}\rangle|\leq\frac{8\sqrt{\varepsilon}}{\varphi}\frac{1}{\sqrt{|C_{i}||C_{j}|}}.

Proof of Lemma 4.6.

According to the distributive law of dot product, we have

fx,fy=fx,fyμi+μi=fx,fyμi+fx,μi.\langle f_{x},f_{y}\rangle=\langle f_{x},f_{y}-\mu_{i}+\mu_{i}\rangle=\langle f_{x},f_{y}-\mu_{i}\rangle+\langle f_{x},\mu_{i}\rangle.

By using Cauchy-Schwarz Inequality, we have |fx,fyμi|fx2fyμi2|\langle f_{x},f_{y}-\mu_{i}\rangle|\leq\|f_{x}\|_{2}\cdot\|f_{y}-\mu_{i}\|_{2}. Since xx and yy are both good vertices with respect to α=2k(εφ2)1/3\alpha=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3} and β=2k(εφ2)1/3\beta=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}, we have

|fx,fyμi|fx2fyμi21αkn4kεβφ21n=kn(εφ2)1/6,|\langle f_{x},f_{y}-\mu_{i}\rangle|\leq\|f_{x}\|_{2}\cdot\|f_{y}-\mu_{i}\|_{2}\leq\sqrt{\frac{1}{\alpha}\cdot\frac{k}{n}}\cdot\sqrt{\frac{4k\varepsilon}{\beta\varphi^{2}}\cdot\frac{1}{n}}=\frac{\sqrt{k}}{n}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/6},

which gives us that fx,fyμikn(εφ2)1/6\langle f_{x},f_{y}-\mu_{i}\rangle\geq-\frac{\sqrt{k}}{n}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/6}. Recall that xx is a good vertex, we have fx,μi0.96μi22\langle f_{x},\mu_{i}\rangle\geq 0.96\|\mu_{i}\|_{2}^{2}. Hence, it holds that

fx,fy\displaystyle\langle f_{x},f_{y}\rangle =fx,fyμi+fx,μi\displaystyle=\langle f_{x},f_{y}-\mu_{i}\rangle+\langle f_{x},\mu_{i}\rangle
0.96μi22kn(εφ2)1/6\displaystyle\geq 0.96\|\mu_{i}\|_{2}^{2}-\frac{\sqrt{k}}{n}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/6}
0.96(14εφ)1|Ci|kn(εφ2)1/6.\displaystyle\geq 0.96\left(1-\frac{4\sqrt{\varepsilon}}{\varphi}\right)\frac{1}{|C_{i}|}-\frac{\sqrt{k}}{n}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/6}.

The last inequality is according to item 11 in Lemma 4.7. ∎

The following lemma shows that if vertex xx and vertex yy are in different clusters and both of them are good vertices with respect to α\alpha and β\beta (α\alpha and β\beta should be chosen appropriately), then the spectral dot product fx,fy\langle f_{x},f_{y}\rangle is close to 0.

Lemma 4.8.

Let k2k\geq 2, φ(0,1)\varphi\in(0,1) and εφ2\frac{\varepsilon}{\varphi^{2}} be smaller than a sufficiently small constant. Let G=(V,E)G=(V,E) be a dd-regular nn-vertex graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. Suppose that xCix\in C_{i}, yCj,(i,j[k],ij)y\in C_{j},(i,j\in[k],i\neq j) and both of them are good vertices with respect to α=2k(εφ2)1/3\alpha=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3} and β=2k(εφ2)1/3\beta=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}, the following holds:

fx,fykn(εφ2)1/6+2k1/4n(εφ2)1/3(1+4εφ)1|Cj|+8εφ1|Ci||Cj|.\langle f_{x},f_{y}\rangle\leq\frac{\sqrt{k}}{n}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/6}+\frac{\sqrt{2}k^{1/4}}{\sqrt{n}}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/3}\cdot\sqrt{\left(1+\frac{4\sqrt{\varepsilon}}{\varphi}\right)\frac{1}{|C_{j}|}}+\frac{8\sqrt{\varepsilon}}{\varphi}\cdot\frac{1}{\sqrt{|C_{i}|\cdot|C_{j}|}}.
Proof.

According to the distributive law of dot product, we have

fx,fy\displaystyle\langle f_{x},f_{y}\rangle =fx,fyμj+μj\displaystyle=\langle f_{x},f_{y}-\mu_{j}+\mu_{j}\rangle
=fx,fyμj+fx,μj\displaystyle=\langle f_{x},f_{y}-\mu_{j}\rangle+\langle f_{x},\mu_{j}\rangle
=fx,fyμj+fxμi+μi,μj\displaystyle=\langle f_{x},f_{y}-\mu_{j}\rangle+\langle f_{x}-\mu_{i}+\mu_{i},\mu_{j}\rangle
=fx,fyμj+fxμi,μj+μi,μj.\displaystyle=\langle f_{x},f_{y}-\mu_{j}\rangle+\langle f_{x}-\mu_{i},\mu_{j}\rangle+\langle\mu_{i},\mu_{j}\rangle.

By Cauchy-Schwarz Inequality, we have

|fx,fyμj|fx2fyμj2|\langle f_{x},f_{y}-\mu_{j}\rangle|\leq\|f_{x}\|_{2}\cdot\|f_{y}-\mu_{j}\|_{2}

and

|fxμi,μj|μj2fxμi2.|\langle f_{x}-\mu_{i},\mu_{j}\rangle|\leq\|\mu_{j}\|_{2}\cdot\|f_{x}-\mu_{i}\|_{2}.

Since xx and yy are both good vertices with respect to α=2k(εφ2)1/3\alpha=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3} and β=2k(εφ2)1/3\beta=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}, we have

|fx,fyμj|fx2fyμj21αkn4kεβφ21n=kn(εφ2)1/6|\langle f_{x},f_{y}-\mu_{j}\rangle|\leq\|f_{x}\|_{2}\cdot\|f_{y}-\mu_{j}\|_{2}\leq\sqrt{\frac{1}{\alpha}\cdot\frac{k}{n}}\cdot\sqrt{\frac{4k\varepsilon}{\beta\varphi^{2}}\cdot\frac{1}{n}}=\frac{\sqrt{k}}{n}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/6}

and

|fxμi,μj|μj2fxμi2μj24kεβφ21n=2k1/4n(εφ2)1/3μj2.|\langle f_{x}-\mu_{i},\mu_{j}\rangle|\leq\|\mu_{j}\|_{2}\cdot\|f_{x}-\mu_{i}\|_{2}\leq\|\mu_{j}\|_{2}\cdot\sqrt{\frac{4k\varepsilon}{\beta\varphi^{2}}\cdot\frac{1}{n}}=\frac{\sqrt{2}k^{1/4}}{\sqrt{n}}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/3}\cdot\|\mu_{j}\|_{2}.

So we have

fx,fy\displaystyle\langle f_{x},f_{y}\rangle =fx,fyμj+fxμi,μj+μi,μj\displaystyle=\langle f_{x},f_{y}-\mu_{j}\rangle+\langle f_{x}-\mu_{i},\mu_{j}\rangle+\langle\mu_{i},\mu_{j}\rangle
fx2fyμj2+μj2fxμi2+μi,μj\displaystyle\leq\|f_{x}\|_{2}\cdot\|f_{y}-\mu_{j}\|_{2}+\|\mu_{j}\|_{2}\cdot\|f_{x}-\mu_{i}\|_{2}+\langle\mu_{i},\mu_{j}\rangle
kn(εφ2)1/6+2k1/4n(εφ2)1/3μj2+μi,μj\displaystyle\leq\frac{\sqrt{k}}{n}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/6}+\frac{\sqrt{2}k^{1/4}}{\sqrt{n}}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/3}\cdot\|\mu_{j}\|_{2}+\langle\mu_{i},\mu_{j}\rangle
kn(εφ2)1/6+2k1/4n(εφ2)1/3(1+4εφ)1|Cj|+8εφ1|Ci||Cj|.\displaystyle\leq\frac{\sqrt{k}}{n}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/6}+\frac{\sqrt{2}k^{1/4}}{\sqrt{n}}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/3}\cdot\sqrt{\left(1+\frac{4\sqrt{\varepsilon}}{\varphi}\right)\frac{1}{|C_{j}|}}+\frac{8\sqrt{\varepsilon}}{\varphi}\cdot\frac{1}{\sqrt{|C_{i}|\cdot|C_{j}|}}.

The last inequality is according to item 11 and item 22 in Lemma 4.7. ∎

4.2 Proof of Theorem 1

Now we prove our main result Theorem 1.

Proof.

Let s=10klogkγs=\frac{10k\log k}{\gamma} be the size of sampling set SS, let α=β=2k(εφ2)1/3\alpha=\beta=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}. Recall that we call a vertex xx a bad vertex, if x(V\V^)(V\V~)(V\(i=1kCi~))x\in(V\backslash\widehat{V})\cup(V\backslash\widetilde{V})\cup(V\backslash(\cup_{i=1}^{k}\widetilde{C_{i}})), where V^,V~,Ci~,i[k]\widehat{V},\widetilde{V},\widetilde{C_{i}},i\in[k] are defined in Lemma 4.1, 4.2, 4.4 respectively. We use BB to denote the set of all bad vertices. Then we have |B|(α+β+104εφ2)n=(4k(εφ2)1/3+104εφ2)n|B|\leq(\alpha+\beta+\frac{10^{4}\varepsilon}{\varphi^{2}})\cdot n=(4\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}+\frac{10^{4}\varepsilon}{\varphi^{2}})\cdot n.

We let κ4k(εφ2)1/3+104εφ2\kappa\leq 4\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}+\frac{10^{4}\varepsilon}{\varphi^{2}} be the fraction of BB in VV. Since εφ2<γ343109k92log3k\frac{\varepsilon}{\varphi^{2}}<\frac{\gamma^{3}}{4^{3}\cdot 10^{9}\cdot k^{\frac{9}{2}}\cdot\log^{3}k}, we have κ4k(εφ2)1/3+104εφ2γ103klogk+γ343105k92log3k2γ103klogk=150s.\kappa\leq 4\sqrt{k}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/3}+\frac{10^{4}\varepsilon}{\varphi^{2}}\leq\frac{\gamma}{10^{3}k\log k}+\frac{\gamma^{3}}{4^{3}\cdot 10^{5}\cdot k^{\frac{9}{2}}\log^{3}k}\leq\frac{2\gamma}{10^{3}k\log k}=\frac{1}{50s}.

Therefore, by union bound, with probability at least 1κs1150ss=11501-\kappa\cdot s\geq 1-\frac{1}{50s}\cdot s=1-\frac{1}{50}, all the vertices in SS are good (we fixed α=β=2k(εφ2)1/3\alpha=\beta=2\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}, so we will omit “with respect to α\alpha and β\beta” in the following). In the following, we will assume all the vertices in SS are good.

Recall that for i[k],|Ci|γnki\in[k],|C_{i}|\geq\gamma\frac{n}{k}, so with probability at least 1(1γk)s=1(11kγ)kγ10logk11k101150k1-(1-\frac{\gamma}{k})^{s}=1-(1-\frac{1}{\frac{k}{\gamma}})^{\frac{k}{\gamma}\cdot 10\log k}\geq 1-\frac{1}{k^{10}}\geq 1-\frac{1}{50k}, there exists at least one vertex in SS that is from CiC_{i}. Then with probability at least 11501-\frac{1}{50}, for all kk clusters C1,,CkC_{1},\dots,C_{k}, there exists at least one vertex in SS that is from CiC_{i}.

Let ξ=γ1000\xi=\frac{\sqrt{\gamma}}{1000}. By Lemma 2.1, we know that with probability at least 11n1001-\frac{1}{n^{100}}, for any pair of x,yVx,y\in V, SpectralDotProductOracle(G,x,y,1/2,ξ,𝒟G,x,y,1/2,\xi,\mathcal{D}) computes an output value fx,fyapx\langle f_{x},f_{y}\rangle_{\rm apx} such that |fx,fyapxfx,fy|ξn\Big{|}\langle f_{x},f_{y}\rangle_{\rm apx}-\langle f_{x},f_{y}\rangle\Big{|}\leq\frac{\xi}{n}. So, with probability at least 1ssn10011n501-\frac{s\cdot s}{n^{100}}\geq 1-\frac{1}{n^{50}}, for all pairs x,ySx,y\in S, SpectralDotProductOracle(G,x,y,1/2,ξ,𝒟G,x,y,1/2,\xi,\mathcal{D}) computes an output value fx,fyapx\langle f_{x},f_{y}\rangle_{\rm apx} such that |fx,fyapxfx,fy|ξn\Big{|}\langle f_{x},f_{y}\rangle_{\rm apx}-\langle f_{x},f_{y}\rangle\Big{|}\leq\frac{\xi}{n}. In the following, we will assume the above inequality holds for any x,ySx,y\in S.

By Lemma 4.6, we know that if x,yx,y are in the same cluster and both of them are good vertices, then we have fx,fy0.96(14εφ)1|Ci|kn(εφ2)1/60.96(14εφ)γknkn(εφ2)1/6\langle f_{x},f_{y}\rangle\geq 0.96(1-\frac{4\sqrt{\varepsilon}}{\varphi})\frac{1}{|C_{i}|}-\frac{\sqrt{k}}{n}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/6}\geq 0.96(1-\frac{4\sqrt{\varepsilon}}{\varphi})\frac{\gamma k}{n}-\frac{\sqrt{k}}{n}(\frac{\varepsilon}{\varphi^{2}})^{1/6} since |Ci|nγk|C_{i}|\leq\frac{n}{\gamma k}. By Lemma 4.8, we know that if x,yx,y are in the different clusters and both of them are good vertices, then we have fx,fykn(εφ2)1/6+2k1/4n(εφ2)1/3(1+4εφ)1|Cj|+8εφ1|Ci||Cj|kn(εφ2)1/6+2γk3/4n1+4εφ(εφ2)1/3+8εφkγn\langle f_{x},f_{y}\rangle\leq\frac{\sqrt{k}}{n}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/6}+\frac{\sqrt{2}k^{1/4}}{\sqrt{n}}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}\cdot\sqrt{(1+\frac{4\sqrt{\varepsilon}}{\varphi})\frac{1}{|C_{j}|}}+\frac{8\sqrt{\varepsilon}}{\varphi}\cdot\frac{1}{\sqrt{|C_{i}|\cdot|C_{j}|}}\leq\frac{\sqrt{k}}{n}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/6}+\sqrt{\frac{2}{\gamma}}\frac{k^{3/4}}{n}\sqrt{1+\frac{4\sqrt{\varepsilon}}{\varphi}}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}+\frac{8\sqrt{\varepsilon}}{\varphi}\cdot\frac{k}{\gamma n} since γnk|Ci|\frac{\gamma n}{k}\leq|C_{i}| for all i[k]i\in[k].

Recall that εφ2<γ343109k92log3k\frac{\varepsilon}{\varphi^{2}}<\frac{\gamma^{3}}{4^{3}\cdot 10^{9}\cdot k^{\frac{9}{2}}\cdot\log^{3}k} and γ(0.001,1]\gamma\in(0.001,1]. Let θ=0.96(14εφ)γknkn(εφ2)1/6ξn\theta=0.96(1-\frac{4\sqrt{\varepsilon}}{\varphi})\frac{\gamma k}{n}-\frac{\sqrt{k}}{n}(\frac{\varepsilon}{\varphi^{2}})^{1/6}-\frac{\xi}{n}, then we have θ>γn(0.96γk0.48109/2k5/4log3/2k12103/2k1/4log1/2k11000)>0.034γn\theta>\frac{\sqrt{\gamma}}{n}\cdot(0.96\sqrt{\gamma}k-\frac{0.48}{10^{9/2}\cdot k^{5/4}\log^{3/2}k}-\frac{1}{2\cdot 10^{3/2}\cdot k^{1/4}\log^{1/2}k}-\frac{1}{1000})>0.034\cdot\frac{\sqrt{\gamma}}{n}. Let SS satisfies that all the vertices in SS are good, and SS contains at least one vertex from CiC_{i} for all i=1,,ki=1,\dots,k. For any x,ySx,y\in S, then:

  1. 1.

    If x,yx,y belong to the same cluster, by above analysis, we know that fx,fy0.96(14εφ)γknkn(εφ2)1/6\langle f_{x},f_{y}\rangle\geq 0.96(1-\frac{4\sqrt{\varepsilon}}{\varphi})\frac{\gamma k}{n}-\frac{\sqrt{k}}{n}(\frac{\varepsilon}{\varphi^{2}})^{1/6}. Then it holds that fx,fyapxfx,fyξn0.96(14εφ)γknkn(εφ2)1/6ξn=θ\langle f_{x},f_{y}\rangle_{\rm apx}\geq\langle f_{x},f_{y}\rangle-\frac{\xi}{n}\geq 0.96(1-\frac{4\sqrt{\varepsilon}}{\varphi})\frac{\gamma k}{n}-\frac{\sqrt{k}}{n}(\frac{\varepsilon}{\varphi^{2}})^{1/6}-\frac{\xi}{n}=\theta. Thus, an edge (x,y)(x,y) will be added to HH (at lines 8 and 9 of Alg.1).

  2. 2.

    If x,yx,y belong to two different clusters, by above analysis, we know that fx,fykn(εφ2)1/6+2γk3/4n1+4εφ(εφ2)1/3+8εφkγn\langle f_{x},f_{y}\rangle\leq\frac{\sqrt{k}}{n}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/6}+\sqrt{\frac{2}{\gamma}}\frac{k^{3/4}}{n}\sqrt{1+\frac{4\sqrt{\varepsilon}}{\varphi}}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}+\frac{8\sqrt{\varepsilon}}{\varphi}\cdot\frac{k}{\gamma n}. Then it holds that fx,fyapxfx,fy+ξnkn(εφ2)1/6+2γk3/4n1+4εφ(εφ2)1/3+8εφkγn+ξn<γn(12103/2k1/4log1/2k+12103k3/4logk+1109/2k5/4log3/2k+11000)<0.027γn<θ\langle f_{x},f_{y}\rangle_{\rm apx}\leq\langle f_{x},f_{y}\rangle+\frac{\xi}{n}\leq\frac{\sqrt{k}}{n}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/6}+\sqrt{\frac{2}{\gamma}}\frac{k^{3/4}}{n}\sqrt{1+\frac{4\sqrt{\varepsilon}}{\varphi}}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}+\frac{8\sqrt{\varepsilon}}{\varphi}\cdot\frac{k}{\gamma n}+\frac{\xi}{n}<\frac{\sqrt{\gamma}}{n}\cdot(\frac{1}{2\cdot 10^{3/2}\cdot k^{1/4}\log^{1/2}k}+\frac{1}{2\cdot 10^{3}\cdot k^{3/4}\log k}+\frac{1}{10^{9/2}\cdot k^{5/4}\log^{3/2}k}+\frac{1}{1000})<0.027\cdot\frac{\sqrt{\gamma}}{n}<\theta, since εφ2<γ343109k92log3k\frac{\varepsilon}{\varphi^{2}}<\frac{\gamma^{3}}{4^{3}\cdot 10^{9}\cdot k^{\frac{9}{2}}\cdot\log^{3}k} and ξ=γ1000\xi=\frac{\sqrt{\gamma}}{1000}. Thus, an edge (u,v)(u,v) will not be added to HH.

Therefore, with probability at least 11501501n500.951-\frac{1}{50}-\frac{1}{50}-\frac{1}{n^{50}}\geq 0.95, the similarity graph HH has following properties: (1) all vertices in V(H)V(H) (i.e., SS) are good; (2) all vertices in SS that belongs to the same cluster CiC_{i} form a connected components, denoted by SiS_{i}; (3) there is no edge between SiS_{i} and SjS_{j}, iji\neq j; (4) there are exactly kk connected components in HH, each corresponding to a cluster.

Now we are ready to consider a query WhichCluster(G,xG,x).

Assume xx is good. We use CxC_{x} to denote the cluster that xx belongs to. Since all the vertices in SS are good, let yCxSy\in C_{x}\cap S, so with probability at least 1sn10011n501-\frac{s}{n^{100}}\geq 1-\frac{1}{n^{50}}, by above analysis, we have fx,fyapxfx,fyξnθ\langle f_{x},f_{y}\rangle_{\rm apx}\geq\langle f_{x},f_{y}\rangle-\frac{\xi}{n}\geq\theta. On the other hand, for any yS\Cxy\in S\backslash C_{x}, with probability at least 1sn10011n501-\frac{s}{n^{100}}\geq 1-\frac{1}{n^{50}}, by above analysis, we have fx,fyapxfx,fy+ξn<θ\langle f_{x},f_{y}\rangle_{\rm apx}\leq\langle f_{x},f_{y}\rangle+\frac{\xi}{n}<\theta. Thus, WhichCluster(G,xG,x) will output the label of yCxSy\in C_{x}\cap S as xsx^{\prime}s label (at line 3 of Alg.2).

Therefore, with probability at least 11501501n50nn500.951-\frac{1}{50}-\frac{1}{50}-\frac{1}{n^{50}}-\frac{n}{n^{50}}\geq 0.95, all the good vertices will be correctly recovered. So the misclassified vertices come from BB. We know that |B|(α+β+104εφ2)n=(4k(εφ2)1/3+104εφ2)n.|B|\leq\left(\alpha+\beta+\frac{10^{4}\varepsilon}{\varphi^{2}}\right)\cdot n=\left(4\sqrt{k}\cdot\left(\frac{\varepsilon}{\varphi^{2}}\right)^{1/3}+\frac{10^{4}\varepsilon}{\varphi^{2}}\right)\cdot n. Since |Ci|γnk|C_{i}|\geq\frac{\gamma n}{k}, we have nkγ|Ci|n\leq\frac{k}{\gamma}|C_{i}|. So, |B|(4k(εφ2)1/3+104εφ2)kγ|Ci|O(k32γ(εφ2)1/3)|Ci||B|\leq(4\sqrt{k}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3}+\frac{10^{4}\varepsilon}{\varphi^{2}})\cdot\frac{k}{\gamma}|C_{i}|\leq O(\frac{k^{\frac{3}{2}}}{\gamma}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3})|C_{i}|. This implies that there exists a permutation π:[k][k]\pi:[k]\rightarrow[k] such that for all i[k]i\in[k], |Uπ(i)Ci|O(k32γ(εφ2)1/3)|Ci||U_{\pi(i)}\triangle C_{i}|\leq O(\frac{k^{\frac{3}{2}}}{\gamma}\cdot(\frac{\varepsilon}{\varphi^{2}})^{1/3})|C_{i}|.

Running time. By Lemma 2.1, we know that InitializeOracle(G,1/2,ξG,1/2,\xi) computes in time (kξ)O(1)n1/2+O(ε/φ2)(logn)31φ2(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot(\log n)^{3}\cdot\frac{1}{\varphi^{2}} a sublinear space data structure 𝒟\mathcal{D} and for every pair of vertices x,yVx,y\in V, SpectralDotProductOracle(G,x,y,1/2,ξ,𝒟G,x,y,1/2,\xi,\mathcal{D}) computes an output value fx,fyapx\langle f_{x},f_{y}\rangle_{\rm apx} in (kξ)O(1)n1/2+O(ε/φ2)(logn)21φ2(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot(\log n)^{2}\cdot\frac{1}{\varphi^{2}} time.

In preprocessing phase, for algorithm ConstructOracle(G,k,φ,ε,γG,k,\varphi,\varepsilon,\gamma), it invokes InitializeOracle one time to construct a data structure 𝒟\mathcal{D} and SpectralDotProductOracle sss\cdot s times to construct a similarity graph HH. So the preprocessing time of our oracle is (kξ)O(1)n1/2+O(ε/φ2)(logn)31φ2+100k2log2kγ2(kξ)O(1)n1/2+O(ε/φ2)(logn)21φ2=O(n1/2+O(ε/φ2)poly(klognγφ))(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot(\log n)^{3}\cdot\frac{1}{\varphi^{2}}+\frac{100k^{2}\log^{2}k}{\gamma^{2}}\cdot(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot(\log n)^{2}\cdot\frac{1}{\varphi^{2}}=O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\cdot\log n}{\gamma\varphi})).

In query phase, to answer the cluster index that xx belongs to, algorithm WhichCluster(G,xG,x) invokes SpectralDotProductOracle ss times. So the query time of our oracle is 10klogkγ(kξ)O(1)n1/2+O(ε/φ2)(logn)21φ2=O(n1/2+O(ε/φ2)poly(klognγφ))\frac{10k\log k}{\gamma}\cdot(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot(\log n)^{2}\cdot\frac{1}{\varphi^{2}}=O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\cdot\log n}{\gamma\varphi})).

Space. By the proof of Lemma 2.1 in [14], we know that overall both algorithm InitializeOracle and SpectralDotProductOracle take (kξ)O(1)n1/2+O(ε/φ2)poly(logn)(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot\textrm{poly}(\log n) space. Therefore, overall preprocessing phase takes (kξ)O(1)n1/2+O(ε/φ2)poly(logn)=O(n1/2+O(ε/φ2)poly(klognγ))(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot\textrm{poly}(\log n)=O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot\textrm{poly}(\frac{k\log n}{\gamma})) space (at lines 5 and 7 of Alg.1). In query phase, WhichCluster query takes (kξ)O(1)n1/2+O(ε/φ2)poly(logn)=O(n1/2+O(ε/φ2)poly(klognγ))(\frac{k}{\xi})^{O(1)}\cdot n^{1/2+O(\varepsilon/\varphi^{2})}\cdot\textrm{poly}(\log n)=O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot\textrm{poly}(\frac{k\log n}{\gamma})) space (at line 2 of Alg.2). ∎

5 Experiments

To evaluate the performance of our oracle, we conducted experiments on the random graph generated by the Stochastic Block Model (SBM). In this model, we are given parameters p,qp,q and n,kn,k, where n,kn,k denote the number of vertices and the number of clusters respectively; pp denotes the probability that any pair of vertices within each cluster is connected by an edge, and qq denotes the probability that any pair of vertices from different clusters is connected by an edge. Setting pq>c\frac{p}{q}>c for some big enough constant cc we can get a well-clusterable graph. All experiments were implemented in Python 3.9 and the experiments were performed using an Intel(R) Core(TM) i7-12700F CPU @ 2.10GHz processor, with 32 GB RAM.

Refer to caption
Figure 2: For a random graph GG generated by SBM with n=3000,k=3,p=0.03,q=0.002n=3000,k=3,p=0.03,q=0.002, we build the dot product oracle for several different parameters for t,s,Rinit,Rqueryt,s,R_{init},R_{query} and plot the density graph. The setting with the most prominent gap in the density graph, i.e., the one on the right, is selected. We can further set θ=0.0005\theta=0.0005 for GG according to the right graph.

Practical changes to our oracle.

In order to implement our oracle, we need to make some modifications to the theoretical algorithms. To adapt the dot product oracle parameters (see Algorithm 7 and Algorithm 8 in Appendix A), i.e., tt (random walk length), ss (sampling set size), and RinitR_{init}, RqueryR_{query} (number of random walks), we exploit the theoretical gap between intra-cluster and inter-cluster dot products in clusterable graphs. Given a clusterable graph GG, by constructing the dot product oracle with various parameter settings and calculating some intra-cluster and inter-cluster dot products, we generate density graphs. The setting with the most prominent gap in the density graph is selected (see Figure 2 for an illustrative example).

Determining the appropriate threshold θ\theta (at lines 2, 8, 9 of Alg.1 and line 3 of Alg.2) is the next step. By observing the density graph linked to the chosen dot product oracle parameters, we identify the fitting θ\theta (see Figure 2 for an illustrative example).

Determining the appropriate sampling set size ss (at line 3 of Alg.1) of our oracle is the final step. Given a graph G=(V,E)G=(V,E) generated by SBM, for all vertices in VV, we know their ground-truth clusters. We can built our clustering oracle for several parameters for ss. For each parameter setting, we run WhichCluster(G,xG,x) for some xVx\in V and check if xx was classified correctly. We pick the parameter setting with the most correct answers.

Misclassification error.

To evaluate the misclassification error our oracle, we conducted this experiment. In this experiment, we set k=3k=3, n=3000n=3000, q=0.002q=0.002, p[0.02,0.07]p\in[0.02,0.07] in the SBM, where each cluster has 10001000 vertices. For each graph G=(V,E)G=(V,E), we run WhichCluster(G,xG,x) for all xVx\in V and get a kk-partition U1,,UkU_{1},\dots,U_{k} of VV. In experiments, the misclassification error of our oracle is defined to be 11nmaxπi=1kUπ(i)Ci1-\frac{1}{n}\cdot\max_{\pi}{\sum_{i=1}^{k}{U_{\pi(i)}\cap C_{i}}}, where π:[k][k]\pi:[k]\rightarrow[k] is a permutation and C1,,CkC_{1},\dots,C_{k} are the ground-truth clusters of GG.

Moreover, we implemented the oracle in [8] to compare with our oracle111We remark that the oracle is implicit in [8] (see also [30]). Instead of using the inner product of spectral embeddings of vertex pairs, the authors of [8] used the pairwise 2\ell_{2}-distance between the distributions of two random walks starting from the two corresponding vertices.. The oracle in [8] can be seen as a non-robust version of oracle in [30]. Note that our primary advancement over [8] (also [30]) is evident in the significantly reduced conductance gap we achieve.

We did not compare with the oracle in [14], since implementing the oracle in [14] poses challenges. As described in section 3.1, the oracle in [14] initially approximates the kk cluster centers by sampling around O(1/εk4logk)O(1/\varepsilon\cdot k^{4}\log k) vertices, and subsequently undertakes the enumeration of approximately 2O(1/εk4log2k)2^{O(1/\varepsilon\cdot k^{4}\log^{2}k)} potential kk-partitions (Algorithm 10 in [14]). This enumeration process is extremely time-intensive and becomes impractical even for modest values of kk.

According to the result of our experiment (Table 1), the misclassification error of our oracle is reported to be quite small when p0.025p\geq 0.025, and even decreases to 0 when p0.035p\geq 0.035. The outcomes of our experimentation distinctly demonstrate that our oracle’s misclassification error remains notably minimal in instances where the input graph showcases an underlying latent cluster structure. In addition, Table 1 also shows that our oracle can handle graphs with a smaller conductance gap than the oracle in [8], which is consistent with the theoretical results. This empirical validation reinforces the practical utility and efficacy of our oracle beyond theoretical conjecture.

Table 1: The misclassification error of the oracle in [8] and our oracle
pp 0.02 0.025 0.03 0.035 0.04 0.05 0.06 0.07
min-error1 ([8]) - 0.5570 0.1677 0.0363 0.0173 0.0010 0 0
max-error1 ([8]) - 0.6607 0.6610 0.6533 0.4510 0.0773 0.0227 0.0013
average-error1 ([8]) - 0.6208 0.4970 0.1996 0.0829 0.0168 0.0030 0.0003
error (our) 0.2273 0.0113 0.0003 0 0 0 0 0
  • 1

    In the experiment, we found that the misclassification error of oracle in [8] fluctuated greatly, so for the oracle in [8], for each value of pp, we conducted 30 independent experiments and recorded the minimum error, maximum error and average error of oracle in [8].

Query complexity.

We conducted an experiment on a SBM graph with k=3,n=15000,q=0.002,p=0.2k=3,n=15000,q=0.002,p=0.2. We calculate the fraction of edges that have been accessed given a number of invocations of WhichCluster(G,xG,x) (Table 2). (Note that there is a trade-off between computational cost and clustering quality. Therefore, it is necessary to point out that the parameters of this experiment are set reasonably and the misclassification error is 0.) Table 2 shows that as long as the number of WhichCluster queries is not too large, our algorithm only reads a small portion of the input graph.

The above experiment shows that for a small target misclassification error, our algorithms only require a sublinear amount of data, which is often critical when analyzing large social networks, since one typically does not have access to the entire network.

Table 2: The fraction of accessed edges of queries
# queries 0 50 100 200 400 800 1600 3200
fraction 0.1277 0.2539 0.3637 0.5377 0.7517 0.9273 0.9929 0.9999

Running time.

To evaluate the running time of our oracle, we conducted this experiment on some random graphs generated by SBM with n=3000,k=3,q=0.002n=3000,k=3,q=0.002 and p[0.02,0.06]p\in[0.02,0.06]. Note that there is a trade-off between running time and clustering quality. In this experiment, we set the experimental parameters the same as those in the misclassification error experiment, which can ensure a small error. We recorded the running time of constructing a similarity graph HH as construct-time. For each pp, we query all the vertices in the input graph and recorded the average time of the n=3000n=3000 queries as query-time (Table 3).

Table 3: The average query time of our oracle
pp 0.02 0.025 0.03 0.035 0.04 0.05 0.06
construct-time (s) 11.6533 12.4221 11.8358 11.6097 12.2473 12.2124 12.5851
query-time (s) 0.3504 0.4468 0.4446 0.4638 0.4650 0.4751 0.4874

Robustness experiment.

The base graph G0=(V,E)G_{0}=(V,E) is generated from SBM with n=3000,k=3,p=0.05,q=0.002n=3000,k=3,p=0.05,q=0.002. Note that randomly deleting some edges in each cluster is equivalent to reducing pp and randomly deleting some edges between different clusters is equivalent to reducing qq. So we consider the worst case. We randomly choose one vertex from each cluster; for each selected vertex xix_{i}, we randomly delete delNum edges connected to xix_{i} in cluster CiC_{i}. If xix_{i} has fewer than delNum neighbors within CiC_{i}, then we delete all the edges incident to xix_{i} in CiC_{i}. We run WhichCluster queries for all vertices in VV on the resulting graph. We repeated this process for five times for each parameter delNum and recorded the average misclassification error (Table 4). The results show that our oracle is robust against a few number of random edge deletions.

Table 4: The average misclassification error with respect to delNum random edge deletions
delNum 25 32 40 45 50 55 60 65
error 0.00007 0.00007 0.00013 0.00047 0.00080 0.00080 0.00080 0.00087

6 Conclusion

We have devised a new spectral clustering oracle with sublinear preprocessing and query time. In comparison to the approach presented in [14], our oracle exhibits improved preprocessing efficiency, albeit with a slightly higher misclassification error rate. Furthermore, our oracle can be readily implemented in practical settings, while the clustering oracle proposed in [14] poses challenges in terms of implementation feasibility. To obtain our oracle, we have established a property regarding the spectral embeddings of the vertices in VV for a dd-bounded nn-vertex graph G=(V,E)G=(V,E) that exhibits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. Specifically, if xx and yy belong to the same cluster, the dot product of their spectral embeddings (denoted as fx,fy\langle f_{x},f_{y}\rangle) is approximately O(kn)O(\frac{k}{n}). Conversely, if xx and yy are from different clusters, fx,fy\langle f_{x},f_{y}\rangle is close to 0. We also show that our clustering oracle is robust against a few random edge deletions and conducted experiments on synthetic networks to validate our theoretical results.

Acknowledgments and Disclosure of Funding

The work is supported in part by the Huawei-USTC Joint Innovation Project on Fundamental System Software, NSFC grant 62272431 and “the Fundamental Research Funds for the Central Universities”.

References

  • [1] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank vectors. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 475–486. IEEE, 2006.
  • [2] Reid Andersen, Shayan Oveis Gharan, Yuval Peres, and Luca Trevisan. Almost optimal local graph clustering using evolving sets. Journal of the ACM (JACM), 63(2):1–31, 2016.
  • [3] Reid Andersen and Yuval Peres. Finding sparse cuts locally using evolving sets. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 235–244, 2009.
  • [4] T Tony Cai and Xiaodong Li. Robust and computationally feasible community detection in the presence of arbitrary outlier nodes. The Annals of Statistics, 43(3):1027–1059, 2015.
  • [5] Yudong Chen, Ali Jalali, Sujay Sanghavi, and Huan Xu. Clustering partially observed graphs via convex optimization. The Journal of Machine Learning Research, 15(1):2213–2238, 2014.
  • [6] Ashish Chiplunkar, Michael Kapralov, Sanjeev Khanna, Aida Mousavifar, and Yuval Peres. Testing graph clusterability: Algorithms and lower bounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 497–508. IEEE, 2018.
  • [7] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.
  • [8] Artur Czumaj, Pan Peng, and Christian Sohler. Testing cluster structure of graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 723–732, 2015.
  • [9] Tamal K Dey, Pan Peng, Alfred Rossi, and Anastasios Sidiropoulos. Spectral concentration and greedy k-clustering. Computational Geometry, 76:19–32, 2019.
  • [10] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231, 1996.
  • [11] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation. International journal of computer vision, 59:167–181, 2004.
  • [12] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
  • [13] Amir Globerson, Tim Roughgarden, David Sontag, and Cafer Yildirim. Tight error bounds for structured prediction. arXiv preprint arXiv:1409.5834, 2014.
  • [14] Grzegorz Gluch, Michael Kapralov, Silvio Lattanzi, Aida Mousavifar, and Christian Sohler. Spectral clustering oracles in sublinear time. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1598–1617. SIAM, 2021.
  • [15] Olivier Guédon and Roman Vershynin. Community detection in sparse networks via grothendieck’s inequality. Probability Theory and Related Fields, 165(3-4):1025–1049, 2016.
  • [16] Michael Kapralov, Akash Kumar, Silvio Lattanzi, and Aida Mousavifar. Learning hierarchical structure of clusterable graphs. CoRR, abs/2207.02581, 2022.
  • [17] Michael Kapralov, Akash Kumar, Silvio Lattanzi, and Aida Mousavifar. Learning hierarchical cluster structure of graphs in sublinear time. In Nikhil Bansal and Viswanath Nagarajan, editors, Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, Florence, Italy, January 22-25, 2023, pages 925–939. SIAM, 2023.
  • [18] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Approximation algorithms for semi-random partitioning problems. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 367–384. ACM, 2012.
  • [19] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Constant factor approximation for balanced cut in the pie model. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 41–49. ACM, 2014.
  • [20] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Correlation clustering with noisy partial information. In Proceedings of The 28th Conference on Learning Theory, pages 1321–1342, 2015.
  • [21] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Learning communities in the presence of errors. In Conference on Learning Theory, pages 1258–1291, 2016.
  • [22] Bogdan-Adrian Manghiuc and He Sun. Hierarchical clustering: o(1)o(1)-approximation for well-clustered graphs. Advances in Neural Information Processing Systems, 34:9278–9289, 2021.
  • [23] Claire Mathieu and Warren Schudy. Correlation clustering with noisy input. In Proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms, pages 712–728. Society for Industrial and Applied Mathematics, 2010.
  • [24] Ankur Moitra, William Perry, and Alexander S Wein. How robust are reconstruction thresholds for community detection? In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 828–841. ACM, 2016.
  • [25] Stefan Neumann and Pan Peng. Sublinear-time clustering oracle for signed graphs. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 16496–16528. PMLR, 2022.
  • [26] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.
  • [27] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001.
  • [28] Lorenzo Orecchia and Zeyuan Allen Zhu. Flow-based algorithms for local graph clustering. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1267–1286. SIAM, 2014.
  • [29] Alberto Paccanaro, James A Casbon, and Mansoor AS Saqi. Spectral clustering of protein sequences. Nucleic acids research, 34(5):1571–1580, 2006.
  • [30] Pan Peng. Robust clustering oracle and local reconstructor of cluster structure of graphs. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2953–2972. SIAM, 2020.
  • [31] Pan Peng and Yuichi Yoshida. Average sensitivity of spectral clustering. In Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash, editors, KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 1132–1140. ACM, 2020.
  • [32] Mason A Porter, Jukka-Pekka Onnela, Peter J Mucha, et al. Communities in networks. Notices of the AMS, 56(9):1082–1097, 2009.
  • [33] Daniel A Spielman and Shang-Hua Teng. A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM Journal on computing, 42(1):1–26, 2013.
  • [34] Zeyuan Allen Zhu, Silvio Lattanzi, and Vahab Mirrokni. A local algorithm for finding well-connected clusters. In International Conference on Machine Learning, pages 396–404. PMLR, 2013.

Appendix

Appendix A Missing algorithms from Section 2 and 3

1
2Run RR random walks of length tt starting from xx
3 Let m^x(y)\widehat{m}_{x}(y) be the fraction of random walks that ends at yy
return m^x\widehat{m}_{x}
Algorithm 4 RunRandomWalks(G,R,t,xG,R,t,x)
1
2for each sample xISx\in I_{S} do
3       m^x:=\widehat{m}_{x}:=RunRandomWalks(G,R,t,xG,R,t,x)
4 end for
5Let Q^\widehat{Q} be the matrix whose columns are m^x\widehat{m}_{x} for xISx\in I_{S}
return Q^\widehat{Q}
Algorithm 5 EstimateTransitionMatrix(G,IS,R,tG,I_{S},R,t)
1
2for i=1i=1 to O(logn)O(\log n) do
3       Q^i:=\widehat{Q}_{i}:=EstimateTransitionMatrix(G,IS,R,tG,I_{S},R,t)
4       P^i:=\widehat{P}_{i}:=EstimateTransitionMatrix(G,IS,R,tG,I_{S},R,t)
5       𝒢i:=12(P^iTQ^i+Q^iTP^i)\mathcal{G}_{i}:=\frac{1}{2}(\widehat{P}_{i}^{T}\widehat{Q}_{i}+\widehat{Q}_{i}^{T}\widehat{P}_{i})
6 end for
7Let 𝒢\mathcal{G} be a matrix obtained by taking the entrywise median of 𝒢is\mathcal{G}_{i}^{\prime}s
return 𝒢\mathcal{G}
Algorithm 6 EstimateCollisionProbabilities(G,IS,R,tG,I_{S},R,t)
1
2t:=20lognφ2t:=\frac{20\cdot\log n}{\varphi^{2}}
3 Rinit:=O(n1δ+980ε/φ2k17/ξ2)R_{\rm init}:=O(n^{1-\delta+980\cdot\varepsilon/\varphi^{2}}\cdot k^{17}/\xi^{2})
4 s:=O(n480ε/φ2lognk8/ξ2)s:=O(n^{480\cdot\varepsilon/\varphi^{2}}\cdot\log n\cdot k^{8}/\xi^{2})
5 Let ISI_{S} be the multiset of ss indices chosen independently and uniformly at random from {1,,n}\{1,\dots,n\}
6 for i=1i=1 to O(logn)O(\log n) do
7       Q^i:=\widehat{Q}_{i}:=EstimateTransitionMatrix(G,IS,Rinit,tG,I_{S},R_{\rm init},t)
8 end for
9𝒢:=\mathcal{G}:=EstimateCollisionProbabilities(G,IS,Rinit,tG,I_{S},R_{\rm init},t)
10 Let ns𝒢:=W^Σ^W^T\frac{n}{s}\cdot\mathcal{G}:=\widehat{W}\widehat{\Sigma}\widehat{W}^{T} be the eigendecomposition of ns𝒢\frac{n}{s}\cdot\mathcal{G}
11 if Σ^1\widehat{\Sigma}^{-1} exists then
12       Ψ:=nsW^[k]Σ^[k]2W^[k]T\Psi:=\frac{n}{s}\cdot\widehat{W}_{[k]}\widehat{\Sigma}_{[k]}^{-2}\widehat{W}^{T}_{[k]}
13       return 𝒟:={Ψ,Q^1,,Q^O(logn)}\mathcal{D}:=\{\Psi,\widehat{Q}_{1},\dots,\widehat{Q}_{O(\log n)}\}
14 end if
Algorithm 7 InitializeOracle(G,δ,ξG,\delta,\xi)  Need:ε/φ21105\varepsilon/\varphi^{2}\leq\frac{1}{10^{5}}
1
2Rquery:=O(nδ+500ε/φ2k9/ξ2)R_{\rm query}:=O(n^{\delta+500\cdot\varepsilon/\varphi^{2}}\cdot k^{9}/\xi^{2})
3 for i=1i=1 to O(logn)O(\log n) do
4       m^xi:=\widehat{m}_{x}^{i}:=RunRandomWalks(G,Rquery,t,xG,R_{\rm query},t,x)
5       m^yi:=\widehat{m}_{y}^{i}:=RunRandomWalks(G,Rquery,t,yG,R_{\rm query},t,y)
6 end for
7Let αx\alpha_{x} be a vector obtained by taking the entrywise median of (Q^i)T(m^xi)(\widehat{Q}_{i})^{T}(\widehat{m}_{x}^{i}) over all runs
8 Let αy\alpha_{y} be a vector obtained by taking the entrywise median of (Q^i)T(m^yi)(\widehat{Q}_{i})^{T}(\widehat{m}_{y}^{i}) over all runs
return fx,fyapx:=αxTΨαy\langle f_{x},f_{y}\rangle_{\rm apx}:=\alpha_{x}^{T}\Psi\alpha_{y}
Algorithm 8 SpectralDotProductOracle(G,x,y,δ,ξ,𝒟G,x,y,\delta,\xi,\mathcal{D}) Need:ε/φ21105\varepsilon/\varphi^{2}\leq\frac{1}{10^{5}}

Appendix B Formal statement of Theorem 2 and proof

Theorem 2 (Formal; Robust against random edge deletions).

Let k2k\geq 2 be an integer, φ(0,1)\varphi\in(0,1). Let G0=(V,E0)G_{0}=(V,E_{0}) be a dd-regular nn-vertex graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}, εφ4γ3k92log3k\frac{\varepsilon}{\varphi^{4}}\ll\frac{\gamma^{3}}{k^{\frac{9}{2}}\cdot\log^{3}k} and for all i[k]i\in[k], γnk|Ci|nγk\gamma\frac{n}{k}\leq|C_{i}|\leq\frac{n}{\gamma k}, where γ\gamma is a constant that is in (0.001,1](0.001,1].

  1. 1.

    Let GG be a graph obtained from G0G_{0} by deleting at most cc edges in each cluster, where cc is a constant. If 0cdφ22100\leq c\leq\frac{d\varphi^{2}}{2\sqrt{10}}, then there exists an algorithm that has query access to the adjacency list of GG and constructs a clustering oracle in O(n1/2+O(ε/φ2)poly(klognγφ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma\varphi})) preprocessing time and takes O(n1/2+O(ε/φ2)poly(klognγ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma})) space. Furthermore, with probability at least 0.950.95, the following hold:

    1. 1).

      Using the oracle, the algorithm can answer any WhichCluster query in O(n1/2+O(ε/φ2)poly(klognγφ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma\varphi})) time and a WhichCluster query takes O(n1/2+O(ε/φ2)poly(klognγ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma})) space.

    2. 2).

      Let Ui:={xV:WhichCluster(G,x)=i},i[k]U_{i}:=\{x\in V:\textsc{WhichCluster}(G,x)=i\},i\in[k] be the clusters recovered by the algorithm. There exists a permutation π:[k][k]\pi:[k]\rightarrow[k] such that for all i[k]i\in[k], |Uπ(i)Ci|O(k32γ(εφ4)1/3)|Ci||U_{\pi(i)}\triangle C_{i}|\leq O(\frac{k^{\frac{3}{2}}}{\gamma}\cdot(\frac{\varepsilon}{\varphi^{4}})^{1/3})|C_{i}|.

  2. 2.

    Let GG be a graph obtained from G0G_{0} by randomly deleting at most O(kd2logk+d)O(\frac{kd^{2}}{\log k+d}) edges in G0G_{0}. With probability at least 11k21-\frac{1}{k^{2}}, then there exists an algorithm that has query access to the adjacency list of GG and constructs a clustering oracle in O(n1/2+O(ε/φ2)poly(klognγφ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma\varphi})) preprocessing time and takes O(n1/2+O(ε/φ2)poly(klognγ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma})) space. Furthermore, with probability at least 0.950.95, the following hold:

    1. 1).

      Using the oracle, the algorithm can answer any WhichCluster query in O(n1/2+O(ε/φ2)poly(klognγφ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma\varphi})) time and a WhichCluster query takes O(n1/2+O(ε/φ2)poly(klognγ))O(n^{1/2+O(\varepsilon/\varphi^{2})}\cdot{\rm poly}(\frac{k\log n}{\gamma})) space.

    2. 2).

      Let Ui:={xV:WhichCluster(G,x)=i},i[k]U_{i}:=\{x\in V:\textsc{WhichCluster}(G,x)=i\},i\in[k] be the clusters recovered by the algorithm. There exists a permutation π:[k][k]\pi:[k]\rightarrow[k] such that for all i[k]i\in[k], |Uπ(i)Ci|O(k32γ(εφ4)1/3)|Ci||U_{\pi(i)}\triangle C_{i}|\leq O(\frac{k^{\frac{3}{2}}}{\gamma}\cdot(\frac{\varepsilon}{\varphi^{4}})^{1/3})|C_{i}|.

To prove Theorem 2, we need the following lemmas.

Lemma B.1 (Cheeger’s Inequality).

In holds for any graph GG that

λ22ϕ(G)2λ2.\frac{\lambda_{2}}{2}\leq\phi(G)\leq\sqrt{2\lambda_{2}}.
Lemma B.2 (Weyl’s Inequality).

Let A,Bn×nA,B\in\mathbb{R}^{n\times n} be symmetric matrices. Let α1,,αn\alpha_{1},\dots,\alpha_{n} and β1,,βn\beta_{1},\dots,\beta_{n} be the eigenvalues of AA and BB respectively. Then for any i[n]i\in[n], we have

|αiβi|AB,\left|\alpha_{i}-\beta_{i}\right|\leq\|A-B\|,

where AB\|A-B\| is the spectral norm of ABA-B.

Proof of Theorem 2.

Proof of item 1: For any dd-bounded graph G=(V,E)G^{\prime}=(V,E), we can get a dd-regular graph GG from GG^{\prime} by adding ddeg(x)d-\deg(x) self-loops with weight 1/21/2 to each vertex xVx\in V. Then according to [7], the normalized Laplacian of GG (denoted as LL) satisfies

L(x,y)={1wself(x)difx=y1difxyand(x,y)E0otherwise.L(x,y)=\begin{cases}1-\frac{w_{self}(x)}{d}&\mbox{if}\quad x=y\\ -\frac{1}{d}&\mbox{if}\quad x\neq y\quad\mbox{and}\quad(x,y)\in E\\ 0&\mbox{otherwise}\end{cases}.

Let G0=(V,E0)G_{0}=(V,E_{0}) be a dd-regular graph that admits a (k,φ,ε)(k,\varphi,\varepsilon)-clustering C1,,CkC_{1},\dots,C_{k}. Now we consider a cluster Ci,i[k]C_{i},i\in[k]. Let Ci0C_{i}^{0} be a dd-regular graph obtained by adding ddegi(x)d-deg_{i}(x) self-loops to each vertex xCix\in C_{i}, where degi(x)deg_{i}(x) is the degree of vertex xx in the subgraph CiC_{i}, i[k]i\in[k]. Let CijC_{i}^{j} be a graph obtained from Cij1C_{i}^{j-1} by: (1) randomly deleting one edge (u,v)E(Cij1)(u,v)\in E(C_{i}^{j-1}), where E(Cij1)E(C_{i}^{j-1}) is a set of edges that have both endpoints in Cij1C_{i}^{j-1}; (2) turning the result subgraph in (1) be a dd-regular graph, i[k],j[c]i\in[k],j\in[c]. Let LijL_{i}^{j} be the normalized Laplacian of Cij,i[k],j=0,1,,cC_{i}^{j},i\in[k],j=0,1,\dots,c. Let Hij=LijLij1,i[k],j[c]H_{i}^{j}=L_{i}^{j}-L_{i}^{j-1},i\in[k],j\in[c]. Then if uvu\neq v, we have

Hij(x,y)={1difx=u,y=vorx=v,y=u12difx=y=uorx=y=v0otherwise,H_{i}^{j}(x,y)=\begin{cases}\frac{1}{d}&\mbox{if}\quad x=u,y=v\quad\mbox{or}\quad x=v,y=u\\ -\frac{1}{2d}&\mbox{if}\quad x=y=u\quad\mbox{or}\quad x=y=v\\ 0&\mbox{otherwise}\end{cases},

and if u=vu=v, HijH_{i}^{j} is a all-zero matrix. Consider the fact that for a symmetric matrix, the spectral norm is less than or equal to its Frobenius norm, we will have HijHijF=21d2+214d2=52d2=102d\|H_{i}^{j}\|\leq\|H_{i}^{j}\|_{F}=\sqrt{2\cdot\frac{1}{d^{2}}+2\cdot\frac{1}{4d^{2}}}=\sqrt{\frac{5}{2d^{2}}}=\frac{\sqrt{10}}{2d} for all i[k],j[c]i\in[k],j\in[c]. Let Hi=j=1cHij=LicLi0H_{i}=\sum_{j=1}^{c}{H_{i}^{j}}=L_{i}^{c}-L_{i}^{0}, we have Hi102dc,i[k]\|H_{i}\|\leq\frac{\sqrt{10}}{2d}\cdot c,i\in[k]. Let λ2(Li0)\lambda_{2}(L_{i}^{0}) and λ2(Lic)\lambda_{2}(L_{i}^{c}) be the second smallest eigenvalue of Li0L_{i}^{0} and LicL_{i}^{c} respectively. By Lemma B.2, we have |λ2(Lic)λ2(Li0)|Hi102dc|\lambda_{2}(L_{i}^{c})-\lambda_{2}(L_{i}^{0})|\leq\|H_{i}\|\leq\frac{\sqrt{10}}{2d}\cdot c, which gives us λ2(Lic)λ2(Li0)102dc\lambda_{2}(L_{i}^{c})\geq\lambda_{2}(L_{i}^{0})-\frac{\sqrt{10}}{2d}\cdot c, i[k]i\in[k]. By Lemma B.1 and the precondition that cdφ2210c\leq\frac{d\varphi^{2}}{2\sqrt{10}}, we have λ2(Li0)φ2210dc\lambda_{2}(L_{i}^{0})\geq\frac{\varphi^{2}}{2}\geq\frac{\sqrt{10}}{d}\cdot c. Therefore,

λ2(Lic)\displaystyle\lambda_{2}(L_{i}^{c}) λ2(Li0)102dc\displaystyle\geq\lambda_{2}(L_{i}^{0})-\frac{\sqrt{10}}{2d}\cdot c
=12λ2(Li0)+12λ2(Li0)102dc\displaystyle=\frac{1}{2}\lambda_{2}(L_{i}^{0})+\frac{1}{2}\lambda_{2}(L_{i}^{0})-\frac{\sqrt{10}}{2d}\cdot c
12λ2(Li0)+102dc102dc\displaystyle\geq\frac{1}{2}\lambda_{2}(L_{i}^{0})+\frac{\sqrt{10}}{2d}\cdot c-\frac{\sqrt{10}}{2d}\cdot c
=12λ2(Li0).\displaystyle=\frac{1}{2}\lambda_{2}(L_{i}^{0}).

Again by Lemma B.1, for graph CicC_{i}^{c}, we have ϕin(Cic)=ϕ(Cic)12λ2(Lic)14λ2(Li0)18φ2\phi_{\rm in}(C_{i}^{c})=\phi(C_{i}^{c})\geq\frac{1}{2}\lambda_{2}(L_{i}^{c})\geq\frac{1}{4}\lambda_{2}(L_{i}^{0})\geq\frac{1}{8}\varphi^{2},i[k]i\in[k]. Note that we slightly abuse the notion CicC_{i}^{c}, for ϕ(Cic)\phi(C_{i}^{c}), we treat CicC_{i}^{c} as a dd-regular graph, and for ϕin(Cic)\phi_{\rm in}(C_{i}^{c}), we treat CicC_{i}^{c} as a cluster obtained by deleting cc edges from E(Ci)E(C_{i}). So, for a (k,φ,ε)(k,\varphi,\varepsilon)-clusterable graph G0=(V,E0)G_{0}=(V,E_{0}), if we delete at most cdφ2210c\leq\frac{d\varphi^{2}}{2\sqrt{10}} edges in each cluster, then the resulting graph GG is (k,18φ2,ε)(k,\frac{1}{8}\varphi^{2},\varepsilon)-clusterable. Since εφ4γ3k92log3k\frac{\varepsilon}{\varphi^{4}}\ll\frac{\gamma^{3}}{k^{\frac{9}{2}}\cdot\log^{3}k}, we have ε(18φ2)2γ3k92log3k\frac{\varepsilon}{(\frac{1}{8}\varphi^{2})^{2}}\ll\frac{\gamma^{3}}{k^{\frac{9}{2}}\cdot\log^{3}k}. The statement of item 11 in this theorem follows from the same augments as those in the proof of Theorem 1 with parameter φ=18φ2\varphi^{\prime}=\frac{1}{8}\varphi^{2} in GG.

Proof of item 2: Let c=dφ2210c=\frac{d\varphi^{2}}{2\sqrt{10}}. Since |Ci|nγk|C_{i}|\leq\frac{n}{\gamma k} for all i[k]i\in[k], we have |E(Ci)|nγkd2|E(C_{i})|\leq\frac{n}{\gamma k}\cdot\frac{d}{2}, where E(Ci)E(C_{i}) is a set of edges that have both endpoints in CiC_{i}. So |E(Ci)||E0|nγkd2nd2=1γk\frac{|E(C_{i})|}{|E_{0}|}\leq\frac{\frac{n}{\gamma k}\cdot\frac{d}{2}}{\frac{nd}{2}}=\frac{1}{\gamma k},i[k]i\in[k]. Since |Ci|γnk|C_{i}|\geq\frac{\gamma n}{k} and ϕout(Ci,V)ε\phi_{\rm out}(C_{i},V)\leq\varepsilon, we have E(Ci)nd2k(γεγ)E(C_{i})\geq\frac{nd}{2k}(\gamma-\frac{\varepsilon}{\gamma}). So |E(Ci)||E0|nd2k(γεγ)nd2=1k(γεγ)\frac{|E(C_{i})|}{|E_{0}|}\geq\frac{\frac{nd}{2k}(\gamma-\frac{\varepsilon}{\gamma})}{\frac{nd}{2}}=\frac{1}{k}(\gamma-\frac{\varepsilon}{\gamma}),i[k]i\in[k]. Combining the above two results, we have 1k(γεγ)|E(Ci)||E0|1γk\frac{1}{k}(\gamma-\frac{\varepsilon}{\gamma})\leq\frac{|E(C_{i})|}{|E_{0}|}\leq\frac{1}{\gamma k}.

In the following, we use XiX_{i} to denote the number of edges that are deleted from E(Ci)E(C_{i}). If we randomly delete kc2γ(γ2ε)9logk+2(γ2ε)c=O(kd2logk+d)\frac{kc^{2}\gamma(\gamma^{2}-\varepsilon)}{9\log k+2(\gamma^{2}-\varepsilon)c}=O(\frac{kd^{2}}{\log k+d}) edges from G0G_{0}, then we have 1k(γεγ)kc2γ(γ2ε)9logk+2(γ2ε)cE(Xi)1γkkc2γ(γ2ε)9logk+2(γ2ε)c\frac{1}{k}(\gamma-\frac{\varepsilon}{\gamma})\cdot\frac{kc^{2}\gamma(\gamma^{2}-\varepsilon)}{9\log k+2(\gamma^{2}-\varepsilon)c}\leq\textrm{E}(X_{i})\leq\frac{1}{\gamma k}\cdot\frac{kc^{2}\gamma(\gamma^{2}-\varepsilon)}{9\log k+2(\gamma^{2}-\varepsilon)c}, which gives us

c2(γ2ε)29logk+2(γ2ε)cE(Xi)c2(γ2ε)9logk+2(γ2ε)c.\frac{c^{2}(\gamma^{2}-\varepsilon)^{2}}{9\log k+2(\gamma^{2}-\varepsilon)c}\leq\textrm{E}(X_{i})\leq\frac{c^{2}(\gamma^{2}-\varepsilon)}{9\log k+2(\gamma^{2}-\varepsilon)c}.

Chernoff-Hoeffding implies that Pr[Xi>(1+δ)E(Xi)]eE(Xi)δ23\Pr[X_{i}>(1+\delta)\cdot\textrm{E}(X_{i})]\leq e^{\frac{-\textrm{E}(X_{i})\cdot\delta^{2}}{3}}. We set δ=9logk+2(γ2ε)c(γ2ε)c1\delta=\frac{9\log k+2(\gamma^{2}-\varepsilon)c}{(\gamma^{2}-\varepsilon)c}-1, then we have

Pr[Xi>c]\displaystyle\Pr\left[X_{i}>c\right] =Pr[Xi>(1+δ)c2(γ2ε)9logk+2(γ2ε)c]\displaystyle=\Pr\left[X_{i}>(1+\delta)\cdot\frac{c^{2}(\gamma^{2}-\varepsilon)}{9\log k+2(\gamma^{2}-\varepsilon)c}\right]
Pr[Xi>(1+δ)E(Xi)]\displaystyle\leq\Pr\left[X_{i}>(1+\delta)\cdot\textrm{E}(X_{i})\right]
eE(Xi)δ23\displaystyle\leq e^{\frac{-\textrm{E}(X_{i})\cdot\delta^{2}}{3}}
e13c2(γ2ε)29logk+2(γ2ε)cδ2\displaystyle\leq e^{\frac{-1}{3}\cdot\frac{c^{2}(\gamma^{2}-\varepsilon)^{2}}{9\log k+2(\gamma^{2}-\varepsilon)c}\cdot\delta^{2}}
e13c2(γ2ε)29logk+2(γ2ε)c(δ+1)(δ1)\displaystyle\leq e^{\frac{-1}{3}\cdot\frac{c^{2}(\gamma^{2}-\varepsilon)^{2}}{9\log k+2(\gamma^{2}-\varepsilon)c}\cdot(\delta+1)(\delta-1)}
=1k3.\displaystyle=\frac{1}{k^{3}}.

Using union bound, with probability at least 11k3k=11k21-\frac{1}{k^{3}}\cdot k=1-\frac{1}{k^{2}}, we have that if we randomly delete kc2γ(γ2ε)9logk+2(γ2ε)c=O(kd2logk+d)\frac{kc^{2}\gamma(\gamma^{2}-\varepsilon)}{9\log k+2(\gamma^{2}-\varepsilon)c}=O(\frac{kd^{2}}{\log k+d}) edges from G0G_{0}, there is no cluster that is deleted more than cc edges. Therefore, according to item 11 of Theorem 2, with probability at least 11k3k=11k21-\frac{1}{k^{3}}\cdot k=1-\frac{1}{k^{2}}, GG is a (k,18φ2,ε)(k,\frac{1}{8}\varphi^{2},\varepsilon)-clusterable graph. The statement of item 22 in this theorem also follows from the same augments as those in the proof of Theorem 1 with parameter φ=18φ2\varphi^{\prime}=\frac{1}{8}\varphi^{2} in GG. ∎