This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast Approximation of Similarity Graphs
with Kernel Density Estimation

Peter Macgregor
School of Informatics
University of Edinburgh
United Kingdom He Sun
School of Informatics
University of Edinburgh
United Kingdom
Abstract

Constructing a similarity graph from a set XX of data points in d\mathbb{R}^{d} is the first step of many modern clustering algorithms. However, typical constructions of a similarity graph have high time complexity, and a quadratic space dependency with respect to |X||X|. We address this limitation and present a new algorithmic framework that constructs a sparse approximation of the fully connected similarity graph while preserving its cluster structure. Our presented algorithm is based on the kernel density estimation problem, and is applicable for arbitrary kernel functions. We compare our designed algorithm with the well-known implementations from the scikit-learn library and the FAISS library, and find that our method significantly outperforms the implementation from both libraries on a variety of datasets.

1 Introduction

Given a set X={x1,,xn}dX=\{x_{1},\ldots,x_{n}\}\subset\mathbb{R}^{d} of data points and a similarity function k:d×d0k:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}_{\geq 0} for any pair of data points xix_{i} and xjx_{j}, the objective of clustering is to partition these nn data points into clusters such that similar points are in the same cluster. As a fundamental data analysis technique, clustering has been extensively studied in different disciplines ranging from algorithms and machine learning, social network analysis, to data science and statistics.

One prominent approach for clustering data points in Euclidean space consists of two simple steps: the first step is to construct a similarity graph 𝖪=(V,E,w)\mathsf{K}=(V,E,w) from XX, where every vertex viv_{i} of 𝖦\mathsf{G} corresponds to xiXx_{i}\in X, and different vertices viv_{i} and vjv_{j} are connected by an edge with weight w(vi,vj)w(v_{i},v_{j}) if their similarity k(xi,xj)k(x_{i},x_{j}) is positive, or higher than some threshold. Secondly, we apply spectral clustering on 𝖦\mathsf{G}, and its output naturally corresponds to some clustering on XX [19]. Because of its out-performance over traditional clustering algorithms like kk-means, this approach has become one of the most popular modern clustering algorithms.

On the other side, different constructions of similarity graphs have significant impact on the quality and time complexity of spectral clustering, which is clearly acknowledged and appropriately discussed by von Luxburg [31]. Generally speaking, there are two types of similarity graphs:

  • the first one is the kk-nearest neighbour graph (kk-NN graph), in which every vertex viv_{i} connects to vjv_{j} if vjv_{j} is among the kk-nearest neighbours of viv_{i}. A kk-NN graph is sparse by construction, but loses some of the structural information in the dataset since kk is usually small and the added edges are unweighted.

  • the second one is the fully connected graph, in which different vertices viv_{i} and vjv_{j} are connected with weight w(vi,vj)=k(xi,xj)w(v_{i},v_{j})=k(x_{i},x_{j}). While a fully connected graph maintains most of the distance information about XX, this graph is dense and storing such graphs requires quadratic memory in nn.

Taking the pros and cons of the two constructions into account, one would naturally ask the question:

Is it possible to directly construct a sparse graph that preserves the cluster structure of a fully connected similarity graph?

We answer this question affirmatively, and present a fast algorithm that constructs an approximation of the fully connected similarity graph. Our constructed graph consists of only O~(n)\widetilde{O}(n) edges111We use O~(n)\widetilde{O}(n) to represent O(nlogc(n))O\left(n\cdot\log^{c}(n)\right) for some constant cc., and preserves the cluster structure of the fully connected similarity graph.

1.1 Our Result

Given any set X={x1,,xn}dX=\{x_{1},\ldots,x_{n}\}\subset\mathbb{R}^{d} and a kernel function k:d×d0k:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}_{\geq 0}, a fully connected similarity graph 𝖪=(V,E,w)\mathsf{K}=(V,E,w) of XX consists of nn vertices, and every viVv_{i}\in V corresponds to xiXx_{i}\in X; we set w(vi,vj)k(xi,xj)w(v_{i},v_{j})\triangleq k(x_{i},x_{j}) for any different viv_{i} and vjv_{j}. We introduce an efficient algorithm that constructs a sparse graph 𝖦\mathsf{G} directly from XX, such that 𝖪\mathsf{K} and 𝖦\mathsf{G} share the same cluster-structure, and the graph matrices for 𝖪\mathsf{K} and 𝖦\mathsf{G} have approximately the same eigen-gap. This ensures that spectral clustering from 𝖦\mathsf{G} and 𝖪\mathsf{K} return approximately the same result.

The design of our algorithm is based on a novel reduction from the approximate construction of similarity graphs to the problem of Kernel Density Estimation (KDE). This reduction shows that any algorithm for the KDE can be employed to construct a sparse representation of a fully connected similarity graph, while preserving the cluster-structure of the input data points. This is summarised as follows:

Theorem 1 (Informal Statement of Theorem 2).

Given a set of data points X={x1,,xn}dX=\{x_{1},\ldots,x_{n}\}\subset\mathbb{R}^{d} as input, there is a randomised algorithm that constructs a sparse graph 𝖦\mathsf{G} of XX, such that it holds with probability at least 9/109/10 that

  1. 1.

    graph 𝖦\mathsf{G} has O~(n)\widetilde{O}(n) edges,

  2. 2.

    graph 𝖦\mathsf{G} has the same cluster structure as the fully connected similarity graph 𝖪\mathsf{K} of XX.

The algorithm uses an approximate KDE algorithm as a black-box, and has running time O~(T𝖪𝖣𝖤(n,n,ϵ))\widetilde{O}(T_{\mathsf{KDE}}(n,n,\epsilon)) for ϵ1/(6log(n))\epsilon\leq 1/(6\log(n)), where T𝖪𝖣𝖤(n,n,ϵ)T_{\mathsf{KDE}}(n,n,\epsilon) is the running time of solving the KDE problem for nn data points up to a (1+ϵ)(1+\epsilon)-approximation.

This result builds a novel connection between the KDE and the fast construction of similarity graphs, and further represents a state-of-the-art algorithm for constructing similarity graphs. For instance, when employing the fast Gauss transform [10] as the KDE solver, Theorem 1 shows that a sparse representation of the fully connected similarity graph with the Gaussian kernel can be constructed in O~(n)\widetilde{O}(n) time when dd is constant. As such, in the case of low dimensions, spectral clustering runs as fast (up to a poly-logarithmic factor) as the time needed to read the input data points. Moreover, any improved algorithm for the KDE would result in a faster construction of approximate similarity graphs.

Refer to caption
Figure 1: Output of spectral clustering with different similarity graph constructions.

To demonstrate the significance of this work in practice, we compare the performance of our algorithm with five competing algorithms from the well-known scikit-learn library [20] and FAISS library [11]: the algorithm that constructs a fully connected Gaussian kernel graph, and four algorithms that construct different variants of kk-nearest neighbour graphs. We apply spectral clustering on the six constructed similarity graphs, and compare the quality of the resulting clustering. For a typical input dataset of 15,000 points in 2\mathbb{R}^{2}, our algorithm runs in 4.7 seconds, in comparison with between 16.1 – 128.9 seconds for the five competing algorithms from scikit-learn and FAISS libraries. As shown in Figure 1, all the six algorithms return reasonable output.

We further compare the quality of the six algorithms on the BSDS image segmentation dataset [2], and our algorithm presents clear improvement over the other five algorithms based on the output’s average Rand Index. In particular, due to its quadratic memory requirement in the input size, one would need to reduce the resolution of every image down to 20,000 pixels in order to construct the fully connected similarity graph with scikit-learn on a typical laptop. In contrast, our algorithm is able to segment the full-resolution image with over 150,000 pixels. Our experimental result on the BSDS dataset is showcased in Figure 2 and demonstrates that, in comparison with SKLearn GK, our algorithm identifies a more detailed pattern on the butterfly’s wing. In contrast, none of the kk-nearest neighbour based algorithms from the two libraries is able to identify the wings of the butterfly.

Refer to caption
(a) Original Image
Refer to caption
(b) SKLearn GK
Refer to caption
(c) Our Algorithm
Refer to caption
(d) SKLearn kk-NN
Refer to caption
(e) FAISS Exact
Refer to caption
(f) FAISS HNSW
Refer to caption
(g) FAISS IVF
Figure 2: Comparison on the performance of spectral clustering with different similarity graph constructions. Here, SKLearn GK is based on the fully connected similarity graph construction, and (d) – (g) are based on different kk-nearest neighbour graph constructions from the two libraries.

1.2 Related Work

There is a number of work on efficient constructions of ε\varepsilon-neighbourhood graphs and kk-NN graphs. For instance, Dong et al. [9] presents an algorithm for approximate kk-NN graph construction, and their algorithm is based on local search. Wang et al. [32] presents an LSH-based algorithm for constructing an approximate kk-NN graph, and employs several sampling and hashing techniques to reduce the computational and parallelisation cost. These two algorithms [9, 32] have shown to work very well in practice, but lack a theoretical guarantee on the performance.

Our work also relates to a large and growing number of KDE algorithms. Charikar and Siminelakis [7] study the KDE problem through LSH, and present a class of unbiased estimators for kernel density in high dimensions for a variety of commonly used kernels. Their work has been improved through the sketching technique [24], and a revised description and analysis of the original algorithm [4]. Charikar et al. [6] presents a data structure for the KDE problem, and their result essentially matches the query time and space complexity for most studied kernels in the literature. In addition, there are studies on designing efficient KDE algorithms based on interpolation of kernel density estimators [29], and coresets [12].

Our work further relates to efficient constructions of spectral sparsifiers for kernel graphs. Quanrud [22] studies smooth kernel functions, and shows that an explicit (1+ε)(1+\varepsilon)-approximate spectral approximation of the geometric graph with O~(n/ε2)\widetilde{O}(n/\varepsilon^{2}) edges can be computed in O~(n/ε2)\widetilde{O}(n/\varepsilon^{2}) time. Bakshi et al. [5] proves that, under the strong exponential time hypothesis, constructing an O(1)O(1)-approximate spectral sparsifier with O(n2δ)O(n^{2-\delta}) edges for the Gaussian kernel graph requires Ω(n2log(1/τ)0.32)\Omega\left(n\cdot 2^{\log(1/\tau)^{0.32}}\right) time, where δ<0.01\delta<0.01 is a fixed universal constant and τ\tau is the minimum entry of the kernel matrix. Compared with their results, we show that, when the similarity graph with the Gaussian kernel presents a well-defined structure of clusters, an approximate construction of this similarity graph can be constructed in nearly-linear time.

2 Preliminaries

Let 𝖦=(V,E,w𝖦)\mathsf{G}=(V,E,w_{\mathsf{G}}) be an undirected graph with weight function w𝖦:E0w_{\mathsf{G}}:E\rightarrow\mathbb{R}_{\geq 0}, and n|V|n\triangleq|V|. The degree of any vertex vv is defined as deg𝖦(v)uvw𝖦(u,v)\deg_{\mathsf{G}}(v)\triangleq\sum_{u\sim v}w_{\mathsf{G}}(u,v), where we write uvu\sim v if {u,v}E(𝖦)\{u,v\}\in E(\mathsf{G}). For any SVS\subset V, the volume of SS is defined by vol𝖦(S)vSdeg𝖦(v)\mathrm{vol}_{\mathsf{G}}(S)\triangleq\sum_{v\in S}\deg_{\mathsf{G}}(v), and the conductance of SS is defined by

ϕ𝖦(S)𝖦(S)vol𝖦(S),\phi_{\mathsf{G}}(S)\triangleq\frac{\partial_{\mathsf{G}}(S)}{\mathrm{vol}_{\mathsf{G}}(S)},

where 𝖦(S)uS,vSw𝖦(u,v)\partial_{\mathsf{G}}(S)\triangleq\sum_{u\in S,v\not\in S}w_{\mathsf{G}}(u,v). For any k2k\geq 2, we call subsets of vertices A1,,AkA_{1},\ldots,A_{k} a kk-way partition if AiA_{i}\neq\emptyset for any 1ik1\leq i\leq k, AiAj=A_{i}\cap A_{j}=\emptyset for any iji\neq j, and i=1kAi=V\bigcup_{i=1}^{k}A_{i}=V. Moreover, we define the kk-way expansion constant by

ρ𝖦(k)minpartition A1,,Akmax1ikϕ𝖦(Ai).\rho_{\mathsf{G}}(k)\triangleq\min_{\textrm{partition\ }A_{1},\ldots,A_{k}}\max_{1\leq i\leq k}\phi_{\mathsf{G}}(A_{i}).

Note that a lower value of ρ𝖦(k)\rho_{\mathsf{G}}(k) ensures the existence of kk clusters A1,,AkA_{1},\ldots,A_{k} of low conductance, i.e, 𝖦\mathsf{G} has at least kk clusters.

For any undirected graph 𝖦\mathsf{G}, the adjacency matrix 𝐀𝖦\mathbf{A}_{\mathsf{G}} of 𝖦\mathsf{G} is defined by 𝐀𝖦(u,v)=w𝖦(u,v)\mathbf{A}_{\mathsf{G}}(u,v)=w_{\mathsf{G}}(u,v) if uvu\sim v, and 𝐀𝖦(u,v)=0\mathbf{A}_{\mathsf{G}}(u,v)=0 otherwise. We write 𝐃𝖦\mathbf{D}_{\mathsf{G}} as the diagonal matrix defined by 𝐃𝖦(v,v)=deg𝖦(v)\mathbf{D}_{\mathsf{G}}(v,v)=\deg_{\mathsf{G}}(v), and the normalised Laplacian of 𝖦\mathsf{G} is defined by 𝐍𝖦𝐈𝐃𝖦1/2𝐀𝖦𝐃𝖦1/2\mathbf{N}_{\mathsf{G}}\triangleq\mathbf{I}-\mathbf{D}_{\mathsf{G}}^{-1/2}\mathbf{A}_{\mathsf{G}}\mathbf{D}_{\mathsf{G}}^{-1/2}. For any PSD matrix 𝐁n×n\mathbf{B}\in\mathbb{R}^{n\times n}, we write the eigenvalues of 𝐁\mathbf{B} as λ1(𝐁)λn(𝐁)\lambda_{1}(\mathbf{B})\leq\ldots\leq\lambda_{n}(\mathbf{B}).

It is well-known that, while computing ρ𝖦(k)\rho_{\mathsf{G}}(k) exactly is NP-hard, ρ𝖦(k)\rho_{\mathsf{G}}(k) is closely related to λk\lambda_{k} through the higher-order Cheeger inequality [13]: it holds for any kk that

λk(𝐍𝖦)/2ρ𝖦(k)O(k3)λk(𝖭𝖦).\lambda_{k}(\mathbf{N}_{\mathsf{G}})/2\leq\rho_{\mathsf{G}}(k)\leq O(k^{3})\sqrt{\lambda_{k}(\mathsf{N}_{\mathsf{G}})}.

2.1 Fully Connected Similarity Graphs

We use X{x1,xn}X\triangleq\{x_{1},\ldots x_{n}\} to represent the set of input data points, where every xidx_{i}\in\mathbb{R}^{d}. Given XX and some kernel function k:d×d0k:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}_{\geq 0}, we use 𝖪=(V𝖪,E𝖪,w𝖪)\mathsf{K}=(V_{\mathsf{K}},E_{\mathsf{K}},w_{\mathsf{K}}) to represent the fully connected similarity graph from XX, which is constructed as follows: every viV𝖪v_{i}\in V_{\mathsf{K}} corresponds to xiXx_{i}\in X, and any pair of different viv_{i} and vjv_{j} is connected by an edge with weight w𝖪(vi,vj)=k(xi,xj)w_{\mathsf{K}}(v_{i},v_{j})=k(x_{i},x_{j}). One of the most common kernels used for clustering is the Gaussian kernel, which is defined by

k(xi,xj)=exp(xixj22σ2)k(x_{i},x_{j})=\exp\left(-\frac{\|x_{i}-x_{j}\|_{2}^{2}}{\sigma^{2}}\right)

for some bandwidth parameter σ\sigma. Other popular kernels include the Laplacian kernel and the exponential kernel which use xixj1\|x_{i}-x_{j}\|_{1} and xixj2\|x_{i}-x_{j}\|_{2} in the exponent respectively.

2.2 Kernel Density Estimation

Our work is based on algorithms for kernel density estimation (KDE), which is defined as follows. Given a kernel function k:d×d0k:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}_{\geq 0} with nn source points x1,,xndx_{1},\ldots,x_{n}\in\mathbb{R}^{d} and mm target points y1,,ymdy_{1},\ldots,y_{m}\in\mathbb{R}^{d}, the KDE problem is to compute g[1,n](y1),g[1,n](ym)g_{[1,n]}(y_{1}),\ldots g_{[1,n]}(y_{m}), where

g[a,b](yi)j=abk(yi,xj)g_{[a,b]}(y_{i})\triangleq\sum_{j=a}^{b}k(y_{i},x_{j}) (1)

for 1im1\leq i\leq m. While a direct computation of the mm values g[1,n](y1),g[1,n](ym)g_{[1,n]}(y_{1}),\ldots g_{[1,n]}(y_{m}) requires mnmn operations, there is substantial research to develop faster algorithms approximating these mm quantities.

In this paper we are interested in the algorithms that approximately compute g[1,n](yi)g_{[1,n]}(y_{i}) for all 1im1\leq i\leq m up to a (1±ϵ)(1\pm\epsilon)-multiplicative error, and use T𝖪𝖣𝖤(m,n,ϵ)T_{\mathsf{KDE}}(m,n,\epsilon) to denote the asymptotic complexity of such a KDE algorithm. We also require that T𝖪𝖣𝖤(m,n,ϵ)T_{\mathsf{KDE}}(m,n,\epsilon) is superadditive in mm and nn; that is, for m=m1+m2m=m_{1}+m_{2} and n=n1+n2n=n_{1}+n_{2}, we have

T𝖪𝖣𝖤(m1,n1,ϵ)+T𝖪𝖣𝖤(m2,n2,ϵ)T𝖪𝖣𝖤(m,n,ϵ);T_{\mathsf{KDE}}(m_{1},n_{1},\epsilon)+T_{\mathsf{KDE}}(m_{2},n_{2},\epsilon)\leq T_{\mathsf{KDE}}(m,n,\epsilon);

it is known that such property holds for many KDE algorithms (e.g., [1, 6, 10]).

3 Cluster-Preserving Sparsifiers

A graph sparsifier is a sparse representation of an input graph that inherits certain properties of the original dense graph. The efficient construction of sparsifiers plays an important role in designing a number of nearly-linear time graph algorithms. However, most algorithms for constructing sparsifiers rely on the recursive decomposition of an input graph [26], sampling with respect to effective resistances [15, 25], or fast SDP solvers [14]; all of these need the explicit representation of an input graph, requiring Ω(n2)\Omega(n^{2}) time and space complexity for a fully connected graph.

Sun and Zanetti [27] study a variant of graph sparsifiers that mainly preserve the cluster structure of an input graph, and introduce the notion of cluster-preserving sparsifier defined as follows:

Definition 1 (Cluster-preserving sparsifier).

Let 𝖪=(V,E,w𝖪)\mathsf{K}=(V,E,w_{\mathsf{K}}) be any graph, and {Ai}i=1k\{A_{i}\}_{i=1}^{k} the kk-way partition of 𝖪\mathsf{K} corresponding to ρ𝖪(k)\rho_{\mathsf{K}}(k). We call a re-weighted subgraph 𝖦=(V,FE,w𝖦)\mathsf{G}=(V,F\subset E,w_{\mathsf{G}}) a cluster-preserving sparsifier of 𝖪\mathsf{K} if ϕ𝖦(Ai)=O(kϕ𝖪(Ai))\phi_{\mathsf{G}}(A_{i})=O(k\cdot\phi_{\mathsf{K}}(A_{i})) for 1ik1\leq i\leq k, and λk+1(𝐍𝖦)=Ω(λk+1(𝐍𝖪))\lambda_{k+1}(\mathbf{N}_{\mathsf{G}})=\Omega(\lambda_{k+1}(\mathbf{N}_{\mathsf{K}})).

Notice that graph 𝖪\mathsf{K} has exactly kk clusters if (i) 𝖪\mathsf{K} has kk disjoint subsets A1,,AkA_{1},\ldots,A_{k} of low conductance, and (ii) any (k+1)(k+1)-way partition of 𝖪\mathsf{K} would include some AVA\subset V of high conductance, which would be implied by a lower bound on λk+1(𝐍𝖪)\lambda_{k+1}(\mathbf{N}_{\mathsf{K}}) due to the higher-order Cheeger inequality. Together with the well-known eigen-gap heuristic [13, 31] and theoretical analysis on spectral clustering [16, 21], the two properties in Definition 1 ensures that spectral clustering returns approximately the same output from 𝖪\mathsf{K} and 𝖧\mathsf{H}.222The most interesting regime for this definition is k=O~(1)k=\widetilde{O}(1) and λk+1(𝐍𝖪)=Ω(1)\lambda_{k+1}(\mathbf{N}_{\mathsf{K}})=\Omega(1), and we assume this in the rest of the paper.

Now we present the algorithm in [27] for constructing a cluster-preserving sparsifier, and we call it the SZ algorithm for simplicity. Given any input graph 𝖪=(V,E,w𝖪)\mathsf{K}=(V,E,w_{\mathsf{K}}) with weight function w𝖪w_{\mathsf{K}}, the algorithm computes

pu(v)min{Clognλk+1w𝖪(u,v)deg𝖪(u),1},andpv(u)min{Clognλk+1w𝖪(v,u)deg𝖪(v),1},p_{u}(v)\triangleq\min\left\{C\cdot\frac{\log n}{\lambda_{k+1}}\cdot\frac{w_{\mathsf{K}}(u,v)}{\deg_{\mathsf{K}}(u)},1\right\},\quad\mbox{and}\quad p_{v}(u)\triangleq\min\left\{C\cdot\frac{\log n}{\lambda_{k+1}}\cdot\frac{w_{\mathsf{K}}(v,u)}{\deg_{\mathsf{K}}(v)},1\right\},

for every edge e={u,v}e=\{u,v\}, where C+C\in\mathbb{R}^{+} is some constant. Then, the algorithm samples every edge e={u,v}e=\{u,v\} with probability

pepu(v)+pv(u)pu(v)pv(u),p_{e}\triangleq p_{u}(v)+p_{v}(u)-p_{u}(v)\cdot p_{v}(u),

and sets the weight of every sampled e={u,v}e=\{u,v\} in 𝖦\mathsf{G} as w𝖦(u,v)w𝖪(u,v)/pew_{\mathsf{G}}(u,v)\triangleq w_{\mathsf{K}}(u,v)/p_{e}. By setting FF as the set of the sampled edges, the algorithm returns 𝖦=(V,F,w𝖦)\mathsf{G}=(V,F,w_{\mathsf{G}}) as output. It is shown in [27] that, with high probability, the constructed 𝖦\mathsf{G} has O~(n)\widetilde{O}(n) edges and is a cluster-preserving sparsifier of 𝖪\mathsf{K}.

We remark that a cluster-preserving sparsifier is a much weaker notion than a spectral sparsifier, which approximately preserves all the cut values and the eigenvalues of the graph Laplacian matrices. On the other side, while a cluster-preserving sparsifier is sufficient for the task of graph clustering, the SZ algorithm runs in Ω(n2)\Omega(n^{2}) time for a fully connected input graph, since it’s based on the computation of the vertex degrees as well as the sampling probabilities pu(v)p_{u}(v) for every pair of vertices uu and vv.

4 Algorithm

This section presents our algorithm that directly constructs an approximation of a fully connected similarity graph from XdX\subseteq\mathbb{R}^{d} with |X|=n|X|=n. As the main theoretical contribution, we demonstrate that neither the quadratic space complexity for directly constructing a fully connected similarity graph nor the quadratic time complexity of the SZ algorithm is necessary when approximately constructing a fully connected similarity graph for the purpose of clustering. The performance of our algorithm is as follows:

Theorem 2 (Main Result).

Given a set of data points X={x1,,xn}dX=\{x_{1},\ldots,x_{n}\}\subset\mathbb{R}^{d} as input, there is a randomised algorithm that constructs a sparse graph 𝖦\mathsf{G} of XX, such that it holds with probability at least 9/109/10 that

  1. 1.

    graph 𝖦\mathsf{G} has O~(n)\widetilde{O}(n) edges,

  2. 2.

    graph 𝖦\mathsf{G} has the same cluster structure as the fully connected similarity graph 𝖪\mathsf{K} of XX; that is, if 𝖪\mathsf{K} has kk well-defined clusters, then it holds that ρ𝖦(k)=O(kρ𝖪(k))\rho_{\mathsf{G}}(k)=O(k\cdot\rho_{\mathsf{K}}(k)) and λk+1(𝐍𝖦)=Ω(λk+1(𝐍𝖪))\lambda_{k+1}(\mathbf{N}_{\mathsf{G}})=\Omega(\lambda_{k+1}(\mathbf{N}_{\mathsf{K}})).

The algorithm uses an approximate KDE algorithm as a black-box, and has running time O~(T𝖪𝖣𝖤(n,n,ϵ))\widetilde{O}(T_{\mathsf{KDE}}(n,n,\epsilon)) for ϵ1/(6log(n))\epsilon\leq 1/(6\log(n)).

4.1 Algorithm Description

At a very high level, our designed algorithm applies a KDE algorithm as a black-box, and constructs a cluster-preserving sparsifier by simultaneous sampling of the edges from a non-explicitly constructed fully connected graph. To explain our technique, we first claim that, for an arbitrary xix_{i}, a random xjx_{j} can be sampled with probability k(xi,xj)/deg𝖪(vi)k(x_{i},x_{j})/\deg_{\mathsf{K}}(v_{i}) through O(logn)O(\log n) queries to a KDE algorithm. To see this, notice that we can apply a KDE algorithm to compute the probability that the sampled neighbour is in some set X1XX_{1}\subset X, i.e.,

[zX1]=xjX1k(xi,xj)deg𝖪(vi)=gX1(xi)gX(xi),\mathbb{P}\left[z\in X_{1}\right]=\sum_{x_{j}\in X_{1}}\frac{k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{i})}=\frac{g_{X_{1}}(x_{i})}{g_{X}(x_{i})},

where we use gX(yi)g_{X}(y_{i}) to denote that the KDE is taken with respect to the set of source points XX. Based on this, we recursively split the set of possible neighbours in half and choose between the two subsets with the correct probability. The sampling procedure is summarised as follows, and is illustrated in Figure 3. We remark that the method of sampling a random neighbour of a vertex in 𝖪\mathsf{K} through KDE and binary search also appears in Backurs et al. [5]

  1. 1.

    Set the feasible neighbours to be X={x1,,xn}X=\{x_{1},\ldots,x_{n}\}.

  2. 2.

    While |X|>1|X|>1:

    • Split XX into X1X_{1} and X2X_{2} with |X1|=|X|/2|X_{1}|=\lfloor|X|/2\rfloor and |X2|=|X|/2|X_{2}|=\lceil|X|/2\rceil.

    • Compute gX(xi)g_{X}(x_{i}) and gX1(xi)g_{X_{1}}(x_{i}); set XX1X\leftarrow X_{1} with probability gX1(xi)/gX(xi)g_{X_{1}}(x_{i})/g_{X}(x_{i}), and XX2X\leftarrow X_{2} with probability 1gX1(xi)/gX(xi)1-g_{X_{1}}(x_{i})/g_{X}(x_{i}).

  3. 3.

    Return the remaining element in XX as the sampled neighbour.

Step 11rUnif[0,1]r\sim\mathrm{Unif}[0,1]g[0,n/2](xi)\underbrace{\hskip 70.0001pt}_{\displaystyle g_{[0,n/2]}(x_{i})}g[n/2,n](xi)\underbrace{\hskip 100.00015pt}_{\displaystyle g_{[n/2,n]}(x_{i})}Step 22g[0,n/4](xi)\underbrace{\hskip 40.00006pt}_{\displaystyle g_{[0,n/4]}(x_{i})}g[n/4,n/2](xi)\underbrace{\hskip 125.00018pt}_{\displaystyle g_{[n/4,n/2]}(x_{i})}Step log2(n)\log_{2}(n)k(xi,xj)\underbrace{\hskip 110.00017pt}_{\displaystyle k(x_{i},x_{j})}k(xi,xj+1)\underbrace{\hskip 55.00008pt}_{\displaystyle k(x_{i},x_{j+1})}
Figure 3: The procedure of sampling a neighbour vjv_{j} of viv_{i} with probability k(xi,xj)/deg𝖪(vi)k(x_{i},x_{j})/\deg_{\mathsf{K}}(v_{i}). Our algorithm performs a binary search to find the sampled neighbour. At each step, the value of two kernel density estimates are used to determine where the sample lies. Notice that the algorithm doesn’t compute any edge weights directly until the last step.

Next we generalise this idea and show that, instead of sampling a neighbour of one vertex at a time, a KDE algorithm allows us to sample neighbours of every vertex in the graph “simultaneously”. Our designed sampling procedure is formalised in Algorithm 1.

Algorithm 1 Sample
1:  Input: set SS of {yi}\{y_{i}\}           set XX of {xi}\{x_{i}\}
2:  Output:    E={(yi,xj)for some i and j}E=\{(y_{i},x_{j})~{}\mbox{for some $i$ and $j$}\}
3:  if |X|=1\left|X\right|=1 then
4:     return S×XS\times X
5:  else
6:     X1={xj:j<|X|/2}X_{1}=\{x_{j}:j<\left|X\right|/2\}
7:     X2={xj:j|X|/2}X_{2}=\{x_{j}:j\geq\left|X\right|/2\}
8:     Compute gX1(yi)g_{X_{1}}(y_{i}) for all ii with a KDE algorithm
9:     Compute gX2(yi)g_{X_{2}}(y_{i}) for all ii with a KDE algorithm
10:     S1=S2=S_{1}=S_{2}=\emptyset
11:     for yiSy_{i}\in S do
12:        rUnif[0,1]r\sim\mathrm{Unif}[0,1]
13:        if rgX1(yi)/(gX1(yi)+gX2(yi))r\leq g_{X_{1}}(y_{i})/(g_{X_{1}}(y_{i})+g_{X_{2}}(y_{i})) then
14:           S1=S1{yi}S_{1}=S_{1}\cup\{y_{i}\}
15:        else
16:           S2=S2{yi}S_{2}=S_{2}\cup\{y_{i}\}
17:        end if
18:     end for
19:     return Sample(S1,X1)Sample(S2,X2)\textsc{Sample}(S_{1},X_{1})\cup\textsc{Sample}(S_{2},X_{2})
20:  end if

Finally, to construct a cluster-preserving sparsifier, we apply Algorithm 1 to sample O(logn)O(\log n) neighbours for every vertex viv_{i}, and set the weight of every sampled edge vivjv_{i}\sim v_{j} as

w𝖦(vi,vj)=k(xi,xj)p^(i,j),w_{\mathsf{G}}(v_{i},v_{j})=\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}, (2)

where p^(i,j)p^i(j)+p^j(i)p^i(j)p^j(i)\widehat{p}(i,j)\triangleq\widehat{p}_{i}(j)+\widehat{p}_{j}(i)-\widehat{p}_{i}(j)\cdot\widehat{p}_{j}(i) is an estimate of the sampling probability of edge vivjv_{i}\sim v_{j}, and

p^i(j)min{6Clognk(xi,xj)g[1,n](xi),1}\widehat{p}_{i}(j)\triangleq\min\left\{6C\cdot\log n\cdot\frac{k(x_{i},x_{j})}{g_{[1,n]}(x_{i})},1\right\}

for some constant C+C\in\mathbb{R}^{+}; see Algorithm 2 for the formal description.

Algorithm 2 FastSimilarityGraph
1:  Input: data point set X={x1,,xn}X=\{x_{1},\ldots,x_{n}\}
2:  Output: similarity graph 𝖦\mathsf{G}
3:  E=E=\emptyset, L=6Clog(n)/λk+1L=6C\cdot\log(n)/\lambda_{k+1}
4:  for [1,L]\ell\in[1,L] do
5:     E=ESample(X,X)E=E\cup\textsc{Sample}(X,X)
6:  end for
7:  Compute g[1,n](xi)g_{[1,n]}(x_{i}) for each ii with a KDE algorithm
8:  for (vi,vj)E(v_{i},v_{j})\in E do
9:     p^i(j)=min{Lk(xi,xj)/g[1,n](xi),1}\widehat{p}_{i}(j)=\min\left\{L\cdot k(x_{i},x_{j})/g_{[1,n]}(x_{i}),1\right\}
10:     p^j(i)=min{Lk(xi,xj)/g[1,n](xj),1}\widehat{p}_{j}(i)=\min\left\{L\cdot k(x_{i},x_{j})/g_{[1,n]}(x_{j}),1\right\}
11:     p^(i,j)=p^i(j)+p^j(i)p^i(j)p^j(i)\widehat{p}(i,j)=\widehat{p}_{i}(j)+\widehat{p}_{j}(i)-\widehat{p}_{i}(j)\cdot\widehat{p}_{j}(i)
12:     Set w𝖦(vi,vj)=k(xi,xj)/p^(i,j)w_{\mathsf{G}}(v_{i},v_{j})=k(x_{i},x_{j})/\widehat{p}(i,j)
13:  end for
14:  return graph 𝖦=(X,E,w𝖦)\mathsf{G}=(X,E,w_{\mathsf{G}})
Step 11rUnif[0,1]r\sim\mathrm{Unif}[0,1]g[0,n/2](xi)\underbrace{\hskip 70.0001pt}_{\displaystyle g_{[0,n/2]}(x_{i})}g[n/2,n](xi)\underbrace{\hskip 100.00015pt}_{\displaystyle g_{[n/2,n]}(x_{i})}Step 22g[0,n/4](xi)\underbrace{\hskip 40.00006pt}_{\displaystyle g_{[0,n/4]}(x_{i})}g[n/4,n/2](xi)\underbrace{\hskip 125.00018pt}_{\displaystyle g_{[n/4,n/2]}(x_{i})}Step log2(n)\log_{2}(n)k(xi,xj)\underbrace{\hskip 110.00017pt}_{\displaystyle k(x_{i},x_{j})}k(xi,xj+1)\underbrace{\hskip 55.00008pt}_{\displaystyle k(x_{i},x_{j+1})}(S0,X0)(S_{0},X_{0})(S1,1,X1,1)(S_{1,1},X_{1,1})(S1,2,X1,2)(S_{1,2},X_{1,2})(S2,1,X2,1)(S_{2,1},X_{2,1})
Figure 4: The recursion tree for Algorithm 1.

4.2 Algorithm Analysis

Now we analyse Algorithm 2, and sketch the proof of Theorem 2. We first analyse the running time of Algorithm 2. Since it involves O(logn)O(\log n) executions of Algorithm 1 in total, it is sufficient to examine the running time of Algorithm 1.

We visualise the recursion of Algorithm 1 with respect to SS and XX in Figure 4. Notice that, although the number of nodes doubles at each level of the recursion tree, the total number of samples SS and data points XX remain constant for each level of the tree: it holds for any ii that j=12iSi,j=S0\bigcup_{j=1}^{2^{i}}S_{i,j}=S_{0} and j=12iXi,j=X0\bigcup_{j=1}^{2^{i}}X_{i,j}=X_{0}. Since the running time of the KDE algorithm is superadditive, the combined running time of all nodes at level ii of the tree is

Ti\displaystyle T_{i} =j=12iTKDE(|Si,j|,|Xi,j|,ϵ)TKDE(j=12i|Si,j|,j=12i|Xi,j|,ϵ)=TKDE(n,n,ϵ).\displaystyle=\sum_{j=1}^{2^{i}}T_{\textsf{KDE}}(|S_{i,j}|,|X_{i,j}|,\epsilon)\leq T_{\textsf{KDE}}\left(\sum_{j=1}^{2^{i}}|S_{i,j}|,\sum_{j=1}^{2^{i}}|X_{i,j}|,\epsilon\right)=T_{\textsf{KDE}}(n,n,\epsilon).

Hence, the total running time of Algorithm 2 is O~(TKDE(n,n,ϵ))\widetilde{O}(T_{\textsf{KDE}}(n,n,\epsilon)) as the tree has log2(n)\log_{2}(n) levels. Moreover, when applying the Fast Gauss Transform (FGT[10] as the KDE algorithm, the total running time of Algorithm 1 is O~(n)\widetilde{O}(n) when d=O(1)d=O(1) and τ=Ω(1/poly(n))\tau=\Omega(1/\mathrm{poly}(n)).

Finally, we prove that the output of Algorithm 2 is a cluster preserving sparsifier of a fully connected similarity graph, and has O~(n)\widetilde{O}(n) edges. Notice that, comparing with the sampling probabilities pu(v)p_{u}(v) and pv(u)p_{v}(u) used in the SZ algorithm, Algorithm 2 samples each edge through O(logn)O(\log n) recursive executions of a KDE algorithm, each of which introduces a (1+ϵ)(1+\epsilon)-multiplicative error. We carefully examine these (1+ϵ)(1+\epsilon)-multiplicative errors and prove that the actual sampling probability p~(i,j)\widetilde{p}(i,j) used in Algorithm 2 is an O(1)O(1)-approximation of pep_{e} for every e={vi,vj}e=\{v_{i},v_{j}\}. As such the output of Algorithm 2 is a cluster-preserving sparsifier of a fully connected similarity graph, and satisfies the two properties of Theorem 2. The total number of edges in 𝖦\mathsf{G} follows from the sampling scheme. We refer the reader to the supplementary material for the complete proof of Theorem 2.

5 Experiments

In this section, we empirically evaluate the performance of spectral clustering with our new algorithm for constructing similarity graphs. We compare our algorithm with the algorithms for similarity graph construction provided by the scikit-learn library [20] and the FAISS library [11] which are commonly used machine learning libraries for Python. In the remainder of this section, we compare the following six spectral clustering methods.

  1. 1.

    SKLearn GK: spectral clustering with the fully connected Gaussian kernel similarity graph constructed with the scikit-learn library.

  2. 2.

    SKLearn kk-NN: spectral clustering with the kk-nearest neighbour similarity graph constructed with the scikit-learn library.

  3. 3.

    FAISS Exact: spectral clustering with the exact kk-nearest neighbour graph constructed with the FAISS library.

  4. 4.

    FAISS HNSW: spectral clustering with the approximate kk-nearest neighbour graph constructed with the “Hierarchical Navigable Small World” method [18].

  5. 5.

    FAISS IVF: spectral clustering with the approximate kk-nearest neighbour graph constructed with the “Invertex File Index” method [3].

  6. 6.

    Our Algorithm: spectral clustering with the Gaussian kernel similarity graph constructed by Algorithm 2.

We implement Algorithm 2 in C++, using the implementation of the Fast Gauss Transform provided by Yang et al. [33], and use the Python SciPy [30] and stag [17] libraries for eigenvector computation and graph operations respectively. The scikit-learn and FAISS libraries are well-optimised and use C, C++, and FORTRAN for efficient implementation of core algorithms. We first employ classical synthetic clustering datasets to clearly compare how the running time of different algorithms is affected by the number of data points. This experiment highlights the nearly-linear time complexity of our algorithm. Next we demonstrate the applicability of our new algorithm for image segmentation on the Berkeley Image Segmentation Dataset (BSDS) [2].

For each experiment, we set k=10k=10 for the approximate nearest neighbour algorithms. For the synthetic datasets, we set the σ\sigma value of the Gaussian kernel used by SKLearn GK and Our Algorithm to be 0.10.1, and for the BSDS experiment we set σ=0.2\sigma=0.2. This choice follows the heuristic suggested by von Luxburg [31] to choose σ\sigma to approximate the distance from a point to its kkth nearest neighbour. All experiments are performed on an HP ZBook laptop with an 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz processor and 32 GB RAM. The code to reproduce our results is available at https://github.com/pmacg/kde-similarity-graph.

5.1 Synthetic Dataset

In this experiment we use the scikit-learn library to generate 15,000 data points in 2\mathbb{R}^{2} from a variety of classical synthetic clustering datasets. The data is generated with the make_circles, make_moons, and make_blobs methods of the scikit-learn library with a noise parameter of 0.050.05. The linear transformation {{0.6,0.6},{0.4,0.8}}\{\{0.6,-0.6\},\{-0.4,0.8\}\} is applied to the blobs data to create asymmetric clusters. As shown in Figure 1, all of the algorithms return approximately the same clustering, but our algorithm runs much faster than the ones from scikit-learn and FAISS.

We further compare the speedup of our algorithm against the five competitors on the two_moons dataset with an increasing number of data points, and our result is reported in Figure 5. Notice that the running time of the scikit-learn and FAISS algorithms scales roughly quadratically with the size of the input, while the running time of our algorithm is nearly linear. Furthermore, we note that on a laptop with 32 GB RAM, the SKLearn GK algorithm could not scale beyond 20,000 data points due to its quadratic memory requirement, while our new algorithm can cluster 1,000,000 data points in 240 seconds under the same memory constraint.

Refer to caption
Refer to caption
Figure 5: Comparison of the running time of spectral clustering on the two moons dataset. In every case, all algorithms return the correct clusters.

5.2 BSDS Image Segmentation Dataset

Finally we study the application of spectral clustering for image segmentation on the BSDS dataset. The dataset contains 500500 images with several ground-truth segmentations for each image. Given an input image, we consider each pixel to be a point (r,g,b,x,y)5(r,g,b,x,y)^{\intercal}\in\mathbb{R}^{5} where rr, gg, bb are the red, green, blue pixel values and xx, yy are the coordinates of the pixel in the image. Then, we construct a similarity graph based on these points, and apply spectral clustering, for which we set the number of clusters to be the median number of clusters in the provided ground truth segmentations. Our experimental result is reported in Table 1, where the “Downsampled” version is employed to reduce the resolution of the image to 20,000 pixels in order to run the SKLearn GK and the “Full Resolution” one is to apply the original image of over 150,000 pixels as input. This set of experiments demonstrates our algorithm produces better clustering results with repsect to the average Rand Index [23].

Table 1: The average Rand Index of the output from the six algorithms. Due to the quadratic space complexity constraint, the SKLearn GK cannot be applied to the full resolution image.
Algorithm Downsampled Full Resolution
SKLearn GK 0.681 -
SKLearn kk-NN 0.675 0.682
FAISS Exact 0.675 0.680
FAISS HNSW 0.679 0.677
FAISS IVF 0.675 0.680
Our Algorithm 0.680 0.702

6 Conclusion

In this paper we develop a new technique that constructs a similarity graph, and show that an approximation algorithm for the KDE can be employed to construct a similarity graph with proven approximation guarantee. Applying the publicly available implementations of the KDE as a black-box, our algorithm outperforms five competing ones from the well-known scikit-learn and FAISS libraries for the classical datasets of low dimensions. We believe that our work will motivate more research on faster algorithms for the KDE in higher dimensions and their efficient implementation, as well as more efficient constructions of similarity graphs.

Acknowledgements

We would like to thank the anonymous reviewers for their valuable comments on the paper. This work is supported by an EPSRC Early Career Fellowship (EP/T00729X/1).

References

  • [1] Josh Alman, Timothy Chu, Aaron Schild, and Zhao Song. Algorithms and hardness for linear algebra on geometric graphs. In 61st Annual IEEE Symposium on Foundations of Computer Science (FOCS’20), pages 541–552, 2020.
  • [2] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, 2011.
  • [3] Artem Babenko and Victor Lempitsky. The inverted multi-index. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(6):1247–1260, 2014.
  • [4] Arturs Backurs, Piotr Indyk, and Tal Wagner. Space and time efficient kernel density estimation in high dimensions. In 33rd Advances in Neural Information Processing Systems (NeurIPS’19), pages 15773–15782, 2019.
  • [5] Ainesh Bakshi, Piotr Indyk, Praneeth Kacham, Sandeep Silwal, and Samson Zhou. Subquadratic algorithms for kernel matrices via kernel density estimation. In 11th International Conference on Learning Representations (ICLR ’23), 2023.
  • [6] Moses Charikar, Michael Kapralov, Navid Nouri, and Paris Siminelakis. Kernel density estimation through density constrained near neighbor search. In 61st Annual IEEE Symposium on Foundations of Computer Science (FOCS’20), pages 172–183, 2020.
  • [7] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimensions. In 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS’17), pages 1032–1043, 2017.
  • [8] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006.
  • [9] Wei Dong, Moses Charikar, and Kai Li. Efficient kk-nearest neighbor graph construction for generic similarity measures. In 20th International Conference on World Wide Web (WWW ’11), pages 577–586, 2011.
  • [10] Leslie Greengard and John Strain. The fast gauss transform. SIAM Journal on Scientific & Statistical Computing, 12(1):79–94, 1991.
  • [11] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2021.
  • [12] Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. In 32nd Conference on Learning Theory (COLT ’19), pages 1975–1993, 2019.
  • [13] James R. Lee, Shayan Oveis Gharan, and Luca Trevisan. Multiway spectral partitioning and higher-order Cheeger inequalities. Journal of the ACM, 61(6):37:1–37:30, 2014.
  • [14] Yin Tat Lee and He Sun. An SDP-based algorithm for linear-sized spectral sparsification. In 49th Annual ACM Symposium on Theory of Computing (STOC’17), pages 678–687, 2017.
  • [15] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsification in almost-linear time. SIAM Journal on Computing, 47(6):2315–2336, 2018.
  • [16] Peter Macgregor and He Sun. A tighter analysis of spectral clustering, and beyond. In 39th International Conference on Machine Learning (ICML’22), pages 14717–14742, 2022.
  • [17] Peter Macgregor and He Sun. Spectral toolkit of algorithms for graphs: Technical report (1). CoRR, abs/2304.03170, 2023.
  • [18] Yu A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2018.
  • [19] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In 14th Advances in Neural Information Processing Systems (NeurIPS’01), pages 849–856, 2001.
  • [20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [21] Richard Peng, He Sun, and Luca Zanetti. Partitioning well-clustered graphs: Spectral clustering works! SIAM Journal on Computing, 46(2):710–743, 2017.
  • [22] Kent Quanrud. Spectral sparsification of metrics and kernels. In 32nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’21), pages 1445–1464, 2021.
  • [23] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971.
  • [24] Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Levis. Rehashing kernel evaluation in high dimensions. In 36th International Conference on Machine Learning (ICML’19), Proceedings of Machine Learning Research, pages 5789–5798, 2019.
  • [25] Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913–1926, 2011.
  • [26] Daniel A. Spielman and Shang-Hua Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4):981–1025, 2011.
  • [27] He Sun and Luca Zanetti. Distributed graph clustering and sparsification. ACM Transactions on Parallel Computing, 6(3):17:1–17:23, 2019.
  • [28] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
  • [29] Paxton Turner, Jingbo Liu, and Philippe Rigollet. Efficient interpolation of density estimators. In 24th International Conference on Artificial Intelligence and Statistics (AISTATS ’21), pages 2503–2511, 2021.
  • [30] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.
  • [31] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
  • [32] Yiqiu Wang, Anshumali Shrivastava, Jonathan Wang, and Junghee Ryu. Randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search. In 2018 International Conference on Management of Data (SIGMOD ’18), pages 889–903, 2018.
  • [33] Changjiang Yang, Ramani Duraiswami, Nail A. Gumerov, and Larry Davis. Improved fast Gauss transform and efficient kernel density estimation. In 9th International Conference on Computer Vision (ICCV’03), pages 664–671, 2003.

Appendix A Proof of Theorem 2

This section presents the complete proof of Theorem 2. Let yi,1,,yi,Ly_{i,1},\ldots,y_{i,L} be random variables which are equal to the indices of the LL points sampled for xix_{i}. Recall that by the SZ algorithm, the “ideal” sampling probability for xjx_{j} from xix_{i} is

pi(j)min{Clog(n)λk+1k(xi,xj)deg𝖪(vi),1}.p_{i}(j)\triangleq\min\left\{C\cdot\frac{\log(n)}{\lambda_{k+1}}\cdot\frac{k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{i})},1\right\}.

We denote the actual sampling probability that xjx_{j} is sampled from xix_{i} under Algorithm 2 to be

p~i(j)[xj{yi,1,yi,L}].\widetilde{p}_{i}(j)\triangleq\mathbb{P}\left[x_{j}\in\{y_{i,1},\ldots y_{i,L}\}\right].

Finally, for each added edge, Algorithm 2 also computes an estimate of pi(xj)p_{i}(x_{j}) which we denote

p^i(j)min{6Clog(n)λk+1k(xi,xj)g[1,n](xi),1}.\widehat{p}_{i}(j)\triangleq\min\left\{6C\cdot\frac{\log(n)}{\lambda_{k+1}}\cdot\frac{k(x_{i},x_{j})}{g_{[1,n]}(x_{i})},1\right\}.

Similar with the definition of pep_{e} in Section 3, we define

  • p(i,j)=pi(j)+pj(i)pi(j)pj(i)p(i,j)=p_{i}(j)+p_{j}(i)-p_{i}(j)\cdot p_{j}(i),

  • p~(i,j)=p~i(j)+p~j(i)p~i(j)p~j(i)\widetilde{p}(i,j)=\widetilde{p}_{i}(j)+\widetilde{p}_{j}(i)-\widetilde{p}_{i}(j)\cdot\widetilde{p}_{j}(i), and

  • p^(i,j)=p^i(j)+p^j(i)p^i(j)p^j(i)\widehat{p}(i,j)=\widehat{p}_{i}(j)+\widehat{p}_{j}(i)-\widehat{p}_{i}(j)\cdot\widehat{p}_{j}(i).

Notice that, following the convention of [27], we use pi(j)p_{i}(j) to refer to the probability that a given edge is sampled from the vertex xix_{i} and p(i,j)p(i,j) is the probability that the given edge {vi,vj}\{v_{i},v_{j}\} is sampled at all by the algorithm. We use the same convention for p~i(j)\widetilde{p}_{i}(j) and p^i(j)\widehat{p}_{i}(j).

We first prove a sequence of lemmas showing that these probabilities are all within a constant factor of each other.

Lemma 1.

For any point xix_{i}, the probability that a given sampled neighbour yi,ly_{i,l} is equal to jj is given by

k(xi,xj)2deg𝖪(vi)[yi,l=j]2k(xi,xj)deg𝖪(vi).\frac{k(x_{i},x_{j})}{2\deg_{\mathsf{K}}(v_{i})}\leq\mathbb{P}\left[y_{i,l}=j\right]\leq\frac{2k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{i})}.
Proof.

Let X={x1,,xn}X=\{x_{1},\ldots,x_{n}\} be the input data points for Algorithm 2, and [n]={1,n}[n]=\{1,\ldots n\} be the indices of the input data points. Furthermore, let [a,b]={a,,b}[a,b]=\{a,\ldots,b\} be the set of indices between aa and bb. Then, in each recursive call to Algorithm 1, we are given a range of indices [a,b][a,b] as input and assign yi,ly_{i,l} to one half of it: either [a,b/2][a,\lfloor b/2\rfloor] or [b/a+1,b][\lfloor b/a\rfloor+1,b]. By Algorithm 1, we have that the probability of assigning yi,ly_{i,l} to [a,b/2][a,\lfloor b/2\rfloor] is

[yi,l[a,b/2]|yi,l[a,b]]=g[a,b/2](xi)g[a,b](xi).\mathbb{P}\left[y_{i,l}\in[a,\lfloor b/2\rfloor]~{}|~{}y_{i,l}\in[a,b]\right]=\frac{g_{[a,\lfloor b/2\rfloor]}(x_{i})}{g_{[a,b]}(x_{i})}.

By the performance guarantee of the KDE algorithm, we have that g[a,b](xi)(1±ϵ)deg[a,b](vi)g_{[a,b]}(x_{i})\in(1\pm\epsilon)\deg_{[a,b]}(v_{i}), where we define

deg[a,b](xi)j=abk(xi,xj).\deg_{[a,b]}(x_{i})\triangleq\sum_{j=a}^{b}k(x_{i},x_{j}).

This gives

(1ϵ1+ϵ)deg[a,b/2](vi)deg[a,b](vi)[yi,lX[a,b/2]|yi,lX[a,b]](1+ϵ1ϵ)deg[a,b/2](vi)deg[a,b](vi).\left(\frac{1-\epsilon}{1+\epsilon}\right)\frac{\deg_{[a,\lfloor b/2\rfloor]}(v_{i})}{\deg_{[a,b]}(v_{i})}\leq\mathbb{P}\left[y_{i,l}\in X_{[a,\lfloor b/2\rfloor]}~{}|y_{i,l}\in X_{[a,b]}\right]\leq\left(\frac{1+\epsilon}{1-\epsilon}\right)\frac{\deg_{[a,\lfloor b/2\rfloor]}(v_{i})}{\deg_{[a,b]}(v_{i})}. (3)

Next, notice that we can write for a sequence of intervals [a1,b1][a2,b2][1,n][a_{1},b_{1}]\subset[a_{2},b_{2}]\subset\ldots\subset[1,n] that

[yi,l=j]\displaystyle\mathbb{P}\left[y_{i,l}=j\right] =[yi,l=j|yi,l[a1,b1]]×[yi,l[a1,b1]|yi,l[a2,b2]]\displaystyle=\mathbb{P}\left[y_{i,l}=j|y_{i,l}\in[a_{1},b_{1}]\right]\times\mathbb{P}\left[y_{i,l}\in[a_{1},b_{1}]|y_{i,l}\in[a_{2},b_{2}]\right]
××[yi,l[ak,bk]|yi,l[1,n]],\displaystyle\qquad\qquad\times\ldots\times\mathbb{P}\left[y_{i,l}\in[a_{k},b_{k}]|y_{i,l}\in[1,n]\right],

where each term corresponds to one level of recursion of Algorithm 1 and there are at most log2(n)\lceil\log_{2}(n)\rceil terms. Then, by (3) and the fact that the denominator and numerator of adjacent terms cancel out, we have

(1ϵ1+ϵ)log2(n)k(xi,xj)deg𝖪(vi)[yi,l=j](1+ϵ1ϵ)log2(n)k(xi,xj)deg𝖪(vi)\left(\frac{1-\epsilon}{1+\epsilon}\right)^{\lceil\log_{2}(n)\rceil}\frac{k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{i})}\leq\mathbb{P}\left[y_{i,l}=j\right]\leq\left(\frac{1+\epsilon}{1-\epsilon}\right)^{\lceil\log_{2}(n)\rceil}\frac{k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{i})}

since deg[j,j](vi)=k(xi,xj)\deg_{[j,j]}(v_{i})=k(x_{i},x_{j}) and deg[1,n](vi)=deg𝖪(vi)\deg_{[1,n]}(v_{i})=\deg_{\mathsf{K}}(v_{i}).

For the lower bound, we have that

(1ϵ1+ϵ)log2(n)\displaystyle\left(\frac{1-\epsilon}{1+\epsilon}\right)^{\lceil\log_{2}(n)\rceil} (12ϵ)log2(n)13log2(n)ϵ1/2,\displaystyle\geq\left(1-2\epsilon\right)^{\lceil\log_{2}(n)\rceil}\geq 1-3\log_{2}(n)\epsilon\geq 1/2,

where the final inequality follows by the condition of ϵ\epsilon that ϵ1/(6log2(n))\epsilon\leq 1/(6\log_{2}(n)).

For the upper bound, we similarly have

(1+ϵ1ϵ)log2(n)\displaystyle\left(\frac{1+\epsilon}{1-\epsilon}\right)^{\lceil\log_{2}(n)\rceil} (1+3ϵ)log2(n)exp(3log2(n)ϵ)e2/32,\displaystyle\leq\left(1+3\epsilon\right)^{\lceil\log_{2}(n)\rceil}\leq\exp\left(3\lceil\log_{2}(n)\rceil\epsilon\right)\leq\mathrm{e}^{2/3}\leq 2,

where the first inequality follows since ϵ<1/(6log2(n))\epsilon<1/(6\log_{2}(n)). ∎

The next lemma shows that the probability of sampling each edge {vi,vj}\{v_{i},v_{j}\} is approximately equal to the “ideal” sampling probability pi(j)p_{i}(j).

Lemma 2.

For every ii and jij\neq i, we have

910pi(j)p~i(j)12pi(j).\frac{9}{10}\cdot p_{i}(j)\leq\widetilde{p}_{i}(j)\leq 12\cdot p_{i}(j).
Proof.

Let Y={yi,1,,yi,L}Y=\{y_{i,1},\ldots,y_{i,L}\} be the neighbours of xix_{i} sampled by Algorithm 2, where L=6Clog(n)/λk+1L=6C\log(n)/\lambda_{k+1}. Then,

[jY]\displaystyle\mathbb{P}\left[j\in Y\right] =1l=1L(1[yi,l=j])1(1k(xi,xj)2deg𝖪(vi))L1exp(Lk(xi,xj)2deg𝖪(vi))\displaystyle=1-\prod_{l=1}^{L}\left(1-\mathbb{P}\left[y_{i,l}=j\right]\right)\geq 1-\left(1-\frac{k(x_{i},x_{j})}{2\deg_{\mathsf{K}}(v_{i})}\right)^{L}\geq 1-\exp\left(-L\cdot\frac{k(x_{i},x_{j})}{2\deg_{\mathsf{K}}(v_{i})}\right)

The proof proceeds by case distinction.

Case 1: pi(j)9/10p_{i}(j)\leq 9/10. In this case, we have

[jY]\displaystyle\mathbb{P}\left[j\in Y\right] 1exp(6pi(j)/2)pi(j).\displaystyle\geq 1-\exp\left(-6p_{i}(j)/2\right)\geq p_{i}(j).

Case 2: pi(j)>9/10p_{i}(j)>9/10. In this case, we have

[jY]\displaystyle\mathbb{P}\left[j\in Y\right] 1exp(9620)910,\displaystyle\geq 1-\exp\left(-\frac{9\cdot 6}{20}\right)\geq\frac{9}{10},

which completes the proof on the lower bound of p~i(j)\widetilde{p}_{i}(j).

For the upper bound, we have

[jY]\displaystyle\mathbb{P}\left[j\in Y\right] 1(12k(xi,xj)deg𝖪(vi))L2k(xi,xj)deg𝖪(vi)L=12Clog(n)λk+1k(xi,xj)deg𝖪(vi),\displaystyle\leq 1-\left(1-\frac{2k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{i})}\right)^{L}\leq\frac{2k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{i})}\cdot L=12C\cdot\frac{\log(n)}{\lambda_{k+1}}\cdot\frac{k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{i})},

from which the statement follows. ∎

An immediate corollary of Lemma 2 is as follows.

Corollary 1.

For all different i,j[n]i,j\in[n], it holds that

910p(i,j)p~(i,j)12p(i,j)\frac{9}{10}\cdot p(i,j)\leq\widetilde{p}(i,j)\leq 12\cdot p(i,j)

and

67p(i,j)p^(i,j)365p(i,j).\frac{6}{7}\cdot p(i,j)\leq\widehat{p}(i,j)\leq\frac{36}{5}\cdot p(i,j).
Proof.

For the upper bound of the first statement, we can assume that pi(j)1/12p_{i}(j)\leq 1/12 and pj(i)1/12p_{j}(i)\leq 1/12, since otherwise we have p~(i,j)112p(i,j)\widetilde{p}(i,j)\leq 1\leq 12\cdot p(i,j) and the statement holds trivially. Then, by Lemma 2, we have

p~(i,j)\displaystyle\widetilde{p}(i,j) =p~i(j)+p~j(i)p~i(j)p~j(i)\displaystyle=\widetilde{p}_{i}(j)+\widetilde{p}_{j}(i)-\widetilde{p}_{i}(j)\cdot\widetilde{p}_{j}(i)
12pi(j)+12pj(i)12pi(j)12pj(i)\displaystyle\leq 12p_{i}(j)+12p_{j}(i)-12p_{i}(j)\cdot 12p_{j}(i)
12(pi(j)+pj(i)pi(j)pj(i))\displaystyle\leq 12\left(p_{i}(j)+p_{j}(i)-p_{i}(j)\cdot p_{j}(i)\right)
=12p(i,j)\displaystyle=12\cdot p(i,j)

and

p~(i,j)\displaystyle\widetilde{p}(i,j) =p~i(j)+p~j(i)p~i(j)p~j(i)\displaystyle=\widetilde{p}_{i}(j)+\widetilde{p}_{j}(i)-\widetilde{p}_{i}(j)\cdot\widetilde{p}_{j}(i)
910pi(j)+910pj(i)910pi(j)910pj(i)\displaystyle\geq\frac{9}{10}\cdot p_{i}(j)+\frac{9}{10}\cdot p_{j}(i)-\frac{9}{10}p_{i}(j)\cdot\frac{9}{10}p_{j}(i)
910(pi(j)+pj(i)pi(j)pj(i))\displaystyle\geq\frac{9}{10}\left(p_{i}(j)+p_{j}(i)-p_{i}(j)p_{j}(i)\right)
=910p(i,j).\displaystyle=\frac{9}{10}\cdot p(i,j).

For the second statement, notice that

p^i(j)\displaystyle\widehat{p}_{i}(j) =min{6Clog(n)λk+1k(i,j)g[1,n](xi),1}\displaystyle=\min\left\{\frac{6C\log(n)}{\lambda_{k+1}}\cdot\frac{k(i,j)}{g_{[1,n]}(x_{i})},1\right\}
min{11+εClog(n)λk+1k(i,j)deg𝖪(vi),1}\displaystyle\geq\min\left\{\frac{1}{1+\varepsilon}\frac{C\log(n)}{\lambda_{k+1}}\cdot\frac{k(i,j)}{\deg_{\mathsf{K}}(v_{i})},1\right\}
11+εmin{Clog(n)λk+1k(i,j)deg𝖪(vi),1}\displaystyle\geq\frac{1}{1+\varepsilon}\cdot\min\left\{\frac{C\log(n)}{\lambda_{k+1}}\cdot\frac{k(i,j)}{\deg_{\mathsf{K}}(v_{i})},1\right\}
=11+εpi(j)\displaystyle=\frac{1}{1+\varepsilon}\cdot p_{i}(j)
67pi(j),\displaystyle\geq\frac{6}{7}\cdot p_{i}(j),

since g[1,n](xi)g_{[1,n]}(x_{i}) is a (1±ε)(1\pm\varepsilon) approximation of deg𝖪(vi)\deg_{\mathsf{K}}(v_{i}) and ε1/6\varepsilon\leq 1/6. Similarly,

p^i(j)61εpi(j)365pi(j).\displaystyle\widehat{p}_{i}(j)\leq\frac{6}{1-\varepsilon}\cdot p_{i}(j)\leq\frac{36}{5}\cdot p_{i}(j).

Then, for the upper bound of the second statement, we can assume that pi(j)5/36p_{i}(j)\leq 5/36 and pj(i)5/36p_{j}(i)\leq 5/36, since otherwise p^(i,j)1(36/5)p~(i,j)\widehat{p}(i,j)\leq 1\leq(36/5)\cdot\widetilde{p}(i,j) and the statement holds trivially. This implies that

p^(i,j)\displaystyle\widehat{p}(i,j) =p^i(j)+p^j(i)p^i(j)p^j(i)\displaystyle=\widehat{p}_{i}(j)+\widehat{p}_{j}(i)-\widehat{p}_{i}(j)\cdot\widehat{p}_{j}(i)
365pi(j)+365pj(i)365pi(j)365pj(i)\displaystyle\leq\frac{36}{5}p_{i}(j)+\frac{36}{5}p_{j}(i)-\frac{36}{5}p_{i}(j)\cdot\frac{36}{5}p_{j}(i)
365(pi(j)+pj(i)pi(j)pj(i))\displaystyle\leq\frac{36}{5}\left(p_{i}(j)+p_{j}(i)-p_{i}(j)\cdot p_{j}(i)\right)
=365p(i,j)\displaystyle=\frac{36}{5}\cdot p(i,j)

and

p^(i,j)\displaystyle\widehat{p}(i,j) 67pi(j)+67pj(i)67pi(j)67pj(i)\displaystyle\geq\frac{6}{7}p_{i}(j)+\frac{6}{7}p_{j}(i)-\frac{6}{7}p_{i}(j)\cdot\frac{6}{7}p_{j}(i)
67(pi(j)+pj(i)pi(j)pj(i))\displaystyle\geq\frac{6}{7}\left(p_{i}(j)+p_{j}(i)-p_{i}(j)\cdot p_{j}(i)\right)
=67p(i,j),\displaystyle=\frac{6}{7}\cdot p(i,j),

which completes the proof. ∎

We are now ready to prove Theorem 2. It is important to note that, although some of the proofs below are parallel to that of [27], our analysis needs to carefully take into account the approximation ratios introduced by the approximate KDE algorithm, which makes our analysis more involved. The following concentration inequalities will be used in our proof.

Lemma 3 (Bernstein’s Inequality [8]).

Let X1,,XnX_{1},\ldots,X_{n} be independent random variables such that |Xi|M|X_{i}|\leq M for any i{1,,n}i\in\{1,\ldots,n\}. Let X=i=1nXiX=\sum_{i=1}^{n}X_{i}, and R=i=1n𝔼[Xi2]R=\sum_{i=1}^{n}\mathbb{E}\left[X_{i}^{2}\right]. Then, it holds that

[|X𝔼[X]|t]2exp(t22(R+Mt/3)).\mathbb{P}\left[|X-\mathbb{E}\left[X\right]|\geq t\right]\leq 2\exp\left(-\frac{t^{2}}{2(R+Mt/3)}\right).
Lemma 4 (Matrix Chernoff Bound [28]).

Consider a finite sequence {Xi}\{X_{i}\} of independent, random, PSD matrices of dimension dd that satisfy XiR\|X_{i}\|\leq R. Let μminλmin(𝔼[iXi])\mu_{\mathrm{min}}\triangleq\lambda_{\mathrm{min}}(\mathbb{E}\left[\sum_{i}X_{i}\right]) and μmaxλmax(𝔼[iXi])\mu_{\mathrm{max}}\triangleq\lambda_{\mathrm{max}}(\mathbb{E}\left[\sum_{i}X_{i}\right]). Then, it holds that

[λmin(iXi)(1δ)μmin]d(eδ(1δ)1δ)μmin/R\mathbb{P}\left[\lambda_{\mathrm{min}}\left(\sum_{i}X_{i}\right)\leq(1-\delta)\mu_{\mathrm{min}}\right]\leq d\left(\frac{\mathrm{e}^{-\delta}}{(1-\delta)^{1-\delta}}\right)^{\mu_{\mathrm{min}}/R}

for δ[0,1]\delta\in[0,1], and

[λmax(iXi)(1+δ)μmax]d(eδ(1+δ)1+δ)μmax/R\mathbb{P}\left[\lambda_{\mathrm{max}}\left(\sum_{i}X_{i}\right)\geq(1+\delta)\mu_{\mathrm{max}}\right]\leq d\left(\frac{\mathrm{e}^{\delta}}{(1+\delta)^{1+\delta}}\right)^{\mu_{\mathrm{max}}/R}

forr δ0\delta\geq 0.

Proof of Theorem 2.

We first show that the degrees of all the vertices in the similarity graph 𝖪\mathsf{K} are preserved with high probability in the sparsifier 𝖦\mathsf{G}. For any vertex viv_{i}, and let yi,1,,yi,Ly_{i,1},\ldots,y_{i,L} be the indices of the neighbours of viv_{i} sampled by Algorithm 2.

For every pair of indices iji\neq j, and for every 1lL1\leq l\leq L, we define the random variable Zi,j,lZ_{i,j,l} to be the weight of the sampled edge if jj is the neighbour sampled from ii at iteration ll, and 0 otherwise:

Zi,j,l{k(xi,xj)p^(i,j)if yi,l=j0otherwise.Z_{i,j,l}\triangleq\left\{\begin{array}[]{ll}\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}&\mbox{if }y_{i,l}=j\\ 0&\mbox{otherwise.}\end{array}\right.

Then, fixing an arbitrary vertex xix_{i}, we can write

deg𝖦(vi)=l=1LijZi,j,l+Zj,i,l.\deg_{\mathsf{G}}(v_{i})=\sum_{l=1}^{L}\sum_{i\neq j}Z_{i,j,l}+Z_{j,i,l}.

We have

𝔼[deg𝖦(vi)]\displaystyle\mathbb{E}\left[\deg_{\mathsf{G}}(v_{i})\right] =l=1Lji𝔼[Zi,j,l]+𝔼[Zj,i,l]\displaystyle=\sum_{l=1}^{L}\sum_{j\neq i}\mathbb{E}\left[Z_{i,j,l}\right]+\mathbb{E}\left[Z_{j,i,l}\right]
=l=1Lji[[yi,l=j]k(xi,xj)p^(i,j)+[yj,l=i]k(xi,xj)p^(i,j)].\displaystyle=\sum_{l=1}^{L}\sum_{j\neq i}\left[\mathbb{P}\left[y_{i,l}=j\right]\cdot\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}+\mathbb{P}\left[y_{j,l}=i\right]\cdot\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}\right].

By Lemmas 1 and 2 and Corollary 1, we have

𝔼[deg𝖦(vi)]\displaystyle\mathbb{E}\left[\deg_{\mathsf{G}}(v_{i})\right] jik(xi,xj)p^(i,j)(Lk(xi,xj)2deg𝖪(vi)+Lk(xi,xj)2deg𝖪(vj))\displaystyle\geq\sum_{j\neq i}\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}\left(\frac{L\cdot k(x_{i},x_{j})}{2\deg_{\mathsf{K}}(v_{i})}+\frac{L\cdot k(x_{i},x_{j})}{2\deg_{\mathsf{K}}(v_{j})}\right)
=ij3k(xi,xj)p^(i,j)(pi(j)+pj(i))\displaystyle=\sum_{i\neq j}\frac{3k(x_{i},x_{j})}{\widehat{p}(i,j)}\left(p_{i}(j)+p_{j}(i)\right)
ij5k(xi,xj)12=5deg𝖪(vi)12,\displaystyle\geq\sum_{i\neq j}\frac{5k(x_{i},x_{j})}{12}=\frac{5\deg_{\mathsf{K}}(v_{i})}{12},

where the last inequality follows by the fact that p^(i,j)(36/5)p(i,j)(36/5)(pi(j)+pj(i))\widehat{p}(i,j)\leq(36/5)\cdot p(i,j)\leq(36/5)\cdot\left(p_{i}(j)+p_{j}(i)\right). Similarly, we have

𝔼[deg𝖦(vi)]\displaystyle\mathbb{E}\left[\deg_{\mathsf{G}}(v_{i})\right] jik(xi,xj)p^(i,j)(2Lk(xi,xj)deg𝖪(vi)+2Lk(xi,xj)deg𝖪(vj))\displaystyle\leq\sum_{j\neq i}\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}\left(\frac{2\cdot L\cdot k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{i})}+\frac{2\cdot L\cdot k(x_{i},x_{j})}{\deg_{\mathsf{K}}(v_{j})}\right)
=ji12k(xi,xj)p^(i,j)(pi(j)+pj(i))\displaystyle=\sum_{j\neq i}\frac{12\cdot k(x_{i},x_{j})}{\widehat{p}(i,j)}\left(p_{i}(j)+p_{j}(i)\right)
ji28k(xi,xj)=28deg𝖪(vi),\displaystyle\leq\sum_{j\neq i}28\cdot k(x_{i},x_{j})=28\cdot\deg_{\mathsf{K}}(v_{i}),

where the inequality follows by the fact that

p^(i,j)67p(i,j)=67(pi(j)+pj(i)pi(j)pj(i))37(pi(j)+pj(i)).\widehat{p}(i,j)\geq\frac{6}{7}\cdot p(i,j)=\frac{6}{7}\cdot\left(p_{i}(j)+p_{j}(i)-p_{i}(j)\cdot p_{j}(i)\right)\geq\frac{3}{7}\cdot\left(p_{i}(j)+p_{j}(i)\right).

In order to prove a concentration bound on this degree estimate, we would like to apply the Bernstein inequality for which we need to bound

R\displaystyle R =l=1Lji𝔼[Zi,j,l2]+𝔼[Zj,i,l2]\displaystyle=\sum_{l=1}^{L}\sum_{j\neq i}\mathbb{E}\left[Z_{i,j,l}^{2}\right]+\mathbb{E}\left[Z_{j,i,l}^{2}\right]
=l=1Lji[yi,l=j]k(xi,xj)2p^(i,j)2+[yj,l=i]k(xi,xj)2p^(i,j)2\displaystyle=\sum_{l=1}^{L}\sum_{j\neq i}\mathbb{P}\left[y_{i,l}=j\right]\cdot\frac{k(x_{i},x_{j})^{2}}{\widehat{p}(i,j)^{2}}+\mathbb{P}\left[y_{j,l}=i\right]\cdot\frac{k(x_{i},x_{j})^{2}}{\widehat{p}(i,j)^{2}}
ji12k(xi,xj)2p^(i,j)2(pi(j)+pj(i))\displaystyle\leq\sum_{j\neq i}\frac{12\cdot k(x_{i},x_{j})^{2}}{\widehat{p}(i,j)^{2}}\cdot\left(p_{i}(j)+p_{j}(i)\right)
ji28k(xi,xj)2p^(i,j)\displaystyle\leq\sum_{j\neq i}28\cdot\frac{k(x_{i},x_{j})^{2}}{\widehat{p}(i,j)}
ji983k(xi,xj)2pi(j)\displaystyle\leq\sum_{j\neq i}\frac{98}{3}\cdot\frac{k(x_{i},x_{j})^{2}}{p_{i}(j)}
=ji983k(xi,xj)deg𝖪(vi)λk+1Clog(n)\displaystyle=\sum_{j\neq i}\frac{98}{3}\cdot\frac{k(x_{i},x_{j})\cdot\deg_{\mathsf{K}}(v_{i})\cdot\lambda_{k+1}}{C\log(n)}
=98deg𝖪(vi)2λk+13Clog(n),\displaystyle=\frac{98\cdot\deg_{\mathsf{K}}(v_{i})^{2}\cdot\lambda_{k+1}}{3\cdot C\log(n)},

where the third inequality follows by the fact that

p^(i,j)67p(i,j)67pi(j).\widehat{p}(i,j)\geq\frac{6}{7}\cdot p(i,j)\geq\frac{6}{7}\cdot p_{i}(j).

Then, by applying Bernstein’s inequality we have for any constant C2C_{2} that

[|deg𝖦(vi)𝔼[deg𝖦(vi)]|1C2deg𝖪(vi)]\displaystyle\mathbb{P}\left[\left|\deg_{\mathsf{G}}(v_{i})-\mathbb{E}[\deg_{\mathsf{G}}(v_{i})]\right|\geq\frac{1}{C_{2}}\deg_{\mathsf{K}}(v_{i})\right] 2exp(deg𝖪(vi)2/(2C22)98deg𝖪(vi)2λk+13Clog(n)+7deg𝖪(vi)2λk+16CC2log(n))\displaystyle\leq 2\exp\left(-\frac{\deg_{\mathsf{K}}(v_{i})^{2}/(2\cdot C_{2}^{2})}{\frac{98\deg_{\mathsf{K}}(v_{i})^{2}\lambda_{k+1}}{3C\log(n)}+\frac{7\deg_{\mathsf{K}}(v_{i})^{2}\lambda_{k+1}}{6CC_{2}\cdot\log(n)}}\right)
2exp(Clog(n)((196/3)C22+(7/3)C2)λk+1)\displaystyle\leq 2\exp\left(-\frac{C\cdot\log(n)}{((196/3)\cdot C_{2}^{2}+(7/3)\cdot C_{2})\cdot\lambda_{k+1}}\right)
=o(1/n),\displaystyle=o(1/n),

where we use the fact that

Zi,j,l7k(xi,xj)6pi(j)=7deg𝖪(vi)λk+16Clog(n).Z_{i,j,l}\leq\frac{7k(x_{i},x_{j})}{6p_{i}(j)}=\frac{7\deg_{\mathsf{K}}(v_{i})\cdot\lambda_{k+1}}{6C\cdot\log(n)}.

Therefore, by taking CC to be sufficiently large and by the union bound, it holds with high probability that the degree of all the nodes in 𝖦\mathsf{G} are preserved up to a constant factor. For the remainder of the proof, we assume that this is the case. Note in particular that this implies vol𝖦(S)=Θ(vol𝖪(S))\mathrm{vol}_{\mathsf{G}}(S)=\Theta(\mathrm{vol}_{\mathsf{K}}(S)) for any subset SVS\subseteq V.

Next, we prove it holds for 𝖦\mathsf{G} that ϕ𝖦(Si)=O(kϕ𝖪(Si))\phi_{\mathsf{G}}(S_{i})=O\left(k\cdot\phi_{\mathsf{K}}(S_{i})\right) for any 1ik1\leq i\leq k, where S1,,SkS_{1},\ldots,S_{k} form an optimal clustering in 𝖪\mathsf{K}.

By the definition of Zi,j,lZ_{i,j,l}, it holds for any 1ik1\leq i\leq k that

𝔼[𝖦(Si)]\displaystyle\mathbb{E}\left[\partial_{\mathsf{G}}(S_{i})\right] =𝔼[aSibSil=1LZa,b,l+Zb,a,l]\displaystyle=\mathbb{E}\left[\sum_{a\in S_{i}}\sum_{b\not\in S_{i}}\sum_{l=1}^{L}Z_{a,b,l}+Z_{b,a,l}\right]
aSibSi12k(xa,xb)p^(a,b)(pa(b)+pb(a))\displaystyle\leq\sum_{a\in S_{i}}\sum_{b\not\in S_{i}}\frac{12k(x_{a},x_{b})}{\widehat{p}(a,b)}\cdot\left(p_{a}(b)+p_{b}(a)\right)
=O(𝖪(Si))\displaystyle=O\left(\partial_{\mathsf{K}}(S_{i})\right)

where the last line follows by Corollary 1. By Markov’s inequality and the union bound, with constant probability it holds for all i=1,,ki=1,\ldots,k that

𝖦(Si)=O(k𝖪(Si)).\partial_{\mathsf{G}}(S_{i})=O(k\cdot\partial_{\mathsf{K}}(S_{i})).

Therefore, it holds with constant probability that

ρ𝖦(k)max1ikϕ𝖦(Si)=max1ikO(kϕ𝖪(Si))=O(kρ𝖪(k)).\rho_{\mathsf{G}}(k)\leq\max_{1\leq i\leq k}\phi_{\mathsf{G}}(S_{i})=\max_{1\leq i\leq k}O(k\cdot\phi_{\mathsf{K}}(S_{i}))=O(k\cdot\rho_{\mathsf{K}}(k)).

Finally, we prove that λk+1(𝐍𝖦)=Ω(λk+1(𝐍𝖪))\lambda_{k+1}(\mathbf{N}_{\mathsf{G}})=\Omega(\lambda_{k+1}(\mathbf{N}_{\mathsf{K}})). Let 𝐍¯𝖪\overline{\mathbf{N}}_{\mathsf{K}} be the projection of 𝐍𝖪\mathbf{N}_{\mathsf{K}} on its top nkn-k eigenspaces, and notice that 𝐍¯𝖪\overline{\mathbf{N}}_{\mathsf{K}} can be written

𝐍¯𝖪=i=k+1nλi(𝐍𝖪)fifi\overline{\mathbf{N}}_{\mathsf{K}}=\sum_{i=k+1}^{n}\lambda_{i}(\mathbf{N}_{\mathsf{K}})f_{i}f_{i}^{\intercal}

where f1,,fnf_{1},\ldots,f_{n} are the eigenvectors of 𝐍𝖪\mathbf{N}_{\mathsf{K}}. Let 𝐍¯𝖪1/2\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2} be the square root of the pseudoinverse of 𝐍¯𝖪\overline{\mathbf{N}}_{\mathsf{K}}.

We prove that the top nkn-k eigenvalues of 𝐍𝖪\mathbf{N}_{\mathsf{K}} are preserved, which implies that λk+1(𝐍𝖦)=Θ(λk+1(𝐍𝖪))\lambda_{k+1}(\mathbf{N}_{\mathsf{G}})=\Theta(\lambda_{k+1}(\mathbf{N}_{\mathsf{K}})). To prove this, for each data point xix_{i} and sample 1lL1\leq l\leq L, we define a random matrix Xi,ln×nX_{i,l}\in\mathbb{R}^{n\times n} by

Xi,l=w𝖦(vi,vj)𝐍¯𝖪1/2bebe𝐍¯𝖪1/2;X_{i,l}=w_{\mathsf{G}}(v_{i},v_{j})\cdot\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}b_{e}b_{e}^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2};

where j=yi,lj=y_{i,l}, beχviχvjb_{e}\triangleq\chi_{v_{i}}-\chi_{v_{j}} is the edge indicator vector, and xvinx_{v_{i}}\in\mathbb{R}^{n} is defined by

χvi(a){1deg𝖪(vi)if a=i0otherwise.\chi_{v_{i}}(a)\triangleq\left\{\begin{array}[]{ll}\frac{1}{\sqrt{\deg_{\mathsf{K}}(v_{i})}}&\mbox{if }a=i\\ 0&\mbox{otherwise.}\end{array}\right.

Notice that

i=1nl=1LXi,l=sampled edges e={vi,vj}w𝖦(vi,vj)𝐍¯𝖪1/2bebe𝐍¯𝖪1/2=𝐍¯𝖪1/2𝐍𝖦𝐍¯𝖪1/2\sum_{i=1}^{n}\sum_{l=1}^{L}X_{i,l}=\sum_{\text{sampled edges }e=\{v_{i},v_{j}\}}w_{\mathsf{G}}(v_{i},v_{j})\cdot\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}b_{e}b_{e}^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}=\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}

where

𝐍𝖦=sampled edges e={vi,vj}w𝖦(vi,vj)bebe\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}=\sum_{\text{sampled edges }e=\{v_{i},v_{j}\}}w_{\mathsf{G}}(v_{i},v_{j})\cdot b_{e}b_{e}^{\intercal}

is the Laplacian matrix of 𝖦\mathsf{G} normalised with respect to the degrees of the nodes in 𝖪\mathsf{K}. We prove that, with high probability, the top nkn-k eigenvectors of 𝐍𝖦\mathbf{N}_{\mathsf{G}}^{{}^{\prime}} and 𝐍𝖪\mathbf{N}_{\mathsf{K}} are approximately the same. Then, we show the same for 𝐍𝖦\mathbf{N}_{\mathsf{G}} and 𝐍𝖦\mathbf{N}_{\mathsf{G}}^{{}^{\prime}} which implies that λk+1(𝐍𝖦)=Ω(λk+1(𝐍𝖪))\lambda_{k+1}(\mathbf{N}_{\mathsf{G}})=\Omega(\lambda_{k+1}(\mathbf{N}_{\mathsf{K}})).

We begin by looking at the first moment of i=1nl=1LXi,l\sum_{i=1}^{n}\sum_{l=1}^{L}X_{i,l}, and have that

λmin(𝔼[i=1nl=1LXi,l])\displaystyle\lambda_{\mathrm{min}}\left(\mathbb{E}\left[\sum_{i=1}^{n}\sum_{l=1}^{L}X_{i,l}\right]\right) =λmin(i=1nl=1Ljie={vi,vj}[yi,l=j]k(xi,xj)p^(i,j)𝐍¯𝖪1/2bebe𝐍¯𝖪1/2)\displaystyle=\lambda_{\mathrm{min}}\left(\sum_{i=1}^{n}\sum_{l=1}^{L}\sum_{\begin{subarray}{c}j\neq i\\ e=\{v_{i},v_{j}\}\end{subarray}}\mathbb{P}\left[y_{i,l}=j\right]\cdot\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}\cdot\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}b_{e}b_{e}^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\right)
λmin(i=1njie={vi,vj}3pi(j)k(xi,xj)p^(i,j)𝐍¯𝖪1/2bebe𝐍¯𝖪1/2)\displaystyle\geq\lambda_{\mathrm{min}}\left(\sum_{i=1}^{n}\sum_{\begin{subarray}{c}j\neq i\\ e=\{v_{i},v_{j}\}\end{subarray}}3p_{i}(j)\cdot\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}\cdot\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}b_{e}b_{e}^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\right)
λmin(512𝐍¯𝖪1/2𝐍𝖪𝐍¯𝖪1/2)=512,\displaystyle\geq\lambda_{\mathrm{min}}\left(\frac{5}{12}\cdot\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\mathbf{N}_{\mathsf{K}}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\right)=\frac{5}{12},

where the last inequality follows by the fact that

p^(i,j)365p(i,j)365(pi(j)+pj(i)).\widehat{p}(i,j)\leq\frac{36}{5}\cdot p(i,j)\leq\frac{36}{5}\cdot\left(p_{i}(j)+p_{j}(i)\right).

Similarly,

λmax(𝔼[i=1nl=1LXi,l])\displaystyle\lambda_{\mathrm{max}}\left(\mathbb{E}\left[\sum_{i=1}^{n}\sum_{l=1}^{L}X_{i,l}\right]\right) =λmax(i=1nl=1Ljie={vi,vj}[yi,l=j]k(xi,xj)p^(i,j)𝐍¯𝖪1/2bebe𝐍¯𝖪1/2)\displaystyle=\lambda_{\mathrm{max}}\left(\sum_{i=1}^{n}\sum_{l=1}^{L}\sum_{\begin{subarray}{c}j\neq i\\ e=\{v_{i},v_{j}\}\end{subarray}}\mathbb{P}\left[y_{i,l}=j\right]\cdot\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}\cdot\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}b_{e}b_{e}^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\right)
λmax(i=1njie={vi,vj}12pi(j)k(xi,xj)p^(i,j)𝐍¯𝖪1/2bebe𝐍¯𝖪1/2)\displaystyle\leq\lambda_{\mathrm{max}}\left(\sum_{i=1}^{n}\sum_{\begin{subarray}{c}j\neq i\\ e=\{v_{i},v_{j}\}\end{subarray}}12\cdot p_{i}(j)\cdot\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}\cdot\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}b_{e}b_{e}^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\right)
λmax(28𝐍¯𝖪1/2𝐍𝖪𝐍¯𝖪1/2)=28,\displaystyle\leq\lambda_{\mathrm{max}}\left(28\cdot\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\mathbf{N}_{\mathsf{K}}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\right)=28,

where the last inequality follows by the fact that

p^(i,j)67p(i,j)37(pi(j)+pj(i)).\widehat{p}(i,j)\geq\frac{6}{7}\cdot p(i,j)\geq\frac{3}{7}\cdot(p_{i}(j)+p_{j}(i)).

Additionally, for any ii and j=yi,lj=y_{i,l} and e={vi,vj}e=\{v_{i},v_{j}\}, we have that

Xi,l\displaystyle\|X_{i,l}\| w𝖦(vi,vj)be𝐍¯𝖪1/2𝐍¯𝖪1/2be\displaystyle\leq w_{\mathsf{G}}(v_{i},v_{j})\cdot b_{e}^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}b_{e}
=k(xi,xj)p^(i,j)be𝐍¯𝖪1be\displaystyle=\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}\cdot b_{e}^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{-1}b_{e}
k(xi,xj)p^(i,j)1λk+1be2\displaystyle\leq\frac{k(x_{i},x_{j})}{\widehat{p}(i,j)}\cdot\frac{1}{\lambda_{k+1}}\|b_{e}\|^{2}
7λk+13Clog(n)(1deg𝖪(vi)+1deg𝖪(vj))1λk+1(1deg𝖪(vi)+1deg𝖪(vj))\displaystyle\leq\frac{7\cdot\lambda_{k+1}}{3C\log(n)\left(\frac{1}{\deg_{\mathsf{K}}(v_{i})}+\frac{1}{\deg_{\mathsf{K}}(v_{j})}\right)}\cdot\frac{1}{\lambda_{k+1}}\left(\frac{1}{\deg_{\mathsf{K}}(v_{i})}+\frac{1}{\deg_{\mathsf{K}}(v_{j})}\right)
73Clog(n).\displaystyle\leq\frac{7}{3C\log(n)}.

Now, we apply the matrix Chernoff bound and have that

[λmax(i=1nl=1LXi,l)42]n(e1/2(1+1/2)3/2)12Clog(n)=O(1/nc)\displaystyle\mathbb{P}\left[\lambda_{\mathrm{max}}\left(\sum_{i=1}^{n}\sum_{l=1}^{L}X_{i,l}\right)\geq 42\right]\leq n\left(\frac{e^{1/2}}{(1+1/2)^{3/2}}\right)^{12C\cdot\log(n)}=O(1/n^{c})

for some constant cc. The other side of the matrix Chernoff bound gives us that

[λmin(i=1nl=1LXi,l)5/24]O(1/nc).\displaystyle\mathbb{P}\left[\lambda_{\mathrm{min}}\left(\sum_{i=1}^{n}\sum_{l=1}^{L}X_{i,l}\right)\leq 5/24\right]\leq O(1/n^{c}).

Combining these, with probability 1O(1/nc)1-O(1/n^{c}) it holds for any non-zero xnx\in\mathbb{R}^{n} in the space spanned by fk+1,,fnf_{k+1},\ldots,f_{n} that

x𝐍¯𝖪1/2𝐍𝖦𝐍¯𝖪1/2xxx(5/24,42).\frac{x^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}x}{x^{\intercal}x}\in(5/24,42).

By setting y=𝐍¯𝖪1/2xy=\overline{\mathbf{N}}_{\mathsf{K}}^{-1/2}x, we can rewrite this as

y𝐍𝖦yy𝐍¯𝖪1/2𝐍¯𝖪1/2y=y𝐍𝖦yy𝐍¯𝖪y=y𝐍𝖦yyyyyy𝐍¯𝖪y(5/24,42).\frac{y^{\intercal}\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}y}{y^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}^{1/2}\overline{\mathbf{N}}_{\mathsf{K}}^{1/2}y}=\frac{y^{\intercal}\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}y}{y^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}y}=\frac{y^{\intercal}\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}y}{y^{\intercal}y}\frac{y^{\intercal}y}{y^{\intercal}\overline{\mathbf{N}}_{\mathsf{K}}y}\in(5/24,42).

Since dim(span{fk+1,,fn})=nk\mathrm{dim}(\mathrm{span}\{f_{k+1},\ldots,f_{n}\})=n-k, we have shown that there exist nkn-k orthogonal vectors whose Rayleigh quotient with respect to N𝖦\mathrm{N}_{\mathsf{G}}^{{}^{\prime}} is Ω(λk+1(𝐍𝖪))\Omega(\lambda_{k+1}(\mathbf{N}_{\mathsf{K}})). By the Courant-Fischer Theorem, we have λk+1(𝐍𝖦)=Ω(λk+1(𝐍𝖪))\lambda_{k+1}(\mathbf{N}_{\mathsf{G}}^{{}^{\prime}})=\Omega(\lambda_{k+1}(\mathbf{N}_{\mathsf{K}})).

It only remains to show that λk+1(𝐍𝖦)=Ω(λk+1(𝐍𝖦))\lambda_{k+1}(\mathbf{N}_{\mathsf{G}})=\Omega(\lambda_{k+1}(\mathbf{N}_{\mathsf{G}}^{{}^{\prime}})), which implies that λk+1(𝐍𝖦)=Ω(λk+1(𝐍𝖪))\lambda_{k+1}(\mathbf{N}_{\mathsf{G}})=\Omega(\lambda_{k+1}(\mathbf{N}_{\mathsf{K}})). By the definition of 𝐍𝖦\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}, we have that 𝐍𝖦=𝐃𝖦1/2𝐃𝖪1/2𝐍𝖦𝐃𝖪1/2𝐃𝖦1/2\mathbf{N}_{\mathsf{G}}=\mathbf{D}_{\mathsf{G}}^{-1/2}\mathbf{D}_{\mathsf{K}}^{1/2}\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}\mathbf{D}_{\mathsf{K}}^{1/2}\mathbf{D}_{\mathsf{G}}^{-1/2}. Therefore, for any xnx\in\mathbb{R}^{n} and y=𝐃𝖪1/2𝐃𝖦1/2xy=\mathbf{D}_{\mathsf{K}}^{1/2}\mathbf{D}_{\mathsf{G}}^{-1/2}x, it holds that

x𝐍𝖦xxx=y𝐍𝖦yxx=Ω(y𝐍𝖦yyy),\frac{x^{\intercal}\mathbf{N}_{\mathsf{G}}x}{x^{\intercal}x}=\frac{y^{\intercal}\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}y}{x^{\intercal}x}=\Omega\left(\frac{y^{\intercal}\mathbf{N}_{\mathsf{G}}^{{}^{\prime}}y}{y^{\intercal}y}\right),

where the final guarantee follows from the fact that the degrees in 𝖦\mathsf{G} are preserved up to a constant factor. The conclusion of the theorem follows by the Courant-Fischer Theorem.

Finally, we bound the running time of Algorithm 2 which is dominated by the recursive calls to Algorithm 1. We note that, although the number of nodes doubles at each level of the recursion tree (visualised in Figure 4), the total number of samples SS and data points XX remain constant for each level of the tree. Then, since the running time of the KDE algorithm is superadditive, the total running time of the KDE algorithms at level ii of the tree is

Ti\displaystyle T_{i} =j=12iTKDE(|Si,j|,|Xi,j|,ϵ)\displaystyle=\sum_{j=1}^{2^{i}}T_{\textsf{KDE}}(|S_{i,j}|,|X_{i,j}|,\epsilon)
TKDE(j=12i|Si,j|,j=12i|Xi,j|,ϵ)=TKDE(|S|,|X|,ϵ).\displaystyle\leq T_{\textsf{KDE}}\left(\sum_{j=1}^{2^{i}}|S_{i,j}|,\sum_{j=1}^{2^{i}}|X_{i,j}|,\epsilon\right)=T_{\textsf{KDE}}(|S|,|X|,\epsilon).

Since there are O(log(n))O(\log(n)) levels of the tree, the total running time of Algorithm 1 is O~(TKDE(|S|,|X|,ϵ))\widetilde{O}(T_{\textsf{KDE}}(|S|,|X|,\epsilon)). This completes the proof. ∎

Appendix B Additional Experimental Results

In this section, we include in Figures 6 and 7 some additional examples of the performance of the six spectral clustering algorithms on the BSDS image segmentation dataset. Due to its quadratic memory requirement, the SKLearn GK algorithm cannot be used on the full-resolution image. Therefore, we present its results on each image downsampled to 20,000 pixels. For every other algorithm, we show the results on the full-resolution image. In every case, we find that our algorithm is able to identify more refined detail of the image when compared with the alternative algorithms.

Refer to caption
(a) Original Image
Refer to caption
(b) SKLearn GK
Refer to caption
(c) Our Algorithm
Refer to caption
(d) SKLearn kk-NN
Refer to caption
(e) FAISS Exact
Refer to caption
(f) FAISS HNSW
Refer to caption
(g) FAISS IVF
Refer to caption
(h) Original Image
Refer to caption
(i) SKLearn GK
Refer to caption
(j) Our Algorithm
Refer to caption
(l) FAISS Exact
Refer to caption
(m) FAISS HNSW
Refer to caption
(n) FAISS IVF

         
           

Refer to caption
(k) SKLearn kk-NN
Refer to caption
(o) Original Image
Refer to caption
(p) SKLearn GK
Refer to caption
(q) Our Algorithm
Refer to caption
(s) FAISS Exact
Refer to caption
(t) FAISS HNSW
Refer to caption
(u) FAISS IVF

         

Refer to caption
(r) SKLearn kk-NN
Figure 6: More examples on the performance of the spectral clustering algorithms for image segmentation.
Refer to caption
(a) Original Image
Refer to caption
(b) SKLearn GK
Refer to caption
(c) Our Algorithm
Refer to caption
(d) SKLearn kk-NN
Refer to caption
(e) FAISS Exact
Refer to caption
(f) FAISS HNSW
Refer to caption
(g) FAISS IVF

 

Refer to caption
(h) Original Image
Refer to caption
(i) SKLearn GK
Refer to caption
(j) Our Algorithm
Refer to caption
(k) SKLearn kk-NN
Refer to caption
(l) FAISS Exact
Refer to caption
(m) FAISS HNSW
Refer to caption
(n) FAISS IVF
Figure 7: More examples on the performance of the spectral clustering algorithms for image segmentation.