This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dimensionality Reduction for General KDE Mode Finding

Xinyu Luo    Christopher Musco    Cas Widdershoven
Abstract

Finding the mode of a high dimensional probability distribution 𝒟\mathcal{D} is a fundamental algorithmic problem in statistics and data analysis. There has been particular interest in efficient methods for solving the problem when 𝒟\mathcal{D} is represented as a mixture model or kernel density estimate, although few algorithmic results with worst-case approximation and runtime guarantees are known. In this work, we significantly generalize a result of (Lee et al., 2021) on mode approximation for Gaussian mixture models. We develop randomized dimensionality reduction methods for mixtures involving a broader class of kernels, including the popular logistic, sigmoid, and generalized Gaussian kernels. As in Lee et al.’s work, our dimensionality reduction results yield quasi-polynomial algorithms for mode finding with multiplicative accuracy (1ϵ)(1-\epsilon) for any ϵ>0\epsilon>0. Moreover, when combined with gradient descent, they yield efficient practical heuristics for the problem. In addition to our positive results, we prove a hardness result for box kernels, showing that there is no polynomial time algorithm for finding the mode of a kernel density estimate, unless P=𝑁𝑃\mathit{P}=\mathit{NP}. Obtaining similar hardness results for kernels used in practice (like Gaussian or logistic kernels) is an interesting future direction.

Kernel density estimation, Johnson-Lindenstrauss, Sketching, Approximation algorithm, Mode finding, Kirszbraun theorem

1 Introduction

We consider the basic computational problem of finding the mode of a high dimensional probability distribution 𝒟\mathcal{D} over d\mathbb{R}^{d}. Specifically, if 𝒟\mathcal{D} has probability density function (PDF) pp, our goal is to find any xdx^{*}\in\mathbb{R}^{d} such that:

xargmaxxdp(x)\displaystyle x^{*}\in\operatorname{argmax}_{x\in\mathbb{R}^{d}}p(x)

A natural setting for this problem is when 𝒟\mathcal{D} is specified as a kernel density estimate (KDE) or mixture distribution (Scott, 2015; Silverman, 2018). In this setting, we are given a set of points MdM\in\mathbb{R}^{d} and a non-negative kernel function κ:d×d+\kappa:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{+} and our PDF equals:

p(x)=𝒦M(x)=1|M|mMκ(x,m).\displaystyle p(x)=\mathcal{K}_{M}(x)=\frac{1}{|M|}\sum_{m\in M}\kappa(x,m).

As is typically the case, we will assume that κ\kappa is shift-invariant and thus only depends on the difference between xx and mm, meaning that it can be reparameterized as κ(xm)\kappa(x-m). A classic example of a shift-invariant KDE is any mixture of Gaussians distribution, for which κ(xm)=Cexm22\kappa(x-m)=C\cdot e^{-\|x-m\|_{2}^{2}} is taken to be the Gaussian kernel. Here C=πd/2C=\pi^{-d/2} is a normalizing constant. Kernel density estimates are widely used to approximate other distributions in a compact way (Botev et al., 2010; Kim & Scott, 2012), and they have been applied to applications ranging from image annotation (Yavlinsky et al., 2005), to medical data analysis (Sheikhpour et al., 2016), to outlier detection (Kamalov & Leung, 2020). The specific problem of finding the mode of a KDE has found applications in object tracking (Shen et al., 2007), super levelset computation (Phillips et al., 2015), typical object finding (Gasser et al., 1997), and more (Lee et al., 2021).

1.1 Prior Work

Despite its many applications, the KDE mode finding problem presents a computational challenge in high-dimensions. For any practically relevant kernel κ\kappa (e.g., Gaussian) there are no known algorithms with runtime polynomial in both nn and dd for KDEs on n=|M|n=|M| base points. This is even the case when we only want to find an ϵ\epsilon-approximate mode for some ϵ(0,1)\epsilon\in(0,1), i.e. a point x~\tilde{x}^{*} satisfying

𝒦M(x~)(1ϵ)maxxd𝒦M(x),\displaystyle\mathcal{K}_{M}(\tilde{x}^{*})\geq(1-\epsilon)\max_{x\in\mathbb{R}^{d}}\mathcal{K}_{M}(x),

There has been extensive work on heuristic local search methods like the well-known “mean-shift” algorithm (Carreira-Perpiñán, 2000, 2007), which can be viewed as a variant of gradient descent, and often works well in practice. However, these methods do not come with theoretical guarantees and can fail on natural problem instances.

While polynomial time methods are not known, for some kernels it is possible to provably solve the ϵ\epsilon-approximate mode finding problem in quasi-polynomial time. For example, Shenmaier’s work on universal approximate centers for clustering can be used to reduce the problem to evaluating the quality of a quasi-polynomial number of candidate modes (Shenmaier, 2019). For the Gaussian kernel, the total runtime is d2O(log2n)d\cdot 2^{O(\log^{2}n)} for constant ϵ\epsilon. Similar runtimes can be obtained by appealing to results on the approximate Carathéodory problem (Blum et al., 2019; Barman, 2015).

More recently, Lee et al. explore dimensionality reduction as an approach to obtaining quasi-polynomial time algorithms for KDE mode finding (Lee et al., 2021). Their work shows that, for the Gaussian kernel, any high-dimensional KDE instance can be reduced to a lower dimensional instance using randomized dimensionality reduction methods – specifically Johnson-Lindenstrauss projection. An approximate mode for the lower dimensional problem can then be found with a method that depends exponentially on the dimension dd, and finally, the low-dimensional solution can be “mapped back” to the high-dimensional space111Methods that run in time exponential in dd are straightforward to obtain via discretization/brute force search. See Section 5.. Ultimately, the result in (Lee et al., 2021) allows all dependencies on dd to be replaced with terms that are polynomial in log(n)\log(n) and ϵ\epsilon. The conclusion is that the mode of a Gaussian KDE can be approximated to accuracy ϵ\epsilon in time O(ndw+2w)O\left(ndw+2^{w}\right), where w=poly(logn,1/ϵ)w=\operatorname{poly}(\log n,1/\epsilon). The leading ndwndw term is the cost of performing the dimensionality reduction.

In addition to nearly matching prior quasi-polynomial time methods in theory (e.g., Shenmaier’s approach), there are a number of benefits to an approach based on dimensionality reduction. For one, sketching directly reduces the space complexity of the mode finding problem, and vectors sketched with JL random projections can be useful in other downstream data analysis tasks. Another benefit is that dimensionality reduction can speed up even heuristic algorithms: instead of using a brute-force approach to solve the low-dimensional KDE instance, a practical alternative is to apply a local search method, like mean-shift, in the low-dimensional space. This approach sacrifices theoretical guarantees, but can lead to faster practical algorithms.

1.2 Our Results

The main contribution of our work is to generalize the dimensionality reduction results of (Lee et al., 2021) to a much broader class of kernels, beyond the Gaussian kernel studied in that work. In particular, we introduce a carefully defined class of kernels called “relative-distance smooth kernels”. This class includes the Gaussian kernel, as well as the sigmoid, logistic, and any generalized Gaussian kernel of the form κ(x,y)=exy2α\kappa(x,y)=e^{-\|x-y\|_{2}^{\alpha}} for α>0\alpha>0. See Definition 3 for more details. Our first result (Lemma 3.4) is that, for any relative-distance smooth kernel, we can approximate the value of the mode maxx𝒦M(x)\max_{x}\mathcal{K}_{M}(x) up to multiplicative error (1ϵ)(1-\epsilon) by solving a lower dimensional instance obtained by sketching the points in MM using a Johnson-Lindenstrauss random projection. The required dimension of the projection is O(logc(n)/ϵ2)O(\log^{c}(n)/\epsilon^{2}), where cc is a constant depending on parameters of the kernel κ\kappa. For most commonly used relative-distance smooth kernels, including the Gaussian, logistic, and sigmoid kernels, c=3c=3. This leads to a dimensionality reduction that is completely independent of the original problem dimension dd and only depends polylogarithmically on the number of points in the KDE, nn.

Moreover, in Section 4, we show how to recover an approximate mode x~\tilde{x} satisfying KM(x~)(1ϵ)maxx𝒦M(x)K_{M}(\tilde{x})\geq(1-\epsilon)\max_{x}\mathcal{K}_{M}(x) from the solution of the low-dimensional sketched problem. When the kernel satisfies an additional convexity property, recovery can be performed in O(nd)O(nd) time using a generalization of the mean-shift algorithm used in (Lee et al., 2021). When the kernel does not satisfy the property, we obtain a slightly slower method using a recent result of (Biess et al., 2019) on constructive Lipschitz extensions. One consequence of our general results is the following claim for a number of common kernels:

Theorem 1.1.

Let 𝒦M=(κ,M)\mathcal{K}_{M}=(\kappa,M) be a be a KDE on n=|M|n=|M| points in dd dimensions, where κ\kappa is a Gaussian, logistic, sigmoid, Cauchy222For the Cauchy kernel, we can actually obtain a better bound with dimension w=O(log(n/ϵ)ϵ2)w=O\left(\frac{\log(n/\epsilon)}{\epsilon^{2}}\right). See Corollary 3.2., or generalized Gaussian kernel with parameter α1\alpha\leq 1. Let Π\Pi be a random JL matrix with w=O(log2(n/ϵ)log(n/δ)ϵ2)w=O\left(\frac{\log^{2}(n/\epsilon)\log(n/\delta)}{\epsilon^{2}}\right) rows and let x~\tilde{x} be any point such that 𝒦ΠM(x~)(1β)maxxw𝒦ΠM(x){\mathcal{K}}_{\Pi M}(\tilde{x})\geq(1-\beta)\max_{x\in\mathbb{R}^{w}}{\mathcal{K}}_{\Pi M}(x). Given x~\tilde{x} as input, Algorithm 2 runs in O(nd)O(nd) time and returns, with probability 1δ1-\delta, a point xdx^{\prime}\in\mathbb{R}^{d} satisfying:

𝒦M(x)(1ϵβ)maxxd𝒦M(x).\displaystyle{\mathcal{K}}_{M}(x^{\prime})\geq(1-\epsilon-\beta)\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x).

Above, ΠM\Pi M is the point set ΠM={Πm for mM}\Pi M=\{\Pi m\text{ for }m\in M\} and 𝒦ΠM{\mathcal{K}}_{\Pi M} is the low-dimensional KDE defined by ΠM\Pi M and κ\kappa. Theorem 1.1 implies that an approximate high-dimensional mode can be found by (approximately) solving a much lower dimensional problem. The result exactly matches that of (Lee et al., 2021) in the Gaussian case.

When combined with a simple brute-force method for maximizing 𝒦ΠM{\mathcal{K}}_{\Pi M}, Theorem 1.1 immediately yields a quasi-polynomial time algorithm for mode finding. Again we state a natural special case of this result, proven in Section 5.

Theorem 1.2.

Let 𝒦M=(κ,M)\mathcal{K}_{M}=(\kappa,M) be a be a KDE on nn points in dd dimensions, where κ\kappa is a Gaussian, logistic, sigmoid, Cauchy, or generalized Gaussian kernel with α1\alpha\leq 1. There is an algorithm which finds a point x~\tilde{x} satisfying:

𝒦M(x~)(1ϵ)maxx𝒦M(x)\displaystyle\mathcal{K}_{M}(\tilde{x})\geq(1-\epsilon)\max_{x}\mathcal{K}_{M}(x)

in 2O~(log3n/ϵ2)+O(ndlog3(n)/ϵ2)2^{\tilde{O}(\log^{3}n/\epsilon^{2})}+O(nd\log^{3}(n)/\epsilon^{2}) time. Here O~(x)\tilde{O}(x) denotes O(xlogcx)O(x\log^{c}x) for constant cc.

Interestingly, as in (Lee et al., 2021), the above result falls just short of providing a polynomial time algorithm: doing so would require improving the log3n\log^{3}n dependence in the exponent to logn\log n. It is possible to achieve polynomial time by make additional assumptions. For example, if we assume that 𝒦M(x)ρ\mathcal{K}_{M}(x^{*})\geq\rho for some constant ρ\rho, then dependencies on log(n)\log(n) can be replaced with log(1/ρ)\log(1/\rho) using existing coreset methods (Lee et al., 2021; Phillips & Tai, 2018). However, the question still remains as to whether the general KDE mode finding problem can be solved in polynomial time for any natural kernel. Our final contribution is to take a step towards answering this question in the negative by relating the mode finding problem to the kk-clique problem, and showing an NP-hardness result for box kernels (defined in the next section). Formally, in Section 6, we prove:

Lemma 1.3.

The problem of computing a 1n\frac{1}{n}-approximate mode of a box kernel KDE is NP-hard.

Unfortunately, our lower bound does not extend to commonly used kernels like the Gaussian, logistic, or sigmoid kernels. Proving lower bounds (or finding polynomial time algorithms) for these kernels is a compelling future goal.

Paper Structure.

Section 2 contains notation and definitions. In Section 3 we provide our main dimensionality results for approximating the objective value for the mode. Then, in Section 4, we show how to recover a high-dimensional mode from a low-dimensional one, providing different approaches for when the kernel is convex and not. Section 5 outlines a brute force method for finding an approximate mode in low dimensions. In Section 6 we show that the approximate mode finding problem is NP-hard for box kernels. Finally, we provide experimental results in Section 7, confirming that dimensionality reduction combined with a heuristic mode finding method yields a practical algorithm for a variety of kernels and data sets.

2 Preliminaries

Notation.

For our purposes, a kernel density estimate (KDE) is defined by a set of nn points (a.k.a. centers) MdM\subset\mathbb{R}^{d} and a non-negative, shift-invariant kernel function. All of the kernels discussed in this work are also radial symmetric. This means that we can actually rewrite the kernel function κ\kappa to be a scalar function of the squared Euclidean distance xm22\|x-m\|_{2}^{2}.333We let 22\|\cdot\|_{2}^{2} denotes the squared Euclidean norm: a22=i=1dai2\|a\|_{2}^{2}=\sum_{i=1}^{d}a_{i}^{2}, where aia_{i} is the ithi^{\text{th}} entry in the length dd vector aa.. Our KDE then has the form:

𝒦M(x)=1nmMCκ(xm22).\displaystyle\mathcal{K}_{M}(x)=\frac{1}{n}\sum_{m\in M}C\cdot\kappa(\|x-m\|_{2}^{2}).

We further assume that κ:\kappa:\mathbb{R}\rightarrow\mathbb{R} is non-increasing, so satisfies κ(t)κ(t)0\kappa(t)\geq\kappa(t^{\prime})\geq 0 for all ttt^{\prime}\geq t. In the expression above, CC is a normalizing constant that only depends on κ\kappa. It is chosen to ensure that tdCκ(t)𝑑t=1\int_{t\in\mathbb{R}^{d}}C\cdot\kappa(t)\,dt=1 and thus 𝒦M\mathcal{K}_{M} is a probability density function. The above function 𝒦M(x)\mathcal{K}_{M}(x) is invariant to scaling κ\kappa, so to ease notation we further assume that κ(0)=1\kappa(0)=1. Note that since κ\kappa is non-increasing, we thus always have that maxtκ(t)=κ(0)=1\max_{t}\kappa(t)=\kappa(0)=1. We write κ\kappa^{\prime} to denote the first-order derivative of κ\kappa (whenever it exists).

Many common kernels are radial symmetric and non-increasing, so fit the form described above (Silverman, 2018; Altman, 1992; Cleveland & Devlin, 1988). We list a few:

Gaussian: κ(t)=et\displaystyle\text{Gaussian: }\kappa(t)=e^{-t}
Logistic: κ(t)=4et+2+et\displaystyle\text{Logistic: }\kappa(t)=\frac{4}{e^{\sqrt{t}}+2+e^{-\sqrt{t}}}
Sigmoid: κ(t)=2et+et\displaystyle\text{Sigmoid: }\kappa(t)=\frac{2}{e^{\sqrt{t}}+e^{-\sqrt{t}}}
Cauchy: κ(t)=11+t\displaystyle\text{Cauchy: }\kappa(t)=\frac{1}{1+t}
Generalized Gaussian: κ(t)=etα\displaystyle\text{Generalized Gaussian: }\kappa(t)=e^{-t^{\alpha}}
Box: κ(t)=1 for |t|1,κ(t)=0 otherwise.\displaystyle\text{Box: }\kappa(t)=1\text{ for $|t|\leq 1$},\,\kappa(t)=0\text{ otherwise}.
Epanechnikov: κ(t)=max(0,1t)\displaystyle\text{Epanechnikov: }\kappa(t)=\max(0,1-t)

We are interested in finding a value for xx which maximizes or approximately maximizes the kernel density estimate 𝒦M(x)\mathcal{K}_{M}(x). Again since the problem is invariant to positive scaling, we will consider the problem of maximizing the unnormalized KDE, which we denote by 𝒦¯M(x)\bar{\mathcal{K}}_{M}(x):

𝒦¯M(x)=mMκ(xm22)=nC𝒦M(x)\displaystyle\bar{\mathcal{K}}_{M}(x)=\sum_{m\in M}\kappa(\|x-m\|_{2}^{2})=\frac{n}{C}\cdot{\mathcal{K}}_{M}(x)

Our general dimensionality reduction result depends on a parameter of κ\kappa that we call the “critical radius”. For common kernels we later show how to bound this parameter to obtain specific dimensionality reduction results.

Definition 2.1 (α\alpha-critical radius, ξκ(α)\xi_{\kappa}(\alpha)).

For any non-increasing kernel function κ:\kappa:\mathbb{R}\rightarrow\mathbb{R}, the α\alpha-critical radius ξκ(α)\xi_{\kappa}(\alpha) is the smallest value of tt such that κ(t)α\kappa(t)\leq\alpha.

Note that for any tξκ(α)t\geq\xi_{\kappa}(\alpha), we have that κ(t)α\kappa(t)\leq\alpha. The value of ξκ(ϵ/2n)\xi_{\kappa}(\epsilon/2n) and ξκ(1/n)\xi_{\kappa}(1/n) will be especially important in our proofs. Specifically, since κ\kappa is assumed to have κ(0)=1\kappa(0)=1, it is easy to check that any mode for 𝒦\mathcal{K} must lie within squared distance ξκ(1/n)\xi_{\kappa}(1/n) from at least one point in MM, a region which we will call the critical area. We will use this fact.

Johnson-Lindenstrauss Lemma.

Our results leverage the Johnson-Lindenstrauss (JL) lemma, which shows that a set of high dimensional points can be mapped into a space of much lower dimension in such a way that distances between the points are nearly preserved. We use the standard variant of the lemma where the mapping is an easy to compute random linear transformation (Achlioptas, 2001; Dasgupta & Gupta, 2003). Specifically, we are interested in random transformations satisfying the following guarantee:

Definition 2.2 ((γ,n,δ)(\gamma,n,\delta)-Johnson-Lindenstrauss Guarantee).

A randomly selected matrix Πw×d\Pi\in\mathbb{R}^{w\times d} satisfies the (γ,n,δ)(\gamma,n,\delta)-JL guarantee for positive error parameter γ\gamma, if for any nn data points v1,,vndv_{1},...,v_{n}\in\mathbb{R}^{d}, with probability 1δ1-\delta,

vivj22ΠviΠvj22(1+γ)vivj22\displaystyle\left\|v_{i}-v_{j}\right\|_{2}^{2}\leq\left\|\Pi v_{i}-\Pi v_{j}\right\|_{2}^{2}\leq(1+\gamma)\left\|v_{i}-v_{j}\right\|_{2}^{2} (1)

for all pairs i,j{1,,n}i,j\in\{1,...,n\} simultaneously.

Note that we require one-sided error: most statements of the JL guarantee have a (1γ)(1-\gamma) factor on the left side of the inequality. This is easily removed by scaling Π\Pi by 11γ\frac{1}{1-\gamma}. It is well known that Definition 2.2 is satisfied by a properly i.i.d. random Gaussian or random ±1\pm 1 matrix with

w=O(log(n/δ)min(1,γ2))\displaystyle w=O\left(\frac{\log(n/\delta)}{\min(1,\gamma^{2})}\right)

rows, and this is tight (Larsen & Nelson, 2017). General sub-Gaussian random matrices also work, as well as constructions that admit faster computation of Πvi\Pi v_{i} (Kane & Nelson, 2014; Ailon & Chazelle, 2009).

Kirszbraun Extension Theorem.

We also rely on a classic result of (Kirszbraun, 1934). Let H1H_{1} and H2H_{2} be Hilbert spaces. Kirszbraun’s theorem states that if SS is a subset of H1H_{1}, and f:SH2f:S\rightarrow H_{2} is a Lipschitz-continuous map, then there is a Lipschitz-continuous map g:H1H2{g}:H_{1}\rightarrow H_{2} that extends444I.e. g(s)=f(s)g(s)={f}(s) for all sSs\in S. ff and has the same Lipschitz constant. Formally, when applied to Euclidean spaces w\mathbb{R}^{w} and d\mathbb{R}^{d} we have:

Fact 2.3.

(Kirszbraun Extension Theorem). For any 𝒮w\mathcal{S}\subset\mathbb{R}^{w}, let f:Sdf:S\rightarrow\mathbb{R}^{d} be an L-Lipschitz function. That is x,y𝒮\forall x,y\in\mathcal{S}, f(x)f(y)2Lxy2\left\|f(x)-f(y)\right\|_{2}\leq L\left\|x-y\right\|_{2}. Then, there always exists some function g:wdg:\mathbb{R}^{w}\rightarrow\mathbb{R}^{d} such that:

  1. 1.

    g(x)=f(x)g(x)=f(x) for all x𝒮x\in\mathcal{S},

  2. 2.

    gg is also L-Lipschitz. That is for all x,ywx,y\in\mathbb{R}^{w}, g(x)g(y)2Lxy2\left\|{g}(x)-{g}(y)\right\|_{2}\leq L\left\|x-y\right\|_{2}.

3 Dimensionality Reduction for Approximating the Mode Value

In this section, we show that using a JL random projection, we can reduce the problem of approximating the value of the mode of a KDE in dd dimensions – i.e., maxx𝒦¯M(x)\max_{x}\bar{\mathcal{K}}_{M}(x) – to the problem of approximating the value of the mode for a KDE in dd^{\prime} dimensions, where dd^{\prime} depends only on nn, κ\kappa, and the desired approximation quality. This problem of recovering the mode value is a prerequisite for the harder problem of recovering the location of an approximate mode (i.e., a point xdx^{*}\in\mathbb{R}^{d} such that 𝒦(x)(1ϵ)maxxd𝒦(x)\mathcal{K}(x^{*})\geq(1-\epsilon)\max_{x\in\mathbb{R}^{d}}\mathcal{K}(x)), which is addressed in Section 4.

We begin with an analysis for JL projections that bounds dd^{\prime} based on generic properties of κ\kappa. Then, in Section 3.1 we analyze these properties for specific kernels of interest, and prove that dd^{\prime} is in fact small for these kernels – specifically, it depends just polylogarithmically on nn and polynomially on the approximation factor ϵ\epsilon. Our general result follows:

Theorem 3.1.

Let 𝒦M=(κ,M)\mathcal{K}_{M}=(\kappa,M) be a dd-dimensional KDE on a differentiable kernel as defined in Section 2 and let 0<ϵ10<\epsilon\leq 1 be an approximation factor. Let ξξκ(ϵ2n)\xi\geq\xi_{\kappa}(\frac{\epsilon}{2n}) and let κminmin0t2ξκ(t)tκ(t)\kappa^{\prime}_{\min}\leq\min_{0\leq t\leq 2\xi}\frac{\kappa^{\prime}(t)t}{\kappa(t)}. Note that κmin0\kappa^{\prime}_{\min}\leq 0 since κ\kappa is assumed to be non-increasing. We can assume that κmin0\kappa^{\prime}_{\min}\neq 0. Let γ=ϵ2κmin>0\gamma=-\frac{\epsilon}{2\kappa^{\prime}_{\min}}>0. Then with probability (1δ)(1-\delta), for any Πw×d\Pi\in\mathbb{R}^{w\times d} satisfying the (γ,n+1,δ)(\gamma,n+1,\delta)-JL guarantee, we have:

(1ϵ)maxxd𝒦M(x)maxxw𝒦ΠM(x)maxxd𝒦M(x).(1-\epsilon)\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x)\leq\max_{x\in\mathbb{R}^{w}}{\mathcal{K}}_{\Pi M}(x)\leq\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x). (2)

Recall that a random Π\Pi with w=O(log((n+1)/δ)min(1,γ2))w=O\left(\frac{\log((n+1)/\delta)}{\min(1,\gamma^{2})}\right) rows will satisfy the required (γ,n+1,δ)(\gamma,n+1,\delta)-JL guarantee.

Note that in the theorem statement above, ΠM={Πm:mM}\Pi M=\{\Pi m:m\in M\} denotes the point set MM with dimension reduced by multiplying each point in the set by Π\Pi. Our proof of Theorem 3.1 is included in Appendix A. It leverages Kirszbraun’s Exention theorem, and follows along the same lines in (Lee et al., 2021). However, we need to more carefully track the effect of properties of the kernel function κ\kappa, since we do not assume that it has the simple form of a Gaussian kernel.

With Theorem 3.1 in place, we can apply it to any non-increasing differentiable kernel to obtain a dimensionality reduction result: we just need to compute a lower bound κminmin0t2ξκ(t)tκ(t)\kappa^{\prime}_{\min}\leq\min_{0\leq t\leq 2\xi}\frac{\kappa^{\prime}(t)t}{\kappa(t)}. For some kernels we can do so directly. For example, consider the Cauchy kernel, κ(t)=11+t\kappa(t)=\frac{1}{1+t}. It can be shown that we can pick κmin=1\kappa^{\prime}_{\min}=-1 (since κ(t)t/κ(t)1\kappa^{\prime}(t)t/\kappa(t)\geq-1 for all tt). Plugging into Theorem 3.1 we obtain:

Corollary 3.2.

Let 𝒦m=(κ,M)\mathcal{K}_{m}=(\kappa,M) be a KDE and, for any δ,ϵ(0,1)\delta,\epsilon\in(0,1), let Π\Pi be a random JL matrix with w=O(log(n/δ)ϵ2)w=O\left(\frac{\log(n/\delta)}{\epsilon^{2}}\right) rows. If κ\kappa is a Cauchy kernel, then with probability 1δ1-\delta,

(1ϵ)maxxd𝒦M(x)maxxw𝒦ΠM(x)maxxd𝒦M(x).\displaystyle(1-\epsilon)\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x)\leq\max_{x\in\mathbb{R}^{w}}{\mathcal{K}}_{\Pi M}(x)\leq\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x).

In the following subsection we will describe a broader class of kernels for which we can also obtain good dimensionality reduction results, but for which bounding κmin\kappa^{\prime}_{\min} is a bit more challenging.

3.1 Relative-Distance Smooth Kernels

Specifically, we consider a broad class of kernels, that included the Gaussian kernel:

Definition 3.3 (Relative-distance smooth kernel).

A non-increasing differentiable kernel κ\kappa is relative-distance smooth if there exist constants c1,d1,q1,c2,d2>0c_{1},d_{1},q_{1},c_{2},d_{2}>0 such that

c1td1q1\displaystyle c_{1}t^{d_{1}}-q_{1} κ(t)tκ(t)c2td2\displaystyle\leq\frac{-\kappa^{\prime}(t)t}{\kappa(t)}\leq c_{2}t^{d_{2}} for all t\displaystyle t 0.\displaystyle\geq 0.

In addition to the Gaussian kernel, this class includes other kernels commonly used in practice, like the logistic, sigmoid, and generalized Gaussian kernels:

Gaussian: t1κ(t)tκ(t)=tt1\displaystyle\text{Gaussian: }t^{1}\leq\frac{-\kappa^{\prime}(t)t}{\kappa(t)}=t\leq t^{1}
Logistic: t1/2212κ(t)tκ(t)=(et1)t2(et+1)t1/22\displaystyle\text{Logistic: }\frac{{t}^{1/2}}{2}-\frac{1}{2}\leq\frac{-\kappa^{\prime}(t)t}{\kappa(t)}=\frac{(e^{\sqrt{t}}-1)\sqrt{t}}{2(e^{\sqrt{t}}+1)}\leq\frac{{t}^{1/2}}{2}
Sigmoid: t1/2212κ(t)tκ(t)=(e2t1)t2(e2t+1)t1/22\displaystyle\text{Sigmoid: }\frac{{t}^{1/2}}{2}-\frac{1}{2}\leq\frac{-\kappa^{\prime}(t)t}{\kappa(t)}=\frac{(e^{2\sqrt{t}}-1)\sqrt{t}}{2(e^{2\sqrt{t}}+1)}\leq\frac{{t}^{1/2}}{2}
Generalized Gaussian: αtακ(t)tκ(t)=αtααtα\displaystyle\text{Generalized Gaussian: }\alpha t^{\alpha}\leq\frac{-\kappa^{\prime}(t)t}{\kappa(t)}=\alpha t^{\alpha}\leq\alpha t^{\alpha}

A few common non-increasing kernels, including the rational quadratic kernel, are not relative distance smooth. Our main result is that for any relative-distance smooth kernel, we can sketch the KDE to dimension ww which depends only polylogarithically on n=|M|n=|M| and quadratically on 1/ϵ1/\epsilon:

Lemma 3.4.

Let 𝒦m\mathcal{K}_{m} be a KDE for a relative-distance smooth kernel κ\kappa with parameters c1,d1,q1,c2,d2c_{1},d_{1},q_{1},c_{2},d_{2}. There is a fixed constant cc^{\prime} such that if γ=ϵclogd2/d1(2nϵ)\gamma=\frac{\epsilon}{c^{\prime}}\log^{-d_{2}/d_{1}}\left(\frac{2n}{\epsilon}\right), then with probability (1δ)(1-\delta), for any Πw×d\Pi\in\mathbb{R}^{w\times d} satisfying the (γ,n+1,δ)(\gamma,n+1,\delta)-JL guarantee, Equation 2 holds. To obtain this JL guarantee, it suffices to take Π\Pi to be a random JL matrix with w=O(log2d2/d1(n/ϵ)log(n/δ)ϵ2)w=O\left(\frac{\log^{2d_{2}/d_{1}}(n/\epsilon)\log(n/\delta)}{\epsilon^{2}}\right) rows.

Lemma 3.4 is proven in Appendix A. It uses an intermediate result that bounds the ϵ2n\frac{\epsilon}{2n}-critical radius for any relative-distance smooth kernel, which is required to invoke Theorem 3.1. Interestingly, the polylogarithmic factor in Lemma 3.4 only depends on the ratio of the parameters d2d_{2} and d1d_{1} of the relative-distance smooth kernel κ\kappa. For all of the example kernels discussed above, this ratio equals 11, so we obtain a dimensionality reduction result exactly matching (Lee et al., 2021) for the Gaussian kernel:

Corollary 3.5.

Let 𝒦m=(κ,M)\mathcal{K}_{m}=(\kappa,M) be a KDE and, for any δ,ϵ(0,1)\delta,\epsilon\in(0,1), let Π\Pi be a random JL matrix with w=O(log2(n/ϵ)log(n/δ)ϵ2)w=O\left(\frac{\log^{2}(n/\epsilon)\log(n/\delta)}{\epsilon^{2}}\right) rows. If κ\kappa is a Gaussian, logistic, sigmoid kernel, or generalized Gaussian kernel, then with probability 1δ1-\delta,

(1ϵ)maxxd𝒦M(x)maxxw𝒦ΠM(x)maxxd𝒦M(x).\displaystyle(1-\epsilon)\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x)\leq\max_{x\in\mathbb{R}^{w}}{\mathcal{K}}_{\Pi M}(x)\leq\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x).

4 Recovering an Approximate Mode in High Dimensions

In Section 3, we discussed how to convert a high dimensional KDE into a lower dimensional KDE whose mode has an approximately equal value. However, in applications, we are typically interested in computing a point in the high-dimensional space whose value is approximately equal to the value of the mode. I.e., using our dimensionality reduced dataset, we want to find some x~\tilde{x} such that:

𝒦M(x~)(1ϵ)maxx𝒦M(x).\displaystyle\mathcal{K}_{M}(\tilde{x})\geq(1-\epsilon)\max_{x}\mathcal{K}_{M}({x}).

We present two approaches for doing so. The first is based on Kirszbraun’s extension theorem and the widely used mean-shift heuristic. It extends the approach of (Lee et al., 2021) to a wider class of kernels – specifically to any convex and non-increasing kernel κ\kappa. This class contains most of the relative-distance smooth kernels discussed in Section 3.1, including the Gaussian, sigmoid, and logistic kernels, and generalized Gaussian kernels when α1\alpha\leq 1. It also includes common kernels like the Cauchy kernel, for which we have shown a strong dimensionality reduction results, and the Epanechnikov, biweight, and triweight kernels. Recall that we define κ(t)\kappa(t) so that tt represents the squared Euclidean distance between two points; we specifically need κ\kappa as defined in this way to be convex.

For non-convex kernels, we briefly discuss a second approach in Appendix B based on recent work on explicit one point extensions of Lipschitz functions (Biess et al., 2019). While less computationally efficient, this approach works for any non-increasing κ\kappa. Common examples of non-convex kernels include the tricube kernel κ(t)=(1t3/2)3\kappa(t)=(1-t^{3/2})^{3} (Altman, 1992), κ(t)=1t2\kappa(t)=1-t^{2} (Comaniciu, 2000), or any generalized Gaussian kernel with α>1\alpha>1.

4.1 Mean-shift for Convex Kernels

Algorithm 1 Mean-Shift Algorithm
0:  Set of nn points MdM\subset\mathbb{R}^{d}, number of iterations τ\tau, differentiable kernel function κ\kappa.
1:  Select initial point x(0)dx^{(0)}\in\mathbb{R}^{d}
2:  For i=0,,τ1i=0,...,\tau-1:
x(i+1)=mMmκ(x(i)m22)jMκ(x(i)j22)\displaystyle x^{(i+1)}=\sum_{m\in M}m\cdot\frac{\kappa^{\prime}\left(\left\|x^{(i)}-m\right\|_{2}^{2}\right)}{\sum_{j\in M}\kappa^{\prime}\left(\left\|x^{(i)}-j\right\|_{2}^{2}\right)}
3:  return x(τ)x^{(\tau)}

Based on ideas proposed by Fukunaga and Hostetler (Fukunaga & Hostetler, 1975), the mean-shift method is a commonly used heuristic for finding an approximate mode (Cheng, 1995). The idea behind the algorithm is to iteratively refine a guess for the mode. At each update, a new guess x(i+1)x^{(i+1)}, is obtained by computing a weighted average of all points in MM that define the KDE. Points that are closer to the previous guess x(i)x^{(i)} are included with higher weight than points that are further. The exact choice of weights depends on the first derivative κ(t)\kappa^{\prime}(t), where tt is the distance from the current mode to a point in MM. For any non-increasing, convex kernel, κ(t)\kappa^{\prime}(t) is non-positive and decreasing in magnitude – i.e., |κ(t)||\kappa^{\prime}(t)| is largest for tt close to 0, which ensures that points closest to the current guess for the mode are weighted highest when computing the new guess555Note that for the Gaussian kernel, κ(t)=et\kappa(t)=e^{-t}, so |κ(t)|=κ(t)|\kappa^{\prime}(t)|=\kappa(t). So the method presented here is equivalent to the version of mean-shift used in prior work on dimensionality reduction for mode finding (Lee et al., 2021). We include pseudocode for mean-shift as Algorithm 1. The method can be alternatively viewed as an instantiation of gradient ascent for the KDE mode objective with a specifically chosen step size – we do not discuss details here.

A powerful property of the mean-shift algorithm is that it always converges for kernels that are non-increasing and convex. In fact, it is known to provide a monotonically improving solution. Specifically:

Fact 4.1 (Comaniciu & Meer (2002)).

Let x(0)dx^{(0)}\in\mathbb{R}^{d} be an arbitrary starting point and let x(1),,x(τ)x^{(1)},\ldots,x^{(\tau)} be the resulting iterates of Algorithm 1 run on point set MM with kernel κ\kappa. If κ\kappa is convex and non-increasing, then for any i1,,τi\in 1,\ldots,\tau:

𝒦M(x(i))𝒦M(x(i1)).\displaystyle\mathcal{K}_{M}(x^{(i)})\geq\mathcal{K}_{M}(x^{(i-1)}).

In Appendix A, we use this fact to prove that with a (modified) mean-shift method, run for only a single iteration, we can translate any approximate solution for a dimensionality reduced KDE problem to a solution for the original high dimensional problem. Formally, we prove the following:

Theorem 4.2.

Let MM be a set of points in d\mathbb{R}^{d} and let 𝒦M=(κ,M)\mathcal{K}_{M}=(\kappa,M) be a KDE defined by a shift-invariant, non-increasing, and convex kernel function κ\kappa. Let xargmaxx𝒦M(x)x^{*}\in\operatorname{argmax}_{x}\mathcal{K}_{M}(x). Let Πw×d\Pi\in\mathbb{R}^{w\times d} be a JL matrix as in Definition 2.2 and assume that ww is chosen large enough so that for all a,ba,b in the set {x}M\{x^{*}\}\cup M,

ab22ΠaΠb22\displaystyle\|a-b\|_{2}^{2}\leq\|\Pi a-\Pi b\|_{2}^{2} (3)
and (1ϵ)maxxd𝒦M(x)maxxw𝒦ΠM(x)maxxd𝒦M(x).\displaystyle\text{and }(1-\epsilon)\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x)\leq\max_{x\in\mathbb{R}^{w}}{\mathcal{K}}_{\Pi M}(x)\leq\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x).

Let x~w\tilde{x}\in\mathbb{R}^{w} be an approximate maximizer for 𝒦ΠM{\mathcal{K}}_{\Pi M} satisfying 𝒦ΠM(x~)(1α)maxxw𝒦ΠM(x){\mathcal{K}}_{\Pi M}(\tilde{x})\geq(1-\alpha)\max_{x\in\mathbb{R}^{w}}{\mathcal{K}}_{\Pi M}(x). Then if we choose x=mMmκ(x~Πm2)mMκ(x~Πm2)x^{\prime}=\sum_{m\in M}m\cdot\frac{\kappa^{\prime}(\left\|\tilde{x}-\Pi m\right\|^{2})}{\sum_{m\in M}\kappa^{\prime}(\left\|\tilde{x}-\Pi m\right\|^{2})}, we have:

𝒦M(x)(1ϵα)maxxd𝒦M(x).\displaystyle{\mathcal{K}}_{M}(x^{\prime})\geq(1-\epsilon-\alpha)\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x).

Note that xx^{\prime} above is set using a single-iteration of what looks like mean-shift. However, instead of using weights based on the distances of points in MM to a previous guess for a high-dimensional mode, we use distances between the points ΠM\Pi M in our low-dimensional space to the approximate low-dimensional mode, x~\tilde{x}. Also note that Theorem 4.2 is independent of exactly how x~\tilde{x} is computed – it could be computed using brute force search, using an approximation algorithm tailored to low-dimensional problems, as in (Lee et al., 2021), or using a heuristic like mean-shift itself.

Algorithm 2 Mode Recovery for Convex Kernels
0:  Shift-invariant, non-increasing, and convex kernel function κ\kappa with derivative κ\kappa^{\prime}. Set of nn points MdM\subset\mathbb{R}^{d}, dimensionality reduction parameter γ\gamma, accuracy parameter α\alpha, failure probability δ\delta.
1:  Construct a random JL matrix Π\Pi with w=O(log((n+1)/δ)min(1,γ2))w=O\left(\frac{\log((n+1)/\delta)}{\min(1,\gamma^{2})}\right) rows.
2:  Construct a set of nn points ΠMw\Pi M\subset\mathbb{R}^{w} that contains Πm\Pi m for each mMm\in M.
3:  Compute x~\tilde{x} such that 𝒦ΠM(x~)(1α)maxxw𝒦ΠM(x){\mathcal{K}}_{\Pi M}(\tilde{x})\geq(1-\alpha)\max_{x\in\mathbb{R}^{w}}{\mathcal{K}}_{\Pi M}(x).
4:  Return x=mMmκ(x~Πm2)mMκ(x~Πm2)x^{\prime}=\sum_{m\in M}m\cdot\frac{\kappa^{\prime}(\left\|\tilde{x}-\Pi m\right\|^{2})}{\sum_{m\in M}\kappa^{\prime}(\left\|\tilde{x}-\Pi m\right\|^{2})}

For convex kernels, Theorem 4.2 implies a strengthening of Theorem 3.1 that allows for recovering an approximate mode, not just the value of the mode. Formally, the combined dimensionality reduction and recover procedure we propose is included as Algorithm 2 and we have the following result on the its accuracy:

Corollary 4.3.

Let 𝒦M=(κ,M)\mathcal{K}_{M}=(\kappa,M) be a dd-dimensional shift-invariant KDE as defined in Section 2 and let ϵ\epsilon and γ\gamma (which depends on κ\kappa and ϵ\epsilon) be as in Theorem 3.1. If κ\kappa is differentiable, non-increasing, and convex, then Algorithm 2 run with parameters γ\gamma and α\alpha returns xx^{\prime} satisfying:

𝒦M(x)(1ϵα)maxxd𝒦M(x).\displaystyle{\mathcal{K}}_{M}(x^{\prime})\geq(1-\epsilon-\alpha)\max_{x\in\mathbb{R}^{d}}{\mathcal{K}}_{M}(x).

Note that Line 4 in Algorithm 2 can be evaluated in O(nw+nd)O(nw+nd) time. So our headline result, Theorem 1.1, follows as a direct corollary.

5 Solving the Low-Dimensional Problem

We next discuss a simple brute-force search method for approximate mode finding for any KDE with a continuous kernel function κ\kappa. The method has an exponential runtime dependency on the dimension, so its use for high-dimensional problems is limited, but combined with the dimensionality reduction techniques from Section 3 and the mode recovery techniques from Section 4, it yields a quasi-polynomial mode finding algorithm for a large class of kernels.

Recall that the mode of a KDE 𝒦=(κ,M)\mathcal{K}=(\kappa,M) with |M|=n|M|=n must lie within its critical area, i.e. in a ball of squared radius ξκ(1/n)\xi_{\kappa}(1/n) around one of the points in MM (where ξκ(1/n)\xi_{\kappa}(1/n) denotes the 1/n1/n-critical radius). For any δ>0\delta>0 we define a finite δ\delta-covering 𝒩(𝒦,δ)\mathcal{N}(\mathcal{K},\delta) to be a finite set of points such that, for every point pp in the critical area of 𝒦\mathcal{K}, there exists a p𝒩(𝒦,δ)p^{\prime}\in\mathcal{N}(\mathcal{K},\delta) such that pp22δ\left\|p-p^{\prime}\right\|_{2}^{2}\leq\delta. Formally:

Lemma 5.1.

Given a KDE 𝒦=(κ,M)\mathcal{K}=(\kappa,M) in d\mathbb{R}^{d} with |M|=n|M|=n, and parameter δ>0\delta>0, let ξ=ξκ(1/n)\xi=\xi_{\kappa}(1/n) and let 𝒩(𝒦,δ)\mathcal{N}(\mathcal{K},\delta) be a set that contains all points of the form

m+i=1dkiδdei,m+\sum_{i=1}^{d}\frac{k_{i}\sqrt{\delta}}{\sqrt{d}}e_{i},

where mMm\in M, kik_{i}\in\mathbb{Z}, and dξδkidξδ-\frac{\sqrt{d}\xi}{\sqrt{\delta}}\leq k_{i}\leq\frac{\sqrt{d}\xi}{\sqrt{\delta}}. Above eie_{i} are the canonical base vectors of d\mathbb{R}^{d}. Then for any point pp in a ξ\xi-ball surrounding one of the points in MM, there exists a point p𝒩(𝒦,δ)p^{\prime}\in\mathcal{N}(\mathcal{K},\delta) such that pp22δ\left\|p-p^{\prime}\right\|_{2}^{2}\leq\delta. Moreover, we have that |𝒩(𝒦,δ)|=n(2dξ/δ)d|\mathcal{N}(\mathcal{K},\delta)|=n(2\sqrt{d}\xi/\sqrt{\delta})^{d}.

By checking every point in 𝒩(𝒦,δ)\mathcal{N}(\mathcal{K},\delta) and returning one that maximizes κ\kappa, we obtain the following results on finding an approximate mode, which is proven in Appendix A:

Theorem 5.2.

Given a KDE 𝒦=(κ,M)\mathcal{K}=(\kappa,M) in d\mathbb{R}^{d} with |M|=n|M|=n and a precision parameter ϵ>0\epsilon>0, let ξ=ξκ(1/n)\xi=\xi_{\kappa}(1/n) and let δ\delta be at most the largest number such that κ(c)κ(c+δ)ϵκ(c)\kappa(c)-\kappa(c+\delta)\leq\epsilon\kappa(c) for all cξc\leq\xi. Then we can find an ϵ\epsilon-approximate mode in O(n(2dξ/δ)d)O\left(n(2\sqrt{d}\xi/\sqrt{\delta})^{d}\right). In particular, if dO(logc(n))d\leq O(\log^{c}(n)), ξO(nc)\xi\leq O(n^{c}), and δO(nc)\delta\geq O(n^{-c}) for some constant cc, then we can find an ϵ\epsilon-approximate mode in quasi-polynomial time in the number of data points nn.

Our headline result, Theorem 1.2, follows by combining the dimensional reduction guarantee of Lemma 3.4 with the observation that for ξ¯=max(1,ξκ(1/n))\bar{\xi}=\max(1,\xi_{\kappa}(1/n)), choosing δ=min((d2c2ϵ)1/d2,ϵc2(2ξ¯)1d2)\delta=\min\left(\left(\frac{d_{2}}{c_{2}}\epsilon\right)^{1/d_{2}},\frac{\epsilon}{c_{2}}(2\bar{\xi})^{1-d_{2}}\right) satisfies the requirement of Theorem 5.2 for any relative-distance smooth kernel κ\kappa with parameters c1,d1,q1,c2,d2c_{1},d_{1},q_{1},c_{2},d_{2}. Moreover, as established in Lemma A.1, ξκ(1n)clog1/d1n\xi_{\kappa}(\frac{1}{n})\leq c\log^{1/d_{1}}n, so we have that the runtime in Theorem 5.2 is O(n(logcn)d)O(n(\log^{c}n)^{d}) for constant ϵ\epsilon. The claim in Theorem 1.2 for the Cauchy kernel follows by noting that ξ=n\xi=n and we can take δ=1ϵ\delta=\frac{1}{\epsilon} in Theorem 5.2. Finally, note that Theorem 1.2 also includes the polynomial time cost of multiplying the original data set by a random JL matrix.

Overall, we conclude that one can compute an approximate mode in quasi-polynomial time for the Cauchy kernel, or any KDE on a relative-distance smooth kernel, and in particular the approximate mode finding problem for KDEs on Gaussian, logistic, sigmoid, or generalized Gaussian kernels can be solved in quasi-polynomial time.

6 Hardness Results

The results from the previous sections place the approximate mode finding problem in quasi-polynomial time for a large class of kernels. The question arises whether we can do much better; in this section, we provide some preliminary negative evidence for this possibility. Specifically, we prove NP-hardness of finding an approximate mode of a box kernel KDE, where we recall that this kernel takes the form κ(t)=1\kappa(t)=1 for |t|1|t|\leq 1 and κ(t)=0\kappa(t)=0 otherwise. Our hardness result is based on the hardness of the kk-clique problem:

Problem 6.0 (kk-clique).

Given a Δ\Delta-regular graph GG and an integer kk, does GG have a complete kk-vertex subgraph?

Section 6 is known to be NP-hard when kk is a parameter of the input. We show how to reduce this problem to KDE mode finding using a reduction inspired by work of Shenmaier on the kk-enclosing ball problem (Shenmaier, 2015). We start by creating a point set given an input GG to Section 6. Specifically, we embed GG in |E|\mathbb{R}^{|E|} as follows: let PP to be the set of rows of the incidence matrix of GG, i.e. the matrix BB such that Bv,e=1B_{v,e}=1 if ee is an edge adjacent to the node vv and Bv,e=0B_{v,e}=0 otherwise (Shenmaier, 2015). See Figure 1 for an example.

Refer to caption
Figure 1: An simple 33-regular graph and its incident matrix BB.

We will base our hardness result on the following lemma:

Lemma 6.1 (Shenmaier (2015)).

Given a Δ\Delta-regular graph G=(V,E)G=(V,E) and integer kk, let PP be defined as above. Let A=(11/k)(Δ1)A=(1-1/k)(\Delta-1) and let RR be the radius of the smallest ball containing k\geq k points in PP. Then R2AR^{2}\leq A if there is a kk-clique in GG, and R2A+2/k2R^{2}\geq A+2/k^{2} otherwise.

By rescaling PP we can show NP-hardness of the KDE mode finding problem for box kernels:

Theorem 6.2.

The problem of computing a 1n\frac{1}{n}-approximate mode of a box kernel KDE is NP-hard.

Proof.

The proof follows almost directly from Lemma 6.1. Note that the value of the mode of a box kernel KDE is given by the largest number of centers in a ball of radius 1. Let GG be an instance of Section 6, and let PP be the set of rows of the incidence matrix of GG as described above. Now let M={p/ApP}M=\{p/\sqrt{A}\;\mid\;p\in P\}. From the lemma, we know that there is a ball of radius 11 containing kk points if GG has a kk-clique, so maxx𝒦¯M(x)k\max_{x}\bar{\mathcal{K}}_{M}(x)\geq k. On the other hand, if GG does not have a kk-clique then every ball of radius 11 contains at most k1k-1 points, i.e., maxx𝒦¯Mk1\max_{x}\bar{\mathcal{K}}_{M}\leq k-1. So, an approximation algorithm with error ϵ=1/k1/n\epsilon=1/k\geq 1/n can distinguish between the two cases. Hence, the (ϵ\epsilon-approximate) mode finding problem for box kernel KDEs is at least as hard as Section 6 when ϵ1/n\epsilon\leq 1/n. ∎

While it provides an initial result and hints at why the mode finding problem might be challenging, the above hardness result leaves a number of open questions. First off, it does not rule out a constant factor approximation, or a method whose dependence on the approximation parameter ϵ\epsilon is exponential (as in our quasi-polynomial time methods). Moreover, the result does not apply for kernels like the Gaussian kernel – it strongly requires that the value of the box kernel differs significantly between t=1t=1 and t=1+1k2t=1+\frac{1}{k^{2}}. Proving stronger hardness of approximation for the box kernel, or any hardness for kernels used in practice (like a relative-distance smooth kernel) are promising future directions.

7 Experiments

We support our theoretical results with experiments on two datasets, MNIST (60000 data points, 784 dimensions) and Text8 (71290 data points, 300 dimensions). We use both the Gaussian and Generalized Gaussian kernels with a variety of different bandwidths, σ\sigma. A bandwidth of σ\sigma means that the kernel function as definied in Section 2 was applied to values t=mx22σ2t=\frac{\|m-x\|_{2}^{2}}{\sigma^{2}}. In general, a large σ\sigma leads to larger mode value. It also leads to a smoother KDE, which is intuitively easier to maximize. We chose values of σ\sigma that lead to substantially varying mode values to check the performance of our method across a variety of optimization surfaces.

Since these are high-dimensional datasets, it is not computationally feasible to find an exact mode to compare against. Instead, we obtain a baseline mode value by running the mean-shift heuristic (gradient descent) for 100 iterations, with 60 randomly chosen starting points. To avoid convergence to local optima at KDE centers, these starting points were chosen to be random linear combinations of either all dataset points, or a random pair of points in the data set. The best mode value found was taken as a baseline.

Once establishing a baseline, we applied JL dimensionality reduction to each data set and kernel for a variety of sketching dimensions, ww. Again, for efficiency, we use mean-shift to find an approximate low-dimensional mode, instead of the brute force search method from Section 5. We ran for 10 iterations with 30 random restarts, chosen as described above. To recover a high-dimensional mode from our approximate low-dimensional mode, we use Algorithm 2, since the kernels tested are convex. For each dimension ww, we ran 1010 trials and report the mean and standard deviation of the KDE value of our approximate mode. Results are included in Figures 2-5. Note that, for visual clarity, the y-axis in these figures does not start at zero.

Refer to caption
Figure 2: MNIST data using a Gaussian kernel with bandwidths 70, 200, 600, and 1800.
Refer to caption
Figure 3: MNIST data using a Generalized Gaussian kernel with parameter α=.5\alpha=.5 and bandwidths 20, 60, 200, and 1100.
Refer to caption
Figure 4: Text8 data using a Gaussian kernel with parameter with bandwidths .02, .035, .1, and .4.
Refer to caption
Figure 5: Text8 data using a Generalized Gaussian kernel with parameter α=.5\alpha=.5 and bandwidths .004, .01, .03, and .25.

As apparent from the plots, our Johnson-Lindenstrauss dimensionality reduction approach combined with the mean-shift heuristic performs very well overall. In all cases, it was able to recover an approximate mode with value close to the baseline with sketching dimension wdw\ll d. As expected, performance improves with increasing sketching dimension.

8 Acknowledgements

This work was partially funded through NSF Award No. 2045590. Cas Widdershoven’s work has been partially funded through the CAS Project for Young Scientists in Basic Research, Grant No. YSBR-040.

References

  • Achlioptas (2001) Achlioptas, D. Database-friendly random projections. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp.  274–281, 2001.
  • Ailon & Chazelle (2009) Ailon, N. and Chazelle, B. The fast johnson-lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., pp.  302–322, 2009.
  • Altman (1992) Altman, N. S. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
  • Barman (2015) Barman, S. Approximating nash equilibria and dense bipartite subgraphs via an approximate version of caratheodory’s theorem. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp.  361–369, 2015.
  • Biess et al. (2019) Biess, A., Kontorovich, A., Makarychev, Y., and Zaichyk, H. Regression via kirszbraun extension with applications to imitation learning. ArXiv Preprint, abs/1905.11930, 2019. URL http://arxiv.org/abs/1905.11930.
  • Blum et al. (2019) Blum, A., Har-Peled, S., and Raichel, B. Sparse approximation via generating point sets. ACM Trans. Algorithms, 15(3), 6 2019.
  • Botev et al. (2010) Botev, Z. I., Grotowski, J. F., and Kroese, D. P. Kernel density estimation via diffusion. The annals of Statistics, 38(5):2916–2957, 2010.
  • Carreira-Perpiñán (2000) Carreira-Perpiñán, M. A. Mode-finding for mixtures of Gaussian distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1318–1323, 2000.
  • Carreira-Perpiñán (2007) Carreira-Perpiñán, M. A. Gaussian mean-shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29:767–776, 2007.
  • Cheng (1995) Cheng, Y. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, 1995.
  • Cleveland & Devlin (1988) Cleveland, W. S. and Devlin, S. J. Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403):596–610, 1988.
  • Comaniciu & Meer (2002) Comaniciu, D. and Meer, P. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
  • Comaniciu (2000) Comaniciu, D. I. Nonparametric robust methods for computer vision. PhD thesis, 2000.
  • Dasgupta & Gupta (2003) Dasgupta, S. and Gupta, A. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 1 2003.
  • Fukunaga & Hostetler (1975) Fukunaga, K. and Hostetler, L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21(1):32–40, 1975.
  • Gasser et al. (1997) Gasser, T., Hall, P., and Presnell, B. Nonparametric estimation of the mode of a distribution of random curves. Journal of the Royal Statistical Society: Series B, 60:681–691, 1997.
  • Kamalov & Leung (2020) Kamalov, F. and Leung, H. H. Outlier detection in high dimensional data. Journal of Information & Knowledge Management, 19(01), 2020.
  • Kane & Nelson (2014) Kane, D. M. and Nelson, J. Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1):1–23, 2014.
  • Kim & Scott (2012) Kim, J. and Scott, C. D. Robust kernel density estimation. The Journal of Machine Learning Research, 13(1):2529–2565, 2012.
  • Kirszbraun (1934) Kirszbraun, M. Über die zusammenziehende und lipschitzsche transformationen. Fundamenta Mathematicae, 22(1):77–108, 1934.
  • Larsen & Nelson (2017) Larsen, K. G. and Nelson, J. Optimality of the johnson-lindenstrauss lemma. In 58th Annual Symposium on Foundations of Computer Science (FOCS), pp.  633–638, 2017.
  • Lee et al. (2021) Lee, J. C., Li, J., Musco, C., Phillips, J. M., and Tai, W. M. Finding an approximate mode of a kernel density estimate. In 29th Annual European Symposium on Algorithms (ESA 2021), volume 204, pp.  61:1–61:19, 2021.
  • Phillips & Tai (2018) Phillips, J. M. and Tai, W. M. Near-optimal coresets of kernel density estimates. In 34th International Symposium on Computational Geometry, 2018.
  • Phillips et al. (2015) Phillips, J. M., Wang, B., and Zheng, Y. Geometric inference on kernel density estimates. In International Symposium on Computational Geometry, 2015.
  • Scott (2015) Scott, D. W. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.
  • Sheikhpour et al. (2016) Sheikhpour, R., Sarram, M. A., and Sheikhpour, R. Particle swarm optimization for bandwidth determination and feature selection of kernel density estimation based classifiers in diagnosis of breast cancer. Applied Soft Computing, 40:113–131, 2016.
  • Shen et al. (2007) Shen, C., Brooks, M. J., and van den Hengel, A. Fast global kernel density mode seeking: Applications to localization and tracking. IEEE Transactions on Image Processing, 16:1457 – 1469, 2007.
  • Shenmaier (2015) Shenmaier, V. Complexity and approximation of the smallest k-enclosing ball problem. European Journal of Combinatorics, 48:81–87, 2015.
  • Shenmaier (2019) Shenmaier, V. A structural theorem for center-based clustering in high-dimensional euclidean space. In Machine Learning, Optimization, and Data Science, pp. 284–295. Springer International Publishing, 2019.
  • Silverman (2018) Silverman, B. W. Density estimation for statistics and data analysis. Routledge, 2018.
  • Yavlinsky et al. (2005) Yavlinsky, A., Schofield, E., and Rüger, S. Automated image annotation using global features and robust nonparametric density estimation. In International Conference on Image and Video Retrieval, pp. 507–517, 2005.

Appendix A Additional Proofs

A.1 Proofs for Section 3

See 3.1

Proof.

First recall the definitions of 𝒦¯M(x)\bar{\mathcal{K}}_{M}(x) and 𝒦¯ΠM(x)\bar{\mathcal{K}}_{\Pi M}(x), which are just fixed positive scalings of 𝒦M(x){\mathcal{K}}_{M}(x) and 𝒦ΠM(x){\mathcal{K}}_{\Pi M}(x). It suffices to prove that:

(1ϵ)maxxd𝒦¯M(x)maxxw𝒦¯ΠM(x)maxxd𝒦¯M(x),(1-\epsilon)\max_{x\in\mathbb{R}^{d}}\bar{\mathcal{K}}_{M}(x)\leq\max_{x\in\mathbb{R}^{w}}\bar{\mathcal{K}}_{\Pi M}(x)\leq\max_{x\in\mathbb{R}^{d}}\bar{\mathcal{K}}_{M}(x), (4)

To prove (4) we will apply the guarantee of Definition 2.2 to the n+1n+1 points in {x}M\{x^{*}\}\cup M, where xargmaxxd𝒦¯M(x)x^{*}\in\operatorname{argmax}_{x\in\mathbb{R}^{d}}\bar{\mathcal{K}}_{M}(x). This guarantee ensures that with probability (1δ)(1-\delta), ab22ΠaΠb22(1+γ)ab22\|a-b\|_{2}^{2}\leq\|\Pi a-\Pi b\|_{2}^{2}\leq(1+\gamma)\|a-b\|_{2}^{2} for all a,ba,b in this set, where Πw×d\Pi\in\mathbb{R}^{w\times d} is the JL matrix from the theorem.

We first prove the right hand side of (4) using an argument identical to the proof of Lemma 10 from (Lee et al., 2021). Consider the set of nn points ΠM\Pi M that contains Πm\Pi m for all mMm\in M. Let gg be a map from each point in this set to the corresponding point in MM. Since Πm1Πm222m1m222\|\Pi m_{1}-\Pi m_{2}\|_{2}^{2}\geq\|m_{1}-m_{2}\|_{2}^{2} for all m1,m2Mm_{1},m_{2}\in M as guaranteed above, we have that gg is 1-Lipschitz. From Kirszbraun’s theorem ( 2.3) it follows that there is a function g~:dw\tilde{g}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{w} which agrees with gg on inputs in ΠM\Pi M and satisfies g~(s)g~(t)22st22\|\tilde{g}(s)-\tilde{g}(t)\|_{2}^{2}\leq\|s-t\|_{2}^{2} for all s,tds,t\in\mathbb{R}^{d}. So for any xwx\in\mathbb{R}^{w}, there is some x=g~(x)x^{\prime}=\tilde{g}(x) such that, for all mMm\in M,

xm22xΠm22.\displaystyle\|x^{\prime}-m\|_{2}^{2}\leq\|x-\Pi m\|_{2}^{2}.

The right hand side of (4) then follows: there must be some point xx^{\prime} such that for all mm, xm22x~Πm22\|x^{\prime}-m\|_{2}^{2}\leq\|\tilde{x}^{*}-\Pi m\|_{2}^{2} where x~argmaxxw𝒦¯ΠM(x)\tilde{x}^{*}\in\operatorname{argmax}_{x\in\mathbb{R}^{w}}\bar{\mathcal{K}}_{\Pi M}(x). Overall we have:

maxxw𝒦¯ΠM(x)=𝒦¯ΠM(x~)=mMκ(x~Πm22)mMκ(xm22)maxxd𝒦¯M(x).\displaystyle\max_{x\in\mathbb{R}^{w}}\bar{\mathcal{K}}_{\Pi M}(x)=\bar{\mathcal{K}}_{\Pi M}(\tilde{x}^{*})=\sum_{m\in M}\kappa(\|\tilde{x}^{*}-\Pi m\|_{2}^{2})\leq\sum_{m\in M}\kappa(\|x^{\prime}-m\|_{2}^{2})\leq\max_{x\in\mathbb{R}^{d}}\bar{\mathcal{K}}_{M}(x).

In the second to last inequality we used that κ\kappa is non-increasing.

We next prove the left hand side of (4). We first have:

maxxw𝒦¯ΠM(x)\displaystyle\max_{x\in\mathbb{R}^{w}}\bar{\mathcal{K}}_{\Pi M}(x) =maxxwmMκ(xΠm22)mMκ(ΠxΠm22)mM,xm22ξκ(ΠxΠm22),\displaystyle=\max_{x\in\mathbb{R}^{w}}\sum_{m\in M}\kappa(\left\|x-\Pi m\right\|^{2}_{2})\geq\sum_{m\in M}\kappa(\left\|\Pi x^{*}-\Pi m\right\|^{2}_{2})\geq\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa(\left\|\Pi x^{*}-\Pi m\right\|^{2}_{2}),

where xargmaxxd𝒦¯M(x)x^{*}\in\operatorname{argmax}_{x\in\mathbb{R}^{d}}\bar{\mathcal{K}}_{M}(x). Applying the JL guarantee to the n+1n+1 points in {x}M\{x^{*}\}\cup M, we have that for all mm, ΠxΠm22(1+γ)xm22\left\|\Pi x^{*}-\Pi m\right\|^{2}_{2}\leq(1+\gamma)\left\|x^{*}-m\right\|^{2}_{2}. So plugging into the equation above, we have:

maxxw𝒦¯ΠM(x)\displaystyle\max_{x\in\mathbb{R}^{w}}\bar{\mathcal{K}}_{\Pi M}(x) mM,xm22ξκ((1+γ)xm22)\displaystyle\geq\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa((1+\gamma)\left\|x^{*}-m\right\|^{2}_{2})

We can then bound:

mM,xm22ξκ((1+γ)xm22)\displaystyle\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa((1+\gamma)\left\|x^{*}-m\right\|^{2}_{2})
mM,xm22ξκ(xm22)+γxm22minz[xm22,(1+γ)xm22]κ(z)\displaystyle\geq\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa(\left\|x^{*}-m\right\|^{2}_{2})+\gamma\left\|x^{*}-m\right\|_{2}^{2}\min_{z\in[\left\|x^{*}-m\right\|_{2}^{2},(1+\gamma)\left\|x^{*}-m\right\|_{2}^{2}]}\kappa^{\prime}(z)
mM,xm22ξκ(xm22)+γminz[xm22,(1+γ)xm22]κ(z)z\displaystyle\geq\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa(\left\|x^{*}-m\right\|^{2}_{2})+\gamma\cdot\min_{z\in[\left\|x^{*}-m\right\|_{2}^{2},(1+\gamma)\left\|x^{*}-m\right\|_{2}^{2}]}\kappa^{\prime}(z)\cdot z
mM,xm22ξκ(xm22)+γminz[xm22,(1+γ)xm22]κ(z)zκ(xm22)κ(z).\displaystyle\geq\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa(\left\|x^{*}-m\right\|^{2}_{2})+\gamma\cdot\min_{z\in[\left\|x^{*}-m\right\|_{2}^{2},(1+\gamma)\left\|x^{*}-m\right\|_{2}^{2}]}\kappa^{\prime}(z)\cdot z\cdot\frac{\kappa(\left\|x^{*}-m\right\|_{2}^{2})}{\kappa(z)}.

The last inequality follows from the fact that κ\kappa is non-increasing, so k(z)zk^{\prime}(z)\cdot z is negative or zero and κ(xm22)κ(z)1\frac{\kappa(\left\|x^{*}-m\right\|_{2}^{2})}{\kappa(z)}\geq 1. Invoking our definition of κmin\kappa^{\prime}_{\min} and choice of γ=ϵ2κmin\gamma=-\frac{\epsilon}{2\kappa^{\prime}_{\min}} we can continue:

mM,xm22ξκ(xm22)+γminz[xm22,(1+γ)xm22]κ(z)zκ(xm22)κ(z)\displaystyle\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa(\left\|x^{*}-m\right\|^{2}_{2})+\gamma\cdot\min_{z\in[\left\|x^{*}-m\right\|_{2}^{2},(1+\gamma)\left\|x^{*}-m\right\|_{2}^{2}]}\kappa^{\prime}(z)\cdot z\cdot\frac{\kappa(\left\|x^{*}-m\right\|_{2}^{2})}{\kappa(z)}
mM,xm22ξκ(xm22)+γκminκ(xm22)\displaystyle\geq\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa(\left\|x^{*}-m\right\|^{2}_{2})+\gamma\cdot\kappa^{\prime}_{\min}\cdot\kappa(\left\|x^{*}-m\right\|_{2}^{2})
=mM,xm22ξκ(xm22)ϵ2κ(xm22)\displaystyle=\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa(\left\|x^{*}-m\right\|^{2}_{2})-\frac{\epsilon}{2}\cdot\kappa(\left\|x^{*}-m\right\|_{2}^{2})
=(1ϵ2)mM,xm22ξκ(xm22)\displaystyle=\left(1-\frac{\epsilon}{2}\right)\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}\leq\xi}\kappa(\left\|x^{*}-m\right\|^{2}_{2})
=(1ϵ2)(mMκ(xm22)mM,xm22>ξκ(xm22))\displaystyle=\left(1-\frac{\epsilon}{2}\right)\left(\sum_{m\in M}\kappa(\left\|x^{*}-m\right\|^{2}_{2})-\sum\limits_{m\in M,\left\|x^{*}-m\right\|_{2}^{2}>\xi}\kappa(\left\|x^{*}-m\right\|_{2}^{2})\right)
(1ϵ2)(mMκ(xm22)ϵ2)\displaystyle\geq\left(1-\frac{\epsilon}{2}\right)\left(\sum_{m\in M}\kappa(\left\|x^{*}-m\right\|^{2}_{2})-\frac{\epsilon}{2}\right)
=(1ϵ2)(maxxd𝒦¯M(x)ϵ2)(1ϵ2)2maxxd𝒦¯M(x)(1ϵ)maxxd𝒦¯M(x)\displaystyle=\left(1-\frac{\epsilon}{2}\right)\left(\max_{x\in\mathbb{R}^{d}}\bar{\mathcal{K}}_{M}(x)-\frac{\epsilon}{2}\right)\geq\left(1-\frac{\epsilon}{2}\right)^{2}\max_{x\in\mathbb{R}^{d}}\bar{\mathcal{K}}_{M}(x)\geq\left(1-\epsilon\right)\max_{x\in\mathbb{R}^{d}}\bar{\mathcal{K}}_{M}(x)

Note that in the second to last line we invoked the definition of ξξκ(ϵ2n)\xi\geq\xi_{\kappa}(\frac{\epsilon}{2n}). Specifically, we used that, for any mm with xm22>ξ\left\|x^{*}-m\right\|_{2}^{2}>\xi, κ(xm22)ϵ2n\kappa(\left\|x^{*}-m\right\|^{2}_{2})\leq\frac{\epsilon}{2n}. In the last line we use that maxxd𝒦¯M(x)1\max_{x\in\mathbb{R}^{d}}\bar{\mathcal{K}}_{M}(x)\geq 1. ∎

Lemma A.1.

Let 𝒦M=(κ,M)\mathcal{K}_{M}=(\kappa,M) be a KDE for a point set MM with cardinality nn and relative-distance smooth kernel κ\kappa with parameters c1,d1,q1,c2,d2c_{1},d_{1},q_{1},c_{2},d_{2}. Then for any ϵ(0,1]\epsilon\in(0,1], ξκ(ϵ2n)clog1/d1(2nϵ)\xi_{\kappa}(\frac{\epsilon}{2n})\leq c\log^{1/d_{1}}\left(\frac{2n}{\epsilon}\right) for a fixed constant cc that depends on c1,d1,q1,c2,c_{1},d_{1},q_{1},c_{2}, and d2d_{2}.

Proof.

Since κ\kappa is positive, non-increasing, and κ(0)=1\kappa(0)=1 we can write κ(t)=ef(t)\kappa(t)=e^{-f(t)} for some positive, non-decreasing function ff with f(0)=0f(0)=0. We have κ(t)tκ(t)=f(t)tc1td1q1\frac{-\kappa^{\prime}(t)t}{\kappa(t)}=f^{\prime}(t)t\geq c_{1}t^{d_{1}}-q_{1} and thus f(t)max(0,c1td11q1t)f^{\prime}(t)\geq\max(0,c_{1}t^{d_{1}-1}-\frac{q_{1}}{t}). Writing κ(t)=e0tf(x)𝑑x\kappa(t)=e^{-\int_{0}^{t}f^{\prime}(x)dx}, we will upper bound κ(t)\kappa(t) by lower bounding 0tf(x)𝑑x\int_{0}^{t}f^{\prime}(x)dx. Specifically, we have:

0tf(x)𝑑x\displaystyle\int_{0}^{t}f^{\prime}(x)dx 0tmax(0,c1xd11q1x)𝑑x\displaystyle\geq\int_{0}^{t}\max\left(0,c_{1}x^{d_{1}-1}-\frac{q_{1}}{x}\right)dx
=0tc1xd11𝑑x0tmin(c1xd11,q1x)𝑑x\displaystyle=\int_{0}^{t}c_{1}x^{d_{1}-1}dx-\int_{0}^{t}\min(c_{1}x^{d_{1}-1},\frac{q_{1}}{x})dx
=c1d1td10(q1/c1)1/d1c1xd11𝑑x(q1/c1)1/d1tq1x𝑑x\displaystyle=\frac{c_{1}}{d_{1}}t^{d_{1}}-\int_{0}^{(q_{1}/c_{1})^{1/d_{1}}}c_{1}x^{d_{1}-1}dx-\int_{(q_{1}/c_{1})^{1/d_{1}}}^{t}\frac{q_{1}}{x}dx
=c1d1td1q1d1q1log(t)+q1d1log(q1/c1).\displaystyle=\frac{c_{1}}{d_{1}}t^{d_{1}}-\frac{q_{1}}{d_{1}}-q_{1}\log(t)+\frac{q_{1}}{d_{1}}\log(q_{1}/c_{1}).

It follows that κ(t)ec1d1td1+q1d1+q1log(t)q1d1logq1c1.\kappa(t)\leq e^{-\frac{c_{1}}{d_{1}}t^{d_{1}}+\frac{q_{1}}{d_{1}}+q_{1}\log(t)-\frac{q_{1}}{d_{1}}\log\frac{q_{1}}{c_{1}}}. We want to upper bound the smallest tt such that κ(t)ϵ2n\kappa(t)\leq\frac{\epsilon}{2n}. Let zz be a sufficiently large constant so that:

c1d1zd12(q1d1+q1log(z)q1d1logq1c1)\displaystyle\frac{c_{1}}{d_{1}}z^{d_{1}}\geq 2\left(\frac{q_{1}}{d_{1}}+q_{1}\log(z)-\frac{q_{1}}{d_{1}}\log\frac{q_{1}}{c_{1}}\right)

Then it suffices to pick tmax(z,(d1c1log(2nϵ))1/d1)=O(log1/d1(2nϵ))t\geq\max\left(z,\left(\frac{d_{1}}{c_{1}}\log(\frac{2n}{\epsilon})\right)^{1/d_{1}}\right)=O\left(\log^{1/d_{1}}(\frac{2n}{\epsilon})\right) to ensure that κ(t)ϵ2n\kappa(t)\leq\frac{\epsilon}{2n}. ∎

See 3.4

Proof.

With Lemma A.1 in place, our main result for relatively distance smooth kernels follows directly: By Lemma A.1, ξ=clog1/d1(2nϵ)ξκ(ϵ2n)\xi=c\log^{1/d_{1}}\left(\frac{2n}{\epsilon}\right)\geq\xi_{\kappa}(\frac{\epsilon}{2n}). And since κ\kappa is relative-distance smooth, we have that:

min0x2ξκ(x)xκ(x)\displaystyle\min_{0\leq x\leq 2\xi}\frac{\kappa^{\prime}(x)x}{\kappa(x)} min0x2ξc2xd2=c2(2ξ)d2clogd2/d1(2nϵ),\displaystyle\geq\min_{0\leq x\leq 2\xi}-c_{2}x^{d_{2}}=-c_{2}(2\xi)^{d_{2}}\geq-c^{\prime}\log^{d_{2}/d_{1}}\left(\frac{2n}{\epsilon}\right),

for sufficiently large constant cc^{\prime}. Let κmin=clogd2/d1(2nϵ)\kappa^{\prime}_{\min}=-c^{\prime}\log^{d_{2}/d_{1}}\left(\frac{2n}{\epsilon}\right). Invoking Theorem 3.1, we require that γ=ϵ2κmin=ϵ2clogd2/d1(2nϵ)\gamma=-\frac{\epsilon}{2\kappa^{\prime}_{\min}}=\frac{\epsilon}{2c^{\prime}}\log^{-d_{2}/d_{1}}\left(\frac{2n}{\epsilon}\right). ∎

A.2 Proofs for Section 4

See 4.2

Proof.

We will show that:

𝒦M(x)𝒦ΠM(x~)\displaystyle{\mathcal{K}}_{M}(x^{\prime})\geq{\mathcal{K}}_{\Pi M}(\tilde{x}) (5)

where x~w\tilde{x}\in\mathbb{R}^{w} is the approximate maximizer of 𝒦ΠM{\mathcal{K}}_{\Pi M}, as defined in the theorem statement. If we can prove (5) then the theorem follows by the following chain of inequalities:

𝒦M(x)𝒦ΠM(x~)(1α)maxx𝒦ΠM(x)(1α)(1ϵ)maxx𝒦M(x)(1αϵ)maxx𝒦M(x).\displaystyle{\mathcal{K}}_{M}(x^{\prime})\geq{\mathcal{K}}_{\Pi M}(\tilde{x})\geq(1-\alpha)\max_{x}{\mathcal{K}}_{\Pi M}({x})\geq(1-\alpha)(1-\epsilon)\max_{x}{\mathcal{K}}_{M}({x})\geq(1-\alpha-\epsilon)\max_{x}{\mathcal{K}}_{M}({x}).

To prove (5), we follow a similar approach to (Lee et al., 2021). Since Π\Pi satisfies (3), the function ff mapping {Πx}ΠM{x}M\{\Pi x^{*}\}\cup\Pi M\rightarrow\{x^{*}\}\cup M is 1-Lipschitz. Accordingly, by Kirszbraun’s Extension Theorem (2.3), there is some function g(x):wdg(x):\mathbb{R}^{w}\rightarrow\mathbb{R}^{d} that agrees with ff on inputs from {Πx}ΠM\{\Pi x^{*}\}\cup\Pi M and remains 1-Lipschitz. It follows that, if we apply gg to x~\tilde{x} then for all mMm\in M,

g(x~)m2x~Πm2.\displaystyle\left\|{g}(\tilde{x})-m\right\|_{2}\leq\left\|\tilde{x}-\Pi m\right\|_{2}.

In other words, for our approximate low-dimensional mode x~\tilde{x}, there is a high-dimensional point g(x~)g(\tilde{x}) that is closer to all points in MM than x~\tilde{x} is to the points in ΠM\Pi M. In fact, by extending all points by one extra dimension, we can obtain an exact equality. In particular, let x′′d+1x^{\prime\prime}\in\mathbb{R}^{d+1} equal g(x~)g(\tilde{x}) on its first dd coordinates, and 0 on its last coordinate. For each mMm\in M let m′′d+1m^{\prime\prime}\in\mathbb{R}^{d+1} equal mm on its first dd coordinates, and x~Πm22g(x~)m22\sqrt{\left\|\tilde{x}-\Pi m\right\|_{2}^{2}-\left\|{g}(\tilde{x})-m\right\|_{2}^{2}} on its last coordinate. Let M′′d×1M^{\prime\prime}\subset\mathbb{R}^{d\times 1} denote the set of these transformed points. First observe that for all m′′M′′m^{\prime\prime}\in M^{\prime\prime}

x′′m′′2=x~Πm2.\displaystyle\left\|x^{\prime\prime}-m^{\prime\prime}\right\|_{2}=\left\|\tilde{x}-\Pi m\right\|_{2}. (6)

Accordingly, xx^{\prime} as defined in the theorem statement is exactly equivalent to the first dd entries of the d+1d+1 dimensional vector that would be obtained from running one iteration of mean-shift on x′′x^{\prime\prime} using point set M′′M^{\prime\prime}. Call this d+1d+1 dimensional vector x¯\bar{x}^{\prime}. By 4.1, we have that:

𝒦M′′(x¯)𝒦M′′(x′′)=𝒦ΠM(x~).\displaystyle\mathcal{K}_{M^{\prime\prime}}(\bar{x}^{\prime})\geq\mathcal{K}_{M^{\prime\prime}}(x^{\prime\prime})=\mathcal{K}_{\Pi M}(\tilde{x}).

The last equality follows from (6). Finally, for any non-increasing kernel we have that:

𝒦M(x)𝒦M′′(x¯),\displaystyle\mathcal{K}_{M}({x}^{\prime})\geq\mathcal{K}_{M^{\prime\prime}}(\bar{x}^{\prime}),

because xm22x¯m′′22\|x^{\prime}-m\|_{2}^{2}\leq\|\bar{x}^{\prime}-m^{\prime\prime}\|_{2}^{2} for all mm. This is simply because xx^{\prime} and mm are equal to x¯\bar{x}^{\prime} and m′′m^{\prime\prime}, but with their last entry removed, so they can only be closer together. Combining the previous two inequalities proves (5), which establishes the theorem. ∎

See 4.3

Proof.

Corollary 4.3 immediately follows by combining Theorem 3.1 with Theorem 4.2. In particular, if Π\Pi is chosen with w=O(log((n+1)/δ)min(1,γ2))w=O\left(\frac{\log((n+1)/\delta)}{\min(1,\gamma^{2})}\right) rows (as in Algorithm 2) then with probability 1δ1-\delta, we have that both (2) and (3) hold with probability 1δ1-\delta, which are the only conditions needed for Theorem 4.2 to hold. ∎

A.3 Proofs for Section 5

See 5.1

Proof.

The second claim on the size of 𝒩(𝒦,δ)\mathcal{N}(\mathcal{K},\delta) is immediate. For the first claim, note that pp can be written as p′′+ikiδdeip^{\prime\prime}+\sum_{i}\frac{k^{\prime}_{i}\sqrt{\delta}}{\sqrt{d}}e_{i} with p′′Mp^{\prime\prime}\in M and |ki|dξδ|k^{\prime}_{i}|\leq\frac{\sqrt{d}\xi}{\sqrt{\delta}}. Let p=p′′+ikiδdei𝒩(𝒦,δ)p^{\prime}=p^{\prime\prime}+\sum_{i}\frac{\lfloor k^{\prime}_{i}\rfloor\sqrt{\delta}}{\sqrt{d}}e_{i}\in\mathcal{N}(\mathcal{K},\delta). Then we have

pp22\displaystyle\left\|p-p^{\prime}\right\|_{2}^{2} =p′′+ikiδdeip′′ikiδdei22=i=1d(kiki)δdei2\displaystyle=\left\|p^{\prime\prime}+\sum_{i}\frac{k^{\prime}_{i}\sqrt{\delta}}{\sqrt{d}}e_{i}-p^{\prime\prime}-\sum_{i}\frac{\lfloor k^{\prime}_{i}\rfloor\sqrt{\delta}}{\sqrt{d}}e_{i}\right\|_{2}^{2}=\left\|\sum_{i=1}^{d}\frac{(k^{\prime}_{i}-\lfloor k^{\prime}_{i}\rfloor)\sqrt{\delta}}{\sqrt{d}}e_{i}\right\|^{2}
=i=1d((kiki)δd)2i=1d(δd)2=δ.\displaystyle=\sum_{i=1}^{d}\left(\frac{(k^{\prime}_{i}-\lfloor k^{\prime}_{i}\rfloor)\sqrt{\delta}}{\sqrt{d}}\right)^{2}\leq\sum_{i=1}^{d}\left(\frac{\sqrt{\delta}}{\sqrt{d}}\right)^{2}=\delta.

See 5.2

Proof.

Since there always exists a mode in the critical area of 𝒦\mathcal{K}, we can use Lemma 5.1 to find a point pp^{\prime} at most δ\delta away from a mode pp of 𝒦\mathcal{K} in O(n(2dξ/δ)d)O\left(n(2\sqrt{d}\xi/\sqrt{\delta})^{d}\right). Then we have

𝒦(p)\displaystyle\mathcal{K}(p^{\prime}) =mMκ(mp22)mMκ(mp22+pp22)mMκ(mp22+δ)\displaystyle=\sum_{m\in M}\kappa(\|m-p^{\prime}\|_{2}^{2})\geq\sum_{m\in M}\kappa(\|m-p\|_{2}^{2}+\|p-p^{\prime}\|_{2}^{2})\geq\sum_{m\in M}\kappa(\|m-p\|_{2}^{2}+\delta)
mMκ(mp22))ϵκ(mp22))=(1ϵ)mMκ(mp22))=(1ϵ)𝒦(p)\displaystyle\geq\sum_{m\in M}\kappa(\|m-p\|_{2}^{2}))-\epsilon\kappa(\|m-p\|_{2}^{2}))=(1-\epsilon)\sum_{m\in M}\kappa(\|m-p\|_{2}^{2}))=(1-\epsilon)\mathcal{K}(p)

A.4 Analysis for Relative-Distance Smooth Kernels

Let ξ¯=max(1,ξκ(1/n))\bar{\xi}=\max(1,\xi_{\kappa}(1/n)). We will prove that for any relative distance smooth kernel κ\kappa with parameters c1,d1,q1,c2c_{1},d_{1},q_{1},c_{2}, and d2d_{2}, we have κ(c)κ(c+δ)ϵκ(c)\kappa(c)-\kappa(c+\delta)\leq\epsilon\kappa(c) for all cξ=ξκ(1/n)c\leq\xi=\xi_{\kappa}(1/n) as long as:

δ=min((d2c2ϵ)1/d2,ϵc2(2ξ¯)1d2).\displaystyle\delta=\min\left(\left(\frac{d_{2}}{c_{2}}\epsilon\right)^{1/d_{2}},\;\frac{\epsilon}{c_{2}}(2\bar{\xi})^{1-d_{2}}\right).

By the definition of relative distance smooth kernels, we have that κ(t)c2td21κ(t)-\kappa^{\prime}(t)\leq c_{2}t^{d_{2}-1}\kappa(t). Hence,

κ(c)κ(c+δ)cc+δc2td21κ(t)𝑑tκ(c)cc+δc2td21𝑑t.\displaystyle\kappa(c)-\kappa(c+\delta)\leq\int_{c}^{c+\delta}c_{2}t^{d_{2}-1}\kappa(t)dt\leq\kappa(c)\int_{c}^{c+\delta}c_{2}t^{d_{2}-1}dt.

The last step follows from the fact that κ(t)\kappa(t) is non-increasing in tt. Since d2>0d_{2}>0, this simplifies to

κ(c)κ(c+δ)κ(c)cc+δc2td21𝑑t=c2d2κ(c)((c+δ)d2cd2).\displaystyle\kappa(c)-\kappa(c+\delta)\leq\kappa(c)\int_{c}^{c+\delta}c_{2}t^{d_{2}-1}dt=\frac{c_{2}}{d_{2}}\kappa(c)((c+\delta)^{d_{2}}-c^{d_{2}}).

So, we need to show that c2d2((c+δ)d2cd2)ϵ\frac{c_{2}}{d_{2}}\left((c+\delta)^{d_{2}}-c^{d_{2}}\right)\leq\epsilon. We consider two cases:

Case 1: When d2d_{2} is <1<1, consider the function f(x)=(x+δ)d2xd2f(x)=(x+\delta)^{d_{2}}-x^{d_{2}}. This function is non-increasing, indeed, f(x)=d2((x+δ)d21xd21)<0f^{\prime}(x)=d_{2}((x+\delta)^{d_{2}-1}-x^{d_{2}-1})<0. Hence, we have that

((c+δ)d2cd2)δd2.\displaystyle((c+\delta)^{d_{2}}-c^{d_{2}})\leq\delta^{d_{2}}.

We can pick δ=(d2c2ϵ)1/d2\delta=(\frac{d_{2}}{c_{2}}\epsilon)^{1/d_{2}}.

Case 2: On the other hand, when d21d_{2}\geq 1, the function f(x)=xd2f(x)=x^{d_{2}} is convex, so we have:

(c+δ)d2cd2δf(c+δ)=δd2(c+δ)d21δd22d21max(ξd21,δd21)δd22d21ξ¯d21\displaystyle(c+\delta)^{d_{2}}-c^{d_{2}}\leq\delta f^{\prime}(c+\delta)=\delta d_{2}(c+\delta)^{d_{2}-1}\leq\delta d_{2}2^{d_{2}-1}\max(\xi^{d_{2}-1},\delta^{d_{2}-1})\leq\delta d_{2}2^{d_{2}-1}\bar{\xi}^{d_{2}-1}

In this case, we can choose δ=ϵc2(2ξ¯)1d2\delta=\frac{\epsilon}{c_{2}}(2\bar{\xi})^{1-d_{2}}.

Hence, picking δ=min((d2c2ϵ)1/d2,ϵc2(2ξ¯)1d2)\delta=\min\left(\left(\frac{d_{2}}{c_{2}}\epsilon\right)^{1/d_{2}},\;\frac{\epsilon}{c_{2}}(2\bar{\xi})^{1-d_{2}}\right) ensures that (c+δ)d2cd2d2c2ϵ(c+\delta)^{d_{2}}-c^{d_{2}}\leq\frac{d_{2}}{c_{2}}\epsilon, and thus that c2d2((c+δ)d2cd2)ϵ\frac{c_{2}}{d_{2}}\left((c+\delta)^{d_{2}}-c^{d_{2}}\right)\leq\epsilon, as required.

Appendix B Mode Recovery for Non-convex Kernels

In Section 4.1, we show that the mean-shift method can rapidly recover an approximate mode for any convex, non-increasing kernel from an approximation to the JL reduced problem. In this section, we briefly comment on an alternative method that can also handle non-convex kernels, albeit at the cost of worse runtime. Specifically, it is possible to leverage a recent result from (Biess et al., 2019) on an algorithmic version of the Kirszbraun extension theory. This work provides an algorithm for explicitly extending a function ff that is Lipschitz on some fixed set of points to one additional point. The main result follows:

Theorem B.1 ((Biess et al., 2019)).

Consider a finite set (xi)i[n]X=w(x_{i})_{i\in[n]}\subset X=\mathbb{R}^{w}, and its image (yi)i[n]Y=d(y_{i})_{i\in[n]}\subset Y=\mathbb{R}^{d} under some LL-Lipschitz map f:XYf:X\rightarrow Y. There is an algorithm running in O(nw+ndlogn/ϵ2)O(nw+nd\log n/\epsilon^{2}) time which returns, for any point zwz\in\mathbb{R}^{w}, and a precision parameter ϵ>0\epsilon>0, a point zdz^{\prime}\in\mathbb{R}^{d} satisfying for all i[n]i\in[n],

zf(xi)2(1+ϵ)Lzxi2\displaystyle\left\|z^{\prime}-f(x_{i})\right\|^{2}\leq(1+\epsilon)L\left\|z-x_{i}\right\|^{2}

From this result we can obtain a claim comparable to Corollary 4.3:

Theorem B.2.

Let 𝒦M=(κ,M)\mathcal{K}_{M}=(\kappa,M) be a dd-dimensional shift-invariant KDE where κ\kappa is differentiable and non-increasing but not necessary convex. Let ϵ\epsilon and γ\gamma (which depends on κ\kappa and ϵ\epsilon) be as in Theorem 3.1 and let Π\Pi be a random JL matrix as in Definition 2.2 with w=O(log((n+1)/δ)min(1,γ2))w=O\left(\frac{\log((n+1)/\delta)}{\min(1,\gamma^{2})}\right) rows. Let x~\tilde{x} be an approximate maximizer for 𝒦ΠM{\mathcal{K}}_{\Pi M} satisfying 𝒦ΠM(x~)(1α)maxxw𝒦ΠM(x){\mathcal{K}}_{\Pi M}(\tilde{x})\geq(1-\alpha)\max_{x\in\mathbb{R}^{w}}{\mathcal{K}}_{\Pi M}(x). If we run the algorithm of Theorem B.1 with X=ΠMX=\Pi M, Y=MY=M, z=x~z=\tilde{x}, and error parameter γ\gamma, then the method returns xdx^{\prime}\in\mathbb{R}^{d} satisfying:

𝒦M(x)(12ϵα)maxxw𝒦M(x).\displaystyle{\mathcal{K}}_{M}(x^{\prime})\geq(1-2\epsilon-\alpha)\max_{x\in\mathbb{R}^{w}}{\mathcal{K}}_{M}(x).
Proof.

For conciseness, we sketch the proof. As discussed, by Definition 2.2, ΠMM\Pi M\rightarrow M is a 1-Lipschitz map. So it follows that the xx^{\prime} returned by the algorithm of (Biess et al., 2019) satisfies for all mMm\in M,

xm2(1+γ)x~Πm2\displaystyle\left\|x^{\prime}-m\right\|^{2}\leq(1+\gamma)\left\|\tilde{x}-\Pi m\right\|^{2}

It follows that:

𝒦M(x)mMκ((1+γ)x~Πm2).\displaystyle\mathcal{K}_{M}(x^{\prime})\geq\sum_{m\in M}\kappa\left((1+\gamma)\left\|\tilde{x}-\Pi m\right\|^{2}\right).

By the same argument used in the proof of Theorem 3.1, we have that

mMκ((1+γ)x~Πm2)mM(1ϵ)κ(x~Πm2)=(1ϵ)𝒦ΠM(x~).\displaystyle\sum_{m\in M}\kappa\left((1+\gamma)\left\|\tilde{x}-\Pi m\right\|^{2}\right)\geq\sum_{m\in M}(1-\epsilon)\kappa\left(\left\|\tilde{x}-\Pi m\right\|^{2}\right)=(1-\epsilon)\mathcal{K}_{\Pi M}(\tilde{x}).

In turn, since Theorem 3.1 holds under the same conditions as Theorem B.2, we have:

𝒦ΠM(x~)(1α)maxx𝒦ΠM(x)(1α)(1ϵ)maxx𝒦M(x).\displaystyle\mathcal{K}_{\Pi M}(\tilde{x})\geq(1-\alpha)\max_{x}\mathcal{K}_{\Pi M}(x)\geq(1-\alpha)(1-\epsilon)\max_{x}\mathcal{K}_{M}(x).

The result follows from noting that (1α)(1ϵ)2(12ϵα(1-\alpha)(1-\epsilon)^{2}\geq(1-2\epsilon-\alpha). By rescaling ϵ\epsilon we can obtain equivalent precision to Corollary 4.3. ∎

Note that for the common relative-distance smooth kernels addressed in Theorem 1.1, we have that γ=O(log(n/ϵ)/ϵ)\gamma=O({\log(n/\epsilon)}/{\epsilon}). So, the runtime of recovering a high-dimensional model using the method of (Biess et al., 2019) is O(ndlog3(n)/ϵ2)O(nd\log^{3}(n)/\epsilon^{2}). This exceeds the O(nd)O(nd) runtime of the mean-shift method. However, in contrast to mean-shift, the method can be applied to non-convex kernels like generalized Gaussian kernels with α>1\alpha>1.