Dimensionality Reduction for General KDE Mode Finding
Abstract
Finding the mode of a high dimensional probability distribution is a fundamental algorithmic problem in statistics and data analysis. There has been particular interest in efficient methods for solving the problem when is represented as a mixture model or kernel density estimate, although few algorithmic results with worst-case approximation and runtime guarantees are known. In this work, we significantly generalize a result of (Lee et al., 2021) on mode approximation for Gaussian mixture models. We develop randomized dimensionality reduction methods for mixtures involving a broader class of kernels, including the popular logistic, sigmoid, and generalized Gaussian kernels. As in Lee et al.’s work, our dimensionality reduction results yield quasi-polynomial algorithms for mode finding with multiplicative accuracy for any . Moreover, when combined with gradient descent, they yield efficient practical heuristics for the problem. In addition to our positive results, we prove a hardness result for box kernels, showing that there is no polynomial time algorithm for finding the mode of a kernel density estimate, unless . Obtaining similar hardness results for kernels used in practice (like Gaussian or logistic kernels) is an interesting future direction.
1 Introduction
We consider the basic computational problem of finding the mode of a high dimensional probability distribution over . Specifically, if has probability density function (PDF) , our goal is to find any such that:
A natural setting for this problem is when is specified as a kernel density estimate (KDE) or mixture distribution (Scott, 2015; Silverman, 2018). In this setting, we are given a set of points and a non-negative kernel function and our PDF equals:
As is typically the case, we will assume that is shift-invariant and thus only depends on the difference between and , meaning that it can be reparameterized as . A classic example of a shift-invariant KDE is any mixture of Gaussians distribution, for which is taken to be the Gaussian kernel. Here is a normalizing constant. Kernel density estimates are widely used to approximate other distributions in a compact way (Botev et al., 2010; Kim & Scott, 2012), and they have been applied to applications ranging from image annotation (Yavlinsky et al., 2005), to medical data analysis (Sheikhpour et al., 2016), to outlier detection (Kamalov & Leung, 2020). The specific problem of finding the mode of a KDE has found applications in object tracking (Shen et al., 2007), super levelset computation (Phillips et al., 2015), typical object finding (Gasser et al., 1997), and more (Lee et al., 2021).
1.1 Prior Work
Despite its many applications, the KDE mode finding problem presents a computational challenge in high-dimensions. For any practically relevant kernel (e.g., Gaussian) there are no known algorithms with runtime polynomial in both and for KDEs on base points. This is even the case when we only want to find an -approximate mode for some , i.e. a point satisfying
There has been extensive work on heuristic local search methods like the well-known “mean-shift” algorithm (Carreira-Perpiñán, 2000, 2007), which can be viewed as a variant of gradient descent, and often works well in practice. However, these methods do not come with theoretical guarantees and can fail on natural problem instances.
While polynomial time methods are not known, for some kernels it is possible to provably solve the -approximate mode finding problem in quasi-polynomial time. For example, Shenmaier’s work on universal approximate centers for clustering can be used to reduce the problem to evaluating the quality of a quasi-polynomial number of candidate modes (Shenmaier, 2019). For the Gaussian kernel, the total runtime is for constant . Similar runtimes can be obtained by appealing to results on the approximate Carathéodory problem (Blum et al., 2019; Barman, 2015).
More recently, Lee et al. explore dimensionality reduction as an approach to obtaining quasi-polynomial time algorithms for KDE mode finding (Lee et al., 2021). Their work shows that, for the Gaussian kernel, any high-dimensional KDE instance can be reduced to a lower dimensional instance using randomized dimensionality reduction methods – specifically Johnson-Lindenstrauss projection. An approximate mode for the lower dimensional problem can then be found with a method that depends exponentially on the dimension , and finally, the low-dimensional solution can be “mapped back” to the high-dimensional space111Methods that run in time exponential in are straightforward to obtain via discretization/brute force search. See Section 5.. Ultimately, the result in (Lee et al., 2021) allows all dependencies on to be replaced with terms that are polynomial in and . The conclusion is that the mode of a Gaussian KDE can be approximated to accuracy in time , where . The leading term is the cost of performing the dimensionality reduction.
In addition to nearly matching prior quasi-polynomial time methods in theory (e.g., Shenmaier’s approach), there are a number of benefits to an approach based on dimensionality reduction. For one, sketching directly reduces the space complexity of the mode finding problem, and vectors sketched with JL random projections can be useful in other downstream data analysis tasks. Another benefit is that dimensionality reduction can speed up even heuristic algorithms: instead of using a brute-force approach to solve the low-dimensional KDE instance, a practical alternative is to apply a local search method, like mean-shift, in the low-dimensional space. This approach sacrifices theoretical guarantees, but can lead to faster practical algorithms.
1.2 Our Results
The main contribution of our work is to generalize the dimensionality reduction results of (Lee et al., 2021) to a much broader class of kernels, beyond the Gaussian kernel studied in that work. In particular, we introduce a carefully defined class of kernels called “relative-distance smooth kernels”. This class includes the Gaussian kernel, as well as the sigmoid, logistic, and any generalized Gaussian kernel of the form for . See Definition 3 for more details. Our first result (Lemma 3.4) is that, for any relative-distance smooth kernel, we can approximate the value of the mode up to multiplicative error by solving a lower dimensional instance obtained by sketching the points in using a Johnson-Lindenstrauss random projection. The required dimension of the projection is , where is a constant depending on parameters of the kernel . For most commonly used relative-distance smooth kernels, including the Gaussian, logistic, and sigmoid kernels, . This leads to a dimensionality reduction that is completely independent of the original problem dimension and only depends polylogarithmically on the number of points in the KDE, .
Moreover, in Section 4, we show how to recover an approximate mode satisfying from the solution of the low-dimensional sketched problem. When the kernel satisfies an additional convexity property, recovery can be performed in time using a generalization of the mean-shift algorithm used in (Lee et al., 2021). When the kernel does not satisfy the property, we obtain a slightly slower method using a recent result of (Biess et al., 2019) on constructive Lipschitz extensions. One consequence of our general results is the following claim for a number of common kernels:
Theorem 1.1.
Let be a be a KDE on points in dimensions, where is a Gaussian, logistic, sigmoid, Cauchy222For the Cauchy kernel, we can actually obtain a better bound with dimension . See Corollary 3.2., or generalized Gaussian kernel with parameter . Let be a random JL matrix with rows and let be any point such that . Given as input, Algorithm 2 runs in time and returns, with probability , a point satisfying:
Above, is the point set and is the low-dimensional KDE defined by and . Theorem 1.1 implies that an approximate high-dimensional mode can be found by (approximately) solving a much lower dimensional problem. The result exactly matches that of (Lee et al., 2021) in the Gaussian case.
When combined with a simple brute-force method for maximizing , Theorem 1.1 immediately yields a quasi-polynomial time algorithm for mode finding. Again we state a natural special case of this result, proven in Section 5.
Theorem 1.2.
Let be a be a KDE on points in dimensions, where is a Gaussian, logistic, sigmoid, Cauchy, or generalized Gaussian kernel with . There is an algorithm which finds a point satisfying:
in time. Here denotes for constant .
Interestingly, as in (Lee et al., 2021), the above result falls just short of providing a polynomial time algorithm: doing so would require improving the dependence in the exponent to . It is possible to achieve polynomial time by make additional assumptions. For example, if we assume that for some constant , then dependencies on can be replaced with using existing coreset methods (Lee et al., 2021; Phillips & Tai, 2018). However, the question still remains as to whether the general KDE mode finding problem can be solved in polynomial time for any natural kernel. Our final contribution is to take a step towards answering this question in the negative by relating the mode finding problem to the -clique problem, and showing an NP-hardness result for box kernels (defined in the next section). Formally, in Section 6, we prove:
Lemma 1.3.
The problem of computing a -approximate mode of a box kernel KDE is NP-hard.
Unfortunately, our lower bound does not extend to commonly used kernels like the Gaussian, logistic, or sigmoid kernels. Proving lower bounds (or finding polynomial time algorithms) for these kernels is a compelling future goal.
Paper Structure.
Section 2 contains notation and definitions. In Section 3 we provide our main dimensionality results for approximating the objective value for the mode. Then, in Section 4, we show how to recover a high-dimensional mode from a low-dimensional one, providing different approaches for when the kernel is convex and not. Section 5 outlines a brute force method for finding an approximate mode in low dimensions. In Section 6 we show that the approximate mode finding problem is NP-hard for box kernels. Finally, we provide experimental results in Section 7, confirming that dimensionality reduction combined with a heuristic mode finding method yields a practical algorithm for a variety of kernels and data sets.
2 Preliminaries
Notation.
For our purposes, a kernel density estimate (KDE) is defined by a set of points (a.k.a. centers) and a non-negative, shift-invariant kernel function. All of the kernels discussed in this work are also radial symmetric. This means that we can actually rewrite the kernel function to be a scalar function of the squared Euclidean distance .333We let denotes the squared Euclidean norm: , where is the entry in the length vector .. Our KDE then has the form:
We further assume that is non-increasing, so satisfies for all . In the expression above, is a normalizing constant that only depends on . It is chosen to ensure that and thus is a probability density function. The above function is invariant to scaling , so to ease notation we further assume that . Note that since is non-increasing, we thus always have that . We write to denote the first-order derivative of (whenever it exists).
Many common kernels are radial symmetric and non-increasing, so fit the form described above (Silverman, 2018; Altman, 1992; Cleveland & Devlin, 1988). We list a few:
We are interested in finding a value for which maximizes or approximately maximizes the kernel density estimate . Again since the problem is invariant to positive scaling, we will consider the problem of maximizing the unnormalized KDE, which we denote by :
Our general dimensionality reduction result depends on a parameter of that we call the “critical radius”. For common kernels we later show how to bound this parameter to obtain specific dimensionality reduction results.
Definition 2.1 (-critical radius, ).
For any non-increasing kernel function , the -critical radius is the smallest value of such that .
Note that for any , we have that . The value of and will be especially important in our proofs. Specifically, since is assumed to have , it is easy to check that any mode for must lie within squared distance from at least one point in , a region which we will call the critical area. We will use this fact.
Johnson-Lindenstrauss Lemma.
Our results leverage the Johnson-Lindenstrauss (JL) lemma, which shows that a set of high dimensional points can be mapped into a space of much lower dimension in such a way that distances between the points are nearly preserved. We use the standard variant of the lemma where the mapping is an easy to compute random linear transformation (Achlioptas, 2001; Dasgupta & Gupta, 2003). Specifically, we are interested in random transformations satisfying the following guarantee:
Definition 2.2 (-Johnson-Lindenstrauss Guarantee).
A randomly selected matrix satisfies the -JL guarantee for positive error parameter , if for any data points , with probability ,
(1) |
for all pairs simultaneously.
Note that we require one-sided error: most statements of the JL guarantee have a factor on the left side of the inequality. This is easily removed by scaling by . It is well known that Definition 2.2 is satisfied by a properly i.i.d. random Gaussian or random matrix with
rows, and this is tight (Larsen & Nelson, 2017). General sub-Gaussian random matrices also work, as well as constructions that admit faster computation of (Kane & Nelson, 2014; Ailon & Chazelle, 2009).
Kirszbraun Extension Theorem.
We also rely on a classic result of (Kirszbraun, 1934). Let and be Hilbert spaces. Kirszbraun’s theorem states that if is a subset of , and is a Lipschitz-continuous map, then there is a Lipschitz-continuous map that extends444I.e. for all . and has the same Lipschitz constant. Formally, when applied to Euclidean spaces and we have:
Fact 2.3.
(Kirszbraun Extension Theorem). For any , let be an L-Lipschitz function. That is , . Then, there always exists some function such that:
-
1.
for all ,
-
2.
is also L-Lipschitz. That is for all , .
3 Dimensionality Reduction for Approximating the Mode Value
In this section, we show that using a JL random projection, we can reduce the problem of approximating the value of the mode of a KDE in dimensions – i.e., – to the problem of approximating the value of the mode for a KDE in dimensions, where depends only on , , and the desired approximation quality. This problem of recovering the mode value is a prerequisite for the harder problem of recovering the location of an approximate mode (i.e., a point such that ), which is addressed in Section 4.
We begin with an analysis for JL projections that bounds based on generic properties of . Then, in Section 3.1 we analyze these properties for specific kernels of interest, and prove that is in fact small for these kernels – specifically, it depends just polylogarithmically on and polynomially on the approximation factor . Our general result follows:
Theorem 3.1.
Let be a -dimensional KDE on a differentiable kernel as defined in Section 2 and let be an approximation factor. Let and let . Note that since is assumed to be non-increasing. We can assume that . Let . Then with probability , for any satisfying the -JL guarantee, we have:
(2) |
Recall that a random with rows will satisfy the required -JL guarantee.
Note that in the theorem statement above, denotes the point set with dimension reduced by multiplying each point in the set by . Our proof of Theorem 3.1 is included in Appendix A. It leverages Kirszbraun’s Exention theorem, and follows along the same lines in (Lee et al., 2021). However, we need to more carefully track the effect of properties of the kernel function , since we do not assume that it has the simple form of a Gaussian kernel.
With Theorem 3.1 in place, we can apply it to any non-increasing differentiable kernel to obtain a dimensionality reduction result: we just need to compute a lower bound . For some kernels we can do so directly. For example, consider the Cauchy kernel, . It can be shown that we can pick (since for all ). Plugging into Theorem 3.1 we obtain:
Corollary 3.2.
Let be a KDE and, for any , let be a random JL matrix with rows. If is a Cauchy kernel, then with probability ,
In the following subsection we will describe a broader class of kernels for which we can also obtain good dimensionality reduction results, but for which bounding is a bit more challenging.
3.1 Relative-Distance Smooth Kernels
Specifically, we consider a broad class of kernels, that included the Gaussian kernel:
Definition 3.3 (Relative-distance smooth kernel).
A non-increasing differentiable kernel is relative-distance smooth if there exist constants such that
for all |
In addition to the Gaussian kernel, this class includes other kernels commonly used in practice, like the logistic, sigmoid, and generalized Gaussian kernels:
A few common non-increasing kernels, including the rational quadratic kernel, are not relative distance smooth. Our main result is that for any relative-distance smooth kernel, we can sketch the KDE to dimension which depends only polylogarithically on and quadratically on :
Lemma 3.4.
Let be a KDE for a relative-distance smooth kernel with parameters . There is a fixed constant such that if , then with probability , for any satisfying the -JL guarantee, Equation 2 holds. To obtain this JL guarantee, it suffices to take to be a random JL matrix with rows.
Lemma 3.4 is proven in Appendix A. It uses an intermediate result that bounds the -critical radius for any relative-distance smooth kernel, which is required to invoke Theorem 3.1. Interestingly, the polylogarithmic factor in Lemma 3.4 only depends on the ratio of the parameters and of the relative-distance smooth kernel . For all of the example kernels discussed above, this ratio equals , so we obtain a dimensionality reduction result exactly matching (Lee et al., 2021) for the Gaussian kernel:
Corollary 3.5.
Let be a KDE and, for any , let be a random JL matrix with rows. If is a Gaussian, logistic, sigmoid kernel, or generalized Gaussian kernel, then with probability ,
4 Recovering an Approximate Mode in High Dimensions
In Section 3, we discussed how to convert a high dimensional KDE into a lower dimensional KDE whose mode has an approximately equal value. However, in applications, we are typically interested in computing a point in the high-dimensional space whose value is approximately equal to the value of the mode. I.e., using our dimensionality reduced dataset, we want to find some such that:
We present two approaches for doing so. The first is based on Kirszbraun’s extension theorem and the widely used mean-shift heuristic. It extends the approach of (Lee et al., 2021) to a wider class of kernels – specifically to any convex and non-increasing kernel . This class contains most of the relative-distance smooth kernels discussed in Section 3.1, including the Gaussian, sigmoid, and logistic kernels, and generalized Gaussian kernels when . It also includes common kernels like the Cauchy kernel, for which we have shown a strong dimensionality reduction results, and the Epanechnikov, biweight, and triweight kernels. Recall that we define so that represents the squared Euclidean distance between two points; we specifically need as defined in this way to be convex.
For non-convex kernels, we briefly discuss a second approach in Appendix B based on recent work on explicit one point extensions of Lipschitz functions (Biess et al., 2019). While less computationally efficient, this approach works for any non-increasing . Common examples of non-convex kernels include the tricube kernel (Altman, 1992), (Comaniciu, 2000), or any generalized Gaussian kernel with .
4.1 Mean-shift for Convex Kernels
Based on ideas proposed by Fukunaga and Hostetler (Fukunaga & Hostetler, 1975), the mean-shift method is a commonly used heuristic for finding an approximate mode (Cheng, 1995). The idea behind the algorithm is to iteratively refine a guess for the mode. At each update, a new guess , is obtained by computing a weighted average of all points in that define the KDE. Points that are closer to the previous guess are included with higher weight than points that are further. The exact choice of weights depends on the first derivative , where is the distance from the current mode to a point in . For any non-increasing, convex kernel, is non-positive and decreasing in magnitude – i.e., is largest for close to , which ensures that points closest to the current guess for the mode are weighted highest when computing the new guess555Note that for the Gaussian kernel, , so . So the method presented here is equivalent to the version of mean-shift used in prior work on dimensionality reduction for mode finding (Lee et al., 2021). We include pseudocode for mean-shift as Algorithm 1. The method can be alternatively viewed as an instantiation of gradient ascent for the KDE mode objective with a specifically chosen step size – we do not discuss details here.
A powerful property of the mean-shift algorithm is that it always converges for kernels that are non-increasing and convex. In fact, it is known to provide a monotonically improving solution. Specifically:
Fact 4.1 (Comaniciu & Meer (2002)).
Let be an arbitrary starting point and let be the resulting iterates of Algorithm 1 run on point set with kernel . If is convex and non-increasing, then for any :
In Appendix A, we use this fact to prove that with a (modified) mean-shift method, run for only a single iteration, we can translate any approximate solution for a dimensionality reduced KDE problem to a solution for the original high dimensional problem. Formally, we prove the following:
Theorem 4.2.
Let be a set of points in and let be a KDE defined by a shift-invariant, non-increasing, and convex kernel function . Let . Let be a JL matrix as in Definition 2.2 and assume that is chosen large enough so that for all in the set ,
(3) |
Let be an approximate maximizer for satisfying . Then if we choose , we have:
Note that above is set using a single-iteration of what looks like mean-shift. However, instead of using weights based on the distances of points in to a previous guess for a high-dimensional mode, we use distances between the points in our low-dimensional space to the approximate low-dimensional mode, . Also note that Theorem 4.2 is independent of exactly how is computed – it could be computed using brute force search, using an approximation algorithm tailored to low-dimensional problems, as in (Lee et al., 2021), or using a heuristic like mean-shift itself.
For convex kernels, Theorem 4.2 implies a strengthening of Theorem 3.1 that allows for recovering an approximate mode, not just the value of the mode. Formally, the combined dimensionality reduction and recover procedure we propose is included as Algorithm 2 and we have the following result on the its accuracy:
Corollary 4.3.
Let be a -dimensional shift-invariant KDE as defined in Section 2 and let and (which depends on and ) be as in Theorem 3.1. If is differentiable, non-increasing, and convex, then Algorithm 2 run with parameters and returns satisfying:
Note that Line 4 in Algorithm 2 can be evaluated in time. So our headline result, Theorem 1.1, follows as a direct corollary.
5 Solving the Low-Dimensional Problem
We next discuss a simple brute-force search method for approximate mode finding for any KDE with a continuous kernel function . The method has an exponential runtime dependency on the dimension, so its use for high-dimensional problems is limited, but combined with the dimensionality reduction techniques from Section 3 and the mode recovery techniques from Section 4, it yields a quasi-polynomial mode finding algorithm for a large class of kernels.
Recall that the mode of a KDE with must lie within its critical area, i.e. in a ball of squared radius around one of the points in (where denotes the -critical radius). For any we define a finite -covering to be a finite set of points such that, for every point in the critical area of , there exists a such that . Formally:
Lemma 5.1.
Given a KDE in with , and parameter , let and let be a set that contains all points of the form
where , , and . Above are the canonical base vectors of . Then for any point in a -ball surrounding one of the points in , there exists a point such that . Moreover, we have that .
By checking every point in and returning one that maximizes , we obtain the following results on finding an approximate mode, which is proven in Appendix A:
Theorem 5.2.
Given a KDE in with and a precision parameter , let and let be at most the largest number such that for all . Then we can find an -approximate mode in . In particular, if , , and for some constant , then we can find an -approximate mode in quasi-polynomial time in the number of data points .
Our headline result, Theorem 1.2, follows by combining the dimensional reduction guarantee of Lemma 3.4 with the observation that for , choosing satisfies the requirement of Theorem 5.2 for any relative-distance smooth kernel with parameters . Moreover, as established in Lemma A.1, , so we have that the runtime in Theorem 5.2 is for constant . The claim in Theorem 1.2 for the Cauchy kernel follows by noting that and we can take in Theorem 5.2. Finally, note that Theorem 1.2 also includes the polynomial time cost of multiplying the original data set by a random JL matrix.
Overall, we conclude that one can compute an approximate mode in quasi-polynomial time for the Cauchy kernel, or any KDE on a relative-distance smooth kernel, and in particular the approximate mode finding problem for KDEs on Gaussian, logistic, sigmoid, or generalized Gaussian kernels can be solved in quasi-polynomial time.
6 Hardness Results
The results from the previous sections place the approximate mode finding problem in quasi-polynomial time for a large class of kernels. The question arises whether we can do much better; in this section, we provide some preliminary negative evidence for this possibility. Specifically, we prove NP-hardness of finding an approximate mode of a box kernel KDE, where we recall that this kernel takes the form for and otherwise. Our hardness result is based on the hardness of the -clique problem:
Problem 6.0 (-clique).
Given a -regular graph and an integer , does have a complete -vertex subgraph?
Section 6 is known to be NP-hard when is a parameter of the input. We show how to reduce this problem to KDE mode finding using a reduction inspired by work of Shenmaier on the -enclosing ball problem (Shenmaier, 2015). We start by creating a point set given an input to Section 6. Specifically, we embed in as follows: let to be the set of rows of the incidence matrix of , i.e. the matrix such that if is an edge adjacent to the node and otherwise (Shenmaier, 2015). See Figure 1 for an example.

We will base our hardness result on the following lemma:
Lemma 6.1 (Shenmaier (2015)).
Given a -regular graph and integer , let be defined as above. Let and let be the radius of the smallest ball containing points in . Then if there is a -clique in , and otherwise.
By rescaling we can show NP-hardness of the KDE mode finding problem for box kernels:
Theorem 6.2.
The problem of computing a -approximate mode of a box kernel KDE is NP-hard.
Proof.
The proof follows almost directly from Lemma 6.1. Note that the value of the mode of a box kernel KDE is given by the largest number of centers in a ball of radius 1. Let be an instance of Section 6, and let be the set of rows of the incidence matrix of as described above. Now let . From the lemma, we know that there is a ball of radius containing points if has a -clique, so . On the other hand, if does not have a -clique then every ball of radius contains at most points, i.e., . So, an approximation algorithm with error can distinguish between the two cases. Hence, the (-approximate) mode finding problem for box kernel KDEs is at least as hard as Section 6 when . ∎
While it provides an initial result and hints at why the mode finding problem might be challenging, the above hardness result leaves a number of open questions. First off, it does not rule out a constant factor approximation, or a method whose dependence on the approximation parameter is exponential (as in our quasi-polynomial time methods). Moreover, the result does not apply for kernels like the Gaussian kernel – it strongly requires that the value of the box kernel differs significantly between and . Proving stronger hardness of approximation for the box kernel, or any hardness for kernels used in practice (like a relative-distance smooth kernel) are promising future directions.
7 Experiments
We support our theoretical results with experiments on two datasets, MNIST (60000 data points, 784 dimensions) and Text8 (71290 data points, 300 dimensions). We use both the Gaussian and Generalized Gaussian kernels with a variety of different bandwidths, . A bandwidth of means that the kernel function as definied in Section 2 was applied to values . In general, a large leads to larger mode value. It also leads to a smoother KDE, which is intuitively easier to maximize. We chose values of that lead to substantially varying mode values to check the performance of our method across a variety of optimization surfaces.
Since these are high-dimensional datasets, it is not computationally feasible to find an exact mode to compare against. Instead, we obtain a baseline mode value by running the mean-shift heuristic (gradient descent) for 100 iterations, with 60 randomly chosen starting points. To avoid convergence to local optima at KDE centers, these starting points were chosen to be random linear combinations of either all dataset points, or a random pair of points in the data set. The best mode value found was taken as a baseline.
Once establishing a baseline, we applied JL dimensionality reduction to each data set and kernel for a variety of sketching dimensions, . Again, for efficiency, we use mean-shift to find an approximate low-dimensional mode, instead of the brute force search method from Section 5. We ran for 10 iterations with 30 random restarts, chosen as described above. To recover a high-dimensional mode from our approximate low-dimensional mode, we use Algorithm 2, since the kernels tested are convex. For each dimension , we ran trials and report the mean and standard deviation of the KDE value of our approximate mode. Results are included in Figures 2-5. Note that, for visual clarity, the y-axis in these figures does not start at zero.




As apparent from the plots, our Johnson-Lindenstrauss dimensionality reduction approach combined with the mean-shift heuristic performs very well overall. In all cases, it was able to recover an approximate mode with value close to the baseline with sketching dimension . As expected, performance improves with increasing sketching dimension.
8 Acknowledgements
This work was partially funded through NSF Award No. 2045590. Cas Widdershoven’s work has been partially funded through the CAS Project for Young Scientists in Basic Research, Grant No. YSBR-040.
References
- Achlioptas (2001) Achlioptas, D. Database-friendly random projections. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281, 2001.
- Ailon & Chazelle (2009) Ailon, N. and Chazelle, B. The fast johnson-lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., pp. 302–322, 2009.
- Altman (1992) Altman, N. S. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
- Barman (2015) Barman, S. Approximating nash equilibria and dense bipartite subgraphs via an approximate version of caratheodory’s theorem. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 361–369, 2015.
- Biess et al. (2019) Biess, A., Kontorovich, A., Makarychev, Y., and Zaichyk, H. Regression via kirszbraun extension with applications to imitation learning. ArXiv Preprint, abs/1905.11930, 2019. URL http://arxiv.org/abs/1905.11930.
- Blum et al. (2019) Blum, A., Har-Peled, S., and Raichel, B. Sparse approximation via generating point sets. ACM Trans. Algorithms, 15(3), 6 2019.
- Botev et al. (2010) Botev, Z. I., Grotowski, J. F., and Kroese, D. P. Kernel density estimation via diffusion. The annals of Statistics, 38(5):2916–2957, 2010.
- Carreira-Perpiñán (2000) Carreira-Perpiñán, M. A. Mode-finding for mixtures of Gaussian distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1318–1323, 2000.
- Carreira-Perpiñán (2007) Carreira-Perpiñán, M. A. Gaussian mean-shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29:767–776, 2007.
- Cheng (1995) Cheng, Y. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, 1995.
- Cleveland & Devlin (1988) Cleveland, W. S. and Devlin, S. J. Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403):596–610, 1988.
- Comaniciu & Meer (2002) Comaniciu, D. and Meer, P. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
- Comaniciu (2000) Comaniciu, D. I. Nonparametric robust methods for computer vision. PhD thesis, 2000.
- Dasgupta & Gupta (2003) Dasgupta, S. and Gupta, A. An elementary proof of a theorem of johnson and lindenstrauss. Random Struct. Algorithms, 22(1):60–65, 1 2003.
- Fukunaga & Hostetler (1975) Fukunaga, K. and Hostetler, L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21(1):32–40, 1975.
- Gasser et al. (1997) Gasser, T., Hall, P., and Presnell, B. Nonparametric estimation of the mode of a distribution of random curves. Journal of the Royal Statistical Society: Series B, 60:681–691, 1997.
- Kamalov & Leung (2020) Kamalov, F. and Leung, H. H. Outlier detection in high dimensional data. Journal of Information & Knowledge Management, 19(01), 2020.
- Kane & Nelson (2014) Kane, D. M. and Nelson, J. Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1):1–23, 2014.
- Kim & Scott (2012) Kim, J. and Scott, C. D. Robust kernel density estimation. The Journal of Machine Learning Research, 13(1):2529–2565, 2012.
- Kirszbraun (1934) Kirszbraun, M. Über die zusammenziehende und lipschitzsche transformationen. Fundamenta Mathematicae, 22(1):77–108, 1934.
- Larsen & Nelson (2017) Larsen, K. G. and Nelson, J. Optimality of the johnson-lindenstrauss lemma. In 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 633–638, 2017.
- Lee et al. (2021) Lee, J. C., Li, J., Musco, C., Phillips, J. M., and Tai, W. M. Finding an approximate mode of a kernel density estimate. In 29th Annual European Symposium on Algorithms (ESA 2021), volume 204, pp. 61:1–61:19, 2021.
- Phillips & Tai (2018) Phillips, J. M. and Tai, W. M. Near-optimal coresets of kernel density estimates. In 34th International Symposium on Computational Geometry, 2018.
- Phillips et al. (2015) Phillips, J. M., Wang, B., and Zheng, Y. Geometric inference on kernel density estimates. In International Symposium on Computational Geometry, 2015.
- Scott (2015) Scott, D. W. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.
- Sheikhpour et al. (2016) Sheikhpour, R., Sarram, M. A., and Sheikhpour, R. Particle swarm optimization for bandwidth determination and feature selection of kernel density estimation based classifiers in diagnosis of breast cancer. Applied Soft Computing, 40:113–131, 2016.
- Shen et al. (2007) Shen, C., Brooks, M. J., and van den Hengel, A. Fast global kernel density mode seeking: Applications to localization and tracking. IEEE Transactions on Image Processing, 16:1457 – 1469, 2007.
- Shenmaier (2015) Shenmaier, V. Complexity and approximation of the smallest k-enclosing ball problem. European Journal of Combinatorics, 48:81–87, 2015.
- Shenmaier (2019) Shenmaier, V. A structural theorem for center-based clustering in high-dimensional euclidean space. In Machine Learning, Optimization, and Data Science, pp. 284–295. Springer International Publishing, 2019.
- Silverman (2018) Silverman, B. W. Density estimation for statistics and data analysis. Routledge, 2018.
- Yavlinsky et al. (2005) Yavlinsky, A., Schofield, E., and Rüger, S. Automated image annotation using global features and robust nonparametric density estimation. In International Conference on Image and Video Retrieval, pp. 507–517, 2005.
Appendix A Additional Proofs
A.1 Proofs for Section 3
See 3.1
Proof.
First recall the definitions of and , which are just fixed positive scalings of and . It suffices to prove that:
(4) |
To prove (4) we will apply the guarantee of Definition 2.2 to the points in , where . This guarantee ensures that with probability , for all in this set, where is the JL matrix from the theorem.
We first prove the right hand side of (4) using an argument identical to the proof of Lemma 10 from (Lee et al., 2021). Consider the set of points that contains for all . Let be a map from each point in this set to the corresponding point in . Since for all as guaranteed above, we have that is 1-Lipschitz. From Kirszbraun’s theorem ( 2.3) it follows that there is a function which agrees with on inputs in and satisfies for all . So for any , there is some such that, for all ,
The right hand side of (4) then follows: there must be some point such that for all , where . Overall we have:
In the second to last inequality we used that is non-increasing.
We next prove the left hand side of (4). We first have:
where . Applying the JL guarantee to the points in , we have that for all , . So plugging into the equation above, we have:
We can then bound:
The last inequality follows from the fact that is non-increasing, so is negative or zero and . Invoking our definition of and choice of we can continue:
Note that in the second to last line we invoked the definition of . Specifically, we used that, for any with , . In the last line we use that . ∎
Lemma A.1.
Let be a KDE for a point set with cardinality and relative-distance smooth kernel with parameters . Then for any , for a fixed constant that depends on and .
Proof.
Since is positive, non-increasing, and we can write for some positive, non-decreasing function with . We have and thus . Writing , we will upper bound by lower bounding . Specifically, we have:
It follows that We want to upper bound the smallest such that . Let be a sufficiently large constant so that:
Then it suffices to pick to ensure that . ∎
See 3.4
Proof.
With Lemma A.1 in place, our main result for relatively distance smooth kernels follows directly: By Lemma A.1, . And since is relative-distance smooth, we have that:
for sufficiently large constant . Let . Invoking Theorem 3.1, we require that . ∎
A.2 Proofs for Section 4
See 4.2
Proof.
We will show that:
(5) |
where is the approximate maximizer of , as defined in the theorem statement. If we can prove (5) then the theorem follows by the following chain of inequalities:
To prove (5), we follow a similar approach to (Lee et al., 2021). Since satisfies (3), the function mapping is 1-Lipschitz. Accordingly, by Kirszbraun’s Extension Theorem (2.3), there is some function that agrees with on inputs from and remains 1-Lipschitz. It follows that, if we apply to then for all ,
In other words, for our approximate low-dimensional mode , there is a high-dimensional point that is closer to all points in than is to the points in . In fact, by extending all points by one extra dimension, we can obtain an exact equality. In particular, let equal on its first coordinates, and on its last coordinate. For each let equal on its first coordinates, and on its last coordinate. Let denote the set of these transformed points. First observe that for all
(6) |
Accordingly, as defined in the theorem statement is exactly equivalent to the first entries of the dimensional vector that would be obtained from running one iteration of mean-shift on using point set . Call this dimensional vector . By 4.1, we have that:
The last equality follows from (6). Finally, for any non-increasing kernel we have that:
because for all . This is simply because and are equal to and , but with their last entry removed, so they can only be closer together. Combining the previous two inequalities proves (5), which establishes the theorem. ∎
See 4.3
Proof.
Corollary 4.3 immediately follows by combining Theorem 3.1 with Theorem 4.2. In particular, if is chosen with rows (as in Algorithm 2) then with probability , we have that both (2) and (3) hold with probability , which are the only conditions needed for Theorem 4.2 to hold. ∎
A.3 Proofs for Section 5
See 5.1
Proof.
The second claim on the size of is immediate. For the first claim, note that can be written as with and . Let . Then we have
∎
See 5.2
Proof.
Since there always exists a mode in the critical area of , we can use Lemma 5.1 to find a point at most away from a mode of in . Then we have
∎
A.4 Analysis for Relative-Distance Smooth Kernels
Let . We will prove that for any relative distance smooth kernel with parameters , and , we have for all as long as:
By the definition of relative distance smooth kernels, we have that . Hence,
The last step follows from the fact that is non-increasing in . Since , this simplifies to
So, we need to show that . We consider two cases:
Case 1: When is , consider the function . This function is non-increasing, indeed, . Hence, we have that
We can pick .
Case 2: On the other hand, when , the function is convex, so we have:
In this case, we can choose .
Hence, picking ensures that , and thus that , as required.
Appendix B Mode Recovery for Non-convex Kernels
In Section 4.1, we show that the mean-shift method can rapidly recover an approximate mode for any convex, non-increasing kernel from an approximation to the JL reduced problem. In this section, we briefly comment on an alternative method that can also handle non-convex kernels, albeit at the cost of worse runtime. Specifically, it is possible to leverage a recent result from (Biess et al., 2019) on an algorithmic version of the Kirszbraun extension theory. This work provides an algorithm for explicitly extending a function that is Lipschitz on some fixed set of points to one additional point. The main result follows:
Theorem B.1 ((Biess et al., 2019)).
Consider a finite set , and its image under some -Lipschitz map . There is an algorithm running in time which returns, for any point , and a precision parameter , a point satisfying for all ,
From this result we can obtain a claim comparable to Corollary 4.3:
Theorem B.2.
Let be a -dimensional shift-invariant KDE where is differentiable and non-increasing but not necessary convex. Let and (which depends on and ) be as in Theorem 3.1 and let be a random JL matrix as in Definition 2.2 with rows. Let be an approximate maximizer for satisfying . If we run the algorithm of Theorem B.1 with , , , and error parameter , then the method returns satisfying:
Proof.
For conciseness, we sketch the proof. As discussed, by Definition 2.2, is a 1-Lipschitz map. So it follows that the returned by the algorithm of (Biess et al., 2019) satisfies for all ,
It follows that:
By the same argument used in the proof of Theorem 3.1, we have that
In turn, since Theorem 3.1 holds under the same conditions as Theorem B.2, we have:
The result follows from noting that ). By rescaling we can obtain equivalent precision to Corollary 4.3. ∎
Note that for the common relative-distance smooth kernels addressed in Theorem 1.1, we have that . So, the runtime of recovering a high-dimensional model using the method of (Biess et al., 2019) is . This exceeds the runtime of the mean-shift method. However, in contrast to mean-shift, the method can be applied to non-convex kernels like generalized Gaussian kernels with .