This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Mixtures of Spherical Gaussians via Fourier Analysis

Somnath Chakraborty
School of Technology and Computer Science,
Tata Institue of Fundamental Research,
Mumbai, India
   Hariharan Narayanan
School of Technology and Computer Science,
Tata Institue of Fundamental Research,
Mumbai, India
Abstract

Suppose that we are given independent, identically distributed samples x1,,xnx_{1},\cdots,x_{n} from a mixture μ\mu of at most kk of dd-dimensional spherical gaussian distributions μ1,,μk0\mu_{1},\cdots,\mu_{k_{0}} of identical known variance σ2>0\sigma^{2}>0, such that the minimum 2\ell_{2} distance between two distinct centers yly_{l} and yjy_{j} is greater than σdΔ\sigma\sqrt{d}\Delta for some Δ2c0\Delta\geq 2c_{0}, where c0(0,1)c_{0}\in(0,1) is an universal constant. We develop a randomized algorithm that learns the centers yly_{l} of the gaussian components, to within an 2\ell_{2} distance of δ<σΔd2\delta<\frac{\sigma\Delta\sqrt{d}}{2} with probability greater than 1exp(k/c)1-\exp(-k/c) for an arbitrary universal constant c>0c>0, when the weights w1,,wk0w_{1},\cdots,w_{k_{0}} and the variance σ2\sigma^{2} are known, and the smallest weight satisfies wminc/kw_{min}\geq c/k. The number of samples and the computational time is bounded above by poly(k,d,1δ,σ1)poly(k,d,\frac{1}{\delta},\sigma^{-1}). Such a bound on the sample and computational complexity was previously unknown for ω(1)dO(logk)\omega(1)\leq d\leq O(\log k). When d=O(1)d=O(1), this complexity bound follows from work of Regev and Vijayaraghavan ([13]), where it has also been shown that the sample complexity of learning a random mixture of gaussians in a ball of radius Θ(d)\Theta(\sqrt{d}) in dd dimensions, when dd is Θ(logk)\Theta(\log k), is at least poly(k,1δ,σ1)poly(k,\frac{1}{\delta},\sigma^{-1}), showing that our result is tight in this case.

1 Introduction

Designing algorithms that estimate the parameters of an underlying probability distributions is a central theme in statistical learning theory. An important special instance of this learning problem is the case when the underlying distribution is known to be a finite mixture of Gaussian distributions in dd-dimensional euclidean space, because such mixtures are popular models for high-dimensional data clustering and learning mixture of Gaussians in an unsupervised setting has been a topic of intensive research for the last few decades.

In its most explicit form, the underlying problem is as follows: we have access to random samples drawn independently as per some Gaussian mixture μ:=ω1μ1++ωk0μk0\mu:=\omega_{1}\mu_{1}+\cdots+\omega_{k_{0}}\mu_{k_{0}}, where (ω1,,ωk0)(\omega_{1},\cdots,\omega_{k_{0}}) is some probability vector with strictly positive components, and each μl\mu_{l} is a Gaussian density in d\mathbb{R}^{d}, with mean yldy_{l}\in\mathbb{R}^{d} and covariance Σl\Sigma_{l}. The main task is to estimate the parameter set {(ω1,y1,Σ1),,(ωk0,yk0,Σk0)}\{(\omega_{1},y_{1},\Sigma_{1}),\cdots,(\omega_{k_{0}},y_{k_{0}},\Sigma_{k_{0}})\} of the density function μ\mu, within a pre-specified accuracy δ\delta. For the purpose of this paper, we will restrict to the scenario where all the Gaussian components are spherical.

Many of the earlier approaches to this learning problem were based on local search heuristics, e.g.e.g. the EM algorithm and kk-means heuristics, that resulted in weak performance guarantees. In [6], Dasgupta presented the first provably correct algorithm for learning a mixture of Gaussians, with a common unknown covariance matrix, using only polynomially many (in dimension as well as number of components) samples and computation time, under the assumption that the minimum separation between the centers of component Gaussians is at least Ω(polylog(kd)d)\Omega(\mathrm{polylog}(kd)\sqrt{d}). In a consequent work, Dasgupta and Schulman ([4]) showed a variation of EM to work with minimum separation only polylog(kd)d14\mathrm{polylog}(kd)d^{\frac{1}{4}}, when the components are all spherical. Subsequently, many more algorithms (see [12] and the references therein) with improved sample complexity and/or computational complexity have since been devised, that work on somewhat weaker separation assumption; in particular, the SVD-based algorithm of Vempala and Wang [14] learns a mixture of spherical Gaussians with poly-sized samples and polynomially many steps, under the separation assumption

min1ljkylyj2Cmax{σi,σj}((min(k,d)log(dk/δ))14+(log(dk/δ))12)\min_{1\leq l\neq j\leq k}||y_{l}-y_{j}||_{2}\geq C\max\{\sigma_{i},\sigma_{j}\}\left(\left(\min(k,d)\log\left(dk/\delta\right)\right)^{\frac{1}{4}}+\left(\log\left(dk/\delta\right)\right)^{\frac{1}{2}}\right)

We note that when k=Θ(2d)k=\Theta(2^{d}), the above separation requirement translates to minimum separation of Ω(d34)\Omega(d^{\frac{3}{4}}). In another line of work, Kalai-Moitra-Valiant [10] and Moitra-Valiant [12], the question of polynomial learnability of arbitrarily separated mixture of Gaussians (and more general families of distributions have been investigated by Belkin and Sinha in [2]). It has been established that, for Gaussian mixture with a fixed number of components and the centers having a minimum required separation, there is an algorithm that runs in polynomial time and uses polynomially many samples to learn the parameters of the mixture to any desired accuracy, with arbitrarily high probability.

Recently, Hopkins and Li ([7]), and Diakonikolas et al ([5]) devised polynomial time learning algorithms that work with minimum separation of kϵk^{\epsilon} (for any ϵ>0\epsilon>0). In [13], Regev and Vijayaraghavan considered the question of obtaining a lower bound on the separation of centers necessary for the polynomial learnability of Gaussian mixtures. They devised an iterative algorithm for amplifying accuracy of parameter estimation that, given initializer y10,,yk0y_{1}^{0},\cdots,y_{k}^{0} and a desired accuracy parameter δ>0\delta>0, uses polynomially many samples from the underlying mixture and computation steps to return y1,,yky_{1}^{\prime},\cdots,y_{k}^{\prime}, that lie within Hausdorff distance at most δ\delta from the true centers; for more details, see Theorem 4.1 of the full version of [13]. One of their results establishes that, in constant dimension d=O(1)d=O(1), with minimum separation at least Ω(1)\Omega(1), any uniform mixture of spherical Gaussians can be learned to desired accuracy δ\delta (with high probability) in number of samples and computation time depending polynomially on the number of components and δ1\delta^{-1}.

In the present paper we consider the question of learning the centers of mixture of Gaussians, with minimum separation Ω(d)\Omega(\sqrt{d}), in number of samples and computational time that depends polynomially on the ambient dimension dd, and the number of components kk. Here is the main theorem:

Main Theorem: There is an absolute constant c0(0,1)c_{0}\in(0,1) such that, given a mixture of at most kk spherical Gaussians in d\mathbb{R}^{d}, with known weights and known identical variance σ2\sigma^{2}, for which all the mixing weights are in [c,c1][c,c^{-1}] for some arbitrary absolute constant c(0,1)c\in(0,1), and the minimum separation of the centers of the components satisfies

min1ljkylyj22c0σd,\min_{1\leq l\neq j\leq k}||y_{l}-y_{j}||_{2}\geq 2c_{0}\sigma\sqrt{d},

there is a randomized algorithm that recovers (with high probability) the centers of the mixture up to accuracy δ\delta in time poly(k,d,δ1,σ1)\mbox{poly}(k,d,\delta^{-1},\sigma^{-1}).

In particular, we generalize the main upper bound in [13] (see theorem 5.1 of [13]) to arbitrary dimension (dd) and number of components (kk), when the variance and weights of the underlying mixture is known. We notice that our minimum separation requirement is independent of kk, the number of components, and for d=O(logk)d=O(\log k), this is weak minimum separation requirement.

In what follows, we will assume that σ=1\sigma=1; by scale-invariance of the condition in the theorem above, this is without loss of generality.

2 Overview

We first observe that if two clusters of centers are very far (i. e. the minimum distance of a center in one cluster is at least d\sqrt{d} away from every center of the other cluster), then the samples are unambiguously of one cluster only. This is shown in Lemma 1, allowing us to reduce the question to the case when a certain proximity graph — defined on the centers — is connected.

In algorithm LearnMixtureLearnMixture, we use the Johnson-Lindenstrauss lemma and project the data in the ambient space to d+1d+1 carefully chosen subspaces of dimension at most O(1+min(logk,d))O(1+\min(\log k,d)), and show that if we can separate the Gaussians in these subspaces, the resulting centers can be used to obtain a good estimate of the centers in the original question. Thus the question is further reduced to one where the dimension d=O(logk)d=O(\log k).

Next, in Lemma 2, we show that if the number of samples is chosen appropriately, then all the k0k_{0} centers are with high probability contained in a union of balls BiB_{i} of radius 2d2\sqrt{d} centered around the data points. This allows us to focus our attention to nn balls of radius 2d2\sqrt{d}.

The main idea is that in low dimensions, it is possible to efficiently implement a deconvolution on a mixture of identical spherical Gaussians having standard deviation σ=1\sigma=1, and recover a good approximation to a mixture of gaussians with the same centers and weights, but in which each component now has standard deviation Δ¯/d<cΔ/d,\bar{{\Delta}}/\sqrt{d}<c{{\Delta}}/\sqrt{d}, where Δ{{\Delta}} is the minimum separation between two centers. Once this density is available within small LL_{\infty} error, the local maxima can be approximately obtained using a robust randomized zeroth order concave optimization method developed in [3] started from all elements of a sufficiently rich d\ell_{\infty}^{d} net on iBi\bigcup_{i}B_{i} (which has polynomial size by Lemma 5), and the resulting centers are good approximations (i.e.  within cd52cd^{-\frac{5}{2}} of the true centers in Hausdorff distance) of the true centers with high probability. We then feed these d52d^{-\frac{5}{2}} approximate centers into the iterative procedure developed by Regev and Vijayaraghavan in [13] and by Theorem 4.1 of the full version of that paper, a seed of this quality, suffices to produce in poly(k,d,δ1)\mathrm{poly}(k,d,{\mathbf{\delta}}^{-1}) time, a set of centers that are within δ{\mathbf{\delta}} of the true centers in Hausdorff distance.

The deconvolution is achieved by convolving the empirical measure μe\mu_{e} with the Fourier transform of a certain carefully chosen ζ^\hat{\zeta}. The function ζ^\hat{\zeta} is upto scalar multiplication, the reciprocal of a Gaussian with standard deviation 1Δ¯2\sqrt{1-\bar{{{\Delta}}}^{2}} restricted to a ball of radius (logk+d)Δ¯1(\sqrt{\log k}+\sqrt{d})\bar{{\Delta}}^{-1}. It follows from Lemma 3, that the effect of this truncation on the deconvolution process can be controlled. The pointwise evaluation of the convolution is done using the Monte Carlo method. The truncation plays an important role, because without it, the reciprocal of a Gaussian would not be in LpL_{p} for any p[1,]p\in[1,\infty], leading to difficulties in the Fourier analysis.

3 Preprocessing and Reduction

Suppose we are given independent, identically distributed samples x1,,xnx_{1},\dots,x_{n} from a mixture μ\mu of no more than kk of dd-dimensional spherical gaussian distributions μi\mu_{i} with variance 11, such that the minimum 2\ell_{2} distance between two distinct centers yly_{l} and yjy_{j} is greater than dΔ\sqrt{d}{{\Delta}} for some Δ2c0{{\Delta}}\geq 2c_{0}, where c0c_{0} is certain universal constant in (0,1)(0,1). Suppose that μ:=l=1k0wlμl\mu:=\sum_{l=1}^{k_{0}}w_{l}\mu_{l}, where μl\mu_{l} are dd-dimensional Gaussians with center at yly_{l}, and k0kk_{0}\leq k, and (w1,,wk0)(w_{1},\cdots,w_{k_{0}}) is a known probability vector such that wmin:=minl[k0]wlw_{min}:=\min_{l\in[k_{0}]}w_{l} satisfies wminckw_{min}\geq\frac{c}{k}.

Notation 1.

For l(1/10)[0,)l\in(1/10)\mathbb{Z}\cap[0,\infty), we denote by Cl10C_{l}\geq 10 and cl1/10c_{l}\leq 1/10, positive universal constants such that Clcl=1C_{l}c_{l}=1, and ClC_{l} depends on the values of constants in {Cj|j(1/10)[0,l)}\{C_{j}|j\in(1/10)\mathbb{Z}\cap[0,l)\}. If ClC_{l} is undefined for some index l(1/10)[0,)l\in(1/10)\mathbb{Z}\cap[0,\infty), we set Cl:=1C_{l}:=1.

Let Y:={y1,,yk0}Y:=\{y_{1},\dots,y_{k_{0}}\} be the set of centers of the component gaussians in the mixture μ\mu:

μ(x):=(2π)d2l=1k0wlexp(xyl22)\mu(x):=(2\pi)^{-\frac{d}{2}}\sum_{l=1}^{k_{0}}w_{l}\exp\begin{pmatrix}-\frac{||x-y_{l}||^{2}}{2}\end{pmatrix}

Let us fix 1η:=9101-\eta:=\frac{9}{10}, to be the success probability we will require. This can be made 1η01-\eta_{0} that is arbitrarily close to 11 by the following simple clustering technique in the metric space associated with Hausdorff distance, applied to the outputs of 100(logη01)100(\log\eta_{0}^{-1}) runs of the algorithm.

3.0.1 Algorithm BoostBoost

  1. 1.

    Find the median of all the number of centers output by the different runs of the algorithm, and set that to be k0k_{0}.

  2. 2.

    Pick a set of centers YY (that is the output of one of the runs) which has the property that |Y|=k0|Y|=k_{0} and at least half of the runs output a set of centers that is within a hausdorff distance of less than ΔdkC\frac{{{\Delta}}\sqrt{d}}{k^{C}} to YY. It is readily seen that provided each run succeeds with probability 1η1-\eta, this clustering technique succeeds in producing an exceptable set of centers with probability at least 1η01-\eta_{0}. \square

Let {(x1,y(x1)),,(xn,y(xn))}\{(x_{1},y(x_{1})),\dots,(x_{n},y(x_{n}))\} be a set of nn independent identically distributed random samples from μ\mu generated by first sampling the mixture component y(xl)y(x_{l}) with probability wlw_{l} and then picking xlx_{l} from the corresponding gaussian. With probability 11, all the xlx_{l} are distinct, and this is part of the hypothesis in the lemma below.

Lemma 1.

Let 𝒢\mathcal{G} be a graph whose vertex set is X={x1,,xn}X=\{x_{1},\dots,x_{n}\}, in which the vertices xlx_{l} and xjx_{j} are connected by an edge if the l2l_{2} distance between xlx_{l} and xjx_{j} is less than 23dln(C1dn)2\sqrt{3d\ln(C_{1}dn)}. Decompose 𝒢\mathcal{G} into the connected components 𝒢1,,𝒢r\mathcal{G}_{1},\dots,\mathcal{G}_{r} of 𝒢\mathcal{G}. Then, the probability that there exist ljl\neq j and x𝒢lx\in\mathcal{G}_{l}, x𝒢jx^{\prime}\in\mathcal{G}_{j} such that y(x)=y(x)y(x)=y(x^{\prime}) is less than η/100\eta/100.

Proof.

By bounds on the tail of a χ2\chi^{2} random variable with parameter dd (see equation (4.3) of [9]), the probability that xly(xl)3dln(C1dn)||x_{l}-y(x_{l})||\geq\sqrt{3d\ln(C_{1}dn)} can be bounded from above as follows.

[xy(x)3dln(C1dn)](3eln(C1dn))d2(C1dn)3d2\displaystyle\mathbb{P}\left[||x-y(x)||\geq\sqrt{3d\ln(C_{1}dn)}\right]\leq\begin{pmatrix}3e\ln(C_{1}dn)\end{pmatrix}^{\frac{d}{2}}(C_{1}dn)^{-\frac{3d}{2}} (1)

Union bound yields

[xlsuch thatxly(xl)3dln(C1dn)]\displaystyle~{}\mathbb{P}\left[\exists~{}x_{l}~{}\mbox{such that}~{}||x_{l}-y(x_{l})||\geq\sqrt{3d\ln(C_{1}dn)}\right]
\displaystyle\leq n(3eln(C1dn))d2(C1dn)3d2\displaystyle~{}n\begin{pmatrix}3e\ln(C_{1}dn)\end{pmatrix}^{\frac{d}{2}}(C_{1}dn)^{-\frac{3d}{2}}
\displaystyle\leq η100\displaystyle~{}\frac{\eta}{100}

By Principal Component Analysis (PCA) it is possible to find a linear subspace SkS_{k} of dimension kk, such that all the yly_{l} are within dΔkC\frac{\sqrt{d}{{\Delta}}}{k^{C}} of SkS_{k} in l2l_{2} with probability at least 1η1001-\frac{\eta}{100} using poly(d,k)poly(d,k) samples and computational time (see Appendix C of [13]). PCA has been used previously in this context (see [14]). We then take the push forward of μ\mu via an orthoprojection of d\mathbb{R}^{d} on to SkS_{k}, and work with that instead. This allows us to reduce the dimension dd to kk (if dd started out being greater than kk), while retaining the same Δ{{\Delta}} to within a multiplicative factor of 22.

Remark 1.

In what follows, we suppose that all the centers are contained within an origin centric 2\ell_{2} ball of radius R:=kdln(C1dn)R:=k\sqrt{d\ln(C_{1}dn)}. This can be ensured by Lemma 1.

3.0.2 Algorithm LearnMixtureLearnMixture

  1. 1.

    Let e1,,ede_{1},\dots,e_{d} be an orthonormal set of vectors, sampled uniformly at random from the haar measure.

  2. 2.

    Define d¯=min(d,O(logk))\bar{d}=\min(d,O(\log k)) dimensional subspace Ad¯A_{\bar{d}} to be the span of e1,,ed¯e_{1},\dots,e_{\bar{d}}. By the Johnson-Lindenstrauss Lemma (see theorem 5.3.1 of [15]), with probability greater than 1η1001-\frac{\eta}{100}, the distance between the projections of any two centers is at least (Δ2)d¯\left(\frac{\Delta}{2}\right)\sqrt{\bar{d}}.

  3. 3.

    Now, use the low dimensional gaussian learning primitive Algorithm FindSpikesFindSpikes from Subsection 4.1 on {Πd¯xj}\{\Pi_{\bar{d}}x_{j}\} that solves the d¯+O(1)\bar{d}+O(1) dimensional problem with high probability, if the distance between any two centers is at least (Δ2)d¯.\left(\frac{\Delta}{2}\right)\sqrt{\bar{d}}. If this fails to produce such a set of centers, go back to 1, else we recover the projections on to Ad¯A_{\bar{d}} of the centers y1(d¯),,yk0(d¯),y_{1}^{(\bar{d})},\dots,y_{k_{0}}^{(\bar{d})}, to within a hausdorff distance of δd\frac{{\mathbf{\delta}}}{\sqrt{d}}.

  4. 4.

    For any fixed ld¯+1l\geq\bar{d}+1, let AlA_{l} denote the span of e1,,ed¯e_{1},\dots,e_{\bar{d}} augmented with the vector ele_{l}. Suppose that we have succeeded in identifying the projections of the centers on to AlA_{l} for d¯+1ld\bar{d}+1\leq l\leq d to within ΔkC3{{\Delta}}{k}^{-C_{3}} in 2\ell_{2} distance with high probability. Together with the knowledge of c1(d¯),,ck0(d¯),c_{1}^{(\bar{d})},\dots,c_{k_{0}}^{(\bar{d})}, this allows us to patch up these projections and give us the centers y1,,yky_{1},\dots,y_{k} to within a Hausdorff distance of δ{{\mathbf{\delta}}} with high probability.

By the above discussion, it is clear that it suffices to consider the case when

dC1.5(logk).d\leq C_{1.5}\left(\log k\right).

4 The case of dC1.5(logk)d\leq C_{1.5}\left(\log k\right)

Lemma 2.

The following statement holds with probability at least 1η1001-\frac{\eta}{100}: if

n1+kclog300kηlog300klog300kηη,n\geq 1+\frac{k}{c}\log\left\lceil\frac{300k}{\eta}\right\rceil\left\lceil\log\left\lceil\frac{300k\log\left\lceil\frac{300k}{\eta}\right\rceil}{\eta}\right\rceil\right\rceil,

and x1,,xnx_{1},\cdots,x_{n} are independent random μ\mu-samples, then

{y1,,yk}l[n]B2(xl,2d).\{y_{1},\dots,y_{k}\}\subseteq\bigcup_{l\in[n]}B_{2}(x_{l},2\sqrt{d}).
Proof.

Recall that wminck1w_{min}\geq ck^{-1}. Suppose that m>km>k random independent μ\mu-samples, denoted as x1,,xmx_{1},\cdots,x_{m}, have been picked up. Let C:=C(k)C:=C(k) be a positive integer valued function. For any l[k0]l\in[k_{0}], the probability – that x1,,xmx_{1},\cdots,x_{m} contain no μl\mu_{l}-sample – is at most (1ck1)memck1\left(1-ck^{-1}\right)^{m}\leq e^{-mck^{-1}}. Thus, the probability – that x1,,xmx_{1},\cdots,x_{m} contain no sample from at least one Gaussian component in μ\mu – is at most elnkmck1e^{\ln k-mck^{-1}}. We ensure this probability is at most η300C\frac{\eta}{300C} by having

mkclog(300Ckη)m\geq\frac{k}{c}\left\lceil\log\left(\frac{300Ck}{\eta}\right)\right\rceil

It follows that, if at least

n0:=Ckclog(300Ckη)n_{0}:=\frac{Ck}{c}\left\lceil\log\left(\frac{300Ck}{\eta}\right)\right\rceil (2)

random independent μ\mu-samples were picked up, then with probability at least 1η3001-\frac{\eta}{300} all the components were needed to be sampled at least CC times.

Let EE denote the event that n0n_{0} random independent μ\mu-samples contain at least CC points from each Gaussian component. For l[k0]l\in[k_{0}], let AlA_{l} be the event that none of the n0n_{0} random samples satisfy xjyl<2d||x_{j}-y_{l}||<2\sqrt{d}. Applying Gaussian isoperimetric inequality (equation (4.3) of [9]) and union bound, we obtain

[l[k0]Al|E]\displaystyle\mathbb{P}\left[\bigcup_{l\in[k_{0}]}A_{l}~{}\middle|~{}E\right] \displaystyle\leq 2logkCd\displaystyle 2^{\log k-Cd}

Thus, letting

C\displaystyle C \displaystyle\geq log(300kη)\displaystyle\log\left(\frac{300k}{\eta}\right) (3)

it follows that

[l[k]Alc]\displaystyle\mathbb{P}\left[\bigcap_{l\in[k]}A_{l}^{c}\right] \displaystyle\geq [l[k0]Alc|E][E]\displaystyle\mathbb{P}\left[\bigcap_{l\in[k_{0}]}A_{l}^{c}~{}\middle|~{}E\right]\mathbb{P}\left[E\right]
\displaystyle\geq (1η300)2\displaystyle\left(1-\frac{\eta}{300}\right)^{2}

provided

nkclog300kηlog300klog300kηηn\geq\frac{k}{c}\log\left\lceil\frac{300k}{\eta}\right\rceil\left\lceil\log\left\lceil\frac{300k\log\left\lceil\frac{300k}{\eta}\right\rceil}{\eta}\right\rceil\right\rceil

Let \mathcal{F} denote the fourier transform. For any function fL2(d)f\in L^{2}(\mathbb{R}^{d}), we have f^=(f)\hat{f}=\mathcal{F}(f), where

f^(ξ)=(12π)ddf(x)eıξx𝑑x.\displaystyle\hat{f}(\xi)=\left(\frac{1}{\sqrt{2\pi}}\right)^{d}\int_{\mathbb{R}^{d}}f(x)e^{-\imath\xi\cdot x}dx. (4)

By the fourier inversion formula,

f(x)=(12π)ddf^(ξ)eıξx𝑑ξ,\displaystyle f(x)=\left(\frac{1}{\sqrt{2\pi}}\right)^{d}\int_{\mathbb{R}^{d}}\hat{f}(\xi)e^{\imath\xi\cdot x}d\xi, (5)

where these are not necessarily lebesgue integrals, but need to be interpreted as improper integrals. Thus,

f^(ξ)=(12π)dlimR|x|Rf(x)eıξx𝑑x,\displaystyle\hat{f}(\xi)=\left(\frac{1}{\sqrt{2\pi}}\right)^{d}\lim\limits_{R\rightarrow\infty}\int_{|x|\leq R}f(x)e^{-\imath\xi\cdot x}dx, (6)

and

f(x)=(12π)dlimR|x|Rf^(ξ)eıξx𝑑ξ,\displaystyle f(x)=\left(\frac{1}{\sqrt{2\pi}}\right)^{d}\lim\limits_{R\rightarrow\infty}\int_{|x|\leq R}\hat{f}(\xi)e^{\imath\xi\cdot x}d\xi, (7)

where the limit is taken in the L2L^{2} sense.

Let

γ(z):=(12π)dez22\gamma(z):=\left(\frac{1}{\sqrt{2\pi}}\right)^{d}e^{-\frac{||z||^{2}}{2}}

denote the standard gaussian in dd dimensions. Then, γ^=(γ)=γ\hat{\gamma}=\mathcal{F}(\gamma)=\gamma. Let Δ¯\bar{{\Delta}} equal Δ/C3.2{{\Delta}}/C_{3.2}. Let

s^(w):=γ(wΔ¯)𝕀((C3.5lnk+d)B2(0,Δ¯1)),\hat{s}(w):=\gamma\left({w\bar{{\Delta}}}\right)\mathbb{I}\left(\left(\sqrt{{C_{3.5}\ln k}}+\sqrt{d}\right)B_{2}(0,\bar{{\Delta}}^{-1})\right),

where B2(0,Δ¯1)B_{2}(0,\bar{{\Delta}}^{-1}) is the euclidean ball of radius Δ¯1\bar{{\Delta}}^{-1} and 𝕀(S)\mathbb{I}(S) the indicator function of SS, and let s=1(s^)s=\mathcal{F}^{-1}(\hat{s}). Let γσ(w):=σdγ(σ1w)\gamma_{\sigma}(w):=\sigma^{-d}\gamma(\sigma^{-1}w) denote the spherical gaussian whose one dimensional marginals have variance σ2\sigma^{2}.

Lemma 3.

For all zdz\in\mathbb{R}^{d}, we have

|s(z)γΔ¯(z)|(2πΔ¯2)d2kC3.52.|s(z)-\gamma_{\bar{{\Delta}}}(z)|\leq(2\pi{\bar{\Delta}}^{2})^{-\frac{d}{2}}k^{-\frac{C_{3.5}}{2}}.
Proof.

Let dκ^2d\hat{\kappa}^{2} be a χ2\chi^{2} random variable with dd degrees of freedom, i. e. , the sum of squares of dd independent standard gaussians.

By equation (4.3) of [9], we know that

[κ^C3.5lnkd+1]\displaystyle\mathbb{P}\left[\hat{\kappa}\geq\sqrt{\frac{C_{3.5}\ln k}{d}}+1\right] =\displaystyle= [dκ^2dC3.5lnk+2C3.5dlnk]\displaystyle\mathbb{P}\left[d\hat{\kappa}^{2}-d\geq C_{3.5}\ln k+2\sqrt{C_{3.5}d\ln k}\right] (8)
\displaystyle\leq [dκ^2dC3.5lnk+2C3.5dlnk]\displaystyle\mathbb{P}\left[d\hat{\kappa}^{2}-d\geq C_{3.5}\ln k+\sqrt{2C_{3.5}d\ln k}\right]
\displaystyle\leq exp(C3.5lnk2)\displaystyle\exp\left(-\frac{C_{3.5}\ln k}{2}\right)
=\displaystyle= kC3.52\displaystyle k^{-\frac{C_{3.5}}{2}}

For wsupp(s^)w\in\mbox{supp}(\hat{s}), one has s^(w)=Δ¯dγ(1/Δ¯)\hat{s}(w)={\bar{{\Delta}}}^{-d}\gamma_{(1/{\bar{{\Delta}}})}. Therefore,

s^Δ¯dγ(1/Δ¯)L1\displaystyle\|\hat{s}-{\bar{{\Delta}}}^{-d}\gamma_{(1/{\bar{{\Delta}}})}\|_{L^{1}} =\displaystyle= Δ¯wC3.5lnk+dγ(Δ¯w)𝑑w\displaystyle\int_{||{\bar{{\Delta}}}w||\geq\sqrt{C_{3.5}\ln k}+\sqrt{d}}\gamma({\bar{{\Delta}}}w)~{}dw
=\displaystyle= Δ¯d[κ^C3.5lnkd+1]\displaystyle{\bar{{\Delta}}}^{-d}\mathbb{P}\left[\hat{\kappa}\geq\sqrt{\frac{C_{3.5}\ln k}{d}}+1\right]
\displaystyle\leq Δ¯dkC3.52\displaystyle{\bar{{\Delta}}}^{-d}k^{-\frac{C_{3.5}}{2}}

The lemma follows from fact that 1\mathcal{F}^{-1} is a bounded operator with operator norm (2π)d2\left(2\pi\right)^{-\frac{d}{2}} from L1L^{1} to LL^{\infty}. ∎

Let μe\mu_{e} denote the uniform probability measure on {x1,,xn}\{x_{1},\dots,x_{n}\}. Let \star denote convolution in d\mathbb{R}^{d}. Note that the Fourier convolution identity is

(2π)d2(fg)=f^g^.(2\pi)^{-\frac{d}{2}}\mathcal{F}(f\star g)=\hat{f}\hat{g}.

Let ζ^:=γ^1s^\hat{\zeta}:=\hat{\gamma}^{-1}\cdot\hat{s}, and ζ=1(ζ^)\zeta=\mathcal{F}^{-1}(\hat{\zeta}). We will recover the centers and weights of the gaussians from ζμe=(2π)d21(ζ^μ^e)\zeta\star\mu_{e}=(2\pi)^{\frac{d}{2}}\mathcal{F}^{-1}(\hat{\zeta}\cdot\hat{\mu}_{e}). The heuristics are as follows.

Let ν\nu denote the unique probability measure satisfying γν=μ\gamma\star\nu=\mu. Thus

ν=j=1k0wjδyj,\nu=\sum_{j=1}^{k_{0}}w_{j}{\mathbf{\delta}}_{y_{j}},

where δyj{\mathbf{\delta}}_{y_{j}} is a dirac delta supported on yjy_{j}. From this it follows, roughly speaking, that inside supp(s^)\mbox{supp}(\hat{s}), we get

ν^(w)\displaystyle{\hat{\nu}}(w) =\displaystyle= (2π)dγ^(w)1𝔼Xμ[eiXw]\displaystyle(2\pi)^{-d}{\hat{\gamma}}(w)^{-1}\mathbb{E}_{X\approx\mu}\left[e^{-iX\cdot w}\right]
\displaystyle\approx (2π)d2γ^(w)1μ^e(w)\displaystyle(2\pi)^{-\frac{d}{2}}{\hat{\gamma}}(w)^{-1}\hat{\mu}_{e}(w)
s^(w)ν^(w)\displaystyle\Rightarrow\hskip 28.45274pt\hat{s}(w){\hat{\nu}}(w) \displaystyle\approx (2π)d2ζ^(w)μ^e(w)\displaystyle(2\pi)^{-\frac{d}{2}}\hat{\zeta}(w)\hat{\mu}_{e}(w)

pointwise, and this should (roughly) yield

sν\displaystyle s\ast\nu \displaystyle\approx (2π)d2ζμe\displaystyle(2\pi)^{-\frac{d}{2}}\zeta\ast\mu_{e}

On the other hand, notice that Lemma  3 shows that sνγΔ¯νs\star\nu\approx\gamma_{\bar{{\Delta}}}\star\nu, and because the spikes of ν\nu are approximately (up to scaling) the spikes of γΔ¯ν\gamma_{\bar{\Delta}}\ast\nu, we restrict to learning the spikes of γΔ¯ν\gamma_{\bar{\Delta}}\ast\nu by accessing an approximation via (2π)d2ζμe(2\pi)^{-\frac{d}{2}}\zeta\ast\mu_{e}. For notational convenience, we write ξe:=(2π)d2ζμe\xi_{e}:=(2\pi)^{-\frac{d}{2}}\zeta\ast\mu_{e}.

In order to formalize this, we proceed as follows. We will employ Hoeffding’s inequality for \mathbb{C}-valued random variables:

Lemma 4 (Hoeffding).

Let b>0b>0. Let Y1,,YrY_{1},\cdots,Y_{r} be independent identically distributed \mathbb{C}-valued random variables such that |Yj|b1|Y_{j}|\leq b^{-1} for all j[r]j\in[r]. Then

[|1rj=1r(𝔼[Yj]Yj)|t]ec0.1rt2b2\displaystyle\mathbb{P}\left[\left|\frac{1}{r}\sum_{j=1}^{r}(\mathbb{E}[Y_{j}]-Y_{j})\right|\geq t\right]\leq e^{-c_{0.1}rt^{2}b^{2}} (9)

where c0.1c_{0.1} is a universal constant. \square

We write :=supp(s^){\mathcal{B}}:=\mbox{supp}(\hat{s}), so that zz\in{\mathcal{B}} if and only if zdz\in\mathbb{R}^{d} satisfies Δ¯zC3.5lnk+d||{\bar{\Delta}}z||\leq\sqrt{C_{3.5}\ln k}+\sqrt{d}.

The following proposition allows us to construct a (random) black box oracle that outputs a good additive approximation of νγΔ¯\nu\star\gamma_{\bar{{\Delta}}} at any given point xx.

Proposition 1.

Let z1,,zmz_{1},\cdots,z_{m} be independent, random variables drawn from the uniform (normalized Lebesgue) probability measure on {\mathcal{B}}. Let x1,,xnx_{1},\cdots,x_{n} be independent μ\mu-distributed random points. If m=C3.6kC3.6lnkm=C_{3.6}k^{C_{3.6}}\ln k, and

nC3.6kC3.6+C3.5ln(C3.6kC3.6+C3.5lnk),\displaystyle n\geq C_{3.6}k^{C_{3.6}+C_{3.5}}\ln\left(C_{3.6}k^{C_{3.6}+C_{3.5}}\ln k\right), (10)

where

C3.6C0.1C3.5+C1.5(ln(2π)+2ln(2C3.2c))+C3.5(8+2C3.22c2),C_{3.6}\geq C_{0.1}C_{3.5}+C_{1.5}\left(\ln(2\pi)+2\ln\left(\frac{2C_{3.2}}{c}\right)\right)+C_{3.5}\left(8+\frac{2C_{3.2}^{2}}{c^{2}}\right),

then, for any xdx\in\mathbb{R}^{d}, the random variable

fx:=vol()ml[m]eixzlγ(zlΔ¯)ezl22μ^e(zl)\displaystyle f_{x}:=\frac{\mbox{vol}({\mathcal{B}})}{m}\sum_{\begin{subarray}{c}l\in[m]\end{subarray}}e^{ix\cdot z_{l}}\gamma(z_{l}{\bar{{\Delta}}})e^{\frac{||z_{l}||^{2}}{2}}{\hat{\mu}}_{e}(z_{l}) (11)

satisfies the following inequality with probability at least 18kC3.51-8k^{-C_{3.5}}:

|(γΔ¯ν)(x)fx|3kC3.5\displaystyle\left|(\gamma_{\bar{\Delta}}\star\nu)(x)-\Re f_{x}\right|\leq 3k^{-C_{3.5}}

where fx\Re f_{x} denotes the real part of fx.f_{x}.

Proof.

For any xdx\in\mathbb{R}^{d}, one has

|(γΔ¯ν)(x)(2π)dj=1k0wjeΔ¯2z22ei(xyj)z𝑑z|\displaystyle\left|(\gamma_{\bar{\Delta}}\star\nu)(x)-(2\pi)^{-d}\sum_{j=1}^{k_{0}}w_{j}\int_{\mathcal{B}}e^{-\frac{\bar{\Delta}^{2}||z||^{2}}{2}}e^{i(x-y_{j})\cdot z}~{}dz\right| (12)
\displaystyle\leq (2π)dd¯eΔ¯2z22𝑑z\displaystyle(2\pi)^{-d}\int_{\mathbb{R}^{\bar{d}}\setminus{\mathcal{B}}}e^{-\frac{\bar{\Delta}^{2}||z||^{2}}{2}}~{}dz
\displaystyle\leq kC3.54\displaystyle k^{-\frac{C_{3.5}}{4}}

Hence, it suffices to estimate

Ix:=(2π)dj=1k0wjeΔ¯2z22ei(xyj)z𝑑zI_{x}:=(2\pi)^{-d}\sum_{j=1}^{k_{0}}w_{j}\int_{\mathcal{B}}e^{-\frac{\bar{\Delta}^{2}||z||^{2}}{2}}e^{i(x-y_{j})\cdot z}~{}dz

Next, one has

Ix\displaystyle I_{x} =\displaystyle= (2π)d2eixz(s^(z)j=1k0wjeiyjz)𝑑z\displaystyle(2\pi)^{-\frac{d}{2}}\int_{\mathcal{B}}e^{ix\cdot z}\left({\hat{s}}(z)\sum_{j=1}^{k_{0}}w_{j}e^{-iy_{j}\cdot z}\right)~{}dz
=\displaystyle= (2π)d2vol(B)𝔼z[eixz(s^(z)j=1k0wjeiyjz)]\displaystyle(2\pi)^{-\frac{d}{2}}\mbox{vol}(B)~{}\mathbb{E}_{z}\left[e^{ix\cdot z}\left({\hat{s}}(z)\sum_{j=1}^{k_{0}}w_{j}e^{-iy_{j}\cdot z}\right)\right]

where zz is a sample from the uniform probability distribution on {\mathcal{B}}. For brevity of notation, write

ϕ(z)\displaystyle\phi(z) =\displaystyle= (2π)d2vol()(s^(z)j=1k0wjeiyjz)\displaystyle(2\pi)^{-\frac{d}{2}}\mbox{vol}({\mathcal{B}})\left({\hat{s}}(z)\sum_{j=1}^{k_{0}}w_{j}e^{-iy_{j}\cdot z}\right)
=\displaystyle= vol()s^(z)ν^(z)\displaystyle\mbox{vol}({\mathcal{B}}){\hat{s}}(z){\hat{\nu}}(z)

so that Ix=𝔼z[eixzϕ(z)]I_{x}=\mathbb{E}_{z}[e^{ix\cdot z}\phi(z)]. By Ramanujan’s approximation of Γ\Gamma (see theorem 1 of [11]) we get

vol()\displaystyle\mbox{vol}({\mathcal{B}}) =\displaystyle= Δ¯d(1+C3.5lnkd)d(dπ)d2Γ(d2+1)\displaystyle{\bar{{\Delta}}}^{-d}\left(1+\sqrt{\frac{C_{3.5}\ln k}{d}}\right)^{d}\frac{(d\pi)^{\frac{d}{2}}}{\Gamma\left(\frac{d}{2}+1\right)} (13)
\displaystyle\leq (2π)d2kC1.5ln(2C3.2c)+C3.5\displaystyle(2\pi)^{\frac{d}{2}}k^{C_{1.5}\ln\left(\frac{2C_{3.2}}{c}\right)+C_{3.5}}

so that

|ϕ(z)|\displaystyle|\phi(z)| \displaystyle\leq (2π)d2vol()|s^(z)|\displaystyle(2\pi)^{-\frac{d}{2}}\mbox{vol}({\mathcal{B}})~{}|{\hat{s}}(z)|
<\displaystyle< kC1.5ln(2C3.2c)+C3.5\displaystyle k^{C_{1.5}\ln\left(\frac{2C_{3.2}}{c}\right)+C_{3.5}}
\displaystyle\leq k2C3.5\displaystyle k^{2C_{3.5}}

If z1,,zmz_{1},\cdots,z_{m} are independent (Lebesgue) uniformly distributed points in {\mathcal{B}}, then by Hoeffding’s inequality (9), one has

[|Ix1ml=1meixzlϕ(zl)|k2C3.5]k2C3.5\displaystyle\mathbb{P}\left[\left|I_{x}-\frac{1}{m}\sum_{l=1}^{m}e^{ix\cdot z_{l}}\phi(z_{l})\right|\geq k^{-2C_{3.5}}\right]\leq k^{-2C_{3.5}} (14)

if we set

m=C0.1C3.5k8C3.5+2C3.5C3.22c2+2C1.5ln(2C3.2c)+C1.5ln(2π)lnk.\displaystyle m=C_{0.1}C_{3.5}k^{8C_{3.5}+\frac{2C_{3.5}C_{3.2}^{2}}{c^{2}}+2C_{1.5}\ln\left(\frac{2C_{3.2}}{c}\right)+C_{1.5}\ln(2\pi)}\ln k. (15)

Recalling that ν^L,μ^eL(2π)d2||\hat{\nu}||_{L^{\infty}},||\hat{\mu}_{e}||_{L^{\infty}}\leq(2\pi)^{-\frac{d}{2}}, and

ν^(z)=𝔼Xμ[(2π)d2eiXz+z22]=𝔼[ez22μ^e(z)],\hat{\nu}(z)=\mathbb{E}_{X\approx\mu}\left[(2\pi)^{-\frac{d}{2}}e^{-iX\cdot z+\frac{||z||^{2}}{2}}\right]=\mathbb{E}\left[e^{\frac{||z||^{2}}{2}}\hat{\mu}_{e}(z)\right],

for any fixed zz\in{\mathcal{B}}, applying Hoeffding’s inequality (9) once again, we obtain

[|ν^(z)ez22μ^e(z)|>kC3.5vol()]<exp(c0.1n(2π)d2k2C3.5+2C3.22C3.5c2vol2())\displaystyle\mathbb{P}\left[\left|\hat{\nu}(z)-e^{\frac{||z||^{2}}{2}}\hat{\mu}_{e}(z)\right|>\frac{k^{-C_{3.5}}}{\mbox{vol}({\mathcal{B}})}\right]<\exp\left(-\frac{c_{0.1}n}{(2\pi)^{-\frac{d}{2}}k^{2C_{3.5}+\frac{2C_{3.2}^{2}C_{3.5}}{c^{2}}}\mbox{vol}^{2}({\mathcal{B}})}\right)

In particular, letting

C3.6C0.1C3.5+C1.5(ln(2π)+2ln(2C3.2c))+C3.5(8+2C3.22c2)C_{3.6}\geq C_{0.1}C_{3.5}+C_{1.5}\left(\ln(2\pi)+2\ln\left(\frac{2C_{3.2}}{c}\right)\right)+C_{3.5}\left(8+\frac{2C_{3.2}^{2}}{c^{2}}\right)

and

n\displaystyle n \displaystyle\geq C3.6kC3.6+C3.5ln(C3.6kC3.6+C3.5lnk)\displaystyle C_{3.6}k^{C_{3.6}+C_{3.5}}\ln\left(C_{3.6}k^{C_{3.6}+C_{3.5}}\ln k\right)
m\displaystyle m =\displaystyle= C3.6kC3.6lnk\displaystyle C_{3.6}k^{C_{3.6}}\ln k (16)

one has

[|ν^(z)ez22μ^e(z)|>kC3.5vol()]<m1kC3.5.\displaystyle\mathbb{P}\left[\left|\hat{\nu}(z)-e^{\frac{||z||^{2}}{2}}\hat{\mu}_{e}(z)\right|>\frac{k^{-C_{3.5}}}{\mbox{vol}({\mathcal{B}})}\right]<m^{-1}k^{-C_{3.5}}. (17)

By an application of the union bound, equations (12), (14), and (17) give us this proposition. ∎

We note that, for any xdx\in\mathbb{R}^{d}, one has

ξe(x)\displaystyle\xi_{e}(x) =\displaystyle= vol()𝔼z[s^(z)ez22μ^e(z)eixz]\displaystyle\mbox{vol}({\mathcal{B}})\mathbb{E}_{z}\left[\hat{s}(z)e^{\frac{||z||^{2}}{2}}\hat{\mu}_{e}(z)e^{ix\cdot z}\right]

Writing ξ(z):=s^(z)ez22μ^e(z)eixz\xi(z):=\hat{s}(z)e^{\frac{||z||^{2}}{2}}\hat{\mu}_{e}(z)e^{ix\cdot z}, one has

[|fxξe(x)|>kC3.5]\displaystyle\mathbb{P}\left[\left|f_{x}-\xi_{e}(x)\right|>k^{-C_{3.5}}\right] =\displaystyle= [vol()m|l[m]ξ(zl)𝔼[ξ(zl)]|>kC3.5]\displaystyle\mathbb{P}\left[\frac{\mbox{vol}({\mathcal{B}})}{m}\left|\displaystyle\sum_{\begin{subarray}{c}l\in[m]\end{subarray}}\xi(z_{l})-\mathbb{E}\left[\xi(z_{l})\right]\right|>k^{-C_{3.5}}\right]

Using (15) and Hoeffding’s inequality (9), we get

[|fxξe(x)|>kC3.5]\displaystyle\mathbb{P}\left[\left|f_{x}-\xi_{e}(x)\right|>k^{-C_{3.5}}\right] \displaystyle\leq kC3.5\displaystyle k^{-C_{3.5}} (18)

In the following algorithm FindSpikesFindSpikes, at any point

ziB2(xi,2d),z\in\bigcup_{i}B_{2}(x_{i},2\sqrt{d}),

we shall therefore have access to the random variable fzf_{z} in \mathbb{C}, such that

[|fz(γΔ¯ν)(z)|<kC4]>1kC4.\mathbb{P}\left[|f_{z}-(\gamma_{\bar{\Delta}}\ast\nu)(z)|<k^{-C_{4}}\right]>1-k^{-C_{4}}.

As established by proposition (1) above, these fzf_{z} can be constructed using polynomially many samples and computational steps, in such a way that for any subset

{z1,,zm}d,\{z_{1},\dots,z_{m}\}\subseteq\mathbb{R}^{d},

{fz1,,fzm}\{f_{z_{1}},\dots,f_{z_{m}}\} is a set of independent random variables.

For the next part we will employ the efficient zeroth order stochastic concave maximization algorithm, devised in Belloni et al ([3]). We denote this algorithm as 𝔄0{\mathfrak{A}}_{0}. In dd-dimensional Euclidean space, the algorithm returns an ϵ\epsilon-approximate maxima of an d1ϵd^{-1}\epsilon-approximate tt-Lipschitz concave function, and the number of function evaluations used depends polynomially on d,ϵd,\epsilon, and logt\log t. The performance guarantee of algorithm 𝔄0{\mathfrak{A}}_{0} is summarized in the following:

Fact 1 (Belloni-Liang-Narayanan-Rakhlin).

Suppose that d¯{\mathcal{B}}\subseteq\mathbb{R}^{\bar{d}} is a convex subset, and χ,ψ:\chi,\psi:{\mathcal{B}}\rightarrow\mathbb{R} are functions satisfying

supz|χ(z)ψ(z)|ϵd\sup_{z\in{\mathcal{B}}}|\chi(z)-\psi(z)|\leq\frac{\epsilon}{d}

Suppose that ψ\psi is concave and tt-Lipschitz; then algorithm 𝔄0{\mathfrak{A}}_{0} returns a point qq\in{\mathcal{B}} satisfying

χ(q)maxzχ(z)ϵ,\chi(q)\geq\max_{z\in{\mathcal{B}}}\chi(z)-\epsilon,

and uses O(d8ϵ2logt)O(d^{8}\epsilon^{-2}\log t) computation steps. \square

In the following, we use d¯\bar{d} instead of dd in order to keep the notation consistent with Algorithm LearnMixtureLearnMixture. Let diam(Q)diam(Q_{\ell}) denote the 2\ell_{2} diameter of QQ_{\ell}.

4.1 Algorithm FindSpikesFindSpikes.

Let countmax=C4.5k.count_{max}=C_{4.5}k.

  1. 1.

    While countcountmaxcount\leq count_{max}, do the following:

    1. (a)

      For each point

      (Δ¯1000d¯C1.5logk)d¯iB2(xi,2d¯),\ell\in\left(\frac{\bar{{\Delta}}}{1000}\sqrt{\frac{\bar{d}}{C_{1.5}\log k}}\right)\mathbb{Z}^{\bar{d}}\bigcap\bigcup_{i}B_{2}(x_{i},2\sqrt{\bar{d}}),

      let QQ_{\ell} be the ball of radius (Δ¯400d¯C1.5logk)\left(\frac{\bar{{\Delta}}}{400}\frac{\bar{d}}{\sqrt{C_{1.5}\log k}}\right), centered at \ell.

    2. (b)

      Use an efficient zeroth order stochastic concave maximization subroutine (see Fact 1) that produces a point qq_{\ell} in QQ_{\ell} at which (2π)d¯2(ζμe)(z)(2\pi)^{-\frac{\bar{d}}{2}}(\zeta\star\mu_{e})(z) is within kC4.6k^{-C_{4.6}} of the maximum of (2π)d¯2(ζμe)(2\pi)^{-\frac{\bar{d}}{2}}(\zeta\star\mu_{e}) restricted to QQ_{\ell}.

    3. (c)

      Create a sequence L=(q1,,qk1)L=(q_{\ell_{1}},\dots,q_{\ell_{k_{1}}}) that consists of all those qq_{\ell}, such that (2π)d¯2(ζμe)(q)>(wmin2)γΔ¯(0)(2\pi)^{-\frac{\bar{d}}{2}}(\zeta\star\mu_{e})(q_{\ell})>(\frac{w_{min}}{2})\gamma_{\bar{{\Delta}}}(0), and

      q2<diam(Q)/4.\|\ell-q_{\ell}\|_{2}<diam(Q_{\ell})/4.

      If k1<1k_{1}<1, return “ERROR”.

    4. (d)

      Form a subsequence M=(m1,,mk2M=(\ell_{m_{1}},\dots,\ell_{m_{k_{2}}}) of (1,,k1)(\ell_{1},\dots,\ell_{k_{1}}) by the following iterative procedure:

      1. i.

        Let m1=1m_{1}=1. Set M:={1},M:=\{\ell_{1}\}, and j=1j=1.

      2. ii.

        While jk1j\leq k_{1}:

        1. A.

          if j+1\ell_{j+1} is not contained in B(j,d¯ΔΔ¯/2)B(\ell_{j^{\prime}},\sqrt{\bar{d}}{\sqrt{{{\Delta}}\bar{{\Delta}}}}/2) for any jjj^{\prime}\leq j, append j+1\ell_{j+1} to MM.

        2. B.

          Increment jj.

    5. (e)

      Pass

      {(qmj)}j[k2]\{(q_{\ell_{m_{j}}})\}_{j\in[k_{2}]}

      to algorithm BoostBoost. Increment count.count.

  2. 2.

    Pass the output of BoostBoost obtained by processing countmaxcount_{max} copies of MM, to the the iterative algorithm of Regev and Vijayaraghavan [13], which will correctly output the centers to the desired accuracy δ/k{\mathbf{\delta}}/k with the required probability 1exp(k/c)1-\exp(-k/c).

4.2 Analysis of FindSpikesFindSpikes

The following lemma shows that the number of cubes in step (a)(a) of FindSpikesFindSpikes that need to be considered is polynomial in kk.

Lemma 5.

For any xd¯x\in\mathbb{R}^{\bar{d}},

|(Δ¯1000d¯C1.5logk)d¯B2(x,2d¯)|kC7.\displaystyle\left|\left(\frac{\bar{{\Delta}}}{1000}\sqrt{\frac{\bar{d}}{C_{1.5}\log k}}\right)\mathbb{Z}^{\bar{d}}\bigcap B_{2}(x,2\sqrt{\bar{d}})\right|\leq k^{C_{7}}. (19)
Proof.

Set rr to d¯\sqrt{{\bar{d}}}. We observe that the number of lattice cubes of side length cr/logkcr/\sqrt{\log k} centered at a point xx belonging to (cr/logk)d¯\left(cr/\sqrt{\log k}\right)\mathbb{Z}^{\bar{d}} that intersect a ball BB of radius rr in d¯\bar{d} dimensions is less or equal to to the number of lattice points inside a concentric ball BB^{\prime} of dimension d¯\bar{d} of radius r+cd¯rlogkr+\frac{c\sqrt{\bar{d}}r}{\sqrt{\log k}}. Every lattice cube of side length cr/logkcr/\sqrt{\log k} centered at a lattice point belonging to (cr/logk)d¯B\left(cr/\sqrt{\log k}\right)\mathbb{Z}^{\bar{d}}\cap B^{\prime} is contained in the ball B′′B^{\prime\prime} with center xx and radius r+2cd¯rlogk.r+\frac{2c\sqrt{\bar{d}}r}{\sqrt{\log k}}. By volumetric considerations, |(cr/logk)d¯B||\left(cr/\sqrt{\log k}\right)\mathbb{Z}^{\bar{d}}\cap B^{\prime}| is therefore bounded above by vol(B′′)(cr/logk)d¯.\frac{vol(B^{\prime\prime})}{\left(cr/\sqrt{\log k}\right)^{\bar{d}}}. We write vd¯:=vol(B2(0,d¯))v_{\bar{d}}:=\mbox{vol}\begin{pmatrix}B_{2}(0,\sqrt{\bar{d}})\end{pmatrix}. By Ramanujan’s approximation of Γ\Gamma (see Theorem 1 of [11]) we get

vd¯1π(2πed¯)d¯2dd¯2.v_{\bar{d}}\leq{\frac{1}{\sqrt{\pi}}}\begin{pmatrix}{\frac{2\pi e}{\bar{d}}}\end{pmatrix}^{\frac{\bar{d}}{2}}d^{\frac{\bar{d}}{2}}.

This tells us that

vol(B′′)(cr/logk)d¯\displaystyle\frac{vol(B^{\prime\prime})}{\left(cr/\sqrt{\log k}\right)^{\bar{d}}} \displaystyle\leq (Clogkd¯)d¯\displaystyle\left(\frac{C\sqrt{\log k}}{\sqrt{\bar{d}}}\right)^{\bar{d}}
\displaystyle\leq (Clogkd¯)(d¯Clogk)(Cd¯logk)\displaystyle\left(\frac{C\sqrt{\log k}}{\sqrt{\bar{d}}}\right)^{\left(\frac{\sqrt{\bar{d}}}{C\sqrt{\log k}}\right)\left(C\sqrt{\bar{d}\log k}\right)}
\displaystyle\leq (e1/e)(Cd¯logk)\displaystyle\left(e^{1/e}\right)^{\left(C\sqrt{\bar{d}\log k}\right)}
\displaystyle\leq kC7.\displaystyle k^{C_{7}}.

We have used the fact that for α>0{\mathbf{\alpha}}>0, αα1{\mathbf{\alpha}}^{{\mathbf{\alpha}}^{-1}} is maximized when α=e{\mathbf{\alpha}}=e, a fact easily verified using calculus.

Next, we will need some results on the structure of γΔ¯ν\gamma_{\bar{{\Delta}}}\star\nu.

Definition 1.

Let d¯\mathcal{B}\subseteq\mathbb{R}^{\bar{d}} be a convex set, and ξ>0\xi>0. A continuous function ϕ:+\phi:{\mathcal{B}}\rightarrow\mathbb{R}_{+} is said to be ξ\xi-approximately log-concave if there exists a continuous function ψ:+\psi:{\mathcal{B}}\rightarrow\mathbb{R}_{+} such that logψ\log\psi is concave, and logϕlogψL()ξ||\log\phi-\log\psi||_{L^{\infty}({\mathcal{B}})}\leq\xi. We say ϕ\phi is approximately log-concave if it is ξ\xi-approximately log-concave for some ξ>0\xi>0.

By Theorem 4.1 of the full version of [13], it suffices to approximate the centers of the Gaussians to within cd52,cd^{-\frac{5}{2}}, and then pass on these approximate centers to an iterative algorithm developed in that paper. To this end, if νj:=wjδyj\nu_{j}:=w_{j}\delta_{y_{j}}, it suffices to have, in the vicinity of yjy_{j}, access to an LL_{\infty} approximation of log(γΔ¯νj)\log(\gamma_{\bar{{\Delta}}}\star\nu_{j}) to within an additive cd¯5.c{\bar{d}}^{-5}. This is achieved by Lemma 6.

Lemma 6.

If qB(yj,diam(Q))q_{\ell}\in B\left(y_{j},diam(Q_{\ell})\right) for some j[k0]j\in[k_{0}], then the restriction of γΔ¯ν\gamma_{\bar{{\Delta}}}\star\nu to QQ_{\ell} is approximately log-concave. Specifically, for νj:=wjδyj\nu_{j}:=w_{j}\delta_{y_{j}}, one has

0log(γΔ¯ν)(x)log(γΔ¯νj)(x)(10)31e1.2120000C3.230c2d¯150\leq\log(\gamma_{\bar{{\Delta}}}\star\nu)(x)-\log(\gamma_{\bar{{\Delta}}}\star\nu_{j})(x)\leq~{}{\frac{(10)^{31}e^{\frac{1.21}{20000}}}{C_{3.2}^{30}c^{2}{\bar{d}}^{15}}}
Proof.

Fix xQx\in Q_{\ell}, and write ar:=xyra_{r}:=x-y_{r} for r[k]r\in[k]. One has

aj1.1Δd¯100C3.2d¯C1.5logk+Δd¯200C3.2d¯C1.5logk3.2Δd¯200C3.2||a_{j}||\leq{\frac{1.1\Delta\sqrt{\bar{d}}}{100C_{3.2}}}\sqrt{\frac{\bar{d}}{C_{1.5}\log k}}+{\frac{\Delta\sqrt{\bar{d}}}{200C_{3.2}}}\sqrt{\frac{\bar{d}}{C_{1.5}\log k}}\leq{\frac{3.2\Delta\sqrt{\bar{d}}}{200C_{3.2}}}

and

arasΔd¯||a_{r}-a_{s}||\geq\Delta\sqrt{\bar{d}}

if rsr\neq s. Rewriting Δ=C3.2Δ¯\Delta=C_{3.2}\bar{\Delta}, one has

(2πΔ¯2)d¯2|(γΔ¯ν)(x)(γΔ¯νj)(x)|\displaystyle(2\pi{\bar{\Delta}}^{2})^{\frac{\bar{d}}{2}}\left|(\gamma_{\bar{{\Delta}}}\star\nu)(x)-(\gamma_{\bar{{\Delta}}}\star\nu_{j})(x)\right| =\displaystyle= r[k0]{j}wrear22Δ¯2\displaystyle\sum_{r\in[k_{0}]\setminus\{j\}}w_{r}e^{-\frac{||a_{r}||^{2}}{2{\bar{\Delta}}^{2}}}
=\displaystyle= m=1r[k0]{j}m2arΔd¯<m+12wreC3.22ar22Δ2\displaystyle\sum_{m=1}^{\infty}\displaystyle\sum_{\begin{subarray}{c}r\in[k_{0}]\setminus\{j\}\\ \frac{m}{2}\leq||\frac{a_{r}}{\Delta\sqrt{\bar{d}}}||<\frac{m+1}{2}\end{subarray}}w_{r}e^{-\frac{C_{3.2}^{2}||a_{r}||^{2}}{2\Delta^{2}}}

We write pm:=|Sm|p_{m}:=|S_{m}| where

Sm:={r[k0]:m2Δd¯ar<m+12Δd¯}S_{m}:=\begin{Bmatrix}r\in[k_{0}]:~{}\frac{m}{2}\Delta\sqrt{\bar{d}}\leq||a_{r}||<\frac{m+1}{2}\Delta\sqrt{\bar{d}}\end{Bmatrix}

Since arasΔd¯||a_{r}-a_{s}||\geq\Delta{\bar{d}}, we can put disjoint balls of radius 0.5Δd¯0.5\Delta\sqrt{\bar{d}} around each center. Thus, letting ωd¯\omega_{\bar{d}} be the volume of unit ball in d¯\mathbb{R}^{\bar{d}}, one has

ωd¯(Δd¯2)d¯pm\displaystyle\omega_{\bar{d}}\left(\frac{\Delta\sqrt{\bar{d}}}{2}\right)^{\bar{d}}p_{m} \displaystyle\leq ωd¯(Δd¯2)d¯((m+2)d¯(m1)d¯)\displaystyle\omega_{\bar{d}}\left(\frac{\Delta\sqrt{\bar{d}}}{2}\right)^{\bar{d}}\left(\left(m+2\right)^{\bar{d}}-\left(m-1\right)^{\bar{d}}\right)
pm\displaystyle\Rightarrow\hskip 28.45274ptp_{m} \displaystyle\leq ((m+2)d¯(m1)d¯)\displaystyle\left(\left(m+2\right)^{\bar{d}}-\left(m-1\right)^{\bar{d}}\right)
<\displaystyle< (m+2)d\displaystyle\left(m+2\right)^{d}

which gives

(2πΔ¯2)d¯2|(γΔ¯ν)(x)(γΔ¯νj)(x)|\displaystyle(2\pi{\bar{\Delta}}^{2})^{\frac{\bar{d}}{2}}\left|(\gamma_{\bar{{\Delta}}}\star\nu)(x)-(\gamma_{\bar{{\Delta}}}\star\nu_{j})(x)\right| =\displaystyle= m=1r[k0]{j}m2arΔd¯<m+12wreC3.22ar22Δ2\displaystyle\sum_{m=1}^{\infty}\sum_{\begin{subarray}{c}r\in[k_{0}]\setminus\{j\}\\ \frac{m}{2}\leq||\frac{a_{r}}{\Delta\sqrt{\bar{d}}}||<\frac{m+1}{2}\end{subarray}}w_{r}e^{-\frac{C_{3.2}^{2}||a_{r}||^{2}}{2\Delta^{2}}}
\displaystyle\leq (ck)1m=1ed(C3.22m28ln(m+2))8\displaystyle(ck)^{-1}\sum_{m=1}^{\infty}e^{-\frac{d(C_{3.2}^{2}m^{2}-8\ln\left(m+2\right))}{8}}
\displaystyle\leq (ck)1m=1ed(C3.22m2)16\displaystyle(ck)^{-1}\sum_{m=1}^{\infty}e^{-\frac{d(C_{3.2}^{2}m^{2})}{16}}

We use ex<615x15e^{-x}<\frac{6^{15}}{x^{15}} and m30<10\sum m^{-30}<10 to obtain

(2πΔ¯2)d¯2|(γΔ¯ν)(x)(γΔ¯νj)(x)|\displaystyle(2\pi{\bar{\Delta}}^{2})^{\frac{\bar{d}}{2}}\left|(\gamma_{\bar{{\Delta}}}\star\nu)(x)-(\gamma_{\bar{{\Delta}}}\star\nu_{j})(x)\right| \displaystyle\leq (ck)1m=1ed¯C3.22m216\displaystyle(ck)^{-1}\sum_{m=1}^{\infty}e^{-\frac{{\bar{d}}C_{3.2}^{2}m^{2}}{16}}
\displaystyle\leq (100)15ckd¯15C3.230m=11m30\displaystyle\frac{(100)^{15}}{ck{\bar{d}}^{15}C_{3.2}^{30}}\sum_{m=1}^{\infty}\frac{1}{m^{30}}
=\displaystyle= (10)31ckd¯15C3.230\displaystyle\frac{(10)^{31}}{ck{\bar{d}}^{15}C_{3.2}^{30}}

Note that γΔ¯νj\gamma_{\bar{\Delta}}\star\nu_{j} is log-concave. Moreover, for any xQlx\in Q_{l}, one has

(γΔ¯νj)(x)\displaystyle(\gamma_{\bar{\Delta}}\star\nu_{j})(x) =\displaystyle= (2πΔ¯2)d¯2wjexyj22Δ¯2\displaystyle(2\pi\bar{\Delta}^{2})^{-\frac{\bar{d}}{2}}w_{j}e^{-\frac{||x-y_{j}||^{2}}{2{\bar{\Delta}}^{2}}}
\displaystyle\geq ck(2πΔ¯2)d¯2e1.21d¯20000C1.5lnk\displaystyle{\frac{c}{k(2\pi{\bar{\Delta}}^{2})^{\frac{\bar{d}}{2}}}}e^{-\frac{1.21{\bar{d}}}{20000C_{1.5}\ln k}}
\displaystyle\geq ck(2πΔ¯2)d¯2e1.2120000\displaystyle{\frac{c}{k(2\pi{\bar{\Delta}}^{2})^{\frac{\bar{d}}{2}}}}e^{-\frac{1.21}{20000}}

so that

|(γΔ¯ν)(x)(γΔ¯νj)(x)1|\displaystyle\left|\frac{(\gamma_{\bar{{\Delta}}}\star\nu)(x)}{(\gamma_{\bar{{\Delta}}}\star\nu_{j})(x)}-1\right| \displaystyle\leq 1(2πΔ¯2)d¯2(10)31ckd¯15C3.230k(2πΔ¯2)d¯2ce1.2120000\displaystyle\frac{1}{(2\pi{\bar{\Delta}}^{2})^{\frac{\bar{d}}{2}}}\frac{(10)^{31}}{ck{\bar{d}}^{15}C_{3.2}^{30}}{\frac{k(2\pi{\bar{\Delta}}^{2})^{\frac{\bar{d}}{2}}}{c}}e^{\frac{1.21}{20000}}
\displaystyle\leq (10)31e1.2120000C3.230c2d¯15\displaystyle{\frac{(10)^{31}e^{\frac{1.21}{20000}}}{C_{3.2}^{30}c^{2}{\bar{d}}^{15}}}

This gives

0\displaystyle 0 \displaystyle\leq log(γΔ¯ν)(x)log(γΔ¯νj)(x)\displaystyle\log(\gamma_{\bar{{\Delta}}}\star\nu)(x)-\log(\gamma_{\bar{{\Delta}}}\star\nu_{j})(x) (20)
=\displaystyle= log(1+(10)31e1.2120000C3.230c2d¯15)\displaystyle\log\left(1+{\frac{(10)^{31}e^{\frac{1.21}{20000}}}{C_{3.2}^{30}c^{2}{\bar{d}}^{15}}}\right)
\displaystyle\leq (10)31e1.2120000C3.230c2d¯15\displaystyle{\frac{(10)^{31}e^{\frac{1.21}{20000}}}{C_{3.2}^{30}c^{2}{\bar{d}}^{15}}}

Remark 2.

Note that, by remark (1) and convexity of QQ_{\ell}, if

qB(yj,(Δ¯d¯2001C1.5logk))q_{\ell}\in B\left(y_{j},\left(\frac{\bar{{\Delta}}\bar{d}}{200}\sqrt{\frac{1}{C_{1.5}\log k}}\right)\right)

then (with high probability) log(γΔ¯νj)\log(\gamma_{\bar{{\Delta}}}\star\nu_{j}) is tt-Lipschitz for

t(ckln(C1dn)+C3.2)C1.5logkc3t\leq\frac{(ck\sqrt{\ln(C_{1}dn)}+C_{3.2})\sqrt{C_{1.5}\log k}}{c^{3}}

The following lemma shows that every true spike γΔ¯ν\gamma_{\bar{{\Delta}}}\star\nu corresponding to a center gets detected by FindSpikeFindSpike.

Lemma 7.

If qB(yj,(Δ¯d¯2001C1.5logk))q_{\ell}\in B\left(y_{j},\left(\frac{\bar{{\Delta}}\bar{d}}{200}\sqrt{\frac{1}{C_{1.5}\log k}}\right)\right) for some j[k]j\in[k], then

(γΔ¯ν)(q)kC3.5(wmin2)γΔ¯(0).\displaystyle(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})-k^{-C_{3.5}}\geq(\frac{w_{min}}{2})\gamma_{\bar{{\Delta}}}(0). (21)
Proof.

By lemma (6) the restriction to QQ_{\ell} of γΔ¯ν\gamma_{\bar{{\Delta}}}\star\nu is approximately log-concave. The output qq_{\ell} of stochastic optimization algorithm 𝔄0{\mathfrak{A}}_{0} satisfies

|logmlog(γΔ¯ν)(q)|1C3.7d¯15\displaystyle\left|\log m_{\ell}-\log(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})\right|\leq{\frac{1}{C_{3.7}{\bar{d}}^{15}}}

where m:=max{(γΔ¯ν)(x):xQ}m_{\ell}:=\max\{(\gamma_{\bar{{\Delta}}}\star\nu)(x):x\in Q_{\ell}\}. Equivalently,

m(γΔ¯ν)(q)\displaystyle m_{\ell}-(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell}) \displaystyle\leq (γΔ¯ν)(q)(1e1C3.7d¯15)\displaystyle(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})\left(1-e^{-\frac{1}{C_{3.7}{\bar{d}}^{15}}}\right)
\displaystyle\leq (γΔ¯ν)(q)C3.7d¯15\displaystyle\frac{(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})}{C_{3.7}{\bar{d}}^{15}}
(γΔ¯ν)(q)\displaystyle\Rightarrow\hskip 28.45274pt(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell}) \displaystyle\geq m(γΔ¯ν)(q)C3.7d¯15\displaystyle m_{\ell}-\frac{(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})}{C_{3.7}{\bar{d}}^{15}}

which proves the lemma. ∎

The next lemma shows that there are no false spikes in γΔ¯ν\gamma_{\bar{{\Delta}}}\star\nu.

Lemma 8.

If (γΔ¯ν)(q)+kC3.5(wmin2)γΔ¯(0)(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})+k^{-C_{3.5}}\geq(\frac{w_{min}}{2})\gamma_{\bar{{\Delta}}}(0), then there exists some j[k]j\in[k] such that qB(yj,d¯(ΔΔ¯)125)q_{\ell}\in B\left(y_{j},\frac{\sqrt{\bar{d}}(\Delta\bar{\Delta})^{\frac{1}{2}}}{5}\right).

Proof.

The arguments are similar to that in lemma (6) above. Suppose that qj[k]B(yj,d¯(ΔΔ¯)125)q_{\ell}\notin\bigcup_{j\in[k]}B\left(y_{j},\frac{\sqrt{\bar{d}}(\Delta\bar{\Delta})^{\frac{1}{2}}}{5}\right), and write aj:=yjqa_{j}:=y_{j}-q_{\ell}. For m1m\geq 1, let pm:=|Sm|p_{m}:=|S_{m}| where

Sm:={r[k0]:(m2+15C3.2)Δd¯ar<(m+12+15C3.2)Δd¯}S_{m}:=\begin{Bmatrix}r\in[k_{0}]:~{}\left(\frac{m}{2}+\frac{1}{5\sqrt{C_{3.2}}}\right)\Delta\sqrt{\bar{d}}\leq||a_{r}||<\left(\frac{m+1}{2}+\frac{1}{5\sqrt{C_{3.2}}}\right)\Delta\sqrt{\bar{d}}\end{Bmatrix}

One has

(γΔ¯ν)(q)\displaystyle(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell}) =\displaystyle= γΔ¯(0)j[k0]wjeaj22Δ¯2\displaystyle\gamma_{\bar{\Delta}}(0)\sum_{j\in[k_{0}]}w_{j}e^{-\frac{||a_{j}||^{2}}{2{\bar{\Delta}}^{2}}}
=\displaystyle= γΔ¯(0)(j[k0]0ajΔd¯15C3.2<12wreaj22Δ¯2+m=1jSmwreaj22Δ¯2)\displaystyle\gamma_{\bar{\Delta}}(0)\left(\displaystyle\sum_{\begin{subarray}{c}j\in[k_{0}]\\ 0\leq||\frac{a_{j}}{\Delta\sqrt{\bar{d}}}||-\frac{1}{5\sqrt{C_{3.2}}}<\frac{1}{2}\end{subarray}}w_{r}e^{-\frac{||a_{j}||^{2}}{2\bar{\Delta}^{2}}}+\sum_{m=1}^{\infty}\displaystyle\sum_{j\in S_{m}}w_{r}e^{-\frac{||a_{j}||^{2}}{2\bar{\Delta}^{2}}}\right)

Since arasΔd¯||a_{r}-a_{s}||\geq\Delta{\bar{d}}, we can put disjoint balls of radius 0.5Δd¯0.5\Delta\sqrt{\bar{d}} around each center. Thus,

pm\displaystyle p_{m} \displaystyle\leq 2d¯((m+22+15C3.2)d¯(m12+15C3.2)d¯)\displaystyle 2^{\bar{d}}\left(\left(\frac{m+2}{2}+\frac{1}{5\sqrt{C_{3.2}}}\right)^{\bar{d}}-\left(\frac{m-1}{2}+\frac{1}{5\sqrt{C_{3.2}}}\right)^{\bar{d}}\right)
<\displaystyle< (m+2)d¯\displaystyle\left(m+2\right)^{\bar{d}}

which gives

(γΔ¯ν)(q)\displaystyle(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell}) =\displaystyle= γΔ¯(0)(j[k0]0ajΔd¯15C3.2<12wreaj22Δ¯2+m=1jSmwreaj22Δ¯2)\displaystyle\gamma_{\bar{\Delta}}(0)\left(\displaystyle\sum_{\begin{subarray}{c}j\in[k_{0}]\\ 0\leq||\frac{a_{j}}{\Delta\sqrt{\bar{d}}}||-\frac{1}{5\sqrt{C_{3.2}}}<\frac{1}{2}\end{subarray}}w_{r}e^{-\frac{||a_{j}||^{2}}{2\bar{\Delta}^{2}}}+\sum_{m=1}^{\infty}\displaystyle\sum_{j\in S_{m}}w_{r}e^{-\frac{||a_{j}||^{2}}{2\bar{\Delta}^{2}}}\right)
\displaystyle\leq CγΔ¯(0)k(eC3.2d¯50+m=1ed¯C3.2m216)\displaystyle\frac{C\gamma_{\bar{\Delta}}(0)}{k}\left(e^{-\frac{C_{3.2}\bar{d}}{50}}+\sum_{m=1}^{\infty}e^{-\frac{\bar{d}C_{3.2}m^{2}}{16}}\right)
<\displaystyle< CγΔ¯(0)k(eC3.2d¯50+2000d¯2C3.22)\displaystyle\frac{C\gamma_{\bar{\Delta}}(0)}{k}\left(e^{-\frac{C_{3.2}\bar{d}}{50}}+\frac{2000}{{\bar{d}}^{2}C_{3.2}^{2}}\right)

Thus, for C3.2C_{3.2} sufficiently large, the inequality

(γΔ¯ν)(q)+kC3.5<(wmin2)γΔ¯(0).(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})+k^{-C_{3.5}}<\left(\frac{w_{min}}{2}\right)\gamma_{\bar{{\Delta}}}(0).

Lemma 9.

If there exists some j[k]j\in[k] such that qB(yj,d¯(ΔΔ¯)125)q_{\ell}\in B\left(y_{j},\frac{\sqrt{\bar{d}}(\Delta\bar{\Delta})^{\frac{1}{2}}}{5}\right), and |q|<diam(Q)/4,|q_{\ell}-\ell|<diam(Q_{\ell})/4, then with high probability, there exists some j[k]j\in[k] such that qB(yj,cd¯52)q_{\ell}\in B\left(y_{j},c{\bar{d}}^{-\frac{5}{2}}\right).

Proof.

If xB(yj,d¯(ΔΔ¯)125)x\in B\left(y_{j},\frac{\sqrt{\bar{d}}(\Delta\bar{\Delta})^{\frac{1}{2}}}{5}\right), by Lemma 6,

0ln(γΔ¯ν)(x)ln(γΔ¯νj)(x)1Kd¯5,0\leq\ln(\gamma_{\bar{{\Delta}}}\star\nu)(x)-\ln(\gamma_{\bar{{\Delta}}}\star\nu_{j})(x)\leq{\frac{1}{K{\bar{d}}^{5}}},

where KK is an absolute constant that can be made arbitrarily large. We note that ln(γΔ¯νj)(x)=a(12Δ¯2)xyj2,\ln(\gamma_{\bar{{\Delta}}}\star\nu_{j})(x)=a-\left(\frac{1}{2{\bar{{\Delta}}}^{2}}\right)\|x-y_{j}\|^{2}, for some constant aa. This implies that if

|ln(γΔ¯ν)(q)supxln(γΔ¯ν)(x)|<2Kd¯5,|\ln(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})-\sup_{x}\ln(\gamma_{\bar{{\Delta}}}\star\nu)(x)|<\frac{2}{K{\bar{d}}^{5}},

then qB(yj,cΔ¯d¯52)q_{\ell}\in B\left(y_{j},c{\bar{{\Delta}}}{\bar{d}}^{-\frac{5}{2}}\right). However, |ln(γΔ¯ν)(q)supxln(γΔ¯ν)(x)||\ln(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})-\sup_{x}\ln(\gamma_{\bar{{\Delta}}}\star\nu)(x)| is indeed less than 2Kd¯5\frac{2}{K{\bar{d}}^{5}} by Proposition 1 and Fact 1. Noting that Δ¯<c\bar{{\Delta}}<c, this completes the proof of this lemma. ∎

The following proposition shows that every spike extracted out of γΔ¯ν\gamma_{\bar{\Delta}}\star\nu by FindSpikesFindSpikes, is within δ/k{\mathbf{\delta}}/k of some yiy_{i}.

Proposition 2.

With probability at least 1exp(k/c)1-\exp(-k/c), the following is true: The hausdorff distance between {y1,,yk0}\{y_{1},\dots,y_{k_{0}}\} and {m1,,mk2}\{\ell_{m_{1}},\dots,\ell_{m_{k_{2}}}\} (which corresponds to the output of BoostBoost in step 2. of FindSpikesFindSpikes above) is less than δ/k{\mathbf{\delta}}/k.

Proof.

Consider the sequence LL in 1(c)1(c) of FindSpikesFindSpikes.However, we know from the statements in Lemma 7, Lemma 8 and Lemma 9 that

  1. 1.

    If qj[k0]B(yj,(Δ¯d¯2001C1.5logk))q_{\ell}\in\bigcup_{j\in[k_{0}]}B\left(y_{j},\left(\frac{\bar{{\Delta}}\bar{d}}{200}\sqrt{\frac{1}{C_{1.5}\log k}}\right)\right), then

    (νγΔ¯)(q)kC3.5(wmin2)γΔ¯(0).(\nu\star\gamma_{\bar{{\Delta}}})(q_{\ell})-k^{-C_{3.5}}\geq(\frac{w_{min}}{2})\gamma_{\bar{{\Delta}}}(0).
  2. 2.

    If (γΔ¯ν)(q)+kC3.5(wmin2)γΔ¯(0)(\gamma_{\bar{{\Delta}}}\star\nu)(q_{\ell})+k^{-C_{3.5}}\geq(\frac{w_{min}}{2})\gamma_{\bar{{\Delta}}}(0), then there exists some j[k0]j\in[k_{0}] such that qB(yj,d¯(ΔΔ¯)125)q_{\ell}\in B\left(y_{j},\frac{\sqrt{\bar{d}}(\Delta\bar{\Delta})^{\frac{1}{2}}}{5}\right).

  3. 3.

    If there exists some j[k0]j\in[k_{0}] such that qB(yj,d¯(ΔΔ¯)125)q_{\ell}\in B\left(y_{j},\frac{\sqrt{\bar{d}}(\Delta\bar{\Delta})^{\frac{1}{2}}}{5}\right), and (as is true from step 1(c) of FindSpikesFindSpikes), |q|<diam(Q)/4,|q_{\ell}-\ell|<diam(Q_{\ell})/4, then with probability at least 1exp(kc)1-\exp(\frac{-k}{c}), there exists some j[k]j\in[k] such that qB(yj,cd¯52)q_{\ell}\in B\left(y_{j},c{\bar{d}}^{-\frac{5}{2}}\right).

We have shown that LL is, with high probability, contained in j[k0]B(yj,cd¯5/2)\bigcup_{j\in[k_{0}]}B(y_{j},c{\bar{d}}^{-5/2}). Lastly, we observe that for each j[k]j\in[k] there must exist a qq_{\ell} such that qB(yj,cd¯52)q_{\ell}\in B\left(y_{j},c{\bar{d}}^{-\frac{5}{2}}\right), by the exhaustive choice of starting points. This proposition now follows from Theorem 4.1 of the full version of [13]. ∎

5 Conclusion and open problems

We developed a randomized algorithm that learns the centers yiy_{i} of standard gaussians in a mixture with known weights, to within an 2\ell_{2} distance of δ<Δd2{\mathbf{\delta}}<\frac{{{\Delta}}\sqrt{d}}{2} with high probability when the minimum separation between two centers is at least dΔ\sqrt{d}{{\Delta}}, where Δ{{\Delta}} is larger than an universal constant in (0,1)(0,1). The number of samples and the computational time is bounded above by poly(k,d,1δ).poly(k,d,\frac{1}{{\mathbf{\delta}}}). Such a polynomial bound on the sample and computational complexity was previously unknown when dω(1)d\geq\omega(1). There is a matching lower bound due to Regev and Vijayaraghavan [13] on the sample complexity of learning a random mixture of gaussians in a ball of radius Θ(d)\Theta(\sqrt{d}) in dd dimensions, when dd is Θ(logk)\Theta(\log k). It remains open whether (as raised in [13]) poly(k,d,1/δ)poly(k,d,1/{\mathbf{\delta}}) upper bounds on computational complexity of this task can be obtained when the minimum separation between two centers in Ω(logk)\Omega(\sqrt{\log k}) in general, although when dO(logk)d\leq O(\log k), this follows from our results. It would also be interesting to extend the results of this paper to mixtures of spherical gaussians whose variances are not necessarily equal.

Acknowledgments

HN was partially supported by a Ramanujan Fellowship and NSF award no. 1620102. Both authors acknowledge the support of DAE project no. 12-R&D-TFR-5.01-0500. This research was supported in part by the International Centre for Theoretical Sciences (ICTS) during a visit for the program - Statistical Physics of Machine Learning (Code: ICTS/SPMML2020/01).

References

  • [1] Sanjeev Arora and Ravi Kannan; Learning mixtures of separated nonspherical Gaussians. Ann. Appl. Probab. 15 (2005), no. 1A, 69-92.
  • [2] Mikhail Belkin and Kaushik Sinha; Polynomial learning of distribution families. SIAM J. Comput. 44 (2015), no. 4, 889-911.
  • [3] Alexandre Belloni, Tengyuan Liang, Hariharan Narayanan, Alexander Rakhlin; Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions. Proceedings of The 28th Conference on Learning Theory, in JMLR, vol 40, 240-265.
  • [4] Sanjoy Dasgupta and Leonard Schulman; A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. J. Mach. Learn. Res. 8 (2007), 203–226.
  • [5] Ilias Diakonikolas, Daniel Kane, and Alistair Stewart; List-decodable robust mean estimation and learning mixtures of spherical Gaussians. STOC’18—Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 1047–1060, ACM, New York, 2018.
  • [6] Sanjoy Dasgupta; Learning mixtures of Gaussians. FOCS (New York, 1999), 634-644, IEEE Computer Soc., Los Alamitos, CA, (1999).
  • [7] Samuel Hopkins and Jerry Li; Mixture models, robustness, and sum of squares proofs. STOC’18 - Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 1021–1034, ACM, New York, 2018.
  • [8] Daniel Hsu and Sham Kakade; Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. ITCS’13 - Proceedings of the 2013 ACM Conference on Innovations in Theoretical Computer Science, 11-19, ACM, New York, (2013).
  • [9] Béatrice Laurent and Pascal Massart; Adaptive estimation of a quadratic functional by model selection. Ann. Statist. 28 (2000), no. 5, 1302-1338.
  • [10] Adam Kalai, Ankur Moitra, Gregory Valiant; Efficiently learning mixtures of two Gaussians. STOC’10 - Proceedings of the 2010 ACM International Symposium on Theory of Computing, 553–562, ACM, New York, (2010).
  • [11] Ekatherina Karatsuba; On the asymptotic representation of the Euler gamma function by Ramanujan. J. Comput. Appl. Math. 135 (2001), no. 2, 225-240.
  • [12] Ankur Moitra and Gregory Valiant; Settling the polynomial learnability of mixtures of Gaussians. FOCS 2010, 93-102, IEEE Computer Soc., Los Alamitos, CA, 2010.
  • [13] Oded Regev and Aravindan Vijayaraghavan; On learning mixtures of well-separated Gaussians. 58th Annual IEEE Symposium on Foundations of Computer Science FOCS 2017, 85-96, IEEE Computer Soc. (2017), Los Alamitos, CA.
    Full version: https://arxiv.org/abs/1710.11592
  • [14] Santosh Vempala and Grant Wang; A spectral algorithm for learning mixture models. J. Comput. System Sci. 68 (2004), no. 4, 841-860.
  • [15] Roman Vershynin; High-dimensional probability. Cambridge Series in Statistical and Probabilistic Mathematics (47), Cambridge University Press, Cambridge, (2018).