Abstract
Suppose that we are given independent, identically distributed samples
from a mixture of at most of -dimensional spherical
gaussian distributions of identical known variance , such
that the minimum distance between two distinct centers and
is greater than for some , where is an universal constant. We develop a randomized
algorithm that learns the centers of the gaussian components, to within an
distance of with probability
greater than for an arbitrary universal constant , when the weights
and the variance are known, and the smallest
weight satisfies . The number of samples and the
computational time is bounded above by . Such
a bound on the sample and computational complexity was previously unknown
for . When , this complexity bound follows from work
of Regev and Vijayaraghavan ([13]), where it has also been shown that
the sample complexity of learning a random mixture of gaussians in a ball of
radius in dimensions, when is , is
at least , showing that our result is tight in this case.
1 Introduction
Designing algorithms that estimate the
parameters of an underlying probability distributions is a central theme in statistical learning theory.
An important special instance of
this learning problem is the case when the underlying distribution is known
to be a finite mixture of Gaussian distributions in -dimensional
euclidean space, because such mixtures are popular models for
high-dimensional data clustering and learning mixture of Gaussians
in an unsupervised setting has been a topic of intensive research for the
last few decades.
In its most explicit form, the underlying problem is as follows: we have access
to random samples drawn independently as per some Gaussian
mixture ,
where is some probability vector with strictly positive components,
and each is a Gaussian density in , with mean
and covariance . The main task is to
estimate the parameter set of the density function
, within a pre-specified accuracy . For the purpose of
this paper, we will restrict to the scenario where all the Gaussian components are spherical.
Many of the earlier approaches to this learning problem were based
on local search heuristics, the EM algorithm and -means
heuristics, that resulted in weak performance guarantees. In [6],
Dasgupta presented the first provably correct algorithm for learning
a mixture of Gaussians, with a common unknown covariance matrix,
using only polynomially many (in dimension as well as
number of components) samples and computation
time, under the assumption that the minimum separation between the
centers of component Gaussians is at least . In a consequent work,
Dasgupta and Schulman ([4]) showed a variation of EM to work with minimum
separation only , when the
components are all spherical. Subsequently, many more algorithms
(see [12] and the references therein) with improved sample complexity
and/or computational complexity have since been devised, that work on
somewhat weaker separation assumption; in particular, the SVD-based algorithm of Vempala
and Wang [14] learns a mixture of spherical Gaussians with poly-sized
samples and polynomially many steps, under the separation
assumption
|
|
|
We note that when , the above separation requirement
translates to minimum separation of . In another line of work, Kalai-Moitra-Valiant [10] and Moitra-Valiant [12],
the question of polynomial learnability of arbitrarily separated mixture of Gaussians
(and more general families of distributions have been investigated by Belkin and Sinha in
[2]). It has been established that, for Gaussian mixture with
a fixed number of components and the centers having a minimum required
separation, there is an algorithm that runs in polynomial time and
uses polynomially many samples to learn the parameters of the mixture to any desired
accuracy, with arbitrarily high probability.
Recently, Hopkins and Li ([7]), and Diakonikolas et al ([5])
devised polynomial time learning algorithms that work
with minimum separation of (for any ).
In [13], Regev and Vijayaraghavan considered the question of obtaining a lower bound on the
separation of centers necessary for the polynomial learnability of Gaussian mixtures. They
devised an iterative algorithm for amplifying accuracy of parameter estimation that,
given initializer and a desired accuracy parameter , uses
polynomially many samples from the underlying mixture and computation steps to return , that lie within Hausdorff distance at most from the
true centers; for more details, see Theorem 4.1 of the full version of
[13]. One of their results establishes that, in constant
dimension , with minimum separation at least , any uniform mixture
of spherical Gaussians can be learned to desired accuracy (with high
probability) in number of samples and computation time depending polynomially on the
number of components and .
In the present paper we consider the question of learning the centers of mixture of Gaussians, with minimum
separation , in number of samples and computational time that depends polynomially
on the ambient dimension , and the number of components . Here is the main theorem:
Main Theorem: There is an absolute constant
such that, given a mixture of at most spherical Gaussians in
, with known weights and known identical variance ,
for which all the mixing weights are in for some arbitrary absolute constant ,
and the minimum separation of the centers of the
components satisfies
|
|
|
there is a
randomized algorithm that recovers (with high probability) the centers of the mixture up to accuracy in time
.
In particular, we generalize the main upper bound in [13] (see
theorem 5.1 of [13]) to arbitrary
dimension () and number of components (), when the variance and weights
of the underlying mixture is known. We notice that
our minimum separation requirement is independent of , the number of components,
and for , this is weak minimum separation requirement.
In what follows, we will assume that ; by scale-invariance
of the condition in the theorem above, this is without loss of generality.
2 Overview
We first observe that if two clusters of centers are very far (i. e. the minimum distance
of a center in one cluster is at least away from every center of the
other cluster), then the samples are unambiguously of one cluster only. This is
shown in Lemma 1, allowing us to reduce the question to the case
when a certain proximity graph — defined on the centers — is connected.
In algorithm , we use the Johnson-Lindenstrauss lemma and
project the data in the ambient space to carefully chosen subspaces of
dimension at most , and show that if we can separate
the Gaussians in these subspaces, the resulting centers can be used to obtain
a good estimate of the centers in the original question. Thus the question is
further reduced to one where the dimension .
Next, in Lemma 2, we show that if the number of samples is chosen
appropriately, then all the centers are with high probability contained in a
union of balls of radius centered around the data points.
This allows us to focus our attention to balls of radius .
The main idea is that in low dimensions, it is possible to efficiently implement
a deconvolution on a mixture of identical spherical Gaussians having standard
deviation , and recover a good approximation to a mixture of
gaussians with the same centers and weights, but in which each component
now has standard deviation where
is the minimum separation between two centers. Once this density is available
within small error, the local maxima can be approximately obtained
using a robust randomized zeroth order concave optimization method developed
in [3] started from all elements of a sufficiently rich
net on (which has polynomial size by Lemma 5),
and the resulting centers are good approximations (i.e. within
of the true centers in Hausdorff distance) of the true centers with high probability.
We then feed these approximate centers into the iterative
procedure developed by Regev and Vijayaraghavan in [13] and by Theorem
4.1 of the full version of that paper, a seed of this quality, suffices to produce in
time, a set of centers that are within of
the true centers in Hausdorff distance.
The deconvolution is achieved by convolving the empirical measure with the
Fourier transform of a certain carefully chosen . The function is upto scalar multiplication, the reciprocal of a Gaussian with standard
deviation restricted to a ball of radius . It follows from Lemma 3, that the effect
of this truncation on the deconvolution process can be controlled.
The pointwise evaluation of the convolution is done using the Monte Carlo method.
The truncation plays an important role, because without it, the reciprocal
of a Gaussian would not be in for any , leading to
difficulties in the Fourier analysis.
3 Preprocessing and Reduction
Suppose we are given independent, identically distributed samples
from a mixture of no more than of -dimensional spherical
gaussian distributions with variance , such that the minimum
distance between two distinct centers and is greater
than for some , where is certain
universal constant in . Suppose that ,
where are -dimensional Gaussians with center at , and , and
is a known probability vector such that
satisfies .
Notation 1.
For , we denote by and
, positive universal constants such that ,
and depends on the values of constants in . If is undefined for some index ,
we set .
Let be the set of
centers of the component gaussians in the mixture
:
|
|
|
Let us fix , to be the success
probability we will require. This can be made
that is arbitrarily close to by the following simple clustering
technique in the metric space associated with Hausdorff distance,
applied to the outputs of runs of the algorithm.
3.0.1 Algorithm
-
1.
Find the median of all the number of centers
output by the different runs of the algorithm, and set that to be .
-
2.
Pick a set of centers (that is the output of one of the runs)
which has the property that and at least half of the
runs output a set of centers that is within a hausdorff distance of
less than to . It is readily seen
that provided each run succeeds with probability , this
clustering technique succeeds in producing an exceptable set of
centers with probability at least .
Let be a set of independent
identically distributed random samples from generated by first
sampling the mixture component with probability and
then picking from the corresponding gaussian. With probability
, all the are distinct, and this is part of the hypothesis in the lemma below.
Lemma 1.
Let be a graph whose vertex set is , in which the vertices and are connected
by an edge if the distance between and is
less than . Decompose into the
connected components of . Then, the
probability that there exist and , such that is less than .
Proof.
By bounds on the tail of a random variable with parameter (see equation
(4.3) of [9]), the probability that
can be bounded from above as follows.
|
|
|
(1) |
Union bound yields
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
∎
By Principal Component Analysis (PCA) it is possible to find a linear
subspace of dimension , such that all the are within
of in with probability at least
using samples and computational
time (see Appendix C of [13]). PCA has been used previously
in this context (see [14]). We then take the push forward
of via an orthoprojection of on to , and work with
that instead. This allows us to reduce the dimension to (if
started out being greater than ), while retaining the same to
within a multiplicative factor of .
3.0.2 Algorithm
-
1.
Let be an orthonormal set of
vectors, sampled uniformly at random from the haar measure.
-
2.
Define dimensional subspace
to be the span of .
By the Johnson-Lindenstrauss Lemma (see theorem 5.3.1
of [15]), with probability greater than ,
the distance between the projections of any two centers is at least
.
-
3.
Now, use the low dimensional gaussian learning primitive
Algorithm from Subsection 4.1 on
that solves the dimensional
problem with high probability, if the distance between any two centers
is at least If this fails to
produce such a set of centers, go back to 1, else we recover the
projections on to of the centers to within a hausdorff distance of
.
-
4.
For any fixed , let denote the
span of augmented with the vector
. Suppose that we have succeeded in identifying the
projections of the centers on to for to within in distance
with high probability. Together with the knowledge of
this allows
us to patch up these projections and give us the centers
to within a Hausdorff distance of
with high probability.
By the above discussion, it is clear that it suffices to consider the case when
|
|
|
4 The case of
Lemma 2.
The following statement holds with probability at least
: if
|
|
|
and are independent random -samples,
then
|
|
|
Proof.
Recall that . Suppose that random
independent -samples, denoted as , have
been picked up. Let be a positive integer valued function.
For any , the probability – that contain
no -sample – is at most . Thus, the probability – that
contain no sample from at least one Gaussian component in
– is at most . We ensure this
probability is at most by having
|
|
|
It
follows that, if at least
|
|
|
(2) |
random independent -samples
were picked up, then with probability at least
all the components were needed to be sampled at least times.
Let denote the event that random independent -samples
contain at least points from each Gaussian component. For ,
let be the event that none of the random samples satisfy
. Applying Gaussian isoperimetric inequality
(equation (4.3) of [9]) and union bound,
we obtain
|
|
|
|
|
Thus, letting
|
|
|
|
|
(3) |
it
follows that
|
|
|
|
|
|
|
|
|
|
provided
|
|
|
∎
Let denote the fourier transform. For any function , we have , where
|
|
|
(4) |
By the fourier inversion formula,
|
|
|
(5) |
where these are not necessarily lebesgue integrals, but need to be interpreted as improper integrals. Thus,
|
|
|
(6) |
and
|
|
|
(7) |
where the limit is taken in the sense.
Let
|
|
|
denote the standard gaussian in dimensions. Then, . Let equal .
Let
|
|
|
where is the euclidean ball of radius and the indicator function of , and let .
Let denote the spherical gaussian whose one dimensional marginals have variance .
Lemma 3.
For all , we have
|
|
|
Proof.
Let be a random variable with degrees of freedom, i. e. , the sum of squares of independent standard gaussians.
By equation (4.3) of [9], we know that
|
|
|
|
|
(8) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For , one has . Therefore,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The lemma follows from fact that is a bounded operator with operator norm from to .
∎
Let denote the uniform probability measure on .
Let denote convolution in . Note that the Fourier convolution
identity is
|
|
|
Let , and . We will recover the centers
and weights of the gaussians from . The heuristics are as follows.
Let denote the unique probability measure satisfying . Thus
|
|
|
where is a dirac delta supported on . From this it follows, roughly speaking,
that inside , we get
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pointwise, and this should (roughly) yield
|
|
|
|
|
On the other hand, notice that Lemma 3 shows that , and because
the spikes of are approximately (up to scaling) the spikes of
, we restrict to learning the spikes
of by accessing an approximation
via . For notational convenience,
we write .
In order to formalize this, we proceed as follows. We will employ Hoeffding’s
inequality for -valued random variables:
Lemma 4 (Hoeffding).
Let . Let be independent identically distributed -valued
random variables such that for all . Then
|
|
|
(9) |
where
is a universal constant.
We write , so that if and only if satisfies .
The following proposition allows us to construct a (random) black box oracle that outputs a good additive approximation of at any given point .
Proposition 1.
Let be independent, random variables drawn from the
uniform (normalized Lebesgue) probability measure on .
Let be independent -distributed random points. If
, and
|
|
|
(10) |
where
|
|
|
then, for any , the random variable
|
|
|
(11) |
satisfies the following
inequality with probability at least :
|
|
|
where denotes the real part of
Proof.
For any , one has
|
|
|
|
|
(12) |
|
|
|
|
|
|
|
|
|
|
Hence,
it suffices to estimate
|
|
|
Next, one has
|
|
|
|
|
|
|
|
|
|
where
is a sample from the uniform probability distribution on . For brevity of notation, write
|
|
|
|
|
|
|
|
|
|
so that . By Ramanujan’s approximation of (see
theorem 1 of [11]) we get
|
|
|
|
|
(13) |
|
|
|
|
|
so that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
If are independent (Lebesgue) uniformly distributed points in , then by Hoeffding’s inequality (9), one has
|
|
|
(14) |
if we set
|
|
|
(15) |
Recalling that
, and
|
|
|
for any fixed , applying
Hoeffding’s inequality (9) once again, we obtain
|
|
|
In
particular, letting
|
|
|
and
|
|
|
|
|
|
|
|
|
|
(16) |
one has
|
|
|
(17) |
By an application of the union bound, equations
(12), (14), and (17) give us this proposition.
∎
We note that, for any , one has
|
|
|
|
|
Writing ,
one has
|
|
|
|
|
Using (15) and
Hoeffding’s inequality (9), we get
|
|
|
|
|
(18) |
In the following algorithm , at any point
|
|
|
we shall therefore have access to the random variable in , such that
|
|
|
As established by proposition (1) above, these can be constructed using polynomially many samples and computational steps, in such a way that for any subset
|
|
|
is a set of independent random variables.
For the next part we will employ the efficient zeroth order stochastic
concave maximization algorithm, devised in Belloni et al ([3]). We denote
this algorithm as . In -dimensional Euclidean space, the algorithm
returns an -approximate maxima of an -approximate -Lipschitz
concave function, and the number of function evaluations used depends polynomially
on , and . The performance guarantee of algorithm is summarized
in the following:
Fact 1 (Belloni-Liang-Narayanan-Rakhlin).
Suppose that is a convex subset, and are functions satisfying
|
|
|
Suppose
that is concave and -Lipschitz; then algorithm returns a point satisfying
|
|
|
and uses
computation steps.
In the following, we use instead of in order to keep the notation consistent with Algorithm .
Let denote the diameter of .
4.1 Algorithm .
Let
-
1.
While , do the following:
-
(a)
For each point
|
|
|
let be the ball of radius , centered at .
-
(b)
Use an efficient zeroth order stochastic concave maximization subroutine (see Fact 1) that produces a point in at which is within of the maximum of restricted to .
-
(c)
Create a sequence that consists of all those , such that , and
|
|
|
If , return “ERROR”.
-
(d)
Form a subsequence ) of by the following iterative procedure:
-
i.
Let . Set and .
-
ii.
While :
-
A.
if is not contained in for any , append to .
-
B.
-
(e)
Pass
|
|
|
to algorithm . Increment
-
2.
Pass the output of obtained by processing copies of , to the the iterative algorithm of Regev and Vijayaraghavan [13], which will correctly output the centers to the desired accuracy with the required probability .
4.2 Analysis of
The following lemma shows that the number of cubes in step of that need to be considered is polynomial in .
Lemma 5.
For any ,
|
|
|
(19) |
Proof.
Set to .
We observe that the number of lattice cubes of side length centered at a point belonging to that intersect a ball of radius in dimensions is less or equal to to the number of lattice points inside a concentric ball of dimension of radius . Every lattice cube of side length centered at a lattice point belonging to is contained in the ball with center and radius By volumetric considerations,
is therefore bounded above by
We write . By Ramanujan’s approximation of (see
Theorem 1 of [11]) we get
|
|
|
This tells us that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We have used the fact that for , is maximized when , a fact easily verified using calculus.
Next, we will need some results on the structure of .
Definition 1.
Let be a convex set, and .
A continuous function is said to be -approximately
log-concave if there exists a continuous function such that is concave, and
. We say is approximately
log-concave if it is -approximately log-concave for some .
By Theorem 4.1 of the full version of [13], it suffices to approximate the centers of the Gaussians to within and then pass on these approximate centers to an iterative algorithm developed in that paper. To this end, if
, it suffices to have, in the vicinity of , access to an approximation of to within an additive This is achieved by Lemma 6.
Lemma 6.
If for some , then the restriction of to is approximately log-concave. Specifically,
for , one has
|
|
|
Proof.
Fix , and write for . One
has
|
|
|
and
|
|
|
if .
Rewriting , one has
|
|
|
|
|
|
|
|
|
|
We write where
|
|
|
Since , we can put disjoint
balls of radius around each center. Thus, letting
be the volume of unit ball in , one
has
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
which gives
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We
use and to obtain
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Note
that
is log-concave. Moreover, for any , one has
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
so that
|
|
|
|
|
|
|
|
|
|
This gives
|
|
|
|
|
(20) |
|
|
|
|
|
|
|
|
|
|
∎
The following lemma shows that every true spike corresponding to a center gets detected by .
Lemma 7.
If for some , then
|
|
|
(21) |
Proof.
By lemma (6) the restriction to of is approximately log-concave. The output of stochastic optimization
algorithm satisfies
|
|
|
where . Equivalently,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
which proves the lemma.
∎
The next lemma shows that there are no false spikes in .
Lemma 8.
If , then there exists some such that .
Proof.
The arguments are similar to that in lemma (6) above.
Suppose that ,
and write . For , let where
|
|
|
One has
|
|
|
|
|
|
|
|
|
|
Since , we can put disjoint
balls of radius around each center. Thus,
|
|
|
|
|
|
|
|
|
|
which gives
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Thus,
for sufficiently large, the inequality
|
|
|
∎
Lemma 9.
If there exists some such that , and then with high probability, there exists some such that .
Proof.
If , by Lemma 6,
|
|
|
where is an absolute constant that can be made arbitrarily large.
We note that for some constant . This implies that if
|
|
|
then . However, is indeed less than by Proposition 1 and Fact 1. Noting that , this completes the proof of this lemma.
∎
The following proposition shows that every spike extracted out of by , is within of some .
Proposition 2.
With probability at least , the following is true:
The hausdorff distance between and (which corresponds to the output of in step 2. of above) is less than .
Proof.
Consider the sequence in of .However, we know from the statements in Lemma 7, Lemma 8 and Lemma 9 that
-
1.
If , then
|
|
|
-
2.
If , then there exists some such that .
-
3.
If there exists some such that , and (as is true from step 1(c) of ), then with probability at least , there exists some such that .
We have shown that is, with high probability, contained in .
Lastly, we observe that for each there must exist a such that , by the exhaustive choice of starting points.
This proposition now follows from Theorem 4.1 of the full version of [13].
∎