A Statistical Perspective on Coreset Density Estimation
Abstract
Coresets have emerged as a powerful tool to summarize data by selecting a small subset of the original observations while retaining most of its information. This approach has led to significant computational speedups but the performance of statistical procedures run on coresets is largely unexplored. In this work, we develop a statistical framework to study coresets and focus on the canonical task of nonparameteric density estimation. Our contributions are twofold. First, we establish the minimax rate of estimation achievable by coreset-based estimators. Second, we show that the practical coreset kernel density estimators are near-minimax optimal over a large class of Hölder-smooth densities.
keywords:
[class=AMS]keywords:
[class=KWD] data summarization, kernel density estimator, Carathéodory’s theorem, minimax risk, compression1 Introduction
The ever-growing size of datasets that are routinely collected has led practitioners across many fields to contemplate effective data summarization techniques that aim at reducing the size of the data while preserving the information that it contains. While there are many ways to achieve this goal, including standard data compression algorithms, they often prevent direct manipulation of data for learning purposes. Coresets have emerged as a flexible and efficient set of techniques that permit direct data manipulation. Coresets are well-studied in machine learning (Har-Peled and Kushal, 2007; Feldman et al., 2013; Bachem et al., 2017, 2018; Karnin and Liberty, 2019), statistics (Feldman et al., 2011; Zheng and Phillips, 2017; Munteanu et al., 2018; Huggins et al., 2016; Phillips and Tai, 2018a, b), and computational geometry (Agarwal et al., 2005; Clarkson, 2010; Frahling and Sohler, 2005; Gärtner and Jaggi, 2009; Claici et al., 2020).
Given a dataset and task (density estimation, logistic regression, etc.) a coreset is given by for some subset of of size . A good coreset should suffice to perform the task at hand with the same accuracy as with the whole dataset .
In this work we study the canonical task of density estimation. Given i.i.d random variables that admit a common density with respect to the Lebesgue measure over , the goal of density estimation is to estimate . It is well known that the minimax rate of estimation over the -Hölder smooth densities of order is given by
(1) |
where the infimum is taken over all estimators based on the dataset . Moreover the minimax rate above is achieved by a kernel density estimator
(2) |
for suitable choices of kernel and bandwidth (see e.g. Tsybakov, 2009, Theorem 1.2).
The main goal of this paper is to extend this understanding of rates for density estimation to estimators based on coresets. Specifically we would like to characterize the statistical performance of coresets in terms of their cardinality. To do so, we investigate two families of estimators built on coresets: one that is quite flexible and allows arbitrary estimators to be used on the coreset and another that is more structured and driven by practical considerations; it consists of weighted kernel density estimators built on coresets.
1.1 Two statistical frameworks for coreset density estimation
We formally define a coreset as follows. Throughout this work denotes the cardinality of the coreset. Let denote a conditional probability measure on , where . In information theoretic language, is a channel from to subsets of cardinality . We refer to the channel as a coreset scheme because it designates a data-driven method of choosing a subset of data points. In what follows, we abuse notation and let denote an instantiation of a sample from the measure for . A coreset is then defined to be the projection of the dataset onto the subset indicated by : .
The first family of estimators that we investigate is quite general and allows the statistician to select a coreset and then employ an estimator that only manipulates data points in the coreset to estimate an unknown density. To study coresets, it is convenient to make the dependence of estimators on observations more explicit than in the traditional literature. More specifically, a density estimator based on observations is a function denoted by . Similarly, a coreset-based estimator is constructed from a coreset scheme of size and an estimator (measurable function) on observations. We enforce the additional restriction on that for all and for all bijections , it holds that . Given and as above, we define the coreset-based estimator to be the function . We evaluate the performance of coreset-based estimators in Section 2 by characterizing their rate of estimation over Hölder classes.111Our notion of coreset-based estimators bares conceptual similarity to various notions of compression schemes as studied in the literature, e.g. Littlestone and Warmuth (1986); Ashtiani et al. (2020); Hanneke et al. (2019).
The symmetry restriction on prevents the user from exploiting information about the ordering of data points to their advantage: the only information that can be used by the estimator is contained in the unordered collection of distinct vectors given by the coreset .
As evident from the the results in Section 2, the information-theoretically optimal coreset estimator does not resemble coreset estimators employed in practice. To remedy this limitation, we also study weighted coreset kernel density estimators (KDEs) in Section 3. Here the statistician selects a kernel , bandwidth parameter , and a coreset of cardinality as defined above and then employs the estimator
where the weights are nonnegative, sum to one and are allowed to depend on the full dataset.
In the case of uniform weights where for all , coreset KDEs are well-studied (see e.g. Bach et al., 2012; Harvey and Samadi, 2014; Phillips and Tai, 2018a, b; Karnin and Liberty, 2019). Interestingly, our results show that allowing flexibility in the weights gives a definitive advantage for the task of density estimation. By Theorems 2 and 5, the uniformly weighted coreset KDEs require a much larger coreset than that of weighted coreset KDEs to attain the minimax rate of estimation over univariate Lipschitz densities.
1.2 Setup and Notation
We reserve the notation for the norm and for the -norm. The constants etc. vary from line to line and the subscripts indicate parameter dependences.
Fix an integer . For any multi-index and , define , and let denote the differential operator defined by
Fix a positive real number and let denote the maximal integer strictly less than . Given a multi-index , the notation signifies the coordinate-wise application of to .
Given we let denote the space of Hölder functions that are supported on the cube , are times differentiable, and satisfy
for all and for all multi-indices such that .
Let denote the set of probability density functions contained in . For , let (resp. ) denote the probability distribution (resp. expectation) associated to .
For and , we also define the Sobolev functions that consist of all that are times differentiable and satisfy
for all multi-indices such that .
2 Coreset-based estimators
In this section we study the performance of coreset-based estimators. Recall that coreset-based estimators are estimators that only depend on the data points in the coreset.
Define the minimax risk for coreset-based estimators over to be
(3) |
where the infimum above is over all choices of coreset scheme of cardinality and all estimators .
Our main result on coreset-based estimators characterizes their minimax risk.
Theorem 1.
Fix and an integer . Assume that . Then the minimax risk of coreset-based estimators satisfies
The above theorem readily yields a characterization of the minimal size that a coreset can have while still enjoying the minimax optimal rate from (1). More specifically, let be such that
-
(i)
if is a sequence such that , then , and
-
(ii)
if then for some constant .
Then it follows readily from from Theorem 1 that .
Theorem 1 illustrates two different curses of dimensionality: the first stems from the original estimation problem, and the second stems from the compression problem. As , it holds that , and in this regime there is essentially no compression, as the implicit constant in Theorem 1 grows rapidly with .222 In fact, even for the classical estimation problem (1), this constant scales as (see McDonald, 2017, Theorem 3).
Our proof of the lower bound in Theorem 1 first uses a standard reduction from estimation to multiple hypothesis testing problem over a finite function class. While Fano’s inequality is the workhorse of our second step, note that the lower bound must hold only for coreset estimators and not any estimator as in standard minimax lower bounds. This additional difficulty is overcome by a careful handling of the information structure generated by coreset scheme channels rather than using off-the-shelf results for minimax lower bounds. The full details of the lower bound are in the Appendix.
The estimator achieving the rate in Theorem (1) relies on an encoding procedure. It is constructed by building a dictionary between the subsets in and an -net on the space of Hölder functions. The key idea is that, for , the amount of subsets of size is extremely large, so for large enough, there is enough information to encode a nearby-neighbor in to the kernel density estimator on the entire dataset.
2.1 Proof of the upper bound in Theorem 1
Fix for to be determined and let denote an -net of with respect to the norm. It follows from the classical Kolmogorov-Tikhomorov bound (see, e.g., Theorem XIV of Tikhomirov, 1993) that there exists a constant such that we can choose with . In particular, there exists such that where is the minimax optimal kernel density estimator defined in (2).
We now develop our encoding procedure for . To that end, fix an integer such that and let be any surjective map. Our procedure only looks at the first coordinates of the sample . Denote these coordinates by and note that these numbers are almost surely distinct. Let denote a parameter to be determined, and define the intervals
For , define
The next lemma, whose proof is in the Appendix, ensures that with high probability every bin contains the first coordinate of at least one data point.
Lemma 1.
Let for a sufficiently large absolute constant, and let denote a sufficiently small constant. Then for all and , the event that for every there exists some in bin holds with probability at least .
In the high-probability event that every bin contains the first coordinate of some data point, choose a unique representative such that and pick any . Then define . If there exists a bin with no observation, then let consist of two data points lying in the same bin and random data points. Then set .
Note that is indeed a coreset-based estimator. The function such that looks at the data points in the coreset, and if their first coordinates lie in distinct bins, then is decoded as above to output the corresponding element of the net . Otherwise, .
Next, it suffices to show the upper bound of Theorem 1 in the case when for a sufficiently small absolute constant. For sufficiently large, by Stirling’s formula and our choice of it holds that
Hence, the surjection and our encoding estimator are well-defined.
By the Cauchy-Schwarz inequality,
Put together, the previous three displays yield the upper bound of Theorem 1.
3 Coreset kernel density estimators
In this section, we consider the family of weighted kernel density estimators built on coresets and study its rate of estimation over the Hölder densities. In this framework, the statistician first computes a minimax estimator using the entire dataset and then approximates with a weighted kernel density estimator over the coreset. Here we allow the weights to be a measurable function over the entire dataset rather than just the coreset.
As is typical in density estimation, we consider kernels of the form where is an even function and . Given bandwidth parameter , we define .
3.1 Carathéodory coreset method
Given a KDE with uniform weights and bandwidth defined by
on a sample , we define a coreset KDE as follows in terms of a cutoff frequency . Define . Consider the complex vectors . By Carathéodory’s theorem (Carathéodory, 1907), there exists a subset of cardinality at most and nonnegative weights with such that
(4) |
Then is defined to be
3.1.1 Algorithmic considerations
For a convex polyhedron with vertices , the proof of Carathéodory’s theorem is constructive and yields a polynomial-time algorithm in and to find a convex combination of vertices that represents a given point in (Carathéodory, 1907) (see also Hiriart-Urruty and Lemaréchal, 2004, Theorem 1.3.6). For completeness, we describe below this algorithm applied to our problem. Note that, more generally, for a large class of convex bodies, Carathéodory’s theorem may be implemented efficiently using standard tools from convex optimization (Grötschel et al., 2012, Chapter 6).
Set . For , let
Let denote the matrix with columns , and let denote the standard simplex. Assume without loss of generality that . Next,
-
1.
Find a nonzero vector
-
2.
Find so that lies on the boundary of
Observe that , and since the average is now represented using a convex combination of at most of the vertices . As long as at least vertices remain, we can continue reducing the number of vertices used to represent by applying steps 1 and 2. Thus after at most iterations, we obtain that satisfies , as desired.
3.2 Results on Carathéodory coresets
Proposition 1 is key to our results and specifies conditions on the kernel guaranteeing that the Carathéodory method yields an accurate estimator.
Proposition 1.
Let denote a kernel with such that for some , and the KDE
with bandwidth satisfies
(5) |
Then the Carathéodory coreset estimator constructed from with satisfies
There exists a kernel that satisfies the conditions above for all and . We sketch the details here and postpone the full argument to the Proof of Theorem 2 in the Appendix. Let denote a cutoff function that has the following properties: , , and is compactly supported on . Define , and let denote the resulting kernel. Observe that for all , the kernel satisfies
Using standard results from (Tsybakov, 2009), this implies that the resulting KDE satisfies (5). Since , the Riemann–Lebesgue lemma guarantees that is satisfied for . Since is compactly supported, an application of Parseval’s identity yields . Applying Proposition 1 to , we conclude that for the task of density estimation, weighted KDEs built on coresets are nearly as powerful as the coreset-based estimators studied in Section 2.
Theorem 2.
Let . The Carathéodory coreset estimator built using the kernel and setting satisfies
The corresponding coreset has cardinality
Theorem 2 shows that the Carathéodory coreset estimator achieves the minimax rate of estimation with near-optimal coreset size. In fact, a small modification yields a near-optimal rate of convergence for any coreset size as in Theorem 1.
Corollary 1.
Let and . The Carathéodory coreset estimator built using the kernel , setting and , satisfies
and the corresponding coreset has cardinality .
Next we apply Proposition 1 to the popular Gaussian kernel . This kernel has rapid decay in the real domain and Fourier space, and is thus amenable to our techniques. Moreover, is a kernel of order , (Tsybakov, 2009, Definition 1.3 and Theorem 1.2) and so the standard KDE on the full dataset attains the minimax rate of estimation over the Lipschitz densities .
Theorem 3.
Let . The Carathéodory coreset estimator built using the kernel and setting satisfies
The corresponding coreset has cardinality
In addition, we have a nearly matching lower bound to Theorem 2 for coreset KDEs. In fact, our lower bound applies to a generalization of coreset KDEs where the vector of weights is not constrained to be in the simplex but can range within a hypercube of width that may grow polynomially with : .
Theorem 4.
Let . Let denote a kernel with . Let denote a weighted coreset KDE with bandwidth built from with weights satisfying . Then
This result is essentially a consequence of the lower bound in Theorem 1 because, in an appropriate sense, coreset KDEs with bounded weights are well-approximated by coreset-based estimators. Hence, in the case of bounded weights, allowing these weights to be measurable functions of the entire dataset rather than just the coreset, as would be required in Section 2, does not make a significant difference for the purpose of estimation. The full details of Theorem 4 are postponed to the Appendix.
3.3 Proof sketch of Proposition 1
Here we sketch the proof of Proposition 1, our main tool in constructing effective coreset KDEs. Full details of the argument may be found in the Appendix.
Let denote a kernel, and suppose that is a good estimator for an unknown density in that
on setting . Our goal is to find a subset and weights such that
Suppose for simplicity that is compactly supported on . By hypothesis and Parseval’s theorem , and we can further show that and . Let denote the Fourier transform on the interval . Using the Fourier decay of , we have
(6) |
when . Observe that this matches the setting of in Proposition 1.
The approximation (6) implies that for ,
Using the Carathéodory coreset and weights constructed in Section 3.1, it follows that
Applying (6) again, we see that the right-hand-side is approximately equal to , the estimator produced in Section (3.1). By the triangle inequality, we conclude that , as desired.
4 Lower bounds for coreset KDEs with uniform weights
In this section we study the performance of univariate uniformly weighted coreset KDEs
where is the coreset and . The next results demonstrate that for a large class of kernels, there is significant gap between the rate of estimation achieved by and that of coreset KDEs with general weights. First we focus on the particular case of estimating Lipschitz densities, the class . For this class, the minimax rate of estimation (over all estimators) is , and this can be achieved by a weighted coreset KDE of cardinality by Theorem 2, for all .
Theorem 5.
Let denote a nonnegative kernel satisfying
for some . Suppose that . If
then
(7) |
The infimum above is over all possible choices of bandwidth and all coreset schemes of cardinality at most .
By this result, if has lighter than quadratic tails and fast Fourier decay, the error in (7) is a polynomial factor larger than the minimax rate when . Hence, our result covers a wide variety of kernels typically used for density estimation and shows that the uniformly weighted coreset KDE performs much worse than the encoding estimator or the Carathéodory method. In addition, for very smooth univariate kernels with rapid decay, we have the following lower bound that applies for all .
Theorem 6.
Fix and a nonnegative kernel on satisfying the following fast decay and smoothness conditions:
(8) | ||||
(9) |
where we recall that denotes the Fourier transform. Let be the uniformly weighted coreset KDE. Then there exists such that for and any and , we have
Therefore attaining the minimax rate with requires for such kernels. Next, note that the Gaussian kernel satisfies the hypotheses of Theorem 5 and 6. As we show in Theorem 7, results of (Phillips and Tai, 2018b) imply that our lower bounds are tight up to logarithmic factors: there exists a uniformly weighted Gaussian coreset KDE of size that attains the minimax rate for estimating univariate Lipschitz densities (). In general, we expect a lower bound to hold for uniformly weighted coreset KDEs attaining the minimax rate. The proofs of Theorems 5 and 6 can be found in the Appendix.
5 Comparison to other methods
Three methods for constructing coreset kernel density estimators that have previously been explored include random sampling (Joshi et al., 2011; Lopez-Paz et al., 2015), the Frank–Wolfe algorithm (Bach et al., 2012; Harvey and Samadi, 2014; Phillips and Tai, 2018a), and discrepancy-based approaches (Phillips and Tai, 2018b; Karnin and Liberty, 2019). These procedures all result in a uniformly weighted coreset KDE. To compare these results with ours on the problem of density estimation, for each method under consideration we raise the question: How large does , the size of the coreset, need to be to guarantee that
(10) |
Here is the resulting coreset KDE and the right-hand-side is the minimax rate over all estimators on the full dataset .
Uniform random sampling of a subset of cardinality yields an i.i.d dataset, so the rate obtained is at least . Hence, we must take to achieve the minimax rate.
The Frank–Wolfe algorithm is a greedy method that iteratively constructs a sparse approximation to a given element in a convex set (Frank et al., 1956; Bubeck, 2015). Thus Frank–Wolfe may be applied directly in the RKHS corresponding to a positive-semidefinite kernel as shown in Phillips and Tai (2018b) to approximate the KDE on the full dataset. However, due to the shrinking bandwidth in our problem, this approach also requires to guarantee the bound in (10). Another strategy is to approximately solve the linear equation (4) using the Frank–Wolfe algorithm. Unfortunately, a direct implementation again uses data points.
A more effective strategy utilizes discrepancy theory (Phillips, 2013; Phillips and Tai, 2018b; Karnin and Liberty, 2019) (see Matoušek, 1999; Chazelle, 2000, for a comprehensive exposition of discrepancy theory). By the well-known halving algorithm (see e.g. Chazelle and Matoušek, 1996; Phillips and Tai, 2018b) if for all , the kernel discrepancy
is at most , then there exists a coreset of size such that
(11) |
The idea of the halving algorithm is to maintain a set of datapoints at each iteration and then set to be the set of vectors that receive sign upon minimizing . Starting with the original dataset and repeating this procedure times yields the desired coreset satisfying (11).
Phillips and Tai (2018b, Theorem 4) use a state-of-the-art algorithm from Bansal et al. (2018) called the Gram–Schmidt walk to give strong bounds on the kernel discrepancy of bounded and Lipschitz kernels that are positive definite and decay rapidly away from the diagonal. With a careful handling of the Lipschitz constant and error in their argument when the bandwidth is set to be , their techniques yield the following result applied to the kernel . For completeness we give details of the argument in the Appendix.
Theorem 7.
This result also applies to more general kernels, for example, the Gaussian kernel when . We suspect that this is the best result achievable by discrepancy-based methods. In particular for nonnegative univariate kernels with fast decay in the real and Fourier domains, such as the Gaussian kernel, Theorem 5 implies that this rate is optimal for estimating Lipschitz densities with uniformly weighted coreset KDEs.
In contrast, the Carathéodory coreset KDE as in Theorem 2 only needs cardinality to be a minimax estimator. By Theorem 4, this result is nearly optimal for coreset KDEs with bounded kernels and weights. And as with the other three methods described, our construction is computationally efficient. Hence allowing more general weights results in more powerful coreset KDEs for the problem of density estimation.
Appendix A PROOFS FROM SECTION 2
A.1 Proof of Lemma 1
Here we prove Lemma 1, restated below for convenience.
Lemma.
Let for a sufficiently large absolute constant, and let denote a sufficiently small constant. Then for all and , the event that for every there exists some in bin holds with probability at least .
Proof.
Note that as a univariate density because . Hence, satisfies
for some absolute constants and . If for , then
(12) |
Thus for all ,
(13) |
It follows that for all ,
(14) |
Let denote the event that every bin contains at least one observation . By the union bound,
By (14), choosing small enough ensures that for all . In fact, by (12) one may take . Hence, setting for sufficiently large, we have
∎
A.2 Proof of the lower bound in Theorem 1
In this section, denotes the sample. It is convenient to consider a more general family of decorated coreset-based estimators. A decorated coreset consists of a coreset along with a data-dependent binary string of length . A decorated coreset-based estimator is then given by , where is a measurable function. As with coreset-based estimators, we require that is invariant under permutation of the vectors . We slightly abuse notation and refer to the channel as a decorated coreset scheme and as the decorated coreset-based estimator. The next proposition implies the lower bound in Theorem 1 on setting , in which case a decorated coreset-based estimator is just a coreset-based estimator. This more general framework allows us to prove Theorem 1 on lower bounds for weighted coreset KDEs.
Proposition 2.
Let denote a decorated coreset-based estimator with decorated coreset scheme such that . Then
A.2.1 Choice of function class
Fix such that is integral to be chosen later. Let label the points in , where denotes the all-ones vector of . We consider a class of functions of the form indexed by . Here, is defined to be
where is -Hölder smooth of order , has , and has .
Informally, puts a bump on the uniform distribution with amplitude over if and only if . Using a standard argument (Tsybakov, 2009, Chapter 2) we can construct a packing of which results of the function class such that
-
(i)
for all and,
-
(ii)
is large in the sense that .
A.2.2 Minimax lower bound
Using standard reductions from estimation to testing, we obtain that
(15) |
where the infimum in the last line is over all tests of the form for a decorated coreset scheme and a measurable function .
Let denote a random variable that is distributed uniformly over and observe that
where denotes the joint distribution of characterized by the conditional distribution which is assumed to have density for all .
Next, by Fano’s inequality (Cover and Thomas, 2006, Theorem 2.10.1) and the chain rule, we have
(16) |
where denotes the mutual information between and and we used the fact that the entropy of is . Therefore, it remains to control . To that end, note that it follows from the data processing inequality that
where and denote the distributions of , and respectively and observe that is the mixture distribution given by for and . Denote by the mixed density of , where the continuous component is with respect to the Lebesgue measure on . Denote by the mixed density of the uniform mixture of these:
By a standard information-theoretic inequality, for all measures it holds that
(17) |
In fact, we have equality precisely when , and (17) follows immediately from the nonnegativity of the KL-divergence. Setting , for all we have
(18) |
Our next goal is to bound the first term on the right-hand-side above.
Lemma 2.
For any , we have
Proof.
Let denote the distribution of the (undecorated) coreset , and note that the density of this distribution is given by . Then because the logarithm is increasing,
By the union bound,
It follows readily that . Next, let be a random variable with density and note that
where in the last inequality, we use the fact that . The lemma follows. ∎
Appendix B PROOFS FROM SECTION 3
B.1 Proof of Proposition 1
We restate the result below.
Proposition.
Let denote a kernel with such that for some , and the KDE
with bandwidth satisfies
Then the Carathéodory coreset estimator constructed from with satisfies
Let denote a cutoff function that has the following properties: , , and is compactly supported on .
Lemma 3.
Let where . Then
Proof.
∎
The triangle inequality and the previous lemma yield the next result.
Lemma 4.
Let denote a kernel such that . Recall the definition of from Lemma 3. Let , and let
denote where and . Let
Then
Next we show that is well approximated by its Fourier expansion on . Since is a smooth periodic function on , it is expressed in as a Fourier series on . Thus we bound the tail of this expansion. In what follows, is a multi-index and
denotes the (rescaled) Fourier transform on , where .
Lemma 5.
Suppose that the kernel . Let , and define
Then
Proof.
Observe that for , it holds that
Therefore,
(19) |
where in the last line we used Parseval’s identity. For any multi-index with ,
(20) |
where we used that the derivatives of are bounded. Next by Parseval’s identity,
(21) |
For , we have
(22) |
as desired.
∎
Applying the previous lemma and linearity of the Fourier transform, we have the next corollary that gives an expansion for a general KDE on the smaller domain .
Corollary 2.
Now we have all the ingredients needed to prove Proposition 1.
Proof of Proposition 1 .
Let
and
Also consider their expansions and as defined in Lemma 5. Observe that, by construction of the Carathéodory coreset,
In what follows, is computed on . By the triangle inequality,
(23) |
On the right-hand-side of the first line, the first and last terms are bounded via Lemma 4. The second and fourth terms are bounded via Lemma 5, and the third term is by Carathéodory. By our choice of and the decay properties of , we have
The conclusion follows by the hypothesis on , the previous display, and the triangle inequality. ∎
B.2 Proof of Theorem 2
We restate Theorem 2 here for convenience.
Theorem.
Let . The Carathéodory coreset estimator built using the kernel and setting satisfies
The corresponding coreset has cardinality
Proof.
Our goal is to apply Proposition 1 to . First we show that the standard KDE built from attains the minimax rate on . The Fourier condition
implies that is a kernel of order (Tsybakov, 2009, Definition 1.3). Since , it remains to show that the ‘moments’ of order at most of vanish. In fact all of the moments vanish. We have, expanding the exponential and using the multinomial formula,
Since in a neighborhood near the origin, it follows that all of the terms . Thus is a kernel of order for all , and the standard KDE on all of the dataset with bandwidth attains the rate of estimation over (see e.g. Tsybakov, 2009, Theorem 1.2).
Next, for . This is because
Moreover for all , . By Parseval’s identity,
because has compact support (see e.g. Katznelson, 2004, Chapter VI).
All of the hypotheses of Proposition 1 are satisfied, so we apply the result with
to derive Theorem 2.
∎
B.3 Proof of Corollary 1
Corollary.
Let and . The Carathéodory coreset estimator built using the kernel , setting and , satisfies
and the corresponding coreset has cardinality .
Proof.
Recall from the proof of Theorem 2 that is a kernel of all orders. By a standard bias-variance trade-off (see e.g. Tsybakov, 2009, Section 1.2), it holds that for the KDE with bandwidth (on the entire dataset)
(24) |
Moreover, from (23) applied to , setting , we get
(25) |
Choosing
(assuming without loss of generality that is sufficiently small so that ), then the triangle inequality, (24), (25), and the upper bound on yield the conclusion of Corollary 1.
∎
B.4 Proof of Theorem 4
For convenience, we restate Theorem 4 here.
Theorem.
Let . Let denote a kernel with . Let denote a weighted coreset KDE with bandwidth built from with weights satisfying . Then
Proof.
Let and let . Observe that
(26) |
Using this we develop a decorated coreset-based estimator (see Section A.2) that approximates well. Set for sufficiently small and to be chosen later. Order the points of the coreset according to their first coordinate. This gives rise to an ordering so that
denote the elements of . Let denote the correspondingly reordered collection of weights so that
Construct a -net with respect to the sup-norm on the set . Observe that
(27) |
Define to be the smallest integer larger than the right-hand-side above. Then we can construct a surjection . Note that is constructed before observing any data: it simply labels the elements of the -net by strings of length .
Given , define as follows:
-
1.
Let denote the closest element in to .
-
2.
Choose such that .
-
3.
Define the decorated coreset .
-
4.
Order the points of by their first coordinate. Pair the -th element of with the -th element of , and define
Appendix C PROOFS FROM SECTION 4
Notation: Given a set of points (not necessarily a sample), we let
denote the uniformly weighted KDE on .
C.1 Proof of Theorem 5
Theorem.
Let denote a nonnegative kernel satisfying
for some . Suppose that . If
then
The infimum above is over all possible choices of bandwidth and all coreset schemes of cardinality at most .
The proof of Theorem 5 follows directly from Propositions 3 and 4, which are presented in Sections C.1.1 and C.1.2, respectively.
C.1.1 Small bandwidth
First we show that uniformly weighted coreset KDEs on points poorly approximate densities that are very close to everywhere.
Lemma 6.
Let denote a uniformly weighted coreset KDE built from an even kernel with bandwidth on points . Suppose that quantiles satisfy
(29) | ||||
(30) |
Let denote an interval where
(31) |
and suppose that satisfies
(32) |
for all .
Then
Proof.
Let denote the number of such that . The argument proceeds in two cases. With foresight, we set . Also let and .
Proposition 3.
Let . Let denote an absolute constant. Let denote a uniformly weighted coreset KDE with bandwidth built from a kernel on . Suppose that for some absolute constants . If , then for
it holds that
(34) |
Proof.
Let
where is a normalizing constant so that . Observe that . Our first goal is to show that
holds for all and for all , where is an absolute constant to be determined.
We apply Lemma 6 to the density . Let be defined as in Lemma 6, and set and . Set . Let
The function satisfies the bounds (32) from Lemma 6. Observe that the length of is
We set the parameter in Lemma 6 to be
By the decay assumption on , we may set
Therefore,
(35) | ||||
(36) | ||||
(37) |
for sufficiently large, because we assume , , and . Hence, condition (31) is satisfied for in the specified range, so we apply Cauchy–Schwarz and Lemma 6 to conclude that for all and ,
(38) |
Suppose first that . Then clearly the right-hand side of (38) is for . Otherwise, we have for all that if is in the range
then (38) implies
(39) |
Moreover, a uniformly weighted coreset KDE on points can be expressed as a uniformly weighted coreset KDE on points by setting some of the ’s to be duplicates. Hence (39) holds for all . Since is a decreasing function of , it follows that (39) holds for all and , as desired.
∎
C.1.2 Large bandwidth
Lemma 7.
Let , and let denote the uniformly weighted coreset KDE on with bandwidth . Suppose that is an odd function supported on . Let denote the density
Then
(40) |
Proof.
Let and . Observe that
(41) |
because is an odd function. Next, using ,
(42) |
By the triangle inequality and Parseval’s formula,
Proposition 4.
Let for some absolute constant . Let denote a uniformly weighted coreset KDE with bandwidth built from a kernel on . Suppose that . If for sufficiently large, then for all it holds that
(45) |
C.2 Proof of Theorem 6
Theorem.
Fix and a nonnegative kernel on satisfying the following fast decay and smoothness conditions:
(46) | ||||
(47) |
where we recall that denotes the Fourier transform. Let be the uniformly weighted coreset KDE. Then there exists such that for and any and , we have
Proof.
We follow a similar strategy to the proof of Theorem 5 by handling the cases of small and large bandwidth separately.
Let be the minimum number such that . By the assumption in the theorem, there exists such that
Note that we can set large such that for any , there exists such that for . We first show that for any given and , we have
(48) |
Let be an arbitrary function in such that
Let be the set of for which .
Case 1: . Since , we have
On the other hand,
therefore,
Case 2: . Define
and
Note that to verify (48) we only need to consider the event of , in which case
Moreover since we see that . Now define
Then for , we have
while for we have
Thus,
On the other hand, by the union bound we see that the Lebesgue measure of is at least
where we used the fact that . Then
and hence
This concludes the proof of (48).
The second step is to show that for given and , we have
(49) |
sufficiently large and to be determined later, and is such that
whose existence is guaranteed by the assumption of the theorem. Let be a smooth, even, nonnegative function supported on satisfying . Define
where is chosen so that . Then , and in particular when for some . Moreover we can find such that for all . Now
(50) |
where (50) used the fact that is even. Since , there exists such that
for any . Now define
There exists such that whenever . With the choice of , we can continue lower bounding (50) as (for ):
Finally, we collect the results for step 1 and step 2. First observe that the main term in the risk in step 1 can be simplified as
(51) |
where denotes the event in the left side of (51).
Thus up to multiplicative constant depending on , , we can lower bound the risk by taking the max of the risks in the two steps:
(52) |
whenever . We can use the distributive law to open up the parentheses in (52). By checking the and cases respectively, it is easy to verify that
Next, if is true, we evidently have
If is not true, then , and we have
In either case the risk with respect to is . It remains to convert this to a lower bound in .
We consider two cases. First note that by the fast decay condition on the Fourier transform, . Let denote a constant such that
(53) |
Set .
Case 1: .
Let , and let . If , then because and by the exponential decay of ,
for sufficiently large. Thus by Cauchy–Schwarz,
Case 2:
In this case, is nearly constant for all . By (53) and Taylor’s theorem,
for all and for all . Hence, for all , using ,
For large enough, we see that for the function with ,
∎
Appendix D PROOFS FROM SECTION 5
D.1 Proof of Theorem 7
The result is restated below.
Theorem.
Proof.
Here we adapt the results in Section 2 of Phillips and Tai (2018b) to our setting where the bandwidth is shrinking. Using their notation, we define and study the kernel discrepancy of the kernel . First we verify the assumptions on the kernel (bounded influence, Lipschitz, and positive semidefiniteness) needed to apply their results.
First, the kernel is bounded influence (see Phillips and Tai, 2018b, Section 2) with constant and , which means that
if . This follows from the fast decay of .
Note that if and differ on a single coordinate , then
because for all and the function is -Lipschitz for some constant . Hence by the triangle and Cauchy–Schwarz inequalities, the function is Lipschitz:
Therefore the kernel is Lipschitz (see Phillips and Tai, 2018b) with constant . Moreover, the kernel is positive semidefinite because the Fourier transform of is nonnegative.
Given the shrinking bandwidth , we slightly modify the lattice used in Phillips and Tai (2018b, Lemma 1). Define the lattice
where
The calculation at the top of page 6 of Phillips and Tai (2018b, Lemma 1) yields
where is the closest point to in the lattice , and assigns either or to each element of . Moreover, with the bounded influence of , if
then
On defining
we see that
for all signings . This is precisely the conclusion of Phillips and Tai (2018b, Lemma 1).
This established, the positive definiteness and bounded diagonal entries of and Phillips and Tai (2018b, Lemmas 2 and 3) imply that
Given , the halving algorithm can be applied to as in Phillips and Tai (2018b, Corollary 5) to yield a coreset of size such that
Rescaling by , we have
Recall from Section B.2 that attains the minimax rate of estimation on . Thus setting we get a coreset of size that attains the minimax rate , as desired. Moreover, by the results of Phillips and Tai (2018b), this coreset can be constructed in polynomial time.
∎
Acknowledgments We thank Cole Franks for helpful discussions regarding algorithmic aspects of Carathéodory’s theorem.
References
- Agarwal et al. [2005] Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1–30, 2005.
- Ashtiani et al. [2020] Hassan Ashtiani, Shai Ben-David, Nicholas JA Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. Journal of the ACM (JACM), 67(6):1–42, 2020.
- Bach et al. [2012] Francis R Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. In ICML, 2012.
- Bachem et al. [2017] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
- Bachem et al. [2018] Olivier Bachem, Mario Lucic, and Andreas Krause. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1119–1127, 2018.
- Bansal et al. [2018] Nikhil Bansal, Daniel Dadush, Shashwat Garg, and Shachar Lovett. The gram-schmidt walk: a cure for the banaszczyk blues. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 587–597, 2018. 10.1145/3188745.3188850. URL https://doi.org/10.1145/3188745.3188850.
- Bubeck [2015] Sébastien Bubeck. Convex optimization: algorithms and complexity. Now Publishers Inc., 2015.
- Carathéodory [1907] C. Carathéodory. Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen, March 1907. URL https://doi.org/10.1007/bf01449883.
- Chazelle and Matoušek [1996] Chazelle and Matoušek. On linear-time deterministic algorithms for optimization problems in fixed dimension. Journal of Algorithms, 21(3):579–597, 1996.
- Chazelle [2000] B. Chazelle. The Discrepancy Method: Randomness and Complexity. Cambridge University Press, Cambridge, 2000.
- Claici et al. [2020] Sebastian Claici, Aude Genevay, and Justin Solomon. Wasserstein measure coresets. arXiv preprint arXiv:1805.07412, 2020.
- Clarkson [2010] Kenneth L Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):1–30, 2010.
- Cover and Thomas [2006] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition, 2006.
- Feldman et al. [2011] Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. In Advances in neural information processing systems, pages 2142–2150, 2011.
- Feldman et al. [2013] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1434–1453. SIAM, 2013.
- Frahling and Sohler [2005] Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’05, pages 209–217, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1581139608. 10.1145/1060590.1060622. URL https://doi.org/10.1145/1060590.1060622.
- Frank et al. [1956] Marguerite Frank, Philip Wolfe, et al. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
- Gärtner and Jaggi [2009] Bernd Gärtner and Martin Jaggi. Coresets for polytope distance. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pages 33–42, 2009.
- Grötschel et al. [2012] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
- Hanneke et al. [2019] Steve Hanneke, Aryeh Kontorovich, and Menachem Sadigurschi. Sample compression for real-valued learners. In Algorithmic Learning Theory, pages 466–488, 2019.
- Har-Peled and Kushal [2007] Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3–19, 2007.
- Harvey and Samadi [2014] Nick Harvey and Samira Samadi. Near-optimal herding. In Conference on Learning Theory, pages 1165–1182, 2014.
- Hiriart-Urruty and Lemaréchal [2004] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis. Springer Science & Business Media, 2004.
- Huggins et al. [2016] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080–4088, 2016.
- Joshi et al. [2011] Sarang Joshi, Raj Varma Kommaraji, Jeff M. Phillips, and Suresh Venkatasubramanian. Comparing distributions and shapes using the kernel distance. In Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry, SoCG ’11, page 47–56, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450306829. 10.1145/1998196.1998204. URL https://doi.org/10.1145/1998196.1998204.
- Karnin and Liberty [2019] Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1975–1993, Phoenix, USA, 25–28 Jun 2019. PMLR. URL http://proceedings.mlr.press/v99/karnin19a.html.
- Katznelson [2004] Yitzhak Katznelson. An Introduction to Harmonic Analysis. Cambridge Mathematical Library. Cambridge University Press, 3 edition, 2004. 10.1017/CBO9781139165372.
- Littlestone and Warmuth [1986] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Unpublished manuscript, 1986.
- Lopez-Paz et al. [2015] David Lopez-Paz, Krikamol Muandet, Bernhard Schölkopf, and Iliya Tolstikhin. Towards a learning theory of cause-effect inference. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1452–1461, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/lopez-paz15.html.
- Matoušek [1999] J. Matoušek. Geometric Discrepancy: an Illustrated Guide. Springer, New York, 1999.
- McDonald [2017] Daniel McDonald. Minimax Density Estimation for Growing Dimension. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 194–203, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR. URL http://proceedings.mlr.press/v54/mcdonald17a.html.
- Munteanu et al. [2018] Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. On coresets for logistic regression. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pages 6562–6571, Red Hook, NY, USA, 2018. Curran Associates Inc.
- Phillips [2013] Jeff M Phillips. -samples for kernels. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1622–1632. SIAM, 2013.
- Phillips and Tai [2018a] Jeff M Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2718–2727. SIAM, 2018a.
- Phillips and Tai [2018b] Jeff M. Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. In 34th International Symposium on Computational Geometry, SoCG 2018, June 11-14, 2018, Budapest, Hungary, pages 66:1–66:13, 2018b. 10.4230/LIPIcs.SoCG.2018.66. URL https://doi.org/10.4230/LIPIcs.SoCG.2018.66.
- Tikhomirov [1993] V. M. Tikhomirov. -Entropy and -Capacity of Sets In Functional Spaces, pages 86–170. Springer Netherlands, Dordrecht, 1993. ISBN 978-94-017-2973-4. 10.1007/978-94-017-2973-4_7. URL https://doi.org/10.1007/978-94-017-2973-4_7.
- Tsybakov [2009] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer series in statistics. Springer, 2009. ISBN 978-0-387-79051-0. 10.1007/b13794. URL https://doi.org/10.1007/b13794.
- Zheng and Phillips [2017] Yan Zheng and Jeff M Phillips. Coresets for kernel regression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645–654, 2017.