This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Statistical Perspective on Coreset Density Estimation

Paxton Turnerlabel=paxton][email protected] [    Jingbo Liulabel=jingbo][email protected] [    and Philippe Rigollet label=rigollet][email protected] P.R. was supported by NSF awards IIS-1838071, DMS-1712596, DMS-1740751, and DMS-2022448.[ Massachusetts Institute of Technology and University of Illinois at Urbana-Champaign Paxton Turner
Department of Mathematics
Massachusetts Institute of Technology
77 Massachusetts Avenue,
Cambridge, MA 02139-4307, USA
Jingbo Liu
Department of Statistics
University of Illinois at Urbana-Champaign
725 S. Wright St.,
Champaign, IL 61820, USA
Philippe Rigollet
Department of Mathematics
Massachusetts Institute of Technology
77 Massachusetts Avenue,
Cambridge, MA 02139-4307, USA
Abstract

Coresets have emerged as a powerful tool to summarize data by selecting a small subset of the original observations while retaining most of its information. This approach has led to significant computational speedups but the performance of statistical procedures run on coresets is largely unexplored. In this work, we develop a statistical framework to study coresets and focus on the canonical task of nonparameteric density estimation. Our contributions are twofold. First, we establish the minimax rate of estimation achievable by coreset-based estimators. Second, we show that the practical coreset kernel density estimators are near-minimax optimal over a large class of Hölder-smooth densities.

62G07,
68Q32,
keywords:
[class=AMS]
keywords:
[class=KWD] data summarization, kernel density estimator, Carathéodory’s theorem, minimax risk, compression

1 Introduction

The ever-growing size of datasets that are routinely collected has led practitioners across many fields to contemplate effective data summarization techniques that aim at reducing the size of the data while preserving the information that it contains. While there are many ways to achieve this goal, including standard data compression algorithms, they often prevent direct manipulation of data for learning purposes. Coresets have emerged as a flexible and efficient set of techniques that permit direct data manipulation. Coresets are well-studied in machine learning (Har-Peled and Kushal, 2007; Feldman et al., 2013; Bachem et al., 2017, 2018; Karnin and Liberty, 2019), statistics (Feldman et al., 2011; Zheng and Phillips, 2017; Munteanu et al., 2018; Huggins et al., 2016; Phillips and Tai, 2018a, b), and computational geometry (Agarwal et al., 2005; Clarkson, 2010; Frahling and Sohler, 2005; Gärtner and Jaggi, 2009; Claici et al., 2020).

Given a dataset 𝒟={X1,,Xn}d\mathcal{D}=\{X_{1},\ldots,X_{n}\}\subset\mathbb{R}^{d} and task (density estimation, logistic regression, etc.) a coreset 𝒞\mathcal{C} is given by 𝒞={Xi:iS}\mathcal{C}=\{X_{i}\,:\,i\in S\} for some subset SS of {1,,n}\{1,\ldots,n\} of size |S|n|S|\ll n. A good coreset should suffice to perform the task at hand with the same accuracy as with the whole dataset 𝒟\mathcal{D}.

In this work we study the canonical task of density estimation. Given i.i.d random variables X1,,XnfX_{1},\ldots,X_{n}\sim\mathbb{P}_{f} that admit a common density ff with respect to the Lebesgue measure over d\mathbb{R}^{d}, the goal of density estimation is to estimate ff. It is well known that the minimax rate of estimation over the LL-Hölder smooth densities 𝒫(β,L)\mathcal{P}_{\mathcal{H}}(\beta,L) of order β\beta is given by

inff^supf𝒫(β,L)𝔼ff^f2=Θβ,d,L(nβ2β+d),\inf_{\hat{f}}\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}-f\rVert_{2}=\Theta_{\beta,d,L}(n^{-\frac{\beta}{2\beta+d}})\,, (1)

where the infimum is taken over all estimators based on the dataset 𝒟\mathcal{D}. Moreover the minimax rate above is achieved by a kernel density estimator

f^n(x):=1nhdj=1nK(Xixh)\hat{f}_{n}(x):=\frac{1}{nh^{d}}\sum_{j=1}^{n}K\left(\frac{X_{i}-x}{h}\right) (2)

for suitable choices of kernel K:dK:\mathbb{R}^{d}\to\mathbb{R} and bandwidth h>0h>0 (see e.g. Tsybakov, 2009, Theorem 1.2).

The main goal of this paper is to extend this understanding of rates for density estimation to estimators based on coresets. Specifically we would like to characterize the statistical performance of coresets in terms of their cardinality. To do so, we investigate two families of estimators built on coresets: one that is quite flexible and allows arbitrary estimators to be used on the coreset and another that is more structured and driven by practical considerations; it consists of weighted kernel density estimators built on coresets.

1.1 Two statistical frameworks for coreset density estimation

We formally define a coreset as follows. Throughout this work m=o(n)m=o(n) denotes the cardinality of the coreset. Let S=S(y|x)S=S(y|x) denote a conditional probability measure on ([n]m)\binom{[n]}{m}, where xd×nx\in\mathbb{R}^{d\times n}. In information theoretic language, SS is a channel from d×n\mathbb{R}^{d\times n} to subsets of cardinality mm. We refer to the channel SS as a coreset scheme because it designates a data-driven method of choosing a subset of data points. In what follows, we abuse notation and let S=S(x)S=S(x) denote an instantiation of a sample from the measure S(y|x)S(y|x) for xd×nx\in\mathbb{R}^{d\times n}. A coreset XSX_{S} is then defined to be the projection of the dataset X=(X1,,Xn)X=(X_{1},\ldots,X_{n}) onto the subset indicated by S(X)S(X): XS:={Xi}iS(X)X_{S}:=\{X_{i}\}_{i\in S(X)}.

The first family of estimators that we investigate is quite general and allows the statistician to select a coreset and then employ an estimator that only manipulates data points in the coreset to estimate an unknown density. To study coresets, it is convenient to make the dependence of estimators on observations more explicit than in the traditional literature. More specifically, a density estimator f^\hat{f} based on nn observations X1,,XndX_{1},\ldots,X_{n}\in\mathbb{R}^{d} is a function f^:d×nL2(d)\hat{f}:\mathbb{R}^{d\times n}\to L_{2}(\mathbb{R}^{d}) denoted by f^[X1,,Xn]()\hat{f}[X_{1},\ldots,X_{n}](\cdot). Similarly, a coreset-based estimator f^S\hat{f}_{S} is constructed from a coreset scheme SS of size mm and an estimator (measurable function) f^:d×mL2(d)\hat{f}:\mathbb{R}^{d\times m}\to L_{2}(\mathbb{R}^{d}) on mm observations. We enforce the additional restriction on f^\hat{f} that for all y1,,ymdy_{1},\ldots,y_{m}\in\mathbb{R}^{d} and for all bijections π:[m][m]\pi:[m]\to[m], it holds that f^[y1,,ym]()=f^[yπ(1),,yπ(m)]()\hat{f}[y_{1},\ldots,y_{m}](\cdot)=\hat{f}[y_{\pi(1)},\ldots,y_{\pi(m)}](\cdot). Given SS and f^\hat{f} as above, we define the coreset-based estimator f^S:d×nL2(d)\hat{f}_{S}:\mathbb{R}^{d\times n}\to L_{2}(\mathbb{R}^{d}) to be the function f^S[X]():=f^[XS]():d\hat{f}_{S}[X](\cdot):=\hat{f}[X_{S}](\cdot):\mathbb{R}^{d}\to\mathbb{R}. We evaluate the performance of coreset-based estimators in Section 2 by characterizing their rate of estimation over Hölder classes.111Our notion of coreset-based estimators bares conceptual similarity to various notions of compression schemes as studied in the literature, e.g. Littlestone and Warmuth (1986); Ashtiani et al. (2020); Hanneke et al. (2019).

The symmetry restriction on f^\hat{f} prevents the user from exploiting information about the ordering of data points to their advantage: the only information that can be used by the estimator f^\hat{f} is contained in the unordered collection of distinct vectors given by the coreset XSX_{S}.

As evident from the the results in Section 2, the information-theoretically optimal coreset estimator does not resemble coreset estimators employed in practice. To remedy this limitation, we also study weighted coreset kernel density estimators (KDEs) in Section 3. Here the statistician selects a kernel kk, bandwidth parameter hh, and a coreset XSX_{S} of cardinality mm as defined above and then employs the estimator

f^S(y)=jSλjhdk(Xjyh),\hat{f}_{S}(y)=\sum_{j\in S}\lambda_{j}h^{-d}k\left(\frac{X_{j}-y}{h}\right),

where the weights {λj}jS\{\lambda_{j}\}_{j\in S} are nonnegative, sum to one and are allowed to depend on the full dataset.

In the case of uniform weights where λj=1m\lambda_{j}=\frac{1}{m} for all jSj\in S, coreset KDEs are well-studied (see e.g. Bach et al., 2012; Harvey and Samadi, 2014; Phillips and Tai, 2018a, b; Karnin and Liberty, 2019). Interestingly, our results show that allowing flexibility in the weights gives a definitive advantage for the task of density estimation. By Theorems 2 and 5, the uniformly weighted coreset KDEs require a much larger coreset than that of weighted coreset KDEs to attain the minimax rate of estimation over univariate Lipschitz densities.

1.2 Setup and Notation

We reserve the notation 2\lVert\cdot\rVert_{2} for the L2L_{2} norm and ||p\left|\cdot\right|_{p} for the p\ell_{p}-norm. The constants c,cβ,d,cL,c,c_{\beta,d},c_{L}, etc. vary from line to line and the subscripts indicate parameter dependences.

Fix an integer d1d\geq 1. For any multi-index s=(s1,,sd)0ds=(s_{1},\ldots,s_{d})\in\mathbb{Z}_{\geq 0}^{d} and x=(x1,,xd)dx=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}, define s!=s1!sd!s!=s_{1}!\cdots s_{d}!, xs=x1s1xdsdx^{s}=x_{1}^{s_{1}}\cdots x_{d}^{s_{d}} and let DsD^{s} denote the differential operator defined by

Ds=|s|1x1s1xdsd.D^{s}=\frac{\partial^{|s|_{1}}}{\partial x_{1}^{s_{1}}\cdots\partial x_{d}^{s_{d}}}\,.

Fix a positive real number β,\beta, and let β\lfloor\beta\rfloor denote the maximal integer strictly less than β\beta. Given a multi-index ss, the notation |s|\left|s\right| signifies the coordinate-wise application of ||\left|\cdot\right| to ss.

Given L>0L>0 we let (β,L)\mathcal{H}(\beta,L) denote the space of Hölder functions f:df:\mathbb{R}^{d}\to\mathbb{R} that are supported on the cube [1/2,1/2]d[-1/2,1/2]^{d}, are β\lfloor\beta\rfloor times differentiable, and satisfy

|Dsf(x)Dsf(y)|L|xy|2ββ,|D^{s}f(x)-D^{s}f(y)|\leq L\left|x-y\right|_{2}^{\beta-\lfloor\beta\rfloor}\,,\quad

for all x,ydx,y\in\mathbb{R}^{d} and for all multi-indices ss such that |s|1=β\left|s\right|_{1}=\lfloor\beta\rfloor.

Let 𝒫(β,L)\mathcal{P}_{\mathcal{H}}(\beta,L) denote the set of probability density functions contained in (β,L)\mathcal{H}(\beta,L). For f𝒫(β,L)f\in\mathcal{P}_{\mathcal{H}}(\beta,L), let f\mathbb{P}_{f} (resp. 𝔼f\mathbb{E}_{f}) denote the probability distribution (resp. expectation) associated to ff.

For d1d\geq 1 and γ0\gamma\in\mathbb{Z}_{\geq 0}, we also define the Sobolev functions 𝒮(γ,L)\mathcal{S}(\gamma,L^{\prime}) that consist of all f:df:\mathbb{R}^{d}\to\mathbb{R} that are γ\gamma times differentiable and satisfy

Dαf2L\lVert D^{\alpha}f\rVert_{2}\leq L^{\prime}

for all multi-indices α\alpha such that |α|1=γ\left|\alpha\right|_{1}=\gamma.

2 Coreset-based estimators

In this section we study the performance of coreset-based estimators. Recall that coreset-based estimators are estimators that only depend on the data points in the coreset.

Define the minimax risk for coreset-based estimators ψn,m(β,L)\psi_{n,m}(\beta,L) over 𝒫(β,L)\mathcal{P}_{\mathcal{H}}(\beta,L) to be

ψn,m(β,L)=inff^,|S|=msupf𝒫(β,L)𝔼ff^Sf2,\psi_{n,m}(\beta,L)=\inf_{\hat{f},|S|=m}\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}, (3)

where the infimum above is over all choices of coreset scheme SS of cardinality mm and all estimators f^:d×mL2(d)\hat{f}:\mathbb{R}^{d\times m}\to L_{2}(\mathbb{R}^{d}).

Our main result on coreset-based estimators characterizes their minimax risk.

Theorem 1.

Fix β,L>0\beta,L>0 and an integer d1d\geq 1. Assume that m=o(n)m=o(n). Then the minimax risk of coreset-based estimators satisfies

inff^,|S|=msupf𝒫(β,L)𝔼ff^Sf2=Θβ,d,L(nβ2β+d+(mlogn)βd).\inf_{\hat{f},|S|=m}\,\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}=\\ \Theta_{\beta,d,L}(n^{-\frac{\beta}{2\beta+d}}+(m\log{n})^{-\frac{\beta}{d}}).

The above theorem readily yields a characterization of the minimal size m(β,d)m^{*}(\beta,d) that a coreset can have while still enjoying the minimax optimal rate nβ2β+dn^{-\frac{\beta}{2\beta+d}} from (1). More specifically, let m=m(n)m^{*}=m^{*}(n) be such that

  • (i)

    if m(n)m(n) is a sequence such that m=o(m)m=o(m^{*}), then lim infnnβ2β+dψn,m(β)=\liminf_{n\to\infty}n^{\frac{\beta}{2\beta+d}}\psi_{n,m}(\beta)=\infty, and

  • (ii)

    if m=Ω(m)m=\Omega(m^{*}) then lim supnψn,m(β)nβ2β+dCβ,d,L\limsup_{n\to\infty}\psi_{n,m}(\beta)n^{\frac{\beta}{2\beta+d}}\leq C_{\beta,d,L} for some constant Cβ,d,L>0C_{\beta,d,L}>0.

Then it follows readily from from Theorem 1 that m=Θβ,d,L(nd2β+d/logn)m^{*}=\Theta_{\beta,d,L}(n^{\frac{d}{2\beta+d}}/\log n).

Theorem 1 illustrates two different curses of dimensionality: the first stems from the original estimation problem, and the second stems from the compression problem. As dd\to\infty, it holds that mn/lognm^{*}\sim n/\log n, and in this regime there is essentially no compression, as the implicit constant in Theorem 1 grows rapidly with dd.222 In fact, even for the classical estimation problem (1), this constant scales as ddd^{d} (see McDonald, 2017, Theorem 3).

Our proof of the lower bound in Theorem 1 first uses a standard reduction from estimation to multiple hypothesis testing problem over a finite function class. While Fano’s inequality is the workhorse of our second step, note that the lower bound must hold only for coreset estimators and not any estimator as in standard minimax lower bounds. This additional difficulty is overcome by a careful handling of the information structure generated by coreset scheme channels rather than using off-the-shelf results for minimax lower bounds. The full details of the lower bound are in the Appendix.

The estimator achieving the rate in Theorem (1) relies on an encoding procedure. It is constructed by building a dictionary between the subsets in ([n]m)\binom{[n]}{m} and an ε\varepsilon-net on the space of Hölder functions. The key idea is that, for ω(1)=mn/2\omega(1)=m\leq n/2, the amount of subsets of size mm is extremely large, so for mm large enough, there is enough information to encode a nearby-neighbor in L2(d)L_{2}(\mathbb{R}^{d}) to the kernel density estimator on the entire dataset.

2.1 Proof of the upper bound in Theorem 1

Fix ε=c(mlogn)βd\varepsilon=c^{*}(m\log n)^{-\frac{\beta}{d}} for cc^{*} to be determined and let 𝒩ε\mathcal{N}_{\varepsilon} denote an ε\varepsilon-net of 𝒫(β,L)\mathcal{P}_{\mathcal{H}}(\beta,L) with respect to the L2([12,12]d)L_{2}([-\frac{1}{2},\frac{1}{2}]^{d}) norm. It follows from the classical Kolmogorov-Tikhomorov bound (see, e.g., Theorem XIV of Tikhomirov, 1993) that there exists a constant C𝖪𝖳(β,d,L)>0C_{\mathsf{KT}}(\beta,d,L)>0 such that we can choose 𝒩ε\mathcal{N}_{\varepsilon} with log|𝒩ε|C𝖪𝖳(β,d,L)εd/β\log|\mathcal{N}_{\varepsilon}|\leq C_{\mathsf{KT}}(\beta,d,L)\,\varepsilon^{-d/\beta}. In particular, there exists 𝖿𝒩ε\mathsf{f}\in\mathcal{N}_{\varepsilon} such that f^n𝖿L2([1/2,1/2]d)ε\|\hat{f}_{n}-\mathsf{f}\|_{L^{2}([-1/2,1/2]^{d})}\leq\varepsilon where f^n\hat{f}_{n} is the minimax optimal kernel density estimator defined in (2).

We now develop our encoding procedure for 𝖿\mathsf{f}. To that end, fix an integer KmK\geq m such that (Km)|𝒩ε|\binom{K}{m}\geq|\mathcal{N}_{\varepsilon}| and let ϕ:([K]m)𝒩ε\phi:\binom{[K]}{m}\to\mathcal{N}_{\varepsilon} be any surjective map. Our procedure only looks at the first coordinates of the sample X={X1,,Xn}X=\{X_{1},\ldots,X_{n}\}. Denote these coordinates by x={x1,,xn}x=\{x_{1},\ldots,x_{n}\} and note that these nn numbers are almost surely distinct. Let AA denote a parameter to be determined, and define the intervals

Bik=[(i1)K1A+(k1)A,(i1)K1A+(k1)A+K1A].B_{ik}=[(i-1)K^{-1}A+(k-1)A,\\ \quad(i-1)K^{-1}A+(k-1)A+K^{-1}A].

For i=1,,Ki=1,\ldots,K, define

Bi=k=11/ABik.B_{i}=\bigcup_{k=1}^{1/A}B_{ik}.

The next lemma, whose proof is in the Appendix, ensures that with high probability every bin BiB_{i} contains the first coordinate xix_{i} of at least one data point.

Lemma 1.

Let K1=c(logn)/nK^{-1}=c(\log n)/n for c>0c>0 a sufficiently large absolute constant, and let A=Aβ,L,KA=A_{\beta,L,K} denote a sufficiently small constant. Then for all f𝒫(β,L)f\in\mathcal{P}_{\mathcal{H}}(\beta,L) and X1,,XniidfX_{1},\ldots,X_{n}\stackrel{{\scriptstyle iid}}{{\sim}}\mathbb{P}_{f}, the event that for every j=1,,Kj=1,\ldots,K there exists some xix_{i} in bin BjB_{j} holds with probability at least 1O(n2)1-O(n^{-2}).

In the high-probability event \mathcal{E} that every bin BiB_{i} contains the first coordinate of some data point, choose a unique representative xjxx_{j}^{\circ}\in x such that xjBjx_{j}^{\circ}\in B_{j} and pick any T𝖿ϕ1(𝖿)T_{\mathsf{f}}\in\phi^{-1}(\mathsf{f}). Then define S={i:xi=xj,jT𝖿}S=\{i\,:\,x_{i}=x^{\circ}_{j},j\in T_{\mathsf{f}}\}. If there exists a bin with no observation, then let XSX_{S} consist of two data points lying in the same bin and m2m-2 random data points. Then set f^S0\hat{f}_{S}\equiv 0.

Note that f^S\hat{f}_{S} is indeed a coreset-based estimator. The function f^\hat{f} such that f^S=f^[XS]\hat{f}_{S}=\hat{f}[X_{S}] looks at the mm data points in the coreset, and if their first coordinates lie in distinct bins, then XSX_{S} is decoded as above to output the corresponding element 𝖿\mathsf{f} of the net 𝒩ε\mathcal{N}_{\varepsilon}. Otherwise, f^0\hat{f}\equiv 0.

Next, it suffices to show the upper bound of Theorem 1 in the case when mcnd/(2β+d)m\leq cn^{d/(2\beta+d)} for cc a sufficiently small absolute constant. For c=cβ,d,Lc^{*}=c^{*}_{\beta,d,L} sufficiently large, by Stirling’s formula and our choice of KK it holds that

log(Km)C𝖪𝖳(β,d,L)(1ε)dβlog|𝒩ε|.\log\binom{K}{m}\geq C_{\mathsf{KT}}(\beta,d,L)\,\left(\frac{1}{\varepsilon}\right)^{\frac{d}{\beta}}\geq\log|\mathcal{N}_{\varepsilon}|.

Hence, the surjection ϕ\phi and our encoding estimator f^S\hat{f}_{S} are well-defined.

Next we have

𝔼ff^Sf2=𝔼f[𝖿f21I]+𝔼f[0f21Ic].\mathbb{E}_{f}\lVert\hat{f}_{S}-f\rVert_{2}=\mathbb{E}_{f}\big{[}\lVert\mathsf{f}-f\rVert_{2}{\rm 1}\kern-2.40005pt{\rm I}_{\mathcal{E}}\big{]}+\mathbb{E}_{f}\big{[}\lVert 0-f\rVert_{2}{\rm 1}\kern-2.40005pt{\rm I}_{\mathcal{E}^{c}}\big{]}.

We control the first term as follows using (1) and the fact that 𝖿f^n2ε\lVert\mathsf{f}-\hat{f}_{n}\rVert_{2}\leq\varepsilon on \mathcal{E}:

𝔼f[𝖿f21I]\displaystyle\mathbb{E}_{f}\big{[}\lVert\mathsf{f}-f\rVert_{2}{\rm 1}\kern-2.40005pt{\rm I}_{\mathcal{E}}\big{]} 𝔼ff^nf2+𝔼f𝖿f^n2\displaystyle\leq\mathbb{E}_{f}\lVert\hat{f}_{n}-f\rVert_{2}+\mathbb{E}_{f}\lVert\mathsf{f}-\hat{f}_{n}\rVert_{2}
cβ,d,L(nβ2β+d+(mlogn)βd).\displaystyle\leq c_{\beta,d,L}\,\big{(}n^{\frac{-\beta}{2\beta+d}}+(m\log n)^{-\frac{\beta}{d}}\big{)}.

By the Cauchy-Schwarz inequality,

𝔼f[0f21Ic]\displaystyle\mathbb{E}_{f}\big{[}\lVert 0-f\rVert_{2}{\rm 1}\kern-2.40005pt{\rm I}_{\mathcal{E}^{c}}\big{]} (𝔼ff22(c))1/2\displaystyle\leq\big{(}\mathbb{E}_{f}\lVert f\rVert_{2}^{2}\,\mathbb{P}(\mathcal{E}^{c})\big{)}^{1/2}
cβ,d,Ln1.\displaystyle\leq c_{\beta,d,L}\,n^{-1}\,.

Put together, the previous three displays yield the upper bound of Theorem 1.

3 Coreset kernel density estimators

In this section, we consider the family of weighted kernel density estimators built on coresets and study its rate of estimation over the Hölder densities. In this framework, the statistician first computes a minimax estimator f^\hat{f} using the entire dataset and then approximates f^\hat{f} with a weighted kernel density estimator over the coreset. Here we allow the weights to be a measurable function over the entire dataset rather than just the coreset.

As is typical in density estimation, we consider kernels k:dk:\mathbb{R}^{d}\to\mathbb{R} of the form k(x)=i=1dκ(xi)k(x)=\prod_{i=1}^{d}\kappa(x_{i}) where κ\kappa is an even function and κ(x)dx=1\int\kappa(x)\,\mathrm{d}x=1. Given bandwidth parameter hh, we define kh(x)=hdk(xh)k_{h}(x)=h^{-d}\,k(\frac{x}{h}).

3.1 Carathéodory coreset method

Given a KDE with uniform weights and bandwidth hh defined by

f^(y)=1nj=1nkh(Xjy),\hat{f}(y)=\frac{1}{n}\sum_{j=1}^{n}k_{h}(X_{j}-y),

on a sample X1,,XnX_{1},\ldots,X_{n}, we define a coreset KDE g^S\hat{g}_{S} as follows in terms of a cutoff frequency T>0T>0. Define A={ωπ2d:|ω|T}A=\{\omega\in\frac{\pi}{2}\mathbb{Z}^{d}:\,\left|\omega\right|_{\infty}\leq T\}. Consider the complex vectors (eiXj,ω)ωA(e^{i\langle X_{j},\omega\rangle})_{\omega\in A}. By Carathéodory’s theorem (Carathéodory, 1907), there exists a subset S[n]S\subset[n] of cardinality at most 2(1+4Tπ)d+12(1+\frac{4T}{\pi})^{d}+1 and nonnegative weights {λj}jS\{\lambda_{j}\}_{j\in S} with jSλj=1\sum_{j\in S}\lambda_{j}=1 such that

1nj=1n(eiXj,ω)ωA=jSλj(eiXj,ω)ωA.\frac{1}{n}\sum_{j=1}^{n}(e^{i\langle X_{j},\omega\rangle})_{\omega\in A}=\sum_{j\in S}\lambda_{j}(e^{i\langle X_{j},\omega\rangle})_{\omega\in A}. (4)

Then g^S(y)\hat{g}_{S}(y) is defined to be

g^S(y)=jSλjkh(Xjy).\hat{g}_{S}(y)=\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y).

3.1.1 Algorithmic considerations

For a convex polyhedron PP with vertices v1,,vnDv_{1},\ldots,v_{n}\in\mathbb{R}^{D}, the proof of Carathéodory’s theorem is constructive and yields a polynomial-time algorithm in nn and DD to find a convex combination of D+1D+1 vertices that represents a given point in PP (Carathéodory, 1907) (see also Hiriart-Urruty and Lemaréchal, 2004, Theorem 1.3.6). For completeness, we describe below this algorithm applied to our problem. Note that, more generally, for a large class of convex bodies, Carathéodory’s theorem may be implemented efficiently using standard tools from convex optimization (Grötschel et al., 2012, Chapter 6).

Set D=2|A|2(1+4Tπ)dD=2|A|\leq 2(1+\frac{4T}{\pi})^{d}. For j=1,,nj=1,\ldots,n, let

vj=(𝖱𝖾eiXj,ω,𝖨𝗆eiXj,ω)ωAD.v_{j}=(\mathsf{Re}\,e^{i\langle X_{j},\omega\rangle},\mathsf{Im}\,e^{i\langle X_{j},\omega\rangle})_{\omega\in A}\in\mathbb{R}^{D}.

Let MM denote the matrix with columns (v1,1)T,,(vn,1)TD+1(v_{1},1)^{T},\ldots,(v_{n},1)^{T}\in\mathbb{R}^{D+1}, and let Δn1n\Delta_{n-1}\subset\mathbb{R}^{n} denote the standard simplex. Assume without loss of generality that nD+2n\geq D+2. Next,

  1. 1.

    Find a nonzero vector wker(M)w\in\text{ker}(M)

  2. 2.

    Find α>0\alpha>0 so that λ1:=1n1I+αw\lambda_{1}:=\frac{1}{n}{\rm 1}\kern-2.40005pt{\rm I}+\alpha w lies on the boundary of Δn1\Delta_{n-1}

Observe that Mλ1=(1nvi,1)TM\lambda_{1}=(\frac{1}{n}\sum v_{i},1)^{T}, and since λ1Δn1\lambda_{1}\in\partial\Delta_{n-1} the average is now represented using a convex combination of at most n1n-1 of the vertices v1,,vnv_{1},\ldots,v_{n}. As long as at least D+2D+2 vertices remain, we can continue reducing the number of vertices used to represent 1nvj\frac{1}{n}\sum v_{j} by applying steps 1 and 2. Thus after at most nD1n-D-1 iterations, we obtain λΔD\lambda\in\Delta_{D} that satisfies λjvj=1nvi\sum\lambda_{j}v_{j}=\frac{1}{n}\sum v_{i}, as desired.

3.2 Results on Carathéodory coresets

Proposition 1 is key to our results and specifies conditions on the kernel guaranteeing that the Carathéodory method yields an accurate estimator.

Proposition 1.

Let k(x)=i=1dκ(xi)k(x)=\prod_{i=1}^{d}\kappa(x_{i}) denote a kernel with κ𝒮(γ,L)\kappa\in\mathcal{S}(\gamma,L^{\prime}) such that |κ(x)|cβ,d|x|ν\left|\kappa(x)\right|\leq c_{\beta,d}\left|x\right|^{-\nu} for some νβ+d\nu\geq\beta+d, and the KDE

f^(y)=1ni=1nkh(Xiy)\hat{f}(y)=\frac{1}{n}\sum_{i=1}^{n}k_{h}(X_{i}-y)

with bandwidth h=n12β+dh=n^{-\frac{1}{2\beta+d}} satisfies

supf𝒫(β,L)𝔼ff^2cβ,d,Lnβ2β+d.\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert f-\hat{f}\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}. (5)

Then the Carathéodory coreset estimator g^S(y)\hat{g}_{S}(y) constructed from f^\hat{f} with T=cd,γ,Lnd/2+β+γγ(2β+d)T=c_{d,\gamma,L^{\prime}}\,n^{\frac{d/2+\beta+\gamma}{\gamma(2\beta+d)}} satisfies

supf𝒫(β,L)𝔼g^Sf2cβ,d,Lnβ2β+d.\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

There exists a kernel ks𝒞k_{s}\in\mathcal{C}^{\infty} that satisfies the conditions above for all β\beta and γ\gamma. We sketch the details here and postpone the full argument to the Proof of Theorem 2 in the Appendix. Let ψ:[1,1][0,1]\psi:[-1,1]\to[0,1] denote a cutoff function that has the following properties: ψ𝒞\psi\in\mathcal{C}^{\infty}, ψ|[1,1]1\psi\big{|}_{[-1,1]}\equiv 1, and ψ\psi is compactly supported on [2,2][-2,2]. Define κS(x)=[ψ](x)\kappa_{S}(x)=\mathcal{F}[\psi](x), and let ks(x)=i=1dκS(xi)k_{s}(x)=\prod_{i=1}^{d}\kappa_{S}(x_{i}) denote the resulting kernel. Observe that for all β>0\beta>0, the kernel ksk_{s} satisfies

ess supω01[ks](ω)|ω|α1,αβ.\text{ess sup}_{\omega\neq 0}\frac{1-\mathcal{F}[k_{s}](\omega)}{\left|\omega\right|^{\alpha}}\leq 1,\quad\forall\alpha\preceq\beta.

Using standard results from (Tsybakov, 2009), this implies that the resulting KDE f^s\hat{f}_{s} satisfies (5). Since ψ=1[ks]𝒞\psi=\mathcal{F}^{-1}[k_{s}]\in\mathcal{C}^{\infty}, the Riemann–Lebesgue lemma guarantees that |κs(x)|cβ,d|x|ν\left|\kappa_{s}(x)\right|\leq c_{\beta,d}\left|x\right|^{\nu} is satisfied for ν=β+d\nu=\lceil\beta+d\rceil. Since ψ\psi is compactly supported, an application of Parseval’s identity yields κs𝒮(γ,cγ)\kappa_{s}\in\mathcal{S}(\gamma,c_{\gamma}). Applying Proposition 1 to ksk_{s}, we conclude that for the task of density estimation, weighted KDEs built on coresets are nearly as powerful as the coreset-based estimators studied in Section 2.

Theorem 2.

Let ε>0\varepsilon>0. The Carathéodory coreset estimator g^S(y)\hat{g}_{S}(y) built using the kernel ksk_{s} and setting T=cd,β,εnεd+12β+dT=c_{d,\beta,\varepsilon}\,n^{\frac{\varepsilon}{d}+\frac{1}{2\beta+d}} satisfies

supf𝒫(β,L)𝔼fg^Sf2cβ,d,Lnβ2β+d.\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

The corresponding coreset has cardinality

m=cd,β,εnd2β+d+ε.m=c_{d,\beta,\varepsilon}n^{\frac{d}{2\beta+d}+\varepsilon}.

Theorem 2 shows that the Carathéodory coreset estimator achieves the minimax rate of estimation with near-optimal coreset size. In fact, a small modification yields a near-optimal rate of convergence for any coreset size as in Theorem 1.

Corollary 1.

Let ε>0\varepsilon>0 and mcβ,d,εnd2β+d+εm\leq c_{\beta,d,\varepsilon}\,n^{\frac{d}{2\beta+d}+\varepsilon}. The Carathéodory coreset estimator g^S(y)\hat{g}_{S}(y) built using the kernel ksk_{s}, setting h=m1d+εβh=m^{-\frac{1}{d}+\frac{\varepsilon}{\beta}} and T=cdm1/dT=c_{d}\,m^{1/d}, satisfies

supf𝒫(β,L)𝔼g^Sf2cβ,d,ε,L(mβd+ε+nβ2β+d+ε),\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,\varepsilon,L}\,\left(m^{-\frac{\beta}{d}+\varepsilon}+n^{-\frac{\beta}{2\beta+d}+\varepsilon}\right),

and the corresponding coreset has cardinality mm.

Next we apply Proposition 1 to the popular Gaussian kernel ϕ(x)=(2π)d/2exp(12|x|22)\phi(x)=(2\pi)^{-d/2}\exp(-\frac{1}{2}\left|x\right|_{2}^{2}). This kernel has rapid decay in the real domain and Fourier space, and is thus amenable to our techniques. Moreover, kϕk_{\phi} is a kernel of order =1\ell=1, (Tsybakov, 2009, Definition 1.3 and Theorem 1.2) and so the standard KDE f^ϕ\hat{f}_{\phi} on the full dataset attains the minimax rate of estimation cd,Ln1/(2+d)c_{d,L}n^{1/(2+d)} over the Lipschitz densities 𝒫(1,L)\mathcal{P}_{\mathcal{H}}(1,L).

Theorem 3.

Let ε>0\varepsilon>0. The Carathéodory coreset estimator g^ϕ(y)\hat{g}_{\phi}(y) built using the kernel ϕ\phi and setting T=cd,εn12+d+εdT=c_{d,\varepsilon}\,n^{\frac{1}{2+d}+\frac{\varepsilon}{d}} satisfies

supf𝒫(1,L)𝔼g^ϕf2cd,Ln12+d.\sup_{f\in\mathcal{P}_{\mathcal{H}}(1,L)}\mathbb{E}\lVert\hat{g}_{\phi}-f\rVert_{2}\leq c_{d,L}\,n^{-\frac{1}{2+d}}.

The corresponding coreset has cardinality

m=cd,εnd2+d+ε.m=c_{d,\varepsilon}n^{\frac{d}{2+d}+\varepsilon}.

In addition, we have a nearly matching lower bound to Theorem 2 for coreset KDEs. In fact, our lower bound applies to a generalization of coreset KDEs where the vector of weights (λj)j(\lambda_{j})_{j} is not constrained to be in the simplex but can range within a hypercube of width that may grow polynomially with nn: maxjS|λj|nB\max_{j\in S}\left|\lambda_{j}\right|\leq n^{B}.

Theorem 4.

Let A,B1A,B\geq 1. Let kk denote a kernel with k2n\lVert k\rVert_{2}\leq n. Let g^S\hat{g}_{S} denote a weighted coreset KDE with bandwidth hnAh\geq n^{-A} built from kk with weights {λj}jS\{\lambda_{j}\}_{j\in S} satisfying maxjS|λj|nB\max_{j\in S}\left|\lambda_{j}\right|\leq n^{B}. Then

supf𝒫(β,L)𝔼fg^Sf2cβ,d,L[(A+B)βd(mlogn)βd+nβ2β+d].\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}\geq\\ c_{\beta,d,L}\left[(A+B)^{-\frac{\beta}{d}}(m\log{n})^{-\frac{\beta}{d}}+n^{-\frac{\beta}{2\beta+d}}\right].

This result is essentially a consequence of the lower bound in Theorem 1 because, in an appropriate sense, coreset KDEs with bounded weights are well-approximated by coreset-based estimators. Hence, in the case of bounded weights, allowing these weights to be measurable functions of the entire dataset rather than just the coreset, as would be required in Section 2, does not make a significant difference for the purpose of estimation. The full details of Theorem 4 are postponed to the Appendix.

3.3 Proof sketch of Proposition 1

Here we sketch the proof of Proposition 1, our main tool in constructing effective coreset KDEs. Full details of the argument may be found in the Appendix.

Let k(x)=i=1dκ(xi)k(x)=\prod_{i=1}^{d}\kappa(x_{i}) denote a kernel, and suppose that f^(y)=1ni=1nkh(Xiy)\hat{f}(y)=\frac{1}{n}\sum_{i=1}^{n}k_{h}(X_{i}-y) is a good estimator for an unknown density ff in that

ff^2ε:=cβ,dnβ2β+d\lVert f-\hat{f}\rVert_{2}\leq\varepsilon:=c_{\beta,d}n^{-\frac{\beta}{2\beta+d}}

on setting h=n1/(2β+d)h=n^{-1/(2\beta+d)}. Our goal is to find a subset S[n]S\subset[n] and weights {λj}jS\{\lambda_{j}\}_{j\in S} such that

1ni=1nkh(Xiy)jSλjkh(Xjy).\frac{1}{n}\sum_{i=1}^{n}k_{h}(X_{i}-y)\approx\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y).

Suppose for simplicity that κ\kappa is compactly supported on [1/2,1/2][-1/2,1/2]. By hypothesis and Parseval’s theorem κ𝒮(γ,L)\kappa\in\mathcal{S}(\gamma,L^{\prime}), and we can further show that k𝒮(γ,cd,L)k\in\mathcal{S}(\gamma,c_{d,L^{\prime}}) and kh𝒮(γ,cd,Lhd/2γ)k_{h}\in\mathcal{S}(\gamma,c_{d,L^{\prime}}h^{-d/2-\gamma}). Let ¯\mathcal{\bar{F}} denote the Fourier transform on the interval [1,1][-1,1]. Using the Fourier decay of khk_{h}, we have

kh(x)|ω|<T¯[kh](ω)eix,ω2ε\lVert k_{h}(x)-\sum_{\left|\omega\right|_{\infty}<T}\mathcal{\bar{F}}[k_{h}](\omega)e^{i\langle x,\omega\rangle}\rVert_{2}\leq\varepsilon (6)

when T=(cd,γ,Lhd2γε)1/γ=cd,γ,Lnd/2+β+γγ(2β+d)T=(\frac{c_{d,\gamma,L^{\prime}}h^{-\frac{d}{2}-\gamma}}{\varepsilon})^{1/\gamma}=c_{d,\gamma,L^{\prime}}\,n^{\frac{d/2+\beta+\gamma}{\gamma(2\beta+d)}}. Observe that this matches the setting of TT in Proposition 1.

The approximation (6) implies that for Xi[1/2,1/2]dX_{i}\in[-1/2,1/2]^{d},

f^(y)|ω|<T¯[kh](ω)(1ni=1neiXi,ω)eiy,ω.\hat{f}(y)\approx\sum_{\left|\omega\right|_{\infty}<T}\mathcal{\bar{F}}[k_{h}](\omega)\left(\frac{1}{n}\sum_{i=1}^{n}e^{i\langle X_{i},\omega\rangle}\right)e^{i\langle y,\omega\rangle}.

Using the Carathéodory coreset and weights {λj}jS\{\lambda_{j}\}_{j\in S} constructed in Section 3.1, it follows that

|ω|<T¯[kh](ω)(1ni=1neiXi,ω)eiy,ω=|ω|<T¯[kh](ω)(i=1nλjeiXi,ω)eiy,ω.\sum_{\left|\omega\right|_{\infty}<T}\mathcal{\bar{F}}[k_{h}](\omega)\left(\frac{1}{n}\sum_{i=1}^{n}e^{i\langle X_{i},\omega\rangle}\right)e^{i\langle y,\omega\rangle}=\\ \sum_{\left|\omega\right|_{\infty}<T}\mathcal{\bar{F}}[k_{h}](\omega)\left(\sum_{i=1}^{n}\lambda_{j}e^{i\langle X_{i},\omega\rangle}\right)e^{i\langle y,\omega\rangle}.

Applying (6) again, we see that the right-hand-side is approximately equal to g^S(y)\hat{g}_{S}(y), the estimator produced in Section (3.1). By the triangle inequality, we conclude that g^S(y)f2cβ,dε\lVert\hat{g}_{S}(y)-f\rVert_{2}\leq c_{\beta,d}\,\varepsilon, as desired.

4 Lower bounds for coreset KDEs with uniform weights

In this section we study the performance of univariate uniformly weighted coreset KDEs

f^S𝗎𝗇𝗂𝖿(y)=1miSkh(Xiy),\hat{f}_{S}^{\mathsf{unif}}(y)=\frac{1}{m}\sum_{i\in S}k_{h}(X_{i}-y),

where XSX_{S} is the coreset and |S|=m|S|=m. The next results demonstrate that for a large class of kernels, there is significant gap between the rate of estimation achieved by f^Sunif(y)\hat{f}_{S}^{\text{unif}}(y) and that of coreset KDEs with general weights. First we focus on the particular case of estimating Lipschitz densities, the class 𝒫(1,L)\mathcal{P}_{\mathcal{H}}(1,L). For this class, the minimax rate of estimation (over all estimators) is n1/3n^{-1/3}, and this can be achieved by a weighted coreset KDE of cardinality cεn1/3+εc_{\varepsilon}n^{1/3+\varepsilon} by Theorem 2, for all ε>0\varepsilon>0.

Theorem 5.

Let kk denote a nonnegative kernel satisfying

k(t)=O(|t|(k+1)),and[k](ω)=O(|ω|)k(t)=O(\left|t\right|^{-(k+1)}),\quad\text{and}\quad\mathcal{F}[k](\omega)=O(\left|\omega\right|^{-\ell})

for some >0,k>1\ell>0,\,k>1. Suppose that 0<α<1/30<\alpha<1/3. If

mn232(α(12)+23)logn,m\leq\frac{n^{\frac{2}{3}-2\left(\alpha(1-\frac{2}{\ell})+\frac{2}{3\ell}\right)}}{\log n},

then

infh,S:|S|msupf𝒫(1,L)𝔼f^S𝗎𝗇𝗂𝖿f2=Ωk(n13+αlogn).\inf_{h,S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(1,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{2}=\Omega_{k}\Big{(}\frac{n^{-\frac{1}{3}+\alpha}}{\log n}\Big{)}. (7)

The infimum above is over all possible choices of bandwidth hh and all coreset schemes SS of cardinality at most mm.

By this result, if kk has lighter than quadratic tails and fast Fourier decay, the error in (7) is a polynomial factor larger than the minimax rate n1/3n^{-1/3} when mn2/3m\ll n^{2/3}. Hence, our result covers a wide variety of kernels typically used for density estimation and shows that the uniformly weighted coreset KDE performs much worse than the encoding estimator or the Carathéodory method. In addition, for very smooth univariate kernels with rapid decay, we have the following lower bound that applies for all β>0\beta>0.

Theorem 6.

Fix β>0\beta>0 and a nonnegative kernel kk on \mathbb{R} satisfying the following fast decay and smoothness conditions:

lims+1slog1|t|>sk(t)𝑑t\displaystyle\lim_{s\to+\infty}\frac{1}{s}\log\frac{1}{\int_{|t|>s}k(t)dt} >0,\displaystyle>0, (8)
limω1|ω|log1|[k](ω)|\displaystyle\lim_{\omega\to\infty}\frac{1}{|\omega|}\log\frac{1}{|\mathcal{F}[k](\omega)|} >0,\displaystyle>0, (9)

where we recall that [k]\mathcal{F}[k] denotes the Fourier transform. Let f^S𝗎𝗇𝗂𝖿\hat{f}_{S}^{\mathsf{unif}} be the uniformly weighted coreset KDE. Then there exists Lβ>0L_{\beta}>0 such that for LLβL\geq L_{\beta} and any mm and h>0h>0, we have

infh,S:|S|msupf𝒫(β,L)𝔼f^S𝗎𝗇𝗂𝖿f2\displaystyle\inf_{h,S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{2} =Ωβ,k(mβ1+βlogβ+12m).\displaystyle=\Omega_{\beta,k}\left(\tfrac{m^{-\frac{\beta}{1+\beta}}}{\log^{\beta+\frac{1}{2}}m}\right).

Therefore attaining the minimax rate with f^S𝗎𝗇𝗂𝖿\hat{f}_{S}^{\mathsf{unif}} requires mnβ+12β+1m\geq n^{\frac{\beta+1}{2\beta+1}} for such kernels. Next, note that the Gaussian kernel satisfies the hypotheses of Theorem 5 and 6. As we show in Theorem 7, results of (Phillips and Tai, 2018b) imply that our lower bounds are tight up to logarithmic factors: there exists a uniformly weighted Gaussian coreset KDE of size m=O~(n2/3)m=\tilde{O}(n^{2/3}) that attains the minimax rate n1/3n^{-1/3} for estimating univariate Lipschitz densities (β=1\beta=1). In general, we expect a lower bound m=Ω(nβ+d2β+d)m=\Omega(n^{\frac{\beta+d}{2\beta+d}}) to hold for uniformly weighted coreset KDEs attaining the minimax rate. The proofs of Theorems 5 and 6 can be found in the Appendix.

5 Comparison to other methods

Three methods for constructing coreset kernel density estimators that have previously been explored include random sampling (Joshi et al., 2011; Lopez-Paz et al., 2015), the Frank–Wolfe algorithm (Bach et al., 2012; Harvey and Samadi, 2014; Phillips and Tai, 2018a), and discrepancy-based approaches (Phillips and Tai, 2018b; Karnin and Liberty, 2019). These procedures all result in a uniformly weighted coreset KDE. To compare these results with ours on the problem of density estimation, for each method under consideration we raise the question: How large does mm, the size of the coreset, need to be to guarantee that

supf𝒫(β,L)𝔼fg^Sf2=Oβ,d,L(nβ2β+d)?\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}=O_{\beta,d,L}\left(n^{-\frac{\beta}{2\beta+d}}\right)\,? (10)

Here g^S\hat{g}_{S} is the resulting coreset KDE and the right-hand-side is the minimax rate over all estimators on the full dataset X1,,XnX_{1},\ldots,X_{n}.

Uniform random sampling of a subset of cardinality mm yields an i.i.d dataset, so the rate obtained is at least mβ/(2β+d)m^{-\beta/(2\beta+d)}. Hence, we must take m=Ω(n)m=\Omega(n) to achieve the minimax rate.

The Frank–Wolfe algorithm is a greedy method that iteratively constructs a sparse approximation to a given element in a convex set (Frank et al., 1956; Bubeck, 2015). Thus Frank–Wolfe may be applied directly in the RKHS corresponding to a positive-semidefinite kernel as shown in Phillips and Tai (2018b) to approximate the KDE on the full dataset. However, due to the shrinking bandwidth in our problem, this approach also requires m=Ω(n)m=\Omega(n) to guarantee the bound in (10). Another strategy is to approximately solve the linear equation (4) using the Frank–Wolfe algorithm. Unfortunately, a direct implementation again uses m=Ω(n)m=\Omega(n) data points.

A more effective strategy utilizes discrepancy theory (Phillips, 2013; Phillips and Tai, 2018b; Karnin and Liberty, 2019) (see Matoušek, 1999; Chazelle, 2000, for a comprehensive exposition of discrepancy theory). By the well-known halving algorithm (see e.g. Chazelle and Matoušek, 1996; Phillips and Tai, 2018b) if for all NnN\leq n, the kernel discrepancy

𝖽𝗂𝗌𝖼k=supx1,,xNminσ{1,+1}n1ITσ=0i=1Nσik(xiy)\mathsf{disc}_{k}=\sup_{x_{1},\ldots,x_{N}}\,\min_{\begin{subarray}{c}\sigma\in\{-1,+1\}^{n}\\ {\rm 1}\kern-1.68004pt{\rm I}^{T}\sigma=0\end{subarray}}\lVert\sum_{i=1}^{N}\sigma_{i}k(x_{i}-y)\rVert_{\infty}

is at most DD, then there exists a coreset XSX_{S} of size O~D(ε1)\tilde{O}_{D}(\varepsilon^{-1}) such that

1ni=1nk(Xiy)1mjSk(Xiy)=O~D(ε).\lVert\frac{1}{n}\sum_{i=1}^{n}k(X_{i}-y)-\frac{1}{m}\sum_{j\in S}k(X_{i}-y)\rVert_{\infty}=\tilde{O}_{D}(\varepsilon). (11)

The idea of the halving algorithm is to maintain a set of datapoints 𝒞\mathcal{C}_{\ell} at each iteration and then set 𝒞+1\mathcal{C}_{\ell+1} to be the set of vectors that receive sign +1+1 upon minimizing x𝒞σxk(xy)\lVert\sum_{x\in\mathcal{C}_{\ell}}\sigma_{x}k(x-y)\rVert_{\infty}. Starting with the original dataset and repeating this procedure O(lognm)O(\log\frac{n}{m}) times yields the desired coreset XSX_{S} satisfying (11).

Phillips and Tai (2018b, Theorem 4) use a state-of-the-art algorithm from Bansal et al. (2018) called the Gram–Schmidt walk to give strong bounds on the kernel discrepancy of bounded and Lipschitz kernels k:d×dk:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} that are positive definite and decay rapidly away from the diagonal. With a careful handling of the Lipschitz constant and error in their argument when the bandwidth is set to be h=n1/(2β+d)h=n^{-1/(2\beta+d)}, their techniques yield the following result applied to the kernel ksk_{s}. For completeness we give details of the argument in the Appendix.

Theorem 7.

Let ksk_{s} denote the kernel from Section 3.2. The algorithm of Phillips and Tai (2018b) yields in polynomial time a subset SS with |S|=m=O~(nβ+d2β+d)\left|S\right|=m=\tilde{O}(n^{\frac{\beta+d}{2\beta+d}}) such that the uniformly weighted coreset KDE g^S\hat{g}_{S} satisfies

supf𝒫(β,L)𝔼fg^S2cβ,d,Lnβ2β+d.\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert f-\hat{g}_{S}\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

This result also applies to more general kernels, for example, the Gaussian kernel when β=1\beta=1. We suspect that this is the best result achievable by discrepancy-based methods. In particular for nonnegative univariate kernels with fast decay in the real and Fourier domains, such as the Gaussian kernel, Theorem 5 implies that this rate is optimal for estimating Lipschitz densities with uniformly weighted coreset KDEs.

In contrast, the Carathéodory coreset KDE as in Theorem 2 only needs cardinality m=Oε(nd2β+d+ε)m=O_{\varepsilon}(n^{\frac{d}{2\beta+d}+\varepsilon}) to be a minimax estimator. By Theorem 4, this result is nearly optimal for coreset KDEs with bounded kernels and weights. And as with the other three methods described, our construction is computationally efficient. Hence allowing more general weights results in more powerful coreset KDEs for the problem of density estimation.

Appendix A PROOFS FROM SECTION 2

A.1 Proof of Lemma 1

Here we prove Lemma 1, restated below for convenience.

Lemma.

Let K1=c(logn)/nK^{-1}=c(\log n)/n for c>0c>0 a sufficiently large absolute constant, and let A=Aβ,L,KA=A_{\beta,L,K} denote a sufficiently small constant. Then for all f𝒫(β,L)f\in\mathcal{P}_{\mathcal{H}}(\beta,L) and X1,,XniidfX_{1},\ldots,X_{n}\stackrel{{\scriptstyle iid}}{{\sim}}\mathbb{P}_{f}, the event that for every j=1,,Kj=1,\ldots,K there exists some xix_{i} in bin BjB_{j} holds with probability at least 1O(n2)1-O(n^{-2}).

Proof.

Note that f1(x1)𝒫(β,L)f_{1}(x_{1})\in\mathcal{P}_{\mathcal{H}}(\beta,L) as a univariate density because f(x)𝒫(β,L)f(x)\in\mathcal{P}_{\mathcal{H}}(\beta,L). Hence, f1f_{1} satisfies

|f1(x)f1(y)|L|xy|α|f_{1}(x)-f_{1}(y)|\leq L|x-y|^{\alpha}

for some absolute constants L>0L>0 and α(0,1)\alpha\in(0,1). If Bik=Bjk+sB_{ik}=B_{jk}+s for sAs\leq A, then

|(Bik)(Bjk)|Bik|f(x1)f(x1+s)|dx1LK1A1+α.\displaystyle\left|\mathbb{P}(B_{ik})-\mathbb{P}(B_{jk})\right|\leq\int_{B_{ik}}\left|f(x_{1})-f(x_{1}+s)\right|\mathrm{d}x_{1}\leq LK^{-1}A^{1+\alpha}. (12)

Thus for all i,ji,j,

|(Bi)(Bj)|k=11/A|(Bik)(Bjk)|LK1Aα.\displaystyle\left|\mathbb{P}(B_{i})-\mathbb{P}(B_{j})\right|\leq\sum_{k=1}^{1/A}\left|\mathbb{P}(B_{ik})-\mathbb{P}(B_{jk})\right|\leq LK^{-1}A^{\alpha}. (13)

It follows that for all i=1,,Ki=1,\ldots,K,

limA0(Bi)=K1.\lim_{A\to 0}\mathbb{P}(B_{i})=K^{-1}. (14)

Let \mathcal{E} denote the event that every bin BiB_{i} contains at least one observation xkx_{k}. By the union bound,

(c)j=1(X11Bj)nKmaxj(1(Bj))n.\mathbb{P}(\mathcal{E}^{c})\leq\sum_{j=1}\mathbb{P}(X_{11}\notin B_{j})^{n}\leq K\max_{j}(1-\mathbb{P}(B_{j}))^{n}.

By (14), choosing AA small enough ensures that [Bj](1/2)K1\mathbb{P}[B_{j}]\geq(1/2)K^{-1} for all jj. In fact, by (12) one may take A=(12K2L)1/αA=(\frac{1}{2K^{-2}L})^{1/\alpha}. Hence, setting K1=c(logn)/nK^{-1}=c(\log n)/n for cc sufficiently large, we have

(c)=O(n2).\mathbb{P}(\mathcal{E}^{c})=O(n^{-2}).

A.2 Proof of the lower bound in Theorem 1

In this section, X=X1,,XndX=X_{1},\ldots,X_{n}\in\mathbb{R}^{d} denotes the sample. It is convenient to consider a more general family of decorated coreset-based estimators. A decorated coreset consists of a coreset XSX_{S} along with a data-dependent binary string σ\sigma of length RR. A decorated coreset-based estimator is then given by f^[XS,σ]\hat{f}[X_{S},\sigma], where f^:d×m×{0,1}RL2([1/2,1/2]d)\hat{f}:\mathbb{R}^{d\times m}\times\{0,1\}^{R}\to L^{2}([-1/2,1/2]^{d}) is a measurable function. As with coreset-based estimators, we require that f^[x1,,xm,σ]\hat{f}[x_{1},\ldots,x_{m},\sigma] is invariant under permutation of the vectors x1,,xmdx_{1},\ldots,x_{m}\in\mathbb{R}^{d}. We slightly abuse notation and refer to the channel S:XYS=(XS,σ)S:X\to Y_{S}=(X_{S},\sigma) as a decorated coreset scheme and f^S\hat{f}_{S} as the decorated coreset-based estimator. The next proposition implies the lower bound in Theorem 1 on setting R=0R=0, in which case a decorated coreset-based estimator is just a coreset-based estimator. This more general framework allows us to prove Theorem 1 on lower bounds for weighted coreset KDEs.

Proposition 2.

Let f^S\hat{f}_{S} denote a decorated coreset-based estimator with decorated coreset scheme SS such that σ{0,1}R\sigma\in\{0,1\}^{R}. Then

supf𝒫(β,L)𝔼ff^Sf2cβ,d,L((mlogn+R)βd+nβ2β+d).\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{f}_{S}-f\rVert_{2}\geq c_{\beta,d,L}\left((m\log n+R)^{-\frac{\beta}{d}}+n^{-\frac{\beta}{2\beta+d}}\right).

A.2.1 Choice of function class

Fix h(0,1)h\in(0,1) such that 1/hd1/h^{d} is integral to be chosen later. Let z1,,z1/hdz_{1},\ldots,z_{1/h^{d}} label the points in {12h1Id+hd}[1/2,1/2]d\{\frac{1}{2}h\cdot{\rm 1}\kern-2.40005pt{\rm I}_{d}+h\mathbb{Z}^{d}\}\cap[-1/2,1/2]^{d}, where 1Id{\rm 1}\kern-2.40005pt{\rm I}_{d} denotes the all-ones vector of d\mathbb{R}^{d}. We consider a class of functions of the form fω(x)=1+j=11/hdωjgj(x)f_{\omega}(x)=1+\sum_{j=1}^{1/h^{d}}\omega_{j}g_{j}(x) indexed by ω{0,1}1/hd\omega\in\{0,1\}^{1/h^{d}}. Here, gj(x)g_{j}(x) is defined to be

gj(x)=hβϕ(xzjh)g_{j}(x)=h^{\beta}\phi\left(\frac{x-z_{j}}{h}\right)

where ϕ:d\phi:\mathbb{R}^{d}\to\mathbb{R} is LL-Hölder smooth of order β\beta, has ϕ=1\lVert\phi\rVert_{\infty}=1, and has ϕ(x)dx=0\int\phi(x)\,\mathrm{d}x=0.

Informally, fωf_{\omega} puts a bump on the uniform distribution with amplitude hβh^{\beta} over zjz_{j} if and only if ωi=1\omega_{i}=1. Using a standard argument (Tsybakov, 2009, Chapter 2) we can construct a packing 𝒱\mathcal{V} of {0,1}1/hd\{0,1\}^{1/h^{d}} which results 𝒢={fω:ω𝒱}\mathcal{G}=\{f_{\omega}:\omega\in\mathcal{V}\} of the function class {fω:ω{0,1}1/hd}\{f_{\omega}:\omega\in\{0,1\}^{1/h^{d}}\} such that

  • (i)

    fg2cβ,d,Lhβ\|f-g\|_{2}\geq c_{\beta,d,L}\,h^{\beta} for all f,g𝒢,fgf,g\in\mathcal{G},f\neq g and,

  • (ii)

    𝒢\mathcal{G} is large in the sense that M:=|𝒢|2cβ,d,L/hdM:=|\mathcal{G}|\geq 2^{c_{\beta,d,L}/h^{d}}.

A.2.2 Minimax lower bound

Using standard reductions from estimation to testing, we obtain that

inff^,|S|=m,σ{0,1}Rsupf𝒫(β,L)𝔼ff^Sf2\displaystyle\inf_{\begin{subarray}{c}\hat{f},|S|=m,\\ \sigma\in\{0,1\}^{R}\end{subarray}}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2} inff^,|S|=m,σ{0,1}Rmaxf𝒢𝔼ff^Sf2\displaystyle\geq\inf_{\begin{subarray}{c}\hat{f},|S|=m,\\ \sigma\in\{0,1\}^{R}\end{subarray}}\,\max_{f\in\mathcal{G}}\,\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}
cβ,d,LhβinfψS1Mω𝒱fω[ψS(X)ω].\displaystyle\geq c_{\beta,d,L}\,h^{\beta}\cdot\inf_{\psi_{S}}\frac{1}{M}\sum_{\omega\in\mathcal{V}}\mathbb{P}_{f_{\omega}}[\psi_{S}(X)\neq\omega]. (15)

where the infimum in the last line is over all tests ψS:d×n[M]\psi_{S}:\mathbb{R}^{d\times n}\to[M] of the form ψS(X)=ψ(YS)\psi_{S}(X)=\psi(Y_{S}) for a decorated coreset scheme SS and a measurable function ψ:d×m×{0,1}R[M]\psi:\mathbb{R}^{d\times m}\times\{0,1\}^{R}\to[M].

Let VV denote a random variable that is distributed uniformly over 𝒱\mathcal{V} and observe that

1Mω𝒱fω[ψS(X)ω]=[ψS(X)V]\frac{1}{M}\sum_{\omega\in\mathcal{V}}\mathbb{P}_{f_{\omega}}[\psi_{S}(X)\neq\omega]=\mathbb{P}[\psi_{S}(X)\neq V]

where \mathbb{P} denotes the joint distribution of (X,V)(X,V) characterized by the conditional distribution X|V=ωX|V=\omega which is assumed to have density fωf_{\omega} for all ω𝒱\omega\in\mathcal{V}.

Next, by Fano’s inequality (Cover and Thomas, 2006, Theorem 2.10.1) and the chain rule, we have

[ψS(X)V]1I(V;ψS(X))+1logM,\mathbb{P}[\psi_{S}(X)\neq V]\geq 1-\frac{I(V;\psi_{S}(X))+1}{\log M}\,, (16)

where I(V;ψS(X))I(V;\psi_{S}(X)) denotes the mutual information between VV and ψS(X)\psi_{S}(X) and we used the fact that the entropy of VV is logM\log M. Therefore, it remains to control I(V;ψS(X))I(V;\psi_{S}(X)). To that end, note that it follows from the data processing inequality that

I(V;ψS(X))I(V;(XS,σ))=I(V;YS)=𝖪𝖫(PV,YSPVPYS),I(V;\psi_{S}(X))\leq I(V;(X_{S},\sigma))=I(V;Y_{S})=\mathsf{KL}(P_{V,Y_{S}}\|P_{V}\otimes P_{Y_{S}})\,,

where PV,YS,PVP_{V,Y_{S}},P_{V} and PYSP_{Y_{S}} denote the distributions of (V,YS)(V,Y_{S}), VV and YSY_{S} respectively and observe that PYSP_{Y_{S}} is the mixture distribution given by PYS(A,t)=M1ω𝒱Pfω(XSA,σ=t)P_{Y_{S}}(A,t)=M^{-1}\sum_{\omega\in\mathcal{V}}P_{f_{\omega}}(X_{S}\in A,\sigma=t) for Ad×mA\subset\mathbb{R}^{d\times m} and t{0,1}Rt\in\{0,1\}^{R}. Denote by fω,YSf_{\omega,Y_{S}} the mixed density of Pfω(XS,σ=)P_{f_{\omega}}(X_{S}\in\cdot,\sigma=\cdot), where the continuous component is with respect to the Lebesgue measure on [1/2,1/2]d×m[-1/2,1/2]^{d\times m}. Denote by f¯YS\bar{f}_{Y_{S}} the mixed density of the uniform mixture of these:

f¯YS:=1Mω𝒱fω,YS.\bar{f}_{Y_{S}}:=\frac{1}{M}\sum_{\omega\in\mathcal{V}}f_{\omega,Y_{S}}\,.

By a standard information-theoretic inequality, for all measures \mathbb{Q} it holds that

𝖪𝖫(PV,YSPVPYS)=1Mω𝖪𝖫(PYS|ωPYS)1Mω𝖪𝖫(PYS|ω).\mathsf{KL}(P_{V,Y_{S}}\|P_{V}\otimes P_{Y_{S}})=\frac{1}{M}\sum_{\omega}\mathsf{KL}(P_{Y_{S}|\omega}\|\,P_{Y_{S}})\leq\frac{1}{M}\sum_{\omega}\mathsf{KL}(P_{Y_{S}|\omega}\|\,\mathbb{Q}). (17)

In fact, we have equality precisely when =PYS\mathbb{Q}=P_{Y_{S}}, and (17) follows immediately from the nonnegativity of the KL-divergence. Setting =𝖴𝗇𝗂𝖿[12,12]d𝖴𝗇𝗂𝖿{0,1}R\mathbb{Q}=\mathsf{Unif}[-\frac{1}{2},\frac{1}{2}]^{d}\otimes\mathsf{Unif}\{0,1\}^{R}, for all ω\omega we have

𝖪𝖫(PYS|ω,)\displaystyle{\sf KL}(P_{Y_{S}|\omega},\mathbb{Q}) =t{0,1}R[12,12]dfω,YS(x,t)logfω,YS(x,t)2Rdx\displaystyle=\sum_{t\in\{0,1\}^{R}}\,\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log\frac{f_{\omega,Y_{S}}(x,t)}{2^{-R}}\,\mathrm{d}x
t{0,1}R[12,12]dfω,YS(x,t)logfω,YS(x,t)dx+R.\displaystyle\leq\sum_{t\in\{0,1\}^{R}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log f_{\omega,Y_{S}}(x,t)\,\mathrm{d}x+R. (18)

Our next goal is to bound the first term on the right-hand-side above.

Lemma 2.

For any ω𝒱\omega\in\mathcal{V}, we have

t{0,1}R[12,12]dfω,YS(x,t)logfω,YS(x,t)dx3mlogn.\sum_{t\in\{0,1\}^{R}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log f_{\omega,Y_{S}}(x,t)\,\mathrm{d}x\leq 3m\log{n}.
Proof.

Let XS\mathbb{P}_{X_{S}} denote the distribution of the (undecorated) coreset XSX_{S}, and note that the density of this distribution is given by fω,XS(x):=t{0,1}Rfω,YS(x,t)f_{\omega,X_{S}}(x):=\sum_{t\in\{0,1\}^{R}}f_{\omega,Y_{S}}(x,t). Then because the logarithm is increasing,

t{0,1}R[12,12]dfω,YS(x,t)logfω,YS(x,t)dx\displaystyle\sum_{t\in\{0,1\}^{R}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log f_{\omega,Y_{S}}(x,t)\,\mathrm{d}x t{0,1}R[12,12]dfω,YS(x,t)logfω,XS(x)dx\displaystyle\leq\sum_{t\in\{0,1\}^{R}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log f_{\omega,X_{S}}(x)\,\mathrm{d}x
=[12,12]dfω,XS(x)logfω,XS(x)dx.\displaystyle=\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,X_{S}}(x)\log f_{\omega,X_{S}}(x)\,\mathrm{d}x.

By the union bound,

XS()s([n]m)Xs()=(nm)X[m]().\mathbb{P}_{X_{S}}(\cdot)\leq\sum_{s\in\binom{[n]}{m}}\mathbb{P}_{X_{s}}(\cdot)=\binom{n}{m}\mathbb{P}_{X_{[m]}}(\cdot)\,.

It follows readily that fω,XS()(nm)fω,X[m]()f_{\omega,X_{S}}(\cdot)\leq\binom{n}{m}f_{\omega,X_{[m]}}(\cdot) . Next, let Z[1/2,1/2]d×mZ\in[-1/2,1/2]^{d\times m} be a random variable with density fω,XSf_{\omega,X_{S}} and note that

fω,XSlogfω,XS=𝔼logfω,XS(Z)log(nm)+𝔼logfω,X[m](Z)mlog(enm)+mlog2,\int f_{\omega,X_{S}}\log f_{\omega,X_{S}}=\mathbb{E}\log f_{\omega,X_{S}}(Z)\leq\log\binom{n}{m}+\mathbb{E}\log f_{\omega,X_{[m]}}(Z)\leq m\log\big{(}\frac{en}{m}\big{)}+m\log 2\,,

where in the last inequality, we use the fact that fω,X[m]=fωm2mf_{\omega,X_{[m]}}=f_{\omega}^{m}\leq 2^{m}. The lemma follows. ∎

Since logMcβ,d,Lhd\log M\geq c_{\beta,d,L}h^{-d}, it follows from (16)–(18) and Lemma 2 that

[ψS(X)V]13mlogn+R+1logM0.5\mathbb{P}[\psi_{S}(X)\neq V]\geq 1-\frac{3m\log n+R+1}{\log M}\geq 0.5

on setting h=cβ,d,L(mlogn+R)1/dh=c_{\beta,d,L}(m\log n+R)^{-1/d}. Plugging this value back into (15) yields

inff^,|S|=msupf𝒫(β,L)𝔼ff^Sf2cβ,d,L(mlogn+R)β/d.\inf_{\hat{f},|S|=m}\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}\geq c_{\beta,d,L}(m\log n+R)^{-\beta/d}\,.

Moreover, it follows from standard minimax theory (see e.g. Tsybakov, 2009, Chapter 2) that

inff^,|S|=msupf𝒫(β,L)𝔼ff^Sf2cβ,d,Lnβ2β+d.\inf_{\hat{f},|S|=m}\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}\geq c_{\beta,d,L}n^{-\frac{\beta}{2\beta+d}}\,.

Combined together, the above two displays give the lower bound of Proposition 2.

Appendix B PROOFS FROM SECTION 3

B.1 Proof of Proposition 1

We restate the result below.

Proposition.

Let k(x)=i=1dκ(xi)k(x)=\prod_{i=1}^{d}\kappa(x_{i}) denote a kernel with κ𝒮(γ,L)\kappa\in\mathcal{S}(\gamma,L^{\prime}) such that |κ(x)|cβ,d|x|ν\left|\kappa(x)\right|\leq c_{\beta,d}\left|x\right|^{-\nu} for some νβ+d\nu\geq\beta+d, and the KDE

f^(y)=1ni=1nkh(Xiy)\hat{f}(y)=\frac{1}{n}\sum_{i=1}^{n}k_{h}(X_{i}-y)

with bandwidth h=n12β+dh=n^{-\frac{1}{2\beta+d}} satisfies

supf𝒫(β,L)𝔼ff^2cβ,d,Lnβ2β+d.\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert f-\hat{f}\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

Then the Carathéodory coreset estimator g^S(y)\hat{g}_{S}(y) constructed from f^\hat{f} with T=cd,γ,Lnd/2+β+γγ(2β+d)T=c_{d,\gamma,L^{\prime}}\,n^{\frac{d/2+\beta+\gamma}{\gamma(2\beta+d)}} satisfies

supf𝒫(β,L)𝔼g^Sf2cβ,d,Lnβ2β+d.\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

Let φ:d[0,1]\varphi:\mathbb{R}^{d}\to[0,1] denote a cutoff function that has the following properties: φ𝒞\varphi\in\mathcal{C}^{\infty}, φ|[1,1]d1\varphi\big{|}_{[-1,1]^{d}}\equiv 1, and φ\varphi is compactly supported on [2,2]d[-2,2]^{d}.

Lemma 3.

Let k~h(x)=kh(x)φ(x)\tilde{k}_{h}(x)=k_{h}(x)\varphi(x) where |κ(x)|cβ,d|x|ν\left|\kappa(x)\right|\leq c_{\beta,d}\left|x\right|^{-\nu}. Then

k~hkh2cβ,dhd+ν.\lVert\tilde{k}_{h}-k_{h}\rVert_{2}\leq c_{\beta,d}\,h^{-d+\nu}.
Proof.
k~hkh2\displaystyle\lVert\tilde{k}_{h}-k_{h}\rVert_{2} =(1φ)kh2\displaystyle=\lVert(1-\varphi)k_{h}\rVert_{2}
(11I[1,1]d)kh2\displaystyle\leq\lVert(1-{\rm 1}\kern-2.40005pt{\rm I}_{[-1,1]^{d}})k_{h}\rVert_{2}
=hd/2(11I[1h,1h]d)k2\displaystyle=h^{-d/2}\lVert(1-{\rm 1}\kern-2.40005pt{\rm I}_{[-\frac{1}{h},\frac{1}{h}]^{d}})k\rVert_{2}
dhd/21I|x1|1hk2\displaystyle\leq dh^{-d/2}\lVert{\rm 1}\kern-2.40005pt{\rm I}_{\left|x_{1}\right|\geq\frac{1}{h}}\,k\rVert_{2}
cβ,dhd/2|x1|1hκ2(x1)dx1\displaystyle\leq c_{\beta,d}\,h^{-d/2}\sqrt{\int_{\left|x_{1}\right|\geq\frac{1}{h}}\kappa^{2}(x_{1})\,\mathrm{d}x_{1}}
cβ,dhd+ν.\displaystyle\leq c_{\beta,d}\,h^{-d+\nu}.

The triangle inequality and the previous lemma yield the next result.

Lemma 4.

Let kk denote a kernel such that |κ(x)|cβ,d|x|2ν\left|\kappa(x)\right|\leq c_{\beta,d}\left|x\right|_{2}^{-\nu}. Recall the definition of k~h\tilde{k}_{h} from Lemma 3. Let X1,,XmdX_{1},\ldots,X_{m}\in\mathbb{R}^{d}, and let

g^S(y)=jSλjkh(Xjy)\hat{g}_{S}(y)=\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y)

denote where λj0\lambda_{j}\geq 0 and 1ITλ=1{\rm 1}\kern-2.40005pt{\rm I}^{T}\lambda=1. Let

gS~(y)=jSλjk~h(Xjy).\tilde{g_{S}}(y)=\sum_{j\in S}\lambda_{j}\tilde{k}_{h}(X_{j}-y).

Then

g^SgS~2cβ,dhν+d.\lVert\hat{g}_{S}-\tilde{g_{S}}\rVert_{2}\leq c_{\beta,d}h^{-\nu+d}.

Next we show that k~h\tilde{k}_{h} is well approximated by its Fourier expansion on [2,2]d[-2,2]^{d}. Since k~h\tilde{k}_{h} is a smooth periodic function on [2,2]d[-2,2]^{d}, it is expressed in L2L_{2} as a Fourier series on π2d\frac{\pi}{2}\mathbb{Z}^{d}. Thus we bound the tail of this expansion. In what follows, α0d\alpha\in\mathbb{Z}^{d}_{\geq 0} is a multi-index and

¯[f](ω)=142df(x)eix,ωdx\mathcal{\bar{F}}[f](\omega)=\frac{1}{4^{2d}}\int f(x)e^{i\langle x,\omega\rangle}\,\mathrm{d}x

denotes the (rescaled) Fourier transform on [2,2]d[-2,2]^{d}, where ωπ2d\omega\in\frac{\pi}{2}\mathbb{Z}^{d}.

Lemma 5.

Suppose that the kernel k𝒮(β,L)k\in\mathcal{S}(\beta,L^{\prime}). Let A={ωπ2d:|ω|1T}A=\{\omega\in\frac{\pi}{2}\mathbb{Z}^{d}:\,\left|\omega\right|_{1}\leq T\}, and define

k~hT(y)=ωA¯[k~h](ω)eiy,ω.\tilde{k}_{h}^{T}(y)=\sum_{\omega\in A}\mathcal{\bar{F}}[\tilde{k}_{h}](\omega)e^{i\langle y,\omega\rangle}.

Then

(k~hk~hT)1I[2,2]d2cγ,d,LTγhd/2γ\lVert(\tilde{k}_{h}-\tilde{k}_{h}^{T}){\rm 1}\kern-2.40005pt{\rm I}_{[-2,2]^{d}}\rVert_{2}\leq c_{\gamma,d,L^{\prime}}\,T^{-\gamma}h^{-d/2-\gamma}
Proof.

Observe that for ωA\omega\notin A, it holds that

|α|1=γγ!α!|ω|α=(|ω1|++|ωd|)γTγ.\sum_{\left|\alpha\right|_{1}=\gamma}\frac{\gamma!}{\alpha!}\left|\omega\right|^{\alpha}=(\left|\omega_{1}\right|+\cdots+\left|\omega_{d}\right|)^{\gamma}\geq T^{\gamma}.

Therefore,

¯[k~h](ω)1IωA2\displaystyle\lVert\mathcal{\bar{F}}[\tilde{k}_{h}](\omega){\rm 1}\kern-2.40005pt{\rm I}_{\omega\notin A}\rVert_{\ell_{2}} Tγ|α|1=γγ!α!|ω|α¯[k~h](ω)1IωA2\displaystyle\leq\,T^{-\gamma}\lVert\sum_{\left|\alpha\right|_{1}=\gamma}\frac{\gamma!}{\alpha!}\left|\omega\right|^{\alpha}\mathcal{\bar{F}}[\tilde{k}_{h}](\omega){\rm 1}\kern-2.40005pt{\rm I}_{\omega\notin A}\rVert_{\ell_{2}}
Tγ|α|1=γγ!α!ωα¯[k~h](ω)2\displaystyle\leq\,T^{-\gamma}\sum_{\left|\alpha\right|_{1}=\gamma}\frac{\gamma!}{\alpha!}\,\lVert\omega^{\alpha}\mathcal{\bar{F}}[\tilde{k}_{h}](\omega)\rVert_{\ell_{2}}
=cdTγ|α|1=γγ!α!αxαk~h(x)2,\displaystyle=\,c_{d}\,T^{-\gamma}\sum_{\left|\alpha\right|_{1}=\gamma}\frac{\gamma!}{\alpha!}\,\lVert\frac{\partial^{\alpha}}{\partial x^{\alpha}}\tilde{k}_{h}(x)\rVert_{2}, (19)

where in the last line we used Parseval’s identity. For any multi-index α\alpha with |α|1=γ\left|\alpha\right|_{1}=\gamma,

αxαk~h(x)2\displaystyle\lVert\frac{\partial^{\alpha}}{\partial x^{\alpha}}\tilde{k}_{h}(x)\rVert_{2} =ηαηxηkh(x)αηxαηφ(x)2\displaystyle=\lVert\sum_{\eta\preceq\alpha}\frac{\partial^{\eta}}{\partial x^{\eta}}k_{h}(x)\,\frac{\partial^{\alpha-\eta}}{\partial x^{\alpha-\eta}}\varphi(x)\rVert_{2}
hd2γηαcd,γηxηk(x)2,\displaystyle\leq h^{-\frac{d}{2}-\gamma}\sum_{\eta\preceq\alpha}c_{d,\gamma}\,\lVert\frac{\partial^{\eta}}{\partial x^{\eta}}k(x)\rVert_{2}, (20)

where we used that the derivatives of φ\varphi are bounded. Next by Parseval’s identity,

ηxηk(x)22\displaystyle\lVert\frac{\partial^{\eta}}{\partial x^{\eta}}k(x)\rVert_{2}^{2} =i=1dωiηi[κ](ωi)22.\displaystyle=\prod_{i=1}^{d}\lVert\omega_{i}^{\eta_{i}}\mathcal{F}[\kappa](\omega_{i})\rVert_{2}^{2}. (21)

For 0aγ0\leq a\leq\gamma, we have

|ωa[κ](ω)|2dωk1+|ω|1|ωγ[κ](ω)|2dωk1+L.\displaystyle\int\left|\omega^{a}\mathcal{F}[\kappa](\omega)\right|^{2}\mathrm{d}\omega\leq\lVert k\rVert_{1}+\int_{\left|\omega\right|\geq 1}\left|\omega^{\gamma}\mathcal{F}[\kappa](\omega)\right|^{2}\mathrm{d}\omega\leq\lVert k\rVert_{1}+L^{\prime}. (22)

By (19)–(22),

¯[k~h](ω)1IωA2cd,γ,LTγhd2γ,\displaystyle\lVert\mathcal{\bar{F}}[\tilde{k}_{h}](\omega){\rm 1}\kern-2.40005pt{\rm I}_{\omega\notin A}\rVert_{\ell_{2}}\leq c_{d,\gamma,L^{\prime}}\,T^{-\gamma}h^{-\frac{d}{2}-\gamma},

as desired.

Applying the previous lemma and linearity of the Fourier transform, we have the next corollary that gives an expansion for a general KDE on the smaller domain [12,12]d[-\frac{1}{2},\frac{1}{2}]^{d}.

Corollary 2.

Let gS~\tilde{g_{S}} denote the KDE built from k~h\tilde{k}_{h} from Lemma 4 where X1,,Xm[12,12]dX_{1},\ldots,X_{m}\in[-\frac{1}{2},\frac{1}{2}]^{d} and moreover κ𝒮(β,L)\kappa\in\mathcal{S}(\beta,L^{\prime}). Let A={ωπ2d:|ω|1T}A=\{\omega\in\frac{\pi}{2}\mathbb{Z}^{d}:\,\left|\omega\right|_{1}\leq T\}, and define

gS~T(y)=ωA¯[gS~](ω)eiy,ω.\tilde{g_{S}}^{T}(y)=\sum_{\omega\in A}\mathcal{\bar{F}}[\tilde{g_{S}}](\omega)e^{i\langle y,\omega\rangle}.

Then

(gS~gS~T)1I[12,12]d2cd,γ,LTγhd/2γL.\lVert(\tilde{g_{S}}-\tilde{g_{S}}^{T}){\rm 1}\kern-2.40005pt{\rm I}_{[-\frac{1}{2},\frac{1}{2}]^{d}}\rVert_{2}\leq c_{d,\gamma,L^{\prime}}\,T^{-\gamma}h^{-d/2-\gamma}L.

Now we have all the ingredients needed to prove Proposition 1.

Proof of Proposition 1 .

Let

f~(y)=1nj=1nk~h(Xjy),\tilde{f}(y)=\frac{1}{n}\sum_{j=1}^{n}\tilde{k}_{h}(X_{j}-y),

and

gS~(y)=jSλjk~h(Xjy).\tilde{g_{S}}(y)=\sum_{j\in S}\lambda_{j}\tilde{k}_{h}(X_{j}-y).

Also consider their expansions f~T\tilde{f}^{T} and g~T\tilde{g}^{T} as defined in Lemma 5. Observe that, by construction of the Carathéodory coreset,

f~T(y)=g~T(y)y[12,12]d.\tilde{f}^{T}(y)=\tilde{g}^{T}(y)\quad\forall y\in[-\frac{1}{2},\frac{1}{2}]^{d}.

In what follows, 2\lVert\cdot\rVert_{2} is computed on [12,12]d[-\frac{1}{2},\frac{1}{2}]^{d}. By the triangle inequality,

g^Sf^2\displaystyle\lVert\hat{g}_{S}-\hat{f}\rVert_{2} g^Sg~2+g~g~T2+g~Tf~T\displaystyle\leq\lVert\hat{g}_{S}-\tilde{g}\rVert_{2}+\lVert\tilde{g}-\tilde{g}^{T}\rVert_{2}+\lVert\tilde{g}^{T}-\tilde{f}^{T}\rVert
+f~Tf~2+f~f^2\displaystyle\quad+\lVert\tilde{f}^{T}-\tilde{f}\rVert_{2}+\lVert\tilde{f}-\hat{f}\rVert_{2}
cβ,dhd+ν+cd,γ,LTγhd/2γ+0\displaystyle\leq c_{\beta,d}\,h^{-d+\nu}+c_{d,\gamma,L^{\prime}}\,T^{-\gamma}h^{-d/2-\gamma}+0
+cd,γ,LTγhd/2γ+cβ,dhd+ν\displaystyle\quad+c_{d,\gamma,L^{\prime}}\,T^{-\gamma}h^{-d/2-\gamma}+c_{\beta,d}\,h^{-d+\nu} (23)

On the right-hand-side of the first line, the first and last terms are bounded via Lemma 4. The second and fourth terms are bounded via Lemma 5, and the third term is 0 by Carathéodory. By our choice of TT and the decay properties of kk, we have

g^Sf^2cβ,d,Lhβcβ,d,Lnβ/(2β+d).\lVert\hat{g}_{S}-\hat{f}\rVert_{2}\leq c_{\beta,d,L}\,h^{\beta}\leq c_{\beta,d,L}\,n^{-\beta/(2\beta+d)}.

The conclusion follows by the hypothesis on kk, the previous display, and the triangle inequality. ∎

B.2 Proof of Theorem 2

We restate Theorem 2 here for convenience.

Theorem.

Let ε>0\varepsilon>0. The Carathéodory coreset estimator g^S(y)\hat{g}_{S}(y) built using the kernel ksk_{s} and setting T=cd,β,εnεd+12β+dT=c_{d,\beta,\varepsilon}\,n^{\frac{\varepsilon}{d}+\frac{1}{2\beta+d}} satisfies

supf𝒫(β,L)𝔼fg^Sf2cβ,d,Lnβ2β+d.\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

The corresponding coreset has cardinality

m=cd,β,εnd2β+d+ε.m=c_{d,\beta,\varepsilon}n^{\frac{d}{2\beta+d}+\varepsilon}.
Proof.

Our goal is to apply Proposition 1 to ksk_{s}. First we show that the standard KDE built from ksk_{s} attains the minimax rate on 𝒫(β,L)\mathcal{P}_{\mathcal{H}}(\beta,L). The Fourier condition

ess supω01[ks](ω)|ω|α1,αβ,\text{ess sup}_{\omega\neq 0}\frac{1-\mathcal{F}[k_{s}](\omega)}{\left|\omega\right|^{\alpha}}\leq 1,\quad\forall\alpha\preceq\beta,

implies that ksk_{s} is a kernel of order β\beta (Tsybakov, 2009, Definition 1.3). Since [ks](0)=1=ks(x)dx\mathcal{F}[k_{s}](0)=1=\int k_{s}(x)\,\mathrm{d}x, it remains to show that the ‘moments’ of order at most β\beta of ksk_{s} vanish. In fact all of the moments vanish. We have, expanding the exponential and using the multinomial formula,

ψ(ω)\displaystyle\psi(\omega) =1[ks](ω)\displaystyle=\mathcal{F}^{-1}[k_{s}](\omega)
=ks(x)eix,ωdx\displaystyle=\int k_{s}(x)e^{-i\langle x,\omega\rangle}\mathrm{d}x
=t=0ks(x)(ix,ω)tt!dx\displaystyle=\sum_{t=0}^{\infty}\int k_{s}(x)\frac{(-i\langle x,\omega\rangle)^{t}}{t!}\mathrm{d}x
=t=0|α|1=titα!wα{ks(x)xαdx}.\displaystyle=\sum_{t=0}^{\infty}\sum_{\left|\alpha\right|_{1}=t}\frac{-i^{t}}{\alpha!}w^{\alpha}\left\{\int k_{s}(x)x^{\alpha}\mathrm{d}x\right\}.

Since ψ(ω)1\psi(\omega)\equiv 1 in a neighborhood near the origin, it follows that all of the terms ks(x)xαdx=0\int k_{s}(x)x^{\alpha}\mathrm{d}x=0. Thus ksk_{s} is a kernel of order β\beta for all β0\beta\in\mathbb{Z}_{\geq 0}, and the standard KDE on all of the dataset with bandwidth h=n1/(2β+d)h=n^{-1/(2\beta+d)} attains the rate of estimation nβ/(2β+d)n^{-\beta/(2\beta+d)} over 𝒫(β,L)\mathcal{P}_{\mathcal{H}}(\beta,L) (see e.g. Tsybakov, 2009, Theorem 1.2).

Next, |κs(x)|cβ,d|x|ν\left|\kappa_{s}(x)\right|\leq c_{\beta,d}\left|x\right|^{\nu} for ν=β+d\nu=\lceil\beta+d\rceil. This is because

xνκs(x)=xν[ψ](x)=[dνdxνψ](x)dνdxνψ1cβ,d.x^{\nu}\kappa_{s}(x)=x^{\nu}\mathcal{F}[\psi](x)=\mathcal{F}\left[\frac{\mathrm{d}^{\nu}}{\mathrm{d}x^{\nu}}\psi\right](x)\leq\lVert\frac{\mathrm{d}^{\nu}}{\mathrm{d}x^{\nu}}\psi\rVert_{1}\leq c_{\beta,d}.

Moreover for all γ>0\gamma\in\mathbb{Z}_{>0}, κs𝒮(γ,cγ)\kappa_{s}\in\mathcal{S}(\gamma,c_{\gamma}). By Parseval’s identity,

dγdxγκs2=[dγdxγκs]2=ωγψ(ω)2cγ\lVert\frac{\mathrm{d}^{\gamma}}{\mathrm{d}x^{\gamma}}\kappa_{s}\rVert_{2}=\lVert\mathcal{F}[\frac{\mathrm{d}^{\gamma}}{\mathrm{d}x^{\gamma}}\kappa_{s}]\rVert_{2}=\lVert\omega^{\gamma}\psi(\omega)\rVert_{2}\leq c_{\gamma}

because ψ\psi has compact support (see e.g. Katznelson, 2004, Chapter VI).

All of the hypotheses of Proposition 1 are satisfied, so we apply the result with

γ=d(d2+β)ε(2β+d)\gamma=\frac{d(\frac{d}{2}+\beta)}{\varepsilon(2\beta+d)}

to derive Theorem 2.

B.3 Proof of Corollary 1

Corollary.

Let ε>0\varepsilon>0 and mcβ,d,εnd2β+d+εm\leq c_{\beta,d,\varepsilon}\,n^{\frac{d}{2\beta+d}+\varepsilon}. The Carathéodory coreset estimator g^S(y)\hat{g}_{S}(y) built using the kernel ksk_{s}, setting h=m1d+εβh=m^{-\frac{1}{d}+\frac{\varepsilon}{\beta}} and T=cdm1/dT=c_{d}\,m^{1/d}, satisfies

supf𝒫(β,L)𝔼g^Sf2cβ,d,ε,L(mβd+ε+nβ2β+d+ε),\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,\varepsilon,L}\,\left(m^{-\frac{\beta}{d}+\varepsilon}+n^{-\frac{\beta}{2\beta+d}+\varepsilon}\right),

and the corresponding coreset has cardinality mm.

Proof.

Recall from the proof of Theorem 2 that ksk_{s} is a kernel of all orders. By a standard bias-variance trade-off (see e.g. Tsybakov, 2009, Section 1.2), it holds that for the KDE f^\hat{f} with bandwidth hh (on the entire dataset)

𝔼ff^f2cβ,d,L(hβ+1nhd).\mathbb{E}_{f}\lVert\hat{f}-f\rVert_{2}\leq c_{\beta,d,L}\left(h^{\beta}+\frac{1}{\sqrt{nh^{d}}}\right). (24)

Moreover, from (23) applied to ksk_{s} , setting T=m1/dT=m^{1/d}, we get

g^Sf^2cβ,dhβ+cd,γmd/γhd/2γ.\lVert\hat{g}_{S}-\hat{f}\rVert_{2}\leq c_{\beta,d}\,h^{\beta}+c_{d,\gamma}\,m^{-d/\gamma}h^{-d/2-\gamma}. (25)

Choosing

γ=(β+d2)(βdε1),h=m1d+εβ\gamma=(\beta+\frac{d}{2})(\frac{\beta}{d\varepsilon}-1),\quad h=m^{-\frac{1}{d}+\frac{\varepsilon}{\beta}}

(assuming without loss of generality that ε>0\varepsilon>0 is sufficiently small so that γ>0\gamma>0), then the triangle inequality, (24), (25), and the upper bound on mm yield the conclusion of Corollary 1.

B.4 Proof of Theorem 4

For convenience, we restate Theorem 4 here.

Theorem.

Let A,B1A,B\geq 1. Let kk denote a kernel with k2n\lVert k\rVert_{2}\leq n. Let g^S\hat{g}_{S} denote a weighted coreset KDE with bandwidth hnAh\geq n^{-A} built from kk with weights {λj}jS\{\lambda_{j}\}_{j\in S} satisfying maxjS|λj|nB\max_{j\in S}\left|\lambda_{j}\right|\leq n^{B}. Then

supf𝒫(β,L)𝔼fg^Sf2cβ,d,L[(A+B)βd(mlogn)βd+nβ2β+d].\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}\geq\\ c_{\beta,d,L}\left[(A+B)^{-\frac{\beta}{d}}(m\log{n})^{-\frac{\beta}{d}}+n^{-\frac{\beta}{2\beta+d}}\right].
Proof.

Let λ=λ1,,λm\lambda=\lambda_{1},\ldots,\lambda_{m} and let λ~=λ~1,,λ~m\tilde{\lambda}=\tilde{\lambda}_{1},\ldots,\tilde{\lambda}_{m}. Observe that

jSλjkh(Xjy)jSλ~jkh(Xjy)2\displaystyle\lVert\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y)-\sum_{j\in S}\tilde{\lambda}_{j}k_{h}(X_{j}-y)\rVert_{2} jS|λjλ~j|kh(Xjy)2\displaystyle\leq\sum_{j\in S}\left|\lambda_{j}-\tilde{\lambda}_{j}\right|\lVert k_{h}(X_{j}-y)\rVert_{2}
|λλ~|n2hd/2.\displaystyle\leq\left|\lambda-\tilde{\lambda}\right|_{\infty}n^{2}h^{-d/2}. (26)

Using this we develop a decorated coreset-based estimator f^S\hat{f}_{S} (see Section A.2) that approximates g^S\hat{g}_{S} well. Set δ=cβ,d,Ln4hd/2\delta=c_{\beta,d,L}n^{-4}h^{d/2} for cβ,d,Lc_{\beta,d,L} sufficiently small and to be chosen later. Order the points of the coreset XSX_{S} according to their first coordinate. This gives rise to an ordering \preceq so that

X1X2XmX^{\prime}_{1}\preceq X^{\prime}_{2}\preceq\cdots\preceq X^{\prime}_{m}

denote the elements of XSX_{S}. Let λm\lambda\in\mathbb{R}^{m} denote the correspondingly reordered collection of weights so that

g^S(y)=j=1mλjkh(Xjy).\hat{g}_{S}(y)=\sum_{j=1}^{m}\lambda_{j}k_{h}(X^{\prime}_{j}-y).

Construct a δ\delta-net 𝒩δ\mathcal{N}_{\delta} with respect to the sup-norm ||\left|\cdot\right|_{\infty} on the set {νm:|ν|nB}\{\nu\in\mathbb{R}^{m}:\left|\nu\right|_{\infty}\leq n^{B}\}. Observe that

log|𝒩δ|=log(nBδ1)m=cβ,d,L(B+A)mlogn\log\left|\mathcal{N}_{\delta}\right|=\log(n^{B}\delta^{-1})^{m}=c_{\beta,d,L}\,(B+A)m\log n (27)

Define RR to be the smallest integer larger than the right-hand-side above. Then we can construct a surjection ϕ:{0,1}R𝒩δ\phi:\{0,1\}^{R}\to\mathcal{N}_{\delta}. Note that ϕ\phi is constructed before observing any data: it simply labels the elements of the δ\delta-net 𝒩δ\mathcal{N}_{\delta} by strings of length RR.

Given g^S(y)=jSλjkh(Xjy)\hat{g}_{S}(y)=\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y), define f^S\hat{f}_{S} as follows:

  1. 1.

    Let λ~m\tilde{\lambda}\in\mathbb{R}^{m} denote the closest element in 𝒩δ\mathcal{N}_{\delta} to λm\lambda\in\mathbb{R}^{m}.

  2. 2.

    Choose σ{0,1}R\sigma\in\{0,1\}^{R} such that ϕ(σ)=λ~\phi(\sigma)=\tilde{\lambda}.

  3. 3.

    Define the decorated coreset YS=(XS,σ)Y_{S}=(X_{S},\sigma).

  4. 4.

    Order the points of XSX_{S} by their first coordinate. Pair the ii-th element of λ~\tilde{\lambda} with the ii-th element XiX_{i}^{\prime} of XSX_{S}, and define

    f^S(y)=j=1mλ~jkh(Xjy)\hat{f}_{S}(y)=\sum_{j=1}^{m}\tilde{\lambda}_{j}k_{h}(X_{j}^{\prime}-y)

We see that f^S\hat{f}_{S} is a decorated-coreset based estimator because in step 4 this estimator is constructed only by looking at the coreset XSX_{S} and the bit string σ\sigma. Moreover, by (26) and the setting of δ\delta,

f^Sg^S2cβ,d,Ln2.\lVert\hat{f}_{S}-\hat{g}_{S}\rVert_{2}\leq c_{\beta,d,L}\,n^{-2}. (28)

By Proposition 2 and our choice of RR,

supf𝒫(β,L)𝔼ff^Sf2cβ,d,L((A+B)βd(mlogn)βd+nβ2β+d).\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{f}_{S}-f\rVert_{2}\geq c_{\beta,d,L}\left((A+B)^{-\frac{\beta}{d}}(m\log n)^{-\frac{\beta}{d}}+n^{-\frac{\beta}{2\beta+d}}\right).

Applying the triangle inequality and (28) yields Theorem 4. ∎

Appendix C PROOFS FROM SECTION 4

Notation: Given a set of points X=x1,,xm[1/2,1/2]X=x_{1},\ldots,x_{m}\in[-1/2,1/2] (not necessarily a sample), we let

f^X(y)=1mi=1mkh(Xiy)\hat{f}_{X}(y)=\frac{1}{m}\sum_{i=1}^{m}k_{h}(X_{i}-y)

denote the uniformly weighted KDE on XX.

C.1 Proof of Theorem 5

Theorem.

Let kk denote a nonnegative kernel satisfying

k(t)=O(|t|(k+1)),and[k](ω)=O(|ω|)k(t)=O(\left|t\right|^{-(k+1)}),\quad\text{and}\quad\mathcal{F}[k](\omega)=O(\left|\omega\right|^{-\ell})

for some >0,k>1\ell>0,\,k>1. Suppose that 0<α<1/30<\alpha<1/3. If

mn232(α(12)+23)logn,m\leq\frac{n^{\frac{2}{3}-2\left(\alpha(1-\frac{2}{\ell})+\frac{2}{3\ell}\right)}}{\log n},

then

infh,S:|S|msupf𝒫(1,L)𝔼f^S𝗎𝗇𝗂𝖿f2=Ωk(n13+αlogn).\inf_{h,S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(1,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{2}=\Omega_{k}\Big{(}\frac{n^{-\frac{1}{3}+\alpha}}{\log n}\Big{)}.

The infimum above is over all possible choices of bandwidth hh and all coreset schemes SS of cardinality at most mm.

The proof of Theorem 5 follows directly from Propositions 3 and 4, which are presented in Sections C.1.1 and C.1.2, respectively.

C.1.1 Small bandwidth

First we show that uniformly weighted coreset KDEs on mm points poorly approximate densities that are very close to 0 everywhere.

Lemma 6.

Let f^X\hat{f}_{X} denote a uniformly weighted coreset KDE built from an even kernel k:k:\mathbb{R}\to\mathbb{R} with bandwidth hh on mm points X=x1,,xmX=x_{1},\ldots,x_{m}\in\mathbb{R}. Suppose that quantiles 0q1q20\leq q_{1}\leq q_{2} satisfy

q1q1k(t)dt\displaystyle\int_{-q_{1}}^{q_{1}}k(t)\mathrm{d}t 0.9,and\displaystyle\geq 0.9,\quad\quad\text{and} (29)
q2q2k(t)dt\displaystyle\int_{-q_{2}}^{q_{2}}k(t)\mathrm{d}t 1γ.\displaystyle\geq 1-\gamma. (30)

Let UU denote an interval [0,u][0,u] where

u8q2h,u\geq 8q_{2}h, (31)

and suppose that f:Uf:U\to\mathbb{R} satisfies

1100q1mhf(x)45441100q1mh\frac{1}{100q_{1}mh}\leq f(x)\leq\frac{45}{44}\cdot\frac{1}{100q_{1}mh} (32)

for all xUx\in U.

Then

infX:|X|=m(f^Xf)1IU1u440q1mhγ.\inf_{X:|X|=m}\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\geq\frac{u}{440q_{1}mh}-\gamma.
Proof.

Let NN denote the number of xiXx_{i}\in X such that [xiq1h,xi+q1h][0,u][x_{i}-q_{1}h,x_{i}+q_{1}h]\subset[0,u]. The argument proceeds in two cases. With foresight, we set α=1/(44q1)\alpha=1/(44q_{1}). Also let C1=1/(100q1)C_{1}=1/(100q_{1}) and C2=45/(4400q1)C_{2}=45/(4400q_{1}).

Case 1: NαuhN\geq\frac{\alpha u}{h}. Then by (29) and the nonnegativity of kk,

f^X1IU10.9Nm0.9αumh.\lVert\hat{f}_{X}{\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\geq\frac{0.9N}{m}\geq\frac{0.9\alpha u}{mh}.

By (32),

f1C2umh.\lVert f\rVert_{1}\leq\frac{C_{2}u}{mh}.

Hence,

(f^Xf)1IU1umh(0.9αC2)=C2umh=454400uq1mh.\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\geq\frac{u}{mh}(0.9\alpha-C_{2})=C_{2}\frac{u}{mh}=\frac{45}{4400}\cdot\frac{u}{q_{1}mh}.

Thus Lemma 6 holds in Case 1 where Nαu/hN\geq\alpha u/h.

Case 2: NαuhN\leq\frac{\alpha u}{h}. Let

V=[2hq2,u2hq2]\jT[xjq1h,xj+q1h]V=[2hq_{2},u-2hq_{2}]\,\backslash\bigcup_{j\in T}[x_{j}-q_{1}h,x_{j}+q_{1}h]

where TT is the set of indices jj so that [xjq1h,xj+q1h]U[x_{j}-q_{1}h,x_{j}+q_{1}h]\subset U. Observe that if jTj\notin T, then by (30),

V1hk(xjth)dtγ.\int_{V}\frac{1}{h}k\left(\frac{x_{j}-t}{h}\right)\mathrm{d}t\leq\gamma.

If jTj\in T, then by (29),

V1hk(xjth)dt0.1.\int_{V}\frac{1}{h}k\left(\frac{x_{j}-t}{h}\right)\mathrm{d}t\leq 0.1.

Thus,

f^X1IV10.1Nm+γα0.1umh+γ.\lVert\hat{f}_{X}{\rm 1}\kern-2.40005pt{\rm I}_{V}\rVert_{1}\leq\frac{0.1N}{m}+\gamma\leq\frac{\alpha 0.1u}{mh}+\gamma.

By the union bound, observe that the Lebesgue measure of VV is at least

u4hq22Nhq1u22Nhq1u(122αq1).u-4hq_{2}-2Nhq_{1}\geq\frac{u}{2}-2Nhq_{1}\geq u(\frac{1}{2}-2\alpha q_{1}).

Next, by (32),

f1IV1C1umh(122αq1).\lVert f{\rm 1}\kern-2.40005pt{\rm I}_{V}\rVert_{1}\geq C_{1}\frac{u}{mh}(\frac{1}{2}-2\alpha q_{1}).

Therefore,

(f^Xf)1IU1umh(C1(1/22αq1)0.1α)γ=u440q1mhγ.\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\geq\frac{u}{mh}(C_{1}(1/2-2\alpha q_{1})-0.1\alpha)-\gamma=\frac{u}{440q_{1}mh}-\gamma. (33)

Proposition 3.

Let L>2L>2. Let 0<δ<1/30<\delta<1/3 denote an absolute constant. Let f^X\hat{f}_{X} denote a uniformly weighted coreset KDE with bandwidth hh built from a kernel kk on X=x1,,xmX=x_{1},\ldots,x_{m}. Suppose that k(t)Δ|t|(k+1)k(t)\leq\Delta|t|^{-(k+1)} for some absolute constants Δ>0,k1\Delta>0,k\geq 1. If hn1/3+δh\leq n^{-1/3+\delta}, then for

mn2/32δlognm\leq\frac{n^{2/3-2\delta}}{\log n}

it holds that

supf𝒫(1,L)infX:|X|=mf^Xf2=Ω(n1/3+δlogn).\sup_{f\in\mathcal{P}_{\mathcal{H}}(1,L)}\,\inf_{X:|X|=m}\,\lVert\hat{f}_{X}-f\rVert_{2}=\Omega\left(\frac{n^{-1/3+\delta}}{\log n}\right). (34)
Proof.

Let

f(t)=λ(e1/t1I(t[1/2,0])+e1/(1t)1I(t[0,1/2])),f(t)=\lambda\left(e^{-1/t}{\rm 1}\kern-2.40005pt{\rm I}(t\in[-1/2,0])+e^{-1/(1-t)}{\rm 1}\kern-2.40005pt{\rm I}(t\in[0,1/2])\right),

where λ\lambda is a normalizing constant so that f=1\int f=1. Observe that f𝒫(1,L)f\in\mathcal{P}_{\mathcal{H}}(1,L). Our first goal is to show that

f^Xf1=Ω(1mhlog2(mh))\lVert\hat{f}_{X}-f\rVert_{1}=\Omega\left(\frac{1}{mh\log^{2}(mh)}\right)

holds for all τ/hmh2\tau/h\leq m\leq h^{-2} and for all hn1/3+δh\leq n^{-1/3+\delta}, where τ\tau is an absolute constant to be determined.

We apply Lemma 6 to the density ff. Let q1q_{1} be defined as in Lemma 6, and set C1=1/(100q1)C_{1}=1/(100q_{1}) and C2=45/(4400q1)C_{2}=45/(4400q_{1}). Set τ=10C2/λ\tau=10C_{2}/\lambda. Let

U=[t1,t2]:=[1log(λmh/C1),1log(λmh/C2)].U=[t_{1},t_{2}]:=\left[\frac{1}{\log(\lambda mh/C_{1})},\,\,\frac{1}{\log(\lambda mh/C_{2})}\right].

The function f|Uf|_{U} satisfies the bounds (32) from Lemma 6. Observe that the length of UU is

u:=t2t1=Ω(1log2(mh)).u:=t_{2}-t_{1}=\Omega(\frac{1}{\log^{2}(mh)}).

We set the parameter γ\gamma in Lemma 6 to be

γ=1800q1mhlog2(mh).\gamma=\frac{1}{800q_{1}mh\log^{2}(mh)}.

By the decay assumption on kk, we may set

q2:=(2Δkγ)1/k.q_{2}:=\left(\frac{2\Delta}{k\gamma}\right)^{1/k}.

Therefore,

u8q2h\displaystyle u-8q_{2}h =Ω(1log2(mh))8h(2Δkγ)1/k\displaystyle=\Omega(\frac{1}{\log^{2}(mh)})-8h\left(\frac{2\Delta}{k\gamma}\right)^{1/k} (35)
=Ω(1log2(mh))O(h(mhlog2(mh))1/k)\displaystyle=\Omega(\frac{1}{\log^{2}(mh)})-O(h(mh\log^{2}(mh))^{1/k}) (36)
=Ω(1log2(h1))O(h11/klog2(h1))>0\displaystyle=\Omega(\frac{1}{\log^{2}(h^{-1})})-O(h^{1-1/k}\log^{2}(h^{-1}))>0 (37)

for nn sufficiently large, because we assume τ/hmh2\tau/h\leq m\leq h^{-2}, hn1/3+δh\leq n^{-1/3+\delta}, and k>1k>1. Hence, condition (31) is satisfied for m,hm,h in the specified range, so we apply Cauchy–Schwarz and Lemma 6 to conclude that for all τ/hmh2\tau/h\leq m\leq h^{-2} and hn1/3+δh\leq n^{-1/3+\delta},

f^Xf2f^Xf1=Ω(1mhlog2(mh))=Ω(1mhlog2(h1)).\lVert\hat{f}_{X}-f\rVert_{2}\geq\lVert\hat{f}_{X}-f\rVert_{1}=\Omega\left(\frac{1}{mh\log^{2}(mh)}\right)=\Omega\left(\frac{1}{mh\log^{2}(h^{-1})}\right). (38)

Suppose first that log2(1/h)n1/3δ\log^{2}(1/h)\geq n^{1/3-\delta}. Then clearly the right-hand side of (38) is Ω(1)\Omega(1) for mnm\leq n. Otherwise, we have for all hn1/3+δh\leq n^{-1/3+\delta} that if mm is in the range

τhmmin(n1/3δlognhlog2(1/h),h2)=:Nh,\frac{\tau}{h}\leq m\leq\min\left(\frac{n^{1/3-\delta}\log n}{h\log^{2}(1/h)},\,\,h^{-2}\right)=:N_{h},

then (38) implies

f^Xf2=Ω(n1/3+δlogn).\lVert\hat{f}_{X}-f\rVert_{2}=\Omega\left(\frac{n^{-1/3+\delta}}{\log n}\right). (39)

Moreover, a uniformly weighted coreset KDE on m=O(1/h)m=O(1/h) points can be expressed as a uniformly weighted coreset KDE on Ω(1/h)\Omega(1/h) points by setting some of the xix_{i}’s to be duplicates. Hence (39) holds for all 1mNh1\leq m\leq N_{h}. Since NhN_{h} is a decreasing function of hh, it follows that (39) holds for all mn2/32δ/lognm\leq n^{2/3-2\delta}/\log n and hn1/3+δh\leq n^{-1/3+\delta}, as desired.

C.1.2 Large bandwidth

Lemma 7.

Let ε=ε(n)>0\varepsilon=\varepsilon(n)>0, and let f^X\hat{f}_{X} denote the uniformly weighted coreset KDE on XX with bandwidth hh. Suppose that ϕ:\phi:\mathbb{R}\to\mathbb{R} is an odd 𝒞\mathcal{C}^{\infty} function supported on [1/4,1/4][-1/4,1/4]. Let f(t):[1/2,1/2]0f(t):[-1/2,1/2]\to\mathbb{R}_{\geq 0} denote the density

f(t)=1211(1t2)+εϕ(t)cos(tε).f(t)=\frac{12}{11}(1-t^{2})+\varepsilon\phi(t)\cos\left(\frac{t}{\varepsilon}\right).

Then

f^Xf2212ε2(ϕ22|[ϕ2](2ε1)|)ϕ1sup|ω|hε1/2|[k](ω)|2ε|ω|ε1/2|[ϕ](ω)|dω.\lVert\hat{f}_{X}-f\rVert_{2}^{2}\geq\frac{1}{2}\varepsilon^{2}\left(\lVert\phi\rVert_{2}^{2}-\left|\mathcal{F}[\phi^{2}](2\varepsilon^{-1})\right|\right)\\ -\lVert\phi\rVert_{1}\sup_{|\omega|\geq h\varepsilon^{-1}/2}\left|\mathcal{F}[k](\omega)\right|-2\varepsilon\int_{|\omega|\geq\varepsilon^{-1}/2}\left|\mathcal{F}[\phi](\omega)\right|\mathrm{d}\omega. (40)
Proof.

Let g(t)=(12/11)(1t2)g(t)=(12/11)(1-t^{2}) and ψ(t)=εϕ(t)cos(t/ε)\psi(t)=\varepsilon\phi(t)\cos(t/\varepsilon). Observe that

f^Xf22\displaystyle\lVert\hat{f}_{X}-f\rVert_{2}^{2} gf222f^X,gf+2g,ψ(t)\displaystyle\geq\lVert g-f\rVert_{2}^{2}-2\langle\hat{f}_{X},g-f\rangle+2\langle g,\psi(t)\rangle
=gf222f^X,gf\displaystyle=\lVert g-f\rVert_{2}^{2}-2\langle\hat{f}_{X},g-f\rangle (41)

because g(t)ψ(t)g(t)\psi(t) is an odd function. Next, using cos2(θ)=(1/2)(cos(2θ)+1)\cos^{2}(\theta)=(1/2)(\cos(2\theta)+1),

gf22\displaystyle\lVert g-f\rVert_{2}^{2} =ε21/21/2cos2(t/ε)ϕ2(t)dt\displaystyle=\varepsilon^{2}\int_{-1/2}^{1/2}\cos^{2}(t/\varepsilon)\phi^{2}(t)\mathrm{d}t
ε22ϕ22ε22|[ϕ2](2ε1)|.\displaystyle\geq\frac{\varepsilon^{2}}{2}\lVert\phi\rVert_{2}^{2}-\frac{\varepsilon^{2}}{2}\left|\mathcal{F}[\phi^{2}](2\varepsilon^{-1})\right|. (42)

By the triangle inequality and Parseval’s formula,

|f^X,gf|ε(|ω|hε1/2=:A+|ω|hε1/2=:B)|[k](hεω)1h[ϕ](ωh)|dω.\frac{\left|\langle\hat{f}_{X},g-f\rangle\right|}{\varepsilon}\leq\Big{(}\underbrace{\int_{|\omega|\leq h\varepsilon^{-1}/2}}_{=:A}+\underbrace{\int_{|\omega|\geq h\varepsilon^{-1}/2}}_{=:B}\Big{)}\Big{|}\mathcal{F}[k]\left(-\frac{h}{\varepsilon}-\omega\right)\frac{1}{h}\mathcal{F}[\phi]\left(-\frac{\omega}{h}\right)\Big{|}\,\mathrm{d}\omega.

Moreover,

A\displaystyle A 12εϕ1sup|ω|hε1/2|[k](ω)|,\displaystyle\leq\frac{1}{2\varepsilon}\lVert\phi\rVert_{1}\cdot\sup_{|\omega|\geq h\varepsilon^{-1}/2}\left|\mathcal{F}[k](\omega)\right|, (43)
B\displaystyle B k1|ω|>ε1/2|[ϕ](ω)|dω.\displaystyle\leq\lVert k\rVert_{1}\cdot\int_{|\omega|>\varepsilon^{-1}/2}\left|\mathcal{F}[\phi](\omega)\right|\mathrm{d}\omega. (44)

Then (40) follows from k1=1\lVert k\rVert_{1}=1 and equations (41), (C.1.2), (43), and (44). ∎

Proposition 4.

Let ε=n1/3+γ\varepsilon=n^{-1/3+\gamma} for some absolute constant γ>0\gamma>0. Let f^X\hat{f}_{X} denote a uniformly weighted coreset KDE with bandwidth hh built from a kernel kk on X=x1,,xmX=x_{1},\ldots,x_{m}. Suppose that |[k](ω)||ω|\left|\mathcal{F}[k](\omega)\right|\leq|\omega|^{-\ell}. If hcε12/=cn(1/3+γ)(12/)h\geq c\varepsilon^{1-2/\ell}=cn^{(-1/3+\gamma)(1-2/\ell)} for cc sufficiently large, then for all mm it holds that

supf𝒫(β,L)infX:|X|=mf^Xf2=Ω(ε)=Ω(n1/3+γ)\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\,\inf_{X:|X|=m}\lVert\hat{f}_{X}-f\rVert_{2}=\Omega(\varepsilon)=\Omega\left(n^{-1/3+\gamma}\right) (45)
Proof.

The proof is a direct application of Lemma 7. Let f(t)=g(t)+εϕ(t)cos(t/ε)f(t)=g(t)+\varepsilon\phi(t)\cos(t/\varepsilon), where we set

ϕ(t)=e1x(x+1/4)1I(x[1/4,0])+e1x(x1/4)1I(x[0,1/4]).\phi(t)=-e^{\frac{1}{x(x+1/4)}}{\rm 1}\kern-2.40005pt{\rm I}(x\in[-1/4,0])+e^{-\frac{1}{x(x-1/4)}}{\rm 1}\kern-2.40005pt{\rm I}(x\in[0,1/4]).

Observe that ϕ\phi is odd and ϕ𝒞\phi\in\mathcal{C}^{\infty}. Thus, ϕ2𝒞\phi^{2}\in\mathcal{C}^{\infty}, so by the Riemann–Lebesgue lemma (see e.g. Katznelson, 2004, Chapter VI), [ϕ2](ε1)10ε\mathcal{F}[\phi^{2}](\varepsilon^{-1})\leq 10\varepsilon. Using a similar argument and noting that [ϕ](ω)=ω2[ϕ′′](ω)10ω3\mathcal{F}[\phi](\omega)=\omega^{-2}\mathcal{F}[\phi^{\prime\prime}](\omega)\leq 10\omega^{-3}, we obtain

|ω|2ε1|[ϕ](ω)|dω100ε2.\int_{\left|\omega\right|\geq 2\varepsilon^{-1}}\left|\mathcal{F}[\phi](\omega)\right|\mathrm{d}\omega\leq 100\varepsilon^{2}.

Also ϕ2c\lVert\phi\rVert_{2}\geq c^{\prime} for a small absolute constant, and ϕ12\lVert\phi\rVert_{1}\leq 2.

Thus Lemma 7, the hypothesis on kk, and hcε12/h\geq c^{\prime}\varepsilon^{1-2/\ell} imply that

f^Xf22c22ε22(εh)200ε3=Ω(ε2).\lVert\hat{f}_{X}-f\rVert_{2}^{2}\geq\frac{c^{2}}{2}\varepsilon^{2}-2\left(\frac{\varepsilon}{h}\right)^{\ell}-200\varepsilon^{3}=\Omega(\varepsilon^{2}).

Since f𝒫(1,L)f\in\mathcal{P}_{\mathcal{H}}(1,L), the statement of the lemma follows. ∎

C.2 Proof of Theorem 6

Theorem.

Fix β>0\beta>0 and a nonnegative kernel kk on \mathbb{R} satisfying the following fast decay and smoothness conditions:

lims+1slog1|t|>sk(t)𝑑t\displaystyle\lim_{s\to+\infty}\frac{1}{s}\log\frac{1}{\int_{|t|>s}k(t)dt} >0,\displaystyle>0, (46)
limω1|ω|log1|[k](ω)|\displaystyle\lim_{\omega\to\infty}\frac{1}{|\omega|}\log\frac{1}{|\mathcal{F}[k](\omega)|} >0,\displaystyle>0, (47)

where we recall that [k]\mathcal{F}[k] denotes the Fourier transform. Let f^S𝗎𝗇𝗂𝖿\hat{f}_{S}^{\mathsf{unif}} be the uniformly weighted coreset KDE. Then there exists Lβ>0L_{\beta}>0 such that for LLβL\geq L_{\beta} and any mm and h>0h>0, we have

infh,S:|S|msupf𝒫(β,L)𝔼f^S𝗎𝗇𝗂𝖿f2\displaystyle\inf_{h,S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{2} =Ωβ,k(mβ1+βlogβ+12m).\displaystyle=\Omega_{\beta,k}\left(\tfrac{m^{-\frac{\beta}{1+\beta}}}{\log^{\beta+\frac{1}{2}}m}\right).
Proof.

We follow a similar strategy to the proof of Theorem 5 by handling the cases of small and large bandwidth separately.

Let q1=q1(k)>0q_{1}=q_{1}(k)>0 be the minimum number such that |t|>q1k(t)𝑑t0.1\int_{|t|>q_{1}}k(t)dt\leq 0.1. By the assumption in the theorem, there exists a>0a>0 such that

|t|>sk(t)𝑑t1aexp(as),s0.\displaystyle\int_{|t|>s}k(t)dt\leq\frac{1}{a}\exp(-as),\quad\forall s\geq 0.

Note that we can set Lβ(1)L_{\beta}^{(1)} large such that for any δ[0,1]\delta\in[0,1], there exists f𝒫(β,Lβ(1))f\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta}^{(1)}) such that f(x)=δf(x)=\delta for x[0,1/2]x\in[0,1/2]. We first show that for any given mm and hh, we have

infS:|S|msupf𝒫(β,Lβ(1))𝔼f^S𝗎𝗇𝗂𝖿f10.2(11100q1mh)1{h0.02alog(mq10.001a10a)1}.\displaystyle\inf_{S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta}^{(1)})}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{1}\geq 0.2\left(1\wedge\frac{1}{100q_{1}mh}\right)1\left\{h\leq\frac{0.02a}{\log\left(\frac{mq_{1}}{0.001a}\vee\frac{10}{a}\right)}\wedge 1\right\}. (48)

Let ff be an arbitrary function in f𝒫(β,Lβ(1))f\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta}^{(1)}) such that

f(x)=11100q1mh,x[0,1/2].\displaystyle f(x)=1\wedge\frac{1}{100q_{1}mh},\quad\forall x\in[0,1/2].

Let TT be the set of iSi\in S for which xi[q1h,1/2q1h]x_{i}\in[q_{1}h,1/2-q_{1}h].

Case 1: |T|m(11100q1mh)|T|\geq m\left(1\wedge\frac{1}{100q_{1}mh}\right). Since k0k\geq 0, we have

f^X1[0,1/2]10.9|T|m0.9(11100q1mh).\displaystyle\|\hat{f}_{X}1_{[0,1/2]}\|_{1}\geq\frac{0.9|T|}{m}\geq 0.9\left(1\wedge\frac{1}{100q_{1}mh}\right).

On the other hand,

f1[0,1/2]112(11100q1mh),\displaystyle\|f1_{[0,1/2]}\|_{1}\leq\frac{1}{2}\left(1\wedge\frac{1}{100q_{1}mh}\right),

therefore,

(f^Xf)1[0,1/2]10.4(11100q1mh).\displaystyle\|(\hat{f}_{X}-f)1_{[0,1/2]}\|_{1}\geq 0.4\left(1\wedge\frac{1}{100q_{1}mh}\right).

Case 2: |T|<m(11100q1mh)|T|<m\left(1\wedge\frac{1}{100q_{1}mh}\right). Define

γ:=0.1(11100q1mh)\displaystyle\gamma:=0.1\left(1\wedge\frac{1}{100q_{1}mh}\right)

and

q2:=0.02h.\displaystyle q_{2}:=\frac{0.02}{h}.

Note that to verify (48) we only need to consider the event of h0.02alog(mq10.001a10a)1h\leq\frac{0.02a}{\log\left(\frac{mq_{1}}{0.001a}\vee\frac{10}{a}\right)}\wedge 1, in which case

|t|>q2k(t)𝑑t\displaystyle\int_{|t|>q_{2}}k(t)dt 1aexp(aq2)\displaystyle\leq\frac{1}{a}\exp(-aq_{2})
1a(0.001amq10.1a)\displaystyle\leq\frac{1}{a}\cdot\left(\frac{0.001a}{mq_{1}}\wedge 0.1a\right)
1a(0.001aq1mh0.1a)\displaystyle\leq\frac{1}{a}\cdot\left(\frac{0.001a}{q_{1}mh}\wedge 0.1a\right)
=0.1(11100q1mh)\displaystyle=0.1(1\wedge\frac{1}{100q_{1}mh})
=γ.\displaystyle=\gamma.

Moreover since γ0.1\gamma\leq 0.1 we see that q2q1q_{2}\geq q_{1}. Now define

V:=[2hq2,1/22hq2]jT[xjq1h,xjq1h].\displaystyle V:=[2hq_{2},1/2-2hq_{2}]\setminus\bigcup_{j\in T}[x_{j}-q_{1}h,x_{j}-q_{1}h].

Then for jTj\notin T, we have

V1hk(xjth)𝑑tγ\displaystyle\int_{V}\frac{1}{h}k\left(\frac{x_{j}-t}{h}\right)dt\leq\gamma

while for jTj\in T we have

V1hk(xjth)𝑑t0.1.\displaystyle\int_{V}\frac{1}{h}k\left(\frac{x_{j}-t}{h}\right)dt\leq 0.1.

Thus,

f^X1V10.1|T|m+γ0.2(11100q1mh).\displaystyle\|\hat{f}_{X}1_{V}\|_{1}\leq\frac{0.1|T|}{m}+\gamma\leq 0.2\left(1\wedge\frac{1}{100q_{1}mh}\right).

On the other hand, by the union bound we see that the Lebesgue measure of VV is at least

124q2h2q1h|T|0.54q2h0.020.4\displaystyle\frac{1}{2}-4q_{2}h-2q_{1}h|T|\geq 0.5-4q_{2}h-0.02\geq 0.4

where we used the fact that q2h=0.02q_{2}h=0.02. Then

f1V10.4(11100q1mh)\displaystyle\|f1_{V}\|_{1}\geq 0.4\left(1\wedge\frac{1}{100q_{1}mh}\right)

and hence

(f^Xf)1[0,1/2]1(f^Xf)1V10.2(11100q1mh).\displaystyle\|(\hat{f}_{X}-f)1_{[0,1/2]}\|_{1}\geq\|(\hat{f}_{X}-f)1_{V}\|_{1}\geq 0.2\left(1\wedge\frac{1}{100q_{1}mh}\right).

This concludes the proof of (48).

The second step is to show that for given mm and hh, we have

infS:|S|msupf𝒫(β,L)𝔼f^S𝗎𝗇𝗂𝖿f114(b(h1)logm)β1bm2\displaystyle\inf_{S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{1}\geq\frac{1}{4}\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}} (49)

sufficiently large mm and LL to be determined later, and 0<b<0<b<\infty is such that

[k](ω)1bexp(bω),ω\displaystyle\mathcal{F}[k](\omega)\leq\frac{1}{b}\exp(-b\omega),\quad\forall\omega\in\mathbb{R}

whose existence is guaranteed by the assumption of the theorem. Let ϕ\phi be a smooth, even, nonnegative function supported on [1/2,1/2][-1/2,1/2] satisfying [1/2,1/2]ϕ=1\int_{[-1/2,1/2]}\phi=1. Define

fϵ(t):=ϕ(t)(cϵ+ϵβsintϵ)\displaystyle f_{\epsilon}(t):=\phi(t)\left(c_{\epsilon}+\epsilon^{\beta}\sin\frac{t}{\epsilon}\right)

where cϵ>0c_{\epsilon}>0 is chosen so that [1/2,1/2]fϵ=1\int_{[-1/2,1/2]}f_{\epsilon}=1. Then limϵ0cϵ=1\lim_{\epsilon\to 0}c_{\epsilon}=1, and in particular fϵ0f_{\epsilon}\geq 0 when ϵ<ϵ(ϕ,β)\epsilon<\epsilon(\phi,\beta) for some ϵ(ϕ,β)\epsilon(\phi,\beta). Moreover we can find Lβ(2)<L_{\beta}^{(2)}<\infty such that fϵ𝒫(β,Lβ(2))f_{\epsilon}\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta}^{(2)}) for all ϵ<ϵ(ϕ,β)\epsilon<\epsilon(\phi,\beta). Now

fϵf^X1\displaystyle\|f_{\epsilon}-\hat{f}_{X}\|_{1} |[fϵ](1/ϵ)[f^X](1/ϵ)|\displaystyle\geq|\mathcal{F}[f_{\epsilon}](1/\epsilon)-\mathcal{F}[\hat{f}_{X}](1/\epsilon)|
|[1/2,1/2]fϵ(t)eit/ϵ𝑑t||[k](hϵ)|\displaystyle\geq\left|\int_{[-1/2,1/2]}f_{\epsilon}(t)e^{-it/\epsilon}dt\right|-\left|\mathcal{F}[k](\frac{h}{\epsilon})\right|
|[1/2,1/2]fϵ(t)sintϵdt||[k](hϵ)|\displaystyle\geq\left|\int_{[-1/2,1/2]}f_{\epsilon}(t)\sin\frac{t}{\epsilon}dt\right|-\left|\mathcal{F}[k](\frac{h}{\epsilon})\right|
=ϵβ|[1/2,1/2]ϕ(t)sin2tϵdt||[k](hϵ)|\displaystyle=\epsilon^{\beta}\left|\int_{[-1/2,1/2]}\phi(t)\sin^{2}\frac{t}{\epsilon}dt\right|-\left|\mathcal{F}[k](\frac{h}{\epsilon})\right| (50)

where (50) used the fact that ϕ\phi is even. Since limϵ0[1/2,1/2]ϕ(t)sin2tϵdt=12\lim_{\epsilon\to 0}\int_{[-1/2,1/2]}\phi(t)\sin^{2}\frac{t}{\epsilon}dt=\frac{1}{2}, there exists ϵ(ϕ)\epsilon^{\prime}(\phi) such that

[1/2,1/2]ϕ(t)sin2tϵdt14\displaystyle\int_{[-1/2,1/2]}\phi(t)\sin^{2}\frac{t}{\epsilon}dt\geq\frac{1}{4}

for any ϵϵ(ϕ)\epsilon\leq\epsilon^{\prime}(\phi). Now define

ϵ′′(h,m)=b(h1)2logm.\displaystyle\epsilon^{\prime\prime}(h,m)=\frac{b(h\wedge 1)}{2\log m}.

There exists m(ϕ,β,b)<m(\phi,\beta,b)<\infty such that suph>0ϵ′′(h,m)<ϵ(ϕ,β)ϵ(ϕ)\sup_{h>0}\epsilon^{\prime\prime}(h,m)<\epsilon(\phi,\beta)\wedge\epsilon^{\prime}(\phi) whenever mm(ϕ,β,b)m\geq m(\phi,\beta,b). With the choice of ϵ=ϵ′′(h,m)\epsilon=\epsilon^{\prime\prime}(h,m), we can continue lower bounding (50) as (for mm(ϕ,β,b)m\geq m(\phi,\beta,b)):

14(b(h1)logm)β1bm2.\displaystyle\frac{1}{4}\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}.

Finally, we collect the results for step 1 and step 2. First observe that the main term in the risk in step 1 can be simplified as

(11100q1mh)1{h0.02alog(mq10.001a10a)1}\displaystyle\left(1\wedge\frac{1}{100q_{1}mh}\right)1\left\{h\leq\frac{0.02a}{\log\left(\frac{mq_{1}}{0.001a}\vee\frac{10}{a}\right)}\wedge 1\right\}
=1100q1mh1{𝒜}\displaystyle=\frac{1}{100q_{1}mh}\wedge 1\left\{\mathcal{A}\right\} (51)

where 𝒜\mathcal{A} denotes the event in the left side of (51).

Thus up to multiplicative constant depending on kk, β\beta, we can lower bound the risk by taking the max of the risks in the two steps:

(1mh1{𝒜})((b(h1)logm)β1bm2)\displaystyle\left(\frac{1}{mh}\wedge 1\{\mathcal{A}\}\right)\vee\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right) (52)

whenever LLβ:=Lβ(1)Lβ(2)L\geq L_{\beta}:=L_{\beta}^{(1)}\vee L_{\beta}^{(2)}. We can use the distributive law to open up the parentheses in (52). By checking the h>m1βh>m^{-\frac{1}{\beta}} and hm1βh\leq m^{-\frac{1}{\beta}} cases respectively, it is easy to verify that

1mh((b(h1)logm)β1bm2)=Ω(mββ+1logβm).\displaystyle\frac{1}{mh}\vee\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right)=\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right).

Next, if 𝒜\mathcal{A} is true, we evidently have

1{𝒜}((b(h1)logm)β1bm2)=1=Ω(mββ+1logβm).\displaystyle 1\{\mathcal{A}\}\vee\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right)=1=\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right).

If 𝒜\mathcal{A} is not true, then h>0.02alog(mq10.001a10a)1h>\frac{0.02a}{\log\left(\frac{mq_{1}}{0.001a}\vee\frac{10}{a}\right)}\wedge 1, and we have

1{𝒜}((b(h1)logm)β1bm2)\displaystyle 1\{\mathcal{A}\}\vee\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right) =((b(h1)logm)β1bm2)\displaystyle=\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right)
=Ω(log2βm)\displaystyle=\Omega\left(\log^{-2\beta}m\right)
=Ω(mββ+1logβm).\displaystyle=\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right).

In either case the risk with respect to L1L_{1} is Ω(mββ+1logβm)\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right). It remains to convert this to a lower bound in L2L_{2}.

We consider two cases. First note that by the fast decay condition on the Fourier transform, k𝒞1k\in\mathcal{C}^{1}. Let B=BkB=B_{k} denote a constant such that

supx[1/2,1/2]|k(x)|B.\sup_{x\in[-1/2,1/2]}\left|k^{\prime}(x)\right|\leq B. (53)

Set Δ=B1/2k(0)1\Delta=B^{1/2}\vee k(0)\vee 1.

Case 1: hΔh\leq\Delta.

Let U={|y|12+cβ,Δ,alogm}U=\{|y|\geq\frac{1}{2}+c_{\beta,\Delta,a}\log m\}, and let Uc=\UU^{c}=\mathbb{R}\backslash U. If hΔh\leq\Delta, then because Xi[1/2,1/2]X_{i}\in[-1/2,1/2] and by the exponential decay of kk,

f^X(y)1IU1m2\lVert\hat{f}_{X}(y){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\leq m^{-2}

for cβ,Δ,ac_{\beta,\Delta,a} sufficiently large. Thus by Cauchy–Schwarz,

(f^Xf)1IUc2\displaystyle\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U^{c}}\rVert_{2} cβ,Δ,a(logm)1/2(f^Xf)1IUc2\displaystyle\geq c^{\prime}_{\beta,\Delta,a}(\log m)^{-1/2}\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U^{c}}\rVert_{2}
=cβ,Δ,a(logm)1/2((f^Xf)1(f^Xf)1IU1)\displaystyle=c^{\prime}_{\beta,\Delta,a}(\log m)^{-1/2}\left(\lVert(\hat{f}_{X}-f)\rVert_{1}-\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\right)
cβ,Δ,a(logm)1/2(cβ,k(mββ+1logβm)m2)\displaystyle\geq c^{\prime}_{\beta,\Delta,a}(\log m)^{-1/2}\left(c_{\beta,k}\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right)-m^{-2}\right)
=Ω(mββ+1logβ+12m)\displaystyle=\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta+\frac{1}{2}}m}\right)

Case 2: hΔh\geq\Delta

In this case, k(Xiy)k(X_{i}-y) is nearly constant for all ii. By (53) and Taylor’s theorem,

|k(0)k(Xiyh)|2B\left|k(0)-k\left(\frac{X_{i}-y}{h}\right)\right|\leq 2B

for all y[1/2,1/2]y\in[-1/2,1/2] and for all ii. Hence, for all y[1/2,1/2]y\in[-1/2,1/2], using hΔh\geq\Delta,

f^X(y)=1mhi=1mk(Xiyh)1h(k(0)+2B)3.\hat{f}_{X}(y)=\frac{1}{mh}\sum_{i=1}^{m}k\left(\frac{X_{i}-y}{h}\right)\leq\frac{1}{h}(k(0)+2B)\leq 3.

For LβL_{\beta} large enough, we see that for the function f𝒫(β,Lβ)f\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta}) with f|[0,1100]4f|_{[0,\frac{1}{100}]}\equiv 4,

f^Xf2(f^Xf)1I[0,1100]1=Ω(1).\lVert\hat{f}_{X}-f\rVert_{2}\geq\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{[0,\frac{1}{100}]}\rVert_{1}=\Omega(1).

Appendix D PROOFS FROM SECTION 5

D.1 Proof of Theorem 7

The result is restated below.

Theorem.

Let ksk_{s} denote the kernel from Section 3.2. The algorithm of Phillips and Tai (2018b) yields in polynomial time a subset SS with |S|=m=O~(nβ+d2β+d)\left|S\right|=m=\tilde{O}(n^{\frac{\beta+d}{2\beta+d}}) such that the uniformly weighted coreset KDE g^S\hat{g}_{S} satisfies

supf𝒫(β,L)𝔼fg^S2cβ,d,Lnβ2β+d.\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert f-\hat{g}_{S}\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.
Proof.

Here we adapt the results in Section 2 of Phillips and Tai (2018b) to our setting where the bandwidth h=n1/(2β+d)h=n^{-1/(2\beta+d)} is shrinking. Using their notation, we define Ks(x,y)=ks(xyh)K_{s}(x,y)=k_{s}\left(\frac{x-y}{h}\right) and study the kernel discrepancy of the kernel KsK_{s}. First we verify the assumptions on the kernel (bounded influence, Lipschitz, and positive semidefiniteness) needed to apply their results.

First, the kernel KsK_{s} is bounded influence (see Phillips and Tai, 2018b, Section 2) with constant cK=2c_{K}=2 and δ=n1\delta=n^{-1}, which means that

|Ks(x,y)|1n\left|K_{s}(x,y)\right|\leq\frac{1}{n}

if |xy|n2\left|x-y\right|_{\infty}\geq n^{2}. This follows from the fast decay of κs\kappa_{s}.

Note that if xx and yy differ on a single coordinate ii, then

|ks(x)ks(y)||c(xiyi)jiκs(xj)|c|xiyi|\left|k_{s}(x)-k_{s}(y)\right|\leq\left|c(x_{i}-y_{i})\prod_{j\neq i}\kappa_{s}(x_{j})\right|\leq c\left|x_{i}-y_{i}\right|

because |κs(x)|ψ1\left|\kappa_{s}(x)\right|\leq\lVert\psi\rVert_{1} for all xx and the function κs\kappa_{s} is cc-Lipschitz for some constant cc. Hence by the triangle and Cauchy–Schwarz inequalities, the function ksk_{s} is Lipschitz:

|ks(x)ks(y)|dck|xy|1d3/2cκ|xy|2.\displaystyle\left|k_{s}(x)-k_{s}(y)\right|\leq dc_{k}\left|x-y\right|_{1}\leq d^{3/2}c_{\kappa}\left|x-y\right|_{2}.

Therefore the kernel Ks(x,y)K_{s}(x,y) is Lipschitz (see Phillips and Tai, 2018b) with constant CK=d3/2cκh1C_{K}=d^{3/2}c_{\kappa}h^{-1}. Moreover, the kernel KsK_{s} is positive semidefinite because the Fourier transform of κs\kappa_{s} is nonnegative.

Given the shrinking bandwidth h=n1/(2β+d)h=n^{-1/(2\beta+d)}, we slightly modify the lattice used in Phillips and Tai (2018b, Lemma 1). Define the lattice

={(i1δ,i2δ,,idδ)|ij},\mathcal{L}=\left\{\left.(i_{1}\delta,i_{2}\delta,\ldots,i_{d}\delta)\,\right|\,i_{j}\in\mathbb{Z}\right\},

where

δ=1cκd2nh1.\delta=\frac{1}{c_{\kappa}d^{2}nh^{-1}}.

The calculation at the top of page 6 of Phillips and Tai (2018b, Lemma 1) yields

𝖽𝗂𝗌𝖼(X,χ,y)\displaystyle\mathsf{disc}(X,\chi,y) :=|i=1nχ(Xi)Ks(Xi,y)|\displaystyle:=\left|\sum_{i=1}^{n}\chi(X_{i})K_{s}(X_{i},y)\right|
|i=1nχ(Xi)Ks(Xi,y0)|+1\displaystyle\leq\left|\sum_{i=1}^{n}\chi(X_{i})K_{s}(X_{i},y_{0})\right|+1

where y0y_{0} is the closest point to yy in the lattice \mathcal{L}, and χ\chi assigns either +1+1 or 1-1 to each element of X=X1,,XnX=X_{1},\ldots,X_{n}. Moreover, with the bounded influence of KsK_{s}, if

mini|yXi|n2,\min_{i}\left|y-X_{i}\right|_{\infty}\geq n^{2},

then

𝖽𝗂𝗌𝖼(X,χ,y)=|i=1nχ(Xi)Ks(Xi,y)|1.\mathsf{disc}(X,\chi,y)=\left|\sum_{i=1}^{n}\chi(X_{i})K_{s}(X_{i},y)\right|\leq 1.

On defining

X={y:mini|yXi|n2},\mathcal{L}_{X}=\mathcal{L}\cap\{y:\,\min_{i}\left|y-X_{i}\right|_{\infty}\leq n^{2}\},

we see that

maxyd𝖽𝗂𝗌𝖼(X,χ,y)maxyX𝖽𝗂𝗌𝖼(X,χ,y)+1\max_{y\in\mathbb{R}^{d}}\mathsf{disc}(X,\chi,y)\leq\max_{y\in\mathcal{L}_{X}}\mathsf{disc}(X,\chi,y)+1

for all signings χ:X{1,+1}\chi:X\to\{-1,+1\}. This is precisely the conclusion of Phillips and Tai (2018b, Lemma 1).

This established, the positive definiteness and bounded diagonal entries of KsK_{s} and Phillips and Tai (2018b, Lemmas 2 and 3) imply that

𝖽𝗂𝗌𝖼Ks=O(dlogn).\mathsf{disc}_{K_{s}}=O(\sqrt{d\log n}).

Given ε>0\varepsilon>0, the halving algorithm can be applied to KsK_{s} as in Phillips and Tai (2018b, Corollary 5) to yield a coreset XSX_{S} of size m=O(ε1dlogε1)m=O(\varepsilon^{-1}\sqrt{d\log\varepsilon^{-1}}) such that

1nj=1nKs(Xj,y)1mjSKs(Xj,y)ε.\lVert\frac{1}{n}\sum_{j=1}^{n}K_{s}(X_{j},y)-\frac{1}{m}\sum_{j\in S}K_{s}(X_{j},y)\rVert_{\infty}\leq\varepsilon.

Rescaling by hdh^{-d}, we have

f^f^S𝗎𝗇𝗂𝖿=1nj=1nks(Xj,y)1mjSks(Xj,y)εhd.\lVert\hat{f}-\hat{f}_{S}^{\mathsf{unif}}\rVert_{\infty}=\lVert\frac{1}{n}\sum_{j=1}^{n}k_{s}(X_{j},y)-\frac{1}{m}\sum_{j\in S}k_{s}(X_{j},y)\rVert_{\infty}\leq\varepsilon h^{-d}.

Recall from Section B.2 that f^\hat{f} attains the minimax rate of estimation on 𝒫(β,L)\mathcal{P}_{\mathcal{H}}(\beta,L). Thus setting ε=hdnβ/(2β+d)\varepsilon=h^{d}n^{-\beta/(2\beta+d)} we get a coreset of size O~d(nβ+d2β+d)\tilde{O}_{d}(n^{\frac{\beta+d}{2\beta+d}}) that attains the minimax rate cβ,d,Lnβ/(2β+d)c_{\beta,d,L}\,n^{-\beta/(2\beta+d)}, as desired. Moreover, by the results of Phillips and Tai (2018b), this coreset can be constructed in polynomial time.

Acknowledgments We thank Cole Franks for helpful discussions regarding algorithmic aspects of Carathéodory’s theorem.

References

  • Agarwal et al. [2005] Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1–30, 2005.
  • Ashtiani et al. [2020] Hassan Ashtiani, Shai Ben-David, Nicholas JA Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. Journal of the ACM (JACM), 67(6):1–42, 2020.
  • Bach et al. [2012] Francis R Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. In ICML, 2012.
  • Bachem et al. [2017] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
  • Bachem et al. [2018] Olivier Bachem, Mario Lucic, and Andreas Krause. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1119–1127, 2018.
  • Bansal et al. [2018] Nikhil Bansal, Daniel Dadush, Shashwat Garg, and Shachar Lovett. The gram-schmidt walk: a cure for the banaszczyk blues. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 587–597, 2018. 10.1145/3188745.3188850. URL https://doi.org/10.1145/3188745.3188850.
  • Bubeck [2015] Sébastien Bubeck. Convex optimization: algorithms and complexity. Now Publishers Inc., 2015.
  • Carathéodory [1907] C. Carathéodory. Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen, March 1907. URL https://doi.org/10.1007/bf01449883.
  • Chazelle and Matoušek [1996] Chazelle and Matoušek. On linear-time deterministic algorithms for optimization problems in fixed dimension. Journal of Algorithms, 21(3):579–597, 1996.
  • Chazelle [2000] B. Chazelle. The Discrepancy Method: Randomness and Complexity. Cambridge University Press, Cambridge, 2000.
  • Claici et al. [2020] Sebastian Claici, Aude Genevay, and Justin Solomon. Wasserstein measure coresets. arXiv preprint arXiv:1805.07412, 2020.
  • Clarkson [2010] Kenneth L Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):1–30, 2010.
  • Cover and Thomas [2006] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition, 2006.
  • Feldman et al. [2011] Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. In Advances in neural information processing systems, pages 2142–2150, 2011.
  • Feldman et al. [2013] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1434–1453. SIAM, 2013.
  • Frahling and Sohler [2005] Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’05, pages 209–217, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1581139608. 10.1145/1060590.1060622. URL https://doi.org/10.1145/1060590.1060622.
  • Frank et al. [1956] Marguerite Frank, Philip Wolfe, et al. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
  • Gärtner and Jaggi [2009] Bernd Gärtner and Martin Jaggi. Coresets for polytope distance. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pages 33–42, 2009.
  • Grötschel et al. [2012] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
  • Hanneke et al. [2019] Steve Hanneke, Aryeh Kontorovich, and Menachem Sadigurschi. Sample compression for real-valued learners. In Algorithmic Learning Theory, pages 466–488, 2019.
  • Har-Peled and Kushal [2007] Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3–19, 2007.
  • Harvey and Samadi [2014] Nick Harvey and Samira Samadi. Near-optimal herding. In Conference on Learning Theory, pages 1165–1182, 2014.
  • Hiriart-Urruty and Lemaréchal [2004] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis. Springer Science & Business Media, 2004.
  • Huggins et al. [2016] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080–4088, 2016.
  • Joshi et al. [2011] Sarang Joshi, Raj Varma Kommaraji, Jeff M. Phillips, and Suresh Venkatasubramanian. Comparing distributions and shapes using the kernel distance. In Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry, SoCG ’11, page 47–56, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450306829. 10.1145/1998196.1998204. URL https://doi.org/10.1145/1998196.1998204.
  • Karnin and Liberty [2019] Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1975–1993, Phoenix, USA, 25–28 Jun 2019. PMLR. URL http://proceedings.mlr.press/v99/karnin19a.html.
  • Katznelson [2004] Yitzhak Katznelson. An Introduction to Harmonic Analysis. Cambridge Mathematical Library. Cambridge University Press, 3 edition, 2004. 10.1017/CBO9781139165372.
  • Littlestone and Warmuth [1986] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Unpublished manuscript, 1986.
  • Lopez-Paz et al. [2015] David Lopez-Paz, Krikamol Muandet, Bernhard Schölkopf, and Iliya Tolstikhin. Towards a learning theory of cause-effect inference. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1452–1461, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/lopez-paz15.html.
  • Matoušek [1999] J. Matoušek. Geometric Discrepancy: an Illustrated Guide. Springer, New York, 1999.
  • McDonald [2017] Daniel McDonald. Minimax Density Estimation for Growing Dimension. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 194–203, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR. URL http://proceedings.mlr.press/v54/mcdonald17a.html.
  • Munteanu et al. [2018] Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. On coresets for logistic regression. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pages 6562–6571, Red Hook, NY, USA, 2018. Curran Associates Inc.
  • Phillips [2013] Jeff M Phillips. ε\varepsilon-samples for kernels. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1622–1632. SIAM, 2013.
  • Phillips and Tai [2018a] Jeff M Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2718–2727. SIAM, 2018a.
  • Phillips and Tai [2018b] Jeff M. Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. In 34th International Symposium on Computational Geometry, SoCG 2018, June 11-14, 2018, Budapest, Hungary, pages 66:1–66:13, 2018b. 10.4230/LIPIcs.SoCG.2018.66. URL https://doi.org/10.4230/LIPIcs.SoCG.2018.66.
  • Tikhomirov [1993] V. M. Tikhomirov. ε\varepsilon-Entropy and ε\varepsilon-Capacity of Sets In Functional Spaces, pages 86–170. Springer Netherlands, Dordrecht, 1993. ISBN 978-94-017-2973-4. 10.1007/978-94-017-2973-4_7. URL https://doi.org/10.1007/978-94-017-2973-4_7.
  • Tsybakov [2009] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer series in statistics. Springer, 2009. ISBN 978-0-387-79051-0. 10.1007/b13794. URL https://doi.org/10.1007/b13794.
  • Zheng and Phillips [2017] Yan Zheng and Jeff M Phillips. Coresets for kernel regression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645–654, 2017.