A Statistical Perspective on Coreset Density Estimation

Paxton Turnerlabel=paxton][email protected] [ Jingbo Liulabel=jingbo][email protected] [ and Philippe Rigollet label=rigollet][email protected] P.R. was supported by NSF awards IIS-1838071, DMS-1712596, DMS-1740751, and DMS-2022448.[ Massachusetts Institute of Technology and University of Illinois at Urbana-Champaign Paxton Turner
Department of Mathematics
Massachusetts Institute of Technology
77 Massachusetts Avenue,
Cambridge, MA 02139-4307, USA
Jingbo Liu
Department of Statistics
University of Illinois at Urbana-Champaign
725 S. Wright St.,
Champaign, IL 61820, USA
Philippe Rigollet
Department of Mathematics
Massachusetts Institute of Technology
77 Massachusetts Avenue,
Cambridge, MA 02139-4307, USA

Abstract

Coresets have emerged as a powerful tool to summarize data by selecting a small subset of the original observations while retaining most of its information. This approach has led to significant computational speedups but the performance of statistical procedures run on coresets is largely unexplored. In this work, we develop a statistical framework to study coresets and focus on the canonical task of nonparameteric density estimation. Our contributions are twofold. First, we establish the minimax rate of estimation achievable by coreset-based estimators. Second, we show that the practical coreset kernel density estimators are near-minimax optimal over a large class of Hölder-smooth densities.

62G07,

68Q32,

keywords:

[class=AMS]

keywords:

[class=KWD] data summarization, kernel density estimator, Carathéodory’s theorem, minimax risk, compression

1 Introduction

The ever-growing size of datasets that are routinely collected has led practitioners across many fields to contemplate effective data summarization techniques that aim at reducing the size of the data while preserving the information that it contains. While there are many ways to achieve this goal, including standard data compression algorithms, they often prevent direct manipulation of data for learning purposes. Coresets have emerged as a flexible and efficient set of techniques that permit direct data manipulation. Coresets are well-studied in machine learning (Har-Peled and Kushal, 2007; Feldman et al., 2013; Bachem et al., 2017, 2018; Karnin and Liberty, 2019), statistics (Feldman et al., 2011; Zheng and Phillips, 2017; Munteanu et al., 2018; Huggins et al., 2016; Phillips and Tai, 2018a, b), and computational geometry (Agarwal et al., 2005; Clarkson, 2010; Frahling and Sohler, 2005; Gärtner and Jaggi, 2009; Claici et al., 2020).

Given a dataset $\mathcal{D}=\{X_{1},\ldots,X_{n}\}\subset\mathbb{R}^{d}$ and task (density estimation, logistic regression, etc.) a coreset $\mathcal{C}$ is given by $\mathcal{C}=\{X_{i}\,:\,i\in S\}$ for some subset $S$ of $\{1,\ldots,n\}$ of size $|S|\ll n$ . A good coreset should suffice to perform the task at hand with the same accuracy as with the whole dataset $\mathcal{D}$ .

In this work we study the canonical task of density estimation. Given i.i.d random variables $X_{1},\ldots,X_{n}\sim\mathbb{P}_{f}$ that admit a common density $f$ with respect to the Lebesgue measure over $\mathbb{R}^{d}$ , the goal of density estimation is to estimate $f$ . It is well known that the minimax rate of estimation over the $L$ -Hölder smooth densities $\mathcal{P}_{\mathcal{H}}(\beta,L)$ of order $\beta$ is given by

\inf_{\hat{f}}\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}-f\rVert_{2}=\Theta_{\beta,d,L}(n^{-\frac{\beta}{2\beta+d}})\,,

(1)

where the infimum is taken over all estimators based on the dataset $\mathcal{D}$ . Moreover the minimax rate above is achieved by a kernel density estimator

\hat{f}_{n}(x):=\frac{1}{nh^{d}}\sum_{j=1}^{n}K\left(\frac{X_{i}-x}{h}\right)

(2)

for suitable choices of kernel $K:\mathbb{R}^{d}\to\mathbb{R}$ and bandwidth $h>0$ (see e.g. Tsybakov, 2009, Theorem 1.2).

The main goal of this paper is to extend this understanding of rates for density estimation to estimators based on coresets. Specifically we would like to characterize the statistical performance of coresets in terms of their cardinality. To do so, we investigate two families of estimators built on coresets: one that is quite flexible and allows arbitrary estimators to be used on the coreset and another that is more structured and driven by practical considerations; it consists of weighted kernel density estimators built on coresets.

1.1 Two statistical frameworks for coreset density estimation

We formally define a coreset as follows. Throughout this work $m=o(n)$ denotes the cardinality of the coreset. Let $S=S(y|x)$ denote a conditional probability measure on $\binom{[n]}{m}$ , where $x\in\mathbb{R}^{d\times n}$ . In information theoretic language, $S$ is a channel from $\mathbb{R}^{d\times n}$ to subsets of cardinality $m$ . We refer to the channel $S$ as a coreset scheme because it designates a data-driven method of choosing a subset of data points. In what follows, we abuse notation and let $S=S(x)$ denote an instantiation of a sample from the measure $S(y|x)$ for $x\in\mathbb{R}^{d\times n}$ . A coreset $X_{S}$ is then defined to be the projection of the dataset $X=(X_{1},\ldots,X_{n})$ onto the subset indicated by $S(X)$ : $X_{S}:=\{X_{i}\}_{i\in S(X)}$ .

The first family of estimators that we investigate is quite general and allows the statistician to select a coreset and then employ an estimator that only manipulates data points in the coreset to estimate an unknown density. To study coresets, it is convenient to make the dependence of estimators on observations more explicit than in the traditional literature. More specifically, a density estimator $\hat{f}$ based on $n$ observations $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ is a function $\hat{f}:\mathbb{R}^{d\times n}\to L_{2}(\mathbb{R}^{d})$ denoted by $\hat{f}[X_{1},\ldots,X_{n}](\cdot)$ . Similarly, a coreset-based estimator $\hat{f}_{S}$ is constructed from a coreset scheme $S$ of size $m$ and an estimator (measurable function) $\hat{f}:\mathbb{R}^{d\times m}\to L_{2}(\mathbb{R}^{d})$ on $m$ observations. We enforce the additional restriction on $\hat{f}$ that for all $y_{1},\ldots,y_{m}\in\mathbb{R}^{d}$ and for all bijections $\pi:[m]\to[m]$ , it holds that $\hat{f}[y_{1},\ldots,y_{m}](\cdot)=\hat{f}[y_{\pi(1)},\ldots,y_{\pi(m)}](\cdot)$ . Given $S$ and $\hat{f}$ as above, we define the coreset-based estimator $\hat{f}_{S}:\mathbb{R}^{d\times n}\to L_{2}(\mathbb{R}^{d})$ to be the function $\hat{f}_{S}[X](\cdot):=\hat{f}[X_{S}](\cdot):\mathbb{R}^{d}\to\mathbb{R}$ . We evaluate the performance of coreset-based estimators in Section 2 by characterizing their rate of estimation over Hölder classes.¹¹1Our notion of coreset-based estimators bares conceptual similarity to various notions of compression schemes as studied in the literature, e.g. Littlestone and Warmuth (1986); Ashtiani et al. (2020); Hanneke et al. (2019).

The symmetry restriction on $\hat{f}$ prevents the user from exploiting information about the ordering of data points to their advantage: the only information that can be used by the estimator $\hat{f}$ is contained in the unordered collection of distinct vectors given by the coreset $X_{S}$ .

As evident from the the results in Section 2, the information-theoretically optimal coreset estimator does not resemble coreset estimators employed in practice. To remedy this limitation, we also study weighted coreset kernel density estimators (KDEs) in Section 3. Here the statistician selects a kernel $k$ , bandwidth parameter $h$ , and a coreset $X_{S}$ of cardinality $m$ as defined above and then employs the estimator

\hat{f}_{S}(y)=\sum_{j\in S}\lambda_{j}h^{-d}k\left(\frac{X_{j}-y}{h}\right),

where the weights $\{\lambda_{j}\}_{j\in S}$ are nonnegative, sum to one and are allowed to depend on the full dataset.

In the case of uniform weights where $\lambda_{j}=\frac{1}{m}$ for all $j\in S$ , coreset KDEs are well-studied (see e.g. Bach et al., 2012; Harvey and Samadi, 2014; Phillips and Tai, 2018a, b; Karnin and Liberty, 2019). Interestingly, our results show that allowing flexibility in the weights gives a definitive advantage for the task of density estimation. By Theorems 2 and 5, the uniformly weighted coreset KDEs require a much larger coreset than that of weighted coreset KDEs to attain the minimax rate of estimation over univariate Lipschitz densities.

1.2 Setup and Notation

We reserve the notation $\lVert\cdot\rVert_{2}$ for the $L_{2}$ norm and $\left|\cdot\right|_{p}$ for the $\ell_{p}$ -norm. The constants $c,c_{\beta,d},c_{L},$ etc. vary from line to line and the subscripts indicate parameter dependences.

Fix an integer $d\geq 1$ . For any multi-index $s=(s_{1},\ldots,s_{d})\in\mathbb{Z}_{\geq 0}^{d}$ and $x=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}$ , define $s!=s_{1}!\cdots s_{d}!$ , $x^{s}=x_{1}^{s_{1}}\cdots x_{d}^{s_{d}}$ and let $D^{s}$ denote the differential operator defined by

D^{s}=\frac{\partial^{|s|_{1}}}{\partial x_{1}^{s_{1}}\cdots\partial x_{d}^{s_{d}}}\,.

Fix a positive real number $\beta,$ and let $\lfloor\beta\rfloor$ denote the maximal integer strictly less than $\beta$ . Given a multi-index $s$ , the notation $\left|s\right|$ signifies the coordinate-wise application of $\left|\cdot\right|$ to $s$ .

Given $L>0$ we let $\mathcal{H}(\beta,L)$ denote the space of Hölder functions $f:\mathbb{R}^{d}\to\mathbb{R}$ that are supported on the cube $[-1/2,1/2]^{d}$ , are $\lfloor\beta\rfloor$ times differentiable, and satisfy

|D^{s}f(x)-D^{s}f(y)|\leq L\left|x-y\right|_{2}^{\beta-\lfloor\beta\rfloor}\,,\quad

for all $x,y\in\mathbb{R}^{d}$ and for all multi-indices $s$ such that $\left|s\right|_{1}=\lfloor\beta\rfloor$ .

Let $\mathcal{P}_{\mathcal{H}}(\beta,L)$ denote the set of probability density functions contained in $\mathcal{H}(\beta,L)$ . For $f\in\mathcal{P}_{\mathcal{H}}(\beta,L)$ , let $\mathbb{P}_{f}$ (resp. $\mathbb{E}_{f}$ ) denote the probability distribution (resp. expectation) associated to $f$ .

For $d\geq 1$ and $\gamma\in\mathbb{Z}_{\geq 0}$ , we also define the Sobolev functions $\mathcal{S}(\gamma,L^{\prime})$ that consist of all $f:\mathbb{R}^{d}\to\mathbb{R}$ that are $\gamma$ times differentiable and satisfy

\lVert D^{\alpha}f\rVert_{2}\leq L^{\prime}

for all multi-indices $\alpha$ such that $\left|\alpha\right|_{1}=\gamma$ .

2 Coreset-based estimators

In this section we study the performance of coreset-based estimators. Recall that coreset-based estimators are estimators that only depend on the data points in the coreset.

Define the minimax risk for coreset-based estimators $\psi_{n,m}(\beta,L)$ over $\mathcal{P}_{\mathcal{H}}(\beta,L)$ to be

\psi_{n,m}(\beta,L)=\inf_{\hat{f},|S|=m}\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2},

(3)

where the infimum above is over all choices of coreset scheme $S$ of cardinality $m$ and all estimators $\hat{f}:\mathbb{R}^{d\times m}\to L_{2}(\mathbb{R}^{d})$ .

Our main result on coreset-based estimators characterizes their minimax risk.

Theorem 1.

Fix $\beta,L>0$ and an integer $d\geq 1$ . Assume that $m=o(n)$ . Then the minimax risk of coreset-based estimators satisfies

\inf_{\hat{f},|S|=m}\,\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}=\\ \Theta_{\beta,d,L}(n^{-\frac{\beta}{2\beta+d}}+(m\log{n})^{-\frac{\beta}{d}}).

The above theorem readily yields a characterization of the minimal size $m^{*}(\beta,d)$ that a coreset can have while still enjoying the minimax optimal rate $n^{-\frac{\beta}{2\beta+d}}$ from (1). More specifically, let $m^{*}=m^{*}(n)$ be such that

(i)

if $m(n)$ is a sequence such that $m=o(m^{*})$ , then $\liminf_{n\to\infty}n^{\frac{\beta}{2\beta+d}}\psi_{n,m}(\beta)=\infty$ , and
(ii)

if $m=\Omega(m^{*})$ then $\limsup_{n\to\infty}\psi_{n,m}(\beta)n^{\frac{\beta}{2\beta+d}}\leq C_{\beta,d,L}$ for some constant $C_{\beta,d,L}>0$ .

Then it follows readily from from Theorem 1 that $m^{*}=\Theta_{\beta,d,L}(n^{\frac{d}{2\beta+d}}/\log n)$ .

Theorem 1 illustrates two different curses of dimensionality: the first stems from the original estimation problem, and the second stems from the compression problem. As $d\to\infty$ , it holds that $m^{*}\sim n/\log n$ , and in this regime there is essentially no compression, as the implicit constant in Theorem 1 grows rapidly with $d$ .²²2 In fact, even for the classical estimation problem (1), this constant scales as $d^{d}$ (see McDonald, 2017, Theorem 3).

Our proof of the lower bound in Theorem 1 first uses a standard reduction from estimation to multiple hypothesis testing problem over a finite function class. While Fano’s inequality is the workhorse of our second step, note that the lower bound must hold only for coreset estimators and not any estimator as in standard minimax lower bounds. This additional difficulty is overcome by a careful handling of the information structure generated by coreset scheme channels rather than using off-the-shelf results for minimax lower bounds. The full details of the lower bound are in the Appendix.

The estimator achieving the rate in Theorem (1) relies on an encoding procedure. It is constructed by building a dictionary between the subsets in $\binom{[n]}{m}$ and an $\varepsilon$ -net on the space of Hölder functions. The key idea is that, for $\omega(1)=m\leq n/2$ , the amount of subsets of size $m$ is extremely large, so for $m$ large enough, there is enough information to encode a nearby-neighbor in $L_{2}(\mathbb{R}^{d})$ to the kernel density estimator on the entire dataset.

2.1 Proof of the upper bound in Theorem 1

Fix $\varepsilon=c^{*}(m\log n)^{-\frac{\beta}{d}}$ for $c^{*}$ to be determined and let $\mathcal{N}_{\varepsilon}$ denote an $\varepsilon$ -net of $\mathcal{P}_{\mathcal{H}}(\beta,L)$ with respect to the $L_{2}([-\frac{1}{2},\frac{1}{2}]^{d})$ norm. It follows from the classical Kolmogorov-Tikhomorov bound (see, e.g., Theorem XIV of Tikhomirov, 1993) that there exists a constant $C_{\mathsf{KT}}(\beta,d,L)>0$ such that we can choose $\mathcal{N}_{\varepsilon}$ with $\log|\mathcal{N}_{\varepsilon}|\leq C_{\mathsf{KT}}(\beta,d,L)\,\varepsilon^{-d/\beta}$ . In particular, there exists $\mathsf{f}\in\mathcal{N}_{\varepsilon}$ such that $\|\hat{f}_{n}-\mathsf{f}\|_{L^{2}([-1/2,1/2]^{d})}\leq\varepsilon$ where $\hat{f}_{n}$ is the minimax optimal kernel density estimator defined in (2).

We now develop our encoding procedure for $\mathsf{f}$ . To that end, fix an integer $K\geq m$ such that $\binom{K}{m}\geq|\mathcal{N}_{\varepsilon}|$ and let $\phi:\binom{[K]}{m}\to\mathcal{N}_{\varepsilon}$ be any surjective map. Our procedure only looks at the first coordinates of the sample $X=\{X_{1},\ldots,X_{n}\}$ . Denote these coordinates by $x=\{x_{1},\ldots,x_{n}\}$ and note that these $n$ numbers are almost surely distinct. Let $A$ denote a parameter to be determined, and define the intervals

B_{ik}=[(i-1)K^{-1}A+(k-1)A,\\ \quad(i-1)K^{-1}A+(k-1)A+K^{-1}A].

For $i=1,\ldots,K$ , define

B_{i}=\bigcup_{k=1}^{1/A}B_{ik}.

The next lemma, whose proof is in the Appendix, ensures that with high probability every bin $B_{i}$ contains the first coordinate $x_{i}$ of at least one data point.

Lemma 1.

Let $K^{-1}=c(\log n)/n$ for $c>0$ a sufficiently large absolute constant, and let $A=A_{\beta,L,K}$ denote a sufficiently small constant. Then for all $f\in\mathcal{P}_{\mathcal{H}}(\beta,L)$ and $X_{1},\ldots,X_{n}\stackrel{{\scriptstyle iid}}{{\sim}}\mathbb{P}_{f}$ , the event that for every $j=1,\ldots,K$ there exists some $x_{i}$ in bin $B_{j}$ holds with probability at least $1-O(n^{-2})$ .

In the high-probability event $\mathcal{E}$ that every bin $B_{i}$ contains the first coordinate of some data point, choose a unique representative $x_{j}^{\circ}\in x$ such that $x_{j}^{\circ}\in B_{j}$ and pick any $T_{\mathsf{f}}\in\phi^{-1}(\mathsf{f})$ . Then define $S=\{i\,:\,x_{i}=x^{\circ}_{j},j\in T_{\mathsf{f}}\}$ . If there exists a bin with no observation, then let $X_{S}$ consist of two data points lying in the same bin and $m-2$ random data points. Then set $\hat{f}_{S}\equiv 0$ .

Note that $\hat{f}_{S}$ is indeed a coreset-based estimator. The function $\hat{f}$ such that $\hat{f}_{S}=\hat{f}[X_{S}]$ looks at the $m$ data points in the coreset, and if their first coordinates lie in distinct bins, then $X_{S}$ is decoded as above to output the corresponding element $\mathsf{f}$ of the net $\mathcal{N}_{\varepsilon}$ . Otherwise, $\hat{f}\equiv 0$ .

Next, it suffices to show the upper bound of Theorem 1 in the case when $m\leq cn^{d/(2\beta+d)}$ for $c$ a sufficiently small absolute constant. For $c^{*}=c^{*}_{\beta,d,L}$ sufficiently large, by Stirling’s formula and our choice of $K$ it holds that

\log\binom{K}{m}\geq C_{\mathsf{KT}}(\beta,d,L)\,\left(\frac{1}{\varepsilon}\right)^{\frac{d}{\beta}}\geq\log|\mathcal{N}_{\varepsilon}|.

Hence, the surjection $\phi$ and our encoding estimator $\hat{f}_{S}$ are well-defined.

Next we have

\mathbb{E}_{f}\lVert\hat{f}_{S}-f\rVert_{2}=\mathbb{E}_{f}\big{[}\lVert\mathsf{f}-f\rVert_{2}{\rm 1}\kern-2.40005pt{\rm I}_{\mathcal{E}}\big{]}+\mathbb{E}_{f}\big{[}\lVert 0-f\rVert_{2}{\rm 1}\kern-2.40005pt{\rm I}_{\mathcal{E}^{c}}\big{]}.

We control the first term as follows using (1) and the fact that $\lVert\mathsf{f}-\hat{f}_{n}\rVert_{2}\leq\varepsilon$ on $\mathcal{E}$ :

	$\displaystyle\mathbb{E}_{f}\big{[}\lVert\mathsf{f}-f\rVert_{2}{\rm 1}\kern-2.40005pt{\rm I}_{\mathcal{E}}\big{]}$	$\displaystyle\leq\mathbb{E}_{f}\lVert\hat{f}_{n}-f\rVert_{2}+\mathbb{E}_{f}\lVert\mathsf{f}-\hat{f}_{n}\rVert_{2}$
		$\displaystyle\leq c_{\beta,d,L}\,\big{(}n^{\frac{-\beta}{2\beta+d}}+(m\log n)^{-\frac{\beta}{d}}\big{)}.$

By the Cauchy-Schwarz inequality,

	$\displaystyle\mathbb{E}_{f}\big{[}\lVert 0-f\rVert_{2}{\rm 1}\kern-2.40005pt{\rm I}_{\mathcal{E}^{c}}\big{]}$	$\displaystyle\leq\big{(}\mathbb{E}_{f}\lVert f\rVert_{2}^{2}\,\mathbb{P}(\mathcal{E}^{c})\big{)}^{1/2}$
		$\displaystyle\leq c_{\beta,d,L}\,n^{-1}\,.$

Put together, the previous three displays yield the upper bound of Theorem 1.

3 Coreset kernel density estimators

In this section, we consider the family of weighted kernel density estimators built on coresets and study its rate of estimation over the Hölder densities. In this framework, the statistician first computes a minimax estimator $\hat{f}$ using the entire dataset and then approximates $\hat{f}$ with a weighted kernel density estimator over the coreset. Here we allow the weights to be a measurable function over the entire dataset rather than just the coreset.

As is typical in density estimation, we consider kernels $k:\mathbb{R}^{d}\to\mathbb{R}$ of the form $k(x)=\prod_{i=1}^{d}\kappa(x_{i})$ where $\kappa$ is an even function and $\int\kappa(x)\,\mathrm{d}x=1$ . Given bandwidth parameter $h$ , we define $k_{h}(x)=h^{-d}\,k(\frac{x}{h})$ .

3.1 Carathéodory coreset method

Given a KDE with uniform weights and bandwidth $h$ defined by

\hat{f}(y)=\frac{1}{n}\sum_{j=1}^{n}k_{h}(X_{j}-y),

on a sample $X_{1},\ldots,X_{n}$ , we define a coreset KDE $\hat{g}_{S}$ as follows in terms of a cutoff frequency $T>0$ . Define $A=\{\omega\in\frac{\pi}{2}\mathbb{Z}^{d}:\,\left|\omega\right|_{\infty}\leq T\}$ . Consider the complex vectors $(e^{i\langle X_{j},\omega\rangle})_{\omega\in A}$ . By Carathéodory’s theorem (Carathéodory, 1907), there exists a subset $S\subset[n]$ of cardinality at most $2(1+\frac{4T}{\pi})^{d}+1$ and nonnegative weights $\{\lambda_{j}\}_{j\in S}$ with $\sum_{j\in S}\lambda_{j}=1$ such that

\frac{1}{n}\sum_{j=1}^{n}(e^{i\langle X_{j},\omega\rangle})_{\omega\in A}=\sum_{j\in S}\lambda_{j}(e^{i\langle X_{j},\omega\rangle})_{\omega\in A}.

(4)

Then $\hat{g}_{S}(y)$ is defined to be

\hat{g}_{S}(y)=\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y).

3.1.1 Algorithmic considerations

For a convex polyhedron $P$ with vertices $v_{1},\ldots,v_{n}\in\mathbb{R}^{D}$ , the proof of Carathéodory’s theorem is constructive and yields a polynomial-time algorithm in $n$ and $D$ to find a convex combination of $D+1$ vertices that represents a given point in $P$ (Carathéodory, 1907) (see also Hiriart-Urruty and Lemaréchal, 2004, Theorem 1.3.6). For completeness, we describe below this algorithm applied to our problem. Note that, more generally, for a large class of convex bodies, Carathéodory’s theorem may be implemented efficiently using standard tools from convex optimization (Grötschel et al., 2012, Chapter 6).

Set $D=2|A|\leq 2(1+\frac{4T}{\pi})^{d}$ . For $j=1,\ldots,n$ , let

v_{j}=(\mathsf{Re}\,e^{i\langle X_{j},\omega\rangle},\mathsf{Im}\,e^{i\langle X_{j},\omega\rangle})_{\omega\in A}\in\mathbb{R}^{D}.

Let $M$ denote the matrix with columns $(v_{1},1)^{T},\ldots,(v_{n},1)^{T}\in\mathbb{R}^{D+1}$ , and let $\Delta_{n-1}\subset\mathbb{R}^{n}$ denote the standard simplex. Assume without loss of generality that $n\geq D+2$ . Next,

1.

Find a nonzero vector $w\in\text{ker}(M)$
2.

Find $\alpha>0$ so that $\lambda_{1}:=\frac{1}{n}{\rm 1}\kern-2.40005pt{\rm I}+\alpha w$ lies on the boundary of $\Delta_{n-1}$

Observe that $M\lambda_{1}=(\frac{1}{n}\sum v_{i},1)^{T}$ , and since $\lambda_{1}\in\partial\Delta_{n-1}$ the average is now represented using a convex combination of at most $n-1$ of the vertices $v_{1},\ldots,v_{n}$ . As long as at least $D+2$ vertices remain, we can continue reducing the number of vertices used to represent $\frac{1}{n}\sum v_{j}$ by applying steps 1 and 2. Thus after at most $n-D-1$ iterations, we obtain $\lambda\in\Delta_{D}$ that satisfies $\sum\lambda_{j}v_{j}=\frac{1}{n}\sum v_{i}$ , as desired.

3.2 Results on Carathéodory coresets

Proposition 1 is key to our results and specifies conditions on the kernel guaranteeing that the Carathéodory method yields an accurate estimator.

Proposition 1.

Let $k(x)=\prod_{i=1}^{d}\kappa(x_{i})$ denote a kernel with $\kappa\in\mathcal{S}(\gamma,L^{\prime})$ such that $\left|\kappa(x)\right|\leq c_{\beta,d}\left|x\right|^{-\nu}$ for some $\nu\geq\beta+d$ , and the KDE

\hat{f}(y)=\frac{1}{n}\sum_{i=1}^{n}k_{h}(X_{i}-y)

with bandwidth $h=n^{-\frac{1}{2\beta+d}}$ satisfies

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert f-\hat{f}\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

(5)

Then the Carathéodory coreset estimator $\hat{g}_{S}(y)$ constructed from $\hat{f}$ with $T=c_{d,\gamma,L^{\prime}}\,n^{\frac{d/2+\beta+\gamma}{\gamma(2\beta+d)}}$ satisfies

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

There exists a kernel $k_{s}\in\mathcal{C}^{\infty}$ that satisfies the conditions above for all $\beta$ and $\gamma$ . We sketch the details here and postpone the full argument to the Proof of Theorem 2 in the Appendix. Let $\psi:[-1,1]\to[0,1]$ denote a cutoff function that has the following properties: $\psi\in\mathcal{C}^{\infty}$ , $\psi\big{|}_{[-1,1]}\equiv 1$ , and $\psi$ is compactly supported on $[-2,2]$ . Define $\kappa_{S}(x)=\mathcal{F}[\psi](x)$ , and let $k_{s}(x)=\prod_{i=1}^{d}\kappa_{S}(x_{i})$ denote the resulting kernel. Observe that for all $\beta>0$ , the kernel $k_{s}$ satisfies

\text{ess sup}_{\omega\neq 0}\frac{1-\mathcal{F}[k_{s}](\omega)}{\left|\omega\right|^{\alpha}}\leq 1,\quad\forall\alpha\preceq\beta.

Using standard results from (Tsybakov, 2009), this implies that the resulting KDE $\hat{f}_{s}$ satisfies (5). Since $\psi=\mathcal{F}^{-1}[k_{s}]\in\mathcal{C}^{\infty}$ , the Riemann–Lebesgue lemma guarantees that $\left|\kappa_{s}(x)\right|\leq c_{\beta,d}\left|x\right|^{\nu}$ is satisfied for $\nu=\lceil\beta+d\rceil$ . Since $\psi$ is compactly supported, an application of Parseval’s identity yields $\kappa_{s}\in\mathcal{S}(\gamma,c_{\gamma})$ . Applying Proposition 1 to $k_{s}$ , we conclude that for the task of density estimation, weighted KDEs built on coresets are nearly as powerful as the coreset-based estimators studied in Section 2.

Theorem 2.

Let $\varepsilon>0$ . The Carathéodory coreset estimator $\hat{g}_{S}(y)$ built using the kernel $k_{s}$ and setting $T=c_{d,\beta,\varepsilon}\,n^{\frac{\varepsilon}{d}+\frac{1}{2\beta+d}}$ satisfies

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

The corresponding coreset has cardinality

m=c_{d,\beta,\varepsilon}n^{\frac{d}{2\beta+d}+\varepsilon}.

Theorem 2 shows that the Carathéodory coreset estimator achieves the minimax rate of estimation with near-optimal coreset size. In fact, a small modification yields a near-optimal rate of convergence for any coreset size as in Theorem 1.

Corollary 1.

Let $\varepsilon>0$ and $m\leq c_{\beta,d,\varepsilon}\,n^{\frac{d}{2\beta+d}+\varepsilon}$ . The Carathéodory coreset estimator $\hat{g}_{S}(y)$ built using the kernel $k_{s}$ , setting $h=m^{-\frac{1}{d}+\frac{\varepsilon}{\beta}}$ and $T=c_{d}\,m^{1/d}$ , satisfies

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,\varepsilon,L}\,\left(m^{-\frac{\beta}{d}+\varepsilon}+n^{-\frac{\beta}{2\beta+d}+\varepsilon}\right),

and the corresponding coreset has cardinality $m$ .

Next we apply Proposition 1 to the popular Gaussian kernel $\phi(x)=(2\pi)^{-d/2}\exp(-\frac{1}{2}\left|x\right|_{2}^{2})$ . This kernel has rapid decay in the real domain and Fourier space, and is thus amenable to our techniques. Moreover, $k_{\phi}$ is a kernel of order $\ell=1$ , (Tsybakov, 2009, Definition 1.3 and Theorem 1.2) and so the standard KDE $\hat{f}_{\phi}$ on the full dataset attains the minimax rate of estimation $c_{d,L}n^{1/(2+d)}$ over the Lipschitz densities $\mathcal{P}_{\mathcal{H}}(1,L)$ .

Theorem 3.

Let $\varepsilon>0$ . The Carathéodory coreset estimator $\hat{g}_{\phi}(y)$ built using the kernel $\phi$ and setting $T=c_{d,\varepsilon}\,n^{\frac{1}{2+d}+\frac{\varepsilon}{d}}$ satisfies

\sup_{f\in\mathcal{P}_{\mathcal{H}}(1,L)}\mathbb{E}\lVert\hat{g}_{\phi}-f\rVert_{2}\leq c_{d,L}\,n^{-\frac{1}{2+d}}.

The corresponding coreset has cardinality

m=c_{d,\varepsilon}n^{\frac{d}{2+d}+\varepsilon}.

In addition, we have a nearly matching lower bound to Theorem 2 for coreset KDEs. In fact, our lower bound applies to a generalization of coreset KDEs where the vector of weights $(\lambda_{j})_{j}$ is not constrained to be in the simplex but can range within a hypercube of width that may grow polynomially with $n$ : $\max_{j\in S}\left|\lambda_{j}\right|\leq n^{B}$ .

Theorem 4.

Let $A,B\geq 1$ . Let $k$ denote a kernel with $\lVert k\rVert_{2}\leq n$ . Let $\hat{g}_{S}$ denote a weighted coreset KDE with bandwidth $h\geq n^{-A}$ built from $k$ with weights $\{\lambda_{j}\}_{j\in S}$ satisfying $\max_{j\in S}\left|\lambda_{j}\right|\leq n^{B}$ . Then

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}\geq\\ c_{\beta,d,L}\left[(A+B)^{-\frac{\beta}{d}}(m\log{n})^{-\frac{\beta}{d}}+n^{-\frac{\beta}{2\beta+d}}\right].

This result is essentially a consequence of the lower bound in Theorem 1 because, in an appropriate sense, coreset KDEs with bounded weights are well-approximated by coreset-based estimators. Hence, in the case of bounded weights, allowing these weights to be measurable functions of the entire dataset rather than just the coreset, as would be required in Section 2, does not make a significant difference for the purpose of estimation. The full details of Theorem 4 are postponed to the Appendix.

3.3 Proof sketch of Proposition 1

Here we sketch the proof of Proposition 1, our main tool in constructing effective coreset KDEs. Full details of the argument may be found in the Appendix.

Let $k(x)=\prod_{i=1}^{d}\kappa(x_{i})$ denote a kernel, and suppose that $\hat{f}(y)=\frac{1}{n}\sum_{i=1}^{n}k_{h}(X_{i}-y)$ is a good estimator for an unknown density $f$ in that

\lVert f-\hat{f}\rVert_{2}\leq\varepsilon:=c_{\beta,d}n^{-\frac{\beta}{2\beta+d}}

on setting $h=n^{-1/(2\beta+d)}$ . Our goal is to find a subset $S\subset[n]$ and weights $\{\lambda_{j}\}_{j\in S}$ such that

\frac{1}{n}\sum_{i=1}^{n}k_{h}(X_{i}-y)\approx\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y).

Suppose for simplicity that $\kappa$ is compactly supported on $[-1/2,1/2]$ . By hypothesis and Parseval’s theorem $\kappa\in\mathcal{S}(\gamma,L^{\prime})$ , and we can further show that $k\in\mathcal{S}(\gamma,c_{d,L^{\prime}})$ and $k_{h}\in\mathcal{S}(\gamma,c_{d,L^{\prime}}h^{-d/2-\gamma})$ . Let $\mathcal{\bar{F}}$ denote the Fourier transform on the interval $[-1,1]$ . Using the Fourier decay of $k_{h}$ , we have

\lVert k_{h}(x)-\sum_{\left|\omega\right|_{\infty}<T}\mathcal{\bar{F}}[k_{h}](\omega)e^{i\langle x,\omega\rangle}\rVert_{2}\leq\varepsilon

(6)

when $T=(\frac{c_{d,\gamma,L^{\prime}}h^{-\frac{d}{2}-\gamma}}{\varepsilon})^{1/\gamma}=c_{d,\gamma,L^{\prime}}\,n^{\frac{d/2+\beta+\gamma}{\gamma(2\beta+d)}}$ . Observe that this matches the setting of $T$ in Proposition 1.

The approximation (6) implies that for $X_{i}\in[-1/2,1/2]^{d}$ ,

\hat{f}(y)\approx\sum_{\left|\omega\right|_{\infty}<T}\mathcal{\bar{F}}[k_{h}](\omega)\left(\frac{1}{n}\sum_{i=1}^{n}e^{i\langle X_{i},\omega\rangle}\right)e^{i\langle y,\omega\rangle}.

Using the Carathéodory coreset and weights $\{\lambda_{j}\}_{j\in S}$ constructed in Section 3.1, it follows that

\sum_{\left|\omega\right|_{\infty}<T}\mathcal{\bar{F}}[k_{h}](\omega)\left(\frac{1}{n}\sum_{i=1}^{n}e^{i\langle X_{i},\omega\rangle}\right)e^{i\langle y,\omega\rangle}=\\ \sum_{\left|\omega\right|_{\infty}<T}\mathcal{\bar{F}}[k_{h}](\omega)\left(\sum_{i=1}^{n}\lambda_{j}e^{i\langle X_{i},\omega\rangle}\right)e^{i\langle y,\omega\rangle}.

Applying (6) again, we see that the right-hand-side is approximately equal to $\hat{g}_{S}(y)$ , the estimator produced in Section (3.1). By the triangle inequality, we conclude that $\lVert\hat{g}_{S}(y)-f\rVert_{2}\leq c_{\beta,d}\,\varepsilon$ , as desired.

4 Lower bounds for coreset KDEs with uniform weights

In this section we study the performance of univariate uniformly weighted coreset KDEs

\hat{f}_{S}^{\mathsf{unif}}(y)=\frac{1}{m}\sum_{i\in S}k_{h}(X_{i}-y),

where $X_{S}$ is the coreset and $|S|=m$ . The next results demonstrate that for a large class of kernels, there is significant gap between the rate of estimation achieved by $\hat{f}_{S}^{\text{unif}}(y)$ and that of coreset KDEs with general weights. First we focus on the particular case of estimating Lipschitz densities, the class $\mathcal{P}_{\mathcal{H}}(1,L)$ . For this class, the minimax rate of estimation (over all estimators) is $n^{-1/3}$ , and this can be achieved by a weighted coreset KDE of cardinality $c_{\varepsilon}n^{1/3+\varepsilon}$ by Theorem 2, for all $\varepsilon>0$ .

Theorem 5.

Let $k$ denote a nonnegative kernel satisfying

k(t)=O(\left|t\right|^{-(k+1)}),\quad\text{and}\quad\mathcal{F}[k](\omega)=O(\left|\omega\right|^{-\ell})

for some $\ell>0,\,k>1$ . Suppose that $0<\alpha<1/3$ . If

m\leq\frac{n^{\frac{2}{3}-2\left(\alpha(1-\frac{2}{\ell})+\frac{2}{3\ell}\right)}}{\log n},

then

\inf_{h,S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(1,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{2}=\Omega_{k}\Big{(}\frac{n^{-\frac{1}{3}+\alpha}}{\log n}\Big{)}.

(7)

The infimum above is over all possible choices of bandwidth $h$ and all coreset schemes $S$ of cardinality at most $m$ .

By this result, if $k$ has lighter than quadratic tails and fast Fourier decay, the error in (7) is a polynomial factor larger than the minimax rate $n^{-1/3}$ when $m\ll n^{2/3}$ . Hence, our result covers a wide variety of kernels typically used for density estimation and shows that the uniformly weighted coreset KDE performs much worse than the encoding estimator or the Carathéodory method. In addition, for very smooth univariate kernels with rapid decay, we have the following lower bound that applies for all $\beta>0$ .

Theorem 6.

Fix $\beta>0$ and a nonnegative kernel $k$ on $\mathbb{R}$ satisfying the following fast decay and smoothness conditions:

	$\displaystyle\lim_{s\to+\infty}\frac{1}{s}\log\frac{1}{\int_{\|t\|>s}k(t)dt}$	$\displaystyle>0,$		(8)
	$\displaystyle\lim_{\omega\to\infty}\frac{1}{\|\omega\|}\log\frac{1}{\|\mathcal{F}[k](\omega)\|}$	$\displaystyle>0,$		(9)

where we recall that $\mathcal{F}[k]$ denotes the Fourier transform. Let $\hat{f}_{S}^{\mathsf{unif}}$ be the uniformly weighted coreset KDE. Then there exists $L_{\beta}>0$ such that for $L\geq L_{\beta}$ and any $m$ and $h>0$ , we have

\displaystyle\inf_{h,S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{2}

\displaystyle=\Omega_{\beta,k}\left(\tfrac{m^{-\frac{\beta}{1+\beta}}}{\log^{\beta+\frac{1}{2}}m}\right).

Therefore attaining the minimax rate with $\hat{f}_{S}^{\mathsf{unif}}$ requires $m\geq n^{\frac{\beta+1}{2\beta+1}}$ for such kernels. Next, note that the Gaussian kernel satisfies the hypotheses of Theorem 5 and 6. As we show in Theorem 7, results of (Phillips and Tai, 2018b) imply that our lower bounds are tight up to logarithmic factors: there exists a uniformly weighted Gaussian coreset KDE of size $m=\tilde{O}(n^{2/3})$ that attains the minimax rate $n^{-1/3}$ for estimating univariate Lipschitz densities ( $\beta=1$ ). In general, we expect a lower bound $m=\Omega(n^{\frac{\beta+d}{2\beta+d}})$ to hold for uniformly weighted coreset KDEs attaining the minimax rate. The proofs of Theorems 5 and 6 can be found in the Appendix.

5 Comparison to other methods

Three methods for constructing coreset kernel density estimators that have previously been explored include random sampling (Joshi et al., 2011; Lopez-Paz et al., 2015), the Frank–Wolfe algorithm (Bach et al., 2012; Harvey and Samadi, 2014; Phillips and Tai, 2018a), and discrepancy-based approaches (Phillips and Tai, 2018b; Karnin and Liberty, 2019). These procedures all result in a uniformly weighted coreset KDE. To compare these results with ours on the problem of density estimation, for each method under consideration we raise the question: How large does $m$ , the size of the coreset, need to be to guarantee that

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}=O_{\beta,d,L}\left(n^{-\frac{\beta}{2\beta+d}}\right)\,?

(10)

Here $\hat{g}_{S}$ is the resulting coreset KDE and the right-hand-side is the minimax rate over all estimators on the full dataset $X_{1},\ldots,X_{n}$ .

Uniform random sampling of a subset of cardinality $m$ yields an i.i.d dataset, so the rate obtained is at least $m^{-\beta/(2\beta+d)}$ . Hence, we must take $m=\Omega(n)$ to achieve the minimax rate.

The Frank–Wolfe algorithm is a greedy method that iteratively constructs a sparse approximation to a given element in a convex set (Frank et al., 1956; Bubeck, 2015). Thus Frank–Wolfe may be applied directly in the RKHS corresponding to a positive-semidefinite kernel as shown in Phillips and Tai (2018b) to approximate the KDE on the full dataset. However, due to the shrinking bandwidth in our problem, this approach also requires $m=\Omega(n)$ to guarantee the bound in (10). Another strategy is to approximately solve the linear equation (4) using the Frank–Wolfe algorithm. Unfortunately, a direct implementation again uses $m=\Omega(n)$ data points.

A more effective strategy utilizes discrepancy theory (Phillips, 2013; Phillips and Tai, 2018b; Karnin and Liberty, 2019) (see Matoušek, 1999; Chazelle, 2000, for a comprehensive exposition of discrepancy theory). By the well-known halving algorithm (see e.g. Chazelle and Matoušek, 1996; Phillips and Tai, 2018b) if for all $N\leq n$ , the kernel discrepancy

\mathsf{disc}_{k}=\sup_{x_{1},\ldots,x_{N}}\,\min_{\begin{subarray}{c}\sigma\in\{-1,+1\}^{n}\\ {\rm 1}\kern-1.68004pt{\rm I}^{T}\sigma=0\end{subarray}}\lVert\sum_{i=1}^{N}\sigma_{i}k(x_{i}-y)\rVert_{\infty}

is at most $D$ , then there exists a coreset $X_{S}$ of size $\tilde{O}_{D}(\varepsilon^{-1})$ such that

\lVert\frac{1}{n}\sum_{i=1}^{n}k(X_{i}-y)-\frac{1}{m}\sum_{j\in S}k(X_{i}-y)\rVert_{\infty}=\tilde{O}_{D}(\varepsilon).

(11)

The idea of the halving algorithm is to maintain a set of datapoints $\mathcal{C}_{\ell}$ at each iteration and then set $\mathcal{C}_{\ell+1}$ to be the set of vectors that receive sign $+1$ upon minimizing $\lVert\sum_{x\in\mathcal{C}_{\ell}}\sigma_{x}k(x-y)\rVert_{\infty}$ . Starting with the original dataset and repeating this procedure $O(\log\frac{n}{m})$ times yields the desired coreset $X_{S}$ satisfying (11).

Phillips and Tai (2018b, Theorem 4) use a state-of-the-art algorithm from Bansal et al. (2018) called the Gram–Schmidt walk to give strong bounds on the kernel discrepancy of bounded and Lipschitz kernels $k:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ that are positive definite and decay rapidly away from the diagonal. With a careful handling of the Lipschitz constant and error in their argument when the bandwidth is set to be $h=n^{-1/(2\beta+d)}$ , their techniques yield the following result applied to the kernel $k_{s}$ . For completeness we give details of the argument in the Appendix.

Theorem 7.

Let $k_{s}$ denote the kernel from Section 3.2. The algorithm of Phillips and Tai (2018b) yields in polynomial time a subset $S$ with $\left|S\right|=m=\tilde{O}(n^{\frac{\beta+d}{2\beta+d}})$ such that the uniformly weighted coreset KDE $\hat{g}_{S}$ satisfies

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert f-\hat{g}_{S}\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

This result also applies to more general kernels, for example, the Gaussian kernel when $\beta=1$ . We suspect that this is the best result achievable by discrepancy-based methods. In particular for nonnegative univariate kernels with fast decay in the real and Fourier domains, such as the Gaussian kernel, Theorem 5 implies that this rate is optimal for estimating Lipschitz densities with uniformly weighted coreset KDEs.

In contrast, the Carathéodory coreset KDE as in Theorem 2 only needs cardinality $m=O_{\varepsilon}(n^{\frac{d}{2\beta+d}+\varepsilon})$ to be a minimax estimator. By Theorem 4, this result is nearly optimal for coreset KDEs with bounded kernels and weights. And as with the other three methods described, our construction is computationally efficient. Hence allowing more general weights results in more powerful coreset KDEs for the problem of density estimation.

Appendix A PROOFS FROM SECTION 2

A.1 Proof of Lemma 1

Here we prove Lemma 1, restated below for convenience.

Lemma.

Proof.

Note that $f_{1}(x_{1})\in\mathcal{P}_{\mathcal{H}}(\beta,L)$ as a univariate density because $f(x)\in\mathcal{P}_{\mathcal{H}}(\beta,L)$ . Hence, $f_{1}$ satisfies

|f_{1}(x)-f_{1}(y)|\leq L|x-y|^{\alpha}

for some absolute constants $L>0$ and $\alpha\in(0,1)$ . If $B_{ik}=B_{jk}+s$ for $s\leq A$ , then

\displaystyle\left|\mathbb{P}(B_{ik})-\mathbb{P}(B_{jk})\right|\leq\int_{B_{ik}}\left|f(x_{1})-f(x_{1}+s)\right|\mathrm{d}x_{1}\leq LK^{-1}A^{1+\alpha}.

(12)

Thus for all $i,j$ ,

\displaystyle\left|\mathbb{P}(B_{i})-\mathbb{P}(B_{j})\right|\leq\sum_{k=1}^{1/A}\left|\mathbb{P}(B_{ik})-\mathbb{P}(B_{jk})\right|\leq LK^{-1}A^{\alpha}.

(13)

It follows that for all $i=1,\ldots,K$ ,

\lim_{A\to 0}\mathbb{P}(B_{i})=K^{-1}.

(14)

Let $\mathcal{E}$ denote the event that every bin $B_{i}$ contains at least one observation $x_{k}$ . By the union bound,

\mathbb{P}(\mathcal{E}^{c})\leq\sum_{j=1}\mathbb{P}(X_{11}\notin B_{j})^{n}\leq K\max_{j}(1-\mathbb{P}(B_{j}))^{n}.

By (14), choosing $A$ small enough ensures that $\mathbb{P}[B_{j}]\geq(1/2)K^{-1}$ for all $j$ . In fact, by (12) one may take $A=(\frac{1}{2K^{-2}L})^{1/\alpha}$ . Hence, setting $K^{-1}=c(\log n)/n$ for $c$ sufficiently large, we have

\mathbb{P}(\mathcal{E}^{c})=O(n^{-2}).

∎

A.2 Proof of the lower bound in Theorem 1

In this section, $X=X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ denotes the sample. It is convenient to consider a more general family of decorated coreset-based estimators. A decorated coreset consists of a coreset $X_{S}$ along with a data-dependent binary string $\sigma$ of length $R$ . A decorated coreset-based estimator is then given by $\hat{f}[X_{S},\sigma]$ , where $\hat{f}:\mathbb{R}^{d\times m}\times\{0,1\}^{R}\to L^{2}([-1/2,1/2]^{d})$ is a measurable function. As with coreset-based estimators, we require that $\hat{f}[x_{1},\ldots,x_{m},\sigma]$ is invariant under permutation of the vectors $x_{1},\ldots,x_{m}\in\mathbb{R}^{d}$ . We slightly abuse notation and refer to the channel $S:X\to Y_{S}=(X_{S},\sigma)$ as a decorated coreset scheme and $\hat{f}_{S}$ as the decorated coreset-based estimator. The next proposition implies the lower bound in Theorem 1 on setting $R=0$ , in which case a decorated coreset-based estimator is just a coreset-based estimator. This more general framework allows us to prove Theorem 1 on lower bounds for weighted coreset KDEs.

Proposition 2.

Let $\hat{f}_{S}$ denote a decorated coreset-based estimator with decorated coreset scheme $S$ such that $\sigma\in\{0,1\}^{R}$ . Then

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{f}_{S}-f\rVert_{2}\geq c_{\beta,d,L}\left((m\log n+R)^{-\frac{\beta}{d}}+n^{-\frac{\beta}{2\beta+d}}\right).

A.2.1 Choice of function class

Fix $h\in(0,1)$ such that $1/h^{d}$ is integral to be chosen later. Let $z_{1},\ldots,z_{1/h^{d}}$ label the points in $\{\frac{1}{2}h\cdot{\rm 1}\kern-2.40005pt{\rm I}_{d}+h\mathbb{Z}^{d}\}\cap[-1/2,1/2]^{d}$ , where ${\rm 1}\kern-2.40005pt{\rm I}_{d}$ denotes the all-ones vector of $\mathbb{R}^{d}$ . We consider a class of functions of the form $f_{\omega}(x)=1+\sum_{j=1}^{1/h^{d}}\omega_{j}g_{j}(x)$ indexed by $\omega\in\{0,1\}^{1/h^{d}}$ . Here, $g_{j}(x)$ is defined to be

g_{j}(x)=h^{\beta}\phi\left(\frac{x-z_{j}}{h}\right)

where $\phi:\mathbb{R}^{d}\to\mathbb{R}$ is $L$ -Hölder smooth of order $\beta$ , has $\lVert\phi\rVert_{\infty}=1$ , and has $\int\phi(x)\,\mathrm{d}x=0$ .

Informally, $f_{\omega}$ puts a bump on the uniform distribution with amplitude $h^{\beta}$ over $z_{j}$ if and only if $\omega_{i}=1$ . Using a standard argument (Tsybakov, 2009, Chapter 2) we can construct a packing $\mathcal{V}$ of $\{0,1\}^{1/h^{d}}$ which results $\mathcal{G}=\{f_{\omega}:\omega\in\mathcal{V}\}$ of the function class $\{f_{\omega}:\omega\in\{0,1\}^{1/h^{d}}\}$ such that

(i)

$\|f-g\|_{2}\geq c_{\beta,d,L}\,h^{\beta}$ for all $f,g\in\mathcal{G},f\neq g$ and,
(ii)

$\mathcal{G}$ is large in the sense that $M:=|\mathcal{G}|\geq 2^{c_{\beta,d,L}/h^{d}}$ .

A.2.2 Minimax lower bound

Using standard reductions from estimation to testing, we obtain that

	$\displaystyle\inf_{\begin{subarray}{c}\hat{f},\|S\|=m,\\ \sigma\in\{0,1\}^{R}\end{subarray}}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}$	$\displaystyle\geq\inf_{\begin{subarray}{c}\hat{f},\|S\|=m,\\ \sigma\in\{0,1\}^{R}\end{subarray}}\,\max_{f\in\mathcal{G}}\,\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}$
		$\displaystyle\geq c_{\beta,d,L}\,h^{\beta}\cdot\inf_{\psi_{S}}\frac{1}{M}\sum_{\omega\in\mathcal{V}}\mathbb{P}_{f_{\omega}}[\psi_{S}(X)\neq\omega].$		(15)

where the infimum in the last line is over all tests $\psi_{S}:\mathbb{R}^{d\times n}\to[M]$ of the form $\psi_{S}(X)=\psi(Y_{S})$ for a decorated coreset scheme $S$ and a measurable function $\psi:\mathbb{R}^{d\times m}\times\{0,1\}^{R}\to[M]$ .

Let $V$ denote a random variable that is distributed uniformly over $\mathcal{V}$ and observe that

\frac{1}{M}\sum_{\omega\in\mathcal{V}}\mathbb{P}_{f_{\omega}}[\psi_{S}(X)\neq\omega]=\mathbb{P}[\psi_{S}(X)\neq V]

where $\mathbb{P}$ denotes the joint distribution of $(X,V)$ characterized by the conditional distribution $X|V=\omega$ which is assumed to have density $f_{\omega}$ for all $\omega\in\mathcal{V}$ .

Next, by Fano’s inequality (Cover and Thomas, 2006, Theorem 2.10.1) and the chain rule, we have

\mathbb{P}[\psi_{S}(X)\neq V]\geq 1-\frac{I(V;\psi_{S}(X))+1}{\log M}\,,

(16)

where $I(V;\psi_{S}(X))$ denotes the mutual information between $V$ and $\psi_{S}(X)$ and we used the fact that the entropy of $V$ is $\log M$ . Therefore, it remains to control $I(V;\psi_{S}(X))$ . To that end, note that it follows from the data processing inequality that

I(V;\psi_{S}(X))\leq I(V;(X_{S},\sigma))=I(V;Y_{S})=\mathsf{KL}(P_{V,Y_{S}}\|P_{V}\otimes P_{Y_{S}})\,,

where $P_{V,Y_{S}},P_{V}$ and $P_{Y_{S}}$ denote the distributions of $(V,Y_{S})$ , $V$ and $Y_{S}$ respectively and observe that $P_{Y_{S}}$ is the mixture distribution given by $P_{Y_{S}}(A,t)=M^{-1}\sum_{\omega\in\mathcal{V}}P_{f_{\omega}}(X_{S}\in A,\sigma=t)$ for $A\subset\mathbb{R}^{d\times m}$ and $t\in\{0,1\}^{R}$ . Denote by $f_{\omega,Y_{S}}$ the mixed density of $P_{f_{\omega}}(X_{S}\in\cdot,\sigma=\cdot)$ , where the continuous component is with respect to the Lebesgue measure on $[-1/2,1/2]^{d\times m}$ . Denote by $\bar{f}_{Y_{S}}$ the mixed density of the uniform mixture of these:

\bar{f}_{Y_{S}}:=\frac{1}{M}\sum_{\omega\in\mathcal{V}}f_{\omega,Y_{S}}\,.

By a standard information-theoretic inequality, for all measures $\mathbb{Q}$ it holds that

\mathsf{KL}(P_{V,Y_{S}}\|P_{V}\otimes P_{Y_{S}})=\frac{1}{M}\sum_{\omega}\mathsf{KL}(P_{Y_{S}|\omega}\|\,P_{Y_{S}})\leq\frac{1}{M}\sum_{\omega}\mathsf{KL}(P_{Y_{S}|\omega}\|\,\mathbb{Q}).

(17)

In fact, we have equality precisely when $\mathbb{Q}=P_{Y_{S}}$ , and (17) follows immediately from the nonnegativity of the KL-divergence. Setting $\mathbb{Q}=\mathsf{Unif}[-\frac{1}{2},\frac{1}{2}]^{d}\otimes\mathsf{Unif}\{0,1\}^{R}$ , for all $\omega$ we have

	$\displaystyle{\sf KL}(P_{Y_{S}\|\omega},\mathbb{Q})$	$\displaystyle=\sum_{t\in\{0,1\}^{R}}\,\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log\frac{f_{\omega,Y_{S}}(x,t)}{2^{-R}}\,\mathrm{d}x$
		$\displaystyle\leq\sum_{t\in\{0,1\}^{R}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log f_{\omega,Y_{S}}(x,t)\,\mathrm{d}x+R.$		(18)

Our next goal is to bound the first term on the right-hand-side above.

Lemma 2.

For any $\omega\in\mathcal{V}$ , we have

\sum_{t\in\{0,1\}^{R}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log f_{\omega,Y_{S}}(x,t)\,\mathrm{d}x\leq 3m\log{n}.

Proof.

Let $\mathbb{P}_{X_{S}}$ denote the distribution of the (undecorated) coreset $X_{S}$ , and note that the density of this distribution is given by $f_{\omega,X_{S}}(x):=\sum_{t\in\{0,1\}^{R}}f_{\omega,Y_{S}}(x,t)$ . Then because the logarithm is increasing,

	$\displaystyle\sum_{t\in\{0,1\}^{R}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log f_{\omega,Y_{S}}(x,t)\,\mathrm{d}x$	$\displaystyle\leq\sum_{t\in\{0,1\}^{R}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,Y_{S}}(x,t)\log f_{\omega,X_{S}}(x)\,\mathrm{d}x$
		$\displaystyle=\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}f_{\omega,X_{S}}(x)\log f_{\omega,X_{S}}(x)\,\mathrm{d}x.$

By the union bound,

\mathbb{P}_{X_{S}}(\cdot)\leq\sum_{s\in\binom{[n]}{m}}\mathbb{P}_{X_{s}}(\cdot)=\binom{n}{m}\mathbb{P}_{X_{[m]}}(\cdot)\,.

It follows readily that $f_{\omega,X_{S}}(\cdot)\leq\binom{n}{m}f_{\omega,X_{[m]}}(\cdot)$ . Next, let $Z\in[-1/2,1/2]^{d\times m}$ be a random variable with density $f_{\omega,X_{S}}$ and note that

\int f_{\omega,X_{S}}\log f_{\omega,X_{S}}=\mathbb{E}\log f_{\omega,X_{S}}(Z)\leq\log\binom{n}{m}+\mathbb{E}\log f_{\omega,X_{[m]}}(Z)\leq m\log\big{(}\frac{en}{m}\big{)}+m\log 2\,,

where in the last inequality, we use the fact that $f_{\omega,X_{[m]}}=f_{\omega}^{m}\leq 2^{m}$ . The lemma follows. ∎

Since $\log M\geq c_{\beta,d,L}h^{-d}$ , it follows from (16)–(18) and Lemma 2 that

\mathbb{P}[\psi_{S}(X)\neq V]\geq 1-\frac{3m\log n+R+1}{\log M}\geq 0.5

on setting $h=c_{\beta,d,L}(m\log n+R)^{-1/d}$ . Plugging this value back into (15) yields

\inf_{\hat{f},|S|=m}\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}\geq c_{\beta,d,L}(m\log n+R)^{-\beta/d}\,.

Moreover, it follows from standard minimax theory (see e.g. Tsybakov, 2009, Chapter 2) that

\inf_{\hat{f},|S|=m}\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\,\lVert\hat{f}_{S}-f\rVert_{2}\geq c_{\beta,d,L}n^{-\frac{\beta}{2\beta+d}}\,.

Combined together, the above two displays give the lower bound of Proposition 2.

Appendix B PROOFS FROM SECTION 3

B.1 Proof of Proposition 1

We restate the result below.

Proposition.

\hat{f}(y)=\frac{1}{n}\sum_{i=1}^{n}k_{h}(X_{i}-y)

with bandwidth $h=n^{-\frac{1}{2\beta+d}}$ satisfies

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert f-\hat{f}\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

Then the Carathéodory coreset estimator $\hat{g}_{S}(y)$ constructed from $\hat{f}$ with $T=c_{d,\gamma,L^{\prime}}\,n^{\frac{d/2+\beta+\gamma}{\gamma(2\beta+d)}}$ satisfies

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

Let $\varphi:\mathbb{R}^{d}\to[0,1]$ denote a cutoff function that has the following properties: $\varphi\in\mathcal{C}^{\infty}$ , $\varphi\big{|}_{[-1,1]^{d}}\equiv 1$ , and $\varphi$ is compactly supported on $[-2,2]^{d}$ .

Lemma 3.

Let $\tilde{k}_{h}(x)=k_{h}(x)\varphi(x)$ where $\left|\kappa(x)\right|\leq c_{\beta,d}\left|x\right|^{-\nu}$ . Then

\lVert\tilde{k}_{h}-k_{h}\rVert_{2}\leq c_{\beta,d}\,h^{-d+\nu}.

Proof.

	$\displaystyle\lVert\tilde{k}_{h}-k_{h}\rVert_{2}$	$\displaystyle=\lVert(1-\varphi)k_{h}\rVert_{2}$
		$\displaystyle\leq\lVert(1-{\rm 1}\kern-2.40005pt{\rm I}_{[-1,1]^{d}})k_{h}\rVert_{2}$
		$\displaystyle=h^{-d/2}\lVert(1-{\rm 1}\kern-2.40005pt{\rm I}_{[-\frac{1}{h},\frac{1}{h}]^{d}})k\rVert_{2}$
		$\displaystyle\leq dh^{-d/2}\lVert{\rm 1}\kern-2.40005pt{\rm I}_{\left\|x_{1}\right\|\geq\frac{1}{h}}\,k\rVert_{2}$
		$\displaystyle\leq c_{\beta,d}\,h^{-d/2}\sqrt{\int_{\left\|x_{1}\right\|\geq\frac{1}{h}}\kappa^{2}(x_{1})\,\mathrm{d}x_{1}}$
		$\displaystyle\leq c_{\beta,d}\,h^{-d+\nu}.$

∎

The triangle inequality and the previous lemma yield the next result.

Lemma 4.

Let $k$ denote a kernel such that $\left|\kappa(x)\right|\leq c_{\beta,d}\left|x\right|_{2}^{-\nu}$ . Recall the definition of $\tilde{k}_{h}$ from Lemma 3. Let $X_{1},\ldots,X_{m}\in\mathbb{R}^{d}$ , and let

\hat{g}_{S}(y)=\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y)

denote where $\lambda_{j}\geq 0$ and ${\rm 1}\kern-2.40005pt{\rm I}^{T}\lambda=1$ . Let

\tilde{g_{S}}(y)=\sum_{j\in S}\lambda_{j}\tilde{k}_{h}(X_{j}-y).

Then

\lVert\hat{g}_{S}-\tilde{g_{S}}\rVert_{2}\leq c_{\beta,d}h^{-\nu+d}.

Next we show that $\tilde{k}_{h}$ is well approximated by its Fourier expansion on $[-2,2]^{d}$ . Since $\tilde{k}_{h}$ is a smooth periodic function on $[-2,2]^{d}$ , it is expressed in $L_{2}$ as a Fourier series on $\frac{\pi}{2}\mathbb{Z}^{d}$ . Thus we bound the tail of this expansion. In what follows, $\alpha\in\mathbb{Z}^{d}_{\geq 0}$ is a multi-index and

\mathcal{\bar{F}}[f](\omega)=\frac{1}{4^{2d}}\int f(x)e^{i\langle x,\omega\rangle}\,\mathrm{d}x

denotes the (rescaled) Fourier transform on $[-2,2]^{d}$ , where $\omega\in\frac{\pi}{2}\mathbb{Z}^{d}$ .

Lemma 5.

Suppose that the kernel $k\in\mathcal{S}(\beta,L^{\prime})$ . Let $A=\{\omega\in\frac{\pi}{2}\mathbb{Z}^{d}:\,\left|\omega\right|_{1}\leq T\}$ , and define

\tilde{k}_{h}^{T}(y)=\sum_{\omega\in A}\mathcal{\bar{F}}[\tilde{k}_{h}](\omega)e^{i\langle y,\omega\rangle}.

Then

\lVert(\tilde{k}_{h}-\tilde{k}_{h}^{T}){\rm 1}\kern-2.40005pt{\rm I}_{[-2,2]^{d}}\rVert_{2}\leq c_{\gamma,d,L^{\prime}}\,T^{-\gamma}h^{-d/2-\gamma}

Proof.

Observe that for $\omega\notin A$ , it holds that

\sum_{\left|\alpha\right|_{1}=\gamma}\frac{\gamma!}{\alpha!}\left|\omega\right|^{\alpha}=(\left|\omega_{1}\right|+\cdots+\left|\omega_{d}\right|)^{\gamma}\geq T^{\gamma}.

Therefore,

$\displaystyle\lVert\mathcal{\bar{F}}[\tilde{k}_{h}](\omega){\rm 1}\kern-2.40005pt{\rm I}_{\omega\notin A}\rVert_{\ell_{2}}$	$\displaystyle\leq\,T^{-\gamma}\lVert\sum_{\left\|\alpha\right\|_{1}=\gamma}\frac{\gamma!}{\alpha!}\left\|\omega\right\|^{\alpha}\mathcal{\bar{F}}[\tilde{k}_{h}](\omega){\rm 1}\kern-2.40005pt{\rm I}_{\omega\notin A}\rVert_{\ell_{2}}$
	$\displaystyle\leq\,T^{-\gamma}\sum_{\left\|\alpha\right\|_{1}=\gamma}\frac{\gamma!}{\alpha!}\,\lVert\omega^{\alpha}\mathcal{\bar{F}}[\tilde{k}_{h}](\omega)\rVert_{\ell_{2}}$
	$\displaystyle=\,c_{d}\,T^{-\gamma}\sum_{\left\|\alpha\right\|_{1}=\gamma}\frac{\gamma!}{\alpha!}\,\lVert\frac{\partial^{\alpha}}{\partial x^{\alpha}}\tilde{k}_{h}(x)\rVert_{2},$	(19)

where in the last line we used Parseval’s identity. For any multi-index $\alpha$ with $\left|\alpha\right|_{1}=\gamma$ ,

	$\displaystyle\lVert\frac{\partial^{\alpha}}{\partial x^{\alpha}}\tilde{k}_{h}(x)\rVert_{2}$	$\displaystyle=\lVert\sum_{\eta\preceq\alpha}\frac{\partial^{\eta}}{\partial x^{\eta}}k_{h}(x)\,\frac{\partial^{\alpha-\eta}}{\partial x^{\alpha-\eta}}\varphi(x)\rVert_{2}$
		$\displaystyle\leq h^{-\frac{d}{2}-\gamma}\sum_{\eta\preceq\alpha}c_{d,\gamma}\,\lVert\frac{\partial^{\eta}}{\partial x^{\eta}}k(x)\rVert_{2},$		(20)

where we used that the derivatives of $\varphi$ are bounded. Next by Parseval’s identity,

\displaystyle\lVert\frac{\partial^{\eta}}{\partial x^{\eta}}k(x)\rVert_{2}^{2}

\displaystyle=\prod_{i=1}^{d}\lVert\omega_{i}^{\eta_{i}}\mathcal{F}[\kappa](\omega_{i})\rVert_{2}^{2}.

(21)

For $0\leq a\leq\gamma$ , we have

\displaystyle\int\left|\omega^{a}\mathcal{F}[\kappa](\omega)\right|^{2}\mathrm{d}\omega\leq\lVert k\rVert_{1}+\int_{\left|\omega\right|\geq 1}\left|\omega^{\gamma}\mathcal{F}[\kappa](\omega)\right|^{2}\mathrm{d}\omega\leq\lVert k\rVert_{1}+L^{\prime}.

(22)

By (19)–(22),

\displaystyle\lVert\mathcal{\bar{F}}[\tilde{k}_{h}](\omega){\rm 1}\kern-2.40005pt{\rm I}_{\omega\notin A}\rVert_{\ell_{2}}\leq c_{d,\gamma,L^{\prime}}\,T^{-\gamma}h^{-\frac{d}{2}-\gamma},

as desired.

∎

Applying the previous lemma and linearity of the Fourier transform, we have the next corollary that gives an expansion for a general KDE on the smaller domain $[-\frac{1}{2},\frac{1}{2}]^{d}$ .

Corollary 2.

Let $\tilde{g_{S}}$ denote the KDE built from $\tilde{k}_{h}$ from Lemma 4 where $X_{1},\ldots,X_{m}\in[-\frac{1}{2},\frac{1}{2}]^{d}$ and moreover $\kappa\in\mathcal{S}(\beta,L^{\prime})$ . Let $A=\{\omega\in\frac{\pi}{2}\mathbb{Z}^{d}:\,\left|\omega\right|_{1}\leq T\}$ , and define

\tilde{g_{S}}^{T}(y)=\sum_{\omega\in A}\mathcal{\bar{F}}[\tilde{g_{S}}](\omega)e^{i\langle y,\omega\rangle}.

Then

\lVert(\tilde{g_{S}}-\tilde{g_{S}}^{T}){\rm 1}\kern-2.40005pt{\rm I}_{[-\frac{1}{2},\frac{1}{2}]^{d}}\rVert_{2}\leq c_{d,\gamma,L^{\prime}}\,T^{-\gamma}h^{-d/2-\gamma}L.

Now we have all the ingredients needed to prove Proposition 1.

Proof of Proposition 1 .

Let

\tilde{f}(y)=\frac{1}{n}\sum_{j=1}^{n}\tilde{k}_{h}(X_{j}-y),

and

\tilde{g_{S}}(y)=\sum_{j\in S}\lambda_{j}\tilde{k}_{h}(X_{j}-y).

Also consider their expansions $\tilde{f}^{T}$ and $\tilde{g}^{T}$ as defined in Lemma 5. Observe that, by construction of the Carathéodory coreset,

\tilde{f}^{T}(y)=\tilde{g}^{T}(y)\quad\forall y\in[-\frac{1}{2},\frac{1}{2}]^{d}.

In what follows, $\lVert\cdot\rVert_{2}$ is computed on $[-\frac{1}{2},\frac{1}{2}]^{d}$ . By the triangle inequality,

$\displaystyle\lVert\hat{g}_{S}-\hat{f}\rVert_{2}$	$\displaystyle\leq\lVert\hat{g}_{S}-\tilde{g}\rVert_{2}+\lVert\tilde{g}-\tilde{g}^{T}\rVert_{2}+\lVert\tilde{g}^{T}-\tilde{f}^{T}\rVert$
	$\displaystyle\quad+\lVert\tilde{f}^{T}-\tilde{f}\rVert_{2}+\lVert\tilde{f}-\hat{f}\rVert_{2}$
	$\displaystyle\leq c_{\beta,d}\,h^{-d+\nu}+c_{d,\gamma,L^{\prime}}\,T^{-\gamma}h^{-d/2-\gamma}+0$
	$\displaystyle\quad+c_{d,\gamma,L^{\prime}}\,T^{-\gamma}h^{-d/2-\gamma}+c_{\beta,d}\,h^{-d+\nu}$	(23)

On the right-hand-side of the first line, the first and last terms are bounded via Lemma 4. The second and fourth terms are bounded via Lemma 5, and the third term is $0$ by Carathéodory. By our choice of $T$ and the decay properties of $k$ , we have

\lVert\hat{g}_{S}-\hat{f}\rVert_{2}\leq c_{\beta,d,L}\,h^{\beta}\leq c_{\beta,d,L}\,n^{-\beta/(2\beta+d)}.

The conclusion follows by the hypothesis on $k$ , the previous display, and the triangle inequality. ∎

B.2 Proof of Theorem 2

We restate Theorem 2 here for convenience.

Theorem.

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

The corresponding coreset has cardinality

m=c_{d,\beta,\varepsilon}n^{\frac{d}{2\beta+d}+\varepsilon}.

Proof.

Our goal is to apply Proposition 1 to $k_{s}$ . First we show that the standard KDE built from $k_{s}$ attains the minimax rate on $\mathcal{P}_{\mathcal{H}}(\beta,L)$ . The Fourier condition

\text{ess sup}_{\omega\neq 0}\frac{1-\mathcal{F}[k_{s}](\omega)}{\left|\omega\right|^{\alpha}}\leq 1,\quad\forall\alpha\preceq\beta,

implies that $k_{s}$ is a kernel of order $\beta$ (Tsybakov, 2009, Definition 1.3). Since $\mathcal{F}[k_{s}](0)=1=\int k_{s}(x)\,\mathrm{d}x$ , it remains to show that the ‘moments’ of order at most $\beta$ of $k_{s}$ vanish. In fact all of the moments vanish. We have, expanding the exponential and using the multinomial formula,

	$\displaystyle\psi(\omega)$	$\displaystyle=\mathcal{F}^{-1}[k_{s}](\omega)$
		$\displaystyle=\int k_{s}(x)e^{-i\langle x,\omega\rangle}\mathrm{d}x$
		$\displaystyle=\sum_{t=0}^{\infty}\int k_{s}(x)\frac{(-i\langle x,\omega\rangle)^{t}}{t!}\mathrm{d}x$
		$\displaystyle=\sum_{t=0}^{\infty}\sum_{\left\|\alpha\right\|_{1}=t}\frac{-i^{t}}{\alpha!}w^{\alpha}\left\{\int k_{s}(x)x^{\alpha}\mathrm{d}x\right\}.$

Since $\psi(\omega)\equiv 1$ in a neighborhood near the origin, it follows that all of the terms $\int k_{s}(x)x^{\alpha}\mathrm{d}x=0$ . Thus $k_{s}$ is a kernel of order $\beta$ for all $\beta\in\mathbb{Z}_{\geq 0}$ , and the standard KDE on all of the dataset with bandwidth $h=n^{-1/(2\beta+d)}$ attains the rate of estimation $n^{-\beta/(2\beta+d)}$ over $\mathcal{P}_{\mathcal{H}}(\beta,L)$ (see e.g. Tsybakov, 2009, Theorem 1.2).

Next, $\left|\kappa_{s}(x)\right|\leq c_{\beta,d}\left|x\right|^{\nu}$ for $\nu=\lceil\beta+d\rceil$ . This is because

x^{\nu}\kappa_{s}(x)=x^{\nu}\mathcal{F}[\psi](x)=\mathcal{F}\left[\frac{\mathrm{d}^{\nu}}{\mathrm{d}x^{\nu}}\psi\right](x)\leq\lVert\frac{\mathrm{d}^{\nu}}{\mathrm{d}x^{\nu}}\psi\rVert_{1}\leq c_{\beta,d}.

Moreover for all $\gamma\in\mathbb{Z}_{>0}$ , $\kappa_{s}\in\mathcal{S}(\gamma,c_{\gamma})$ . By Parseval’s identity,

\lVert\frac{\mathrm{d}^{\gamma}}{\mathrm{d}x^{\gamma}}\kappa_{s}\rVert_{2}=\lVert\mathcal{F}[\frac{\mathrm{d}^{\gamma}}{\mathrm{d}x^{\gamma}}\kappa_{s}]\rVert_{2}=\lVert\omega^{\gamma}\psi(\omega)\rVert_{2}\leq c_{\gamma}

because $\psi$ has compact support (see e.g. Katznelson, 2004, Chapter VI).

All of the hypotheses of Proposition 1 are satisfied, so we apply the result with

\gamma=\frac{d(\frac{d}{2}+\beta)}{\varepsilon(2\beta+d)}

to derive Theorem 2.

∎

B.3 Proof of Corollary 1

Corollary.

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{g}_{S}-f\rVert_{2}\leq c_{\beta,d,\varepsilon,L}\,\left(m^{-\frac{\beta}{d}+\varepsilon}+n^{-\frac{\beta}{2\beta+d}+\varepsilon}\right),

and the corresponding coreset has cardinality $m$ .

Proof.

Recall from the proof of Theorem 2 that $k_{s}$ is a kernel of all orders. By a standard bias-variance trade-off (see e.g. Tsybakov, 2009, Section 1.2), it holds that for the KDE $\hat{f}$ with bandwidth $h$ (on the entire dataset)

\mathbb{E}_{f}\lVert\hat{f}-f\rVert_{2}\leq c_{\beta,d,L}\left(h^{\beta}+\frac{1}{\sqrt{nh^{d}}}\right).

(24)

Moreover, from (23) applied to $k_{s}$ , setting $T=m^{1/d}$ , we get

\lVert\hat{g}_{S}-\hat{f}\rVert_{2}\leq c_{\beta,d}\,h^{\beta}+c_{d,\gamma}\,m^{-d/\gamma}h^{-d/2-\gamma}.

(25)

Choosing

\gamma=(\beta+\frac{d}{2})(\frac{\beta}{d\varepsilon}-1),\quad h=m^{-\frac{1}{d}+\frac{\varepsilon}{\beta}}

(assuming without loss of generality that $\varepsilon>0$ is sufficiently small so that $\gamma>0$ ), then the triangle inequality, (24), (25), and the upper bound on $m$ yield the conclusion of Corollary 1.

∎

B.4 Proof of Theorem 4

For convenience, we restate Theorem 4 here.

Theorem.

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{g}_{S}-f\rVert_{2}\geq\\ c_{\beta,d,L}\left[(A+B)^{-\frac{\beta}{d}}(m\log{n})^{-\frac{\beta}{d}}+n^{-\frac{\beta}{2\beta+d}}\right].

Proof.

Let $\lambda=\lambda_{1},\ldots,\lambda_{m}$ and let $\tilde{\lambda}=\tilde{\lambda}_{1},\ldots,\tilde{\lambda}_{m}$ . Observe that

	$\displaystyle\lVert\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y)-\sum_{j\in S}\tilde{\lambda}_{j}k_{h}(X_{j}-y)\rVert_{2}$	$\displaystyle\leq\sum_{j\in S}\left\|\lambda_{j}-\tilde{\lambda}_{j}\right\|\lVert k_{h}(X_{j}-y)\rVert_{2}$
		$\displaystyle\leq\left\|\lambda-\tilde{\lambda}\right\|_{\infty}n^{2}h^{-d/2}.$		(26)

Using this we develop a decorated coreset-based estimator $\hat{f}_{S}$ (see Section A.2) that approximates $\hat{g}_{S}$ well. Set $\delta=c_{\beta,d,L}n^{-4}h^{d/2}$ for $c_{\beta,d,L}$ sufficiently small and to be chosen later. Order the points of the coreset $X_{S}$ according to their first coordinate. This gives rise to an ordering $\preceq$ so that

X^{\prime}_{1}\preceq X^{\prime}_{2}\preceq\cdots\preceq X^{\prime}_{m}

denote the elements of $X_{S}$ . Let $\lambda\in\mathbb{R}^{m}$ denote the correspondingly reordered collection of weights so that

\hat{g}_{S}(y)=\sum_{j=1}^{m}\lambda_{j}k_{h}(X^{\prime}_{j}-y).

Construct a $\delta$ -net $\mathcal{N}_{\delta}$ with respect to the sup-norm $\left|\cdot\right|_{\infty}$ on the set $\{\nu\in\mathbb{R}^{m}:\left|\nu\right|_{\infty}\leq n^{B}\}$ . Observe that

\log\left|\mathcal{N}_{\delta}\right|=\log(n^{B}\delta^{-1})^{m}=c_{\beta,d,L}\,(B+A)m\log n

(27)

Define $R$ to be the smallest integer larger than the right-hand-side above. Then we can construct a surjection $\phi:\{0,1\}^{R}\to\mathcal{N}_{\delta}$ . Note that $\phi$ is constructed before observing any data: it simply labels the elements of the $\delta$ -net $\mathcal{N}_{\delta}$ by strings of length $R$ .

Given $\hat{g}_{S}(y)=\sum_{j\in S}\lambda_{j}k_{h}(X_{j}-y)$ , define $\hat{f}_{S}$ as follows:

1.

Let $\tilde{\lambda}\in\mathbb{R}^{m}$ denote the closest element in $\mathcal{N}_{\delta}$ to $\lambda\in\mathbb{R}^{m}$ .
2.

Choose $\sigma\in\{0,1\}^{R}$ such that $\phi(\sigma)=\tilde{\lambda}$ .
3.

Define the decorated coreset $Y_{S}=(X_{S},\sigma)$ .
4.

Order the points of $X_{S}$ by their first coordinate. Pair the $i$ -th element of $\tilde{\lambda}$ with the $i$ -th element $X_{i}^{\prime}$ of $X_{S}$ , and define

$\hat{f}_{S}(y)=\sum_{j=1}^{m}\tilde{\lambda}_{j}k_{h}(X_{j}^{\prime}-y)$

We see that $\hat{f}_{S}$ is a decorated-coreset based estimator because in step 4 this estimator is constructed only by looking at the coreset $X_{S}$ and the bit string $\sigma$ . Moreover, by (26) and the setting of $\delta$ ,

\lVert\hat{f}_{S}-\hat{g}_{S}\rVert_{2}\leq c_{\beta,d,L}\,n^{-2}.

(28)

By Proposition 2 and our choice of $R$ ,

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}_{f}\lVert\hat{f}_{S}-f\rVert_{2}\geq c_{\beta,d,L}\left((A+B)^{-\frac{\beta}{d}}(m\log n)^{-\frac{\beta}{d}}+n^{-\frac{\beta}{2\beta+d}}\right).

Applying the triangle inequality and (28) yields Theorem 4. ∎

Appendix C PROOFS FROM SECTION 4

Notation: Given a set of points $X=x_{1},\ldots,x_{m}\in[-1/2,1/2]$ (not necessarily a sample), we let

\hat{f}_{X}(y)=\frac{1}{m}\sum_{i=1}^{m}k_{h}(X_{i}-y)

denote the uniformly weighted KDE on $X$ .

C.1 Proof of Theorem 5

Theorem.

Let $k$ denote a nonnegative kernel satisfying

k(t)=O(\left|t\right|^{-(k+1)}),\quad\text{and}\quad\mathcal{F}[k](\omega)=O(\left|\omega\right|^{-\ell})

for some $\ell>0,\,k>1$ . Suppose that $0<\alpha<1/3$ . If

m\leq\frac{n^{\frac{2}{3}-2\left(\alpha(1-\frac{2}{\ell})+\frac{2}{3\ell}\right)}}{\log n},

then

\inf_{h,S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(1,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{2}=\Omega_{k}\Big{(}\frac{n^{-\frac{1}{3}+\alpha}}{\log n}\Big{)}.

The infimum above is over all possible choices of bandwidth $h$ and all coreset schemes $S$ of cardinality at most $m$ .

The proof of Theorem 5 follows directly from Propositions 3 and 4, which are presented in Sections C.1.1 and C.1.2, respectively.

C.1.1 Small bandwidth

First we show that uniformly weighted coreset KDEs on $m$ points poorly approximate densities that are very close to $0$ everywhere.

Lemma 6.

Let $\hat{f}_{X}$ denote a uniformly weighted coreset KDE built from an even kernel $k:\mathbb{R}\to\mathbb{R}$ with bandwidth $h$ on $m$ points $X=x_{1},\ldots,x_{m}\in\mathbb{R}$ . Suppose that quantiles $0\leq q_{1}\leq q_{2}$ satisfy

	$\displaystyle\int_{-q_{1}}^{q_{1}}k(t)\mathrm{d}t$	$\displaystyle\geq 0.9,\quad\quad\text{and}$		(29)
	$\displaystyle\int_{-q_{2}}^{q_{2}}k(t)\mathrm{d}t$	$\displaystyle\geq 1-\gamma.$		(30)

Let $U$ denote an interval $[0,u]$ where

u\geq 8q_{2}h,

(31)

and suppose that $f:U\to\mathbb{R}$ satisfies

\frac{1}{100q_{1}mh}\leq f(x)\leq\frac{45}{44}\cdot\frac{1}{100q_{1}mh}

(32)

for all $x\in U$ .

Then

\inf_{X:|X|=m}\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\geq\frac{u}{440q_{1}mh}-\gamma.

Proof.

Let $N$ denote the number of $x_{i}\in X$ such that $[x_{i}-q_{1}h,x_{i}+q_{1}h]\subset[0,u]$ . The argument proceeds in two cases. With foresight, we set $\alpha=1/(44q_{1})$ . Also let $C_{1}=1/(100q_{1})$ and $C_{2}=45/(4400q_{1})$ .

Case 1: $N\geq\frac{\alpha u}{h}$ . Then by (29) and the nonnegativity of $k$ ,

\lVert\hat{f}_{X}{\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\geq\frac{0.9N}{m}\geq\frac{0.9\alpha u}{mh}.

By (32),

\lVert f\rVert_{1}\leq\frac{C_{2}u}{mh}.

Hence,

\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\geq\frac{u}{mh}(0.9\alpha-C_{2})=C_{2}\frac{u}{mh}=\frac{45}{4400}\cdot\frac{u}{q_{1}mh}.

Thus Lemma 6 holds in Case 1 where $N\geq\alpha u/h$ .

Case 2: $N\leq\frac{\alpha u}{h}$ . Let

V=[2hq_{2},u-2hq_{2}]\,\backslash\bigcup_{j\in T}[x_{j}-q_{1}h,x_{j}+q_{1}h]

where $T$ is the set of indices $j$ so that $[x_{j}-q_{1}h,x_{j}+q_{1}h]\subset U$ . Observe that if $j\notin T$ , then by (30),

\int_{V}\frac{1}{h}k\left(\frac{x_{j}-t}{h}\right)\mathrm{d}t\leq\gamma.

If $j\in T$ , then by (29),

\int_{V}\frac{1}{h}k\left(\frac{x_{j}-t}{h}\right)\mathrm{d}t\leq 0.1.

Thus,

\lVert\hat{f}_{X}{\rm 1}\kern-2.40005pt{\rm I}_{V}\rVert_{1}\leq\frac{0.1N}{m}+\gamma\leq\frac{\alpha 0.1u}{mh}+\gamma.

By the union bound, observe that the Lebesgue measure of $V$ is at least

u-4hq_{2}-2Nhq_{1}\geq\frac{u}{2}-2Nhq_{1}\geq u(\frac{1}{2}-2\alpha q_{1}).

Next, by (32),

\lVert f{\rm 1}\kern-2.40005pt{\rm I}_{V}\rVert_{1}\geq C_{1}\frac{u}{mh}(\frac{1}{2}-2\alpha q_{1}).

Therefore,

\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\geq\frac{u}{mh}(C_{1}(1/2-2\alpha q_{1})-0.1\alpha)-\gamma=\frac{u}{440q_{1}mh}-\gamma.

(33)

∎

Proposition 3.

Let $L>2$ . Let $0<\delta<1/3$ denote an absolute constant. Let $\hat{f}_{X}$ denote a uniformly weighted coreset KDE with bandwidth $h$ built from a kernel $k$ on $X=x_{1},\ldots,x_{m}$ . Suppose that $k(t)\leq\Delta|t|^{-(k+1)}$ for some absolute constants $\Delta>0,k\geq 1$ . If $h\leq n^{-1/3+\delta}$ , then for

m\leq\frac{n^{2/3-2\delta}}{\log n}

it holds that

\sup_{f\in\mathcal{P}_{\mathcal{H}}(1,L)}\,\inf_{X:|X|=m}\,\lVert\hat{f}_{X}-f\rVert_{2}=\Omega\left(\frac{n^{-1/3+\delta}}{\log n}\right).

(34)

Proof.

Let

f(t)=\lambda\left(e^{-1/t}{\rm 1}\kern-2.40005pt{\rm I}(t\in[-1/2,0])+e^{-1/(1-t)}{\rm 1}\kern-2.40005pt{\rm I}(t\in[0,1/2])\right),

where $\lambda$ is a normalizing constant so that $\int f=1$ . Observe that $f\in\mathcal{P}_{\mathcal{H}}(1,L)$ . Our first goal is to show that

\lVert\hat{f}_{X}-f\rVert_{1}=\Omega\left(\frac{1}{mh\log^{2}(mh)}\right)

holds for all $\tau/h\leq m\leq h^{-2}$ and for all $h\leq n^{-1/3+\delta}$ , where $\tau$ is an absolute constant to be determined.

We apply Lemma 6 to the density $f$ . Let $q_{1}$ be defined as in Lemma 6, and set $C_{1}=1/(100q_{1})$ and $C_{2}=45/(4400q_{1})$ . Set $\tau=10C_{2}/\lambda$ . Let

U=[t_{1},t_{2}]:=\left[\frac{1}{\log(\lambda mh/C_{1})},\,\,\frac{1}{\log(\lambda mh/C_{2})}\right].

The function $f|_{U}$ satisfies the bounds (32) from Lemma 6. Observe that the length of $U$ is

u:=t_{2}-t_{1}=\Omega(\frac{1}{\log^{2}(mh)}).

We set the parameter $\gamma$ in Lemma 6 to be

\gamma=\frac{1}{800q_{1}mh\log^{2}(mh)}.

By the decay assumption on $k$ , we may set

q_{2}:=\left(\frac{2\Delta}{k\gamma}\right)^{1/k}.

Therefore,

$\displaystyle u-8q_{2}h$	$\displaystyle=\Omega(\frac{1}{\log^{2}(mh)})-8h\left(\frac{2\Delta}{k\gamma}\right)^{1/k}$	(35)
	$\displaystyle=\Omega(\frac{1}{\log^{2}(mh)})-O(h(mh\log^{2}(mh))^{1/k})$	(36)
	$\displaystyle=\Omega(\frac{1}{\log^{2}(h^{-1})})-O(h^{1-1/k}\log^{2}(h^{-1}))>0$	(37)

for $n$ sufficiently large, because we assume $\tau/h\leq m\leq h^{-2}$ , $h\leq n^{-1/3+\delta}$ , and $k>1$ . Hence, condition (31) is satisfied for $m,h$ in the specified range, so we apply Cauchy–Schwarz and Lemma 6 to conclude that for all $\tau/h\leq m\leq h^{-2}$ and $h\leq n^{-1/3+\delta}$ ,

\lVert\hat{f}_{X}-f\rVert_{2}\geq\lVert\hat{f}_{X}-f\rVert_{1}=\Omega\left(\frac{1}{mh\log^{2}(mh)}\right)=\Omega\left(\frac{1}{mh\log^{2}(h^{-1})}\right).

(38)

Suppose first that $\log^{2}(1/h)\geq n^{1/3-\delta}$ . Then clearly the right-hand side of (38) is $\Omega(1)$ for $m\leq n$ . Otherwise, we have for all $h\leq n^{-1/3+\delta}$ that if $m$ is in the range

\frac{\tau}{h}\leq m\leq\min\left(\frac{n^{1/3-\delta}\log n}{h\log^{2}(1/h)},\,\,h^{-2}\right)=:N_{h},

then (38) implies

\lVert\hat{f}_{X}-f\rVert_{2}=\Omega\left(\frac{n^{-1/3+\delta}}{\log n}\right).

(39)

Moreover, a uniformly weighted coreset KDE on $m=O(1/h)$ points can be expressed as a uniformly weighted coreset KDE on $\Omega(1/h)$ points by setting some of the $x_{i}$ ’s to be duplicates. Hence (39) holds for all $1\leq m\leq N_{h}$ . Since $N_{h}$ is a decreasing function of $h$ , it follows that (39) holds for all $m\leq n^{2/3-2\delta}/\log n$ and $h\leq n^{-1/3+\delta}$ , as desired.

∎

C.1.2 Large bandwidth

Lemma 7.

Let $\varepsilon=\varepsilon(n)>0$ , and let $\hat{f}_{X}$ denote the uniformly weighted coreset KDE on $X$ with bandwidth $h$ . Suppose that $\phi:\mathbb{R}\to\mathbb{R}$ is an odd $\mathcal{C}^{\infty}$ function supported on $[-1/4,1/4]$ . Let $f(t):[-1/2,1/2]\to\mathbb{R}_{\geq 0}$ denote the density

f(t)=\frac{12}{11}(1-t^{2})+\varepsilon\phi(t)\cos\left(\frac{t}{\varepsilon}\right).

Then

\lVert\hat{f}_{X}-f\rVert_{2}^{2}\geq\frac{1}{2}\varepsilon^{2}\left(\lVert\phi\rVert_{2}^{2}-\left|\mathcal{F}[\phi^{2}](2\varepsilon^{-1})\right|\right)\\ -\lVert\phi\rVert_{1}\sup_{|\omega|\geq h\varepsilon^{-1}/2}\left|\mathcal{F}[k](\omega)\right|-2\varepsilon\int_{|\omega|\geq\varepsilon^{-1}/2}\left|\mathcal{F}[\phi](\omega)\right|\mathrm{d}\omega.

(40)

Proof.

Let $g(t)=(12/11)(1-t^{2})$ and $\psi(t)=\varepsilon\phi(t)\cos(t/\varepsilon)$ . Observe that

	$\displaystyle\lVert\hat{f}_{X}-f\rVert_{2}^{2}$	$\displaystyle\geq\lVert g-f\rVert_{2}^{2}-2\langle\hat{f}_{X},g-f\rangle+2\langle g,\psi(t)\rangle$
		$\displaystyle=\lVert g-f\rVert_{2}^{2}-2\langle\hat{f}_{X},g-f\rangle$		(41)

because $g(t)\psi(t)$ is an odd function. Next, using $\cos^{2}(\theta)=(1/2)(\cos(2\theta)+1)$ ,

	$\displaystyle\lVert g-f\rVert_{2}^{2}$	$\displaystyle=\varepsilon^{2}\int_{-1/2}^{1/2}\cos^{2}(t/\varepsilon)\phi^{2}(t)\mathrm{d}t$
		$\displaystyle\geq\frac{\varepsilon^{2}}{2}\lVert\phi\rVert_{2}^{2}-\frac{\varepsilon^{2}}{2}\left\|\mathcal{F}[\phi^{2}](2\varepsilon^{-1})\right\|.$		(42)

By the triangle inequality and Parseval’s formula,

\frac{\left|\langle\hat{f}_{X},g-f\rangle\right|}{\varepsilon}\leq\Big{(}\underbrace{\int_{|\omega|\leq h\varepsilon^{-1}/2}}_{=:A}+\underbrace{\int_{|\omega|\geq h\varepsilon^{-1}/2}}_{=:B}\Big{)}\Big{|}\mathcal{F}[k]\left(-\frac{h}{\varepsilon}-\omega\right)\frac{1}{h}\mathcal{F}[\phi]\left(-\frac{\omega}{h}\right)\Big{|}\,\mathrm{d}\omega.

Moreover,

	$\displaystyle A$	$\displaystyle\leq\frac{1}{2\varepsilon}\lVert\phi\rVert_{1}\cdot\sup_{\|\omega\|\geq h\varepsilon^{-1}/2}\left\|\mathcal{F}[k](\omega)\right\|,$		(43)
	$\displaystyle B$	$\displaystyle\leq\lVert k\rVert_{1}\cdot\int_{\|\omega\|>\varepsilon^{-1}/2}\left\|\mathcal{F}[\phi](\omega)\right\|\mathrm{d}\omega.$		(44)

Then (40) follows from $\lVert k\rVert_{1}=1$ and equations (41), (C.1.2), (43), and (44). ∎

Proposition 4.

Let $\varepsilon=n^{-1/3+\gamma}$ for some absolute constant $\gamma>0$ . Let $\hat{f}_{X}$ denote a uniformly weighted coreset KDE with bandwidth $h$ built from a kernel $k$ on $X=x_{1},\ldots,x_{m}$ . Suppose that $\left|\mathcal{F}[k](\omega)\right|\leq|\omega|^{-\ell}$ . If $h\geq c\varepsilon^{1-2/\ell}=cn^{(-1/3+\gamma)(1-2/\ell)}$ for $c$ sufficiently large, then for all $m$ it holds that

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\,\inf_{X:|X|=m}\lVert\hat{f}_{X}-f\rVert_{2}=\Omega(\varepsilon)=\Omega\left(n^{-1/3+\gamma}\right)

(45)

Proof.

The proof is a direct application of Lemma 7. Let $f(t)=g(t)+\varepsilon\phi(t)\cos(t/\varepsilon)$ , where we set

\phi(t)=-e^{\frac{1}{x(x+1/4)}}{\rm 1}\kern-2.40005pt{\rm I}(x\in[-1/4,0])+e^{-\frac{1}{x(x-1/4)}}{\rm 1}\kern-2.40005pt{\rm I}(x\in[0,1/4]).

Observe that $\phi$ is odd and $\phi\in\mathcal{C}^{\infty}$ . Thus, $\phi^{2}\in\mathcal{C}^{\infty}$ , so by the Riemann–Lebesgue lemma (see e.g. Katznelson, 2004, Chapter VI), $\mathcal{F}[\phi^{2}](\varepsilon^{-1})\leq 10\varepsilon$ . Using a similar argument and noting that $\mathcal{F}[\phi](\omega)=\omega^{-2}\mathcal{F}[\phi^{\prime\prime}](\omega)\leq 10\omega^{-3}$ , we obtain

\int_{\left|\omega\right|\geq 2\varepsilon^{-1}}\left|\mathcal{F}[\phi](\omega)\right|\mathrm{d}\omega\leq 100\varepsilon^{2}.

Also $\lVert\phi\rVert_{2}\geq c^{\prime}$ for a small absolute constant, and $\lVert\phi\rVert_{1}\leq 2$ .

Thus Lemma 7, the hypothesis on $k$ , and $h\geq c^{\prime}\varepsilon^{1-2/\ell}$ imply that

\lVert\hat{f}_{X}-f\rVert_{2}^{2}\geq\frac{c^{2}}{2}\varepsilon^{2}-2\left(\frac{\varepsilon}{h}\right)^{\ell}-200\varepsilon^{3}=\Omega(\varepsilon^{2}).

Since $f\in\mathcal{P}_{\mathcal{H}}(1,L)$ , the statement of the lemma follows. ∎

C.2 Proof of Theorem 6

Theorem.

Fix $\beta>0$ and a nonnegative kernel $k$ on $\mathbb{R}$ satisfying the following fast decay and smoothness conditions:

	$\displaystyle\lim_{s\to+\infty}\frac{1}{s}\log\frac{1}{\int_{\|t\|>s}k(t)dt}$	$\displaystyle>0,$		(46)
	$\displaystyle\lim_{\omega\to\infty}\frac{1}{\|\omega\|}\log\frac{1}{\|\mathcal{F}[k](\omega)\|}$	$\displaystyle>0,$		(47)

\displaystyle\inf_{h,S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{2}

\displaystyle=\Omega_{\beta,k}\left(\tfrac{m^{-\frac{\beta}{1+\beta}}}{\log^{\beta+\frac{1}{2}}m}\right).

Proof.

We follow a similar strategy to the proof of Theorem 5 by handling the cases of small and large bandwidth separately.

Let $q_{1}=q_{1}(k)>0$ be the minimum number such that $\int_{|t|>q_{1}}k(t)dt\leq 0.1$ . By the assumption in the theorem, there exists $a>0$ such that

\displaystyle\int_{|t|>s}k(t)dt\leq\frac{1}{a}\exp(-as),\quad\forall s\geq 0.

Note that we can set $L_{\beta}^{(1)}$ large such that for any $\delta\in[0,1]$ , there exists $f\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta}^{(1)})$ such that $f(x)=\delta$ for $x\in[0,1/2]$ . We first show that for any given $m$ and $h$ , we have

\displaystyle\inf_{S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta}^{(1)})}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{1}\geq 0.2\left(1\wedge\frac{1}{100q_{1}mh}\right)1\left\{h\leq\frac{0.02a}{\log\left(\frac{mq_{1}}{0.001a}\vee\frac{10}{a}\right)}\wedge 1\right\}.

(48)

Let $f$ be an arbitrary function in $f\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta}^{(1)})$ such that

\displaystyle f(x)=1\wedge\frac{1}{100q_{1}mh},\quad\forall x\in[0,1/2].

Let $T$ be the set of $i\in S$ for which $x_{i}\in[q_{1}h,1/2-q_{1}h]$ .

Case 1: $|T|\geq m\left(1\wedge\frac{1}{100q_{1}mh}\right)$ . Since $k\geq 0$ , we have

\displaystyle\|\hat{f}_{X}1_{[0,1/2]}\|_{1}\geq\frac{0.9|T|}{m}\geq 0.9\left(1\wedge\frac{1}{100q_{1}mh}\right).

On the other hand,

\displaystyle\|f1_{[0,1/2]}\|_{1}\leq\frac{1}{2}\left(1\wedge\frac{1}{100q_{1}mh}\right),

therefore,

\displaystyle\|(\hat{f}_{X}-f)1_{[0,1/2]}\|_{1}\geq 0.4\left(1\wedge\frac{1}{100q_{1}mh}\right).

Case 2: $|T|<m\left(1\wedge\frac{1}{100q_{1}mh}\right)$ . Define

\displaystyle\gamma:=0.1\left(1\wedge\frac{1}{100q_{1}mh}\right)

and

\displaystyle q_{2}:=\frac{0.02}{h}.

Note that to verify (48) we only need to consider the event of $h\leq\frac{0.02a}{\log\left(\frac{mq_{1}}{0.001a}\vee\frac{10}{a}\right)}\wedge 1$ , in which case

	$\displaystyle\int_{\|t\|>q_{2}}k(t)dt$	$\displaystyle\leq\frac{1}{a}\exp(-aq_{2})$
		$\displaystyle\leq\frac{1}{a}\cdot\left(\frac{0.001a}{mq_{1}}\wedge 0.1a\right)$
		$\displaystyle\leq\frac{1}{a}\cdot\left(\frac{0.001a}{q_{1}mh}\wedge 0.1a\right)$
		$\displaystyle=0.1(1\wedge\frac{1}{100q_{1}mh})$
		$\displaystyle=\gamma.$

Moreover since $\gamma\leq 0.1$ we see that $q_{2}\geq q_{1}$ . Now define

\displaystyle V:=[2hq_{2},1/2-2hq_{2}]\setminus\bigcup_{j\in T}[x_{j}-q_{1}h,x_{j}-q_{1}h].

Then for $j\notin T$ , we have

\displaystyle\int_{V}\frac{1}{h}k\left(\frac{x_{j}-t}{h}\right)dt\leq\gamma

while for $j\in T$ we have

\displaystyle\int_{V}\frac{1}{h}k\left(\frac{x_{j}-t}{h}\right)dt\leq 0.1.

Thus,

\displaystyle\|\hat{f}_{X}1_{V}\|_{1}\leq\frac{0.1|T|}{m}+\gamma\leq 0.2\left(1\wedge\frac{1}{100q_{1}mh}\right).

On the other hand, by the union bound we see that the Lebesgue measure of $V$ is at least

\displaystyle\frac{1}{2}-4q_{2}h-2q_{1}h|T|\geq 0.5-4q_{2}h-0.02\geq 0.4

where we used the fact that $q_{2}h=0.02$ . Then

\displaystyle\|f1_{V}\|_{1}\geq 0.4\left(1\wedge\frac{1}{100q_{1}mh}\right)

and hence

\displaystyle\|(\hat{f}_{X}-f)1_{[0,1/2]}\|_{1}\geq\|(\hat{f}_{X}-f)1_{V}\|_{1}\geq 0.2\left(1\wedge\frac{1}{100q_{1}mh}\right).

This concludes the proof of (48).

The second step is to show that for given $m$ and $h$ , we have

\displaystyle\inf_{S:|S|\leq m}\,\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert\hat{f}_{S}^{\mathsf{unif}}-f\rVert_{1}\geq\frac{1}{4}\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}

(49)

sufficiently large $m$ and $L$ to be determined later, and $0<b<\infty$ is such that

\displaystyle\mathcal{F}[k](\omega)\leq\frac{1}{b}\exp(-b\omega),\quad\forall\omega\in\mathbb{R}

whose existence is guaranteed by the assumption of the theorem. Let $\phi$ be a smooth, even, nonnegative function supported on $[-1/2,1/2]$ satisfying $\int_{[-1/2,1/2]}\phi=1$ . Define

\displaystyle f_{\epsilon}(t):=\phi(t)\left(c_{\epsilon}+\epsilon^{\beta}\sin\frac{t}{\epsilon}\right)

where $c_{\epsilon}>0$ is chosen so that $\int_{[-1/2,1/2]}f_{\epsilon}=1$ . Then $\lim_{\epsilon\to 0}c_{\epsilon}=1$ , and in particular $f_{\epsilon}\geq 0$ when $\epsilon<\epsilon(\phi,\beta)$ for some $\epsilon(\phi,\beta)$ . Moreover we can find $L_{\beta}^{(2)}<\infty$ such that $f_{\epsilon}\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta}^{(2)})$ for all $\epsilon<\epsilon(\phi,\beta)$ . Now

$\displaystyle\\|f_{\epsilon}-\hat{f}_{X}\\|_{1}$	$\displaystyle\geq\|\mathcal{F}[f_{\epsilon}](1/\epsilon)-\mathcal{F}[\hat{f}_{X}](1/\epsilon)\|$
	$\displaystyle\geq\left\|\int_{[-1/2,1/2]}f_{\epsilon}(t)e^{-it/\epsilon}dt\right\|-\left\|\mathcal{F}[k](\frac{h}{\epsilon})\right\|$
	$\displaystyle\geq\left\|\int_{[-1/2,1/2]}f_{\epsilon}(t)\sin\frac{t}{\epsilon}dt\right\|-\left\|\mathcal{F}[k](\frac{h}{\epsilon})\right\|$
	$\displaystyle=\epsilon^{\beta}\left\|\int_{[-1/2,1/2]}\phi(t)\sin^{2}\frac{t}{\epsilon}dt\right\|-\left\|\mathcal{F}[k](\frac{h}{\epsilon})\right\|$	(50)

where (50) used the fact that $\phi$ is even. Since $\lim_{\epsilon\to 0}\int_{[-1/2,1/2]}\phi(t)\sin^{2}\frac{t}{\epsilon}dt=\frac{1}{2}$ , there exists $\epsilon^{\prime}(\phi)$ such that

\displaystyle\int_{[-1/2,1/2]}\phi(t)\sin^{2}\frac{t}{\epsilon}dt\geq\frac{1}{4}

for any $\epsilon\leq\epsilon^{\prime}(\phi)$ . Now define

\displaystyle\epsilon^{\prime\prime}(h,m)=\frac{b(h\wedge 1)}{2\log m}.

There exists $m(\phi,\beta,b)<\infty$ such that $\sup_{h>0}\epsilon^{\prime\prime}(h,m)<\epsilon(\phi,\beta)\wedge\epsilon^{\prime}(\phi)$ whenever $m\geq m(\phi,\beta,b)$ . With the choice of $\epsilon=\epsilon^{\prime\prime}(h,m)$ , we can continue lower bounding (50) as (for $m\geq m(\phi,\beta,b)$ ):

\displaystyle\frac{1}{4}\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}.

Finally, we collect the results for step 1 and step 2. First observe that the main term in the risk in step 1 can be simplified as

	$\displaystyle\left(1\wedge\frac{1}{100q_{1}mh}\right)1\left\{h\leq\frac{0.02a}{\log\left(\frac{mq_{1}}{0.001a}\vee\frac{10}{a}\right)}\wedge 1\right\}$
	$\displaystyle=\frac{1}{100q_{1}mh}\wedge 1\left\{\mathcal{A}\right\}$		(51)

where $\mathcal{A}$ denotes the event in the left side of (51).

Thus up to multiplicative constant depending on $k$ , $\beta$ , we can lower bound the risk by taking the max of the risks in the two steps:

\displaystyle\left(\frac{1}{mh}\wedge 1\{\mathcal{A}\}\right)\vee\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right)

(52)

whenever $L\geq L_{\beta}:=L_{\beta}^{(1)}\vee L_{\beta}^{(2)}$ . We can use the distributive law to open up the parentheses in (52). By checking the $h>m^{-\frac{1}{\beta}}$ and $h\leq m^{-\frac{1}{\beta}}$ cases respectively, it is easy to verify that

\displaystyle\frac{1}{mh}\vee\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right)=\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right).

Next, if $\mathcal{A}$ is true, we evidently have

\displaystyle 1\{\mathcal{A}\}\vee\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right)=1=\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right).

If $\mathcal{A}$ is not true, then $h>\frac{0.02a}{\log\left(\frac{mq_{1}}{0.001a}\vee\frac{10}{a}\right)}\wedge 1$ , and we have

	$\displaystyle 1\{\mathcal{A}\}\vee\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right)$	$\displaystyle=\left(\left(\frac{b(h\wedge 1)}{\log m}\right)^{\beta}-\frac{1}{bm^{2}}\right)$
		$\displaystyle=\Omega\left(\log^{-2\beta}m\right)$
		$\displaystyle=\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right).$

In either case the risk with respect to $L_{1}$ is $\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right)$ . It remains to convert this to a lower bound in $L_{2}$ .

We consider two cases. First note that by the fast decay condition on the Fourier transform, $k\in\mathcal{C}^{1}$ . Let $B=B_{k}$ denote a constant such that

\sup_{x\in[-1/2,1/2]}\left|k^{\prime}(x)\right|\leq B.

(53)

Set $\Delta=B^{1/2}\vee k(0)\vee 1$ .

Case 1: $h\leq\Delta$ .

Let $U=\{|y|\geq\frac{1}{2}+c_{\beta,\Delta,a}\log m\}$ , and let $U^{c}=\mathbb{R}\backslash U$ . If $h\leq\Delta$ , then because $X_{i}\in[-1/2,1/2]$ and by the exponential decay of $k$ ,

\lVert\hat{f}_{X}(y){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\leq m^{-2}

for $c_{\beta,\Delta,a}$ sufficiently large. Thus by Cauchy–Schwarz,

	$\displaystyle\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U^{c}}\rVert_{2}$	$\displaystyle\geq c^{\prime}_{\beta,\Delta,a}(\log m)^{-1/2}\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U^{c}}\rVert_{2}$
		$\displaystyle=c^{\prime}_{\beta,\Delta,a}(\log m)^{-1/2}\left(\lVert(\hat{f}_{X}-f)\rVert_{1}-\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{U}\rVert_{1}\right)$
		$\displaystyle\geq c^{\prime}_{\beta,\Delta,a}(\log m)^{-1/2}\left(c_{\beta,k}\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta}m}\right)-m^{-2}\right)$
		$\displaystyle=\Omega\left(\frac{m^{-\frac{\beta}{\beta+1}}}{\log^{\beta+\frac{1}{2}}m}\right)$

Case 2: $h\geq\Delta$

In this case, $k(X_{i}-y)$ is nearly constant for all $i$ . By (53) and Taylor’s theorem,

\left|k(0)-k\left(\frac{X_{i}-y}{h}\right)\right|\leq 2B

for all $y\in[-1/2,1/2]$ and for all $i$ . Hence, for all $y\in[-1/2,1/2]$ , using $h\geq\Delta$ ,

\hat{f}_{X}(y)=\frac{1}{mh}\sum_{i=1}^{m}k\left(\frac{X_{i}-y}{h}\right)\leq\frac{1}{h}(k(0)+2B)\leq 3.

For $L_{\beta}$ large enough, we see that for the function $f\in\mathcal{P}_{\mathcal{H}}(\beta,L_{\beta})$ with $f|_{[0,\frac{1}{100}]}\equiv 4$ ,

\lVert\hat{f}_{X}-f\rVert_{2}\geq\lVert(\hat{f}_{X}-f){\rm 1}\kern-2.40005pt{\rm I}_{[0,\frac{1}{100}]}\rVert_{1}=\Omega(1).

∎

Appendix D PROOFS FROM SECTION 5

D.1 Proof of Theorem 7

The result is restated below.

Theorem.

\sup_{f\in\mathcal{P}_{\mathcal{H}}(\beta,L)}\mathbb{E}\lVert f-\hat{g}_{S}\rVert_{2}\leq c_{\beta,d,L}\,n^{-\frac{\beta}{2\beta+d}}.

Proof.

Here we adapt the results in Section 2 of Phillips and Tai (2018b) to our setting where the bandwidth $h=n^{-1/(2\beta+d)}$ is shrinking. Using their notation, we define $K_{s}(x,y)=k_{s}\left(\frac{x-y}{h}\right)$ and study the kernel discrepancy of the kernel $K_{s}$ . First we verify the assumptions on the kernel (bounded influence, Lipschitz, and positive semidefiniteness) needed to apply their results.

First, the kernel $K_{s}$ is bounded influence (see Phillips and Tai, 2018b, Section 2) with constant $c_{K}=2$ and $\delta=n^{-1}$ , which means that

\left|K_{s}(x,y)\right|\leq\frac{1}{n}

if $\left|x-y\right|_{\infty}\geq n^{2}$ . This follows from the fast decay of $\kappa_{s}$ .

Note that if $x$ and $y$ differ on a single coordinate $i$ , then

\left|k_{s}(x)-k_{s}(y)\right|\leq\left|c(x_{i}-y_{i})\prod_{j\neq i}\kappa_{s}(x_{j})\right|\leq c\left|x_{i}-y_{i}\right|

because $\left|\kappa_{s}(x)\right|\leq\lVert\psi\rVert_{1}$ for all $x$ and the function $\kappa_{s}$ is $c$ -Lipschitz for some constant $c$ . Hence by the triangle and Cauchy–Schwarz inequalities, the function $k_{s}$ is Lipschitz:

\displaystyle\left|k_{s}(x)-k_{s}(y)\right|\leq dc_{k}\left|x-y\right|_{1}\leq d^{3/2}c_{\kappa}\left|x-y\right|_{2}.

Therefore the kernel $K_{s}(x,y)$ is Lipschitz (see Phillips and Tai, 2018b) with constant $C_{K}=d^{3/2}c_{\kappa}h^{-1}$ . Moreover, the kernel $K_{s}$ is positive semidefinite because the Fourier transform of $\kappa_{s}$ is nonnegative.

Given the shrinking bandwidth $h=n^{-1/(2\beta+d)}$ , we slightly modify the lattice used in Phillips and Tai (2018b, Lemma 1). Define the lattice

\mathcal{L}=\left\{\left.(i_{1}\delta,i_{2}\delta,\ldots,i_{d}\delta)\,\right|\,i_{j}\in\mathbb{Z}\right\},

where

\delta=\frac{1}{c_{\kappa}d^{2}nh^{-1}}.

The calculation at the top of page 6 of Phillips and Tai (2018b, Lemma 1) yields

	$\displaystyle\mathsf{disc}(X,\chi,y)$	$\displaystyle:=\left\|\sum_{i=1}^{n}\chi(X_{i})K_{s}(X_{i},y)\right\|$
		$\displaystyle\leq\left\|\sum_{i=1}^{n}\chi(X_{i})K_{s}(X_{i},y_{0})\right\|+1$

where $y_{0}$ is the closest point to $y$ in the lattice $\mathcal{L}$ , and $\chi$ assigns either $+1$ or $-1$ to each element of $X=X_{1},\ldots,X_{n}$ . Moreover, with the bounded influence of $K_{s}$ , if

\min_{i}\left|y-X_{i}\right|_{\infty}\geq n^{2},

then

\mathsf{disc}(X,\chi,y)=\left|\sum_{i=1}^{n}\chi(X_{i})K_{s}(X_{i},y)\right|\leq 1.

On defining

\mathcal{L}_{X}=\mathcal{L}\cap\{y:\,\min_{i}\left|y-X_{i}\right|_{\infty}\leq n^{2}\},

we see that

\max_{y\in\mathbb{R}^{d}}\mathsf{disc}(X,\chi,y)\leq\max_{y\in\mathcal{L}_{X}}\mathsf{disc}(X,\chi,y)+1

for all signings $\chi:X\to\{-1,+1\}$ . This is precisely the conclusion of Phillips and Tai (2018b, Lemma 1).

This established, the positive definiteness and bounded diagonal entries of $K_{s}$ and Phillips and Tai (2018b, Lemmas 2 and 3) imply that

\mathsf{disc}_{K_{s}}=O(\sqrt{d\log n}).

Given $\varepsilon>0$ , the halving algorithm can be applied to $K_{s}$ as in Phillips and Tai (2018b, Corollary 5) to yield a coreset $X_{S}$ of size $m=O(\varepsilon^{-1}\sqrt{d\log\varepsilon^{-1}})$ such that

\lVert\frac{1}{n}\sum_{j=1}^{n}K_{s}(X_{j},y)-\frac{1}{m}\sum_{j\in S}K_{s}(X_{j},y)\rVert_{\infty}\leq\varepsilon.

Rescaling by $h^{-d}$ , we have

\lVert\hat{f}-\hat{f}_{S}^{\mathsf{unif}}\rVert_{\infty}=\lVert\frac{1}{n}\sum_{j=1}^{n}k_{s}(X_{j},y)-\frac{1}{m}\sum_{j\in S}k_{s}(X_{j},y)\rVert_{\infty}\leq\varepsilon h^{-d}.

Recall from Section B.2 that $\hat{f}$ attains the minimax rate of estimation on $\mathcal{P}_{\mathcal{H}}(\beta,L)$ . Thus setting $\varepsilon=h^{d}n^{-\beta/(2\beta+d)}$ we get a coreset of size $\tilde{O}_{d}(n^{\frac{\beta+d}{2\beta+d}})$ that attains the minimax rate $c_{\beta,d,L}\,n^{-\beta/(2\beta+d)}$ , as desired. Moreover, by the results of Phillips and Tai (2018b), this coreset can be constructed in polynomial time.

∎

Acknowledgments We thank Cole Franks for helpful discussions regarding algorithmic aspects of Carathéodory’s theorem.

References

Agarwal et al. [2005] Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1–30, 2005.
Ashtiani et al. [2020] Hassan Ashtiani, Shai Ben-David, Nicholas JA Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. Journal of the ACM (JACM), 67(6):1–42, 2020.
Bach et al. [2012] Francis R Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. In ICML, 2012.
Bachem et al. [2017] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
Bachem et al. [2018] Olivier Bachem, Mario Lucic, and Andreas Krause. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1119–1127, 2018.
Bansal et al. [2018] Nikhil Bansal, Daniel Dadush, Shashwat Garg, and Shachar Lovett. The gram-schmidt walk: a cure for the banaszczyk blues. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 587–597, 2018. 10.1145/3188745.3188850. URL https://doi.org/10.1145/3188745.3188850.
Bubeck [2015] Sébastien Bubeck. Convex optimization: algorithms and complexity. Now Publishers Inc., 2015.
Carathéodory [1907] C. Carathéodory. Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen, March 1907. URL https://doi.org/10.1007/bf01449883.
Chazelle and Matoušek [1996] Chazelle and Matoušek. On linear-time deterministic algorithms for optimization problems in fixed dimension. Journal of Algorithms, 21(3):579–597, 1996.
Chazelle [2000] B. Chazelle. The Discrepancy Method: Randomness and Complexity. Cambridge University Press, Cambridge, 2000.
Claici et al. [2020] Sebastian Claici, Aude Genevay, and Justin Solomon. Wasserstein measure coresets. arXiv preprint arXiv:1805.07412, 2020.
Clarkson [2010] Kenneth L Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):1–30, 2010.
Cover and Thomas [2006] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition, 2006.
Feldman et al. [2011] Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. In Advances in neural information processing systems, pages 2142–2150, 2011.
Feldman et al. [2013] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1434–1453. SIAM, 2013.
Frahling and Sohler [2005] Gereon Frahling and Christian Sohler. Coresets in dynamic geometric data streams. In Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’05, pages 209–217, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1581139608. 10.1145/1060590.1060622. URL https://doi.org/10.1145/1060590.1060622.
Frank et al. [1956] Marguerite Frank, Philip Wolfe, et al. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
Gärtner and Jaggi [2009] Bernd Gärtner and Martin Jaggi. Coresets for polytope distance. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pages 33–42, 2009.
Grötschel et al. [2012] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
Hanneke et al. [2019] Steve Hanneke, Aryeh Kontorovich, and Menachem Sadigurschi. Sample compression for real-valued learners. In Algorithmic Learning Theory, pages 466–488, 2019.
Har-Peled and Kushal [2007] Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3–19, 2007.
Harvey and Samadi [2014] Nick Harvey and Samira Samadi. Near-optimal herding. In Conference on Learning Theory, pages 1165–1182, 2014.
Hiriart-Urruty and Lemaréchal [2004] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis. Springer Science & Business Media, 2004.
Huggins et al. [2016] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080–4088, 2016.
Joshi et al. [2011] Sarang Joshi, Raj Varma Kommaraji, Jeff M. Phillips, and Suresh Venkatasubramanian. Comparing distributions and shapes using the kernel distance. In Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry, SoCG ’11, page 47–56, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450306829. 10.1145/1998196.1998204. URL https://doi.org/10.1145/1998196.1998204.
Karnin and Liberty [2019] Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1975–1993, Phoenix, USA, 25–28 Jun 2019. PMLR. URL http://proceedings.mlr.press/v99/karnin19a.html.
Katznelson [2004] Yitzhak Katznelson. An Introduction to Harmonic Analysis. Cambridge Mathematical Library. Cambridge University Press, 3 edition, 2004. 10.1017/CBO9781139165372.
Littlestone and Warmuth [1986] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Unpublished manuscript, 1986.
Lopez-Paz et al. [2015] David Lopez-Paz, Krikamol Muandet, Bernhard Schölkopf, and Iliya Tolstikhin. Towards a learning theory of cause-effect inference. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1452–1461, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/lopez-paz15.html.
Matoušek [1999] J. Matoušek. Geometric Discrepancy: an Illustrated Guide. Springer, New York, 1999.
McDonald [2017] Daniel McDonald. Minimax Density Estimation for Growing Dimension. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 194–203, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR. URL http://proceedings.mlr.press/v54/mcdonald17a.html.
Munteanu et al. [2018] Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. On coresets for logistic regression. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pages 6562–6571, Red Hook, NY, USA, 2018. Curran Associates Inc.
Phillips [2013] Jeff M Phillips. $\varepsilon$ -samples for kernels. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1622–1632. SIAM, 2013.
Phillips and Tai [2018a] Jeff M Phillips and Wai Ming Tai. Improved coresets for kernel density estimates. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2718–2727. SIAM, 2018a.
Phillips and Tai [2018b] Jeff M. Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. In 34th International Symposium on Computational Geometry, SoCG 2018, June 11-14, 2018, Budapest, Hungary, pages 66:1–66:13, 2018b. 10.4230/LIPIcs.SoCG.2018.66. URL https://doi.org/10.4230/LIPIcs.SoCG.2018.66.
Tikhomirov [1993] V. M. Tikhomirov. $\varepsilon$ -Entropy and $\varepsilon$ -Capacity of Sets In Functional Spaces, pages 86–170. Springer Netherlands, Dordrecht, 1993. ISBN 978-94-017-2973-4. 10.1007/978-94-017-2973-4_7. URL https://doi.org/10.1007/978-94-017-2973-4_7.
Tsybakov [2009] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer series in statistics. Springer, 2009. ISBN 978-0-387-79051-0. 10.1007/b13794. URL https://doi.org/10.1007/b13794.
Zheng and Phillips [2017] Yan Zheng and Jeff M Phillips. Coresets for kernel regression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645–654, 2017.

$\displaystyle\lVert\mathcal{\bar{F}}[\tilde{k}_{h}](\omega){\rm 1}\kern-2.40005pt{\rm I}_{\omega\notin A}\rVert_{\ell_{2}}$	$\displaystyle\leq\,T^{-\gamma}\lVert\sum_{\left\|\alpha\right\|_{1}=\gamma}\frac{\gamma!}{\alpha!}\left\|\omega\right\|^{\alpha}\mathcal{\bar{F}}[\tilde{k}_{h}](\omega){\rm 1}\kern-2.40005pt{\rm I}_{\omega\notin A}\rVert_{\ell_{2}}$
	$\displaystyle\leq\,T^{-\gamma}\sum_{\left\|\alpha\right\|_{1}=\gamma}\frac{\gamma!}{\alpha!}\,\lVert\omega^{\alpha}\mathcal{\bar{F}}[\tilde{k}_{h}](\omega)\rVert_{\ell_{2}}$
	$\displaystyle=\,c_{d}\,T^{-\gamma}\sum_{\left\|\alpha\right\|_{1}=\gamma}\frac{\gamma!}{\alpha!}\,\lVert\frac{\partial^{\alpha}}{\partial x^{\alpha}}\tilde{k}_{h}(x)\rVert_{2},$	(19)

$\displaystyle\\|f_{\epsilon}-\hat{f}_{X}\\|_{1}$	$\displaystyle\geq\|\mathcal{F}[f_{\epsilon}](1/\epsilon)-\mathcal{F}[\hat{f}_{X}](1/\epsilon)\|$
	$\displaystyle\geq\left\|\int_{[-1/2,1/2]}f_{\epsilon}(t)e^{-it/\epsilon}dt\right\|-\left\|\mathcal{F}[k](\frac{h}{\epsilon})\right\|$
	$\displaystyle\geq\left\|\int_{[-1/2,1/2]}f_{\epsilon}(t)\sin\frac{t}{\epsilon}dt\right\|-\left\|\mathcal{F}[k](\frac{h}{\epsilon})\right\|$
	$\displaystyle=\epsilon^{\beta}\left\|\int_{[-1/2,1/2]}\phi(t)\sin^{2}\frac{t}{\epsilon}dt\right\|-\left\|\mathcal{F}[k](\frac{h}{\epsilon})\right\|$	(50)