This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving Nonparametric Density Estimation with Tensor Decompositions

Robert A. Vandermeulen
Machine Learning Group
Technische Universität Berlin
Berlin, Germany
[email protected]
Abstract

While nonparametric density estimators often perform well on low dimensional data, their performance can suffer when applied to higher dimensional data, owing presumably to the curse of dimensionality. One technique for avoiding this is to assume no dependence between features and that the data are sampled from a separable density. This allows one to estimate each marginal distribution independently thereby avoiding the slow rates associated with estimating the full joint density. This is a strategy employed in naive Bayes models and is analogous to estimating a rank-one tensor. In this paper we investigate whether these improvements can be extended to other simplified dependence assumptions which we model via nonnegative tensor decompositions. In our central theoretical results we prove that restricting estimation to low-rank nonnegative PARAFAC or Tucker decompositions removes the dimensionality exponent on bin width rates for multidimensional histograms. These results are validated experimentally with high statistical significance via direct application of an existing nonnegative tensor factorization to histogram estimators.

1 Introduction

Nonparametric density estimation is the task of estimating a density ff from data without assuming that ff belongs to some parametric class of densities, e.g. the space multivariate Gaussian distributions. Some common nonparametric density estimators include the histogram estimator and the kernel density estimator (KDE). While nonparametric density estimation has been shown to be effective for many tasks, it has been observed empirically that estimator performance typically declines as data dimensionality increases, a manifestation of the curse of dimensionality. For the histogram and KDE this phenomena also has a precise mathematical analog. For these estimators universal consistency is achieved iff nn\to\infty and h0h\to 0, with nhdnh^{d}\to\infty, where nn is number of samples and hh is the bin width for the histogram and bandwidth parameter for the KDE [12].

One of the most common approaches to alleviating the curse of dimensionality is dimensionality reduction. A dimensionality reduction technique typically attempts to transform a high dimensional representation into a lower dimensional one that removes dependences within data. Feature selection techniques, for example, usually explicitly remove highly dependent features [30, 7]. In PCA one finds the best dd-dimensional affine subspace for approximating data. If the PCA model fits well, it implies that, given dd features of a DD-dimensional sample, there exists a good linear prediction of the remaining DdD-d features. The manifold hypothesis [6], which is often touted as a general explanation for why high dimensional problems are learnable [5], can also be viewed as a model of dependence. For example, if we assume that a random vector [X,Y]T[X,Y]^{T} lies on the one-dimensional sphere, X2+Y2=1X^{2}+Y^{2}=1, and YY is known, then XX can only assume one of two values X=±1Y2X=\pm\sqrt{1-Y^{2}}. More generally, the assumption that the data lies on a manifold implies local dependence. This is because any sufficiently small region of the support manifold can be well approximated by an affine subspace, and thus there exists a linear dependence between the features like in PCA. Interestingly, assuming that features are not dependent yields another approach to overcoming the curse of dimensionality.

In a naive Bayes model one assumes no dependence between features, i.e. our target density is separable,

f(x1,,xd)=f1(x1)f2(x2)fd(xd).\displaystyle f\left(x_{1},\ldots,x_{d}\right)=f_{1}\left(x_{1}\right)f_{2}\left(x_{2}\right)\cdots f_{d}\left(x_{d}\right). (1)

In order to estimate ff one can now simply estimate the marginal distributions and multiply them. Because the dimensionality of each marginal is one, one can use a histogram or KDE and achieve nhnh\to\infty rates on bin width/bandwidth while preserving consistent estimation, thereby circumventing the curse of dimensionality. The separability assumption is rarely satisfied in practice so naive Bayes models are typically not used for density estimation directly, but may be used for some other task such as classification via a likelihood ratio test.

In this paper we consider relaxations of the naive Bayes model based on the assumption that a density is a mixture of separable densities. Our models are inspired by nonnegative tensor factorizations so we term them generally nonparametric nonnegative tensor factorization (NNTF) models. The first model is related to nonnegative PARAFAC [28] and is commonly known as a multi-view model in statistics or machine learning literature:

f(x1,,xd)=i=1kwifi,1(x1)fi,2(x2)fi,d(xd).f\left(x_{1},\ldots,x_{d}\right)=\sum_{i=1}^{k}w_{i}f_{i,1}\left(x_{1}\right)f_{i,2}\left(x_{2}\right)\cdots f_{i,d}\left(x_{d}\right). (2)

This is equivalent to the assumption that the features are independent conditioned on an unobserved discrete random variable taking on kk values. Our second model is based on the nonnegative Tucker decomposition [17]. In this model it is assumed that there are dd collections of kk one-dimensional densities, 1,,d\mathcal{F}_{1},\ldots,\mathcal{F}_{d} with i={fi,1,,fi,k}\mathcal{F}_{i}=\left\{f_{i,1},\ldots,f_{i,k}\right\}, and some probability measure which which randomly selects one density from each collection. This measure a can be represented by a tensor Wk×dW\in\mathbb{R}^{k^{\times d}} where the probability of selecting f1,i1,,fd,idf_{1,i_{1}},\ldots,f_{d,i_{d}} is Wi1,,idW_{i_{1},\ldots,i_{d}}. To sample from this model we first randomly select the marginal distributions f1,i1,,fd,idf_{1,i_{1}},\ldots,f_{d,i_{d}} according to WW and then independently sample each feature according the randomly selected marginal distribution Xjfj,ijX_{j}\sim f_{j,i_{j}}. The density of this model is

f(x1,,xd)=i1=1kid=1kWi1,,idf1,i1(x1)fd,id(xd).f\left(x_{1},\ldots,x_{d}\right)\quad=\sum_{i_{1}=1}^{k}\cdots\sum_{i_{d}=1}^{k}W_{i_{1},\ldots,i_{d}}f_{1,i_{1}}(x_{1})\cdots f_{d,i_{d}}(x_{d}). (3)

We are unaware of previous literature investigating this model so we will simply term it the Tucker model.

In Section 2 we prove that there exists a trade-off between the rate on bin width hh and number of components kk: to control estimation error111For an estimator VV restricted space of densities 𝒫\mathcal{P}, the estimation error refers the difference between Vp\left\|V-p\right\| and minq𝒫qp\min_{q\in\mathcal{P}}\left\|q-p\right\|, where pp is the target density. we require k/hk/h to be asymptotically dominated by nn for the multi-view histogram and k/h+kdk/h+k^{d} to be asymptotically dominated by nn for the Tucker histogram (both of these rates ignore logarithmic factors). Note that for the multi-view histogram this rate is not dependent on dd. Allowing hh to shrink as aggressively as possible (which we pay for with a slow rate on kk) we show that there exist universally consistent histogram estimators which achieve nh/log(h1)nh/\log\left(h^{-1}\right)\to\infty rate on bin width, thereby removing the dependence on dimension and approximately attaining rates possible for densities of the form in (1) while still controlling estimation error. We show that these are the approximately fastest possible rates via matching lower bound. In Section 3 we show that we can use an existing low-rank nonnegative Tucker factorization algorithm to fit our model and demonstrate empirically that fitting histograms to a Tucker model outperforms the standard histogram estimator with very high statistical significance.

While this paper focuses on histogram estimation and presents a promising, readily implementable improvement to the standard histogram estimator in Section 3, its primary purpose is to showcase the potential of utilizing concepts nonnegative tensor factorization to improve performance in nonparametric statistical methods.

1.1 Previous Work

Nonparametric density estimation has been extensively studied with the histogram estimator and KDE being by far the most well known. There do exist, however, alternative methods for density estimation, e.g. the forest density estimator [19] and kk-nearest neighbor density estimator [20]. The L1,L2L_{1},L_{2} and LL_{\infty} convergence of the histogram and KDE has been studied extensively [12, 8, 33, 14]. The KDE is generally accepted as being the superior density estimator with some mathematical justification [29]. Numerous modifications and extensions of the KDE have been proposed including utilizing variable bandwidth [32], robust KDEs [15, 34, 35], and methods for enforcing support boundary constraints [27]. Finally we mention one recent paper [16] that demonstrated that uniform convergence of a KDE to its population estimate suffered when the intrinsic dimension of the data was lower than the ambient dimension, a phenomenon seemingly at odds with the curse of dimensionality.

For our review of NNTF models we also include a general review of tensor/matrix factorizations since both can be viewed being low-rank models. In particular, for the multi-view model we have

i=1kwifi,1(x1)fi,d(xd)i=1kλi𝐯i,1𝐯i,d.\sum_{i=1}^{k}w_{i}f_{i,1}\left(x_{1}\right)\cdots f_{i,d}\left(x_{d}\right)\sim\sum_{i=1}^{k}\lambda_{i}\mathbf{v}_{i,1}\otimes\cdots\otimes\mathbf{v}_{i,d}. (4)

We further note that for histogram estimation, once data has been assigned to bins, finding a good multi-view or Tucker histogram is analogous to estimating a probability tensor with a nonnegative factorization (we show this rigorously in Section 3).

A great deal of work has gone into leveraging low-rank assumptions to improve matrix estimation, particularly in the field of compressed sensing [10, 24]. In compressed sensing one has access to a collection of random linear measurements of an unknown low-rank matrix to be estimated. Fitting a matrix to these measurements with a nuclear norm regularized optimization problem achieves estimation bounds better than those possible without the low-rank assumption. These techniques have been effectively applied to problems such as matrix completion, multivariate regression, and estimating autoregressive models [21, 22]. Unfortunately such techniques are not extensible to histogram estimation because, in the density estimation setting, data are not linearly sampled from the target model. Furthermore how to extend compressed sensing techniques to general tensors is not clear.

General matrix/tensor factorization, including nonnegative matrix/tensor factorizations, have been extensively studied despite their inherent difficulty due to non-convexity. The works [9, 3] present potential theoretical grounds for avoiding the computational difficulties of nonnegative matrix factorization. Some algorithms for finding nonnegative tensor factorizations are mentioned in Section 3. One notable approach to tensor factorization is to assume, in the tensor representation in (4), that d3d\geq 3 and the collections of vectors 𝐯1,j,,𝐯k,j\mathbf{v}_{1,j},\ldots,\mathbf{v}_{k,j} are linearly independent for all jj. Under this assumption we are guaranteed that the factorization (4) is unique [1]. In [2] the authors present a method for recovering this factorization efficiently and demonstrate its utility for a variety of tasks. This work was extended in [31] to recover a multi-view KDE satisfying an analogous linear independence assumption. This is the only work of its type of which we are aware. In this work the authors investigate the sample complexity of their estimator but do not demonstrate that their technique has potential for improving rates for nonparametric density estimation in general. Finally we note that nonparametric applications of the Tucker decomposition have been utilized in Bayesian statistics [26]. We are unaware of any literature describing the model we introduce in (3).

2 Theoretical Results

In this section we mathematically demonstrate that histogram estimators can achieve greater estimation accuracy by restricting to NNTF models. To simplify analysis we will only consider densities on [0,1)d\left[0,1\right)^{d} and analyze number of bins per dimension bb which is the inverse of the bin width, i.e. b=1/hb=1/h. We prove that there exists a trade-off between rates on bb and kk. Furthermore we show that the approximate fastest possible rate on bb while still uniformly controlling for estimator variance and remaining universally consistent is n/(blogb)n/\left(b\log b\right)\to\infty. Before proving these results we must introduce a fair amount of notation.

2.1 Notation

All norms in Section 2 are the 1,L1,\ell^{1},L^{1}, or total variation norm for vectors/tensors, densities, and measures respectively. Note that these norms are analogous e.g. the L1L^{1} norm of a probability density function is the same as the total variation norm on the probability measure associated with the density. Let 𝒟d\mathcal{D}_{d} be the set of all densities on [0,1)d\left[0,1\right)^{d}. By density we mean probability measures that are absolutely continuous with respect to the dd-dimensional Lebesgue measure. We define a probability vector or probability tensor to simply mean a vector or tensor whose entires are nonnegative and sum to one. Let Δb\Delta_{b} denote the set of probability vectors in b\mathbb{R}^{b}and 𝒯d,b\mathcal{T}_{d,b} the set of probability tensors in b×d\mathbb{R}^{b^{\times d}}. Let 𝒯d,bk\mathcal{T}_{d,b}^{k} be the set of tensors which are a convex combination of kk separable probability tensors, i.e.

𝒯d,bk{i=1kwij=1dpi,j|wΔk,pi,jΔb}.\displaystyle\mathcal{T}_{d,b}^{k}\triangleq\left\{\sum_{i=1}^{k}w_{i}\prod_{j=1}^{d}p_{i,j}\middle|w\in\Delta_{k},p_{i,j}\in\Delta_{b}\right\}.

In this paper, the product symbol \prod will always mean the standard outer product, e.g. set product or tensor product. For any natural number bb let [b]={1,,b}[b]=\left\{1,\ldots,b\right\}. For a multi-index A[b]dA\in\left[b\right]^{d} we define 𝐞d,b,A\mathbf{e}_{d,b,A} as the element of 𝒯d,b\mathcal{T}_{d,b} where the (A1,,Ad)(A_{1},\ldots,A_{d})-th entry is one and is zero elsewhere.

The following is the set of probability tensors constructed via a nonnegative Tucker factorization

𝒯~d,bk{S[k]×dWSj=1dpi,Sj|W𝒯b,k,pi,jΔb}.\displaystyle\widetilde{\mathcal{T}}_{d,b}^{k}\triangleq\left\{\sum_{S\in\left[k\right]^{\times d}}W_{S}\prod_{j=1}^{d}p_{i,S_{j}}\middle|W\in\mathcal{T}_{b,k},p_{i,j}\in\Delta_{b}\right\}.

We will now construct the space of histograms on [0,1)d\left[0,1\right)^{d}. We begin with one-dimensional histograms. We define h1,b,ih_{1,b,i} with i[b]i\in\left[b\right] to be the one dimensional histogram where all weight is allocated to the iith bin. Formally we define this as

h1,b,i(x)b𝟙(i1bx<ib).\displaystyle h_{1,b,i}\left(x\right)\triangleq b\mathbbm{1}\left(\frac{i-1}{b}\leq x<\frac{i}{b}\right).

Note that this is a valid density due to the leading bb coefficient. We use these to construct higher-dimensional histograms. For a multi-index A[b]dA\ \in\left[b\right]^{d}, let

hd,b,Ai=1dh1,b,Ai,\displaystyle h_{d,b,A}\triangleq\prod_{i=1}^{d}h_{1,b,A_{i}},

the histogram whose entire density is allocated to the bin indexed by AA. Finally we define Λd,b,A\Lambda_{d,b,A} to be the support of hd,b,Ah_{d,b,A}, i.e. the “bins” of a histogram estimator,

Λd,b,Ai=1d[Ai1b,Aib).\displaystyle\Lambda_{d,b,A}\triangleq\prod_{i=1}^{d}\left[\frac{A_{i}-1}{b},\frac{A_{i}}{b}\right).

For a sequence of points in [0,1)d\left[0,1\right)^{d}, 𝒳=(X1,,Xn)\mathcal{X}=\left(X_{1},\ldots,X_{n}\right), the standard histogram estimator is

Hd,b(𝒳)1ni=1nA[b]dhd,b,A𝟙(XiΛd,b,A).\displaystyle H_{d,b}\left(\mathcal{X}\right)\triangleq\frac{1}{n}\sum_{i=1}^{n}\sum_{A\in[b]^{d}}h_{d,b,A}\mathbbm{1}\left(X_{i}\in\Lambda_{d,b,A}\right).

Let d,bConv({hd,b,A|A[b]d})\mathcal{H}_{d,b}\triangleq\operatorname{Conv}\left(\left\{h_{d,b,A}\middle|A\in\left[b\right]^{d}\right\}\right), the set of all dd-dimensional histograms with bb bins per dimension. Let d,bk\mathcal{H}_{d,b}^{k} be the set of histograms with at most kk separable components, i.e.

d,bk{i=1kwij=1dfi,j|wΔk,fi,j1,b}.\mathcal{H}_{d,b}^{k}\triangleq\left\{\sum_{i=1}^{k}w_{i}\prod_{j=1}^{d}f_{i,j}\middle|w\in\Delta_{k},f_{i,j}\in\mathcal{H}_{1,b}\right\}. (5)

We will refer elements in this space multi-view histograms. Analogously we define the space of Tucker histograms to be

~d,bk={S[k]×dWSi=1dfi,Si|W𝒯d,k,fi,j1,b}.\widetilde{\mathcal{H}}_{d,b}^{k}=\left\{\sum_{S\in\left[k\right]^{\times d}}W_{S}\prod_{i=1}^{d}f_{i,S_{i}}\middle|W\in\mathcal{T}_{d,k},f_{i,j}\in\mathcal{H}_{1,b}\right\}.

We emphasize that the collections of densities d,bk\mathcal{H}_{d,b}^{k} and ~d,bk\widetilde{\mathcal{H}}_{d,b}^{k} the primary objects of interest in this paper as they represent NNTF histograms. The theoretical results we present are concerned with finding good density estimators restricted to these sets as kk and bb vary.

Note that there exists a 1L1\ell^{1}\to L^{1} linear isometry Ud,b:𝒯d,bd,bU_{d,b}:\mathcal{T}_{d,b}\to\mathcal{H}_{d,b} with Ud,bU_{d,b} defined as

Ud,b(𝐞d,b,A)=hd,b,A.U_{d,b}(\mathbf{e}_{d,b,A})=h_{d,b,A}.

The inverse function, Ud,b1U^{-1}_{d,b}, simply transforms a histogram to the tensor representing its bin weights and Ud,bU_{d,b} performs the reverse transformation. Note that Ud,bU_{d,b} is also a bijection between 𝒯d,bkd,bk\mathcal{T}_{d,b}^{k}\to\mathcal{H}_{d,b}^{k} and 𝒯~d,bk~d,bk\widetilde{\mathcal{T}}_{d,b}^{k}\to\widetilde{\mathcal{H}}_{d,b}^{k}. Much of our analysis on histograms will be performed on the space of probability tensors with the analysis being translated to histograms via this operator.

For a set of vectors 𝒱\mathcal{V} we define kmix(𝒱){i=1kwivi|wΔk,vi𝒱}k\operatorname{-mix}\left(\mathcal{V}\right)\triangleq\left\{\sum_{i=1}^{k}w_{i}v_{i}\middle|w\in\Delta_{k},v_{i}\in\mathcal{V}\right\}, i.e. the set of convex combinations of collections of kk vectors from 𝒱\mathcal{V}. We define N(𝒱,ε)N\left(\mathcal{V},\varepsilon\right) to be the minimum cardinality for a subset of of 𝒱\mathcal{V} which ε\varepsilon-covers 𝒱\mathcal{V} (with closed balls) with respect to the \left\|\cdot\right\| metric. It will be clear from context whether \left\|\cdot\right\| represents the 1\ell^{1}, L1L^{1} or total variation norm.

2.2 Preliminary Results

For brevity the main text only contains a full proof of Lemma 2.7. The remaining full proofs can be found in the appendix. We include general descriptions of the proof techniques we use for multi-view histogram results. These are similar to the techniques we use for Tucker histograms. Our general proof technique is to find good covers of spaces of densities, i.e. d,bk\mathcal{H}_{d,b}^{k} and ~d,bk\widetilde{\mathcal{H}}_{d,b}^{k}, and then apply an existing algorithm for selecting a good estimator from finite collections of densities given data. We begin by establishing a covering number bound on the space of probability vectors via an adaptation of a standard result presented in [8].

Lemma 2.1.

For all 0<ε10<\varepsilon\leq 1 we have that N(Δb,ε)(2bε)bN\left(\Delta_{b},\varepsilon\right)\leq\left(\frac{2b}{\varepsilon}\right)^{b}.

We can extend this to a covering number for the space of separable probability tensors.

Lemma 2.2.

For all 0<ε10<\varepsilon\leq 1 we have that N(𝒯d,b1,ε)(2bdε)bdN\left(\mathcal{T}_{d,b}^{1},\varepsilon\right)\leq\left(\frac{2bd}{\varepsilon}\right)^{bd}.

Proof sketch.

Combine Lemma 2.1 with the following standard bound for product measures (see Lemma 3.3.7 in [25]): i=1dqij=1dq~ji=1dqiq~i.\left\|\prod_{i=1}^{d}q_{i}-\prod_{j=1}^{d}\widetilde{q}_{j}\right\|\leq\sum_{i=1}^{d}\left\|q_{i}-\widetilde{q}_{i}\right\|.

Now we establish the following lemma for covering numbers of mixtures of densities.

Lemma 2.3.

Let 𝒫\mathcal{P} be a set of probability measures, then

N(kmix(𝒫),ε+δ)N(𝒫,ε)kN(Δk,δ).\displaystyle N\left(k\operatorname{-mix}\left(\mathcal{P}\right),\varepsilon+\delta\right)\leq N\left(\mathcal{P},\varepsilon\right)^{k}N\left(\Delta_{k},\delta\right).
Proof sketch.

Use Lemma 2.1 to construct different weightings of kk elements from an ε\varepsilon-cover of 𝒫\mathcal{P}. ∎

By combining Lemma 2.3 with Lemma 2.2 we arrive at covering numbers for the space of multi-view probability tensors.

Lemma 2.4.

For all 0<ε10<\varepsilon\leq 1 the following holds N(𝒯d,bk,ε)(4bdε)bdk(4kε)kN\left(\mathcal{T}_{d,b}^{k},\varepsilon\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k}{\varepsilon}\right)^{k}.

Through application of the Ud,bU_{d,b} operator we now have a characterization of the complexity of the space d,bk\mathcal{H}_{d,b}^{k}.

Corollary 2.1.

For all 0<ε10<\varepsilon\leq 1 following holds N(d,bk,ε)(4bdε)bdk(4kε)k.N\left(\mathcal{H}_{d,b}^{k},\varepsilon\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k}{\varepsilon}\right)^{k}.

The following are analogous results for Tucker histograms.

Lemma 2.5.

For all 0<ε10<\varepsilon\leq 1 the following holds N(𝒯~d,bk,ε)(4bdε)bdk(4kdε)kdN\left(\widetilde{\mathcal{T}}_{d,b}^{k},\varepsilon\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k^{d}}{\varepsilon}\right)^{k^{d}}.

Corollary 2.2.

For all 0<ε10<\varepsilon\leq 1 following holds N(~d,bk,ε)(4bdε)bdk(4kdε)kd.N\left(\widetilde{\mathcal{H}}_{d,b}^{k},\varepsilon\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k^{d}}{\varepsilon}\right)^{k^{d}}.

The following lemma from [4] provides us with a way to choose good estimators from finite collections of densities. It can be proven by applying a Chernoff bound to [8], Theorem 6.3.

Lemma 2.6 ([4]).

There exists a deterministic algorithm that, given a collection of distributions p1,,pMp_{1},\ldots,p_{M}, a parameter ε>0\varepsilon>0 and at least log(3M2/δ)2ε2\frac{\log\left(3M^{2}/\delta\right)}{2\varepsilon^{2}} iid samples from an unknown distribution pp, outputs an index j[M]j\in\left[M\right] such that

pjp3mini[M]pip+4ε\displaystyle\left\|p_{j}-p\right\|\leq 3\min_{i\in\left[M\right]}\left\|p_{i}-p\right\|+4\varepsilon

with probability at least 1δ31-\frac{\delta}{3}.

We present the following asymptotic version of the previous lemma and include its full proof. We highlight the use of finding sufficiently slow rates on parameters in order to establish asymptotic results, a technique which we will use in later proofs.

Lemma 2.7.

Let (𝒫n)n\left(\mathcal{P}_{n}\right)_{n\in\mathbb{N}} be a sequence of finite collections of densities in 𝒟d\mathcal{D}_{d} where |𝒫n|\left|\mathcal{P}_{n}\right|\to\infty with n/log(|𝒫n|)n/\log\left(\left|\mathcal{P}_{n}\right|\right)\to\infty. Then there exists a sequence of estimators Vn𝒫nV_{n}\in\mathcal{P}_{n} such that, for all γ>0\gamma>0,

supp𝒟dP(Vnp>3minq𝒫npq+γ)0,\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{P}_{n}}\left\|p-q\right\|+\gamma\right)\to 0,

where VnV_{n} is a function of X1,,XniidpX_{1},\ldots,X_{n}\overset{iid}{\sim}p.

Proof.

Let M=M(n)=|𝒫n|M=M(n)=\left|\mathcal{P}_{n}\right|. Since n/log(M)n/\log\left(M\right)\to\infty we have that for all c>0c>0 there exists a NcN_{c} such that, for all nNcn\geq N_{c} we have n/log(M)cn/\log\left(M\right)\geq c or equivalently nclog(M)n\geq c\log\left(M\right). Because of this there exists sequence of positive values C=C(n)C=C(n) such that CC\to\infty and nClog(M)n\geq C\log\left(M\right).

We will be making use of the algorithm in Lemma 2.6 as well as its notation. If we can show that there exist sequences of positive values ε(n)0,δ(n)0\varepsilon(n)\to 0,\delta(n)\to 0 such that, for sufficiently large nn, the following holds

log(3M2/δ)2ε2n,\displaystyle\frac{\log\left(3M^{2}/\delta\right)}{2\varepsilon^{2}}\leq n,

then can simply set VnV_{n} equal to be the estimator from Lemma 2.6 for sufficiently large nn and, because the lemma holds independent of choice of pp, the theorem statement follows.

Let ε=(2/C)1/4\varepsilon=\left(2/C\right)^{1/4} and δ=3/(exp(2C2))\delta=3/\left(\exp\left(2\sqrt{\frac{C}{2}}\right)\right). Note that these are both positive sequences which converge to zero. Now we have

log(3M2/δ)2ε2\displaystyle\frac{\log\left(3M^{2}/\delta\right)}{2\varepsilon^{2}} =log(M2)+log(3/δ)2ε2\displaystyle=\frac{\log\left(M^{2}\right)+\log\left(3/\delta\right)}{2\varepsilon^{2}}
=2log(M)+log(3/δ)2ε2=log(M)+12log(3/δ)ε2\displaystyle=\frac{2\log\left(M\right)+\log\left(3/\delta\right)}{2\varepsilon^{2}}=\frac{\log\left(M\right)+\frac{1}{2}\log\left(3/\delta\right)}{\varepsilon^{2}}
=ε2(log(M)+12log(3/δ))\displaystyle=\varepsilon^{-2}\left(\log\left(M\right)+\frac{1}{2}\log\left(3/\delta\right)\right)
=((2/C)1/4)2(log(M)+12log(exp(2C2)))\displaystyle=\left(\left(2/C\right)^{1/4}\right)^{-2}\left(\log\left(M\right)+\frac{1}{2}\log\left(\exp\left(2\sqrt{\frac{C}{2}}\right)\right)\right)
=C2(log(M)+C2)=C2log(M)+C2.\displaystyle=\sqrt{\frac{C}{2}}\left(\log(M)+\sqrt{\frac{C}{2}}\right)=\sqrt{\frac{C}{2}}\log(M)+\frac{C}{2}. (6)

For sufficiently large CC and MM we have that the RHS of (6) is less than or equal to

C2log(M)+C2\displaystyle\frac{C}{2}\log(M)+\frac{C}{2} C2log(M)+C2log(M)\displaystyle\leq\frac{C}{2}\log(M)+\frac{C}{2}\log(M)
=Clog(M)n.\displaystyle=C\log(M)\leq n.

which completes our proof. ∎

2.3 Main Theoretical Results

We can now state the central results of this paper. The following theorem states that one can control the estimation error of multi-view histograms with kk components and bb bins per dimension so long as nbkn\sim bk (omitting logarithmic factors). Recall that the standard histogram estimator requires nbdn\sim b^{d}, so we have removed the exponential dependence of bin rate on dimensionality. Here and elsewhere the \sim symbol is not a precise mathematical statement but rather describes that the two values should be of the same order. In the following bb and kk are functions of nn; the space of histograms which we are fitting changes as we acquire more data.

Theorem 2.1.

For any pairs of sequences bb\to\infty and kk\to\infty satisfying n/(bklog(b)+klog(k))n/(bk\log(b)+k\log(k))\to\infty, there exists an estimator Vnd,bkV_{n}\in\mathcal{H}_{d,b}^{k} such that, for all ε>0\varepsilon>0

supp𝒟dP(Vnp>3minqd,bkpq+ε)0,\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0,

where VnV_{n} is a function of X1,,XniidpX_{1},\ldots,X_{n}\overset{iid}{\sim}p.

Proof sketch.

Apply Lemma 2.7 to the cover in Corollary 2.1 and choose appropriately slow rates for terms not involving bb or kk. ∎

The sample complexity for the multi-view histogram is perhaps more accurately approximated as being on the order of dbkdbk however the dd disappears in the asymptotic analysis.

The following theorem states that we can control the error of Tucker histogram estimates so long as nbk+kdn\sim bk+k^{d} (omitting logarithmic factors).

Theorem 2.2.

For any pairs of sequences bb\to\infty and kk\to\infty satisfying n/(bklog(b)+kdlog(kd))n/(bk\log(b)+k^{d}\log\left(k^{d}\right))\to\infty, there exists an estimator Vn~d,bkV_{n}\in\widetilde{\mathcal{H}}_{d,b}^{k} such that, for all ε>0\varepsilon>0

supp𝒟dP(Vnp>3minq~d,bkpq+ε)0,\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\widetilde{\mathcal{H}}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0,

where VnV_{n} is a function of X1,,XniidpX_{1},\ldots,X_{n}\overset{iid}{\sim}p.

Allowing bb to grow as aggressively as possible we achieve consistent estimation, using either the multi-view or Tucker histograms, so long as nblogbn\sim b\log b regardless of dimensionality.

Corollary 2.3.

For all d,b,kd,b,k fix d,bk\mathcal{R}_{d,b}^{k} to be either d,bk\mathcal{H}_{d,b}^{k} or ~d,bk\widetilde{\mathcal{H}}_{d,b}^{k}. For any sequence bb\to\infty with n/(blogb)n/\left(b\log b\right)\to\infty, there exists a sequence kk\to\infty and estimator Vnd,bkV_{n}\in\mathcal{R}_{d,b}^{k} such that, for all ε>0\varepsilon>0

supp𝒟dP(Vnp>3minqd,bkpq+ε)0,\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{R}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0,

where VnV_{n} is a function of X1,,XniidpX_{1},\ldots,X_{n}\overset{iid}{\sim}p.

Replacing b:=1/hb:=1/h allows us to arrive at the rates mentioned in Section 1. The following result shows that the bias of the estimators in Theorems 2.1 and 2.2 go to zero for all densities. Thus these estimators are universally consistent even when the NNTF model assumption is not satisfied.

Lemma 2.8.

Let p𝒟dp\in\mathcal{D}_{d}. If kk\to\infty and bb\to\infty then minqd,bkpq0\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|\to 0.

A straightforward consequence of this is that the Tucker histogram bias also goes to zero.

Lemma 2.9.

Let p𝒟dp\in\mathcal{D}_{d}. If kk\to\infty and bb\to\infty then minq~d,bkpq0\min_{q\in\widetilde{\mathcal{H}}_{d,b}^{k}}\left\|p-q\right\|\to 0.

Finally we have that rate on bkbk in Theorem 2.1 cannot be made significantly faster.

Theorem 2.3.

Let d2d\geq 2, bb\to\infty, and kk\to\infty with bkb\geq k and n/(bk)0n/\left(bk\right)\to 0. There exists no estimator Vnd,bkV_{n}\in\mathcal{H}_{d,b}^{k} such that, for all ε>0\varepsilon>0, the following limit holds

supp𝒟dP(Vnp>3minqd,bkpq+ε)0\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0

where VnV_{n} is a function of X1,,XniidpX_{1},\ldots,X_{n}\overset{iid}{\sim}p.

Proof sketch.

We use VnV_{n} to construct an estimator over Δb\Delta_{b} and show that such an estimator is impossible using a result in [13]. ∎

Likewise the rate on bk+kdbk+k^{d} can also not be significantly improved in Theorem 2.2.

Theorem 2.4.

Let d2d\geq 2, bb\to\infty, and kk\to\infty with bkb\geq k and n/(bk+kd)0n/\left(bk+k^{d}\right)\to 0. There exists no estimator Vn~d,bkV_{n}\in\widetilde{\mathcal{H}}_{d,b}^{k} such that, for all ε>0\varepsilon>0, the following limit holds

supp𝒟dP(Vnp>3minq~d,bkpq+ε)0\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\widetilde{\mathcal{H}}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0

where VnV_{n} is a function of X1,,XniidpX_{1},\ldots,X_{n}\overset{iid}{\sim}p.

2.4 Discussion

Naturally real world data likely never exactly satisfies the NNTF model assumption. Our results are meant to highlight a trade-off between model assumptions of smoothness (low bb) and simple dependence between features (low kk). Here we will explore this trade-off for multi-view histogram. Letting k=bdk=b^{d} gives d,bk=d,b\mathcal{H}_{d,b}^{k}=\mathcal{H}_{d,b} since we can allocate one component to each bin. Using the estimator from Theorem 2.1 gives a sample complexity of approximately bd+1b^{d+1}. Thus setting k=bdk=b^{d} in the multi-view histogram gives us something which behaves similarly to the standard histogram estimator with a similar sample complexity. On the other hand setting k=1k=1 gives a naive Bayes model with a sample complexity of approximately bb. The Tucker histogram can be similarly analyzed with k=bk=b corresponding to the standard histogram. Thus we have a span of kk yielding different estimators with maximal kk corresponding to the standard histogram and minimal kk corresponding to a naive Bayes assumption. We observe in Section 3 that this trade-off is useful in practice: we virtually never want kk to be maximized.

3 Experiments

While Theorems 2.1 and 2.2 guarantee the existence of estimators which can effectively estimate NNTF models, these estimators are unfortunately computationally intractable. Fortunately there exist estimators which can be adapted to our problem setting, though they lack the theoretical guarantees of the algorithm described in Theorems 2.1 and 2.2. Specifically we will utilize an existing algorithm for nonnegative tensor decompositions. Due to the difficulties of estimating L1L^{1} distances between densities we will instead focus on estimates minimizing the L2L^{2} norm. In this section inner products and norms will be L2L^{2} for functions and 2\ell^{2} for tensors i.e. standard euclidean norm or inner product applied to flattened tensors. We will again restrict our analysis to densities supported on [0,1)d\left[0,1\right)^{d}.

Consider the problem of finding some density estimator p^\hat{p} with minimal L2L_{2} distance to an unknown density pp. This is equivalent to minimizing the squared L2L^{2} loss:

[0,1]d(p(x)p^(x))2𝑑x\displaystyle\int_{\left[0,1\right]^{d}}\left(p(x)-\hat{p}\left(x\right)\right)^{2}dx (7)
=[0,1]dp^(x)2𝑑x2[0,1]dp(y)p^(y)𝑑y+[0,1]dp(z)2𝑑z.\displaystyle\quad=\int_{\left[0,1\right]^{d}}\hat{p}\left(x\right)^{2}dx-2\int_{\left[0,1\right]^{d}}p(y)\hat{p}(y)dy+\int_{\left[0,1\right]^{d}}p(z)^{2}dz. (8)

Because the right term in (8) does not depend on p^\hat{p} it can be ignored when finding optimal p^\hat{p}. The left term in (8) is known. The middle term in can be estimated with the following approximation

[0,1]dp(x)p^(x)𝑑x=𝔼Xp[p^(X)]1ni=1np^(Xi)\int_{\left[0,1\right]^{d}}p(x)\hat{p}(x)dx=\mathbb{E}_{X\sim p}\left[\hat{p}(X)\right]\approx\frac{1}{n}\sum_{i=1}^{n}\hat{p}\left(X_{i}\right)

where 𝒳=X1,,Xniidp\mathcal{X}=X_{1},\ldots,X_{n}\overset{iid}{\sim}p. We can use this to find a good estimate for pp in d,bk\mathcal{R}^{k}_{d,b} which represents d,bk\mathcal{H}^{k}_{d,b} or ~d,bk\widetilde{\mathcal{H}}^{k}_{d,b}:

argminH^d,bk[0,1]d(H^(x)p^(x))2𝑑x\displaystyle\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\int_{\left[0,1\right]^{d}}\left(\hat{H}\left(x\right)-\hat{p}\left(x\right)\right)^{2}dx
=argminH^d,bkH^,H^2[0,1]dH^(x)p(x)𝑑x\displaystyle=\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left<\hat{H},\hat{H}\right>-2\int_{\left[0,1\right]^{d}}\hat{H}(x)p(x)dx (9)
argminH^d,bkH^,H^21ni=1nH^(Xi).\displaystyle\approx\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left<\hat{H},\hat{H}\right>-2\frac{1}{n}\sum_{i=1}^{n}\hat{H}(X_{i}). (10)

Recall that the standard histogram estimator is H(𝒳)=1ni=1nA[b]dhd,b,A𝟙(XiΛd,b,A)H\left(\mathcal{X}\right)=\frac{1}{n}\sum_{i=1}^{n}\sum_{A\in\left[b\right]^{d}}h_{d,b,A}\mathbbm{1}\left(X_{i}\in\Lambda_{d,b,A}\right) and let H^=A[b]dw^Ahd,b,A\hat{H}=\sum_{A\in\left[b\right]^{d}}\hat{w}_{A}h_{d,b,A}. We have the following

H^,H\displaystyle\left<\hat{H},H\right> =A[b]dw^Ahd,b,A,1ni=1nB[b]dhd,b,B𝟙(XiΛd,b,B)\displaystyle=\left<\sum_{A\in\left[b\right]^{d}}\hat{w}_{A}h_{d,b,A},\frac{1}{n}\sum_{i=1}^{n}\sum_{B\in\left[b\right]^{d}}h_{d,b,B}\mathbbm{1}\left(X_{i}\in\Lambda_{d,b,B}\right)\right>
=1ni=1nA[b]dw^A𝟙(XiΛB)bd=1ni=1nH^(Xi).\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sum_{A\in\left[b\right]^{d}}\hat{w}_{A}\mathbbm{1}\left(X_{i}\in\Lambda_{B}\right)b^{d}=\frac{1}{n}\sum_{i=1}^{n}\hat{H}(X_{i}).

As a consequence (10) is equal to

argminH^d,bkH^,H^2H,H^\displaystyle\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left<\hat{H},\hat{H}\right>-2\left<H,\hat{H}\right> =argminH^d,bkH^,H^2H,H^+H,H\displaystyle=\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left<\hat{H},\hat{H}\right>-2\left<H,\hat{H}\right>+\left<H,H\right>
=argminH^d,bkHH^22.\displaystyle=\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left\|H-\hat{H}\right\|_{2}^{2}.

Using the Ud,bU_{d,b} operator we can reformulate this into a tensor factorization problem

minT^𝒬d,bkHUd,b(T^)22=minT^𝒬d,bkbdUd,b1(H)T^22.\min_{\hat{T}\in\mathcal{Q}_{d,b}^{k}}\left\|H-U_{d,b}(\hat{T})\right\|_{2}^{2}=\min_{\hat{T}\in\mathcal{Q}_{d,b}^{k}}b^{d}\left\|U_{d,b}^{-1}\left(H\right)-\hat{T}\right\|_{2}^{2}.

Where 𝒬d,bk\mathcal{Q}_{d,b}^{k} could be either 𝒯d,bk\mathcal{T}_{d,b}^{k} or 𝒯~d,bk\widetilde{\mathcal{T}}_{d,b}^{k}. Because of this equivalence, to find estimates in d,bk\mathcal{H}_{d,b}^{k} or ~d,bk\widetilde{\mathcal{H}}_{d,b}^{k} we can simply use nonnegative tensor decomposition algorithms, which minimize 2\ell^{2} loss, to find NNTF tensors which approximate HH.

Table 1: Experimental Results
Dataset dd Red. Dim. Hist. Perf. Tucker Perf. Hist. Bins Tucker Bins Tucker rr p-val.
MNIST PCA 2 -1.455±\pm0.089 -1.502±\pm0.102 6.531±\pm1.499 8.375±\pm1.780 4.968±\pm1.976 5e-4
3 -2.040±\pm0.196 -2.268±\pm0.195 4.781±\pm0.738 6.718±\pm1.565 5.781±\pm1.340 2e-4
4 -3.532±\pm0.996 -4.014±\pm0.655 4.031±\pm0.585 5.343±\pm1.018 4.375±\pm0.695 2e-3
5 -4.673±\pm1.026 -6.157±\pm2.924 3.468±\pm0.499 4.343±\pm0.592 3.281±\pm0.514 4e-5
Rand. 2 -2.034±\pm0.100 -2.099±\pm0.102 6.062±\pm1.197 7.562±\pm1.657 2.062±\pm1.784 3e-5
3 -3.086±\pm0.207 -3.331±\pm0.387 4.812±\pm0.526 6.843±\pm1.227 2.687±\pm1.959 1e-4
4 -4.307±\pm0.290 -5.731±\pm0.435 3.500±\pm0.559 5.656±\pm0.642 2.593±\pm1.497 8e-7
5 -6.327±\pm0.522 -9.539±\pm1.053 3.250±\pm0.433 4.718±\pm0.571 2.562±\pm1.087 8e-7
Diabetes PCA 2 -2.079±\pm0.122 -2.212±\pm0.132 5.718±\pm1.304 7.468±\pm1.478 1.062±\pm0.242 8e-6
3 -3.010±\pm0.364 -3.606±\pm0.420 3.593±\pm0.860 7.062±\pm1.058 1.843±\pm1.543 2e-6
4 -4.002±\pm0.415 -4.423±\pm0.701 3.000±\pm0.000 5.906±\pm0.804 2.343±\pm1.107 2e-3
5 -6.139±\pm0.661 -6.043±\pm1.192 3.000±\pm0.000 3.750±\pm0.968 1.843±\pm0.617 0.91
Rand. 2 -3.074±\pm0.224 -3.277±\pm0.287 6.843±\pm1.227 9.250±\pm1.936 1.093±\pm0.384 7e-5
3 -4.726±\pm0.483 -5.353±\pm0.751 4.968±\pm0.769 8.406±\pm1.343 1.625±\pm1.672 2e-5
4 -6.017±\pm0.873 -7.732±\pm1.497 4.062±\pm0.704 6.718±\pm1.328 2.093±\pm1.155 1e-5
5 -8.986±\pm1.292 -12.61±\pm2.477 3.062±\pm0.242 5.093±\pm0.521 2.531±\pm0.865 2e-6

For our experiments we used the Tensorly library [18] to perform the nonnegative Tucker decomposition [17] with Tucker rank [k,k,,k][k,k,\ldots,k] which was then projected to the simplex of probability tensors using [11]. We also performed experiments with nonnegative PARAFAC decompositions using [28, 18]. These decompositions performed poorly. This is potentially because the PARAFAC optimization is more difficult or the additional flexibility of the Tucker decomposition was more appropriate for the experimental datasets.

3.1 Experimental Setup and Results

Our experiments were performed on the Scikit-learn “toy” datasets MNIST and Diabetes [23], with labels removed. We used the expression inside the minimization in (10) to evaluate performance. Our experiments considered estimating histograms in d=2,3,4,5d=2,3,4,5 dimensional space. We consider two forms of dimensionality reduction. First we consider projecting the dataset onto its top dd principle components. As an alternative we consider projecting our dataset onto a random subspace of dimension dd. We have constructed our random subspace dimensionality reduction so that each additional dimension adds a new index without affecting the others. For each dataset we randomly select an orthonormal basis that remains unchanged for all experiments v1,v2,v_{1},v_{2},\ldots. To transform a point XX to dimension dd we perform the following transform

Xreduced dim.=[v1vd]TX.\displaystyle X_{\text{reduced dim.}}=\left[v_{1}\cdots v_{d}\right]^{T}X.

We consider both transforms since PCA may select dimensions where the features tend to be independent, as is the case when the distribution is a multivariate Gaussian. After dimensionality reduction we scale and translate the data to fit in the unit cube.

With our preprocessed dataset, each experiment consisted of randomly selecting 200 samples for training and using the rest to evaluate performance. For the estimators we tested all combinations using 1 to bmaxb_{\text{max}} bins per dimension and kk from 1 to kmaxk_{\text{max}}. As dd increased the best cross validated bb and kk value decreased, so we reduced bmaxb_{\text{max}} and kmaxk_{\text{max}} for larger dd to reduce computational time, while still leaving a sizable gap between the best cross validated bb and kk across all runs of all experiment. For d=2,3d=2,3 we have bmax=15b_{\text{max}}=15 and kmax=10k_{\text{max}}=10; for d=4d=4 we have bmax=12b_{\text{max}}=12 and kmax=8k_{\text{max}}=8; for d=5d=5 we have bmax=8b_{\text{max}}=8 and kmax=6k_{\text{max}}=6. For parameter fitting we used random subset cross validation repeated 80 times using 40 of the 200 samples to cross validate. Performing 80 folds of cross validation was necessary because of the variance of the histogram estimated loss. This high variance is likely due to the noncontinuous nature of the histogram estimator itself and the noncontinuity of the histogram as a function of the data, i.e. slightly moving one training sample can potentially the change histogram bin in which it lies. Each experiment was run 32 times and we report the mean and standard deviations of estimator performance as well as the best parameters found from cross validation. We additionally apply the Wilcoxon signed rank test to the 32 pairs of performance results to statistically determine if the mean performance between the standard histogram and our algorithm are different and report the corresponding pp-value. Our results are in Table 1 where the Tucker histogram dominates. We also observe that the Tucker histogram can estimate more bins per dimension than the standard histogram and is able to estimate more bins per dimension when the number of components of components is lower. This corroborates the number of components versus the bins per dimension trade-off from Section 2.4.

4 Conclusion

Through analysis of the histogram estimator, we have theoretically and empirically demonstrated that NNTF models can also be used to improve nonparametric density estimation. Though the histogram estimator is not a particularly popular estimator,we hope that the ideas presented here can be adapted to improve other techniques in nonparametric statistics.

Acknowledgements

This work was supported by the Berlin Institute for the Foundations of Learning and Data (BIFOLD) sponsored by the German Federal Ministry of Education and Research (BMBF).

References

  • Allman et al. [2009] E. S. Allman, C. Matias, and J. A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Ann. Statist., 37(6A):3099–3132, 12 2009. doi: 10.1214/09-AOS689. URL http://dx.doi.org/10.1214/09-AOS689.
  • Anandkumar et al. [2014] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014. URL http://jmlr.org/papers/v15/anandkumar14b.html.
  • Arora et al. [2012] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization – provably. In Proceedings of the Forty-fourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 145–162, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1245-5. doi: 10.1145/2213977.2213994. URL http://doi.acm.org/10.1145/2213977.2213994.
  • Ashtiani et al. [2018] H. Ashtiani, S. Ben-David, N. Harvey, C. Liaw, A. Mehrabian, and Y. Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3412–3421. Curran Associates, Inc., 2018.
  • Bengio [2013] Y. Bengio. Deep learning of representations: Looking forward. In A.-H. Dediu, C. Martín-Vide, R. Mitkov, and B. Truthe, editors, Statistical Language and Speech Processing, pages 1–37, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-39593-2.
  • Cayton [2005] L. Cayton. Algorithms for manifold learning. Technical report, 2005.
  • Chen et al. [2017] J. Chen, M. Stern, M. J. Wainwright, and M. I. Jordan. Kernel feature selection via conditional covariance minimization. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6946–6955. Curran Associates, Inc., 2017.
  • Devroye and Lugosi [2001] L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer, New York, 2001.
  • Donoho and Stodden [2004] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In S. Thrun, L. K. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 1141–1148. MIT Press, 2004.
  • Donoho [2006] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, April 2006. ISSN 0018-9448. doi: 10.1109/TIT.2006.871582.
  • Duchi et al. [2008] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l1{}_{\mbox{1}}-ball for learning in high dimensions. In ICML, pages 272–279, 2008.
  • Györfi et al. [1985] L. Györfi, L. Devroye, and L. Gyorfi. Nonparametric density estimation: the L1 view. John Wiley & Sons, New York; Chichester, 1985.
  • Han et al. [2014] Y. Han, J. Jiao, and T. Weissman. Minimax estimation of discrete distributions under 1\ell_{1} loss. CoRR, abs/1411.1467, 2014. URL http://arxiv.org/abs/1411.1467.
  • Jiang [2017] H. Jiang. Uniform convergence rates for kernel density estimation. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1694–1703, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/jiang17b.html.
  • Kim and Scott [2012] J. Kim and C. Scott. Robust kernel density estimation. J. Machine Learning Res., 13:2529–2565, 2012.
  • Kim et al. [2018] J. Kim, J. Shin, A. Rinaldo, and L. Wasserman. Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension. arXiv e-prints, art. arXiv:1810.05935, Oct 2018.
  • Kim and Choi [2007] Y.-D. Kim and S. Choi. Nonnegative tucker decomposition. 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
  • Kossaifi et al. [2016] J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic. TensorLy: Tensor Learning in Python. arXiv e-prints, art. arXiv:1610.09555, Oct 2016.
  • Liu et al. [2011] H. Liu, M. Xu, H. Gu, A. Gupta, J. Lafferty, and L. Wasserman. Forest density estimation. J. Mach. Learn. Res., 12:907–951, July 2011. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1953048.2021032.
  • Mack and Rosenblatt [1979] Y. P. Mack and M. Rosenblatt. Multivariate k-nearest neighbor density estimates. Journal of Multivariate Analysis, 9(1):1–15, March 1979.
  • Negahban and Wainwright [2011] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist., 39(2):1069–1097, 04 2011. doi: 10.1214/10-AOS850. URL https://doi.org/10.1214/10-AOS850.
  • Negahban and Wainwright [2012] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res., 13:1665–1697, May 2012. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2188385.2343697.
  • Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Recht et al. [2010] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3):471–501, Aug. 2010. ISSN 0036-1445. doi: 10.1137/070697835. URL http://dx.doi.org/10.1137/070697835.
  • Reiss [1989] R. Reiss. Approximate distributions of order statistics: with applications to nonparametric statistics. Springer series in statistics. Springer, 1989. ISBN 9783540968511. URL https://books.google.de/books?id=DxzvAAAAMAAJ.
  • Schein et al. [2016] A. Schein, M. Zhou, D. Blei, and H. Wallach. Bayesian poisson tucker decomposition for learning the structure of international relations. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2810–2819, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/schein16.html.
  • Schuster [1985] E. F. Schuster. Incorporating support constraints into nonparametric estimators of densities. Communications in Statistics - Theory and Methods, 14(5):1123–1136, 1985. doi: 10.1080/03610928508828965. URL https://doi.org/10.1080/03610928508828965.
  • Shashua and Hazan [2005] A. Shashua and T. Hazan. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 792–799, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: 10.1145/1102351.1102451. URL http://doi.acm.org/10.1145/1102351.1102451.
  • Silverman [1986] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986.
  • Song et al. [2012] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximization. J. Mach. Learn. Res., 13(1):1393–1434, May 2012. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2503308.2343691.
  • Song et al. [2014] L. Song, A. Anandkumar, B. Dai, and B. Xie. Nonparametric estimation of multi-view latent variable models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 640–648, 2014. URL http://jmlr.org/proceedings/papers/v32/songa14.html.
  • Terrell and Scott [1992] G. R. Terrell and D. W. Scott. Variable kernel density estimation. Ann. Statist., 20(3):1236–1265, 09 1992. doi: 10.1214/aos/1176348768. URL https://doi.org/10.1214/aos/1176348768.
  • Tsybakov [2008] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008.
  • Vandermeulen and Scott [2013] R. Vandermeulen and C. Scott. Consistency of robust kernel density estimators. COLT, 30:568–591, 2013.
  • Vandermeulen and Scott [2014] R. A. Vandermeulen and C. Scott. Robust kernel density estimation by scaling and projection in hilbert space. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 433–441. Curran Associates, Inc., 2014.

Appendix A Proofs Omitted from Main Text

All norms are either the 1\ell^{1}, L1L^{1}, or total variation norm, which are equivalent with respect to our analysis and the proper norm will be clear from context.

Proof of Lemma 2.1.

In Section 7.4 from [8], the authors show that for any collection of measures μ1,,μb\mu_{1},\ldots,\mu_{b}, for all ε>0\varepsilon>0, that

N(Conv({μ1,,μb}),ε)(b+bε)b.\displaystyle N\left(\operatorname{Conv}\left(\left\{\mu_{1},\ldots,\mu_{b}\right\}\right),\varepsilon\right)\leq\left(b+\frac{b}{\varepsilon}\right)^{b}.

With the additional assumption that ε1\varepsilon\leq 1 we have that b+bεbε+bε=2bεb+\frac{b}{\varepsilon}\leq\frac{b}{\varepsilon}+\frac{b}{\varepsilon}=\frac{2b}{\varepsilon} and thus

N(Conv({μ1,,μb}))(2bε)b.\displaystyle N\left(\operatorname{Conv}\left(\left\{\mu_{1},\ldots,\mu_{b}\right\}\right)\right)\leq\left(\frac{2b}{\varepsilon}\right)^{b}.

If we let μi=𝐞i\mu_{i}=\mathbf{e}_{i}, the indicator vector at index ii, then the lemma follows. ∎

Proof of Lemma 2.2.

From Lemma 2.1 we know there exists a finite collection of probability vectors 𝒫~\widetilde{\mathcal{P}} such that 𝒫~\widetilde{\mathcal{P}} is an ε/d\varepsilon/d-covering of Δb\Delta_{b} and |𝒫~|(2bdε)b\left|\widetilde{\mathcal{P}}\right|\leq\left(\frac{2bd}{\varepsilon}\right)^{b}. Note that the set {p~1p~d:p~i𝒫~}\left\{\widetilde{p}_{1}\otimes\dots\otimes\widetilde{p}_{d}:\widetilde{p}_{i}\in\widetilde{\mathcal{P}}\right\} contains at most ((2bdε)b)d=(2bdε)bd\left(\left(\frac{2bd}{\varepsilon}\right)^{b}\right)^{d}=\left(\frac{2bd}{\varepsilon}\right)^{bd} elements. We will now show that this set is an ε\varepsilon-cover of 𝒯d,b1\mathcal{T}_{d,b}^{1}. Let p1pd𝒯d,b1p_{1}\otimes\cdots\otimes p_{d}\in\mathcal{T}_{d,b}^{1} be arbitrary. From our construction of 𝒫~\widetilde{\mathcal{P}} there exist elements p~1,,p~d𝒫~\widetilde{p}_{1},\ldots,\widetilde{p}_{d}\in\widetilde{\mathcal{P}} such that pip~iεd\left\|p_{i}-\widetilde{p}_{i}\right\|\leq\frac{\varepsilon}{d}.

We will now make use of Lemma 3.3.7 in [25], which states that, for any collection of probability vectors q1,,qdq_{1},\ldots,q_{d} and q~1,,q~d\widetilde{q}_{1},\ldots,\widetilde{q}_{d}, the following holds

i=1dqij=1dq~ji=1dqiq~i.\displaystyle\left\|\prod_{i=1}^{d}q_{i}-\prod_{j=1}^{d}\widetilde{q}_{j}\right\|\leq\sum_{i=1}^{d}\left\|q_{i}-\widetilde{q}_{i}\right\|.

From this it follows that

i=1dpij=1dp~ji=1dpip~idεd=ε\displaystyle\left\|\prod_{i=1}^{d}p_{i}-\prod_{j=1}^{d}\widetilde{p}_{j}\right\|\leq\sum_{i=1}^{d}\left\|p_{i}-\widetilde{p}_{i}\right\|\leq d\frac{\varepsilon}{d}=\varepsilon

thus completing our proof. ∎

Proof of Lemma 2.3.

Let 𝒫~\widetilde{\mathcal{P}} be the finite collection of probability measures with |𝒫~|=N(𝒫,ε)|\widetilde{\mathcal{P}}|=N\left(\mathcal{P},\varepsilon\right) which ε\varepsilon-covers 𝒫\mathcal{P}. Similarly let WΔkW\subset\Delta_{k} with |W|=N(Δk,δ)|W|=N\left(\Delta_{k},\delta\right) such that WW is a δ\delta-cover of Δk\Delta_{k}. Consider the set

Ω={i=1kw~ip~i|w~W,p~i𝒫~}.\displaystyle\Omega=\left\{\sum_{i=1}^{k}\widetilde{w}_{i}\widetilde{p}_{i}\middle|\widetilde{w}\in W,\widetilde{p}_{i}\in\widetilde{\mathcal{P}}\right\}.

Note that this set contains at most N(𝒫,ε)kN(Δk,δ)N\left(\mathcal{P},\varepsilon\right)^{k}N\left(\Delta_{k},\delta\right) elements. We will now show that it (δ+ε)\left(\delta+\varepsilon\right)-covers kmix(𝒫)k\operatorname{-mix}\left(\mathcal{P}\right), which completes the proof. Let i=1kpiwikmix(𝒫)\sum_{i=1}^{k}p_{i}w_{i}\in k\operatorname{-mix}\left(\mathcal{P}\right). We know there exists elements p~1,,p~k𝒫~\widetilde{p}_{1},\ldots,\widetilde{p}_{k}\in\widetilde{\mathcal{P}} such that p~ipiε\left\|\widetilde{p}_{i}-p_{i}\right\|\leq\varepsilon and w~W\widetilde{w}\in W such that ww~δ\left\|w-\widetilde{w}\right\|\leq\delta and thus i=1kp~iw~iΩ\sum_{i=1}^{k}\widetilde{p}_{i}\widetilde{w}_{i}\in\Omega. Now observe that

i=1kp~iw~ij=1kpjwj\displaystyle\left\|\sum_{i=1}^{k}\widetilde{p}_{i}\widetilde{w}_{i}-\sum_{j=1}^{k}p_{j}w_{j}\right\| =i=1kp~iw~ij=1kpjw~j+l=1kplw~lr=1kprwr\displaystyle=\left\|\sum_{i=1}^{k}\widetilde{p}_{i}\widetilde{w}_{i}-\sum_{j=1}^{k}p_{j}\widetilde{w}_{j}+\sum_{l=1}^{k}p_{l}\widetilde{w}_{l}-\sum_{r=1}^{k}p_{r}w_{r}\right\|
i=1k(p~ipi)w~i+i=1kpi(w~iwi)\displaystyle\leq\left\|\sum_{i=1}^{k}\left(\widetilde{p}_{i}-p_{i}\right)\widetilde{w}_{i}\right\|+\left\|\sum_{i=1}^{k}p_{i}\left(\widetilde{w}_{i}-w_{i}\right)\right\|
i=1kw~ip~ipi+i=1k|w~iwi|\displaystyle\leq\sum_{i=1}^{k}\widetilde{w}_{i}\left\|\widetilde{p}_{i}-p_{i}\right\|+\sum_{i=1}^{k}\left|\widetilde{w}_{i}-w_{i}\right|
i=1kw~iε+w~w\displaystyle\leq\sum_{i=1}^{k}\widetilde{w}_{i}\varepsilon+\left\|\widetilde{w}-w\right\|
ε+δ.\displaystyle\leq\varepsilon+\delta.

Proof of Lemma 2.4.

Note that 𝒯d,bk=kmix(𝒯d,b1)\mathcal{T}_{d,b}^{k}=k\operatorname{-mix}\left(\mathcal{T}_{d,b}^{1}\right). Applying Lemma 2.3 followed by Lemmas 2.1 and 2.2 we have that

N(𝒯d,bk,ε)N(𝒯d,b1,ε/2)kN(Δk,ε/2)(4bdε)bdk(4kε)k.\displaystyle N\left(\mathcal{T}_{d,b}^{k},\varepsilon\right)\leq N\left(\mathcal{T}_{d,b}^{1},\varepsilon/2\right)^{k}N\left(\Delta_{k},\varepsilon/2\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k}{\varepsilon}\right)^{k}.

Proof of Lemma 2.5.

Fix k,d,bk,d,b and 0<ε10<\varepsilon\leq 1. We are going to construct an ε\varepsilon-cover of 𝒯~d,bk\widetilde{\mathcal{T}}_{d,b}^{k}. From Lemma 2.1 we know that there exists a set Δb\mathcal{B}\subset\Delta_{b} which (ε2d)\left(\frac{\varepsilon}{2d}\right)-covers of Δb\Delta_{b} and contains no more than (4bdε)b\left(\frac{4bd}{\varepsilon}\right)^{b} elements. Let 𝒫\mathcal{P} be the collection of all d×kd\times k arrays whose entries are elements from \mathcal{B}. So we have that

|𝒫|=||dk(4bdε)bdk.\displaystyle\left|\mathcal{P}\right|=\left|\mathcal{B}\right|^{dk}\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}.

From Lemma 2.1 there exists 𝒲\mathcal{W} which is an ε/2\varepsilon/2-cover of 𝒯d,k\mathcal{T}_{d,k} and contains no more than (4kd/ε)(kd)\left(4k^{d}/\varepsilon\right)^{\left(k^{d}\right)} elements. Now let

d,bk={S[k]dW~Si=1dp~i,Si|W~𝒲,p~𝒫}.\displaystyle\mathcal{L}_{d,b}^{k}=\left\{\sum_{S\in[k]^{d}}\widetilde{W}_{S}\prod_{i=1}^{d}\widetilde{p}_{i,S_{i}}\middle|\widetilde{W}\in\mathcal{W},\widetilde{p}\in\mathcal{P}\right\}.

Note that

|d,bk||𝒲||𝒫|(4kdε)kd(4bdε)bdk.\displaystyle\left|\mathcal{L}_{d,b}^{k}\right|\leq\left|\mathcal{W}\right|\left|\mathcal{P}\right|\leq\left(\frac{4k^{d}}{\varepsilon}\right)^{k^{d}}\left(\frac{4bd}{\varepsilon}\right)^{bdk}.

We will now show that d,bk\mathcal{L}_{d,b}^{k} is an ε\varepsilon-cover of 𝒯~d,bk\widetilde{\mathcal{T}}_{d,b}^{k}. To this end let S[k]dWSi=1dpi,Si𝒯~d,bk\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}p_{i,S_{i}}\in\widetilde{\mathcal{T}}_{d,b}^{k} be arbitrary, where W𝒯d,kW\in\mathcal{T}_{d,k} and pi,jΔbp_{i,j}\in\Delta_{b}. From our construction of 𝒲\mathcal{W}, there exists W~𝒲\widetilde{W}\in\mathcal{W} such that WW~ε/2\left\|W-\widetilde{W}\right\|\leq\varepsilon/2. There also exists p~𝒫\widetilde{p}\in\mathcal{P} such that p~i,jpi,jε/2\left\|\widetilde{p}_{i,j}-p_{i,j}\right\|\leq\varepsilon/2 for all i,ji,j. Therefore we have that

S[k]dW~Si=1dp~i,Sid,bk.\displaystyle\sum_{S\in\left[k\right]^{d}}\widetilde{W}_{S}\prod_{i=1}^{d}\widetilde{p}_{i,S_{i}}\in\mathcal{L}_{d,b}^{k}.

So finally

S[k]dWSi=1dpi,SiR[k]dW~Rj=1dp~j,Rj\displaystyle\left\|\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}p_{i,S_{i}}-\sum_{R\in\left[k\right]^{d}}\widetilde{W}_{R}\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\|
S[k]dWSi=1dpi,SiR[k]dWRj=1dp~j,Rj+S[k]dWSi=1dp~i,SiR[k]dW~Rj=1dp~j,Rj\displaystyle\leq\left\|\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}p_{i,S_{i}}-\sum_{R\in\left[k\right]^{d}}W_{R}\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\|+\left\|\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}\widetilde{p}_{i,S_{i}}-\sum_{R\in\left[k\right]^{d}}\widetilde{W}_{R}\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\|
S[k]dWSi=1dpi,Sij=1dp~j,Sj+R[k]d|WRW~R|j=1dp~j,Rj\displaystyle\leq\sum_{S\in\left[k\right]^{d}}W_{S}\left\|\prod_{i=1}^{d}p_{i,S_{i}}-\prod_{j=1}^{d}\widetilde{p}_{j,S_{j}}\right\|+\sum_{R\in\left[k\right]^{d}}|W_{R}-\widetilde{W}_{R}|\left\|\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\|
S[k]dWSε+WW~\displaystyle\leq\sum_{S\in\left[k\right]^{d}}W_{S}\varepsilon+\left\|W-\widetilde{W}\right\|
ε/2+ε/2=ε.\displaystyle\leq\varepsilon/2+\varepsilon/2=\varepsilon.

Proof of Theorem 2.1.

We will be applying the estimator from Lemma 2.7 to a series of δ\delta-covers of d,bk\mathcal{H}_{d,b}^{k}. We begin by constructing a series of δ\delta-covers whose cardinality doesn’t grow too quickly. Corollary 2.1 states that, for all 0<δ10<\delta\leq 1, that N(d,bk,δ)(4bdδ)bdk(4kδ)kN\left(\mathcal{H}_{d,b}^{k},\delta\right)\leq\left(\frac{4bd}{\delta}\right)^{bdk}\left(\frac{4k}{\delta}\right)^{k}. For sufficiently large bb and kk and sufficiently small δ\delta, the following holds

log((4bdδ)bdk(4kδ)k)\displaystyle\log\left(\left(\frac{4bd}{\delta}\right)^{bdk}\left(\frac{4k}{\delta}\right)^{k}\right) =bdklog(4bdδ)+klog(4kδ)\displaystyle=bdk\log\left(\frac{4bd}{\delta}\right)+k\log\left(\frac{4k}{\delta}\right)
=bdk[log(b)+log(4dδ)]+k[log(k)+log(4δ)]\displaystyle=bdk\left[\log\left(b\right)+\log\left(\frac{4d}{\delta}\right)\right]+k\left[\log\left(k\right)+\log\left(\frac{4}{\delta}\right)\right]
bdk[log(b)+log(b)log(4dδ)]+dk[log(k)+log(k)log(4dδ)]\displaystyle\leq bdk\left[\log\left(b\right)+\log\left(b\right)\log\left(\frac{4d}{\delta}\right)\right]+dk\left[\log\left(k\right)+\log\left(k\right)\log\left(\frac{4d}{\delta}\right)\right]
=(bklog(b)+klog(k))d(1+log(4dδ)).\displaystyle=\left(bk\log\left(b\right)+k\log\left(k\right)\right)d\left(1+\log\left(\frac{4d}{\delta}\right)\right). (11)

Using the argument from the proof of Lemma 2.7 we have that, because n/(bklog(b)+klog(k))n/(bk\log(b)+k\log(k))\to\infty there exists a sequence of positive values C=C(n)C=C(n) such that CC\to\infty and n>C[bklog(b)+klog(k)]n>C\left[bk\log(b)+k\log(k)\right]. If we let δ=4dexp(Cd1)\delta=\frac{4d}{\exp\left(\frac{C}{d}-1\right)} we have that δ0\delta\to 0 and

(bklog(b)+klog(k))d(1+log(4dδ))n.\displaystyle\left(bk\log\left(b\right)+k\log\left(k\right)\right)d\left(1+\log\left(\frac{4d}{\delta}\right)\right)\leq n.

Because of this we can construct collections of densities 𝒫~nd,bk\widetilde{\mathcal{P}}_{n}\subset\mathcal{H}_{d,b}^{k} such that 𝒫~n\widetilde{\mathcal{P}}_{n} is a δ\delta-covering of d,bk\mathcal{H}_{d,b}^{k} with |𝒫~|\left|\widetilde{\mathcal{P}}\right|\to\infty, n/log|𝒫~n|n/\log\left|\widetilde{\mathcal{P}}_{n}\right|\to\infty and δ0\delta\to 0. Let VnV_{n} be the estimator from Lemma 2.7 applied to the sequence 𝒫~n\widetilde{\mathcal{P}}_{n}.

Let ε>0\varepsilon>0 be arbitrary. Due to the way that we have constructed the sequence 𝒫~n\widetilde{\mathcal{P}}_{n}, for sufficiently large nn, we have that 3supqd,bkminq~𝒫~nqq~ε/23\sup_{q\in\mathcal{H}_{d,b}^{k}}\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\|q-\widetilde{q}\right\|\leq\varepsilon/2. It therefore follows that, for sufficiently large nn, the following holds for all p𝒟dp\in\mathcal{D}_{d}

3minqd,bkpq+ε\displaystyle 3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon 3minqd,bkpq+3supqbkminq~𝒫~nqq~+ε/2\displaystyle\geq 3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+3\sup_{q\in\mathcal{H}_{b}^{k}}\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\|q-\widetilde{q}\right\|+\varepsilon/2
3minqd,bk[pq+minq~𝒫~nqq~]+ε/2\displaystyle\geq 3\min_{q\in\mathcal{H}_{d,b}^{k}}\left[\left\|p-q\right\|+\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\|q-\widetilde{q}\right\|\right]+\varepsilon/2
=3minqd,bkminq~𝒫~npq+qq~+ε/2\displaystyle=3\min_{q\in\mathcal{H}_{d,b}^{k}}\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\|p-q\right\|+\left\|q-\widetilde{q}\right\|+\varepsilon/2
3minq~𝒫~npq~+ε/2.\displaystyle\geq 3\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\|p-\widetilde{q}\right\|+\varepsilon/2.

From this we have that, for sufficiently large nn

supp𝒟dP(Vip>3minqd,bkpq+ε)supp𝒟dP(Vip>3minq~𝒫~npq~+ε/2)\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{i}-p\right\|>3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\leq\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{i}-p\right\|>3\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\|p-\widetilde{q}\right\|+\varepsilon/2\right)

and the right side goes to zero due to Lemma 2.7, thus completing the proof. ∎

Proof of Theorem 2.2.

This proof is very similar to the proof of Theorem 2.1. We will be applying the estimator from Lemma 2.7 to a series of δ\delta-covers of ~d,bk\widetilde{\mathcal{H}}_{d,b}^{k}. We begin by constructing a series of δ\delta-covers whose cardinality doesn’t grow too quickly. Corollary 2.2 states that, for all 0<δ10<\delta\leq 1, that N(~d,bk,δ)(4bdδ)bdk(4kdδ)kdN\left(\widetilde{\mathcal{H}}_{d,b}^{k},\delta\right)\leq\left(\frac{4bd}{\delta}\right)^{bdk}\left(\frac{4k^{d}}{\delta}\right)^{k^{d}}. For sufficiently large bb and kk and sufficiently small δ\delta, the following holds

log((4bdδ)bdk(4kdδ)kd)\displaystyle\log\left(\left(\frac{4bd}{\delta}\right)^{bdk}\left(\frac{4k^{d}}{\delta}\right)^{k^{d}}\right) =bdklog(4bdδ)+kdlog(4kdδ)\displaystyle=bdk\log\left(\frac{4bd}{\delta}\right)+k^{d}\log\left(\frac{4k^{d}}{\delta}\right)
d(bklog(4bdδ)+kdlog(4kdδ))\displaystyle\leq d\left(bk\log\left(\frac{4bd}{\delta}\right)+k^{d}\log\left(\frac{4k^{d}}{\delta}\right)\right)
=d(bk(log(b)+log(4dδ))+kd(log(kd)+log(4δ)))\displaystyle=d\left(bk\left(\log(b)+\log\left(\frac{4d}{\delta}\right)\right)+k^{d}\left(\log\left(k^{d}\right)+\log\left(\frac{4}{\delta}\right)\right)\right)
d(bk(log(b)+log(4dδ))+kd(log(kd)+log(4dδ)))\displaystyle\leq d\left(bk\left(\log(b)+\log\left(\frac{4d}{\delta}\right)\right)+k^{d}\left(\log\left(k^{d}\right)+\log\left(\frac{4d}{\delta}\right)\right)\right)
=(bklog(b)+kdlog(kd))d(1+log(4dδ)).\displaystyle=\left(bk\log(b)+k^{d}\log\left(k^{d}\right)\right)d\left(1+\log\left(\frac{4d}{\delta}\right)\right).

Note that replacing bklog(b)+klog(k)bk\log\left(b\right)+k\log\left(k\right) with bklog(b)+kdlog(kd)bk\log\left(b\right)+k^{d}\log\left(k^{d}\right) in the last line is exactly (11) in our proof of Theorem 2.1 . From here we can proceed exactly as in the proof of Theorem 2.1 by replacing d,bk\mathcal{H}_{d,b}^{k} with ~d,bk\widetilde{\mathcal{H}}_{d,b}^{k} and bklog(b)+klog(k)bk\log\left(b\right)+k\log\left(k\right) with bklog(b)+kdlog(kd)bk\log\left(b\right)+k^{d}\log\left(k^{d}\right). ∎

Proof of Lemma 2.8.

Let ε>0\varepsilon>0. Theorem 5 in Chapter 2 of [12] states that, for any p𝒟dp\in\mathcal{D}_{d}, that minhd,bph0\min_{h\in\mathcal{H}_{d,b}}\left\|p-h\right\|\to 0 as bb\to\infty, i.e. the bias of a histogram estimator goes to zero as the number of bins per dimension goes to infinity. Thus there exists a sufficiently large BB such that there exists a histogram hd,Bh\in\mathcal{H}_{d,B} which is a good approximation of pp, ph<ε/2\left\|p-h\right\|<\varepsilon/2. In this proof we we will argue that once kBdk\geq B^{d} and bb is sufficiently large, we can find an element of d,kk\mathcal{H}_{d,k}^{k} where the multi-view components can approximate the kk bins of hh.

We have that

h=A[B]×dwAi=1dhd,B,Ai.\displaystyle h=\sum_{A\in\left[B\right]^{\times d}}w_{A}\prod_{i=1}^{d}h_{d,B,A_{i}}.

From the same theorem there exists a0a_{0} such that, for all aa0a\geq a_{0}, for all ii, there exists h~1,a,i1,a\widetilde{h}_{1,a,i}\in\mathcal{H}_{1,a} such that h1,B,ih~1,a,i<ε/(2d)\left\|h_{1,B,i}-\widetilde{h}_{1,a,i}\right\|<\varepsilon/(2d) for all i[B]i\in[B]. For any multi-index A[B]dA\in[B]^{d}, we define

h~d,a,A=j=1dh~1,a,Aj.\displaystyle\widetilde{h}_{d,a,A}=\prod_{j=1}^{d}\widetilde{h}_{1,a,A_{j}}.

Now we have that, for all aa0a\geq a_{0} and A[B]dA\in[B]^{d},

hd,B,Ah~d,a,A\displaystyle\left\|h_{d,B,A}-\widetilde{h}_{d,a,A}\right\| =i=1dh1,B,Aij=1dh~1,a,Aj\displaystyle=\left\|\prod_{i=1}^{d}h_{1,B,A_{i}}-\prod_{j=1}^{d}\widetilde{h}_{1,a,A_{j}}\right\|
i=1dh1,B,Aih~1,a,Ai\displaystyle\leq\sum_{i=1}^{d}\left\|h_{1,B,A_{i}}-\widetilde{h}_{1,a,A_{i}}\right\| (12)
dε2d\displaystyle\leq d\frac{\varepsilon}{2d}
=ε/2,\displaystyle=\varepsilon/2,

where we use the previously mentioned product measure inequality for (12). As soon as kBdk\geq B^{d} and aa0a\geq a_{0} the set d,ak\mathcal{H}_{d,a}^{k} contains the element,

h~A[B]×dwAh~d,a,A.\displaystyle\widetilde{h}\triangleq\sum_{A\in\left[B\right]^{\times d}}w_{A}\widetilde{h}_{d,a,A}.

Now we have that, for all aa0a\geq a_{0}.

hh~\displaystyle\left\|h-\widetilde{h}\right\| =A[B]×dwAhd,B,AQ[B]×dwQh~d,a,Q\displaystyle=\left\|\sum_{A\in\left[B\right]^{\times d}}w_{A}h_{d,B,A}-\sum_{Q\in\left[B\right]^{\times d}}w_{Q}\widetilde{h}_{d,a,Q}\right\|
A[B]×dwAhd,B,Ah~d,a,A\displaystyle\leq\sum_{A\in\left[B\right]^{\times d}}w_{A}\left\|h_{d,B,A}-\widetilde{h}_{d,a,A}\right\|
ε/2.\displaystyle\leq\varepsilon/2.

From the triangle inequality we have that

ph~ph+hh~ε.\displaystyle\left\|p-\widetilde{h}\right\|\leq\left\|p-h\right\|+\left\|h-\widetilde{h}\right\|\leq\varepsilon.

So we have that, for sufficiently large bb and kk

minqd,bkpqε\displaystyle\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|\leq\varepsilon

which completes our proof. ∎

Proof of Lemma 2.9.

We will show that d,bk~d,bk\mathcal{H}_{d,b}^{k}\subset\widetilde{\mathcal{H}}_{d,b}^{k} and the theorem clearly follows due to Lemma 2.8. Any element of d,bk\mathcal{H}_{d,b}^{k} will have the following representation

i=1kwij=1dfi,j:wΔk,fi,j1,b.\displaystyle\sum_{i=1}^{k}w_{i}\prod_{j=1}^{d}f_{i,j}:w\in\Delta_{k},f_{i,j}\in\mathcal{H}_{1,b}. (13)

Letting W𝒯d,kW\in\mathcal{T}_{d,k} with Wi,,i=wiW_{i,\ldots,i}=w_{i} for all ii and the rest of the entries of WW be zero and letting f~j,i=fi,j\widetilde{f}_{j,i}=f_{i,j} for all i,ji,j we have that

S[k]dWSj=1df~j,Sj\displaystyle\sum_{S\in[k]^{d}}W_{S}\prod_{j=1}^{d}\widetilde{f}_{j,S_{j}} =i=1kWi,,ij=1df~j,i\displaystyle=\sum_{i=1}^{k}W_{i,\ldots,i}\prod_{j=1}^{d}\widetilde{f}_{j,i}
=i=1kwij=1dfi,j\displaystyle=\sum_{i=1}^{k}w_{i}\prod_{j=1}^{d}f_{i,j}

so we have that (13) is an element of ~d,bk\widetilde{\mathcal{H}}_{d,b}^{k} and we are done. ∎

Proof of Theorem 2.3.

We will proceed by contradiction. Suppose VnV_{n} is an estimator violating the theorem statement, i.e. there exist sequences bb\to\infty and kk\to\infty with n/(bk)0n/\left(bk\right)\to 0 and bkb\geq k such that, for all ε>0\varepsilon>0,

supp𝒟dP(Vnp>3minqd,bkpq+ε)0.\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0.

Let (pn)n=1\left(p_{n}\right)_{n=1}^{\infty} be a sequence of probability vectors pnΔb(n)×k(n)p_{n}\in\Delta_{b(n)\times k(n)} which represent distributions over [b(n)]×[k(n)]\left[b(n)\right]\times\left[k(n)\right]. Let 𝒳n(Xn,1,,Xn,n)\mathcal{X}_{n}\triangleq\left(X_{n,1},\ldots,X_{n,n}\right) with Xn,1,,Xn,niidpnX_{n,1},\ldots,X_{n,n}\overset{iid}{\sim}p_{n}.

We will now construct a series of estimators for pnp_{n} using VnV_{n}. Let 𝒳~n=(X~n,1,,X~n,n)\widetilde{\mathcal{X}}_{n}=\left(\widetilde{X}_{n,1},\ldots,\widetilde{X}_{n,n}\right) which are independent random variables with X~n,ihd,b,(Xn,i,1,,1)\widetilde{X}_{n,i}\sim h_{d,b,\left(X_{n,i},1,\ldots,1\right)}. For this proof we will assume d>2d>2 but the proof can be simplified in a straightforward manner to the d=2d=2 case by ignoring the indices and modes beyond the second. Note that that Xn,iX_{n,i} contains two indices. Now we have the following for the densities of X~n,i\widetilde{X}_{n,i}

pX~n,i\displaystyle p_{\widetilde{X}_{n,i}} =(j,)[b]×[k]pX~n,i|Xn,i=(j,)P(Xn,i=(j,))\displaystyle=\sum_{(j,\ell)\in\left[b\right]\times\left[k\right]}p_{\widetilde{X}_{n,i}|X_{n,i}=\left(j,\ell\right)}P(X_{n,i}=\left(j,\ell\right))
=(j,)[b]×[k]hd,b,(j,,1,,1)pn(j,)\displaystyle=\sum_{(j,\ell)\in\left[b\right]\times\left[k\right]}h_{d,b,\left(j,\ell,1,\ldots,1\right)}p_{n}\left(j,\ell\right)
=[k]j[b]hd,b,(j,,1,,1)pn(j,)\displaystyle=\sum_{\ell\in\left[k\right]}\sum_{j\in\left[b\right]}h_{d,b,\left(j,\ell,1,\ldots,1\right)}p_{n}\left(j,\ell\right)
=[k]j[b]pn(j,)h1,b,jh1,b,a[d2]h1,b,1\displaystyle=\sum_{\ell\in\left[k\right]}\sum_{j\in\left[b\right]}p_{n}\left(j,\ell\right)h_{1,b,j}\otimes h_{1,b,\ell}\otimes\prod_{a\in[d-2]}h_{1,b,1}
=[k](j[b]pn(j,)h1,b,j)h1,b,a[d2]h1,b,1\displaystyle=\sum_{\ell\in\left[k\right]}\left(\sum_{j\in\left[b\right]}p_{n}\left(j,\ell\right)h_{1,b,j}\right)\otimes h_{1,b,\ell}\otimes\prod_{a\in[d-2]}h_{1,b,1} (14)
=[k](q[b]pn(q,))(j[b]pn(j,)q[b]pn(q,)h1,b,j)h1,b,a[d2]h1,b,1.\displaystyle=\sum_{\ell\in\left[k\right]}\left(\sum_{q\in\left[b\right]}p_{n}\left(q,\ell\right)\right)\left(\sum_{j\in\left[b\right]}\frac{p_{n}\left(j,\ell\right)}{\sum_{q\in\left[b\right]}p_{n}\left(q,\ell\right)}h_{1,b,j}\right)\otimes h_{1,b,\ell}\otimes\prod_{a\in[d-2]}h_{1,b,1}. (15)

This last line is in the form of (5) in the main text and is thus an element of d,bk\mathcal{H}_{d,b}^{k}. To see this we will show the correspondence between the terms in (15) from here and the terms in (5) in the main text:

w\displaystyle w_{\ell} :=(q[b]pn(q,))\displaystyle:=\left(\sum_{q\in\left[b\right]}p_{n}\left(q,\ell\right)\right)
f,1\displaystyle f_{\ell,1} :=(j[b]pn(j,)q[b]pn(q,)h1,b,j)\displaystyle:=\left(\sum_{j\in\left[b\right]}\frac{p_{n}\left(j,\ell\right)}{\sum_{q\in\left[b\right]}p_{n}\left(q,\ell\right)}h_{1,b,j}\right)
f,2\displaystyle f_{\ell,2} :=h1,b,\displaystyle:=h_{1,b,\ell}
fi,j\displaystyle f_{i,j} :=h1,b,1,j>2,i.\displaystyle:=h_{1,b,1},\forall j>2,\forall i.

Let VnV_{n} estimate P~npX~n,i\widetilde{P}_{n}\triangleq p_{\widetilde{X}_{n,i}} so X~n,1,,X~n,niidP~n\widetilde{X}_{n,1},\ldots,\widetilde{X}_{n,n}\overset{iid}{\sim}\widetilde{P}_{n}. We will use VnV_{n} to construct an estimator vnv_{n} for pnp_{n}.

Because P~nd,bk\widetilde{P}_{n}\in\mathcal{H}_{d,b}^{k} 222We will use this portion of the proof again for our proof of Theorem 2.4 for all nn we have that VnP~np 0\left\|V_{n}-\widetilde{P}_{n}\right\|\,{\buildrel p\over{\rightarrow}}\,0 and thus Ud,b1(Vn)Ud,b1(P~n)p 0\left\|U_{d,b}^{-1}(V_{n})-U_{d,b}^{-1}(\widetilde{P}_{n})\right\|\,{\buildrel p\over{\rightarrow}}\,0. Note that [Ud,b1(P~n)]j,,A=pn(j,)\left[U_{d,b}^{-1}(\widetilde{P}_{n})\right]_{j,\ell,A}=p_{n}(j,\ell) when A=(1,,1)A=\left(1,\ldots,1\right) and zero otherwise (see (14)). We define the linear operator Bn:𝒯d,bΔb×kB_{n}:\mathcal{T}_{d,b}\to\Delta_{b\times k} as

[Bn(T)]j,A[b]×d2Tj,,A\displaystyle\left[B_{n}(T)\right]_{j,\ell}\triangleq\sum_{A\in\left[b\right]^{\times d-2}}T_{j,\ell,A}

i.e. the linear operator which sums out all modes except for the first two. We have that Bn(Ud,b1(P~n))=pnB_{n}(U_{d,b}^{-1}(\widetilde{P}_{n}))=p_{n}. Now let vn=Bn(Ud,b1(Vn))v_{n}=B_{n}(U_{d,b}^{-1}(V_{n})) be the estimator for pnp_{n}. Now we have that

vnpn=Bn(Ud,b1(P~n))Bn(Ud,b1(Vn))=Bn(Ud,b1(P~nVn)).\left\|v_{n}-p_{n}\right\|=\left\|B_{n}(U_{d,b}^{-1}(\widetilde{P}_{n}))-B_{n}(U_{d,b}^{-1}(V_{n}))\right\|=\left\|B_{n}(U_{d,b}^{-1}(\widetilde{P}_{n}-V_{n}))\right\|.

We have that BnB_{n} is a nonexpansive operator due to the triangle inequality,

Bn(T)=j,l|A[b]×d2Tj,,A|j,lA[b]×d2|Tj,,A|=T,\displaystyle\left\|B_{n}\left(T\right)\right\|=\sum_{j,l}\left|\sum_{A\in\left[b\right]^{\times d-2}}T_{j,\ell,A}\right|\leq\sum_{j,l}\sum_{A\in\left[b\right]^{\times d-2}}\left|T_{j,\ell,A}\right|=\left\|T\right\|,

so the operator norm of BnB_{n} is less than or equal to one. We also know that Ud,b1U^{-1}_{d,b} an isometry and P~nVnp 0\left\|\widetilde{P}_{n}-V_{n}\right\|\,{\buildrel p\over{\rightarrow}}\,0, so it follows that vnpnp 0\left\|v_{n}-p_{n}\right\|\,{\buildrel p\over{\rightarrow}}\,0 for any sequence of pnΔ[b(n)]×[k(n)]p_{n}\in\Delta_{\left[b(n)\right]\times\left[k(n)\right]}. We will now use following theorem from [13] to show that no such estimator vnv_{n} can exist.

Theorem A.1 ([13] Theorem 2.).

For any ζ(0,1]\zeta\in\left(0,1\right], we have

infp^suppΔa𝔼pp^p18ea(1+ζ)n𝟙((1+ζ)na>e16)\displaystyle\inf_{\hat{p}}\sup_{p\in\Delta_{a}}\mathbb{E}_{p}\left\|\hat{p}-p\right\|\geq\frac{1}{8}\sqrt{\frac{ea}{\left(1+\zeta\right)n}}\mathbbm{1}\left(\frac{\left(1+\zeta\right)n}{a}>\frac{e}{16}\right)
+exp(2(1+ζ)na)𝟙((1+ζ)nae16)exp(ζ2n24)12exp(ζ2a32(loga)2)\displaystyle\quad+\exp\left(-\frac{2\left(1+\zeta\right)n}{a}\right)\mathbbm{1}\left(\frac{\left(1+\zeta\right)n}{a}\leq\frac{e}{16}\right)-\exp\left(-\frac{\zeta^{2}n}{24}\right)-12\exp\left(-\frac{\zeta^{2}a}{32\left(\log a\right)^{2}}\right)

where the infimum is over all estimators.

Our estimator is equivalent to estimating a categorical distribution with a=bka=bk categories. Letting ζ=1\zeta=1, bkbk\to\infty, and nn\to\infty, with n/(bk)0n/\left(bk\right)\to 0, we get that for sufficiently large nn

infp^suppΔbk𝔼pp^pexp(4nbk)exp(n24)12exp(bk32(logbk)2)\displaystyle\inf_{\hat{p}}\sup_{p\in\Delta_{bk}}\mathbb{E}_{p}\left\|\hat{p}-p\right\|\geq\exp\left(-\frac{4n}{bk}\right)-\exp\left(-\frac{n}{24}\right)-12\exp\left(-\frac{bk}{32\left(\log bk\right)^{2}}\right)

whose right hand side converges to 1. From this we get that

lim infnsuppnΔbk𝔼pnvnpn>12\displaystyle\liminf_{n\to\infty}\sup_{p_{n}\in\Delta_{bk}}\mathbb{E}_{p_{n}}\left\|v_{n}-p_{n}\right\|>\frac{1}{2}

which contradicts vnpnp 0\left\|v_{n}-p_{n}\right\|\,{\buildrel p\over{\rightarrow}}\,0 for arbitrary sequences pnp_{n}. ∎

Proof of Thoerem 2.4.

We will proceed by contradiction. Suppose VnV_{n} is an estimator violating the theorem statement, i.e. there exist sequences bb\to\infty and kk\to\infty with n/(bk+kd)0n/\left(bk+k^{d}\right)\to 0 and bkb\geq k such that, for all ε>0\varepsilon>0,

supp𝒟dP(Vnp>3minq~d,bkpq+ε)0.\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\widetilde{\mathcal{H}}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0.

Since n/(bk+kd)0n/(bk+k^{d})\to 0 we have that (bk+kd)/n(bk+k^{d})/n\to\infty so there is a subsequence nin_{i} such that b(ni)k(ni)/nib(n_{i})k(n_{i})/n_{i}\to\infty or k(ni)d/nik(n_{i})^{d}/n_{i}\to\infty, or equivalently ni/(b(ni)k(ni))0n_{i}/(b(n_{i})k(n_{i}))\to 0 or ni/k(ni)d0n_{i}/k(n_{i})^{d}\to 0. We will show that both cases lead to a contradiction. We will let bb and kk be functions of nin_{i} now implictly when defining limits.
Case ni/(bk)0n_{i}/(bk)\to 0: We proceed similarly to the proof of Theorem 2.3. Let (pn)n=1\left(p_{n}\right)_{n=1}^{\infty}, P~n\widetilde{P}_{n}, and 𝒳n\mathcal{X}_{n} be defined as in the proof of Theorem 2.3. Note that d,bk~d,bk\mathcal{H}_{d,b}^{k}\subset\widetilde{\mathcal{H}}_{d,b}^{k} (see proof of Lemma 2.9) and thus P~n~d,bk\widetilde{P}_{n}\in\widetilde{\mathcal{H}}_{d,b}^{k}. We can proceed exactly as in our proof of Theorem 2.3 at footnote 2, by simply replacing d,bk\mathcal{H}_{d,b}^{k} with ~d,bk\widetilde{\mathcal{H}}_{d,b}^{k} and nn with nin_{i} which finishes this case.
Case ni/kd0n_{i}/k^{d}\to 0: Let (pn)n=1(p_{n})_{n=1}^{\infty} be a sequence of elements in 𝒯d,k\mathcal{T}_{d,k} which represents distributions over [k]d[k]^{d}. Let 𝒳n(Xn,1,,Xn,n)\mathcal{X}_{n}\triangleq\left(X_{n,1},\ldots,X_{n,n}\right) with Xn,1,,Xn,niidpnX_{n,1},\ldots,X_{n,n}\overset{iid}{\sim}p_{n}. Let 𝒳~n=(X~n,1,,X~n,n)\widetilde{\mathcal{X}}_{n}=\left(\widetilde{X}_{n,1},\ldots,\widetilde{X}_{n,n}\right) which are independent random variables with X~n,ihd,b,Xn,i\widetilde{X}_{n,i}\sim h_{d,b,X_{n,i}}. Let P~n\widetilde{P}_{n} be the density for X~n,i\widetilde{X}_{n,i}. Note that kbk\leq b. So we have that

P~n\displaystyle\widetilde{P}_{n} =S[k]dpX~n,i|Xn,i=SP(Xn,i=S)\displaystyle=\sum_{S\in[k]^{d}}p_{\widetilde{X}_{n,i}|X_{n,i}=S}P(X_{n,i}=S)
=S[k]dhd,b,Spn(S)\displaystyle=\sum_{S\in[k]^{d}}h_{d,b,S}p_{n}(S)
=S[k]dpn(S)i=1dh1,b,Si\displaystyle=\sum_{S\in[k]^{d}}p_{n}(S)\prod_{i=1}^{d}h_{1,b,S_{i}}

and thus P~n~d,bk\widetilde{P}_{n}\in\widetilde{\mathcal{H}}_{d,b}^{k}. We proceed as in Theorem 2.3 to find an estimator for elements of 𝒯d,k\mathcal{T}_{d,k} which is equivalent to estimating elements of Δkd\Delta^{k^{d}} which is impossible since ni/kd0n_{i}/k^{d}\to 0. ∎