Improving Nonparametric Density Estimation with Tensor Decompositions

Robert A. Vandermeulen
Machine Learning Group
Technische Universität Berlin
Berlin, Germany
[email protected]

Abstract

While nonparametric density estimators often perform well on low dimensional data, their performance can suffer when applied to higher dimensional data, owing presumably to the curse of dimensionality. One technique for avoiding this is to assume no dependence between features and that the data are sampled from a separable density. This allows one to estimate each marginal distribution independently thereby avoiding the slow rates associated with estimating the full joint density. This is a strategy employed in naive Bayes models and is analogous to estimating a rank-one tensor. In this paper we investigate whether these improvements can be extended to other simplified dependence assumptions which we model via nonnegative tensor decompositions. In our central theoretical results we prove that restricting estimation to low-rank nonnegative PARAFAC or Tucker decompositions removes the dimensionality exponent on bin width rates for multidimensional histograms. These results are validated experimentally with high statistical significance via direct application of an existing nonnegative tensor factorization to histogram estimators.

1 Introduction

Nonparametric density estimation is the task of estimating a density $f$ from data without assuming that $f$ belongs to some parametric class of densities, e.g. the space multivariate Gaussian distributions. Some common nonparametric density estimators include the histogram estimator and the kernel density estimator (KDE). While nonparametric density estimation has been shown to be effective for many tasks, it has been observed empirically that estimator performance typically declines as data dimensionality increases, a manifestation of the curse of dimensionality. For the histogram and KDE this phenomena also has a precise mathematical analog. For these estimators universal consistency is achieved iff $n\to\infty$ and $h\to 0$ , with $nh^{d}\to\infty$ , where $n$ is number of samples and $h$ is the bin width for the histogram and bandwidth parameter for the KDE [12].

One of the most common approaches to alleviating the curse of dimensionality is dimensionality reduction. A dimensionality reduction technique typically attempts to transform a high dimensional representation into a lower dimensional one that removes dependences within data. Feature selection techniques, for example, usually explicitly remove highly dependent features [30, 7]. In PCA one finds the best $d$ -dimensional affine subspace for approximating data. If the PCA model fits well, it implies that, given $d$ features of a $D$ -dimensional sample, there exists a good linear prediction of the remaining $D-d$ features. The manifold hypothesis [6], which is often touted as a general explanation for why high dimensional problems are learnable [5], can also be viewed as a model of dependence. For example, if we assume that a random vector $[X,Y]^{T}$ lies on the one-dimensional sphere, $X^{2}+Y^{2}=1$ , and $Y$ is known, then $X$ can only assume one of two values $X=\pm\sqrt{1-Y^{2}}$ . More generally, the assumption that the data lies on a manifold implies local dependence. This is because any sufficiently small region of the support manifold can be well approximated by an affine subspace, and thus there exists a linear dependence between the features like in PCA. Interestingly, assuming that features are not dependent yields another approach to overcoming the curse of dimensionality.

In a naive Bayes model one assumes no dependence between features, i.e. our target density is separable,

\displaystyle f\left(x_{1},\ldots,x_{d}\right)=f_{1}\left(x_{1}\right)f_{2}\left(x_{2}\right)\cdots f_{d}\left(x_{d}\right).

(1)

In order to estimate $f$ one can now simply estimate the marginal distributions and multiply them. Because the dimensionality of each marginal is one, one can use a histogram or KDE and achieve $nh\to\infty$ rates on bin width/bandwidth while preserving consistent estimation, thereby circumventing the curse of dimensionality. The separability assumption is rarely satisfied in practice so naive Bayes models are typically not used for density estimation directly, but may be used for some other task such as classification via a likelihood ratio test.

In this paper we consider relaxations of the naive Bayes model based on the assumption that a density is a mixture of separable densities. Our models are inspired by nonnegative tensor factorizations so we term them generally nonparametric nonnegative tensor factorization (NNTF) models. The first model is related to nonnegative PARAFAC [28] and is commonly known as a multi-view model in statistics or machine learning literature:

f\left(x_{1},\ldots,x_{d}\right)=\sum_{i=1}^{k}w_{i}f_{i,1}\left(x_{1}\right)f_{i,2}\left(x_{2}\right)\cdots f_{i,d}\left(x_{d}\right).

(2)

This is equivalent to the assumption that the features are independent conditioned on an unobserved discrete random variable taking on $k$ values. Our second model is based on the nonnegative Tucker decomposition [17]. In this model it is assumed that there are $d$ collections of $k$ one-dimensional densities, $\mathcal{F}_{1},\ldots,\mathcal{F}_{d}$ with $\mathcal{F}_{i}=\left\{f_{i,1},\ldots,f_{i,k}\right\}$ , and some probability measure which which randomly selects one density from each collection. This measure a can be represented by a tensor $W\in\mathbb{R}^{k^{\times d}}$ where the probability of selecting $f_{1,i_{1}},\ldots,f_{d,i_{d}}$ is $W_{i_{1},\ldots,i_{d}}$ . To sample from this model we first randomly select the marginal distributions $f_{1,i_{1}},\ldots,f_{d,i_{d}}$ according to $W$ and then independently sample each feature according the randomly selected marginal distribution $X_{j}\sim f_{j,i_{j}}$ . The density of this model is

f\left(x_{1},\ldots,x_{d}\right)\quad=\sum_{i_{1}=1}^{k}\cdots\sum_{i_{d}=1}^{k}W_{i_{1},\ldots,i_{d}}f_{1,i_{1}}(x_{1})\cdots f_{d,i_{d}}(x_{d}).

(3)

We are unaware of previous literature investigating this model so we will simply term it the Tucker model.

In Section 2 we prove that there exists a trade-off between the rate on bin width $h$ and number of components $k$ : to control estimation error¹¹1For an estimator $V$ restricted space of densities $\mathcal{P}$ , the estimation error refers the difference between $\left\|V-p\right\|$ and $\min_{q\in\mathcal{P}}\left\|q-p\right\|$ , where $p$ is the target density. we require $k/h$ to be asymptotically dominated by $n$ for the multi-view histogram and $k/h+k^{d}$ to be asymptotically dominated by $n$ for the Tucker histogram (both of these rates ignore logarithmic factors). Note that for the multi-view histogram this rate is not dependent on $d$ . Allowing $h$ to shrink as aggressively as possible (which we pay for with a slow rate on $k$ ) we show that there exist universally consistent histogram estimators which achieve $nh/\log\left(h^{-1}\right)\to\infty$ rate on bin width, thereby removing the dependence on dimension and approximately attaining rates possible for densities of the form in (1) while still controlling estimation error. We show that these are the approximately fastest possible rates via matching lower bound. In Section 3 we show that we can use an existing low-rank nonnegative Tucker factorization algorithm to fit our model and demonstrate empirically that fitting histograms to a Tucker model outperforms the standard histogram estimator with very high statistical significance.

While this paper focuses on histogram estimation and presents a promising, readily implementable improvement to the standard histogram estimator in Section 3, its primary purpose is to showcase the potential of utilizing concepts nonnegative tensor factorization to improve performance in nonparametric statistical methods.

1.1 Previous Work

Nonparametric density estimation has been extensively studied with the histogram estimator and KDE being by far the most well known. There do exist, however, alternative methods for density estimation, e.g. the forest density estimator [19] and $k$ -nearest neighbor density estimator [20]. The $L_{1},L_{2}$ and $L_{\infty}$ convergence of the histogram and KDE has been studied extensively [12, 8, 33, 14]. The KDE is generally accepted as being the superior density estimator with some mathematical justification [29]. Numerous modifications and extensions of the KDE have been proposed including utilizing variable bandwidth [32], robust KDEs [15, 34, 35], and methods for enforcing support boundary constraints [27]. Finally we mention one recent paper [16] that demonstrated that uniform convergence of a KDE to its population estimate suffered when the intrinsic dimension of the data was lower than the ambient dimension, a phenomenon seemingly at odds with the curse of dimensionality.

For our review of NNTF models we also include a general review of tensor/matrix factorizations since both can be viewed being low-rank models. In particular, for the multi-view model we have

\sum_{i=1}^{k}w_{i}f_{i,1}\left(x_{1}\right)\cdots f_{i,d}\left(x_{d}\right)\sim\sum_{i=1}^{k}\lambda_{i}\mathbf{v}_{i,1}\otimes\cdots\otimes\mathbf{v}_{i,d}.

(4)

We further note that for histogram estimation, once data has been assigned to bins, finding a good multi-view or Tucker histogram is analogous to estimating a probability tensor with a nonnegative factorization (we show this rigorously in Section 3).

A great deal of work has gone into leveraging low-rank assumptions to improve matrix estimation, particularly in the field of compressed sensing [10, 24]. In compressed sensing one has access to a collection of random linear measurements of an unknown low-rank matrix to be estimated. Fitting a matrix to these measurements with a nuclear norm regularized optimization problem achieves estimation bounds better than those possible without the low-rank assumption. These techniques have been effectively applied to problems such as matrix completion, multivariate regression, and estimating autoregressive models [21, 22]. Unfortunately such techniques are not extensible to histogram estimation because, in the density estimation setting, data are not linearly sampled from the target model. Furthermore how to extend compressed sensing techniques to general tensors is not clear.

General matrix/tensor factorization, including nonnegative matrix/tensor factorizations, have been extensively studied despite their inherent difficulty due to non-convexity. The works [9, 3] present potential theoretical grounds for avoiding the computational difficulties of nonnegative matrix factorization. Some algorithms for finding nonnegative tensor factorizations are mentioned in Section 3. One notable approach to tensor factorization is to assume, in the tensor representation in (4), that $d\geq 3$ and the collections of vectors $\mathbf{v}_{1,j},\ldots,\mathbf{v}_{k,j}$ are linearly independent for all $j$ . Under this assumption we are guaranteed that the factorization (4) is unique [1]. In [2] the authors present a method for recovering this factorization efficiently and demonstrate its utility for a variety of tasks. This work was extended in [31] to recover a multi-view KDE satisfying an analogous linear independence assumption. This is the only work of its type of which we are aware. In this work the authors investigate the sample complexity of their estimator but do not demonstrate that their technique has potential for improving rates for nonparametric density estimation in general. Finally we note that nonparametric applications of the Tucker decomposition have been utilized in Bayesian statistics [26]. We are unaware of any literature describing the model we introduce in (3).

2 Theoretical Results

In this section we mathematically demonstrate that histogram estimators can achieve greater estimation accuracy by restricting to NNTF models. To simplify analysis we will only consider densities on $\left[0,1\right)^{d}$ and analyze number of bins per dimension $b$ which is the inverse of the bin width, i.e. $b=1/h$ . We prove that there exists a trade-off between rates on $b$ and $k$ . Furthermore we show that the approximate fastest possible rate on $b$ while still uniformly controlling for estimator variance and remaining universally consistent is $n/\left(b\log b\right)\to\infty$ . Before proving these results we must introduce a fair amount of notation.

2.1 Notation

All norms in Section 2 are the $\ell^{1},L^{1},$ or total variation norm for vectors/tensors, densities, and measures respectively. Note that these norms are analogous e.g. the $L^{1}$ norm of a probability density function is the same as the total variation norm on the probability measure associated with the density. Let $\mathcal{D}_{d}$ be the set of all densities on $\left[0,1\right)^{d}$ . By density we mean probability measures that are absolutely continuous with respect to the $d$ -dimensional Lebesgue measure. We define a probability vector or probability tensor to simply mean a vector or tensor whose entires are nonnegative and sum to one. Let $\Delta_{b}$ denote the set of probability vectors in $\mathbb{R}^{b}$ and $\mathcal{T}_{d,b}$ the set of probability tensors in $\mathbb{R}^{b^{\times d}}$ . Let $\mathcal{T}_{d,b}^{k}$ be the set of tensors which are a convex combination of $k$ separable probability tensors, i.e.

\displaystyle\mathcal{T}_{d,b}^{k}\triangleq\left\{\sum_{i=1}^{k}w_{i}\prod_{j=1}^{d}p_{i,j}\middle|w\in\Delta_{k},p_{i,j}\in\Delta_{b}\right\}.

In this paper, the product symbol $\prod$ will always mean the standard outer product, e.g. set product or tensor product. For any natural number $b$ let $[b]=\left\{1,\ldots,b\right\}$ . For a multi-index $A\in\left[b\right]^{d}$ we define $\mathbf{e}_{d,b,A}$ as the element of $\mathcal{T}_{d,b}$ where the $(A_{1},\ldots,A_{d})$ -th entry is one and is zero elsewhere.

The following is the set of probability tensors constructed via a nonnegative Tucker factorization

\displaystyle\widetilde{\mathcal{T}}_{d,b}^{k}\triangleq\left\{\sum_{S\in\left[k\right]^{\times d}}W_{S}\prod_{j=1}^{d}p_{i,S_{j}}\middle|W\in\mathcal{T}_{b,k},p_{i,j}\in\Delta_{b}\right\}.

We will now construct the space of histograms on $\left[0,1\right)^{d}$ . We begin with one-dimensional histograms. We define $h_{1,b,i}$ with $i\in\left[b\right]$ to be the one dimensional histogram where all weight is allocated to the $i$ th bin. Formally we define this as

\displaystyle h_{1,b,i}\left(x\right)\triangleq b\mathbbm{1}\left(\frac{i-1}{b}\leq x<\frac{i}{b}\right).

Note that this is a valid density due to the leading $b$ coefficient. We use these to construct higher-dimensional histograms. For a multi-index $A\ \in\left[b\right]^{d}$ , let

\displaystyle h_{d,b,A}\triangleq\prod_{i=1}^{d}h_{1,b,A_{i}},

the histogram whose entire density is allocated to the bin indexed by $A$ . Finally we define $\Lambda_{d,b,A}$ to be the support of $h_{d,b,A}$ , i.e. the “bins” of a histogram estimator,

\displaystyle\Lambda_{d,b,A}\triangleq\prod_{i=1}^{d}\left[\frac{A_{i}-1}{b},\frac{A_{i}}{b}\right).

For a sequence of points in $\left[0,1\right)^{d}$ , $\mathcal{X}=\left(X_{1},\ldots,X_{n}\right)$ , the standard histogram estimator is

\displaystyle H_{d,b}\left(\mathcal{X}\right)\triangleq\frac{1}{n}\sum_{i=1}^{n}\sum_{A\in[b]^{d}}h_{d,b,A}\mathbbm{1}\left(X_{i}\in\Lambda_{d,b,A}\right).

Let $\mathcal{H}_{d,b}\triangleq\operatorname{Conv}\left(\left\{h_{d,b,A}\middle|A\in\left[b\right]^{d}\right\}\right)$ , the set of all $d$ -dimensional histograms with $b$ bins per dimension. Let $\mathcal{H}_{d,b}^{k}$ be the set of histograms with at most $k$ separable components, i.e.

\mathcal{H}_{d,b}^{k}\triangleq\left\{\sum_{i=1}^{k}w_{i}\prod_{j=1}^{d}f_{i,j}\middle|w\in\Delta_{k},f_{i,j}\in\mathcal{H}_{1,b}\right\}.

(5)

We will refer elements in this space multi-view histograms. Analogously we define the space of Tucker histograms to be

\widetilde{\mathcal{H}}_{d,b}^{k}=\left\{\sum_{S\in\left[k\right]^{\times d}}W_{S}\prod_{i=1}^{d}f_{i,S_{i}}\middle|W\in\mathcal{T}_{d,k},f_{i,j}\in\mathcal{H}_{1,b}\right\}.

We emphasize that the collections of densities $\mathcal{H}_{d,b}^{k}$ and $\widetilde{\mathcal{H}}_{d,b}^{k}$ the primary objects of interest in this paper as they represent NNTF histograms. The theoretical results we present are concerned with finding good density estimators restricted to these sets as $k$ and $b$ vary.

Note that there exists a $\ell^{1}\to L^{1}$ linear isometry $U_{d,b}:\mathcal{T}_{d,b}\to\mathcal{H}_{d,b}$ with $U_{d,b}$ defined as

U_{d,b}(\mathbf{e}_{d,b,A})=h_{d,b,A}.

The inverse function, $U^{-1}_{d,b}$ , simply transforms a histogram to the tensor representing its bin weights and $U_{d,b}$ performs the reverse transformation. Note that $U_{d,b}$ is also a bijection between $\mathcal{T}_{d,b}^{k}\to\mathcal{H}_{d,b}^{k}$ and $\widetilde{\mathcal{T}}_{d,b}^{k}\to\widetilde{\mathcal{H}}_{d,b}^{k}$ . Much of our analysis on histograms will be performed on the space of probability tensors with the analysis being translated to histograms via this operator.

For a set of vectors $\mathcal{V}$ we define $k\operatorname{-mix}\left(\mathcal{V}\right)\triangleq\left\{\sum_{i=1}^{k}w_{i}v_{i}\middle|w\in\Delta_{k},v_{i}\in\mathcal{V}\right\}$ , i.e. the set of convex combinations of collections of $k$ vectors from $\mathcal{V}$ . We define $N\left(\mathcal{V},\varepsilon\right)$ to be the minimum cardinality for a subset of of $\mathcal{V}$ which $\varepsilon$ -covers $\mathcal{V}$ (with closed balls) with respect to the $\left\|\cdot\right\|$ metric. It will be clear from context whether $\left\|\cdot\right\|$ represents the $\ell^{1}$ , $L^{1}$ or total variation norm.

2.2 Preliminary Results

For brevity the main text only contains a full proof of Lemma 2.7. The remaining full proofs can be found in the appendix. We include general descriptions of the proof techniques we use for multi-view histogram results. These are similar to the techniques we use for Tucker histograms. Our general proof technique is to find good covers of spaces of densities, i.e. $\mathcal{H}_{d,b}^{k}$ and $\widetilde{\mathcal{H}}_{d,b}^{k}$ , and then apply an existing algorithm for selecting a good estimator from finite collections of densities given data. We begin by establishing a covering number bound on the space of probability vectors via an adaptation of a standard result presented in [8].

Lemma 2.1.

For all $0<\varepsilon\leq 1$ we have that $N\left(\Delta_{b},\varepsilon\right)\leq\left(\frac{2b}{\varepsilon}\right)^{b}$ .

We can extend this to a covering number for the space of separable probability tensors.

Lemma 2.2.

For all $0<\varepsilon\leq 1$ we have that $N\left(\mathcal{T}_{d,b}^{1},\varepsilon\right)\leq\left(\frac{2bd}{\varepsilon}\right)^{bd}$ .

Proof sketch.

Combine Lemma 2.1 with the following standard bound for product measures (see Lemma 3.3.7 in [25]): $\left\|\prod_{i=1}^{d}q_{i}-\prod_{j=1}^{d}\widetilde{q}_{j}\right\|\leq\sum_{i=1}^{d}\left\|q_{i}-\widetilde{q}_{i}\right\|.$ ∎

Now we establish the following lemma for covering numbers of mixtures of densities.

Lemma 2.3.

Let $\mathcal{P}$ be a set of probability measures, then

\displaystyle N\left(k\operatorname{-mix}\left(\mathcal{P}\right),\varepsilon+\delta\right)\leq N\left(\mathcal{P},\varepsilon\right)^{k}N\left(\Delta_{k},\delta\right).

Proof sketch.

Use Lemma 2.1 to construct different weightings of $k$ elements from an $\varepsilon$ -cover of $\mathcal{P}$ . ∎

By combining Lemma 2.3 with Lemma 2.2 we arrive at covering numbers for the space of multi-view probability tensors.

Lemma 2.4.

For all $0<\varepsilon\leq 1$ the following holds $N\left(\mathcal{T}_{d,b}^{k},\varepsilon\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k}{\varepsilon}\right)^{k}$ .

Through application of the $U_{d,b}$ operator we now have a characterization of the complexity of the space $\mathcal{H}_{d,b}^{k}$ .

Corollary 2.1.

For all $0<\varepsilon\leq 1$ following holds $N\left(\mathcal{H}_{d,b}^{k},\varepsilon\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k}{\varepsilon}\right)^{k}.$

The following are analogous results for Tucker histograms.

Lemma 2.5.

For all $0<\varepsilon\leq 1$ the following holds $N\left(\widetilde{\mathcal{T}}_{d,b}^{k},\varepsilon\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k^{d}}{\varepsilon}\right)^{k^{d}}$ .

Corollary 2.2.

For all $0<\varepsilon\leq 1$ following holds $N\left(\widetilde{\mathcal{H}}_{d,b}^{k},\varepsilon\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k^{d}}{\varepsilon}\right)^{k^{d}}.$

The following lemma from [4] provides us with a way to choose good estimators from finite collections of densities. It can be proven by applying a Chernoff bound to [8], Theorem 6.3.

Lemma 2.6 ([4]).

There exists a deterministic algorithm that, given a collection of distributions $p_{1},\ldots,p_{M}$ , a parameter $\varepsilon>0$ and at least $\frac{\log\left(3M^{2}/\delta\right)}{2\varepsilon^{2}}$ iid samples from an unknown distribution $p$ , outputs an index $j\in\left[M\right]$ such that

\displaystyle\left\|p_{j}-p\right\|\leq 3\min_{i\in\left[M\right]}\left\|p_{i}-p\right\|+4\varepsilon

with probability at least $1-\frac{\delta}{3}$ .

We present the following asymptotic version of the previous lemma and include its full proof. We highlight the use of finding sufficiently slow rates on parameters in order to establish asymptotic results, a technique which we will use in later proofs.

Lemma 2.7.

Let $\left(\mathcal{P}_{n}\right)_{n\in\mathbb{N}}$ be a sequence of finite collections of densities in $\mathcal{D}_{d}$ where $\left|\mathcal{P}_{n}\right|\to\infty$ with $n/\log\left(\left|\mathcal{P}_{n}\right|\right)\to\infty$ . Then there exists a sequence of estimators $V_{n}\in\mathcal{P}_{n}$ such that, for all $\gamma>0$ ,

\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{P}_{n}}\left\|p-q\right\|+\gamma\right)\to 0,

where $V_{n}$ is a function of $X_{1},\ldots,X_{n}\overset{iid}{\sim}p$ .

Proof.

Let $M=M(n)=\left|\mathcal{P}_{n}\right|$ . Since $n/\log\left(M\right)\to\infty$ we have that for all $c>0$ there exists a $N_{c}$ such that, for all $n\geq N_{c}$ we have $n/\log\left(M\right)\geq c$ or equivalently $n\geq c\log\left(M\right)$ . Because of this there exists sequence of positive values $C=C(n)$ such that $C\to\infty$ and $n\geq C\log\left(M\right)$ .

We will be making use of the algorithm in Lemma 2.6 as well as its notation. If we can show that there exist sequences of positive values $\varepsilon(n)\to 0,\delta(n)\to 0$ such that, for sufficiently large $n$ , the following holds

\displaystyle\frac{\log\left(3M^{2}/\delta\right)}{2\varepsilon^{2}}\leq n,

then can simply set $V_{n}$ equal to be the estimator from Lemma 2.6 for sufficiently large $n$ and, because the lemma holds independent of choice of $p$ , the theorem statement follows.

Let $\varepsilon=\left(2/C\right)^{1/4}$ and $\delta=3/\left(\exp\left(2\sqrt{\frac{C}{2}}\right)\right)$ . Note that these are both positive sequences which converge to zero. Now we have

$\displaystyle\frac{\log\left(3M^{2}/\delta\right)}{2\varepsilon^{2}}$	$\displaystyle=\frac{\log\left(M^{2}\right)+\log\left(3/\delta\right)}{2\varepsilon^{2}}$
	$\displaystyle=\frac{2\log\left(M\right)+\log\left(3/\delta\right)}{2\varepsilon^{2}}=\frac{\log\left(M\right)+\frac{1}{2}\log\left(3/\delta\right)}{\varepsilon^{2}}$
	$\displaystyle=\varepsilon^{-2}\left(\log\left(M\right)+\frac{1}{2}\log\left(3/\delta\right)\right)$
	$\displaystyle=\left(\left(2/C\right)^{1/4}\right)^{-2}\left(\log\left(M\right)+\frac{1}{2}\log\left(\exp\left(2\sqrt{\frac{C}{2}}\right)\right)\right)$
	$\displaystyle=\sqrt{\frac{C}{2}}\left(\log(M)+\sqrt{\frac{C}{2}}\right)=\sqrt{\frac{C}{2}}\log(M)+\frac{C}{2}.$	(6)

For sufficiently large $C$ and $M$ we have that the RHS of (6) is less than or equal to

	$\displaystyle\frac{C}{2}\log(M)+\frac{C}{2}$	$\displaystyle\leq\frac{C}{2}\log(M)+\frac{C}{2}\log(M)$
		$\displaystyle=C\log(M)\leq n.$

which completes our proof. ∎

2.3 Main Theoretical Results

We can now state the central results of this paper. The following theorem states that one can control the estimation error of multi-view histograms with $k$ components and $b$ bins per dimension so long as $n\sim bk$ (omitting logarithmic factors). Recall that the standard histogram estimator requires $n\sim b^{d}$ , so we have removed the exponential dependence of bin rate on dimensionality. Here and elsewhere the $\sim$ symbol is not a precise mathematical statement but rather describes that the two values should be of the same order. In the following $b$ and $k$ are functions of $n$ ; the space of histograms which we are fitting changes as we acquire more data.

Theorem 2.1.

For any pairs of sequences $b\to\infty$ and $k\to\infty$ satisfying $n/(bk\log(b)+k\log(k))\to\infty$ , there exists an estimator $V_{n}\in\mathcal{H}_{d,b}^{k}$ such that, for all $\varepsilon>0$

\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0,

where $V_{n}$ is a function of $X_{1},\ldots,X_{n}\overset{iid}{\sim}p$ .

Proof sketch.

Apply Lemma 2.7 to the cover in Corollary 2.1 and choose appropriately slow rates for terms not involving $b$ or $k$ . ∎

The sample complexity for the multi-view histogram is perhaps more accurately approximated as being on the order of $dbk$ however the $d$ disappears in the asymptotic analysis.

The following theorem states that we can control the error of Tucker histogram estimates so long as $n\sim bk+k^{d}$ (omitting logarithmic factors).

Theorem 2.2.

For any pairs of sequences $b\to\infty$ and $k\to\infty$ satisfying $n/(bk\log(b)+k^{d}\log\left(k^{d}\right))\to\infty$ , there exists an estimator $V_{n}\in\widetilde{\mathcal{H}}_{d,b}^{k}$ such that, for all $\varepsilon>0$

\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\widetilde{\mathcal{H}}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0,

where $V_{n}$ is a function of $X_{1},\ldots,X_{n}\overset{iid}{\sim}p$ .

Allowing $b$ to grow as aggressively as possible we achieve consistent estimation, using either the multi-view or Tucker histograms, so long as $n\sim b\log b$ regardless of dimensionality.

Corollary 2.3.

For all $d,b,k$ fix $\mathcal{R}_{d,b}^{k}$ to be either $\mathcal{H}_{d,b}^{k}$ or $\widetilde{\mathcal{H}}_{d,b}^{k}$ . For any sequence $b\to\infty$ with $n/\left(b\log b\right)\to\infty$ , there exists a sequence $k\to\infty$ and estimator $V_{n}\in\mathcal{R}_{d,b}^{k}$ such that, for all $\varepsilon>0$

\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{R}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0,

where $V_{n}$ is a function of $X_{1},\ldots,X_{n}\overset{iid}{\sim}p$ .

Replacing $b:=1/h$ allows us to arrive at the rates mentioned in Section 1. The following result shows that the bias of the estimators in Theorems 2.1 and 2.2 go to zero for all densities. Thus these estimators are universally consistent even when the NNTF model assumption is not satisfied.

Lemma 2.8.

Let $p\in\mathcal{D}_{d}$ . If $k\to\infty$ and $b\to\infty$ then $\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|\to 0$ .

A straightforward consequence of this is that the Tucker histogram bias also goes to zero.

Lemma 2.9.

Let $p\in\mathcal{D}_{d}$ . If $k\to\infty$ and $b\to\infty$ then $\min_{q\in\widetilde{\mathcal{H}}_{d,b}^{k}}\left\|p-q\right\|\to 0$ .

Finally we have that rate on $bk$ in Theorem 2.1 cannot be made significantly faster.

Theorem 2.3.

Let $d\geq 2$ , $b\to\infty$ , and $k\to\infty$ with $b\geq k$ and $n/\left(bk\right)\to 0$ . There exists no estimator $V_{n}\in\mathcal{H}_{d,b}^{k}$ such that, for all $\varepsilon>0$ , the following limit holds

\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0

where $V_{n}$ is a function of $X_{1},\ldots,X_{n}\overset{iid}{\sim}p$ .

Proof sketch.

We use $V_{n}$ to construct an estimator over $\Delta_{b}$ and show that such an estimator is impossible using a result in [13]. ∎

Likewise the rate on $bk+k^{d}$ can also not be significantly improved in Theorem 2.2.

Theorem 2.4.

Let $d\geq 2$ , $b\to\infty$ , and $k\to\infty$ with $b\geq k$ and $n/\left(bk+k^{d}\right)\to 0$ . There exists no estimator $V_{n}\in\widetilde{\mathcal{H}}_{d,b}^{k}$ such that, for all $\varepsilon>0$ , the following limit holds

\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\widetilde{\mathcal{H}}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0

where $V_{n}$ is a function of $X_{1},\ldots,X_{n}\overset{iid}{\sim}p$ .

2.4 Discussion

Naturally real world data likely never exactly satisfies the NNTF model assumption. Our results are meant to highlight a trade-off between model assumptions of smoothness (low $b$ ) and simple dependence between features (low $k$ ). Here we will explore this trade-off for multi-view histogram. Letting $k=b^{d}$ gives $\mathcal{H}_{d,b}^{k}=\mathcal{H}_{d,b}$ since we can allocate one component to each bin. Using the estimator from Theorem 2.1 gives a sample complexity of approximately $b^{d+1}$ . Thus setting $k=b^{d}$ in the multi-view histogram gives us something which behaves similarly to the standard histogram estimator with a similar sample complexity. On the other hand setting $k=1$ gives a naive Bayes model with a sample complexity of approximately $b$ . The Tucker histogram can be similarly analyzed with $k=b$ corresponding to the standard histogram. Thus we have a span of $k$ yielding different estimators with maximal $k$ corresponding to the standard histogram and minimal $k$ corresponding to a naive Bayes assumption. We observe in Section 3 that this trade-off is useful in practice: we virtually never want $k$ to be maximized.

3 Experiments

While Theorems 2.1 and 2.2 guarantee the existence of estimators which can effectively estimate NNTF models, these estimators are unfortunately computationally intractable. Fortunately there exist estimators which can be adapted to our problem setting, though they lack the theoretical guarantees of the algorithm described in Theorems 2.1 and 2.2. Specifically we will utilize an existing algorithm for nonnegative tensor decompositions. Due to the difficulties of estimating $L^{1}$ distances between densities we will instead focus on estimates minimizing the $L^{2}$ norm. In this section inner products and norms will be $L^{2}$ for functions and $\ell^{2}$ for tensors i.e. standard euclidean norm or inner product applied to flattened tensors. We will again restrict our analysis to densities supported on $\left[0,1\right)^{d}$ .

Consider the problem of finding some density estimator $\hat{p}$ with minimal $L_{2}$ distance to an unknown density $p$ . This is equivalent to minimizing the squared $L^{2}$ loss:

	$\displaystyle\int_{\left[0,1\right]^{d}}\left(p(x)-\hat{p}\left(x\right)\right)^{2}dx$		(7)
	$\displaystyle\quad=\int_{\left[0,1\right]^{d}}\hat{p}\left(x\right)^{2}dx-2\int_{\left[0,1\right]^{d}}p(y)\hat{p}(y)dy+\int_{\left[0,1\right]^{d}}p(z)^{2}dz.$		(8)

Because the right term in (8) does not depend on $\hat{p}$ it can be ignored when finding optimal $\hat{p}$ . The left term in (8) is known. The middle term in can be estimated with the following approximation

\int_{\left[0,1\right]^{d}}p(x)\hat{p}(x)dx=\mathbb{E}_{X\sim p}\left[\hat{p}(X)\right]\approx\frac{1}{n}\sum_{i=1}^{n}\hat{p}\left(X_{i}\right)

where $\mathcal{X}=X_{1},\ldots,X_{n}\overset{iid}{\sim}p$ . We can use this to find a good estimate for $p$ in $\mathcal{R}^{k}_{d,b}$ which represents $\mathcal{H}^{k}_{d,b}$ or $\widetilde{\mathcal{H}}^{k}_{d,b}$ :

	$\displaystyle\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\int_{\left[0,1\right]^{d}}\left(\hat{H}\left(x\right)-\hat{p}\left(x\right)\right)^{2}dx$
	$\displaystyle=\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left<\hat{H},\hat{H}\right>-2\int_{\left[0,1\right]^{d}}\hat{H}(x)p(x)dx$		(9)
	$\displaystyle\approx\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left<\hat{H},\hat{H}\right>-2\frac{1}{n}\sum_{i=1}^{n}\hat{H}(X_{i}).$		(10)

Recall that the standard histogram estimator is $H\left(\mathcal{X}\right)=\frac{1}{n}\sum_{i=1}^{n}\sum_{A\in\left[b\right]^{d}}h_{d,b,A}\mathbbm{1}\left(X_{i}\in\Lambda_{d,b,A}\right)$ and let $\hat{H}=\sum_{A\in\left[b\right]^{d}}\hat{w}_{A}h_{d,b,A}$ . We have the following

	$\displaystyle\left<\hat{H},H\right>$	$\displaystyle=\left<\sum_{A\in\left[b\right]^{d}}\hat{w}_{A}h_{d,b,A},\frac{1}{n}\sum_{i=1}^{n}\sum_{B\in\left[b\right]^{d}}h_{d,b,B}\mathbbm{1}\left(X_{i}\in\Lambda_{d,b,B}\right)\right>$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\sum_{A\in\left[b\right]^{d}}\hat{w}_{A}\mathbbm{1}\left(X_{i}\in\Lambda_{B}\right)b^{d}=\frac{1}{n}\sum_{i=1}^{n}\hat{H}(X_{i}).$

As a consequence (10) is equal to

	$\displaystyle\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left<\hat{H},\hat{H}\right>-2\left<H,\hat{H}\right>$	$\displaystyle=\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left<\hat{H},\hat{H}\right>-2\left<H,\hat{H}\right>+\left<H,H\right>$
		$\displaystyle=\arg\min_{\hat{H}\in\mathcal{R}_{d,b}^{k}}\left\\|H-\hat{H}\right\\|_{2}^{2}.$

Using the $U_{d,b}$ operator we can reformulate this into a tensor factorization problem

\min_{\hat{T}\in\mathcal{Q}_{d,b}^{k}}\left\|H-U_{d,b}(\hat{T})\right\|_{2}^{2}=\min_{\hat{T}\in\mathcal{Q}_{d,b}^{k}}b^{d}\left\|U_{d,b}^{-1}\left(H\right)-\hat{T}\right\|_{2}^{2}.

Where $\mathcal{Q}_{d,b}^{k}$ could be either $\mathcal{T}_{d,b}^{k}$ or $\widetilde{\mathcal{T}}_{d,b}^{k}$ . Because of this equivalence, to find estimates in $\mathcal{H}_{d,b}^{k}$ or $\widetilde{\mathcal{H}}_{d,b}^{k}$ we can simply use nonnegative tensor decomposition algorithms, which minimize $\ell^{2}$ loss, to find NNTF tensors which approximate $H$ .

Table 1: Experimental Results

Dataset	$d$ Red.	Dim.	Hist. Perf.	Tucker Perf.	Hist. Bins	Tucker Bins	Tucker $r$	p-val.
MNIST	PCA	2	-1.455 $\pm$ 0.089	-1.502 $\pm$ 0.102	6.531 $\pm$ 1.499	8.375 $\pm$ 1.780	4.968 $\pm$ 1.976	5e-4
		3	-2.040 $\pm$ 0.196	-2.268 $\pm$ 0.195	4.781 $\pm$ 0.738	6.718 $\pm$ 1.565	5.781 $\pm$ 1.340	2e-4
		4	-3.532 $\pm$ 0.996	-4.014 $\pm$ 0.655	4.031 $\pm$ 0.585	5.343 $\pm$ 1.018	4.375 $\pm$ 0.695	2e-3
		5	-4.673 $\pm$ 1.026	-6.157 $\pm$ 2.924	3.468 $\pm$ 0.499	4.343 $\pm$ 0.592	3.281 $\pm$ 0.514	4e-5
	Rand.	2	-2.034 $\pm$ 0.100	-2.099 $\pm$ 0.102	6.062 $\pm$ 1.197	7.562 $\pm$ 1.657	2.062 $\pm$ 1.784	3e-5
		3	-3.086 $\pm$ 0.207	-3.331 $\pm$ 0.387	4.812 $\pm$ 0.526	6.843 $\pm$ 1.227	2.687 $\pm$ 1.959	1e-4
		4	-4.307 $\pm$ 0.290	-5.731 $\pm$ 0.435	3.500 $\pm$ 0.559	5.656 $\pm$ 0.642	2.593 $\pm$ 1.497	8e-7
		5	-6.327 $\pm$ 0.522	-9.539 $\pm$ 1.053	3.250 $\pm$ 0.433	4.718 $\pm$ 0.571	2.562 $\pm$ 1.087	8e-7
Diabetes	PCA	2	-2.079 $\pm$ 0.122	-2.212 $\pm$ 0.132	5.718 $\pm$ 1.304	7.468 $\pm$ 1.478	1.062 $\pm$ 0.242	8e-6
		3	-3.010 $\pm$ 0.364	-3.606 $\pm$ 0.420	3.593 $\pm$ 0.860	7.062 $\pm$ 1.058	1.843 $\pm$ 1.543	2e-6
		4	-4.002 $\pm$ 0.415	-4.423 $\pm$ 0.701	3.000 $\pm$ 0.000	5.906 $\pm$ 0.804	2.343 $\pm$ 1.107	2e-3
		5	-6.139 $\pm$ 0.661	-6.043 $\pm$ 1.192	3.000 $\pm$ 0.000	3.750 $\pm$ 0.968	1.843 $\pm$ 0.617	0.91
	Rand.	2	-3.074 $\pm$ 0.224	-3.277 $\pm$ 0.287	6.843 $\pm$ 1.227	9.250 $\pm$ 1.936	1.093 $\pm$ 0.384	7e-5
		3	-4.726 $\pm$ 0.483	-5.353 $\pm$ 0.751	4.968 $\pm$ 0.769	8.406 $\pm$ 1.343	1.625 $\pm$ 1.672	2e-5
		4	-6.017 $\pm$ 0.873	-7.732 $\pm$ 1.497	4.062 $\pm$ 0.704	6.718 $\pm$ 1.328	2.093 $\pm$ 1.155	1e-5
		5	-8.986 $\pm$ 1.292	-12.61 $\pm$ 2.477	3.062 $\pm$ 0.242	5.093 $\pm$ 0.521	2.531 $\pm$ 0.865	2e-6

For our experiments we used the Tensorly library [18] to perform the nonnegative Tucker decomposition [17] with Tucker rank $[k,k,\ldots,k]$ which was then projected to the simplex of probability tensors using [11]. We also performed experiments with nonnegative PARAFAC decompositions using [28, 18]. These decompositions performed poorly. This is potentially because the PARAFAC optimization is more difficult or the additional flexibility of the Tucker decomposition was more appropriate for the experimental datasets.

3.1 Experimental Setup and Results

Our experiments were performed on the Scikit-learn “toy” datasets MNIST and Diabetes [23], with labels removed. We used the expression inside the minimization in (10) to evaluate performance. Our experiments considered estimating histograms in $d=2,3,4,5$ dimensional space. We consider two forms of dimensionality reduction. First we consider projecting the dataset onto its top $d$ principle components. As an alternative we consider projecting our dataset onto a random subspace of dimension $d$ . We have constructed our random subspace dimensionality reduction so that each additional dimension adds a new index without affecting the others. For each dataset we randomly select an orthonormal basis that remains unchanged for all experiments $v_{1},v_{2},\ldots$ . To transform a point $X$ to dimension $d$ we perform the following transform

\displaystyle X_{\text{reduced dim.}}=\left[v_{1}\cdots v_{d}\right]^{T}X.

We consider both transforms since PCA may select dimensions where the features tend to be independent, as is the case when the distribution is a multivariate Gaussian. After dimensionality reduction we scale and translate the data to fit in the unit cube.

With our preprocessed dataset, each experiment consisted of randomly selecting 200 samples for training and using the rest to evaluate performance. For the estimators we tested all combinations using 1 to $b_{\text{max}}$ bins per dimension and $k$ from 1 to $k_{\text{max}}$ . As $d$ increased the best cross validated $b$ and $k$ value decreased, so we reduced $b_{\text{max}}$ and $k_{\text{max}}$ for larger $d$ to reduce computational time, while still leaving a sizable gap between the best cross validated $b$ and $k$ across all runs of all experiment. For $d=2,3$ we have $b_{\text{max}}=15$ and $k_{\text{max}}=10$ ; for $d=4$ we have $b_{\text{max}}=12$ and $k_{\text{max}}=8$ ; for $d=5$ we have $b_{\text{max}}=8$ and $k_{\text{max}}=6$ . For parameter fitting we used random subset cross validation repeated 80 times using 40 of the 200 samples to cross validate. Performing 80 folds of cross validation was necessary because of the variance of the histogram estimated loss. This high variance is likely due to the noncontinuous nature of the histogram estimator itself and the noncontinuity of the histogram as a function of the data, i.e. slightly moving one training sample can potentially the change histogram bin in which it lies. Each experiment was run 32 times and we report the mean and standard deviations of estimator performance as well as the best parameters found from cross validation. We additionally apply the Wilcoxon signed rank test to the 32 pairs of performance results to statistically determine if the mean performance between the standard histogram and our algorithm are different and report the corresponding $p$ -value. Our results are in Table 1 where the Tucker histogram dominates. We also observe that the Tucker histogram can estimate more bins per dimension than the standard histogram and is able to estimate more bins per dimension when the number of components of components is lower. This corroborates the number of components versus the bins per dimension trade-off from Section 2.4.

4 Conclusion

Through analysis of the histogram estimator, we have theoretically and empirically demonstrated that NNTF models can also be used to improve nonparametric density estimation. Though the histogram estimator is not a particularly popular estimator,we hope that the ideas presented here can be adapted to improve other techniques in nonparametric statistics.

Acknowledgements

This work was supported by the Berlin Institute for the Foundations of Learning and Data (BIFOLD) sponsored by the German Federal Ministry of Education and Research (BMBF).

References

Allman et al. [2009] E. S. Allman, C. Matias, and J. A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Ann. Statist., 37(6A):3099–3132, 12 2009. doi: 10.1214/09-AOS689. URL http://dx.doi.org/10.1214/09-AOS689.
Anandkumar et al. [2014] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014. URL http://jmlr.org/papers/v15/anandkumar14b.html.
Arora et al. [2012] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization – provably. In Proceedings of the Forty-fourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 145–162, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1245-5. doi: 10.1145/2213977.2213994. URL http://doi.acm.org/10.1145/2213977.2213994.
Ashtiani et al. [2018] H. Ashtiani, S. Ben-David, N. Harvey, C. Liaw, A. Mehrabian, and Y. Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3412–3421. Curran Associates, Inc., 2018.
Bengio [2013] Y. Bengio. Deep learning of representations: Looking forward. In A.-H. Dediu, C. Martín-Vide, R. Mitkov, and B. Truthe, editors, Statistical Language and Speech Processing, pages 1–37, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-39593-2.
Cayton [2005] L. Cayton. Algorithms for manifold learning. Technical report, 2005.
Chen et al. [2017] J. Chen, M. Stern, M. J. Wainwright, and M. I. Jordan. Kernel feature selection via conditional covariance minimization. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6946–6955. Curran Associates, Inc., 2017.
Devroye and Lugosi [2001] L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer, New York, 2001.
Donoho and Stodden [2004] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In S. Thrun, L. K. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 1141–1148. MIT Press, 2004.
Donoho [2006] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, April 2006. ISSN 0018-9448. doi: 10.1109/TIT.2006.871582.
Duchi et al. [2008] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l ${}_{\mbox{1}}$ -ball for learning in high dimensions. In ICML, pages 272–279, 2008.
Györfi et al. [1985] L. Györfi, L. Devroye, and L. Gyorfi. Nonparametric density estimation: the L1 view. John Wiley & Sons, New York; Chichester, 1985.
Han et al. [2014] Y. Han, J. Jiao, and T. Weissman. Minimax estimation of discrete distributions under $\ell_{1}$ loss. CoRR, abs/1411.1467, 2014. URL http://arxiv.org/abs/1411.1467.
Jiang [2017] H. Jiang. Uniform convergence rates for kernel density estimation. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1694–1703, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/jiang17b.html.
Kim and Scott [2012] J. Kim and C. Scott. Robust kernel density estimation. J. Machine Learning Res., 13:2529–2565, 2012.
Kim et al. [2018] J. Kim, J. Shin, A. Rinaldo, and L. Wasserman. Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension. arXiv e-prints, art. arXiv:1810.05935, Oct 2018.
Kim and Choi [2007] Y.-D. Kim and S. Choi. Nonnegative tucker decomposition. 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
Kossaifi et al. [2016] J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic. TensorLy: Tensor Learning in Python. arXiv e-prints, art. arXiv:1610.09555, Oct 2016.
Liu et al. [2011] H. Liu, M. Xu, H. Gu, A. Gupta, J. Lafferty, and L. Wasserman. Forest density estimation. J. Mach. Learn. Res., 12:907–951, July 2011. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1953048.2021032.
Mack and Rosenblatt [1979] Y. P. Mack and M. Rosenblatt. Multivariate k-nearest neighbor density estimates. Journal of Multivariate Analysis, 9(1):1–15, March 1979.
Negahban and Wainwright [2011] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist., 39(2):1069–1097, 04 2011. doi: 10.1214/10-AOS850. URL https://doi.org/10.1214/10-AOS850.
Negahban and Wainwright [2012] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res., 13:1665–1697, May 2012. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2188385.2343697.
Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
Recht et al. [2010] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3):471–501, Aug. 2010. ISSN 0036-1445. doi: 10.1137/070697835. URL http://dx.doi.org/10.1137/070697835.
Reiss [1989] R. Reiss. Approximate distributions of order statistics: with applications to nonparametric statistics. Springer series in statistics. Springer, 1989. ISBN 9783540968511. URL https://books.google.de/books?id=DxzvAAAAMAAJ.
Schein et al. [2016] A. Schein, M. Zhou, D. Blei, and H. Wallach. Bayesian poisson tucker decomposition for learning the structure of international relations. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2810–2819, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/schein16.html.
Schuster [1985] E. F. Schuster. Incorporating support constraints into nonparametric estimators of densities. Communications in Statistics - Theory and Methods, 14(5):1123–1136, 1985. doi: 10.1080/03610928508828965. URL https://doi.org/10.1080/03610928508828965.
Shashua and Hazan [2005] A. Shashua and T. Hazan. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 792–799, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: 10.1145/1102351.1102451. URL http://doi.acm.org/10.1145/1102351.1102451.
Silverman [1986] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986.
Song et al. [2012] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximization. J. Mach. Learn. Res., 13(1):1393–1434, May 2012. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2503308.2343691.
Song et al. [2014] L. Song, A. Anandkumar, B. Dai, and B. Xie. Nonparametric estimation of multi-view latent variable models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 640–648, 2014. URL http://jmlr.org/proceedings/papers/v32/songa14.html.
Terrell and Scott [1992] G. R. Terrell and D. W. Scott. Variable kernel density estimation. Ann. Statist., 20(3):1236–1265, 09 1992. doi: 10.1214/aos/1176348768. URL https://doi.org/10.1214/aos/1176348768.
Tsybakov [2008] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008.
Vandermeulen and Scott [2013] R. Vandermeulen and C. Scott. Consistency of robust kernel density estimators. COLT, 30:568–591, 2013.
Vandermeulen and Scott [2014] R. A. Vandermeulen and C. Scott. Robust kernel density estimation by scaling and projection in hilbert space. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 433–441. Curran Associates, Inc., 2014.

Appendix A Proofs Omitted from Main Text

All norms are either the $\ell^{1}$ , $L^{1}$ , or total variation norm, which are equivalent with respect to our analysis and the proper norm will be clear from context.

Proof of Lemma 2.1.

In Section 7.4 from [8], the authors show that for any collection of measures $\mu_{1},\ldots,\mu_{b}$ , for all $\varepsilon>0$ , that

\displaystyle N\left(\operatorname{Conv}\left(\left\{\mu_{1},\ldots,\mu_{b}\right\}\right),\varepsilon\right)\leq\left(b+\frac{b}{\varepsilon}\right)^{b}.

With the additional assumption that $\varepsilon\leq 1$ we have that $b+\frac{b}{\varepsilon}\leq\frac{b}{\varepsilon}+\frac{b}{\varepsilon}=\frac{2b}{\varepsilon}$ and thus

\displaystyle N\left(\operatorname{Conv}\left(\left\{\mu_{1},\ldots,\mu_{b}\right\}\right)\right)\leq\left(\frac{2b}{\varepsilon}\right)^{b}.

If we let $\mu_{i}=\mathbf{e}_{i}$ , the indicator vector at index $i$ , then the lemma follows. ∎

Proof of Lemma 2.2.

From Lemma 2.1 we know there exists a finite collection of probability vectors $\widetilde{\mathcal{P}}$ such that $\widetilde{\mathcal{P}}$ is an $\varepsilon/d$ -covering of $\Delta_{b}$ and $\left|\widetilde{\mathcal{P}}\right|\leq\left(\frac{2bd}{\varepsilon}\right)^{b}$ . Note that the set $\left\{\widetilde{p}_{1}\otimes\dots\otimes\widetilde{p}_{d}:\widetilde{p}_{i}\in\widetilde{\mathcal{P}}\right\}$ contains at most $\left(\left(\frac{2bd}{\varepsilon}\right)^{b}\right)^{d}=\left(\frac{2bd}{\varepsilon}\right)^{bd}$ elements. We will now show that this set is an $\varepsilon$ -cover of $\mathcal{T}_{d,b}^{1}$ . Let $p_{1}\otimes\cdots\otimes p_{d}\in\mathcal{T}_{d,b}^{1}$ be arbitrary. From our construction of $\widetilde{\mathcal{P}}$ there exist elements $\widetilde{p}_{1},\ldots,\widetilde{p}_{d}\in\widetilde{\mathcal{P}}$ such that $\left\|p_{i}-\widetilde{p}_{i}\right\|\leq\frac{\varepsilon}{d}$ .

We will now make use of Lemma 3.3.7 in [25], which states that, for any collection of probability vectors $q_{1},\ldots,q_{d}$ and $\widetilde{q}_{1},\ldots,\widetilde{q}_{d}$ , the following holds

\displaystyle\left\|\prod_{i=1}^{d}q_{i}-\prod_{j=1}^{d}\widetilde{q}_{j}\right\|\leq\sum_{i=1}^{d}\left\|q_{i}-\widetilde{q}_{i}\right\|.

From this it follows that

\displaystyle\left\|\prod_{i=1}^{d}p_{i}-\prod_{j=1}^{d}\widetilde{p}_{j}\right\|\leq\sum_{i=1}^{d}\left\|p_{i}-\widetilde{p}_{i}\right\|\leq d\frac{\varepsilon}{d}=\varepsilon

thus completing our proof. ∎

Proof of Lemma 2.3.

Let $\widetilde{\mathcal{P}}$ be the finite collection of probability measures with $|\widetilde{\mathcal{P}}|=N\left(\mathcal{P},\varepsilon\right)$ which $\varepsilon$ -covers $\mathcal{P}$ . Similarly let $W\subset\Delta_{k}$ with $|W|=N\left(\Delta_{k},\delta\right)$ such that $W$ is a $\delta$ -cover of $\Delta_{k}$ . Consider the set

\displaystyle\Omega=\left\{\sum_{i=1}^{k}\widetilde{w}_{i}\widetilde{p}_{i}\middle|\widetilde{w}\in W,\widetilde{p}_{i}\in\widetilde{\mathcal{P}}\right\}.

Note that this set contains at most $N\left(\mathcal{P},\varepsilon\right)^{k}N\left(\Delta_{k},\delta\right)$ elements. We will now show that it $\left(\delta+\varepsilon\right)$ -covers $k\operatorname{-mix}\left(\mathcal{P}\right)$ , which completes the proof. Let $\sum_{i=1}^{k}p_{i}w_{i}\in k\operatorname{-mix}\left(\mathcal{P}\right)$ . We know there exists elements $\widetilde{p}_{1},\ldots,\widetilde{p}_{k}\in\widetilde{\mathcal{P}}$ such that $\left\|\widetilde{p}_{i}-p_{i}\right\|\leq\varepsilon$ and $\widetilde{w}\in W$ such that $\left\|w-\widetilde{w}\right\|\leq\delta$ and thus $\sum_{i=1}^{k}\widetilde{p}_{i}\widetilde{w}_{i}\in\Omega$ . Now observe that

	$\displaystyle\left\\|\sum_{i=1}^{k}\widetilde{p}_{i}\widetilde{w}_{i}-\sum_{j=1}^{k}p_{j}w_{j}\right\\|$	$\displaystyle=\left\\|\sum_{i=1}^{k}\widetilde{p}_{i}\widetilde{w}_{i}-\sum_{j=1}^{k}p_{j}\widetilde{w}_{j}+\sum_{l=1}^{k}p_{l}\widetilde{w}_{l}-\sum_{r=1}^{k}p_{r}w_{r}\right\\|$
		$\displaystyle\leq\left\\|\sum_{i=1}^{k}\left(\widetilde{p}_{i}-p_{i}\right)\widetilde{w}_{i}\right\\|+\left\\|\sum_{i=1}^{k}p_{i}\left(\widetilde{w}_{i}-w_{i}\right)\right\\|$
		$\displaystyle\leq\sum_{i=1}^{k}\widetilde{w}_{i}\left\\|\widetilde{p}_{i}-p_{i}\right\\|+\sum_{i=1}^{k}\left\|\widetilde{w}_{i}-w_{i}\right\|$
		$\displaystyle\leq\sum_{i=1}^{k}\widetilde{w}_{i}\varepsilon+\left\\|\widetilde{w}-w\right\\|$
		$\displaystyle\leq\varepsilon+\delta.$

∎

Proof of Lemma 2.4.

Note that $\mathcal{T}_{d,b}^{k}=k\operatorname{-mix}\left(\mathcal{T}_{d,b}^{1}\right)$ . Applying Lemma 2.3 followed by Lemmas 2.1 and 2.2 we have that

\displaystyle N\left(\mathcal{T}_{d,b}^{k},\varepsilon\right)\leq N\left(\mathcal{T}_{d,b}^{1},\varepsilon/2\right)^{k}N\left(\Delta_{k},\varepsilon/2\right)\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}\left(\frac{4k}{\varepsilon}\right)^{k}.

∎

Proof of Lemma 2.5.

Fix $k,d,b$ and $0<\varepsilon\leq 1$ . We are going to construct an $\varepsilon$ -cover of $\widetilde{\mathcal{T}}_{d,b}^{k}$ . From Lemma 2.1 we know that there exists a set $\mathcal{B}\subset\Delta_{b}$ which $\left(\frac{\varepsilon}{2d}\right)$ -covers of $\Delta_{b}$ and contains no more than $\left(\frac{4bd}{\varepsilon}\right)^{b}$ elements. Let $\mathcal{P}$ be the collection of all $d\times k$ arrays whose entries are elements from $\mathcal{B}$ . So we have that

\displaystyle\left|\mathcal{P}\right|=\left|\mathcal{B}\right|^{dk}\leq\left(\frac{4bd}{\varepsilon}\right)^{bdk}.

From Lemma 2.1 there exists $\mathcal{W}$ which is an $\varepsilon/2$ -cover of $\mathcal{T}_{d,k}$ and contains no more than $\left(4k^{d}/\varepsilon\right)^{\left(k^{d}\right)}$ elements. Now let

\displaystyle\mathcal{L}_{d,b}^{k}=\left\{\sum_{S\in[k]^{d}}\widetilde{W}_{S}\prod_{i=1}^{d}\widetilde{p}_{i,S_{i}}\middle|\widetilde{W}\in\mathcal{W},\widetilde{p}\in\mathcal{P}\right\}.

Note that

\displaystyle\left|\mathcal{L}_{d,b}^{k}\right|\leq\left|\mathcal{W}\right|\left|\mathcal{P}\right|\leq\left(\frac{4k^{d}}{\varepsilon}\right)^{k^{d}}\left(\frac{4bd}{\varepsilon}\right)^{bdk}.

We will now show that $\mathcal{L}_{d,b}^{k}$ is an $\varepsilon$ -cover of $\widetilde{\mathcal{T}}_{d,b}^{k}$ . To this end let $\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}p_{i,S_{i}}\in\widetilde{\mathcal{T}}_{d,b}^{k}$ be arbitrary, where $W\in\mathcal{T}_{d,k}$ and $p_{i,j}\in\Delta_{b}$ . From our construction of $\mathcal{W}$ , there exists $\widetilde{W}\in\mathcal{W}$ such that $\left\|W-\widetilde{W}\right\|\leq\varepsilon/2$ . There also exists $\widetilde{p}\in\mathcal{P}$ such that $\left\|\widetilde{p}_{i,j}-p_{i,j}\right\|\leq\varepsilon/2$ for all $i,j$ . Therefore we have that

\displaystyle\sum_{S\in\left[k\right]^{d}}\widetilde{W}_{S}\prod_{i=1}^{d}\widetilde{p}_{i,S_{i}}\in\mathcal{L}_{d,b}^{k}.

So finally

	$\displaystyle\left\\|\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}p_{i,S_{i}}-\sum_{R\in\left[k\right]^{d}}\widetilde{W}_{R}\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\\|$
	$\displaystyle\leq\left\\|\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}p_{i,S_{i}}-\sum_{R\in\left[k\right]^{d}}W_{R}\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\\|+\left\\|\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}\widetilde{p}_{i,S_{i}}-\sum_{R\in\left[k\right]^{d}}\widetilde{W}_{R}\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\\|$
	$\displaystyle\leq\sum_{S\in\left[k\right]^{d}}W_{S}\left\\|\prod_{i=1}^{d}p_{i,S_{i}}-\prod_{j=1}^{d}\widetilde{p}_{j,S_{j}}\right\\|+\sum_{R\in\left[k\right]^{d}}\|W_{R}-\widetilde{W}_{R}\|\left\\|\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\\|$
	$\displaystyle\leq\sum_{S\in\left[k\right]^{d}}W_{S}\varepsilon+\left\\|W-\widetilde{W}\right\\|$
	$\displaystyle\leq\varepsilon/2+\varepsilon/2=\varepsilon.$

∎

Proof of Theorem 2.1.

We will be applying the estimator from Lemma 2.7 to a series of $\delta$ -covers of $\mathcal{H}_{d,b}^{k}$ . We begin by constructing a series of $\delta$ -covers whose cardinality doesn’t grow too quickly. Corollary 2.1 states that, for all $0<\delta\leq 1$ , that $N\left(\mathcal{H}_{d,b}^{k},\delta\right)\leq\left(\frac{4bd}{\delta}\right)^{bdk}\left(\frac{4k}{\delta}\right)^{k}$ . For sufficiently large $b$ and $k$ and sufficiently small $\delta$ , the following holds

$\displaystyle\log\left(\left(\frac{4bd}{\delta}\right)^{bdk}\left(\frac{4k}{\delta}\right)^{k}\right)$	$\displaystyle=bdk\log\left(\frac{4bd}{\delta}\right)+k\log\left(\frac{4k}{\delta}\right)$
	$\displaystyle=bdk\left[\log\left(b\right)+\log\left(\frac{4d}{\delta}\right)\right]+k\left[\log\left(k\right)+\log\left(\frac{4}{\delta}\right)\right]$
	$\displaystyle\leq bdk\left[\log\left(b\right)+\log\left(b\right)\log\left(\frac{4d}{\delta}\right)\right]+dk\left[\log\left(k\right)+\log\left(k\right)\log\left(\frac{4d}{\delta}\right)\right]$
	$\displaystyle=\left(bk\log\left(b\right)+k\log\left(k\right)\right)d\left(1+\log\left(\frac{4d}{\delta}\right)\right).$	(11)

Using the argument from the proof of Lemma 2.7 we have that, because $n/(bk\log(b)+k\log(k))\to\infty$ there exists a sequence of positive values $C=C(n)$ such that $C\to\infty$ and $n>C\left[bk\log(b)+k\log(k)\right]$ . If we let $\delta=\frac{4d}{\exp\left(\frac{C}{d}-1\right)}$ we have that $\delta\to 0$ and

\displaystyle\left(bk\log\left(b\right)+k\log\left(k\right)\right)d\left(1+\log\left(\frac{4d}{\delta}\right)\right)\leq n.

Because of this we can construct collections of densities $\widetilde{\mathcal{P}}_{n}\subset\mathcal{H}_{d,b}^{k}$ such that $\widetilde{\mathcal{P}}_{n}$ is a $\delta$ -covering of $\mathcal{H}_{d,b}^{k}$ with $\left|\widetilde{\mathcal{P}}\right|\to\infty$ , $n/\log\left|\widetilde{\mathcal{P}}_{n}\right|\to\infty$ and $\delta\to 0$ . Let $V_{n}$ be the estimator from Lemma 2.7 applied to the sequence $\widetilde{\mathcal{P}}_{n}$ .

Let $\varepsilon>0$ be arbitrary. Due to the way that we have constructed the sequence $\widetilde{\mathcal{P}}_{n}$ , for sufficiently large $n$ , we have that $3\sup_{q\in\mathcal{H}_{d,b}^{k}}\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\|q-\widetilde{q}\right\|\leq\varepsilon/2$ . It therefore follows that, for sufficiently large $n$ , the following holds for all $p\in\mathcal{D}_{d}$

	$\displaystyle 3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\\|p-q\right\\|+\varepsilon$	$\displaystyle\geq 3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\\|p-q\right\\|+3\sup_{q\in\mathcal{H}_{b}^{k}}\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\\|q-\widetilde{q}\right\\|+\varepsilon/2$
		$\displaystyle\geq 3\min_{q\in\mathcal{H}_{d,b}^{k}}\left[\left\\|p-q\right\\|+\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\\|q-\widetilde{q}\right\\|\right]+\varepsilon/2$
		$\displaystyle=3\min_{q\in\mathcal{H}_{d,b}^{k}}\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\\|p-q\right\\|+\left\\|q-\widetilde{q}\right\\|+\varepsilon/2$
		$\displaystyle\geq 3\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\\|p-\widetilde{q}\right\\|+\varepsilon/2.$

From this we have that, for sufficiently large $n$

\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{i}-p\right\|>3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\leq\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{i}-p\right\|>3\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\|p-\widetilde{q}\right\|+\varepsilon/2\right)

and the right side goes to zero due to Lemma 2.7, thus completing the proof. ∎

Proof of Theorem 2.2.

This proof is very similar to the proof of Theorem 2.1. We will be applying the estimator from Lemma 2.7 to a series of $\delta$ -covers of $\widetilde{\mathcal{H}}_{d,b}^{k}$ . We begin by constructing a series of $\delta$ -covers whose cardinality doesn’t grow too quickly. Corollary 2.2 states that, for all $0<\delta\leq 1$ , that $N\left(\widetilde{\mathcal{H}}_{d,b}^{k},\delta\right)\leq\left(\frac{4bd}{\delta}\right)^{bdk}\left(\frac{4k^{d}}{\delta}\right)^{k^{d}}$ . For sufficiently large $b$ and $k$ and sufficiently small $\delta$ , the following holds

	$\displaystyle\log\left(\left(\frac{4bd}{\delta}\right)^{bdk}\left(\frac{4k^{d}}{\delta}\right)^{k^{d}}\right)$	$\displaystyle=bdk\log\left(\frac{4bd}{\delta}\right)+k^{d}\log\left(\frac{4k^{d}}{\delta}\right)$
		$\displaystyle\leq d\left(bk\log\left(\frac{4bd}{\delta}\right)+k^{d}\log\left(\frac{4k^{d}}{\delta}\right)\right)$
		$\displaystyle=d\left(bk\left(\log(b)+\log\left(\frac{4d}{\delta}\right)\right)+k^{d}\left(\log\left(k^{d}\right)+\log\left(\frac{4}{\delta}\right)\right)\right)$
		$\displaystyle\leq d\left(bk\left(\log(b)+\log\left(\frac{4d}{\delta}\right)\right)+k^{d}\left(\log\left(k^{d}\right)+\log\left(\frac{4d}{\delta}\right)\right)\right)$
		$\displaystyle=\left(bk\log(b)+k^{d}\log\left(k^{d}\right)\right)d\left(1+\log\left(\frac{4d}{\delta}\right)\right).$

Note that replacing $bk\log\left(b\right)+k\log\left(k\right)$ with $bk\log\left(b\right)+k^{d}\log\left(k^{d}\right)$ in the last line is exactly (11) in our proof of Theorem 2.1 . From here we can proceed exactly as in the proof of Theorem 2.1 by replacing $\mathcal{H}_{d,b}^{k}$ with $\widetilde{\mathcal{H}}_{d,b}^{k}$ and $bk\log\left(b\right)+k\log\left(k\right)$ with $bk\log\left(b\right)+k^{d}\log\left(k^{d}\right)$ . ∎

Proof of Lemma 2.8.

Let $\varepsilon>0$ . Theorem 5 in Chapter 2 of [12] states that, for any $p\in\mathcal{D}_{d}$ , that $\min_{h\in\mathcal{H}_{d,b}}\left\|p-h\right\|\to 0$ as $b\to\infty$ , i.e. the bias of a histogram estimator goes to zero as the number of bins per dimension goes to infinity. Thus there exists a sufficiently large $B$ such that there exists a histogram $h\in\mathcal{H}_{d,B}$ which is a good approximation of $p$ , $\left\|p-h\right\|<\varepsilon/2$ . In this proof we we will argue that once $k\geq B^{d}$ and $b$ is sufficiently large, we can find an element of $\mathcal{H}_{d,k}^{k}$ where the multi-view components can approximate the $k$ bins of $h$ .

We have that

\displaystyle h=\sum_{A\in\left[B\right]^{\times d}}w_{A}\prod_{i=1}^{d}h_{d,B,A_{i}}.

From the same theorem there exists $a_{0}$ such that, for all $a\geq a_{0}$ , for all $i$ , there exists $\widetilde{h}_{1,a,i}\in\mathcal{H}_{1,a}$ such that $\left\|h_{1,B,i}-\widetilde{h}_{1,a,i}\right\|<\varepsilon/(2d)$ for all $i\in[B]$ . For any multi-index $A\in[B]^{d}$ , we define

\displaystyle\widetilde{h}_{d,a,A}=\prod_{j=1}^{d}\widetilde{h}_{1,a,A_{j}}.

Now we have that, for all $a\geq a_{0}$ and $A\in[B]^{d}$ ,

$\displaystyle\left\\|h_{d,B,A}-\widetilde{h}_{d,a,A}\right\\|$	$\displaystyle=\left\\|\prod_{i=1}^{d}h_{1,B,A_{i}}-\prod_{j=1}^{d}\widetilde{h}_{1,a,A_{j}}\right\\|$
	$\displaystyle\leq\sum_{i=1}^{d}\left\\|h_{1,B,A_{i}}-\widetilde{h}_{1,a,A_{i}}\right\\|$	(12)
	$\displaystyle\leq d\frac{\varepsilon}{2d}$
	$\displaystyle=\varepsilon/2,$

where we use the previously mentioned product measure inequality for (12). As soon as $k\geq B^{d}$ and $a\geq a_{0}$ the set $\mathcal{H}_{d,a}^{k}$ contains the element,

\displaystyle\widetilde{h}\triangleq\sum_{A\in\left[B\right]^{\times d}}w_{A}\widetilde{h}_{d,a,A}.

Now we have that, for all $a\geq a_{0}$ .

	$\displaystyle\left\\|h-\widetilde{h}\right\\|$	$\displaystyle=\left\\|\sum_{A\in\left[B\right]^{\times d}}w_{A}h_{d,B,A}-\sum_{Q\in\left[B\right]^{\times d}}w_{Q}\widetilde{h}_{d,a,Q}\right\\|$
		$\displaystyle\leq\sum_{A\in\left[B\right]^{\times d}}w_{A}\left\\|h_{d,B,A}-\widetilde{h}_{d,a,A}\right\\|$
		$\displaystyle\leq\varepsilon/2.$

From the triangle inequality we have that

\displaystyle\left\|p-\widetilde{h}\right\|\leq\left\|p-h\right\|+\left\|h-\widetilde{h}\right\|\leq\varepsilon.

So we have that, for sufficiently large $b$ and $k$

\displaystyle\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|\leq\varepsilon

which completes our proof. ∎

Proof of Lemma 2.9.

We will show that $\mathcal{H}_{d,b}^{k}\subset\widetilde{\mathcal{H}}_{d,b}^{k}$ and the theorem clearly follows due to Lemma 2.8. Any element of $\mathcal{H}_{d,b}^{k}$ will have the following representation

\displaystyle\sum_{i=1}^{k}w_{i}\prod_{j=1}^{d}f_{i,j}:w\in\Delta_{k},f_{i,j}\in\mathcal{H}_{1,b}.

(13)

Letting $W\in\mathcal{T}_{d,k}$ with $W_{i,\ldots,i}=w_{i}$ for all $i$ and the rest of the entries of $W$ be zero and letting $\widetilde{f}_{j,i}=f_{i,j}$ for all $i,j$ we have that

	$\displaystyle\sum_{S\in[k]^{d}}W_{S}\prod_{j=1}^{d}\widetilde{f}_{j,S_{j}}$	$\displaystyle=\sum_{i=1}^{k}W_{i,\ldots,i}\prod_{j=1}^{d}\widetilde{f}_{j,i}$
		$\displaystyle=\sum_{i=1}^{k}w_{i}\prod_{j=1}^{d}f_{i,j}$

so we have that (13) is an element of $\widetilde{\mathcal{H}}_{d,b}^{k}$ and we are done. ∎

Proof of Theorem 2.3.

We will proceed by contradiction. Suppose $V_{n}$ is an estimator violating the theorem statement, i.e. there exist sequences $b\to\infty$ and $k\to\infty$ with $n/\left(bk\right)\to 0$ and $b\geq k$ such that, for all $\varepsilon>0$ ,

\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0.

Let $\left(p_{n}\right)_{n=1}^{\infty}$ be a sequence of probability vectors $p_{n}\in\Delta_{b(n)\times k(n)}$ which represent distributions over $\left[b(n)\right]\times\left[k(n)\right]$ . Let $\mathcal{X}_{n}\triangleq\left(X_{n,1},\ldots,X_{n,n}\right)$ with $X_{n,1},\ldots,X_{n,n}\overset{iid}{\sim}p_{n}$ .

We will now construct a series of estimators for $p_{n}$ using $V_{n}$ . Let $\widetilde{\mathcal{X}}_{n}=\left(\widetilde{X}_{n,1},\ldots,\widetilde{X}_{n,n}\right)$ which are independent random variables with $\widetilde{X}_{n,i}\sim h_{d,b,\left(X_{n,i},1,\ldots,1\right)}$ . For this proof we will assume $d>2$ but the proof can be simplified in a straightforward manner to the $d=2$ case by ignoring the indices and modes beyond the second. Note that that $X_{n,i}$ contains two indices. Now we have the following for the densities of $\widetilde{X}_{n,i}$

$\displaystyle p_{\widetilde{X}_{n,i}}$	$\displaystyle=\sum_{(j,\ell)\in\left[b\right]\times\left[k\right]}p_{\widetilde{X}_{n,i}\|X_{n,i}=\left(j,\ell\right)}P(X_{n,i}=\left(j,\ell\right))$
	$\displaystyle=\sum_{(j,\ell)\in\left[b\right]\times\left[k\right]}h_{d,b,\left(j,\ell,1,\ldots,1\right)}p_{n}\left(j,\ell\right)$
	$\displaystyle=\sum_{\ell\in\left[k\right]}\sum_{j\in\left[b\right]}h_{d,b,\left(j,\ell,1,\ldots,1\right)}p_{n}\left(j,\ell\right)$
	$\displaystyle=\sum_{\ell\in\left[k\right]}\sum_{j\in\left[b\right]}p_{n}\left(j,\ell\right)h_{1,b,j}\otimes h_{1,b,\ell}\otimes\prod_{a\in[d-2]}h_{1,b,1}$
	$\displaystyle=\sum_{\ell\in\left[k\right]}\left(\sum_{j\in\left[b\right]}p_{n}\left(j,\ell\right)h_{1,b,j}\right)\otimes h_{1,b,\ell}\otimes\prod_{a\in[d-2]}h_{1,b,1}$	(14)
	$\displaystyle=\sum_{\ell\in\left[k\right]}\left(\sum_{q\in\left[b\right]}p_{n}\left(q,\ell\right)\right)\left(\sum_{j\in\left[b\right]}\frac{p_{n}\left(j,\ell\right)}{\sum_{q\in\left[b\right]}p_{n}\left(q,\ell\right)}h_{1,b,j}\right)\otimes h_{1,b,\ell}\otimes\prod_{a\in[d-2]}h_{1,b,1}.$	(15)

This last line is in the form of (5) in the main text and is thus an element of $\mathcal{H}_{d,b}^{k}$ . To see this we will show the correspondence between the terms in (15) from here and the terms in (5) in the main text:

	$\displaystyle w_{\ell}$	$\displaystyle:=\left(\sum_{q\in\left[b\right]}p_{n}\left(q,\ell\right)\right)$
	$\displaystyle f_{\ell,1}$	$\displaystyle:=\left(\sum_{j\in\left[b\right]}\frac{p_{n}\left(j,\ell\right)}{\sum_{q\in\left[b\right]}p_{n}\left(q,\ell\right)}h_{1,b,j}\right)$
	$\displaystyle f_{\ell,2}$	$\displaystyle:=h_{1,b,\ell}$
	$\displaystyle f_{i,j}$	$\displaystyle:=h_{1,b,1},\forall j>2,\forall i.$

Let $V_{n}$ estimate $\widetilde{P}_{n}\triangleq p_{\widetilde{X}_{n,i}}$ so $\widetilde{X}_{n,1},\ldots,\widetilde{X}_{n,n}\overset{iid}{\sim}\widetilde{P}_{n}$ . We will use $V_{n}$ to construct an estimator $v_{n}$ for $p_{n}$ .

Because $\widetilde{P}_{n}\in\mathcal{H}_{d,b}^{k}$ ²²2We will use this portion of the proof again for our proof of Theorem 2.4 for all $n$ we have that $\left\|V_{n}-\widetilde{P}_{n}\right\|\,{\buildrel p\over{\rightarrow}}\,0$ and thus $\left\|U_{d,b}^{-1}(V_{n})-U_{d,b}^{-1}(\widetilde{P}_{n})\right\|\,{\buildrel p\over{\rightarrow}}\,0$ . Note that $\left[U_{d,b}^{-1}(\widetilde{P}_{n})\right]_{j,\ell,A}=p_{n}(j,\ell)$ when $A=\left(1,\ldots,1\right)$ and zero otherwise (see (14)). We define the linear operator $B_{n}:\mathcal{T}_{d,b}\to\Delta_{b\times k}$ as

\displaystyle\left[B_{n}(T)\right]_{j,\ell}\triangleq\sum_{A\in\left[b\right]^{\times d-2}}T_{j,\ell,A}

i.e. the linear operator which sums out all modes except for the first two. We have that $B_{n}(U_{d,b}^{-1}(\widetilde{P}_{n}))=p_{n}$ . Now let $v_{n}=B_{n}(U_{d,b}^{-1}(V_{n}))$ be the estimator for $p_{n}$ . Now we have that

\left\|v_{n}-p_{n}\right\|=\left\|B_{n}(U_{d,b}^{-1}(\widetilde{P}_{n}))-B_{n}(U_{d,b}^{-1}(V_{n}))\right\|=\left\|B_{n}(U_{d,b}^{-1}(\widetilde{P}_{n}-V_{n}))\right\|.

We have that $B_{n}$ is a nonexpansive operator due to the triangle inequality,

\displaystyle\left\|B_{n}\left(T\right)\right\|=\sum_{j,l}\left|\sum_{A\in\left[b\right]^{\times d-2}}T_{j,\ell,A}\right|\leq\sum_{j,l}\sum_{A\in\left[b\right]^{\times d-2}}\left|T_{j,\ell,A}\right|=\left\|T\right\|,

so the operator norm of $B_{n}$ is less than or equal to one. We also know that $U^{-1}_{d,b}$ an isometry and $\left\|\widetilde{P}_{n}-V_{n}\right\|\,{\buildrel p\over{\rightarrow}}\,0$ , so it follows that $\left\|v_{n}-p_{n}\right\|\,{\buildrel p\over{\rightarrow}}\,0$ for any sequence of $p_{n}\in\Delta_{\left[b(n)\right]\times\left[k(n)\right]}$ . We will now use following theorem from [13] to show that no such estimator $v_{n}$ can exist.

Theorem A.1 ([13] Theorem 2.).

For any $\zeta\in\left(0,1\right]$ , we have

	$\displaystyle\inf_{\hat{p}}\sup_{p\in\Delta_{a}}\mathbb{E}_{p}\left\\|\hat{p}-p\right\\|\geq\frac{1}{8}\sqrt{\frac{ea}{\left(1+\zeta\right)n}}\mathbbm{1}\left(\frac{\left(1+\zeta\right)n}{a}>\frac{e}{16}\right)$
	$\displaystyle\quad+\exp\left(-\frac{2\left(1+\zeta\right)n}{a}\right)\mathbbm{1}\left(\frac{\left(1+\zeta\right)n}{a}\leq\frac{e}{16}\right)-\exp\left(-\frac{\zeta^{2}n}{24}\right)-12\exp\left(-\frac{\zeta^{2}a}{32\left(\log a\right)^{2}}\right)$

where the infimum is over all estimators.

Our estimator is equivalent to estimating a categorical distribution with $a=bk$ categories. Letting $\zeta=1$ , $bk\to\infty$ , and $n\to\infty$ , with $n/\left(bk\right)\to 0$ , we get that for sufficiently large $n$

\displaystyle\inf_{\hat{p}}\sup_{p\in\Delta_{bk}}\mathbb{E}_{p}\left\|\hat{p}-p\right\|\geq\exp\left(-\frac{4n}{bk}\right)-\exp\left(-\frac{n}{24}\right)-12\exp\left(-\frac{bk}{32\left(\log bk\right)^{2}}\right)

whose right hand side converges to 1. From this we get that

\displaystyle\liminf_{n\to\infty}\sup_{p_{n}\in\Delta_{bk}}\mathbb{E}_{p_{n}}\left\|v_{n}-p_{n}\right\|>\frac{1}{2}

which contradicts $\left\|v_{n}-p_{n}\right\|\,{\buildrel p\over{\rightarrow}}\,0$ for arbitrary sequences $p_{n}$ . ∎

Proof of Thoerem 2.4.

We will proceed by contradiction. Suppose $V_{n}$ is an estimator violating the theorem statement, i.e. there exist sequences $b\to\infty$ and $k\to\infty$ with $n/\left(bk+k^{d}\right)\to 0$ and $b\geq k$ such that, for all $\varepsilon>0$ ,

\displaystyle\sup_{p\in\mathcal{D}_{d}}P\left(\left\|V_{n}-p\right\|>3\min_{q\in\widetilde{\mathcal{H}}_{d,b}^{k}}\left\|p-q\right\|+\varepsilon\right)\to 0.

Since $n/(bk+k^{d})\to 0$ we have that $(bk+k^{d})/n\to\infty$ so there is a subsequence $n_{i}$ such that $b(n_{i})k(n_{i})/n_{i}\to\infty$ or $k(n_{i})^{d}/n_{i}\to\infty$ , or equivalently $n_{i}/(b(n_{i})k(n_{i}))\to 0$ or $n_{i}/k(n_{i})^{d}\to 0$ . We will show that both cases lead to a contradiction. We will let $b$ and $k$ be functions of $n_{i}$ now implictly when defining limits.
Case $n_{i}/(bk)\to 0$ : We proceed similarly to the proof of Theorem 2.3. Let $\left(p_{n}\right)_{n=1}^{\infty}$ , $\widetilde{P}_{n}$ , and $\mathcal{X}_{n}$ be defined as in the proof of Theorem 2.3. Note that $\mathcal{H}_{d,b}^{k}\subset\widetilde{\mathcal{H}}_{d,b}^{k}$ (see proof of Lemma 2.9) and thus $\widetilde{P}_{n}\in\widetilde{\mathcal{H}}_{d,b}^{k}$ . We can proceed exactly as in our proof of Theorem 2.3 at footnote 2, by simply replacing $\mathcal{H}_{d,b}^{k}$ with $\widetilde{\mathcal{H}}_{d,b}^{k}$ and $n$ with $n_{i}$ which finishes this case.
Case $n_{i}/k^{d}\to 0$ : Let $(p_{n})_{n=1}^{\infty}$ be a sequence of elements in $\mathcal{T}_{d,k}$ which represents distributions over $[k]^{d}$ . Let $\mathcal{X}_{n}\triangleq\left(X_{n,1},\ldots,X_{n,n}\right)$ with $X_{n,1},\ldots,X_{n,n}\overset{iid}{\sim}p_{n}$ . Let $\widetilde{\mathcal{X}}_{n}=\left(\widetilde{X}_{n,1},\ldots,\widetilde{X}_{n,n}\right)$ which are independent random variables with $\widetilde{X}_{n,i}\sim h_{d,b,X_{n,i}}$ . Let $\widetilde{P}_{n}$ be the density for $\widetilde{X}_{n,i}$ . Note that $k\leq b$ . So we have that

	$\displaystyle\widetilde{P}_{n}$	$\displaystyle=\sum_{S\in[k]^{d}}p_{\widetilde{X}_{n,i}\|X_{n,i}=S}P(X_{n,i}=S)$
		$\displaystyle=\sum_{S\in[k]^{d}}h_{d,b,S}p_{n}(S)$
		$\displaystyle=\sum_{S\in[k]^{d}}p_{n}(S)\prod_{i=1}^{d}h_{1,b,S_{i}}$

and thus $\widetilde{P}_{n}\in\widetilde{\mathcal{H}}_{d,b}^{k}$ . We proceed as in Theorem 2.3 to find an estimator for elements of $\mathcal{T}_{d,k}$ which is equivalent to estimating elements of $\Delta^{k^{d}}$ which is impossible since $n_{i}/k^{d}\to 0$ . ∎

	$\displaystyle\left\\|\sum_{i=1}^{k}\widetilde{p}_{i}\widetilde{w}_{i}-\sum_{j=1}^{k}p_{j}w_{j}\right\\|$	$\displaystyle=\left\\|\sum_{i=1}^{k}\widetilde{p}_{i}\widetilde{w}_{i}-\sum_{j=1}^{k}p_{j}\widetilde{w}_{j}+\sum_{l=1}^{k}p_{l}\widetilde{w}_{l}-\sum_{r=1}^{k}p_{r}w_{r}\right\\|$
		$\displaystyle\leq\left\\|\sum_{i=1}^{k}\left(\widetilde{p}_{i}-p_{i}\right)\widetilde{w}_{i}\right\\|+\left\\|\sum_{i=1}^{k}p_{i}\left(\widetilde{w}_{i}-w_{i}\right)\right\\|$
		$\displaystyle\leq\sum_{i=1}^{k}\widetilde{w}_{i}\left\\|\widetilde{p}_{i}-p_{i}\right\\|+\sum_{i=1}^{k}\left\|\widetilde{w}_{i}-w_{i}\right\|$
		$\displaystyle\leq\sum_{i=1}^{k}\widetilde{w}_{i}\varepsilon+\left\\|\widetilde{w}-w\right\\|$
		$\displaystyle\leq\varepsilon+\delta.$

	$\displaystyle\left\\|\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}p_{i,S_{i}}-\sum_{R\in\left[k\right]^{d}}\widetilde{W}_{R}\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\\|$
	$\displaystyle\leq\left\\|\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}p_{i,S_{i}}-\sum_{R\in\left[k\right]^{d}}W_{R}\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\\|+\left\\|\sum_{S\in\left[k\right]^{d}}W_{S}\prod_{i=1}^{d}\widetilde{p}_{i,S_{i}}-\sum_{R\in\left[k\right]^{d}}\widetilde{W}_{R}\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\\|$
	$\displaystyle\leq\sum_{S\in\left[k\right]^{d}}W_{S}\left\\|\prod_{i=1}^{d}p_{i,S_{i}}-\prod_{j=1}^{d}\widetilde{p}_{j,S_{j}}\right\\|+\sum_{R\in\left[k\right]^{d}}\|W_{R}-\widetilde{W}_{R}\|\left\\|\prod_{j=1}^{d}\widetilde{p}_{j,R_{j}}\right\\|$
	$\displaystyle\leq\sum_{S\in\left[k\right]^{d}}W_{S}\varepsilon+\left\\|W-\widetilde{W}\right\\|$
	$\displaystyle\leq\varepsilon/2+\varepsilon/2=\varepsilon.$

	$\displaystyle 3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\\|p-q\right\\|+\varepsilon$	$\displaystyle\geq 3\min_{q\in\mathcal{H}_{d,b}^{k}}\left\\|p-q\right\\|+3\sup_{q\in\mathcal{H}_{b}^{k}}\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\\|q-\widetilde{q}\right\\|+\varepsilon/2$
		$\displaystyle\geq 3\min_{q\in\mathcal{H}_{d,b}^{k}}\left[\left\\|p-q\right\\|+\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\\|q-\widetilde{q}\right\\|\right]+\varepsilon/2$
		$\displaystyle=3\min_{q\in\mathcal{H}_{d,b}^{k}}\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\\|p-q\right\\|+\left\\|q-\widetilde{q}\right\\|+\varepsilon/2$
		$\displaystyle\geq 3\min_{\widetilde{q}\in\widetilde{\mathcal{P}}_{n}}\left\\|p-\widetilde{q}\right\\|+\varepsilon/2.$

	$\displaystyle\left\\|h-\widetilde{h}\right\\|$	$\displaystyle=\left\\|\sum_{A\in\left[B\right]^{\times d}}w_{A}h_{d,B,A}-\sum_{Q\in\left[B\right]^{\times d}}w_{Q}\widetilde{h}_{d,a,Q}\right\\|$
		$\displaystyle\leq\sum_{A\in\left[B\right]^{\times d}}w_{A}\left\\|h_{d,B,A}-\widetilde{h}_{d,a,A}\right\\|$
		$\displaystyle\leq\varepsilon/2.$