Improving Nonparametric Density Estimation with Tensor Decompositions
Abstract
While nonparametric density estimators often perform well on low dimensional data, their performance can suffer when applied to higher dimensional data, owing presumably to the curse of dimensionality. One technique for avoiding this is to assume no dependence between features and that the data are sampled from a separable density. This allows one to estimate each marginal distribution independently thereby avoiding the slow rates associated with estimating the full joint density. This is a strategy employed in naive Bayes models and is analogous to estimating a rank-one tensor. In this paper we investigate whether these improvements can be extended to other simplified dependence assumptions which we model via nonnegative tensor decompositions. In our central theoretical results we prove that restricting estimation to low-rank nonnegative PARAFAC or Tucker decompositions removes the dimensionality exponent on bin width rates for multidimensional histograms. These results are validated experimentally with high statistical significance via direct application of an existing nonnegative tensor factorization to histogram estimators.
1 Introduction
Nonparametric density estimation is the task of estimating a density from data without assuming that belongs to some parametric class of densities, e.g. the space multivariate Gaussian distributions. Some common nonparametric density estimators include the histogram estimator and the kernel density estimator (KDE). While nonparametric density estimation has been shown to be effective for many tasks, it has been observed empirically that estimator performance typically declines as data dimensionality increases, a manifestation of the curse of dimensionality. For the histogram and KDE this phenomena also has a precise mathematical analog. For these estimators universal consistency is achieved iff and , with , where is number of samples and is the bin width for the histogram and bandwidth parameter for the KDE [12].
One of the most common approaches to alleviating the curse of dimensionality is dimensionality reduction. A dimensionality reduction technique typically attempts to transform a high dimensional representation into a lower dimensional one that removes dependences within data. Feature selection techniques, for example, usually explicitly remove highly dependent features [30, 7]. In PCA one finds the best -dimensional affine subspace for approximating data. If the PCA model fits well, it implies that, given features of a -dimensional sample, there exists a good linear prediction of the remaining features. The manifold hypothesis [6], which is often touted as a general explanation for why high dimensional problems are learnable [5], can also be viewed as a model of dependence. For example, if we assume that a random vector lies on the one-dimensional sphere, , and is known, then can only assume one of two values . More generally, the assumption that the data lies on a manifold implies local dependence. This is because any sufficiently small region of the support manifold can be well approximated by an affine subspace, and thus there exists a linear dependence between the features like in PCA. Interestingly, assuming that features are not dependent yields another approach to overcoming the curse of dimensionality.
In a naive Bayes model one assumes no dependence between features, i.e. our target density is separable,
(1) |
In order to estimate one can now simply estimate the marginal distributions and multiply them. Because the dimensionality of each marginal is one, one can use a histogram or KDE and achieve rates on bin width/bandwidth while preserving consistent estimation, thereby circumventing the curse of dimensionality. The separability assumption is rarely satisfied in practice so naive Bayes models are typically not used for density estimation directly, but may be used for some other task such as classification via a likelihood ratio test.
In this paper we consider relaxations of the naive Bayes model based on the assumption that a density is a mixture of separable densities. Our models are inspired by nonnegative tensor factorizations so we term them generally nonparametric nonnegative tensor factorization (NNTF) models. The first model is related to nonnegative PARAFAC [28] and is commonly known as a multi-view model in statistics or machine learning literature:
(2) |
This is equivalent to the assumption that the features are independent conditioned on an unobserved discrete random variable taking on values. Our second model is based on the nonnegative Tucker decomposition [17]. In this model it is assumed that there are collections of one-dimensional densities, with , and some probability measure which which randomly selects one density from each collection. This measure a can be represented by a tensor where the probability of selecting is . To sample from this model we first randomly select the marginal distributions according to and then independently sample each feature according the randomly selected marginal distribution . The density of this model is
(3) |
We are unaware of previous literature investigating this model so we will simply term it the Tucker model.
In Section 2 we prove that there exists a trade-off between the rate on bin width and number of components : to control estimation error111For an estimator restricted space of densities , the estimation error refers the difference between and , where is the target density. we require to be asymptotically dominated by for the multi-view histogram and to be asymptotically dominated by for the Tucker histogram (both of these rates ignore logarithmic factors). Note that for the multi-view histogram this rate is not dependent on . Allowing to shrink as aggressively as possible (which we pay for with a slow rate on ) we show that there exist universally consistent histogram estimators which achieve rate on bin width, thereby removing the dependence on dimension and approximately attaining rates possible for densities of the form in (1) while still controlling estimation error. We show that these are the approximately fastest possible rates via matching lower bound. In Section 3 we show that we can use an existing low-rank nonnegative Tucker factorization algorithm to fit our model and demonstrate empirically that fitting histograms to a Tucker model outperforms the standard histogram estimator with very high statistical significance.
While this paper focuses on histogram estimation and presents a promising, readily implementable improvement to the standard histogram estimator in Section 3, its primary purpose is to showcase the potential of utilizing concepts nonnegative tensor factorization to improve performance in nonparametric statistical methods.
1.1 Previous Work
Nonparametric density estimation has been extensively studied with the histogram estimator and KDE being by far the most well known. There do exist, however, alternative methods for density estimation, e.g. the forest density estimator [19] and -nearest neighbor density estimator [20]. The and convergence of the histogram and KDE has been studied extensively [12, 8, 33, 14]. The KDE is generally accepted as being the superior density estimator with some mathematical justification [29]. Numerous modifications and extensions of the KDE have been proposed including utilizing variable bandwidth [32], robust KDEs [15, 34, 35], and methods for enforcing support boundary constraints [27]. Finally we mention one recent paper [16] that demonstrated that uniform convergence of a KDE to its population estimate suffered when the intrinsic dimension of the data was lower than the ambient dimension, a phenomenon seemingly at odds with the curse of dimensionality.
For our review of NNTF models we also include a general review of tensor/matrix factorizations since both can be viewed being low-rank models. In particular, for the multi-view model we have
(4) |
We further note that for histogram estimation, once data has been assigned to bins, finding a good multi-view or Tucker histogram is analogous to estimating a probability tensor with a nonnegative factorization (we show this rigorously in Section 3).
A great deal of work has gone into leveraging low-rank assumptions to improve matrix estimation, particularly in the field of compressed sensing [10, 24]. In compressed sensing one has access to a collection of random linear measurements of an unknown low-rank matrix to be estimated. Fitting a matrix to these measurements with a nuclear norm regularized optimization problem achieves estimation bounds better than those possible without the low-rank assumption. These techniques have been effectively applied to problems such as matrix completion, multivariate regression, and estimating autoregressive models [21, 22]. Unfortunately such techniques are not extensible to histogram estimation because, in the density estimation setting, data are not linearly sampled from the target model. Furthermore how to extend compressed sensing techniques to general tensors is not clear.
General matrix/tensor factorization, including nonnegative matrix/tensor factorizations, have been extensively studied despite their inherent difficulty due to non-convexity. The works [9, 3] present potential theoretical grounds for avoiding the computational difficulties of nonnegative matrix factorization. Some algorithms for finding nonnegative tensor factorizations are mentioned in Section 3. One notable approach to tensor factorization is to assume, in the tensor representation in (4), that and the collections of vectors are linearly independent for all . Under this assumption we are guaranteed that the factorization (4) is unique [1]. In [2] the authors present a method for recovering this factorization efficiently and demonstrate its utility for a variety of tasks. This work was extended in [31] to recover a multi-view KDE satisfying an analogous linear independence assumption. This is the only work of its type of which we are aware. In this work the authors investigate the sample complexity of their estimator but do not demonstrate that their technique has potential for improving rates for nonparametric density estimation in general. Finally we note that nonparametric applications of the Tucker decomposition have been utilized in Bayesian statistics [26]. We are unaware of any literature describing the model we introduce in (3).
2 Theoretical Results
In this section we mathematically demonstrate that histogram estimators can achieve greater estimation accuracy by restricting to NNTF models. To simplify analysis we will only consider densities on and analyze number of bins per dimension which is the inverse of the bin width, i.e. . We prove that there exists a trade-off between rates on and . Furthermore we show that the approximate fastest possible rate on while still uniformly controlling for estimator variance and remaining universally consistent is . Before proving these results we must introduce a fair amount of notation.
2.1 Notation
All norms in Section 2 are the or total variation norm for vectors/tensors, densities, and measures respectively. Note that these norms are analogous e.g. the norm of a probability density function is the same as the total variation norm on the probability measure associated with the density. Let be the set of all densities on . By density we mean probability measures that are absolutely continuous with respect to the -dimensional Lebesgue measure. We define a probability vector or probability tensor to simply mean a vector or tensor whose entires are nonnegative and sum to one. Let denote the set of probability vectors in and the set of probability tensors in . Let be the set of tensors which are a convex combination of separable probability tensors, i.e.
In this paper, the product symbol will always mean the standard outer product, e.g. set product or tensor product. For any natural number let . For a multi-index we define as the element of where the -th entry is one and is zero elsewhere.
The following is the set of probability tensors constructed via a nonnegative Tucker factorization
We will now construct the space of histograms on . We begin with one-dimensional histograms. We define with to be the one dimensional histogram where all weight is allocated to the th bin. Formally we define this as
Note that this is a valid density due to the leading coefficient. We use these to construct higher-dimensional histograms. For a multi-index , let
the histogram whose entire density is allocated to the bin indexed by . Finally we define to be the support of , i.e. the “bins” of a histogram estimator,
For a sequence of points in , , the standard histogram estimator is
Let , the set of all -dimensional histograms with bins per dimension. Let be the set of histograms with at most separable components, i.e.
(5) |
We will refer elements in this space multi-view histograms. Analogously we define the space of Tucker histograms to be
We emphasize that the collections of densities and the primary objects of interest in this paper as they represent NNTF histograms. The theoretical results we present are concerned with finding good density estimators restricted to these sets as and vary.
Note that there exists a linear isometry with defined as
The inverse function, , simply transforms a histogram to the tensor representing its bin weights and performs the reverse transformation. Note that is also a bijection between and . Much of our analysis on histograms will be performed on the space of probability tensors with the analysis being translated to histograms via this operator.
For a set of vectors we define , i.e. the set of convex combinations of collections of vectors from . We define to be the minimum cardinality for a subset of of which -covers (with closed balls) with respect to the metric. It will be clear from context whether represents the , or total variation norm.
2.2 Preliminary Results
For brevity the main text only contains a full proof of Lemma 2.7. The remaining full proofs can be found in the appendix. We include general descriptions of the proof techniques we use for multi-view histogram results. These are similar to the techniques we use for Tucker histograms. Our general proof technique is to find good covers of spaces of densities, i.e. and , and then apply an existing algorithm for selecting a good estimator from finite collections of densities given data. We begin by establishing a covering number bound on the space of probability vectors via an adaptation of a standard result presented in [8].
Lemma 2.1.
For all we have that .
We can extend this to a covering number for the space of separable probability tensors.
Lemma 2.2.
For all we have that .
Proof sketch.
Now we establish the following lemma for covering numbers of mixtures of densities.
Lemma 2.3.
Let be a set of probability measures, then
Proof sketch.
Use Lemma 2.1 to construct different weightings of elements from an -cover of . ∎
By combining Lemma 2.3 with Lemma 2.2 we arrive at covering numbers for the space of multi-view probability tensors.
Lemma 2.4.
For all the following holds .
Through application of the operator we now have a characterization of the complexity of the space .
Corollary 2.1.
For all following holds
The following are analogous results for Tucker histograms.
Lemma 2.5.
For all the following holds .
Corollary 2.2.
For all following holds
The following lemma from [4] provides us with a way to choose good estimators from finite collections of densities. It can be proven by applying a Chernoff bound to [8], Theorem 6.3.
Lemma 2.6 ([4]).
There exists a deterministic algorithm that, given a collection of distributions , a parameter and at least iid samples from an unknown distribution , outputs an index such that
with probability at least .
We present the following asymptotic version of the previous lemma and include its full proof. We highlight the use of finding sufficiently slow rates on parameters in order to establish asymptotic results, a technique which we will use in later proofs.
Lemma 2.7.
Let be a sequence of finite collections of densities in where with . Then there exists a sequence of estimators such that, for all ,
where is a function of .
Proof.
Let . Since we have that for all there exists a such that, for all we have or equivalently . Because of this there exists sequence of positive values such that and .
We will be making use of the algorithm in Lemma 2.6 as well as its notation. If we can show that there exist sequences of positive values such that, for sufficiently large , the following holds
then can simply set equal to be the estimator from Lemma 2.6 for sufficiently large and, because the lemma holds independent of choice of , the theorem statement follows.
Let and . Note that these are both positive sequences which converge to zero. Now we have
(6) |
For sufficiently large and we have that the RHS of (6) is less than or equal to
which completes our proof. ∎
2.3 Main Theoretical Results
We can now state the central results of this paper. The following theorem states that one can control the estimation error of multi-view histograms with components and bins per dimension so long as (omitting logarithmic factors). Recall that the standard histogram estimator requires , so we have removed the exponential dependence of bin rate on dimensionality. Here and elsewhere the symbol is not a precise mathematical statement but rather describes that the two values should be of the same order. In the following and are functions of ; the space of histograms which we are fitting changes as we acquire more data.
Theorem 2.1.
For any pairs of sequences and satisfying , there exists an estimator such that, for all
where is a function of .
Proof sketch.
The sample complexity for the multi-view histogram is perhaps more accurately approximated as being on the order of however the disappears in the asymptotic analysis.
The following theorem states that we can control the error of Tucker histogram estimates so long as (omitting logarithmic factors).
Theorem 2.2.
For any pairs of sequences and satisfying , there exists an estimator such that, for all
where is a function of .
Allowing to grow as aggressively as possible we achieve consistent estimation, using either the multi-view or Tucker histograms, so long as regardless of dimensionality.
Corollary 2.3.
For all fix to be either or . For any sequence with , there exists a sequence and estimator such that, for all
where is a function of .
Replacing allows us to arrive at the rates mentioned in Section 1. The following result shows that the bias of the estimators in Theorems 2.1 and 2.2 go to zero for all densities. Thus these estimators are universally consistent even when the NNTF model assumption is not satisfied.
Lemma 2.8.
Let . If and then .
A straightforward consequence of this is that the Tucker histogram bias also goes to zero.
Lemma 2.9.
Let . If and then .
Finally we have that rate on in Theorem 2.1 cannot be made significantly faster.
Theorem 2.3.
Let , , and with and . There exists no estimator such that, for all , the following limit holds
where is a function of .
Proof sketch.
We use to construct an estimator over and show that such an estimator is impossible using a result in [13]. ∎
Likewise the rate on can also not be significantly improved in Theorem 2.2.
Theorem 2.4.
Let , , and with and . There exists no estimator such that, for all , the following limit holds
where is a function of .
2.4 Discussion
Naturally real world data likely never exactly satisfies the NNTF model assumption. Our results are meant to highlight a trade-off between model assumptions of smoothness (low ) and simple dependence between features (low ). Here we will explore this trade-off for multi-view histogram. Letting gives since we can allocate one component to each bin. Using the estimator from Theorem 2.1 gives a sample complexity of approximately . Thus setting in the multi-view histogram gives us something which behaves similarly to the standard histogram estimator with a similar sample complexity. On the other hand setting gives a naive Bayes model with a sample complexity of approximately . The Tucker histogram can be similarly analyzed with corresponding to the standard histogram. Thus we have a span of yielding different estimators with maximal corresponding to the standard histogram and minimal corresponding to a naive Bayes assumption. We observe in Section 3 that this trade-off is useful in practice: we virtually never want to be maximized.
3 Experiments
While Theorems 2.1 and 2.2 guarantee the existence of estimators which can effectively estimate NNTF models, these estimators are unfortunately computationally intractable. Fortunately there exist estimators which can be adapted to our problem setting, though they lack the theoretical guarantees of the algorithm described in Theorems 2.1 and 2.2. Specifically we will utilize an existing algorithm for nonnegative tensor decompositions. Due to the difficulties of estimating distances between densities we will instead focus on estimates minimizing the norm. In this section inner products and norms will be for functions and for tensors i.e. standard euclidean norm or inner product applied to flattened tensors. We will again restrict our analysis to densities supported on .
Consider the problem of finding some density estimator with minimal distance to an unknown density . This is equivalent to minimizing the squared loss:
(7) | |||
(8) |
Because the right term in (8) does not depend on it can be ignored when finding optimal . The left term in (8) is known. The middle term in can be estimated with the following approximation
where . We can use this to find a good estimate for in which represents or :
(9) | |||
(10) |
Recall that the standard histogram estimator is and let . We have the following
As a consequence (10) is equal to
Using the operator we can reformulate this into a tensor factorization problem
Where could be either or . Because of this equivalence, to find estimates in or we can simply use nonnegative tensor decomposition algorithms, which minimize loss, to find NNTF tensors which approximate .
Dataset | Red. | Dim. | Hist. Perf. | Tucker Perf. | Hist. Bins | Tucker Bins | Tucker | p-val. |
---|---|---|---|---|---|---|---|---|
MNIST | PCA | 2 | -1.4550.089 | -1.5020.102 | 6.5311.499 | 8.3751.780 | 4.9681.976 | 5e-4 |
3 | -2.0400.196 | -2.2680.195 | 4.7810.738 | 6.7181.565 | 5.7811.340 | 2e-4 | ||
4 | -3.5320.996 | -4.0140.655 | 4.0310.585 | 5.3431.018 | 4.3750.695 | 2e-3 | ||
5 | -4.6731.026 | -6.1572.924 | 3.4680.499 | 4.3430.592 | 3.2810.514 | 4e-5 | ||
Rand. | 2 | -2.0340.100 | -2.0990.102 | 6.0621.197 | 7.5621.657 | 2.0621.784 | 3e-5 | |
3 | -3.0860.207 | -3.3310.387 | 4.8120.526 | 6.8431.227 | 2.6871.959 | 1e-4 | ||
4 | -4.3070.290 | -5.7310.435 | 3.5000.559 | 5.6560.642 | 2.5931.497 | 8e-7 | ||
5 | -6.3270.522 | -9.5391.053 | 3.2500.433 | 4.7180.571 | 2.5621.087 | 8e-7 | ||
Diabetes | PCA | 2 | -2.0790.122 | -2.2120.132 | 5.7181.304 | 7.4681.478 | 1.0620.242 | 8e-6 |
3 | -3.0100.364 | -3.6060.420 | 3.5930.860 | 7.0621.058 | 1.8431.543 | 2e-6 | ||
4 | -4.0020.415 | -4.4230.701 | 3.0000.000 | 5.9060.804 | 2.3431.107 | 2e-3 | ||
5 | -6.1390.661 | -6.0431.192 | 3.0000.000 | 3.7500.968 | 1.8430.617 | 0.91 | ||
Rand. | 2 | -3.0740.224 | -3.2770.287 | 6.8431.227 | 9.2501.936 | 1.0930.384 | 7e-5 | |
3 | -4.7260.483 | -5.3530.751 | 4.9680.769 | 8.4061.343 | 1.6251.672 | 2e-5 | ||
4 | -6.0170.873 | -7.7321.497 | 4.0620.704 | 6.7181.328 | 2.0931.155 | 1e-5 | ||
5 | -8.9861.292 | -12.612.477 | 3.0620.242 | 5.0930.521 | 2.5310.865 | 2e-6 |
For our experiments we used the Tensorly library [18] to perform the nonnegative Tucker decomposition [17] with Tucker rank which was then projected to the simplex of probability tensors using [11]. We also performed experiments with nonnegative PARAFAC decompositions using [28, 18]. These decompositions performed poorly. This is potentially because the PARAFAC optimization is more difficult or the additional flexibility of the Tucker decomposition was more appropriate for the experimental datasets.
3.1 Experimental Setup and Results
Our experiments were performed on the Scikit-learn “toy” datasets MNIST and Diabetes [23], with labels removed. We used the expression inside the minimization in (10) to evaluate performance. Our experiments considered estimating histograms in dimensional space. We consider two forms of dimensionality reduction. First we consider projecting the dataset onto its top principle components. As an alternative we consider projecting our dataset onto a random subspace of dimension . We have constructed our random subspace dimensionality reduction so that each additional dimension adds a new index without affecting the others. For each dataset we randomly select an orthonormal basis that remains unchanged for all experiments . To transform a point to dimension we perform the following transform
We consider both transforms since PCA may select dimensions where the features tend to be independent, as is the case when the distribution is a multivariate Gaussian. After dimensionality reduction we scale and translate the data to fit in the unit cube.
With our preprocessed dataset, each experiment consisted of randomly selecting 200 samples for training and using the rest to evaluate performance. For the estimators we tested all combinations using 1 to bins per dimension and from 1 to . As increased the best cross validated and value decreased, so we reduced and for larger to reduce computational time, while still leaving a sizable gap between the best cross validated and across all runs of all experiment. For we have and ; for we have and ; for we have and . For parameter fitting we used random subset cross validation repeated 80 times using 40 of the 200 samples to cross validate. Performing 80 folds of cross validation was necessary because of the variance of the histogram estimated loss. This high variance is likely due to the noncontinuous nature of the histogram estimator itself and the noncontinuity of the histogram as a function of the data, i.e. slightly moving one training sample can potentially the change histogram bin in which it lies. Each experiment was run 32 times and we report the mean and standard deviations of estimator performance as well as the best parameters found from cross validation. We additionally apply the Wilcoxon signed rank test to the 32 pairs of performance results to statistically determine if the mean performance between the standard histogram and our algorithm are different and report the corresponding -value. Our results are in Table 1 where the Tucker histogram dominates. We also observe that the Tucker histogram can estimate more bins per dimension than the standard histogram and is able to estimate more bins per dimension when the number of components of components is lower. This corroborates the number of components versus the bins per dimension trade-off from Section 2.4.
4 Conclusion
Through analysis of the histogram estimator, we have theoretically and empirically demonstrated that NNTF models can also be used to improve nonparametric density estimation. Though the histogram estimator is not a particularly popular estimator,we hope that the ideas presented here can be adapted to improve other techniques in nonparametric statistics.
Acknowledgements
This work was supported by the Berlin Institute for the Foundations of Learning and Data (BIFOLD) sponsored by the German Federal Ministry of Education and Research (BMBF).
References
- Allman et al. [2009] E. S. Allman, C. Matias, and J. A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Ann. Statist., 37(6A):3099–3132, 12 2009. doi: 10.1214/09-AOS689. URL http://dx.doi.org/10.1214/09-AOS689.
- Anandkumar et al. [2014] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014. URL http://jmlr.org/papers/v15/anandkumar14b.html.
- Arora et al. [2012] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization – provably. In Proceedings of the Forty-fourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 145–162, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1245-5. doi: 10.1145/2213977.2213994. URL http://doi.acm.org/10.1145/2213977.2213994.
- Ashtiani et al. [2018] H. Ashtiani, S. Ben-David, N. Harvey, C. Liaw, A. Mehrabian, and Y. Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3412–3421. Curran Associates, Inc., 2018.
- Bengio [2013] Y. Bengio. Deep learning of representations: Looking forward. In A.-H. Dediu, C. Martín-Vide, R. Mitkov, and B. Truthe, editors, Statistical Language and Speech Processing, pages 1–37, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-39593-2.
- Cayton [2005] L. Cayton. Algorithms for manifold learning. Technical report, 2005.
- Chen et al. [2017] J. Chen, M. Stern, M. J. Wainwright, and M. I. Jordan. Kernel feature selection via conditional covariance minimization. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6946–6955. Curran Associates, Inc., 2017.
- Devroye and Lugosi [2001] L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer, New York, 2001.
- Donoho and Stodden [2004] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In S. Thrun, L. K. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 1141–1148. MIT Press, 2004.
- Donoho [2006] D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, April 2006. ISSN 0018-9448. doi: 10.1109/TIT.2006.871582.
- Duchi et al. [2008] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l-ball for learning in high dimensions. In ICML, pages 272–279, 2008.
- Györfi et al. [1985] L. Györfi, L. Devroye, and L. Gyorfi. Nonparametric density estimation: the L1 view. John Wiley & Sons, New York; Chichester, 1985.
- Han et al. [2014] Y. Han, J. Jiao, and T. Weissman. Minimax estimation of discrete distributions under loss. CoRR, abs/1411.1467, 2014. URL http://arxiv.org/abs/1411.1467.
- Jiang [2017] H. Jiang. Uniform convergence rates for kernel density estimation. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1694–1703, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/jiang17b.html.
- Kim and Scott [2012] J. Kim and C. Scott. Robust kernel density estimation. J. Machine Learning Res., 13:2529–2565, 2012.
- Kim et al. [2018] J. Kim, J. Shin, A. Rinaldo, and L. Wasserman. Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension. arXiv e-prints, art. arXiv:1810.05935, Oct 2018.
- Kim and Choi [2007] Y.-D. Kim and S. Choi. Nonnegative tucker decomposition. 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
- Kossaifi et al. [2016] J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic. TensorLy: Tensor Learning in Python. arXiv e-prints, art. arXiv:1610.09555, Oct 2016.
- Liu et al. [2011] H. Liu, M. Xu, H. Gu, A. Gupta, J. Lafferty, and L. Wasserman. Forest density estimation. J. Mach. Learn. Res., 12:907–951, July 2011. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1953048.2021032.
- Mack and Rosenblatt [1979] Y. P. Mack and M. Rosenblatt. Multivariate k-nearest neighbor density estimates. Journal of Multivariate Analysis, 9(1):1–15, March 1979.
- Negahban and Wainwright [2011] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist., 39(2):1069–1097, 04 2011. doi: 10.1214/10-AOS850. URL https://doi.org/10.1214/10-AOS850.
- Negahban and Wainwright [2012] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res., 13:1665–1697, May 2012. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2188385.2343697.
- Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Recht et al. [2010] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3):471–501, Aug. 2010. ISSN 0036-1445. doi: 10.1137/070697835. URL http://dx.doi.org/10.1137/070697835.
- Reiss [1989] R. Reiss. Approximate distributions of order statistics: with applications to nonparametric statistics. Springer series in statistics. Springer, 1989. ISBN 9783540968511. URL https://books.google.de/books?id=DxzvAAAAMAAJ.
- Schein et al. [2016] A. Schein, M. Zhou, D. Blei, and H. Wallach. Bayesian poisson tucker decomposition for learning the structure of international relations. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2810–2819, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/schein16.html.
- Schuster [1985] E. F. Schuster. Incorporating support constraints into nonparametric estimators of densities. Communications in Statistics - Theory and Methods, 14(5):1123–1136, 1985. doi: 10.1080/03610928508828965. URL https://doi.org/10.1080/03610928508828965.
- Shashua and Hazan [2005] A. Shashua and T. Hazan. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 792–799, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi: 10.1145/1102351.1102451. URL http://doi.acm.org/10.1145/1102351.1102451.
- Silverman [1986] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986.
- Song et al. [2012] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximization. J. Mach. Learn. Res., 13(1):1393–1434, May 2012. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2503308.2343691.
- Song et al. [2014] L. Song, A. Anandkumar, B. Dai, and B. Xie. Nonparametric estimation of multi-view latent variable models. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 640–648, 2014. URL http://jmlr.org/proceedings/papers/v32/songa14.html.
- Terrell and Scott [1992] G. R. Terrell and D. W. Scott. Variable kernel density estimation. Ann. Statist., 20(3):1236–1265, 09 1992. doi: 10.1214/aos/1176348768. URL https://doi.org/10.1214/aos/1176348768.
- Tsybakov [2008] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008.
- Vandermeulen and Scott [2013] R. Vandermeulen and C. Scott. Consistency of robust kernel density estimators. COLT, 30:568–591, 2013.
- Vandermeulen and Scott [2014] R. A. Vandermeulen and C. Scott. Robust kernel density estimation by scaling and projection in hilbert space. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 433–441. Curran Associates, Inc., 2014.
Appendix A Proofs Omitted from Main Text
All norms are either the , , or total variation norm, which are equivalent with respect to our analysis and the proper norm will be clear from context.
Proof of Lemma 2.1.
In Section 7.4 from [8], the authors show that for any collection of measures , for all , that
With the additional assumption that we have that and thus
If we let , the indicator vector at index , then the lemma follows. ∎
Proof of Lemma 2.2.
From Lemma 2.1 we know there exists a finite collection of probability vectors such that is an -covering of and . Note that the set contains at most elements. We will now show that this set is an -cover of . Let be arbitrary. From our construction of there exist elements such that .
We will now make use of Lemma 3.3.7 in [25], which states that, for any collection of probability vectors and , the following holds
From this it follows that
thus completing our proof. ∎
Proof of Lemma 2.3.
Let be the finite collection of probability measures with which -covers . Similarly let with such that is a -cover of . Consider the set
Note that this set contains at most elements. We will now show that it -covers , which completes the proof. Let . We know there exists elements such that and such that and thus . Now observe that
∎
Proof of Lemma 2.5.
Fix and . We are going to construct an -cover of . From Lemma 2.1 we know that there exists a set which -covers of and contains no more than elements. Let be the collection of all arrays whose entries are elements from . So we have that
From Lemma 2.1 there exists which is an -cover of and contains no more than elements. Now let
Note that
We will now show that is an -cover of . To this end let be arbitrary, where and . From our construction of , there exists such that . There also exists such that for all . Therefore we have that
So finally
∎
Proof of Theorem 2.1.
We will be applying the estimator from Lemma 2.7 to a series of -covers of . We begin by constructing a series of -covers whose cardinality doesn’t grow too quickly. Corollary 2.1 states that, for all , that . For sufficiently large and and sufficiently small , the following holds
(11) |
Using the argument from the proof of Lemma 2.7 we have that, because there exists a sequence of positive values such that and . If we let we have that and
Because of this we can construct collections of densities such that is a -covering of with , and . Let be the estimator from Lemma 2.7 applied to the sequence .
Let be arbitrary. Due to the way that we have constructed the sequence , for sufficiently large , we have that . It therefore follows that, for sufficiently large , the following holds for all
From this we have that, for sufficiently large
and the right side goes to zero due to Lemma 2.7, thus completing the proof. ∎
Proof of Theorem 2.2.
This proof is very similar to the proof of Theorem 2.1. We will be applying the estimator from Lemma 2.7 to a series of -covers of . We begin by constructing a series of -covers whose cardinality doesn’t grow too quickly. Corollary 2.2 states that, for all , that . For sufficiently large and and sufficiently small , the following holds
Note that replacing with in the last line is exactly (11) in our proof of Theorem 2.1 . From here we can proceed exactly as in the proof of Theorem 2.1 by replacing with and with . ∎
Proof of Lemma 2.8.
Let . Theorem 5 in Chapter 2 of [12] states that, for any , that as , i.e. the bias of a histogram estimator goes to zero as the number of bins per dimension goes to infinity. Thus there exists a sufficiently large such that there exists a histogram which is a good approximation of , . In this proof we we will argue that once and is sufficiently large, we can find an element of where the multi-view components can approximate the bins of .
We have that
From the same theorem there exists such that, for all , for all , there exists such that for all . For any multi-index , we define
Now we have that, for all and ,
(12) | ||||
where we use the previously mentioned product measure inequality for (12). As soon as and the set contains the element,
Now we have that, for all .
From the triangle inequality we have that
So we have that, for sufficiently large and
which completes our proof. ∎
Proof of Lemma 2.9.
Proof of Theorem 2.3.
We will proceed by contradiction. Suppose is an estimator violating the theorem statement, i.e. there exist sequences and with and such that, for all ,
Let be a sequence of probability vectors which represent distributions over . Let with .
We will now construct a series of estimators for using . Let which are independent random variables with . For this proof we will assume but the proof can be simplified in a straightforward manner to the case by ignoring the indices and modes beyond the second. Note that that contains two indices. Now we have the following for the densities of
(14) | ||||
(15) |
This last line is in the form of (5) in the main text and is thus an element of . To see this we will show the correspondence between the terms in (15) from here and the terms in (5) in the main text:
Let estimate so . We will use to construct an estimator for .
Because 222We will use this portion of the proof again for our proof of Theorem 2.4 for all we have that and thus . Note that when and zero otherwise (see (14)). We define the linear operator as
i.e. the linear operator which sums out all modes except for the first two. We have that . Now let be the estimator for . Now we have that
We have that is a nonexpansive operator due to the triangle inequality,
so the operator norm of is less than or equal to one. We also know that an isometry and , so it follows that for any sequence of . We will now use following theorem from [13] to show that no such estimator can exist.
Theorem A.1 ([13] Theorem 2.).
For any , we have
where the infimum is over all estimators.
Our estimator is equivalent to estimating a categorical distribution with categories. Letting , , and , with , we get that for sufficiently large
whose right hand side converges to 1. From this we get that
which contradicts for arbitrary sequences . ∎
Proof of Thoerem 2.4.
We will proceed by contradiction. Suppose is an estimator violating the theorem statement, i.e. there exist sequences and with and such that, for all ,
Since we have that so there is a subsequence such that or , or equivalently or . We will show that both cases lead to a contradiction. We will let and be functions of now implictly when defining limits.
Case :
We proceed similarly to the proof of Theorem 2.3. Let , , and be defined as in the proof of Theorem 2.3. Note that (see proof of Lemma 2.9) and thus . We can proceed exactly as in our proof of Theorem 2.3 at footnote 2, by simply replacing with and with which finishes this case.
Case : Let be a sequence of elements in which represents distributions over . Let with . Let which are independent random variables with . Let be the density for . Note that . So we have that
and thus . We proceed as in Theorem 2.3 to find an estimator for elements of which is equivalent to estimating elements of which is impossible since . ∎