Nonparametric Density Estimation via Variance-Reduced Sketching
Abstract
Nonparametric density models are of great interest in various scientific and engineering disciplines. Classical density kernel methods, while numerically robust and statistically sound in low-dimensional settings, become inadequate even in moderate higher-dimensional settings due to the curse of dimensionality. In this paper, we introduce a new framework called Variance-Reduced Sketching (VRS), specifically designed to estimate multivariable density functions with a reduced curse of dimensionality. Our framework conceptualizes multivariable functions as infinite-size matrices, and facilitates a new sketching technique motivated by numerical linear algebra literature to reduce the variance in density estimation problems. We demonstrate the robust numerical performance of VRS through a series of simulated experiments and real-world data applications. Notably, VRS shows remarkable improvement over existing neural network estimators and classical kernel methods in numerous density models. Additionally, we offer theoretical justifications for VRS to support its ability to deliver nonparametric density estimation with a reduced curse of dimensionality.
Keywords— Density estimation, Matrix sketching, Tensor sketching, Range estimation, High-dimensional function approximation
1 Introduction
Nonparametric density estimation has extensive applications across diverse fields, including biostatistics (Chen (2017)), machine learning (Sriperumbudur and Steinwart (2012) and Liu et al. (2023)), engineering (Chaudhuri et al. (2014)), and economics (Zambom and Dias (2013)). In this context, the goal is to estimate the underlying density function based on a collection of independently and identically distributed data. Classical density estimation methods such as histograms (Scott (1979) and Scott (1985)) and kernel density estimators (Parzen (1962) and Davis et al. (2011)), known for their numerical robustness and statistical stability in lower-dimensional settings, often suffer from the curse of dimensionality even in moderate higher-dimensional spaces. Mixture models (Dempster (1977) and Escobar and West (1995)) offer a potential solution for higher-dimensional problems, but these methods are lacking the flexibility to extend in the nonparametric settings. Additionally, adaptive methods (Liu et al. (2007) and Liu et al. (2023)) have been developed to address higher-dimensional challenges. Recently, deep generative modeling has emerged as a popular technique to approximate high-dimensional densities from given samples, especially in the context of images. This category includes generative adversarial networks (GAN) (Goodfellow et al. (2020); Liu et al. (2021)), normalizing flows (Dinh et al. (2014); Rezende and Mohamed (2015); Dinh et al. (2016)), and autoregressive models (Germain et al. (2015); Uria et al. (2016); Papamakarios et al. (2017); Huang et al. (2018)). Despite their remarkable performance, particularly with high-resolution images, statistical guarantees for these neural network methods continue to be a challenge. In this paper, we aim to develop a new framework specifically designed to estimate multivariate nonparametric density functions. Within this framework, we conceptualize multivariate density functions as matrices or tensors and extend matrix/tensor approximation algorithm to this nonparametric setting, in order to reduce the curse of dimensionality in higher dimensions.
1.1 Contributions
Motivated by Hur et al. (2023), we propose a new matrix/tensor-based sketching framework for estimating multivariate density functions in nonparametric models, referred to as Variance-Reduced Sketching (VRS). To illustrate the advantages of VRS, consider the setting of estimating an -times differentiable density function in dimensions. In terms of squared norm, the error rate of the classical kernel density estimator (KDE) with sample size is of order
(1) |
while the VRS estimator can achieve error rates of order222Assuming the minimal singular values are all constants, the error bound in (2) matches Theorem 2.
(2) |
Here, are the ranks of the true density function, and the precise definition can be found in 3. Note that when are bounded constants, the error rate in (2) reduces to , which is a nonparametric error rate in one dimension, significantly better than (1), especially when is large. In Figure 1, we conduct a simulation study to numerically compare deep neural network estimators, classical kernel density estimators, and VRS. Additional empirical studies are provided in Section 5. Extensive numerical evidence suggests that VRS significantly outperforms various deep neural network estimators and KDE by a considerable margin.

The promising performance of VRS comes from its ability to reduce the problem of multivariate density estimation to the problem of estimating low-rank matrices/tensors in space of functions. For example, we can think of a two-variable density function as an infinite-size matrix. When estimating this two-variable density using classical kernel methods, it typically exhibits a two-dimensional error rate. In contrast, VRS reduces the matrix estimation problem to an estimation problem of its range. Since the range of the density function is a single-variable function when the rank is low, we achieve a single-variable estimation error rate. Such philosophy can be generalized to density estimation in arbitrary dimensions. To estimate the range, VRS employs a new sketching strategy, wherein the empirical density function is sketched with a set of carefully chosen sketch functions, as illustrated in Figure 2. These sketch functions are selected to retain the range of the density while reducing the variance of the range estimation from any multidimensional scaling to -dimensional scaling, hence the name Variance-Reduced Sketching.
Our VRS sketching approach allows us to achieve the error bound demonstrated in (2), contrasting with the error bound of KDE in (1), which suffers from the curse of dimensionality. Our VRS framework is fundamentally different from the randomized sketching algorithm for finite-dimensional matrices previously studied in Halko et al. (2011), as randomly chosen sketching does not address the curse of dimensionality in multivariable density estimation models. Additionally, the statistical guarantees for VRS are derived from a computationally tractable algorithm. This contrasts with deep learning methods, where generalization errors highly depend on the architecture of the neural network estimators, and the statistical analysis of generalization errors of neural network estimators is not necessarily related to optimization errors achieved by computationally tractable optimizers.

1.2 Related literature
Matrix approximation algorithms, such as singular value decomposition and QR decomposition, play a crucial role in computational mathematics and statistics. A notable advancement in this area is the emergence of randomized low-rank approximation algorithms. These algorithms excel in reducing time and space complexity substantially without sacrificing too much numerical accuracy. Seminal contributions to this area are outlined in works such as Liberty et al. (2007) and Halko et al. (2011). Additionally, review papers like Woodruff et al. (2014), Drineas and Mahoney (2016), Martinsson (2019), and Martinsson and Tropp (2020) have provided comprehensive summaries of these randomized approaches, along with their theoretical stability guarantees. Randomized low-rank approximation algorithms typically start by estimating the range of a large low-rank matrix by forming a reduced-size sketch. This is achieved by right multiplying with a random matrix , where . The random matrix is selected to ensure that the range of remains a close approximation of the range of , even when the column size of is significantly reduced from . As such, the random matrix is referred to as randomized linear embedding or sketching matrix by Tropp et al. (2017b) and Nakatsukasa and Tropp (2021). The sketching approach reduces the cost in singular value decomposition from to , where represents the complexity of matrix multiplication.
Recently, a series of studies have extended the matrix sketching technique to range estimation for high-order tensor structures, such as the Tucker structure (Che and Wei (2019); Sun et al. (2020); Minster et al. (2020)) and the tensor train structure (Al Daas et al. (2023); Kressner et al. (2023); Shi et al. (2023)). These studies developed specialized structures for sketching to reduce computational complexity while maintaining high levels of numerical accuracy in handling high-order tensors.
Previous work has also explored randomized sketching techniques in specific estimation problems. For instance, Mahoney et al. (2011) and Raskutti and Mahoney (2016) utilized randomized sketching to solve unconstrained least squares problems. Williams and Seeger (2000) , Rahimi and Recht (2007), Kumar et al. (2012) and Tropp et al. (2017a) improved Nyström method with randomized sketching techniques. Similarly, Alaoui and Mahoney (2015), Wang et al. (2017), and Yang et al. (2017) applied randomized sketching to kernel matrices in kernel ridge regression to reduce computational complexity. While these studies mark significant progress in the literature, they usually require extensive observation of the estimated function prior to employing the randomized sketching technique, in order to maintain acceptable accuracy. This step would be significantly expensive for the higher-dimensional setting. Notably, Hur et al. (2023) and subsequent studies Tang et al. (2022); Ren et al. (2023); Peng et al. (2023); Chen and Khoo (2023); Tang and Ying (2023) addressed the issues by taking the variance of data generation process into the creation of sketching for high-dimensional tensor estimation. This sketching technique allows for the direct estimation of the range of a tensor with reduced sample complexity, rather than directly estimating the full tensor.
1.3 Organization
Our manuscript is organized as follows. In Section 2, we detail the procedure for implementing the range estimation through sketching and study the corresponding error analysis. In Section 3, we extend our study to density function estimation based the range estimators developed in Section 2, and provide the corresponding statistical error analysis with a reduced curse of dimensionality. Additionally, we extend our method to image PCA denoising in Section 4. In Section 5, we present comprehensive numerical results to demonstrate the superior numerical performance of our method. Finally, Section 6 summarizes our conclusions.
1.4 Notations
We use to denote the natural numbers and to denote all the real numbers. We say that if for any given , there exists a such that For real numbers , and , we denote if and if . Let denote the unit interval in . For positive integer , denote the -dimensional unit cube as .
Let be a collection of elements in the Hilbert space . Then
(3) |
Note that is a linear subspace of . For a generic measurable set , denote For any , let the inner product between and be
We say that is a collection of orthonormal functions in if if , and if . Let and be the indicator function at the point . We define
If spans , then is a collection of orthonormal basis functions in . In what follows, we briefly introduce the notations for Sobolev spaces. Let be any measurable set. For multi-index , and , define the -derivative of as Then
where and represents the total order of derivatives. The Sobolev norm of is
1.5 Background: linear algebra in tensorized function spaces
We briefly introduce linear algebra in tensorized function spaces, which is the necessary to develop our Variance-Reduced Sketching (VRS) framework for nonparametric density estimation.
Multivariable functions as tensors
Given positive integers , let . Let be a generic multivariable function and for . Denote
(4) |
Define the Frobenius norm of as
(5) |
where are any orthonormal basis functions of . Note that in (5), is independent of the choices of basis functions in for . The operator norm of is defined as
(6) |
We say that is a tensor in tensor product space if there exist scalars and functions for each such that
(7) |
and that . From classical functional analysis, we have that
(8) |
Projection operators in function spaces
Let , where and are any orthonormal functions. Then is an -dimensional linear subspace of and we denote . Let be the projection operator onto the subspace . Therefore for any , the projection of on is
(9) |
Note that we always have for any projection operator .
For , let be any orthonormal functions of and where . With the expansion in (7), define the tensor as
(10) |
We have by Lemma 16 and the fact that . It follows that is a tensor in and therefore can be viewed as a function in .
More generally given , let and . We can view as a function in . For , let be any subspace. Suppose is such that
Then we have as shown in Lemma 18.
2 Density range estimation by sketching
Let and be arbitrary positive integers, and let and be two measurable sets. Let be the unknown density function, and be the observed data sampled from . Denote the empirical measure formed by . More precisely,
(11) |
In this section, we introduce a sketching algorithm to estimate based on with a reduced curse of dimensionality, where
(12) |
To this end, let be a linear subspace of that acts as an estimation subspace and be a linear subspace of that acts as a sketching subspace. More details on how to choose and are deferred to Remark 1. Our procedure is composed of the following three stages.
-
Sketching stage. Let be the orthonormal basis functions of . We apply the projection operator to by computing
(13) Note that for each , is a function solely depending on . This stage is aiming at reducing the curse of dimensionality associated to variable .
-
Estimation stage. We estimate the functions by utilizing the estimation space . Specifically, for each , we approximate by
(14) -
Orthogonalization stage. Let
(15) Compute the leading singular functions in the variable of to estimate the .
We formally summarize our procedure in Algorithm 1.
Suppose the estimation space is spanned by the orthonormal basis functions . In what follows, we provide an explicit expression of in (15) based on the .
In the sketching stage, computing (13) is equivalent to compute
, as
In the estimation stage, we have the following explicit expression for (14) by Lemma 20:
where . Therefore, in (15) can be rewritten as
(16) |
By Lemma 19, has the exact same expression as . Therefore, we establish the identification
(17) |
In 5 of the appendix, we provide further implementation details on how to compute the leading singular functions in the variable of using singular value decomposition. Let be the projection operator onto the , and let be the projection operator onto the space spanned by , the output of Algorithm 1. In Section 2.1, we show that the range estimator in Algorithm 1 is consistent by providing theoretical quantification on the difference between and .
2.1 Error analysis of Algorithm 1
We start with introducing necessary conditions to establish the consistency of our range estimators.
Assumption 1.
Suppose and be two measurable sets. Let be a generic population function with .
where , , and and are orthonormal basis functions in and respectively.
1 postulates that the density is a finite-rank function. Finite-rank conditions are commonly observed in the literature. In Example 1 and Example 2 of Appendix A, we illustrate that both additive models and mean-field models satisfy 1. Additionally in Example 3, we demonstrate that all multivariable differentiable functions can be effectively approximated by finite-rank functions.
The following assumption quantifies the bias between and its projection.
Assumption 2.
Let and be two linear subspaces such that and , where . For , suppose that
(18) | ||||
(19) | ||||
(20) |
Remark 1.
When and , 2 directly follows from approximation theories in Sobolev spaces. Indeed in the Appendix B, under the assumption that we justify 2 when and are derived from three different popular nonparametric approaches: the reproducing kernel Hilbert spaces in Lemma 2, the Legendre polynomial system in Lemma 5, and the spline basis in Lemma 7.
Note that by the identification in (17), . The following theorem shows that can be well-approximated by the leading singular functions in of .
Theorem 1.
The proof of Theorem 1 can be found in Section C.1. It follows that if , for a sufficiently large constant , and sample size for a sufficiently large constant , then Theorem 1 implies that
(23) |
To interpret our result in Theorem 1, consider a simplified scenario where the minimal spectral value is a positive constant. Then (23) further reduces to
which matches the optimal nonparametric density estimation rate in . This indicates that our method is able to estimate without the curse of dimensionality introduced by the variable . Utilizing the estimator of , we can further estimate the population function with a reduced curse of dimensionality as detailed in Section 3.
3 Density estimation by sketching
In this section, we study multivariable density estimation by utilizing the range estimator outlined in Algorithm 1. Let be a measurable subset of and be the unknown population function. In this section, we propose a tensor-based algorithm to estimate function with a reduced curse of dimensionality.
Remark 2.
In density estimation, it is sufficient to assume . This is a common assumption widely used in the nonparametric statistics literature. Indeed, if the density function has compact support, through necessary scaling, we can assume the support is a subset of .
We begin by stating the necessary assumptions for our tensor-based estimator of .
Assumption 3.
For , it holds that
where , , and and are orthonormal functions in and respectively. Furthermore, .
3 is a direct extension of 1, and all of Example 1, 2, and 3 in the appendix continue to hold for 3. Throughout this section, for , denote the operator as the projection operator onto More precisely,
In what follows, we formally introduce our algorithm to estimate the density function . Let be a collection of orthonormal basis functions. For , let and denote
(24) | ||||
(25) |
Remark 3.
The collection of orthonormal basis functions can be derived through various nonparametric estimation methods. In the appendix, we present three examples of , including reproducing kernel Hilbert space basis functions (Section B.1), Legendre polynomial basis functions (Section B.2), and spline basis functions (Section B.3) to illustrate the potential choices.
In Algorithm 2, we formally summarize our tensor-based estimator of , which utilizes the range estimator developed in Section 2.1.
In Section 5.1, we provide an in-depth discussion on how to choose the tuning parameters , and in a data-driven way. The time complexity of Algorithm 2 is
(26) |
In (26), the first term is the cost of computing , the second term corresponds to the cost of singular value decomposition of , the third term represents the cost of computing , and the last term reflects the cost of computing given . In the following theorem, we show that the difference between and is well-controlled.
Theorem 2.
Suppose that are independently sampled from a density satisfying with and .
Suppose in addition that
satisfies 3, and that
are in the form of (24) and (25), where are derived from reproducing kernel Hilbert spaces, the Legendre polynomial basis, or spline basis functions.
Let in (11), , and be the input of Algorithm 2, and
be the corresponding output. Denote
and suppose for a sufficiently large constant ,
If for some sufficiently large constant and , then it holds that
(27) |
The proof of Theorem 2 can be found in the appendix. To interpret Theorem 2, consider the simplified scenario where the ranks and the minimal spectral values are both positive constants for . Then (27) implies that
which matches the minimax optimal rate of estimating non-parametric functions in . Note that the error rate of estimating a nonparametric density in using classical kernel methods is of order . Therefore, as long as
then by (27), with high probability
and the error bound we obtain in Theorem 2 is strictly better than classical kernel methods.
4 Extension: image PCA denoising
In this section, we demonstrate that our nonparametric sketching methodology, introduced in Section 2, can be used to estimate the principal component functions in the continuum limit. The most representative application of PCA is image denoising, which has a wide range of application in machine learning and data science, such as image clustering, classification, and segmentation. We refer the readers to Mika et al. (1998) and Bakır et al. (2004) for a detailed introduction on image processing.
Let . We define and . Motivated by the image denoising model, in our approach, data are treated as discrete functions in and therefore the resolution of the image data is . In such a setting for and , we have
where indicates the pixel of . Note that the norm in differs from the Euclidean norm in by a factor of . Let and define
(28) |
The operator norm of is defined as
(29) |
Motivated by the tremendous success of the discrete wavelet basis functions in the image denoising literature (e.g., see Mohideen et al. (2008)), we study PCA in reproducing kernel Hilbert spaces (RKHS) generated by wavelet functions. Specifically, let be a collection of orthonormal discrete wavelet functions in . The RKHS generated by is
(30) |
For , define . For any , we have
Let be the estimation space and be the sketching space such that
(31) |
where
.
Suppose we observe a collection of noisy images
where
for each ,
(32) |
Here are i.i.d. sub-Gaussian random variables and are i.i.d. sub-Gaussian stochastic functions in such that for every and ,
(33) |
Our objective is to estimate the principle components of . Denote , and define the covariance operator estimator as
(34) |
The following theorem shows that the principle components of can be consistently estimated by the singular value decomposition of with suitably chosen subspaces and .
Corollary 1.
Suppose the
data
satisfy (32) and (33), and that . Suppose in addition that for ,
where , and
are orthonormal discrete functions in .
Suppose that in (30). Let and be defined as in (31). For sufficiently large constant , suppose that
(35) |
Denote as the projection operator onto the , and the projection operator onto the space spanned by the leading singular functions in variable of . Then
(36) |
The proof of Corollary 1 can be found in Appendix E. To interpret the result in Corollary 1, consider the scenario where the minimal spectral value is a positive constant. Then (36) simplifies to
The term aligns with the optimal rate for estimating a function in RKHS with degree of smoothness in a two-dimensional space. The additional term accounts for the measurement errors . This term is typically negligible in application as , the resolution of the images, is typically much larger than the sample size for high-resolution images.
5 Simulations and real data examples
In this section, we compare the numerical performance of the proposed estimator VRS with classical kernel methods and neural network estimators through density estimation models and image denoising models.
5.1 Implementations
As detailed in Algorithm 2, our approach involves three groups of tuning parameters: , , and . In all our numerical experiments, the optimal choices for and are determined through cross-validation. To select , we apply a popular method in low-rank matrix estimation known as adaptive thresholding. Specifically, for each , we compute , the set of singular values of and set
Adaptive thresholding is a very popular strategy in the matrix completion literature (Candes and Plan (2010)) and it has been proven to be empirical robust in many scientific and engineering applications. We use built-in functions provided by the popular Python package scikit-learn to train kernel estimators, and scikit-learn also utilizes cross-validation for tuning parameter selection. For neural networks, we use PyTorch to train various models and make predictions. The implementations of our numerical studies can be found at this link.
5.2 Density estimation
We study the numerical performance of Variance-Reduced Sketching (VRS), kernel density estimators (KDE), and neural networks (NN) in various density estimation problems. We use Legendre polynomials to span linear subspaces as in (24) and as in (25) in Algorithm 2 and the inputs of VRS in the density estimation setting are detailed in Theorem 2. Once we compute the density function via VRS, we use Gibbs sampling to efficiently generate samples from this multivariate probability distribution. The details are listed in Appendix I. For neural network estimators, we use two popular density estimation architectures: Masked Autoregressive Flow (MAF) (Papamakarios et al. (2017)) and Neural Autoregressive Flows (NAF) (Huang et al. (2018)) for comparisons. The details of implementing neural network density estimators are provided in Appendix I. We measure the estimation accuracy by the relative -error defined as
where is the density estimator of a given estimator. We also compute the standard Kullback–Leibler (KL) divergence to measure the distance between two probability density functions:
Simulation . In the first simulation, we use a two-dimensional two-modes Gaussian mixture model to numerically compare VRS, KDE and neural network estimators. We generate 1,000 samples from the density function
where The relative error and KL divergence for each method are reported in Table 1. As demonstrated in this example, VRS achieves decent accuracy in the classical low-dimensional setting.
VRS | KDE | NN-MAF | NN-NAF | |
---|---|---|---|---|
relative error | 0.1270(0.0054) | 0.1636(0.0068) | 0.2225(0.0135) | 0.3265(0.0103) |
KL divergence | 0.0092(0.0033) | 0.0488(0.0029) | 0.0785(0.0134) | 0.0983(0.0098) |
To further visualize the performance of the four different methods, the true density and the estimated density from each method are plotted in Figure 3. Direct comparison in Figure 3 demonstrates that VRS provides a relatively better estimator for the true density.

Simulation . We consider a moderate high-dimensional density model in this simulation. Specifically, we generate data from the 30-dimensional Gaussian mixture density function
The experiments are repeated for 50 times. Since computing high-dimensional errors is NP-hard, we only report the averaged KL divergence for performance evaluation. Table 2 showcases that VRS outperforms kernel and neural network estimators with a remarkable margin.
VRS | KDE | MAF | NAF | |
KL divergence | 0.0195(0.0056) | 4.3823(0.0047) | 0.9260(0.0523) | 0.1613(0.0823) |
To further visualize the performance of the four different methods, we provide visualization of a few estimated marginal densities and compare them with the ground truth marginal density. Figure 4 depicts the comparison of the two-dimensional marginal densities corresponding to , , and , respectively. Direct comparison in Figure 4 demonstrates VRS provides a relatively better fit for the true density.



Simulation . Ginzburg-Landau theory is widely used to model microscopic behavior of superconductors. The Ginzburg-Landau density has the following expression
(37) |
where . We sample data from the Ginzburg-Landau density with coefficient using Metropolis-Hastings sampling algorithm. This type of density would concentrate on two centers and , and all the coordinates are correlated in a non-trivial way due to the interaction term in the density function. We consider two sets of experiments for the Ginzburg-Landau density model. In the first set of experiments, we fix and change the sample size from to . In the second set of experiments, we keep the sample size at and vary from to . We summarize the averaged relative -error for each method in Figure 5. Furthermore in Figure 6, we visualize several two-dimensional marginal densities estimated by the four different methods with sample size . Direct comparison showcases that our VRS method recovers these marginal densities with a decent accuracy.




Simulation . We consider a four-mode Gaussian mixture model in two dimensions. We generate 20,000 data from the density
where and , . This setting is more difficult than Simulation due to more modes and stronger singularity in the true density. We report the relative error and KL divergence for each method in Table 3.
VRS | KDE | MAF | NAF | |
---|---|---|---|---|
Relative Error | 0.0721(0.0029) | 0.3987(0.0039) | 0.2441(0.0411) | 0.4617(0.0621) |
KL Divergence | 0.0142(0.0015) | 0.1223(0.0014) | 0.0819(0.0161) | 0.2356(0.0417) |
To further visualize the performance of the four different methods, the true density and the estimated density from each method are plotted Figure 7. Direct comparison in Figure 7 demonstrates VRS provides a relatively better estimate for the true density.

Real data . We analyze the density estimation for the Portugal wine quality dataset from UCI Machine Learning Repository. This dataset contains samples of red and white wines, along with continuous variables: volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, density, sulphates, and alcohol. To provide a comprehensive comparison between different methods, we estimate the joint density of the first variables in this dataset, allowing to vary from 2 to 8. For instance, corresponds to the joint density of volatile acidity and citric acid. Since the true density is unknown, we randomly split the dataset into 90% training and 10% test data and evaluate the performance of various approaches using the averaged log-likelihood of the test data. The averaged log-likelihood is defined as follows: let be the density estimator based on the training data. The averaged log-likelihood of the test data is
The numerical performance of VRS, NN, and KDE are summarized in Figure 8. Notably, VRS achieves the highest averaged log-likelihood values, indicating its superior numerical performance.

5.3 Image PCA denoising
Principal Component Analysis (PCA) is a popular technique for reducing noise in images. In this subsection, we examine the numerical performance of VRS in image denoising problems. The state-of-the-art method in image denoising literature is kernel PCA. We direct interested readers to Mika et al. (1998) and Bakır et al. (2004) for a comprehensive introduction to the kernel PCA method.
The main advantage of VRS lies in its computational complexity. Consider image data with resolution , where . The time complexity of kernel PCA is , where corresponds to the cost of generating the kernel matrix in , and reflects the cost of computing the principal components of this matrix. In contrast, the time complexity of VRS is analyzed in (26) with . Empirical evidence (see e.g., Pope et al. (2021)) suggests that image data possesses low intrinsic dimensions, making practical choices of and in (26) significantly smaller than and . Even in the worst case scenario where takes the upper bound in (26), the practical time complexity of VRS is which is considerably more efficient than the kernel PCA approach.
In the numerical experiments, we work with real datasets and we treat images from these real datasets as the ground truth images. To evaluate the numerical performance of a given approach, we add i.i.d. Gaussian noise to each pixels of the images and randomly split the dataset into 90% training and 10% test data. We then use the training data to compute the principal components based on the given approach and project the test data onto these estimated principal components. Denote the noiseless ground truth image as , the corresponding projected noisy test data as and the corresponding Gaussian noise added in the images as in (32). The numerical performance of the given approach is evaluated through the relative denoising error:
where indicates the euclidean norm of . Besides, we use the relative variance to measure the noise level. For the time complexity comparison, we execute on Google Colab’s CPU with high RAM and the execution time of each method is recorded.
Real data . Our first study focuses the USPS digits dataset.
This dataset comprises images of handwritten digits (0 through 9) that were originally scanned from envelopes by the USPS. It contains a total of 9,298 images, each with a resolution of .
After adding the Gaussian noise, the relative noise variance of the noisy data is 0.7191.
The relative denoising errors for VRS and kernel PCA are 0.2951 and 0.2959, respectively, which reflects excellent denoising performance of both two methods. Although the error shows minimal difference, the computational cost of VRS is significantly lower than that of kernel PCA: the execution time for VRS is 0.40 seconds, compared to 36.91 seconds for kernel PCA. In addition to this numerical comparison, in Figure 9(a) we have randomly selected five images from the test set to illustrate the denoised results using VRS and kernel PCA.


Real data . We analyze the MNIST dataset, which comprises 70,000 images of handwritten digits (0 through 9), each labeled with the true digit. The size of each image is . After adding the Gaussian noise, the relative noise variance of the noisy data is 0.9171. The relative denoising errors for VRS and kernel PCA are 0.4044 and 0.4170, respectively. Although the numerical accuracy of the two methods is quite similar, the computational cost of VRS is significantly lower than that of kernel PCA. The execution time for VRS is only 4.33 seconds, in contrast to 3218.35 seconds for kernel PCA. In addition to this numerical comparison, Figure 9 includes a random selection of five images from the test set to demonstrate the denoised images using VRS and kernel PCA.
6 Conclusion
In this paper, we develop a comprehensive framework Variance-Reduced Sketching (VRS) for nonparametric density estimation problems in higher dimensions. Our approach leverages the concept of sketching from numerical linear algebra to address the curse of dimensionality in function spaces. Our method treats multivariable functions as infinite-dimensional matrices or tensors and the selection of sketching is specifically tailored to the regularity of the estimated function. This design takes the variance of random samples in nonparametric problems into consideration, intended to reduce curse of dimensionality in density estimation problems. Extensive simulated experiments and real data examples demonstrate that our sketching-based method substantially outperforms both neural network estimators and classical kernel density methods in terms of numerical performance.
References
- Al Daas et al. (2023) Hussam Al Daas, Grey Ballard, Paul Cazeaux, Eric Hallman, Agnieszka Miedlar, Mirjeta Pasha, Tim W Reid, and Arvind K Saibaba. Randomized algorithms for rounding in the tensor-train format. SIAM Journal on Scientific Computing, 45(1):A74–A95, 2023.
- Alaoui and Mahoney (2015) Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical guarantees. Advances in neural information processing systems, 28, 2015.
- Bakır et al. (2004) Gökhan H Bakır, Jason Weston, and Bernhard Schölkopf. Learning to find pre-images. Advances in neural information processing systems, 16:449–456, 2004.
- Bell (2014) Jordan Bell. The singular value decomposition of compact operators on hilbert spaces, 2014.
- Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
- Candes and Plan (2010) Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
- Chaudhuri et al. (2014) Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, and Ulrike Von Luxburg. Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory, 60(12):7900–7912, 2014.
- Che and Wei (2019) Maolin Che and Yimin Wei. Randomized algorithms for the approximations of tucker and the tensor train decompositions. Advances in Computational Mathematics, 45(1):395–428, 2019.
- Chen (2017) Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017.
- Chen and Khoo (2023) Yian Chen and Yuehaw Khoo. Combining particle and tensor-network methods for partial differential equations via sketching. arXiv preprint arXiv:2305.17884, 2023.
- Chen et al. (2021) Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, et al. Spectral methods for data science: A statistical perspective. Foundations and Trends® in Machine Learning, 14(5):566–806, 2021.
- Davis et al. (2011) Richard A Davis, Keh-Shin Lii, and Dimitris N Politis. Remarks on some nonparametric estimates of a density function. Selected Works of Murray Rosenblatt, pages 95–100, 2011.
- Dempster (1977) Arthur Dempster. Maximum likelihood estimation from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39:1–38, 1977.
- Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
- Drineas and Mahoney (2016) Petros Drineas and Michael W Mahoney. Randnla: randomized numerical linear algebra. Communications of the ACM, 59(6):80–90, 2016.
- Escobar and West (1995) Michael D Escobar and Mike West. Bayesian density estimation and inference using mixtures. Journal of the american statistical association, 90(430):577–588, 1995.
- Germain et al. (2015) Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR, 2015.
- Goodfellow et al. (2020) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Halko et al. (2011) Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
- Han et al. (2018) Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised generative modeling using matrix product states. Physical Review X, 8(3):031012, 2018.
- Horn and Johnson (1994) Roger A Horn and Charles R Johnson. Topics in matrix analysis. Cambridge university press, 1994.
- Huang et al. (2018) Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, pages 2078–2087. PMLR, 2018.
- Hur et al. (2023) Yoonhaeng Hur, Jeremy G Hoskins, Michael Lindsey, E Miles Stoudenmire, and Yuehaw Khoo. Generative modeling via tensor train sketching. Applied and Computational Harmonic Analysis, 67:101575, 2023.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kressner et al. (2023) Daniel Kressner, Bart Vandereycken, and Rik Voorhaar. Streaming tensor train approximation. SIAM Journal on Scientific Computing, 45(5):A2610–A2631, 2023.
- Kumar et al. (2012) Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the nyström method. The Journal of Machine Learning Research, 13(1):981–1006, 2012.
- Liberty et al. (2007) Edo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.
- Liu et al. (2007) Han Liu, John Lafferty, and Larry Wasserman. Sparse nonparametric density estimation in high dimensions using the rodeo. In Artificial Intelligence and Statistics, pages 283–290. PMLR, 2007.
- Liu et al. (2023) Linxi Liu, Dangna Li, and Wing Hung Wong. Convergence rates of a class of multivariate density estimation methods based on adaptive partitioning. Journal of machine learning research, 24(50):1–64, 2023.
- Liu et al. (2021) Qiao Liu, Jiaze Xu, Rui Jiang, and Wing Hung Wong. Density estimation using deep generative neural networks. Proceedings of the National Academy of Sciences, 118(15):e2101344118, 2021.
- Mahoney et al. (2011) Michael W Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2):123–224, 2011.
- Martinsson (2019) Per-Gunnar Martinsson. Randomized methods for matrix computations. The Mathematics of Data, 25(4):187–231, 2019.
- Martinsson and Tropp (2020) Per-Gunnar Martinsson and Joel A Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403–572, 2020.
- Mika et al. (1998) Sebastian Mika, Bernhard Schölkopf, Alex Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch. Kernel pca and de-noising in feature spaces. Advances in neural information processing systems, 11, 1998.
- Minster et al. (2020) Rachel Minster, Arvind K Saibaba, and Misha E Kilmer. Randomized algorithms for low-rank tensor decompositions in the tucker format. SIAM Journal on Mathematics of Data Science, 2(1):189–215, 2020.
- Mohideen et al. (2008) S Kother Mohideen, S Arumuga Perumal, and M Mohamed Sathik. Image de-noising using discrete wavelet transform. International Journal of Computer Science and Network Security, 8(1):213–216, 2008.
- Nakatsukasa and Tropp (2021) Yuji Nakatsukasa and Joel A Tropp. Fast & accurate randomized algorithms for linear systems and eigenvalue problems. arXiv preprint arXiv:2111.00113, 2021.
- Papamakarios et al. (2017) George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30, 2017.
- Parzen (1962) Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
- Peng et al. (2023) Yifan Peng, Yian Chen, E Miles Stoudenmire, and Yuehaw Khoo. Generative modeling via hierarchical tensor sketching. arXiv preprint arXiv:2304.05305, 2023.
- Pope et al. (2021) Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894, 2021.
- Rahimi and Recht (2007) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007.
- Raskutti and Mahoney (2016) Garvesh Raskutti and Michael W Mahoney. A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508–7538, 2016.
- Ren et al. (2023) Yinuo Ren, Hongli Zhao, Yuehaw Khoo, and Lexing Ying. High-dimensional density estimation with tensorizing flow. Research in the Mathematical Sciences, 10(3):30, 2023.
- Rezende and Mohamed (2015) Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
- Sande et al. (2020) Espen Sande, Carla Manni, and Hendrik Speleers. Explicit error estimates for spline approximation of arbitrary smoothness in isogeometric analysis. Numerische Mathematik, 144(4):889–929, 2020.
- Scott (1979) David W Scott. On optimal and data-based histograms. Biometrika, 66(3):605–610, 1979.
- Scott (1985) David W Scott. Averaged shifted histograms: effective nonparametric density estimators in several dimensions. The Annals of Statistics, pages 1024–1040, 1985.
- Shi et al. (2023) Tianyi Shi, Maximilian Ruth, and Alex Townsend. Parallel algorithms for computing the tensor-train decomposition. SIAM Journal on Scientific Computing, 45(3):C101–C130, 2023.
- Sriperumbudur and Steinwart (2012) Bharath Sriperumbudur and Ingo Steinwart. Consistency and rates for clustering with dbscan. In Artificial Intelligence and Statistics, pages 1090–1098. PMLR, 2012.
- Sun et al. (2020) Yiming Sun, Yang Guo, Charlene Luo, Joel Tropp, and Madeleine Udell. Low-rank tucker approximation of a tensor from streaming data. SIAM Journal on Mathematics of Data Science, 2(4):1123–1150, 2020.
- Tang and Ying (2023) Xun Tang and Lexing Ying. Solving high-dimensional fokker-planck equation with functional hierarchical tensor. arXiv preprint arXiv:2312.07455, 2023.
- Tang et al. (2022) Xun Tang, Yoonhaeng Hur, Yuehaw Khoo, and Lexing Ying. Generative modeling via tree tensor network states. arXiv preprint arXiv:2209.01341, 2022.
- Tropp et al. (2017a) Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Fixed-rank approximation of a positive-semidefinite matrix from streaming data. Advances in Neural Information Processing Systems, 30, 2017a.
- Tropp et al. (2017b) Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 38(4):1454–1485, 2017b.
- Uria et al. (2016) Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
- Vershynin (2018) Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- Wainwright (2019) Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- Wang et al. (2017) Shusen Wang, Alex Gittens, and Michael W Mahoney. Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. In International Conference on Machine Learning, pages 3608–3616. PMLR, 2017.
- Wasserman (2006) Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.
- Williams and Seeger (2000) Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. Advances in neural information processing systems, 13, 2000.
- Woodruff et al. (2014) David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
- Xu (2018) Yuan Xu. Approximation by polynomials in sobolev spaces with jacobi weight. Journal of Fourier Analysis and Applications, 24:1438–1459, 2018.
- Yang et al. (2017) Yun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels: Fast and optimal nonparametric regression. Annals of Statistics, pages 991–1023, 2017.
- Zambom and Dias (2013) Adriano Z Zambom and Ronaldo Dias. A review of kernel density estimation with applications to econometrics. International Econometric Review, 5(1):20–42, 2013.
Appendix A Examples of finite-rank functions
In this section, we present three examples commonly encountered in nonparametric statistical literature that satisfy 1. Note that the range of is defined as
(38) |
In addition, we provide a classical result in function spaces that allows us to conceptualize mutivariable functions as infinite-dimensional matrices. Let and be arbitrary positive integers, and let and be two measurable sets.
Theorem 3.
[Singular value decomposition in function space] Let be any function such that . There exists a collection of strictly positive singular values , and two collections of orthonormal basis functions and where such that
(39) |
By viewing as an infinite-dimensional matrix, it follows that the rank of is and that an equivalent definition of is that
(40) |
where the definition of Span can be found in (3). Consequently, the rank of is the same as the dimensionality of .
Example 1 (Additive models in regression).
In multivariate nonparametric statistical literature, it is commonly assumed that the underlying unknown nonparametric function process an additive structure, meaning that there exists a collection of univariate functions such that
To connect this with 1, let and Then by (38), and , where
The dimensionality of is at most , and consequently the rank of is at most .
Example 2 (Mean-field models in density estimation).
Mean-field theory is a popular framework in computational physics and Bayesian probability as it studies the behavior of high-dimensional stochastic models. The main idea of the mean-field theory is to replace all interactions to any one body with an effective interaction in a physical system. Specifically, the mean-field model assumes that the density function can be well-approximated by , where for , are univariate marginal density functions. The readers are referred to Blei et al. (2017) for further discussion.
In a large physical system with multiple interacting sub-systems, the underlying density can be well-approximated by a mixture of mean-field densities. Specifically, let be a collection of positive probabilities summing to . In the mean-field mixture model, with probability , data are sampled from a mean-field density . Therefore
To connect the mean-field mixture model to 1, let and Then according to (38), and , where
The dimensionality of is at most , and therefore the rank of is at most .
Example 3 (Multivariate Taylor expansion).
Suppose is an -times continuously differentiable function. Then Taylor’s theorem in the multivariate setting states that for and , , where
(41) |
and For example, , and so on. To simplify our discussion, let . Then (41) becomes
Let and Then by (38), . The dimensionality of is at most , and therefore can be well-approximated by finite rank functions.
Appendix B Examples of and satisfying 2
In this section, we provide three examples of the subspaces and such that 2 holds.
B.1 Reproducing kernel Hilbert space basis
Let be a measurable set in . The two most used examples are
for non-parametric estimation and for image PCA.
For ,
let be a kernel function such that
(42) |
where , and is a collection of basis functions in . If , are orthonormal functions. If , then can be identified as orthogonal vectors in . In this case, for all .
The
reproducing kernel Hilbert space generated by is
(43) |
For any functions , the inner product in is given by
Denote Then are the orthonormal basis functions in as we have that
an that
Define the tensor product space
The induced Frobenius norm in is
(44) |
where is defined by (4). The following lemma shows that the space is naturally motivated by multidimensional Sobolev spaces.
Lemma 1.
Let . With and suitable choices of , it holds that
Proof.
Let . When , it is a classical Sobolev space result that with and suitable choices of ,
We refer interested readers to Chapter 12 of Wainwright (2019) for more details. In general, it is well-known in functional analysis that for and , then
Therefore by induction
(45) |
∎
Let In what follows, we show 2 holds when
Lemma 2.
Let be a kernel in the form of (42). Suppose that , and that is such that . Then for any two positive integers , it holds that
(46) |
where is some absolute constant. Consequently
(47) | |||
(48) |
Proof.
Since , without loss of generality, throughout the proof we assume that
as otherwise all of our analysis still holds up to an absolute constant. Observe that
Then
Observe that
where the first inequality holds because and the last equality follows from (44). Similarly
where the first inequality holds because and the last inequality follows from (44).
Thus (46) follows immediately.
For
(47), note that when , . In this case
becomes the identity operator and
Therefore (47) follows from
(46) by taking .
For (48), similar to (47), we have that
It follows that
where last inequality follows from the fact that . ∎
B.2 Legendre polynomial basis
Legendre polynomials is a well-known classical orthonormal polynomial system in . We can define the Legendre polynomials in the following inductive way. Let and suppose are defined. Let be a polynomial of degree such that
, and
for all .
As a quick example, we have that
Let . Then are the orthonormal polynomial system in . In this subsection, we show that 2 holds when in (3) is chosen to be . More precisely, let
and denote the projection operator from to . Then is the subspace of polynomials of degree at most . In addition, for any , is the best -degree polynomial approximation of in the sense that
(49) |
We begin with a well-known polynomial approximation result. For , denote to be the class of functions that are times continuously differentiable.
Theorem 4.
Suppose . Then for any , there exists a polynomial of degree such that
where is an absolute constant.
Proof.
This is Theorem 1.2 of Xu (2018). ∎
Therefore by (49) and Theorem 4,
(50) |
Let and let denote the linear space spanned by polynomials of of degree at most .
Corollary 2.
Suppose . Then for any ,
Proof.
It suffices to consider . For any fixed , . Therefore by (50),
Therefore
The desired result follows from the observation that
and that . ∎
Lemma 3.
Under the same conditions as in Corollary 2, it holds that
(51) |
Proof.
Note that is a projection operator. So for any ,
(52) |
Given , and therefore is well-defined and is a function mapping from to . To show that (51), it suffices to observe that for any test functions ,
∎
In what follows, we present a polynomial approximation theory in multidimensions.
Lemma 4.
For , let denote the linear space spanned by polynomials of of degree and let be the corresponding projection operator. Then for any , it holds that
(53) |
Proof.
Since is dense in , it suffices to show (53) for all . We proceed by induction. The base case
is a direct consequence of Corollary 2. Suppose by induction, the following inequality holds for ,
(54) |
Then
The desired result follows from (54) and the observation that for all , and therefore
where the last inequality follows from Corollary 2. ∎
Lemma 5.
Suppose . Then for and ,
(55) | ||||
(56) | ||||
(57) |
Proof.
B.3 Spline basis
Let be given and be a collection of grid points such that
Denote the subspace in spanned by the spline functions being peicewise polynomials defined on of degree . Specifically
where
and
Let be the sub-basis functions spanning . In this section, we show that 2 holds when
where and are positive integers such that and . We begin with a spline space approximation theorem for multivariate functions.
Lemma 6.
Suppose . Suppose in addition is a collection positive integer strictly greater than . Then
Proof.
This is Example 13 on page 26 of Sande et al. (2020). ∎
In the following lemma, we justify 2 when and .
Lemma 7.
Suppose where is a fixed constant. Then for and ,
(58) | ||||
(59) | ||||
(60) |
Proof.
Appendix C Proofs of the main results
C.1 Proofs related to Theorem 1
Proof of Theorem 1.
By Lemma 8, is the projection operator of . By Corollary 3,
(61) |
Supposed this good event holds. Observe that
(62) |
where the last inequality follows from 2 and (61). In addition by (64) in Lemma 8, the minimal eigenvalue of is lower bounded by
.
The rank of is bounded by the dimensionality of , so the rank of is finite. Similarly, has finite rank.
Corollary 4 in Section F.2 implies that
By (62), we have that , and by condition (21) in Theorem 1, we have that . The desired result follows immediately:
(63) |
∎
Proof of Lemma 8.
By Lemma 15 in Appendix F and 2, the singular values of the operator satisfies
As a result if for sufficiently large constant , then
(64) |
Since by construction, , and the leading singular values of is positive, it follows that the rank of is . So . ∎
Lemma 9.
Suppose and are defined as in Theorem 2. Let be such that . Suppose be a collection of basis such that . For positive integers and , denote
(65) |
Then it holds that
Proof.
Denote
For positive integers and , by ordering the indexes and in (65), we can also write
(66) |
Note that and are both zero outsize the subspace . Recall are independently samples forming . Let and be two matrices in such that
where and . Note that
(67) |
Step 1. Let and suppose that . Then by orthonormality of in it follows that
In addition, since
it follows that for all and
Step 2. Let and suppose that . Then by orthonormality of in ,
In addition, since
it follows that . Therefore
Step 3. For fixed and , we bound Let . Then
where the last equality follows from Step 1 and Step 2. In addition,
So for given , by Bernstein’s inequality
Step 4. Let be a covering net of unit ball in and be a covering net of unit ball in , then by 4.4.3 on page 90 of Vershynin (2018)
So by union bound and the fact that the size of is bounded by and the size of is bounded by ,
This implies that
∎
Corollary 3.
Suppose and are defined as in Theorem 2. Let be a collection of basis such that . Let
If in addition that and that then
(68) |
Proof.
Since with above choice of and , Corollary 3 is a direct consequence of Lemma 9. ∎
C.2 Proof of Theorem 2
Proof of Theorem 2.
It suffices to confirm that all the conditions in Theorem 5 in Section C.3 are met.
In particular, 2 is verified in Section B.1 for reproducing kernel Hilbert spaces, Section B.2 for Legendre polynomials, and Section B.3 for spline basis. 4 is shown in (69) and (70). Therefore, Theorem 2 immediately follows. ∎
C.3 Proofs related to Theorem 5
Assumption 4.
Suppose for any non-random function . In addition, suppose that
4 requires that is a consistent estimator of for any generic non-random function . It is straight forward to check that in the density estimation model satisfies 4: for any ,
(69) | ||||
(70) |
Theorem 5.
Proof of Theorem 5.
Observe that , where the projection matrix of . As a result,
In the following, we bound each above term individually.
Let denote the linear subspace spanned by basis and . So is non-random projection with rank at most . Since the column space of is spanned by basis and the column space of is contained in , it follows that . Besides, by condition (71) in Theorem 5, both (21) in Theorem 1 and and (74) in Lemma 11 hold. Therefore
where the second inequality follows from
Theorem 1 and Lemma 11, and the last equality follows from the fact that so that from the condition (71) in Theorem 5 and .
Similarly, let denote the linear subspace space spanned by
basis and . So
is non-random with rank at most .
Since the column space of is spanned by basis and the the column space of is contained in , it follows that
.
Here we provide two lemmas required in the above proof.
Lemma 10.
Suppose for each , is a non-random linear operator on and that the rank of is . Then under 4, it holds that
(73) |
Consequently
Proof.
Since the rank of is , we can write
where and are both orthonormal in . Note that for any . Denote
Note that is zero in the orthogonal complement of the subspace . Therefore,
and so
where the equality follows from the assumption that for any Consequently,
∎
Lemma 11.
Let be any estimator satisfying 4. Suppose is collection of non-random operators on such that has rank and . Let and suppose in addition it holds that
(74) |
Then for any , it holds that
(75) |
Proof.
We prove (75) by induction. The base case is exactly Lemma 10. Suppose (75) holds for any . Then
(76) | ||||
(77) |
By induction,
Let denote space spanned by basis defined in 3 and . So is non-random with rank at most . Since the column space of is spanned by and the column space of is contained in , it follows that . Consequently,
where the second inequality follows from induction and Theorem 1, and the last inequality follows from the assumption that Consequently,
Therefore, (75) holds for any . ∎
Appendix D Additional discussions related to Section 2
D.1 Implementation details for Algorithm 1
Let and be two subspaces and be any (random) function. Suppose that and are the orthonormal basis functions of and respectively, with . Our general assumption is that can be computed efficiently for any and . This assumption is easily verified for the density estimation model. The following algorithm provide additional implementation details for Algorithm 1.
D.2 Sketching in finite-dimensional vector spaces
In this subsection, we illustrate the intuition of sketching in a finite-dimensional matrix example. Suppose is a finite-dimensional matrix with rank . Let denote the linear subspace spanned by the columns of , and the linear subspace spanned by the rows of . Our goal is to illustrate how to estimate with reduced variance and reduced computational complexity when .
By singular value decomposition, we can write
where are singular values, are orthonormal vectors in , and are orthonormal vectors in . Therefore is spanned by , and is spanned by .
The sketch-based estimation procedure of is as follows. First, we choose a linear subspace such that and that forms a -cover of . Let be the projection matrix from to and we form the sketch matrix . Then in the second stage, we use the singular value decomposition to compute and return as the estimator of .
With the sketching technique, we only need to work with the reduced-size matrix instead of . Therefore, the effective variance of the sketching procedure is reduced to , significantly smaller than which is the cost if we directly use to estimate the range.
We also provide an intuitive argument to support the above sketching procedure. Since , it holds that
(78) |
Let indicate matrix spectral norm for matrix and vector norm. Since the subspace is a -cover of , it follows that for . Therefore
where the last equality follows from the fact that and for . Let be the leading singular values of . By matrix singular value perturbation theory,
(79) |
for . Suppose is chosen so that , where is the minimal singular value of . Then (79) implies that for . Therefore, has at least positive singular values and the rank of is at least . This observation, together with (78) and the fact that implies that
This justifies the sketching procedure in finite-dimensions.
Appendix E Proofs in Section 4
Lemma 12.
Let be a generic element in . Then
Proof.
Let and . Then
It suffices to observe that
∎
Proof of Corollary 1.
From the proof of Theorem 1 and Lemma 13, it follows that
(80) |
The desired result follows by setting
∎
Lemma 13.
Let and be subspaces in the form of (31). Suppose in addition that Then under the same conditions as in Corollary 1, it holds that
Proof.
Let and . By reordering and in (31), we can assume that
(81) |
where and are orthonormal basis functions of . Note that and are both zero on the orthogonal complement of the subspace . Let
(82) |
where and . Therefore and . Let be such that for and ,
where and are defined according to (28). By the definition of and ,
(83) |
Note that
We estimate above two terms separately.
Step 1. In this step, we control .
Denote .
Since and are independent,
Therefore for any ,
and
So
and
(84) |
By Lemma 12, it follows that
Therefore
Step 2. In this step, we bound . This procedures are similar to Lemma 9 and Lemma 22. Let and be such that and . Denote
Since and are orthonormal basis functions of , it follows that
(85) |
Therefore,
(86) | ||||
(87) | ||||
(88) | ||||
(89) |
where the third equality follows from (34) and
(28).
Step 3. Here we bound above four terms separately. Observe that
Since is a subGaussian process with parameter , and
by (85), , it follows that
is a subGaussian random variable with parameter
Similarly is subGaussian with parameter
.
Therefore is sub-exponential with parameter
.
It follows that
For (87),
note that
is subGaussian with parameter and
is subGaussian with parameter .
Therefore
is sub-exponential with parameter
.
Consequently
Similarly, it holds that
Step 4. By summarizing above four terms, the first term is dominant. Therefore,
Let be a covering net of unit ball in and be a covering net of unit ball in , then by 4.4.3 on page 90 of Vershynin (2018)
So by union bound and the fact that the size of is bounded by and the size of is bounded by ,
(90) |
This implies that
Therefore by Step 1 and Step 2
where the last equality follows from the fact that The desired result follows from (83). ∎
Appendix F Perturbation bounds
F.1 Compact operators on Hilbert spaces
Lemma 14.
Let and be two compact self-adjoint operators on a Hilbert space . Denote and to be the k-th eigenvalue of and respectively. Then
Proof.
By the min-max principle, for any compact self-adjoint operators and any being a -dimensional subspace
It follows that
The other direction follows from symmetry. ∎
For any compact operator , by Theorem 13 of Bell (2014), there exists orthogonal basis and such that
where are the singular values of . So
(91) |
Lemma 15.
Let and be two separable Hilbert spaces. Suppose and are two compact operators from . Then
Proof.
Let and be the orthogonal basis of and . Let
Denote
Note that due to linearity. Since and are compact,
Then and are two compact self-adjoint operators on and that
By Lemma 14, Since and are both positive, by (91)
Similarly
By the finite dimensional SVD perturbation theory (see Theorem 3.3.16 on page 178 of Horn and Johnson (1994)), it follows that
The desired result follows by taking the limit as .
∎
F.2 Subspace perturbation bounds
Theorem 6 (Wedin).
Suppose without loss of generality that . Let be two matrices in whose svd are given respectively by
where and . For any , let
Denote and . If , then
Proof.
This follows from Lemma 2.6 and Theorem 2.9 of Chen et al. (2021). ∎
Corollary 4.
Let and be two Hilbert spaces. Let and be two finite rank operators on and denote . Let the SVD of and are given respectively by
where and . For , denote
Let to be projection matrix from to , and to be projection matrix from to . If , then
Proof.
Let
Then and can be viewed as finite-dimensional matrices on . Since and , the desired result follows from Theorem 6.
∎
Appendix G Additional technical results
Lemma 16.
For positive integers , let . Let and for , let be a collection of operators such that . Then
Lemma 17.
For , let be a collection of orthogonal basis function of . Suppose is such that . Then is a function in and that
(92) |
Note that (92) is independent of choices of basis functions.
Proof.
This is a classical functional analysis result. ∎
Lemma 18.
For , let and . Let . For , let be a collection of subspaces and is such that then
(93) |
Proof.
Lemma 19.
Let be any tensor. For , suppose
where are orthonormal functions in . Then
Therefore the core size of the tensor is .
Proof.
It suffices to observe that as a linear map, is in the orthogonal complement the subspace and are the orthonormal basis of . ∎
Lemma 20.
Let be linear subspace of spanned by the orthonormal basis function . Suppose is a generic function in . If
Then , where
Proof.
This is a well-known projection property in Hilbert space. ∎
Appendix H Extension: multivariable nonparametric regression
In this section, we apply our sketching estimator to study the nonparametric regression models. To begin, suppose the observed data satisfy
(94) |
where are measurement errors and is the unknown regression function. We first present our theory assuming that the random design are independently sampled from the uniform density on the domain in Corollary 5. The general setting, where are sampled from an unknown generic density function, will be discussed in Corollary 6.
Let be the estimator such that for any non-random function ,
(95) |
where is the value of function evaluated at the sample point . In the following corollary, we formally summarize the statistical guarantee of the regression function estimator detailed in Algorithm 2.
Corollary 5.
Suppose the observed data satisfy (94), where are i.i.d. centered subGaussian noise with subGaussian parameter ,
are independently sampled from the uniform density distribution on , and that
with and .
Suppose in addition that
satisfies 3, and that
are in the form of (24) and (25), where are derived from reproducing kernel Hilbert spaces, the Legendre polynomial basis, or spline basis functions.
Let in (95), , and be the input of Algorithm 2, and
be the corresponding output. Denote
and suppose for a sufficiently large constant ,
If for some sufficiently large constant and , then it holds that
(96) |
Proof.
2 is verified in Section B.1 for reproducing kernel Hilbert space, Section B.2 for Legendre polynomials, and Section B.3 for spline basis. 4 is shown in Lemma 21. Therefore all the conditions in Theorem 5 are met and Corollary 5 immediately follows. ∎
In the following result, we extend our approach to the general setting where the random designs are sampled from a generic density function . To achieve consistent regression estimation in this context, we propose adjusting our estimator to incorporate an additional density estimator. This modification aligns with techniques commonly used in classical nonparametric methods, such as the Nadaraya–Watson kernel regression estimator.
Corollary 6.
Suppose are random designs independently sampled from a common density function such that where is a universal positive constant. Let
where is defined in Corollary 5, and is any generic density estimator of . Denote Suppose in addition that all of the other conditions in Corollary 5 hold. Then
Note that if is also a low-rank density function, then we can estimate via Algorithm 2 with a reduced curse of dimensionality. Consequently, the regression function can be estimated with a reduced curse of dimensionality even when the random designs are sampled from a non-uniform density.
Proof of Corollary 6.
Suppose is sufficient large so that . Let be a generic element in . Based on the definition, Thus, when , . When , note that
where the first equality follows from , the second equality follows from , the inequality follows from and the last equality follows from . Therefore for all and it follows that
(97) |
By Corollary 5,
(98) |
Therefore
(99) |
The desired result follows from the observation that
and that
where the last inequality follows from (97). ∎
H.1 Additional technical results for regression
Lemma 21.
Let be defined as in (95). Suppose all the conditions in Corollary 5 holds. Then satisfies 4.
Proof.
Note that are sampled from the uniform density and that Therefore
where the second equality holds since and are independent. Suppose . Then
∎
Lemma 22.
Let be such that . Suppose be a collection of basis such that . For positive integers and , denote
(100) |
Then it holds that
Proof.
Similar to Lemma 9, by ordering the indexes and in (100), we can also write
(101) |
Note that and are both zero in the orthogonal complement of the subspace . Let and be two matrices in such that
where and . Therefore
(102) |
Since are subGaussian, it follows from a union bound argument that there exists a sufficiently large constant such that
(103) |
The following procedures are similar to Lemma 9, but we need to estimate the variance brought by additional random variables here.
Step 1. Let and suppose that
.
Then by orthonormality of in it follows that
In addition, since
it follows that for all and
Step 2. Let Let and suppose that . Then by orthonormality of in ,
In addition, since
it follows that . Therefore
Step 3. For fixed and , we bound Let . Since the measurement errors and the random designs are independent, it follows that
where the last equality follows from Step 1 and Step 2. In addition, suppose the good event in (103) holds. Then uniformly for all ,
So for given , by Bernstein’s inequality
Step 4. Let be a covering net of unit ball in and be a covering net of unit ball in , then by 4.4.3 on page 90 of Vershynin (2018)
So by union bound and the fact that the size of is bounded by and the size of is bounded by ,
(104) |
Therefore
as desired. ∎
Corollary 7.
Suppose be a collection of basis such that . Let
If in addition that and that then
(105) |
Proof.
The proof is almost the same as the proof in Corollary 3 with above choice of and . ∎
H.2 Simulations and real data
We analyze the numerical performance of our VRS, Nadaraya–Watson kernel regression (NWKR) estimators, and neural networks (NN) in various nonparametric regression problems. The implementation of VRS is provided in Algorithm 2 and the inputs of VRS in the regression setting are detailed in Corollary 6. For neural network estimators, we use the feedforward architecture that are either wide and deep. We measure the estimation accuracy by relative -error defined as
where is the regression function estimator of a given method. The subsequent simulations and real data examples consistently demonstrate that VRS outperforms both NN and NWKR in a range of nonparametric regression problems.
Simulation . We sample data from the regression model
where are independently sampled from standard normal distribution, are sampled from the uniform distribution on , and
In the first set of experiments, we set and vary sample size from million to million. In the second set of experiments, the sample size is fixed at million, while the dimensionality varies from to . Each experimental setup is replicated 100 times to ensure robustness, and we present the average relative -error for each method in Figure 10.


Simulation . We sample data from the regression model
where are independently sampled from standard normal distribution, and are independently sampled in from a -dimensional truncated Gaussian distribution with mean vector and covariance matrix . Here is the identity matrix in -dimensions. In addition,
for . In the first set of experiments, we fix and vary vary from million to million. In the second set of experiments, we fix the sample size at million and let vary from to . We repeat each experiment setting times and report the averaged relative -error for each method in Figure 11.


Real data . We study the problem of predicting the house price in California using the California housing dataset. This dataset contains house price data from the 1990 California census, along with continuous features such as locations, median house age, and total number bedrooms for house price prediction. Since the true regression function is unknown, we randomly split the dataset into 90% training and 10% test data and evaluate the performance of various approaches by relative test error. Let be any regression estimator computed based on the training data. The relative test error of this estimator is defined as
where are the test data. The relative test errors for VRS, NWKR, and NN are , , and , respectively, showing that VRS numerical surpasses other methods in this real data example.
Appendix I Additional numerical results and details
I.1 Kernel methods
In our simulated experiments and real data examples, we choose Gaussian kernel for the kernel density estimators and Nadaraya–Watson kernel regression (NWKR) estimators. The bandwidths in all the numerical examples are chosen using cross-validations. We refer interested readers to Wasserman (2006) for an introduction to nonparametric statistics.
I.2 Sample generation in VRS
Following Algorithm 5, we obtain the tensor represented density . To generate the samples, we follow Gibbs samplers and rewrite the joint density into the product of marginal densities:
where is the one marginal density, which requires the integration over other variables. In our setting, we use some orthogonal basis functions and the first basis is commonly a constant. Based on orthgonality, we only need to choose the first slices of coefficient tensor to compute the one marginal density and this only requires complexity. Then we follow Markov Chain Monte Carlo (MCMC) to generate sample from this one marginal density. Next, we need to compute the next one marginal . We fix as and choose the first slices of coefficient tensor to represent the integration over other variables. After that, we successfully compute one marginal density and follow MCMC to generate sample . We repeat the procedure for all conditional probablity and we could generate sample in complexity.
The density given by tensor-sketching based method may not always be positive. There are some postprocessing techniques to deal with the issue. For example, Han et al. (2018) proposes a method to do minimization between the output density with a square of testing density, where the testing density shares the same tensor architecture. After that, we could use the square of testing density to implement MCMC sampling. Besides, based on our observation towards our method, the negative part of output density is quick small and we could choose a small threshold such as to avoid the negative component.
I.3 Neural network density estimator
For neural network estimators, we use two popular density estimation architectures: Masked Autoregressive Flow (MAF) (Papamakarios et al. (2017)) and Neural Autoregressive Flows (NAF) (Huang et al. (2018)) for comparisons. Both neural networks are trained using the Adam optimizer (Kingma and Ba (2014)). For MAF, we use 5 transforms and each transform is a 3-hidden layer neural network with width 128. For NAF, we choose 3 transforms and each transform is a 3-hidden layer neural network with width 128.
I.4 Additional image denoising result
We provide additional image denoising results in this subsection. In Figure 12, we have randomly selected another five images from the test set of the USPS digits dataset and MNIST dataset to illustrate the denoised results using VRS and kernel PCA.

