This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonparametric Density Estimation via Variance-Reduced Sketching

Yifan Peng Committee on Computational and Applied Mathematics, University of Chicago Yuehaw Khoo111Corresponding author Department of Statistics, University of Chicago Daren Wang Department of Statistics, University of Notre Dame
Abstract

Nonparametric density models are of great interest in various scientific and engineering disciplines. Classical density kernel methods, while numerically robust and statistically sound in low-dimensional settings, become inadequate even in moderate higher-dimensional settings due to the curse of dimensionality. In this paper, we introduce a new framework called Variance-Reduced Sketching (VRS), specifically designed to estimate multivariable density functions with a reduced curse of dimensionality. Our framework conceptualizes multivariable functions as infinite-size matrices, and facilitates a new sketching technique motivated by numerical linear algebra literature to reduce the variance in density estimation problems. We demonstrate the robust numerical performance of VRS through a series of simulated experiments and real-world data applications. Notably, VRS shows remarkable improvement over existing neural network estimators and classical kernel methods in numerous density models. Additionally, we offer theoretical justifications for VRS to support its ability to deliver nonparametric density estimation with a reduced curse of dimensionality.

Keywords— Density estimation, Matrix sketching, Tensor sketching, Range estimation, High-dimensional function approximation

1 Introduction

Nonparametric density estimation has extensive applications across diverse fields, including biostatistics (Chen (2017)), machine learning (Sriperumbudur and Steinwart (2012) and Liu et al. (2023)), engineering (Chaudhuri et al. (2014)), and economics (Zambom and Dias (2013)). In this context, the goal is to estimate the underlying density function based on a collection of independently and identically distributed data. Classical density estimation methods such as histograms (Scott (1979) and Scott (1985)) and kernel density estimators (Parzen (1962) and Davis et al. (2011)), known for their numerical robustness and statistical stability in lower-dimensional settings, often suffer from the curse of dimensionality even in moderate higher-dimensional spaces. Mixture models (Dempster (1977) and Escobar and West (1995)) offer a potential solution for higher-dimensional problems, but these methods are lacking the flexibility to extend in the nonparametric settings. Additionally, adaptive methods (Liu et al. (2007) and Liu et al. (2023)) have been developed to address higher-dimensional challenges. Recently, deep generative modeling has emerged as a popular technique to approximate high-dimensional densities from given samples, especially in the context of images. This category includes generative adversarial networks (GAN) (Goodfellow et al. (2020); Liu et al. (2021)), normalizing flows (Dinh et al. (2014); Rezende and Mohamed (2015); Dinh et al. (2016)), and autoregressive models (Germain et al. (2015); Uria et al. (2016); Papamakarios et al. (2017); Huang et al. (2018)). Despite their remarkable performance, particularly with high-resolution images, statistical guarantees for these neural network methods continue to be a challenge. In this paper, we aim to develop a new framework specifically designed to estimate multivariate nonparametric density functions. Within this framework, we conceptualize multivariate density functions as matrices or tensors and extend matrix/tensor approximation algorithm to this nonparametric setting, in order to reduce the curse of dimensionality in higher dimensions.

1.1 Contributions

Motivated by Hur et al. (2023), we propose a new matrix/tensor-based sketching framework for estimating multivariate density functions in nonparametric models, referred to as Variance-Reduced Sketching (VRS). To illustrate the advantages of VRS, consider the setting of estimating an α\alpha-times differentiable density function in dd dimensions. In terms of squared 𝐋2{{\bf L}_{2}} norm, the error rate of the classical kernel density estimator (KDE) with sample size NN is of order

O(1N2α/(2α+d)),\displaystyle\operatorname*{O}\bigg{(}\frac{1}{N^{2\alpha/(2\alpha+d)}}\bigg{)}, (1)

while the VRS estimator can achieve error rates of order222Assuming the minimal singular values are all constants, the error bound in (2) matches Theorem 2.

O(1N2α/(2α+1)+j=1drjN).\displaystyle\operatorname*{O}\bigg{(}\frac{1}{N^{2\alpha/(2\alpha+1)}}+\frac{\prod_{j=1}^{d}r_{j}}{N}\bigg{)}. (2)

Here, {rj}j=1d\{r_{j}\}_{j=1}^{d} are the ranks of the true density function, and the precise definition can be found in 3. Note that when {rj}j=1d\{r_{j}\}_{j=1}^{d} are bounded constants, the error rate in (2) reduces to O(1N2α/(2α+1))\operatorname*{O}\big{(}\frac{1}{N^{2\alpha/(2\alpha+1)}}\big{)}, which is a nonparametric error rate in one dimension, significantly better than (1), especially when dd is large. In Figure 1, we conduct a simulation study to numerically compare deep neural network estimators, classical kernel density estimators, and VRS. Additional empirical studies are provided in Section 5. Extensive numerical evidence suggests that VRS significantly outperforms various deep neural network estimators and KDE by a considerable margin.

Refer to caption
Figure 1: Density estimation with data sampled from the Ginzburg-Landau density in (37). The xx-axis represents dimensionality, varying from 22 to 1010. The performance of different estimators is evaluated using 𝐋2{{\bf L}_{2}}-errors. Additional details are provided in Simulation 𝐈𝐈𝐈\mathbf{III} of Section 5.

The promising performance of VRS comes from its ability to reduce the problem of multivariate density estimation to the problem of estimating low-rank matrices/tensors in space of functions. For example, we can think of a two-variable density function as an infinite-size matrix. When estimating this two-variable density using classical kernel methods, it typically exhibits a two-dimensional error rate. In contrast, VRS reduces the matrix estimation problem to an estimation problem of its range. Since the range of the density function is a single-variable function when the rank is low, we achieve a single-variable estimation error rate. Such philosophy can be generalized to density estimation in arbitrary dimensions. To estimate the range, VRS employs a new sketching strategy, wherein the empirical density function is sketched with a set of carefully chosen sketch functions, as illustrated in Figure 2. These sketch functions are selected to retain the range of the density while reducing the variance of the range estimation from any multidimensional scaling to 11-dimensional scaling, hence the name Variance-Reduced Sketching.

Our VRS sketching approach allows us to achieve the error bound demonstrated in (2), contrasting with the error bound of KDE in (1), which suffers from the curse of dimensionality. Our VRS framework is fundamentally different from the randomized sketching algorithm for finite-dimensional matrices previously studied in Halko et al. (2011), as randomly chosen sketching does not address the curse of dimensionality in multivariable density estimation models. Additionally, the statistical guarantees for VRS are derived from a computationally tractable algorithm. This contrasts with deep learning methods, where generalization errors highly depend on the architecture of the neural network estimators, and the statistical analysis of generalization errors of neural network estimators is not necessarily related to optimization errors achieved by computationally tractable optimizers.

Refer to caption
Figure 2: The sketched function ASAS by VRS retains the range in the variable xx of A(x,y)A(x,y). The complexity of estimating the range of AA using ASAS is much lower than the complexity of directly estimating AA.

1.2 Related literature

Matrix approximation algorithms, such as singular value decomposition and QR decomposition, play a crucial role in computational mathematics and statistics. A notable advancement in this area is the emergence of randomized low-rank approximation algorithms. These algorithms excel in reducing time and space complexity substantially without sacrificing too much numerical accuracy. Seminal contributions to this area are outlined in works such as Liberty et al. (2007) and Halko et al. (2011). Additionally, review papers like Woodruff et al. (2014), Drineas and Mahoney (2016), Martinsson (2019), and Martinsson and Tropp (2020) have provided comprehensive summaries of these randomized approaches, along with their theoretical stability guarantees. Randomized low-rank approximation algorithms typically start by estimating the range of a large low-rank matrix An×nA\in\mathbb{R}^{n\times n} by forming a reduced-size sketch. This is achieved by right multiplying AA with a random matrix Sn×kS\in\mathbb{R}^{n\times k}, where knk\ll n. The random matrix SS is selected to ensure that the range of ASAS remains a close approximation of the range of AA, even when the column size of ASAS is significantly reduced from AA. As such, the random matrix SS is referred to as randomized linear embedding or sketching matrix by Tropp et al. (2017b) and Nakatsukasa and Tropp (2021). The sketching approach reduces the cost in singular value decomposition from O(n3)\operatorname*{O}(n^{3}) to O(n2)\operatorname*{O}(n^{2}), where O(n2)\operatorname*{O}(n^{2}) represents the complexity of matrix multiplication.

Recently, a series of studies have extended the matrix sketching technique to range estimation for high-order tensor structures, such as the Tucker structure (Che and Wei (2019); Sun et al. (2020); Minster et al. (2020)) and the tensor train structure (Al Daas et al. (2023); Kressner et al. (2023); Shi et al. (2023)). These studies developed specialized structures for sketching to reduce computational complexity while maintaining high levels of numerical accuracy in handling high-order tensors.

Previous work has also explored randomized sketching techniques in specific estimation problems. For instance, Mahoney et al. (2011) and Raskutti and Mahoney (2016) utilized randomized sketching to solve unconstrained least squares problems. Williams and Seeger (2000) , Rahimi and Recht (2007), Kumar et al. (2012) and Tropp et al. (2017a) improved Nyström method with randomized sketching techniques. Similarly, Alaoui and Mahoney (2015), Wang et al. (2017), and Yang et al. (2017) applied randomized sketching to kernel matrices in kernel ridge regression to reduce computational complexity. While these studies mark significant progress in the literature, they usually require extensive observation of the estimated function prior to employing the randomized sketching technique, in order to maintain acceptable accuracy. This step would be significantly expensive for the higher-dimensional setting. Notably, Hur et al. (2023) and subsequent studies Tang et al. (2022); Ren et al. (2023); Peng et al. (2023); Chen and Khoo (2023); Tang and Ying (2023) addressed the issues by taking the variance of data generation process into the creation of sketching for high-dimensional tensor estimation. This sketching technique allows for the direct estimation of the range of a tensor with reduced sample complexity, rather than directly estimating the full tensor.

1.3 Organization

Our manuscript is organized as follows. In Section 2, we detail the procedure for implementing the range estimation through sketching and study the corresponding error analysis. In Section 3, we extend our study to density function estimation based the range estimators developed in Section 2, and provide the corresponding statistical error analysis with a reduced curse of dimensionality. Additionally, we extend our method to image PCA denoising in Section 4. In Section 5, we present comprehensive numerical results to demonstrate the superior numerical performance of our method. Finally, Section 6 summarizes our conclusions.

1.4 Notations

We use \mathbb{N} to denote the natural numbers {1,2,3,}\{1,2,3,\ldots\} and \mathbb{R} to denote all the real numbers. We say that Xn=O(an)X_{n}=\operatorname*{O_{\mathbb{P}}}(a_{n}) if for any given ϵ>0\epsilon>0, there exists a Kϵ>0K_{\epsilon}>0 such that lim supn(|Xn/an|Kϵ)<ϵ.\limsup_{n\to\infty}\mathbb{P}(|X_{n}/a_{n}|\geq K_{\epsilon})<\epsilon. For real numbers {an}n=1\{a_{n}\}_{n=1}, {bn}n=1\{b_{n}\}_{n=1} and {cn}n=1\{c_{n}\}_{n=1}, we denote an=O(bn)a_{n}=\operatorname*{O}(b_{n}) if limnan/bn<\lim_{n\to\infty}a_{n}/b_{n}<\infty and an=o(cn)a_{n}=\operatorname*{o}(c_{n}) if limnan/cn=0\lim_{n\to\infty}a_{n}/c_{n}=0. Let [0,1][0,1] denote the unit interval in \mathbb{R}. For positive integer dd, denote the dd-dimensional unit cube as [0,1]d[0,1]^{d}.

Let {fi}i=1n\{f_{i}\}_{i=1}^{n} be a collection of elements in the Hilbert space {\mathcal{H}}. Then

Span{fi}i=1n={b1f1++bnfn:{bi}i=1n}.\displaystyle\operatorname*{Span}\{f_{i}\}_{i=1}^{n}=\{b_{1}f_{1}+\cdots+b_{n}f_{n}:\{b_{i}\}_{i=1}^{n}\subset\mathbb{R}\}. (3)

Note that Span{fi}i=1n\operatorname*{Span}\{f_{i}\}_{i=1}^{n} is a linear subspace of {\mathcal{H}}. For a generic measurable set Ωd\Omega\subset\mathbb{R}^{d}, denote 𝐋2(Ω)={f:Ω:f𝐋2(Ω)2=Ωf2(z)dz<}.{{\bf L}_{2}}(\Omega)=\big{\{}f:\Omega\to\mathbb{R}:\|f\|^{2}_{{{\bf L}_{2}}(\Omega)}=\int_{\Omega}f^{2}(z)\mathrm{d}z<\infty\big{\}}. For any f,g𝐋2(Ω)f,g\in{{\bf L}_{2}}(\Omega), let the inner product between ff and gg be

f,g=Ωf(z)g(z)dz.\langle f,g\rangle=\int_{\Omega}f(z)g(z)\mathrm{d}z.

We say that {ϕk}k=1\{\phi_{k}\}_{k=1}^{\infty} is a collection of orthonormal functions in 𝐋2(Ω){{\bf L}_{2}}(\Omega) if ϕk,ϕl=1\langle\phi_{k},\phi_{l}\rangle=1 if k=lk=l, and 0 if klk\not=l. Let ZΩZ\in\Omega and 𝕀Z\mathbb{I}_{Z} be the indicator function at the point ZZ. We define

𝕀{Z},f=f(Z).\langle\mathbb{I}_{\{Z\}},f\rangle=f(Z).

If {ϕk}k=1\{\phi_{k}\}_{k=1}^{\infty} spans 𝐋2(Ω){{\bf L}_{2}}(\Omega), then {ϕk}k=1\{\phi_{k}\}_{k=1}^{\infty} is a collection of orthonormal basis functions in 𝐋2(Ω){{\bf L}_{2}}(\Omega). In what follows, we briefly introduce the notations for Sobolev spaces. Let Ωd\Omega\subset\mathbb{R}^{d} be any measurable set. For multi-index β=(β1,,βd)d\beta=(\beta_{1},\ldots,\beta_{d})\in\mathbb{N}^{d}, and f(z1,,zd):Ωf(z_{1},\ldots,z_{d}):\Omega\to\mathbb{R}, define the β\beta-derivative of ff as Dβf=1β1dβdf.D^{\beta}f=\partial^{\beta_{1}}_{1}\cdots\partial^{\beta_{d}}_{d}f. Then

W2α(Ω):={f𝐋2(Ω):Dβf𝐋2(Ω) for all |β|α},W^{\alpha}_{2}(\Omega):=\{f\in{{\bf L}_{2}}(\Omega):\ D^{\beta}f\in{{\bf L}_{2}}(\Omega)\text{ for all }|\beta|\leq\alpha\},

where |β|=β1++βd|\beta|=\beta_{1}+\cdots+\beta_{d} and α\alpha represents the total order of derivatives. The Sobolev norm of fW2αf\in W^{\alpha}_{2} is fW2α(Ω)2=0|β|αDβf𝐋2(Ω)2.\|f\|_{W^{\alpha}_{2}(\Omega)}^{2}=\sum_{0\leq|\beta|\leq\alpha}\|D^{\beta}f\|_{{{\bf L}_{2}}(\Omega)}^{2}.

1.5 Background: linear algebra in tensorized function spaces

We briefly introduce linear algebra in tensorized function spaces, which is the necessary to develop our Variance-Reduced Sketching (VRS) framework for nonparametric density estimation.

Multivariable functions as tensors

Given positive integers {sj}j=1d\{s_{j}\}_{j=1}^{d}\in\mathbb{N}, let Ωjsj\Omega_{j}\subset\mathbb{R}^{s_{j}}. Let A(z1,,zd):Ω1×Ω2×ΩdA(z_{1},\ldots,z_{d}):\Omega_{1}\times\Omega_{2}\cdots\times\Omega_{d}\to\mathbb{R} be a generic multivariable function and uj(zj)𝐋2(Ωj)u_{j}(z_{j})\in{{\bf L}_{2}}(\Omega_{j}) for j=1,,dj=1,\ldots,d. Denote

A[u1,,ud]=Ω1ΩdA(z1,,zd)u1(z1)ud(zd)dz1dzd.\displaystyle A[u_{1},\ldots,u_{d}]=\int_{\Omega_{1}}\ldots\int_{\Omega_{d}}A(z_{1},\ldots,z_{d})u_{1}(z_{1})\cdots u_{d}(z_{d})\mathrm{d}z_{1}\cdots\mathrm{d}z_{d}. (4)

Define the Frobenius norm of AA as

AF2=k1,,kd=1A2[ϕ1,k1,,ϕd,kd],\displaystyle\|A\|_{F}^{2}=\sum_{k_{1},\ldots,k_{d}=1}^{\infty}A^{2}[\phi_{1,k_{1}},\ldots,\phi_{d,k_{d}}], (5)

where {ϕj,kj}kj=1\{\phi_{j,k_{j}}\}_{k_{j}=1}^{\infty} are any orthonormal basis functions of 𝐋2(Ωj){{\bf L}_{2}}(\Omega_{j}). Note that in (5), AF\|A\|_{F} is independent of the choices of basis functions {ϕj,kj}kj=1\{\phi_{j,k_{j}}\}_{k_{j}=1}^{\infty} in 𝐋2(Ωj){{\bf L}_{2}}(\Omega_{j}) for j=1,,dj=1,\ldots,d. The operator norm of AA is defined as

Aop=supuj𝐋2(Ωj)1,j=1,,dA[u1,,ud].\displaystyle\|A\|_{\operatorname*{op}}=\sup_{\|u_{j}\|_{{{\bf L}_{2}}(\Omega_{j})}\leq 1,\ j=1,\ldots,d}A[u_{1},\ldots,u_{d}]. (6)

We say that AA is a tensor in tensor product space 𝐋2(Ω1)𝐋2(Ω2)𝐋2(Ωd){{\bf L}_{2}}(\Omega_{1})\otimes{{\bf L}_{2}}(\Omega_{2})\otimes\cdots\otimes{{\bf L}_{2}}(\Omega_{d}) if there exist scalars {bk1,,kd}k1,,kd=1\{b_{k_{1},\ldots,k_{d}}\}_{k_{1},\ldots,k_{d}=1}^{\infty}\subset\mathbb{R} and functions {fj,kj}kj=1𝐋2(Ωj)\{f_{j,k_{j}}\}_{k_{j}=1}^{\infty}\subset{{\bf L}_{2}}(\Omega_{j}) for each j=1,,dj=1,\ldots,d such that

A(z1,,zd)=k1,,kd=1bk1,,kdf1,k1(z1)fd,kd(zd)\displaystyle A(z_{1},\ldots,z_{d})=\sum_{k_{1},\ldots,k_{d}=1}^{\infty}b_{k_{1},\ldots,k_{d}}f_{1,k_{1}}(z_{1})\cdots f_{d,k_{d}}(z_{d}) (7)

and that AF<\|A\|_{F}<\infty. From classical functional analysis, we have that

𝐋2(Ω1)𝐋2(Ω2)𝐋2(Ωd)=𝐋2(Ω1×Ω2×Ωd)andA𝐋2(Ω1××Ωd)=AF.\displaystyle{{\bf L}_{2}}(\Omega_{1})\otimes{{\bf L}_{2}}(\Omega_{2})\cdots\otimes{{\bf L}_{2}}(\Omega_{d})={{\bf L}_{2}}(\Omega_{1}\times\Omega_{2}\cdots\times\Omega_{d})\quad\text{and}\quad\|A\|_{{{\bf L}_{2}}(\Omega_{1}\times\cdots\times\Omega_{d})}=\|A\|_{F}. (8)

Projection operators in function spaces

Let 𝒰=span{ϕ1,,ϕm}𝐋2(Ω)\mathcal{U}=\text{span}\{\phi_{1},\ldots,\phi_{m}\}\subset{{\bf L}_{2}}(\Omega), where mm\in\mathbb{N} and {ϕμ}μ=1m\{\phi_{{\mu}}\}_{{\mu}=1}^{m} are any orthonormal functions. Then 𝒰\mathcal{U} is an mm-dimensional linear subspace of 𝐋2(Ω){{\bf L}_{2}}(\Omega) and we denote dim(𝒰)=m\dim(\mathcal{U})=m. Let 𝒫𝒰\mathcal{P}_{\mathcal{U}} be the projection operator onto the subspace 𝒰\mathcal{U}. Therefore for any f𝐋2(Ω)f\in{{\bf L}_{2}}(\Omega), the projection of ff on 𝒰\mathcal{U} is

𝒫𝒰(f)=μ=1mf,ϕμϕμ.\displaystyle\mathcal{P}_{\mathcal{U}}(f)=\sum_{{\mu}=1}^{m}\langle f,\phi_{{\mu}}\rangle\phi_{{\mu}}. (9)

Note that we always have 𝒫𝒰op1\|\mathcal{P}_{\mathcal{U}}\|_{\operatorname*{op}}\leq 1 for any projection operator 𝒫𝒰\mathcal{P}_{\mathcal{U}}.

For j=1,,dj=1,\ldots,d, let {ϕj,μj}μj=1\{\phi_{j,{{\mu}}_{j}}\}_{{{\mu}}_{j}=1}^{\infty} be any orthonormal functions of 𝐋2(Ωj){{\bf L}_{2}}(\Omega_{j}) and 𝒰j=span{ϕj,μj}μj=1mj,\mathcal{U}_{j}=\text{span}\{\phi_{j,{{\mu}}_{j}}\}_{{{\mu}}_{j}=1}^{{m}_{j}}, where mj{m}_{j}\in\mathbb{N}. With the expansion in (7), define the tensor A×1𝒫𝒰1×d𝒫𝒰dA\times_{1}\mathcal{P}_{\mathcal{U}_{1}}\cdots\times_{d}\mathcal{P}_{\mathcal{U}_{d}} as

(A×1𝒫𝒰1×d𝒫𝒰d)(z1,,zd)=k1,,kd=1bk1,,kd𝒫𝒰1(f1,k1)(z1)𝒫𝒰d(fd,kd)(zd).\displaystyle\left(A\times_{1}\mathcal{P}_{\mathcal{U}_{1}}\cdots\times_{d}\mathcal{P}_{\mathcal{U}_{d}}\right)(z_{1},\ldots,z_{d})=\sum_{k_{1},\ldots,k_{d}=1}^{\infty}b_{k_{1},\ldots,k_{d}}\mathcal{P}_{\mathcal{U}_{1}}(f_{1,k_{1}})(z_{1})\cdots\mathcal{P}_{\mathcal{U}_{d}}(f_{d,k_{d}})(z_{d}). (10)

We have A×1𝒫𝒰1×d𝒫𝒰dFAF𝒫𝒰1op𝒫𝒰dop=AF<\|A\times_{1}\mathcal{P}_{\mathcal{U}_{1}}\cdots\times_{d}\mathcal{P}_{\mathcal{U}_{d}}\|_{F}\leq\|A\|_{F}\|\mathcal{P}_{\mathcal{U}_{1}}\|_{\operatorname*{op}}\cdots\|\mathcal{P}_{\mathcal{U}_{d}}\|_{\operatorname*{op}}=\|A\|_{F}<\infty by Lemma 16 and the fact that 𝒫𝒰dop1\|\mathcal{P}_{\mathcal{U}_{d}}\|_{\operatorname*{op}}\leq 1. It follows that A×1𝒫𝒰1×d𝒫𝒰dA\times_{1}\mathcal{P}_{\mathcal{U}_{1}}\cdots\times_{d}\mathcal{P}_{\mathcal{U}_{d}} is a tensor in 𝐋2(Ω1)𝐋2(Ω2)𝐋2(Ωd){{\bf L}_{2}}(\Omega_{1})\otimes{{\bf L}_{2}}(\Omega_{2})\otimes\cdots\otimes{{\bf L}_{2}}(\Omega_{d}) and therefore can be viewed as a function in 𝐋2(Ω1×Ω2×Ωd){{\bf L}_{2}}(\Omega_{1}\times\Omega_{2}\cdots\times\Omega_{d}).

More generally given kdk\leq d, let x=(z1,,zk)Ω1××Ωkx=(z_{1},\ldots,z_{k})\in\Omega_{1}\times\cdots\times\Omega_{k} and y=(zk+1,,zd)Ωk+1××Ωdy=(z_{k+1},\ldots,z_{d})\in\Omega_{k+1}\times\cdots\times\Omega_{d}. We can view A(z1,,zd)=A(x,y)A(z_{1},\ldots,z_{d})=A(x,y) as a function in 𝐋2(Ω1××Ωk)𝐋2(Ωk+1××Ωd){{\bf L}_{2}}(\Omega_{1}\times\cdots\times\Omega_{k})\otimes{{\bf L}_{2}}(\Omega_{k+1}\times\cdots\times\Omega_{d}). For j=1,,kj=1,\ldots,k, let 𝒰j𝐋2(Ωj)\mathcal{U}_{j}\subset{{\bf L}_{2}}(\Omega_{j})\ be any subspace. Suppose 𝒰x𝐋2(Ω1××Ωk)\mathcal{U}_{x}\subset{{\bf L}_{2}}(\Omega_{1}\times\cdots\times\Omega_{k}) is such that

𝒰x=𝒰1𝒰k=Span{f1(z1)fk(zk),fj(zj)𝒰j for all 1jk}.\mathcal{U}_{x}=\mathcal{U}_{1}\otimes\cdots\otimes\mathcal{U}_{k}=\operatorname*{Span}\{f_{1}(z_{1})\cdots f_{k}(z_{k}),\ f_{j}(z_{j})\in\mathcal{U}_{j}\text{ for all }1\leq j\leq k\}.

Then we have A×x𝒫𝒰x=A×1𝒫𝒰1×k𝒫𝒰k,A\times_{x}\mathcal{P}_{\mathcal{U}_{x}}=A\times_{1}\mathcal{P}_{\mathcal{U}_{1}}\cdots\times_{k}\mathcal{P}_{\mathcal{U}_{k}}, as shown in Lemma 18.

2 Density range estimation by sketching

Let d1d_{1} and d2d_{2} be arbitrary positive integers, and let Ω1d1\Omega_{1}\subset\mathbb{R}^{d_{1}} and Ω2d2\Omega_{2}\subset\mathbb{R}^{d_{2}} be two measurable sets. Let A(x,y):Ω1×Ω2A^{*}(x,y):\Omega_{1}\times\Omega_{2}\to\mathbb{R} be the unknown density function, and {Zi}i=1NΩ1×Ω2\{Z_{i}\}_{i=1}^{N}\subset\Omega_{1}\times\Omega_{2} be the observed data sampled from AA^{*}. Denote A^\widehat{A} the empirical measure formed by {Zi}i=1N\{Z_{i}\}_{i=1}^{N}. More precisely,

A^=1Ni=1N𝕀{Zi}.\displaystyle\widehat{A}=\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}_{\{Z_{i}\}}. (11)

In this section, we introduce a sketching algorithm to estimate Rangex(A)\operatorname*{Range}_{x}(A^{*}) based on A^\widehat{A} with a reduced curse of dimensionality, where

Rangex(A)={f(x):f(x)=Ω2A(x,y)g(y)𝑑y for any g(y)𝐋2(Ω2)}.\displaystyle{\operatorname*{Range}}_{x}(A^{*})=\bigg{\{}f(x):f(x)=\int_{\Omega_{2}}A^{*}(x,y)g(y)dy\text{ for any }g(y)\in{{\bf L}_{2}}(\Omega_{2})\bigg{\}}. (12)

To this end, let \mathcal{M} be a linear subspace of 𝐋2(Ω1){{\bf L}_{2}}(\Omega_{1}) that acts as an estimation subspace and \mathcal{L} be a linear subspace of 𝐋2(Ω2){{\bf L}_{2}}(\Omega_{2}) that acts as a sketching subspace. More details on how to choose \mathcal{M} and \mathcal{L} are deferred to Remark 1. Our procedure is composed of the following three stages.

  • \bullet

    Sketching stage. Let {wη}η=1dim()\{w_{{\eta}}\}_{{\eta}=1}^{\dim(\mathcal{L})} be the orthonormal basis functions of \mathcal{L}. We apply the projection operator 𝒫\mathcal{P}_{\mathcal{L}} to A^\widehat{A} by computing

    {Ω2A^(x,y)wη(y)dy}η=1dim().\bigg{\{}\int_{\Omega_{2}}\widehat{A}(x,y)w_{{\eta}}(y)\mathrm{d}y\bigg{\}}_{{\eta}=1}^{\dim(\mathcal{L})}. (13)

    Note that for each η=1,,dim(){\eta}=1,\ldots,\dim(\mathcal{L}), Ω2A^(x,y)wη(y)dy\int_{\Omega_{2}}\widehat{A}(x,y)w_{{\eta}}(y)\mathrm{d}y is a function solely depending on xx. This stage is aiming at reducing the curse of dimensionality associated to variable yy.

  • \bullet

    Estimation stage. We estimate the functions {Ω2A^(x,y)wη(y)dy}η=1dim()\big{\{}\int_{\Omega_{2}}\widehat{A}(x,y)w_{{\eta}}(y)\mathrm{d}y\big{\}}_{{\eta}=1}^{\text{dim}(\mathcal{L})} by utilizing the estimation space \mathcal{M}. Specifically, for each η=1,,dim(){\eta}=1,\ldots,\dim(\mathcal{L}), we approximate Ω2A^(x,y)wη(y)dy\int_{\Omega_{2}}\widehat{A}(x,y)w_{{\eta}}(y)\mathrm{d}y by

    f~η(x)=argminfΩ2A^(x,y)wη(y)dyf(x)𝐋2(Ω1)2.\widetilde{f}_{\eta}(x)=\operatorname*{arg\,min}_{f\in\mathcal{M}}\bigg{\|}\int_{\Omega_{2}}\widehat{A}(x,y)w_{{\eta}}(y)\mathrm{d}y-f(x)\bigg{\|}^{2}_{{{\bf L}_{2}}(\Omega_{1})}. (14)
  • \bullet

    Orthogonalization stage. Let

    A~(x,y)=η=1dim()f~η(x)wη(y).\displaystyle\widetilde{A}(x,y)=\sum_{{\eta}=1}^{\dim(\mathcal{L})}\widetilde{f}_{\eta}(x)w_{\eta}(y). (15)

    Compute the leading singular functions in the variable xx of A~(x,y)\widetilde{A}(x,y) to estimate the Rangex(A)\operatorname*{Range}_{x}(A^{*}).

We formally summarize our procedure in Algorithm 1.

1:Estimator A^(x,y):Ω1×Ω2\widehat{A}(x,y):\Omega_{1}\times\Omega_{2}\to\mathbb{R}, parameter r+r\in\mathbb{Z}^{+}, linear subspaces 𝐋2(Ω1)\mathcal{M}\subset{{\bf L}_{2}}(\Omega_{1}) and 𝐋2(Ω2)\mathcal{L}\subset{{\bf L}_{2}}(\Omega_{2}).
2:Compute {Ω2A^(x,y)wη(y)dy}η=1dim(),\big{\{}\int_{\Omega_{2}}\widehat{A}(x,y)w_{{\eta}}(y)\mathrm{d}y\big{\}}_{{\eta}=1}^{\dim(\mathcal{L})}, the projection of A^\widehat{A} onto {wη(y)}η=1dim()\{w_{{\eta}}(y)\}_{{\eta}=1}^{\dim(\mathcal{L})}, the basis functions of \mathcal{L}.
3:Compute the estimated functions {f~η(x)}η=1dim()\{\widetilde{f}_{\eta}(x)\}_{{\eta}=1}^{\dim(\mathcal{L})} in \mathcal{M} by (14).
4:Compute the leading rr singular functions in the variable xx of A~(x,y)=η=1dim()f~η(x)wη(y)\widetilde{A}(x,y)=\sum_{{\eta}=1}^{\dim(\mathcal{L})}\widetilde{f}_{\eta}(x)w_{\eta}(y) and denote them as {Φ^ρ(x)}ρ=1r\{\widehat{\Phi}_{\rho}(x)\}_{{\rho}=1}^{r}.
5:{Φ^ρ(x)}ρ=1r\{\widehat{\Phi}_{\rho}(x)\}_{{\rho}=1}^{r}.
Algorithm 1 Density range Estimation via Variance-Reduced Sketching

Suppose the estimation space \mathcal{M} is spanned by the orthonormal basis functions {vμ(x)}μ=1dim()\{v_{{\mu}}(x)\}_{{\mu}=1}^{\dim(\mathcal{M})}. In what follows, we provide an explicit expression of A~(x,y)\widetilde{A}(x,y) in (15) based on the A^(x,y)\widehat{A}(x,y).
In the sketching stage, computing (13) is equivalent to compute A^×y𝒫\widehat{A}\times_{y}\mathcal{P}_{\mathcal{L}}, as

A^×y𝒫=η=1dim()(Ω2A^(x,y)wη(y)dy)wη(y).\widehat{A}\times_{y}\mathcal{P}_{\mathcal{L}}=\sum_{{\eta}=1}^{\dim(\mathcal{L})}\bigg{(}\int_{\Omega_{2}}\widehat{A}(x,y)w_{{\eta}}(y)\mathrm{d}y\bigg{)}w_{{\eta}}(y).

In the estimation stage, we have the following explicit expression for (14) by Lemma 20:

f^η(x)=μ=1dim()A^[vμ,wη]vμ(x),\widehat{f}_{\eta}(x)=\sum_{{\mu}=1}^{\dim(\mathcal{M})}\widehat{A}[v_{\mu},w_{\eta}]v_{\mu}(x),

where A^[vμ,wη]=A^(x,y)vμ(x)wη(y)dxdy\widehat{A}[v_{\mu},w_{\eta}]=\iint\widehat{A}(x,y)v_{{\mu}}(x)w_{{\eta}}(y)\mathrm{d}x\mathrm{d}y. Therefore, A~(x,y)\widetilde{A}(x,y) in (15) can be rewritten as

A~(x,y)=μ=1dim()η=1dim()A^[vμ,wη]vμ(x)wη(y).\displaystyle\widetilde{A}(x,y)=\sum_{{\mu}=1}^{\dim(\mathcal{M})}\sum_{{\eta}=1}^{\dim(\mathcal{L})}\widehat{A}[v_{\mu},w_{\eta}]v_{\mu}(x)w_{\eta}(y). (16)

By Lemma 19, A^×x𝒫×y𝒫\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} has the exact same expression as A~\widetilde{A}. Therefore, we establish the identification

A~=A^×x𝒫×y𝒫.\displaystyle\widetilde{A}=\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}. (17)

In 5 of the appendix, we provide further implementation details on how to compute the leading rr singular functions in the variable xx of A^×x𝒫×y𝒫\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} using singular value decomposition. Let 𝒫x{\mathcal{P}}_{x}^{*} be the projection operator onto the Rangex(A)\operatorname*{Range}_{x}(A^{*}), and let 𝒫^x\widehat{\mathcal{P}}_{x} be the projection operator onto the space spanned by {Φ^ρ(x)}ρ=1r\{\widehat{\Phi}_{\rho}(x)\}_{{\rho}=1}^{r}, the output of Algorithm 1. In Section 2.1, we show that the range estimator in Algorithm 1 is consistent by providing theoretical quantification on the difference between 𝒫x{\mathcal{P}}_{x}^{*} and 𝒫^x\widehat{\mathcal{P}}_{x}.

2.1 Error analysis of Algorithm 1

We start with introducing necessary conditions to establish the consistency of our range estimators.

Assumption 1.

Suppose Ω1d1\Omega_{1}\subset\mathbb{R}^{d_{1}} and Ω2d2\Omega_{2}\subset\mathbb{R}^{d_{2}} be two measurable sets. Let A(x,y):Ω1×Ω2A^{*}(x,y):\Omega_{1}\times\Omega_{2}\to\mathbb{R} be a generic population function with A𝐋2(Ω1×Ω2)<\|A^{*}\|_{{{\bf L}_{2}}(\Omega_{1}\times\Omega_{2})}<\infty.

A(x,y)=ρ=1rσρΦρ(x)Ψρ(y),A^{*}(x,y)=\sum_{{\rho}=1}^{r}\sigma_{{\rho}}\Phi^{*}_{{\rho}}(x)\Psi^{*}_{{\rho}}(y),

where rr\in\mathbb{N}, σ1σ2σr>0\sigma_{1}\geq\sigma_{2}\geq\ldots\geq\sigma_{r}>0, and {Φρ(x)}ρ=1r\{\Phi_{{\rho}}^{*}(x)\}_{{\rho}=1}^{r} and {Ψρ(y)}ρ=1r\{\Psi^{*}_{{\rho}}(y)\}_{{\rho}=1}^{r} are orthonormal basis functions in 𝐋2(Ω1){{\bf L}_{2}}(\Omega_{1}) and 𝐋2(Ω2){{\bf L}_{2}}(\Omega_{2}) respectively.

1 postulates that the density AA^{*} is a finite-rank function. Finite-rank conditions are commonly observed in the literature. In Example 1 and Example 2 of Appendix A, we illustrate that both additive models and mean-field models satisfy 1. Additionally in Example 3, we demonstrate that all multivariable differentiable functions can be effectively approximated by finite-rank functions.

The following assumption quantifies the bias between AA^{*} and its projection.

Assumption 2.

Let 𝐋2(Ω1)\mathcal{M}\subset{{\bf L}_{2}}(\Omega_{1}) and 𝐋2(Ω2)\mathcal{L}\subset{{\bf L}_{2}}(\Omega_{2}) be two linear subspaces such that dim()=md1\dim(\mathcal{M})=m^{d_{1}} and dim()=d2\dim(\mathcal{L})=\ell^{d_{2}}, where m,m,\ell\in\mathbb{N}. For α1\alpha\geq 1, suppose that

AA×x𝒫×y𝒫𝐋2(Ω1×Ω2)2=O(m2α+2α),\displaystyle\|A^{*}-A^{*}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}(\Omega_{1}\times\Omega_{2})}=\operatorname*{O}(m^{-2\alpha}+\ell^{-2\alpha}), (18)
AA×y𝒫𝐋2(Ω1×Ω2)2=O(2α),and\displaystyle\|A^{*}-A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}(\Omega_{1}\times\Omega_{2})}=\operatorname*{O}(\ell^{-2\alpha}),\quad\text{and} (19)
A×y𝒫A×x𝒫×y𝒫𝐋2(Ω1×Ω2)2=O(m2α).\displaystyle\|A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}-A^{*}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}(\Omega_{1}\times\Omega_{2})}=\operatorname*{O}(m^{-2\alpha}). (20)
Remark 1.

When Ω1=[0,1]d1\Omega_{1}=[0,1]^{d_{1}} and Ω2=[0,1]d2\Omega_{2}=[0,1]^{d_{2}}, 2 directly follows from approximation theories in Sobolev spaces. Indeed in the Appendix B, under the assumption that AW2α([0,1]d1+d2)<,\|A^{*}\|_{W^{\alpha}_{2}({[0,1]}^{d_{1}+d_{2}})}<\infty, we justify 2 when \mathcal{M} and \mathcal{L} are derived from three different popular nonparametric approaches: the reproducing kernel Hilbert spaces in Lemma 2, the Legendre polynomial system in Lemma 5, and the spline basis in Lemma 7.

Note that by the identification in (17), A~=A^×x𝒫×y𝒫\widetilde{A}=\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}. The following theorem shows that Rangex(A)\operatorname*{Range}_{x}(A^{*}) can be well-approximated by the leading singular functions in xx of A^×x𝒫×y𝒫\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}.

Theorem 1.

Suppose 1 and 2 hold. For a sufficiently large constant CC, suppose that

σr>Cmax{α,mα,md1+d2N}.\displaystyle\sigma_{r}>C\max\bigg{\{}\ell^{-\alpha},m^{-\alpha},\sqrt{\frac{m^{d_{1}}+\ell^{d_{2}}}{N}}\bigg{\}}. (21)

Let 𝒫x{\mathcal{P}}_{x}^{*} be the projection operator onto the Rangex(A)\operatorname*{Range}_{x}(A^{*}), and let 𝒫^x\widehat{\mathcal{P}}_{x} be the projection operator onto the space spanned by the leading rr singular functions in the variable xx of A^×x𝒫×y𝒫\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}. Then

𝒫^x𝒫xop2=O{σr2(md1+d2N+m2α)}.\displaystyle\|\widehat{\mathcal{P}}_{x}-\mathcal{P}^{*}_{x}\|_{\operatorname*{op}}^{2}=\operatorname*{O_{\mathbb{P}}}\bigg{\{}\sigma_{r}^{-2}\bigg{(}\frac{m^{d_{1}}+\ell^{d_{2}}}{N}+m^{-2\alpha}\bigg{)}\bigg{\}}. (22)

The proof of Theorem 1 can be found in Section C.1. It follows that if mN1/(2α+d1)m\asymp N^{1/(2\alpha+d_{1})}, =CLσr1/α\ell=C_{L}\sigma_{r}^{-1/\alpha} for a sufficiently large constant CLC_{L}, and sample size NCσmax{σr2d1/α,σr2d2/α}N\geq C_{\sigma}\max\{\sigma_{r}^{-2-d_{1}/\alpha},\sigma_{r}^{-2-d_{2}/\alpha}\} for a sufficiently large constant CσC_{\sigma}, then Theorem 1 implies that

𝒫^x𝒫xop2=O{σr2N2α/(2α+d1)+σr2d2/αN}.\displaystyle\|\widehat{\mathcal{P}}_{x}-\mathcal{P}^{*}_{x}\|_{\operatorname*{op}}^{2}=\operatorname*{O_{\mathbb{P}}}\bigg{\{}\frac{\sigma_{r}^{-2}}{N^{2\alpha/(2\alpha+d_{1})}}+\frac{\sigma_{r}^{-2-d_{2}/\alpha}}{N}\bigg{\}}. (23)

To interpret our result in Theorem 1, consider a simplified scenario where the minimal spectral value σr\sigma_{r} is a positive constant. Then (23) further reduces to

𝒫^x𝒫xop2=O{1N2α/(2α+d1)},\|\widehat{\mathcal{P}}_{x}-\mathcal{P}^{*}_{x}\|_{\operatorname*{op}}^{2}=\operatorname*{O_{\mathbb{P}}}\bigg{\{}\frac{1}{N^{2\alpha/(2\alpha+d_{1})}}\bigg{\}},

which matches the optimal nonparametric density estimation rate in W2α(d1)W^{\alpha}_{2}(\mathbb{R}^{d_{1}}). This indicates that our method is able to estimate Rangex(A)\operatorname*{Range}_{x}(A^{*}) without the curse of dimensionality introduced by the variable yd2y\in\mathbb{R}^{d_{2}}. Utilizing the estimator of 𝒫x\mathcal{P}^{*}_{x}, we can further estimate the population function AA^{*} with a reduced curse of dimensionality as detailed in Section 3.

3 Density estimation by sketching

In this section, we study multivariable density estimation by utilizing the range estimator outlined in Algorithm 1. Let 𝒪\mathcal{O} be a measurable subset of \mathbb{R} and A(z1,,zd):𝒪dA^{*}(z_{1},\ldots,z_{d}):{\mathcal{O}}^{d}\to\mathbb{R} be the unknown population function. In this section, we propose a tensor-based algorithm to estimate function AA^{*} with a reduced curse of dimensionality.

Remark 2.

In density estimation, it is sufficient to assume 𝒪=[0,1]{\mathcal{O}}=[0,1]. This is a common assumption widely used in the nonparametric statistics literature. Indeed, if the density function has compact support, through necessary scaling, we can assume the support is a subset of 𝒪d=[0,1]d\mathcal{O}^{d}=[0,1]^{d}.

We begin by stating the necessary assumptions for our tensor-based estimator of AA^{*}.

Assumption 3.

For j=1,,dj=1,\ldots,d, it holds that

A(z1,,zd)=ρ=1rjσj,ρΦj,ρ(zj)Ψj,ρ(z1,,zj1,zj+1,,zd)A^{*}(z_{1},\ldots,z_{d})=\sum_{{{\rho}=1}}^{{r}_{j}}\sigma_{j,{\rho}}\Phi^{*}_{j,{\rho}}(z_{j})\Psi^{*}_{j,{\rho}}(z_{1},\ldots,z_{j-1},z_{j+1},\ldots,z_{d})

where rj{r}_{j}\in\mathbb{N}, σj,1σj,2σj,rj>0\sigma_{j,1}\geq\sigma_{j,2}\geq\ldots\geq\sigma_{j,{r}_{j}}>0, and {Φj,ρ}ρ=1rj\{\Phi_{j,{\rho}}^{*}\}_{{{\rho}=1}}^{{r}_{j}} and {Ψj,ρ}ρ=1rj\{\Psi^{*}_{j,{\rho}}\}_{{{\rho}=1}}^{{r}_{j}} are orthonormal functions in 𝐋2(𝒪){{\bf L}_{2}}({\mathcal{O}}) and 𝐋2(𝒪d1){{\bf L}_{2}}({\mathcal{O}}^{d-1}) respectively. Furthermore, A𝐋2(𝒪d)<\|A^{*}\|_{{{\bf L}_{2}}(\mathcal{O}^{d})}<\infty.

3 is a direct extension of 1, and all of Example 1, 2, and 3 in the appendix continue to hold for 3. Throughout this section, for j{1,,d}j\in\{1,\ldots,d\}, denote the operator 𝒫j\mathcal{P}_{j}^{*} as the projection operator onto Rangej(A)=Span{Φj,ρ(zj)}ρ=1rj.\operatorname*{Range}_{j}(A^{*})=\operatorname*{Span}\{\Phi^{*}_{j,{\rho}}(z_{j})\}_{\rho=1}^{{r}_{j}}. More precisely, 𝒫j=ρ=1rjΦj,ρΦj,ρ𝐋2(𝒪)𝐋2(𝒪).\mathcal{P}_{j}^{*}=\sum_{{{\rho}=1}}^{{r}_{j}}\Phi^{*}_{j,{\rho}}\otimes\Phi^{*}_{j,{\rho}}\in{{\bf L}_{2}}({\mathcal{O}})\otimes{{\bf L}_{2}}({\mathcal{O}}).

In what follows, we formally introduce our algorithm to estimate the density function AA^{*}. Let {ϕk𝕊}k=1𝐋2(𝒪)\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty}\subset{{\bf L}_{2}}({\mathcal{O}}) be a collection of orthonormal basis functions. For j{1,,d}j\in\{1,\ldots,d\}, let m,jm\in\mathbb{N},\ell_{j}\in\mathbb{N} and denote

j\displaystyle\mathcal{M}_{j} =span{ϕμ𝕊(zj)}μ=1mand\displaystyle=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\mu}}(z_{j})\bigg{\}}_{{{\mu}=1}}^{m}\quad\text{and}\quad (24)
j\displaystyle\mathcal{L}_{j} =span{ϕη1𝕊(z1)ϕηj1𝕊(zj1)ϕηj+1𝕊(zj+1)ϕηd𝕊(zd)}η1,,ηj1,ηj+1,,ηd=1j.\displaystyle=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\eta}_{1}}(z_{1})\cdots\phi^{\mathbb{S}}_{{\eta}_{j-1}}(z_{j-1})\phi^{\mathbb{S}}_{{\eta}_{j+1}}(z_{j+1})\cdots\phi^{\mathbb{S}}_{{\eta}_{d}}(z_{d})\bigg{\}}_{{\eta}_{1},\ldots,{\eta}_{j-1},{\eta}_{j+1},\ldots,{\eta}_{d}=1}^{\ell_{j}}. (25)
Remark 3.

The collection of orthonormal basis functions {ϕk𝕊}k=1𝐋2(𝒪)\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty}\subset{{\bf L}_{2}}({\mathcal{O}}) can be derived through various nonparametric estimation methods. In the appendix, we present three examples of {ϕk𝕊}k=1\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty}, including reproducing kernel Hilbert space basis functions (Section B.1), Legendre polynomial basis functions (Section B.2), and spline basis functions (Section B.3) to illustrate the potential choices.

In Algorithm 2, we formally summarize our tensor-based estimator of AA^{*}, which utilizes the range estimator developed in Section 2.1.

1:Empirical measure A^\widehat{A} as in (11), parameters {rj}j=1d+\{{r}_{j}\}_{j=1}^{d}\subset\mathbb{Z}^{+}, linear subspaces {j}j=1d\{\mathcal{M}_{j}\}_{j=1}^{d} as in (24) and {j}j=1d\{\mathcal{L}_{j}\}_{j=1}^{d} as in (25).
2:for j{1,,d}j\in\{1,\ldots,d\} do
3:     Input {A^,rj,j,j}\{\widehat{A},{r}_{j},\mathcal{M}_{j},\mathcal{L}_{j}\} to Algorithm 1 to estimate Rangej(A)\text{Range}_{j}(A^{*}) and output leading rj{r}_{j} singular
4:     functions {Φ^j,ρj}ρj=1rj\{\widehat{\Phi}_{j,{\rho}_{j}}\}_{{{\rho}_{j}=1}}^{{r}_{j}}. Compute projection operator 𝒫^j=ρj=1rjΦ^j,ρjΦ^j,ρj\widehat{\mathcal{P}}_{j}=\sum_{{{\rho}_{j}=1}}^{{r}_{j}}\widehat{\Phi}_{j,{\rho}_{j}}\otimes\widehat{\Phi}_{j,{\rho}_{j}}.
5:end for
6:Compute coefficient tensor Br1××rdB\in\mathbb{R}^{{r}_{1}\times\cdots\times{r}_{d}}:
Bρ1,,ρd=A^[Φ^1,ρ1,,Φ^d,ρd], for all ρ1{1,,r1},,ρd{1,,rd}.B_{{\rho}_{1},\ldots,{\rho}_{d}}=\widehat{A}[\widehat{\Phi}_{1,{\rho}_{1}},\ldots,\widehat{\Phi}_{d,{\rho}_{d}}],\text{ for all }{\rho}_{1}\in\{1,\ldots,{r}_{1}\},\ldots,{\rho}_{d}\in\{1,\ldots,{r}_{d}\}.
7:Compute estimated multivariable function
(A^×1𝒫^1×d𝒫^d)(z1,,zd)=ρ1=1r1ρd=1rd(Bρ1,,ρdΦ^1,ρ1(z1)Φ^d,ρd(zd)).\left(\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\right)(z_{1},\ldots,z_{d})=\sum_{{\rho}_{1}=1}^{{r}_{1}}\cdots\sum_{{\rho}_{d}=1}^{{r}_{d}}\left(B_{{\rho}_{1},\ldots,{\rho}_{d}}\widehat{\Phi}_{1,{\rho}_{1}}(z_{1})\cdots\widehat{\Phi}_{d,{\rho}_{d}}(z_{d})\right).
8:The estimate density function A^×1𝒫^1×d𝒫^d\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}.
Algorithm 2 Multivariable Density Estimation via Variance-Reduced Sketching

In Section 5.1, we provide an in-depth discussion on how to choose the tuning parameters mm, {j}j=1d\{\ell_{j}\}_{j=1}^{d} and {rj}j=1d\{{r}_{j}\}_{j=1}^{d} in a data-driven way. The time complexity of Algorithm 2 is

O(Nmj=1djd1+m2j=1djd1+Nmj=1drj+Nj=1drj).\displaystyle\operatorname*{O}\bigg{(}Nm\sum_{j=1}^{d}\ell_{j}^{d-1}+m^{2}\sum_{j=1}^{d}\ell_{j}^{d-1}+Nm\sum_{j=1}^{d}{r}_{j}+N\prod_{j=1}^{d}{r}_{j}\bigg{)}. (26)

In (26), the first term is the cost of computing {A^×x𝒫j×y𝒫j}j=1d\{\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}_{j}}\times_{y}\mathcal{P}_{\mathcal{L}_{j}}\}_{j=1}^{d}, the second term corresponds to the cost of singular value decomposition of {A^×x𝒫j×y𝒫j}j=1d\{\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}_{j}}\times_{y}\mathcal{P}_{\mathcal{L}_{j}}\}_{j=1}^{d}, the third term represents the cost of computing {𝒫^j}j=1d\{\widehat{\mathcal{P}}_{j}\}_{j=1}^{d}, and the last term reflects the cost of computing A^×1𝒫^1×d𝒫^d\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d} given {𝒫^j}j=1d\{\widehat{\mathcal{P}}_{j}\}_{j=1}^{d}. In the following theorem, we show that the difference between A^×1𝒫^1×d𝒫^d\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d} and AA^{*} is well-controlled.

Theorem 2.

Suppose that {Zi}i=1N𝒪d\{Z_{i}\}_{i=1}^{N}\subset\mathcal{O}^{d} are independently sampled from a density A:𝒪d+A^{*}:\mathcal{O}^{d}\to\mathbb{R}^{+} satisfying AW2α(𝒪d)<\|A^{*}\|_{W^{\alpha}_{2}(\mathcal{O}^{d})}<\infty with α1\alpha\geq 1 and A<\|A^{*}\|_{\infty}<\infty.

Suppose in addition that AA^{*} satisfies 3, and that {j,j}j=1d\{\mathcal{M}_{j},\mathcal{L}_{j}\}_{j=1}^{d} are in the form of (24) and (25), where {ϕk𝕊}k=1𝐋2(𝒪)\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty}\subset{{\bf L}_{2}}(\mathcal{O}) are derived from reproducing kernel Hilbert spaces, the Legendre polynomial basis, or spline basis functions.

Let A^\widehat{A} in (11), {rj}j=1d\{{r}_{j}\}_{j=1}^{d}, and {j,j}j=1d\{\mathcal{M}_{j},\mathcal{L}_{j}\}_{j=1}^{d} be the input of Algorithm 2, and A^×1𝒫^1×d𝒫^d\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d} be the corresponding output. Denote σmin=minj=1,,d{σj,rj}\sigma_{\min}=\min_{j=1,\ldots,d}\{\sigma_{j,{{r}}_{j}}\} and suppose for a sufficiently large constant CsnrC_{snr},

N2α/(2α+1)>Csnrmax{j=1drj,1σmin(d1)/α+2,1σmin2α/(α1/2)}.\displaystyle N^{2\alpha/(2\alpha+1)}>C_{snr}\max\bigg{\{}\prod_{j=1}^{d}{{r}}_{j},\frac{1}{\sigma^{(d-1)/\alpha+2}_{\min}},\frac{1}{\sigma^{2\alpha/(\alpha-1/2)}_{\min}}\bigg{\}}.

If j=CLσj,rj1/α\ell_{j}=C_{L}\sigma_{j,{{r}}_{j}}^{-1/\alpha} for some sufficiently large constant CLC_{L} and mN1/(2α+1)m\asymp N^{1/(2\alpha+1)}, then it holds that

A^×1𝒫^1×d𝒫^dA𝐋2(𝒪d)2=O(j=1dσj,rj2N2α/(2α+1)+j=1dσj,rj(d1)/α2N+j=1drjN).\displaystyle\|\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}-A^{*}\|^{2}_{{{\bf L}_{2}}(\mathcal{O}^{d})}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\sum_{j=1}^{d}\sigma_{j,{{r}}_{j}}^{-2}}{N^{2\alpha/(2\alpha+1)}}+\frac{\sum_{j=1}^{d}\sigma_{j,{{r}}_{j}}^{-(d-1)/\alpha-2}}{N}+\frac{\prod_{j=1}^{d}{{r}}_{j}}{N}\bigg{)}. (27)

The proof of Theorem 2 can be found in the appendix. To interpret Theorem 2, consider the simplified scenario where the ranks rj{r}_{j} and the minimal spectral values σrj\sigma_{{r}_{j}} are both positive constants for j=1,,dj=1,\ldots,d. Then (27) implies that

A^×1𝒫^1×d𝒫^dA𝐋2(𝒪d)2=O(1N2α/(2α+1)),\|\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}-A^{*}\|^{2}_{{{\bf L}_{2}}({\mathcal{O}}^{d})}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{1}{N^{2\alpha/(2\alpha+1)}}\bigg{)},

which matches the minimax optimal rate of estimating non-parametric functions in W2α()W^{\alpha}_{2}(\mathbb{R}). Note that the error rate of estimating a nonparametric density in W2α(d)W^{\alpha}_{2}(\mathbb{R}^{d}) using classical kernel methods is of order N2α/(2α+d)N^{-2\alpha/(2\alpha+d)}. Therefore, as long as

max{σmin2,j=1drj}=o(Nd/(2α+d)),\max\bigg{\{}\sigma_{\min}^{-2},\quad\prod_{j=1}^{d}{r}_{j}\bigg{\}}=\operatorname*{o}(N^{d/(2\alpha+d)}),

then by (27), with high probability

A^×1𝒫^1×d𝒫^dA𝐋2(𝒪d)2=o(N2α/(2α+d)),\|\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}-A^{*}\|^{2}_{{{\bf L}_{2}}({\mathcal{O}}^{d})}={\operatorname*{o}}_{\mathbb{P}}(N^{-2\alpha/(2\alpha+d)}),

and the error bound we obtain in Theorem 2 is strictly better than classical kernel methods.

4 Extension: image PCA denoising

In this section, we demonstrate that our nonparametric sketching methodology, introduced in Section 2, can be used to estimate the principal component functions in the continuum limit. The most representative application of PCA is image denoising, which has a wide range of application in machine learning and data science, such as image clustering, classification, and segmentation. We refer the readers to Mika et al. (1998) and Bakır et al. (2004) for a detailed introduction on image processing.

Let κ{{\kappa}}\in\mathbb{N}. We define [κ]=[1,,κ][{{\kappa}}]=[1,\ldots,{{\kappa}}] and [κ]2=[1,,κ]×[1,,κ][{{\kappa}}]^{2}=[1,\ldots,{{\kappa}}]\times[1,\ldots,{{\kappa}}]. Motivated by the image denoising model, in our approach, data are treated as discrete functions in 𝐋2([κ]2){{\bf L}_{2}}([{\kappa}]^{2}) and therefore the resolution of the image data is κ2{\kappa}^{2}. In such a setting for U,V𝐋2([κ]2)U,V\in{{\bf L}_{2}}([{\kappa}]^{2}) and x=(x1,x2)[κ]2x=(x_{1},x_{2})\in[{\kappa}]^{2}, we have

U𝐋2([κ]2)2=1κ2(x1,x2)[κ]2{U(x1,x2)}2andU,V=1κ2(x1,x2)[κ]2U(x1,x2)V(x1,x2)\|U\|_{{{\bf L}_{2}}([{\kappa}]^{2})}^{2}=\frac{1}{{\kappa}^{2}}\sum_{(x_{1},x_{2})\in[{\kappa}]^{2}}\big{\{}U(x_{1},x_{2})\big{\}}^{2}\quad\text{and}\quad\langle U,V\rangle=\frac{1}{{\kappa}^{2}}\sum_{(x_{1},x_{2})\in[{\kappa}]^{2}}U(x_{1},x_{2})V(x_{1},x_{2})

where U(x1,x2)U(x_{1},x_{2}) indicates the (x1,x2)(x_{1},x_{2}) pixel of UU. Note that the norm in 𝐋2([κ]2){{\bf L}_{2}}([{\kappa}]^{2}) differs from the Euclidean norm in κ2\mathbb{R}^{{\kappa}^{2}} by a factor of κ2{\kappa}^{2}. Let Γ𝐋2([κ]2)𝐋2([κ]2)\Gamma\in{{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})} and define

Γ[U,V]=1κ4x[κ]2,y[κ]2Γ(x,y)U(x)V(y).\displaystyle\Gamma[U,V]=\frac{1}{{{\kappa}}^{4}}\sum_{x\in[{\kappa}]^{2},y\in[{\kappa}]^{2}}\Gamma(x,y)U(x)V(y). (28)

The operator norm of Γ(x,y)𝐋2([κ]2)𝐋2([κ]2)\Gamma(x,y)\in{{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})} is defined as

Γ(x,y)op(𝐋2([κ]2)𝐋2([κ]2))=supU𝐋2([κ]2)=V𝐋2([κ]2)=1Γ[U,V].\displaystyle\|\Gamma(x,y)\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}=\sup_{\begin{subarray}{c}\|U\|_{{{\bf L}_{2}}([{\kappa}]^{2})}=\|V\|_{{{\bf L}_{2}}([{\kappa}]^{2})}=1\end{subarray}}\Gamma[U,V]. (29)

Motivated by the tremendous success of the discrete wavelet basis functions in the image denoising literature (e.g., see Mohideen et al. (2008)), we study PCA in reproducing kernel Hilbert spaces (RKHS) generated by wavelet functions. Specifically, let {ϕk𝕊}k=1κ\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{{\kappa}} be a collection of orthonormal discrete wavelet functions in 𝐋2([κ]){{\bf L}_{2}}([{\kappa}]). The RKHS generated by {ϕk𝕊}k=1κ\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{{\kappa}} is

([κ])={f𝐋2([κ]):f([κ])2=k=1κγk1f,ϕk𝕊2<}.\displaystyle{\mathcal{H}}([{\kappa}])=\bigg{\{}f\in{{\bf L}_{2}}([{\kappa}]):\|f\|_{{\mathcal{H}}([{\kappa}])}^{2}=\sum_{k=1}^{{\kappa}}\gamma_{k}^{-1}\langle f,\phi^{\mathbb{S}}_{k}\rangle^{2}<\infty\bigg{\}}. (30)

For dd\in\mathbb{N}, define ([κ]d)=([κ])([κ]){\mathcal{H}}([{\kappa}]^{d})={\mathcal{H}}([{\kappa}])\otimes\cdots\otimes{\mathcal{H}}([{\kappa}]). For any F([κ]d)F\in{\mathcal{H}}([{\kappa}]^{d}), we have

F([κ]d)2=k1,,kd=1κγk11γkd1(H[ϕk1𝕊,,ϕkd𝕊])2.\|F\|_{{\mathcal{H}}([{\kappa}]^{d})}^{2}=\sum_{k_{1},\ldots,k_{d}=1}^{{\kappa}}\gamma_{k_{1}}^{-1}\cdots\gamma_{k_{d}}^{-1}(H[\phi^{\mathbb{S}}_{k_{1}},\ldots,\phi^{\mathbb{S}}_{k_{d}}])^{2}.

Let 𝐋2([κ]2)\mathcal{M}\subset{{\bf L}_{2}}([{\kappa}]^{2}) be the estimation space and 𝐋2([κ]2)\mathcal{L}\subset{{\bf L}_{2}}([{\kappa}]^{2}) be the sketching space such that

=span{ϕμ1𝕊(x1)ϕμ2𝕊(x2)}μ1,μ2=1mand=span{ϕη1𝕊(y1)ϕη2𝕊(y2)}η1,η2=1,\displaystyle\mathcal{M}=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\mu}_{1}}(x_{1})\phi^{\mathbb{S}}_{{\mu}_{2}}(x_{2})\bigg{\}}_{{\mu}_{1},{\mu}_{2}=1}^{m}\quad\text{and}\quad\mathcal{L}=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\eta}_{1}}(y_{1})\phi^{\mathbb{S}}_{{\eta}_{2}}(y_{2})\bigg{\}}_{{\eta}_{1},{\eta}_{2}=1}^{\ell}, (31)

where x1,x2,y1,y2[κ]x_{1},x_{2},y_{1},y_{2}\in[{\kappa}].

Suppose we observe a collection of noisy images {Wi}i=1N𝐋2([κ]2)\{W_{i}\}_{i=1}^{N}\subset{{\bf L}_{2}}([{\kappa}]^{2}) where for each x[κ]2x\in[{\kappa}]^{2},

Wi(x)=Ii(x)+ϵi(x).\displaystyle W_{i}(x)={\mathnormal{I}}_{i}(x)+\epsilon_{i}(x). (32)

Here {ϵi(x)}i=1,,N,x[κ]2\{\epsilon_{i}(x)\}_{i=1,\ldots,N,x\in[{\kappa}]^{2}}\subset\mathbb{R} are i.i.d. sub-Gaussian random variables and {Ii}i=1N\{{\mathnormal{I}}_{i}\}_{i=1}^{N} are i.i.d. sub-Gaussian stochastic functions in 𝐋2([κ2]){{\bf L}_{2}}([{\kappa}^{2}]) such that for every x,y[κ]2x,y\in[{\kappa}]^{2} and i=1,,ni=1,\ldots,n,

𝔼(Ii(x))=I(x)andCov{Ii(x)Ii(y)}=Σ(x,y).\displaystyle{\mathbb{E}}({\mathnormal{I}}_{i}(x))={\mathnormal{I}}^{*}(x)\quad\text{and}\quad Cov\{{\mathnormal{I}}_{i}(x){\mathnormal{I}}_{i}(y)\}=\Sigma^{*}(x,y). (33)

Our objective is to estimate the principle components of Σ\Sigma^{*}. Denote W¯(x)=1Ni=1NWi(x)\overline{W}(x)=\frac{1}{N}\sum_{i=1}^{N}W_{i}(x), and define the covariance operator estimator as

Σ^(x,y)=1N1i=1N{Wi(x)W¯(x)}{Wi(y)W¯(y)}.\displaystyle\widehat{\Sigma}(x,y)=\frac{1}{N-1}\sum_{i=1}^{N}\big{\{}W_{i}(x)-\overline{W}(x)\big{\}}\big{\{}W_{i}(y)-\overline{W}(y)\big{\}}. (34)

The following theorem shows that the principle components of Σ\Sigma^{*} can be consistently estimated by the singular value decomposition of Σ^×x𝒫×y𝒫\widehat{\Sigma}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} with suitably chosen subspaces \mathcal{M} and \mathcal{L}.

Corollary 1.

Suppose the data {Wi}i=1N𝐋2([κ]2)\{W_{i}\}_{i=1}^{N}\subset{{\bf L}_{2}}([{\kappa}]^{2}) satisfy (32) and (33), and that Σ([κ]4)<\|\Sigma^{*}\|_{{\mathcal{H}}([{\kappa}]^{4})}<\infty. Suppose in addition that for x,y[κ]2x,y\in[{\kappa}]^{2}, Σ(x,y)=ρ=1rσρΦρ(x)Φρ(y)\Sigma^{*}(x,y)=\sum_{\rho=1}^{r}\sigma_{\rho}\Phi^{*}_{\rho}(x)\Phi^{*}_{\rho}(y) where σ1σ2σr>0\sigma_{1}\geq\sigma_{2}\geq\ldots\geq\sigma_{{r}}>0, and {Φρ}ρ=1r\{\Phi_{\rho}^{*}\}_{\rho=1}^{r} are orthonormal discrete functions in 𝐋2([κ]2){{\bf L}_{2}}([{\kappa}]^{2}).

Suppose that γkk2α\gamma_{k}\asymp k^{-2\alpha} in (30). Let \mathcal{M} and \mathcal{L} be defined as in (31). For sufficiently large constant CC, suppose that

|σr|>Cmax{α,mα,(m+)N,1κ2}.\displaystyle|\sigma_{r}|>C\max\bigg{\{}\ell^{-\alpha},m^{-\alpha},\frac{(m+\ell)}{\sqrt{N}},\frac{1}{{{\kappa}}^{2}}\bigg{\}}. (35)

Denote 𝒫x{\mathcal{P}}_{x}^{*} as the projection operator onto the Span{Φρ}ρ=1r\operatorname*{Span}\{\Phi_{\rho}^{*}\}_{\rho=1}^{r}, and 𝒫^x\widehat{\mathcal{P}}_{x} the projection operator onto the space spanned by the leading r{r} singular functions in variable xx of Σ^×x𝒫×y𝒫\widehat{\Sigma}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}. Then

𝒫^x𝒫xop(𝐋2([κ]2)𝐋2([κ]2))2=O(σr2N2α/(2α+2)+σr2κ4).\displaystyle\|\widehat{\mathcal{P}}_{x}-\mathcal{P}^{*}_{x}\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}^{2}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\sigma_{r}^{-2}}{N^{2\alpha/(2\alpha+2)}}+\frac{\sigma_{r}^{-2}}{{{\kappa}}^{4}}\bigg{)}. (36)

The proof of Corollary 1 can be found in Appendix E. To interpret the result in Corollary 1, consider the scenario where the minimal spectral value σr\sigma_{r} is a positive constant. Then (36) simplifies to

𝒫^x𝒫xop(𝐋2([κ]2)𝐋2([κ]2))2=O(1N2α/(2α+2)+1κ4).\|\widehat{\mathcal{P}}_{x}-\mathcal{P}^{*}_{x}\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}^{2}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{1}{N^{2\alpha/(2\alpha+2)}}+\frac{1}{{{\kappa}}^{4}}\bigg{)}.

The term 1N2α/(2α+2)\frac{1}{N^{2\alpha/(2\alpha+2)}} aligns with the optimal rate for estimating a function in RKHS with degree of smoothness α\alpha in a two-dimensional space. The additional term 1κ4\frac{1}{{{\kappa}}^{4}} accounts for the measurement errors {ϵi}i=1N\{\epsilon_{i}\}_{i=1}^{N}. This term is typically negligible in application as κ{{\kappa}}, the resolution of the images, is typically much larger than the sample size NN for high-resolution images.

5 Simulations and real data examples

In this section, we compare the numerical performance of the proposed estimator VRS with classical kernel methods and neural network estimators through density estimation models and image denoising models.

5.1 Implementations

As detailed in Algorithm 2, our approach involves three groups of tuning parameters: mm, {j}j=1d\{\ell_{j}\}_{j=1}^{d}, and {rj}j=1d\{r_{j}\}_{j=1}^{d}. In all our numerical experiments, the optimal choices for mm and {j}j=1d\{\ell_{j}\}_{j=1}^{d} are determined through cross-validation. To select {rj}j=1d\{r_{j}\}_{j=1}^{d}, we apply a popular method in low-rank matrix estimation known as adaptive thresholding. Specifically, for each j=1,,dj=1,\ldots,d, we compute {σ^j,k}k=1\{\widehat{\sigma}_{j,k}\}_{k=1}^{\infty}, the set of singular values of A^×x𝒫j×y𝒫j\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}_{j}}\times_{y}\mathcal{P}_{\mathcal{L}_{j}} and set

rj=argmaxk1σ^j,kσ^j,k+1.{r}_{j}=\arg\max_{k\geq 1}\frac{\widehat{\sigma}_{j,k}}{\widehat{\sigma}_{j,k+1}}.

Adaptive thresholding is a very popular strategy in the matrix completion literature (Candes and Plan (2010)) and it has been proven to be empirical robust in many scientific and engineering applications. We use built-in functions provided by the popular Python package scikit-learn to train kernel estimators, and scikit-learn also utilizes cross-validation for tuning parameter selection. For neural networks, we use PyTorch to train various models and make predictions. The implementations of our numerical studies can be found at this link.  

5.2 Density estimation

We study the numerical performance of Variance-Reduced Sketching (VRS), kernel density estimators (KDE), and neural networks (NN) in various density estimation problems. We use Legendre polynomials to span linear subspaces {j}j=1d\{\mathcal{M}_{j}\}_{j=1}^{d} as in (24) and {j}j=1d\{\mathcal{L}_{j}\}_{j=1}^{d} as in (25) in Algorithm 2 and the inputs of VRS in the density estimation setting are detailed in Theorem 2. Once we compute the density function via VRS, we use Gibbs sampling to efficiently generate samples from this multivariate probability distribution. The details are listed in Appendix I. For neural network estimators, we use two popular density estimation architectures: Masked Autoregressive Flow (MAF) (Papamakarios et al. (2017)) and Neural Autoregressive Flows (NAF) (Huang et al. (2018)) for comparisons. The details of implementing neural network density estimators are provided in Appendix I. We measure the estimation accuracy by the relative 𝐋2{{\bf L}_{2}}-error defined as

pp~𝐋2(Ω)p𝐋2(Ω),\frac{\|p^{*}-\widetilde{p}\|_{{{\bf L}_{2}}(\Omega)}}{\|p^{*}\|_{{{\bf L}_{2}}(\Omega)}},

where p~\widetilde{p} is the density estimator of a given estimator. We also compute the standard Kullback–Leibler (KL) divergence to measure the distance between two probability density functions:

DKL=𝔼p[log(p/p~)].D_{KL}=\mathbb{E}_{p^{*}}[\log({p^{*}}/\tilde{p})].

\bullet Simulation 𝐈\mathbf{I}. In the first simulation, we use a two-dimensional two-modes Gaussian mixture model to numerically compare VRS, KDE and neural network estimators. We generate 1,000 samples from the density function

p(𝒙)=0.4exp(12(𝒙𝝁1)T𝚺11(𝒙𝝁1))(2π)2|𝚺1|+0.6exp(12(𝒙𝝁2)T𝚺21(𝒙𝝁2))(2π)2|𝚺2|,\displaystyle p^{*}(\bm{x})=0.4\frac{\exp\left(-\frac{1}{2}(\bm{x}-\bm{\mu}_{1})^{T}\bm{\Sigma}_{1}^{-1}(\bm{x}-\bm{\mu}_{1})\right)}{\sqrt{(2\pi)^{2}|\bm{\Sigma}_{1}|}}+0.6\frac{\exp\left(-\frac{1}{2}(\bm{x}-\bm{\mu}_{2})^{T}\bm{\Sigma}_{2}^{-1}(\bm{x}-\bm{\mu}_{2})\right)}{\sqrt{(2\pi)^{2}|\bm{\Sigma}_{2}|}},

where 𝝁1=(0.350.35),𝚺1=(0.2520.0320.0320.252),𝝁2=(0.350.35),𝚺2=(0.3520.120.120.352).\bm{\mu}_{1}=\begin{pmatrix}-0.35\\ -0.35\end{pmatrix},\bm{\Sigma}_{1}=\begin{pmatrix}0.25^{2}&-0.03^{2}\\ -0.03^{2}&0.25^{2}\\ \end{pmatrix},\bm{\mu}_{2}=\begin{pmatrix}0.35\\ 0.35\end{pmatrix},\bm{\Sigma}_{2}=\begin{pmatrix}0.35^{2}&0.1^{2}\\ 0.1^{2}&0.35^{2}\\ \end{pmatrix}. The relative 𝐋2{{\bf L}_{2}} error and KL divergence for each method are reported in Table 1. As demonstrated in this example, VRS achieves decent accuracy in the classical low-dimensional setting.

VRS KDE NN-MAF NN-NAF
relative 𝐋2{{\bf L}_{2}} error 0.1270(0.0054) 0.1636(0.0068) 0.2225(0.0135) 0.3265(0.0103)
KL divergence 0.0092(0.0033) 0.0488(0.0029) 0.0785(0.0134) 0.0983(0.0098)
Table 1: Relative 𝐋2{{\bf L}_{2}} errors and KL divergences for four different methods of the two-dimensional Gaussian mixture model in Simulation 𝐈\mathbf{I}. The experiments are repeated for 50 times. The average errors are reported and standard deviations are shown in the bracket.

To further visualize the performance of the four different methods, the true density and the estimated density from each method are plotted in Figure 3. Direct comparison in Figure 3 demonstrates that VRS provides a relatively better estimator for the true density.

Refer to caption
Figure 3: Density functions from the two-dimensional Gaussian mixture model in Simulation 𝐈\mathbf{I}. From left to right: the ground truth density, and estimations from VRS, KDE, MAF, and NAF. The values in the colorbar on the right represent function values.

\bullet Simulation 𝐈𝐈\mathbf{II}. We consider a moderate high-dimensional density model in this simulation. Specifically, we generate 10510^{5} data from the 30-dimensional Gaussian mixture density function

(x1x2x3)\displaystyle\begin{pmatrix}x_{1}\\ x_{2}\\ x_{3}\end{pmatrix} i.i.d. 12𝒩((0.50.50.5),(0.120.06200.0620.120000.12))+12𝒩((0.50.50.5),0.12I3×3),\displaystyle\overset{\text{ i.i.d. }}{\sim}\frac{1}{2}\mathcal{N}\left(\begin{pmatrix}-0.5\\ -0.5\\ -0.5\end{pmatrix},\begin{pmatrix}0.1^{2}&0.06^{2}&0\\ 0.06^{2}&0.1^{2}&0\\ 0&0&0.1^{2}\end{pmatrix}\right)+\frac{1}{2}\mathcal{N}\left(\begin{pmatrix}0.5\\ 0.5\\ 0.5\end{pmatrix},0.1^{2}I_{3\times 3}\right),
x4,x5\displaystyle x_{4},x_{5} i.i.d. 𝒩(0,0.04),\displaystyle\overset{\text{ i.i.d. }}{\sim}\mathcal{N}(0,0.04),
x6,,x30\displaystyle x_{6},\dots,x_{30} i.i.d. 12𝒩(0.4,0.32)+12𝒩(0.4,0.32), for d>5.\displaystyle\overset{\text{ i.i.d. }}{\sim}\frac{1}{2}\mathcal{N}(-0.4,0.3^{2})+\frac{1}{2}\mathcal{N}(0.4,0.3^{2}),\text{ for }d>5.

The experiments are repeated for 50 times. Since computing high-dimensional 𝐋2{{\bf L}_{2}} errors is NP-hard, we only report the averaged KL divergence for performance evaluation. Table 2 showcases that VRS outperforms kernel and neural network estimators with a remarkable margin.

VRS KDE MAF NAF
KL divergence 0.0195(0.0056) 4.3823(0.0047) 0.9260(0.0523) 0.1613(0.0823)
Table 2: KL divergence for four methods of the 30-dimensional density model in Simulation 𝐈𝐈\mathbf{II}. The experiments are repeated for 50 times. The average errors are reported and standard deviations are shown in the bracket.

To further visualize the performance of the four different methods, we provide visualization of a few estimated marginal densities and compare them with the ground truth marginal density. Figure 4 depicts the comparison of the two-dimensional marginal densities corresponding to (x1,x2)(x_{1},x_{2}), (x4,x8)(x_{4},x_{8}), and (x10,x20)(x_{10},x_{20}), respectively. Direct comparison in Figure 4 demonstrates VRS provides a relatively better fit for the true density.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Marginal densities from the 30-dimensional Gaussian mixture model in Simulation II. From left to right: the ground truth density, and estimations from VRS, KDE, MAF, and NAF. From top to bottom: two-dimensional marginal densities corresponding to (x1,x2)(x_{1},x_{2}), (x4,x8)(x_{4},x_{8}), and (x10,x20)(x_{10},x_{20}). The values in the colorbar on the right represent function values.

\bullet Simulation 𝐈𝐈𝐈\mathbf{III}. Ginzburg-Landau theory is widely used to model microscopic behavior of superconductors. The Ginzburg-Landau density has the following expression

p(x1,,xd)exp(β{j=0dλ2(xjxj+1h)2+j=1d14λ(xj21)2}),\displaystyle p^{*}(x_{1},\ldots,x_{d})\propto\exp\left(-\beta\bigg{\{}\sum_{j=0}^{d}\frac{\lambda}{2}(\frac{x_{j}-x_{j+1}}{h})^{2}+\sum_{j=1}^{d}\frac{1}{4\lambda}(x_{j}^{2}-1)^{2}\bigg{\}}\right), (37)

where x0=xd+1=0x_{0}=x_{d+1}=0. We sample data from the Ginzburg-Landau density with coefficient β=1/8,λ=0.02,h=1/(d+1)\beta=1/8,\lambda=0.02,h=1/(d+1) using Metropolis-Hastings sampling algorithm. This type of density would concentrate on two centers (+1,+1,,+1)(+1,+1,\cdots,+1) and (1,1,,1)(-1,-1,\cdots,-1), and all the coordinates (x1,,xd)(x_{1},\ldots,x_{d}) are correlated in a non-trivial way due to the interaction term exp(βj=0dλ2(xjxj+1h)2)\exp\big{(}-\beta\sum_{j=0}^{d}\frac{\lambda}{2}(\frac{x_{j}-x_{j+1}}{h})^{2}\big{)} in the density function. We consider two sets of experiments for the Ginzburg-Landau density model. In the first set of experiments, we fix d=10d=10 and change the sample size NN from 1×1051\times 10^{5} to 5×1055\times 10^{5}. In the second set of experiments, we keep the sample size NN at 1×1051\times 10^{5} and vary dd from 22 to 1010. We summarize the averaged relative 𝐋2{{\bf L}_{2}}-error for each method in Figure 5. Furthermore in Figure 6, we visualize several two-dimensional marginal densities estimated by the four different methods with sample size 1×1051\times 10^{5}. Direct comparison showcases that our VRS method recovers these marginal densities with a decent accuracy.

Refer to caption
Refer to caption
Figure 5: The plot on the left corresponds to Simulation 𝐈𝐈𝐈\mathbf{III} with d=10d=10 and NN varying from 1×1051\times 10^{5} to 5×1055\times 10^{5} ; the plot on the right corresponds to Simulation 𝐈𝐈𝐈\mathbf{III} with NN being 1×1051\times 10^{5} and dd varying from 22 to 1010.
Refer to caption
Refer to caption
Figure 6: Marginal densities from the 10-dimensional Ginzburg-Landau model in Simulation III. and estimations from VRS, KDE, MAF, and NAF. From top to bottom: two-dimensional marginal densities corresponding to (x4,x8)(x_{4},x_{8}) and (x9,x10)(x_{9},x_{10}). The values in the colorbar on the right represent the density function values.

\bullet Simulation 𝐈𝐕\mathbf{IV}. We consider a four-mode Gaussian mixture model in two dimensions. We generate 20,000 data from the density

p(𝒙)=i=1414exp(12(𝒙𝝁i)T𝚺i1(𝒙𝝁i))(2π)2|𝚺i|,\displaystyle p^{*}(\bm{x})=\sum_{i=1}^{4}\frac{1}{4}\frac{\exp\left(-\frac{1}{2}(\bm{x}-\bm{\mu}_{i})^{T}\bm{\Sigma}_{i}^{-1}(\bm{x}-\bm{\mu}_{i})\right)}{\sqrt{(2\pi)^{2}|\bm{\Sigma}_{i}|}},

where 𝝁1=(0.50.5),𝝁2=(0.50.5),𝝁3=(0.50.5),𝝁4=(0.50.5),\bm{\mu}_{1}=\begin{pmatrix}-0.5\\ -0.5\end{pmatrix},\bm{\mu}_{2}=\begin{pmatrix}0.5\\ 0.5\end{pmatrix},\bm{\mu}_{3}=\begin{pmatrix}-0.5\\ 0.5\end{pmatrix},\bm{\mu}_{4}=\begin{pmatrix}0.5\\ -0.5\end{pmatrix}, and 𝚺1=𝚺2=(0.2520.0320.0320.252)\bm{\Sigma}_{1}=\bm{\Sigma}_{2}=\begin{pmatrix}0.25^{2}&0.03^{2}\\ 0.03^{2}&0.25^{2}\\ \end{pmatrix}, 𝚺3=𝚺4=(0.120.0520.0520.12)\bm{\Sigma}_{3}=\bm{\Sigma}_{4}=\begin{pmatrix}0.1^{2}&-0.05^{2}\\ -0.05^{2}&0.1^{2}\\ \end{pmatrix}. This setting is more difficult than Simulation 𝐈\mathbf{I} due to more modes and stronger singularity in the true density. We report the relative 𝐋2{{\bf L}_{2}} error and KL divergence for each method in Table 3.

VRS KDE MAF NAF
Relative L1L_{1} Error 0.0721(0.0029) 0.3987(0.0039) 0.2441(0.0411) 0.4617(0.0621)
KL Divergence 0.0142(0.0015) 0.1223(0.0014) 0.0819(0.0161) 0.2356(0.0417)
Table 3: Relative L1L_{1} errors and KL divergences for four different methods of the two-dimensional Gaussian mixture model in Simulation IV. The experiments are repeated for 50 times. The average errors are reported and standard deviations are shown in the bracket.

To further visualize the performance of the four different methods, the true density and the estimated density from each method are plotted Figure 7. Direct comparison in Figure 7 demonstrates VRS provides a relatively better estimate for the true density.

Refer to caption
Figure 7: Density functions from the two-dimensional Gaussian mixture model in Simulation 𝐈𝐕\mathbf{IV}. From left to right are the ground truth density, estimates from VRS, KDE, MAF, and NAF, respectively. The values in the colorbar on the right represent function values.

\bullet Real data 𝐈\mathbf{I}. We analyze the density estimation for the Portugal wine quality dataset from UCI Machine Learning Repository. This dataset contains 64976497 samples of red and white wines, along with 88 continuous variables: volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, density, sulphates, and alcohol. To provide a comprehensive comparison between different methods, we estimate the joint density of the first dd variables in this dataset, allowing dd to vary from 2 to 8. For instance, d=2d=2 corresponds to the joint density of volatile acidity and citric acid. Since the true density is unknown, we randomly split the dataset into 90% training and 10% test data and evaluate the performance of various approaches using the averaged log-likelihood of the test data. The averaged log-likelihood is defined as follows: let p~\widetilde{p} be the density estimator based on the training data. The averaged log-likelihood of the test data {Zi}i=1Ntest\{Z_{i}\}_{i=1}^{N_{\text{test}}} is

1Ntesti=1Ntestlog{p~(Zi)}.\frac{1}{N_{\text{test}}}\sum_{i=1}^{N_{\text{test}}}\log\{\widetilde{p}(Z_{i})\}.

The numerical performance of VRS, NN, and KDE are summarized in Figure 8. Notably, VRS achieves the highest averaged log-likelihood values, indicating its superior numerical performance.

Refer to caption
Figure 8: Density estimation for the Portugal wine quality dataset with VRS, KDE, and neural network estimators.

5.3 Image PCA denoising

Principal Component Analysis (PCA) is a popular technique for reducing noise in images. In this subsection, we examine the numerical performance of VRS in image denoising problems. The state-of-the-art method in image denoising literature is kernel PCA. We direct interested readers to Mika et al. (1998) and Bakır et al. (2004) for a comprehensive introduction to the kernel PCA method.

The main advantage of VRS lies in its computational complexity. Consider NN image data with resolution κ2{\kappa}^{2}, where N,κN,{\kappa}\in\mathbb{N}. The time complexity of kernel PCA is O(N2κ2+N3)\operatorname*{O}(N^{2}{\kappa}^{2}+N^{3}), where O(N2κ2)\operatorname*{O}(N^{2}{\kappa}^{2}) corresponds to the cost of generating the kernel matrix in N×N\mathbb{R}^{N\times N}, and O(N3)\operatorname*{O}(N^{3}) reflects the cost of computing the principal components of this matrix. In contrast, the time complexity of VRS is analyzed in (26) with d=2d=2. Empirical evidence (see e.g., Pope et al. (2021)) suggests that image data possesses low intrinsic dimensions, making practical choices of j\ell_{j} and rjr_{j} in (26) significantly smaller than NN and κ{\kappa}. Even in the worst case scenario where MM takes the upper bound κ{\kappa} in (26), the practical time complexity of VRS is O(Nκ2+κ4)\operatorname*{O}(N{\kappa}^{2}+{\kappa}^{4}) which is considerably more efficient than the kernel PCA approach.

In the numerical experiments, we work with real datasets and we treat images from these real datasets as the ground truth images. To evaluate the numerical performance of a given approach, we add i.i.d. Gaussian noise to each pixels of the images and randomly split the dataset into 90% training and 10% test data. We then use the training data to compute the principal components based on the given approach and project the test data onto these estimated principal components. Denote the noiseless ground truth image as {Ii}i=1Ntest\{{\mathnormal{I}}^{*}_{i}\}_{i=1}^{N_{\text{test}}}, the corresponding projected noisy test data as {I~i}i=1Ntest\{\widetilde{{\mathnormal{I}}}_{i}\}_{i=1}^{N_{\text{test}}} and the corresponding Gaussian noise added in the images as {ϵi}i=1Ntest\{\epsilon_{i}\}_{i=1}^{N_{\text{test}}} in (32). The numerical performance of the given approach is evaluated through the relative denoising error:

1Ntesti=1NtestI~iIi22Ii22\sqrt{\frac{1}{N_{\text{test}}}\sum_{i=1}^{N_{\text{test}}}\frac{\|\widetilde{{\mathnormal{I}}}_{i}-{\mathnormal{I}}^{*}_{i}\|^{2}_{2}}{\|{\mathnormal{I}}^{*}_{i}\|^{2}_{2}}}

where Ii2\|{{\mathnormal{I}}}^{*}_{i}\|_{2} indicates the euclidean norm of Ii{{\mathnormal{I}}}^{*}_{i}. Besides, we use the relative variance 1Ntesti=1Ntestϵi22/Ii22\sqrt{\frac{1}{N_{\text{test}}}\sum_{i=1}^{N_{\text{test}}}\|\epsilon_{i}\|^{2}_{2}/\|{\mathnormal{I}}^{*}_{i}\|^{2}_{2}} to measure the noise level. For the time complexity comparison, we execute on Google Colab’s CPU with high RAM and the execution time of each method is recorded.

\bullet Real data 𝐈𝐈𝐈\mathbf{III}. Our first study focuses the USPS digits dataset. This dataset comprises images of handwritten digits (0 through 9) that were originally scanned from envelopes by the USPS. It contains a total of 9,298 images, each with a resolution of 16×1616\times 16. After adding the Gaussian noise, the relative noise variance of the noisy data is 0.7191. The relative denoising errors for VRS and kernel PCA are 0.2951 and 0.2959, respectively, which reflects excellent denoising performance of both two methods. Although the error shows minimal difference, the computational cost of VRS is significantly lower than that of kernel PCA: the execution time for VRS is 0.40 seconds, compared to 36.91 seconds for kernel PCA. In addition to this numerical comparison, in Figure 9(a) we have randomly selected five images from the test set to illustrate the denoised results using VRS and kernel PCA.

Refer to caption
a USPS digits dataset
Refer to caption
b MNIST dataset
Figure 9: Denoising images from (a) USPS digits dataset and (b) MNIST dataset. In both (a) and (b), the first column shows the ground truth images from the test data, the second column shows the images polluted by Gaussian noise, the third column shows the images denoised using VRS, and the last column shows the images denoised using kernel PCA. Additional numerical results are shown in Appendix I.

\bullet Real data 𝐈𝐕\mathbf{IV}. We analyze the MNIST dataset, which comprises 70,000 images of handwritten digits (0 through 9), each labeled with the true digit. The size of each image is 28×2828\times 28. After adding the Gaussian noise, the relative noise variance of the noisy data is 0.9171. The relative denoising errors for VRS and kernel PCA are 0.4044 and 0.4170, respectively. Although the numerical accuracy of the two methods is quite similar, the computational cost of VRS is significantly lower than that of kernel PCA. The execution time for VRS is only 4.33 seconds, in contrast to 3218.35 seconds for kernel PCA. In addition to this numerical comparison, Figure 9 includes a random selection of five images from the test set to demonstrate the denoised images using VRS and kernel PCA.

6 Conclusion

In this paper, we develop a comprehensive framework Variance-Reduced Sketching (VRS) for nonparametric density estimation problems in higher dimensions. Our approach leverages the concept of sketching from numerical linear algebra to address the curse of dimensionality in function spaces. Our method treats multivariable functions as infinite-dimensional matrices or tensors and the selection of sketching is specifically tailored to the regularity of the estimated function. This design takes the variance of random samples in nonparametric problems into consideration, intended to reduce curse of dimensionality in density estimation problems. Extensive simulated experiments and real data examples demonstrate that our sketching-based method substantially outperforms both neural network estimators and classical kernel density methods in terms of numerical performance.

References

  • Al Daas et al. (2023) Hussam Al Daas, Grey Ballard, Paul Cazeaux, Eric Hallman, Agnieszka Miedlar, Mirjeta Pasha, Tim W Reid, and Arvind K Saibaba. Randomized algorithms for rounding in the tensor-train format. SIAM Journal on Scientific Computing, 45(1):A74–A95, 2023.
  • Alaoui and Mahoney (2015) Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical guarantees. Advances in neural information processing systems, 28, 2015.
  • Bakır et al. (2004) Gökhan H Bakır, Jason Weston, and Bernhard Schölkopf. Learning to find pre-images. Advances in neural information processing systems, 16:449–456, 2004.
  • Bell (2014) Jordan Bell. The singular value decomposition of compact operators on hilbert spaces, 2014.
  • Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
  • Candes and Plan (2010) Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
  • Chaudhuri et al. (2014) Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, and Ulrike Von Luxburg. Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory, 60(12):7900–7912, 2014.
  • Che and Wei (2019) Maolin Che and Yimin Wei. Randomized algorithms for the approximations of tucker and the tensor train decompositions. Advances in Computational Mathematics, 45(1):395–428, 2019.
  • Chen (2017) Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017.
  • Chen and Khoo (2023) Yian Chen and Yuehaw Khoo. Combining particle and tensor-network methods for partial differential equations via sketching. arXiv preprint arXiv:2305.17884, 2023.
  • Chen et al. (2021) Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, et al. Spectral methods for data science: A statistical perspective. Foundations and Trends® in Machine Learning, 14(5):566–806, 2021.
  • Davis et al. (2011) Richard A Davis, Keh-Shin Lii, and Dimitris N Politis. Remarks on some nonparametric estimates of a density function. Selected Works of Murray Rosenblatt, pages 95–100, 2011.
  • Dempster (1977) Arthur Dempster. Maximum likelihood estimation from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39:1–38, 1977.
  • Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  • Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
  • Drineas and Mahoney (2016) Petros Drineas and Michael W Mahoney. Randnla: randomized numerical linear algebra. Communications of the ACM, 59(6):80–90, 2016.
  • Escobar and West (1995) Michael D Escobar and Mike West. Bayesian density estimation and inference using mixtures. Journal of the american statistical association, 90(430):577–588, 1995.
  • Germain et al. (2015) Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR, 2015.
  • Goodfellow et al. (2020) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Halko et al. (2011) Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
  • Han et al. (2018) Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised generative modeling using matrix product states. Physical Review X, 8(3):031012, 2018.
  • Horn and Johnson (1994) Roger A Horn and Charles R Johnson. Topics in matrix analysis. Cambridge university press, 1994.
  • Huang et al. (2018) Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, pages 2078–2087. PMLR, 2018.
  • Hur et al. (2023) Yoonhaeng Hur, Jeremy G Hoskins, Michael Lindsey, E Miles Stoudenmire, and Yuehaw Khoo. Generative modeling via tensor train sketching. Applied and Computational Harmonic Analysis, 67:101575, 2023.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kressner et al. (2023) Daniel Kressner, Bart Vandereycken, and Rik Voorhaar. Streaming tensor train approximation. SIAM Journal on Scientific Computing, 45(5):A2610–A2631, 2023.
  • Kumar et al. (2012) Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the nyström method. The Journal of Machine Learning Research, 13(1):981–1006, 2012.
  • Liberty et al. (2007) Edo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.
  • Liu et al. (2007) Han Liu, John Lafferty, and Larry Wasserman. Sparse nonparametric density estimation in high dimensions using the rodeo. In Artificial Intelligence and Statistics, pages 283–290. PMLR, 2007.
  • Liu et al. (2023) Linxi Liu, Dangna Li, and Wing Hung Wong. Convergence rates of a class of multivariate density estimation methods based on adaptive partitioning. Journal of machine learning research, 24(50):1–64, 2023.
  • Liu et al. (2021) Qiao Liu, Jiaze Xu, Rui Jiang, and Wing Hung Wong. Density estimation using deep generative neural networks. Proceedings of the National Academy of Sciences, 118(15):e2101344118, 2021.
  • Mahoney et al. (2011) Michael W Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning, 3(2):123–224, 2011.
  • Martinsson (2019) Per-Gunnar Martinsson. Randomized methods for matrix computations. The Mathematics of Data, 25(4):187–231, 2019.
  • Martinsson and Tropp (2020) Per-Gunnar Martinsson and Joel A Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403–572, 2020.
  • Mika et al. (1998) Sebastian Mika, Bernhard Schölkopf, Alex Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch. Kernel pca and de-noising in feature spaces. Advances in neural information processing systems, 11, 1998.
  • Minster et al. (2020) Rachel Minster, Arvind K Saibaba, and Misha E Kilmer. Randomized algorithms for low-rank tensor decompositions in the tucker format. SIAM Journal on Mathematics of Data Science, 2(1):189–215, 2020.
  • Mohideen et al. (2008) S Kother Mohideen, S Arumuga Perumal, and M Mohamed Sathik. Image de-noising using discrete wavelet transform. International Journal of Computer Science and Network Security, 8(1):213–216, 2008.
  • Nakatsukasa and Tropp (2021) Yuji Nakatsukasa and Joel A Tropp. Fast & accurate randomized algorithms for linear systems and eigenvalue problems. arXiv preprint arXiv:2111.00113, 2021.
  • Papamakarios et al. (2017) George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30, 2017.
  • Parzen (1962) Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
  • Peng et al. (2023) Yifan Peng, Yian Chen, E Miles Stoudenmire, and Yuehaw Khoo. Generative modeling via hierarchical tensor sketching. arXiv preprint arXiv:2304.05305, 2023.
  • Pope et al. (2021) Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894, 2021.
  • Rahimi and Recht (2007) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007.
  • Raskutti and Mahoney (2016) Garvesh Raskutti and Michael W Mahoney. A statistical perspective on randomized sketching for ordinary least-squares. The Journal of Machine Learning Research, 17(1):7508–7538, 2016.
  • Ren et al. (2023) Yinuo Ren, Hongli Zhao, Yuehaw Khoo, and Lexing Ying. High-dimensional density estimation with tensorizing flow. Research in the Mathematical Sciences, 10(3):30, 2023.
  • Rezende and Mohamed (2015) Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
  • Sande et al. (2020) Espen Sande, Carla Manni, and Hendrik Speleers. Explicit error estimates for spline approximation of arbitrary smoothness in isogeometric analysis. Numerische Mathematik, 144(4):889–929, 2020.
  • Scott (1979) David W Scott. On optimal and data-based histograms. Biometrika, 66(3):605–610, 1979.
  • Scott (1985) David W Scott. Averaged shifted histograms: effective nonparametric density estimators in several dimensions. The Annals of Statistics, pages 1024–1040, 1985.
  • Shi et al. (2023) Tianyi Shi, Maximilian Ruth, and Alex Townsend. Parallel algorithms for computing the tensor-train decomposition. SIAM Journal on Scientific Computing, 45(3):C101–C130, 2023.
  • Sriperumbudur and Steinwart (2012) Bharath Sriperumbudur and Ingo Steinwart. Consistency and rates for clustering with dbscan. In Artificial Intelligence and Statistics, pages 1090–1098. PMLR, 2012.
  • Sun et al. (2020) Yiming Sun, Yang Guo, Charlene Luo, Joel Tropp, and Madeleine Udell. Low-rank tucker approximation of a tensor from streaming data. SIAM Journal on Mathematics of Data Science, 2(4):1123–1150, 2020.
  • Tang and Ying (2023) Xun Tang and Lexing Ying. Solving high-dimensional fokker-planck equation with functional hierarchical tensor. arXiv preprint arXiv:2312.07455, 2023.
  • Tang et al. (2022) Xun Tang, Yoonhaeng Hur, Yuehaw Khoo, and Lexing Ying. Generative modeling via tree tensor network states. arXiv preprint arXiv:2209.01341, 2022.
  • Tropp et al. (2017a) Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Fixed-rank approximation of a positive-semidefinite matrix from streaming data. Advances in Neural Information Processing Systems, 30, 2017a.
  • Tropp et al. (2017b) Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 38(4):1454–1485, 2017b.
  • Uria et al. (2016) Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
  • Vershynin (2018) Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • Wainwright (2019) Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  • Wang et al. (2017) Shusen Wang, Alex Gittens, and Michael W Mahoney. Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. In International Conference on Machine Learning, pages 3608–3616. PMLR, 2017.
  • Wasserman (2006) Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.
  • Williams and Seeger (2000) Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. Advances in neural information processing systems, 13, 2000.
  • Woodruff et al. (2014) David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
  • Xu (2018) Yuan Xu. Approximation by polynomials in sobolev spaces with jacobi weight. Journal of Fourier Analysis and Applications, 24:1438–1459, 2018.
  • Yang et al. (2017) Yun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels: Fast and optimal nonparametric regression. Annals of Statistics, pages 991–1023, 2017.
  • Zambom and Dias (2013) Adriano Z Zambom and Ronaldo Dias. A review of kernel density estimation with applications to econometrics. International Econometric Review, 5(1):20–42, 2013.

Appendix A Examples of finite-rank functions

In this section, we present three examples commonly encountered in nonparametric statistical literature that satisfy 1. Note that the range of A(x,y):Ω1×Ω2A(x,y):\Omega_{1}\times\Omega_{2}\to\mathbb{R} is defined as

Rangex(A)={f(x):f(x)=A(x,y)g(y)dy for any g(y)𝐋2(Ω2)}.\displaystyle{\operatorname*{Range}}_{x}(A)=\bigg{\{}f(x):f(x)=\int A(x,y)g(y)\mathrm{d}y\text{ for any }g(y)\in{{\bf L}_{2}}(\Omega_{2})\bigg{\}}. (38)

In addition, we provide a classical result in function spaces that allows us to conceptualize mutivariable functions as infinite-dimensional matrices. Let d1d_{1} and d2d_{2} be arbitrary positive integers, and let Ω1d1\Omega_{1}\subset\mathbb{R}^{d_{1}} and Ω2d2\Omega_{2}\subset\mathbb{R}^{d_{2}} be two measurable sets.

Theorem 3.

[Singular value decomposition in function space] Let A(x,y):Ω1×Ω2A(x,y):\Omega_{1}\times\Omega_{2}\to\mathbb{R} be any function such that A𝐋2(Ω1×Ω2)<\|A\|_{{{\bf L}_{2}}(\Omega_{1}\times\Omega_{2})}<\infty. There exists a collection of strictly positive singular values {σρ(A)}ρ=1r+\{\sigma_{\rho}(A)\}_{{\rho}=1}^{r}\in\mathbb{R}^{+}, and two collections of orthonormal basis functions {Φρ(x)}ρ=1r𝐋2(Ω1)\{\Phi_{\rho}(x)\}_{{\rho}=1}^{r}\subset{{\bf L}_{2}}(\Omega_{1}) and {Ψρ(y)}ρ=1r𝐋2(Ω2)\{\Psi_{\rho}(y)\}_{{\rho}=1}^{r}\subset{{\bf L}_{2}}(\Omega_{2}) where r{+}r\in\mathbb{N}\cup\{+\infty\} such that

A(x,y)=ρ=1rσρ(A)Φρ(x)Ψρ(y).\displaystyle A(x,y)=\sum_{{\rho}=1}^{r}\sigma_{{\rho}}(A)\Phi_{\rho}(x)\Psi_{\rho}(y). (39)

By viewing A(x,y)A(x,y) as an infinite-dimensional matrix, it follows that the rank of A(x,y)A(x,y) is rr and that an equivalent definition of Rangex(A){\operatorname*{Range}}_{x}(A) is that

Rangex(A)=Span{Φρ(x)}ρ=1r.\displaystyle{\operatorname*{Range}}_{x}(A)=\operatorname*{Span}\{\Phi_{\rho}(x)\}_{{\rho}=1}^{r}. (40)

where the definition of Span can be found in (3). Consequently, the rank of A(x,y)A(x,y) is the same as the dimensionality of Rangex(A)\operatorname*{Range}_{x}(A).

Example 1 (Additive models in regression).

In multivariate nonparametric statistical literature, it is commonly assumed that the underlying unknown nonparametric function f:[0,1]df^{*}:[0,1]^{d}\to\mathbb{R} process an additive structure, meaning that there exists a collection of univariate functions {fj(zj):[0,1]}j=1d\{f_{j}^{*}(z_{j}):[0,1]\to\mathbb{R}\}_{j=1}^{d} such that

f(z1,zd)=f1(z1)++fd(zd) for all (z1,,zd)[0,1]df^{*}(z_{1},\ldots z_{d})=f_{1}^{*}(z_{1})+\cdots+f_{d}^{*}(z_{d})\text{ for all }(z_{1},\ldots,z_{d})\in[0,1]^{d}

To connect this with 1, let x=z1x=z_{1} and y=(z2,,zd).y=(z_{2},\ldots,z_{d}). Then by (38), Rangex(f)=Span{1,f1(z1)}\operatorname*{Range}_{x}(f^{*})=\operatorname*{Span}\{1,f^{*}_{1}(z_{1})\} and Rangey(f)=Span{1,g(y)}\operatorname*{Range}_{y}(f^{*})=\operatorname*{Span}\{1,g^{*}(y)\}, where

g(y)=g(z2,,zd)=f2(z2)++fd(zd).g^{*}(y)=g^{*}(z_{2},\ldots,z_{d})=f_{2}^{*}(z_{2})+\cdots+f_{d}^{*}(z_{d}).

The dimensionality of Rangex(f)\operatorname*{Range}_{x}(f^{*}) is at most 22, and consequently the rank of f(x,y)𝐋2([0,1])𝐋2([0,1]d1)f^{*}(x,y)\in{{\bf L}_{2}}([0,1])\otimes{{\bf L}_{2}}([0,1]^{d-1}) is at most 22.

Example 2 (Mean-field models in density estimation).

Mean-field theory is a popular framework in computational physics and Bayesian probability as it studies the behavior of high-dimensional stochastic models. The main idea of the mean-field theory is to replace all interactions to any one body with an effective interaction in a physical system. Specifically, the mean-field model assumes that the density function p(z1,,zd):[0,1]d+p^{*}(z_{1},\ldots,z_{d}):[0,1]^{d}\to\mathbb{R}^{+} can be well-approximated by p1(z1)pd(zd)p_{1}^{*}(z_{1})\cdots p_{d}^{*}(z_{d}), where for j=1,,dj=1,\ldots,d, pj(zj):[0,1]+p^{*}_{j}(z_{j}):[0,1]\to\mathbb{R}^{+} are univariate marginal density functions. The readers are referred to Blei et al. (2017) for further discussion.

In a large physical system with multiple interacting sub-systems, the underlying density can be well-approximated by a mixture of mean-field densities. Specifically, let {τρ}ρ=1r+\{\tau_{\rho}\}_{{\rho}=1}^{r}\subset\mathbb{R}^{+} be a collection of positive probabilities summing to 11. In the mean-field mixture model, with probability τρ\tau_{\rho}, data are sampled from a mean-field density pρ(z1,,zd)=pρ,1(z1)pρ,d(zd)p_{{\rho}}^{*}(z_{1},\ldots,z_{d})=p_{{\rho},1}^{*}(z_{1})\cdots p_{{\rho},d}^{*}(z_{d}). Therefore

p(z1,,zd)=ρ=1rτρpρ,1(z1)pρ,d(zd).p^{*}(z_{1},\ldots,z_{d})=\sum_{{\rho}=1}^{r}\tau_{\rho}p_{{\rho},1}^{*}(z_{1})\cdots p_{{\rho},d}^{*}(z_{d}).

To connect the mean-field mixture model to 1, let x=z1x=z_{1} and y=(z2,,zd).y=(z_{2},\ldots,z_{d}). Then according to (38), Rangex(f)=Span{pρ,1(z1)}ρ=1r\operatorname*{Range}_{x}(f^{*})=\operatorname*{Span}\{p_{{\rho},1}^{*}(z_{1})\}_{{\rho}=1}^{r} and Rangey(f)=Span{gρ(y)}ρ=1r\operatorname*{Range}_{y}(f^{*})=\operatorname*{Span}\{g^{*}_{\rho}(y)\}_{{\rho}=1}^{r}, where

gρ(y)=gρ(z2,,zd)=pρ,2(z2)pρ,d(zd).g^{*}_{\rho}(y)=g^{*}_{\rho}(z_{2},\ldots,z_{d})=p_{{\rho},2}^{*}(z_{2})\cdots p_{{\rho},d}^{*}(z_{d}).

The dimensionality of Rangex(p)\operatorname*{Range}_{x}(p^{*}) is at most rr, and therefore the rank of p(x,y)p^{*}(x,y) is at most rr.

Example 3 (Multivariate Taylor expansion).

Suppose G:[0,1]dG:[0,1]^{d}\to\mathbb{R} is an α\alpha-times continuously differentiable function. Then Taylor’s theorem in the multivariate setting states that for z=(z1,,zd)dz=(z_{1},\ldots,z_{d})\in\mathbb{R}^{d} and t=(t1,,td)dt=(t_{1},\ldots,t_{d})\in\mathbb{R}^{d}, G(z)Tt(z)G(z)\approx T_{t}(z), where

Tt(z)=G(t)+k=1α1k!𝒟kG(t,zt),\displaystyle T_{t}(z)=G(t)+\sum_{k=1}^{\alpha}\frac{1}{k!}\mathcal{D}^{k}G(t,z-t), (41)

and 𝒟kG(x,h)=i1,,ik=1di1ikG(x)hi1hik.\mathcal{D}^{k}G(x,h)=\sum_{i_{1},\cdots,i_{k}=1}^{d}\partial_{i_{1}}\cdots\partial_{i_{k}}G(x)h_{i_{1}}\cdots h_{i_{k}}. For example, 𝒟G(x,h)=i=1diG(x)hi\mathcal{D}G(x,h)=\sum_{i=1}^{d}\partial_{i}G(x)h_{i}, 𝒟2G(x,h)=i=1dj=1dijG(x)hihj,\mathcal{D}^{2}G(x,h)=\sum_{i=1}^{d}\sum_{j=1}^{d}\partial_{i}\partial_{j}G(x)h_{i}h_{j}, and so on. To simplify our discussion, let t=0dt=0\in\mathbb{R}^{d}. Then (41) becomes

T0(z)=G(0)+i=1diG(0)zi+12!i=1dj=1dijG(0)zizj++1α!i1,,iα=1di1iαG(0)zi1ziα.T_{0}(z)=G(0)+\sum_{i=1}^{d}\partial_{i}G(0)z_{i}+\frac{1}{2!}\sum_{i=1}^{d}\sum_{j=1}^{d}\partial_{i}\partial_{j}G(0)z_{i}z_{j}+\ldots+\frac{1}{\alpha!}\sum_{i_{1},\ldots,i_{\alpha}=1}^{d}\partial_{i_{1}}\cdots\partial_{i_{\alpha}}G(0)z_{i_{1}}\cdots z_{i_{\alpha}}.

Let x=z1x=z_{1} and y=(z2,,zd).y=(z_{2},\ldots,z_{d}). Then by (38), Rangex(T0)=Span{1,z1,z12,,z1α}\operatorname*{Range}_{x}(T_{0})=\operatorname*{Span}\{1,z_{1},z_{1}^{2},\ldots,z_{1}^{\alpha}\}. The dimensionality of Rangex(T0)\operatorname*{Range}_{x}(T_{0}) is at most α+1\alpha+1, and therefore GG can be well-approximated by finite rank functions.

Appendix B Examples of \mathcal{M} and \mathcal{L} satisfying 2

In this section, we provide three examples of the subspaces \mathcal{M} and \mathcal{L} such that 2 holds.

B.1 Reproducing kernel Hilbert space basis

Let 𝒪\mathcal{O} be a measurable set in \mathbb{R}. The two most used examples are 𝒪=[0,1]\mathcal{O}=[0,1] for non-parametric estimation and 𝒪={1,,κ}\mathcal{O}=\{1,\ldots,{\kappa}\} for image PCA.

For x,y𝒪x,y\in\mathcal{O}, let 𝕂:𝒪×𝒪\mathbb{K}:\mathcal{O}\times\mathcal{O}\to\mathbb{R} be a kernel function such that

𝕂(x,y)=k=1λk𝕂ϕk𝕂(x)ϕk𝕂(y),\displaystyle\mathbb{K}(x,y)=\sum_{k=1}^{\infty}\lambda^{\mathbb{K}}_{k}\phi^{\mathbb{K}}_{k}(x)\phi^{\mathbb{K}}_{k}(y), (42)

where {λk𝕂}k=1+{0}\{\lambda^{\mathbb{K}}_{k}\}_{k=1}^{\infty}\subset\mathbb{R}^{+}\cup\{0\} , and {ϕk𝕂}k=1\{\phi^{\mathbb{K}}_{k}\}_{k=1}^{\infty} is a collection of basis functions in 𝐋2(𝒪){{\bf L}_{2}}(\mathcal{O}). If 𝒪=[0,1]\mathcal{O}=[0,1], {ϕk𝕂}k=1\{\phi^{\mathbb{K}}_{k}\}_{k=1}^{\infty} are orthonormal 𝐋2([0,1]){{\bf L}_{2}}([0,1]) functions. If 𝒪={1,,κ}\mathcal{O}=\{1,\ldots,{\kappa}\}, then {ϕk𝕂}k=1\{\phi^{\mathbb{K}}_{k}\}_{k=1}^{\infty} can be identified as orthogonal vectors in κ\mathbb{R}^{{\kappa}}. In this case, λk𝕂=0\lambda^{\mathbb{K}}_{k}=0 for all k>κk>{\kappa}.

The reproducing kernel Hilbert space generated by 𝕂\mathbb{K} is

(𝕂)={f𝐋2([0,1]):f(𝕂)2=k=1(λk𝕂)1f,ϕk𝕂2<}.\displaystyle{\mathcal{H}}(\mathbb{K})=\bigg{\{}f\in{{\bf L}_{2}}([0,1]):\|f\|_{{\mathcal{H}}(\mathbb{K})}^{2}=\sum_{k=1}^{\infty}(\lambda_{k}^{\mathbb{K}})^{-1}\langle f,\phi^{\mathbb{K}}_{k}\rangle^{2}<\infty\bigg{\}}. (43)

For any functions f,g(𝕂)f,g\in{\mathcal{H}}(\mathbb{K}), the inner product in (𝕂){\mathcal{H}}(\mathbb{K}) is given by

f,g(𝕂)=k=1(λk𝕂)1f,ϕk𝕂g,ϕk𝕂.\langle f,g\rangle_{{\mathcal{H}}(\mathbb{K})}=\sum_{k=1}^{\infty}(\lambda_{k}^{\mathbb{K}})^{-1}\langle f,\phi^{\mathbb{K}}_{k}\rangle\langle g,\phi^{\mathbb{K}}_{k}\rangle.

Denote Θk𝕂=(λk𝕂)1/2ϕk𝕂.\Theta^{\mathbb{K}}_{k}=(\lambda^{\mathbb{K}}_{k})^{-1/2}\phi^{\mathbb{K}}_{k}. Then {Θk𝕂}k=1\{\Theta^{\mathbb{K}}_{k}\}_{k=1}^{\infty} are the orthonormal basis functions in (𝕂){\mathcal{H}}(\mathbb{K}) as we have that

Θk1𝕂,Θk2𝕂(𝕂)={1,if k1=k2;0,if k1k2.\langle\Theta_{k_{1}}^{\mathbb{K}},\Theta_{k_{2}}^{\mathbb{K}}\rangle_{{\mathcal{H}}(\mathbb{K})}=\begin{cases}1,&\text{if }{k_{1}}={k_{2}};\\ 0,&\text{if }{k_{1}}\not={k_{2}}.\end{cases}

an that

f(𝕂)2=k=1(λk𝕂)1f,ϕk𝕂2=k=1f,Θk𝕂2.\|f\|_{{\mathcal{H}}(\mathbb{K})}^{2}=\sum_{k=1}^{\infty}(\lambda_{k}^{\mathbb{K}})^{-1}\langle f,\phi^{\mathbb{K}}_{k}\rangle^{2}=\sum_{k=1}^{\infty}\langle f,\Theta_{k}^{\mathbb{K}}\rangle^{2}.

Define the tensor product space

(𝕂))(𝕂)d copies={(𝕂)}d.\underbrace{{\mathcal{H}}(\mathbb{K}))\otimes\cdots\otimes{\mathcal{H}}(\mathbb{K})}_{d\text{ copies}}=\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}.

The induced Frobenius norm in {(𝕂)}d\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d} is

A{(𝕂)}d2=k1,,kd=1(A[Θk1𝕂,,Θkd𝕂])2=k1,,kd=1(λk1𝕂λkd𝕂)1(A[ϕk1𝕂,,ϕkd𝕂])2,\displaystyle\|A\|_{\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}}^{2}=\sum_{k_{1},\ldots,k_{d}=1}^{\infty}\big{(}A[\Theta^{\mathbb{K}}_{k_{1}},\ldots,\Theta_{k_{d}}^{\mathbb{K}}]\big{)}^{2}=\sum_{k_{1},\ldots,k_{d}=1}^{\infty}(\lambda_{k_{1}}^{\mathbb{K}}\ldots\lambda_{k_{d}}^{\mathbb{K}})^{-1}\big{(}A[\phi^{\mathbb{K}}_{k_{1}},\ldots,\phi^{\mathbb{K}}_{k_{d}}]\big{)}^{2}, (44)

where A[ϕk1𝕂,,ϕkd𝕂]A[\phi^{\mathbb{K}}_{k_{1}},\ldots,\phi_{k_{d}}^{\mathbb{K}}] is defined by (4). The following lemma shows that the space {(𝕂)}d\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d} is naturally motivated by multidimensional Sobolev spaces.

Lemma 1.

Let 𝒪=[0,1]\mathcal{O}=[0,1]. With λk𝕂k2α\lambda^{\mathbb{K}}_{k}\asymp k^{-2\alpha} and suitable choices of {ϕk𝕂}k=1\{\phi^{\mathbb{K}}_{k}\}_{k=1}^{\infty}, it holds that

{(𝕂)}d=W2α([0,1]d).\{\mathcal{H}(\mathbb{K})\}^{\otimes d}=W^{\alpha}_{2}([0,1]^{d}).
Proof.

Let 𝒪=[0,1]\mathcal{O}=[0,1]. When d=1d=1, it is a classical Sobolev space result that with λk𝕂k2α\lambda^{\mathbb{K}}_{k}\asymp k^{-2\alpha} and suitable choices of {ϕk𝕂}k=1\{\phi^{\mathbb{K}}_{k}\}_{k=1}^{\infty},

(𝕂)=W2α([0,1]).\mathcal{H}(\mathbb{K})=W^{\alpha}_{2}([0,1]).

We refer interested readers to Chapter 12 of Wainwright (2019) for more details. In general, it is well-known in functional analysis that for Ω1d1\Omega_{1}\subset\mathbb{R}^{d_{1}} and Ω2d2\Omega_{2}\subset\mathbb{R}^{d_{2}}, then

W2α(Ω1)W2α(Ω2)=W2α(Ω1×Ω2).W^{\alpha}_{2}(\Omega_{1})\otimes W^{\alpha}_{2}(\Omega_{2})=W^{\alpha}_{2}(\Omega_{1}\times\Omega_{2}).

Therefore by induction

{(𝕂)}d={(𝕂)}d1{(𝕂)}=W2α([0,1]d1)W2α([0,1])=W2α([0,1]d).\displaystyle\{\mathcal{H}(\mathbb{K})\}^{\otimes d}=\{\mathcal{H}(\mathbb{K})\}^{\otimes d-1}\otimes\{\mathcal{H}(\mathbb{K})\}=W^{\alpha}_{2}([0,1]^{d-1})\otimes W^{\alpha}_{2}([0,1])=W^{\alpha}_{2}([0,1]^{d}). (45)

Let (z1,,zd1,zd1+1,,zd)𝒪d.(z_{1},\ldots,z_{d_{1}},z_{d_{1}+1},\ldots,z_{d})\in\mathcal{O}^{d}. In what follows, we show 2 holds when

=span{ϕμ1𝕂(z1)ϕμd1𝕂(zd1)}μ1,,μd1=1mand=span{ϕη1𝕂(zd1+1)ϕηd2𝕂(zd)}η1,,ηd2=1.\displaystyle\mathcal{M}=\text{span}\bigg{\{}\phi^{\mathbb{K}}_{{\mu}_{1}}(z_{1})\cdots\phi^{\mathbb{K}}_{{\mu}_{d_{1}}}(z_{d_{1}})\bigg{\}}_{{\mu}_{1},\ldots,{\mu}_{d_{1}}=1}^{m}\quad\text{and}\quad\mathcal{L}=\text{span}\bigg{\{}\phi^{\mathbb{K}}_{{\eta}_{1}}(z_{d_{1}+1})\cdots\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}(z_{d})\bigg{\}}_{{\eta}_{1},\ldots,{\eta}_{d_{2}}=1}^{\ell}.
Lemma 2.

Let 𝕂\mathbb{K} be a kernel in the form of (42). Suppose that λk𝕂k2α\lambda^{\mathbb{K}}_{k}\asymp k^{-2\alpha}, and that A:𝒪dA:\mathcal{O}^{d}\to\mathbb{R} is such that A{(𝕂)}d<\|A\|_{\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}}<\infty. Then for any two positive integers m,+m,\ell\in\mathbb{Z}^{+}, it holds that

AA×x𝒫×y𝒫𝐋2(𝒪d)2C(d1m2α+d22α)A{(𝕂)}d2,\displaystyle\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}(\mathcal{O}^{d})}\leq C(d_{1}m^{-2\alpha}+d_{2}\ell^{-2\alpha})\|A\|_{\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}}^{2}, (46)

where CC is some absolute constant. Consequently

AA×y𝒫𝐋2(𝒪d)2Cd22αA{(𝕂)}d2and\displaystyle\|A-A\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}(\mathcal{O}^{d})}\leq Cd_{2}\ell^{-2\alpha}\|A\|_{\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}}^{2}\quad\text{and} (47)
A×y𝒫A×x𝒫×y𝒫𝐋2(𝒪d)2Cd1m2αA{(𝕂)}d2.\displaystyle\|A\times_{y}\mathcal{P}_{\mathcal{L}}-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}(\mathcal{O}^{d})}\leq Cd_{1}m^{-2\alpha}\|A\|_{\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}}^{2}. (48)
Proof.

Since λk𝕂k2α\lambda^{\mathbb{K}}_{k}\asymp k^{-2\alpha}, without loss of generality, throughout the proof we assume that

λk𝕂=k2α,\lambda^{\mathbb{K}}_{k}=k^{-2\alpha},

as otherwise all of our analysis still holds up to an absolute constant. Observe that

(A×x𝒫×y𝒫)[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂]\displaystyle\left(A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\right)[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]
=\displaystyle= {A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂],if 1μ1,,μd1m and 1η1,,ηd2,0,otherwise.\displaystyle\begin{cases}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}],&\text{if }1\leq{\mu}_{1},\ldots,{\mu}_{d_{1}}\leq m$ and $1\leq{\eta}_{1},\ldots,{\eta}_{d_{2}}\leq\ell,\\ 0,&\text{otherwise.}\end{cases}

Then

AA×x𝒫×y𝒫𝐋2(𝒪d)2=\displaystyle\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}(\mathcal{O}^{d})}= μ1=m+1μ2,,μd1=1mη1,,ηd2=1(A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂])2\displaystyle\sum_{{\mu}_{1}=m+1}^{\infty}\sum_{{\mu}_{2},\ldots,{\mu}_{d_{1}}=1}^{m}\sum_{{\eta}_{1},\ldots,{\eta}_{d_{2}}=1}^{\ell}\big{(}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]\big{)}^{2}
+\displaystyle+ \displaystyle\ldots
+\displaystyle+ μ1,,μd1=1mη1,,ηd21=1ηd2=+1(A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂])2.\displaystyle\sum_{{\mu}_{1},\ldots,{\mu}_{d_{1}}=1}^{m}\sum_{{\eta}_{1},\ldots,{\eta}_{d_{2}-1}=1}^{\ell}\sum_{{\eta}_{d_{2}}=\ell+1}^{\infty}\big{(}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]\big{)}^{2}.

Observe that

μ1=m+1μ2,,μd1=1mη1,,ηd2=1(A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂])2\displaystyle\sum_{{\mu}_{1}=m+1}^{\infty}\sum_{{\mu}_{2},\ldots,{\mu}_{d_{1}}=1}^{m}\sum_{{\eta}_{1},\ldots,{\eta}_{d_{2}}=1}^{\ell}\big{(}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]\big{)}^{2}
\displaystyle\leq μ1=m+1m2αμ12αμ2,,μd1=1mη1,,ηd2=1(A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂])2\displaystyle\sum_{{\mu}_{1}=m+1}^{\infty}m^{-2\alpha}{\mu}_{1}^{2\alpha}\sum_{{\mu}_{2},\ldots,{\mu}_{d_{1}}=1}^{m}\sum_{{\eta}_{1},\ldots,{\eta}_{d_{2}}=1}^{\ell}\big{(}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]\big{)}^{2}
\displaystyle\leq m2αμ1=m+1μ2,,μd1=1mη1,,ηd2=1μ12αμd12αη12αηd22α(A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂])2\displaystyle m^{-2\alpha}\sum_{{\mu}_{1}=m+1}^{\infty}\sum_{{\mu}_{2},\ldots,{\mu}_{d_{1}}=1}^{m}\sum_{{\eta}_{1},\ldots,{\eta}_{d_{2}}=1}^{\ell}{\mu}_{1}^{2\alpha}\cdots{\mu}_{d_{1}}^{2\alpha}{\eta}_{1}^{2\alpha}\cdots{\eta}_{d_{2}}^{2\alpha}\big{(}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]\big{)}^{2}
\displaystyle\leq m2αμ1=1μ2,,μd1=1η1,,ηd2=1μ12αμd12αη12αηd22α(A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂])2\displaystyle m^{-2\alpha}\sum_{{\mu}_{1}=1}^{\infty}\sum_{{\mu}_{2},\ldots,{\mu}_{d_{1}}=1}^{\infty}\sum_{{\eta}_{1},\ldots,{\eta}_{d_{2}}=1}^{\infty}{\mu}_{1}^{2\alpha}\cdots{\mu}_{d_{1}}^{2\alpha}{\eta}_{1}^{2\alpha}\cdots{\eta}_{d_{2}}^{2\alpha}\big{(}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]\big{)}^{2}
=\displaystyle= m2αA{(𝕂)}d2\displaystyle m^{-2\alpha}\|A\|_{\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}}^{2}

where the first inequality holds because μ1m+1m{\mu}_{1}\geq m+1\geq m and the last equality follows from (44). Similarly

μ1,,μd1=1mη1,,ηd21=1ηd2=+1(A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂])2\displaystyle\sum_{{\mu}_{1},\ldots,{\mu}_{d_{1}}=1}^{m}\sum_{{\eta}_{1},\ldots,{\eta}_{d_{2}-1}=1}^{\ell}\sum_{{\eta}_{d_{2}}=\ell+1}^{\infty}\big{(}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]\big{)}^{2}
\displaystyle\leq μ1,,μd1=1mη1,,ηd21=1ηd2=+12αηd22α(A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂])2\displaystyle\sum_{{\mu}_{1},\ldots,{\mu}_{d_{1}}=1}^{m}\sum_{{\eta}_{1},\ldots,{\eta}_{d_{2}-1}=1}^{\ell}\sum_{{\eta}_{d_{2}}=\ell+1}^{\infty}\ell^{-2\alpha}{\eta}_{d_{2}}^{2\alpha}\big{(}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]\big{)}^{2}
\displaystyle\leq 2αμ1,,μd1=1mη1,,ηd21=1ηd2=+1μ12αμd12αη12αηd22α(A[ϕμ1𝕂,,ϕμd1𝕂,ϕη1𝕂,,ϕηd2𝕂])2\displaystyle\ell^{-2\alpha}\sum_{{\mu}_{1},\ldots,{\mu}_{d_{1}}=1}^{m}\sum_{{\eta}_{1},\ldots,{\eta}_{d_{2}-1}=1}^{\ell}\sum_{{\eta}_{d_{2}}=\ell+1}^{\infty}{\mu}_{1}^{2\alpha}\cdots{\mu}_{d_{1}}^{2\alpha}{\eta}_{1}^{2\alpha}\cdots{\eta}_{d_{2}}^{2\alpha}\big{(}A[\phi^{\mathbb{K}}_{{\mu}_{1}},\ldots,\phi^{\mathbb{K}}_{{\mu}_{d_{1}}},\phi^{\mathbb{K}}_{{\eta}_{1}},\ldots,\phi^{\mathbb{K}}_{{\eta}_{d_{2}}}]\big{)}^{2}
\displaystyle\leq 2αA{(𝕂)}d2,\displaystyle\ell^{-2\alpha}\|A\|_{\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}}^{2},

where the first inequality holds because ηd2+1{\eta}_{d_{2}}\geq\ell+1\geq\ell and the last inequality follows from (44). Thus (46) follows immediately.

For (47), note that when m=m=\infty, =𝐋2(𝒪d1)\mathcal{M}={{\bf L}_{2}}(\mathcal{O}^{d_{1}}). In this case 𝒫\mathcal{P}_{\mathcal{M}} becomes the identity operator and

A×x𝒫×y𝒫=A×y𝒫.A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}=A\times_{y}\mathcal{P}_{\mathcal{L}}.

Therefore (47) follows from (46) by taking m=m=\infty.

For (48), similar to (47), we have that

AA×x𝒫𝐋2(𝒪d)2Cd1m2αA{(𝕂)}d2.\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\|^{2}_{{{\bf L}_{2}}(\mathcal{O}^{d})}\leq Cd_{1}m^{-2\alpha}\|A\|_{\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}}^{2}.

It follows that

A×y𝒫A×x𝒫×y𝒫𝐋2(𝒪d)2AA×x𝒫𝐋2(𝒪d)2𝒫op2Cd1m2αA{(𝕂)}d2,\|A\times_{y}\mathcal{P}_{\mathcal{L}}-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}(\mathcal{O}^{d})}\leq\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\|^{2}_{{{\bf L}_{2}}(\mathcal{O}^{d})}\|\mathcal{P}_{\mathcal{L}}\|^{2}_{\operatorname*{op}}\leq Cd_{1}m^{-2\alpha}\|A\|_{\{{\mathcal{H}}(\mathbb{K})\}^{\otimes d}}^{2},

where last inequality follows from the fact that 𝒫op1\|\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}\leq 1. ∎

B.2 Legendre polynomial basis

Legendre polynomials is a well-known classical orthonormal polynomial system in 𝐋2([1,1]){{\bf L}_{2}}([-1,1]). We can define the Legendre polynomials in the following inductive way. Let p0=1p_{0}=1 and suppose {pk}k=1n1\{p_{k}\}_{k=1}^{n-1} are defined. Let pn:[1,1]p_{n}:[-1,1]\to\mathbb{R} be a polynomial of degree nn such that
\bullet pn𝐋2([1,1])=1\|p_{n}\|_{{{\bf L}_{2}}([-1,1])}=1, and
\bullet 11pn(x)pk(x)dx=0\int_{-1}^{1}p_{n}(x)p_{k}(x)\mathrm{d}x=0 for all 0kn10\leq k\leq n-1.
As a quick example, we have that

p0(x)=1,p1(x)=32x,andp2(x)=533x212.p_{0}(x)=1,\quad p_{1}(x)=\sqrt{\frac{3}{2}}x,\quad\text{and}\quad p_{2}(x)=\sqrt{\frac{5}{3}}\frac{3x^{2}-1}{2}.

Let qk(x)=2pk(2x1)q_{k}(x)=\sqrt{2}p_{k}(2x-1). Then {qk}k=0\{q_{k}\}_{k=0}^{\infty} are the orthonormal polynomial system in 𝐋2([0,1]){{\bf L}_{2}}([0,1]). In this subsection, we show that 2 holds when {ϕk𝕊}k=1\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty} in (3) is chosen to be {qk}k=0\{q_{k}\}_{k=0}^{\infty}. More precisely, let

𝕊n=Span{qk}k=0n\mathbb{S}_{n}=\operatorname*{Span}\{q_{k}\}_{k=0}^{n}

and 𝒫𝕊n\mathcal{P}_{\mathbb{S}_{n}} denote the projection operator from 𝐋2([0,1]){{\bf L}_{2}}([0,1]) to 𝕊n\mathbb{S}_{n}. Then 𝕊n\mathbb{S}_{n} is the subspace of polynomials of degree at most nn. In addition, for any f𝐋2([0,1])f\in{{\bf L}_{2}}([0,1]), 𝒫𝕊n(f)\mathcal{P}_{\mathbb{S}_{n}}(f) is the best nn-degree polynomial approximation of ff in the sense that

𝒫𝕊n(z)(f)f𝐋2([0,1])=ming𝕊ngf𝐋2([0,1]).\displaystyle\|\mathcal{P}_{\mathbb{S}_{n}(z)}(f)-f\|_{{{\bf L}_{2}}([0,1])}=\min_{g\in\mathbb{S}_{n}}\|g-f\|_{{{\bf L}_{2}}([0,1])}. (49)

We begin with a well-known polynomial approximation result. For α+\alpha\in\mathbb{Z}^{+}, denote Cα([0,1])C^{\alpha}([0,1]) to be the class of functions that are α\alpha times continuously differentiable.

Theorem 4.

Suppose fCα([0,1])f\in C^{\alpha}([0,1]). Then for any n+n\in\mathbb{Z}^{+}, there exists a polynomial p2n(f)p_{2n}(f) of degree 2n2n such that

fp2n(f)𝐋2([0,1])2Cn2αf(α)𝐋2([0,1])2,\|f-p_{2n}(f)\|_{{{\bf L}_{2}}([0,1])}^{2}\leq Cn^{-2\alpha}\|f^{(\alpha)}\|_{{{\bf L}_{2}}([0,1])}^{2},

where CC is an absolute constant.

Proof.

This is Theorem 1.2 of Xu (2018). ∎

Therefore by (49) and Theorem 4,

𝒫𝕊n(z)(f)f𝐋2([0,1])2fpn/2(f)𝐋2([0,1])2Cn2αf(α)𝐋2([0,1])2.\displaystyle\|\mathcal{P}_{\mathbb{S}_{n}(z)}(f)-f\|^{2}_{{{\bf L}_{2}}([0,1])}\leq\|f-p_{\lfloor n/2\rfloor}(f)\|_{{{\bf L}_{2}}([0,1])}^{2}\leq C^{\prime}n^{-2\alpha}\|f^{(\alpha)}\|_{{{\bf L}_{2}}([0,1])}^{2}. (50)

Let z=(z1,,zd)dz=(z_{1},\ldots,z_{d})\in\mathbb{R}^{d} and let 𝕊n(zj)\mathbb{S}_{n}(z_{j}) denote the linear space spanned by polynomials of zjz_{j} of degree at most nn.

Corollary 2.

Suppose B(z1,,zd)Cα([0,1]d)B(z_{1},\ldots,z_{d})\in C^{\alpha}([0,1]^{d}). Then for any 1pd1\leq p\leq d,

BB×p𝒫𝕊n(zp)𝐋2([0,1]d)CnαBW2α([0,1]d).\|B-B\times_{p}\mathcal{P}_{\mathbb{S}_{n}(z_{p})}\|_{{{\bf L}_{2}}([0,1]^{d})}\leq Cn^{-\alpha}\|B\|_{W^{\alpha}_{2}([0,1]^{d})}.
Proof.

It suffices to consider p=1p=1. For any fixed (z2,,zd)(z_{2},\ldots,z_{d}), B(,z2,,zd)Cα([0,1])B(\cdot,z_{2},\ldots,z_{d})\in C^{\alpha}([0,1]). Therefore by (50),

{B(z1,z2,,zd)𝒫𝕊n(z1)(B(z1,z2,,zd))}2dz1Cn2α{ααz1B(z1,z2,,zd)}2dz1.\int\bigg{\{}B(z_{1},z_{2},\ldots,z_{d})-\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(B(z_{1},z_{2},\ldots,z_{d}))\bigg{\}}^{2}\mathrm{d}z_{1}\leq Cn^{-2\alpha}\int\bigg{\{}\frac{\partial^{\alpha}}{\partial^{\alpha}z_{1}}B(z_{1},z_{2},\ldots,z_{d})\bigg{\}}^{2}\mathrm{d}z_{1}.

Therefore

{B(z1,z2,,zd)𝒫𝕊n(z1)(B(z1,z2,,zd))}2dz1dz2dzd\displaystyle\int\cdots\int\bigg{\{}B(z_{1},z_{2},\ldots,z_{d})-\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(B(z_{1},z_{2},\ldots,z_{d}))\bigg{\}}^{2}\mathrm{d}z_{1}\mathrm{d}z_{2}\cdots\mathrm{d}z_{d}
\displaystyle\leq Cn2α{ααz1B(z1,z2,,zd)}2dz1dz2dzd.\displaystyle\int\cdots\int Cn^{-2\alpha}\int\bigg{\{}\frac{\partial^{\alpha}}{\partial^{\alpha}z_{1}}B(z_{1},z_{2},\ldots,z_{d})\bigg{\}}^{2}\mathrm{d}z_{1}\mathrm{d}z_{2}\cdots\mathrm{d}z_{d}.

The desired result follows from the observation that

𝒫𝕊n(z1)(B(z1,z2,,zd))=(B×1𝒫𝕊n(z1))(z1,z2,,zd)\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(B(z_{1},z_{2},\ldots,z_{d}))=\big{(}B\times_{1}\mathcal{P}_{\mathbb{S}_{n}(z_{1})}\big{)}(z_{1},z_{2},\ldots,z_{d})

and that {ααz1B(z1,z2,,zd)}2dz1dzdBW2α([0,1]d)2\int\cdots\int\big{\{}\frac{\partial^{\alpha}}{\partial^{\alpha}z_{1}}B(z_{1},z_{2},\ldots,z_{d})\big{\}}^{2}\mathrm{d}z_{1}\cdots\mathrm{d}z_{d}\leq\|B\|^{2}_{W^{\alpha}_{2}([0,1]^{d})}. ∎

Lemma 3.

Under the same conditions as in Corollary 2, it holds that

𝒫𝕊n(z1)(B(z1,z2,,zd))=(B×1𝒫𝕊n(z1))(z1,z2,,zd).\displaystyle\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(B(z_{1},z_{2},\ldots,z_{d}))=\big{(}B\times_{1}\mathcal{P}_{\mathbb{S}_{n}(z_{1})}\big{)}(z_{1},z_{2},\ldots,z_{d}). (51)
Proof.

Note that 𝒫𝕊n(z1)\mathcal{P}_{\mathbb{S}_{n}(z_{1})} is a projection operator. So for any f,g𝐋2([0,1])f,g\in{{\bf L}_{2}}([0,1]),

01𝒫𝕊n(z1)(f(z1))g(z1)dz1=𝒫𝕊n(z1)(f),g=f,𝒫𝕊n(z1)(g)=01f(z1)𝒫𝕊n(z1)(g(z1))dz1\displaystyle\int_{0}^{1}\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(f(z_{1}))g(z_{1})\mathrm{d}z_{1}=\langle\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(f),g\rangle=\langle f,\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(g)\rangle=\int_{0}^{1}f(z_{1})\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(g(z_{1}))\mathrm{d}z_{1} (52)

Given (z2,,zd)(z_{2},\ldots,z_{d}), B(,z2,,zd)Cα([0,1])B(\cdot,z_{2},\ldots,z_{d})\in C^{\alpha}([0,1]) and therefore 𝒫𝕊n(z1)(B(z1,z2,,zd))\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(B(z_{1},z_{2},\ldots,z_{d})) is well-defined and is a function mapping from [0,1]d[0,1]^{d} to \mathbb{R}. To show that (51), it suffices to observe that for any test functions {uj(zj)}j=1d𝐋2([0,1])\{u_{j}(z_{j})\}_{j=1}^{d}\in{{\bf L}_{2}}([0,1]),

𝒫𝕊n(z1)(B(z1,z2,,zd))[u1(z1),,ud(zd)]\displaystyle\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(B(z_{1},z_{2},\ldots,z_{d}))[u_{1}(z_{1}),\ldots,u_{d}(z_{d})]
=\displaystyle= 0101𝒫𝕊n(z1)(B(z1,z2,,zd))u1(z1)ud(zd)dz1dzd\displaystyle\int_{0}^{1}\cdots\int_{0}^{1}\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(B(z_{1},z_{2},\ldots,z_{d}))u_{1}(z_{1})\cdots u_{d}(z_{d})\mathrm{d}z_{1}\cdots\mathrm{d}z_{d}
=\displaystyle= 0101(01𝒫𝕊n(z1)(B(z1,z2,,zd))u1(z1)dz1)u2(z2)ud(zd)dz2dzd\displaystyle\int_{0}^{1}\cdots\int_{0}^{1}\bigg{(}\int_{0}^{1}\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(B(z_{1},z_{2},\ldots,z_{d}))u_{1}(z_{1})\mathrm{d}z_{1}\bigg{)}u_{2}(z_{2})\cdots u_{d}(z_{d})\mathrm{d}z_{2}\cdots\mathrm{d}z_{d}
=\displaystyle= 0101(01B(z1,z2,,zd)𝒫𝕊n(z1)(u1(z1))dz1)u2(z2)ud(zd)dz2dzd\displaystyle\int_{0}^{1}\cdots\int_{0}^{1}\bigg{(}\int_{0}^{1}B(z_{1},z_{2},\ldots,z_{d})\mathcal{P}_{\mathbb{S}_{n}(z_{1})}(u_{1}(z_{1}))\mathrm{d}z_{1}\bigg{)}u_{2}(z_{2})\cdots u_{d}(z_{d})\mathrm{d}z_{2}\cdots\mathrm{d}z_{d}
=\displaystyle= (B×1𝒫𝕊n(z1))[u1(z1),,ud(zd)].\displaystyle\big{(}B\times_{1}\mathcal{P}_{\mathbb{S}_{n}(z_{1})}\big{)}[u_{1}(z_{1}),\ldots,u_{d}(z_{d})].

In what follows, we present a polynomial approximation theory in multidimensions.

Lemma 4.

For j[1,,d]j\in[1,\ldots,d], let 𝕊nj(zj)\mathbb{S}_{n_{j}}(z_{j}) denote the linear space spanned by polynomials of zjz_{j} of degree njn_{j} and let 𝒫𝕊nj(zj)\mathcal{P}_{\mathbb{S}_{n_{j}}(z_{j})} be the corresponding projection operator. Then for any BW2α([0,1]d)B\in W^{\alpha}_{2}([0,1]^{d}), it holds that

BB×1𝒫𝕊n1(z1)×d𝒫𝕊nd(zd)𝐋2([0,1]d)Cj=1dnjαBW2α([0,1]d).\displaystyle\|B-B\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\cdots\times_{d}\mathcal{P}_{\mathbb{S}_{n_{d}}(z_{d})}\|_{{{\bf L}_{2}}([0,1]^{d})}\leq C\sum_{j=1}^{d}n_{j}^{-\alpha}\|B\|_{W^{\alpha}_{2}([0,1]^{d})}. (53)
Proof.

Since Cα([0,1]d)C^{\alpha}([0,1]^{d}) is dense in W2α([0,1]d)W^{\alpha}_{2}([0,1]^{d}), it suffices to show (53) for all fCα([0,1]d)f\in C^{\alpha}([0,1]^{d}). We proceed by induction. The base case

BB×1𝒫𝕊n1(z1)𝐋2([0,1]d)Cn1αBW2α([0,1]d)\displaystyle\|B-B\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\|_{{{\bf L}_{2}}([0,1]^{d})}\leq Cn_{1}^{-\alpha}\|B\|_{W^{\alpha}_{2}([0,1]^{d})}

is a direct consequence of Corollary 2. Suppose by induction, the following inequality holds for pp,

BB×1𝒫𝕊n1(z1)×p𝒫𝕊np(zp)𝐋2([0,1]d)Cj=1pnjαBW2α([0,1]d).\displaystyle\|B-B\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\cdots\times_{p}\mathcal{P}_{\mathbb{S}_{n_{p}}(z_{p})}\|_{{{\bf L}_{2}}([0,1]^{d})}\leq C\sum_{j=1}^{p}n_{j}^{-\alpha}\|B\|_{W^{\alpha}_{2}([0,1]^{d})}. (54)

Then

BB×1𝒫𝕊n1(z1)×p𝒫𝕊np(zp)×p+1𝒫𝕊np+1(zp+1)𝐋2([0,1]d)\displaystyle\|B-B\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\cdots\times_{p}\mathcal{P}_{\mathbb{S}_{n_{p}}(z_{p})}\times_{p+1}\mathcal{P}_{\mathbb{S}_{n_{p+1}}(z_{p+1})}\|_{{{\bf L}_{2}}([0,1]^{d})}
\displaystyle\leq B×1𝒫𝕊n1(z1)×p𝒫𝕊np(zp)×p+1𝒫𝕊np+1(zp+1)B×1𝒫𝕊n1(z1)×p𝒫𝕊np(zp)𝐋2([0,1]d)\displaystyle\|B\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\cdots\times_{p}\mathcal{P}_{\mathbb{S}_{n_{p}}(z_{p})}\times_{p+1}\mathcal{P}_{\mathbb{S}_{n_{p+1}}(z_{p+1})}-B\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\cdots\times_{p}\mathcal{P}_{\mathbb{S}_{n_{p}}(z_{p})}\|_{{{\bf L}_{2}}([0,1]^{d})}
+\displaystyle+ BB×1𝒫𝕊n1(z1)×p𝒫𝕊np(zp)𝐋2([0,1]d).\displaystyle\|B-B\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\cdots\times_{p}\mathcal{P}_{\mathbb{S}_{n_{p}}(z_{p})}\|_{{{\bf L}_{2}}([0,1]^{d})}.

The desired result follows from (54) and the observation that 𝒫𝕊nj(zj)op1\|\mathcal{P}_{\mathbb{S}_{n_{j}}(z_{j})}\|_{\operatorname*{op}}\leq 1 for all jj, and therefore

B×1𝒫𝕊n1(z1)×p𝒫𝕊np(zp)×p+1𝒫𝕊np+1(zp+1)B×1𝒫𝕊n1(z1)×p𝒫𝕊np(zp)𝐋2([0,1]d)\displaystyle\|B\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\cdots\times_{p}\mathcal{P}_{\mathbb{S}_{n_{p}}(z_{p})}\times_{p+1}\mathcal{P}_{\mathbb{S}_{n_{p+1}}(z_{p+1})}-B\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\cdots\times_{p}\mathcal{P}_{\mathbb{S}_{n_{p}}(z_{p})}\|_{{{\bf L}_{2}}([0,1]^{d})}
\displaystyle\leq B×p+1𝒫𝕊np+1(zp+1)B𝐋2([0,1]d)𝒫𝕊n1(z1)op𝒫𝕊np(zp)op\displaystyle\|B\times_{p+1}\mathcal{P}_{\mathbb{S}_{n_{p+1}}(z_{p+1})}-B\|_{{{\bf L}_{2}}([0,1]^{d})}\|\mathcal{P}_{\mathbb{S}_{n_{1}}(z_{1})}\|_{\operatorname*{op}}\cdots\|\mathcal{P}_{\mathbb{S}_{n_{p}}(z_{p})}\|_{\operatorname*{op}}
\displaystyle\leq Cnp+1αBW2α([0,1]d),\displaystyle Cn^{-\alpha}_{p+1}\|B\|_{W^{\alpha}_{2}([0,1]^{d})},

where the last inequality follows from Corollary 2. ∎

Note that Lemma 4 directly implies that 2 holds when

=𝕊m(z1)𝕊m(zd1)and=𝕊(zd1+1)𝕊(zd).\displaystyle\mathcal{M}=\mathbb{S}_{m}(z_{1})\otimes\cdots\otimes\mathbb{S}_{m}(z_{d_{1}})\quad\text{and}\quad\mathcal{L}=\mathbb{S}_{\ell}(z_{d_{1}+1})\otimes\cdots\otimes\mathbb{S}_{\ell}(z_{d}).

This is summarized in the following lemma.

Lemma 5.

Suppose AW2α([0,1]d)<\|A\|_{W^{\alpha}_{2}([0,1]^{d})}<\infty. Then for 1m1\leq m\leq\infty and 11\leq\ell\leq\infty,

AA×x𝒫×y𝒫2([0,1]d)2=O(m2α+2α)\displaystyle\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{\mathcal{L}_{2}([0,1]^{d})}=\operatorname*{O}(m^{-2\alpha}+\ell^{-2\alpha}) (55)
AA×y𝒫2([0,1]d)2=O(2α)and\displaystyle\|A-A\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{\mathcal{L}_{2}([0,1]^{d})}=\operatorname*{O}(\ell^{-2\alpha})\quad\text{and} (56)
A×y𝒫A×x𝒫×y𝒫2([0,1]d)2=O(m2α).\displaystyle\|A\times_{y}\mathcal{P}_{\mathcal{L}}-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{\mathcal{L}_{2}([0,1]^{d})}=\operatorname*{O}(m^{-2\alpha}). (57)
Proof.

For (55), by Lemma 4,

AA×x𝒫×y𝒫2([0,1]d)\displaystyle\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\mathcal{L}_{2}([0,1]^{d})}\leq C(d1(mα1)α+d2(α1)α)AW2α([0,1]d)\displaystyle C\big{(}d_{1}(m-\alpha-1)^{-\alpha}+d_{2}(\ell-\alpha-1)^{-\alpha}\big{)}\|A\|_{W^{\alpha}_{2}([0,1]^{d})}
=\displaystyle= O(mα+α),\displaystyle\operatorname*{O}(m^{-\alpha}+\ell^{-\alpha}),

where the equality follows from the fact that α\alpha, d1d_{1} and d2d_{2} are constants.

For (56), note that when m=m=\infty, =𝐋2([0,1]d1)\mathcal{M}={{\bf L}_{2}}([0,1]^{d_{1}}). In this case 𝒫\mathcal{P}_{\mathcal{M}} becomes the identity operator and

A×x𝒫×y𝒫=A×y𝒫.A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}=A\times_{y}\mathcal{P}_{\mathcal{L}}.

Therefore (56) follows from (55) by taking m=m=\infty.

For (57), similar to (56), we have that

AA×x𝒫𝐋2([0,1]d)2=O(m2α).\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\|^{2}_{{{\bf L}_{2}}([0,1]^{d})}=\operatorname*{O}(m^{-2\alpha}).

It follows that

A×y𝒫A×x𝒫×y𝒫𝐋2([0,1]d)2AA×x𝒫𝐋2([0,1]d)2𝒫op2=O(m2α),\|A\times_{y}\mathcal{P}_{\mathcal{L}}-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}([0,1]^{d})}\\ \leq\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\|^{2}_{{{\bf L}_{2}}([0,1]^{d})}\|\mathcal{P}_{\mathcal{L}}\|^{2}_{\operatorname*{op}}=\operatorname*{O}(m^{-2\alpha}),

where last inequality follows from the fact that 𝒫op1\|\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}\leq 1. ∎

B.3 Spline basis

Let α+\alpha\in\mathbb{Z}^{+} be given and {ξk}k=1n[0,1]\{\xi_{k}\}_{k=1}^{n}\subset[0,1] be a collection of grid points such that

ξk=kn.\xi_{k}=\frac{k}{n}.

Denote 𝕊n,α(x)\mathbb{S}_{n,\alpha}(x) the subspace in 𝐋2([0,1]){{\bf L}_{2}}([0,1]) spanned by the spline functions being peicewise polynomials defined on {ξk}k=1n\{\xi_{k}\}_{k=1}^{n} of degree α\alpha. Specifically

𝕊n,α(x)={β0+β1b1(x)++βα+nbα+n(x):{βk}k=0α+n},\mathbb{S}_{n,\alpha}(x)=\{\beta_{0}+\beta_{1}b_{1}(x)+\ldots+\beta_{\alpha+n}b_{\alpha+n}(x):\{\beta_{k}\}_{k=0}^{\alpha+n}\subset\mathbb{R}\},

where

b1(x)=x1,,bα(x)=xα,bk+α(x)=(xξk)+α for k=1,,n,b_{1}(x)=x^{1},\ \ldots,\ b_{\alpha}(x)=x^{\alpha},\ b_{k+\alpha}(x)=(x-\xi_{k})_{+}^{\alpha}\text{ for }k=1,\ldots,n,

and

(xξk)+α={(xξk)α,if xξk;0,if x<ξk.(x-\xi_{k})_{+}^{\alpha}=\begin{cases}(x-\xi_{k})^{\alpha},&\text{if }x\geq\xi_{k};\\ 0,&\text{if }x<\xi_{k}.\end{cases}

Let {ϕkn,α(x)}k=1n+α+1\{\phi^{n,\alpha}_{k}(x)\}_{k=1}^{n+\alpha+1} be the 𝐋2([0,1]){{\bf L}_{2}}([0,1]) sub-basis functions spanning 𝕊n,α(x)\mathbb{S}_{n,\alpha}(x). In this section, we show that 2 holds when

\displaystyle\mathcal{M} =𝕊mα1,α(z1)𝕊mα1,α(zd1)=span{ϕμ1mα1,α(z1)ϕμd1mα1,α(zd1)}μ1,,μd1=1m,and\displaystyle=\mathbb{S}_{m-\alpha-1,\alpha}(z_{1})\otimes\cdots\otimes\mathbb{S}_{m-\alpha-1,\alpha}(z_{d_{1}})=\text{span}\bigg{\{}\phi^{m-\alpha-1,\alpha}_{{\mu}_{1}}(z_{1})\cdots\phi^{m-\alpha-1,\alpha}_{{\mu}_{d_{1}}}(z_{d_{1}})\bigg{\}}_{{\mu}_{1},\ldots,{\mu}_{d_{1}}=1}^{m},\quad\text{and}
\displaystyle\mathcal{L} =𝕊α1,α(zd1+1)𝕊α1,α(zd)=span{ϕη1α1,α(zd1+1)ϕηd2α1,α(zd)}η1,,ηd2=1.\displaystyle=\mathbb{S}_{\ell-\alpha-1,\alpha}(z_{d_{1}+1})\otimes\cdots\otimes\mathbb{S}_{\ell-\alpha-1,\alpha}(z_{d})=\text{span}\bigg{\{}\phi^{\ell-\alpha-1,\alpha}_{{\eta}_{1}}(z_{d_{1}+1})\cdots\phi^{\ell-\alpha-1,\alpha}_{{\eta}_{d_{2}}}(z_{d})\bigg{\}}_{{\eta}_{1},\ldots,{\eta}_{d_{2}}=1}^{\ell}.

where mm and \ell are positive integers such that m>α+1m>\alpha+1 and >α+1\ell>\alpha+1. We begin with a spline space approximation theorem for multivariate functions.

Lemma 6.

Suppose AW2α([0,1]d)<\|A\|_{W^{\alpha}_{2}([0,1]^{d})}<\infty. Suppose in addition {nj}j=1d\{n_{j}\}_{j=1}^{d} is a collection positive integer strictly greater than α+1\alpha+1. Then

AA×1𝒫𝕊n1α1,α(z1)×d𝒫𝕊ndα1,α(zd)Cj=1d(njα1)αAW2α([0,1]d)\|A-A\times_{1}\mathcal{P}_{\mathbb{S}_{n_{1}-\alpha-1,\alpha}}(z_{1})\cdots\times_{d}\mathcal{P}_{\mathbb{S}_{n_{d}-\alpha-1,\alpha}}(z_{d})\|\leq C\sum_{j=1}^{d}(n_{j}-\alpha-1)^{-\alpha}\|A\|_{W^{\alpha}_{2}([0,1]^{d})}
Proof.

This is Example 13 on page 26 of Sande et al. (2020). ∎

In the following lemma, we justify 2 when =𝕊mα1,α(z1)𝕊mα1,α(zd1)\mathcal{M}=\mathbb{S}_{m-\alpha-1,\alpha}(z_{1})\otimes\cdots\otimes\mathbb{S}_{m-\alpha-1,\alpha}(z_{d_{1}}) and =𝕊α1,α(zd1+1)𝕊α1,α(zd)\mathcal{L}=\mathbb{S}_{\ell-\alpha-1,\alpha}(z_{d_{1}+1})\otimes\cdots\otimes\mathbb{S}_{\ell-\alpha-1,\alpha}(z_{d}).

Lemma 7.

Suppose AW2α([0,1]d)<\|A\|_{W^{\alpha}_{2}([0,1]^{d})}<\infty where α+\alpha\in\mathbb{Z}^{+} is a fixed constant. Then for 1+αm1+\alpha\leq m\leq\infty and 1+α1+\alpha\leq\ell\leq\infty,

AA×x𝒫×y𝒫2([0,1]d)2=O(m2α+2α)\displaystyle\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{\mathcal{L}_{2}([0,1]^{d})}=\operatorname*{O}(m^{-2\alpha}+\ell^{-2\alpha}) (58)
AA×y𝒫2([0,1]d)2=O(2α)and\displaystyle\|A-A\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{\mathcal{L}_{2}([0,1]^{d})}=\operatorname*{O}(\ell^{-2\alpha})\quad\text{and} (59)
A×y𝒫A×x𝒫×y𝒫2([0,1]d)2=O(m2α).\displaystyle\|A\times_{y}\mathcal{P}_{\mathcal{L}}-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{\mathcal{L}_{2}([0,1]^{d})}=\operatorname*{O}(m^{-2\alpha}). (60)
Proof.

For (58), by Lemma 6,

AA×x𝒫×y𝒫2([0,1]d)\displaystyle\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\mathcal{L}_{2}([0,1]^{d})}\leq C(d1(mα1)α+d2(α1)α)AW2α([0,1]d)\displaystyle C\big{(}d_{1}(m-\alpha-1)^{-\alpha}+d_{2}(\ell-\alpha-1)^{-\alpha}\big{)}\|A\|_{W^{\alpha}_{2}([0,1]^{d})}
=\displaystyle= O(mα+α),\displaystyle\operatorname*{O}(m^{-\alpha}+\ell^{-\alpha}),

where the equality follows from the fact that α\alpha, d1d_{1} and d2d_{2} are constants.

For (59), note that when m=m=\infty, =𝐋2([0,1]d1)\mathcal{M}={{\bf L}_{2}}([0,1]^{d_{1}}). In this case 𝒫\mathcal{P}_{\mathcal{M}} becomes the identity operator and

A×x𝒫×y𝒫=A×y𝒫.A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}=A\times_{y}\mathcal{P}_{\mathcal{L}}.

Therefore (59) follows from (58) by taking m=m=\infty.

For (60), similar to (59), we have that

AA×x𝒫𝐋2([0,1]d)2=O(m2α).\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\|^{2}_{{{\bf L}_{2}}([0,1]^{d})}=\operatorname*{O}(m^{-2\alpha}).

It follows that

A×y𝒫A×x𝒫×y𝒫𝐋2([0,1]d)2AA×x𝒫𝐋2([0,1]d)2𝒫op2=O(m2α),\|A\times_{y}\mathcal{P}_{\mathcal{L}}-A\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|^{2}_{{{\bf L}_{2}}([0,1]^{d})}\\ \leq\|A-A\times_{x}\mathcal{P}_{\mathcal{M}}\|^{2}_{{{\bf L}_{2}}([0,1]^{d})}\|\mathcal{P}_{\mathcal{L}}\|^{2}_{\operatorname*{op}}=\operatorname*{O}(m^{-2\alpha}),

where last inequality follows from the fact that 𝒫op1\|\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}\leq 1. ∎

Appendix C Proofs of the main results

C.1 Proofs related to Theorem 1

Proof of Theorem 1.

By Lemma 8, 𝒫x\mathcal{P}^{*}_{x} is the projection operator of Rangex(A×y𝒫)\operatorname*{Range}_{x}(A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}). By Corollary 3,

(A^A)×x𝒫×y𝒫op=O(md1+d2N).\displaystyle\|(\widehat{A}-A^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\sqrt{\frac{m^{d_{1}}+\ell^{d_{2}}}{N}}\bigg{)}. (61)

Supposed this good event holds. Observe that

A^×x𝒫×y𝒫A×y𝒫op\displaystyle\|\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}-A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}
\displaystyle\leq A×x𝒫×y𝒫A×y𝒫op+A^×x𝒫×y𝒫A×x𝒫×y𝒫op\displaystyle\|A^{*}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}-A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}+\|\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}-A^{*}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}
\displaystyle\leq O(mα+md1+d2N),\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}m^{-\alpha}+\sqrt{\frac{m^{d_{1}}+\ell^{d_{2}}}{N}}\bigg{)}, (62)

where the last inequality follows from 2 and (61). In addition by (64) in Lemma 8, the minimal eigenvalue of A×y𝒫A^{*}\times_{y}\mathcal{P}_{\mathcal{L}} is lower bounded by |σr|/2|\sigma_{r}|/2.

The rank of A×y𝒫A^{*}\times_{y}\mathcal{P}_{\mathcal{L}} is bounded by the dimensionality of dim()=d2dim(\mathcal{L})=\ell^{d_{2}}, so the rank of A×y𝒫A^{*}\times_{y}\mathcal{P}_{\mathcal{L}} is finite. Similarly, A^×x𝒫×y𝒫\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} has finite rank. Corollary 4 in Section F.2 implies that

𝒫^x𝒫xop2A^×x𝒫×y𝒫A×y𝒫opσr/2A^×x𝒫×y𝒫A×y𝒫op.\|\widehat{\mathcal{P}}_{x}-\mathcal{P}^{*}_{x}\|_{\operatorname*{op}}\leq\frac{\sqrt{2}\|\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}-A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}}{\sigma_{r}/2-\|\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}-A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}}.

By (62), we have that A^×x𝒫×y𝒫A×y𝒫op=O(mα+md1+d2N)\|\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}-A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}=\operatorname*{O_{\mathbb{P}}}\bigg{(}m^{-\alpha}+\sqrt{\frac{m^{d_{1}}+\ell^{d_{2}}}{N}}\bigg{)}, and by condition (21) in Theorem 1, we have that σr/2A^×x𝒫×y𝒫A×y𝒫opσr/4\sigma_{r}/2-\|\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}-A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}\geq\sigma_{r}/4. The desired result follows immediately:

𝒫^x𝒫xop2=O{σr2(md1+d2N+m2α)}.\displaystyle\|\widehat{\mathcal{P}}_{x}-\mathcal{P}^{*}_{x}\|_{\operatorname*{op}}^{2}=\operatorname*{O_{\mathbb{P}}}\bigg{\{}\sigma_{r}^{-2}\bigg{(}\frac{m^{d_{1}}+\ell^{d_{2}}}{N}+m^{-2\alpha}\bigg{)}\bigg{\}}. (63)

Lemma 8.

Suppose 1 and 2 hold. If σr>CLα\sigma_{r}>C_{L}\ell^{-\alpha} for sufficiently large constant CLC_{L}, then

Rangex(A×y𝒫)=Rangex(A).{\operatorname*{Range}}_{x}(A^{*}\times_{y}\mathcal{P}_{\mathcal{L}})={\operatorname*{Range}}_{x}(A^{*}).
Proof of Lemma 8.

By Lemma 15 in Appendix F and 2, the singular values {σρ(A×y𝒫)}ρ=1\{\sigma_{\rho}(A^{*}\times_{y}\mathcal{P}_{\mathcal{L}})\}_{{\rho}=1}^{\infty} of the operator A×y𝒫A^{*}\times_{y}\mathcal{P}_{\mathcal{L}} satisfies

|σρσρ(A×y𝒫)|AA×y𝒫F=O(α) for all 1ρ<.|\sigma_{\rho}-\sigma_{\rho}(A^{*}\times_{y}\mathcal{P}_{\mathcal{L}})|\leq\|A^{*}-A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{F}=\operatorname*{O}(\ell^{-\alpha})\text{ for all $1\leq{\rho}<\infty$}.

As a result if σr>CLα\sigma_{r}>C_{L}\ell^{-\alpha} for sufficiently large constant CLC_{L}, then

σ1(A×y𝒫)σr(A×y𝒫)σrAA×y𝒫F>σr/2.\displaystyle\sigma_{1}(A^{*}\times_{y}\mathcal{P}_{\mathcal{L}})\geq\ldots\geq\sigma_{r}(A^{*}\times_{y}\mathcal{P}_{\mathcal{L}})\geq\sigma_{r}-\|A^{*}-A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{F}>\sigma_{r}/2. (64)

Since by construction, Rangex(A×y𝒫)Rangex(A)\operatorname*{Range}_{x}(A^{*}\times_{y}\mathcal{P}_{\mathcal{L}})\subset\operatorname*{Range}_{x}(A^{*}), and the leading rr singular values of A×y𝒫A^{*}\times_{y}\mathcal{P}_{\mathcal{L}} is positive, it follows that the rank of Rangex(A×y𝒫)\operatorname*{Range}_{x}(A^{*}\times_{y}\mathcal{P}_{\mathcal{L}}) is rr. So Rangex(A×y𝒫)=Rangex(A)\operatorname*{Range}_{x}(A^{*}\times_{y}\mathcal{P}_{\mathcal{L}})=\operatorname*{Range}_{x}(A^{*}). ∎

Lemma 9.

Suppose A^\widehat{A} and AA^{*} are defined as in Theorem 2. Let d1,d2+d_{1},d_{2}\in\mathbb{Z}^{+} be such that d1+d2=dd_{1}+d_{2}=d. Suppose {ϕk𝕊}k=1\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty} be a collection of 𝐋2([0,1]){{\bf L}_{2}}([0,1]) basis such that ϕk𝕊C𝕊\|\phi^{\mathbb{S}}_{k}\|_{\infty}\leq C_{\mathbb{S}}. For positive integers mm and \ell, denote

=span{ϕμ1𝕊(z1)ϕμd1𝕊(zd1)}μ1,,μd1=1mand=span{ϕη1𝕊(zd1+1)ϕηd2𝕊(zd)}η1,,ηd2=1.\displaystyle\mathcal{M}=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\mu}_{1}}(z_{1})\cdots\phi^{\mathbb{S}}_{{\mu}_{d_{1}}}(z_{d_{1}})\bigg{\}}_{{\mu}_{1},\ldots,{\mu}_{d_{1}}=1}^{m}\quad\text{and}\quad\mathcal{L}=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\eta}_{1}}(z_{d_{1}+1})\cdots\phi^{\mathbb{S}}_{{\eta}_{d_{2}}}(z_{d})\bigg{\}}_{{\eta}_{1},\ldots,{\eta}_{d_{2}}=1}^{\ell}. (65)

Then it holds that

(A^A)×x𝒫×y𝒫op=O(md1+d2N+m3d1/2d2/2+md1/23d2/2N).\|(\widehat{A}-A^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}={\operatorname*{O}}_{\mathbb{P}}\bigg{(}\sqrt{\frac{m^{d_{1}}+\ell^{d_{2}}}{N}}+\frac{m^{3d_{1}/2}\ell^{d_{2}/2}+m^{d_{1}/2}\ell^{3d_{2}/2}}{N}\bigg{)}.
Proof.

Denote

x=(z1,,zd1)andy=(zd1+1,,zd).x=(z_{1},\ldots,z_{d_{1}})\quad\text{and}\quad y=(z_{d_{1}+1},\ldots,z_{d}).

For positive integers mm and \ell, by ordering the indexes (μ1,,μd1)({\mu}_{1},\ldots,{\mu}_{d_{1}}) and (η1,,ηd2)({\eta}_{1},\ldots,{\eta}_{d_{2}}) in (65), we can also write

=span{Φμ(x)}μ=1md1and=span{Ψη(y)}η=1d2.\displaystyle\mathcal{M}=\text{span}\{\Phi_{\mu}(x)\}_{{\mu}=1}^{m^{d_{1}}}\quad\text{and}\quad\mathcal{L}=\text{span}\{\Psi_{{\eta}}(y)\}_{{\eta}=1}^{\ell^{d_{2}}}. (66)

Note that A^×x𝒫×y𝒫\widehat{A}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} and A×x𝒫×y𝒫A^{*}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} are both zero outsize the subspace \mathcal{M}\otimes\mathcal{L}. Recall {Zi}i=1N[0,1]d\{Z_{i}\}_{i=1}^{N}\subset[0,1]^{d} are independently samples forming A^\widehat{A}. Let B^\widehat{B} and BB^{*} be two matrices in md1×d2\mathbb{R}^{m^{d_{1}}\times\ell^{d_{2}}} such that

B^μ,η=A^[Φμ,Ψη]=1Ni=1NΦμ(Xi)Ψη(Yi)andBμ,η=A[Φμ,Ψη]=𝔼(A^[Φμ,Ψη]),\displaystyle\widehat{B}_{{\mu},{\eta}}=\widehat{A}[\Phi_{\mu},\Psi_{\eta}]=\frac{1}{N}\sum_{i=1}^{N}\Phi_{\mu}(X_{i})\Psi_{\eta}(Y_{i})\quad\text{and}\quad B_{{\mu},{\eta}}^{*}=A^{*}[\Phi_{\mu},\Psi_{\eta}]={\mathbb{E}}(\widehat{A}[\Phi_{\mu},\Psi_{\eta}]),

where Xi=(Zi,1,,Zi,d1)d1X_{i}=(Z_{i,1},\ldots,Z_{i,d_{1}})\in\mathbb{R}^{d_{1}} and Yi=(Zi,d1+1,,Zi,d)d2Y_{i}=(Z_{i,d_{1}+1},\ldots,Z_{i,d})\in\mathbb{R}^{d_{2}}. Note that

B^Bop=(A^A)×x𝒫×y𝒫op.\displaystyle\|\widehat{B}-B^{*}\|_{\operatorname*{op}}=\|(\widehat{A}-A^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}. (67)

Step 1. Let v=(v1,,vmd1)md1v=(v_{1},\ldots,v_{m^{d_{1}}})\in\mathbb{R}^{m^{d_{1}}} and suppose that v2=1\|v\|_{2}=1. Then by orthonormality of {Φμ(x)}μ=1md1\{\Phi_{{\mu}}(x)\}_{{\mu}=1}^{m^{d_{1}}} in 𝐋2([0,1]d1){{\bf L}_{2}}([0,1]^{d_{1}}) it follows that

μ=1md1vμΦμ𝐋2([0,1]d1)2=1.\bigg{\|}\sum_{{\mu}=1}^{m^{d_{1}}}v_{\mu}\Phi_{\mu}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d_{1}})}^{2}=1.

In addition, since

ϕμ1𝕊(z1)ϕμd1𝕊(zd1)j=1d1ϕμj𝕊C𝕊d1,\|\phi^{\mathbb{S}}_{{\mu}_{1}}(z_{1})\cdots\phi^{\mathbb{S}}_{{\mu}_{d_{1}}}(z_{d_{1}})\|_{\infty}\leq\prod_{j=1}^{d_{1}}\|\phi^{\mathbb{S}}_{{\mu}_{j}}\|_{\infty}\leq C_{\mathbb{S}}^{d_{1}},

it follows that ΦμC𝕊p\|\Phi_{\mu}\|_{\infty}\leq C_{\mathbb{S}}^{p} for all 1μmd11\leq{\mu}\leq m^{d_{1}} and

μ=1md1vμΦμμ=1md1vμ2μ=1md1Φμ2=O(md1).\bigg{\|}\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}\bigg{\|}_{\infty}\leq\sqrt{\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}^{2}}\sqrt{\sum_{{\mu}=1}^{m^{d_{1}}}\|\Phi_{{\mu}}\|_{\infty}^{2}}=O(\sqrt{m^{d_{1}}}).


Step 2. Let w=(w1,,wd2)d2w=(w_{1},\ldots,w_{\ell^{d_{2}}})\in\mathbb{R}^{\ell^{d_{2}}} and suppose that w2=1\|w\|_{2}=1. Then by orthonormality of {Ψη(y)}η=1d2\{\Psi_{{\eta}}(y)\}_{{\eta}=1}^{\ell^{d_{2}}} in 𝐋2([0,1]d2){{\bf L}_{2}}([0,1]^{d_{2}}),

η=1d2wηΨη𝐋2([0,1]d)2=1.\bigg{\|}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}\bigg{\|}^{2}_{{{\bf L}_{2}}([0,1]^{d})}=1.

In addition, since

ϕη1𝕊(zd1+1)ϕηd2𝕊(zd)j=1d2ϕηj𝕊C𝕊d2,\|\phi^{\mathbb{S}}_{{\eta}_{1}}(z_{d_{1}+1})\cdots\phi^{\mathbb{S}}_{{\eta}_{d_{2}}}(z_{d})\|_{\infty}\leq\prod_{j=1}^{d_{2}}\|\phi^{\mathbb{S}}_{{\eta}_{j}}\|_{\infty}\leq C_{\mathbb{S}}^{d_{2}},

it follows that ΨηC𝕊d2\|\Psi_{{\eta}}\|_{\infty}\leq C_{\mathbb{S}}^{d_{2}}. Therefore

η=1d2wηΨηη=1d2wη2η=1d2Ψη2=O(d2).\bigg{\|}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}\bigg{\|}_{\infty}\leq\sqrt{\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}^{2}}\sqrt{\sum_{{\eta}=1}^{\ell^{d_{2}}}\|\Psi_{{\eta}}\|_{\infty}^{2}}=O(\sqrt{\ell^{d_{2}}}).

Step 3. For fixed v=(v1,,vmd1)v=(v_{1},\ldots,v_{m^{d_{1}}}) and w=(w1,,wd2)w=(w_{1},\ldots,w_{\ell^{d_{2}}}), we bound v(BB^)w.v^{\top}(B^{*}-\widehat{B})w. Let Δi=μ=1md1vμΦμ(Xi)η=1d2wηΨη(Yi)\Delta_{i}=\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}(X_{i})\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}(Y_{i}). Then

Var(Δi)𝔼(Δi2)={μ=1md1vμΦμ(x)}2{η=1d2wηΨη(y)}2A(x,y)dxdy\displaystyle Var(\Delta_{i})\leq{\mathbb{E}}(\Delta_{i}^{2})=\iint\bigg{\{}\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}(x)\bigg{\}}^{2}\bigg{\{}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}(y)\bigg{\}}^{2}A^{*}(x,y)\mathrm{d}x\mathrm{d}y
\displaystyle\leq A{μ=1md1vμΦμ(x)}2dx{η=1d2wηΨη(y)}2dy\displaystyle\|A^{*}\|_{\infty}\int\bigg{\{}\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}(x)\bigg{\}}^{2}\mathrm{d}x\int\bigg{\{}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}(y)\bigg{\}}^{2}\mathrm{d}y
=\displaystyle= Aμ=1md1vμΦμ𝐋2([0,1]d1)2η=1d2wηΨη𝐋2([0,1]d2)2=A.\displaystyle\|A^{*}\|_{\infty}\bigg{\|}\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d_{1}})}^{2}\bigg{\|}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d_{2}})}^{2}=\|A^{*}\|_{\infty}.

where the last equality follows from Step 1 and Step 2. In addition,

|Δi|μ=1md1vμΦμ(Xi)η=1d2wηΨη(Yi)=O(md1d2).|\Delta_{i}|\leq\|\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}(X_{i})\|_{\infty}\|\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}(Y_{i})\|_{\infty}=O(\sqrt{m^{d_{1}}\ell^{d_{2}}}).

So for given v,wv,w, by Bernstein’s inequality

(|v(BB^)w|t)=(|1Ni=1NΔi𝔼(Δi)|t)2exp(cNt2A+tmd1d2).\displaystyle\mathbb{P}\bigg{(}\bigg{|}v^{\top}(B^{*}-\widehat{B})w\bigg{|}\geq t\bigg{)}=\mathbb{P}\bigg{(}\bigg{|}\frac{1}{N}\sum_{i=1}^{N}\Delta_{i}-{\mathbb{E}}(\Delta_{i})\bigg{|}\geq t\bigg{)}\leq 2\exp\bigg{(}\frac{-cNt^{2}}{\|A^{*}\|_{\infty}+t\sqrt{m^{d_{1}}\ell^{d_{2}}}}\bigg{)}.

Step 4. Let 𝒩(14,md1)\mathcal{N}(\frac{1}{4},m^{d_{1}}) be a 1/41/4 covering net of unit ball in md1\mathbb{R}^{m^{d_{1}}} and 𝒩(14,d2)\mathcal{N}(\frac{1}{4},\ell^{d_{2}}) be a 1/41/4 covering net of unit ball in d2\mathbb{R}^{\ell^{d_{2}}}, then by 4.4.3 on page 90 of Vershynin (2018)

BB^op2supv𝒩(14,md1),w𝒩(14,d2)v(BB^)w.\displaystyle\|B^{*}-\widehat{B}\|_{\operatorname*{op}}\leq 2\sup_{v\in\mathcal{N}(\frac{1}{4},m^{d_{1}}),w\in\mathcal{N}(\frac{1}{4},\ell^{d_{2}})}v^{\top}(B^{*}-\widehat{B})w.

So by union bound and the fact that the size of 𝒩(14,md1)\mathcal{N}(\frac{1}{4},m^{d_{1}}) is bounded by 9md19^{m^{d_{1}}} and the size of 𝒩(14,d2)\mathcal{N}(\frac{1}{4},\ell^{d_{2}}) is bounded by 9d29^{\ell^{d_{2}}},

(BB^opt)\displaystyle\mathbb{P}\bigg{(}\|B^{*}-\widehat{B}\|_{\operatorname*{op}}\geq t\bigg{)}\leq (supv𝒩(14,md1),w𝒩(14,d2)v(BB^)wt/2)\displaystyle\mathbb{P}\bigg{(}\sup_{v\in\mathcal{N}(\frac{1}{4},m^{d_{1}}),w\in\mathcal{N}(\frac{1}{4},\ell^{d_{2}})}v^{\top}(B^{*}-\widehat{B})w\geq t/2\bigg{)}
\displaystyle\leq 29md1+d2exp(cNt2A+tmd1d2).\displaystyle 2\cdot 9^{m^{d_{1}}+\ell^{d_{2}}}\exp\bigg{(}\frac{-cNt^{2}}{\|A^{*}\|_{\infty}+t\sqrt{m^{d_{1}}\ell^{d_{2}}}}\bigg{)}.

This implies that

BB^op=O(md1+d2N+m3d1/2d2/2+md1/23d2/2N).\|B^{*}-\widehat{B}\|_{\operatorname*{op}}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\sqrt{\frac{m^{d_{1}}+\ell^{d_{2}}}{N}}+\frac{m^{3d_{1}/2}\ell^{d_{2}/2}+m^{d_{1}/2}\ell^{3d_{2}/2}}{N}\bigg{)}.

Corollary 3.

Suppose A^\widehat{A} and AA^{*} are defined as in Theorem 2. Let {ϕk𝕊}k=1\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty} be a collection of 𝐋2([0,1]){{\bf L}_{2}}([0,1]) basis such that ϕk𝕊C𝕊\|\phi^{\mathbb{S}}_{k}\|_{\infty}\leq C_{\mathbb{S}}. Let

=Span{ϕμ1𝕊(z1)}μ1=1mand=Span{ϕη1𝕊(z2)ϕηd1𝕊(zd)}η1,,ηd1=1.\displaystyle\mathcal{M}=\operatorname*{Span}\bigg{\{}\phi^{\mathbb{S}}_{{\mu}_{1}}(z_{1})\bigg{\}}_{{\mu}_{1}=1}^{m}\quad\text{and}\quad\mathcal{L}=\operatorname*{Span}\bigg{\{}\phi^{\mathbb{S}}_{{\eta}_{1}}(z_{2})\cdots\phi^{\mathbb{S}}_{{\eta}_{d-1}}(z_{d})\bigg{\}}_{{\eta}_{1},\ldots,{\eta}_{d-1}=1}^{\ell}.

If in addition that mN1/(2α+1)m\asymp N^{1/(2\alpha+1)} and that d1=o(N2α12(2α+1)),\ell^{d-1}=o(N^{\frac{2\alpha-1}{2(2\alpha+1)}}), then

(A^A)×x𝒫×y𝒫op=O(dim()+dim()N).\displaystyle\|(\widehat{A}-A^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\sqrt{\frac{\dim(\mathcal{M})+\dim(\mathcal{L})}{N}}\bigg{)}. (68)
Proof.

Since md1(m+d1)md1md1Nm\ell^{d-1}(m+\ell^{d-1})\leq m\ell^{d-1}m\ell^{d-1}\leq N with above choice of mm and \ell, Corollary 3 is a direct consequence of Lemma 9. ∎

C.2 Proof of Theorem 2

Proof of Theorem 2.

It suffices to confirm that all the conditions in Theorem 5 in Section C.3 are met.

In particular, 2 is verified in Section B.1 for reproducing kernel Hilbert spaces, Section B.2 for Legendre polynomials, and Section B.3 for spline basis. 4 is shown in (69) and (70). Therefore, Theorem 2 immediately follows. ∎

C.3 Proofs related to Theorem 5

Assumption 4.

Suppose 𝔼(A^,G)=A,G{\mathbb{E}}(\langle\widehat{A},G\rangle)=\langle A^{*},G\rangle for any non-random function G𝐋2(𝒪d)G\in{{\bf L}_{2}}({\mathcal{O}}^{d}). In addition, suppose that

supG𝐋2(𝒪d)1Var{A^,G}=O(1N).\sup_{\|G\|_{{{\bf L}_{2}}({\mathcal{O}}^{d})}\leq 1}Var\{\langle\widehat{A},G\rangle\}=\operatorname*{O}\bigg{(}\frac{1}{N}\bigg{)}.

4 requires that A^,G\langle\widehat{A},G\rangle is a consistent estimator of A,G\langle A^{*},G\rangle for any generic non-random function G𝐋2(𝒪d)G\in{{\bf L}_{2}}({\mathcal{O}}^{d}). It is straight forward to check that A^=1Ni=1N𝕀{Zi}\widehat{A}=\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}_{\{Z_{i}\}} in the density estimation model satisfies 4: for any G𝐋2(𝒪d)G\in{{\bf L}_{2}}(\mathcal{O}^{d}),

𝔼(A^,G)=𝔼(1Ni=1NG(Zi))=𝔼(G(Z1))=𝒪dA(z)G(z)dz=A,Gand\displaystyle{\mathbb{E}}(\langle\widehat{A},G\rangle)={\mathbb{E}}\bigg{(}\frac{1}{N}\sum_{i=1}^{N}G(Z_{i})\bigg{)}={\mathbb{E}}(G(Z_{1}))=\int_{\mathcal{O}^{d}}A^{*}(z)G(z)\mathrm{d}z=\langle A^{*},G\rangle\quad\text{and} (69)
Var(A^,G)=1NVar(G(Z1))1N𝔼(G2(Z1))=1N𝒪dA(z)G2(z)dzAG𝐋2(𝒪d)2N.\displaystyle Var(\langle\widehat{A},G\rangle)=\frac{1}{N}Var(G(Z_{1}))\leq\frac{1}{N}{\mathbb{E}}(G^{2}(Z_{1}))=\frac{1}{N}\int_{\mathcal{O}^{d}}A^{*}(z)G^{2}(z)\mathrm{d}z\leq\frac{\|A^{*}\|_{\infty}\|G\|^{2}_{{{\bf L}_{2}}(\mathcal{O}^{d})}}{N}. (70)
Theorem 5.

Suppose 3 and 4 hold with α1\alpha\geq 1. Let {j,j}j=1d\{\mathcal{M}_{j},\mathcal{L}_{j}\}_{j=1}^{d} be defined in (24) and (25), and suppose 2 holds with (j,j)(\mathcal{M}_{j},\mathcal{L}_{j}) for j{1,,d}j\in\{1,\ldots,d\}. Let σmin=minj=1,,d{σj,rj}\sigma_{\min}=\min_{j=1,\ldots,d}\{\sigma_{j,{r}_{j}}\} and suppose for a sufficiently large constant CsnrC_{snr},

N2α/(2α+1)>Csnrmax{j=1drj,1σmin(d1)/α+2,1σmin2α/(α1/2)}.\displaystyle N^{2\alpha/(2\alpha+1)}>C_{snr}\max\bigg{\{}\prod_{j=1}^{d}{r}_{j},\frac{1}{\sigma^{(d-1)/\alpha+2}_{\min}},\frac{1}{\sigma^{2\alpha/(\alpha-1/2)}_{\min}}\bigg{\}}. (71)

Let f^×1𝒫^1×d𝒫^d\widehat{f}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d} be the output of Algorithm 2. If j=CLσj,rj1/α\ell_{j}=C_{L}\sigma_{j,{r}_{j}}^{-1/\alpha} for some sufficiently large constant CLC_{L} and mN1/(2α+1)m\asymp N^{1/(2\alpha+1)}, then it holds that

A^×1𝒫^1×d𝒫^dA𝐋2(𝒪d)2=O(j=1dσj,rj2N2α/(2α+1)+j=1dσj,rj(d1)/α2N+j=1drjN).\displaystyle\|\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}-A^{*}\|^{2}_{{{\bf L}_{2}}({\mathcal{O}}^{d})}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\sum_{j=1}^{d}\sigma_{j,{r}_{j}}^{-2}}{N^{2\alpha/(2\alpha+1)}}+\frac{\sum_{j=1}^{d}\sigma_{j,{r}_{j}}^{-(d-1)/\alpha-2}}{N}+\frac{\prod_{j=1}^{d}{r}_{j}}{N}\bigg{)}. (72)
Proof of Theorem 5.

Observe that A=A×1𝒫1×d𝒫dA^{*}=A^{*}\times_{1}{\mathcal{P}}^{*}_{1}\cdots\times_{d}{\mathcal{P}}^{*}_{d}, where 𝒫j\mathcal{P}^{*}_{j} the projection matrix of Rangej(A){\operatorname*{Range}}_{j}(A^{*}). As a result,

A^×1𝒫^1×d𝒫^dA=\displaystyle\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}-A^{*}= A^×1𝒫^1×d𝒫^dA×1𝒫1×d𝒫d\displaystyle\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}-A^{*}\times_{1}{\mathcal{P}}^{*}_{1}\cdots\times_{d}{\mathcal{P}}^{*}_{d}
=\displaystyle= A^×1(𝒫^1𝒫1)×2𝒫^2×d𝒫^d\displaystyle\widehat{A}\times_{1}(\widehat{\mathcal{P}}_{1}-{\mathcal{P}}_{1}^{*})\times_{2}\widehat{\mathcal{P}}_{2}\cdots\times_{d}\widehat{\mathcal{P}}_{d}
+\displaystyle+ \displaystyle\ldots
+\displaystyle+ A^×1𝒫1×2𝒫2×d1𝒫d1×d(𝒫^d𝒫d)\displaystyle\widehat{A}\times_{1}{\mathcal{P}}_{1}^{*}\times_{2}{\mathcal{P}}^{*}_{2}\cdots\times_{d-1}{\mathcal{P}}^{*}_{d-1}\times_{d}(\widehat{\mathcal{P}}_{d}-{\mathcal{P}}_{d}^{*})
+\displaystyle+ (A^A)×1𝒫1×d𝒫d.\displaystyle(\widehat{A}-A^{*})\times_{1}{\mathcal{P}}^{*}_{1}\cdots\times_{d}{\mathcal{P}}^{*}_{d}.

In the following, we bound each above term individually.

Let 𝒯1\mathcal{T}_{1} denote the linear subspace spanned by basis {Φ1,ρ}ρ=1r1\{\Phi^{*}_{1,{\rho}}\}_{{\rho}=1}^{r_{1}} and 1\mathcal{M}_{1}. So 𝒫𝒯1\mathcal{P}_{\mathcal{T}_{1}} is non-random projection with rank at most m+r1m+r_{1}. Since the column space of 𝒫1\mathcal{P}^{*}_{1} is spanned by basis {Φ1,ρ}ρ=1r1\{\Phi^{*}_{1,{\rho}}\}_{{\rho}=1}^{r_{1}} and the column space of 𝒫^1\widehat{\mathcal{P}}_{1} is contained in 1\mathcal{M}_{1}, it follows that 𝒫𝒯1(𝒫1𝒫^1)=𝒫1𝒫^1\mathcal{P}_{\mathcal{T}_{1}}(\mathcal{P}^{*}_{1}-\widehat{\mathcal{P}}_{1})=\mathcal{P}^{*}_{1}-\widehat{\mathcal{P}}_{1}. Besides, by condition (71) in Theorem 5, both (21) in Theorem 1 and and (74) in Lemma 11 hold. Therefore

A^×1(𝒫^1𝒫1)×2𝒫^2×d𝒫^dF2\displaystyle\|\widehat{A}\times_{1}(\widehat{\mathcal{P}}_{1}-{\mathcal{P}}_{1}^{*})\times_{2}\widehat{\mathcal{P}}_{2}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|_{F}^{2}
=\displaystyle= A^×1𝒫𝒯1(𝒫^1𝒫1)×2𝒫^2×d𝒫^dF2\displaystyle\|\widehat{A}\times_{1}\mathcal{P}_{\mathcal{T}_{1}}(\widehat{\mathcal{P}}_{1}-{\mathcal{P}}_{1}^{*})\times_{2}\widehat{\mathcal{P}}_{2}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|_{F}^{2}
\displaystyle\leq 𝒫^1𝒫1op2A^×1𝒫𝒯1×2𝒫^2×d𝒫^dF2\displaystyle\|\widehat{\mathcal{P}}_{1}-{\mathcal{P}}_{1}^{*}\|_{\operatorname*{op}}^{2}\|\widehat{A}\times_{1}\mathcal{P}_{\mathcal{T}_{1}}\times_{2}\widehat{\mathcal{P}}_{2}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|_{F}^{2}
\displaystyle\leq O(1σ1,r12{m+1d1N+m2α}{(m+r1)j=2drjN+AF2})\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{1}{\sigma^{2}_{1,r_{1}}}\bigg{\{}\frac{m+\ell_{1}^{d-1}}{N}+m^{-2\alpha}\bigg{\}}\bigg{\{}\frac{(m+r_{1})\prod_{j=2}^{d}r_{j}}{N}+\|A^{*}\|_{F}^{2}\bigg{\}}\bigg{)}
=\displaystyle= O(1σ1,r12{m+1d1N+m2α})\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{1}{\sigma^{2}_{1,r_{1}}}\bigg{\{}\frac{m+\ell_{1}^{d-1}}{N}+m^{-2\alpha}\bigg{\}}\bigg{)}

where the second inequality follows from Theorem 1 and Lemma 11, and the last equality follows from the fact that mN1/(2α+1)m\asymp N^{1/(2\alpha+1)} so that mj=1drjN=O(1)\frac{m\prod_{j=1}^{d}r_{j}}{N}=O(1) from the condition (71) in Theorem 5 and AF2=O(1)\|A^{*}\|_{F}^{2}=O(1).

Similarly, let 𝒯d\mathcal{T}_{d} denote the linear subspace space spanned by basis {Φd,ρ}ρ=1rd\{\Phi^{*}_{d,{\rho}}\}_{{\rho}=1}^{r_{d}} and d\mathcal{M}_{d}. So 𝒫𝒯d\mathcal{P}_{\mathcal{T}_{d}} is non-random with rank at most m+rdm+r_{d}. Since the column space of 𝒫d\mathcal{P}^{*}_{d} is spanned by basis {Φd,ρ}ρ=1rd\{\Phi^{*}_{d,{\rho}}\}_{{\rho}=1}^{r_{d}} and the the column space of 𝒫^d\widehat{\mathcal{P}}_{d} is contained in d\mathcal{M}_{d}, it follows that 𝒫𝒯d(𝒫d𝒫^d)=𝒫d𝒫^d\mathcal{P}_{\mathcal{T}_{d}}(\mathcal{P}^{*}_{d}-\widehat{\mathcal{P}}_{d})=\mathcal{P}^{*}_{d}-\widehat{\mathcal{P}}_{d}.

A^×1𝒫1×2𝒫2×d1𝒫d1×d(𝒫^d𝒫d)F2\displaystyle\|\widehat{A}\times_{1}{\mathcal{P}}_{1}^{*}\times_{2}{\mathcal{P}}^{*}_{2}\cdots\times_{d-1}{\mathcal{P}}^{*}_{d-1}\times_{d}(\widehat{\mathcal{P}}_{d}-{\mathcal{P}}_{d}^{*})\|_{F}^{2}
=\displaystyle= A^×1𝒫1×2𝒫2×d1𝒫d1×d𝒫𝒯d(𝒫^d𝒫d)F2\displaystyle\|\widehat{A}\times_{1}{\mathcal{P}}_{1}^{*}\times_{2}{\mathcal{P}}^{*}_{2}\cdots\times_{d-1}{\mathcal{P}}^{*}_{d-1}\times_{d}\mathcal{P}_{\mathcal{T}_{d}}(\widehat{\mathcal{P}}_{d}-{\mathcal{P}}_{d}^{*})\|_{F}^{2}
\displaystyle\leq 𝒫^d𝒫dop2A^×1𝒫1×2𝒫2×d1𝒫d1×d𝒫𝒯dF2\displaystyle\|\widehat{\mathcal{P}}_{d}-{\mathcal{P}}_{d}^{*}\|_{\operatorname*{op}}^{2}\|\widehat{A}\times_{1}{\mathcal{P}}_{1}^{*}\times_{2}{\mathcal{P}}^{*}_{2}\cdots\times_{d-1}{\mathcal{P}}^{*}_{d-1}\times_{d}\mathcal{P}_{\mathcal{T}_{d}}\|_{F}^{2}
\displaystyle\leq O(1σd,rd2{m+dd1N+m2α}{(m+rd)j=1d1rjN+AF2})\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{1}{\sigma_{d,r_{d}}^{2}}\bigg{\{}\frac{m+\ell_{d}^{d-1}}{N}+m^{-2\alpha}\bigg{\}}\bigg{\{}\frac{(m+r_{d})\prod_{j=1}^{d-1}r_{j}}{N}+\|A^{*}\|_{F}^{2}\bigg{\}}\bigg{)}
=\displaystyle= O(1σd,rd2{m+dd1N+m2α})\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{1}{\sigma_{d,r_{d}}^{2}}\bigg{\{}\frac{m+\ell_{d}^{d-1}}{N}+m^{-2\alpha}\bigg{\}}\bigg{)}

where the inequalities hold following the same logic as above. Similarly, for any 2pd12\leq p\leq d-1, it holds

A^×1𝒫1×p1𝒫p1×p(𝒫^p𝒫p)×p+1𝒫^p+1×d𝒫^dF2=O(1σp,rp2{m+pd1N+m2α}).\|\widehat{A}\times_{1}{\mathcal{P}}_{1}^{*}\cdots\times_{p-1}{\mathcal{P}}_{p-1}^{*}\times_{p}(\widehat{\mathcal{P}}_{p}-{\mathcal{P}}_{p}^{*})\times_{p+1}\widehat{\mathcal{P}}_{p+1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|_{F}^{2}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{1}{\sigma_{p,r_{p}}^{2}}\bigg{\{}\frac{m+\ell_{p}^{d-1}}{N}+m^{-2\alpha}\bigg{\}}\bigg{)}.

from Lemma 11. In addition by Lemma 10,

𝔼(A^A)×1𝒫1×d𝒫dF2=O(j=1drjN).\displaystyle{\mathbb{E}}\|(\widehat{A}-A_{*})\times_{1}{\mathcal{P}}^{*}_{1}\cdots\times_{d}{\mathcal{P}}^{*}_{d}\|_{F}^{2}=\operatorname*{O}\bigg{(}\frac{\prod_{j=1}^{d}r_{j}}{N}\bigg{)}.

Therefore,

A^×1𝒫^1×d𝒫^dAF2=O(j=1d1σj,rj2{m+jd1N+m2α}+j=1drjN).\displaystyle\|\widehat{A}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}-A^{*}\|^{2}_{F}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\sum_{j=1}^{d}\frac{1}{\sigma^{2}_{j,r_{j}}}\bigg{\{}\frac{m+\ell_{j}^{d-1}}{N}+m^{-2\alpha}\bigg{\}}+\frac{\prod_{j=1}^{d}r_{j}}{N}\bigg{)}.

The desired result follows from the condition mN1/(2α+1)m\asymp N^{1/(2\alpha+1)} and jσj,rj1/α\ell_{j}\asymp\sigma_{j,r_{j}}^{-1/\alpha} in Theorem 5. ∎

Here we provide two lemmas required in the above proof.

Lemma 10.

Suppose for each j=1,,dj=1,\ldots,d, 𝒬j\mathcal{Q}_{j} is a non-random linear operator on 𝐋2(𝒪)𝐋2(𝒪){{\bf L}_{2}}(\mathcal{O})\otimes{{\bf L}_{2}}(\mathcal{O}) and that the rank of 𝒬j\mathcal{Q}_{j} is qjq_{j}. Then under 4, it holds that

𝔼((A^A)×1𝒬1×d𝒬dF2)\displaystyle{\mathbb{E}}(\|(\widehat{A}-A^{*})\times_{1}\mathcal{Q}_{1}\cdots\times_{d}\mathcal{Q}_{d}\|_{F}^{2}) =O(j=1dqj𝒬jop2N).\displaystyle=\operatorname*{O}\bigg{(}\frac{\prod_{j=1}^{d}q_{j}\|\mathcal{Q}_{j}\|^{2}_{\operatorname*{op}}}{N}\bigg{)}. (73)

Consequently

A^×1𝒬1×2𝒬2×d𝒬dF2=O(j=1d𝒬jop2{j=1dqjN+AF2}).\|\widehat{A}\times_{1}{\mathcal{Q}}_{1}\times_{2}{\mathcal{Q}}_{2}\cdots\times_{d}{\mathcal{Q}}_{d}\|_{F}^{2}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\prod_{j=1}^{d}\|{\mathcal{Q}}_{j}\|_{\operatorname*{op}}^{2}\bigg{\{}\frac{\prod_{j=1}^{d}q_{j}}{N}+\|A^{*}\|^{2}_{F}\bigg{\}}\bigg{)}.
Proof.

Since the rank of 𝒬j\mathcal{Q}_{j} is qjq_{j}, we can write

𝒬j=μj=1qjνj,μjϕj,μjψj,μj,\mathcal{Q}_{j}=\sum_{{\mu}_{j}=1}^{q_{j}}\nu_{j,{\mu}_{j}}\phi_{j,{\mu}_{j}}\otimes\psi_{j,{\mu}_{j}},

where {ψj,μj}μj=1qj\{\psi_{j,{\mu}_{j}}\}_{{\mu}_{j}=1}^{q_{j}} and {ϕj,μj}μj=1qj\{\phi_{j,{\mu}_{j}}\}_{{\mu}_{j}=1}^{q_{j}} are both orthonormal in 𝐋2(𝒪){{\bf L}_{2}}(\mathcal{O}). Note that |νj,μj|𝒬jop|\nu_{j,{\mu}_{j}}|\leq\|\mathcal{Q}_{j}\|_{\operatorname*{op}} for any μj{\mu}_{j}. Denote

𝒮j=Span{ψj,μj}μj=1qj.\mathcal{S}_{j}=\operatorname*{Span}\{\psi_{j,{\mu}_{j}}\}_{{\mu}_{j}=1}^{q_{j}}.

Note that (A^A)×1𝒬1×d𝒬d(\widehat{A}-A^{*})\times_{1}\mathcal{Q}_{1}\cdots\times_{d}\mathcal{Q}_{d} is zero in the orthogonal complement of the subspace 𝒮1𝒮d\mathcal{S}_{1}\otimes\cdots\otimes\mathcal{S}_{d}. Therefore,

(A^A)×1𝒬1×d𝒬dF2\displaystyle\|(\widehat{A}-A^{*})\times_{1}\mathcal{Q}_{1}\cdots\times_{d}\mathcal{Q}_{d}\|_{F}^{2}
=\displaystyle= μ1=1q1μd=1qd{(A^A)×1𝒬1×d𝒬d[ψ1,μ1,,ψd,μd]}2\displaystyle\sum_{{\mu}_{1}=1}^{q_{1}}\ldots\sum_{{\mu}_{d}=1}^{q_{d}}\bigg{\{}(\widehat{A}-A^{*})\times_{1}\mathcal{Q}_{1}\cdots\times_{d}\mathcal{Q}_{d}[\psi_{1,{\mu}_{1}},\ldots,\psi_{d,{\mu}_{d}}]\bigg{\}}^{2}
=\displaystyle= μ1=1q1μd=1qd{(A^A)[𝒬1(ψ1,μ1),,𝒬d(ψd,μd)]}2\displaystyle\sum_{{\mu}_{1}=1}^{q_{1}}\ldots\sum_{{\mu}_{d}=1}^{q_{d}}\bigg{\{}(\widehat{A}-A^{*})[\mathcal{Q}_{1}(\psi_{1,{\mu}_{1}}),\ldots,\mathcal{Q}_{d}(\psi_{d,{\mu}_{d}})]\bigg{\}}^{2}
=\displaystyle= μ1=1q1μd=1qd{(A^A)[ν1,μ1ϕ1,μ1,,νd,μdϕd,μd]}2\displaystyle\sum_{{\mu}_{1}=1}^{q_{1}}\ldots\sum_{{\mu}_{d}=1}^{q_{d}}\bigg{\{}(\widehat{A}-A^{*})[\nu_{1,{\mu}_{1}}\phi_{1,{\mu}_{1}},\ldots,\nu_{{d},{\mu}_{d}}\phi_{d,{\mu}_{d}}]\bigg{\}}^{2}
\displaystyle\leq j=1d𝒬jop2μ1=1q1μd=1qd{(A^A)[ϕ1,μ1,,ϕd,μd]}2\displaystyle\prod_{j=1}^{d}\|\mathcal{Q}_{j}\|_{\operatorname*{op}}^{2}\sum_{{\mu}_{1}=1}^{q_{1}}\ldots\sum_{{\mu}_{d}=1}^{q_{d}}\bigg{\{}(\widehat{A}-A^{*})[\phi_{1,{\mu}_{1}},\ldots,\phi_{d,{\mu}_{d}}]\bigg{\}}^{2}

and so

𝔼(A^A)×1𝒬1×d𝒬dF2\displaystyle{\mathbb{E}}\|(\widehat{A}-A^{*})\times_{1}\mathcal{Q}_{1}\cdots\times_{d}\mathcal{Q}_{d}\|_{F}^{2}
\displaystyle\leq j=1d𝒬jop2μ1=1q1μd=1qd𝔼{(A^A)[ϕ1,μ1,,ϕd,μd]}2\displaystyle\prod_{j=1}^{d}\|\mathcal{Q}_{j}\|_{\operatorname*{op}}^{2}\sum_{{\mu}_{1}=1}^{q_{1}}\ldots\sum_{{\mu}_{d}=1}^{q_{d}}{\mathbb{E}}\bigg{\{}(\widehat{A}-A^{*})[\phi_{1,{\mu}_{1}},\ldots,\phi_{d,{\mu}_{d}}]\bigg{\}}^{2}
=\displaystyle= j=1d𝒬jop2μ1=1q1μd=1qdVar{A^[ϕ1,μ1,,ϕd,μd]}=O(j=1dqj𝒬jop2N),\displaystyle\prod_{j=1}^{d}\|\mathcal{Q}_{j}\|_{\operatorname*{op}}^{2}\sum_{{\mu}_{1}=1}^{q_{1}}\ldots\sum_{{\mu}_{d}=1}^{q_{d}}Var\bigg{\{}\widehat{A}[\phi_{1,{\mu}_{1}},\ldots,\phi_{d,{\mu}_{d}}]\bigg{\}}=\operatorname*{O}\bigg{(}\frac{\prod_{j=1}^{d}q_{j}\|\mathcal{Q}_{j}\|_{\operatorname*{op}}^{2}}{N}\bigg{)},

where the equality follows from the assumption that 𝔼(A^,G)=A,G{\mathbb{E}}(\langle\widehat{A},G\rangle)=\langle A^{*},G\rangle for any G𝐋2(𝒪d).G\in{{\bf L}_{2}}(\mathcal{O}^{d}). Consequently,

A^×1𝒬1×2𝒬2×d𝒬dF22(A^A)×1𝒬1×2𝒬2×d𝒬dF2+2A×1𝒬1×2𝒬2×d𝒬dF2\displaystyle\|\widehat{A}\times_{1}{\mathcal{Q}}_{1}\times_{2}{\mathcal{Q}}_{2}\cdots\times_{d}{\mathcal{Q}}_{d}\|_{F}^{2}\leq 2\|(\widehat{A}-A^{*})\times_{1}{\mathcal{Q}}_{1}\times_{2}{\mathcal{Q}}_{2}\cdots\times_{d}{\mathcal{Q}}_{d}\|_{F}^{2}+2\|A^{*}\times_{1}{\mathcal{Q}}_{1}\times_{2}{\mathcal{Q}}_{2}\cdots\times_{d}{\mathcal{Q}}_{d}\|_{F}^{2}
2(A^A)×1𝒬1×2𝒬2×d𝒬dF2+2j=1d𝒬jop2AF2\displaystyle\leq 2\|(\widehat{A}-A^{*})\times_{1}{\mathcal{Q}}_{1}\times_{2}{\mathcal{Q}}_{2}\cdots\times_{d}{\mathcal{Q}}_{d}\|_{F}^{2}+2\prod_{j=1}^{d}\|{\mathcal{Q}}_{j}\|_{\operatorname*{op}}^{2}\|A^{*}\|_{F}^{2}
=O(j=1d𝒬jop2{j=1dqjN+AF2}).\displaystyle=\operatorname*{O_{\mathbb{P}}}\bigg{(}\prod_{j=1}^{d}\|{\mathcal{Q}}_{j}\|_{\operatorname*{op}}^{2}\bigg{\{}\frac{\prod_{j=1}^{d}q_{j}}{N}+\|A^{*}\|^{2}_{F}\bigg{\}}\bigg{)}.

Lemma 11.

Let A^\widehat{A} be any estimator satisfying 4. Suppose {𝒬j}j=1d\{{\mathcal{Q}}_{j}\}_{j=1}^{d} is collection of non-random operators on 𝐋2𝐋2{{\bf L}_{2}}\otimes{{\bf L}_{2}} such that 𝒬j{\mathcal{Q}}_{j} has rank qjq_{j} and 𝒬jop1\|\mathcal{Q}_{j}\|_{\operatorname*{op}}\leq 1. Let σmin=minj=1,,d{σj,rj}\sigma_{\min}=\min_{j=1,\ldots,d}\{\sigma_{j,r_{j}}\} and suppose in addition it holds that

mσmin2(m+d1N+m2α)=O(1).\displaystyle\frac{m}{\sigma_{\min}^{2}}\bigg{(}\frac{m+\ell^{d-1}}{N}+m^{-2\alpha}\bigg{)}=O(1). (74)

Then for any 0pd10\leq p\leq d-1, it holds that

A^×1𝒬1×p+1𝒬p+1×p+2𝒫^p+2×d𝒫^dF2=O((j=1p+1qj)(j=p+2drj)N+AF2).\displaystyle\|\widehat{A}\times_{1}{\mathcal{Q}}_{1}\cdots\times_{p+1}{\mathcal{Q}}_{p+1}\times_{p+2}\widehat{\mathcal{P}}_{p+2}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|^{2}_{F}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\big{(}\prod_{j=1}^{p+1}q_{j}\big{)}(\prod_{j=p+2}^{d}r_{j})}{N}+\|A^{*}\|^{2}_{F}\bigg{)}. (75)
Proof.

We prove (75) by induction. The base case p+1=dp+1=d is exactly Lemma 10. Suppose (75) holds for any p+1p+1. Then

A^×1𝒬1×p𝒬p×p+1𝒫^p+1×d𝒫^dF2\displaystyle\|\widehat{A}\times_{1}{\mathcal{Q}}_{1}\cdots\times_{p}{\mathcal{Q}}_{p}\times_{p+1}\widehat{\mathcal{P}}_{p+1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|^{2}_{F}
\displaystyle\leq 2A^×1𝒬1×p𝒬p×p+1𝒫p+1×p+2𝒫^p+2×d𝒫^dF2\displaystyle 2\|\widehat{A}\times_{1}{\mathcal{Q}}_{1}\cdots\times_{p}{\mathcal{Q}}_{p}\times_{p+1}\mathcal{P}^{*}_{p+1}\times_{p+2}\widehat{\mathcal{P}}_{p+2}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|^{2}_{F} (76)
+\displaystyle+ 2A^×1𝒬1×p𝒬p×p+1(𝒫p+1𝒫^p+1)×p+2𝒫^p+2×d𝒫^dF2.\displaystyle 2\|\widehat{A}\times_{1}{\mathcal{Q}}_{1}\cdots\times_{p}{\mathcal{Q}}_{p}\times_{p+1}(\mathcal{P}^{*}_{p+1}-\widehat{\mathcal{P}}_{p+1})\times_{p+2}\widehat{\mathcal{P}}_{p+2}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|^{2}_{F}. (77)

By induction,

(76)\displaystyle\eqref{eq:induction projected norm term 1}\leq O((j=1pqj)Rank(𝒫p+1)(j=p+2drj)N+AF2)\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\big{(}\prod_{j=1}^{p}q_{j}\big{)}\operatorname*{Rank}(\mathcal{P}^{*}_{p+1})\big{(}\prod_{j=p+2}^{d}r_{j}\big{)}}{N}+\|A^{*}\|^{2}_{F}\bigg{)}
=\displaystyle= O((j=1pqj)(j=p+1drj)N+AF2).\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\big{(}\prod_{j=1}^{p}q_{j}\big{)}\big{(}\prod_{j=p+1}^{d}r_{j}\big{)}}{N}+\|A^{*}\|^{2}_{F}\bigg{)}.

Let 𝒯p+1\mathcal{T}_{p+1} denote space spanned by basis {Φp+1,ρ}ρ=1rp+1\{\Phi^{*}_{p+1,{\rho}}\}_{{\rho}=1}^{r_{p+1}} defined in 3 and p+1\mathcal{M}_{p+1}. So 𝒫𝒯p+1\mathcal{P}_{\mathcal{T}_{p+1}} is non-random with rank at most m+rp+1m+r_{p+1}. Since the column space of 𝒫p+1\mathcal{P}^{*}_{p+1} is spanned by {Φp+1,ρ}ρ=1rp+1\{\Phi^{*}_{p+1,{\rho}}\}_{{\rho}=1}^{r_{p+1}} and the column space of 𝒫^p+1\widehat{\mathcal{P}}_{p+1} is contained in p+1\mathcal{M}_{p+1}, it follows that 𝒫𝒯p+1(𝒫p+1𝒫^p+1)=𝒫p+1𝒫^p+1\mathcal{P}_{\mathcal{T}_{p+1}}(\mathcal{P}^{*}_{p+1}-\widehat{\mathcal{P}}_{p+1})=\mathcal{P}^{*}_{p+1}-\widehat{\mathcal{P}}_{p+1}.  Consequently,

(77)=\displaystyle\eqref{eq:induction projected norm term 2}= A^×1𝒬1×p𝒬p×p+1𝒫𝒯p+1(𝒫p+1𝒫^p+1)×p+2𝒫^p+2×d𝒫^dF2\displaystyle\|\widehat{A}\times_{1}{\mathcal{Q}}_{1}\cdots\times_{p}{\mathcal{Q}}_{p}\times_{p+1}\mathcal{P}_{\mathcal{T}_{p+1}}(\mathcal{P}^{*}_{p+1}-\widehat{\mathcal{P}}_{p+1})\times_{p+2}\widehat{\mathcal{P}}_{p+2}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|^{2}_{F}
\displaystyle\leq A^×1𝒬1×p𝒬p×p+1𝒫𝒯p+1×p+2𝒫^p+2×d𝒫^dF2𝒫p+1𝒫^p+1op2\displaystyle\|\widehat{A}\times_{1}{\mathcal{Q}}_{1}\cdots\times_{p}{\mathcal{Q}}_{p}\times_{p+1}\mathcal{P}_{\mathcal{T}_{p+1}}\times_{p+2}\widehat{\mathcal{P}}_{p+2}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|^{2}_{F}\|\mathcal{P}^{*}_{p+1}-\widehat{\mathcal{P}}_{p+1}\|_{\operatorname*{op}}^{2}
\displaystyle\leq O({(j=1pqj)(m+rp+1)(j=p+2drj)N+AF2}1σp+1,rp+12{m+d1N+m2α})\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\bigg{\{}\frac{\big{(}\prod_{j=1}^{p}q_{j}\big{)}(m+r_{p+1})\big{(}\prod_{j=p+2}^{d}r_{j}\big{)}}{N}+\|A^{*}\|^{2}_{F}\bigg{\}}\frac{1}{\sigma^{2}_{p+1,r_{p+1}}}\bigg{\{}\frac{m+\ell^{d-1}}{N}+m^{-2\alpha}\bigg{\}}\bigg{)}
=\displaystyle= O({(j=1pqj)j=p+1drjN+AF2}1σp+1,rp+12{m+d1N+m2α})\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\bigg{\{}\frac{\big{(}\prod_{j=1}^{p}q_{j}\big{)}\prod_{j=p+1}^{d}r_{j}}{N}+\|A^{*}\|^{2}_{F}\bigg{\}}\frac{1}{\sigma^{2}_{p+1,r_{p+1}}}\bigg{\{}\frac{m+\ell^{d-1}}{N}+m^{-2\alpha}\bigg{\}}\bigg{)}
+\displaystyle+ O((j=1pqj)j=p+2drjNmσp+1,rp+12{m+d1N+m2α})\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\big{(}\prod_{j=1}^{p}q_{j}\big{)}\prod_{j=p+2}^{d}r_{j}}{N}\frac{m}{\sigma^{2}_{p+1,r_{p+1}}}\bigg{\{}\frac{m+\ell^{d-1}}{N}+m^{-2\alpha}\bigg{\}}\bigg{)}
=\displaystyle= O((j=1pqj)(j=p+1drj)N+AF2)\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\big{(}\prod_{j=1}^{p}q_{j}\big{)}\big{(}\prod_{j=p+1}^{d}r_{j}\big{)}}{N}+\|A^{*}\|^{2}_{F}\bigg{)}

where the second inequality follows from induction and Theorem 1, and the last inequality follows from the assumption that mσp+1,rp+12(m+d1N+m2α)mσmin2(m+d1N+m2α)=O(1).\frac{m}{\sigma^{2}_{p+1,r_{p+1}}}\bigg{(}\frac{m+\ell^{d-1}}{N}+m^{-2\alpha}\bigg{)}\leq\frac{m}{\sigma_{\min}^{2}}\bigg{(}\frac{m+\ell^{d-1}}{N}+m^{-2\alpha}\bigg{)}=\operatorname*{O}(1). Consequently,

A^×1𝒬1×p𝒬p×p+1𝒫^p+1×d𝒫^dF2=O((j=1pqj)(j=p+1drj)N+AF2)\displaystyle\|\widehat{A}\times_{1}{\mathcal{Q}}_{1}\cdots\times_{p}{\mathcal{Q}}_{p}\times_{p+1}\widehat{\mathcal{P}}_{p+1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}\|^{2}_{F}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\big{(}\prod_{j=1}^{p}q_{j}\big{)}\big{(}\prod_{j=p+1}^{d}r_{j}\big{)}}{N}+\|A^{*}\|^{2}_{F}\bigg{)}

Therefore, (75) holds for any p{1,,d}p\in\{1,\ldots,d\}. ∎

Appendix D Additional discussions related to Section 2

D.1 Implementation details for Algorithm 1

Let 𝐋2(Ω1)\mathcal{M}\subset{{\bf L}_{2}}(\Omega_{1}) and 𝐋2(Ω2)\mathcal{L}\subset{{\bf L}_{2}}(\Omega_{2}) be two subspaces and A^(x,y):Ω1×Ω2\widehat{A}(x,y):\Omega_{1}\times\Omega_{2}\to\mathbb{R} be any (random) function. Suppose that {vμ(x)}μ=1dim()\{v_{\mu}(x)\}_{{\mu}=1}^{\dim(\mathcal{M})} and {wη(y)}η=1dim()\{w_{\eta}(y)\}_{{\eta}=1}^{\dim(\mathcal{L})} are the orthonormal basis functions of \mathcal{M} and \mathcal{L} respectively, with dim(),dim()+\dim(\mathcal{M}),\dim(\mathcal{L})\in\mathbb{Z}^{+}. Our general assumption is that A^[f,g]\widehat{A}[f,g] can be computed efficiently for any f𝐋2(Ω1)f\in{{\bf L}_{2}}(\Omega_{1}) and g𝐋2(Ω2)g\in{{\bf L}_{2}}(\Omega_{2}). This assumption is easily verified for the density estimation model. The following algorithm provide additional implementation details for Algorithm 1.

1:Estimator A^(x,y)\widehat{A}(x,y), parameter r+r\in\mathbb{Z}^{+}, linear subspaces =Span{vμ(x)}μ=1dim()\mathcal{M}=\operatorname*{Span}\{v_{\mu}(x)\}_{{\mu}=1}^{\dim(\mathcal{M})} and =Span{wη(y)}η=1dim()\mathcal{L}=\operatorname*{Span}\{w_{\eta}(y)\}_{{\eta}=1}^{\dim(\mathcal{L})}.
2:Compute Bdim()×dim()B\in\mathbb{R}^{\dim(\mathcal{M})\times\dim(\mathcal{L})}, where for 1μdim()1\leq{\mu}\leq dim(\mathcal{M}) and 1ηdim()1\leq{\eta}\leq\dim(\mathcal{L}),
Bμ,η=A^[vμ,wη].B_{{\mu},{\eta}}=\widehat{A}[v_{{\mu}},w_{\eta}].
3:Compute {U^ρ}ρ=1rdim()\{\widehat{U}_{{\rho}}\}_{{\rho}=1}^{r}\subset\mathbb{R}^{\dim(\mathcal{M})}, the leading rr left singular vectors of BB using matrix singular value decomposition.
4:Compute Φ^ρ(x)=μ=1dim()U^ρ,μvμ(x)\widehat{\Phi}_{\rho}(x)=\sum_{{\mu}=1}^{\dim(\mathcal{M})}\widehat{U}_{{\rho},{\mu}}v_{\mu}(x) for ρ=1,,r{\rho}=1,\ldots,r.
5:Functions {Φ^ρ(x)}ρ=1r𝐋2(Ω1)\{\widehat{\Phi}_{\rho}(x)\}_{{\rho}=1}^{r}\subset{{\bf L}_{2}}(\Omega_{1}).
Algorithm 3 Range Estimation via Variance-Reduced Sketching

D.2 Sketching in finite-dimensional vector spaces

In this subsection, we illustrate the intuition of sketching in a finite-dimensional matrix example. Suppose Bn1×n2B\in\mathbb{R}^{n_{1}\times n_{2}} is a finite-dimensional matrix with rank rr. Let Range(B)n1\operatorname*{Range}(B)\subset\mathbb{R}^{n_{1}} denote the linear subspace spanned by the columns of BB, and Row(B)n2\text{Row}(B)\subset\mathbb{R}^{n_{2}} the linear subspace spanned by the rows of BB. Our goal is to illustrate how to estimate  Range(B)\operatorname*{Range}(B) with reduced variance and reduced computational complexity when n1n2n_{1}\ll n_{2}.

By singular value decomposition, we can write

B=ρ=1rσρ(B)UρVρ,B=\sum_{{\rho}=1}^{r}\sigma_{{\rho}}(B)U_{{\rho}}V_{{\rho}}^{\top},

where σ1(B)σr(B)>0\sigma_{1}(B)\geq\cdots\geq\sigma_{r}(B)>0 are singular values, {Uρ}ρ=1r\{U_{\rho}\}_{{\rho}=1}^{r} are orthonormal vectors in n1\mathbb{R}^{n_{1}}, and {Vρ}ρ=1r\{V_{\rho}\}_{{\rho}=1}^{r} are orthonormal vectors in n2\mathbb{R}^{n_{2}}. Therefore Range(B)\operatorname*{Range}(B) is spanned by {Uρ}ρ=1r\{U_{{\rho}}\}_{{\rho}=1}^{r}, and Row(B)\text{Row}(B) is spanned by {Vρ}ρ=1r\{V_{\rho}\}_{{\rho}=1}^{r}.

The sketch-based estimation procedure of Range(B)\operatorname*{Range}(B) is as follows. First, we choose a linear subspace 𝒮n2\mathcal{S}\subset\mathbb{R}^{n_{2}} such that dim(𝒮)n2\dim(\mathcal{S})\ll n_{2} and that 𝒮\mathcal{S} forms a δ\delta-cover of Row(B)\text{Row}(B). Let P𝒮P_{\mathcal{S}} be the projection matrix from n2\mathbb{R}^{n_{2}} to 𝒮\mathcal{S} and we form the sketch matrix BP𝒮n1×dim(𝒮)BP_{\mathcal{S}}^{\top}\in\mathbb{R}^{n_{1}\times\dim(\mathcal{S})}. Then in the second stage, we use the singular value decomposition to compute Range(BP𝒮)\operatorname*{Range}(BP_{\mathcal{S}}^{\top}) and return Range(BP𝒮)\operatorname*{Range}(BP_{\mathcal{S}}^{\top}) as the estimator of Range(B)\operatorname*{Range}(B).

With the sketching technique, we only need to work with the reduced-size matrix BP𝒮n1×dim(𝒮)BP_{\mathcal{S}}^{\top}\in\mathbb{R}^{n_{1}\times\dim(\mathcal{S})} instead of Bn1×n2B\in\mathbb{R}^{n_{1}\times n_{2}}. Therefore, the effective variance of the sketching procedure is reduced to O(n1dim(𝒮))\operatorname*{O}(n_{1}\dim(\mathcal{S})), significantly smaller than O(n1n2)\operatorname*{O}(n_{1}n_{2}) which is the cost if we directly use BB to estimate the range.

We also provide an intuitive argument to support the above sketching procedure. Since BP𝒮=ρ=1rσρ(B)Uρ(P𝒮Vρ)BP_{\mathcal{S}}^{\top}=\sum_{{\rho}=1}^{r}\sigma_{\rho}(B)U_{\rho}(P_{\mathcal{S}}V_{\rho})^{\top}, it holds that

Range(BP𝒮)Range(B).\displaystyle\operatorname*{Range}(BP_{\mathcal{S}}^{\top})\subset\operatorname*{Range}(B). (78)

Let 2\|\cdot\|_{2} indicate matrix spectral norm for matrix and vector l2l_{2} norm. Since the subspace 𝒮\mathcal{S} is a δ\delta-cover of Row(B)\text{Row}(B), it follows that P𝒮VρVρδ\|P_{\mathcal{S}}V_{\rho}-V_{\rho}\|\leq\delta for ρ=1,,r{\rho}=1,\ldots,r. Therefore

BBP𝒮2=ρ=1rσρ(B)Uρ(VρP𝒮Vρ)2ρ=1r|σρ(B)|Uρ2VρP𝒮Vρ2=O(δ),\displaystyle\|B-BP_{\mathcal{S}}^{\top}\|_{2}=\bigg{\|}\sum_{{\rho}=1}^{r}\sigma_{\rho}(B)U_{\rho}(V_{\rho}-P_{\mathcal{S}}V_{\rho})^{\top}\bigg{\|}_{2}\leq\sum_{{\rho}=1}^{r}|\sigma_{\rho}(B)|\|U_{\rho}\|_{2}\|V_{\rho}-P_{\mathcal{S}}V_{\rho}\|_{2}=\operatorname*{O}(\delta),

where the last equality follows from the fact that |σρ(B)|<|\sigma_{\rho}(B)|<\infty and Uρ2=1\|U_{\rho}\|_{2}=1 for ρ=1,,r{\rho}=1,\ldots,r. Let {σρ(BP𝒮)}ρ=1r\{\sigma_{\rho}(BP_{\mathcal{S}}^{\top})\}_{{\rho}=1}^{r} be the leading rr singular values of BP𝒮BP_{\mathcal{S}}^{\top}. By matrix singular value perturbation theory,

|σρ(B)σρ(BP𝒮)|BBP𝒮2=O(δ).\displaystyle|\sigma_{\rho}(B)-\sigma_{\rho}(BP_{\mathcal{S}}^{\top})|\leq\|B-BP_{\mathcal{S}}^{\top}\|_{2}=\operatorname*{O}(\delta). (79)

for ρ=1,,r{\rho}=1,\ldots,r. Suppose 𝒮\mathcal{S} is chosen so that δσr(B)\delta\ll\sigma_{r}(B), where σr(B)\sigma_{r}(B) is the minimal singular value of BB. Then (79) implies that σρ(BP𝒮)σρ(B)O(δ)>0\sigma_{\rho}(BP_{\mathcal{S}}^{\top})\geq\sigma_{\rho}(B)-\operatorname*{O}(\delta)>0 for ρ{1,,r}{\rho}\in\{1,\ldots,r\}. Therefore, BP𝒮BP_{\mathcal{S}}^{\top} has at least rr positive singular values and the rank of BP𝒮BP_{\mathcal{S}}^{\top} is at least rr. This observation, together with (78) and the fact that Rank(B)=r\operatorname*{Rank}(B)=r implies that

Range(BP𝒮)=Range(B).\operatorname*{Range}(BP_{\mathcal{S}}^{\top})=\operatorname*{Range}(B).

This justifies the sketching procedure in finite-dimensions.

Appendix E Proofs in Section 4

Lemma 12.

Let Γ(x,y)\Gamma(x,y) be a generic element in 𝐋2([κ]2)𝐋2([κ]2){{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})}. Then

Γ(x,y)op(𝐋2([κ]2)𝐋2([κ]2))=Γ(x,y)opκ2.\|\Gamma(x,y)\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}=\frac{\|\Gamma(x,y)\|_{\operatorname*{op}}}{{\kappa}^{2}}.
Proof.

Let fx=I(x)/κf_{x}=I(x)/{\kappa} and gx=J(x)/κg_{x}=J(x)/{\kappa}. Then

x[κ]2fx2=1κ2x[κ]2I2(x)=I𝐋2([κ]2)2.\sum_{x\in[{\kappa}]^{2}}f_{x}^{2}=\frac{1}{{\kappa}^{2}}\sum_{x\in[{\kappa}]^{2}}I^{2}(x)=\|I\|^{2}_{{{\bf L}_{2}}([{\kappa}]^{2})}.

It suffices to observe that

Γop(𝐋2([κ]2)𝐋2([κ]2))=\displaystyle\|\Gamma\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}= supI𝐋2([κ]2)=J𝐋2([κ]2)=1Γ[I,J]\displaystyle\sup_{\|I\|_{{{\bf L}_{2}}([{\kappa}]^{2})}=\|J\|_{{{\bf L}_{2}}([{\kappa}]^{2})}=1}\Gamma[I,J]
=\displaystyle= supI𝐋2([κ]2)=J𝐋2([κ]2)=11κ4x,y[κ]2Γ(x,y)I(x)J(y)\displaystyle\sup_{\|I\|_{{{\bf L}_{2}}([{\kappa}]^{2})}=\|J\|_{{{\bf L}_{2}}([{\kappa}]^{2})}=1}\frac{1}{{\kappa}^{4}}\sum_{x,y\in[{\kappa}]^{2}}\Gamma(x,y)I(x)J(y)
=\displaystyle= 1κ2x[κ]2fx2=y[κ]2gy2=1Γ(x,y)fxgy=Γ(x,y)opκ2.\displaystyle\frac{1}{{\kappa}^{2}}\sum_{\sum_{x\in[{\kappa}]^{2}}f_{x}^{2}=\sum_{y\in[{\kappa}]^{2}}g_{y}^{2}=1}\Gamma(x,y)f_{x}g_{y}=\frac{\|\Gamma(x,y)\|_{\operatorname*{op}}}{{\kappa}^{2}}.

Proof of Corollary 1.

From the proof of Theorem 1 and Lemma 13, it follows that

𝒫^x𝒫xop(𝐋2([κ]2)𝐋2([κ]2))2=O{σr2(m2α+m2+2N+1κ4)}.\displaystyle\|\widehat{\mathcal{P}}_{x}-\mathcal{P}^{*}_{x}\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}^{2}=\operatorname*{O_{\mathbb{P}}}\bigg{\{}\sigma_{r}^{-2}\bigg{(}m^{-2\alpha}+\frac{m^{2}+\ell^{2}}{N}+\frac{1}{{\kappa}^{4}}\bigg{)}\bigg{\}}. (80)

The desired result follows by setting

mN1/(2α+2)and=CLσr1/α.m\asymp N^{-1/(2\alpha+2)}\quad\text{and}\quad\ell=C_{L}\sigma_{r}^{-1/\alpha}.

Lemma 13.

Let \mathcal{M} and \mathcal{L} be subspaces in the form of (31). Suppose in addition that m+N=O(1).\frac{m+\ell}{\sqrt{N}}=O(1). Then under the same conditions as in Corollary 1, it holds that

(Σ^Σ)×x𝒫×y𝒫op(𝐋2([κ]2)𝐋2([κ]2))=O(m+N+m2+2N+1κ2).\|(\widehat{\Sigma}-\Sigma^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{m+\ell}{\sqrt{N}}+\frac{m^{2}+\ell^{2}}{N}+\frac{1}{{\kappa}^{2}}\bigg{)}.
Proof.

Let x=(x1,x2)[κ]2x=(x_{1},x_{2})\in[{\kappa}]^{2} and y=(y1,y2)[κ]2y=(y_{1},y_{2})\in[{\kappa}]^{2}. By reordering (μ1,μ2)({\mu}_{1},{\mu}_{2}) and (η1,η2)({\eta}_{1},{\eta}_{2}) in (31), we can assume that

=span{Φμ(x)}μ=1m2and=span{Ψη(y)}η=12,\displaystyle\mathcal{M}=\text{span}\{\Phi_{{\mu}}(x)\}_{{\mu}=1}^{m^{2}}\quad\text{and}\quad\mathcal{L}=\text{span}\{\Psi_{{\eta}}(y)\}_{{\eta}=1}^{\ell^{2}}, (81)

where {Φμ(x)}μ=1m2\{\Phi_{{\mu}}(x)\}_{{\mu}=1}^{m^{2}} and {Ψη(y)}η=12\{\Psi_{{\eta}}(y)\}_{{\eta}=1}^{\ell^{2}} are orthonormal basis functions of 𝐋2([κ]2){{\bf L}_{2}}([{\kappa}]^{2}). Note that Σ^×x𝒫×y𝒫\widehat{\Sigma}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} and Σ×x𝒫×y𝒫\Sigma^{*}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} are both zero on the orthogonal complement of the subspace \mathcal{M}\otimes\mathcal{L}. Let

Wi(x)=I(x)+δi(x)+ϵi(x)\displaystyle W_{i}(x)=I^{*}(x)+\delta_{i}(x)+\epsilon_{i}(x) (82)

where I(x)=𝔼(Ii(x))I^{*}(x)={\mathbb{E}}(I_{i}(x)) and δi(x)=Ii(x)I(x)\delta_{i}(x)=I_{i}(x)-I^{*}(x). Therefore 𝔼(δi(x))=0{\mathbb{E}}(\delta_{i}(x))=0 and Cov(δi(x),δi(y))=Σ(x,y)Cov(\delta_{i}(x),\delta_{i}(y))=\Sigma^{*}(x,y). Let B^,Bm2×2\widehat{B},B^{*}\in\mathbb{R}^{m^{2}\times\ell^{2}} be such that for μ=1,,m2{\mu}=1,\ldots,m^{2} and η=1,2{\eta}=1\ldots,\ell^{2},

B^μ,η=Σ^[Φμ,Ψη]andBμ,η=Σ[Φμ,Ψη],\displaystyle\widehat{B}_{{\mu},{\eta}}=\widehat{\Sigma}[\Phi_{{\mu}},\Psi_{{\eta}}]\quad\text{and}\quad B_{{\mu},{\eta}}^{*}=\Sigma^{*}[\Phi_{{\mu}},\Psi_{{\eta}}],

where Σ^[Φμ,Ψη]\widehat{\Sigma}[\Phi_{{\mu}},\Psi_{{\eta}}] and Σ[Φμ,Ψη]\Sigma^{*}[\Phi_{{\mu}},\Psi_{{\eta}}] are defined according to (28). By the definition of B^\widehat{B} and BB^{*},

B^Bop=(Σ^Σ)×x𝒫×y𝒫op(𝐋2([κ]2)𝐋2([κ]2)).\displaystyle\|\widehat{B}-B^{*}\|_{\operatorname*{op}}=\|(\widehat{\Sigma}-\Sigma^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}. (83)

Note that

B^BopB^𝔼(B^)op+𝔼(B^)Bop.\|\widehat{B}-B^{*}\|_{\operatorname*{op}}\leq\|\widehat{B}-{\mathbb{E}}(\widehat{B})\|_{\operatorname*{op}}+\|{\mathbb{E}}(\widehat{B})-B^{*}\|_{\operatorname*{op}}.

We estimate above two terms separately.
Step 1. In this step, we control 𝔼(B^)Bop\|{\mathbb{E}}(\widehat{B})-B^{*}\|_{\operatorname*{op}}. Denote Cϵ=Var(ϵi(x))C_{\epsilon}=Var(\epsilon_{i}(x)). Since {ϵi(x)}i=1,,N,x[κ]2\{\epsilon_{i}(x)\}_{i=1,...,N,x\in[{\kappa}]^{2}} and {δi}i=1N\{\delta_{i}\}_{i=1}^{N} are independent,

𝔼(Wi(x)Wi(y))={I(x)I(y)+Σ(x,y)if xy,I(x)I(x)+Σ(x,x)+Cϵif x=y.{\mathbb{E}}(W_{i}(x)W_{i}(y))=\begin{cases}I^{*}(x)I^{*}(y)+\Sigma^{*}(x,y)&\text{if }x\not=y,\\ I^{*}(x)I^{*}(x)+\Sigma^{*}(x,x)+C_{\epsilon}&\text{if }x=y.\end{cases}

Therefore for any i=1,,Ni=1,\ldots,N,

𝔼(Wi(x)W¯(y))={I(x)I(y)+1NΣ(x,y) if xy,I(x)I(x)+1NΣ(x,x)+1NCϵ if x=y\displaystyle{\mathbb{E}}(W_{i}(x)\overline{W}(y))=\begin{cases}I^{*}(x)I^{*}(y)+\frac{1}{N}\Sigma^{*}(x,y)&\text{ if }x\not=y,\\ I^{*}(x)I^{*}(x)+\frac{1}{N}\Sigma^{*}(x,x)+\frac{1}{N}C_{\epsilon}&\text{ if }x=y\end{cases}

and

𝔼(W¯(x)W¯(y))={I(x)I(y)+1NΣ(x,y) if xy,I(x)I(x)+1NΣ(x,x)+1NCϵ if x=y.\displaystyle{\mathbb{E}}(\overline{W}(x)\overline{W}(y))=\begin{cases}I^{*}(x)I^{*}(y)+\frac{1}{N}\Sigma^{*}(x,y)&\text{ if }x\not=y,\\ I^{*}(x)I^{*}(x)+\frac{1}{N}\Sigma^{*}(x,x)+\frac{1}{N}C_{\epsilon}&\text{ if }x=y.\end{cases}

So

𝔼(Σ^(x,y))={Σ(x,y) if xy,Σ(x,x)+Cϵ if x=y,\displaystyle{\mathbb{E}}(\widehat{\Sigma}(x,y))=\begin{cases}\Sigma^{*}(x,y)&\text{ if }x\not=y,\\ \Sigma^{*}(x,x)+C_{\epsilon}&\text{ if }x=y,\end{cases}

and

𝔼(Σ^(x,y))Σ(x,y)={0 if xy,Cϵ if x=y.\displaystyle{\mathbb{E}}(\widehat{\Sigma}(x,y))-\Sigma^{*}(x,y)=\begin{cases}0&\text{ if }x\not=y,\\ C_{\epsilon}&\text{ if }x=y.\end{cases} (84)

By Lemma 12, it follows that

𝔼(Σ^)Σop(𝐋2([κ]2)𝐋2([κ]2))=Cϵκ2.\|{\mathbb{E}}(\widehat{\Sigma})-\Sigma^{*}\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}=\frac{C_{\epsilon}}{{\kappa}^{2}}.

Therefore

𝔼(B^)Bop=(𝔼(Σ^)Σ)×x𝒫×y𝒫op(𝐋2([κ]2)𝐋2([κ]2))=Cϵκ2.\|{\mathbb{E}}(\widehat{B})-B^{*}\|_{\operatorname*{op}}=\|({\mathbb{E}}(\widehat{\Sigma})-\Sigma^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}({{{\bf L}_{2}}([{\kappa}]^{2})\otimes{{\bf L}_{2}}([{\kappa}]^{2})})}=\frac{C_{\epsilon}}{{\kappa}^{2}}.

Step 2. In this step, we bound 𝔼(B^)B^op\|{\mathbb{E}}(\widehat{B})-\widehat{B}\|_{\operatorname*{op}}. This procedures are similar to Lemma 9 and Lemma 22. Let v=(v1,,vm2)m2v=(v_{1},\ldots,v_{m^{2}})\in\mathbb{R}^{m^{2}} and w=(w1,,w2)2w=(w_{1},\ldots,w_{\ell^{2}})\in\mathbb{R}^{\ell^{2}} be such that μ=1m2vμ2=1\sum_{{\mu}=1}^{m^{2}}v_{{\mu}}^{2}=1 and η=12wη2=1\sum_{{\eta}=1}^{\ell^{2}}w_{{\eta}}^{2}=1. Denote

Φv(x)=μ=1m2vμΦμ(x)andΨw(y)=η=12wηΨη(y).\Phi_{v}(x)=\sum_{{\mu}=1}^{m^{2}}v_{{\mu}}\Phi_{{\mu}}(x)\quad\text{and}\quad\Psi_{w}(y)=\sum_{{\eta}=1}^{\ell^{2}}w_{{\eta}}\Psi_{{\eta}}(y).

Since {Φμ(x)}μ=1m2\{\Phi_{{\mu}}(x)\}_{{\mu}=1}^{m^{2}} and {Ψη(y)}η=12\{\Psi_{{\eta}}(y)\}_{{\eta}=1}^{\ell^{2}} are orthonormal basis functions of 𝐋2([κ]2){{\bf L}_{2}}([{\kappa}]^{2}), it follows that

Φv𝐋2([κ]2)=1andΨw𝐋2([κ]2)=1.\displaystyle\|\Phi_{v}\|_{{{\bf L}_{2}}([{\kappa}]^{2})}=1\quad\text{and}\quad\|\Psi_{w}\|_{{{\bf L}_{2}}([{\kappa}]^{2})}=1. (85)

Therefore,

v(𝔼(B^)B^)w=μ=1m2η=12vμΣ^[Φμ,Ψη]wη=μ=1m2η=12Σ^[vμΦμ,wηΨη]\displaystyle v^{\top}({\mathbb{E}}(\widehat{B})-\widehat{B})w=\sum_{{\mu}=1}^{m^{2}}\sum_{{\eta}=1}^{\ell^{2}}v_{{\mu}}\widehat{\Sigma}[\Phi_{{\mu}},\Psi_{{\eta}}]w_{{\eta}}=\sum_{{\mu}=1}^{m^{2}}\sum_{{\eta}=1}^{\ell^{2}}\widehat{\Sigma}[v_{{\mu}}\Phi_{{\mu}},w_{{\eta}}\Psi_{{\eta}}]
=\displaystyle= 1Ni=1N1κ4{x[κ]2{Wi(x)W¯(x)}Φv(x)y[κ]2{Wi(y)W¯(y)}Ψw(y)}\displaystyle\frac{1}{N}\sum_{i=1}^{N}\frac{1}{{\kappa}^{4}}\bigg{\{}\sum_{x\in[{\kappa}]^{2}}\big{\{}W_{i}(x)-\overline{W}(x)\big{\}}\Phi_{v}(x)\sum_{y\in[{\kappa}]^{2}}\big{\{}W_{i}(y)-\overline{W}(y)\big{\}}\Psi_{w}(y)\bigg{\}}
=\displaystyle= 1Nκ4i=1Nx[κ]2{Wi(x)𝔼(W(x))}Φv(x)y[κ]2{Wi(y)𝔼(W(y))}Ψw(y)\displaystyle\frac{1}{N{\kappa}^{4}}\sum_{i=1}^{N}\sum_{x\in[{\kappa}]^{2}}\big{\{}W_{i}(x)-{\mathbb{E}}(W(x))\big{\}}\Phi_{v}(x)\sum_{y\in[{\kappa}]^{2}}\big{\{}W_{i}(y)-{\mathbb{E}}(W(y))\big{\}}\Psi_{w}(y) (86)
+\displaystyle+ 1Nκ4i=1Nx[κ]2{𝔼(W(x))W¯(x)}Φv(x)y[κ]2{Wi(y)𝔼(W(y))}Ψw(y)\displaystyle\frac{1}{N{\kappa}^{4}}\sum_{i=1}^{N}\sum_{x\in[{\kappa}]^{2}}\big{\{}{\mathbb{E}}(W(x))-\overline{W}(x)\big{\}}\Phi_{v}(x)\sum_{y\in[{\kappa}]^{2}}\big{\{}W_{i}(y)-{\mathbb{E}}(W(y))\big{\}}\Psi_{w}(y) (87)
+\displaystyle+ 1Nκ4i=1Nx[κ]2{Wi(x)𝔼(W(x))}Φv(x)y[κ]2{𝔼(W(y))W¯(y)}Ψw(y)\displaystyle\frac{1}{N{\kappa}^{4}}\sum_{i=1}^{N}\sum_{x\in[{\kappa}]^{2}}\big{\{}W_{i}(x)-{\mathbb{E}}(W(x))\big{\}}\Phi_{v}(x)\sum_{y\in[{\kappa}]^{2}}\big{\{}{\mathbb{E}}(W(y))-\overline{W}(y)\big{\}}\Psi_{w}(y) (88)
+\displaystyle+ 1Nκ4i=1Nx[κ]2{𝔼(W(x))W¯(x)}Φv(x)y[κ]2{𝔼(W(y))W¯(y)}Ψw(y),\displaystyle\frac{1}{N{\kappa}^{4}}\sum_{i=1}^{N}\sum_{x\in[{\kappa}]^{2}}\big{\{}{\mathbb{E}}(W(x))-\overline{W}(x)\big{\}}\Phi_{v}(x)\sum_{y\in[{\kappa}]^{2}}\big{\{}{\mathbb{E}}(W(y))-\overline{W}(y)\big{\}}\Psi_{w}(y), (89)

where the third equality follows from (34) and (28).

Step 3. Here we bound above four terms separately. Observe that

(86)=1Ni=1N{1κ2x[κ]2(δi(x)+ϵi(x))Φv(x)}{1κ2y[κ]2(δi(y)+ϵi(y))Ψw(y)}.\eqref{eq:image pca deviation bound term 1}=\frac{1}{N}\sum_{i=1}^{N}\bigg{\{}\frac{1}{{\kappa}^{2}}\sum_{x\in[{\kappa}]^{2}}(\delta_{i}(x)+\epsilon_{i}(x))\Phi_{v}(x)\bigg{\}}\bigg{\{}\frac{1}{{\kappa}^{2}}\sum_{y\in[{\kappa}]^{2}}(\delta_{i}(y)+\epsilon_{i}(y))\Psi_{w}(y)\bigg{\}}.

Since δi𝐋2([κ]2)\delta_{i}\in{{\bf L}_{2}}([{\kappa}]^{2}) is a subGaussian process with parameter CδC_{\delta}, and by (85), Φv𝐋2([κ]2)=1\|\Phi_{v}\|_{{{\bf L}_{2}}([{\kappa}]^{2})}=1, it follows that 1κ2x[𝒩]2(δi(x)+ϵi(x))Φv(x)\frac{1}{{\kappa}^{2}}\sum_{x\in[{\mathcal{N}}]^{2}}(\delta_{i}(x)+\epsilon_{i}(x))\Phi_{v}(x) is a subGaussian random variable with parameter (Cδ+Cϵ)2.\big{(}C_{\delta}+C_{\epsilon}\big{)}^{2}. Similarly {1κ2y[κ]2(δi(y)+ϵi(y))Ψw(y)}\bigg{\{}\frac{1}{{\kappa}^{2}}\sum_{y\in[{\kappa}]^{2}}(\delta_{i}(y)+\epsilon_{i}(y))\Psi_{w}(y)\bigg{\}} is subGaussian with parameter (Cδ+Cϵ)2\big{(}C_{\delta}+C_{\epsilon}\big{)}^{2}.

Therefore {1κ2x[κ]2(δi(x)+ϵi(x))Φv(x)}{1κ2y[κ]2(δi(y)+ϵi(y))Ψw(y)}\bigg{\{}\frac{1}{{\kappa}^{2}}\sum_{x\in[{\kappa}]^{2}}(\delta_{i}(x)+\epsilon_{i}(x))\Phi_{v}(x)\bigg{\}}\bigg{\{}\frac{1}{{\kappa}^{2}}\sum_{y\in[{\kappa}]^{2}}(\delta_{i}(y)+\epsilon_{i}(y))\Psi_{w}(y)\bigg{\}} is sub-exponential with parameter (Cδ+Cϵ)4\big{(}C_{\delta}+C_{\epsilon}\big{)}^{4}. It follows that

(|(86)|t)2exp(cNt2(Cδ+Cϵ)4+t(Cδ+Cϵ)2).\mathbb{P}\bigg{(}\big{|}\eqref{eq:image pca deviation bound term 1}\big{|}\geq t\bigg{)}\leq 2\exp\bigg{(}-\frac{cNt^{2}}{(C_{\delta}+C_{\epsilon})^{4}+t(C_{\delta}+C_{\epsilon})^{2}}\bigg{)}.

For (87), note that 1κ2x[κ]2{1Ni=1N(δi(x)+ϵi(x))Φv(x)}\frac{1}{{\kappa}^{2}}\sum_{x\in[{\kappa}]^{2}}\bigg{\{}\frac{1}{N}\sum_{i=1}^{N}\big{(}\delta_{i}(x)+\epsilon_{i}(x)\big{)}\Phi_{v}(x)\bigg{\}} is subGaussian with parameter (Cδ+Cϵ)2N\frac{(C_{\delta}+C_{\epsilon})^{2}}{N} and 1κ2y[κ]2{(δi(y)+ϵi(y))Ψw(y)}\frac{1}{{\kappa}^{2}}\sum_{y\in[{\kappa}]^{2}}\bigg{\{}\big{(}\delta_{i}(y)+\epsilon_{i}(y)\big{)}\Psi_{w}(y)\bigg{\}} is subGaussian with parameter (Cδ+Cϵ)2\big{(}C_{\delta}+C_{\epsilon}\big{)}^{2}.
Therefore 1κ2x[κ]2{1Ni=1N(δi(x)+ϵi(x))Φv(x)}1κ2y[κ]2{(δi(y)+ϵi(y))Ψw(y)}\frac{1}{{\kappa}^{2}}\sum_{x\in[{\kappa}]^{2}}\bigg{\{}\frac{1}{N}\sum_{i=1}^{N}\big{(}\delta_{i}(x)+\epsilon_{i}(x)\big{)}\Phi_{v}(x)\bigg{\}}\frac{1}{{\kappa}^{2}}\sum_{y\in[{\kappa}]^{2}}\bigg{\{}\big{(}\delta_{i}(y)+\epsilon_{i}(y)\big{)}\Psi_{w}(y)\bigg{\}} is sub-exponential with parameter (Cδ+Cϵ)4N\frac{\big{(}C_{\delta}+C_{\epsilon}\big{)}^{4}}{N}. Consequently

(|(87)|t)2exp(cNt2(Cδ+Cϵ)4/N+t(Cδ+Cϵ)2/N).\mathbb{P}\bigg{(}\big{|}\eqref{eq:image pca deviation bound term 2}\big{|}\geq t\bigg{)}\leq 2\exp\bigg{(}-\frac{cNt^{2}}{(C_{\delta}+C_{\epsilon})^{4}/N+t(C_{\delta}+C_{\epsilon})^{2}/\sqrt{N}}\bigg{)}.

Similarly, it holds that

(|(88)|t)2exp(cNt2(Cδ+Cϵ)4/N+t(Cδ+Cϵ)2/N),and that\displaystyle\mathbb{P}\bigg{(}\big{|}\eqref{eq:image pca deviation bound term 3}\big{|}\geq t\bigg{)}\leq 2\exp\bigg{(}-\frac{cNt^{2}}{(C_{\delta}+C_{\epsilon})^{4}/N+t(C_{\delta}+C_{\epsilon})^{2}/\sqrt{N}}\bigg{)},\quad\text{and that}
(|(89)|t)2exp(cNt2(Cδ+Cϵ)4/N2+t(Cδ+Cϵ)2/N).\displaystyle\mathbb{P}\bigg{(}\big{|}\eqref{eq:image pca deviation bound term 4}\big{|}\geq t\bigg{)}\leq 2\exp\bigg{(}-\frac{cNt^{2}}{(C_{\delta}+C_{\epsilon})^{4}/N^{2}+t(C_{\delta}+C_{\epsilon})^{2}/N}\bigg{)}.

Step 4. By summarizing above four terms, the first term is dominant. Therefore,

(|v(𝔼(B^)B^)w|t)8exp(cNt2(Cδ+Cϵ)4+t(Cδ+Cϵ)2).\mathbb{P}(\big{|}v^{\top}({\mathbb{E}}(\widehat{B})-\widehat{B})w\big{|}\geq t)\leq 8\exp\bigg{(}-\frac{cNt^{2}}{(C_{\delta}+C_{\epsilon})^{4}+t(C_{\delta}+C_{\epsilon})^{2}}\bigg{)}.

Let 𝒩(14,m2)\mathcal{N}(\frac{1}{4},m^{2}) be a 1/41/4 covering net of unit ball in m2\mathbb{R}^{m^{2}} and 𝒩(14,2)\mathcal{N}(\frac{1}{4},\ell^{2}) be a 1/41/4 covering net of unit ball in 2\mathbb{R}^{\ell^{2}}, then by 4.4.3 on page 90 of Vershynin (2018)

𝔼(B^)B^op2supv𝒩(14,m2),w𝒩(14,2)|v(𝔼(B^)B^)w|.\displaystyle\|{\mathbb{E}}(\widehat{B})-\widehat{B}\|_{\operatorname*{op}}\leq 2\sup_{v\in\mathcal{N}(\frac{1}{4},m^{2}),w\in\mathcal{N}(\frac{1}{4},\ell^{2})}|v^{\top}({\mathbb{E}}(\widehat{B})-\widehat{B})w|.

So by union bound and the fact that the size of 𝒩(14,m2)\mathcal{N}(\frac{1}{4},m^{2}) is bounded by 9m29^{m^{2}} and the size of 𝒩(14,2)\mathcal{N}(\frac{1}{4},\ell^{2}) is bounded by 929^{\ell^{2}},

(𝔼(B^)B^opt)(supv𝒩(14,m2),w𝒩(14,2)|v(𝔼(B^)B^)w|t2)\displaystyle\mathbb{P}\bigg{(}\|{\mathbb{E}}(\widehat{B})-\widehat{B}\|_{\operatorname*{op}}\geq t\bigg{)}\leq\mathbb{P}\bigg{(}\sup_{v\in\mathcal{N}(\frac{1}{4},m^{2}),w\in\mathcal{N}(\frac{1}{4},\ell^{2})}|v^{\top}({\mathbb{E}}(\widehat{B})-\widehat{B})w|\geq\frac{t}{2}\bigg{)}
\displaystyle\leq 9m2+216exp(cNt2(Cδ+Cϵ)4+t(Cδ+Cϵ)2).\displaystyle 9^{m^{2}+\ell^{2}}*16\exp\bigg{(}-\frac{cNt^{2}}{(C_{\delta}+C_{\epsilon})^{4}+t(C_{\delta}+C_{\epsilon})^{2}}\bigg{)}. (90)

This implies that

𝔼(B^)B^op=O(m+N+m2+2N).\|{\mathbb{E}}(\widehat{B})-\widehat{B}\|_{\operatorname*{op}}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{m+\ell}{\sqrt{N}}+\frac{m^{2}+\ell^{2}}{N}\bigg{)}.

Therefore by Step 1 and Step 2

B^Bop𝔼(B^)Bop+B^𝔼(B^)op=\displaystyle\|\widehat{B}-B^{*}\|_{\operatorname*{op}}\leq\|{\mathbb{E}}(\widehat{B})-B^{*}\|_{\operatorname*{op}}+\|\widehat{B}-{\mathbb{E}}(\widehat{B})\|_{\operatorname*{op}}= O(m+N+m2+2N+1κ2)\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{m+\ell}{\sqrt{N}}+\frac{m^{2}+\ell^{2}}{N}+\frac{1}{{\kappa}^{2}}\bigg{)}
=\displaystyle= O(m+N+1κ2).\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{m+\ell}{\sqrt{N}}+\frac{1}{{\kappa}^{2}}\bigg{)}.

where the last equality follows from the fact that m+N=o(1).\frac{m+\ell}{\sqrt{N}}=o(1). The desired result follows from (83). ∎

Appendix F Perturbation bounds

F.1 Compact operators on Hilbert spaces

Lemma 14.

Let AA and BB be two compact self-adjoint operators on a Hilbert space 𝒲\mathcal{W}. Denote λk(A)\lambda_{k}(A) and λk(A+B)\lambda_{k}(A+B) to be the k-th eigenvalue of AA and A+BA+B respectively. Then

|λk(A+B)λk(A)|Bop.|\lambda_{k}(A+B)-\lambda_{k}(A)|\leq\|B\|_{\operatorname*{op}}.
Proof.

By the min-max principle, for any compact self-adjoint operators HH and any Sk𝒲S_{k}\subset\mathcal{W} being a kk-dimensional subspace

maxSkminxSk,x𝒲=1H[x,x]=λk(H).\max_{S_{k}}\min_{x\in S_{k},\|x\|_{\mathcal{W}}=1}H[x,x]=\lambda_{k}(H).

It follows that

λk(A+B)=\displaystyle\lambda_{k}(A+B)= maxSkminxSk,x𝒲=1(A+B)[x,x]\displaystyle\max_{S_{k}}\min_{x\in S_{k},\|x\|_{\mathcal{W}}=1}(A+B)[x,x]
\displaystyle\leq maxSkminxSk,x𝒲=1A[x,x]+Bopx𝒲2\displaystyle\max_{S_{k}}\min_{x\in S_{k},\|x\|_{\mathcal{W}}=1}A[x,x]+\|B\|_{\operatorname*{op}}\|x\|_{\mathcal{W}}^{2}
=\displaystyle= λk(A)+Bop.\displaystyle\lambda_{k}(A)+\|B\|_{\operatorname*{op}}.

The other direction follows from symmetry. ∎

For any compact operator H:𝒲𝒲H:\mathcal{W}\otimes\mathcal{W}^{\prime}, by Theorem 13 of Bell (2014), there exists orthogonal basis {uk}k=1\{u_{k}\}_{k=1}^{\infty} and {vk}k=1\{v_{k}\}_{k=1}^{\infty} such that

H=k=1σk(H)ukvk,H=\sum_{k=1}^{\infty}\sigma_{k}(H)u_{k}\otimes v_{k},

where σ1(H)σ2(H)0\sigma_{1}(H)\geq\sigma_{2}(H)\geq\ldots\geq 0 are the singular values of HH. So

λk(HH)=σk(H)2.\displaystyle\lambda_{k}(HH^{\top})=\sigma_{k}(H)^{2}. (91)
Lemma 15.

Let 𝒲\mathcal{W} and 𝒲\mathcal{W}^{\prime} be two separable Hilbert spaces. Suppose AA and BB are two compact operators from 𝒲𝒲\mathcal{W}\otimes\mathcal{W}^{\prime}\to\mathbb{R}. Then

|σk(A+B)σk(A)|Bop.\big{|}\sigma_{k}(A+B)-\sigma_{k}(A)\big{|}\leq\|B\|_{\operatorname*{op}}.
Proof.

Let {ϕi}i=1\{\phi_{i}\}_{i=1}^{\infty} and {ϕi}i=1\{\phi_{i}^{\prime}\}_{i=1}^{\infty} be the orthogonal basis of 𝒲\mathcal{W} and 𝒲\mathcal{W}^{\prime}. Let

𝒲j=Span({ϕi}i=1j)and𝒲j=Span({ϕi}i=1j).\mathcal{W}_{j}=\operatorname*{Span}(\{\phi_{i}\}_{i=1}^{j})\quad\text{and}\quad\mathcal{W}_{j}^{\prime}=\operatorname*{Span}(\{\phi_{i}^{\prime}\}_{i=1}^{j}).

Denote

Aj=A×𝒫𝒲j×𝒫𝒲jand(A+B)j=(A+B)×𝒫𝒲j×𝒫𝒲j.A_{j}=A\times\mathcal{P}_{\mathcal{W}_{j}}\times\mathcal{P}_{\mathcal{W}_{j}^{\prime}}\quad\text{and}\quad(A+B)_{j}=(A+B)\times\mathcal{P}_{\mathcal{W}_{j}}\times\mathcal{P}_{\mathcal{W}_{j}^{\prime}}.

Note that (A+B)j=Aj+Bj(A+B)_{j}=A_{j}+B_{j} due to linearity. Since AA and A+BA+B are compact,

limjAAjF=0andlimj(A+B)(A+B)jF=0.\lim_{j\to\infty}\|A-A_{j}\|_{F}=0\quad\text{and}\quad\lim_{j\to\infty}\|(A+B)-(A+B)_{j}\|_{F}=0.

Then AAAA^{\top} and AjAjA_{j}A_{j}^{\top} are two compact self-adjoint operators on 𝒲\mathcal{W} and that

limjAAAjAjF=0.\lim_{j\to\infty}\|AA^{\top}-A_{j}A_{j}^{\top}\|_{F}=0.

By Lemma 14, limjλk(AjAj)=λk(AA).\lim_{j\to\infty}\lambda_{k}(A_{j}A_{j}^{\top})=\lambda_{k}(AA^{\top}). Since σk(Aj)\sigma_{k}(A_{j}) and σk(A)\sigma_{k}(A) are both positive, by (91)

limjσk(Aj)=σk(A).\lim_{j\to\infty}\sigma_{k}(A_{j})=\sigma_{k}(A).

Similarly

limjσk((A+B)j)=σk(A+B).\lim_{j\to\infty}\sigma_{k}((A+B)_{j})=\sigma_{k}(A+B).

By the finite dimensional SVD perturbation theory (see Theorem 3.3.16 on page 178 of Horn and Johnson (1994)), it follows that

|σk((A+B)j)σk(Aj)|BjopBop.\big{|}\sigma_{k}((A+B)_{j})-\sigma_{k}(A_{j})\big{|}\leq\|B_{j}\|_{\operatorname*{op}}\leq\|B\|_{\operatorname*{op}}.

The desired result follows by taking the limit as jj\to\infty.

F.2 Subspace perturbation bounds

Theorem 6 (Wedin).

Suppose without loss of generality that n1n2n_{1}\leq n_{2}. Let M=M+E,MM=M^{*}+E,M^{*} be two matrices in n1×n2\mathbb{R}^{n_{1}\times n_{2}} whose svd are given respectively by

M=i=1n1σiuiviandM=i=1n1σiuivi\displaystyle M^{*}=\sum_{i=1}^{n_{1}}\sigma^{*}_{i}u_{i}^{*}{v_{i}^{*}}^{\top}\quad\text{and}\quad M=\sum_{i=1}^{n_{1}}\sigma_{i}u_{i}{v_{i}}^{\top}

where σ1σn1\sigma_{1}^{*}\geq\cdots\geq\sigma_{n_{1}}^{*} and σ1σn1\sigma_{1}\geq\cdots\geq\sigma_{n_{1}}. For any rn1r\leq n_{1}, let

Σ=diag([σ1,,σr])r×r,U=[u1,,ur]n1×r,V=[v1,,vr]r×n2,\displaystyle\Sigma^{*}=\text{diag}([\sigma_{1}^{*},\cdots,\sigma_{r}^{*}])\in\mathbb{R}^{r\times r},\quad U^{*}=[u_{1}^{*},\cdots,u_{r}^{*}]\in\mathbb{R}^{n_{1}\times r},\quad V=[v_{1}^{*},\cdots,v_{r}^{*}]\in\mathbb{R}^{r\times n_{2}},
Σ=diag([σ1,,σr])r×r,U=[u1,,ur]n1×r,V=[v1,,vr]r×n2.\displaystyle\Sigma=\text{diag}([\sigma_{1},\cdots,\sigma_{r}])\in\mathbb{R}^{r\times r},\quad U=[u_{1},\cdots,u_{r}]\in\mathbb{R}^{n_{1}\times r},\quad V=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times n_{2}}.

Denote 𝒫U=i=1ruiui\mathcal{P}_{U^{*}}=\sum_{i=1}^{r}u^{*}_{i}\otimes u^{*}_{i} and 𝒫U=i=1ruiui\mathcal{P}_{U}=\sum_{i=1}^{r}u_{i}\otimes u_{i}. If Eop<σrσr+1\|E\|_{\operatorname*{op}}<\sigma_{r}^{*}-\sigma_{r+1}^{*}, then

𝒫U𝒫Uop2max{UEop,EVop}σrσr+1Eop.\|\mathcal{P}_{U^{*}}-\mathcal{P}_{U}\|_{op}\leq\frac{\sqrt{2}\max\{\|U^{*}E\|_{\operatorname*{op}},\|EV^{*}\|_{\operatorname*{op}}\}}{\sigma_{r}^{*}-\sigma_{r+1}^{*}-\|E\|_{\operatorname*{op}}}.
Proof.

This follows from Lemma 2.6 and Theorem 2.9 of Chen et al. (2021). ∎

Corollary 4.

Let 𝒲\mathcal{W} and 𝒲\mathcal{W}^{\prime} be two Hilbert spaces. Let MM and EE be two finite rank operators on 𝒲𝒲\mathcal{W}\otimes\mathcal{W}^{\prime} and denote M=M+EM=M^{*}+E. Let the SVD of MM^{*} and MM are given respectively by

M=i=1r1σiuiviandM=i=1r2σiuivi\displaystyle M^{*}=\sum_{i=1}^{r_{1}}\sigma^{*}_{i}u_{i}^{*}\otimes v_{i}^{*}\quad\text{and}\quad M=\sum_{i=1}^{r_{2}}\sigma_{i}u_{i}\otimes v_{i}

where σ1σr1\sigma_{1}^{*}\geq\cdots\geq\sigma_{r_{1}}^{*} and σ1σr2\sigma_{1}\geq\cdots\geq\sigma_{r_{2}}. For rmin{r1,r2}r\leq min\{r_{1},r_{2}\}, denote

U=Span({ui}i=1r)andU=Span({ui}i=1r)\displaystyle U^{*}=\operatorname*{Span}(\{u_{i}^{*}\}_{i=1}^{r})\quad\text{and}\quad U=\operatorname*{Span}(\{u_{i}\}_{i=1}^{r})

Let 𝒫U\mathcal{P}_{U^{*}} to be projection matrix from 𝒲\mathcal{W} to UU^{*}, and 𝒫U\mathcal{P}_{U} to be projection matrix from 𝒲\mathcal{W} to UU. If Eop<σrσr+1\|E\|_{\operatorname*{op}}<\sigma_{r}^{*}-\sigma_{r+1}^{*}, then

𝒫U𝒫Uop2max{E×1Uop,E×2Vop}σrσr+1Eop2Eopσrσr+1Eop.\|\mathcal{P}_{U^{*}}-\mathcal{P}_{U}\|_{\operatorname*{op}}\leq\frac{\sqrt{2}\max\{\|E\times_{1}U^{*}\|_{\operatorname*{op}},\|E\times_{2}V^{*}\|_{\operatorname*{op}}\}}{\sigma_{r}^{*}-\sigma_{r+1}^{*}-\|E\|_{\operatorname*{op}}}\leq\frac{\sqrt{2}\|E\|_{\operatorname*{op}}}{\sigma_{r}^{*}-\sigma_{r+1}^{*}-\|E\|_{\operatorname*{op}}}.
Proof.

Let

𝒮=Span({ui}i=1r1,{ui}i=1r2)and𝒮=Span({vi}r=1r1,{vi}i=1r2).\mathcal{S}=\operatorname*{Span}(\{u_{i}^{*}\}_{i=1}^{r_{1}},\{u_{i}\}_{i=1}^{r_{2}})\quad\text{and}\quad\mathcal{S}^{\prime}=\operatorname*{Span}(\{v_{i}^{*}\}_{r=1}^{r_{1}},\{v_{i}\}_{i=1}^{r_{2}}).

Then MM^{*} and MM can be viewed as finite-dimensional matrices on 𝒮𝒮\mathcal{S}\otimes\mathcal{S}^{\prime}. Since 𝒫U=i=1ruiui\mathcal{P}_{U^{*}}=\sum_{i=1}^{r}u^{*}_{i}\otimes u^{*}_{i} and 𝒫U=i=1ruiui\mathcal{P}_{U}=\sum_{i=1}^{r}u_{i}\otimes u_{i}, the desired result follows from Theorem 6.

Appendix G Additional technical results

Lemma 16.

For positive integers {sj}j=1d\{s_{j}\}_{j=1}^{d}\in\mathbb{N}, let Ωjsj\Omega_{j}\subset\mathbb{R}^{s_{j}}. Let B𝐋2(Ω1)𝐋2(Ω2)𝐋2(Ωd)B\in{{\bf L}_{2}}(\Omega_{1})\otimes{{\bf L}_{2}}(\Omega_{2})\cdots\otimes{{\bf L}_{2}}(\Omega_{d}) and for 1jd1\leq j\leq d, let 𝒬j𝐋2(Ωj)𝐋2(Ωj)\mathcal{Q}_{j}\in{{\bf L}_{2}}(\Omega_{j})\otimes{{\bf L}_{2}}(\Omega_{j}) be a collection of operators such that 𝒬jF<\|\mathcal{Q}_{j}\|_{F}<\infty. Then

B×1𝒬1×d𝒬dFBF𝒬1op𝒬dop\displaystyle\|B\times_{1}\mathcal{Q}_{1}\cdots\times_{d}\mathcal{Q}_{d}\|_{F}\leq\|B\|_{F}\|\mathcal{Q}_{1}\|_{\operatorname*{op}}\cdots\|\mathcal{Q}_{d}\|_{\operatorname*{op}}
Proof.

By Theorem 3, we can write

𝒬j=μj=1νj,μjϕj,μjψj,μj,\mathcal{Q}_{j}=\sum_{{\mu}_{j}=1}^{\infty}\nu_{j,{\mu}_{j}}\phi_{j,{\mu}_{j}}\otimes\psi_{j,{\mu}_{j}},

where {ψj,μj}μj=1\{\psi_{j,{\mu}_{j}}\}_{{\mu}_{j}=1}^{\infty} and {ϕj,μj}μj=1\{\phi_{j,{\mu}_{j}}\}_{{\mu}_{j}=1}^{\infty} are both orthnormal in 𝐋2(Ωj){{\bf L}_{2}}(\Omega_{j}). Note that |νj,μj|𝒬jop.|\nu_{j,{\mu}_{j}}|\leq\|\mathcal{Q}_{j}\|_{\operatorname*{op}}. Therefore

B×1𝒬1×d𝒬dF2=\displaystyle\|B\times_{1}\mathcal{Q}_{1}\cdots\times_{d}\mathcal{Q}_{d}\|_{F}^{2}= μ1,,μd=1{B×1𝒬1×d𝒬d[ψ1,μ1,,ψd,μd]}2\displaystyle\sum_{{\mu}_{1},\ldots,{\mu}_{d}=1}^{\infty}\big{\{}B\times_{1}\mathcal{Q}_{1}\cdots\times_{d}\mathcal{Q}_{d}[\psi_{1,{\mu}_{1}},\ldots,\psi_{d,{\mu}_{d}}]\big{\}}^{2}
=\displaystyle= μ1,,μd=1{B[𝒬1(ψ1,μ1),,𝒬d(ψd,μd)]}2\displaystyle\sum_{{\mu}_{1},\ldots,{\mu}_{d}=1}^{\infty}\big{\{}B[\mathcal{Q}_{1}(\psi_{1,{\mu}_{1}}),\ldots,\mathcal{Q}_{d}(\psi_{d,{\mu}_{d}})]\big{\}}^{2}
=\displaystyle= μ1,,μd=1{B[ν1,μ1ϕ1,μ1,,νd,μdϕd,μd]}2\displaystyle\sum_{{\mu}_{1},\ldots,{\mu}_{d}=1}^{\infty}\big{\{}B[\nu_{1,{\mu}_{1}}\phi_{1,{\mu}_{1}},\ldots,\nu_{d,{\mu}_{d}}\phi_{d,{\mu}_{d}}]\big{\}}^{2}
\displaystyle\leq 𝒬1op2𝒬dop2μ1,,μd=1{B[ϕ1,μ1,,ϕd,μd]}2\displaystyle\|\mathcal{Q}_{1}\|_{\operatorname*{op}}^{2}\cdots\|\mathcal{Q}_{d}\|_{\operatorname*{op}}^{2}\sum_{{\mu}_{1},\ldots,{\mu}_{d}=1}^{\infty}\big{\{}B[\phi_{1,{\mu}_{1}},\ldots,\phi_{d,{\mu}_{d}}]\big{\}}^{2}
=\displaystyle= 𝒬1op2𝒬dop2BF2.\displaystyle\|\mathcal{Q}_{1}\|_{\operatorname*{op}}^{2}\cdots\|\mathcal{Q}_{d}\|_{\operatorname*{op}}^{2}\|B\|_{F}^{2}.

Lemma 17.

For j=1,,dj=1,\ldots,d, let {ϕj,i}i=1\{\phi_{j,i}\}_{i=1}^{\infty} be a collection of orthogonal basis function of 𝐋2(Ωj){{\bf L}_{2}}(\Omega_{j}). Suppose A𝐋2(Ω1)𝐋2(Ωd)A\in{{\bf L}_{2}}(\Omega_{1})\otimes\cdots\otimes{{\bf L}_{2}}(\Omega_{d}) is such that AF<\|A\|_{F}<\infty. Then AA is a function in 𝐋2(Ω1××Ωd){{\bf L}_{2}}(\Omega_{1}\times\cdots\times\Omega_{d}) and that

A(z1,,zd)=i1,,id=1A[ϕ1,i1,,ϕd,id]ϕ1,i1(z1)ϕd,id(zd).\displaystyle A(z_{1},\ldots,z_{d})=\sum_{i_{1},\ldots,i_{d}=1}^{\infty}A[\phi_{1,i_{1}},\ldots,\phi_{d,i_{d}}]\phi_{1,i_{1}}(z_{1})\cdots\phi_{d,i_{d}}(z_{d}). (92)

Note that (92) is independent of choices of basis functions.

Proof.

This is a classical functional analysis result. ∎

Lemma 18.

For kdk\leq d, let x=(z1,,zk)Ω1××Ωkx=(z_{1},\ldots,z_{k})\in\Omega_{1}\times\cdots\times\Omega_{k} and y=(zk+1,,zd)Ωk+1××Ωdy=(z_{k+1},\ldots,z_{d})\in\Omega_{k+1}\times\cdots\times\Omega_{d}. Let A=A(x,y)𝐋2(Ω1××Ωk)𝐋2(Ωk+1××Ωd)A=A(x,y)\in{{\bf L}_{2}}(\Omega_{1}\times\cdots\times\Omega_{k})\otimes{{\bf L}_{2}}(\Omega_{k+1}\times\cdots\times\Omega_{d}). For 1jk1\leq j\leq k, let 𝒰j𝐋2(Ωj)\mathcal{U}_{j}\subset{{\bf L}_{2}}(\Omega_{j})\ be a collection of subspaces and 𝒰x𝐋2(Ω1××Ωk)\mathcal{U}_{x}\subset{{\bf L}_{2}}(\Omega_{1}\times\cdots\times\Omega_{k}) is such that 𝒰x=𝒰1𝒰k,\mathcal{U}_{x}=\mathcal{U}_{1}\otimes\cdots\otimes\mathcal{U}_{k}, then

A×x𝒫𝒰x=A×1𝒫𝒰1×k𝒫𝒰k.\displaystyle A\times_{x}\mathcal{P}_{\mathcal{U}_{x}}=A\times_{1}\mathcal{P}_{\mathcal{U}_{1}}\cdots\times_{k}\mathcal{P}_{\mathcal{U}_{k}}. (93)
Proof.

For generic functions fj𝐋2(Ωj)f_{j}\in{{\bf L}_{2}}(\Omega_{j}), it holds that

𝒫𝒰x[f1,,fk](z1,,zd)=𝒫𝒰1(f1)(z1)𝒫𝒰k(fk)(zk).\mathcal{P}_{\mathcal{U}_{x}}[f_{1},\ldots,f_{k}](z_{1},\ldots,z_{d})=\mathcal{P}_{\mathcal{U}_{1}}(f_{1})(z_{1})\cdots\mathcal{P}_{\mathcal{U}_{k}}(f_{k})(z_{k}).

Therefore (93) follows from the observation that

A×x𝒫𝒰x[f1,,fk,fk+1,,fd)]\displaystyle A\times_{x}\mathcal{P}_{\mathcal{U}_{x}}[f_{1},\ldots,f_{k},f_{k+1},\ldots,f_{d})]
=\displaystyle= A[𝒫𝒰x(f1,,fk),fk+1,,fd)]\displaystyle A[\mathcal{P}_{\mathcal{U}_{x}}(f_{1},\ldots,f_{k}),f_{k+1},\ldots,f_{d})]
=\displaystyle= A[𝒫𝒰1(f1),,𝒫𝒰k(fk),fk+1,,fd)]\displaystyle A[\mathcal{P}_{\mathcal{U}_{1}}(f_{1}),\ldots,\mathcal{P}_{\mathcal{U}_{k}}(f_{k}),f_{k+1},\ldots,f_{d})]
=\displaystyle= A×1𝒫𝒰1×k𝒫𝒰k[f1,,fk,fk+1,,fd].\displaystyle A\times_{1}\mathcal{P}_{\mathcal{U}_{1}}\cdots\times_{k}\mathcal{P}_{\mathcal{U}_{k}}[f_{1},\ldots,f_{k},f_{k+1},\ldots,f_{d}].

Lemma 19.

Let A𝐋2(Ω1)𝐋2(Ωd)A\in{{\bf L}_{2}}(\Omega_{1})\otimes\cdots\otimes{{\bf L}_{2}}(\Omega_{d}) be any tensor. For j=1,,dj=1,\ldots,d, suppose

Rangej(A)=Span{uj,ρj}ρj=1rj,{\operatorname*{Range}}_{j}(A)=\operatorname*{Span}\{u_{j,{\rho}_{j}}\}_{{\rho}_{j}=1}^{r_{j}},

where {uj,ρj}ρj=1rj\{u_{j,{\rho}_{j}}\}_{{\rho}_{j}=1}^{r_{j}} are orthonormal functions in 𝐋2(Ωj){{\bf L}_{2}}(\Omega_{j}). Then

A(z1,,zd)=ρ1=1r1ρd=1rdA[u1,ρ1,,ud,ρd]u1,ρ1(z1)ud,ρd(zd).A(z_{1},\ldots,z_{d})=\sum_{{\rho}_{1}=1}^{r_{1}}\cdots\sum_{{\rho}_{d}=1}^{r_{d}}A[u_{1,{\rho}_{1}},\ldots,u_{d,{\rho}_{d}}]u_{1,{\rho}_{1}}(z_{1})\cdots u_{d,{\rho}_{d}}(z_{d}).

Therefore the core size of the tensor AA is j=1drj\prod_{j=1}^{d}r_{j}.

Proof.

It suffices to observe that as a linear map, AA is 0 in the orthogonal complement the subspace Range1(A)Ranged(A)\operatorname*{Range}_{1}(A)\otimes\cdots\otimes\operatorname*{Range}_{d}(A) and {u1,ρ1(z1)ud,ρd(zd)}ρ1=1,,ρd=1r1,,rd\{u_{1,{\rho}_{1}}(z_{1})\cdots u_{d,{\rho}_{d}}(z_{d})\}_{{\rho}_{1}=1,\ldots,{\rho}_{d}=1}^{r_{1},\ldots,r_{d}} are the orthonormal basis of Range1(A)Ranged(A)\operatorname*{Range}_{1}(A)\otimes\cdots\otimes\operatorname*{Range}_{d}(A). ∎

Lemma 20.

Let \mathcal{M} be linear subspace of 𝐋2(Ω1){{\bf L}_{2}}(\Omega_{1}) spanned by the orthonormal basis function =Span{vμ(x)}μ=1dim()=\operatorname*{Span}\{v_{{\mu}}(x)\}_{{\mu}=1}^{dim(\mathcal{M})}. Suppose g:Ω1g:\Omega_{1}\to\mathbb{R} is a generic function in 𝐋2(Ω1){{\bf L}_{2}}(\Omega_{1}). If

g~(x)=argminfg(x)f(x)𝐋2(Ω1)2,\widetilde{g}(x)=\operatorname*{arg\,min}_{f\in\mathcal{M}}\big{\|}g(x)-f(x)\big{\|}^{2}_{{{\bf L}_{2}}(\Omega_{1})},

Then g~(x)=μ=1dim()aμvμ(x)\widetilde{g}(x)=\sum_{{\mu}=1}^{\dim(\mathcal{M})}a_{\mu}v_{{\mu}}(x), where

aμ=g,vμ=Ω1g(x)vμ(x)dx.a_{\mu}=\langle g,v_{{\mu}}\rangle=\int_{\Omega_{1}}g(x)v_{{\mu}}(x)\mathrm{d}x.
Proof.

This is a well-known projection property in Hilbert space. ∎

Appendix H Extension: multivariable nonparametric regression

In this section, we apply our sketching estimator to study the nonparametric regression models. To begin, suppose the observed data {Wi,Zi}i=1N×d\{W_{i},Z_{i}\}_{i=1}^{N}\subset\mathbb{R}\times\mathbb{R}^{d} satisfy

Wi=f(Zi)+ϵi,\displaystyle W_{i}=f^{*}(Z_{i})+\epsilon_{i}, (94)

where {ϵi}i=1N\{\epsilon_{i}\}_{i=1}^{N} are measurement errors and f:[0,1]df^{*}:[0,1]^{d}\to\mathbb{R} is the unknown regression function. We first present our theory assuming that the random design {Zi}i=1N\{Z_{i}\}_{i=1}^{N} are independently sampled from the uniform density on the domain [0,1]d[0,1]^{d} in Corollary 5. The general setting, where {Zi}i=1N\{Z_{i}\}_{i=1}^{N} are sampled from an unknown generic density function, will be discussed in Corollary 6.

Let f^\widehat{f} be the estimator such that for any non-random function G𝐋2([0,1]d)G\in{{\bf L}_{2}}([0,1]^{d}),

f^,G=1Ni=1NWiG(Zi).\displaystyle\langle\widehat{f},G\rangle=\frac{1}{N}\sum_{i=1}^{N}W_{i}G(Z_{i}). (95)

where G(Zi)G(Z_{i}) is the value of function GG evaluated at the sample point Zi[0,1]dZ_{i}\in[0,1]^{d}. In the following corollary, we formally summarize the statistical guarantee of the regression function estimator detailed in Algorithm 2.

Corollary 5.

Suppose the observed data {Wi,Zi}i=1N×[0,1]d\{W_{i},Z_{i}\}_{i=1}^{N}\in\mathbb{R}\times[0,1]^{d} satisfy (94), where {ϵi}i=1N\{\epsilon_{i}\}_{i=1}^{N} are i.i.d. centered subGaussian noise with subGaussian parameter CϵC_{\epsilon}, {Zi}i=1N\{Z_{i}\}_{i=1}^{N} are independently sampled from the uniform density distribution on [0,1]d[0,1]^{d}, and that fW2α([0,1]d)<\|f^{*}\|_{W^{\alpha}_{2}([0,1]^{d})}<\infty with α1\alpha\geq 1 and f<\|f^{*}\|_{\infty}<\infty.

Suppose in addition that ff^{*} satisfies 3, and that {j,j}j=1d\{\mathcal{M}_{j},\mathcal{L}_{j}\}_{j=1}^{d} are in the form of (24) and (25), where {ϕk𝕊}k=1𝐋2([0,1])\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty}\subset{{\bf L}_{2}}([0,1]) are derived from reproducing kernel Hilbert spaces, the Legendre polynomial basis, or spline basis functions.

Let f^\widehat{f} in (95), {rj}j=1d\{{{r}}_{j}\}_{j=1}^{d}, and {j,j}j=1d\{\mathcal{M}_{j},\mathcal{L}_{j}\}_{j=1}^{d} be the input of Algorithm 2, and f^×1𝒫^1×d𝒫^d\widehat{f}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d} be the corresponding output. Denote σmin=minj=1,,d{σj,rj}\sigma_{\min}=\min_{j=1,\ldots,d}\{\sigma_{j,{{r}}_{j}}\} and suppose for a sufficiently large constant CsnrC_{snr},

N2α/(2α+1)>Csnrmax{j=1drj,1σmin(d1)/α+2,1σmin2α/(α1/2)}.\displaystyle N^{2\alpha/(2\alpha+1)}>C_{snr}\max\bigg{\{}\prod_{j=1}^{d}{{r}}_{j},\frac{1}{\sigma^{(d-1)/\alpha+2}_{\min}},\frac{1}{\sigma^{2\alpha/(\alpha-1/2)}_{\min}}\bigg{\}}.

If j=CLσj,rj1/α\ell_{j}=C_{L}\sigma_{j,{{r}}_{j}}^{-1/\alpha} for some sufficiently large constant CLC_{L} and mN1/(2α+1)m\asymp N^{1/(2\alpha+1)}, then it holds that

f^×1𝒫^1×d𝒫^df𝐋2([0,1]d)2=O(j=1dσj,rj2N2α/(2α+1)+j=1dσj,rj(d1)/α2N+j=1drjN).\displaystyle\|\widehat{f}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d}-f^{*}\|^{2}_{{{\bf L}_{2}}([0,1]^{d})}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\frac{\sum_{j=1}^{d}\sigma_{j,{{r}}_{j}}^{-2}}{N^{2\alpha/(2\alpha+1)}}+\frac{\sum_{j=1}^{d}\sigma_{j,{{r}}_{j}}^{-(d-1)/\alpha-2}}{N}+\frac{\prod_{j=1}^{d}{{r}}_{j}}{N}\bigg{)}. (96)
Proof.

2 is verified in Section B.1 for reproducing kernel Hilbert space, Section B.2 for Legendre polynomials, and Section B.3 for spline basis. 4 is shown in Lemma 21. Therefore all the conditions in Theorem 5 are met and Corollary 5 immediately follows. ∎

In the following result, we extend our approach to the general setting where the random designs {Zi}i=1N\{Z_{i}\}_{i=1}^{N} are sampled from a generic density function p:[0,1]d+p^{*}:[0,1]^{d}\to\mathbb{R}^{+}. To achieve consistent regression estimation in this context, we propose adjusting our estimator to incorporate an additional density estimator. This modification aligns with techniques commonly used in classical nonparametric methods, such as the Nadaraya–Watson kernel regression estimator.

Corollary 6.

Suppose {Zi}i=1N\{Z_{i}\}_{i=1}^{N} are random designs independently sampled from a common density function p:[0,1]d+p^{*}:[0,1]^{d}\to\mathbb{R}^{+} such that p(z1,,zd)c for all (z1,,zd)[0,1]dp^{*}(z_{1},\ldots,z_{d})\geq c^{*}\text{ for all }(z_{1},\ldots,z_{d})\in[0,1]^{d} where c>0c^{*}>0 is a universal positive constant. Let

f~=f^×1𝒫^1×d𝒫^d,\widetilde{f}=\widehat{f}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d},

where f^×1𝒫^1×d𝒫^d\widehat{f}\times_{1}\widehat{\mathcal{P}}_{1}\cdots\times_{d}\widehat{\mathcal{P}}_{d} is defined in Corollary 5, and p~\widetilde{p} is any generic density estimator of pp^{*}. Denote p~=max{1log(N),p~}.{\widetilde{p}}^{\prime}=\max\big{\{}\frac{1}{\sqrt{\log(N)}},\widetilde{p}\big{\}}. Suppose in addition that all of the other conditions in Corollary 5 hold. Then

f~p~f𝐋2([0,1]d)2\displaystyle\bigg{\|}\frac{\widetilde{f}}{{\widetilde{p}}^{\prime}}-f^{*}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d})}^{2}
=\displaystyle= O(log(N){j=1dσj,rj2N2α/(2α+1)+j=1dσj,rj(d1)/α2N+j=1drjN}+log(N)pp~𝐋2([0,1]d)2).\displaystyle\operatorname*{O_{\mathbb{P}}}\bigg{(}\log(N)\bigg{\{}\frac{\sum_{j=1}^{d}\sigma_{j,{{r}}_{j}}^{-2}}{N^{2\alpha/(2\alpha+1)}}+\frac{\sum_{j=1}^{d}\sigma_{j,{{r}}_{j}}^{-(d-1)/\alpha-2}}{N}+\frac{\prod_{j=1}^{d}{{r}}_{j}}{N}\bigg{\}}+\log(N)\|p^{*}-\widetilde{p}\|^{2}_{{{\bf L}_{2}}([0,1]^{d})}\bigg{)}.

Note that if pp^{*} is also a low-rank density function, then we can estimate pp^{*} via Algorithm 2 with a reduced curse of dimensionality. Consequently, the regression function ff^{*} can be estimated with a reduced curse of dimensionality even when the random designs are sampled from a non-uniform density.

Proof of Corollary 6.

Suppose NN is sufficient large so that 1log(N)c\frac{1}{\sqrt{\log(N)}}\leq c^{*}. Let ZZ be a generic element in [0,1]d[0,1]^{d}. Based on the definition, p~=max{1log(N),p~}.{\widetilde{p}}^{\prime}=\max\big{\{}\frac{1}{\sqrt{\log(N)}},\widetilde{p}\big{\}}. Thus, when p~(Z)1log(N)\widetilde{p}(Z)\geq\frac{1}{\sqrt{\log(N)}}, p~(Z)p(Z)=p~(Z)p(Z)\widetilde{p}(Z)-p^{*}(Z)=\widetilde{p}^{\prime}(Z)-p^{*}(Z). When p~(Z)<1log(N)\widetilde{p}(Z)<\frac{1}{\sqrt{\log(N)}}, note that

|p(Z)p~(Z)|=|p(Z)1log(N)|=p(Z)1log(N)p(Z)p~(Z)=|p(Z)p~(Z)|\displaystyle|p^{*}(Z)-\widetilde{p}^{\prime}(Z)|=\bigg{|}p^{*}(Z)-\frac{1}{\sqrt{\log(N)}}\bigg{|}=p^{*}(Z)-\frac{1}{\sqrt{\log(N)}}\leq p^{*}(Z)-\widetilde{p}(Z)=|p^{*}(Z)-\widetilde{p}(Z)|

where the first equality follows from p~(Z)=1log(N)\widetilde{p}^{\prime}(Z)=\frac{1}{\sqrt{\log(N)}}, the second equality follows from p(Z)c1log(N)p^{*}(Z)\geq c^{*}\geq\frac{1}{\sqrt{\log(N)}}, the inequality follows from p~(Z)<1log(N)\widetilde{p}(Z)<\frac{1}{\sqrt{\log(N)}} and the last equality follows from p(Z)c1log(N)p~(Z)p^{*}(Z)\geq c^{*}\geq\frac{1}{\sqrt{\log(N)}}\geq\widetilde{p}(Z). Therefore |p~(Z)p(Z)||p~(Z)p(Z)||\widetilde{p}^{\prime}(Z)-p^{*}(Z)|\leq|\widetilde{p}(Z)-p^{*}(Z)| for all Z[0,1]dZ\in[0,1]^{d} and it follows that

p~p𝐋2([0,1]d)p~p𝐋2([0,1]d).\displaystyle\|\widetilde{p}^{\prime}-p^{*}\|_{{{\bf L}_{2}}([0,1]^{d})}\leq\|\widetilde{p}-p^{*}\|_{{{\bf L}_{2}}([0,1]^{d})}. (97)

By Corollary 5,

f~fp𝐋2([0,1]d)2=O(j=1dσj,rj2N2α/(2α+1)+j=1dσj,rj(d1)/α2N+j=1drjN).\displaystyle\|\widetilde{f}-f^{*}p^{*}\|^{2}_{{{\bf L}_{2}}([0,1]^{d})}=O_{\mathbb{P}}\bigg{(}\frac{\sum_{j=1}^{d}\sigma_{j,r_{j}}^{-2}}{N^{2\alpha/(2\alpha+1)}}+\frac{\sum_{j=1}^{d}\sigma_{j,r_{j}}^{-(d-1)/\alpha-2}}{N}+\frac{\prod_{j=1}^{d}r_{j}}{N}\bigg{)}. (98)

Therefore

f~p~f𝐋2([0,1]d)f~p~fpp~𝐋2([0,1]d)+fpp~f𝐋2([0,1]d).\displaystyle\bigg{\|}\frac{\widetilde{f}}{\widetilde{p}^{\prime}}-f^{*}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d})}\leq\bigg{\|}\frac{\widetilde{f}}{\widetilde{p}^{\prime}}-\frac{f^{*}p^{*}}{\widetilde{p}^{\prime}}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d})}+\bigg{\|}\frac{f^{*}p^{*}}{\widetilde{p}^{\prime}}-f^{*}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d})}. (99)

The desired result follows from the observation that

f~p~fpp~𝐋2([0,1]d)21p~2f~fp𝐋2([0,1]d)2\displaystyle\bigg{\|}\frac{\widetilde{f}}{\widetilde{p}^{\prime}}-\frac{f^{*}p^{*}}{\widetilde{p}^{\prime}}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d})}^{2}\leq\bigg{\|}\frac{1}{\widetilde{p}^{\prime}}\bigg{\|}^{2}_{\infty}\|\widetilde{f}-f^{*}p^{*}\|_{{{\bf L}_{2}}([0,1]^{d})}^{2}
=O(\displaystyle=O_{\mathbb{P}}\bigg{(} log(N){j=1dσj,rj2N2α/(2α+1)+j=1dσj,rj(d1)/α2N+j=1drjN})\displaystyle\log(N)\bigg{\{}\frac{\sum_{j=1}^{d}\sigma_{j,r_{j}}^{-2}}{N^{2\alpha/(2\alpha+1)}}+\frac{\sum_{j=1}^{d}\sigma_{j,r_{j}}^{-(d-1)/\alpha-2}}{N}+\frac{\prod_{j=1}^{d}r_{j}}{N}\bigg{\}}\bigg{)}

and that

fpp~f𝐋2([0,1]d)=f(pp~1)𝐋2([0,1]d)=f(pp~p~)𝐋2([0,1]d)\displaystyle\bigg{\|}\frac{f^{*}p^{*}}{\widetilde{p}^{\prime}}-f^{*}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d})}=\bigg{\|}f^{*}\bigg{(}\frac{p^{*}}{\widetilde{p}^{\prime}}-1\bigg{)}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d})}=\bigg{\|}f^{*}\bigg{(}\frac{p^{*}-\widetilde{p}^{\prime}}{\widetilde{p}^{\prime}}\bigg{)}\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d})}
\displaystyle\leq fp~pp~𝐋2([0,1]d)flog(N)pp~𝐋2([0,1]d)flog(N)pp~𝐋2([0,1]d),\displaystyle\bigg{\|}\frac{f^{*}}{\widetilde{p}^{\prime}}\bigg{\|}_{\infty}\|p^{*}-\widetilde{p}^{\prime}\|_{{{\bf L}_{2}}([0,1]^{d})}\leq\|f^{*}\|_{\infty}\sqrt{\log(N)}\|p^{*}-\widetilde{p}^{\prime}\|_{{{\bf L}_{2}}([0,1]^{d})}\leq\|f^{*}\|_{\infty}\sqrt{\log(N)}\|p^{*}-\widetilde{p}\|_{{{\bf L}_{2}}([0,1]^{d})},

where the last inequality follows from (97). ∎

H.1 Additional technical results for regression

Lemma 21.

Let f^\widehat{f} be defined as in (95). Suppose all the conditions in Corollary 5 holds. Then f^\widehat{f} satisfies 4.

Proof.

Note that {Zi}i=1N\{Z_{i}\}_{i=1}^{N} are sampled from the uniform density and that 𝔼(ϵ1)=0,{\mathbb{E}}(\epsilon_{1})=0, Var(ϵ1)Cϵ.Var(\epsilon_{1})\leq C_{\epsilon}. Therefore

𝔼(f^,G)=𝔼(w1G(Z1)+f(Z1)G(Z1))=f(z)G(z)dz=f,G,\displaystyle{\mathbb{E}}(\langle\widehat{f},G\rangle)={\mathbb{E}}(w_{1}G(Z_{1})+f^{*}(Z_{1})G(Z_{1}))=\int f_{*}(z)G(z)\mathrm{d}z=\langle f_{*},G\rangle,

where the second equality holds since ϵ\epsilon and ZZ are independent. Suppose G𝐋2=1\|G\|_{{\bf L}_{2}}=1. Then

Var(f^,G)=1NVar(ϵ1G(Z1)+f(Z1)G(Z1))=1NVar(ϵ1G(Z1))+1NVar(f(Z1)G(Z1))\displaystyle Var(\langle\widehat{f},G\rangle)=\frac{1}{N}Var(\epsilon_{1}G(Z_{1})+f^{*}(Z_{1})G(Z_{1}))=\frac{1}{N}Var(\epsilon_{1}G(Z_{1}))+\frac{1}{N}Var(f^{*}(Z_{1})G(Z_{1}))
\displaystyle\leq 1N(𝔼(ϵ12G(Z1))+𝔼{f(Z1)G(Z1)}2)=1N𝔼(ϵ12)[0,1]dG2(z)dz+1N{f(z)G(z)}2dz\displaystyle\frac{1}{N}\bigg{(}{\mathbb{E}}(\epsilon_{1}^{2}G(Z_{1}))+{\mathbb{E}}\{f^{*}(Z_{1})G(Z_{1})\}^{2}\bigg{)}=\frac{1}{N}{\mathbb{E}}(\epsilon_{1}^{2})\int_{[0,1]^{d}}G^{2}(z)\mathrm{d}z+\frac{1}{N}\int\big{\{}f^{*}(z)G(z)\big{\}}^{2}\mathrm{d}z
\displaystyle\leq 1NCϵG𝐋22+1Nf2G𝐋22=O(1N).\displaystyle\frac{1}{N}C_{\epsilon}\|G\|_{{\bf L}_{2}}^{2}+\frac{1}{N}\|f^{*}\|_{\infty}^{2}\|G\|_{{\bf L}_{2}}^{2}=O\bigg{(}\frac{1}{N}\bigg{)}.

Lemma 22.

Let d1,d2+d_{1},d_{2}\in\mathbb{Z}^{+} be such that d1+d2=dd_{1}+d_{2}=d. Suppose {ϕk𝕊}k=1\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty} be a collection of 𝐋2{{\bf L}_{2}} basis such that ϕk𝕊C𝕊\|\phi^{\mathbb{S}}_{k}\|_{\infty}\leq C_{\mathbb{S}}. For positive integers mm and \ell, denote

=span{ϕμ1𝕊(z1)ϕμd1𝕊(zd1)}μ1,,μd1=1mand=span{ϕη1𝕊(zd1+1)ϕηd2𝕊(zd)}η1,,ηd2=1.\displaystyle\mathcal{M}=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\mu}_{1}}(z_{1})\cdots\phi^{\mathbb{S}}_{{\mu}_{d_{1}}}(z_{d_{1}})\bigg{\}}_{{\mu}_{1},\ldots,{\mu}_{d_{1}}=1}^{m}\quad\text{and}\quad\mathcal{L}=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\eta}_{1}}(z_{d_{1}+1})\cdots\phi^{\mathbb{S}}_{{\eta}_{d_{2}}}(z_{d})\bigg{\}}_{{\eta}_{1},\ldots,{\eta}_{d_{2}}=1}^{\ell}. (100)

Then it holds that

(f^f)×x𝒫×y𝒫op=O(md1+d2N+m3d1d2log(N)N+md13d2log(N)N).\|(\widehat{f}-f^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}=O_{\mathbb{P}}\bigg{(}\sqrt{\frac{m^{d_{1}}+\ell^{d_{2}}}{N}}+\frac{\sqrt{m^{3d_{1}}\ell^{d_{2}}\log(N)}}{N}+\frac{\sqrt{m^{d_{1}}\ell^{3d_{2}}\log(N)}}{N}\bigg{)}.
Proof.

Similar to Lemma 9, by ordering the indexes (μ1,,μd1)({\mu}_{1},\ldots,{\mu}_{d_{1}}) and (η1,,ηd2)({\eta}_{1},\ldots,{\eta}_{d_{2}}) in (100), we can also write

=span{Φμ(x)}μ=1md1and=span{Ψη(y)}η=1d2.\displaystyle\mathcal{M}=\text{span}\{\Phi_{{\mu}}(x)\}_{{\mu}=1}^{m^{d_{1}}}\quad\text{and}\quad\mathcal{L}=\text{span}\{\Psi_{{\eta}}(y)\}_{{\eta}=1}^{\ell^{d_{2}}}. (101)

Note that f^×x𝒫×y𝒫\widehat{f}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} and f×x𝒫×y𝒫f^{*}\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}} are both zero in the orthogonal complement of the subspace \mathcal{M}\otimes\mathcal{L}. Let B^\widehat{B} and BB^{*} be two matrices in md1×d2\mathbb{R}^{m^{d_{1}}\times\ell^{d_{2}}} such that

B^μ,η=f^[Φμ,Ψη]=1Ni=1NWiΦμ(Xi)Ψη(Yi)andBμ,η=f[Φμ,Ψη]=𝔼(f^[Φμ,Ψη]),\displaystyle\widehat{B}_{{\mu},{\eta}}=\widehat{f}[\Phi_{{\mu}},\Psi_{{\eta}}]=\frac{1}{N}\sum_{i=1}^{N}W_{i}\Phi_{{\mu}}(X_{i})\Psi_{{\eta}}(Y_{i})\quad\text{and}\quad B_{{\mu},{\eta}}^{*}=f^{*}[\Phi_{{\mu}},\Psi_{{\eta}}]={\mathbb{E}}(\widehat{f}[\Phi_{{\mu}},\Psi_{{\eta}}]),

where Xi=(Zi,1,,Zi,d1)d1X_{i}=(Z_{i,1},\ldots,Z_{i,d_{1}})\in\mathbb{R}^{d_{1}} and Yi=(Zi,d1+1,,Zi,d)d2Y_{i}=(Z_{i,d_{1}+1},\ldots,Z_{i,d})\in\mathbb{R}^{d_{2}}. Therefore

B^Bop=(f^f)×x𝒫×y𝒫op.\displaystyle\|\widehat{B}-B^{*}\|_{\operatorname*{op}}=\|(\widehat{f}-f^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}. (102)

Since {ϵi}i=1N\{\epsilon_{i}\}_{i=1}^{N} are subGaussian, it follows from a union bound argument that there exists a sufficiently large constant C1C_{1} such that

(max1iN|ϵi|C1log(N))1N1.\displaystyle\mathbb{P}\bigg{(}\max_{1\leq i\leq N}|\epsilon_{i}|\leq C_{1}\sqrt{\log(N)}\bigg{)}\geq 1-N^{-1}. (103)

The following procedures are similar to Lemma 9, but we need to estimate the variance brought by additional random variables ϵi\epsilon_{i} here.
Step 1. Let v=(v1,,vmd1)md1v=(v_{1},\ldots,v_{m^{d_{1}}})\in\mathbb{R}^{m^{d_{1}}} and suppose that v2=1\|v\|_{2}=1. Then by orthonormality of {Φμ(x)}μ=1md1\{\Phi_{{\mu}}(x)\}_{{\mu}=1}^{m^{d_{1}}} in 𝐋2{{\bf L}_{2}} it follows that

μ=1md1vμΦμ𝐋2([0,1]d1)2=1.\bigg{\|}\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}\bigg{\|}^{2}_{{{\bf L}_{2}}([0,1]^{d_{1}})}=1.

In addition, since

ϕμ1𝕊(z1)ϕμd1𝕊(zd1)j=1d1ϕμj𝕊C𝕊d1,\|\phi^{\mathbb{S}}_{{\mu}_{1}}(z_{1})\cdots\phi^{\mathbb{S}}_{{\mu}_{d_{1}}}(z_{d_{1}})\|_{\infty}\leq\prod_{j=1}^{d_{1}}\|\phi^{\mathbb{S}}_{{\mu}_{j}}\|_{\infty}\leq C_{\mathbb{S}}^{d_{1}},

it follows that ΦμC𝕊d1\|\Phi_{{\mu}}\|_{\infty}\leq C_{\mathbb{S}}^{d_{1}} for all 1μmd11\leq{\mu}\leq m^{d_{1}} and

μ=1md1vμΦμμ=1md1vμ2μ=1md1Φμ2=O(md1).\bigg{\|}\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}\bigg{\|}_{\infty}\leq\sqrt{\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}^{2}}\sqrt{\sum_{{\mu}=1}^{m^{d_{1}}}\|\Phi_{{\mu}}\|_{\infty}^{2}}=O(\sqrt{m^{d_{1}}}).

Step 2. Let Let w=(w1,,wd2)d2w=(w_{1},\ldots,w_{\ell^{d_{2}}})\in\mathbb{R}^{\ell^{d_{2}}} and suppose that w2=1\|w\|_{2}=1. Then by orthonormality of {Ψη(y)}η=1d2\{\Psi_{{\eta}}(y)\}_{{\eta}=1}^{\ell^{d_{2}}} in 𝐋2([0,1]d2){{\bf L}_{2}}([0,1]^{d_{2}}),

η=1d2wηΨη𝐋2([0,1]d2)2=1.\bigg{\|}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}\bigg{\|}^{2}_{{{\bf L}_{2}}([0,1]^{d_{2}})}=1.

In addition, since

ϕη1𝕊(zd1+1)ϕηd2𝕊(zd)j=1d2ϕηj𝕊C𝕊d2,\|\phi^{\mathbb{S}}_{{\eta}_{1}}(z_{d_{1}+1})\cdots\phi^{\mathbb{S}}_{{\eta}_{d_{2}}}(z_{d})\|_{\infty}\leq\prod_{j=1}^{d_{2}}\|\phi^{\mathbb{S}}_{{\eta}_{j}}\|_{\infty}\leq C_{\mathbb{S}}^{d_{2}},

it follows that ΨηC𝕊d2\|\Psi_{{\eta}}\|_{\infty}\leq C_{\mathbb{S}}^{d_{2}}. Therefore

η=1d2wηΨηη=1d2wη2η=1d2Ψη2=O(d2).\bigg{\|}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}\bigg{\|}_{\infty}\leq\sqrt{\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}^{2}}\sqrt{\sum_{{\eta}=1}^{\ell^{d_{2}}}\|\Psi_{{\eta}}\|_{\infty}^{2}}=O(\sqrt{\ell^{d_{2}}}).

Step 3. For fixed v=(v1,,vmd1)v=(v_{1},\ldots,v_{m^{d_{1}}}) and w=(w1,,wd2)w=(w_{1},\ldots,w_{\ell^{d_{2}}}), we bound v(AA^)w.v^{\top}(A^{*}-\widehat{A})w. Let Δi=μ=1md1vμΦμ(Xi)η=1d2wηΨη(Yi)(f(Xi,Yi)+ϵi)\Delta_{i}=\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}(X_{i})\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}(Y_{i})(f(X_{i},Y_{i})+\epsilon_{i}). Since the measurement errors {ϵi}i=1N\{\epsilon_{i}\}_{i=1}^{N} and the random designs {Xi,Yi}i=1N\{X_{i},Y_{i}\}_{i=1}^{N} are independent, it follows that

Var(Δi)𝔼(Δi2)={μ=1md1vμΦμ(x)}2{η=1d2wηΨη(y)}2(f(x,y)+𝔼(ϵi)2)dxdy\displaystyle Var(\Delta_{i})\leq{\mathbb{E}}(\Delta_{i}^{2})=\iint\bigg{\{}\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}(x)\bigg{\}}^{2}\bigg{\{}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}(y)\bigg{\}}^{2}(f^{*}(x,y)+{\mathbb{E}}(\epsilon_{i})^{2})\mathrm{d}x\mathrm{d}y
\displaystyle\leq 2(f2+Cϵ){μ=1md1vμΦμ(x)}2dx{η=1d2wηΨη(y)}2dy\displaystyle 2\bigg{(}\|f^{*}\|_{\infty}^{2}+C_{\epsilon}\bigg{)}\int\bigg{\{}\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}(x)\bigg{\}}^{2}\mathrm{d}x\int\bigg{\{}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}(y)\bigg{\}}^{2}\mathrm{d}y
=\displaystyle= 2(f2+Cϵ)μ=1md1vμΦμ(x)𝐋2([0,1]d1)2η=1d2wηΨη(y)𝐋2([0,1]d2)2=2(f2+Cϵ).\displaystyle 2\bigg{(}\|f^{*}\|_{\infty}^{2}+C_{\epsilon}\bigg{)}\bigg{\|}\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}(x)\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d_{1}})}^{2}\bigg{\|}\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}(y)\bigg{\|}_{{{\bf L}_{2}}([0,1]^{d_{2}})}^{2}=2\bigg{(}\|f^{*}\|_{\infty}^{2}+C_{\epsilon}\bigg{)}.

where the last equality follows from Step 1 and Step 2. In addition, suppose the good event in (103) holds. Then uniformly for all 1iN1\leq i\leq N,

|Δi|μ=1md1vμΦμ(Xi)η=1d2wηΨη(Yi)(f+|ϵi|)=O(md1d2|ϵi|)=O(md1d2log(N)).|\Delta_{i}|\leq\|\sum_{{\mu}=1}^{m^{d_{1}}}v_{{\mu}}\Phi_{{\mu}}(X_{i})\|_{\infty}\|\sum_{{\eta}=1}^{\ell^{d_{2}}}w_{{\eta}}\Psi_{{\eta}}(Y_{i})\|_{\infty}(\|f^{*}\|_{\infty}+|\epsilon_{i}|)=O(\sqrt{m^{d_{1}}\ell^{d_{2}}}|\epsilon_{i}|)=O(\sqrt{m^{d_{1}}\ell^{d_{2}}}\sqrt{\log(N)}).

So for given v,wv,w, by Bernstein’s inequality

Z|ϵ(|v(BB^)w|t)=Z|ϵ(|1Ni=1NΔi𝔼(Δi)|t)\displaystyle\mathbb{P}_{Z|\epsilon}\bigg{(}\bigg{|}v^{\top}(B^{*}-\widehat{B})w\bigg{|}\geq t\bigg{)}=\mathbb{P}_{Z|\epsilon}\bigg{(}\bigg{|}\frac{1}{N}\sum_{i=1}^{N}\Delta_{i}-{\mathbb{E}}(\Delta_{i})\bigg{|}\geq t\bigg{)}
\displaystyle\leq 2exp(cNt2f2+Cϵ+tmd1d2log(N)).\displaystyle 2\exp\bigg{(}\frac{-cNt^{2}}{\|f^{*}\|_{\infty}^{2}+C_{\epsilon}+t\sqrt{m^{d_{1}}\ell^{d_{2}}\log(N)}}\bigg{)}.

Step 4. Let 𝒩(14,md1)\mathcal{N}(\frac{1}{4},m^{d_{1}}) be a 1/41/4 covering net of unit ball in md1\mathbb{R}^{m^{d_{1}}} and 𝒩(14,d2)\mathcal{N}(\frac{1}{4},\ell^{d_{2}}) be a 1/41/4 covering net of unit ball in d2\mathbb{R}^{\ell^{d_{2}}}, then by 4.4.3 on page 90 of Vershynin (2018)

BB^op2supv𝒩(14,md1),w𝒩(14,d2)v(BB^)w.\displaystyle\|B^{*}-\widehat{B}\|_{\operatorname*{op}}\leq 2\sup_{v\in\mathcal{N}(\frac{1}{4},m^{d_{1}}),w\in\mathcal{N}(\frac{1}{4},\ell^{d_{2}})}v^{\top}(B^{*}-\widehat{B})w.

So by union bound and the fact that the size of 𝒩(14,md1)\mathcal{N}(\frac{1}{4},m^{d_{1}}) is bounded by 9md19^{m^{d_{1}}} and the size of 𝒩(14,d2)\mathcal{N}(\frac{1}{4},\ell^{d_{2}}) is bounded by 9d29^{\ell^{d_{2}}},

(BB^opt)(supv𝒩(14,md1),w𝒩(14,d2)v(BB^)wt2)\displaystyle\mathbb{P}\bigg{(}\|B^{*}-\widehat{B}\|_{\operatorname*{op}}\geq t\bigg{)}\leq\mathbb{P}\bigg{(}\sup_{v\in\mathcal{N}(\frac{1}{4},m^{d_{1}}),w\in\mathcal{N}(\frac{1}{4},\ell^{d_{2}})}v^{\top}(B^{*}-\widehat{B})w\geq\frac{t}{2}\bigg{)}
\displaystyle\leq 9md1+d22exp(cNt2f2+Cϵ+tmd1d2log(N)).\displaystyle 9^{m^{d_{1}}+\ell^{d_{2}}}2\exp\bigg{(}\frac{-cNt^{2}}{\|f^{*}\|_{\infty}^{2}+C_{\epsilon}+t\sqrt{m^{d_{1}}\ell^{d_{2}}\log(N)}}\bigg{)}. (104)

Therefore

BB^op=O(md1+d2N+m3d1d2log(N)N+md13d2log(N)N)\|B^{*}-\widehat{B}\|_{\operatorname*{op}}=O_{\mathbb{P}}\bigg{(}\sqrt{\frac{m^{d_{1}}+\ell^{d_{2}}}{N}}+\frac{\sqrt{m^{3d_{1}}\ell^{d_{2}}\log(N)}}{N}+\frac{\sqrt{m^{d_{1}}\ell^{3d_{2}}\log(N)}}{N}\bigg{)}

as desired. ∎

Corollary 7.

Suppose {ϕk𝕊}k=1\{\phi^{\mathbb{S}}_{k}\}_{k=1}^{\infty} be a collection of 𝐋2{{\bf L}_{2}} basis such that ϕk𝕊C𝕊\|\phi^{\mathbb{S}}_{k}\|_{\infty}\leq C_{\mathbb{S}}. Let

=span{ϕμ1𝕊(z1)}μ1=1mand=span{ϕη1𝕊(z2)ϕηd1𝕊(zd)}η1,,ηd1=1.\displaystyle\mathcal{M}=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\mu}_{1}}(z_{1})\bigg{\}}_{{\mu}_{1}=1}^{m}\quad\text{and}\quad\mathcal{L}=\text{span}\bigg{\{}\phi^{\mathbb{S}}_{{\eta}_{1}}(z_{2})\cdots\phi^{\mathbb{S}}_{{\eta}_{d-1}}(z_{d})\bigg{\}}_{{\eta}_{1},\ldots,{\eta}_{d-1}=1}^{\ell}.

If in addition that mN1/(2α+1)m\asymp N^{1/(2\alpha+1)} and that d1=O(N2α12(2α+1)/log(N)),\ell^{d-1}=O(N^{\frac{2\alpha-1}{2(2\alpha+1)}}/\log(N)), then

(f^f)×x𝒫×y𝒫op=O(dim()+dim()N).\displaystyle\|(\widehat{f}-f^{*})\times_{x}\mathcal{P}_{\mathcal{M}}\times_{y}\mathcal{P}_{\mathcal{L}}\|_{\operatorname*{op}}=\operatorname*{O_{\mathbb{P}}}\bigg{(}\sqrt{\frac{\dim(\mathcal{M})+\dim(\mathcal{L})}{N}}\bigg{)}. (105)
Proof.

The proof is almost the same as the proof in Corollary 3 with above choice of mm and \ell. ∎

H.2 Simulations and real data

We analyze the numerical performance of our VRS, Nadaraya–Watson kernel regression (NWKR) estimators, and neural networks (NN) in various nonparametric regression problems. The implementation of VRS is provided in Algorithm 2 and the inputs of VRS in the regression setting are detailed in Corollary 6. For neural network estimators, we use the feedforward architecture that are either wide and deep. We measure the estimation accuracy by relative 𝐋2{{\bf L}_{2}}-error defined as

ff~𝐋2(Ω)f𝐋2(Ω),\frac{\|f^{*}-\widetilde{f}\|_{{{\bf L}_{2}}(\Omega)}}{\|f^{*}\|_{{{\bf L}_{2}}(\Omega)}},

where f~\widetilde{f} is the regression function estimator of a given method. The subsequent simulations and real data examples consistently demonstrate that VRS outperforms both NN and NWKR in a range of nonparametric regression problems.

\bullet Simulation 𝐕\mathbf{V}. We sample data {Wi,Zi}i=1N×[1,1]d\{W_{i},Z_{i}\}_{i=1}^{N}\subset\mathbb{R}\times[-1,1]^{d} from the regression model

Wi=f(Zi)+ϵi,\displaystyle W_{i}=f^{*}(Z_{i})+\epsilon_{i},

where {ϵi}i=1N\{\epsilon_{i}\}_{i=1}^{N} are independently sampled from standard normal distribution, {Zi}i=1N\{Z_{i}\}_{i=1}^{N} are sampled from the uniform distribution on [1,1]d[-1,1]^{d}, and

f(x1,,xd)=sin(j=1dxj)for (x1,,xd)[1,1]d.f^{*}(x_{1},\ldots,x_{d})=\sin\big{(}\sum_{j=1}^{d}x_{j}\big{)}\quad\text{for }(x_{1},\ldots,x_{d})\in[-1,1]^{d}.

In the first set of experiments, we set d=5d=5 and vary sample size NN from 0.10.1 million to 11 million. In the second set of experiments, the sample size NN is fixed at 11 million, while the dimensionality dd varies from 22 to 1010. Each experimental setup is replicated 100 times to ensure robustness, and we present the average relative l2l_{2}-error for each method in Figure 10.

Refer to caption
Refer to caption
Figure 10: The plot on the left corresponds to Simulation 𝐕\mathbf{V} with d=5d=5 and NN varying from 0.10.1 million to 11 million; the plot on the right corresponds to Simulation 𝐕\mathbf{V} with NN being 1 million and dd varying from 22 to 1010.

\bullet Simulation 𝐕𝐈\mathbf{VI}. We sample data {Wi,Zi}i=1N×[1,1]d\{W_{i},Z_{i}\}_{i=1}^{N}\subset\mathbb{R}\times[-1,1]^{d} from the regression model

Wi=f(Zi)+ϵi,\displaystyle W_{i}=f^{*}(Z_{i})+\epsilon_{i},

where {ϵi}i=1N\{\epsilon_{i}\}_{i=1}^{N} are independently sampled from standard normal distribution, and {Zi}i=1N\{Z_{i}\}_{i=1}^{N} are independently sampled in [1,1]d[-1,1]^{d} from a dd-dimensional truncated Gaussian distribution with mean vector 0 and covariance matrix 4Id4I_{d}. Here IdI_{d} is the identity matrix in dd-dimensions. In addition,

f(x1,,xd)=12exp(1di=1d3xi2+14)+12exp(1di=1d7xi2+18)f^{*}(x_{1},\ldots,x_{d})=\frac{1}{2}\exp\left(-\frac{1}{d}\sum_{i=1}^{d}\frac{3x_{i}^{2}+1}{4}\right)+\frac{1}{2}\exp\left(-\frac{1}{d}\sum_{i=1}^{d}\frac{7x_{i}^{2}+1}{8}\right)

for (x1,,xd)[1,1]d(x_{1},\ldots,x_{d})\in[-1,1]^{d}. In the first set of experiments, we fix d=5d=5 and vary NN vary from 0.10.1 million to 11 million. In the second set of experiments, we fix the sample size NN at 11 million and let dd vary from 22 to 1010. We repeat each experiment setting 100100 times and report the averaged relative 𝐋2{{\bf L}_{2}}-error for each method in Figure 11.

Refer to caption
Refer to caption
Figure 11: The plot on the left corresponds to Simulation 𝐕𝐈\mathbf{VI} with d=5d=5 and NN varying from 0.10.1 million to 11 million; the plot on the right corresponds to Simulation 𝐕𝐈\mathbf{VI} with NN being 1 million and dd varying from 22 to 1010.

\bullet Real data 𝐈𝐈\mathbf{II}. We study the problem of predicting the house price in California using the California housing dataset. This dataset contains 2064020640 house price data from the 1990 California census, along with 88 continuous features such as locations, median house age, and total number bedrooms for house price prediction. Since the true regression function is unknown, we randomly split the dataset into 90% training and 10% test data and evaluate the performance of various approaches by relative test error. Let f~\widetilde{f} be any regression estimator computed based on the training data. The relative test error of this estimator is defined as

1Ntesti=1Ntest(f~(Zi)Wi)2Wi2,\sqrt{\frac{1}{N_{\text{test}}}\sum_{i=1}^{N_{\text{test}}}\frac{(\widetilde{f}(Z_{i})-W_{i})^{2}}{W_{i}^{2}}},

where {Zi,Wi}i=1Ntest\{Z_{i},W_{i}\}_{i=1}^{N_{\text{test}}} are the test data. The relative test errors for VRS, NWKR, and NN are 0.02750.0275, 0.03670.0367, and 0.02850.0285, respectively, showing that VRS numerical surpasses other methods in this real data example.

Appendix I Additional numerical results and details

I.1 Kernel methods

In our simulated experiments and real data examples, we choose Gaussian kernel for the kernel density estimators and Nadaraya–Watson kernel regression (NWKR) estimators. The bandwidths in all the numerical examples are chosen using cross-validations. We refer interested readers to Wasserman (2006) for an introduction to nonparametric statistics.

I.2 Sample generation in VRS

Following Algorithm 5, we obtain the tensor represented density p~\tilde{p}. To generate the samples, we follow Gibbs samplers and rewrite the joint density into the product of marginal densities:

p~(x1,x2,,xd)=p~(x1)p~(x2|x1)p~(xd|x1,,xd1)\tilde{p}(x_{1},x_{2},\cdots,x_{d})=\tilde{p}(x_{1})\tilde{p}(x_{2}|x_{1})\cdots\tilde{p}(x_{d}|x_{1},\cdots,x_{d-1})

where p~(x1)\tilde{p}(x_{1}) is the one marginal density, which requires the integration over other d1d-1 variables. In our setting, we use some orthogonal basis functions and the first basis is commonly a constant. Based on orthgonality, we only need to choose the first slices of coefficient tensor to compute the one marginal density and this only requires O(1)O(1) complexity. Then we follow Markov Chain Monte Carlo (MCMC) to generate sample x~1\tilde{x}_{1} from this one marginal density. Next, we need to compute the next one marginal p~(x2|x1)\tilde{p}(x_{2}|x_{1}). We fix x1x_{1} as x~1\tilde{x}_{1} and choose the first slices of coefficient tensor to represent the integration over other d2d-2 variables. After that, we successfully compute one marginal density p~(x2|x1)\tilde{p}(x_{2}|x_{1}) and follow MCMC to generate sample x~2\tilde{x}_{2}. We repeat the procedure for all conditional probablity and we could generate sample (x~1,x~2,,x~d)(\tilde{x}_{1},\tilde{x}_{2},\cdots,\tilde{x}_{d}) in O(d)O(d) complexity.

The density given by tensor-sketching based method may not always be positive. There are some postprocessing techniques to deal with the issue. For example, Han et al. (2018) proposes a method to do minimization between the output density with a square of testing density, where the testing density shares the same tensor architecture. After that, we could use the square of testing density to implement MCMC sampling. Besides, based on our observation towards our method, the negative part of output density is quick small and we could choose a small threshold such as 10710^{-7} to avoid the negative component.

I.3 Neural network density estimator

For neural network estimators, we use two popular density estimation architectures: Masked Autoregressive Flow (MAF) (Papamakarios et al. (2017)) and Neural Autoregressive Flows (NAF) (Huang et al. (2018)) for comparisons. Both neural networks are trained using the Adam optimizer (Kingma and Ba (2014)). For MAF, we use 5 transforms and each transform is a 3-hidden layer neural network with width 128. For NAF, we choose 3 transforms and each transform is a 3-hidden layer neural network with width 128.

I.4 Additional image denoising result

We provide additional image denoising results in this subsection. In Figure 12, we have randomly selected another five images from the test set of the USPS digits dataset and MNIST dataset to illustrate the denoised results using VRS and kernel PCA.

Refer to caption
a USPS digits dataset.
Refer to caption
b MNIST dataset.
Figure 12: Denoising images from (a) USPS digits dataset and (b) MNIST dataset. In both (a) and (b), the first column shows the ground truth images from the test data, the second column shows the images polluted by Gaussian noise, the third column shows the images denoised using VRS, and the last column shows the images denoised using kernel PCA.