This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data

Saptarshi Chakraborty email: [email protected] Department of Statistics, UC Berkeley Peter L. Bartlett email: [email protected] Department of Statistics, UC Berkeley Department of Electrical Engineering and Computer Sciences, UC Berkeley Google DeepMind
Abstract

Recent advances have revealed that the rate of convergence of the expected test error in deep supervised learning decays as a function of the intrinsic dimension and not the dimension dd of the input space. Existing literature defines this intrinsic dimension as the Minkowski dimension or the manifold dimension of the support of the underlying probability measures, which often results in sub-optimal rates and unrealistic assumptions. In this paper, we consider supervised deep learning when the response given the explanatory variable is distributed according to an exponential family with a β\beta-Hölder smooth mean function. We consider an entropic notion of the intrinsic data-dimension and demonstrate that with nn independent and identically distributed samples, the test error scales as 𝒪~(n2β2β+d¯2β(λ))\tilde{\mathcal{O}}\left(n^{-\frac{2\beta}{2\beta+\bar{d}_{2\beta}(\lambda)}}\right), where d¯2β(λ)\bar{d}_{2\beta}(\lambda) is the 2β2\beta-entropic dimension of λ\lambda, the distribution of the explanatory variables. This improves on the best-known rates. Furthermore, under the assumption of an upper-bounded density of the explanatory variables, we characterize the rate of convergence as 𝒪~(d2β(β+d)2β+dn2β2β+d)\tilde{\mathcal{O}}\left(d^{\frac{2\lfloor\beta\rfloor(\beta+d)}{2\beta+d}}n^{-\frac{2\beta}{2\beta+d}}\right), establishing that the dependence on dd is not exponential but at most polynomial. We also demonstrate that when the explanatory variable has a lower bounded density, this rate in terms of the number of data samples, is nearly optimal for learning the dependence structure for exponential families.

1 Introduction

One hypothesis for the extraordinary performance of deep learning is that most natural data have an intrinsically low-dimensional structure despite lying in a high-dimensional feature space (Pope et al.,, 2020). Under this so-called “manifold hypothesis,” the recent theoretical developments in the generalization aspects of deep learning theory literature have revealed that the excess risk for different deep learning models, especially regression (Schmidt-Hieber,, 2020; Suzuki,, 2018) and generative models (Huang et al.,, 2022; Chakraborty and Bartlett, 2024a, ; Chakraborty and Bartlett, 2024b, ) exhibit a decay pattern that depends only on the intrinsic dimension of the data. Notably, Nakada and Imaizumi, (2020), Huang et al., (2022) and Chakraborty and Bartlett, 2024a showed that the excess risk decays as 𝒪(n1/𝒪(dμ))\mathcal{O}(n^{-1/\mathcal{O}(d_{\mu})}), where dμd_{\mu} denotes the Minkowski dimension of the underlying distribution. For a supervised learning setting, this phenomenon has been proved for various deep regression models with additive Gaussian noise (Schmidt-Hieber,, 2020; Nakada and Imaizumi,, 2020; Suzuki,, 2018; Suzuki and Nitanda,, 2021).

Most of the aforementioned studies aim to describe this inherent dimensionality by utilizing the concept of the (upper) Minkowski dimension of the data’s underlying support. However, it is worth noting that the Minkowski dimension primarily focuses on quantifying the rate of growth in the covering number of the support while neglecting to account for situations where the distribution may exhibit a higher concentration of mass within specific sub-regions of this support. Thus, the Minkowski dimension often overestimates the intrinsic dimension of the data distribution, resulting in slower rates of statistical convergence. On the other hand, some works (Chen et al.,, 2022, 2019; Jiao et al.,, 2021) try to impose a smooth Riemannian manifold structure for this support and characterize the rate through the dimension of this manifold. However, this assumption is not only very strong and difficult to verify, but also ignores the fact that the data can be concentrated only on some sub-regions and can be thinly spread over the remainder, again resulting in an over-estimate. In contrast, recent insights from the optimal transport literature have introduced the concept of the Wasserstein dimension (Weed and Bach,, 2019), which overcomes these limitations and offers a more accurate characterization of convergence rates when estimating a distribution through the empirical measure. Furthermore, recent advancements in this field have led to the introduction of the entropic dimension (Chakraborty and Bartlett, 2024b, ), which builds upon seminal work by Dudley, (1969) and can be employed to describe the convergence rates for Bidirectional Generative Adversarial Networks (BiGANs) (Donahue et al.,, 2017). Remarkably, the entropic dimension is no larger than the Wasserstein and Minkowski dimensions, resulting in faster convergence rates. However, it is not known whether this faster rate of convergence extends beyond Generative Adversarial Networks (GANs) and their variants to supervised learning problems.

In this paper, we provide a statistical framework aimed at understanding deep supervised learning. Our approach involves modeling the conditional distribution of the response variable given the explanatory variable as a member of an exponential family with a smooth mean function. This framework accommodates a wide spectrum of scenarios, including standard regression and classification tasks, while also providing a statistical foundation for handling complex dependencies in real data settings. In this framework, the maximum likelihood estimates can be viewed as minimizing the canonical Bregman divergence loss between the predicted values and the actual responses. When the explanatory variable has a bounded density with respect to the dd-dimensional Lebesgue measure, our analysis reveals that deep networks employing ReLU activation functions can achieve a test error on the order of 𝒪~(n2β/(2β+d))\tilde{\mathcal{O}}(n^{-2\beta/(2\beta+d)}) provided that appropriately sized networks are selected. Here β\beta denotes the Hölder smoothness of the true mean response function. This generalizes the known results in the literature for additive regression with Gaussian noise.

Another aspect overlooked by the current literature is that even when the explanatory variable is absolutely continuous, the rate of convergence of the sample estimator often exponentially increases with the ambient feature dimension, making the upper bound on the estimation error weak for high-dimensional data. In this paper, we prove that if the explanatory variable has a bounded density, the dependence, in terms of the ambient feature dimension, is not exponential but at most polynomial. Furthermore, we show that the derived rates for the test error are roughly minimax optimal, meaning that one cannot achieve a better rate of convergence through any estimator except for only potentially improving a logarithmic dependence on nn. Lastly, when the data has an intrinsically low dimensional structure, we show that the test error can be improved to achieve a rate of roughly 𝒪~(n2β/(2β+d¯2β(λ)))\tilde{\mathcal{O}}(n^{-2\beta/(2\beta+\bar{d}_{2\beta}(\lambda))}), where d¯2β(λ)\bar{d}_{2\beta}(\lambda) denotes the 2β2\beta-entropic dimension (see Definition 11) of λ\lambda, the distribution of the explanatory variables, thus, improving upon the rates in the current literature. This result not only extends the framework beyond additive Gaussian noise regression models but also improves upon the existing rates (Nakada and Imaizumi,, 2020; Schmidt-Hieber,, 2020; Chen et al.,, 2022). The main results of this paper are summarized as follows:

  • In Theorem 8, we demonstrate that when the explanatory variable has a bounded density, the test error for the learning problem scales as 𝒪~(d2β(β+d)2β+dn2β2β+d)\tilde{\mathcal{O}}\left(d^{\frac{2\lfloor\beta\rfloor(\beta+d)}{2\beta+d}}n^{-\frac{2\beta}{2\beta+d}}\right), showing explicit dependence on the problem dimension (dd) and the number of samples (nn)

  • Theorem 10 establishes that the minimax rates scale as 𝒪~(n2β2β+d)\tilde{\mathcal{O}}\left(n^{-\frac{2\beta}{2\beta+d}}\right), ensuring that deep learners can attain the minimax optimal rate when network sizes are appropriately chosen. Notably, this theorem recovers the seminal results of Yang and Barron, (1999) as a special case.

  • When the explanatory variable has an intrinsically low dimensional structure, we show that deep supervised learners can effectively achieve an error rate of 𝒪~(n2β2β+d¯2β(λ))\tilde{\mathcal{O}}\left(n^{-\frac{2\beta}{2\beta+\bar{d}_{2\beta}(\lambda)}}\right) in Theorem 12. This result provides the fastest known rates for deep supervised learners and encompasses many recent findings as special cases (Nakada and Imaizumi,, 2020; Chen et al.,, 2022) for additive regression models.

  • In the process, in Lemma 21 we are able to improve upon the recent 𝕃p\mathbb{L}_{p}-approximation results on ReLU networks, which might be of independent interest.

The remainder of the paper is organized as follows. After discussing the necessary background in Section 2, we discuss the exponential family learning framework in Section 3. Under this framework, we derive the learning rates (Theorem 8) when the explanatory variable is absolutely continuous in Section 4 and show that it is minimax optimal (Theorem 10). Next, we analyze the error rate (Theorem 12) when the explanatory variable has an intrinsically low dimensional structure in Section 5. The proofs of the main results are sketched in Section 6, followed by concluding remarks in Section 7.

2 Background

This section recalls some of the notation and background that we need. We say An,dBn,dA_{n,d}\precsim B_{n,d} (for A,B0A,\,B\geq 0) if there exists an absolute constant C>0C>0 (independent of nn and dd), such that An,dCBn,dA_{n,d}\leq CB_{n,d}, for all n,dn,d. Similarly, for non-negative functions ff and gg, we say f(x)xg(x)f(x)\precsim_{x}g(x) if there exists a constant CC, which is independent of xx such that f(x)Cg(x)f(x)\leq Cg(x), for all xx. For any function f:𝒮f:\mathcal{S}\to\mathop{\mathbb{R}}\nolimits, and any measure γ\gamma on 𝒮\mathcal{S}, let f𝕃p(γ):=(𝒮|f(x)|p𝑑γ(x))1/p\|f\|_{\mathbb{L}_{p}(\gamma)}:=\left(\int_{\mathcal{S}}|f(x)|^{p}d\gamma(x)\right)^{1/p}, if 0<p<0<p<\infty. Also let, f𝕃(γ):=esssupxsupp(γ)|f(x)|\|f\|_{\mathbb{L}_{\infty}(\gamma)}:=\operatorname{ess\,sup}_{x\in\text{supp}(\gamma)}|f(x)|. We say An=𝒪~(Bn)A_{n}=\tilde{\mathcal{O}}(B_{n}) if AnBn×logC(en)A_{n}\leq B_{n}\times\log^{C}(en), for some factor constant C>0C>0. Moreover, xy=max{x,y}x\vee y=\max\{x,y\} and xy=min{x,y}x\wedge y=\min\{x,y\}.

Definition 1 (Covering and packing numbers).

For a metric space (S,ϱ)(S,\varrho), the ϵ\epsilon-covering number with respect to (w.r.t.) ϱ\varrho is defined as: 𝒩(ϵ;S,ϱ)=inf{n:x1,xn such that i=1nBϱ(xi,ϵ)S}.\mathcal{N}(\epsilon;S,\varrho)=\inf\{n\in\mathbb{N}:\exists\,x_{1},\dots x_{n}\text{ such that }\cup_{i=1}^{n}B_{\varrho}(x_{i},\epsilon)\supseteq S\}. An ϵ\epsilon-cover of SS is denoted as 𝒞(ϵ;S,ϱ)\mathcal{C}(\epsilon;S,\varrho). Similarly, the ϵ\epsilon-packing number is defined as: (ϵ;S,ϱ)=sup{m:x1,xmS such that ϱ(xi,xj)ϵ, for all ij}.\mathcal{M}(\epsilon;S,\varrho)=\sup\{m\in\mathbb{N}:\exists\,x_{1},\dots x_{m}\in S\text{ such that }\varrho(x_{i},x_{j})\geq\epsilon,\text{ for all }i\neq j\}.

Definition 2 (Neural networks).

Let LL\in\mathbb{N} and {Ni}i[L]\{N_{i}\}_{i\in[L]}\subset\mathbb{N}. Then a LL-layer neural network f:dNLf:\mathop{\mathbb{R}}\nolimits^{d}\to\mathop{\mathbb{R}}\nolimits^{N_{L}} is defined as,

f(x)=ALσL1AL1σ1A1(x)f(x)=A_{L}\circ\sigma_{L-1}\circ A_{L-1}\circ\dots\circ\sigma_{1}\circ A_{1}(x) (1)

Here, Ai(y)=Wiy+biA_{i}(y)=W_{i}y+b_{i}, with WiNi×Ni1W_{i}\in\mathop{\mathbb{R}}\nolimits^{N_{i}\times N_{i-1}} and biNi1b_{i}\in\mathop{\mathbb{R}}\nolimits^{N_{i-1}}, with N0=dN_{0}=d. Note that σj\sigma_{j} is applied component-wise. Here, {Wi}1iL\{W_{i}\}_{1\leq i\leq L} are known as weights, and {bi}1iL\{b_{i}\}_{1\leq i\leq L} are known as biases. {σi}1iL1\{\sigma_{i}\}_{1\leq i\leq L-1} are known as the activation functions. Without loss of generality, one can take σ(0)=0,[L1]\sigma_{\ell}(0)=0,\,\forall\,\ell\in[L-1]. We define the following quantities: (Depth) (f):=L\mathcal{L}(f):=L is known as the depth of the network; (Number of weights) The number of weights of the network ff is denoted as 𝒲(f)\mathcal{W}(f); (maximum weight) (f)=max1jL(bj)Wj\mathcal{B}(f)=\max_{1\leq j\leq L}(\|b_{j}\|_{\infty})\vee\|W_{j}\|_{\infty} to denote the maximum absolute value of the weights and biases.

𝒩𝒩{σi}1iL1(L,W,R)={\displaystyle\mathcal{N}\mathcal{N}_{\{\sigma_{i}\}_{1\leq i\leq L-1}}(L,W,R)=\{ f of the form (1):(f)L,𝒲(f)W,supx[0,1]df(x)R}.\displaystyle f\text{ of the form \eqref{ee1}}:\mathcal{L}(f)\leq L,\,\mathcal{W}(f)\leq W,\sup_{x\in[0,1]^{d}}\|f(x)\|_{\infty}\leq R\}.

If σj(x)=x0\sigma_{j}(x)=x\vee 0, for all j=1,,L1j=1,\dots,L-1, we denote 𝒩𝒩{σi}1iL1(L,W,R)\mathcal{N}\mathcal{N}_{\{\sigma_{i}\}_{1\leq i\leq L-1}}(L,W,R) as 𝒩(L,W,R)\mathcal{R}\mathcal{N}(L,W,R).

Definition 3 (Hölder functions).

Let f:𝒮f:\mathcal{S}\to\mathop{\mathbb{R}}\nolimits be a function, where 𝒮d\mathcal{S}\subseteq\mathop{\mathbb{R}}\nolimits^{d}. For a multi-index 𝒔=(s1,,sd)\boldsymbol{s}=(s_{1},\dots,s_{d}), let, 𝒔f=|𝒔|fx1s1xdsd\partial^{\boldsymbol{s}}f=\frac{\partial^{|\boldsymbol{s}|}f}{\partial x_{1}^{s_{1}}\dots\partial x_{d}^{s_{d}}}, where, |𝒔|==1ds|\boldsymbol{s}|=\sum_{\ell=1}^{d}s_{\ell}. We say that a function f:𝒮f:\mathcal{S}\to\mathop{\mathbb{R}}\nolimits is β\beta-Hölder (for β>0\beta>0) if

fβ:=𝒔:0|𝒔|β𝒔f+𝒔:|𝒔|=βsupxy𝒔f(x)𝒔f(y)xyββ<.\|f\|_{\mathscr{H}^{\beta}}:=\sum_{\boldsymbol{s}:0\leq|\boldsymbol{s}|\leq\lfloor\beta\rfloor}\|\partial^{\boldsymbol{s}}f\|_{\infty}+\sum_{\boldsymbol{s}:|\boldsymbol{s}|=\lfloor\beta\rfloor}\sup_{x\neq y}\frac{\|\partial^{\boldsymbol{s}}f(x)-\partial^{\boldsymbol{s}}f(y)\|}{\|x-y\|^{\beta-\lfloor\beta\rfloor}}<\infty.

If f:d1d2f:\mathop{\mathbb{R}}\nolimits^{d_{1}}\to\mathop{\mathbb{R}}\nolimits^{d_{2}}, then we define fβ=j=1d2fjβ\|f\|_{\mathscr{H}^{\beta}}=\sum_{j=1}^{d_{2}}\|f_{j}\|_{\mathscr{H}^{\beta}}. For notational simplicity, let, β(𝒮1,𝒮2,C)={f:𝒮1𝒮2:fβC}\mathscr{H}^{\beta}(\mathcal{S}_{1},\mathcal{S}_{2},C)=\{f:\mathcal{S}_{1}\to\mathcal{S}_{2}:\|f\|_{\mathscr{H}^{\beta}}\leq C\}. Here, both 𝒮1\mathcal{S}_{1} and 𝒮2\mathcal{S}_{2} are both subsets of real vector spaces.

Definition 4 (Smoothness and strong convexity).

We say a differentiable function f:df:\mathop{\mathbb{R}}\nolimits^{d}\to\mathop{\mathbb{R}}\nolimits is HH-smooth if f(x)f(y)2Hxy2.\|\nabla f(x)-\nabla f(y)\|_{2}\leq H\|x-y\|_{2}. Similarly, we say that ff is α\alpha-strongly convex if f(tx+(1t)y)tf(x)+(1t)f(y)αt(1t)2xy22.f(tx+(1-t)y)\leq tf(x)+(1-t)f(y)-\frac{\alpha t(1-t)}{2}\|x-y\|_{2}^{2}.

Definition 5 (Bregman divergences).

A differentiable, convex function ϕ:p\phi:\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits generates the Bregman divergence dϕ:p×p0d_{\phi}:\mathop{\mathbb{R}}\nolimits^{p}\times\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits_{\geq 0} defined by dϕ(xy)=ϕ(x)ϕ(y)ϕ(y),xy.d_{\phi}(x\|y)=\phi(x)-\phi(y)-\langle\nabla\phi(y),x-y\rangle.

It is evident that dϕ(x,y)0d_{\phi}(x,y)\geq 0 holds for all x,ypx,y\in\mathop{\mathbb{R}}\nolimits^{p} due to the fact that ϕ(x)ϕ(y)+ϕ(y),xy\phi(x)\geq\phi(y)+\langle\nabla\phi(y),x-y\rangle is equivalent to the convex nature of the function ϕ\phi. From a geometric standpoint, one can conceptualize dϕ(xy)d_{\phi}(x\|y) as the separation between ϕ(x)\phi(x) and the linear approximation of ϕ(x)\phi(x) centered around ϕ(y)\phi(y). Put simply, this can be described as the distance between ϕ(x)\phi(x) and the value obtained by evaluating the tangent line to ϕ(y)\phi(y) at the point xx. Some prominent examples of Bregman divergences include the squared Euclidean distance, Kullback-Leibler (KL) divergence, Mahalanobis distance, etc. We refer the reader to Banerjee et al., (2005, Table 1) for more examples of the Bregman family. Bregman divergences have a direct association with standard exponential families, as elaborated in the upcoming section, rendering them particularly suitable for modeling and learning from various common data types that originate from exponential family distributions.

3 Learning Framework

To discuss our framework, let us first recall the definition of exponential families (Lehmann and Casella,, 2006, Chapter 1, Section 5). We suppose that 𝜽Θ\boldsymbol{\theta}\in\Theta is the natural parameter. We say that 𝑿\boldsymbol{X} is distributed according to an exponential family, Ψ\mathscr{F}_{\Psi} if the density of 𝑿\boldsymbol{X} w.r.t. some dominating measure ν\nu, is given by,

pΨ,θ(d𝒙)=h(𝒙)exp(𝜽,T(𝒙)Ψ(𝜽))ν(d𝒙).p_{\Psi,\theta}(d\boldsymbol{x})=h(\boldsymbol{x})\exp\left(\langle\boldsymbol{\theta},T(\boldsymbol{x})\rangle-\Psi(\boldsymbol{\theta})\right)\nu(d\boldsymbol{x}).

Here, T()T(\cdot) is called the natural statistic. Often, it is assumed that the exponential family is expressed in its regular form, which means that the components of T()T(\cdot) are affinely independent, i.e. there exists no 𝝊\boldsymbol{\upsilon}, such that 𝝊,T(𝒙)=c\langle\boldsymbol{\upsilon},T(\boldsymbol{x})\rangle=c (a constant), for all 𝒙\boldsymbol{x}. Popular examples of exponential families include Gaussian, binomial, Poisson, and exponential distributions. Given an exponential family, one can express it in its natural form. Formally,

Definition 6 (Natural form of Exponential families).

A multivariate parametric family Ψ\mathscr{F}_{\Psi} of distributions {pΨ,𝜽|𝜽Θ=int(Θ)=dom(Ψ)dθ}\{p_{\Psi,\boldsymbol{\theta}}|\boldsymbol{\theta}\in\Theta=\operatorname{int}(\Theta)=\operatorname{dom}(\Psi)\subseteq\mathbb{R}^{d_{\theta}}\} is called a regular exponential family provided that each probability density, w.r.t. some dominating measure ν\nu, is of the form,

pΨ,𝜽(d𝒙)=exp(𝒙,𝜽Ψ(𝜽))h(𝒙)ν(d𝒙)p_{\Psi,\boldsymbol{\theta}}(d\boldsymbol{x})=\exp\left(\langle\boldsymbol{x},\boldsymbol{\theta}\rangle-\Psi(\boldsymbol{\theta})\right)h(\boldsymbol{x})\nu(d\boldsymbol{x})

for all 𝒙d\boldsymbol{x}\in\mathbb{R}^{d}, where 𝒙\boldsymbol{x} represents a minimal sufficient statistic for the family.

It is well known that 𝝁(𝜽):=𝔼𝑿pΨ,𝜽𝑿=Ψ(𝜽)\boldsymbol{\mu}(\boldsymbol{\theta}):=\mathbb{E}_{\boldsymbol{X}\sim p_{\Psi,\boldsymbol{\theta}}}\boldsymbol{X}=\nabla\Psi(\boldsymbol{\theta}). For simplicity, we assume that Ψ\Psi is proper, closed, convex, and differentiable. The conjugate of Ψ\Psi, denoted as ϕ\phi is defined as, ϕ(𝝁)=sup𝜽Θ{𝜽,𝝁Ψ(𝜽)}\phi(\boldsymbol{\mu})=\sup_{\boldsymbol{\theta}\in\Theta}\left\{\langle\boldsymbol{\theta},\boldsymbol{\mu}\rangle-\Psi(\boldsymbol{\theta})\right\}. It is well known (Banerjee et al.,, 2005, Theorem 4) that pΨ,𝜽p_{\Psi,\boldsymbol{\theta}} can be expressed as,

pΨ,𝜽(d𝒙)=exp(dϕ(𝒙𝝁(𝜽))bϕ(𝒙)ν(d𝒙),p_{\Psi,\boldsymbol{\theta}}(d\boldsymbol{x})=\exp(-d_{\phi}(\boldsymbol{x}\|\boldsymbol{\mu}(\boldsymbol{\theta}))b_{\phi}(\boldsymbol{x})\nu(d\boldsymbol{x}), (2)

where dϕ()d_{\phi}(\cdot\|\cdot) denotes the Bergman divergence corresponding to ϕ\phi. Here, bϕ(𝒙)=exp(ϕ(𝒙))h(𝒙)b_{\phi}(\boldsymbol{x})=\exp(\phi(\boldsymbol{x}))h(\boldsymbol{x}). In this paper, we are interested in the supervised learning problem when the response, given the explanatory variable, is distributed according to an exponential family. For simplicity, we assume that the responses are real-valued. We assume that there exists a “true” predictor function, f0:df_{0}:\mathop{\mathbb{R}}\nolimits^{d}\to\mathop{\mathbb{R}}\nolimits, such that

y|𝒙pΨ,f0(𝒙)and𝒙λ.y|\boldsymbol{x}\sim p_{\Psi,f_{0}(\boldsymbol{x})}\quad\text{and}\quad\boldsymbol{x}\sim\lambda.

Thus, the joint distribution of (𝑿,Y)(\boldsymbol{X},Y) is given by,

𝒫(d𝒙,dy)exp(dϕ(yμ(f0(𝒙)))bϕ(y)λ(d𝒙)ν(dy).\mathscr{P}(d\boldsymbol{x},dy)\propto\exp(-d_{\phi}(y\|\mu(f_{0}(\boldsymbol{x})))b_{\phi}(y)\lambda(d\boldsymbol{x})\nu(dy). (3)

By definition, we observe that 𝔼(y|𝒙)=μ(f0(𝒙))\mathbb{E}(y|\boldsymbol{x})=\mu(f_{0}(\boldsymbol{x})). We will assume that the data is independent and identically distributed according to the distribution 𝒫\mathscr{P}. We also assume that the distribution of 𝒙\boldsymbol{x} is bounded in the compact set, [0,1]d[0,1]^{d}. Formally,

A 1.

We assume that the data {(𝐱i,yi)}i[n]\{(\boldsymbol{x}_{i},y_{i})\}_{i\in[n]} are independent and identically distributed according to the distribution 𝒫\mathscr{P}, defined in (3). Furthermore, λ([0,1]d)=1\lambda\left([0,1]^{d}\right)=1.

In the classical statistics literature, one estimates f0f_{0} by finding its maximum likelihood estimates (m.l.e.) as,

argmaxf1ni=1nlog𝒫(𝒙i,yi).\mathop{\rm argmax}_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\log\mathscr{P}(\boldsymbol{x}_{i},y_{i}).

Plugging in the form of 𝒫\mathscr{P} as in (3), it is easy to see that the above optimization problem is equivalent to

argminf1ni=1ndϕ(yiμ(f(𝒙i)))\mathop{\rm argmin}_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}d_{\phi}\left(y_{i}\|\mu(f(\boldsymbol{x}_{i}))\right) (4)

Here, μ:\mu:\mathop{\mathbb{R}}\nolimits\to\mathop{\mathbb{R}}\nolimits is known as the link function. In practice, we take \mathcal{F} to be some sort of neural network class, with the final output passing through the activation function μ\mu. The empirical minimizer of (4) is denoted as f^\hat{f}. To show that this framework covers a wide range of supervised learning problems, we consider the classical example for the case when y|𝒙Normal(f0(𝒙),σ2)y|\boldsymbol{x}\sim\text{Normal}(f_{0}(\boldsymbol{x}),\sigma^{2}), for some unknown σ2\sigma^{2}. In this case, it is well known that μ()\mu(\cdot) is the identity map and dϕ()d_{\phi}(\cdot\|\cdot) becomes the squared Euclidean distance. Thus, the m.l.e. problem (4) becomes the classical regression problem, i.e. argminf1ni=1n(yif(𝒙i))2\mathop{\rm argmin}_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-f(\boldsymbol{x}_{i}))^{2}.

Another example is the case of logistic regression. We assume that y|𝒙y|\boldsymbol{x} is a Bernoulli random variable. This makes μ(z)=11+ez\mu(z)=\frac{1}{1+e^{-z}}, i.e. the sigmoid activation function. Furthermore, an easy calculation (Banerjee et al.,, 2005, Table 2) shows that dϕ(xy)=xlog(xy)+(1x)log(1x1y)d_{\phi}(x\|y)=x\log\left(\frac{x}{y}\right)+(1-x)\log\left(\frac{1-x}{1-y}\right). Plugging in the values of μ\mu and dϕd_{\phi} into (4), we note that the estimation problem becomes,

argminf1ni=1n(yilogsigmoid(f(𝒙i))+(1yi)log(1sigmoid(f(𝒙i)))).\mathop{\rm argmin}_{f\in\mathcal{F}}-\frac{1}{n}\sum_{i=1}^{n}\left(y_{i}\log\circ\,\text{sigmoid}(f(\boldsymbol{x}_{i}))+(1-y_{i})\log\circ(1-\text{sigmoid}(f(\boldsymbol{x}_{i})))\right).

Here, sigmoid(t)=1/(1+et)\operatorname{sigmoid}(t)=1/(1+e^{-t}) denotes the sigmoid activation function. Thus, the problem reduces to the classical two-class learning problem with the binary cross-entropy loss and with sigmoid activation at the final layer. For simplicity, we assume that all activations, excluding that of the final layer of ff, are realized by the ReLU activation function. The choice of ReLU activation is a natural choice for practitioners and enables us to harness the approximation theory of ReLU networks developed throughout the recent literature (Yarotsky,, 2017; Uppal et al.,, 2019). However, using a leaky ReLU network will also result in a similar analysis, changing only the constants in the main theorems.

To facilitate the theoretical analysis, we will assume that the problem is smooth in terms of the learning function f0f_{0}. As a notion of smoothness, we will use Hölder smoothness. This has been a popular choice in the recent literature (Nakada and Imaizumi,, 2020; Schmidt-Hieber,, 2020; Chen et al.,, 2022) and covers a vast array of functions commonly modeled in the literature.

A 2.

f0f_{0} is β\beta-Hölder continuous, i.e. f0β(d,,C)f_{0}\in\mathscr{H}^{\beta}(\mathop{\mathbb{R}}\nolimits^{d},\mathop{\mathbb{R}}\nolimits,C).

We make the additional assumption that the function Ψ\Psi is well-behaved. In particular, we assume that Ψ\Psi possesses both smoothness and strong convexity properties. It is important to note that these assumptions are widely employed in the existing literature (Telgarsky and Dasgupta,, 2013; Paul et al.,, 2021). Though A3 is not applicable for the classification problem, as in that case, Ψ(x)=xlnx+(1x)ln(1x)\Psi(x)=x\ln x+(1-x)\ln(1-x), which does not satisfy A3. However for all practical purposes, one clips the output network (which is often done in practice to ensure smooth training), i.e. ensures that ϵf,f01ϵ\epsilon\leq f,f_{0}\leq 1-\epsilon, for some positive ϵ\epsilon, A3 is satisfied. The assumption is formally stated as follows.

A 3.

We assume that Ψ\Psi is σ1\sigma_{1}-smooth and σ2\sigma_{2}-strongly convex.

A direct implication of A3 is that by Kakade et al., (2009, Theorem 6), ϕ\phi is τ2\tau_{2}-smooth and τ1\tau_{1}- strongly convex. Here τi=1/σi\tau_{i}=1/\sigma_{i}. Also, since Ψ\Psi is σ1\sigma_{1}-smooth, μ()=Ψ()\mu(\cdot)=\nabla\Psi(\cdot) is σ1\sigma_{1}-Lipschitz. This fact will be useful for the proofs of the main results. In the subsequent sections, under the above assumptions, we derive probabilistic error bounds for the excess risk of f^\hat{f}.

4 Optimal Rates for Distributions with Bounded Densities

We begin the analysis of the test error for the problem (4) when λ\lambda, the distribution of the explanatory variable has a bounded density on [0,1]d[0,1]^{d}. First note that the excess risk is upper bounded by the estimation error for f0f_{0} in the 𝕃2(λ)\mathbb{L}_{2}(\lambda)-norm. The excess risk for the problem is given by

(f^)=𝔼(y,𝒙)𝒫dϕ(yμ(f^(𝒙)))𝔼(y,𝒙)𝒫dϕ(yμ(f0(𝒙))).\mathfrak{R}(\hat{f})=\mathbb{E}_{(y,\boldsymbol{x})\sim\mathscr{P}}d_{\phi}(y\|\mu(\hat{f}(\boldsymbol{x})))-\mathbb{E}_{(y,\boldsymbol{x})\sim\mathscr{P}}d_{\phi}(y\|\mu(f_{0}(\boldsymbol{x}))).

The following lemma ensures that (f^)f^f0𝕃2(λ)2\mathfrak{R}(\hat{f})\asymp\|\hat{f}-f_{0}\|_{\mathbb{L}_{2}(\lambda)}^{2} and hence, it is enough to prove bounds on f^f0𝕃2(λ)2\|\hat{f}-f_{0}\|_{\mathbb{L}_{2}(\lambda)}^{2} to derive upper and lower bounds on the excess risk.

Lemma 7.

For any f^\hat{f}\in\mathcal{F}, σ2σ1f^f0𝕃2(λ)2(f^)σ1σ2f^f0𝕃2(λ)2.\frac{\sigma_{2}}{\sigma_{1}}\|\hat{f}-f_{0}\|_{\mathbb{L}_{2}(\lambda)}^{2}\leq\mathfrak{R}(\hat{f})\leq\frac{\sigma_{1}}{\sigma_{2}}\|\hat{f}-f_{0}\|_{\mathbb{L}_{2}(\lambda)}^{2}.

As already mentioned, we assume that λ\lambda admits a density w.r.t. the Lebesgue measure, and this density is upper bounded. Formally,

A 4.

Suppose that λ\lambda admits an upper-bounded density pλp_{\lambda} w.r.t. the Lebesgue measure on [0,1]d[0,1]^{d}, i.e. pλb¯λ\|p_{\lambda}\|_{\infty}\leq\bar{b}_{\lambda}, almost surely (under the Lebesgue measure).

Under Assumptions A14, we observe that with high probability, f^f0𝕃2(λ)2\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)} and thereby, (f^)\mathfrak{R}(\hat{f}) scales at most as 𝒪~(d2βn2β2β+d)\tilde{\mathcal{O}}\left(d^{2\lfloor\beta\rfloor}n^{-\frac{2\beta}{2\beta+d}}\right), ignoring log\log-terms in nn, for large nn, if the network sizes are chosen properly. This rate of decrease aligns consistently with prior findings reported by Nakada and Imaizumi, (2020) for additive Gaussian-noise regression. It is important to underscore that the existing literature predominantly investigates the rate of decrease in (f^)\mathfrak{R}(\hat{f}) solely with regard to the sample size nn, overlooking terms dependent upon the data dimensionality. These dimension-dependent terms harbor the potential for exponential growth with respect to the dimensionality of the explanatory variables, and may therefore attain substantial magnitudes, making such bounds inefficient in high-dimensional statistical learning contexts. This analysis shows that the dependence in dd is not exponential and can be made to increase at a polynomial rate only under the assumption of the existence of a bounded density of the explanatory variable. The main upper bound for this case is stated in Theorem 8, with a proof outline appearing in Section 6.1.

Theorem 8.

Suppose Assumptions A14 hold. Then we can choose =𝒩(L,W,2C)\mathcal{F}=\mathcal{R}\mathcal{N}(L,W,2C) with LlognL\precsim\log n and Wnd2β+dlognW\precsim n^{\frac{d}{2\beta+d}}\log n, such that, with probability at least 13exp(nd2β+d)1-3\exp\left(-n^{\frac{d}{2\beta+d}}\right),

f^f0𝕃2(λ)2d2β(β+d)2β+dn2β2β+d(logn)5,\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim d^{\frac{2\lfloor\beta\rfloor(\beta+d)}{2\beta+d}}n^{-\frac{2\beta}{2\beta+d}}(\log n)^{5},

for nn0n\geq n_{0}, where, n0n_{0} might depend on dd.

From the bound on the network size, i.e. WW in Theorem 8, it is clear that when f0f_{0} is smooth i.e. for large β\beta, one requires a network of smaller size compared to when f0f_{0} is less smooth i.e. when β\beta is small. Similarly, in cases where the dimensionality of the explanatory variables, represented by dd is substantial, a larger network is required as compared to situations where dd is relatively small. This observation aligns with the intuitive expectation that solving more intricate and complex problems in higher dimensions demands the utilization of larger networks.

The high probability bound in Theorem 8 ensures that the expected test error also scales with the same rate of convergence. This result is shown in Corollary 9

Corollary 9.

Under the assumptions and choices of Theorem 8, 𝔼f^f0𝕃2(λ)2d2β(β+d)2β+dn2β2β+d(logn)5\mathbb{E}\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim d^{\frac{2\lfloor\beta\rfloor(\beta+d)}{2\beta+d}}n^{-\frac{2\beta}{2\beta+d}}(\log n)^{5}.

To understand whether deep supervised learning can achieve the optimal rates for the learning problem, we derive the minimax rates for estimating f0f_{0}. The minimax expected risk for this problem, when one has access to nn i.i.d. samples {(𝒙i,yi)}i[n]\{(\boldsymbol{x}_{i},y_{i})\}_{i\in[n]} from (3) (f0f_{0} replaced with ff) is given by

𝔐n=inff^supfβ(d,,C)𝔼ff^f𝕃2(λ)2,\mathfrak{M}_{n}=\inf_{\hat{f}}\sup_{f\in\mathscr{H}^{\beta}(\mathop{\mathbb{R}}\nolimits^{d},\mathop{\mathbb{R}}\nolimits,C)}\mathbb{E}_{f}\|\hat{f}-f\|_{\mathbb{L}_{2}(\lambda)}^{2},

With the notation 𝔼f()\mathbb{E}_{f}(\cdot) we denote the expectation w.r.t. the measure (3), with f0f_{0} replaced with ff. Here the infimum is taken over all measurable estimates of f^\hat{f}, based on the data. Minimax lower bounds are used to understand the theoretical limits of any statistical estimation problem. The aim of this analysis is to show that deep learning with ReLU networks for the exponential family dependence is (almost) minimax optimal. To facilitate the theoretical analysis, we assume that the density of λ\lambda is lower bounded by a positive constant. Formally,

A 5.

λ\lambda admits a lower-bounded density pλp_{\lambda} w.r.t. the Lebesgue measure on [0,1]d[0,1]^{d}, i.e. pλ(𝐱)b¯λp_{\lambda}(\boldsymbol{x})\geq\underline{b}_{\lambda}, almost surely (under the Lebesgue measure).

Theorem 10 provides a characterization of this minimax lower bound for estimating ff. It is important to note that the seminal works of Yang and Barron, (1999) for the normal-noise regression problem is a special case of Theorem 10.

Theorem 10 (Minimax lower bound).

Suppose that Assumptions A13 and A5 hold. Then, we can find an n0n_{0}\in\mathbb{N}, such that, if nn0n\geq n_{0},

inff^supfβ(d,,C)𝔼ff^f𝕃2(λ)2nn2β2β+d.\inf_{\hat{f}}\sup_{f\in\mathscr{H}^{\beta}(\mathop{\mathbb{R}}\nolimits^{d},\mathop{\mathbb{R}}\nolimits,C)}\mathbb{E}_{f}\|\hat{f}-f\|_{\mathbb{L}_{2}(\lambda)}^{2}\succsim_{n}n^{-\frac{2\beta}{2\beta+d}}.

Thus, from Theorems 8 and 10 it is clear that deep supervised estimators for the exponential family dependence can achieve this minimax optimal rate with high probability, barring an excess poly-log factor of nn.

5 Rates for Low Intrinsic Dimension

Frequently, it is posited that real-world data, such as vision data, resides within a lower-dimensional structure embedded in a high-dimensional feature space (Pope et al.,, 2020). To quantify this intrinsic dimensionality of the data, researchers have introduced various measures of the effective dimension of the underlying probability distribution assumed to generate the data. Among these approaches, the most commonly used ones involve assessing the rate of growth of the covering number, in a logarithmic scale, for most of the support of this data distribution. Let us consider a compact Polish space denoted as (𝒮,ϱ)(\mathcal{S},\varrho), with γ\gamma representing a probability measure defined on it. For the remainder of this paper, we will assume that ϱ\varrho corresponds to the \ell_{\infty}-norm. The simplest measure of the dimension of a probability distribution is the upper Minkowski dimension of its support, defined as follows:

dim¯M(γ)=lim supϵ0log𝒩(ϵ;supp(γ),)log(1/ϵ).\overline{\text{dim}}_{M}(\gamma)=\limsup_{\epsilon\downarrow 0}\frac{\log\mathcal{N}(\epsilon;\text{supp}(\gamma),\ell_{\infty})}{\log(1/\epsilon)}.

This dimensionality concept relies solely on the covering number of the support and does not assume the existence of a smooth mapping to a lower-dimensional Euclidean space. Consequently, it encompasses not only smooth Riemannian manifolds but also encompasses highly non-smooth sets like fractals. The statistical convergence properties of various estimators concerning the upper Minkowski dimension have been extensively explored in the literature. Kolmogorov and Tikhomirov, (1961) conducted a comprehensive study on how the covering number of different function classes depends on the upper Minkowski dimension of the support. Recently, Nakada and Imaizumi, (2020) demonstrated how deep learning models can leverage this inherent low-dimensionality in data, which is also reflected in their convergence rates. Nevertheless, a notable limitation associated with utilizing the upper Minkowski dimension is that when a probability measure covers the entire sample space but is concentrated predominantly in specific regions, it may yield a high dimensionality estimate, which might not accurately reflect the underlying dimension.

To overcome the aforementioned difficulty, as a notion of the intrinsic dimension of a measure γ\gamma, Chakraborty and Bartlett, 2024b introduced the notion of α\alpha-entropic dimension of a measure. Before we proceed, we recall the (ϵ,τ)(\epsilon,\tau)-cover of a measure (Posner et al.,, 1967) as: 𝒩ϵ(γ,τ)=inf{𝒩(ϵ;S,ϱ):γ(S)1τ},\mathscr{N}_{\epsilon}(\gamma,\tau)=\inf\{\mathcal{N}(\epsilon;S,\varrho):\gamma(S)\geq 1-\tau\}, i.e. 𝒩ϵ(γ,τ)\mathscr{N}_{\epsilon}(\gamma,\tau) counts the minimum number of ϵ\epsilon-balls required to cover a set SS of probability at least 1τ1-\tau.

Definition 11 (Entropic dimension, Chakraborty and Bartlett, 2024b, ).

For any α>0\alpha>0, we define the α\alpha-entropic dimension of γ\gamma as:

d¯α(γ)=lim supϵ0log𝒩ϵ(γ,ϵα)log(1/ϵ).\bar{d}_{\alpha}(\gamma)=\limsup_{\epsilon\downarrow 0}\frac{\log\mathscr{N}_{\epsilon}(\gamma,\epsilon^{\alpha})}{\log(1/\epsilon)}.

This notion extends Dudley’s notion of entropic dimension (Dudley,, 1969) to characterize the convergence rate for the BiGAN problem (Donahue et al.,, 2017). Chakraborty and Bartlett, 2024b showed that the entropic dimension is not larger than the upper Minkowski dimension and the upper Wasserstein dimension (Weed and Bach,, 2019). Furthermore, strict inequality holds even for simplistic examples for measures on the unit hypercube. We refer the reader to Section 3 of Chakraborty and Bartlett, 2024b . Chakraborty and Bartlett, 2024b showed that the entropic dimension is a more efficient way of characterizing the intrinsic dimension of the data distributions compared to popular measures such as the upper Minkowski dimension or the Wasserstein dimension (Weed and Bach,, 2019) as the entropic dimension enables us to derive a faster rate of convergence of the estimates. Importantly, d¯α(γ)dim¯M(γ)\bar{d}_{\alpha}(\gamma)\leq\overline{dim}_{M}(\gamma), with strict inequality holding, even for simplistic cases (see examples 10 and 11 of Chakraborty and Bartlett, 2024b, ).

As an intrinsically low-dimensional probability measure is not guaranteed to be dominated by the Lebesgue measure, we remove Assumption A4. Under only Assumptions A13, if the network sizes are properly chosen, the rate of convergence of f^\hat{f} to f0f_{0} under the 𝕃2(λ)\mathbb{L}_{2}(\lambda)-norm decays at a rough rate of 𝒪~(n2β/(2β+d¯2β(λ)))\tilde{\mathcal{O}}\left(n^{-2\beta/(2\beta+\bar{d}_{2\beta}(\lambda))}\right), as shown by Theorem 12.

Theorem 12.

Suppose Assumptions A13 holds and let d>d¯2β(λ)d^{\star}>\bar{d}_{2\beta}(\lambda). Then we can choose =𝒩(L,W,2C)\mathcal{F}=\mathcal{R}\mathcal{N}(L,W,2C) with LlognL\precsim\log n and Wnd2β+dlognW\precsim n^{\frac{d^{\star}}{2\beta+d^{\star}}}\log n, such that, with probability at least 13exp(nd2β+d)1-3\exp\left(-n^{\frac{d^{\star}}{2\beta+d^{\star}}}\right),

f^f0𝕃2(λ)2nn2β2β+d(logn)5,\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim_{n}n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5},

for nn0n\geq n_{0}, where, n0n_{0} depends on dd and 𝒫\mathscr{P}.

Since the normal-noise regression model with β\beta-Hölder f0f_{0} is a special case of our model in (3), Theorem 12 derives a faster rate compared to Nakada and Imaizumi, (2020, Theorem 7), who show a rate of 𝒪~(n2β2β+dim¯M(λ))\tilde{\mathcal{O}}\left(n^{-\frac{2\beta}{2\beta+\overline{\text{dim}}_{M}(\lambda)}}\right). This is because the upper Minkowski dimension is at least the 2β2\beta-entropic dimension by Chakraborty and Bartlett, 2024b (, Proposition 8(c)), i.e. d¯2β(λ)dim¯M(λ)\bar{d}_{2\beta}(\lambda)\leq\overline{\text{dim}}_{M}(\lambda).

An immediate corollary of Theorem 12 is that the expected test-error rate follows the same rate of decay. The proof of this result can be done following the proof of Corollary 9.

Corollary 13.

Under the assumptions and choices of Theorem 12, 𝔼f^f0𝕃2(λ)2n2β2β+d(logn)5\mathbb{E}\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5}.

One can state that a rate similar to that observed in Theorem 12 holds when the support of λ\lambda is regular. We recall the definition (Weed and Bach,, 2019, Definition 6) of regular sets in [0,1]d[0,1]^{d} as follows.

Definition 14 (Regular sets).

We say a set 𝕄\mathbb{M} is d~\tilde{d}-regular w.r.t. the d~\tilde{d}-dimensional Hausdorff measure d~\mathbb{H}^{\tilde{d}}, if d~(Bϱ(x,r))rd~,\mathbb{H}^{\tilde{d}}(B_{\varrho}(x,r))\asymp r^{\tilde{d}}, for all x𝕄x\in\mathbb{M}. Recall that the dd-Hausdroff measure of a set SS is defined as, d(S):=lim infϵ0{k=1rkd:Sk=1Bϱ(xk,rk),rkϵ,k}.\mathbb{H}^{d}(S):=\liminf_{\epsilon\downarrow 0}\left\{\sum_{k=1}^{\infty}r_{k}^{d}:S\subseteq\sum_{k=1}^{\infty}B_{\varrho}(x_{k},r_{k}),r_{k}\leq\epsilon,\,\forall k\right\}.

Examples of regular sets include compact d~\tilde{d}-dimensional differentiable manifolds; nonempty, compact convex sets spanned by an affine space of dimension d~\tilde{d}; the relative boundary of a nonempty, compact convex set of dimension d~+1\tilde{d}+1; or a self-similar set with similarity dimension d~\tilde{d}. When the support of λ\lambda is d~\tilde{d}-regular, it can be shown that d¯α(λ)d~\bar{d}_{\alpha}(\lambda)\leq\tilde{d}. Formally,

Lemma 15.

Suppose that the support of γ\gamma is d~\tilde{d}-regular. Then, d¯α(γ)d~\bar{d}_{\alpha}(\gamma)\leq\tilde{d}, for any α>0\alpha>0. Furthermore, if γd~\gamma\ll\mathbb{H}^{\tilde{d}}, d¯α(γ)=d~\bar{d}_{\alpha}(\gamma)=\tilde{d}.

Thus, applying Theorem 12 and Lemma 15, we note that f^f0𝕃2(λ)2\|\hat{f}-f_{0}\|_{\mathbb{L}_{2}(\lambda)}^{2} decays at most at a rate of 𝒪~(n2β/(2β+d~))\tilde{\mathcal{O}}\left(n^{-2\beta/(2\beta+\tilde{d})}\right), resulting in the following corollary.

Corollary 16.

Suppose Assumptions A13 and the support of λ\lambda is d~\tilde{d}-regular. Let d>d~d^{\star}>\tilde{d}, then we can choose =𝒩(L,W,2C)\mathcal{F}=\mathcal{R}\mathcal{N}(L,W,2C) with LlognL\precsim\log n and Wnd2β+dlognW\precsim n^{\frac{d^{\star}}{2\beta+d^{\star}}}\log n, such that, with probability at least 13exp(nd2β+d)1-3\exp\left(-n^{\frac{d^{\star}}{2\beta+d^{\star}}}\right), f^f0𝕃2(λ)2nn2β2β+d(logn)5,\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim_{n}n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5}, for nn0n\geq n_{0}, where, n0n_{0} depends on dd and 𝒫\mathscr{P}.

Since compact d~\tilde{d}-dimensional differentiable manifolds are a special case of d~\tilde{d}-regular sets, Corollary 16 recovers the results by Chen et al., (2022) as a special case i.e., an additive Gaussian-noise regression model. Importantly, this recovery is achieved without imposing assumptions about uniform sharpness on the manifold, as done by Chen et al., (2022, Assumption 2).

6 Proof of the Main Results

This section discusses the proof of the main results of this paper, i.e, Theorems 8, 10 and 12, with proofs of auxiliary supporting lemmas appearing in the appendix. The proof of the main upper bounds (Theorems 8 and 12) are presented in Section 6.1, while the minimax lower bound is proved in Section 6.2.

6.1 Proof of the Upper Bounds

In order to prove Theorem 8, we first decompose the error through an oracle inequality. For any vector vqv\in\mathop{\mathbb{R}}\nolimits^{q}, we denote |v|p,q=(1qi=1q|vi|p)1/p{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|v\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{p,q}=\left(\frac{1}{q}\sum_{i=1}^{q}|v_{i}|^{p}\right)^{1/p}.

Lemma 17 (Oracle inequality).

Let ff^{\ast}\in\mathcal{F}. Suppose that ξi=yiμ(f0(𝐱i))\xi_{i}=y_{i}-\mu(f_{0}(\boldsymbol{x}_{i})), Δ^i=μ(f^(𝐱i))μ(f0(𝐱i))\hat{\Delta}_{i}=\mu(\hat{f}(\boldsymbol{x}_{i}))-\mu(f_{0}(\boldsymbol{x}_{i})) and Δ~i=ϕ(μ(f(𝐱i)))ϕ(μ(f0(𝐱i)))\tilde{\Delta}_{i}=\nabla\phi(\mu(f^{\ast}(\boldsymbol{x}_{i})))-\nabla\phi(\mu(f_{0}(\boldsymbol{x}_{i}))). Then,

τ1|Δ^|2,n2τ2μ(f)μ(f0)𝕃2(λn)2+1ni=1nξiΔ~i.\tau_{1}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Delta}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2,n}^{2}\leq\tau_{2}\|\mu(f^{\ast})-\mu(f_{0})\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}+\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\tilde{\Delta}_{i}. (5)

The first term in the right-hand side (RHS) of (5) is analogous to an approximation error while the second term is akin to a generalization gap. It is worth noting that while taking a large network reduces the approximation error, it can potentially give rise to a large generalization gap and vice versa. The key idea is to select a network of appropriate size that ensures that both these errors are small enough. In the following two sections, we control these terms individually.

6.1.1 Generalization Error

To effectively control the generalization error, we employ standard localization techniques; see, for example, Wainwright, (2019, Chapter 14). These techniques are instrumental in achieving rapid convergence of the sample estimator to the population estimator in the 𝕃2(λ)\mathbb{L}_{2}(\lambda) norm. It is important to note that in some cases, the true function, denoted as f0f_{0}, may not be precisely representable by a ReLU network. We establish a high-probability bound for the squared 𝕃2(λ)\mathbb{L}_{2}(\lambda) norm difference between our estimated function f^\hat{f} and ff^{\ast}, where we will take ff^{\ast} to belong in the neural network function class, close enough to f0f_{0}. Our strategy revolves around a two-step process: firstly, we derive a local complexity bound, as outlined in Lemma 18 and subsequently, we leverage this local complexity bound to derive an estimate for f^f𝕃2(λn)2\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}, as elucidated in Lemma 19. Here λn\lambda_{n} denotes the empirical distribution of the explanatory variables. We then use this result to control f^f𝕃2(λ)2\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda)} in Lemma 22 for large nn. We state these results subsequently with proofs appearing in Appendix A.

Lemma 18.

Suppose that 𝒢δ={ϕ(μ(f))ϕ(μ(f)):ff𝕃(λn)δ and f,f}\mathscr{G}_{\delta}=\left\{\nabla\phi(\mu(f))-\nabla\phi(\mu(f^{\prime})):\|f-f^{\prime}\|_{\mathbb{L}_{\infty}(\lambda_{n})}\leq\delta\text{ and }f,f^{\prime}\in\mathcal{F}\right\}, with δ1/e\delta\leq 1/e. Also let, nPdim()n\geq\operatorname{Pdim}(\mathcal{F}). Then, for any t>0t>0, with probability (conditioned on x1:nx_{1:n}) at least 1ent2/δ21-e^{-nt^{2}/\delta^{2}},

supg𝒢δ1ni=1nξig(xi)\displaystyle\sup_{g\in\mathscr{G}_{\delta}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}g(x_{i})\precsim t+δPdim()log(n/δ)n.\displaystyle\,t+\delta\sqrt{\frac{\operatorname{Pdim}(\mathcal{F})\log(n/\delta)}{n}}. (6)

Here, Pdim()\operatorname{Pdim}(\mathcal{F}) denotes the pseudo-dimension of the function class \mathcal{F} (Anthony and Bartlett,, 1999).

Lemma 19.

Suppose α(0,1/2)\alpha\in(0,1/2) and nmax{e1/α,Pdim()}n\geq\max\left\{e^{1/\alpha},\operatorname{Pdim}(\mathcal{F})\right\}. Then, for any ff^{\ast}\in\mathcal{F}, with probability at least, 1exp(n12α)1-\exp\left(-n^{1-2\alpha}\right),

f^f𝕃2(λn)2n2α+ff0𝕃2(λn)2+1nPdim()logn\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}\precsim n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log n (7)
Lemma 20.

For α(0,1/2)\alpha\in(0,1/2), if nmax{e1/α,Pdim()}n\geq\max\left\{e^{1/\alpha},\operatorname{Pdim}(\mathcal{F})\right\}, with probability at least 13exp(n12α)1-3\exp\left(-n^{1-2\alpha}\right)

f^f0𝕃2(λ)2n2α+ff0𝕃2(λ)2+1nPdim()log2n+1nloglogn,\displaystyle\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log^{2}n+\frac{1}{n}\log\log n, (8)

for any ff^{\ast}\in\mathcal{F}.

In Lemmata 19 and 20, one can think of ff^{\ast} as the closest member of \mathcal{F} to f0f_{0}, making the term, ff0𝕃2(λ)2\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)} akin to a misspecification error. The intuition is to choose \mathcal{F} appropriately, so that the misspecification and the generalization errors in Lemma 20 are both small.

6.1.2 Approximation Error

To effectively bound the overall error in Lemma 17, one needs to control the approximation error, denoted by the first term of (5). Exploring the approximating potential of neural networks has witnessed substantial interest in the research community in the past decade or so. Pioneering studies such as those by Cybenko, (1989) and Hornik, (1991) have extensively examined the universal approximation properties of networks utilizing sigmoid-like activations. These foundational works demonstrated that wide, single-hidden-layer neural networks possess the capacity to approximate any continuous function within a bounded domain. In light of recent advancements in deep learning, there has been a notable surge in research dedicated to exploring the approximation capabilities of deep neural networks. Some important results in this direction include those by Yarotsky, (2017); Lu et al., (2021); Petersen and Voigtlaender, (2018); Shen et al., (2019); Schmidt-Hieber, (2020) among many others. All of the aforementioned results indicate that when ϵ\epsilon-approximating a β\beta-Hölder function in the \ell_{\infty}-norm, it suffices to have a network of depth 𝒪(log(1/ϵ))\mathcal{O}(\log(1/\epsilon)) with at most 𝒪(ϵd/βlog(1/ϵ))\mathcal{O}(\epsilon^{-d/\beta}\log(1/\epsilon))-many weights for the approximating network. However, the constants in the expressions of the upper bound of the number of weights and depth of the network can potentially increase exponentially with dd. Shen et al., (2022) showed that if one approximates in the 𝕃2(Leb)\mathbb{L}_{2}(\text{Leb})-norm, this exponential dependence can be mitigated for the case β1\beta\leq 1. Here Leb()\text{Leb}(\cdot) denotes the Lebesgue measure on [0,1]d[0,1]^{d}. Lemma 21 generalizes this result to include all β>0\beta>0 to achieve a precise dependence on dd. The proof is provided in Appendix B.

Lemma 21.

Suppose that fβ(,,C)f\in\mathscr{H}^{\beta}(\mathop{\mathbb{R}}\nolimits,\mathop{\mathbb{R}}\nolimits,C). Then, we can find a ReLU network, f^\hat{f}, with (f^)ϑlog2(8/η)+4\mathcal{L}(\hat{f})\leq\vartheta\lceil\log_{2}(8/\eta)\rceil+4 and 𝒲(f^)12(η/20)1/βd(3β)β(d+β)β(ϑlog2(8ηdβ)+8d+4β)\mathcal{W}(\hat{f})\leq\left\lceil\frac{1}{2(\eta/20)^{1/\beta}}\right\rceil^{d}\left(\frac{3}{\beta}\right)^{\beta}(d+\lfloor\beta\rfloor)^{\lfloor\beta\rfloor}\left(\vartheta\left\lceil\log_{2}\left(\frac{8}{\eta d^{\lfloor\beta\rfloor}}\right)\right\rceil+8d+4\lfloor\beta\rfloor\right), and a constant η0(0,1)\eta_{0}\in(0,1) (that might depend on β\beta and dd) such that ff^𝕃p(Leb)Cdβη,\|f-\hat{f}\|_{\mathbb{L}_{p}(\text{Leb})}\leq Cd^{\lfloor\beta\rfloor}\eta, for all η(0,η0]\eta\in(0,\eta_{0}]. Here, ϑ\vartheta is an absolute constant.

We now provide formal proofs of Theorems 8 and 12 by combining the results in Sections 6.1.1 and 6.1.2.

6.1.3 Proof of Theorem 8

Proof.

We take =𝒩(Lϵ,Wϵ,2C)\mathcal{F}=\mathcal{R}\mathcal{N}(L_{\epsilon},W_{\epsilon},2C), with Lϵlog(1/ϵ)L_{\epsilon}\precsim\log(1/\epsilon) and Wϵdβϵd/βlog(1/ϵ)W_{\epsilon}\precsim d^{\lfloor\beta\rfloor}\epsilon^{-d/\beta}\log(1/\epsilon). Then, by Lemma 21, we can find ff^{\ast}\in\mathcal{F}, such that, ff0𝕃2(λ)dβϵ\|f^{\ast}-f_{0}\|_{\mathbb{L}_{2}(\lambda)}\precsim d^{\lfloor\beta\rfloor}\epsilon. Furthermore, by Lemma 20, we observe that with probability at least 13exp(n12α)1-3\exp\left(-n^{1-2\alpha}\right),

f^f0𝕃2(λ)2\displaystyle\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim n2α+ff0𝕃2(λ)2+1nPdim()log2n+loglognn\displaystyle n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log^{2}n+\frac{\log\log n}{n}
\displaystyle\leq n2α+d2βϵ2+1nPdim()log2n+loglognn\displaystyle n^{-2\alpha}+d^{2\lfloor\beta\rfloor}\epsilon^{2}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log^{2}n+\frac{\log\log n}{n}
\displaystyle\precsim n2α+d2βϵ2+log2nnWϵLϵlog(Wϵ)+loglognn\displaystyle n^{-2\alpha}+d^{2\lfloor\beta\rfloor}\epsilon^{2}+\frac{\log^{2}n}{n}W_{\epsilon}L_{\epsilon}\log(W_{\epsilon})+\frac{\log\log n}{n}
\displaystyle\precsim n2α+d2βϵ2+dβlog2nnϵd/βlog3(1/ϵ)+loglognn.\displaystyle n^{-2\alpha}+d^{2\lfloor\beta\rfloor}\epsilon^{2}+\frac{d^{\lfloor\beta\rfloor}\log^{2}n}{n}\epsilon^{-d/\beta}\log^{3}(1/\epsilon)+\frac{\log\log n}{n}. (9)

Here (9) follows from the following calculations. Suppose α2\alpha_{2} is the constant that honors Wϵdβϵd/βlog(1/ϵ)W_{\epsilon}\precsim d^{\lfloor\beta\rfloor}\epsilon^{-d/\beta}\log(1/\epsilon), i.e. Wϵα2dβϵd/βlog(1/ϵ)W_{\epsilon}\leq\alpha_{2}d^{\lfloor\beta\rfloor}\epsilon^{-d/\beta}\log(1/\epsilon). Then,

logWϵlogα2+βlogd+dβlog(1/ϵ)+loglog(1/ϵ)3dβlog(1/ϵ),\log W_{\epsilon}\leq\log\alpha_{2}+\lfloor\beta\rfloor\log d+\frac{d}{\beta}\log(1/\epsilon)+\log\log(1/\epsilon)\leq\frac{3d}{\beta}\log(1/\epsilon),

when ϵ\epsilon is small enough. Taking ϵ(ndβ)β2β+d\epsilon\asymp(nd^{\lfloor\beta\rfloor})^{-\frac{\beta}{2\beta+d}} and α=β2β+d\alpha=\frac{\beta}{2\beta+d}, we note that with probability at least 13exp(nd2β+d)1-3\exp\left(-n^{\frac{d}{2\beta+d}}\right),

f^f0𝕃2(λ)2d2β(β+d)2β+dn2β2β+d(logn)5.\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim d^{\frac{2\lfloor\beta\rfloor(\beta+d)}{2\beta+d}}n^{-\frac{2\beta}{2\beta+d}}(\log n)^{5}.

Note that for the above bounds to hold, one requires nPdim()n\geq\operatorname{Pdim}(\mathcal{F}) and ϵϵ0\epsilon\leq\epsilon_{0}, which holds when nn is large enough. ∎

6.1.4 Proof of Theorem 12

Proof.

We take =𝒩(Lϵ,Wϵ,2C)\mathcal{F}=\mathcal{R}\mathcal{N}(L_{\epsilon},W_{\epsilon},2C), with Lϵϵlog(1/ϵ)L_{\epsilon}\precsim_{\epsilon}\log(1/\epsilon) and Wϵϵϵd/βlog(1/ϵ)W_{\epsilon}\precsim_{\epsilon}\epsilon^{-d^{\star}/\beta}\log(1/\epsilon). Then, by Chakraborty and Bartlett, 2024b (, Theorem 18), we can find ff^{\ast}\in\mathcal{F}, such that, ff0𝕃2(λ)ϵ\|f^{\ast}-f_{0}\|_{\mathbb{L}_{2}(\lambda)}\leq\epsilon. Furthermore, by Lemma 20, we observe that with probability at least 13exp(n12α)1-3\exp\left(-n^{1-2\alpha}\right),

f^f0𝕃2(λ)2\displaystyle\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim n2α+ff0𝕃2(λ)2+1nPdim()log2n+loglognn\displaystyle n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log^{2}n+\frac{\log\log n}{n}
\displaystyle\leq n2α+ϵ2+1nPdim()log2n+loglognn\displaystyle n^{-2\alpha}+\epsilon^{2}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log^{2}n+\frac{\log\log n}{n}
\displaystyle\precsim n2α+ϵ2+1nWϵLϵlog(Wϵ)log2n+loglognn\displaystyle n^{-2\alpha}+\epsilon^{2}+\frac{1}{n}W_{\epsilon}L_{\epsilon}\log(W_{\epsilon})\log^{2}n+\frac{\log\log n}{n} (10)
ϵ\displaystyle\precsim_{\epsilon} n2α+ϵ2+log2nnϵd/βlog3(1/ϵ)+loglognn.\displaystyle n^{-2\alpha}+\epsilon^{2}+\frac{\log^{2}n}{n}\epsilon^{-d^{\star}/\beta}\log^{3}(1/\epsilon)+\frac{\log\log n}{n}.

Here, (10) follows from (Bartlett et al.,, 2019, Theorem 6). Taking ϵnβ2β+d\epsilon\asymp n^{-\frac{\beta}{2\beta+d^{\star}}} and α=β2β+d\alpha=\frac{\beta}{2\beta+d^{\star}}, we note that, with probability at least 13exp(nd2β+d)1-3\exp\left(-n^{\frac{d^{\star}}{2\beta+d^{\star}}}\right), f^f0𝕃2(λ)2nn2β2β+d(logn)5.\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim_{n}n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5}. Note that for the above bounds to hold, one requires nPdim()n\geq\operatorname{Pdim}(\mathcal{F}) and ϵϵ0\epsilon\leq\epsilon_{0}, which holds when nn is large enough. ∎

6.2 Proof of the Minimax Rates

In this section, we give a formal proof of Theorem 10. We use the standard technique of Fano’s method—see, for example, (Wainwright,, 2019, Chapter 15)—to construct hypotheses that are well separated in 𝕃2(λ)\mathbb{L}_{2}(\lambda) sense but difficult to distinguish in the KL-divergence.

Proof of Theorem 10: Let b(x)=exp(1x21)𝟙{|x|1}b(x)=\exp\left(\frac{1}{x^{2}-1}\right)\mathbbm{1}\{|x|\leq 1\} be the standard bump function on \mathop{\mathbb{R}}\nolimits. For any xdx\in\mathop{\mathbb{R}}\nolimits^{d} and δ(0,1]\delta\in(0,1], we let, hδ(x)=aδβj=1db(xj/δ)h_{\delta}(x)=a\delta^{\beta}\prod_{j=1}^{d}b(x_{j}/\delta). Here aa is such that ab(x)β(,,C)ab(x)\in\mathscr{H}^{\beta}(\mathop{\mathbb{R}}\nolimits,\mathop{\mathbb{R}}\nolimits,C). It is easy to observe that hδβ(d,,C)h_{\delta}\in\mathscr{H}^{\beta}(\mathop{\mathbb{R}}\nolimits^{d},\mathop{\mathbb{R}}\nolimits,C). In what follows, we take, δ=1/m\delta=1/m. Let

δ={f𝝎(x)=𝝃[m]dω𝝃hδ(x1m(ξi1/2)):𝝎{0,1}md}.\mathscr{F}_{\delta}=\left\{f_{\boldsymbol{\omega}}(x)=\sum_{\boldsymbol{\xi}\in[m]^{d}}\omega_{\boldsymbol{\xi}}h_{\delta}\left(x-\frac{1}{m}(\xi_{i}-1/2)\right):\boldsymbol{\omega}\in\{0,1\}^{m^{d}}\right\}.

Since each element of δ\mathscr{F}_{\delta} is a sum of members in β(,,C)\mathscr{H}^{\beta}(\mathop{\mathbb{R}}\nolimits,\mathop{\mathbb{R}}\nolimits,C) with disjoint support, δβ(,,C)\mathscr{F}_{\delta}\subseteq\mathscr{H}^{\beta}(\mathop{\mathbb{R}}\nolimits,\mathop{\mathbb{R}}\nolimits,C). By the Varshamov-Gilbert bound (Tsybakov,, 2009, Lemma 2.9), we can construct a subset of Ω={𝝎1,,𝝎M}\Omega=\{\boldsymbol{\omega}_{1},\dots,\boldsymbol{\omega}_{M}\} of {0,1}md\{0,1\}^{m^{d}} with 𝝎i𝝎j1md8\|\boldsymbol{\omega}_{i}-\boldsymbol{\omega}_{j}\|_{1}\geq\frac{m^{d}}{8}, for all iji\neq j and M2md/8M\geq 2^{m^{d}/8}. We note that for any 𝝎,𝝎Ω\boldsymbol{\omega},\,\boldsymbol{\omega}^{\prime}\in\Omega,

f𝝎f𝝎𝕃2(λ)2b¯λf𝝎f𝝎𝕃2(Leb)2=𝝎𝝎1hδ2(x)𝑑x=\displaystyle\|f_{\boldsymbol{\omega}}-f_{\boldsymbol{\omega}^{\prime}}\|_{\mathbb{L}_{2}(\lambda)}^{2}\geq\,\underline{b}_{\lambda}\|f_{\boldsymbol{\omega}}-f_{\boldsymbol{\omega}^{\prime}}\|_{\mathbb{L}_{2}(\text{Leb})}^{2}=\|\boldsymbol{\omega}-\boldsymbol{\omega}^{\prime}\|_{1}\int h^{2}_{\delta}(x)dx= 𝝎𝝎1×a2δ2β+db2(x)𝑑x\displaystyle\|\boldsymbol{\omega}-\boldsymbol{\omega}^{\prime}\|_{1}\times a^{2}\delta^{2\beta+d}\int b^{2}(x)dx
\displaystyle\succsim mdδ2β+d\displaystyle m^{d}\delta^{2\beta+d}
=\displaystyle= δ2β.\displaystyle\delta^{2\beta}.

Let P𝝎P_{\boldsymbol{\omega}} denote the the distribution of the form (3) with f0f_{0} replaced with f𝝎f_{\boldsymbol{\omega}}. Thus,

KL(P𝝎nP𝝎n)=nKL(P𝝎P𝝎)=\displaystyle\operatorname{KL}(P_{\boldsymbol{\omega}}^{\otimes_{n}}\|P_{\boldsymbol{\omega}^{\prime}}^{\otimes_{n}})=n\operatorname{KL}(P_{\boldsymbol{\omega}}\|P_{\boldsymbol{\omega}^{\prime}})= n𝔼𝒙dϕ(f𝝎(𝒙)f𝝎(𝒙))\displaystyle n\mathbb{E}_{\boldsymbol{x}}d_{\phi}\left(f_{\boldsymbol{\omega}}(\boldsymbol{x})\|f_{\boldsymbol{\omega}^{\prime}}(\boldsymbol{x})\right) (11)
\displaystyle\leq nτ2b¯λ𝝎𝝎1×a2δ2β+db2(x)𝑑x\displaystyle n\tau_{2}\bar{b}_{\lambda}\|\boldsymbol{\omega}-\boldsymbol{\omega}^{\prime}\|_{1}\times a^{2}\delta^{2\beta+d}\int b^{2}(x)dx
\displaystyle\precsim nmdδ2β+d.\displaystyle nm^{d}\delta^{2\beta+d}.

Here, (11) follows from Lemma 24. Choosing mn1/(2β+d)m\asymp n^{1/(2\beta+d)}, we can make, KL(P𝝎nP𝝎n)md1000\operatorname{KL}(P_{\boldsymbol{\omega}}^{\otimes_{n}}\|P_{\boldsymbol{\omega}^{\prime}}^{\otimes_{n}})\leq\frac{m^{d}}{1000}. Thus, from Wainwright, (2019, equation 15.34), I(Z;J)1M2𝝎,𝝎ΩKL(P𝝎nP𝝎n)md1000.I(Z;J)\leq\frac{1}{M^{2}}\sum_{\boldsymbol{\omega},\boldsymbol{\omega}^{\prime}\in\Omega}\operatorname{KL}(P_{\boldsymbol{\omega}}^{\otimes_{n}}\|P_{\boldsymbol{\omega}^{\prime}}^{\otimes_{n}})\leq\frac{m^{d}}{1000}. Here I(Z1;Z2)I(Z_{1};Z_{2}) denotes the mutual information between the random variables Z1Z_{1} and Z2Z_{2} (Cover and Thomas,, 2005, Section 2.3). Hence, I(Z;J)+log2logM8md/1000+log2mdlog21/2,\frac{I(Z;J)+\log 2}{\log M}\leq 8\frac{m^{d}/1000+\log 2}{m^{d}\log 2}\leq 1/2, if nn is large enough. Thus, applying Proposition 15.2 of Wainwright, (2019), we note that, inff^supfβ(d,,C)𝔼ff^f𝕃2(λ)2δ2βn2β2β+d.\inf_{\hat{f}}\sup_{f\in\mathscr{H}^{\beta}(\mathop{\mathbb{R}}\nolimits^{d},\mathop{\mathbb{R}}\nolimits,C)}\mathbb{E}_{f}\|\hat{f}-f\|_{\mathbb{L}_{2}(\lambda)}^{2}\succsim\delta^{2\beta}\asymp n^{-\frac{2\beta}{2\beta+d}}.

7 Conclusion

In this paper, we discussed a statistical framework to understand the finite sample properties of supervised deep learning for both regression and classification settings. In particular, we modeled the dependence of the response given the explanatory variable through a exponential families and showed that the maximum likelihood estimates can be achieved by minimizing the corresponding Bregman loss and incorporating the mean function as the activation for the final layer. Under the assumption of the existence of a bounded density for the explanatory variable, we showed that deep ReLU networks can achieve the minimax optimal rate when the network size is chosen properly. Furthermore, when the explanatory variable has an intrinsically low dimensional structure, the convergence rate of the sample estimator, in terms of the sample size, only depends on the entropic dimension of the underlying distribution of the explanatory variable, resulting in better convergence rates compared to the existing literature for both classification and regression problems.

While our findings offer insights into the theoretical aspects of deep supervised learning, it is crucial to recognize that assessing the complete error of models in practical applications necessitates the consideration of an optimization error component. Regrettably, the accurate estimation of this component remains a formidable challenge in the non-overparametrized regime due to the non-convex and intricate nature of the optimization problem. Nevertheless, it is worth emphasizing that our error analyses operate independently of the optimization process and can be readily integrated with optimization analyses.

Appendix A Proofs of Main Lemmata

A.1 Proof of Lemma 7

See 7

Proof.

To prove Lemma 7, we first make the following observation.

𝔼dϕ(yμ(f^(𝒙)))𝔼dϕ(yμ(f0(𝒙)))\displaystyle\mathbb{E}d_{\phi}(y\|\mu(\hat{f}(\boldsymbol{x})))-\mathbb{E}d_{\phi}(y\|\mu(f_{0}(\boldsymbol{x})))
=\displaystyle= 𝔼𝒙𝔼y|𝒙(ϕ(μ(f0(𝒙)))ϕ(μ(f^(𝒙)))ϕ(μ(f^(𝒙))),yμ(f^(𝒙))+ϕ(μ(f0(𝒙))),yμ(f0(𝒙)))\displaystyle\mathbb{E}_{\boldsymbol{x}}\mathbb{E}_{y|\boldsymbol{x}}\bigg{(}\phi(\mu(f_{0}(\boldsymbol{x})))-\phi(\mu(\hat{f}(\boldsymbol{x})))-\left\langle\nabla\phi(\mu(\hat{f}(\boldsymbol{x}))),y-\mu(\hat{f}(\boldsymbol{x}))\right\rangle+\left\langle\nabla\phi(\mu(f_{0}(\boldsymbol{x}))),y-\mu(f_{0}(\boldsymbol{x}))\right\rangle\bigg{)}
=\displaystyle= 𝔼𝒙dϕ(μ(f0(𝒙))μ(f^(𝒙)))\displaystyle\mathbb{E}_{\boldsymbol{x}}d_{\phi}\left(\mu(f_{0}(\boldsymbol{x}))\|\mu(\hat{f}(\boldsymbol{x}))\right)
\displaystyle\leq τ2𝔼𝒙μ(f0(𝒙))μ(f^(𝒙))22\displaystyle\tau_{2}\mathbb{E}_{\boldsymbol{x}}\|\mu(f_{0}(\boldsymbol{x}))-\mu(\hat{f}(\boldsymbol{x}))\|_{2}^{2} (12)
\displaystyle\leq τ2σ1𝔼𝒙f0(𝒙)f^(𝒙)22\displaystyle\tau_{2}\sigma_{1}\mathbb{E}_{\boldsymbol{x}}\|f_{0}(\boldsymbol{x})-\hat{f}(\boldsymbol{x})\|_{2}^{2} (13)
=\displaystyle= σ1σ2f0f^𝕃2(λ)2.\displaystyle\frac{\sigma_{1}}{\sigma_{2}}\|f_{0}-\hat{f}\|_{\mathbb{L}_{2}(\lambda)}^{2}.

Here (12) follows from Lemma 28. Inequality (13) follows from the fact that μ()\mu(\cdot) is σ1\sigma_{1}-Lipschitz. We also note that,

𝔼dϕ(yf^(𝒙))𝔼dϕ(yμ(f0(𝒙)))=\displaystyle\mathbb{E}d_{\phi}(y\|\hat{f}(\boldsymbol{x}))-\mathbb{E}d_{\phi}(y\|\mu(f_{0}(\boldsymbol{x})))= 𝔼𝒙dϕ(μ(f0(𝒙))μ(f^(𝒙)))\displaystyle\mathbb{E}_{\boldsymbol{x}}d_{\phi}\left(\mu(f_{0}(\boldsymbol{x}))\|\mu(\hat{f}(\boldsymbol{x}))\right)
\displaystyle\geq τ1𝔼𝒙μ(f0(𝒙))μ(f^(𝒙))22\displaystyle\tau_{1}\mathbb{E}_{\boldsymbol{x}}\|\mu(f_{0}(\boldsymbol{x}))-\mu(\hat{f}(\boldsymbol{x}))\|_{2}^{2} (14)
\displaystyle\geq τ1σ2𝔼𝒙f0(𝒙)f^(𝒙)22\displaystyle\tau_{1}\sigma_{2}\mathbb{E}_{\boldsymbol{x}}\|f_{0}(\boldsymbol{x})-\hat{f}(\boldsymbol{x})\|_{2}^{2} (15)
=\displaystyle= σ2σ1f0f^𝕃2(λ)2.\displaystyle\frac{\sigma_{2}}{\sigma_{1}}\|f_{0}-\hat{f}\|_{\mathbb{L}_{2}(\lambda)}^{2}.

As before, (14) follows from the fact that ϕ\phi is τ1\tau_{1}-strongly convex and applying Lemma 28. Inequality (15) follows from the fact that μ()=Ψ′′()σ2\mu^{\prime}(\cdot)=\Psi^{\prime\prime}(\cdot)\geq\sigma_{2}, due to the strong convexity of Ψ\Psi and a simple application of the mean value theorem. ∎

A.2 Proof of Lemma 15

See 15

Proof.

We note that d¯α(γ)dim¯M(γ)=d~\bar{d}_{\alpha}(\gamma)\leq\overline{\text{dim}}_{M}(\gamma)=\tilde{d}, by Weed and Bach, (2019, Proposition 7). When μd~\mu\ll\mathbb{H}^{\tilde{d}}, again by Weed and Bach, (2019, Proposition 8), it is known that d(γ)=d~d_{\ast}(\gamma)=\tilde{d}, where d(γ)d_{\ast}(\gamma) denotes the lower Wasserstein dimension of γ\gamma (Weed and Bach,, 2019, Definition 4). The result now follows from Proposition 8 of Chakraborty and Bartlett, 2024b . ∎

A.3 Proof of Lemma 17

See 17

Proof.

Since f^\hat{f} is the global minimizer of i=1ndϕ(yiμ(f(𝒙i)))\sum_{i=1}^{n}d_{\phi}(y_{i}\|\mu(f(\boldsymbol{x}_{i}))), we note that

i=1ndϕ(yiμ(f^(𝒙i)))i=1ndϕ(yiμ(f(𝒙i))),\displaystyle\sum_{i=1}^{n}d_{\phi}(y_{i}\|\mu(\hat{f}(\boldsymbol{x}_{i})))\leq\sum_{i=1}^{n}d_{\phi}(y_{i}\|\mu(f(\boldsymbol{x}_{i}))), (16)

for any ff\in\mathcal{F}. A little algebra shows that (16) is equivalent to

1ni=1ndϕ(μ(f0(𝒙i))μ(f^(𝒙i)))\displaystyle\frac{1}{n}\sum_{i=1}^{n}d_{\phi}(\mu(f_{0}(\boldsymbol{x}_{i}))\|\mu(\hat{f}(\boldsymbol{x}_{i})))
\displaystyle\leq 1ni=1ndϕ(μ(f0(𝒙i))μ(f(𝒙i)))+1ni=1nϕ(μ(f^(𝒙i)))ϕ(μ(f(𝒙i))),ξi.\displaystyle\frac{1}{n}\sum_{i=1}^{n}d_{\phi}(\mu(f_{0}(\boldsymbol{x}_{i}))\|\mu(f(\boldsymbol{x}_{i})))+\frac{1}{n}\sum_{i=1}^{n}\left\langle\nabla\phi(\mu(\hat{f}(\boldsymbol{x}_{i})))-\nabla\phi(\mu(f(\boldsymbol{x}_{i}))),\xi_{i}\right\rangle. (17)

From (17), applying Lemma 28, we observe that

τ1ni=1n(μ(f0(𝒙i))μ(f^(𝒙i)))2\displaystyle\frac{\tau_{1}}{n}\sum_{i=1}^{n}(\mu(f_{0}(\boldsymbol{x}_{i}))-\mu(\hat{f}(\boldsymbol{x}_{i})))^{2}
\displaystyle\leq τ2ni=1n(μ(f0(𝒙i))μ(f(𝒙i)))2+1ni=1nϕ(μ(f^(𝒙i)))ϕ(μ(f(𝒙i))),ξi.\displaystyle\frac{\tau_{2}}{n}\sum_{i=1}^{n}(\mu(f_{0}(\boldsymbol{x}_{i}))-\mu(f(\boldsymbol{x}_{i})))^{2}+\frac{1}{n}\sum_{i=1}^{n}\left\langle\nabla\phi(\mu(\hat{f}(\boldsymbol{x}_{i})))-\nabla\phi(\mu(f(\boldsymbol{x}_{i}))),\xi_{i}\right\rangle. (18)

Plugging in fff\leftarrow f^{\ast}, we get the desired result. ∎

A.4 Proof of Lemma 18

See 18

Proof.

From the definition of 𝒢δ\mathscr{G}_{\delta}, it is clear that log𝒩(ϵ;𝒢δ,,n)2log𝒩(ϵ/2;,,n).\log\mathcal{N}(\epsilon;\mathscr{G}_{\delta},\|\cdot\|_{\infty,n})\leq 2\log\mathcal{N}(\epsilon/2;\mathcal{F},\|\cdot\|_{\infty,n}). Let Zf=1ni=1nξif(xi)Z_{f}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}f(x_{i}). Clearly, 𝔼ξZf=0\mathbb{E}_{\xi}Z_{f}=0. Furthermore, applying Lemma 26, we observe that

𝔼ξexp(λ(ZfZg))=𝔼ξexp(λni=1nξi(f(xi)g(xi)))=\displaystyle\mathbb{E}_{\xi}\exp(\lambda(Z_{f}-Z_{g}))=\mathbb{E}_{\xi}\exp\left(\frac{\lambda}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}(f(x_{i})-g(x_{i}))\right)= exp(λ22fg𝕃2(λn)2σ1).\displaystyle\exp\left(\frac{\lambda^{2}}{2}\|f-g\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}\sigma_{1}\right).

Thus, (ZfZg)(Z_{f}-Z_{g}) is fg𝕃2(λn)2σ1\|f-g\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}\sigma_{1}-subGaussian. Furthermore,

supf,g𝒢δfg𝕃2(λn)=\displaystyle\sup_{f,g\in\mathscr{G}_{\delta}}\|f-g\|_{\mathbb{L}_{2}(\lambda_{n})}= supf,f:ff𝕃(λn)δϕ(μ(f))ϕ(μ(f))𝕃2(λn)\displaystyle\sup_{f,f^{\prime}\in\mathcal{F}:\|f-f^{\prime}\|_{\mathbb{L}_{\infty}(\lambda_{n})}\leq\delta}\|\nabla\phi(\mu(f))-\nabla\phi(\mu(f^{\prime}))\|_{\mathbb{L}_{2}(\lambda_{n})}
\displaystyle\leq τ1σ1supf,f:ff𝕃(λn)δff𝕃2(λn)\displaystyle\tau_{1}\sigma_{1}\sup_{f,f^{\prime}\in\mathcal{F}:\|f-f^{\prime}\|_{\mathbb{L}_{\infty}(\lambda_{n})}\leq\delta}\|f-f^{\prime}\|_{\mathbb{L}_{2}(\lambda_{n})}
\displaystyle\leq supf,f:ff𝕃(λn)δff𝕃(λn)\displaystyle\sup_{f,f^{\prime}\in\mathcal{F}:\|f-f^{\prime}\|_{\mathbb{L}_{\infty}(\lambda_{n})}\leq\delta}\|f-f^{\prime}\|_{\mathbb{L}_{\infty}(\lambda_{n})}
\displaystyle\leq δ.\displaystyle\delta.

From Wainwright, (2019, Proposition 5.22),

𝔼ξsupg𝒢δ1ni=1nξig(xi)=𝔼ξsupg𝒢δZg=\displaystyle\mathbb{E}_{\xi}\sup_{g\in\mathscr{G}_{\delta}}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}g(x_{i})=\mathbb{E}_{\xi}\sup_{g\in\mathscr{G}_{\delta}}Z_{g}= 𝔼ξsupg𝒢δ(ZgZg)\displaystyle\mathbb{E}_{\xi}\sup_{g\in\mathscr{G}_{\delta}}(Z_{g}-Z_{g^{\prime}})
\displaystyle\leq 𝔼ξsupg,g𝒢δ(ZgZg)\displaystyle\mathbb{E}_{\xi}\sup_{g,g^{\prime}\in\mathscr{G}_{\delta}}(Z_{g}-Z_{g^{\prime}})
\displaystyle\leq 320δlog𝒩(ϵ;𝒢δ,𝕃2(λn))𝑑ϵ\displaystyle 32\int_{0}^{\delta}\sqrt{\log\mathcal{N}(\epsilon;\mathscr{G}_{\delta},\mathbb{L}_{2}(\lambda_{n}))}d\epsilon
\displaystyle\precsim 0δlog𝒩(ϵ/(2σ1);,𝕃(λn))𝑑ϵ\displaystyle\int_{0}^{\delta}\sqrt{\log\mathcal{N}(\epsilon/(2\sigma_{1});\mathcal{F},\mathbb{L}_{\infty}(\lambda_{n}))}d\epsilon
\displaystyle\precsim 0δPdim()log(n/ϵ)𝑑ϵ\displaystyle\int_{0}^{\delta}\sqrt{\operatorname{Pdim}(\mathcal{F})\log(n/\epsilon)}d\epsilon
\displaystyle\leq δPdim()logn+Pdim()0δlog(1/ϵ)𝑑ϵ\displaystyle\delta\sqrt{\operatorname{Pdim}(\mathcal{F})\log n}+\sqrt{\operatorname{Pdim}(\mathcal{F})}\int_{0}^{\delta}\sqrt{\log(1/\epsilon)}d\epsilon
\displaystyle\leq δPdim()logn+2Pdim()δlog(1/δ)\displaystyle\delta\sqrt{\operatorname{Pdim}(\mathcal{F})\log n}+2\sqrt{\operatorname{Pdim}(\mathcal{F})}\delta\sqrt{\log(1/\delta)} (19)
\displaystyle\precsim δPdim()log(n/δ).\displaystyle\delta\sqrt{\operatorname{Pdim}(\mathcal{F})\log(n/\delta)}. (20)

Here, (19) follows from Lemma 25. Thus, 𝔼supg𝒢δ1ni=1nξig(xi)δPdim()log(n/δ)n.\mathbb{E}\sup_{g\in\mathscr{G}_{\delta}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}g(x_{i})\precsim\delta\sqrt{\frac{\operatorname{Pdim}(\mathcal{F})\log(n/\delta)}{n}}. Applying Lemma 27, we note that for t>0t>0, with probability at least 1ent2/δ21-e^{-nt^{2}/\delta^{2}},

supg𝒢δ1ni=1nξig(xi)\displaystyle\sup_{g\in\mathscr{G}_{\delta}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}g(x_{i})\precsim t+δPdim()log(n/δ)n.\displaystyle t+\delta\sqrt{\frac{\operatorname{Pdim}(\mathcal{F})\log(n/\delta)}{n}}. (21)

A.5 Proof of Lemma 19

See 19

Proof.

We take δ=max{nα,2f^f0𝕃2(λn)}\delta=\max\left\{n^{-\alpha},2\|\hat{f}-f_{0}\|_{\mathbb{L}_{2}(\lambda_{n})}\right\} and let t=n2αt=n^{-2\alpha}. We consider two cases as follows.

Case 1: f^f𝕃2(λn)δ\|\hat{f}-f^{\ast}\|_{\mathbb{L}_{2}(\lambda_{n})}\leq\delta.

Then, by Lemma 18, with probability at least 1exp(n12α)1-\exp\left(-n^{1-2\alpha}\right),

f^f𝕃2(λn)2\displaystyle\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}\leq 2f^f0𝕃2(λn)2+2f0f𝕃2(λn)2\displaystyle 2\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+2\|f_{0}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}
\displaystyle\precsim f0f𝕃2(λn)2+μ(f^)μ(f0)𝕃2(λn)2\displaystyle\|f_{0}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\|\mu(\hat{f})-\mu(f_{0})\|^{2}_{\mathbb{L}_{2}(\lambda_{n})} (22)
\displaystyle\precsim f0f𝕃2(λn)2+supg𝒢δ1ni=1nξig(xi)\displaystyle\|f_{0}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\sup_{g\in\mathscr{G}_{\delta}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}g(x_{i}) (23)
\displaystyle\precsim f0f𝕃2(λn)2+t+δPdim()log(n/δ)n\displaystyle\|f_{0}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+t+\delta\sqrt{\frac{\operatorname{Pdim}(\mathcal{F})\log(n/\delta)}{n}} (24)

In the above calculations, (22) follows from the fact that μ()\mu(\cdot) is strongly convex and (23) follows from Lemma 17. Inequality (24) follows from Lemma 18. Let α11\alpha_{1}\geq 1 be the corresponding constant that honors the inequality in (24). Then using the upper bound on δ\delta, we observe that

f^f𝕃2(λn)2\displaystyle\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}
\displaystyle\leq α1f0f𝕃2(λn)2+α1δPdim()log(n/δ)n+n2α\displaystyle\alpha_{1}\|f_{0}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\alpha_{1}\delta\sqrt{\frac{\operatorname{Pdim}(\mathcal{F})\log(n/\delta)}{n}}+n^{-2\alpha}
\displaystyle\leq α1f0f𝕃2(λn)2+δ216+4α12nPdim()log(n/δ)+n2α\displaystyle\alpha_{1}\|f_{0}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{\delta^{2}}{16}+\frac{4\alpha_{1}^{2}}{n}\operatorname{Pdim}(\mathcal{F})\log(n/\delta)+n^{-2\alpha} (25)
\displaystyle\leq α1f0f𝕃2(λn)2+9n2α8+14f^f0𝕃2(λn)2+4(1+α)α12nPdim()log(n)\displaystyle\alpha_{1}\|f_{0}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{9n^{-2\alpha}}{8}+\frac{1}{4}\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{4(1+\alpha)\alpha_{1}^{2}}{n}\operatorname{Pdim}(\mathcal{F})\log(n)
\displaystyle\leq α1f0f𝕃2(λn)2+2n2α+12f^f𝕃2(λn)2+12ff0𝕃2(λn)2+4(1+α)α12nPdim()log(n).\displaystyle\alpha_{1}\|f_{0}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+2n^{-2\alpha}+\frac{1}{2}\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{1}{2}\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{4(1+\alpha)\alpha_{1}^{2}}{n}\operatorname{Pdim}(\mathcal{F})\log(n).

Here, (25) follows from the fact that xyx16α1+4α1y\sqrt{xy}\leq\frac{x}{16\alpha_{1}}+4\alpha_{1}y, from the AM-GM inequality and taking x=δ2x=\delta^{2} and y=Pdim()log(n/δ)ny=\frac{\operatorname{Pdim}(\mathcal{F})\log(n/\delta)}{n}. Thus,

f^f𝕃2(λn)2\displaystyle\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}\precsim n2α+ff0𝕃2(λn)2+1nPdim()log(n).\displaystyle n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log(n). (26)

Case 2: f^f𝕃2(λn)δ\|\hat{f}-f^{\ast}\|_{\mathbb{L}_{2}(\lambda_{n})}\geq\delta.

It this case, we note that f^f𝕃2(λn)2f^f0𝕃2(λn)\|\hat{f}-f^{\ast}\|_{\mathbb{L}_{2}(\lambda_{n})}\geq 2\|\hat{f}-f_{0}\|_{\mathbb{L}_{2}(\lambda_{n})}. Thus,

f^f𝕃2(λn)2\displaystyle\|\hat{f}-f^{\ast}\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}\leq 2f^f0𝕃2(λn)2+2f0f𝕃2(λn)212f^f𝕃2(λn)2+2f0f𝕃2(λn)2\displaystyle 2\|\hat{f}-f_{0}\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}+2\|f_{0}-f^{\ast}\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}\leq\frac{1}{2}\|\hat{f}-f^{\ast}\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}+2\|f_{0}-f^{\ast}\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}
\displaystyle\implies f^f𝕃2(λn)2f0f𝕃2(λn)2.\displaystyle\|\hat{f}-f^{\ast}\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}\precsim\|f_{0}-f^{\ast}\|_{\mathbb{L}_{2}(\lambda_{n})}^{2}.

Thus, from the above two cases, combining equations (24) and (26), with probability at least, 1exp(n12α)1-\exp\left(-n^{1-2\alpha}\right),

f^f𝕃2(λn)2n2α+ff0𝕃2(λn)2+1nPdim()log(n).\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}\precsim n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log(n). (27)

From equation (27), we note that, for some constant B4B_{4},

(f^f𝕃2(λn)2B4(n2α+ff0𝕃2(λn)2+1nPdim()log(n))|x1:n)1exp(n12α).\displaystyle\mathbb{P}\left(\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}\leq B_{4}\left(n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log(n)\right)\bigg{|}x_{1:n}\right)\geq 1-\exp\left(-n^{1-2\alpha}\right).

Integrating both sides w.r.t. the measure μn\mu^{\otimes_{n}}, i.e. the joint distribution of 𝒙1:n\boldsymbol{x}_{1:n}, we observe that, unconditionally, with probability at least 1exp(n12α)1-\exp\left(-n^{1-2\alpha}\right),

f^f𝕃2(λn)2n2α+ff0𝕃2(λn)2+1nPdim()log(n).\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}\precsim n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log(n). (28)

A.6 Proof of Lemma 20

To prove Lemma 20, we first state and prove the following result.

Lemma 22.

For α(0,1/2)\alpha\in(0,1/2), if nmax{e1/α,Pdim()}n\geq\max\left\{e^{1/\alpha},\operatorname{Pdim}(\mathcal{F})\right\}, with probability at least 12exp(n12α)1-2\exp\left(-n^{1-2\alpha}\right),

f^f𝕃2(λ)2\displaystyle\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim n2α+ff0𝕃2(λn)2+1nPdim()log2n+loglognn.\displaystyle n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log^{2}n+\frac{\log\log n}{n}.
Proof.

From Lemma 23, we note that if nPdim()n\geq\operatorname{Pdim}(\mathcal{F}), then,

𝔼ϵsuphr1ni=1nϵih(𝒙i)\displaystyle\mathbb{E}_{\epsilon}\sup_{h\in\mathcal{H}_{r}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}h(\boldsymbol{x}_{i})\precsim rPdim()lognn\displaystyle\sqrt{\frac{r\operatorname{Pdim}(\mathcal{F})\log n}{n}} (29)
\displaystyle\leq (Pdim())2lognn2+rPdim()log(n/ePdim())lognn.\displaystyle\sqrt{\frac{(\operatorname{Pdim}(\mathcal{F}))^{2}\log n}{n^{2}}+r\frac{\operatorname{Pdim}(\mathcal{F})\log(n/e\operatorname{Pdim}(\mathcal{F}))\log n}{n}}. (30)

Here, (29) follows from Lemma 23 and (30) follows from the fact that for all x,y>0x,y>0, xlogxy+xlog(1/ye)x\log x\leq y+x\log(1/ye). It is easy to see that the RHS of (30) has a fixed point of rr^{\ast} and rPdim()log2nnr^{\ast}\precsim\frac{\operatorname{Pdim}(\mathcal{F})\log^{2}n}{n}. Then, by Theorem 6.1 of Bousquet, (2002), we note that with probability at least 1ex1-e^{-x},

h𝑑λB3(h𝑑λn+Pdim()log2nn+xn+loglognn),h,\int hd\lambda\leq B_{3}\left(\int hd\lambda_{n}+\frac{\operatorname{Pdim}(\mathcal{F})\log^{2}n}{n}+\frac{x}{n}+\frac{\log\log n}{n}\right),\forall h\in\mathcal{H}, (31)

for some absolute constant B3B_{3}. Now, taking x=n12αx=n^{1-2\alpha} in (31), we note that, with probability at least 1exp(n12α)1-\exp\left(-n^{1-2\alpha}\right),

f^f𝕃2(λ)2n2α+f^f𝕃2(λn)2+1nPdim()log2n+loglognn.\displaystyle\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim n^{-2\alpha}+\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log^{2}n+\frac{\log\log n}{n}. (32)

Combining (32) with Lemma 19, we observe that with probability at least 12exp(n12α)1-2\exp\left(-n^{1-2\alpha}\right),

f^f𝕃2(λ)2n2α+ff0𝕃2(λn)2+1nPdim()log2n+loglognn.\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda_{n})}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log^{2}n+\frac{\log\log n}{n}. (33)

A.6.1 Proof of Lemma 20

See 20

Proof.

Let Zi=(f(𝒙i)f0(𝒙i))2𝔼(f(𝒙i)f0(𝒙i))2Z_{i}=(f^{\ast}(\boldsymbol{x}_{i})-f_{0}(\boldsymbol{x}_{i}))^{2}-\mathbb{E}(f^{\ast}(\boldsymbol{x}_{i})-f_{0}(\boldsymbol{x}_{i}))^{2}. Since f0,fB\|f_{0}\|_{\infty},\|f^{\ast}\|_{\infty}\leq B, for some constant BB, it is easy to see that

σ2=i=1n𝔼Zi2=i=1nVar((f(𝒙i)f0(𝒙i))2)i=1n𝔼(f(𝒙i)f0(𝒙i))4\displaystyle\sigma^{2}=\sum_{i=1}^{n}\mathbb{E}Z_{i}^{2}=\sum_{i=1}^{n}\operatorname{Var}\left((f^{\ast}(\boldsymbol{x}_{i})-f_{0}(\boldsymbol{x}_{i}))^{2}\right)\leq\sum_{i=1}^{n}\mathbb{E}\left(f^{\ast}(\boldsymbol{x}_{i})-f_{0}(\boldsymbol{x}_{i})\right)^{4}\leq 4nB2ff0𝕃2(λ)2.\displaystyle 4nB^{2}\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}.

Taking u=vff0𝕃2(λ)2u=v\vee\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}, we note that, σ24nB2u\sigma^{2}\leq 4nB^{2}u. Applying Bernstein’s inequality, e.g., (Vershynin,, 2018, Theorem 2.8.4), with t=nut=nu,

(|ff0𝕃(λn)2ff0𝕃(λ)2|u)exp(n2u2/24nB2u+Bnu/3)\displaystyle\mathbb{P}\left(\left|\|f^{\ast}-f_{0}\|_{\mathbb{L}(\lambda_{n})}^{2}-\|f^{\ast}-f_{0}\|_{\mathbb{L}(\lambda)}^{2}\right|\geq u\right)\leq\exp\left(-\frac{n^{2}u^{2}/2}{4nB^{2}u+Bnu/3}\right)\leq exp(nu8B2+2B/3)=exp(nuB5)exp(nvB5)\displaystyle\exp\left(-\frac{nu}{8B^{2}+2B/3}\right)=\exp\left(-\frac{nu}{B_{5}}\right)\leq\exp\left(-\frac{nv}{B_{5}}\right)

with B5=8B2+2B/3B_{5}=8B^{2}+2B/3. Thus, with probability, at least 1exp(nvB5)1-\exp\left(-\frac{nv}{B_{5}}\right),

ff0𝕃(λn)2ff0𝕃(λ)2+u2ff0𝕃(λ)2+v.\|f^{\ast}-f_{0}\|_{\mathbb{L}(\lambda_{n})}^{2}\leq\|f^{\ast}-f_{0}\|_{\mathbb{L}(\lambda)}^{2}+u\leq 2\|f^{\ast}-f_{0}\|_{\mathbb{L}(\lambda)}^{2}+v.

Taking v=B5n2αv=B_{5}n^{-2\alpha}, we observe that, with probability at least 1exp(n12α)1-\exp\left(-n^{1-2\alpha}\right),

ff0𝕃(λn)2ff0𝕃(λ)2+n2α.\|f^{\ast}-f_{0}\|_{\mathbb{L}(\lambda_{n})}^{2}\precsim\|f^{\ast}-f_{0}\|_{\mathbb{L}(\lambda)}^{2}+n^{-2\alpha}. (34)

Combining (34) with Lemma 22, we observe that, with probability at least 13exp(n12α)1-3\exp\left(-n^{1-2\alpha}\right)

f^f𝕃2(λ)2n2α+ff0𝕃2(λ)2+1nPdim()log2n+loglognn.\displaystyle\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda)}\precsim n^{-2\alpha}+\|f^{\ast}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}+\frac{1}{n}\operatorname{Pdim}(\mathcal{F})\log^{2}n+\frac{\log\log n}{n}.

The theorem now follows from observing that f^f0𝕃2(λ)22f^f𝕃2(λ)2+2f0f𝕃2(λ)2\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\leq 2\|\hat{f}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda)}+2\|f_{0}-f^{\ast}\|^{2}_{\mathbb{L}_{2}(\lambda)}. ∎

Appendix B Proof of Approximation Results (Lemma 21)

See 21

Proof.

We first fix any ϵ(0,1)\epsilon\in(0,1) and let, K=12ϵK=\lceil\frac{1}{2\epsilon}\rceil. For any 𝒊[K]d\boldsymbol{i}\in[K]^{d}, let 𝜽𝒊=(ϵ+2(i11)ϵ,,ϵ+2(id1)ϵ)\boldsymbol{\theta}^{\boldsymbol{i}}=(\epsilon+2(i_{1}-1)\epsilon,\dots,\epsilon+2(i_{d}-1)\epsilon). Clearly, {𝜽𝒊:𝒊[K]d}\{\boldsymbol{\theta}^{\boldsymbol{i}}:\boldsymbol{i}\in[K]^{d}\} constitutes an ϵ\epsilon-net of [0,1]d[0,1]^{d}, w.r.t. the \ell_{\infty}-norm. We let

ξa,b(x)=ReLU(x+aab)ReLU(x+bab)ReLU(xbab)+ReLU(xaab),\xi_{a,b}(x)=\text{ReLU}\left(\frac{x+a}{a-b}\right)-\text{ReLU}\left(\frac{x+b}{a-b}\right)-\text{ReLU}\left(\frac{x-b}{a-b}\right)+\text{ReLU}\left(\frac{x-a}{a-b}\right),

for any 0<ba0<b\leq a. For 0<δϵ/30<\delta\leq\epsilon/3, we define

ζϵ,δ(𝒙)=j=1dξϵ,δ(xj).\zeta_{\epsilon,\delta}(\boldsymbol{x})=\prod_{j=1}^{d}\xi_{\epsilon,\delta}(x_{j}).

We define the region 𝒬ϵ,δ=𝒊[K]dB(𝜽𝒊,δ)\mathscr{Q}_{\epsilon,\delta}=\cup_{\boldsymbol{i}\in[K]^{d}}B_{\ell_{\infty}}(\boldsymbol{\theta}^{\boldsymbol{i}},\delta). It is easy to observe that Leb([0,1]d𝒬ϵ,δ)2dδ\operatorname{Leb}([0,1]^{d}\setminus\mathscr{Q}_{\epsilon,\delta})\leq 2d\delta. Here Leb()\operatorname{Leb}(\cdot) denotes the Lebesgue measure on d\mathop{\mathbb{R}}\nolimits^{d}. Clearly, ζϵ,δ(𝜽𝒊)=1\zeta_{\epsilon,\delta}\left(\cdot-\boldsymbol{\theta}^{\boldsymbol{i}}\right)=1 on 𝒬ϵ,δ\mathscr{Q}_{\epsilon,\delta}.

Consider the Taylor expansion of ff around 𝜽\boldsymbol{\theta} as, P𝜽(𝒙)=|𝒔|β𝒔f(𝜽)𝒔!(𝒙𝜽)𝒔.P_{\boldsymbol{\theta}}(\boldsymbol{x})=\sum_{|\boldsymbol{s}|\leq\lfloor\beta\rfloor}\frac{\partial^{\boldsymbol{s}}f(\boldsymbol{\theta})}{\boldsymbol{s}!}\left(\boldsymbol{x}-\boldsymbol{\theta}\right)^{\boldsymbol{s}}.

Clearly,f(𝒙)P𝜽(𝒙)=|𝒔|=β(𝒙𝜽)𝒔𝒔!(𝒔f(𝒚)𝒔f(𝜽))\displaystyle\text{Clearly},\,f(\boldsymbol{x})-P_{\boldsymbol{\theta}}(\boldsymbol{x})=\sum_{|\boldsymbol{s}|=\lfloor\beta\rfloor}\frac{\left(\boldsymbol{x}-\boldsymbol{\theta}\right)^{\boldsymbol{s}}}{\boldsymbol{s}!}\left(\partial^{\boldsymbol{s}}f(\boldsymbol{y})-\partial^{\boldsymbol{s}}f(\boldsymbol{\theta})\right)\leq 𝒙𝜽β|𝒔|=β1𝒔!|𝒔f(𝒚)𝒔f(𝜽)|\displaystyle\|\boldsymbol{x}-\boldsymbol{\theta}\|_{\infty}^{\lfloor\beta\rfloor}\sum_{|\boldsymbol{s}|=\lfloor\beta\rfloor}\frac{1}{\boldsymbol{s}!}\left|\partial^{\boldsymbol{s}}f(\boldsymbol{y})-\partial^{\boldsymbol{s}}f(\boldsymbol{\theta})\right|
\displaystyle\leq Cdββ!𝒙𝜽β.\displaystyle\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\|\boldsymbol{x}-\boldsymbol{\theta}\|_{\infty}^{\beta}. (35)

In the above calculations, 𝒚\boldsymbol{y} lies on the line segment joining 𝒙\boldsymbol{x} and 𝜽\boldsymbol{\theta}. Inequality (35) follows from the fact that |𝒔f(𝒚)𝒔f(𝜽)|C𝒚𝜽ββC𝒙𝜽ββ\left|\partial^{\boldsymbol{s}}f(\boldsymbol{y})-\partial^{\boldsymbol{s}}f(\boldsymbol{\theta})\right|\leq C\|\boldsymbol{y}-\boldsymbol{\theta}\|_{\infty}^{\beta-\lfloor\beta\rfloor}\leq C\|\boldsymbol{x}-\boldsymbol{\theta}\|_{\infty}^{\beta-\lfloor\beta\rfloor} and the identity dkk!=|𝒔|=k1𝒔!\frac{d^{k}}{k!}=\sum_{|\boldsymbol{s}|=k}\frac{1}{\boldsymbol{s}!}. Next, we suppose that f~(𝒙)=𝒊[K]dζϵ,δ(𝒙𝜽𝒊)P𝜽𝒊(𝒙)\tilde{f}(\boldsymbol{x})=\sum_{\boldsymbol{i}\in[K]^{d}}\zeta_{\epsilon,\delta}(\boldsymbol{x}-\boldsymbol{\theta}^{\boldsymbol{i}})P_{\boldsymbol{\theta}^{\boldsymbol{i}}}(\boldsymbol{x}). Thus, if 𝒙𝒬ϵ,δ\boldsymbol{x}\in\mathscr{Q}_{\epsilon,\delta},

|f(𝒙)f~(𝒙)|max𝒊[K]dsup𝒙B(𝜽𝒊,δ)|f(𝒙)P𝜽𝒊(𝒙)|\displaystyle|f(\boldsymbol{x})-\tilde{f}(\boldsymbol{x})|\leq\max_{\boldsymbol{i}\in[K]^{d}}\sup_{\boldsymbol{x}\in B_{\ell_{\infty}}(\boldsymbol{\theta}^{\boldsymbol{i}},\delta)}|f(\boldsymbol{x})-P_{\boldsymbol{\theta}^{\boldsymbol{i}}}(\boldsymbol{x})|\leq Cdββ!δβ.\displaystyle\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\delta^{\beta}. (36)

Here, (36) follows from (35). Thus, ff~𝕃(𝒬ϵ,δ)Cdββ!δβ\|f-\tilde{f}\|_{\mathbb{L}_{\infty}(\mathscr{Q}_{\epsilon,\delta})}\leq\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\delta^{\beta}. Furthermore, by definition, f~C+Cdββ!ϵβ\|\tilde{f}\|_{\infty}\leq C+\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\epsilon^{\beta}. Let a𝒊,𝒔=𝒔f(θ𝒊)𝒔!a_{\boldsymbol{i},\boldsymbol{s}}=\frac{\partial^{\boldsymbol{s}}f(\theta^{\boldsymbol{i}})}{\boldsymbol{s}!} and

f^𝒊,𝒔(𝒙)=prodm(d+|𝒔|)(\displaystyle\hat{f}_{\boldsymbol{i},\boldsymbol{s}}(\boldsymbol{x})=\text{prod}_{m}^{(d+|\boldsymbol{s}|)}( ξϵ1,δ1(x1θ1𝒊),,ξϵd,δd(xdθd𝒊),(x1θ1𝒊),,(x1θ1𝒊)s1 times,,(x1θd𝒊),,(xdθd𝒊)sd times).\displaystyle\xi_{\epsilon_{1},\delta_{1}}(x_{1}-\theta_{1}^{\boldsymbol{i}}),\dots,\xi_{\epsilon_{d},\delta_{d}}(x_{d}-\theta_{d}^{\boldsymbol{i}}),\underbrace{(x_{1}-\theta_{1}^{\boldsymbol{i}}),\dots,(x_{1}-\theta_{1}^{\boldsymbol{i}})}_{\text{$s_{1}$ times}},\dots,\underbrace{(x_{1}-\theta_{d}^{\boldsymbol{i}}),\dots,(x_{d}-\theta_{d}^{\boldsymbol{i}})}_{\text{$s_{d}$ times}}).

Here, prodm(d+|𝒔|)\text{prod}_{m}^{(d+|\boldsymbol{s}|)} is an approximation of the product function developed by Chakraborty and Bartlett, 2024b (see Lemma 29) and has at most d+|𝒔|d+βd+|\boldsymbol{s}|\leq d+\lfloor\beta\rfloor many inputs. By Chakraborty and Bartlett, 2024b (, Lemma 40), prodm(d+|𝒔|)\text{prod}_{m}^{(d+|\boldsymbol{s}|)} can be implemented by a ReLU network with (prodm(d+|𝒔|))\mathcal{L}(\text{prod}_{m}^{(d+|\boldsymbol{s}|)}), 𝒲(prodm(d+|𝒔|))c3m\mathcal{W}(\text{prod}_{m}^{(d+|\boldsymbol{s}|)})\leq c_{3}m, where c3c_{3} is an absolute constant. Thus, (f^𝒊,𝒔)c3m+2\mathcal{L}(\hat{f}_{\boldsymbol{i},\boldsymbol{s}})\leq c_{3}m+2 and 𝒲(f^𝒊,𝒔)c3m+8d+4|s|c3m+8d+4β\mathcal{W}(\hat{f}_{\boldsymbol{i},\boldsymbol{s}})\leq c_{3}m+8d+4|s|\leq c_{3}m+8d+4\lfloor\beta\rfloor. From Chakraborty and Bartlett, 2024b (, Lemma 40), we observe that,

|f^𝒊,𝒔(𝒙)ζ(xθ𝒊)(xθ𝒊)𝒔|12m,xS.\displaystyle\left|\hat{f}_{\boldsymbol{i},\boldsymbol{s}}(\boldsymbol{x})-\zeta(x-\theta^{\boldsymbol{i}})\left(x-\theta^{\boldsymbol{i}}\right)^{\boldsymbol{s}}\right|\leq\frac{1}{2^{m}},\,\forall x\in S. (37)

Here, mmax12(log2(4d)1)m\geq\max\frac{1}{2}(\log_{2}(4d)-1). Finally, let f^(𝒙)=𝒊[K]d|𝒔|βa𝒊,𝒔f^𝒊,𝒔(𝒙)\hat{f}(\boldsymbol{x})=\sum_{\boldsymbol{i}\in[K]^{d}}\sum_{|\boldsymbol{s}|\leq\lfloor\beta\rfloor}a_{\boldsymbol{i},\boldsymbol{s}}\hat{f}_{\boldsymbol{i},\boldsymbol{s}}(\boldsymbol{x}). Clearly, (f^)c3m+3\mathcal{L}(\hat{f})\leq c_{3}m+3 and 𝒲(f^)(d+ββ)(c3m+8d+4β)\mathcal{W}(\hat{f})\leq\binom{d+\lfloor\beta\rfloor}{\lfloor\beta\rfloor}\left(c_{3}m+8d+4\lfloor\beta\rfloor\right). This implies that,

sup𝒙𝒬ϵ,δ|f^(𝒙)f~(𝒙)|\displaystyle\sup_{\boldsymbol{x}\in\mathscr{Q}_{\epsilon,\delta}}|\hat{f}(\boldsymbol{x})-\tilde{f}(\boldsymbol{x})|\leq max𝒊[K]dsup𝒙B(𝜽𝒊,δ)|𝒔|β|a𝒊,𝒔|ζ(xθ𝒊)|f^𝒊𝒔(x)(xθ𝒊)𝒔|\displaystyle\max_{\boldsymbol{i}\in[K]^{d}}\sup_{\boldsymbol{x}\in B_{\ell_{\infty}}(\boldsymbol{\theta}^{\boldsymbol{i}},\delta)}\sum_{|\boldsymbol{s}|\leq\lfloor\beta\rfloor}|a_{\boldsymbol{i},\boldsymbol{s}}|\zeta(x-\theta^{\boldsymbol{i}})|\hat{f}_{\boldsymbol{i}\boldsymbol{s}}(x)-\left(x-\theta^{\boldsymbol{i}}\right)^{\boldsymbol{s}}|
\displaystyle\leq |𝒔|k|aθ,𝒔||f^θ𝒊(x),𝒔(x)ζϵ,𝜹(xθ𝒊(x))(xθ𝒊(x))𝒔|\displaystyle\sum_{|\boldsymbol{s}|\leq k}|a_{\theta,\boldsymbol{s}}|\left|\hat{f}_{\theta^{\boldsymbol{i}(x)},\boldsymbol{s}}(x)-\zeta_{\boldsymbol{\epsilon},\boldsymbol{\delta}}(x-\theta^{\boldsymbol{i}(x)})\left(x-\theta^{\boldsymbol{i}(x)}\right)^{\boldsymbol{s}}\right|
\displaystyle\leq C2m.\displaystyle\frac{C}{2^{m}}. (38)

From (36) and (38), we thus get that if 𝒙𝒬ϵ,δ\boldsymbol{x}\in\mathscr{Q}_{\epsilon,\delta},

|f(𝒙)f^(𝒙)|\displaystyle|f(\boldsymbol{x})-\hat{f}(\boldsymbol{x})|\leq |f(𝒙)f~(𝒙)|+|f^(𝒙)f~(𝒙)|Cdββ!δβ+C2m.\displaystyle|f(\boldsymbol{x})-\tilde{f}(\boldsymbol{x})|+|\hat{f}(\boldsymbol{x})-\tilde{f}(\boldsymbol{x})|\leq\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\delta^{\beta}+\frac{C}{2^{m}}. (39)

Furthermore, it is easy to observe that, f^𝕃([0,1]d)C+Cdββ!ϵβ+C2m\|\hat{f}\|_{\mathbb{L}_{\infty}([0,1]^{d})}\leq C+\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\epsilon^{\beta}+\frac{C}{2^{m}}. Hence,

ff^𝕃p(Leb)p=\displaystyle\|f-\hat{f}\|_{\mathbb{L}_{p}(\text{Leb})}^{p}= 𝒬ϵ,δ|f(𝒙)f^(𝒙)|p𝑑Leb(𝒙)+𝒬ϵ,δ|f(𝒙)f^(𝒙)|p𝑑Leb(𝒙)\displaystyle\int_{\mathscr{Q}_{\epsilon,\delta}}|f(\boldsymbol{x})-\hat{f}(\boldsymbol{x})|^{p}d\text{Leb}(\boldsymbol{x})+\int_{\mathscr{Q}_{\epsilon,\delta}^{\complement}}|f(\boldsymbol{x})-\hat{f}(\boldsymbol{x})|^{p}d\text{Leb}(\boldsymbol{x})
\displaystyle\leq (Cdββ!δβ+C2m)pLeb(𝒬ϵ,δ)+(2C+Cdββ!ϵβ+C2m)pLeb(𝒬ϵ,δ)\displaystyle\left(\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\delta^{\beta}+\frac{C}{2^{m}}\right)^{p}\operatorname{Leb}\left(\mathscr{Q}_{\epsilon,\delta}\right)+\left(2C+\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\epsilon^{\beta}+\frac{C}{2^{m}}\right)^{p}\operatorname{Leb}(\mathscr{Q}_{\epsilon,\delta}^{\complement})
\displaystyle\leq (Cdββ!δβ+C2m)p+2(2C+Cdββ!ϵβ+C2m)pdδ\displaystyle\left(\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\delta^{\beta}+\frac{C}{2^{m}}\right)^{p}+2\left(2C+\frac{Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\epsilon^{\beta}+\frac{C}{2^{m}}\right)^{p}d\delta
ff^𝕃p(Leb)\displaystyle\implies\|f-\hat{f}\|_{\mathbb{L}_{p}(\text{Leb})}\leq 2Cdββ!δβ+2C2m+4C(dδ)1/p2Cdββ!ϵβ+2C2m+4Cϵβ10Cdβϵβ+2C2m,\displaystyle\frac{2Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\delta^{\beta}+\frac{2C}{2^{m}}+4C(d\delta)^{1/p}\leq\frac{2Cd^{\lfloor\beta\rfloor}}{\lfloor\beta\rfloor!}\epsilon^{\beta}+\frac{2C}{2^{m}}+4C\epsilon^{\beta}\leq 10Cd^{\lfloor\beta\rfloor}\epsilon^{\beta}+\frac{2C}{2^{m}},

taking δ=1dϵpβ(ϵ/3)\delta=\frac{1}{d}\epsilon^{p\beta}\wedge(\epsilon/3). We take m=log2(8ηdβ)m=\left\lceil\log_{2}\left(\frac{8}{\eta d^{\lfloor\beta\rfloor}}\right)\right\rceil and ϵ=(η/20)1/β\epsilon=(\eta/20)^{1/\beta}. Thus, ff^𝕃p(Leb)Cdβη.\|f-\hat{f}\|_{\mathbb{L}_{p}(\text{Leb})}\leq Cd^{\lfloor\beta\rfloor}\eta. We note that f^\hat{f} has at most KdK^{d}-many networks of depth c3m+3c_{3}m+3 and number of weights (d+ββ)(c3m+8d+4β){{d+\lfloor\beta\rfloor}\choose\lfloor\beta\rfloor}\left(c_{3}m+8d+4\lfloor\beta\rfloor\right). Thus, (f^)c3m+4\mathcal{L}(\hat{f})\leq c_{3}m+4 and 𝒲(f^)Kd(d+ββ)(c3m+8d+4β)\mathcal{W}(\hat{f})\leq K^{d}{{d+\lfloor\beta\rfloor}\choose\lfloor\beta\rfloor}\left(c_{3}m+8d+4\lfloor\beta\rfloor\right). We thus get,

(f^)c3m+4c3log2(8ηdβ)+4.\mathcal{L}(\hat{f})\leq c_{3}m+4\leq c_{3}\left\lceil\log_{2}\left(\frac{8}{\eta d^{\lfloor\beta\rfloor}}\right)\right\rceil+4.

Similarly,

𝒲(f^)\displaystyle\mathcal{W}(\hat{f})\leq Kd(d+ββ)(c3m+8d+4β)\displaystyle K^{d}{{d+\lfloor\beta\rfloor}\choose\lfloor\beta\rfloor}\left(c_{3}m+8d+4\lfloor\beta\rfloor\right)
\displaystyle\leq 12(η/20)1/βd(d+ββ)(c3log2(8ηdβ)+8d+4β)\displaystyle\left\lceil\frac{1}{2(\eta/20)^{1/\beta}}\right\rceil^{d}{{d+\lfloor\beta\rfloor}\choose\lfloor\beta\rfloor}\left(c_{3}\left\lceil\log_{2}\left(\frac{8}{\eta d^{\lfloor\beta\rfloor}}\right)\right\rceil+8d+4\lfloor\beta\rfloor\right)
\displaystyle\leq 12(η/20)1/βd(3β)β(d+β)β(c3log2(8ηdβ)+8d+4β).\displaystyle\left\lceil\frac{1}{2(\eta/20)^{1/\beta}}\right\rceil^{d}\left(\frac{3}{\beta}\right)^{\beta}(d+\lfloor\beta\rfloor)^{\lfloor\beta\rfloor}\left(c_{3}\left\lceil\log_{2}\left(\frac{8}{\eta d^{\lfloor\beta\rfloor}}\right)\right\rceil+8d+4\lfloor\beta\rfloor\right).

The proof is now complete by replacing c3c_{3} with ϑ\vartheta. ∎

Appendix C Proof of Corollary 9

Proof.

Suppose that κ\kappa be the constant that honors the inequality. Then

𝔼f^f0𝕃2(λ)2=\displaystyle\mathbb{E}\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}= 𝔼f^f0𝕃2(λ)2𝟙{f^f0𝕃2(λ)2κd2β(β+d)2β+dn2β2β+d(logn)5}\displaystyle\mathbb{E}\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\mathbbm{1}\left\{\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\leq\kappa d^{\frac{2\lfloor\beta\rfloor(\beta+d)}{2\beta+d}}n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5}\right\}
+𝔼f^f0𝕃2(λ)2𝟙{f^f0𝕃2(λ)2>κd2β(β+d)2β+dn2β2β+d(logn)5}\displaystyle+\mathbb{E}\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}\mathbbm{1}\left\{\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}>\kappa d^{\frac{2\lfloor\beta\rfloor(\beta+d)}{2\beta+d}}n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5}\right\}
\displaystyle\leq κn2β2β+d(logn)5+9C2(f^f0𝕃2(λ)2>n2β2β+d(logn)5)\displaystyle\kappa n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5}+9C^{2}\mathbb{P}\left(\|\hat{f}-f_{0}\|^{2}_{\mathbb{L}_{2}(\lambda)}>n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5}\right)
\displaystyle\leq κn2β2β+d(logn)5+27C2exp(nd2β+d)\displaystyle\kappa n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5}+27C^{2}\exp\left(-n^{\frac{d^{\star}}{2\beta+d^{\star}}}\right)
\displaystyle\precsim n2β2β+d(logn)5.\displaystyle\,\,n^{-\frac{2\beta}{2\beta+d^{\star}}}(\log n)^{5}.

Appendix D Supporting Lemmas

Lemma 23.

Let r={h=(ff)2:f,f and λnhr}\mathcal{H}_{r}=\{h=(f-f^{\prime})^{2}:f,f^{\prime}\in\mathcal{F}\text{ and }\lambda_{n}h\leq r\} with supff𝕃(λn)<\sup_{f\in\mathcal{F}}\|f\|_{\mathbb{L}_{\infty}(\lambda_{n})}<\infty. Then, we can find r0>0r_{0}>0, such that if 0<rr00<r\leq r_{0} and nPdim()n\geq\operatorname{Pdim}(\mathcal{F}),

𝔼ϵsuphr1ni=1nϵih(𝒙i)rlog(1/r)Pdim()lognn.\mathbb{E}_{\epsilon}\sup_{h\in\mathcal{H}_{r}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}h(\boldsymbol{x}_{i})\precsim\sqrt{\frac{r\log(1/r)\operatorname{Pdim}(\mathcal{F})\log n}{n}}.
Proof.

Let B=4supff𝕃(λn)2B=4\sup_{f\in\mathcal{F}}\|f\|_{\mathbb{L}_{\infty}(\lambda_{n})}^{2}. We first fix ϵ2Br\epsilon\leq\sqrt{2Br} and let h=ffh=f-f^{\prime} be a member of r\mathcal{H}_{r} with f,ff,f^{\prime}\in\mathcal{F}. We use the notation |𝒙1:n={(f(𝒙1),,f(𝒙n)):f}\mathcal{F}_{|_{\boldsymbol{x}_{1:n}}}=\{(f(\boldsymbol{x}_{1}),\dots,f(\boldsymbol{x}_{n}))^{\top}:f\in\mathcal{F}\}. Let 𝒗f,𝒗f𝒞(ϵ;|𝒙1:n,)\boldsymbol{v}^{f},\boldsymbol{v}^{f^{\prime}}\in\mathcal{C}(\epsilon;\mathcal{F}_{|_{\boldsymbol{x}_{1:n}}},\|\cdot\|_{\infty}) be such that |𝒗iff(𝒙i)|,|𝒗iff(𝒙i)|ϵ|\boldsymbol{v}^{f}_{i}-f(\boldsymbol{x}_{i})|,|\boldsymbol{v}^{f^{\prime}}_{i}-f^{\prime}(\boldsymbol{x}_{i})|\leq\epsilon, for all ii. Here 𝒞(ϵ;|𝒙1:n,)\mathcal{C}(\epsilon;\mathcal{F}_{|_{\boldsymbol{x}_{1:n}}},\|\cdot\|_{\infty}) denotes the ϵ\epsilon cover of |𝒙1:n\mathcal{F}_{|_{\boldsymbol{x}_{1:n}}} w.r.t. the \ell_{\infty}-norm. Let 𝒗=𝒗f𝒗f\boldsymbol{v}=\boldsymbol{v}^{f}-\boldsymbol{v}^{f^{\prime}}. Then

1ni=1n(h(𝒙i)𝒗i2)2=\displaystyle\frac{1}{n}\sum_{i=1}^{n}(h(\boldsymbol{x}_{i})-\boldsymbol{v}_{i}^{2})^{2}= 1ni=1n((f(𝒙i)f(𝒙i))2(vifvif)2)2\displaystyle\frac{1}{n}\sum_{i=1}^{n}((f(\boldsymbol{x}_{i})-f^{\prime}(\boldsymbol{x}_{i}))^{2}-(v_{i}^{f}-v_{i}^{f^{\prime}})^{2})^{2}
\displaystyle\leq 2ni=1n((f(𝒙i)f(𝒙i))2+(vifvif)2)((f(𝒙i)f(𝒙i))(vifvif))2\displaystyle\frac{2}{n}\sum_{i=1}^{n}((f(\boldsymbol{x}_{i})-f^{\prime}(\boldsymbol{x}_{i}))^{2}+(v_{i}^{f}-v_{i}^{f^{\prime}})^{2})((f(\boldsymbol{x}_{i})-f^{\prime}(\boldsymbol{x}_{i}))-(v_{i}^{f}-v_{i}^{f^{\prime}}))^{2} (40)
\displaystyle\precsim ϵ2.\displaystyle\epsilon^{2}. (41)

Here (40) follows from the fact that (t2r2)2=(t+r)2(tr)22(t2+r2)(tr)2(t^{2}-r^{2})^{2}=(t+r)^{2}(t-r)^{2}\leq 2(t^{2}+r^{2})(t-r)^{2}, for any t,rt,r\in\mathop{\mathbb{R}}\nolimits. Hence, from the above calculations, 𝒩(ϵ;r,𝕃2(λn))(𝒩(a1ϵ;,𝕃(λn)))2\mathcal{N}(\epsilon;\mathcal{H}_{r},\mathbb{L}_{2}(\lambda_{n}))\leq\left(\mathcal{N}\left(a_{1}\epsilon;\mathcal{F},\mathbb{L}_{\infty}(\lambda_{n})\right)\right)^{2}, for some absolute constant a1a_{1}.

diam2(r,𝕃2(λn))=suph,hr1ni=1n(h(𝒙i)h(𝒙i))22suphr1ni=1nh2(𝒙i)2Bsuphr1ni=1nh(𝒙i)2Br.\displaystyle\operatorname{diam}^{2}(\mathcal{H}_{r},\mathbb{L}_{2}(\lambda_{n}))=\sup_{h,h^{\prime}\in\mathcal{H}_{r}}\frac{1}{n}\sum_{i=1}^{n}(h(\boldsymbol{x}_{i})-h^{\prime}(\boldsymbol{x}_{i}))^{2}\leq 2\sup_{h\in\mathcal{H}_{r}}\frac{1}{n}\sum_{i=1}^{n}h^{2}(\boldsymbol{x}_{i})\leq 2B\sup_{h\in\mathcal{H}_{r}}\frac{1}{n}\sum_{i=1}^{n}h(\boldsymbol{x}_{i})\leq 2Br.

Hence, diam(r,𝕃2(λn))2Br\operatorname{diam}(\mathcal{H}_{r},\mathbb{L}_{2}(\lambda_{n}))\leq\sqrt{2Br}. Thus from Wainwright, (2019, Theorem 5.22),

𝔼ϵsuphr1ni=1nϵih(𝒙i)\displaystyle\mathbb{E}_{\epsilon}\sup_{h\in\mathcal{H}_{r}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}h(\boldsymbol{x}_{i})\precsim 02Br1nlog𝒩(ϵ;r,𝕃2(λn))𝑑ϵ\displaystyle\int_{0}^{\sqrt{2Br}}\sqrt{\frac{1}{n}\log\mathcal{N}(\epsilon;\mathcal{H}_{r},\mathbb{L}_{2}(\lambda_{n}))}d\epsilon
\displaystyle\leq 02Br2Pdim()nlog(a2nϵ)𝑑ϵ\displaystyle\int_{0}^{\sqrt{2Br}}\sqrt{\frac{2\operatorname{Pdim}(\mathcal{F})}{n}\log\left(\frac{a_{2}n}{\epsilon}\right)}d\epsilon
\displaystyle\precsim 2BrPdim()lognn+02BrPdim()nlog(a2/ϵ)𝑑ϵ\displaystyle\sqrt{2Br}\sqrt{\frac{\operatorname{Pdim}(\mathcal{F})\log n}{n}}+\int_{0}^{\sqrt{2Br}}\sqrt{\frac{\operatorname{Pdim}(\mathcal{F})}{n}\log(a_{2}/\epsilon)}d\epsilon
\displaystyle\precsim rlog(1/r)Pdim()lognn.\displaystyle\sqrt{\frac{r\log(1/r)\operatorname{Pdim}(\mathcal{F})\log n}{n}}. (42)

Here, (42) follows from Lemma 25. ∎

Lemma 24.

KL(pΨ,𝜽pΨ,𝜽)=dϕ(𝝁(𝜽)𝝁(𝜽)).\operatorname{KL}(p_{\Psi,\boldsymbol{\theta}}\|p_{\Psi,\boldsymbol{\theta}^{\prime}})=d_{\phi}\left(\boldsymbol{\mu}(\boldsymbol{\theta})\|\boldsymbol{\mu}(\boldsymbol{\theta}^{\prime})\right).

Proof.

To prove this result, we make the following observations.

KL(pΨ,𝜽pΨ,𝜽)\displaystyle\operatorname{KL}(p_{\Psi,\boldsymbol{\theta}}\|p_{\Psi,\boldsymbol{\theta}^{\prime}})
=\displaystyle= 𝔼𝒙pΨ,𝜽(dϕ(𝒙𝝁(𝜽)dϕ(𝒙𝝁(𝜽))\displaystyle\mathbb{E}_{\boldsymbol{x}\sim p_{\Psi,\boldsymbol{\theta}}}\left(d_{\phi}(\boldsymbol{x}\|\boldsymbol{\mu}(\boldsymbol{\theta}^{\prime})-d_{\phi}(\boldsymbol{x}\|\boldsymbol{\mu}(\boldsymbol{\theta})\right)
=\displaystyle= 𝔼𝒙pΨ,𝜽(ϕ(𝒙)ϕ(𝝁(𝜽))ϕ(𝝁(𝜽),𝒙𝝁(𝜽)(ϕ(𝒙)ϕ(𝝁(𝜽))ϕ(𝝁(𝜽),𝒙𝝁(𝜽)))\displaystyle\mathbb{E}_{\boldsymbol{x}\sim p_{\Psi,\boldsymbol{\theta}}}\left(\phi(\boldsymbol{x})-\phi(\boldsymbol{\mu}(\boldsymbol{\theta}^{\prime}))-\langle\nabla\phi(\boldsymbol{\mu}(\boldsymbol{\theta}^{\prime}),\boldsymbol{x}-\boldsymbol{\mu}(\boldsymbol{\theta}^{\prime})\rangle-\left(\phi(\boldsymbol{x})-\phi(\boldsymbol{\mu}(\boldsymbol{\theta}))-\langle\nabla\phi(\boldsymbol{\mu}(\boldsymbol{\theta}),\boldsymbol{x}-\boldsymbol{\mu}(\boldsymbol{\theta})\rangle\right)\right)
=\displaystyle= 𝔼𝒙pΨ,𝜽(ϕ(𝝁(𝜽))ϕ(𝝁(𝜽))ϕ(𝝁(𝜽),𝒙𝝁(𝜽)+ϕ(𝝁(𝜽),𝒙𝝁(𝜽))\displaystyle\mathbb{E}_{\boldsymbol{x}\sim p_{\Psi,\boldsymbol{\theta}}}\left(\phi(\boldsymbol{\mu}(\boldsymbol{\theta}))-\phi(\boldsymbol{\mu}(\boldsymbol{\theta}^{\prime}))-\langle\nabla\phi(\boldsymbol{\mu}(\boldsymbol{\theta}^{\prime}),\boldsymbol{x}-\boldsymbol{\mu}(\boldsymbol{\theta}^{\prime})\rangle+\langle\nabla\phi(\boldsymbol{\mu}(\boldsymbol{\theta}),\boldsymbol{x}-\boldsymbol{\mu}(\boldsymbol{\theta})\rangle\right)
=\displaystyle= dϕ(𝝁(𝜽)𝝁(𝜽)).\displaystyle d_{\phi}\left(\boldsymbol{\mu}(\boldsymbol{\theta})\|\boldsymbol{\mu}(\boldsymbol{\theta}^{\prime})\right). (43)

Here, (43) follows from noting that E𝒙pΨ,𝜽𝒙=𝝁(𝜽)E_{\boldsymbol{x}\sim p_{\Psi,\boldsymbol{\theta}}}\boldsymbol{x}=\boldsymbol{\mu}(\boldsymbol{\theta}). ∎

Appendix E Supporting Lemmata

Lemma 25.

For any δ1/e\delta\leq 1/e, 0δlog(1/ϵ)𝑑ϵ2δlog(1/δ)\int_{0}^{\delta}\sqrt{\log(1/\epsilon)}d\epsilon\leq 2\delta\sqrt{\log(1/\delta)}.

Proof.

We start by making a transformation x=log(1/ϵ)x=\log(1/\epsilon) and observe that,

0δlog(1/ϵ)𝑑ϵ=log(1/δ)xex𝑑x=\displaystyle\int_{0}^{\delta}\sqrt{\log(1/\epsilon)}d\epsilon=\int_{\log(1/\delta)}^{\infty}\sqrt{x}e^{-x}dx= log(1/δ)xex/2ex/2𝑑x\displaystyle\int_{\log(1/\delta)}^{\infty}\sqrt{x}e^{-x/2}e^{-x/2}dx
\displaystyle\leq log(1/δ)e12log(1/δ)log(1/δ)ex/2𝑑x\displaystyle\sqrt{\log(1/\delta)}e^{-\frac{1}{2}\log(1/\delta)}\int_{\log(1/\delta)}^{\infty}e^{-x/2}dx (44)
=\displaystyle= 2log(1/δ)e12log(1/δ)ex/2|log(1/δ)\displaystyle 2\sqrt{\log(1/\delta)}e^{-\frac{1}{2}\log(1/\delta)}e^{-x/2}|_{\log(1/\delta)}^{\infty}
=\displaystyle= 2log(1/δ)elog(1/δ)\displaystyle 2\sqrt{\log(1/\delta)}e^{\log(1/\delta)}
=\displaystyle= 2δlog(1/δ).\displaystyle 2\delta\sqrt{\log(1/\delta)}.

In the above calculations, (44) follows from the fact that the function xex/2\sqrt{x}e^{-x/2} is decreasing when x1x\geq 1. ∎

Lemma 26.

Let ZpΨ,θZ\sim p_{\Psi,\theta}, with θΘ=\theta\in\Theta=\mathop{\mathbb{R}}\nolimits. Then, Z𝔼ZZ-\mathbb{E}Z is σ1\sigma_{1}-SubGaussian.

Proof.

We observe the following,

𝔼eλ(Z𝔼Z)=eλΨ(θ)eλzeθzΨ(θ)h(z)𝑑τ(z)\displaystyle\mathbb{E}e^{\lambda(Z-\mathbb{E}Z)}=e^{-\lambda\nabla\Psi(\theta)}\int e^{\lambda z}e^{\theta z-\Psi(\theta)}h(z)d\tau(z) =eλΨ(θ)Ψ(θ)e(θ+λ)zh(z)𝑑τ(z)\displaystyle=e^{-\lambda\nabla\Psi(\theta)-\Psi(\theta)}\int e^{(\theta+\lambda)z}h(z)d\tau(z)
=eΨ(θ+λ)Ψ(θ)λΨ(θ)eσ1λ22.\displaystyle=e^{\Psi(\theta+\lambda)-\Psi(\theta)-\lambda\nabla\Psi(\theta)}\leq e^{\frac{\sigma_{1}\lambda^{2}}{2}}.

Lemma 27.

Suppose that Z1,,ZnZ_{1},\dots,Z_{n} are independent and identically distributed sub-Gaussian random variables with variance proxy σ2\sigma^{2} and suppose that fb\|f\|_{\infty}\leq b for all ff\in\mathcal{F}. Then with probability at least 1δ1-\delta,

1nsupfi=1nZif(𝒙i)1n𝔼supfi=1nZif(𝒙i)bσlog(1/δ)n.\frac{1}{n}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}Z_{i}f(\boldsymbol{x}_{i})-\frac{1}{n}\mathbb{E}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}Z_{i}f(\boldsymbol{x}_{i})\precsim b\sigma\sqrt{\frac{\log(1/\delta)}{n}}.
Proof.

Recall that for a random variable, ZZ, Zψ2=supp1(𝔼|Z|p)1/pp\|Z\|_{\psi_{2}}=\sup_{p\geq 1}\frac{(\mathbb{E}|Z|^{p})^{1/p}}{\sqrt{p}}. Let g(Z)=1nsupfi=1nZif(𝒙i)g(Z)=\frac{1}{n}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}Z_{i}f(\boldsymbol{x}_{i}). Using the notations of Maurer and Pontil, (2021), we note that

gk(Z)ψ2=\displaystyle\|g_{k}(Z)\|_{\psi_{2}}= 1nsupf(ikzif(𝒙i)+Zkf(𝒙k))𝔼Zksupf(ikzif(𝒙i)+Zkf(𝒙k))ψ2\displaystyle\frac{1}{n}\left\|\sup_{f\in\mathcal{F}}\left(\sum_{i\neq k}z_{i}f(\boldsymbol{x}_{i})+Z_{k}f(\boldsymbol{x}_{k})\right)-\mathbb{E}_{Z_{k}^{\prime}}\sup_{f\in\mathcal{F}}\left(\sum_{i\neq k}z_{i}f(\boldsymbol{x}_{i})+Z_{k}^{\prime}f(\boldsymbol{x}_{k})\right)\right\|_{\psi_{2}}
\displaystyle\leq 1n𝔼Zk|ZkZkf(𝒙k)|ψ2\displaystyle\frac{1}{n}\left\|\mathbb{E}_{Z_{k}^{\prime}}|Z_{k}-Z_{k}^{\prime}f(\boldsymbol{x}_{k})|\right\|_{\psi_{2}}
\displaystyle\leq bn𝔼Zk|ZkZk|ψ2\displaystyle\frac{b}{n}\left\|\mathbb{E}_{Z_{k}^{\prime}}|Z_{k}-Z_{k}^{\prime}|\right\|_{\psi_{2}} (45)
\displaystyle\leq bnZkZkψ2\displaystyle\frac{b}{n}\left\|Z_{k}-Z_{k}^{\prime}\right\|_{\psi_{2}}
\displaystyle\leq 2bnZkψ2\displaystyle\frac{2b}{n}\left\|Z_{k}\right\|_{\psi_{2}}
\displaystyle\precsim bσn.\displaystyle\frac{b\sigma}{n}.

Here, (45) follows from (Maurer and Pontil,, 2021, Lemma 6). Thus, k=1ngk(Z)ψ22b2σ2/n\left\|\sum_{k=1}^{n}\|g_{k}(Z)\|_{\psi_{2}}^{2}\right\|_{\infty}\precsim b^{2}\sigma^{2}/n. Hence applying (Maurer and Pontil,, 2021, Theorem 3), we note that with probability at least 1δ1-\delta,

1nsupfi=1nZif(𝒙i)1n𝔼supfi=1nZif(𝒙i)bσlog(1/δ)n.\frac{1}{n}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}Z_{i}f(\boldsymbol{x}_{i})-\frac{1}{n}\mathbb{E}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}Z_{i}f(\boldsymbol{x}_{i})\precsim b\sigma\sqrt{\frac{\log(1/\delta)}{n}}.

Appendix F Supporting Results from the Literature

Lemma 28 (Lemma B.1 of Telgarsky and Dasgupta, (2013)).

If differentiable ff is r1r_{1} strongly convex, then Bf(xy)r1xy22B_{f}(x\|y)\geq r_{1}\|x-y\|_{2}^{2}. Furthermore, if differentiable ff has Lipschitz gradients with parameter r2r_{2} with respect to 2\|\cdot\|_{2}, then Bf(xy)r2xy22B_{f}(x\|y)\leq r_{2}\|x-y\|_{2}^{2}.

Lemma 29 (Lemma 40 of Chakraborty and Bartlett, 2024b ).

For any m12(log2(4d)1)m\geq\frac{1}{2}(\log_{2}(4d)-1), we can construct a ReLU network prodm(d):d\text{prod}^{(d)}_{m}:\mathop{\mathbb{R}}\nolimits^{d}\to\mathop{\mathbb{R}}\nolimits, such that for any x1,,xd[1,1]x_{1},\dots,x_{d}\in[-1,1], prodm(d)(x1,,xd)x1xd([1,1]d)12m\|\text{prod}_{m}^{(d)}(x_{1},\dots,x_{d})-x_{1}\dots x_{d}\|_{\mathcal{L}_{\infty}([-1,1]^{d})}\leq\frac{1}{2^{m}}. Furthermore, for some absolute constant c3c_{3},

  1. 1.

    (prodm(d))c3m\mathcal{L}(\text{prod}^{(d)}_{m})\leq c_{3}m, 𝒲(prodm(d))c3m\mathcal{W}(\text{prod}^{(d)}_{m})\leq c_{3}m.

  2. 2.

    (prodm(d))4\mathcal{B}(\text{prod}^{(d)}_{m})\leq 4.

Acknowledgment

We gratefully acknowledge the support of the NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 and #814639, the NSF’s support of FODSI through grant DMS-2023505, and the support of the ONR through MURI award N000142112431.

References

  • Anthony and Bartlett, (1999) Anthony, M. and Bartlett, P. (1999). Neural network learning: Theoretical foundations. Cambridge University Press.
  • Banerjee et al., (2005) Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., and Lafferty, J. (2005). Clustering with bregman divergences. Journal of machine learning research, 6(10).
  • Bartlett et al., (2019) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. (2019). Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301.
  • Bousquet, (2002) Bousquet, O. (2002). Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Biologische Kybernetik.
  • (5) Chakraborty, S. and Bartlett, P. (2024a). A statistical analysis of wasserstein autoencoders for intrinsically low-dimensional data. In The Twelfth International Conference on Learning Representations.
  • (6) Chakraborty, S. and Bartlett, P. L. (2024b). On the statistical properties of generative adversarial models for low intrinsic data dimension. arXiv preprint arXiv:2401.15801.
  • Chen et al., (2019) Chen, M., Jiang, H., Liao, W., and Zhao, T. (2019). Efficient approximation of deep relu networks for functions on low dimensional manifolds. Advances in neural information processing systems, 32.
  • Chen et al., (2022) Chen, M., Jiang, H., Liao, W., and Zhao, T. (2022). Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253.
  • Cover and Thomas, (2005) Cover, T. M. and Thomas, J. A. (2005). Elements of Information Theory. John Wiley & Sons, Hoboken, NJ.
  • Cybenko, (1989) Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.
  • Donahue et al., (2017) Donahue, J., Krähenbühl, P., and Darrell, T. (2017). Adversarial feature learning. In International Conference on Learning Representations.
  • Dudley, (1969) Dudley, R. M. (1969). The speed of mean glivenko-cantelli convergence. The Annals of Mathematical Statistics, 40(1):40–50.
  • Hornik, (1991) Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257.
  • Huang et al., (2022) Huang, J., Jiao, Y., Li, Z., Liu, S., Wang, Y., and Yang, Y. (2022). An error analysis of generative adversarial networks for learning distributions. Journal of Machine Learning Research, 23(116):1–43.
  • Jiao et al., (2021) Jiao, Y., Shen, G., Lin, Y., and Huang, J. (2021). Deep nonparametric regression on approximately low-dimensional manifolds. arXiv preprint arXiv:2104.06708.
  • Kakade et al., (2009) Kakade, S., Shalev-Shwartz, S., Tewari, A., et al. (2009). On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09. pdf, 2(1):35.
  • Kolmogorov and Tikhomirov, (1961) Kolmogorov, A. N. and Tikhomirov, V. M. (1961). ϵ\epsilon-entropy and ϵ\epsilon-capacity of sets in function spaces. Translations of the American Mathematical Society, 17:277–364.
  • Lehmann and Casella, (2006) Lehmann, E. L. and Casella, G. (2006). Theory of Point Estimation. Springer Science & Business Media.
  • Lu et al., (2021) Lu, J., Shen, Z., Yang, H., and Zhang, S. (2021). Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506.
  • Maurer and Pontil, (2021) Maurer, A. and Pontil, M. (2021). Concentration inequalities under sub-gaussian and sub-exponential conditions. Advances in Neural Information Processing Systems, 34:7588–7597.
  • Nakada and Imaizumi, (2020) Nakada, R. and Imaizumi, M. (2020). Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38.
  • Paul et al., (2021) Paul, D., Chakraborty, S., Das, S., and Xu, J. (2021). Uniform concentration bounds toward a unified framework for robust clustering. Advances in Neural Information Processing Systems, 34:8307–8319.
  • Petersen and Voigtlaender, (2018) Petersen, P. and Voigtlaender, F. (2018). Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330.
  • Pope et al., (2020) Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., and Goldstein, T. (2020). The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations.
  • Posner et al., (1967) Posner, E. C., Rodemich, E. R., and Rumsey Jr, H. (1967). Epsilon entropy of stochastic processes. The Annals of Mathematical Statistics, pages 1000–1020.
  • Schmidt-Hieber, (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4):1875 – 1897.
  • Shen et al., (2019) Shen, Z., Yang, H., and Zhang, S. (2019). Nonlinear approximation via compositions. Neural Networks, 119:74–84.
  • Shen et al., (2022) Shen, Z., Yang, H., and Zhang, S. (2022). Optimal approximation rate of relu networks in terms of width and depth. Journal de Mathématiques Pures et Appliquées, 157:101–135.
  • Suzuki, (2018) Suzuki, T. (2018). Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. arXiv preprint arXiv:1810.08033.
  • Suzuki and Nitanda, (2021) Suzuki, T. and Nitanda, A. (2021). Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic besov space. Advances in Neural Information Processing Systems, 34:3609–3621.
  • Telgarsky and Dasgupta, (2013) Telgarsky, M. J. and Dasgupta, S. (2013). Moment-based uniform deviation bounds for kk-means and friends. Advances in Neural Information Processing Systems, 26.
  • Tsybakov, (2009) Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, Springer New York, NY, 1 edition. Published: 26 November 2008.
  • Uppal et al., (2019) Uppal, A., Singh, S., and Póczos, B. (2019). Nonparametric density estimation & convergence rates for gans under besov ipm losses. Advances in neural information processing systems, 32.
  • Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
  • Wainwright, (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
  • Weed and Bach, (2019) Weed, J. and Bach, F. (2019). Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620 – 2648.
  • Yang and Barron, (1999) Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Annals of Statistics, pages 1564–1599.
  • Yarotsky, (2017) Yarotsky, D. (2017). Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114.