A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data
Abstract
Recent advances have revealed that the rate of convergence of the expected test error in deep supervised learning decays as a function of the intrinsic dimension and not the dimension of the input space. Existing literature defines this intrinsic dimension as the Minkowski dimension or the manifold dimension of the support of the underlying probability measures, which often results in sub-optimal rates and unrealistic assumptions. In this paper, we consider supervised deep learning when the response given the explanatory variable is distributed according to an exponential family with a -Hölder smooth mean function. We consider an entropic notion of the intrinsic data-dimension and demonstrate that with independent and identically distributed samples, the test error scales as , where is the -entropic dimension of , the distribution of the explanatory variables. This improves on the best-known rates. Furthermore, under the assumption of an upper-bounded density of the explanatory variables, we characterize the rate of convergence as , establishing that the dependence on is not exponential but at most polynomial. We also demonstrate that when the explanatory variable has a lower bounded density, this rate in terms of the number of data samples, is nearly optimal for learning the dependence structure for exponential families.
1 Introduction
One hypothesis for the extraordinary performance of deep learning is that most natural data have an intrinsically low-dimensional structure despite lying in a high-dimensional feature space (Pope et al.,, 2020). Under this so-called “manifold hypothesis,” the recent theoretical developments in the generalization aspects of deep learning theory literature have revealed that the excess risk for different deep learning models, especially regression (Schmidt-Hieber,, 2020; Suzuki,, 2018) and generative models (Huang et al.,, 2022; Chakraborty and Bartlett, 2024a, ; Chakraborty and Bartlett, 2024b, ) exhibit a decay pattern that depends only on the intrinsic dimension of the data. Notably, Nakada and Imaizumi, (2020), Huang et al., (2022) and Chakraborty and Bartlett, 2024a showed that the excess risk decays as , where denotes the Minkowski dimension of the underlying distribution. For a supervised learning setting, this phenomenon has been proved for various deep regression models with additive Gaussian noise (Schmidt-Hieber,, 2020; Nakada and Imaizumi,, 2020; Suzuki,, 2018; Suzuki and Nitanda,, 2021).
Most of the aforementioned studies aim to describe this inherent dimensionality by utilizing the concept of the (upper) Minkowski dimension of the data’s underlying support. However, it is worth noting that the Minkowski dimension primarily focuses on quantifying the rate of growth in the covering number of the support while neglecting to account for situations where the distribution may exhibit a higher concentration of mass within specific sub-regions of this support. Thus, the Minkowski dimension often overestimates the intrinsic dimension of the data distribution, resulting in slower rates of statistical convergence. On the other hand, some works (Chen et al.,, 2022, 2019; Jiao et al.,, 2021) try to impose a smooth Riemannian manifold structure for this support and characterize the rate through the dimension of this manifold. However, this assumption is not only very strong and difficult to verify, but also ignores the fact that the data can be concentrated only on some sub-regions and can be thinly spread over the remainder, again resulting in an over-estimate. In contrast, recent insights from the optimal transport literature have introduced the concept of the Wasserstein dimension (Weed and Bach,, 2019), which overcomes these limitations and offers a more accurate characterization of convergence rates when estimating a distribution through the empirical measure. Furthermore, recent advancements in this field have led to the introduction of the entropic dimension (Chakraborty and Bartlett, 2024b, ), which builds upon seminal work by Dudley, (1969) and can be employed to describe the convergence rates for Bidirectional Generative Adversarial Networks (BiGANs) (Donahue et al.,, 2017). Remarkably, the entropic dimension is no larger than the Wasserstein and Minkowski dimensions, resulting in faster convergence rates. However, it is not known whether this faster rate of convergence extends beyond Generative Adversarial Networks (GANs) and their variants to supervised learning problems.
In this paper, we provide a statistical framework aimed at understanding deep supervised learning. Our approach involves modeling the conditional distribution of the response variable given the explanatory variable as a member of an exponential family with a smooth mean function. This framework accommodates a wide spectrum of scenarios, including standard regression and classification tasks, while also providing a statistical foundation for handling complex dependencies in real data settings. In this framework, the maximum likelihood estimates can be viewed as minimizing the canonical Bregman divergence loss between the predicted values and the actual responses. When the explanatory variable has a bounded density with respect to the -dimensional Lebesgue measure, our analysis reveals that deep networks employing ReLU activation functions can achieve a test error on the order of provided that appropriately sized networks are selected. Here denotes the Hölder smoothness of the true mean response function. This generalizes the known results in the literature for additive regression with Gaussian noise.
Another aspect overlooked by the current literature is that even when the explanatory variable is absolutely continuous, the rate of convergence of the sample estimator often exponentially increases with the ambient feature dimension, making the upper bound on the estimation error weak for high-dimensional data. In this paper, we prove that if the explanatory variable has a bounded density, the dependence, in terms of the ambient feature dimension, is not exponential but at most polynomial. Furthermore, we show that the derived rates for the test error are roughly minimax optimal, meaning that one cannot achieve a better rate of convergence through any estimator except for only potentially improving a logarithmic dependence on . Lastly, when the data has an intrinsically low dimensional structure, we show that the test error can be improved to achieve a rate of roughly , where denotes the -entropic dimension (see Definition 11) of , the distribution of the explanatory variables, thus, improving upon the rates in the current literature. This result not only extends the framework beyond additive Gaussian noise regression models but also improves upon the existing rates (Nakada and Imaizumi,, 2020; Schmidt-Hieber,, 2020; Chen et al.,, 2022). The main results of this paper are summarized as follows:
-
•
In Theorem 8, we demonstrate that when the explanatory variable has a bounded density, the test error for the learning problem scales as , showing explicit dependence on the problem dimension () and the number of samples ()
- •
-
•
When the explanatory variable has an intrinsically low dimensional structure, we show that deep supervised learners can effectively achieve an error rate of in Theorem 12. This result provides the fastest known rates for deep supervised learners and encompasses many recent findings as special cases (Nakada and Imaizumi,, 2020; Chen et al.,, 2022) for additive regression models.
-
•
In the process, in Lemma 21 we are able to improve upon the recent -approximation results on ReLU networks, which might be of independent interest.
The remainder of the paper is organized as follows. After discussing the necessary background in Section 2, we discuss the exponential family learning framework in Section 3. Under this framework, we derive the learning rates (Theorem 8) when the explanatory variable is absolutely continuous in Section 4 and show that it is minimax optimal (Theorem 10). Next, we analyze the error rate (Theorem 12) when the explanatory variable has an intrinsically low dimensional structure in Section 5. The proofs of the main results are sketched in Section 6, followed by concluding remarks in Section 7.
2 Background
This section recalls some of the notation and background that we need. We say (for ) if there exists an absolute constant (independent of and ), such that , for all . Similarly, for non-negative functions and , we say if there exists a constant , which is independent of such that , for all . For any function , and any measure on , let , if . Also let, . We say if , for some factor constant . Moreover, and .
Definition 1 (Covering and packing numbers).
For a metric space , the -covering number with respect to (w.r.t.) is defined as: An -cover of is denoted as . Similarly, the -packing number is defined as:
Definition 2 (Neural networks).
Let and . Then a -layer neural network is defined as,
(1) |
Here, , with and , with . Note that is applied component-wise. Here, are known as weights, and are known as biases. are known as the activation functions. Without loss of generality, one can take . We define the following quantities: (Depth) is known as the depth of the network; (Number of weights) The number of weights of the network is denoted as ; (maximum weight) to denote the maximum absolute value of the weights and biases.
If , for all , we denote as .
Definition 3 (Hölder functions).
Let be a function, where . For a multi-index , let, , where, . We say that a function is -Hölder (for ) if
If , then we define . For notational simplicity, let, . Here, both and are both subsets of real vector spaces.
Definition 4 (Smoothness and strong convexity).
We say a differentiable function is -smooth if Similarly, we say that is -strongly convex if
Definition 5 (Bregman divergences).
A differentiable, convex function generates the Bregman divergence defined by
It is evident that holds for all due to the fact that is equivalent to the convex nature of the function . From a geometric standpoint, one can conceptualize as the separation between and the linear approximation of centered around . Put simply, this can be described as the distance between and the value obtained by evaluating the tangent line to at the point . Some prominent examples of Bregman divergences include the squared Euclidean distance, Kullback-Leibler (KL) divergence, Mahalanobis distance, etc. We refer the reader to Banerjee et al., (2005, Table 1) for more examples of the Bregman family. Bregman divergences have a direct association with standard exponential families, as elaborated in the upcoming section, rendering them particularly suitable for modeling and learning from various common data types that originate from exponential family distributions.
3 Learning Framework
To discuss our framework, let us first recall the definition of exponential families (Lehmann and Casella,, 2006, Chapter 1, Section 5). We suppose that is the natural parameter. We say that is distributed according to an exponential family, if the density of w.r.t. some dominating measure , is given by,
Here, is called the natural statistic. Often, it is assumed that the exponential family is expressed in its regular form, which means that the components of are affinely independent, i.e. there exists no , such that (a constant), for all . Popular examples of exponential families include Gaussian, binomial, Poisson, and exponential distributions. Given an exponential family, one can express it in its natural form. Formally,
Definition 6 (Natural form of Exponential families).
A multivariate parametric family of distributions is called a regular exponential family provided that each probability density, w.r.t. some dominating measure , is of the form,
for all , where represents a minimal sufficient statistic for the family.
It is well known that . For simplicity, we assume that is proper, closed, convex, and differentiable. The conjugate of , denoted as is defined as, . It is well known (Banerjee et al.,, 2005, Theorem 4) that can be expressed as,
(2) |
where denotes the Bergman divergence corresponding to . Here, . In this paper, we are interested in the supervised learning problem when the response, given the explanatory variable, is distributed according to an exponential family. For simplicity, we assume that the responses are real-valued. We assume that there exists a “true” predictor function, , such that
Thus, the joint distribution of is given by,
(3) |
By definition, we observe that . We will assume that the data is independent and identically distributed according to the distribution . We also assume that the distribution of is bounded in the compact set, . Formally,
A 1.
We assume that the data are independent and identically distributed according to the distribution , defined in (3). Furthermore, .
In the classical statistics literature, one estimates by finding its maximum likelihood estimates (m.l.e.) as,
Plugging in the form of as in (3), it is easy to see that the above optimization problem is equivalent to
(4) |
Here, is known as the link function. In practice, we take to be some sort of neural network class, with the final output passing through the activation function . The empirical minimizer of (4) is denoted as . To show that this framework covers a wide range of supervised learning problems, we consider the classical example for the case when , for some unknown . In this case, it is well known that is the identity map and becomes the squared Euclidean distance. Thus, the m.l.e. problem (4) becomes the classical regression problem, i.e. .
Another example is the case of logistic regression. We assume that is a Bernoulli random variable. This makes , i.e. the sigmoid activation function. Furthermore, an easy calculation (Banerjee et al.,, 2005, Table 2) shows that . Plugging in the values of and into (4), we note that the estimation problem becomes,
Here, denotes the sigmoid activation function. Thus, the problem reduces to the classical two-class learning problem with the binary cross-entropy loss and with sigmoid activation at the final layer. For simplicity, we assume that all activations, excluding that of the final layer of , are realized by the ReLU activation function. The choice of ReLU activation is a natural choice for practitioners and enables us to harness the approximation theory of ReLU networks developed throughout the recent literature (Yarotsky,, 2017; Uppal et al.,, 2019). However, using a leaky ReLU network will also result in a similar analysis, changing only the constants in the main theorems.
To facilitate the theoretical analysis, we will assume that the problem is smooth in terms of the learning function . As a notion of smoothness, we will use Hölder smoothness. This has been a popular choice in the recent literature (Nakada and Imaizumi,, 2020; Schmidt-Hieber,, 2020; Chen et al.,, 2022) and covers a vast array of functions commonly modeled in the literature.
A 2.
is -Hölder continuous, i.e. .
We make the additional assumption that the function is well-behaved. In particular, we assume that possesses both smoothness and strong convexity properties. It is important to note that these assumptions are widely employed in the existing literature (Telgarsky and Dasgupta,, 2013; Paul et al.,, 2021). Though A3 is not applicable for the classification problem, as in that case, , which does not satisfy A3. However for all practical purposes, one clips the output network (which is often done in practice to ensure smooth training), i.e. ensures that , for some positive , A3 is satisfied. The assumption is formally stated as follows.
A 3.
We assume that is -smooth and -strongly convex.
A direct implication of A3 is that by Kakade et al., (2009, Theorem 6), is -smooth and - strongly convex. Here . Also, since is -smooth, is -Lipschitz. This fact will be useful for the proofs of the main results. In the subsequent sections, under the above assumptions, we derive probabilistic error bounds for the excess risk of .
4 Optimal Rates for Distributions with Bounded Densities
We begin the analysis of the test error for the problem (4) when , the distribution of the explanatory variable has a bounded density on . First note that the excess risk is upper bounded by the estimation error for in the -norm. The excess risk for the problem is given by
The following lemma ensures that and hence, it is enough to prove bounds on to derive upper and lower bounds on the excess risk.
Lemma 7.
For any ,
As already mentioned, we assume that admits a density w.r.t. the Lebesgue measure, and this density is upper bounded. Formally,
A 4.
Suppose that admits an upper-bounded density w.r.t. the Lebesgue measure on , i.e. , almost surely (under the Lebesgue measure).
Under Assumptions A1–4, we observe that with high probability, and thereby, scales at most as , ignoring -terms in , for large , if the network sizes are chosen properly. This rate of decrease aligns consistently with prior findings reported by Nakada and Imaizumi, (2020) for additive Gaussian-noise regression. It is important to underscore that the existing literature predominantly investigates the rate of decrease in solely with regard to the sample size , overlooking terms dependent upon the data dimensionality. These dimension-dependent terms harbor the potential for exponential growth with respect to the dimensionality of the explanatory variables, and may therefore attain substantial magnitudes, making such bounds inefficient in high-dimensional statistical learning contexts. This analysis shows that the dependence in is not exponential and can be made to increase at a polynomial rate only under the assumption of the existence of a bounded density of the explanatory variable. The main upper bound for this case is stated in Theorem 8, with a proof outline appearing in Section 6.1.
Theorem 8.
From the bound on the network size, i.e. in Theorem 8, it is clear that when is smooth i.e. for large , one requires a network of smaller size compared to when is less smooth i.e. when is small. Similarly, in cases where the dimensionality of the explanatory variables, represented by is substantial, a larger network is required as compared to situations where is relatively small. This observation aligns with the intuitive expectation that solving more intricate and complex problems in higher dimensions demands the utilization of larger networks.
The high probability bound in Theorem 8 ensures that the expected test error also scales with the same rate of convergence. This result is shown in Corollary 9
Corollary 9.
Under the assumptions and choices of Theorem 8, .
To understand whether deep supervised learning can achieve the optimal rates for the learning problem, we derive the minimax rates for estimating . The minimax expected risk for this problem, when one has access to i.i.d. samples from (3) ( replaced with ) is given by
With the notation we denote the expectation w.r.t. the measure (3), with replaced with . Here the infimum is taken over all measurable estimates of , based on the data. Minimax lower bounds are used to understand the theoretical limits of any statistical estimation problem. The aim of this analysis is to show that deep learning with ReLU networks for the exponential family dependence is (almost) minimax optimal. To facilitate the theoretical analysis, we assume that the density of is lower bounded by a positive constant. Formally,
A 5.
admits a lower-bounded density w.r.t. the Lebesgue measure on , i.e. , almost surely (under the Lebesgue measure).
Theorem 10 provides a characterization of this minimax lower bound for estimating . It is important to note that the seminal works of Yang and Barron, (1999) for the normal-noise regression problem is a special case of Theorem 10.
Theorem 10 (Minimax lower bound).
5 Rates for Low Intrinsic Dimension
Frequently, it is posited that real-world data, such as vision data, resides within a lower-dimensional structure embedded in a high-dimensional feature space (Pope et al.,, 2020). To quantify this intrinsic dimensionality of the data, researchers have introduced various measures of the effective dimension of the underlying probability distribution assumed to generate the data. Among these approaches, the most commonly used ones involve assessing the rate of growth of the covering number, in a logarithmic scale, for most of the support of this data distribution. Let us consider a compact Polish space denoted as , with representing a probability measure defined on it. For the remainder of this paper, we will assume that corresponds to the -norm. The simplest measure of the dimension of a probability distribution is the upper Minkowski dimension of its support, defined as follows:
This dimensionality concept relies solely on the covering number of the support and does not assume the existence of a smooth mapping to a lower-dimensional Euclidean space. Consequently, it encompasses not only smooth Riemannian manifolds but also encompasses highly non-smooth sets like fractals. The statistical convergence properties of various estimators concerning the upper Minkowski dimension have been extensively explored in the literature. Kolmogorov and Tikhomirov, (1961) conducted a comprehensive study on how the covering number of different function classes depends on the upper Minkowski dimension of the support. Recently, Nakada and Imaizumi, (2020) demonstrated how deep learning models can leverage this inherent low-dimensionality in data, which is also reflected in their convergence rates. Nevertheless, a notable limitation associated with utilizing the upper Minkowski dimension is that when a probability measure covers the entire sample space but is concentrated predominantly in specific regions, it may yield a high dimensionality estimate, which might not accurately reflect the underlying dimension.
To overcome the aforementioned difficulty, as a notion of the intrinsic dimension of a measure , Chakraborty and Bartlett, 2024b introduced the notion of -entropic dimension of a measure. Before we proceed, we recall the -cover of a measure (Posner et al.,, 1967) as: i.e. counts the minimum number of -balls required to cover a set of probability at least .
Definition 11 (Entropic dimension, Chakraborty and Bartlett, 2024b, ).
For any , we define the -entropic dimension of as:
This notion extends Dudley’s notion of entropic dimension (Dudley,, 1969) to characterize the convergence rate for the BiGAN problem (Donahue et al.,, 2017). Chakraborty and Bartlett, 2024b showed that the entropic dimension is not larger than the upper Minkowski dimension and the upper Wasserstein dimension (Weed and Bach,, 2019). Furthermore, strict inequality holds even for simplistic examples for measures on the unit hypercube. We refer the reader to Section 3 of Chakraborty and Bartlett, 2024b . Chakraborty and Bartlett, 2024b showed that the entropic dimension is a more efficient way of characterizing the intrinsic dimension of the data distributions compared to popular measures such as the upper Minkowski dimension or the Wasserstein dimension (Weed and Bach,, 2019) as the entropic dimension enables us to derive a faster rate of convergence of the estimates. Importantly, , with strict inequality holding, even for simplistic cases (see examples 10 and 11 of Chakraborty and Bartlett, 2024b, ).
As an intrinsically low-dimensional probability measure is not guaranteed to be dominated by the Lebesgue measure, we remove Assumption A4. Under only Assumptions A1–3, if the network sizes are properly chosen, the rate of convergence of to under the -norm decays at a rough rate of , as shown by Theorem 12.
Theorem 12.
Since the normal-noise regression model with -Hölder is a special case of our model in (3), Theorem 12 derives a faster rate compared to Nakada and Imaizumi, (2020, Theorem 7), who show a rate of . This is because the upper Minkowski dimension is at least the -entropic dimension by Chakraborty and Bartlett, 2024b (, Proposition 8(c)), i.e. .
An immediate corollary of Theorem 12 is that the expected test-error rate follows the same rate of decay. The proof of this result can be done following the proof of Corollary 9.
Corollary 13.
Under the assumptions and choices of Theorem 12, .
One can state that a rate similar to that observed in Theorem 12 holds when the support of is regular. We recall the definition (Weed and Bach,, 2019, Definition 6) of regular sets in as follows.
Definition 14 (Regular sets).
We say a set is -regular w.r.t. the -dimensional Hausdorff measure , if for all . Recall that the -Hausdroff measure of a set is defined as,
Examples of regular sets include compact -dimensional differentiable manifolds; nonempty, compact convex sets spanned by an affine space of dimension ; the relative boundary of a nonempty, compact convex set of dimension ; or a self-similar set with similarity dimension . When the support of is -regular, it can be shown that . Formally,
Lemma 15.
Suppose that the support of is -regular. Then, , for any . Furthermore, if , .
Thus, applying Theorem 12 and Lemma 15, we note that decays at most at a rate of , resulting in the following corollary.
Corollary 16.
Since compact -dimensional differentiable manifolds are a special case of -regular sets, Corollary 16 recovers the results by Chen et al., (2022) as a special case i.e., an additive Gaussian-noise regression model. Importantly, this recovery is achieved without imposing assumptions about uniform sharpness on the manifold, as done by Chen et al., (2022, Assumption 2).
6 Proof of the Main Results
This section discusses the proof of the main results of this paper, i.e, Theorems 8, 10 and 12, with proofs of auxiliary supporting lemmas appearing in the appendix. The proof of the main upper bounds (Theorems 8 and 12) are presented in Section 6.1, while the minimax lower bound is proved in Section 6.2.
6.1 Proof of the Upper Bounds
In order to prove Theorem 8, we first decompose the error through an oracle inequality. For any vector , we denote .
Lemma 17 (Oracle inequality).
Let . Suppose that , and . Then,
(5) |
The first term in the right-hand side (RHS) of (5) is analogous to an approximation error while the second term is akin to a generalization gap. It is worth noting that while taking a large network reduces the approximation error, it can potentially give rise to a large generalization gap and vice versa. The key idea is to select a network of appropriate size that ensures that both these errors are small enough. In the following two sections, we control these terms individually.
6.1.1 Generalization Error
To effectively control the generalization error, we employ standard localization techniques; see, for example, Wainwright, (2019, Chapter 14). These techniques are instrumental in achieving rapid convergence of the sample estimator to the population estimator in the norm. It is important to note that in some cases, the true function, denoted as , may not be precisely representable by a ReLU network. We establish a high-probability bound for the squared norm difference between our estimated function and , where we will take to belong in the neural network function class, close enough to . Our strategy revolves around a two-step process: firstly, we derive a local complexity bound, as outlined in Lemma 18 and subsequently, we leverage this local complexity bound to derive an estimate for , as elucidated in Lemma 19. Here denotes the empirical distribution of the explanatory variables. We then use this result to control in Lemma 22 for large . We state these results subsequently with proofs appearing in Appendix A.
Lemma 18.
Suppose that , with . Also let, . Then, for any , with probability (conditioned on ) at least ,
(6) |
Here, denotes the pseudo-dimension of the function class (Anthony and Bartlett,, 1999).
Lemma 19.
Suppose and . Then, for any , with probability at least, ,
(7) |
Lemma 20.
For , if , with probability at least
(8) |
for any .
6.1.2 Approximation Error
To effectively bound the overall error in Lemma 17, one needs to control the approximation error, denoted by the first term of (5). Exploring the approximating potential of neural networks has witnessed substantial interest in the research community in the past decade or so. Pioneering studies such as those by Cybenko, (1989) and Hornik, (1991) have extensively examined the universal approximation properties of networks utilizing sigmoid-like activations. These foundational works demonstrated that wide, single-hidden-layer neural networks possess the capacity to approximate any continuous function within a bounded domain. In light of recent advancements in deep learning, there has been a notable surge in research dedicated to exploring the approximation capabilities of deep neural networks. Some important results in this direction include those by Yarotsky, (2017); Lu et al., (2021); Petersen and Voigtlaender, (2018); Shen et al., (2019); Schmidt-Hieber, (2020) among many others. All of the aforementioned results indicate that when -approximating a -Hölder function in the -norm, it suffices to have a network of depth with at most -many weights for the approximating network. However, the constants in the expressions of the upper bound of the number of weights and depth of the network can potentially increase exponentially with . Shen et al., (2022) showed that if one approximates in the -norm, this exponential dependence can be mitigated for the case . Here denotes the Lebesgue measure on . Lemma 21 generalizes this result to include all to achieve a precise dependence on . The proof is provided in Appendix B.
Lemma 21.
Suppose that . Then, we can find a ReLU network, , with and , and a constant (that might depend on and ) such that for all . Here, is an absolute constant.
6.1.3 Proof of Theorem 8
Proof.
We take , with and . Then, by Lemma 21, we can find , such that, . Furthermore, by Lemma 20, we observe that with probability at least ,
(9) |
Here (9) follows from the following calculations. Suppose is the constant that honors , i.e. . Then,
when is small enough. Taking and , we note that with probability at least ,
Note that for the above bounds to hold, one requires and , which holds when is large enough. ∎
6.1.4 Proof of Theorem 12
Proof.
We take , with and . Then, by Chakraborty and Bartlett, 2024b (, Theorem 18), we can find , such that, . Furthermore, by Lemma 20, we observe that with probability at least ,
(10) | ||||
Here, (10) follows from (Bartlett et al.,, 2019, Theorem 6). Taking and , we note that, with probability at least , Note that for the above bounds to hold, one requires and , which holds when is large enough. ∎
6.2 Proof of the Minimax Rates
In this section, we give a formal proof of Theorem 10. We use the standard technique of Fano’s method—see, for example, (Wainwright,, 2019, Chapter 15)—to construct hypotheses that are well separated in sense but difficult to distinguish in the KL-divergence.
Proof of Theorem 10: Let be the standard bump function on . For any and , we let, . Here is such that . It is easy to observe that . In what follows, we take, . Let
Since each element of is a sum of members in with disjoint support, . By the Varshamov-Gilbert bound (Tsybakov,, 2009, Lemma 2.9), we can construct a subset of of with , for all and . We note that for any ,
Let denote the the distribution of the form (3) with replaced with . Thus,
(11) | ||||
Here, (11) follows from Lemma 24. Choosing , we can make, . Thus, from Wainwright, (2019, equation 15.34), Here denotes the mutual information between the random variables and (Cover and Thomas,, 2005, Section 2.3). Hence, if is large enough. Thus, applying Proposition 15.2 of Wainwright, (2019), we note that,
7 Conclusion
In this paper, we discussed a statistical framework to understand the finite sample properties of supervised deep learning for both regression and classification settings. In particular, we modeled the dependence of the response given the explanatory variable through a exponential families and showed that the maximum likelihood estimates can be achieved by minimizing the corresponding Bregman loss and incorporating the mean function as the activation for the final layer. Under the assumption of the existence of a bounded density for the explanatory variable, we showed that deep ReLU networks can achieve the minimax optimal rate when the network size is chosen properly. Furthermore, when the explanatory variable has an intrinsically low dimensional structure, the convergence rate of the sample estimator, in terms of the sample size, only depends on the entropic dimension of the underlying distribution of the explanatory variable, resulting in better convergence rates compared to the existing literature for both classification and regression problems.
While our findings offer insights into the theoretical aspects of deep supervised learning, it is crucial to recognize that assessing the complete error of models in practical applications necessitates the consideration of an optimization error component. Regrettably, the accurate estimation of this component remains a formidable challenge in the non-overparametrized regime due to the non-convex and intricate nature of the optimization problem. Nevertheless, it is worth emphasizing that our error analyses operate independently of the optimization process and can be readily integrated with optimization analyses.
Appendix A Proofs of Main Lemmata
A.1 Proof of Lemma 7
See 7
Proof.
To prove Lemma 7, we first make the following observation.
(12) | ||||
(13) | ||||
Here (12) follows from Lemma 28. Inequality (13) follows from the fact that is -Lipschitz. We also note that,
(14) | ||||
(15) | ||||
As before, (14) follows from the fact that is -strongly convex and applying Lemma 28. Inequality (15) follows from the fact that , due to the strong convexity of and a simple application of the mean value theorem. ∎
A.2 Proof of Lemma 15
See 15
Proof.
We note that , by Weed and Bach, (2019, Proposition 7). When , again by Weed and Bach, (2019, Proposition 8), it is known that , where denotes the lower Wasserstein dimension of (Weed and Bach,, 2019, Definition 4). The result now follows from Proposition 8 of Chakraborty and Bartlett, 2024b . ∎
A.3 Proof of Lemma 17
See 17
A.4 Proof of Lemma 18
See 18
Proof.
From the definition of , it is clear that Let . Clearly, . Furthermore, applying Lemma 26, we observe that
Thus, is -subGaussian. Furthermore,
A.5 Proof of Lemma 19
See 19
Proof.
We take and let . We consider two cases as follows.
Case 1: .
Then, by Lemma 18, with probability at least ,
(22) | ||||
(23) | ||||
(24) |
In the above calculations, (22) follows from the fact that is strongly convex and (23) follows from Lemma 17. Inequality (24) follows from Lemma 18. Let be the corresponding constant that honors the inequality in (24). Then using the upper bound on , we observe that
(25) | ||||
Here, (25) follows from the fact that , from the AM-GM inequality and taking and . Thus,
(26) |
Case 2: .
It this case, we note that . Thus,
Thus, from the above two cases, combining equations (24) and (26), with probability at least, ,
(27) |
From equation (27), we note that, for some constant ,
Integrating both sides w.r.t. the measure , i.e. the joint distribution of , we observe that, unconditionally, with probability at least ,
(28) |
∎
A.6 Proof of Lemma 20
To prove Lemma 20, we first state and prove the following result.
Lemma 22.
For , if , with probability at least ,
Proof.
From Lemma 23, we note that if , then,
(29) | ||||
(30) |
Here, (29) follows from Lemma 23 and (30) follows from the fact that for all , . It is easy to see that the RHS of (30) has a fixed point of and . Then, by Theorem 6.1 of Bousquet, (2002), we note that with probability at least ,
(31) |
for some absolute constant . Now, taking in (31), we note that, with probability at least ,
(32) |
Combining (32) with Lemma 19, we observe that with probability at least ,
(33) |
∎
A.6.1 Proof of Lemma 20
See 20
Proof.
Let . Since , for some constant , it is easy to see that
Taking , we note that, . Applying Bernstein’s inequality, e.g., (Vershynin,, 2018, Theorem 2.8.4), with ,
with . Thus, with probability, at least ,
Taking , we observe that, with probability at least ,
(34) |
Combining (34) with Lemma 22, we observe that, with probability at least
The theorem now follows from observing that . ∎
Appendix B Proof of Approximation Results (Lemma 21)
See 21
Proof.
We first fix any and let, . For any , let . Clearly, constitutes an -net of , w.r.t. the -norm. We let
for any . For , we define
We define the region . It is easy to observe that . Here denotes the Lebesgue measure on . Clearly, on .
Consider the Taylor expansion of around as,
(35) |
In the above calculations, lies on the line segment joining and . Inequality (35) follows from the fact that and the identity . Next, we suppose that . Thus, if ,
(36) |
Here, (36) follows from (35). Thus, . Furthermore, by definition, . Let and
Here, is an approximation of the product function developed by Chakraborty and Bartlett, 2024b (see Lemma 29) and has at most many inputs. By Chakraborty and Bartlett, 2024b (, Lemma 40), can be implemented by a ReLU network with , , where is an absolute constant. Thus, and . From Chakraborty and Bartlett, 2024b (, Lemma 40), we observe that,
(37) |
Here, . Finally, let . Clearly, and . This implies that,
(38) |
From (36) and (38), we thus get that if ,
(39) |
Furthermore, it is easy to observe that, . Hence,
taking . We take and . Thus, We note that has at most -many networks of depth and number of weights . Thus, and . We thus get,
Similarly,
The proof is now complete by replacing with . ∎
Appendix C Proof of Corollary 9
Proof.
Suppose that be the constant that honors the inequality. Then
∎
Appendix D Supporting Lemmas
Lemma 23.
Let with . Then, we can find , such that if and ,
Proof.
Let . We first fix and let be a member of with . We use the notation . Let be such that , for all . Here denotes the cover of w.r.t. the -norm. Let . Then
(40) | ||||
(41) |
Here (40) follows from the fact that , for any . Hence, from the above calculations, , for some absolute constant .
Hence, . Thus from Wainwright, (2019, Theorem 5.22),
(42) |
Lemma 24.
Proof.
To prove this result, we make the following observations.
(43) |
Here, (43) follows from noting that . ∎
Appendix E Supporting Lemmata
Lemma 25.
For any , .
Proof.
We start by making a transformation and observe that,
(44) | ||||
In the above calculations, (44) follows from the fact that the function is decreasing when . ∎
Lemma 26.
Let , with . Then, is -SubGaussian.
Proof.
We observe the following,
∎
Lemma 27.
Suppose that are independent and identically distributed sub-Gaussian random variables with variance proxy and suppose that for all . Then with probability at least ,
Appendix F Supporting Results from the Literature
Lemma 28 (Lemma B.1 of Telgarsky and Dasgupta, (2013)).
If differentiable is strongly convex, then . Furthermore, if differentiable has Lipschitz gradients with parameter with respect to , then .
Lemma 29 (Lemma 40 of Chakraborty and Bartlett, 2024b ).
For any , we can construct a ReLU network , such that for any , . Furthermore, for some absolute constant ,
-
1.
, .
-
2.
.
Acknowledgment
We gratefully acknowledge the support of the NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 and #814639, the NSF’s support of FODSI through grant DMS-2023505, and the support of the ONR through MURI award N000142112431.
References
- Anthony and Bartlett, (1999) Anthony, M. and Bartlett, P. (1999). Neural network learning: Theoretical foundations. Cambridge University Press.
- Banerjee et al., (2005) Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., and Lafferty, J. (2005). Clustering with bregman divergences. Journal of machine learning research, 6(10).
- Bartlett et al., (2019) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. (2019). Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301.
- Bousquet, (2002) Bousquet, O. (2002). Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Biologische Kybernetik.
- (5) Chakraborty, S. and Bartlett, P. (2024a). A statistical analysis of wasserstein autoencoders for intrinsically low-dimensional data. In The Twelfth International Conference on Learning Representations.
- (6) Chakraborty, S. and Bartlett, P. L. (2024b). On the statistical properties of generative adversarial models for low intrinsic data dimension. arXiv preprint arXiv:2401.15801.
- Chen et al., (2019) Chen, M., Jiang, H., Liao, W., and Zhao, T. (2019). Efficient approximation of deep relu networks for functions on low dimensional manifolds. Advances in neural information processing systems, 32.
- Chen et al., (2022) Chen, M., Jiang, H., Liao, W., and Zhao, T. (2022). Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253.
- Cover and Thomas, (2005) Cover, T. M. and Thomas, J. A. (2005). Elements of Information Theory. John Wiley & Sons, Hoboken, NJ.
- Cybenko, (1989) Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.
- Donahue et al., (2017) Donahue, J., Krähenbühl, P., and Darrell, T. (2017). Adversarial feature learning. In International Conference on Learning Representations.
- Dudley, (1969) Dudley, R. M. (1969). The speed of mean glivenko-cantelli convergence. The Annals of Mathematical Statistics, 40(1):40–50.
- Hornik, (1991) Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257.
- Huang et al., (2022) Huang, J., Jiao, Y., Li, Z., Liu, S., Wang, Y., and Yang, Y. (2022). An error analysis of generative adversarial networks for learning distributions. Journal of Machine Learning Research, 23(116):1–43.
- Jiao et al., (2021) Jiao, Y., Shen, G., Lin, Y., and Huang, J. (2021). Deep nonparametric regression on approximately low-dimensional manifolds. arXiv preprint arXiv:2104.06708.
- Kakade et al., (2009) Kakade, S., Shalev-Shwartz, S., Tewari, A., et al. (2009). On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09. pdf, 2(1):35.
- Kolmogorov and Tikhomirov, (1961) Kolmogorov, A. N. and Tikhomirov, V. M. (1961). -entropy and -capacity of sets in function spaces. Translations of the American Mathematical Society, 17:277–364.
- Lehmann and Casella, (2006) Lehmann, E. L. and Casella, G. (2006). Theory of Point Estimation. Springer Science & Business Media.
- Lu et al., (2021) Lu, J., Shen, Z., Yang, H., and Zhang, S. (2021). Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506.
- Maurer and Pontil, (2021) Maurer, A. and Pontil, M. (2021). Concentration inequalities under sub-gaussian and sub-exponential conditions. Advances in Neural Information Processing Systems, 34:7588–7597.
- Nakada and Imaizumi, (2020) Nakada, R. and Imaizumi, M. (2020). Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38.
- Paul et al., (2021) Paul, D., Chakraborty, S., Das, S., and Xu, J. (2021). Uniform concentration bounds toward a unified framework for robust clustering. Advances in Neural Information Processing Systems, 34:8307–8319.
- Petersen and Voigtlaender, (2018) Petersen, P. and Voigtlaender, F. (2018). Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330.
- Pope et al., (2020) Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., and Goldstein, T. (2020). The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations.
- Posner et al., (1967) Posner, E. C., Rodemich, E. R., and Rumsey Jr, H. (1967). Epsilon entropy of stochastic processes. The Annals of Mathematical Statistics, pages 1000–1020.
- Schmidt-Hieber, (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4):1875 – 1897.
- Shen et al., (2019) Shen, Z., Yang, H., and Zhang, S. (2019). Nonlinear approximation via compositions. Neural Networks, 119:74–84.
- Shen et al., (2022) Shen, Z., Yang, H., and Zhang, S. (2022). Optimal approximation rate of relu networks in terms of width and depth. Journal de Mathématiques Pures et Appliquées, 157:101–135.
- Suzuki, (2018) Suzuki, T. (2018). Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. arXiv preprint arXiv:1810.08033.
- Suzuki and Nitanda, (2021) Suzuki, T. and Nitanda, A. (2021). Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic besov space. Advances in Neural Information Processing Systems, 34:3609–3621.
- Telgarsky and Dasgupta, (2013) Telgarsky, M. J. and Dasgupta, S. (2013). Moment-based uniform deviation bounds for -means and friends. Advances in Neural Information Processing Systems, 26.
- Tsybakov, (2009) Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, Springer New York, NY, 1 edition. Published: 26 November 2008.
- Uppal et al., (2019) Uppal, A., Singh, S., and Póczos, B. (2019). Nonparametric density estimation & convergence rates for gans under besov ipm losses. Advances in neural information processing systems, 32.
- Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
- Wainwright, (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
- Weed and Bach, (2019) Weed, J. and Bach, F. (2019). Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620 – 2648.
- Yang and Barron, (1999) Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Annals of Statistics, pages 1564–1599.
- Yarotsky, (2017) Yarotsky, D. (2017). Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114.