Generalization Performance of Empirical Risk Minimization on Over-parameterized Deep ReLU Nets
Abstract
In this paper, we study the generalization performance of global minima for implementing empirical risk minimization (ERM) on over-parameterized deep ReLU nets. Using a novel deepening scheme for deep ReLU nets, we rigorously prove that there exist perfect global minima achieving almost optimal generalization error rates for numerous types of data under mild conditions. Since over-parameterization is crucial to guarantee that the global minima of ERM on deep ReLU nets can be realized by the widely used stochastic gradient descent (SGD) algorithm, our results indeed fill a gap between optimization and generalization of deep learning.
Index Terms:
Deep learning, Empirical risk minimization, Global minima, Over-parameterization.I Introduction
Deep learning [27] that conducts feature exaction and statistical modelling on a unified deep neural network (deep net) framework has attracted enormous research activities in the past decade. It has made significant breakthroughs in numerous applications including computer vision [32], speech recognition [33] and go games [57]. Simultaneously , it also brings several challenges in understanding the running mechanism and magic behind deep learning, among which the generalization issue in statistics [59] and convergence issue in optimization [2] are crucial.
The generalization issue pursues theoretical advantages of deep learning via presenting its better generalization capability than shallow learning such as kernel methods and shallow networks. Along with the rapid development of deep nets in approximation theory [16, 43, 58, 48, 55, 54, 18], essential progress on the excellent generalization performance of deep learning has been made in [29, 8, 37, 53, 17, 25]. To be detailed, [53] proved that implementing empirical risk minimization (ERM) on deep ReLU nets is better than shallow learning in embodying the composite structure of the well known regression function [24]; [25] showed that implementing ERM on deep ReLU nets succeeds in improving the learning performance of shallow learning by capturing the group-structure of inputs; and [17] derived that, with the help of massive data, implementing ERM on deep ReLU nets is capable of reflecting spatially sparse properties of the regression function while shallow learning fails. In a word, with appropriately selected number of free parameters, the generalization issue of deep learning seems to be successfully settled in terms that deep learning can achieve optimal learning rates for numerous types of data, which is beyond the capability of shallow learning.
The convergence issue concerns the convergence of some popular algorithms such as the stochastic gradient descent (SGD) and adaptive moment estimation (Adam) to solve ERM on deep nets. Due to the highly nonconvex nature, it is generally difficult to find global minima of such ERM problems, since there are numerous local minima, saddle points, plateau and even some flat regions [22]. It is thus highly desired to declare when and where the corresponding SGD or Adam converges. Unfortunately, in the under-parameterized setting exhibited in [29, 8, 37, 53, 17, 25], the convergence issue remains open. Alternatively, studies in optimization theory show that over-parameterizing deep ReLU nets enhance the convergence of SGD [21, 41, 2, 1]. In fact, it was proved in [2, 1] that SGD converges to either a global minimum or a local minimum near some global minimum of ERM, provided there are sufficiently many free parameters in deep ReLU nets. Noting further that implementing ERM on over-parameterized deep ReLU nets commonly yields infinitely many global minima and usually leads to over-fitting, it is difficult to verify the good generalization performance of the obtained global minima in theory.
As mentioned above, there is an inconsistency between generalization and convergence issues of deep learning in the sense that good generalization requires under-parameterization of deep ReLU nets while provable convergence needs over-parameterization, which makes the running mechanism of deep learning be still a mystery. Such an inconsistency stimulates to rethink the classical bias-variance trade-off in modern machine learning practice [12], since numerical evidences in [59] illustrated that there are over-parameterized deep nets generalizing well despite they achieved extremely small training error. In particular, [9] constructed an exact interpolation of training data that achieves optimal generalization error rates based on Nadaraya-Watson kernel estimates; [7] derived a sufficient condition for the data under which the global minimum of over-parameterized linear regression possesses excellent generalization performance; [36] studied the generalization performance of kernel-based least norm interpolation and presented generalization error estimates under some restrictions on the data distribution. Similar results can also be found in [11, 26, 45, 38, 6] and references therein. All these exciting results provide a springboard to understand benign over-fitting in modern machine-learning practice and present novel insights in developing learning algorithms to avoid the traditional bias-variance dilemma. However, it should be pointed out that these existing results are incapable of settling the generalization and convergence challenges of deep learning, mainly due to the following three aspects:
Difference in model: the existing theoretical analysis for benign over-fitting is only available to convex linear models, but ERM on deep ReLU nets involves highly nonconvex nonlinear models.
Difference in theory: the existing results for benign over-fitting focus on pursuing the restrictions on data distributions so that all global minima for over-parameterized linear models are good learners, which does not hold for deep learning, since it is easy to provide a counterexample for benign over-fitting of deep ReLU nets, even for noise-less data (see Proposition 1 below).
Difference in requirements of dimension: Theoretical analysis in the existing work frequently requires the high dimensionality assumption of the input space, while the numerical experiments in [59] showed that deep learning can lead to benign over-fitting in both high and low dimensional input spaces.
Based on these three interesting observations, there naturally arises the following problem:
Problem 1
Without strict restrictions on data distributions and dimensions of input spaces, are there global minima of ERM on over-parameterized deep ReLU nets achieving the optimal generalization error rates obtained by under-parameterized deep ReLU nets?
There are roughly two schemes to settle the inconsistency between optimization and generalization for deep learning. One is to pursue the convergence guarantee for SGD on under-parameterized deep ReLU nets and the other is to study the generalization performance of implementing ERM on over-parameterized deep ReLU nets. To the best of our knowledge, there isn’t any theoretical analysis for the former, even when strict restrictions are imposed on the data distributions. An answer to Problem 1, a stepping-stone to demonstrate the feasibility of the latter, not only provides solid theoretical evidences for the benign over-fitting phenomenon of deep learning in practice [59], but also presents theoretical guidance on setting network structures to balance the generalization and optimization inconsistency.
The aim of the present paper is to present an affirmative answer to Problem 1. Our main tool for analysis is a novel network deepening approach based on the localized approximation property [16, 17] and product-gate property [58, 48] of deep nets. The network deepening approach succeeds in constructing over-parameterized deep nets (student network) via deepening and widening an arbitrary under-parameterized network (teacher network) so that the obtained student network exactly interpolates the training data and possesses almost the same generalization capability as the teacher network. In this way, setting the teacher network to be the one in [29, 8, 37, 53, 17, 25], we actually prove that there are global minima for ERM on over-parameterized deep ReLU nets that possess optimal generalization error rates, provided that the networks are deeper and wider than the corresponding student network.
The main contributions of the paper are two folds. Since the presence of noise is a crucial factor to result in over-fitting and it is not difficult to design learning algorithms with good generalization performance to produce perfect fit for noiseless data [34, 14], our first result is then to study the generalization capability of deep ReLU nets that exactly interpolate noiseless data. In particular, we construct a deep ReLU net that exactly interpolates the noiseless data but performs extremely badly in generalization and also prove that for deep ReLU nets with more than two hidden layers, there always exist global minima of ERM that can generalize extremely well, provided the number of free parameters achieves a certain level and the data are noiseless. This is different from the linear models studied in [9, 11, 12, 26, 36, 45, 38, 6] and shows the difficulty in analyzing over-parameterized deep ReLU nets. Our second result, more importantly, focuses on the existence of a perfect global minimum of ERM on over-parameterized deep ReLU nets that achieves almost optimal generalization error rates for numerous types of data. Using the network deepening approach, we rigorously prove that, if the depth and width of a deep net are larger than specific values, then there always exists such a perfect global minimum. This finding partly demonstrates the reason of the benign over-fitting phenomenon of deep learning and shows that implementing ERM on over-parameterized deep nets can derive an estimator of high quality. Different from the existing results, our analysis requires neither high dimensionality of the input space nor strong restrictions on the covariance matrix of the input data.
The rest of this paper is organized as follows. In the next section, after introducing the deep ReLU nets, we provide theoretical guarantee for the existence of good deep nets interpolant for noiseless data. In Section III, we present our main results via rigorously proving the existence of perfect global minima. In Section IV, we compare our results with related work and present some discussions. In Section V, we conduct numerical experiments to verify our theoretical assertions. We prove our results in the last section.
II Global Minima of ERM on Over-Parameterized Deep Nets for Noiseless Data
Let be the depth of a deep net, and be the width of the -th hidden layer for . Denote by the affine operator with for weight matrix and bias vector . For the ReLU function , write for . Define an -layer deep ReLU net by
(1) |
where . The structure of is determined by the weight matrices and bias vectors , . In particular, full weight matrices correspond to deep fully connected nets (DFCN) [58]; sparse weight matrices are associated with deep sparsely connected nets (DSCN) [48]; and Toeplitz-type weight matrices are related to deep convolutional neural networks (DCNN) [60]. In particular, for DFCN, we have the number of training parameters
(2) |
and for DSCN, is much smaller than the number in (2). In this paper, we mainly focus on analyzing the benign over-fitting phenomenon for DFCN. Denote by the set of all DFCNs of the form (1). Define the width of DFCN to be
Given a sample set , we are interested in global minima of the following empirical risk minimization (ERM):
(3) |
Denote by the set of global minima of the optimization problem (3), i.e.,
(4) |
Before studying the quality of the global minima of (3), we present some properties of in the following lemma.
Lemma 1
If , then for any and , there are infinitely many functions in and for any , there holds .
Lemma 1 is a direct extension of [49, Theorem 5.1], where similar conclusion is drawn for with . Based on [49, Theorem 5.1], the proof of Lemma 1 is obvious by noting and . Lemma 1 implies that if is sufficiently large and the network structure satisfying , then there are always infinitely many solutions to (3) and any global minimum exactly interpolates the given data. It should be mentioned that similar results also hold for any DSCNs that contain a shallow net with neurons. Since we place particular emphasis on DFCN, we leave the corresponding assertions for DSCN for interested readers.
Let . Denote by the space of th-Lebesgue integrable functions endowed with norm . In this section, we are concerned with noiseless data, that is, there is some such that
(5) |
At first, we provide a negative result for running ERM on over-parameterized deep ReLU nets, showing that there is some performs extremely badly in generalization.
Proposition 1
Let . If , , and , then there are infinitely many such that for any satisfying (5) and there holds
(6) |
where is an absolute constant.
Proposition 1 shows that in the over-parameterized setting, there are infinitely many global minima of (3) behaving extremely badly, even for the noiseless data. It should be highlighted that the construction of in our proof is independent of . In fact, due to the localized approximation of deep ReLU nets, we can construct deep ReLU nets which can exactly interpolate the training sample but satisfy for arbitrarily small . Different from the results in [12, 26, 45, 36], it is difficult to quantify conditions on distributions and dimensions of input spaces such that all global minima of over-parameterized deep ReLU nets possess good generalization performances. This is mainly due to the nonlinear nature of the hypothesis space, though the capacities, measured by the covering numbers [23, 5], for deep ReLU nets and linear models are comparable, provided there are similar numbers of free parameters involved in hypothesis spaces and the depth is not so large.
Proposition 1 provides extremely bad examples for global minima of (3) in generalization, even for noiseless data. It seems that over-parameterized deep ReLU nets are always worse than under-parameterized networks, which is standard from a viewpoint of classical learning theory [19]. However, in our following theorem, we will show that if the number of free parameters increases to a certain extent, then there are also infinitely many global minima of (3) possessing excellent generalization performance. To this end, we need two concepts concerning the data distribution and the smoothness of the target function .
Denote as the input set. The separation radius [46] of is defined by
(7) |
where denotes the Euclidean norm of . The separation radius is half of the smallest distance between any two distinct points in and naturally satisfies . Let us also introduce the standard smoothness assumption [24, 58, 48, 37, 62]. Let and with and . We say a function is -smooth if is -times differentiable and for every , with , its -th partial derivative satisfies the Lipschitz condition
(8) |
Denote by the set of all -smooth functions defined on .
We are in a position to state our first main result to show the existence of good global minima of (3) for noiseless data.
Theorem 2
Let and . If satisfies (5), , , and for , then there are infinitely many , such that
(9) |
where is a constant depending only on and , and for means that there is some constant depending only on and such that .
A consensus on deep nets approximation is that it can break the “curse of dimensionality”, which was verified in interesting work [43, 30, 55, 31, 18, 53] in terms of deriving dimension-independent approximation rates. However, it should be pointed out that to achieve such dimension-independent approximation rates, strict restrictions have been imposed on target functions, which become stronger as the dimension grows, just as [4, P.68] observed. In this way, though the approximation error of deep nets is independent of the dimension, the applicable target functions become more and more stringent as grows. Our result presented in Theorem 2 drives a different direction to show that even for well-studied smooth functions, there is a deep net that interpolates the training data without degrading the approximation rate. We highlight that the approximation rate depends on the a-priori knowledge of the target functions. In particular, if we impose strict restriction such as with , which is standard in kernel learning [15], then the approximation rate can be at least of order , which is also dimension-independent.
It is well known that deep learning has achieved great success in applications whose input space are of high dimensionality such as image processing [32] and game theory [57], showing the excellent performance of deep learning with large . However, recent progress in inventory management [52], finance prediction [28] and earthquake intensity analysis [25] demonstrated that deep learning is also efficient for applications with low-dimensional input spaces. Therefore, numerous research activities have been triggered to verify the advantage of deep nets from approximation theory [58, 48, 61] and learning theory [29, 31, 17] that regarded as a constant. This paper follows from this direction to assume to be a constant and is much smaller than the size of data. As we mentioned above, if is extremely large, more restrictions on the target functions should be imposed, just as Theorem 4 below purports to show.
Noting that there are totally parameters in in (9), the derived error rate is almost optimal in the sense that up to a logarithmic factor, the derived upper bound is of the same order of lower bound [58, 23], i.e., for some independent of . This means that for and , we can get almost optimal deep nets via finding suitable global minima of (3). Theorem 2 also shows that in the over-parameterized setting, where all global minima exactly interpolate the data, the interpolation restriction does not always affect the approximation performance of deep nets. Theorem 2 actually presents a sufficient condition for the number of free parameters of deep ReLU nets , , to guarantee the existence of perfect global minima when we are faced with noiseless data. It should be highlighted that can be numerically determined, provided the data set is given. If are drawn identically and independently (i.i.d.) according to some distribution, the lower bound of is easy to be derived in theory [24, 36, 38].
III Global Minima of ERM on Over-Parameterized Deep Nets for Noisy Data
In this section, we conduct our study in the standard least-square regression framework [24, 19] and assume
(10) |
where are i.i.d. drawn according to an unknown distribution on an input space and are independent random variables that are independent of and satisfy , , Denote by a set of Borel probability measures on and a set of functions defined on , respectively. Write
Let be the class of all functions that are derived based on the data set . Define
(11) |
It is easy to see that measures the theoretically optimal generalization bound a learning scheme, based on data set satisfying (10), can achieve when and [24]. Our purpose is to compare generalization errors of the global minima defined by (3) with to illustrate whether the global minima can achieve the theoretically optimal generalization error bounds.
Before presenting our main results, we introduce a negative result derived in [31, Theorem 2], which shows that without any restriction imposed to the distribution , over-parameterized deep nets do not generalize well.
Lemma 2
If , then for any and and any , there exists a probability measure on such that
(12) |
Lemma 2 shows that without any restrictions to the distribution, even for the widely studied smooth functions, all of the global minima of (3) in the over-parameterized setting are bad estimators. This is totally different from the under-parameterized setting, in which distribution-free optimal generalization error rates were established [24, 40]. Lemma 2 seems to contradict with the numerical results in [59] at the first glance. However, we highlight that in (12) is a very special distribution that is even not continuous with respect to the Lebesgue measure. If we impose some mild conditions to exclude such distributions, then the result will be totally different.
The restriction we study in this paper is the following well known distortion assumption of [56, 17], which is slightly stricter than the standard assumption that is absolutely continuous with respect to the Lebesgue measure. Let and be the identity mapping
Define where denotes the operator norm. Then is called the distortion of (with respect to the Lebesgue measure), which measures how much distorts the Lebesgue measure. In our analysis, we assume , which holds for the uniform distribution for all obviously. Denote by the set of satisfying .
We then provide optimal generalization error rates for global minima of (3) for different a-priori information. Let us begin with the widely used class of smooth regression functions. The classical results in [24] showed that
(13) |
where , are constants independent of and is defined by (11). It demonstrates the optimal generalization error rates for and that a good learning algorithm should achieve. Furthermore, it can be found in [53, 25] that there is some network structure such that all global minima of ERM on under-parameterized deep ReLU nets can reach almost optimal error rates in the sense that if , and , then
(14) |
where is a constant depending only on and is an arbitrary global minimum of (3) with depth and width specified as above.
It follows from (14) and (13) that all global minima of ERM on under-parameterized deep ReLU nets are almost optimal learners to tackle data i.i.d. drawn according to . In the following theorem, we show that, if the network is deepened and widened, there also exist global minima of (3) for over-parameterized deep ReLU nets possessing similar generalization performance.
Theorem 3
Let , . If , and , then there exist infinitely many such that
(15) |
Due to Lemma 1, any is an exact interpolant of , i.e. , . Therefore, Theorem 3 presents theoretical verifications of benign over-fitting for deep ReLU nets, which was intensively discussed recently [9, 12, 36, 38, 7]. It should be mentioned that our main novelties are two folds. On one hand, we are concerned with deep ReLU nets while the existing results focused on linear hypothesis spaces. On the other hand, we provide evidence of existence of good global minima without high-dimensionality and strong distribution assumptions while the existing results focused on searching strong conditions on the dimension of input spaces and data distributions to guarantee that all global minima have excellent generalization performances.
Theorem 3 shows that implementing ERM on over-parameterized deep ReLU nets can achieve the almost optimal generalization error rates, but it does not demonstrate the power of depth since shallow learning also reaches these bounds [15, 40]. To show the power of depth, we should impose more restrictions on regression functions. For this purpose, we introduce the generalized additive models that are widely used in statistics and machine learning [24, 53]. For , we say that admits a generalized additive model if where and . Write as the set of all functions admitting a generalized additive model. If , and , it can be found in [53, 25] that for any , there holds
(16) | |||||
where , are constants independent of and is an arbitrary global minimum of (3). Noticing that shallow learning is difficult to achieve the above generalization error rates: even for a special case of the generalized additive model with , it has been proved in [18] that shallow nets with any activation functions cannot achieve the aforementioned generalization error rates. The following theorem presents the existence of perfect global minima of (3) to show the power of depth in the over-parameterized setting.
Theorem 4
Let and . If , , and , then there are infinitely many such that
(17) | |||||
Theorem 4 shows that there are infinitely many global minima of ERM on over-parameterized deep ReLU nets theoretically breakthroughing the bottleneck of shallow learning. This illustrates an advantage of adopting over-parameterized deep ReLU nets to build up hypothesis spaces in practice.
Finally, we show the power of depth in capturing the widely used spatially sparse features with the help of massive data. It has been discussed in [17, 39] that spatial sparseness is an important data feature for image and signal processing and deep ReLU nets perform excellently in reflecting the spatial sparseness. Partition by sub-cubes of side length and with centers . For with , define
If the support of is contained in for a subset of of cardinality at most , we then say that is -sparse in partitions. Denote by the set of all which are -sparse in partitions. Let , , . If is large enough to satisfy then it can be easily deduced from [17] that there exists a DSCN structure contained in such that
(18) | |||||
where are constants independent of and is an arbitrary global minimum of ERM on the DSCN. In (18), reflects the smoothness of and embodies the spatial sparseness of . As discussed above, given a sparsity level and the number of partitions , the size of data should satisfy to embody the spatially sparse feature of . In particular, if the number of samples is smaller than the sparsity level , it is impossible to develop learning schemes to realize the support of . Recalling the localized approximation property of deep ReLU nets [16, 17], with the help of massive data, (18) shows that deep ReLU nets are capable of capturing the spatial sparseness, which is beyond the capability of shallow nets due to its lack of localized approximation [16]. We refer the readers to [17] for more details about the above assertions. If , it can be found in (18) that ERM on deep ReLU nets in the under-parameterized setting can achieve almost optimal generalization error rates. The following theorem shows the existence of perfect global minima of (3) in the over-parameterized setting.
Theorem 5
Let , , and satisfy If , and , then there are are infinitely many such that
(19) | |||||
(20) |
Theorem 5 shows that for spatially sparse regression functions, there are infinitely many global minima of (3) in the over-parameterized setting achieving the almost optimal generalization error rates. Besides the given three types of regression functions, we can provide similar results for numerous regression functions including the general composite functions [53], hierarchical interaction models [30] and piecewise smooth functions [29] by using the same approach in this paper. In particular, to derive similar assertions as our theorems, it is sufficient to apply the proposed deepening scheme developed in Theorem 6 below on the corresponding generalization error estimates in [30, 29, 53]. We remove the details for the sake of brevity.
Based on the above theorems, we can derive the following corollary, which shows the versatility of over-parameterized deep ReLU nets in regression.
Corollary 1
A crucial problem of deep learning is on how to specify the structure of deep nets for a given learning task. It should be highlighted that in the under-parameterized setting, both the width and depth should be carefully tailored to avoid the bias-variance trade-off phenomenon, making the structures of deep nets for different learning tasks quite different [30, 29, 53, 17, 25]. However, as shown in Corollary 1, over-parameterizing succeeds in avoiding the structure selection problem of deep learning in the sense that there exist a unified over-parameterized structure of DFCN which contains perfect global minima for different learning tasks.
IV Related Work and Discussions
In this section, we review some related work on the generalization performance of deep ReLU nets and make some comparisons to highlight our novelty. From the classical bias-variance trade-off principle, the over-parameterized setting makes the deep nets model so flexible that its global minima suffer from the well known over-fitting phenomenon [19] in the sense that they fit the training data perfectly but fail to predict new query points. Surprisingly, numerical evidences [59] showed that the over-fitting may not occur. This interesting phenomenon leads to a rethinking of the modern machine-learning practice and bias-variance trade-off.
The interesting result in [9] is the first work, to the best of our knowledge, to theoretically study the generalization performance of interpolation methods. In [9], multivariate triangulations are constructed to interpolate data and the generalization error bounds are exhibited as a function with respect to the dimension , which shows that the interpolation method possesses good generalization performance when is large. After imposing certain structure constraints on the covariance matrices, [26, 7] derived tight generalization error bounds for over-parameterized liner regression. In [45], the authors revealed several quantitative relations between linear interpolants and the structures of covariance matrices and then provided a hybrid interpolating scheme whose generalization error was rate-optimal for sparse liner model with noise. Motivated by these results, [36] proved that kernel ridgeless least squares possess a good generalization performance for high dimensional data, provided the distribution satisfies certain restrictions. In [38], the generalization error of kernel ridgeless least squares was proved to be bounded by means of some differences of kernel-based integral operators.
It should be mentioned that there are some strict restrictions concerning the dimensionality, structures of covariance matrices and marginal distributions to guarantee the good generalization performance of interpolation methods in the existing literature. Indeed, it was proved in [31] that some restriction on the marginal distribution is necessary, without which there is a such that all interpolation methods may perform extremely badly (see Lemma 2). However, the high dimensional assumption and structure constrains of covariance matrices are removable, since [9] have already constructed a piecewise interpolation based on the well known Nadaraya-Watson estimator and derived optimal generalization error rates, without any restrictions on the dimension and covariance matrices.
Compared with the above mentioned existing work, there are mainly three novelties of our results. First, we aim to explain the over-fitting resistance phenomenon for deep learning rather than linear algorithms such as linear regression and kernel regression. Due to the nonlinear nature of (3), we rigorously prove the existence of bad minima and perfect global ones. Therefore, it is almost impossible to derive the same results as linear models [12, 45, 36] for deep ReLU nets to determine which conditions are sufficient to guarantee the perfect generalization performance of all global minima. Furthermore, our theoretical results coincide with the experimental phenomenon in the sense that global minima (with training error to be 0) with different parameters frequently have totally different behaviors in generalization. Then, our results present essential advantages of running ERM on over-parameterized deep ReLU nets by means of proving the existence of deep ReLU nets possessing the almost optimal generalization error rates, which is beyond the capability of shallow learning. Finally, our results are established with mild conditions on distributions and without any restrictions on the dimension of the input space.
Another related work is [44], where the authors discussed the generalization performance of interpolation methods based on histograms and also established the existence of bad and good interpolation neural networks. The main arguments of [44] and our paper are similar: there are global minima of ERM on deep ReLU nets that can avoid over-fitting. The main differences are as follows: 1) It is well known that an approximant or learner based on histograms suffers from the well known saturation problem in the sense that the approximation or learning rate cannot be improved further once the regularity (or smoothness) of the regression function achieves certain level [58]. In particular, as shown in [58, 17], deep ReLU nets with two hidden layers can provide localized approximation but is difficult to approximate extremely smooth functions. In our paper, we avoid this saturation phenomenon via deepening the networks and thus break through the bottleneck of the analysis in [44]. 2) We provide detailed structures of deep ReLU nets and derive the quantitative requirement of the number of free parameters to guarantee the existence of global minima of ERM on deep ReLU nets, which is different from [44]. 3) More importantly, we devote to answering Problem 1 via showing the optimal generalization error rates and the power of depth of some global minima of ERM on deep ReLU networks. In particular, using the deepening approach, we prove that there exist infinitely many global minima of ERM on over-parameterized deep ReLU nets that perform almost the same as the under-parameterized deep ReLU nets.
In summary, we provide an affirmative answer to Problem 1 via providing several examples for perfect global minima of (3). It would be interesting to study the distribution of these perfect global minima and design feasible schemes to find them. We will keep in studying this topic and consider these two more challenging problems.
V Numerical Examples
In this section, we conduct numerical simulations to support our theoretical assertions on the existence of benign overfitting of running ERM on over-parameterized deep ReLU nets. There are mainly four purposes of our simulations. In the first simulation, we aim to show the relation between the generalization performance of global minima of (3) and the number of parameters (or width) of deep ReLU nets. In the second one, we devote to verifying the over-fitting resistance of (3) via showing the relation between the generalization error and the number of algorithmic iterations (epoches). In the third one, we show the existence of good and bad global minima of (3). Finally, we compare our learned global minima with some widely used learning schemes to show the learning performance of (3) on over-parameterized deep ReLU nets.




For these purposes, we adopt fully connected ReLU neural networks with hidden layers and neurons paved on each layer. In all simulations, we set and We use the well known Adam optimization algorithm on deep ReLU nets with step-size being constantly and initialization being the default PyTorch values. Without especial declaration, the training is stopped after iterations.
we report our results on two real-world data numerical simulations on two datasets. The first one is a Wine Quality dataset from UCI database. The Wine Quality dataset is related to red and white variants of the Portuguese “Vinho Verde” wine with red and white examples. We select white wine for experiments. There are attributes in the data set: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, aulphates, alcohol and quality (score between and ). Therefore, it can be viewed as a regression task on input data of dimension. Regarding the preferences, each sample was evaluated by a minimum of three sensory assessors (using blind tastes), which graded the wine in a scale ranging from (very bad) to (excellent). We sample data points as our training set and for testing.
The second dataset is MNIST. MNIST dataset is widely used in classification tasks. Here we follow [36] to create a regression task using MNIST. MNIST inlcudes 70000 samples in total. Each sample includes a 28*28 dimensional feature and a target representing a digit ranging from 0 to 9. We randomly pick 291 samples with targeting digits equaling to 0 or 1, and separate 221 samples for training and 70 for testing. We label digit 0 as -1 and digit 1 as 1. Then we get a dataset with a dimensional feature and a label -1 or 1. The regression task is then built.
Simulation 1. In this simulation, we study the relation between the RMSE (rooted mean squared error) of test error and widths of deep ReLU nets with . Our results are recorded via 5 independent single trials. The solid line is the mean value from these trials, and the shaded part indicates the deviation. The numerical results are reported in Figure 1.
Figure 1 presents the perfect global minima by exhibiting the relation between training and testing RMSE and the number of free parameters. From Figure 1, we obtain the following four observations: 1) The left figure shows that neural networks with more hidden layers are easier to produce exact interpolations of the training data. This coincides with the common consensus since more hidden layers with the same width involves much more free parameters; 2) For each depth, it can be found in the right figure that the testing curves exhibit an approximate doubling descent phenomenon as declared in [12] for linear models. It should be highlighted that such a phenomenon does not always exist for deep ReLU nets training and we only choose a good trend from several trails to demonstrate the existence of perfect global minima; 3) As the width (or capacity of the hypothesis space) increases, it can be found in the right figure that the testing error does not increase, exhibiting a totally different phenomenon from the classical bias-variance trade-off. This shows that for over-parameterized deep ReLU nets, there exist good global minima of (3), provided the depth is appropriately selected; 4) It can be found that deeper ReLU nets perform better in generalization, which demonstrates the power of depth in tackling the Wine Quality data. All these verify our theoretical assertions in Section III and show that there exist perfect global minima of (3) to realize the power of deep ReLU nets.
Simulation 2. In this simulation, we devote to numerically studying the role of iterations (epoches) in (3) in both under-parameterized and over-parameterized settings. We run ERM on DFCNs with depth 4 and width on Wine dataset and MNIST dataset. Since the number of training data is 3265(221 in MNIST dataset, resp.), it is easy to check that deep ReLU nets with depth 4 and widths 2 and 40 are under-parameterized, while those with depth 4 and width 2000 is over-parameterized. The numerical results are reported in Figure 2.




There are also three interesting findings exhibited in Figure 3: 1) For under-parameterized ReLU nets, it is almost impossible to produce a global minimum acting as an exact interpolation of the data. However, for over-parameterized deep ReLU nets, running Adam with sufficiently many epoches attains a training error to be zero. Furthermore, after a specific value, the number of iterations does not affect the training error. This means that Adam converges to a global minimum of (3) on over-parameterized deep ReLU nets; 2) The testing error for under-parameterized ReLU nets, exhibited in the right figure, behaves according to the classical bias-variance trade-off principle in the sense that the error firstly decreases with respect to the epoch and then increases after a specific value of epoches. Therefore, early-stopping is necessary to guarantee the good performance in this setting; 3) The testing error for over-parameterized ReLU nets is always non-increasing with respect to the epoch. This shows the over-fitting resistance of deep ReLU nets training and also verifies the existence of the perfect global minima of (3) on over-parameterized deep ReLU nets. It should be highlighted the numerical result presented in Figure 3 is also a single trial selected from numerous results, since we are concerned with the existence of perfect global minima. In fact, there are also numerous examples for bad global minima of (3).
Simulation 3 In this simulation, we show that although there exist perfect global minima in over-parameterized settings, bad global minima can also be found sometimes. We test the performance of deep ReLU nets with depth 4 and width 2000 on Wine dataset and MNIST dataset. It take different numbers of steps to converge to a good training performance on these two datasets. We trigger several runs with different learning rates and net parameter initializations, and pick good and bad global minima from two trials respectively. We report the numerical results in Figure3.




From Figure 3, we find that different global minima of Figure 3 perform totally differently in generalization, though the training loss both comes to 0. In particular, the testing errors of bad global minima can be much larger than those of good global minima. It should be mentioned that the bad interpolants in the above simulations are also derived from Adam. Therefore, their orders of testing errors are comparable with those of good interpolants. We highlight that this is due to the implementation of the ADAM algorithm rather than the model (3). As far as the model is concerned, it can be shown in our next simulation that the orders of bad interpolants are also larger than those of good ones.
Simulation 4 In this simulation, we compare (3) on over-parameterized deep ReLU nets with some standard learning algorithms including ridge regression (Ridge), support vector regression (SVR), kernel interpolation(KIR) and kernel ridge regression (KRR) to show that the numerical phenomenon exhibited in previous figures is not built upon sacrificing the generalization performance. For the sake of fairness, we test various models under the same condition to our best efforts via tuning hyper-parameters. In particular, we implement the referenced methods by using the standard scikit-learn package. In the experiment with the wine dataset specifically, we use ridge regression with regularization parameter being . In KRR, we use the Gaussian kernel with width being and regularization parameter being . KIR uses the Gaussian kernel with width being and regularization parameter being 0. SVR keeps the default sklearn hyper-parameters. In the experiment with the MNIST dataset, we change regularization parameter in ridge regression to 10 and keep other parameters fixed. In this simulation, we also use the deep ReLU nets with depth and width and conduct 5 trials to record the average training and testing RMSE.
In addition, we construct a ReLU net that achieves 0 training error but performs extremely badly in the testing set to show that there really exist extremely bad interpolants, just as Proposition 1 illustrated. We introduce some notations at first. Denote . Define , where is the ReLU activation function and is a parameter which we set to be small enough ( in this simulation). A feature input is expressed as . We call the corresponding output of the constructed net (CN) as . Note that we drop duplicated samples when implementing CN. CN is constructed as bellow:
More details of the construction of can be found in the proof of Proposition 1. We introduce in this simulation is to show that as an interpolant, performs quite poorly to show that there are extremely bad global minima of (3). The numerical results are reported in Table I.
Wine Quality Data | |||
---|---|---|---|
Methods | Train RMSE | Test RMSE | |
Ridge | 0.534 | 0.735 | |
kernel interpolation | 0.000 | 13.031 | |
KRR | 0.668 | 0.706 | |
SVR | 0.628 | 0.696 | |
4-hidden layer DFCN (good case) | 0.000 | 0.668 | |
CN (bad case) | 0.000 | 5.931 |
MNIST Data | |||
---|---|---|---|
Methods | Train RMSE | Test RMSE | |
Ridge | 0.056 | 0.304 | |
kernel interpolation | 0.000 | 0.135 | |
KRR | 0.031 | 0.140 | |
SVR | 0.073 | 0.154 | |
4-hidden layer DFCN (good case) | 0.000 | 0.097 | |
CN (bad case) | 0.000 | 1.000 |
There are four interesting observation in Table I: 1) Learning schemes such as SVR, KRR and Ridge perform stablely, since for both high-dimensional applications and low-dimensional simulations. The main reason is that a regularization term is introduced to balance the bias and variance for these schemes. As a result, the training error of these schemes are always non-zero; 2) Kernel interpolation performs well in high dimensional applications but fails to generalize well in low dimensional simulations. The main reason is that if is large, then the separation radius is large [36, 38], which in turn implies that the condition number of the kernel matrix is relatively small, making the kernel interpolation perform well. However, if is small, the condition number of the kernel matrix is usually extremely large, making the prediction instable; 3) There exist deep ReLU nets exactly interpolating the training data, leading to zero training error, but possessing an excellent generalization capability in yield small testing error, implying that the obtain estimator is a benign over-fitter for the data. Furthermore, it is shown in the table that the testing error of over-parameterized deep ReLU nets is the smallest, demonstrating the power of depth as declared in our theoretical assertions in Section III; 4) There also exist deep ReLU nets interpolating the data but performing extremely badly in generalization, for both high-dimensional applications and low dimensional simulations. All these findings verify our theoretical assertions that there are good global minima for ERM on over-parameterized deep ReLU nets but not all global minima are good.
VI Proofs
In this section, we aim at proving our results stated in Section II and Section III. The main novelty of our proof is a deepening scheme that produces an over-parameterized deep ReLU net (student network) based on a specific under-parameterized one (teacher network) so that the student network exactly interpolates the training data and possesses almost the same generalization performance as the teacher network.
VI-A Deepening scheme for ReLU nets
Given a teacher network , the deepening scheme devotes to deepening and widening it to produce a student network that exactly interpolates the given data and possesses almost the same generalization performance as . The following theorem presents the deepening scheme in our analysis.
Theorem 6
Let be any deep ReLU nets with layers, free parameters and width not larger than satisfying for some . If with , then for any , there exist infinitely many DFCNs of depth and width such that
(21) |
and
(22) |
where is a constant depending only on .
The deepening scheme developed in Theorem 6 implies that all deep ReLU nets that have been verified to possess good generalization performances in the under-parameterized setting [29, 53, 38, 25] can be deepened to corresponding deep ReLU nets in the over-parameterized setting such that the deepened networks exactly interpolate the given data and possess good generalization error bounds.
The main tools for the proof of Theorem 6 are the localized approximation property of deep ReLU nets developed in [17] and the product gate property of deep ReLU nets proved in [58]. Let us introduce the first tool as follows. For with , define a trapezoid-shaped function with a parameter as
(23) | |||||
We consider
(24) |
The following lemma proved in [17] presents the localized approximation property of .
Lemma 3
Let , and be defined by (24). Then we have for all and
(25) |
The second tool, as shown in the following lemma, presents the product-gate property of deep ReLU nets [58].
Lemma 4
For any and , there exists a DFCN with ReLU activation functions with depth, width, and free parameters bounded by for some such that
and
With the above tools, we can prove Theorem 6 as follows.
Proof:
Let be given in Lemma 3 and in Lemma 4 with . Then it follows from (25) that
(26) |
Since , we can define a function on by
(27) | |||||
If , then for any , we have from (26) that Noting further , we have for any that and
(28) |
Moreover, for and any ,
implies Hence . Therefore, Lemma 4 yields . This implies
(29) |
Define further a function on by
It follows from Lemma 4 that
(30) |
If for all , then it follows from (26) that , which implies . Hence,
This implies
The above estimate together with (30) yields
Set and . We obtain
(31) |
where . Denote by . Recalling (28), we can define
with and as above so that the two items on the righthand side of have the same depth. Then is a DFCN of depth and width . Noting further , we then have . This together with (31) yields
Recalling that there are infinitely many satisfying , then there are infinitely many such . Theorem 6 is then proved by scaling. ∎
VI-B Proofs
In this part, we prove our main results by using the proposed deepening scheme (for results concerning noisy data) and a functional analysis approach developed in [46] (for results concerning noiseless data).
Proof:
If , , we can set . Then our conclusion naturally holds. Otherwise, we define
(32) |
If , then it follows from (28) that . Since , a direct computation yields
But (26) together with (5) and for any yields
Therefore, for
(33) |
we have , which yields
Note further that is a DFCN with 2 hidden layers with and . Since , we can define iteratively by satisfying , and
Then . Recalling the construction in (32), different corresponds to different neural networks and the above results hold hold for all satisfying (33). Therefore, there are infinitely many deep ReLU nets formed as (32). This completes the proof of Proposition 1. ∎
In the following, we construct some real-valued functions to feed and derive a deep-net-based linear space possessing good approximation properties. For , define
(34) |
Then
(35) |
For , , and define
(36) | |||||
where
(37) |
It is easy to see that for arbitrarily fixed , is a linear space of dimension at most . Each element in is a DFCN with depth , and The approximation capability of the constructed linear space was deduced in [25, Theorem 2] or [58].
Lemma 5
Let and . If and with , then there holds
(38) |
where is a constant depending only on and .
To prove Theorem 2, we need the following lemma proved in [46]. It should be mentioned that for DFCN with larger depth and width, the above assertions obviously hold. We use the following functional analysis tool that presents a close relation interpolation and approximation to minimize the depth.
Lemma 6
Let be a (possibly complex) Banach space, a subspace of , and a finite-dimensional subspace of , the dual of . If for every and some , independent of ,
then for any there exists such that interpolates on ; that is, for all . In addition, approximates in the sense that .
To use the above lemma, we need to construct a special function to facilitate the proof. Our construction is motivated by [46]. For any , define
(39) |
where is the point evaluation operator and is the sign function satisfying for and for . Then it is easy to see that is a continuous function. In the following, we present three important properties of .
Lemma 7
Let . Then for any , there holds (i) (ii) (iii) .
Proof:
Denote , where is the ball with center and radius . Then it follows from the definition of that , where and denotes the boundary of . Without loss of generality, we assume . From (39), we have for . If there exist some such that , then
So
Thus, for all . Since , , we get , which verifies (i). For , we have
Thus (ii) holds. The remainder is to prove that satisfies (iii). We divide the proof into four cases.
If for some , then it follows from (39) that
If , then the definition of yields , which implies
If for , then it is easy to see that for any , , there holds . Let be the intersections of the line segment and , and the line segment and , respectively. Then, we have . Since , and , we have
If for some and , then we take to be the intersection of and the line segment . Then and
Combining all the above cases verifies . This completes the proof of Lemma 7. ∎
With the above tools, we are in a position to prove Theorem 2.
Proof:
Let , the space of continuous functions defined on , and in Lemma 6. For every , we have for some . Without loss of generality, we assume . Let be defined by (39). Then, it follows from Lemma 5 and Lemma 7 that there is some such that
Let for some . We have
This together with (i) in Lemma 7 yields
Since is a linear operator and (ii) in Lemma 7 holds, there holds
Hence, from
we have
Consequently,
Setting , for any , it follows from Lemma 6 and Lemma 5 that there exists some such that and
Setting and recalling (37) and , defined in (37) can be regarded as the set of DFCNs with depth , and . Noting that there are infinitely many , there are infinitely many DFCNs satisfying the above assertions. This completes the proof of Theorem 2. ∎
The proofs of the other main results are simple by combining Theorem 6 with existing results.
Proof:
Proof:
Acknowledgement
The work of S. B. Lin is supported partially by the National Key R&D Program of China (No.2020YFA0713900) and the National Natural Science Foundation of China (No.62276209). The work of Y. Wang is supported partially by the National Natural Science Foundation of China (No.11971374). The work of D. X. Zhou is supported partially by the NSFC/RGC Joint Research Scheme [RGC Project No. N-CityU102/20 and NSFC Project No. 12061160462], Germany/Hong Kong Joint Research Scheme [Project No. G-CityU101/20], and the Laboratory for AI-Powered Financial Technologies.
References
- [1] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. ICML, 2019.
- [2] Z. Allen, Y. Li, and Y. Liang. Learning and generalization in overparametrized neural networks, going beyond two layers. arXiv:1811.04918, 2018.
- [3] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 2009.
- [4] A. R. Barron, A. Cohen, W. Dahmen and R. Devore. Approximation and learning by greedy algorithms. Ann. Statist., 36: 64-94, 2008.
- [5] P. L. Bartlett, N. Harvey, C. Liaw and A. Mehrabian. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. J. Mach. Learn. Res., 20(63): 1-17, 2019.
- [6] P. L. Bartlett and P. M. Long. Failures of model-dependent generalization bounds for least-norm interpolation. J. Mach. Learn. Res., 22 (204): 1-15, 2021.
- [7] P. L. Bartlett, P. M. Long, G. Lugosi and A. Tsigler. Benign overfitting in linear regression. Proc. Nat. Acad. Sci. USA, 117 (48): 30063-30070, 2020.
- [8] B. Bauer and M. Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist., 47(4): 2261-2285, 2019.
- [9] M. Belkin. Approximation beats concentration? An approximation view on inference with smooth radial kernels. COLT 2018: 1348-1361.
- [10] M. Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. arXiv:2105.14368, 2021.
- [11] M. Belkin, D. Hsu and P. Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. NIPS 2018: 2300-2311.
- [12] M. Belkin, D. Hsu, S. Ma and S. Mandal. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Nat. Acad. Sci. USA, 116 (32): 15849-15854, 2019.
- [13] Y. Bengio, A. Courville and Vincent. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intel., 35: 1798–1828, 2013.
- [14] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. NIPS 2019: 10835-10845.
- [15] A. Caponnetto and E. DeVito. Optimal rates for the regularized least squares algorithm. Found. Comput. Math., 7: 331-368, 2007.
- [16] C. K. Chui, X. Li and H. N. Mhaskar. Neural networks for localized approximation. Math. Comput., 63: 607-623, 1994.
- [17] C. K. Chui, S. B. Lin, B. Zhang and D. X. Zhou. Realization of spatial sparseness by deep ReLU nets with massive data. IEEE Trans. Neural Netw. Learn. Syst., In Presee, 2020.
- [18] C. K. Chui, S. B. Lin and D. X. Zhou, Deep neural networks for rotation-invariance approximation and learning. Anal. Appl., 17(3): 737-772, 2019.
- [19] F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, Cambridge, 2007.
- [20] S. Du, J. Lee, Y. Tian, B. Poczós and A. Singh. Gradient descent learns one-hiddenlayer cnn: Don’t be afraid of spurious local minima. ICML 2018: 1339–1348.
- [21] S. Du, X. Zhai, B. Poczós and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. ICLR 2019.
- [22] I. Goodfellow, Y. Bengio and A. Courville. Deep Learning. The MIT Press, 2016.
- [23] Z. C. Guo, L. Shi and S. B. Lin. Realizing data features by deep nets. IEEE Trans. Neural Netw. Learn. Syst., In Press, 2019.
- [24] L. Györfy, M. Kohler, A. Krzyzak and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, Berlin, 2002.
- [25] Z. Han, S. Yu, S. B. Lin, and D. X. Zhou. Depth selection for deep ReLU nets in feature extraction and generalization. IEEE Trans. on Pattern Anal. Mach. Intel., In Press.
- [26] T. Hastie, A. Montanari, S. Rosset and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv:1903.08560.
- [27] G. E. Hinton, S. Osindero and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7): 1527–1554, 2006.
- [28] J. Hong and P. R. Hoban. Writing more compelling creative appeals: a deep learning-based approach. Market. Sci., In Press.
- [29] M. Imaizumi and K. Fukumizu. Deep Neural Networks Learn Non-Smooth Functions Effectively. arXiv preprint arXiv:1802.04474, 2018.
- [30] M. Kohler and A. Krzyżak. Nonparametric regression based on hierarchical interaction models. IEEE Trans. Inf. Theory, 63: 1620–1630, 2017.
- [31] M. Kohler and A. Krzyżak. Over-parametrized deep nerual networks do not generalize well. arXiv:1912.03925, 2019.
- [32] A. Krizhevsky, I. Sutskever and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS 2012: 1097–1105.
- [33] H. Lee, P. T. Pham, L. Yan and A. Y. Ng. Unsupervised feature learning for audio classification using convolutional deep belief networks. NIPS 2009.
- [34] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. NIPS 2018: 8157-8166.
- [35] W. Li. Generalization error of minimum weighted norm and kernel interpolation. SIAM J. Math. Data Sci., 3(1): 414-438, 2021.
- [36] T. Liang and A. Rakhlin. Just interpolate: Kernel ¡°ridgeless¡± regression can generalize, The Annals of Statistics, 48: 1329–1347, 2020.
- [37] S. B. Lin. Generalization and expressivity for deep nets. IEEE Trans. Neural Netw. Learn. Syst. 30: 1392–1406, 2019.
- [38] S. B. Lin, X. Chang and X. Sun. Kernel interpolation of high dimensional scattered data. arXiv:2009.01514v2, 2020.
- [39] X. Liu, D. Wang and S. B. Lin. Construction of Deep ReLU Nets for Spatially Sparse Learning. IEEE Transactions on Neural Networks and Learning Systems, In Press.
- [40] V. Maiorov. Approximation by neural networks and learning theory. J. Complex., 22: 102-117, 2006.
- [41] S. Mei, A. Montanari and P. M. Nguyen. A mean field view of the landscape of two-layer neural networks. Proc. Nat. Acad. Sci. USA, 115 (33): E7665-E7671.
- [42] S. Mendelson and R. Vershinin. Entropy and the combinatorial dimension. Invent. Math., 125: 37-55, 2003.
- [43] H. N. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective. Anal. Appl., 14: 829–848, 2016.
- [44] N. Mücke and I. Steinwart. Empirical Risk Minimization in the Interpolating Regime with Application to Neural Network Learning. arXiv preprint arXiv:1905.10686.
- [45] V. Muthukumar, K. Vodrahalli, V. Subramanian and A. Saha. Harmless interpolation of noisy data in regression. IEEE J. Selected Areas Inf. Theory, 1 (1): 67-83.
- [46] F. J. Narcowich and J. D. Ward. Scattered-Data interpolation on : Error estimates for radial basis and band-limited functions. SIAM J. Math. Anal., 36: 284-300, 2004.
- [47] B. Neyshabur, S. Bhojanapalli and N. Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv1707.09564v2, 2018.
- [48] P. Petersen and F. Voigtlaender. Optimal aproximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108: 296-330, 2018.
- [49] A. Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, 8: 143-195, 1999.
- [50] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli and J. Sohl-Dickstein, On the expressive power of deep neural networks. arXiv: 1606.05336, 2016.
- [51] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda and Q. Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Intern. J. Auto. Comput., DOI: 10.1007/s11633-017-1054-2, 2017.
- [52] M. Qi, Y. Shi, Y. Qi, C. Ma, R. Yuan, D. Wu and Z. J. M. Shen. A practical end-to-end inventory management model with deep learning. Management Sci., In Press.
- [53] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Statist., 48(4): 1875-1897, 2020.
- [54] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in uq. Anal. Appl., 17: 19–55, 2019.
- [55] U. Shaham, A. Cloninger and R. R. Coifman. Provable approximation properties for deep neural networks. Appl. Comput. Harmonic Anal., 44: 537–557, 2018.
- [56] L. Shi. Learning theory estimates for coefficient-based regularized regression. Appl. Comput. Harmonic Anal., 34: 252–265, 2013.
- [57] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. D. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam and M. Lanctot. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587): 484–489, 2016.
- [58] D. Yarotsky. Error bounds for aproximations with deep ReLU networks. Neural Networks, 94: 103-114, 2017.
- [59] C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv: 1611.03530, 2016.
- [60] D. X. Zhou. Deep distributed convolutional neural networks: Universality. Anal. Appl., 16: 895-919, 2018.
- [61] D. X. Zhou. Universality of deep convolutional neural networks. Appl. Comput. Harmonic. Anal., 48: 784-794, 2020
- [62] D. X. Zhou. Theory of deep convolutional neural networks: Downsampling. Neural Netw., 124: 319-327, 2020.