This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalization Performance of Empirical Risk Minimization on Over-parameterized Deep ReLU Nets

Shao-Bo Lin, Yao Wang, and Ding-Xuan Zhou S. B. Lin and Y. Wang are with the Center for Intelligent Decision-Making and Machine Learning, School of Management, Xi’an Jiaotong University, Xi’an 710049, P R China. D. X. Zhou is with School of Mathematics and Statistics, University of Sydney, Sydney NSW 2006, Australia. The corresponding author is Y. Wang (email: [email protected]).
Abstract

In this paper, we study the generalization performance of global minima for implementing empirical risk minimization (ERM) on over-parameterized deep ReLU nets. Using a novel deepening scheme for deep ReLU nets, we rigorously prove that there exist perfect global minima achieving almost optimal generalization error rates for numerous types of data under mild conditions. Since over-parameterization is crucial to guarantee that the global minima of ERM on deep ReLU nets can be realized by the widely used stochastic gradient descent (SGD) algorithm, our results indeed fill a gap between optimization and generalization of deep learning.

Index Terms:
Deep learning, Empirical risk minimization, Global minima, Over-parameterization.

I Introduction

Deep learning [27] that conducts feature exaction and statistical modelling on a unified deep neural network (deep net) framework has attracted enormous research activities in the past decade. It has made significant breakthroughs in numerous applications including computer vision [32], speech recognition [33] and go games [57]. Simultaneously , it also brings several challenges in understanding the running mechanism and magic behind deep learning, among which the generalization issue in statistics [59] and convergence issue in optimization [2] are crucial.

The generalization issue pursues theoretical advantages of deep learning via presenting its better generalization capability than shallow learning such as kernel methods and shallow networks. Along with the rapid development of deep nets in approximation theory [16, 43, 58, 48, 55, 54, 18], essential progress on the excellent generalization performance of deep learning has been made in [29, 8, 37, 53, 17, 25]. To be detailed, [53] proved that implementing empirical risk minimization (ERM) on deep ReLU nets is better than shallow learning in embodying the composite structure of the well known regression function [24]; [25] showed that implementing ERM on deep ReLU nets succeeds in improving the learning performance of shallow learning by capturing the group-structure of inputs; and [17] derived that, with the help of massive data, implementing ERM on deep ReLU nets is capable of reflecting spatially sparse properties of the regression function while shallow learning fails. In a word, with appropriately selected number of free parameters, the generalization issue of deep learning seems to be successfully settled in terms that deep learning can achieve optimal learning rates for numerous types of data, which is beyond the capability of shallow learning.

The convergence issue concerns the convergence of some popular algorithms such as the stochastic gradient descent (SGD) and adaptive moment estimation (Adam) to solve ERM on deep nets. Due to the highly nonconvex nature, it is generally difficult to find global minima of such ERM problems, since there are numerous local minima, saddle points, plateau and even some flat regions [22]. It is thus highly desired to declare when and where the corresponding SGD or Adam converges. Unfortunately, in the under-parameterized setting exhibited in [29, 8, 37, 53, 17, 25], the convergence issue remains open. Alternatively, studies in optimization theory show that over-parameterizing deep ReLU nets enhance the convergence of SGD [21, 41, 2, 1]. In fact, it was proved in [2, 1] that SGD converges to either a global minimum or a local minimum near some global minimum of ERM, provided there are sufficiently many free parameters in deep ReLU nets. Noting further that implementing ERM on over-parameterized deep ReLU nets commonly yields infinitely many global minima and usually leads to over-fitting, it is difficult to verify the good generalization performance of the obtained global minima in theory.

As mentioned above, there is an inconsistency between generalization and convergence issues of deep learning in the sense that good generalization requires under-parameterization of deep ReLU nets while provable convergence needs over-parameterization, which makes the running mechanism of deep learning be still a mystery. Such an inconsistency stimulates to rethink the classical bias-variance trade-off in modern machine learning practice [12], since numerical evidences in [59] illustrated that there are over-parameterized deep nets generalizing well despite they achieved extremely small training error. In particular, [9] constructed an exact interpolation of training data that achieves optimal generalization error rates based on Nadaraya-Watson kernel estimates; [7] derived a sufficient condition for the data under which the global minimum of over-parameterized linear regression possesses excellent generalization performance; [36] studied the generalization performance of kernel-based least norm interpolation and presented generalization error estimates under some restrictions on the data distribution. Similar results can also be found in [11, 26, 45, 38, 6] and references therein. All these exciting results provide a springboard to understand benign over-fitting in modern machine-learning practice and present novel insights in developing learning algorithms to avoid the traditional bias-variance dilemma. However, it should be pointed out that these existing results are incapable of settling the generalization and convergence challenges of deep learning, mainly due to the following three aspects:

\bullet Difference in model: the existing theoretical analysis for benign over-fitting is only available to convex linear models, but ERM on deep ReLU nets involves highly nonconvex nonlinear models.

\bullet Difference in theory: the existing results for benign over-fitting focus on pursuing the restrictions on data distributions so that all global minima for over-parameterized linear models are good learners, which does not hold for deep learning, since it is easy to provide a counterexample for benign over-fitting of deep ReLU nets, even for noise-less data (see Proposition 1 below).

\bullet Difference in requirements of dimension: Theoretical analysis in the existing work frequently requires the high dimensionality assumption of the input space, while the numerical experiments in [59] showed that deep learning can lead to benign over-fitting in both high and low dimensional input spaces.

Based on these three interesting observations, there naturally arises the following problem:

Problem 1

Without strict restrictions on data distributions and dimensions of input spaces, are there global minima of ERM on over-parameterized deep ReLU nets achieving the optimal generalization error rates obtained by under-parameterized deep ReLU nets?

There are roughly two schemes to settle the inconsistency between optimization and generalization for deep learning. One is to pursue the convergence guarantee for SGD on under-parameterized deep ReLU nets and the other is to study the generalization performance of implementing ERM on over-parameterized deep ReLU nets. To the best of our knowledge, there isn’t any theoretical analysis for the former, even when strict restrictions are imposed on the data distributions. An answer to Problem 1, a stepping-stone to demonstrate the feasibility of the latter, not only provides solid theoretical evidences for the benign over-fitting phenomenon of deep learning in practice [59], but also presents theoretical guidance on setting network structures to balance the generalization and optimization inconsistency.

The aim of the present paper is to present an affirmative answer to Problem 1. Our main tool for analysis is a novel network deepening approach based on the localized approximation property [16, 17] and product-gate property [58, 48] of deep nets. The network deepening approach succeeds in constructing over-parameterized deep nets (student network) via deepening and widening an arbitrary under-parameterized network (teacher network) so that the obtained student network exactly interpolates the training data and possesses almost the same generalization capability as the teacher network. In this way, setting the teacher network to be the one in [29, 8, 37, 53, 17, 25], we actually prove that there are global minima for ERM on over-parameterized deep ReLU nets that possess optimal generalization error rates, provided that the networks are deeper and wider than the corresponding student network.

The main contributions of the paper are two folds. Since the presence of noise is a crucial factor to result in over-fitting and it is not difficult to design learning algorithms with good generalization performance to produce perfect fit for noiseless data [34, 14], our first result is then to study the generalization capability of deep ReLU nets that exactly interpolate noiseless data. In particular, we construct a deep ReLU net that exactly interpolates the noiseless data but performs extremely badly in generalization and also prove that for deep ReLU nets with more than two hidden layers, there always exist global minima of ERM that can generalize extremely well, provided the number of free parameters achieves a certain level and the data are noiseless. This is different from the linear models studied in [9, 11, 12, 26, 36, 45, 38, 6] and shows the difficulty in analyzing over-parameterized deep ReLU nets. Our second result, more importantly, focuses on the existence of a perfect global minimum of ERM on over-parameterized deep ReLU nets that achieves almost optimal generalization error rates for numerous types of data. Using the network deepening approach, we rigorously prove that, if the depth and width of a deep net are larger than specific values, then there always exists such a perfect global minimum. This finding partly demonstrates the reason of the benign over-fitting phenomenon of deep learning and shows that implementing ERM on over-parameterized deep nets can derive an estimator of high quality. Different from the existing results, our analysis requires neither high dimensionality of the input space nor strong restrictions on the covariance matrix of the input data.

The rest of this paper is organized as follows. In the next section, after introducing the deep ReLU nets, we provide theoretical guarantee for the existence of good deep nets interpolant for noiseless data. In Section III, we present our main results via rigorously proving the existence of perfect global minima. In Section IV, we compare our results with related work and present some discussions. In Section V, we conduct numerical experiments to verify our theoretical assertions. We prove our results in the last section.

II Global Minima of ERM on Over-Parameterized Deep Nets for Noiseless Data

Let LL\in\mathbb{N} be the depth of a deep net, d0=0d_{0}=0 and dd_{\ell}\in\mathbb{N} be the width of the \ell-th hidden layer for =1,,L1\ell=1,\dots,L-1. Denote by the affine operator 𝒥:d1d\mathcal{J}_{\ell}:\mathbb{R}^{d_{\ell-1}}\rightarrow\mathbb{R}^{d_{\ell}} with 𝒥(x):=Wx+b\mathcal{J}_{\ell}(x):=W_{\ell}x+b_{\ell} for d×d1d_{\ell}\times d_{\ell-1} weight matrix WW_{\ell} and bias vector bdb_{\ell}\in\mathbb{R}^{d_{\ell}}. For the ReLU function σ(t)=t+:=max{t,0}\sigma(t)=t_{+}:=\max\{t,0\}, write σ(x)=(σ(x(1)),,σ(x(d)))T\sigma(x)=(\sigma(x^{(1)}),\dots,\sigma(x^{(d)}))^{T} for x=(x(1),,x(d))Tx=(x^{(1)},\dots,x^{(d)})^{T}. Define an LL-layer deep ReLU net by

𝒩d1,,dL(x)=aσ𝒥Lσ𝒥L1σ𝒥1(x),\mathcal{N}_{d_{1},\dots,d_{L}}(x)=a\cdot\sigma\circ\mathcal{J}_{L}\circ\sigma\circ\mathcal{J}_{L-1}\circ\dots\circ\sigma\circ\mathcal{J}_{1}(x), (1)

where adLa\in\mathbb{R}^{d_{L}}. The structure of 𝒩d1,,dL\mathcal{N}_{d_{1},\dots,d_{L}} is determined by the weight matrices WW_{\ell} and bias vectors bb_{\ell}, =1,,L\ell=1,\dots,L. In particular, full weight matrices correspond to deep fully connected nets (DFCN) [58]; sparse weight matrices are associated with deep sparsely connected nets (DSCN) [48]; and Toeplitz-type weight matrices are related to deep convolutional neural networks (DCNN) [60]. In particular, for DFCN, we have the number of training parameters

n=dL+=1L(d1d+d)n=d_{L}+\sum_{\ell=1}^{L}(d_{\ell-1}d_{\ell}+d_{\ell}) (2)

and for DSCN, nn is much smaller than the number in (2). In this paper, we mainly focus on analyzing the benign over-fitting phenomenon for DFCN. Denote by 𝒩d1,,dLDFCN\mathcal{N}_{d_{1},\dots,d_{L}}^{DFCN} the set of all DFCNs of the form (1). Define the width of DFCN to be U:=max{d,d1,,dL}.U:=\max\{d,d_{1},\dots,d_{L}\}.

Given a sample set D={(xi,yi)i=1m}D=\{(x_{i},y_{i})_{i=1}^{m}\}, we are interested in global minima of the following empirical risk minimization (ERM):

argminf𝒩d1,,dLDFCN1mi=1m(f(xi)yi)2.{\arg\min}_{f\in\mathcal{N}_{d_{1},\dots,d_{L}}^{DFCN}}\frac{1}{m}\sum_{i=1}^{m}(f(x_{i})-y_{i})^{2}. (3)

Denote by Ψd1,,dL,m\Psi_{{d_{1},\dots,d_{L}},m} the set of global minima of the optimization problem (3), i.e.,

Ψd1,,dL,m:={f:fis the solution to(3)}.\Psi_{d_{1},\dots,d_{L},m}:=\{f:f\ \mbox{is the solution to}\ (\ref{target-optimization})\}. (4)

Before studying the quality of the global minima of (3), we present some properties of Ψd1,,dL,m\Psi_{d_{1},\dots,d_{L},m} in the following lemma.

Lemma 1

If d1md_{1}\geq m, then for any L1L\geq 1 and d2,,dL2d_{2},\dots,d_{L}\geq 2, there are infinitely many functions in Ψd1,,dL,m\Psi_{d_{1},\dots,d_{L},m} and for any fΨd1,,dL,mf\in\Psi_{d_{1},\dots,d_{L},m}, there holds f(xi)=yi,i=1,,mf(x_{i})=y_{i},i=1,\dots,m.

Lemma 1 is a direct extension of [49, Theorem 5.1], where similar conclusion is drawn for 𝒩d1DFCN\mathcal{N}_{d_{1}}^{DFCN} with d1md_{1}\geq m. Based on [49, Theorem 5.1], the proof of Lemma 1 is obvious by noting t=σ(t)σ(t)t=\sigma(t)-\sigma(-t) and d2,,dL2d_{2},\dots,d_{L}\geq 2. Lemma 1 implies that if mm is sufficiently large and the network structure satisfying d1md_{1}\geq m, then there are always infinitely many solutions to (3) and any global minimum ff exactly interpolates the given data. It should be mentioned that similar results also hold for any DSCNs that contain a shallow net with mm neurons. Since we place particular emphasis on DFCN, we leave the corresponding assertions for DSCN for interested readers.

Let 𝕀d:=[1,1]d\mathbb{I}^{d}:=[-1,1]^{d}. Denote by Lp(𝕀d)L^{p}(\mathbb{I}^{d}) the space of ppth-Lebesgue integrable functions endowed with norm Lp(𝕀d)\|\cdot\|_{L^{p}(\mathbb{I}^{d})}. In this section, we are concerned with noiseless data, that is, there is some fLp(𝕀d)f^{*}\in L^{p}(\mathbb{I}^{d}) such that

f(xi)=yi,i=1,,m.f^{*}(x_{i})=y_{i},\qquad i=1,\dots,m. (5)

At first, we provide a negative result for running ERM on over-parameterized deep ReLU nets, showing that there is some fΨd1,,dL,mf\in\Psi_{{d_{1},\dots,d_{L}},m} performs extremely badly in generalization.

Proposition 1

Let 1p<1\leq p<\infty. If L2L\geq 2, d14dmd_{1}\geq 4dm, d2md_{2}\geq m and d3,,dL2d_{3},\dots,d_{L}\geq 2, then there are infinitely many fΨd1,d2,,dL,mf\in\Psi_{d_{1},d_{2},\dots,d_{L},m} such that for any ff^{*} satisfying (5) and fLp(𝕀d)c\|f^{*}\|_{L^{p}(\mathbb{I}^{d})}\geq c there holds

ffLp(𝕀d)c/2,\|f^{*}-f\|_{L^{p}(\mathbb{I}^{d})}\geq c/2, (6)

where cc is an absolute constant.

Proposition 1 shows that in the over-parameterized setting, there are infinitely many global minima of (3) behaving extremely badly, even for the noiseless data. It should be highlighted that the construction of ff in our proof is independent of ff^{*}. In fact, due to the localized approximation of deep ReLU nets, we can construct deep ReLU nets ff which can exactly interpolate the training sample but satisfy fLp(𝕀d)ε\|f\|_{L^{p}(\mathbb{I}^{d})}\leq\varepsilon for arbitrarily small ε>0\varepsilon>0. Different from the results in [12, 26, 45, 36], it is difficult to quantify conditions on distributions and dimensions of input spaces such that all global minima of over-parameterized deep ReLU nets possess good generalization performances. This is mainly due to the nonlinear nature of the hypothesis space, though the capacities, measured by the covering numbers [23, 5], for deep ReLU nets and linear models are comparable, provided there are similar numbers of free parameters involved in hypothesis spaces and the depth is not so large.

Proposition 1 provides extremely bad examples for global minima of (3) in generalization, even for noiseless data. It seems that over-parameterized deep ReLU nets are always worse than under-parameterized networks, which is standard from a viewpoint of classical learning theory [19]. However, in our following theorem, we will show that if the number of free parameters increases to a certain extent, then there are also infinitely many global minima of (3) possessing excellent generalization performance. To this end, we need two concepts concerning the data distribution and the smoothness of the target function ff^{*}.

Denote Λ:={xi}i=1m\Lambda:=\{x_{i}\}_{i=1}^{m} as the input set. The separation radius [46] of Λ\Lambda is defined by

qΛ=12minijxixj2,q_{\Lambda}=\frac{1}{2}\min_{i\neq j}\|x_{i}-x_{j}\|_{2}, (7)

where x2\|x\|_{2} denotes the Euclidean norm of xdx\in\mathbb{R}^{d}. The separation radius is half of the smallest distance between any two distinct points in Λ\Lambda and naturally satisfies qΛm1/dq_{\Lambda}\leq m^{-1/d}. Let us also introduce the standard smoothness assumption [24, 58, 48, 37, 62]. Let c0>0c_{0}>0 and r=s+μr=s+\mu with s0:={0}s\in\mathbb{N}_{0}:=\{0\}\cup\mathbb{N} and 0<μ10<\mu\leq 1. We say a function f:𝒜df:\mathcal{A}\subseteq\mathbb{R}^{d}\rightarrow\mathbb{R} is (r,c0)(r,c_{0})-smooth if ff is ss-times differentiable and for every αj0\alpha_{j}\in\mathbb{N}_{0}, j=1,,dj=1,\dots,d with α1++αd=s\alpha_{1}+\dots+\alpha_{d}=s, its ss-th partial derivative satisfies the Lipschitz condition

|sfx1α1xdαd(x)sfx1α1xdαd(x)|c0xx2μ,x,x𝒜.\left|\frac{\partial^{s}f}{\partial x_{1}^{\alpha_{1}}\dots\partial x_{d}^{\alpha_{d}}}(x)-\frac{\partial^{s}f}{\partial x_{1}^{\alpha_{1}}\dots\partial x_{d}^{\alpha_{d}}}(x^{\prime})\right|\leq c_{0}\|x-x^{\prime}\|_{2}^{\mu},\quad\forall\ x,x^{\prime}\in\mathcal{A}. (8)

Denote by Lip𝒜(r,c0)Lip_{\mathcal{A}}^{(r,c_{0})} the set of all (r,c0)(r,c_{0})-smooth functions defined on 𝒜\mathcal{A}.

We are in a position to state our first main result to show the existence of good global minima of (3) for noiseless data.

Theorem 2

Let r,c0>0r,c_{0}>0 and NN\in\mathbb{N}. If fLip𝕀d(r,c0)f^{*}\in Lip_{\mathbb{I}^{d}}^{(r,c_{0})} satisfies (5), NqΛdN\succeq q_{\Lambda}^{-d}, LlogNL\succeq\log N, d1Nd_{1}\succeq N and dlogNd_{\ell}\succeq\log N for =2,,L\ell=2,\dots,L, then there are infinitely many hΨd1,,dL,mh^{*}\in\Psi_{d_{1},\dots,d_{L},m}, such that

hfL(𝕀d)CNr/d,\|h^{*}-f^{*}\|_{L^{\infty}(\mathbb{I}^{d})}\leq CN^{-r/d}, (9)

where C>0C>0 is a constant depending only on r,dr,d and c0c_{0}, and aba\succeq b for a,b>0a,b>0 means that there is some constant CC^{\prime} depending only on r,dr,d and c0c_{0} such that aCba\geq C^{\prime}b.

A consensus on deep nets approximation is that it can break the “curse of dimensionality”, which was verified in interesting work [43, 30, 55, 31, 18, 53] in terms of deriving dimension-independent approximation rates. However, it should be pointed out that to achieve such dimension-independent approximation rates, strict restrictions have been imposed on target functions, which become stronger as the dimension dd grows, just as [4, P.68] observed. In this way, though the approximation error of deep nets is independent of the dimension, the applicable target functions become more and more stringent as dd grows. Our result presented in Theorem 2 drives a different direction to show that even for well-studied smooth functions, there is a deep net that interpolates the training data without degrading the approximation rate. We highlight that the approximation rate depends on the a-priori knowledge of the target functions. In particular, if we impose strict restriction such as fLip𝕀d(r,c0)f^{*}\in Lip_{\mathbb{I}^{d}}^{(r,c_{0})} with rd/2r\geq d/2, which is standard in kernel learning [15], then the approximation rate can be at least of order n1/2n^{-1/2}, which is also dimension-independent.

It is well known that deep learning has achieved great success in applications whose input space are of high dimensionality such as image processing [32] and game theory [57], showing the excellent performance of deep learning with large dd. However, recent progress in inventory management [52], finance prediction [28] and earthquake intensity analysis [25] demonstrated that deep learning is also efficient for applications with low-dimensional input spaces. Therefore, numerous research activities have been triggered to verify the advantage of deep nets from approximation theory [58, 48, 61] and learning theory [29, 31, 17] that regarded dd as a constant. This paper follows from this direction to assume dd to be a constant and is much smaller than the size of data. As we mentioned above, if dd is extremely large, more restrictions on the target functions should be imposed, just as Theorem 4 below purports to show.

Noting that there are totally 𝒪(NlogN)\mathcal{O}(N\log N) parameters in hh^{*} in (9), the derived error rate is almost optimal in the sense that up to a logarithmic factor, the derived upper bound is of the same order of lower bound [58, 23], i.e., C1(NlogN)r/dC_{1}^{\prime}(N\log N)^{-r/d} for some C1>0C_{1}^{\prime}>0 independent of NN. This means that for NqΛdN\succeq q_{\Lambda}^{-d} and LlogNL\succeq\log N, we can get almost optimal deep nets via finding suitable global minima of (3). Theorem 2 also shows that in the over-parameterized setting, where all global minima exactly interpolate the data, the interpolation restriction does not always affect the approximation performance of deep nets. Theorem 2 actually presents a sufficient condition for the number of free parameters of deep ReLU nets , NqΛdN\succeq q_{\Lambda}^{-d}, to guarantee the existence of perfect global minima when we are faced with noiseless data. It should be highlighted that qΛq_{\Lambda} can be numerically determined, provided the data set is given. If {xi}i=1m\{x_{i}\}_{i=1}^{m} are drawn identically and independently (i.i.d.) according to some distribution, the lower bound of qΛq_{\Lambda} is easy to be derived in theory [24, 36, 38].

III Global Minima of ERM on Over-Parameterized Deep Nets for Noisy Data

In this section, we conduct our study in the standard least-square regression framework [24, 19] and assume

yi=f(xi)+εi,i=1,,m,y_{i}=f^{*}(x_{i})+\varepsilon_{i},\qquad i=1,\dots,m, (10)

where {xi}i=1m\{x_{i}\}_{i=1}^{m} are i.i.d. drawn according to an unknown distribution ρX\rho_{X} on an input space 𝒳[0,1]d\mathcal{X}\subseteq[0,1]^{d} and {εi}i=1m\{\varepsilon_{i}\}_{i=1}^{m} are independent random variables that are independent of {xi}i=1m\{x_{i}\}_{i=1}^{m} and satisfy |εi|1|\varepsilon_{i}|\leq 1, 𝐄[εi]=0\mathbf{E}[\varepsilon_{i}]=0, i=1,,m.i=1,\dots,m. Denote by Ξ\Xi a set of Borel probability measures on 𝒳\mathcal{X} and Θ\Theta a set of functions defined on 𝒳\mathcal{X}, respectively. Write

(Θ,Ξ):={(ρX,f):ρXΞ,fΘ}.\mathcal{M}(\Theta,\Xi):=\{(\rho_{X},f^{*}):\rho_{X}\in\Xi,f^{*}\in\Theta\}.

Let ΓD\Gamma_{D} be the class of all functions that are derived based on the data set DD. Define

e(Θ,Ξ):=sup(ρX,f)(Θ,Ξ)inffDΓD𝐄(ffDLρX22).\displaystyle e(\Theta,\Xi):=\sup_{{(\rho_{X},f^{*})}\in\mathcal{M}(\Theta,\Xi)}\inf_{f_{D}\in\Gamma_{D}}\mathbf{E}(\|f^{*}-f_{D}\|^{2}_{L^{2}_{\rho_{X}}}). (11)

It is easy to see that e(Θ,Ξ)e(\Theta,\Xi) measures the theoretically optimal generalization bound a learning scheme, based on data set DD satisfying (10), can achieve when fΘf^{*}\in\Theta and ρXΞ\rho_{X}\in\Xi [24]. Our purpose is to compare generalization errors of the global minima defined by (3) with e(Θ,Ξ)e(\Theta,\Xi) to illustrate whether the global minima can achieve the theoretically optimal generalization error bounds.

Before presenting our main results, we introduce a negative result derived in [31, Theorem 2], which shows that without any restriction imposed to the distribution ρX\rho_{X}, over-parameterized deep nets do not generalize well.

Lemma 2

If d1md_{1}\geq m, then for any L1L\geq 1 and d2,,dL2d_{2},\dots,d_{L}\geq 2 and any hΨd1,,dL,mh\in\Psi_{d_{1},\dots,d_{L},m}, there exists a probability measure ρX\rho_{X}^{*} on 𝒳\mathcal{X} such that

supfLip𝕀d(r,c0),ρX=ρX𝐄[fhLρX22]1/6.\sup_{f^{*}\in Lip_{\mathbb{I}^{d}}^{(r,c_{0})},\rho_{X}=\rho_{X}^{*}}\mathbf{E}[\|f^{*}-h\|^{2}_{L^{2}_{\rho_{X}^{*}}}]\geq 1/6. (12)

Lemma 2 shows that without any restrictions to the distribution, even for the widely studied smooth functions, all of the global minima of (3) in the over-parameterized setting are bad estimators. This is totally different from the under-parameterized setting, in which distribution-free optimal generalization error rates were established [24, 40]. Lemma 2 seems to contradict with the numerical results in [59] at the first glance. However, we highlight that ρX\rho_{X}^{*} in (12) is a very special distribution that is even not continuous with respect to the Lebesgue measure. If we impose some mild conditions to exclude such distributions, then the result will be totally different.

The restriction we study in this paper is the following well known distortion assumption of ρX\rho_{X} [56, 17], which is slightly stricter than the standard assumption that ρX\rho_{X} is absolutely continuous with respect to the Lebesgue measure. Let p2p\geq 2 and JpJ_{p} be the identity mapping

Lp(𝒳)JpLρX2.L^{p}(\mathcal{X})~{}~{}{\stackrel{{\scriptstyle J_{p}}}{{\longrightarrow}}}~{}~{}L^{2}_{\rho_{X}}.

Define DρX,p:=Jp,D_{\rho_{X},p}:=\|J_{p}\|, where \|\cdot\| denotes the operator norm. Then DρX,pD_{\rho_{X},p} is called the distortion of ρX\rho_{X} (with respect to the Lebesgue measure), which measures how much ρX\rho_{X} distorts the Lebesgue measure. In our analysis, we assume DρX,p<D_{\rho_{X},p}<\infty, which holds for the uniform distribution for all p2p\geq 2 obviously. Denote by Ξp\Xi_{p} the set of ρX\rho_{X} satisfying DρX,p<D_{\rho_{X},p}<\infty.

We then provide optimal generalization error rates for global minima of (3) for different a-priori information. Let us begin with the widely used class of smooth regression functions. The classical results in [24] showed that

C1m2r2r+de(Lip𝕀d(r,c0),Ξp)C2m2r2r+d,p2,C_{1}m^{-\frac{2r}{2r+d}}\leq e(Lip^{(r,c_{0})}_{\mathbb{I}^{d}},\Xi_{p})\leq C_{2}{m}^{-\frac{2r}{2r+d}},\qquad p\geq 2, (13)

where C1C_{1}, C2C_{2} are constants independent of mm and e(Lip𝕀d(r,c0),Ξp)e(Lip^{(r,c_{0})}_{\mathbb{I}^{d}},\Xi_{p}) is defined by (11). It demonstrates the optimal generalization error rates for fLip𝕀d(r,c0)f^{*}\in Lip^{(r,c_{0})}_{\mathbb{I}^{d}} and ρXΞp\rho_{X}\in\Xi_{p} that a good learning algorithm should achieve. Furthermore, it can be found in [53, 25] that there is some network structure Φn,L\Phi_{n,L} such that all global minima of ERM on under-parameterized deep ReLU nets can reach almost optimal error rates in the sense that if LlogmL\sim\log m, d1md2r+dd_{1}\sim m^{\frac{d}{2r+d}} and d2,,dLlogmd_{2},\dots,d_{L}\sim\log m, then

C1m2r2r+dsupfLip𝕀d(r,c0),ρXΞp𝐄[fglobalunderfLρX22]C3(mlogm)2r2r+d,C_{1}m^{-\frac{2r}{2r+d}}\leq\sup_{f^{*}\in Lip^{(r,c_{0})}_{\mathbb{I}^{d}},\rho_{X}\in\Xi_{p}}\mathbf{E}[\|f^{under}_{global}-f^{*}\|_{L_{\rho_{X}^{2}}}^{2}]\leq C_{3}\left(\frac{m}{\log m}\right)^{-\frac{2r}{2r+d}}, (14)

where C3C_{3} is a constant depending only on r,c0,d,pr,c_{0},d,p and fglobalunderf^{under}_{global} is an arbitrary global minimum of (3) with depth and width specified as above.

It follows from (14) and (13) that all global minima of ERM on under-parameterized deep ReLU nets are almost optimal learners to tackle data i.i.d. drawn according to ρ(Lip(r,c0),Ξp)\rho\in\mathcal{M}(Lip^{(r,c_{0})},\Xi_{p}). In the following theorem, we show that, if the network is deepened and widened, there also exist global minima of (3) for over-parameterized deep ReLU nets possessing similar generalization performance.

Theorem 3

Let p2p\geq 2, r,c0>0r,c_{0}>0. If LlogmL\succeq\log m, d1md_{1}\succeq m and d2,,dLlogmd_{2},\dots,d_{L}\succeq\log m, then there exist infinitely many hΨd1,,dL,mh\in\Psi_{d_{1},\dots,d_{L},m} such that

C1m2r2r+dsupfLip𝕀d(r,c0),ρXΞp𝐄[hfLρX22]2C3(mlogm)2r2r+d.C_{1}m^{-\frac{2r}{2r+d}}\leq\sup_{f^{*}\in Lip^{(r,c_{0})}_{\mathbb{I}^{d}},\rho_{X}\in\Xi_{p}}\mathbf{E}[\|h-f^{*}\|_{L_{\rho_{X}}^{2}}^{2}]\leq 2C_{3}\left(\frac{m}{\log m}\right)^{-\frac{2r}{2r+d}}. (15)

Due to Lemma 1, any hΨd1,,dL,mh\in\Psi_{d_{1},\dots,d_{L},m} is an exact interpolant of DD, i.e. h(xi)=yih(x_{i})=y_{i}, i=1,,mi=1,\dots,m. Therefore, Theorem 3 presents theoretical verifications of benign over-fitting for deep ReLU nets, which was intensively discussed recently [9, 12, 36, 38, 7]. It should be mentioned that our main novelties are two folds. On one hand, we are concerned with deep ReLU nets while the existing results focused on linear hypothesis spaces. On the other hand, we provide evidence of existence of good global minima without high-dimensionality and strong distribution assumptions while the existing results focused on searching strong conditions on the dimension of input spaces and data distributions to guarantee that all global minima have excellent generalization performances.

Theorem 3 shows that implementing ERM on over-parameterized deep ReLU nets can achieve the almost optimal generalization error rates, but it does not demonstrate the power of depth since shallow learning also reaches these bounds [15, 40]. To show the power of depth, we should impose more restrictions on regression functions. For this purpose, we introduce the generalized additive models that are widely used in statistics and machine learning [24, 53]. For r,γ,c0,c0>0r,\gamma,c_{0},c_{0}^{\prime}>0, we say that ff admits a generalized additive model if f=h(i=1dfi(x(i))),f=h\left(\sum_{i=1}^{d}f_{i}(x^{(i)})\right), where hLip(r,c0)h\in Lip^{(r,c_{0})}_{\mathbb{R}} and fiLip𝕀(γ,c0)f_{i}\in Lip^{(\gamma,c_{0}^{\prime})}_{\mathbb{I}}. Write 𝒲r,γ,c0,c0\mathcal{W}_{r,\gamma,c_{0},c_{0}^{\prime}} as the set of all functions admitting a generalized additive model. If LlogmL\sim\log m, d1m12r+1d_{1}\sim m^{\frac{1}{2r+1}} and d2,,dLlogmd_{2},\dots,d_{L}\sim\log m, it can be found in [53, 25] that for any p2p\geq 2, there holds

C4(m2r2r+1+m2γ(r1)2γ(r1)+1)e(𝒲r,γ,c0,c0,Ξp)\displaystyle C_{4}\left(m^{-\frac{2r}{2r+1}}+m^{-\frac{2\gamma(r\wedge 1)}{2\gamma(r\wedge 1)+1}}\right)\leq e(\mathcal{W}_{r,\gamma,c_{0},c_{0}^{\prime}},\Xi_{p}) (16)
\displaystyle\leq supf𝒲r,γ,c0,c0,ρXΞp𝐄[fglobalunderfLρX22]\displaystyle\sup_{f^{*}\in\mathcal{W}_{r,\gamma,c_{0},c_{0}^{\prime}},\rho_{X}\in\Xi_{p}}\mathbf{E}[\|f_{global}^{under}-f^{*}\|_{L^{2}_{\rho_{X}}}^{2}]
\displaystyle\leq C5(m2r2r+1+m2γ(r1)2γ(r1)+1)log3m,\displaystyle C_{5}\left(m^{-\frac{2r}{2r+1}}+m^{-\frac{2\gamma(r\wedge 1)}{2\gamma(r\wedge 1)+1}}\right)\log^{3}m,

where C4C_{4}, C5C_{5} are constants independent of mm and fglobalunderf_{global}^{under} is an arbitrary global minimum of (3). Noticing that shallow learning is difficult to achieve the above generalization error rates: even for a special case of the generalized additive model with fi(x(i))=(x(i))2f_{i}(x^{(i)})=(x^{(i)})^{2}, it has been proved in [18] that shallow nets with any activation functions cannot achieve the aforementioned generalization error rates. The following theorem presents the existence of perfect global minima of (3) to show the power of depth in the over-parameterized setting.

Theorem 4

Let p2p\geq 2 and r,γ,c0,c0>0r,\gamma,c_{0},c_{0}^{\prime}>0. If LlogmL\succeq\log m, d1md_{1}\succeq m, and d2,,dLlogmd_{2},\dots,d_{L}\succeq\log m, then there are infinitely many hΨd1,,dL,mh\in\Psi_{d_{1},\dots,d_{L},m} such that

C4(m2r2r+1+m2γ(r1)2γ(r1)+1)supf𝒲r,γ,c0,c0,ρXΞp𝐄[hfLρX22]\displaystyle C_{4}\left(m^{-\frac{2r}{2r+1}}+m^{-\frac{2\gamma(r\wedge 1)}{2\gamma(r\wedge 1)+1}}\right)\leq\sup_{f^{*}\in\mathcal{W}_{r,\gamma,c_{0},c_{0}^{\prime}},\rho_{X}\in\Xi_{p}}\mathbf{E}[\|h-f^{*}\|_{L_{\rho_{X}}^{2}}^{2}] (17)
\displaystyle\leq 2C5(m2r2r+1+m2γ(r1)2γ(r1)+1)log3m.\displaystyle 2C_{5}\left(m^{-\frac{2r}{2r+1}}+m^{-\frac{2\gamma(r\wedge 1)}{2\gamma(r\wedge 1)+1}}\right)\log^{3}m.

Theorem 4 shows that there are infinitely many global minima of ERM on over-parameterized deep ReLU nets theoretically breakthroughing the bottleneck of shallow learning. This illustrates an advantage of adopting over-parameterized deep ReLU nets to build up hypothesis spaces in practice.

Finally, we show the power of depth in capturing the widely used spatially sparse features with the help of massive data. It has been discussed in [17, 39] that spatial sparseness is an important data feature for image and signal processing and deep ReLU nets perform excellently in reflecting the spatial sparseness. Partition 𝕀d\mathbb{I}^{d} by (N)d(N^{*})^{d} sub-cubes {Aj}j=1(N)d\{A_{j}\}_{j=1}^{(N^{*})^{d}} of side length (N)1(N^{*}){-1} and with centers {ζj}j=1(N)d\{\zeta_{j}\}_{j=1}^{(N^{*})^{d}}. For uu\in\mathbb{N} with u(N)du\leq(N^{*})^{d}, define

Υu:={j:j{1,2,,(N)d},1u}.\Upsilon_{u}:=\left\{j_{\ell}:j_{\ell}\in\{1,2,\dots,(N^{*})^{d}\},1\leq\ell\leq u\right\}.

If the support of fLp(𝕀d)f\in L^{p}(\mathbb{I}^{d}) is contained in S:=jΥuAjS:=\cup_{j\in\Upsilon_{u}}A_{j} for a subset Υu\Upsilon_{u} of {1,2,,(N)d}\{1,2,\dots,(N^{*})^{d}\} of cardinality at most uu, we then say that ff is uu-sparse in (N)d(N^{*})^{d} partitions. Denote by Lip𝕀d(N,u,r,c0)Lip^{(N^{*},u,r,c_{0})}_{\mathbb{I}^{d}} the set of all fLip𝕀d(r,c0)f\in Lip^{(r,c_{0})}_{\mathbb{I}^{d}} which are ss-sparse in (N)d(N^{*})^{d} partitions. Let LlogmL\sim\log m, d1,d2(m(u(N)d)/logm)1/(2r+d)d_{1},d_{2}\sim\left(m\left(\frac{u}{(N^{*})^{d}}\right)/\log m\right)^{1/(2r+d)}, d3,,dLlogmd_{3},\dots,d_{L}\sim\log m. If mm is large enough to satisfy mlogmC~4(N)2d+4r+2d(2r+d)pu12r+d,\frac{m}{\log m}\geq\tilde{C}_{4}\frac{(N^{*})^{\frac{2d+4r+2d}{(2r+d)p}}}{u^{\frac{1}{2r+d}}}, then it can be easily deduced from [17] that there exists a DSCN structure contained in 𝒩d1,,dLDFCN\mathcal{N}_{d_{1},\dots,d_{L}}^{DFCN} such that

C6m2r2r+d(u(N)d)d2r+de(Lip𝕀d(N,u,r,c0),Ξp)\displaystyle C_{6}m^{-\frac{2r}{2r+d}}\left(\frac{u}{(N^{*})^{d}}\right)^{\frac{d}{2r+d}}\leq e(Lip^{(N^{*},u,r,c_{0})}_{\mathbb{I}^{d}},\Xi_{p}) (18)
\displaystyle\leq supρ(Lip((N)d,u,r,c0),Ξp)𝐄[fglobalunderfLρX22]\displaystyle\sup_{\rho\in\mathcal{M}(Lip^{((N^{*})^{d},u,r,c_{0})},\Xi_{p})}\mathbf{E}[\|f_{global}^{under}-f^{*}\|_{L^{2}_{\rho_{X}}}^{2}]
\displaystyle\leq C7(mlogm)2r2r+d(u(N)d)2p2r2r+d,\displaystyle C_{7}\left(\frac{m}{\log m}\right)^{-\frac{2r}{2r+d}}\left(\frac{u}{(N^{*})^{d}}\right)^{\frac{2}{p}-\frac{2r}{2r+d}},

where C~4,C6,C7\tilde{C}_{4},C_{6},C_{7} are constants independent of mm and fglobalunderf_{global}^{under} is an arbitrary global minimum of ERM on the DSCN. In (18), (mlogm)2r2r+d\left(\frac{m}{\log m}\right)^{-\frac{2r}{2r+d}} reflects the smoothness of ff^{*} and (u(N)d)2p2r2r+d\left(\frac{u}{(N^{*})^{d}}\right)^{\frac{2}{p}-\frac{2r}{2r+d}} embodies the spatial sparseness of ff^{*}. As discussed above, given a sparsity level uu and the number of partitions (N)d(N^{*})^{d}, the size of data should satisfy mlogmC~4(N)2d+4r+2d(2r+d)pu12r+d\frac{m}{\log m}\geq\tilde{C}_{4}\frac{(N^{*})^{\frac{2d+4r+2d}{(2r+d)p}}}{u^{\frac{1}{2r+d}}} to embody the spatially sparse feature of ff^{*}. In particular, if the number of samples is smaller than the sparsity level uu, it is impossible to develop learning schemes to realize the support of ff^{*}. Recalling the localized approximation property of deep ReLU nets [16, 17], with the help of massive data, (18) shows that deep ReLU nets are capable of capturing the spatial sparseness, which is beyond the capability of shallow nets due to its lack of localized approximation [16]. We refer the readers to [17] for more details about the above assertions. If p=2p=2, it can be found in (18) that ERM on deep ReLU nets in the under-parameterized setting can achieve almost optimal generalization error rates. The following theorem shows the existence of perfect global minima of (3) in the over-parameterized setting.

Theorem 5

Let r,c0>0r,c_{0}>0, NN^{*}\in\mathbb{N}, u(N)du\leq(N^{*})^{d} and mm satisfy mlogm(N)2d+4r+2d(2r+d)pu12r+d.\frac{m}{\log m}\succeq\frac{(N^{*})^{\frac{2d+4r+2d}{(2r+d)p}}}{u^{\frac{1}{2r+d}}}. If LlogmL\succeq\log m, d1,d2md_{1},d_{2}\succeq m and d3,,dLlogmd_{3},\dots,d_{L}\succeq\log m, then there are are infinitely many hΨd1,,dLh\in\Psi_{d_{1},\dots,d_{L}} such that

C6m2r2r+d(u(N)d)d2r+dsupρ(Lip(N,u,r,c0),Ξ2)𝐄[hfLρX22]\displaystyle C_{6}m^{-\frac{2r}{2r+d}}\left(\frac{u}{(N^{*})^{d}}\right)^{\frac{d}{2r+d}}\leq\sup_{\rho\in\mathcal{M}(Lip^{(N^{*},u,r,c_{0})},\Xi_{2})}\mathbf{E}[\|h-f^{*}\|_{L^{2}_{\rho_{X}}}^{2}] (19)
\displaystyle\leq 2C7(mlogm)2r2r+d(u(N)d)d2r+d.\displaystyle 2C_{7}\left(\frac{m}{\log m}\right)^{-\frac{2r}{2r+d}}\left(\frac{u}{(N^{*})^{d}}\right)^{\frac{d}{2r+d}}. (20)

Theorem 5 shows that for spatially sparse regression functions, there are infinitely many global minima of (3) in the over-parameterized setting achieving the almost optimal generalization error rates. Besides the given three types of regression functions, we can provide similar results for numerous regression functions including the general composite functions [53], hierarchical interaction models [30] and piecewise smooth functions [29] by using the same approach in this paper. In particular, to derive similar assertions as our theorems, it is sufficient to apply the proposed deepening scheme developed in Theorem 6 below on the corresponding generalization error estimates in [30, 29, 53]. We remove the details for the sake of brevity.

Based on the above theorems, we can derive the following corollary, which shows the versatility of over-parameterized deep ReLU nets in regression.

Corollary 1

Let p2p\geq 2 and r,γ,c0,c0>0r,\gamma,c_{0},c_{0}^{\prime}>0, NN\in\mathbb{N}, uNdu\leq N^{d} and mm satisfy mlogmN2d+4r+2d(2r+d)pu12r+d.\frac{m}{\log m}\succeq\frac{N^{\frac{2d+4r+2d}{(2r+d)p}}}{u^{\frac{1}{2r+d}}}. If LlogmL\succeq\log m, d1,d2md_{1},d_{2}\succeq m and d3,,dLlogmd_{3},\dots,d_{L}\succeq\log m, then there are are infinitely many hΨd1,,dLh\in\Psi_{d_{1},\dots,d_{L}} such that (14), (17), and (19) hold simultaneously.

A crucial problem of deep learning is on how to specify the structure of deep nets for a given learning task. It should be highlighted that in the under-parameterized setting, both the width and depth should be carefully tailored to avoid the bias-variance trade-off phenomenon, making the structures of deep nets for different learning tasks quite different [30, 29, 53, 17, 25]. However, as shown in Corollary 1, over-parameterizing succeeds in avoiding the structure selection problem of deep learning in the sense that there exist a unified over-parameterized structure of DFCN which contains perfect global minima for different learning tasks.

IV Related Work and Discussions

In this section, we review some related work on the generalization performance of deep ReLU nets and make some comparisons to highlight our novelty. From the classical bias-variance trade-off principle, the over-parameterized setting makes the deep nets model so flexible that its global minima suffer from the well known over-fitting phenomenon [19] in the sense that they fit the training data perfectly but fail to predict new query points. Surprisingly, numerical evidences [59] showed that the over-fitting may not occur. This interesting phenomenon leads to a rethinking of the modern machine-learning practice and bias-variance trade-off.

The interesting result in [9] is the first work, to the best of our knowledge, to theoretically study the generalization performance of interpolation methods. In [9], multivariate triangulations are constructed to interpolate data and the generalization error bounds are exhibited as a function with respect to the dimension dd, which shows that the interpolation method possesses good generalization performance when dd is large. After imposing certain structure constraints on the covariance matrices, [26, 7] derived tight generalization error bounds for over-parameterized liner regression. In [45], the authors revealed several quantitative relations between linear interpolants and the structures of covariance matrices and then provided a hybrid interpolating scheme whose generalization error was rate-optimal for sparse liner model with noise. Motivated by these results, [36] proved that kernel ridgeless least squares possess a good generalization performance for high dimensional data, provided the distribution ρX\rho_{X} satisfies certain restrictions. In [38], the generalization error of kernel ridgeless least squares was proved to be bounded by means of some differences of kernel-based integral operators.

It should be mentioned that there are some strict restrictions concerning the dimensionality, structures of covariance matrices and marginal distributions ρX\rho_{X} to guarantee the good generalization performance of interpolation methods in the existing literature. Indeed, it was proved in [31] that some restriction on the marginal distribution ρX\rho_{X} is necessary, without which there is a ρX\rho_{X}^{*} such that all interpolation methods may perform extremely badly (see Lemma 2). However, the high dimensional assumption and structure constrains of covariance matrices are removable, since [9] have already constructed a piecewise interpolation based on the well known Nadaraya-Watson estimator and derived optimal generalization error rates, without any restrictions on the dimension and covariance matrices.

Compared with the above mentioned existing work, there are mainly three novelties of our results. First, we aim to explain the over-fitting resistance phenomenon for deep learning rather than linear algorithms such as linear regression and kernel regression. Due to the nonlinear nature of (3), we rigorously prove the existence of bad minima and perfect global ones. Therefore, it is almost impossible to derive the same results as linear models [12, 45, 36] for deep ReLU nets to determine which conditions are sufficient to guarantee the perfect generalization performance of all global minima. Furthermore, our theoretical results coincide with the experimental phenomenon in the sense that global minima (with training error to be 0) with different parameters frequently have totally different behaviors in generalization. Then, our results present essential advantages of running ERM on over-parameterized deep ReLU nets by means of proving the existence of deep ReLU nets possessing the almost optimal generalization error rates, which is beyond the capability of shallow learning. Finally, our results are established with mild conditions on distributions and without any restrictions on the dimension of the input space.

Another related work is [44], where the authors discussed the generalization performance of interpolation methods based on histograms and also established the existence of bad and good interpolation neural networks. The main arguments of [44] and our paper are similar: there are global minima of ERM on deep ReLU nets that can avoid over-fitting. The main differences are as follows: 1) It is well known that an approximant or learner based on histograms suffers from the well known saturation problem in the sense that the approximation or learning rate cannot be improved further once the regularity (or smoothness) of the regression function achieves certain level [58]. In particular, as shown in [58, 17], deep ReLU nets with two hidden layers can provide localized approximation but is difficult to approximate extremely smooth functions. In our paper, we avoid this saturation phenomenon via deepening the networks and thus break through the bottleneck of the analysis in [44]. 2) We provide detailed structures of deep ReLU nets and derive the quantitative requirement of the number of free parameters to guarantee the existence of global minima of ERM on deep ReLU nets, which is different from [44]. 3) More importantly, we devote to answering Problem 1 via showing the optimal generalization error rates and the power of depth of some global minima of ERM on deep ReLU networks. In particular, using the deepening approach, we prove that there exist infinitely many global minima of ERM on over-parameterized deep ReLU nets that perform almost the same as the under-parameterized deep ReLU nets.

In summary, we provide an affirmative answer to Problem 1 via providing several examples for perfect global minima of (3). It would be interesting to study the distribution of these perfect global minima and design feasible schemes to find them. We will keep in studying this topic and consider these two more challenging problems.

V Numerical Examples

In this section, we conduct numerical simulations to support our theoretical assertions on the existence of benign overfitting of running ERM on over-parameterized deep ReLU nets. There are mainly four purposes of our simulations. In the first simulation, we aim to show the relation between the generalization performance of global minima of (3) and the number of parameters (or width) of deep ReLU nets. In the second one, we devote to verifying the over-fitting resistance of (3) via showing the relation between the generalization error and the number of algorithmic iterations (epoches). In the third one, we show the existence of good and bad global minima of (3). Finally, we compare our learned global minima with some widely used learning schemes to show the learning performance of (3) on over-parameterized deep ReLU nets.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Relation between training and testing errors and the width

For these purposes, we adopt fully connected ReLU neural networks with LL hidden layers and kk neurons paved on each layer. In all simulations, we set L{1,2,4}L\in\{1,2,4\} and k{1,,2000}.k\in\{1,\dots,2000\}. We use the well known Adam optimization algorithm on deep ReLU nets with step-size being constantly 0.0010.001 and initialization being the default PyTorch values. Without especial declaration, the training is stopped after 50,00050,000 iterations.

we report our results on two real-world data numerical simulations on two datasets. The first one is a Wine Quality dataset from UCI database. The Wine Quality dataset is related to red and white variants of the Portuguese “Vinho Verde” wine with 15991599 red and 48984898 white examples. We select white wine for experiments. There are 1212 attributes in the data set: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, aulphates, alcohol and quality (score between 0 and 1010). Therefore, it can be viewed as a regression task on input data of 1111 dimension. Regarding the preferences, each sample was evaluated by a minimum of three sensory assessors (using blind tastes), which graded the wine in a scale ranging from 0 (very bad) to 1010 (excellent). We sample 2/32/3 data points as our training set and 1/31/3 for testing.

The second dataset is MNIST. MNIST dataset is widely used in classification tasks. Here we follow [36] to create a regression task using MNIST. MNIST inlcudes 70000 samples in total. Each sample includes a 28*28 dimensional feature and a target representing a digit ranging from 0 to 9. We randomly pick 291 samples with targeting digits equaling to 0 or 1, and separate 221 samples for training and 70 for testing. We label digit 0 as -1 and digit 1 as 1. Then we get a dataset with a 28×28=78428\times 28=784 dimensional feature and a label -1 or 1. The regression task is then built.

Simulation 1. In this simulation, we study the relation between the RMSE (rooted mean squared error) of test error and widths of deep ReLU nets with L=1,2,4L=1,2,4. Our results are recorded via 5 independent single trials. The solid line is the mean value from these trials, and the shaded part indicates the deviation. The numerical results are reported in Figure 1.

Figure 1 presents the perfect global minima by exhibiting the relation between training and testing RMSE and the number of free parameters. From Figure 1, we obtain the following four observations: 1) The left figure shows that neural networks with more hidden layers are easier to produce exact interpolations of the training data. This coincides with the common consensus since more hidden layers with the same width involves much more free parameters; 2) For each depth, it can be found in the right figure that the testing curves exhibit an approximate doubling descent phenomenon as declared in [12] for linear models. It should be highlighted that such a phenomenon does not always exist for deep ReLU nets training and we only choose a good trend from several trails to demonstrate the existence of perfect global minima; 3) As the width (or capacity of the hypothesis space) increases, it can be found in the right figure that the testing error does not increase, exhibiting a totally different phenomenon from the classical bias-variance trade-off. This shows that for over-parameterized deep ReLU nets, there exist good global minima of (3), provided the depth is appropriately selected; 4) It can be found that deeper ReLU nets perform better in generalization, which demonstrates the power of depth in tackling the Wine Quality data. All these verify our theoretical assertions in Section III and show that there exist perfect global minima of (3) to realize the power of deep ReLU nets.

Simulation 2. In this simulation, we devote to numerically studying the role of iterations (epoches) in (3) in both under-parameterized and over-parameterized settings. We run ERM on DFCNs with depth 4 and width 2,40,20002,40,2000 on Wine dataset and MNIST dataset. Since the number of training data is 3265(221 in MNIST dataset, resp.), it is easy to check that deep ReLU nets with depth 4 and widths 2 and 40 are under-parameterized, while those with depth 4 and width 2000 is over-parameterized. The numerical results are reported in Figure 2.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Relation between the training and testing errors and the number of iterations

There are also three interesting findings exhibited in Figure 3: 1) For under-parameterized ReLU nets, it is almost impossible to produce a global minimum acting as an exact interpolation of the data. However, for over-parameterized deep ReLU nets, running Adam with sufficiently many epoches attains a training error to be zero. Furthermore, after a specific value, the number of iterations does not affect the training error. This means that Adam converges to a global minimum of (3) on over-parameterized deep ReLU nets; 2) The testing error for under-parameterized ReLU nets, exhibited in the right figure, behaves according to the classical bias-variance trade-off principle in the sense that the error firstly decreases with respect to the epoch and then increases after a specific value of epoches. Therefore, early-stopping is necessary to guarantee the good performance in this setting; 3) The testing error for over-parameterized ReLU nets is always non-increasing with respect to the epoch. This shows the over-fitting resistance of deep ReLU nets training and also verifies the existence of the perfect global minima of (3) on over-parameterized deep ReLU nets. It should be highlighted the numerical result presented in Figure 3 is also a single trial selected from numerous results, since we are concerned with the existence of perfect global minima. In fact, there are also numerous examples for bad global minima of (3).

Simulation 3 In this simulation, we show that although there exist perfect global minima in over-parameterized settings, bad global minima can also be found sometimes. We test the performance of deep ReLU nets with depth 4 and width 2000 on Wine dataset and MNIST dataset. It take different numbers of steps to converge to a good training performance on these two datasets. We trigger several runs with different learning rates and net parameter initializations, and pick good and bad global minima from two trials respectively. We report the numerical results in Figure3.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Comparison of good and bad global minima

From Figure 3, we find that different global minima of Figure 3 perform totally differently in generalization, though the training loss both comes to 0. In particular, the testing errors of bad global minima can be much larger than those of good global minima. It should be mentioned that the bad interpolants in the above simulations are also derived from Adam. Therefore, their orders of testing errors are comparable with those of good interpolants. We highlight that this is due to the implementation of the ADAM algorithm rather than the model (3). As far as the model is concerned, it can be shown in our next simulation that the orders of bad interpolants are also larger than those of good ones.

Simulation 4 In this simulation, we compare (3) on over-parameterized deep ReLU nets with some standard learning algorithms including ridge regression (Ridge), support vector regression (SVR), kernel interpolation(KIR) and kernel ridge regression (KRR) to show that the numerical phenomenon exhibited in previous figures is not built upon sacrificing the generalization performance. For the sake of fairness, we test various models under the same condition to our best efforts via tuning hyper-parameters. In particular, we implement the referenced methods by using the standard scikit-learn package. In the experiment with the wine dataset specifically, we use ridge regression with regularization parameter being 11. In KRR, we use the Gaussian kernel with width being 2020 and regularization parameter being 0.00020.0002. KIR uses the Gaussian kernel with width being 55 and regularization parameter being 0. SVR keeps the default sklearn hyper-parameters. In the experiment with the MNIST dataset, we change regularization parameter in ridge regression to 10 and keep other parameters fixed. In this simulation, we also use the deep ReLU nets with depth 44 and width 20002000 and conduct 5 trials to record the average training and testing RMSE.

In addition, we construct a ReLU net that achieves 0 training error but performs extremely badly in the testing set to show that there really exist extremely bad interpolants, just as Proposition 1 illustrated. We introduce some notations at first. Denote xi=(xi(1),,xi(d))x_{i}=\left(x_{i}^{(1)},\ldots,x_{i}^{(d)}\right). Define Tτ,a,b(t)=1τ{σ(ta+τ)σ(ta)σ(tb)+σ(tbτ)}T_{\tau,a,b}(t)=\frac{1}{\tau}\{\sigma(t-a+\tau)-\sigma(t-a)-\sigma(t-b)+\sigma(t-b-\tau)\}, where σ\sigma is the ReLU activation function and τ\tau is a parameter which we set to be small enough (e10e^{-10} in this simulation). A feature input is expressed as x=(x(1),,x(d))x=\left(x^{(1)},\ldots,x^{(d)}\right). We call the corresponding output of the constructed net (CN) as N1,m,τ(x)N_{1,m,\tau}(x). Note that we drop duplicated samples when implementing CN. CN is constructed as bellow:

N1,m,τ(x)=i=1myiσ(l=1dTτ,xi(l)1m5,xi(l)+1m5(x(l))(d1)).N_{1,m,\tau}(x)=\sum_{i=1}^{m}y_{i}\sigma\left(\sum_{l=1}^{d}T_{\tau,x_{i}^{(l)}-\frac{1}{m^{5}},x_{i}^{(l)}+\frac{1}{m^{5}}}\left(x^{(l)}\right)-(d-1)\right).

More details of the construction of N1,m,τN_{1,m,\tau} can be found in the proof of Proposition 1. We introduce N1,m,τN_{1,m,\tau} in this simulation is to show that as an interpolant, N1N_{1} performs quite poorly to show that there are extremely bad global minima of (3). The numerical results are reported in Table I.

Wine Quality Data
Methods Train RMSE Test RMSE
Ridge 0.534 0.735
kernel interpolation 0.000 13.031
KRR 0.668 0.706
SVR 0.628 0.696
4-hidden layer DFCN (good case) 0.000 0.668
CN (bad case) 0.000 5.931
MNIST Data
Methods Train RMSE Test RMSE
Ridge 0.056 0.304
kernel interpolation 0.000 0.135
KRR 0.031 0.140
SVR 0.073 0.154
4-hidden layer DFCN (good case) 0.000 0.097
CN (bad case) 0.000 1.000
TABLE I: Comparison with other regression methods

There are four interesting observation in Table I: 1) Learning schemes such as SVR, KRR and Ridge perform stablely, since for both high-dimensional applications and low-dimensional simulations. The main reason is that a regularization term is introduced to balance the bias and variance for these schemes. As a result, the training error of these schemes are always non-zero; 2) Kernel interpolation performs well in high dimensional applications but fails to generalize well in low dimensional simulations. The main reason is that if dd is large, then the separation radius qΛq_{\Lambda} is large [36, 38], which in turn implies that the condition number of the kernel matrix is relatively small, making the kernel interpolation perform well. However, if dd is small, the condition number of the kernel matrix is usually extremely large, making the prediction instable; 3) There exist deep ReLU nets exactly interpolating the training data, leading to zero training error, but possessing an excellent generalization capability in yield small testing error, implying that the obtain estimator is a benign over-fitter for the data. Furthermore, it is shown in the table that the testing error of over-parameterized deep ReLU nets is the smallest, demonstrating the power of depth as declared in our theoretical assertions in Section III; 4) There also exist deep ReLU nets interpolating the data but performing extremely badly in generalization, for both high-dimensional applications and low dimensional simulations. All these findings verify our theoretical assertions that there are good global minima for ERM on over-parameterized deep ReLU nets but not all global minima are good.

VI Proofs

In this section, we aim at proving our results stated in Section II and Section III. The main novelty of our proof is a deepening scheme that produces an over-parameterized deep ReLU net (student network) based on a specific under-parameterized one (teacher network) so that the student network exactly interpolates the training data and possesses almost the same generalization performance as the teacher network.

VI-A Deepening scheme for ReLU nets

Given a teacher network gg, the deepening scheme devotes to deepening and widening it to produce a student network ff that exactly interpolates the given data DD and possesses almost the same generalization performance as gg. The following theorem presents the deepening scheme in our analysis.

Theorem 6

Let gn,L,Ug_{n,L,U} be any deep ReLU nets with LL layers, nn free parameters and width not larger than UU\in\mathbb{N} satisfying gn,L,UL(𝕀d)C\|g_{n,L,U}\|_{L^{\infty}(\mathbb{I}^{d})}\leq C^{*} for some C>0C^{*}>0. If ρXΞp\rho_{X}\in\Xi_{p} with p[2,)p\in[2,\infty), then for any ε>0\varepsilon>0, there exist infinitely many DFCNs fD,n,L,U,gf_{D,n,L,U,g} of depth 𝒪(L+logε1)\mathcal{O}(L+\log\varepsilon^{-1}) and width 𝒪(m+U+logε1)\mathcal{O}(m+U+\log\varepsilon^{-1}) such that

fD,n,L,U,g(xi)=yi,i=1,,m,f_{D,n,L,U,g}(x_{i})=y_{i},\qquad\ \forall i=1,\dots,m, (21)

and

fD,n,L,U,ggn,LLρX2ε,\|f_{D,n,L,U,g}-g_{n,L}\|_{L_{\rho_{X}}^{2}}\leq\varepsilon, (22)

where C~\tilde{C} is a constant depending only on dd.

The deepening scheme developed in Theorem 6 implies that all deep ReLU nets that have been verified to possess good generalization performances in the under-parameterized setting [29, 53, 38, 25] can be deepened to corresponding deep ReLU nets in the over-parameterized setting such that the deepened networks exactly interpolate the given data and possess good generalization error bounds.

The main tools for the proof of Theorem 6 are the localized approximation property of deep ReLU nets developed in [17] and the product gate property of deep ReLU nets proved in [58]. Let us introduce the first tool as follows. For a,ba,b\in\mathbb{R} with a<ba<b, define a trapezoid-shaped function Tτ,a,bT_{\tau,a,b} with a parameter 0<τ10<\tau\leq 1 as

Tτ,a,b(t)\displaystyle T_{\tau,a,b}(t) :=\displaystyle:= 1τ{σ(ta+τ)σ(ta)\displaystyle\frac{1}{\tau}\big{\{}\sigma(t-a+\tau)-\sigma(t-a) (23)
\displaystyle- σ(tb)+σ(tbτ)}.\displaystyle\sigma(t-b)+\sigma(t-b-\tau)\big{\}}.

We consider

𝒩a,b,τ(x):=σ(j=1dTτ,a,b(x(j))(d1)).\displaystyle{\mathcal{N}}_{a,b,\tau}(x):=\sigma\left(\sum_{j=1}^{d}T_{\tau,a,b}(x^{(j)})-(d-1)\right). (24)

The following lemma proved in [17] presents the localized approximation property of 𝒩a,b,τ{\mathcal{N}}_{a,b,\tau}.

Lemma 3

Let a<ba<b, 0<τ10<\tau\leq 1 and 𝒩a,b,τ{\mathcal{N}}_{a,b,\tau} be defined by (24). Then we have 0𝒩a,b,τ(x)10\leq{\mathcal{N}}_{a,b,\tau}(x)\leq 1 for all x𝕀dx\in\mathbb{I}^{d} and

𝒩a,b,τ(x)={0,ifx[aτ,b+τ]d,1,ifx[a,b]d.{\mathcal{N}}_{a,b,\tau}(x)=\left\{\begin{array}[]{cc}0,&\mbox{if}\ x\notin[a-\tau,b+\tau]^{d},\\ 1,&\mbox{if}\ x\in[a,b]^{d}.\end{array}\right. (25)

The second tool, as shown in the following lemma, presents the product-gate property of deep ReLU nets [58].

Lemma 4

For any {2,3,,}\ell\in\{2,3,\dots,\} and ν(0,1)\nu\in(0,1), there exists a DFCN with ReLU activation functions ×~,ν:\tilde{\times}_{\ell,\nu}:\mathbb{R}^{\ell}\rightarrow\mathbb{R} with 𝒪(log1ε)\mathcal{O}\left(\ell\log\frac{1}{\varepsilon}\right) depth, 𝒪(log1ε)\mathcal{O}\left(\ell\log\frac{1}{\varepsilon}\right) width, and free parameters bounded by 𝒪(βνβ)\mathcal{O}(\ell^{\beta}\nu^{-\beta}) for some β>0\beta>0 such that

|u1u2u×~,ν(u1,,u)|ν,u1,,u[1,1]|u_{1}u_{2}\cdots u_{\ell}-\tilde{\times}_{\ell,\nu}(u_{1},\dots,u_{\ell})|\leq\nu,\qquad\forall u_{1},\dots,u_{\ell}\in[-1,1]

and

×~,ν(u1,,u)=0,ifuj=0for somej=1,,.\tilde{\times}_{\ell,\nu}(u_{1},\dots,u_{\ell})=0,\qquad\mbox{if}\quad u_{j}=0\quad\mbox{for some}\quad j=1,\dots,\ell.

With the above tools, we can prove Theorem 6 as follows.

Proof:

Let 𝒩τ=𝒩τ,τ,τ/2{\mathcal{N}}_{\tau}={\mathcal{N}}_{-\tau,\tau,\tau/2} be given in Lemma 3 and ×~2,ν:2\tilde{\times}_{2,\nu}:\mathbb{R}^{2}\rightarrow\mathbb{R} in Lemma 4 with =2\ell=2. Then it follows from (25) that

𝒩τ(xxi)={0,ifxxi+[3τ/2,3τ/2]d,1,ifxxi+[τ,τ]d.{\mathcal{N}}_{\tau}(x-x_{i})=\left\{\begin{array}[]{cc}0,&\mbox{if}\ x\notin x_{i}+[-3\tau/2,3\tau/2]^{d},\\ 1,&\mbox{if}\ x\in x_{i}+[-\tau,\tau]^{d}.\end{array}\right. (26)

Since gn,L,UL(𝕀d)C\|g_{n,L,U}\|_{L^{\infty}(\mathbb{I}^{d})}\leq C^{*}, we can define a function 𝒩τ,ν,D,g\mathcal{N}_{\tau,\nu,D,g} on d\mathbb{R}^{d} by

𝒩τ,ν,D,g(x)\displaystyle\mathcal{N}_{\tau,\nu,D,g}(x) :=\displaystyle:= i=1myi𝒩τ(xxi)\displaystyle\sum_{i=1}^{m}y_{i}{\mathcal{N}}_{\tau}(x-x_{i}) (27)
+\displaystyle+ C×~2,ν(gn,L,U(x)C,1i=1m𝒩τ(xxi)).\displaystyle C^{*}\tilde{\times}_{2,\nu}\left(\frac{g_{n,L,U}(x)}{C^{*}},1-\sum_{i=1}^{m}{\mathcal{N}}_{\tau}(x-x_{i})\right).

If τ<2qΛ3d\tau<\frac{2q_{\Lambda}}{3\sqrt{d}}, then for any jij\neq i, we have from (26) that 𝒩τ(xjxi)=0.{\mathcal{N}}_{\tau}(x_{j}-x_{i})=0. Noting further 𝒩τ(xixi)=1{\mathcal{N}}_{\tau}(x_{i}-x_{i})=1, we have for any j={1,,m}j=\{1,\dots,m\} that i=1m𝒩τ(xjxi)=1\sum_{i=1}^{m}{\mathcal{N}}_{\tau}(x_{j}-x_{i})=1 and

𝒩τ,ν,D,g(xj)=i=1myi𝒩τ(xjxi)=yj.\mathcal{N}_{\tau,\nu,D,g}(x_{j})=\sum_{i=1}^{m}y_{i}{\mathcal{N}}_{\tau}(x_{j}-x_{i})=y_{j}. (28)

Moreover, for iji\neq j and any xdx\in\mathbb{R}^{d},

xix(xjx)2=xixj22qΛd>3τ\|x_{i}-x-(x_{j}-x)\|_{2}=\|x_{i}-x_{j}\|_{2}\geq\frac{2q_{\Lambda}}{\sqrt{d}}>3\tau

implies 𝒩τ(xxj)=0.\mathcal{N}_{\tau}(x-x_{j})=0. Hence 1i=1m𝒩τ(xxi)[0,1]1-\sum_{i=1}^{m}\mathcal{N}_{\tau}(x-x_{i})\in[0,1]. Therefore, Lemma 4 yields ×~2,ν(gn,L,U(xj)C,1i=1m𝒩τ(xjxi))=0\tilde{\times}_{2,\nu}\left(\frac{g_{n,L,U}(x_{j})}{C^{*}},1-\sum_{i=1}^{m}{\mathcal{N}}_{\tau}(x_{j}-x_{i})\right)=0. This implies

𝒩τ,ν,D,g(xj)=yj,j=1,,m.\mathcal{N}_{\tau,\nu,D,g}(x_{j})=y_{j},\qquad j=1,\dots,m. (29)

Define further a function hDh_{D} on d\mathbb{R}^{d} by

hD(x):=i=1myi𝒩τ(xxi)+gn,L,U(x)(1i=1m𝒩τ(xxi)).h_{D}(x):=\sum_{i=1}^{m}y_{i}{\mathcal{N}}_{\tau}(x-x_{i})+g_{n,L,U}(x)\left(1-\sum_{i=1}^{m}{\mathcal{N}}_{\tau}(x-x_{i})\right).

It follows from Lemma 4 that

|hD(x)𝒩τ,ν,D,g(x)|ν,x𝕀d.|h_{D}(x)-\mathcal{N}_{\tau,\nu,D,g}(x)|\leq\nu,\qquad\ \forall x\in\mathbb{I}^{d}. (30)

If xxi[3τ/2,3τ/2]dx-x_{i}\notin[-3\tau/2,3\tau/2]^{d} for all i=1,,mi=1,\dots,m, then it follows from (26) that i=1m𝒩τ(xxi)=0\sum_{i=1}^{m}{\mathcal{N}}_{\tau}(x-x_{i})=0, which implies hD(x)=gn,L(x)h_{D}(x)=g_{n,L}(x). Hence,

gn,L,UhDLp(𝕀d)p=𝕀d|gn,L,U(x)hD(x)|p𝑑x\displaystyle\|g_{n,L,U}-h_{D}\|_{L^{p}(\mathbb{I}^{d})}^{p}=\int_{\mathbb{I}^{d}}|g_{n,L,U}(x)-h_{D}(x)|^{p}dx
\displaystyle\leq i=1m[xi3τ/2,xi+3τ/2]d|gn,L,U(x)hD(x)|p𝑑xm(3τ)d2p(C)p.\displaystyle\sum_{i=1}^{m}\int_{[x_{i}-3\tau/2,x_{i}+3\tau/2]^{d}}|g_{n,L,U}(x)-h_{D}(x)|^{p}dx\leq m(3\tau)^{d}2^{p}(C^{*})^{p}.

This implies

gn,L,UhDLp(𝕀d)2C3d/pm1/pτd/p.\|g_{n,L,U}-h_{D}\|_{L^{p}(\mathbb{I}^{d})}\leq 2C^{*}3^{d/p}m^{1/p}\tau^{d/p}.

The above estimate together with (30) yields

hD𝒩τ,ν,D,gLp(𝕀d)hD𝒩τ,ν,D,gLp(𝕀d)+gn,L,UhDLp(𝕀d)\displaystyle\|h_{D}-\mathcal{N}_{\tau,\nu,D,g}\|_{L^{p}(\mathbb{I}^{d})}\leq\|h_{D}-\mathcal{N}_{\tau,\nu,D,g}\|_{L^{p}(\mathbb{I}^{d})}+\|g_{n,L,U}-h_{D}\|_{L^{p}(\mathbb{I}^{d})}
\displaystyle\leq 2d/pν+2C3d/pm1/pτd/p.\displaystyle 2^{d/p}\nu+2C^{*}3^{d/p}m^{1/p}\tau^{d/p}.

Set ν=ε\nu=\varepsilon and τmin{2qΛ/(3d),m1/dεp/d}\tau\leq\min\{2q_{\Lambda}/(3\sqrt{d}),m^{-1/d}\varepsilon^{p/d}\}. We obtain

hD𝒩τ,ν,D,gLp(𝕀d)Cε,\|h_{D}-\mathcal{N}_{\tau,\nu,D,g}\|_{L^{p}(\mathbb{I}^{d})}\leq C^{\prime}\varepsilon, (31)

where C:=2d/p+2C3d/pC^{\prime}:=2^{d/p}+2C^{*}3^{d/p}. Denote by 𝒩(t)=σ(t)σ(t)=t\mathcal{N}^{*}(t)=\sigma(t)-\sigma(-t)=t. Recalling (28), we can define

fD,n,L,U,g\displaystyle f_{D,n,L,U,g} :=\displaystyle:= i=1myi𝒩(𝒩𝒪(L+logε1)(𝒩τ(xxi)))\displaystyle\sum_{i=1}^{m}y_{i}\overbrace{\mathcal{N}^{*}(\cdots\mathcal{N}^{*}}^{\mathcal{O}(L+\log\varepsilon^{-1})}({\mathcal{N}}_{\tau}(x-x_{i})))
+\displaystyle+ C×~2,ν(𝒩(gn,L,U(x)C),1i=1m𝒩τ(xxi))\displaystyle C^{*}\tilde{\times}_{2,\nu}\left(\mathcal{N}^{*}\left(\frac{g_{n,L,U}(x)}{C^{*}}\right),1-\sum_{i=1}^{m}{\mathcal{N}}_{\tau}(x-x_{i})\right)

with τ\tau and ν\nu as above so that the two items on the righthand side of fD,n,gf_{D,n,g} have the same depth. Then fD,n,L,U,gf_{D,n,L,U,g} is a DFCN of depth 𝒪(L+logε1)\mathcal{O}(L+\log\varepsilon^{-1}) and width 𝒪(m+U+logε1)\mathcal{O}(m+U+\log\varepsilon^{-1}). Noting further ρXΞp\rho_{X}\in\Xi_{p}, we then have fLρ2(𝕀d)DρXfLp(𝕀d)\|f\|_{L^{2}_{\rho}(\mathbb{I}^{d})}\leq D_{\rho_{X}}\|f\|_{L^{p}(\mathbb{I}^{d})}. This together with (31) yields

hDfD,n,L,U,gLρX2(𝕀d)Cε.\|h_{D}-f_{D,n,L,U,g}\|_{L^{2}_{\rho_{X}}(\mathbb{I}^{d})}\leq C^{\prime}\varepsilon.

Recalling that there are infinitely many τ\tau satisfying τmin{2qΛ/(3d),m1/dεp/d}\tau\leq\min\{2q_{\Lambda}/(3\sqrt{d}),m^{-1/d}\varepsilon^{p/d}\}, then there are infinitely many such fD,n,L,U,gf_{D,n,L,U,g}. Theorem 6 is then proved by scaling. ∎

VI-B Proofs

In this part, we prove our main results by using the proposed deepening scheme (for results concerning noisy data) and a functional analysis approach developed in [46] (for results concerning noiseless data).

Firstly, we prove Proposition 1 based on Lemma 3.

Proof:

If yi=0y_{i}=0, i=1,,mi=1,\dots,m, we can set fD,L(x)=0f_{D,L}(x)=0. Then our conclusion naturally holds. Otherwise, we define

𝒩τ,D(x):=i=1myi𝒩τ,τ,τ/2(xxi).\mathcal{N}_{\tau,D}(x):=\sum_{i=1}^{m}y_{i}{\mathcal{N}}_{-\tau,\tau,\tau/2}(x-x_{i}). (32)

If τ<2qΛ3d\tau<\frac{2q_{\Lambda}}{3\sqrt{d}}, then it follows from (28) that 𝒩τ,D(xi)=yi\mathcal{N}_{\tau,D}(x_{i})=y_{i}. Since fLp(𝕀d)c\|f^{*}\|_{L^{p}(\mathbb{I}^{d})}\geq c, a direct computation yields

f𝒩τ,DLp(𝕀d)fLp(𝕀d)𝒩τ,DLp(𝕀d)c𝒩τ,DLp(𝕀d).\|f^{*}-\mathcal{N}_{\tau,D}\|_{L^{p}(\mathbb{I}^{d})}\geq\|f^{*}\|_{L^{p}(\mathbb{I}^{d})}-\|\mathcal{N}_{\tau,D}\|_{L^{p}(\mathbb{I}^{d})}\geq c-\|\mathcal{N}_{\tau,D}\|_{L^{p}(\mathbb{I}^{d})}.

But (26) together with (5) and 𝒩τ(xxi)1{\mathcal{N}}_{\tau}(x-x_{i})\leq 1 for any x𝕀dx\in\mathbb{I}^{d} yields

𝒩τ,DLp(𝕀d)i=1m(𝕀d|f(xi)𝒩τ,τ,τ/2(xxi)|p𝑑x)1/p\displaystyle\|\mathcal{N}_{\tau,D}\|_{L^{p}(\mathbb{I}^{d})}\leq\sum_{i=1}^{m}\left(\int_{\mathbb{I}^{d}}\left|f^{*}(x_{i}){\mathcal{N}}_{-\tau,\tau,\tau/2}(x-x_{i})\right|^{p}dx\right)^{1/p}
\displaystyle\leq i=1m|f(xi)|(𝕀d|𝒩τ,τ,τ/2(xxi)|𝑑x)1/p\displaystyle\sum_{i=1}^{m}|f^{*}(x_{i})|\left(\int_{\mathbb{I}^{d}}|{\mathcal{N}}_{-\tau,\tau,\tau/2}(x-x_{i})|dx\right)^{1/p}
\displaystyle\leq i=1m|f(xi)|(x:xxi22τ3𝑑x)1/p=i=1m|f(xi)|(3τ2)d/p.\displaystyle\sum_{i=1}^{m}|f^{*}(x_{i})|\left(\int_{x:\|x-x_{i}\|_{2}\leq\frac{2\tau}{3}}dx\right)^{1/p}=\sum_{i=1}^{m}|f^{*}(x_{i})|\left(\frac{3\tau}{2}\right)^{d/p}.

Therefore, for

τ<min{2qΛ3d,23(c2)p/d(i=1m|yi|)p/d},\tau<\min\left\{\frac{2q_{\Lambda}}{3\sqrt{d}},\frac{2}{3}\left(\frac{c}{2}\right)^{p/d}\left(\sum_{i=1}^{m}|y_{i}|\right)^{-p/d}\right\}, (33)

we have 𝒩τ,DLp(𝕀d)c/2\|\mathcal{N}_{\tau,D}\|_{L_{p}(\mathbb{I}^{d})}\leq c/2, which yields

f𝒩τ,DLp(𝕀d)cc/2=c/2.\|f^{*}-\mathcal{N}_{\tau,D}\|_{L_{p}(\mathbb{I}^{d})}\geq c-c/2=c/2.

Note further that 𝒩τ,D\mathcal{N}_{\tau,D} is a DFCN with 2 hidden layers with d1=4dmd_{1}=4dm and d2=md_{2}=m. Since t=σ(t)σ(t)t=\sigma(t)-\sigma(-t), we can define fD,d1,d2,,dLf_{D,d_{1},d_{2},\dots,d_{L}} iteratively by fD,d1,d2(x)f_{D,d_{1},d_{2}}(x) satisfying d14dmd_{1}\geq 4dm, d2md_{2}\geq m and

fD,d1,d2,,d+1=σ(fD,d1,d2,,d)σ(fD,d1,d2,,d).f_{D,d_{1},d_{2},\dots,d_{\ell+1}}=\sigma(f_{D,d_{1},d_{2},\dots,d_{\ell}})-\sigma(-f_{D,d_{1},d_{2},\dots,d_{\ell}}).

Then fD,d1,d2,,dLΨd1,d2,,dL,m{f_{D,d_{1},d_{2},\dots,d_{L}}}\in\Psi_{d_{1},d_{2},\dots,d_{L},m}. Recalling the construction in (32), different τ\tau corresponds to different neural networks and the above results hold hold for all τ\tau satisfying (33). Therefore, there are infinitely many deep ReLU nets formed as (32). This completes the proof of Proposition 1. ∎

In the following, we construct some real-valued functions to feed ×~,ν\tilde{\times}_{\ell,\nu} and derive a deep-net-based linear space possessing good approximation properties. For tt\in\mathbb{R}, define

ψ(t)=σ(t+2)σ(t+1)σ(t1)+σ(t2).\psi(t)=\sigma(t+2)-\sigma(t+1)-\sigma(t-1)+\sigma(t-2). (34)

Then

ψ(t)={1,if|t|1,0,if|t|2,2|t|,if 1<|t|<2.\psi(t)=\left\{\begin{array}[]{cc}1,&\mbox{if}\ |t|\leq 1,\\ 0,&\mbox{if}\ |t|\geq 2,\\ 2-|t|,&\mbox{if}\ 1<|t|<2.\end{array}\right. (35)

For NN\in\mathbb{N}, α=(α(1),,α(d))0d\alpha=(\alpha^{(1)},\dots,\alpha^{(d)})\in\mathbb{N}_{0}^{d}, |α|=α(1)++α(d)s|\alpha|=\alpha^{(1)}+\dots+\alpha^{(d)}\leq s and 𝐣=(j1,,jd){0,1,,N}d,\mathbf{j}=(j_{1},\dots,j_{d})\in\{0,1,\dots,N\}^{d}, define

ΦN,ν,s\displaystyle\Phi_{N,\nu,s} :=\displaystyle:= span{×~d+s,ν(ψ1,𝐣,,ψd,𝐣,x(1),,x(1)α(1),\displaystyle\mbox{span}\left\{\tilde{\times}_{d+s,\nu}(\psi_{1,{\bf j}},\dots,\psi_{d,{\bf j}},\overbrace{x^{(1)},\dots,x^{(1)}}^{\alpha^{(1)}},\right. (36)
,x(d),,x(d)α(d),1,,1s|α|)},\displaystyle\left.\dots,\overbrace{x^{(d)},\dots,x^{(d)}}^{\alpha^{(d)}},\overbrace{1,\dots,1}^{s-|\alpha|})\right\},

where

ψk,𝐣(x)=ψ(3N(x(k)jkN)).\psi_{k,{\bf j}}(x)=\psi\left(3N\left(x^{(k)}-\frac{j_{k}}{N}\right)\right). (37)

It is easy to see that for arbitrarily fixed N,ν,sN,\nu,s, ΦN,θ,ν,s\Phi_{N,\theta,\nu,s} is a linear space of dimension at most d(N+1)d()ds+dd(N+1)^{d}\left({}^{s+d}_{\ d}\right). Each element in ΦN,ν,s\Phi_{N,\nu,s} is a DFCN with depth 𝒪((s+d)logν1)\mathcal{O}((s+d)\log\nu^{-1}), d1=𝒪(d(N+1)d()ds+d)d_{1}=\mathcal{O}\left(d(N+1)^{d}\left({}^{s+d}_{\ d}\right)\right) and d=𝒪(logν1).d_{\ell}=\mathcal{O}(\log\nu^{-1}). The approximation capability of the constructed linear space was deduced in [25, Theorem 2] or [58].

Lemma 5

Let ν(0,1)\nu\in(0,1) and s,N0s,N\in\mathbb{N}_{0}. If ν=Nrd\nu=N^{-r-d} and fLip𝕀d(r,c0)f\in Lip^{(r,c_{0})}_{\mathbb{I}^{d}} with 0<rs+10<r\leq s+1, then there holds

minhΦN,ν,sfhL(𝕀d)C1c0Nr/d,\min_{h\in\Phi_{N,\nu,s}}\|f-h\|_{L^{\infty}(\mathbb{I}^{d})}\leq C_{1}^{\prime}c_{0}N^{-r/d}, (38)

where C1C_{1}^{\prime} is a constant depending only on dd and rr.

To prove Theorem 2, we need the following lemma proved in [46]. It should be mentioned that for DFCN with larger depth and width, the above assertions obviously hold. We use the following functional analysis tool that presents a close relation interpolation and approximation to minimize the depth.

Lemma 6

Let 𝒰\mathcal{U} be a (possibly complex) Banach space, 𝒱\mathcal{V} a subspace of 𝒰\mathcal{U}, and WW^{*} a finite-dimensional subspace of 𝒰\mathcal{U}^{*}, the dual of 𝒰\mathcal{U}. If for every wWw^{*}\in W^{*} and some γ>1\gamma>1, γ\gamma independent of ww^{*},

w𝒰γw|𝒱𝒱,\|w^{*}\|_{\mathcal{U}^{*}}\leq\gamma\|w^{*}|_{\mathcal{V}}\|_{\mathcal{V}^{*}},

then for any u𝒰u\in\mathcal{U} there exists v𝒱v\in\mathcal{V} such that vv interpolates uu on WW^{*}; that is, w(u)=w(v)w^{*}(u)=w^{*}(v) for all wWw^{*}\in W^{*}. In addition, vv approximates uu in the sense that uv𝒰(1+2γ)dist𝒰(u,𝒱)\|u-v\|_{\mathcal{U}}\leq(1+2\gamma)\mbox{dist}_{{}_{\mathcal{U}}}(u,\mathcal{V}).

To use the above lemma, we need to construct a special function to facilitate the proof. Our construction is motivated by [46]. For any w=j=1mcjδxjWw^{*}=\sum_{j=1}^{m}c_{j}\delta_{x_{j}}\in W^{*}, define

gw(x)=j=1msgn(cj)(1xxj2qΛ)+,g_{w}(x)=\sum_{j=1}^{m}\mbox{sgn}(c_{j})\left(1-\frac{\|x-x_{j}\|_{2}}{q_{\Lambda}}\right)_{+}, (39)

where δxi\delta_{x_{i}} is the point evaluation operator and sgn(t)\mbox{sgn}(t) is the sign function satisfying sgn(t)=1\mbox{sgn}(t)=1 for t0t\geq 0 and sgn(t)=0\mbox{sgn}(t)=0 for t<0t<0. Then it is easy to see that gwg_{w} is a continuous function. In the following, we present three important properties of gwg_{w}.

Lemma 7

Let W=span{δxi:i=1,,m}W^{*}=\mbox{span}\{\delta_{x_{i}}:i=1,\dots,m\}. Then for any wWw^{*}\in W^{*}, there holds (i) gwL(𝕀d)=1,\|g_{w}\|_{L^{\infty}(\mathbb{I}^{d})}=1,   (ii) w(gw)=w,w^{*}(g_{w})=\|w^{*}\|,    (iii) gwLip𝕀(1,qΛ1)g_{w}\in Lip_{\mathbb{I}}^{(1,q_{\Lambda}^{-1})}.

Proof:

Denote Aj=B(xj,qΛ)𝕀dA_{j}=B(x_{j},q_{\Lambda})\cap\mathbb{I}^{d}, where B(xj,qΛ)B(x_{j},q_{\Lambda}) is the ball with center xjx_{j} and radius qΛq_{\Lambda}. Then it follows from the definition of qΛq_{\Lambda} that A˙jA˙k=\dot{A}_{j}\cap\dot{A}_{k}=\varnothing, where A˙j=Aj\Aj\dot{A}_{j}=A_{j}\backslash\partial A_{j} and Aj\partial A_{j} denotes the boundary of AjA_{j}. Without loss of generality, we assume 𝕀d\j=1mAj\mathbb{I}^{d}\backslash\bigcup_{j=1}^{m}A_{j}\neq\varnothing. From (39), we have gw(x)=0g_{w}(x)=0 for x𝕀d\j=1mAjx\in\mathbb{I}^{d}\backslash\bigcup_{j=1}^{m}A_{j}\neq\varnothing. If there exist some j{1,,m}j\in\{1,\dots,m\} such that xAjx\in A_{j}, then

gw(x)=sgn(cj)(1xxj2qΛ).g_{w}(x)=\mbox{sgn}(c_{j})\left(1-\frac{\|x-x_{j}\|_{2}}{q_{\Lambda}}\right).

So

|gw(x)|=1xxj2qΛ|gw(xj)|=1.|g_{w}(x)|=1-\frac{\|x-x_{j}\|_{2}}{q_{\Lambda}}\leq|g_{w}(x_{j})|=1.

Thus, |gw(x)|1|g_{w}(x)|\leq 1 for all x𝕀dx\in\mathbb{I}^{d}. Since |gw(xj)|=1|g_{w}(x_{j})|=1, j=1,,mj=1,\dots,m, we get gwL(𝕀d)=1\|g_{w}\|_{L^{\infty}(\mathbb{I}^{d})}=1, which verifies (i). For wWw^{*}\in W^{*}, we have

w(gw)\displaystyle w^{*}(g_{w}) =\displaystyle= j=0mcjδxj(gw)=j=0mcjgw(xj)\displaystyle\sum_{j=0}^{m}c_{j}\delta_{x_{j}}(g_{w})=\sum_{j=0}^{m}c_{j}g_{w}(x_{j})
=\displaystyle= j=0mcjsgn(cj)=j=0m|cj|=w.\displaystyle\sum_{j=0}^{m}c_{j}\mbox{sgn}(c_{j})=\sum_{j=0}^{m}|c_{j}|=\|w^{*}\|.

Thus (ii) holds. The remainder is to prove that gwg_{w} satisfies (iii). We divide the proof into four cases.

If x,xAjx,x^{\prime}\in A_{j} for some j{1,,m}j\in\{1,\dots,m\}, then it follows from (39) that

|gw(x)gw(x)|\displaystyle|g_{w}(x)-g_{w}(x^{\prime})|
=\displaystyle= |sgn(cj)(1xxj2qΛ)sgn(cj)(1xxj2qΛ)|\displaystyle\left|\mbox{sgn}(c_{j})\left(1-\frac{\|x-x_{j}\|_{2}}{q_{\Lambda}}\right)-\mbox{sgn}(c_{j})\left(1-\frac{\|x^{\prime}-x_{j}\|_{2}}{q_{\Lambda}}\right)\right|
\displaystyle\leq |xxj2xxj2|qΛxx2qΛ.\displaystyle\frac{|\|x-x_{j}\|_{2}-\|x^{\prime}-x_{j}\|_{2}|}{q_{\Lambda}}\leq\frac{\|x-x^{\prime}\|_{2}}{q_{\Lambda}}.

If x,x𝕀d\j=1mAjx,x^{\prime}\in\mathbb{I}^{d}\backslash\bigcup_{j=1}^{m}A_{j}, then the definition of gwg_{w} yields gw(x)=gw(x)=0g_{w}(x)=g_{w}(x^{\prime})=0, which implies |gw(x)gw(x)|xx2qΛ.|g_{w}(x)-g_{w}(x^{\prime})|\leq\frac{\|x-x^{\prime}\|_{2}}{q_{\Lambda}}.

If xAj,xAkx\in A_{j},x^{\prime}\in A_{k} for kjk\neq j, then it is easy to see that for any zB(xj,qΛ)z\in\partial B(x_{j},q_{\Lambda}), j=1,,mj=1,\dots,m, there holds gw(z)=0g_{w}(z)=0. Let zj,zkz_{j},z_{k} be the intersections of the line segment xxxx^{\prime} and B(xj,qΛ)\partial B(x_{j},q_{\Lambda}), and the line segment xxxx^{\prime} and B(xk,qΛ)\partial B(x_{k},q_{\Lambda}), respectively. Then, we have xx2xzj2+xzk2\|x-x^{\prime}\|_{2}\geq\|x-z_{j}\|_{2}+\|x^{\prime}-z_{k}\|_{2}. Since x,zjAjx,z_{j}\in A_{j}, x,zkAkx^{\prime},z_{k}\in A_{k} and gw(zj)=gw(zk)=0g_{w}(z_{j})=g_{w}(z_{k})=0, we have

|gw(x)gw(x)||gw(x)gw(zj)|+|gw(x)gw(zk)|\displaystyle|g_{w}(x)-g_{w}(x^{\prime})|\leq|g_{w}(x)-g_{w}(z_{j})|+|g_{w}(x^{\prime})-g_{w}(z_{k})|
\displaystyle\leq xzj2qΛ+xzk2qΛxx2qΛ.\displaystyle\frac{\|x-z_{j}\|_{2}}{q_{\Lambda}}+\frac{\|x^{\prime}-z_{k}\|_{2}}{q_{\Lambda}}\leq\frac{\|x-x^{\prime}\|_{2}}{q_{\Lambda}}.

If xAjx\in A_{j} for some j{1,,m}j\in\{1,\dots,m\} and x𝕀d\j=1mAjx^{\prime}\in\mathbb{I}^{d}\backslash\bigcup_{j=1}^{m}A_{j}, then we take zjz_{j} to be the intersection of B(xj,qΛ)\partial B(x_{j},q_{\Lambda}) and the line segment xxxx^{\prime}. Then xzj2xx2\|x-z_{j}\|_{2}\leq\|x-x^{\prime}\|_{2} and

|gw(x)gw(x)|=|gw(x)|=|gw(x)gw(zj)|xzj2qΛxx2qΛ.|g_{w}(x)-g_{w}(x^{\prime})|=|g_{w}(x)|=|g_{w}(x)-g_{w}(z_{j})|\leq\frac{\|x-z_{j}\|_{2}}{q_{\Lambda}}\leq\frac{\|x-x^{\prime}\|_{2}}{q_{\Lambda}}.

Combining all the above cases verifies gwLip𝕀(1,qΛ1)g_{w}\in Lip_{\mathbb{I}}^{(1,q_{\Lambda}^{-1})}. This completes the proof of Lemma 7. ∎

With the above tools, we are in a position to prove Theorem 2.

Proof:

Let 𝒰=C(𝕀d)\mathcal{U}=C(\mathbb{I}^{d}), the space of continuous functions defined on 𝕀d\mathbb{I}^{d}, 𝒲=span{δxi}i=im\mathcal{W}^{*}=\mbox{span}\{\delta_{x_{i}}\}_{i=i}^{m} and 𝒱=ΦN,ν,s\mathcal{V}=\Phi_{N,\nu,s} in Lemma 6. For every w𝒲w^{*}\in\mathcal{W}^{*}, we have w=i=1mciδxiw^{*}=\sum_{i=1}^{m}c_{i}\delta_{x_{i}} for some c=(c1,,cm)Tm\vec{c}=(c_{1},\dots,c_{m})^{T}\in\mathbb{R}^{m}. Without loss of generality, we assume w=i=1m|ci|=1\|w^{*}\|=\sum_{i=1}^{m}|c_{i}|=1. Let gwg_{w} be defined by (39). Then, it follows from Lemma 5 and Lemma 7 that there is some hg𝒱h_{g}\in\mathcal{V} such that

gwhgL(𝕀d)C1qΛ1N1/d.\|g_{w}-h_{g}\|_{L^{\infty}(\mathbb{I}^{d})}\leq C_{1}^{\prime}q_{\Lambda}^{-1}N^{-1/d}.

Let N((γ+1)C1(γ1)qΛ)dN\geq\left\lceil\left(\frac{(\gamma+1)C_{1}^{\prime}}{(\gamma-1)q_{\Lambda}}\right)^{d}\right\rceil for some γ>1\gamma>1. We have

gwhgL(𝕀d)γ1γ+1.\|g_{w}-h_{g}\|_{L^{\infty}(\mathbb{I}^{d})}\leq\frac{\gamma-1}{\gamma+1}.

This together with (i) in Lemma 7 yields

hgL(𝕀d)γ1γ+1+1=2γγ+1.\|h_{g}\|_{L^{\infty}(\mathbb{I}^{d})}\leq\frac{\gamma-1}{\gamma+1}+1=\frac{2\gamma}{\gamma+1}.

Since ww^{*} is a linear operator and (ii) in Lemma 7 holds, there holds

1=w=w(gw)=w(gwhg)+w(hg).1=\|w^{*}\|=w^{*}(g_{w})=w^{*}(g_{w}-h_{g})+w^{*}(h_{g}).

Hence, from

w(gwhg)L(𝕀d)wgwhgL(𝕀d)γ1γ+1,\|w^{*}(g_{w}-h_{g})\|_{L^{\infty}(\mathbb{I}^{d})}\leq\|w^{*}\|\|g_{w}-h_{g}\|_{L^{\infty}(\mathbb{I}^{d})}\leq\frac{\gamma-1}{\gamma+1},

we have

w(hg)1|w(gwhg)|1γ1γ+1=2γ+1.w^{*}(h_{g})\geq 1-|w^{*}(g_{w}-h_{g})|\geq 1-\frac{\gamma-1}{\gamma+1}=\frac{2}{\gamma+1}.

Consequently,

w\displaystyle\|w^{*}\| =\displaystyle= 1γ+12w(hg)γ+12w|ΦN,θ,ν,shgL(𝕀d)\displaystyle 1\leq\frac{\gamma+1}{2}w^{*}(h_{g})\leq\frac{\gamma+1}{2}\|w^{*}|_{\Phi_{N,\theta,\nu,s}}\|\|h_{g}\|_{L^{\infty}(\mathbb{I}^{d})}
\displaystyle\leq γ+122γγ+1w|ΦN,θ,ν,s=γw|ΦN,θ,ν,s.\displaystyle\frac{\gamma+1}{2}\cdot\frac{2\gamma}{\gamma+1}\|\|w^{*}|_{\Phi_{N,\theta,\nu,s}}\|=\gamma\|w^{*}|_{\Phi_{N,\theta,\nu,s}}\|.

Setting γ=2\gamma=2, for any fLip𝕀d(r,c0)f^{*}\in Lip_{\mathbb{I}^{d}}^{(r,c_{0})}, it follows from Lemma 6 and Lemma 5 that there exists some h𝒱=ΦN,v,sh^{*}\in\mathcal{V}=\Phi_{N,v,s} such that h(xi)=f(xi)h^{*}(x_{i})=f^{*}(x_{i}) and

hfL(𝕀d)5minh𝒱hfL(𝕀d)5C1c0Nr/d.\|h^{*}-f^{*}\|_{L^{\infty}(\mathbb{I}^{d})}\leq 5\min_{h\in\mathcal{V}}\|h-f^{*}\|_{L^{\infty}(\mathbb{I}^{d})}\leq 5C_{1}^{\prime}c_{0}N^{-r/d}.

Setting νNrd\nu\sim N^{-r-d} and recalling (37) and t=σ(t)σ(t)t=\sigma(t)-\sigma(-t), ΦN,ν,s\Phi_{N,\nu,s} defined in (37) can be regarded as the set of DFCNs with depth 𝒪(logN)\mathcal{O}(\log N), d1=𝒪(Nd)d_{1}=\mathcal{O}(N^{d}) and d=𝒪(logN)d_{\ell}=\mathcal{O}(\log N). Noting that there are infinitely many νNrd\nu\sim N^{-r-d}, there are infinitely many DFCNs satisfying the above assertions. This completes the proof of Theorem 2. ∎

The proofs of the other main results are simple by combining Theorem 6 with existing results.

Proof:

Setting the teacher network g=fglobalunderg=f_{global}^{under} in (14), we obtain a student net hh based on Theorem 6 with ε=m2r/(2r+d)\varepsilon=m^{-2r/(2r+d)}. Then, Theorem 3 follows from Theorem 6 and (14) directly. ∎

Proof:

Based on Theorem 6, we can set the teacher network to be the under-parameterized deep ReLU fglobalunderf_{global}^{under} in (16). Then, Theorem 4 follows from Theorem 6 and (16) directly. ∎

Proof:

According to Theorem 6, it is easy to obtain a student network hh based on the teacher network g=fglobalunderg=f_{global}^{under} in (18). Then, Theorem 5 follows directly from Theorem 6 and (18). ∎

Acknowledgement

The work of S. B. Lin is supported partially by the National Key R&D Program of China (No.2020YFA0713900) and the National Natural Science Foundation of China (No.62276209). The work of Y. Wang is supported partially by the National Natural Science Foundation of China (No.11971374). The work of D. X. Zhou is supported partially by the NSFC/RGC Joint Research Scheme [RGC Project No. N-CityU102/20 and NSFC Project No. 12061160462], Germany/Hong Kong Joint Research Scheme [Project No. G-CityU101/20], and the Laboratory for AI-Powered Financial Technologies.

References

  • [1] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. ICML, 2019.
  • [2] Z. Allen, Y. Li, and Y. Liang. Learning and generalization in overparametrized neural networks, going beyond two layers. arXiv:1811.04918, 2018.
  • [3] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 2009.
  • [4] A. R. Barron, A. Cohen, W. Dahmen and R. Devore. Approximation and learning by greedy algorithms. Ann. Statist., 36: 64-94, 2008.
  • [5] P. L. Bartlett, N. Harvey, C. Liaw and A. Mehrabian. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. J. Mach. Learn. Res., 20(63): 1-17, 2019.
  • [6] P. L. Bartlett and P. M. Long. Failures of model-dependent generalization bounds for least-norm interpolation. J. Mach. Learn. Res., 22 (204): 1-15, 2021.
  • [7] P. L. Bartlett, P. M. Long, G. Lugosi and A. Tsigler. Benign overfitting in linear regression. Proc. Nat. Acad. Sci. USA, 117 (48): 30063-30070, 2020.
  • [8] B. Bauer and M. Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist., 47(4): 2261-2285, 2019.
  • [9] M. Belkin. Approximation beats concentration? An approximation view on inference with smooth radial kernels. COLT 2018: 1348-1361.
  • [10] M. Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. arXiv:2105.14368, 2021.
  • [11] M. Belkin, D. Hsu and P. Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. NIPS 2018: 2300-2311.
  • [12] M. Belkin, D. Hsu, S. Ma and S. Mandal. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Nat. Acad. Sci. USA, 116 (32): 15849-15854, 2019.
  • [13] Y. Bengio, A. Courville and Vincent. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intel., 35: 1798–1828, 2013.
  • [14] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. NIPS 2019: 10835-10845.
  • [15] A. Caponnetto and E. DeVito. Optimal rates for the regularized least squares algorithm. Found. Comput. Math., 7: 331-368, 2007.
  • [16] C. K. Chui, X. Li and H. N. Mhaskar. Neural networks for localized approximation. Math. Comput., 63: 607-623, 1994.
  • [17] C. K. Chui, S. B. Lin, B. Zhang and D. X. Zhou. Realization of spatial sparseness by deep ReLU nets with massive data. IEEE Trans. Neural Netw. Learn. Syst., In Presee, 2020.
  • [18] C. K. Chui, S. B. Lin and D. X. Zhou, Deep neural networks for rotation-invariance approximation and learning. Anal. Appl., 17(3): 737-772, 2019.
  • [19] F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, Cambridge, 2007.
  • [20] S. Du, J. Lee, Y. Tian, B. Poczós and A. Singh. Gradient descent learns one-hiddenlayer cnn: Don’t be afraid of spurious local minima. ICML 2018: 1339–1348.
  • [21] S. Du, X. Zhai, B. Poczós and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. ICLR 2019.
  • [22] I. Goodfellow, Y. Bengio and A. Courville. Deep Learning. The MIT Press, 2016.
  • [23] Z. C. Guo, L. Shi and S. B. Lin. Realizing data features by deep nets. IEEE Trans. Neural Netw. Learn. Syst., In Press, 2019.
  • [24] L. Györfy, M. Kohler, A. Krzyzak and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, Berlin, 2002.
  • [25] Z. Han, S. Yu, S. B. Lin, and D. X. Zhou. Depth selection for deep ReLU nets in feature extraction and generalization. IEEE Trans. on Pattern Anal. Mach. Intel., In Press.
  • [26] T. Hastie, A. Montanari, S. Rosset and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv:1903.08560.
  • [27] G. E. Hinton, S. Osindero and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7): 1527–1554, 2006.
  • [28] J. Hong and P. R. Hoban. Writing more compelling creative appeals: a deep learning-based approach. Market. Sci., In Press.
  • [29] M. Imaizumi and K. Fukumizu. Deep Neural Networks Learn Non-Smooth Functions Effectively. arXiv preprint arXiv:1802.04474, 2018.
  • [30] M. Kohler and A. Krzyżak. Nonparametric regression based on hierarchical interaction models. IEEE Trans. Inf. Theory, 63: 1620–1630, 2017.
  • [31] M. Kohler and A. Krzyżak. Over-parametrized deep nerual networks do not generalize well. arXiv:1912.03925, 2019.
  • [32] A. Krizhevsky, I. Sutskever and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS 2012: 1097–1105.
  • [33] H. Lee, P. T. Pham, L. Yan and A. Y. Ng. Unsupervised feature learning for audio classification using convolutional deep belief networks. NIPS 2009.
  • [34] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. NIPS 2018: 8157-8166.
  • [35] W. Li. Generalization error of minimum weighted norm and kernel interpolation. SIAM J. Math. Data Sci., 3(1): 414-438, 2021.
  • [36] T. Liang and A. Rakhlin. Just interpolate: Kernel ¡°ridgeless¡± regression can generalize, The Annals of Statistics, 48: 1329–1347, 2020.
  • [37] S. B. Lin. Generalization and expressivity for deep nets. IEEE Trans. Neural Netw. Learn. Syst. 30: 1392–1406, 2019.
  • [38] S. B. Lin, X. Chang and X. Sun. Kernel interpolation of high dimensional scattered data. arXiv:2009.01514v2, 2020.
  • [39] X. Liu, D. Wang and S. B. Lin. Construction of Deep ReLU Nets for Spatially Sparse Learning. IEEE Transactions on Neural Networks and Learning Systems, In Press.
  • [40] V. Maiorov. Approximation by neural networks and learning theory. J. Complex., 22: 102-117, 2006.
  • [41] S. Mei, A. Montanari and P. M. Nguyen. A mean field view of the landscape of two-layer neural networks. Proc. Nat. Acad. Sci. USA, 115 (33): E7665-E7671.
  • [42] S. Mendelson and R. Vershinin. Entropy and the combinatorial dimension. Invent. Math., 125: 37-55, 2003.
  • [43] H. N. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective. Anal. Appl., 14: 829–848, 2016.
  • [44] N. Mücke and I. Steinwart. Empirical Risk Minimization in the Interpolating Regime with Application to Neural Network Learning. arXiv preprint arXiv:1905.10686.
  • [45] V. Muthukumar, K. Vodrahalli, V. Subramanian and A. Saha. Harmless interpolation of noisy data in regression. IEEE J. Selected Areas Inf. Theory, 1 (1): 67-83.
  • [46] F. J. Narcowich and J. D. Ward. Scattered-Data interpolation on RnR^{n}: Error estimates for radial basis and band-limited functions. SIAM J. Math. Anal., 36: 284-300, 2004.
  • [47] B. Neyshabur, S. Bhojanapalli and N. Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv1707.09564v2, 2018.
  • [48] P. Petersen and F. Voigtlaender. Optimal aproximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108: 296-330, 2018.
  • [49] A. Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, 8: 143-195, 1999.
  • [50] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli and J. Sohl-Dickstein, On the expressive power of deep neural networks. arXiv: 1606.05336, 2016.
  • [51] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda and Q. Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Intern. J. Auto. Comput., DOI: 10.1007/s11633-017-1054-2, 2017.
  • [52] M. Qi, Y. Shi, Y. Qi, C. Ma, R. Yuan, D. Wu and Z. J. M. Shen. A practical end-to-end inventory management model with deep learning. Management Sci., In Press.
  • [53] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Statist., 48(4): 1875-1897, 2020.
  • [54] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in uq. Anal. Appl., 17: 19–55, 2019.
  • [55] U. Shaham, A. Cloninger and R. R. Coifman. Provable approximation properties for deep neural networks. Appl. Comput. Harmonic Anal., 44: 537–557, 2018.
  • [56] L. Shi. Learning theory estimates for coefficient-based regularized regression. Appl. Comput. Harmonic Anal., 34: 252–265, 2013.
  • [57] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. D. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam and M. Lanctot. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587): 484–489, 2016.
  • [58] D. Yarotsky. Error bounds for aproximations with deep ReLU networks. Neural Networks, 94: 103-114, 2017.
  • [59] C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv: 1611.03530, 2016.
  • [60] D. X. Zhou. Deep distributed convolutional neural networks: Universality. Anal. Appl., 16: 895-919, 2018.
  • [61] D. X. Zhou. Universality of deep convolutional neural networks. Appl. Comput. Harmonic. Anal., 48: 784-794, 2020
  • [62] D. X. Zhou. Theory of deep convolutional neural networks: Downsampling. Neural Netw., 124: 319-327, 2020.