This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Simultaneous Neural Network Approximation for Smooth Functions

Sean Hon Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong SAR [email protected] Haizhao Yang Department of Mathematics, Purdue University, IN 47907, USA [email protected]
Abstract

We establish in this work approximation results of deep neural networks for smooth functions measured in Sobolev norms, motivated by recent development of numerical solvers for partial differential equations using deep neural networks. Our approximation results are nonasymptotic in the sense that the error bounds are explicitly characterized in terms of both the width and depth of the networks simultaneously with all involved constants explicitly determined. Namely, for fCs([0,1]d)f\in C^{s}([0,1]^{d}), we show that deep ReLU networks of width 𝒪(NlogN)\mathcal{O}(N\log{N}) and of depth 𝒪(LlogL)\mathcal{O}(L\log{L}) can achieve a nonasymptotic approximation rate of 𝒪(N2(s1)/dL2(s1)/d)\mathcal{O}(N^{-2(s-1)/d}L^{-2(s-1)/d}) with respect to the 𝒲1,p([0,1]d)\mathcal{W}^{1,p}([0,1]^{d}) norm for p[1,)p\in[1,\infty). If either the ReLU function or its square is applied as activation functions to construct deep neural networks of width 𝒪(NlogN)\mathcal{O}(N\log{N}) and of depth 𝒪(LlogL)\mathcal{O}(L\log{L}) to approximate fCs([0,1]d)f\in C^{s}([0,1]^{d}), the approximation rate is 𝒪(N2(sn)/dL2(sn)/d)\mathcal{O}(N^{-2(s-n)/d}L^{-2(s-n)/d}) with respect to the 𝒲n,p([0,1]d)\mathcal{W}^{n,p}([0,1]^{d}) norm for p[1,)p\in[1,\infty).

keywords:
Deep neural networks , Sobolev norm , ReLUk activation functions , approximation theory
journal: Journal

1 Introduction

Over the past decades, deep neural networks have made remarkable impacts in various areas of science and engineering. With the aid of high-performance computing equipment and abundance of high quality data, neural network based methods outperform traditional machine learning methods in a wide range of applications, including image classification, object detection, speech recognition, to name just a few. The success of neural networks also motivates its applications in scientific computing including the recovery of governing equations for mathematical modeling and prediction [52, 12, 40, 28, 22, 33] and solving partial differential equations (PDEs) [41, 13, 21, 26, 15, 16, 25, 18, 6, 27].

Neural network based methods have evoked many open problems in mathematical theory, notwithstanding their success in practice. In a typical supervised learning algorithm, a potential high-dimensional target function f(x)f(x) defined on a domain Ω\Omega is to be learned from a finite set of data samples {(𝐱i,f(𝐱i))}i=1n\{(\mathbf{x}_{i},f(\mathbf{x}_{i}))\}_{i=1}^{n}. When a deep network is integrated in the learning process, one needs to identify a deep network ϕ(𝐱;θ𝐒)\phi(\mathbf{x};\mathbf{\theta_{S}}) with θ𝐒\mathbf{\theta_{S}} being the hyperparameter to determine f(𝐱)f(\mathbf{x}) for unseen data samples 𝐱\mathbf{x}, namely the following optimization problem arises

θ𝐒=argminθRS(θ):=argminθ1ni=1nl(ϕ(xi;θ),f(xi))\mathbf{\theta_{S}}=\text{arg}\min_{\mathbf{\theta}}R_{S}(\mathbf{\theta}):=\text{arg}\min_{\mathbf{\theta}}\tfrac{1}{n}\sum_{i=1}^{n}l(\phi(x_{i};\theta),f(x_{i})) (1)

where ll denotes a loss function.

Now, let us inspect the overall inference error which is estimated by RD(θS)R_{D}(\theta_{S}), where RD(θ):=E𝐱U(Ω)[l(ϕ(𝐱;θ),f(𝐱))]R_{D}(\theta):=E_{\mathbf{x}\sim U(\Omega)}[l(\phi(\mathbf{x};\mathbf{\theta}),f(\mathbf{x}))]. In reality, U(Ω)U(\Omega) is unknown and only finitely many samples from this distribution are available. Hence, the empirical loss RS(θ)R_{S}(\mathbf{\theta}) is minimized hoping to obtain ϕ(𝐱;θ𝐒)\phi(\mathbf{x};\mathbf{\theta_{S}}), instead of minimizing the population loss RD(θ)R_{D}(\mathbf{\theta}) to obtain ϕ(𝐱;θ𝐃)\phi(\mathbf{x};\mathbf{\theta_{D}}), where θ𝐃=argminθRD(θ)\mathbf{\theta_{D}}=\text{arg}\min_{\mathbf{\theta}}R_{D}(\theta). In practice, a numerical optimization method to solve (1) may result in a numerical solution (denoted as θ𝒩\mathbf{\theta_{{\mathcal{N}}}}) that may not be a global minimizer θ𝐒\mathbf{\theta_{{S}}}. Therefore, the actually learned neural network to infer f(x)f(x) is ϕ(𝐱;θ𝒩)\phi(\mathbf{x};\mathbf{\theta_{{\mathcal{N}}}}) and the corresponding inference error is measured by RD(θ𝒩)R_{D}(\mathbf{\theta_{\mathcal{N}}}). By the discussion just above, it is crucial to quantify RD(θ𝒩)R_{D}(\mathbf{\theta_{\mathcal{N}}}) to see how good the learned neural network ϕ(𝐱;θ𝒩)\phi(\mathbf{x};\mathbf{\theta_{{\mathcal{N}}}}) is, since RD(θ𝒩)R_{D}(\mathbf{\theta_{\mathcal{N}}}) is the expected inference error overall possible data samples. Note that

RD(θ𝒩)RD(θ𝐃)approximation+[RS(θ𝒩)RS(θ𝐒)]optimization+[RD(θ𝒩)RS(θ𝒩)]+[RS(θ𝐃)RD(θ𝐃)]generalization,\displaystyle R_{D}(\mathbf{\theta_{\mathcal{N}}})\leq\underbrace{R_{D}(\mathbf{\theta_{D}})}_{\rm{approximation}}+\underbrace{[R_{S}(\mathbf{\theta_{\mathcal{N}}})-R_{S}(\mathbf{\theta_{S}})]}_{\rm{optimization}}+\underbrace{[R_{D}(\mathbf{\theta_{\mathcal{N}}})-R_{S}(\mathbf{\theta_{\mathcal{N}}})]+[R_{S}(\mathbf{\theta_{D}})-R_{D}(\mathbf{\theta_{D}})]}_{\rm{generalization}}, (2)

where the inequality comes from the fact that [RS(θ𝐒))RS(θ𝐃)]0[R_{S}(\mathbf{\theta_{S}}))-R_{S}(\mathbf{\theta_{D}})]\leq 0 since θ𝐒\mathbf{\theta_{S}} is a global minimizer of RS(θ)R_{S}(\mathbf{\theta}).

Synergies from approximation theory [51, 42, 4, 43, 37, 36, 31, 17, 39, 50, 49, 14, 47, 23, 8, 7], optimization theory [24, 35, 34, 2, 11, 53, 1, 10, 48], and generalization theory [24, 5, 49, 3, 32, 30, 29] have led to many recent advances in the mathematical investigation for deep learning, regarding the error bound (2). All of these areas, having different emphases and angles, have fostered many separate research directions.

Providing an estimate of the first error term of (2), RD(θ𝐃)R_{D}(\mathbf{\theta_{D}}), belongs to the regime of approximation theory and is of the main concerns in this paper. The results established in this work provide an upper bound of RD(θ𝐃)R_{D}(\mathbf{\theta_{D}}), estimated explicitly in terms of the size of the network, e.g. its width and depth, with a nonasymptotic approximation rate for smooth functions measured in the Sobolev norms. Our approximation results are nonasymptotic in the sense that the all involved constants are explicitly determined. We summarize our first main result in the following on the networks with rectified linear unit (ReLU) σ1\sigma_{1} as the activation function:

Theorem 1.1

Suppose that fCs([0,1]d)f\in C^{s}([0,1]^{d}) with s>1+s>1\in\mathbb{N}^{+} satisfies 𝛂fL([0,1]d)<1\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1 for any 𝛂d\bm{\alpha}\in\mathbb{N}^{d} with |𝛂|s|{\bm{\alpha}}|\leq s. For any N,L+N,L\in\mathbb{N}^{+} and p[1,)p\in[1,\infty), there exists a σ1\sigma_{1}-NN ϕ\phi with width 16sd+1d(N+2)log2(8N)16s^{d+1}d(N+2)\log_{2}{({8}N)} and depth 27s2(L+2)log2(4L)27s^{2}(L+2)\log_{2}(4L) such that

fϕ𝒲1,p([0,1]d)85(s+1)d8sN2(s1)/dL2(s1)/d.\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq{85}(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.

For smooth functions, we show in Theorem 1.1 that deep ReLU networks of width 𝒪(NlogN)\mathcal{O}(N\log{N}) and of depth 𝒪(LlogL)\mathcal{O}(L\log{L}) can achieve an approximation rate of 𝒪(N2(s1)/dL2(s1)/d)\mathcal{O}(N^{-2(s-1)/d}L^{-2(s-1)/d}) with respect to the 𝒲1,p([0,1]d)\mathcal{W}^{1,p}([0,1]^{d}) norm. Note that ReLU networks are in fact piecewise linear functions, so they have at most nonzero first (weak) derivatives. The above approximation rate in the LpL^{p} norm estimated in terms of NN and LL was already covered in [31]. When the 𝒲n,p\mathcal{W}^{n,p} norm is considered, where 0<n<10<n<1 is not an integer, the interpolation technique used in [19] can be combined together with our method here to develop new approximation rates, which is left as future work.

To achieve rates truly in terms of width and depth without the logarithm terms, we also have the following corollary by setting N~=𝒪(NlogN)\widetilde{N}=\mathcal{O}(N\log N) and L~=𝒪(LlogL)\widetilde{L}=\mathcal{O}(L\log L) and making use of

(NlnN)2(s1)/d(LlnL)2(s1)/d𝒪(N2(sρ)/dL2(sρ)/d)(N\ln N)^{-2(s-1)/d}(L\ln L)^{-2(s-1)/d}\leq\mathcal{O}(N^{-2(s-\rho)/d}L^{-2(s-\rho)/d})

for ρ(1,s)\rho\in(1,s).

Corollary 1.2

Suppose that fCs([0,1]d)f\in C^{s}([0,1]^{d}) with s>1+s>1\in\mathbb{N}^{+} satisfies 𝛂fL([0,1]d)<1\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1 for any 𝛂d\bm{\alpha}\in\mathbb{N}^{d} with |𝛂|s|{\bm{\alpha}}|\leq s. For any N,L+N,L\in\mathbb{N}^{+}, ρ(1,s)\rho\in(1,s), and p[1,)p\in[1,\infty), there exist C1(s,d)C_{1}(s,d), C2(s,d)C_{2}(s,d), C3(s,d,ρ)C_{3}(s,d,\rho), and a σ1\sigma_{1}-NN ϕ\phi with width C1NC_{1}N and depth C1LC_{1}L such that

fϕ𝒲1,p([0,1]d)C3N2(sρ)/dL2(sρ)/d.\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq C_{3}N^{-2(s-\rho)/d}L^{-2(s-\rho)/d}.

Note that the constants C1,C2C_{1},C_{2} and C3C_{3} in Corollary 1.2 can be explicitly determined and we leave it to readers.

To obtain approximation results in terms of the number of parameters in neural networks, the following corollary is followed by setting N=𝒪(1)N=\mathcal{O}(1) and ε=𝒪(L2(s1)/d)\varepsilon=\mathcal{O}(L^{-2(s-1)/d}) from Theorem 1.1.

Corollary 1.3

Suppose that fCs([0,1]d)f\in C^{s}([0,1]^{d}) with s>1+s>1\in\mathbb{N}^{+} satisfies 𝛂fL([0,1]d)<1\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1 for any 𝛂d\bm{\alpha}\in\mathbb{N}^{d} with |𝛂|s|{\bm{\alpha}}|\leq s. Given any ε>0\varepsilon>0 and p[1,)p\in[1,\infty), there exists a σ1\sigma_{1}-NN ϕ\phi with 𝒪(εd/(2(s1))ln1ε)\mathcal{O}(\varepsilon^{-d/(2(s-1))}\ln\tfrac{1}{\varepsilon}) parameters such that

fϕ𝒲1,p([0,1]d)ε.\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq\varepsilon.

Corollary 1.3 partially recovers [19, Corollary 4.1] in which the parameters are of order 𝒪(εd/(s1)ln2(εs/(s1)))\mathcal{O}(\varepsilon^{-d/(s-1)}\ln^{2}({\varepsilon}^{-s/(s-1)})).

Considering a smoother neural network in which either the ReLU function σ1\sigma_{1} or its square σ12\sigma_{1}^{2} is applied, we provide similar results in the following Theorem 1.4. Namely, these networks of width 𝒪(NlogN)\mathcal{O}(N\log{N}) and of depth 𝒪(LlogL)\mathcal{O}(L\log{L}) can achieve the approximation rate of 𝒪(N2(sn)/dL2(sn)/d)\mathcal{O}(N^{-2(s-n)/d}L^{-2(s-n)/d}) with respect to the 𝒲n,p([0,1]d)\mathcal{W}^{n,p}([0,1]^{d}) norm. It is worth noting that the confinement of space is now relaxed to 𝒲n,p([0,1]d)\mathcal{W}^{n,p}([0,1]^{d}) from 𝒲1,p([0,1]d)\mathcal{W}^{1,p}([0,1]^{d}) with such smoother networks.

Theorem 1.4

Suppose that fCs([0,1]d)f\in C^{s}([0,1]^{d}) with s+s\in\mathbb{N}^{+} satisfies 𝛂fL([0,1]d)<1\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1 for any 𝛂d\bm{\alpha}\in\mathbb{N}^{d} with |𝛂|s|{\bm{\alpha}}|\leq s. For any N,L+N,L\in\mathbb{N}^{+} satisfying (L2log2N)Ns(L-2-\log_{2}N)N\geq s, and p[1,)p\in[1,\infty), there exists a σ2\sigma_{2}-NN ϕ\phi with width 16sd+1d(N+2)log2(8N)16s^{d+1}d(N+2)\log_{2}{({8}N)} and depth 10(L+2)log2(4L)10(L+2)\log_{2}(4L) such that

fϕ𝒲n,p([0,1]d)3(s+1)d8snN2(sn)/dL2(sn)/d,\|f-\phi\|_{\mathcal{W}^{n,p}([0,1]^{d})}\leq{3}(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d},

where n<sn<s is a positive integer.

Our work developed here focuses particularly on the Sobolev norms which are suitable for studying partial differential equations. The above results concern the ReLUk network approximation of functions with respect to Sobolev norms. Along this research direction, the work in [19] provides asymptotic upper bounds with unknown parameters with respect to 𝒲n,p\mathcal{W}^{n,p} norm, where nn is a fraction between zero and one, based on the means of localized polynomials. Making use of the same polynomials, asymptotic approximation results measured with respect to high order 𝒲n,p\mathcal{W}^{n,p} norm were provided in [20]. These results were developed for neural networks with smoother activation functions, in addition to ReLU networks. Measured in 𝒲1,p\mathcal{W}^{1,p} norm, it was shown in [38] that ReLU networks can achieve essentially the same approximation rates as free-knot spline approximations. In contrast with these existing results, our approximation results are nonasymptotic with known pre-factors, established in the spirit of explicit error characterization methodology developed in [31, 44].

Recall from [9, Theorem 4.2] by DeVore et al., we have the following ostensible negative result on the continuity of the weight selection. Suppose that there is a continuous map Σ\Sigma from the unit ball of Sobolev space with smoothness nn, i.e. 𝒲n,p\mathcal{W}^{n,p}, to W\mathbb{R}^{W} such that fg(Σ(f))Lpε\|f-g(\Sigma(f))\|_{L^{p}}\leq\varepsilon for all f𝒲n,pf\in\mathcal{W}^{n,p}, where WW denotes a fixed number of parameters and gg is a map realizing a deep neural network from a given set of parameters in W\mathbb{R}^{W} to the unit ball in 𝒲n,p\mathcal{W}^{n,p}, then WCεn/dW\geq C\varepsilon^{-n/d} with some constant CC depending only on nn. This in a way means any such constructive approximation of ReLU networks cannot have a continuous weight selection property if the approximation rate is better than Cεn/dC\varepsilon^{-n/d}, and hence a stable numerical implementation with such an error rate does not exist. It must, however, note that [9, Theorem 4.2] is basically a min-max criterion for evaluating continuous weight selection maps, describing the worst case. That is, the approximation result is obtained by minimizing over all continuous maps Σ\Sigma and network realizations gg and maximizing over all target functions. For most smooth functions practically encountered in applications, the theorem does not eliminate the possible cases in which they might still enjoy a continuous weight selection. Thus, there could be a stable numerical algorithm that can achieve our derived approximation rate. In other words, there is a special subclass of functions which arise in practice for which a continuous assignment of the weights exists. It is interesting future work to characterize such a subclass. Finally, to the best of our knowledge, there is not any efficient numerical algorithm to achieve the approximation rate with continuous weight selection especially when the dimension is large. Therefore, approximation results with or without continuous weight selection both have the difficulty of numerical implementations. Designing numerical algorithms to achieve these rates has been an active research field recently.

It is remarked that providing error estimate in the context of high-dimensional PDE problem using the newly introduced Floor-ReLU networks [44], which are fully connected neural networks with either Floor x\lfloor x\rfloor or ReLU activation function in each neuron, will be an intriguing option for future work, since these networks conquer the curse of dimensionality in terms of approximation theory. Other neural networks can also be considered. Along this line of work, the novel Floor-Exponential-Step networks with merely three hidden layers and width of order 𝒪(N)\mathcal{O}(N) constructed in [46] can approximate dd-dimensional Lipschitz continuous functions with an exponentially small error rate of order 𝒪(d2N)\mathcal{O}(\sqrt{d}2^{-N}). In [45], neural networks with a simple and computable continuous activation function and a fixed finite number of neurons were developed, which achieve the universal approximation property for all high-dimensional continuous functions.

This paper is organized as follows. In Section 2, Sobolev spaces are briefly introduced and a number of useful existing results on ReLU network approximation are given. We provide our main approximation results for ReLU networks in Section 3. Several auxiliary results concerning approximating high order polynomials using ReLU networks are first provided in Section 3.1, and our main results on ReLU network are given in Section 3.2. In addition to ReLU networks, we show similar results for the smoother neural networks with ReLU square in Section 3.3.

2 Preliminaries

We provide in this section some useful preliminaries of notations and basic approximation results.

2.1 Deep Neural Networks

Let us summarize all basic notations used in deep neural networks as follows.

  1. 1.

    Matrices are denoted by bold uppercase letters. For instance, 𝑨m×n\bm{A}\in\mathbb{R}^{m\times n} is a real matrix of size m×nm\times n, and 𝑨T\bm{A}^{T} denotes the transpose of 𝑨\bm{A}.

  2. 2.

    Vectors are denoted as bold lowercase letters. For example, 𝒗n\bm{v}\in\mathbb{R}^{n} is a column vector of size nn. Correspondingly, 𝒗(i)\bm{v}(i) is the ii-th element of 𝒗\bm{v}. 𝒗=[v1,,vn]T=[v1vn]\bm{v}=[v_{1},\cdots,v_{n}]^{T}=\left[\begin{array}[]{c}\vspace{-3pt}v_{1}\\ \vspace{-5pt}\vdots\\ v_{n}\end{array}\right] are vectors consisting of numbers {vi}\{v_{i}\} with 𝒗(i)=vi\bm{v}(i)=v_{i}.

  3. 3.

    A dd-dimensional multi-index is a dd-tuple 𝜶=[α1,α2,,αd]Td.{\bm{\alpha}}=[\alpha_{1},\alpha_{2},\cdots,\alpha_{d}]^{T}\in\mathbb{N}^{d}. Several related notations are listed below.

    1. (a)

      |𝜶|=|α1|+|α2|++|αd||{\bm{\alpha}}|=|\alpha_{1}|+|\alpha_{2}|+\cdots+|\alpha_{d}|;

    2. (b)

      𝒙𝜶=x1α1x2α2xdαd{\bm{x}}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}}, where 𝒙=[x1,x2,,xd]T{\bm{x}}=[x_{1},x_{2},\cdots,x_{d}]^{T};

    3. (c)

      𝜶!=α1!α2!αd!{\bm{\alpha}}!=\alpha_{1}!\alpha_{2}!\cdots\alpha_{d}!;

  4. 4.

    Let Br,||(𝒙)dB_{r,|\cdot|}({\bm{x}})\subseteq\mathbb{R}^{d} be the closed ball with a center 𝒙d{\bm{x}}\subseteq\mathbb{R}^{d} and a radius rr measured by the Euclidean distance. Similarly, Br,(𝒙)dB_{r,\|\cdot\|_{\ell^{\infty}}}({\bm{x}})\subseteq\mathbb{R}^{d} is a ball measured by the discrete \ell^{\infty}-norm of a vector.

  5. 5.

    Assume 𝒏n\bm{n}\in\mathbb{N}^{n}, then f(𝒏)=𝒪(g(𝒏))f(\bm{n})=\mathcal{O}(g(\bm{n})) means that there exists positive CC independent of 𝒏\bm{n}, ff, and gg such that f(𝒏)Cg(𝒏)f(\bm{n})\leq Cg(\bm{n}) when all entries of 𝒏\bm{n} go to ++\infty.

  6. 6.

    We use σ\sigma to denote an activation function. Let σ1:\sigma_{1}:\mathbb{R}\to\mathbb{R} denote the rectified linear unit (ReLU), i.e. σ1(x)=max{0,x}\sigma_{1}(x)=\max\{0,x\}. With the abuse of notations, we define σ1:dd\sigma_{1}:\mathbb{R}^{d}\to\mathbb{R}^{d} as σ1(𝒙)=[max{0,x1}max{0,xd}]\sigma_{1}({\bm{x}})=\left[\begin{array}[]{c}\max\{0,x_{1}\}\\ \vdots\\ \max\{0,x_{d}\}\end{array}\right] for any 𝒙=[x1,,xd]Td{\bm{x}}=[x_{1},\cdots,x_{d}]^{T}\in\mathbb{R}^{d}.

  7. 7.

    Furthermore, we let the activation function σ2:\sigma_{2}:\mathbb{R}\to\mathbb{R} be either σ1\sigma_{1} or σ12\sigma_{1}^{2}. Similar to σ1\sigma_{1}, we define the action of σ2\sigma_{2} on a vector 𝒙{\bm{x}}.

  8. 8.

    We will use NN as a neural network for short and σr\sigma_{r}-NN to specify an NN with activation functions σt\sigma_{t} with trt\leq r. We will also use Python-type notations to specify a class of NN’s, e.g., σ1\sigma_{1}-NN(c1;c2;;cm)\textnormal{NN}(\textnormal{c}_{1};\ \textnormal{c}_{2};\ \cdots;\ \textnormal{c}_{m}) is a set of ReLU FNNs satisfying mm conditions given by {ci}1im\{\textnormal{c}_{i}\}_{1\leq i\leq m}, each of which may specify the number of inputs (#input), the total number of nodes in all hidden layers (#\#node), the number of hidden layers (#\#layer), the number of total parameters (#\#parameter), and the width in each hidden layer (widthvec), the maximum width of all hidden layers (maxwidth), etc. For example, if ϕσ1\phi\in\sigma_{1}-NN(#input=2;widthvec=[100,100])\textnormal{NN}(\textnormal{\#input}=2;\,\textnormal{widthvec}=[100,100]), then ϕ\phi satisfies

    1. (a)

      ϕ\phi maps from 2\mathbb{R}^{2} to \mathbb{R}.

    2. (b)

      ϕ\phi has two hidden layers and the number of nodes in each hidden layer is 100100.

  9. 9.

    [n]L[n]^{L} is short for [n,n,,n]L[n,n,\cdots,n]\in\mathbb{N}^{L}. For example,

    NN(#input=d;widthvec=[100,100])=NN(#input=d;widthvec=[100]2).\textnormal{NN}(\textnormal{\#input}=d;\,\textnormal{widthvec}=[100,100])=\textnormal{NN}(\textnormal{\#input}=d;\,\textnormal{widthvec}=[100]^{2}).
  10. 10.

    For ϕσ\phi\in\sigma-NN(#input=d;widthvec=[N1,N2,,NL])\textnormal{NN}(\textnormal{\#input}=d;\,\textnormal{widthvec}=[N_{1},N_{2},\cdots,N_{L}]), if we define N0=dN_{0}=d and NL+1=1N_{L+1}=1, then the architecture of ϕ\phi can be briefly described as follows:

    𝒙=𝒉~0𝑾1,𝒃1𝒉1σ𝒉~1𝑾L,𝒃L𝒉Lσ𝒉~L𝑾L+1,𝒃L+1ϕ(𝒙)=𝒉L+1,\displaystyle\bm{x}=\tilde{\bm{h}}_{0}\mathop{\raisebox{0.0pt}{\scalebox{2.6}[1.0]{$\longrightarrow$}}}^{\bm{W}_{1},\ \bm{b}_{1}}\bm{h}_{1}\mathop{\longrightarrow}^{\sigma}\tilde{\bm{h}}_{1}\cdots\mathop{\raisebox{0.0pt}{\scalebox{2.6}[1.0]{$\longrightarrow$}}}^{\bm{W}_{L},\ \bm{b}_{L}}\bm{h}_{L}\mathop{\longrightarrow}^{\sigma}\tilde{\bm{h}}_{L}\mathop{\mathop{\raisebox{0.0pt}{\scalebox{2.6}[1.0]{$\longrightarrow$}}}}^{\bm{W}_{L+1},\ \bm{b}_{L+1}}\phi(\bm{x})=\bm{h}_{L+1},

    where 𝑾iNi×Ni1\bm{W}_{i}\in\mathbb{R}^{N_{i}\times N_{i-1}} and 𝒃iNi\bm{b}_{i}\in\mathbb{R}^{N_{i}} are the weight matrix and the bias vector in the ii-th linear transform in ϕ\phi, respectively, i.e.,

    𝒉i:=𝑾i𝒉~i1+𝒃i,for i=1L+1,\bm{h}_{i}:=\bm{W}_{i}\tilde{\bm{h}}_{i-1}+\bm{b}_{i},\quad\textnormal{for $i=1$, $\dots$, $L+1$,}

    and

    𝒉~i=σ(𝒉i),for i=1L.\tilde{\bm{h}}_{i}=\sigma(\bm{h}_{i}),\quad\textnormal{for $i=1$, $\dots$, $L$.}

    LL in this paper is also called the number of hidden layers in the literature.

  11. 11.

    The expression, an FNN with width NN and depth LL, means

    1. (a)

      The maximum width of this FNN for all hidden layers less than or equal to NN.

    2. (b)

      The number of hidden layers of this FNN less than or equal to LL.

2.2 Sobolev Spaces

We will use DD to denote the weak derivative of a single variable function and D𝜶D^{\bm{\alpha}} to denote the partial derivative D1α1D2α2DdαdD^{\alpha_{1}}_{1}D^{\alpha_{2}}_{2}\dots D^{\alpha_{d}}_{d} of a dd-dimensional function with αi\alpha_{i} as the order of derivative DiD_{i} in the ii-th variable and 𝜶=[α1,,αd]T{\bm{\alpha}}=[\alpha_{1},\dots,\alpha_{d}]^{T}. Let Ω\Omega denote an open subset of d\mathbb{R}^{d} and Lp(Ω)L^{p}(\Omega) be the standard Lebesgue space on Ω\Omega for p[1,]p\in[1,\infty]. We write f:=[D1f,,Ddf]T\nabla f:=[D_{1}f,\dots,D_{d}f]^{T}. Ω\partial\Omega is the boundary of Ω\Omega. Let μ()\mu(\cdot) be the Lebesgue measure. For f(x)𝒲n,p(Ω)f(x)\in\mathcal{W}^{n,p}(\Omega), we use the notation

f𝒲n,p(Ω)=f𝒲n,p=f(x)𝒲n,p(Ω,μ),\|f\|_{\mathcal{W}^{n,p}(\Omega)}=\|f\|_{\mathcal{W}^{n,p}}=\|f(x)\|_{\mathcal{W}^{n,p}(\Omega,\mu)},

if the domain is clear from the context and we use the Lebesgue measure.

Definition 2.1

(Sobolev Space) Let n0n\in\mathbb{N}_{0} and 1p1\leq p\leq\infty. Then we define the Sobolev space

𝒲n,p(Ω):={fLp(Ω):D𝜶fLp(Ω) for all 𝜶0d with |𝜶|n}\mathcal{W}^{n,p}(\Omega):=\{f\in L^{p}(\Omega):D^{\bm{\alpha}}f\in L^{p}(\Omega)\text{ for all }{\bm{\alpha}}\in\mathbb{N}_{0}^{d}\text{ with }|{\bm{\alpha}}|\leq n\}

with a norm

f𝒲n,p(Ω):=(0|𝜶|nD𝜶fLp(Ω)p)1/p,\|f\|_{\mathcal{W}^{n,p}(\Omega)}:=\Bigg{(}\sum_{0\leq|{\bm{\alpha}}|\leq n}\|D^{\bm{\alpha}}f\|^{p}_{L^{p}(\Omega)}\Bigg{)}^{1/p},

if p<p<\infty, and

f𝒲n,(Ω):=max0|𝜶|nD𝜶fL(Ω).\|f\|_{\mathcal{W}^{n,\infty}(\Omega)}:=\max_{0\leq|{\bm{\alpha}}|\leq n}\|D^{\bm{\alpha}}f\|_{L^{\infty}(\Omega)}.

2.3 Auxiliary Neutral Network Approximation Results

We first give the following useful lemmas on several ReLU networks approximation results with explicit error characterization measured in the LL^{\infty} norm for polynomials.

Lemma 2.1

The followings lemmas are satisfied by σ1\sigma_{1}-NNs.

  1. (i)

    Any one-dimensional continuous piecewise linear function with NN breakpoints can be exactly realized by a one-dimensional σ1\sigma_{1}-NN with one-hidden layer and NN neurons.

  2. (ii)

    Any identity map in d\mathbb{R}^{d} can be exactly realized by a d-dimensional σ1\sigma_{1}-NN with one hidden layer and 2d2d neurons.

  3. (iii)

    ([31, Lemma 5.1]) For any N,L+N,L\in\mathbb{N}^{+}, there exists a σ1\sigma_{1}-NN ϕ\phi with width 3N3N and depth LL such that

    ϕ(x)x2L([0,1])NL.\|\phi(x)-x^{2}\|_{L^{\infty}([0,1])}\leq N^{-L}.
  4. (iv)

    ([31, Lemma 4.2]) For any N,L+N,L\in\mathbb{N}^{+} and a,ba,b\in\mathbb{R} with a<ba<b, there exists a σ1\sigma_{1}-NN ϕ\phi with width 9N+19N+1 and depth LL such that

    ϕ(x,y)xyL([a,b]2)6(ba)2NL.\|\phi(x,y)-xy\|_{L^{\infty}([a,b]^{2})}\leq 6(b-a)^{2}N^{-L}.
  5. (v)

    ([31, Theorem 4.1]) Assume P(𝒙)=𝒙𝜶=x1α1x2α2xdαdP({\bm{x}})={\bm{x}}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}} for 𝜶d{\bm{\alpha}}\in\mathbb{N}^{d} with 𝜶1k+\|{\bm{\alpha}}\|_{1}\leq k\in\mathbb{N}^{+}. For any N,L+N,L\in\mathbb{N}^{+}, there exists a σ1\sigma_{1}-NN ϕ\phi with width 9(N+1)+k19(N+1)+k-1 and depth 7k2L7k^{2}L such that

    ϕ(𝒙)P(𝒙)L([0,1]d)9k(N+1)7kL.\|\phi({\bm{x}})-P({\bm{x}})\|_{L^{\infty}([0,1]^{d})}\leq 9k(N+1)^{-7kL}.

The following two propositions will also be used in proving our main results.

Proposition 2.2

(Step function approximations [31, Proposition 4.3]) For any N,L,d+N,L,d\in\mathbb{N}^{+} and δ(0,13K]\delta\in(0,\tfrac{1}{3K}] with K=N1/d2L2/dK=\lfloor{N^{1/d}}\rfloor^{2}\lfloor L^{2/d}\rfloor, there exists a one-dimensional σ1\sigma_{1}-NN ϕ\phi with width 4N1/d+34\lfloor N^{1/d}\rfloor+3 and depth 4L+54L+5 such that

ϕ(x)=k,ifx[kK,k+1Kδ1{k<K1}]\phi(x)=k,\quad\text{if}~{}x\in\big{[}\tfrac{k}{K},\tfrac{k+1}{K}-\delta\cdot 1_{\{k<K-1\}}\big{]}

for k=0,1,,K1k=0,1,\dots,K-1.

Proposition 2.3

(Point matching [31, Proposition 4.4]) Given any N,L,s+N,L,s\in\mathbb{N}^{+} and ξi[0,1]\xi_{i}\in[0,1] for i=0,1,,N2L21i=0,1,\dots,N^{2}L^{2}-1, there exists a σ1\sigma_{1}-NN ϕ\phi with width 16s(N+1)log2(8N)16s(N+1)\log_{2}(8N) and depth (5L+2)log2(4L)(5L+2)\log_{2}{(4L)} such that

  1. 1.

    |ϕ(i)ξi|N2sL2s,|\phi(i)-\xi_{i}|\leq N^{-2s}L^{-2s}, for i=0,1,,N2L21i=0,1,\cdots,N^{2}L^{2}-1;

  2. 2.

    0ϕ(x)1,0\leq\phi(x)\leq 1, xx\in\mathbb{R}

3 Proof of Our Main Results

In this section, the proofs of our main results are given. We first define the following notion of subset of [0,1]d[0,1]^{d}, before presenting our main approximation results. Given any KN+K\in N^{+} and δ(0,1K)\delta\in(0,\tfrac{1}{K}), define a trifling region Ω([0,1]d,K,δ)\Omega{([0,1]^{d},K,\delta)} of [0,1]d[0,1]^{d} as

Ω([0,1]d,K,δ):=i=1d{𝒙=[x1,x2,,xd]T[0,1]d:xik=1K1(kKδ,kK)}.\Omega{([0,1]^{d},K,\delta)}:=\cup_{i=1}^{d}\Big{\{}{\bm{x}}=[x_{1},x_{2},\cdots,x_{d}]^{T}\in[0,1]^{d}:x_{i}\in\cup_{k=1}^{K-1}(\tfrac{k}{K}-\delta,\tfrac{k}{K})\Big{\}}. (3)

3.1 Approximations Using ReLU Neural Networks

We first present the following Theorem 3.1, which concerns the approximation result by ReLU networks outside the trifling region Ω([0,1]d,K,δ)\Omega{([0,1]^{d},K,\delta)}.

Theorem 3.1

Suppose that fCs([0,1]d)f\in C^{s}([0,1]^{d}) with an integer s>1s>1 satisfies 𝛂fL([0,1]d)<1\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1 for any 𝛂d\bm{\alpha}\in\mathbb{N}^{d} with |𝛂|s|{\bm{\alpha}}|\leq s. For any N,L+N,L\in\mathbb{N}^{+}, there exists a σ1\sigma_{1}-NN ϕ\phi with width 16sd+1d(N+2)log2(8N)16s^{d+1}d(N+2)\log_{2}{(8N)} and depth 27s2(L+2)log2(4L)27s^{2}(L+2)\log_{2}(4L) such that ϕ𝒲1,([0,1]d)432sd\|{\phi}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 432s^{d} and

fϕ𝒲1,([0,1]d\Ω([0,1]d,K,δ))84(s+1)d8sN2(s1)/dL2(s1)/d,\|f-\phi\|_{\mathcal{W}^{1,\infty}([0,1]^{d}\backslash{\Omega{([0,1]^{d},K,\delta)}})}\leq 84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d},

where K=N1/d2L2/dK=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor and 0<δ13K0<\delta\leq\tfrac{1}{3K}.

We will now show our main result, Theorem 1.1, provided Theorem 3.1 holds true. The proof of Theorem 3.1 will be given in Section 3.2.

  • Proof of Theorem 1.1

    When ff is a constant function, the statement is trivial. By Theorem 3.1, there exists a σ1\sigma_{1}-NN ϕ\phi with width 16sd+1d(N+2)log2(8N)16s^{d+1}d(N+2)\log_{2}{(8N)} and depth 27s2(L+2)log2(4L)27s^{2}(L+2)\log_{2}(4L) such that ϕ𝒲1,([0,1]d)432sd\|{\phi}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 432s^{d} and

    fϕ𝒲1,([0,1]d\Ω([0,1]d,K,δ))84(s+1)d8sN2(s1)/dL2(s1)/d.\|f-\phi\|_{\mathcal{W}^{1,\infty}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}\leq 84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.

    Now, we set K=N1/d2L2/dK=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor and choose a small δ\delta such that

    Kdδ432p(N2(s1)/dL2(s1)/d)p.Kd\delta 432^{p}\leq(N^{-2(s-1)/d}L^{-2(s-1)/d})^{p}.

    Then, we have

    fϕ𝒲1,p([0,1]d)p\displaystyle\|f-\phi\|^{p}_{\mathcal{W}^{1,p}([0,1]^{d})} =fϕ𝒲1,p(Ω([0,1]d,K,δ))p+fϕ𝒲1,p([0,1]d\Ω([0,1]d,K,δ))p\displaystyle=\|f-\phi\|^{p}_{\mathcal{W}^{1,p}(\Omega{([0,1]^{d},K,\delta)})}+\|f-\phi\|^{p}_{\mathcal{W}^{1,p}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}
    Kdδ(432sd)p+(84(s+1)d8sN2(s1)/dL2(s1)/d)p\displaystyle\leq Kd\delta(432s^{d})^{p}+(84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d})^{p}
    (sdN2(s1)/dL2(s1)/d)p+(84(s+1)d8sN2(s1)/dL2(s1)/d)p\displaystyle\leq\big{(}s^{d}N^{-2(s-1)/d}L^{-2(s-1)/d}\big{)}^{p}+\big{(}84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}\big{)}^{p}
    (85(s+1)d8sN2(s1)/dL2(s1)/d)p.\displaystyle\leq(85(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d})^{p}.

    Hence, we have

    fϕ𝒲1,p([0,1]d)85(s+1)d8sN2(s1)/dL2(s1)/d.\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq 85(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.
    \qed

Before showing Theorem 3.1 directly, we present the following lemmas which will be helpful to understand the different steps illustrated in the proof. Similar to showing [31, Theorem 2.2] where only the LL^{\infty} norm is considered, in order to prove Theorem 3.1 we will first construct ReLU networks to approximate multivariate polynomials and provide an error bound measured in the 𝒲1,\mathcal{W}^{1,\infty} norm that is explicitly characterized in terms of the layer and depth of the underlying ReLU networks, via the following steps:

  1. 1.

    We approximate f(x)=x2f(x)=x^{2} by the compositions of the “sawtooth functions”, which was first proposed in [51].

  2. 2.

    We approximate f(x,y)=xyf(x,y)=xy using the ReLU network constructed in the previous step based on the identity xy=2((|x+y|2)2(|x|2)2(|y|2)2)xy=2\Big{(}\big{(}\tfrac{|x+y|}{2}\big{)}^{2}-\big{(}\tfrac{|x|}{2}\big{)}^{2}-\big{(}\tfrac{|y|}{2}\big{)}^{2}\Big{)}.

  3. 3.

    We approximate f(x1,x2,,xk)=x1x2xkf(x_{1},x_{2},\cdots,x_{k})=x_{1}x_{2}\cdots x_{k} for k2k\geq 2 repeatedly using the ReLU networks constructed in the previous step.

  4. 4.

    We approximate a general polynomial 𝒙𝜶=x1α1x2α2xdαd\bm{x}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}} for 𝜶d\bm{\alpha}\in\mathbb{N}^{d} with |𝜶|k+|{\bm{\alpha}}|\leq k\in\mathbb{N}^{+}. Since any one term polynomial of degree less than or equal to kk can be written as Cz1z2zkCz_{1}z_{2}\cdots z_{k}, where CC is a constant, we can obtain an error bound based on the previous approximation result.

We begin with giving an approximation result for the simple function of x2x^{2}.

Lemma 3.2

For any N,L+N,L\in\mathbb{N}^{+}, there exists a σ1\sigma_{1}-NN ϕ\phi with width 3N3N and depth 2L2L such that ϕ(x)𝒲1,([0,1])2\|\phi(x)\|_{\mathcal{W}^{1,\infty}([0,1])}\leq 2 and

ϕ(x)x2𝒲1,((0,1))NL.||\phi(x)-x^{2}||_{\mathcal{W}^{1,\infty}((0,1))}\leq N^{-L}.
  • Proof

    As in the proof of Lemma 2.1 (iii), we begin by defining the piecewise linear function fs:[0,1][0,1]f_{s}:[0,1]\to[0,1] for s+s\in\mathbb{N}^{+}, s=1,2,s=1,2,\dots, satisfying the following properties

    1. 1.

      fs(x)=x2f_{s}(x)=x^{2} at a set of break points {j2s:j=0,1,2,,2s}\big{\{}\tfrac{j}{2^{s}}:j=0,1,2,\dots,2^{s}\big{\}}.

    2. 2.

      fs(x)f_{s}(x) is linear between any two adjacent break points.

    For any integer NN, let k+k\in\mathbb{N}^{+} be the unique number such that (k1)2k1+1Nk2k(k-1)2^{k-1}+1\leq N\leq k2^{k}. For such an NN, kk, and any LL^{\prime}, let s=Lks=L^{\prime}k. It is shown in the proof of [31, Lemma 5.1] that there exists a σ1\sigma_{1}-NN ϕ(x)=fs(x)=fLk(x)\phi(x){=f_{s}(x)=f_{L^{\prime}k}(x)} with width 3N3N and depth LL^{\prime} such that

    ϕ(x)x2L([0,1])=fLkx2L([0,1])22(Lk+1)22LkNL.\|\phi(x)-x^{2}\|_{L^{\infty}([0,1])}{=}\|f_{L^{\prime}k}-x^{2}\|_{L^{\infty}([0,1])}\leq 2^{-2(L^{\prime}k+1)}\leq 2^{-2L^{\prime}k}\leq N^{-L^{\prime}}. (4)

We now show that the approximation error of the first order (weak) derivative can be measured in a similar fashion.

Note that for all x(j2s,j+12s)x\in\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)},

ϕ(x)=((j+1)22sj22s)(xj2s)+(j2s)2.\phi(x)=\Big{(}\tfrac{(j+1)^{2}}{2^{s}}-\tfrac{j^{2}}{2^{s}}\Big{)}\Big{(}x-\tfrac{j}{2^{s}}\Big{)}+\Big{(}\tfrac{j}{2^{s}}\Big{)}^{2}. (5)

From (5), we have for each j=0,1,,2s1j=0,1,\dots,2^{s}-1,

|ϕ(x)x2|𝒲1,((j2s,j+12s))\displaystyle|\phi(x)-x^{2}|_{\mathcal{W}^{1,\infty}\big{(}\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)}\big{)}} =\displaystyle= (j+1)22sj22s2xL((j2s,j+12s))\displaystyle\Big{\|}\tfrac{(j+1)^{2}}{2^{s}}-\tfrac{j^{2}}{2^{s}}-2x\Big{\|}_{L^{\infty}\big{(}\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)}\big{)}}
=\displaystyle= 2j+12s2xL((j2s,j+12s))\displaystyle\Big{\|}\tfrac{2j+1}{2^{s}}-2x\Big{\|}_{L^{\infty}\big{(}\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)}\big{)}}
=\displaystyle= max{|2j+12s2(j2s)|,|2j+12s2(j+12s)|}\displaystyle\max\Big{\{}\big{|}\tfrac{2j+1}{2^{s}}-2\big{(}\tfrac{j}{2^{s}}\big{)}\big{|},\big{|}\tfrac{2j+1}{2^{s}}-2\big{(}\tfrac{j+1}{2^{s}}\big{)}\big{|}\Big{\}}
=\displaystyle= 2s\displaystyle 2^{-s}
=\displaystyle= 2Lk.\displaystyle 2^{-L^{\prime}k}.

Clearly, from (4) & (5) we have

ϕ(x)x2𝒲1,((0,1))=fLkx2𝒲1,((0,1))2LkNL/2.\|\phi(x)-x^{2}\|_{\mathcal{W}^{1,\infty}((0,1))}{=}\|f_{L^{\prime}k}-x^{2}\|_{\mathcal{W}^{1,\infty}((0,1))}\leq 2^{-L^{\prime}k}\leq N^{-L^{\prime}/2}.

Finally, we have

ϕ(x)𝒲1,([0,1])(j+1)2j22sL((j2s,j+12s))=2j+12s\displaystyle\|\phi(x)\|_{\mathcal{W}^{1,\infty}([0,1])}\leq\Big{\|}\tfrac{(j+1)^{2}-j^{2}}{2^{s}}\Big{\|}_{L^{\infty}\big{(}\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)}\big{)}}=\tfrac{2j+1}{2^{s}} \displaystyle\leq 2(2s1)+12s\displaystyle\tfrac{2(2^{s}-1)+1}{2^{s}}
=\displaystyle= 212Lk\displaystyle 2-\tfrac{1}{2^{L^{\prime}k}}
\displaystyle\leq 2.\displaystyle 2.

Setting L=2LL^{\prime}=2L, we have the desired ϕ(x)\phi(x) and hence the proof is finished. \qed

Lemma 3.3

For any N,L+N,L\in\mathbb{N}^{+}, there exists a σ1\sigma_{1}-NN ϕ\phi with width 9N9N and depth 2L2L such that ϕ𝒲1,((0,1)2)12\|\phi\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}\leq 12 and

ϕ(x,y)xy𝒲1,((0,1)2)6NL.\|\phi(x,y)-xy\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}\leq 6N^{-L}.
  • Proof

    By Lemma 3.2, there exists a σ1\sigma_{1}-NN ψ\psi with width 3N3N and depth 2L2L such that ψ𝒲1,((0,1))2\|\psi\|_{\mathcal{W}^{1,\infty}((0,1))}\leq 2 and

    z2ψ(z)𝒲1,((0,1))NL.\|z^{2}-\psi(z)\|_{\mathcal{W}^{1,\infty}((0,1))}\leq N^{-L}.

    Combining the above inequality and the fact that for any x,yx,y\in\mathbb{R}

    xy=2((|x+y|2)2(|x|2)2(|y|2)2),xy=2\Big{(}\big{(}\tfrac{|x+y|}{2}\big{)}^{2}-\big{(}\tfrac{|x|}{2}\big{)}^{2}-\big{(}\tfrac{|y|}{2}\big{)}^{2}\Big{)},

    we construct the following network ϕ\phi

    ϕ(x,y)=2(ψ(|x+y|2)ψ(|x|2)ψ(|y|2)).\phi(x,y)=2\Big{(}\psi\big{(}\tfrac{|x+y|}{2}\big{)}-\psi\big{(}\tfrac{|x|}{2}\big{)}-\psi\big{(}\tfrac{|y|}{2}\big{)}\Big{)}.

    We have

    ϕ(x,y)xy𝒲1,((0,1)2)\displaystyle\|\phi(x,y)-xy\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}
    \displaystyle\leq 2ψ(|x+y|2)(|x+y|2)2𝒲1,((0,1)2)+2ψ(|x|2)(|x|2)2𝒲1,((0,1)2)\displaystyle 2\big{\|}\psi\big{(}\tfrac{|x+y|}{2}\big{)}-(\tfrac{|x+y|}{2})^{2}\big{\|}_{\mathcal{W}^{1,\infty}((0,1)^{2})}+2\big{\|}\psi\big{(}\tfrac{|x|}{2}\big{)}-(\tfrac{|x|}{2})^{2}\big{\|}_{\mathcal{W}^{1,\infty}((0,1)^{2})}
    +2ψ(|y|2)(|y|2)2𝒲1,((0,1)2)\displaystyle+2\big{\|}\psi\big{(}\tfrac{|y|}{2}\big{)}-(\tfrac{|y|}{2})^{2}\big{\|}_{\mathcal{W}^{1,\infty}((0,1)^{2})}
    \displaystyle\leq 2NL+2NL+2NL\displaystyle 2N^{-L}+2N^{-L}+2N^{-L}
    =\displaystyle= 6NL.\displaystyle 6N^{-L}.

    Finally, we have

    ϕ(x,y)𝒲1,((0,1)2)\displaystyle\|\phi(x,y)\|_{\mathcal{W}^{1,\infty}((0,1)^{2})} =\displaystyle= 2(ψ(|x+y|2)ψ(|x|2)ψ(|y|2))𝒲1,((0,1)2)\displaystyle\Big{\|}2\Big{(}\psi\big{(}\tfrac{|x+y|}{2}\big{)}-\psi\big{(}\tfrac{|x|}{2}\big{)}-\psi\big{(}\tfrac{|y|}{2}\big{)}\Big{)}\Big{\|}_{\mathcal{W}^{1,\infty}((0,1)^{2})}
    \displaystyle\leq 12.\displaystyle 12.
    \qed

By rescaling, we have the following modification of Lemma 3.3.

Lemma 3.4

For any N,L+N,L\in\mathbb{N}^{+} and a,ba,b\in\mathbb{R} with a<ba<b, there exists a σ1\sigma_{1}-NN ϕ\phi with width 9N+19N+1 and depth 2L2L such that ϕ𝒲1,((a,b)2)12(ba)2\|\phi\|_{\mathcal{W}^{1,\infty}((a,b)^{2})}\leq 12(b-a)^{2} and

ϕ(x,y)xy𝒲1,((a,b)2)6(ba)2NL.\|\phi(x,y)-xy\|_{\mathcal{W}^{1,\infty}((a,b)^{2})}\leq 6(b-a)^{2}N^{-L}.
  • Proof

    By Lemma 3.3, there exists a σ1\sigma_{1}-NN ψ\psi with width 9N9N and depth 2L2L such that ψ𝒲1,((0,1)2)12\|\psi\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}\leq 12 and

    ψ(x~,y~)x~y~𝒲1,((0,1)2)6NL.\|\psi(\widetilde{x},\widetilde{y})-\widetilde{x}\widetilde{y}\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}\leq 6N^{-L}.

    By setting x=a+(ba)x~x=a+(b-a)\widetilde{x} and y=a+(ba)y~y=a+(b-a)\widetilde{y} for any x~,y~(0,1)\widetilde{x},\widetilde{y}\in(0,1), we define the following network ϕ\phi

    ϕ(x,y)=(ba)2ψ(xaba,yaba)+a(xa)+a(ya)+a2.\phi(x,y)=(b-a)^{2}\psi\big{(}\tfrac{x-a}{b-a},\tfrac{y-a}{b-a}\big{)}+a(x-a)+a(y-a)+a^{2}.

    Note that a(xa)+a(ya)a(x-a)+a(y-a) is positive. Hence, the width of ϕ\phi can be as small as 9N+19N+1. Thus, by xy=(ba)2(xabayaba)+a(xa)+a(ya)+a2xy=(b-a)^{2}(\tfrac{x-a}{b-a}\cdot\tfrac{y-a}{b-a})+a(x-a)+a(y-a)+a^{2}, we have

    ϕ(x,y)xy𝒲1,((a,b)2)\displaystyle\|\phi(x,y)-xy\|_{\mathcal{W}^{1,\infty}((a,b)^{2})} =\displaystyle= (ba)2ψ(xaba,yaba)(xabayaba)𝒲1,((a,b)2)\displaystyle(b-a)^{2}\big{\|}\psi\big{(}\tfrac{x-a}{b-a},\tfrac{y-a}{b-a}\big{)}-\big{(}\tfrac{x-a}{b-a}\cdot\tfrac{y-a}{b-a}\big{)}\big{\|}_{\mathcal{W}^{1,\infty}((a,b)^{2})}
    \displaystyle\leq 6(ba)2NL.\displaystyle 6(b-a)^{2}N^{-L}.

    Finally, we have

    ϕ(x,y)𝒲1,((a,b)2)\displaystyle\|\phi(x,y)\|_{\mathcal{W}^{1,\infty}((a,b)^{2})} =\displaystyle= (ba)2ψ(xaba,yaba)+a(xa)+a(ya)+a2𝒲1,((a,b)2)\displaystyle\Big{\|}(b-a)^{2}\psi\big{(}\tfrac{x-a}{b-a},\tfrac{y-a}{b-a}\big{)}+a(x-a)+a(y-a)+a^{2}\Big{\|}_{\mathcal{W}^{1,\infty}((a,b)^{2})}
    \displaystyle\leq 12(ba)2.\displaystyle 12(b-a)^{2}.
    \qed
Lemma 3.5

For any N,L,k+N,L,k\in\mathbb{N}^{+} with k2k\geq 2, there exists a σ1\sigma_{1}-NN ϕ\phi with width 9(N+1)+k19(N+1)+k-1 and depth 14k(k1)L14k(k-1)L such that ϕ𝒲1,((0,1)k)18\|\phi\|_{\mathcal{W}^{1,\infty}((0,1)^{k})}\leq 18 and

ϕ(𝒙)x1x2xk𝒲1,((0,1)k)10(k1)(N+1)7kL.\|\phi(\bm{x})-x_{1}x_{2}\cdots x_{k}\|_{\mathcal{W}^{1,\infty}((0,1)^{k})}\leq 10(k-1)(N+1)^{-7kL}.
  • Proof

    By Lemma 3.4, there exists a σ1\sigma_{1}-NN ϕ1\phi_{1} with width 9(N+1)+19(N+1)+1 and depth 14kL14kL such that ϕ1𝒲1,((0.1,1.1)2))18\|\phi_{1}\|_{\mathcal{W}^{1,\infty}((-0.1,1.1)^{2}))}\leq 18 and

    ϕ1(x,y)xy𝒲1,((0.1,1.1)2)\displaystyle\|\phi_{1}(x,y)-xy\|_{\mathcal{W}^{1,\infty}((-0.1,1.1)^{2})} \displaystyle\leq 6(1.2)2(N+1)7kL\displaystyle 6(1.2)^{2}(N+1)^{-7kL}
    \displaystyle\leq 9(N+1)7kL.\displaystyle 9(N+1)^{-7kL}.

    Now, our goal is to construct via induction that for i=1,2,,k1i=1,2,\dots,k-1 there exists a σ1\sigma_{1}-NN ϕi\phi_{i} with width 9(N+1)+i9(N+1)+i and depth 14kiL14kiL such that

    ϕi(x1,,xi+1)x1x2xi+1𝒲1,((0,1)i+1)10i(N+1)7kL,\|\phi_{i}(x_{1},\cdots,x_{i+1})-x_{1}x_{2}\cdots x_{i+1}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+1})}\leq 10i(N+1)^{-7kL},

    for any [x1,x2,,xi+1]T(0,1)i+1.[x_{1},x_{2},\cdots,x_{i+1}]^{T}\in(0,1)^{i+1}.

    When d=1d=1, ϕ1\phi_{1} satisfies the condition.

    Assuming now for any i{1,2,,d1}i\in\{1,2,\dots,d-1\} there exists a σ1\sigma_{1}-NN ϕi\phi_{i} such that the both conditions hold, we define ϕi+1\phi_{i+1} as follows:

    ϕi+1(x1,,xi+2)=ϕ1(ϕi(x1,,xi+1),σ(xi+2))\phi_{i+1}(x_{1},\cdots,x_{i+2})=\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),\sigma(x_{i+2}))

    for any x1,,xk+2x_{1},\cdots,x_{k+2}\in\mathbb{R}. We can shift xi+2x_{i+2} to obtain a nonnegative number xi+2+0.1x_{i+2}+0.1, which can be copied via a ReLU network of width 11 to the input of ϕ1\phi_{1}. Before inputting xi+2+0.1x_{i+2}+0.1 to ϕ1\phi_{1}, we can shift it back to xi+2x_{i+2}. Therefore, ϕi+1\phi_{i+1} can be implemented via a σ1\sigma_{1}-NN with width 9(N+1)+i+19(N+1)+i+1 and depth 14kiL+14kL=14k(i+1)L14kiL+14kL=14k(i+1)L.

    By the induction assumption, we have

    ϕi(x1,x2,,xi+1)x1x2xi+1𝒲1,((0,1)i+1)10i(N+1)7kL.\|\phi_{i}(x_{1},x_{2},\cdots,x_{i+1})-x_{1}x_{2}\cdots x_{i+1}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+1})}\leq 10i(N+1)^{-7kL}.

    Note that 10i(N+1)7kL10k27k<10k1100k=0.110i(N+1)^{-7k{L}}\leq 10k2^{-7k}<10k\tfrac{1}{100k}=0.1 for any N,L,kd+N,L,k\geq d\in\mathbb{N}^{+} and i{1,2,,d1}i\in\{1,2,\cdots,d-1\}. Thus, we have

    ϕi(x1,x2,,xi+1)(0.1,1.1)\phi_{i}(x_{1},x_{2},\cdots,x_{i+1})\in(-0.1,1.1)

    and

    ϕix1(0.1,1.1)\tfrac{\partial\phi_{i}}{\partial x_{1}}\in(-0.1,1.1)

    for any x1,x2,,xi+1(0,1)x_{1},x_{2},\cdots,x_{i+1}\in(0,1).

Hence, for any x1,x2,,xi+2(0,1)x_{1},x_{2},\cdots,x_{i+2}\in(0,1), we have

ϕi+1(x1,,xi+2)x1xi+2𝒲1,((0,1)i+2)\displaystyle\|\phi_{i+1}(x_{1},\cdots,x_{i+2})-x_{1}\cdots x_{i+2}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})} (6)
=\displaystyle= ϕ1(ϕi(x1,,xi+1),σ(xi+2))x1xi+2𝒲1,((0,1)i+2)\displaystyle\|\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),\sigma(x_{i+2}))-x_{1}\cdots x_{i+2}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}
\displaystyle\leq ϕ1(ϕi(x1,,xi+1),xi+2)ϕi(x1,,xi+1)xi+2𝒲1,((0,1)i+2)\displaystyle\|\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})-\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}
+ϕi(x1,,xi+1)xi+2x1xi+2𝒲1,((0,1)i+2)\displaystyle+\|\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}-x_{1}\cdots x_{i+2}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}
\displaystyle\leq 10(N+1)7kL+10i(N+1)7kL\displaystyle{10(N+1)^{-7kL}}+10i(N+1)^{-7kL}
=\displaystyle= 10(i+1)(N+1)7kL,\displaystyle 10(i+1)(N+1)^{-7kL},

where the second inequality is obtained since we have

|(ϕ1(ϕi(x1,,xi+1),xi+2)ϕi(x1,,xi+1)xi+2)x1|\displaystyle\Big{|}\tfrac{\partial(\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})-\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2})}{\partial x_{1}}\Big{|} =\displaystyle= |ϕ1(ϕi(x1,,xi+1),xi+2)ϕiϕix1xi+2ϕix1|\displaystyle\Big{|}\tfrac{\partial\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})}{\partial\phi_{i}}\cdot\tfrac{\partial\phi_{i}}{\partial x_{1}}-x_{i+2}\cdot\tfrac{\partial\phi_{i}}{\partial x_{1}}\Big{|}
=\displaystyle= |ϕ1(ϕi(x1,,xi+1),xi+2)ϕixi+2|9(N+1)7kL|ϕix1|<=1.1\displaystyle\underbrace{\Big{|}\tfrac{\partial\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})}{\partial\phi_{i}}-x_{i+2}\Big{|}}_{\leq 9(N+1)^{-7kL}}\underbrace{\Big{|}\tfrac{\partial\phi_{i}}{\partial x_{1}}\Big{|}}_{<=1.1}
\displaystyle\leq 10(N+1)7kL\displaystyle 10(N+1)^{-7kL}

and

|(ϕ1(ϕi(x1,,xi+1),xi+2)ϕi(x1,,xi+1)xi+2)xi+2|\displaystyle\Big{|}\tfrac{\partial(\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})-\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2})}{\partial x_{i+2}}\Big{|} =\displaystyle= |ϕ1(ϕi(x1,,xi+1),xi+2)xi+2ϕi|\displaystyle\Big{|}\tfrac{\partial\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})}{\partial x_{i+2}}-\phi_{i}\Big{|}
\displaystyle\leq 9(N+1)7kL\displaystyle 9(N+1)^{-7kL}
\displaystyle\leq 10(N+1)7kL.\displaystyle 10(N+1)^{-7kL}.

Thus,

|ϕi(x1,,xi+1)xi+2x1xi+2|𝒲1,((0,1)i+2)10(N+1)7kL.|\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}-x_{1}\cdots x_{i+2}|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}\leq 10(N+1)^{-7kL}.

Also, using similar arguments or see the proof of [31, Lemma 5.1], it can be shown that

|ϕi(x1,,xi+1)xi+2x1xi+2|L((0,1)i+2)10(N+1)14kL.|\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}-x_{1}\cdots x_{i+2}|_{L^{\infty}((0,1)^{i+2})}\leq 10(N+1)^{-14kL}.

Hence, we have shown

ϕ1(ϕi(x1,,xi+1),xi+2)ϕi(x1,,xi+1)xi+2𝒲1,((0,1)i+2)10(N+1)7kL,\|\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})-\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}\leq{10(N+1)^{-7kL}},

which was used in showing (6).

Also, for any x1,x2,,xi+2(0,1)x_{1},x_{2},\cdots,x_{i+2}\in(0,1), we have

ϕi+1(x1,,xi+2)𝒲1,((0,1)i+2)\displaystyle\|\phi_{i+1}(x_{1},\cdots,x_{i+2})\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})} =\displaystyle= ϕ1(ϕi(x1,,xi+1),σ(xi+2))𝒲1,((0,1)i+2)\displaystyle\|\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),\sigma(x_{i+2}))\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}
\displaystyle\leq 18.\displaystyle 18.

At last, setting ϕ=ϕd1\phi=\phi_{d-1} in (6), we have finished the proof by induction. \qed

Proposition 3.6

Suppose 𝐱𝛂=x1α1x2α2xdαd\bm{x}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}} for 𝛂d\bm{\alpha}\in\mathbb{N}^{d} with |𝛂|k+|{\bm{\alpha}}|\leq k\in\mathbb{N}^{+}. For any N,L+N,L\in\mathbb{N}^{+}, there exists a σ1\sigma_{1}-NN ϕ\phi with width 9(N+1)+k19(N+1)+k-1 and depth 14k2L14k^{2}L such that ϕ𝒲1,([0,1]d)18\|\phi\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 18 and

ϕ(𝒙)𝒙𝜶𝒲1,([0,1]d)10k(N+1)7kL.\|\phi(\bm{x})-\bm{x}^{\bm{\alpha}}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 10k(N+1)^{-7kL}.
  • Proof

    This proof is similar to that of [31, Proposition 4.1], but for readability we provide a detailed proof as follows.

    The case when k=1k=1 is trivial. When k2k\geq 2, we set k~=|𝜶|k\widetilde{k}=|{\bm{\alpha}}|\leq k, 𝜶=[α1,α2,,αd]Td\bm{\alpha}=[\alpha_{1},\alpha_{2},\dots,\alpha_{d}]^{T}\in\mathbb{R}^{d}, and let [z1,z2,,zk~]Tk~[z_{1},z_{2},\cdots,z_{\widetilde{k}}]^{T}\in\mathbb{R}^{\widetilde{k}} be the vector such that

    zl=xj,ifi=1j1αi<li=1jαj,forj=1,2,,d.z_{l}=x_{j},\quad\text{if}~{}\sum_{i=1}^{j-1}\alpha_{i}<l\leq\sum_{i=1}^{j}\leq\alpha_{j},\quad\text{for}~{}j=1,2,\cdots,d.

    In other words, we have

    [z1,z2,,zk~]T=[x1,,x1α1times,x2,,x2α2times,,xd,,xdαdtimes]Tk~.[z_{1},z_{2},\cdots,z_{\widetilde{k}}]^{T}=[\underbrace{x_{1},\cdots,x_{1}}_{\alpha_{1}~{}\text{times}},\underbrace{x_{2},\cdots,x_{2}}_{\alpha_{2}~{}\text{times}},\cdots,\underbrace{x_{d},\cdots,x_{d}}_{\alpha_{d}~{}\text{times}}]^{T}\in\mathbb{R}^{\widetilde{k}}.

    We now construct the target deep ReLU neural network. First, there exists a linear map :dk\mathcal{L}:\mathbb{R}^{d}\to\mathbb{R}^{k} (𝒙)=[z1,z2,,zk~,1,,1]\mathcal{L}({\bm{x}})=[z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1], which copies 𝒙{\bm{x}} to form a new vector [z1,z2,,zk~,1,,1]k[z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1]\in\mathbb{R}^{k}. Second, there exists by Lemma 3.5 a function ψ:k\psi:\mathbb{R}^{k}\rightarrow\mathbb{R} implemented by a ReLU network with width 9(N+1)+k19(N+1)+k-1 and depth 14k(k1)L14k(k-1)L such that ψ𝒲1,([0,1]d)18\|\psi\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 18 and ψ\psi maps [z1,z2,,zk~,1,,1][z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1] to z1z2zk~z_{1}z_{2}\cdots z_{\widetilde{k}} within an error of 10(k1)(N+1)7kL10(k-1)(N+1)^{-7kL}. Thus, we can construct the network ϕ=ψ\phi=\psi\circ\mathcal{L}. Then, ϕ\phi can be implemented by a ReLU with width 9(N+1)+k19(N+1)+k-1 and depth 14k(k1)L14k2L14k(k-1)L\leq 14k^{2}L, and

    ϕ(𝒙)𝒙𝜶𝒲1,([0,1]d)\displaystyle\|\phi(\bm{x})-\bm{x}^{\bm{\alpha}}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})} =\displaystyle= ψ(𝒙)x1α1x2α2xdαd𝒲1,([0,1]d)\displaystyle\|\psi\circ\mathcal{L}(\bm{x})-x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}
    =\displaystyle= ψ(z1,z2,,zk~,1,,1)z1z2zk~𝒲1,([0,1]d)\displaystyle\|\psi(z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1)-z_{1}z_{2}\cdots z_{\widetilde{k}}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}
    \displaystyle\leq 10(k1)(N+1)7kL\displaystyle 10(k-1)(N+1)^{-7kL}
    \displaystyle\leq 10k(N+1)7kL.\displaystyle 10k(N+1)^{-7kL}.

    Also, we have

    ϕ(𝒙)𝒲1,([0,1]d)\displaystyle\|\phi(\bm{x})\|_{\mathcal{W}^{1,\infty}([0,1]^{d})} =\displaystyle= ψ(𝒙)𝒲1,([0,1]d)\displaystyle\|\psi\circ\mathcal{L}(\bm{x})\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}
    =\displaystyle= ψ(z1,z2,,zk~,1,,1)𝒲1,([0,1]d)\displaystyle\|\psi(z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1)\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}
    \displaystyle\leq 18.\displaystyle 18.

    This proof is then finished. \qed

We are now ready to show our main approximation result of Theorem 3.1. The main idea of the proof is based on the recurring use of the Taylor polynomial approximation to smooth functions. We first construct ReLU subnetworks to realize these polynomials and their derivatives, and then implement the target smooth functions by composing the subnetworks. The overall approximation error is estimated by the remainder terms in Taylor’s theorem and the error resulted from those ReLU subnetworks for approximating the polynomials.

3.2 Proof of Theorem 3.1

  • Proof

We set K=N1/d2L2/dK=\lfloor{N^{1/d}}\rfloor^{2}\lfloor L^{2/d}\rfloor and let Ω([0,1]d,K,δ)\Omega{([0,1]^{d},K,\delta)} defined by (3) partition [0,1]d[0,1]^{d} into KdK^{d} cubes Q𝜷Q_{\bm{\beta}} for 𝜷{0,1,,K1}d\bm{\beta}\in\{0,1,\cdots,K-1\}^{d} such that

[0,1]d=Ω([0,1]d,K,δ)(𝜷{0,1,,K1}dQ𝜷).[0,1]^{d}=\Omega{([0,1]^{d},K,\delta)}\bigcup\big{(}\cup_{\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}}Q_{\bm{\beta}}\big{)}.

For each 𝜷=[β1,β2,,βd]T{0,1,,K1}d\bm{\beta}=[\beta_{1},\beta_{2},\cdots,\beta_{d}]^{T}\in\{0,1,\cdots,K-1\}^{d}, we define

Q𝜷={𝒙=[x1,x2,,xd]T:xi[βiK,βi+1Kδ1{βiK2}(βi)]fori=1,2,,d},Q_{\bm{\beta}}=\Bigg{\{}\bm{x}=[x_{1},x_{2},\cdots,x_{d}]^{T}:x_{i}\in\Big{[}\tfrac{\beta_{i}}{K},\tfrac{\beta_{i}+1}{K}-\delta\cdot 1_{\{\beta_{i}\leq K-2\}}(\beta_{i})\Big{]}~{}\text{for}~{}i=1,2,\dots,d\Bigg{\}},

where 1{βiK2}(βi)1_{\{\beta_{i}\leq K-2\}}(\beta_{i}) is the indicator function of the set {βiK2}\{\beta_{i}\leq K-2\}.

By Proposition 2.2, there exists a σ1\sigma_{1}-NN ψ\psi with width 4N+34N+3 and depth 4L+54L+5 such that

ψ(x)=k,ifx[kK,k+1Kδ1{kK2}]fork=0,1,,K1.\psi(x)=k,\quad\text{if}~{}x\in\big{[}\tfrac{k}{K},\tfrac{k+1}{K}-\delta\cdot 1_{\{k\leq K-2\}}\big{]}~{}\text{for}~{}k=0,1,\dots,K-1.

Then, for each 𝜷{0,1,,K1}d,ψ(xi)=βi\bm{\beta}\in\{0,1,\cdots,K-1\}^{d},\psi(x_{i})=\beta_{i} if 𝒙Q𝜷fori=1,2,,d\bm{x}\in Q_{\bm{\beta}}~{}\text{for}~{}i=1,2,\cdots,d.

Define

𝝍(𝒙):=[ψ(x1),ψ(x2),,ψ(xd)]T/Kfor any𝒙[0,1]d,\bm{\psi{(x)}}:=[\psi(x_{1}),\psi(x_{2}),\cdots,\psi(x_{d})]^{T}/K\quad\text{for any}~{}\bm{x}\in[0,1]^{d},

then

𝝍(𝒙)=𝜷/K,if𝒙Q𝜷,for𝜷{0,1,,K1}d.\bm{\psi{(x)}}=\bm{\beta}/K,\quad\text{if}~{}\bm{x}\in Q_{\bm{\beta}},\quad\text{for}~{}\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}.

Now, we fix a 𝜷{0,1,,K1}d\bm{\beta}\in\{0,1,\cdots,K-1\}^{d} throughout the proof. For any 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}}, by Taylor’s expansion there exists a ξ𝒙(0,1)\xi_{\bm{x}}\in(0,1) such that

f(𝒙)=|𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶+|𝜶|=s𝜶f(𝝍(𝒙)+ξ𝒙𝒉)𝜶!𝒉𝜶,{f(\bm{x})}=\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}+\sum_{|{\bm{\alpha}}|=s}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}, (7)

where 𝒉=𝒙𝝍(𝒙)\bm{h}=\bm{x}-\bm{\psi{(\bm{x})}}.

By Lemma 3.4, there exists a σ1\sigma_{1}-NN ϕ~\widetilde{\phi} with width 9(N+1)+19(N+1)+1 and depth 4s(L+1)4s(L+1) such that ϕ~𝒲1,((3,3)2)432\|\widetilde{\phi}\|_{\mathcal{W}^{1,\infty}((-3,3)^{2})}\leq 432 and

ϕ~(x,y)xy𝒲1,((3,3)2))\displaystyle\|\widetilde{\phi}(x,y)-xy\|_{\mathcal{W}^{1,\infty}((-3,3)^{2}))} \displaystyle\leq 6(6)2(N+1)2s(L+1)\displaystyle 6(6)^{2}(N+1)^{-2s(L+1)}
=\displaystyle= 216(N+1)2s(L+1)=:1.\displaystyle 216(N+1)^{-2s(L+1)}=:\mathcal{E}_{1}.

For each 𝜶d{\bm{\alpha}}\in\mathbb{N}^{d} with |𝜶|s|{\bm{\alpha}}|\leq s, by Proposition 3.6 there exist σ1\sigma_{1}-NNs\textnormal{NN}s P𝜶(𝒙)P_{\bm{\alpha}}(\bm{x}) with width 9(N+1)+s19(N+1)+s-1 and depth 14s2L14s^{2}L such that

P𝜶(𝒙)𝒙𝜶𝒲1,([0,1]d)10s(N+1)7sL=:2.\|P_{\bm{\alpha}}(\bm{x})-\bm{x}^{\bm{\alpha}}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 10s(N+1)^{-7sL}=:\mathcal{E}_{2}. (8)

For each i=0,1,,Kd1i=0,1,\cdots,K^{d}-1, we define the bijection

𝜼(i)=[η1,η2,,ηd]T{0,1,,K1}d\bm{\eta}(i)=[\eta_{1},\eta_{2},\cdots,\eta_{d}]^{T}\in\{0,1,\dots,K-1\}^{d}

such that j=1dηjKj1=i\sum_{j=1}^{d}\eta_{j}K^{j-1}=i. We will drop the input ii in 𝜼(i)\bm{\eta}(i) later for simplicity. For each 𝜶d\bm{\alpha}\in\mathbb{N}^{d} with |𝜶|s1|{\bm{\alpha}}|\leq s-1, define

ξ𝜶,i=(𝜶f(ηK)+1)/2.\xi_{\bm{\alpha},i}=\big{(}\partial^{\bm{\alpha}}f\big{(}\tfrac{\eta}{K}\big{)}+1\big{)}/2.

Note that Kd=(N1/d2L2/d)dN2L2K^{d}=(\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor)^{d}\leq N^{2}L^{2} and ξ𝜶,i[0,1]\xi_{\bm{\alpha},i}\in[0,1] for i=0,1,,Kd1i=0,1,\cdots,K^{d}-1. By Proposition 2.3, there exists a σ1\sigma_{1}-NN ϕ~𝜶\widetilde{\phi}_{\bm{\alpha}} of width 16s(N+1)log2(8N)16s(N+1)\log_{2}(8N) and depth 5(L+2)log2(4L)5(L+2)\log_{2}(4L) such that

|ϕ~𝜶(i)ξ𝜶,i|N2sL2s,fori=0,1,,Kd1and|𝜶|s1.|\widetilde{\phi}_{\bm{\alpha}}(i)-\xi_{\bm{\alpha},i}|\leq N^{-2s}L^{-2s},\quad\text{for}~{}i=0,1,\cdots,K^{d}-1~{}\text{and}~{}|{\bm{\alpha}}|\leq s-1.

For each 𝜶d{\bm{\alpha}}\in\mathbb{N}^{d} with |𝜶|s1|{\bm{\alpha}}|\leq s-1, we define

ϕ𝜶(𝒙):=2ϕ~𝜶(j=1dxjKj1)1,for any𝒙=[x1,x2,,xd]dd.\phi_{\bm{\alpha}}(\bm{x}):=2\widetilde{\phi}_{\bm{\alpha}}\Bigg{(}\sum_{j=1}^{d}x_{j}{K^{j-1}}\Bigg{)}-1,\quad\text{for any}~{}\bm{x}=[x_{1},x_{2},\cdots,x_{d}]^{d}\in\mathbb{R}^{d}.

It can be seen that ϕ𝜶\phi_{\bm{\alpha}} is also of width 16s(N+1)log2(8N)16s(N+1)\log_{2}(8N) and depth 5(L+2)log2(4L)5(L+2)\log_{2}(4L).

Then, for each 𝜼=[η1,η2,,ηd]T{0,1,,K1}d\bm{\eta}=[\eta_{1},\eta_{2},\cdots,\eta_{d}]^{T}\in\{0,1,\dots,K-1\}^{d} corresponding to i=j=1dηjKj1i=\sum_{j=1}^{d}\eta_{j}K^{j-1}, each 𝜶d\bm{\alpha}\in\mathbb{N}^{d} with |𝜶|s1|{\bm{\alpha}}|\leq s-1, we have

|ϕ𝜶(𝜼K)𝜶f(𝜼K)|=|2ϕ~𝜶(j=1dxjKj1)1(2ξ𝜶,i1)|=2|ϕ~𝜶(i)ξ𝜶,i|2N2sL2s.\big{|}\phi_{\bm{\alpha}}(\tfrac{\bm{\eta}}{K})-\partial^{\bm{\alpha}}f(\tfrac{\bm{\eta}}{K})\big{|}=\Bigg{|}2\widetilde{\phi}_{\bm{\alpha}}\Bigg{(}\sum_{j=1}^{d}x_{j}K^{j-1}\Bigg{)}-1-(2\xi_{\bm{\alpha},i}-1)\Bigg{|}=2|\widetilde{\phi}_{\bm{\alpha}}(i)-\xi_{\bm{\alpha},i}|\leq 2N^{-2s}L^{-2s}.

From 𝝍(𝒙)=𝜷K\bm{\psi{(x)}}=\tfrac{\bm{\beta}}{K} for 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}}, it follows that

ϕ𝜶(𝝍(𝒙))𝜶f(𝝍(𝒙))𝒲1,(Q𝜷)\displaystyle\|\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\|_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})} =\displaystyle= ϕ𝜶(𝝍(𝒙))𝜶f(𝝍(𝒙))L(Q𝜷)\displaystyle\|\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\|_{{L}^{\infty}(Q_{\bm{\beta}})} (9)
=\displaystyle= |ϕ𝜶(𝜷K)𝜶f(𝜷K)|\displaystyle\big{|}\phi_{\bm{\alpha}}(\tfrac{\bm{\beta}}{K})-\partial^{\bm{\alpha}}f(\tfrac{\bm{\beta}}{K})\big{|}
\displaystyle\leq 2N2sL2s=:3.\displaystyle 2N^{-2s}L^{-2s}=:\mathcal{E}_{3}.

Note that since ϕ𝜶(𝝍(𝒙))𝜶f(𝝍(𝒙))\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})}) for 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}} is constant, its weak derivative is zero, which has given us the first equality of (9).

Define

ϕ(𝒙)=|𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))\phi(\bm{x})=\sum_{|{\bm{\alpha}}|\leq s-1}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)} (10)

for any 𝒙d\bm{x}\in\mathbb{R}^{d}.

Let us now first estimate the overall approximation error in semi-norm. We denote the first order derivatives of ff by 𝜸f\partial^{\bm{\gamma}}f with |𝜸|=1|\bm{\gamma}|=1. For any 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}}, by Taylor’s expansion there exists a ξ𝒙𝜸(0,1)\xi_{\bm{x}}^{\bm{\gamma}}\in(0,1) such that

𝜸f(𝒙)\displaystyle{\partial^{\bm{\gamma}}f(\bm{x})} =\displaystyle= |𝜶|s2𝜶𝜸f(𝝍(𝒙))𝜶!𝒉𝜶+|𝜶|=s1𝜶𝜸f(𝝍(𝒙)+ξ𝒙𝜸𝒉)𝜶!𝒉𝜶\displaystyle\sum_{|{\bm{\alpha}}|\leq s-2}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}+\sum_{|{\bm{\alpha}}|=s-1}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}
=\displaystyle= 𝜸(|𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶)+|𝜶|=s1𝜶𝜸f(𝝍(𝒙)+ξ𝒙𝜸𝒉)𝜶!𝒉𝜶,\displaystyle\partial^{\bm{\gamma}}\Big{(}\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{)}+\sum_{|{\bm{\alpha}}|=s-1}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}},

where 𝒉=𝒙𝝍(𝒙)\bm{h}=\bm{x}-\bm{\psi{(\bm{x})}}.
Therefore,

|ϕ(𝒙)f(𝒙)|𝒲1,(Q𝜽)\displaystyle|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{1,\infty}(Q_{\bm{\theta}})}
\displaystyle\leq |ϕ(𝒙)|𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶|𝒲1,(Q𝜽)+||𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶f(𝒙)|𝒲1,(Q𝜽)\displaystyle\Big{|}\phi(\bm{x})-\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\theta}})}+\Big{|}\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}-f(\bm{x})\Big{|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\theta}})}
\displaystyle\leq |𝜶|s1|ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶|𝒲1,(Q𝜷)\displaystyle{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}
+max|𝜸|=1|𝜶|=s1𝜶𝜸f(𝝍(𝒙)+ξ𝒙𝜸𝒉)𝜶!𝒉𝜶(Q𝜷)\displaystyle+\max_{|\bm{\gamma}|=1}\sum_{|{\bm{\alpha}}|=s-1}\Big{\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲1,(Q𝜷)=:E1\displaystyle\underbrace{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{=:E_{1}}
+max|𝜸|=1|𝜶|=s1𝜶𝜸f(𝝍(𝒙)+ξ𝒙𝜸𝒉)𝜶!𝒉𝜶(Q𝜷)=:E2,\displaystyle+\underbrace{\max_{|\bm{\gamma}|=1}\sum_{|{\bm{\alpha}}|=s-1}\Big{\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}}_{=:E_{2}},

where E2E_{2} with ξ𝒙𝜸(0,1)\xi_{\bm{x}}^{\bm{\gamma}}\in(0,1) is the remainder term resulted from Taylor’s expansion of 𝜸f\partial^{\bm{\gamma}}f in (3.2).

The error term E1E_{1} is further decomposed into two parts via

E1\displaystyle E_{1} =\displaystyle= |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲1,(Q𝜷)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝒲1,(Q𝜷)=:E1,1\displaystyle\underbrace{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{=:E_{1,1}}
+|𝜶|s1ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲1,(Q𝜷)=:E1,2.\displaystyle+\underbrace{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{=:E_{1,2}}.

For each 𝜶d{\bm{\alpha}}\in\mathbb{R}^{d} with |𝜶|s1|{\bm{\alpha}}|\leq s-1 and 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}}, since 3[0,2]\mathcal{E}_{3}\in[0,2] and 𝜶f(𝝍(𝒙))(1,1)\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\in(-1,1) according to (9), we have ϕ𝜶(𝝍(𝒙))(3,3)\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})\in(-3,3) and hence ϕ𝜶(𝝍(𝒙))𝜶!(3,3)\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\in(-3,3).

Also, we have P𝜶(𝒙)(2,3)(3,3)P_{\bm{\alpha}}(\bm{x})\in(-2,3)\subseteq(-3,3) and P𝜶(𝒙)𝒲([0,1]d)1,3\|P_{\bm{\alpha}}(\bm{x})\|_{\mathcal{W}{{}^{1,\infty}([0,1]^{d})}}\leq 3 for any 𝒙[0,1]d\bm{x}\in[0,1]^{d} and |𝜶|s1|{\bm{\alpha}}|\leq s-1, since 2<2\mathcal{E}_{2}<2 from (8).

Hence, we can now measure E1,1E_{1,1}, E1,2E_{1,2} and E2E_{2}:

E1,1\displaystyle E_{1,1} =\displaystyle= |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝒲1,(Q𝜷)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1(ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))ϕ𝜶(𝝍(𝒙))𝜶!P𝜶(𝒉)𝒲1,(Q𝜷)1\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Bigg{(}\underbrace{\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{\leq\mathcal{E}_{1}}
+ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!P𝜶(𝒉)𝒲1,(Q𝜷)1\displaystyle+\underbrace{\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{\leq\mathcal{E}_{1}}
+ϕ𝜶(𝝍(𝒙))𝜶!P𝜶(𝒉)𝜶f(𝝍(𝒙))𝜶!P𝜶(𝒉)𝒲1,(Q𝜷)33)\displaystyle+\underbrace{\Big{\|}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{\leq 3\mathcal{E}_{3}}\Bigg{)}
\displaystyle\leq |𝜶|s1(21+33)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}(2\mathcal{E}_{1}+3\mathcal{E}_{3})
\displaystyle\leq sd(21+33).\displaystyle s^{d}(2\mathcal{E}_{1}+3\mathcal{E}_{3}).

Note that the last inequality is followed by the fact that |𝜶|s11i=0s1(i+1)d1sd\sum_{|{\bm{\alpha}}|\leq s-1}1\leq\sum_{i=0}^{s-1}(i+1)^{d-1}\leq s^{d}.

Thus,

E1,2\displaystyle E_{1,2} =\displaystyle= |𝜶|s1ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲1,(Q𝜷)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!P𝜶(𝒉)𝒲1,(Q𝜷)1\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\underbrace{\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{\leq\mathcal{E}_{1}}
+|𝜶|s1𝜶f(𝝍(𝒙))𝜶!P𝜶(𝒉)𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲1,(Q𝜷)2\displaystyle+\sum_{|{\bm{\alpha}}|\leq s-1}{\underbrace{\Big{\|}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{{}_{\leq\mathcal{E}_{2}}}}
\displaystyle\leq |𝜶|s1(1+2)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}(\mathcal{E}_{1}+\mathcal{E}_{2})
\displaystyle\leq sd(1+2).\displaystyle s^{d}(\mathcal{E}_{1}+\mathcal{E}_{2}).

We now provide a measure on E2E_{2}. First, by the assumption that 𝜶fL([0,1]d)<1\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1 for any 𝜶d\bm{\alpha}\in\mathbb{N}^{d} with |𝜶|s|{\bm{\alpha}}|\leq s, we have

E2\displaystyle E_{2} =\displaystyle= max|𝜸|=1|𝜶|=s1𝜶𝜸f(𝝍(𝒙)+ξ𝒙𝜸𝒉)𝜶!𝒉𝜶(Q𝜷)\displaystyle\max_{|\bm{\gamma}|=1}\sum_{|{\bm{\alpha}}|=s-1}\Big{\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|=s11𝜶!𝒉𝜶L(Q𝜷)\displaystyle\sum_{|{\bm{\alpha}}|=s-1}\Big{\|}\tfrac{1}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{{L}^{\infty}(Q_{\bm{\beta}})}
\displaystyle\leq sd1K(s1),\displaystyle s^{d-1}K^{-(s-1)},

where the last inequality is followed by |𝜶|=s11sd1\sum_{|{\bm{\alpha}}|=s-1}1\leq s^{d-1}.
Thus, we have shown that

|ϕ(𝒙)f(𝒙)|𝒲1,(Q𝜷)\displaystyle|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}
\displaystyle\leq E1,1+E1,2+E2\displaystyle E_{1,1}+E_{1,2}+E_{2}
\displaystyle\leq E1,1+E1,2+sd1K(s1).\displaystyle E_{1,1}+E_{1,2}+s^{d-1}K^{-(s-1)}.

We now estimate ϕ(𝒙)f(𝒙)L(Q𝜷)\|\phi(\bm{x})-f(\bm{x})\|_{{L}^{\infty}(Q_{\bm{\beta}})}. Using similar arguments, from the Taylor expansion defined in (7) with ξ𝒙\xi_{\bm{x}} in the remainder term, we can show that

ϕ(𝒙)f(𝒙)L(Q𝜷)\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{{L}^{\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶L(Q𝜷)+|𝜶|=s𝜶f(𝝍(𝒙)+ξ𝒙𝒉)𝜶!𝒉𝜶L(Q𝜷)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{{L}^{\infty}(Q_{\bm{\beta}})}+\sum_{|{\bm{\alpha}}|=s}\Big{\|}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{{L}^{\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲1,(Q𝜷)+(s+1)d1Ks\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}+(s+1)^{d-1}K^{-s}
\displaystyle\leq E1,1+E1,2+(s+1)d1Ks,\displaystyle E_{1,1}+E_{1,2}+(s+1)^{d-1}K^{-s},

where the second last inequality is due to |𝜶|=s1(s+1)d1\sum_{|{\bm{\alpha}}|=s}1\leq(s+1)^{d-1}.

Using (N+1)7s(L+1)(N+1)2s(L+1)(N+1)2s22sLN2sL2s(N+1)^{-7s(L+1)}\leq(N+1)^{-2s(L+1)}\leq(N+1)^{-2s}2^{-{2}sL}\leq N^{-2s}L^{-2s} and K=N1/d2L2/dN2/dL2/d8K=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor\geq\tfrac{N^{2/d}L^{2/d}}{8}, we have

ϕ(𝒙)f(𝒙)𝒲1,(Q𝜷)\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}
=\displaystyle= max{ϕ(𝒙)f(𝒙)L(Q𝜷),|ϕ(𝒙)f(𝒙)|𝒲1,(Q𝜷)}\displaystyle\max{\big{\{}\|\phi(\bm{x})-f(\bm{x})\|_{{L}^{\infty}(Q_{\bm{\beta}})},|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}\big{\}}}
\displaystyle\leq E1,1+E1,2+(s+1)d1K(s1)\displaystyle E_{1,1}+E_{1,2}+(s+1)^{d-1}K^{-(s-1)}
\displaystyle\leq sd(21+33)+sd(1+2)+(s+1)d1K(s1)\displaystyle s^{d}(2\mathcal{E}_{1}+3\mathcal{E}_{3})+s^{d}(\mathcal{E}_{1}+\mathcal{E}_{2})+(s+1)^{d-1}K^{-(s-1)}
\displaystyle\leq (s+1)d(K(s1)+31+2+33)\displaystyle(s+1)^{d}(K^{-(s-1)}+3\mathcal{E}_{1}+\mathcal{E}_{2}+3\mathcal{E}_{3})
\displaystyle\leq (s+1)d(K(s1)+648(N+1)2s(L+1)+10s(N+1)7s(L+1)+6N2sL2s)\displaystyle(s+1)^{d}\big{(}K^{-(s-1)}+648(N+1)^{-2s(L+1)}+10s(N+1)^{-7s(L+1)}+6N^{-2s}L^{-2s}\big{)}
\displaystyle\leq (s+1)d(8s1N2(s1)/dL2(s1)/d+(654+10s)N2sL2s)\displaystyle(s+1)^{d}\big{(}8^{s-1}N^{-2(s-1)/d}L^{-2(s-1)/d}+(654+10s)N^{-2s}L^{-2s}\big{)}
\displaystyle\leq (s+1)d(8s1+654+10s)N2(s1)/dL2(s1)/d\displaystyle(s+1)^{d}(8^{s-1}+654+10s)N^{-2(s-1)/d}L^{-2(s-1)/d}
\displaystyle\leq 84(s+1)d8sN2(s1)/dL2(s1)/d.\displaystyle 84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.

Since 𝜷{0,1,2,,K1}d\bm{\beta}\in\{0,1,2,\cdots,K-1\}^{d} is arbitrary and the fact that [0,1]d\Ω([0,1]d,K,δ)𝜷{0,1,,K1}dQ𝜷[0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)}\subseteq\cup_{\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}}Q_{\bm{\beta}}, we have

ϕ(𝒙)f(𝒙)𝒲1,([0,1]d\Ω([0,1]d,K,δ))84(s+1)d8sN2(s1)/dL2(s1)/d.\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{1,\infty}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}\leq 84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.

Furthermore, we have

ϕ(𝒙)𝒲1,([0,1]d)\displaystyle\|\phi(\bm{x})\|_{\mathcal{W}^{1,\infty}([0,1]^{d})} =\displaystyle= |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝒲1,([0,1]d)\displaystyle\Bigg{\|}\sum_{|{\bm{\alpha}}|\leq s-1}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Bigg{\|}_{\mathcal{W}^{1,\infty}([0,1]^{d})}
\displaystyle\leq |𝜶|s1ϕ~𝒲1,([0,1]d)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\|\widetilde{\phi}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}
\displaystyle\leq 432sd.\displaystyle 432s^{d}.

As last, we finish the proof by estimating the width and depth of the network implementing ϕ(𝒙)\phi(\bm{x}). From (10), we know that ϕ(𝒙)\phi(\bm{x}) consists of the following subnetworks:

  1. 1.

    𝝍NN(widthd(4N+3);depth4L+5)\bm{\psi}\in\textnormal{NN}(\textrm{width}\leq d(4N+3);\textrm{depth}\leq 4L+5);

  2. 2.

    ϕ𝜶NN(width16s(N+1)log2(8N);depth5(L+2)log2(4L))\phi_{\bm{\alpha}}\in\textnormal{NN}(\textrm{width}\leq 16s(N+1)\log_{2}(8N);\textrm{depth}\leq 5(L+2)\log_{2}(4L));

  3. 3.

    P𝜶NN(width9(N+1)+s1;depth14s2L)P_{\bm{\alpha}}\in\textnormal{NN}(\textrm{width}\leq 9(N+1)+s-1;\textrm{depth}\leq 14s^{2}L);

  4. 4.

    ϕ~NN(width9(N+1)+1;depth4s(L+1))\widetilde{\phi}\in\textnormal{NN}(\textrm{width}\leq 9(N+1)+1;\textrm{depth}\leq 4s(L+1)).

Thus, we can infer that the ϕ\phi can be implemented by a ReLU network with width 16sd+1d(N+2)log2(8N)16s^{d+1}{d}(N+2)\log_{2}{(8N)} and depth

(4L+5)+4s(L+1)+14s2L+5(L+2)log2(4L)+327s2(L+2)log2(4L).\begin{split}(4L+5)+4s(L+1)+14s^{2}L+5(L+2)\log_{2}(4L)+3\leq 27s^{2}(L+2)\log_{2}(4L)\end{split}.

Hence, we have finished the proof. \qed

3.3 Approximation Using σ2\sigma_{2}-Neural Networks

In this subsection, we provide approximation results for smoothing functions measured in the 𝒲n,\mathcal{W}^{n,\infty} norm, where nn is a positive integer, using the smoother σ2\sigma_{2}-NNs. First, we list a few basic lemmas of σ2\sigma_{2}-NNs repeatedly applied in our main analysis.

Lemma 3.7

The following basic lemmas of σ2\sigma_{2}-NNs hold:

  1. (i)

    σ1\sigma_{1}-NNs are σ2\sigma_{2}-NNs.

  2. (ii)

    Any identity map in d\mathbb{R}^{d} can be realized exactly by a σ2\sigma_{2}-NN with one hidden layer and 2d2d neurons.

  3. (iii)

    f(x)=x2f(x)=x^{2} can be realized exactly by σ2\sigma_{2}-NN with one hidden layer and two neurons.

  4. (iv)

    f(x,y)=xy=(x+y)2(xy)24f(x,y)=xy=\tfrac{(x+y)^{2}-(x-y)^{2}}{4} can be realized exactly by σ2\sigma_{2}-NN with one hidden layer and four neurons.

  5. (v)

    Assume 𝒙𝜶=x1α1x2α2xdαd{\bm{x}}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}} for 𝜶d{\bm{\alpha}}\in\mathbb{N}^{d}. For any N,L+N,L\in\mathbb{N}^{+} such that NL+2log2N|𝜶|NL+2^{\lfloor\log_{2}N\rfloor}\geq|{\bm{\alpha}}|, there exists a σ2\sigma_{2}-NN ϕ\phi with width 4N+2d4N+2d and depth L+log2NL+\lceil\log_{2}N\rceil such that

    ϕ(𝒙)=𝒙𝜶for any 𝒙d.\phi({\bm{x}})={\bm{x}}^{\bm{\alpha}}\quad\textnormal{for any ${\bm{x}}\in\mathbb{R}^{d}$.}
  6. (vi)

    Assume P(𝒙)=j=1Jcj𝒙𝜶jP({\bm{x}})=\sum_{j=1}^{J}c_{j}{\bm{x}}^{{\bm{\alpha}}_{j}} for 𝜶jd{\bm{\alpha}}_{j}\in\mathbb{N}^{d}. For any N,L,a,b+N,L,a,b\in\mathbb{N}^{+} such that abJab\geq J and (L2bblog2N)Nbmaxj|𝜶j|(L-2b-b\log_{2}N)N\geq b\max_{j}|{\bm{\alpha}}_{j}|, there exists a σ2\sigma_{2}-NN ϕ\phi with width 4Na+2d+24Na+2d+2 and depth LL such that

    ϕ(𝒙)=P(𝒙)for any 𝒙d.\phi({\bm{x}})=P({\bm{x}})\quad\textnormal{for any ${\bm{x}}\in\mathbb{R}^{d}$.}
  • Proof

    Showing (i) to (iv) is trivial. We will only prove (v) and (vi) in the following.

    Part (v): In the case of |𝜶|=k1|{\bm{\alpha}}|=k\leq 1, the proof is simple and left for the reader. When |𝜶|=k2|{\bm{\alpha}}|=k\geq 2, the main idea of the proof of (v) can be summarized in Figure 1. We apply σ1\sigma_{1}-NNs to implement a dd-dimensional identity map as in Lemma 2.1 (iii). These identity maps maintain necessary entries of 𝒙{\bm{x}} to be multiplied together. We apply σ2\sigma_{2}-NNs to implement the multiplication function in Lemma 3.7 (iii) and carry out the multiplication NN times per layer. After LL layers, there are kNLNk-NL\leq N multiplication to be implemented. Finally, these at most NN multiplications can be carried out with a small σ2\sigma_{2}-NNs in a dyadic tree structure.

    Part (vi): The main idea of the proof is to apply Part (v) JJ times to construct JJ σ2\sigma_{2}-NNs, {ϕj(𝒙)}j=1J\{\phi_{j}({\bm{x}})\}_{j=1}^{J}, to represent 𝒙𝜶j{\bm{x}}^{{\bm{\alpha}}_{j}} and arrange these σ2\sigma_{2}-NNs as sub-NN blocks to form a larger σ2\sigma_{2}-NN ϕ~(𝒙)\tilde{\phi}({\bm{x}}) with abab blocks as shown in Figure 2, where each red rectangle represents one σ2\sigma_{2}-NN ϕj(𝒙)\phi_{j}({\bm{x}}) and each blue rectangle represents one σ1\sigma_{1}-NN of width 22 as an identity map of \mathbb{R}. There are abab red blocks with aa rows and bb columns. When abJab\geq J, these sub-NN blocks can carry out all monomials 𝒙𝜶j{\bm{x}}^{{\bm{\alpha}}_{j}}. In each column, the results of the multiplications of 𝒙𝜶j{\bm{x}}^{{\bm{\alpha}}_{j}} are added up to the input of the narrow σ1\sigma_{1}-NN, which can carry the sum over to the next column. After the calculation of bb columns, JJ additions of the monomials 𝒙𝜶j{\bm{x}}^{{\bm{\alpha}}_{j}} have been implemented, resulting in the output P(𝒙)P({\bm{x}}).

    By Part (v), for any N+N\in\mathbb{N}^{+}, there exists a σ2\sigma_{2}-NN ϕj(𝒙)\phi_{j}({\bm{x}}) of width 2d+4N2d+4N and depth Lj=|𝜶j|N+log2NL_{j}=\lceil\tfrac{|{\bm{\alpha}}_{j}|}{N}\rceil+\lceil\log_{2}N\rceil to implement 𝒙𝜶j{\bm{x}}^{{\bm{\alpha}}_{j}}. Note that bmaxjLjb(maxj|𝜶j|N+2+log2N)b\max_{j}L_{j}\leq b\left(\tfrac{\max_{j}|{\bm{\alpha}}_{j}|}{N}+2+\log_{2}N\right). Hence, there exists a σ2\sigma_{2}-NN ϕ~(𝒙)\tilde{\phi}({\bm{x}}) of width 2da+4Na+22da+4Na+2 and depth b(maxj|𝜶j|N+2+log2N)b\left(\tfrac{\max_{j}|{\bm{\alpha}}_{j}|}{N}+2+\log_{2}N\right) to implement P(𝒙)P({\bm{x}}) as in Figure 2. Note that the total width of each column of blocks is 2ad+4Na+22ad+4Na+2 but in fact this width can be reduced to 2d+4Na+22d+4Na+2, since the red blocks in each column can share the same identity map of d\mathbb{R}^{d} (the blue part of Figure 1).

    Note that b(maxj|𝜶j|N+2+log2N)Lb\left(\tfrac{\max_{j}|{\bm{\alpha}}_{j}|}{N}+2+\log_{2}N\right)\leq L is equivalent to (L2bblog2N)Nbmaxj|𝜶j|(L-2b-b\log_{2}N)N\geq b\max_{j}|{\bm{\alpha}}_{j}|. Hence, for any N,L,a,b+N,L,a,b\in\mathbb{N}^{+} such that abJab\geq J and (L2bblog2N)Nbmaxj|𝜶j|(L-2b-b\log_{2}N)N\geq b\max_{j}|{\bm{\alpha}}_{j}|, there exists a σ2\sigma_{2}-NN ϕ(𝒙)\phi({\bm{x}}) with width 4Na+2d+24Na+2d+2 and depth LL such that ϕ~(𝒙)\tilde{\phi}({\bm{x}}) is a sub-NN of ϕ(𝒙)\phi({\bm{x}}) in the sense of ϕ(𝒙)=Idϕ~(𝒙)\phi({\bm{x}})=\textnormal{Id}\circ\tilde{\phi}({\bm{x}}) with Id as an identify map of \mathbb{R}, which means that ϕ(𝒙)=ϕ~(𝒙)=P(𝒙)\phi({\bm{x}})=\tilde{\phi}({\bm{x}})=P({\bm{x}}). The proof of Part (vi) is completed. \qed

Refer to caption
Figure 1: Left: An illustration of the proof of Lemma 3.7 (v). Green vectors represent the input and output of the σ2\sigma_{2}-NN carrying out P(𝒙)P({\bm{x}}). Blue vectors represent the σ1\sigma_{1}-NN that implements a dd-dimensional identity map in Lemma 2.1 (iii), which was repeatedly applied for LL times. Black arrows represent the data flow for carrying out the identity maps. Red vectors represent the σ2\sigma_{2}-NNs implementing the multiplication function in Lemma 3.7 (iii) and there NLNL such red vectors. Red arrows represent the data flow for carrying out the multiplications. Finally, a red triangle represent a σ2\sigma_{2}-NN of width at most 4N4N and depth at most log2N\lceil\log_{2}^{N}\rceil carrying out the rest of the multiplications. Right: An example of the red triangle is given on the right when it consists of 1515 red vectors carrying out 1515 multiplications.
Refer to caption
Figure 2: An illustration of the proof of Lemma 3.7 (vi). Green vectors represent the input and output of the σ2\sigma_{2}-NN ϕ~(𝒙)\tilde{\phi}({\bm{x}}) carrying out P(𝒙)P({\bm{x}}). Each red rectangle represents one σ2\sigma_{2}-NN ϕj(𝒙)\phi_{j}({\bm{x}}) and each blue rectangle represents one σ1\sigma_{1}-NN of width 22 as an identity map of \mathbb{R}. There are abJab\geq J red blocks with aa rows and bb columns. When abJab\geq J, these sub-NN blocks can carry out all monomials 𝒙𝜶j{\bm{x}}^{{\bm{\alpha}}_{j}}. In each column, the results of the multiplications of 𝒙𝜶j{\bm{x}}^{{\bm{\alpha}}_{j}} are added up to (indicated by black arrows) the input of the narrow σ1\sigma_{1}-NN, which can carry the sum over to the next column. Each red arrow passes 𝒙{\bm{x}} to the next red block. After the calculation of bb columns, JJ additions of the monomials 𝒙𝜶j{\bm{x}}^{{\bm{\alpha}}_{j}} have been implemented, resulting in the output P(𝒙)P({\bm{x}}).

Similar to the ReLU network case, we first present the following approximation result before showing our main theorem - Theorem 1.4.

Theorem 3.8

Suppose that fCs([0,1]d)f\in C^{s}([0,1]^{d}) with s+s\in\mathbb{N}^{+} satisfies 𝛂fL([0,1]d)<1\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1 for any 𝛂d\bm{\alpha}\in\mathbb{N}^{d} with |𝛂|s|{\bm{\alpha}}|\leq s. For any N,L+N,L\in\mathbb{N}^{+} satisfying (L2log2N)Ns(L-2-\log_{2}N)N\geq s, there exists a σ2\sigma_{2}-NN ϕ\phi with width 16sd+1d(N+2)log2(8N)16s^{d+1}d(N+2)\log_{2}{({8}N)} and depth 10(L+2)log2(4L)10(L+2)\log_{2}(4L) such that ϕ(x)𝒲n,([0,1])sd\|\phi(x)\|_{\mathcal{W}^{n,\infty}([0,1])}\leq s^{d} and

fϕ𝒲n,([0,1]d\Ω([0,1]d,K,δ))2(s+1)d8snN2(sn)/dL2(sn)/d\|f-\phi\|_{\mathcal{W}^{n,\infty}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}\leq 2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}

where n<sn<s is a positive integer, K=N1/d2L2/dK=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor and 0δ13K0\leq\delta\leq\tfrac{1}{3K}.

  • Proof

Similar to the proof of Theorem 3.1, we set K=N1/d2L2/dK=\lfloor{N^{1/d}}\rfloor^{2}\lfloor L^{2/d}\rfloor and let Ω([0,1]d,K,δ)\Omega{([0,1]^{d},K,\delta)} partition [0,1]d[0,1]^{d} into KdK^{d} cubes Q𝜷Q_{\bm{\beta}} for 𝜷{0,1,,K1}d\bm{\beta}\in\{0,1,\cdots,K-1\}^{d} such that

[0,1]d=Ω([0,1]d,K,δ)(𝜷{0,1,,K1}dQ𝜷).[0,1]^{d}=\Omega{([0,1]^{d},K,\delta)}\bigcup\big{(}\cup_{\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}}Q_{\bm{\beta}}\big{)}.

For each 𝜷=[β1,β2,,βd]T{0,1,,K1}d\bm{\beta}=[\beta_{1},\beta_{2},\cdots,\beta_{d}]^{T}\in\{0,1,\cdots,K-1\}^{d}, we define

Q𝜷={𝒙=[x1,x2,,xd]T:xi[βiK,βi+1Kδ1{βiK2}]fori=1,2,,d}.Q_{\bm{\beta}}=\Bigg{\{}\bm{x}=[x_{1},x_{2},\cdots,x_{d}]^{T}:x_{i}\in\Big{[}\tfrac{\beta_{i}}{K},\tfrac{\beta_{i}+1}{K}-\delta\cdot 1_{\{\beta_{i}\leq K-2\}}\Big{]}~{}\text{for}~{}i=1,2,\dots,d\Bigg{\}}.

By Proposition 2.2, there exists a σ2\sigma_{2}-NN ψ\psi with width 4N+34N+3 and depth 4L+54L+5 such that

ψ(x)=k,ifx[kK,k+1Kδ1{kK2}]fork=0,1,,K1.\psi(x)=k,\quad\text{if}~{}x\in\big{[}\tfrac{k}{K},\tfrac{k+1}{K}-\delta\cdot 1_{\{k\leq K-2\}}\big{]}~{}\text{for}~{}k=0,1,\dots,K-1.

Then, for each 𝜷{0,1,,K1}d,ψ(xi)=θi\bm{\beta}\in\{0,1,\cdots,K-1\}^{d},\psi(x_{i})=\theta_{i} if 𝒙Q𝜷fori=1,2,,d\bm{x}\in Q_{\bm{\beta}}~{}\text{for}~{}i=1,2,\cdots,d.

Define

𝝍(𝒙):=[ψ(x1),ψ(x2),,ψ(xd)]T/Kfor any𝒙[0,1]d,\bm{\psi{(x)}}:=[\psi(x_{1}),\psi(x_{2}),\cdots,\psi(x_{d})]^{T}/K\quad\text{for any}~{}\bm{x}\in[0,1]^{d},

then

𝝍(𝒙)=𝜷/Kif𝒙Q𝜷for𝜷{0,1,,K1}d.\bm{\psi{(x)}}=\bm{\beta}/K\quad\text{if}~{}\bm{x}\in Q_{\bm{\beta}}\quad\text{for}~{}\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}.

Now, we fix a 𝜷{0,1,,K1}d\bm{\beta}\in\{0,1,\cdots,K-1\}^{d} throughout the proof. For any 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}}, by Taylor’s expansion there exists a ξ𝒙(0,1)\xi_{\bm{x}}\in(0,1) such that

f(𝒙)=|𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶+|𝜶|=s𝜶f(𝝍(𝒙)+ξ𝒙𝒉)𝜶!𝒉𝜶,{f(\bm{x})}=\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}+\sum_{|{\bm{\alpha}}|=s}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}},

where 𝒉=𝒙𝝍(𝒙)\bm{h}=\bm{x}-\bm{\psi{(\bm{x})}}.

By Lemma 3.7 (iv), there exists a σ2\sigma_{2}-NN ϕ~\widetilde{\phi} with width 44 and depth 11 such that

ϕ~(x,y)=xy\widetilde{\phi}(x,y)=xy

for any x,y(3,3).x,y\in(-3,3).

Note that it is trivial to construct σ2\sigma_{2}-NNs\textnormal{NN}s P𝜶(𝒙)P_{\bm{\alpha}}(\bm{x}) for 𝒙𝜶\bm{x}^{\bm{\alpha}} when |𝜶|1|{\bm{\alpha}}|\leq 1. Thus, for each 𝜶d\bm{\alpha}\in\mathbb{N}^{d} with |𝜶|s1|{\bm{\alpha}}|\leq s-1 and for any N,L+N,L\in\mathbb{N}^{+} satisfying (L2log2N)Ns(L-2-\log_{2}N)N\geq s, by Lemma 3.7 (vi) with a=b=J=1a=b=J=1, there exists a σ2\sigma_{2}-NN P𝜶P_{\bm{\alpha}} with width 4N+2d+24N+2d+2 and depth LL such that

P𝜶(𝒙)=𝒙𝜶P_{\bm{\alpha}}(\bm{x})=\bm{x}^{\bm{\alpha}}

for any 𝒙d\bm{x}\in\mathbb{R}^{d}.

For each i=0,1,,Kd1i=0,1,\cdots,K^{d}-1, we define

𝜼(i)=[η1,η2,,ηd]T{0,1,,K1}d\bm{\eta}(i)=[\eta_{1},\eta_{2},\cdots,\eta_{d}]^{T}\in\{0,1,\dots,K-1\}^{d}

such that j=1dηjKj1=i\sum_{j=1}^{d}\eta_{j}K^{j-1}=i. We will drop the input ii in 𝜼(i)\bm{\eta}(i) later for simplicity. For each 𝜶d\bm{\alpha}\in\mathbb{N}^{d} with |𝜶|s1|{\bm{\alpha}}|\leq s-1, define

ξ𝜶,i=(𝜶f(ηK)+1)/2.\xi_{\bm{\alpha},i}=\big{(}\partial^{\bm{\alpha}}f\big{(}\tfrac{\eta}{K}\big{)}+1\big{)}/2.

Note that Kd=(N1/d2L2/d)dN2L2K^{d}=(\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor)^{d}\leq N^{2}L^{2} and ξ𝜶,i[0,1]\xi_{\bm{\alpha},i}\in[0,1] for i=0,1,,Kd1i=0,1,\cdots,K^{d}-1. By Proposition 2.3, there exists a σ1\sigma_{1}-NN ϕ~𝜶\widetilde{\phi}_{\bm{\alpha}}, which is also a σ2\sigma_{2}-NN, of width 16s(N+1)log2(8N)16s(N+1)\log_{2}(8N) and depth 5(L+2)log2(4L)5(L+2)\log_{2}(4L) such that

|ϕ~𝜶(i)ξ𝜶,i|N2sL2s,fori=0,1,,Kd1and|𝜶|s1.|\widetilde{\phi}_{\bm{\alpha}}(i)-\xi_{\bm{\alpha},i}|\leq N^{-2s}L^{-2s},\quad\text{for}~{}i=0,1,\cdots,K^{d}-1~{}\text{and}~{}|{\bm{\alpha}}|\leq s-1.

Define

ϕ𝜶(𝒙):=2ϕ~𝜶(j=1dxjKj1)1,for any𝒙=[x1,x2,,xd]dd.\phi_{\bm{\alpha}}(\bm{x}):=2\widetilde{\phi}_{\bm{\alpha}}\Bigg{(}\sum_{j=1}^{d}x_{j}K^{j-1}\Bigg{)}-1,\quad\text{for any}~{}\bm{x}=[x_{1},x_{2},\cdots,x_{d}]^{d}\in\mathbb{R}^{d}.

For each |𝜶|s1|{\bm{\alpha}}|\leq s-1, we know that ϕ𝜶\phi_{\bm{\alpha}} is also of width 16s(N+1)log2(8N)16s(N+1)\log_{2}(8N) and depth 5(L+2)log2(4L)5(L+2)\log_{2}(4L).

Then for each 𝜼=[η1,η2,,ηd]T{0,1,,K1}d\bm{\eta}=[\eta_{1},\eta_{2},\cdots,\eta_{d}]^{T}\in\{0,1,\dots,K-1\}^{d} corresponding to i=j=1dηjKj1i=\sum_{j=1}^{d}\eta_{j}K^{j-1}, each 𝜶d\bm{\alpha}\in\mathbb{N}^{d} with |𝜶|s1|{\bm{\alpha}}|\leq s-1, we have

|ϕ𝜶(𝜼K)𝜶f(𝜼K)|=|2ϕ~𝜶(j=1dxjKj1)1(2ξ𝜶,i1)|=2|ϕ~𝜶(i)ξ𝜶,i|2N2sL2s.\big{|}\phi_{\bm{\alpha}}(\tfrac{\bm{\eta}}{K})-\partial^{\bm{\alpha}}f(\tfrac{\bm{\eta}}{K})\big{|}=\Bigg{|}2\widetilde{\phi}_{\bm{\alpha}}\bigg{(}\sum_{j=1}^{d}x_{j}K^{j-1}\bigg{)}-1-(2\xi_{\bm{\alpha},i}-1)\Bigg{|}=2|\widetilde{\phi}_{\bm{\alpha}}(i)-\xi_{\bm{\alpha},i}|\leq 2N^{-2s}L^{-2s}.

From 𝝍(𝒙)=𝜷K\bm{\psi{(x)}}=\tfrac{\bm{\beta}}{K} for 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}}, it follows that

ϕ𝜶(𝝍(𝒙))𝜶f(𝝍(𝒙))𝒲n,(Q𝜷)\displaystyle\|\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\|_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})} =\displaystyle= ϕ𝜶(𝝍(𝒙))𝜶f(𝝍(𝒙))L(Q𝜷)\displaystyle\|\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\|_{{L}^{\infty}(Q_{\bm{\beta}})}
=\displaystyle= |ϕ𝜶(𝜷K)𝜶f(𝜷K)|\displaystyle\big{|}\phi_{\bm{\alpha}}(\tfrac{\bm{\beta}}{K})-\partial^{\bm{\alpha}}f(\tfrac{\bm{\beta}}{K})\big{|}
\displaystyle\leq 2N2sL2s=:3.\displaystyle 2N^{-2s}L^{-2s}=:\mathcal{E}_{3}.

Note that since ϕ𝜶(𝝍(𝒙))𝜶f(𝝍(𝒙))\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})}) for 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}} is constant, its weak derivative is zero, which has given the above inequality.

Define

ϕ(𝒙)=|𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))\phi(\bm{x})=\sum_{|{\bm{\alpha}}|\leq s-1}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)} (12)

for any 𝒙d\bm{x}\in\mathbb{R}^{d}.

Let us now first estimate the overall approximation error in semi-norm. We denote the nn-th order derivatives of ff by 𝜸f\partial^{\bm{\gamma}}f with |𝜸|=n|\bm{\gamma}|=n. For any 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}}, by Taylor’s expansion there exists a ξ𝒙𝜸(0,1)\xi_{{\bm{x}}}^{\bm{\gamma}}\in(0,1) such that

𝜸f\displaystyle\partial^{\bm{\gamma}}f =\displaystyle= |𝜶|sn1𝜶𝜸f(𝝍(𝒙))𝜶!𝒉𝜶+|𝜶|=sn𝜶𝜸f(𝝍(𝒙)+ξ𝒙𝜸𝒉)𝜶!𝒉𝜶\displaystyle\sum_{|{\bm{\alpha}}|\leq s-n-1}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}+\sum_{|{\bm{\alpha}}|=s-n}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{{\bm{x}}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}
=\displaystyle= 𝜸(|𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶)+|𝜶|=sn𝜶𝜸f(𝝍(𝒙)+ξ𝒙𝜸𝒉)𝜶!𝒉𝜶,\displaystyle\partial^{\bm{\gamma}}\Big{(}\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{)}+\sum_{|{\bm{\alpha}}|=s-n}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}},

where 𝒉=𝒙𝝍(𝒙)\bm{h}=\bm{x}-\bm{\psi{(\bm{x})}}.
Therefore, for any 𝒙Q𝜷\bm{x}\in Q_{\bm{\beta}} we have

|ϕ(𝒙)f(𝒙)|𝒲n,(Q𝜽)\displaystyle|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{n,\infty}(Q_{\bm{\theta}})}
\displaystyle\leq |ϕ(𝒙)|𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶|𝒲n,(Q𝜽)+||𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶f(𝒙)|𝒲n,(Q𝜽)\displaystyle\Big{|}\phi(\bm{x})-\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\theta}})}+\Big{|}\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}-f(\bm{x})\Big{|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\theta}})}
\displaystyle\leq |𝜶|s1|ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶|𝒲n,(Q𝜷)\displaystyle{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}
+max|𝜸|=n|𝜶|=sn𝜶𝜸f(𝝍(𝒙)+ξ𝒙𝜸𝒉)𝜶!𝒉𝜶(Q𝜷)\displaystyle+\max_{|\bm{\gamma}|=n}\sum_{|{\bm{\alpha}}|=s-n}\Big{\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲n,(Q𝜷)=:E1\displaystyle\underbrace{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=:E_{1}}
+max|𝜸|=n|𝜶|=sn𝜶𝜸f(𝝍(𝒙)+ξ𝒙(𝜸)𝒉)𝜶!𝒉𝜶(Q𝜷)=:E2,\displaystyle+\underbrace{\max_{|\bm{\gamma}|=n}\sum_{|{\bm{\alpha}}|=s-n}\Big{\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+{\xi}^{(\bm{\gamma})}_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}}_{=:E_{2}},

where E2E_{2} with ξ𝒙(𝜸)(0,1){\xi}^{(\bm{\gamma})}_{\bm{x}}\in(0,1) is the remainder resulted from Taylor’s expansion of 𝜸f\partial^{\bm{\gamma}}f.

E1\displaystyle E_{1} =\displaystyle= |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲n,(Q𝜷)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝒲n,(Q𝜷)=:E1,1\displaystyle\underbrace{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=:E_{1,1}}
+|𝜶|s1ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲n,(Q𝜷)=:E1,2.\displaystyle+\underbrace{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=:E_{1,2}}.

Hence, we can now measure E1,1E_{1,1}, E1,2E_{1,2}, and E2E_{2}:

E1,1\displaystyle E_{1,1} =\displaystyle= |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝒲n,(Q𝜷)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1(ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))ϕ𝜶(𝝍(𝒙))𝜶!P𝜶(𝒉)𝒲n,(Q𝜷)=0\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Bigg{(}\underbrace{\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=0}
+ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!P𝜶(𝒉)𝒲n,(Q𝜷)=0\displaystyle+\underbrace{\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=0}
+ϕ𝜶(𝝍(𝒙))𝜶!P𝜶(𝒉)𝜶f(𝝍(𝒙))𝜶!P𝜶(𝒉)𝒲n,(Q𝜷)3)\displaystyle+\underbrace{\Big{\|}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{\leq\mathcal{E}_{3}}\Bigg{)}
\displaystyle\leq |𝜶|s13\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\mathcal{E}_{3}
\displaystyle\leq sd3.\displaystyle s^{d}\mathcal{E}_{3}.

Note that the last inequality is followed by the fact that |𝜶|s11=i=1s1(i+1)d1sd\sum_{|{\bm{\alpha}}|\leq s-1}1=\sum_{i=1}^{s-1}(i+1)^{d-1}\leq s^{d}. Similarly,

E1,2\displaystyle E_{1,2} =\displaystyle= |𝜶|s1ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲n,(Q𝜷)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1ϕ~(𝜶f(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!P𝜶(𝒉)𝒲n,(Q𝜷)=0\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\underbrace{\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=0}
+|𝜶|s1𝜶f(𝝍(𝒙))𝜶!P𝜶(𝒉)𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲n,(Q𝜷)=0\displaystyle+\sum_{|{\bm{\alpha}}|\leq s-1}{\underbrace{\Big{\|}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{{}_{=0}}}
\displaystyle\leq 0.\displaystyle 0.

We now provide a measure on E2E_{2}. First, by the assumption that 𝜶fL([0,1]d)<1\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1 for any 𝜶d\bm{\alpha}\in\mathbb{N}^{d} with |𝜶|s|{\bm{\alpha}}|\leq s, we have

E2\displaystyle E_{2} =\displaystyle= max|𝜸|=n|𝜶|=sn𝜶𝜸f(𝝍(𝒙)+ξ𝒙(𝜸)𝒉)𝜶!𝒉𝜶(Q𝜷)\displaystyle\max_{|\bm{\gamma}|=n}\sum_{|{\bm{\alpha}}|=s-n}\Big{\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+{\xi}^{(\bm{\gamma})}_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|=sn1𝜶!𝒉𝜶L(Q𝜷)\displaystyle\sum_{|{\bm{\alpha}}|=s-n}\Big{\|}\tfrac{1}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{{L}^{\infty}(Q_{\bm{\beta}})}
\displaystyle\leq (sn+1)d1K(sn),\displaystyle(s-n+1)^{d-1}K^{-(s-n)},

where the last inequality is followed by |𝜶|=sn1(sn+1)d1\sum_{|{\bm{\alpha}}|=s-n}1\leq(s-n+1)^{d-1}.
Thus, we have shown that

|ϕ(𝒙)f(𝒙)|𝒲n,(Q𝜷)\displaystyle|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}
\displaystyle\leq E1,1+E1,2+E2\displaystyle E_{1,1}+E_{1,2}+E_{2}
\displaystyle\leq E1,1+E1,2+(sn+1)d1K(sn).\displaystyle E_{1,1}+E_{1,2}+(s-n+1)^{d-1}K^{-(s-n)}.

We now estimate |ϕ(𝒙)f(𝒙)|𝒲j,(Q𝜷),j=0,,n1|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{j,\infty}(Q_{\bm{\beta}})},j=0,\dots,n-1. Using similar arguments and considering Taylor’s expansion of 𝜸f{\partial^{\bm{\gamma}}}f, |𝜸|n1|\bm{\gamma}|\leq n-1, with ξ𝒙(𝜸)(0,1){\xi}^{(\bm{\gamma})}_{\bm{x}}\in(0,1) in the remainder term, we have

|ϕ(𝒙)f(𝒙)|𝒲j,(Q𝜷)\displaystyle|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{j,\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |ϕ(𝒙)|𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶|𝒲j,(Q𝜽)+||𝜶|s1𝜶f(𝝍(𝒙))𝜶!𝒉𝜶f(𝒙)|𝒲j,(Q𝜽)\displaystyle\Big{|}\phi(\bm{x})-\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{|}_{\mathcal{W}^{j,\infty}(Q_{\bm{\theta}})}+\Big{|}\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}-f(\bm{x})\Big{|}_{\mathcal{W}^{j,\infty}(Q_{\bm{\theta}})}
\displaystyle\leq |𝜶|s1|ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶|𝒲j,(Q𝜷)\displaystyle{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{|}_{\mathcal{W}^{j,\infty}(Q_{\bm{\beta}})}}
+max|𝜸|=j|𝜶|=sj𝜶𝜸f(𝝍(𝒙)+ξ𝒙(𝜸)𝒉)𝜶!𝒉𝜶(Q𝜷)\displaystyle+\max_{|\bm{\gamma}|=j}\sum_{|{\bm{\alpha}}|=s-j}\Big{\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+{\xi}^{(\bm{\gamma})}_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}
\displaystyle\leq |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝜶f(𝝍(𝒙))𝜶!𝒉𝜶𝒲n,(Q𝜷)\displaystyle{\sum_{|{\bm{\alpha}}|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}
+(sj+1)d1K(sj)\displaystyle+(s-j+1)^{d-1}K^{-(s-j)}
\displaystyle\leq E1,1+E1,2+(sj+1)d1K(sj),\displaystyle E_{1,1}+E_{1,2}+(s-j+1)^{d-1}K^{-(s-j)},

where the second last inequality is due to |𝜶|=sj1(sj+1)d1\sum_{|{\bm{\alpha}}|=s-j}1\leq(s-j+1)^{d-1}.

Using K=N1/d2L2/dN2/dL2/d8K=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor\geq\tfrac{N^{2/d}L^{2/d}}{8}, we have

ϕ(𝒙)f(𝒙)𝒲n,(Q𝜷)\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}
=\displaystyle= max{ϕ(𝒙)f(𝒙)L(Q𝜷),,|ϕ(𝒙)f(𝒙)|𝒲n1,(Q𝜷),|ϕ(𝒙)f(𝒙)|𝒲n,(Q𝜷)}\displaystyle\max{\big{\{}\|\phi(\bm{x})-f(\bm{x})\|_{{L}^{\infty}(Q_{\bm{\beta}})},\dots,|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{n-1,\infty}(Q_{\bm{\beta}})},|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}\big{\}}}
\displaystyle\leq E1,1+E1,2+(s+1)d1K(sn)\displaystyle E_{1,1}+E_{1,2}+(s+1)^{d-1}K^{-(s-n)}
\displaystyle\leq sd3+(s+1)d1K(sn)\displaystyle s^{d}\mathcal{E}_{3}+(s+1)^{d-1}K^{-(s-n)}
\displaystyle\leq (s+1)d(K(sn)+3)\displaystyle(s+1)^{d}(K^{-(s-n)}+\mathcal{E}_{3})
\displaystyle\leq (s+1)d(K(sn)+2N2sL2s)\displaystyle(s+1)^{d}\big{(}K^{-(s-n)}+2N^{-2s}L^{-2s}\big{)}
\displaystyle\leq (s+1)d(8snN2(sn)/dL2(sn)/d+2N2sL2s)\displaystyle(s+1)^{d}\big{(}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}+2N^{-2s}L^{-2s}\big{)}
\displaystyle\leq (s+1)d(8sn+2)N2(sn)/dL2(sn)/d\displaystyle(s+1)^{d}(8^{s-n}+2)N^{-2(s-n)/d}L^{-2(s-n)/d}
\displaystyle\leq 2(s+1)d8snN2(sn)/dL2(sn)/d.\displaystyle 2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}.

Since 𝜷{0,1,2,,K1}d\bm{\beta}\in\{0,1,2,\cdots,K-1\}^{d} is arbitrary and the fact that [0,1]d\Ω([0,1]d,K,δ)𝜷{0,1,,K1}dQ𝜷[0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)}\subseteq\cup_{\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}}Q_{\bm{\beta}}, we have

ϕ(𝒙)f(𝒙)𝒲n,([0,1]d\Ω([0,1]d,K,δ))2(s+1)d8snN2(sn)/dL2(sn)/d.\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{n,\infty}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}\leq 2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}.

Furthermore, we have

ϕ(𝒙)𝒲1,([0,1]d)\displaystyle\|\phi(\bm{x})\|_{\mathcal{W}^{1,\infty}([0,1]^{d})} =\displaystyle= |𝜶|s1ϕ~(ϕ𝜶(𝝍(𝒙))𝜶!,P𝜶(𝒉))𝒲1,([0,1]d)\displaystyle\Bigg{\|}\sum_{|{\bm{\alpha}}|\leq s-1}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Bigg{\|}_{\mathcal{W}^{1,\infty}([0,1]^{d})}
\displaystyle\leq |𝜶|s1ϕ~𝒲1,([0,1]d)\displaystyle\sum_{|{\bm{\alpha}}|\leq s-1}\|\widetilde{\phi}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}
\displaystyle\leq sd.\displaystyle s^{d}.

As last, we finish the proof by estimating the width and depth of the network implementing ϕ(𝒙)\phi(\bm{x}). From (12), assuming for any N,L+N,L\in\mathbb{N}^{+} satisfying (L2log2N)Ns(L-2-\log_{2}N)N\geq s, we know that ϕ(𝒙)\phi(\bm{x}) consists of the following subnetworks:

  1. 1.

    𝝍NN(widthd(4N+3);depth4L+5)\bm{\psi}\in\textnormal{NN}(\textrm{width}\leq d(4N+3);\textrm{depth}\leq 4L+5);

  2. 2.

    ϕ𝜶NN(width16s(N+1)log2(8N);depth5(L+2)log2(4L))\phi_{\bm{\alpha}}\in\textnormal{NN}(\textrm{width}\leq 16s(N+1)\log_{2}(8N);\textrm{depth}\leq 5(L+2)\log_{2}(4L));

  3. 3.

    P𝜶NN(width4N+2d+2;depthL)P_{\bm{\alpha}}\in\textnormal{NN}(\textrm{width}\leq 4N+2d+2;\textrm{depth}\leq L);

  4. 4.

    ϕ~NN(width4;depth1)\widetilde{\phi}\in\textnormal{NN}(\textrm{width}\leq 4;\textrm{depth}\leq 1).

Thus, we can infer that the ϕ\phi can be implemented by a ReLU network with width 16sd+1d(N+2)log2(8N)16s^{d+1}{d}(N+2)\log_{2}{(8N)} and depth

(4L+5)+1+L+5(L+2)log2(4L)+310(L+2)log2(4L).\begin{split}(4L+5)+1+L+5(L+2)\log_{2}(4L)+3\leq 10(L+2)\log_{2}(4L)\end{split}.

Hence, we have finished the proof. \qed

We can now prove Theorem 1.4 using Theorem 3.8.

  • Proof of Theorem 1.4

    When ff is a constant function, the statement is trivial. By Theorem 3.8, there exists a σ2\sigma_{2}-NN ϕ\phi with width 16sd+1d(N+2)log2(8N)16s^{d+1}d(N+2)\log_{2}{({8}N)} and depth 10(L+2)log2(4L)10(L+2)\log_{2}(4L) such that ϕ𝒲n,([0,1]d)sd\|{\phi}\|_{\mathcal{W}^{n,\infty}([0,1]^{d})}\leq s^{d} and

    fϕ𝒲n,p([0,1]d)2(s+1)d8snN2(sn)/dL2(sn)/d,\|f-\phi\|_{\mathcal{W}^{n,p}([0,1]^{d})}\leq 2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d},

    Now, we set K=N1/d2L2/dK=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor and choose a small δ\delta such that

    Kdδ(N2(sn)/dL2(sn)/d)p.Kd\delta\leq\big{(}N^{-2(s-n)/d}L^{-2(s-n)/d}\big{)}^{p}.

    Then, we have

    fϕ𝒲n,p([0,1]d)p\displaystyle\|f-\phi\|^{p}_{\mathcal{W}^{n,p}([0,1]^{d})} =fϕ𝒲n,p(Ω([0,1]d,K,δ))p+fϕ𝒲n,p([0,1]d\Ω([0,1]d,K,δ))p\displaystyle=\|f-\phi\|^{p}_{\mathcal{W}^{n,p}(\Omega{([0,1]^{d},K,\delta)})}+\|f-\phi\|^{p}_{\mathcal{W}^{n,p}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}
    Kdδ(sd)p+(2(s+1)d8snN2(sn)/dL2(sn)/d)p\displaystyle\leq Kd\delta(s^{d})^{p}+\big{(}2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}\big{)}^{p}
    (sdN2(s1)/dL2(s1)/d)p+(2(s+1)d8snN2(sn)/dL2(sn)/d)p\displaystyle\leq\big{(}s^{d}N^{-2(s-1)/d}L^{-2(s-1)/d}\big{)}^{p}+\big{(}2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}\big{)}^{p}
    (3(s+1)d8snN2(sn)/dL2(sn)/d)p.\displaystyle\leq(3(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d})^{p}.

    Hence, we have

    fϕ𝒲1,p([0,1]d)3(s+1)d8snN2(sn)/dL2(sn)/d.\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq 3(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}.
    \qed

4 Conclusions

We have given a number of theoretical results on explicit error characterization for approximating smooth functions using deep ReLU networks and their smoother variants. Our results measured in Sobolev norms are well-suited for studying solving high-dimensional PDEs. Further generalizing our analysis to other function spaces including the Hölder space and other neutral networks such as the Floor-ReLU networks will be an interesting direction for research work. Numerical investigation of our findings in the setting for solving parametric PDEs will also be left as future work.

Acknowledgements

The authors would like to thank the editor and the two anonymous reviewers for their valuable comments and constructive suggestions, which have greatly improved this paper. S. Hon was partially supported by the Hong Kong RGC under Grant 22300921, a start-up grant from the Croucher Foundation, and a Tier 2 Start-up Grant from the Hong Kong Baptist University. H. Yang was partially supported by the US National Science Foundation under award DMS-1945029.

References

  • [1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 242–252. PMLR, 2019.
  • [2] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 322–332. PMLR, 2019.
  • [3] Julius Berner, Philipp Grohs, and Arnulf Jentzen. Analysis of the generalization error: empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations. SIAM Journal on Mathematics of Data Science, 2(3):631–657, 2020.
  • [4] Helmut Bölcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019.
  • [5] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • [6] Fan Chen, Jianguo Huang, Chunmei Wang, and Haizhao Yang. Friedrichs learning: weak solutions of partial differential equations via deep learning. arXiv e-prints, page arXiv:2012.08023, 2020.
  • [7] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova. Nonlinear approximation and (deep) ReLU networks. Constructive Approximation, 2021.
  • [8] Ronald DeVore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica, 30:327–444, 2021.
  • [9] Ronald A. DeVore, Ralph Howard, and Charles Micchelli. Optimal nonlinear approximation. manuscripta mathematica, 63:469–478, 1989.
  • [10] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1675–1685. PMLR, 2019.
  • [11] Simon S. Du, Barnabás Poczós, Xiyu Zhai, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. 7th International Conference on Learning Representations, ICLR 2019, pages 1–19, 2019.
  • [12] Weinan E. A proposal on machine Learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 2017.
  • [13] Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
  • [14] Weinan E, Chao Ma, and Lei Wu. The Barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 2021.
  • [15] Dennis Elbrächter, Philipp Grohs, Arnulf Jentzen, and Christoph Schwab. DNN expression rate analysis of high-dimensional PDEs: application to option pricing. Constructive Approximation, 2021.
  • [16] Moritz Geist, Philipp Petersen, Mones Raslan, Reinhold Schneider, and Gitta Kutyniok. Numerical solution of the parametric diffusion equation by deep neural networks. Journal of Scientific Computing, 88(1):22, 2021.
  • [17] Rémi Gribonval, Gitta Kutyniok, Morten Nielsen, and Felix Voigtlaender. Approximation spaces of deep neural networks. Constructive Approximation, 2021.
  • [18] Yiqi Gu, Haizhao Yang, and Chao Zhou. SelectNet: self-paced learning for high-dimensional partial differential equations. Journal of Computational Physics, 441:110444, 2021.
  • [19] Ingo Gühring, Gitta Kutyniok, and Philipp Petersen. Error bounds for approximations with deep ReLU neural networks in Ws,pW^{s,p} norms. Analysis and Applications, 18(5):803–859, 2020.
  • [20] Ingo Gühring and Mones Raslan. Approximation rates for neural networks with encodable weights in smoothness spaces. Neural Networks, 134:107–130, 2021.
  • [21] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
  • [22] John Harlim, Shixiao W Jiang, Senwei Liang, and Haizhao Yang. Machine learning for prediction with missing dynamics. Journal of Computational Physics, 428:109922, 2021.
  • [23] Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. ReLU deep neural networks and linear finite elements. Journal of Computational Mathematics, 38(3):502–527, 2020.
  • [24] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 2018-Decem, pages 8571–8580, 2018.
  • [25] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric PDE problems with artificial neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021.
  • [26] Gitta Kutyniok, Philipp Petersen, Mones Raslan, and Reinhold Schneider. A theoretical analysis of deep neural networks and parametric PDEs. Constructive Approximation, 2021.
  • [27] Senwei Liang, Shixiao W Jiang, John Harlim, and Haizhao Yang. Solving PDEs on unknown manifolds with machine learning. arXiv e-prints, page arXiv:2106.06682, 2021.
  • [28] Zichao Long, Yiping Lu, and Bin Dong. PDE-Net 2.0: learning PDEs from data with a numeric-symbolic hybrid deep network. Journal of Computational Physics, 399:108925, 2019.
  • [29] Jianfeng Lu and Yulong Lu. A priori generalization error analysis of two-Layer neural networks for solving high dimensional Schrödinger eigenvalue problems. arXiv e-prints, page arXiv:2105.01228, 2021.
  • [30] Jianfeng Lu, Yulong Lu, and Min Wang. A priori generalization analysis of the deep Ritz method for solving high dimensional elliptic equations. Proceedings of Thirty Fourth Conference on Learning Theory, PMLR, 134:3196-3241, 2021.
  • [31] Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5), 5465–5506, 2021.
  • [32] Tao Luo and Haizhao Yang. Two-layer neural networks for partial differential equations: optimization and generalization Theory, 2020.
  • [33] Chao Ma, Jianchun Wang, and Weinan E. Model reduction with memory and the machine learning of dynamical systems. Communications in Computational Physics, 25(4):947–962, 2019.
  • [34] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2388–2464. PMLR, 2019.
  • [35] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665—-E7671, 2018.
  • [36] Hadrien Montanelli and Qiang Du. New error bounds for deep ReLU networks using sparse grids. SIAM Journal on Mathematics of Data Science, 1(1):78–92, 2019.
  • [37] Hadrien Montanelli and Haizhao Yang. Error bounds for deep ReLU networks using the Kolmogorov–Arnold superposition theorem. Neural Networks, 129:1–6, 2020.
  • [38] Joost A A Opschoor, Philipp C Petersen, and Christoph Schwab. Deep ReLU networks and high-order finite element methods. Analysis and Applications, 18(05):715–770, 2020.
  • [39] Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
  • [40] Tong Qin, Kailiang Wu, and Dongbin Xiu. Data driven governing equations approximation using deep neural networks. Journal of Computational Physics, 395:620–635, 2019.
  • [41] M Raissi, P Perdikaris, and G E Karniadakis. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
  • [42] Christoph Schwab and Jakob Zech. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(1), 2019.
  • [43] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Nonlinear approximation via compositions. Neural Networks, 119:74–84, 2019.
  • [44] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. Communications in Computational Physics, 28(5):1768–1811, 2020.
  • [45] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation: achieving arbitrary accuracy with fixed number of neurons. arXiv e-prints, page arXiv:2107.02397, 2021.
  • [46] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network approximation: three hidden layers are enough. Neural Networks, 141:160–173, 2021.
  • [47] Jonathan W Siegel and Jinchao Xu. Approximation rates for neural networks with general activation functions. Neural Networks, 128:313–321, 2020.
  • [48] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2019.
  • [49] E. Weinan, Chao Ma, and Lei Wu. A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17(5):1407–1425, 2019.
  • [50] E. Weinan and Qingcan Wang. Exponential convergence of the deep neural network approximation for analytic functions. Science China Mathematics, 61(10):1733–1740, 2018.
  • [51] Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
  • [52] Linfeng Zhang, Jiequn Han, Han Wang, Roberto Car, and Weinan E. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett., 120(14):143001, 2018.
  • [53] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4140–4149. PMLR, 2017.