Simultaneous Neural Network Approximation for Smooth Functions

Sean Hon Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong SAR [email protected] Haizhao Yang Department of Mathematics, Purdue University, IN 47907, USA [email protected]

Abstract

We establish in this work approximation results of deep neural networks for smooth functions measured in Sobolev norms, motivated by recent development of numerical solvers for partial differential equations using deep neural networks. Our approximation results are nonasymptotic in the sense that the error bounds are explicitly characterized in terms of both the width and depth of the networks simultaneously with all involved constants explicitly determined. Namely, for $f\in C^{s}([0,1]^{d})$ , we show that deep ReLU networks of width $\mathcal{O}(N\log{N})$ and of depth $\mathcal{O}(L\log{L})$ can achieve a nonasymptotic approximation rate of $\mathcal{O}(N^{-2(s-1)/d}L^{-2(s-1)/d})$ with respect to the $\mathcal{W}^{1,p}([0,1]^{d})$ norm for $p\in[1,\infty)$ . If either the ReLU function or its square is applied as activation functions to construct deep neural networks of width $\mathcal{O}(N\log{N})$ and of depth $\mathcal{O}(L\log{L})$ to approximate $f\in C^{s}([0,1]^{d})$ , the approximation rate is $\mathcal{O}(N^{-2(s-n)/d}L^{-2(s-n)/d})$ with respect to the $\mathcal{W}^{n,p}([0,1]^{d})$ norm for $p\in[1,\infty)$ .

keywords:

Deep neural networks , Sobolev norm , ReLU^k activation functions , approximation theory

^†^†journal: Journal

1 Introduction

Over the past decades, deep neural networks have made remarkable impacts in various areas of science and engineering. With the aid of high-performance computing equipment and abundance of high quality data, neural network based methods outperform traditional machine learning methods in a wide range of applications, including image classification, object detection, speech recognition, to name just a few. The success of neural networks also motivates its applications in scientific computing including the recovery of governing equations for mathematical modeling and prediction [52, 12, 40, 28, 22, 33] and solving partial differential equations (PDEs) [41, 13, 21, 26, 15, 16, 25, 18, 6, 27].

Neural network based methods have evoked many open problems in mathematical theory, notwithstanding their success in practice. In a typical supervised learning algorithm, a potential high-dimensional target function $f(x)$ defined on a domain $\Omega$ is to be learned from a finite set of data samples $\{(\mathbf{x}_{i},f(\mathbf{x}_{i}))\}_{i=1}^{n}$ . When a deep network is integrated in the learning process, one needs to identify a deep network $\phi(\mathbf{x};\mathbf{\theta_{S}})$ with $\mathbf{\theta_{S}}$ being the hyperparameter to determine $f(\mathbf{x})$ for unseen data samples $\mathbf{x}$ , namely the following optimization problem arises

\mathbf{\theta_{S}}=\text{arg}\min_{\mathbf{\theta}}R_{S}(\mathbf{\theta}):=\text{arg}\min_{\mathbf{\theta}}\tfrac{1}{n}\sum_{i=1}^{n}l(\phi(x_{i};\theta),f(x_{i}))

(1)

where $l$ denotes a loss function.

Now, let us inspect the overall inference error which is estimated by $R_{D}(\theta_{S})$ , where $R_{D}(\theta):=E_{\mathbf{x}\sim U(\Omega)}[l(\phi(\mathbf{x};\mathbf{\theta}),f(\mathbf{x}))]$ . In reality, $U(\Omega)$ is unknown and only finitely many samples from this distribution are available. Hence, the empirical loss $R_{S}(\mathbf{\theta})$ is minimized hoping to obtain $\phi(\mathbf{x};\mathbf{\theta_{S}})$ , instead of minimizing the population loss $R_{D}(\mathbf{\theta})$ to obtain $\phi(\mathbf{x};\mathbf{\theta_{D}})$ , where $\mathbf{\theta_{D}}=\text{arg}\min_{\mathbf{\theta}}R_{D}(\theta)$ . In practice, a numerical optimization method to solve (1) may result in a numerical solution (denoted as $\mathbf{\theta_{{\mathcal{N}}}}$ ) that may not be a global minimizer $\mathbf{\theta_{{S}}}$ . Therefore, the actually learned neural network to infer $f(x)$ is $\phi(\mathbf{x};\mathbf{\theta_{{\mathcal{N}}}})$ and the corresponding inference error is measured by $R_{D}(\mathbf{\theta_{\mathcal{N}}})$ . By the discussion just above, it is crucial to quantify $R_{D}(\mathbf{\theta_{\mathcal{N}}})$ to see how good the learned neural network $\phi(\mathbf{x};\mathbf{\theta_{{\mathcal{N}}}})$ is, since $R_{D}(\mathbf{\theta_{\mathcal{N}}})$ is the expected inference error overall possible data samples. Note that

\displaystyle R_{D}(\mathbf{\theta_{\mathcal{N}}})\leq\underbrace{R_{D}(\mathbf{\theta_{D}})}_{\rm{approximation}}+\underbrace{[R_{S}(\mathbf{\theta_{\mathcal{N}}})-R_{S}(\mathbf{\theta_{S}})]}_{\rm{optimization}}+\underbrace{[R_{D}(\mathbf{\theta_{\mathcal{N}}})-R_{S}(\mathbf{\theta_{\mathcal{N}}})]+[R_{S}(\mathbf{\theta_{D}})-R_{D}(\mathbf{\theta_{D}})]}_{\rm{generalization}},

(2)

where the inequality comes from the fact that $[R_{S}(\mathbf{\theta_{S}}))-R_{S}(\mathbf{\theta_{D}})]\leq 0$ since $\mathbf{\theta_{S}}$ is a global minimizer of $R_{S}(\mathbf{\theta})$ .

Synergies from approximation theory [51, 42, 4, 43, 37, 36, 31, 17, 39, 50, 49, 14, 47, 23, 8, 7], optimization theory [24, 35, 34, 2, 11, 53, 1, 10, 48], and generalization theory [24, 5, 49, 3, 32, 30, 29] have led to many recent advances in the mathematical investigation for deep learning, regarding the error bound (2). All of these areas, having different emphases and angles, have fostered many separate research directions.

Providing an estimate of the first error term of (2), $R_{D}(\mathbf{\theta_{D}})$ , belongs to the regime of approximation theory and is of the main concerns in this paper. The results established in this work provide an upper bound of $R_{D}(\mathbf{\theta_{D}})$ , estimated explicitly in terms of the size of the network, e.g. its width and depth, with a nonasymptotic approximation rate for smooth functions measured in the Sobolev norms. Our approximation results are nonasymptotic in the sense that the all involved constants are explicitly determined. We summarize our first main result in the following on the networks with rectified linear unit (ReLU) $\sigma_{1}$ as the activation function:

Theorem 1.1

Suppose that $f\in C^{s}([0,1]^{d})$ with $s>1\in\mathbb{N}^{+}$ satisfies $\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1$ for any $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s$ . For any $N,L\in\mathbb{N}^{+}$ and $p\in[1,\infty)$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $16s^{d+1}d(N+2)\log_{2}{({8}N)}$ and depth $27s^{2}(L+2)\log_{2}(4L)$ such that

\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq{85}(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.

For smooth functions, we show in Theorem 1.1 that deep ReLU networks of width $\mathcal{O}(N\log{N})$ and of depth $\mathcal{O}(L\log{L})$ can achieve an approximation rate of $\mathcal{O}(N^{-2(s-1)/d}L^{-2(s-1)/d})$ with respect to the $\mathcal{W}^{1,p}([0,1]^{d})$ norm. Note that ReLU networks are in fact piecewise linear functions, so they have at most nonzero first (weak) derivatives. The above approximation rate in the $L^{p}$ norm estimated in terms of $N$ and $L$ was already covered in [31]. When the $\mathcal{W}^{n,p}$ norm is considered, where $0<n<1$ is not an integer, the interpolation technique used in [19] can be combined together with our method here to develop new approximation rates, which is left as future work.

To achieve rates truly in terms of width and depth without the logarithm terms, we also have the following corollary by setting $\widetilde{N}=\mathcal{O}(N\log N)$ and $\widetilde{L}=\mathcal{O}(L\log L)$ and making use of

(N\ln N)^{-2(s-1)/d}(L\ln L)^{-2(s-1)/d}\leq\mathcal{O}(N^{-2(s-\rho)/d}L^{-2(s-\rho)/d})

for $\rho\in(1,s)$ .

Corollary 1.2

Suppose that $f\in C^{s}([0,1]^{d})$ with $s>1\in\mathbb{N}^{+}$ satisfies $\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1$ for any $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s$ . For any $N,L\in\mathbb{N}^{+}$ , $\rho\in(1,s)$ , and $p\in[1,\infty)$ , there exist $C_{1}(s,d)$ , $C_{2}(s,d)$ , $C_{3}(s,d,\rho)$ , and a $\sigma_{1}$ -NN $\phi$ with width $C_{1}N$ and depth $C_{1}L$ such that

\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq C_{3}N^{-2(s-\rho)/d}L^{-2(s-\rho)/d}.

Note that the constants $C_{1},C_{2}$ and $C_{3}$ in Corollary 1.2 can be explicitly determined and we leave it to readers.

To obtain approximation results in terms of the number of parameters in neural networks, the following corollary is followed by setting $N=\mathcal{O}(1)$ and $\varepsilon=\mathcal{O}(L^{-2(s-1)/d})$ from Theorem 1.1.

Corollary 1.3

Suppose that $f\in C^{s}([0,1]^{d})$ with $s>1\in\mathbb{N}^{+}$ satisfies $\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1$ for any $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s$ . Given any $\varepsilon>0$ and $p\in[1,\infty)$ , there exists a $\sigma_{1}$ -NN $\phi$ with $\mathcal{O}(\varepsilon^{-d/(2(s-1))}\ln\tfrac{1}{\varepsilon})$ parameters such that

\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq\varepsilon.

Corollary 1.3 partially recovers [19, Corollary 4.1] in which the parameters are of order $\mathcal{O}(\varepsilon^{-d/(s-1)}\ln^{2}({\varepsilon}^{-s/(s-1)}))$ .

Considering a smoother neural network in which either the ReLU function $\sigma_{1}$ or its square $\sigma_{1}^{2}$ is applied, we provide similar results in the following Theorem 1.4. Namely, these networks of width $\mathcal{O}(N\log{N})$ and of depth $\mathcal{O}(L\log{L})$ can achieve the approximation rate of $\mathcal{O}(N^{-2(s-n)/d}L^{-2(s-n)/d})$ with respect to the $\mathcal{W}^{n,p}([0,1]^{d})$ norm. It is worth noting that the confinement of space is now relaxed to $\mathcal{W}^{n,p}([0,1]^{d})$ from $\mathcal{W}^{1,p}([0,1]^{d})$ with such smoother networks.

Theorem 1.4

Suppose that $f\in C^{s}([0,1]^{d})$ with $s\in\mathbb{N}^{+}$ satisfies $\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1$ for any $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s$ . For any $N,L\in\mathbb{N}^{+}$ satisfying $(L-2-\log_{2}N)N\geq s$ , and $p\in[1,\infty)$ , there exists a $\sigma_{2}$ -NN $\phi$ with width $16s^{d+1}d(N+2)\log_{2}{({8}N)}$ and depth $10(L+2)\log_{2}(4L)$ such that

\|f-\phi\|_{\mathcal{W}^{n,p}([0,1]^{d})}\leq{3}(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d},

where $n<s$ is a positive integer.

Our work developed here focuses particularly on the Sobolev norms which are suitable for studying partial differential equations. The above results concern the ReLU^k network approximation of functions with respect to Sobolev norms. Along this research direction, the work in [19] provides asymptotic upper bounds with unknown parameters with respect to $\mathcal{W}^{n,p}$ norm, where $n$ is a fraction between zero and one, based on the means of localized polynomials. Making use of the same polynomials, asymptotic approximation results measured with respect to high order $\mathcal{W}^{n,p}$ norm were provided in [20]. These results were developed for neural networks with smoother activation functions, in addition to ReLU networks. Measured in $\mathcal{W}^{1,p}$ norm, it was shown in [38] that ReLU networks can achieve essentially the same approximation rates as free-knot spline approximations. In contrast with these existing results, our approximation results are nonasymptotic with known pre-factors, established in the spirit of explicit error characterization methodology developed in [31, 44].

Recall from [9, Theorem 4.2] by DeVore et al., we have the following ostensible negative result on the continuity of the weight selection. Suppose that there is a continuous map $\Sigma$ from the unit ball of Sobolev space with smoothness $n$ , i.e. $\mathcal{W}^{n,p}$ , to $\mathbb{R}^{W}$ such that $\|f-g(\Sigma(f))\|_{L^{p}}\leq\varepsilon$ for all $f\in\mathcal{W}^{n,p}$ , where $W$ denotes a fixed number of parameters and $g$ is a map realizing a deep neural network from a given set of parameters in $\mathbb{R}^{W}$ to the unit ball in $\mathcal{W}^{n,p}$ , then $W\geq C\varepsilon^{-n/d}$ with some constant $C$ depending only on $n$ . This in a way means any such constructive approximation of ReLU networks cannot have a continuous weight selection property if the approximation rate is better than $C\varepsilon^{-n/d}$ , and hence a stable numerical implementation with such an error rate does not exist. It must, however, note that [9, Theorem 4.2] is basically a min-max criterion for evaluating continuous weight selection maps, describing the worst case. That is, the approximation result is obtained by minimizing over all continuous maps $\Sigma$ and network realizations $g$ and maximizing over all target functions. For most smooth functions practically encountered in applications, the theorem does not eliminate the possible cases in which they might still enjoy a continuous weight selection. Thus, there could be a stable numerical algorithm that can achieve our derived approximation rate. In other words, there is a special subclass of functions which arise in practice for which a continuous assignment of the weights exists. It is interesting future work to characterize such a subclass. Finally, to the best of our knowledge, there is not any efficient numerical algorithm to achieve the approximation rate with continuous weight selection especially when the dimension is large. Therefore, approximation results with or without continuous weight selection both have the difficulty of numerical implementations. Designing numerical algorithms to achieve these rates has been an active research field recently.

It is remarked that providing error estimate in the context of high-dimensional PDE problem using the newly introduced Floor-ReLU networks [44], which are fully connected neural networks with either Floor $\lfloor x\rfloor$ or ReLU activation function in each neuron, will be an intriguing option for future work, since these networks conquer the curse of dimensionality in terms of approximation theory. Other neural networks can also be considered. Along this line of work, the novel Floor-Exponential-Step networks with merely three hidden layers and width of order $\mathcal{O}(N)$ constructed in [46] can approximate $d$ -dimensional Lipschitz continuous functions with an exponentially small error rate of order $\mathcal{O}(\sqrt{d}2^{-N})$ . In [45], neural networks with a simple and computable continuous activation function and a fixed finite number of neurons were developed, which achieve the universal approximation property for all high-dimensional continuous functions.

This paper is organized as follows. In Section 2, Sobolev spaces are briefly introduced and a number of useful existing results on ReLU network approximation are given. We provide our main approximation results for ReLU networks in Section 3. Several auxiliary results concerning approximating high order polynomials using ReLU networks are first provided in Section 3.1, and our main results on ReLU network are given in Section 3.2. In addition to ReLU networks, we show similar results for the smoother neural networks with ReLU square in Section 3.3.

2 Preliminaries

We provide in this section some useful preliminaries of notations and basic approximation results.

2.1 Deep Neural Networks

Let us summarize all basic notations used in deep neural networks as follows.

1.

Matrices are denoted by bold uppercase letters. For instance, $\bm{A}\in\mathbb{R}^{m\times n}$ is a real matrix of size $m\times n$ , and $\bm{A}^{T}$ denotes the transpose of $\bm{A}$ .
2.

Vectors are denoted as bold lowercase letters. For example, $\bm{v}\in\mathbb{R}^{n}$ is a column vector of size $n$ . Correspondingly, $\bm{v}(i)$ is the $i$ -th element of $\bm{v}$ . $\bm{v}=[v_{1},\cdots,v_{n}]^{T}=\left[\begin{array}[]{c}\vspace{-3pt}v_{1}\\ \vspace{-5pt}\vdots\\ v_{n}\end{array}\right]$ are vectors consisting of numbers $\{v_{i}\}$ with $\bm{v}(i)=v_{i}$ .
3.
A $d$ -dimensional multi-index is a $d$ -tuple ${\bm{\alpha}}=[\alpha_{1},\alpha_{2},\cdots,\alpha_{d}]^{T}\in\mathbb{N}^{d}.$ Several related notations are listed below.
1. (a)
  
  $|{\bm{\alpha}}|=|\alpha_{1}|+|\alpha_{2}|+\cdots+|\alpha_{d}|$ ;
2. (b)
  
  ${\bm{x}}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}}$ , where ${\bm{x}}=[x_{1},x_{2},\cdots,x_{d}]^{T}$ ;
3. (c)
  
  ${\bm{\alpha}}!=\alpha_{1}!\alpha_{2}!\cdots\alpha_{d}!$ ;
4.

Let $B_{r,|\cdot|}({\bm{x}})\subseteq\mathbb{R}^{d}$ be the closed ball with a center ${\bm{x}}\subseteq\mathbb{R}^{d}$ and a radius $r$ measured by the Euclidean distance. Similarly, $B_{r,\|\cdot\|_{\ell^{\infty}}}({\bm{x}})\subseteq\mathbb{R}^{d}$ is a ball measured by the discrete $\ell^{\infty}$ -norm of a vector.
5.

Assume $\bm{n}\in\mathbb{N}^{n}$ , then $f(\bm{n})=\mathcal{O}(g(\bm{n}))$ means that there exists positive $C$ independent of $\bm{n}$ , $f$ , and $g$ such that $f(\bm{n})\leq Cg(\bm{n})$ when all entries of $\bm{n}$ go to $+\infty$ .
6.

We use $\sigma$ to denote an activation function. Let $\sigma_{1}:\mathbb{R}\to\mathbb{R}$ denote the rectified linear unit (ReLU), i.e. $\sigma_{1}(x)=\max\{0,x\}$ . With the abuse of notations, we define $\sigma_{1}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ as $\sigma_{1}({\bm{x}})=\left[\begin{array}[]{c}\max\{0,x_{1}\}\\ \vdots\\ \max\{0,x_{d}\}\end{array}\right]$ for any ${\bm{x}}=[x_{1},\cdots,x_{d}]^{T}\in\mathbb{R}^{d}$ .
7.

Furthermore, we let the activation function $\sigma_{2}:\mathbb{R}\to\mathbb{R}$ be either $\sigma_{1}$ or $\sigma_{1}^{2}$ . Similar to $\sigma_{1}$ , we define the action of $\sigma_{2}$ on a vector ${\bm{x}}$ .
8.
We will use NN as a neural network for short and $\sigma_{r}$ -NN to specify an NN with activation functions $\sigma_{t}$ with $t\leq r$ . We will also use Python-type notations to specify a class of NN’s, e.g., $\sigma_{1}$ - $\textnormal{NN}(\textnormal{c}_{1};\ \textnormal{c}_{2};\ \cdots;\ \textnormal{c}_{m})$ is a set of ReLU FNNs satisfying $m$ conditions given by $\{\textnormal{c}_{i}\}_{1\leq i\leq m}$ , each of which may specify the number of inputs (#input), the total number of nodes in all hidden layers ( $\#$ node), the number of hidden layers ( $\#$ layer), the number of total parameters ( $\#$ parameter), and the width in each hidden layer (widthvec), the maximum width of all hidden layers (maxwidth), etc. For example, if $\phi\in\sigma_{1}$ - $\textnormal{NN}(\textnormal{\#input}=2;\,\textnormal{widthvec}=[100,100])$ , then $\phi$ satisfies
1. (a)
  
  $\phi$ maps from $\mathbb{R}^{2}$ to $\mathbb{R}$ .
2. (b)
  
  $\phi$ has two hidden layers and the number of nodes in each hidden layer is $100$ .

$[n]^{L}$ is short for $[n,n,\cdots,n]\in\mathbb{N}^{L}$ . For example,

\textnormal{NN}(\textnormal{\#input}=d;\,\textnormal{widthvec}=[100,100])=\textnormal{NN}(\textnormal{\#input}=d;\,\textnormal{widthvec}=[100]^{2}).

10.

For $\phi\in\sigma$ - $\textnormal{NN}(\textnormal{\#input}=d;\,\textnormal{widthvec}=[N_{1},N_{2},\cdots,N_{L}])$ , if we define $N_{0}=d$ and $N_{L+1}=1$ , then the architecture of $\phi$ can be briefly described as follows:

\displaystyle\bm{x}=\tilde{\bm{h}}_{0}\mathop{\raisebox{0.0pt}{\scalebox{2.6}[1.0]{$\longrightarrow$}}}^{\bm{W}_{1},\ \bm{b}_{1}}\bm{h}_{1}\mathop{\longrightarrow}^{\sigma}\tilde{\bm{h}}_{1}\cdots\mathop{\raisebox{0.0pt}{\scalebox{2.6}[1.0]{$\longrightarrow$}}}^{\bm{W}_{L},\ \bm{b}_{L}}\bm{h}_{L}\mathop{\longrightarrow}^{\sigma}\tilde{\bm{h}}_{L}\mathop{\mathop{\raisebox{0.0pt}{\scalebox{2.6}[1.0]{$\longrightarrow$}}}}^{\bm{W}_{L+1},\ \bm{b}_{L+1}}\phi(\bm{x})=\bm{h}_{L+1},

where $\bm{W}_{i}\in\mathbb{R}^{N_{i}\times N_{i-1}}$ and $\bm{b}_{i}\in\mathbb{R}^{N_{i}}$ are the weight matrix and the bias vector in the $i$ -th linear transform in $\phi$ , respectively, i.e.,

\bm{h}_{i}:=\bm{W}_{i}\tilde{\bm{h}}_{i-1}+\bm{b}_{i},\quad\textnormal{for $i=1$, $\dots$, $L+1$,}

and

\tilde{\bm{h}}_{i}=\sigma(\bm{h}_{i}),\quad\textnormal{for $i=1$, $\dots$, $L$.}

$L$ in this paper is also called the number of hidden layers in the literature.

11.
The expression, an FNN with width $N$ and depth $L$ , means
1. (a)
  
  The maximum width of this FNN for all hidden layers less than or equal to $N$ .
2. (b)
  
  The number of hidden layers of this FNN less than or equal to $L$ .

2.2 Sobolev Spaces

We will use $D$ to denote the weak derivative of a single variable function and $D^{\bm{\alpha}}$ to denote the partial derivative $D^{\alpha_{1}}_{1}D^{\alpha_{2}}_{2}\dots D^{\alpha_{d}}_{d}$ of a $d$ -dimensional function with $\alpha_{i}$ as the order of derivative $D_{i}$ in the $i$ -th variable and ${\bm{\alpha}}=[\alpha_{1},\dots,\alpha_{d}]^{T}$ . Let $\Omega$ denote an open subset of $\mathbb{R}^{d}$ and $L^{p}(\Omega)$ be the standard Lebesgue space on $\Omega$ for $p\in[1,\infty]$ . We write $\nabla f:=[D_{1}f,\dots,D_{d}f]^{T}$ . $\partial\Omega$ is the boundary of $\Omega$ . Let $\mu(\cdot)$ be the Lebesgue measure. For $f(x)\in\mathcal{W}^{n,p}(\Omega)$ , we use the notation

\|f\|_{\mathcal{W}^{n,p}(\Omega)}=\|f\|_{\mathcal{W}^{n,p}}=\|f(x)\|_{\mathcal{W}^{n,p}(\Omega,\mu)},

if the domain is clear from the context and we use the Lebesgue measure.

Definition 2.1

(Sobolev Space) Let $n\in\mathbb{N}_{0}$ and $1\leq p\leq\infty$ . Then we define the Sobolev space

\mathcal{W}^{n,p}(\Omega):=\{f\in L^{p}(\Omega):D^{\bm{\alpha}}f\in L^{p}(\Omega)\text{ for all }{\bm{\alpha}}\in\mathbb{N}_{0}^{d}\text{ with }|{\bm{\alpha}}|\leq n\}

with a norm

\|f\|_{\mathcal{W}^{n,p}(\Omega)}:=\Bigg{(}\sum_{0\leq|{\bm{\alpha}}|\leq n}\|D^{\bm{\alpha}}f\|^{p}_{L^{p}(\Omega)}\Bigg{)}^{1/p},

if $p<\infty$ , and

\|f\|_{\mathcal{W}^{n,\infty}(\Omega)}:=\max_{0\leq|{\bm{\alpha}}|\leq n}\|D^{\bm{\alpha}}f\|_{L^{\infty}(\Omega)}.

2.3 Auxiliary Neutral Network Approximation Results

We first give the following useful lemmas on several ReLU networks approximation results with explicit error characterization measured in the $L^{\infty}$ norm for polynomials.

Lemma 2.1

The followings lemmas are satisfied by $\sigma_{1}$ -NNs.

(i)

Any one-dimensional continuous piecewise linear function with $N$ breakpoints can be exactly realized by a one-dimensional $\sigma_{1}$ -NN with one-hidden layer and $N$ neurons.
(ii)

Any identity map in $\mathbb{R}^{d}$ can be exactly realized by a d-dimensional $\sigma_{1}$ -NN with one hidden layer and $2d$ neurons.
(iii)

([31, Lemma 5.1]) For any $N,L\in\mathbb{N}^{+}$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $3N$ and depth $L$ such that

$\|\phi(x)-x^{2}\|_{L^{\infty}([0,1])}\leq N^{-L}.$
(iv)

([31, Lemma 4.2]) For any $N,L\in\mathbb{N}^{+}$ and $a,b\in\mathbb{R}$ with $a<b$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $9N+1$ and depth $L$ such that

$\|\phi(x,y)-xy\|_{L^{\infty}([a,b]^{2})}\leq 6(b-a)^{2}N^{-L}.$
(v)

([31, Theorem 4.1]) Assume $P({\bm{x}})={\bm{x}}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}}$ for ${\bm{\alpha}}\in\mathbb{N}^{d}$ with $\|{\bm{\alpha}}\|_{1}\leq k\in\mathbb{N}^{+}$ . For any $N,L\in\mathbb{N}^{+}$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $9(N+1)+k-1$ and depth $7k^{2}L$ such that

$\|\phi({\bm{x}})-P({\bm{x}})\|_{L^{\infty}([0,1]^{d})}\leq 9k(N+1)^{-7kL}.$

The following two propositions will also be used in proving our main results.

Proposition 2.2

(Step function approximations [31, Proposition 4.3]) For any $N,L,d\in\mathbb{N}^{+}$ and $\delta\in(0,\tfrac{1}{3K}]$ with $K=\lfloor{N^{1/d}}\rfloor^{2}\lfloor L^{2/d}\rfloor$ , there exists a one-dimensional $\sigma_{1}$ -NN $\phi$ with width $4\lfloor N^{1/d}\rfloor+3$ and depth $4L+5$ such that

\phi(x)=k,\quad\text{if}~{}x\in\big{[}\tfrac{k}{K},\tfrac{k+1}{K}-\delta\cdot 1_{\{k<K-1\}}\big{]}

for $k=0,1,\dots,K-1$ .

Proposition 2.3

(Point matching [31, Proposition 4.4]) Given any $N,L,s\in\mathbb{N}^{+}$ and $\xi_{i}\in[0,1]$ for $i=0,1,\dots,N^{2}L^{2}-1$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $16s(N+1)\log_{2}(8N)$ and depth $(5L+2)\log_{2}{(4L)}$ such that

1.

$|\phi(i)-\xi_{i}|\leq N^{-2s}L^{-2s},$ for $i=0,1,\cdots,N^{2}L^{2}-1$ ;
2.

$0\leq\phi(x)\leq 1,$ $x\in\mathbb{R}$

3 Proof of Our Main Results

In this section, the proofs of our main results are given. We first define the following notion of subset of $[0,1]^{d}$ , before presenting our main approximation results. Given any $K\in N^{+}$ and $\delta\in(0,\tfrac{1}{K})$ , define a trifling region $\Omega{([0,1]^{d},K,\delta)}$ of $[0,1]^{d}$ as

\Omega{([0,1]^{d},K,\delta)}:=\cup_{i=1}^{d}\Big{\{}{\bm{x}}=[x_{1},x_{2},\cdots,x_{d}]^{T}\in[0,1]^{d}:x_{i}\in\cup_{k=1}^{K-1}(\tfrac{k}{K}-\delta,\tfrac{k}{K})\Big{\}}.

(3)

3.1 Approximations Using ReLU Neural Networks

We first present the following Theorem 3.1, which concerns the approximation result by ReLU networks outside the trifling region $\Omega{([0,1]^{d},K,\delta)}$ .

Theorem 3.1

Suppose that $f\in C^{s}([0,1]^{d})$ with an integer $s>1$ satisfies $\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1$ for any $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s$ . For any $N,L\in\mathbb{N}^{+}$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $16s^{d+1}d(N+2)\log_{2}{(8N)}$ and depth $27s^{2}(L+2)\log_{2}(4L)$ such that $\|{\phi}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 432s^{d}$ and

\|f-\phi\|_{\mathcal{W}^{1,\infty}([0,1]^{d}\backslash{\Omega{([0,1]^{d},K,\delta)}})}\leq 84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d},

where $K=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor$ and $0<\delta\leq\tfrac{1}{3K}$ .

We will now show our main result, Theorem 1.1, provided Theorem 3.1 holds true. The proof of Theorem 3.1 will be given in Section 3.2.

Proof of Theorem 1.1

When $f$ is a constant function, the statement is trivial. By Theorem 3.1, there exists a $\sigma_{1}$ -NN $\phi$ with width $16s^{d+1}d(N+2)\log_{2}{(8N)}$ and depth $27s^{2}(L+2)\log_{2}(4L)$ such that $\|{\phi}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 432s^{d}$ and

\|f-\phi\|_{\mathcal{W}^{1,\infty}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}\leq 84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.

Now, we set $K=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor$ and choose a small $\delta$ such that

Kd\delta 432^{p}\leq(N^{-2(s-1)/d}L^{-2(s-1)/d})^{p}.

Then, we have

	$\displaystyle\\|f-\phi\\|^{p}_{\mathcal{W}^{1,p}([0,1]^{d})}$	$\displaystyle=\\|f-\phi\\|^{p}_{\mathcal{W}^{1,p}(\Omega{([0,1]^{d},K,\delta)})}+\\|f-\phi\\|^{p}_{\mathcal{W}^{1,p}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}$
		$\displaystyle\leq Kd\delta(432s^{d})^{p}+(84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d})^{p}$
		$\displaystyle\leq\big{(}s^{d}N^{-2(s-1)/d}L^{-2(s-1)/d}\big{)}^{p}+\big{(}84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}\big{)}^{p}$
		$\displaystyle\leq(85(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d})^{p}.$

Hence, we have

\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq 85(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.

\qed

Before showing Theorem 3.1 directly, we present the following lemmas which will be helpful to understand the different steps illustrated in the proof. Similar to showing [31, Theorem 2.2] where only the $L^{\infty}$ norm is considered, in order to prove Theorem 3.1 we will first construct ReLU networks to approximate multivariate polynomials and provide an error bound measured in the $\mathcal{W}^{1,\infty}$ norm that is explicitly characterized in terms of the layer and depth of the underlying ReLU networks, via the following steps:

1.

We approximate $f(x)=x^{2}$ by the compositions of the “sawtooth functions”, which was first proposed in [51].
2.

We approximate $f(x,y)=xy$ using the ReLU network constructed in the previous step based on the identity $xy=2\Big{(}\big{(}\tfrac{|x+y|}{2}\big{)}^{2}-\big{(}\tfrac{|x|}{2}\big{)}^{2}-\big{(}\tfrac{|y|}{2}\big{)}^{2}\Big{)}$ .
3.

We approximate $f(x_{1},x_{2},\cdots,x_{k})=x_{1}x_{2}\cdots x_{k}$ for $k\geq 2$ repeatedly using the ReLU networks constructed in the previous step.
4.

We approximate a general polynomial $\bm{x}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}}$ for $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq k\in\mathbb{N}^{+}$ . Since any one term polynomial of degree less than or equal to $k$ can be written as $Cz_{1}z_{2}\cdots z_{k}$ , where $C$ is a constant, we can obtain an error bound based on the previous approximation result.

We begin with giving an approximation result for the simple function of $x^{2}$ .

Lemma 3.2

For any $N,L\in\mathbb{N}^{+}$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $3N$ and depth $2L$ such that $\|\phi(x)\|_{\mathcal{W}^{1,\infty}([0,1])}\leq 2$ and

||\phi(x)-x^{2}||_{\mathcal{W}^{1,\infty}((0,1))}\leq N^{-L}.

Proof

As in the proof of Lemma 2.1 (iii), we begin by defining the piecewise linear function $f_{s}:[0,1]\to[0,1]$ for $s\in\mathbb{N}^{+}$ , $s=1,2,\dots$ , satisfying the following properties

1.

$f_{s}(x)=x^{2}$ at a set of break points $\big{\{}\tfrac{j}{2^{s}}:j=0,1,2,\dots,2^{s}\big{\}}$ .
2.

$f_{s}(x)$ is linear between any two adjacent break points.

For any integer $N$ , let $k\in\mathbb{N}^{+}$ be the unique number such that $(k-1)2^{k-1}+1\leq N\leq k2^{k}$ . For such an $N$ , $k$ , and any $L^{\prime}$ , let $s=L^{\prime}k$ . It is shown in the proof of [31, Lemma 5.1] that there exists a $\sigma_{1}$ -NN $\phi(x){=f_{s}(x)=f_{L^{\prime}k}(x)}$ with width $3N$ and depth $L^{\prime}$ such that

\|\phi(x)-x^{2}\|_{L^{\infty}([0,1])}{=}\|f_{L^{\prime}k}-x^{2}\|_{L^{\infty}([0,1])}\leq 2^{-2(L^{\prime}k+1)}\leq 2^{-2L^{\prime}k}\leq N^{-L^{\prime}}.

(4)

We now show that the approximation error of the first order (weak) derivative can be measured in a similar fashion.

Note that for all $x\in\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)}$ ,

\phi(x)=\Big{(}\tfrac{(j+1)^{2}}{2^{s}}-\tfrac{j^{2}}{2^{s}}\Big{)}\Big{(}x-\tfrac{j}{2^{s}}\Big{)}+\Big{(}\tfrac{j}{2^{s}}\Big{)}^{2}.

(5)

From (5), we have for each $j=0,1,\dots,2^{s}-1$ ,

$\displaystyle\|\phi(x)-x^{2}\|_{\mathcal{W}^{1,\infty}\big{(}\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)}\big{)}}$	$\displaystyle=$	$\displaystyle\Big{\\|}\tfrac{(j+1)^{2}}{2^{s}}-\tfrac{j^{2}}{2^{s}}-2x\Big{\\|}_{L^{\infty}\big{(}\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)}\big{)}}$
	$\displaystyle=$	$\displaystyle\Big{\\|}\tfrac{2j+1}{2^{s}}-2x\Big{\\|}_{L^{\infty}\big{(}\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)}\big{)}}$
	$\displaystyle=$	$\displaystyle\max\Big{\{}\big{\|}\tfrac{2j+1}{2^{s}}-2\big{(}\tfrac{j}{2^{s}}\big{)}\big{\|},\big{\|}\tfrac{2j+1}{2^{s}}-2\big{(}\tfrac{j+1}{2^{s}}\big{)}\big{\|}\Big{\}}$
	$\displaystyle=$	$\displaystyle 2^{-s}$
	$\displaystyle=$	$\displaystyle 2^{-L^{\prime}k}.$

Clearly, from (4) & (5) we have

\|\phi(x)-x^{2}\|_{\mathcal{W}^{1,\infty}((0,1))}{=}\|f_{L^{\prime}k}-x^{2}\|_{\mathcal{W}^{1,\infty}((0,1))}\leq 2^{-L^{\prime}k}\leq N^{-L^{\prime}/2}.

Finally, we have

$\displaystyle\\|\phi(x)\\|_{\mathcal{W}^{1,\infty}([0,1])}\leq\Big{\\|}\tfrac{(j+1)^{2}-j^{2}}{2^{s}}\Big{\\|}_{L^{\infty}\big{(}\big{(}\tfrac{j}{2^{s}},\tfrac{j+1}{2^{s}}\big{)}\big{)}}=\tfrac{2j+1}{2^{s}}$	$\displaystyle\leq$	$\displaystyle\tfrac{2(2^{s}-1)+1}{2^{s}}$
	$\displaystyle=$	$\displaystyle 2-\tfrac{1}{2^{L^{\prime}k}}$
	$\displaystyle\leq$	$\displaystyle 2.$

Setting $L^{\prime}=2L$ , we have the desired $\phi(x)$ and hence the proof is finished. \qed

Lemma 3.3

For any $N,L\in\mathbb{N}^{+}$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $9N$ and depth $2L$ such that $\|\phi\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}\leq 12$ and

\|\phi(x,y)-xy\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}\leq 6N^{-L}.

Proof

By Lemma 3.2, there exists a $\sigma_{1}$ -NN $\psi$ with width $3N$ and depth $2L$ such that $\|\psi\|_{\mathcal{W}^{1,\infty}((0,1))}\leq 2$ and

\|z^{2}-\psi(z)\|_{\mathcal{W}^{1,\infty}((0,1))}\leq N^{-L}.

Combining the above inequality and the fact that for any $x,y\in\mathbb{R}$

xy=2\Big{(}\big{(}\tfrac{|x+y|}{2}\big{)}^{2}-\big{(}\tfrac{|x|}{2}\big{)}^{2}-\big{(}\tfrac{|y|}{2}\big{)}^{2}\Big{)},

we construct the following network $\phi$

\phi(x,y)=2\Big{(}\psi\big{(}\tfrac{|x+y|}{2}\big{)}-\psi\big{(}\tfrac{|x|}{2}\big{)}-\psi\big{(}\tfrac{|y|}{2}\big{)}\Big{)}.

We have

			$\displaystyle\\|\phi(x,y)-xy\\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}$
		$\displaystyle\leq$	$\displaystyle 2\big{\\|}\psi\big{(}\tfrac{\|x+y\|}{2}\big{)}-(\tfrac{\|x+y\|}{2})^{2}\big{\\|}_{\mathcal{W}^{1,\infty}((0,1)^{2})}+2\big{\\|}\psi\big{(}\tfrac{\|x\|}{2}\big{)}-(\tfrac{\|x\|}{2})^{2}\big{\\|}_{\mathcal{W}^{1,\infty}((0,1)^{2})}$
			$\displaystyle+2\big{\\|}\psi\big{(}\tfrac{\|y\|}{2}\big{)}-(\tfrac{\|y\|}{2})^{2}\big{\\|}_{\mathcal{W}^{1,\infty}((0,1)^{2})}$
		$\displaystyle\leq$	$\displaystyle 2N^{-L}+2N^{-L}+2N^{-L}$
		$\displaystyle=$	$\displaystyle 6N^{-L}.$

Finally, we have

	$\displaystyle\\|\phi(x,y)\\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}$	$\displaystyle=$	$\displaystyle\Big{\\|}2\Big{(}\psi\big{(}\tfrac{\|x+y\|}{2}\big{)}-\psi\big{(}\tfrac{\|x\|}{2}\big{)}-\psi\big{(}\tfrac{\|y\|}{2}\big{)}\Big{)}\Big{\\|}_{\mathcal{W}^{1,\infty}((0,1)^{2})}$
		$\displaystyle\leq$	$\displaystyle 12.$

\qed

By rescaling, we have the following modification of Lemma 3.3.

Lemma 3.4

For any $N,L\in\mathbb{N}^{+}$ and $a,b\in\mathbb{R}$ with $a<b$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $9N+1$ and depth $2L$ such that $\|\phi\|_{\mathcal{W}^{1,\infty}((a,b)^{2})}\leq 12(b-a)^{2}$ and

\|\phi(x,y)-xy\|_{\mathcal{W}^{1,\infty}((a,b)^{2})}\leq 6(b-a)^{2}N^{-L}.

Proof

By Lemma 3.3, there exists a $\sigma_{1}$ -NN $\psi$ with width $9N$ and depth $2L$ such that $\|\psi\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}\leq 12$ and

\|\psi(\widetilde{x},\widetilde{y})-\widetilde{x}\widetilde{y}\|_{\mathcal{W}^{1,\infty}((0,1)^{2})}\leq 6N^{-L}.

By setting $x=a+(b-a)\widetilde{x}$ and $y=a+(b-a)\widetilde{y}$ for any $\widetilde{x},\widetilde{y}\in(0,1)$ , we define the following network $\phi$

\phi(x,y)=(b-a)^{2}\psi\big{(}\tfrac{x-a}{b-a},\tfrac{y-a}{b-a}\big{)}+a(x-a)+a(y-a)+a^{2}.

Note that $a(x-a)+a(y-a)$ is positive. Hence, the width of $\phi$ can be as small as $9N+1$ . Thus, by $xy=(b-a)^{2}(\tfrac{x-a}{b-a}\cdot\tfrac{y-a}{b-a})+a(x-a)+a(y-a)+a^{2}$ , we have

	$\displaystyle\\|\phi(x,y)-xy\\|_{\mathcal{W}^{1,\infty}((a,b)^{2})}$	$\displaystyle=$	$\displaystyle(b-a)^{2}\big{\\|}\psi\big{(}\tfrac{x-a}{b-a},\tfrac{y-a}{b-a}\big{)}-\big{(}\tfrac{x-a}{b-a}\cdot\tfrac{y-a}{b-a}\big{)}\big{\\|}_{\mathcal{W}^{1,\infty}((a,b)^{2})}$
		$\displaystyle\leq$	$\displaystyle 6(b-a)^{2}N^{-L}.$

Finally, we have

	$\displaystyle\\|\phi(x,y)\\|_{\mathcal{W}^{1,\infty}((a,b)^{2})}$	$\displaystyle=$	$\displaystyle\Big{\\|}(b-a)^{2}\psi\big{(}\tfrac{x-a}{b-a},\tfrac{y-a}{b-a}\big{)}+a(x-a)+a(y-a)+a^{2}\Big{\\|}_{\mathcal{W}^{1,\infty}((a,b)^{2})}$
		$\displaystyle\leq$	$\displaystyle 12(b-a)^{2}.$

\qed

Lemma 3.5

For any $N,L,k\in\mathbb{N}^{+}$ with $k\geq 2$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $9(N+1)+k-1$ and depth $14k(k-1)L$ such that $\|\phi\|_{\mathcal{W}^{1,\infty}((0,1)^{k})}\leq 18$ and

\|\phi(\bm{x})-x_{1}x_{2}\cdots x_{k}\|_{\mathcal{W}^{1,\infty}((0,1)^{k})}\leq 10(k-1)(N+1)^{-7kL}.

Proof

By Lemma 3.4, there exists a $\sigma_{1}$ -NN $\phi_{1}$ with width $9(N+1)+1$ and depth $14kL$ such that $\|\phi_{1}\|_{\mathcal{W}^{1,\infty}((-0.1,1.1)^{2}))}\leq 18$ and

	$\displaystyle\\|\phi_{1}(x,y)-xy\\|_{\mathcal{W}^{1,\infty}((-0.1,1.1)^{2})}$	$\displaystyle\leq$	$\displaystyle 6(1.2)^{2}(N+1)^{-7kL}$
		$\displaystyle\leq$	$\displaystyle 9(N+1)^{-7kL}.$

Now, our goal is to construct via induction that for $i=1,2,\dots,k-1$ there exists a $\sigma_{1}$ -NN $\phi_{i}$ with width $9(N+1)+i$ and depth $14kiL$ such that

\|\phi_{i}(x_{1},\cdots,x_{i+1})-x_{1}x_{2}\cdots x_{i+1}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+1})}\leq 10i(N+1)^{-7kL},

for any $[x_{1},x_{2},\cdots,x_{i+1}]^{T}\in(0,1)^{i+1}.$

When $d=1$ , $\phi_{1}$ satisfies the condition.

Assuming now for any $i\in\{1,2,\dots,d-1\}$ there exists a $\sigma_{1}$ -NN $\phi_{i}$ such that the both conditions hold, we define $\phi_{i+1}$ as follows:

\phi_{i+1}(x_{1},\cdots,x_{i+2})=\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),\sigma(x_{i+2}))

for any $x_{1},\cdots,x_{k+2}\in\mathbb{R}$ . We can shift $x_{i+2}$ to obtain a nonnegative number $x_{i+2}+0.1$ , which can be copied via a ReLU network of width $1$ to the input of $\phi_{1}$ . Before inputting $x_{i+2}+0.1$ to $\phi_{1}$ , we can shift it back to $x_{i+2}$ . Therefore, $\phi_{i+1}$ can be implemented via a $\sigma_{1}$ -NN with width $9(N+1)+i+1$ and depth $14kiL+14kL=14k(i+1)L$ .

By the induction assumption, we have

\|\phi_{i}(x_{1},x_{2},\cdots,x_{i+1})-x_{1}x_{2}\cdots x_{i+1}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+1})}\leq 10i(N+1)^{-7kL}.

Note that $10i(N+1)^{-7k{L}}\leq 10k2^{-7k}<10k\tfrac{1}{100k}=0.1$ for any $N,L,k\geq d\in\mathbb{N}^{+}$ and $i\in\{1,2,\cdots,d-1\}$ . Thus, we have

\phi_{i}(x_{1},x_{2},\cdots,x_{i+1})\in(-0.1,1.1)

and

\tfrac{\partial\phi_{i}}{\partial x_{1}}\in(-0.1,1.1)

for any $x_{1},x_{2},\cdots,x_{i+1}\in(0,1)$ .

Hence, for any $x_{1},x_{2},\cdots,x_{i+2}\in(0,1)$ , we have

	$\displaystyle\\|\phi_{i+1}(x_{1},\cdots,x_{i+2})-x_{1}\cdots x_{i+2}\\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}$	(6)
$\displaystyle=$	$\displaystyle\\|\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),\sigma(x_{i+2}))-x_{1}\cdots x_{i+2}\\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}$
$\displaystyle\leq$	$\displaystyle\\|\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})-\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}\\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}$
	$\displaystyle+\\|\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}-x_{1}\cdots x_{i+2}\\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}$
$\displaystyle\leq$	$\displaystyle{10(N+1)^{-7kL}}+10i(N+1)^{-7kL}$
$\displaystyle=$	$\displaystyle 10(i+1)(N+1)^{-7kL},$

where the second inequality is obtained since we have

$\displaystyle\Big{\|}\tfrac{\partial(\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})-\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2})}{\partial x_{1}}\Big{\|}$	$\displaystyle=$	$\displaystyle\Big{\|}\tfrac{\partial\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})}{\partial\phi_{i}}\cdot\tfrac{\partial\phi_{i}}{\partial x_{1}}-x_{i+2}\cdot\tfrac{\partial\phi_{i}}{\partial x_{1}}\Big{\|}$
	$\displaystyle=$	$\displaystyle\underbrace{\Big{\|}\tfrac{\partial\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})}{\partial\phi_{i}}-x_{i+2}\Big{\|}}_{\leq 9(N+1)^{-7kL}}\underbrace{\Big{\|}\tfrac{\partial\phi_{i}}{\partial x_{1}}\Big{\|}}_{<=1.1}$
	$\displaystyle\leq$	$\displaystyle 10(N+1)^{-7kL}$

and

$\displaystyle\Big{\|}\tfrac{\partial(\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})-\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2})}{\partial x_{i+2}}\Big{\|}$	$\displaystyle=$	$\displaystyle\Big{\|}\tfrac{\partial\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})}{\partial x_{i+2}}-\phi_{i}\Big{\|}$
	$\displaystyle\leq$	$\displaystyle 9(N+1)^{-7kL}$
	$\displaystyle\leq$	$\displaystyle 10(N+1)^{-7kL}.$

Thus,

|\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}-x_{1}\cdots x_{i+2}|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}\leq 10(N+1)^{-7kL}.

Also, using similar arguments or see the proof of [31, Lemma 5.1], it can be shown that

|\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}-x_{1}\cdots x_{i+2}|_{L^{\infty}((0,1)^{i+2})}\leq 10(N+1)^{-14kL}.

Hence, we have shown

\|\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),x_{i+2})-\phi_{i}(x_{1},\cdots,x_{i+1})x_{i+2}\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}\leq{10(N+1)^{-7kL}},

which was used in showing (6).

Also, for any $x_{1},x_{2},\cdots,x_{i+2}\in(0,1)$ , we have

	$\displaystyle\\|\phi_{i+1}(x_{1},\cdots,x_{i+2})\\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}$	$\displaystyle=$	$\displaystyle\\|\phi_{1}(\phi_{i}(x_{1},\cdots,x_{i+1}),\sigma(x_{i+2}))\\|_{\mathcal{W}^{1,\infty}((0,1)^{i+2})}$
		$\displaystyle\leq$	$\displaystyle 18.$

At last, setting $\phi=\phi_{d-1}$ in (6), we have finished the proof by induction. \qed

Proposition 3.6

Suppose $\bm{x}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}}$ for $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq k\in\mathbb{N}^{+}$ . For any $N,L\in\mathbb{N}^{+}$ , there exists a $\sigma_{1}$ -NN $\phi$ with width $9(N+1)+k-1$ and depth $14k^{2}L$ such that $\|\phi\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 18$ and

\|\phi(\bm{x})-\bm{x}^{\bm{\alpha}}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 10k(N+1)^{-7kL}.

Proof

This proof is similar to that of [31, Proposition 4.1], but for readability we provide a detailed proof as follows.

The case when $k=1$ is trivial. When $k\geq 2$ , we set $\widetilde{k}=|{\bm{\alpha}}|\leq k$ , $\bm{\alpha}=[\alpha_{1},\alpha_{2},\dots,\alpha_{d}]^{T}\in\mathbb{R}^{d}$ , and let $[z_{1},z_{2},\cdots,z_{\widetilde{k}}]^{T}\in\mathbb{R}^{\widetilde{k}}$ be the vector such that

z_{l}=x_{j},\quad\text{if}~{}\sum_{i=1}^{j-1}\alpha_{i}<l\leq\sum_{i=1}^{j}\leq\alpha_{j},\quad\text{for}~{}j=1,2,\cdots,d.

In other words, we have

[z_{1},z_{2},\cdots,z_{\widetilde{k}}]^{T}=[\underbrace{x_{1},\cdots,x_{1}}_{\alpha_{1}~{}\text{times}},\underbrace{x_{2},\cdots,x_{2}}_{\alpha_{2}~{}\text{times}},\cdots,\underbrace{x_{d},\cdots,x_{d}}_{\alpha_{d}~{}\text{times}}]^{T}\in\mathbb{R}^{\widetilde{k}}.

We now construct the target deep ReLU neural network. First, there exists a linear map $\mathcal{L}:\mathbb{R}^{d}\to\mathbb{R}^{k}$ $\mathcal{L}({\bm{x}})=[z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1]$ , which copies ${\bm{x}}$ to form a new vector $[z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1]\in\mathbb{R}^{k}$ . Second, there exists by Lemma 3.5 a function $\psi:\mathbb{R}^{k}\rightarrow\mathbb{R}$ implemented by a ReLU network with width $9(N+1)+k-1$ and depth $14k(k-1)L$ such that $\|\psi\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 18$ and $\psi$ maps $[z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1]$ to $z_{1}z_{2}\cdots z_{\widetilde{k}}$ within an error of $10(k-1)(N+1)^{-7kL}$ . Thus, we can construct the network $\phi=\psi\circ\mathcal{L}$ . Then, $\phi$ can be implemented by a ReLU with width $9(N+1)+k-1$ and depth $14k(k-1)L\leq 14k^{2}L$ , and

$\displaystyle\\|\phi(\bm{x})-\bm{x}^{\bm{\alpha}}\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$	$\displaystyle=$	$\displaystyle\\|\psi\circ\mathcal{L}(\bm{x})-x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}}\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$
	$\displaystyle=$	$\displaystyle\\|\psi(z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1)-z_{1}z_{2}\cdots z_{\widetilde{k}}\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$
	$\displaystyle\leq$	$\displaystyle 10(k-1)(N+1)^{-7kL}$
	$\displaystyle\leq$	$\displaystyle 10k(N+1)^{-7kL}.$

Also, we have

$\displaystyle\\|\phi(\bm{x})\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$	$\displaystyle=$	$\displaystyle\\|\psi\circ\mathcal{L}(\bm{x})\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$
	$\displaystyle=$	$\displaystyle\\|\psi(z_{1},z_{2},\cdots,z_{\widetilde{k}},1,\cdots,1)\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$
	$\displaystyle\leq$	$\displaystyle 18.$

This proof is then finished. \qed

We are now ready to show our main approximation result of Theorem 3.1. The main idea of the proof is based on the recurring use of the Taylor polynomial approximation to smooth functions. We first construct ReLU subnetworks to realize these polynomials and their derivatives, and then implement the target smooth functions by composing the subnetworks. The overall approximation error is estimated by the remainder terms in Taylor’s theorem and the error resulted from those ReLU subnetworks for approximating the polynomials.

3.2 Proof of Theorem 3.1

Proof

We set $K=\lfloor{N^{1/d}}\rfloor^{2}\lfloor L^{2/d}\rfloor$ and let $\Omega{([0,1]^{d},K,\delta)}$ defined by (3) partition $[0,1]^{d}$ into $K^{d}$ cubes $Q_{\bm{\beta}}$ for $\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}$ such that

[0,1]^{d}=\Omega{([0,1]^{d},K,\delta)}\bigcup\big{(}\cup_{\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}}Q_{\bm{\beta}}\big{)}.

For each $\bm{\beta}=[\beta_{1},\beta_{2},\cdots,\beta_{d}]^{T}\in\{0,1,\cdots,K-1\}^{d}$ , we define

Q_{\bm{\beta}}=\Bigg{\{}\bm{x}=[x_{1},x_{2},\cdots,x_{d}]^{T}:x_{i}\in\Big{[}\tfrac{\beta_{i}}{K},\tfrac{\beta_{i}+1}{K}-\delta\cdot 1_{\{\beta_{i}\leq K-2\}}(\beta_{i})\Big{]}~{}\text{for}~{}i=1,2,\dots,d\Bigg{\}},

where $1_{\{\beta_{i}\leq K-2\}}(\beta_{i})$ is the indicator function of the set $\{\beta_{i}\leq K-2\}$ .

By Proposition 2.2, there exists a $\sigma_{1}$ -NN $\psi$ with width $4N+3$ and depth $4L+5$ such that

\psi(x)=k,\quad\text{if}~{}x\in\big{[}\tfrac{k}{K},\tfrac{k+1}{K}-\delta\cdot 1_{\{k\leq K-2\}}\big{]}~{}\text{for}~{}k=0,1,\dots,K-1.

Then, for each $\bm{\beta}\in\{0,1,\cdots,K-1\}^{d},\psi(x_{i})=\beta_{i}$ if $\bm{x}\in Q_{\bm{\beta}}~{}\text{for}~{}i=1,2,\cdots,d$ .

Define

\bm{\psi{(x)}}:=[\psi(x_{1}),\psi(x_{2}),\cdots,\psi(x_{d})]^{T}/K\quad\text{for any}~{}\bm{x}\in[0,1]^{d},

then

\bm{\psi{(x)}}=\bm{\beta}/K,\quad\text{if}~{}\bm{x}\in Q_{\bm{\beta}},\quad\text{for}~{}\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}.

Now, we fix a $\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}$ throughout the proof. For any $\bm{x}\in Q_{\bm{\beta}}$ , by Taylor’s expansion there exists a $\xi_{\bm{x}}\in(0,1)$ such that

{f(\bm{x})}=\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}+\sum_{|{\bm{\alpha}}|=s}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}},

(7)

where $\bm{h}=\bm{x}-\bm{\psi{(\bm{x})}}$ .

By Lemma 3.4, there exists a $\sigma_{1}$ -NN $\widetilde{\phi}$ with width $9(N+1)+1$ and depth $4s(L+1)$ such that $\|\widetilde{\phi}\|_{\mathcal{W}^{1,\infty}((-3,3)^{2})}\leq 432$ and

	$\displaystyle\\|\widetilde{\phi}(x,y)-xy\\|_{\mathcal{W}^{1,\infty}((-3,3)^{2}))}$	$\displaystyle\leq$	$\displaystyle 6(6)^{2}(N+1)^{-2s(L+1)}$
		$\displaystyle=$	$\displaystyle 216(N+1)^{-2s(L+1)}=:\mathcal{E}_{1}.$

For each ${\bm{\alpha}}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s$ , by Proposition 3.6 there exist $\sigma_{1}$ - $\textnormal{NN}s$ $P_{\bm{\alpha}}(\bm{x})$ with width $9(N+1)+s-1$ and depth $14s^{2}L$ such that

\|P_{\bm{\alpha}}(\bm{x})-\bm{x}^{\bm{\alpha}}\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}\leq 10s(N+1)^{-7sL}=:\mathcal{E}_{2}.

(8)

For each $i=0,1,\cdots,K^{d}-1$ , we define the bijection

\bm{\eta}(i)=[\eta_{1},\eta_{2},\cdots,\eta_{d}]^{T}\in\{0,1,\dots,K-1\}^{d}

such that $\sum_{j=1}^{d}\eta_{j}K^{j-1}=i$ . We will drop the input $i$ in $\bm{\eta}(i)$ later for simplicity. For each $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s-1$ , define

\xi_{\bm{\alpha},i}=\big{(}\partial^{\bm{\alpha}}f\big{(}\tfrac{\eta}{K}\big{)}+1\big{)}/2.

Note that $K^{d}=(\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor)^{d}\leq N^{2}L^{2}$ and $\xi_{\bm{\alpha},i}\in[0,1]$ for $i=0,1,\cdots,K^{d}-1$ . By Proposition 2.3, there exists a $\sigma_{1}$ -NN $\widetilde{\phi}_{\bm{\alpha}}$ of width $16s(N+1)\log_{2}(8N)$ and depth $5(L+2)\log_{2}(4L)$ such that

|\widetilde{\phi}_{\bm{\alpha}}(i)-\xi_{\bm{\alpha},i}|\leq N^{-2s}L^{-2s},\quad\text{for}~{}i=0,1,\cdots,K^{d}-1~{}\text{and}~{}|{\bm{\alpha}}|\leq s-1.

For each ${\bm{\alpha}}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s-1$ , we define

\phi_{\bm{\alpha}}(\bm{x}):=2\widetilde{\phi}_{\bm{\alpha}}\Bigg{(}\sum_{j=1}^{d}x_{j}{K^{j-1}}\Bigg{)}-1,\quad\text{for any}~{}\bm{x}=[x_{1},x_{2},\cdots,x_{d}]^{d}\in\mathbb{R}^{d}.

It can be seen that $\phi_{\bm{\alpha}}$ is also of width $16s(N+1)\log_{2}(8N)$ and depth $5(L+2)\log_{2}(4L)$ .

Then, for each $\bm{\eta}=[\eta_{1},\eta_{2},\cdots,\eta_{d}]^{T}\in\{0,1,\dots,K-1\}^{d}$ corresponding to $i=\sum_{j=1}^{d}\eta_{j}K^{j-1}$ , each $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s-1$ , we have

\big{|}\phi_{\bm{\alpha}}(\tfrac{\bm{\eta}}{K})-\partial^{\bm{\alpha}}f(\tfrac{\bm{\eta}}{K})\big{|}=\Bigg{|}2\widetilde{\phi}_{\bm{\alpha}}\Bigg{(}\sum_{j=1}^{d}x_{j}K^{j-1}\Bigg{)}-1-(2\xi_{\bm{\alpha},i}-1)\Bigg{|}=2|\widetilde{\phi}_{\bm{\alpha}}(i)-\xi_{\bm{\alpha},i}|\leq 2N^{-2s}L^{-2s}.

From $\bm{\psi{(x)}}=\tfrac{\bm{\beta}}{K}$ for $\bm{x}\in Q_{\bm{\beta}}$ , it follows that

$\displaystyle\\|\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\\|_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}$	$\displaystyle=$	$\displaystyle\\|\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\\|_{{L}^{\infty}(Q_{\bm{\beta}})}$	(9)
	$\displaystyle=$	$\displaystyle\big{\|}\phi_{\bm{\alpha}}(\tfrac{\bm{\beta}}{K})-\partial^{\bm{\alpha}}f(\tfrac{\bm{\beta}}{K})\big{\|}$
	$\displaystyle\leq$	$\displaystyle 2N^{-2s}L^{-2s}=:\mathcal{E}_{3}.$

Note that since $\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})$ for $\bm{x}\in Q_{\bm{\beta}}$ is constant, its weak derivative is zero, which has given us the first equality of (9).

Define

\phi(\bm{x})=\sum_{|{\bm{\alpha}}|\leq s-1}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}

(10)

for any $\bm{x}\in\mathbb{R}^{d}$ .

Let us now first estimate the overall approximation error in semi-norm. We denote the first order derivatives of $f$ by $\partial^{\bm{\gamma}}f$ with $|\bm{\gamma}|=1$ . For any $\bm{x}\in Q_{\bm{\beta}}$ , by Taylor’s expansion there exists a $\xi_{\bm{x}}^{\bm{\gamma}}\in(0,1)$ such that

	$\displaystyle{\partial^{\bm{\gamma}}f(\bm{x})}$	$\displaystyle=$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-2}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}+\sum_{\|{\bm{\alpha}}\|=s-1}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}$
		$\displaystyle=$	$\displaystyle\partial^{\bm{\gamma}}\Big{(}\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{)}+\sum_{\|{\bm{\alpha}}\|=s-1}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}},$

where $\bm{h}=\bm{x}-\bm{\psi{(\bm{x})}}$ .
Therefore,

			$\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{1,\infty}(Q_{\bm{\theta}})}$
		$\displaystyle\leq$	$\displaystyle\Big{\|}\phi(\bm{x})-\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\theta}})}+\Big{\|}\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}-f(\bm{x})\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\theta}})}$
		$\displaystyle\leq$	$\displaystyle{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}$
			$\displaystyle+\max_{\|\bm{\gamma}\|=1}\sum_{\|{\bm{\alpha}}\|=s-1}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle\underbrace{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{=:E_{1}}$
			$\displaystyle+\underbrace{\max_{\|\bm{\gamma}\|=1}\sum_{\|{\bm{\alpha}}\|=s-1}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}}_{=:E_{2}},$

where $E_{2}$ with $\xi_{\bm{x}}^{\bm{\gamma}}\in(0,1)$ is the remainder term resulted from Taylor’s expansion of $\partial^{\bm{\gamma}}f$ in (3.2).

The error term $E_{1}$ is further decomposed into two parts via

$\displaystyle E_{1}$	$\displaystyle=$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle\underbrace{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{=:E_{1,1}}$
		$\displaystyle+\underbrace{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{=:E_{1,2}}.$

For each ${\bm{\alpha}}\in\mathbb{R}^{d}$ with $|{\bm{\alpha}}|\leq s-1$ and $\bm{x}\in Q_{\bm{\beta}}$ , since $\mathcal{E}_{3}\in[0,2]$ and $\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\in(-1,1)$ according to (9), we have $\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})\in(-3,3)$ and hence $\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\in(-3,3)$ .

Also, we have $P_{\bm{\alpha}}(\bm{x})\in(-2,3)\subseteq(-3,3)$ and $\|P_{\bm{\alpha}}(\bm{x})\|_{\mathcal{W}{{}^{1,\infty}([0,1]^{d})}}\leq 3$ for any $\bm{x}\in[0,1]^{d}$ and $|{\bm{\alpha}}|\leq s-1$ , since $\mathcal{E}_{2}<2$ from (8).

Hence, we can now measure $E_{1,1}$ , $E_{1,2}$ and $E_{2}$ :

$\displaystyle E_{1,1}$	$\displaystyle=$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Bigg{(}\underbrace{\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{\leq\mathcal{E}_{1}}$
		$\displaystyle+\underbrace{\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{\leq\mathcal{E}_{1}}$
		$\displaystyle+\underbrace{\Big{\\|}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{\leq 3\mathcal{E}_{3}}\Bigg{)}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}(2\mathcal{E}_{1}+3\mathcal{E}_{3})$
	$\displaystyle\leq$	$\displaystyle s^{d}(2\mathcal{E}_{1}+3\mathcal{E}_{3}).$

Note that the last inequality is followed by the fact that $\sum_{|{\bm{\alpha}}|\leq s-1}1\leq\sum_{i=0}^{s-1}(i+1)^{d-1}\leq s^{d}$ .

Thus,

$\displaystyle E_{1,2}$	$\displaystyle=$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\underbrace{\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{\leq\mathcal{E}_{1}}$
		$\displaystyle+\sum_{\|{\bm{\alpha}}\|\leq s-1}{\underbrace{\Big{\\|}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{{}_{\leq\mathcal{E}_{2}}}}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}(\mathcal{E}_{1}+\mathcal{E}_{2})$
	$\displaystyle\leq$	$\displaystyle s^{d}(\mathcal{E}_{1}+\mathcal{E}_{2}).$

We now provide a measure on $E_{2}$ . First, by the assumption that $\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1$ for any $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s$ , we have

$\displaystyle E_{2}$	$\displaystyle=$	$\displaystyle\max_{\|\bm{\gamma}\|=1}\sum_{\|{\bm{\alpha}}\|=s-1}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|=s-1}\Big{\\|}\tfrac{1}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{{L}^{\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle s^{d-1}K^{-(s-1)},$

where the last inequality is followed by $\sum_{|{\bm{\alpha}}|=s-1}1\leq s^{d-1}$ .
Thus, we have shown that

			$\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle E_{1,1}+E_{1,2}+E_{2}$
		$\displaystyle\leq$	$\displaystyle E_{1,1}+E_{1,2}+s^{d-1}K^{-(s-1)}.$

We now estimate $\|\phi(\bm{x})-f(\bm{x})\|_{{L}^{\infty}(Q_{\bm{\beta}})}$ . Using similar arguments, from the Taylor expansion defined in (7) with $\xi_{\bm{x}}$ in the remainder term, we can show that

			$\displaystyle\\|\phi(\bm{x})-f(\bm{x})\\|_{{L}^{\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{{L}^{\infty}(Q_{\bm{\beta}})}+\sum_{\|{\bm{\alpha}}\|=s}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{{L}^{\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}+(s+1)^{d-1}K^{-s}$
		$\displaystyle\leq$	$\displaystyle E_{1,1}+E_{1,2}+(s+1)^{d-1}K^{-s},$

where the second last inequality is due to $\sum_{|{\bm{\alpha}}|=s}1\leq(s+1)^{d-1}$ .

Using $(N+1)^{-7s(L+1)}\leq(N+1)^{-2s(L+1)}\leq(N+1)^{-2s}2^{-{2}sL}\leq N^{-2s}L^{-2s}$ and $K=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor\geq\tfrac{N^{2/d}L^{2/d}}{8}$ , we have

			$\displaystyle\\|\phi(\bm{x})-f(\bm{x})\\|_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}$
		$\displaystyle=$	$\displaystyle\max{\big{\{}\\|\phi(\bm{x})-f(\bm{x})\\|_{{L}^{\infty}(Q_{\bm{\beta}})},\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}\big{\}}}$
		$\displaystyle\leq$	$\displaystyle E_{1,1}+E_{1,2}+(s+1)^{d-1}K^{-(s-1)}$
		$\displaystyle\leq$	$\displaystyle s^{d}(2\mathcal{E}_{1}+3\mathcal{E}_{3})+s^{d}(\mathcal{E}_{1}+\mathcal{E}_{2})+(s+1)^{d-1}K^{-(s-1)}$
		$\displaystyle\leq$	$\displaystyle(s+1)^{d}(K^{-(s-1)}+3\mathcal{E}_{1}+\mathcal{E}_{2}+3\mathcal{E}_{3})$
		$\displaystyle\leq$	$\displaystyle(s+1)^{d}\big{(}K^{-(s-1)}+648(N+1)^{-2s(L+1)}+10s(N+1)^{-7s(L+1)}+6N^{-2s}L^{-2s}\big{)}$
		$\displaystyle\leq$	$\displaystyle(s+1)^{d}\big{(}8^{s-1}N^{-2(s-1)/d}L^{-2(s-1)/d}+(654+10s)N^{-2s}L^{-2s}\big{)}$
		$\displaystyle\leq$	$\displaystyle(s+1)^{d}(8^{s-1}+654+10s)N^{-2(s-1)/d}L^{-2(s-1)/d}$
		$\displaystyle\leq$	$\displaystyle 84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.$

Since $\bm{\beta}\in\{0,1,2,\cdots,K-1\}^{d}$ is arbitrary and the fact that $[0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)}\subseteq\cup_{\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}}Q_{\bm{\beta}}$ , we have

\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{1,\infty}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}\leq 84(s+1)^{d}8^{s}N^{-2(s-1)/d}L^{-2(s-1)/d}.

Furthermore, we have

$\displaystyle\\|\phi(\bm{x})\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$	$\displaystyle=$	$\displaystyle\Bigg{\\|}\sum_{\|{\bm{\alpha}}\|\leq s-1}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Bigg{\\|}_{\mathcal{W}^{1,\infty}([0,1]^{d})}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\\|\widetilde{\phi}\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$
	$\displaystyle\leq$	$\displaystyle 432s^{d}.$

As last, we finish the proof by estimating the width and depth of the network implementing $\phi(\bm{x})$ . From (10), we know that $\phi(\bm{x})$ consists of the following subnetworks:

1.

$\bm{\psi}\in\textnormal{NN}(\textrm{width}\leq d(4N+3);\textrm{depth}\leq 4L+5)$ ;
2.

$\phi_{\bm{\alpha}}\in\textnormal{NN}(\textrm{width}\leq 16s(N+1)\log_{2}(8N);\textrm{depth}\leq 5(L+2)\log_{2}(4L))$ ;
3.

$P_{\bm{\alpha}}\in\textnormal{NN}(\textrm{width}\leq 9(N+1)+s-1;\textrm{depth}\leq 14s^{2}L)$ ;
4.

$\widetilde{\phi}\in\textnormal{NN}(\textrm{width}\leq 9(N+1)+1;\textrm{depth}\leq 4s(L+1))$ .

Thus, we can infer that the $\phi$ can be implemented by a ReLU network with width $16s^{d+1}{d}(N+2)\log_{2}{(8N)}$ and depth

\begin{split}(4L+5)+4s(L+1)+14s^{2}L+5(L+2)\log_{2}(4L)+3\leq 27s^{2}(L+2)\log_{2}(4L)\end{split}.

Hence, we have finished the proof. \qed

3.3 Approximation Using $\sigma_{2}$ -Neural Networks

In this subsection, we provide approximation results for smoothing functions measured in the $\mathcal{W}^{n,\infty}$ norm, where $n$ is a positive integer, using the smoother $\sigma_{2}$ -NNs. First, we list a few basic lemmas of $\sigma_{2}$ -NNs repeatedly applied in our main analysis.

Lemma 3.7

The following basic lemmas of $\sigma_{2}$ -NNs hold:

(i)

$\sigma_{1}$ -NNs are $\sigma_{2}$ -NNs.
(ii)

Any identity map in $\mathbb{R}^{d}$ can be realized exactly by a $\sigma_{2}$ -NN with one hidden layer and $2d$ neurons.
(iii)

$f(x)=x^{2}$ can be realized exactly by $\sigma_{2}$ -NN with one hidden layer and two neurons.
(iv)

$f(x,y)=xy=\tfrac{(x+y)^{2}-(x-y)^{2}}{4}$ can be realized exactly by $\sigma_{2}$ -NN with one hidden layer and four neurons.
(v)

Assume ${\bm{x}}^{\bm{\alpha}}=x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}\cdots x_{d}^{\alpha_{d}}$ for ${\bm{\alpha}}\in\mathbb{N}^{d}$ . For any $N,L\in\mathbb{N}^{+}$ such that $NL+2^{\lfloor\log_{2}N\rfloor}\geq|{\bm{\alpha}}|$ , there exists a $\sigma_{2}$ -NN $\phi$ with width $4N+2d$ and depth $L+\lceil\log_{2}N\rceil$ such that

$\phi({\bm{x}})={\bm{x}}^{\bm{\alpha}}\quad\textnormal{for any ${\bm{x}}\in\mathbb{R}^{d}$.}$
(vi)

Assume $P({\bm{x}})=\sum_{j=1}^{J}c_{j}{\bm{x}}^{{\bm{\alpha}}_{j}}$ for ${\bm{\alpha}}_{j}\in\mathbb{N}^{d}$ . For any $N,L,a,b\in\mathbb{N}^{+}$ such that $ab\geq J$ and $(L-2b-b\log_{2}N)N\geq b\max_{j}|{\bm{\alpha}}_{j}|$ , there exists a $\sigma_{2}$ -NN $\phi$ with width $4Na+2d+2$ and depth $L$ such that

$\phi({\bm{x}})=P({\bm{x}})\quad\textnormal{for any ${\bm{x}}\in\mathbb{R}^{d}$.}$

Proof

Showing (i) to (iv) is trivial. We will only prove (v) and (vi) in the following.

Part (v): In the case of $|{\bm{\alpha}}|=k\leq 1$ , the proof is simple and left for the reader. When $|{\bm{\alpha}}|=k\geq 2$ , the main idea of the proof of (v) can be summarized in Figure 1. We apply $\sigma_{1}$ -NNs to implement a $d$ -dimensional identity map as in Lemma 2.1 (iii). These identity maps maintain necessary entries of ${\bm{x}}$ to be multiplied together. We apply $\sigma_{2}$ -NNs to implement the multiplication function in Lemma 3.7 (iii) and carry out the multiplication $N$ times per layer. After $L$ layers, there are $k-NL\leq N$ multiplication to be implemented. Finally, these at most $N$ multiplications can be carried out with a small $\sigma_{2}$ -NNs in a dyadic tree structure.

Part (vi): The main idea of the proof is to apply Part (v) $J$ times to construct $J$ $\sigma_{2}$ -NNs, $\{\phi_{j}({\bm{x}})\}_{j=1}^{J}$ , to represent ${\bm{x}}^{{\bm{\alpha}}_{j}}$ and arrange these $\sigma_{2}$ -NNs as sub-NN blocks to form a larger $\sigma_{2}$ -NN $\tilde{\phi}({\bm{x}})$ with $ab$ blocks as shown in Figure 2, where each red rectangle represents one $\sigma_{2}$ -NN $\phi_{j}({\bm{x}})$ and each blue rectangle represents one $\sigma_{1}$ -NN of width $2$ as an identity map of $\mathbb{R}$ . There are $ab$ red blocks with $a$ rows and $b$ columns. When $ab\geq J$ , these sub-NN blocks can carry out all monomials ${\bm{x}}^{{\bm{\alpha}}_{j}}$ . In each column, the results of the multiplications of ${\bm{x}}^{{\bm{\alpha}}_{j}}$ are added up to the input of the narrow $\sigma_{1}$ -NN, which can carry the sum over to the next column. After the calculation of $b$ columns, $J$ additions of the monomials ${\bm{x}}^{{\bm{\alpha}}_{j}}$ have been implemented, resulting in the output $P({\bm{x}})$ .

By Part (v), for any $N\in\mathbb{N}^{+}$ , there exists a $\sigma_{2}$ -NN $\phi_{j}({\bm{x}})$ of width $2d+4N$ and depth $L_{j}=\lceil\tfrac{|{\bm{\alpha}}_{j}|}{N}\rceil+\lceil\log_{2}N\rceil$ to implement ${\bm{x}}^{{\bm{\alpha}}_{j}}$ . Note that $b\max_{j}L_{j}\leq b\left(\tfrac{\max_{j}|{\bm{\alpha}}_{j}|}{N}+2+\log_{2}N\right)$ . Hence, there exists a $\sigma_{2}$ -NN $\tilde{\phi}({\bm{x}})$ of width $2da+4Na+2$ and depth $b\left(\tfrac{\max_{j}|{\bm{\alpha}}_{j}|}{N}+2+\log_{2}N\right)$ to implement $P({\bm{x}})$ as in Figure 2. Note that the total width of each column of blocks is $2ad+4Na+2$ but in fact this width can be reduced to $2d+4Na+2$ , since the red blocks in each column can share the same identity map of $\mathbb{R}^{d}$ (the blue part of Figure 1).

Note that $b\left(\tfrac{\max_{j}|{\bm{\alpha}}_{j}|}{N}+2+\log_{2}N\right)\leq L$ is equivalent to $(L-2b-b\log_{2}N)N\geq b\max_{j}|{\bm{\alpha}}_{j}|$ . Hence, for any $N,L,a,b\in\mathbb{N}^{+}$ such that $ab\geq J$ and $(L-2b-b\log_{2}N)N\geq b\max_{j}|{\bm{\alpha}}_{j}|$ , there exists a $\sigma_{2}$ -NN $\phi({\bm{x}})$ with width $4Na+2d+2$ and depth $L$ such that $\tilde{\phi}({\bm{x}})$ is a sub-NN of $\phi({\bm{x}})$ in the sense of $\phi({\bm{x}})=\textnormal{Id}\circ\tilde{\phi}({\bm{x}})$ with Id as an identify map of $\mathbb{R}$ , which means that $\phi({\bm{x}})=\tilde{\phi}({\bm{x}})=P({\bm{x}})$ . The proof of Part (vi) is completed. \qed

Refer to caption — Figure 1: Left: An illustration of the proof of Lemma 3.7 (v). Green vectors represent the input and output of the $\sigma_{2}$ -NN carrying out $P({\bm{x}})$ . Blue vectors represent the $\sigma_{1}$ -NN that implements a $d$ -dimensional identity map in Lemma 2.1 (iii), which was repeatedly applied for $L$ times. Black arrows represent the data flow for carrying out the identity maps. Red vectors represent the $\sigma_{2}$ -NNs implementing the multiplication function in Lemma 3.7 (iii) and there $NL$ such red vectors. Red arrows represent the data flow for carrying out the multiplications. Finally, a red triangle represent a $\sigma_{2}$ -NN of width at most $4N$ and depth at most $\lceil\log_{2}^{N}\rceil$ carrying out the rest of the multiplications. Right: An example of the red triangle is given on the right when it consists of $15$ red vectors carrying out $15$ multiplications.

Similar to the ReLU network case, we first present the following approximation result before showing our main theorem - Theorem 1.4.

Theorem 3.8

Suppose that $f\in C^{s}([0,1]^{d})$ with $s\in\mathbb{N}^{+}$ satisfies $\|\partial^{\bm{\alpha}}f\|_{L^{\infty}([0,1]^{d})}<1$ for any $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s$ . For any $N,L\in\mathbb{N}^{+}$ satisfying $(L-2-\log_{2}N)N\geq s$ , there exists a $\sigma_{2}$ -NN $\phi$ with width $16s^{d+1}d(N+2)\log_{2}{({8}N)}$ and depth $10(L+2)\log_{2}(4L)$ such that $\|\phi(x)\|_{\mathcal{W}^{n,\infty}([0,1])}\leq s^{d}$ and

\|f-\phi\|_{\mathcal{W}^{n,\infty}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}\leq 2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}

where $n<s$ is a positive integer, $K=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor$ and $0\leq\delta\leq\tfrac{1}{3K}$ .

Proof

Similar to the proof of Theorem 3.1, we set $K=\lfloor{N^{1/d}}\rfloor^{2}\lfloor L^{2/d}\rfloor$ and let $\Omega{([0,1]^{d},K,\delta)}$ partition $[0,1]^{d}$ into $K^{d}$ cubes $Q_{\bm{\beta}}$ for $\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}$ such that

[0,1]^{d}=\Omega{([0,1]^{d},K,\delta)}\bigcup\big{(}\cup_{\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}}Q_{\bm{\beta}}\big{)}.

For each $\bm{\beta}=[\beta_{1},\beta_{2},\cdots,\beta_{d}]^{T}\in\{0,1,\cdots,K-1\}^{d}$ , we define

Q_{\bm{\beta}}=\Bigg{\{}\bm{x}=[x_{1},x_{2},\cdots,x_{d}]^{T}:x_{i}\in\Big{[}\tfrac{\beta_{i}}{K},\tfrac{\beta_{i}+1}{K}-\delta\cdot 1_{\{\beta_{i}\leq K-2\}}\Big{]}~{}\text{for}~{}i=1,2,\dots,d\Bigg{\}}.

By Proposition 2.2, there exists a $\sigma_{2}$ -NN $\psi$ with width $4N+3$ and depth $4L+5$ such that

\psi(x)=k,\quad\text{if}~{}x\in\big{[}\tfrac{k}{K},\tfrac{k+1}{K}-\delta\cdot 1_{\{k\leq K-2\}}\big{]}~{}\text{for}~{}k=0,1,\dots,K-1.

Then, for each $\bm{\beta}\in\{0,1,\cdots,K-1\}^{d},\psi(x_{i})=\theta_{i}$ if $\bm{x}\in Q_{\bm{\beta}}~{}\text{for}~{}i=1,2,\cdots,d$ .

Define

\bm{\psi{(x)}}:=[\psi(x_{1}),\psi(x_{2}),\cdots,\psi(x_{d})]^{T}/K\quad\text{for any}~{}\bm{x}\in[0,1]^{d},

then

\bm{\psi{(x)}}=\bm{\beta}/K\quad\text{if}~{}\bm{x}\in Q_{\bm{\beta}}\quad\text{for}~{}\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}.

Now, we fix a $\bm{\beta}\in\{0,1,\cdots,K-1\}^{d}$ throughout the proof. For any $\bm{x}\in Q_{\bm{\beta}}$ , by Taylor’s expansion there exists a $\xi_{\bm{x}}\in(0,1)$ such that

{f(\bm{x})}=\sum_{|{\bm{\alpha}}|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}+\sum_{|{\bm{\alpha}}|=s}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}},

where $\bm{h}=\bm{x}-\bm{\psi{(\bm{x})}}$ .

By Lemma 3.7 (iv), there exists a $\sigma_{2}$ -NN $\widetilde{\phi}$ with width $4$ and depth $1$ such that

\widetilde{\phi}(x,y)=xy

for any $x,y\in(-3,3).$

Note that it is trivial to construct $\sigma_{2}$ - $\textnormal{NN}s$ $P_{\bm{\alpha}}(\bm{x})$ for $\bm{x}^{\bm{\alpha}}$ when $|{\bm{\alpha}}|\leq 1$ . Thus, for each $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s-1$ and for any $N,L\in\mathbb{N}^{+}$ satisfying $(L-2-\log_{2}N)N\geq s$ , by Lemma 3.7 (vi) with $a=b=J=1$ , there exists a $\sigma_{2}$ -NN $P_{\bm{\alpha}}$ with width $4N+2d+2$ and depth $L$ such that

P_{\bm{\alpha}}(\bm{x})=\bm{x}^{\bm{\alpha}}

for any $\bm{x}\in\mathbb{R}^{d}$ .

For each $i=0,1,\cdots,K^{d}-1$ , we define

\bm{\eta}(i)=[\eta_{1},\eta_{2},\cdots,\eta_{d}]^{T}\in\{0,1,\dots,K-1\}^{d}

such that $\sum_{j=1}^{d}\eta_{j}K^{j-1}=i$ . We will drop the input $i$ in $\bm{\eta}(i)$ later for simplicity. For each $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s-1$ , define

\xi_{\bm{\alpha},i}=\big{(}\partial^{\bm{\alpha}}f\big{(}\tfrac{\eta}{K}\big{)}+1\big{)}/2.

Note that $K^{d}=(\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor)^{d}\leq N^{2}L^{2}$ and $\xi_{\bm{\alpha},i}\in[0,1]$ for $i=0,1,\cdots,K^{d}-1$ . By Proposition 2.3, there exists a $\sigma_{1}$ -NN $\widetilde{\phi}_{\bm{\alpha}}$ , which is also a $\sigma_{2}$ -NN, of width $16s(N+1)\log_{2}(8N)$ and depth $5(L+2)\log_{2}(4L)$ such that

|\widetilde{\phi}_{\bm{\alpha}}(i)-\xi_{\bm{\alpha},i}|\leq N^{-2s}L^{-2s},\quad\text{for}~{}i=0,1,\cdots,K^{d}-1~{}\text{and}~{}|{\bm{\alpha}}|\leq s-1.

Define

\phi_{\bm{\alpha}}(\bm{x}):=2\widetilde{\phi}_{\bm{\alpha}}\Bigg{(}\sum_{j=1}^{d}x_{j}K^{j-1}\Bigg{)}-1,\quad\text{for any}~{}\bm{x}=[x_{1},x_{2},\cdots,x_{d}]^{d}\in\mathbb{R}^{d}.

For each $|{\bm{\alpha}}|\leq s-1$ , we know that $\phi_{\bm{\alpha}}$ is also of width $16s(N+1)\log_{2}(8N)$ and depth $5(L+2)\log_{2}(4L)$ .

Then for each $\bm{\eta}=[\eta_{1},\eta_{2},\cdots,\eta_{d}]^{T}\in\{0,1,\dots,K-1\}^{d}$ corresponding to $i=\sum_{j=1}^{d}\eta_{j}K^{j-1}$ , each $\bm{\alpha}\in\mathbb{N}^{d}$ with $|{\bm{\alpha}}|\leq s-1$ , we have

\big{|}\phi_{\bm{\alpha}}(\tfrac{\bm{\eta}}{K})-\partial^{\bm{\alpha}}f(\tfrac{\bm{\eta}}{K})\big{|}=\Bigg{|}2\widetilde{\phi}_{\bm{\alpha}}\bigg{(}\sum_{j=1}^{d}x_{j}K^{j-1}\bigg{)}-1-(2\xi_{\bm{\alpha},i}-1)\Bigg{|}=2|\widetilde{\phi}_{\bm{\alpha}}(i)-\xi_{\bm{\alpha},i}|\leq 2N^{-2s}L^{-2s}.

From $\bm{\psi{(x)}}=\tfrac{\bm{\beta}}{K}$ for $\bm{x}\in Q_{\bm{\beta}}$ , it follows that

$\displaystyle\\|\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\\|_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}$	$\displaystyle=$	$\displaystyle\\|\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})-\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})\\|_{{L}^{\infty}(Q_{\bm{\beta}})}$
	$\displaystyle=$	$\displaystyle\big{\|}\phi_{\bm{\alpha}}(\tfrac{\bm{\beta}}{K})-\partial^{\bm{\alpha}}f(\tfrac{\bm{\beta}}{K})\big{\|}$
	$\displaystyle\leq$	$\displaystyle 2N^{-2s}L^{-2s}=:\mathcal{E}_{3}.$

Define

\phi(\bm{x})=\sum_{|{\bm{\alpha}}|\leq s-1}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}

(12)

for any $\bm{x}\in\mathbb{R}^{d}$ .

Let us now first estimate the overall approximation error in semi-norm. We denote the $n$ -th order derivatives of $f$ by $\partial^{\bm{\gamma}}f$ with $|\bm{\gamma}|=n$ . For any $\bm{x}\in Q_{\bm{\beta}}$ , by Taylor’s expansion there exists a $\xi_{{\bm{x}}}^{\bm{\gamma}}\in(0,1)$ such that

	$\displaystyle\partial^{\bm{\gamma}}f$	$\displaystyle=$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-n-1}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}+\sum_{\|{\bm{\alpha}}\|=s-n}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{{\bm{x}}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}$
		$\displaystyle=$	$\displaystyle\partial^{\bm{\gamma}}\Big{(}\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{)}+\sum_{\|{\bm{\alpha}}\|=s-n}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}},$

where $\bm{h}=\bm{x}-\bm{\psi{(\bm{x})}}$ .
Therefore, for any $\bm{x}\in Q_{\bm{\beta}}$ we have

			$\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{n,\infty}(Q_{\bm{\theta}})}$
		$\displaystyle\leq$	$\displaystyle\Big{\|}\phi(\bm{x})-\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\theta}})}+\Big{\|}\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}-f(\bm{x})\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\theta}})}$
		$\displaystyle\leq$	$\displaystyle{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}$
			$\displaystyle+\max_{\|\bm{\gamma}\|=n}\sum_{\|{\bm{\alpha}}\|=s-n}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle\underbrace{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=:E_{1}}$
			$\displaystyle+\underbrace{\max_{\|\bm{\gamma}\|=n}\sum_{\|{\bm{\alpha}}\|=s-n}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+{\xi}^{(\bm{\gamma})}_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}}_{=:E_{2}},$

where $E_{2}$ with ${\xi}^{(\bm{\gamma})}_{\bm{x}}\in(0,1)$ is the remainder resulted from Taylor’s expansion of $\partial^{\bm{\gamma}}f$ .

$\displaystyle E_{1}$	$\displaystyle=$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle\underbrace{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=:E_{1,1}}$
		$\displaystyle+\underbrace{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=:E_{1,2}}.$

Hence, we can now measure $E_{1,1}$ , $E_{1,2}$ , and $E_{2}$ :

$\displaystyle E_{1,1}$	$\displaystyle=$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Bigg{(}\underbrace{\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=0}$
		$\displaystyle+\underbrace{\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=0}$
		$\displaystyle+\underbrace{\Big{\\|}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{\leq\mathcal{E}_{3}}\Bigg{)}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\mathcal{E}_{3}$
	$\displaystyle\leq$	$\displaystyle s^{d}\mathcal{E}_{3}.$

Note that the last inequality is followed by the fact that $\sum_{|{\bm{\alpha}}|\leq s-1}1=\sum_{i=1}^{s-1}(i+1)^{d-1}\leq s^{d}$ . Similarly,

$\displaystyle E_{1,2}$	$\displaystyle=$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\underbrace{\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=0}$
		$\displaystyle+\sum_{\|{\bm{\alpha}}\|\leq s-1}{\underbrace{\Big{\\|}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}P_{\bm{\alpha}}(\bm{h})-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{{}_{=0}}}$
	$\displaystyle\leq$	$\displaystyle 0.$

$\displaystyle E_{2}$	$\displaystyle=$	$\displaystyle\max_{\|\bm{\gamma}\|=n}\sum_{\|{\bm{\alpha}}\|=s-n}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+{\xi}^{(\bm{\gamma})}_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|=s-n}\Big{\\|}\tfrac{1}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{{L}^{\infty}(Q_{\bm{\beta}})}$
	$\displaystyle\leq$	$\displaystyle(s-n+1)^{d-1}K^{-(s-n)},$

where the last inequality is followed by $\sum_{|{\bm{\alpha}}|=s-n}1\leq(s-n+1)^{d-1}$ .
Thus, we have shown that

			$\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle E_{1,1}+E_{1,2}+E_{2}$
		$\displaystyle\leq$	$\displaystyle E_{1,1}+E_{1,2}+(s-n+1)^{d-1}K^{-(s-n)}.$

We now estimate $|\phi(\bm{x})-f(\bm{x})|_{\mathcal{W}^{j,\infty}(Q_{\bm{\beta}})},j=0,\dots,n-1$ . Using similar arguments and considering Taylor’s expansion of ${\partial^{\bm{\gamma}}}f$ , $|\bm{\gamma}|\leq n-1$ , with ${\xi}^{(\bm{\gamma})}_{\bm{x}}\in(0,1)$ in the remainder term, we have

			$\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{j,\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle\Big{\|}\phi(\bm{x})-\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{j,\infty}(Q_{\bm{\theta}})}+\Big{\|}\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}-f(\bm{x})\Big{\|}_{\mathcal{W}^{j,\infty}(Q_{\bm{\theta}})}$
		$\displaystyle\leq$	$\displaystyle{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{j,\infty}(Q_{\bm{\beta}})}}$
			$\displaystyle+\max_{\|\bm{\gamma}\|=j}\sum_{\|{\bm{\alpha}}\|=s-j}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+{\xi}^{(\bm{\gamma})}_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}$
			$\displaystyle+(s-j+1)^{d-1}K^{-(s-j)}$
		$\displaystyle\leq$	$\displaystyle E_{1,1}+E_{1,2}+(s-j+1)^{d-1}K^{-(s-j)},$

where the second last inequality is due to $\sum_{|{\bm{\alpha}}|=s-j}1\leq(s-j+1)^{d-1}$ .

Using $K=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor\geq\tfrac{N^{2/d}L^{2/d}}{8}$ , we have

			$\displaystyle\\|\phi(\bm{x})-f(\bm{x})\\|_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}$
		$\displaystyle=$	$\displaystyle\max{\big{\{}\\|\phi(\bm{x})-f(\bm{x})\\|_{{L}^{\infty}(Q_{\bm{\beta}})},\dots,\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{n-1,\infty}(Q_{\bm{\beta}})},\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}\big{\}}}$
		$\displaystyle\leq$	$\displaystyle E_{1,1}+E_{1,2}+(s+1)^{d-1}K^{-(s-n)}$
		$\displaystyle\leq$	$\displaystyle s^{d}\mathcal{E}_{3}+(s+1)^{d-1}K^{-(s-n)}$
		$\displaystyle\leq$	$\displaystyle(s+1)^{d}(K^{-(s-n)}+\mathcal{E}_{3})$
		$\displaystyle\leq$	$\displaystyle(s+1)^{d}\big{(}K^{-(s-n)}+2N^{-2s}L^{-2s}\big{)}$
		$\displaystyle\leq$	$\displaystyle(s+1)^{d}\big{(}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}+2N^{-2s}L^{-2s}\big{)}$
		$\displaystyle\leq$	$\displaystyle(s+1)^{d}(8^{s-n}+2)N^{-2(s-n)/d}L^{-2(s-n)/d}$
		$\displaystyle\leq$	$\displaystyle 2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}.$

\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{n,\infty}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}\leq 2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}.

Furthermore, we have

$\displaystyle\\|\phi(\bm{x})\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$	$\displaystyle=$	$\displaystyle\Bigg{\\|}\sum_{\|{\bm{\alpha}}\|\leq s-1}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}\Bigg{\\|}_{\mathcal{W}^{1,\infty}([0,1]^{d})}$
	$\displaystyle\leq$	$\displaystyle\sum_{\|{\bm{\alpha}}\|\leq s-1}\\|\widetilde{\phi}\\|_{\mathcal{W}^{1,\infty}([0,1]^{d})}$
	$\displaystyle\leq$	$\displaystyle s^{d}.$

As last, we finish the proof by estimating the width and depth of the network implementing $\phi(\bm{x})$ . From (12), assuming for any $N,L\in\mathbb{N}^{+}$ satisfying $(L-2-\log_{2}N)N\geq s$ , we know that $\phi(\bm{x})$ consists of the following subnetworks:

1.

$\bm{\psi}\in\textnormal{NN}(\textrm{width}\leq d(4N+3);\textrm{depth}\leq 4L+5)$ ;
2.

$\phi_{\bm{\alpha}}\in\textnormal{NN}(\textrm{width}\leq 16s(N+1)\log_{2}(8N);\textrm{depth}\leq 5(L+2)\log_{2}(4L))$ ;
3.

$P_{\bm{\alpha}}\in\textnormal{NN}(\textrm{width}\leq 4N+2d+2;\textrm{depth}\leq L)$ ;
4.

$\widetilde{\phi}\in\textnormal{NN}(\textrm{width}\leq 4;\textrm{depth}\leq 1)$ .

Thus, we can infer that the $\phi$ can be implemented by a ReLU network with width $16s^{d+1}{d}(N+2)\log_{2}{(8N)}$ and depth

\begin{split}(4L+5)+1+L+5(L+2)\log_{2}(4L)+3\leq 10(L+2)\log_{2}(4L)\end{split}.

Hence, we have finished the proof. \qed

We can now prove Theorem 1.4 using Theorem 3.8.

Proof of Theorem 1.4

When $f$ is a constant function, the statement is trivial. By Theorem 3.8, there exists a $\sigma_{2}$ -NN $\phi$ with width $16s^{d+1}d(N+2)\log_{2}{({8}N)}$ and depth $10(L+2)\log_{2}(4L)$ such that $\|{\phi}\|_{\mathcal{W}^{n,\infty}([0,1]^{d})}\leq s^{d}$ and

\|f-\phi\|_{\mathcal{W}^{n,p}([0,1]^{d})}\leq 2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d},

Now, we set $K=\lfloor N^{1/d}\rfloor^{2}\lfloor L^{2/d}\rfloor$ and choose a small $\delta$ such that

Kd\delta\leq\big{(}N^{-2(s-n)/d}L^{-2(s-n)/d}\big{)}^{p}.

Then, we have

	$\displaystyle\\|f-\phi\\|^{p}_{\mathcal{W}^{n,p}([0,1]^{d})}$	$\displaystyle=\\|f-\phi\\|^{p}_{\mathcal{W}^{n,p}(\Omega{([0,1]^{d},K,\delta)})}+\\|f-\phi\\|^{p}_{\mathcal{W}^{n,p}([0,1]^{d}\backslash\Omega{([0,1]^{d},K,\delta)})}$
		$\displaystyle\leq Kd\delta(s^{d})^{p}+\big{(}2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}\big{)}^{p}$
		$\displaystyle\leq\big{(}s^{d}N^{-2(s-1)/d}L^{-2(s-1)/d}\big{)}^{p}+\big{(}2(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}\big{)}^{p}$
		$\displaystyle\leq(3(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d})^{p}.$

Hence, we have

\|f-\phi\|_{\mathcal{W}^{1,p}([0,1]^{d})}\leq 3(s+1)^{d}8^{s-n}N^{-2(s-n)/d}L^{-2(s-n)/d}.

\qed

4 Conclusions

We have given a number of theoretical results on explicit error characterization for approximating smooth functions using deep ReLU networks and their smoother variants. Our results measured in Sobolev norms are well-suited for studying solving high-dimensional PDEs. Further generalizing our analysis to other function spaces including the Hölder space and other neutral networks such as the Floor-ReLU networks will be an interesting direction for research work. Numerical investigation of our findings in the setting for solving parametric PDEs will also be left as future work.

Acknowledgements

The authors would like to thank the editor and the two anonymous reviewers for their valuable comments and constructive suggestions, which have greatly improved this paper. S. Hon was partially supported by the Hong Kong RGC under Grant 22300921, a start-up grant from the Croucher Foundation, and a Tier 2 Start-up Grant from the Hong Kong Baptist University. H. Yang was partially supported by the US National Science Foundation under award DMS-1945029.

References

[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 242–252. PMLR, 2019.
[2] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 322–332. PMLR, 2019.
[3] Julius Berner, Philipp Grohs, and Arnulf Jentzen. Analysis of the generalization error: empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations. SIAM Journal on Mathematics of Data Science, 2(3):631–657, 2020.
[4] Helmut Bölcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019.
[5] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, volume 32, 2019.
[6] Fan Chen, Jianguo Huang, Chunmei Wang, and Haizhao Yang. Friedrichs learning: weak solutions of partial differential equations via deep learning. arXiv e-prints, page arXiv:2012.08023, 2020.
[7] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova. Nonlinear approximation and (deep) ReLU networks. Constructive Approximation, 2021.
[8] Ronald DeVore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica, 30:327–444, 2021.
[9] Ronald A. DeVore, Ralph Howard, and Charles Micchelli. Optimal nonlinear approximation. manuscripta mathematica, 63:469–478, 1989.
[10] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1675–1685. PMLR, 2019.
[11] Simon S. Du, Barnabás Poczós, Xiyu Zhai, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. 7th International Conference on Learning Representations, ICLR 2019, pages 1–19, 2019.
[12] Weinan E. A proposal on machine Learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 2017.
[13] Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
[14] Weinan E, Chao Ma, and Lei Wu. The Barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 2021.
[15] Dennis Elbrächter, Philipp Grohs, Arnulf Jentzen, and Christoph Schwab. DNN expression rate analysis of high-dimensional PDEs: application to option pricing. Constructive Approximation, 2021.
[16] Moritz Geist, Philipp Petersen, Mones Raslan, Reinhold Schneider, and Gitta Kutyniok. Numerical solution of the parametric diffusion equation by deep neural networks. Journal of Scientific Computing, 88(1):22, 2021.
[17] Rémi Gribonval, Gitta Kutyniok, Morten Nielsen, and Felix Voigtlaender. Approximation spaces of deep neural networks. Constructive Approximation, 2021.
[18] Yiqi Gu, Haizhao Yang, and Chao Zhou. SelectNet: self-paced learning for high-dimensional partial differential equations. Journal of Computational Physics, 441:110444, 2021.
[19] Ingo Gühring, Gitta Kutyniok, and Philipp Petersen. Error bounds for approximations with deep ReLU neural networks in $W^{s,p}$ norms. Analysis and Applications, 18(5):803–859, 2020.
[20] Ingo Gühring and Mones Raslan. Approximation rates for neural networks with encodable weights in smoothness spaces. Neural Networks, 134:107–130, 2021.
[21] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
[22] John Harlim, Shixiao W Jiang, Senwei Liang, and Haizhao Yang. Machine learning for prediction with missing dynamics. Journal of Computational Physics, 428:109922, 2021.
[23] Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. ReLU deep neural networks and linear finite elements. Journal of Computational Mathematics, 38(3):502–527, 2020.
[24] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 2018-Decem, pages 8571–8580, 2018.
[25] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric PDE problems with artificial neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021.
[26] Gitta Kutyniok, Philipp Petersen, Mones Raslan, and Reinhold Schneider. A theoretical analysis of deep neural networks and parametric PDEs. Constructive Approximation, 2021.
[27] Senwei Liang, Shixiao W Jiang, John Harlim, and Haizhao Yang. Solving PDEs on unknown manifolds with machine learning. arXiv e-prints, page arXiv:2106.06682, 2021.
[28] Zichao Long, Yiping Lu, and Bin Dong. PDE-Net 2.0: learning PDEs from data with a numeric-symbolic hybrid deep network. Journal of Computational Physics, 399:108925, 2019.
[29] Jianfeng Lu and Yulong Lu. A priori generalization error analysis of two-Layer neural networks for solving high dimensional Schrödinger eigenvalue problems. arXiv e-prints, page arXiv:2105.01228, 2021.
[30] Jianfeng Lu, Yulong Lu, and Min Wang. A priori generalization analysis of the deep Ritz method for solving high dimensional elliptic equations. Proceedings of Thirty Fourth Conference on Learning Theory, PMLR, 134:3196-3241, 2021.
[31] Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5), 5465–5506, 2021.
[32] Tao Luo and Haizhao Yang. Two-layer neural networks for partial differential equations: optimization and generalization Theory, 2020.
[33] Chao Ma, Jianchun Wang, and Weinan E. Model reduction with memory and the machine learning of dynamical systems. Communications in Computational Physics, 25(4):947–962, 2019.
[34] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2388–2464. PMLR, 2019.
[35] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665—-E7671, 2018.
[36] Hadrien Montanelli and Qiang Du. New error bounds for deep ReLU networks using sparse grids. SIAM Journal on Mathematics of Data Science, 1(1):78–92, 2019.
[37] Hadrien Montanelli and Haizhao Yang. Error bounds for deep ReLU networks using the Kolmogorov–Arnold superposition theorem. Neural Networks, 129:1–6, 2020.
[38] Joost A A Opschoor, Philipp C Petersen, and Christoph Schwab. Deep ReLU networks and high-order finite element methods. Analysis and Applications, 18(05):715–770, 2020.
[39] Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
[40] Tong Qin, Kailiang Wu, and Dongbin Xiu. Data driven governing equations approximation using deep neural networks. Journal of Computational Physics, 395:620–635, 2019.
[41] M Raissi, P Perdikaris, and G E Karniadakis. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
[42] Christoph Schwab and Jakob Zech. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(1), 2019.
[43] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Nonlinear approximation via compositions. Neural Networks, 119:74–84, 2019.
[44] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. Communications in Computational Physics, 28(5):1768–1811, 2020.
[45] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation: achieving arbitrary accuracy with fixed number of neurons. arXiv e-prints, page arXiv:2107.02397, 2021.
[46] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network approximation: three hidden layers are enough. Neural Networks, 141:160–173, 2021.
[47] Jonathan W Siegel and Jinchao Xu. Approximation rates for neural networks with general activation functions. Neural Networks, 128:313–321, 2020.
[48] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2019.
[49] E. Weinan, Chao Ma, and Lei Wu. A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17(5):1407–1425, 2019.
[50] E. Weinan and Qingcan Wang. Exponential convergence of the deep neural network approximation for analytic functions. Science China Mathematics, 61(10):1733–1740, 2018.
[51] Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
[52] Linfeng Zhang, Jiequn Han, Han Wang, Roberto Car, and Weinan E. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett., 120(14):143001, 2018.
[53] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4140–4149. PMLR, 2017.

			$\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{1,\infty}(Q_{\bm{\theta}})}$
		$\displaystyle\leq$	$\displaystyle\Big{\|}\phi(\bm{x})-\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\theta}})}+\Big{\|}\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}-f(\bm{x})\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\theta}})}$
		$\displaystyle\leq$	$\displaystyle{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}$
			$\displaystyle+\max_{\|\bm{\gamma}\|=1}\sum_{\|{\bm{\alpha}}\|=s-1}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle\underbrace{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{1,\infty}(Q_{\bm{\beta}})}}_{=:E_{1}}$
			$\displaystyle+\underbrace{\max_{\|\bm{\gamma}\|=1}\sum_{\|{\bm{\alpha}}\|=s-1}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}}_{=:E_{2}},$

			$\displaystyle\|\phi(\bm{x})-f(\bm{x})\|_{\mathcal{W}^{n,\infty}(Q_{\bm{\theta}})}$
		$\displaystyle\leq$	$\displaystyle\Big{\|}\phi(\bm{x})-\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\theta}})}+\Big{\|}\sum_{\|{\bm{\alpha}}\|\leq s-1}\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}-f(\bm{x})\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\theta}})}$
		$\displaystyle\leq$	$\displaystyle{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}$
			$\displaystyle+\max_{\|\bm{\gamma}\|=n}\sum_{\|{\bm{\alpha}}\|=s-n}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+\xi_{\bm{x}}^{\bm{\gamma}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}$
		$\displaystyle\leq$	$\displaystyle\underbrace{\sum_{\|{\bm{\alpha}}\|\leq s-1}\Big{\\|}\widetilde{\phi}\Big{(}\tfrac{\phi_{\bm{\alpha}}(\bm{\psi(\bm{x})})}{\bm{\alpha}!},P_{\bm{\alpha}}(\bm{h})\Big{)}-\tfrac{\partial^{\bm{\alpha}}f(\bm{\psi(\bm{x})})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{W}^{n,\infty}(Q_{\bm{\beta}})}}_{=:E_{1}}$
			$\displaystyle+\underbrace{\max_{\|\bm{\gamma}\|=n}\sum_{\|{\bm{\alpha}}\|=s-n}\Big{\\|}\tfrac{\partial^{\bm{\alpha}}\partial^{\bm{\gamma}}f(\bm{\psi(\bm{x})}+{\xi}^{(\bm{\gamma})}_{\bm{x}}\bm{h})}{\bm{\alpha}!}\bm{h}^{\bm{\alpha}}\Big{\\|}_{\mathcal{L}^{\infty}(Q_{\bm{\beta}})}}_{=:E_{2}},$

Simultaneous Neural Network Approximation for Smooth Functions

Abstract

keywords:

1 Introduction

Theorem 1.1

Corollary 1.2

Corollary 1.3

Theorem 1.4

2 Preliminaries

2.1 Deep Neural Networks

2.2 Sobolev Spaces

Definition 2.1

2.3 Auxiliary Neutral Network Approximation Results

Lemma 2.1

Proposition 2.2

Proposition 2.3

3 Proof of Our Main Results

3.1 Approximations Using ReLU Neural Networks

Theorem 3.1

Lemma 3.2

Lemma 3.3

Lemma 3.4

Lemma 3.5

Proposition 3.6

3.2 Proof of Theorem 3.1

3.3 Approximation Using σ2\sigma_{2}-Neural Networks

Lemma 3.7

Theorem 3.8

4 Conclusions

Acknowledgements

References

3.3 Approximation Using $\sigma_{2}$ -Neural Networks