Simultaneous approximation of a smooth function
and its derivatives by deep neural networks
with piecewise-polynomial activations

Denis Belomestny [email protected] Alexey Naumov [email protected] Nikita Puchkin [email protected] Sergey Samsonov [email protected] Duisburg-Essen University, Germany HSE University, Russia Institute for Information Transmission Problems RAS, Russia

Abstract

This paper investigates the approximation properties of deep neural networks with piecewise-polynomial activation functions. We derive the required depth, width, and sparsity of a deep neural network to approximate any Hölder smooth function up to a given approximation error in Hölder norms in such a way that all weights of this neural network are bounded by $1$ . The latter feature is essential to control generalization errors in many statistical and machine learning applications.

keywords:

deep neural networks , approximation complexity , ReQU activations ,

\rm{ReLU}^{k}

activations , Hölder class.

MSC:

[2020] 41A25 , 68T07

1 Introduction

Neural networks have recently gained much attention due to their impressive performance in many complicated practical tasks, including image processing [1], generative modelling [2], reinforcement learning [3], numerical solution of PDEs, e.g., [4, 5], and optimal control [6, 7]. This makes them extremely useful in design of self-driving vehicles [8] and robot control systems, e.g., [9, 10, 11]. One of the reasons for such a success of neural networks is their expressiveness, that is, the ability to approximate functions with any desired accuracy. The question of expressiveness of neural networks has a long history and goes back to the papers [12, 13, 14]. In particular, in [13], the author showed that one hidden layer is enough to approximate any continuous function $f$ with any prescribed accuracy $\varepsilon>0$ . However, further analysis revealed the fact that deep neural networks may require much fewer parameters than the shallow ones to approximate $f$ with the same accuracy. Many efforts were put in recent years to understand the fidelity of deep neural networks. In a pioneering work [15], the author showed that for any target function $f$ from the Sobolev space $\mathcal{W}^{n,\infty}([0,1]^{d})$ there is a neural network with $O(\varepsilon^{-d/n})$ parameters and ReLU activation function, that approximates $f$ within the accuracy $\varepsilon$ with respect to the $L_{\infty}$ -norm on the unit cube $[0,1]^{d}$ . Further works in this direction considered various smoothness classes of the target functions [16, 17, 18, 19], neural networks with diverse activations [17, 20, 21, 22], domains of more complicated shape [23], and measured the approximation errors with respect to different norms [15, 17, 20, 24]. Several authors also considered the expressiveness of neural networks with different architectures. This includes wide neural networks of logarithmic [15, 17, 24] or even constant depth [25, 16, 20, 19], or deep and narrow networks [26, 27, 28]. Most of the existing results on the expressiveness of neural networks measure the quality of approximation with respect to either the $L_{\infty}$ - or $L_{p}$ -norm, $p\geqslant 1$ . Much fewer papers study the approximation of derivatives of smooth functions. These rare exceptions include [29, 17, 20].

In the present paper, we focus on feed-forward neural networks with piecewise-polynomial activation functions of the form $\sigma^{\mathsf{ReQU}}(x)=(x\vee 0)^{2}$ . Neural networks with such activations are known to successfully approximate smooth functions from the Sobolev and Besov spaces with respect to the $L_{\infty}$ - and $L_{p}$ -norms (see, for instance, [30, 25, 16, 31, 32, 33, 34, 35]). We continue this line of research and study the ability of such neural networks to approximate not only smooth functions themselves but also their derivatives. We derive the non-asymptotic upper bounds on the Hölder norm between the target function and its approximation from a class of sparsely connected neural networks with ReQU activations. In particular, we show that, for any $f$ from a Hölder ball $\mathcal{H}^{\beta}([0,1]^{d},H)$ , $H>0$ , $\beta>2$ , (see Section 2 for the definition) and any $\varepsilon>0,$ there exists a neural network with ReQU activation functions, that uniformly approximates the target function in the norms of the Hölder spaces $\mathcal{H}^{\ell}([0,1]^{d})$ for all $\ell\in\{0,\ldots,\lfloor\beta\rfloor\}$ . Here and further in the paper, $\lfloor\beta\rfloor$ stands for the largest integer which is strictly smaller than $\beta$ . A simplified statement of our main result is given below.

Theorem 1 (simplified version of 2).

Fix $\beta>2$ and $p,d\in\mathbb{N}$ . Then, for any $H>0$ , for any $f:[0,1]^{d}\rightarrow\mathbb{R}^{p}$ , $f\in\mathcal{H}^{\beta}([0,1]^{d},H)$ and any integer $K\geqslant 2$ , there exists a neural network $h_{f}:[0,1]^{d}\rightarrow\mathbb{R}^{p}$ with ReQU activation functions such that it has $O(\log d+\lfloor\beta\rfloor+\log\log H)$ layers, at most $O(p\vee d(K+\lfloor\beta\rfloor)^{d})$ neurons in each layer and $O(p(d\beta+d^{2}+\log\log H)(K+\lfloor\beta\rfloor)^{d})$ non-zero weights taking their values in $[-1,1]$ . Moreover, it holds that

\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}\leqslant\frac{C^{\beta d}H\beta^{\ell}}{K^{\beta-\ell}}\quad\text{for all $\ell\in\{0,\ldots,\lfloor\beta\rfloor\}$,}

where $C$ is a universal constant.

We provide explicit expressions for the hidden constants in 2. The main contributions of our work can be summarized as follows.

1.

Given a smooth target function $f\in\mathcal{H}^{\beta}([0,1]^{d},H)$ , we construct a neural network, that simultaneously approximates all the derivatives of $f$ up to order $\lfloor\beta\rfloor$ with optimal dependence of the precision on the number of non-zero weights. That is, if we denote the number of non-zero weights in the network by $N$ , then it holds that $\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}=O(N^{-(\beta-\ell)/d})$ simultaneously for all $\ell\in\{0,\ldots,\lfloor\beta\rfloor\}.$
2.

The constructed neural network has almost the same smoothness as the target function itself while approximating it with the optimal accuracy. This property turns out to be very useful in many applications including the approximation of PDEs and density transformations where we need to use derivatives of the approximation.
3.

The weights of the approximating neural network are bounded in absolute values by $1$ . The latter property plays a crucial role in deriving bounds on the generalization error of empirical risk minimizers in terms of the covering number of the corresponding parametric class of neural networks. Note that the upper bounds on the weights provided in [29, 17, 20] blow up as the approximation error decreases.

The rest of the paper is organized as follows. In Section 2, we introduce necessary definitions and notations. Section 3 contains the statement of our main result, 2, followed by a detailed comparison with the existing literature. We then present numerical experiments in Section 4. The proofs are collected in Section 5. Some auxiliary facts are deferred to A.

2 Preliminaries and notations

Norms

For a matrix $A$ and a vector $v$ , we denote by $\|A\|_{\infty}$ and $\|v\|_{\infty}$ the maximal absolute value of entries of $A$ and $v$ , respectively. $\|A\|_{0}$ and $\|v\|_{0}$ shall stand for the number of non-zero entries of $A$ and $v$ , respectively. Finally, the Frobenius norm and operator norm of $A$ are denoted by $\|A\|_{F}$ and $\|A\|$ , respectively, and the Euclidean norm of $v$ is denoted by $\|v\|$ . For a function $f:\Omega\to\mathbb{R}^{d}$ , we set

	$\displaystyle\\|f\\|_{L_{\infty}(\Omega)}=\operatornamewithlimits{esssup}_{x\in\Omega}\\|f(x)\\|,$
	$\displaystyle\\|f\\|_{L_{p}(\Omega)}=\left\{\int_{\Omega}\\|f(x)\\|^{p}\,\mathrm{d}x\right\}^{1/p},\quad p\geqslant 1.$

If the domain $\Omega$ is clear from the context, we simply write $L_{\infty}$ or $L_{p}$ , instead of $L_{\infty}(\Omega)$ or $L_{p}(\Omega)$ , respectively.

Smoothness classes

Let $\Omega\subseteq\mathbb{R}^{d}$ , $f:\Omega\mapsto\mathbb{R}^{m}$ . For a multi-index ${\bm{\gamma}}=(\gamma_{1},\dots,\gamma_{d})\in\mathbb{N}_{0}^{d}$ , we write $|{\bm{\gamma}}|=\sum_{i=1}^{d}\gamma_{i}$ and define the corresponding partial differential operator $D^{{\bm{\gamma}}}$ as

D^{{\bm{\gamma}}}f_{i}=\frac{\partial^{|{\bm{\gamma}}|}f_{i}}{\partial x_{1}^{\gamma_{1}}\cdots\partial x_{d}^{\gamma_{d}}},\quad i\in\{1,\dots,m\}\,,\text{ and }\|D^{{\bm{\gamma}}}f\|_{L_{\infty}(\Omega)}=\max\limits_{1\leqslant i\leqslant m}\|D^{{\bm{\gamma}}}f_{i}\|_{L_{\infty}(\Omega)}\,.

For $s\in\mathbb{N}$ , the function space $C^{s}(\Omega)$ consists of those functions over the domain $\Omega$ which have bounded and continuous derivatives up to order $s$ in $\Omega$ . Formally,

\textstyle{C^{s}(\Omega)=\big{\{}f:\Omega\to\mathbb{R}^{m}:\quad\|f\|_{C^{s}}:=\max\limits_{|{\bm{\gamma}}|\leqslant s}\|D^{{\bm{\gamma}}}f\|_{L_{\infty}(\Omega)}<\infty\big{\}},}

For a function $f:\Omega\to\mathbb{R}^{m}$ and any positive number $0<\delta\leqslant 1$ , the Hölder constant of order $\delta$ is given by

[f]_{\delta}:=\max_{i\in\{1,\ldots,m\}}\sup_{x\not=y\in\Omega}\frac{|f_{i}(x)-f_{i}(y)|}{\min\{1,\|x-y\|\}^{\delta}}\;.

(1)

Now, for any $\beta>0$ , we can define the Hölder ball $\mathcal{H}^{\beta}(\Omega,H)$ . If we let $s=\lfloor\beta\rfloor$ be the largest integer strictly less than $\beta$ , $\mathcal{H}^{\beta}(\Omega,H)$ contains all functions in $C^{s}(\Omega)$ with $\delta$ -Hölder-continuous, $\delta=\beta-s>0$ , partial derivatives of order $s$ . Formally,

\textstyle{\mathcal{H}^{\beta}(\Omega,H)=\big{\{}f\in C^{s}(\Omega):\quad\|f\|_{\mathcal{H}^{\beta}}:=\max\{\|f\|_{C^{s}},\ \max\limits_{|{\bm{\gamma}}|=s}[D^{{\bm{\gamma}}}f]_{\delta}\}\leqslant H\big{\}}.}

We also write $f\in\mathcal{H}^{\beta}(\Omega)$ if $f\in\mathcal{H}^{\beta}(\Omega,H)$ for some $H<\infty.$

Neural networks

Fix an activation function $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ . For a vector $v=(v_{1},\dots,v_{p})\in\mathbb{R}^{p}$ , we define the shifted activation function $\sigma_{v}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{p}$ as

\sigma_{v}(x)=\bigl{(}\sigma(x_{1}-v_{1}),\dots,\sigma(x_{p}-v_{p})\bigr{)},\quad x=(x_{1},\dots,x_{p})\in\mathbb{R}^{p}.

Given a positive integer $L$ and a vector $\mathcal{A}=(p_{0},p_{1},\dots,p_{L+1})\in\mathbb{N}^{L+2}$ , a neural network of depth $L+1$ (with $L$ hidden layers) and architecture $\mathcal{A}$ is a function of the form

f:\mathbb{R}^{p_{0}}\rightarrow\mathbb{R}^{p_{L+1}}\,,\quad f(x)=W_{L}\circ\sigma_{v_{L}}\circ W_{L-1}\circ\sigma_{v_{L-1}}\circ\dots\circ W_{1}\circ\sigma_{v_{1}}\circ W_{0}\circ x,

(2)

where $W_{i}\in\mathbb{R}^{p_{i+1}\times p_{i}}$ are weight matrices and $v_{i}\in\mathbb{R}^{p_{i}}$ are shift (bias) vectors. The maximum number of neurons in one layer $\|\mathcal{A}\|_{\infty}$ is called the width of the neural network. Next, we introduce a subclass $\mathsf{NN}(L,\mathcal{A},s)$ of neural networks of depth $L+1$ with architecture $\mathcal{A}$ and at most $s$ non-zero weights. That is, $\mathsf{NN}(L,\mathcal{A},s)$ consist of functions of the form (2), such that

\begin{cases}\|W_{0}\|_{\infty}\vee\max\limits_{1\leqslant\ell\leqslant L}\left\{\|W_{\ell}\|_{\infty}\vee\|v_{\ell}\|_{\infty}\right\}\leqslant 1\,,\\ \|W_{0}\|_{0}+\sum\limits_{\ell=1}^{L}\left(\|W_{\ell}\|_{0}+\|v_{\ell}\|_{0}\right)\leqslant s\,.\end{cases}

We also use the notation $\mathsf{NN}(L,\mathcal{A})$ , standing for $\mathsf{NN}(L,\mathcal{A},\infty)$ . Throughout the paper, we use the ReQU (rectified quadratic unit) activation function, defined as

\sigma(x)=(x\vee 0)^{2}.

Concatenation and parallelization of neural networks

During the construction of approximating network in 2, we use the operations of concatenation (consecutive connection) and parallel connection of neural networks. Given neural networks $g$ and $h$ of architectures $(p_{0},p_{1},\dots,p_{L},p_{out})$ and $(p_{in},p_{L+1},p_{L+2},\dots,p_{L+M+1})$ , respectively, such that $p_{in}=p_{out}$ , the concatenation of $g$ and $h$ is their composition $h\circ g$ , that is, a neural network of the architecture $(p_{0},p_{1},\dots,p_{L},p_{L+1},p_{L+2},\dots,p_{L+M+1})$ . The parallel connection of neural networks is defined as follows. Let

f(x)=W_{L}\circ\sigma_{v_{L}}\circ W_{L-1}\circ\sigma_{v_{L-1}}\circ\dots\circ W_{1}\circ\sigma_{v_{1}}\circ W_{0}\circ x

and

\widetilde{f}(x)=\widetilde{W}_{L}\circ\sigma_{\widetilde{v}_{L}}\circ\widetilde{W}_{L-1}\circ\sigma_{\widetilde{v}_{L-1}}\circ\dots\circ\widetilde{W}_{1}\circ\sigma_{\widetilde{v}_{1}}\circ\widetilde{W}_{0}\circ x

be two neural networks of the same depth with architectures $(p_{0},p_{1},\dots,p_{L},p_{L+1})$ and $(p_{0},\widetilde{p}_{1},\widetilde{p}_{2}\dots,\widetilde{p}_{L},\widetilde{p}_{L+1})$ , respectively. Define

\overline{W}_{0}=\begin{pmatrix}W_{0}\\ \widetilde{W}_{0}.\end{pmatrix}\,,\quad\overline{W}_{\ell}=\begin{pmatrix}W_{\ell}&O\\ O&\widetilde{W}_{\ell}\end{pmatrix}\,,\quad\text{and}\quad\overline{v}_{\ell}=\begin{pmatrix}v_{\ell}\\ \widetilde{v}_{\ell}\end{pmatrix}\quad\text{for all $\ell\in\{1,\dots,L\}$}.

Then the parallel connection of $f$ and $\widetilde{f}$ is a neural network $\overline{f}$ of architecture $(p_{0},p_{1}+\widetilde{p}_{1},p_{2}+\widetilde{p}_{2},\dots,p_{L}+\widetilde{p}_{L},p_{L+1}+\widetilde{p}_{L+1})$ , given by

\overline{f}(x)=\overline{W}_{L}\circ\sigma_{\overline{v}_{L}}\circ\overline{W}_{L-1}\circ\sigma_{\overline{v}_{L-1}}\circ\dots\circ\overline{W}_{1}\circ\sigma_{\overline{v}_{1}}\circ\overline{W}_{0}\circ x.

Note that the number of non-zero weights of $\overline{f}$ is equal to the sum of the ones in $f$ and $\widetilde{f}$ .

3 Main results

3.1 Approximation of functions from Hölder classes

Our main result states that any function from $\mathcal{H}^{\beta}([0,1]^{d},H)$ , $H>0$ , $\beta>2$ , can be approximated by a feed-forward deep neural network with ReQU activation functions in $\mathcal{H}^{\ell}([0,1]^{d}),$ $\ell\in\{0,\dots,\lfloor\beta\rfloor\}.$

Theorem 2.

Let $\beta>2$ and let $p,d\in\mathbb{N}$ . Then, for any $H>0$ , for any $f:[0,1]^{d}\rightarrow\mathbb{R}^{p}$ , $f\in\mathcal{H}^{\beta}([0,1]^{d},H)$ and any integer $K\geqslant 2$ , there exists a neural network $h_{f}:[0,1]^{d}\rightarrow\mathbb{R}^{p}$ of the width

\bigl{(}4d(K+\lfloor\beta\rfloor)^{d}\bigr{)}\vee 12\,\left((K+2\lfloor\beta\rfloor)+1\right)\vee p

with

6+2(\lfloor\beta\rfloor-2)+\lceil\log_{2}{d}\rceil+2\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)

hidden layers and at most $p(K+\lfloor\beta\rfloor)^{d}C(\beta,d,H)$ non-zero weights taking their values in $[-1,1]$ , such that, for any $\ell\in\{0,\dots,\lfloor\beta\rfloor\}$ ,

\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}\leqslant\frac{(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}+\frac{9^{d(\lfloor\beta\rfloor-1)}(2\lfloor\beta\rfloor+1)^{2d+\ell}(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}.

(3)

The above constant $C(\beta,d,H)$ is given by

C(\beta,d,H)=\left(60\bigl{(}\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\bigr{)}+38\right)+20d^{2}+144d\lfloor\beta\rfloor+8d.

An illustrative comparison of the result of 2 with the literature is provided in Table 1 below. 2 improves over the results of [25, Theorem 3.3] and [16, Theorem 7] as far as the approximation properties of ReQU neural networks in terms of the $L_{\infty}$ -norm are concerned. Namely, 2 claims that, for any $f\in\mathcal{H}^{\beta}([0,1]^{d},H)$ , $\beta>2$ , and any $\varepsilon>0$ , there is a ReQU neural network of width $O(\varepsilon^{-d/\beta})$ and depth $O(1)$ with $O(\varepsilon^{-d/\beta})$ non-zero weights, taking their values in $[-1,1]$ , such that it approximates $f$ within the accuracy $\varepsilon$ with respect to the $L_{\infty}$ -norm on $[0,1]^{d}$ . In [25, 16], the authors considered a target function from a general weighted Sobolev class but measure the quality of approximation in terms of a weighted $L_{2}$ -norm. Nevertheless, the width and the number of non-zero weights of the constructed neural networks in 2, [25, Theorem 3.3], and [16, Theorem 7] coincide. The difference is that, while in [25, 16], the depth of the neural networks is of order $O(\log(1/\varepsilon)),$ we need only $O(1)$ layers. More importantly, the authors of [25, 16] do not provide any guarantees on the absolute values of the weights of the approximating neural networks. At the same time, all the weights of our neural network take their values in $[-1,1]$ . We will explain the importance of the latter property a bit later in this section. It is also worth mentioning that the same number of non-zero weights $O(\varepsilon^{-d/\beta})$ is needed to approximate $f$ within the tolerance $\varepsilon$ (with respect to the $L_{\infty}$ -norm) via neural networks with other activation functions, such as ReLU [36], sigmoid [22, 17], and $\arctan$ [17].

The approximation properties of deep neural networks with respect to norms involving derivatives are much less studied. In the rest of this section, we elaborate on the comparison with the state-of-the-art results in this direction. The fact that shallow neural networks can simultaneously approximate a smooth function and its derivatives is known for a long time from the paper [37], where the authors derived an upper bound on the approximation accuracy in terms of the modulus of smoothness. To be more precise, they showed that if a function $f$ is continuous on a compact set $\mathcal{K}\subset\mathbb{R}^{d}$ and its derivatives up to an order $s\in\mathbb{Z}_{+}$ are in $L_{p}(\mathcal{K})$ , then there is a neural network $g_{f}$ with

(K+1)\sum\limits_{j=0}^{K}\binom{j+d-1}{d-1}=O(K^{d+1})

hidden units, such that

\left\|D^{\bm{\gamma}}g_{f}-D^{\bm{\gamma}}f\right\|_{L_{p}(\mathcal{K})}\lesssim\omega\left(D^{\bm{\gamma}}f,\frac{1}{\sqrt{K}}\right)_{p}+\frac{\|D^{\bm{\gamma}}f\|_{L_{p}(\mathcal{K})}}{K}\quad\text{for all ${\bm{\gamma}}\in\mathcal{Z}_{+}^{d}$ such that $|{\bm{\gamma}}|\leqslant s$.}

Here

\omega(\phi,\delta)_{p}=\sup\limits_{0<t\leqslant\delta}\|\phi(\cdot+t)-\phi(\cdot)\|_{L_{p}(\mathcal{K})},\quad\delta>0,

is the modulus of smoothness of a function $\phi$ . Moreover, if, in addition, the derivatives of $f$ of order $s$ are $\alpha$ -Hölder, $\alpha\in(0,1]$ , then it holds that

\|f-g_{f}\|_{\mathcal{H}^{s}(\mathcal{K})}\lesssim K^{-\alpha/2}.

Taking into account that $\omega(\phi,\delta)_{p}\lesssim\delta$ for a Lipschitz function $\phi$ , we conclude that one has to use a shallow neural network with at least $\Omega(\varepsilon^{-2(d+1)})$ hidden units to approximate a function of interest or its derivatives within the accuracy $\varepsilon$ with respect to the $L_{p}(\mathcal{K})$ -norm even if $f$ is sufficiently smooth. This bound becomes prohibitive in the case of large dimension. The situation is much better in the case of deep neural networks. To our knowledge, Gühring and Raslan [17, Proposition 4.8] were the first to prove that, for any $f\in\mathcal{H}^{\beta}([0,1]^{d},H)$ , $\varepsilon>0$ , and $\ell\in\{0,\dots,\lfloor\beta\rfloor\}$ , there is a ReQU neural network with $O(\varepsilon^{-d/(\beta-\ell)})$ non-zero weights, which approximates $f$ within the accuracy $\varepsilon$ with respect to the $\mathcal{H}^{\ell}$ -norm on $[0,1]^{d}$ . A drawback of the result in [17] is that the architecture of the suggested neural network heavily depends on $\ell$ . Hence, it is hard to control higher-order derivatives of the approximating neural network itself, which is of interest in such applications as numerical solutions of PDEs and density transformations. This question was addressed in [20, 38], where the authors considered the problem of simultaneous approximation of a target function with respect to the Hölder norms of different orders. In [20, Theorem 5.1], the authors showed that, for any $f\in\mathcal{H}^{s}([0,1]^{d},H)$ , $s\in\mathbb{N}$ , and any sufficiently large integer $K$ , there is a three-layer neural network $g_{f}$ of width $O(K^{d})$ with $\tanh$ activations, such that

\|f-g_{f}\|_{\mathcal{H}^{\ell}([0,1]^{d})}=O\left(\frac{(\log K)^{\ell}}{K^{s-\ell}}\right)

simultaneously for all $\ell\in\{0,1,\dots,s-1\}$ . Note that 2 yields a sharper bound, removing the odd logarithmic factors. Regarding the results of the recent work [38], we found some critical flaws in the proofs of the main results. We explain our concerns in 1 below as the main results of [38] are close to ours. 2 also improves over [17, 20, 38] in another aspect. Namely, 2 guarantees that the weights of the neural network take their values in $[-1,1]$ , while in [17, 20, 38] they may grow polynomially as the approximation error decreases. This boundeness property is extremely important when studying the generalization ability of the neural networks in various statistical problems since the metric entropy of the corresponding parametric class of neural networks involves a uniform upper bound on the weights (see, for instance, [24, Theorem 2 and Lemma 5]). Besides that, a polynomial upper bound on the absolute values of the weights becomes prohibitive when approximating analytic functions (see 2 in the next section).

Remark 1.

The proofs of Theorem 3.1 and Theorem 3.8 in [38] have critical flaws. In particular, on page 18, the authors mistakenly bound the Sobolev norm $\mathcal{W}^{1,\infty}$ of the composite function $\phi(x)=\widetilde{\phi}(\phi_{\alpha}(\psi(x))/\alpha!,P_{\alpha}(h))$ by the $\mathcal{W}^{1,\infty}$ -norm of $\widetilde{\phi}$ . A similar error appears on page 24. In this way, the authors obtain that the $\mathcal{W}^{1,\infty}$ -norm of $\phi(x)$ is bounded by $432s^{d}$ , where $s$ is the smoothness of the target function. In fact, the latter norm should scale as $1/\delta$ , where $\delta>0$ is a small auxiliary parameter describing the width of the boundary strips. This flaw completely ruins the proofs of the main results in [38], Theorem 1.1 (see the 9th line of the proof on p.8) and Theorem 1.4 (the 9th line of the proof on p.25).

Table 1: comparison of the state-of-the-art results on approximation of a function

f\in\mathcal{H}^{\beta}([0,1]^{d},H),\beta>2,H>0

, within the accuracy

\varepsilon

via neural networks. The papers marked with

*

consider the case of integer

\beta

only.

Paper	Norm	Depth	Non-zero weights	Simultaneous approximation	Weights in $[-1,1]$
[16]*	$L_{2}$	$O(\log(1/\varepsilon))$	$O(\varepsilon^{-d/\beta})$	N/A	✗
[17]*	$\mathcal{H}^{\ell}$	$O(\log(1/\varepsilon))$	$O(\varepsilon^{-d/(\beta-\ell)})$	✗	✗
[20]	$\mathcal{H}^{\ell}$	$3$	$O(\varepsilon^{-d/(\beta-\ell)}(\log(1/\varepsilon))^{\ell})$	✓	✗
[22]	$L_{\infty}$	$O(\log(1/\varepsilon))$	$O(\varepsilon^{-d/\beta})$	N/A	✗
[24]	$L_{\infty}$	$O(\log(1/\varepsilon))$	$O(\varepsilon^{-d/\beta})$	N/A	✓
[25]*	$L_{2}$	$O(\log(1/\varepsilon))$	$O(\varepsilon^{-d/\beta})$	N/A	✗
[37]*	$\mathcal{H}^{\ell}$	2	$O(\varepsilon^{-2(d+1)/(1\wedge(\beta-\ell))})$	✓	✗
Ours	$\mathcal{H}^{\ell}$	$O(1)$	$O(\varepsilon^{-d/(\beta-\ell)})$	✓	✓

3.2 Approximation of analytic functions

The bound of 2 can be transformed to exponential (in $K$ ) rates of approximation for analytic functions. In the rest of this section, we consider $(Q,R)$ -analytic functions defined below.

Definition 1.

A function $f\in C^{\infty}(\mathbb{R}^{d})$ is called $(Q,R)$ -analytic with $Q,R>0$ , if it satisfies the inequality

\left\|f\right\|_{\mathcal{H}^{s}([0,1]^{d})}\leqslant QR^{-s}s!\quad\text{for all $s\in\mathbb{N}_{0}$.}

(4)

Similar concepts were also considered in [39, 40, 20]. Applying 2 to $(Q,R)$ -analytic functions, we get the following corollary.

Corollary 1.

Let $f$ be a $(Q,R)$ -analytic function and let $\ell\in\mathbb{N}_{0}$ . Then, for any integer $K>2ed\,9^{d}/R$ , there exists a neural network $h_{f}:[0,1]^{d}\mapsto\mathbb{R}^{p}$ of width $O((K+\ell)^{d}\vee p)$ with $O(K+\ell)$ hidden layers and at most $O(p(K+\ell)^{d}\log K)$ non-zero weights, taking values in $[-1,1]$ , such that

\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}\leqslant\frac{Qe(\sqrt{2}\,ed\,9^{d})^{\ell}(2s-1)^{2d+2\ell}}{R^{\ell}}\exp\left\{-\left\lfloor\frac{KR}{2ed\,9^{d}}\right\rfloor\right\}.

(5)

Remark 2.

In [20, Corollary 5.6], the authors claim that $(Q,R)$ -analytic functions can be approximated with exponential (in the number of non-zero weights $N$ ) rates with respect to the Sobolev norm $\mathcal{W}^{k,\infty}([0,1]^{d})$ , where $k$ is a fixed integer (see eq.(112) on p.743). However, a closer look at Corollary 5.6 reveals the fact that the right-hand side of eq.(112) includes a constant $c_{d,k,\alpha,f}$ with $\alpha$ depending on $N$ too, as follows from eq.(54). Thus, the dependence of the final bound on $N$ in Corollary 5.6 remains unclear. Besides that, in contrast to [20, Theorem 5.2], the authors do not specify the upper bound on the absolute values of the weights of the constructed neural network in Corollary 5.6. A thorough inspection of the proof shows that the weights can be as large as $O(N^{N^{2}})$ .

4 Numerical experiments

In this section, we provide numerical experiments to illustrate the approximation properties of neural networks with ReQU activations. We considered a scalar function $f(x)=\sin(x_{1}^{2}x_{2})$ of two variables and approximated it on the unit square $[0,1]^{2}$ via neural networks with two types of activations: ReLU and ReQU. All the neural networks were fully connected, and all their hidden layers had a width $16$ . The first layer had a width $2$ . The depth of neural networks took its values in $\{1,2,3,4,5\}$ . In the training phase, we sampled $N=10000$ points $X_{1},\dots,X_{N}$ independently from the uniform distribution on $[0,1]^{2}$ and tuned the weights of neural networks by minimizing the mean empirical squared error

\mathrm{ERR}(h)=\frac{1}{N}\sum\limits_{i=1}^{N}(h(X_{i})-f(X_{i}))^{2}.

After that, we computed the approximation errors of $f$ and of its gradient on the two-dimensional grid $G=\{0,1/M,\dots,(M-1)/M,1\}^{2}$ with $M=500$ :

\frac{1}{M^{2}}\sum\limits_{x\in G}(\widehat{h}(x)-f(x))^{2}\quad\text{and}\quad\frac{1}{M^{2}}\sum\limits_{x\in G}\left\|\nabla\widehat{h}(x)-\nabla f(x)\right\|^{2},

where $\widehat{h}$ denotes the neural network with the weights tuned on the training phase. We repeated the experiment $10$ times and computed average approximation errors. The results are displayed in Figure 1. The quality of approximation by neural networks with ReQU activations turns out to be better than by the ones with ReLU. Note that in such a scenario the stochastic error becomes small, and the quality of learning is mostly determined by the approximation error. This claim is supported by the fact that the error values were similar in different experiments.

Refer to caption — Figure 1: The mean squared error of approximation of the target function $f=\sin(x_{1}^{2}x_{2})$ (left) and its gradient (right) by neural networks with different activations.

5 Proofs

5.1 Proof of 2

Step 1. Let $f=(f_{1},\dots,f_{p})$ . Consider a vector $a=(a_{1},\dots,a_{2\lfloor\beta\rfloor+K+1})$ , such that

		$\displaystyle a_{1}=\ldots=a_{\lfloor\beta\rfloor+1}=0\,;$
		$\displaystyle a_{\lfloor\beta\rfloor+1+j}=j/K,\quad 1\leqslant j\leqslant K-1\,;$		(6)
		$\displaystyle a_{\lfloor\beta\rfloor+K+1}=\ldots=a_{2\lfloor\beta\rfloor+K+1}=1\,.$

By 3, there exist tensor-product splines $S_{f}^{\lfloor\beta\rfloor,K}=(S_{f,1}^{\lfloor\beta\rfloor,K},\dots,S_{f,p}^{\lfloor\beta\rfloor,K})$ of order $\lfloor\beta\rfloor\geqslant 2$ associated with knots at $\big{\{}(a_{j_{1}},a_{j_{2}},\dots,a_{j_{d}}):j_{1},\dots,j_{d}\in\{1,\dots,2\lfloor\beta\rfloor+K+1\}\big{\}}$ such that

	$\displaystyle\left\\|f-S_{f}^{\lfloor\beta\rfloor,K}\right\\|_{\mathcal{H}^{\ell}([0,1]^{d})}$	$\displaystyle\leqslant\max_{m\in\{1,\dots,p\}}\left\\|f_{m}-S_{f,m}^{\lfloor\beta\rfloor,K}\right\\|_{\mathcal{H}^{\ell}([0,1]^{d})}$
		$\displaystyle\leqslant\frac{(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}+\frac{9^{d(\lfloor\beta\rfloor-1)}(2\lfloor\beta\rfloor+1)^{2d+\ell}(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}.$

Our goal is to show that $S_{f}^{\lfloor\beta\rfloor,K}$ can be represented by a neural network $h_{f}$ with ReQU activation functions. For this purpose, for each $m\in\{1,\dots,p\}$ , we use an expansion of $S_{f,m}^{\lfloor\beta\rfloor,K}$ with respect to the basis

\big{\{}N_{j_{1}}^{\lfloor\beta\rfloor,K}(x_{1})\cdot N_{j_{2}}^{\lfloor\beta\rfloor,K}(x_{2})\cdot\ldots\cdot N_{j_{d}}^{\lfloor\beta\rfloor,K}(x_{d}):j_{1},\dots,j_{d}\in\{1,\dots,\lfloor\beta\rfloor+K\}\big{\}},

where $N^{\lfloor\beta\rfloor,K}_{1},\dots,N^{\lfloor\beta\rfloor,K}_{\lfloor\beta\rfloor+K}$ are normalized $B$ -splines defined in (31). There exist coefficients $\{w^{(f)}_{m,j_{1},j_{2},\dots,j_{d}}:1\leqslant m\leqslant p,1\leqslant j_{1},\dots,j_{d}\leqslant\lfloor\beta\rfloor+K\}$ such that, for any $m\in\{1,\dots,p\}$ ,

\begin{split}S_{f,m}^{\lfloor\beta\rfloor,K}&=\sum\limits_{j_{1},\dots,j_{d}=1}^{\lfloor\beta\rfloor+K}w^{(f)}_{m,j_{1},\dots,j_{d}}\prod\limits_{\ell=1}^{d}N_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell})\\ &=\sum\limits_{j_{1},\dots,j_{d}=1}^{\lfloor\beta\rfloor+K}w^{(f)}_{m,j_{1},\dots,j_{d}}\prod\limits_{\ell=1}^{d}(a_{j_{\ell}+\lfloor\beta\rfloor+1}-a_{j_{\ell}})B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell})\,,\end{split}

(7)

where $B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell})$ are (unnormalized) $B$ -splines defined in (30). Hence, in order to represent $S_{f}^{\lfloor\beta\rfloor,K}=(S_{f,1}^{\lfloor\beta\rfloor,K},\dots,S_{f,p}^{\lfloor\beta\rfloor,K})$ by a neural network with ReQU activations, we first perform this task for the products of basis functions $\prod_{\ell=1}^{d}B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell})$ , $1\leqslant j_{1},\dots,j_{d}\leqslant\lfloor\beta\rfloor+K$ .

Step 2. Applying 3 with $q=\lfloor\beta\rfloor$ component-wise for each $x_{i},i\in\{1,\ldots,d\}$ , we obtain that the mapping

$\displaystyle x=(x_{1},\dots,x_{d})\mapsto\Big{(}$	$\displaystyle x_{1},K,B_{1}^{\lfloor\beta\rfloor,K}(x_{1}),\dots,B_{\lfloor\beta\rfloor+K}^{\lfloor\beta\rfloor,K}(x_{1}),$
	$\displaystyle x_{2},K,B_{1}^{\lfloor\beta\rfloor,K}(x_{2}),\ldots,B_{\lfloor\beta\rfloor+K}^{\lfloor\beta\rfloor,K}(x_{2}),\dots,$	(8)
	$\displaystyle x_{d},K,B_{1}^{\lfloor\beta\rfloor,K}(x_{d}),\ldots,B_{\lfloor\beta\rfloor+K}^{\lfloor\beta\rfloor,K}(x_{d})\Big{)}$

can be represented by a neural network from the class $\mathsf{NN}\big{(}4+2(\lfloor\beta\rfloor-2),(d,d\mathcal{A}_{1})\big{)}$ , where

\mathcal{A}_{1}=\big{(}\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{\lfloor\beta\rfloor},K+\lfloor\beta\rfloor+2\big{)},

(9)

and $\mathcal{B}_{2},\dots,\mathcal{B}_{\lfloor\beta\rfloor}$ are defined in (15) and (21). Here $d\mathcal{A}_{1}$ should be understood as element-wise multiplication of $\mathcal{A}_{1}$ entries by $d$ . Since each coordinate transformation in (5.1) can be computed independently, the number of parameters in this network does not exceed

72d\lfloor\beta\rfloor(K+2\lfloor\beta\rfloor).

Using 1 with $k=d$ , for any fixed $j_{1},\dots,j_{d}\in\{1,\dots,\lfloor\beta\rfloor+K\}$ , we calculate all products $\prod_{\ell=1}^{d}B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell})$ by a neural network from

\mathsf{NN}\left(\lceil\log_{2}{d}\rceil,\big{(}(K+\lfloor\beta\rfloor+2)d,(K+\lfloor\beta\rfloor)^{d}\mathcal{A}_{2},(K+\lfloor\beta\rfloor)^{d})\right)

with

\mathcal{A}_{2}=\bigl{(}\underbrace{2^{\lceil\log_{2}{d}\rceil+1},2^{\lceil\log_{2}{d}\rceil},\ldots,4}_{\lceil\log_{2}{d}\rceil}\bigr{)}\,.

(10)

Note that this network contains at most

5(K+\lfloor\beta\rfloor)^{d}2^{2\lceil\log_{2}{d}\rceil}\leqslant 20d^{2}(K+\lfloor\beta\rfloor)^{d}

non-zero parameters.

Step 3. Let us recall that, due to (7), we have

S_{f,m}^{\lfloor\beta\rfloor,K}=\sum\limits_{j_{1},\dots,j_{d}=1}^{\lfloor\beta\rfloor+K}w^{(f)}_{m,j_{1},\dots,j_{d}}\prod\limits_{\ell=1}^{d}(a_{j_{\ell}+\lfloor\beta\rfloor+1}-a_{j_{\ell}})\prod\limits_{\ell=1}^{d}B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell})

(11)

for all $m\in\{1,\dots,p\}$ . On Step $2$ , we constructed a network which calculates the products $\prod\limits_{\ell=1}^{d}B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell})$ for all $(j_{1},\dots,j_{d})\in\{1,\dots,\lfloor\beta\rfloor+K\}^{d}$ . Now we implement multiplications by

\tilde{w}^{(f)}_{m,j_{1},\dots,j_{d}}:=w^{(f)}_{m,j_{1},\dots,j_{d}}\prod\limits_{\ell=1}^{d}(a_{j_{\ell}+\lfloor\beta\rfloor+1}-a_{j_{\ell}}).

2 implies that

\left|w^{(f)}_{m,j_{1},\dots,j_{d}}\right|\leqslant(2\lfloor\beta\rfloor+1)^{d}9^{d(\lfloor\beta\rfloor-1)}\|f_{m}\|_{L_{\infty}([0,1]^{d})}\leqslant(2\lfloor\beta\rfloor+1)^{d}9^{d(\lfloor\beta\rfloor-1)}H

Since $\beta>2$ by the conditions of the theorem, we can write

\left|w^{(f)}_{m,j_{1},\dots,j_{d}}\right|\leqslant(2\lfloor\beta\rfloor+1)^{d\lfloor\beta\rfloor}H.

Equation (5.1) yields that $0\leqslant a_{j+\lfloor\beta\rfloor+1}-a_{j}\leqslant 1$ for all $j\in\{1,\ldots,\lfloor\beta\rfloor+K\}$ . In view of 4, we can implement the multiplication by $\tilde{w}^{(f)}_{m,j_{1},\dots,j_{d}}$ by a network from the class

\mathsf{NN}\left(2\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)+2,\big{(}1,5,5,\dots,5,4,1\big{)}\right).

Here we used the fact that, since $\lfloor\beta\rfloor\geqslant 2$ ,

	$\displaystyle\log_{4}\left(\log_{4}\big{(}(2\lfloor\beta\rfloor+1)^{d\lfloor\beta\rfloor}H\big{)}\right)$	$\displaystyle\leqslant\log_{4}\left(2\log_{4}\big{(}(2\lfloor\beta\rfloor+1)^{d\lfloor\beta\rfloor}\big{)}\vee 2\log_{4}H\right)$
		$\displaystyle=\log_{4}\left(\log_{2}\big{(}(2\lfloor\beta\rfloor+1)^{d\lfloor\beta\rfloor}\big{)}\vee\log_{2}H\right)$
		$\displaystyle=\log_{4}\left(d\lfloor\beta\rfloor\log_{2}\big{(}(2\lfloor\beta\rfloor+1)\big{)}\right)\vee\log_{4}\left(\log_{2}H\right)$
		$\displaystyle\leqslant\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H.$

Hence, the representation (11) implies that the function $S_{f}^{\lfloor\beta\rfloor,K}$ can be represented by a neural network with

4+2(\lfloor\beta\rfloor-2)+\lceil\log_{2}{d}\rceil+2\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)+2

hidden layers and the architecture

\left(d,d\mathcal{A}_{1},(K+\lfloor\beta\rfloor)^{d}\mathcal{A}_{2},\underbrace{5(K+\lfloor\beta\rfloor)^{d},5(K+\lfloor\beta\rfloor)^{d},\dots,5(K+\lfloor\beta\rfloor)^{d}}_{\text{$2\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)+1$ times}},4(K+\lfloor\beta\rfloor)^{d},p\right),

where $\mathcal{A}_{1}$ and $\mathcal{A}_{2}$ are given by (9) and (10), respectively. As before, weights of the constructed neural network are bounded by $1$ . Note that this network contains at most

	$\displaystyle p(K+\lfloor\beta\rfloor)^{d}\left(60\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)+38\right)$
	$\displaystyle\quad+20d^{2}(K+\lfloor\beta\rfloor)^{d}+72d\lfloor\beta\rfloor(K+2\lfloor\beta\rfloor)+8d(K+\lfloor\beta\rfloor)^{d}$

non-zero weights. The last summand appears, because each of

(K+\lfloor\beta\rfloor)^{d}2^{\lceil\log_{2}{d}\rceil+1}\leqslant 4d(K+\lfloor\beta\rfloor)^{d}

neurons in the first layer of $(K+\lfloor\beta\rfloor)^{d}\mathcal{A}_{2}$ receives information from two neurons from the previous layer. This completes the proof.

5.2 Auxiliary lemmas for 2

Lemma 1.

Let $k\in\mathbb{N},k\geqslant 2$ . Then, for any $x=(x_{1},\dots,x_{k})\in\mathbb{R}^{k}$ , there exists a neural network from the class

\mathsf{NN}\left(\lceil\log_{2}{k}\rceil,\left(k,2^{\lceil\log_{2}{k}\rceil+1},2^{\lceil\log_{2}{k}\rceil},\dots,4,1\right)\right)\,,

(12)

which implements the map $x\mapsto x_{1}x_{2},\dots x_{k}$ . Moreover, this network contains at most $5\cdot 2^{2\lceil\log_{2}{k}\rceil}$ non-zero weights.

Proof of 1.

Let $x=(x_{1},x_{2})\in\mathbb{R}^{2}$ . We first prove that the map $(x_{1},x_{2})\mapsto x_{1}x_{2}$ belongs to the class $\mathsf{NN}(1,(2,4,1))$ . Let us define $W_{1}=\frac{1}{4}(1,-1,-1,1)^{\top}$ and

W_{0}=\begin{pmatrix}1&1\\ 1&-1\\ -1&1\\ -1&-1\end{pmatrix}\,.

Then it is easy to see that

\begin{split}W_{1}\circ\sigma\circ W_{0}x&\equiv\frac{1}{4}\left(\sigma(x_{1}+x_{2})+\sigma(-x_{1}-x_{2})-\sigma(x_{1}-x_{2})-\sigma(x_{2}-x_{1})\right)\\ &=\frac{1}{4}\left((x_{1}+x_{2})^{2}-(x_{1}-x_{2})^{2}\right)=x_{1}x_{2}.\end{split}

(13)

Now, for any $k\geqslant 2$ , we use the following representation

x_{1}x_{2}\dots x_{k}=x_{1}x_{2}\dots x_{k}\cdot\underbrace{1\cdot\ldots\cdot 1}_{2^{\lceil\log_{2}{k}\rceil}-k}\,.

(14)

Put $v=\lceil\log_{2}{k}\rceil$ . Then it remains to note that, based on the equality (13), for any $v\in\mathbb{N}$ , we can implement a sequence of mappings

\left(a_{1},a_{2},\ldots,a_{2^{v}}\right)\mapsto\left(a_{1}a_{2},a_{3}a_{4},\ldots,a_{2^{v}-1}a_{2^{v}}\right)\mapsto\ldots\mapsto\prod_{i=1}^{2^{v}}a_{i}

by a neural network from the class

\mathsf{NN}\left(v,\left(2^{v},2^{v+1},2^{v},2^{v-1},2^{v-2},\dots,4,1\right)\right)\,.

The number of parameters in such network does not exceed $2^{2v+1}+2^{2v+1}+2^{2v}+\ldots+4\leqslant 5\cdot 2^{2v}$ . We complete the proof, combining this bound with (14).

∎

Lemma 2.

Let $q,K$ be integers not smaller than $2$ , and $x\in[0,1]$ . Then the mapping

x\mapsto\left(x,K,B_{1}^{2,K}(x),B_{2}^{2,K}(x),\ldots,B_{K+2q-2}^{2,K}(x)\right),

can be represented by a network from the class $\mathsf{NN}\Big{(}4,\big{(}1,\mathcal{B}_{2},K+2q\big{)}\Big{)}$ , where

\mathcal{B}_{2}=\left(4(K+2q-1)+K,4K+8q,4K+8q,4K+8q\right),

(15)

containing at most $72(K+2q)$ non-zero weights.

Proof.

Note first that, due to [41, Theorem 4.32],

	$\displaystyle B_{j}^{2,K}(x)$	$\displaystyle=\frac{K^{3}}{6}\left(\left(x-\frac{j-q-1}{K}\right)_{+}^{2}-3\left(x-\frac{j-q}{K}\right)_{+}^{2}\right.$
		$\displaystyle\quad\left.+3\left(x-\frac{j-q+1}{K}\right)_{+}^{2}-\left(x-\frac{j-q+2}{K}\right)_{+}^{2}\right),\quad q+1\leqslant j\leqslant q+K-2.$

The functions $B_{q-1}^{2,K}$ , $B_{q}^{2,K}$ , $B_{q+K-1}^{2,K}$ , and $B_{q+K}^{2,K}$ can be computed directly using (30). Indeed, for any $x\in[0,1]$ ,

	$\displaystyle B_{q-1}^{2,K}(x)$	$\displaystyle=K^{3}\left(\frac{1}{K}-x\right)_{+}^{2},$
	$\displaystyle B_{q}^{2,K}(x)$	$\displaystyle=\frac{K^{3}}{4}\left(\left(\frac{2}{K}-x\right)_{+}^{2}-4\left(\frac{1}{K}-x\right)_{+}^{2}+3(-x)_{+}^{2}\right),$
	$\displaystyle B_{q+K-1}^{2,K}(x)$	$\displaystyle=\frac{K^{3}}{4}\left(\left(x-\frac{K-2}{K}\right)_{+}^{2}-2\left(x-\frac{K-1}{K}\right)_{+}^{2}-3(x-1)_{+}^{2}\right),$
	$\displaystyle B_{q+K}^{2,K}(x)$	$\displaystyle=K^{3}\left(x-\frac{K-1}{K}\right)_{+}^{2}\,.$

Moreover, (30) implies that $B_{1}^{2,K}(x)=\dots=B_{q-2}^{2,K}(x)=0$ and $B_{q+K+1}^{2,K}(x)=\ldots=B_{2q+K-2}^{2,K}(x)=0$ . Hence, each of the functions $B_{j}^{2,K}/K^{3},j\in\{1,\dots,2q+K-2\}$ can be exactly represented by a neural network from the class $\mathsf{NN}(1,(1,4,1))$ . The final mapping (b) is implemented as $K^{3}=\frac{1}{4}(K^{2}+K)^{2}-\frac{1}{4}(K^{2}-K)^{2}$ . Note that the identity map $x\mapsto x$ belongs to $\mathsf{NN}(1,(1,4,1))$ too:

x=\left((x+1)^{2}-(x-1)^{2}\right)/4=\left(\sigma_{-1}(x)+\sigma_{-1}(-x)-\sigma_{1}(x)-\sigma_{1}(-x)\right)/4.

(16)

Combining the arguments above, we conclude that the mapping

x\mapsto\left(x,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right)

(17)

can be exactly represented by a network from the class $\mathsf{NN}(1,(1,4(K+2q-1),K+2q-1))$ . Besides, one can implement the sequence of transformations

x\mapsto\underbrace{(1,\dots,1)}_{K\text{times}}\mapsto K,

(18)

using a neural network from $\mathsf{NN}\big{(}1,(1,K,1\big{)}$ . Connecting in parallel the neural networks realizing the maps (17) and (18), we obtain that one can implement the transformation

x\mapsto\left(x,K,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right)

(19)

via a neural network from the class $\mathsf{NN}\big{(}1,(1,4(K+2q-1)+K,K+2q),13(K+2q-1)+3K+1\big{)}$ .

It remains to implement multiplication of the splines by $K^{3}$ to complete the proof. Recall that, according to Lemma 1, the function, mapping a pair $(x,y)$ to their product $xy$ , belongs to the class $\mathsf{NN}(1,(2,4,1))$ . For any $x\in\mathbb{R}$ , consider a vector

\left(x,K,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right).

Let us implement the multiplications of the splines $B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}$ by $K$ in parallel. Then we obtain that the mapping

	$\displaystyle\left(x,K,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right)$
	$\displaystyle\mapsto\left(x,K,B_{1}^{2,K}(x)/K^{2},B_{2}^{2,K}(x)/K^{2},\ldots,B_{K+2q-2}^{2,K}(x)/K^{2}\right)$

is in the class $\mathsf{NN}\big{(}1,(K+2q,4K+8q,K+2q),17(K+2q)-8\big{)}$ . Here we took into account that we need $17$ non-zero weights to implement each of $(K+2q-2)$ multiplications and $13$ non-zero weights to implement the identity maps $x\mapsto x$ and $K\mapsto K$ . Repeating the same trick two more times, we get that there is a neural network in $\mathsf{NN}\big{(}3,(K+2q,4K+8q,4K+8q,4K+8q,K+2q),57(K+2q)-8\big{)}$ , implementing

\begin{split}&\left(x,K,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right)\\ &\mapsto\left(x,K,B_{1}^{2,K}(x),B_{2}^{2,K}(x),\ldots,B_{K+2q-2}^{2,K}(x)\right).\end{split}

(20)

This follows from the fact that each of $2K+q-2$ multiplications by $K^{3}$ , running in parallel, can be implemented by a neural network from $\mathsf{NN}(3,(2,4,4,4,1))$ , and two identity maps $x\mapsto x$ and $K\mapsto K$ can be represented by a network from $\mathsf{NN}(3,(1,4,4,4,1))$ . Concatenating the neural networks, performing (19) and (20), we finally get that the mapping of interest,

x\mapsto\left(x,K,B_{1}^{2,K}(x),B_{2}^{2,K}(x),\ldots,B_{K+2q-2}^{2,K}(x)\right),

is in the class of neural networks from $\mathsf{NN}\big{(}3,(1,4(K+2q-1)+K,4K+8q,4K+8q,4K+8q,K+2q)\big{)}$ , containing at most

69(K+2q)-20+3K<72(K+2q)

non-zero weights. ∎

Lemma 3.

Let $x\in\mathbb{R}$ and $K,q\in\mathbb{N},K,q\geqslant 2$ . Then for any $j\in\{1,\dots,q+K\}$ , the mapping $x\mapsto\left(x,K,B_{1}^{q,K}(x),B_{2}^{q,K}(x),\ldots,B_{q+K}^{q,K}(x)\right)$ belongs to the class

\mathsf{NN}\Big{(}4+2(q-2),\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{q},K+q+2\big{)}\Big{)}\,,

with at most $72q(K+2q)$ non-zero weights. The vector $\mathcal{B}_{2}$ is defined in (15), and for $m\in\{3,\ldots,q\}$ , $\mathcal{B}_{m}$ is equal to

\mathcal{B}_{m}=\bigl{(}12(K+2q-m)+12,8(K+2q-m)+8\bigr{)}\,.

(21)

Proof of 3.

Fix an integer $q\geqslant 2$ . We consider the family of $B$ -splines $\bigl{\{}B_{j}^{m,K}(x),m\in\{2,\dots,q\},j\in\{1,\dots,K+2q-m\}\bigr{\}}$ and construct all elements of this family sequentially starting from $m=2$ . By 2, the map

x\mapsto\left(x,K,B_{1}^{2,K}(x),B_{2}^{2,K}(x),\ldots,B_{K+2q-2}^{2,K}(x)\right)

can be represented by a network in the class (15), containing at most $72(K+2q)$ non-zero weights. Assume that for some $m\geqslant 3$ the mapping

x\mapsto\left(x,K,B_{1}^{m-1,K}(x),\ldots,B_{K+2q-(m-1)}^{m-1,K}(x)\right)

belongs to the class

\displaystyle\mathsf{NN}\Big{(}4+2(m-3),\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{m-1},K+2q-m+3\big{)}\Big{)}.

(22)

We use the recursion formula (30) to perform the induction step. Note that we can implement each of the mappings

(x,K)\mapsto\frac{x-a_{j}}{a_{j+m+1}-a_{j}}\,,\quad(x,K)\mapsto\frac{a_{j+m+1}-x}{a_{j+m+1}-a_{j}}\,,\quad j\in\{1,\dots,2q+K-m\}\,,

(23)

by networks from $\mathsf{NN}(1,(2,4,1))$ . It is possible, since for $j$ and $m$ satisfying $a_{j+m+1}-a_{j}>0$ , it holds that $a_{j+m+1}-a_{j}\geqslant 1/K$ . We remove the last linear layer of (22), and concatenate the remaining part with the first layer of networks, implementing (23). We obtain a network with $5+2(m-3)$ hidden layers and the architecture

\mathsf{NN}\Big{(}5+2(m-3),\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{m-3},12(K+2q-m)+12,3(K+2q-m)+3\big{)}\Big{)}\,,

(24)

which implements the mapping

x\mapsto\left(x,K,B_{1}^{m-1,K}(x),\ldots,B_{K+2q-(m-1)}^{m-1,K}(x),\underbrace{\frac{x-a_{j}}{a_{j+m+1}-a_{j}},\frac{a_{j+m+1}-x}{a_{j+m+1}-a_{j}}}_{j\in\{1,\dots,2q+K-m\}}\right)\,.

(25)

Note that we have added at most

\underbrace{4(K+2q-m+3)}_{\text{to implement identity maps}}+\underbrace{16\cdot 2(K+2q-m)}_{\text{to implement \eqref{eq:aux_products}}}=36(K+2q-m)+12

non-zero weights, since each linear function of the form (23) can be implemented via a network from $\mathsf{NN}\big{(}1,(2,4,1)\big{)}$ with at most $16$ non-zero weights. According to (30), for any $j\in\{1,\dots,2q+K-m\}$ , we construct

B_{j}^{m,K}(x)=\begin{cases}\frac{(x-a_{j})B_{j}^{m-1,K}(x)+(a_{j+m+1}-x)B_{j+1}^{m-1,K}(x)}{a_{j+m+1}-a_{j}},\\ \qquad\qquad\qquad\qquad\qquad\text{if $a_{j}<a_{j+m+1},a_{j}\leqslant x<a_{j+m+1}$},\\ 0,\quad\text{otherwise}.\end{cases}

(26)

Now we add one more hidden layer with $8(K+2q-m)+8$ parameters to the network (25), yielding a network with an architecture

\begin{split}&\mathsf{NN}\Big{(}6+2(m-3),\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{m-1},12(K+2q-m)+12,\\ &\qquad\qquad\qquad\qquad\qquad 8(K+2q-m)+8,2(K+2q-m)+2\big{)}\Big{)}\,,\end{split}

(27)

which implements

x\mapsto\left(x,K,\underbrace{\frac{x-a_{j}}{a_{j+m+1}-a_{j}}B_{j}^{m-1,K}(x),\frac{a_{j+m+1}-x}{a_{j+m+1}-a_{j}}B_{j+1}^{m-1,K}(x)}_{j\in\{1,\dots,2q+K-m\}}\right).

Note that adding the new hidden layer adds at most

\underbrace{8}_{\text{to implement identity maps for $x$ and $K$}}+\underbrace{16\cdot 2(K+2q-m)}_{\text{to implement \eqref{eq:b-spline_recursion}}}

additional non-zero-weights. Combining the above representations and (26), we obtain a network from the class

\begin{split}\mathsf{NN}\Big{(}4+2(m-2),&\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\dots,\mathcal{B}_{m},K+2q-m+2\big{)}\Big{)}\end{split}

with at most

	$\displaystyle 72(K+2q)+\sum\limits_{s=3}^{m}\Big{[}68(K+2q-s)+20\Big{]}$
	$\displaystyle\leqslant 72(K+2q)+(m-2)\Big{[}68(K+2q)+20\Big{]}$
	$\displaystyle\leqslant 72(K+2q)+(m-2)\cdot 72(K+2q)$
	$\displaystyle<72m(K+2q)$

non-zero weights. ∎

Lemma 4.

Let $x\in\mathbb{R}$ and let $L\in\mathbb{N}$ . Then, for any $M$ , such that $|M|\leqslant 4^{4^{L}}$ , the mapping $x\mapsto Mx$ can be represented by a neural network, belonging to the class

\mathsf{NN}\Big{(}2L+2,(1,5,5,\dots,5,4,1)\Big{)}

and containing at most $60L+38$ non-zero weights.

Proof.

Using the representation $4=(1+1)^{2}$ , we obtain that there is a neural network from $\mathsf{NN}((2L+1,(1,1,\dots,1)\big{)}$ , implementing the sequence of maps

x\mapsto 1\mapsto 4\mapsto 4^{2}\mapsto 4^{4}\mapsto\dots\mapsto 4^{4^{L}}.

At the same time, due to

x=\frac{1}{4}\left((x-1)^{2}-(x+1)^{2}\right)=\frac{1}{4}\left((x-1)_{+}^{2}-(1-x)_{+}^{2}-(x+1)_{+}^{2}+(-x-1)_{+}^{2}\right),

the identity map $x\mapsto x$ belongs to

\mathsf{NN}((2L+1,(1,\underbrace{4,4,\dots,4}_{\text{$2L+1$ times}},1)\big{)}.

Running the two neural networks in parallel, we can implement the mapping

x\mapsto\left(4^{4^{L}},x\right)

via a neural network from $\mathsf{NN}\big{(}2L+1,(1,5,5,\dots,5,2)\big{)}$ . Finally, applying 1, we conclude that there is a neural network from the class

\mathsf{NN}\big{(}2L+2,(1,5,5,\dots,5,4,1)\big{)},

performing the map $x\mapsto 4^{4^{L}}x$ . The number of non-zero weights in this network does not exceed

\left(1\cdot 5+5\cdot 5\cdot 2L+5\cdot 4+4\cdot 1\right)+5\cdot(2L+1)+4=60L+38.

It remains to note that, by the definition of $L$ , we have $|M|/4^{4^{L}}\leqslant 1$ .

∎

5.3 Proof of 1

Let $s>\ell$ be an integer to be specified later. Plugging the bound $\left\|f\right\|_{\mathcal{H}^{s}([0,1]^{d})}\leqslant QR^{-s}s!$ , $s\in\mathbb{N}_{0}$ , into (3), we get that there is a neural network $h_{f}$ of width

\left(4d(K+s-1)^{d}\right)\vee\left(12(K+2s-2)+12\right)\vee p

with

6+2(s-3)+\left\lceil\log_{2}{d}\right\rceil+2\left(\left\lceil\log_{2}(2d(s-1)+d)\vee\log_{2}\log_{2}(QR^{-s}s!)\right\rceil\vee 1\right)

hidden layers and at most

	$\displaystyle p(K+s-1)^{d}\left(60\left[\left\lceil\log_{2}(2d(s-1)+d)\vee\log_{2}\log_{2}(QR^{-s}s!)\right\rceil\vee 1\right]+38\right)$
	$\displaystyle+p(K+s-1)^{d}\left(20d^{2}+144d(s-1)+8d\right)$

non-zero weights, taking their values in $[-1,1]$ , such that, for any $s>\ell$ ,

	$\displaystyle\left\\|f-h_{f}\right\\|_{\mathcal{H}^{\ell}([0,1]^{d})}$	$\displaystyle\leqslant\frac{Q(\sqrt{2}ed)^{s}s!}{R^{s}K^{s-\ell}}+\frac{Q9^{d(s-2)}(2s-1)^{2d+\ell}(\sqrt{2}ed)^{s}s!}{R^{s}K^{s-\ell}}$
		$\displaystyle\leqslant\frac{Q9^{ds}(2s-1)^{2d+\ell}(\sqrt{2}ed)^{s}s!}{R^{s}K^{s-\ell}}.$

Recall that, by the definition, $\lfloor s\rfloor=s-1$ for any integer $s$ . Using the inequalities

s!\leqslant(s-\ell)!\,s^{\ell},\quad(s-\ell)!\leqslant e\sqrt{s-\ell}\left(\frac{s-\ell}{e}\right)^{s-\ell},\quad\text{and}\quad\sqrt{s-\ell}\leqslant 2^{(s-\ell)/2},

which are valid for all $s>\ell$ , we obtain that

\begin{split}\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}&\leqslant\frac{Qe(\sqrt{2}\,ed\,9^{d})^{\ell}(2s-1)^{2d+\ell}s^{\ell}}{R^{\ell}}\cdot\left(\frac{2d\,9^{d}(s-\ell)}{KR}\right)^{s-\ell}\\ &\leqslant\frac{Qe(\sqrt{2}\,ed\,9^{d})^{\ell}(2s-1)^{2d+2\ell}}{R^{\ell}}\cdot\left(\frac{2d\,9^{d}(s-\ell)}{KR}\right)^{s-\ell}.\end{split}

(28)

Set

s=\ell+\left\lfloor\frac{KR}{2ed\,9^{d}}\right\rfloor.

(29)

It is easy to observe that, with such a choice of $s$ , the neural network $h_{f}$ is of the width $O((K+\ell)^{d})$ and has $O((K+\ell))$ hidden layers and at most $O(p(K+\ell)^{d}\log K)$ non-zero weights. The bound (5) on the approximation error of $h_{f}$ follows directly from (28) and (29).

References

[1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444. doi:10.1038/nature14539.
[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014, pp. 2672–2680.
URL https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533. doi:10.1038/nature14236.
[4] J. Han, A. Jentzen, W. E, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences 115 (34) (2018) 8505–8510. doi:10.1073/pnas.1718942115.
[5] M. Geist, P. Petersen, M. Raslan, R. Schneider, G. Kutyniok, Numerical solution of the parametric diffusion equation by deep neural networks, Journal of Scientific Computing 88 (1) (2021) 1–37. doi:10.1007/s10915-021-01532-w.
[6] Y. Chen, Y. Shi, B. Zhang, Optimal control via neural networks: A convex approach, in: International Conference on Learning Representations, 2019.
URL https://openreview.net/pdf?id=H1MW72AcK7
[7] D. Onken, L. Nurbekyan, X. Li, S. W. Fung, S. Osher, L. Ruthotto, A neural network approach for high-dimensional optimal control applied to multiagent path finding, IEEE Transactions on Control Systems Technology (2022) 1–17doi:10.1109/TCST.2022.3172872.
[8] L. Li, K. Ota, M. Dong, Humanlike driving: Empirical decision-making system for autonomous vehicles, IEEE Transactions on Vehicular Technology 67 (8) (2018) 6814–6823. doi:10.1109/TVT.2018.2822762.
[9] G. Cembrano, C. Torras, G. Wells, Neural networks for robot control, Annual Review in Automatic Programming 19 (1994) 159–166. doi:https://doi.org/10.1016/0066-4138(94)90059-0.
URL https://www.sciencedirect.com/science/article/pii/0066413894900590
[10] P. Bozek, Y. L. Karavaev, A. A. Ardentov, K. S. Yefremov, Neural network control of a wheeled mobile robot based on optimal trajectories, International Journal of Advanced Robotic Systems 17 (2) (2020) 1729881420916077. doi:10.1177/1729881420916077.
[11] M. González-Álvarez, J. Dupeyroux, F. Corradi, G. C. de Croon, Evolved neuromorphic radar-based altitude controller for an autonomous open-source blimp, 2022 International Conference on Robotics and Automation (ICRA) (2022) 85–90.
[12] K.-I. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural networks 2 (3) (1989) 183–192. doi:10.1016/0893-6080(89)90003-8.
[13] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (4) (1989) 303–314. doi:10.1007/BF02551274.
[14] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural networks 4 (2) (1991) 251–257. doi:10.1016/0893-6080(91)90009-T.
[15] D. Yarotsky, Error bounds for approximations with deep ReLU networks, Neural Networks 94 (2017) 103–114. doi:10.1016/j.neunet.2017.07.002.
[16] B. Li, S. Tang, H. Yu, PowerNet: efficient representations of polynomials and smooth functions by deep neural networks with rectified power units, Journal of Mathematical Study 53 (2) (2020) 159–191. doi:10.4208/jms.v53n2.20.03.
[17] I. Gühring, M. Raslan, Approximation rates for neural networks with encodable weights in smoothness spaces, Neural Networks 134 (2021) 107–130. doi:10.1016/j.neunet.2020.11.010.
[18] J. Lu, Z. Shen, H. Yang, S. Zhang, Deep network approximation for smooth functions, SIAM Journal on Mathematical Analysis 53 (5) (2021) 5465–5506. doi:10.1137/20M134695X.
[19] Z. Shen, H. Yang, S. Zhang, Neural network approximation: Three hidden layers are enough, Neural Networks 141 (2021) 160–173. doi:10.1016/j.neunet.2021.04.011.
[20] T. De Ryck, S. Lanthaler, S. Mishra, On the approximation of functions by tanh neural networks, Neural Networks 143 (2021) 732–750. doi:10.1016/j.neunet.2021.08.015.
[21] Y. Jiao, Y. Lai, X. Lu, F. Wang, J. Z. Yang, Y. Yang, Deep neural networks with relu-sine-exponential activations break curse of dimensionality on Hölder class, preprint, arXiv:2103.00542 (2021). doi:10.48550/ARXIV.2103.00542.
[22] S. Langer, Approximating smooth functions by deep neural networks with sigmoid activation function, Journal of Multivariate Analysis 182 (2021) Paper No. 104696, 21. doi:10.1016/j.jmva.2020.104696.
[23] Z. Shen, H. Yang, S. Zhang, Deep network approximation characterized by number of neurons, Communications in Computational Physics 28 (5) (2020) 1768–1811. doi:10.4208/cicp.oa-2020-0149.
[24] J. Schmidt-Hieber, Nonparametric regression using deep neural networks with ReLU activation function, The Annals of Statistics 48 (4) (2020) 1875–1897. doi:10.1214/19-AOS1875.
[25] B. Li, S. Tang, H. Yu, Better approximations of high dimensional smooth functions by deep neural networks with rectified power units, Communications in Computational Physics 27 (2) (2020) 379–411. doi:10.4208/cicp.oa-2019-0168.
[26] B. Hanin, Universal function approximation by deep neural nets with bounded width and ReLU activations, Mathematics 7 (10). doi:10.3390/math7100992.
[27] P. Kidger, T. Lyons, Universal approximation with deep narrow networks, in: Proceedings of Thirty Third Conference on Learning Theory, Vol. 125 of Proceedings of Machine Learning Research, 2020, pp. 2306–2327.
URL https://proceedings.mlr.press/v125/kidger20a.html
[28] S. Park, C. Yun, J. Lee, J. Shin, Minimum width for universal approximation, in: International Conference on Learning Representations, 2021.
URL https://openreview.net/forum?id=O-XJwyoIF-k
[29] I. Gühring, G. Kutyniok, P. Petersen, Error bounds for approximations with deep ReLU neural networks in $W^{s,p}$ norms, Analysis and Applications 18 (5) (2020) 803–859. doi:10.1142/S0219530519410021.
[30] J. M. Klusowski, A. R. Barron, Approximation by combinations of ReLU and squared ReLU ridge functions with $\ell^{1}$ and $\ell^{0}$ controls, IEEE Transactions on Information Theory 64 (12) (2018) 7649–7656. doi:10.1109/tit.2018.2874447.
[31] M. Ali, A. Nouy, Approximation of smoothness classes by deep rectifier networks, SIAM Journal on Numerical Analysis 59 (6) (2021) 3032–3051. doi:10.1137/20M1360657.
[32] A. Abdeljawad, P. Grohs, Approximations with deep neural networks in Sobolev time-space, Analysis and Applications 20 (3) (2022) 499–541. doi:10.1142/S0219530522500014.
[33] Q. Chen, W. Hao, J. He, Power series expansion neural network, Journal of Computational Science 59 (2022) 101552. doi:10.1016/j.jocs.2021.101552.
[34] R. Gribonval, G. Kutyniok, M. Nielsen, F. Voigtlaender, Approximation spaces of deep neural networks, Constructive Approximation. An International Journal for Approximations and Expansions 55 (1) (2022) 259–367. doi:10.1007/s00365-021-09543-4.
[35] J. W. Siegel, J. Xu, High-order approximation rates for shallow neural networks with cosine and ${\rm ReLU}^{k}$ activation functions, Applied and Computational Harmonic Analysis 58 (2022) 1–26. doi:10.1016/j.acha.2021.12.005.
[36] D. Yarotsky, A. Zhevnerchuk, The phase diagram of approximation rates for deep neural networks, in: Advances in Neural Information Processing Systems, Vol. 33, 2020, pp. 13005–13015.
URL https://proceedings.neurips.cc/paper/2020/file/979a3f14bae523dc5101c52120c535e9-Paper.pdf
[37] Z.-B. Xu, F.-L. Cao, Simultaneous ${L}_{p}$ -approximation order for neural networks, Neural Networks 18 (7) (2005) 914–923. doi:https://doi.org/10.1016/j.neunet.2005.03.013.
[38] S. Hon, H. Yang, Simultaneous neural network approximation for smooth functions, preprint, arXiv:2109.00161 (2021). doi:10.48550/ARXIV.2109.00161.
[39] E. Candès, L. Demanet, L. Ying, Fast computation of fourier integral operators, SIAM Journal on Scientific Computing 29 (6) (2007) 2464–2493. doi:10.1137/060671139.
[40] E. Candès, L. Demanet, L. Ying, A fast butterfly algorithm for the computation of Fourier integral operators, Multiscale Modeling & Simulation. A SIAM Interdisciplinary Journal 7 (4) (2009) 1727–1750. doi:10.1137/080734339.
[41] L. Schumaker, Spline Functions: Basic theory, 3rd Edition, Cambridge University Press, Cambridge, 2007. doi:10.1017/CBO9780511618994.

Appendix A Some properties of multivariate splines

In this section we provide some properties of multivariate splines. For more details we refer reader to the book [41].

A.1 Univariate splines

We start with one-dimensional case. Fix $K,q\in\mathbb{N}$ . We call a function $S:[0,1]\mapsto\mathbb{R}$ a univariate spline of degree $q$ with knots at $\{1/K,2/K,\dots,(K-1)/K\}$ , if

1.

for any $i\in\{0,1,\dots,K-1\}$ , the restriction of $S$ on $[i/K,(i+1)/K]$ is a polynomial of degree $q$ ;
2.

$S$ has continuous derivatives up to order $q-1$ , that is, for any $m\in\{0,\dots,q-1\}$ and any $i\in\{1,2,\dots,K-1\}$ , $D^{m}S(i/K+0)=D^{m}S(i/K-0)$ .

It is easy to observe that univariate splines of degree $q$ with knots at $\{1/K,2/K,\dots,(K-1)/K\}$ form a linear space. We denote this space by $\mathcal{S}_{q,K}$ . Note that $\mathcal{S}_{q,K}$ has a finite dimension $q+K$ . There are several ways to choose a basis in $\mathcal{S}_{q,K}$ , below we construct the basis of normalized $B$ -splines $\{N_{j}^{q,K}(x):1\leqslant j\leqslant K+q\}$ . Let $a=(a_{1},\dots,a_{2q+K+1})$ be a vector of knots defined in (5.1). First, we successively construct (unnormalized) B-splines $\{B_{j}^{q,K}(x):1\leqslant j\leqslant K+q\}$ associated with knots $a_{1},\dots,a_{2q+K+1}$ . For any $j\in\{1,\dots,2q+K\}$ , let

B_{j}^{0,K}(x)=\begin{cases}1/(a_{j+1}-a_{j}),\quad\text{if $a_{j}<a_{j+1}$ and $a_{j}\leqslant x<a_{j+1}$},\\ 0,\quad\text{otherwise}.\end{cases}

Then for any $m\in\{1,\dots,q\}$ , $j\in\{1,2q+K-m\}$ , we put

B_{j}^{m,K}(x)=\begin{cases}\frac{(x-a_{j})B_{j}^{m-1,K}(x)+(a_{j+m+1}-x)B_{j+1}^{m-1,K}(x)}{a_{j+m+1}-a_{j}},\\ \qquad\qquad\qquad\qquad\qquad\text{if $a_{j}<a_{j+m+1},a_{j}\leqslant x<a_{j+m+1}$},\\ 0,\quad\text{otherwise}.\end{cases}

(30)

Now, for $m\in\{1,\dots,q\}$ , we define the normalized $B$ -spline $N_{j}^{m,K}$ as

N_{j}^{m,K}(x)=(a_{j+m+1}-a_{j})B_{j}^{m,K}(x),\quad 1\leqslant j\leqslant 2q+K-m.

(31)

We use the following properties of normalized $B$ -splines (see [41, Section 4.3] for the details).

Proposition 1 ([41], Section 4.3).

Let $N_{j}^{m,K}$ be a normalized B-spline defined in (31). Then the following holds.

1.

For any $q\in\mathbb{N}_{0},j\in\{1,\dots,q+K\}$ ,

$\text{\rm supp}(N_{j}^{q,K})=[a_{j},a_{j+q+1}]\quad\text{and}\quad\|N_{j}^{q,K}\|_{L_{\infty}([0,1])}\leqslant 1.$

For any $q\in\mathbb{N}$ and any $j\in\{1,\dots,q+K\}$ , $N_{j}^{q,K}$ is a spline of degree $q$ , it has continuous derivatives up to order $q-1$ and the $(q-1)$ -th derivative is Lipschitz. Moreover, for any $m\in\{1,\dots,q\}$ ,

\frac{\mathrm{d}N^{m,K}_{j}(x+0)}{\mathrm{d}x}=\frac{m}{a_{j+m}-a_{j}}\left(N_{j}^{m-1,K}(x)-N_{j+1}^{m-1,K}(x)\right),\quad 1\leqslant j\leqslant 2q+K-m.

Corollary 2.

For any $m\in\mathbb{N}$ and any $\ell\in\{1,\dots,m\}$ ,

\max\limits_{1\leqslant j\leqslant 2q+K-m}\left\|D^{\ell}N^{m,K}_{j}\right\|_{L_{\infty}([0,1])}\leqslant(2K)^{\ell}\frac{m!}{(m-\ell)!}.

Proof.

1 implies that, for any $\ell\in\{1,\dots,m\}$ ,

	$\displaystyle\max\limits_{1\leqslant j\leqslant 2q+K-m}\left\\|D^{\ell}N^{m,K}_{j}\right\\|_{L_{\infty}([0,1])}$
	$\displaystyle=\max\limits_{1\leqslant j\leqslant 2q+K-m}\frac{m}{a_{j+m}-a_{j}}\left\\|D^{\ell-1}N_{j}^{m-1,K}-D^{\ell-1}N_{j+1}^{m-1,K}\right\\|_{L_{\infty}([0,1])}$
	$\displaystyle\leqslant 2mK\max\limits_{1\leqslant j\leqslant 2q+K-m+1}\left\\|D^{\ell-1}N_{j}^{m-1,K}\right\\|_{L_{\infty}([0,1])}.$

Now the statement follows from the induction in $\ell$ together with $\|N_{j}^{q,K}\|_{L_{\infty}([0,1])}\leqslant 1$ . ∎

To study approximation properties of $\mathcal{S}_{q,K}$ with respect to the Hölder norm $\|\cdot\|_{\mathcal{H}^{k}},k\in\mathbb{N}_{0}$ , we follow the technique from [41, Sections 4.6 and 12.3] and introduce a dual basis for normalized B-splines. A set of linear functionals $\{\lambda^{q,K}_{1},\dots,\lambda^{q,K}_{q+K}\}$ on $L_{\infty}([0,1])$ is called a dual basis if, for all $i,j\in\{1,\dots,q+K\}$ ,

\lambda_{i}^{q,K}N_{j}^{q,K}=\begin{cases}1,\quad\text{if $i=j$,}\\ 0,\quad\text{otherwise}.\end{cases}

An explicit expression for $\lambda_{i}^{q,K}$ can be found in [41, Eq.(4.89)] but it is not necessary for our purposes. Define a linear operator $\mathcal{Q}^{q,K}:L_{\infty}([0,1])\rightarrow\mathcal{S}_{q,K}$ by

\mathcal{Q}^{q,K}f=\sum\limits_{j=1}^{q+K}(\lambda_{j}^{q,K}f)N_{j}^{q,K}.

The function $\mathcal{Q}^{q,K}f$ is called quasi-interpolant and it nicely approximates $f$ provided that $f$ is smooth enough. The approximation properties of $\mathcal{Q}^{q,K}f$ are studied below for the case of multivariate tensor-product splines.

A.2 Tensor-product splines

The main result of this section is 3 concerning approximation properties of multivariate splines. We start with auxiliary definitions. The tensor product

\mathcal{S}^{q,K}_{d}=\left(\mathcal{S}^{q,K}\right)^{\otimes d}\equiv\text{span}\left\{N_{j_{1}}^{q,K}(x_{1})\cdot\dots\cdot N_{j_{d}}^{q,K}(x_{d}):j_{1},\dots,j_{d}\in\{1,\dots,q+K\}\right\}

is called a space of tensor-product splines of degree $q$ . Since $q$ and $K$ are fixed, we omit the upper indices in the notations $B_{j}^{q,K},N_{j}^{q,K}$ , $1\leqslant j\leqslant q+K$ , if there is no ambiguity. For any $i\in\{1,\dots,d\}$ , we denote $\{\lambda_{i,1},\dots,\lambda_{i,q+K}\}$ the dual basis for $\{N_{1}(x_{i}),\dots,N_{q+K}(x_{i})\}$ constructed in A.1. For any $j_{1},\dots,j_{d}\in\{1,\dots,q+K\}$ and $f\in L_{\infty}([0,1]^{d})$ , define

\mathcal{L}_{j_{1},\dots,j_{d}}f=\lambda_{1,j_{1}}\lambda_{2,j_{2}}\dots\lambda_{d,j_{d}}f.

It is easy to see that $\left\{\mathcal{L}_{j_{1},\dots,j_{d}}:j_{1},\dots,j_{d}\in\{1,\dots,q+K\}\right\}$ form a dual basis for
$\left\{N_{j_{1}}(x_{1}),\ldots,N_{j_{d}}(x_{d})\right\}$ . Now we formulate the approximation result from [41, Theorem 12.5].

Proposition 2 ([41], Theorem 12.5).

Let $f\in L_{\infty}([0,1]^{d})$ . Fix any $j_{1},\dots,j_{d}\in\{1,\dots,q+K\}$ and let $\mathcal{L}_{j_{1},\dots,j_{d}}$ be as defined above. Introduce

A_{j_{1},\dots,j_{d}}=\bigotimes\limits_{i=1}^{d}[a_{j_{i}},a_{j_{i}+q+1})\,.

Then

|\mathcal{L}_{j_{1},\dots,j_{d}}f|\leqslant(2q+1)^{d}9^{d(q-1)}\|f\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\,.

Similarly to $\mathcal{Q}^{q,K}$ , for any $f\in L_{\infty}([0,1]^{d})$ , define a multivariate quasi-interpolant

\mathcal{Q}_{d}^{q,K}f(x)=\sum\limits_{j_{1},\dots,j_{d}=1}^{q+K}(\mathcal{L}_{j_{1},\dots,j_{d}}f)N_{j_{1}}(x_{1})\dots N_{j_{d}}(x_{d}).

Similarly to $N_{j}$ ’s, we omit the upper indices $q,K$ in the notation of $\mathcal{Q}_{d}$ when they are clear from context. We are ready to formulate the main result of this section.

Theorem 3.

Let $f\in\mathcal{H}^{\beta}([0,1]^{d},H)$ and fix a non-negative integer $\ell\leqslant\lfloor\beta\rfloor$ . Let the linear operator $\mathcal{Q}_{d}^{\lfloor\beta\rfloor,K}:L_{\infty}([0,1]^{d})\rightarrow\mathcal{S}_{d}^{\lfloor\beta\rfloor,K}$ be as defined above. Then

\left\|f-\mathcal{Q}_{d}^{\lfloor\beta\rfloor,K}f\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}\leqslant\frac{(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}+\frac{9^{d(\lfloor\beta\rfloor-1)}(2\lfloor\beta\rfloor+1)^{2d+\ell}(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}\,.

Proof.

For simplicity, we adopt the notation $q=\lfloor\beta\rfloor$ . Fix a multi-index ${\bm{\gamma}}\in\mathbb{N}_{0}^{d}$ , $|{\bm{\gamma}}|\leqslant\ell$ . Then

\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}\mathcal{Q}_{d}f\|_{L_{\infty}([0,1]^{d})}=\max\limits_{j_{1},\dots,j_{d}\in\{1,\dots,2q+K+1\}}\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}\mathcal{Q}_{d}f\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\,.

Consider a polynomial of degree $q$

P^{f}_{j_{1},\dots,j_{d}}(x)=\sum\limits_{{\bm{\alpha}}:|{\bm{\alpha}}|\leqslant q}\frac{D^{\bm{\alpha}}f(a_{j_{1},\dots,j_{d}})}{{\bm{\alpha}}!}(x-a_{j_{1},\dots,j_{d}})^{\bm{\alpha}}\,,

where $a_{j_{1},\dots,j_{d}}=(a_{j_{1}},\dots,a_{j_{d}})$ . Here we used conventional notations ${\bm{\alpha}}!=\alpha_{1}!\alpha_{2}!\dots\alpha_{d}!$ and

(x-a_{j_{1},\dots,j_{d}})^{\bm{\alpha}}=(x_{1}-a_{j_{1}})^{\alpha_{1}}(x_{2}-a_{j_{2}})^{\alpha_{2}}\dots(x_{d}-a_{j_{d}})^{\alpha_{d}}.

Since $P^{f}_{j_{1},\dots,j_{d}}$ belongs to $\mathcal{S}_{d}$ , we have $\mathcal{Q}_{d}P^{f}_{j_{1},\dots,j_{d}}=P^{f}_{j_{1},\dots,j_{d}}$ . By the triangle inequality,

\begin{split}\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}\mathcal{Q}_{d}f\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}&\leqslant\|D^{\bm{\gamma}}(f-P^{f}_{j_{1},\dots,j_{d}})\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\\ &\quad+\|D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}.\end{split}

(32)

By Taylor’s theorem and the definition of the Hölder class, the first term in the right-hand sight does not exceed

	$\displaystyle\\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}P^{f}_{j_{1},\dots,j_{d}}\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle=\Bigg{\\|}\sum\limits_{{\bm{\alpha}}:\|{\bm{\alpha}}\|=q-\|{\bm{\gamma}}\|}\frac{q(x-a_{j_{1},\dots,j_{d}})^{\bm{\alpha}}}{{\bm{\alpha}}!}\int\limits_{0}^{1}(1-s)^{q-1}$
	$\displaystyle\quad\cdot\left(D^{{\bm{\alpha}}+{\bm{\gamma}}}f(sx+(1-s)a_{j_{1},\dots,j_{d}})-D^{{\bm{\alpha}}+{\bm{\gamma}}}f(a_{j_{1},\dots,j_{d}})\right)\mathrm{d}s\Bigg{\\|}_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\leqslant\left\\|\sum\limits_{\alpha:\|{\bm{\alpha}}\|=q-\|{\bm{\gamma}}\|}\frac{q(q/K)^{q-\|{\bm{\gamma}}\|}H}{{\bm{\alpha}}!}\int\limits_{0}^{1}(1-s)^{q-1}\\|x-a_{j_{1},\dots,j_{d}}\\|^{\beta-q}\mathrm{d}s\right\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\leqslant\sum\limits_{{\bm{\alpha}}:\|{\bm{\alpha}}\|=q-\|{\bm{\gamma}}\|}\frac{(q/K)^{q-\|{\bm{\gamma}}\|}H}{{\bm{\alpha}}!}(q\sqrt{d}/K)^{\beta-q}\leqslant\frac{(qd/K)^{\beta-\|{\bm{\gamma}}\|}H}{(q-\|{\bm{\gamma}}\|)!}.$

Here we used the fact that

\sum\limits_{{\bm{\alpha}}\in\mathbb{N}_{0}^{d}:|{\bm{\alpha}}|=m}\frac{m!}{{\bm{\alpha}}!}=d^{m}\quad\text{for all $m\in\mathbb{N}_{0}$.}

Moreover, note that, for any $r\in\{0,1,\dots,q-1\}$ , it holds that

\frac{q^{\beta-r}}{(q-r)!}=\frac{q}{q-r}\cdot\frac{q^{\beta-r-1}}{(q-r-1)!}\geqslant\frac{q^{\beta-r-1}}{(q-r-1)!}.

This yields that $q^{\beta-r}/(q-r)!$ achieves its maximum over $r\in\{0,1,\dots,q\}$ at $r=0$ . Thus,

\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}P^{f}_{j_{1},\dots,j_{d}}\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\leqslant\frac{(qd/K)^{\beta-|{\bm{\gamma}}|}H}{(q-|{\bm{\gamma}}|)!}\leqslant\frac{q^{\beta}(d/K)^{\beta-|{\bm{\gamma}}|}H}{q!}.

Since

q!\geqslant\sqrt{2\pi}e^{-q}q^{q+1/2}

(33)

for all $q\in\mathbb{N}$ , we obtain that

\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}P^{f}_{j_{1},\dots,j_{d}}\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\leqslant(\sqrt{2}e)^{q}(d/K)^{\beta-|{\bm{\gamma}}|}H.

Now, we consider the second term in (32), that is, $D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})$ . By the definition of $\mathcal{Q}_{d}$ , we have

D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})=\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\mathcal{L}_{i_{1},\dots,i_{d}}(f-P^{f}_{j_{1},\dots,j_{d}})D^{\bm{\gamma}}\left(N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right).

The triangle inequality and 2 implies that

	$\displaystyle\\|D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\leqslant\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\|\mathcal{L}_{i_{1},\dots,i_{d}}(f-P^{f}_{j_{1},\dots,j_{d}})\|\left\\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\leqslant(2q+1)^{d}9^{d(q-1)}\\|f-P^{f}_{j_{1},\dots,j_{d}}\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\quad\cdot\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\left\\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}.$

Again, by Taylor’s theorem,

\|f-P^{f}_{j_{1},\dots,j_{d}}\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\leqslant\frac{(qd/K)^{\beta}H}{q!}.

By 1, for any $m\in\{1,\dots,q+K\}$ , $\text{supp}(N_{i_{m}})=[a_{i_{m}},a_{i_{m+q}}]$ . Hence, the intersection of $\text{supp}(N_{i_{m}})$ with $[a_{j_{m}},a_{j_{m+q}}]$ has zero volume if $|i_{m}-j_{m}|>q$ . This yields

	$\displaystyle\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\left\\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle=\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\left\\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\prod_{m=1}^{d}\mathbbm{1}\left(\|i_{m}-j_{m}\|\leqslant q\right).$

Applying 2, we obtain that

	$\displaystyle\\|D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\leqslant\frac{(2q+1)^{d}9^{d(q-1)}(qd/K)^{\beta}H}{q!}\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\prod_{m=1}^{d}\frac{q!(2K)^{\gamma_{m}}}{(q-\gamma_{m})!}\mathbbm{1}\left(\|i_{m}-j_{m}\|\leqslant q\right)$
	$\displaystyle\leqslant\frac{(2q+1)^{d}9^{d(q-1)}(qd/K)^{\beta}H}{q!}\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}(2Kq)^{\|{\bm{\gamma}}\|}\prod_{m=1}^{d}\mathbbm{1}\left(\|i_{m}-j_{m}\|\leqslant q\right).$

The sum in the right hand side contains at most $(2q+1)^{d}$ non-zero terms. Hence,

	$\displaystyle\\|D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$	$\displaystyle=\frac{(2q+1)^{2d}9^{d(q-1)}(qd/K)^{\beta}(2Kq)^{\|{\bm{\gamma}}\|}H}{q!}$
		$\displaystyle\leqslant\frac{9^{d(q-1)}(2q+1)^{2d+\|{\bm{\gamma}}\|}(\sqrt{2}ed)^{\beta}H}{K^{\beta-\|{\bm{\gamma}}\|}}\,,$

where we used (33).

∎

	$\displaystyle\\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}P^{f}_{j_{1},\dots,j_{d}}\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle=\Bigg{\\|}\sum\limits_{{\bm{\alpha}}:\|{\bm{\alpha}}\|=q-\|{\bm{\gamma}}\|}\frac{q(x-a_{j_{1},\dots,j_{d}})^{\bm{\alpha}}}{{\bm{\alpha}}!}\int\limits_{0}^{1}(1-s)^{q-1}$
	$\displaystyle\quad\cdot\left(D^{{\bm{\alpha}}+{\bm{\gamma}}}f(sx+(1-s)a_{j_{1},\dots,j_{d}})-D^{{\bm{\alpha}}+{\bm{\gamma}}}f(a_{j_{1},\dots,j_{d}})\right)\mathrm{d}s\Bigg{\\|}_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\leqslant\left\\|\sum\limits_{\alpha:\|{\bm{\alpha}}\|=q-\|{\bm{\gamma}}\|}\frac{q(q/K)^{q-\|{\bm{\gamma}}\|}H}{{\bm{\alpha}}!}\int\limits_{0}^{1}(1-s)^{q-1}\\|x-a_{j_{1},\dots,j_{d}}\\|^{\beta-q}\mathrm{d}s\right\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\leqslant\sum\limits_{{\bm{\alpha}}:\|{\bm{\alpha}}\|=q-\|{\bm{\gamma}}\|}\frac{(q/K)^{q-\|{\bm{\gamma}}\|}H}{{\bm{\alpha}}!}(q\sqrt{d}/K)^{\beta-q}\leqslant\frac{(qd/K)^{\beta-\|{\bm{\gamma}}\|}H}{(q-\|{\bm{\gamma}}\|)!}.$

	$\displaystyle\\|D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\leqslant\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\|\mathcal{L}_{i_{1},\dots,i_{d}}(f-P^{f}_{j_{1},\dots,j_{d}})\|\left\\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\leqslant(2q+1)^{d}9^{d(q-1)}\\|f-P^{f}_{j_{1},\dots,j_{d}}\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}$
	$\displaystyle\quad\cdot\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\left\\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}.$

Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations

Abstract

keywords:

MSC:

1 Introduction

Theorem 1 (simplified version of 2).

2 Preliminaries and notations

Norms

Smoothness classes

Neural networks

Concatenation and parallelization of neural networks

3 Main results

3.1 Approximation of functions from Hölder classes

Theorem 2.

Remark 1.

3.2 Approximation of analytic functions

Definition 1.

Corollary 1.

Remark 2.

4 Numerical experiments

5 Proofs

5.1 Proof of 2

5.2 Auxiliary lemmas for 2

Lemma 1.

Proof of 1.

Lemma 2.

Proof.

Lemma 3.

Proof of 3.

Lemma 4.

Proof.

5.3 Proof of 1

References

Appendix A Some properties of multivariate splines

A.1 Univariate splines

Proposition 1 ([41], Section 4.3).

Corollary 2.

Proof.

A.2 Tensor-product splines

Proposition 2 ([41], Theorem 12.5).

Theorem 3.

Proof.

Simultaneous approximation of a smooth function
and its derivatives by deep neural networks
with piecewise-polynomial activations