This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Simultaneous approximation of a smooth function
and its derivatives by deep neural networks
with piecewise-polynomial activations

Denis Belomestny [email protected] Alexey Naumov [email protected] Nikita Puchkin [email protected] Sergey Samsonov [email protected] Duisburg-Essen University, Germany HSE University, Russia Institute for Information Transmission Problems RAS, Russia
Abstract

This paper investigates the approximation properties of deep neural networks with piecewise-polynomial activation functions. We derive the required depth, width, and sparsity of a deep neural network to approximate any Hölder smooth function up to a given approximation error in Hölder norms in such a way that all weights of this neural network are bounded by 11. The latter feature is essential to control generalization errors in many statistical and machine learning applications.

keywords:
deep neural networks , approximation complexity , ReQU activations , ReLUk\rm{ReLU}^{k} activations , Hölder class.
MSC:
[2020] 41A25 , 68T07

1 Introduction

Neural networks have recently gained much attention due to their impressive performance in many complicated practical tasks, including image processing [1], generative modelling [2], reinforcement learning [3], numerical solution of PDEs, e.g., [4, 5], and optimal control [6, 7]. This makes them extremely useful in design of self-driving vehicles [8] and robot control systems, e.g., [9, 10, 11]. One of the reasons for such a success of neural networks is their expressiveness, that is, the ability to approximate functions with any desired accuracy. The question of expressiveness of neural networks has a long history and goes back to the papers [12, 13, 14]. In particular, in [13], the author showed that one hidden layer is enough to approximate any continuous function ff with any prescribed accuracy ε>0\varepsilon>0. However, further analysis revealed the fact that deep neural networks may require much fewer parameters than the shallow ones to approximate ff with the same accuracy. Many efforts were put in recent years to understand the fidelity of deep neural networks. In a pioneering work [15], the author showed that for any target function ff from the Sobolev space 𝒲n,([0,1]d)\mathcal{W}^{n,\infty}([0,1]^{d}) there is a neural network with O(εd/n)O(\varepsilon^{-d/n}) parameters and ReLU activation function, that approximates ff within the accuracy ε\varepsilon with respect to the LL_{\infty}-norm on the unit cube [0,1]d[0,1]^{d}. Further works in this direction considered various smoothness classes of the target functions [16, 17, 18, 19], neural networks with diverse activations [17, 20, 21, 22], domains of more complicated shape [23], and measured the approximation errors with respect to different norms [15, 17, 20, 24]. Several authors also considered the expressiveness of neural networks with different architectures. This includes wide neural networks of logarithmic [15, 17, 24] or even constant depth [25, 16, 20, 19], or deep and narrow networks [26, 27, 28]. Most of the existing results on the expressiveness of neural networks measure the quality of approximation with respect to either the LL_{\infty}- or LpL_{p}-norm, p1p\geqslant 1. Much fewer papers study the approximation of derivatives of smooth functions. These rare exceptions include [29, 17, 20].

In the present paper, we focus on feed-forward neural networks with piecewise-polynomial activation functions of the form σ𝖱𝖾𝖰𝖴(x)=(x0)2\sigma^{\mathsf{ReQU}}(x)=(x\vee 0)^{2}. Neural networks with such activations are known to successfully approximate smooth functions from the Sobolev and Besov spaces with respect to the LL_{\infty}- and LpL_{p}-norms (see, for instance, [30, 25, 16, 31, 32, 33, 34, 35]). We continue this line of research and study the ability of such neural networks to approximate not only smooth functions themselves but also their derivatives. We derive the non-asymptotic upper bounds on the Hölder norm between the target function and its approximation from a class of sparsely connected neural networks with ReQU activations. In particular, we show that, for any ff from a Hölder ball β([0,1]d,H)\mathcal{H}^{\beta}([0,1]^{d},H), H>0H>0, β>2\beta>2, (see Section 2 for the definition) and any ε>0,\varepsilon>0, there exists a neural network with ReQU activation functions, that uniformly approximates the target function in the norms of the Hölder spaces ([0,1]d)\mathcal{H}^{\ell}([0,1]^{d}) for all {0,,β}\ell\in\{0,\ldots,\lfloor\beta\rfloor\}. Here and further in the paper, β\lfloor\beta\rfloor stands for the largest integer which is strictly smaller than β\beta. A simplified statement of our main result is given below.

Theorem 1 (simplified version of 2).

Fix β>2\beta>2 and p,dp,d\in\mathbb{N}. Then, for any H>0H>0, for any f:[0,1]dpf:[0,1]^{d}\rightarrow\mathbb{R}^{p}, fβ([0,1]d,H)f\in\mathcal{H}^{\beta}([0,1]^{d},H) and any integer K2K\geqslant 2, there exists a neural network hf:[0,1]dph_{f}:[0,1]^{d}\rightarrow\mathbb{R}^{p} with ReQU activation functions such that it has O(logd+β+loglogH)O(\log d+\lfloor\beta\rfloor+\log\log H) layers, at most O(pd(K+β)d)O(p\vee d(K+\lfloor\beta\rfloor)^{d}) neurons in each layer and O(p(dβ+d2+loglogH)(K+β)d)O(p(d\beta+d^{2}+\log\log H)(K+\lfloor\beta\rfloor)^{d}) non-zero weights taking their values in [1,1][-1,1]. Moreover, it holds that

fhf([0,1]d)CβdHβKβfor all {0,,β},\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}\leqslant\frac{C^{\beta d}H\beta^{\ell}}{K^{\beta-\ell}}\quad\text{for all $\ell\in\{0,\ldots,\lfloor\beta\rfloor\}$,}

where CC is a universal constant.

We provide explicit expressions for the hidden constants in 2. The main contributions of our work can be summarized as follows.

  • 1.

    Given a smooth target function fβ([0,1]d,H)f\in\mathcal{H}^{\beta}([0,1]^{d},H), we construct a neural network, that simultaneously approximates all the derivatives of ff up to order β\lfloor\beta\rfloor with optimal dependence of the precision on the number of non-zero weights. That is, if we denote the number of non-zero weights in the network by NN, then it holds that fhf([0,1]d)=O(N(β)/d)\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}=O(N^{-(\beta-\ell)/d}) simultaneously for all {0,,β}.\ell\in\{0,\ldots,\lfloor\beta\rfloor\}.

  • 2.

    The constructed neural network has almost the same smoothness as the target function itself while approximating it with the optimal accuracy. This property turns out to be very useful in many applications including the approximation of PDEs and density transformations where we need to use derivatives of the approximation.

  • 3.

    The weights of the approximating neural network are bounded in absolute values by 11. The latter property plays a crucial role in deriving bounds on the generalization error of empirical risk minimizers in terms of the covering number of the corresponding parametric class of neural networks. Note that the upper bounds on the weights provided in [29, 17, 20] blow up as the approximation error decreases.

The rest of the paper is organized as follows. In Section 2, we introduce necessary definitions and notations. Section 3 contains the statement of our main result, 2, followed by a detailed comparison with the existing literature. We then present numerical experiments in Section 4. The proofs are collected in Section 5. Some auxiliary facts are deferred to A.

2 Preliminaries and notations

Norms

For a matrix AA and a vector vv, we denote by A\|A\|_{\infty} and v\|v\|_{\infty} the maximal absolute value of entries of AA and vv, respectively. A0\|A\|_{0} and v0\|v\|_{0} shall stand for the number of non-zero entries of AA and vv, respectively. Finally, the Frobenius norm and operator norm of AA are denoted by AF\|A\|_{F} and A\|A\|, respectively, and the Euclidean norm of vv is denoted by v\|v\|. For a function f:Ωdf:\Omega\to\mathbb{R}^{d}, we set

fL(Ω)=esssupxΩf(x),\displaystyle\|f\|_{L_{\infty}(\Omega)}=\operatornamewithlimits{esssup}_{x\in\Omega}\|f(x)\|,
fLp(Ω)={Ωf(x)pdx}1/p,p1.\displaystyle\|f\|_{L_{p}(\Omega)}=\left\{\int_{\Omega}\|f(x)\|^{p}\,\mathrm{d}x\right\}^{1/p},\quad p\geqslant 1.

If the domain Ω\Omega is clear from the context, we simply write LL_{\infty} or LpL_{p}, instead of L(Ω)L_{\infty}(\Omega) or Lp(Ω)L_{p}(\Omega), respectively.

Smoothness classes

Let Ωd\Omega\subseteq\mathbb{R}^{d}, f:Ωmf:\Omega\mapsto\mathbb{R}^{m}. For a multi-index 𝜸=(γ1,,γd)0d{\bm{\gamma}}=(\gamma_{1},\dots,\gamma_{d})\in\mathbb{N}_{0}^{d}, we write |𝜸|=i=1dγi|{\bm{\gamma}}|=\sum_{i=1}^{d}\gamma_{i} and define the corresponding partial differential operator D𝜸D^{{\bm{\gamma}}} as

D𝜸fi=|𝜸|fix1γ1xdγd,i{1,,m}, and D𝜸fL(Ω)=max1imD𝜸fiL(Ω).D^{{\bm{\gamma}}}f_{i}=\frac{\partial^{|{\bm{\gamma}}|}f_{i}}{\partial x_{1}^{\gamma_{1}}\cdots\partial x_{d}^{\gamma_{d}}},\quad i\in\{1,\dots,m\}\,,\text{ and }\|D^{{\bm{\gamma}}}f\|_{L_{\infty}(\Omega)}=\max\limits_{1\leqslant i\leqslant m}\|D^{{\bm{\gamma}}}f_{i}\|_{L_{\infty}(\Omega)}\,.

For ss\in\mathbb{N}, the function space Cs(Ω)C^{s}(\Omega) consists of those functions over the domain Ω\Omega which have bounded and continuous derivatives up to order ss in Ω\Omega. Formally,

Cs(Ω)={f:Ωm:fCs:=max|𝜸|sD𝜸fL(Ω)<},\textstyle{C^{s}(\Omega)=\big{\{}f:\Omega\to\mathbb{R}^{m}:\quad\|f\|_{C^{s}}:=\max\limits_{|{\bm{\gamma}}|\leqslant s}\|D^{{\bm{\gamma}}}f\|_{L_{\infty}(\Omega)}<\infty\big{\}},}

For a function f:Ωmf:\Omega\to\mathbb{R}^{m} and any positive number 0<δ10<\delta\leqslant 1, the Hölder constant of order δ\delta is given by

[f]δ:=maxi{1,,m}supxyΩ|fi(x)fi(y)|min{1,xy}δ.[f]_{\delta}:=\max_{i\in\{1,\ldots,m\}}\sup_{x\not=y\in\Omega}\frac{|f_{i}(x)-f_{i}(y)|}{\min\{1,\|x-y\|\}^{\delta}}\;. (1)

Now, for any β>0\beta>0, we can define the Hölder ball β(Ω,H)\mathcal{H}^{\beta}(\Omega,H). If we let s=βs=\lfloor\beta\rfloor be the largest integer strictly less than β\beta, β(Ω,H)\mathcal{H}^{\beta}(\Omega,H) contains all functions in Cs(Ω)C^{s}(\Omega) with δ\delta-Hölder-continuous, δ=βs>0\delta=\beta-s>0, partial derivatives of order ss. Formally,

β(Ω,H)={fCs(Ω):fβ:=max{fCs,max|𝜸|=s[D𝜸f]δ}H}.\textstyle{\mathcal{H}^{\beta}(\Omega,H)=\big{\{}f\in C^{s}(\Omega):\quad\|f\|_{\mathcal{H}^{\beta}}:=\max\{\|f\|_{C^{s}},\ \max\limits_{|{\bm{\gamma}}|=s}[D^{{\bm{\gamma}}}f]_{\delta}\}\leqslant H\big{\}}.}

We also write fβ(Ω)f\in\mathcal{H}^{\beta}(\Omega) if fβ(Ω,H)f\in\mathcal{H}^{\beta}(\Omega,H) for some H<.H<\infty.

Neural networks

Fix an activation function σ:\sigma:\mathbb{R}\rightarrow\mathbb{R}. For a vector v=(v1,,vp)pv=(v_{1},\dots,v_{p})\in\mathbb{R}^{p}, we define the shifted activation function σv:pp\sigma_{v}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{p} as

σv(x)=(σ(x1v1),,σ(xpvp)),x=(x1,,xp)p.\sigma_{v}(x)=\bigl{(}\sigma(x_{1}-v_{1}),\dots,\sigma(x_{p}-v_{p})\bigr{)},\quad x=(x_{1},\dots,x_{p})\in\mathbb{R}^{p}.

Given a positive integer LL and a vector 𝒜=(p0,p1,,pL+1)L+2\mathcal{A}=(p_{0},p_{1},\dots,p_{L+1})\in\mathbb{N}^{L+2}, a neural network of depth L+1L+1 (with LL hidden layers) and architecture 𝒜\mathcal{A} is a function of the form

f:p0pL+1,f(x)=WLσvLWL1σvL1W1σv1W0x,f:\mathbb{R}^{p_{0}}\rightarrow\mathbb{R}^{p_{L+1}}\,,\quad f(x)=W_{L}\circ\sigma_{v_{L}}\circ W_{L-1}\circ\sigma_{v_{L-1}}\circ\dots\circ W_{1}\circ\sigma_{v_{1}}\circ W_{0}\circ x, (2)

where Wipi+1×piW_{i}\in\mathbb{R}^{p_{i+1}\times p_{i}} are weight matrices and vipiv_{i}\in\mathbb{R}^{p_{i}} are shift (bias) vectors. The maximum number of neurons in one layer 𝒜\|\mathcal{A}\|_{\infty} is called the width of the neural network. Next, we introduce a subclass 𝖭𝖭(L,𝒜,s)\mathsf{NN}(L,\mathcal{A},s) of neural networks of depth L+1L+1 with architecture 𝒜\mathcal{A} and at most ss non-zero weights. That is, 𝖭𝖭(L,𝒜,s)\mathsf{NN}(L,\mathcal{A},s) consist of functions of the form (2), such that

{W0max1L{Wv}1,W00+=1L(W0+v0)s.\begin{cases}\|W_{0}\|_{\infty}\vee\max\limits_{1\leqslant\ell\leqslant L}\left\{\|W_{\ell}\|_{\infty}\vee\|v_{\ell}\|_{\infty}\right\}\leqslant 1\,,\\ \|W_{0}\|_{0}+\sum\limits_{\ell=1}^{L}\left(\|W_{\ell}\|_{0}+\|v_{\ell}\|_{0}\right)\leqslant s\,.\end{cases}

We also use the notation 𝖭𝖭(L,𝒜)\mathsf{NN}(L,\mathcal{A}), standing for 𝖭𝖭(L,𝒜,)\mathsf{NN}(L,\mathcal{A},\infty). Throughout the paper, we use the ReQU (rectified quadratic unit) activation function, defined as

σ(x)=(x0)2.\sigma(x)=(x\vee 0)^{2}.

Concatenation and parallelization of neural networks

During the construction of approximating network in 2, we use the operations of concatenation (consecutive connection) and parallel connection of neural networks. Given neural networks gg and hh of architectures (p0,p1,,pL,pout)(p_{0},p_{1},\dots,p_{L},p_{out}) and (pin,pL+1,pL+2,,pL+M+1)(p_{in},p_{L+1},p_{L+2},\dots,p_{L+M+1}), respectively, such that pin=poutp_{in}=p_{out}, the concatenation of gg and hh is their composition hgh\circ g, that is, a neural network of the architecture (p0,p1,,pL,pL+1,pL+2,,pL+M+1)(p_{0},p_{1},\dots,p_{L},p_{L+1},p_{L+2},\dots,p_{L+M+1}). The parallel connection of neural networks is defined as follows. Let

f(x)=WLσvLWL1σvL1W1σv1W0xf(x)=W_{L}\circ\sigma_{v_{L}}\circ W_{L-1}\circ\sigma_{v_{L-1}}\circ\dots\circ W_{1}\circ\sigma_{v_{1}}\circ W_{0}\circ x

and

f~(x)=W~Lσv~LW~L1σv~L1W~1σv~1W~0x\widetilde{f}(x)=\widetilde{W}_{L}\circ\sigma_{\widetilde{v}_{L}}\circ\widetilde{W}_{L-1}\circ\sigma_{\widetilde{v}_{L-1}}\circ\dots\circ\widetilde{W}_{1}\circ\sigma_{\widetilde{v}_{1}}\circ\widetilde{W}_{0}\circ x

be two neural networks of the same depth with architectures (p0,p1,,pL,pL+1)(p_{0},p_{1},\dots,p_{L},p_{L+1}) and (p0,p~1,p~2,p~L,p~L+1)(p_{0},\widetilde{p}_{1},\widetilde{p}_{2}\dots,\widetilde{p}_{L},\widetilde{p}_{L+1}), respectively. Define

W¯0=(W0W~0.),W¯=(WOOW~),andv¯=(vv~)for all {1,,L}.\overline{W}_{0}=\begin{pmatrix}W_{0}\\ \widetilde{W}_{0}.\end{pmatrix}\,,\quad\overline{W}_{\ell}=\begin{pmatrix}W_{\ell}&O\\ O&\widetilde{W}_{\ell}\end{pmatrix}\,,\quad\text{and}\quad\overline{v}_{\ell}=\begin{pmatrix}v_{\ell}\\ \widetilde{v}_{\ell}\end{pmatrix}\quad\text{for all $\ell\in\{1,\dots,L\}$}.

Then the parallel connection of ff and f~\widetilde{f} is a neural network f¯\overline{f} of architecture (p0,p1+p~1,p2+p~2,,pL+p~L,pL+1+p~L+1)(p_{0},p_{1}+\widetilde{p}_{1},p_{2}+\widetilde{p}_{2},\dots,p_{L}+\widetilde{p}_{L},p_{L+1}+\widetilde{p}_{L+1}), given by

f¯(x)=W¯Lσv¯LW¯L1σv¯L1W¯1σv¯1W¯0x.\overline{f}(x)=\overline{W}_{L}\circ\sigma_{\overline{v}_{L}}\circ\overline{W}_{L-1}\circ\sigma_{\overline{v}_{L-1}}\circ\dots\circ\overline{W}_{1}\circ\sigma_{\overline{v}_{1}}\circ\overline{W}_{0}\circ x.

Note that the number of non-zero weights of f¯\overline{f} is equal to the sum of the ones in ff and f~\widetilde{f}.

3 Main results

3.1 Approximation of functions from Hölder classes

Our main result states that any function from β([0,1]d,H)\mathcal{H}^{\beta}([0,1]^{d},H), H>0H>0, β>2\beta>2, can be approximated by a feed-forward deep neural network with ReQU activation functions in ([0,1]d),\mathcal{H}^{\ell}([0,1]^{d}), {0,,β}.\ell\in\{0,\dots,\lfloor\beta\rfloor\}.

Theorem 2.

Let β>2\beta>2 and let p,dp,d\in\mathbb{N}. Then, for any H>0H>0, for any f:[0,1]dpf:[0,1]^{d}\rightarrow\mathbb{R}^{p}, fβ([0,1]d,H)f\in\mathcal{H}^{\beta}([0,1]^{d},H) and any integer K2K\geqslant 2, there exists a neural network hf:[0,1]dph_{f}:[0,1]^{d}\rightarrow\mathbb{R}^{p} of the width

(4d(K+β)d)12((K+2β)+1)p\bigl{(}4d(K+\lfloor\beta\rfloor)^{d}\bigr{)}\vee 12\,\left((K+2\lfloor\beta\rfloor)+1\right)\vee p

with

6+2(β2)+log2d+2(log2(2dβ+d)log2log2H1)6+2(\lfloor\beta\rfloor-2)+\lceil\log_{2}{d}\rceil+2\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)

hidden layers and at most p(K+β)dC(β,d,H)p(K+\lfloor\beta\rfloor)^{d}C(\beta,d,H) non-zero weights taking their values in [1,1][-1,1], such that, for any {0,,β}\ell\in\{0,\dots,\lfloor\beta\rfloor\},

fhf([0,1]d)(2ed)βHKβ+9d(β1)(2β+1)2d+(2ed)βHKβ.\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}\leqslant\frac{(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}+\frac{9^{d(\lfloor\beta\rfloor-1)}(2\lfloor\beta\rfloor+1)^{2d+\ell}(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}. (3)

The above constant C(β,d,H)C(\beta,d,H) is given by

C(β,d,H)=(60(log2(2dβ+d)log2log2H1)+38)+20d2+144dβ+8d.C(\beta,d,H)=\left(60\bigl{(}\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\bigr{)}+38\right)+20d^{2}+144d\lfloor\beta\rfloor+8d.

An illustrative comparison of the result of 2 with the literature is provided in Table 1 below. 2 improves over the results of [25, Theorem 3.3] and [16, Theorem 7] as far as the approximation properties of ReQU neural networks in terms of the LL_{\infty}-norm are concerned. Namely, 2 claims that, for any fβ([0,1]d,H)f\in\mathcal{H}^{\beta}([0,1]^{d},H), β>2\beta>2, and any ε>0\varepsilon>0, there is a ReQU neural network of width O(εd/β)O(\varepsilon^{-d/\beta}) and depth O(1)O(1) with O(εd/β)O(\varepsilon^{-d/\beta}) non-zero weights, taking their values in [1,1][-1,1], such that it approximates ff within the accuracy ε\varepsilon with respect to the LL_{\infty}-norm on [0,1]d[0,1]^{d}. In [25, 16], the authors considered a target function from a general weighted Sobolev class but measure the quality of approximation in terms of a weighted L2L_{2}-norm. Nevertheless, the width and the number of non-zero weights of the constructed neural networks in 2, [25, Theorem 3.3], and [16, Theorem 7] coincide. The difference is that, while in [25, 16], the depth of the neural networks is of order O(log(1/ε)),O(\log(1/\varepsilon)), we need only O(1)O(1) layers. More importantly, the authors of [25, 16] do not provide any guarantees on the absolute values of the weights of the approximating neural networks. At the same time, all the weights of our neural network take their values in [1,1][-1,1]. We will explain the importance of the latter property a bit later in this section. It is also worth mentioning that the same number of non-zero weights O(εd/β)O(\varepsilon^{-d/\beta}) is needed to approximate ff within the tolerance ε\varepsilon (with respect to the LL_{\infty}-norm) via neural networks with other activation functions, such as ReLU [36], sigmoid [22, 17], and arctan\arctan [17].

The approximation properties of deep neural networks with respect to norms involving derivatives are much less studied. In the rest of this section, we elaborate on the comparison with the state-of-the-art results in this direction. The fact that shallow neural networks can simultaneously approximate a smooth function and its derivatives is known for a long time from the paper [37], where the authors derived an upper bound on the approximation accuracy in terms of the modulus of smoothness. To be more precise, they showed that if a function ff is continuous on a compact set 𝒦d\mathcal{K}\subset\mathbb{R}^{d} and its derivatives up to an order s+s\in\mathbb{Z}_{+} are in Lp(𝒦)L_{p}(\mathcal{K}), then there is a neural network gfg_{f} with

(K+1)j=0K(j+d1d1)=O(Kd+1)(K+1)\sum\limits_{j=0}^{K}\binom{j+d-1}{d-1}=O(K^{d+1})

hidden units, such that

D𝜸gfD𝜸fLp(𝒦)ω(D𝜸f,1K)p+D𝜸fLp(𝒦)Kfor all 𝜸𝒵+d such that |𝜸|s.\left\|D^{\bm{\gamma}}g_{f}-D^{\bm{\gamma}}f\right\|_{L_{p}(\mathcal{K})}\lesssim\omega\left(D^{\bm{\gamma}}f,\frac{1}{\sqrt{K}}\right)_{p}+\frac{\|D^{\bm{\gamma}}f\|_{L_{p}(\mathcal{K})}}{K}\quad\text{for all ${\bm{\gamma}}\in\mathcal{Z}_{+}^{d}$ such that $|{\bm{\gamma}}|\leqslant s$.}

Here

ω(ϕ,δ)p=sup0<tδϕ(+t)ϕ()Lp(𝒦),δ>0,\omega(\phi,\delta)_{p}=\sup\limits_{0<t\leqslant\delta}\|\phi(\cdot+t)-\phi(\cdot)\|_{L_{p}(\mathcal{K})},\quad\delta>0,

is the modulus of smoothness of a function ϕ\phi. Moreover, if, in addition, the derivatives of ff of order ss are α\alpha-Hölder, α(0,1]\alpha\in(0,1], then it holds that

fgfs(𝒦)Kα/2.\|f-g_{f}\|_{\mathcal{H}^{s}(\mathcal{K})}\lesssim K^{-\alpha/2}.

Taking into account that ω(ϕ,δ)pδ\omega(\phi,\delta)_{p}\lesssim\delta for a Lipschitz function ϕ\phi, we conclude that one has to use a shallow neural network with at least Ω(ε2(d+1))\Omega(\varepsilon^{-2(d+1)}) hidden units to approximate a function of interest or its derivatives within the accuracy ε\varepsilon with respect to the Lp(𝒦)L_{p}(\mathcal{K})-norm even if ff is sufficiently smooth. This bound becomes prohibitive in the case of large dimension. The situation is much better in the case of deep neural networks. To our knowledge, Gühring and Raslan [17, Proposition 4.8] were the first to prove that, for any fβ([0,1]d,H)f\in\mathcal{H}^{\beta}([0,1]^{d},H), ε>0\varepsilon>0, and {0,,β}\ell\in\{0,\dots,\lfloor\beta\rfloor\}, there is a ReQU neural network with O(εd/(β))O(\varepsilon^{-d/(\beta-\ell)}) non-zero weights, which approximates ff within the accuracy ε\varepsilon with respect to the \mathcal{H}^{\ell}-norm on [0,1]d[0,1]^{d}. A drawback of the result in [17] is that the architecture of the suggested neural network heavily depends on \ell. Hence, it is hard to control higher-order derivatives of the approximating neural network itself, which is of interest in such applications as numerical solutions of PDEs and density transformations. This question was addressed in [20, 38], where the authors considered the problem of simultaneous approximation of a target function with respect to the Hölder norms of different orders. In [20, Theorem 5.1], the authors showed that, for any fs([0,1]d,H)f\in\mathcal{H}^{s}([0,1]^{d},H), ss\in\mathbb{N}, and any sufficiently large integer KK, there is a three-layer neural network gfg_{f} of width O(Kd)O(K^{d}) with tanh\tanh activations, such that

fgf([0,1]d)=O((logK)Ks)\|f-g_{f}\|_{\mathcal{H}^{\ell}([0,1]^{d})}=O\left(\frac{(\log K)^{\ell}}{K^{s-\ell}}\right)

simultaneously for all {0,1,,s1}\ell\in\{0,1,\dots,s-1\}. Note that 2 yields a sharper bound, removing the odd logarithmic factors. Regarding the results of the recent work [38], we found some critical flaws in the proofs of the main results. We explain our concerns in 1 below as the main results of [38] are close to ours. 2 also improves over [17, 20, 38] in another aspect. Namely, 2 guarantees that the weights of the neural network take their values in [1,1][-1,1], while in [17, 20, 38] they may grow polynomially as the approximation error decreases. This boundeness property is extremely important when studying the generalization ability of the neural networks in various statistical problems since the metric entropy of the corresponding parametric class of neural networks involves a uniform upper bound on the weights (see, for instance, [24, Theorem 2 and Lemma 5]). Besides that, a polynomial upper bound on the absolute values of the weights becomes prohibitive when approximating analytic functions (see 2 in the next section).

Remark 1.

The proofs of Theorem 3.1 and Theorem 3.8 in [38] have critical flaws. In particular, on page 18, the authors mistakenly bound the Sobolev norm 𝒲1,\mathcal{W}^{1,\infty} of the composite function ϕ(x)=ϕ~(ϕα(ψ(x))/α!,Pα(h))\phi(x)=\widetilde{\phi}(\phi_{\alpha}(\psi(x))/\alpha!,P_{\alpha}(h)) by the 𝒲1,\mathcal{W}^{1,\infty}-norm of ϕ~\widetilde{\phi}. A similar error appears on page 24. In this way, the authors obtain that the 𝒲1,\mathcal{W}^{1,\infty}-norm of ϕ(x)\phi(x) is bounded by 432sd432s^{d}, where ss is the smoothness of the target function. In fact, the latter norm should scale as 1/δ1/\delta, where δ>0\delta>0 is a small auxiliary parameter describing the width of the boundary strips. This flaw completely ruins the proofs of the main results in [38], Theorem 1.1 (see the 9th line of the proof on p.8) and Theorem 1.4 (the 9th line of the proof on p.25).

Table 1: comparison of the state-of-the-art results on approximation of a function fβ([0,1]d,H),β>2,H>0f\in\mathcal{H}^{\beta}([0,1]^{d},H),\beta>2,H>0, within the accuracy ε\varepsilon via neural networks. The papers marked with * consider the case of integer β\beta only.
Paper Norm Depth Non-zero weights Simultaneous approximation Weights in [1,1][-1,1]
[16]* L2L_{2} O(log(1/ε))O(\log(1/\varepsilon)) O(εd/β)O(\varepsilon^{-d/\beta}) N/A
[17]* \mathcal{H}^{\ell} O(log(1/ε))O(\log(1/\varepsilon)) O(εd/(β))O(\varepsilon^{-d/(\beta-\ell)})
[20] \mathcal{H}^{\ell} 33 O(εd/(β)(log(1/ε)))O(\varepsilon^{-d/(\beta-\ell)}(\log(1/\varepsilon))^{\ell})
[22] LL_{\infty} O(log(1/ε))O(\log(1/\varepsilon)) O(εd/β)O(\varepsilon^{-d/\beta}) N/A
[24] LL_{\infty} O(log(1/ε))O(\log(1/\varepsilon)) O(εd/β)O(\varepsilon^{-d/\beta}) N/A
[25]* L2L_{2} O(log(1/ε))O(\log(1/\varepsilon)) O(εd/β)O(\varepsilon^{-d/\beta}) N/A
[37]* \mathcal{H}^{\ell} 2 O(ε2(d+1)/(1(β)))O(\varepsilon^{-2(d+1)/(1\wedge(\beta-\ell))})
Ours \mathcal{H}^{\ell} O(1)O(1) O(εd/(β))O(\varepsilon^{-d/(\beta-\ell)})

3.2 Approximation of analytic functions

The bound of 2 can be transformed to exponential (in KK) rates of approximation for analytic functions. In the rest of this section, we consider (Q,R)(Q,R)-analytic functions defined below.

Definition 1.

A function fC(d)f\in C^{\infty}(\mathbb{R}^{d}) is called (Q,R)(Q,R)-analytic with Q,R>0Q,R>0, if it satisfies the inequality

fs([0,1]d)QRss!for all s0.\left\|f\right\|_{\mathcal{H}^{s}([0,1]^{d})}\leqslant QR^{-s}s!\quad\text{for all $s\in\mathbb{N}_{0}$.} (4)

Similar concepts were also considered in [39, 40, 20]. Applying 2 to (Q,R)(Q,R)-analytic functions, we get the following corollary.

Corollary 1.

Let ff be a (Q,R)(Q,R)-analytic function and let 0\ell\in\mathbb{N}_{0}. Then, for any integer K>2ed 9d/RK>2ed\,9^{d}/R, there exists a neural network hf:[0,1]dph_{f}:[0,1]^{d}\mapsto\mathbb{R}^{p} of width O((K+)dp)O((K+\ell)^{d}\vee p) with O(K+)O(K+\ell) hidden layers and at most O(p(K+)dlogK)O(p(K+\ell)^{d}\log K) non-zero weights, taking values in [1,1][-1,1], such that

fhf([0,1]d)Qe(2ed 9d)(2s1)2d+2Rexp{KR2ed 9d}.\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}\leqslant\frac{Qe(\sqrt{2}\,ed\,9^{d})^{\ell}(2s-1)^{2d+2\ell}}{R^{\ell}}\exp\left\{-\left\lfloor\frac{KR}{2ed\,9^{d}}\right\rfloor\right\}. (5)
Remark 2.

In [20, Corollary 5.6], the authors claim that (Q,R)(Q,R)-analytic functions can be approximated with exponential (in the number of non-zero weights NN) rates with respect to the Sobolev norm 𝒲k,([0,1]d)\mathcal{W}^{k,\infty}([0,1]^{d}), where kk is a fixed integer (see eq.(112) on p.743). However, a closer look at Corollary 5.6 reveals the fact that the right-hand side of eq.(112) includes a constant cd,k,α,fc_{d,k,\alpha,f} with α\alpha depending on NN too, as follows from eq.(54). Thus, the dependence of the final bound on NN in Corollary 5.6 remains unclear. Besides that, in contrast to [20, Theorem 5.2], the authors do not specify the upper bound on the absolute values of the weights of the constructed neural network in Corollary 5.6. A thorough inspection of the proof shows that the weights can be as large as O(NN2)O(N^{N^{2}}).

4 Numerical experiments

In this section, we provide numerical experiments to illustrate the approximation properties of neural networks with ReQU activations. We considered a scalar function f(x)=sin(x12x2)f(x)=\sin(x_{1}^{2}x_{2}) of two variables and approximated it on the unit square [0,1]2[0,1]^{2} via neural networks with two types of activations: ReLU and ReQU. All the neural networks were fully connected, and all their hidden layers had a width 1616. The first layer had a width 22. The depth of neural networks took its values in {1,2,3,4,5}\{1,2,3,4,5\}. In the training phase, we sampled N=10000N=10000 points X1,,XNX_{1},\dots,X_{N} independently from the uniform distribution on [0,1]2[0,1]^{2} and tuned the weights of neural networks by minimizing the mean empirical squared error

ERR(h)=1Ni=1N(h(Xi)f(Xi))2.\mathrm{ERR}(h)=\frac{1}{N}\sum\limits_{i=1}^{N}(h(X_{i})-f(X_{i}))^{2}.

After that, we computed the approximation errors of ff and of its gradient on the two-dimensional grid G={0,1/M,,(M1)/M,1}2G=\{0,1/M,\dots,(M-1)/M,1\}^{2} with M=500M=500:

1M2xG(h^(x)f(x))2and1M2xGh^(x)f(x)2,\frac{1}{M^{2}}\sum\limits_{x\in G}(\widehat{h}(x)-f(x))^{2}\quad\text{and}\quad\frac{1}{M^{2}}\sum\limits_{x\in G}\left\|\nabla\widehat{h}(x)-\nabla f(x)\right\|^{2},

where h^\widehat{h} denotes the neural network with the weights tuned on the training phase. We repeated the experiment 1010 times and computed average approximation errors. The results are displayed in Figure 1. The quality of approximation by neural networks with ReQU activations turns out to be better than by the ones with ReLU. Note that in such a scenario the stochastic error becomes small, and the quality of learning is mostly determined by the approximation error. This claim is supported by the fact that the error values were similar in different experiments.

Refer to caption
Figure 1: The mean squared error of approximation of the target function f=sin(x12x2)f=\sin(x_{1}^{2}x_{2}) (left) and its gradient (right) by neural networks with different activations.

5 Proofs

5.1 Proof of 2

Step 1. Let f=(f1,,fp)f=(f_{1},\dots,f_{p}). Consider a vector a=(a1,,a2β+K+1)a=(a_{1},\dots,a_{2\lfloor\beta\rfloor+K+1}), such that

a1==aβ+1=0;\displaystyle a_{1}=\ldots=a_{\lfloor\beta\rfloor+1}=0\,;
aβ+1+j=j/K,1jK1;\displaystyle a_{\lfloor\beta\rfloor+1+j}=j/K,\quad 1\leqslant j\leqslant K-1\,; (6)
aβ+K+1==a2β+K+1=1.\displaystyle a_{\lfloor\beta\rfloor+K+1}=\ldots=a_{2\lfloor\beta\rfloor+K+1}=1\,.

By 3, there exist tensor-product splines Sfβ,K=(Sf,1β,K,,Sf,pβ,K)S_{f}^{\lfloor\beta\rfloor,K}=(S_{f,1}^{\lfloor\beta\rfloor,K},\dots,S_{f,p}^{\lfloor\beta\rfloor,K}) of order β2\lfloor\beta\rfloor\geqslant 2 associated with knots at {(aj1,aj2,,ajd):j1,,jd{1,,2β+K+1}}\big{\{}(a_{j_{1}},a_{j_{2}},\dots,a_{j_{d}}):j_{1},\dots,j_{d}\in\{1,\dots,2\lfloor\beta\rfloor+K+1\}\big{\}} such that

fSfβ,K([0,1]d)\displaystyle\left\|f-S_{f}^{\lfloor\beta\rfloor,K}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})} maxm{1,,p}fmSf,mβ,K([0,1]d)\displaystyle\leqslant\max_{m\in\{1,\dots,p\}}\left\|f_{m}-S_{f,m}^{\lfloor\beta\rfloor,K}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}
(2ed)βHKβ+9d(β1)(2β+1)2d+(2ed)βHKβ.\displaystyle\leqslant\frac{(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}+\frac{9^{d(\lfloor\beta\rfloor-1)}(2\lfloor\beta\rfloor+1)^{2d+\ell}(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}.

Our goal is to show that Sfβ,KS_{f}^{\lfloor\beta\rfloor,K} can be represented by a neural network hfh_{f} with ReQU activation functions. For this purpose, for each m{1,,p}m\in\{1,\dots,p\}, we use an expansion of Sf,mβ,KS_{f,m}^{\lfloor\beta\rfloor,K} with respect to the basis

{Nj1β,K(x1)Nj2β,K(x2)Njdβ,K(xd):j1,,jd{1,,β+K}},\big{\{}N_{j_{1}}^{\lfloor\beta\rfloor,K}(x_{1})\cdot N_{j_{2}}^{\lfloor\beta\rfloor,K}(x_{2})\cdot\ldots\cdot N_{j_{d}}^{\lfloor\beta\rfloor,K}(x_{d}):j_{1},\dots,j_{d}\in\{1,\dots,\lfloor\beta\rfloor+K\}\big{\}},

where N1β,K,,Nβ+Kβ,KN^{\lfloor\beta\rfloor,K}_{1},\dots,N^{\lfloor\beta\rfloor,K}_{\lfloor\beta\rfloor+K} are normalized BB-splines defined in (31). There exist coefficients {wm,j1,j2,,jd(f):1mp,1j1,,jdβ+K}\{w^{(f)}_{m,j_{1},j_{2},\dots,j_{d}}:1\leqslant m\leqslant p,1\leqslant j_{1},\dots,j_{d}\leqslant\lfloor\beta\rfloor+K\} such that, for any m{1,,p}m\in\{1,\dots,p\},

Sf,mβ,K=j1,,jd=1β+Kwm,j1,,jd(f)=1dNjβ,K(x)=j1,,jd=1β+Kwm,j1,,jd(f)=1d(aj+β+1aj)Bjβ,K(x),\begin{split}S_{f,m}^{\lfloor\beta\rfloor,K}&=\sum\limits_{j_{1},\dots,j_{d}=1}^{\lfloor\beta\rfloor+K}w^{(f)}_{m,j_{1},\dots,j_{d}}\prod\limits_{\ell=1}^{d}N_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell})\\ &=\sum\limits_{j_{1},\dots,j_{d}=1}^{\lfloor\beta\rfloor+K}w^{(f)}_{m,j_{1},\dots,j_{d}}\prod\limits_{\ell=1}^{d}(a_{j_{\ell}+\lfloor\beta\rfloor+1}-a_{j_{\ell}})B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell})\,,\end{split} (7)

where Bjβ,K(x)B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell}) are (unnormalized) BB-splines defined in (30). Hence, in order to represent Sfβ,K=(Sf,1β,K,,Sf,pβ,K)S_{f}^{\lfloor\beta\rfloor,K}=(S_{f,1}^{\lfloor\beta\rfloor,K},\dots,S_{f,p}^{\lfloor\beta\rfloor,K}) by a neural network with ReQU activations, we first perform this task for the products of basis functions =1dBjβ,K(x)\prod_{\ell=1}^{d}B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell}), 1j1,,jdβ+K1\leqslant j_{1},\dots,j_{d}\leqslant\lfloor\beta\rfloor+K.

Step 2. Applying 3 with q=βq=\lfloor\beta\rfloor component-wise for each xi,i{1,,d}x_{i},i\in\{1,\ldots,d\}, we obtain that the mapping

x=(x1,,xd)(\displaystyle x=(x_{1},\dots,x_{d})\mapsto\Big{(} x1,K,B1β,K(x1),,Bβ+Kβ,K(x1),\displaystyle x_{1},K,B_{1}^{\lfloor\beta\rfloor,K}(x_{1}),\dots,B_{\lfloor\beta\rfloor+K}^{\lfloor\beta\rfloor,K}(x_{1}),
x2,K,B1β,K(x2),,Bβ+Kβ,K(x2),,\displaystyle x_{2},K,B_{1}^{\lfloor\beta\rfloor,K}(x_{2}),\ldots,B_{\lfloor\beta\rfloor+K}^{\lfloor\beta\rfloor,K}(x_{2}),\dots, (8)
xd,K,B1β,K(xd),,Bβ+Kβ,K(xd))\displaystyle x_{d},K,B_{1}^{\lfloor\beta\rfloor,K}(x_{d}),\ldots,B_{\lfloor\beta\rfloor+K}^{\lfloor\beta\rfloor,K}(x_{d})\Big{)}

can be represented by a neural network from the class 𝖭𝖭(4+2(β2),(d,d𝒜1))\mathsf{NN}\big{(}4+2(\lfloor\beta\rfloor-2),(d,d\mathcal{A}_{1})\big{)}, where

𝒜1=(2,3,,β,K+β+2),\mathcal{A}_{1}=\big{(}\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{\lfloor\beta\rfloor},K+\lfloor\beta\rfloor+2\big{)}, (9)

and 2,,β\mathcal{B}_{2},\dots,\mathcal{B}_{\lfloor\beta\rfloor} are defined in (15) and (21). Here d𝒜1d\mathcal{A}_{1} should be understood as element-wise multiplication of 𝒜1\mathcal{A}_{1} entries by dd. Since each coordinate transformation in (5.1) can be computed independently, the number of parameters in this network does not exceed

72dβ(K+2β).72d\lfloor\beta\rfloor(K+2\lfloor\beta\rfloor).

Using 1 with k=dk=d, for any fixed j1,,jd{1,,β+K}j_{1},\dots,j_{d}\in\{1,\dots,\lfloor\beta\rfloor+K\}, we calculate all products =1dBjβ,K(x)\prod_{\ell=1}^{d}B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell}) by a neural network from

𝖭𝖭(log2d,((K+β+2)d,(K+β)d𝒜2,(K+β)d))\mathsf{NN}\left(\lceil\log_{2}{d}\rceil,\big{(}(K+\lfloor\beta\rfloor+2)d,(K+\lfloor\beta\rfloor)^{d}\mathcal{A}_{2},(K+\lfloor\beta\rfloor)^{d})\right)

with

𝒜2=(2log2d+1,2log2d,,4log2d).\mathcal{A}_{2}=\bigl{(}\underbrace{2^{\lceil\log_{2}{d}\rceil+1},2^{\lceil\log_{2}{d}\rceil},\ldots,4}_{\lceil\log_{2}{d}\rceil}\bigr{)}\,. (10)

Note that this network contains at most

5(K+β)d22log2d20d2(K+β)d5(K+\lfloor\beta\rfloor)^{d}2^{2\lceil\log_{2}{d}\rceil}\leqslant 20d^{2}(K+\lfloor\beta\rfloor)^{d}

non-zero parameters.

Step 3. Let us recall that, due to (7), we have

Sf,mβ,K=j1,,jd=1β+Kwm,j1,,jd(f)=1d(aj+β+1aj)=1dBjβ,K(x)S_{f,m}^{\lfloor\beta\rfloor,K}=\sum\limits_{j_{1},\dots,j_{d}=1}^{\lfloor\beta\rfloor+K}w^{(f)}_{m,j_{1},\dots,j_{d}}\prod\limits_{\ell=1}^{d}(a_{j_{\ell}+\lfloor\beta\rfloor+1}-a_{j_{\ell}})\prod\limits_{\ell=1}^{d}B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell}) (11)

for all m{1,,p}m\in\{1,\dots,p\}. On Step 22, we constructed a network which calculates the products =1dBjβ,K(x)\prod\limits_{\ell=1}^{d}B_{j_{\ell}}^{\lfloor\beta\rfloor,K}(x_{\ell}) for all (j1,,jd){1,,β+K}d(j_{1},\dots,j_{d})\in\{1,\dots,\lfloor\beta\rfloor+K\}^{d}. Now we implement multiplications by

w~m,j1,,jd(f):=wm,j1,,jd(f)=1d(aj+β+1aj).\tilde{w}^{(f)}_{m,j_{1},\dots,j_{d}}:=w^{(f)}_{m,j_{1},\dots,j_{d}}\prod\limits_{\ell=1}^{d}(a_{j_{\ell}+\lfloor\beta\rfloor+1}-a_{j_{\ell}}).

2 implies that

|wm,j1,,jd(f)|(2β+1)d9d(β1)fmL([0,1]d)(2β+1)d9d(β1)H\left|w^{(f)}_{m,j_{1},\dots,j_{d}}\right|\leqslant(2\lfloor\beta\rfloor+1)^{d}9^{d(\lfloor\beta\rfloor-1)}\|f_{m}\|_{L_{\infty}([0,1]^{d})}\leqslant(2\lfloor\beta\rfloor+1)^{d}9^{d(\lfloor\beta\rfloor-1)}H

Since β>2\beta>2 by the conditions of the theorem, we can write

|wm,j1,,jd(f)|(2β+1)dβH.\left|w^{(f)}_{m,j_{1},\dots,j_{d}}\right|\leqslant(2\lfloor\beta\rfloor+1)^{d\lfloor\beta\rfloor}H.

Equation (5.1) yields that 0aj+β+1aj10\leqslant a_{j+\lfloor\beta\rfloor+1}-a_{j}\leqslant 1 for all j{1,,β+K}j\in\{1,\ldots,\lfloor\beta\rfloor+K\}. In view of 4, we can implement the multiplication by w~m,j1,,jd(f)\tilde{w}^{(f)}_{m,j_{1},\dots,j_{d}} by a network from the class

𝖭𝖭(2(log2(2dβ+d)log2log2H1)+2,(1,5,5,,5,4,1)).\mathsf{NN}\left(2\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)+2,\big{(}1,5,5,\dots,5,4,1\big{)}\right).

Here we used the fact that, since β2\lfloor\beta\rfloor\geqslant 2,

log4(log4((2β+1)dβH))\displaystyle\log_{4}\left(\log_{4}\big{(}(2\lfloor\beta\rfloor+1)^{d\lfloor\beta\rfloor}H\big{)}\right) log4(2log4((2β+1)dβ)2log4H)\displaystyle\leqslant\log_{4}\left(2\log_{4}\big{(}(2\lfloor\beta\rfloor+1)^{d\lfloor\beta\rfloor}\big{)}\vee 2\log_{4}H\right)
=log4(log2((2β+1)dβ)log2H)\displaystyle=\log_{4}\left(\log_{2}\big{(}(2\lfloor\beta\rfloor+1)^{d\lfloor\beta\rfloor}\big{)}\vee\log_{2}H\right)
=log4(dβlog2((2β+1)))log4(log2H)\displaystyle=\log_{4}\left(d\lfloor\beta\rfloor\log_{2}\big{(}(2\lfloor\beta\rfloor+1)\big{)}\right)\vee\log_{4}\left(\log_{2}H\right)
log2(2dβ+d)log2log2H.\displaystyle\leqslant\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H.

Hence, the representation (11) implies that the function Sfβ,KS_{f}^{\lfloor\beta\rfloor,K} can be represented by a neural network with

4+2(β2)+log2d+2(log2(2dβ+d)log2log2H1)+24+2(\lfloor\beta\rfloor-2)+\lceil\log_{2}{d}\rceil+2\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)+2

hidden layers and the architecture

(d,d𝒜1,(K+β)d𝒜2,5(K+β)d,5(K+β)d,,5(K+β)d2(log2(2dβ+d)log2log2H1)+1 times,4(K+β)d,p),\left(d,d\mathcal{A}_{1},(K+\lfloor\beta\rfloor)^{d}\mathcal{A}_{2},\underbrace{5(K+\lfloor\beta\rfloor)^{d},5(K+\lfloor\beta\rfloor)^{d},\dots,5(K+\lfloor\beta\rfloor)^{d}}_{\text{$2\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)+1$ times}},4(K+\lfloor\beta\rfloor)^{d},p\right),

where 𝒜1\mathcal{A}_{1} and 𝒜2\mathcal{A}_{2} are given by (9) and (10), respectively. As before, weights of the constructed neural network are bounded by 11. Note that this network contains at most

p(K+β)d(60(log2(2dβ+d)log2log2H1)+38)\displaystyle p(K+\lfloor\beta\rfloor)^{d}\left(60\left(\left\lceil\log_{2}(2d\lfloor\beta\rfloor+d)\vee\log_{2}\log_{2}H\right\rceil\vee 1\right)+38\right)
+20d2(K+β)d+72dβ(K+2β)+8d(K+β)d\displaystyle\quad+20d^{2}(K+\lfloor\beta\rfloor)^{d}+72d\lfloor\beta\rfloor(K+2\lfloor\beta\rfloor)+8d(K+\lfloor\beta\rfloor)^{d}

non-zero weights. The last summand appears, because each of

(K+β)d2log2d+14d(K+β)d(K+\lfloor\beta\rfloor)^{d}2^{\lceil\log_{2}{d}\rceil+1}\leqslant 4d(K+\lfloor\beta\rfloor)^{d}

neurons in the first layer of (K+β)d𝒜2(K+\lfloor\beta\rfloor)^{d}\mathcal{A}_{2} receives information from two neurons from the previous layer. This completes the proof.

5.2 Auxiliary lemmas for 2

Lemma 1.

Let k,k2k\in\mathbb{N},k\geqslant 2. Then, for any x=(x1,,xk)kx=(x_{1},\dots,x_{k})\in\mathbb{R}^{k}, there exists a neural network from the class

𝖭𝖭(log2k,(k,2log2k+1,2log2k,,4,1)),\mathsf{NN}\left(\lceil\log_{2}{k}\rceil,\left(k,2^{\lceil\log_{2}{k}\rceil+1},2^{\lceil\log_{2}{k}\rceil},\dots,4,1\right)\right)\,, (12)

which implements the map xx1x2,xkx\mapsto x_{1}x_{2},\dots x_{k}. Moreover, this network contains at most 522log2k5\cdot 2^{2\lceil\log_{2}{k}\rceil} non-zero weights.

Proof of 1.

Let x=(x1,x2)2x=(x_{1},x_{2})\in\mathbb{R}^{2}. We first prove that the map (x1,x2)x1x2(x_{1},x_{2})\mapsto x_{1}x_{2} belongs to the class 𝖭𝖭(1,(2,4,1))\mathsf{NN}(1,(2,4,1)). Let us define W1=14(1,1,1,1)W_{1}=\frac{1}{4}(1,-1,-1,1)^{\top} and

W0=(11111111).W_{0}=\begin{pmatrix}1&1\\ 1&-1\\ -1&1\\ -1&-1\end{pmatrix}\,.

Then it is easy to see that

W1σW0x14(σ(x1+x2)+σ(x1x2)σ(x1x2)σ(x2x1))=14((x1+x2)2(x1x2)2)=x1x2.\begin{split}W_{1}\circ\sigma\circ W_{0}x&\equiv\frac{1}{4}\left(\sigma(x_{1}+x_{2})+\sigma(-x_{1}-x_{2})-\sigma(x_{1}-x_{2})-\sigma(x_{2}-x_{1})\right)\\ &=\frac{1}{4}\left((x_{1}+x_{2})^{2}-(x_{1}-x_{2})^{2}\right)=x_{1}x_{2}.\end{split} (13)

Now, for any k2k\geqslant 2, we use the following representation

x1x2xk=x1x2xk112log2kk.x_{1}x_{2}\dots x_{k}=x_{1}x_{2}\dots x_{k}\cdot\underbrace{1\cdot\ldots\cdot 1}_{2^{\lceil\log_{2}{k}\rceil}-k}\,. (14)

Put v=log2kv=\lceil\log_{2}{k}\rceil. Then it remains to note that, based on the equality (13), for any vv\in\mathbb{N}, we can implement a sequence of mappings

(a1,a2,,a2v)(a1a2,a3a4,,a2v1a2v)i=12vai\left(a_{1},a_{2},\ldots,a_{2^{v}}\right)\mapsto\left(a_{1}a_{2},a_{3}a_{4},\ldots,a_{2^{v}-1}a_{2^{v}}\right)\mapsto\ldots\mapsto\prod_{i=1}^{2^{v}}a_{i}

by a neural network from the class

𝖭𝖭(v,(2v,2v+1,2v,2v1,2v2,,4,1)).\mathsf{NN}\left(v,\left(2^{v},2^{v+1},2^{v},2^{v-1},2^{v-2},\dots,4,1\right)\right)\,.

The number of parameters in such network does not exceed 22v+1+22v+1+22v++4522v2^{2v+1}+2^{2v+1}+2^{2v}+\ldots+4\leqslant 5\cdot 2^{2v}. We complete the proof, combining this bound with (14).

Lemma 2.

Let q,Kq,K be integers not smaller than 22, and x[0,1]x\in[0,1]. Then the mapping

x(x,K,B12,K(x),B22,K(x),,BK+2q22,K(x)),x\mapsto\left(x,K,B_{1}^{2,K}(x),B_{2}^{2,K}(x),\ldots,B_{K+2q-2}^{2,K}(x)\right),

can be represented by a network from the class 𝖭𝖭(4,(1,2,K+2q))\mathsf{NN}\Big{(}4,\big{(}1,\mathcal{B}_{2},K+2q\big{)}\Big{)}, where

2=(4(K+2q1)+K,4K+8q,4K+8q,4K+8q),\mathcal{B}_{2}=\left(4(K+2q-1)+K,4K+8q,4K+8q,4K+8q\right), (15)

containing at most 72(K+2q)72(K+2q) non-zero weights.

Proof.

Note first that, due to [41, Theorem 4.32],

Bj2,K(x)\displaystyle B_{j}^{2,K}(x) =K36((xjq1K)+23(xjqK)+2\displaystyle=\frac{K^{3}}{6}\left(\left(x-\frac{j-q-1}{K}\right)_{+}^{2}-3\left(x-\frac{j-q}{K}\right)_{+}^{2}\right.
+3(xjq+1K)+2(xjq+2K)+2),q+1jq+K2.\displaystyle\quad\left.+3\left(x-\frac{j-q+1}{K}\right)_{+}^{2}-\left(x-\frac{j-q+2}{K}\right)_{+}^{2}\right),\quad q+1\leqslant j\leqslant q+K-2.

The functions Bq12,KB_{q-1}^{2,K}, Bq2,KB_{q}^{2,K}, Bq+K12,KB_{q+K-1}^{2,K}, and Bq+K2,KB_{q+K}^{2,K} can be computed directly using (30). Indeed, for any x[0,1]x\in[0,1],

Bq12,K(x)\displaystyle B_{q-1}^{2,K}(x) =K3(1Kx)+2,\displaystyle=K^{3}\left(\frac{1}{K}-x\right)_{+}^{2},
Bq2,K(x)\displaystyle B_{q}^{2,K}(x) =K34((2Kx)+24(1Kx)+2+3(x)+2),\displaystyle=\frac{K^{3}}{4}\left(\left(\frac{2}{K}-x\right)_{+}^{2}-4\left(\frac{1}{K}-x\right)_{+}^{2}+3(-x)_{+}^{2}\right),
Bq+K12,K(x)\displaystyle B_{q+K-1}^{2,K}(x) =K34((xK2K)+22(xK1K)+23(x1)+2),\displaystyle=\frac{K^{3}}{4}\left(\left(x-\frac{K-2}{K}\right)_{+}^{2}-2\left(x-\frac{K-1}{K}\right)_{+}^{2}-3(x-1)_{+}^{2}\right),
Bq+K2,K(x)\displaystyle B_{q+K}^{2,K}(x) =K3(xK1K)+2.\displaystyle=K^{3}\left(x-\frac{K-1}{K}\right)_{+}^{2}\,.

Moreover, (30) implies that B12,K(x)==Bq22,K(x)=0B_{1}^{2,K}(x)=\dots=B_{q-2}^{2,K}(x)=0 and Bq+K+12,K(x)==B2q+K22,K(x)=0B_{q+K+1}^{2,K}(x)=\ldots=B_{2q+K-2}^{2,K}(x)=0. Hence, each of the functions Bj2,K/K3,j{1,,2q+K2}B_{j}^{2,K}/K^{3},j\in\{1,\dots,2q+K-2\} can be exactly represented by a neural network from the class 𝖭𝖭(1,(1,4,1))\mathsf{NN}(1,(1,4,1)). The final mapping (b) is implemented as K3=14(K2+K)214(K2K)2K^{3}=\frac{1}{4}(K^{2}+K)^{2}-\frac{1}{4}(K^{2}-K)^{2}. Note that the identity map xxx\mapsto x belongs to 𝖭𝖭(1,(1,4,1))\mathsf{NN}(1,(1,4,1)) too:

x=((x+1)2(x1)2)/4=(σ1(x)+σ1(x)σ1(x)σ1(x))/4.x=\left((x+1)^{2}-(x-1)^{2}\right)/4=\left(\sigma_{-1}(x)+\sigma_{-1}(-x)-\sigma_{1}(x)-\sigma_{1}(-x)\right)/4. (16)

Combining the arguments above, we conclude that the mapping

x(x,B12,K(x)/K3,B22,K(x)/K3,,BK+2q22,K(x)/K3)x\mapsto\left(x,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right) (17)

can be exactly represented by a network from the class 𝖭𝖭(1,(1,4(K+2q1),K+2q1))\mathsf{NN}(1,(1,4(K+2q-1),K+2q-1)). Besides, one can implement the sequence of transformations

x(1,,1)KtimesK,x\mapsto\underbrace{(1,\dots,1)}_{K\text{times}}\mapsto K, (18)

using a neural network from 𝖭𝖭(1,(1,K,1)\mathsf{NN}\big{(}1,(1,K,1\big{)}. Connecting in parallel the neural networks realizing the maps (17) and (18), we obtain that one can implement the transformation

x(x,K,B12,K(x)/K3,B22,K(x)/K3,,BK+2q22,K(x)/K3)x\mapsto\left(x,K,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right) (19)

via a neural network from the class 𝖭𝖭(1,(1,4(K+2q1)+K,K+2q),13(K+2q1)+3K+1)\mathsf{NN}\big{(}1,(1,4(K+2q-1)+K,K+2q),13(K+2q-1)+3K+1\big{)}.

It remains to implement multiplication of the splines by K3K^{3} to complete the proof. Recall that, according to Lemma 1, the function, mapping a pair (x,y)(x,y) to their product xyxy, belongs to the class 𝖭𝖭(1,(2,4,1))\mathsf{NN}(1,(2,4,1)). For any xx\in\mathbb{R}, consider a vector

(x,K,B12,K(x)/K3,B22,K(x)/K3,,BK+2q22,K(x)/K3).\left(x,K,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right).

Let us implement the multiplications of the splines B12,K(x)/K3,B22,K(x)/K3,,BK+2q22,K(x)/K3B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3} by KK in parallel. Then we obtain that the mapping

(x,K,B12,K(x)/K3,B22,K(x)/K3,,BK+2q22,K(x)/K3)\displaystyle\left(x,K,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right)
(x,K,B12,K(x)/K2,B22,K(x)/K2,,BK+2q22,K(x)/K2)\displaystyle\mapsto\left(x,K,B_{1}^{2,K}(x)/K^{2},B_{2}^{2,K}(x)/K^{2},\ldots,B_{K+2q-2}^{2,K}(x)/K^{2}\right)

is in the class 𝖭𝖭(1,(K+2q,4K+8q,K+2q),17(K+2q)8)\mathsf{NN}\big{(}1,(K+2q,4K+8q,K+2q),17(K+2q)-8\big{)}. Here we took into account that we need 1717 non-zero weights to implement each of (K+2q2)(K+2q-2) multiplications and 1313 non-zero weights to implement the identity maps xxx\mapsto x and KKK\mapsto K. Repeating the same trick two more times, we get that there is a neural network in 𝖭𝖭(3,(K+2q,4K+8q,4K+8q,4K+8q,K+2q),57(K+2q)8)\mathsf{NN}\big{(}3,(K+2q,4K+8q,4K+8q,4K+8q,K+2q),57(K+2q)-8\big{)}, implementing

(x,K,B12,K(x)/K3,B22,K(x)/K3,,BK+2q22,K(x)/K3)(x,K,B12,K(x),B22,K(x),,BK+2q22,K(x)).\begin{split}&\left(x,K,B_{1}^{2,K}(x)/K^{3},B_{2}^{2,K}(x)/K^{3},\ldots,B_{K+2q-2}^{2,K}(x)/K^{3}\right)\\ &\mapsto\left(x,K,B_{1}^{2,K}(x),B_{2}^{2,K}(x),\ldots,B_{K+2q-2}^{2,K}(x)\right).\end{split} (20)

This follows from the fact that each of 2K+q22K+q-2 multiplications by K3K^{3}, running in parallel, can be implemented by a neural network from 𝖭𝖭(3,(2,4,4,4,1))\mathsf{NN}(3,(2,4,4,4,1)), and two identity maps xxx\mapsto x and KKK\mapsto K can be represented by a network from 𝖭𝖭(3,(1,4,4,4,1))\mathsf{NN}(3,(1,4,4,4,1)). Concatenating the neural networks, performing (19) and (20), we finally get that the mapping of interest,

x(x,K,B12,K(x),B22,K(x),,BK+2q22,K(x)),x\mapsto\left(x,K,B_{1}^{2,K}(x),B_{2}^{2,K}(x),\ldots,B_{K+2q-2}^{2,K}(x)\right),

is in the class of neural networks from 𝖭𝖭(3,(1,4(K+2q1)+K,4K+8q,4K+8q,4K+8q,K+2q))\mathsf{NN}\big{(}3,(1,4(K+2q-1)+K,4K+8q,4K+8q,4K+8q,K+2q)\big{)}, containing at most

69(K+2q)20+3K<72(K+2q)69(K+2q)-20+3K<72(K+2q)

non-zero weights. ∎

Lemma 3.

Let xx\in\mathbb{R} and K,q,K,q2K,q\in\mathbb{N},K,q\geqslant 2. Then for any j{1,,q+K}j\in\{1,\dots,q+K\}, the mapping x(x,K,B1q,K(x),B2q,K(x),,Bq+Kq,K(x))x\mapsto\left(x,K,B_{1}^{q,K}(x),B_{2}^{q,K}(x),\ldots,B_{q+K}^{q,K}(x)\right) belongs to the class

𝖭𝖭(4+2(q2),(1,2,3,,q,K+q+2)),\mathsf{NN}\Big{(}4+2(q-2),\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{q},K+q+2\big{)}\Big{)}\,,

with at most 72q(K+2q)72q(K+2q) non-zero weights. The vector 2\mathcal{B}_{2} is defined in (15), and for m{3,,q}m\in\{3,\ldots,q\}, m\mathcal{B}_{m} is equal to

m=(12(K+2qm)+12,8(K+2qm)+8).\mathcal{B}_{m}=\bigl{(}12(K+2q-m)+12,8(K+2q-m)+8\bigr{)}\,. (21)
Proof of 3.

Fix an integer q2q\geqslant 2. We consider the family of BB-splines {Bjm,K(x),m{2,,q},j{1,,K+2qm}}\bigl{\{}B_{j}^{m,K}(x),m\in\{2,\dots,q\},j\in\{1,\dots,K+2q-m\}\bigr{\}} and construct all elements of this family sequentially starting from m=2m=2. By 2, the map

x(x,K,B12,K(x),B22,K(x),,BK+2q22,K(x))x\mapsto\left(x,K,B_{1}^{2,K}(x),B_{2}^{2,K}(x),\ldots,B_{K+2q-2}^{2,K}(x)\right)

can be represented by a network in the class (15), containing at most 72(K+2q)72(K+2q) non-zero weights. Assume that for some m3m\geqslant 3 the mapping

x(x,K,B1m1,K(x),,BK+2q(m1)m1,K(x))x\mapsto\left(x,K,B_{1}^{m-1,K}(x),\ldots,B_{K+2q-(m-1)}^{m-1,K}(x)\right)

belongs to the class

𝖭𝖭(4+2(m3),(1,2,3,,m1,K+2qm+3)).\displaystyle\mathsf{NN}\Big{(}4+2(m-3),\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{m-1},K+2q-m+3\big{)}\Big{)}. (22)

We use the recursion formula (30) to perform the induction step. Note that we can implement each of the mappings

(x,K)xajaj+m+1aj,(x,K)aj+m+1xaj+m+1aj,j{1,,2q+Km},(x,K)\mapsto\frac{x-a_{j}}{a_{j+m+1}-a_{j}}\,,\quad(x,K)\mapsto\frac{a_{j+m+1}-x}{a_{j+m+1}-a_{j}}\,,\quad j\in\{1,\dots,2q+K-m\}\,, (23)

by networks from 𝖭𝖭(1,(2,4,1))\mathsf{NN}(1,(2,4,1)). It is possible, since for jj and mm satisfying aj+m+1aj>0a_{j+m+1}-a_{j}>0, it holds that aj+m+1aj1/Ka_{j+m+1}-a_{j}\geqslant 1/K. We remove the last linear layer of (22), and concatenate the remaining part with the first layer of networks, implementing (23). We obtain a network with 5+2(m3)5+2(m-3) hidden layers and the architecture

𝖭𝖭(5+2(m3),(1,2,3,,m3,12(K+2qm)+12,3(K+2qm)+3)),\mathsf{NN}\Big{(}5+2(m-3),\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{m-3},12(K+2q-m)+12,3(K+2q-m)+3\big{)}\Big{)}\,, (24)

which implements the mapping

x(x,K,B1m1,K(x),,BK+2q(m1)m1,K(x),xajaj+m+1aj,aj+m+1xaj+m+1ajj{1,,2q+Km}).x\mapsto\left(x,K,B_{1}^{m-1,K}(x),\ldots,B_{K+2q-(m-1)}^{m-1,K}(x),\underbrace{\frac{x-a_{j}}{a_{j+m+1}-a_{j}},\frac{a_{j+m+1}-x}{a_{j+m+1}-a_{j}}}_{j\in\{1,\dots,2q+K-m\}}\right)\,. (25)

Note that we have added at most

4(K+2qm+3)to implement identity maps+162(K+2qm)to implement (23)=36(K+2qm)+12\underbrace{4(K+2q-m+3)}_{\text{to implement identity maps}}+\underbrace{16\cdot 2(K+2q-m)}_{\text{to implement \eqref{eq:aux_products}}}=36(K+2q-m)+12

non-zero weights, since each linear function of the form (23) can be implemented via a network from 𝖭𝖭(1,(2,4,1))\mathsf{NN}\big{(}1,(2,4,1)\big{)} with at most 1616 non-zero weights. According to (30), for any j{1,,2q+Km}j\in\{1,\dots,2q+K-m\}, we construct

Bjm,K(x)={(xaj)Bjm1,K(x)+(aj+m+1x)Bj+1m1,K(x)aj+m+1aj,if aj<aj+m+1,ajx<aj+m+1,0,otherwise.B_{j}^{m,K}(x)=\begin{cases}\frac{(x-a_{j})B_{j}^{m-1,K}(x)+(a_{j+m+1}-x)B_{j+1}^{m-1,K}(x)}{a_{j+m+1}-a_{j}},\\ \qquad\qquad\qquad\qquad\qquad\text{if $a_{j}<a_{j+m+1},a_{j}\leqslant x<a_{j+m+1}$},\\ 0,\quad\text{otherwise}.\end{cases} (26)

Now we add one more hidden layer with 8(K+2qm)+88(K+2q-m)+8 parameters to the network (25), yielding a network with an architecture

𝖭𝖭(6+2(m3),(1,2,3,,m1,12(K+2qm)+12,8(K+2qm)+8,2(K+2qm)+2)),\begin{split}&\mathsf{NN}\Big{(}6+2(m-3),\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\ldots,\mathcal{B}_{m-1},12(K+2q-m)+12,\\ &\qquad\qquad\qquad\qquad\qquad 8(K+2q-m)+8,2(K+2q-m)+2\big{)}\Big{)}\,,\end{split} (27)

which implements

x(x,K,xajaj+m+1ajBjm1,K(x),aj+m+1xaj+m+1ajBj+1m1,K(x)j{1,,2q+Km}).x\mapsto\left(x,K,\underbrace{\frac{x-a_{j}}{a_{j+m+1}-a_{j}}B_{j}^{m-1,K}(x),\frac{a_{j+m+1}-x}{a_{j+m+1}-a_{j}}B_{j+1}^{m-1,K}(x)}_{j\in\{1,\dots,2q+K-m\}}\right).

Note that adding the new hidden layer adds at most

8to implement identity maps for x and K+162(K+2qm)to implement (30)\underbrace{8}_{\text{to implement identity maps for $x$ and $K$}}+\underbrace{16\cdot 2(K+2q-m)}_{\text{to implement \eqref{eq:b-spline_recursion}}}

additional non-zero-weights. Combining the above representations and (26), we obtain a network from the class

𝖭𝖭(4+2(m2),(1,2,3,,m,K+2qm+2))\begin{split}\mathsf{NN}\Big{(}4+2(m-2),&\big{(}1,\mathcal{B}_{2},\mathcal{B}_{3},\dots,\mathcal{B}_{m},K+2q-m+2\big{)}\Big{)}\end{split}

with at most

72(K+2q)+s=3m[68(K+2qs)+20]\displaystyle 72(K+2q)+\sum\limits_{s=3}^{m}\Big{[}68(K+2q-s)+20\Big{]}
72(K+2q)+(m2)[68(K+2q)+20]\displaystyle\leqslant 72(K+2q)+(m-2)\Big{[}68(K+2q)+20\Big{]}
72(K+2q)+(m2)72(K+2q)\displaystyle\leqslant 72(K+2q)+(m-2)\cdot 72(K+2q)
<72m(K+2q)\displaystyle<72m(K+2q)

non-zero weights. ∎

Lemma 4.

Let xx\in\mathbb{R} and let LL\in\mathbb{N}. Then, for any MM, such that |M|44L|M|\leqslant 4^{4^{L}}, the mapping xMxx\mapsto Mx can be represented by a neural network, belonging to the class

𝖭𝖭(2L+2,(1,5,5,,5,4,1))\mathsf{NN}\Big{(}2L+2,(1,5,5,\dots,5,4,1)\Big{)}

and containing at most 60L+3860L+38 non-zero weights.

Proof.

Using the representation 4=(1+1)24=(1+1)^{2}, we obtain that there is a neural network from 𝖭𝖭((2L+1,(1,1,,1))\mathsf{NN}((2L+1,(1,1,\dots,1)\big{)}, implementing the sequence of maps

x14424444L.x\mapsto 1\mapsto 4\mapsto 4^{2}\mapsto 4^{4}\mapsto\dots\mapsto 4^{4^{L}}.

At the same time, due to

x=14((x1)2(x+1)2)=14((x1)+2(1x)+2(x+1)+2+(x1)+2),x=\frac{1}{4}\left((x-1)^{2}-(x+1)^{2}\right)=\frac{1}{4}\left((x-1)_{+}^{2}-(1-x)_{+}^{2}-(x+1)_{+}^{2}+(-x-1)_{+}^{2}\right),

the identity map xxx\mapsto x belongs to

𝖭𝖭((2L+1,(1,4,4,,42L+1 times,1)).\mathsf{NN}((2L+1,(1,\underbrace{4,4,\dots,4}_{\text{$2L+1$ times}},1)\big{)}.

Running the two neural networks in parallel, we can implement the mapping

x(44L,x)x\mapsto\left(4^{4^{L}},x\right)

via a neural network from 𝖭𝖭(2L+1,(1,5,5,,5,2))\mathsf{NN}\big{(}2L+1,(1,5,5,\dots,5,2)\big{)}. Finally, applying 1, we conclude that there is a neural network from the class

𝖭𝖭(2L+2,(1,5,5,,5,4,1)),\mathsf{NN}\big{(}2L+2,(1,5,5,\dots,5,4,1)\big{)},

performing the map x44Lxx\mapsto 4^{4^{L}}x. The number of non-zero weights in this network does not exceed

(15+552L+54+41)+5(2L+1)+4=60L+38.\left(1\cdot 5+5\cdot 5\cdot 2L+5\cdot 4+4\cdot 1\right)+5\cdot(2L+1)+4=60L+38.

It remains to note that, by the definition of LL, we have |M|/44L1|M|/4^{4^{L}}\leqslant 1.

5.3 Proof of 1

Let s>s>\ell be an integer to be specified later. Plugging the bound fs([0,1]d)QRss!\left\|f\right\|_{\mathcal{H}^{s}([0,1]^{d})}\leqslant QR^{-s}s!, s0s\in\mathbb{N}_{0}, into (3), we get that there is a neural network hfh_{f} of width

(4d(K+s1)d)(12(K+2s2)+12)p\left(4d(K+s-1)^{d}\right)\vee\left(12(K+2s-2)+12\right)\vee p

with

6+2(s3)+log2d+2(log2(2d(s1)+d)log2log2(QRss!)1)6+2(s-3)+\left\lceil\log_{2}{d}\right\rceil+2\left(\left\lceil\log_{2}(2d(s-1)+d)\vee\log_{2}\log_{2}(QR^{-s}s!)\right\rceil\vee 1\right)

hidden layers and at most

p(K+s1)d(60[log2(2d(s1)+d)log2log2(QRss!)1]+38)\displaystyle p(K+s-1)^{d}\left(60\left[\left\lceil\log_{2}(2d(s-1)+d)\vee\log_{2}\log_{2}(QR^{-s}s!)\right\rceil\vee 1\right]+38\right)
+p(K+s1)d(20d2+144d(s1)+8d)\displaystyle+p(K+s-1)^{d}\left(20d^{2}+144d(s-1)+8d\right)

non-zero weights, taking their values in [1,1][-1,1], such that, for any s>s>\ell,

fhf([0,1]d)\displaystyle\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})} Q(2ed)ss!RsKs+Q9d(s2)(2s1)2d+(2ed)ss!RsKs\displaystyle\leqslant\frac{Q(\sqrt{2}ed)^{s}s!}{R^{s}K^{s-\ell}}+\frac{Q9^{d(s-2)}(2s-1)^{2d+\ell}(\sqrt{2}ed)^{s}s!}{R^{s}K^{s-\ell}}
Q9ds(2s1)2d+(2ed)ss!RsKs.\displaystyle\leqslant\frac{Q9^{ds}(2s-1)^{2d+\ell}(\sqrt{2}ed)^{s}s!}{R^{s}K^{s-\ell}}.

Recall that, by the definition, s=s1\lfloor s\rfloor=s-1 for any integer ss. Using the inequalities

s!(s)!s,(s)!es(se)s,ands2(s)/2,s!\leqslant(s-\ell)!\,s^{\ell},\quad(s-\ell)!\leqslant e\sqrt{s-\ell}\left(\frac{s-\ell}{e}\right)^{s-\ell},\quad\text{and}\quad\sqrt{s-\ell}\leqslant 2^{(s-\ell)/2},

which are valid for all s>s>\ell, we obtain that

fhf([0,1]d)Qe(2ed 9d)(2s1)2d+sR(2d 9d(s)KR)sQe(2ed 9d)(2s1)2d+2R(2d 9d(s)KR)s.\begin{split}\left\|f-h_{f}\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}&\leqslant\frac{Qe(\sqrt{2}\,ed\,9^{d})^{\ell}(2s-1)^{2d+\ell}s^{\ell}}{R^{\ell}}\cdot\left(\frac{2d\,9^{d}(s-\ell)}{KR}\right)^{s-\ell}\\ &\leqslant\frac{Qe(\sqrt{2}\,ed\,9^{d})^{\ell}(2s-1)^{2d+2\ell}}{R^{\ell}}\cdot\left(\frac{2d\,9^{d}(s-\ell)}{KR}\right)^{s-\ell}.\end{split} (28)

Set

s=+KR2ed 9d.s=\ell+\left\lfloor\frac{KR}{2ed\,9^{d}}\right\rfloor. (29)

It is easy to observe that, with such a choice of ss, the neural network hfh_{f} is of the width O((K+)d)O((K+\ell)^{d}) and has O((K+))O((K+\ell)) hidden layers and at most O(p(K+)dlogK)O(p(K+\ell)^{d}\log K) non-zero weights. The bound (5) on the approximation error of hfh_{f} follows directly from (28) and (29).

References

  • [1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444. doi:10.1038/nature14539.
  • [2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014, pp. 2672–2680.
    URL https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
  • [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533. doi:10.1038/nature14236.
  • [4] J. Han, A. Jentzen, W. E, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences 115 (34) (2018) 8505–8510. doi:10.1073/pnas.1718942115.
  • [5] M. Geist, P. Petersen, M. Raslan, R. Schneider, G. Kutyniok, Numerical solution of the parametric diffusion equation by deep neural networks, Journal of Scientific Computing 88 (1) (2021) 1–37. doi:10.1007/s10915-021-01532-w.
  • [6] Y. Chen, Y. Shi, B. Zhang, Optimal control via neural networks: A convex approach, in: International Conference on Learning Representations, 2019.
    URL https://openreview.net/pdf?id=H1MW72AcK7
  • [7] D. Onken, L. Nurbekyan, X. Li, S. W. Fung, S. Osher, L. Ruthotto, A neural network approach for high-dimensional optimal control applied to multiagent path finding, IEEE Transactions on Control Systems Technology (2022) 1–17doi:10.1109/TCST.2022.3172872.
  • [8] L. Li, K. Ota, M. Dong, Humanlike driving: Empirical decision-making system for autonomous vehicles, IEEE Transactions on Vehicular Technology 67 (8) (2018) 6814–6823. doi:10.1109/TVT.2018.2822762.
  • [9] G. Cembrano, C. Torras, G. Wells, Neural networks for robot control, Annual Review in Automatic Programming 19 (1994) 159–166. doi:https://doi.org/10.1016/0066-4138(94)90059-0.
    URL https://www.sciencedirect.com/science/article/pii/0066413894900590
  • [10] P. Bozek, Y. L. Karavaev, A. A. Ardentov, K. S. Yefremov, Neural network control of a wheeled mobile robot based on optimal trajectories, International Journal of Advanced Robotic Systems 17 (2) (2020) 1729881420916077. doi:10.1177/1729881420916077.
  • [11] M. González-Álvarez, J. Dupeyroux, F. Corradi, G. C. de Croon, Evolved neuromorphic radar-based altitude controller for an autonomous open-source blimp, 2022 International Conference on Robotics and Automation (ICRA) (2022) 85–90.
  • [12] K.-I. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural networks 2 (3) (1989) 183–192. doi:10.1016/0893-6080(89)90003-8.
  • [13] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (4) (1989) 303–314. doi:10.1007/BF02551274.
  • [14] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural networks 4 (2) (1991) 251–257. doi:10.1016/0893-6080(91)90009-T.
  • [15] D. Yarotsky, Error bounds for approximations with deep ReLU networks, Neural Networks 94 (2017) 103–114. doi:10.1016/j.neunet.2017.07.002.
  • [16] B. Li, S. Tang, H. Yu, PowerNet: efficient representations of polynomials and smooth functions by deep neural networks with rectified power units, Journal of Mathematical Study 53 (2) (2020) 159–191. doi:10.4208/jms.v53n2.20.03.
  • [17] I. Gühring, M. Raslan, Approximation rates for neural networks with encodable weights in smoothness spaces, Neural Networks 134 (2021) 107–130. doi:10.1016/j.neunet.2020.11.010.
  • [18] J. Lu, Z. Shen, H. Yang, S. Zhang, Deep network approximation for smooth functions, SIAM Journal on Mathematical Analysis 53 (5) (2021) 5465–5506. doi:10.1137/20M134695X.
  • [19] Z. Shen, H. Yang, S. Zhang, Neural network approximation: Three hidden layers are enough, Neural Networks 141 (2021) 160–173. doi:10.1016/j.neunet.2021.04.011.
  • [20] T. De Ryck, S. Lanthaler, S. Mishra, On the approximation of functions by tanh neural networks, Neural Networks 143 (2021) 732–750. doi:10.1016/j.neunet.2021.08.015.
  • [21] Y. Jiao, Y. Lai, X. Lu, F. Wang, J. Z. Yang, Y. Yang, Deep neural networks with relu-sine-exponential activations break curse of dimensionality on Hölder class, preprint, arXiv:2103.00542 (2021). doi:10.48550/ARXIV.2103.00542.
  • [22] S. Langer, Approximating smooth functions by deep neural networks with sigmoid activation function, Journal of Multivariate Analysis 182 (2021) Paper No. 104696, 21. doi:10.1016/j.jmva.2020.104696.
  • [23] Z. Shen, H. Yang, S. Zhang, Deep network approximation characterized by number of neurons, Communications in Computational Physics 28 (5) (2020) 1768–1811. doi:10.4208/cicp.oa-2020-0149.
  • [24] J. Schmidt-Hieber, Nonparametric regression using deep neural networks with ReLU activation function, The Annals of Statistics 48 (4) (2020) 1875–1897. doi:10.1214/19-AOS1875.
  • [25] B. Li, S. Tang, H. Yu, Better approximations of high dimensional smooth functions by deep neural networks with rectified power units, Communications in Computational Physics 27 (2) (2020) 379–411. doi:10.4208/cicp.oa-2019-0168.
  • [26] B. Hanin, Universal function approximation by deep neural nets with bounded width and ReLU activations, Mathematics 7 (10). doi:10.3390/math7100992.
  • [27] P. Kidger, T. Lyons, Universal approximation with deep narrow networks, in: Proceedings of Thirty Third Conference on Learning Theory, Vol. 125 of Proceedings of Machine Learning Research, 2020, pp. 2306–2327.
    URL https://proceedings.mlr.press/v125/kidger20a.html
  • [28] S. Park, C. Yun, J. Lee, J. Shin, Minimum width for universal approximation, in: International Conference on Learning Representations, 2021.
    URL https://openreview.net/forum?id=O-XJwyoIF-k
  • [29] I. Gühring, G. Kutyniok, P. Petersen, Error bounds for approximations with deep ReLU neural networks in Ws,pW^{s,p} norms, Analysis and Applications 18 (5) (2020) 803–859. doi:10.1142/S0219530519410021.
  • [30] J. M. Klusowski, A. R. Barron, Approximation by combinations of ReLU and squared ReLU ridge functions with 1\ell^{1} and 0\ell^{0} controls, IEEE Transactions on Information Theory 64 (12) (2018) 7649–7656. doi:10.1109/tit.2018.2874447.
  • [31] M. Ali, A. Nouy, Approximation of smoothness classes by deep rectifier networks, SIAM Journal on Numerical Analysis 59 (6) (2021) 3032–3051. doi:10.1137/20M1360657.
  • [32] A. Abdeljawad, P. Grohs, Approximations with deep neural networks in Sobolev time-space, Analysis and Applications 20 (3) (2022) 499–541. doi:10.1142/S0219530522500014.
  • [33] Q. Chen, W. Hao, J. He, Power series expansion neural network, Journal of Computational Science 59 (2022) 101552. doi:10.1016/j.jocs.2021.101552.
  • [34] R. Gribonval, G. Kutyniok, M. Nielsen, F. Voigtlaender, Approximation spaces of deep neural networks, Constructive Approximation. An International Journal for Approximations and Expansions 55 (1) (2022) 259–367. doi:10.1007/s00365-021-09543-4.
  • [35] J. W. Siegel, J. Xu, High-order approximation rates for shallow neural networks with cosine and ReLUk{\rm ReLU}^{k} activation functions, Applied and Computational Harmonic Analysis 58 (2022) 1–26. doi:10.1016/j.acha.2021.12.005.
  • [36] D. Yarotsky, A. Zhevnerchuk, The phase diagram of approximation rates for deep neural networks, in: Advances in Neural Information Processing Systems, Vol. 33, 2020, pp. 13005–13015.
    URL https://proceedings.neurips.cc/paper/2020/file/979a3f14bae523dc5101c52120c535e9-Paper.pdf
  • [37] Z.-B. Xu, F.-L. Cao, Simultaneous Lp{L}_{p}-approximation order for neural networks, Neural Networks 18 (7) (2005) 914–923. doi:https://doi.org/10.1016/j.neunet.2005.03.013.
  • [38] S. Hon, H. Yang, Simultaneous neural network approximation for smooth functions, preprint, arXiv:2109.00161 (2021). doi:10.48550/ARXIV.2109.00161.
  • [39] E. Candès, L. Demanet, L. Ying, Fast computation of fourier integral operators, SIAM Journal on Scientific Computing 29 (6) (2007) 2464–2493. doi:10.1137/060671139.
  • [40] E. Candès, L. Demanet, L. Ying, A fast butterfly algorithm for the computation of Fourier integral operators, Multiscale Modeling & Simulation. A SIAM Interdisciplinary Journal 7 (4) (2009) 1727–1750. doi:10.1137/080734339.
  • [41] L. Schumaker, Spline Functions: Basic theory, 3rd Edition, Cambridge University Press, Cambridge, 2007. doi:10.1017/CBO9780511618994.

Appendix A Some properties of multivariate splines

In this section we provide some properties of multivariate splines. For more details we refer reader to the book [41].

A.1 Univariate splines

We start with one-dimensional case. Fix K,qK,q\in\mathbb{N}. We call a function S:[0,1]S:[0,1]\mapsto\mathbb{R} a univariate spline of degree qq with knots at {1/K,2/K,,(K1)/K}\{1/K,2/K,\dots,(K-1)/K\}, if

  • 1.

    for any i{0,1,,K1}i\in\{0,1,\dots,K-1\}, the restriction of SS on [i/K,(i+1)/K][i/K,(i+1)/K] is a polynomial of degree qq;

  • 2.

    SS has continuous derivatives up to order q1q-1, that is, for any m{0,,q1}m\in\{0,\dots,q-1\} and any i{1,2,,K1}i\in\{1,2,\dots,K-1\}, DmS(i/K+0)=DmS(i/K0)D^{m}S(i/K+0)=D^{m}S(i/K-0).

It is easy to observe that univariate splines of degree qq with knots at {1/K,2/K,,(K1)/K}\{1/K,2/K,\dots,(K-1)/K\} form a linear space. We denote this space by 𝒮q,K\mathcal{S}_{q,K}. Note that 𝒮q,K\mathcal{S}_{q,K} has a finite dimension q+Kq+K. There are several ways to choose a basis in 𝒮q,K\mathcal{S}_{q,K}, below we construct the basis of normalized BB-splines {Njq,K(x):1jK+q}\{N_{j}^{q,K}(x):1\leqslant j\leqslant K+q\}. Let a=(a1,,a2q+K+1)a=(a_{1},\dots,a_{2q+K+1}) be a vector of knots defined in (5.1). First, we successively construct (unnormalized) B-splines {Bjq,K(x):1jK+q}\{B_{j}^{q,K}(x):1\leqslant j\leqslant K+q\} associated with knots a1,,a2q+K+1a_{1},\dots,a_{2q+K+1}. For any j{1,,2q+K}j\in\{1,\dots,2q+K\}, let

Bj0,K(x)={1/(aj+1aj),if aj<aj+1 and ajx<aj+1,0,otherwise.B_{j}^{0,K}(x)=\begin{cases}1/(a_{j+1}-a_{j}),\quad\text{if $a_{j}<a_{j+1}$ and $a_{j}\leqslant x<a_{j+1}$},\\ 0,\quad\text{otherwise}.\end{cases}

Then for any m{1,,q}m\in\{1,\dots,q\}, j{1,2q+Km}j\in\{1,2q+K-m\}, we put

Bjm,K(x)={(xaj)Bjm1,K(x)+(aj+m+1x)Bj+1m1,K(x)aj+m+1aj,if aj<aj+m+1,ajx<aj+m+1,0,otherwise.B_{j}^{m,K}(x)=\begin{cases}\frac{(x-a_{j})B_{j}^{m-1,K}(x)+(a_{j+m+1}-x)B_{j+1}^{m-1,K}(x)}{a_{j+m+1}-a_{j}},\\ \qquad\qquad\qquad\qquad\qquad\text{if $a_{j}<a_{j+m+1},a_{j}\leqslant x<a_{j+m+1}$},\\ 0,\quad\text{otherwise}.\end{cases} (30)

Now, for m{1,,q}m\in\{1,\dots,q\}, we define the normalized BB-spline Njm,KN_{j}^{m,K} as

Njm,K(x)=(aj+m+1aj)Bjm,K(x),1j2q+Km.N_{j}^{m,K}(x)=(a_{j+m+1}-a_{j})B_{j}^{m,K}(x),\quad 1\leqslant j\leqslant 2q+K-m. (31)

We use the following properties of normalized BB-splines (see [41, Section 4.3] for the details).

Proposition 1 ([41], Section 4.3).

Let Njm,KN_{j}^{m,K} be a normalized B-spline defined in (31). Then the following holds.

  • 1.

    For any q0,j{1,,q+K}q\in\mathbb{N}_{0},j\in\{1,\dots,q+K\},

    supp(Njq,K)=[aj,aj+q+1]andNjq,KL([0,1])1.\text{\rm supp}(N_{j}^{q,K})=[a_{j},a_{j+q+1}]\quad\text{and}\quad\|N_{j}^{q,K}\|_{L_{\infty}([0,1])}\leqslant 1.
  • 2.

    For any qq\in\mathbb{N} and any j{1,,q+K}j\in\{1,\dots,q+K\}, Njq,KN_{j}^{q,K} is a spline of degree qq, it has continuous derivatives up to order q1q-1 and the (q1)(q-1)-th derivative is Lipschitz. Moreover, for any m{1,,q}m\in\{1,\dots,q\},

    dNjm,K(x+0)dx=maj+maj(Njm1,K(x)Nj+1m1,K(x)),1j2q+Km.\frac{\mathrm{d}N^{m,K}_{j}(x+0)}{\mathrm{d}x}=\frac{m}{a_{j+m}-a_{j}}\left(N_{j}^{m-1,K}(x)-N_{j+1}^{m-1,K}(x)\right),\quad 1\leqslant j\leqslant 2q+K-m.
Corollary 2.

For any mm\in\mathbb{N} and any {1,,m}\ell\in\{1,\dots,m\},

max1j2q+KmDNjm,KL([0,1])(2K)m!(m)!.\max\limits_{1\leqslant j\leqslant 2q+K-m}\left\|D^{\ell}N^{m,K}_{j}\right\|_{L_{\infty}([0,1])}\leqslant(2K)^{\ell}\frac{m!}{(m-\ell)!}.
Proof.

1 implies that, for any {1,,m}\ell\in\{1,\dots,m\},

max1j2q+KmDNjm,KL([0,1])\displaystyle\max\limits_{1\leqslant j\leqslant 2q+K-m}\left\|D^{\ell}N^{m,K}_{j}\right\|_{L_{\infty}([0,1])}
=max1j2q+Kmmaj+majD1Njm1,KD1Nj+1m1,KL([0,1])\displaystyle=\max\limits_{1\leqslant j\leqslant 2q+K-m}\frac{m}{a_{j+m}-a_{j}}\left\|D^{\ell-1}N_{j}^{m-1,K}-D^{\ell-1}N_{j+1}^{m-1,K}\right\|_{L_{\infty}([0,1])}
2mKmax1j2q+Km+1D1Njm1,KL([0,1]).\displaystyle\leqslant 2mK\max\limits_{1\leqslant j\leqslant 2q+K-m+1}\left\|D^{\ell-1}N_{j}^{m-1,K}\right\|_{L_{\infty}([0,1])}.

Now the statement follows from the induction in \ell together with Njq,KL([0,1])1\|N_{j}^{q,K}\|_{L_{\infty}([0,1])}\leqslant 1. ∎

To study approximation properties of 𝒮q,K\mathcal{S}_{q,K} with respect to the Hölder norm k,k0\|\cdot\|_{\mathcal{H}^{k}},k\in\mathbb{N}_{0}, we follow the technique from [41, Sections 4.6 and 12.3] and introduce a dual basis for normalized B-splines. A set of linear functionals {λ1q,K,,λq+Kq,K}\{\lambda^{q,K}_{1},\dots,\lambda^{q,K}_{q+K}\} on L([0,1])L_{\infty}([0,1]) is called a dual basis if, for all i,j{1,,q+K}i,j\in\{1,\dots,q+K\},

λiq,KNjq,K={1,if i=j,0,otherwise.\lambda_{i}^{q,K}N_{j}^{q,K}=\begin{cases}1,\quad\text{if $i=j$,}\\ 0,\quad\text{otherwise}.\end{cases}

An explicit expression for λiq,K\lambda_{i}^{q,K} can be found in [41, Eq.(4.89)] but it is not necessary for our purposes. Define a linear operator 𝒬q,K:L([0,1])𝒮q,K\mathcal{Q}^{q,K}:L_{\infty}([0,1])\rightarrow\mathcal{S}_{q,K} by

𝒬q,Kf=j=1q+K(λjq,Kf)Njq,K.\mathcal{Q}^{q,K}f=\sum\limits_{j=1}^{q+K}(\lambda_{j}^{q,K}f)N_{j}^{q,K}.

The function 𝒬q,Kf\mathcal{Q}^{q,K}f is called quasi-interpolant and it nicely approximates ff provided that ff is smooth enough. The approximation properties of 𝒬q,Kf\mathcal{Q}^{q,K}f are studied below for the case of multivariate tensor-product splines.

A.2 Tensor-product splines

The main result of this section is 3 concerning approximation properties of multivariate splines. We start with auxiliary definitions. The tensor product

𝒮dq,K=(𝒮q,K)dspan{Nj1q,K(x1)Njdq,K(xd):j1,,jd{1,,q+K}}\mathcal{S}^{q,K}_{d}=\left(\mathcal{S}^{q,K}\right)^{\otimes d}\equiv\text{span}\left\{N_{j_{1}}^{q,K}(x_{1})\cdot\dots\cdot N_{j_{d}}^{q,K}(x_{d}):j_{1},\dots,j_{d}\in\{1,\dots,q+K\}\right\}

is called a space of tensor-product splines of degree qq. Since qq and KK are fixed, we omit the upper indices in the notations Bjq,K,Njq,KB_{j}^{q,K},N_{j}^{q,K}, 1jq+K1\leqslant j\leqslant q+K, if there is no ambiguity. For any i{1,,d}i\in\{1,\dots,d\}, we denote {λi,1,,λi,q+K}\{\lambda_{i,1},\dots,\lambda_{i,q+K}\} the dual basis for {N1(xi),,Nq+K(xi)}\{N_{1}(x_{i}),\dots,N_{q+K}(x_{i})\} constructed in A.1. For any j1,,jd{1,,q+K}j_{1},\dots,j_{d}\in\{1,\dots,q+K\} and fL([0,1]d)f\in L_{\infty}([0,1]^{d}), define

j1,,jdf=λ1,j1λ2,j2λd,jdf.\mathcal{L}_{j_{1},\dots,j_{d}}f=\lambda_{1,j_{1}}\lambda_{2,j_{2}}\dots\lambda_{d,j_{d}}f.

It is easy to see that {j1,,jd:j1,,jd{1,,q+K}}\left\{\mathcal{L}_{j_{1},\dots,j_{d}}:j_{1},\dots,j_{d}\in\{1,\dots,q+K\}\right\} form a dual basis for
{Nj1(x1),,Njd(xd)}\left\{N_{j_{1}}(x_{1}),\ldots,N_{j_{d}}(x_{d})\right\}. Now we formulate the approximation result from [41, Theorem 12.5].

Proposition 2 ([41], Theorem 12.5).

Let fL([0,1]d)f\in L_{\infty}([0,1]^{d}). Fix any j1,,jd{1,,q+K}j_{1},\dots,j_{d}\in\{1,\dots,q+K\} and let j1,,jd\mathcal{L}_{j_{1},\dots,j_{d}} be as defined above. Introduce

Aj1,,jd=i=1d[aji,aji+q+1).A_{j_{1},\dots,j_{d}}=\bigotimes\limits_{i=1}^{d}[a_{j_{i}},a_{j_{i}+q+1})\,.

Then

|j1,,jdf|(2q+1)d9d(q1)fL(Aj1,,jd).|\mathcal{L}_{j_{1},\dots,j_{d}}f|\leqslant(2q+1)^{d}9^{d(q-1)}\|f\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\,.

Similarly to 𝒬q,K\mathcal{Q}^{q,K}, for any fL([0,1]d)f\in L_{\infty}([0,1]^{d}), define a multivariate quasi-interpolant

𝒬dq,Kf(x)=j1,,jd=1q+K(j1,,jdf)Nj1(x1)Njd(xd).\mathcal{Q}_{d}^{q,K}f(x)=\sum\limits_{j_{1},\dots,j_{d}=1}^{q+K}(\mathcal{L}_{j_{1},\dots,j_{d}}f)N_{j_{1}}(x_{1})\dots N_{j_{d}}(x_{d}).

Similarly to NjN_{j}’s, we omit the upper indices q,Kq,K in the notation of 𝒬d\mathcal{Q}_{d} when they are clear from context. We are ready to formulate the main result of this section.

Theorem 3.

Let fβ([0,1]d,H)f\in\mathcal{H}^{\beta}([0,1]^{d},H) and fix a non-negative integer β\ell\leqslant\lfloor\beta\rfloor. Let the linear operator 𝒬dβ,K:L([0,1]d)𝒮dβ,K\mathcal{Q}_{d}^{\lfloor\beta\rfloor,K}:L_{\infty}([0,1]^{d})\rightarrow\mathcal{S}_{d}^{\lfloor\beta\rfloor,K} be as defined above. Then

f𝒬dβ,Kf([0,1]d)(2ed)βHKβ+9d(β1)(2β+1)2d+(2ed)βHKβ.\left\|f-\mathcal{Q}_{d}^{\lfloor\beta\rfloor,K}f\right\|_{\mathcal{H}^{\ell}([0,1]^{d})}\leqslant\frac{(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}+\frac{9^{d(\lfloor\beta\rfloor-1)}(2\lfloor\beta\rfloor+1)^{2d+\ell}(\sqrt{2}ed)^{\beta}H}{K^{\beta-\ell}}\,.
Proof.

For simplicity, we adopt the notation q=βq=\lfloor\beta\rfloor. Fix a multi-index 𝜸0d{\bm{\gamma}}\in\mathbb{N}_{0}^{d}, |𝜸||{\bm{\gamma}}|\leqslant\ell. Then

D𝜸fD𝜸𝒬dfL([0,1]d)=maxj1,,jd{1,,2q+K+1}D𝜸fD𝜸𝒬dfL(Aj1,,jd).\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}\mathcal{Q}_{d}f\|_{L_{\infty}([0,1]^{d})}=\max\limits_{j_{1},\dots,j_{d}\in\{1,\dots,2q+K+1\}}\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}\mathcal{Q}_{d}f\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\,.

Consider a polynomial of degree qq

Pj1,,jdf(x)=𝜶:|𝜶|qD𝜶f(aj1,,jd)𝜶!(xaj1,,jd)𝜶,P^{f}_{j_{1},\dots,j_{d}}(x)=\sum\limits_{{\bm{\alpha}}:|{\bm{\alpha}}|\leqslant q}\frac{D^{\bm{\alpha}}f(a_{j_{1},\dots,j_{d}})}{{\bm{\alpha}}!}(x-a_{j_{1},\dots,j_{d}})^{\bm{\alpha}}\,,

where aj1,,jd=(aj1,,ajd)a_{j_{1},\dots,j_{d}}=(a_{j_{1}},\dots,a_{j_{d}}). Here we used conventional notations 𝜶!=α1!α2!αd!{\bm{\alpha}}!=\alpha_{1}!\alpha_{2}!\dots\alpha_{d}! and

(xaj1,,jd)𝜶=(x1aj1)α1(x2aj2)α2(xdajd)αd.(x-a_{j_{1},\dots,j_{d}})^{\bm{\alpha}}=(x_{1}-a_{j_{1}})^{\alpha_{1}}(x_{2}-a_{j_{2}})^{\alpha_{2}}\dots(x_{d}-a_{j_{d}})^{\alpha_{d}}.

Since Pj1,,jdfP^{f}_{j_{1},\dots,j_{d}} belongs to 𝒮d\mathcal{S}_{d}, we have 𝒬dPj1,,jdf=Pj1,,jdf\mathcal{Q}_{d}P^{f}_{j_{1},\dots,j_{d}}=P^{f}_{j_{1},\dots,j_{d}}. By the triangle inequality,

D𝜸fD𝜸𝒬dfL(Aj1,,jd)D𝜸(fPj1,,jdf)L(Aj1,,jd)+D𝜸𝒬d(fPj1,,jdf)L(Aj1,,jd).\begin{split}\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}\mathcal{Q}_{d}f\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}&\leqslant\|D^{\bm{\gamma}}(f-P^{f}_{j_{1},\dots,j_{d}})\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\\ &\quad+\|D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}.\end{split} (32)

By Taylor’s theorem and the definition of the Hölder class, the first term in the right-hand sight does not exceed

D𝜸fD𝜸Pj1,,jdfL(Aj1,,jd)\displaystyle\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}P^{f}_{j_{1},\dots,j_{d}}\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}
=𝜶:|𝜶|=q|𝜸|q(xaj1,,jd)𝜶𝜶!01(1s)q1\displaystyle=\Bigg{\|}\sum\limits_{{\bm{\alpha}}:|{\bm{\alpha}}|=q-|{\bm{\gamma}}|}\frac{q(x-a_{j_{1},\dots,j_{d}})^{\bm{\alpha}}}{{\bm{\alpha}}!}\int\limits_{0}^{1}(1-s)^{q-1}
(D𝜶+𝜸f(sx+(1s)aj1,,jd)D𝜶+𝜸f(aj1,,jd))dsL(Aj1,,jd)\displaystyle\quad\cdot\left(D^{{\bm{\alpha}}+{\bm{\gamma}}}f(sx+(1-s)a_{j_{1},\dots,j_{d}})-D^{{\bm{\alpha}}+{\bm{\gamma}}}f(a_{j_{1},\dots,j_{d}})\right)\mathrm{d}s\Bigg{\|}_{L_{\infty}(A_{j_{1},\dots,j_{d}})}
α:|𝜶|=q|𝜸|q(q/K)q|𝜸|H𝜶!01(1s)q1xaj1,,jdβqdsL(Aj1,,jd)\displaystyle\leqslant\left\|\sum\limits_{\alpha:|{\bm{\alpha}}|=q-|{\bm{\gamma}}|}\frac{q(q/K)^{q-|{\bm{\gamma}}|}H}{{\bm{\alpha}}!}\int\limits_{0}^{1}(1-s)^{q-1}\|x-a_{j_{1},\dots,j_{d}}\|^{\beta-q}\mathrm{d}s\right\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}
𝜶:|𝜶|=q|𝜸|(q/K)q|𝜸|H𝜶!(qd/K)βq(qd/K)β|𝜸|H(q|𝜸|)!.\displaystyle\leqslant\sum\limits_{{\bm{\alpha}}:|{\bm{\alpha}}|=q-|{\bm{\gamma}}|}\frac{(q/K)^{q-|{\bm{\gamma}}|}H}{{\bm{\alpha}}!}(q\sqrt{d}/K)^{\beta-q}\leqslant\frac{(qd/K)^{\beta-|{\bm{\gamma}}|}H}{(q-|{\bm{\gamma}}|)!}.

Here we used the fact that

𝜶0d:|𝜶|=mm!𝜶!=dmfor all m0.\sum\limits_{{\bm{\alpha}}\in\mathbb{N}_{0}^{d}:|{\bm{\alpha}}|=m}\frac{m!}{{\bm{\alpha}}!}=d^{m}\quad\text{for all $m\in\mathbb{N}_{0}$.}

Moreover, note that, for any r{0,1,,q1}r\in\{0,1,\dots,q-1\}, it holds that

qβr(qr)!=qqrqβr1(qr1)!qβr1(qr1)!.\frac{q^{\beta-r}}{(q-r)!}=\frac{q}{q-r}\cdot\frac{q^{\beta-r-1}}{(q-r-1)!}\geqslant\frac{q^{\beta-r-1}}{(q-r-1)!}.

This yields that qβr/(qr)!q^{\beta-r}/(q-r)! achieves its maximum over r{0,1,,q}r\in\{0,1,\dots,q\} at r=0r=0. Thus,

D𝜸fD𝜸Pj1,,jdfL(Aj1,,jd)(qd/K)β|𝜸|H(q|𝜸|)!qβ(d/K)β|𝜸|Hq!.\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}P^{f}_{j_{1},\dots,j_{d}}\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\leqslant\frac{(qd/K)^{\beta-|{\bm{\gamma}}|}H}{(q-|{\bm{\gamma}}|)!}\leqslant\frac{q^{\beta}(d/K)^{\beta-|{\bm{\gamma}}|}H}{q!}.

Since

q!2πeqqq+1/2q!\geqslant\sqrt{2\pi}e^{-q}q^{q+1/2} (33)

for all qq\in\mathbb{N}, we obtain that

D𝜸fD𝜸Pj1,,jdfL(Aj1,,jd)(2e)q(d/K)β|𝜸|H.\|D^{\bm{\gamma}}f-D^{\bm{\gamma}}P^{f}_{j_{1},\dots,j_{d}}\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\leqslant(\sqrt{2}e)^{q}(d/K)^{\beta-|{\bm{\gamma}}|}H.

Now, we consider the second term in (32), that is, D𝜸𝒬d(fPj1,,jdf)D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}}). By the definition of 𝒬d\mathcal{Q}_{d}, we have

D𝜸𝒬d(fPj1,,jdf)=i1,,id=1q+Ki1,,id(fPj1,,jdf)D𝜸(Ni1(x1)Nid(xd)).D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})=\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\mathcal{L}_{i_{1},\dots,i_{d}}(f-P^{f}_{j_{1},\dots,j_{d}})D^{\bm{\gamma}}\left(N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right).

The triangle inequality and 2 implies that

D𝜸𝒬d(fPj1,,jdf)L(Aj1,,jd)\displaystyle\|D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}
i1,,id=1q+K|i1,,id(fPj1,,jdf)|D𝜸Ni1(x1)Nid(xd)L(Aj1,,jd)\displaystyle\leqslant\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}|\mathcal{L}_{i_{1},\dots,i_{d}}(f-P^{f}_{j_{1},\dots,j_{d}})|\left\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}
(2q+1)d9d(q1)fPj1,,jdfL(Aj1,,jd)\displaystyle\leqslant(2q+1)^{d}9^{d(q-1)}\|f-P^{f}_{j_{1},\dots,j_{d}}\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}
i1,,id=1q+KD𝜸Ni1(x1)Nid(xd)L(Aj1,,jd).\displaystyle\quad\cdot\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\left\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}.

Again, by Taylor’s theorem,

fPj1,,jdfL(Aj1,,jd)(qd/K)βHq!.\|f-P^{f}_{j_{1},\dots,j_{d}}\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\leqslant\frac{(qd/K)^{\beta}H}{q!}.

By 1, for any m{1,,q+K}m\in\{1,\dots,q+K\}, supp(Nim)=[aim,aim+q]\text{supp}(N_{i_{m}})=[a_{i_{m}},a_{i_{m+q}}]. Hence, the intersection of supp(Nim)\text{supp}(N_{i_{m}}) with [ajm,ajm+q][a_{j_{m}},a_{j_{m+q}}] has zero volume if |imjm|>q|i_{m}-j_{m}|>q. This yields

i1,,id=1q+KD𝜸Ni1(x1)Nid(xd)L(Aj1,,jd)\displaystyle\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\left\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}
=i1,,id=1q+KD𝜸Ni1(x1)Nid(xd)L(Aj1,,jd)m=1d𝟙(|imjm|q).\displaystyle=\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\left\|D^{\bm{\gamma}}N_{i_{1}}(x_{1})\dots N_{i_{d}}(x_{d})\right\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}\prod_{m=1}^{d}\mathbbm{1}\left(|i_{m}-j_{m}|\leqslant q\right).

Applying 2, we obtain that

D𝜸𝒬d(fPj1,,jdf)L(Aj1,,jd)\displaystyle\|D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})}
(2q+1)d9d(q1)(qd/K)βHq!i1,,id=1q+Km=1dq!(2K)γm(qγm)!𝟙(|imjm|q)\displaystyle\leqslant\frac{(2q+1)^{d}9^{d(q-1)}(qd/K)^{\beta}H}{q!}\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}\prod_{m=1}^{d}\frac{q!(2K)^{\gamma_{m}}}{(q-\gamma_{m})!}\mathbbm{1}\left(|i_{m}-j_{m}|\leqslant q\right)
(2q+1)d9d(q1)(qd/K)βHq!i1,,id=1q+K(2Kq)|𝜸|m=1d𝟙(|imjm|q).\displaystyle\leqslant\frac{(2q+1)^{d}9^{d(q-1)}(qd/K)^{\beta}H}{q!}\sum\limits_{i_{1},\dots,i_{d}=1}^{q+K}(2Kq)^{|{\bm{\gamma}}|}\prod_{m=1}^{d}\mathbbm{1}\left(|i_{m}-j_{m}|\leqslant q\right).

The sum in the right hand side contains at most (2q+1)d(2q+1)^{d} non-zero terms. Hence,

D𝜸𝒬d(fPj1,,jdf)L(Aj1,,jd)\displaystyle\|D^{\bm{\gamma}}\mathcal{Q}_{d}(f-P^{f}_{j_{1},\dots,j_{d}})\|_{L_{\infty}(A_{j_{1},\dots,j_{d}})} =(2q+1)2d9d(q1)(qd/K)β(2Kq)|𝜸|Hq!\displaystyle=\frac{(2q+1)^{2d}9^{d(q-1)}(qd/K)^{\beta}(2Kq)^{|{\bm{\gamma}}|}H}{q!}
9d(q1)(2q+1)2d+|𝜸|(2ed)βHKβ|𝜸|,\displaystyle\leqslant\frac{9^{d(q-1)}(2q+1)^{2d+|{\bm{\gamma}}|}(\sqrt{2}ed)^{\beta}H}{K^{\beta-|{\bm{\gamma}}|}}\,,

where we used (33).