Simultaneous approximation of a smooth function
and its derivatives by deep neural networks
with piecewise-polynomial activations
Abstract
This paper investigates the approximation properties of deep neural networks with piecewise-polynomial activation functions. We derive the required depth, width, and sparsity of a deep neural network to approximate any Hölder smooth function up to a given approximation error in Hölder norms in such a way that all weights of this neural network are bounded by . The latter feature is essential to control generalization errors in many statistical and machine learning applications.
keywords:
deep neural networks , approximation complexity , ReQU activations , activations , Hölder class.MSC:
[2020] 41A25 , 68T071 Introduction
Neural networks have recently gained much attention due to their impressive performance in many complicated practical tasks, including image processing [1], generative modelling [2], reinforcement learning [3], numerical solution of PDEs, e.g., [4, 5], and optimal control [6, 7]. This makes them extremely useful in design of self-driving vehicles [8] and robot control systems, e.g., [9, 10, 11]. One of the reasons for such a success of neural networks is their expressiveness, that is, the ability to approximate functions with any desired accuracy. The question of expressiveness of neural networks has a long history and goes back to the papers [12, 13, 14]. In particular, in [13], the author showed that one hidden layer is enough to approximate any continuous function with any prescribed accuracy . However, further analysis revealed the fact that deep neural networks may require much fewer parameters than the shallow ones to approximate with the same accuracy. Many efforts were put in recent years to understand the fidelity of deep neural networks. In a pioneering work [15], the author showed that for any target function from the Sobolev space there is a neural network with parameters and ReLU activation function, that approximates within the accuracy with respect to the -norm on the unit cube . Further works in this direction considered various smoothness classes of the target functions [16, 17, 18, 19], neural networks with diverse activations [17, 20, 21, 22], domains of more complicated shape [23], and measured the approximation errors with respect to different norms [15, 17, 20, 24]. Several authors also considered the expressiveness of neural networks with different architectures. This includes wide neural networks of logarithmic [15, 17, 24] or even constant depth [25, 16, 20, 19], or deep and narrow networks [26, 27, 28]. Most of the existing results on the expressiveness of neural networks measure the quality of approximation with respect to either the - or -norm, . Much fewer papers study the approximation of derivatives of smooth functions. These rare exceptions include [29, 17, 20].
In the present paper, we focus on feed-forward neural networks with piecewise-polynomial activation functions of the form . Neural networks with such activations are known to successfully approximate smooth functions from the Sobolev and Besov spaces with respect to the - and -norms (see, for instance, [30, 25, 16, 31, 32, 33, 34, 35]). We continue this line of research and study the ability of such neural networks to approximate not only smooth functions themselves but also their derivatives. We derive the non-asymptotic upper bounds on the Hölder norm between the target function and its approximation from a class of sparsely connected neural networks with ReQU activations. In particular, we show that, for any from a Hölder ball , , , (see Section 2 for the definition) and any there exists a neural network with ReQU activation functions, that uniformly approximates the target function in the norms of the Hölder spaces for all . Here and further in the paper, stands for the largest integer which is strictly smaller than . A simplified statement of our main result is given below.
Theorem 1 (simplified version of 2).
Fix and . Then, for any , for any , and any integer , there exists a neural network with ReQU activation functions such that it has layers, at most neurons in each layer and non-zero weights taking their values in . Moreover, it holds that
where is a universal constant.
We provide explicit expressions for the hidden constants in 2. The main contributions of our work can be summarized as follows.
-
1.
Given a smooth target function , we construct a neural network, that simultaneously approximates all the derivatives of up to order with optimal dependence of the precision on the number of non-zero weights. That is, if we denote the number of non-zero weights in the network by , then it holds that simultaneously for all
-
2.
The constructed neural network has almost the same smoothness as the target function itself while approximating it with the optimal accuracy. This property turns out to be very useful in many applications including the approximation of PDEs and density transformations where we need to use derivatives of the approximation.
-
3.
The weights of the approximating neural network are bounded in absolute values by . The latter property plays a crucial role in deriving bounds on the generalization error of empirical risk minimizers in terms of the covering number of the corresponding parametric class of neural networks. Note that the upper bounds on the weights provided in [29, 17, 20] blow up as the approximation error decreases.
The rest of the paper is organized as follows. In Section 2, we introduce necessary definitions and notations. Section 3 contains the statement of our main result, 2, followed by a detailed comparison with the existing literature. We then present numerical experiments in Section 4. The proofs are collected in Section 5. Some auxiliary facts are deferred to A.
2 Preliminaries and notations
Norms
For a matrix and a vector , we denote by and the maximal absolute value of entries of and , respectively. and shall stand for the number of non-zero entries of and , respectively. Finally, the Frobenius norm and operator norm of are denoted by and , respectively, and the Euclidean norm of is denoted by . For a function , we set
If the domain is clear from the context, we simply write or , instead of or , respectively.
Smoothness classes
Let , . For a multi-index , we write and define the corresponding partial differential operator as
For , the function space consists of those functions over the domain which have bounded and continuous derivatives up to order in . Formally,
For a function and any positive number , the Hölder constant of order is given by
(1) |
Now, for any , we can define the Hölder ball . If we let be the largest integer strictly less than , contains all functions in with -Hölder-continuous, , partial derivatives of order . Formally,
We also write if for some
Neural networks
Fix an activation function . For a vector , we define the shifted activation function as
Given a positive integer and a vector , a neural network of depth (with hidden layers) and architecture is a function of the form
(2) |
where are weight matrices and are shift (bias) vectors. The maximum number of neurons in one layer is called the width of the neural network. Next, we introduce a subclass of neural networks of depth with architecture and at most non-zero weights. That is, consist of functions of the form (2), such that
We also use the notation , standing for . Throughout the paper, we use the ReQU (rectified quadratic unit) activation function, defined as
Concatenation and parallelization of neural networks
During the construction of approximating network in 2, we use the operations of concatenation (consecutive connection) and parallel connection of neural networks. Given neural networks and of architectures and , respectively, such that , the concatenation of and is their composition , that is, a neural network of the architecture . The parallel connection of neural networks is defined as follows. Let
and
be two neural networks of the same depth with architectures and , respectively. Define
Then the parallel connection of and is a neural network of architecture , given by
Note that the number of non-zero weights of is equal to the sum of the ones in and .
3 Main results
3.1 Approximation of functions from Hölder classes
Our main result states that any function from , , , can be approximated by a feed-forward deep neural network with ReQU activation functions in
Theorem 2.
Let and let . Then, for any , for any , and any integer , there exists a neural network of the width
with
hidden layers and at most non-zero weights taking their values in , such that, for any ,
(3) |
The above constant is given by
An illustrative comparison of the result of 2 with the literature is provided in Table 1 below. 2 improves over the results of [25, Theorem 3.3] and [16, Theorem 7] as far as the approximation properties of ReQU neural networks in terms of the -norm are concerned. Namely, 2 claims that, for any , , and any , there is a ReQU neural network of width and depth with non-zero weights, taking their values in , such that it approximates within the accuracy with respect to the -norm on . In [25, 16], the authors considered a target function from a general weighted Sobolev class but measure the quality of approximation in terms of a weighted -norm. Nevertheless, the width and the number of non-zero weights of the constructed neural networks in 2, [25, Theorem 3.3], and [16, Theorem 7] coincide. The difference is that, while in [25, 16], the depth of the neural networks is of order we need only layers. More importantly, the authors of [25, 16] do not provide any guarantees on the absolute values of the weights of the approximating neural networks. At the same time, all the weights of our neural network take their values in . We will explain the importance of the latter property a bit later in this section. It is also worth mentioning that the same number of non-zero weights is needed to approximate within the tolerance (with respect to the -norm) via neural networks with other activation functions, such as ReLU [36], sigmoid [22, 17], and [17].
The approximation properties of deep neural networks with respect to norms involving derivatives are much less studied. In the rest of this section, we elaborate on the comparison with the state-of-the-art results in this direction. The fact that shallow neural networks can simultaneously approximate a smooth function and its derivatives is known for a long time from the paper [37], where the authors derived an upper bound on the approximation accuracy in terms of the modulus of smoothness. To be more precise, they showed that if a function is continuous on a compact set and its derivatives up to an order are in , then there is a neural network with
hidden units, such that
Here
is the modulus of smoothness of a function . Moreover, if, in addition, the derivatives of of order are -Hölder, , then it holds that
Taking into account that for a Lipschitz function , we conclude that one has to use a shallow neural network with at least hidden units to approximate a function of interest or its derivatives within the accuracy with respect to the -norm even if is sufficiently smooth. This bound becomes prohibitive in the case of large dimension. The situation is much better in the case of deep neural networks. To our knowledge, Gühring and Raslan [17, Proposition 4.8] were the first to prove that, for any , , and , there is a ReQU neural network with non-zero weights, which approximates within the accuracy with respect to the -norm on . A drawback of the result in [17] is that the architecture of the suggested neural network heavily depends on . Hence, it is hard to control higher-order derivatives of the approximating neural network itself, which is of interest in such applications as numerical solutions of PDEs and density transformations. This question was addressed in [20, 38], where the authors considered the problem of simultaneous approximation of a target function with respect to the Hölder norms of different orders. In [20, Theorem 5.1], the authors showed that, for any , , and any sufficiently large integer , there is a three-layer neural network of width with activations, such that
simultaneously for all . Note that 2 yields a sharper bound, removing the odd logarithmic factors. Regarding the results of the recent work [38], we found some critical flaws in the proofs of the main results. We explain our concerns in 1 below as the main results of [38] are close to ours. 2 also improves over [17, 20, 38] in another aspect. Namely, 2 guarantees that the weights of the neural network take their values in , while in [17, 20, 38] they may grow polynomially as the approximation error decreases. This boundeness property is extremely important when studying the generalization ability of the neural networks in various statistical problems since the metric entropy of the corresponding parametric class of neural networks involves a uniform upper bound on the weights (see, for instance, [24, Theorem 2 and Lemma 5]). Besides that, a polynomial upper bound on the absolute values of the weights becomes prohibitive when approximating analytic functions (see 2 in the next section).
Remark 1.
The proofs of Theorem 3.1 and Theorem 3.8 in [38] have critical flaws. In particular, on page 18, the authors mistakenly bound the Sobolev norm of the composite function by the -norm of . A similar error appears on page 24. In this way, the authors obtain that the -norm of is bounded by , where is the smoothness of the target function. In fact, the latter norm should scale as , where is a small auxiliary parameter describing the width of the boundary strips. This flaw completely ruins the proofs of the main results in [38], Theorem 1.1 (see the 9th line of the proof on p.8) and Theorem 1.4 (the 9th line of the proof on p.25).
Paper | Norm | Depth | Non-zero weights | Simultaneous approximation | Weights in |
[16]* | N/A | ✗ | |||
[17]* | ✗ | ✗ | |||
[20] | ✓ | ✗ | |||
[22] | N/A | ✗ | |||
[24] | N/A | ✓ | |||
[25]* | N/A | ✗ | |||
[37]* | 2 | ✓ | ✗ | ||
Ours | ✓ | ✓ |
3.2 Approximation of analytic functions
The bound of 2 can be transformed to exponential (in ) rates of approximation for analytic functions. In the rest of this section, we consider -analytic functions defined below.
Definition 1.
A function is called -analytic with , if it satisfies the inequality
(4) |
Similar concepts were also considered in [39, 40, 20]. Applying 2 to -analytic functions, we get the following corollary.
Corollary 1.
Let be a -analytic function and let . Then, for any integer , there exists a neural network of width with hidden layers and at most non-zero weights, taking values in , such that
(5) |
Remark 2.
In [20, Corollary 5.6], the authors claim that -analytic functions can be approximated with exponential (in the number of non-zero weights ) rates with respect to the Sobolev norm , where is a fixed integer (see eq.(112) on p.743). However, a closer look at Corollary 5.6 reveals the fact that the right-hand side of eq.(112) includes a constant with depending on too, as follows from eq.(54). Thus, the dependence of the final bound on in Corollary 5.6 remains unclear. Besides that, in contrast to [20, Theorem 5.2], the authors do not specify the upper bound on the absolute values of the weights of the constructed neural network in Corollary 5.6. A thorough inspection of the proof shows that the weights can be as large as .
4 Numerical experiments
In this section, we provide numerical experiments to illustrate the approximation properties of neural networks with ReQU activations. We considered a scalar function of two variables and approximated it on the unit square via neural networks with two types of activations: ReLU and ReQU. All the neural networks were fully connected, and all their hidden layers had a width . The first layer had a width . The depth of neural networks took its values in . In the training phase, we sampled points independently from the uniform distribution on and tuned the weights of neural networks by minimizing the mean empirical squared error
After that, we computed the approximation errors of and of its gradient on the two-dimensional grid with :
where denotes the neural network with the weights tuned on the training phase. We repeated the experiment times and computed average approximation errors. The results are displayed in Figure 1. The quality of approximation by neural networks with ReQU activations turns out to be better than by the ones with ReLU. Note that in such a scenario the stochastic error becomes small, and the quality of learning is mostly determined by the approximation error. This claim is supported by the fact that the error values were similar in different experiments.

5 Proofs
5.1 Proof of 2
Step 1. Let . Consider a vector , such that
(6) | ||||
By 3, there exist tensor-product splines of order associated with knots at such that
Our goal is to show that can be represented by a neural network with ReQU activation functions. For this purpose, for each , we use an expansion of with respect to the basis
where are normalized -splines defined in (31). There exist coefficients such that, for any ,
(7) |
where are (unnormalized) -splines defined in (30). Hence, in order to represent by a neural network with ReQU activations, we first perform this task for the products of basis functions , .
Step 2. Applying 3 with component-wise for each , we obtain that the mapping
(8) | ||||
can be represented by a neural network from the class , where
(9) |
and are defined in (15) and (21). Here should be understood as element-wise multiplication of entries by . Since each coordinate transformation in (5.1) can be computed independently, the number of parameters in this network does not exceed
Using 1 with , for any fixed , we calculate all products by a neural network from
with
(10) |
Note that this network contains at most
non-zero parameters.
Step 3. Let us recall that, due to (7), we have
(11) |
for all . On Step , we constructed a network which calculates the products for all . Now we implement multiplications by
2 implies that
Since by the conditions of the theorem, we can write
Equation (5.1) yields that for all . In view of 4, we can implement the multiplication by by a network from the class
Here we used the fact that, since ,
Hence, the representation (11) implies that the function can be represented by a neural network with
hidden layers and the architecture
where and are given by (9) and (10), respectively. As before, weights of the constructed neural network are bounded by . Note that this network contains at most
non-zero weights. The last summand appears, because each of
neurons in the first layer of receives information from two neurons from the previous layer. This completes the proof.
5.2 Auxiliary lemmas for 2
Lemma 1.
Let . Then, for any , there exists a neural network from the class
(12) |
which implements the map . Moreover, this network contains at most non-zero weights.
Proof of 1.
Let . We first prove that the map belongs to the class . Let us define and
Then it is easy to see that
(13) |
Now, for any , we use the following representation
(14) |
Put . Then it remains to note that, based on the equality (13), for any , we can implement a sequence of mappings
by a neural network from the class
The number of parameters in such network does not exceed . We complete the proof, combining this bound with (14).
∎
Lemma 2.
Let be integers not smaller than , and . Then the mapping
can be represented by a network from the class , where
(15) |
containing at most non-zero weights.
Proof.
Note first that, due to [41, Theorem 4.32],
The functions , , , and can be computed directly using (30). Indeed, for any ,
Moreover, (30) implies that and . Hence, each of the functions can be exactly represented by a neural network from the class . The final mapping (b) is implemented as . Note that the identity map belongs to too:
(16) |
Combining the arguments above, we conclude that the mapping
(17) |
can be exactly represented by a network from the class . Besides, one can implement the sequence of transformations
(18) |
using a neural network from . Connecting in parallel the neural networks realizing the maps (17) and (18), we obtain that one can implement the transformation
(19) |
via a neural network from the class .
It remains to implement multiplication of the splines by to complete the proof. Recall that, according to Lemma 1, the function, mapping a pair to their product , belongs to the class . For any , consider a vector
Let us implement the multiplications of the splines by in parallel. Then we obtain that the mapping
is in the class . Here we took into account that we need non-zero weights to implement each of multiplications and non-zero weights to implement the identity maps and . Repeating the same trick two more times, we get that there is a neural network in , implementing
(20) |
This follows from the fact that each of multiplications by , running in parallel, can be implemented by a neural network from , and two identity maps and can be represented by a network from . Concatenating the neural networks, performing (19) and (20), we finally get that the mapping of interest,
is in the class of neural networks from , containing at most
non-zero weights. ∎
Lemma 3.
Let and . Then for any , the mapping belongs to the class
with at most non-zero weights. The vector is defined in (15), and for , is equal to
(21) |
Proof of 3.
Fix an integer . We consider the family of -splines and construct all elements of this family sequentially starting from . By 2, the map
can be represented by a network in the class (15), containing at most non-zero weights. Assume that for some the mapping
belongs to the class
(22) |
We use the recursion formula (30) to perform the induction step. Note that we can implement each of the mappings
(23) |
by networks from . It is possible, since for and satisfying , it holds that . We remove the last linear layer of (22), and concatenate the remaining part with the first layer of networks, implementing (23). We obtain a network with hidden layers and the architecture
(24) |
which implements the mapping
(25) |
Note that we have added at most
non-zero weights, since each linear function of the form (23) can be implemented via a network from with at most non-zero weights. According to (30), for any , we construct
(26) |
Now we add one more hidden layer with parameters to the network (25), yielding a network with an architecture
(27) |
which implements
Note that adding the new hidden layer adds at most
additional non-zero-weights. Combining the above representations and (26), we obtain a network from the class
with at most
non-zero weights. ∎
Lemma 4.
Let and let . Then, for any , such that , the mapping can be represented by a neural network, belonging to the class
and containing at most non-zero weights.
Proof.
Using the representation , we obtain that there is a neural network from , implementing the sequence of maps
At the same time, due to
the identity map belongs to
Running the two neural networks in parallel, we can implement the mapping
via a neural network from . Finally, applying 1, we conclude that there is a neural network from the class
performing the map . The number of non-zero weights in this network does not exceed
It remains to note that, by the definition of , we have .
∎
5.3 Proof of 1
Let be an integer to be specified later. Plugging the bound , , into (3), we get that there is a neural network of width
with
hidden layers and at most
non-zero weights, taking their values in , such that, for any ,
Recall that, by the definition, for any integer . Using the inequalities
which are valid for all , we obtain that
(28) |
Set
(29) |
It is easy to observe that, with such a choice of , the neural network is of the width and has hidden layers and at most non-zero weights. The bound (5) on the approximation error of follows directly from (28) and (29).
References
- [1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444. doi:10.1038/nature14539.
-
[2]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, Y. Bengio,
Generative
adversarial nets, in: Advances in neural information processing systems,
2014, pp. 2672–2680.
URL https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf - [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533. doi:10.1038/nature14236.
- [4] J. Han, A. Jentzen, W. E, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences 115 (34) (2018) 8505–8510. doi:10.1073/pnas.1718942115.
- [5] M. Geist, P. Petersen, M. Raslan, R. Schneider, G. Kutyniok, Numerical solution of the parametric diffusion equation by deep neural networks, Journal of Scientific Computing 88 (1) (2021) 1–37. doi:10.1007/s10915-021-01532-w.
-
[6]
Y. Chen, Y. Shi, B. Zhang,
Optimal control via neural
networks: A convex approach, in: International Conference on Learning
Representations, 2019.
URL https://openreview.net/pdf?id=H1MW72AcK7 - [7] D. Onken, L. Nurbekyan, X. Li, S. W. Fung, S. Osher, L. Ruthotto, A neural network approach for high-dimensional optimal control applied to multiagent path finding, IEEE Transactions on Control Systems Technology (2022) 1–17doi:10.1109/TCST.2022.3172872.
- [8] L. Li, K. Ota, M. Dong, Humanlike driving: Empirical decision-making system for autonomous vehicles, IEEE Transactions on Vehicular Technology 67 (8) (2018) 6814–6823. doi:10.1109/TVT.2018.2822762.
-
[9]
G. Cembrano, C. Torras, G. Wells,
Neural
networks for robot control, Annual Review in Automatic Programming 19 (1994)
159–166.
doi:https://doi.org/10.1016/0066-4138(94)90059-0.
URL https://www.sciencedirect.com/science/article/pii/0066413894900590 - [10] P. Bozek, Y. L. Karavaev, A. A. Ardentov, K. S. Yefremov, Neural network control of a wheeled mobile robot based on optimal trajectories, International Journal of Advanced Robotic Systems 17 (2) (2020) 1729881420916077. doi:10.1177/1729881420916077.
- [11] M. González-Álvarez, J. Dupeyroux, F. Corradi, G. C. de Croon, Evolved neuromorphic radar-based altitude controller for an autonomous open-source blimp, 2022 International Conference on Robotics and Automation (ICRA) (2022) 85–90.
- [12] K.-I. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural networks 2 (3) (1989) 183–192. doi:10.1016/0893-6080(89)90003-8.
- [13] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (4) (1989) 303–314. doi:10.1007/BF02551274.
- [14] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural networks 4 (2) (1991) 251–257. doi:10.1016/0893-6080(91)90009-T.
- [15] D. Yarotsky, Error bounds for approximations with deep ReLU networks, Neural Networks 94 (2017) 103–114. doi:10.1016/j.neunet.2017.07.002.
- [16] B. Li, S. Tang, H. Yu, PowerNet: efficient representations of polynomials and smooth functions by deep neural networks with rectified power units, Journal of Mathematical Study 53 (2) (2020) 159–191. doi:10.4208/jms.v53n2.20.03.
- [17] I. Gühring, M. Raslan, Approximation rates for neural networks with encodable weights in smoothness spaces, Neural Networks 134 (2021) 107–130. doi:10.1016/j.neunet.2020.11.010.
- [18] J. Lu, Z. Shen, H. Yang, S. Zhang, Deep network approximation for smooth functions, SIAM Journal on Mathematical Analysis 53 (5) (2021) 5465–5506. doi:10.1137/20M134695X.
- [19] Z. Shen, H. Yang, S. Zhang, Neural network approximation: Three hidden layers are enough, Neural Networks 141 (2021) 160–173. doi:10.1016/j.neunet.2021.04.011.
- [20] T. De Ryck, S. Lanthaler, S. Mishra, On the approximation of functions by tanh neural networks, Neural Networks 143 (2021) 732–750. doi:10.1016/j.neunet.2021.08.015.
- [21] Y. Jiao, Y. Lai, X. Lu, F. Wang, J. Z. Yang, Y. Yang, Deep neural networks with relu-sine-exponential activations break curse of dimensionality on Hölder class, preprint, arXiv:2103.00542 (2021). doi:10.48550/ARXIV.2103.00542.
- [22] S. Langer, Approximating smooth functions by deep neural networks with sigmoid activation function, Journal of Multivariate Analysis 182 (2021) Paper No. 104696, 21. doi:10.1016/j.jmva.2020.104696.
- [23] Z. Shen, H. Yang, S. Zhang, Deep network approximation characterized by number of neurons, Communications in Computational Physics 28 (5) (2020) 1768–1811. doi:10.4208/cicp.oa-2020-0149.
- [24] J. Schmidt-Hieber, Nonparametric regression using deep neural networks with ReLU activation function, The Annals of Statistics 48 (4) (2020) 1875–1897. doi:10.1214/19-AOS1875.
- [25] B. Li, S. Tang, H. Yu, Better approximations of high dimensional smooth functions by deep neural networks with rectified power units, Communications in Computational Physics 27 (2) (2020) 379–411. doi:10.4208/cicp.oa-2019-0168.
- [26] B. Hanin, Universal function approximation by deep neural nets with bounded width and ReLU activations, Mathematics 7 (10). doi:10.3390/math7100992.
-
[27]
P. Kidger, T. Lyons,
Universal
approximation with deep narrow networks, in: Proceedings of Thirty Third
Conference on Learning Theory, Vol. 125 of Proceedings of Machine Learning
Research, 2020, pp. 2306–2327.
URL https://proceedings.mlr.press/v125/kidger20a.html -
[28]
S. Park, C. Yun, J. Lee, J. Shin,
Minimum width for
universal approximation, in: International Conference on Learning
Representations, 2021.
URL https://openreview.net/forum?id=O-XJwyoIF-k - [29] I. Gühring, G. Kutyniok, P. Petersen, Error bounds for approximations with deep ReLU neural networks in norms, Analysis and Applications 18 (5) (2020) 803–859. doi:10.1142/S0219530519410021.
- [30] J. M. Klusowski, A. R. Barron, Approximation by combinations of ReLU and squared ReLU ridge functions with and controls, IEEE Transactions on Information Theory 64 (12) (2018) 7649–7656. doi:10.1109/tit.2018.2874447.
- [31] M. Ali, A. Nouy, Approximation of smoothness classes by deep rectifier networks, SIAM Journal on Numerical Analysis 59 (6) (2021) 3032–3051. doi:10.1137/20M1360657.
- [32] A. Abdeljawad, P. Grohs, Approximations with deep neural networks in Sobolev time-space, Analysis and Applications 20 (3) (2022) 499–541. doi:10.1142/S0219530522500014.
- [33] Q. Chen, W. Hao, J. He, Power series expansion neural network, Journal of Computational Science 59 (2022) 101552. doi:10.1016/j.jocs.2021.101552.
- [34] R. Gribonval, G. Kutyniok, M. Nielsen, F. Voigtlaender, Approximation spaces of deep neural networks, Constructive Approximation. An International Journal for Approximations and Expansions 55 (1) (2022) 259–367. doi:10.1007/s00365-021-09543-4.
- [35] J. W. Siegel, J. Xu, High-order approximation rates for shallow neural networks with cosine and activation functions, Applied and Computational Harmonic Analysis 58 (2022) 1–26. doi:10.1016/j.acha.2021.12.005.
-
[36]
D. Yarotsky, A. Zhevnerchuk,
The
phase diagram of approximation rates for deep neural networks, in: Advances
in Neural Information Processing Systems, Vol. 33, 2020, pp. 13005–13015.
URL https://proceedings.neurips.cc/paper/2020/file/979a3f14bae523dc5101c52120c535e9-Paper.pdf - [37] Z.-B. Xu, F.-L. Cao, Simultaneous -approximation order for neural networks, Neural Networks 18 (7) (2005) 914–923. doi:https://doi.org/10.1016/j.neunet.2005.03.013.
- [38] S. Hon, H. Yang, Simultaneous neural network approximation for smooth functions, preprint, arXiv:2109.00161 (2021). doi:10.48550/ARXIV.2109.00161.
- [39] E. Candès, L. Demanet, L. Ying, Fast computation of fourier integral operators, SIAM Journal on Scientific Computing 29 (6) (2007) 2464–2493. doi:10.1137/060671139.
- [40] E. Candès, L. Demanet, L. Ying, A fast butterfly algorithm for the computation of Fourier integral operators, Multiscale Modeling & Simulation. A SIAM Interdisciplinary Journal 7 (4) (2009) 1727–1750. doi:10.1137/080734339.
- [41] L. Schumaker, Spline Functions: Basic theory, 3rd Edition, Cambridge University Press, Cambridge, 2007. doi:10.1017/CBO9780511618994.
Appendix A Some properties of multivariate splines
In this section we provide some properties of multivariate splines. For more details we refer reader to the book [41].
A.1 Univariate splines
We start with one-dimensional case. Fix . We call a function a univariate spline of degree with knots at , if
-
1.
for any , the restriction of on is a polynomial of degree ;
-
2.
has continuous derivatives up to order , that is, for any and any , .
It is easy to observe that univariate splines of degree with knots at form a linear space. We denote this space by . Note that has a finite dimension . There are several ways to choose a basis in , below we construct the basis of normalized -splines . Let be a vector of knots defined in (5.1). First, we successively construct (unnormalized) B-splines associated with knots . For any , let
Then for any , , we put
(30) |
Now, for , we define the normalized -spline as
(31) |
We use the following properties of normalized -splines (see [41, Section 4.3] for the details).
Proposition 1 ([41], Section 4.3).
Let be a normalized B-spline defined in (31). Then the following holds.
-
1.
For any ,
-
2.
For any and any , is a spline of degree , it has continuous derivatives up to order and the -th derivative is Lipschitz. Moreover, for any ,
Corollary 2.
For any and any ,
Proof.
To study approximation properties of with respect to the Hölder norm , we follow the technique from [41, Sections 4.6 and 12.3] and introduce a dual basis for normalized B-splines. A set of linear functionals on is called a dual basis if, for all ,
An explicit expression for can be found in [41, Eq.(4.89)] but it is not necessary for our purposes. Define a linear operator by
The function is called quasi-interpolant and it nicely approximates provided that is smooth enough. The approximation properties of are studied below for the case of multivariate tensor-product splines.
A.2 Tensor-product splines
The main result of this section is 3 concerning approximation properties of multivariate splines. We start with auxiliary definitions. The tensor product
is called a space of tensor-product splines of degree . Since and are fixed, we omit the upper indices in the notations , , if there is no ambiguity. For any , we denote the dual basis for constructed in A.1. For any and , define
It is easy to see that form a dual basis for
. Now we formulate the approximation result from [41, Theorem 12.5].
Proposition 2 ([41], Theorem 12.5).
Let . Fix any and let be as defined above. Introduce
Then
Similarly to , for any , define a multivariate quasi-interpolant
Similarly to ’s, we omit the upper indices in the notation of when they are clear from context. We are ready to formulate the main result of this section.
Theorem 3.
Let and fix a non-negative integer . Let the linear operator be as defined above. Then
Proof.
For simplicity, we adopt the notation . Fix a multi-index , . Then
Consider a polynomial of degree
where . Here we used conventional notations and
Since belongs to , we have . By the triangle inequality,
(32) |
By Taylor’s theorem and the definition of the Hölder class, the first term in the right-hand sight does not exceed
Here we used the fact that
Moreover, note that, for any , it holds that
This yields that achieves its maximum over at . Thus,
Since
(33) |
for all , we obtain that
Now, we consider the second term in (32), that is, . By the definition of , we have
The triangle inequality and 2 implies that
Again, by Taylor’s theorem,
By 1, for any , . Hence, the intersection of with has zero volume if . This yields
Applying 2, we obtain that
The sum in the right hand side contains at most non-zero terms. Hence,
where we used (33).
∎