The authors thank Professor Renjin Jiang in Capital Normal University for the helpful discussion. The work of Liao and Ming were supported by National Natural Science Foundation of China through Grants No. 11971467 and No. 12371438.
1. Introduction
A series of works have been devoted to studying the neural network approximation error and generalization error with the Barron classย [Barron:1992, Barron:1993, Barron:1994, Barron:2018] as the target space. For a complex-valued function and , the spectral norm is defined as
|
|
|
where is the Fourier transform of in the distribution sense. A function is said to belong to the Barron class if its spectral norm is finite and the Fourier inversion holds pointwise. However, it is important to note that this definition lacks rigor, as it does not specify the conditions under which the pointwise Fourier inversion is valid. Addressing this issue is a nontrivial matter, as discussed inย [Pinsky:1997]. Since then, the authors inย [Ma:2017, Xu:2020, Siegel:2022, Siegel:2023] assume , and define
(1.1) |
|
|
|
For functions in , the Fourier transform and the pointwise Fourier inversion are valid. Unfortunately, we shall prove in Lemmaย 2.3 that equipped with the norm is not complete. Therefore, it does not qualify as a Banach space.
To address this issue, an alternative class of function spaces has been proposed, which can be traced back to the work of Hรถrmanderย [Hormander:1963]. It is defined as follows.
|
|
|
for and . This space has been studied extensively and may be referred to by different names. Some call it the Hรถrmander space, as mentioned in works such asย [Hormander:1963, Messina:2001, DGSM60:2014, Ivec:2021]; Others refer to it as the Fourier Lebesgue space, as seen inย [Grochenig:2002, Pilipovic:2010, BenyiOh:2013, Kato:2020]. We are interested in , and call it the spectral Barron space:
|
|
|
which is equipped with the norm
|
|
|
We show in Lemmaย 2.1 that the pointwise Fourier inversion is valid for functions in with a nonnegative . Some authors also refer to as the Fourier algebra or Wiener algebra, whose algebraic properties, such as the Wiener-Levy theoremย [Wiener:1932, Levy:1935, Helson:1959], have been extensively studied inย [ReitherStegeman:2000, Liflyand:2012].
Another popular space for analyzing shallow neural networks is the Barron spaceย [E:2019, EMW:2022], which can be viewed as shallow neural networks with infinite width. The authors inย [Wojtowytsch:2022, E:2022] claimed that the spectral Barron space is much smaller than the Barron space. As observed inย [Caragea:2023], this statement is not accurate because they have not discriminated the smoothness index in . In addition, the variation space, introduced inย [Barron:2008], has been studied in relation to the spectral Barron space and the Barron space inย [SiegelXu:2022, Siegel:2023]. These spaces have been exploited to study the regularity of partial differential equationsย [ChenLuLu:2021, Lu:2021, E:2022, Chen:2023]. Recently a new spaceย [ParhiNowak:2022] originated from the variational spline theory, which is closely related to the variation space, has also been exploited as the target function space for neural network approximationย [ParhiNowak:2023].
The first objective of the present work is the analytical properties of . In Lemmaย 2.2, we show that is complete, while Lemmaย 2.3 shows that is not complete. This distinction highlights a key difference between these two spaces. Furthermore, Lemmaย 2.5 provides an example that illustrates functions in may decay arbitrarily slow. This example, constructed elegantly using the generalized Hypergeometric function, reveals interesting relationships between the Fourier transform and the decay rate of the functions. Additionally, we study the relations among and some classical function spaces. In Theoremย 2.9, we establish the connections between and the Besov space. Furthermore, in Corollaryย 2.12, we establish the connections between and the Sobolev spaces. Notably, we prove the embedding relation
|
|
|
which is an optimal result that appears to be missing in the existing literature. This embedding may serve as a bridge to study how the Barron space, the variation space and the space inย [ParhiNowak:2022] are related to the classical function spaces such as the Besov space, which seems missing in the literature; cf.ย [ParhiNowak:2022]*ยงย 5.
The second objective of the present work is to explore the neural network approximation on a bounded domain. Building upon Barronโs seminal works on approximating functions in with -norm, recent studies extended the approximation to functions in with -norm, as demonstrated inย [Siegel:2020, Xu:2020]. Furthermore, improved approximation rates have been achieved for functions in with large
in works such asย [Bresler:2020, MaSiegelXu:2022, Siegel:2022]. These advancements contribute to a deeper understanding of the approximation capabilities of neural networks.
The distinction between deep ReLU networks and shallow networks has been highlighted in the separation theorems presented inย [Eldan:2016, Telgarsky:2016, Shamir:2022]. These theorems provide examples that can be well approximated by three-layer ReLU neural networks but not by two-layer ReLU neural networks with a width that grows polynomially with the dimension. This sheds light on the differences in expressive power between shallow and deep networks. Moreover, the approximation rates for neural networks targeting mixed derivative Besov/Sobolev spaces, spectral Barron spaces, and Hรถlder spaces have also been investigated. These studies contribute to a broader understanding of the approximation capabilities of neural networks in various function spaces as inย [Du:2019, BreslerNagaraj:2020, Bolcskei:2021, LuShenYangZhang:2021, Suzuki:2021].
We focus on the -approximation properties for functions in when is small.
In Theoremย 3.9, we establish that a neural network with
hidden layers and units in each layer can approximate functions in with a convergence rate of when . This bound is sharp, as demonstrated in Theoremย 3.11.
Importantly, our results provide optimal convergence rates compared to existing literature. For deep neural networks, a similar result has been presented inย [BreslerNagaraj:2020] with a convergence rate of . For shallow neural network; i.e., , convergence rates of have been established inย [MengMing:2022, Siegel:2022] when . However, it is worth noting that the constants in their estimates depend on the dimension at least polynomially, or even exponentially, and require other bounded norms besides . Our results provide a significant advancement by achieving optimal convergence rates without the additional dependency on dimension or other bounded norms.
The remaining part of the paper is structured as follows. In Section 2, we demonstrate that the spectral Barron space is a Banach space and examine its relationship with other function spaces. This analysis provides a foundation for understanding the properties of the spectral Barron space. In Section 3, we delve into the error estimation for approximating functions in the spectral Barron space using deep neural networks with finite depth and infinite width. By investigating the convergence properties of these networks, we gain insights into their approximation capabilities and provide error bounds for their performance. Finally, in Section 4, we conclude our work by summarizing the key findings and contributions of this study. We also discuss potential avenues for future research and highlight the significance of our results in the broader context of function approximation using neural networks. Certain technical results are postponed to the Appendix.
2. Completeness of and its relation to other function spaces
This part discusses the completeness of the spectral Barron space and embedding relations to other classical function spaces. Firstly, we fix some notations. Let be the Schwartz space and let be its topological dual space, i.e., the space of tempered distribution. The Gamma function
|
|
|
Denoting the surface area of the unit sphere by . The volume of the unit ball is . The Beta function
|
|
|
The series formulation of the first kind of Bessel function is defined as
|
|
|
This definition may be found inย [Luke:1962]*ยงย 1.4.1, Eq. (1).
For , its Fourier transform of is defined as
|
|
|
and the inverse Fourier transform is defined as
|
|
|
If , then the Fourier transform in the sense of distribution means
|
|
|
We shall frequently use the following Hausdorff-Young inequality. Let and , then
(2.1) |
|
|
|
where is the conjugate exponent of ; i.e. .
We shall use the following pointwise Fourier inversion theorem.
Lemma 2.1.
Let , then in . Furthermore, let and , then , a.e. on .
Proof.
By definition, there holds
|
|
|
Therefore, in . Note that ,
|
|
|
By the Hausdorff-Young inequalityย (2.1),
|
|
|
Therefore, is a linear bounded operator on ; i.e., due to is dense in and
|
|
|
Hence, , a.e. on because ย [Brezis:2011]*Corollary 4.24.
โ
A direct consequence of Lemmaย 2.1 is that the pointwise Fourier inversion is valid for functions in . We shall frequently use this fact later on.
2.1. Completeness of the spectral Barron space
Lemma 2.2.
-
(1)
is a Banach space.
-
(2)
When , is not a Banach space if the norm is replaced by .
Proof.
We give a brief proof for the first claim for the readerโs convenience, which has been stated inย [Hormander:1963]*Theorem 2.2.1.
It is sufficient to check the completeness of . For any Cauchy sequence , there exists such that in . Therefore there exists a sub-sequence of (still denoted by ) such that a.e. on .
Define the measure by setting that for any measurable set ,
|
|
|
Then is a Cauchy sequence in and there exists such that in . Therefore there exists a sub-sequence of (still denoted by ) such that -a.e. on . Note that for any measurable set , is equivalent to . Therefore a.e. on . By the uniqueness of limitation, , a.e. on .
Define . Lemmaย 2.1 shows that in . Therefore and in . Hence is complete and it is a Banach space.
The proof for (2) is a reductio ad absurdum.
Suppose the claim does not hold, then there exists depending only on and such that for any ,
(2.2) |
|
|
|
We shall show this is false by the following example.
For some , let
|
|
|
To bound and , we introduce the Bochner-Riesz multipliers
|
|
|
We claim
(2.3) |
|
|
|
and
(2.4) |
|
|
|
The proof is postponed to Appendixย A.1. It follows fromย (2.3) that
(2.5) |
|
|
|
and with
|
|
|
and
|
|
|
where we have usedย (2.4).
It is clear that
|
|
|
Hence while . This shows thatย (2.2) is invalid for a large number . This proves the second claim.
โ
Similar to , the space defined inย (1.1) has been exploited as the target space for neural network approximation by several authorsย [Ma:2017, Xu:2020, Siegel:2022, Siegel:2023]. The advantage of this space is that the Fourier transform is well-defined and the pointwise Fourier inversion is true for functions belonging to . Unfortunately, is not a Banach space as we shall show below.
Lemma 2.3.
The space defined inย (1.1) equipped with the norm is not a Banach space.
To prove Lemmaย 2.3, we recall the Barron spectrum space defined by Meng and Ming inย [MengMing:2022]: For and ,
(2.6) |
|
|
|
equipped with the norm . A useful interpolation inequality that compares the spectral norm of different orders has been proved inย [MengMing:2022]*Lemma 2.1. For and , there exists depends on and such that
(2.7) |
|
|
|
where . For any , using the fact
|
|
|
we observe that the inequalityย (2.7) is dilation invariant because it is invariant if we replace by .
Proof of Lemmaย 2.3.
The authors inย [MengMing:2022] have proved that equipped with the norm is a Banach space. For any , taking and inย (2.7), we obtain, there exists depending only on and such that
|
|
|
where .
We prove the assertion by reductio ad absurdum. Suppose that equipped with the norm is also a Banach space, then by the bounded inverse theorem and the above interpolation inequality, we get, there exists depending only on and such that for any ,
(2.8) |
|
|
|
We shall show this is not the case by the following example.
For some , we define
|
|
|
Usingย (2.3) and noting , we have the explicit form of as
(2.9) |
|
|
|
Usingย (2.4), we get
|
|
|
and
|
|
|
Proceeding along the same line, we obtain
|
|
|
and
|
|
|
Hence,
(2.10) |
|
|
|
Byย (2.3), a direct calculation gives
|
|
|
Invokingย [Grafakos:2014]*Appendix B.6, B.7, there exists that depends on such that
|
|
|
We get, there exists depending only on and such that
|
|
|
|
|
|
|
|
|
|
|
|
where we have used the fact in the last step. Therefore, is bounded by a constant that depends only on and but is independent of . Moreover,
|
|
|
and by the Hausdorff-Young inequalityย (2.1),
|
|
|
This means that , which together withย (2.10) immediately shows that the inequalityย (2.8) cannot be true for sufficiently large . Hence, we conclude that is not a Banach space.
โ
2.2. Embedding relations of the spectral Barron spaces
In this part we discuss the embedding of the spectral Barron spaces.
Lemma 2.4.
-
(1)
Interpolation inequality: For any satisfying with , and , there holds
(2.11) |
|
|
|
and
(2.12) |
|
|
|
-
(2)
Let , there holds with
(2.13) |
|
|
|
The embeddingย (2.13) has been stated inย [Hormander:1963]*Theorem 2.2.2 without tracing the embedding constant.
Proof.
We start with the interpolation inequalityย (2.11) for the spectral norm. For any with , using Hรถlderโs inequality, we obtain
|
|
|
This givesย (2.11).
Next, for , by Youngโs inequality, we have
|
|
|
|
|
|
|
|
|
|
|
|
This yields
|
|
|
Let and , we obtain
|
|
|
This impliesย (2.12).
Next, if we take inย (2.11) and with , then
|
|
|
This leads toย (2.13) and completes the proof.
โ
The next lemma shows that is a proper subspace of .
Lemma 2.5.
For and , there holds , and the inclusion is proper in the sense that for any , there exists and .
Proof.
It follows from the interpolation inequalityย (2.7) that . Hence
|
|
|
This implies for any and .
We shall show below that the inclusion is proper. Let
|
|
|
where is the characteristic function on that equals to one if and zero otherwise. It is straightforward to verify .
We shall show below that , which is based on the following explicit formula for shown in Appendixย A.2:
(2.14) |
|
|
|
where the generalized Hypergeometric function is defined as follows. For nonnegative integer and none of the parameters is a negative integer or zero,
|
|
|
The generalized Hypergeometric function converges for all finite if . In particular . Hence is finite for any . Usingย [MathaiSaxena:1973]*Appendix, we obtain
|
|
|
Therefore,
|
|
|
This immediately implies .
โ
2.3. Relations to some classical function spaces
In this part, we establish the embedding between the spectral Barron space and the Besov space, and hence we bridge and the Sobolev space as inย [MengMing:2022]. We firstly recall the definition of the Besov space.
Definition 2.7 (Besov space).
Let satisfies and
|
|
|
For every multi-index , there exists a positive number such that
|
|
|
and
|
|
|
Let and . Define the Besov space
|
|
|
equipped with the norm
|
|
|
and
|
|
|
We firstly recall the following embedding for the Besov space, which was firstly proved in the series work of Taiblesonย [Taibleson:1964, Taibleson:1965, Taibleson:1966]. We retain the proof in Appendixย A.3 for the readerโs convenience.
Lemma 2.8.
There holds if and only if and one of the following conditions holds:
-
(1)
and are arbitrary;
-
(2)
and .
The main result of the embedding is:
Theorem 2.9.
-
(1)
There holds
(2.15) |
|
|
|
-
(2)
The above embedding is optimal in the sense that is the biggest one of all satisfying , and is the smallest one of all satisfying .
Proof.
To prove (1), firstly, for any ,
|
|
|
|
|
|
|
|
A direct calculation gives: for ,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Using the Plancherelโs theorem, we get
|
|
|
|
|
|
|
|
|
|
|
|
Next, for any , by Lemmaย 2.1, we have , using the Hausdorff-Young inequalityย (2.1), we obtain
|
|
|
|
|
|
|
|
|
|
|
|
Therefore, . This provesย (2.15) with
|
|
|
It remains to show the embeddingย (2.15) is optimal. Suppose that there exists such that
|
|
|
using Lemmaย 2.8, we would have and . In what follows, we shall exploit an example adopted fromย [Lieb:2001]*Ch. 5, Ex. 9 to show that when and . Therefore, for any .
Let
|
|
|
A direct calculation gives . Hence
and
|
|
|
which is independent of . We shall prove in Appendixย A.4 that when and ,
(2.16) |
|
|
|
Therefore when and .
On the other hand, we cannot expect that there exists certain such that . Otherwise, we would have
|
|
|
because of Lemmaย 2.8 andย [Triebel:1983]*ยงย 2.5.7, Proposition. This contradicts with
the fact , which has been proved in Lemmaย 2.5.
โ
As a consequence of Theoremย 2.9 and Lemmaย 2.5, we establish the embedding between the spectral Barron space and the Sobolev spaces.
Definition 2.10 (Fractional Sobolev space).
Let and non-integer , then the fractional Sobolev space
|
|
|
equipped with the norm
|
|
|
We firstly recall the relation between the Sobolev space and , which has been proved inย [MengMing:2022]*Theorem 4.3.
Lemma 2.11 ([MengMing:2022]*Theorem 4.3).
-
(1)
If and , then
|
|
|
-
(2)
If is not an integer or is an integer and , then
|
|
|
It follows from the above lemma and Lemmaย 2.5 that
Corollary 2.12.
-
(1)
If and , there holds
(2.17) |
|
|
|
-
(2)
If is not an integer or is an integer and , then
|
|
|
The first embedding with and was hidden inย [Barron:1993]*ย ยงย II, Para. 7;ย ยงย IX, 15.
Proof.
By Lemmaย 2.11 and Lemmaย 2.5, when and , we have
|
|
|
When is not an integer or is an integer and , there holds
|
|
|
It remains to prove the right-hand side ofย (2.17). Using Theoremย 2.9,
|
|
|
due to Theoremย 2.9, Lemmaย 2.8 andย [Triebel:1983]*ยงย 2.3.5, Eq. (1); ยงย 2.5.7, Eq. (2), (9), (11).
โ
3. Application to deep neural network approximation
The embedding results proved in Theoremย 2.9 and Corollaryย 2.12 indicate that is a smoothness index. Consequently, we are interested in exploring the approximate rate when is small with as the target function space. To facilitate our analysis, we shall focus on the hypercube , and the spectral norm for function defined on is
|
|
|
where the infimum is taken for all extension operators . To simplify the notations, we employ to denote subsequently. We replace by in the definition of , the latter seems more natural for studying the approximation over the hypercube as suggested byย [Barron:1993]*ยงย V.
Definition 3.1.
A sigmoidal function is a bounded function such that
|
|
|
For example, the Heaviside function is a sigmoidal function.
A classical idea for the approximation error of neural networks with sigmoidal activation functions is to use the Heaviside function as a transition. Caragea et. al.ย [Caragea:2023] pointed out that the gap between sigmoidal function and the Heaviside function cannot be dismissed in . While this gap does not exist in .
Lemma 3.2.
For fixed and ,
|
|
|
Proof.
Note that
|
|
|
We divide the cube into and . With proper choice of and large enough, we can obtain that the -distance between and is arbitrarily small.
โ
For a shallow neural network, the following lemma inย [Barron:1993] is proved for the real-valued function, while it is straightforward to extend the proof to the complex-valued function.
Lemma 3.3 ([Barron:1993]*Theorem 1).
Let , there exists
(3.1) |
|
|
|
with and such that
|
|
|
In this part, we shall show the approximation error for the deep neural network. We use the -network to describe a neural network with hidden layers and at most units per layer. Here denotes the number of hidden layers, e.g., the shallow neural network, expressed asย (3.1), is an -network.
Definition 3.4 (-network).
An -network represents a neural network with hidden layers and at most units per layer. The activation functions of the first layers are all ReLU and the activation function of the last layer is the sigmoidal function. The connection weights between the input layer and the hidden layer, and between the hidden layer and the hidden layer are all real numbers. The connection weights between the last hidden layer and the output layer are complex numbers.
Here we make some preparations for the rest work. The analysis in this part owns the most toย [BreslerNagaraj:2020] with certain improvements that will be detailed later on. For any function defined on and it is symmetric about , We use the notation to denote the function in the interval of the period repeated times, i.e.,
(3.2) |
|
|
|
Define
|
|
|
By definitionย (3.2), represents a triangle function with peaks and can be represented by ReLUs:
|
|
|
Lemma 3.5.
Let be a function defined on and symmetric about , then on .
The above lemma is a rigorous statement ofย [Telgarsky:2016]*Proposition 5.1. A key example is when . A geometrical explanation may be founded in ย [Bolcskei:2021]*Figure 3. We postpone the rigorous proof in Appendixย A.5.
For , we define
|
|
|
then supp and is symmetric about . Define
|
|
|
Then is symmetric about because
|
|
|
|
|
|
|
|
By definitionย (3.2), is well defined on and
|
|
|
|
|
|
|
|
|
|
And on can be represents by Heaviside function due to only need one Heaviside function each:
|
|
|
A direct consequence of the above construction is
Lemma 3.6.
For , there holds
(3.3) |
|
|
|
Proof.
For any , a direct calculation gives
|
|
|
Fix a . If there exists an integer satisfying and , then and
|
|
|
Otherwise there exists an integer satisfying and . Then and
|
|
|
This completes the proof ofย (3.3).
โ
Now we are ready to give an approximation result for deep neural networks, which follows the framework ofย [BreslerNagaraj:2020], while we achieve a higher order convergence rate and the prefactor is dimension-free.
Lemma 3.7.
Let the positive integer and with and . For any positive integer there exists an -network such that
(3.4) |
|
|
|
Proof.
By Lemmaย 2.1, for , assume is real-valued, then
|
|
|
with proper choice such that . For fixed , choose and , then and by Lemmaย 3.6,
|
|
|
Define the probability measure
(3.5) |
|
|
|
where is the normalized factor that
|
|
|
Therefore with
|
|
|
If is an i.i.d. sequence of random samples from , and
|
|
|
then using Fubiniโs theorem, we obtain
|
|
|
|
|
|
|
|
|
|
|
|
Note that
|
|
|
we obtain
|
|
|
By Markovโs inequality, with probability at least , for some to be chosen later on, we obtain
(3.6) |
|
|
|
It remains to calculate the number of units in each layer. For each , choose , then , and by Lemmaย 3.5, on . Lemmaย 3.2 shows the Heaviside function can be approximated by , and we need at most
|
|
|
units in each layer to represent . Denote the total number of units in each layer, then and
|
|
|
Again, by Markov inequality, with probability at least , we obtain
(3.7) |
|
|
|
Combiningย (3.6) andย (3.7), with probability at least , there exists an -network such that
|
|
|
with proper choice of in the last step. Finally, if is complex-valued, we approximate the real and imaginary parts of the function separately to obtainย (3.4).
โ
There is relatively little work on the approximation rate of deep neural networks that employ the spectral Barron space as the target space. For deep ReLU networks,ย [BreslerNagaraj:2020] has proven approximation results of -order. We shall show below this may be improved to -order at the cost of appears in the estimate.
Theorem 3.9.
Let the positive integer and with . For any positive integer there exists an -network such that
(3.8) |
|
|
|
Moreover, if is a real-valued function, then the connection weights in are all real.
Proof.
We may write with
|
|
|
Then and because
|
|
|
We approximate with an -network with and obtain the error estimate. Applying Lemmaย 3.3 we obtain, there exists an -network such that
|
|
|
Additional emphasis needs to be placed on the fact that an -network can be represented by an -network. We just need to fill the rest of the hidden layers with
|
|
|
Meanwhile, we approximate with an -network with and obtain the error estimate. Applying Lemmaย 3.7 we obtain, there exists an network such that
|
|
|
These together with the triangle inequality give the estimateย (3.8) and the total number of units in each layer is
|
|
|
If is a real-valued function, then we let , and the upper boundย (3.8) still holds.
โ
As far as we know, the above theorem is best in the literature available so far. For shallow neural network , the authors inย [MengMing:2022] have proven the -convergence rate with as the target function space, which is smaller than , and their estimate depends on the dimension as . The upper bound inย [Siegel:2022] depends on , whileย (2.5) exemplifies that may be much larger than for some functions in , and the estimate depends upon the dimension exponentially. By contrast to these two results, the upper bound in Theoremย 3.9 depends only on , and is independent of the dimension.
For deep neural network, a similar result for ReLU has been proven inย [BreslerNagaraj:2020] with -order, which is not optimal compared with our estimate. At first glance, our result may seem contradictory withย [BreslerNagaraj:2020]*Theorem 2. This is not the case because the upper bound therein is , which requires , but is usually smaller than for oscillatory functions; cf. Lemmaย A.1.
In what follows, we shall show that Theoremย 3.9 is sharp if the activation function of the last hidden layer is Heaviside function. This example is adopted fromย [BreslerNagaraj:2020]. We reserve it briefly to ensure the completeness of our work and postpone the proof in Appendixย A.6.
Theorem 3.11.
For any fixed positive integers and real numbers with , there exists satisfying such that for any -network whose activation function in the last layer is the Heaviside function , there holds
(3.9) |
|
|
|