Spectral Barron space and deep neural network approximation

Yulei Liao and Pingbing Ming LSEC, Institute of Computational Mathematics and Scientific/Engineering Computing, AMSS, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China [email protected], [email protected]

Abstract.

We prove the sharp embedding between the spectral Barron space and the Besov space. Given the spectral Barron space as the target function space, we prove a dimension-free result that if the neural network contains $L$ hidden layers with $N$ units per layer, then the upper and lower bounds of the $L^{2}$ -approximation error are $\mathcal{O}(N^{-sL})$ with $0<sL\leq 1/2$ , where $s$ is the smoothness index of the spectral Barron space.

Key words and phrases:

Spectral Barron space; Deep neural network; Approximation theory

2020 Mathematics Subject Classification:

32C22, 32K05, 33C20, 41A25, 41A46, 42A38, 68T07

The authors thank Professor Renjin Jiang in Capital Normal University for the helpful discussion. The work of Liao and Ming were supported by National Natural Science Foundation of China through Grants No. 11971467 and No. 12371438.

1. Introduction

A series of works have been devoted to studying the neural network approximation error and generalization error with the Barron class [Barron:1992, Barron:1993, Barron:1994, Barron:2018] as the target space. For $f$ a complex-valued function and $s\geq 0$ , the spectral norm $\upsilon_{f,s}$ is defined as

\upsilon_{f,s}{:}=\int_{\mathbb{R}^{d}}\lvert\xi\rvert^{s}\lvert\widehat{f}(\xi)\rvert\mathrm{d}\xi,

where $\widehat{f}$ is the Fourier transform of $f$ in the distribution sense. A function $f$ is said to belong to the Barron class if its spectral norm $\upsilon_{f,s}$ is finite and the Fourier inversion holds pointwise. However, it is important to note that this definition lacks rigor, as it does not specify the conditions under which the pointwise Fourier inversion is valid. Addressing this issue is a nontrivial matter, as discussed in [Pinsky:1997]. Since then, the authors in [Ma:2017, Xu:2020, Siegel:2022, Siegel:2023] assume $f\in L^{1}(\mathbb{R}^{d})$ , and define

(1.1)

\widehat{\mathscr{B}}^{s}(\mathbb{R}^{d}){:}=\left\{\,f\in L^{1}(\mathbb{R}^{d})\,\mid\,\upsilon_{f,0}+\upsilon_{f,s}<\infty\,\right\}.

For functions in $\widehat{\mathscr{B}}^{s}(\mathbb{R}^{d})$ , the Fourier transform and the pointwise Fourier inversion are valid. Unfortunately, we shall prove in Lemma 2.3 that $\widehat{\mathscr{B}}^{s}(\mathbb{R}^{d})$ equipped with the norm $\upsilon_{f,0}+\upsilon_{f,s}$ is not complete. Therefore, it does not qualify as a Banach space.

To address this issue, an alternative class of function spaces has been proposed, which can be traced back to the work of Hörmander [Hormander:1963]. It is defined as follows.

\mathscr{F}L^{s}_{p}(\mathbb{R}^{d}){:}=\left\{\,f\in\mathscr{S}^{\prime}(\mathbb{R}^{d})\,\mid\,(1+\lvert\xi\rvert^{s})\widehat{f}(\xi)\in L^{p}(\mathbb{R}^{d})\,\right\}

for $1\leq p\leq\infty$ and $s\geq 0$ . This space has been studied extensively and may be referred to by different names. Some call it the Hörmander space, as mentioned in works such as [Hormander:1963, Messina:2001, DGSM60:2014, Ivec:2021]; Others refer to it as the Fourier Lebesgue space, as seen in [Grochenig:2002, Pilipovic:2010, BenyiOh:2013, Kato:2020]. We are interested in $p=1$ , and call it the spectral Barron space:

\mathscr{B}^{s}(\mathbb{R}^{d}){:}=\left\{\,f\in\mathscr{S}^{\prime}(\mathbb{R}^{d})\,\mid\,\upsilon_{f,0}+\upsilon_{f,s}<\infty\,\right\},

which is equipped with the norm

\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}{:}=\upsilon_{f,0}+\upsilon_{f,s}=\int_{\mathbb{R}^{d}}(1+\lvert\xi\rvert^{s})\lvert\widehat{f}(\xi)\rvert\mathrm{d}\xi.

We show in Lemma 2.1 that the pointwise Fourier inversion is valid for functions in $\mathscr{B}^{s}(\mathbb{R}^{d})$ with a nonnegative $s$ . Some authors also refer to $\mathscr{B}^{s}(\mathbb{R}^{d})$ as the Fourier algebra or Wiener algebra, whose algebraic properties, such as the Wiener-Levy theorem [Wiener:1932, Levy:1935, Helson:1959], have been extensively studied in [ReitherStegeman:2000, Liflyand:2012].

Another popular space for analyzing shallow neural networks is the Barron space [E:2019, EMW:2022], which can be viewed as shallow neural networks with infinite width. The authors in [Wojtowytsch:2022, E:2022] claimed that the spectral Barron space is much smaller than the Barron space. As observed in [Caragea:2023], this statement is not accurate because they have not discriminated the smoothness index $s$ in $\mathscr{B}^{s}(\mathbb{R}^{d})$ . In addition, the variation space, introduced in [Barron:2008], has been studied in relation to the spectral Barron space $\mathscr{B}^{s}(\mathbb{R}^{d})$ and the Barron space in [SiegelXu:2022, Siegel:2023]. These spaces have been exploited to study the regularity of partial differential equations [ChenLuLu:2021, Lu:2021, E:2022, Chen:2023]. Recently a new space [ParhiNowak:2022] originated from the variational spline theory, which is closely related to the variation space, has also been exploited as the target function space for neural network approximation [ParhiNowak:2023].

The first objective of the present work is the analytical properties of $\mathscr{B}^{s}(\mathbb{R}^{d})$ . In Lemma 2.2, we show that $\mathscr{B}^{s}(\mathbb{R}^{d})$ is complete, while Lemma 2.3 shows that $\mathscr{\widehat{B}}^{s}(\mathbb{R}^{d})$ is not complete. This distinction highlights a key difference between these two spaces. Furthermore, Lemma 2.5 provides an example that illustrates functions in $\mathscr{B}^{s}(\mathbb{R}^{d})$ may decay arbitrarily slow. This example, constructed elegantly using the generalized Hypergeometric function, reveals interesting relationships between the Fourier transform and the decay rate of the functions. Additionally, we study the relations among $\mathscr{B}^{s}(\mathbb{R}^{d})$ and some classical function spaces. In Theorem 2.9, we establish the connections between $\mathscr{B}^{s}(\mathbb{R}^{d})$ and the Besov space. Furthermore, in Corollary 2.12, we establish the connections between $\mathscr{B}^{s}(\mathbb{R}^{d})$ and the Sobolev spaces. Notably, we prove the embedding relation

B^{s+d/2}_{2,1}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d})\hookrightarrow B^{s}_{\infty,1}(\mathbb{R}^{d}),

which is an optimal result that appears to be missing in the existing literature. This embedding may serve as a bridge to study how the Barron space, the variation space and the space in [ParhiNowak:2022] are related to the classical function spaces such as the Besov space, which seems missing in the literature; cf. [ParhiNowak:2022]*§ 5.

The second objective of the present work is to explore the neural network approximation on a bounded domain. Building upon Barron’s seminal works on approximating functions in $\mathscr{B}^{1}(\mathbb{R}^{d})$ with $L^{2}$ -norm, recent studies extended the approximation to functions in $\mathscr{B}^{k+1}(\mathbb{R}^{d})$ with $H^{k}$ -norm, as demonstrated in [Siegel:2020, Xu:2020]. Furthermore, improved approximation rates have been achieved for functions in $\mathscr{B}^{s}(\mathbb{R}^{d})$ with large $s$ in works such as [Bresler:2020, MaSiegelXu:2022, Siegel:2022]. These advancements contribute to a deeper understanding of the approximation capabilities of neural networks.

The distinction between deep ReLU networks and shallow networks has been highlighted in the separation theorems presented in [Eldan:2016, Telgarsky:2016, Shamir:2022]. These theorems provide examples that can be well approximated by three-layer ReLU neural networks but not by two-layer ReLU neural networks with a width that grows polynomially with the dimension. This sheds light on the differences in expressive power between shallow and deep networks. Moreover, the approximation rates for neural networks targeting mixed derivative Besov/Sobolev spaces, spectral Barron spaces, and Hölder spaces have also been investigated. These studies contribute to a broader understanding of the approximation capabilities of neural networks in various function spaces as in [Du:2019, BreslerNagaraj:2020, Bolcskei:2021, LuShenYangZhang:2021, Suzuki:2021].

We focus on the $L^{2}$ -approximation properties for functions in $\mathscr{B}^{s}(\mathbb{R}^{d})$ when $s$ is small. In Theorem 3.9, we establish that a neural network with $L$ hidden layers and $N$ units in each layer can approximate functions in $\mathscr{B}^{s}(\mathbb{R}^{d})$ with a convergence rate of $\mathcal{O}(N^{-sL})$ when $0<sL\leq 1/2$ . This bound is sharp, as demonstrated in Theorem 3.11. Importantly, our results provide optimal convergence rates compared to existing literature. For deep neural networks, a similar result has been presented in [BreslerNagaraj:2020] with a convergence rate of $\mathcal{O}(N^{-sL/2})$ . For shallow neural network; i.e., $L=1$ , convergence rates of $\mathcal{O}(N^{-1/2})$ have been established in [MengMing:2022, Siegel:2022] when $s=1/2$ . However, it is worth noting that the constants in their estimates depend on the dimension at least polynomially, or even exponentially, and require other bounded norms besides $\upsilon_{f,s}$ . Our results provide a significant advancement by achieving optimal convergence rates without the additional dependency on dimension or other bounded norms.

The remaining part of the paper is structured as follows. In Section 2, we demonstrate that the spectral Barron space is a Banach space and examine its relationship with other function spaces. This analysis provides a foundation for understanding the properties of the spectral Barron space. In Section 3, we delve into the error estimation for approximating functions in the spectral Barron space using deep neural networks with finite depth and infinite width. By investigating the convergence properties of these networks, we gain insights into their approximation capabilities and provide error bounds for their performance. Finally, in Section 4, we conclude our work by summarizing the key findings and contributions of this study. We also discuss potential avenues for future research and highlight the significance of our results in the broader context of function approximation using neural networks. Certain technical results are postponed to the Appendix.

2. Completeness of $\mathscr{B}^{s}$ and its relation to other function spaces

This part discusses the completeness of the spectral Barron space and embedding relations to other classical function spaces. Firstly, we fix some notations. Let $\mathscr{S}$ be the Schwartz space and let $\mathscr{S}^{\prime}$ be its topological dual space, i.e., the space of tempered distribution. The Gamma function

\Gamma(s){:}=\int_{0}^{\infty}t^{s-1}e^{-t}\mathrm{d}t,\qquad s>0.

Denoting the surface area of the unit sphere $\mathbb{S}^{d-1}$ by $\omega_{d-1}=2\pi^{d/2}/\Gamma(d/2)$ . The volume of the unit ball is $\nu_{d}=\omega_{d-1}/d$ . The Beta function

B(\alpha,\beta){:}=\int_{0}^{1}t^{\alpha-1}(1-t)^{\beta-1}\mathrm{d}t=\dfrac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)},\qquad\alpha,\beta>0.

The series formulation of the first kind of Bessel function is defined as

J_{\nu}(x){:}=(x/2)^{\nu}\sum_{k=0}^{\infty}(-1)^{k}\dfrac{(x/2)^{2k}}{\Gamma(\nu+k+1)k!}.

This definition may be found in [Luke:1962]*§ 1.4.1, Eq. (1).

For $f\in L^{1}(\mathbb{R}^{d})$ , its Fourier transform of $f$ is defined as

\widehat{f}(\xi){:}=\int_{\mathbb{R}^{d}}f(x)e^{-2\pi ix\cdot\xi}\mathrm{d}x,

and the inverse Fourier transform is defined as

f^{\vee}(x){:}=\int_{\mathbb{R}^{d}}f(\xi)e^{2\pi ix\cdot\xi}\mathrm{d}\xi.

If $f\in\mathscr{S}^{\prime}(\mathbb{R}^{d})$ , then the Fourier transform in the sense of distribution means

\langle\,\widehat{f},\varphi\rangle=\langle\,f,\widehat{\varphi}\rangle\qquad\text{for any}\quad\varphi\in\mathscr{S}(\mathbb{R}^{d})\subset L^{1}(\mathbb{R}^{d}).

We shall frequently use the following Hausdorff-Young inequality. Let $1\leq p\leq 2$ and $f\in L^{p}(\mathbb{R}^{d})$ , then

(2.1)

\|\,\widehat{f}\,\|_{L^{p^{\prime}}(\mathbb{R}^{d})}\leq\|\,f\,\|_{L^{p}(\mathbb{R}^{d})},

where $p^{\prime}$ is the conjugate exponent of $p$ ; i.e. $1/p+1/p^{\prime}=1$ .

We shall use the following pointwise Fourier inversion theorem.

Lemma 2.1.

Let $g\in L^{1}(\mathbb{R}^{d})$ , then $\widehat{g^{\vee}}=g$ in $\mathscr{S}^{\prime}(\mathbb{R}^{d})$ . Furthermore, let $f\in\mathscr{S}^{\prime}(\mathbb{R}^{d})$ and $\widehat{f}\in L^{1}(\mathbb{R}^{d})$ , then $(\widehat{f})^{\vee}=f$ , a.e. on $\mathbb{R}^{d}$ .

Proof.

By definition, there holds

\langle\,\widehat{g^{\vee}},\varphi\rangle=\langle\,g^{\vee},\widehat{\varphi}\rangle=\langle\,g,\varphi\rangle\qquad\text{for any}\quad\varphi\in\mathscr{S}(\mathbb{R}^{d}).

Therefore, $\widehat{g^{\vee}}=g$ in $\mathscr{S}^{\prime}(\mathbb{R}^{d})$ . Note that $\widehat{f}\in L^{1}(\mathbb{R}^{d})$ ,

\langle\,(\widehat{f})^{\vee},\varphi\rangle=\langle\,\widehat{f},\varphi^{\vee}\rangle=\langle\,f,\varphi\rangle\qquad\text{for any}\quad\varphi\in\mathscr{S}(\mathbb{R}^{d}).

By the Hausdorff-Young inequality (2.1),

\|\,(\widehat{f})^{\vee}\,\|_{L^{\infty}(\mathbb{R}^{d})}\leq\|\,\widehat{f}\,\|_{L^{1}(\mathbb{R}^{d})}.

Therefore, $f$ is a linear bounded operator on $L^{1}(\mathbb{R}^{d})$ ; i.e., $f\in[L^{1}(\mathbb{R}^{d})]^{*}=L^{\infty}(\mathbb{R}^{d})$ due to $\mathscr{S}(\mathbb{R}^{d})$ is dense in $L^{1}(\mathbb{R}^{d})$ and

\lvert\langle\,f,\varphi\rangle\rvert=\lvert\langle\,(\widehat{f})^{\vee},\varphi\rangle\rvert\leq\|\,(\widehat{f})^{\vee}\,\|_{L^{\infty}(\mathbb{R}^{d})}\|\,\varphi\,\|_{L^{1}(\mathbb{R}^{d})}\leq\|\,\widehat{f}\,\|_{L^{1}(\mathbb{R}^{d})}\|\,\varphi\,\|_{L^{1}(\mathbb{R}^{d})}.

Hence, $(\widehat{f})^{\vee}=f$ , a.e. on $\mathbb{R}^{d}$ because $(\widehat{f})^{\vee}-f\in L^{\infty}(\mathbb{R}^{d})$ [Brezis:2011]*Corollary 4.24. ∎

A direct consequence of Lemma 2.1 is that the pointwise Fourier inversion is valid for functions in $\mathscr{B}^{s}(\mathbb{R}^{d})$ . We shall frequently use this fact later on.

2.1. Completeness of the spectral Barron space

Lemma 2.2.

(1)

$\mathscr{B}^{s}(\mathbb{R}^{d})$ is a Banach space.
(2)

When $s>0$ , $\mathscr{B}^{s}(\mathbb{R}^{d})$ is not a Banach space if the norm $\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}$ is replaced by $\upsilon_{f,s}$ .

Proof.

We give a brief proof for the first claim for the reader’s convenience, which has been stated in [Hormander:1963]*Theorem 2.2.1.

It is sufficient to check the completeness of $\mathscr{B}^{s}(\mathbb{R}^{d})$ . For any Cauchy sequence $\{f_{k}\}_{k=1}^{\infty}\subset\mathscr{B}^{s}(\mathbb{R}^{d})$ , there exists $g\in L^{1}(\mathbb{R}^{d})$ such that $\widehat{f}_{k}\to g$ in $L^{1}(\mathbb{R}^{d})$ . Therefore there exists a sub-sequence of $\{f_{k}\}_{k=1}^{\infty}$ (still denoted by $f_{k}$ ) such that $\widehat{f}_{k}\to g$ a.e. on $\mathbb{R}^{d}$ .

Define the measure $\mu$ by setting that for any measurable set $E\subset\mathbb{R}^{d}$ ,

\mu(E){:}=\int_{E}\lvert\xi\rvert^{s}\mathrm{d}\xi.

Then $\{\widehat{f}_{k}\}_{k=1}^{\infty}$ is a Cauchy sequence in $L^{1}(\mathbb{R}^{d},\mu)$ and there exists $h\in L^{1}(\mathbb{R}^{d},\mu)$ such that $\widehat{f}_{k}\to h$ in $L^{1}(\mathbb{R}^{d},\mu)$ . Therefore there exists a sub-sequence of $\{f_{k}\}_{k=1}^{\infty}$ (still denoted by $f_{k}$ ) such that $\widehat{f}_{k}\to h$ $\mu$ -a.e. on $\mathbb{R}^{d}$ . Note that for any measurable set $E\subset\mathbb{R}^{d}$ , $\mu(E)=0$ is equivalent to $\lvert E\rvert=0$ . Therefore $\widehat{f}_{k}\to h$ a.e. on $\mathbb{R}^{d}$ . By the uniqueness of limitation, $h=g$ , a.e. on $\mathbb{R}^{d}$ .

Define $f=g^{\vee}$ . Lemma 2.1 shows that $\widehat{f}=g$ in $\mathscr{S}^{\prime}(\mathbb{R}^{d})$ . Therefore $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ and $f_{k}\to f$ in $\mathscr{B}^{s}(\mathbb{R}^{d})$ . Hence $\mathscr{B}^{s}$ is complete and it is a Banach space.

The proof for (2) is a reductio ad absurdum.

Suppose the claim does not hold, then there exists $C$ depending only on $s$ and $d$ such that for any $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ ,

(2.2)

\upsilon_{f,0}\leq C\upsilon_{f,s}.

We shall show this is false by the following example.

For some $\delta>-1$ , let

f_{n}(x)=\left(\sum_{k=1}^{n}2^{kd}(1-2^{2k}\lvert\xi\rvert^{2})_{+}^{\delta}\right)^{\vee}(x).

To bound $\upsilon_{f_{n},0}$ and $\upsilon_{f_{n},s}$ , we introduce the Bochner-Riesz multipliers

\phi_{R}=\left(\left(1-\dfrac{\lvert\xi\rvert^{2}}{R^{2}}\right)_{+}^{\delta}\right)^{\vee},\qquad\delta>-1.

We claim

(2.3)

\phi_{R}(x)=\dfrac{\Gamma(\delta+1)}{\pi^{\delta}\lvert x\rvert^{\delta+d/2}}R^{-\delta+d/2}J_{\delta+d/2}(2\pi\lvert x\rvert R),

and

(2.4)

\upsilon_{\phi_{R},s}=\dfrac{\omega_{d-1}}{2}B\left(\dfrac{s+d}{2},\delta+1\right)R^{s+d}.

The proof is postponed to Appendix A.1. It follows from (2.3) that

(2.5)

f_{n}(x)=\dfrac{\Gamma(\delta+1)}{\pi^{\delta}\lvert x\rvert^{\delta+d/2}}\sum_{k=1}^{n}2^{k(\delta+d/2)}J_{\delta+d/2}(2^{1-k}\pi\lvert x\rvert),

and $f_{n}\in\mathscr{B}^{s}(\mathbb{R}^{d})$ with

\upsilon_{f_{n},s}=\sum_{k=1}^{n}2^{kd}\upsilon_{\phi_{2^{-k}},s}=\dfrac{1-2^{-ns}}{2^{s+1}-2}\omega_{d-1}B\left(\dfrac{s+d}{2},\delta+1\right),

and

\upsilon_{f_{n},0}=\sum_{k=1}^{n}2^{kd}\upsilon_{\phi_{2^{-k}},0}=\dfrac{\omega_{d-1}}{2}B\left(\dfrac{d}{2},\delta+1\right)n.

where we have used (2.4). It is clear that

\dfrac{\omega_{d-1}}{2^{s+1}}B\left(\dfrac{s+d}{2},\delta+1\right)\leq\upsilon_{f_{n},s}\leq\dfrac{\omega_{d-1}}{2^{s+1}-2}B\left(\dfrac{s+d}{2},\delta+1\right).

Hence $\upsilon_{f_{n},0}\simeq\mathcal{O}(n)$ while $\upsilon_{f_{n},s}\simeq\mathcal{O}(1)$ . This shows that (2.2) is invalid for a large number $n$ . This proves the second claim. ∎

Similar to $\mathscr{B}^{s}(\mathbb{R}^{d})$ , the space $\widehat{\mathscr{B}}^{s}(\mathbb{R}^{d})$ defined in (1.1) has been exploited as the target space for neural network approximation by several authors [Ma:2017, Xu:2020, Siegel:2022, Siegel:2023]. The advantage of this space is that the Fourier transform is well-defined and the pointwise Fourier inversion is true for functions belonging to $\widehat{\mathscr{B}}^{s}(\mathbb{R}^{d})$ . Unfortunately, $\widehat{\mathscr{B}}^{s}(\mathbb{R}^{d})$ is not a Banach space as we shall show below.

Lemma 2.3.

The space $\widehat{\mathscr{B}}^{s}(\mathbb{R}^{d})$ defined in (1.1) equipped with the norm $\upsilon_{f,0}+\upsilon_{f,s}$ is not a Banach space.

To prove Lemma 2.3, we recall the Barron spectrum space defined by Meng and Ming in [MengMing:2022]: For $s\in\mathbb{R}$ and $1\leq p\leq 2$ ,

(2.6)

\mathscr{B}^{s}_{p}(\mathbb{R}^{d}){:}=\left\{\,f\in L^{p}(\mathbb{R}^{d})\,\mid\,\|\,f\,\|_{L^{p}(\mathbb{R}^{d})}+\upsilon_{f,s}<\infty\,\right\}

equipped with the norm $\|\,f\,\|_{\mathscr{B}^{s}_{p}(\mathbb{R}^{d})}{:}=\|\,f\,\|_{L^{p}(\mathbb{R}^{d})}+\upsilon_{f,s}$ . A useful interpolation inequality that compares the spectral norm of different orders has been proved in [MengMing:2022]*Lemma 2.1. For $1\leq p\leq 2$ and $-d/p<s_{1}<s_{2}$ , there exists $C$ depends on $s_{1},s_{2},d$ and $p$ such that

(2.7)

\upsilon_{f,s_{1}}\leq C\|\,f\,\|_{L^{p}(\mathbb{R}^{d})}^{\gamma}\upsilon_{f,s_{2}}^{1-\gamma},

where $\gamma=(s_{2}-s_{1})/(s_{2}+d/p)$ . For any $\varepsilon>0$ , using the fact

\upsilon_{f_{\varepsilon},s}=\varepsilon^{-s}\upsilon_{f,s},

we observe that the inequality (2.7) is dilation invariant because it is invariant if we replace $f$ by $f_{\varepsilon}{:}=f(x/\varepsilon)$ .

Proof of Lemma 2.3.

The authors in [MengMing:2022] have proved that $\mathscr{B}^{s}_{p}(\mathbb{R}^{d})$ equipped with the norm $\|\,f\,\|_{\mathscr{B}^{s}_{p}(\mathbb{R}^{d})}$ is a Banach space. For any $f\in\mathscr{B}^{s}_{1}(\mathbb{R}^{d})$ , taking $s_{1}=0,s_{2}=s$ and $p=1$ in (2.7), we obtain, there exists $C$ depending only on $d$ and $s$ such that

\upsilon_{f,0}\leq C\|\,f\,\|_{L^{1}(\mathbb{R}^{d})}^{\gamma}\upsilon_{f,s}^{1-\gamma}\leq C\|\,f\,\|_{\mathscr{B}^{s}_{1}(\mathbb{R}^{d})},

where $\gamma=s/(s+d)$ .

We prove the assertion by reductio ad absurdum. Suppose that $\widehat{\mathscr{B}}^{s}(\mathbb{R}^{d})$ equipped with the norm $\upsilon_{f,0}+\upsilon_{f,s}$ is also a Banach space, then by the bounded inverse theorem and the above interpolation inequality, we get, there exists $C$ depending only on $s$ and $d$ such that for any $f\in\mathscr{B}^{s}_{1}(\mathbb{R}^{d})$ ,

(2.8)

\|\,f\,\|_{L^{1}(\mathbb{R}^{d})}\leq C(\upsilon_{f,0}+\upsilon_{f,s}).

We shall show this is not the case by the following example.

For some $\delta>(d-1)/2$ , we define

f_{n}(x){:}=\left(\sum_{k=1}^{n}(1-2^{2k}\lvert\xi\rvert^{2})_{+}^{\delta}\right)^{\vee}(x).

Using (2.3) and noting $f_{n}=\sum_{k=1}^{n}\phi_{2^{-k}}$ , we have the explicit form of $f_{n}$ as

(2.9)

f_{n}(x)=\dfrac{\Gamma(\delta+1)}{\pi^{\delta}\lvert x\rvert^{\delta+d/2}}\sum_{k=1}^{n}2^{k(\delta-d/2)}J_{\delta+d/2}(2^{1-k}\pi\lvert x\rvert).

Using (2.4), we get

\upsilon_{f_{n},s}=\sum_{k=1}^{n}\upsilon_{\phi_{2^{-k}},s}=\dfrac{1-2^{-n(s+d)}}{2^{s+d+1}-2}\omega_{d-1}B\left(\dfrac{s+d}{2},\delta+1\right),

and

\dfrac{\omega_{d-1}}{2^{s+d+1}}B\left(\dfrac{s+d}{2},\delta+1\right)\leq\upsilon_{f_{n},s}\leq\dfrac{\omega_{d-1}}{2^{s+d+1}-2}B\left(\dfrac{s+d}{2},\delta+1\right).

Proceeding along the same line, we obtain

\upsilon_{f_{n},0}=\sum_{k=1}^{n}\upsilon_{\phi_{2^{-k}},0}=\dfrac{1-2^{-nd}}{2^{d+1}-2}\omega_{d-1}B\left(\dfrac{d}{2},\delta+1\right),

and

\dfrac{\omega_{d-1}}{2^{d+1}}B\left(\dfrac{d}{2},\delta+1\right)\leq\upsilon_{f_{n},0}\leq\dfrac{\omega_{d-1}}{2^{d+1}-2}B\left(\dfrac{d}{2},\delta+1\right).

Hence,

(2.10)

\upsilon_{f_{n},0}+\upsilon_{f_{n},s}\leq\dfrac{\omega_{d-1}}{2}\left(\dfrac{B(d/2,\delta+1)}{2^{d}-1}+\dfrac{B((s+d)/2,\delta+1)}{2^{s+d}-1}\right).

By (2.3), a direct calculation gives

\|\,\phi_{R}\,\|_{L^{1}(\mathbb{R}^{d})}=\dfrac{\Gamma(\delta+1)}{\pi^{\delta}R^{\delta-d/2}}\int_{\mathbb{R}^{d}}\dfrac{\lvert J_{\delta+d/2}(2\pi\lvert x\rvert R)\rvert}{\lvert x\rvert^{\delta+d/2}}\mathrm{d}x\\ =\dfrac{2^{\delta}\Gamma(\delta+1)}{\pi^{\delta+d/2}}\int_{\mathbb{R}^{d}}\dfrac{\lvert J_{\delta+d/2}(\lvert x\rvert)\rvert}{\lvert x\rvert^{\delta+d/2}}\mathrm{d}x.

Invoking [Grafakos:2014]*Appendix B.6, B.7, there exists $C$ that depends on $\nu$ such that

\lvert J_{\nu}(x)\rvert\leq C\begin{cases}\lvert x\rvert^{\nu}\qquad&\lvert x\rvert\leq 1,\\ \lvert x\rvert^{-1/2}\qquad&\lvert x\rvert>1.\end{cases}

We get, there exists $C$ depending only on $d$ and $\delta$ such that

	$\displaystyle\\|\,\phi_{R}\,\\|_{L^{1}(\mathbb{R}^{d})}$	$\displaystyle=\dfrac{2^{\delta}\Gamma(\delta+1)}{\pi^{\delta+d/2}}\left(\int_{\lvert x\rvert\leq 1}\dfrac{\lvert J_{\delta+d/2}(\lvert x\rvert)\rvert}{\lvert x\rvert^{\delta+d/2}}\mathrm{d}x+\int_{\lvert x\rvert>1}\dfrac{\lvert J_{\delta+d/2}(\lvert x\rvert)\rvert}{\lvert x\rvert^{\delta+d/2}}\mathrm{d}x\right)$
		$\displaystyle\leq C\left(\int_{\lvert x\rvert\leq 1}\mathrm{d}\,x+\int_{\lvert x\rvert>1}\lvert x\rvert^{-1/2-\delta-d/2}\mathrm{d}\,x\right)$
		$\displaystyle\leq C\left(1+\dfrac{1}{\delta-(d-1)/2}\right),$

where we have used the fact $\delta>(d-1)/2$ in the last step. Therefore, $\|\,\phi_{R}\,\|_{L^{1}(\mathbb{R}^{d})}$ is bounded by a constant that depends only on $\delta$ and $d$ but is independent of $R$ . Moreover,

\|\,f_{n}\,\|_{L^{1}(\mathbb{R}^{d})}\leq\sum_{k=1}^{n}\|\,\phi_{2^{-k}}\,\|_{L^{1}(\mathbb{R}^{d})}\leq n\|\,\phi_{1}\,\|_{L^{1}(\mathbb{R}^{d})},

and by the Hausdorff-Young inequality (2.1),

\|\,f_{n}\,\|_{L^{1}(\mathbb{R}^{d})}\geq\|\,\widehat{f}_{n}\,\|_{L^{\infty}(\mathbb{R}^{d})}=\widehat{f}_{n}(0)=n.

This means that $\|\,f_{n}\,\|_{L^{1}(\mathbb{R}^{d})}=\mathcal{O}(n)$ , which together with (2.10) immediately shows that the inequality (2.8) cannot be true for sufficiently large $n$ . Hence, we conclude that $\widehat{\mathscr{B}}^{s}(\mathbb{R}^{d})$ is not a Banach space. ∎

2.2. Embedding relations of the spectral Barron spaces

In this part we discuss the embedding of the spectral Barron spaces.

Lemma 2.4.

(1)

Interpolation inequality: For any $0\leq s_{1}\leq s\leq s_{2}$ satisfying $s=\alpha s_{1}+(1-\alpha)s_{2}$ with $0\leq\alpha\leq 1$ , and $f\in\mathscr{B}^{s_{1}}(\mathbb{R}^{d})$ , there holds

(2.11)

\upsilon_{f,s}\leq\upsilon_{f,s_{1}}^{\alpha}\upsilon_{f,s_{2}}^{1-\alpha},

and

(2.12)

\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}\leq\|\,f\,\|_{\mathscr{B}^{s_{1}}(\mathbb{R}^{d})}^{\alpha}\|\,f\,\|_{\mathscr{B}^{s_{2}}(\mathbb{R}^{d})}^{1-\alpha}.

(2)

Let $0\leq s_{1}\leq s_{2}$ , there holds $\mathscr{B}^{s_{2}}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s_{1}}(\mathbb{R}^{d})$ with

(2.13)

\|\,f\,\|_{\mathscr{B}^{s_{1}}(\mathbb{R}^{d})}\leq\left(2-\dfrac{s_{1}}{s_{2}}\right)\|\,f\,\|_{\mathscr{B}^{s_{2}}(\mathbb{R}^{d})}\qquad\forall f\in\mathscr{B}^{s_{2}}(\mathbb{R}^{d}).

The embedding (2.13) has been stated in [Hormander:1963]*Theorem 2.2.2 without tracing the embedding constant.

Proof.

We start with the interpolation inequality (2.11) for the spectral norm. For any $0\leq s_{1}\leq s\leq s_{2}$ with $s=\alpha s_{1}+(1-\alpha)s_{2}$ , using Hölder’s inequality, we obtain

\upsilon_{f,s}=\int_{\mathbb{R}^{d}}\left(\lvert\xi\rvert^{s_{1}}\lvert\widehat{f}(\xi)\rvert\right)^{\alpha}\left(\lvert\xi\rvert^{s_{2}}\lvert\widehat{f}(\xi)\rvert\right)^{1-\alpha}\mathrm{d}\xi\leq\upsilon_{f,s_{1}}^{\alpha}\upsilon_{f,s_{2}}^{1-\alpha}.

This gives (2.11).

Next, for $a,b,c>0$ , by Young’s inequality, we have

	$\displaystyle\dfrac{a+b^{\alpha}c^{1-\alpha}}{(a+b)^{\alpha}(a+c)^{1-\alpha}}$	$\displaystyle=\left(\dfrac{a}{a+b}\right)^{\alpha}\left(\dfrac{a}{a+c}\right)^{1-\alpha}+\left(\dfrac{b}{a+b}\right)^{\alpha}\left(\dfrac{c}{a+c}\right)^{1-\alpha}$
		$\displaystyle\leq\alpha\dfrac{a}{a+b}+(1-\alpha)\dfrac{a}{a+c}+\alpha\dfrac{b}{a+b}+(1-\alpha)\dfrac{c}{a+c}$
		$\displaystyle=1.$

This yields

a+b^{\alpha}c^{1-\alpha}\leq(a+b)^{\alpha}(a+c)^{1-\alpha}.

Let $a=\upsilon_{f,0},b=\upsilon_{f,s_{1}}$ and $c=\upsilon_{f,s_{2}}$ , we obtain

\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}=\upsilon_{f,0}+\upsilon_{f,s}\leq\upsilon_{f,0}+\upsilon_{f,s_{1}}^{\alpha}\upsilon_{f,s_{2}}^{1-\alpha}\leq\|\,f\,\|_{\mathscr{B}^{s_{1}}(\mathbb{R}^{d})}^{\alpha}\|\,f\,\|_{\mathscr{B}^{s_{2}}(\mathbb{R}^{d})}^{1-\alpha}.

This implies (2.12).

Next, if we take $s_{1}=0$ in (2.11) and $s=(1-\alpha)s_{2}$ with $\alpha=1-s/s_{2}$ , then

\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}\leq\upsilon_{f,0}+\upsilon_{f,0}^{\alpha}\upsilon_{f,s_{2}}^{1-\alpha}\leq(1+\alpha)\upsilon_{f,0}+(1-\alpha)\upsilon_{f,s_{2}}\leq(1+\alpha)\|\,f\,\|_{\mathscr{B}^{s_{2}}(\mathbb{R}^{d})}.

This leads to (2.13) and completes the proof. ∎

The next lemma shows that $\mathscr{B}^{s}_{p}(\mathbb{R}^{d})$ is a proper subspace of $\mathscr{B}^{s}(\mathbb{R}^{d})$ .

Lemma 2.5.

For $s\geq 0$ and $1\leq p\leq 2$ , there holds $\mathscr{B}^{s}_{p}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d})$ , and the inclusion is proper in the sense that for any $1\leq p<\infty$ , there exists $f_{p}\in\mathscr{B}^{s}(\mathbb{R}^{d})$ and $f_{p}\not\in L^{p}(\mathbb{R}^{d})$ .

Proof.

It follows from the interpolation inequality (2.7) that $\upsilon_{f,0}\leq C\|\,f\,\|_{\mathscr{B}^{s}_{p}(\mathbb{R}^{d})}$ . Hence

\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}\leq C\|\,f\,\|_{\mathscr{B}^{s}_{p}(\mathbb{R}^{d})}.

This implies $\mathscr{B}^{s}_{p}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d})$ for any $s\geq 0$ and $1\leq p\leq 2$ .

We shall show below that the inclusion is proper. Let

f_{p}(x){:}=\left(\lvert\xi\rvert^{-d/p^{\prime}}\chi_{[0,1)}(\lvert\xi\rvert)\right)^{\vee}(x),

where $\chi_{\Omega}(t)$ is the characteristic function on $\mathbb{R}$ that equals to one if $t\in\Omega$ and zero otherwise. It is straightforward to verify $f_{p}\in\mathscr{B}^{s}(\mathbb{R}^{d})$ .

We shall show below that $f_{p}\notin L^{p}(\mathbb{R}^{d})$ , which is based on the following explicit formula for $f_{p}$ shown in Appendix A.2:

(2.14)

f_{p}(x)={}_{1}F_{2}(d/(2p);1+d/(2p),d/2;-\pi^{2}\lvert x\rvert^{2})p\nu_{d},

where the generalized Hypergeometric function ${}_{n}F_{m}$ is defined as follows. For nonnegative integer $n,m$ and none of the parameters $\{\beta_{j}\}_{j=1}^{m}$ is a negative integer or zero,

{}_{n}F_{m}(\alpha_{1},\dots,\alpha_{n};\beta_{1},\dots,\beta_{m};x){:}=\sum_{k=0}^{\infty}\dfrac{\prod_{j=1}^{n}(\alpha_{j})_{k}}{\prod_{j=1}^{m}(\beta_{j})_{k}}\dfrac{x^{k}}{k!}.

The generalized Hypergeometric function ${}_{n}F_{m}$ converges for all finite $x$ if $n\leq m$ . In particular ${}_{n}F_{m}(\alpha_{1},\dots,\alpha_{n};\beta_{1},\dots,\beta_{m};0)=1$ . Hence $f_{p}(x)$ is finite for any $x$ . Using [MathaiSaxena:1973]*Appendix, we obtain

{}_{1}F_{2}(\alpha;\beta,\gamma;-x^{2}/4)\simeq\mathcal{O}(\lvert x\rvert^{\alpha-\beta-\gamma+1/2}+\lvert x\rvert^{-2\alpha})\qquad\text{when}\quad\lvert x\rvert\to\infty.

Therefore,

f_{p}(x)\simeq\mathcal{O}(\lvert x\rvert^{-(d+1)/2}+\lvert x\rvert^{-d/p})\qquad\text{when}\quad\lvert x\rvert\to\infty.

This immediately implies $f_{p}\not\in L^{p}(\mathbb{R}^{d})$ . ∎

Remark 2.6.

The representation (2.14) is rather complicated, we give explicit formulas for certain special cases. When $d=1$ ,

f_{1}(x)=\dfrac{\sin(2\pi x)}{\pi x},\qquad f_{2}(x)=\dfrac{2}{\sqrt{\lvert x\rvert}}C(2\sqrt{\lvert x\rvert}),

where $C$ is the Fresnel Cosine integral given by

C(x)=\int_{0}^{x}\cos(\pi t^{2}/2)\mathrm{d}t\to\dfrac{1}{2}\qquad\text{when}\quad x\to\infty,

Indeed, for $d=p=1$ , using the relation [Luke:1969]*§ 6.2.1, Eq.(10)

\sin x={}_{0}F_{1}(;3/2;-x^{2}/4)x,

we obtain

f_{1}(x)=2{}_{1}F_{2}(1/2;3/2,1/2;-\pi^{2}x^{2})=2{}_{0}F_{1}(;3/2;-\pi^{2}x^{2})=\dfrac{\sin(2\pi x)}{\pi x}.

When $p=2$ , using the identity [Luke:1969]*§ 6.2.11, Eq. (41)

C(\sqrt{2x/\pi})=\sqrt{\dfrac{2x}{\pi}}{}_{1}F_{2}(1/4;5/4,1/2;-x^{2}/4)\qquad\text{when}\quad x>0,

we obtain

f_{2}(x)=4{}_{1}F_{2}(1/4;5/4,1/2;-\pi^{2}x^{2})=\dfrac{2}{\sqrt{\lvert x\rvert}}C(2\sqrt{\lvert x\rvert}).

2.3. Relations to some classical function spaces

In this part, we establish the embedding between the spectral Barron space $\mathscr{B}^{s}(\mathbb{R}^{d})$ and the Besov space, and hence we bridge $\mathscr{B}^{s}(\mathbb{R}^{d})$ and the Sobolev space as in [MengMing:2022]. We firstly recall the definition of the Besov space.

Definition 2.7 (Besov space).

Let $\{\varphi_{j}\}_{j=0}^{\infty}\subset\mathscr{S}(\mathbb{R}^{d})$ satisfies $0\leq\varphi_{j}\leq 1$ and

\begin{cases}\text{supp}(\varphi_{0})\subset\Gamma_{0}{:}=\left\{\,x\in\mathbb{R}^{d}\,\mid\,\lvert x\rvert\leq 2\,\right\},\\ \text{supp}(\varphi_{j})\subset\Gamma_{j}{:}=\left\{\,x\in\mathbb{R}^{d}\,\mid\,2^{j-1}\leq\lvert x\rvert\leq 2^{j+1}\,\right\},\qquad j=1,2,\dots.\end{cases}

For every multi-index $\alpha$ , there exists a positive number $c_{\alpha}$ such that

2^{j\lvert\alpha\rvert}\lvert\nabla^{\alpha}\varphi_{j}(x)\rvert\leq c_{\alpha}\quad\text{for all}\quad j=0,\dots,\quad\text{for all}\quad x\in\mathbb{R}^{d},

and

\sum_{j=0}^{\infty}\varphi_{j}(x)=1\quad\text{for every}\quad x\in\mathbb{R}^{d}.

Let $\alpha\in\mathbb{R}$ and $1\leq p,q\leq\infty$ . Define the Besov space

B^{\alpha}_{p,q}(\mathbb{R}^{d}){:}=\left\{\,f\in\mathscr{S}^{\prime}(\mathbb{R}^{d})\,\mid\,\|\,f\,\|_{B^{\alpha}_{p,q}(\mathbb{R}^{d})}<\infty\,\right\}

equipped with the norm

\|\,f\,\|_{B^{\alpha}_{p,q}(\mathbb{R}^{d})}{:}=\left(\sum_{j=0}^{\infty}2^{\alpha qj}\|\,(\varphi_{j}\widehat{f})^{\vee}\,\|_{L^{p}(\mathbb{R}^{d})}^{q}\right)^{1/q}\qquad\text{when}\quad q<\infty,

and

\|\,f\,\|_{B_{p,\infty}^{\alpha}(\mathbb{R}^{d})}{:}=\sup_{j\geq 0}2^{\alpha j}\|\,(\varphi_{j}\widehat{f})^{\vee}\,\|_{L^{p}(\mathbb{R}^{d})}.

We firstly recall the following embedding for the Besov space, which was firstly proved in the series work of Taibleson [Taibleson:1964, Taibleson:1965, Taibleson:1966]. We retain the proof in Appendix A.3 for the reader’s convenience.

Lemma 2.8.

There holds $B_{p_{1},q_{1}}^{\alpha_{1}}(\mathbb{R}^{d})\hookrightarrow B_{p_{2},q_{2}}^{\alpha_{2}}(\mathbb{R}^{d})$ if and only if $p_{1}\leq p_{2}$ and one of the following conditions holds:

(1)

$\alpha_{1}-d/p_{1}>\alpha_{2}-d/p_{2}$ and $q_{1},q_{2}$ are arbitrary;
(2)

$\alpha_{1}-d/p_{1}=\alpha_{2}-d/p_{2}$ and $q_{1}\leq q_{2}$ .

The main result of the embedding is:

Theorem 2.9.

(1)

There holds

(2.15) $B_{2,1}^{s+d/2}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d})\hookrightarrow B_{\infty,1}^{s}(\mathbb{R}^{d}).$
(2)

The above embedding is optimal in the sense that $B_{2,1}^{s+d/2}(\mathbb{R}^{d})$ is the biggest one of all $B_{p,q}^{\alpha}(\mathbb{R}^{d})$ satisfying $B_{p,q}^{\alpha}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d})$ , and $B_{\infty,1}^{s}(\mathbb{R}^{d})$ is the smallest one of all $B_{p,q}^{\alpha}(\mathbb{R}^{d})$ satisfying $\mathscr{B}^{s}(\mathbb{R}^{d})\hookrightarrow B_{p,q}^{\alpha}(\mathbb{R}^{d})$ .

Proof.

To prove (1), firstly, for any $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ ,

	$\displaystyle\\|\,f\,\\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}=$	$\displaystyle\sum_{j=0}^{\infty}\int_{\mathbb{R}^{d}}(1+\lvert\xi\rvert^{s})\varphi_{j}(\xi)\lvert\widehat{f}(\xi)\rvert\mathrm{d}\xi$
	$\displaystyle\leq$	$\displaystyle\sum_{j=0}^{\infty}\left(\int_{\text{supp\;}\varphi_{j}}(1+\lvert\xi\rvert^{s})^{2}\mathrm{d}\xi\right)^{1/2}\\|\,\varphi_{j}\widehat{f}\,\\|_{L^{2}(\mathbb{R}^{d})}.$

A direct calculation gives: for $j=0,1,\dots$ ,

	$\displaystyle\int_{\text{supp\;}\varphi_{j}}(1+\lvert\xi\rvert^{s})^{2}\mathrm{d}\xi$	$\displaystyle\leq\int_{0\leq\lvert\xi\rvert\leq 2^{j+1}}(1+\lvert\xi\rvert^{s})^{2}\mathrm{d}\xi$
		$\displaystyle=\omega_{d-1}\int_{0}^{2^{j+1}}(1+r^{s})^{2}r^{d-1}\mathrm{d}\,r$
		$\displaystyle\leq 2\omega_{d-1}\int_{0}^{2^{j+1}}(1+r^{2s})r^{d-1}\mathrm{d}\,r$
		$\displaystyle\leq 2\omega_{d-1}\left(\dfrac{2^{(j+1)d}}{d}+\dfrac{2^{(j+1)(2s+d)}}{2s+d}\right)$
		$\displaystyle\leq 4\nu_{d}2^{(j+1)(2s+d)}.$

Using the Plancherel’s theorem, we get

	$\displaystyle\\|\,f\,\\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}$	$\displaystyle\leq 2^{s+1+d/2}\sqrt{\nu_{d}}\sum_{j=0}^{\infty}2^{j(s+d/2)}\\|\,\varphi_{j}\widehat{f}\,\\|_{L^{2}(\mathbb{R}^{d})}$
		$\displaystyle=2^{s+1+d/2}\sqrt{\nu_{d}}\sum_{j=0}^{\infty}2^{j(s+d/2)}\\|\,(\varphi_{j}\widehat{f})^{\vee}\,\\|_{L^{2}(\mathbb{R}^{d})}$
		$\displaystyle=2^{s+1+d/2}\sqrt{\nu_{d}}\\|\,f\,\\|_{B_{2,1}^{s+d/2}(\mathbb{R}^{d})}.$

Next, for any $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ , by Lemma 2.1, we have $\varphi_{j}\widehat{f}\in L^{1}(\mathbb{R}^{d})$ , using the Hausdorff-Young inequality (2.1), we obtain

	$\displaystyle\\|\,f\,\\|_{B_{\infty,1}^{s}(\mathbb{R}^{d})}=$	$\displaystyle\sum_{j=0}^{\infty}2^{sj}\\|\,(\varphi_{j}\widehat{f})^{\vee}\,\\|_{L^{\infty}(\mathbb{R}^{d})}\leq\sum_{j=0}^{\infty}2^{sj}\\|\,\varphi_{j}\widehat{f}\,\\|_{L^{1}(\mathbb{R}^{d})}$
	$\displaystyle\leq$	$\displaystyle\\|\,\varphi_{0}\widehat{f}\,\\|_{L^{1}(\mathbb{R}^{d})}+2^{s}\sum_{j=1}^{\infty}\int_{\mathbb{R}^{d}}\varphi_{j}(\xi)\lvert\xi\rvert^{s}\lvert\widehat{f}(\xi)\rvert\mathrm{d}\xi$
	$\displaystyle\leq$	$\displaystyle\upsilon_{f,0}+2^{s}\upsilon_{f,s}.$

Therefore, $\|\,f\,\|_{B_{\infty,1}^{s}(\mathbb{R}^{d})}\leq 2^{s}\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}$ . This proves (2.15) with

2^{-s}\|\,f\,\|_{B_{\infty,1}^{s}(\mathbb{R}^{d})}\leq\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}\leq 2^{s+1+d/2}\sqrt{\nu_{d}}\|\,f\,\|_{B_{2,1}^{s+d/2}(\mathbb{R}^{d})}.

It remains to show the embedding (2.15) is optimal. Suppose that there exists $B_{p,q}^{\alpha}(\mathbb{R}^{d})$ such that

B_{2,1}^{s+d/2}(\mathbb{R}^{d})\hookrightarrow B_{p,q}^{\alpha}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d})\hookrightarrow B_{\infty,1}^{s}(\mathbb{R}^{d}),

using Lemma 2.8, we would have $2\leq p\leq\infty,\alpha=s+d/p$ and $q=1$ . In what follows, we shall exploit an example adopted from [Lieb:2001]*Ch. 5, Ex. 9 to show that $B_{p,1}^{\alpha}(\mathbb{R}^{d})\not\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d})$ when $2<p<\infty$ and $\alpha>0$ . Therefore, $B^{s+d/p}_{p,1}(\mathbb{R}^{d})\not\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d})$ for any $2<p\leq\infty$ .

Let

\psi_{n}(x)=(1+in)^{-d/2}e^{-\pi\lvert x\rvert^{2}/(1+in)}.

A direct calculation gives $\widehat{\psi}_{n}(\xi)=e^{-\pi(1+in)\lvert\xi\rvert^{2}}$ . Hence $\lvert\widehat{\psi}_{n}(\xi)\rvert=e^{-\pi\lvert\xi\rvert^{2}}\in\mathscr{S}(\mathbb{R}^{d})$ and

\|\,\psi_{n}\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}=1+\dfrac{\Gamma((s+d)/2)}{\Gamma(d/2)\pi^{s/2}},

which is independent of $n$ . We shall prove in Appendix A.4 that when $1\leq p<\infty$ and $\alpha>0$ ,

(2.16)

\|\,\psi_{n}\,\|_{B_{p,1}^{\alpha}(\mathbb{R}^{d})}\leq C(1+n^{2})^{-d(p-2)/(4p)}.

Therefore $\|\,\psi_{n}\,\|_{B_{p,1}^{\alpha}(\mathbb{R}^{d})}\to 0$ when $p>2$ and $n\to\infty$ .

On the other hand, we cannot expect that there exists certain $p<\infty$ such that $\mathscr{B}^{s}(\mathbb{R}^{d})\hookrightarrow B^{s+d/p}_{p,1}(\mathbb{R}^{d})$ . Otherwise, we would have

\mathscr{B}^{s}(\mathbb{R}^{d})\hookrightarrow B^{s+d/p}_{p,1}(\mathbb{R}^{d})\hookrightarrow L^{p}(\mathbb{R}^{d})

because of Lemma 2.8 and [Triebel:1983]*§ 2.5.7, Proposition. This contradicts with the fact $\mathscr{B}^{s}(\mathbb{R}^{d})\not\hookrightarrow L^{p}(\mathbb{R}^{d})$ , which has been proved in Lemma 2.5. ∎

As a consequence of Theorem 2.9 and Lemma 2.5, we establish the embedding between the spectral Barron space and the Sobolev spaces.

Definition 2.10 (Fractional Sobolev space).

Let $1\leq p<\infty$ and non-integer $\alpha>0$ , then the fractional Sobolev space

W^{\alpha}_{p}(\mathbb{R}^{d}){:}=\left\{\,f\in W^{\lfloor\alpha\rfloor}_{p}(\mathbb{R}^{d})\,\mid\,\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\dfrac{\lvert\nabla^{\lfloor\alpha\rfloor}f(x)-\nabla^{\lfloor\alpha\rfloor}f(y)\rvert^{p}}{\lvert x-y\rvert^{d+(\alpha-\lfloor\alpha\rfloor)p}}\mathrm{d}x\mathrm{d}y<\infty\,\right\}

equipped with the norm

\|\,f\,\|_{W^{\alpha}_{p}(\mathbb{R}^{d})}{:}=\|\,f\,\|_{W^{\lfloor\alpha\rfloor}_{p}(\mathbb{R}^{d})}+\left(\iint_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\dfrac{\lvert\nabla^{\lfloor\alpha\rfloor}f(x)-\nabla^{\lfloor\alpha\rfloor}f(y)\rvert^{p}}{\lvert x-y\rvert^{d+(\alpha-\lfloor\alpha\rfloor)p}}\mathrm{d}x\mathrm{d}y\right)^{1/p}.

We firstly recall the relation between the Sobolev space and $\mathscr{B}^{s}_{p}(\mathbb{R}^{d})$ , which has been proved in [MengMing:2022]*Theorem 4.3.

Lemma 2.11 ([MengMing:2022]*Theorem 4.3).

(1)

If $1\leq p\leq 2$ and $\alpha>s+d/p>0$ , then

$W^{\alpha}_{p}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}_{p}(\mathbb{R}^{d}).$
(2)

If $s>-d$ is not an integer or $s>-d$ is an integer and $d\geq 2$ , then

$W^{s+d}_{1}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}_{1}(\mathbb{R}^{d}).$

It follows from the above lemma and Lemma 2.5 that

Corollary 2.12.

(1)

If $1\leq p\leq 2$ and $\alpha>s+d/p$ , there holds

(2.17) $W^{\alpha}_{p}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d})\hookrightarrow C^{s}(\mathbb{R}^{d}).$
(2)

If $s$ is not an integer or $s$ is an integer and $d\geq 2$ , then

$W^{s+d}_{1}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d}).$

The first embedding with $p=2$ and $s=1$ was hidden in [Barron:1993]* § II, Para. 7; § IX, 15.

Proof.

By Lemma 2.11 and Lemma 2.5, when $\alpha>s+d/p$ and $1\leq p\leq 2$ , we have

W^{\alpha}_{p}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}_{p}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d}).

When $s$ is not an integer or $s$ is an integer and $d\geq 2$ , there holds

W^{s+d}_{1}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}_{1}(\mathbb{R}^{d})\hookrightarrow\mathscr{B}^{s}(\mathbb{R}^{d}).

It remains to prove the right-hand side of (2.17). Using Theorem 2.9,

\mathscr{B}^{s}(\mathbb{R}^{d})\hookrightarrow B^{s}_{\infty,1}(\mathbb{R}^{d})\hookrightarrow C^{s}(\mathbb{R}^{d})

due to Theorem 2.9, Lemma 2.8 and [Triebel:1983]*§ 2.3.5, Eq. (1); § 2.5.7, Eq. (2), (9), (11). ∎

3. Application to deep neural network approximation

The embedding results proved in Theorem 2.9 and Corollary 2.12 indicate that $s$ is a smoothness index. Consequently, we are interested in exploring the approximate rate when $s$ is small with $\mathscr{B}^{s}$ as the target function space. To facilitate our analysis, we shall focus on the hypercube $\Omega{:}=[0,1]^{d}$ , and the spectral norm for function $f$ defined on $\Omega$ is

\upsilon_{f,s,\Omega}=\inf_{Ef|_{\Omega}=f}\int_{\mathbb{R}^{d}}\|\,\xi\,\|_{1}^{s}\lvert\widehat{Ef}(\xi)\rvert\mathrm{d}\xi,

where the infimum is taken for all extension operators $E:\Omega\to\mathbb{R}^{d}$ . To simplify the notations, we employ $f$ to denote $Ef$ subsequently. We replace $\lvert\xi\rvert$ by $\|\,\xi\,\|_{1}$ in the definition of $\upsilon_{f,s,\Omega}$ , the latter seems more natural for studying the approximation over the hypercube as suggested by [Barron:1993]*§ V.

Definition 3.1.

A sigmoidal function is a bounded function $\sigma:\mathbb{R}\mapsto\mathbb{R}$ such that

\lim_{t\to-\infty}\sigma(t)=0,\qquad\lim_{t\to+\infty}\sigma(t)=1.

For example, the Heaviside function $\chi_{[0,\infty)}$ is a sigmoidal function.

A classical idea for the approximation error of neural networks with sigmoidal activation functions $\sigma$ is to use the Heaviside function $\chi_{[0,\infty)}$ as a transition. Caragea et. al. [Caragea:2023] pointed out that the gap between sigmoidal function $\sigma$ and the Heaviside function $\chi_{[0,\infty)}$ cannot be dismissed in $L^{\infty}(\Omega)$ . While this gap does not exist in $L^{2}(\Omega)$ .

Lemma 3.2.

For fixed $\omega\in\mathbb{R}^{d}\backslash\{0\}$ and $b\in\mathbb{R}$ ,

\lim_{\tau\to\infty}\|\,\sigma(\tau(\omega\cdot x+b))-\chi_{[0,\infty)}(\omega\cdot x+b)\,\|_{L^{2}(\Omega)}=0.

Proof.

Note that

\lim_{t\to\pm\infty}\lvert\sigma(t)-\chi_{[0,\infty)}(t)\rvert=0.

We divide the cube $\Omega$ into $\Omega_{1}{:}=\left\{\,x\in\Omega\,\mid\,\lvert\tau(\omega\cdot x+b)\rvert<\delta\,\right\}$ and $\Omega_{2}{:}=\Omega\setminus\Omega_{1}$ . With proper choice of $\delta>0$ and $\tau>0$ large enough, we can obtain that the $L^{2}$ -distance between $\sigma(\tau(\omega\cdot x+b))$ and $\chi_{[0,\infty)}(\omega\cdot x+b)$ is arbitrarily small. ∎

For a shallow neural network, the following lemma in [Barron:1993] is proved for the real-valued function, while it is straightforward to extend the proof to the complex-valued function.

Lemma 3.3 ([Barron:1993]*Theorem 1).

Let $f\in\mathscr{B}^{1}(\mathbb{R}^{d})$ , there exists

(3.1)

f_{N}(x)=\sum_{i=1}^{N}c_{i}\sigma(\omega_{i}\cdot x+b_{i})

with $\omega_{i}\in\mathbb{R}^{d},b_{i}\in\mathbb{R}$ and $c_{i}\in\mathbb{C}$ such that

\|\,f-f_{N}\,\|_{L^{2}(\Omega)}\leq\dfrac{2\upsilon_{f,1,\Omega}}{\sqrt{N}}.

In this part, we shall show the approximation error for the deep neural network. We use the $(L,N)$ -network to describe a neural network with $L$ hidden layers and at most $N$ units per layer. Here $L$ denotes the number of hidden layers, e.g., the shallow neural network, expressed as (3.1), is an $(1,N)$ -network.

Definition 3.4 ( $(L,N)$ -network).

An $(L,N)$ -network represents a neural network with $L$ hidden layers and at most $N$ units per layer. The activation functions of the first $L-1$ layers are all ReLU and the activation function of the last layer is the sigmoidal function. The connection weights between the input layer and the hidden layer, and between the hidden layer and the hidden layer are all real numbers. The connection weights between the last hidden layer and the output layer are complex numbers.

Here we make some preparations for the rest work. The analysis in this part owns the most to [BreslerNagaraj:2020] with certain improvements that will be detailed later on. For any function $g$ defined on $[0,1]$ and it is symmetric about $x=1/2$ , We use the notation $g_{,n}$ to denote the function $g$ in the $[0,1]$ interval of the period repeated $n$ times, i.e.,

(3.2)

g_{,n}(t)=g(nt-j),\quad j=0,\dots,n-1,\quad 0\leq nt-j\leq 1.

Define

\beta(t)=\text{ReLU}(2t)-2\text{ReLU}(2t-1)+\text{ReLU}(2t-2)=\begin{cases}2t,&0\leq t\leq 1/2,\\ 2-2t,&1/2\leq t\leq 1,\\ 0,&\text{otherwise}.\end{cases}

By definition (3.2), $\beta_{,n}$ represents a triangle function with $n$ peaks and can be represented by $3n$ ReLUs:

\beta_{,n}(t)=\sum_{j=0}^{n-1}\beta(nt-j),\quad 0\leq t\leq 1.

Lemma 3.5.

Let $g$ be a function defined on $[0,1]$ and symmetric about $x=1/2$ , then $g_{,n_{2}}\circ\beta_{,n_{1}}=g_{,2n_{1}n_{2}}$ on $[0,1]$ .

The above lemma is a rigorous statement of [Telgarsky:2016]*Proposition 5.1. A key example is $\cos(2\pi n_{2}\beta_{,n_{1}}(t))=\cos(4\pi n_{1}n_{2}t)$ when $t\in[0,1]$ . A geometrical explanation may be founded in [Bolcskei:2021]*Figure 3. We postpone the rigorous proof in Appendix A.5.

For $r\in(0,1)$ , we define

\alpha(t,r)=\chi_{[0,\infty)}(t-r/2)-\chi_{[0,\infty)}(t-(1-r)/2)=\begin{cases}\chi_{[r/2,(1-r)/2]}(t),&0<r\leq 1/2,\\ -\chi_{[(1-r)/2,r/2]}(t),&1/2\leq r<1,\end{cases}

then supp $(\alpha(\cdot,r))\subset(0,1/2)$ and $\alpha(t,r)$ is symmetric about $t=1/4$ . Define

\gamma(t,r)=\alpha(t+1/4,r)-\alpha(t-1/4,r)+\alpha(t-3/4,r).

Then $\gamma(t,r)$ is symmetric about $t=1/2$ because

	$\displaystyle\gamma(1-t,r)=$	$\displaystyle\alpha(5/4-t,r)-\alpha(3/4-t,r)+\alpha(1/4-t,r)$
	$\displaystyle=$	$\displaystyle\alpha(t-3/4,r)-\alpha(t-1/4,r)+\alpha(t+1/4,r)=\gamma(t,r).$

By definition (3.2), $\gamma_{,n}(\cdot,r)$ is well defined on $[0,1]$ and

	$\displaystyle\gamma_{,n}(t,r)=$	$\displaystyle\begin{cases}\alpha(nt-j+1/4,r),&0\leq nt-j\leq 1/4,\\ -\alpha(nt-j-1/4,r),&1/4\leq nt-j\leq 3/4,\\ \alpha(nt-j-3/4,r),&3/4\leq nt-j\leq 1,\end{cases}$	$\displaystyle j=0,\dots,n-1$
	$\displaystyle=$	$\displaystyle\begin{cases}\alpha(nt-j+1/4,r),&-1/4\leq nt-j\leq 1/4,\\ -\alpha(nt-j-1/4,r),&1/4\leq nt-j\leq 3/4,\\ \end{cases}$	$\displaystyle j=0,\dots,n.$

And $\gamma_{,n}(\cdot,r)$ on $[0,1]$ can be represents by $4n$ Heaviside function $\chi_{[0,\infty)}$ due to $\alpha(nt+1/4,r),\alpha(nt-n+1/4,r)$ only need one Heaviside function each:

\gamma_{,n}(t,r)=\sum_{j=0}^{n}\alpha(nt-j+1/4,r)-\sum_{j=0}^{n-1}\alpha(nt-j-1/4,r).

A direct consequence of the above construction is

Lemma 3.6.

For $t\in[0,1]$ , there holds

(3.3)

\dfrac{\pi}{2}\int_{0}^{1}\cos(\pi r)\gamma_{,n}(t,r)\mathrm{d}r=\cos(2\pi nt).

Proof.

For any $t\in[0,1/2]$ , a direct calculation gives

\dfrac{\pi}{2}\int_{0}^{1}\cos(\pi r)\alpha(t,r)\mathrm{d}r=\pi\int_{0}^{2t}\cos(\pi r)\mathrm{d}r=\sin(2\pi t).

Fix a $t\in[0,1]$ . If there exists an integer $j$ satisfying $0\leq j\leq n$ and $-1/4\leq nt-j\leq 1/4$ , then $\gamma_{,n}(t,r)=\alpha(nt-j+1/4,r)$ and

\dfrac{\pi}{2}\int_{0}^{1}\cos(\pi r)\alpha(nt-j+1/4,r)\mathrm{d}r=\sin(2\pi(nt-j+1/4))=\cos(2\pi nt).

Otherwise there exists an integer $j$ satisfying $0\leq j\leq n-1$ and $1/4\leq nt-j\leq 3/4$ . Then $\gamma_{,n}(t,r)=-\alpha(nt-j-1/4,r)$ and

-\dfrac{\pi}{2}\int_{0}^{1}\cos(\pi r)\alpha(nt-j-1/4,r)\mathrm{d}r=-\sin(2\pi(nt-j-1/4))=\cos(2\pi nt).

This completes the proof of (3.3). ∎

Now we are ready to give an approximation result for deep neural networks, which follows the framework of [BreslerNagaraj:2020], while we achieve a higher order convergence rate and the prefactor is dimension-free.

Lemma 3.7.

Let the positive integer $L$ and $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ with $0<sL\leq 1/2$ and $\mathrm{supp\;}f\subset\left\{\,\xi\in\mathbb{R}^{d}\,\mid\,\|\,\xi\,\|_{1}\geq 1\,\right\}$ . For any positive integer $N$ there exists an $(L,N)$ -network $f_{N}$ such that

(3.4)

\|\,f-f_{N}\,\|_{L^{2}(\Omega)}\leq\dfrac{22\upsilon_{f,s,\Omega}}{N^{sL}}.

Proof.

By Lemma 2.1, for $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ , assume $f$ is real-valued, then

f(x)=\int_{\mathbb{R}^{d}}\widehat{f}(\xi)e^{2\pi i\xi\cdot x}\mathrm{d}\xi=\int_{\mathbb{R}^{d}}\lvert\widehat{f}(\xi)\rvert\cos(2\pi(\xi\cdot x+\theta(\xi)))\mathrm{d}\xi,

with proper choice $\theta(\xi)$ such that $0\leq\xi\cdot x+\theta(\xi)\leq\|\,\xi\,\|_{1}+1$ . For fixed $\xi$ , choose $n_{\xi}=2^{L-1}\lceil(\|\,\xi\,\|_{1}+1)^{1/L}\rceil^{L}$ and $t_{\xi}(x)=(\xi\cdot x+\theta(\xi))/n_{\xi}$ , then $0\leq t_{\xi}(x)\leq 1$ and by Lemma 3.6,

f(x)=\int_{\mathbb{R}^{d}}\lvert\widehat{f}(\xi)\rvert\cos(2\pi n_{\xi}t_{\xi}(x))\mathrm{d}\xi=\dfrac{\pi}{2}\int_{\mathbb{R}^{d}}\lvert\widehat{f}(\xi)\rvert\mathrm{d}\xi\int_{0}^{1}\cos(\pi r)\gamma_{,n_{\xi}}(t_{\xi}(x),r)\mathrm{d}r.

Define the probability measure

(3.5)

\mu(\mathrm{d}\xi,\mathrm{d}r)=\dfrac{1}{Q}\|\,\xi\,\|_{1}^{-s}\lvert\widehat{f}(\xi)\rvert\chi_{(0,1)}(r)\mathrm{d}\xi\mathrm{d}r,

where $Q$ is the normalized factor that

Q=\int_{\mathbb{R}^{d}}\|\,\xi\,\|_{1}^{-s}\lvert\widehat{f}(\xi)\rvert\mathrm{d}\xi\int_{0}^{1}\mathrm{d}r\leq\upsilon_{f,s,\Omega}.

Therefore $f(x)=\mathbb{E}_{(\xi,r)\sim\mu}F(x,\xi,r)$ with

F(x,\xi,r)=\dfrac{\pi Q}{2}\|\,\xi\,\|_{1}^{s}\cos(\pi r)\gamma_{,n_{\xi}}(t_{\xi}(x),r).

If $\{\xi_{i},r_{i}\}_{i=1}^{m}$ is an i.i.d. sequence of random samples from $\mu$ , and

\tilde{f}=\dfrac{1}{m}\sum_{i=1}^{m}F(x,\xi_{i},r_{i}),

then using Fubini’s theorem, we obtain

	$\displaystyle\mathbb{E}_{(\xi_{i},r_{i})\sim\mu}\\|\,f-\tilde{f}\,\\|_{L^{2}(\Omega)}^{2}=$	$\displaystyle\int_{\Omega}\mathbb{E}_{(\xi_{i},r_{i})\sim\mu}\lvert\mathbb{E}_{(\xi,r)\sim\mu}F(x,\xi,r)-\tilde{f}(x)\rvert^{2}\mathrm{d}x$
	$\displaystyle=$	$\displaystyle\dfrac{1}{m}\int_{\Omega}\mathrm{Var}_{(\xi,r)\sim\mu}F(x,\xi,r)\mathrm{d}x$
	$\displaystyle\leq$	$\displaystyle\dfrac{1}{m}\mathbb{E}_{(\xi,r)\sim\mu}\\|\,F(\cdot,\xi,r)\,\\|_{L^{\infty}(\Omega)}^{2}.$

Note that

\|\,F(\cdot,\xi,r)\,\|_{L^{\infty}(\Omega)}\leq\dfrac{\pi Q}{2}\|\,\xi\,\|_{1}^{s},

we obtain

\mathbb{E}_{(\xi_{i},r_{i})\sim\mu}\|\,f-\tilde{f}\,\|_{L^{2}(\Omega)}^{2}\leq\dfrac{1}{m}\mathbb{E}_{(\xi,r)\sim\mu}\|\,F(\cdot,\xi,r)\,\|_{L^{\infty}(\Omega)}^{2}\leq\dfrac{\pi^{2}Q\upsilon_{f,s,\Omega}}{4m}.

By Markov’s inequality, with probability at least $(1+\varepsilon)/(2+\varepsilon)$ , for some $\varepsilon>0$ to be chosen later on, we obtain

(3.6)

\|\,f-\tilde{f}\,\|_{L^{2}(\Omega)}^{2}\leq\dfrac{(2+\varepsilon)\pi^{2}Q\upsilon_{f,s,\Omega}}{4m}

It remains to calculate the number of units in each layer. For each $\gamma_{,n_{\xi}}(t_{\xi}(x),r)$ , choose $n_{1}=\dots=n_{L}=\lceil(\|\,\xi\,\|_{1}+1)^{1/L}\rceil$ , then $n_{\xi}=2^{L-1}n_{1}\dots n_{L}$ , and by Lemma 3.5, $\gamma_{,n_{\xi}}(\cdot,r)=\gamma_{,n_{L}}(\cdot,r)\circ\beta_{,n_{L-1}}\circ\dots\circ\beta_{,n_{1}}$ on $[0,1]$ . Lemma 3.2 shows the Heaviside function $\chi_{[0,\infty)}$ can be approximated by $\sigma$ , and we need at most

\max\{3n_{1},\dots,3n_{L-1},4n_{L}\}\leq 4\lceil(\|\,\xi\,\|_{1}+1)^{1/L}\rceil\leq 12\|\,\xi\,\|_{1}^{1/L}

units in each layer to represent $\gamma_{,n_{\xi}}(t_{\xi}(x),r)$ . Denote $N$ the total number of units in each layer, then $N\leq 12\sum_{i=1}^{m}\|\,\xi_{i}\,\|_{1}^{1/L}$ and

\mathbb{E}_{(\xi_{i},r_{i})\sim\mu}N^{2sL}\leq 12\sum_{i=1}^{m}\mathbb{E}_{(\xi_{i},r_{i})\sim\mu}\|\,\xi_{i}\,\|_{1}^{2s}\leq\dfrac{12m\upsilon_{f,s,\Omega}}{Q}.

Again, by Markov inequality, with probability at least $(1+\varepsilon)/(2+\varepsilon)$ , we obtain

(3.7)

\dfrac{Q}{m}\leq\dfrac{12(2+\varepsilon)\upsilon_{f,s,\Omega}}{N^{2sL}}.

Combining (3.6) and (3.7), with probability at least $\varepsilon/(2+\varepsilon)$ , there exists an $(L,N)$ -network $f_{N}$ such that

\|\,f-f_{N}\,\|_{L^{2}(\Omega)}\leq\dfrac{\sqrt{3}(2+\varepsilon)\pi\upsilon_{f,s,\Omega}}{N^{sL}}\leq\dfrac{11\upsilon_{f,s,\Omega}}{N^{sL}},

with proper choice of $\varepsilon$ in the last step. Finally, if $f$ is complex-valued, we approximate the real and imaginary parts of the function separately to obtain (3.4). ∎

Remark 3.8.

We assume $\mathrm{supp}\;\widehat{f}\subset\left\{\,\xi\in\mathbb{R}^{d}\,\mid\,\|\,\xi\,\|_{1}\geq 1\,\right\}$ in Lemma 3.7 because we want to obtain an upper bound depending only on $\upsilon_{f,s,\Omega}$ . If we give up this condition, then the upper bound in (3.4) changes to $C\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}/N^{sL}$ for some dimension-free constant $C$ . The proof is essentially the same provided that the probability measure (3.5) is replaced by

\mu(\mathrm{d}\xi,\mathrm{d}r)=\dfrac{1}{Q}(1+\|\,\xi\,\|_{1})^{-s}\lvert\widehat{f}(\xi)\rvert\chi_{(0,1)}(r)\mathrm{d}\xi\mathrm{d}r,

We leave it to the interested reader.

There is relatively little work on the approximation rate of deep neural networks that employ the spectral Barron space as the target space. For deep ReLU networks, [BreslerNagaraj:2020] has proven approximation results of $(sL/2)$ -order. We shall show below this may be improved to $sL$ -order at the cost of $\upsilon_{f,s,\Omega}$ appears in the estimate.

Theorem 3.9.

Let the positive integer $L$ and $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ with $0<sL\leq 1/2$ . For any positive integer $N$ there exists an $(L,N+2)$ -network $f_{N}$ such that

(3.8)

\|\,f-f_{N}\,\|_{L^{2}(\Omega)}\leq\dfrac{29\upsilon_{f,s,\Omega}}{N^{sL}}.

Moreover, if $f$ is a real-valued function, then the connection weights in $f_{N}$ are all real.

Proof.

We may write $f=f_{1}+f_{2}$ with

f_{1}(x)=\int_{\|\,\xi\,\|_{1}<1}\widehat{f}(\xi)e^{2\pi i\xi\cdot x}\mathrm{d}\xi,\qquad f_{2}(x)=\int_{\|\,\xi\,\|_{1}\geq 1}\widehat{f}(\xi)e^{2\pi i\xi\cdot x}\mathrm{d}\xi.

Then $\upsilon_{f_{1},1,\Omega}\leq\upsilon_{f,s,\Omega}$ and $\upsilon_{f_{2},s,\Omega}\leq\upsilon_{f,s,\Omega}$ because

\widehat{f}_{1}(\xi)=\widehat{f}(\xi)\chi_{[0,1)}(\|\,\xi\,\|_{1})\qquad\text{and}\qquad\widehat{f}_{2}(\xi)=\widehat{f}(\xi)\chi_{[1,\infty)}(\|\,\xi\,\|_{1}).

We approximate $f_{1}$ with an $(L,n_{1})$ -network with $n_{1}=\lceil N/6\rceil$ and obtain the error estimate. Applying Lemma 3.3 we obtain, there exists an $(1,n_{1})$ -network $f_{1,n_{1}}$ such that

\|\,f_{1}-f_{1,n_{1}}\,\|_{L^{2}(\Omega)}\leq\dfrac{2\upsilon_{f_{1},1,\Omega}}{n_{1}^{1/2}}\leq\dfrac{2\sqrt{6}\upsilon_{f,s,\Omega}}{N^{sL}}.

Additional emphasis needs to be placed on the fact that an $(1,n_{1})$ -network can be represented by an $(L,n_{1})$ -network. We just need to fill the rest of the hidden layers with

t=\begin{cases}\text{ReLU}(t),&t\geq 0,\\ -\text{ReLU}(-t),&t<0.\end{cases}

Meanwhile, we approximate $f_{2}$ with an $(L,n_{2})$ -network with $n_{2}=\lceil 5N/6\rceil$ and obtain the error estimate. Applying Lemma 3.7 we obtain, there exists an $(L,n_{2})$ network $f_{2,n_{2}}$ such that

\|\,f_{2}-f_{2,n_{2}}\,\|_{L^{2}(\Omega)}\leq\dfrac{22\pi\upsilon_{f_{2},s,\Omega}}{n_{2}^{s}}\leq\dfrac{22\sqrt{6/5}\pi\upsilon_{f,s,\Omega}}{N^{sL}}.

These together with the triangle inequality give the estimate (3.8) and the total number of units in each layer is

n_{1}+2n_{2}=\lceil N/6\rceil+\lceil 5N/6\rceil\leq N+2.

If $f$ is a real-valued function, then we let $f_{N}=\mathrm{Re}(f_{1,n_{1}}+f_{2,n_{2}})$ , and the upper bound (3.8) still holds. ∎

As far as we know, the above theorem is best in the literature available so far. For shallow neural network $L=1$ , the authors in [MengMing:2022] have proven the $1/2$ -convergence rate with $\mathscr{B}^{1/2}_{p}(\mathbb{R}^{d})$ as the target function space, which is smaller than $\mathscr{B}^{1/2}(\mathbb{R}^{d})$ , and their estimate depends on the dimension as $d^{1/4}$ . The upper bound in [Siegel:2022] depends on $\upsilon_{f,0}+\upsilon_{f,1/2}$ , while (2.5) exemplifies that $\upsilon_{f,0}$ may be much larger than $\upsilon_{f,s}$ for some functions in $\mathscr{B}^{s}(\mathbb{R}^{d})$ , and the estimate depends upon the dimension exponentially. By contrast to these two results, the upper bound in Theorem 3.9 depends only on $\upsilon_{f,s,\Omega}$ , and is independent of the dimension.

For deep neural network, a similar result for ReLU has been proven in [BreslerNagaraj:2020] with $(sL/2)$ -order, which is not optimal compared with our estimate. At first glance, our result may seem contradictory with [BreslerNagaraj:2020]*Theorem 2. This is not the case because the upper bound therein is $\sqrt{\upsilon_{f,0}\upsilon_{f,s}}+\upsilon_{f,0}$ , which requires $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ , but is usually smaller than $\|\,f\,\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}$ for oscillatory functions; cf. Lemma A.1.

Remark 3.10.

The activation function of the last hidden layer of the $(L,N)$ -network in Theorem 3.9 may be replaced by many other familiar activation functions such as Hyperbolic tangent, SoftPlus, ELU, Leaky ReLU, ReLU^k and so on. Because all these activation functions can be reduced to sigmoidal functions by certain shifting and scaling argument; e.g., for SoftPlus, we observe that $\text{SoftPlus}(t)-\text{SoftPlus}(t-1)$ is a sigmoidal function. Unfortunately, it is not easy to change ReLU of the first $L-1$ hidden layers by other activation functions.

In what follows, we shall show that Theorem 3.9 is sharp if the activation function of the last hidden layer is Heaviside function. This example is adopted from [BreslerNagaraj:2020]. We reserve it briefly to ensure the completeness of our work and postpone the proof in Appendix A.6.

Theorem 3.11.

For any fixed positive integers $L,N$ and real numbers $\varepsilon,s$ with $0<\varepsilon,sL\leq 1/2$ , there exists $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ satisfying $\upsilon_{f,s,\Omega}\leq 1+\varepsilon$ such that for any $(L,N)$ -network $f_{N}$ whose activation function $\sigma$ in the last layer is the Heaviside function $\chi_{[0,\infty)}$ , there holds

(3.9)

\|\,f-f_{N}\,\|_{L^{2}(\Omega)}\geq\dfrac{1-\varepsilon}{8N^{sL}}.

4. Conclusion

We discuss the analytical functional properties of the spectral Barron space. The sharp embedding between the spectral Barron spaces and various classical function spaces have been established. The approximation rate has been proved for the deep ReLU neural networks when the spectral Barron space with a small smoothness index is employed as the target function space. There are still some unsolved problems, such as the sup-norm error and the higher-order convergence results for larger $s$ , the relations among Barron type spaces, variational space and the Radon bounded variation space as well as understanding how these spaces are related to the classical function spaces, which will be pursued in the subsequent works.

References

Appendix A Some proof details

A.1. Proof for (2.3) and (2.4)

Proof.

Note that $\widehat{\phi}_{R}$ is a radial function. By Lemma 2.1 and $\widehat{\phi}_{R}\in L^{1}(\mathbb{R}^{d})$ ,

	$\displaystyle\phi_{R}(x)=$	$\displaystyle\int_{B_{R}}\left(1-\dfrac{\lvert\xi\rvert^{2}}{R^{2}}\right)^{\delta}e^{2\pi ix\cdot\xi}\mathrm{d}\xi$
	$\displaystyle=$	$\displaystyle\int_{-R}^{R}e^{2\pi i\lvert x\rvert\xi_{1}}\mathrm{d}\xi_{1}\int_{\xi_{2}^{2}+\dots\xi_{d}^{2}<R^{2}-\xi_{1}^{2}}\left(1-\dfrac{\lvert\xi\rvert^{2}}{R^{2}}\right)^{\delta}\mathrm{d}\xi_{2}\dots\mathrm{d}\xi_{d}.$

Performing the polar transformation and changing the variable $t=r^{2}/(R^{2}-\xi_{1}^{2})$ , we obtain

	$\displaystyle\quad\int_{\xi_{2}^{2}+\dots\xi_{d}^{2}<R^{2}-\xi_{1}^{2}}\left(1-\dfrac{\lvert\xi\rvert^{2}}{R^{2}}\right)^{\delta}\mathrm{d}\xi_{2}\dots\mathrm{d}\xi_{d}$
	$\displaystyle=\omega_{d-2}\int_{0}^{\sqrt{R^{2}-\xi_{1}^{2}}}r^{d-2}\left(1-\dfrac{\xi_{1}^{2}+r^{2}}{R^{2}}\right)^{\delta}\mathrm{d}r$
	$\displaystyle=\dfrac{\omega_{d-2}}{2}R^{d-1}\left(1-\dfrac{\xi_{1}^{2}}{R^{2}}\right)^{\delta+(d-1)/2}\int_{0}^{1}t^{(d-3)/2}(1-t)^{\delta}\mathrm{d}t$
	$\displaystyle=\dfrac{\omega_{d-2}}{2}R^{d-1}\left(1-\dfrac{\xi_{1}^{2}}{R^{2}}\right)^{\delta+(d-1)/2}B\left(\dfrac{d-1}{2},\delta+1\right).$

Substituting this equation into the previous one and changing the variable $\xi_{1}=R\cos\theta$ , we get

	$\displaystyle\phi_{R}(x)=$	$\displaystyle\dfrac{\omega_{d-2}}{2}R^{d-1}B\left(\dfrac{d-1}{2},\delta+1\right)\int_{-R}^{R}\left(1-\dfrac{\xi_{1}^{2}}{R^{2}}\right)^{\delta+(d-1)/2}e^{2\pi i\lvert x\rvert\xi_{1}}\mathrm{d}\xi_{1}$
	$\displaystyle=$	$\displaystyle\dfrac{\pi^{(d-1)/2}\Gamma(\delta+1)R^{d}}{\Gamma(\delta+(d+1)/2)}\int_{0}^{\pi}\cos(2\pi\lvert x\rvert R\cos\theta)\sin^{2\delta+d}\theta\mathrm{d}\theta$
	$\displaystyle=$	$\displaystyle\dfrac{\Gamma(\delta+1)}{\pi^{\delta}\lvert x\rvert^{\delta+d/2}}R^{-\delta+d/2}J_{\delta+d/2}(2\pi\lvert x\rvert R),$

where we have used

J_{\nu}(x)=\dfrac{(x/2)^{\nu}}{\pi^{1/2}\Gamma((d+1)/2)}\int_{0}^{\pi}\cos(x\cos\theta)\sin^{2\nu}\theta\mathrm{d}\theta,\qquad\nu>-\dfrac{1}{2}

in the last step. The above integral representation of the first kind of Bessel function may be found in [Luke:1962]*§ 1.4.5, Eq. (4).

It remains to prove (2.4). For $s\geq 0$ , a direct calculation gives

	$\displaystyle\upsilon_{\phi_{R},s}$	$\displaystyle=\int_{B_{R}}\lvert\xi\rvert^{s}\left(1-\dfrac{\lvert\xi\rvert^{2}}{R^{2}}\right)^{\delta}\mathrm{d}\xi=\omega_{d-1}\int_{0}^{R}r^{s+d-1}\left(1-\dfrac{r^{2}}{R^{2}}\right)^{\delta}\mathrm{d}r$
		$\displaystyle=\dfrac{\omega_{d-1}}{2}R^{s+d}\int_{0}^{1}t^{(s+d)/2-1}(1-t)^{\delta}\mathrm{d}t$
		$\displaystyle=\dfrac{\omega_{d-1}}{2}B\left(\dfrac{s+d}{2},\delta+1\right)R^{s+d}.$

Therefore $\phi_{R}\in\mathscr{B}^{s}(\mathbb{R}^{d})$ . ∎

A.2. Proof for (2.14)

Proof.

To prove (2.14), we start with the following representation formula. If $\widehat{f}\in L^{1}(\mathbb{R}^{d})$ is a radial function with $\widehat{f}(\xi)=g_{0}(\lvert\xi\rvert)$ , then

(A.1)

f(x)=\dfrac{2\pi}{\lvert x\rvert^{d/2-1}}\int_{0}^{\infty}g_{0}(r)r^{d/2}J_{d/2-1}(2\pi\lvert x\rvert r)\mathrm{d}r.

If $d=1$ , then using Lemma 2.1, we obtain

f(x)=\int_{\mathbb{R}}g_{0}(\lvert\xi\rvert)e^{2\pi ix\xi}\mathrm{d}\xi=2\int_{0}^{\infty}g_{0}(r)\cos(2\pi\lvert x\rvert r)\mathrm{d}r,

which gives (A.1)., where we have used the relation [Luke:1962]*§ 1.4.6, Eq. (7)

J_{-1/2}(x)=\sqrt{\dfrac{2}{\pi x}}\cos(x)

in the last step.

For $d\geq 2$ , combining Lemma 2.1 and [Stein:1971]*Ch. IV, Theorem 3.3, we obtain (A.1), which immediately implies

	$\displaystyle f_{p}(x)$	$\displaystyle=\dfrac{2\pi}{\lvert x\rvert^{d/2-1}}\int_{0}^{1}r^{d(1/2-1/p^{\prime})}J_{d/2-1}(2\pi\lvert x\rvert r)\mathrm{d}r$
		$\displaystyle=\omega_{d-1}\int_{0}^{1}r^{d/p-1}{}_{0}F_{1}(;d/2;-\pi^{2}\lvert x\rvert^{2}r^{2})\mathrm{d}r,$

where we have used the relation [Luke:1962]*§ 1.4.1, Eq. (1)

J_{\nu}(x)=\dfrac{(x/2)^{\nu}}{\Gamma(\nu+1)}{}_{0}F_{1}(;\nu+1;-x^{2}/4)

in the last step. Changing the variable $t=r^{2}$ and using the identity [Luke:1962]*§ 1.3.2, Eq. (2)

{}_{1}F_{2}(\rho;\rho+\sigma,\beta;x)=\dfrac{1}{B(\rho,\sigma)}\int_{0}^{1}t^{\rho-1}(1-t)^{\sigma-1}{}_{0}F_{1}(;\beta;xt)\mathrm{d}t,

we get (2.14). ∎

A.3. Proof for Lemma 2.8

Proof.

The “if”-part is standard by [Triebel:1983]*§ 2.3.2, Proposition 2; § 2.7.1, Theorem. We illustrate the “only if”-part with an example, which is taken from [Triebel:1983]*§ 2.3.9, Proof of Theorem.

Let $f_{0}\in\mathscr{S}(\mathbb{R}^{d})$ with supp $(\widehat{f}_{0})\subset\left\{\,x\in\mathbb{R}^{d}\,\mid\,1\leq\lvert x\rvert\leq 3/2\,\right\}$ . Let $f_{n}(x)=f_{0}(2^{-n}x)$ for an integer $n$ , then

\widehat{f}_{n}(\xi)=2^{-dn}\widehat{f}_{0}(2^{-n}\xi)\qquad\text{and}\qquad\mathrm{supp}(\widehat{f}_{n})\subset\left\{\,x\in\mathbb{R}^{d}\,\mid\,2^{n}\leq\lvert x\rvert\leq 3\times 2^{n-1}\,\right\}.

Choose proper $\{\varphi_{j}\}_{j=0}^{\infty}\subset\mathscr{S}(\mathbb{R}^{d})$ in the definition of Besov space such that $\varphi_{0}(x)=1$ when $\lvert x\rvert\leq 3/2$ and $\varphi_{j}=1$ on supp $(\widehat{f}_{j})$ for $j\geq 1$ , then

\mathrm{supp}(\widehat{f}_{n})\cap\mathrm{supp}(\varphi_{j})=\emptyset\quad\text{if}\quad n\geq 0\quad\text{and}\quad n\neq j.

A direct calculation gives that

(\varphi_{j}\widehat{f}_{n})^{\vee}=\delta_{0n}f_{n},\quad\text{if}\quad n\leq 0\qquad\text{and}\qquad(\varphi_{j}\widehat{f}_{n})^{\vee}=\delta_{jn}\widehat{f}_{n},\quad\text{if}\quad n>0.

By definition, when $n<0$ ,

\|\,f_{n}\,\|_{B_{p,q}^{\alpha}(\mathbb{R}^{d})}=\|\,f_{n}\,\|_{L^{p}(\mathbb{R}^{d})}=2^{-dn/p}\|\,f_{0}\,\|_{L^{p}(\mathbb{R}^{d})}.

Let $n\to-\infty$ with the embedding relation $B_{p_{1},q_{1}}^{\alpha_{1}}(\mathbb{R}^{d})\hookrightarrow B_{p_{2},q_{2}}^{\alpha_{2}}(\mathbb{R}^{d})$ yields $p_{1}\leq p_{2}$ . Similarly, when $n>0$ ,

\|\,f_{n}\,\|_{B_{p,q}^{\alpha}(\mathbb{R}^{d})}=2^{\alpha n}\|\,f_{n}\,\|_{L^{p}(\mathbb{R}^{d})}=2^{(\alpha-d/p)n}\|\,f_{0}\,\|_{L^{p}(\mathbb{R}^{d})}.

Let $n\to+\infty$ with the embedding relation implies $\alpha_{1}-d/p_{1}\geq\alpha_{2}-d/p_{2}$ . Finally if $\alpha_{1}-d/p_{1}=\alpha_{2}-d/p_{2}$ , then $q_{1}\leq q_{2}$ proved in [Triebel:1995]*Theorem 3.2.1. ∎

A.4. Proof for (2.16)

Proof.

If $1\leq p<\infty$ and $\alpha>0$ , then $B_{p,1}^{\alpha}(\mathbb{R}^{d})$ is equivalent to the space defined in [Triebel:1983]*§ 2.5.7, Theorem

\Lambda_{p,1}^{\alpha}(\mathbb{R}^{d}){:}=\left\{\,f\in W^{[\alpha],p}(\mathbb{R}^{d})\,\mid\,\int_{\mathbb{R}^{d}}\dfrac{\|\,\Delta_{h}^{2}(\nabla^{[\alpha]}f)\,\|_{L^{p}(\mathbb{R}^{d})}}{\lvert h\rvert^{d+\{\alpha\}}}\mathrm{d}h<\infty\,\right\}

equipped with the norm

\|\,f\,\|_{\Lambda_{p,1}^{\alpha}(\mathbb{R}^{d})}{:}=\|\,f\,\|_{W^{[\alpha],p}(\mathbb{R}^{d})}+\int_{\mathbb{R}^{d}}\dfrac{\|\,\Delta_{h}^{2}(\nabla^{[\alpha]}f)\,\|_{L^{p}(\mathbb{R}^{d})}}{\lvert h\rvert^{d+\{\alpha\}}}\mathrm{d}h,

where $\alpha=[\alpha]+\{\alpha\}$ with integer $[\alpha]$ and $0<\{\alpha\}\leq 1$ , and $\Delta_{h}^{2}f(x)=f(x+2h)-2f(x+h)+f(x)$ [Triebel:1983]*§ 2.2.2, Eq. (9).

For any nonnegative integer $k$ , a direct calculation gives

\nabla^{k}\psi_{n}(x)=\sum_{j=0}^{k}\dfrac{c_{j}x^{\beta_{j}}}{(1+in)^{d/2+j}}e^{-\pi\lvert x\rvert^{2}/(1+in)}

for some constants $\{c_{j}\}_{j=0}^{k}$ , and the multi-index $\beta_{j}=(\beta_{j1},\dots,\beta_{jd})$ satisfies $\lvert\beta_{j}\rvert\leq j$ with $x^{\beta_{j}}=x_{1}^{\beta_{j1}}\dots x_{d}^{\beta_{jd}}$ . Then

\|\,\nabla^{k}\psi_{n}\,\|_{L^{p}(\mathbb{R}^{d})}\leq C\sum_{j=0}^{k}\dfrac{1}{(1+n^{2})^{d/4+j/2}}\|\,\lvert x\rvert^{\lvert\beta_{j}\rvert}e^{-\pi\lvert x\rvert^{2}/(1+n^{2})}\,\|_{L^{p}(\mathbb{R}^{d})}.

A direct calculation gives

	$\displaystyle\int_{\mathbb{R}^{d}}\lvert x\rvert^{\lvert\beta_{j}\rvert p}e^{-\pi p\lvert x\rvert^{2}/(1+n^{2})}\mathrm{d}x=$	$\displaystyle(1+n^{2})^{(d+\lvert\beta_{j}\rvert p)/2}\int_{\mathbb{R}^{d}}\lvert y\rvert^{\lvert\beta_{j}\rvert p}e^{-\pi p\lvert y\rvert^{2}}\mathrm{d}y$
	$\displaystyle=$	$\displaystyle\dfrac{\omega_{d-1}\Gamma\left((\lvert\beta_{j}\rvert p+d)/2\right)}{2(p\pi)^{(\lvert\beta_{j}\rvert p+d)/2}}(1+n^{2})^{(d+\lvert\beta_{j}\rvert p)/2}.$

Therefore, there exists $C$ depending only on $d,p,k$ such that

(A.2)

\|\,\nabla^{k}\psi_{n}\,\|_{L^{p}(\mathbb{R}^{d})}\leq C\sum_{j=0}^{k}\dfrac{(1+n^{2})^{d/(2p)+\lvert\beta_{j}\rvert/2}}{(1+n^{2})^{d/4+j/2}}\leq C(1+n^{2})^{-d(p-2)/(4p)}.

If $f\in W^{2}_{p}(\mathbb{R}^{d})$ , then

\Delta_{h}^{2}f(x)=\int_{0}^{1}\mathrm{d}t\int_{t}^{1+t}\nabla^{2}f(x+sh)\mathrm{d}sh\cdot h=\int_{0}^{2}\nabla^{2}f(x+sh)\mathrm{d}s\int_{\max(s-1,0)}^{\min(s,1)}\mathrm{d}th\cdot h.

Therefore,

\lvert\Delta_{h}^{2}f(x)\rvert\leq\lvert h\rvert^{2}\int_{0}^{2}\lvert\nabla^{2}f(x+sh)\rvert\mathrm{d}s.

By the Minkowski’s inequality, we obtain

\|\,\Delta_{h}^{2}f(x)\,\|_{L^{p}(\mathbb{R}^{d})}\leq\lvert h\rvert^{2}\int_{0}^{2}\|\,\nabla^{2}f(\cdot+sh)\,\|_{L^{p}(\mathbb{R}^{d})}\mathrm{d}s=2\lvert h\rvert^{2}\|\,\nabla^{2}f\,\|_{L^{p}(\mathbb{R}^{d})}.

Splitting the integral part of the $\Lambda_{p,1}^{\alpha}$ -norm into two parts, we get

		$\displaystyle\int_{\lvert h\rvert<1}\dfrac{\\|\,\Delta_{h}^{2}f\,\\|_{L^{p}(\mathbb{R}^{d})}}{\lvert h\rvert^{d+\{\alpha\}}}\mathrm{d}h+\int_{\lvert h\rvert>1}\dfrac{\\|\,\Delta_{h}^{2}f\,\\|_{L^{p}(\mathbb{R}^{d})}}{\lvert h\rvert^{d+\{\alpha\}}}\mathrm{d}h$
	$\displaystyle\leq$	$\displaystyle 2\\|\,\nabla^{2}f\,\\|_{L^{p}(\mathbb{R}^{d})}\int_{\lvert h\rvert<1}h^{2-d-\{\alpha\}}\mathrm{d}h+4\\|\,f\,\\|_{L^{p}(\mathbb{R}^{d})}\int_{\lvert h\rvert>1}h^{-d-\{\alpha\}}\mathrm{d}h$
	$\displaystyle=$	$\displaystyle 2\omega_{d-1}\left(\dfrac{\\|\,\nabla^{2}f\,\\|_{L^{p}(\mathbb{R}^{d})}}{2-\{\alpha\}}+\dfrac{2\\|\,f\,\\|_{L^{p}(\mathbb{R}^{d})}}{\{\alpha\}}\right)$
	$\displaystyle\leq$	$\displaystyle C\\|\,f\,\\|_{W^{2}_{p}(\mathbb{R}^{d})}.$

Note that $\nabla^{[\alpha]}\psi_{n}\in W^{2}_{p}(\mathbb{R}^{d})$ , a combination of the above inequality and (A.2) yields

\|\,\psi_{n}\,\|_{\Lambda_{p,1}^{\alpha}(\mathbb{R}^{d})}\leq C\|\,\psi_{n}\,\|_{W^{[\alpha]+2}_{p}(\mathbb{R}^{d})}\leq C(1+n^{2})^{-d(p-2)/(4p)},

where $C$ is a constant depending on $p,\alpha$ and $d$ but independent of $n$ . So does $\|\,\psi_{n}\,\|_{B_{p,1}^{\alpha}(\mathbb{R}^{d})}$ . ∎

A.5. Proof for Lemma 3.5

Proof.

Firstly we show that $g_{,n}$ is symmetric about $x=1/2$ . By definition,

g_{,n}(1-t)=g(n(1-t)-j)=g(nt-n+j+1)=g(nt-k)=g_{,n}(t)

for some integers $j,k$ satisfying $0\leq j,k\leq n-1$ and $k=n-j-1$ .

For a fixed $t\in[0,1]$ , there exist integers $j,k$ satisfying $0\leq j\leq 2n_{1}-1,0\leq k\leq n_{2}-1$ such that $0\leq 2n_{1}n_{2}t-n_{2}j-k\leq 1$ , then $0\leq n_{2}(2n_{1}t-j)-k\leq 1$ and

g_{,2n_{1}n_{2}}(t)=g(2n_{1}n_{2}t-n_{2}j-k)=g(n_{2}(2n_{1}t-j)-k)=g_{,n_{2}}(2n_{1}t-j).

By definition,

\beta_{,n}(t)=\begin{cases}2nt-2j,&0\leq nt-j\leq 1/2,\\ 2+2j-2nt,&1/2\leq nt-j\leq 1,\end{cases}\quad j=0,\dots,n-1.

If $j$ is even, then $j=2l$ for some integer $l$ satisfying $0\leq l\leq n_{1}-1$ and $0\leq n_{1}t-l\leq 1/2$ . Therefore

g_{,n_{2}}(\beta_{,n_{1}}(t))=g_{,n_{2}}(2n_{1}t-2l)=g_{,n_{2}}(2n_{1}t-j).

Otherwise $j$ is odd, then $j=2l+1$ for some integer $l$ satisfying $0\leq l\leq n_{1}-1$ and $1/2\leq n_{1}t-l\leq 1$ . Therefore

g_{,n_{2}}(\beta_{,n_{1}}(t))=g_{,n_{2}}(2+2l-2n_{1}t)=g_{,n_{2}}(1+j-2n_{1}t)=g_{,n_{2}}(2n_{1}t-j).

This gives $g_{,n_{2}}\circ\beta_{,n_{1}}=g_{,2n_{1}n_{2}}$ on $[0,1]$ . ∎

A.6. Proof for Theorem 3.11

Lemma A.1.

Given $n,R>0$ and let $f(x)=\cos(2\pi nx_{1})e^{-\pi\lvert x\rvert^{2}/R}$ , then $f\in\mathscr{B}^{s}(\mathbb{R}^{d})$ with

(A.3)

\upsilon_{f,s,\Omega}\leq\left(n+\dfrac{d}{\pi\sqrt{R}}\right)^{s}\qquad\text{for}\quad 0\leq s\leq 1.

Proof.

For any $R>0,$ the Fourier transform of the dilated Guass function $e^{-\pi x_{j}^{2}/R}$ reads as

\widehat{e^{-\pi x_{j}^{2}/R}}=\sqrt{R}e^{-\pi R\xi_{j}^{2}}.

A direct calculation gives

	$\displaystyle\widehat{f}(\xi)=$	$\displaystyle\dfrac{1}{2}\int_{\mathbb{R}}e^{-\pi x_{1}^{2}/R}\Bigl{(}e^{-2\pi ix_{1}(\xi_{1}-n)}+e^{-2\pi ix_{1}(\xi_{1}+n)}\Bigr{)}\mathrm{d}x_{1}\prod_{j=2}^{d}\int_{\mathbb{R}}e^{-\pi x_{j}^{2}/R-2\pi ix_{j}\xi_{j}}\mathrm{d}x_{j}$
	$\displaystyle=$	$\displaystyle\dfrac{R^{d/2}}{2}\Bigl{(}e^{-\pi R(\xi_{1}-n)^{2}}+e^{-\pi R(\xi_{1}+n)^{2}}\Bigr{)}\prod_{j=2}^{d}e^{-\pi R\xi_{j}^{2}}$
	$\displaystyle=$	$\displaystyle R^{d/2}e^{-\pi R(\lvert\xi\rvert^{2}+n^{2})}\cosh(2\pi nR\xi_{1}).$

It is clear that $f,\widehat{f}\in L^{1}(\mathbb{R}^{d})$ and the pointwise Fourier inversion theorem holds true, and

\upsilon_{f,0,\Omega}=\int_{\mathbb{R}^{d}}\lvert\widehat{f}(\xi)\rvert\mathrm{d}\xi=\int_{\mathbb{R}^{d}}\widehat{f}(\xi)\mathrm{d}\xi=f(0)=1,

where we have used the positiveness of $\widehat{f}$ .

Next, using the elementary identities

\sqrt{R}\int_{\mathbb{R}}e^{-\pi R\xi_{j}^{2}}\mathrm{d}\xi_{j}=1\quad\text{and}\quad\sqrt{R}\int_{\mathbb{R}}\lvert\xi_{j}\rvert e^{-\pi R\xi_{j}^{2}}\mathrm{d}\xi_{j}=\dfrac{1}{\pi\sqrt{R}},

we obtain

	$\displaystyle\upsilon_{f,1,\Omega}$	$\displaystyle=R^{d/2}\int_{\mathbb{R}}\lvert\xi_{1}\rvert e^{-\pi R(\xi_{1}-n)^{2}}\mathrm{d}\xi_{1}\prod_{j=2}^{d}\int_{\mathbb{R}}e^{-\pi R\xi_{j}^{2}}\mathrm{d}\xi_{j}$
		$\displaystyle\quad+R^{d/2}\int_{\mathbb{R}}e^{-\pi R(\xi_{1}-n)^{2}}\mathrm{d}\xi_{1}\sum_{j=2}^{d}\int_{\mathbb{R}}\lvert\xi_{j}\rvert e^{-\pi R\xi_{j}^{2}}\mathrm{d}\xi_{j}\prod_{k\neq 1,j}\int_{\mathbb{R}}e^{-\pi R\xi_{k}^{2}}\mathrm{d}\xi_{k}$
		$\displaystyle\leq\sqrt{R}\int_{\mathbb{R}}(\lvert\xi_{1}-n\rvert+n)e^{-\pi R(\xi_{1}-n)^{2}}\mathrm{d}\xi_{1}+\dfrac{d-1}{\pi\sqrt{R}}$
		$\displaystyle=n+\dfrac{d}{\pi\sqrt{R}}.$

Using the interpolation inequality (2.11), we obtain (A.3). ∎

Proof for Theorem 3.11.

Define $n=2^{L+2}N^{L}$ and $f(x)=n^{-s}\cos(2\pi nx_{1})e^{-\pi\lvert x\rvert^{2}/R}$ with large enough $R$ such that $\upsilon_{f,s,\Omega}\leq 1+\varepsilon$ by Lemma A.1 and $e^{-\pi\lvert x\rvert^{2}/R}\geq 1-\varepsilon$ when $x\in\Omega$ . Fix $x_{2},\dots,x_{d}$ , then any $(L,N)$ -network $f_{N}$ can be viewed as an one-dimensional $(L,N)$ -network, i.e. $f_{N}(\cdot,x_{2},\dots,x_{d}):[0,1]\to\mathbb{C}$ . Divide $[0,1]$ into $n$ -internals of $[j/n,(j+1)/n]$ with $j=0,\dots,n-1$ . There exists at least $n-2^{L+1}N^{L}=2^{L+1}N^{L}$ intervals such that $f_{N}$ does not change sign on those intervals [Telgarsky:2016]*Lemma 3.2. Without loss of generality, we assume $f_{N}(\cdot,x_{2},\dots,x_{d})\geq 0$ on some interval $[j/n,(j+1)/n]$ , then

\int_{j/n}^{(j+1)/n}(f(x)-f_{N}(x))^{2}\mathrm{d}x_{1}\geq\dfrac{(1-\varepsilon)^{2}}{n^{2s}}\int_{(4j+1)/(4n)}^{(4j+3)/(4n)}\cos^{2}(2\pi nx_{1})\mathrm{d}x_{1}\geq\dfrac{(1-\varepsilon)^{2}}{4n^{2s+1}},

because $\cos(2\pi nx_{1})\leq 0$ when $2\pi j+\pi/2\leq 2\pi nx_{1}\leq 2\pi j+3\pi/2$ .

Summing up these $n-2^{L+1}N^{L}$ intervals gives

	$\displaystyle\\|\,f-f_{N}\,\\|_{L^{2}(\Omega)}^{2}\geq$	$\displaystyle\int_{[0,1]^{d-1}}\mathrm{d}x_{2}\dots\mathrm{d}x_{d}\int_{0}^{1}(f(x)-f_{N}(x))^{2}\mathrm{d}x_{1}$
	$\displaystyle\geq$	$\displaystyle\dfrac{2^{L+1}N^{L}(1-\varepsilon)^{2}}{4n^{2s+1}}\geq\dfrac{(1-\varepsilon)^{2}}{2^{2sL+4s+3}N^{2sL}}\geq\dfrac{(1-\varepsilon)^{2}}{64N^{2sL}}.$

Simultaneously squaring off both sides of the inequality, we obtain (3.9). ∎

	$\displaystyle\\|\,f\,\\|_{\mathscr{B}^{s}(\mathbb{R}^{d})}$	$\displaystyle\leq 2^{s+1+d/2}\sqrt{\nu_{d}}\sum_{j=0}^{\infty}2^{j(s+d/2)}\\|\,\varphi_{j}\widehat{f}\,\\|_{L^{2}(\mathbb{R}^{d})}$
		$\displaystyle=2^{s+1+d/2}\sqrt{\nu_{d}}\sum_{j=0}^{\infty}2^{j(s+d/2)}\\|\,(\varphi_{j}\widehat{f})^{\vee}\,\\|_{L^{2}(\mathbb{R}^{d})}$
		$\displaystyle=2^{s+1+d/2}\sqrt{\nu_{d}}\\|\,f\,\\|_{B_{2,1}^{s+d/2}(\mathbb{R}^{d})}.$

		$\displaystyle\int_{\lvert h\rvert<1}\dfrac{\\|\,\Delta_{h}^{2}f\,\\|_{L^{p}(\mathbb{R}^{d})}}{\lvert h\rvert^{d+\{\alpha\}}}\mathrm{d}h+\int_{\lvert h\rvert>1}\dfrac{\\|\,\Delta_{h}^{2}f\,\\|_{L^{p}(\mathbb{R}^{d})}}{\lvert h\rvert^{d+\{\alpha\}}}\mathrm{d}h$
	$\displaystyle\leq$	$\displaystyle 2\\|\,\nabla^{2}f\,\\|_{L^{p}(\mathbb{R}^{d})}\int_{\lvert h\rvert<1}h^{2-d-\{\alpha\}}\mathrm{d}h+4\\|\,f\,\\|_{L^{p}(\mathbb{R}^{d})}\int_{\lvert h\rvert>1}h^{-d-\{\alpha\}}\mathrm{d}h$
	$\displaystyle=$	$\displaystyle 2\omega_{d-1}\left(\dfrac{\\|\,\nabla^{2}f\,\\|_{L^{p}(\mathbb{R}^{d})}}{2-\{\alpha\}}+\dfrac{2\\|\,f\,\\|_{L^{p}(\mathbb{R}^{d})}}{\{\alpha\}}\right)$
	$\displaystyle\leq$	$\displaystyle C\\|\,f\,\\|_{W^{2}_{p}(\mathbb{R}^{d})}.$

Spectral Barron space and deep neural network approximation

Abstract.

Key words and phrases:

2020 Mathematics Subject Classification:

1. Introduction

2. Completeness of ℬs\mathscr{B}^{s} and its relation to other function spaces

Lemma 2.1.

Proof.

2.1. Completeness of the spectral Barron space

Lemma 2.2.

Proof.

Lemma 2.3.

Proof of Lemma 2.3.

2.2. Embedding relations of the spectral Barron spaces

Lemma 2.4.

Proof.

Lemma 2.5.

Proof.

Remark 2.6.

2.3. Relations to some classical function spaces

Definition 2.7 (Besov space).

Lemma 2.8.

Theorem 2.9.

Proof.

Definition 2.10 (Fractional Sobolev space).

Lemma 2.11 ([MengMing:2022]*Theorem 4.3).

Corollary 2.12.

Proof.

3. Application to deep neural network approximation

Definition 3.1.

Lemma 3.2.

Proof.

Lemma 3.3 ([Barron:1993]*Theorem 1).

Definition 3.4 ((L,N)(L,N)-network).

Lemma 3.5.

Lemma 3.6.

Proof.

Lemma 3.7.

Proof.

Remark 3.8.

Theorem 3.9.

Proof.

Remark 3.10.

Theorem 3.11.

4. Conclusion

References

Appendix A Some proof details

A.1. Proof for (2.3) and (2.4)

Proof.

A.2. Proof for (2.14)

Proof.

A.3. Proof for Lemma 2.8

Proof.

A.4. Proof for (2.16)

Proof.

A.5. Proof for Lemma 3.5

Proof.

A.6. Proof for Theorem 3.11

Lemma A.1.

Proof.

Proof for Theorem 3.11.

2. Completeness of $\mathscr{B}^{s}$ and its relation to other function spaces

Definition 3.4 ( $(L,N)$ -network).