A Unified and Constructive Framework for the Universality of Neural Networks

Abstract

One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: any nAI activation function is universal. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces several advantages over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.

keywords:

Universal approximation , neural networks , activation functions , non-asymptotic analysis

MSC:

62M45 , 82C32 , 62-02 , 60-02 , 65-02.

\affiliation

[UT-mech]organization=Department of Aerospace Engineering and Engineering Mechanics, The University of Texas at Austin, city=Austin, state=Texas, country=USA

\affiliation

[Oden]organization=Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin, city=Austin, state=Texas, country=USA

{highlights}

Unlike existing universality approaches, our framework is constructive using elementary means from functional analysis, probability theory, non-asymptotic analysis, and numerical analysis.

While existing approaches are either technical or specialized for particular activation functions, our framework is the first unified attempt that is valid for most of the existing activation functions and beyond.

The framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc;

The framework provides new proofs for all activation functions

The framework discovers and facilitates the discovery of new activation functions with guaranteed universality property. Indeed, any activation—whose $k$ th derivative, with $k$ being an integer, is integrable and essentially bounded—is universal. In that case, the activation function and all of its $j$ th derivative, $j=1,\ldots,k$ are not only a valid activation but also universal.

For a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with a predetermined number of neurons, and the values of weights/biases;

The framework is the first that allows for abstractly presenting the universal approximation with the favorable non-asymptotic rate of $N^{-{\frac{1}{2}}}$ , where $N$ is the number of neurons. This provides not only theoretical insights into the required number of neurons but also guidance on how to choose the number of neurons to achieve a certain accuracy with controllable successful probability. Perhaps more importantly, it shows that neural network may fail, with non-zero probability, to provide the desired testing error in practice when an architecture is selected;

Our framework also provides insights into the developments, and hence providing constructive derivations, of some of the existing approaches.

1 Introduction

Human brain consists of networks of billions of neurons, each of which, roughly speaking, receives information—electrical pulses —from other neurons via dendrites, processes the information using soma, is activated by difference of electrical potential, and passes the output along its axon to other neurons through synapse. Attempt to understand the extraordinary ability of the brain in categorizing, classifying, regressing, processing, etc information has inspired numerous scientists to develop computational models to mimic brain functionalities. Most well-known is perhaps the McCulloch-Pitts model [1], which is also called perceptron by Rosenblatt who extended McCulloch-Pitts model to networks of artificial neurons capable of learning from data [2]. The question is if such a network could mimic some of the brain capability, such as learning to classify. The answer lies in the fact that perceptron networks can represent Boolean logic functions exactly. From a mathematical point of view, perceptron networks with Heaviside activation functions compute step functions, linear combinations of which form the space of simple functions which in turn is dense in the space of measurable functions [3, 4]. That is, linear combination of perceptions can approximate a measurable function to any desired accuracy [5, 6]: the first universal approximation for neural networks. The universal approximation capability partially “explains” why human brains can be trained to virtually learn any tasks.

For training one-hidden layer networks, Widrow-Hoff rule [7] can be used for supervised learning. For many-layer ones, the most popular approach is back-propagation [8] which requires the derivative of activation functions. Heaviside function is, however, not differentiable in the classical sense. A popular smooth approximation is the standard logistic (also called sigmoidal) activation function which is infinitely differentiable. The question is now: are neural networks of sigmoidal function universal? An answer to this question was first presented by Cybenko [9] and Funahashi [10]. The former, though non-constructive (a more constructive proof was then provided in [11] and revisited in [12]), elegantly used the Hahn-Banach theorem to show that sigmoidal neural networks (NNs) with one hidden layer is dense in the space of continuous functions on compacta. The latter used Fourier transform and an integral formulation for integrable function with bounded variation [13] to show the same results for NNs with continuous, bounded, monotone increasing sigmoidal activation function. Recognizing NN output as an approximate back-projection operator, [14] employed inverse Radon transformation to show the universal approximation property in space of squared integrable functions. Making use of the classical Stone-Weierstrass approximation theorem [15] successfully proved NNs with non-decreasing sigmoidal function are dense in space of continuous functions over compacta and dense in space of measurable functions. Using the separability of space of continuous functions over compacta [16] constructed a strictly increasing analytic sigmoidal function that is universal. The work in [17], based on a Kolmogorov theorem, showed that any continuous function on hypercubes can be approximated well with two-hidden layer NNs of sigmoidal functions.

Though sigmoidal functions are natural continuous approximation to Heaviside function, and hence mimicking the activation mechanism of neurons, they are not the only ones having universal approximation property. Indeed, [18] showed, using distributional theory, that NNs with $k$ th degree sigmoidal function are dense in space of continuous functions over compacta. Meanwhile, [19] designed a cosine squasher activation function so that the output of one-hidden layer neural network is truncated Fourier series and thus can approximate square integrable functions to any desired accuracy. Following and extending [9], [20] showed that NN is dense in $\mathcal{L}^{p}$ with bounded non-constant activations function and in space of continuous functions over compacta with continuous bounded nonconstant activation functions. Using the Stone-Weierstrass theorem, [6] showed that any network with activations (e.g. exponential, cosine squasher, modified sigma-pi, modified logistic, and step functions) that can transform product of functions into sum of functions is universal in the space of bounded measurable functions over compacta. The work in [21] provided universal approximation for bounded weights (and biases) with piecewise polynomial and superanalytic activation functions (e.g. sine, cosine, logistic functions and piecewise continuous functions are superanalytic at some point with positive radius of convergence).

One hidden-layer NN is in fact dense in space of continuous functions and $\mathcal{L}^{p}$ if the activation function is not polynomial almost everywhere [22, 23]. Using Taylor expansion and Vandermonde determinant, [24] provided an elementary proof of universal approximation for NNs with $C^{\infty}$ activation functions. Recently, [25] provides universal approximation theorem for multilayer NNs with ReLU activation functions using partition of unity and Taylor expansion for functions in Sobolev spaces. Universality of multilayer ReLU NNs can also be obtained using approximate identity and rectangular quadrature rule [26]. Restricting in Barron spaces [27, 28], the universality of ReLU NNs is a straightforward application of the law of large numbers [29]. The universality of ReLU NNs can also be achieved by emulating finite element approximations [30, 31].

Unlike others, [32] introduced $\mathcal{B}$ ellshape function (as derivative of squash-type activation function, for example) as a means to explicitly construct one-hidden layer NNs to approximate continuous function over compacta in multiple dimensions. More detailed analysis was then carried out in [33, 34]. The idea was revisited and extended to establish universal approximation in the uniform norm for tensor product sigmoidal and hyperbolic tangent activations [35, 36, 37, 38, 39]. Similar approach was also taken in [40] using cardinal B-spline and in [41] using the distribution function as sigmoid but for a family of sigmoids with a certain of decaying-tail conditions.

While it is sufficient for most universal approximation results to hold when each weight (and bias) varying over the whole real line $\mathbb{R}$ , this is not necessary. In fact, universal approximation results for continuous function on compacta can also be obtained using finite set of weights [42, 43, 44]. It is quite striking that one-hidden layer with only one neuron is enough for universality: [45] constructs a smooth, sigmoidal, almost monotone activation function so that one-hidden layer with one neuron can approximate any continuous function over any compact subset in $\mathbb{R}$ to any desired accuracy.

Universal theorems with convergence rate for sigmoidal and others NNs have also been established. Modifying the proofs in [11, 18], [46] showed that, in one dimension, the error incurred by sigmoidal NNs with $N$ hidden units scales as $\mathcal{O}\left(N^{-1}\right)$ in the uniform norm over compacta. The result was then extended to $\mathbb{R}$ for bounded continuous function [47]. Similar results were obtained for functions with bounded variations [48] and with bounded $\phi$ -variations [49]. For multiple dimensions, [27] provided universal approximation for sigmoidal NNs in the space of functions with bounded first moment of the magnitude distribution of their Fourier transform with rate $\mathcal{O}\left(N^{-1/2}\right)$ in the $\mathcal{L}^{2}$ -norm, independent of dimensions. This result can be generalized to Lipschitz functions (with additional assumptions) [50]. Using explicit NN construction in [32], [35, 36, 37, 38, 39, 33, 34, 41] provided convergence rate of $\mathcal{O}\left(N^{-\alpha}\right)$ , $0<\alpha<1$ in the uniform norm for Hölder continuous functions with exponent $\alpha$ . Recently, [51] has revisited universal approximation theory with rate $\mathcal{O}\left(N^{-1/2}\right)$ in Sobolev norms for smooth activation functions with polynomial decay condition on all derivatives. This improves/extends the previous similar work for sigmoidal functions in [27] and exponentially decaying activation functions in [52]. The setting in [51] is valid for sigmoidal, arctan, hyperbolic tangent, softplus, ReLU, Leaky ReLU, and $k$ th power of ReLU, as their central differences satisfy polynomial decaying condition. However, the result is only valid for smooth functions in high-order Barron spaces. For activation functions without decay but essentially bounded and having bounded Fourier transform on some interval (or having bounded variation), the universal approximation results with slower rate $\mathcal{O}\left(N^{-1/4}\right)$ in the $\mathcal{L}^{2}$ -norm can be obtained for first order Barron space. These rates can be further improved using stratified sampling [53, 28]. Convergence rates for ReLU NNs in Sobolev spaces has been recently established using finite elements [30, 31].

The main objective of this paper is to provide the first constructive and unified frameworks for a large class of activation functions including most of existing ones. At the heart of this new framework is the introduction of the neural network approximate identity (nAI) concept. The important consequence is: any nAI activation function is universal. The following are the main contributions: i) Unlike existing works, our framework is constructive using elementary means from functional analysis, probability theory, non-asymptotic analysis, and numerical analysis. ii) While existing approaches are either technical or specialized for particular activation functions, our framework is the first unified attempt that is valid for most of the existing activation functions and beyond; iii) The framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc; iv) The framework provides new proofs for all activation functions; v) The framework discovers and facilitates the discovery of new activation functions with guaranteed universality property. Indeed, any activation—whose $k$ th derivative, with $k$ being an integer, is integrable and essentially bounded—is universal. In that case, the activation function and all of its $j$ th derivative, $j=1,\ldots,k$ are not only a valid activation but also universal. vi) For a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with a predetermined number of neurons, and the values of weights/biases; vii) The framework is the first that allows for abstractly presenting the universal approximation with the favorable non-asymptotic rate of $N^{-{\frac{1}{2}}}$ , where $N$ is the number of neurons. This provides not only theoretical insights into the required number of neurons but also guidance on how to choose the number of neurons to achieve a certain accuracy with controllable successful probability. Perhaps more importantly, it shows that neural network may fail, with non-zero probability, to provide the desired testing error in practice when an architecture is selected; viii) Our framework also provides insights into the developments, and hence providing constructive derivations, of some of the existing approaches.

The paper is organized as follows. Section 2 introduces conventions and notations used in the paper. Elementary facts about convolution and approximate identities that are useful for our purposes are presented in section 3. Section 4 recalls quadrature rules in terms of Riemann sums for continuous functions on compacta and their error analysis using moduli of continuity. This follows by a unified abstract framework for universality in section 6. The key to achieve this is to introduce the concept of neural network approximate identity (nAI), which immediately provides an abstract universality result in Lemma 3. This abstract framework reduces universality proof to nAI proof. Section 7 shows that most of existing activations are nAI. This includes the family of rectified polynomial units (RePU) in which the parametric and leaky ReLUs are a member, a family of generalized sigmoidal functions in which the standard sigmoidal, hyperbolic tangent and softplus are a member, the exponential linear unit (ELU), the Gaussian error linear unit (GELU), the sigmoid linear unit (SiLU), and the Mish. It is the nAI proof of the Mish activation function that guides us to devise a general framework for a large class of nAI functions (including all activations in this paper and beyond) in section 8. Abstract universal approximation result with non-asymptotic rates is presented in section 9. Section 10 concludes the paper.

2 Notations

This section describes notations used in the paper. We reserve lower case roman letters for scalars or scalar-valued function. Boldface lower case roman letters are for vectors with components denoted by subscripts. We denote by $\mathbb{R}$ the set of real numbers and by $*$ the convolution operator. For ${\boldsymbol{{x}}}\in\mathbb{R}^{n}$ , where $n\in\mathbb{N}$ is the ambient dimension, $\left|{\boldsymbol{{x}}}\right|_{\_}p:=\left(\sum_{\_}{i=1}^{n}\left|{\boldsymbol{{x}}}_{\_}i\right|^{p}\right)^{1/p}$ denotes the standard $\ell^{p}$ norm in $\mathbb{R}^{n}$ . We conventionally write $f\left({\boldsymbol{{x}}}\right):=f\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)$ and $\left\|f\right\|_{\_}p:=\left(\int_{\mathbb{R}^{n}}\left|f\left({\boldsymbol{{x}}}\right)\right|^{p}\,d\mu\left({\boldsymbol{{x}}}\right)\right)^{1/p}$ for $1\leq p<\infty$ and $\mu$ is the Lebesgue measure in $\mathbb{R}^{n}$ . For $p=\infty$ , $\left\|f\right\|_{\_}\infty:=\text{ess}\sup_{\_}{\mathbb{R}^{n}}\left|f\right|:=\left\{M>0:\mu\left\{{\boldsymbol{{x}}}:\left|f(x)\right|>M\right\}=0\right\}$ . For simplicity, we use $d{\boldsymbol{{x}}}$ in place of $d\mu\left({\boldsymbol{{x}}}\right)$ . Note that we also use $\left\|f\right\|_{\_}\infty$ to denote the uniform norm of continuous function $f$ . We define $\mathcal{L}^{p}:=\mathcal{L}^{p}\left(\mathbb{R}^{n}\right):=\left\{f:\left\|f\right\|_{\_}p<\infty\right\}$ for $1\leq p\leq\infty$ , $\mathcal{C}\left(\mathcal{K}\right)$ as the space of continuous functions on $\mathcal{K}\subseteq\mathbb{R}^{n}$ , $\mathcal{C}_{\_}0\left(\mathbb{R}^{n}\right)$ as the space of continuous functions vanishing at infinity, $\mathcal{C}_{\_}b\left(\mathbb{R}^{n}\right)$ as the space of bounded functions in $\mathcal{C}\left(\mathbb{R}^{n}\right)$ , and $\mathcal{C}_{\_}c\left(\mathcal{K}\right)$ as the space of functions in $\mathcal{C}\left(\mathcal{K}\right)$ with compact support.

3 Convolution and Approximate Identity

The mathematical foundation for our framework is approximate identity, which relies on convolution. Convolution has been used by many authors including [23, 40, 22] to assist in proving universal approximation theorems. Approximate identity generated by ReLU function has been used in [26] for showing ReLU universality in $\mathcal{L}^{p}$ . In this section we collect some important results from convolution and approximate identity that are useful for our developments (see, e.g., [54] for a comprehensive treatment). Let $\tau_{\_}{\boldsymbol{{y}}}f\left({\boldsymbol{{x}}}\right):=f\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)$ be the translation operator. The following is standard.

Proposition 1.

$f$ is uniformly continuous iff $\lim_{\_}{{\boldsymbol{{y}}}\to 0}\left\|\tau_{\_}{\boldsymbol{{y}}}f-f\right\|_{\_}\infty=0$ .

Let $f,g:\mathbb{R}^{n}\to\mathbb{R}$ be two measurable functions in $\mathbb{R}^{n}$ , their convolution is defined as

f*g:=\int_{\mathbb{R}^{n}}f\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)g\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}},

when the integral exists. We are interested in the conditions under which $f\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)g\left({\boldsymbol{{y}}}\right)$ (or $f\left({\boldsymbol{{y}}}\right)g\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)$ ) is integrable for almost every (a.e.) ${\boldsymbol{{x}}}$ , as our approach requires a discretization of the convolution integral. Below is a relevant case.

Lemma 1.

If $f\in\mathcal{C}_{\_}0\left(\mathbb{R}^{n}\right),g\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ then $f*g\in\mathcal{C}_{\_}0\left(\mathbb{R}^{n}\right)$ .

We are interested in $g$ which is an approximate identity in the following sense.

Definition 1.

A family of functions $\mathcal{B}_{\theta}\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ , where $\theta>0$ , is called an approximate identity (AI) if: i) The family is bounded in the $\mathcal{L}^{1}$ norm, i.e., $\left\|\mathcal{B}_{\theta}\right\|_{\_}1\leq C$ for some $C>0$ ; ii) $\int_{\mathbb{R}^{n}}\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=1$ for all $\theta>0$ ; and iii) $\int_{\_}{\left\|{\boldsymbol{{x}}}\right\|>\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}\longrightarrow 0$ as $\theta\longrightarrow 0$ , for any $\delta>0$ .

An important class of approximate identity is obtained by rescaling $\mathcal{L}^{1}$ functions.

Lemma 2.

If $g\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ and $\int_{\mathbb{R}^{n}}g\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=1$ , then $\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right):=\frac{1}{\theta^{n}}g\left(\frac{{\boldsymbol{{x}}}}{\theta}\right)$ is an AI.

Lemma 3.

For any $f\in\mathcal{C}_{\_}0\left(\mathbb{R}^{n}\right)$ , $\lim_{\_}{\theta\to 0}\left\|f*\mathcal{B}_{\theta}-f\right\|_{\_}\infty=0$ , where $\mathcal{B}_{\theta}$ is an AI. Consequently, for any $\mathcal{K}$ being a compact subset of $\mathbb{R}^{n}$ , $\lim_{\_}{\theta\to 0}\left\|\mathds{1}_{\_}{\mathcal{K}}\left(f*\mathcal{B}_{\theta}-f\right)\right\|_{\_}\infty=0$ , where $\mathds{1}_{\_}\mathcal{K}$ is the indicator/characteristic function of $\mathcal{K}$ .

Proof.

We briefly present the proof here as we will use a similar approach to estimate approximate identity error later. Since $f\in C_{\_}0\left(\mathbb{R}^{n}\right)$ it resides in $\mathcal{L}^{\infty}\left(\mathbb{R}^{n}\right)$ and is uniformly continuous. As a consequence of Proposition 1, for any $\varepsilon>0$ , there exists $\delta>0$ such that $\left\|\tau_{\_}{\boldsymbol{{y}}}f-f\right\|_{\_}\infty\leq\varepsilon$ for $\left\|{\boldsymbol{{y}}}\right\|\leq\delta$ . We have

\left\|f*\mathcal{B}_{\theta}-f\right\|_{\_}\infty\leq\int_{\mathbb{R}^{n}}\left\|\tau_{\_}{\boldsymbol{{y}}}f-f\right\|_{\_}\infty\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\leq\varepsilon\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|\leq\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\\ +2\left\|f\right\|_{\_}\infty\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|>\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}},

where we have used the second property in Proposition 1. The assertion is now clear as $\varepsilon$ is arbitrarily small and by the third condition of approximate identity. ∎

4 Quadrature rules for continuous functions on bounded domain

Recall that for a bounded function on a compact set, it is Riemann integrable if and only if it is continuous almost everywhere. In that case, Riemann sums converge to Riemann integrals as the corresponding partition size approaches 0. It is well-known (see, e.g. [55, 56], and references therein) that most common numerical quadrature rules—including trapezoidal, some Newton-Cotes formulas, and Gauss-Legendre quadrature formulas—are Riemann sums. In this section, we use the Riemann sum interpretation of quadrature rule to approximate integrals of bounded functions. The goal is to characterize quadrature error in terms of modulus of continuity. We first discuss quadrature error in one dimension and then extend the result to $n$ dimensions.

We assume the domain of interest is $\left[-1,1\right]^{n}$ . Let $f\in{\mathcal{C}}\left[-1,1\right]^{n}:={\mathcal{C}}\left(\left[-1,1\right]^{n}\right)$ , $\mathcal{P}^{m}:=\left\{\boldsymbol{{z}}^{1},\ldots,\boldsymbol{{z}}^{m+1}\right\}$ a partition of $\left[-1,1\right]$ , and $\mathcal{Q}_{\_}m:=\left\{\xi^{1},\ldots,\xi^{m}\right\}$ the collection of all “quadrature points” such that $-1\leq\boldsymbol{{z}}^{j}\leq\xi^{j}\leq\boldsymbol{{z}}^{j+1}\leq 1$ for $j=1,\ldots,m$ . We assume that

\sum_{\_}{j=1}^{m}g\left(\xi^{j}\right)\left(\boldsymbol{{z}}^{j+1}-\boldsymbol{{z}}^{j}\right)

be a valid Riemann sum, e.g. trapezoidal rule, and thus converging to $\int_{\_}{-1}^{1}g\left(\boldsymbol{{z}}\right)\,d\boldsymbol{{z}}$ as $m$ approaches $\infty$ . We define a quadrature rule for $\left[-1,1\right]^{n}$ as the tensor product of the aforementioned one dimensional quadrature rule, and thus

\mathcal{S}\left(m,f\right):=\sum_{\_}{j^{1},\ldots,j^{n}=1}^{m}f\left(\xi^{j^{1}},\ldots,\xi^{j^{n}}\right)\Pi_{\_}{i=1}^{n}\left(\boldsymbol{{z}}^{j^{i}+1}-\boldsymbol{{z}}^{j^{i}}\right)

(1)

is a valid Riemann sum for $\int_{\_}{\left[-1,1\right]^{n}}f\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}$ .

Recall the modulus of continuity $\omega\left(f,h\right)$ of a function $f$ :

\omega\left(f,h\right):=\sup_{\_}{\left|\boldsymbol{{z}}\right|_{\_}2\leq h}\left\|\tau_{\_}{\boldsymbol{{z}}}f-f\right\|_{\_}\infty,

(2)

$\omega\left(f,0\right)=0$ , and $\omega\left(f,h\right)$ is continuous with respect to $h$ at $0$ (due to the uniform continuity of $f$ on $\left[-1,1\right]^{n}$ ). The following error bound for the tensor product quadrature rule is an extension of the standard one dimensional case [55].

Lemma 4.

Let $f\in\mathcal{C}\left[-1,1\right]^{n}$ . Then¹¹1The result can be marginally improved using the common refinement for $\mathcal{P}_{\_}m$ and $\mathcal{Q}_{\_}m$ , but that is not important for our purpose.

\left|\mathcal{S}\left(m,f\right)-\int_{\_}{\left[-1,1\right]^{n}}f\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\right|\leq 2^{n}\omega\left(f,\left|\mathcal{P}_{\_}m\right|\right),

where $\left|\mathcal{P}_{\_}m\right|$ is the norm of the partition $\mathcal{P}_{\_}m$ .

5 Error Estimation for Approximate Identity with quadrature

We are interested in approximating functions in $\mathcal{C}_{\_}c\left[-1,1\right]^{n}$ using neural networks. How to extend the results to $\mathcal{C}\left[-1,1\right]^{n}$ is given in A. At the heart of our framework is to first approximate any function $f$ in $\mathcal{C}_{\_}c\left[-1,1\right]^{n}$ with an approximate identity $\mathcal{B}_{\theta}$ and then numerically integrate the convolution integral $f*\mathcal{B}_{\theta}$ with quadrature (later with Monte Carlo sampling in Section 9). This extends the similar approach in [26] for ReLU activation functions in several directions: 1) we rigorously account for the approximate identity and quadrature errors, 2) our unified framework holds for most activation functions (see Section 7) using the network approximate identity (nAI) concept introduced in Section 6, 4) we identify sufficient conditions under which an activation is nAI (see Section 8), and 5) we provide non-asymptotic rate of convergence (see Section 9). Moreover, this procedure is the key to our unification of neural network universality.

Let $f\in\mathcal{C}_{\_}c\left[-1,1\right]^{n}$ and $\mathcal{B}_{\theta}$ be an approximate identity. From Lemma 3 we know that $f*\mathcal{B}_{\theta}$ converges to $f$ uniformly as $\theta\to 0$ . We, however, are interested in estimating the error $\left\|f*\mathcal{B}_{\theta}-f\right\|_{\_}\infty$ for a given $\theta$ . From the proof of Lemma 3, for any $\delta>0$ , we have

\left\|f*\mathcal{B}_{\theta}-f\right\|_{\_}\infty\leq\int_{\mathbb{R}^{n}}\left\|\tau_{\_}{\boldsymbol{{y}}}f-f\right\|_{\_}\infty\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\\ \leq\omega\left(f,\delta\right)\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|\leq\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}+2\left\|f\right\|_{\_}\infty\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|>\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\leq\omega\left(f,\delta\right)+2\left\|f\right\|_{\_}\infty\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right),

where $\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right):=\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|>\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}$ is the tail mass of $\mathcal{B}_{\theta}$ .

Next, for any ${\boldsymbol{{x}}}\in\left[-1,1\right]^{n}$ , we approximate

f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)=\int_{\_}{\mathbb{R}^{n}}\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)f\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}=\int_{\_}{\left[-1,1\right]^{n}}\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)f\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}

with the Riemann sum defined in (1). The following result is obvious by triangle inequality, continuity of $\omega\left(f,h\right)$ at $h=0$ , the third condition of the definition of $\mathcal{B}_{\theta}$ , and Lemma 4.

Lemma 5.

Let $f\in\mathcal{C}_{\_}c\left[-1,1\right]^{n}$ , $\mathcal{B}_{\theta}$ be an approximate identity, and $\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)$ be the quadrature rule (1) for $f*\mathcal{B}_{\theta}$ with the partition $\mathcal{P}^{m}$ and quadrature point set $\mathcal{Q}^{m}$ . Then, for any $\delta>0$ , and $\varepsilon>0$ , there holds

f\left({\boldsymbol{{x}}}\right)-\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\mathcal{O}\left(\omega\left(f,\delta\right)+\omega\left(f,\left|\mathcal{P}^{m}\right|\right)+\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right)\right).

(3)

In particular, for any $\varepsilon>0$ , there exist a sufficiently small $\left|\mathcal{P}^{m}\right|$ (and hence large $m$ ), and sufficiently small $\theta$ and $\delta$ such that

\left\|f\left({\boldsymbol{{x}}}\right)-\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty\leq\varepsilon.

Remark 1.

Aiming at using only elementary means, we use Riemann sum to approximate the integral, which does not give the best possible convergence rates. To improve the rates, Section 9 resorts to Monte Carlo sampling to not only reduce the number of “quadrature” points but also to obtain a total number of points independent of the ambient dimension $n$ .

6 An abstract unified framework for universality

Definition 2 (Network Approximate Identity function).

A univariate function $\sigma\left(x\right):\mathbb{R}\to\mathbb{R}$ admits a network approximate identity (nAI) if there exist $1\leq k\in\mathbb{N}$ , $\left\{\alpha_{\_}i\right\}_{\_}{i=1}^{k}\subset\mathbb{R}$ , $\left\{{w}^{i}\right\}_{\_}{i=1}^{k}\subset\mathbb{R}$ , and $\left\{{b}_{\_}i\right\}_{\_}{i=1}^{k}\subset\mathbb{R}$ such that

g\left({x}\right):=\sum_{\_}{i=1}^{k}\alpha_{\_}i\sigma\left({w}^{i}{x}+{b}_{\_}i\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right).

(4)

Thus, up to a scaling, $g\left({x}\right)$ is an approximate identity.

Lemma 6.

If a univariate function $\sigma\left(x\right):\mathbb{R}\to\mathbb{R}$ is a nAI, then it is universal in $\mathcal{C}_{\_}c\left[-1,1\right]$ .

Proof.

From nAI definition 4, there exist $k\in\mathbb{N}$ and $\alpha_{\_}i,{w}^{i},{b}_{\_}i$ , $i=1,\ldots,k$ such that (4) holds. After rescaling $g\left({x}\right)/\left\|g\left({x}\right)\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}$ we infer from Lemma 2 that $\mathcal{B}_{\theta}\left({x}\right):=\frac{1}{\theta}g\left(\frac{{x}}{\theta}\right)$ is an approximate identity. Lemma 5 then asserts that for any desired accuracy $\varepsilon>0$ , there exist a partition $\left|\mathcal{P}^{m}\right|$ , and sufficiently small $\theta$ and $\delta$ such that

\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({x}\right)=\frac{1}{\theta}\sum_{\_}{j=1}^{m}\left({z}^{j+1}-{z}^{j}\right)f\left(\xi^{j}\right)\sum_{\_}{\ell=1}^{k}\alpha_{\_}\ell\sigma\left[\frac{{w}^{\ell}}{\theta}\left({x}-{\xi^{j}}\right)+{b}_{\_}\ell\right]

(5)

is within $\varepsilon$ from any $f\left({x}\right)\in\mathcal{C}_{\_}c\left[-1,1\right]$ in the uniform norm. ∎

We have explicitly constructed a one-hidden layer neural network $\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({x}\right)$ with an arbitrary nAI activation $\sigma$ in (5) to approximate well any continuous function $f\left({x}\right)$ with compact support in $\left[-1,1\right]$ . A few observations are in order: i) if we know the modulus of continuity of $f\left({x}\right)$ and the tail behavior of $\mathcal{B}_{\theta}$ (from property of $\sigma$ ), we can precisely determine the total number of quadrature points $m$ , the scaling $\theta$ , and the cut-off radius $\delta$ in terms of $\varepsilon$ (see Lemma 5). That is, the topology of the network is completely determined; ii) The weights and biases of the network are also readily available from the nAI property of $\sigma$ , the quadrature points, and $\theta$ ; iii) the coefficients of the output layer is also pre-determined by nAI (i.e. $\alpha_{\_}\ell$ ) and the values of the unknown function $f$ at the quadrature points; and iv) any nAI activation function is universal in $\mathcal{C}_{\_}c\left[-1,1\right]$ .

Clearly the Gaussian activation function $e^{-{x}^{2}}$ is an nAI with $k=1$ , $\alpha_{\_}1=1$ , ${w}_{\_}1=1$ and ${b}_{\_}1=0$ . The interesting fact is that, as we shall prove in Section 7, most existing activation functions, though not integrable by themselves, are nAI.

Remark 2.

It is important to point out that the universality in Lemma 6 for any nAI activation is in one dimension. In order to extend the universality result to $n$ dimensions for an nAI activation function, we shall deploy a special $n$ -fold composition of its one-dimensional approximate identity (4). Section 7 provides explicit constructions of such $n$ -fold composition for many nAI activation functions, and Section 8 extends the result to abstract nAI function with appropriate conditions.

Remark 3.

Our framework also helps understand some of the existing universal approximation approaches in a more constructive manner. For example, the constructive proof in [11] is not entirely constructive as the key step in equation (4), where the authors introduced the neural network approximation function $g\left({x}\right)$ , is not constructive. Our framework can provide a constructive derivation of this step. Indeed, by applying a summation by part, one can see that $g\left({x}\right)$ resembles a quadrature approximation of the convolution of $f\left({x}\right)$ and the derivative (approximated by forward finite difference) of scaled sigmoidal function. Since the derivative of sigmoidal function, up to a scaling factor, is nAI (see Section 8), the convolution of $f\left({x}\right)$ and a scaled sigmoidal derivative can approximate $f\left({x}\right)$ to any desired accuracy.

Another important example is the pioneered work in [32] that has been extended in many directions in [35, 36, 37, 38, 39, 33, 34, 41, 12]. Though the rest of the approach is constructive, the introduction of the neural network operator is rather mysterious. From our framework point of view, this is no longer the case as the neural network operator is nothing more than a quadrature approximation of the convolution of $f\left({x}\right)$ and $\mathcal{B}$ ell shape functions constructed from activation functions under consideration. Since these $\mathcal{B}$ ell shape functions fall under the nAI umbrella, the introduction of neural network operator is now completely justified and easily understood. We would like to point out that our work does not generalize/extend the work in [32]. Instead, we aim at developing a new analytical framework that achieves unified and broad results that are not possible with the contemporary counterparts, while being more constructive and simpler. The $\mathcal{B}$ ell shape function idea is not sufficient to construct our framework. Indeed, the work in [32] and its extensions [35, 36, 37, 38, 39, 33, 34, 41, 12] have been limited to a few special and standard squash-type activation functions such as sigmoid and arctangent functions. Our work—thanks to other new ingredients such as convolution, numerical quadrature, Monte-Carlo sampling, and finite difference together with the new nAI concept—breaks the barrier. Our framework thus provides not only a new constructive method but also new insights into the work in [32] as its special case.

7 Many existing activation functions are nAI

We now exploit properties of each activation (and its family, if any) under consideration to show that they are nAI for $n=1$ . That is, these activations generate one-dimensional approximate identity, which in turn shows that they are universal by Lemma 6. To show that they are also universal in $n$ dimensions²²2Note that by Proposition 3.3 of [23], with an additional assumption on the denseness of the set of ridge functions in $\mathcal{C}\left(\mathbb{R}^{n}\right)$ , we can conclude that they are also universal in $n$ dimensions. This approach, however, does not provide an explicit construction of the corresponding neural networks. we extend a constructive and explicit approach from [26] that was done for ReLU activation function. We start with the family of rectified polynomial units (RePU)—also parametric and leaky ReLUs—with many properties, then family of sigmoidal functions, then the exponential linear unit (ELU), then the Gaussian error linear unit (GELU), then the sigmoid linear unit (SiLU), and we conclude with the Mish activation with least properties. The proofs for all results in this section, together with figures demonstrating the $\mathcal{B}$ functions for each activation, can be found in Appendix B and Appendix C.

7.1 Rectified Polynomial Units (RePU) is an nAI

Following [57, 18] we define the rectified polynomial unit (RePU) as

\text{RePU}\left(q;x\right)=\begin{cases}x^{q}&\text{if }x\geq 0,\\ 0&\text{otherwise},\end{cases}\quad q\in\mathbb{N},

which reduces to the standard rectilinear (ReLU) unit [58] when $q=1$ , the Heaviside unit when $q=0$ , and the rectified cubic unit [59] when $q=3$ .

The goal of this section is to construct an integrable linear combination of RePU activation function for a given $q$ . We begin with constructing compact supported $\mathcal{B}$ ell-shape functions³³3 $\mathcal{B}$ ell-shape functions seemed to be first coined in [32]. from RePU in one dimension. Recall the binomial coefficient notation $\begin{pmatrix}r\\ k\end{pmatrix}=\frac{r!}{\left(r-k\right)!k!},$ and the central finite differencing operator with stepsize $h$ $\delta_{h}\left[f\right]\left({x}\right):=f\left({x}+\frac{h}{2}\right)-f\left({x}-\frac{h}{2}\right)$ (see Remark 8 for forward and backward finite differences).

Lemma 7 ( $\mathcal{B}$ -function for RePU in one dimension).

Let $r:=q+1$ . For any $h\geq 0$ , define

{\mathcal{B}}\left({x},h\right):=\frac{1}{q!}\delta_{h}^{r}\left[\text{RePU}\right]\left({x}\right)=\frac{1}{q!}\sum_{\_}{i=0}^{r}\left(-1\right)^{i}\begin{pmatrix}r\\ i\end{pmatrix}\text{RePU}\left(q;{x}+\left(\frac{r}{2}-i\right)h\right).

Then: i) $\mathcal{B}\left({x},h\right)$ is piecewise polynomial of order at most $q$ on each interval $\left[kh-\frac{rh}{2},(k+1)h-\frac{rh}{2}\right]$ for $k=0,\ldots,r-1$ ; ii) $\mathcal{B}\left({x},h\right)$ is $\left(q-1\right)$ -time differentiable for $q\geq 2$ , continuous for $q=1$ , and discontinuous for $q=0$ . Furthermore, $\text{supp}\left(\mathcal{B}\left({x},h\right)\right)$ is a subset $\left[-\frac{rh}{2},\frac{rh}{2}\right]$ ; and iii) $\mathcal{B}\left({x},h\right)$ is even, non-negative, unimodal, and $\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},1\right)\,d{x}=1$ ..

Though there are many approaches to construct $\mathcal{B}$ -functions in $n$ dimensions from $\mathcal{B}$ -function in one dimension (see, e.g. [32, 35, 36, 37, 38, 39, 40, 41]) that are realizable by neural networks, inspired by [26] we construct $\mathcal{B}$ -function in $n$ dimensions by $n$ -fold composition of the one dimensional $\mathcal{B}$ -function in Lemma 7. By Lemma 5, it follows that the activation under consideration is universal in $n$ dimensions. We shall carry out this $n$ -fold composition for not only RePU but also the other activation functions. The same procedure for an abstract activation function will be presented in Section 8.

Theorem 1 ( $\mathcal{B}$ -function for RePU in $n$ dimensions).

Let $r:=q+1$ , and ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)$ . Define $\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right).$ The following hold:

i)

$\mathfrak{B}\left({\boldsymbol{{x}}}\right)$ is non-negative with compact support. In particular $\text{supp}\left(\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right)\subseteq X_{\_}{i=1}^{n}\left[-b_{\_}i,b_{\_}i\right]$ where $b_{\_}i=\underbrace{\mathcal{B}\left(0,\ldots,\mathcal{B}\left(0,1\right)\right)}_{\_}{(i-1)-\text{time composition}}$ , for $2\leq i\leq n$ , and $b_{\_}1=\frac{n}{2}$ .
ii)

$\mathfrak{B}\left({\boldsymbol{{x}}}\right)$ is even with respect to each component ${\boldsymbol{{x}}}_{\_}i$ , and unimodal with $\mathfrak{B}\left(0\right)=\max_{\_}{{\boldsymbol{{x}}}\in\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)$ .
iii)

$\mathfrak{B}\left({\boldsymbol{{x}}}\right)$ is piecewise polynomial of order at most $\left(n-i\right)q$ in ${\boldsymbol{{x}}}_{\_}i$ , $i=1,\ldots,n$ . Furthermore, $\mathfrak{B}\left({\boldsymbol{{x}}}\right)$ is $(q-1)$ -time differentiable in each ${\boldsymbol{{x}}}_{\_}i$ , $\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1$ and $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ .

Remark 4.

Note that Lemma 7 and Theorem 1 also hold for parametric ReLU [60] and leaky ReLU [61] (a special case of parametric ReLU) with $r=2$ . In other words, parametric ReLU is an nAI, and thus universal.

7.2 Sigmoidal and related activation functions are nAI

7.2.1 Sigmoidal, hyperbolic tangent, and softplus activation functions

Recall the sigmoidal and softplus functions [62] are, respectively, given by

\sigma_{\_}s\left({x}\right):=\frac{1}{1+e^{-x}},\quad\text{ and }\sigma_{\_}p\left({x}\right):=\ln\left(1+e^{x}\right).

It is the relationship between sigmoidal and hyperbolic tangent function, denoted as $\sigma_{\_}t\left({x}\right)$ ,

\sigma_{\_}s\left({x}\right)={\frac{1}{2}}+{\frac{1}{2}}\sigma_{\_}t\left(\frac{{x}}{2}\right)

that allows us to construct their $\mathcal{B}$ -functions a similar manner. In particular, based on the bell shape geometry of the the derivative of sigmoidal and hyperbolic tangent functions, we apply the central finite differencing with $h>0$ to obtain the corresponding $\mathcal{B}$ -functions

\mathcal{B}_{\_}s\left({x},h\right):=\frac{1}{2}\delta_{h}\left[\sigma_{\_}s\right]\left({x}\right),\quad\text{and }\mathcal{B}_{\_}t\left({x},h\right):=\frac{1}{4}\delta_{h}\left[\sigma_{\_}t\right]\left({x}\right).

Since $\sigma_{\_}s\left({x}\right)$ is the derivative of $\sigma_{\_}p\left({x}\right)$ , we apply central difference twice to obtain a $B$ -function for $\sigma_{\_}p\left({x}\right)$ : $\mathcal{B}_{\_}p\left({x},h\right)=\delta_{h}^{2}\left[\sigma_{\_}p\right]\left({x}\right)$ . The following Lemma 8 and Theorem 2 hold for $\mathcal{B}\left({x},h\right)$ being either $\mathcal{B}_{\_}s\left({x},h\right)$ or $\mathcal{B}_{\_}t\left({x},h\right)$ or $\mathcal{B}_{\_}p\left({x},h\right)$ .

Lemma 8 ( $\mathcal{B}$ -function for sigmoid, hyperbolic tangent, and solfplus in one dimension).

For $0<h<\infty$ , there hold: i) $\mathcal{B}\left({x},h\right)\geq 0$ for all ${x}\in\mathbb{R}$ , $\lim_{\_}{\left|{x}\right|\to\infty}\mathcal{B}\left({x},h\right)$ = 0, and $\mathcal{B}\left({x},h\right)$ is even; and ii) $\mathcal{B}\left({x},h\right)$ is unimodal $\int_{\_}\mathbb{R}\mathcal{B}\left({x},h\right)\,d{x}<\infty$ , and $\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right)$ .

Similar to Theorem 1, we construct $\mathcal{B}$ -function in $n$ dimensions by $n$ -fold composition of the one dimensional $\mathcal{B}$ -function in Lemma 8.

Theorem 2 ( $\mathcal{B}$ -function for sigmoid, hyperbolic tangent, and softplus in $n$ dimensions).

Let ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}$ . Define

\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right).

Then $\mathfrak{B}\left({\boldsymbol{{x}}}\right)$ is even with respect to each component ${\boldsymbol{{x}}}_{\_}i$ , $i=1,\ldots,n$ , and unimodal with $\mathfrak{B}\left(0\right)=\max_{\_}{{\boldsymbol{{x}}}\in\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)$ . Furthermore, $\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1$ and $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ .

7.2.2 Arctangent function

We next discuss arctangent activation function $\sigma_{\_}a\left({x}\right):=\arctan\left({x}\right)$ whose shape is similar to sigmoid. Since its derivative has the bell shape geometry, we define the $\mathcal{B}$ -function for arctangent function by approximating its derivative with central finite differencing, i.e.,

\mathcal{B}_{\_}a\left({x},h\right)=\frac{1}{2\pi}\delta_{h}\left[\sigma_{\_}a\right]\left({x}\right)=\frac{1}{2\pi}\arctan\left(\frac{2h}{1+{x}^{2}-h^{2}}\right),

for $0<h\leq 1$ . Then simple algebra manipulations show that Lemma 8 holds for $\mathcal{B}_{\_}a\left({x},h\right)$ with $\int_{\_}\mathbb{R}\mathcal{B}_{\_}a\left({x},h\right)\,d{x}=h$ . Thus, Theorem 2 also holds for $n$ -dimensional arctangent $\mathcal{B}$ -function

\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}_{\_}a\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}_{\_}a\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}_{\_}a\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right),

as we have shown for the sigmoidal function.

7.2.3 Generalized sigmoidal functions

We extend the class of sigmoidal function in [63] to generalized sigmoidal (or “squash”) functions.

Definition 3 (Generalized sigmoidal functions).

We say $\sigma\left({x}\right)$ is a generalized sigmoidal function if it satisfies the following conditions:

i)

$\sigma\left({x}\right)$ is bounded, i.e., there exists a constant $\sigma_{\_}{\text{max}}$ such that $\left|\sigma\left({x}\right)\right|\leq\sigma_{\_}{\text{max}}$ for all ${x}\in\mathbb{R}$ .
ii)

$\lim_{\_}{{x}\to\infty}\sigma\left({x}\right)=L$ and $\lim_{\_}{{x}\to-\infty}\sigma\left({x}\right)=\ell$ .
iii)

There exist ${x}^{-}<0$ and $\alpha>0$ such that ${\sigma\left({x}\right)-\ell}=\mathcal{O}\left(\left|{x}\right|^{-1-\alpha}\right)$ for ${x}<{x}^{-}$ .
iv)

There exist ${x}^{+}>0$ and $\alpha>0$ such that ${L-\sigma\left({x}\right)}=\mathcal{O}\left(\left|{x}\right|^{-1-\alpha}\right)$ for ${x}>{x}^{+}$ .

Clearly the standard sigmoidal, hyperbolic tangent, and arctangent activations are members of the class of generalized sigmoidal functions.

Lemma 9 ( $\mathcal{B}$ -function for generalized sigmoids in one dimension).

Let $\sigma\left({x}\right)$ be a generalized sigmoidal function, and $\mathcal{B}\left({x},h\right)$ be an associated $\mathcal{B}$ -function defined as

\mathcal{B}\left({x},h\right):=\frac{1}{2\left(L-\ell\right)}\left[\sigma\left({x}+h\right)-\sigma\left({x}-h\right)\right].

Then the following hold for any $h\in\mathbb{R}$ : i) There exists a constant $C$ such that $\left|\mathcal{B}\left({x},h\right)\right|\leq C\left({x}+\left|h\right|\right)^{-1-\alpha}$ for ${x}\geq{x}^{+}+\left|h\right|$ , and $\left|\mathcal{B}\left({x},h\right)\right|\leq C\left(-{x}+\left|h\right|\right)^{-1-\alpha}$ for ${x}\leq{x}^{-}-\left|h\right|$ ; and ii) For $x\in\left[m,M\right]$ , $m,M\in\mathbb{R}$ , $\sum_{\_}{k=-\infty}^{\infty}\mathcal{B}\left({x}+kh,h\right)$ converges uniformly to $\text{sign}\left(h\right)$ . Furthermore, $\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h$ , and $\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right)$ .

Thus any generalized sigmoidal function is an nAI in one dimension. Note that the setting in [63]—in which $\sigma\left({x}\right)$ is non-decreasing, $L=1$ , and $\ell=0$ —is a special case of our general setting. In this less general setting, it is clear that $\mathcal{B}\left({x},h\right)\geq$ and thus $\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h$ is sufficient to conclude $\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right)$ for any $h>0$ . We will explore this in the next theorem as it is not clear, at the moment of writing this paper, how to show that the $B$ -functions for generalized sigmoids, as constructed below, reside in $\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ .

Theorem 3 ( $\mathcal{B}$ -function for generalized sigmoids in $n$ dimensions).

Suppose that $\sigma\left({x}\right)$ is a non-decreasing generalized sigmoidal function. Let ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}$ . Define $\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right)$ . Then $\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=1$ and $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ .

7.3 The Exponential Linear Unit (ELU) is nAI

Following [64, 65] we define the Exponential Linear Unit (ELU) as

\sigma\left({x}\right)=\begin{cases}\alpha\left(e^{{x}}-1\right)&{x}\leq 0\\ {x}&{x}>0,\end{cases}

for some $\alpha\in\mathbb{R}$ . The goal of this section is to show that ELU is an nAI. Since the unbounded part of ELU is linear, Section 7.1 suggests us to define, for $h\in\mathbb{R}$ ,

\mathcal{B}\left({x},h\right):=\frac{\delta_{h}^{2}\left[\sigma\right]\left({x}\right)}{\gamma},\text{ where }\gamma=\max\left\{1+4\left|\alpha\right|,2+\left|\alpha\right|\right\}.

Lemma 10 ( $\mathcal{B}$ -function for ELU in one dimension).

Let $\alpha,h\in\mathbb{R}$ , then $\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=\frac{h^{2}}{\gamma},\text{ and }\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right)$ Furthermore: $\mathcal{B}\left({x},h\right)\leq 1$ for $\left|h\right|\leq 1$ .

Theorem 4 ( $\mathcal{B}$ -function for ELU in $n$ dimensions).

Let ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}$ . Define $\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right)$ . Then $\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}\leq 1$ and thus $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ .

7.4 The Gaussian Error Linear Unit (GELU) is nAI

GELU, introduced in [66], is defined as

\sigma\left({x}\right):={x}\Phi\left({x}\right),

where $\Phi\left({x}\right)$ is the cummulative distribution function of standard normal distribution. Since the unbounded part of GELU is essentially linear, Section 7.1 suggests us to define

\mathcal{B}\left({x},h\right):=\delta_{h}^{2}\left[\sigma\right]\left({x}\right),\quad h\in\mathbb{R}.

Lemma 11 ( $\mathcal{B}$ -function for GELU in one dimension).

i)

$\mathcal{B}\left({x},h\right)$ is an even function in both ${x}$ and $h$ .

ii)

$\mathcal{B}\left({x},h\right)$ has two symmetric roots $\pm{x}^{*}$ with $h<{x}^{*}<\max\left\{2,2h\right\}$ , and $\left|\mathcal{B}\left({x},h\right)\right|\leq\frac{1}{\sqrt{2\pi}}h^{2}$ ,

\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2},\text{ and }\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\frac{37}{10}h^{2}.

Theorem 5 ( $\mathcal{B}$ -function for GELU in $n$ dimensions).

Let ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}$ . Define $\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right)$ , then $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ .

7.5 The Sigmoid Linear Unit (SiLU) is an nAI

SiLU, also known as sigmoid shrinkage or swish [66, 67, 68, 69], is defined as $\sigma\left({x}\right):=\frac{{x}}{1+e^{-{x}}}.$ By inspection, the second derivative of $\sigma\left({x}\right)$ is bounded and its graph is quite close to bell shape. This suggests us to define

\mathcal{B}\left({x},h\right):=\delta_{h}^{2}\left[\sigma\right]\left({x}\right),\quad h\in\mathbb{R}.

The proofs of the following results are similar to those of Lemma 11 and Theorem 5.

Theorem 6 ( $\mathcal{B}$ -function for SiLU in $n$ dimension).

i)

$\mathcal{B}\left({x},h\right)$ is an even function in both ${x}$ and $h$ .
ii)

$\mathcal{B}\left({x},h\right)$ has two symmetric roots $\pm{x}^{*}$ with $h<{x}^{*}<\max\left\{3,2h\right\}$ , and $\left|\mathcal{B}\left({x},h\right)\right|\leq{\frac{1}{2}}h^{2}$ .
iii)

$\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2},\text{ and }\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\frac{26}{5}h^{2}.$
iv)

Let ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)$ . Define $\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right)$ , then $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ .

7.6 The Mish unit is an nAI

Mish unit, introduced in [70], is defined as

\sigma\left({x}\right):={x}\tanh\left(\ln\left(1+e^{x}\right)\right).

Due to its similarity with SiLU, we define

\mathcal{B}\left({x},h\right):=\delta_{h}^{2}\left[\sigma\right]\left({x}\right).

Unlike any of the preceding activation functions it is not straightforward to manipulate Mish analytically. This motivates us to devise a new approach to show that Mish is nAI. As shall be shown in Section 8, this approach allows us to unify the nAI property for all activation functions. We begin with the following result on the second derivative of Mish.

Lemma 12.

The second derivative of Mish, $\sigma^{(2)}\left({x}\right)$ , is continuous, bounded, and integrable. That is, $\left|\sigma^{(2)}\left({x}\right)\right|\leq M$ for some positive constant $M$ and $\sigma^{(2)}\left({x}\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right)$ .

Theorem 7.

Let $h\in\mathbb{R}$ , then the following hold:

i)

There exists $0<M<\infty$ such that $\left|\mathcal{B}\left({x},h\right)\right|\leq Mh^{2}$ , and $\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\left\|\sigma^{(2)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}h^{2}.$
ii)

Let ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)$ . Define $\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right)$ , then $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ .

8 A general framework for nAI

Inspired by the nAI proof of the Mish activation function in Section 7.6, we develop a general framework for nAI that requires only two conditions on the $k$ th derivative of any activation $\sigma\left({x}\right)$ . The beauty of the general framework is that it provides a single nAI proof that is valid for a large class of functions including all activation functions that we have considered. The trade-off is that less can be said about the corresponding $\mathcal{B}$ -functions. To begin, suppose that there exists $k\in\mathbb{N}$ such that

C1

Integrability: $\sigma^{(k)}\left({x}\right)$ is integrable, i.e., $\sigma^{(k)}\left({x}\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right)$ , and
C2

Essential boundedness: there exists $M<\infty$ such that $\left\|\sigma^{(k)}\right\|_{\_}\infty\leq M$ .

Note that if the two conditions C1 and C2 hold for $k=0$ (e.g. Gaussian activation functions) then the activation is obviously an nAI. Thus we only consider $k\in\mathbb{N}$ , and the $\mathcal{B}$ -function for $\sigma\left({x}\right)$ can be defined via the $k$ -order central finite difference:

\mathcal{B}\left({x},h\right):=\delta_{h}^{k}\left[\sigma\right]\left({x}\right)=\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}\sigma\left({x}+\left(\frac{k}{2}-i\right)h\right).

Lemma 13.

For any $h\in\mathbb{R}$ , there holds:

\mathcal{B}\left({x},h\right)=\frac{h^{k}}{\left(k-1\right)!}\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}\left(\frac{k}{2}-i\right)^{k}\int_{\_}0^{1}\sigma^{(k)}\left({x}+s\left(\frac{k}{2}-i\right)h\right)\left(1-s\right)^{k-1}\,ds.

Proof.

Applying the Taylor theorem gives

\sigma\left({x}+\left(\frac{k}{2}-i\right)h\right)=\sum_{\_}{j=0}^{k-1}\sigma^{(j)}\left({x}\right)\left(\frac{k}{2}-i\right)^{j}\frac{h^{j}}{j!}+\frac{h^{k}}{\left(k-1\right)!}\\ \times\left(\frac{k}{2}-i\right)^{k}\int_{\_}0^{1}\sigma^{(k)}\left({x}+s\left(\frac{k}{2}-i\right)h\right)\left(1-s\right)^{k-1}\,ds.

The proof is concluded if we can show that

\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}\sum_{\_}{j=0}^{k-1}\sigma^{(j)}\left({x}\right)\left(\frac{k}{2}-i\right)^{j}\frac{h^{j}}{j!}=\sum_{\_}{j=0}^{k-1}\sigma^{(j)}\left({x}\right)\frac{h^{j}}{j!}\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}\left(\frac{k}{2}-i\right)^{j}=0,

but this is clear by the alternating sum identity

\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}i^{j}=0,\quad\text{ for }j=0,\ldots,k-1.

∎

Theorem 8.

Let $h\in\mathbb{R}$ . Then: i) There exists $N<\infty$ such that $\left|\mathcal{B}\left({x},h\right)\right|\leq N\left|h\right|^{k}$ ; ii) There exists $C<\infty$ such that $\int_{\mathbb{R}^{n}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq C\left|h\right|^{k}$ ; and iii) Let ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}$ . Define $\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right),$ then $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ .

Proof.

The first assertion is straightforward by invoking assumption C2, Lemma 13, and defining $N=\frac{M}{k!}\sum_{\_}{i=0}^{k}\begin{pmatrix}k\\ i\end{pmatrix}\left|\frac{k}{2}-i\right|^{k}.$ For the second assertion, using triangle inequalities and the Fubini theorem yields

\int_{\mathbb{R}^{n}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\frac{\left|h\right|^{k}}{\left(k-1\right)!}\sum_{\_}{i=0}^{k}\begin{pmatrix}k\\ i\end{pmatrix}\left|\frac{k}{2}-i\right|^{k}\\ \times\int_{\_}0^{1}\left(1-s\right)^{k-1}\int_{\mathbb{R}^{n}}\left|\sigma^{(k)}\left({x}+s\left(\frac{k}{2}-i\right)h\right)\right|d{x}\,ds=\frac{N}{M}\left\|\sigma^{(k)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}\left|h\right|^{k},

and, by defining $C=\frac{N}{M}\left\|\sigma^{(k)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}$ , the result follows owing to assumption C1. The proof of the last assertion is the same as the proof of Theorem 5. In particular, we have

\int_{\_}{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}\leq C^{k}M^{\frac{n^{k}-1}{n-1}-k}<\infty.

∎

Remark 5.

Note that Theorem 8 is valid for all activation functions considered in Section 7 with appropriate $k$ : for example, $k=q+1$ for RePU of order $q$ , $k=1$ for generalized sigmoidal functions, $k=2$ for ELU, GELU, SiLU and Mish, $k=0$ for Gaussian, and etc.

Remark 6.

Suppose a function $\sigma\left({\boldsymbol{{x}}}\right)$ satisfies both conditions C1 and C2. Theorem 8 implies that $\sigma\left({\boldsymbol{{x}}}\right)$ and all of its $j$ th derivative, $j=1,\ldots,k$ , are not only a valid activation function but also universal. For example, any differentiable and non-decreasing sigmoidal function satisfies the conditions with $k=1$ , and therefore its derivative and itself are universal.

Remark 7.

In one dimension, if $\sigma$ is of bounded variation, i.e. its total variation $TV\left(\sigma\right)$ is finite, then $\mathcal{B}\left({x},h\right)$ resides in $\mathcal{L}^{1}\left(\mathbb{R}\right)\cap\mathcal{L}^{\infty}\left(\mathbb{R}\right)$ by taking $k=1$ . A simple proof of this fact can be found in [51, Corollary 3]. Thus, $\sigma$ is an nAI.

Remark 8.

We have used central finite differences for convenience, but Lemma 13, and hence Theorem 8, also holds for $k$ th-order forward and backward finite differences. The proofs are indeed almost identical.

9 Universality with non-asymptotic rates

This section explores Lemma 5 and Lemma 6 to study the convergence of the neural network (5) to the ground truth function $f\left({\boldsymbol{{x}}}\right)\in\mathcal{C}_{\_}c\left[-1,1\right]^{n}$ as a function of the number of neurons. From (3), we need to estimate the modulus of continuity of $f\left({\boldsymbol{{x}}}\right)$ (the first and the second terms) and the decaying property of the tail (the third term) of the approximate identity $\mathcal{B}_{\theta}$ (and hence the tail of the activation $\sigma\left({\boldsymbol{{x}}}\right)$ ). For the former, we further assume that $f\left({\boldsymbol{{x}}}\right)$ is Lipschitz so that the first and the second terms can be estimated as $\omega\left(f,\delta\right)=\mathcal{O}\left(\delta\right)$ and $\omega\left(f,\left|\mathcal{P}^{m}\right|\right)=\mathcal{O}\left(\left|\mathcal{P}^{m}\right|\right)$ . To balance these two terms, we need to pick the partition $\mathcal{P}^{m}$ such that $\left|\mathcal{P}^{m}\right|\approx\delta$ . From (1) and Lemma 4, we conclude $m=\mathcal{O}\left(\delta^{-1}\right)$ and the total number of quadrature points scales as $m=\mathcal{O}\left(\delta^{-n}\right)$ . It follows from (5) that the total number of neurons is $N=\mathcal{O}\left(k\delta^{-n}\right)$ . Conversely, for a given number of neurons $N$ , the error from the first two terms scales as $\delta=\mathcal{O}\left(N^{-1/n}\right)$ , which is quite pessimistic. This is due to the tensor product quadrature rule. The result can be improved using Monte Carlo estimation of integrals at the expense of deterministic estimates. Indeed, let ${\boldsymbol{\xi}}^{i}$ , $i=1,\ldots,N$ , be independent and identically distributed (i.i.d.) by the uniform distribution on $\left[-1,1\right]^{n}$ . A Monte Carlo counterpart of (5) is given as

\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\frac{1}{N}\sum_{\_}{j=1}^{N}f\left({\boldsymbol{\xi}}^{j}\right)\sum_{\_}{\ell=1}^{k}\alpha_{\_}\ell\sigma\left[\frac{{\boldsymbol{{w}}}^{\ell}}{\theta}\cdot\left({\boldsymbol{{x}}}-{{\boldsymbol{\xi}}^{j}}\right)+{b}_{\_}\ell\right],

(6)

which, by the law of large numbers, converges almost surely (a.s.) to $f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)$ for every ${\boldsymbol{{x}}}$ . However, the Monte Carlo mean square error estimate

\mathbb{E}_{\_}{{\boldsymbol{\xi}}^{1},\ldots,{\boldsymbol{\xi}}^{N}}\left[\left|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right|^{2}\right]=\mathcal{O}\left(N^{-1}\right)

is not useful for our $\infty$ -norm estimate without further technicalities in exchanging expectation and $\infty$ -norm (see, e.g., [RockafellarWetts98, Theorem 14.60] and [71]).

While most, if not all, literature concerns deterministic and asymptotic rates, we pursue a non-asymptotic direction as it precisely captures the behavior of the Monte Carlo estimate (6) of $f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)$ , and hence of $f\left({\boldsymbol{{x}}}\right)$ . Within the non-asymptotic setting, we now show that, with high probability, any neural network with the nAI property and $kN$ neurons converges with rate $N^{-{\frac{1}{2}}}$ to any Lipschitz continuous function $f\left({\boldsymbol{{x}}}\right)$ with compact support in $\left[-1,1\right]^{n}$ .

Theorem 9.

Let ${\boldsymbol{\xi}}^{i}$ , $i=1,\ldots,N$ , be i.i.d. samples from the uniform distribution on $\left[-1,1\right]^{n}$ . Suppose that $f\left({\boldsymbol{{x}}}\right)$ is Lipschitz and $\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)$ is continuous. Furthermore, let $\delta=\mathcal{O}\left(\frac{C}{\sqrt{N}}\right)$ and $\theta=\mathcal{O}\left(\delta^{2}\right)$ for some $C>0$ . There exist three absolute constants $\alpha$ , $\ell$ , and $L$ :

\left\|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty\leq\frac{C}{\sqrt{N}}

(7)

holds with probability at least $1-2e^{-2\frac{\left(C-\alpha N\right)^{2}}{\left(L-\ell\right)^{2}}}$ .

Proof.

If we define

g^{j}\left({\boldsymbol{{x}}}\right):=f\left({\boldsymbol{\xi}}^{j}\right)\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}-{\boldsymbol{\xi}}^{j}\right)

then $g^{j}\left({\boldsymbol{{x}}}\right)$ , $j=1,\ldots,N$ , are independent random variables and $\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\frac{1}{N}\sum_{\_}{j=1}^{N}g^{j}\left({\boldsymbol{{x}}}\right)$ . Owing to the continuity of $f$ and $\mathcal{B}_{\theta}$ , we know that there exist two absolute constants $\ell$ and $L$ such that $\ell\leq g^{j}\left({\boldsymbol{{x}}}\right)\leq L$ . Let $h\left({\boldsymbol{{x}}}\right):=\left|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right|$ , then by Hoeffding inequality [72, 73] we have

\mathbb{P}\left[h\left({\boldsymbol{{x}}}\right)>\varepsilon\right]\leq 2e^{-2N\frac{\varepsilon^{2}}{\left(L-\ell\right)^{2}}}

(8)

for each ${\boldsymbol{{x}}}$ , where $\mathbb{P}\left[\cdot\right]$ stands for probability. An application of triangle inequality gives

\left|h\left({\boldsymbol{{x}}}\right)-h\left({\boldsymbol{{y}}}\right)\right|\leq 2\left(L+\sup_{\_}{{\boldsymbol{{x}}}\in\left[-1,1\right]^{n}}f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right)=:\alpha,

where $\alpha$ is meaningful as $f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)$ is continuous. Thus, for a given ${\boldsymbol{{y}}}$ , there holds

h\left({\boldsymbol{{y}}}\right)\geq h\left({\boldsymbol{{x}}}\right)-\alpha\quad\forall{\boldsymbol{{x}}}\in\left[-1,1\right]^{n},

that is,

h\left({\boldsymbol{{y}}}\right)\geq\left\|h\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty-\alpha.

Consequently, using the tail bound (8) yields

\mathbb{P}\left[\left\|h\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty>\frac{C}{\sqrt{N}}\right]\leq\mathbb{P}\left[h\left({\boldsymbol{{y}}}\right)>\frac{C}{\sqrt{N}}-\alpha\right]\leq 2e^{-2\frac{\left(C-\alpha N\right)^{2}}{\left(L-\ell\right)^{2}}}.

It follows that

\left\|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty\leq\frac{C}{\sqrt{N}}

(9)

with probability at least $1-2e^{-2\frac{\left(C-\alpha N\right)^{2}}{\left(L-\ell\right)^{2}}}$ for any $C$ . Clearly, we need to pick either $C\ll\alpha N$ or $C\gg\alpha N$ . The former is more favorable as it makes the error in (9) small with high probability. The latter could lead to large error with high probability. Now, choosing $\delta=\mathcal{O}\left(\frac{C}{\sqrt{N}}\right)$ and following the proof of Lemma 5 we arrive at

\left\|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty=\mathcal{O}\left(\frac{C}{\sqrt{N}}+\mathcal{T}\left(\mathcal{B}_{\theta},\frac{C}{\sqrt{N}}\right)\right).

We now estimate $\mathcal{T}\left(\mathcal{B}_{\theta},\frac{C}{\sqrt{N}}\right)=\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|>\frac{C}{\sqrt{N}}}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}$ . By Markov inequality we have

\mathcal{T}\left(\mathcal{B}_{\theta},\frac{C}{\sqrt{N}}\right)\leq\frac{\theta\sqrt{N}}{C},

where we have used the fact that $\left\|\mathcal{B}\left({\boldsymbol{{x}}}\right)\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)}=1$ . Now taking $\theta=\mathcal{O}\left(\delta^{2}\right)=\mathcal{O}\left(\frac{C^{2}}{N}\right)$ yields

\mathcal{T}\left(\mathcal{B}_{\theta},\frac{C}{\sqrt{N}}\right)=\mathcal{O}\left(\frac{C}{\sqrt{N}}\right),

and this concludes the proof. ∎

Remark 9.

The continuity of $\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)$ in Theorem 9 is only sufficient. All we need is the boundedness of $\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)$ . Theorem 9 is thus valid for all activation functions that we have discussed, including those in Section 8. Theorem 9 provides not only theoretical insights into the required number of neurons but also a guide on how to choose the number of neurons to achieve a certain accuracy with controllable successful probability. Perhaps more importantly, it shows that neural networks may fail, with non-zero probability, to provide a desired testing error in practice when an architecture, and hence a number of neurons, is selected.

10 Conclusions

We have presented a constructive framework for neural network universality. At the heart of the framework is the neural network approximate identity (nAI) concept that allows us to unify most of activations under the same umbrella. Indeed, we have shown that most of existing activations are nAI, and thus universal in the space of continuous of functions on compacta. We have shown that for an activation to be nAI, it is sufficient to verify that its $k$ th derivative, $k\geq 0$ , is essentially bounded and integrable. The framework induces several advantages over contemporary approaches. First, our approach is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first attempt to unify the universality for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including the Mish, SiLU, ELU, GELU, etc. Fourth, it provides a new universality proof for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for each activation, the framework provides precisely the architecture of the one-hidden neural network with predetermined number of of neurons, and the values of weights/biases. Seventh, the framework facilitates us to develop the first abstract universal result with favorable non-asymptotic rates of $N^{-{\frac{1}{2}}}$ , where $N$ is the number of neurons. Our framework also provides insights into the derivations of some of the existing approaches. Ongoing work is to build upon our framework to study the universal approximation properties of convolutional neural networks and deep neural networks. Part of the future work is also to exploit the unified nature of the framework to study which activation is better, in which sense, for a wide range of classification and regression tasks.

Funding

This work is partially supported by National Science Foundation awards NSF-OAC-2212442, NSF-2108320, NSF-1808576, and NSF-CAREER-1845799; by Department of Energy awards DE-SC0018147 and DE-SC0022211; and by a 2021 UT-Portugal CoLab award, and we are grateful for the support. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. National Science Foundation or the U.S. Department of Energy or the United States Government. We would like to thank Professor Christoph Schwab and Professor Ian Sloan for pointing out a technical mistake in the preprint.

Appendix A Extension to $\mathcal{C}\left[-1,1\right]^{n}$

In this section, we present an approach to extend the results in Sections 5 and 9 to continuous functions. To directly build upon the results in these sections to our extension, without loss of generality, we considers $\mathcal{C}\left[-b,b\right]^{n}$ such that $0<\sqrt{2}b<1$ . Consider $g\in\mathcal{C}\left[-b,b\right]^{n}$ and let $\hat{g}\in\mathcal{C}\left(\mathbb{R}^{n}\right)$ be an extension of $\left.\hat{g}\right\rvert_{\_}{{\left[-b,b\right]^{n}}}=g$ , and $\left\|\hat{g}\right\|_{\_}\infty=\left\|g\right\|_{\_}\infty$ [74]. Next, let us take $\varphi\in\mathcal{C}_{\_}b\left[-1,1\right]^{n}$ such that: i) $\left.\varphi\right\rvert_{\_}{\left[-b,b\right]^{n}}=1$ and ii) $\omega\left(\varphi,h\right)=\mathcal{O}\left(\omega\left(g,h\right)\right)$ for any $h>0$ . Define $f:=\hat{g}\varphi$ , then it is easy to see that:

•

$\left.f\right\rvert_{\_}{\left[-b,b\right]^{n}}=g$ ,
•

$f\in\mathcal{C}_{\_}c\left[-1,1\right]^{n}$ ,
•

$\left\|f\right\|_{\_}\infty=\mathcal{O}\left(\left\|g\right\|_{\_}\infty\right)$ , and
•

$\omega\left(f,h\right)=\mathcal{O}\left(\omega\left(g,h\right)\right)$ .

Now, applying Lemma 5 we obtain

f\left({\boldsymbol{{x}}}\right)-\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\mathcal{O}\left(\omega\left(f,\delta\right)+\omega\left(f,\left|\mathcal{P}^{m}\right|\right)+\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right)\right)\\ =\mathcal{O}\left(\omega\left(g,\delta\right)+\omega\left(g,\left|\mathcal{P}^{m}\right|\right)+\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right)\right)

for all ${\boldsymbol{{x}}}\in\mathbb{R}^{n}$ . Thus, by restricting ${\boldsymbol{{x}}}\in\left[-b,b\right]^{n}$ we arrive at

g\left({\boldsymbol{{x}}}\right)-\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\mathcal{O}\left(\omega\left(g,\delta\right)+\omega\left(g,\left|\mathcal{P}^{m}\right|\right)+\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right)\right).

This, together with Lemma 6, ensures the universality of any nAI activation in $\mathcal{C}\left[-b,b\right]^{n}$ . The extension for Lipschitz continuous functions on $\left[-b,b\right]^{n}$ for Section 9 follows similarly, again using the key extension results in [74].

Appendix B Proofs of results in Section 7

This section presents the detailed proofs of results in Section 7.

Proof of Lemma 7.

We start by defining $X_{\_}r:=\sum_{\_}{i=1}^{r}U_{\_}i$ , where $U_{\_}i$ are independent identically distributed uniform random variables on $\left[-\frac{1}{2},\frac{1}{2}\right]$ . Following [75, 76, 77], $X_{\_}r$ is distributed by the Irwin-Hall distribution with the probability density function

f_{\_}r\left({x}\right):=\frac{1}{q!}\sum_{\_}{i=1}^{\lfloor{x}\rfloor}(-1)^{i}\begin{pmatrix}r\\ i\end{pmatrix}\left({x}+\frac{r}{2}-i\right)^{q},

where $\lfloor{x}\rfloor$ denotes the largest integer smaller than ${x}$ . Using the definition of RePU, it is easy to see that $f_{\_}r\left({x}\right)$ can be also written in terms of RePU as follows

f_{\_}r\left({x}\right):=\frac{1}{q!}\sum_{\_}{i=1}^{r}(-1)^{i}\begin{pmatrix}r\\ i\end{pmatrix}\text{RePU}\left(q;{x}+\frac{r}{2}-i\right),

which in turns implies

\mathcal{B}\left({x},h\right)=h^{q}f_{\_}r\left(\frac{{x}}{h}\right).

(10)

In other words, $\mathcal{B}\left({x},h\right)$ is a dilated version of $f_{\_}r\left({x}\right)$ . Thus, all the properties of $f_{\_}r\left({x}\right)$ holds for $\mathcal{B}\left({x},h\right)$ . In particular, all the assertions of Lemma 7 hold. Note that the compact support can be alternatively shown using the property of the central finite differencing. Indeed, it is easy to see that for ${x}\geq\frac{rh}{2}$ we have

\delta_{h}^{r}\left[\text{RePU}\right]\left({x}\right)=\delta_{h}^{r}\left[{x}^{q}\right]=0.

∎

Proof of Theorem 1.

The first three assertions are direct consequences of Lemma 7. For the fourth assertion, since $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\geq 0$ it is sufficient to show $\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1$ and we do so in three steps. Let $r=q+1$ and define ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{x}}}_{\_}n,\ldots,{\boldsymbol{{x}}}_{\_}1\right)=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{y}}}\right)$ . We first show by induction that $\mathcal{B}\left({x},1\right)\leq 1$ for $q\in\mathbb{N}$ and ${x}\in\mathbb{R}$ . The claim is clearly true for $q=\left\{0,1\right\}$ . Suppose the claim holds for $q$ , then (10) implies

f_{\_}{r}\left({x}\right)\leq 1,\quad\forall{x}\in\mathbb{R}.

For $q+1$ we have

\mathcal{B}\left(0\right)=f_{\_}{r+1}\left(0\right)=f_{\_}{r}*f_{\_}1\left(0\right)=\int_{\_}\mathbb{R}f_{\_}{r}\left(-{y}\right)f_{\_}1\left({y}\right)\,d{y}\leq\int_{\_}\mathbb{R}f_{\_}1\left({y}\right)\,d{y}=1.

By the second assertion, we conclude that $\mathcal{B}\left({x},1\right)\leq 1$ for any $q\in\mathbb{N}$ and ${x}\in\mathbb{R}$ .

In the second step, we show $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\leq 1$ by induction on $n$ for any ${\boldsymbol{{x}}}\in\mathbb{R}^{n}$ and any $q\in\mathbb{N}$ . The result holds for $n=1$ due to the first step. Suppose the claim is true for $n$ . For $n+1$ , we have

\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)=\left[\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right]^{q}f_{\_}r\left(\frac{{\boldsymbol{{x}}}_{\_}{n+1}}{\mathfrak{B}\left({\boldsymbol{{y}}}\right)}\right)\leq 1,

where we have used (10) and in the last equality, and the first step together with the induction hypothesis in the last inequality.

In the last step, we show $\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1$ by induction on $n$ . For $n=1$ , $\int_{\_}\mathbb{R}\mathfrak{B}\left({x}\right)\,d{x}=1$ is clear using by the Irwin-Hall probability density function. Suppose the result is true for $n$ . For $n+1$ , applying the Fubini theorem gives

\int_{\_}{\mathbb{R}^{n+1}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=\int_{\mathbb{R}^{n}}\left(\int_{\_}{\mathbb{R}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}\\ =\int_{\mathbb{R}^{n}}\left[\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right]^{q}\left(\int_{\_}{\mathbb{R}}f_{\_}{r}\left(\frac{{\boldsymbol{{x}}}_{\_}{n+1}}{\mathfrak{B}\left({\boldsymbol{{y}}}\right)}\right)\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}=\int_{\mathbb{R}^{n}}\left[\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right]^{r}\left(\int_{\_}{\mathbb{R}}f_{\_}{r}\left(t\right)\,dt\right)\,d{\boldsymbol{{y}}}\\ =\int_{\mathbb{R}^{n}}\left[\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right]^{r}\,d{\boldsymbol{{y}}}\leq\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}\leq 1,

where we have used (10) in the second equality, the result of the second step in the second last inequality, and the induction hypothesis in the last inequality. ∎

Proof of Lemma 8.

The proof for $\mathcal{B}_{\_}s\left({x},h\right)$ is a simple extension of those in [63, 33], and the proof for $\mathcal{B}_{\_}s\left({x},h\right)$ follows similarly. Note that $\int_{\_}\mathbb{R}\mathcal{B}\left({x},h\right)\,d{x}=h$ for sigmoid and hyperbolic tangent. For $\mathcal{B}_{\_}p\left({x},h\right)$ , due to the global convexity of $\sigma_{\_}t\left({x}\right)$ we have

\ln\left(1+e^{{x}}\right)\leq\frac{\ln\left(e^{h}+e^{{x}}\right)+\ln\left(e^{-h}+e^{{x}}\right)}{2},\quad\forall{x}\in\mathbb{R},

which is equivalently to $\mathcal{B}_{\_}p\left({x},h\right)\geq 0$ for ${x}\in\mathbb{R}$ . The fact that $\mathcal{B}_{\_}p\left({x},h\right)$ is even and $\lim_{\_}{\left|{x}\right|\to\infty}\mathcal{B}_{\_}p\left({x},h\right)$ = 0 are obvious by inspection. Since the derivative of $\mathcal{B}_{\_}p\left({x},h\right)$ is negative for ${x}\in(0,\infty)$ and $\mathcal{B}_{\_}p\left({x},h\right)$ is even, $\mathcal{B}_{\_}p\left({x},h\right)$ is unimodal. It follows that $\mathcal{B}_{\_}p\left({x},0\right)=\max_{\_}{{x}\in R}\mathcal{B}_{\_}p\left({x},h\right)$ .

Next integrating by parts gives

\int_{\_}{-\infty}^{\infty}\mathcal{B}_{\_}p\left({x},h\right)\,d{x}=2\int_{\_}{0}^{\infty}\frac{\left(e^{h}-1\right)^{2}\left(1-e^{-{x}}\right)}{\left(e^{h}+e^{-x}\right)\left(e^{-h}+e^{-x}\right)\left(1+e^{-x}\right)}e^{-{x}}{x}\,d{x}\\ \leq\left(\frac{1-e^{-h}}{1+e^{-h}}\right)^{2}\int_{\_}{0}^{\infty}e^{-{x}}{x}\,d{x}=\left(\frac{1-e^{-h}}{1+e^{-h}}\right)^{2}\leq\min\left\{1,h^{2}\right\}.

(11)

Thus all the assertions for $\mathcal{B}_{\_}p\left({x},h\right)$ holds. ∎

Proof of Theorem 2.

We only need to show $\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1$ as the proof for other assertions is similar to that of Theorem 1, and thus is omitted. For sigmoid and hyperbolic tangent the result is clear as

\int_{\_}{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=\int_{\_}{\mathbb{R}^{n}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n},\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right)\,d{\boldsymbol{{x}}}\\ =\int_{\_}{\mathbb{R}^{n-1}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\,d{\boldsymbol{{x}}}_{\_}{n-1}\ldots d{\boldsymbol{{x}}}_{\_}1=\ldots=\int_{\_}{\mathbb{R}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)d{\boldsymbol{{x}}}_{\_}1=1.

For softplus function, by inspection, $\mathcal{B}\left(\boldsymbol{{z}},h\right)\leq\mathcal{B}\left(0,h\right)\leq 1$ for all $\boldsymbol{{z}}\in\mathbb{R}$ and $0<h\leq 1$ . Lemma 8 gives $\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1$ for $n=1$ . Define ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{x}}}_{\_}n,\ldots,{\boldsymbol{{x}}}_{\_}1\right)=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{y}}}\right)$ and suppose the claim holds for $n$ , i.e., $\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}\leq 1$ . Now apply (11) we have

\int_{\_}{\mathbb{R}^{n+1}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=\int_{\mathbb{R}^{n}}\left(\int_{\_}{\mathbb{R}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}\leq\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}\leq 1,

which ends the proof by induction. ∎

Proof of Lemma 9.

The first assertion is clear by Definition 3. For the second assertion, with a telescope trick similar to [63] we have

s_{\_}N\left({x}\right)=\sum_{\_}{k=-N}^{N}\mathcal{B}\left({x}+kh,h\right)=\frac{1}{2\left(L-\ell\right)}\left[\sigma\left({x}+\left(N+1\right)h\right)+\sigma\left({x}+Nh\right)\right.\\ -\left.\sigma\left({x}-Nh\right)-\sigma\left({x}-\left(N+1\right)h\right)\right]\xrightarrow[]{N\to\infty}\text{sign}\left(h\right).

To show the convergence is uniform, we consider only $h>0$ as the proof for $h<0$ is similar and for $h=0$ is obvious. We first consider the right tail. For sufficiently large $N$ , there exists a constant $C>0$ such that

\sum_{\_}{k=N}^{\infty}\mathcal{B}\left({x}+kh,h\right)\leq C\sum_{\_}{k=N}^{\infty}\left({x}+kh\right)^{-1-\alpha}=C\sum_{\_}{k=N}^{\infty}\int_{\_}{k-1}^{k}\left({x}+kh\right)^{-1-\alpha}\,dy\\ \leq C\int_{\_}{N-1}^{\infty}\left({x}+yh\right)^{-1-\alpha}\,dy=\frac{C}{h\alpha}\left[{x}+\left(N-1\right)h\right]\\ \leq\frac{C}{h\alpha}\left[m+\left(N-1\right)h\right]\xrightarrow[]{N\to\infty}0\text{ independent of }{x}.

Similarly, the left tail converges to 0 uniformly as

\sum_{\_}{k=N}^{\infty}\mathcal{B}\left({x}-kh,h\right)\leq\frac{C}{h\alpha}\left[\left(N-1\right)h-M\right]\xrightarrow[]{N\to\infty}0\text{ independent of }{x}.

As a consequence, we have

\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=\sum_{\_}{k=-\infty}^{\infty}\int_{\_}{k\left|h\right|}^{\left(k+1\right)\left|h\right|}\mathcal{B}\left({x},h\right)\,d{x}=\sum_{\_}{k=-\infty}^{\infty}\int_{\_}{0}^{\left|h\right|}\mathcal{B}\left(y+k\left|h\right|,h\right)\,dy\\ =\int_{\_}{0}^{\left|h\right|}\sum_{\_}{k=-\infty}^{\infty}\mathcal{B}\left(y+k\left|h\right|,h\right)\,dy=h,

where we have used the uniform convergence in the third equality.

Using the first assertion we have

\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}=\int_{\_}{{x}^{-}-\left|h\right|}^{{x}^{+}+\left|h\right|}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}+\int_{\_}{{x}^{+}+\left|h\right|}^{\infty}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}+\int_{\_}{-\infty}^{{x}^{-}-\left|h\right|}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\\ \leq\left({x}_{\_}p-{x}_{\_}n+2\left|h\right|\right)+C\int_{\_}{{x}^{+}+\left|h\right|}^{\infty}\left({x}+\left|h\right|\right)^{-1-\alpha}\,d{x}+C\int_{\_}{-\infty}^{{x}^{-}-\left|h\right|}\left(-{x}+\left|h\right|\right)^{-1-\alpha}\,d{x}\\ =\left({x}_{\_}p-{x}_{\_}n+2\left|h\right|\right)+\frac{C}{\alpha}\left[\left({x}^{+}+2\left|h\right|\right)^{-\alpha}+\left(-{x}^{-}+2\left|h\right|\right)^{-\alpha}\right]<\infty.

∎

Proof of Theorem 3.

The proof is the same as the proof of Theorem 2 for the standard sigmoidal unit, and thus is omitted. ∎

Proof of Lemma 10.

The expression of $\mathcal{B}\left({x},h\right)$ and direct integration give $\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2}\gamma$ . Similarly, simple algebra manipulations yield

\gamma\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq h^{2}+2\left|\alpha\right|\left|h\right|+2\left|\alpha\right|\left(e^{-\left|h\right|}-1\right)^{2}<\infty,

and thus $\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right)$ . The fact that $\mathcal{B}\left({x},h\right)\leq 1$ for $\left|h\right|\leq 1$ holds is straightforward by inspecting the extrema of $\mathcal{B}\left({x},h\right)$ . ∎

Proof of Theorem 4.

From the proof of Lemma 10 we infer that $\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\left|h\right|$ for all $\left|h\right|\leq 1$ : in particular, $\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right|\,d{x}\leq 1$ . Define ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{x}}}_{\_}n,\ldots,{\boldsymbol{{x}}}_{\_}1\right)=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{y}}}\right)$ and suppose the claim holds for $n$ , i.e., $\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\leq 1$ . We have

\int_{\_}{\mathbb{R}^{n+1}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}=\int_{\mathbb{R}^{n}}\left(\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)\right|\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}\leq\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\leq 1,

which, by induction, concludes the proof. ∎

Proof of Lemma 11.

The first assertion is straightforward. For the second assertion, it is sufficient to consider ${x}\geq 0$ and $h>0$ . Any root of $\mathcal{B}\left({x},h\right)$ satisfy the following equation

f\left({x}\right):=\frac{\Phi\left({x}+h\right)-\Phi\left({x}\right)}{\Phi\left({x}\right)-\Phi\left({x}-h\right)}=\frac{{x}-h}{{x}+h}=:g\left({x}\right).

Since, given $h$ , $f\left({x}\right)$ is a positive decreasing function and $g\left({x}\right)$ is an increasing function (starting from $-1$ ), they can have at most one intersection. Now for sufficiently large ${x}$ , using the the mean value theorem it can be shown that

f\left({x}\right)\approx e^{-{x}h}\xrightarrow[]{{x}\to\infty}0,\text{ and }g\left({x}\right)\xrightarrow[]{{x}\to\infty}1,

from which we deduce that that there is only one intersection, and hence only one positive root ${x}^{*}$ for $\mathcal{B}\left({x},h\right)$ . Next, we notice $g\left({x}\right)\leq 0<f\left({x}\right)$ for $0\leq{x}\leq h$ . For $h\geq 1$ it is easy to see $f\left(2h\right)<g\left(2h\right)=1/3$ . Thus, $h<{x}^{*}<2h$ for $h\geq 1$ . For $0<h<1$ , by inspection we have $f\left(2\right)<g\left(2\right)$ , and hence $h<{x}^{*}<2$ . In conclusion $h<{x}^{*}<\max\left\{2,2h\right\}$ .

The preceding argument also shows that $\mathcal{B}\left({x},h\right)\geq 0$ for $0\leq{x}\leq{x}^{*}$ , $\mathcal{B}\left({x},h\right)<0$ for ${x}>{x}^{*}$ , and $\mathcal{B}\left({x},h\right)\xrightarrow[]{{x}\to\infty}0^{-}$ . Using the Taylor formula we have

\left|\mathcal{B}\left(\bar{x},h\right)\right|\leq\frac{1}{\sqrt{2\pi}}h^{2}.

For the third assertion, it is easy to verify the following indefinite integral (ignoring the integration constant as it will be canceled out)

\int{x}\Phi\left({x}\right)\,d{x}={\frac{1}{2}}\left({x}^{2}-1\right)\Phi\left({x}\right)+\frac{xe^{{-\frac{{x}^{2}}{2}}}}{2\sqrt{2\pi}},

which, together with simple calculations, yields

\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2}.

From the proof of the second assertion and the Taylor formula we can show

\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}=2\int_{\_}{0}^{{x}^{*}}\mathcal{B}\left({x},h\right)\,d{x}-2\int_{\_}{{x}^{*}}^{\infty}\mathcal{B}\left({x},h\right)\,d{x}\\ =-3h^{2}+2\delta_{h}^{2}\left[\left(\left({x}^{*}\right)^{2}-1\right)\Phi({x}^{*})\right]+\delta_{h}^{2}\left[\frac{{x}^{*}e^{{-\frac{\left({x}^{*}\right)^{2}}{2}}}}{\sqrt{2\pi}}\right]\leq\frac{37}{10}h^{2}.

Thus, $\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right)$ . ∎

Proof of Theorem 5.

From the proof of Lemma 11 we have $\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq Ch^{2}$ and $\left|\mathcal{B}\left({x},h\right)\right|\leq Mh^{2}$ for all ${h}\in\mathbb{R}$ where $M=\frac{2}{\sqrt{2\pi}}$ , $C=\frac{37}{10}$ : in particular, $\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right|\,d{x}\leq C$ . It is easy to see by induction that

\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\leq M^{2^{n}-1}.

We claim that

\int_{\_}{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}\leq C^{n}M^{2^{n}-n-1},

and thus $\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)$ . We prove the claim by induction. Clearly it holds for $n=1$ . Suppose it is valid for $n$ . Define ${\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{x}}}_{\_}n,\ldots,{\boldsymbol{{x}}}_{\_}1\right)=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{y}}}\right)$ , we have

\int_{\_}{\mathbb{R}^{n+1}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}=\int_{\mathbb{R}^{n}}\left(\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)\right|\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}\leq C\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right|^{2}\,d{\boldsymbol{{y}}}\\ \leq CM^{2^{n}-1}\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right|d{\boldsymbol{{y}}}\leq C^{n+1}M^{2^{n+1}-n-2},

where we have used the induction hypothesis in the last inequality, and this ends the proof. ∎

Proof of Lemma 12.

It is easy to see that for ${x}\geq 2$

\left|\sigma^{(2)}\left({x}\right)\right|\leq 12e^{-3{x}}\left({x}-2\right)+8e^{-2{x}}\left({x}-1\right)+8e^{-5{x}}\left({x}+2\right)+8e^{-4{x}}\left({x}+4\right),

and

\left|\sigma^{(2)}\left({x}\right)\right|\leq 12e^{3{x}}\left(2-{x}\right)+8e^{4{x}}\left(1-{x}\right)-8e^{{x}}\left({x}+2\right)-8e^{2{x}}\left({x}+4\right),

for ${x}\leq-4$ . That is, both the right and the left tails of $\sigma^{(2)}\left({x}\right)$ decay exponentially, and this concludes the proof. ∎

Proof of Theorem 7.

For the first assertion, we note that $\sigma^{(2)}\left({x}\right)$ is continuous and thus the following Taylor theorem with integral remainder for $\sigma\left({x}\right)$ holds

\sigma\left({x}+h\right)=\sigma\left({x}\right)+\sigma^{\prime}\left({x}\right)h+h^{2}\int_{\_}{0}^{1}\sigma^{(2)}\left({x}+sh\right)\left(1-s\right)\,ds.

As a result,

\left|\mathcal{B}\left({x},h\right)\right|\leq h^{2}\left[\int_{\_}{0}^{1}\left|\sigma^{(2)}\left({x}+sh\right)\right|\left(1-s\right)\,ds+\int_{\_}{0}^{1}\left|\sigma^{(2)}\left({x}-sh\right)\right|\left(1-s\right)\,ds\right]\leq Mh^{2},

where we have used the boundedness of $\sigma^{(2)}\left({x}\right)$ from Lemma 12 in the last inequality.

For the second assertion, we have

\int_{\mathbb{R}^{n}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2}\int_{\_}\mathbb{R}\int_{\_}{0}^{1}{\sigma^{(2)}\left({x}+sh\right)}\left(1-s\right)\,dsd{x}\\ +h^{2}\int_{\_}\mathbb{R}\int_{\_}{0}^{1}{\sigma^{(2)}\left({x}-sh\right)}\left(1-s\right)\,dsd{x},

whose right hand side is well-defined owing to $\sigma^{(2)}\left({x}\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right)$ (see Lemma 12) and the Fubini theorem. In particular,

\int_{\mathbb{R}^{n}}\mathcal{B}\left({x},h\right)\,d{x}\leq\left\|\sigma^{(2)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}h^{2}.

The proof for $\int_{\mathbb{R}^{n}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\left\|\sigma^{(2)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}h^{2}$ follows similarly.

The proof of the last assertion is the same as the proof of Theorem 5 and hence is omitted. ∎

Appendix C Figures

This section provides the plots of the $\mathcal{B}$ functions for various activations in 1, 2, and 3 dimensions.

Figure 1 (left) plots $\mathcal{B}\left({x},1\right)$ for RePU unit with $q=\left\{0,1,3,6,9\right\}$ in one dimension. The non-negativeness, compact-support, $\mathcal{B}$ ell shape, smoothness, and unimodal of $\mathcal{B}\left({x},1\right)$ can be clearly seen. For $n=2$ we plot in Figure 1 (the three right most subfigures) the surfaces of RePU for $q=\left\{0,1,5\right\}$ together with $15$ contours to again verify the non-negativeness, compact-support, $\mathcal{B}$ ell shape, smoothness, and unimodal of $\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({x}_{\_}2,\mathcal{B}\left({x}_{\_}1,1\right)\right)$ . To further confirm these features for $n=3$ , in figure 2 we plot an isosurface for RePU unit with $q=\left\{0,1,3,5\right\}$ , and $4$ isosurfaces for the case $q=4$ in Figure 3. Note the supports of the $\mathcal{B}$ functions around the origin: the further away from the origin the smaller the isosurface values.

Refer to caption — Figure 1: From left to right: One dimensional $\mathcal{B}\left({\boldsymbol{{x}}}\right)$ functions for RePU with $q=\left\{0,1,3,6,9\right\}$ , and two dimensional $\mathfrak{B}\left({\boldsymbol{{x}}}\right)$ functions for RePU with $q=\left\{0,1,5\right\}$ . The surfaces of $\mathfrak{B}\left({\boldsymbol{{x}}}\right)$ are plot with $15$ contours at $15$ values equally space from $10^{-6}$ to $\mathfrak{B}\left(0,\mathcal{B}\left(0,1\right)\right)$ .

To verify that activation functions in the generalized sigmoidal class behave similarly from our unified framework we plot the $\mathcal{B}$ functions of the standard sigmoid, softplus, and arctangent activation functions in Figure 4 for $n=\left\{1,2,3\right\}$ . As can be seen, the $\mathcal{B}$ functions, though have different values, share similar shapes. Note that for $n=3$ we plot the isosurfaces at $5\times 10^{-2}$ times the largest value of the corresponding $\mathcal{B}$ functions.

In figure 5 we present a snapshot of the $\mathcal{B}$ function of the ELU activation function in one, two, and three dimensions. While it has common features as other $\mathcal{B}$ functions such as decaying to zero at infinity, it possesses distinct features including asymmetric shape with positive and negative values.

We have shown the nAI proof for $n$ dimensions is the same for GELU, SiLU, and Mish. It turns out that they are geometrically very similar and this can be seen in Figure 6. Note that for $n=3$ we plot the isosurfaces at $5\times 10^{-2}$ times the largest value of the corresponding $\mathcal{B}$ functions.

References

[1] W. S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity, Bulletin of Mathematical Biology 52 (1) (1990) 99–115. doi:10.1007/BF02459570.
URL https://doi.org/10.1007/BF02459570
[2] F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain., Psychological Review 65 (6) (1958) 386 – 408.
URL http://ezproxy.lib.utexas.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=pdh&AN=1959-09865-001&site=ehost-live
[3] R. Durrett, Probability: Theory and Examples, Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, 2010.
URL https://books.google.com/books?id=evbGTPhuvSoC
[4] H. L. Royden, Real analysis / H.L. Royden, Stanford University, P.M. Fitzpatrick, University of Maryland, College Park., fourth edition [2018 reissue]. Edition, Pearson modern classic, Pearson, New York, NY, 2018 - 2010.
[5] R. Lippmann, An introduction to computing with neural nets, IEEE ASSP Magazine 4 (2) (1987) 4–22. doi:10.1109/MASSP.1987.1165576.
[6] N. Cotter, The Stone-Weierstrass theorem and its application to neural networks, IEEE Transactions on Neural Networks 1 (4) (1990) 290–295. doi:10.1109/72.80265.
[7] B. Widrow, M. E. Hoff, Adaptive switching circuits, 1960 IRE WESCON Convention Record (1960) 96–104Reprinted in Neurocomputing MIT Press, 1988 .
[8] D. E. Rumelhart, J. L. McClelland, Learning Internal Representations by Error Propagation, 1987, pp. 318–362.
[9] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems 2 (4) (1989) 303–314. doi:10.1007/BF02551274.
URL https://doi.org/10.1007/BF02551274
[10] K.-I. Funahashi, On the approximate realization of continuous mappings by neural networks, Neural Networks 2 (3) (1989) 183–192. doi:https://doi.org/10.1016/0893-6080(89)90003-8.
URL https://www.sciencedirect.com/science/article/pii/0893608089900038
[11] T. Chen, H. Chen, R.-w. Liu, A constructive proof and an extension of Cybenko’s approximation theorem, in: C. Page, R. LePage (Eds.), Computing Science and Statistics, Springer New York, New York, NY, 1992, pp. 163–168.
[12] D. Costarelli, Sigmoidal functions approximation and applications, Ph.D. thesis, Department of Mathematics, Roma Tre University (2014).
[13] Irie, Miyake, Capabilities of three-layered perceptrons, in: IEEE 1988 International Conference on Neural Networks, 1988, pp. 641–648 vol.1. doi:10.1109/ICNN.1988.23901.
[14] Carroll, Dickinson, Construction of neural nets using the Radon transform, in: International 1989 Joint Conference on Neural Networks, 1989, pp. 607–611 vol.1. doi:10.1109/IJCNN.1989.118639.
[15] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (5) (1989) 359–366. doi:https://doi.org/10.1016/0893-6080(89)90020-8.
URL https://www.sciencedirect.com/science/article/pii/0893608089900208
[16] V. Maiorov, A. Pinkus, Lower bounds for approximation by MLP neural networks, Neurocomputing 25 (1999) 81–91.
[17] V. Kůrková, Kolmogorov’s theorem and multilayer neural networks, Neural Networks 5 (3) (1992) 501–506. doi:https://doi.org/10.1016/0893-6080(92)90012-8.
URL https://www.sciencedirect.com/science/article/pii/0893608092900128
[18] H. Mhaskar, C. A. Micchelli, Approximation by superposition of sigmoidal and radial basis functions, Advances in Applied Mathematics 13 (3) (1992) 350–373. doi:https://doi.org/10.1016/0196-8858(92)90016-P.
URL https://www.sciencedirect.com/science/article/pii/019688589290016P
[19] Gallant, White, There exists a neural network that does not make avoidable mistakes, in: IEEE 1988 International Conference on Neural Networks, 1988, pp. 657–664 vol.1. doi:10.1109/ICNN.1988.23903.
[20] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4 (2) (1991) 251–257. doi:https://doi.org/10.1016/0893-6080(91)90009-T.
URL https://www.sciencedirect.com/science/article/pii/089360809190009T
[21] M. Stinchcombe, H. White, Approximating and learning unknown mappings using multilayer feedforward networks with bounded weights, in: 1990 IJCNN International Joint Conference on Neural Networks, 1990, pp. 7–16 vol.3. doi:10.1109/IJCNN.1990.137817.
[22] M. Leshno, V. Y. Lin, A. Pinkus, S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks 6 (6) (1993) 861–867. doi:https://doi.org/10.1016/S0893-6080(05)80131-5.
URL https://www.sciencedirect.com/science/article/pii/S0893608005801315
[23] A. Pinkus, Approximation theory of the MLP model in neural networks, Acta Numerica 8 (1999) 143–195. doi:10.1017/S0962492900002919.
[24] J.-G. Attali, G. Pagès, Approximations of functions by a multilayer perceptron: a new approach, Neural Networks 10 (6) (1997) 1069–1081. doi:https://doi.org/10.1016/S0893-6080(97)00010-5.
URL https://www.sciencedirect.com/science/article/pii/S0893608097000105
[25] D. Yarotsky, Error bounds for approximations with deep ReLU networks, Neural Networks 94 (2017) 103–114. doi:https://doi.org/10.1016/j.neunet.2017.07.002.
URL https://www.sciencedirect.com/science/article/pii/S0893608017301545
[26] S. Moon, ReLU network with bounded width is a universal approximator in view of an approximate identity, Applied Sciences 11 (1) (2021). doi:10.3390/app11010427.
URL https://www.mdpi.com/2076-3417/11/1/427
[27] A. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Transactions on Information Theory 39 (3) (1993) 930–945. doi:10.1109/18.256500.
[28] J. M. Klusowski, A. R. Barron, Approximation by combinations of ReLU and squared ReLU ridge functions with $\ell^{1}$ and $\ell^{0}$ controls (2018). arXiv:1607.07819.
[29] E. Weinan, C. Ma, L. Wu, Barron spaces and the compositional function spaces for neural network models, ArXiv abs/1906.08039 (2019).
[30] J. He, L. Li, J. Xu, C. Zheng, ReLU deep neural networks and linear finite elements, Journal of Computational Mathematics 38 (3) 502–527.
[31] J. A. A. Opschoor, P. C. Petersen, C. Schwab, Deep ReLU networks and high-order finite element methods, Analysis and Applications 18 (05) (2020) 715–770. arXiv:https://doi.org/10.1142/S0219530519410136, doi:10.1142/S0219530519410136.
URL https://doi.org/10.1142/S0219530519410136
[32] P. Cardaliaguet, G. Euvrard, Approximation of a function and its derivative with a neural network, Neural Networks 5 (2) (1992) 207–220. doi:https://doi.org/10.1016/S0893-6080(05)80020-6.
URL https://www.sciencedirect.com/science/article/pii/S0893608005800206
[33] Z. Chen, F. Cao, The approximation operators with sigmoidal functions, Computers & Mathematics with Applications 58 (4) (2009) 758–765. doi:https://doi.org/10.1016/j.camwa.2009.05.001.
URL https://www.sciencedirect.com/science/article/pii/S0898122109003071
[34] Z. Chen, F. Cao, J. Hu, Approximation by network operators with logistic activation functions, Applied Mathematics and Computation 256 (2015) 565–571. doi:https://doi.org/10.1016/j.amc.2015.01.049.
URL https://www.sciencedirect.com/science/article/pii/S009630031500079X
[35] G. Anastassiou, Quantitative Approximations, First edition, Chapman and Hall/CRC, 2001. doi:https://doi.org/10.1201/9781482285796.
[36] G. Anastassiou, Rate of convergence of some multivariate neural network operators to the unit, Computers & Mathematics With Applications - COMPUT MATH APPL 40 (2000) 1–19. doi:10.1016/S0898-1221(00)00136-X.
[37] G. A. Anastassiou, Univariate hyperbolic tangent neural network approximation, Mathematical and Computer Modelling 53 (5) (2011) 1111–1132. doi:https://doi.org/10.1016/j.mcm.2010.11.072.
URL https://www.sciencedirect.com/science/article/pii/S0895717710005649
[38] G. A. Anastassiou, Multivariate sigmoidal neural network approximation, Neural Networks 24 (4) (2011) 378–386. doi:https://doi.org/10.1016/j.neunet.2011.01.003.
URL https://www.sciencedirect.com/science/article/pii/S089360801100027X
[39] G. A. Anastassiou, Multivariate hyperbolic tangent neural network approximation, Computers & Mathematics with Applications 61 (4) (2011) 809–821. doi:https://doi.org/10.1016/j.camwa.2010.12.029.
URL https://www.sciencedirect.com/science/article/pii/S089812211000934X
[40] D. Costarelli, Neural network operators: Constructive interpolation of multivariate functions, Neural Networks 67 (2015) 28–36. doi:https://doi.org/10.1016/j.neunet.2015.02.002.
URL https://www.sciencedirect.com/science/article/pii/S0893608015000362
[41] D. Costarelli, R. Spigler, Approximation by series of sigmoidal functions with applications to neural networks, Annali di Matematica Pura ed Applicata (1923 -) 194 (1) (2015) 289–306. doi:10.1007/s10231-013-0378-y.
URL https://doi.org/10.1007/s10231-013-0378-y
[42] C. K. Chui, X. Li, Approximation by ridge functions and neural networks with one hidden layer, Journal of Approximation Theory 70 (2) (1992) 131–141. doi:https://doi.org/10.1016/0021-9045(92)90081-X.
URL https://www.sciencedirect.com/science/article/pii/002190459290081X
[43] Y. Ito, Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory, Neural Networks 4 (3) (1991) 385–394. doi:https://doi.org/10.1016/0893-6080(91)90075-G.
URL https://www.sciencedirect.com/science/article/pii/089360809190075G
[44] V. E. Ismailov, Approximation by neural networks with weights varying on a finite set of directions, Journal of Mathematical Analysis and Applications 389 (2012) 72–83.
[45] N. J. Guliyev, V. E. Ismailov, A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function, Neural Comput. 28 (7) (2016) 1289–1304. doi:10.1162/NECO_a_00849.
URL https://doi.org/10.1162/NECO_a_00849
[46] C. Debao, Degree of approximation by superpositions of a sigmoidal function, Approximation Theory and its Applications 9 (1993) 17–28.
[47] B. I. Hong, N. Hahm, Approximation order to a function in C(R) by superposition of a sigmoidal function, Applied Mathematics Letters 15 (5) (2002) 591–597. doi:https://doi.org/10.1016/S0893-9659(02)80011-8.
URL https://www.sciencedirect.com/science/article/pii/S0893965902800118
[48] B. Gao, Y. Xu, Univariant approximation by superpositions of a sigmoidal function, Journal of Mathematical Analysis and Applications 178 (1) (1993) 221–226. doi:https://doi.org/10.1006/jmaa.1993.1302.
URL https://www.sciencedirect.com/science/article/pii/S0022247X83713028
[49] G. Lewicki, G. Marino, Approximation by superpositions of a sigmoidal function, Zeitschrift Fur Analysis Und Ihre Anwendungen 22 (2003) 463–470.
[50] G. Lewicki, G. Marino, Approximation of functions of finite variation by superpositions of a sigmoidal function, Applied Mathematics Letters 17 (10) (2004) 1147–1152. doi:https://doi.org/10.1016/j.aml.2003.11.006.
URL https://www.sciencedirect.com/science/article/pii/S089396590481694X
[51] J. W. Siegel, J. Xu, Approximation rates for neural networks with general activation functions, Neural Networks 128 (2020) 313–321. doi:https://doi.org/10.1016/j.neunet.2020.05.019.
URL https://www.sciencedirect.com/science/article/pii/S0893608020301891
[52] K. Hornik, M. Stinchcombe, H. White, P. Auer, Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives, Neural Computation 6 (6) (1994) 1262–1275. doi:10.1162/neco.1994.6.6.1262.
[53] Y. Makovoz, Random approximants and neural networks, Journal of Approximation Theory 85 (1) (1996) 98–109. doi:https://doi.org/10.1006/jath.1996.0031.
URL https://www.sciencedirect.com/science/article/pii/S0021904596900313
[54] G. Folland, Real Analysis: Modern Techniques and Their Applications, A Wiley-Interscience publication, Wiley, 1999.
URL https://books.google.com/books?id=uPkYAQAAIAAJ
[55] C. T. H. Baker, On the nature of certain quadrature formulas and their errors, SIAM Journal on Numerical Analysis 5 (4) (1968) 783–804.
URL http://www.jstor.org/stable/2949426
[56] P. J. Davis, D. P. Rabinowitz, Methods of Numerical Integration, Academic Press, Boston, San Diego, New York, London, 1984.
[57] B. Li, S. Tang, H. Yu, Better approximations of high dimensional smooth functions by deep neural networks with rectified power units, Communications in Computational Physics 27 (2) (2019) 379–411. doi:https://doi.org/10.4208/cicp.OA-2019-0168.
URL http://global-sci.org/intro/article_detail/cicp/13451.html
[58] V. Nair, G. E. Hinton, Rectified linear units improve restricted Boltzmann machines, in: Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, Omnipress, Madison, WI, USA, 2010, p. 807–814.
[59] W. E, B. Yu, The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics 6 (1) (2018) 1–12. doi:10.1007/s40304-018-0127-z.
URL https://doi.org/10.1007/s40304-018-0127-z
[60] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (2015). arXiv:1502.01852.
[61] A. L. Maas, A. Y. Hannun, A. Y. Ng, Rectifier nonlinearities improve neural network acoustic models”, Proc. ICML 30 (1) 2 16489696.
[62] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: International Conference on Artificial Intelligence and Statistics.
[63] D. Costarelli, R. Spigler, Approximation results for neural network operators activated by sigmoidal functions, Neural Networks 44 (2013) 101–106. doi:https://doi.org/10.1016/j.neunet.2013.03.015.
URL https://www.sciencedirect.com/science/article/pii/S0893608013001007
[64] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUs) (2016). arXiv:1511.07289.
[65] G. Klambauer, T. Unterthiner, A. Mayr, S. Hochreiter, Self-normalizing neural networks (2017). arXiv:1706.02515.
[66] D. Hendrycks, K. Gimpel, Gaussian error linear units (GELUs) (2020). arXiv:1606.08415.
[67] S. Elfwing, E. Uchibe, K. Doya, Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, Neural Networks 107 (2018) 3–11, special issue on deep reinforcement learning. doi:https://doi.org/10.1016/j.neunet.2017.12.012.
URL https://www.sciencedirect.com/science/article/pii/S0893608017302976
[68] P. Ramachandran, B. Zoph, Q. V. Le, Searching for activation functions (2017). arXiv:1710.05941.
[69] A. M. Atto, D. Pastor, G. Mercier, Smooth sigmoid wavelet shrinkage for non-parametric estimation, in: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 3265–3268. doi:10.1109/ICASSP.2008.4518347.
[70] D. Misra, Mish: A self regularized non-monotonic activation function (2020). arXiv:1908.08681.
[71] O. A. Hafsa, J.-P. Mandallena, Interchange of infimum and integral, Calculus of Variations and Partial Differential Equations 18 (4) (2003) 433–449. doi:10.1007/s00526-003-0211-3.
URL https://doi.org/10.1007/s00526-003-0211-3
[72] M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of Machine Learning, 2nd Edition, Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA, 2018.
[73] S. Shalev-Shwartz, S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, New York, NY, USA, 2014.
[74] V. A. Milman, Extension of functions preserving the modulus of continuity, Mathematical Notes 61 (2) (1997) 193–200. doi:10.1007/BF02355728.
URL https://doi.org/10.1007/BF02355728
[75] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, Wiley, 1968.
URL http://www.amazon.ca/exec/obidos/redirect?tag=citeulike04-20{&}path=ASIN/0471257087
[76] J. O. Irwin, On the frequency distribution of the means of samples from a population having any law of frequency with finite moments, with special reference to Pearson’s type ii, Biometrika 19 (3/4) (1927) 225–239.
URL http://www.jstor.org/stable/2331960
[77] P. Hall, The distribution of means for samples of size n drawn from a population in which the variate takes values between 0 and 1, all such values being equally probable, Biometrika 19 (3/4) (1927) 240–245.
URL http://www.jstor.org/stable/2331961

A Unified and Constructive Framework for the Universality of Neural Networks

Abstract

keywords:

MSC:

1 Introduction

2 Notations

3 Convolution and Approximate Identity

Proposition 1.

Lemma 1.

Definition 1.

Lemma 2.

Lemma 3.

Proof.

4 Quadrature rules for continuous functions on bounded domain

Lemma 4.

5 Error Estimation for Approximate Identity with quadrature

Lemma 5.

Remark 1.

6 An abstract unified framework for universality

Definition 2 (Network Approximate Identity function).

Lemma 6.

Proof.

Remark 2.

Remark 3.

7 Many existing activation functions are nAI

7.1 Rectified Polynomial Units (RePU) is an nAI

Lemma 7 (ℬ\mathcal{B}-function for RePU in one dimension).

Theorem 1 (ℬ\mathcal{B}-function for RePU in nn dimensions).

Remark 4.

7.2 Sigmoidal and related activation functions are nAI

7.2.1 Sigmoidal, hyperbolic tangent, and softplus activation functions

Lemma 8 (ℬ\mathcal{B}-function for sigmoid, hyperbolic tangent, and solfplus in one dimension).

Theorem 2 (ℬ\mathcal{B}-function for sigmoid, hyperbolic tangent, and softplus in nn dimensions).

7.2.2 Arctangent function

7.2.3 Generalized sigmoidal functions

Definition 3 (Generalized sigmoidal functions).

Lemma 9 (ℬ\mathcal{B}-function for generalized sigmoids in one dimension).

Theorem 3 (ℬ\mathcal{B}-function for generalized sigmoids in nn dimensions).

7.3 The Exponential Linear Unit (ELU) is nAI

Lemma 10 (ℬ\mathcal{B}-function for ELU in one dimension).

Theorem 4 (ℬ\mathcal{B}-function for ELU in nn dimensions).

7.4 The Gaussian Error Linear Unit (GELU) is nAI

Lemma 11 (ℬ\mathcal{B}-function for GELU in one dimension).

Theorem 5 (ℬ\mathcal{B}-function for GELU in nn dimensions).

7.5 The Sigmoid Linear Unit (SiLU) is an nAI

Theorem 6 (ℬ\mathcal{B}-function for SiLU in nn dimension).

7.6 The Mish unit is an nAI

Lemma 12.

Theorem 7.

8 A general framework for nAI

Lemma 13.

Proof.

Theorem 8.

Proof.

Remark 5.

Remark 6.

Remark 7.

Remark 8.

9 Universality with non-asymptotic rates

Theorem 9.

Proof.

Remark 9.

10 Conclusions

Funding

Appendix A Extension to 𝒞​[−1,1]n\mathcal{C}\left[-1,1\right]^{n}

Appendix B Proofs of results in Section 7

Proof of Lemma 7.

Proof of Theorem 1.

Proof of Lemma 8.

Proof of Theorem 2.

Proof of Lemma 9.

Proof of Theorem 3.

Proof of Lemma 10.

Proof of Theorem 4.

Proof of Lemma 11.

Proof of Theorem 5.

Proof of Lemma 12.

Proof of Theorem 7.

Appendix C Figures

References

Lemma 7 ( $\mathcal{B}$ -function for RePU in one dimension).

Theorem 1 ( $\mathcal{B}$ -function for RePU in $n$ dimensions).

Lemma 8 ( $\mathcal{B}$ -function for sigmoid, hyperbolic tangent, and solfplus in one dimension).

Theorem 2 ( $\mathcal{B}$ -function for sigmoid, hyperbolic tangent, and softplus in $n$ dimensions).

Lemma 9 ( $\mathcal{B}$ -function for generalized sigmoids in one dimension).

Theorem 3 ( $\mathcal{B}$ -function for generalized sigmoids in $n$ dimensions).

Lemma 10 ( $\mathcal{B}$ -function for ELU in one dimension).

Theorem 4 ( $\mathcal{B}$ -function for ELU in $n$ dimensions).

Lemma 11 ( $\mathcal{B}$ -function for GELU in one dimension).

Theorem 5 ( $\mathcal{B}$ -function for GELU in $n$ dimensions).

Theorem 6 ( $\mathcal{B}$ -function for SiLU in $n$ dimension).

Appendix A Extension to $\mathcal{C}\left[-1,1\right]^{n}$