This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Unified and Constructive Framework for the Universality of Neural Networks

Tan Bui-Thanh [email protected]
Abstract

One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: any nAI activation function is universal. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces several advantages over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.

keywords:
Universal approximation , neural networks , activation functions , non-asymptotic analysis
MSC:
62M45 , 82C32 , 62-02 , 60-02 , 65-02.
\affiliation

[UT-mech]organization=Department of Aerospace Engineering and Engineering Mechanics, The University of Texas at Austin, city=Austin, state=Texas, country=USA

\affiliation

[Oden]organization=Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin, city=Austin, state=Texas, country=USA

{highlights}

Unlike existing universality approaches, our framework is constructive using elementary means from functional analysis, probability theory, non-asymptotic analysis, and numerical analysis.

While existing approaches are either technical or specialized for particular activation functions, our framework is the first unified attempt that is valid for most of the existing activation functions and beyond.

The framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc;

The framework provides new proofs for all activation functions

The framework discovers and facilitates the discovery of new activation functions with guaranteed universality property. Indeed, any activation—whose kkth derivative, with kk being an integer, is integrable and essentially bounded—is universal. In that case, the activation function and all of its jjth derivative, j=1,,kj=1,\ldots,k are not only a valid activation but also universal.

For a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with a predetermined number of neurons, and the values of weights/biases;

The framework is the first that allows for abstractly presenting the universal approximation with the favorable non-asymptotic rate of N12N^{-{\frac{1}{2}}}, where NN is the number of neurons. This provides not only theoretical insights into the required number of neurons but also guidance on how to choose the number of neurons to achieve a certain accuracy with controllable successful probability. Perhaps more importantly, it shows that neural network may fail, with non-zero probability, to provide the desired testing error in practice when an architecture is selected;

Our framework also provides insights into the developments, and hence providing constructive derivations, of some of the existing approaches.

1 Introduction

Human brain consists of networks of billions of neurons, each of which, roughly speaking, receives information—electrical pulses —from other neurons via dendrites, processes the information using soma, is activated by difference of electrical potential, and passes the output along its axon to other neurons through synapse. Attempt to understand the extraordinary ability of the brain in categorizing, classifying, regressing, processing, etc information has inspired numerous scientists to develop computational models to mimic brain functionalities. Most well-known is perhaps the McCulloch-Pitts model [1], which is also called perceptron by Rosenblatt who extended McCulloch-Pitts model to networks of artificial neurons capable of learning from data [2]. The question is if such a network could mimic some of the brain capability, such as learning to classify. The answer lies in the fact that perceptron networks can represent Boolean logic functions exactly. From a mathematical point of view, perceptron networks with Heaviside activation functions compute step functions, linear combinations of which form the space of simple functions which in turn is dense in the space of measurable functions [3, 4]. That is, linear combination of perceptions can approximate a measurable function to any desired accuracy [5, 6]: the first universal approximation for neural networks. The universal approximation capability partially “explains” why human brains can be trained to virtually learn any tasks.

For training one-hidden layer networks, Widrow-Hoff rule [7] can be used for supervised learning. For many-layer ones, the most popular approach is back-propagation [8] which requires the derivative of activation functions. Heaviside function is, however, not differentiable in the classical sense. A popular smooth approximation is the standard logistic (also called sigmoidal) activation function which is infinitely differentiable. The question is now: are neural networks of sigmoidal function universal? An answer to this question was first presented by Cybenko [9] and Funahashi [10]. The former, though non-constructive (a more constructive proof was then provided in [11] and revisited in [12]), elegantly used the Hahn-Banach theorem to show that sigmoidal neural networks (NNs) with one hidden layer is dense in the space of continuous functions on compacta. The latter used Fourier transform and an integral formulation for integrable function with bounded variation [13] to show the same results for NNs with continuous, bounded, monotone increasing sigmoidal activation function. Recognizing NN output as an approximate back-projection operator, [14] employed inverse Radon transformation to show the universal approximation property in space of squared integrable functions. Making use of the classical Stone-Weierstrass approximation theorem [15] successfully proved NNs with non-decreasing sigmoidal function are dense in space of continuous functions over compacta and dense in space of measurable functions. Using the separability of space of continuous functions over compacta [16] constructed a strictly increasing analytic sigmoidal function that is universal. The work in [17], based on a Kolmogorov theorem, showed that any continuous function on hypercubes can be approximated well with two-hidden layer NNs of sigmoidal functions.

Though sigmoidal functions are natural continuous approximation to Heaviside function, and hence mimicking the activation mechanism of neurons, they are not the only ones having universal approximation property. Indeed, [18] showed, using distributional theory, that NNs with kkth degree sigmoidal function are dense in space of continuous functions over compacta. Meanwhile, [19] designed a cosine squasher activation function so that the output of one-hidden layer neural network is truncated Fourier series and thus can approximate square integrable functions to any desired accuracy. Following and extending [9], [20] showed that NN is dense in p\mathcal{L}^{p} with bounded non-constant activations function and in space of continuous functions over compacta with continuous bounded nonconstant activation functions. Using the Stone-Weierstrass theorem, [6] showed that any network with activations (e.g. exponential, cosine squasher, modified sigma-pi, modified logistic, and step functions) that can transform product of functions into sum of functions is universal in the space of bounded measurable functions over compacta. The work in [21] provided universal approximation for bounded weights (and biases) with piecewise polynomial and superanalytic activation functions (e.g. sine, cosine, logistic functions and piecewise continuous functions are superanalytic at some point with positive radius of convergence).

One hidden-layer NN is in fact dense in space of continuous functions and p\mathcal{L}^{p} if the activation function is not polynomial almost everywhere [22, 23]. Using Taylor expansion and Vandermonde determinant, [24] provided an elementary proof of universal approximation for NNs with CC^{\infty} activation functions. Recently, [25] provides universal approximation theorem for multilayer NNs with ReLU activation functions using partition of unity and Taylor expansion for functions in Sobolev spaces. Universality of multilayer ReLU NNs can also be obtained using approximate identity and rectangular quadrature rule [26]. Restricting in Barron spaces [27, 28], the universality of ReLU NNs is a straightforward application of the law of large numbers [29]. The universality of ReLU NNs can also be achieved by emulating finite element approximations [30, 31].

Unlike others, [32] introduced \mathcal{B}ellshape function (as derivative of squash-type activation function, for example) as a means to explicitly construct one-hidden layer NNs to approximate continuous function over compacta in multiple dimensions. More detailed analysis was then carried out in [33, 34]. The idea was revisited and extended to establish universal approximation in the uniform norm for tensor product sigmoidal and hyperbolic tangent activations [35, 36, 37, 38, 39]. Similar approach was also taken in [40] using cardinal B-spline and in [41] using the distribution function as sigmoid but for a family of sigmoids with a certain of decaying-tail conditions.

While it is sufficient for most universal approximation results to hold when each weight (and bias) varying over the whole real line \mathbb{R}, this is not necessary. In fact, universal approximation results for continuous function on compacta can also be obtained using finite set of weights [42, 43, 44]. It is quite striking that one-hidden layer with only one neuron is enough for universality: [45] constructs a smooth, sigmoidal, almost monotone activation function so that one-hidden layer with one neuron can approximate any continuous function over any compact subset in \mathbb{R} to any desired accuracy.

Universal theorems with convergence rate for sigmoidal and others NNs have also been established. Modifying the proofs in [11, 18], [46] showed that, in one dimension, the error incurred by sigmoidal NNs with NN hidden units scales as 𝒪(N1)\mathcal{O}\left(N^{-1}\right) in the uniform norm over compacta. The result was then extended to \mathbb{R} for bounded continuous function [47]. Similar results were obtained for functions with bounded variations [48] and with bounded ϕ\phi-variations [49]. For multiple dimensions, [27] provided universal approximation for sigmoidal NNs in the space of functions with bounded first moment of the magnitude distribution of their Fourier transform with rate 𝒪(N1/2)\mathcal{O}\left(N^{-1/2}\right) in the 2\mathcal{L}^{2}-norm, independent of dimensions. This result can be generalized to Lipschitz functions (with additional assumptions) [50]. Using explicit NN construction in [32], [35, 36, 37, 38, 39, 33, 34, 41] provided convergence rate of 𝒪(Nα)\mathcal{O}\left(N^{-\alpha}\right), 0<α<10<\alpha<1 in the uniform norm for Hölder continuous functions with exponent α\alpha. Recently, [51] has revisited universal approximation theory with rate 𝒪(N1/2)\mathcal{O}\left(N^{-1/2}\right) in Sobolev norms for smooth activation functions with polynomial decay condition on all derivatives. This improves/extends the previous similar work for sigmoidal functions in [27] and exponentially decaying activation functions in [52]. The setting in [51] is valid for sigmoidal, arctan, hyperbolic tangent, softplus, ReLU, Leaky ReLU, and kkth power of ReLU, as their central differences satisfy polynomial decaying condition. However, the result is only valid for smooth functions in high-order Barron spaces. For activation functions without decay but essentially bounded and having bounded Fourier transform on some interval (or having bounded variation), the universal approximation results with slower rate 𝒪(N1/4)\mathcal{O}\left(N^{-1/4}\right) in the 2\mathcal{L}^{2}-norm can be obtained for first order Barron space. These rates can be further improved using stratified sampling [53, 28]. Convergence rates for ReLU NNs in Sobolev spaces has been recently established using finite elements [30, 31].

The main objective of this paper is to provide the first constructive and unified frameworks for a large class of activation functions including most of existing ones. At the heart of this new framework is the introduction of the neural network approximate identity (nAI) concept. The important consequence is: any nAI activation function is universal. The following are the main contributions: i) Unlike existing works, our framework is constructive using elementary means from functional analysis, probability theory, non-asymptotic analysis, and numerical analysis. ii) While existing approaches are either technical or specialized for particular activation functions, our framework is the first unified attempt that is valid for most of the existing activation functions and beyond; iii) The framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc; iv) The framework provides new proofs for all activation functions; v) The framework discovers and facilitates the discovery of new activation functions with guaranteed universality property. Indeed, any activation—whose kkth derivative, with kk being an integer, is integrable and essentially bounded—is universal. In that case, the activation function and all of its jjth derivative, j=1,,kj=1,\ldots,k are not only a valid activation but also universal. vi) For a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with a predetermined number of neurons, and the values of weights/biases; vii) The framework is the first that allows for abstractly presenting the universal approximation with the favorable non-asymptotic rate of N12N^{-{\frac{1}{2}}}, where NN is the number of neurons. This provides not only theoretical insights into the required number of neurons but also guidance on how to choose the number of neurons to achieve a certain accuracy with controllable successful probability. Perhaps more importantly, it shows that neural network may fail, with non-zero probability, to provide the desired testing error in practice when an architecture is selected; viii) Our framework also provides insights into the developments, and hence providing constructive derivations, of some of the existing approaches.

The paper is organized as follows. Section 2 introduces conventions and notations used in the paper. Elementary facts about convolution and approximate identities that are useful for our purposes are presented in section 3. Section 4 recalls quadrature rules in terms of Riemann sums for continuous functions on compacta and their error analysis using moduli of continuity. This follows by a unified abstract framework for universality in section 6. The key to achieve this is to introduce the concept of neural network approximate identity (nAI), which immediately provides an abstract universality result in Lemma 3. This abstract framework reduces universality proof to nAI proof. Section 7 shows that most of existing activations are nAI. This includes the family of rectified polynomial units (RePU) in which the parametric and leaky ReLUs are a member, a family of generalized sigmoidal functions in which the standard sigmoidal, hyperbolic tangent and softplus are a member, the exponential linear unit (ELU), the Gaussian error linear unit (GELU), the sigmoid linear unit (SiLU), and the Mish. It is the nAI proof of the Mish activation function that guides us to devise a general framework for a large class of nAI functions (including all activations in this paper and beyond) in section 8. Abstract universal approximation result with non-asymptotic rates is presented in section 9. Section 10 concludes the paper.

2 Notations

This section describes notations used in the paper. We reserve lower case roman letters for scalars or scalar-valued function. Boldface lower case roman letters are for vectors with components denoted by subscripts. We denote by \mathbb{R} the set of real numbers and by * the convolution operator. For 𝒙n{\boldsymbol{{x}}}\in\mathbb{R}^{n}, where nn\in\mathbb{N} is the ambient dimension, |𝒙|_p:=(_i=1n|𝒙_i|p)1/p\left|{\boldsymbol{{x}}}\right|_{\_}p:=\left(\sum_{\_}{i=1}^{n}\left|{\boldsymbol{{x}}}_{\_}i\right|^{p}\right)^{1/p} denotes the standard p\ell^{p} norm in n\mathbb{R}^{n}. We conventionally write f(𝒙):=f(𝒙_1,,𝒙_n)f\left({\boldsymbol{{x}}}\right):=f\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right) and f_p:=(n|f(𝒙)|p𝑑μ(𝒙))1/p\left\|f\right\|_{\_}p:=\left(\int_{\mathbb{R}^{n}}\left|f\left({\boldsymbol{{x}}}\right)\right|^{p}\,d\mu\left({\boldsymbol{{x}}}\right)\right)^{1/p} for 1p<1\leq p<\infty and μ\mu is the Lebesgue measure in n\mathbb{R}^{n}. For p=p=\infty, f_:=esssup_n|f|:={M>0:μ{𝒙:|f(x)|>M}=0}\left\|f\right\|_{\_}\infty:=\text{ess}\sup_{\_}{\mathbb{R}^{n}}\left|f\right|:=\left\{M>0:\mu\left\{{\boldsymbol{{x}}}:\left|f(x)\right|>M\right\}=0\right\}. For simplicity, we use d𝒙d{\boldsymbol{{x}}} in place of dμ(𝒙)d\mu\left({\boldsymbol{{x}}}\right). Note that we also use f_\left\|f\right\|_{\_}\infty to denote the uniform norm of continuous function ff. We define p:=p(n):={f:f_p<}\mathcal{L}^{p}:=\mathcal{L}^{p}\left(\mathbb{R}^{n}\right):=\left\{f:\left\|f\right\|_{\_}p<\infty\right\} for 1p1\leq p\leq\infty, 𝒞(𝒦)\mathcal{C}\left(\mathcal{K}\right) as the space of continuous functions on 𝒦n\mathcal{K}\subseteq\mathbb{R}^{n}, 𝒞_0(n)\mathcal{C}_{\_}0\left(\mathbb{R}^{n}\right) as the space of continuous functions vanishing at infinity, 𝒞_b(n)\mathcal{C}_{\_}b\left(\mathbb{R}^{n}\right) as the space of bounded functions in 𝒞(n)\mathcal{C}\left(\mathbb{R}^{n}\right), and 𝒞_c(𝒦)\mathcal{C}_{\_}c\left(\mathcal{K}\right) as the space of functions in 𝒞(𝒦)\mathcal{C}\left(\mathcal{K}\right) with compact support.

3 Convolution and Approximate Identity

The mathematical foundation for our framework is approximate identity, which relies on convolution. Convolution has been used by many authors including [23, 40, 22] to assist in proving universal approximation theorems. Approximate identity generated by ReLU function has been used in [26] for showing ReLU universality in p\mathcal{L}^{p}. In this section we collect some important results from convolution and approximate identity that are useful for our developments (see, e.g., [54] for a comprehensive treatment). Let τ_𝒚f(𝒙):=f(𝒙𝒚)\tau_{\_}{\boldsymbol{{y}}}f\left({\boldsymbol{{x}}}\right):=f\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right) be the translation operator. The following is standard.

Proposition 1.

ff is uniformly continuous iff lim_𝐲0τ_𝐲ff_=0\lim_{\_}{{\boldsymbol{{y}}}\to 0}\left\|\tau_{\_}{\boldsymbol{{y}}}f-f\right\|_{\_}\infty=0.

Let f,g:nf,g:\mathbb{R}^{n}\to\mathbb{R} be two measurable functions in n\mathbb{R}^{n}, their convolution is defined as

fg:=nf(𝒙𝒚)g(𝒚)𝑑𝒚,f*g:=\int_{\mathbb{R}^{n}}f\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)g\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}},

when the integral exists. We are interested in the conditions under which f(𝒙𝒚)g(𝒚)f\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)g\left({\boldsymbol{{y}}}\right) (or f(𝒚)g(𝒙𝒚)f\left({\boldsymbol{{y}}}\right)g\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)) is integrable for almost every (a.e.) 𝒙{\boldsymbol{{x}}}, as our approach requires a discretization of the convolution integral. Below is a relevant case.

Lemma 1.

If f𝒞_0(n),g1(n)f\in\mathcal{C}_{\_}0\left(\mathbb{R}^{n}\right),g\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right) then fg𝒞_0(n)f*g\in\mathcal{C}_{\_}0\left(\mathbb{R}^{n}\right).

We are interested in gg which is an approximate identity in the following sense.

Definition 1.

A family of functions θ1(n)\mathcal{B}_{\theta}\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right), where θ>0\theta>0, is called an approximate identity (AI) if: i) The family is bounded in the 1\mathcal{L}^{1} norm, i.e., θ_1C\left\|\mathcal{B}_{\theta}\right\|_{\_}1\leq C for some C>0C>0; ii) nθ(𝐱)𝑑𝐱=1\int_{\mathbb{R}^{n}}\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=1 for all θ>0\theta>0; and iii) _𝐱>δ|θ(𝐱)|d𝐱0\int_{\_}{\left\|{\boldsymbol{{x}}}\right\|>\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}\longrightarrow 0 as θ0\theta\longrightarrow 0, for any δ>0\delta>0.

An important class of approximate identity is obtained by rescaling 1\mathcal{L}^{1} functions.

Lemma 2.

If g1(n)g\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right) and ng(𝐱)𝑑𝐱=1\int_{\mathbb{R}^{n}}g\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=1, then θ(𝐱):=1θng(𝐱θ)\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right):=\frac{1}{\theta^{n}}g\left(\frac{{\boldsymbol{{x}}}}{\theta}\right) is an AI.

Lemma 3.

For any f𝒞_0(n)f\in\mathcal{C}_{\_}0\left(\mathbb{R}^{n}\right), lim_θ0fθf_=0\lim_{\_}{\theta\to 0}\left\|f*\mathcal{B}_{\theta}-f\right\|_{\_}\infty=0, where θ\mathcal{B}_{\theta} is an AI. Consequently, for any 𝒦\mathcal{K} being a compact subset of n\mathbb{R}^{n}, lim_θ0𝟙_𝒦(fθf)_=0\lim_{\_}{\theta\to 0}\left\|\mathds{1}_{\_}{\mathcal{K}}\left(f*\mathcal{B}_{\theta}-f\right)\right\|_{\_}\infty=0, where 𝟙_𝒦\mathds{1}_{\_}\mathcal{K} is the indicator/characteristic function of 𝒦\mathcal{K}.

Proof.

We briefly present the proof here as we will use a similar approach to estimate approximate identity error later. Since fC_0(n)f\in C_{\_}0\left(\mathbb{R}^{n}\right) it resides in (n)\mathcal{L}^{\infty}\left(\mathbb{R}^{n}\right) and is uniformly continuous. As a consequence of Proposition 1, for any ε>0\varepsilon>0, there exists δ>0\delta>0 such that τ_𝒚ff_ε\left\|\tau_{\_}{\boldsymbol{{y}}}f-f\right\|_{\_}\infty\leq\varepsilon for 𝒚δ\left\|{\boldsymbol{{y}}}\right\|\leq\delta. We have

fθf_nτ_𝒚ff_|θ(𝒚)|𝑑𝒚ε_𝒚δ|θ(𝒚)|d𝒚+2f__𝒚>δ|θ(𝒚)|d𝒚,\left\|f*\mathcal{B}_{\theta}-f\right\|_{\_}\infty\leq\int_{\mathbb{R}^{n}}\left\|\tau_{\_}{\boldsymbol{{y}}}f-f\right\|_{\_}\infty\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\leq\varepsilon\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|\leq\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\\ +2\left\|f\right\|_{\_}\infty\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|>\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}},

where we have used the second property in Proposition 1. The assertion is now clear as ε\varepsilon is arbitrarily small and by the third condition of approximate identity. ∎

4 Quadrature rules for continuous functions on bounded domain

Recall that for a bounded function on a compact set, it is Riemann integrable if and only if it is continuous almost everywhere. In that case, Riemann sums converge to Riemann integrals as the corresponding partition size approaches 0. It is well-known (see, e.g. [55, 56], and references therein) that most common numerical quadrature rules—including trapezoidal, some Newton-Cotes formulas, and Gauss-Legendre quadrature formulas—are Riemann sums. In this section, we use the Riemann sum interpretation of quadrature rule to approximate integrals of bounded functions. The goal is to characterize quadrature error in terms of modulus of continuity. We first discuss quadrature error in one dimension and then extend the result to nn dimensions.

We assume the domain of interest is [1,1]n\left[-1,1\right]^{n}. Let f𝒞[1,1]n:=𝒞([1,1]n)f\in{\mathcal{C}}\left[-1,1\right]^{n}:={\mathcal{C}}\left(\left[-1,1\right]^{n}\right), 𝒫m:={𝒛1,,𝒛m+1}\mathcal{P}^{m}:=\left\{\boldsymbol{{z}}^{1},\ldots,\boldsymbol{{z}}^{m+1}\right\} a partition of [1,1]\left[-1,1\right], and 𝒬_m:={ξ1,,ξm}\mathcal{Q}_{\_}m:=\left\{\xi^{1},\ldots,\xi^{m}\right\} the collection of all “quadrature points” such that 1𝒛jξj𝒛j+11-1\leq\boldsymbol{{z}}^{j}\leq\xi^{j}\leq\boldsymbol{{z}}^{j+1}\leq 1 for j=1,,mj=1,\ldots,m. We assume that

_j=1mg(ξj)(𝒛j+1𝒛j)\sum_{\_}{j=1}^{m}g\left(\xi^{j}\right)\left(\boldsymbol{{z}}^{j+1}-\boldsymbol{{z}}^{j}\right)

be a valid Riemann sum, e.g. trapezoidal rule, and thus converging to _11g(𝒛)d𝒛\int_{\_}{-1}^{1}g\left(\boldsymbol{{z}}\right)\,d\boldsymbol{{z}} as mm approaches \infty. We define a quadrature rule for [1,1]n\left[-1,1\right]^{n} as the tensor product of the aforementioned one dimensional quadrature rule, and thus

𝒮(m,f):=_j1,,jn=1mf(ξj1,,ξjn)Π_i=1n(𝒛ji+1𝒛ji)\mathcal{S}\left(m,f\right):=\sum_{\_}{j^{1},\ldots,j^{n}=1}^{m}f\left(\xi^{j^{1}},\ldots,\xi^{j^{n}}\right)\Pi_{\_}{i=1}^{n}\left(\boldsymbol{{z}}^{j^{i}+1}-\boldsymbol{{z}}^{j^{i}}\right) (1)

is a valid Riemann sum for _[1,1]nf(𝒙)𝑑𝒙\int_{\_}{\left[-1,1\right]^{n}}f\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}.

Recall the modulus of continuity ω(f,h)\omega\left(f,h\right) of a function ff:

ω(f,h):=sup_|𝒛|_2hτ_𝒛ff_,\omega\left(f,h\right):=\sup_{\_}{\left|\boldsymbol{{z}}\right|_{\_}2\leq h}\left\|\tau_{\_}{\boldsymbol{{z}}}f-f\right\|_{\_}\infty, (2)

ω(f,0)=0\omega\left(f,0\right)=0, and ω(f,h)\omega\left(f,h\right) is continuous with respect to hh at 0 (due to the uniform continuity of ff on [1,1]n\left[-1,1\right]^{n}). The following error bound for the tensor product quadrature rule is an extension of the standard one dimensional case [55].

Lemma 4.

Let f𝒞[1,1]nf\in\mathcal{C}\left[-1,1\right]^{n}. Then111The result can be marginally improved using the common refinement for 𝒫_m\mathcal{P}_{\_}m and 𝒬_m\mathcal{Q}_{\_}m, but that is not important for our purpose.

|𝒮(m,f)_[1,1]nf(𝒙)𝑑𝒙|2nω(f,|𝒫_m|),\left|\mathcal{S}\left(m,f\right)-\int_{\_}{\left[-1,1\right]^{n}}f\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\right|\leq 2^{n}\omega\left(f,\left|\mathcal{P}_{\_}m\right|\right),

where |𝒫_m|\left|\mathcal{P}_{\_}m\right| is the norm of the partition 𝒫_m\mathcal{P}_{\_}m.

5 Error Estimation for Approximate Identity with quadrature

We are interested in approximating functions in 𝒞_c[1,1]n\mathcal{C}_{\_}c\left[-1,1\right]^{n} using neural networks. How to extend the results to 𝒞[1,1]n\mathcal{C}\left[-1,1\right]^{n} is given in A. At the heart of our framework is to first approximate any function ff in 𝒞_c[1,1]n\mathcal{C}_{\_}c\left[-1,1\right]^{n} with an approximate identity θ\mathcal{B}_{\theta} and then numerically integrate the convolution integral fθf*\mathcal{B}_{\theta} with quadrature (later with Monte Carlo sampling in Section 9). This extends the similar approach in [26] for ReLU activation functions in several directions: 1) we rigorously account for the approximate identity and quadrature errors, 2) our unified framework holds for most activation functions (see Section 7) using the network approximate identity (nAI) concept introduced in Section 6, 4) we identify sufficient conditions under which an activation is nAI (see Section 8), and 5) we provide non-asymptotic rate of convergence (see Section 9). Moreover, this procedure is the key to our unification of neural network universality.

Let f𝒞_c[1,1]nf\in\mathcal{C}_{\_}c\left[-1,1\right]^{n} and θ\mathcal{B}_{\theta} be an approximate identity. From Lemma 3 we know that fθf*\mathcal{B}_{\theta} converges to ff uniformly as θ0\theta\to 0. We, however, are interested in estimating the error fθf_\left\|f*\mathcal{B}_{\theta}-f\right\|_{\_}\infty for a given θ\theta. From the proof of Lemma 3, for any δ>0\delta>0, we have

fθf_nτ_𝒚ff_|θ(𝒚)|𝑑𝒚ω(f,δ)_𝒚δ|θ(𝒚)|d𝒚+2f__𝒚>δ|θ(𝒚)|d𝒚ω(f,δ)+2f_𝒯(θ,δ),\left\|f*\mathcal{B}_{\theta}-f\right\|_{\_}\infty\leq\int_{\mathbb{R}^{n}}\left\|\tau_{\_}{\boldsymbol{{y}}}f-f\right\|_{\_}\infty\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\\ \leq\omega\left(f,\delta\right)\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|\leq\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}+2\left\|f\right\|_{\_}\infty\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|>\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\leq\omega\left(f,\delta\right)+2\left\|f\right\|_{\_}\infty\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right),

where 𝒯(θ,δ):=_𝒚>δ|θ(𝒚)|d𝒚\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right):=\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|>\delta}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}} is the tail mass of θ\mathcal{B}_{\theta}.

Next, for any 𝒙[1,1]n{\boldsymbol{{x}}}\in\left[-1,1\right]^{n}, we approximate

fθ(𝒙)=_nθ(𝒙𝒚)f(𝒚)𝑑𝒚=_[1,1]nθ(𝒙𝒚)f(𝒚)𝑑𝒚f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)=\int_{\_}{\mathbb{R}^{n}}\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)f\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}=\int_{\_}{\left[-1,1\right]^{n}}\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}-{\boldsymbol{{y}}}\right)f\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}

with the Riemann sum defined in (1). The following result is obvious by triangle inequality, continuity of ω(f,h)\omega\left(f,h\right) at h=0h=0, the third condition of the definition of θ\mathcal{B}_{\theta}, and Lemma 4.

Lemma 5.

Let f𝒞_c[1,1]nf\in\mathcal{C}_{\_}c\left[-1,1\right]^{n}, θ\mathcal{B}_{\theta} be an approximate identity, and 𝒮(m,fθ)(𝐱)\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right) be the quadrature rule (1) for fθf*\mathcal{B}_{\theta} with the partition 𝒫m\mathcal{P}^{m} and quadrature point set 𝒬m\mathcal{Q}^{m}. Then, for any δ>0\delta>0, and ε>0\varepsilon>0, there holds

f(𝒙)𝒮(m,fθ)(𝒙)=𝒪(ω(f,δ)+ω(f,|𝒫m|)+𝒯(θ,δ)).f\left({\boldsymbol{{x}}}\right)-\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\mathcal{O}\left(\omega\left(f,\delta\right)+\omega\left(f,\left|\mathcal{P}^{m}\right|\right)+\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right)\right). (3)

In particular, for any ε>0\varepsilon>0, there exist a sufficiently small |𝒫m|\left|\mathcal{P}^{m}\right| (and hence large mm), and sufficiently small θ\theta and δ\delta such that

f(𝒙)𝒮(m,fθ)(𝒙)_ε.\left\|f\left({\boldsymbol{{x}}}\right)-\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty\leq\varepsilon.
Remark 1.

Aiming at using only elementary means, we use Riemann sum to approximate the integral, which does not give the best possible convergence rates. To improve the rates, Section 9 resorts to Monte Carlo sampling to not only reduce the number of “quadrature” points but also to obtain a total number of points independent of the ambient dimension nn.

6 An abstract unified framework for universality

Definition 2 (Network Approximate Identity function).

A univariate function σ(x):\sigma\left(x\right):\mathbb{R}\to\mathbb{R} admits a network approximate identity (nAI) if there exist 1k1\leq k\in\mathbb{N}, {α_i}_i=1k\left\{\alpha_{\_}i\right\}_{\_}{i=1}^{k}\subset\mathbb{R}, {wi}_i=1k\left\{{w}^{i}\right\}_{\_}{i=1}^{k}\subset\mathbb{R}, and {b_i}_i=1k\left\{{b}_{\_}i\right\}_{\_}{i=1}^{k}\subset\mathbb{R} such that

g(x):=_i=1kα_iσ(wix+b_i)1().g\left({x}\right):=\sum_{\_}{i=1}^{k}\alpha_{\_}i\sigma\left({w}^{i}{x}+{b}_{\_}i\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right). (4)

Thus, up to a scaling, g(x)g\left({x}\right) is an approximate identity.

Lemma 6.

If a univariate function σ(x):\sigma\left(x\right):\mathbb{R}\to\mathbb{R} is a nAI, then it is universal in 𝒞_c[1,1]\mathcal{C}_{\_}c\left[-1,1\right].

Proof.

From nAI definition 4, there exist kk\in\mathbb{N} and α_i,wi,b_i\alpha_{\_}i,{w}^{i},{b}_{\_}i, i=1,,ki=1,\ldots,k such that (4) holds. After rescaling g(x)/g(x)_1()g\left({x}\right)/\left\|g\left({x}\right)\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)} we infer from Lemma 2 that θ(x):=1θg(xθ)\mathcal{B}_{\theta}\left({x}\right):=\frac{1}{\theta}g\left(\frac{{x}}{\theta}\right) is an approximate identity. Lemma 5 then asserts that for any desired accuracy ε>0\varepsilon>0, there exist a partition |𝒫m|\left|\mathcal{P}^{m}\right|, and sufficiently small θ\theta and δ\delta such that

𝒮(m,fθ)(x)=1θ_j=1m(zj+1zj)f(ξj)_=1kα_σ[wθ(xξj)+b_]\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({x}\right)=\frac{1}{\theta}\sum_{\_}{j=1}^{m}\left({z}^{j+1}-{z}^{j}\right)f\left(\xi^{j}\right)\sum_{\_}{\ell=1}^{k}\alpha_{\_}\ell\sigma\left[\frac{{w}^{\ell}}{\theta}\left({x}-{\xi^{j}}\right)+{b}_{\_}\ell\right] (5)

is within ε\varepsilon from any f(x)𝒞_c[1,1]f\left({x}\right)\in\mathcal{C}_{\_}c\left[-1,1\right] in the uniform norm. ∎

We have explicitly constructed a one-hidden layer neural network 𝒮(m,fθ)(x)\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({x}\right) with an arbitrary nAI activation σ\sigma in (5) to approximate well any continuous function f(x)f\left({x}\right) with compact support in [1,1]\left[-1,1\right]. A few observations are in order: i) if we know the modulus of continuity of f(x)f\left({x}\right) and the tail behavior of θ\mathcal{B}_{\theta} (from property of σ\sigma), we can precisely determine the total number of quadrature points mm, the scaling θ\theta, and the cut-off radius δ\delta in terms of ε\varepsilon (see Lemma 5). That is, the topology of the network is completely determined; ii) The weights and biases of the network are also readily available from the nAI property of σ\sigma, the quadrature points, and θ\theta; iii) the coefficients of the output layer is also pre-determined by nAI (i.e. α_\alpha_{\_}\ell) and the values of the unknown function ff at the quadrature points; and iv) any nAI activation function is universal in 𝒞_c[1,1]\mathcal{C}_{\_}c\left[-1,1\right].

Clearly the Gaussian activation function ex2e^{-{x}^{2}} is an nAI with k=1k=1, α_1=1\alpha_{\_}1=1, w_1=1{w}_{\_}1=1 and b_1=0{b}_{\_}1=0. The interesting fact is that, as we shall prove in Section 7, most existing activation functions, though not integrable by themselves, are nAI.

Remark 2.

It is important to point out that the universality in Lemma 6 for any nAI activation is in one dimension. In order to extend the universality result to nn dimensions for an nAI activation function, we shall deploy a special nn-fold composition of its one-dimensional approximate identity (4). Section 7 provides explicit constructions of such nn-fold composition for many nAI activation functions, and Section 8 extends the result to abstract nAI function with appropriate conditions.

Remark 3.

Our framework also helps understand some of the existing universal approximation approaches in a more constructive manner. For example, the constructive proof in [11] is not entirely constructive as the key step in equation (4), where the authors introduced the neural network approximation function g(x)g\left({x}\right), is not constructive. Our framework can provide a constructive derivation of this step. Indeed, by applying a summation by part, one can see that g(x)g\left({x}\right) resembles a quadrature approximation of the convolution of f(x)f\left({x}\right) and the derivative (approximated by forward finite difference) of scaled sigmoidal function. Since the derivative of sigmoidal function, up to a scaling factor, is nAI (see Section 8), the convolution of f(x)f\left({x}\right) and a scaled sigmoidal derivative can approximate f(x)f\left({x}\right) to any desired accuracy.

Another important example is the pioneered work in [32] that has been extended in many directions in [35, 36, 37, 38, 39, 33, 34, 41, 12]. Though the rest of the approach is constructive, the introduction of the neural network operator is rather mysterious. From our framework point of view, this is no longer the case as the neural network operator is nothing more than a quadrature approximation of the convolution of f(x)f\left({x}\right) and \mathcal{B}ell shape functions constructed from activation functions under consideration. Since these \mathcal{B}ell shape functions fall under the nAI umbrella, the introduction of neural network operator is now completely justified and easily understood. We would like to point out that our work does not generalize/extend the work in [32]. Instead, we aim at developing a new analytical framework that achieves unified and broad results that are not possible with the contemporary counterparts, while being more constructive and simpler. The \mathcal{B}ell shape function idea is not sufficient to construct our framework. Indeed, the work in [32] and its extensions [35, 36, 37, 38, 39, 33, 34, 41, 12] have been limited to a few special and standard squash-type activation functions such as sigmoid and arctangent functions. Our work—thanks to other new ingredients such as convolution, numerical quadrature, Monte-Carlo sampling, and finite difference together with the new nAI concept—breaks the barrier. Our framework thus provides not only a new constructive method but also new insights into the work in [32] as its special case.

7 Many existing activation functions are nAI

We now exploit properties of each activation (and its family, if any) under consideration to show that they are nAI for n=1n=1. That is, these activations generate one-dimensional approximate identity, which in turn shows that they are universal by Lemma 6. To show that they are also universal in nn dimensions222Note that by Proposition 3.3 of [23], with an additional assumption on the denseness of the set of ridge functions in 𝒞(n)\mathcal{C}\left(\mathbb{R}^{n}\right), we can conclude that they are also universal in nn dimensions. This approach, however, does not provide an explicit construction of the corresponding neural networks. we extend a constructive and explicit approach from [26] that was done for ReLU activation function. We start with the family of rectified polynomial units (RePU)—also parametric and leaky ReLUs—with many properties, then family of sigmoidal functions, then the exponential linear unit (ELU), then the Gaussian error linear unit (GELU), then the sigmoid linear unit (SiLU), and we conclude with the Mish activation with least properties. The proofs for all results in this section, together with figures demonstrating the \mathcal{B} functions for each activation, can be found in Appendix B and Appendix C.

7.1 Rectified Polynomial Units (RePU) is an nAI

Following [57, 18] we define the rectified polynomial unit (RePU) as

RePU(q;x)={xqif x0,0otherwise,q,\text{RePU}\left(q;x\right)=\begin{cases}x^{q}&\text{if }x\geq 0,\\ 0&\text{otherwise},\end{cases}\quad q\in\mathbb{N},

which reduces to the standard rectilinear (ReLU) unit [58] when q=1q=1, the Heaviside unit when q=0q=0, and the rectified cubic unit [59] when q=3q=3.

The goal of this section is to construct an integrable linear combination of RePU activation function for a given qq. We begin with constructing compact supported \mathcal{B}ell-shape functions333 \mathcal{B}ell-shape functions seemed to be first coined in [32]. from RePU in one dimension. Recall the binomial coefficient notation (rk)=r!(rk)!k!,\begin{pmatrix}r\\ k\end{pmatrix}=\frac{r!}{\left(r-k\right)!k!}, and the central finite differencing operator with stepsize hh δh[f](x):=f(x+h2)f(xh2)\delta_{h}\left[f\right]\left({x}\right):=f\left({x}+\frac{h}{2}\right)-f\left({x}-\frac{h}{2}\right) (see Remark 8 for forward and backward finite differences).

Lemma 7 (\mathcal{B}-function for RePU  in one dimension).

Let r:=q+1r:=q+1. For any h0h\geq 0, define

(x,h):=1q!δhr[RePU](x)=1q!_i=0r(1)i(ri)RePU(q;x+(r2i)h).{\mathcal{B}}\left({x},h\right):=\frac{1}{q!}\delta_{h}^{r}\left[\text{RePU}\right]\left({x}\right)=\frac{1}{q!}\sum_{\_}{i=0}^{r}\left(-1\right)^{i}\begin{pmatrix}r\\ i\end{pmatrix}\text{RePU}\left(q;{x}+\left(\frac{r}{2}-i\right)h\right).

Then: i) (x,h)\mathcal{B}\left({x},h\right) is piecewise polynomial of order at most qq on each interval [khrh2,(k+1)hrh2]\left[kh-\frac{rh}{2},(k+1)h-\frac{rh}{2}\right] for k=0,,r1k=0,\ldots,r-1; ii) (x,h)\mathcal{B}\left({x},h\right) is (q1)\left(q-1\right)-time differentiable for q2q\geq 2, continuous for q=1q=1, and discontinuous for q=0q=0. Furthermore, supp((x,h))\text{supp}\left(\mathcal{B}\left({x},h\right)\right) is a subset [rh2,rh2]\left[-\frac{rh}{2},\frac{rh}{2}\right]; and iii) (x,h)\mathcal{B}\left({x},h\right) is even, non-negative, unimodal, and _(x,1)𝑑x=1\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},1\right)\,d{x}=1..

Though there are many approaches to construct \mathcal{B}-functions in nn dimensions from \mathcal{B}-function in one dimension (see, e.g. [32, 35, 36, 37, 38, 39, 40, 41]) that are realizable by neural networks, inspired by [26] we construct \mathcal{B}-function in nn dimensions by nn-fold composition of the one dimensional \mathcal{B}-function in Lemma 7. By Lemma 5, it follows that the activation under consideration is universal in nn dimensions. We shall carry out this nn-fold composition for not only RePU but also the other activation functions. The same procedure for an abstract activation function will be presented in Section 8.

Theorem 1 (\mathcal{B}-function for RePU  in nn dimensions).

Let r:=q+1r:=q+1, and 𝐱=(𝐱_1,,𝐱_n){\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right). Define 𝔅(𝐱)=(𝐱_n,(𝐱_n1,,(𝐱_1,1))).\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right). The following hold:

  1. i)

    𝔅(𝒙)\mathfrak{B}\left({\boldsymbol{{x}}}\right) is non-negative with compact support. In particular supp(𝔅(𝒙))X_i=1n[b_i,b_i]\text{supp}\left(\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right)\subseteq X_{\_}{i=1}^{n}\left[-b_{\_}i,b_{\_}i\right] where b_i=(0,,(0,1))_(i1)time compositionb_{\_}i=\underbrace{\mathcal{B}\left(0,\ldots,\mathcal{B}\left(0,1\right)\right)}_{\_}{(i-1)-\text{time composition}}, for 2in2\leq i\leq n, and b_1=n2b_{\_}1=\frac{n}{2}.

  2. ii)

    𝔅(𝒙)\mathfrak{B}\left({\boldsymbol{{x}}}\right) is even with respect to each component 𝒙_i{\boldsymbol{{x}}}_{\_}i, and unimodal with 𝔅(0)=max_𝒙n𝔅(𝒙)\mathfrak{B}\left(0\right)=\max_{\_}{{\boldsymbol{{x}}}\in\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right).

  3. iii)

    𝔅(𝒙)\mathfrak{B}\left({\boldsymbol{{x}}}\right) is piecewise polynomial of order at most (ni)q\left(n-i\right)q in 𝒙_i{\boldsymbol{{x}}}_{\_}i, i=1,,ni=1,\ldots,n. Furthermore, 𝔅(𝒙)\mathfrak{B}\left({\boldsymbol{{x}}}\right) is (q1)(q-1)-time differentiable in each 𝒙_i{\boldsymbol{{x}}}_{\_}i, n𝔅(𝒙)𝑑𝒙1\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1 and 𝔅(𝒙)1(n)\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right).

Remark 4.

Note that Lemma 7 and Theorem 1 also hold for parametric ReLU [60] and leaky ReLU [61] (a special case of parametric ReLU) with r=2r=2. In other words, parametric ReLU is an nAI, and thus universal.

7.2 Sigmoidal and related activation functions are nAI

7.2.1 Sigmoidal, hyperbolic tangent, and softplus activation functions

Recall the sigmoidal and softplus functions [62] are, respectively, given by

σ_s(x):=11+ex, and σ_p(x):=ln(1+ex).\sigma_{\_}s\left({x}\right):=\frac{1}{1+e^{-x}},\quad\text{ and }\sigma_{\_}p\left({x}\right):=\ln\left(1+e^{x}\right).

It is the relationship between sigmoidal and hyperbolic tangent function, denoted as σ_t(x)\sigma_{\_}t\left({x}\right),

σ_s(x)=12+12σ_t(x2)\sigma_{\_}s\left({x}\right)={\frac{1}{2}}+{\frac{1}{2}}\sigma_{\_}t\left(\frac{{x}}{2}\right)

that allows us to construct their \mathcal{B}-functions a similar manner. In particular, based on the bell shape geometry of the the derivative of sigmoidal and hyperbolic tangent functions, we apply the central finite differencing with h>0h>0 to obtain the corresponding \mathcal{B}-functions

_s(x,h):=12δh[σ_s](x),and _t(x,h):=14δh[σ_t](x).\mathcal{B}_{\_}s\left({x},h\right):=\frac{1}{2}\delta_{h}\left[\sigma_{\_}s\right]\left({x}\right),\quad\text{and }\mathcal{B}_{\_}t\left({x},h\right):=\frac{1}{4}\delta_{h}\left[\sigma_{\_}t\right]\left({x}\right).

Since σ_s(x)\sigma_{\_}s\left({x}\right) is the derivative of σ_p(x)\sigma_{\_}p\left({x}\right), we apply central difference twice to obtain a BB-function for σ_p(x)\sigma_{\_}p\left({x}\right): _p(x,h)=δh2[σ_p](x)\mathcal{B}_{\_}p\left({x},h\right)=\delta_{h}^{2}\left[\sigma_{\_}p\right]\left({x}\right). The following Lemma 8 and Theorem 2 hold for (x,h)\mathcal{B}\left({x},h\right) being either _s(x,h)\mathcal{B}_{\_}s\left({x},h\right) or _t(x,h)\mathcal{B}_{\_}t\left({x},h\right) or _p(x,h)\mathcal{B}_{\_}p\left({x},h\right).

Lemma 8 (\mathcal{B}-function for sigmoid, hyperbolic tangent, and solfplus in one dimension).

For 0<h<0<h<\infty, there hold: i) (x,h)0\mathcal{B}\left({x},h\right)\geq 0 for all x{x}\in\mathbb{R}, lim_|x|(x,h)\lim_{\_}{\left|{x}\right|\to\infty}\mathcal{B}\left({x},h\right) = 0, and (x,h)\mathcal{B}\left({x},h\right) is even; and ii) (x,h)\mathcal{B}\left({x},h\right) is unimodal _(x,h)𝑑x<\int_{\_}\mathbb{R}\mathcal{B}\left({x},h\right)\,d{x}<\infty, and (x,h)1()\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right).

Similar to Theorem 1, we construct \mathcal{B}-function in nn dimensions by nn-fold composition of the one dimensional \mathcal{B}-function in Lemma 8.

Theorem 2 (\mathcal{B}-function for sigmoid, hyperbolic tangent, and softplus in nn dimensions).

Let 𝐱=(𝐱_1,,𝐱_n)n{\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}. Define

𝔅(𝒙)=(𝒙_n,(𝒙_n1,,(𝒙_1,1))).\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right).

Then 𝔅(𝐱)\mathfrak{B}\left({\boldsymbol{{x}}}\right) is even with respect to each component 𝐱_i{\boldsymbol{{x}}}_{\_}i, i=1,,ni=1,\ldots,n, and unimodal with 𝔅(0)=max_𝐱n𝔅(𝐱)\mathfrak{B}\left(0\right)=\max_{\_}{{\boldsymbol{{x}}}\in\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right). Furthermore, n𝔅(𝐱)𝑑𝐱1\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1 and 𝔅(𝐱)1(n)\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right).

7.2.2 Arctangent function

We next discuss arctangent activation function σ_a(x):=arctan(x)\sigma_{\_}a\left({x}\right):=\arctan\left({x}\right) whose shape is similar to sigmoid. Since its derivative has the bell shape geometry, we define the \mathcal{B}-function for arctangent function by approximating its derivative with central finite differencing, i.e.,

_a(x,h)=12πδh[σ_a](x)=12πarctan(2h1+x2h2),\mathcal{B}_{\_}a\left({x},h\right)=\frac{1}{2\pi}\delta_{h}\left[\sigma_{\_}a\right]\left({x}\right)=\frac{1}{2\pi}\arctan\left(\frac{2h}{1+{x}^{2}-h^{2}}\right),

for 0<h10<h\leq 1. Then simple algebra manipulations show that Lemma 8 holds for _a(x,h)\mathcal{B}_{\_}a\left({x},h\right) with __a(x,h)𝑑x=h\int_{\_}\mathbb{R}\mathcal{B}_{\_}a\left({x},h\right)\,d{x}=h. Thus, Theorem 2 also holds for nn-dimensional arctangent \mathcal{B}-function

𝔅(𝒙)=_a(𝒙_n,_a(𝒙_n1,,_a(𝒙_1,1))),\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}_{\_}a\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}_{\_}a\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}_{\_}a\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right),

as we have shown for the sigmoidal function.

7.2.3 Generalized sigmoidal functions

We extend the class of sigmoidal function in [63] to generalized sigmoidal (or “squash”) functions.

Definition 3 (Generalized sigmoidal functions).

We say σ(x)\sigma\left({x}\right) is a generalized sigmoidal function if it satisfies the following conditions:

  1. i)

    σ(x)\sigma\left({x}\right) is bounded, i.e., there exists a constant σ_max\sigma_{\_}{\text{max}} such that |σ(x)|σ_max\left|\sigma\left({x}\right)\right|\leq\sigma_{\_}{\text{max}} for all x{x}\in\mathbb{R}.

  2. ii)

    lim_xσ(x)=L\lim_{\_}{{x}\to\infty}\sigma\left({x}\right)=L and lim_xσ(x)=\lim_{\_}{{x}\to-\infty}\sigma\left({x}\right)=\ell.

  3. iii)

    There exist x<0{x}^{-}<0 and α>0\alpha>0 such that σ(x)=𝒪(|x|1α){\sigma\left({x}\right)-\ell}=\mathcal{O}\left(\left|{x}\right|^{-1-\alpha}\right) for x<x{x}<{x}^{-}.

  4. iv)

    There exist x+>0{x}^{+}>0 and α>0\alpha>0 such that Lσ(x)=𝒪(|x|1α){L-\sigma\left({x}\right)}=\mathcal{O}\left(\left|{x}\right|^{-1-\alpha}\right) for x>x+{x}>{x}^{+}.

Clearly the standard sigmoidal, hyperbolic tangent, and arctangent activations are members of the class of generalized sigmoidal functions.

Lemma 9 (\mathcal{B}-function for generalized sigmoids in one dimension).

Let σ(x)\sigma\left({x}\right) be a generalized sigmoidal function, and (x,h)\mathcal{B}\left({x},h\right) be an associated \mathcal{B}-function defined as

(x,h):=12(L)[σ(x+h)σ(xh)].\mathcal{B}\left({x},h\right):=\frac{1}{2\left(L-\ell\right)}\left[\sigma\left({x}+h\right)-\sigma\left({x}-h\right)\right].

Then the following hold for any hh\in\mathbb{R}: i) There exists a constant CC such that |(x,h)|C(x+|h|)1α\left|\mathcal{B}\left({x},h\right)\right|\leq C\left({x}+\left|h\right|\right)^{-1-\alpha} for xx++|h|{x}\geq{x}^{+}+\left|h\right|, and |(x,h)|C(x+|h|)1α\left|\mathcal{B}\left({x},h\right)\right|\leq C\left(-{x}+\left|h\right|\right)^{-1-\alpha} for xx|h|{x}\leq{x}^{-}-\left|h\right|; and ii) For x[m,M]x\in\left[m,M\right], m,Mm,M\in\mathbb{R}, _k=(x+kh,h)\sum_{\_}{k=-\infty}^{\infty}\mathcal{B}\left({x}+kh,h\right) converges uniformly to sign(h)\text{sign}\left(h\right). Furthermore, _(x,h)𝑑x=h\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h, and (x,h)1()\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right).

Thus any generalized sigmoidal function is an nAI in one dimension. Note that the setting in [63]—in which σ(x)\sigma\left({x}\right) is non-decreasing, L=1L=1, and =0\ell=0—is a special case of our general setting. In this less general setting, it is clear that (x,h)\mathcal{B}\left({x},h\right)\geq and thus _(x,h)𝑑x=h\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h is sufficient to conclude (x,h)1()\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right) for any h>0h>0. We will explore this in the next theorem as it is not clear, at the moment of writing this paper, how to show that the BB-functions for generalized sigmoids, as constructed below, reside in 1(n)\mathcal{L}^{1}\left(\mathbb{R}^{n}\right).

Theorem 3 (\mathcal{B}-function for generalized sigmoids in nn dimensions).

Suppose that σ(x)\sigma\left({x}\right) is a non-decreasing generalized sigmoidal function. Let 𝐱=(𝐱_1,,𝐱_n)n{\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}. Define 𝔅(𝐱)=(𝐱_n,(𝐱_n1,,(𝐱_1,1)))\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right). Then n𝔅(𝐱)𝑑𝐱=1\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=1 and 𝔅(𝐱)1(n)\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right).

7.3 The Exponential Linear Unit (ELU) is nAI

Following [64, 65] we define the Exponential Linear Unit (ELU) as

σ(x)={α(ex1)x0xx>0,\sigma\left({x}\right)=\begin{cases}\alpha\left(e^{{x}}-1\right)&{x}\leq 0\\ {x}&{x}>0,\end{cases}

for some α\alpha\in\mathbb{R}. The goal of this section is to show that ELU is an nAI. Since the unbounded part of ELU is linear, Section 7.1 suggests us to define, for hh\in\mathbb{R},

(x,h):=δh2[σ](x)γ, where γ=max{1+4|α|,2+|α|}.\mathcal{B}\left({x},h\right):=\frac{\delta_{h}^{2}\left[\sigma\right]\left({x}\right)}{\gamma},\text{ where }\gamma=\max\left\{1+4\left|\alpha\right|,2+\left|\alpha\right|\right\}.
Lemma 10 (\mathcal{B}-function for ELU in one dimension).

Let α,h\alpha,h\in\mathbb{R}, then _(x,h)𝑑x=h2γ, and (x,h)1()\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=\frac{h^{2}}{\gamma},\text{ and }\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right) Furthermore: (x,h)1\mathcal{B}\left({x},h\right)\leq 1 for |h|1\left|h\right|\leq 1.

Theorem 4 (\mathcal{B}-function for ELU in nn dimensions).

Let 𝐱=(𝐱_1,,𝐱_n)n{\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}. Define 𝔅(𝐱)=(𝐱_n,(𝐱_n1,,(𝐱_1,1)))\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right). Then n|𝔅(𝐱)|𝑑𝐱1\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}\leq 1 and thus 𝔅(𝐱)1(n)\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right).

7.4 The Gaussian Error Linear Unit (GELU) is nAI

GELU, introduced in [66], is defined as

σ(x):=xΦ(x),\sigma\left({x}\right):={x}\Phi\left({x}\right),

where Φ(x)\Phi\left({x}\right) is the cummulative distribution function of standard normal distribution. Since the unbounded part of GELU is essentially linear, Section 7.1 suggests us to define

(x,h):=δh2[σ](x),h.\mathcal{B}\left({x},h\right):=\delta_{h}^{2}\left[\sigma\right]\left({x}\right),\quad h\in\mathbb{R}.
Lemma 11 (\mathcal{B}-function for GELU in one dimension).
  1. i)

    (x,h)\mathcal{B}\left({x},h\right) is an even function in both x{x} and hh.

  2. ii)

    (x,h)\mathcal{B}\left({x},h\right) has two symmetric roots ±x\pm{x}^{*} with h<x<max{2,2h}h<{x}^{*}<\max\left\{2,2h\right\}, and |(x,h)|12πh2\left|\mathcal{B}\left({x},h\right)\right|\leq\frac{1}{\sqrt{2\pi}}h^{2},

    _(x,h)𝑑x=h2, and _|(x,h)|𝑑x3710h2.\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2},\text{ and }\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\frac{37}{10}h^{2}.
Theorem 5 (\mathcal{B}-function for GELU in nn dimensions).

Let 𝐱=(𝐱_1,,𝐱_n)n{\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}. Define 𝔅(𝐱)=(𝐱_n,(𝐱_n1,,(𝐱_1,1)))\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right), then 𝔅(𝐱)1(n)\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right).

7.5 The Sigmoid Linear Unit (SiLU) is an nAI

SiLU, also known as sigmoid shrinkage or swish [66, 67, 68, 69], is defined as σ(x):=x1+ex.\sigma\left({x}\right):=\frac{{x}}{1+e^{-{x}}}. By inspection, the second derivative of σ(x)\sigma\left({x}\right) is bounded and its graph is quite close to bell shape. This suggests us to define

(x,h):=δh2[σ](x),h.\mathcal{B}\left({x},h\right):=\delta_{h}^{2}\left[\sigma\right]\left({x}\right),\quad h\in\mathbb{R}.

The proofs of the following results are similar to those of Lemma 11 and Theorem 5.

Theorem 6 (\mathcal{B}-function for SiLU in nn dimension).
  1. i)

    (x,h)\mathcal{B}\left({x},h\right) is an even function in both x{x} and hh.

  2. ii)

    (x,h)\mathcal{B}\left({x},h\right) has two symmetric roots ±x\pm{x}^{*} with h<x<max{3,2h}h<{x}^{*}<\max\left\{3,2h\right\}, and |(x,h)|12h2\left|\mathcal{B}\left({x},h\right)\right|\leq{\frac{1}{2}}h^{2}.

  3. iii)

    _(x,h)𝑑x=h2, and _|(x,h)|𝑑x265h2.\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2},\text{ and }\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\frac{26}{5}h^{2}.

  4. iv)

    Let 𝒙=(𝒙_1,,𝒙_n){\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right). Define 𝔅(𝒙)=(𝒙_n,(𝒙_n1,,(𝒙_1,1)))\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right), then 𝔅(𝒙)1(n)\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right).

7.6 The Mish unit is an nAI

Mish unit, introduced in [70], is defined as

σ(x):=xtanh(ln(1+ex)).\sigma\left({x}\right):={x}\tanh\left(\ln\left(1+e^{x}\right)\right).

Due to its similarity with SiLU, we define

(x,h):=δh2[σ](x).\mathcal{B}\left({x},h\right):=\delta_{h}^{2}\left[\sigma\right]\left({x}\right).

Unlike any of the preceding activation functions it is not straightforward to manipulate Mish analytically. This motivates us to devise a new approach to show that Mish is nAI. As shall be shown in Section 8, this approach allows us to unify the nAI property for all activation functions. We begin with the following result on the second derivative of Mish.

Lemma 12.

The second derivative of Mish, σ(2)(x)\sigma^{(2)}\left({x}\right), is continuous, bounded, and integrable. That is, |σ(2)(x)|M\left|\sigma^{(2)}\left({x}\right)\right|\leq M for some positive constant MM and σ(2)(x)1()\sigma^{(2)}\left({x}\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right).

Theorem 7.

Let hh\in\mathbb{R}, then the following hold:

  1. i)

    There exists 0<M<0<M<\infty such that |(x,h)|Mh2\left|\mathcal{B}\left({x},h\right)\right|\leq Mh^{2}, and _|(x,h)|𝑑xσ(2)_1()h2.\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\left\|\sigma^{(2)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}h^{2}.

  2. ii)

    Let 𝒙=(𝒙_1,,𝒙_n){\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right). Define 𝔅(𝒙)=(𝒙_n,(𝒙_n1,,(𝒙_1,1)))\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right), then 𝔅(𝒙)1(n)\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right).

8 A general framework for nAI

Inspired by the nAI proof of the Mish activation function in Section 7.6, we develop a general framework for nAI that requires only two conditions on the kkth derivative of any activation σ(x)\sigma\left({x}\right). The beauty of the general framework is that it provides a single nAI proof that is valid for a large class of functions including all activation functions that we have considered. The trade-off is that less can be said about the corresponding \mathcal{B}-functions. To begin, suppose that there exists kk\in\mathbb{N} such that

  1. C1

    Integrability: σ(k)(x)\sigma^{(k)}\left({x}\right) is integrable, i.e., σ(k)(x)1()\sigma^{(k)}\left({x}\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right), and

  2. C2

    Essential boundedness: there exists M<M<\infty such that σ(k)_M\left\|\sigma^{(k)}\right\|_{\_}\infty\leq M.

Note that if the two conditions C1 and C2 hold for k=0k=0 (e.g. Gaussian activation functions) then the activation is obviously an nAI. Thus we only consider kk\in\mathbb{N}, and the \mathcal{B}-function for σ(x)\sigma\left({x}\right) can be defined via the kk-order central finite difference:

(x,h):=δhk[σ](x)=_i=0k(1)i(ki)σ(x+(k2i)h).\mathcal{B}\left({x},h\right):=\delta_{h}^{k}\left[\sigma\right]\left({x}\right)=\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}\sigma\left({x}+\left(\frac{k}{2}-i\right)h\right).
Lemma 13.

For any hh\in\mathbb{R}, there holds:

(x,h)=hk(k1)!_i=0k(1)i(ki)(k2i)k_01σ(k)(x+s(k2i)h)(1s)k1𝑑s.\mathcal{B}\left({x},h\right)=\frac{h^{k}}{\left(k-1\right)!}\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}\left(\frac{k}{2}-i\right)^{k}\int_{\_}0^{1}\sigma^{(k)}\left({x}+s\left(\frac{k}{2}-i\right)h\right)\left(1-s\right)^{k-1}\,ds.
Proof.

Applying the Taylor theorem gives

σ(x+(k2i)h)=_j=0k1σ(j)(x)(k2i)jhjj!+hk(k1)!×(k2i)k_01σ(k)(x+s(k2i)h)(1s)k1𝑑s.\sigma\left({x}+\left(\frac{k}{2}-i\right)h\right)=\sum_{\_}{j=0}^{k-1}\sigma^{(j)}\left({x}\right)\left(\frac{k}{2}-i\right)^{j}\frac{h^{j}}{j!}+\frac{h^{k}}{\left(k-1\right)!}\\ \times\left(\frac{k}{2}-i\right)^{k}\int_{\_}0^{1}\sigma^{(k)}\left({x}+s\left(\frac{k}{2}-i\right)h\right)\left(1-s\right)^{k-1}\,ds.

The proof is concluded if we can show that

_i=0k(1)i(ki)_j=0k1σ(j)(x)(k2i)jhjj!=_j=0k1σ(j)(x)hjj!_i=0k(1)i(ki)(k2i)j=0,\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}\sum_{\_}{j=0}^{k-1}\sigma^{(j)}\left({x}\right)\left(\frac{k}{2}-i\right)^{j}\frac{h^{j}}{j!}=\sum_{\_}{j=0}^{k-1}\sigma^{(j)}\left({x}\right)\frac{h^{j}}{j!}\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}\left(\frac{k}{2}-i\right)^{j}=0,

but this is clear by the alternating sum identity

_i=0k(1)i(ki)ij=0, for j=0,,k1.\sum_{\_}{i=0}^{k}\left(-1\right)^{i}\begin{pmatrix}k\\ i\end{pmatrix}i^{j}=0,\quad\text{ for }j=0,\ldots,k-1.

Theorem 8.

Let hh\in\mathbb{R}. Then: i) There exists N<N<\infty such that |(x,h)|N|h|k\left|\mathcal{B}\left({x},h\right)\right|\leq N\left|h\right|^{k}; ii) There exists C<C<\infty such that n|(x,h)|𝑑xC|h|k\int_{\mathbb{R}^{n}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq C\left|h\right|^{k}; and iii) Let 𝐱=(𝐱_1,,𝐱_n)n{\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}1,\ldots,{\boldsymbol{{x}}}_{\_}n\right)\in\mathbb{R}^{n}. Define 𝔅(𝐱)=(𝐱_n,(𝐱_n1,,(𝐱_1,1))),\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}n,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right), then 𝔅(𝐱)1(n)\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right).

Proof.

The first assertion is straightforward by invoking assumption C2, Lemma 13, and defining N=Mk!_i=0k(ki)|k2i|k.N=\frac{M}{k!}\sum_{\_}{i=0}^{k}\begin{pmatrix}k\\ i\end{pmatrix}\left|\frac{k}{2}-i\right|^{k}. For the second assertion, using triangle inequalities and the Fubini theorem yields

n|(x,h)|𝑑x|h|k(k1)!_i=0k(ki)|k2i|k×_01(1s)k1n|σ(k)(x+s(k2i)h)|dxds=NMσ(k)_1()|h|k,\int_{\mathbb{R}^{n}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\frac{\left|h\right|^{k}}{\left(k-1\right)!}\sum_{\_}{i=0}^{k}\begin{pmatrix}k\\ i\end{pmatrix}\left|\frac{k}{2}-i\right|^{k}\\ \times\int_{\_}0^{1}\left(1-s\right)^{k-1}\int_{\mathbb{R}^{n}}\left|\sigma^{(k)}\left({x}+s\left(\frac{k}{2}-i\right)h\right)\right|d{x}\,ds=\frac{N}{M}\left\|\sigma^{(k)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}\left|h\right|^{k},

and, by defining C=NMσ(k)_1()C=\frac{N}{M}\left\|\sigma^{(k)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}, the result follows owing to assumption C1. The proof of the last assertion is the same as the proof of Theorem 5. In particular, we have

_n|𝔅(𝒙)|𝑑𝒙CkMnk1n1k<.\int_{\_}{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}\leq C^{k}M^{\frac{n^{k}-1}{n-1}-k}<\infty.

Remark 5.

Note that Theorem 8 is valid for all activation functions considered in Section 7 with appropriate kk: for example, k=q+1k=q+1 for RePU of order qq, k=1k=1 for generalized sigmoidal functions, k=2k=2 for ELU, GELU, SiLU and Mish, k=0k=0 for Gaussian, and etc.

Remark 6.

Suppose a function σ(𝐱)\sigma\left({\boldsymbol{{x}}}\right) satisfies both conditions C1 and C2. Theorem 8 implies that σ(𝐱)\sigma\left({\boldsymbol{{x}}}\right) and all of its jjth derivative, j=1,,kj=1,\ldots,k, are not only a valid activation function but also universal. For example, any differentiable and non-decreasing sigmoidal function satisfies the conditions with k=1k=1, and therefore its derivative and itself are universal.

Remark 7.

In one dimension, if σ\sigma is of bounded variation, i.e. its total variation TV(σ)TV\left(\sigma\right) is finite, then (x,h)\mathcal{B}\left({x},h\right) resides in 1()()\mathcal{L}^{1}\left(\mathbb{R}\right)\cap\mathcal{L}^{\infty}\left(\mathbb{R}\right) by taking k=1k=1. A simple proof of this fact can be found in [51, Corollary 3]. Thus, σ\sigma is an nAI.

Remark 8.

We have used central finite differences for convenience, but Lemma 13, and hence Theorem 8, also holds for kkth-order forward and backward finite differences. The proofs are indeed almost identical.

9 Universality with non-asymptotic rates

This section explores Lemma 5 and Lemma 6 to study the convergence of the neural network (5) to the ground truth function f(𝒙)𝒞_c[1,1]nf\left({\boldsymbol{{x}}}\right)\in\mathcal{C}_{\_}c\left[-1,1\right]^{n} as a function of the number of neurons. From (3), we need to estimate the modulus of continuity of f(𝒙)f\left({\boldsymbol{{x}}}\right) (the first and the second terms) and the decaying property of the tail (the third term) of the approximate identity θ\mathcal{B}_{\theta} (and hence the tail of the activation σ(𝒙)\sigma\left({\boldsymbol{{x}}}\right)). For the former, we further assume that f(𝒙)f\left({\boldsymbol{{x}}}\right) is Lipschitz so that the first and the second terms can be estimated as ω(f,δ)=𝒪(δ)\omega\left(f,\delta\right)=\mathcal{O}\left(\delta\right) and ω(f,|𝒫m|)=𝒪(|𝒫m|)\omega\left(f,\left|\mathcal{P}^{m}\right|\right)=\mathcal{O}\left(\left|\mathcal{P}^{m}\right|\right). To balance these two terms, we need to pick the partition 𝒫m\mathcal{P}^{m} such that |𝒫m|δ\left|\mathcal{P}^{m}\right|\approx\delta. From (1) and Lemma 4, we conclude m=𝒪(δ1)m=\mathcal{O}\left(\delta^{-1}\right) and the total number of quadrature points scales as m=𝒪(δn)m=\mathcal{O}\left(\delta^{-n}\right). It follows from (5) that the total number of neurons is N=𝒪(kδn)N=\mathcal{O}\left(k\delta^{-n}\right). Conversely, for a given number of neurons NN, the error from the first two terms scales as δ=𝒪(N1/n)\delta=\mathcal{O}\left(N^{-1/n}\right), which is quite pessimistic. This is due to the tensor product quadrature rule. The result can be improved using Monte Carlo estimation of integrals at the expense of deterministic estimates. Indeed, let 𝝃i{\boldsymbol{\xi}}^{i}, i=1,,Ni=1,\ldots,N, be independent and identically distributed (i.i.d.) by the uniform distribution on [1,1]n\left[-1,1\right]^{n}. A Monte Carlo counterpart of (5) is given as

𝒮~(m,fθ)(𝒙)=1N_j=1Nf(𝝃j)_=1kα_σ[𝒘θ(𝒙𝝃j)+b_],\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\frac{1}{N}\sum_{\_}{j=1}^{N}f\left({\boldsymbol{\xi}}^{j}\right)\sum_{\_}{\ell=1}^{k}\alpha_{\_}\ell\sigma\left[\frac{{\boldsymbol{{w}}}^{\ell}}{\theta}\cdot\left({\boldsymbol{{x}}}-{{\boldsymbol{\xi}}^{j}}\right)+{b}_{\_}\ell\right], (6)

which, by the law of large numbers, converges almost surely (a.s.) to fθ(𝒙)f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right) for every 𝒙{\boldsymbol{{x}}}. However, the Monte Carlo mean square error estimate

𝔼_𝝃1,,𝝃N[|𝒮~(m,fθ)(𝒙)fθ(𝒙)|2]=𝒪(N1)\mathbb{E}_{\_}{{\boldsymbol{\xi}}^{1},\ldots,{\boldsymbol{\xi}}^{N}}\left[\left|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right|^{2}\right]=\mathcal{O}\left(N^{-1}\right)

is not useful for our \infty-norm estimate without further technicalities in exchanging expectation and \infty-norm (see, e.g., [RockafellarWetts98, Theorem 14.60] and [71]).

While most, if not all, literature concerns deterministic and asymptotic rates, we pursue a non-asymptotic direction as it precisely captures the behavior of the Monte Carlo estimate (6) of fθ(𝒙)f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right), and hence of f(𝒙)f\left({\boldsymbol{{x}}}\right). Within the non-asymptotic setting, we now show that, with high probability, any neural network with the nAI property and kNkN neurons converges with rate N12N^{-{\frac{1}{2}}} to any Lipschitz continuous function f(𝒙)f\left({\boldsymbol{{x}}}\right) with compact support in [1,1]n\left[-1,1\right]^{n}.

Theorem 9.

Let 𝛏i{\boldsymbol{\xi}}^{i}, i=1,,Ni=1,\ldots,N, be i.i.d. samples from the uniform distribution on [1,1]n\left[-1,1\right]^{n}. Suppose that f(𝐱)f\left({\boldsymbol{{x}}}\right) is Lipschitz and θ(𝐱)\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right) is continuous. Furthermore, let δ=𝒪(CN)\delta=\mathcal{O}\left(\frac{C}{\sqrt{N}}\right) and θ=𝒪(δ2)\theta=\mathcal{O}\left(\delta^{2}\right) for some C>0C>0. There exist three absolute constants α\alpha, \ell, and LL:

𝒮~(m,fθ)(𝒙)f(𝒙)_CN\left\|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty\leq\frac{C}{\sqrt{N}} (7)

holds with probability at least 12e2(CαN)2(L)21-2e^{-2\frac{\left(C-\alpha N\right)^{2}}{\left(L-\ell\right)^{2}}}.

Proof.

If we define

gj(𝒙):=f(𝝃j)θ(𝒙𝝃j)g^{j}\left({\boldsymbol{{x}}}\right):=f\left({\boldsymbol{\xi}}^{j}\right)\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}-{\boldsymbol{\xi}}^{j}\right)

then gj(𝒙)g^{j}\left({\boldsymbol{{x}}}\right), j=1,,Nj=1,\ldots,N, are independent random variables and 𝒮~(m,fθ)(𝒙)=1N_j=1Ngj(𝒙)\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\frac{1}{N}\sum_{\_}{j=1}^{N}g^{j}\left({\boldsymbol{{x}}}\right). Owing to the continuity of ff and θ\mathcal{B}_{\theta}, we know that there exist two absolute constants \ell and LL such that gj(𝒙)L\ell\leq g^{j}\left({\boldsymbol{{x}}}\right)\leq L. Let h(𝒙):=|𝒮~(m,fθ)(𝒙)fθ(𝒙)|h\left({\boldsymbol{{x}}}\right):=\left|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right|, then by Hoeffding inequality [72, 73] we have

[h(𝒙)>ε]2e2Nε2(L)2\mathbb{P}\left[h\left({\boldsymbol{{x}}}\right)>\varepsilon\right]\leq 2e^{-2N\frac{\varepsilon^{2}}{\left(L-\ell\right)^{2}}} (8)

for each 𝒙{\boldsymbol{{x}}}, where []\mathbb{P}\left[\cdot\right] stands for probability. An application of triangle inequality gives

|h(𝒙)h(𝒚)|2(L+sup_𝒙[1,1]nfθ(𝒙))=:α,\left|h\left({\boldsymbol{{x}}}\right)-h\left({\boldsymbol{{y}}}\right)\right|\leq 2\left(L+\sup_{\_}{{\boldsymbol{{x}}}\in\left[-1,1\right]^{n}}f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right)=:\alpha,

where α\alpha is meaningful as fθ(𝒙)f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right) is continuous. Thus, for a given 𝒚{\boldsymbol{{y}}}, there holds

h(𝒚)h(𝒙)α𝒙[1,1]n,h\left({\boldsymbol{{y}}}\right)\geq h\left({\boldsymbol{{x}}}\right)-\alpha\quad\forall{\boldsymbol{{x}}}\in\left[-1,1\right]^{n},

that is,

h(𝒚)h(𝒙)_α.h\left({\boldsymbol{{y}}}\right)\geq\left\|h\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty-\alpha.

Consequently, using the tail bound (8) yields

[h(𝒙)_>CN][h(𝒚)>CNα]2e2(CαN)2(L)2.\mathbb{P}\left[\left\|h\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty>\frac{C}{\sqrt{N}}\right]\leq\mathbb{P}\left[h\left({\boldsymbol{{y}}}\right)>\frac{C}{\sqrt{N}}-\alpha\right]\leq 2e^{-2\frac{\left(C-\alpha N\right)^{2}}{\left(L-\ell\right)^{2}}}.

It follows that

𝒮~(m,fθ)(𝒙)fθ(𝒙)_CN\left\|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f*\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty\leq\frac{C}{\sqrt{N}} (9)

with probability at least 12e2(CαN)2(L)21-2e^{-2\frac{\left(C-\alpha N\right)^{2}}{\left(L-\ell\right)^{2}}} for any CC. Clearly, we need to pick either CαNC\ll\alpha N or CαNC\gg\alpha N. The former is more favorable as it makes the error in (9) small with high probability. The latter could lead to large error with high probability. Now, choosing δ=𝒪(CN)\delta=\mathcal{O}\left(\frac{C}{\sqrt{N}}\right) and following the proof of Lemma 5 we arrive at

𝒮~(m,fθ)(𝒙)f(𝒙)_=𝒪(CN+𝒯(θ,CN)).\left\|\tilde{\mathcal{S}}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)-f\left({\boldsymbol{{x}}}\right)\right\|_{\_}\infty=\mathcal{O}\left(\frac{C}{\sqrt{N}}+\mathcal{T}\left(\mathcal{B}_{\theta},\frac{C}{\sqrt{N}}\right)\right).

We now estimate 𝒯(θ,CN)=_𝒚>CN|θ(𝒚)|d𝒚\mathcal{T}\left(\mathcal{B}_{\theta},\frac{C}{\sqrt{N}}\right)=\int_{\_}{\left\|{\boldsymbol{{y}}}\right\|>\frac{C}{\sqrt{N}}}\left|\mathcal{B}_{\theta}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}. By Markov inequality we have

𝒯(θ,CN)θNC,\mathcal{T}\left(\mathcal{B}_{\theta},\frac{C}{\sqrt{N}}\right)\leq\frac{\theta\sqrt{N}}{C},

where we have used the fact that (𝒙)_1(n)=1\left\|\mathcal{B}\left({\boldsymbol{{x}}}\right)\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}^{n}\right)}=1. Now taking θ=𝒪(δ2)=𝒪(C2N)\theta=\mathcal{O}\left(\delta^{2}\right)=\mathcal{O}\left(\frac{C^{2}}{N}\right) yields

𝒯(θ,CN)=𝒪(CN),\mathcal{T}\left(\mathcal{B}_{\theta},\frac{C}{\sqrt{N}}\right)=\mathcal{O}\left(\frac{C}{\sqrt{N}}\right),

and this concludes the proof. ∎

Remark 9.

The continuity of θ(𝐱)\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right) in Theorem 9 is only sufficient. All we need is the boundedness of θ(𝐱)\mathcal{B}_{\theta}\left({\boldsymbol{{x}}}\right). Theorem 9 is thus valid for all activation functions that we have discussed, including those in Section 8. Theorem 9 provides not only theoretical insights into the required number of neurons but also a guide on how to choose the number of neurons to achieve a certain accuracy with controllable successful probability. Perhaps more importantly, it shows that neural networks may fail, with non-zero probability, to provide a desired testing error in practice when an architecture, and hence a number of neurons, is selected.

10 Conclusions

We have presented a constructive framework for neural network universality. At the heart of the framework is the neural network approximate identity (nAI) concept that allows us to unify most of activations under the same umbrella. Indeed, we have shown that most of existing activations are nAI, and thus universal in the space of continuous of functions on compacta. We have shown that for an activation to be nAI, it is sufficient to verify that its kkth derivative, k0k\geq 0, is essentially bounded and integrable. The framework induces several advantages over contemporary approaches. First, our approach is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first attempt to unify the universality for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including the Mish, SiLU, ELU, GELU, etc. Fourth, it provides a new universality proof for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for each activation, the framework provides precisely the architecture of the one-hidden neural network with predetermined number of of neurons, and the values of weights/biases. Seventh, the framework facilitates us to develop the first abstract universal result with favorable non-asymptotic rates of N12N^{-{\frac{1}{2}}}, where NN is the number of neurons. Our framework also provides insights into the derivations of some of the existing approaches. Ongoing work is to build upon our framework to study the universal approximation properties of convolutional neural networks and deep neural networks. Part of the future work is also to exploit the unified nature of the framework to study which activation is better, in which sense, for a wide range of classification and regression tasks.

Funding

This work is partially supported by National Science Foundation awards NSF-OAC-2212442, NSF-2108320, NSF-1808576, and NSF-CAREER-1845799; by Department of Energy awards DE-SC0018147 and DE-SC0022211; and by a 2021 UT-Portugal CoLab award, and we are grateful for the support. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. National Science Foundation or the U.S. Department of Energy or the United States Government. We would like to thank Professor Christoph Schwab and Professor Ian Sloan for pointing out a technical mistake in the preprint.

Appendix A Extension to 𝒞[1,1]n\mathcal{C}\left[-1,1\right]^{n}

In this section, we present an approach to extend the results in Sections 5 and 9 to continuous functions. To directly build upon the results in these sections to our extension, without loss of generality, we considers 𝒞[b,b]n\mathcal{C}\left[-b,b\right]^{n} such that 0<2b<10<\sqrt{2}b<1. Consider g𝒞[b,b]ng\in\mathcal{C}\left[-b,b\right]^{n} and let g^𝒞(n)\hat{g}\in\mathcal{C}\left(\mathbb{R}^{n}\right) be an extension of g^|_[b,b]n=g\left.\hat{g}\right\rvert_{\_}{{\left[-b,b\right]^{n}}}=g, and g^_=g_\left\|\hat{g}\right\|_{\_}\infty=\left\|g\right\|_{\_}\infty [74]. Next, let us take φ𝒞_b[1,1]n\varphi\in\mathcal{C}_{\_}b\left[-1,1\right]^{n} such that: i) φ|_[b,b]n=1\left.\varphi\right\rvert_{\_}{\left[-b,b\right]^{n}}=1 and ii) ω(φ,h)=𝒪(ω(g,h))\omega\left(\varphi,h\right)=\mathcal{O}\left(\omega\left(g,h\right)\right) for any h>0h>0. Define f:=g^φf:=\hat{g}\varphi, then it is easy to see that:

  • f|_[b,b]n=g\left.f\right\rvert_{\_}{\left[-b,b\right]^{n}}=g,

  • f𝒞_c[1,1]nf\in\mathcal{C}_{\_}c\left[-1,1\right]^{n},

  • f_=𝒪(g_)\left\|f\right\|_{\_}\infty=\mathcal{O}\left(\left\|g\right\|_{\_}\infty\right), and

  • ω(f,h)=𝒪(ω(g,h))\omega\left(f,h\right)=\mathcal{O}\left(\omega\left(g,h\right)\right).

Now, applying Lemma 5 we obtain

f(𝒙)𝒮(m,fθ)(𝒙)=𝒪(ω(f,δ)+ω(f,|𝒫m|)+𝒯(θ,δ))=𝒪(ω(g,δ)+ω(g,|𝒫m|)+𝒯(θ,δ))f\left({\boldsymbol{{x}}}\right)-\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\mathcal{O}\left(\omega\left(f,\delta\right)+\omega\left(f,\left|\mathcal{P}^{m}\right|\right)+\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right)\right)\\ =\mathcal{O}\left(\omega\left(g,\delta\right)+\omega\left(g,\left|\mathcal{P}^{m}\right|\right)+\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right)\right)

for all 𝒙n{\boldsymbol{{x}}}\in\mathbb{R}^{n}. Thus, by restricting 𝒙[b,b]n{\boldsymbol{{x}}}\in\left[-b,b\right]^{n} we arrive at

g(𝒙)𝒮(m,fθ)(𝒙)=𝒪(ω(g,δ)+ω(g,|𝒫m|)+𝒯(θ,δ)).g\left({\boldsymbol{{x}}}\right)-\mathcal{S}\left(m,f*\mathcal{B}_{\theta}\right)\left({\boldsymbol{{x}}}\right)=\mathcal{O}\left(\omega\left(g,\delta\right)+\omega\left(g,\left|\mathcal{P}^{m}\right|\right)+\mathcal{T}\left(\mathcal{B}_{\theta},\delta\right)\right).

This, together with Lemma 6, ensures the universality of any nAI activation in 𝒞[b,b]n\mathcal{C}\left[-b,b\right]^{n}. The extension for Lipschitz continuous functions on [b,b]n\left[-b,b\right]^{n} for Section 9 follows similarly, again using the key extension results in [74].

Appendix B Proofs of results in Section 7

This section presents the detailed proofs of results in Section 7.

Proof of Lemma 7.

We start by defining X_r:=_i=1rU_iX_{\_}r:=\sum_{\_}{i=1}^{r}U_{\_}i, where U_iU_{\_}i are independent identically distributed uniform random variables on [12,12]\left[-\frac{1}{2},\frac{1}{2}\right]. Following [75, 76, 77], X_rX_{\_}r is distributed by the Irwin-Hall distribution with the probability density function

f_r(x):=1q!_i=1x(1)i(ri)(x+r2i)q,f_{\_}r\left({x}\right):=\frac{1}{q!}\sum_{\_}{i=1}^{\lfloor{x}\rfloor}(-1)^{i}\begin{pmatrix}r\\ i\end{pmatrix}\left({x}+\frac{r}{2}-i\right)^{q},

where x\lfloor{x}\rfloor denotes the largest integer smaller than x{x}. Using the definition of RePU, it is easy to see that f_r(x)f_{\_}r\left({x}\right) can be also written in terms of RePU  as follows

f_r(x):=1q!_i=1r(1)i(ri)RePU(q;x+r2i),f_{\_}r\left({x}\right):=\frac{1}{q!}\sum_{\_}{i=1}^{r}(-1)^{i}\begin{pmatrix}r\\ i\end{pmatrix}\text{RePU}\left(q;{x}+\frac{r}{2}-i\right),

which in turns implies

(x,h)=hqf_r(xh).\mathcal{B}\left({x},h\right)=h^{q}f_{\_}r\left(\frac{{x}}{h}\right). (10)

In other words, (x,h)\mathcal{B}\left({x},h\right) is a dilated version of f_r(x)f_{\_}r\left({x}\right). Thus, all the properties of f_r(x)f_{\_}r\left({x}\right) holds for (x,h)\mathcal{B}\left({x},h\right). In particular, all the assertions of Lemma 7 hold. Note that the compact support can be alternatively shown using the property of the central finite differencing. Indeed, it is easy to see that for xrh2{x}\geq\frac{rh}{2} we have

δhr[RePU](x)=δhr[xq]=0.\delta_{h}^{r}\left[\text{RePU}\right]\left({x}\right)=\delta_{h}^{r}\left[{x}^{q}\right]=0.

Proof of Theorem 1.

The first three assertions are direct consequences of Lemma 7. For the fourth assertion, since 𝔅(𝒙)0\mathfrak{B}\left({\boldsymbol{{x}}}\right)\geq 0 it is sufficient to show n𝔅(𝒙)𝑑𝒙1\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1 and we do so in three steps. Let r=q+1r=q+1 and define 𝒙=(𝒙_n+1,𝒙_n,,𝒙_1)=(𝒙_n+1,𝒚){\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{x}}}_{\_}n,\ldots,{\boldsymbol{{x}}}_{\_}1\right)=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{y}}}\right). We first show by induction that (x,1)1\mathcal{B}\left({x},1\right)\leq 1 for qq\in\mathbb{N} and x{x}\in\mathbb{R}. The claim is clearly true for q={0,1}q=\left\{0,1\right\}. Suppose the claim holds for qq, then (10) implies

f_r(x)1,x.f_{\_}{r}\left({x}\right)\leq 1,\quad\forall{x}\in\mathbb{R}.

For q+1q+1 we have

(0)=f_r+1(0)=f_rf_1(0)=_f_r(y)f_1(y)𝑑y_f_1(y)𝑑y=1.\mathcal{B}\left(0\right)=f_{\_}{r+1}\left(0\right)=f_{\_}{r}*f_{\_}1\left(0\right)=\int_{\_}\mathbb{R}f_{\_}{r}\left(-{y}\right)f_{\_}1\left({y}\right)\,d{y}\leq\int_{\_}\mathbb{R}f_{\_}1\left({y}\right)\,d{y}=1.

By the second assertion, we conclude that (x,1)1\mathcal{B}\left({x},1\right)\leq 1 for any qq\in\mathbb{N} and x{x}\in\mathbb{R}.

In the second step, we show 𝔅(𝒙)1\mathfrak{B}\left({\boldsymbol{{x}}}\right)\leq 1 by induction on nn for any 𝒙n{\boldsymbol{{x}}}\in\mathbb{R}^{n} and any qq\in\mathbb{N}. The result holds for n=1n=1 due to the first step. Suppose the claim is true for nn. For n+1n+1, we have

𝔅(𝒙)=(𝒙_n+1,𝔅(𝒚))=[𝔅(𝒚)]qf_r(𝒙_n+1𝔅(𝒚))1,\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)=\left[\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right]^{q}f_{\_}r\left(\frac{{\boldsymbol{{x}}}_{\_}{n+1}}{\mathfrak{B}\left({\boldsymbol{{y}}}\right)}\right)\leq 1,

where we have used (10) and in the last equality, and the first step together with the induction hypothesis in the last inequality.

In the last step, we show n𝔅(𝒙)𝑑𝒙1\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1 by induction on nn. For n=1n=1, _𝔅(x)𝑑x=1\int_{\_}\mathbb{R}\mathfrak{B}\left({x}\right)\,d{x}=1 is clear using by the Irwin-Hall probability density function. Suppose the result is true for nn. For n+1n+1, applying the Fubini theorem gives

_n+1𝔅(𝒙)𝑑𝒙=n(_(𝒙_n+1,𝔅(𝒚))𝑑𝒙_n+1)𝑑𝒚=n[𝔅(𝒚)]q(_f_r(𝒙_n+1𝔅(𝒚))𝑑𝒙_n+1)𝑑𝒚=n[𝔅(𝒚)]r(_f_r(t)𝑑t)𝑑𝒚=n[𝔅(𝒚)]r𝑑𝒚n𝔅(𝒚)𝑑𝒚1,\int_{\_}{\mathbb{R}^{n+1}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=\int_{\mathbb{R}^{n}}\left(\int_{\_}{\mathbb{R}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}\\ =\int_{\mathbb{R}^{n}}\left[\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right]^{q}\left(\int_{\_}{\mathbb{R}}f_{\_}{r}\left(\frac{{\boldsymbol{{x}}}_{\_}{n+1}}{\mathfrak{B}\left({\boldsymbol{{y}}}\right)}\right)\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}=\int_{\mathbb{R}^{n}}\left[\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right]^{r}\left(\int_{\_}{\mathbb{R}}f_{\_}{r}\left(t\right)\,dt\right)\,d{\boldsymbol{{y}}}\\ =\int_{\mathbb{R}^{n}}\left[\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right]^{r}\,d{\boldsymbol{{y}}}\leq\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}\leq 1,

where we have used (10) in the second equality, the result of the second step in the second last inequality, and the induction hypothesis in the last inequality. ∎

Proof of Lemma 8.

The proof for _s(x,h)\mathcal{B}_{\_}s\left({x},h\right) is a simple extension of those in [63, 33], and the proof for _s(x,h)\mathcal{B}_{\_}s\left({x},h\right) follows similarly. Note that _(x,h)𝑑x=h\int_{\_}\mathbb{R}\mathcal{B}\left({x},h\right)\,d{x}=h for sigmoid and hyperbolic tangent. For _p(x,h)\mathcal{B}_{\_}p\left({x},h\right), due to the global convexity of σ_t(x)\sigma_{\_}t\left({x}\right) we have

ln(1+ex)ln(eh+ex)+ln(eh+ex)2,x,\ln\left(1+e^{{x}}\right)\leq\frac{\ln\left(e^{h}+e^{{x}}\right)+\ln\left(e^{-h}+e^{{x}}\right)}{2},\quad\forall{x}\in\mathbb{R},

which is equivalently to _p(x,h)0\mathcal{B}_{\_}p\left({x},h\right)\geq 0 for x{x}\in\mathbb{R}. The fact that _p(x,h)\mathcal{B}_{\_}p\left({x},h\right) is even and lim_|x|_p(x,h)\lim_{\_}{\left|{x}\right|\to\infty}\mathcal{B}_{\_}p\left({x},h\right) = 0 are obvious by inspection. Since the derivative of _p(x,h)\mathcal{B}_{\_}p\left({x},h\right) is negative for x(0,){x}\in(0,\infty) and _p(x,h)\mathcal{B}_{\_}p\left({x},h\right) is even, _p(x,h)\mathcal{B}_{\_}p\left({x},h\right) is unimodal. It follows that _p(x,0)=max_xR_p(x,h)\mathcal{B}_{\_}p\left({x},0\right)=\max_{\_}{{x}\in R}\mathcal{B}_{\_}p\left({x},h\right).

Next integrating by parts gives

__p(x,h)dx=2_0(eh1)2(1ex)(eh+ex)(eh+ex)(1+ex)exx𝑑x(1eh1+eh)2_0exx𝑑x=(1eh1+eh)2min{1,h2}.\int_{\_}{-\infty}^{\infty}\mathcal{B}_{\_}p\left({x},h\right)\,d{x}=2\int_{\_}{0}^{\infty}\frac{\left(e^{h}-1\right)^{2}\left(1-e^{-{x}}\right)}{\left(e^{h}+e^{-x}\right)\left(e^{-h}+e^{-x}\right)\left(1+e^{-x}\right)}e^{-{x}}{x}\,d{x}\\ \leq\left(\frac{1-e^{-h}}{1+e^{-h}}\right)^{2}\int_{\_}{0}^{\infty}e^{-{x}}{x}\,d{x}=\left(\frac{1-e^{-h}}{1+e^{-h}}\right)^{2}\leq\min\left\{1,h^{2}\right\}. (11)

Thus all the assertions for _p(x,h)\mathcal{B}_{\_}p\left({x},h\right) holds. ∎

Proof of Theorem 2.

We only need to show n𝔅(𝒙)𝑑𝒙1\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1 as the proof for other assertions is similar to that of Theorem 1, and thus is omitted. For sigmoid and hyperbolic tangent the result is clear as

_n𝔅(𝒙)𝑑𝒙=_n(𝒙_n,(𝒙_n1,,(𝒙_1,1)))𝑑𝒙=_n1(𝒙_n1,,(𝒙_1,1))𝑑𝒙_n1d𝒙_1==_(𝒙_1,1)𝑑𝒙_1=1.\int_{\_}{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=\int_{\_}{\mathbb{R}^{n}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n},\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\right)\,d{\boldsymbol{{x}}}\\ =\int_{\_}{\mathbb{R}^{n-1}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n-1},\ldots,\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right)\,d{\boldsymbol{{x}}}_{\_}{n-1}\ldots d{\boldsymbol{{x}}}_{\_}1=\ldots=\int_{\_}{\mathbb{R}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)d{\boldsymbol{{x}}}_{\_}1=1.

For softplus function, by inspection, (𝒛,h)(0,h)1\mathcal{B}\left(\boldsymbol{{z}},h\right)\leq\mathcal{B}\left(0,h\right)\leq 1 for all 𝒛\boldsymbol{{z}}\in\mathbb{R} and 0<h10<h\leq 1. Lemma 8 gives n𝔅(𝒙)𝑑𝒙1\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}\leq 1 for n=1n=1. Define 𝒙=(𝒙_n+1,𝒙_n,,𝒙_1)=(𝒙_n+1,𝒚){\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{x}}}_{\_}n,\ldots,{\boldsymbol{{x}}}_{\_}1\right)=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{y}}}\right) and suppose the claim holds for nn, i.e., n𝔅(𝒚)𝑑𝒚1\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}\leq 1. Now apply (11) we have

_n+1𝔅(𝒙)𝑑𝒙=n(_(𝒙_n+1,𝔅(𝒚))𝑑𝒙_n+1)𝑑𝒚n𝔅(𝒚)𝑑𝒚1,\int_{\_}{\mathbb{R}^{n+1}}\mathfrak{B}\left({\boldsymbol{{x}}}\right)\,d{\boldsymbol{{x}}}=\int_{\mathbb{R}^{n}}\left(\int_{\_}{\mathbb{R}}\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}\leq\int_{\mathbb{R}^{n}}\mathfrak{B}\left({\boldsymbol{{y}}}\right)\,d{\boldsymbol{{y}}}\leq 1,

which ends the proof by induction. ∎

Proof of Lemma 9.

The first assertion is clear by Definition 3. For the second assertion, with a telescope trick similar to [63] we have

s_N(x)=_k=NN(x+kh,h)=12(L)[σ(x+(N+1)h)+σ(x+Nh)σ(xNh)σ(x(N+1)h)]Nsign(h).s_{\_}N\left({x}\right)=\sum_{\_}{k=-N}^{N}\mathcal{B}\left({x}+kh,h\right)=\frac{1}{2\left(L-\ell\right)}\left[\sigma\left({x}+\left(N+1\right)h\right)+\sigma\left({x}+Nh\right)\right.\\ -\left.\sigma\left({x}-Nh\right)-\sigma\left({x}-\left(N+1\right)h\right)\right]\xrightarrow[]{N\to\infty}\text{sign}\left(h\right).

To show the convergence is uniform, we consider only h>0h>0 as the proof for h<0h<0 is similar and for h=0h=0 is obvious. We first consider the right tail. For sufficiently large NN, there exists a constant C>0C>0 such that

_k=N(x+kh,h)C_k=N(x+kh)1α=C_k=N_k1k(x+kh)1αdyC_N1(x+yh)1αdy=Chα[x+(N1)h]Chα[m+(N1)h]N0 independent of x.\sum_{\_}{k=N}^{\infty}\mathcal{B}\left({x}+kh,h\right)\leq C\sum_{\_}{k=N}^{\infty}\left({x}+kh\right)^{-1-\alpha}=C\sum_{\_}{k=N}^{\infty}\int_{\_}{k-1}^{k}\left({x}+kh\right)^{-1-\alpha}\,dy\\ \leq C\int_{\_}{N-1}^{\infty}\left({x}+yh\right)^{-1-\alpha}\,dy=\frac{C}{h\alpha}\left[{x}+\left(N-1\right)h\right]\\ \leq\frac{C}{h\alpha}\left[m+\left(N-1\right)h\right]\xrightarrow[]{N\to\infty}0\text{ independent of }{x}.

Similarly, the left tail converges to 0 uniformly as

_k=N(xkh,h)Chα[(N1)hM]N0 independent of x.\sum_{\_}{k=N}^{\infty}\mathcal{B}\left({x}-kh,h\right)\leq\frac{C}{h\alpha}\left[\left(N-1\right)h-M\right]\xrightarrow[]{N\to\infty}0\text{ independent of }{x}.

As a consequence, we have

_(x,h)𝑑x=_k=_k|h|(k+1)|h|(x,h)𝑑x=_k=_0|h|(y+k|h|,h)𝑑y=_0|h|_k=(y+k|h|,h)dy=h,\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=\sum_{\_}{k=-\infty}^{\infty}\int_{\_}{k\left|h\right|}^{\left(k+1\right)\left|h\right|}\mathcal{B}\left({x},h\right)\,d{x}=\sum_{\_}{k=-\infty}^{\infty}\int_{\_}{0}^{\left|h\right|}\mathcal{B}\left(y+k\left|h\right|,h\right)\,dy\\ =\int_{\_}{0}^{\left|h\right|}\sum_{\_}{k=-\infty}^{\infty}\mathcal{B}\left(y+k\left|h\right|,h\right)\,dy=h,

where we have used the uniform convergence in the third equality.

Using the first assertion we have

_|(x,h)|𝑑x=_x|h|x++|h||(x,h)|dx+_x++|h||(x,h)|dx+_x|h||(x,h)|dx(x_px_n+2|h|)+C_x++|h|(x+|h|)1αdx+C_x|h|(x+|h|)1αdx=(x_px_n+2|h|)+Cα[(x++2|h|)α+(x+2|h|)α]<.\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}=\int_{\_}{{x}^{-}-\left|h\right|}^{{x}^{+}+\left|h\right|}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}+\int_{\_}{{x}^{+}+\left|h\right|}^{\infty}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}+\int_{\_}{-\infty}^{{x}^{-}-\left|h\right|}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\\ \leq\left({x}_{\_}p-{x}_{\_}n+2\left|h\right|\right)+C\int_{\_}{{x}^{+}+\left|h\right|}^{\infty}\left({x}+\left|h\right|\right)^{-1-\alpha}\,d{x}+C\int_{\_}{-\infty}^{{x}^{-}-\left|h\right|}\left(-{x}+\left|h\right|\right)^{-1-\alpha}\,d{x}\\ =\left({x}_{\_}p-{x}_{\_}n+2\left|h\right|\right)+\frac{C}{\alpha}\left[\left({x}^{+}+2\left|h\right|\right)^{-\alpha}+\left(-{x}^{-}+2\left|h\right|\right)^{-\alpha}\right]<\infty.

Proof of Theorem 3.

The proof is the same as the proof of Theorem 2 for the standard sigmoidal unit, and thus is omitted. ∎

Proof of Lemma 10.

The expression of (x,h)\mathcal{B}\left({x},h\right) and direct integration give _(x,h)𝑑x=h2γ\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2}\gamma. Similarly, simple algebra manipulations yield

γ_|(x,h)|𝑑xh2+2|α||h|+2|α|(e|h|1)2<,\gamma\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq h^{2}+2\left|\alpha\right|\left|h\right|+2\left|\alpha\right|\left(e^{-\left|h\right|}-1\right)^{2}<\infty,

and thus (x,h)1()\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right). The fact that (x,h)1\mathcal{B}\left({x},h\right)\leq 1 for |h|1\left|h\right|\leq 1 holds is straightforward by inspecting the extrema of (x,h)\mathcal{B}\left({x},h\right). ∎

Proof of Theorem 4.

From the proof of Lemma 10 we infer that _|(x,h)|𝑑x|h|\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\left|h\right| for all |h|1\left|h\right|\leq 1: in particular, _|(𝒙_1,1)|𝑑x1\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right|\,d{x}\leq 1. Define 𝒙=(𝒙_n+1,𝒙_n,,𝒙_1)=(𝒙_n+1,𝒚){\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{x}}}_{\_}n,\ldots,{\boldsymbol{{x}}}_{\_}1\right)=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{y}}}\right) and suppose the claim holds for nn, i.e., n|𝔅(𝒚)|𝑑𝒚1\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\leq 1. We have

_n+1|𝔅(𝒙)|𝑑𝒙=n(_|(𝒙_n+1,𝔅(𝒚))|𝑑𝒙_n+1)𝑑𝒚n|𝔅(𝒚)|𝑑𝒚1,\int_{\_}{\mathbb{R}^{n+1}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}=\int_{\mathbb{R}^{n}}\left(\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)\right|\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}\leq\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right|\,d{\boldsymbol{{y}}}\leq 1,

which, by induction, concludes the proof. ∎

Proof of Lemma 11.

The first assertion is straightforward. For the second assertion, it is sufficient to consider x0{x}\geq 0 and h>0h>0. Any root of (x,h)\mathcal{B}\left({x},h\right) satisfy the following equation

f(x):=Φ(x+h)Φ(x)Φ(x)Φ(xh)=xhx+h=:g(x).f\left({x}\right):=\frac{\Phi\left({x}+h\right)-\Phi\left({x}\right)}{\Phi\left({x}\right)-\Phi\left({x}-h\right)}=\frac{{x}-h}{{x}+h}=:g\left({x}\right).

Since, given hh, f(x)f\left({x}\right) is a positive decreasing function and g(x)g\left({x}\right) is an increasing function (starting from 1-1), they can have at most one intersection. Now for sufficiently large x{x}, using the the mean value theorem it can be shown that

f(x)exhx0, and g(x)x1,f\left({x}\right)\approx e^{-{x}h}\xrightarrow[]{{x}\to\infty}0,\text{ and }g\left({x}\right)\xrightarrow[]{{x}\to\infty}1,

from which we deduce that that there is only one intersection, and hence only one positive root x{x}^{*} for (x,h)\mathcal{B}\left({x},h\right). Next, we notice g(x)0<f(x)g\left({x}\right)\leq 0<f\left({x}\right) for 0xh0\leq{x}\leq h. For h1h\geq 1 it is easy to see f(2h)<g(2h)=1/3f\left(2h\right)<g\left(2h\right)=1/3. Thus, h<x<2hh<{x}^{*}<2h for h1h\geq 1. For 0<h<10<h<1, by inspection we have f(2)<g(2)f\left(2\right)<g\left(2\right), and hence h<x<2h<{x}^{*}<2. In conclusion h<x<max{2,2h}h<{x}^{*}<\max\left\{2,2h\right\}.

The preceding argument also shows that (x,h)0\mathcal{B}\left({x},h\right)\geq 0 for 0xx0\leq{x}\leq{x}^{*}, (x,h)<0\mathcal{B}\left({x},h\right)<0 for x>x{x}>{x}^{*}, and (x,h)x0\mathcal{B}\left({x},h\right)\xrightarrow[]{{x}\to\infty}0^{-}. Using the Taylor formula we have

|(x¯,h)|12πh2.\left|\mathcal{B}\left(\bar{x},h\right)\right|\leq\frac{1}{\sqrt{2\pi}}h^{2}.

For the third assertion, it is easy to verify the following indefinite integral (ignoring the integration constant as it will be canceled out)

xΦ(x)𝑑x=12(x21)Φ(x)+xex2222π,\int{x}\Phi\left({x}\right)\,d{x}={\frac{1}{2}}\left({x}^{2}-1\right)\Phi\left({x}\right)+\frac{xe^{{-\frac{{x}^{2}}{2}}}}{2\sqrt{2\pi}},

which, together with simple calculations, yields

_(x,h)𝑑x=h2.\int_{\_}{\mathbb{R}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2}.

From the proof of the second assertion and the Taylor formula we can show

_|(x,h)|𝑑x=2_0x(x,h)𝑑x2_x(x,h)𝑑x=3h2+2δh2[((x)21)Φ(x)]+δh2[xe(x)222π]3710h2.\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}=2\int_{\_}{0}^{{x}^{*}}\mathcal{B}\left({x},h\right)\,d{x}-2\int_{\_}{{x}^{*}}^{\infty}\mathcal{B}\left({x},h\right)\,d{x}\\ =-3h^{2}+2\delta_{h}^{2}\left[\left(\left({x}^{*}\right)^{2}-1\right)\Phi({x}^{*})\right]+\delta_{h}^{2}\left[\frac{{x}^{*}e^{{-\frac{\left({x}^{*}\right)^{2}}{2}}}}{\sqrt{2\pi}}\right]\leq\frac{37}{10}h^{2}.

Thus, (x,h)1()\mathcal{B}\left({x},h\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right). ∎

Proof of Theorem 5.

From the proof of Lemma 11 we have _|(x,h)|𝑑xCh2\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq Ch^{2} and |(x,h)|Mh2\left|\mathcal{B}\left({x},h\right)\right|\leq Mh^{2} for all h{h}\in\mathbb{R} where M=22πM=\frac{2}{\sqrt{2\pi}}, C=3710C=\frac{37}{10}: in particular, _|(𝒙_1,1)|𝑑xC\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({\boldsymbol{{x}}}_{\_}1,1\right)\right|\,d{x}\leq C. It is easy to see by induction that

|𝔅(𝒙)|M2n1.\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\leq M^{2^{n}-1}.

We claim that

_n|𝔅(𝒙)|𝑑𝒙CnM2nn1,\int_{\_}{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}\leq C^{n}M^{2^{n}-n-1},

and thus 𝔅(𝒙)1(n)\mathfrak{B}\left({\boldsymbol{{x}}}\right)\in\mathcal{L}^{1}\left(\mathbb{R}^{n}\right). We prove the claim by induction. Clearly it holds for n=1n=1. Suppose it is valid for nn. Define 𝒙=(𝒙_n+1,𝒙_n,,𝒙_1)=(𝒙_n+1,𝒚){\boldsymbol{{x}}}=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{x}}}_{\_}n,\ldots,{\boldsymbol{{x}}}_{\_}1\right)=\left({\boldsymbol{{x}}}_{\_}{n+1},{\boldsymbol{{y}}}\right), we have

_n+1|𝔅(𝒙)|𝑑𝒙=n(_|(𝒙_n+1,𝔅(𝒚))|𝑑𝒙_n+1)𝑑𝒚Cn|𝔅(𝒚)|2𝑑𝒚CM2n1n|𝔅(𝒚)|𝑑𝒚Cn+1M2n+1n2,\int_{\_}{\mathbb{R}^{n+1}}\left|\mathfrak{B}\left({\boldsymbol{{x}}}\right)\right|\,d{\boldsymbol{{x}}}=\int_{\mathbb{R}^{n}}\left(\int_{\_}{\mathbb{R}}\left|\mathcal{B}\left({\boldsymbol{{x}}}_{\_}{n+1},\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right)\right|\,d{\boldsymbol{{x}}}_{\_}{n+1}\right)\,d{\boldsymbol{{y}}}\leq C\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right|^{2}\,d{\boldsymbol{{y}}}\\ \leq CM^{2^{n}-1}\int_{\mathbb{R}^{n}}\left|\mathfrak{B}\left({\boldsymbol{{y}}}\right)\right|d{\boldsymbol{{y}}}\leq C^{n+1}M^{2^{n+1}-n-2},

where we have used the induction hypothesis in the last inequality, and this ends the proof. ∎

Proof of Lemma 12.

It is easy to see that for x2{x}\geq 2

|σ(2)(x)|12e3x(x2)+8e2x(x1)+8e5x(x+2)+8e4x(x+4),\left|\sigma^{(2)}\left({x}\right)\right|\leq 12e^{-3{x}}\left({x}-2\right)+8e^{-2{x}}\left({x}-1\right)+8e^{-5{x}}\left({x}+2\right)+8e^{-4{x}}\left({x}+4\right),

and

|σ(2)(x)|12e3x(2x)+8e4x(1x)8ex(x+2)8e2x(x+4),\left|\sigma^{(2)}\left({x}\right)\right|\leq 12e^{3{x}}\left(2-{x}\right)+8e^{4{x}}\left(1-{x}\right)-8e^{{x}}\left({x}+2\right)-8e^{2{x}}\left({x}+4\right),

for x4{x}\leq-4. That is, both the right and the left tails of σ(2)(x)\sigma^{(2)}\left({x}\right) decay exponentially, and this concludes the proof. ∎

Proof of Theorem 7.

For the first assertion, we note that σ(2)(x)\sigma^{(2)}\left({x}\right) is continuous and thus the following Taylor theorem with integral remainder for σ(x)\sigma\left({x}\right) holds

σ(x+h)=σ(x)+σ(x)h+h2_01σ(2)(x+sh)(1s)𝑑s.\sigma\left({x}+h\right)=\sigma\left({x}\right)+\sigma^{\prime}\left({x}\right)h+h^{2}\int_{\_}{0}^{1}\sigma^{(2)}\left({x}+sh\right)\left(1-s\right)\,ds.

As a result,

|(x,h)|h2[_01|σ(2)(x+sh)|(1s)𝑑s+_01|σ(2)(xsh)|(1s)𝑑s]Mh2,\left|\mathcal{B}\left({x},h\right)\right|\leq h^{2}\left[\int_{\_}{0}^{1}\left|\sigma^{(2)}\left({x}+sh\right)\right|\left(1-s\right)\,ds+\int_{\_}{0}^{1}\left|\sigma^{(2)}\left({x}-sh\right)\right|\left(1-s\right)\,ds\right]\leq Mh^{2},

where we have used the boundedness of σ(2)(x)\sigma^{(2)}\left({x}\right) from Lemma 12 in the last inequality.

For the second assertion, we have

n(x,h)𝑑x=h2__01σ(2)(x+sh)(1s)𝑑s𝑑x+h2__01σ(2)(xsh)(1s)𝑑s𝑑x,\int_{\mathbb{R}^{n}}\mathcal{B}\left({x},h\right)\,d{x}=h^{2}\int_{\_}\mathbb{R}\int_{\_}{0}^{1}{\sigma^{(2)}\left({x}+sh\right)}\left(1-s\right)\,dsd{x}\\ +h^{2}\int_{\_}\mathbb{R}\int_{\_}{0}^{1}{\sigma^{(2)}\left({x}-sh\right)}\left(1-s\right)\,dsd{x},

whose right hand side is well-defined owing to σ(2)(x)1()\sigma^{(2)}\left({x}\right)\in\mathcal{L}^{1}\left(\mathbb{R}\right) (see Lemma 12) and the Fubini theorem. In particular,

n(x,h)𝑑xσ(2)_1()h2.\int_{\mathbb{R}^{n}}\mathcal{B}\left({x},h\right)\,d{x}\leq\left\|\sigma^{(2)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}h^{2}.

The proof for n|(x,h)|𝑑xσ(2)_1()h2\int_{\mathbb{R}^{n}}\left|\mathcal{B}\left({x},h\right)\right|\,d{x}\leq\left\|\sigma^{(2)}\right\|_{\_}{\mathcal{L}^{1}\left(\mathbb{R}\right)}h^{2} follows similarly.

The proof of the last assertion is the same as the proof of Theorem 5 and hence is omitted. ∎

Appendix C Figures

This section provides the plots of the \mathcal{B} functions for various activations in 1, 2, and 3 dimensions.

Figure 1 (left) plots (x,1)\mathcal{B}\left({x},1\right) for RePU unit with q={0,1,3,6,9}q=\left\{0,1,3,6,9\right\} in one dimension. The non-negativeness, compact-support, \mathcal{B}ell shape, smoothness, and unimodal of (x,1)\mathcal{B}\left({x},1\right) can be clearly seen. For n=2n=2 we plot in Figure 1 (the three right most subfigures) the surfaces of RePU for q={0,1,5}q=\left\{0,1,5\right\} together with 1515 contours to again verify the non-negativeness, compact-support, \mathcal{B}ell shape, smoothness, and unimodal of 𝔅(𝒙)=(x_2,(x_1,1))\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left({x}_{\_}2,\mathcal{B}\left({x}_{\_}1,1\right)\right). To further confirm these features for n=3n=3, in figure 2 we plot an isosurface for RePU unit with q={0,1,3,5}q=\left\{0,1,3,5\right\}, and 44 isosurfaces for the case q=4q=4 in Figure 3. Note the supports of the \mathcal{B} functions around the origin: the further away from the origin the smaller the isosurface values.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: From left to right: One dimensional (𝒙)\mathcal{B}\left({\boldsymbol{{x}}}\right) functions for RePU with q={0,1,3,6,9}q=\left\{0,1,3,6,9\right\}, and two dimensional 𝔅(𝒙)\mathfrak{B}\left({\boldsymbol{{x}}}\right) functions for RePU with q={0,1,5}q=\left\{0,1,5\right\}. The surfaces of 𝔅(𝒙)\mathfrak{B}\left({\boldsymbol{{x}}}\right) are plot with 1515 contours at 1515 values equally space from 10610^{-6} to 𝔅(0,(0,1))\mathfrak{B}\left(0,\mathcal{B}\left(0,1\right)\right).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: From left to right: Three dimensional 𝔅(𝒙)\mathfrak{B}\left({\boldsymbol{{x}}}\right) functions for RePU with q={0,1,3,5}q=\left\{0,1,3,5\right\}. The isosurfaces are plotted for 𝔅(𝒙)=(0,(0,1))×102\mathfrak{B}\left({\boldsymbol{{x}}}\right)=\mathcal{B}\left(0,\mathcal{B}\left(0,1\right)\right)\times 10^{-2}.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Three dimensional 𝔅(𝒙)\mathfrak{B}\left({\boldsymbol{{x}}}\right) functions for RePU with q=4q=4. The first three isosurfaces (from left to right) are plotted at (0,(0,1))×{101,102,103}\mathcal{B}\left(0,\mathcal{B}\left(0,1\right)\right)\times\left\{10^{-1},10^{-2},10^{-3}\right\}, and the right most subfigure shows half of isosurfaces at (0,(0,1))×{101,102,103,104}\mathcal{B}\left(0,\mathcal{B}\left(0,1\right)\right)\times\left\{10^{-1},10^{-2},10^{-3},10^{-4}\right\}, where red is for the isosurface at (0,(0,1))×104\mathcal{B}\left(0,\mathcal{B}\left(0,1\right)\right)\times 10^{-4}.

To verify that activation functions in the generalized sigmoidal class behave similarly from our unified framework we plot the \mathcal{B} functions of the standard sigmoid, softplus, and arctangent activation functions in Figure 4 for n={1,2,3}n=\left\{1,2,3\right\}. As can be seen, the \mathcal{B} functions, though have different values, share similar shapes. Note that for n=3n=3 we plot the isosurfaces at 5×1025\times 10^{-2} times the largest value of the corresponding \mathcal{B} functions.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: From left to right: One dimensional plot, two dimensional surface together with 1515 contours, and a three dimensional isosurface of \mathcal{B} functions. From top row to bottom row: the standard sigmoid, softplus, and arctangent activation functions.

In figure 5 we present a snapshot of the \mathcal{B} function of the ELU activation function in one, two, and three dimensions. While it has common features as other \mathcal{B} functions such as decaying to zero at infinity, it possesses distinct features including asymmetric shape with positive and negative values.

Refer to caption
Refer to caption
Refer to caption
Figure 5: From left to right: One dimensional plot, two dimensional surface together with 1515 contours, and a three dimensional isosurface of the \mathcal{B} function of ELU activation function.

We have shown the nAI proof for nn dimensions is the same for GELU, SiLU, and Mish. It turns out that they are geometrically very similar and this can be seen in Figure 6. Note that for n=3n=3 we plot the isosurfaces at 5×1025\times 10^{-2} times the largest value of the corresponding \mathcal{B} functions.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: From left to right: One dimensional plot, two dimensional surface together with 1515 contours, and a three dimensional isosurface of \mathcal{B} functions. From top row to bottom row: GELU, SiLU, and Mish activation functions.

References