This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Universality and approximation bounds for echo state networks with random weights

Zhen Li and Yunfei Yang Z. Li is with Theory Lab, Huawei Technologies Co., Ltd., Shenzhen, China. (E-mail: [email protected])Y. Yang is with Department of Mathematics, City University of Hong Kong, Hong Kong, China. The first version of this paper was written when he was with The Hong Kong University of Science and Technology. (Corresponding author, E-mail: [email protected])
Abstract

We study the uniform approximation of echo state networks with randomly generated internal weights. These models, in which only the readout weights are optimized during training, have made empirical success in learning dynamical systems. Recent results showed that echo state networks with ReLU activation are universal. In this paper, we give an alternative construction and prove that the universality holds for general activation functions. Specifically, our main result shows that, under certain condition on the activation function, there exists a sampling procedure for the internal weights so that the echo state network can approximate any continuous casual time-invariant operators with high probability. In particular, for ReLU activation, we give explicit construction for these sampling procedures. We also quantify the approximation error of the constructed ReLU echo state networks for sufficiently regular operators.

Index Terms:
Universality, approximation error, echo state networks, reservoir computing

I Introduction

Reservoir computing [1], such as echo state networks [2] and liquid state machines [3], is a paradigm for supervised learning of dynamical systems, which transforms input data into a high-dimensional space by a state-space nonlinear system called reservoir, and performs the learning task only on the readout. Due to the simplicity of this computational framework, it has been applied to many fields and made remarkable success in many tasks such as temporal pattern prediction, classification and generation [4]. Motivated by these empirical successes, researchers have devoted a lot of effort in the theoretical understanding of the properties and performances of reservoir computing models (see, for instance, [5, 6, 1, 7, 8, 9, 10] and references therein).

In particular, recent studies [11, 12, 13, 14] showed that echo state networks are universal in the sense that they can approximate sufficiently regular input/output systems (i.e. operators) in various settings. However, many of these universality results do not guarantee the important property of echo state networks (and reservoir computing) that the state-space system is randomly generated, which is the major difference between reservoir computing and recurrent neural networks. To be concrete, consider the echo state network

{𝒔t=ϕ(W1𝒔t1+W2𝒖t+𝒃),yt=𝒂𝒔t,\begin{cases}{\boldsymbol{s}}_{t}=\phi(W_{1}{\boldsymbol{s}}_{t-1}+W_{2}{\boldsymbol{u}}_{t}+{\boldsymbol{b}}),\\ y_{t}={\boldsymbol{a}}\cdot{\boldsymbol{s}}_{t},\end{cases} (1)

where 𝒖t,yt,𝒔t{\boldsymbol{u}}_{t},y_{t},{\boldsymbol{s}}_{t} are the input, output and hidden state at time step tt, and ϕ\phi is a prescribed activation function. In general, the weight matrices W1,W2,𝒃W_{1},W_{2},{\boldsymbol{b}} are randomly generated and only the readout vector 𝒂{\boldsymbol{a}} is trained in a supervised learning manner. But the universality results in many papers, such as [11] and [14], require that all the weights depend on the target system that we want to approximate. Hence, these results can not completely explain the approximation capacity of echo state networks. To overcome this drawback of current theories, the recent work [9] studied the approximation of random neural networks in a Hilbert space setting and proposed a sampling procedure for the internal weights W1,W2,𝒃W_{1},W_{2},{\boldsymbol{b}} so that echo state network (1) with ReLU activation is universal in L2L^{2} sense.

In this paper, we generalize these results and study the universality of echo state networks with randomly generated internal weights. Our main results show that, under weak assumption on the activation function ϕ\phi, there exists a sampling procedure for W1,W2,𝒃W_{1},W_{2},{\boldsymbol{b}} so that the system (1) can approximate any continuous time-invariant operator with prescribed LL^{\infty} error in high probability. In particular, for ReLU activation, we show that the echo state network (1) is universal when W1=cμWPWW_{1}=c_{\mu}WPW^{\intercal} and W2=WQW_{2}=WQ, where (W,𝒃)(W,{\boldsymbol{b}}) is a random matrix whose entries are sampled independently from a general symmetric distribution μ\mu, cμc_{\mu} is a constant depending on μ\mu and P,QP,Q are fixed matrices defined by (10).

Let us compare our results with the most related work [9]. Given a series of random samples from a uniform distribution, the authors of [9] provided a procedure to use these samples to construct the internal weights so that the ReLU echo state network is universal in high probability. We prove that such kind of sampling procedure also exists for general activation functions (it is sufficient that the activation does not grow too fast, see Corollary III.6 and Assumption III.8 for details). Different from the construction in [9], which takes advantage of the piecewise linearity of the ReLU function, our construction makes use of the concentration of probability measures. Hence, it does not rely on the piecewise linearity and can be applied to general activation functions. But our sampling procedure is given implicitly and it is difficult to compute it explicitly for general activation functions. We can only provide explicit examples for ReLU networks. One of the shortcoming of our construction is that the constructed neural networks may not have the echo state property. But we show that these networks satisfy a property slightly weaker than the echo state property (see Remark III.10), which, we think, is good enough for many applications.

I-A Notations

We denote :={1,2,}\mathbb{N}:=\{1,2,\dots\} as the set of positive integers. Let \mathbb{Z} (respectively, +\mathbb{Z}_{+} and \mathbb{Z}_{-}) be the set of integers (respectively, the non-negative and non-positive integers). The ReLU function is denoted by σ(x):=max{x,0}\sigma(x):=\max\{x,0\}. Throughout the paper, d\mathbb{R}^{d} is equipped with the sup-norm \|\cdot\|_{\infty}, unless it is explicitly stated. For any UdU\subseteq\mathbb{R}^{d}, we use UU^{\mathbb{Z}} to denote the set of sequences of the form 𝒖=(,𝒖1,𝒖0,𝒖1,){\boldsymbol{u}}=(\dots,{\boldsymbol{u}}_{-1},{\boldsymbol{u}}_{0},{\boldsymbol{u}}_{1},\dots), 𝒖tU{\boldsymbol{u}}_{t}\in U, tt\in\mathbb{Z}. The set U+U^{\mathbb{Z}_{+}} and UU^{\mathbb{Z}_{-}} consist of right and left infinite sequences of the form (𝒖0,𝒖1,)({\boldsymbol{u}}_{0},{\boldsymbol{u}}_{1},\dots) and (,𝒖1,𝒖0)(\dots,{\boldsymbol{u}}_{-1},{\boldsymbol{u}}_{0}), respectively. For any iji\leq j and 𝒖U{\boldsymbol{u}}\in U^{\mathbb{Z}}, we denote 𝒖i:j:=(𝒖i,,𝒖j)Uji+1{\boldsymbol{u}}_{i:j}:=({\boldsymbol{u}}_{i},\dots,{\boldsymbol{u}}_{j})\in U^{j-i+1} and use the convention that, when i=i=-\infty or j=j=\infty, it denotes the left or right infinite sequence. And we will often regard 𝒖:j{\boldsymbol{u}}_{-\infty:j} as an element of UU^{\mathbb{Z}_{-}}. The supremum norm of the sequence is denoted by 𝒖:=supt𝒖t\|{\boldsymbol{u}}\|_{\infty}:=\sup_{t\in\mathbb{Z}}\|{\boldsymbol{u}}_{t}\|_{\infty}. A weighting sequence η:+(0,1]\eta:\mathbb{Z}_{+}\to(0,1] is a decreasing sequence with zero limit. The weight norm on (d)(\mathbb{R}^{d})^{\mathbb{Z}_{-}} associated to η\eta is denoted by 𝒖η:=suptηt𝒖t\|{\boldsymbol{u}}\|_{\eta}:=\sup_{t\in\mathbb{Z}_{-}}\eta_{-t}\|{\boldsymbol{u}}_{t}\|_{\infty}.

II Continuous causal time-invariant operators

We study the uniform approximation problem for input/output systems or operators of signals in discrete time setting. We mainly consider the operators F:([1,1]d)F:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}} that satisfy the following three properties:

  1. (1)

    Causality: For any 𝒖,𝒗([1,1]d){\boldsymbol{u}},{\boldsymbol{v}}\in([-1,1]^{d})^{\mathbb{Z}} and tt\in\mathbb{Z}, 𝒖:t=𝒗:t{\boldsymbol{u}}_{-\infty:t}={\boldsymbol{v}}_{-\infty:t} implies F(𝒖)t=F(𝒗)tF({\boldsymbol{u}})_{t}=F({\boldsymbol{v}})_{t}.

  2. (2)

    Time-invariance: F(Tτ(𝒖))t=F(𝒖)tτF(T_{\tau}({\boldsymbol{u}}))_{t}=F({\boldsymbol{u}})_{t-\tau} for any τ\tau\in\mathbb{Z}, where Tτ:(d)(d)T_{\tau}:(\mathbb{R}^{d})^{\mathbb{Z}}\to(\mathbb{R}^{d})^{\mathbb{Z}} is time delay operator defined by Tτ(𝒖)t=𝒖tτT_{\tau}({\boldsymbol{u}})_{t}={\boldsymbol{u}}_{t-\tau}.

  3. (3)

    Continuity (fading memory property): F:([1,1]d)[B,B]F:([-1,1]^{d})^{\mathbb{Z}}\to[-B,B]^{\mathbb{Z}} is uniformly bounded for some B>0B>0 and is continuous with respect to the product topology.

The paper [11] gave a comprehensive study of these operators. We recall that the causal time-invariant operator F:([1,1]d)F:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}} is one-to-one correspondence with the functional F:([1,1]d)F^{*}:([-1,1]^{d})^{\mathbb{Z}_{-}}\to\mathbb{R} defined by F(𝒖):=F(𝒖)0F^{*}({\boldsymbol{u}}_{-}):=F({\boldsymbol{u}})_{0} where 𝒖([1,1]d){\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}} is any extension of 𝒖([1,1]d){\boldsymbol{u}}_{-}\in([-1,1]^{d})^{\mathbb{Z}_{-}} such that 𝒖:0=𝒖{\boldsymbol{u}}_{-\infty:0}={\boldsymbol{u}}_{-} (FF^{*} is well-defined by causality). And we can reconstruct FF from FF^{*} by F(𝒖)t=F(PTt(𝒖))F({\boldsymbol{u}})_{t}=F^{*}(P_{\mathbb{Z}_{-}}\circ T_{-t}({\boldsymbol{u}})), where P:(d)(d)P_{\mathbb{Z}_{-}}:(\mathbb{R}^{d})^{\mathbb{Z}}\to(\mathbb{R}^{d})^{\mathbb{Z}_{-}} is the natural projection. In particular, for any causal time-invariant operators F,G:([1,1]d)F,G:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}},

sup𝒖([1,1]d)F(𝒖)G(𝒖)=sup𝒖([1,1]d)|F(𝒖)G(𝒖)|.\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}}\|F({\boldsymbol{u}})-G({\boldsymbol{u}})\|_{\infty}\\ =\sup_{{\boldsymbol{u}}_{-}\in([-1,1]^{d})^{\mathbb{Z}_{-}}}|F^{*}({\boldsymbol{u}}_{-})-G^{*}({\boldsymbol{u}}_{-})|.

Hence, the approximation problem of causal time-invariant operators FF can be reduced to the approximation of functionals FF^{*} on ([1,1]d)([-1,1]^{d})^{\mathbb{Z}_{-}}.

We remark that the product topology on (d)(\mathbb{R}^{d})^{\mathbb{Z}} is different from the uniform topology induced by the sup-norm. However, for the set ([B,B]d)([-B,B]^{d})^{\mathbb{Z}_{-}} of uniformly bounded sequences, the product topology coincides with the topology induced by the weight norm 𝒖η\|{\boldsymbol{u}}\|_{\eta} of any weighting sequence η\eta. It is shown by [11, Section 2.3] that the causal time-invariant operator FF is continuous (i.e. has the fading memory property) if and only if the corresponding functional FF^{*} is continuous with respect to the product topology, which is equivalent to FF^{*} has fading memory property with respect to some (and hence any) weighting sequence η\eta, i.e. FF^{*} is a continuous function on the compact metric space (([1,1]d),η)(([-1,1]^{d})^{\mathbb{Z}_{-}},\|\cdot\|_{\eta}). In other words, for any ϵ>0\epsilon>0, there exists a δ(ϵ)>0\delta(\epsilon)>0 such that for any 𝒖,𝒗([1,1]d){\boldsymbol{u}},{\boldsymbol{v}}\in([-1,1]^{d})^{\mathbb{Z}_{-}},

𝒖𝒗η=suptηt𝒖t𝒗t<δ(ϵ)|F(𝒖)F(𝒗)|<ϵ.\|{\boldsymbol{u}}-{\boldsymbol{v}}\|_{\eta}=\sup_{t\in\mathbb{Z}_{-}}\eta_{-t}\|{\boldsymbol{u}}_{t}-{\boldsymbol{v}}_{t}\|_{\infty}<\delta(\epsilon)\\ \Longrightarrow|F^{*}({\boldsymbol{u}})-F^{*}({\boldsymbol{v}})|<\epsilon.

Next, we introduce several notations to quantify the regularity of causal time-invariant operators. For any mm\in\mathbb{N}, we can associate a function Fm:([1,1]d)mF^{*}_{m}:([-1,1]^{d})^{m}\to\mathbb{R} to any causal time-invariant operator F:([1,1]d)F:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}} by

Fm(𝒖m+1,,𝒖0):=F(,𝟎,𝒖m+1,,𝒖0).F^{*}_{m}({\boldsymbol{u}}_{-m+1},\dots,{\boldsymbol{u}}_{0}):=F^{*}(\cdots,\boldsymbol{0},{\boldsymbol{u}}_{-m+1},\dots,{\boldsymbol{u}}_{0}). (2)

If the functional FF^{*} can be approximated by FmF_{m}^{*} arbitrarily well, we say FF has approximately finite memory. The following definition quantifies this approximation.

Definition II.1 (Approximately finite memory).

For any causal time-invariant operator F:([1,1]d)F:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}}, let ϵ0\epsilon\geq 0 and mm\in\mathbb{N}, we denote

F(m)\displaystyle\mathcal{E}_{F}(m) :=sup𝒖([1,1]d)|F(𝒖)Fm(𝒖m+1:0)|,\displaystyle:=\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}_{-}}}|F^{*}({\boldsymbol{u}})-F^{*}_{m}({\boldsymbol{u}}_{-m+1:0})|,
mF(ϵ)\displaystyle m_{F}(\epsilon) :=min{m:F(m)ϵ}.\displaystyle:=\min\{m\in\mathbb{N}:\mathcal{E}_{F}(m)\leq\epsilon\}.

If mF(ϵ)<m_{F}(\epsilon)<\infty for all ϵ>0\epsilon>0, we say FF has approximately finite memory. If mF(0)<m_{F}(0)<\infty, we say FF has finite memory.

Note that mF(ϵ)m_{F}(\epsilon) is a non-increasing function of ϵ\epsilon. By the time-invariance of FF, for any tt\in\mathbb{Z},

sup𝒖([1,1]d)|F(𝒖)tFm(𝒖tm+1:t)|\displaystyle\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}}|F({\boldsymbol{u}})_{t}-F_{m}^{*}({\boldsymbol{u}}_{t-m+1:t})| (3)
=\displaystyle= sup𝒖([1,1]d)|F(𝒖:t)Fm(𝒖tm+1:t)|=F(m).\displaystyle\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}}|F^{*}({\boldsymbol{u}}_{-\infty:t})-F_{m}^{*}({\boldsymbol{u}}_{t-m+1:t})|=\mathcal{E}_{F}(m).

Hence, F(m)\mathcal{E}_{F}(m) quantifies how well FF can be approximated by functionals with finite memory.

Definition II.2 (Modulus of continuity).

Suppose the functional F:([1,1]d)F^{*}:([-1,1]^{d})^{\mathbb{Z}_{-}}\to\mathbb{R} has fading memory property with respect to some weighting sequence η\eta. We denote the modulus of continuity with respect to η\|\cdot\|_{\eta} by

ωF(δ;η):=sup{|F(𝒖)F(𝒖)|:𝒖,𝒖([1,1]d),𝒖𝒖ηδ},\omega_{F^{*}}(\delta;\eta):=\sup\{|F^{*}({\boldsymbol{u}})-F^{*}({\boldsymbol{u}}^{\prime})|:{\boldsymbol{u}},{\boldsymbol{u}}^{\prime}\in([-1,1]^{d})^{\mathbb{Z}_{-}},\\ \|{\boldsymbol{u}}-{\boldsymbol{u}}^{\prime}\|_{\eta}\leq\delta\},

and the inverse modulus of continuity

ωF1(ϵ;η):=sup{δ>0:ωF(δ;η)ϵ}.\omega_{F^{*}}^{-1}(\epsilon;\eta):=\sup\{\delta>0:\omega_{F^{*}}(\delta;\eta)\leq\epsilon\}.

Similarly, the modulus of continuity (with respect to \|\cdot\|_{\infty}) of a continuous function f:[1,1]mf:[-1,1]^{m}\to\mathbb{R} and the inverse modulus of continuity are defined by

ωf(δ)\displaystyle\omega_{f}(\delta) :=sup{|f(𝒙)f(𝒙)|:𝒙,𝒙[1,1]m,\displaystyle:=\sup\{|f({\boldsymbol{x}})-f({\boldsymbol{x}}^{\prime})|:{\boldsymbol{x}},{\boldsymbol{x}}^{\prime}\in[-1,1]^{m},
𝒙𝒙δ},\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\|{\boldsymbol{x}}-{\boldsymbol{x}}^{\prime}\|_{\infty}\leq\delta\},
ωf1(ϵ)\displaystyle\omega_{f}^{-1}(\epsilon) :=sup{δ>0:ωf(δ)ϵ}.\displaystyle:=\sup\{\delta>0:\omega_{f}(\delta)\leq\epsilon\}.

The next proposition quantifies the continuity of causal time-invariant operators by the approximately finite memory and modulus of continuity. This proposition is a modification of similar result in [15] to our setting. It shows that a causal time-invariant operator FF is continuous if and only if it has approximately finite memory and each FmF^{*}_{m} is continuous.

Proposition II.3.

Let F:([1,1]d)F:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}} be a causal time-invariant operator.

  1. (1)

    If FF has approximately finite memory and ωFm(δ)<\omega_{F^{*}_{m}}(\delta)<\infty for any mm\in\mathbb{N} and δ>0\delta>0, then FF^{*} has fading memory property with respect to any weighting sequence η\eta: for any ϵ>0\epsilon>0, 𝒖,𝒗([1,1]d){\boldsymbol{u}},{\boldsymbol{v}}\in([-1,1]^{d})^{\mathbb{Z}_{-}} and m=mF(ϵ/3)m=m_{F}(\epsilon/3),

    𝒖𝒗ηηm1ωFm1(ϵ/3)|F(𝒖)F(𝒗)|ϵ.\|{\boldsymbol{u}}-{\boldsymbol{v}}\|_{\eta}\leq\eta_{m-1}\omega_{F^{*}_{m}}^{-1}(\epsilon/3)\Longrightarrow|F^{*}({\boldsymbol{u}})-F^{*}({\boldsymbol{v}})|\leq\epsilon.

    In other words, ωF(ηm1ωFm1(ϵ/3);η)ϵ\omega_{F_{*}}(\eta_{m-1}\omega_{F^{*}_{m}}^{-1}(\epsilon/3);\eta)\leq\epsilon for m=mF(ϵ/3)m=m_{F}(\epsilon/3).

  2. (2)

    If FF^{*} has fading memory property with respect to some weighting sequence η\eta, then FF has approximately finite memory with F(m)ωF(ηm,η)\mathcal{E}_{F}(m)\leq\omega_{F^{*}}(\eta_{m},\eta),

    mF(ϵ)min{m:ηmωF1(ϵ;η)},m_{F}(\epsilon)\leq\min\left\{m\in\mathbb{N}:\eta_{m}\leq\omega_{F^{*}}^{-1}(\epsilon;\eta)\right\},

    and ωFm(δ)ωF(δ;η)\omega_{F^{*}_{m}}(\delta)\leq\omega_{F^{*}}(\delta;\eta) for any mm\in\mathbb{N}.

In order to approximate the continuous causal time-invariant operator FF, we only need to approximate the functional FF^{*}, which can be approximated by the continuous function FmF^{*}_{m} if mm is chosen sufficiently large. Hence, any approximation theory of continuous functions can be translated to an approximation result for continuous causal time-invariant operators. For instance, if we approximate FmF^{*}_{m} by some function hh, then we can approximate FF by

yt=h(𝒖tm+1,,𝒖t),t.y_{t}=h({\boldsymbol{u}}_{t-m+1},\dots,{\boldsymbol{u}}_{t}),\quad t\in\mathbb{Z}.

The function hh uniquely determine a causal time-invariant operator HH such that H(𝒖)t=h(𝒖tm+1:t)H({\boldsymbol{u}})_{t}=h({\boldsymbol{u}}_{t-m+1:t}). Since HH has finite memory, HH is continuous if and only if hh is continuous by Proposition II.3. When we approximate FmF^{*}_{m} by polynomials, then HH is the Volterra series [16]. When hh is a neural network, then HH is a temporal convolution neural network, studied by [15].

In this paper, we focus on the approximation by echo state networks (ESN), which are special state space models of the form

{𝒔t=ϕ(W1𝒔t1+W2𝒖t+𝒃),yt=𝒂𝒔t,\begin{cases}{\boldsymbol{s}}_{t}=\phi(W_{1}{\boldsymbol{s}}_{t-1}+W_{2}{\boldsymbol{u}}_{t}+{\boldsymbol{b}}),\\ y_{t}={\boldsymbol{a}}\cdot{\boldsymbol{s}}_{t},\end{cases} (4)

where 𝒔t,𝒂,𝒃n{\boldsymbol{s}}_{t},{\boldsymbol{a}},{\boldsymbol{b}}\in\mathbb{R}^{n}, W1n×nW_{1}\in\mathbb{R}^{n\times n}, W2n×dW_{2}\in\mathbb{R}^{n\times d} and the activation function ϕ:\phi:\mathbb{R}\to\mathbb{R} is applied element-wise.

Definition II.4 (Existence of solutions and echo state property).

We say the system (4) has existence of solutions property if for any 𝐮([1,1]d){\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}_{-}}, there exist 𝐬(n){\boldsymbol{s}}\in(\mathbb{R}^{n})^{\mathbb{Z}_{-}} such that 𝐬t=ϕ(W1𝐬t1+W2𝐮t+𝐛){\boldsymbol{s}}_{t}=\phi(W_{1}{\boldsymbol{s}}_{t-1}+W_{2}{\boldsymbol{u}}_{t}+{\boldsymbol{b}}) holds for each tt\in\mathbb{Z}_{-}. If the solution 𝐬{\boldsymbol{s}} is unique, we say the system has echo state property.

The article [11, Theorem 3.1] gave sufficient conditions for the system (4) to have existence of solutions property and echo state property. In particular, they showed that, if ϕ\phi is a bounded continuous function, then the existence of solutions property holds. And if ϕ\phi is a bounded Lipschitz continuous function with Lipschitz constant Lϕ<L_{\phi}<\infty and W1Lϕ<1\|W_{1}\|L_{\phi}<1, then the system (4) has echo state property, where W1\|W_{1}\| is the operator norm of the matrix W1W_{1}. As a sufficient condition to ensure the echo state property, the hypothesis W1Lϕ<1\|W_{1}\|L_{\phi}<1 has been extensively studied in the ESN literature [5, 2, 6, 7, 17].

If the system (4) has existence of solutions property, the axiom of choice allows us to assign 𝒔(𝒖)(n){\boldsymbol{s}}({\boldsymbol{u}})\in(\mathbb{R}^{n})^{\mathbb{Z}_{-}} to each 𝒖([1,1]d){\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}_{-}}, and hence define a functional h:([1,1]d)h:([-1,1]^{d})^{\mathbb{Z}_{-}}\to\mathbb{R} by h(𝒖):=𝒂𝒔(𝒖)0=y0h({\boldsymbol{u}}):={\boldsymbol{a}}\cdot{\boldsymbol{s}}({\boldsymbol{u}})_{0}=y_{0}. Thus, we can assign a causal time-invariant operator H:([1,1]d)H:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}} to the system such that H=hH^{*}=h. When the echo state property holds, this operator is unique. The operator HH is continuous if and only if the mapping 𝒖𝒔(𝒖)0{\boldsymbol{u}}\mapsto{\boldsymbol{s}}({\boldsymbol{u}})_{0} is a continuous function. In the next section, we study the universality of these operators.

III Universal approximation

As mentioned in the introduction, the recent works of [11, 14] showed that the echo state networks (4) are universal: Assume ϕ\phi is a bounded Lipschitz continuous function. Let F:([1,1]d)F:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}} be a continuous causal time-invariant operator, then for any ϵ>0\epsilon>0, for sufficiently large nn, there exists an ESN (4) such that the corresponding causal time-invariant operator FESNF_{ESN} satisfies

sup𝒖([1,1]d)F(𝒖)FESN(𝒖)ϵ.\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}}\|F({\boldsymbol{u}})-F_{ESN}({\boldsymbol{u}})\|_{\infty}\leq\epsilon. (5)

In this universal approximation theorem, the weights W1,W2,𝒂,𝒃W_{1},W_{2},{\boldsymbol{a}},{\boldsymbol{b}} in the network (4) depend on the target operator FF. However, in practice, the parameters W1,W2,𝒃W_{1},W_{2},{\boldsymbol{b}} are drawn at random from certain given distribution and only the readout vector 𝒂{\boldsymbol{a}} are trained by linear regression using observed data related to the target operator FF. Hence, this universal approximation theorem can not completely explain the empirical performance of echo state networks.

In this section, our goal is to show that, with randomly generated weights, echo state networks are universal: For any ϵ>0\epsilon>0, for sufficiently large nn, with high probability on W1,W2,𝒃W_{1},W_{2},{\boldsymbol{b}}, which are drawn from certain distribution, there exists 𝒂{\boldsymbol{a}} such that we can associate a causal time-invariant operator FESNF_{ESN} to the ESN (4) and the approximation bound (5) holds. In the context of standard feed-forward neural networks, one can show that similar universal approximation theorem holds for random neural networks. This will be the building block of our main theorem of echo state networks.

III-A Universality of random neural networks

It is well-known that feed-forward neural networks with one hidden layer are universal [18, 19, 20, 21]. We recall the universal approximation theorem proved by [20], which has minimal requirements on the activation function.

Theorem III.1.

If ϕ:\phi:\mathbb{R}\to\mathbb{R} is continuous and is not a polynomial, then for any compact set 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d}, any function fC(𝒳)f\in C(\mathcal{X}) and ϵ>0\epsilon>0, there exists ai,bia_{i},b_{i}\in\mathbb{R} and 𝐰id{\boldsymbol{w}}_{i}\in\mathbb{R}^{d}, i=1,,ni=1,\dots,n, such that

sup𝒙𝒳|f(𝒙)i=1naiϕ(𝒘i𝒙+bi)|ϵ.\sup_{{\boldsymbol{x}}\in\mathcal{X}}\left|f({\boldsymbol{x}})-\sum_{i=1}^{n}a_{i}\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})\right|\leq\epsilon. (6)

In Theorem III.1, the parameters ai,bi,𝒘ia_{i},b_{i},{\boldsymbol{w}}_{i} depend on the target function ff. In order to take into account the fact that for ESN the inner weights 𝒘i,bi{\boldsymbol{w}}_{i},b_{i} are randomly chosen and only aia_{i} are trained from data, we consider the random neural networks whose weights (𝒘i,bi)({\boldsymbol{w}}_{i},b_{i}) are drawn from some probability distribution μ\mu. This motivates the following definition.

Definition III.2.

Suppose (𝐰i,bi)i({\boldsymbol{w}}_{i}^{\intercal},b_{i})_{i\in\mathbb{N}} is a sequence of i.i.d. random vectors drawn from probability distribution μ\mu defined on d+1\mathbb{R}^{d+1}. If for any compact set 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d}, any function fC(𝒳)f\in C(\mathcal{X}) and ϵ,δ(0,1)\epsilon,\delta\in(0,1), there exists nn\in\mathbb{N} such that, with probability at least 1δ1-\delta, the inequality (6) holds for some (ai)1in(a_{i})_{1\leq i\leq n}, then we say the pair (ϕ,μ)(\phi,\mu) is universal.

Remark III.3.

We note that the readout weights (ai)1in(a_{i})_{1\leq i\leq n} depend on the randomly generated weights (𝐰i,bi)1in({\boldsymbol{w}}_{i},b_{i})_{1\leq i\leq n}. In the definition, we only require the existence of neural networks that converge to the target function in probability. This requirement is actually equivalent to almost sure convergence. To see this, convergence in probability implies that there exists a sub-sequence nkn_{k} such that there exists fnkf_{n_{k}}, which is a linear combination of {ϕ(𝐰i𝐱+bi)}i=1nk\{\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})\}_{i=1}^{n_{k}}, converging to the target function ff almost surely as kk\to\infty. Notice that, for any nnkn\geq n_{k}, fnkf_{n_{k}} is also a linear combination of {ϕ(𝐰i𝐱+bi)}i=1n\{\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})\}_{i=1}^{n}. Hence, there exists fnf_{n}, which is a linear combination of {ϕ(𝐰i𝐱+bi)}i=1n\{\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})\}_{i=1}^{n}, such that fnf_{n} convergences to ff almost surely.

The universality of random neural networks was widely studied in the context of extreme learning machine [22, 23, 24]. In particular, [22] used an incremental construction to establish the random universal approximation theorem in L2L^{2}-norm for bounded non-constant piecewise continuous activation function. The paper [9] studied the approximation of random neural networks in a Hilbert space setting and established the universal approximation of ReLU neural networks. The recent work of [25] considered the approximation in C1C^{1}-norm and assumed that the activation function ϕC1()\phi\in C^{1}(\mathbb{R}) satisfying 0<|ϕ(x)|dx<0<\int_{\mathbb{R}}|\phi^{\prime}(x)|dx<\infty. They argued that, since there exists a neural network gg that approximates the target function ff by universality of neural networks, there will eventually be some randomly generated samples (𝒘i,bi)({\boldsymbol{w}}_{i}^{\intercal},b_{i}) that are close to the weights of gg and we can discard other samples by setting the corresponding ai=0a_{i}=0, hence the random universal approximation holds. It is possible to generalize their argument to the approximation of continuous functions. Nevertheless, we will give an alternative approach based on law of large numbers and show that (ϕ,μ)(\phi,\mu) is universal under very weak conditions. Our analysis will need the following uniform law of large numbers.

Lemma III.4 ([26], Theorem 2).

Let 𝐰1,,𝐰n{\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{n} be i.i.d. samples from some probability distribution μ\mu on Ω\Omega. Suppose

  1. (1)

    𝒳\mathcal{X} is compact;

  2. (2)

    f(𝒙,𝒘)f({\boldsymbol{x}},{\boldsymbol{w}}) is continuous on 𝒙{\boldsymbol{x}} for each 𝒘Ω{\boldsymbol{w}}\in\Omega, and measurable on 𝒘{\boldsymbol{w}} for each 𝒙𝒳{\boldsymbol{x}}\in\mathcal{X};

  3. (3)

    there exists g(𝒘)g({\boldsymbol{w}}) with |f(𝒙,𝒘)|g(𝒘)|f({\boldsymbol{x}},{\boldsymbol{w}})|\leq g({\boldsymbol{w}}) for all 𝒙𝒳{\boldsymbol{x}}\in\mathcal{X} and 𝔼𝒘[g(𝒘)]<\mathbb{E}_{\boldsymbol{w}}[g({\boldsymbol{w}})]<\infty.

Then 𝔼𝐰[f(𝐱,𝐰)]\mathbb{E}_{\boldsymbol{w}}[f({\boldsymbol{x}},{\boldsymbol{w}})] is continuous on 𝐱{\boldsymbol{x}} and

sup𝒙𝒳|1ni=1nf(𝒙,𝒘i)𝔼𝒘[f(𝒙,𝒘)]|0\sup_{{\boldsymbol{x}}\in\mathcal{X}}\left|\frac{1}{n}\sum_{i=1}^{n}f({\boldsymbol{x}},{\boldsymbol{w}}_{i})-\mathbb{E}_{\boldsymbol{w}}[f({\boldsymbol{x}},{\boldsymbol{w}})]\right|\to 0

μ\mu-almost surely as nn\to\infty.

Now, we give a sufficient condition for the pair (ϕ,μ)(\phi,\mu) to be universal. Our proof is a combination of the uniform law of large number (Lemma III.4) and the universality of neural networks (Theorem III.1).

Theorem III.5.

Suppose the continuous function ϕ:\phi:\mathbb{R}\to\mathbb{R} is not a polynomial and |ϕ(x)|C1|x|α+C2|\phi(x)|\leq C_{1}|x|^{\alpha}+C_{2} for some α,C1,C20\alpha,C_{1},C_{2}\geq 0. If μ\mu is a probability distribution with full support on d+1\mathbb{R}^{d+1} such that (𝐰,b)α𝑑μ(𝐰,b)<\int\|({\boldsymbol{w}}^{\intercal},b)\|_{\infty}^{\alpha}d\mu({\boldsymbol{w}}^{\intercal},b)<\infty, then (ϕ,μ)(\phi,\mu) is universal.

Proof.

By Hahn-Banach theorem, for any compact set 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d}, the linear span of a function class C(𝒳)\mathcal{H}\subseteq C(\mathcal{X}) is dense in C(𝒳)C(\mathcal{X}) if and only if

{νM(𝒳):𝒳h(𝒙)𝑑ν(𝒙)=0,h}={0},\left\{\nu\in M(\mathcal{X}):\int_{\mathcal{X}}h({\boldsymbol{x}})d\nu({\boldsymbol{x}})=0,\forall h\in\mathcal{H}\right\}=\{0\},

where M(𝒳)M(\mathcal{X}) is the dual space of C(𝒳)C(\mathcal{X}), that is, the space of all signed Radon measures with finite total variation [27, Chapter 7.3].

We consider the linear space

μ:={h(𝒙)=d+1g(𝒘,b)ϕ(𝒘𝒙+b)dμ(𝒘,b).:gL(μ)}.\mathcal{H}_{\mu}:=\left\{h({\boldsymbol{x}})=\int_{\mathbb{R}^{d+1}}g({\boldsymbol{w}},b)\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d\mu({\boldsymbol{w}}^{\intercal},b)\right.\\ \bigg{.}:g\in L^{\infty}(\mu)\bigg{\}}. (7)

Observe that |g(𝒘,b)ϕ(𝒘𝒙+b)|g(C1|𝒘𝒙+b|α+C2)C𝒳g(𝒘,b)α+C2g|g({\boldsymbol{w}},b)\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)|\leq\|g\|_{\infty}(C_{1}|{\boldsymbol{w}}\cdot{\boldsymbol{x}}+b|^{\alpha}+C_{2})\leq C_{\mathcal{X}}\|g\|_{\infty}\|({\boldsymbol{w}}^{\intercal},b)\|_{\infty}^{\alpha}+C_{2}\|g\|_{\infty}, where C𝒳C_{\mathcal{X}} is a constant depending on the compact set 𝒳\mathcal{X}. By assumption and Lemma III.4, any hμh\in\mathcal{H}_{\mu} is continuous and hence μC(𝒳)\mathcal{H}_{\mu}\subseteq C(\mathcal{X}). Suppose νM(𝒳)\nu\in M(\mathcal{X}) satisfies 𝒳h(𝒙)𝑑ν(𝒙)=0\int_{\mathcal{X}}h({\boldsymbol{x}})d\nu({\boldsymbol{x}})=0 for all hμh\in\mathcal{H}_{\mu}. Then, by Fubini’s theorem,

0=\displaystyle 0= 𝒳d+1g(𝒘,b)ϕ(𝒘𝒙+b)𝑑μ(𝒘,b)𝑑ν(𝒙)\displaystyle\int_{\mathcal{X}}\int_{\mathbb{R}^{d+1}}g({\boldsymbol{w}},b)\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d\mu({\boldsymbol{w}}^{\intercal},b)d\nu({\boldsymbol{x}})
=\displaystyle= d+1g(𝒘,b)𝒳ϕ(𝒘𝒙+b)𝑑ν(𝒙)𝑑μ(𝒘,b),\displaystyle\int_{\mathbb{R}^{d+1}}g({\boldsymbol{w}},b)\int_{\mathcal{X}}\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d\nu({\boldsymbol{x}})d\mu({\boldsymbol{w}}^{\intercal},b),

for all function gL(μ)g\in L^{\infty}(\mu). Therefore,

𝒳ϕ(𝒘𝒙+b)𝑑ν(𝒙)=0,μ-almost surely.\int_{\mathcal{X}}\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d\nu({\boldsymbol{x}})=0,\quad\mu\mbox{-almost surely}.

Since this function is continuous and μ\mu has full support, the equality holds for all (𝒘,b)d+1({\boldsymbol{w}}^{\intercal},b)\in\mathbb{R}^{d+1}. By Theorem III.1, the linear span of {ϕ(𝒘𝒙+b):(𝒘,b)d+1}\{\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b):({\boldsymbol{w}}^{\intercal},b)\in\mathbb{R}^{d+1}\} is dense in C(𝒳)C(\mathcal{X}), which implies ν=0\nu=0. We conclude that μ\mathcal{H}_{\mu} is dense in C(𝒳)C(\mathcal{X}). In other words, for any fC(𝒳)f\in C(\mathcal{X}) and ϵ>0\epsilon>0, there exists hμh\in\mathcal{H}_{\mu} such that |f(𝒙)h(𝒙)|ϵ/2|f({\boldsymbol{x}})-h({\boldsymbol{x}})|\leq\epsilon/2 for all 𝒙𝒳{\boldsymbol{x}}\in\mathcal{X}.

By Lemma III.4, for any δ>0\delta>0, there exists nn\in\mathbb{N} such that, with probability at least 1δ1-\delta on the samples (𝒘i,bi)({\boldsymbol{w}}_{i}^{\intercal},b_{i}) from μ\mu,

sup𝒙𝒳|1ni=1ng(𝒘i,bi)ϕ(𝒘i𝒙+bi)h(𝒙)|ϵ2.\sup_{{\boldsymbol{x}}\in\mathcal{X}}\left|\frac{1}{n}\sum_{i=1}^{n}g({\boldsymbol{w}}_{i},b_{i})\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})-h({\boldsymbol{x}})\right|\leq\frac{\epsilon}{2}.

By triangle inequality, we conclude that (6) holds with ai=1ng(𝒘i,bi)a_{i}=\frac{1}{n}g({\boldsymbol{w}}_{i},b_{i}) and the pair (ϕ,μ)(\phi,\mu) is universal. ∎

In practice, it is more convenient to sample each weight independently from certain distribution μ\mu on \mathbb{R}, so that (𝒘,b)({\boldsymbol{w}}^{\intercal},b) is sample from μd+1\mu^{d+1}, the d+1d+1 products of μ\mu. The next corollary is a direct application of Lemma III.5 to this situation.

Corollary III.6.

Suppose the continuous function ϕ:\phi:\mathbb{R}\to\mathbb{R} is not a polynomial and |ϕ(x)|C1|x|α+C2|\phi(x)|\leq C_{1}|x|^{\alpha}+C_{2} for some α,C1,C20\alpha,C_{1},C_{2}\geq 0. If μ\mu is a probability distribution with full support on \mathbb{R} such that |x|α𝑑μ(x)<\int|x|^{\alpha}d\mu(x)<\infty, then for any dd\in\mathbb{N}, the pair (ϕ,μd+1)(\phi,\mu^{d+1}) is universal.

Proof.

When 0α10\leq\alpha\leq 1,

(𝒘,b)α𝑑μd+1(𝒘,b)\displaystyle\int\|({\boldsymbol{w}}^{\intercal},b)\|_{\infty}^{\alpha}d\mu^{d+1}({\boldsymbol{w}}^{\intercal},b)
\displaystyle\leq (𝒘,b)1α𝑑μd+1(𝒘,b)\displaystyle\int\|({\boldsymbol{w}}^{\intercal},b)\|_{1}^{\alpha}d\mu^{d+1}({\boldsymbol{w}}^{\intercal},b)
\displaystyle\leq (|b|α+i=1d|wi|α)𝑑μd+1(𝒘,b)<.\displaystyle\int\left(|b|^{\alpha}+\sum_{i=1}^{d}|w_{i}|^{\alpha}\right)d\mu^{d+1}({\boldsymbol{w}}^{\intercal},b)<\infty.

When α>1\alpha>1, since all norms of d+1\mathbb{R}^{d+1} are equivalent,

(𝒘,b)α𝑑μd+1(𝒘,b)\displaystyle\int\|({\boldsymbol{w}}^{\intercal},b)\|_{\infty}^{\alpha}d\mu^{d+1}({\boldsymbol{w}}^{\intercal},b)
\displaystyle\leq C(𝒘,b)αα𝑑μd+1(𝒘,b)\displaystyle C\int\|({\boldsymbol{w}}^{\intercal},b)\|_{\alpha}^{\alpha}d\mu^{d+1}({\boldsymbol{w}}^{\intercal},b)
=\displaystyle= C(|b|α+i=1d|wi|α)𝑑μd+1(𝒘,b)<.\displaystyle C\int\left(|b|^{\alpha}+\sum_{i=1}^{d}|w_{i}|^{\alpha}\right)d\mu^{d+1}({\boldsymbol{w}}^{\intercal},b)<\infty.

In any cases, the pair (ϕ,μd+1)(\phi,\mu^{d+1}) satisfies the condition in Lemma III.5, hence it is universal. ∎

So far, we have assumed that μ\mu has full support. When the activation ϕ(x)=σ(x)=max{x,0}\phi(x)=\sigma(x)=\max\{x,0\} is the ReLU function, this assumption can be weaken due to the absolute homogeneity of ReLU.

Corollary III.7.

For the ReLU function σ\sigma, if μ\mu is a probability distribution on \mathbb{R} whose support contains the interval [r,r][-r,r] for some r>0r>0, then (σ,μd+1)(\sigma,\mu^{d+1}) is universal for any dd\in\mathbb{N}.

Proof.

We consider the continuous mapping T:d+1[r,r]d+1T:\mathbb{R}^{d+1}\to[-r,r]^{d+1} defined by

T(𝒘,b)={(𝒘,b),(𝒘,b)rr(𝒘,b)(𝒘,b)(𝒘,b)>r.T({\boldsymbol{w}}^{\intercal},b)=\begin{cases}({\boldsymbol{w}}^{\intercal},b),\quad&\|({\boldsymbol{w}}^{\intercal},b)\|_{\infty}\leq r\\ \frac{r}{\|({\boldsymbol{w}}^{\intercal},b)\|_{\infty}}({\boldsymbol{w}}^{\intercal},b)\quad&\|({\boldsymbol{w}}^{\intercal},b)\|_{\infty}>r.\end{cases}

Let γ=T#μd+1\gamma=T_{\#}\mu^{d+1} be the push-forward measure of μd+1\mu^{d+1} under TT defined by γ(S)=μd+1(T1(S))\gamma(S)=\mu^{d+1}(T^{-1}(S)) for any measurable set S[r,r]d+1S\subseteq[-r,r]^{d+1}. Then, the support of γ\gamma is [r,r]d+1[-r,r]^{d+1} by assumption.

We firstly show that (σ,γ)(\sigma,\gamma) is universal. As in the proof of Theorem III.5, if νM(𝒳)\nu\in M(\mathcal{X}) satisfies 𝒳h(𝒙)𝑑ν(𝒙)=0\int_{\mathcal{X}}h({\boldsymbol{x}})d\nu({\boldsymbol{x}})=0 for all hγh\in\mathcal{H}_{\gamma}, then by Fubini’s theorem,

𝒳σ(𝒘~𝒙+b~)𝑑ν(𝒙)=0\int_{\mathcal{X}}\sigma(\tilde{{\boldsymbol{w}}}\cdot{\boldsymbol{x}}+\tilde{b})d\nu({\boldsymbol{x}})=0

holds for all (𝒘~,b~)[r,r]d+1(\tilde{{\boldsymbol{w}}}^{\intercal},\tilde{b})\in[-r,r]^{d+1}. By the absolute homogeneity of σ\sigma, this equation actually holds for all (𝒘~,b~)d+1(\tilde{{\boldsymbol{w}}}^{\intercal},\tilde{b})\in\mathbb{R}^{d+1}. The argument in the proof of Theorem III.5 implies that (σ,γ)(\sigma,\gamma) is universal.

Observe that any sample (𝒘i,bi)({\boldsymbol{w}}_{i}^{\intercal},b_{i}) from μd+1\mu^{d+1} corresponding to a sample (𝒘~i,b~i)=T(𝒘i,bi)(\tilde{{\boldsymbol{w}}}_{i}^{\intercal},\tilde{b}_{i})=T({\boldsymbol{w}}_{i}^{\intercal},b_{i}) from γ=T#μd+1\gamma=T_{\#}\mu^{d+1}, and

i=1naiϕ(𝒘i𝒙+bi)=i=1naiciϕ(𝒘~i𝒙+b~i),\sum_{i=1}^{n}a_{i}\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})=\sum_{i=1}^{n}\frac{a_{i}}{c_{i}}\phi(\tilde{{\boldsymbol{w}}}_{i}\cdot{\boldsymbol{x}}+\tilde{b}_{i}),

where ci=1c_{i}=1 if (𝒘i,bi)r\|({\boldsymbol{w}}_{i}^{\intercal},b_{i})\|_{\infty}\leq r and ci=(𝒘i,bi)/rc_{i}=\|({\boldsymbol{w}}_{i}^{\intercal},b_{i})\|_{\infty}/r otherwise. We conclude that (σ,μd+1)(\sigma,\mu^{d+1}) is universal. ∎

III-B Universality of echo state networks

In this section, we will state and prove the random universal approximation theorem for echo state networks. Our analysis is based on the uniform law of large numbers and the universality of random feed-forward neural networks. For simplicity, we will assume that all internal weights in the network have the same distribution μ\mu from now on. We make the following assumption on the activation function ϕ:\phi:\mathbb{R}\to\mathbb{R} and the distribution μ\mu, where we also assume that the input 𝒙[2,2]m{\boldsymbol{x}}\in[-2,2]^{m} (any compact set slightly larger than [1,1]m[-1,1]^{m} is enough for our purpose).

Assumption III.8.

For any mm\in\mathbb{N}, the pair (ϕ,μm+1)(\phi,\mu^{m+1}) is universal and, for any ϵ0>0\epsilon_{0}>0, there exists a measurable mapping φm:m+1m\varphi_{m}:\mathbb{R}^{m+1}\to\mathbb{R}^{m} such that

sup𝒙[2,2]m𝒙𝔼(𝒘,b)μm+1[φm(𝒘,b)ϕ(𝒘𝒙+b)]ϵ0\sup_{{\boldsymbol{x}}\in[-2,2]^{m}}\left\|{\boldsymbol{x}}-\mathbb{E}_{({\boldsymbol{w}}^{\intercal},b)\sim\mu^{m+1}}[\varphi_{m}({\boldsymbol{w}},b)\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)]\right\|_{\infty}\leq\epsilon_{0} (8)

and φm(𝐰,b)ϕ(𝐰𝐱+b)gm(𝐰,b)\|\varphi_{m}({\boldsymbol{w}},b)\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)\|_{\infty}\leq g_{m}({\boldsymbol{w}},b) for some gmg_{m} satisfying 𝔼[gm(𝐰,b)]<\mathbb{E}[g_{m}({\boldsymbol{w}},b)]<\infty.

Corollary III.6 gives sufficient condition for the pair (ϕ,μm+1)(\phi,\mu^{m+1}) to be universal. Besides, in the proof of Theorem III.5, we have shown that the function class μ\mathcal{H}_{\mu} defined by (7) is dense in C(𝒳)C(\mathcal{X}) for any compact set 𝒳\mathcal{X}. Therefore, if ϕ\phi and μ\mu satisfy Corollary III.6 (so the pair (ϕ,μm+1)(\phi,\mu^{m+1}) satisfies Theorem III.5), then there exists a bounded mapping φm\varphi_{m} that satisfies (8) by the denseness of μ\mathcal{H}_{\mu}. In summary, any activation ϕ\phi and distribution μ\mu that satisfy the Conditions in corollary III.6 also satisfy Assumption III.8. However, the function φm\varphi_{m} in (8) may be difficult to compute for general activation function. We will give explicit construction for the ReLU function in Corollary III.12.

By Lemma III.4, Assumption III.8 ensures that if we have nn i.i.d. samples (𝒘1,b1),,(𝒘n,bn)({\boldsymbol{w}}_{1}^{\intercal},b_{1}),\dots,({\boldsymbol{w}}_{n}^{\intercal},b_{n}) from μm+1\mu^{m+1}, we can approximately reconstruct the input by

𝒙1ni=1nφm(𝒘i,bi)ϕ(𝒘i𝒙+bi).{\boldsymbol{x}}\approx\frac{1}{n}\sum_{i=1}^{n}\varphi_{m}({\boldsymbol{w}}_{i},b_{i})\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i}).

In other words, the features ϕ(𝒘i𝒙+bi)\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i}) contain enough information of the input and we can approximately recover it using φm(𝒘i,bi)\varphi_{m}({\boldsymbol{w}}_{i},b_{i}) as coefficients. For echo state networks, this assumption guarantees that the hidden state does not lose too much information about the history of the input, and hence we can view it as a “reservoir” and approximate the desired function at any time steps. We make this idea precise in the next theorem.

Theorem III.9.

Suppose the activation ϕ:\phi:\mathbb{R}\to\mathbb{R} and the distribution μ\mu satisfy Assumption III.8. For n,m,dn,m,d\in\mathbb{N}, let (𝐰1,b1),,(𝐰n,bn)({\boldsymbol{w}}_{1}^{\intercal},b_{1}),\dots,({\boldsymbol{w}}_{n}^{\intercal},b_{n}) be nn i.i.d. samples from μmd+1\mu^{md+1}, and define the ESN

{𝒔t=ϕ(W(1nPφmd(W,𝒃)𝒔t1+Q𝒖t)+𝒃),yt=𝒂𝒔t,\begin{cases}{\boldsymbol{s}}_{t}=\phi(W(\tfrac{1}{n}P\varphi_{md}(W,{\boldsymbol{b}}){\boldsymbol{s}}_{t-1}+Q{\boldsymbol{u}}_{t})+{\boldsymbol{b}}),\\ y_{t}={\boldsymbol{a}}\cdot{\boldsymbol{s}}_{t},\end{cases} (9)

where 𝐮td{\boldsymbol{u}}_{t}\in\mathbb{R}^{d}, 𝐚,𝐬tn{\boldsymbol{a}},{\boldsymbol{s}}_{t}\in\mathbb{R}^{n}, W=(𝐰1,,𝐰n)n×mdW=({\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{n})^{\intercal}\in\mathbb{R}^{n\times md}, 𝐛=(b1,,bn)n{\boldsymbol{b}}=(b_{1},\dots,b_{n})^{\intercal}\in\mathbb{R}^{n} and we denote φmd(W,𝐛):=(φmd(𝐰1,b1),,φmd(𝐰n,bn))md×n\varphi_{md}(W,{\boldsymbol{b}}):=(\varphi_{md}({\boldsymbol{w}}_{1},b_{1}),\dots,\varphi_{md}({\boldsymbol{w}}_{n},b_{n}))\in\mathbb{R}^{md\times n} for measurable function φmd:md+1md\varphi_{md}:\mathbb{R}^{md+1}\to\mathbb{R}^{md} and

P=(0dId0dId0d)md×md,Q=(0d0dId)md×d,P=\begin{pmatrix}0_{d}&I_{d}&&\\ &\ddots&\ddots&\\ &&0_{d}&I_{d}\\ &&&0_{d}\end{pmatrix}\in\mathbb{R}^{md\times md},Q=\begin{pmatrix}0_{d}\\ \vdots\\ 0_{d}\\ I_{d}\end{pmatrix}\in\mathbb{R}^{md\times d}, (10)

where 0d0_{d} and IdI_{d} are the d×dd\times d zero matrix and identity matrix respectively. Then, for any continuous causal time-invariant operator F:([1,1]d)F:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}}, any ϵ,δ(0,1)\epsilon,\delta\in(0,1), there exist n,mn,m\in\mathbb{N} and measurable function φmd\varphi_{md} such that, with probability at least 1δ1-\delta, the system (9) has existence of solutions property and there exists 𝐚n{\boldsymbol{a}}\in\mathbb{R}^{n} so that the corresponding operator FESNF_{ESN} satisfies

sup𝒖([1,1]d)F(𝒖)FESN(𝒖)ϵ.\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}}\|F({\boldsymbol{u}})-F_{ESN}({\boldsymbol{u}})\|_{\infty}\leq\epsilon. (11)
Proof.

The proof is divided into three steps.

Step 1: Choose parameters (W,𝒃)(W,{\boldsymbol{b}}).

We choose m=mF(ϵ/3)m=m_{F}(\epsilon/3), then by the definition of mF(ϵ/3)m_{F}(\epsilon/3) and FmF^{*}_{m}, for any 𝒖([1,1]d){\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}} and tt\in\mathbb{Z},

|F(𝒖)tFm(𝒖tm+1:t)|ϵ/3.|F({\boldsymbol{u}})_{t}-F^{*}_{m}({\boldsymbol{u}}_{t-m+1:t})|\leq\epsilon/3.

Since Fm:[1,1]mdF^{*}_{m}:[-1,1]^{md}\to\mathbb{R} is continuous by Proposition II.3, we can continuously extend its domain to [2,2]md[-2,2]^{md} by setting F(x1,,xmd)=F(x~1,,x~md)F(x_{1},\dots,x_{md})=F(\tilde{x}_{1},\dots,\tilde{x}_{md}) where x~i=max{min{xi,1},1}\tilde{x}_{i}=\max\{\min\{x_{i},1\},-1\}. By Assumption III.8, the pair (ϕ,μmd+1)(\phi,\mu^{md+1}) is universal. Hence, there exists N1N_{1}\in\mathbb{N} such that for any nN1n\geq N_{1}, with probability at least 1δ/21-\delta/2,

sup𝒙[2,2]md|Fm(𝒙)i=1naiϕ(𝒘i𝒙+bi)|ϵ/3,\sup_{{\boldsymbol{x}}\in[-2,2]^{md}}\left|F^{*}_{m}({\boldsymbol{x}})-\sum_{i=1}^{n}a_{i}\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})\right|\leq\epsilon/3, (12)

for some 𝒂=(a1,,an)n{\boldsymbol{a}}=(a_{1},\dots,a_{n})^{\intercal}\in\mathbb{R}^{n}.

By Assumption III.8, there exists φmd\varphi_{md} such that

sup𝒙[2,2]md𝒙𝔼[φmd(𝒘,b)ϕ(𝒘𝒙+b)]12mωFm1(ϵ/3).\sup_{{\boldsymbol{x}}\in[-2,2]^{md}}\left\|{\boldsymbol{x}}-\mathbb{E}[\varphi_{md}({\boldsymbol{w}},b)\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)]\right\|_{\infty}\\ \leq\frac{1}{2m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3).

Lemma III.4 implies that,

sup𝒙[2,2]md𝔼[φmd(𝒘,b)ϕ(𝒘𝒙+b)].1ni=1nφmd(𝒘i,bi)ϕ(𝒘i𝒙+bi)0,\sup_{{\boldsymbol{x}}\in[-2,2]^{md}}\Bigg{\|}\mathbb{E}[\varphi_{md}({\boldsymbol{w}},b)\phi({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)]-\Bigg{.}\\ \left.\frac{1}{n}\sum_{i=1}^{n}\varphi_{md}({\boldsymbol{w}}_{i},b_{i})\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})\right\|_{\infty}\to 0,

almost surely as nn\to\infty. Thus, there exists N2N_{2}\in\mathbb{N} such that for any nN2n\geq N_{2}, with probability at least 1δ/21-\delta/2,

sup𝒙[2,2]md𝒙1ni=1nφmd(𝒘i,bi)ϕ(𝒘i𝒙+bi)1mωFm1(ϵ/3),\sup_{{\boldsymbol{x}}\in[-2,2]^{md}}\left\|{\boldsymbol{x}}-\frac{1}{n}\sum_{i=1}^{n}\varphi_{md}({\boldsymbol{w}}_{i},b_{i})\phi({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})\right\|_{\infty}\\ \leq\frac{1}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3),

which can be written as,

sup𝒙[2,2]md𝒙1nφmd(W,𝒃)ϕ(W𝒙+𝒃)1mωFm1(ϵ/3).\sup_{{\boldsymbol{x}}\in[-2,2]^{md}}\left\|{\boldsymbol{x}}-\frac{1}{n}\varphi_{md}(W,{\boldsymbol{b}})\phi(W{\boldsymbol{x}}+{\boldsymbol{b}})\right\|_{\infty}\leq\frac{1}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3). (13)

Without loss of generality, we can assume that ωFm1(ϵ/3)1/2\omega_{F^{*}_{m}}^{-1}(\epsilon/3)\leq 1/2 by setting ϵ\epsilon sufficiently small. Now, we choose n=max{N1,N2}n=\max\{N_{1},N_{2}\}, then with probability at least 1δ1-\delta, (12) and (13) hold. We will fix such parameters (W,𝒃)(W,{\boldsymbol{b}}) that satisfy (12) and (13).

Step 2: Establish the existence of solutions property.

For any fixed 𝒖([1,1]d){\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}, we consider the operator ψ:(md)(md)\psi:(\mathbb{R}^{md})^{\mathbb{Z}}\to(\mathbb{R}^{md})^{\mathbb{Z}} defined by

ψ(𝒓)t:=1nPφmd(W,𝒃)ϕ(W𝒓t1+𝒃)+Q𝒖tmd.\psi({\boldsymbol{r}})_{t}:=\tfrac{1}{n}P\varphi_{md}(W,{\boldsymbol{b}})\phi(W{\boldsymbol{r}}_{t-1}+{\boldsymbol{b}})+Q{\boldsymbol{u}}_{t}\in\mathbb{R}^{md}.

For convenience, we will regard any vector in md\mathbb{R}^{md} as mm vectors in d\mathbb{R}^{d}. To do this, for any i=1,,mi=1,\dots,m, we denote the (i1)d+1(i-1)d+1 to idid entries of ψ(𝒓)t\psi({\boldsymbol{r}})_{t} by ψ(𝒓)t[i]d\psi({\boldsymbol{r}})_{t}[i]\in\mathbb{R}^{d}. Then, by the definition of matrices PP and QQ, ψ(𝒓)t[m]=𝒖t\psi({\boldsymbol{r}})_{t}[m]={\boldsymbol{u}}_{t} and ψ(𝒓)t[i]=(1nφmd(W,𝒃)ϕ(W𝒓t1+𝒃))[i+1]\psi({\boldsymbol{r}})_{t}[i]=(\tfrac{1}{n}\varphi_{md}(W,{\boldsymbol{b}})\phi(W{\boldsymbol{r}}_{t-1}+{\boldsymbol{b}}))[i+1] for 1im11\leq i\leq m-1.

For j=1,,mj=1,\dots,m, we define the jj composition of ψ\psi inductively by ψ(j)=ψψ(j1)\psi^{(j)}=\psi\circ\psi^{(j-1)}, where ψ(0)\psi^{(0)} is the identity map. We are going to show that ψ(m)\psi^{(m)} maps the compact set ([32,32]md)([-\frac{3}{2},\frac{3}{2}]^{md})^{\mathbb{Z}} into itself and then use fixed point theorem to establish the existence of solutions property. To do this, we assert that, if 𝒓([32,32]md){\boldsymbol{r}}\in([-\tfrac{3}{2},\tfrac{3}{2}]^{md})^{\mathbb{Z}}, then, for 1imj1\leq i\leq m-j,

ψ(j)(𝒓)t[i]32+jmωFm1(ϵ/3),\left\|\psi^{(j)}({\boldsymbol{r}})_{t}[i]\right\|_{\infty}\leq\tfrac{3}{2}+\tfrac{j}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3),

and for mj<kmm-j<k\leq m,

ψ(j)(𝒓)t[k]𝒖tm+kmkmωFm1(ϵ/3).\left\|\psi^{(j)}({\boldsymbol{r}})_{t}[k]-{\boldsymbol{u}}_{t-m+k}\right\|_{\infty}\leq\tfrac{m-k}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3).

We prove this by induction. When j=1j=1, ψ(𝒓)t[m]=𝒖t[1,1]d\psi({\boldsymbol{r}})_{t}[m]={\boldsymbol{u}}_{t}\in[-1,1]^{d} and, by inequality (13), for 1im11\leq i\leq m-1,

ψ(𝒓)t[i]\displaystyle\left\|\psi({\boldsymbol{r}})_{t}[i]\right\|_{\infty}\leq 𝒓t1[i+1]+1mωFm1(ϵ/3)\displaystyle\left\|{\boldsymbol{r}}_{t-1}[i+1]\right\|_{\infty}+\frac{1}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3)
\displaystyle\leq 32+1mωFm1(ϵ/3).\displaystyle\frac{3}{2}+\frac{1}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3).

Hence, the assertion is true for j=1j=1. Assume that the assertion is true for some 1jm11\leq j\leq m-1, we are going to show that it is true for j+1j+1. Since ψ(j)(𝒓)t[2,2]md\psi^{(j)}({\boldsymbol{r}})_{t}\in[-2,2]^{md}, by inequality (13), for 1im11\leq i\leq m-1,

ψ(j+1)(𝒓)t[i]ψ(j)(𝒓)t1[i+1]1mωFm1(ϵ/3),\left\|\psi^{(j+1)}({\boldsymbol{r}})_{t}[i]-\psi^{(j)}({\boldsymbol{r}})_{t-1}[i+1]\right\|_{\infty}\leq\frac{1}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3),

and ψ(j+1)(𝒓)t[m]=𝒖t\psi^{(j+1)}({\boldsymbol{r}})_{t}[m]={\boldsymbol{u}}_{t}. Hence, by induction assumption, for 1imj11\leq i\leq m-j-1,

ψ(j+1)(𝒓)t[i]\displaystyle\left\|\psi^{(j+1)}({\boldsymbol{r}})_{t}[i]\right\|_{\infty}\leq ψ(j)(𝒓)t1[i+1]+1mωFm1(ϵ/3)\displaystyle\left\|\psi^{(j)}({\boldsymbol{r}})_{t-1}[i+1]\right\|_{\infty}+\frac{1}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3)
\displaystyle\leq 32+j+1mωFm1(ϵ/3),\displaystyle\frac{3}{2}+\frac{j+1}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3),

and for mj1<km1m-j-1<k\leq m-1,

ψ(j+1)(𝒓)t[k]𝒖tm+k\displaystyle\left\|\psi^{(j+1)}({\boldsymbol{r}})_{t}[k]-{\boldsymbol{u}}_{t-m+k}\right\|_{\infty}
\displaystyle\leq ψ(j)(𝒓)t1[k+1]𝒖t1m+k+1+1mωFm1(ϵ/3)\displaystyle\left\|\psi^{(j)}({\boldsymbol{r}})_{t-1}[k+1]-{\boldsymbol{u}}_{t-1-m+k+1}\right\|_{\infty}+\frac{1}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3)
\displaystyle\leq m(k+1)mωFm1(ϵ/3)+1mωFm1(ϵ/3)\displaystyle\frac{m-(k+1)}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3)+\frac{1}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3)
=\displaystyle= mkmωFm1(ϵ/3).\displaystyle\frac{m-k}{m}\omega_{F^{*}_{m}}^{-1}(\epsilon/3).

Hence, the assertion is true and we conclude that

ψ(m)(𝒓)t𝒖tm+1:tωFm1(ϵ/3)12\left\|\psi^{(m)}({\boldsymbol{r}})_{t}-{\boldsymbol{u}}_{t-m+1:t}\right\|_{\infty}\leq\omega_{F^{*}_{m}}^{-1}(\epsilon/3)\leq\frac{1}{2}

holds for 𝒓([32,32]md){\boldsymbol{r}}\in([-\tfrac{3}{2},\tfrac{3}{2}]^{md})^{\mathbb{Z}}, where we regard 𝒖tm+1:t([1,1]d)m{\boldsymbol{u}}_{t-m+1:t}\in([-1,1]^{d})^{m} as the vector (𝒖tm+1,,𝒖t)md({\boldsymbol{u}}_{t-m+1}^{\intercal},\dots,{\boldsymbol{u}}_{t}^{\intercal})^{\intercal}\in\mathbb{R}^{md} to simplify the notation. This inequality implies that ψ(m)\psi^{(m)} maps the compact convex set ([32,32]md)([-\frac{3}{2},\frac{3}{2}]^{md})^{\mathbb{Z}} into itself. By the continuity of ψ(m)\psi^{(m)} and Schauder’s Fixed Point Theorem (see [28, Theorem 7.1]), ψ(m)\psi^{(m)} has at least a fixed point, that is, a point 𝒓([32,32]md){\boldsymbol{r}}^{*}\in([-\frac{3}{2},\frac{3}{2}]^{md})^{\mathbb{Z}} such that ψ(m)(𝒓)=𝒓\psi^{(m)}({\boldsymbol{r}}^{*})={\boldsymbol{r}}^{*}. Note that the fixed point 𝒓{\boldsymbol{r}}^{*} depends on 𝒖{\boldsymbol{u}} because the operator ψ\psi is defined through 𝒖{\boldsymbol{u}}.

Now, we are ready to establish the existence of solution property for the system (9). For any 𝒖([1,1]d){\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}, we define 𝒓~([2,2]md)\widetilde{{\boldsymbol{r}}}\in([-2,2]^{md})^{\mathbb{Z}} by 𝒓~t=ψ(jt)(𝒓)t\widetilde{{\boldsymbol{r}}}_{t}=\psi^{(j_{t})}({\boldsymbol{r}}^{*})_{t}, where jt=tt/mm{0,1,,m1}j_{t}=t-\lfloor t/m\rfloor m\in\{0,1,\dots,m-1\}. If jt=0j_{t}=0, then 𝒓~t=𝒓t\widetilde{{\boldsymbol{r}}}_{t}={\boldsymbol{r}}^{*}_{t}, 𝒓~t1=ψ(m1)(𝒓)t1\widetilde{{\boldsymbol{r}}}_{t-1}=\psi^{(m-1)}({\boldsymbol{r}}^{*})_{t-1} and

ψ(𝒓~)t\displaystyle\psi(\widetilde{{\boldsymbol{r}}})_{t} =1nPφmd(W,𝒃)ϕ(W𝒓~t1+𝒃)+Q𝒖t\displaystyle=\tfrac{1}{n}P\varphi_{md}(W,{\boldsymbol{b}})\phi(W\widetilde{{\boldsymbol{r}}}_{t-1}+{\boldsymbol{b}})+Q{\boldsymbol{u}}_{t}
=1nPφmd(W,𝒃)ϕ(Wψ(m1)(𝒓)t1+𝒃)+Q𝒖t\displaystyle=\tfrac{1}{n}P\varphi_{md}(W,{\boldsymbol{b}})\phi(W\psi^{(m-1)}({\boldsymbol{r}}^{*})_{t-1}+{\boldsymbol{b}})+Q{\boldsymbol{u}}_{t}
=ψ(ψ(m1)(𝒓))t=ψ(m)(𝒓)t\displaystyle=\psi(\psi^{(m-1)}({\boldsymbol{r}}^{*}))_{t}=\psi^{(m)}({\boldsymbol{r}}^{*})_{t}
=𝒓t=𝒓~t.\displaystyle={\boldsymbol{r}}^{*}_{t}=\widetilde{{\boldsymbol{r}}}_{t}.

If jt=j{1,2,,m1}j_{t}=j\in\{1,2,\dots,m-1\}, then 𝒓~t=ψ(j)(𝒓)t\widetilde{{\boldsymbol{r}}}_{t}=\psi^{(j)}({\boldsymbol{r}}^{*})_{t}, 𝒓~t1=ψ(j1)(𝒓)t1\widetilde{{\boldsymbol{r}}}_{t-1}=\psi^{(j-1)}({\boldsymbol{r}}^{*})_{t-1} and

ψ(𝒓~)t\displaystyle\psi(\widetilde{{\boldsymbol{r}}})_{t} =1nPφmd(W,𝒃)ϕ(W𝒓~t1+𝒃)+Q𝒖t\displaystyle=\tfrac{1}{n}P\varphi_{md}(W,{\boldsymbol{b}})\phi(W\widetilde{{\boldsymbol{r}}}_{t-1}+{\boldsymbol{b}})+Q{\boldsymbol{u}}_{t}
=1nPφmd(W,𝒃)ϕ(Wψ(j1)(𝒓)t1+𝒃)+Q𝒖t\displaystyle=\tfrac{1}{n}P\varphi_{md}(W,{\boldsymbol{b}})\phi(W\psi^{(j-1)}({\boldsymbol{r}}^{*})_{t-1}+{\boldsymbol{b}})+Q{\boldsymbol{u}}_{t}
=ψ(ψ(j1)(𝒓))t=ψ(j)(𝒓)t\displaystyle=\psi(\psi^{(j-1)}({\boldsymbol{r}}^{*}))_{t}=\psi^{(j)}({\boldsymbol{r}}^{*})_{t}
=𝒓~t.\displaystyle=\widetilde{{\boldsymbol{r}}}_{t}.

In any cases, we have ψ(𝒓~)t=𝒓~t\psi(\widetilde{{\boldsymbol{r}}})_{t}=\widetilde{{\boldsymbol{r}}}_{t}, or equivalently,

𝒓~t=1nPφmd(W,𝒃)ϕ(W𝒓~t1+𝒃)+Q𝒖t.\widetilde{{\boldsymbol{r}}}_{t}=\tfrac{1}{n}P\varphi_{md}(W,{\boldsymbol{b}})\phi(W\widetilde{{\boldsymbol{r}}}_{t-1}+{\boldsymbol{b}})+Q{\boldsymbol{u}}_{t}.

Since 𝒓~\widetilde{{\boldsymbol{r}}} is a fixed point of ψ\psi, it is also a fixed point of ψ(m)\psi^{(m)} and hence

𝒓~t𝒖tm+1:t=\displaystyle\left\|\widetilde{{\boldsymbol{r}}}_{t}-{\boldsymbol{u}}_{t-m+1:t}\right\|_{\infty}= ψ(m)(𝒓~)t𝒖tm+1:t\displaystyle\left\|\psi^{(m)}(\widetilde{{\boldsymbol{r}}})_{t}-{\boldsymbol{u}}_{t-m+1:t}\right\|_{\infty}
\displaystyle\leq ωFm1(ϵ/3),\displaystyle\omega_{F^{*}_{m}}^{-1}(\epsilon/3),

where we regard 𝒖tm+1:t{\boldsymbol{u}}_{t-m+1:t} as a vector of md\mathbb{R}^{md} as before.

Let 𝒔t:=ϕ(W𝒓~t+𝒃){\boldsymbol{s}}_{t}:=\phi(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}}), then 𝒔{\boldsymbol{s}} satisfies the equation

𝒔t=ϕ(W(1nPφmd(W,𝒃)𝒔t1+Q𝒖t)+𝒃).{\boldsymbol{s}}_{t}=\phi(W(\tfrac{1}{n}P\varphi_{md}(W,{\boldsymbol{b}}){\boldsymbol{s}}_{t-1}+Q{\boldsymbol{u}}_{t})+{\boldsymbol{b}}).

Hence, the system (9) has the existence of solution property with probability at least 1δ1-\delta.

Step 3: Bound the approximation error.

For the chosen parameters (W,𝒃)(W,{\boldsymbol{b}}), we let 𝒂=(a1,,an)n{\boldsymbol{a}}=(a_{1},\dots,a_{n})^{\intercal}\in\mathbb{R}^{n} be the vector that satisfies inequality (12), then

sup𝒙[2,2]md|Fm(𝒙)𝒂ϕ(W𝒙+𝒃)|ϵ/3.\sup_{{\boldsymbol{x}}\in[-2,2]^{md}}\left|F^{*}_{m}({\boldsymbol{x}})-{\boldsymbol{a}}\cdot\phi(W{\boldsymbol{x}}+{\boldsymbol{b}})\right|\leq\epsilon/3.

Since the system (9) has existence of solution property, we can associate a causal time-invariant operator FESNF_{ESN} to the system (9) such that

FESN(𝒖)t=𝒂𝒔t=𝒂ϕ(W𝒓~t+𝒃),𝒖([1,1]d).F_{ESN}({\boldsymbol{u}})_{t}={\boldsymbol{a}}\cdot{\boldsymbol{s}}_{t}={\boldsymbol{a}}\cdot\phi(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}}),\quad{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}.

Therefore, for any 𝒖([1,1]d){\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}, we have 𝒓~t𝒖tm+1:tωFm1(ϵ/3)1/2\|\widetilde{{\boldsymbol{r}}}_{t}-{\boldsymbol{u}}_{t-m+1:t}\|_{\infty}\leq\omega_{F^{*}_{m}}^{-1}(\epsilon/3)\leq 1/2 and

|FESN(𝒖)tF(𝒖)t|=|𝒂ϕ(W𝒓~t+𝒃)F(𝒖)t|\displaystyle|F_{ESN}({\boldsymbol{u}})_{t}-F({\boldsymbol{u}})_{t}|=|{\boldsymbol{a}}\cdot\phi(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}})-F({\boldsymbol{u}})_{t}|
\displaystyle\leq |𝒂ϕ(W𝒓~t+𝒃)Fm(𝒓~t)|+|Fm(𝒓~t)Fm(𝒖tm+1:t)|\displaystyle|{\boldsymbol{a}}\cdot\phi(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}})-F^{*}_{m}(\widetilde{{\boldsymbol{r}}}_{t})|+|F^{*}_{m}(\widetilde{{\boldsymbol{r}}}_{t})-F^{*}_{m}({\boldsymbol{u}}_{t-m+1:t})|
+|Fm(𝒖tm+1:t)F(𝒖)t|\displaystyle\quad+|F^{*}_{m}({\boldsymbol{u}}_{t-m+1:t})-F({\boldsymbol{u}})_{t}|
\displaystyle\leq ϵ/3+ϵ/3+ϵ/3=ϵ,\displaystyle\epsilon/3+\epsilon/3+\epsilon/3=\epsilon,

which completes the proof. ∎

Theorem III.9 establishes the random universal approximation for a special form of echo state networks. But we are only able to prove the existence of solutions property for the constructed network. We note that the echo state property is established for certain ReLU neural networks constructed in [9, 14]. It would be interesting to see whether this property also holds for the construction in Theorem III.9 (it seems that some regularity assumption on the activation function is needed). In the following remark, we show that the networks constructed in the proof of Theorem III.9 satisfy a property slightly weaker than the echo state property.

Remark III.10.

Recall that the echo state property can be formulated in a forward version [7, Theorem 2.1]: there exits a sequence δ0:\delta_{0:\infty} such that δt0\delta_{t}\to 0 as tt\to\infty and, for any state sequences 𝐬0:{\boldsymbol{s}}_{0:\infty} and 𝐬0:{\boldsymbol{s}}^{\prime}_{0:\infty} generated by the system (9) with the same input 𝐮1:{\boldsymbol{u}}_{1:\infty} and different initial states 𝐬0{\boldsymbol{s}}_{0} and 𝐬0{\boldsymbol{s}}_{0}^{\prime} respectively, it holds 𝐬t𝐬tδt\|{\boldsymbol{s}}_{t}-{\boldsymbol{s}}^{\prime}_{t}\|_{\infty}\leq\delta_{t}. Thus, the effect of the initial state vanishes as tt\to\infty. In the proof of Theorem III.9, we have showed that 𝐬t=ϕ(W𝐫~t+𝐛){\boldsymbol{s}}_{t}=\phi(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}}) with 𝐫~t𝐮tm+1:tωFm1(ϵ/3)\|\widetilde{{\boldsymbol{r}}}_{t}-{\boldsymbol{u}}_{t-m+1:t}\|_{\infty}\leq\omega_{F^{*}_{m}}^{-1}(\epsilon/3), which can be smaller than any prescribed accuracy ϵ0\epsilon_{0} by setting ϵ\epsilon small enough. Hence, for any fixed sample (W,𝐛)(W,{\boldsymbol{b}}), the difference of the states 𝐬t𝐬t\|{\boldsymbol{s}}_{t}-{\boldsymbol{s}}^{\prime}_{t}\|_{\infty} is bounded for large tt, by the continuity of the activation ϕ\phi. For example, if ϕ\phi is Lipschitz continuity with Lipschitz constant LL, then 𝐬t𝐬tϕ(W𝐫~t+𝐛)ϕ(W𝐫~t+𝐛)2LWϵ0\|{\boldsymbol{s}}_{t}-{\boldsymbol{s}}^{\prime}_{t}\|_{\infty}\leq\|\phi(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}})-\phi(W\widetilde{{\boldsymbol{r}}}^{\prime}_{t}+{\boldsymbol{b}})\|_{\infty}\leq 2L\|W\|\epsilon_{0} by the triangle inequality, where W\|W\| denotes the operator norm induced by \|\cdot\|_{\infty}. Furthermore, the output sequences yt=𝐚𝐬ty_{t}={\boldsymbol{a}}\cdot{\boldsymbol{s}}_{t} and yt=𝐚𝐬ty^{\prime}_{t}={\boldsymbol{a}}^{\prime}\cdot{\boldsymbol{s}}^{\prime}_{t} computed by the corresponding operators constructed in Theorem III.9 satisfy the same approximation accuracy (11) for large tt, which means the outputs are closed to each other. So, in our construction, the effect of the initial state is small.

Let us take a closer look at the result of Theorem III.9. For a given pair of activation and distribution (ϕ,μ)(\phi,\mu), it is shown that if the internal weights are sampled as in (9), then the echo state network is a universal approximator with high probability. Notice that the sampling procedure in (9) depends on the function φmd\varphi_{md}, which is implicitly given in Assumption III.8. As mentioned above, if the pair (ϕ,μ)(\phi,\mu) satisfies Corollary III.6, we are guaranteed the existence of φmd\varphi_{md}. However, it may be difficult to compute it explicitly. In the following, we show how to compute φmd\varphi_{md} for the widely used ReLU activation ϕ=σ\phi=\sigma, when the distribution μ\mu is symmetric.

Proposition III.11.

Let μ\mu be a symmetric distribution on \mathbb{R} with variance x2𝑑μ(x)=M2<\int x^{2}d\mu(x)=M_{2}<\infty. Then, for the ReLU function σ\sigma and any mm\in\mathbb{N},

𝔼(𝒘,b)μm+1[𝒘σ(𝒘𝒙+b)]=M22𝒙.\mathbb{E}_{({\boldsymbol{w}}^{\intercal},b)\sim\mu^{m+1}}[{\boldsymbol{w}}\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)]=\tfrac{M_{2}}{2}{\boldsymbol{x}}.
Proof.

Let us denote 𝒘=(w1,,wm){\boldsymbol{w}}=(w_{1},\dots,w_{m})^{\intercal} and 𝒙=(x1,,xm){\boldsymbol{x}}=(x_{1},\dots,x_{m})^{\intercal}. We need to show that 2𝔼[wjσ(𝒘𝒙+b)]=xjM22\mathbb{E}[w_{j}\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)]=x_{j}M_{2} for any j=1,,mj=1,\dots,m. By symmetry,

𝔼[wjσ(𝒘𝒙+b)]=𝔼[wjσ(𝒘𝒙b)].\mathbb{E}[w_{j}\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)]=-\mathbb{E}[w_{j}\sigma(-{\boldsymbol{w}}\cdot{\boldsymbol{x}}-b)].

Using the equality σ(x)σ(x)=x\sigma(x)-\sigma(-x)=x for the ReLU function, we have

2𝔼[wjσ(𝒘𝒙+b)]\displaystyle 2\mathbb{E}[w_{j}\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)]
=\displaystyle= 𝔼[wjσ(𝒘𝒙+b)]𝔼[wjσ(𝒘𝒙b)]\displaystyle\mathbb{E}[w_{j}\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)]-\mathbb{E}[w_{j}\sigma(-{\boldsymbol{w}}\cdot{\boldsymbol{x}}-b)]
=\displaystyle= 𝔼[wj(w1x1++wmxm+b)]\displaystyle\mathbb{E}[w_{j}(w_{1}x_{1}+\dots+w_{m}x_{m}+b)]
=\displaystyle= xj𝔼[wj2]=xjM2,\displaystyle x_{j}\mathbb{E}[w_{j}^{2}]=x_{j}M_{2},

where we use the fact that 𝔼wjwixi=𝔼wj𝔼wixi=0\mathbb{E}w_{j}w_{i}x_{i}=\mathbb{E}w_{j}\mathbb{E}w_{i}x_{i}=0 for iji\neq j, because the random variables wiw_{i} and wjw_{j} are independent and symmetric. ∎

By Proposition III.11, for the ReLU function σ\sigma and symmetric distribution μ\mu, we can choose φm(𝒘,b)=2M2𝒘\varphi_{m}({\boldsymbol{w}},b)=\frac{2}{M_{2}}{\boldsymbol{w}} in Assumption III.8 with error ϵ0=0\epsilon_{0}=0. As a corollary, we can obtain the random universal approximation theorem for ReLU echo state networks under very weak assumption on the distribution μ\mu.

Corollary III.12.

Let μ\mu be a symmetric probability distribution on \mathbb{R} whose support contains the interval [r,r][-r,r] for some r>0r>0, and variance x2𝑑μ(x)=M2<\int x^{2}d\mu(x)=M_{2}<\infty. For n,m,dn,m,d\in\mathbb{N}, let (𝐰1,b1),,(𝐰n,bn)({\boldsymbol{w}}_{1}^{\intercal},b_{1}),\dots,({\boldsymbol{w}}_{n}^{\intercal},b_{n}) be nn i.i.d. samples from μmd+1\mu^{md+1}, and define the ReLU ESN

{𝒔t=σ(W(2nM2PW𝒔t1+Q𝒖t)+𝒃),yt=𝒂𝒔t,\begin{cases}{\boldsymbol{s}}_{t}=\sigma(W(\tfrac{2}{nM_{2}}PW^{\intercal}{\boldsymbol{s}}_{t-1}+Q{\boldsymbol{u}}_{t})+{\boldsymbol{b}}),\\ y_{t}={\boldsymbol{a}}\cdot{\boldsymbol{s}}_{t},\end{cases} (14)

where 𝐮td{\boldsymbol{u}}_{t}\in\mathbb{R}^{d}, 𝐚,𝐬tn{\boldsymbol{a}},{\boldsymbol{s}}_{t}\in\mathbb{R}^{n}, W=(𝐰1,,𝐰n)n×mdW=({\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{n})^{\intercal}\in\mathbb{R}^{n\times md}, 𝐛=(b1,,bn)n{\boldsymbol{b}}=(b_{1},\dots,b_{n})^{\intercal}\in\mathbb{R}^{n} and P,QP,Q are defined by (10). Then, for any continuous causal time-invariant operator F:([1,1]d)F:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}}, any ϵ,δ(0,1)\epsilon,\delta\in(0,1), there exist n,mn,m\in\mathbb{N} such that, with probability at least 1δ1-\delta, the system (14) has existence of solutions property and there exists 𝐚n{\boldsymbol{a}}\in\mathbb{R}^{n} so that the corresponding operator FESNF_{ESN} satisfies

sup𝒖([1,1]d)F(𝒖)FESN(𝒖)ϵ.\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}}\|F({\boldsymbol{u}})-F_{ESN}({\boldsymbol{u}})\|_{\infty}\leq\epsilon.
Proof.

For any mm\in\mathbb{N}, the pair (σ,μm+1)(\sigma,\mu^{m+1}) is universal by Corollary III.7 and satisfies Assumption III.8 with φm(𝒘,b)=2M2𝒘\varphi_{m}({\boldsymbol{w}},b)=\frac{2}{M_{2}}{\boldsymbol{w}} by Proposition III.11. The result then follows from Theorem III.9 by observing that φmd(W,𝒃)=(2M2𝒘1,,2M2𝒘n)=2M2W\varphi_{md}(W,{\boldsymbol{b}})=(\frac{2}{M_{2}}{\boldsymbol{w}}_{1},\dots,\frac{2}{M_{2}}{\boldsymbol{w}}_{n})=\frac{2}{M_{2}}W^{\intercal}. ∎

IV Approximation bounds for ReLU networks

We have shown that the ReLU echo state network (14) with randomly generated weights can approximate any continuous causal time-invariant operator with high probability. In this section, we derive approximation bounds for sufficiently regular functions. To simplify the analysis, we will assume that each random weight in (14) is sampled from the uniform distribution on [1/2,1/2][-1/2,1/2]. Similar results can be derived for distributions μ\mu with bounded support. To do that, one can modify Assumption IV.1 by replacing the expectation (15) over uniform distribution by the expectation over μ\mu. Then, the proofs of Lemma IV.3 and Theorem IV.4 can be applied.

The key ingredients of our approximation results are certain integral representation of sufficiently regular functions and the random neural networks approximation of these functions. We make the following regularity assumption on the target continuous causal time-invariant operator FF.

Assumption IV.1.

The function Fm:[1,1]mdF^{*}_{m}:[-1,1]^{md}\to\mathbb{R} defined by (2) can be represented as

Fm(𝒙)=1212[12,12]mdgm(𝒘,b)σ(𝒘𝒙+b)𝑑𝒘𝑑b,F^{*}_{m}({\boldsymbol{x}})=\int_{-\frac{1}{2}}^{\frac{1}{2}}\int_{[-\frac{1}{2},\frac{1}{2}]^{md}}g_{m}({\boldsymbol{w}},b)\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d{\boldsymbol{w}}db, (15)

for some function gmg_{m} with uniform bound |gm(𝐰,b)|Bm|g_{m}({\boldsymbol{w}},b)|\leq B_{m}.

Neural network approximation of functions with integral representations was initially studied by the classical work of Barron [29]. This result has been generalized and widely discussed in recent articles [30, 31, 32, 33, 9]. In particular, [30, 9] showed that sufficient regular functions (whose Fourier transforms decay sufficiently fast) can be represented by integral forms similar to (15). The following proposition is a modification of these results. The proof is deferred to the appendix.

Proposition IV.2.

Suppose f:df:\mathbb{R}^{d}\to\mathbb{R} admits a Fourier representation

f(𝒙)=df^(𝒘)ei𝒘𝒙𝑑𝒘,𝒙[1,1]d,f({\boldsymbol{x}})=\int_{\mathbb{R}^{d}}\hat{f}({\boldsymbol{w}})e^{i{\boldsymbol{w}}\cdot{\boldsymbol{x}}}d{\boldsymbol{w}},\quad{\boldsymbol{x}}\in[-1,1]^{d},

and supr>0sup𝐰1=1/2max(1,r2d+4)|f^(r𝐰)|B\sup_{r>0}\sup_{\|{\boldsymbol{w}}\|_{1}=1/2}\max(1,r^{2d+4})|\hat{f}(r{\boldsymbol{w}})|\leq B. Then there exists g:[1/2,1/2]d+1g:[-1/2,1/2]^{d+1}\to\mathbb{R} with |g(𝐰,b)|40B|g({\boldsymbol{w}},b)|\leq 40B such that

f(𝒙)=1212[12,12]dg(𝒘,b)σ(𝒘𝒙+b)𝑑𝒘𝑑b,𝒙[1,1]d.f({\boldsymbol{x}})=\int_{-\frac{1}{2}}^{\frac{1}{2}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}g({\boldsymbol{w}},b)\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d{\boldsymbol{w}}db,\ {\boldsymbol{x}}\in[-1,1]^{d}.

The next lemma gives error bound for the random approximation of the integral form (15). This result can be seen as a quantitative version of Lemma III.4. Intuitively, the integral can be regarded as an expectation over uniform distribution. Hence, the law of large numbers implies that the empirical average converges to the expectation. We can bound this approximation error by using statistical learning theory [34] and concentration inequalities [35]. Specifically, we view the uniform approximation error as the supremum of difference between the expectation f(𝒙)f({\boldsymbol{x}}) and its empirical average. Then, we can apply concentration inequality to bound the deviation of this supremum. The expectation of the supremum is controlled by the Rademacher complexity [36]. The proof is given in the appendix.

Lemma IV.3.

Let

f(𝒙)=[12,12]dg(𝒘)σ(𝒘𝒙)𝑑𝒘f({\boldsymbol{x}})=\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}g({\boldsymbol{w}})\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}})d{\boldsymbol{w}}

with |g(𝐰)|B|g({\boldsymbol{w}})|\leq B. Then, with probability at least 1δ1-\delta,

sup𝒙[r,r]d|f(𝒙)1ni=1ng(𝒘i)σ(𝒘i𝒙)|Brd(22dlog(n+1)n+log(2/δ)2n),\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\left|f({\boldsymbol{x}})-\frac{1}{n}\sum_{i=1}^{n}g({\boldsymbol{w}}_{i})\sigma({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}})\right|\\ \leq Brd\left(2\sqrt{\frac{2d\log(n+1)}{n}}+\sqrt{\frac{\log(2/\delta)}{2n}}\right),

where 𝐰1:n:=(𝐰i)1in{\boldsymbol{w}}_{1:n}:=({\boldsymbol{w}}_{i})_{1\leq i\leq n} are i.i.d. samples from the uniform distribution on [1/2,1/2]d[-1/2,1/2]^{d}.

Now, we are ready to prove our approximation bound for ReLU echo state networks. The proof is similar to the proof of Theorem III.9, in which we reduce the approximation of operator FF to the approximation of function FmF_{m}^{*} and construct an ESN to approximate FmF_{m}^{*}. Here, we quantify this approximation error by Lemma IV.3.

Theorem IV.4.

For n,m,dn,m,d\in\mathbb{N}, let (𝐰1,b1),,(𝐰n,bn)({\boldsymbol{w}}_{1}^{\intercal},b_{1}),\dots,({\boldsymbol{w}}_{n}^{\intercal},b_{n}) be nn i.i.d. samples from the uniform distribution on [1/2,1/2]md+1[-1/2,1/2]^{md+1}, and define the ReLU ESN

{𝒔t=σ(W(24nPW𝒔t1+Q𝒖t)+𝒃),yt=𝒂𝒔t,\begin{cases}{\boldsymbol{s}}_{t}=\sigma(W(\tfrac{24}{n}PW^{\intercal}{\boldsymbol{s}}_{t-1}+Q{\boldsymbol{u}}_{t})+{\boldsymbol{b}}),\\ y_{t}={\boldsymbol{a}}\cdot{\boldsymbol{s}}_{t},\end{cases}

where 𝐮td{\boldsymbol{u}}_{t}\in\mathbb{R}^{d}, 𝐚,𝐬tn{\boldsymbol{a}},{\boldsymbol{s}}_{t}\in\mathbb{R}^{n}, W=(𝐰1,,𝐰n)n×mdW=({\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{n})^{\intercal}\in\mathbb{R}^{n\times md}, 𝐛=(b1,,bn)n{\boldsymbol{b}}=(b_{1},\dots,b_{n})^{\intercal}\in\mathbb{R}^{n} and P,QP,Q are defined by (10). Then, for sufficiently large n/log(n+1)c(m,δ,d)n/\log(n+1)\geq c(m,\delta,d) and any continuous causal time-invariant operator F:([1,1]d)F:([-1,1]^{d})^{\mathbb{Z}}\to\mathbb{R}^{\mathbb{Z}} satisfying Assumption IV.1, with probability at least 1δ1-\delta, the system has existence of solutions property and there exists 𝐚n{\boldsymbol{a}}\in\mathbb{R}^{n} so that the corresponding operator FESNF_{ESN} satisfies

sup𝒖([1,1]d)F(𝒖)FESN(𝒖)C(m,δ,d)log(n+1)n+F(m),\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}}\|F({\boldsymbol{u}})-F_{ESN}({\boldsymbol{u}})\|_{\infty}\\ \leq C(m,\delta,d)\sqrt{\frac{\log(n+1)}{n}}+\mathcal{E}_{F}(m),

where we can choose C(m,δ,d)=2Bm(md+1)(3m2d+1)(4md+1+log(4md/δ))C(m,\delta,d)=\sqrt{2}B_{m}(md+1)(3m^{2}d+1)(4\sqrt{md+1}+\sqrt{\log(4md/\delta)}) and c(m,δ,d)=1152m2(md+1)2(4md+1+log(4md/δ))2c(m,\delta,d)=1152m^{2}(md+1)^{2}(4\sqrt{md+1}+\sqrt{\log(4md/\delta)})^{2}.

Proof.

For any tt\in\mathbb{Z}, by equation (3),

sup𝒖([1,1]d)|F(𝒖)tFm(𝒖tm+1:t)|F(m).\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}}|F({\boldsymbol{u}})_{t}-F_{m}^{*}({\boldsymbol{u}}_{t-m+1:t})|\leq\mathcal{E}_{F}(m).

By Assumption IV.1, we can extend the domain of FmF_{m}^{*} to [2,2]md[-2,2]^{md} by the integral representation (15). Using Lemma IV.3, it holds with probability at least 1δ/21-\delta/2 that

sup𝒙[2,2]md|Fm(𝒙)1ni=1ng(𝒘i,bi)σ(𝒘i𝒙+bi)|\displaystyle\sup_{{\boldsymbol{x}}\in[-2,2]^{md}}\left|F_{m}^{*}({\boldsymbol{x}})-\frac{1}{n}\sum_{i=1}^{n}g({\boldsymbol{w}}_{i},b_{i})\sigma({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})\right| (16)
\displaystyle\leq 2Bm(md+1)(4(md+1)log(n+1)n+log(4/δ)n)\displaystyle\sqrt{2}B_{m}(md+1)\left(4\sqrt{\frac{(md+1)\log(n+1)}{n}}+\sqrt{\frac{\log(4/\delta)}{n}}\right)
=\displaystyle= :1.\displaystyle:\mathcal{E}_{1}.

On the other hand, Proposition III.11 shows that 𝒙=24𝔼𝒘σ(𝒘𝒙+b){\boldsymbol{x}}=24\mathbb{E}{\boldsymbol{w}}\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b). Hence, applying Lemma IV.3 to each coordinate of 𝒙{\boldsymbol{x}}, it holds with probability at least 1δ/21-\delta/2 that

sup𝒙[2,2]md𝒙24ni=1n𝒘iσ(𝒘i𝒙+bi)\displaystyle\sup_{{\boldsymbol{x}}\in[-2,2]^{md}}\left\|{\boldsymbol{x}}-\frac{24}{n}\sum_{i=1}^{n}{\boldsymbol{w}}_{i}\sigma({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})\right\|_{\infty} (17)
\displaystyle\leq 122(md+1)\displaystyle 12\sqrt{2}(md+1)
×(4(md+1)log(n+1)n+log(4md/δ)n)\displaystyle\quad\times\left(4\sqrt{\frac{(md+1)\log(n+1)}{n}}+\sqrt{\frac{\log(4md/\delta)}{n}}\right)
=\displaystyle= :212m,\displaystyle:\mathcal{E}_{2}\leq\frac{1}{2m},

where the last inequality holds if n/log(n+1)1152m2(md+1)2(4md+1+log(4md/δ))2n/\log(n+1)\geq 1152m^{2}(md+1)^{2}(4\sqrt{md+1}+\sqrt{\log(4md/\delta)})^{2}. We note that, in matrix form,

𝒙24ni=1n𝒘iσ(𝒘i𝒙+bi)=𝒙24nWσ(W𝒙+𝒃).{\boldsymbol{x}}-\frac{24}{n}\sum_{i=1}^{n}{\boldsymbol{w}}_{i}\sigma({\boldsymbol{w}}_{i}\cdot{\boldsymbol{x}}+b_{i})={\boldsymbol{x}}-\frac{24}{n}W^{\intercal}\sigma(W{\boldsymbol{x}}+{\boldsymbol{b}}).

Therefore, with probability at least 1δ1-\delta over parameters (W,𝒃)(W,{\boldsymbol{b}}), inequalities (16) and (17) hold. For such parameters (W,𝒃)(W,{\boldsymbol{b}}), Step 2 of the proof of Theorem III.9 shows that, for any 𝒖([1,1]d){\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}, there exists 𝒓~([2,2]md)\widetilde{{\boldsymbol{r}}}\in([-2,2]^{md})^{\mathbb{Z}} such that

𝒓~t=24nPWσ(W𝒓~t1+𝒃)+Q𝒖t,\widetilde{{\boldsymbol{r}}}_{t}=\tfrac{24}{n}PW^{\intercal}\sigma(W\widetilde{{\boldsymbol{r}}}_{t-1}+{\boldsymbol{b}})+Q{\boldsymbol{u}}_{t},

and

𝒓~t𝒖tm+1:tm2,\left\|\widetilde{{\boldsymbol{r}}}_{t}-{\boldsymbol{u}}_{t-m+1:t}\right\|_{\infty}\leq m\mathcal{E}_{2},

where we regard 𝒖tm+1:t{\boldsymbol{u}}_{t-m+1:t} as a vector of md\mathbb{R}^{md} in the natural way. By the integral representation (15), we have

|Fm(𝒓~t)Fm(𝒖tm+1:t)|Bm[12,12]md𝒘1𝒓~t𝒖tm+1:t𝑑𝒘Bmm2d42.|F^{*}_{m}(\widetilde{{\boldsymbol{r}}}_{t})-F^{*}_{m}({\boldsymbol{u}}_{t-m+1:t})|\\ \leq B_{m}\int_{[-\frac{1}{2},\frac{1}{2}]^{md}}\|{\boldsymbol{w}}\|_{1}\left\|\widetilde{{\boldsymbol{r}}}_{t}-{\boldsymbol{u}}_{t-m+1:t}\right\|_{\infty}d{\boldsymbol{w}}\leq\frac{B_{m}m^{2}d}{4}\mathcal{E}_{2}.

Let 𝒔t:=σ(W𝒓~t+𝒃){\boldsymbol{s}}_{t}:=\sigma(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}}), then 𝒔{\boldsymbol{s}} satisfies the equation

𝒔t=σ(W(24nPW𝒔t1+Q𝒖t)+𝒃).{\boldsymbol{s}}_{t}=\sigma(W(\tfrac{24}{n}PW^{\intercal}{\boldsymbol{s}}_{t-1}+Q{\boldsymbol{u}}_{t})+{\boldsymbol{b}}).

Hence, the system has the existence of solution property with probability at least 1δ1-\delta and we can we can associate a causal time invariant operator FESNF_{ESN} to the system such that

FESN(𝒖)t=𝒂𝒔t=𝒂σ(W𝒓~t+𝒃),𝒖([1,1]d).F_{ESN}({\boldsymbol{u}})_{t}={\boldsymbol{a}}\cdot{\boldsymbol{s}}_{t}={\boldsymbol{a}}\cdot\sigma(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}}),\quad{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}}.

Let 𝒂=1n(g(𝒘1,b1),,g(𝒘n,bn)){\boldsymbol{a}}=\frac{1}{n}(g({\boldsymbol{w}}_{1},b_{1}),\dots,g({\boldsymbol{w}}_{n},b_{n}))^{\intercal}, then by inequality (16),

|FESN(𝒖)tF(𝒖)t|=|𝒂σ(W𝒓~t+𝒃)F(𝒖)t|\displaystyle|F_{ESN}({\boldsymbol{u}})_{t}-F({\boldsymbol{u}})_{t}|=|{\boldsymbol{a}}\cdot\sigma(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}})-F({\boldsymbol{u}})_{t}|
\displaystyle\leq |𝒂σ(W𝒓~t+𝒃)Fm(𝒓~t)|+|Fm(𝒓~t)Fm(𝒖tm+1:t)|\displaystyle|{\boldsymbol{a}}\cdot\sigma(W\widetilde{{\boldsymbol{r}}}_{t}+{\boldsymbol{b}})-F^{*}_{m}(\widetilde{{\boldsymbol{r}}}_{t})|+|F^{*}_{m}(\widetilde{{\boldsymbol{r}}}_{t})-F^{*}_{m}({\boldsymbol{u}}_{t-m+1:t})|
+|Fm(𝒖tm+1:t)F(𝒖)t|\displaystyle\qquad+|F^{*}_{m}({\boldsymbol{u}}_{t-m+1:t})-F({\boldsymbol{u}})_{t}|
\displaystyle\leq 1+Bmm2d42+F(m)\displaystyle\mathcal{E}_{1}+\frac{B_{m}m^{2}d}{4}\mathcal{E}_{2}+\mathcal{E}_{F}(m)
\displaystyle\leq 2Bm(md+1)((12m2d+4)(md+1)log(n+1)n\displaystyle\sqrt{2}B_{m}(md+1)\left((12m^{2}d+4)\sqrt{\frac{(md+1)\log(n+1)}{n}}\right.
+log(4/δ)n+3m2dlog(4md/δ)n)+F(m)\displaystyle\qquad\left.+\sqrt{\frac{\log(4/\delta)}{n}}+3m^{2}d\sqrt{\frac{\log(4md/\delta)}{n}}\right)+\mathcal{E}_{F}(m)
\displaystyle\leq C(m,δ,d)log(n+1)n+F(m),\displaystyle C(m,\delta,d)\sqrt{\frac{\log(n+1)}{n}}+\mathcal{E}_{F}(m),

where C(m,δ,d)=2Bm(md+1)(3m2d+1)(4md+1+log(4md/δ))C(m,\delta,d)=\sqrt{2}B_{m}(md+1)(3m^{2}d+1)(4\sqrt{md+1}+\sqrt{\log(4md/\delta)}). ∎

Remark IV.5.

Similar approximation bound has been obtained in [9]. They made assumptions that the input sequences are sampled from some probability distribution and the target operator is Lipschitz continuous, and proved a bound for the expected L2L^{2} approximation error. In contrast, we generalize the Lipschitz continuity to general modulus of continuity, and obtain a high probability bound for the uniform approximation error in Theorem IV.4. Note that, if BmB_{m} is at most polynomial on mm and F(m)\mathcal{E}_{F}(m) decays exponentially, say F(m)c0ec1m\mathcal{E}_{F}(m)\leq c_{0}e^{-c_{1}m} for some c0,c1>0c_{0},c_{1}>0, then we can choose m(2c1)1lognm\approx(2c_{1})^{-1}\log n and the approximation error is 𝒪~(log(1/δ)/n)\widetilde{\mathcal{O}}(\sqrt{\log(1/\delta)/n}), where the notation 𝒪~\widetilde{\mathcal{O}} hides poly-logarithmic factors of nn. In this case, the approximation rate is the same as [9, Theorem 2].

V Conclusions

We have analyzed the approximation capacity of echo state networks with randomly generated internal weights. Under weak assumption on the activation function, we constructed a sampling procedure for the internal weights so that the echo state networks can approximate any continuous casual time-invariant operators with prescribed accuracy. For ReLU activation, we explicitly computed the random weights in our construction and derived the approximation bounds for sufficiently regular operators.

Our results also raise some questions for future study. For example, our construction of the sampling procedure relies on the function φm\varphi_{m} in Assumption III.8. In other words, we only establish the universality of echo state networks when the parameters are sampled in a special way. In practical applications, the parameters in echo state networks are directly sampled from some simple distributions, and they do not have the particular form constructed in this paper. It would be interesting to extend the universality result to these “practical” sampling procedures. Another interesting question is the echo state property of the constructed networks. In Theorem III.9 and Remark III.10, we only prove that the constructed networks have existence of solution property and satisfy a property weaker than the the echo state property. In order to establish the echo state property, we think more regularity condition on the activation function is needed.

Proof of Proposition II.3

For the first part, by the definition of m=mF(ϵ/3)m=m_{F}(\epsilon/3),

sup𝒖([1,1]d)|F(𝒖)Fm(𝒖m+1:0)|ϵ/3.\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}_{-}}}|F^{*}({\boldsymbol{u}})-F^{*}_{m}({\boldsymbol{u}}_{-m+1:0})|\leq\epsilon/3.

Since the weighting sequence is decreasing,

𝒖m+1:0𝒗m+1:0ηm11𝒖𝒗ηωFm1(ϵ/3),\|{\boldsymbol{u}}_{-m+1:0}-{\boldsymbol{v}}_{-m+1:0}\|_{\infty}\leq\eta_{m-1}^{-1}\|{\boldsymbol{u}}-{\boldsymbol{v}}\|_{\eta}\leq\omega_{F^{*}_{m}}^{-1}(\epsilon/3),

which implies |Fm(𝒖m+1:0)Fm(𝒗m+1:0)|ϵ/3|F^{*}_{m}({\boldsymbol{u}}_{-m+1:0})-F^{*}_{m}({\boldsymbol{v}}_{-m+1:0})|\leq\epsilon/3. Therefore,

|F(𝒖)F(𝒗)|\displaystyle|F^{*}({\boldsymbol{u}})-F^{*}({\boldsymbol{v}})|
\displaystyle\leq |F(𝒖)Fm(𝒖m+1:0)|+|Fm(𝒖m+1:0)\displaystyle|F^{*}({\boldsymbol{u}})-F^{*}_{m}({\boldsymbol{u}}_{-m+1:0})|+|F^{*}_{m}({\boldsymbol{u}}_{-m+1:0})
Fm(𝒗m+1:0)|+|Fm(𝒗m+1:0)F(𝒗)|\displaystyle\qquad-F^{*}_{m}({\boldsymbol{v}}_{-m+1:0})|+|F^{*}_{m}({\boldsymbol{v}}_{-m+1:0})-F^{*}({\boldsymbol{v}})|
\displaystyle\leq ϵ/3+ϵ/3+ϵ/3=ϵ.\displaystyle\epsilon/3+\epsilon/3+\epsilon/3=\epsilon.

For the second part, we observe that, for any mm\in\mathbb{N} and 𝒖([1,1]d){\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}_{-}},

𝒖(,𝟎,𝒖m+1:0)η=suptmηt𝒖tηm.\|{\boldsymbol{u}}-(\dots,\boldsymbol{0},{\boldsymbol{u}}_{-m+1:0})\|_{\eta}=\sup_{t\leq-m}\eta_{-t}\|{\boldsymbol{u}}_{t}\|_{\infty}\leq\eta_{m}.

Then, by the definition of ωF(δ;η)\omega_{F^{*}}(\delta;\eta),

F(m)\displaystyle\mathcal{E}_{F}(m) =sup𝒖([1,1]d)|F(𝒖)F(,𝟎,𝒖m+1:0)|\displaystyle=\sup_{{\boldsymbol{u}}\in([-1,1]^{d})^{\mathbb{Z}_{-}}}|F^{*}({\boldsymbol{u}})-F^{*}(\dots,\boldsymbol{0},{\boldsymbol{u}}_{-m+1:0})|
ωF(ηm,η).\displaystyle\leq\omega_{F^{*}}(\eta_{m},\eta).

If mm satisfies ηmωF1(ϵ;η)\eta_{m}\leq\omega_{F^{*}}^{-1}(\epsilon;\eta), then F(m)ωF(ηm,η)ϵ\mathcal{E}_{F}(m)\leq\omega_{F^{*}}(\eta_{m},\eta)\leq\epsilon. Hence,

mF(ϵ)\displaystyle m_{F}(\epsilon) =min{m:F(m)ϵ}\displaystyle=\min\{m\in\mathbb{N}:\mathcal{E}_{F}(m)\leq\epsilon\}
min{m:ηmωF1(ϵ;η)}.\displaystyle\leq\min\left\{m\in\mathbb{N}:\eta_{m}\leq\omega_{F^{*}}^{-1}(\epsilon;\eta)\right\}.

Finally, by the definition of FmF^{*}_{m} and (,𝟎,𝒖m+1:0)(,𝟎,𝒖m+1:0)η𝒖m+1:0𝒖m+1:0\|(\cdots,\boldsymbol{0},{\boldsymbol{u}}_{-m+1:0})-(\cdots,\boldsymbol{0},{\boldsymbol{u}}_{-m+1:0}^{\prime})\|_{\eta}\leq\|{\boldsymbol{u}}_{-m+1:0}-{\boldsymbol{u}}_{-m+1:0}^{\prime}\|_{\infty}, we have

ωFm(δ)\displaystyle\omega_{F^{*}_{m}}(\delta)
=\displaystyle= sup{|F(,𝟎,𝒖m+1:0)F(,𝟎,𝒖m+1:0)|\displaystyle\sup\{|F^{*}(\cdots,\boldsymbol{0},{\boldsymbol{u}}_{-m+1:0})-F^{*}(\cdots,\boldsymbol{0},{\boldsymbol{u}}_{-m+1:0}^{\prime})|
:𝒖m+1:0𝒖m+1:0δ}\displaystyle\qquad\qquad:\|{\boldsymbol{u}}_{-m+1:0}-{\boldsymbol{u}}_{-m+1:0}^{\prime}\|_{\infty}\leq\delta\}
\displaystyle\leq ωF(δ;η).\displaystyle\omega_{F^{*}}(\delta;\eta).

Proof of Proposition IV.2

The proof is a modification of the argument in [30, 9]. Firstly, observe that, for any tt\in\mathbb{R},

0σ(tb)eib+σ(tb)eibdb=eitit1.-\int_{0}^{\infty}\sigma(t-b)e^{ib}+\sigma(-t-b)e^{-ib}db=e^{it}-it-1.

For any 𝒙[1,1]d{\boldsymbol{x}}\in[-1,1]^{d}, letting t=𝒘𝒙t={\boldsymbol{w}}\cdot{\boldsymbol{x}}, we have

f(𝒙)f(0)𝒙f(0)\displaystyle f({\boldsymbol{x}})-\nabla f(0)\cdot{\boldsymbol{x}}-f(0)
=\displaystyle= d(ei𝒘𝒙i𝒘𝒙1)f^(𝒘)𝑑𝒘\displaystyle\int_{\mathbb{R}^{d}}(e^{i{\boldsymbol{w}}\cdot{\boldsymbol{x}}}-i{\boldsymbol{w}}\cdot{\boldsymbol{x}}-1)\hat{f}({\boldsymbol{w}})d{\boldsymbol{w}}
=\displaystyle= d0(σ(𝒘𝒙b)eib+σ(𝒘𝒙b)eib)\displaystyle-\int_{\mathbb{R}^{d}}\int_{0}^{\infty}(\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}-b)e^{ib}+\sigma(-{\boldsymbol{w}}\cdot{\boldsymbol{x}}-b)e^{-ib})
×f^(𝒘)dbd𝒘\displaystyle\qquad\times\hat{f}({\boldsymbol{w}})dbd{\boldsymbol{w}}
=\displaystyle= d0σ(𝒘𝒙+b)eibf^(𝒘)𝑑b𝑑𝒘\displaystyle-\int_{\mathbb{R}^{d}}\int_{-\infty}^{0}\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)e^{-ib}\hat{f}({\boldsymbol{w}})dbd{\boldsymbol{w}}
d0σ(𝒘𝒙+b)eibf^(𝒘)𝑑b𝑑𝒘\displaystyle\qquad-\int_{\mathbb{R}^{d}}\int_{-\infty}^{0}\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)e^{ib}\hat{f}(-{\boldsymbol{w}})dbd{\boldsymbol{w}}
=\displaystyle= d𝒘10Re[eibf^(𝒘)eibf^(𝒘)]\displaystyle\int_{\mathbb{R}^{d}}\int_{-\|{\boldsymbol{w}}\|_{1}}^{0}\,{\rm Re}\,[-e^{-ib}\hat{f}({\boldsymbol{w}})-e^{ib}\hat{f}(-{\boldsymbol{w}})]
×σ(𝒘𝒙+b)dbd𝒘,\displaystyle\qquad\times\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)dbd{\boldsymbol{w}},

where, in the last equality, we use the fact that σ(𝒘𝒙+b)=0\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)=0 for b<𝒘1b<-\|{\boldsymbol{w}}\|_{1}. Since f(0),f(0)𝒙f(0),\nabla f(0)\cdot{\boldsymbol{x}}\in\mathbb{R}, we have dIm[f^(𝒘)]𝑑𝒘=0\int_{\mathbb{R}^{d}}\,{\rm Im}\,[\hat{f}({\boldsymbol{w}})]d{\boldsymbol{w}}=0 and d(𝒘𝒙)Re[f^(𝒘)]𝑑𝒘=0\int_{\mathbb{R}^{d}}({\boldsymbol{w}}\cdot{\boldsymbol{x}})\,{\rm Re}\,[\hat{f}({\boldsymbol{w}})]d{\boldsymbol{w}}=0, which implies

f(0)𝒙+f(0)\displaystyle\nabla f(0)\cdot{\boldsymbol{x}}+f(0)
=\displaystyle= d(𝒘𝒙)Im[f^(𝒘)]+Re[f^(𝒘)]d𝒘\displaystyle\int_{\mathbb{R}^{d}}-({\boldsymbol{w}}\cdot{\boldsymbol{x}})\,{\rm Im}\,[\hat{f}({\boldsymbol{w}})]+\,{\rm Re}\,[\hat{f}({\boldsymbol{w}})]d{\boldsymbol{w}}
=\displaystyle= d01/2(𝒘𝒙+b)(8Re[f^(𝒘)]Im[f^(𝒘)])𝑑b𝑑𝒘\displaystyle\int_{\mathbb{R}^{d}}\int_{0}^{1/2}({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)(8\,{\rm Re}\,[\hat{f}({\boldsymbol{w}})]-\,{\rm Im}\,[\hat{f}({\boldsymbol{w}})])dbd{\boldsymbol{w}}
=\displaystyle= d01/2(σ(𝒘𝒙+b)σ(𝒘𝒙b))\displaystyle\int_{\mathbb{R}^{d}}\int_{0}^{1/2}(\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)-\sigma(-{\boldsymbol{w}}\cdot{\boldsymbol{x}}-b))
×Re[(8+i)f^(𝒘)]dbd𝒘\displaystyle\qquad\times\,{\rm Re}\,[(8+i)\hat{f}({\boldsymbol{w}})]dbd{\boldsymbol{w}}
=\displaystyle= d01/2Re[(8+i)f^(𝒘)]σ(𝒘𝒙+b)𝑑b𝑑𝒘\displaystyle\int_{\mathbb{R}^{d}}\int_{0}^{1/2}\,{\rm Re}\,[(8+i)\hat{f}({\boldsymbol{w}})]\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)dbd{\boldsymbol{w}}
d1/20Re[(8+i)f^(𝒘)]σ(𝒘𝒙+b)𝑑b𝑑𝒘.\displaystyle\qquad-\int_{\mathbb{R}^{d}}\int_{-1/2}^{0}\,{\rm Re}\,[(8+i)\hat{f}(-{\boldsymbol{w}})]\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)dbd{\boldsymbol{w}}.

By Fubini’s theorem, we conclude that

f(𝒙)=d+1h(𝒘,b)σ(𝒘𝒙+b)𝑑𝒘𝑑b,f({\boldsymbol{x}})=\int_{\mathbb{R}^{d+1}}h({\boldsymbol{w}},b)\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d{\boldsymbol{w}}db,

with

h(𝒘,b):=\displaystyle h({\boldsymbol{w}},b):= 1[𝒘1,0](b)Re[eibf^(𝒘)eibf^(𝒘)]\displaystyle 1_{[-\|{\boldsymbol{w}}\|_{1},0]}(b)\,{\rm Re}\,[-e^{-ib}\hat{f}({\boldsymbol{w}})-e^{ib}\hat{f}(-{\boldsymbol{w}})]
+1[0,1/2](b)Re[(8+i)f^(𝒘)]\displaystyle\quad+1_{[0,1/2]}(b)\,{\rm Re}\,[(8+i)\hat{f}({\boldsymbol{w}})]
1[1/2,0](b)Re[(8+i)f^(𝒘)],\displaystyle\quad-1_{[-1/2,0]}(b)\,{\rm Re}\,[(8+i)\hat{f}(-{\boldsymbol{w}})],

where 1S1_{S} denote the indicator function of the set SS. Note that, if |b|>𝒘1|b|>\|{\boldsymbol{w}}\|_{1} and |b|>1/2|b|>1/2, then h(𝒘,b)=0h({\boldsymbol{w}},b)=0.

Next, we denote 𝒜={𝒘d:𝒘1>1/2}\mathcal{A}=\{{\boldsymbol{w}}\in\mathbb{R}^{d}:\|{\boldsymbol{w}}\|_{1}>1/2\}, ={𝒘d:0<𝒘1<1/2}\mathcal{B}=\{{\boldsymbol{w}}\in\mathbb{R}^{d}:0<\|{\boldsymbol{w}}\|_{1}<1/2\} and consider the mapping

T:𝒜,T(𝒘)=𝒘4𝒘12.T:\mathcal{B}\to\mathcal{A},\quad T({\boldsymbol{w}})=\frac{{\boldsymbol{w}}}{4\|{\boldsymbol{w}}\|_{1}^{2}}.

Then, the Jacobian of TT is

T(𝒘)=14𝒘12(Id2𝒘𝒘1sgn(𝒘))\nabla T({\boldsymbol{w}})=\frac{1}{4\|{\boldsymbol{w}}\|_{1}^{2}}\left(I_{d}-2\frac{{\boldsymbol{w}}}{\|{\boldsymbol{w}}\|_{1}}\,{\rm sgn}\,({\boldsymbol{w}})^{\intercal}\right)

with determinant

|det(T(𝒘))|=\displaystyle|\det(\nabla T({\boldsymbol{w}}))|= 1(2𝒘1)2d|12sgn(𝒘)𝒘𝒘1|\displaystyle\frac{1}{(2\|{\boldsymbol{w}}\|_{1})^{2d}}\left|1-2\,{\rm sgn}\,({\boldsymbol{w}})^{\intercal}\frac{{\boldsymbol{w}}}{\|{\boldsymbol{w}}\|_{1}}\right|
=\displaystyle= 1(2𝒘1)2d,\displaystyle\frac{1}{(2\|{\boldsymbol{w}}\|_{1})^{2d}},

where we use Sylvester’s determinant theorem. The change of variables formula implies

𝒜h(𝒘,b)σ(𝒘𝒙+b)𝑑𝒘𝑑b\displaystyle\int_{\mathbb{R}}\int_{\mathcal{A}}h({\boldsymbol{w}},b)\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d{\boldsymbol{w}}db
=\displaystyle= h(𝒘4𝒘12,b)σ(𝒘𝒙4𝒘12+b)1(2𝒘1)2d𝑑𝒘𝑑b\displaystyle\int_{\mathbb{R}}\int_{\mathcal{B}}h\left(\frac{{\boldsymbol{w}}}{4\|{\boldsymbol{w}}\|_{1}^{2}},b\right)\sigma\left(\frac{{\boldsymbol{w}}\cdot{\boldsymbol{x}}}{4\|{\boldsymbol{w}}\|_{1}^{2}}+b\right)\frac{1}{(2\|{\boldsymbol{w}}\|_{1})^{2d}}d{\boldsymbol{w}}db
=\displaystyle= h(𝒘4𝒘12,b4𝒘12)σ(𝒘𝒙+b)\displaystyle\int_{\mathbb{R}}\int_{\mathcal{B}}h\left(\frac{{\boldsymbol{w}}}{4\|{\boldsymbol{w}}\|_{1}^{2}},\frac{b}{4\|{\boldsymbol{w}}\|_{1}^{2}}\right)\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)
×1(2𝒘1)2d+4d𝒘db,\displaystyle\qquad\times\frac{1}{(2\|{\boldsymbol{w}}\|_{1})^{2d+4}}d{\boldsymbol{w}}db,

where we use the homogeneity of ReLU in the last equality. Hence,

f(𝒙)\displaystyle f({\boldsymbol{x}})
=\displaystyle= h(𝒘,b)σ(𝒘𝒙+b)𝑑𝒘𝑑b\displaystyle\int_{\mathbb{R}}\int_{\mathcal{B}}h({\boldsymbol{w}},b)\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d{\boldsymbol{w}}db
+𝒜h(𝒘,b)σ(𝒘𝒙+b)𝑑𝒘𝑑b\displaystyle\qquad+\int_{\mathbb{R}}\int_{\mathcal{A}}h({\boldsymbol{w}},b)\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d{\boldsymbol{w}}db
=\displaystyle= (h(𝒘,b)+1(2𝒘1)2d+4h(𝒘4𝒘12,b4𝒘12))\displaystyle\int_{\mathbb{R}}\int_{\mathcal{B}}\left(h({\boldsymbol{w}},b)+\frac{1}{(2\|{\boldsymbol{w}}\|_{1})^{2d+4}}h\left(\frac{{\boldsymbol{w}}}{4\|{\boldsymbol{w}}\|_{1}^{2}},\frac{b}{4\|{\boldsymbol{w}}\|_{1}^{2}}\right)\right)
×σ(𝒘𝒙+b)d𝒘db.\displaystyle\qquad\times\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d{\boldsymbol{w}}db.

We define

g(𝒘,b):=1(𝒘)(h(𝒘,b)+1(2𝒘1)2d+4h(𝒘4𝒘12,b4𝒘12)).g({\boldsymbol{w}},b)\\ :=1_{\mathcal{B}}({\boldsymbol{w}})\left(h({\boldsymbol{w}},b)+\frac{1}{(2\|{\boldsymbol{w}}\|_{1})^{2d+4}}h\left(\frac{{\boldsymbol{w}}}{4\|{\boldsymbol{w}}\|_{1}^{2}},\frac{b}{4\|{\boldsymbol{w}}\|_{1}^{2}}\right)\right).

Then, g(𝒘,b)=0g({\boldsymbol{w}},b)=0 for 𝒘1>1/2\|{\boldsymbol{w}}\|_{1}>1/2. If 𝒘11/2\|{\boldsymbol{w}}\|_{1}\leq 1/2 and |b|>1/2|b|>1/2, then |b|>𝒘1|b|>\|{\boldsymbol{w}}\|_{1} and |b|/(4𝒘12)>1/2|b|/(4\|{\boldsymbol{w}}\|_{1}^{2})>1/2, and hence h(𝒘,b)=h(𝒘4𝒘12,b4𝒘12)=0h({\boldsymbol{w}},b)=h(\frac{{\boldsymbol{w}}}{4\|{\boldsymbol{w}}\|_{1}^{2}},\frac{b}{4\|{\boldsymbol{w}}\|_{1}^{2}})=0, which implies g(𝒘,b)=0g({\boldsymbol{w}},b)=0. This shows that

f(𝒙)=1212[12,12]dg(𝒘,b)σ(𝒘𝒙+b)𝑑𝒘𝑑b,𝒙[1,1]d.f({\boldsymbol{x}})=\int_{-\frac{1}{2}}^{\frac{1}{2}}\int_{[-\frac{1}{2},\frac{1}{2}]^{d}}g({\boldsymbol{w}},b)\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}+b)d{\boldsymbol{w}}db,\ {\boldsymbol{x}}\in[-1,1]^{d}.

Finally, for 0<𝒘11/20<\|{\boldsymbol{w}}\|_{1}\leq 1/2, by the definition of h(𝒘,b)h({\boldsymbol{w}},b) and the assumption that

supr>0sup𝒘1=1/2max(1,r2d+4)|f^(r𝒘)|B,\sup_{r>0}\sup_{\|{\boldsymbol{w}}\|_{1}=1/2}\max(1,r^{2d+4})|\hat{f}(r{\boldsymbol{w}})|\leq B,

we have |h(𝒘,b)|10(|f^(𝒘)|+|f^(𝒘)|)20B|h({\boldsymbol{w}},b)|\leq 10(|\hat{f}({\boldsymbol{w}})|+|\hat{f}(-{\boldsymbol{w}})|)\leq 20B and

|1(2𝒘1)2d+4h(𝒘4𝒘12,b4𝒘12)|\displaystyle\left|\frac{1}{(2\|{\boldsymbol{w}}\|_{1})^{2d+4}}h\left(\frac{{\boldsymbol{w}}}{4\|{\boldsymbol{w}}\|_{1}^{2}},\frac{b}{4\|{\boldsymbol{w}}\|_{1}^{2}}\right)\right|
\displaystyle\leq 10r2d+4(|f^(r𝒘~)|+|f^(r𝒘~)|)20B,\displaystyle 10r^{2d+4}\left(\left|\hat{f}(r\widetilde{{\boldsymbol{w}}})\right|+\left|\hat{f}(-r\widetilde{{\boldsymbol{w}}})\right|\right)\leq 20B,

where 𝒘~=𝒘/(2𝒘1)\widetilde{{\boldsymbol{w}}}={\boldsymbol{w}}/(2\|{\boldsymbol{w}}\|_{1}) and r=1/(2𝒘1)r=1/(2\|{\boldsymbol{w}}\|_{1}). Therefore, |g(𝒘,b)|40B|g({\boldsymbol{w}},b)|\leq 40B.

Proof of Lemma IV.3

We denote the function h(𝒙,𝒘):=g(𝒘)σ(𝒘𝒙)h({\boldsymbol{x}},{\boldsymbol{w}}):=g({\boldsymbol{w}})\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}}) and the random variables

Z:=\displaystyle Z:= sup𝒙[r,r]d|f(𝒙)1ni=1nh(𝒙,𝒘i)|,\displaystyle\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\left|f({\boldsymbol{x}})-\frac{1}{n}\sum_{i=1}^{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{i})\right|,
Z+:=\displaystyle Z_{+}:= sup𝒙[r,r]df(𝒙)1ni=1nh(𝒙,𝒘i),\displaystyle\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}f({\boldsymbol{x}})-\frac{1}{n}\sum_{i=1}^{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{i}),
Z:=\displaystyle Z_{-}:= sup𝒙[r,r]d1ni=1nh(𝒙,𝒘i)f(𝒙).\displaystyle\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\frac{1}{n}\sum_{i=1}^{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{i})-f({\boldsymbol{x}}).

We are going to bound the expectation of these random variables by the Rademacher complexity [36]. Let 𝒘1:n=(𝒘i)1in{\boldsymbol{w}}_{1:n}^{\prime}=({\boldsymbol{w}}_{i}^{\prime})_{1\leq i\leq n} be i.i.d. samples from the uniform distribution on [1/2,1/2]d[-1/2,1/2]^{d}, independent of 𝒘1:n{\boldsymbol{w}}_{1:n}. Since 𝔼𝒘ih(𝒙,𝒘i)=f(𝒙)\mathbb{E}_{{\boldsymbol{w}}_{i}^{\prime}}h({\boldsymbol{x}},{\boldsymbol{w}}_{i}^{\prime})=f({\boldsymbol{x}}),

𝔼𝒘1:nZ+\displaystyle\mathbb{E}_{{\boldsymbol{w}}_{1:n}}Z_{+}
=\displaystyle= 𝔼𝒘1:n[sup𝒙[r,r]d𝔼𝒘1:n1ni=1nh(𝒙,𝒘i)1ni=1nh(𝒙,𝒘i)]\displaystyle\mathbb{E}_{{\boldsymbol{w}}_{1:n}}\left[\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\mathbb{E}_{{\boldsymbol{w}}_{1:n}^{\prime}}\frac{1}{n}\sum_{i=1}^{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{i}^{\prime})-\frac{1}{n}\sum_{i=1}^{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{i})\right]
\displaystyle\leq 𝔼𝒘1:n,𝒘1:n[sup𝒙[r,r]d1ni=1nh(𝒙,𝒘i)h(𝒙,𝒘i)].\displaystyle\mathbb{E}_{{\boldsymbol{w}}_{1:n},{\boldsymbol{w}}_{1:n}^{\prime}}\left[\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\frac{1}{n}\sum_{i=1}^{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{i}^{\prime})-h({\boldsymbol{x}},{\boldsymbol{w}}_{i})\right].

Let ξ1:n=(ξi)1in\xi_{1:n}=(\xi_{i})_{1\leq i\leq n} be a sequence of i.i.d Rademacher random variables, independent of 𝒘1:n{\boldsymbol{w}}_{1:n} and 𝒘1:n{\boldsymbol{w}}_{1:n}^{\prime}, Then, by symmetrization argument, we have

𝔼𝒘1:nZ+\displaystyle\mathbb{E}_{{\boldsymbol{w}}_{1:n}}Z_{+}
\displaystyle\leq 𝔼𝒘1:n,𝒘1:n[sup𝒙[r,r]d1ni=1nh(𝒙,𝒘i)h(𝒙,𝒘i)]\displaystyle\mathbb{E}_{{\boldsymbol{w}}_{1:n},{\boldsymbol{w}}_{1:n}^{\prime}}\left[\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\frac{1}{n}\sum_{i=1}^{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{i}^{\prime})-h({\boldsymbol{x}},{\boldsymbol{w}}_{i})\right]
=\displaystyle= 𝔼𝒘1:n,𝒘1:n,ξ1:n[sup𝒙[r,r]d1ni=1nξi(h(𝒙,𝒘i)h(𝒙,𝒘i))]\displaystyle\mathbb{E}_{{\boldsymbol{w}}_{1:n},{\boldsymbol{w}}_{1:n}^{\prime},\xi_{1:n}}\left[\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}(h({\boldsymbol{x}},{\boldsymbol{w}}_{i}^{\prime})-h({\boldsymbol{x}},{\boldsymbol{w}}_{i}))\right]
\displaystyle\leq 𝔼𝒘1:n,𝒘1:n,ξ1:n[sup𝒙[r,r]d1ni=1nξih(𝒙,𝒘i)\displaystyle\mathbb{E}_{{\boldsymbol{w}}_{1:n},{\boldsymbol{w}}_{1:n}^{\prime},\xi_{1:n}}\left[\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}h({\boldsymbol{x}},{\boldsymbol{w}}_{i}^{\prime})\right.
+sup𝒙[r,r]d1ni=1nξih(𝒙,𝒘i)]\displaystyle\qquad\left.+\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\frac{1}{n}\sum_{i=1}^{n}-\xi_{i}h({\boldsymbol{x}},{\boldsymbol{w}}_{i})\right]
=\displaystyle= 2𝔼𝒘1:n,ξ1:n[sup𝒙[r,r]d1ni=1nξih(𝒙,𝒘i)],\displaystyle 2\mathbb{E}_{{\boldsymbol{w}}_{1:n},\xi_{1:n}}\left[\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}h({\boldsymbol{x}},{\boldsymbol{w}}_{i})\right],

where the last equality is because 𝒘i{\boldsymbol{w}}_{i} and 𝒘i{\boldsymbol{w}}_{i}^{\prime} have the same distribution and ξi\xi_{i} and ξi-\xi_{i} have the same distribution.

Observe that, for any 𝒘[1/2,1/2]d{\boldsymbol{w}}\in[-1/2,1/2]^{d} and 𝒙,𝒚[r,r]d{\boldsymbol{x}},{\boldsymbol{y}}\in[-r,r]^{d}, we have |h(𝒙,𝒘)|Brd/2|h({\boldsymbol{x}},{\boldsymbol{w}})|\leq Brd/2 and

|h(𝒙,𝒘)h(𝒚,𝒘)|\displaystyle|h({\boldsymbol{x}},{\boldsymbol{w}})-h({\boldsymbol{y}},{\boldsymbol{w}})| =|g(𝒘)σ(𝒘𝒙)g(𝒘)σ(𝒘𝒚)|\displaystyle=|g({\boldsymbol{w}})\sigma({\boldsymbol{w}}\cdot{\boldsymbol{x}})-g({\boldsymbol{w}})\sigma({\boldsymbol{w}}\cdot{\boldsymbol{y}})|
B|𝒘(𝒙𝒚)|12Bd𝒙𝒚.\displaystyle\leq B|{\boldsymbol{w}}\cdot({\boldsymbol{x}}-{\boldsymbol{y}})|\leq\tfrac{1}{2}Bd\|{\boldsymbol{x}}-{\boldsymbol{y}}\|_{\infty}.

For any ϵ>0\epsilon>0, there exists a set Sϵ={𝒙j}j=1N[r,r]dS_{\epsilon}=\{{\boldsymbol{x}}_{j}\}_{j=1}^{N}\subseteq[-r,r]^{d} of N(1+r/ϵ)dN\leq(1+r/\epsilon)^{d} points such that SϵS_{\epsilon} is an ϵ\epsilon-covering of [r,r]d[-r,r]^{d}: for any 𝒙{\boldsymbol{x}}, there exists 𝒙jSϵ{\boldsymbol{x}}_{j}\in S_{\epsilon} such that 𝒙𝒙jϵ\|{\boldsymbol{x}}-{\boldsymbol{x}}_{j}\|_{\infty}\leq\epsilon. Hence,

𝔼𝒘1:nZ+\displaystyle\mathbb{E}_{{\boldsymbol{w}}_{1:n}}Z_{+}
\displaystyle\leq 2𝔼𝒘1:n,ξ1:n[sup𝒙jSϵ1ni=1nξih(𝒙j,𝒘i)\displaystyle 2\mathbb{E}_{{\boldsymbol{w}}_{1:n},\xi_{1:n}}\left[\sup_{{\boldsymbol{x}}_{j}\in S_{\epsilon}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}h({\boldsymbol{x}}_{j},{\boldsymbol{w}}_{i})\right.
+sup𝒙𝒙jϵ1ni=1nξi(h(𝒙,𝒘i)h(𝒙j,𝒘i))]\displaystyle\qquad\left.+\sup_{\|{\boldsymbol{x}}-{\boldsymbol{x}}_{j}\|_{\infty}\leq\epsilon}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}(h({\boldsymbol{x}},{\boldsymbol{w}}_{i})-h({\boldsymbol{x}}_{j},{\boldsymbol{w}}_{i}))\right]
\displaystyle\leq 2𝔼𝒘1:n,ξ1:n[sup𝒙jSϵ1ni=1nξih(𝒙j,𝒘i)]+Bdϵ.\displaystyle 2\mathbb{E}_{{\boldsymbol{w}}_{1:n},\xi_{1:n}}\left[\sup_{{\boldsymbol{x}}_{j}\in S_{\epsilon}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}h({\boldsymbol{x}}_{j},{\boldsymbol{w}}_{i})\right]+Bd\epsilon.

By Massart’s lemma [34, Lemma 26.8],

𝔼ξ1:n[sup𝒙jSϵ1ni=1nξih(𝒙j,𝒘i)]Brd2logN2nBrd22dlog(1+r/ϵ)n.\mathbb{E}_{\xi_{1:n}}\left[\sup_{{\boldsymbol{x}}_{j}\in S_{\epsilon}}\frac{1}{n}\sum_{i=1}^{n}\xi_{i}h({\boldsymbol{x}}_{j},{\boldsymbol{w}}_{i})\right]\\ \leq\frac{Brd\sqrt{2\log N}}{2\sqrt{n}}\leq\frac{Brd}{2}\sqrt{\frac{2d\log(1+r/\epsilon)}{n}}.

If we choose ϵ=r/n\epsilon=r/\sqrt{n}, then

𝔼𝒘1:nZ+\displaystyle\mathbb{E}_{{\boldsymbol{w}}_{1:n}}Z_{+}\leq Brd2dlog(1+n)n+Brdn\displaystyle Brd\sqrt{\frac{2d\log(1+\sqrt{n})}{n}}+\frac{Brd}{\sqrt{n}}
\displaystyle\leq 2Brd2dlog(n+1)n.\displaystyle 2Brd\sqrt{\frac{2d\log(n+1)}{n}}.

Next, we apply McDiarmid’s inequality [35, Theorem 6.2] to the random variable Z+Z_{+}. Observe that, if one of the 𝒘j{\boldsymbol{w}}_{j} is replaced by 𝒘j{\boldsymbol{w}}_{j}^{\prime}, then the difference is

|sup𝒙[r,r]d[f(𝒙)1ni=1nh(𝒙,𝒘i)]sup𝒙[r,r]d\displaystyle\left|\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\left[f({\boldsymbol{x}})-\frac{1}{n}\sum_{i=1}^{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{i})\right]-\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\right.
[f(𝒙)1ni=1nh(𝒙,𝒘i)+1nh(𝒙,𝒘j)1nh(𝒙,𝒘j)]|\displaystyle\quad\left.\left[f({\boldsymbol{x}})-\frac{1}{n}\sum_{i=1}^{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{i})+\frac{1}{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{j})-\frac{1}{n}h({\boldsymbol{x}},{\boldsymbol{w}}_{j}^{\prime})\right]\right|
\displaystyle\leq 1nsup𝒙[r,r]d|h(𝒙,𝒘j)h(𝒙,𝒘j)|Brdn,\displaystyle\frac{1}{n}\sup_{{\boldsymbol{x}}\in[-r,r]^{d}}\left|h({\boldsymbol{x}},{\boldsymbol{w}}_{j})-h({\boldsymbol{x}},{\boldsymbol{w}}_{j}^{\prime})\right|\leq\frac{Brd}{n},

because the inequalities hold for each fixed 𝒙[r,r]d{\boldsymbol{x}}\in[-r,r]^{d} and taking supremum cannot increase the difference. Hence, McDiarmid’s inequality implies

(Z+𝔼Z+>t)exp(2nt2B2r2d2).\mathbb{P}(Z_{+}-\mathbb{E}Z_{+}>t)\leq\exp\left(-\frac{2nt^{2}}{B^{2}r^{2}d^{2}}\right).

For any δ(0,1)\delta\in(0,1), if we choose t=Brdlog(2/δ)/(2n)t=Brd\sqrt{\log(2/\delta)/(2n)}, then with probability at least 1δ/21-\delta/2,

Z+\displaystyle Z_{+}\leq 𝔼Z++t\displaystyle\mathbb{E}Z_{+}+t
\displaystyle\leq Brd(22dlog(n+1)n+log(2/δ)2n)=:.\displaystyle Brd\left(2\sqrt{\frac{2d\log(n+1)}{n}}+\sqrt{\frac{\log(2/\delta)}{2n}}\right)=:\mathcal{E}.

Applying similar argument to ZZ_{-} shows that ZZ_{-}\leq\mathcal{E} with probability at least 1δ/21-\delta/2. Hence,

(Z>)(Z+>)+(Z>)δ.\mathbb{P}(Z>\mathcal{E})\leq\mathbb{P}(Z_{+}>\mathcal{E})+\mathbb{P}(Z_{-}>\mathcal{E})\leq\delta.

References

  • [1] M. Lukoševičius and H. Jaeger, “Reservoir computing approaches to recurrent neural network training,” Computer Science Review, vol. 3, no. 3, pp. 127–149, Aug. 2009.
  • [2] H. Jaeger and H. Haas, “Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication,” Science, vol. 304, no. 5667, pp. 78–80, Apr. 2004.
  • [3] W. Maass, T. Natschläger, and H. Markram, “Real-time computing without stable states: A new framework for neural computation based on perturbations,” Neural Computation, vol. 14, no. 11, pp. 2531–2560, Nov. 2002.
  • [4] G. Tanaka, T. Yamane, J. B. Héroux, R. Nakane, N. Kanazawa, S. Takeda, H. Numata, D. Nakano, and A. Hirose, “Recent advances in physical reservoir computing: A review,” Neural Networks, vol. 115, pp. 100–123, Jul. 2019.
  • [5] H. Jaeger, “The ’echo state’ approach to analysing and training recurrent neural networks-with an erratum note,” Bonn, Germany: German National Research Center for Information Technology, vol. 148, no. 34, p. 13, 2001.
  • [6] M. R. Buehner and P. M. Young, “A tighter bound for the echo state property,” IEEE Transactions on Neural Networks, vol. 17, no. 3, pp. 820–824, 2006.
  • [7] I. B. Yildiz, H. Jaeger, and S. J. Kiebel, “Re-visiting the echo state property,” Neural Networks, vol. 35, pp. 1–9, Nov. 2012.
  • [8] L. Grigoryeva and J.-P. Ortega, “Differentiable reservoir computing,” Journal of Machine Learning Research, vol. 20, no. 179, pp. 1–62, 2019.
  • [9] L. Gonon, L. Grigoryeva, and J.-P. Ortega, “Approximation bounds for random neural networks and reservoir systems,” The Annals of Applied Probability, vol. 33, no. 1, Feb. 2023.
  • [10] ——, “Risk bounds for reservoir computing,” Journal of Machine Learning Research, vol. 21, no. 240, pp. 1–61, 2020.
  • [11] L. Grigoryeva and J.-P. Ortega, “Echo state networks are universal,” Neural Networks, vol. 108, pp. 495–508, Dec. 2018.
  • [12] ——, “Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems,” Journal of Machine Learning Research, vol. 19, no. 24, pp. 1–40, 2018.
  • [13] L. Gonon and J.-P. Ortega, “Reservoir computing universality with stochastic inputs,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 1, pp. 100–112, 2020.
  • [14] ——, “Fading memory echo state networks are universal,” Neural Networks, vol. 138, pp. 10–13, Jun. 2021.
  • [15] J. Hanson and M. Raginsky, “Universal approximation of input-output maps by temporal convolutional nets,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019, pp. 14 071–14 081.
  • [16] S. Boyd and L. O. Chua, “Fading memory and the problem of approximating nonlinear operators with Volterra series,” IEEE Transactions on Circuits and Systems, vol. 32, no. 11, pp. 1150–1161, Nov. 1985.
  • [17] M. Gandhi and H. Jaeger, “Echo state property linked to an input: Exploring a fundamental characteristic of recurrent neural networks,” Neural Computation, vol. 25, no. 3, pp. 671–696, Mar. 2013.
  • [18] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals, and Systems, vol. 2, no. 4, pp. 303–314, 1989.
  • [19] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251–257, 1991.
  • [20] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function,” Neural Networks, vol. 6, no. 6, pp. 861–867, Jan. 1993.
  • [21] A. Pinkus, “Approximation theory of the MLP model in neural networks,” Acta Numerica, vol. 8, pp. 143–195, Jan. 1999.
  • [22] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879–892, Jul. 2006.
  • [23] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006.
  • [24] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 42, no. 2, pp. 513–529, 2012.
  • [25] A. Hart, J. Hook, and J. Dawes, “Embedding and approximation theorems for echo state networks,” Neural Networks, vol. 128, pp. 234–247, Aug. 2020.
  • [26] R. I. Jennrich, “Asymptotic properties of non-linear least squares estimators,” The Annals of Mathematical Statistics, vol. 40, no. 2, pp. 633–643, Apr. 1969.
  • [27] G. B. Folland, Real analysis: modern techniques and their applications, 2nd ed.   John Wiley & Sons, 1999.
  • [28] J. H. Shapiro, A fixed-point farrago.   Springer International Publishing, 2016.
  • [29] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, May 1993.
  • [30] J. M. Klusowski and A. R. Barron, “Approximation by combinations of ReLU and squared ReLU ridge functions with l1l^{1} and l0l^{0} controls,” IEEE Transactions on Information Theory, vol. 64, no. 12, pp. 7649–7656, Dec. 2018.
  • [31] W. E, C. Ma, and L. Wu, “A priori estimates of the population risk for two-layer neural networks,” Communications in Mathematical Sciences, vol. 17, no. 5, pp. 1407–1425, 2019.
  • [32] ——, “The Barron space and the flow-induced function spaces for neural network models,” Constructive Approximation, vol. 55, no. 1, pp. 369–406, 2022.
  • [33] W. E and S. Wojtowytsch, “Representation formulas and pointwise properties for Barron functions,” Calculus of Variations and Partial Differential Equations, vol. 61, no. 2, Feb. 2022.
  • [34] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms.   Cambridge University Press, 2014.
  • [35] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence.   Oxford University Press, 2013.
  • [36] P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, pp. 463–482, 2002.