This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\affiliation

[1]organization=Wyant College of Optical Sciences, University of Arizona, city=Tuscon, Arizona, country=USA \affiliation[2]organization=School of Electrical Engineering and Telecommunications, University of New South Wales, city=Sydney, country=Australia

Improving the Performance of Echo State Networks Through Feedback

Peter J. Ehlers [email protected] Hendra I. Nurdin [email protected] Daniel Soh [email protected]
Abstract

Reservoir computing, using nonlinear dynamical systems, offers a cost-effective alternative to neural networks for complex tasks involving processing of sequential data, time series modeling, and system identification. Echo state networks (ESNs), a type of reservoir computer, mirror neural networks but simplify training. They apply fixed, random linear transformations to the internal state, followed by nonlinear changes. This process, guided by input signals and linear regression, adapts the system to match target characteristics, reducing computational demands. A potential drawback of ESNs is that the fixed reservoir may not offer the complexity needed for specific problems. While directly altering (training) the internal ESN would reintroduce the computational burden, an indirect modification can be achieved by redirecting some output as input. This feedback can influence the internal reservoir state, yielding ESNs with enhanced complexity suitable for broader challenges. In this paper, we demonstrate that by feeding some component of the reservoir state back into the network through the input, we can drastically improve upon the performance of a given ESN. We rigorously prove that, for any given ESN, feedback will almost always improve the accuracy of the output. For a set of three tasks, each representing different problem classes, we find that with feedback the average error measures are reduced by 30%60%30\%-60\%. Remarkably, feedback provides at least an equivalent performance boost to doubling the initial number of computational nodes, a computationally expensive and technologically challenging alternative. These results demonstrate the broad applicability and substantial usefulness of this feedback scheme.

keywords:
Reservoir Computing, Echo State Network, Feedback Improvement

1 Introduction

Compared to recurrent neural networks where excellent performance could only be obtained with very computationally expensive system adjustment procedures, the premise of reservoir computing is to use a fixed nonlinear dynamical system of the form (1)-(2) to perform signal processing tasks:

xk+1\displaystyle x_{k+1} =f(xk,uk),\displaystyle=f(x_{k},u_{k}), (1)
y^k\displaystyle\hat{y}_{k} =Wxk+C,\displaystyle=W^{\top}x_{k}+C, (2)

where uku_{k} is the input signal at time kk [1, 2, 3]. The output y^k\hat{y}_{k} of the dynamical system is then typically taken as a simple linear combination of the states (nodes) xx of the dynamical system as given in (2) plus some constant CC (as the output bias), where WW is a weight matrix and WW^{\top} is its transpose. The nodes correspond to a basis map that is used to approximate an unknown map which maps input (discrete-time) sequences to output sequences that are to be learned by the dynamical system. Using the linear combination of states makes the training extremely straightforward and efficient as the weight matrix WW can be determined by a simple linear regression.

Reservoir computers (RCs) have been extensively used to predict deterministic sequences, in particular chaotic sequences, and data-based chaotic system modelling, see, e.g., [4, 5, 6]. In the deterministic setting it has found applications in channel equalization [4], chaos synchronisation and encryption [7], and model-free observers for chaotic systems [8]. RCs have also be studied for modelling of stochastic signals and systems with applications including time series modelling, forecasting, filtering and system identification [9, 10, 11, 12].

Physical reservoir computing employs a device with complex temporal evolution, tapping into the computational power of a nonlinear dynamical system without extensive parameter optimization needed in typical neural networks. Inputs are fed into a reservoir, a natural system with complex dynamics, influencing its state based on current and past inputs due to its (limited) memory. The reservoir, running automatically, is altered only by these inputs. This approach models complex, nonlinear functions with minimal requirements: problem-related inputs and a linear fitting algorithm. Various physical platforms have been experimentally demonstrated for reservoir computing include, for instance, photonics [13, 6], spintronics [14] and even quantum systems [15, 16, 17]. For a review of physical reservoir computing and quantum reservoir computing, see, e.g., [3, 18, 19, 20].

An echo state network (ESN) [4, 21, 22, 23] is a type of RC using an iterative structure for adding nonlinearity to inputs. It is similar to a recurrent neural network, except that the neural weights are fixed and optimization occurs only at the output layer. In an ESN, the reservoir state at time step kk is represented by vector xkx_{k}, equivalent to the neural outputs at step kk. Each step involves applying a fixed linear transformation given by a matrix AA to xkx_{k}, then adding to it a vector BB times the input value uku_{k}, forming a new vector zkz_{k}. The matrix AA represents the fixed neural weights, while the vector BB represents the biases. A nonlinear transformation on each element of zkz_{k} generates xk+1x_{k+1}, akin to neuron outputs. The affine function y^k\hat{y}_{k} of xkx_{k} given in (2) is then fit to a target value sequence yk{y_{k}}, giving an output y^kyk\hat{y}_{k}\approx y_{k} that approximates the target system. We describe the ESN framework with further detail in Section 2.1.

The main drawback of this approach is that any specific ESN is only going to be effective for a certain subset of problems because the transformations that the reservoir applies are fixed, so a specific reservoir will tend to modify the inputs in the same way, leading to a limited range of potential outputs. It has been shown in [10] that ESNs as a whole are universal, meaning that for any target sequence {yk}\{y_{k}\} and a given input sequence {uk}\{u_{k}\}, there will be an ESN with specific choices of A,B,A,B, that can approximate it to any desired accuracy. However, it may not be practically feasible to find a sufficiently accurate ESN for a particular problem of interest as it may require choosing an excessively and unpractically large ESN, and one may have to settle for an ESN with a weaker performance instead.

There have been previous efforts related to the above issue. In the context of an autonomous ESN with no external driving input, the work [24] introduces a number of architectures. One architecture includes adding a second auxiliary ESN network besides the principal “generator” ESN. The auxiliary is fed by a tunable linear combination of some nodes of the generator ESN, while a fixed (non-tunable) linear combination of the nodes of the auxiliary ESN is fed back to the generator network. The same error signal at the output of the generator ESN is used to train both the output weight of the principal and the weights that connect the generator to the auxiliary. The weight update is done recursively through an algorithm called the First-Order, Reduced and Controlled Error (FORCE) learning algorithm, which is in turn based on the recursive least squares algorithm. A second architecture does not use feedback but allows modification of the some internal weights of the ESN besides the output weight, as in a conventional recurrent neural network. The internal and output weights are also updated using the FORCE algorithm. In [25], multiple ESNs with tunable output weights that are interconnected in a fixed feedforward architecture (with no feedback loops) are considered. A set of completely known but randomly generated “surrogate” ESNs are coupled according to some architecture and trained by simulation (“in silico”) using the backpropagation algorithm for artificial neural networks. The “intermediate” signals generated at the output of each component ESN are then used to train the output weights of another set of random ESNs, representing the “true” ESNs that will be deployed, in the same architecture. The output weights of the individual true ESNs can be trained by linear regression. In [26], in the continuous time setting, it was shown that an affine nonlinear system given by the nonlinear ODE:

x˙i\displaystyle\dot{x}_{i} =f(x1,,xn)+g(x1,,xn)v,i=1,2,,n,\displaystyle=f(x_{1},\ldots,x_{n})+g(x_{1},\ldots,x_{n})v,\ i=1,2,\ldots,n,

for scalar real functions x1,,xnx_{1},\ldots,x_{n} and vv is universal in the sense that it can exactly emulate any other nn-th order ODE of the form

z(n)\displaystyle z^{(n)} =G(z,z(1),,z(n1))+u,\displaystyle=G(z,z^{(1)},\ldots,z^{(n-1)})+u,

that is driven by a signal uu, where zz is a scalar signal and z(j)z^{(j)} denotes the jj-th derivative of zz with respect to time. The emulation is achieved by appropriately choosing scalar-valued real functions KK and hh and setting v=K(x,u)v=K(x,u) and z=h(x)z=h(x), where xx is the column vector x=(x1,,xn)x=(x_{1},\ldots,x_{n})^{\top}, where \top denotes the transpose of a matrix. Also, any system of kk higher order ODEs in zz of the form above can be emulated by using kk different feedback terms.

In this work, unlike [24], we are interested in ESNs that are driven by external input and are required to be convergent (forget their initial condition). Notably, the ESN in [24] could not be convergent because a limit cycle was used to generate a periodic output without any inputs in the ESN. When driven by an external input such networks may produce an output diverging in time. Also, while the study in [24] is motivated by biological networks, our work is motivated by the applications of physical reservoir computing. In particular, we are interested in enhancing the performance of a fixed physical RC by adding a simple but tunable structure external to the computer. Also, in contrast to [25], in this paper we do not consider multiple interconnected ESNs in a feedforward architecture and do not use surrogates, but train a single ESN augmented with a feedback loop directly with the data. While our work is related to [26] as discussed above, we do not seek universal emulation, but consider a linear feedback K(x)=VxK(x)=V^{\top}x that only depends on the state xx, but not the input uu, to enhance the approximation ability of ESNs.

The goal of this paper is to study the use of feedback in the context of ESNs and show that it will improve the performance of ESNs in the overwhelming majority of cases. Our proposal is to feed a linear function of the reservoir state back into the network as input. That is, for some vector VV, we change the input from uku_{k} to uk+Vxku_{k}+V^{\top}x_{k}. We then optimize VV with respect to the cost function to achieve a better fit to the target sequence {yk}\{y_{k}\}. This has the effect of changing the linear transformation AA that the reservoir performs on xkx_{k} at each time step, allowing us to partially control how the reservoir state evolves without modifying the reservoir itself. This will in essence provide us with a wider range of possible outputs for any given ESN, and can provide smaller ESNs an accuracy boost that makes them comparable to larger ones. Thus, our new paradigm of ESNs with feedback generates a significant performance boost with minimal perturbation of the system. We offer a thorough proof of a broad theorem, confidently ensuring that almost any ESN will experience a performance enhancement when a feedback mechanism is implemented, making this new scheme of ESNs with feedback universally applicable.

The structure of the paper is as follows. In Section 2, we provide some background on reservoir computers and echo state networks and introduce our feedback procedure. In Section 3, we provide a proof of the superiority of ESNs with feedback. In Section 4, we describe how the new parameters introduced by feedback are optimized. In Section 5, we provide numerical results that demonstrate the effectiveness of feedback for several different representative tasks. Finally, in Section 6 we give our concluding remarks.

In this paper, we denote the transpose of a matrix MM as MM^{\top}, with the same notation used for vectors. The n×nn\times n identity matrix is written as 𝕀n\mathbb{I}_{n}, while the zero matrix of any size (including any zero vector) is written as 𝟎\mathbf{0}. We treat an nn-dimensional vector as an n×1n\times 1 rectangular matrix in terms of notation, and in particular the outer product of two vectors v1v_{1} and v2v_{2} is written as v1v2v_{1}v_{2}^{\top}. The vector norm v||v|| denotes the standard 2-norm v2=vv||v||_{2}=\sqrt{v^{\top}v}. For a sequence whose kkth element is given by aka_{k}, we denote the entire sequence as {ak}\{a_{k}\}. When finite, a sum of the elements of such a sequence is often written notationally as if they were a sample of some stochastic process. We write weighted sums of these sequences as determinstic “expectation” values (averages), so that for a sequence {ak}\{a_{k}\} with NN entries starting from k=0k=0 we may write a=1Nk=0N1ak\langle a\rangle=\frac{1}{N}\sum_{k=0}^{N-1}a_{k}. We also define the mean of such a sequence as a¯=a\overline{a}=\langle a\rangle and its variance as σa2=(aa¯)2\sigma_{a}^{2}=\langle(a-\overline{a})^{2}\rangle. The expectation operator is denoted by 𝔼[]\mathbb{E}[\cdot], the expectation of a random variable XX is denoted by 𝔼[X]\mathbb{E}[X] and the conditional expectation of a random variable XX given random variables Y1,,YmY_{1},\ldots,Y_{m} is denoted by 𝔼[X|Y1,,Ym]\mathbb{E}[X|Y_{1},\ldots,Y_{m}]. We will denote the input and the output sequences of the training data as {uk}={uk}k=1,,N,{yk}={yk}k=1,,N\{u_{k}\}=\{u_{k}\}_{k=1,\ldots,N},\{y_{k}\}=\{y_{k}\}_{k=1,\ldots,N}, respectively.

2 Theory of Reservoir Computing with Feedback

2.1 Reservoir Computing and Echo State Networks

A general RC is described by the following two equations:

xk+1\displaystyle x_{k+1} =f(xk,uk)\displaystyle=f(x_{k},u_{k}) (3)
y^k\displaystyle\hat{y}_{k} =h(xk),\displaystyle=h(x_{k}), (4)

where xkx_{k} is a vector representing the reservoir state at time step kk, uku_{k} is the kkth member of some input sequence, and y^k\hat{y}_{k} is the predicted output. The function f(x,u)f(x,u) is defined by the reservoir and is fixed, but the output function h(x)h(x) is fit to the target sequence yky_{k} by minimizing a cost function SS. In practice we usually choose (for NN training data points)

h(x)=Wx+C\displaystyle h(x)=W^{\top}x+C (5)
S=12Nk=0N1(yky^k)2,\displaystyle S=\frac{1}{2N}\sum_{k=0}^{N-1}(y_{k}-\hat{y}_{k})^{2}, (6)

where the scalar CC and vector WW are chosen to minimize SS, so that the problem of fitting the output function to data is just a linear regression problem. This setup is what enables the simulation and prediction of complex phenomena with a low computational overhead, because the reservoir dynamics encoded in f(x,u)f(x,u) are complex enough to get a nonlinear function of the inputs {uk}\{u_{k}\} that can then be made to approximate {yk}\{y_{k}\} using linear regression.

In order for a RC to work, it must obey what is known as the (uniform) convergence property, or echo state property [12, 27]. It states that, for a reservoir defined by the function f(x,u)f(x,u) and a given input sequence {uk}\{u_{k}\} defined for all kk\in\mathbb{Z}, there exists a unique sequence of reservoir states {xk}\{x_{k}\} that satisfy xk+1=f(xk,uk)x_{k+1}=f(x_{k},u_{k}) for all kk\in\mathbb{Z}. The consequence of this property is that the initial state of the reservoir in the infinite past does not have any bearing on what the current reservoir state is. This consequence combined with the continuity of f(x,u)f(x,u) leads to the fading memory property [28], which tells us that the dependence of xkx_{k} on an input uk0u_{k_{0}} for k>k0k>k_{0} must dwindle continuously to zero as kk0k-k_{0} tends to infinity. This means that any initial state dependence should become negligible after the RC runs for a certain amount of time, so that the RC is reusable and produces repeatable, deterministic results while also retaining some memory capacity for past inputs.

It has been shown [10, 29] that a given RC will have the uniform convergence property if the reservoir dynamics f(x,u)f(x,u) are contracting, or in other words if it satisfies

f(x1,u)f(x2,u)ϵx1x2,\displaystyle||f(x_{1},u)-f(x_{2},u)||\leq\epsilon||x_{1}-x_{2}||, (7)

where ϵ\epsilon is some real number 0<ϵ<10<\epsilon<1. The norm in this inequality is arbitrary (as all norms on finite-dimensional metric spaces have equivalent effects), but it is usually chosen to be the standard vector norm v2=vv||v||_{2}=\sqrt{v^{\top}v}. This ensures that all reservoir states xx will be driven toward the same sequence of states defined by the inputs {uk}\{u_{k}\}.

An ESN is a specific type of RC described above, with

f(x,u)=g(Ax+Bu),\displaystyle f(x,u)=g(Ax+Bu), (8)

where AA and BB are a random but fixed matrix and vector, respectively, while g(z)g(z) is a nonlinear function that acts on each component of its input zz. Throughout the paper we will take the dimension of the state xx to be nn, and the dimensions of AA and BB to be n×nn\times n and n×1n\times 1, respectively. For the output of the ESN we have that CC is a real scalar and WW is a real column vector of dimension nn. This design gives the ESN resemblance to a typical neural network, where the linear transformation zk=Axk+Bukz_{k}=Ax_{k}+Bu_{k} defines the input into the array of neurons, with AA providing the weights and BukBu_{k} providing a bias. The element-wise nonlinear function g(z)g(z) gives the array of outputs of the neurons as a function of the weighted inputs. The choices of g,A,g,A, and BB define a specific ESN, though in practice gg is often chosen to be one of a specific set of preferred functions such as the sigmoid σ(z)=(1+ez)1\sigma(z)=(1+e^{-z})^{-1} or tanh(z)\tanh(z) functions. In this work, we choose the sigmoid function for our numerical results.

The convergence of the ESN can be guaranteed by subjecting the matrix AA to the constraint that AA<a2𝕀nA^{\top}A<a^{2}\mathbb{I}_{n} for a constant a>0a>0. In other words, the singular values of AA must all be strictly less than some number aa which is determined by g(z)g(z). For the sigmoid function, we can use a=4a=4, while for the tanh\tanh function we use a=1a=1. This originates from proving that

g(z1)g(z2)a1z1z2\displaystyle||g(z_{1})-g(z_{2})||\leq a^{-1}||z_{1}-z_{2}|| (9)

for all z1,z2z_{1},z_{2}\in\mathbb{Z}, so that the convergence inequality will always be satisfied as long as

a1(Ax1+Bu)(Ax2+Bu)=a1A(x1x2)ϵx1x2,\displaystyle a^{-1}||(Ax_{1}+Bu)-(Ax_{2}+Bu)||=a^{-1}||A(x_{1}-x_{2})||\leq\epsilon||x_{1}-x_{2}||, (10)

for some 0<ϵ<10<\epsilon<1. Note that this is a sufficient but not necessary condition, as there could be combinations of A,B,A,B, and {uk}\{u_{k}\} such that

g(Ax1+Buk)g(Ax2+Buk)ϵx1x2\displaystyle||g(Ax_{1}+Bu_{k})-g(Ax_{2}+Bu_{k})||\leq\epsilon||x_{1}-x_{2}|| (11)

for all x1,x2,x_{1},x_{2}, and kk, but this singular value criterion is much easier to test and design for while still providing a large space of possible reservoirs to choose from.

We parameterize how well the output of the ESN matches the target data using the normalized mean-square error (NMSE). In a linear regression problem we can show that the mean-squared error is

(yy^)2\displaystyle\langle(y-\hat{y})^{2}\rangle =1Nk=0N1(yky^k)2=2S,\displaystyle=\frac{1}{N}\sum_{k=0}^{N-1}(y_{k}-\hat{y}_{k})^{2}=2S, (12)

where we are averaging over the NN time steps corresponding to the training interval. With y^k=Wxk+C\hat{y}_{k}=W^{\top}x_{k}+C, we can show that

(yy^)2\displaystyle\langle(y-\hat{y})^{2}\rangle =(C+Wxy)2\displaystyle=\langle(C+W^{\top}x-y)^{2}\rangle (13)
=(C+Wxy)2+(W(xx)(yy))2\displaystyle=(C+W^{\top}\langle x\rangle-\langle y\rangle)^{2}+\langle(W^{\top}(x-\langle x\rangle)-(y-\langle y\rangle))^{2}\rangle (14)
=(C+Wxy)2+WKxxW2WKxy+σy2,\displaystyle=(C+W^{\top}\langle x\rangle-\langle y\rangle)^{2}+W^{\top}K_{xx}W-2W^{\top}K_{xy}+\sigma_{y}^{2}, (15)

where

Kxx\displaystyle K_{xx} =(xx)(xx)=xxxx\displaystyle=\langle(x-\langle x\rangle)(x-\langle x\rangle)^{\top}\rangle=\langle xx^{\top}\rangle-\langle x\rangle\langle x\rangle^{\top} (16)
Kxy\displaystyle K_{xy} =(xx)(yy)=xyxy.\displaystyle=\langle(x-\langle x\rangle)(y-\langle y\rangle)\rangle=\langle xy\rangle-\langle x\rangle\langle y\rangle. (17)

The values of CC and WW that minimize the mean-squared error are

C=yWx\displaystyle C=\langle y\rangle-W^{\top}\langle x\rangle (18)
W=Kxx1Kxy.\displaystyle W=K_{xx}^{-1}K_{xy}. (19)

Note that since KxxK_{xx} is a covariance matrix, it must be positive semi-definite, but by inverting it to find the optimal value of WW we have further assumed that it is positive definite. This assumption is equivalent to saying that all of the vectors xkx_{k} span the entire vector space nc\mathbb{R}^{n_{c}}, where ncn_{c} is the dimension of WW and all xkx_{k}’s. This is reasonable because in practice we usually take the number of training steps N>>ncN>>n_{c}, and since each xkx_{k} is a nonlinear transformation of the previous one, it is unlikely that any vector vv will satisfy vxk=0v^{\top}x_{k}=0 for all k0,,N1k\in{0,\dots,N-1}. Nevertheless, in the event that KxxK_{xx} is not invertible, we can take the pseudoinverse of KxxK_{xx} instead. This is because the components of WW parallel to the zero eigenvectors of KxxK_{xx} are not fixed by the optimization (which is why the inversion fails in the first place), so we are free to choose those components to be zero, which makes Eq. (19) correct when using the pseudoinverse of KxxK_{xx} as well.

Plugging the optimized values of CC and WW into the mean-squared error gives

((yy^)2)min\displaystyle\left(\langle(y-\hat{y})^{2}\rangle\right)_{\min} =σy2KxyKxx1Kxy.\displaystyle=\sigma_{y}^{2}-K_{xy}^{\top}K_{xx}^{-1}K_{xy}. (20)

From the original expression for the mean-squared error in Eq. (12), we can see that it is non-negative. Since KxxK_{xx} is a covariance matrix, the quantity KxyKxx1Kxy=WKxxW0K_{xy}^{\top}K_{xx}^{-1}K_{xy}=W^{\top}K_{xx}W\geq 0. Thus ((yky^k)2)min\left(\langle(y_{k}-\hat{y}_{k})^{2}\rangle\right)_{\min} is bounded above by σy2\sigma_{y}^{2}, so we may define a normalized mean-squared error, or NMSE, by

NMSE\displaystyle\mathrm{NMSE} =(yy^)2σy2.\displaystyle=\frac{\langle(y-\hat{y})^{2}\rangle}{\sigma_{y}^{2}}. (21)

This quantity is guaranteed to be between 0 and 1 for the training data, though it may exceed 1 for an arbitrary test data set. We can also see by Eqs. (12) and (20), the task of minimizing SS as a function of C,W,C,W, and VV is equivalent to maximizing KxyKxx1KxyK_{xy}^{\top}K_{xx}^{-1}K_{xy} as a function of VV.

2.2 ESNs with Feedback

The main result of this work is the introduction of a feedback procedure to improve the performance of ESNs. We add an additional step to the process where the input is taken to be uk+Vxku_{k}+Vx_{k} at each time step as opposed to just uku_{k}. The reservoir of the ESN is then described by

xk+1\displaystyle x_{k+1} =g(Axk+B(uk+Vxk))=g((A+BV)xk+Buk).\displaystyle=g(Ax_{k}+B(u_{k}+V^{\top}x_{k}))=g((A+BV^{\top})x_{k}+Bu_{k}). (22)

From this equation, we see that the feedback causes this ESN to behave like a different network that uses A¯=A+BV\overline{A}=A+BV^{\top} as a transformation matrix instead of AA. We achieve this without modifying the RC itself, using only the pre-existing input channel and the reservoir states that we are already measuring. This provides a practical way of changing the reservoir dynamics without any internal hardware modification. We then optimize for VV using batch gradient descent to further reduce the cost function SS.

Note, however, that in attempting to modify AA we run the risk of eliminating the uniform convergence of the ESN. Thus, there must be a constraint placed on VV in order to keep the network convergent. In accordance with the constraint in Eq. (10), we require that A¯A¯<a2𝕀n\overline{A}^{\top}\overline{A}<a^{2}\mathbb{I}_{n} in addition to AA, which places some limitations on the value of VV. This constraint is generally quite complex to solve beyond this inequality, but it is possible to formulate this as a linear matrix inequality in A¯\overline{A}; see, e.g., [12, §IV]. In addition, this condition can be easily applied during the process of optimizing VV.

3 Universal Superiority of ESN with Feedback over ESN without Feedback

In this section, we prove our central theorem stating that the ESN with feedback accomplishes smaller overall errors than the ESN without feedback. For this, we start with a theorem for an individual ESN:

Theorem 1 (Superiority of feedback for a given ESN and training data).

For any given matrix AA and vector BB in Eq. (8), and given sets of training inputs {uk}={uk}k=1,,N\{u_{k}\}=\{u_{k}\}_{k=1,\ldots,N} and outputs {yk}={yk}k=1,,N\{y_{k}\}=\{y_{k}\}_{k=1,\ldots,N} of finite length, define an optimized cost function Smin(A,B,{uk},{yk})S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}) with appropriate optimal WW and CC. Then, for almost any given (A,B,{uk},{yk})(A,B,\{u_{k}\},\{y_{k}\}) except for vanishingly small number of (A,B,{uk},{yk})(A,B,\{u_{k}\},\{y_{k}\}), the feedback always reduces the cost function further:

minVSmin(A+BV,B,{uk},{yk})<Smin(A,B,{uk},{yk}).\min_{V}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})<S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}). (23)

Moreover, if AA is such that AA<a2𝕀nA^{\top}A<a^{2}\mathbb{I}_{n}, where aa is a constant that guarantees that the ESN is convergent, then the feedback gain VV can alawys be chosen such that the ESN with feedback is also convergent and satisfy the above.

3.1 Preliminary Definitions and Relations

To prove Theorem 1, we will set up a number of lemmas and definitions prior to starting the main proof. This preliminary work will primarily concern the cases in which Eq. (23) does not hold, and the lemmas will show that the number of such cases is vanishingly small. The main proof of Theorem 1 will then prove the strict inequality for all other cases. The following rigorously proves that the number of cases for (A,B,{uk},{yk})(A,B,\{u_{k}\},\{y_{k}\}) that satisfies the above is vanishingly small.

We will need a number of new symbols and definitions to facilitate the proofs of Theorem 1 and the following lemmas. For a given RC (A,B)(A,B) and training data set ({uk},{yk})(\{u_{k}\},\{y_{k}\}), there are several cases where the change in the vector VV may be zero. Consider an ESN with a specific choice of the matrix AA, vector BB, and nonlinear function σ(z)\sigma(z). Also consider a fixed input sequence {uk}\{u_{k}\}, and to train our network we will use the NN time steps ranging from 0 to N1N-1. To see how the derivative of the minimized cost function SminS_{\mathrm{min}} with respect to the feedback parameters VV can be zero, define the matrix Xik=1N(xk,ix¯i)X_{ik}=\frac{1}{\sqrt{N}}(x_{k,i}-\overline{x}_{i}) for time steps kk in the training data set. This is similar to the procedure used in [30] to optimize for WW. In other words, with nc+1n_{c}+1 computational nodes (ncn_{c} coming from the vector WW and 1 from CC) and N>ncN>n_{c} training data points, the matrix XX is an nc×Nn_{c}\times N rectangular matrix whose columns are proportional to the mean-adjusted reservoir state xkx¯x_{k}-\overline{x} at each time step kk in the training set. Further, define the vector Yk=1NykY_{k}=\frac{1}{\sqrt{N}}y_{k}. With these definitions, we can rewrite the quantities previously defined in the context of Eq. (16) as

Kxy\displaystyle K_{xy} =1Nk=0N1(xkx¯)(yky¯)=1Nk=0N1(xkx¯)yk𝟎=XY\displaystyle=\frac{1}{N}\sum_{k=0}^{N-1}(x_{k}-\overline{x})(y_{k}-\overline{y})=\frac{1}{N}\sum_{k=0}^{N-1}(x_{k}-\overline{x})y_{k}-\mathbf{0}=XY (24)
Kxx\displaystyle K_{xx} =1Nk=0N1(xkx¯)(xkx¯)=XX.\displaystyle=\frac{1}{N}\sum_{k=0}^{N-1}(x_{k}-\overline{x})(x_{k}-\overline{x})^{\top}=XX^{\top}. (25)

Here, the second equality of Eq. (24) used the fact that 1Nk=0N1(xkx¯)y¯=x¯y¯x¯y¯=𝟎\frac{1}{N}\sum_{k=0}^{N-1}(x_{k}-\bar{x})\bar{y}=\bar{x}\bar{y}-\bar{x}\bar{y}=\mathbf{0}.

Denote the pseudoinverse of XX as X1X^{-1}. Note that while XX1XX^{-1} is the nc×ncn_{c}\times n_{c} identity matrix, X1XX^{-1}X is not the N×NN\times N identity matrix in the vector space of time steps denoted by kk. Instead, it is a projection operator we will call Πx\Pi_{x}. The singular value decomposition of XX is given by X=UncΣUNX=U_{n_{c}}\Sigma U_{N}^{\top}, where UncU_{n_{c}} is an nc×ncn_{c}\times n_{c} orthogonal matrix, Σ\Sigma is taken to be an nc×Nn_{c}\times N rectangular diagonal matrix with non-negative values, and UNU_{N} is an N×NN\times N orthogonal matrix. The pseudoinverse of XX is defined to be X1UNΣ1UncX^{-1}\equiv U_{N}\Sigma^{-1}U_{n_{c}}^{\top}, where the pseudoinverse of Σ\Sigma is defined so that with Σjk=σjδjk\Sigma_{jk}=\sigma_{j}\delta_{jk} we have Σkj1=σj1δjk\Sigma^{-1}_{kj}=\sigma_{j}^{-1}\delta_{jk}. This also implies that (X1)=Unc(Σ1)UN=(X)1(X^{-1})^{\top}=U_{n_{c}}(\Sigma^{-1})^{\top}U_{N}^{\top}=(X^{\top})^{-1} since (Σ1)=(Σ)1(\Sigma^{-1})^{\top}=(\Sigma^{\top})^{-1}.

The product of Σ1\Sigma^{-1} and Σ\Sigma is given by Σ1Σ=Πnc\Sigma^{-1}\Sigma=\Pi_{n_{c}}, where the elements of Πnc\Pi_{n_{c}} are defined by

(Πnc)kl\displaystyle(\Pi_{n_{c}})_{kl} θ(nck)δkl,\displaystyle\equiv\theta_{-}(n_{c}-k)\delta_{kl}, (26)

where θ(x)\theta_{-}(x) is the step function with θ(0)=0\theta_{-}(0)=0. Note that this is a projection operator since it satisfies ΠncΠnc=Πnc\Pi_{n_{c}}\Pi_{n_{c}}=\Pi_{n_{c}}. Thus the product of X1X^{-1} and XX is given by

X1X=(UNΣ1Unc)(UncΣUN)=UNΠncUNΠx.\displaystyle X^{-1}X=(U_{N}\Sigma^{-1}U_{n_{c}}^{\top})(U_{n_{c}}\Sigma U_{N}^{\top})=U_{N}\Pi_{n_{c}}U_{N}^{\top}\equiv\Pi_{x}. (27)

Πx\Pi_{x} must also be a projection operator since ΠxΠx=UNΠncΠncUN=UNΠncUN=Πx\Pi_{x}\Pi_{x}=U_{N}\Pi_{n_{c}}\Pi_{n_{c}}U_{N}^{\top}=U_{N}\Pi_{n_{c}}U_{N}^{\top}=\Pi_{x}. We also see that Πx\Pi_{x} is symmetric since Πnc\Pi_{n_{c}} is symmetric, so (X1X)=X(X)1=Πx(X^{-1}X)^{\top}=X^{\top}(X^{\top})^{-1}=\Pi_{x} as well. This method of defining a singular value decomposition of the XX matrix and obtaining the corresponding projection matrix Πx\Pi_{x} is similar to the methods used to obtain theoretical results in [31, 32]. Note that the inversion of Σ\Sigma assumes that all singular values are nonzero, but we already make this assumption when optimize for WW. By the expression for WW in Eq. (19) and the discussion following that equation, this assumption to reasonable.

To get an expression for SminS_{\mathrm{min}} (short for Smin(A+BV,B,{uk},{yk})S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})) in this formalism, define the NN-dimensional vector e^\hat{e} such that its elements are given by e^k=1N\hat{e}_{k}=\frac{1}{\sqrt{N}}. It can then be shown that e^Y=1Nk=0N1yk=y¯\hat{e}^{\top}Y=\frac{1}{N}\sum_{k=0}^{N-1}y_{k}=\overline{y}. Then the variance of yky_{k} can be written as σy2=1Nk=0N1yk2y¯2=Y(𝕀Ne^e^)Y\sigma_{y}^{2}=\frac{1}{N}\sum_{k=0}^{N-1}y_{k}^{2}-\overline{y}^{2}=Y^{\top}(\mathbb{I}_{N}-\hat{e}\hat{e}^{\top})Y, where 𝕀n\mathbb{I}_{n} is the identity matrix of dimension n×nn\times n. From this expression and the expression for twice the optimal cost given in Eq. (20), SminS_{\mathrm{min}} is then

Smin\displaystyle S_{\mathrm{min}} =12(σy2KxyKxx1Kxy)\displaystyle=\frac{1}{2}\left(\sigma_{y}^{2}-K_{xy}K_{xx}^{-1}K_{xy}\right) (28)
=12Y(𝕀Ne^e^X(X)1X1X)Y.\displaystyle=\frac{1}{2}Y^{\top}\left(\mathbb{I}_{N}-\hat{e}\hat{e}^{\top}-X^{\top}(X^{\top})^{-1}X^{-1}X\right)Y. (29)

With the projection operator Πx\Pi_{x}, we can rewrite this expression for optimal cost function as

Smin=12Y(𝕀Ne^e^ΠxΠx)Y=12Y(𝕀Ne^e^Πx)Y.\displaystyle S_{\mathrm{min}}=\frac{1}{2}Y^{\top}\left(\mathbb{I}_{N}-\hat{e}\hat{e}^{\top}-\Pi_{x}\Pi_{x}\right)Y=\frac{1}{2}Y^{\top}\left(\mathbb{I}_{N}-\hat{e}\hat{e}^{\top}-\Pi_{x}\right)Y. (30)

Thus the effect of indirectly modifying the RC with feedback is to shift the basis of the projection operator Πx\Pi_{x} to have as large of an overlap with the target sequence YY as possible. The derivative of SminS_{\mathrm{min}} with respect to some general parameter θ\theta of the RC is then given simply by

dSmindθ=12YdΠxdθY.\displaystyle\frac{dS_{\mathrm{min}}}{d\theta}=-\frac{1}{2}Y^{\top}\frac{d\Pi_{x}}{d\theta}Y. (31)

The target YY is independent of the RC and thus independent of θ\theta, so any changes to SminS_{\mathrm{min}} as a result of changing θ\theta must come from a change in Πx\Pi_{x}. The fact that Πx\Pi_{x} is a projection operator of rank ncn_{c} tells us some properties of any of its derivatives. First, from the property ΠxΠx=Πx\Pi_{x}\Pi_{x}=\Pi_{x} we get

dΠxdθ=ddθ(ΠxΠx)=dΠxdθΠx+ΠxdΠxdθ.\displaystyle\frac{d\Pi_{x}}{d\theta}=\frac{d}{d\theta}(\Pi_{x}\Pi_{x})=\frac{d\Pi_{x}}{d\theta}\Pi_{x}+\Pi_{x}\frac{d\Pi_{x}}{d\theta}. (32)

Note that this implies ΠxdΠxdθΠx=2(ΠxdΠxdθΠx)\Pi_{x}\frac{d\Pi_{x}}{d\theta}\Pi_{x}=2(\Pi_{x}\frac{d\Pi_{x}}{d\theta}\Pi_{x}). The only matrix that is equal to 2 times itself is the zero matrix, so ΠxdΠxdθΠx\Pi_{x}\frac{d\Pi_{x}}{d\theta}\Pi_{x} must be the zero matrix.

3.2 Lemmas for Proving the Lower Dimensionality of Cases where VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0}

Now that we have established that the dependence of the cost function on the reservoir is entirely determined by a projection matrix Πx\Pi_{x}, we are ready to begin discussing cases where dSmindθ=0\frac{dS_{\mathrm{min}}}{d\theta}=0.

Lemma 1 (Categorization of cases where a derivative of SminS_{\mathrm{min}} w.r.t. a general reservoir parameter θ\theta vanishes).

Given Smin(A,B,{uk},{yk})S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}) and any parameter θ\theta that the reservoir is dependent on, the cases where dSmindθ=0\frac{dS_{\mathrm{min}}}{d\theta}=0 fall into one of two categories, one where dSmindθ=0\frac{dS_{\mathrm{min}}}{d\theta}=0 only for specific target sequences {yk}\{y_{k}\} and one where dΠxdθ=0\frac{d\Pi_{x}}{d\theta}=0. Furthermore, the former category is divided into 3 more categories in which ΠxY=𝟎\Pi_{x}Y=\mathbf{0}, ΠxY=Y\Pi_{x}Y=Y, or neither.

Proof.

For the discussion that follows, define the vector subspace 𝒴\mathcal{Y}_{\parallel} to be the space of NN-dimensional vectors with real coefficients such that 𝒴={Y|YN,ΠxY=Y}\mathcal{Y}_{\parallel}=\{Y|Y\in\mathbb{R}^{N},\Pi_{x}Y=Y\}. Define also the vector subspace 𝒴\mathcal{Y}_{\perp} such that 𝒴={Y|YN,ΠxY=𝟎}\mathcal{Y}_{\perp}=\{Y|Y\in\mathbb{R}^{N},\Pi_{x}Y=\mathbf{0}\}. Note that it is always possible to construct an orthonormal basis of vectors Y^k\hat{Y}_{k} in N\mathbb{R}^{N} where the first ncn_{c} basis vectors are in 𝒴\mathcal{Y}_{\parallel} and the remaining NncN-n_{c} basis vectors are in 𝒴\mathcal{Y}_{\perp}. This makes it useful to define

Tθ\displaystyle T_{\theta} ΠxdΠxdθ(𝕀NΠx)=Πxddθ(X1X)(𝕀NΠx)\displaystyle\equiv\Pi_{x}\frac{d\Pi_{x}}{d\theta}(\mathbb{I}_{N}-\Pi_{x})=\Pi_{x}\frac{d}{d\theta}\left(X^{-1}X\right)(\mathbb{I}_{N}-\Pi_{x}) (33)
=ΠxX1dXdθ(𝕀NΠx)+ΠxdX1dθX(𝕀NΠx)\displaystyle=\Pi_{x}X^{-1}\frac{dX}{d\theta}(\mathbb{I}_{N}-\Pi_{x})+\Pi_{x}\frac{dX^{-1}}{d\theta}X(\mathbb{I}_{N}-\Pi_{x}) (34)
=X1dXdθ(𝕀NΠx),\displaystyle=X^{-1}\frac{dX}{d\theta}(\mathbb{I}_{N}-\Pi_{x}), (35)

where in the last line we used ΠxX1=X1XX1=X1𝕀nc=X1\Pi_{x}X^{-1}=X^{-1}XX^{-1}=X^{-1}\mathbb{I}_{n_{c}}=X^{-1} to simplify the first term and XΠx=XX1X=𝕀ncX=XX\Pi_{x}=XX^{-1}X=\mathbb{I}_{n_{c}}X=X to eliminate the second term. This definition is useful because ΠxdΠxdθΠx=𝟎\Pi_{x}\frac{d\Pi_{x}}{d\theta}\Pi_{x}=\mathbf{0}, so from the definition above Tθ=ΠxdΠxdθT_{\theta}=\Pi_{x}\frac{d\Pi_{x}}{d\theta}, and therefore from Eq. (32) we have

dΠxdθ\displaystyle\frac{d\Pi_{x}}{d\theta} =Tθ+Tθ.\displaystyle=T_{\theta}+T_{\theta}^{\top}. (36)

This also implies that

dSmindθ\displaystyle\frac{dS_{\mathrm{min}}}{d\theta} =12Y(Tθ+Tθ)Y=YTθY,\displaystyle=-\frac{1}{2}Y^{\top}(T_{\theta}+T_{\theta}^{\top})Y=-Y^{\top}T_{\theta}Y, (37)

so the change in SminS_{\mathrm{min}} depends entirely upon TθT_{\theta} with respect to YY.

From the definition of TθT_{\theta} in Eq. (33), because there is a Πx\Pi_{x} on the left side of the matrix, we then have that YTθ=𝟎Y^{\top}T_{\theta}=\mathbf{0} for all Y𝒴Y\in\mathcal{Y}_{\perp}. Since 𝒴\mathcal{Y}_{\perp} has dimension NncN-n_{c}, there must be at least NncN-n_{c} zero singular values of TθT_{\theta}. Let mθm_{\theta} be the matrix rank of TθT_{\theta} (number of nonzero singular values), which by the previous argument cannot be larger than ncn_{c}. Then the singular value decomposition of TθT_{\theta} can be written as

Tθ=j=0mθ1σθ,jy^θ,j(y^θ,j),\displaystyle T_{\theta}=\sum_{j=0}^{m_{\theta}-1}\sigma_{\theta,j}\hat{y}_{\theta,j}^{\parallel}\left(\hat{y}_{\theta,j}^{\perp}\right)^{\top}, (38)

where σθ,j\sigma_{\theta,j} is a strictly positive singular value of TθT_{\theta}, y^θ,j\hat{y}_{\theta,j}^{\parallel} is one of mθm_{\theta} orthonormal basis vectors in 𝒴\mathcal{Y}_{\parallel}, and y^θ,j\hat{y}_{\theta,j}^{\perp} one of mθm_{\theta} orthonormal basis vectors in 𝒴\mathcal{Y}_{\perp}. The reason the right basis vectors are in 𝒴\mathcal{Y}_{\perp} is because of the (𝕀NΠx)(\mathbb{I}_{N}-\Pi_{x}) on the right side of Eq. (33), which makes to so that TθY=𝟎T_{\theta}Y=\mathbf{0} for all Y𝒴Y\in\mathcal{Y}_{\parallel}.

With this decomposition of TθT_{\theta}, we can use Eq. (37) to rewrite dSmindθ\frac{dS_{\mathrm{min}}}{d\theta} as

dSmindθ\displaystyle\frac{dS_{\mathrm{min}}}{d\theta} =j=0mθ1σθ,jcθ,j(cθ,j),\displaystyle=-\sum_{j=0}^{m_{\theta}-1}\sigma_{\theta,j}c_{\theta,j}^{\parallel}\left(c_{\theta,j}^{\perp}\right)^{\top}, (39)

where cθ,j=Yy^θ,jc_{\theta,j}^{\parallel}=Y^{\top}\hat{y}_{\theta,j}^{\parallel} and cθ,j=Yy^θ,jc_{\theta,j}^{\perp}=Y^{\top}\hat{y}_{\theta,j}^{\perp}. There are 3 broad categories of YY for which dSmindθ\frac{dS_{\mathrm{min}}}{d\theta} vanishes for a given TθT_{\theta}:

  • 1.

    YY is orthogonal to every y^θ,j\hat{y}_{\theta,j}^{\parallel}, or equivalently ΠxY=0\Pi_{x}Y=0.

  • 2.

    YY is orthogonal to every y^θ,j\hat{y}_{\theta,j}^{\perp}, or equivalently ΠxY=Y\Pi_{x}Y=Y.

  • 3.

    Neither of the above statements are true, but the coefficients cθ,jc_{\theta,j}^{\parallel} and cθ,jc_{\theta,j}^{\perp} are such that the sum j=0mθ1σθ,jcθ,jcθ,j\sum_{j=0}^{m_{\theta}-1}\sigma_{\theta,j}c_{\theta,j}^{\parallel}c_{\theta,j}^{\perp} vanishes.

There is also the possibility that mθm_{\theta} is zero, meaning that every singular value of TθT_{\theta} is zero, so that dSmindθ=0\frac{dS_{\mathrm{min}}}{d\theta}=0 for any YY. ∎

In what follows, while we cannot rule out any of these possibilities for the feedback vector VV, we can show that the space of YY’s that fit into the above three criterion are of lower dimension than the general space of NN-dimensional vectors that encompasses all YY’s, and give criterion for numerically testing whether any of these cases hold for a given reservoir computation. Also, in the event that the gradient of SminS_{\mathrm{min}} with respect to VV vanishes for any YY, we will show that the number of solutions in the space of possible matrices AA, vectors BB, and input sequences uku_{k} is of lower dimension as well with testable criterion for a given ESN.

Lemma 2 (Lower dimensionality of cases where VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} while VΠx𝟎\nabla_{V}\Pi_{x}\neq\mathbf{0}).

Given a specific ESN defined by a matrix AA, vector BB, and input sequence {uk}\{u_{k}\} such that the projection operator Πx\Pi_{x} satisfies VΠx𝟎\nabla_{V}\Pi_{x}\neq\mathbf{0}, the space of training vectors YY that then leads to VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} is of lower dimension than the space of all training vectors, whose dimension is NN.

Proof.

Define TiTViT_{i}\equiv T_{V_{i}} to be the same as in Eq. (33) with θ\theta replaced with ViV_{i} for each i=0,,nc1i=0,\dots,n_{c}-1. In order for the gradient VSmin\nabla_{V}S_{\mathrm{min}} to be zero, we require that dSmindVi=YTiY\frac{dS_{\mathrm{min}}}{dV_{i}}=-Y^{\top}T_{i}Y be zero for all ii’s. This means that for a given set of TiT_{i}’s, there are three cases in which VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} because of the particular form of YY:

  • Case 1:

    Let 𝒴V\mathcal{Y}_{V}^{\parallel} be the span of the set of vectors that contains y^i,j\hat{y}_{i,j}^{\parallel} for every i{0,,nc1}i\in\{0,\dots,n_{c}-1\} and j{0,,rank(Ti)1}j\in\{0,\dots,\mathrm{rank}(T_{i})-1\}, and let its dimension be mm_{\parallel}. Then if YY is such that Yy=0Y^{\top}y^{\parallel}=0 for all y𝒴Vy^{\parallel}\in\mathcal{Y}_{V}^{\parallel}, then VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} because YTi=𝟎Y^{\top}T_{i}=\mathbf{0} for all ii. This includes the case where SminS_{\mathrm{min}} is at its maximum possible value of 12σy2\frac{1}{2}\sigma_{y}^{2}, in which the RC has utterly failed to capture any properties of YY. The dimension of the space of YY’s that fall under this case is NmN-m_{\parallel}. This is because 𝒴V\mathcal{Y}_{V}^{\parallel} is spanned by mm_{\parallel} basis vectors, so the space of YY’s that are orthogonal to all of them is spanned by the remaining NmN-m_{\parallel} basis vectors. We can calculate mm_{\parallel} from the matrix defined as

    M=i=0nc1TiTi.\displaystyle M_{\parallel}=\sum_{i=0}^{n_{c}-1}T_{i}T_{i}^{\top}. (40)

    mm_{\parallel} is given by the rank of MM_{\parallel} because each TiTiT_{i}T_{i}^{\top} is a positive semi-definite matrix, so the only way that YM=𝟎Y^{\top}M_{\parallel}=\mathbf{0} is if YTi=𝟎Y^{\top}T_{i}=\mathbf{0} for all ii. The dimension of the space of vectors that satisfy this relation is NmN-m_{\parallel} as mentioned previously, so if MM_{\parallel} has NmN-m_{\parallel} zero eigenvalues, then that leaves mm_{\parallel} nonzero eigenvalues.

    We see from Eq. (35) that computing MM_{\parallel} will involve calculating X1X^{-1}, but since we are only interested in the rank of the matrix we can find an alternative. Recall that XX is an nc×Nn_{c}\times N matrix whose singular values are strictly positive, so therefore the rank of XX is ncn_{c}. Furthermore, XΠx=XX\Pi_{x}=X, so the span of the right eigenvectors of XX must be 𝒴\mathcal{Y}_{\parallel}. Since the span of the left eigenvectors of TiT_{i} is a subspace of 𝒴\mathcal{Y}_{\parallel} for all ii, the span of MM_{\parallel} is also a subspace of 𝒴\mathcal{Y}_{\parallel}, and we can multiply MM_{\parallel} by XX on both sides without changing the rank and use Eq. (35) to get a simpler nc×ncn_{c}\times n_{c} matrix

    M~=XMX=i=0nc1dXdVi(𝕀NΠx)dXdVi.\displaystyle\widetilde{M}_{\parallel}=XM_{\parallel}X^{\top}=\sum_{i=0}^{n_{c}-1}\frac{dX}{dV_{i}}(\mathbb{I}_{N}-\Pi_{x})\frac{dX^{\top}}{dV_{i}}. (41)

    Since this has the same number of nonzero eigenvalues as MM_{\parallel}, mm_{\parallel} is also given by the number of nonzero eigenvalues of M~\widetilde{M}_{\parallel}. This allow us to compute mm_{\parallel} without the need to calculate X1X^{-1} directly like we would have if we calculated MM_{\parallel} using Eq. (35) for TiT_{i}.

  • Case 2:

    Let 𝒴V\mathcal{Y}_{V}^{\perp} be the span of the set of vectors that contains y^i,j\hat{y}_{i,j}^{\perp} for every i{0,,nc1}i\in\{0,\dots,n_{c}-1\} and j{0,,rank(Ti)1}j\in\{0,\dots,\mathrm{rank}(T_{i})-1\}, and let its dimension be mm_{\perp}. Then if YY is such that Yy=0Y^{\top}y^{\perp}=0 for all y𝒴Vy^{\perp}\in\mathcal{Y}_{V}^{\perp}, then VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} because TiY=𝟎T_{i}Y=\mathbf{0} for all ii. This includes the minimal case where Smin=0S_{\mathrm{min}}=0, in which the RC perfectly describes YY and no further improvement is possible. The dimension of the space of YY’s that fall under this case is NmN-m_{\perp}. This is because 𝒴V\mathcal{Y}_{V}^{\perp} is spanned by mm_{\perp} basis vectors, so the space of YY’s that are orthogonal to all of them is spanned by the remaining NmN-m_{\perp} basis vectors. We can calculate mm_{\perp} from the matrix defined as

    M=i=0nc1TiTi.\displaystyle M_{\perp}=\sum_{i=0}^{n_{c}-1}T_{i}^{\top}T_{i}. (42)

    mm_{\perp} is given by the rank of MM_{\perp} because each TiTiT_{i}^{\top}T_{i} is a positive semi-definite matrix, so the only way that MY=𝟎M_{\perp}Y=\mathbf{0} is if TiY=𝟎T_{i}Y=\mathbf{0} for all ii. The dimension of the space of vectors that satisfy this relation is NmN-m_{\perp} as mentioned previously, so if MM_{\perp} has NmN-m_{\perp} zero eigenvalues, then that leaves mm_{\perp} nonzero eigenvalues.

  • Case 3:

    Neither of the above cases holds, but every YTiY=0Y^{\top}T_{i}Y=0 nonetheless. Since there are ncn_{c} different TiT_{i}’s, then we have ncn_{c} equations of constraint on which YY’s of this type set VSmin\nabla_{V}S_{\mathrm{min}} to zero. However, it may be possible that some of these equations are not independent, which would imply that there is at least one linear combination of TiT_{i}’s such that i=0nc1γiTi=𝟎\sum_{i=0}^{n_{c}-1}\gamma_{i}T_{i}=\mathbf{0}. Define mIm_{I} to be the number of independent constraints of this form. Then the dimension of the space of YY’s for which YTiY=0Y^{\top}T_{i}Y=0 is given by NmIN-m_{I}, the number of free parameters left after applying the constraints. We can calculate mIm_{I} using the nc×ncn_{c}\times n_{c} matrix MIM_{I} whose components are defined to be

    (MI)ij=Tr(TiTj).\displaystyle(M_{I})_{ij}=\mathrm{Tr}(T_{i}T_{j}^{\top}). (43)

    mIm_{I} is given by the rank of MIM_{I}, since if the linear combination i=0nc1γiTi=𝟎\sum_{i=0}^{n_{c}-1}\gamma_{i}T_{i}=\mathbf{0} then the ncn_{c}-dimensional vector vv defined by vi=γiv_{i}=\gamma_{i} is an eigenvector of MIM_{I} with eigenvalue 0.

As long as dimensions of the spaces of YY’s that satisfy the three cases above are all smaller than the total dimension of all YY’s, then it is very unlikely that any given YY will fall into any of these categories. The total dimension of the space of YY vectors is NN, and the dimension of the spaces for each of the three cases are Nm,Nm,N-m_{\parallel},N-m_{\perp}, and NmIN-m_{I}, respectively, so as long as m,m,m_{\parallel},m_{\perp}, and mIm_{I} are all positive, than this argument holds. We can check whether or not they are zero by taking the traces of their respective matrices M,M,M_{\parallel},M_{\perp}, and MIM_{I} since they are all positive semi-definite matrices, so the only way that the sums of their eigenvalues are zero is if every eigenvalue is zero. It turns out that all three traces are equal, since

Tr(M)=Tr(M)=Tr(MI)=i=0nc1Tr(TiTi),\displaystyle\mathrm{Tr}(M_{\parallel})=\mathrm{Tr}(M_{\perp})=\mathrm{Tr}(M_{I})=\sum_{i=0}^{n_{c}-1}\mathrm{Tr}(T_{i}T_{i}^{\top}), (44)

so therefore if any one of m,m,m_{\parallel},m_{\perp}, or mIm_{I} are found to be zero, then all three are guaranteed to be zero. This is because a matrix whose eigenvalues are all zero must be the zero matrix, so this implies that Ti=𝟎T_{i}=\mathbf{0} for all ii, which is the subject of our second proof below. Conversely, Eq. (44) also implies that if any one of them is positive, then they all must be positive as well, so we need only check that one of them is nonzero for this proof to hold. The easiest one to check is likely mm_{\parallel} as we can use M~\widetilde{M}_{\parallel} in place of MM_{\parallel} and avoid directly calculating the X1X^{-1}. ∎

Lemma 3 (Lower dimensionality of cases where VΠx𝟎\nabla_{V}\Pi_{x}\neq\mathbf{0}).

The space of ESNs defined by matrices, vectors, and input sequences (A,B,{uk})(A,B,\{u_{k}\}) that satisfies VΠx𝟎\nabla_{V}\Pi_{x}\neq\mathbf{0} is of lower dimension than the space of all possible ESNs.

Proof.

Our second part of this proof will show that in the space of all matrices AA, vectors BB, and input sequences {uk}\{u_{k}\} that lead to a convergent RC, the subspace of these where VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} for all YY (or equivalently that Ti=𝟎T_{i}=\mathbf{0} for all ii) must be of lower dimension as well. If it was not of lower dimension, then that would imply that VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} holds over a finite region of the possible values of A,B,A,B, and {uk}\{u_{k}\}. If the function f(xk,uk)f(x_{k},u_{k}) representing the reservoir dynamics is analytic, then all reservoir states must be analytic functions of A,B,A,B, and {uk}\{u_{k}\} as well. In this paper, we are considering ESNs with f(xk,uk)=σ(Axk+Buk)f(x_{k},u_{k})=\sigma(Ax_{k}+Bu_{k}), where σ(z)\sigma(z) is the element-wise sigmoid function, which is analytic. So if all reservoir states are analytic in A,B,A,B, and {uk}\{u_{k}\}, then SminS_{\mathrm{min}} is an analytic function of these variables as well, so if VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} holds over a finite region of the possible values of these variables, then SminS_{\mathrm{min}} must be constant with respect to VV for all A,B,{uk}A,B,\{u_{k}\} and {yk}\{y_{k}\}, which is a direct consequence of the identity theorem of an analytical function [33]. However, the ESN will become unstable for a VV with a sufficiently large norm, in which case the fit to {yk}\{y_{k}\} will be poor and SminS_{\mathrm{min}} would have to be larger than if we had no feedback. Thus SminS_{\mathrm{min}} must not be constant with respect to VV for all A,B,{uk}A,B,\{u_{k}\} and {yk}\{y_{k}\}, and therefore the space of matrices AA, vectors BB and training inputs {uk}\{u_{k}\} that satisfy VΠx𝟎\nabla_{V}\Pi_{x}\neq\mathbf{0} is of lower dimension than the space of all possible (A,B,{uk})(A,B,\{u_{k}\}). ∎

Lemma 4 (Lower dimensionality of the subdomain of Smin(A+BV,B,{uk},{yk})S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\}) for which VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0}).

The dimension of the space of matrices, vectors, input sequences, and target sequences (A,B,{uk},{yk})(A,B,\{u_{k}\},\{y_{k}\}) that satisfy VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} is strictly less than the dimension of the space of all possible (A,B,{uk},{yk})(A,B,\{u_{k}\},\{y_{k}\}), which implies that the number of cases in which SminS_{\mathrm{min}} has a null gradient w.r.t. VV is vanishingly small compared to all cases.

Proof.

Lemma 2 proves that when VΠx𝟎\nabla_{V}\Pi_{x}\neq\mathbf{0}, the number of cases where VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} is vanishingly small in the space of all training sequences {yk}\{y_{k}\}, and therefore also in the space of all possible (A,B,{uk},{yk})(A,B,\{u_{k}\},\{y_{k}\}). Lemma 3 proves that when VΠx=𝟎\nabla_{V}\Pi_{x}=\mathbf{0}, the number of cases where VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} is vanishingly small in the space of all matrices AA, vectors BB, and training inputs {uk}\{u_{k}\}, and therefore for all training sequences {yk})\{y_{k}\}) as well. Therefore, the number of cases of (A,B,{uk},{yk})(A,B,\{u_{k}\},\{y_{k}\}) where VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} is vanishingly small over all possible (A,B,{uk},{yk})(A,B,\{u_{k}\},\{y_{k}\}), regardless of VΠx\nabla_{V}\Pi_{x}. ∎

Finally, we note that all of this work dedicated to finding where VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} is a necessary but not sufficient condition for proving that feedback will not improve the result. That is, the points where VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} correspond to the extrema of SminS_{\mathrm{min}} with respect to VV, but these extrema could be minima, maxima, or saddle points. However, only minima will prevent feedback from improving the output of an ESN, and if we use a non-local method to find the global minimum of SminS_{\mathrm{min}} with respect to VV, then local minima do not mitigate improvement, either.

It is possible to compute VSmin\nabla_{V}S_{\mathrm{min}} without much extra overhead for any given run of a RC to see if it is zero. The derivatives dxkdVi\frac{dx_{k}}{dV_{i}} can be calculated iteratively using the relation

dxkdVi\displaystyle\frac{dx_{k}}{dV_{i}} =Σk(Bxk1,i+A¯dxk1dVi)\displaystyle=\Sigma_{k}\left(B\leavevmode\nobreak\ x_{k-1,i}+\overline{A}\frac{dx_{k-1}}{dV_{i}}\right) (45)
Σk,ij\displaystyle\Sigma_{k,ij} =δijσ(zk1,i)=δijxk,i(1xk,i),\displaystyle=\delta_{ij}\sigma^{\prime}(z_{k-1,i})=\delta_{ij}x_{k,i}(1-x_{k,i}), (46)

where A¯=A+BV\overline{A}=A+BV^{\top}. This uses AA, BB, and the reservoir states xkx_{k} that have already been obtained from running of the RC. We can also avoid computing N×NN\times N matrices like Πx\Pi_{x} directly by noting that from previous results we have YΠxY=KxyKxx1KxyY\Pi_{x}Y=K_{xy}K_{xx}^{-1}K_{xy}. Kxx1K_{xx}^{-1} was already computed when optimizing for WW, so there is no additional matrix inversion needed to find VSmin\nabla_{V}S_{\mathrm{min}}. It is also feasible to check whether this due to the specific YY or a symptom of the RC by checking if m=0m_{\parallel}=0 using Tr(M~)\mathrm{Tr}(\widetilde{M}_{\parallel}) defined in Eq. (41). M~\widetilde{M}_{\parallel} has the same form as SminS_{\mathrm{min}}, but with YY replaced with the matrix dXdVi\frac{dX}{dV_{i}}, so by replacing yky_{k} with dxk,jdVi\frac{dx_{k,j}}{dV_{i}} for every ii and jj in the definition of KxyK_{xy} we can check if Tr(M~)\mathrm{Tr}(\widetilde{M}_{\parallel}) is zero without much extra work.

3.3 Proving the Universal Superiority of ESNs with Feedback

We are now ready to finally present the proof of Theorem 1 since we know that VSmin\nabla_{V}S_{\mathrm{min}} is nonzero except on a lower dimensional subspace.

Proof of Theorem 1.

Note that SminS_{\mathrm{min}} is a real analytic functional with respect to matrix AA (see the definition of SS in Eq. (6)), having the Taylor series

Smin(A+BV,B,{uk},{yk})\displaystyle S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})
=Smin(A,B,{uk},{yk})+Tr[(ASmin(A,B,{uk},{yk})))(BV)]+𝒪[δA2],\displaystyle=S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})+\mathrm{Tr}\left[\left(\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}))\right)(BV^{\top})^{\top}\right]+\mathcal{O}[\delta A^{2}], (47)

where the last term consolidates the second and the higher order of δA=(A+BV)A=BV\delta A=(A+BV^{\top})-A=BV^{\top}. A reasonable ansatz for VV that reduces the second term most is

V=αASmin(A,B,{uk},{yk})B,V=-\alpha\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})^{\top}B, (48)

where α>0\alpha>0 is a constant to be determined. We now calculate the second term as

Tr[ASmin(A,B,{uk},{yk}))(BV)]=αβ,\mathrm{Tr}[\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}))(BV^{\top})^{\top}]=-\alpha\beta, (49)

where

β\displaystyle\beta =Tr[ASmin(A,B,{uk},{yk}))(BB)ASmin(A,B,{uk},{yk}))]\displaystyle=\mathrm{Tr}[\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}))^{\top}(BB^{\top})\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}))]
=ASmin(A,B,{uk},{yk})B20.\displaystyle=||\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})^{\top}B||^{2}\geq 0. (50)

If β>0\beta>0, one can always choose an arbitrarily small α(>0)\alpha(>0) such that

α>|𝒪(α2)|β.\alpha>\frac{\left|\mathcal{O}(\alpha^{2})\right|}{\beta}. (51)

This is because the left side is linear with respect to α\alpha whereas the right side is higher-order polynomial of α\alpha, and therefore, such (arbitrarily small) α\alpha satisfying the above always exists.

The only case where such α\alpha cannot be found is the case where β=0\beta=0:

VSmin(A+BV,B,{uk},{yk})=ASmin(A+BV,B,{uk},{yk})B=𝟎.\nabla_{V}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})=\nabla_{A}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})^{\top}B=\mathbf{0}. (52)

However, we have proved in Lemma 4 that the number of cases in which this occurs is vanishingly small. Then, the strict inequality in Eq. (23) is proved in almost every case.

Now, we prove that A¯=A+BV\overline{A}=A+BV^{\top} with V=αASmin(A,B,{uk},{yk})BV=-\alpha\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})^{\top}B will make the ESN convergent. The set 𝒜a={An×nAA<a2𝕀n}\mathcal{A}_{a}=\{A\in\mathbb{R}^{n\times n}\mid A^{\top}A<a^{2}\mathbb{I}_{n}\} is an open convex set in n×n\mathbb{R}^{n\times n} for any a>0a>0. As has been shown earlier, for the overwhelming majority of (B,{uk},{yk})(B,\{u_{k}\},\{y_{k}\}) there is always a choice of VV that decreases Smin(A,B,{uk},{yk})S_{\rm min}(A,B,\{u_{k}\},\{y_{k}\}). Since A𝒜aA\in\mathcal{A}_{a}, by the continuity of the maximum singular value of AA with respect to AA (for any choice of matrix norm) and by the particular choice of VV, there always exists a small number δ>0\delta>0 such that A+δBV𝒜aA+\delta BV^{\top}\in\mathcal{A}_{a} (guaranteeing that the ESN with feedback remains convergent) while still decreasing the cost, Smin(A+BδV,B,{uk},{yk})<Smin(A,B,{uk},{yk}S_{\rm min}(A+B\delta V^{\top},B,\{u_{k}\},\{y_{k}\})<S_{\rm min}(A,B,\{u_{k}\},\{y_{k}\}). This concludes the proof of Theorem 1.

3.4 Superiority of ESNs with Feedback for the Whole Class of ESNs

According to Theorem 1, the cost of the ESN with feedback is guaranteed to be smaller than the cost of the ESN without feedback for almost all fixed (A,B,{uk},{yk})(A,B,\{u_{k}\},\{y_{k}\}). Next, the following corollary states that ESN with feedback exceeds the performance of ESN without feedback in the whole class of ESNs.

Corollary 1 (Universal superiority of ESN with feedback over the whole class).

For given and fixed finite input and output sequences {uk}={uk}k=1,,N\{u_{k}\}=\{u_{k}\}_{k=1,\ldots,N} and {yk}={yk}k=1,,N\{y_{k}\}=\{y_{k}\}_{k=1,\ldots,N}, let AA and BB be drawn randomly according to some probability measure \mathbb{P} on n×n×n\mathbb{R}^{n\times n}\times\mathbb{R}^{n}. Let 𝒳={(A,B)n×n×nAA<a2𝕀}\mathcal{X}=\{(A,B)\in\mathbb{R}^{n\times n}\times\mathbb{R}^{n}\mid A^{\top}A<a^{2}\mathbb{I}\} and \mathbb{P} be such that (𝒳)=1\mathbb{P}(\mathcal{X})=1. Let 𝒴=𝒳{(A,B)VSmin(A+BV,B,{uk},{yk})0}\mathcal{Y}=\mathcal{X}\cap\{(A,B)\mid\nabla_{V}S_{\rm min}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})\neq 0\} and choose \mathbb{P} such that (𝒴)>0\mathbb{P}(\mathcal{Y})>0. Let SminA,B=𝔼[Smin(A,B)]\langle S_{\mathrm{min}}\rangle_{A,B}=\mathbb{E}\left[S_{\mathrm{min}}(A,B)\right], where the expectation (average) is taken with respect to the probability measure \mathbb{P}, for a fixed training dataset {uk},{yk}\{u_{k}\},\{y_{k}\}. Then, for the given training data set, ESN with feedback on average has a smaller cost function values than ESN without feedback for various (A,B)(A,B) on average. That is, the following holds on average over all possible (A,B)(A,B):

Smin(A,B,{uk},{yk})A,B>minVSmin(A+BV,B,{uk},{yk})A,B.\langle S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})\rangle_{A,B}>\langle\min_{V}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})\rangle_{A,B}. (53)
Proof.

To prove the theorem over the whole class, we adopt a slightly different approach. The cost function after minimizing for CC and WW can be written as a function of the ESN parameters Smin(A,B)S_{\mathrm{min}}(A,B) (short for Smin(A,B,{uk},{yk})S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})). With feedback using a vector VV, the new minimum is given by Smin(A¯,B)=Smin(A+BV,B)S_{\mathrm{min}}(\overline{A},B)=S_{\mathrm{min}}(A+BV^{\top},B). Let us separate the matrix AA into two components given by

A\displaystyle A =A||+A,\displaystyle=A_{||}+A_{\perp}, (54)
A||\displaystyle A_{||} =BZABwhereZAB=ABB2,\displaystyle=BZ_{AB}^{\top}\quad\mathrm{where}\quad Z_{AB}=\frac{A^{\top}B}{||B||^{2}}, (55)
A\displaystyle A_{\perp} =(𝕀dim(A)BBB2)A=ABZAB,\displaystyle=\left(\mathbb{I}_{\mathrm{dim}(A)}-\frac{BB^{\top}}{||B||^{2}}\right)A=A-BZ_{AB}^{\top}, (56)

This is essentially pulling the degrees of freedom of AA in the BB direction apart from all other degree of freedom, so that A||A_{||} and AA_{\perp} are independent. This is best shown by calculating the inner product:

Tr[A||A]\displaystyle\mathrm{Tr}\left[A_{||}^{\top}A_{\perp}\right] =Tr[ABBB2(𝕀dim(A)BBB2)A]=Tr[A(BBB2BBB2)A]=0.\displaystyle=\mathrm{Tr}\left[A^{\top}\frac{BB^{\top}}{||B||^{2}}\left(\mathbb{I}_{\mathrm{dim}(A)}-\frac{BB^{\top}}{||B||^{2}}\right)A\right]=\mathrm{Tr}\left[A^{\top}\left(\frac{BB^{\top}}{||B||^{2}}-\frac{BB^{\top}}{||B||^{2}}\right)A\right]=0. (57)

Thus we can write the minimized cost function as a function of these independent variables so that we can define S~min(ZAB,A,B)=Smin(A,B)\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)=S_{\mathrm{min}}(A,B). With feedback, these quantities become Z¯AB=ZAB+V\overline{Z}_{AB}=Z_{AB}+V and A¯=A¯BZ¯AB=ABZAB=A\overline{A}_{\perp}=\overline{A}-B\overline{Z}_{AB}^{\top}=A-BZ_{AB}^{\top}=A_{\perp}, and therefore Smin(A¯,B)=S~min(ZAB+V,A,B)S_{\mathrm{min}}(\overline{A},B)=\tilde{S}_{\mathrm{min}}(Z_{AB}+V,A_{\perp},B). This all means that when we use feedback and optimize with respect to VV we are equivalently choosing minV(S~min(V,A,B))\min_{V}(\tilde{S}_{\mathrm{min}}(V,A_{\perp},B)), the minimum value of S~min\tilde{S}_{\mathrm{min}} as a function of ZABZ_{AB}, subject to the convergence constraint. We note that, therefore,

minV(S~min(ZAB+V,A,B))S~min(ZAB,A,B),\min_{V}(\tilde{S}_{\mathrm{min}}(Z_{AB}+V,A_{\perp},B))\leq\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B), (58)

since the cost function can further be reduced by optimizing an additional degree of freedom A||A_{||} (equivalently, VV). The equality occurs when V=𝟎V=\mathbf{0} is the optimal solution, making the feedback unnecessary. One can verify that the derivative of the cost function with respect to VV is not zero except for vanishingly small cases of (A,B)(A,B). To show this, let us take the derivative of SminS_{\mathrm{min}} at V=𝟎V=\mathbf{0}:

VSmin(A+BV,B)|V=𝟎=BASmin(A,B).\nabla_{V}S_{\mathrm{min}}(A+BV^{\top},B)|_{V=\mathbf{0}}=B^{\top}\nabla_{A}S_{\mathrm{min}}(A,B). (59)

The condition for this to become zero is exactly the same condition appearing in Eq. (52), for which we proved that only vanishingly small number of (A,B)(A,B) satisfy the above equation. Therefore, the strict inequality holds for most of (A,B)(A,B).

The average cost without feedback is given by S~minZAB,A,B=𝔼[S~min(ZAB,A,B)]\langle\tilde{S}_{\mathrm{min}}\rangle_{Z_{AB},A_{\perp},B}=\mathbb{E}\left[\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\right], averaging over all three variables. But with feedback, as stated above this is equivalent to minimizing the cost with respect to ZABZ_{AB}, so the average cost with feedback is given by minV(S~min(V,A,B))A,B\langle\min_{V}(\tilde{S}_{\mathrm{min}}(V,A_{\perp},B))\rangle_{A_{\perp},B}. The average over ZABZ_{AB} does nothing in this case because all of the different initial ZABZ_{AB} will get shifted to the minimizing value. Because S~min(ZAB,A,B)minV(S~min(V,A,B))\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\geq\min_{V}(\tilde{S}_{\mathrm{min}}(V,A_{\perp},B)) for each individual choice of ZAB,A,Z_{AB},A_{\perp}, and BB, and ZABZ_{AB} and AA_{\perp} are measurable functions of AA and BB by construction, we must have that S~min(ZAB,A,B)ZAB𝔼[S~min(ZAB,A,B)A,B]minV(S~min(V,A,B))\langle\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\rangle_{Z_{AB}}\triangleq\mathbb{E}[\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\mid A^{\perp},B]\geq\min_{V}(\tilde{S}_{\mathrm{min}}(V,A_{\perp},B)) for all (A,B)𝒳(A,B)\in\mathcal{X}. Furthermore, equality only holds if every choice of ZABZ_{AB} yields the same value of S~min(ZAB,A,B)\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B). By the definition of 𝒳\mathcal{X} and 𝒴\mathcal{Y} and the hypothesis that (𝒳)=1\mathbb{P}(\mathcal{X})=1, and since the number of (A,B)(A_{\perp},B) that makes S~min(ZAB,A,B)\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B) completely independent from ZABZ_{AB} is vanishingly small (c.f. the proof of Theorem 1), one can always choose \mathbb{P} under which AA and BB are sampled to be such that (𝒴)>0\mathbb{P}(\mathcal{Y})>0. Therefore, we have the strict inequality when averaged over (A,B)(A_{\perp},B):

S~min(ZAB,A,B)ZAB,A,B\displaystyle\langle\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\rangle_{Z_{AB},A_{\perp},B}
=𝒴S~min(ZAB,A,B)ZAB(A,B)(dA,dB)\displaystyle=\int_{\mathcal{Y}}\langle\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\rangle_{Z_{AB}}(A,B)\mathbb{P}(dA,dB)
+𝒳\𝒴S~min(ZAB,A,B)ZAB(A,B)(dA,dB)\displaystyle\quad+\int_{\mathcal{X}\backslash\mathcal{Y}}\langle\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\rangle_{Z_{AB}}(A,B)\mathbb{P}(dA,dB)
>𝒴minVS~min(V,A,B)(dA,dB)+𝒳\𝒴minVS~min(V,A,B)(dA,dB)\displaystyle>\int_{\mathcal{Y}}\min_{V}\tilde{S}_{\mathrm{min}}(V,A_{\perp},B)\mathbb{P}(dA,dB)+\int_{\mathcal{X}\backslash\mathcal{Y}}\min_{V}\tilde{S}_{\mathrm{min}}(V,A_{\perp},B)\mathbb{P}(dA,dB)
=minVSmin(A+BV,B,{uk},{yk})A,B.\displaystyle=\langle\min_{V}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})\rangle_{A,B}. (60)

Thus an ESN with feedback will always do better than an ESN without feedback over the whole ESN class on average, given the same number of computational nodes. ∎

We note that this corollary could be proven more succinctly using Theorem 1 under the same hypothesis on the probability measure \mathbb{P} under which AA and BB are sampled by arguing that the average over ESNs will always include cases where the strict inequality (23) holds, and therefore the average also obeys a strict inequality. However, the proof given here adopted a different path from the proof of Theorem 1, providing an alternative explanation of the corollary.

4 Optimization of ESN with Feedback

One of the main advantages to using an ESN is that the training procedure is a linear regression problem that can be solved exactly without much computational effort. To use the ESN to match a target sequence {yk}\{y_{k}\} for a given input sequence {uk}\{u_{k}\}, we first run the ESN driven by the input for a number of steps until the initial state of the network is forgotten. This ensures that the states of the network is close to the unique sequence solely determined by the input that is guaranteed to exist by the uniform convergence property. In most of our simulations, we let the ESN run for 500500 steps before beginning training, which appears to be significantly more than necessary for our examples. We were able to use as few as 1919 steps of startup for some of our tests without any issue.

After this initial set of steps that insure that the system has converged to the input dependent state sequence, we then record the values of the state for the entire range of training steps, which we define to be a total of NN steps staring from k=0k=0. We then define the network’s output to be y^k=Wxk+C\hat{y}_{k}=W^{\top}x_{k}+C, where the parameters CC and WW are optimized using the cost function given in Eq. (6), and whose exact solutions are given in Eqs. (18,19). Then, any future step in {yk}\{y_{k}\} is estimated using y^k=Wxk+C\hat{y}_{k}=W^{\top}x_{k}+C for some kNk\geq N.

With feedback, we also must optimize with respect to the feedback vector VV to determine the modified input sequence {uk+Vxk}\{u_{k}+V^{\top}x_{k}\}. Since the network states {xk}\{x_{k}\} will have a highly complex and nonlinear dependence on VV, we cannot solve it exactly as we do with CC and WW. It also turns out that, unfortunately, the cost function is not a convex function with respect to VV (see Fig. 1). Therefore, a simple gradient descent or a linear regression for optimizing (training) VV is not guaranteed to converge to the global minimum of the cost function. In this case, optimizing VV to minimize the cost function is tricky.

Refer to caption
Figure 1: 3D plot of the non-convex dependence of the NMSE on VV for an ESN with 2 computational modes. Here we optimize CC and WW for the Mackey-Glass task and use 1000 training data points after 500 initial steps in our ESN. This plot shows only a portion of the full space of convergent feedback vectors to better illustrate the non-convexity.

In fact, a good candidate for VV (which is at least locally optimal) is obtained by choosing α=α¯\alpha=\overline{\alpha} such that the difference between the left and the right hand side of the inequality (51) becomes maximum:

α¯=argmaxα(α|𝒪(α2)|β).\overline{\alpha}=\operatorname*{argmax}_{\alpha}\left(\alpha-\frac{\left|\mathcal{O}(\alpha^{2})\right|}{\beta}\right). (61)

Then, a good VV should be given by V=α¯ASmin(A,B)BV=-\overline{\alpha}\nabla_{A}S_{\mathrm{min}}(A,B)^{\top}B. Therefore, a strategy to optimize VV is, first, to perform the optimization for WW assuming there is no feedback, which will result in Smin(A,B)S_{\mathrm{min}}(A,B). Then, one calculates ASmin(A,B)\nabla_{A}S_{\mathrm{min}}(A,B), which will require additional optimization of WW for given perturbed A+ΔAA+\Delta A for each entry of AA. Such perturbation requires n×nn\times n number of optimizations of WW’s for different AA’s. This will allow to calculate ASmin(A,B)\nabla_{A}S_{\mathrm{min}}(A,B), which will lead to a good V=αASmin(A,B)BV=-\alpha\nabla_{A}S_{\mathrm{min}}(A,B)^{\top}B. We, however, note that obtaining α¯\overline{\alpha} requires the calculations of ASmin(A,B)\nabla_{A}S_{\mathrm{min}}(A,B), A2Smin(A,B)\nabla_{A}^{2}S_{\mathrm{min}}(A,B), A3Smin(A,B)\nabla_{A}^{3}S_{\mathrm{min}}(A,B), etc., which are computationally demanding. Thus, in practice, we use different method that is more practical.

In our numerical examples, we used a standard batch gradient descent method to optimize VV, with a forced condition that ensures that the ESN will remain convergent during every step. We cannot use stochastic gradient descent because of the causal nature of the ESN, meaning that the order and size of the training set influences the optimal value of VV. The proofs in Section 3 guarantee that gradient descent will almost always provide an improvement to the fit. The mathematical details of the gradient descent method that we used is explained in A.

To begin our gradient descent routine, we start at V0=𝟎V_{0}=\mathbf{0}. First, we run the ESN without feedback and optimize for CC and WW as usual. Then, we calculate VSmin\nabla_{V}S_{\mathrm{min}}, which as we will demonstrate below can be done using only quantities already obtained from running the ESN. We then choose V1=V0ηVSminV_{1}=V_{0}-\eta\nabla_{V}S_{\mathrm{min}} to be our new feedback vector for the next step, where η\eta is a learning rate that must be chosen beforehand. We repeat this process many times, running the ESN with {uk+Vixk}\{u_{k}+V_{i}^{\top}x_{k}\} as the input sequence, optimizing for CC and WW under the new inputs, and then recalculating VSmin\nabla_{V}S_{\mathrm{min}} to update the feedback vector for the next step using Vi+1=ViηVSminV_{i+1}=V_{i}-\eta\nabla_{V}S_{\mathrm{min}}. This is performed for a set number of iterations. In the event that the gradient descent converges to an ESN that is unstable, we will also detail a procedure we use to keep every ViV_{i} within a certain convex region for which the ESN is guaranteed to be stable.

Here, we present the method to enforce the ESN’s stability while updating VV. For this, we need to make sure that the ESN remains convergent at every step of gradient descent. The constraint on VV is given by the constraint for convergence on the ESN following from Eq. (10) with A¯\overline{A} in place of AA. Formally, the constraint is A¯A¯<a2𝕀dim(A)\overline{A}^{\top}\overline{A}<a^{2}\mathbb{I}_{\mathrm{dim}(A)}, where aa is a constant value that depends on the nonlinear function g(z)g(z) of the ESN. We use the sigmoid function, so we take a=4a=4. We ensure that the gradient descent algorithm obeys this constraint by applying a correction to any gradient descent step that causes the new value of VV to violate the constraint inequality. This correction is designed to only change the component of VV perpendicular to the surface defined by A¯A¯=a2𝕀dim(A)\overline{A}^{\top}\overline{A}=a^{2}\mathbb{I}_{\mathrm{dim}(A)} as a function of VV, so that gradient descent can still freely adjust VV in any direction parallel to this surface.

For some small shift in VV given by δV\delta V, the change in A¯A¯\overline{A}^{\top}\overline{A} is given by

A¯A¯\displaystyle\overline{A}^{\top}\overline{A} =AA+V(BA)+(AB)V+B2VV\displaystyle=A^{\top}A+V(B^{\top}A)+(A^{\top}B)V^{\top}+||B||^{2}\leavevmode\nobreak\ VV^{\top} (62)
δ(A¯TA¯)\displaystyle\delta(\overline{A}^{T}\overline{A}) δV(BA)+(AB)δV+B2δVV+B2VδV\displaystyle\approx\delta V(B^{\top}A)+(A^{\top}B)\delta V^{\top}+||B||^{2}\leavevmode\nobreak\ \delta VV^{\top}+||B||^{2}\leavevmode\nobreak\ V\delta V^{\top} (63)
=δV(BA¯)+(A¯B)δV.\displaystyle=\delta V(B^{\top}\overline{A})+(\overline{A}^{\top}B)\delta V^{\top}. (64)

If gradient descent ends up causing the largest singular value λmax\lambda_{\max} of AA to reach or exceed aa, it would cause our ESN to cease being uniformly convergent. We can use the associated normalized eigenvector umaxu_{\max} associated with λmax\lambda_{\max} to get

δ(λmax2)\displaystyle\delta(\lambda_{\max}^{2}) =δ(umaxA¯A¯umax)\displaystyle=\delta(u_{\max}^{\top}\overline{A}^{\top}\overline{A}u_{\max}) (65)
=umaxδ(A¯A¯)umax+2δ(umax)A¯A¯umax\displaystyle=u_{\max}^{\top}\delta(\overline{A}^{\top}\overline{A})u_{\max}+2\delta(u_{\max})^{\top}\overline{A}^{\top}\overline{A}u_{\max} (66)
=umaxδ(A¯A¯)umax+2λmax2δV(dumaxdV)umax\displaystyle=u_{\max}^{\top}\delta(\overline{A}^{\top}\overline{A})u_{\max}+2\lambda_{\max}^{2}\delta V^{\top}\left(\frac{du_{\max}}{dV}\right)^{\top}u_{\max} (67)
2(umaxδV)(BA¯umax)+0.\displaystyle\approx 2(u_{\max}^{\top}\delta V)(B^{\top}\overline{A}u_{\max})+0. (68)

We can ignore the dependence of umaxu_{\max} on W2W_{2} in this equation because the derivative of a normalized vector is always orthogonal to the original vector, so the first-order shift in each umaxu_{\max} above gets eliminated by the other umaxu_{max}. We can solve this in terms of δV\delta V to get

umaxδV\displaystyle u_{\max}^{\top}\delta V δ(λmax2)2(BA¯umax).\displaystyle\approx\frac{\delta(\lambda_{\max}^{2})}{2(B^{\top}\overline{A}u_{\max})}. (69)

This formula tells us that we can adjust the singular values of AA by using δV=δVumaxΔ2(BA¯umax)\delta V^{\prime}=\delta V-u_{\max}\frac{\Delta}{2(B^{\top}\overline{A}u_{\max})} for our gradient descent step instead of just δV\delta V for some small positive value Δ\Delta. We take Δ\Delta to be λmax2+δ(λmax2)a2+ϵa\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2})-a^{2}+\epsilon_{a} for some small positive number ϵa\epsilon_{a}. This ensures that the new step δV\delta V^{\prime} will keep the singular values of AA strictly less than aa with a minimal change to the original step δV\delta V, so that the convergence of the gradient descent procedure is minimally impacted. λmax2+δ(λmax2)a2\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2})-a^{2} is guaranteed to be of the same order of magnitude as the norm of δV\delta V because we assume that VV leads to convergent dynamics, but V+δVV+\delta V does not, so λmax<a\lambda_{\max}<a but λmax2+δ(λmax2)a2\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2})\geq a^{2}, and since δ(λmax2)\delta(\lambda_{\max}^{2}) is of the same order as δV||\delta V|| the difference 0λmax2+δ(λmax2)a2<δ(λmax2)0\leq\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2})-a^{2}<\delta(\lambda_{\max}^{2}) must be as well, where the last inequality is because λmax<a\lambda_{\max}<a. If multiple singular values exceed aa due to a single step, we can apply this procedure for each singular value independently since the eigenvectors associated with these singular values are orthogonal, so each adjustment to δV\delta V has no overlap with any of the other adjustments. We calculate λmax2+δ(λmax2)\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2}) and umaxu_{\max} directly from (A+B(V+δV))(A+B(V+δV))(A+B(V+\delta V)^{\top})^{\top}(A+B(V+\delta V)^{\top}), while ϵa\epsilon_{a} is chosen to be 10510^{-5}. To the order of approximation used in Eq. (69), we can take A+B(V+δV)A¯A+B(V+\delta V)^{\top}\approx\overline{A} and use their eigenvectors interchangeably for our calculation of the adjustment to δV\delta V aside from the value of δ(λmax2)\delta(\lambda_{\max}^{2}).

5 Benchmark Test Results

We conducted numerical demonstrations of ESNs with feedback by focusing on three distinct tasks: the Mackey-Glass task, the Nonlinear Channel Equalization task, and the Coupled Electric Drives task. These tasks are elaborated in the supplementary material of the reference [34] for the first two and in [35] for the latter. Each task represents a unique class of problems. The Mackey-Glass task exemplifies a highly nonlinear chaotic system, challenging the ESN’s ability to handle complex dynamics. The Nonlinear Channel Equalization task involves the recovery of a discrete signal from a nonlinear channel, testing the ESN’s proficiency in signal processing. Finally, the Coupled Electric Drives task is focused on system identification for a nonlinear stochastic system, evaluating the ESN’s performance in modeling and memory retention. Together, these three tasks provide a comprehensive evaluation of ESNs, covering aspects like nonlinear modeling, system memory, and advanced signal processing. This multifaceted approach ensures a thorough assessment of ESN capabilities across various complex systems.

The Mackey-Glass task requires the ESN to approximate a chaotic dynamical system described by y(t)y(t) in the Mackey-Glass equation:

dydt(t)=βy(tτ)1+yn(tτ)γy(t),\displaystyle\frac{dy}{dt}(t)=\beta\frac{y(t-\tau)}{1+y^{n}(t-\tau)}-\gamma y(t), (70)

where we choose the standard values β=0.2,τ=17,n=10,\beta=0.2,\tau=17,n=10, and γ=0.1\gamma=0.1. We numerically approximate the solution to this equation using yk+1=y((k+1)δt)=yk+δtdykdty_{k+1}=y((k+1)\delta t)=y_{k}+\delta t\frac{dy_{k}}{dt} with δt=1.0\delta t=1.0 and y(0)=1.0y(0)=1.0. We also run the solution for 10001000 steps before using it for the task, or in other words the target sequence we use actually starts with y1000y_{1000}. The task for the ESN is to predict what the sequence will be 10 time steps into the future. In other words, using the input sequence {uk=yk10}\{u_{k}=y_{k-10}\}, we want the ESN to successfully predict {yk}\{y_{k}\}.

The second task is the Nonlinear Channel Equalization task. In this task, there is some sequence of digits {dk}\{d_{k}\}, each of which can take one of 4 values so that dk{3,1,1,3}d_{k}\in\{-3,-1,1,3\}, that is put through a nonlinear propagation channel. This channel uku_{k} is a polynomial in another linear channel qkq_{k}, which is in turn a linear combination of 10 different dkd_{k} values. The linear channel is given by

qk=\displaystyle q_{k}= 0.08dk+20.12dk+1+dk+0.18dk10.1dk2\displaystyle\leavevmode\nobreak\ 0.08d_{k+2}-0.12d_{k+1}+d_{k}+0.18d_{k-1}-0.1d_{k-2}
+0.091dk30.05dk4+0.04dk5+0.03dk6+0.01dk7.\displaystyle+0.091d_{k-3}-0.05d_{k-4}+0.04d_{k-5}+0.03d_{k-6}+0.01d_{k-7}. (71)

The nonlinear transformation uku_{k} of this channel is given by

uk=qk+0.036qk20.011qk3+vk,\displaystyle u_{k}=q_{k}+0.036q_{k}^{2}-0.011q_{k}^{3}+v_{k}, (72)

where vkv_{k} is a Gaussian white noise term with a signal-to-noise ratio of 32dB32\,\mathrm{dB}. That is, each noise term vkv_{k} is a random number generated from a Gaussian distribution with a mean of 0 and a standard deviation given by σk=abs(uk)/39.81\sigma_{k}=\mathrm{abs}(u_{k})/39.81, so that the signal-to-noise ratio is 10log10(uk2σk2)=20log10(39.81)3210\log_{10}\left(\frac{u_{k}^{2}}{\sigma_{k}^{2}}\right)=20\log_{10}\left(39.81\right)\approx 32. The task for the ESN is to recover the original digit sequence {dk}\{d_{k}\} from the nonlinear channel {uk}\{u_{k}\}, which is used as input. In other words, using the input sequence {uk}\{u_{k}\} described by Eq. (72), we want the ESN to produce the digit sequence {yk=dk}\{y_{k}=d_{k}\} as output. Since the ESN produces a continuous output while the target sequence takes discrete values, we round the output of the ESN to the nearest value in {3,1,1,3}\{-3,-1,1,3\} for the final error analysis. However, we still use the continuous outputs during training using the standard cost function described in Eq. (6).

The third test is fitting the Coupled Electric Drives data set, which is derived from a real physical process and is intended as a benchmark data set for nonlinear system identification [36, 37]. In system identification, the ESN is used to approximately model an unknown stochastic dynamical system. This is achieved by tuning the free parameters of the ESN so that it approximates the nonlinear input/output (I/O) map generated by the unknown system through I/O data generated by the latter. This I/O map sends input sequences {,uk1,uk}\{\ldots,u_{k-1},u_{k}\} to output sequences {,yk1,yk}\{\ldots,y_{k-1},y_{k}\} for all kk. To this end, the ESN is configured as a nonlinear stochastic autoregressive model following [12], which is briefly explained in B.

We fit to the output signal labeled z2z2 in [35], which uses a PRBS input u2u2 with amplitude 1.51.5. We model the data using a combination of the input u2u2 and the previous state of the system such that for each time step kk, the next time step is obtained using an input given by su2k+(1s)yks\cdot u2_{k}+(1-s)\cdot y_{k} for some parameter ss. This parameter is chosen by optimizing the cost function of the training data using gradient descent, in much the same way that we optimize the feedback vector VV. However, this procedure is used to provide information about both u2u2 and the past values of yy to the ESN through a single input channel, and is in no way related to our feedback procedure. In the context of the discussion in B, the function ν\nu that is defined in that appendix is in this case given by ν(xk,u2k,yk)=su2k+(1s)yk+Vxk\nu(x_{k},u2_{k},y_{k})=s\cdot u2_{k}+(1-s)\cdot y_{k}+V^{\top}x_{k}, where V=0V=0 if no state feedback is used, otherwise with feedback the value of VV is determined though the I/O data. Through our empirical analysis, we find that the global minimum of the cost function with respect to ss is always located near s=0s=0, such that the cost function is locally convex in an interval that always contains s=0s=0. Thus using gradient descent starting from s=0s=0 to find the optimal value of ss will always converge to the global minimum of this parameter. In our numerical work below, we choose a learning rate of 0.00120.0012 without feedback and a learning rate of 0.0010.001 when using feedback.

Refer to caption
Refer to caption
Figure 2: Histograms for the NMSE values for the Mackey-Glass task. The left plot shows the NMSE values for ESNs with 10 computational nodes, with 48000 randomly chosen ESNs without feedback (the base ESN) and 9600 choices with feedback. For the feedback optimization, we used 100 steps of batch gradient descent with a learning rate of 25.025.0. The right plot uses 48000 randomly chosen ESNs with 100 nodes. In all cases we used 1000 training data points taken after 500 steps of startup, and we show the NMSE values for 500 test data steps after training ends.

Fig. 2 shows histograms of the NMSE values obtained during the Mackey-Glass task for many different ESNs. We see that for 10 computational nodes without feedback, the distribution of NMSE values roughly takes the shape of a skewed Gaussian, with an average of about 0.2520.252 and a standard deviation of about 0.0560.056, and a longer tail on the right side than the left. With feedback, the distribution shifts significantly toward lower NMSE values, with an average of about 0.1770.177 and a standard deviation of about 0.0690.069. This is a roughly 30%30\% reduction of the average NMSE. For 100 computational nodes without feedback, the average is about 0.1200.120 with a standard deviation of about 0.0290.029, with a slight skew toward larger NMSE values this time. Note that the 10-node histogram with feedback appears to have two primary peaks, one centered around the lower edge of the 10-node distribution without feedback and one centered much closer to the 100-node average. With a better method of optimizing VV, it may be possible to get more cases toward the left peak in this distribution and demonstrate results comparable to a 100-node calculation using only 10 nodes with feedback. Such an analysis is beyond the scope of this article, however.

Refer to caption
Refer to caption
Figure 3: Histograms for the errors in the Channel Equalization task. The left plot shows the number of errors for ESNs with 10 computational nodes, with 48000 randomly chosen ESNs without feedback (the base ESN) and 9600 choices with feedback. For the feedback optimization, we used 100 steps of batch gradient descent with a learning rate of 10.010.0. The right plot uses 48000 randomly chosen ESNs with 100 nodes. In all cases we used 1000 training data points taken after 500 steps of startup, and we show the number of errors for 500 test data steps after training ends. The errors are counted using abs(dky^k)/2\mathrm{abs}(d_{k}-\hat{y}_{k})/2, where dkd_{k} is the actual value of the signal at time step kk, while y^k\hat{y}_{k} is the prediction by the ESN.

In Fig. 3, we show histograms of the number of errors obtained in the Channel Equalization task for many different ESNs. We count the number of errors based on how far off the ESN prediction is from the actual signal, using the expression |dky^k|/2|d_{k}-\hat{y}_{k}|/2. For example, if the true signal value was dk=1d_{k}=-1 but the ESN gave us y^k=3\hat{y}_{k}=-3, this counts as 1 error, but if dk=1d_{k}=-1 and y^k=+3\hat{y}_{k}=+3 we count it as 2 errors. We see that for 10 computational nodes without feedback, the distribution of total error values also takes the shape of a skewed Gaussian, with an average of about 11.2911.29 errors and a standard deviation of about 5.205.20, and a longer tail on the right side than the left. With feedback, the distribution again shifts significantly toward a lower number of errors, with an average of about 4.894.89 errors and a standard deviation of about 2.082.08, a nearly 57%57\% reduction to the average error count. For 100 computational nodes without feedback, the average is about 3.223.22 errors with a standard deviation of about 1.111.11.

In this task, all of the distributions roughly adhere to the shape of a Gaussian with a long tail on the right. This is in contrast to the Mackey-Glass task, where each distribution had a slightly different skew. The feedback procedure has definitely improved the average performance of 10-node ESNs, but unlike the Mackey-Glass task there is no secondary peak, so it may be the case that in most of the cases the gradient descent algorithm has settled near a minimum and will not improve the results further. Still, the 10-node average with feedback is much closer to the 100-node result on average than the 10-node result.

Refer to caption
Refer to caption
Figure 4: Plots of the average NMSE and error values for the Mackey-Glass and Channel Equalization tasks, respectively, as a function of computational nodes. The average is taken over 9600 randomly chosen ESNs, and the error bars represent the standard deviation of the distribution for each number of nodes. In all cases we used 1000 training data points taken after 500 steps of startup, and we show the NMSE and total error values for 500 test data steps after training ends. We also include a line showing the average NMSE and total errors for 10 nodes with feedback for reference.

Fig. 4 shows the average dependence of the error of the ESN without feedback as a function of the number of nodes, using the NMSE for the Mackey-Glass task and the total number of errors in the Channel Equalization task. We also include the average value of the 10-node results with feedback for comparison. The error bars in each plot represent the standard deviation associated with each specific number of nodes. We see that, on average, using feedback on a 10-node ESN with 100 steps of gradient descent for optimizing VV is roughly equivalent to a little more than a 2020 node calculation for the Mackey-Glass task, while it is closer to a 2525 or 3030 node calculation for Channel Equalization.

The main reason for this discrepancy is because the Mackey-Glass task is significantly more difficult for the ESN than the Channel Equalization task. In Mackey-Glass, we are asking the network to predict 1010 time steps into the future, but not all ESNs have a memory capacity going back 1010 steps, especially with only 1010 computational nodes. In contrast, the Channel Equalization task is much easier because the ESN does not have to reproduce the exact digits {3,1,1,3}\{-3,-1,1,3\}, it only has to achieve a difference of less than 1 to be considered correct, so there is more room for error on with the continuous output of the ESN. This is evidenced by the fact that the nodal dependence in the Mackey-Glass task seems to continue decreasing almost linearly near 100 nodes, while for Channel Equalization the dependence is close to zero as the ESN is almost perfectly reproducing the signal with 100 nodes. This also suggests that the results for the Mackey-Glass task could see further improvement with feedback using a better method for optimizing VV, since even the addition of computational nodes seems to converge slowly.

Refer to caption
Refer to caption
Figure 5: Plots of the average NMSE and error values for the Mackey-Glass and Channel Equalization tasks, respectively, as a function of batch gradient descent steps for optimizing VV, with a learning rate of 25.025.0 for Mackey-Glass and 10.010.0 for Channel Equalization. The average is taken over 9600 randomly chosen ESNs, and the error bars represent the standard deviation of the distribution for each number of nodes. In all cases we used 1000 training data points taken after 500 steps of startup. These plots shows the number of errors for the training data. The asterisk in the second plot indicates that we have rescaled the average number of errors by a factor of 1/21/2 to be directly comparable with the other figures in this work.

In Fig. 5, we show the average dependence of the error of 10-node ESNs with feedback as a function of the number of gradient descent steps for optimizing VV, using the NMSE for the Mackey-Glass task and the total number of errors in the Channel Equalization task. Note that these plots show the NMSE and total errors for the training data set of 10001000 steps, as opposed to all of the previous plots which use the 500500 times immediate after training. This is why we have rescaled the average number of errors by a factor of 1/21/2 in the plot for the Channel Equalization task: we are checking the errors of 1000 steps for each ESN in these plots instead of 500 like all the others, so the total number of errors is doubled as a result, hence the rescaling. The error bars in each plot represent the standard deviation associated with each specific number of gradient descent steps.

Here, we observe that for the Mackey-Glass task the gradient descent algorithm still has not fully converged even after 100 steps, and the variance in the performance is very large. In contrast, the gradient descent algorithm for the Channel Equalization task appears to have converged on average after about 2525 to 3030 steps to a value of about 33, with a moderately large standard deviation. This corroborates the discussion of Fig. 4, where we see that the ESN has trouble with the Mackey-Glass task, and so the convergence is slow and highly dependent on the specific ESN. Meanwhile, the Channel Equalization task is easier, and so the convergence occurs faster and more consistently. This further motivates using a better optimization method for VV to get the most we can out of an ESN for task like Mackey-Glass. For easier tasks like Channel Equalization we have likely already done the best we can with batch gradient descent, which seems to indicate that feedback can be more useful that adding an equal number of parameters as new computational nodes, at least for small ESNs.

Refer to caption
Refer to caption
Figure 6: Plots of the performance of a specific ESN with and without feedback for the Coupled Electric Drives task. In both cases we used an ESN with 2 computational nodes, with 280 training data points taken after 19 steps of startup, and we show the results for 200 test data steps after training ends. For the feedback, we used 100 steps of batch gradient descent with a learning rate of 27.027.0. The left plot shows the ESN outputs with and without feedback (the base ESN) for the target data set along with the target data. In the right plot, we show the average correlation between time steps of the residuals ek=yky^ke_{k}=y_{k}-\hat{y}_{k}, where the time difference denotes the difference between the time steps of the residuals involved in the average. The 95%95\% confidence interval denotes that there is a 95%95\% chance that an i.i.d. normal distribution would produce a correlation within this interval, indicating that data found largely within the interval are consistent with a normal distribution.

In Fig. 6, we show plots of one ESN’s fit to the Coupled Electric Drive data given in [35]. We specifically use z2z2, the PRBS signal with an amplitude of 1.51.5. Note that since this data set contains only 500 data points, we use significantly less training data and test data than before, with only 280280 and 200200 steps, respectively. We are also only using 2 computational nodes here as opposed to 1010 or more in the previous tasks. This is in accordance with [12], where it was shown that ESNs with 2 computational nodes perform well on the z3z3 Coupled Electric Drives data set. The first plot shows the fit to test data for a specific choice of of ESN with and without feedback. We see that in this specific instance the original ESN has some trouble fitting to the data, but with feedback the fit becomes much better, nearly matching the target data. The NMSE value without feedback was about 0.430.43, but with feedback it is reduced by a full order of magnitude to about 0.0320.032.

The right plot shows the average correlations between the residuals ek=yky^ke_{k}=y_{k}-\hat{y}_{k} of the test data for a fixed time difference. This is calculated using the formula

Rk\displaystyle R_{k} =1N1j=0Nk1(eje¯)(ej+ke¯)σe2,\displaystyle=\frac{1}{N-1}\sum_{j=0}^{N-k-1}\frac{(e_{j}-\overline{e})(e_{j+k}-\overline{e})}{\sigma_{e}^{2}}, (73)

where e¯\overline{e} and σe2\sigma_{e}^{2} are the sample mean and variance of the residuals. This measures the correlation between the residuals at different time steps. If the residuals correspond to white noise, then 95%95\% of the correlations would fit into the confidence interval shown in the plot. We see that in the original ESN, there is a strong correlation between residuals that are one and two time steps apart, along with anti-correlations when they are 5 to 10 steps apart as well as from 25 to 30 step apart. With feedback, the one step correlation is significantly reduced, and all but one of the other correlations are found within the confidence interval. This suggests that the ESN without feedback was not completely capturing part of the correlations in the test data, but with feedback the accuracy is improved to the point that most of the statistically significant correlations have been eliminated. We also checked that the residuals are consistent with Gaussian noise using the Lilliefors test [38, 39] and checking a Q-Q plot [40] against the CDF of a normal distribution. We found that both with and without feedback, the residuals pass the n=50n=50 Lilliefors test and follow a roughly linear trend on the Q-Q plot. Finally, we checked the correlation between the residuals and the input u2u2 to see whether the residual noise is uncorrelated with the input and, therefore, unrelated to the system dynamics. We found that for this ESN, the residuals without feedback show a statistically significant anti-correlation of about 0.22-0.22, below the 95%95\% confidence threshold of 0.14-0.14. With feedback, the correlation becomes about 0.11-0.11, a significant reduction in magnitude that now puts it within the confidence interval.

The average behavior of feedback for this task also shows good improvement. Taking an average over 9600 different ESNs (including the one in Fig. 6), we find that the average NMSE without feedback using 2 computational nodes is about 0.1780.178 with a standard deviation of about 0.07790.0779, but with feedback this average goes down to 0.07230.0723 with a standard deviation of about 0.04480.0448. This is a roughly 59%59\% reduction of the average NMSE. We also find that the average correlation with the input u2u2 is about 0.183-0.183 without feedback, below the 95%95\% confidence threshold of 0.139-0.139, but with feedback it is on average above the threshold with a value of about 0.132-0.132. Additionally, the average one time step difference correlation between residuals is about 0.690.69 without feedback, but with feedback it reduces to about 0.480.48. However, this does not reduce it into the 95%95\% confidence interval for Gaussian white noise, suggesting that there is still room for improvement with feedback. We note that batch gradient descent may not be the most effective choice for optimizing VV for this task, especially considering the use of a large learning rate of 27.027.0 to get these results. If we were able to reach the true global minimum of VV, we may be able to much more consistently reach a scenario like the one shown in Fig. 6.

6 Discussions and Conclusion

In this work, we have introduced a new method for improving the performance of an ESN using a feedback scheme. This scheme uses a combination of the existing input channel of the ESN and the previously measured values of the network state as a new input, so that no direct modification of the ESN is necessary. We proved rigorously that using feedback is almost always guaranteed to provide an improvement to the performance, and that the number of cases in which such an improvement is not possible using batch gradient descent is vanishingly small compared to all possible ESNs. In addition, we proved rigorously that such a feedback scheme provides a superior performance over the whole class of ESN on average. We laid out the procedure for optimizing the ESN as a function of the fitting parameters C,W,C,W, and VV, exactly solving for CC and WW while using batch gradient descent on VV for a fixed number of steps. We then demonstrated the performance improvements for the Mackey-Glass, Channel Equalization, and Coupled Electric Drives tasks, and commented on how the relative difficulty of the tasks affected the results. The ESNs with feedback exhibited outstanding performance improvement in the Channel Equalization and Coupled Electric Drives tasks, and we observed a roughly 57%57\% and 59%59\% improvement in their respective error measures. For the more difficult Mackey-Glass task, we still saw a roughly 30%30\% improvement in the averaged NMSE, showing that feedback produces a significant boost in performance for a variety of tasks. These ESNs with feedback were shown to perform just as well on average, if not better than, the ESNs that have double the number of computational nodes without feedback.

Although there will be an additional hardware modification required to implement feedback, such a modification will only be external to the reservoir computer and, therefore, the burden will be minimal. This feedback scheme is designed to avoid any direct modification of the ESN’s main body (i.e., the computing bulk) since we only need to take the readout of the network and send some component of that readout back into the network with the usual input. Thus, we will only need an apparatus that connects to the readout and the input of the ESN, but does not require modifying the internal reservoir. Given that our results suggest that ESNs with feedback will perform just as well as, if not better than, ESNs of double the number of nodes without feedback, the cost-benefit analysis of adding feedback hardware is very likely to be more favorable than increasing the size of the ESN to achieve similar performance.

Because of the highly complex and nonlinear dependence of the network states {xk}\{x_{k}\} on VV, we used batch gradient descent for the optimization of VV, but better methods may very well exist. The question of how to best optimize VV is indeed closely related to how to choose the best (A,B)(A,B) for a given training data set {uk}\{u_{k}\} and {yk}\{y_{k}\}. This is an open question for which any progress would be monumental in the general theoretical development of reservoir computing and neural networks. Even providing just a measure of the computing power of a given (A,B)(A,B) or (A,B,{uk})(A,B,\{u_{k}\}) outside of the cost function itself could provide some classification of tasks that will save us significant computational time and resources in the future.

Appendix A Batch Gradient Descent Method for Optimizing VV

At the beginning of each step, we train the ESN using {uk+ViTxk}\{u_{k}+V_{i}^{T}x_{k}\} as the input sequence for the current feedback vector ViV_{i}, with V0=𝟎V_{0}=\mathbf{0} as the initial step. To get the change in VV, we use the gradient of SminS_{\mathrm{min}} given by

dSdVj\displaystyle\frac{dS}{dV_{j}} =1Nk=0N1(WdxkdVj)(Wxkyk)\displaystyle=\frac{1}{N}\sum_{k=0}^{N-1}\left(W^{\top}\frac{dx_{k}}{dV_{j}}\right)(W^{\top}x_{k}-y_{k}) (74)
dxkdVj\displaystyle\frac{dx_{k}}{dV_{j}} =Σk(Bxk1,j+A¯dxk1dVj)\displaystyle=\Sigma_{k}\left(B\leavevmode\nobreak\ x_{k-1,j}+\overline{A}\frac{dx_{k-1}}{dV_{j}}\right) (75)
Σk,ij\displaystyle\Sigma_{k,ij} =δijσ(zk1,i)=δijxk,i(1xk,i),\displaystyle=\delta_{ij}\sigma^{\prime}(z_{k-1,i})=\delta_{ij}x_{k,i}(1-x_{k,i}), (76)

where A¯=A+BV\overline{A}=A+BV^{\top}. The derivatives dxkdVj\frac{dx_{k}}{dV_{j}} can be calculated by iteration starting from the initial condition dx0dVj=𝟎\frac{dx_{0}}{dV_{j}}=\mathbf{0}. Then we simply shift VVηVSminV\rightarrow V-\eta\nabla_{V}S_{\mathrm{min}} for some learning rate η\eta at each step of the descent, using the optimal solutions for CC and WW for the current value of VV.

Appendix B Nonlinear Stochastic Autoregressive Model

The basic model is

xk+1\displaystyle x_{k+1} =g(Ax+Bν(xk,uk,yk))\displaystyle=g(Ax+B\nu(x_{k},u_{k},y_{k}))
y^k\displaystyle\hat{y}_{k} =Wxk+C,\displaystyle=W^{\top}x_{k}+C,

where ν\nu is some function of xkx_{k}, uku_{k} and yky_{k} taking values in n\mathbb{R}^{n} and {(uk,yk)}\{(u_{k},y_{k})\} are I/O pairs generated at time kk by the stochastic dynamical system of interest. The quantity y^k\hat{y}_{k} is the output of the ESN and functions as an approximation of yky_{k}. Assuming convergence, after washout y^k\hat{y}_{k} can be expressed as [12]:

y^k=F(yk1,yk2,,uk1,uk2,),\hat{y}_{k}=F(y_{k-1},y_{k-2},\ldots,u_{k-1},u_{k-2},\ldots),

for some nonlinear functional FF of past input and output values, uk1,uk2,u_{k-1},u_{k-2},\ldots and yk1,yk2,y_{k-1},y_{k-2},\ldots, respectively. Letting eke_{k} be the approximation error of yky_{k} by y^k\hat{y}_{k}, ek=ykyk^e_{k}=y_{k}-\hat{y_{k}}, the ESN then generates a nonlinear infinite-order autoregressive model with exogenous input:

yk=F(yk1,yk2,,uk1,uk2,)+ek.y_{k}=F(y_{k-1},y_{k-2},\ldots,u_{k-1},u_{k-2},\ldots)+e_{k}.

To complete the model, it is stipulated that {ek}\{e_{k}\} is Gaussian white noise sequence that is uncorrelated with the input sequence {uk}\{u_{k}\}. By making the substitution yk=y^k+ek=Wxk+C+eky_{k}=\hat{y}_{k}+e_{k}=W^{\top}x_{k}+C+e_{k}, the system identification procedure identifies an approximate state-space model of the unknown stochastic dynamical system given by:

xk+1\displaystyle x_{k+1} =f(xk,uk,ek)\displaystyle=f(x_{k},u_{k},e_{k})
yk\displaystyle y_{k} =Wxk+ek,\displaystyle=W^{\top}x_{k}+e_{k},

where

f(xk,uk,ek)=g(Axk+Bν(xk,uk,Wxk+ek)),f(x_{k},u_{k},e_{k})=g(Ax_{k}+B\nu(x_{k},u_{k},W^{\top}x_{k}+e_{k})),

where eke_{k} and uku_{k} free inputs to the model. The stochasticity in the model comes from the free noise sequence {ek}\{e_{k}\} and possibly also the input {uk}\{u_{k}\} (typically the case in system identification). This model can then be used, for instance, to design stochastic control laws uku_{k} for the unknown stochastic dynamical system or to simulate it on a digital computer. The hypothesis of the model, that {ek}\{e_{k}\} is a Gaussian white noise sequence that is uncorrelated with the input sequence {uk}\{u_{k}\} is tested after the model is fitted using a separate validation data set (different from the fitting data set) by residual analysis; see, e.g., [36, 37].

References

  • [1] M. Lukoševičius, H. Jaeger, Reservoir computing approaches to recurrent neural network training, Computer science review 3 (3) (2009) 127–149.
  • [2] B. Schrauwen, D. Verstraeten, J. Van Campenhout, An overview of reservoir computing: theory, applications and implementations, in: Proceedings of the 15th european symposium on artificial neural networks. p. 471-482 2007, 2007, pp. 471–482.
  • [3] G. Tanaka, T. Yamane, J. B. Héroux, R. Nakane, N. Kanazawa, S. Takeda, H. Numata, D. Nakano, A. Hirose, Recent advances in physical reservoir computing: A review, Neural Networks 115 (2019) 100–123.
  • [4] H. Jaeger, H. Haas, Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communications, Science 304 (2004) 5667.
  • [5] J. Pathak, et al., Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach, Physical review letters 120 (2) (2018) 024102.
  • [6] M. Rafayelyan, et al., Large-scale optical reservoir computing for spatiotemporal chaotic systems prediction, Phys. Rev. X 10 (2020) 041037.
  • [7] P. Antonik, M. Gulina, J. Pauwels, S. Massar, Using a reservoir computer to learn chaotic attractors, with applications to chaos synchronization and cryptography, Physical Review E 98 (1) (2018) 012215.
  • [8] Z. Lu, et al., Reservoir observers: Model-free inference of unmeasured variables in chaotic systems, Chaos 27 (2017) 041102.
  • [9] L. Grigoryeva, J. Henriques, J.-P. Ortega, Reservoir computing: information processing of stationary signals, in: Joint 2016 CSE, EUC and DCABES, IEEE, 2016, pp. 496–503.
  • [10] L. Grigoryeva, J.-P. Ortega, Echo state networks are universal, Neural Networks 108 (2018) 495–508. doi:https://doi.org/10.1016/j.neunet.2018.08.025.
  • [11] L. Gonon, J.-P. Ortega, Reservoir computing universality with stochastic inputs, IEEE transactions on neural networks and learning systems 31 (1) (2019) 100–112.
  • [12] J. Chen, H. I. Nurdin, Nonlinear autoregression with convergent dynamics on novel computational platforms, IEEE Transactions on Control Systems Technology 30 (5) (2022) 2228–2234. doi:10.1109/TCST.2021.3136227.
  • [13] L. Larger, et al., High-speed photonic reservoir computing using a time-delay-based architecture: Million words per second classification, Physical Review X 7 (1) (2017) 011015.
  • [14] J. Torrejon, et al., Neuromorphic computing with nanoscale spintronic oscillators, Nature 547 (7664) (2017) 428–431.
  • [15] J. Chen, H. I. Nurdin, N. Yamamoto, Temporal information processing on noisy quantum computers, Phys. Rev. Applied 14 (2020) 024065. doi:10.1103/PhysRevApplied.14.024065.
  • [16] Y. Suzuki, et al., Natural quantum reservoir computing for temporal information processing, Sci. Reports 12 (1) (2022) 1353.
  • [17] T. Yasuda, et al., Quantum reservoir computing with repeated measurements on superconducting devices, arXiv preprint arXiv:2310.06706 (October 2023).
  • [18] K. Nakajima, I. Fischer, Reservoir Computing: Theory, Physical Implementations, and Applications, Springer Singapore, 2021.
  • [19] P. Mujal, et al., Opportunities in quantum reservoir computing and extreme learning machines, Adv. Quantum Technol. 4 (2021) 2100027.
  • [20] D. Marković, J. Grollier, Quantum neuromorphic computing, Applied Physics Letters 117 (15) (2020) 150501.
  • [21] H. Jaeger, The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Bonn, Germany: German National Research Center for Information Technology GMD Technical Report 148 (34) (2001) 13.
  • [22] H. Jaeger, Echo state network, scholarpedia 2 (9) (2007) 2330.
  • [23] M. Lukoševičius, A practical guide to applying echo state networks, in: Neural Networks: Tricks of the Trade: Second Edition, Springer, 2012, pp. 659–686.
  • [24] D. Sussillo, L. F. Abbott, Generating coherent patterns of activity from chaotic neural networks, Neuron 63 (4) (2009) 544–557.
  • [25] M. Freiberger, P. Bienstman, J. Dambre, A training algorithm for networks of high-variability reservoirs, Scientific Reports 10 (2020) 14451.
  • [26] W. Maass, P. Joshi, E. D. Sontag, Computational aspects of feedback in neural circuits, PLoS Comp. Bio. 3 (2007) article no. e165.
  • [27] G. Manjunath, H. Jaeger, Echo state property linked to an input: Exploring a fundamental characteristic of recurrent neural networks, Neural Computation 25 (3) (2013) 671–696. doi:10.1162/NECO_a_00411.
  • [28] S. Boyd, L. Chua, Fading memory and the problem of approximating nonlinear operators with volterra series, IEEE Transactions on Circuits and Systems 32 (11) (1985) 1150–1161. doi:10.1109/TCS.1985.1085649.
  • [29] D. N. Tran, B. S. Rüffer, C. M. Kellett, Convergence properties for discrete-time nonlinear systems, IEEE Transactions on Automatic Control 64 (8) (2019) 3415–3422. doi:10.1109/TAC.2018.2879951.
  • [30] K. Fujii, K. Nakajima, Harnessing disordered-ensemble quantum dynamics for machine learning, Phys. Rev. Appl. 8 (2017) 024030. doi:10.1103/PhysRevApplied.8.024030.
  • [31] T. Kubota, H. Takahashi, K. Nakajima, Unifying framework for information processing in stochastically driven dynamical systems, Phys. Rev. Res. 3 (2021) 043135. doi:10.1103/PhysRevResearch.3.043135.
  • [32] K. Nakajima, K. Fujii, M. Negoro, K. Mitarai, M. Kitagawa, Boosting computational power through spatial multiplexing in quantum reservoir computing, Phys. Rev. Appl. 11 (2019) 034021. doi:10.1103/PhysRevApplied.11.034021.
  • [33] W. Rudin, Principles of Mathematical Analysis, New York: McGraw-Hill, 1976.
  • [34] T. Hülser, F. Köster, K. Lüdge, L. Jaurigue, Deriving task specific performance from the information processing capacity of a reservoir computer, Nanophotonics 12 (5) (2023) 937–947. doi:doi:10.1515/nanoph-2022-0415.
  • [35] T. Wigren, M. Schoukens, Coupled electric drives data set and reference models, Tech. rep., Department of Information Technology, Uppsala University, Uppsala, Sweden (2017).
  • [36] L. Ljung, System Identification: Theory for the User, 2nd Edition, Prentice-Hall, 1999.
  • [37] S. A. Billings, Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains, Wiley, 2013.
  • [38] H. W. Lilliefors, On the Kolmogorov-Smirnov test for normality with mean and variance unknown, Journal of the American Statistical Association 62 (318) (1967) 399–402.
  • [39] H. Abdi, P. Molin, Lilliefors/Van Soest’s test of normality, Encyclopedia Meas. Stat. (01 2007).
  • [40] M. B. Wilk, R. Gnanadesikan, Probability plotting methods for the analysis for the analysis of data, Biometrika 55 (1) (1968) 1–17. arXiv:https://academic.oup.com/biomet/article-pdf/55/1/1/730568/55-1-1.pdf, doi:10.1093/biomet/55.1.1.
    URL https://doi.org/10.1093/biomet/55.1.1