\affiliation

[1]organization=Wyant College of Optical Sciences, University of Arizona, city=Tuscon, Arizona, country=USA \affiliation[2]organization=School of Electrical Engineering and Telecommunications, University of New South Wales, city=Sydney, country=Australia

Improving the Performance of Echo State Networks Through Feedback

Peter J. Ehlers [email protected] Hendra I. Nurdin [email protected] Daniel Soh [email protected]

Abstract

Reservoir computing, using nonlinear dynamical systems, offers a cost-effective alternative to neural networks for complex tasks involving processing of sequential data, time series modeling, and system identification. Echo state networks (ESNs), a type of reservoir computer, mirror neural networks but simplify training. They apply fixed, random linear transformations to the internal state, followed by nonlinear changes. This process, guided by input signals and linear regression, adapts the system to match target characteristics, reducing computational demands. A potential drawback of ESNs is that the fixed reservoir may not offer the complexity needed for specific problems. While directly altering (training) the internal ESN would reintroduce the computational burden, an indirect modification can be achieved by redirecting some output as input. This feedback can influence the internal reservoir state, yielding ESNs with enhanced complexity suitable for broader challenges. In this paper, we demonstrate that by feeding some component of the reservoir state back into the network through the input, we can drastically improve upon the performance of a given ESN. We rigorously prove that, for any given ESN, feedback will almost always improve the accuracy of the output. For a set of three tasks, each representing different problem classes, we find that with feedback the average error measures are reduced by $30\%-60\%$ . Remarkably, feedback provides at least an equivalent performance boost to doubling the initial number of computational nodes, a computationally expensive and technologically challenging alternative. These results demonstrate the broad applicability and substantial usefulness of this feedback scheme.

keywords:

Reservoir Computing, Echo State Network, Feedback Improvement

1 Introduction

Compared to recurrent neural networks where excellent performance could only be obtained with very computationally expensive system adjustment procedures, the premise of reservoir computing is to use a fixed nonlinear dynamical system of the form (1)-(2) to perform signal processing tasks:

	$\displaystyle x_{k+1}$	$\displaystyle=f(x_{k},u_{k}),$		(1)
	$\displaystyle\hat{y}_{k}$	$\displaystyle=W^{\top}x_{k}+C,$		(2)

where $u_{k}$ is the input signal at time $k$ [1, 2, 3]. The output $\hat{y}_{k}$ of the dynamical system is then typically taken as a simple linear combination of the states (nodes) $x$ of the dynamical system as given in (2) plus some constant $C$ (as the output bias), where $W$ is a weight matrix and $W^{\top}$ is its transpose. The nodes correspond to a basis map that is used to approximate an unknown map which maps input (discrete-time) sequences to output sequences that are to be learned by the dynamical system. Using the linear combination of states makes the training extremely straightforward and efficient as the weight matrix $W$ can be determined by a simple linear regression.

Reservoir computers (RCs) have been extensively used to predict deterministic sequences, in particular chaotic sequences, and data-based chaotic system modelling, see, e.g., [4, 5, 6]. In the deterministic setting it has found applications in channel equalization [4], chaos synchronisation and encryption [7], and model-free observers for chaotic systems [8]. RCs have also be studied for modelling of stochastic signals and systems with applications including time series modelling, forecasting, filtering and system identification [9, 10, 11, 12].

Physical reservoir computing employs a device with complex temporal evolution, tapping into the computational power of a nonlinear dynamical system without extensive parameter optimization needed in typical neural networks. Inputs are fed into a reservoir, a natural system with complex dynamics, influencing its state based on current and past inputs due to its (limited) memory. The reservoir, running automatically, is altered only by these inputs. This approach models complex, nonlinear functions with minimal requirements: problem-related inputs and a linear fitting algorithm. Various physical platforms have been experimentally demonstrated for reservoir computing include, for instance, photonics [13, 6], spintronics [14] and even quantum systems [15, 16, 17]. For a review of physical reservoir computing and quantum reservoir computing, see, e.g., [3, 18, 19, 20].

An echo state network (ESN) [4, 21, 22, 23] is a type of RC using an iterative structure for adding nonlinearity to inputs. It is similar to a recurrent neural network, except that the neural weights are fixed and optimization occurs only at the output layer. In an ESN, the reservoir state at time step $k$ is represented by vector $x_{k}$ , equivalent to the neural outputs at step $k$ . Each step involves applying a fixed linear transformation given by a matrix $A$ to $x_{k}$ , then adding to it a vector $B$ times the input value $u_{k}$ , forming a new vector $z_{k}$ . The matrix $A$ represents the fixed neural weights, while the vector $B$ represents the biases. A nonlinear transformation on each element of $z_{k}$ generates $x_{k+1}$ , akin to neuron outputs. The affine function $\hat{y}_{k}$ of $x_{k}$ given in (2) is then fit to a target value sequence ${y_{k}}$ , giving an output $\hat{y}_{k}\approx y_{k}$ that approximates the target system. We describe the ESN framework with further detail in Section 2.1.

The main drawback of this approach is that any specific ESN is only going to be effective for a certain subset of problems because the transformations that the reservoir applies are fixed, so a specific reservoir will tend to modify the inputs in the same way, leading to a limited range of potential outputs. It has been shown in [10] that ESNs as a whole are universal, meaning that for any target sequence $\{y_{k}\}$ and a given input sequence $\{u_{k}\}$ , there will be an ESN with specific choices of $A,B,$ that can approximate it to any desired accuracy. However, it may not be practically feasible to find a sufficiently accurate ESN for a particular problem of interest as it may require choosing an excessively and unpractically large ESN, and one may have to settle for an ESN with a weaker performance instead.

There have been previous efforts related to the above issue. In the context of an autonomous ESN with no external driving input, the work [24] introduces a number of architectures. One architecture includes adding a second auxiliary ESN network besides the principal “generator” ESN. The auxiliary is fed by a tunable linear combination of some nodes of the generator ESN, while a fixed (non-tunable) linear combination of the nodes of the auxiliary ESN is fed back to the generator network. The same error signal at the output of the generator ESN is used to train both the output weight of the principal and the weights that connect the generator to the auxiliary. The weight update is done recursively through an algorithm called the First-Order, Reduced and Controlled Error (FORCE) learning algorithm, which is in turn based on the recursive least squares algorithm. A second architecture does not use feedback but allows modification of the some internal weights of the ESN besides the output weight, as in a conventional recurrent neural network. The internal and output weights are also updated using the FORCE algorithm. In [25], multiple ESNs with tunable output weights that are interconnected in a fixed feedforward architecture (with no feedback loops) are considered. A set of completely known but randomly generated “surrogate” ESNs are coupled according to some architecture and trained by simulation (“in silico”) using the backpropagation algorithm for artificial neural networks. The “intermediate” signals generated at the output of each component ESN are then used to train the output weights of another set of random ESNs, representing the “true” ESNs that will be deployed, in the same architecture. The output weights of the individual true ESNs can be trained by linear regression. In [26], in the continuous time setting, it was shown that an affine nonlinear system given by the nonlinear ODE:

\displaystyle\dot{x}_{i}

\displaystyle=f(x_{1},\ldots,x_{n})+g(x_{1},\ldots,x_{n})v,\ i=1,2,\ldots,n,

for scalar real functions $x_{1},\ldots,x_{n}$ and $v$ is universal in the sense that it can exactly emulate any other $n$ -th order ODE of the form

\displaystyle z^{(n)}

\displaystyle=G(z,z^{(1)},\ldots,z^{(n-1)})+u,

that is driven by a signal $u$ , where $z$ is a scalar signal and $z^{(j)}$ denotes the $j$ -th derivative of $z$ with respect to time. The emulation is achieved by appropriately choosing scalar-valued real functions $K$ and $h$ and setting $v=K(x,u)$ and $z=h(x)$ , where $x$ is the column vector $x=(x_{1},\ldots,x_{n})^{\top}$ , where $\top$ denotes the transpose of a matrix. Also, any system of $k$ higher order ODEs in $z$ of the form above can be emulated by using $k$ different feedback terms.

In this work, unlike [24], we are interested in ESNs that are driven by external input and are required to be convergent (forget their initial condition). Notably, the ESN in [24] could not be convergent because a limit cycle was used to generate a periodic output without any inputs in the ESN. When driven by an external input such networks may produce an output diverging in time. Also, while the study in [24] is motivated by biological networks, our work is motivated by the applications of physical reservoir computing. In particular, we are interested in enhancing the performance of a fixed physical RC by adding a simple but tunable structure external to the computer. Also, in contrast to [25], in this paper we do not consider multiple interconnected ESNs in a feedforward architecture and do not use surrogates, but train a single ESN augmented with a feedback loop directly with the data. While our work is related to [26] as discussed above, we do not seek universal emulation, but consider a linear feedback $K(x)=V^{\top}x$ that only depends on the state $x$ , but not the input $u$ , to enhance the approximation ability of ESNs.

The goal of this paper is to study the use of feedback in the context of ESNs and show that it will improve the performance of ESNs in the overwhelming majority of cases. Our proposal is to feed a linear function of the reservoir state back into the network as input. That is, for some vector $V$ , we change the input from $u_{k}$ to $u_{k}+V^{\top}x_{k}$ . We then optimize $V$ with respect to the cost function to achieve a better fit to the target sequence $\{y_{k}\}$ . This has the effect of changing the linear transformation $A$ that the reservoir performs on $x_{k}$ at each time step, allowing us to partially control how the reservoir state evolves without modifying the reservoir itself. This will in essence provide us with a wider range of possible outputs for any given ESN, and can provide smaller ESNs an accuracy boost that makes them comparable to larger ones. Thus, our new paradigm of ESNs with feedback generates a significant performance boost with minimal perturbation of the system. We offer a thorough proof of a broad theorem, confidently ensuring that almost any ESN will experience a performance enhancement when a feedback mechanism is implemented, making this new scheme of ESNs with feedback universally applicable.

The structure of the paper is as follows. In Section 2, we provide some background on reservoir computers and echo state networks and introduce our feedback procedure. In Section 3, we provide a proof of the superiority of ESNs with feedback. In Section 4, we describe how the new parameters introduced by feedback are optimized. In Section 5, we provide numerical results that demonstrate the effectiveness of feedback for several different representative tasks. Finally, in Section 6 we give our concluding remarks.

In this paper, we denote the transpose of a matrix $M$ as $M^{\top}$ , with the same notation used for vectors. The $n\times n$ identity matrix is written as $\mathbb{I}_{n}$ , while the zero matrix of any size (including any zero vector) is written as $\mathbf{0}$ . We treat an $n$ -dimensional vector as an $n\times 1$ rectangular matrix in terms of notation, and in particular the outer product of two vectors $v_{1}$ and $v_{2}$ is written as $v_{1}v_{2}^{\top}$ . The vector norm $||v||$ denotes the standard 2-norm $||v||_{2}=\sqrt{v^{\top}v}$ . For a sequence whose $k$ th element is given by $a_{k}$ , we denote the entire sequence as $\{a_{k}\}$ . When finite, a sum of the elements of such a sequence is often written notationally as if they were a sample of some stochastic process. We write weighted sums of these sequences as determinstic “expectation” values (averages), so that for a sequence $\{a_{k}\}$ with $N$ entries starting from $k=0$ we may write $\langle a\rangle=\frac{1}{N}\sum_{k=0}^{N-1}a_{k}$ . We also define the mean of such a sequence as $\overline{a}=\langle a\rangle$ and its variance as $\sigma_{a}^{2}=\langle(a-\overline{a})^{2}\rangle$ . The expectation operator is denoted by $\mathbb{E}[\cdot]$ , the expectation of a random variable $X$ is denoted by $\mathbb{E}[X]$ and the conditional expectation of a random variable $X$ given random variables $Y_{1},\ldots,Y_{m}$ is denoted by $\mathbb{E}[X|Y_{1},\ldots,Y_{m}]$ . We will denote the input and the output sequences of the training data as $\{u_{k}\}=\{u_{k}\}_{k=1,\ldots,N},\{y_{k}\}=\{y_{k}\}_{k=1,\ldots,N}$ , respectively.

2 Theory of Reservoir Computing with Feedback

2.1 Reservoir Computing and Echo State Networks

A general RC is described by the following two equations:

	$\displaystyle x_{k+1}$	$\displaystyle=f(x_{k},u_{k})$		(3)
	$\displaystyle\hat{y}_{k}$	$\displaystyle=h(x_{k}),$		(4)

where $x_{k}$ is a vector representing the reservoir state at time step $k$ , $u_{k}$ is the $k$ th member of some input sequence, and $\hat{y}_{k}$ is the predicted output. The function $f(x,u)$ is defined by the reservoir and is fixed, but the output function $h(x)$ is fit to the target sequence $y_{k}$ by minimizing a cost function $S$ . In practice we usually choose (for $N$ training data points)

	$\displaystyle h(x)=W^{\top}x+C$		(5)
	$\displaystyle S=\frac{1}{2N}\sum_{k=0}^{N-1}(y_{k}-\hat{y}_{k})^{2},$		(6)

where the scalar $C$ and vector $W$ are chosen to minimize $S$ , so that the problem of fitting the output function to data is just a linear regression problem. This setup is what enables the simulation and prediction of complex phenomena with a low computational overhead, because the reservoir dynamics encoded in $f(x,u)$ are complex enough to get a nonlinear function of the inputs $\{u_{k}\}$ that can then be made to approximate $\{y_{k}\}$ using linear regression.

In order for a RC to work, it must obey what is known as the (uniform) convergence property, or echo state property [12, 27]. It states that, for a reservoir defined by the function $f(x,u)$ and a given input sequence $\{u_{k}\}$ defined for all $k\in\mathbb{Z}$ , there exists a unique sequence of reservoir states $\{x_{k}\}$ that satisfy $x_{k+1}=f(x_{k},u_{k})$ for all $k\in\mathbb{Z}$ . The consequence of this property is that the initial state of the reservoir in the infinite past does not have any bearing on what the current reservoir state is. This consequence combined with the continuity of $f(x,u)$ leads to the fading memory property [28], which tells us that the dependence of $x_{k}$ on an input $u_{k_{0}}$ for $k>k_{0}$ must dwindle continuously to zero as $k-k_{0}$ tends to infinity. This means that any initial state dependence should become negligible after the RC runs for a certain amount of time, so that the RC is reusable and produces repeatable, deterministic results while also retaining some memory capacity for past inputs.

It has been shown [10, 29] that a given RC will have the uniform convergence property if the reservoir dynamics $f(x,u)$ are contracting, or in other words if it satisfies

\displaystyle||f(x_{1},u)-f(x_{2},u)||\leq\epsilon||x_{1}-x_{2}||,

(7)

where $\epsilon$ is some real number $0<\epsilon<1$ . The norm in this inequality is arbitrary (as all norms on finite-dimensional metric spaces have equivalent effects), but it is usually chosen to be the standard vector norm $||v||_{2}=\sqrt{v^{\top}v}$ . This ensures that all reservoir states $x$ will be driven toward the same sequence of states defined by the inputs $\{u_{k}\}$ .

An ESN is a specific type of RC described above, with

\displaystyle f(x,u)=g(Ax+Bu),

(8)

where $A$ and $B$ are a random but fixed matrix and vector, respectively, while $g(z)$ is a nonlinear function that acts on each component of its input $z$ . Throughout the paper we will take the dimension of the state $x$ to be $n$ , and the dimensions of $A$ and $B$ to be $n\times n$ and $n\times 1$ , respectively. For the output of the ESN we have that $C$ is a real scalar and $W$ is a real column vector of dimension $n$ . This design gives the ESN resemblance to a typical neural network, where the linear transformation $z_{k}=Ax_{k}+Bu_{k}$ defines the input into the array of neurons, with $A$ providing the weights and $Bu_{k}$ providing a bias. The element-wise nonlinear function $g(z)$ gives the array of outputs of the neurons as a function of the weighted inputs. The choices of $g,A,$ and $B$ define a specific ESN, though in practice $g$ is often chosen to be one of a specific set of preferred functions such as the sigmoid $\sigma(z)=(1+e^{-z})^{-1}$ or $\tanh(z)$ functions. In this work, we choose the sigmoid function for our numerical results.

The convergence of the ESN can be guaranteed by subjecting the matrix $A$ to the constraint that $A^{\top}A<a^{2}\mathbb{I}_{n}$ for a constant $a>0$ . In other words, the singular values of $A$ must all be strictly less than some number $a$ which is determined by $g(z)$ . For the sigmoid function, we can use $a=4$ , while for the $\tanh$ function we use $a=1$ . This originates from proving that

\displaystyle||g(z_{1})-g(z_{2})||\leq a^{-1}||z_{1}-z_{2}||

(9)

for all $z_{1},z_{2}\in\mathbb{Z}$ , so that the convergence inequality will always be satisfied as long as

\displaystyle a^{-1}||(Ax_{1}+Bu)-(Ax_{2}+Bu)||=a^{-1}||A(x_{1}-x_{2})||\leq\epsilon||x_{1}-x_{2}||,

(10)

for some $0<\epsilon<1$ . Note that this is a sufficient but not necessary condition, as there could be combinations of $A,B,$ and $\{u_{k}\}$ such that

\displaystyle||g(Ax_{1}+Bu_{k})-g(Ax_{2}+Bu_{k})||\leq\epsilon||x_{1}-x_{2}||

(11)

for all $x_{1},x_{2},$ and $k$ , but this singular value criterion is much easier to test and design for while still providing a large space of possible reservoirs to choose from.

We parameterize how well the output of the ESN matches the target data using the normalized mean-square error (NMSE). In a linear regression problem we can show that the mean-squared error is

\displaystyle\langle(y-\hat{y})^{2}\rangle

\displaystyle=\frac{1}{N}\sum_{k=0}^{N-1}(y_{k}-\hat{y}_{k})^{2}=2S,

(12)

where we are averaging over the $N$ time steps corresponding to the training interval. With $\hat{y}_{k}=W^{\top}x_{k}+C$ , we can show that

$\displaystyle\langle(y-\hat{y})^{2}\rangle$	$\displaystyle=\langle(C+W^{\top}x-y)^{2}\rangle$	(13)
	$\displaystyle=(C+W^{\top}\langle x\rangle-\langle y\rangle)^{2}+\langle(W^{\top}(x-\langle x\rangle)-(y-\langle y\rangle))^{2}\rangle$	(14)
	$\displaystyle=(C+W^{\top}\langle x\rangle-\langle y\rangle)^{2}+W^{\top}K_{xx}W-2W^{\top}K_{xy}+\sigma_{y}^{2},$	(15)

where

	$\displaystyle K_{xx}$	$\displaystyle=\langle(x-\langle x\rangle)(x-\langle x\rangle)^{\top}\rangle=\langle xx^{\top}\rangle-\langle x\rangle\langle x\rangle^{\top}$		(16)
	$\displaystyle K_{xy}$	$\displaystyle=\langle(x-\langle x\rangle)(y-\langle y\rangle)\rangle=\langle xy\rangle-\langle x\rangle\langle y\rangle.$		(17)

The values of $C$ and $W$ that minimize the mean-squared error are

	$\displaystyle C=\langle y\rangle-W^{\top}\langle x\rangle$		(18)
	$\displaystyle W=K_{xx}^{-1}K_{xy}.$		(19)

Note that since $K_{xx}$ is a covariance matrix, it must be positive semi-definite, but by inverting it to find the optimal value of $W$ we have further assumed that it is positive definite. This assumption is equivalent to saying that all of the vectors $x_{k}$ span the entire vector space $\mathbb{R}^{n_{c}}$ , where $n_{c}$ is the dimension of $W$ and all $x_{k}$ ’s. This is reasonable because in practice we usually take the number of training steps $N>>n_{c}$ , and since each $x_{k}$ is a nonlinear transformation of the previous one, it is unlikely that any vector $v$ will satisfy $v^{\top}x_{k}=0$ for all $k\in{0,\dots,N-1}$ . Nevertheless, in the event that $K_{xx}$ is not invertible, we can take the pseudoinverse of $K_{xx}$ instead. This is because the components of $W$ parallel to the zero eigenvectors of $K_{xx}$ are not fixed by the optimization (which is why the inversion fails in the first place), so we are free to choose those components to be zero, which makes Eq. (19) correct when using the pseudoinverse of $K_{xx}$ as well.

Plugging the optimized values of $C$ and $W$ into the mean-squared error gives

\displaystyle\left(\langle(y-\hat{y})^{2}\rangle\right)_{\min}

\displaystyle=\sigma_{y}^{2}-K_{xy}^{\top}K_{xx}^{-1}K_{xy}.

(20)

From the original expression for the mean-squared error in Eq. (12), we can see that it is non-negative. Since $K_{xx}$ is a covariance matrix, the quantity $K_{xy}^{\top}K_{xx}^{-1}K_{xy}=W^{\top}K_{xx}W\geq 0$ . Thus $\left(\langle(y_{k}-\hat{y}_{k})^{2}\rangle\right)_{\min}$ is bounded above by $\sigma_{y}^{2}$ , so we may define a normalized mean-squared error, or NMSE, by

\displaystyle\mathrm{NMSE}

\displaystyle=\frac{\langle(y-\hat{y})^{2}\rangle}{\sigma_{y}^{2}}.

(21)

This quantity is guaranteed to be between 0 and 1 for the training data, though it may exceed 1 for an arbitrary test data set. We can also see by Eqs. (12) and (20), the task of minimizing $S$ as a function of $C,W,$ and $V$ is equivalent to maximizing $K_{xy}^{\top}K_{xx}^{-1}K_{xy}$ as a function of $V$ .

2.2 ESNs with Feedback

The main result of this work is the introduction of a feedback procedure to improve the performance of ESNs. We add an additional step to the process where the input is taken to be $u_{k}+Vx_{k}$ at each time step as opposed to just $u_{k}$ . The reservoir of the ESN is then described by

\displaystyle x_{k+1}

\displaystyle=g(Ax_{k}+B(u_{k}+V^{\top}x_{k}))=g((A+BV^{\top})x_{k}+Bu_{k}).

(22)

From this equation, we see that the feedback causes this ESN to behave like a different network that uses $\overline{A}=A+BV^{\top}$ as a transformation matrix instead of $A$ . We achieve this without modifying the RC itself, using only the pre-existing input channel and the reservoir states that we are already measuring. This provides a practical way of changing the reservoir dynamics without any internal hardware modification. We then optimize for $V$ using batch gradient descent to further reduce the cost function $S$ .

Note, however, that in attempting to modify $A$ we run the risk of eliminating the uniform convergence of the ESN. Thus, there must be a constraint placed on $V$ in order to keep the network convergent. In accordance with the constraint in Eq. (10), we require that $\overline{A}^{\top}\overline{A}<a^{2}\mathbb{I}_{n}$ in addition to $A$ , which places some limitations on the value of $V$ . This constraint is generally quite complex to solve beyond this inequality, but it is possible to formulate this as a linear matrix inequality in $\overline{A}$ ; see, e.g., [12, §IV]. In addition, this condition can be easily applied during the process of optimizing $V$ .

3 Universal Superiority of ESN with Feedback over ESN without Feedback

In this section, we prove our central theorem stating that the ESN with feedback accomplishes smaller overall errors than the ESN without feedback. For this, we start with a theorem for an individual ESN:

Theorem 1 (Superiority of feedback for a given ESN and training data).

For any given matrix $A$ and vector $B$ in Eq. (8), and given sets of training inputs $\{u_{k}\}=\{u_{k}\}_{k=1,\ldots,N}$ and outputs $\{y_{k}\}=\{y_{k}\}_{k=1,\ldots,N}$ of finite length, define an optimized cost function $S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})$ with appropriate optimal $W$ and $C$ . Then, for almost any given $(A,B,\{u_{k}\},\{y_{k}\})$ except for vanishingly small number of $(A,B,\{u_{k}\},\{y_{k}\})$ , the feedback always reduces the cost function further:

\min_{V}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})<S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}).

(23)

Moreover, if $A$ is such that $A^{\top}A<a^{2}\mathbb{I}_{n}$ , where $a$ is a constant that guarantees that the ESN is convergent, then the feedback gain $V$ can alawys be chosen such that the ESN with feedback is also convergent and satisfy the above.

3.1 Preliminary Definitions and Relations

To prove Theorem 1, we will set up a number of lemmas and definitions prior to starting the main proof. This preliminary work will primarily concern the cases in which Eq. (23) does not hold, and the lemmas will show that the number of such cases is vanishingly small. The main proof of Theorem 1 will then prove the strict inequality for all other cases. The following rigorously proves that the number of cases for $(A,B,\{u_{k}\},\{y_{k}\})$ that satisfies the above is vanishingly small.

We will need a number of new symbols and definitions to facilitate the proofs of Theorem 1 and the following lemmas. For a given RC $(A,B)$ and training data set $(\{u_{k}\},\{y_{k}\})$ , there are several cases where the change in the vector $V$ may be zero. Consider an ESN with a specific choice of the matrix $A$ , vector $B$ , and nonlinear function $\sigma(z)$ . Also consider a fixed input sequence $\{u_{k}\}$ , and to train our network we will use the $N$ time steps ranging from 0 to $N-1$ . To see how the derivative of the minimized cost function $S_{\mathrm{min}}$ with respect to the feedback parameters $V$ can be zero, define the matrix $X_{ik}=\frac{1}{\sqrt{N}}(x_{k,i}-\overline{x}_{i})$ for time steps $k$ in the training data set. This is similar to the procedure used in [30] to optimize for $W$ . In other words, with $n_{c}+1$ computational nodes ( $n_{c}$ coming from the vector $W$ and 1 from $C$ ) and $N>n_{c}$ training data points, the matrix $X$ is an $n_{c}\times N$ rectangular matrix whose columns are proportional to the mean-adjusted reservoir state $x_{k}-\overline{x}$ at each time step $k$ in the training set. Further, define the vector $Y_{k}=\frac{1}{\sqrt{N}}y_{k}$ . With these definitions, we can rewrite the quantities previously defined in the context of Eq. (16) as

	$\displaystyle K_{xy}$	$\displaystyle=\frac{1}{N}\sum_{k=0}^{N-1}(x_{k}-\overline{x})(y_{k}-\overline{y})=\frac{1}{N}\sum_{k=0}^{N-1}(x_{k}-\overline{x})y_{k}-\mathbf{0}=XY$		(24)
	$\displaystyle K_{xx}$	$\displaystyle=\frac{1}{N}\sum_{k=0}^{N-1}(x_{k}-\overline{x})(x_{k}-\overline{x})^{\top}=XX^{\top}.$		(25)

Here, the second equality of Eq. (24) used the fact that $\frac{1}{N}\sum_{k=0}^{N-1}(x_{k}-\bar{x})\bar{y}=\bar{x}\bar{y}-\bar{x}\bar{y}=\mathbf{0}$ .

Denote the pseudoinverse of $X$ as $X^{-1}$ . Note that while $XX^{-1}$ is the $n_{c}\times n_{c}$ identity matrix, $X^{-1}X$ is not the $N\times N$ identity matrix in the vector space of time steps denoted by $k$ . Instead, it is a projection operator we will call $\Pi_{x}$ . The singular value decomposition of $X$ is given by $X=U_{n_{c}}\Sigma U_{N}^{\top}$ , where $U_{n_{c}}$ is an $n_{c}\times n_{c}$ orthogonal matrix, $\Sigma$ is taken to be an $n_{c}\times N$ rectangular diagonal matrix with non-negative values, and $U_{N}$ is an $N\times N$ orthogonal matrix. The pseudoinverse of $X$ is defined to be $X^{-1}\equiv U_{N}\Sigma^{-1}U_{n_{c}}^{\top}$ , where the pseudoinverse of $\Sigma$ is defined so that with $\Sigma_{jk}=\sigma_{j}\delta_{jk}$ we have $\Sigma^{-1}_{kj}=\sigma_{j}^{-1}\delta_{jk}$ . This also implies that $(X^{-1})^{\top}=U_{n_{c}}(\Sigma^{-1})^{\top}U_{N}^{\top}=(X^{\top})^{-1}$ since $(\Sigma^{-1})^{\top}=(\Sigma^{\top})^{-1}$ .

The product of $\Sigma^{-1}$ and $\Sigma$ is given by $\Sigma^{-1}\Sigma=\Pi_{n_{c}}$ , where the elements of $\Pi_{n_{c}}$ are defined by

\displaystyle(\Pi_{n_{c}})_{kl}

\displaystyle\equiv\theta_{-}(n_{c}-k)\delta_{kl},

(26)

where $\theta_{-}(x)$ is the step function with $\theta_{-}(0)=0$ . Note that this is a projection operator since it satisfies $\Pi_{n_{c}}\Pi_{n_{c}}=\Pi_{n_{c}}$ . Thus the product of $X^{-1}$ and $X$ is given by

\displaystyle X^{-1}X=(U_{N}\Sigma^{-1}U_{n_{c}}^{\top})(U_{n_{c}}\Sigma U_{N}^{\top})=U_{N}\Pi_{n_{c}}U_{N}^{\top}\equiv\Pi_{x}.

(27)

$\Pi_{x}$ must also be a projection operator since $\Pi_{x}\Pi_{x}=U_{N}\Pi_{n_{c}}\Pi_{n_{c}}U_{N}^{\top}=U_{N}\Pi_{n_{c}}U_{N}^{\top}=\Pi_{x}$ . We also see that $\Pi_{x}$ is symmetric since $\Pi_{n_{c}}$ is symmetric, so $(X^{-1}X)^{\top}=X^{\top}(X^{\top})^{-1}=\Pi_{x}$ as well. This method of defining a singular value decomposition of the $X$ matrix and obtaining the corresponding projection matrix $\Pi_{x}$ is similar to the methods used to obtain theoretical results in [31, 32]. Note that the inversion of $\Sigma$ assumes that all singular values are nonzero, but we already make this assumption when optimize for $W$ . By the expression for $W$ in Eq. (19) and the discussion following that equation, this assumption to reasonable.

To get an expression for $S_{\mathrm{min}}$ (short for $S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})$ ) in this formalism, define the $N$ -dimensional vector $\hat{e}$ such that its elements are given by $\hat{e}_{k}=\frac{1}{\sqrt{N}}$ . It can then be shown that $\hat{e}^{\top}Y=\frac{1}{N}\sum_{k=0}^{N-1}y_{k}=\overline{y}$ . Then the variance of $y_{k}$ can be written as $\sigma_{y}^{2}=\frac{1}{N}\sum_{k=0}^{N-1}y_{k}^{2}-\overline{y}^{2}=Y^{\top}(\mathbb{I}_{N}-\hat{e}\hat{e}^{\top})Y$ , where $\mathbb{I}_{n}$ is the identity matrix of dimension $n\times n$ . From this expression and the expression for twice the optimal cost given in Eq. (20), $S_{\mathrm{min}}$ is then

	$\displaystyle S_{\mathrm{min}}$	$\displaystyle=\frac{1}{2}\left(\sigma_{y}^{2}-K_{xy}K_{xx}^{-1}K_{xy}\right)$		(28)
		$\displaystyle=\frac{1}{2}Y^{\top}\left(\mathbb{I}_{N}-\hat{e}\hat{e}^{\top}-X^{\top}(X^{\top})^{-1}X^{-1}X\right)Y.$		(29)

With the projection operator $\Pi_{x}$ , we can rewrite this expression for optimal cost function as

\displaystyle S_{\mathrm{min}}=\frac{1}{2}Y^{\top}\left(\mathbb{I}_{N}-\hat{e}\hat{e}^{\top}-\Pi_{x}\Pi_{x}\right)Y=\frac{1}{2}Y^{\top}\left(\mathbb{I}_{N}-\hat{e}\hat{e}^{\top}-\Pi_{x}\right)Y.

(30)

Thus the effect of indirectly modifying the RC with feedback is to shift the basis of the projection operator $\Pi_{x}$ to have as large of an overlap with the target sequence $Y$ as possible. The derivative of $S_{\mathrm{min}}$ with respect to some general parameter $\theta$ of the RC is then given simply by

\displaystyle\frac{dS_{\mathrm{min}}}{d\theta}=-\frac{1}{2}Y^{\top}\frac{d\Pi_{x}}{d\theta}Y.

(31)

The target $Y$ is independent of the RC and thus independent of $\theta$ , so any changes to $S_{\mathrm{min}}$ as a result of changing $\theta$ must come from a change in $\Pi_{x}$ . The fact that $\Pi_{x}$ is a projection operator of rank $n_{c}$ tells us some properties of any of its derivatives. First, from the property $\Pi_{x}\Pi_{x}=\Pi_{x}$ we get

\displaystyle\frac{d\Pi_{x}}{d\theta}=\frac{d}{d\theta}(\Pi_{x}\Pi_{x})=\frac{d\Pi_{x}}{d\theta}\Pi_{x}+\Pi_{x}\frac{d\Pi_{x}}{d\theta}.

(32)

Note that this implies $\Pi_{x}\frac{d\Pi_{x}}{d\theta}\Pi_{x}=2(\Pi_{x}\frac{d\Pi_{x}}{d\theta}\Pi_{x})$ . The only matrix that is equal to 2 times itself is the zero matrix, so $\Pi_{x}\frac{d\Pi_{x}}{d\theta}\Pi_{x}$ must be the zero matrix.

3.2 Lemmas for Proving the Lower Dimensionality of Cases where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$

Now that we have established that the dependence of the cost function on the reservoir is entirely determined by a projection matrix $\Pi_{x}$ , we are ready to begin discussing cases where $\frac{dS_{\mathrm{min}}}{d\theta}=0$ .

Lemma 1 (Categorization of cases where a derivative of $S_{\mathrm{min}}$ w.r.t. a general reservoir parameter $\theta$ vanishes).

Given $S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})$ and any parameter $\theta$ that the reservoir is dependent on, the cases where $\frac{dS_{\mathrm{min}}}{d\theta}=0$ fall into one of two categories, one where $\frac{dS_{\mathrm{min}}}{d\theta}=0$ only for specific target sequences $\{y_{k}\}$ and one where $\frac{d\Pi_{x}}{d\theta}=0$ . Furthermore, the former category is divided into 3 more categories in which $\Pi_{x}Y=\mathbf{0}$ , $\Pi_{x}Y=Y$ , or neither.

Proof.

For the discussion that follows, define the vector subspace $\mathcal{Y}_{\parallel}$ to be the space of $N$ -dimensional vectors with real coefficients such that $\mathcal{Y}_{\parallel}=\{Y|Y\in\mathbb{R}^{N},\Pi_{x}Y=Y\}$ . Define also the vector subspace $\mathcal{Y}_{\perp}$ such that $\mathcal{Y}_{\perp}=\{Y|Y\in\mathbb{R}^{N},\Pi_{x}Y=\mathbf{0}\}$ . Note that it is always possible to construct an orthonormal basis of vectors $\hat{Y}_{k}$ in $\mathbb{R}^{N}$ where the first $n_{c}$ basis vectors are in $\mathcal{Y}_{\parallel}$ and the remaining $N-n_{c}$ basis vectors are in $\mathcal{Y}_{\perp}$ . This makes it useful to define

$\displaystyle T_{\theta}$	$\displaystyle\equiv\Pi_{x}\frac{d\Pi_{x}}{d\theta}(\mathbb{I}_{N}-\Pi_{x})=\Pi_{x}\frac{d}{d\theta}\left(X^{-1}X\right)(\mathbb{I}_{N}-\Pi_{x})$	(33)
	$\displaystyle=\Pi_{x}X^{-1}\frac{dX}{d\theta}(\mathbb{I}_{N}-\Pi_{x})+\Pi_{x}\frac{dX^{-1}}{d\theta}X(\mathbb{I}_{N}-\Pi_{x})$	(34)
	$\displaystyle=X^{-1}\frac{dX}{d\theta}(\mathbb{I}_{N}-\Pi_{x}),$	(35)

where in the last line we used $\Pi_{x}X^{-1}=X^{-1}XX^{-1}=X^{-1}\mathbb{I}_{n_{c}}=X^{-1}$ to simplify the first term and $X\Pi_{x}=XX^{-1}X=\mathbb{I}_{n_{c}}X=X$ to eliminate the second term. This definition is useful because $\Pi_{x}\frac{d\Pi_{x}}{d\theta}\Pi_{x}=\mathbf{0}$ , so from the definition above $T_{\theta}=\Pi_{x}\frac{d\Pi_{x}}{d\theta}$ , and therefore from Eq. (32) we have

\displaystyle\frac{d\Pi_{x}}{d\theta}

\displaystyle=T_{\theta}+T_{\theta}^{\top}.

(36)

This also implies that

\displaystyle\frac{dS_{\mathrm{min}}}{d\theta}

\displaystyle=-\frac{1}{2}Y^{\top}(T_{\theta}+T_{\theta}^{\top})Y=-Y^{\top}T_{\theta}Y,

(37)

so the change in $S_{\mathrm{min}}$ depends entirely upon $T_{\theta}$ with respect to $Y$ .

From the definition of $T_{\theta}$ in Eq. (33), because there is a $\Pi_{x}$ on the left side of the matrix, we then have that $Y^{\top}T_{\theta}=\mathbf{0}$ for all $Y\in\mathcal{Y}_{\perp}$ . Since $\mathcal{Y}_{\perp}$ has dimension $N-n_{c}$ , there must be at least $N-n_{c}$ zero singular values of $T_{\theta}$ . Let $m_{\theta}$ be the matrix rank of $T_{\theta}$ (number of nonzero singular values), which by the previous argument cannot be larger than $n_{c}$ . Then the singular value decomposition of $T_{\theta}$ can be written as

\displaystyle T_{\theta}=\sum_{j=0}^{m_{\theta}-1}\sigma_{\theta,j}\hat{y}_{\theta,j}^{\parallel}\left(\hat{y}_{\theta,j}^{\perp}\right)^{\top},

(38)

where $\sigma_{\theta,j}$ is a strictly positive singular value of $T_{\theta}$ , $\hat{y}_{\theta,j}^{\parallel}$ is one of $m_{\theta}$ orthonormal basis vectors in $\mathcal{Y}_{\parallel}$ , and $\hat{y}_{\theta,j}^{\perp}$ one of $m_{\theta}$ orthonormal basis vectors in $\mathcal{Y}_{\perp}$ . The reason the right basis vectors are in $\mathcal{Y}_{\perp}$ is because of the $(\mathbb{I}_{N}-\Pi_{x})$ on the right side of Eq. (33), which makes to so that $T_{\theta}Y=\mathbf{0}$ for all $Y\in\mathcal{Y}_{\parallel}$ .

With this decomposition of $T_{\theta}$ , we can use Eq. (37) to rewrite $\frac{dS_{\mathrm{min}}}{d\theta}$ as

\displaystyle\frac{dS_{\mathrm{min}}}{d\theta}

\displaystyle=-\sum_{j=0}^{m_{\theta}-1}\sigma_{\theta,j}c_{\theta,j}^{\parallel}\left(c_{\theta,j}^{\perp}\right)^{\top},

(39)

where $c_{\theta,j}^{\parallel}=Y^{\top}\hat{y}_{\theta,j}^{\parallel}$ and $c_{\theta,j}^{\perp}=Y^{\top}\hat{y}_{\theta,j}^{\perp}$ . There are 3 broad categories of $Y$ for which $\frac{dS_{\mathrm{min}}}{d\theta}$ vanishes for a given $T_{\theta}$ :

1.

$Y$ is orthogonal to every $\hat{y}_{\theta,j}^{\parallel}$ , or equivalently $\Pi_{x}Y=0$ .
2.

$Y$ is orthogonal to every $\hat{y}_{\theta,j}^{\perp}$ , or equivalently $\Pi_{x}Y=Y$ .
3.

Neither of the above statements are true, but the coefficients $c_{\theta,j}^{\parallel}$ and $c_{\theta,j}^{\perp}$ are such that the sum $\sum_{j=0}^{m_{\theta}-1}\sigma_{\theta,j}c_{\theta,j}^{\parallel}c_{\theta,j}^{\perp}$ vanishes.

There is also the possibility that $m_{\theta}$ is zero, meaning that every singular value of $T_{\theta}$ is zero, so that $\frac{dS_{\mathrm{min}}}{d\theta}=0$ for any $Y$ . ∎

In what follows, while we cannot rule out any of these possibilities for the feedback vector $V$ , we can show that the space of $Y$ ’s that fit into the above three criterion are of lower dimension than the general space of $N$ -dimensional vectors that encompasses all $Y$ ’s, and give criterion for numerically testing whether any of these cases hold for a given reservoir computation. Also, in the event that the gradient of $S_{\mathrm{min}}$ with respect to $V$ vanishes for any $Y$ , we will show that the number of solutions in the space of possible matrices $A$ , vectors $B$ , and input sequences $u_{k}$ is of lower dimension as well with testable criterion for a given ESN.

Lemma 2 (Lower dimensionality of cases where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ while $\nabla_{V}\Pi_{x}\neq\mathbf{0}$ ).

Given a specific ESN defined by a matrix $A$ , vector $B$ , and input sequence $\{u_{k}\}$ such that the projection operator $\Pi_{x}$ satisfies $\nabla_{V}\Pi_{x}\neq\mathbf{0}$ , the space of training vectors $Y$ that then leads to $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ is of lower dimension than the space of all training vectors, whose dimension is $N$ .

Proof.

Define $T_{i}\equiv T_{V_{i}}$ to be the same as in Eq. (33) with $\theta$ replaced with $V_{i}$ for each $i=0,\dots,n_{c}-1$ . In order for the gradient $\nabla_{V}S_{\mathrm{min}}$ to be zero, we require that $\frac{dS_{\mathrm{min}}}{dV_{i}}=-Y^{\top}T_{i}Y$ be zero for all $i$ ’s. This means that for a given set of $T_{i}$ ’s, there are three cases in which $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ because of the particular form of $Y$ :

Case 1:

Let $\mathcal{Y}_{V}^{\parallel}$ be the span of the set of vectors that contains $\hat{y}_{i,j}^{\parallel}$ for every $i\in\{0,\dots,n_{c}-1\}$ and $j\in\{0,\dots,\mathrm{rank}(T_{i})-1\}$ , and let its dimension be $m_{\parallel}$ . Then if $Y$ is such that $Y^{\top}y^{\parallel}=0$ for all $y^{\parallel}\in\mathcal{Y}_{V}^{\parallel}$ , then $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ because $Y^{\top}T_{i}=\mathbf{0}$ for all $i$ . This includes the case where $S_{\mathrm{min}}$ is at its maximum possible value of $\frac{1}{2}\sigma_{y}^{2}$ , in which the RC has utterly failed to capture any properties of $Y$ . The dimension of the space of $Y$ ’s that fall under this case is $N-m_{\parallel}$ . This is because $\mathcal{Y}_{V}^{\parallel}$ is spanned by $m_{\parallel}$ basis vectors, so the space of $Y$ ’s that are orthogonal to all of them is spanned by the remaining $N-m_{\parallel}$ basis vectors. We can calculate $m_{\parallel}$ from the matrix defined as

\displaystyle M_{\parallel}=\sum_{i=0}^{n_{c}-1}T_{i}T_{i}^{\top}.

(40)

$m_{\parallel}$ is given by the rank of $M_{\parallel}$ because each $T_{i}T_{i}^{\top}$ is a positive semi-definite matrix, so the only way that $Y^{\top}M_{\parallel}=\mathbf{0}$ is if $Y^{\top}T_{i}=\mathbf{0}$ for all $i$ . The dimension of the space of vectors that satisfy this relation is $N-m_{\parallel}$ as mentioned previously, so if $M_{\parallel}$ has $N-m_{\parallel}$ zero eigenvalues, then that leaves $m_{\parallel}$ nonzero eigenvalues.

We see from Eq. (35) that computing $M_{\parallel}$ will involve calculating $X^{-1}$ , but since we are only interested in the rank of the matrix we can find an alternative. Recall that $X$ is an $n_{c}\times N$ matrix whose singular values are strictly positive, so therefore the rank of $X$ is $n_{c}$ . Furthermore, $X\Pi_{x}=X$ , so the span of the right eigenvectors of $X$ must be $\mathcal{Y}_{\parallel}$ . Since the span of the left eigenvectors of $T_{i}$ is a subspace of $\mathcal{Y}_{\parallel}$ for all $i$ , the span of $M_{\parallel}$ is also a subspace of $\mathcal{Y}_{\parallel}$ , and we can multiply $M_{\parallel}$ by $X$ on both sides without changing the rank and use Eq. (35) to get a simpler $n_{c}\times n_{c}$ matrix

\displaystyle\widetilde{M}_{\parallel}=XM_{\parallel}X^{\top}=\sum_{i=0}^{n_{c}-1}\frac{dX}{dV_{i}}(\mathbb{I}_{N}-\Pi_{x})\frac{dX^{\top}}{dV_{i}}.

(41)

Since this has the same number of nonzero eigenvalues as $M_{\parallel}$ , $m_{\parallel}$ is also given by the number of nonzero eigenvalues of $\widetilde{M}_{\parallel}$ . This allow us to compute $m_{\parallel}$ without the need to calculate $X^{-1}$ directly like we would have if we calculated $M_{\parallel}$ using Eq. (35) for $T_{i}$ .

Case 2:

Let $\mathcal{Y}_{V}^{\perp}$ be the span of the set of vectors that contains $\hat{y}_{i,j}^{\perp}$ for every $i\in\{0,\dots,n_{c}-1\}$ and $j\in\{0,\dots,\mathrm{rank}(T_{i})-1\}$ , and let its dimension be $m_{\perp}$ . Then if $Y$ is such that $Y^{\top}y^{\perp}=0$ for all $y^{\perp}\in\mathcal{Y}_{V}^{\perp}$ , then $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ because $T_{i}Y=\mathbf{0}$ for all $i$ . This includes the minimal case where $S_{\mathrm{min}}=0$ , in which the RC perfectly describes $Y$ and no further improvement is possible. The dimension of the space of $Y$ ’s that fall under this case is $N-m_{\perp}$ . This is because $\mathcal{Y}_{V}^{\perp}$ is spanned by $m_{\perp}$ basis vectors, so the space of $Y$ ’s that are orthogonal to all of them is spanned by the remaining $N-m_{\perp}$ basis vectors. We can calculate $m_{\perp}$ from the matrix defined as

$\displaystyle M_{\perp}=\sum_{i=0}^{n_{c}-1}T_{i}^{\top}T_{i}.$ (42)

$m_{\perp}$ is given by the rank of $M_{\perp}$ because each $T_{i}^{\top}T_{i}$ is a positive semi-definite matrix, so the only way that $M_{\perp}Y=\mathbf{0}$ is if $T_{i}Y=\mathbf{0}$ for all $i$ . The dimension of the space of vectors that satisfy this relation is $N-m_{\perp}$ as mentioned previously, so if $M_{\perp}$ has $N-m_{\perp}$ zero eigenvalues, then that leaves $m_{\perp}$ nonzero eigenvalues.
Case 3:

Neither of the above cases holds, but every $Y^{\top}T_{i}Y=0$ nonetheless. Since there are $n_{c}$ different $T_{i}$ ’s, then we have $n_{c}$ equations of constraint on which $Y$ ’s of this type set $\nabla_{V}S_{\mathrm{min}}$ to zero. However, it may be possible that some of these equations are not independent, which would imply that there is at least one linear combination of $T_{i}$ ’s such that $\sum_{i=0}^{n_{c}-1}\gamma_{i}T_{i}=\mathbf{0}$ . Define $m_{I}$ to be the number of independent constraints of this form. Then the dimension of the space of $Y$ ’s for which $Y^{\top}T_{i}Y=0$ is given by $N-m_{I}$ , the number of free parameters left after applying the constraints. We can calculate $m_{I}$ using the $n_{c}\times n_{c}$ matrix $M_{I}$ whose components are defined to be

$\displaystyle(M_{I})_{ij}=\mathrm{Tr}(T_{i}T_{j}^{\top}).$ (43)

$m_{I}$ is given by the rank of $M_{I}$ , since if the linear combination $\sum_{i=0}^{n_{c}-1}\gamma_{i}T_{i}=\mathbf{0}$ then the $n_{c}$ -dimensional vector $v$ defined by $v_{i}=\gamma_{i}$ is an eigenvector of $M_{I}$ with eigenvalue $0$ .

As long as dimensions of the spaces of $Y$ ’s that satisfy the three cases above are all smaller than the total dimension of all $Y$ ’s, then it is very unlikely that any given $Y$ will fall into any of these categories. The total dimension of the space of $Y$ vectors is $N$ , and the dimension of the spaces for each of the three cases are $N-m_{\parallel},N-m_{\perp},$ and $N-m_{I}$ , respectively, so as long as $m_{\parallel},m_{\perp},$ and $m_{I}$ are all positive, than this argument holds. We can check whether or not they are zero by taking the traces of their respective matrices $M_{\parallel},M_{\perp},$ and $M_{I}$ since they are all positive semi-definite matrices, so the only way that the sums of their eigenvalues are zero is if every eigenvalue is zero. It turns out that all three traces are equal, since

\displaystyle\mathrm{Tr}(M_{\parallel})=\mathrm{Tr}(M_{\perp})=\mathrm{Tr}(M_{I})=\sum_{i=0}^{n_{c}-1}\mathrm{Tr}(T_{i}T_{i}^{\top}),

(44)

so therefore if any one of $m_{\parallel},m_{\perp},$ or $m_{I}$ are found to be zero, then all three are guaranteed to be zero. This is because a matrix whose eigenvalues are all zero must be the zero matrix, so this implies that $T_{i}=\mathbf{0}$ for all $i$ , which is the subject of our second proof below. Conversely, Eq. (44) also implies that if any one of them is positive, then they all must be positive as well, so we need only check that one of them is nonzero for this proof to hold. The easiest one to check is likely $m_{\parallel}$ as we can use $\widetilde{M}_{\parallel}$ in place of $M_{\parallel}$ and avoid directly calculating the $X^{-1}$ . ∎

Lemma 3 (Lower dimensionality of cases where $\nabla_{V}\Pi_{x}\neq\mathbf{0}$ ).

The space of ESNs defined by matrices, vectors, and input sequences $(A,B,\{u_{k}\})$ that satisfies $\nabla_{V}\Pi_{x}\neq\mathbf{0}$ is of lower dimension than the space of all possible ESNs.

Proof.

Our second part of this proof will show that in the space of all matrices $A$ , vectors $B$ , and input sequences $\{u_{k}\}$ that lead to a convergent RC, the subspace of these where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ for all $Y$ (or equivalently that $T_{i}=\mathbf{0}$ for all $i$ ) must be of lower dimension as well. If it was not of lower dimension, then that would imply that $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ holds over a finite region of the possible values of $A,B,$ and $\{u_{k}\}$ . If the function $f(x_{k},u_{k})$ representing the reservoir dynamics is analytic, then all reservoir states must be analytic functions of $A,B,$ and $\{u_{k}\}$ as well. In this paper, we are considering ESNs with $f(x_{k},u_{k})=\sigma(Ax_{k}+Bu_{k})$ , where $\sigma(z)$ is the element-wise sigmoid function, which is analytic. So if all reservoir states are analytic in $A,B,$ and $\{u_{k}\}$ , then $S_{\mathrm{min}}$ is an analytic function of these variables as well, so if $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ holds over a finite region of the possible values of these variables, then $S_{\mathrm{min}}$ must be constant with respect to $V$ for all $A,B,\{u_{k}\}$ and $\{y_{k}\}$ , which is a direct consequence of the identity theorem of an analytical function [33]. However, the ESN will become unstable for a $V$ with a sufficiently large norm, in which case the fit to $\{y_{k}\}$ will be poor and $S_{\mathrm{min}}$ would have to be larger than if we had no feedback. Thus $S_{\mathrm{min}}$ must not be constant with respect to $V$ for all $A,B,\{u_{k}\}$ and $\{y_{k}\}$ , and therefore the space of matrices $A$ , vectors $B$ and training inputs $\{u_{k}\}$ that satisfy $\nabla_{V}\Pi_{x}\neq\mathbf{0}$ is of lower dimension than the space of all possible $(A,B,\{u_{k}\})$ . ∎

Lemma 4 (Lower dimensionality of the subdomain of $S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})$ for which $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ ).

The dimension of the space of matrices, vectors, input sequences, and target sequences $(A,B,\{u_{k}\},\{y_{k}\})$ that satisfy $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ is strictly less than the dimension of the space of all possible $(A,B,\{u_{k}\},\{y_{k}\})$ , which implies that the number of cases in which $S_{\mathrm{min}}$ has a null gradient w.r.t. $V$ is vanishingly small compared to all cases.

Proof.

Lemma 2 proves that when $\nabla_{V}\Pi_{x}\neq\mathbf{0}$ , the number of cases where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ is vanishingly small in the space of all training sequences $\{y_{k}\}$ , and therefore also in the space of all possible $(A,B,\{u_{k}\},\{y_{k}\})$ . Lemma 3 proves that when $\nabla_{V}\Pi_{x}=\mathbf{0}$ , the number of cases where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ is vanishingly small in the space of all matrices $A$ , vectors $B$ , and training inputs $\{u_{k}\}$ , and therefore for all training sequences $\{y_{k}\})$ as well. Therefore, the number of cases of $(A,B,\{u_{k}\},\{y_{k}\})$ where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ is vanishingly small over all possible $(A,B,\{u_{k}\},\{y_{k}\})$ , regardless of $\nabla_{V}\Pi_{x}$ . ∎

Finally, we note that all of this work dedicated to finding where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ is a necessary but not sufficient condition for proving that feedback will not improve the result. That is, the points where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ correspond to the extrema of $S_{\mathrm{min}}$ with respect to $V$ , but these extrema could be minima, maxima, or saddle points. However, only minima will prevent feedback from improving the output of an ESN, and if we use a non-local method to find the global minimum of $S_{\mathrm{min}}$ with respect to $V$ , then local minima do not mitigate improvement, either.

It is possible to compute $\nabla_{V}S_{\mathrm{min}}$ without much extra overhead for any given run of a RC to see if it is zero. The derivatives $\frac{dx_{k}}{dV_{i}}$ can be calculated iteratively using the relation

	$\displaystyle\frac{dx_{k}}{dV_{i}}$	$\displaystyle=\Sigma_{k}\left(B\leavevmode\nobreak\ x_{k-1,i}+\overline{A}\frac{dx_{k-1}}{dV_{i}}\right)$		(45)
	$\displaystyle\Sigma_{k,ij}$	$\displaystyle=\delta_{ij}\sigma^{\prime}(z_{k-1,i})=\delta_{ij}x_{k,i}(1-x_{k,i}),$		(46)

where $\overline{A}=A+BV^{\top}$ . This uses $A$ , $B$ , and the reservoir states $x_{k}$ that have already been obtained from running of the RC. We can also avoid computing $N\times N$ matrices like $\Pi_{x}$ directly by noting that from previous results we have $Y\Pi_{x}Y=K_{xy}K_{xx}^{-1}K_{xy}$ . $K_{xx}^{-1}$ was already computed when optimizing for $W$ , so there is no additional matrix inversion needed to find $\nabla_{V}S_{\mathrm{min}}$ . It is also feasible to check whether this due to the specific $Y$ or a symptom of the RC by checking if $m_{\parallel}=0$ using $\mathrm{Tr}(\widetilde{M}_{\parallel})$ defined in Eq. (41). $\widetilde{M}_{\parallel}$ has the same form as $S_{\mathrm{min}}$ , but with $Y$ replaced with the matrix $\frac{dX}{dV_{i}}$ , so by replacing $y_{k}$ with $\frac{dx_{k,j}}{dV_{i}}$ for every $i$ and $j$ in the definition of $K_{xy}$ we can check if $\mathrm{Tr}(\widetilde{M}_{\parallel})$ is zero without much extra work.

3.3 Proving the Universal Superiority of ESNs with Feedback

We are now ready to finally present the proof of Theorem 1 since we know that $\nabla_{V}S_{\mathrm{min}}$ is nonzero except on a lower dimensional subspace.

Proof of Theorem 1.

Note that $S_{\mathrm{min}}$ is a real analytic functional with respect to matrix $A$ (see the definition of $S$ in Eq. (6)), having the Taylor series

	$\displaystyle S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})$
	$\displaystyle=S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})+\mathrm{Tr}\left[\left(\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}))\right)(BV^{\top})^{\top}\right]+\mathcal{O}[\delta A^{2}],$		(47)

where the last term consolidates the second and the higher order of $\delta A=(A+BV^{\top})-A=BV^{\top}$ . A reasonable ansatz for $V$ that reduces the second term most is

V=-\alpha\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})^{\top}B,

(48)

where $\alpha>0$ is a constant to be determined. We now calculate the second term as

\mathrm{Tr}[\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}))(BV^{\top})^{\top}]=-\alpha\beta,

(49)

where

	$\displaystyle\beta$	$\displaystyle=\mathrm{Tr}[\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}))^{\top}(BB^{\top})\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\}))]$
		$\displaystyle=\|\|\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})^{\top}B\|\|^{2}\geq 0.$		(50)

If $\beta>0$ , one can always choose an arbitrarily small $\alpha(>0)$ such that

\alpha>\frac{\left|\mathcal{O}(\alpha^{2})\right|}{\beta}.

(51)

This is because the left side is linear with respect to $\alpha$ whereas the right side is higher-order polynomial of $\alpha$ , and therefore, such (arbitrarily small) $\alpha$ satisfying the above always exists.

The only case where such $\alpha$ cannot be found is the case where $\beta=0$ :

\nabla_{V}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})=\nabla_{A}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})^{\top}B=\mathbf{0}.

(52)

However, we have proved in Lemma 4 that the number of cases in which this occurs is vanishingly small. Then, the strict inequality in Eq. (23) is proved in almost every case.

Now, we prove that $\overline{A}=A+BV^{\top}$ with $V=-\alpha\nabla_{A}S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})^{\top}B$ will make the ESN convergent. The set $\mathcal{A}_{a}=\{A\in\mathbb{R}^{n\times n}\mid A^{\top}A<a^{2}\mathbb{I}_{n}\}$ is an open convex set in $\mathbb{R}^{n\times n}$ for any $a>0$ . As has been shown earlier, for the overwhelming majority of $(B,\{u_{k}\},\{y_{k}\})$ there is always a choice of $V$ that decreases $S_{\rm min}(A,B,\{u_{k}\},\{y_{k}\})$ . Since $A\in\mathcal{A}_{a}$ , by the continuity of the maximum singular value of $A$ with respect to $A$ (for any choice of matrix norm) and by the particular choice of $V$ , there always exists a small number $\delta>0$ such that $A+\delta BV^{\top}\in\mathcal{A}_{a}$ (guaranteeing that the ESN with feedback remains convergent) while still decreasing the cost, $S_{\rm min}(A+B\delta V^{\top},B,\{u_{k}\},\{y_{k}\})<S_{\rm min}(A,B,\{u_{k}\},\{y_{k}\}$ ). This concludes the proof of Theorem 1.

∎

3.4 Superiority of ESNs with Feedback for the Whole Class of ESNs

According to Theorem 1, the cost of the ESN with feedback is guaranteed to be smaller than the cost of the ESN without feedback for almost all fixed $(A,B,\{u_{k}\},\{y_{k}\})$ . Next, the following corollary states that ESN with feedback exceeds the performance of ESN without feedback in the whole class of ESNs.

Corollary 1 (Universal superiority of ESN with feedback over the whole class).

For given and fixed finite input and output sequences $\{u_{k}\}=\{u_{k}\}_{k=1,\ldots,N}$ and $\{y_{k}\}=\{y_{k}\}_{k=1,\ldots,N}$ , let $A$ and $B$ be drawn randomly according to some probability measure $\mathbb{P}$ on $\mathbb{R}^{n\times n}\times\mathbb{R}^{n}$ . Let $\mathcal{X}=\{(A,B)\in\mathbb{R}^{n\times n}\times\mathbb{R}^{n}\mid A^{\top}A<a^{2}\mathbb{I}\}$ and $\mathbb{P}$ be such that $\mathbb{P}(\mathcal{X})=1$ . Let $\mathcal{Y}=\mathcal{X}\cap\{(A,B)\mid\nabla_{V}S_{\rm min}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})\neq 0\}$ and choose $\mathbb{P}$ such that $\mathbb{P}(\mathcal{Y})>0$ . Let $\langle S_{\mathrm{min}}\rangle_{A,B}=\mathbb{E}\left[S_{\mathrm{min}}(A,B)\right]$ , where the expectation (average) is taken with respect to the probability measure $\mathbb{P}$ , for a fixed training dataset $\{u_{k}\},\{y_{k}\}$ . Then, for the given training data set, ESN with feedback on average has a smaller cost function values than ESN without feedback for various $(A,B)$ on average. That is, the following holds on average over all possible $(A,B)$ :

\langle S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})\rangle_{A,B}>\langle\min_{V}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})\rangle_{A,B}.

(53)

Proof.

To prove the theorem over the whole class, we adopt a slightly different approach. The cost function after minimizing for $C$ and $W$ can be written as a function of the ESN parameters $S_{\mathrm{min}}(A,B)$ (short for $S_{\mathrm{min}}(A,B,\{u_{k}\},\{y_{k}\})$ ). With feedback using a vector $V$ , the new minimum is given by $S_{\mathrm{min}}(\overline{A},B)=S_{\mathrm{min}}(A+BV^{\top},B)$ . Let us separate the matrix $A$ into two components given by

$\displaystyle A$	$\displaystyle=A_{\|\|}+A_{\perp},$	(54)
$\displaystyle A_{\|\|}$	$\displaystyle=BZ_{AB}^{\top}\quad\mathrm{where}\quad Z_{AB}=\frac{A^{\top}B}{\|\|B\|\|^{2}},$	(55)
$\displaystyle A_{\perp}$	$\displaystyle=\left(\mathbb{I}_{\mathrm{dim}(A)}-\frac{BB^{\top}}{\|\|B\|\|^{2}}\right)A=A-BZ_{AB}^{\top},$	(56)

This is essentially pulling the degrees of freedom of $A$ in the $B$ direction apart from all other degree of freedom, so that $A_{||}$ and $A_{\perp}$ are independent. This is best shown by calculating the inner product:

\displaystyle\mathrm{Tr}\left[A_{||}^{\top}A_{\perp}\right]

\displaystyle=\mathrm{Tr}\left[A^{\top}\frac{BB^{\top}}{||B||^{2}}\left(\mathbb{I}_{\mathrm{dim}(A)}-\frac{BB^{\top}}{||B||^{2}}\right)A\right]=\mathrm{Tr}\left[A^{\top}\left(\frac{BB^{\top}}{||B||^{2}}-\frac{BB^{\top}}{||B||^{2}}\right)A\right]=0.

(57)

Thus we can write the minimized cost function as a function of these independent variables so that we can define $\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)=S_{\mathrm{min}}(A,B)$ . With feedback, these quantities become $\overline{Z}_{AB}=Z_{AB}+V$ and $\overline{A}_{\perp}=\overline{A}-B\overline{Z}_{AB}^{\top}=A-BZ_{AB}^{\top}=A_{\perp}$ , and therefore $S_{\mathrm{min}}(\overline{A},B)=\tilde{S}_{\mathrm{min}}(Z_{AB}+V,A_{\perp},B)$ . This all means that when we use feedback and optimize with respect to $V$ we are equivalently choosing $\min_{V}(\tilde{S}_{\mathrm{min}}(V,A_{\perp},B))$ , the minimum value of $\tilde{S}_{\mathrm{min}}$ as a function of $Z_{AB}$ , subject to the convergence constraint. We note that, therefore,

\min_{V}(\tilde{S}_{\mathrm{min}}(Z_{AB}+V,A_{\perp},B))\leq\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B),

(58)

since the cost function can further be reduced by optimizing an additional degree of freedom $A_{||}$ (equivalently, $V$ ). The equality occurs when $V=\mathbf{0}$ is the optimal solution, making the feedback unnecessary. One can verify that the derivative of the cost function with respect to $V$ is not zero except for vanishingly small cases of $(A,B)$ . To show this, let us take the derivative of $S_{\mathrm{min}}$ at $V=\mathbf{0}$ :

\nabla_{V}S_{\mathrm{min}}(A+BV^{\top},B)|_{V=\mathbf{0}}=B^{\top}\nabla_{A}S_{\mathrm{min}}(A,B).

(59)

The condition for this to become zero is exactly the same condition appearing in Eq. (52), for which we proved that only vanishingly small number of $(A,B)$ satisfy the above equation. Therefore, the strict inequality holds for most of $(A,B)$ .

The average cost without feedback is given by $\langle\tilde{S}_{\mathrm{min}}\rangle_{Z_{AB},A_{\perp},B}=\mathbb{E}\left[\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\right]$ , averaging over all three variables. But with feedback, as stated above this is equivalent to minimizing the cost with respect to $Z_{AB}$ , so the average cost with feedback is given by $\langle\min_{V}(\tilde{S}_{\mathrm{min}}(V,A_{\perp},B))\rangle_{A_{\perp},B}$ . The average over $Z_{AB}$ does nothing in this case because all of the different initial $Z_{AB}$ will get shifted to the minimizing value. Because $\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\geq\min_{V}(\tilde{S}_{\mathrm{min}}(V,A_{\perp},B))$ for each individual choice of $Z_{AB},A_{\perp},$ and $B$ , and $Z_{AB}$ and $A_{\perp}$ are measurable functions of $A$ and $B$ by construction, we must have that $\langle\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\rangle_{Z_{AB}}\triangleq\mathbb{E}[\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\mid A^{\perp},B]\geq\min_{V}(\tilde{S}_{\mathrm{min}}(V,A_{\perp},B))$ for all $(A,B)\in\mathcal{X}$ . Furthermore, equality only holds if every choice of $Z_{AB}$ yields the same value of $\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)$ . By the definition of $\mathcal{X}$ and $\mathcal{Y}$ and the hypothesis that $\mathbb{P}(\mathcal{X})=1$ , and since the number of $(A_{\perp},B)$ that makes $\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)$ completely independent from $Z_{AB}$ is vanishingly small (c.f. the proof of Theorem 1), one can always choose $\mathbb{P}$ under which $A$ and $B$ are sampled to be such that $\mathbb{P}(\mathcal{Y})>0$ . Therefore, we have the strict inequality when averaged over $(A_{\perp},B)$ :

$\displaystyle\langle\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\rangle_{Z_{AB},A_{\perp},B}$
	$\displaystyle=\int_{\mathcal{Y}}\langle\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\rangle_{Z_{AB}}(A,B)\mathbb{P}(dA,dB)$
	$\displaystyle\quad+\int_{\mathcal{X}\backslash\mathcal{Y}}\langle\tilde{S}_{\mathrm{min}}(Z_{AB},A_{\perp},B)\rangle_{Z_{AB}}(A,B)\mathbb{P}(dA,dB)$
	$\displaystyle>\int_{\mathcal{Y}}\min_{V}\tilde{S}_{\mathrm{min}}(V,A_{\perp},B)\mathbb{P}(dA,dB)+\int_{\mathcal{X}\backslash\mathcal{Y}}\min_{V}\tilde{S}_{\mathrm{min}}(V,A_{\perp},B)\mathbb{P}(dA,dB)$
	$\displaystyle=\langle\min_{V}S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})\rangle_{A,B}.$	(60)

Thus an ESN with feedback will always do better than an ESN without feedback over the whole ESN class on average, given the same number of computational nodes. ∎

We note that this corollary could be proven more succinctly using Theorem 1 under the same hypothesis on the probability measure $\mathbb{P}$ under which $A$ and $B$ are sampled by arguing that the average over ESNs will always include cases where the strict inequality (23) holds, and therefore the average also obeys a strict inequality. However, the proof given here adopted a different path from the proof of Theorem 1, providing an alternative explanation of the corollary.

4 Optimization of ESN with Feedback

One of the main advantages to using an ESN is that the training procedure is a linear regression problem that can be solved exactly without much computational effort. To use the ESN to match a target sequence $\{y_{k}\}$ for a given input sequence $\{u_{k}\}$ , we first run the ESN driven by the input for a number of steps until the initial state of the network is forgotten. This ensures that the states of the network is close to the unique sequence solely determined by the input that is guaranteed to exist by the uniform convergence property. In most of our simulations, we let the ESN run for $500$ steps before beginning training, which appears to be significantly more than necessary for our examples. We were able to use as few as $19$ steps of startup for some of our tests without any issue.

After this initial set of steps that insure that the system has converged to the input dependent state sequence, we then record the values of the state for the entire range of training steps, which we define to be a total of $N$ steps staring from $k=0$ . We then define the network’s output to be $\hat{y}_{k}=W^{\top}x_{k}+C$ , where the parameters $C$ and $W$ are optimized using the cost function given in Eq. (6), and whose exact solutions are given in Eqs. (18,19). Then, any future step in $\{y_{k}\}$ is estimated using $\hat{y}_{k}=W^{\top}x_{k}+C$ for some $k\geq N$ .

With feedback, we also must optimize with respect to the feedback vector $V$ to determine the modified input sequence $\{u_{k}+V^{\top}x_{k}\}$ . Since the network states $\{x_{k}\}$ will have a highly complex and nonlinear dependence on $V$ , we cannot solve it exactly as we do with $C$ and $W$ . It also turns out that, unfortunately, the cost function is not a convex function with respect to $V$ (see Fig. 1). Therefore, a simple gradient descent or a linear regression for optimizing (training) $V$ is not guaranteed to converge to the global minimum of the cost function. In this case, optimizing $V$ to minimize the cost function is tricky.

Refer to caption — Figure 1: 3D plot of the non-convex dependence of the NMSE on $V$ for an ESN with 2 computational modes. Here we optimize $C$ and $W$ for the Mackey-Glass task and use 1000 training data points after 500 initial steps in our ESN. This plot shows only a portion of the full space of convergent feedback vectors to better illustrate the non-convexity.

In fact, a good candidate for $V$ (which is at least locally optimal) is obtained by choosing $\alpha=\overline{\alpha}$ such that the difference between the left and the right hand side of the inequality (51) becomes maximum:

\overline{\alpha}=\operatorname*{argmax}_{\alpha}\left(\alpha-\frac{\left|\mathcal{O}(\alpha^{2})\right|}{\beta}\right).

(61)

Then, a good $V$ should be given by $V=-\overline{\alpha}\nabla_{A}S_{\mathrm{min}}(A,B)^{\top}B$ . Therefore, a strategy to optimize $V$ is, first, to perform the optimization for $W$ assuming there is no feedback, which will result in $S_{\mathrm{min}}(A,B)$ . Then, one calculates $\nabla_{A}S_{\mathrm{min}}(A,B)$ , which will require additional optimization of $W$ for given perturbed $A+\Delta A$ for each entry of $A$ . Such perturbation requires $n\times n$ number of optimizations of $W$ ’s for different $A$ ’s. This will allow to calculate $\nabla_{A}S_{\mathrm{min}}(A,B)$ , which will lead to a good $V=-\alpha\nabla_{A}S_{\mathrm{min}}(A,B)^{\top}B$ . We, however, note that obtaining $\overline{\alpha}$ requires the calculations of $\nabla_{A}S_{\mathrm{min}}(A,B)$ , $\nabla_{A}^{2}S_{\mathrm{min}}(A,B)$ , $\nabla_{A}^{3}S_{\mathrm{min}}(A,B)$ , etc., which are computationally demanding. Thus, in practice, we use different method that is more practical.

In our numerical examples, we used a standard batch gradient descent method to optimize $V$ , with a forced condition that ensures that the ESN will remain convergent during every step. We cannot use stochastic gradient descent because of the causal nature of the ESN, meaning that the order and size of the training set influences the optimal value of $V$ . The proofs in Section 3 guarantee that gradient descent will almost always provide an improvement to the fit. The mathematical details of the gradient descent method that we used is explained in A.

To begin our gradient descent routine, we start at $V_{0}=\mathbf{0}$ . First, we run the ESN without feedback and optimize for $C$ and $W$ as usual. Then, we calculate $\nabla_{V}S_{\mathrm{min}}$ , which as we will demonstrate below can be done using only quantities already obtained from running the ESN. We then choose $V_{1}=V_{0}-\eta\nabla_{V}S_{\mathrm{min}}$ to be our new feedback vector for the next step, where $\eta$ is a learning rate that must be chosen beforehand. We repeat this process many times, running the ESN with $\{u_{k}+V_{i}^{\top}x_{k}\}$ as the input sequence, optimizing for $C$ and $W$ under the new inputs, and then recalculating $\nabla_{V}S_{\mathrm{min}}$ to update the feedback vector for the next step using $V_{i+1}=V_{i}-\eta\nabla_{V}S_{\mathrm{min}}$ . This is performed for a set number of iterations. In the event that the gradient descent converges to an ESN that is unstable, we will also detail a procedure we use to keep every $V_{i}$ within a certain convex region for which the ESN is guaranteed to be stable.

Here, we present the method to enforce the ESN’s stability while updating $V$ . For this, we need to make sure that the ESN remains convergent at every step of gradient descent. The constraint on $V$ is given by the constraint for convergence on the ESN following from Eq. (10) with $\overline{A}$ in place of $A$ . Formally, the constraint is $\overline{A}^{\top}\overline{A}<a^{2}\mathbb{I}_{\mathrm{dim}(A)}$ , where $a$ is a constant value that depends on the nonlinear function $g(z)$ of the ESN. We use the sigmoid function, so we take $a=4$ . We ensure that the gradient descent algorithm obeys this constraint by applying a correction to any gradient descent step that causes the new value of $V$ to violate the constraint inequality. This correction is designed to only change the component of $V$ perpendicular to the surface defined by $\overline{A}^{\top}\overline{A}=a^{2}\mathbb{I}_{\mathrm{dim}(A)}$ as a function of $V$ , so that gradient descent can still freely adjust $V$ in any direction parallel to this surface.

For some small shift in $V$ given by $\delta V$ , the change in $\overline{A}^{\top}\overline{A}$ is given by

$\displaystyle\overline{A}^{\top}\overline{A}$	$\displaystyle=A^{\top}A+V(B^{\top}A)+(A^{\top}B)V^{\top}+\|\|B\|\|^{2}\leavevmode\nobreak\ VV^{\top}$	(62)
$\displaystyle\delta(\overline{A}^{T}\overline{A})$	$\displaystyle\approx\delta V(B^{\top}A)+(A^{\top}B)\delta V^{\top}+\|\|B\|\|^{2}\leavevmode\nobreak\ \delta VV^{\top}+\|\|B\|\|^{2}\leavevmode\nobreak\ V\delta V^{\top}$	(63)
	$\displaystyle=\delta V(B^{\top}\overline{A})+(\overline{A}^{\top}B)\delta V^{\top}.$	(64)

If gradient descent ends up causing the largest singular value $\lambda_{\max}$ of $A$ to reach or exceed $a$ , it would cause our ESN to cease being uniformly convergent. We can use the associated normalized eigenvector $u_{\max}$ associated with $\lambda_{\max}$ to get

$\displaystyle\delta(\lambda_{\max}^{2})$	$\displaystyle=\delta(u_{\max}^{\top}\overline{A}^{\top}\overline{A}u_{\max})$	(65)
	$\displaystyle=u_{\max}^{\top}\delta(\overline{A}^{\top}\overline{A})u_{\max}+2\delta(u_{\max})^{\top}\overline{A}^{\top}\overline{A}u_{\max}$	(66)
	$\displaystyle=u_{\max}^{\top}\delta(\overline{A}^{\top}\overline{A})u_{\max}+2\lambda_{\max}^{2}\delta V^{\top}\left(\frac{du_{\max}}{dV}\right)^{\top}u_{\max}$	(67)
	$\displaystyle\approx 2(u_{\max}^{\top}\delta V)(B^{\top}\overline{A}u_{\max})+0.$	(68)

We can ignore the dependence of $u_{\max}$ on $W_{2}$ in this equation because the derivative of a normalized vector is always orthogonal to the original vector, so the first-order shift in each $u_{\max}$ above gets eliminated by the other $u_{max}$ . We can solve this in terms of $\delta V$ to get

\displaystyle u_{\max}^{\top}\delta V

\displaystyle\approx\frac{\delta(\lambda_{\max}^{2})}{2(B^{\top}\overline{A}u_{\max})}.

(69)

This formula tells us that we can adjust the singular values of $A$ by using $\delta V^{\prime}=\delta V-u_{\max}\frac{\Delta}{2(B^{\top}\overline{A}u_{\max})}$ for our gradient descent step instead of just $\delta V$ for some small positive value $\Delta$ . We take $\Delta$ to be $\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2})-a^{2}+\epsilon_{a}$ for some small positive number $\epsilon_{a}$ . This ensures that the new step $\delta V^{\prime}$ will keep the singular values of $A$ strictly less than $a$ with a minimal change to the original step $\delta V$ , so that the convergence of the gradient descent procedure is minimally impacted. $\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2})-a^{2}$ is guaranteed to be of the same order of magnitude as the norm of $\delta V$ because we assume that $V$ leads to convergent dynamics, but $V+\delta V$ does not, so $\lambda_{\max}<a$ but $\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2})\geq a^{2}$ , and since $\delta(\lambda_{\max}^{2})$ is of the same order as $||\delta V||$ the difference $0\leq\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2})-a^{2}<\delta(\lambda_{\max}^{2})$ must be as well, where the last inequality is because $\lambda_{\max}<a$ . If multiple singular values exceed $a$ due to a single step, we can apply this procedure for each singular value independently since the eigenvectors associated with these singular values are orthogonal, so each adjustment to $\delta V$ has no overlap with any of the other adjustments. We calculate $\lambda_{\max}^{2}+\delta(\lambda_{\max}^{2})$ and $u_{\max}$ directly from $(A+B(V+\delta V)^{\top})^{\top}(A+B(V+\delta V)^{\top})$ , while $\epsilon_{a}$ is chosen to be $10^{-5}$ . To the order of approximation used in Eq. (69), we can take $A+B(V+\delta V)^{\top}\approx\overline{A}$ and use their eigenvectors interchangeably for our calculation of the adjustment to $\delta V$ aside from the value of $\delta(\lambda_{\max}^{2})$ .

5 Benchmark Test Results

We conducted numerical demonstrations of ESNs with feedback by focusing on three distinct tasks: the Mackey-Glass task, the Nonlinear Channel Equalization task, and the Coupled Electric Drives task. These tasks are elaborated in the supplementary material of the reference [34] for the first two and in [35] for the latter. Each task represents a unique class of problems. The Mackey-Glass task exemplifies a highly nonlinear chaotic system, challenging the ESN’s ability to handle complex dynamics. The Nonlinear Channel Equalization task involves the recovery of a discrete signal from a nonlinear channel, testing the ESN’s proficiency in signal processing. Finally, the Coupled Electric Drives task is focused on system identification for a nonlinear stochastic system, evaluating the ESN’s performance in modeling and memory retention. Together, these three tasks provide a comprehensive evaluation of ESNs, covering aspects like nonlinear modeling, system memory, and advanced signal processing. This multifaceted approach ensures a thorough assessment of ESN capabilities across various complex systems.

The Mackey-Glass task requires the ESN to approximate a chaotic dynamical system described by $y(t)$ in the Mackey-Glass equation:

\displaystyle\frac{dy}{dt}(t)=\beta\frac{y(t-\tau)}{1+y^{n}(t-\tau)}-\gamma y(t),

(70)

where we choose the standard values $\beta=0.2,\tau=17,n=10,$ and $\gamma=0.1$ . We numerically approximate the solution to this equation using $y_{k+1}=y((k+1)\delta t)=y_{k}+\delta t\frac{dy_{k}}{dt}$ with $\delta t=1.0$ and $y(0)=1.0$ . We also run the solution for $1000$ steps before using it for the task, or in other words the target sequence we use actually starts with $y_{1000}$ . The task for the ESN is to predict what the sequence will be 10 time steps into the future. In other words, using the input sequence $\{u_{k}=y_{k-10}\}$ , we want the ESN to successfully predict $\{y_{k}\}$ .

The second task is the Nonlinear Channel Equalization task. In this task, there is some sequence of digits $\{d_{k}\}$ , each of which can take one of 4 values so that $d_{k}\in\{-3,-1,1,3\}$ , that is put through a nonlinear propagation channel. This channel $u_{k}$ is a polynomial in another linear channel $q_{k}$ , which is in turn a linear combination of 10 different $d_{k}$ values. The linear channel is given by

	$\displaystyle q_{k}=$	$\displaystyle\leavevmode\nobreak\ 0.08d_{k+2}-0.12d_{k+1}+d_{k}+0.18d_{k-1}-0.1d_{k-2}$
		$\displaystyle+0.091d_{k-3}-0.05d_{k-4}+0.04d_{k-5}+0.03d_{k-6}+0.01d_{k-7}.$		(71)

The nonlinear transformation $u_{k}$ of this channel is given by

\displaystyle u_{k}=q_{k}+0.036q_{k}^{2}-0.011q_{k}^{3}+v_{k},

(72)

where $v_{k}$ is a Gaussian white noise term with a signal-to-noise ratio of $32\,\mathrm{dB}$ . That is, each noise term $v_{k}$ is a random number generated from a Gaussian distribution with a mean of 0 and a standard deviation given by $\sigma_{k}=\mathrm{abs}(u_{k})/39.81$ , so that the signal-to-noise ratio is $10\log_{10}\left(\frac{u_{k}^{2}}{\sigma_{k}^{2}}\right)=20\log_{10}\left(39.81\right)\approx 32$ . The task for the ESN is to recover the original digit sequence $\{d_{k}\}$ from the nonlinear channel $\{u_{k}\}$ , which is used as input. In other words, using the input sequence $\{u_{k}\}$ described by Eq. (72), we want the ESN to produce the digit sequence $\{y_{k}=d_{k}\}$ as output. Since the ESN produces a continuous output while the target sequence takes discrete values, we round the output of the ESN to the nearest value in $\{-3,-1,1,3\}$ for the final error analysis. However, we still use the continuous outputs during training using the standard cost function described in Eq. (6).

The third test is fitting the Coupled Electric Drives data set, which is derived from a real physical process and is intended as a benchmark data set for nonlinear system identification [36, 37]. In system identification, the ESN is used to approximately model an unknown stochastic dynamical system. This is achieved by tuning the free parameters of the ESN so that it approximates the nonlinear input/output (I/O) map generated by the unknown system through I/O data generated by the latter. This I/O map sends input sequences $\{\ldots,u_{k-1},u_{k}\}$ to output sequences $\{\ldots,y_{k-1},y_{k}\}$ for all $k$ . To this end, the ESN is configured as a nonlinear stochastic autoregressive model following [12], which is briefly explained in B.

We fit to the output signal labeled $z2$ in [35], which uses a PRBS input $u2$ with amplitude $1.5$ . We model the data using a combination of the input $u2$ and the previous state of the system such that for each time step $k$ , the next time step is obtained using an input given by $s\cdot u2_{k}+(1-s)\cdot y_{k}$ for some parameter $s$ . This parameter is chosen by optimizing the cost function of the training data using gradient descent, in much the same way that we optimize the feedback vector $V$ . However, this procedure is used to provide information about both $u2$ and the past values of $y$ to the ESN through a single input channel, and is in no way related to our feedback procedure. In the context of the discussion in B, the function $\nu$ that is defined in that appendix is in this case given by $\nu(x_{k},u2_{k},y_{k})=s\cdot u2_{k}+(1-s)\cdot y_{k}+V^{\top}x_{k}$ , where $V=0$ if no state feedback is used, otherwise with feedback the value of $V$ is determined though the I/O data. Through our empirical analysis, we find that the global minimum of the cost function with respect to $s$ is always located near $s=0$ , such that the cost function is locally convex in an interval that always contains $s=0$ . Thus using gradient descent starting from $s=0$ to find the optimal value of $s$ will always converge to the global minimum of this parameter. In our numerical work below, we choose a learning rate of $0.0012$ without feedback and a learning rate of $0.001$ when using feedback.

Fig. 2 shows histograms of the NMSE values obtained during the Mackey-Glass task for many different ESNs. We see that for 10 computational nodes without feedback, the distribution of NMSE values roughly takes the shape of a skewed Gaussian, with an average of about $0.252$ and a standard deviation of about $0.056$ , and a longer tail on the right side than the left. With feedback, the distribution shifts significantly toward lower NMSE values, with an average of about $0.177$ and a standard deviation of about $0.069$ . This is a roughly $30\%$ reduction of the average NMSE. For 100 computational nodes without feedback, the average is about $0.120$ with a standard deviation of about $0.029$ , with a slight skew toward larger NMSE values this time. Note that the 10-node histogram with feedback appears to have two primary peaks, one centered around the lower edge of the 10-node distribution without feedback and one centered much closer to the 100-node average. With a better method of optimizing $V$ , it may be possible to get more cases toward the left peak in this distribution and demonstrate results comparable to a 100-node calculation using only 10 nodes with feedback. Such an analysis is beyond the scope of this article, however.

In Fig. 3, we show histograms of the number of errors obtained in the Channel Equalization task for many different ESNs. We count the number of errors based on how far off the ESN prediction is from the actual signal, using the expression $|d_{k}-\hat{y}_{k}|/2$ . For example, if the true signal value was $d_{k}=-1$ but the ESN gave us $\hat{y}_{k}=-3$ , this counts as 1 error, but if $d_{k}=-1$ and $\hat{y}_{k}=+3$ we count it as 2 errors. We see that for 10 computational nodes without feedback, the distribution of total error values also takes the shape of a skewed Gaussian, with an average of about $11.29$ errors and a standard deviation of about $5.20$ , and a longer tail on the right side than the left. With feedback, the distribution again shifts significantly toward a lower number of errors, with an average of about $4.89$ errors and a standard deviation of about $2.08$ , a nearly $57\%$ reduction to the average error count. For 100 computational nodes without feedback, the average is about $3.22$ errors with a standard deviation of about $1.11$ .

In this task, all of the distributions roughly adhere to the shape of a Gaussian with a long tail on the right. This is in contrast to the Mackey-Glass task, where each distribution had a slightly different skew. The feedback procedure has definitely improved the average performance of 10-node ESNs, but unlike the Mackey-Glass task there is no secondary peak, so it may be the case that in most of the cases the gradient descent algorithm has settled near a minimum and will not improve the results further. Still, the 10-node average with feedback is much closer to the 100-node result on average than the 10-node result.

Fig. 4 shows the average dependence of the error of the ESN without feedback as a function of the number of nodes, using the NMSE for the Mackey-Glass task and the total number of errors in the Channel Equalization task. We also include the average value of the 10-node results with feedback for comparison. The error bars in each plot represent the standard deviation associated with each specific number of nodes. We see that, on average, using feedback on a 10-node ESN with 100 steps of gradient descent for optimizing $V$ is roughly equivalent to a little more than a $20$ node calculation for the Mackey-Glass task, while it is closer to a $25$ or $30$ node calculation for Channel Equalization.

The main reason for this discrepancy is because the Mackey-Glass task is significantly more difficult for the ESN than the Channel Equalization task. In Mackey-Glass, we are asking the network to predict $10$ time steps into the future, but not all ESNs have a memory capacity going back $10$ steps, especially with only $10$ computational nodes. In contrast, the Channel Equalization task is much easier because the ESN does not have to reproduce the exact digits $\{-3,-1,1,3\}$ , it only has to achieve a difference of less than 1 to be considered correct, so there is more room for error on with the continuous output of the ESN. This is evidenced by the fact that the nodal dependence in the Mackey-Glass task seems to continue decreasing almost linearly near 100 nodes, while for Channel Equalization the dependence is close to zero as the ESN is almost perfectly reproducing the signal with 100 nodes. This also suggests that the results for the Mackey-Glass task could see further improvement with feedback using a better method for optimizing $V$ , since even the addition of computational nodes seems to converge slowly.

In Fig. 5, we show the average dependence of the error of 10-node ESNs with feedback as a function of the number of gradient descent steps for optimizing $V$ , using the NMSE for the Mackey-Glass task and the total number of errors in the Channel Equalization task. Note that these plots show the NMSE and total errors for the training data set of $1000$ steps, as opposed to all of the previous plots which use the $500$ times immediate after training. This is why we have rescaled the average number of errors by a factor of $1/2$ in the plot for the Channel Equalization task: we are checking the errors of 1000 steps for each ESN in these plots instead of 500 like all the others, so the total number of errors is doubled as a result, hence the rescaling. The error bars in each plot represent the standard deviation associated with each specific number of gradient descent steps.

Here, we observe that for the Mackey-Glass task the gradient descent algorithm still has not fully converged even after 100 steps, and the variance in the performance is very large. In contrast, the gradient descent algorithm for the Channel Equalization task appears to have converged on average after about $25$ to $30$ steps to a value of about $3$ , with a moderately large standard deviation. This corroborates the discussion of Fig. 4, where we see that the ESN has trouble with the Mackey-Glass task, and so the convergence is slow and highly dependent on the specific ESN. Meanwhile, the Channel Equalization task is easier, and so the convergence occurs faster and more consistently. This further motivates using a better optimization method for $V$ to get the most we can out of an ESN for task like Mackey-Glass. For easier tasks like Channel Equalization we have likely already done the best we can with batch gradient descent, which seems to indicate that feedback can be more useful that adding an equal number of parameters as new computational nodes, at least for small ESNs.

In Fig. 6, we show plots of one ESN’s fit to the Coupled Electric Drive data given in [35]. We specifically use $z2$ , the PRBS signal with an amplitude of $1.5$ . Note that since this data set contains only 500 data points, we use significantly less training data and test data than before, with only $280$ and $200$ steps, respectively. We are also only using 2 computational nodes here as opposed to $10$ or more in the previous tasks. This is in accordance with [12], where it was shown that ESNs with 2 computational nodes perform well on the $z3$ Coupled Electric Drives data set. The first plot shows the fit to test data for a specific choice of of ESN with and without feedback. We see that in this specific instance the original ESN has some trouble fitting to the data, but with feedback the fit becomes much better, nearly matching the target data. The NMSE value without feedback was about $0.43$ , but with feedback it is reduced by a full order of magnitude to about $0.032$ .

The right plot shows the average correlations between the residuals $e_{k}=y_{k}-\hat{y}_{k}$ of the test data for a fixed time difference. This is calculated using the formula

\displaystyle R_{k}

\displaystyle=\frac{1}{N-1}\sum_{j=0}^{N-k-1}\frac{(e_{j}-\overline{e})(e_{j+k}-\overline{e})}{\sigma_{e}^{2}},

(73)

where $\overline{e}$ and $\sigma_{e}^{2}$ are the sample mean and variance of the residuals. This measures the correlation between the residuals at different time steps. If the residuals correspond to white noise, then $95\%$ of the correlations would fit into the confidence interval shown in the plot. We see that in the original ESN, there is a strong correlation between residuals that are one and two time steps apart, along with anti-correlations when they are 5 to 10 steps apart as well as from 25 to 30 step apart. With feedback, the one step correlation is significantly reduced, and all but one of the other correlations are found within the confidence interval. This suggests that the ESN without feedback was not completely capturing part of the correlations in the test data, but with feedback the accuracy is improved to the point that most of the statistically significant correlations have been eliminated. We also checked that the residuals are consistent with Gaussian noise using the Lilliefors test [38, 39] and checking a Q-Q plot [40] against the CDF of a normal distribution. We found that both with and without feedback, the residuals pass the $n=50$ Lilliefors test and follow a roughly linear trend on the Q-Q plot. Finally, we checked the correlation between the residuals and the input $u2$ to see whether the residual noise is uncorrelated with the input and, therefore, unrelated to the system dynamics. We found that for this ESN, the residuals without feedback show a statistically significant anti-correlation of about $-0.22$ , below the $95\%$ confidence threshold of $-0.14$ . With feedback, the correlation becomes about $-0.11$ , a significant reduction in magnitude that now puts it within the confidence interval.

The average behavior of feedback for this task also shows good improvement. Taking an average over 9600 different ESNs (including the one in Fig. 6), we find that the average NMSE without feedback using 2 computational nodes is about $0.178$ with a standard deviation of about $0.0779$ , but with feedback this average goes down to $0.0723$ with a standard deviation of about $0.0448$ . This is a roughly $59\%$ reduction of the average NMSE. We also find that the average correlation with the input $u2$ is about $-0.183$ without feedback, below the $95\%$ confidence threshold of $-0.139$ , but with feedback it is on average above the threshold with a value of about $-0.132$ . Additionally, the average one time step difference correlation between residuals is about $0.69$ without feedback, but with feedback it reduces to about $0.48$ . However, this does not reduce it into the $95\%$ confidence interval for Gaussian white noise, suggesting that there is still room for improvement with feedback. We note that batch gradient descent may not be the most effective choice for optimizing $V$ for this task, especially considering the use of a large learning rate of $27.0$ to get these results. If we were able to reach the true global minimum of $V$ , we may be able to much more consistently reach a scenario like the one shown in Fig. 6.

6 Discussions and Conclusion

In this work, we have introduced a new method for improving the performance of an ESN using a feedback scheme. This scheme uses a combination of the existing input channel of the ESN and the previously measured values of the network state as a new input, so that no direct modification of the ESN is necessary. We proved rigorously that using feedback is almost always guaranteed to provide an improvement to the performance, and that the number of cases in which such an improvement is not possible using batch gradient descent is vanishingly small compared to all possible ESNs. In addition, we proved rigorously that such a feedback scheme provides a superior performance over the whole class of ESN on average. We laid out the procedure for optimizing the ESN as a function of the fitting parameters $C,W,$ and $V$ , exactly solving for $C$ and $W$ while using batch gradient descent on $V$ for a fixed number of steps. We then demonstrated the performance improvements for the Mackey-Glass, Channel Equalization, and Coupled Electric Drives tasks, and commented on how the relative difficulty of the tasks affected the results. The ESNs with feedback exhibited outstanding performance improvement in the Channel Equalization and Coupled Electric Drives tasks, and we observed a roughly $57\%$ and $59\%$ improvement in their respective error measures. For the more difficult Mackey-Glass task, we still saw a roughly $30\%$ improvement in the averaged NMSE, showing that feedback produces a significant boost in performance for a variety of tasks. These ESNs with feedback were shown to perform just as well on average, if not better than, the ESNs that have double the number of computational nodes without feedback.

Although there will be an additional hardware modification required to implement feedback, such a modification will only be external to the reservoir computer and, therefore, the burden will be minimal. This feedback scheme is designed to avoid any direct modification of the ESN’s main body (i.e., the computing bulk) since we only need to take the readout of the network and send some component of that readout back into the network with the usual input. Thus, we will only need an apparatus that connects to the readout and the input of the ESN, but does not require modifying the internal reservoir. Given that our results suggest that ESNs with feedback will perform just as well as, if not better than, ESNs of double the number of nodes without feedback, the cost-benefit analysis of adding feedback hardware is very likely to be more favorable than increasing the size of the ESN to achieve similar performance.

Because of the highly complex and nonlinear dependence of the network states $\{x_{k}\}$ on $V$ , we used batch gradient descent for the optimization of $V$ , but better methods may very well exist. The question of how to best optimize $V$ is indeed closely related to how to choose the best $(A,B)$ for a given training data set $\{u_{k}\}$ and $\{y_{k}\}$ . This is an open question for which any progress would be monumental in the general theoretical development of reservoir computing and neural networks. Even providing just a measure of the computing power of a given $(A,B)$ or $(A,B,\{u_{k}\})$ outside of the cost function itself could provide some classification of tasks that will save us significant computational time and resources in the future.

Appendix A Batch Gradient Descent Method for Optimizing $V$

At the beginning of each step, we train the ESN using $\{u_{k}+V_{i}^{T}x_{k}\}$ as the input sequence for the current feedback vector $V_{i}$ , with $V_{0}=\mathbf{0}$ as the initial step. To get the change in $V$ , we use the gradient of $S_{\mathrm{min}}$ given by

$\displaystyle\frac{dS}{dV_{j}}$	$\displaystyle=\frac{1}{N}\sum_{k=0}^{N-1}\left(W^{\top}\frac{dx_{k}}{dV_{j}}\right)(W^{\top}x_{k}-y_{k})$	(74)
$\displaystyle\frac{dx_{k}}{dV_{j}}$	$\displaystyle=\Sigma_{k}\left(B\leavevmode\nobreak\ x_{k-1,j}+\overline{A}\frac{dx_{k-1}}{dV_{j}}\right)$	(75)
$\displaystyle\Sigma_{k,ij}$	$\displaystyle=\delta_{ij}\sigma^{\prime}(z_{k-1,i})=\delta_{ij}x_{k,i}(1-x_{k,i}),$	(76)

where $\overline{A}=A+BV^{\top}$ . The derivatives $\frac{dx_{k}}{dV_{j}}$ can be calculated by iteration starting from the initial condition $\frac{dx_{0}}{dV_{j}}=\mathbf{0}$ . Then we simply shift $V\rightarrow V-\eta\nabla_{V}S_{\mathrm{min}}$ for some learning rate $\eta$ at each step of the descent, using the optimal solutions for $C$ and $W$ for the current value of $V$ .

Appendix B Nonlinear Stochastic Autoregressive Model

The basic model is

	$\displaystyle x_{k+1}$	$\displaystyle=g(Ax+B\nu(x_{k},u_{k},y_{k}))$
	$\displaystyle\hat{y}_{k}$	$\displaystyle=W^{\top}x_{k}+C,$

where $\nu$ is some function of $x_{k}$ , $u_{k}$ and $y_{k}$ taking values in $\mathbb{R}^{n}$ and $\{(u_{k},y_{k})\}$ are I/O pairs generated at time $k$ by the stochastic dynamical system of interest. The quantity $\hat{y}_{k}$ is the output of the ESN and functions as an approximation of $y_{k}$ . Assuming convergence, after washout $\hat{y}_{k}$ can be expressed as [12]:

\hat{y}_{k}=F(y_{k-1},y_{k-2},\ldots,u_{k-1},u_{k-2},\ldots),

for some nonlinear functional $F$ of past input and output values, $u_{k-1},u_{k-2},\ldots$ and $y_{k-1},y_{k-2},\ldots$ , respectively. Letting $e_{k}$ be the approximation error of $y_{k}$ by $\hat{y}_{k}$ , $e_{k}=y_{k}-\hat{y_{k}}$ , the ESN then generates a nonlinear infinite-order autoregressive model with exogenous input:

y_{k}=F(y_{k-1},y_{k-2},\ldots,u_{k-1},u_{k-2},\ldots)+e_{k}.

To complete the model, it is stipulated that $\{e_{k}\}$ is Gaussian white noise sequence that is uncorrelated with the input sequence $\{u_{k}\}$ . By making the substitution $y_{k}=\hat{y}_{k}+e_{k}=W^{\top}x_{k}+C+e_{k}$ , the system identification procedure identifies an approximate state-space model of the unknown stochastic dynamical system given by:

	$\displaystyle x_{k+1}$	$\displaystyle=f(x_{k},u_{k},e_{k})$
	$\displaystyle y_{k}$	$\displaystyle=W^{\top}x_{k}+e_{k},$

where

f(x_{k},u_{k},e_{k})=g(Ax_{k}+B\nu(x_{k},u_{k},W^{\top}x_{k}+e_{k})),

where $e_{k}$ and $u_{k}$ free inputs to the model. The stochasticity in the model comes from the free noise sequence $\{e_{k}\}$ and possibly also the input $\{u_{k}\}$ (typically the case in system identification). This model can then be used, for instance, to design stochastic control laws $u_{k}$ for the unknown stochastic dynamical system or to simulate it on a digital computer. The hypothesis of the model, that $\{e_{k}\}$ is a Gaussian white noise sequence that is uncorrelated with the input sequence $\{u_{k}\}$ is tested after the model is fitted using a separate validation data set (different from the fitting data set) by residual analysis; see, e.g., [36, 37].

References

[1] M. Lukoševičius, H. Jaeger, Reservoir computing approaches to recurrent neural network training, Computer science review 3 (3) (2009) 127–149.
[2] B. Schrauwen, D. Verstraeten, J. Van Campenhout, An overview of reservoir computing: theory, applications and implementations, in: Proceedings of the 15th european symposium on artificial neural networks. p. 471-482 2007, 2007, pp. 471–482.
[3] G. Tanaka, T. Yamane, J. B. Héroux, R. Nakane, N. Kanazawa, S. Takeda, H. Numata, D. Nakano, A. Hirose, Recent advances in physical reservoir computing: A review, Neural Networks 115 (2019) 100–123.
[4] H. Jaeger, H. Haas, Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communications, Science 304 (2004) 5667.
[5] J. Pathak, et al., Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach, Physical review letters 120 (2) (2018) 024102.
[6] M. Rafayelyan, et al., Large-scale optical reservoir computing for spatiotemporal chaotic systems prediction, Phys. Rev. X 10 (2020) 041037.
[7] P. Antonik, M. Gulina, J. Pauwels, S. Massar, Using a reservoir computer to learn chaotic attractors, with applications to chaos synchronization and cryptography, Physical Review E 98 (1) (2018) 012215.
[8] Z. Lu, et al., Reservoir observers: Model-free inference of unmeasured variables in chaotic systems, Chaos 27 (2017) 041102.
[9] L. Grigoryeva, J. Henriques, J.-P. Ortega, Reservoir computing: information processing of stationary signals, in: Joint 2016 CSE, EUC and DCABES, IEEE, 2016, pp. 496–503.
[10] L. Grigoryeva, J.-P. Ortega, Echo state networks are universal, Neural Networks 108 (2018) 495–508. doi:https://doi.org/10.1016/j.neunet.2018.08.025.
[11] L. Gonon, J.-P. Ortega, Reservoir computing universality with stochastic inputs, IEEE transactions on neural networks and learning systems 31 (1) (2019) 100–112.
[12] J. Chen, H. I. Nurdin, Nonlinear autoregression with convergent dynamics on novel computational platforms, IEEE Transactions on Control Systems Technology 30 (5) (2022) 2228–2234. doi:10.1109/TCST.2021.3136227.
[13] L. Larger, et al., High-speed photonic reservoir computing using a time-delay-based architecture: Million words per second classification, Physical Review X 7 (1) (2017) 011015.
[14] J. Torrejon, et al., Neuromorphic computing with nanoscale spintronic oscillators, Nature 547 (7664) (2017) 428–431.
[15] J. Chen, H. I. Nurdin, N. Yamamoto, Temporal information processing on noisy quantum computers, Phys. Rev. Applied 14 (2020) 024065. doi:10.1103/PhysRevApplied.14.024065.
[16] Y. Suzuki, et al., Natural quantum reservoir computing for temporal information processing, Sci. Reports 12 (1) (2022) 1353.
[17] T. Yasuda, et al., Quantum reservoir computing with repeated measurements on superconducting devices, arXiv preprint arXiv:2310.06706 (October 2023).
[18] K. Nakajima, I. Fischer, Reservoir Computing: Theory, Physical Implementations, and Applications, Springer Singapore, 2021.
[19] P. Mujal, et al., Opportunities in quantum reservoir computing and extreme learning machines, Adv. Quantum Technol. 4 (2021) 2100027.
[20] D. Marković, J. Grollier, Quantum neuromorphic computing, Applied Physics Letters 117 (15) (2020) 150501.
[21] H. Jaeger, The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Bonn, Germany: German National Research Center for Information Technology GMD Technical Report 148 (34) (2001) 13.
[22] H. Jaeger, Echo state network, scholarpedia 2 (9) (2007) 2330.
[23] M. Lukoševičius, A practical guide to applying echo state networks, in: Neural Networks: Tricks of the Trade: Second Edition, Springer, 2012, pp. 659–686.
[24] D. Sussillo, L. F. Abbott, Generating coherent patterns of activity from chaotic neural networks, Neuron 63 (4) (2009) 544–557.
[25] M. Freiberger, P. Bienstman, J. Dambre, A training algorithm for networks of high-variability reservoirs, Scientific Reports 10 (2020) 14451.
[26] W. Maass, P. Joshi, E. D. Sontag, Computational aspects of feedback in neural circuits, PLoS Comp. Bio. 3 (2007) article no. e165.
[27] G. Manjunath, H. Jaeger, Echo state property linked to an input: Exploring a fundamental characteristic of recurrent neural networks, Neural Computation 25 (3) (2013) 671–696. doi:10.1162/NECO_a_00411.
[28] S. Boyd, L. Chua, Fading memory and the problem of approximating nonlinear operators with volterra series, IEEE Transactions on Circuits and Systems 32 (11) (1985) 1150–1161. doi:10.1109/TCS.1985.1085649.
[29] D. N. Tran, B. S. Rüffer, C. M. Kellett, Convergence properties for discrete-time nonlinear systems, IEEE Transactions on Automatic Control 64 (8) (2019) 3415–3422. doi:10.1109/TAC.2018.2879951.
[30] K. Fujii, K. Nakajima, Harnessing disordered-ensemble quantum dynamics for machine learning, Phys. Rev. Appl. 8 (2017) 024030. doi:10.1103/PhysRevApplied.8.024030.
[31] T. Kubota, H. Takahashi, K. Nakajima, Unifying framework for information processing in stochastically driven dynamical systems, Phys. Rev. Res. 3 (2021) 043135. doi:10.1103/PhysRevResearch.3.043135.
[32] K. Nakajima, K. Fujii, M. Negoro, K. Mitarai, M. Kitagawa, Boosting computational power through spatial multiplexing in quantum reservoir computing, Phys. Rev. Appl. 11 (2019) 034021. doi:10.1103/PhysRevApplied.11.034021.
[33] W. Rudin, Principles of Mathematical Analysis, New York: McGraw-Hill, 1976.
[34] T. Hülser, F. Köster, K. Lüdge, L. Jaurigue, Deriving task specific performance from the information processing capacity of a reservoir computer, Nanophotonics 12 (5) (2023) 937–947. doi:doi:10.1515/nanoph-2022-0415.
[35] T. Wigren, M. Schoukens, Coupled electric drives data set and reference models, Tech. rep., Department of Information Technology, Uppsala University, Uppsala, Sweden (2017).
[36] L. Ljung, System Identification: Theory for the User, 2nd Edition, Prentice-Hall, 1999.
[37] S. A. Billings, Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains, Wiley, 2013.
[38] H. W. Lilliefors, On the Kolmogorov-Smirnov test for normality with mean and variance unknown, Journal of the American Statistical Association 62 (318) (1967) 399–402.
[39] H. Abdi, P. Molin, Lilliefors/Van Soest’s test of normality, Encyclopedia Meas. Stat. (01 2007).
[40] M. B. Wilk, R. Gnanadesikan, Probability plotting methods for the analysis for the analysis of data, Biometrika 55 (1) (1968) 1–17. arXiv:https://academic.oup.com/biomet/article-pdf/55/1/1/730568/55-1-1.pdf, doi:10.1093/biomet/55.1.1.
URL https://doi.org/10.1093/biomet/55.1.1

Improving the Performance of Echo State Networks Through Feedback

Abstract

keywords:

1 Introduction

2 Theory of Reservoir Computing with Feedback

2.1 Reservoir Computing and Echo State Networks

2.2 ESNs with Feedback

3 Universal Superiority of ESN with Feedback over ESN without Feedback

Theorem 1 (Superiority of feedback for a given ESN and training data).

3.1 Preliminary Definitions and Relations

3.2 Lemmas for Proving the Lower Dimensionality of Cases where ∇VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0}

Lemma 1 (Categorization of cases where a derivative of SminS_{\mathrm{min}} w.r.t. a general reservoir parameter θ\theta vanishes).

Proof.

Lemma 2 (Lower dimensionality of cases where ∇VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0} while ∇VΠx≠𝟎\nabla_{V}\Pi_{x}\neq\mathbf{0}).

Proof.

Lemma 3 (Lower dimensionality of cases where ∇VΠx≠𝟎\nabla_{V}\Pi_{x}\neq\mathbf{0}).

Proof.

Lemma 4 (Lower dimensionality of the subdomain of Smin​(A+B​V⊤,B,{uk},{yk})S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\}) for which ∇VSmin=𝟎\nabla_{V}S_{\mathrm{min}}=\mathbf{0}).

Proof.

3.3 Proving the Universal Superiority of ESNs with Feedback

Proof of Theorem 1.

3.4 Superiority of ESNs with Feedback for the Whole Class of ESNs

Corollary 1 (Universal superiority of ESN with feedback over the whole class).

Proof.

4 Optimization of ESN with Feedback

5 Benchmark Test Results

6 Discussions and Conclusion

Appendix A Batch Gradient Descent Method for Optimizing VV

Appendix B Nonlinear Stochastic Autoregressive Model

References

3.2 Lemmas for Proving the Lower Dimensionality of Cases where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$

Lemma 1 (Categorization of cases where a derivative of $S_{\mathrm{min}}$ w.r.t. a general reservoir parameter $\theta$ vanishes).

Lemma 2 (Lower dimensionality of cases where $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ while $\nabla_{V}\Pi_{x}\neq\mathbf{0}$ ).

Lemma 3 (Lower dimensionality of cases where $\nabla_{V}\Pi_{x}\neq\mathbf{0}$ ).

Lemma 4 (Lower dimensionality of the subdomain of $S_{\mathrm{min}}(A+BV^{\top},B,\{u_{k}\},\{y_{k}\})$ for which $\nabla_{V}S_{\mathrm{min}}=\mathbf{0}$ ).

Appendix A Batch Gradient Descent Method for Optimizing $V$