This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Rethinking the Relationship between Recurrent and Non-Recurrent Neural Networks: A Study in Sparsity

\nameQuincy Hershey \email[email protected]
\addrData Science
Worcester Polytechnic Institute
Worcester, MA 01609, USA \AND\nameRandy Paffenroth \email[email protected]
\addrMathematical Sciences, Computer Science, and Data Science
Worcester Polytechnic Institute
Worcester, MA 01609, USA \AND\nameHarsh Pathak \email[email protected]
\addrData Science
Worcester Polytechnic Institute
Worcester, MA 01609, USA \AND\nameSimon Tavener \email[email protected]
\addrMathematics
Colorado State University
Fort Collins, CO 80523
Abstract

Neural networks (NN) can be divided into two broad categories, recurrent and non-recurrent. Both types of neural networks are popular and extensively studied, but they are often treated as distinct families of machine learning algorithms. In this position paper, we argue that there is a closer relationship between these two types of neural networks than is normally appreciated. We show that many common neural network models, such as Recurrent Neural Networks (RNN), Multi-Layer Perceptrons (MLP), and even deep multi-layer transformers, can all be represented as iterative maps.

The close relationship between RNNs and other types of NNs should not be surprising. In particular, RNNs are known to be Turing complete, and therefore capable of representing any computable function (such as any other types of NNs), but herein we argue that the relationship runs deeper and is more practical than this. For example, RNNs are often thought to be more difficult to train than other types of NNs, with RNNs being plagued by issues such as vanishing or exploding gradients. However, as we demonstrate in this paper, MLPs, RNNs, and many other NNs lie on a continuum, and this perspective leads to several insights that illuminate both theoretical and practical aspects of NNs.

Keywords: recurrent neural networks, sparsity, iterative maps, deep learning

1 Introduction

This paper recasts the concept of Neural Networks (NN) as iterative maps in order to study the underlying principles and assumptions embodied in traditional NN architectures. Traditional feed-forward multi-layer perceptron (MLP) LeCun et al. (2015) architectures are typically presented as distinct from recurrent neural network (RNN) Rumelhart and McClelland (1987) architectures which in-turn have a reputation of being difficult to train Pascanu et al. (2013); Sherstinsky (2020); Schäfer and Zimmermann (2006). Contrary to this perspective, our research shows that every MLP can be recast as an RNN and much can be learned about RNN architectures by reference to MLP models. This result is consequential since the universality of RNNs has been previously demonstrated Schäfer and Zimmermann (2006) and the Turing Completeness of RNN architectures has been understood since the 1990s Siegelmann and Sontag (1991). Beyond such theoretical considerations, we show how many common NNs can be represented as iterative maps which we define in Definition 3. Taking an iterative map perspective for MLPs and many other NN architectures leads to several insights that illuminate both theoretical and practical aspects of NNs.

The iterative map perspective proposed here both allows us to prove the equivalence described above and inspires us to study NNs from a discrete dynamical systems perspective. As we will demonstrate in what follows, such a perspective can elucidate many aspects of NN architectures. In particular, such ideas are not only of theoretical interest, but they also lead to improvements in NN implemention and training.

In this paper we develop notation that provides a context for describing many NN architectures as iterative maps. This notation is introduced in §2 and iterated maps, in our context, are defined in Algorithm 1. Of course, many RNNs are inherently iterative maps, and we remind the reader of the iterative structure of RNNs in §3. It is less obvious that MLPs can also be described as iterative maps, and we derive such equations in §4. The results in §3 and §4 allow both RNNs and MLPs to be concisely formulated as dynamical systems in §5. We explore extensions of these simple dynamical systems in §6. Beyond a theoretical nicety, we observe that MLPs and many other NNs can efficiently be implemented as iterative maps, and we provide a description of an implementation we call Sequential2D in §7. In §8, we demonstrate that the iterative map perspective for MLPs can lead to surprising numerical results. In particular, we show that random NN architectures are equivalent, in the appropriate sense, to more standard layer-wise approaches. Further exploration of new classes of methods that are natural extensions to existing methods will be the subject of future publications.

Deep networks have also been considered as discretizations of continuous dynamical systems and training viewed as an optimal control problem. This perspective has created an enormous literature and an exhaustive description is challenging. As an introduction to this area, we point to the early work of Chen et al. (2018); Weinan (2017); Chang et al. (2017); Han et al. (2019); Gunther et al. (2020). An early review appears in Liu and Theodorou (2019). Also, dynamical systems already play several important roles in state-of-the-art NNs as can be seen in reviews such as Liu and Theodorou (2019). However, several specific parts of the current literature bear special attention. For example, Koopman operators have an important role to play in analyzing NNs Yeung et al. (2019). Also, several recent state-of-the-art models are seemingly inspired by ideas similar to those that inspire our work, such as diffusion models Croitoru et al. (2023); Yang et al. (2023) and structured state space models Gu et al. (2021) such as MAMBA Gu and Dao (2023). Finally, iterated auto-encoders and their interpretation in terms of associative memory Radhakrishnan et al. (2020) is quite close in spirit to the ideas that we present here. It it our hope that our contributions in this paper serve to further underscore the important connections between NNs and dynamical systems.

2 Notation

As function composition is a fundamental building block of NNs and iterative maps in general, it is important to establish a notation for composing functions. We begin with the following notational conventions for scalars, vectors and matrices as shown in Table 1.

Notation Description Example
xx Scalar 3.143.14
𝐱\mathbf{x} Column vector [123]\begin{bmatrix}1\\ 2\\ 3\end{bmatrix}
|𝐱||\mathbf{x}| The number of entries in 𝐱\mathbf{x} |[123]|=3\left|\begin{bmatrix}1\\ 2\\ 3\end{bmatrix}\right|=3
XX Matrix [1234]\begin{bmatrix}1&2\\ 3&4\end{bmatrix}
Table 1: Notation for scalars, vectors, and matrices.

We will use the function composition operator \circ to write

f(g(𝐱))=fg𝐱f(g(\mathbf{x}))=f\circ g\circ\mathbf{x} (1)

for vector valued functions ff and gg. To be clear, when we write fg𝐱f\circ g\circ\mathbf{x} we intend that the argument 𝐱\mathbf{x} will only appear as the right-most entry and ff and gg are functions. We then slightly abuse notation by defining a 2×22\times 2 block non-linear function as

[f0,0f0,1f1,0f1,1][𝐱𝐡]=[f0,0(𝐱)+f0,1(𝐡)f1,0(𝐱)+f1,1(𝐡)].\begin{bmatrix}f_{0,0}&f_{0,1}\\ f_{1,0}&f_{1,1}\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{x}\\ \mathbf{h}\\ \end{bmatrix}=\begin{bmatrix}f_{0,0}(\mathbf{x})+f_{0,1}(\mathbf{h})\\ f_{1,0}(\mathbf{x})+f_{1,1}(\mathbf{h})\\ \end{bmatrix}. (2)

It’s important to emphasize that the block non-linear function defined in equation (2) is not a matrix; instead, it’s a separable function. This function operates on a vector comprising two input vectors and produces a vector consisting of two output vectors. Also, it’s worth noting that the choice of combining the outputs of fi,0f_{i,0} and fi,1f_{i,1} isn’t restricted to addition alone. However, for the sake of simplicity and to maintain resemblance to standard block matrix equations, addition is employed in this paper.

Moreover, if the functions fi,jf_{i,j} are linear, the block non-linear function in equation (2) simplifies to a block matrix equation. In essence, for linear maps, the function composition operator and the dot product are equivalent.

Definition 1

Given functions fi,jf_{i,j}, where each fi,jf_{i,j} may be linear or non-linear, we define a block non-linear function as

F=[f0,0:|𝐱||𝐱|f0,n1:|𝐡k||𝐱|f0,n:|𝐲||𝐱|f1,0:|𝐱||𝐡1|f1,n1:|𝐡k||𝐡1|f1,n:|𝐲||𝐡1|fm1,0:|𝐱||𝐡k|fm1,n1:|𝐡k||𝐡k|fm1,n:|𝐲||𝐡k|fm,0:|𝐱||𝐲|fm,n1:|𝐡k||𝐲|fm,n:|𝐲||𝐲|].\begin{aligned} &F=\\ &\begin{bmatrix}f_{0,0}:\mathbb{R}^{|\mathbf{x}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{x}|}&\ldots&f_{0,n-1}:\mathbb{R}^{|\mathbf{h}_{k}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{x}|}&f_{0,n}:\mathbb{R}^{|\mathbf{y}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{x}|}\\ f_{1,0}:\mathbb{R}^{|\mathbf{x}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{h}_{1}|}&\ldots&f_{1,n-1}:\mathbb{R}^{|\mathbf{h}_{k}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{h}_{1}|}&f_{1,n}:\mathbb{R}^{|\mathbf{y}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{h}_{1}|}\\ \vdots&\ddots&\vdots&\vdots\\ f_{m-1,0}:\mathbb{R}^{|\mathbf{x}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{h}_{k}|}&\ldots&f_{m-1,n-1}:\mathbb{R}^{|\mathbf{h}_{k}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{h}_{k}|}&f_{m-1,n}:\mathbb{R}^{|\mathbf{y}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{h}_{k}|}\\ f_{m,0}:\mathbb{R}^{|\mathbf{x}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{y}|}&\ldots&f_{m,n-1}:\mathbb{R}^{|\mathbf{h}_{k}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{y}|}&f_{m,n}:\mathbb{R}^{|\mathbf{y}|}\xrightarrow[]{}\mathbb{R}^{|\mathbf{y}|}\\ \end{bmatrix}\end{aligned}\,. (3)

The action of FF on a vector is then defined as

F[x|𝐱|h1|𝐡1|hk|𝐡k|y|𝐲|]\displaystyle F\circ\begin{bmatrix}x\in\mathbb{R}^{|\mathbf{x}|}\\ h_{1}\in\mathbb{R}^{|\mathbf{h}_{1}|}\\ \vdots\\ h_{k}\in\mathbb{R}^{|\mathbf{h}_{k}|}\\ y\in\mathbb{R}^{|\mathbf{y}|}\end{bmatrix} =[f0,0(𝐱)+f0,1(𝐡1)++f0,n1(𝐡k)+f0,n(𝐲)f1,0(𝐱)+f1,1(𝐡1)++f1,n1(𝐡k)+f1,n(𝐲)fm1,0(𝐱)+fm1,1(𝐡1)++fm1,n1(𝐡k)+fm1,n(𝐲)fm,0(𝐱)+fm,1(𝐡1)++fm,n1(𝐡k)+fm,n(𝐲)]\displaystyle=\begin{bmatrix}f_{0,0}(\mathbf{x})+f_{0,1}(\mathbf{h}_{1})+\ldots+f_{0,n-1}(\mathbf{h}_{k})+f_{0,n}(\mathbf{y})\\ f_{1,0}(\mathbf{x})+f_{1,1}(\mathbf{h}_{1})+\ldots+f_{1,n-1}(\mathbf{h}_{k})+f_{1,n}(\mathbf{y})\\ \vdots\\ f_{m-1,0}(\mathbf{x})+f_{m-1,1}(\mathbf{h}_{1})+\ldots+f_{m-1,n-1}(\mathbf{h}_{k})+f_{m-1,n}(\mathbf{y})\\ f_{m,0}(\mathbf{x})+f_{m,1}(\mathbf{h}_{1})+\ldots+f_{m,n-1}(\mathbf{h}_{k})+f_{m,n}(\mathbf{y})\\ \end{bmatrix} (4)
|𝐱|+|𝐡1|++|𝐡k|+|𝐲|,\displaystyle\in\mathbb{R}^{|\mathbf{x}|+|\mathbf{h}_{1}|+\ldots+|\mathbf{h}_{k}|+|\mathbf{y}|}\,,

thus in many ways, (3) is a straight forward generalization of a block matrix.

When working with NNs, training algorithms are often considered in terms of mini-batches, where a set of nn training examples are processed together. 111In NN literature, parameter matrices are often applied to a mini-batch using a left dot-product, such as XWX\cdot W, where XX is a mini-batch of training examples and WW is a parameter matrix. In this paper, we use the right dot-product, such as WXW\cdot X, which is more common in the linear algebra literature and matches the function composition notation fg(x)f\circ g(x).

Definition 2

Given matrices X|𝐱|×n,H1|𝐡1|×n,,Hk|𝐡k|×n,Y|𝐲|×nX\in\mathbb{R}^{|\mathbf{x}|\times n},H_{1}\in\mathbb{R}^{|\mathbf{h}_{1}|\times n},\cdots,H_{k}\in\mathbb{R}^{|\mathbf{h}_{k}|\times n},Y\in\mathbb{R}^{|\mathbf{y}|\times n} then the block non-linear function FF acts on the mini-batch as

F[X|𝐱|×nH1|𝐡1|×nHk|𝐡k|×nY|𝐲|×n]=[f0,0(X)+f0,1(H1)++f0,n1(Hk)+f0,n(Y)f1,0(X)+f1,1(H1)++f1,n1(Hk)+f1,n(Y)fm1,0(X)+fm1,1(H1)++fm1,n1(Hk)+fm1,n(Y)fm,0(X)+fm,1(H1)++fm,n1(Hk)+fm,n(Y)]F\circ\begin{bmatrix}X\in\mathbb{R}^{|\mathbf{x}|\times n}\\ H_{1}\in\mathbb{R}^{|\mathbf{h}_{1}|\times n}\\ \vdots\\ H_{k}\in\mathbb{R}^{|\mathbf{h}_{k}|\times n}\\ Y\in\mathbb{R}^{|\mathbf{y}|\times n}\end{bmatrix}=\begin{bmatrix}f_{0,0}(X)+f_{0,1}(H_{1})+\ldots+f_{0,n-1}(H_{k})+f_{0,n}(Y)\\ f_{1,0}(X)+f_{1,1}(H_{1})+\ldots+f_{1,n-1}(H_{k})+f_{1,n}(Y)\\ \vdots\\ f_{m-1,0}(X)+f_{m-1,1}(H_{1})+\ldots+f_{m-1,n-1}(H_{k})+f_{m-1,n}(Y)\\ f_{m,0}(X)+f_{m,1}(H_{1})+\ldots+f_{m,n-1}(H_{k})+f_{m,n}(Y)\\ \end{bmatrix} (5)

where each function fi,jf_{i,j} acts columnwise on the given input matrix. This gives rise to

F[X|𝐱|×nH1|𝐡1|×nHk|𝐡k|×nY|𝐲|×n](|𝐱|+|𝐡1|++|𝐡k|+|𝐲|)×n.F\circ\begin{bmatrix}X\in\mathbb{R}^{|\mathbf{x}|\times n}\\ H_{1}\in\mathbb{R}^{|\mathbf{h}_{1}|\times n}\\ \vdots\\ H_{k}\in\mathbb{R}^{|\mathbf{h}_{k}|\times n}\\ Y\in\mathbb{R}^{|\mathbf{y}|\times n}\end{bmatrix}\in\mathbb{R}^{(|\mathbf{x}|+|\mathbf{h}_{1}|+\ldots+|\mathbf{h}_{k}|+|\mathbf{y}|)\times n}\,. (6)

In what follows we will often wish to iterate a fixed block non-linear function FF. In particular, a defining characteristic of a discrete iterated map is that the map of interest is fixed and applied a number of times to a given input xx as in f(f((f(x))))f(f(\cdots(f(x))\cdots)). Accordingly, we define an iterated block nonlinear function as below.

Definition 3

An iterative map is defined below.

Algorithm 1 INN
t0t\leftarrow 0
𝐱𝟎𝐱\mathbf{x_{0}}\leftarrow\mathbf{x}
while t<Tt<T do
     𝐱t+1F𝐱t\mathbf{x}_{t+1}\leftarrow F\;\circ\;\mathbf{x}_{t}
     tt+1t\leftarrow t+1
end while

3 Recurrent Neural Networks (RNNs)

3.1 Traditional development of RNNs

A recurrent neural network (RNN) is a type of neural network that is designed to process sequential data, e.g., data that is indexed by time. Let 𝐡kn\mathbf{h}_{k}\in\mathbb{R}^{n} be the state of the system, and 𝐱km\mathbf{x}_{k}\in\mathbb{R}^{m} be the input data or forcing function. As in Goodfellow et al. (2016); Pascanu et al. (2013), an RNN can be written as

Given 𝐡0=𝐱0\displaystyle\mathbf{h}_{0}=\mathbf{x}_{0} (7)
𝐡t+1=fθ(𝐱t+1,𝐡t),t=0,1,,T1\displaystyle\mathbf{h}_{t+1}=f_{\theta}(\mathbf{x}_{t+1},\mathbf{h}_{t}),\quad t=0,1,\dots,T-1

where fθf_{\theta} is a family of functions parameterized by θ\theta. In other words the 𝐱0\mathbf{x}_{0} is the initialization of the hidden state 𝐡0\mathbf{h}_{0}. Equivalently, equation (7) defines a discrete time dynamical system for 𝐡k\mathbf{h}_{k} where the function ff is the dynamics of the system and 𝐱k\mathbf{x}_{k} is a forcing term. The study of RNNs revolves around the choice of the function family fθ(,)f_{\theta}(\cdot,\cdot), with many interesting sub-classes of RNNs having different properties.

A specific family of functions that is widely used in the RNN literature Goodfellow et al. (2016); Pascanu et al. (2013); Salehinejad et al. (2017); Lipton et al. (2015) is a family of affine functions with non-linear activations, e.g.,

𝐡t+1=σ(Wx𝐱t+1+Wh𝐡t+𝐛)\mathbf{h}_{t+1}=\sigma(W_{x}\cdot\mathbf{x}_{t+1}+W_{h}\cdot\mathbf{h}_{t}+\mathbf{b}) (8)

where we have the weight matrices Wxn×mW_{x}\in\mathbb{R}^{n\times m} and Whn×nW_{h}\in\mathbb{R}^{n\times n}, the bias vector 𝐛n\mathbf{b}\in\mathbb{R}^{n}, and σ\sigma is a non-linear activation function, most commonly a tanh.

3.2 RNNs as iterated maps

The RNN in equation (8) can be written as a block non-linear function

[000σ][00FxFh][𝐱t+1𝐡t]\displaystyle\begin{bmatrix}0&0\\ 0&\sigma\\ \end{bmatrix}\circ\begin{bmatrix}0&0\\ F_{x}&F_{h}\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{x}_{t+1}\\ \mathbf{h}_{t}\\ \end{bmatrix} =[0σ(Fx(𝐱t+1)+Fh(𝐡t))]\displaystyle=\begin{bmatrix}0\\ \sigma(F_{x}(\mathbf{x}_{t+1})+F_{h}(\mathbf{h}_{t}))\\ \end{bmatrix} (9)

where Fx(𝐱t+1)=Wx𝐱t+1F_{x}(\mathbf{x}_{t+1})=W_{x}\cdot\mathbf{x}_{t+1} is a linear function, Fh(𝐡t)=Wh𝐡t+𝐛F_{h}(\mathbf{h}_{t})=W_{h}\cdot\mathbf{h}_{t}+\mathbf{b} is an affine function and where σ:nn\sigma:\mathbb{R}^{n}\mapsto\mathbb{R}^{n}. In general however, FxF_{x} and FhF_{h} may be nonlinear functions such that Fx:mnF_{x}:\mathbb{R}^{m}\mapsto\mathbb{R}^{n} and Fh:nnF_{h}:\mathbb{R}^{n}\mapsto\mathbb{R}^{n}, but the choice in (8) is common in the RNN literature Goodfellow et al. (2016); Pascanu et al. (2013); Salehinejad et al. (2017); Lipton et al. (2015).

The trainable parameter set is

θRNN={Wx,Wh,𝐛}.\theta_{RNN}=\{W_{x},W_{h},\mathbf{b}\}. (10)

RNNs can therefore be viewed as a forced discrete dynamical system with state space hkh_{k} and the function

MRNN=[000σ][00FxFh]M_{RNN}=\begin{aligned} \begin{bmatrix}0&0\\ 0&\sigma\\ \end{bmatrix}\circ\begin{bmatrix}0&0\\ F_{x}&F_{h}\\ \end{bmatrix}\end{aligned} (11)

is the dynamics of the system. Given 𝐡0\mathbf{h}_{0}, the sequence provided by (8), or equivalently (11), is

𝐡1\displaystyle\mathbf{h}_{1} =σ(Fx(𝐱1)+Fh(𝐡0))\displaystyle=\sigma(F_{x}(\mathbf{x}_{1})+F_{h}(\mathbf{h}_{0})) (12)
𝐡2\displaystyle\mathbf{h}_{2} =σ(Fx(𝐱2)+Fh(𝐡1))\displaystyle=\sigma\Big{(}F_{x}(\mathbf{x}_{2})+F_{h}(\mathbf{h}_{1})\Big{)}
=σ(Fx(𝐱2)+Fh(σ(Fx(𝐱1)+Fh(𝐡0))))\displaystyle=\sigma\Big{(}F_{x}(\mathbf{x}_{2})+F_{h}\big{(}\;\sigma(F_{x}(\mathbf{x}_{1})+F_{h}(\mathbf{h}_{0}))\;\big{)}\Big{)}
𝐡3\displaystyle\mathbf{h}_{3} =σ(Fx(𝐱3)+Fh(𝐡2))\displaystyle=\sigma\Bigg{(}F_{x}(\mathbf{x}_{3})+F_{h}(\mathbf{h}_{2})\Bigg{)}
=σ(Fx(𝐱3)+Fh(σ(Fx(𝐱2)+Fh(σ(Fx(𝐱1)+Fh(𝐡0)))))).\displaystyle=\sigma\Bigg{(}F_{x}(\mathbf{x}_{3})+F_{h}\bigg{(}\;\;\sigma\Big{(}F_{x}(\mathbf{x}_{2})+F_{h}\big{(}\;\sigma(F_{x}(\mathbf{x}_{1})+F_{h}(\mathbf{h}_{0}))\;\big{)}\Big{)}\;\;\bigg{)}\Bigg{)}\,.

To emphasize the iterative nature of the RNN, we extend the definition of the RNN in (8) by defining

fθ(𝐱,𝐡)=σ(Fx(𝐱)+Fh(𝐡))f_{\theta}(\mathbf{x},\mathbf{h})=\sigma(F_{x}(\mathbf{x})+F_{h}(\mathbf{h})) (13)

and using this notation and appealing to the standard \circ notation for function composition we see that

𝐡3\displaystyle\mathbf{h}_{3} =fθ(𝐱3,)fθ(𝐱2,)fθ(𝐱1,)𝐡0\displaystyle=f_{\theta}(\mathbf{x}_{3},\cdot)\circ f_{\theta}(\mathbf{x}_{2},\cdot)\circ f_{\theta}(\mathbf{x}_{1},\cdot)\circ\mathbf{h}_{0} (14)
=fθ(𝐱3,fθ(𝐱2,fθ(𝐱1,𝐡0)))\displaystyle=f_{\theta}\Big{(}\mathbf{x}_{3},f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})\big{)}\Big{)}
=fθ(𝐱3,fθ(𝐱2,fθ(𝐱1,𝐱0))).\displaystyle=f_{\theta}\Big{(}\mathbf{x}_{3},f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},\mathbf{x}_{0})\big{)}\Big{)}\,.

Equivalently, we can iterate the following matrix three times and recover the output of (14) in the final component of the vector. While this notation appears superfluous, it demonstrates that we create a fixed map composed of a fixed function fθf_{\theta}. This approach will prove to be particularly useful looking ahead to §4 when we consider MLPs. We define

MRNN3=[0000fθ(𝐱1,)0000fθ(𝐱2,)0000fθ(𝐱3,)0].M_{RNN3}=\begin{bmatrix}0&0&0&0\\ f_{\theta}(\mathbf{x}_{1},\cdot)&0&0&0\\ 0&f_{\theta}(\mathbf{x}_{2},\cdot)&0&0\\ 0&0&f_{\theta}(\mathbf{x}_{3},\cdot)&0\\ \end{bmatrix}. (15)

We show below that we reproduce the action of (11) by expanding the dimension of the space such all copies of (11) lie “below the diagonal”. This is an example of an unrolling operation utilizing an increase in dimension. Why do we combine these two operations? As we will see later by combining these two operations we will uncover a unifying representation of RNNs and MLPs. In the general RNN case, the particular dimension we chose for the lifting is (|𝐡|+T(|𝐱|+|𝐡|))\big{(}|\mathbf{h}|+T(|\mathbf{x}|+|\mathbf{h}|)\big{)} where TT is the number of times the map is applied to produce 𝐡T\mathbf{h}_{T}, and |𝐱|=|𝐱i|i=1,,T1|\mathbf{x}|=|\mathbf{x}_{i}|\;\forall\;i=1,\dots,T-1 and |𝐡|=|𝐡i|i=0,,T|\mathbf{h}|=|\mathbf{h}_{i}|\;\forall\;i=0,\dots,T. 222In the context of MLPs we will use a similar lifting approach to create a fixed map, but in this case the fixed map will be composed of different functions, fθ1,fθ2,,fθTf_{\theta_{1}},f_{\theta_{2}},\dots,f_{\theta_{T}}. In general, these functions map from and to vector spaces having different dimensions as indicated in (3). The dimension of the fixed map in the MLP case is therefore |𝐡0|+|𝐡1|+|𝐡2|++|𝐡T||\mathbf{h}_{0}|+|\mathbf{h}_{1}|+|\mathbf{h}_{2}|+\dots+|\mathbf{h}_{T}|.

Given the initial state

[𝐡0000]\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} (16)

the first iteration of the map MRNN3M_{RNN3} (abbreviated to MM for convenience) gives

M[𝐡0000]=[0000fθ(𝐱1,)0000fθ(𝐱2,)0000fθ(𝐱3,)0][𝐡0000]=[0fθ(𝐱1,𝐡0)fθ(𝐱2,0)fθ(𝐱3,0)].M\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix}=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\cdot)$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},\cdot)$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},\cdot)$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix}=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})$}\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},0)$}\\ \end{bmatrix}. (17)

The second iteration of the map MM gives

MM[𝐡0000]\displaystyle M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} =[0000fθ(𝐱1,)0000fθ(𝐱2,)0000fθ(𝐱3,)0][0fθ(𝐱1,𝐡0)fθ(𝐱2,0)fθ(𝐱3,0)]\displaystyle=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\cdot)$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},\cdot)$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},\cdot)$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})$}\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},0)$}\\ \end{bmatrix} (18)
=[0fθ(𝐱1,0)fθ(𝐱2,(fθ(𝐱1,𝐡0))fθ(𝐱3,(fθ(𝐱2,0))]\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{2},(f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})\big{)}$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{3},(f_{\theta}(\mathbf{x}_{2},0)\big{)}$}\\ \end{bmatrix}

and the third iteration of the map MM results in

MM\displaystyle M\circ M\circ M[𝐡0000]\displaystyle M\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} (19)
=[0000fθ(𝐱1,)0000fθ(𝐱2,)0000fθ(𝐱3,)0][0fθ(𝐱1,0)fθ(𝐱2,(fθ(𝐱1,𝐡0))fθ(𝐱3,(fθ(𝐱2,0))]\displaystyle=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\cdot)$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},\cdot)$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},\cdot)$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{2},(f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})\big{)}$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{3},(f_{\theta}(\mathbf{x}_{2},0)\big{)}$}\\ \end{bmatrix}
=[0fθ(𝐱1,0)fθ(𝐱2,fθ(𝐱1,0))fθ(𝐱3,fθ(𝐱2,fθ(𝐱1,𝐡0)))]\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},0)\big{)}$}\\ \hbox{\pagecolor{green}$f_{\theta}\Big{(}\mathbf{x}_{3},f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})\big{)}\Big{)}$}\\ \end{bmatrix}
=[0fθ(𝐱1,0)fθ(𝐱2,fθ(𝐱1,0))fθ(𝐱3,fθ(𝐱2,fθ(𝐱1,𝐱0)))].\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},0)\big{)}$}\\ \hbox{\pagecolor{green}$f_{\theta}\Big{(}\mathbf{x}_{3},f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},\mathbf{x}_{0})\big{)}\Big{)}$}\\ \end{bmatrix}\,.

If we now interpret the last entry in the vector as the output of the map, we observe that it is identical to the output of a three layer RNN with function/dynamics fθ(,)f_{\theta}(\cdot,\cdot), forcing terms 𝐱1,𝐱2,\mathbf{x}_{1},\mathbf{x}_{2}, and 𝐱3\mathbf{x}_{3}, and input 𝐱0\mathbf{x}_{0} as shown in (14).

3.3 Choice of the initial vector

Given the initial state

[𝐡0q1q2q3]\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} (20)

the first iteration of the map MRNN3M_{RNN3} (abbreviated to MM for convenience) gives

M[𝐡0q1q2q3]=[0000fθ(𝐱1,)0000fθ(𝐱2,)0000fθ(𝐱3,)0][𝐡0q1q2q3]=[0fθ(𝐱1,𝐡0)fθ(𝐱2,q1)fθ(𝐱3,q2)].M\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\cdot)$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},\cdot)$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},\cdot)$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})$}\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},q_{1})$}\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},q_{2})$}\\ \end{bmatrix}\,. (21)

The second iteration of the map MM gives

MM[𝐡0q1q2q3]\displaystyle M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} =[0000fθ(𝐱1,)0000fθ(𝐱2,)0000fθ(𝐱3,)0][0fθ(𝐱1,𝐡0)fθ(𝐱2,q1)fθ(𝐱3,q2)]\displaystyle=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\cdot)$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},\cdot)$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},\cdot)$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})$}\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},q_{1})$}\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},q_{2})$}\\ \end{bmatrix} (22)
=[0fθ(𝐱1,0)fθ(𝐱2,(fθ(𝐱1,𝐡0))fθ(𝐱3,(fθ(𝐱2,q1))]\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{2},(f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})\big{)}$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{3},(f_{\theta}(\mathbf{x}_{2},q_{1})\big{)}$}\\ \end{bmatrix}

and the third iteration of the map MM results in

MM\displaystyle M\circ M\circ M[𝐡0q1q2q3]\displaystyle M\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} (23)
=[0000fθ(𝐱1,)0000fθ(𝐱2,)0000fθ(𝐱3,)0][0fθ(𝐱1,0)fθ(𝐱2,(fθ(𝐱1,𝐡0))fθ(𝐱3,(fθ(𝐱2,q1))]\displaystyle=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},\cdot)$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{2},\cdot)$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{3},\cdot)$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{2},(f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})\big{)}$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{3},(f_{\theta}(\mathbf{x}_{2},q_{1})\big{)}$}\\ \end{bmatrix}
=[0fθ(𝐱1,0)fθ(𝐱2,fθ(𝐱1,0))fθ(𝐱3,fθ(𝐱2,fθ(𝐱1,𝐡0)))]\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},0)\big{)}$}\\ \hbox{\pagecolor{green}$f_{\theta}\Big{(}\mathbf{x}_{3},f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},\mathbf{h}_{0})\big{)}\Big{)}$}\\ \end{bmatrix}
=[0fθ(𝐱1,0)fθ(𝐱2,fθ(𝐱1,0))fθ(𝐱3,fθ(𝐱2,fθ(𝐱1,𝐱0)))].\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta}(\mathbf{x}_{1},0)$}\\ \hbox{\pagecolor{green}$f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},0)\big{)}$}\\ \hbox{\pagecolor{green}$f_{\theta}\Big{(}\mathbf{x}_{3},f_{\theta}\big{(}\mathbf{x}_{2},f_{\theta}(\mathbf{x}_{1},\mathbf{x}_{0})\big{)}\Big{)}$}\\ \end{bmatrix}\,.

The final component after three iterations is equal to (19) and independent of q1,q2,q3q_{1},q_{2},q_{3}.

4 Multilayer perceptrons (MLPs)

4.1 Traditional development of MLPs

Neural networks such as the traditional feed-forward multilayer perceptron Murtagh (1991) architecture can be defined as a set of nested functions. In this representation, each layer ii of the network and accompanying (fixed) nonlinear activation function σ()\sigma(\cdot) Rasamoelina et al. (2020) is represented as a function. Let θi\theta_{i} represent the parameters that define WiW_{i} and 𝐛i\mathbf{b}_{i}. To make the notation consistent we make 𝐡0\mathbf{h}_{0} equal to the input data 𝐱\mathbf{x}.333We note that for MLP the input data appears only once, yet the forcing function in an RNN changes every iteration and 𝐡0=𝐱0\mathbf{h}_{0}=\mathbf{x}_{0}. Let

𝐡i+1=fθi+1(𝐡i)=σ(Wi+1𝐡i+𝐛i+1),i=0,,T1\mathbf{h}_{i+1}=f_{\theta_{i+1}}(\mathbf{h}_{i})=\sigma(W_{i+1}\cdot\mathbf{h}_{i}+\mathbf{b}_{i+1}),\quad i=0,\dots,T-1 (24)

with 𝐡0m\mathbf{h}_{0}\in\mathbb{R}^{m}, 𝐡i|𝐡i|\mathbf{h}_{i}\in\mathbb{R}^{|\mathbf{h}_{i}|}, and weight matrix Wi|𝐡i|×|𝐡i1|W_{i}\in\mathbb{R}^{|\mathbf{h}_{i}|\times|\mathbf{h}_{i-1}|} and bias vector 𝐛|𝐡i|\mathbf{b}\in\mathbb{R}^{|\mathbf{h}_{i}|} forming a parameter set

θMLP={Wi,𝐛i},i=1,,T.\theta_{MLP}=\{W_{i},\mathbf{b}_{i}\},\;i=1,\dots,T. (25)

Each hi,i=1,,Th_{i},\;i=1,\dots,T is thought of as a hidden layer of the neural network. Such a layer is called a dense layer based on the density of the non-zero entries in the matrix WiW_{i}. Note that the affine function Wi𝐡i1+𝐛iW_{i}\cdot\mathbf{h}_{i-1}+\mathbf{b}_{i} of each layer is fed through the nonlinear activation function σ\sigma. The final layer provides model output 𝐲~=𝐡T\tilde{\mathbf{y}}=\mathbf{h}_{T}. As a concrete example, consider the following three-layer network, where we nest three functions of the form (24) to obtain

𝐲~=𝐡3=σ(W3σ(W2σ(W1𝐡0+𝐛1)+𝐛2)+𝐛3)\tilde{\mathbf{y}}=\mathbf{h}_{3}=\sigma\Big{(}W_{3}\cdot\sigma\big{(}\;W_{2}\cdot\sigma(\;W_{1}\cdot\mathbf{h}_{0}+\mathbf{b}_{1}\;)+\mathbf{b}_{2}\;\big{)}+\mathbf{b}_{3}\Big{)} (26)

or, equivalently and more compactly,

𝐲~=𝐡3=fθ3(fθ2(fθ1(𝐡0)))\tilde{\mathbf{y}}=\mathbf{h}_{3}=f_{\theta_{3}}(f_{\theta_{2}}(f_{\theta_{1}}(\mathbf{h}_{0}))) (27)

where

fθj(𝐳)=σ(Wj𝐳+𝐛j).f_{\theta_{j}}(\mathbf{z})=\sigma(W_{j}\cdot\mathbf{z}+\mathbf{b}_{j}). (28)

In general

𝐲~=𝐡T\displaystyle\tilde{\mathbf{y}}=\mathbf{h}_{T} =fθTfθT1fθ1𝐡0\displaystyle=f_{\theta_{T}}\circ f_{\theta_{T-1}}\circ\dots\circ f_{\theta_{1}}\circ\mathbf{h}_{0} (29)
=fθT(fθT1((fθ1(𝐡0))).\displaystyle=f_{\theta_{T}}\Big{(}\;\;f_{\theta_{T-1}}\big{(}\dots(\;f_{\theta_{1}}(\mathbf{h}_{0})\;\big{)}\;\;\Big{)}\,.

4.2 MLPs as iterated maps

Given 𝐡0=𝐱\mathbf{h}_{0}=\mathbf{x} the iteration (29) gives

𝐡1\displaystyle\mathbf{h}_{1} =σ(W1𝐡0+𝐛1)\displaystyle=\sigma(W_{1}\cdot\mathbf{h}_{0}+\mathbf{b}_{1}) (30)
𝐡2\displaystyle\mathbf{h}_{2} =σ(W2𝐡1+𝐛2)\displaystyle=\sigma\big{(}W_{2}\cdot\mathbf{h}_{1}+\mathbf{b}_{2}\big{)}
=σ(W2σ(W1𝐡0+𝐛1)+𝐛2)\displaystyle=\sigma\big{(}W_{2}\cdot\sigma(W_{1}\cdot\mathbf{h}_{0}+\mathbf{b}_{1})+\mathbf{b}_{2}\big{)}
𝐡3\displaystyle\mathbf{h}_{3} =σ(W3𝐡2+𝐛3)\displaystyle=\sigma\Big{(}W_{3}\cdot\mathbf{h}_{2}+\mathbf{b}_{3}\Big{)}
=σ(W3σ(W2σ(W1𝐡0+𝐛1)+𝐛2)+𝐛3).\displaystyle=\sigma\Big{(}W_{3}\cdot\sigma\big{(}W_{2}\cdot\sigma(W_{1}\cdot\mathbf{h}_{0}+\mathbf{b}_{1})+\mathbf{b}_{2}\big{)}+\mathbf{b}_{3}\Big{)}\,.

Let

fθj(𝐳)=σ(Wj𝐳+𝐛j)f_{\theta_{j}}(\mathbf{z})=\sigma(W_{j}\cdot\mathbf{z}+\mathbf{b}_{j}) (31)

and define

MMLP=[00000fθ100000fθ200000fθ3000000fθT10]M_{MLP}=\begin{bmatrix}0&0&0&\dots&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&\dots&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&\dots&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&\dots&0&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&0&\hbox{\pagecolor{green}$f_{\theta_{T-1}}$}&0\end{bmatrix} (32)

where fθ1:|𝐱||𝐡1|f_{\theta_{1}}:\mathbb{R}^{|\mathbf{x}|}\mapsto\mathbb{R}^{|\mathbf{h}_{1}|} and fθj:|𝐡j1||𝐡j|,j=2,,T\hbox{\pagecolor{green}$f_{\theta_{j}}$}:\mathbb{R}^{|\mathbf{h}_{j-1}|}\mapsto\mathbb{R}^{|\mathbf{h}_{j}|},\,j=2,...,T, and initial conditions

[𝐡0000].\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ \vdots\\ 0\\ \end{bmatrix}\,. (33)

This is an example of a sparse block linear function, where the majority of the entries are the zero-function and only a few entries are non-zero. In particular, the non-zero entries are the functions fθ1,fθ2,,fθT1f_{\theta_{1}},f_{\theta_{2}},\dots,f_{\theta_{T-1}} are assumed to be nonlinear functions. The function MMLPM_{MLP} is an autonomous, nonlinear map. Now define

MMLP3=[0000fθ10000fθ20000fθ30].M_{MLP3}=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}. (34)

We show that three iteration of the map MMLP3M_{MLP3} is equivalent to a three layer MLP.

We apply the map MMLP3M_{MLP3} (written here as MM to simplify notation) to the initial vector

[𝐡0000]\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} (35)

which has length (|𝐡0|+|𝐡1|+|𝐡2|+|𝐡3|)(|\mathbf{h}_{0}|+|\mathbf{h}_{1}|+|\mathbf{h}_{2}|+|\mathbf{h}_{3}|). After the first iteration of the map MM

M[𝐡0000]=[0000fθ10000fθ20000fθ30][𝐡0000]=[0fθ1(𝐡0)fθ2(0)fθ3(0)].M\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix}=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix}=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(0)\\ \end{bmatrix}\,. (36)

After the second iteration of map MM

MM[𝐡0000]=[0000fθ10000fθ20000fθ30][0fθ1(𝐡0)fθ2(0)fθ3(0)]=[0fθ1(0)fθ2(fθ1(𝐡0))fθ3(fθ2(0))]M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix}=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(0)\\ \end{bmatrix}=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(0))\\ \end{bmatrix} (37)

and after the third iteration of map MM

MMM[𝐡0000]\displaystyle M\circ M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} =[0000fθ10000fθ20000fθ30][0fθ1(0)fθ2(fθ1(𝐡0))fθ3(fθ2(0))]\displaystyle=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(0))\\ \end{bmatrix} (38)
=[0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(𝐡0)))].\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix}\,.

If we now interpret the last entry in the vector as the output of the map we observe that it is identical to the output of a three layer MLP with weights fθ1f_{\theta_{1}}, fθ2f_{\theta_{2}}, and fθ3f_{\theta_{3}} and input 𝐡0\mathbf{h}_{0} as shown in (27). Note, this is not a special property of this particular MLP, but instead a universal property of MLPs.

4.3 MLPs with additional iterations

4.3.1 Finite impulse response

In §4.2 it is important to recognize that (32) starting with initial conditions (33) must be iterated TT times to achieve the equivalent outcome as (29). Further iterations lead to interesting results as shown below.

Applying the map MM one additional time reveals

MMMM[𝐡0000]\displaystyle M\circ M\circ M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} =[0000fθ10000fθ20000fθ30][0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(𝐡0))))]\displaystyle=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))))\\ \end{bmatrix} (39)
=[0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(0)))]\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)))\\ \end{bmatrix}

which is independent of the input 𝐡0\mathbf{h}_{0}. Accordingly, it is important to note that MMMM\circ M\circ M is identical to the original three-layer MLP, but MMMMM\circ M\circ M\circ M is a totally different operator. In fact, MMMMM\circ M\circ M\circ M is finite impulse in that it does not depend on the input data 𝐡0\mathbf{h}_{0}.

4.3.2 Fixed point

An additional application of MM, while seemingly inane from the point of view of NNs, does have interesting implications from the point of view of dynamical systems. In particular, one may observe that

MMMMM[𝐡0000]\displaystyle M\circ M\circ M\circ M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} =[0000fθ10000fθ20000fθ30][0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(0)))]\displaystyle=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)))\\ \end{bmatrix} (40)
=[0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(0)))]\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)))\\ \end{bmatrix}

which in addition, shows that

[0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(0)))]\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)))\\ \end{bmatrix} (41)

is a fixed point of the map MM. Such examples underscore the tight, and perhaps surprising, relationship between NNs and dynamical systems, and now the dynamical system point of view provides insights into the deep connection between NNs and dynamical systems.

4.4 Choice of the initial vector

We apply the map MMLP3M_{MLP3} (written here as MM to simplify notation) to the initial vector

[𝐡0q1q2q3]\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} (42)

where q1,q2,q30q_{1},q_{2},q_{3}\not=0, which has length (|𝐡0|+|𝐡1|+|𝐡2|+|𝐡3|)(|\mathbf{h}_{0}|+|\mathbf{h}_{1}|+|\mathbf{h}_{2}|+|\mathbf{h}_{3}|). After the first iteration of the map MM

M[𝐡0q1q2q3]=[0000fθ10000fθ20000fθ30][𝐡0q1q2q3]=[0fθ1(𝐡0)fθ2(q1)fθ3(q2)].M\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(q_{2})\\ \end{bmatrix}\,. (43)

After the second iteration of map MM

MM[𝐡0q1q2q3]=[0000fθ10000fθ20000fθ30][0fθ1(𝐡0)fθ2(q1)fθ3(q2)]=[0fθ1(0)fθ2(fθ1(𝐡0))fθ3(fθ2(q1))]M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(q_{2})\\ \end{bmatrix}=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \end{bmatrix} (44)

and after the third iteration of map MM

MMM[𝐡0q1q2q3]\displaystyle M\circ M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} =[0000fθ10000fθ20000fθ30][0fθ1(0)fθ2(fθ1(𝐡0))fθ3(fθ2(q1))]\displaystyle=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \end{bmatrix} (45)
=[0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(𝐡0)))].\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix}\,.

The solution after three iterations is identical to that in (38). Clearly then, further iterations result in

MMMM[𝐡0q1q2q3]\displaystyle M\circ M\circ M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} =[0000fθ10000fθ20000fθ30][0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(𝐡0))))]\displaystyle=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))))\\ \end{bmatrix} (46)
=[0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(0)))]\displaystyle=\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)))\\ \end{bmatrix}

and

MMMMM[𝐡0q1q2q3]=[0000fθ10000fθ20000fθ30][0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(0)))]=\displaystyle M\circ M\circ M\circ M\circ M\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)))\\ \end{bmatrix}= (47)
[0fθ1(0)fθ2(fθ1(0))fθ3(fθ2(fθ1(0)))].\displaystyle\begin{bmatrix}0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(0)))\\ \end{bmatrix}\,.

Note, again, the independence of the final result to the initial input vector.

5 RNNs and MLPs as dynamical systems

When viewed as iterative maps (dynamical systems), RNNs and MLPs have identical structures as shown in (19) and (38), namely

[0000f10000f20000fT0][0000f10000f20000fT0][0000f10000f20000fT0][𝐡0000].\begin{bmatrix}0&0&\dots&0&0\\ f_{1}&0&\dots&0&0\\ 0&f_{2}&\dots&0&0\\ \vdots&&\ddots&&\vdots\\ 0&0&\dots&f_{T}&0\end{bmatrix}\circ\begin{bmatrix}0&0&\dots&0&0\\ f_{1}&0&\dots&0&0\\ 0&f_{2}&\dots&0&0\\ \vdots&&\ddots&&\vdots\\ 0&0&\dots&f_{T}&0\end{bmatrix}\circ\ldots\circ\begin{bmatrix}0&0&\dots&0&0\\ f_{1}&0&\dots&0&0\\ 0&f_{2}&\dots&0&0\\ \vdots&&\ddots&&\vdots\\ 0&0&\dots&f_{T}&0\end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ \vdots\\ 0\end{bmatrix}. (48)

The difference lies in the parameter sets θRNN\theta_{RNN} and θMLP\theta_{MLP} and in the definitions of the functions fjf_{j}, specifically

RNN:θRNN={Wx,Wh,𝐛}\displaystyle\hbox{RNN:}\qquad\theta_{RNN}=\{W_{x},W_{h},\mathbf{b}\}\; (49)
MLP:θMLP={Wj,𝐛j},j=1,T\displaystyle\hbox{MLP:}\qquad\theta_{MLP}=\{W_{j},\mathbf{b}_{j}\},\;j=1\dots,T
and
RNN:f=fθ(𝐱j,𝐡j1)=σ(Wx𝐱j+Wh𝐡j1+𝐛),j=1,,T\displaystyle\hbox{RNN:}\qquad f=f_{\theta}(\mathbf{x}_{j},\mathbf{h}_{j-1})=\sigma(W_{x}\cdot\mathbf{x}_{j}+W_{h}\cdot\mathbf{h}_{j-1}+\mathbf{b}),\quad j=1,\dots,T
MLP:fj=fθj(𝐡j1)=σ(Wj𝐡j1+𝐛j),j=1,,T\displaystyle\hbox{MLP:}\qquad f_{j}=f_{\theta_{j}}(\mathbf{h}_{j-1})=\sigma(W_{j}\cdot\mathbf{h}_{j-1}+\mathbf{b}_{j}),\quad j=1,\dots,T

where 𝐱j\mathbf{x}_{j} is the forcing function in the case of the RNN and, similarily, 𝐡0\mathbf{h}_{0} is the initialization data of both the MLP and RNN. Thus, we observe both RNNs and MLPs may be written in the form of (48) by merely noting that for RNNs we have i,j1,,T,fi=fj=f\forall i,j\in 1,\cdots,T,\;f_{i}=f_{j}=f.


The iterative map in (48) plays two key roles in our work:

  • It provides a unifying perspective for understanding the relationship between MLPs and RNNs. In fact, they are both dynamical systems and can be explored from that point of view.

  • Equation (48) also underscores the fact that a dynamical systems perspective inspires one to define and explore families of functions which are more general then those normally considered as MLPs and RNNs. In fact, each of the 0 blocks in (48), when trained and allowed to be non-zero, leads to a new family of functions with interesting theoretical and practical properties.

Exploring these two aspects of (48) will comprise the remainder of this text.

6 Modified MLP as an infinite impulse response map

6.1 Map with a non-zero diagonal entry

It is interesting to note, and will form the foundation of much future work, that seeming minor modifications to systems such as (38) can lead to important differences in the resulting dynamical systems. For example, adding an identity matrix to the top left corner of the map MM results in an infinite impulse response map. Define 444Because of the II on the diagonal of MM_{\infty} it is not a sequential operator. However, we maintain the name Sequential2D for its implementation to engender the intuition that Sequential2D is a container for (non-linear) functions just like the more common Sequential container.

M=[I000fθ10000fθ20000fθ30].M_{\infty}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\,. (50)

When started at the state

[𝐡0000]\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} (51)

after the first iteration of MM_{\infty}

M[𝐡0000]=[I000fθ10000fθ20000fθ30][𝐡0000]=[𝐡0fθ1(𝐡0)fθ2(0)fθ3(0)].M_{\infty}\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix}=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(0)\\ \end{bmatrix}\,. (52)

After the second iteration of map MM_{\infty}

MM[𝐡0000]=[I000fθ10000fθ20000fθ30][𝐡0fθ1(𝐡0)fθ2(0)fθ3(0)]=[𝐡0fθ1(𝐡0)fθ2(fθ1(𝐡0))fθ3(fθ2(0))]M_{\infty}\circ M_{\infty}\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(0)\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(0)\\ \end{bmatrix}=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(0))\\ \end{bmatrix} (53)

and after the third iteration of map MM_{\infty}

MMM[𝐡0000]\displaystyle M_{\infty}\circ M_{\infty}\circ M_{\infty}\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} =[I000fθ10000fθ20000fθ30][𝐡0fθ1(𝐡0)fθ2(fθ1(𝐡0))fθ3(fθ2(0))]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(0))\\ \end{bmatrix} (54)
=[𝐡0fθ1(𝐡0)fθ2(fθ1(𝐡0))fθ3(fθ2(fθ1(𝐡0)))].\displaystyle=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix}.

We observe that the last entry is exactly the same (vector) value as the corresponding MLP with weights fθ3f_{\theta_{3}}, fθ2f_{\theta_{2}}, and fθ1f_{\theta_{1}} and input 𝐡0\mathbf{h}_{0}. Further iterations give

MMMM[𝐡0000]\displaystyle M_{\infty}\circ M_{\infty}\circ M_{\infty}\circ M_{\infty}\circ\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ \end{bmatrix} =[I000fθ10000fθ20000fθ30][𝐡0fθ1(𝐡0)fθ2(fθ1(𝐡0))fθ3(fθ2(fθ1(𝐡0)))]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix} (55)
=[𝐡𝟎fθ1(𝐡0))fθ2(fθ1(𝐡0))fθ3(fθ2(fθ1(𝐡0)))]\displaystyle=\begin{bmatrix}\mathbf{h_{0}}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix}

which is a fixed point that is dependent on the input 𝐡0\mathbf{h}_{0}, hence MM_{\infty} is a dynamical system with infinite impulse response.

Summarizing, the maps defined by (34) and (50) both converge to a fixed point. The fixed point defined by (34) is independent of the input and therefore is a finite impulse response map. The fixed point defined by (50) is dependent of the input and therefore is a infinite impulse response map.

6.2 Choice of the initial vector

In §6.2 we observed that the result of the iterative map only depends on h0h_{0} and is invariant to the other entries of the initial vector, as long as those entries are 0. However, if, as we claim, the iterative map is equivalent to an MLP, then the iterative map should be invariant to the values in the initial vector, other than h0h_{0} regardless of their value. Accordingly, in this section we consider the same iterative map as in §6.2, but with a different initial vector. In particular, we consider the initial vector

[𝐡0q1q2q3]\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} (56)

after the first iteration of map MM_{\infty}

M[𝐡0q1q2q3]=[I000fθ10000fθ20000fθ30][𝐡0q1q2q3]=[𝐡0fθ1(𝐡0)fθ2(q1)fθ3(q2)].M_{\infty}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(q_{2})\\ \end{bmatrix}\,. (57)

After the second iteration of map MM_{\infty}

MM[𝐡0q1q2q3]=[I000fθ10000fθ20000fθ30][𝐡0fθ1(𝐡0)fθ2(q1)fθ3(q2)]=[𝐡0fθ1(𝐡𝟎)fθ2(fθ1(𝐡0))fθ3(fθ2(q1))]M_{\infty}\circ M_{\infty}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(q_{2})\\ \end{bmatrix}=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \end{bmatrix} (58)

and after the third iteration of map MM

MMM[𝐡0q1q2q3]\displaystyle M_{\infty}\circ M_{\infty}\circ M_{\infty}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} =[I000fθ10000fθ20000fθ30][𝐡0fθ1(𝐡𝟎)fθ2(fθ1(𝐡0))fθ3(fθ2(q1))]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \end{bmatrix} (59)
=[𝐡𝟎fθ1(𝐡𝟎)fθ2(fθ1(𝐡𝟎))fθ3(fθ2(fθ1(𝐡0)))]\displaystyle=\begin{bmatrix}\mathbf{h_{0}}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix}

identical to (54) and independent of q1,q2,q3q_{1},q_{2},q_{3}. Clearly then

MMMM[𝐡0q1q2q3]\displaystyle M_{\infty}\circ M_{\infty}\circ M_{\infty}\circ M_{\infty}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} =[I000fθ10000fθ20000fθ30][𝐡𝟎fθ1(𝐡𝟎)fθ2(fθ1(𝐡𝟎))fθ3(fθ2(fθ1(𝐡0)))]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h_{0}}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix} (60)
=[𝐡0fθ1(𝐡0)fθ2(fθ1(𝐡0))fθ3(fθ2(fθ1(𝐡0)))]\displaystyle=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix}

which is a fixed point that is dependent on the input 𝐡0\mathbf{h}_{0} but independent of q1,q2,q3q_{1},q_{2},q_{3}.

6.3 Choice of the initial vector with below diagonal blocks

In this section we stray from the stand MLP models, and consider a more complicated block non-linear function. The dynamical system enables us to consider structures that are not so evident from the traditional perspective. In particular, we begin with the same initial vector

[𝐡0q1q2q3]\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} (61)

but consider the block non-linear function

Mskip=[I000fθ10000fθ200S0fθ30].M_{\text{skip}}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ \hbox{\pagecolor{cyan}$S$}&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}. (62)

Such a block non-linear function can be thought of as an MLP with a skip connection Huang et al. (2017); He et al. (2016) from the input to the last layer.

In this case, after the first iteration of map MskipM_{\text{skip}}

Mskip[𝐡0q1q2q3]=[I000fθ10000fθ200S0fθ30][𝐡0q1q2q3]=[𝐡0fθ1(𝐡0)fθ2(q1)S(𝐡0)+fθ3(q2)].M_{\text{skip}}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ \hbox{\pagecolor{cyan}$S$}&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})\\ \hbox{\pagecolor{cyan}$S$}(\mathbf{h}_{0})+\hbox{\pagecolor{green}$f_{\theta_{3}}$}(q_{2})\\ \end{bmatrix}\,. (63)

After the second iteration of map MskipM_{\text{skip}}

MskipMskip[𝐡0q1q2q3]\displaystyle M_{\text{skip}}\circ M_{\text{skip}}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} =[I000fθ10000fθ200S0fθ30][𝐡0fθ1(𝐡0)fθ2(q1)S(𝐡0)+fθ3(q2)]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ \hbox{\pagecolor{cyan}$S$}&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})\\ \hbox{\pagecolor{cyan}$S$}(\mathbf{h}_{0})+\hbox{\pagecolor{green}$f_{\theta_{3}}$}(q_{2})\\ \end{bmatrix} (64)
=[𝐡0fθ1(𝐡𝟎)fθ2(fθ1(𝐡0))S(𝐡0)+fθ3(fθ2(q1))]\displaystyle=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{cyan}$S$}(\mathbf{h}_{0})+\hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \end{bmatrix}

and after the third iteration of map MskipM_{\text{skip}}

MskipMskipMskip[𝐡0q1q2q3]\displaystyle M_{\text{skip}}\circ M_{\text{skip}}\circ M_{\text{skip}}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} =[I000fθ10000fθ200S0fθ30][𝐡0fθ1(𝐡𝟎)fθ2(fθ1(𝐡0))S(𝐡0)+fθ3(fθ2(q1))]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ \hbox{\pagecolor{cyan}$S$}&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{cyan}$S$}(\mathbf{h}_{0})+\hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \end{bmatrix} (65)
=[𝐡𝟎fθ1(𝐡𝟎)fθ2(fθ1(𝐡𝟎))S(𝐡0)+fθ3(fθ2(fθ1(𝐡0)))]\displaystyle=\begin{bmatrix}\mathbf{h_{0}}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}}))\\ \hbox{\pagecolor{cyan}$S$}(\mathbf{h}_{0})+\hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix}

which is independent of q1,q2,q3q_{1},q_{2},q_{3}. Clearly then

MskipMskip\displaystyle M_{\text{skip}}\circ M_{\text{skip}}\circ MskipMskip[𝐡0q1q2q3]\displaystyle M_{\text{skip}}\circ M_{\text{skip}}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} (66)
=[I000fθ10000fθ200S0fθ30][𝐡𝟎fθ1(𝐡𝟎)fθ2(fθ1(𝐡𝟎))S(𝐡0)+fθ3(fθ2(fθ1(𝐡0)))]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ \hbox{\pagecolor{cyan}$S$}&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h_{0}}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}}))\\ \hbox{\pagecolor{cyan}$S$}(\mathbf{h}_{0})+\hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix}
=[𝐡0fθ1(𝐡0)fθ2(fθ1(𝐡0))S(𝐡0)+fθ3(fθ2(fθ1(𝐡0)))].\displaystyle=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0}))\\ \hbox{\pagecolor{cyan}$S$}(\mathbf{h}_{0})+\hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})))\\ \end{bmatrix}\,.

Again, we see that the fixed point is dependent on the input 𝐡0\mathbf{h}_{0} but independent of q1,q2,q3q_{1},q_{2},q_{3}. Of course, the fixed point is different from the fixed point in §6.1 and §6.2, because of the addition of the skip connections. However, we have preserved the property that the fixed point is independent of the values of q1,q2,q3q_{1},q_{2},q_{3} which is required by standard MLPs with skip connections Huang et al. (2017); He et al. (2016).

6.4 Choice of the initial vector with above diagonal blocks

Now, we consider a more complicated block non-linear function. We know that RNNs are recurrent Goodfellow et al. (2016) and, in many interesting cases, infinite-impulse Miljanovic (2012). Accordingly, their final output normally depends on the value of their initial state. Exploiting the dynamical system perspective again, it is now easy to represent systems that are no longer simple feed forward networks. In this section we demonstrate such a dependence in a block non-linear function.

Similar to Section 6.2 in this section we consider the same initial vector

[𝐡0q1q2q3]\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} (67)

but consider the block non-linear function

Mabove=[I000fθ10S00fθ20000fθ30].M_{\text{above}}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&\hbox{\pagecolor{magenta}$S$}&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\,. (68)

As the function SS is above the diagonal, this block non-linear function is infinite-impulse Miljanovic (2012) unless SS is specifically chosen to be zero, or some similar degenerate function.

To demonstrate the dependence of (68) on the initial vector (67) we do the same calculation as before. In particular, after the first iteration of map MaboveM_{\text{above}}

Mabove[𝐡0q1q2q3]=[I000fθ10S00fθ20000fθ30][𝐡0q1q2q3]=[𝐡0fθ1(𝐡0)+S(q2)fθ2(q1)fθ3(q2)].M_{\text{above}}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&\hbox{\pagecolor{magenta}$S$}&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix}=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(q_{2})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(q_{2})\\ \end{bmatrix}\,. (69)

After the second iteration of map MaboveM_{\text{above}}

MaboveMabove[𝐡0q1q2q3]\displaystyle M_{\text{above}}\circ M_{\text{above}}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} =[I000fθ10S00fθ20000fθ30][𝐡0fθ1(𝐡0)+S(q2)fθ2(q1)fθ3(q2)]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&\hbox{\pagecolor{magenta}$S$}&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(q_{2})\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(q_{2})\\ \end{bmatrix} (70)
=[𝐡0fθ1(𝐡𝟎)+S(fθ2(q1))fθ2(fθ1(𝐡0)+S(q2))fθ3(fθ2(q1))]\displaystyle=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(q_{2}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \end{bmatrix}

and after the third iteration of map MaboveM_{\text{above}}

MaboveMabove\displaystyle M_{\text{above}}\circ M_{\text{above}}\circ Mabove[𝐡0q1q2q3]\displaystyle M_{\text{above}}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} (71)
=[I000fθ10S00fθ20000fθ30][𝐡0fθ1(𝐡𝟎)+S(fθ2(q1))fθ2(fθ1(𝐡0)+S(q2))fθ3(fθ2(q1))]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&\hbox{\pagecolor{magenta}$S$}&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(q_{2}))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))\\ \end{bmatrix}
=[𝐡𝟎fθ1(𝐡𝟎)+S(fθ2(fθ1(𝐡0)+S(q2)))fθ2(fθ1(𝐡0)+S(fθ2(q1)))fθ3(fθ2(fθ1(𝐡0)+S(q2)))].\displaystyle=\begin{bmatrix}\mathbf{h_{0}}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(q_{2})))\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(q_{2})))\\ \end{bmatrix}\,.

Now, the fourth iteration of the map reveals that

Mabove\displaystyle M_{\text{above}}\circ MaboveMaboveMabove[𝐡0q1q2q3]\displaystyle M_{\text{above}}\circ M_{\text{above}}\circ M_{\text{above}}\circ\begin{bmatrix}\mathbf{h}_{0}\\ q_{1}\\ q_{2}\\ q_{3}\\ \end{bmatrix} (72)
=[I000fθ10S00fθ20000fθ30][𝐡𝟎fθ1(𝐡𝟎)+S(fθ2(fθ1(𝐡0)+S(q2)))fθ2(fθ1(𝐡0)+S(fθ2(q1)))fθ3(fθ2(fθ1(𝐡0)+S(q2)))]\displaystyle=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&\hbox{\pagecolor{magenta}$S$}&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0\\ \end{bmatrix}\circ\begin{bmatrix}\mathbf{h_{0}}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(q_{2})))\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1})))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(q_{2})))\\ \end{bmatrix}
=[𝐡0fθ1(𝐡0)+S(fθ2(fθ1(𝐡0)+S(fθ2(q1))))fθ2(fθ1(𝐡𝟎)+S(fθ2(fθ1(𝐡0)+S(q2))))fθ3(fθ2(fθ1(𝐡0)+S(fθ2(q1))))],\displaystyle=\begin{bmatrix}\mathbf{h}_{0}\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))))\\ \hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h_{0}})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(q_{2}))))\\ \hbox{\pagecolor{green}$f_{\theta_{3}}$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(\hbox{\pagecolor{green}$f_{\theta_{1}}$}(\mathbf{h}_{0})+\hbox{\pagecolor{magenta}$S$}(\hbox{\pagecolor{green}$f_{\theta_{2}}$}(q_{1}))))\\ \end{bmatrix},

which is not a fixed point and is dependent on the input 𝐡0\mathbf{h}_{0} as well as q1q_{1} and q2q_{2}. Accordingly, this block non-linear function is not equivalent to an MLP, but instead displays the infinite impulse response of an RNN. Of course, if we set SS to be zero, then the above calculation reduces to the calculation in §6.2 and the fixed point is independent of q1,q2,q3q_{1},q_{2},q_{3}, and this demonstrates that MLPs and RNNs lay on a continuum of block non-linear functions.

7 Implementation

As discussed in Pathak et al. (2023) there are many references in the literature that propose a block matrix structure for NNs. Examples include, Narang et al. (2017) where the focus is reducing the computational complexity and memory requirements of RNNs while simultaneously maintaining accuracy. Also, in Narang et al. (2017), it is noted that block sparsity can lead to better performance on hardware architectures such as Graphical Processing Units that support block-sparse matrix multiplication.

Having derived various theoretical connections between MLPs and RNNs, we now take a brief detour into the implementation of such methods. In particular, the block non-linear function from (3) is closely related to the commonly used Sequential container that appears in both the deep learning libraries, PyTorch, Paszke et al. (2017) and Keras, Chollet et al. (2015). In fact, the function (3) can be thought of as a two-dimensional generalization of the Sequential container, and we call this generalization Sequential2D, which is described in more detail in Pathak et al. (2023). A brief review of github.com555Data gathered on 5-16-2023. reveals the popularity of the Sequential architecture. As discussed in Pathak et al. (2023), we noted 380,000 files leveraging Pytorch’s or Keras’s Sequential implementation. Accordingly, the theoretical ideas described in this paper, and the practical implementation discussed in Pathak et al. (2023), promise to be able to impact many Neural Network (NN) applications.

For the purposes of this paper, we merely observe that Sequential2D, Pathak et al. (2023) refers to the 2D matrix of functions, as opposed to the 1D array of functions in a PyTorch or Keras Sequential container. Specifically, closely following Pathak et al. (2023), we observe that Sequential2D is a PyTorch module that allows one to define a neural network by specifying a two dimensional array of modules. Each individual module can be any type of layer or operation supported by PyTorch, Paszke et al. (2017). Practically Sequential2D can be used to implement a variety of neural network models including MLPs, RNNs, and many other NNs.

8 Numerical experiments

Using the Sequential2D module as an implementation for the block non-linear function from Definition 1, numerous experiments have been conducted seeking to examine performance of various models across a range of scenarios and conditions. In particular, the interested reader can look to Pathak et al. (2023); Hershey et al. (2023); Hershey (2023) for examples of the ideas in this paper being applied for practical machine learning problems. We will not attempt to repeat the results from those works here, but merely endeavor to provide a sample of the surprising results that arise from treating MLPs and RNNs as dynamical systems.

8.1 MNIST classification

In the following section we closely follow the presentation and results from Hershey (2023). We study the classic MNIST, LeCun et al. (2010) (Modified National Institute of Standards and Technology) dataset, which is a common choice for machine learning exercises and features 70,000 labeled images often used for image recognition and classification problems. The content of each image is a hand written single digit integer ranging from 0-9 and a ground truth label reflecting the integer that was depicted. Unmodified MNIST images can be seen in Fig. 1. In all experiments, the MNIST data set was divided into a training subset of 10,000 images, a validation subset of 1,000 images and a testing subset of 1,000 images with batch size of 64 trained for a duration of 100 epochs.

Refer to caption
Figure 1: Examples of unmodified handwritten 28×2828\times 28 pixel handwritten images from the MNIST database.

Again, following Hershey (2023), several image transformations were applied with the objective of increasing the difficulty and creating more disparate comparisons among model structures. Since model implementations were performed through PyTorch using the PyTorch Lightning wrapper (Falcon and The PyTorch Lightning team (2019)), Torchvision (maintainers and contributors (2016)) was a natural choice for transforms. As a first step, images were resized (torchvision.transforms.Resize) from 28×2828\times 28 to 50×5050\times 50, increasing the total number of input features in 𝐡0\mathbf{h}_{0} from 784 to 2,500. Additionally, the features of each image are normalized (torchvision.transforms.Normalize) using the distribution defined as N(0.1307,0.3081)N~{}(0.1307,0.3081) in adherence with guidelines from the Torchvision documentation for the MNIST data set as shown in Figure 1.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: The cumulative effects of transformations applied to samples from the MNIST data set. All images were first resized and then normalized. (row 1) Resized and normalized only; (row 2) with random erase as an intermediary step; (row 3) with random perspective as an intermediary step; (row 4) with both random erase and random perspective as intermediary steps.

Once the image has been resized, each image in row 2 of Figure 2 is transformed by randomly erasing (torchvision.transforms.RandomErasing) portions of the images prior to being normalized, creating a more challenging problem. The amount of erasure is in the range (0.02, 0.05). In row 3, each image is transformed after the normalization with a random perspective transformation (torchvision.transforms.RandomPerspective), using a distortion scale value of 0.5. When applied in combination, both transforms from rows 2 and 3 result in the examples shown in row 4 of Figure 2, clearly an even more challenging problem.

8.2 An MLP versus a dynamical system

For each run of the classification experiment for the MNIST data set, a base network was constructed to match the characteristics of a four-layer MLP with layers starting from an input dimension of 2,500 and with layer output dimensions of 500, 200, 100 and 10. Following the notation of §4.2 this gives rise to a block non-linear function of the form

MMLP4=[I0000fθ100000fθ200000fθ300000fθ40].M_{MLP4}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0&0\\ 0&0&0&\hbox{\pagecolor{green}$f_{\theta_{4}}$}&0\\ \end{bmatrix}. (73)

fθi(𝐳)=σ(Wi𝐳+bi)f_{\theta_{i}}(\mathbf{z})=\sigma(W_{i}\cdot\mathbf{z}+b_{i}) with I2500×2500I\in\mathbb{R}^{2500\times 2500} being the identity function, σ\sigma being the standard ReLU activation function, θ1={W1500×2500,𝐛1500×1}\theta_{1}=\{W_{1}\in\mathbb{R}^{500\times 2500},\mathbf{b}_{1}\in\mathbb{R}^{500\times 1}\}, θ2={W2200×500,𝐛2200×1}\theta_{2}=\{W_{2}\in\mathbb{R}^{200\times 500},\mathbf{b}_{2}\in\mathbb{R}^{200\times 1}\}, θ3={W3100×200,𝐛3100×1}\theta_{3}=\{W_{3}\in\mathbb{R}^{100\times 200},\mathbf{b}_{3}\in\mathbb{R}^{100\times 1}\}, and θ4={W410×100,𝐛410×1}\theta_{4}=\{W_{4}\in\mathbb{R}^{10\times 100},\mathbf{b}_{4}\in\mathbb{R}^{10\times 1}\}. As demonstrated in §6, such a block non-linear function is identical to a standard MLP with the same layer sizes.

We emphasize the block structure inherent to the MLP in (73) with a visualization in Figure 3 in which the block non-linear function is depicted “to scale”. The iteration matrix in (73) is decomposed into smaller component matrices of size 100×100100\times 100 except for the final row and column. The last row is decomposed into component matrices of size 10×10010\times 100. The last column is decomposed in to component matrices of size 100×10100\times 10. The matrix in the bottom right corner remained a square matrix of dimension 10×1010\times 10. The identity matrix in the top left hand corner is of size 2500×25002500\times 2500 or 25×25,100×10025\times 25,100\times 100 blocks. The blue blocks are trainable, the grey blocks are untrainable.

MMLP4=[I0000fθ100000fθ200000fθ300000fθ40]M_{MLP4}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0&0\\ \hbox{\pagecolor{green}$f_{\theta_{1}}$}&0&0&0&0\\ 0&\hbox{\pagecolor{green}$f_{\theta_{2}}$}&0&0&0\\ 0&0&\hbox{\pagecolor{green}$f_{\theta_{3}}$}&0&0\\ 0&0&0&\hbox{\pagecolor{green}$f_{\theta_{4}}$}&0\\ \end{bmatrix}
Refer to caption
Figure 3: Side-by-side comparison of an MLP represented as a dynamical system using a block non-linear function. The blue blocks represent trainable weights, the grey blocks represent non-trainable 0 blocks, and the green blocks represent a 2500×25002500\times 2500 untrainable identity matrix. Note that this block non-linear function is of the same form at that discussed in §6.2 with all of the properties discussed there.

A question immediately arises when looking at Figure 3. Is the placement of the blue trainable and grey untrainable blocks essential to the operation of the MLP? For example, one can consider the “randomized” dynamical system shown in Figure 4.

MMLP4=[I0000SSSSSSSSSSSSSSSSSSSS]M_{MLP4}=\begin{bmatrix}\hbox{\pagecolor{yellow}$I$}&0&0&0&0\\ \hbox{\pagecolor{green}$S$}&\hbox{\pagecolor{yellow}$S$}&\hbox{\pagecolor{magenta}$S$}&\hbox{\pagecolor{magenta}$S$}&\hbox{\pagecolor{magenta}$S$}\\ \hbox{\pagecolor{cyan}$S$}&\hbox{\pagecolor{green}$S$}&\hbox{\pagecolor{yellow}$S$}&\hbox{\pagecolor{magenta}$S$}&\hbox{\pagecolor{magenta}$S$}\\ \hbox{\pagecolor{cyan}$S$}&\hbox{\pagecolor{cyan}$S$}&\hbox{\pagecolor{green}$S$}&\hbox{\pagecolor{yellow}$S$}&\hbox{\pagecolor{magenta}$S$}\\ \hbox{\pagecolor{cyan}$S$}&\hbox{\pagecolor{cyan}$S$}&\hbox{\pagecolor{cyan}$S$}&\hbox{\pagecolor{green}$S$}&\hbox{\pagecolor{yellow}$S$}\\ \end{bmatrix}
Refer to caption
Figure 4: A randomized dynamical system. While the positions of the trainable parameters are quite different when compared to Figure 3, many other properties are similar. For example, the number of trainable parameters and the dimension of the input, hidden, and output spaces are all identical to (73). This block non-linear function combines the features of all the block non-linear functions in §6.2, 6.3, and, most importantly, 6.4. Despite these differences, we will demonstrate that the performance of this dynamical system is surprisingly indistinguishable from that of the MLP in Figure 3.

Using the dynamical system architectures in Figure 3 and Figure 4, a traditional MNIST classification experiment was conducted, with the former dynamical system corresponding exactly to a standard MLP. In contrast, the dynamical system in Figure 4 shares the same number of trainable parameters and dimensions of the input, hidden, and output spaces, but allocates those trainable parameters using a very different distribution. Once the block nonlinear functions had been created, 10 training runs were performed on the data set using the traditional four layer MLP representation. We then followed these 10 runs of the traditional MLP with 20 runs using randomized dynamical systems to explore the space of random configurations of trainable blocks and compare their performance. For both the architectures in Figures 3 and 4 we use the same form of the initial vector 𝐡0\mathbf{h}_{0}, as in (51) and shown below

[𝐡00000].\displaystyle\begin{bmatrix}\mathbf{h}_{0}\\ 0\\ 0\\ 0\\ 0\\ \end{bmatrix}. (74)

We emphasize the following points:

  • For experiments that use the model in Figure 3, the values contained in the blue blocks are randomly initialized.

  • For experiments that use the model in Figure 4, both the placement of the blue values and the initial values contained in the blue blocks are randomized.

  • For all experiments the initial vector in (74) is not random and they are fixed by the training data h0h_{0} and the placement of the 0s.

Following the process outlined above, experiments were conducted with results that are somewhat counter intuitive to prevailing theory regarding the principles and importance of network architecture. As an illustration of these results, the two very different networks shown in Figure 3 and Figure 4 were trained using the Adam optimizer from the PyTorch library with a learning rate of 1e31e-3. The resulting training profiles are presented in Figure 5. Despite a stark difference in the distribution of the trainable parameters, both networks followed the same training profile with final testing accuracy of both models being virtually indistinguishable at 92.7%. These results, which clearly demonstrate an unexpected robustness, indicate other factors aside from layer based structure govern overall model performance.

Refer to caption
Figure 5: A comparison of training loss over 100 epochs between the layered architecture of Figure 3 and the randomized architecture of Figure 4. “Model 0” is the original MLP and “Model 1” is the randomized architecture. The performance curves are indistinguishable despite the dissimilar architectures.

Figure 6 shows 10 training curves for a four layer MLP. Each of the training curves begins with a different, random initialization. We see that all initializations result in similar training profiles.

Refer to caption
Figure 6: Performance curves demonstrating training loss over 100 epochs for 10 different random initializations of models with the layered architecture (Figure 3). As might be expected, the difference between the training of randomly initialized layered architectures is small.

In Figure 7 we plot the layered training loss curves from Figure 6 in blue before overlaying the training loss curves of models with randomized architectures (Figure 4) in gold. The results are compelling. When the 20 additional randomized training curves are overlaid they lie on top of one another. With no similarity in structure, the models train with indistinguishable performance. These results provide evidence that the concepts of layers and neatly defined weight matrices, as exist in Figure 3, seemingly provide no advantage over the randomized architectures in Figure 4. Before seeing these results, one might have assumed that current MLPs represent the optimal network structure, but now this appears unlikely.

Refer to caption
Figure 7: The training curves over 100 epochs with randomly initialized layered architectures are repeated from Figure 6 (blue). Training curves for models with randomly initialized randomized architectures (Figure 4) over 100 epochs are overlaid in gold. The difference between the training of randomly initialized layered architectures is similar to the difference between the layered and randomized model architectures. This is quite suprising since the randomized architectures appear to be so different than the layered architectures.

9 Summary

The initial results in this paper demonstrate that despite major structural changes, randomized and layered architectures have little training variation. This observation raises the valid question of whether the layered structure was ultimately irrelevant. Specifically, if the arrangement of neatly demarcated layers is not a central feature of performance when normalized for the total number of trainable parameters, then the question becomes what factor or factors may be contributing to the performance of trained dynamical systems. This question is explored further in Hershey et al. (2023); Hershey (2023).

A dynamical system perspective offers a generalized context for exploring RNNs, MLPs and a vast array of other neural network architectures. The results suggest layer wise architectural constraints have little impact on the training performance of at least some problems. Other parameters, such as number of iterations (in the context of a dynamical system), sparsity, and other less commonly studied parameters may have a profound influence on model behavior. We have only begun exploring this large space of machine learning models Hershey et al. (2023); Hershey (2023); Pathak et al. (2023), and we hope this paper will encourage others to study this domain.

One important aspect of this work is that the number of iterations in the context of a dynamical system and the number of layers in a neural network are related but distinct concepts. In particular a layered architecture forces a number of iterations of the dynamical system (see our discussion of fixed points). However, a broader view of training dynamical systems decouples the architecture from the number of iterations. The randomized dynamical system in Figure 4 was trained and tested exclusively over four iterations, but that is not required. Exploring the relationship between different training paradigms given the freedom provided by the dynamical systems approach provides fertile ground for exploration.

The effect of noise in the data for both types of architectures remains an open issue. A sensitivity analysis could provide insight into the effect of architecture on stability. This is a natural question in a dynamical systems context with its concept of stable and unstable manifolds, but has no obvious analogue in standard approaches to the construction and analysis of neural networks.

Continuation and bifurcation analysis would seem to be another rich field of study for practical methods of training dynamical systems to solve real world problems. For instance, we have already demonstrated that GPT2’s performance can be substantially improved through a parameter continuation methodology inspired by the results of this paper and a common technique in dynamical systems Pathak et al. (2023). Since a simple Sequential construction shown in (32) is the basis of all MLPs, as well as many other neural networks, and can be embedded within the Sequential2D framework, it is reasonable to expect this new approach will provide improvements.

A disadvantage of a dynamical systems approach where a fixed map is iterated many times to achieve some goal is in the training of such a map. For example, the gradients that one computes using automatic differentiation are more complicated in this case. The existence of advanced optimization libraries such as PyTorch and TensorFlow ameliorate many of the technical issues in computing such gradients efficiently. However, this does not directly address such problems as gradient collapse and gradient explosion. Yet, within the context of dynamical systems natural analogues exist for these characteristics. The notions of stable and unstable manifolds correspond to gradient collapse and gradient explosion respectively. Further, there exist many techniques in dynamical systems for addressing exactly these types of problems.

As soon as one thinks about training dynamical systems to solve practical problems there are many interesting avenues for theoretical and performance analysis. In particular Koopman operators are an active area of mathematical research that are precisely concerned with data oriented dynamical systems. Our proposed approach allows us to immediately leverage this body of work.

The stepwise nature of trained dynamical systems leads to an interesting trade-off between complicated dynamical systems that achieve some practical output in a fewer number of iterations versus simple dynamical systems that achieve the same output but only after a greater number of iterations. While seemingly a bad trade-off, having a simple dynamical system that achieves a given result does provide advantages in analysis, explainability and safety. In particular, the use of simple dynamical systems over many iterations would seem to address some of the goals of mechanistic interpretability in AI Zhang et al. (2021); Kästner and Crook (2023); Shen et al. (2023); Michaud et al. (2024).


Acknowledgments and Disclosure of Funding

This research was carried out using computational resources supported by the Academic and Research Computing group at Worcester Polytechnic Institute.

References

  • Chang et al. (2017) B. Chang, L. Meng, E. Haber, F. Tung, and D. Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
  • Chen et al. (2018) R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
  • Chollet et al. (2015) F. Chollet et al. Keras. https://keras.io, 2015.
  • Croitoru et al. (2023) F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Falcon and The PyTorch Lightning team (2019) W. Falcon and The PyTorch Lightning team. PyTorch Lightning, 3 2019. URL https://github.com/Lightning-AI/lightning.
  • Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • Gu and Dao (2023) A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  • Gu et al. (2021) A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  • Gunther et al. (2020) S. Gunther, L. Ruthotto, J. B. Schroder, E. C. Cyr, and N. R. Gauger. Layer-parallel training of deep residual neural networks. SIAM Journal on Mathematics of Data Science, 2(1):1–23, 2020.
  • Han et al. (2019) J. Han, Q. Li, et al. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1–41, 2019.
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hershey (2023) Q. Hershey. Exploring Neural Network Structure through Iterative Neural Networks: Connections to Dynamical Systems. PhD thesis, Worcester Polytechnic Institute, 2023.
  • Hershey et al. (2023) Q. Hershey, R. C. Paffenroth, and H. Pathak. Exploring neural network structure through sparse recurrent neural networks: A recasting and distillation of neural network hyperparameters. In 2023 22nd IEEE International Conference On Machine Learning And Applications (ICMLA). IEEE, 2023.
  • Huang et al. (2017) G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • Kästner and Crook (2023) L. Kästner and B. Crook. Explaining ai through mechanistic interpretability. 2023.
  • LeCun et al. (2010) Y. LeCun, C. Cortes, C. Burges, et al. Mnist handwritten digit database, 2010.
  • LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • Lipton et al. (2015) Z. C. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.
  • Liu and Theodorou (2019) G.-H. Liu and E. A. Theodorou. Deep learning theory review: An optimal control and dynamical systems perspective. arXiv preprint arXiv:1908.10920, 2019.
  • maintainers and contributors (2016) T. maintainers and contributors. TorchVision: PyTorch’s Computer Vision library, 11 2016. URL https://github.com/pytorch/vision.
  • Michaud et al. (2024) E. J. Michaud, I. Liao, V. Lad, Z. Liu, A. Mudide, C. Loughridge, Z. C. Guo, T. R. Kheirkhah, M. Vukelić, and M. Tegmark. Opening the ai black box: program synthesis via mechanistic interpretability. arXiv preprint arXiv:2402.05110, 2024.
  • Miljanovic (2012) M. Miljanovic. Comparative analysis of recurrent and finite impulse response neural networks in time series prediction. Indian Journal of Computer Science and Engineering, 3(1):180–191, 2012.
  • Murtagh (1991) F. Murtagh. Multilayer perceptrons for classification and regression. Neurocomputing, 2(5):183–197, 1991. ISSN 0925-2312. doi: https://doi.org/10.1016/0925-2312(91)90023-5. URL https://www.sciencedirect.com/science/article/pii/0925231291900235.
  • Narang et al. (2017) S. Narang, E. Undersander, and G. Diamos. Block-sparse recurrent neural networks, 2017.
  • Pascanu et al. (2013) R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA, 6 2013. PMLR. URL https://proceedings.mlr.press/v28/pascanu13.html.
  • Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • Pathak et al. (2023) H. Pathak, R. C. Paffenroth, and Q. Hershey. Sequential2d: Organizing center of skip connections for transformers. In 2023 22nd IEEE International Conference On Machine Learning And Applications (ICMLA). IEEE, 2023.
  • Radhakrishnan et al. (2020) A. Radhakrishnan, M. Belkin, and C. Uhler. Overparameterized neural networks implement associative memory. Proceedings of the National Academy of Sciences, 117(44):27162–27170, 2020.
  • Rasamoelina et al. (2020) A. D. Rasamoelina, F. Adjailia, and P. Sinčák. A review of activation function for artificial neural network. In 2020 IEEE 18th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pages 281–286, 2020. doi: 10.1109/SAMI48414.2020.9108717.
  • Rumelhart and McClelland (1987) D. E. Rumelhart and J. L. McClelland. Learning Internal Representations by Error Propagation, pages 318–362. MIT Press, 1987.
  • Salehinejad et al. (2017) H. Salehinejad, S. Sankar, J. Barfett, E. Colak, and S. Valaee. Recent advances in recurrent neural networks. arXiv preprint arXiv:1801.01078, 2017.
  • Schäfer and Zimmermann (2006) A. M. Schäfer and H. G. Zimmermann. Recurrent neural networks are universal approximators. In Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16, pages 632–640. Springer, 2006.
  • Shen et al. (2023) T. Shen, R. Jin, Y. Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y. Liu, and D. Xiong. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025, 2023.
  • Sherstinsky (2020) A. Sherstinsky. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena, 404:132306, mar 2020. doi: 10.1016/j.physd.2019.132306.
  • Siegelmann and Sontag (1991) H. T. Siegelmann and E. D. Sontag. Turing computability with neural nets. Applied Mathematics Letters, 4(6):77–80, 1991.
  • Weinan (2017) E. Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1(5):1–11, 2017.
  • Yang et al. (2023) L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
  • Yeung et al. (2019) E. Yeung, S. Kundu, and N. Hodas. Learning deep neural network representations for koopman operators of nonlinear dynamical systems. In 2019 American Control Conference (ACC), pages 4832–4839. IEEE, 2019.
  • Zhang et al. (2021) Y. Zhang, P. Tiňo, A. Leonardis, and K. Tang. A survey on neural network interpretability. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5):726–742, 2021.