Rethinking the Relationship between Recurrent and Non-Recurrent Neural Networks: A Study in Sparsity
Abstract
Neural networks (NN) can be divided into two broad categories, recurrent and non-recurrent. Both types of neural networks are popular and extensively studied, but they are often treated as distinct families of machine learning algorithms. In this position paper, we argue that there is a closer relationship between these two types of neural networks than is normally appreciated. We show that many common neural network models, such as Recurrent Neural Networks (RNN), Multi-Layer Perceptrons (MLP), and even deep multi-layer transformers, can all be represented as iterative maps.
The close relationship between RNNs and other types of NNs should not be surprising. In particular, RNNs are known to be Turing complete, and therefore capable of representing any computable function (such as any other types of NNs), but herein we argue that the relationship runs deeper and is more practical than this. For example, RNNs are often thought to be more difficult to train than other types of NNs, with RNNs being plagued by issues such as vanishing or exploding gradients. However, as we demonstrate in this paper, MLPs, RNNs, and many other NNs lie on a continuum, and this perspective leads to several insights that illuminate both theoretical and practical aspects of NNs.
Keywords: recurrent neural networks, sparsity, iterative maps, deep learning
1 Introduction
This paper recasts the concept of Neural Networks (NN) as iterative maps in order to study the underlying principles and assumptions embodied in traditional NN architectures. Traditional feed-forward multi-layer perceptron (MLP) LeCun et al. (2015) architectures are typically presented as distinct from recurrent neural network (RNN) Rumelhart and McClelland (1987) architectures which in-turn have a reputation of being difficult to train Pascanu et al. (2013); Sherstinsky (2020); Schäfer and Zimmermann (2006). Contrary to this perspective, our research shows that every MLP can be recast as an RNN and much can be learned about RNN architectures by reference to MLP models. This result is consequential since the universality of RNNs has been previously demonstrated Schäfer and Zimmermann (2006) and the Turing Completeness of RNN architectures has been understood since the 1990s Siegelmann and Sontag (1991). Beyond such theoretical considerations, we show how many common NNs can be represented as iterative maps which we define in Definition 3. Taking an iterative map perspective for MLPs and many other NN architectures leads to several insights that illuminate both theoretical and practical aspects of NNs.
The iterative map perspective proposed here both allows us to prove the equivalence described above and inspires us to study NNs from a discrete dynamical systems perspective. As we will demonstrate in what follows, such a perspective can elucidate many aspects of NN architectures. In particular, such ideas are not only of theoretical interest, but they also lead to improvements in NN implemention and training.
In this paper we develop notation that provides a context for describing many NN architectures as iterative maps. This notation is introduced in §2 and iterated maps, in our context, are defined in Algorithm 1. Of course, many RNNs are inherently iterative maps, and we remind the reader of the iterative structure of RNNs in §3. It is less obvious that MLPs can also be described as iterative maps, and we derive such equations in §4. The results in §3 and §4 allow both RNNs and MLPs to be concisely formulated as dynamical systems in §5. We explore extensions of these simple dynamical systems in §6. Beyond a theoretical nicety, we observe that MLPs and many other NNs can efficiently be implemented as iterative maps, and we provide a description of an implementation we call Sequential2D in §7. In §8, we demonstrate that the iterative map perspective for MLPs can lead to surprising numerical results. In particular, we show that random NN architectures are equivalent, in the appropriate sense, to more standard layer-wise approaches. Further exploration of new classes of methods that are natural extensions to existing methods will be the subject of future publications.
Deep networks have also been considered as discretizations of continuous dynamical systems and training viewed as an optimal control problem. This perspective has created an enormous literature and an exhaustive description is challenging. As an introduction to this area, we point to the early work of Chen et al. (2018); Weinan (2017); Chang et al. (2017); Han et al. (2019); Gunther et al. (2020). An early review appears in Liu and Theodorou (2019). Also, dynamical systems already play several important roles in state-of-the-art NNs as can be seen in reviews such as Liu and Theodorou (2019). However, several specific parts of the current literature bear special attention. For example, Koopman operators have an important role to play in analyzing NNs Yeung et al. (2019). Also, several recent state-of-the-art models are seemingly inspired by ideas similar to those that inspire our work, such as diffusion models Croitoru et al. (2023); Yang et al. (2023) and structured state space models Gu et al. (2021) such as MAMBA Gu and Dao (2023). Finally, iterated auto-encoders and their interpretation in terms of associative memory Radhakrishnan et al. (2020) is quite close in spirit to the ideas that we present here. It it our hope that our contributions in this paper serve to further underscore the important connections between NNs and dynamical systems.
2 Notation
As function composition is a fundamental building block of NNs and iterative maps in general, it is important to establish a notation for composing functions. We begin with the following notational conventions for scalars, vectors and matrices as shown in Table 1.
Notation | Description | Example |
---|---|---|
Scalar | ||
Column vector | ||
The number of entries in | ||
Matrix |
We will use the function composition operator to write
(1) |
for vector valued functions and . To be clear, when we write we intend that the argument will only appear as the right-most entry and and are functions. We then slightly abuse notation by defining a block non-linear function as
(2) |
It’s important to emphasize that the block non-linear function defined in equation (2) is not a matrix; instead, it’s a separable function. This function operates on a vector comprising two input vectors and produces a vector consisting of two output vectors. Also, it’s worth noting that the choice of combining the outputs of and isn’t restricted to addition alone. However, for the sake of simplicity and to maintain resemblance to standard block matrix equations, addition is employed in this paper.
Moreover, if the functions are linear, the block non-linear function in equation (2) simplifies to a block matrix equation. In essence, for linear maps, the function composition operator and the dot product are equivalent.
Definition 1
Given functions , where each may be linear or non-linear, we define a block non-linear function as
(3) |
The action of on a vector is then defined as
(4) | ||||
thus in many ways, (3) is a straight forward generalization of a block matrix.
When working with NNs, training algorithms are often considered in terms of mini-batches, where a set of training examples are processed together. 111In NN literature, parameter matrices are often applied to a mini-batch using a left dot-product, such as , where is a mini-batch of training examples and is a parameter matrix. In this paper, we use the right dot-product, such as , which is more common in the linear algebra literature and matches the function composition notation .
Definition 2
Given matrices then the block non-linear function acts on the mini-batch as
(5) |
where each function acts columnwise on the given input matrix. This gives rise to
(6) |
In what follows we will often wish to iterate a fixed block non-linear function . In particular, a defining characteristic of a discrete iterated map is that the map of interest is fixed and applied a number of times to a given input as in . Accordingly, we define an iterated block nonlinear function as below.
Definition 3
An iterative map is defined below.
3 Recurrent Neural Networks (RNNs)
3.1 Traditional development of RNNs
A recurrent neural network (RNN) is a type of neural network that is designed to process sequential data, e.g., data that is indexed by time. Let be the state of the system, and be the input data or forcing function. As in Goodfellow et al. (2016); Pascanu et al. (2013), an RNN can be written as
Given | (7) | |||
where is a family of functions parameterized by . In other words the is the initialization of the hidden state . Equivalently, equation (7) defines a discrete time dynamical system for where the function is the dynamics of the system and is a forcing term. The study of RNNs revolves around the choice of the function family , with many interesting sub-classes of RNNs having different properties.
A specific family of functions that is widely used in the RNN literature Goodfellow et al. (2016); Pascanu et al. (2013); Salehinejad et al. (2017); Lipton et al. (2015) is a family of affine functions with non-linear activations, e.g.,
(8) |
where we have the weight matrices and , the bias vector , and is a non-linear activation function, most commonly a tanh.
3.2 RNNs as iterated maps
The RNN in equation (8) can be written as a block non-linear function
(9) |
where is a linear function, is an affine function and where . In general however, and may be nonlinear functions such that and , but the choice in (8) is common in the RNN literature Goodfellow et al. (2016); Pascanu et al. (2013); Salehinejad et al. (2017); Lipton et al. (2015).
The trainable parameter set is
(10) |
RNNs can therefore be viewed as a forced discrete dynamical system with state space and the function
(11) |
is the dynamics of the system. Given , the sequence provided by (8), or equivalently (11), is
(12) | ||||
To emphasize the iterative nature of the RNN, we extend the definition of the RNN in (8) by defining
(13) |
and using this notation and appealing to the standard notation for function composition we see that
(14) | ||||
Equivalently, we can iterate the following matrix three times and recover the output of (14) in the final component of the vector. While this notation appears superfluous, it demonstrates that we create a fixed map composed of a fixed function . This approach will prove to be particularly useful looking ahead to §4 when we consider MLPs. We define
(15) |
We show below that we reproduce the action of (11) by expanding the dimension of the space such all copies of (11) lie “below the diagonal”. This is an example of an unrolling operation utilizing an increase in dimension. Why do we combine these two operations? As we will see later by combining these two operations we will uncover a unifying representation of RNNs and MLPs. In the general RNN case, the particular dimension we chose for the lifting is where is the number of times the map is applied to produce , and and . 222In the context of MLPs we will use a similar lifting approach to create a fixed map, but in this case the fixed map will be composed of different functions, . In general, these functions map from and to vector spaces having different dimensions as indicated in (3). The dimension of the fixed map in the MLP case is therefore .
Given the initial state
(16) |
the first iteration of the map (abbreviated to for convenience) gives
(17) |
The second iteration of the map gives
(18) | ||||
and the third iteration of the map results in
(19) | ||||
If we now interpret the last entry in the vector as the output of the map, we observe that it is identical to the output of a three layer RNN with function/dynamics , forcing terms and , and input as shown in (14).
3.3 Choice of the initial vector
Given the initial state
(20) |
the first iteration of the map (abbreviated to for convenience) gives
(21) |
The second iteration of the map gives
(22) | ||||
and the third iteration of the map results in
(23) | ||||
The final component after three iterations is equal to (19) and independent of .
4 Multilayer perceptrons (MLPs)
4.1 Traditional development of MLPs
Neural networks such as the traditional feed-forward multilayer perceptron Murtagh (1991) architecture can be defined as a set of nested functions. In this representation, each layer of the network and accompanying (fixed) nonlinear activation function Rasamoelina et al. (2020) is represented as a function. Let represent the parameters that define and . To make the notation consistent we make equal to the input data .333We note that for MLP the input data appears only once, yet the forcing function in an RNN changes every iteration and . Let
(24) |
with , , and weight matrix and bias vector forming a parameter set
(25) |
Each is thought of as a hidden layer of the neural network. Such a layer is called a dense layer based on the density of the non-zero entries in the matrix . Note that the affine function of each layer is fed through the nonlinear activation function . The final layer provides model output . As a concrete example, consider the following three-layer network, where we nest three functions of the form (24) to obtain
(26) |
or, equivalently and more compactly,
(27) |
where
(28) |
In general
(29) | ||||
4.2 MLPs as iterated maps
Given the iteration (29) gives
(30) | ||||
Let
(31) |
and define
(32) |
where and , and initial conditions
(33) |
This is an example of a sparse block linear function, where the majority of the entries are the zero-function and only a few entries are non-zero. In particular, the non-zero entries are the functions are assumed to be nonlinear functions. The function is an autonomous, nonlinear map. Now define
(34) |
We show that three iteration of the map is equivalent to a three layer MLP.
We apply the map (written here as to simplify notation) to the initial vector
(35) |
which has length . After the first iteration of the map
(36) |
After the second iteration of map
(37) |
and after the third iteration of map
(38) | ||||
If we now interpret the last entry in the vector as the output of the map we observe that it is identical to the output of a three layer MLP with weights , , and and input as shown in (27). Note, this is not a special property of this particular MLP, but instead a universal property of MLPs.
4.3 MLPs with additional iterations
4.3.1 Finite impulse response
In §4.2 it is important to recognize that (32) starting with initial conditions (33) must be iterated times to achieve the equivalent outcome as (29). Further iterations lead to interesting results as shown below.
Applying the map one additional time reveals
(39) | ||||
which is independent of the input . Accordingly, it is important to note that is identical to the original three-layer MLP, but is a totally different operator. In fact, is finite impulse in that it does not depend on the input data .
4.3.2 Fixed point
An additional application of , while seemingly inane from the point of view of NNs, does have interesting implications from the point of view of dynamical systems. In particular, one may observe that
(40) | ||||
which in addition, shows that
(41) |
is a fixed point of the map . Such examples underscore the tight, and perhaps surprising, relationship between NNs and dynamical systems, and now the dynamical system point of view provides insights into the deep connection between NNs and dynamical systems.
4.4 Choice of the initial vector
We apply the map (written here as to simplify notation) to the initial vector
(42) |
where , which has length . After the first iteration of the map
(43) |
After the second iteration of map
(44) |
and after the third iteration of map
(45) | ||||
The solution after three iterations is identical to that in (38). Clearly then, further iterations result in
(46) | ||||
and
(47) | |||
Note, again, the independence of the final result to the initial input vector.
5 RNNs and MLPs as dynamical systems
When viewed as iterative maps (dynamical systems), RNNs and MLPs have identical structures as shown in (19) and (38), namely
(48) |
The difference lies in the parameter sets and and in the definitions of the functions , specifically
(49) | ||||
and | ||||
where is the forcing function in the case of the RNN and, similarily, is the initialization data of both the MLP and RNN. Thus, we observe both RNNs and MLPs may be written in the form of (48) by merely noting that for RNNs we have .
The iterative map in (48) plays two key roles in our work:
-
•
It provides a unifying perspective for understanding the relationship between MLPs and RNNs. In fact, they are both dynamical systems and can be explored from that point of view.
-
•
Equation (48) also underscores the fact that a dynamical systems perspective inspires one to define and explore families of functions which are more general then those normally considered as MLPs and RNNs. In fact, each of the blocks in (48), when trained and allowed to be non-zero, leads to a new family of functions with interesting theoretical and practical properties.
Exploring these two aspects of (48) will comprise the remainder of this text.
6 Modified MLP as an infinite impulse response map
6.1 Map with a non-zero diagonal entry
It is interesting to note, and will form the foundation of much future work, that seeming minor modifications to systems such as (38) can lead to important differences in the resulting dynamical systems. For example, adding an identity matrix to the top left corner of the map results in an infinite impulse response map. Define 444Because of the on the diagonal of it is not a sequential operator. However, we maintain the name Sequential2D for its implementation to engender the intuition that Sequential2D is a container for (non-linear) functions just like the more common Sequential container.
(50) |
When started at the state
(51) |
after the first iteration of
(52) |
After the second iteration of map
(53) |
and after the third iteration of map
(54) | ||||
We observe that the last entry is exactly the same (vector) value as the corresponding MLP with weights , , and and input . Further iterations give
(55) | ||||
which is a fixed point that is dependent on the input , hence is a dynamical system with infinite impulse response.
6.2 Choice of the initial vector
In §6.2 we observed that the result of the iterative map only depends on and is invariant to the other entries of the initial vector, as long as those entries are . However, if, as we claim, the iterative map is equivalent to an MLP, then the iterative map should be invariant to the values in the initial vector, other than regardless of their value. Accordingly, in this section we consider the same iterative map as in §6.2, but with a different initial vector. In particular, we consider the initial vector
(56) |
after the first iteration of map
(57) |
After the second iteration of map
(58) |
and after the third iteration of map
(59) | ||||
identical to (54) and independent of . Clearly then
(60) | ||||
which is a fixed point that is dependent on the input but independent of .
6.3 Choice of the initial vector with below diagonal blocks
In this section we stray from the stand MLP models, and consider a more complicated block non-linear function. The dynamical system enables us to consider structures that are not so evident from the traditional perspective. In particular, we begin with the same initial vector
(61) |
but consider the block non-linear function
(62) |
Such a block non-linear function can be thought of as an MLP with a skip connection Huang et al. (2017); He et al. (2016) from the input to the last layer.
In this case, after the first iteration of map
(63) |
After the second iteration of map
(64) | ||||
and after the third iteration of map
(65) | ||||
which is independent of . Clearly then
(66) | ||||
Again, we see that the fixed point is dependent on the input but independent of . Of course, the fixed point is different from the fixed point in §6.1 and §6.2, because of the addition of the skip connections. However, we have preserved the property that the fixed point is independent of the values of which is required by standard MLPs with skip connections Huang et al. (2017); He et al. (2016).
6.4 Choice of the initial vector with above diagonal blocks
Now, we consider a more complicated block non-linear function. We know that RNNs are recurrent Goodfellow et al. (2016) and, in many interesting cases, infinite-impulse Miljanovic (2012). Accordingly, their final output normally depends on the value of their initial state. Exploiting the dynamical system perspective again, it is now easy to represent systems that are no longer simple feed forward networks. In this section we demonstrate such a dependence in a block non-linear function.
Similar to Section 6.2 in this section we consider the same initial vector
(67) |
but consider the block non-linear function
(68) |
As the function is above the diagonal, this block non-linear function is infinite-impulse Miljanovic (2012) unless is specifically chosen to be zero, or some similar degenerate function.
To demonstrate the dependence of (68) on the initial vector (67) we do the same calculation as before. In particular, after the first iteration of map
(69) |
After the second iteration of map
(70) | ||||
and after the third iteration of map
(71) | ||||
Now, the fourth iteration of the map reveals that
(72) | ||||
which is not a fixed point and is dependent on the input as well as and . Accordingly, this block non-linear function is not equivalent to an MLP, but instead displays the infinite impulse response of an RNN. Of course, if we set to be zero, then the above calculation reduces to the calculation in §6.2 and the fixed point is independent of , and this demonstrates that MLPs and RNNs lay on a continuum of block non-linear functions.
7 Implementation
As discussed in Pathak et al. (2023) there are many references in the literature that propose a block matrix structure for NNs. Examples include, Narang et al. (2017) where the focus is reducing the computational complexity and memory requirements of RNNs while simultaneously maintaining accuracy. Also, in Narang et al. (2017), it is noted that block sparsity can lead to better performance on hardware architectures such as Graphical Processing Units that support block-sparse matrix multiplication.
Having derived various theoretical connections between MLPs and RNNs, we now take a brief detour into the implementation of such methods. In particular, the block non-linear function from (3) is closely related to the commonly used Sequential container that appears in both the deep learning libraries, PyTorch, Paszke et al. (2017) and Keras, Chollet et al. (2015). In fact, the function (3) can be thought of as a two-dimensional generalization of the Sequential container, and we call this generalization Sequential2D, which is described in more detail in Pathak et al. (2023). A brief review of github.com555Data gathered on 5-16-2023. reveals the popularity of the Sequential architecture. As discussed in Pathak et al. (2023), we noted 380,000 files leveraging Pytorch’s or Keras’s Sequential implementation. Accordingly, the theoretical ideas described in this paper, and the practical implementation discussed in Pathak et al. (2023), promise to be able to impact many Neural Network (NN) applications.
For the purposes of this paper, we merely observe that Sequential2D, Pathak et al. (2023) refers to the 2D matrix of functions, as opposed to the 1D array of functions in a PyTorch or Keras Sequential container. Specifically, closely following Pathak et al. (2023), we observe that Sequential2D is a PyTorch module that allows one to define a neural network by specifying a two dimensional array of modules. Each individual module can be any type of layer or operation supported by PyTorch, Paszke et al. (2017). Practically Sequential2D can be used to implement a variety of neural network models including MLPs, RNNs, and many other NNs.
8 Numerical experiments
Using the Sequential2D module as an implementation for the block non-linear function from Definition 1, numerous experiments have been conducted seeking to examine performance of various models across a range of scenarios and conditions. In particular, the interested reader can look to Pathak et al. (2023); Hershey et al. (2023); Hershey (2023) for examples of the ideas in this paper being applied for practical machine learning problems. We will not attempt to repeat the results from those works here, but merely endeavor to provide a sample of the surprising results that arise from treating MLPs and RNNs as dynamical systems.
8.1 MNIST classification
In the following section we closely follow the presentation and results from Hershey (2023). We study the classic MNIST, LeCun et al. (2010) (Modified National Institute of Standards and Technology) dataset, which is a common choice for machine learning exercises and features 70,000 labeled images often used for image recognition and classification problems. The content of each image is a hand written single digit integer ranging from 0-9 and a ground truth label reflecting the integer that was depicted. Unmodified MNIST images can be seen in Fig. 1. In all experiments, the MNIST data set was divided into a training subset of 10,000 images, a validation subset of 1,000 images and a testing subset of 1,000 images with batch size of 64 trained for a duration of 100 epochs.

Again, following Hershey (2023), several image transformations were applied with the objective of increasing the difficulty and creating more disparate comparisons among model structures. Since model implementations were performed through PyTorch using the PyTorch Lightning wrapper (Falcon and The PyTorch Lightning team (2019)), Torchvision (maintainers and contributors (2016)) was a natural choice for transforms. As a first step, images were resized (torchvision.transforms.Resize) from to , increasing the total number of input features in from 784 to 2,500. Additionally, the features of each image are normalized (torchvision.transforms.Normalize) using the distribution defined as in adherence with guidelines from the Torchvision documentation for the MNIST data set as shown in Figure 1.




Once the image has been resized, each image in row 2 of Figure 2 is transformed by randomly erasing (torchvision.transforms.RandomErasing) portions of the images prior to being normalized, creating a more challenging problem. The amount of erasure is in the range (0.02, 0.05). In row 3, each image is transformed after the normalization with a random perspective transformation (torchvision.transforms.RandomPerspective), using a distortion scale value of 0.5. When applied in combination, both transforms from rows 2 and 3 result in the examples shown in row 4 of Figure 2, clearly an even more challenging problem.
8.2 An MLP versus a dynamical system
For each run of the classification experiment for the MNIST data set, a base network was constructed to match the characteristics of a four-layer MLP with layers starting from an input dimension of 2,500 and with layer output dimensions of 500, 200, 100 and 10. Following the notation of §4.2 this gives rise to a block non-linear function of the form
(73) |
with being the identity function, being the standard ReLU activation function, , , , and . As demonstrated in §6, such a block non-linear function is identical to a standard MLP with the same layer sizes.
We emphasize the block structure inherent to the MLP in (73) with a visualization in Figure 3 in which the block non-linear function is depicted “to scale”. The iteration matrix in (73) is decomposed into smaller component matrices of size except for the final row and column. The last row is decomposed into component matrices of size . The last column is decomposed in to component matrices of size . The matrix in the bottom right corner remained a square matrix of dimension . The identity matrix in the top left hand corner is of size or blocks. The blue blocks are trainable, the grey blocks are untrainable.

A question immediately arises when looking at Figure 3. Is the placement of the blue trainable and grey untrainable blocks essential to the operation of the MLP? For example, one can consider the “randomized” dynamical system shown in Figure 4.

Using the dynamical system architectures in Figure 3 and Figure 4, a traditional MNIST classification experiment was conducted, with the former dynamical system corresponding exactly to a standard MLP. In contrast, the dynamical system in Figure 4 shares the same number of trainable parameters and dimensions of the input, hidden, and output spaces, but allocates those trainable parameters using a very different distribution. Once the block nonlinear functions had been created, 10 training runs were performed on the data set using the traditional four layer MLP representation. We then followed these 10 runs of the traditional MLP with 20 runs using randomized dynamical systems to explore the space of random configurations of trainable blocks and compare their performance. For both the architectures in Figures 3 and 4 we use the same form of the initial vector , as in (51) and shown below
(74) |
We emphasize the following points:
-
•
For experiments that use the model in Figure 3, the values contained in the blue blocks are randomly initialized.
-
•
For experiments that use the model in Figure 4, both the placement of the blue values and the initial values contained in the blue blocks are randomized.
-
•
For all experiments the initial vector in (74) is not random and they are fixed by the training data and the placement of the s.
Following the process outlined above, experiments were conducted with results that are somewhat counter intuitive to prevailing theory regarding the principles and importance of network architecture. As an illustration of these results, the two very different networks shown in Figure 3 and Figure 4 were trained using the Adam optimizer from the PyTorch library with a learning rate of . The resulting training profiles are presented in Figure 5. Despite a stark difference in the distribution of the trainable parameters, both networks followed the same training profile with final testing accuracy of both models being virtually indistinguishable at 92.7%. These results, which clearly demonstrate an unexpected robustness, indicate other factors aside from layer based structure govern overall model performance.

Figure 6 shows 10 training curves for a four layer MLP. Each of the training curves begins with a different, random initialization. We see that all initializations result in similar training profiles.

In Figure 7 we plot the layered training loss curves from Figure 6 in blue before overlaying the training loss curves of models with randomized architectures (Figure 4) in gold. The results are compelling. When the 20 additional randomized training curves are overlaid they lie on top of one another. With no similarity in structure, the models train with indistinguishable performance. These results provide evidence that the concepts of layers and neatly defined weight matrices, as exist in Figure 3, seemingly provide no advantage over the randomized architectures in Figure 4. Before seeing these results, one might have assumed that current MLPs represent the optimal network structure, but now this appears unlikely.

9 Summary
The initial results in this paper demonstrate that despite major structural changes, randomized and layered architectures have little training variation. This observation raises the valid question of whether the layered structure was ultimately irrelevant. Specifically, if the arrangement of neatly demarcated layers is not a central feature of performance when normalized for the total number of trainable parameters, then the question becomes what factor or factors may be contributing to the performance of trained dynamical systems. This question is explored further in Hershey et al. (2023); Hershey (2023).
A dynamical system perspective offers a generalized context for exploring RNNs, MLPs and a vast array of other neural network architectures. The results suggest layer wise architectural constraints have little impact on the training performance of at least some problems. Other parameters, such as number of iterations (in the context of a dynamical system), sparsity, and other less commonly studied parameters may have a profound influence on model behavior. We have only begun exploring this large space of machine learning models Hershey et al. (2023); Hershey (2023); Pathak et al. (2023), and we hope this paper will encourage others to study this domain.
One important aspect of this work is that the number of iterations in the context of a dynamical system and the number of layers in a neural network are related but distinct concepts. In particular a layered architecture forces a number of iterations of the dynamical system (see our discussion of fixed points). However, a broader view of training dynamical systems decouples the architecture from the number of iterations. The randomized dynamical system in Figure 4 was trained and tested exclusively over four iterations, but that is not required. Exploring the relationship between different training paradigms given the freedom provided by the dynamical systems approach provides fertile ground for exploration.
The effect of noise in the data for both types of architectures remains an open issue. A sensitivity analysis could provide insight into the effect of architecture on stability. This is a natural question in a dynamical systems context with its concept of stable and unstable manifolds, but has no obvious analogue in standard approaches to the construction and analysis of neural networks.
Continuation and bifurcation analysis would seem to be another rich field of study for practical methods of training dynamical systems to solve real world problems. For instance, we have already demonstrated that GPT2’s performance can be substantially improved through a parameter continuation methodology inspired by the results of this paper and a common technique in dynamical systems Pathak et al. (2023). Since a simple Sequential construction shown in (32) is the basis of all MLPs, as well as many other neural networks, and can be embedded within the Sequential2D framework, it is reasonable to expect this new approach will provide improvements.
A disadvantage of a dynamical systems approach where a fixed map is iterated many times to achieve some goal is in the training of such a map. For example, the gradients that one computes using automatic differentiation are more complicated in this case. The existence of advanced optimization libraries such as PyTorch and TensorFlow ameliorate many of the technical issues in computing such gradients efficiently. However, this does not directly address such problems as gradient collapse and gradient explosion. Yet, within the context of dynamical systems natural analogues exist for these characteristics. The notions of stable and unstable manifolds correspond to gradient collapse and gradient explosion respectively. Further, there exist many techniques in dynamical systems for addressing exactly these types of problems.
As soon as one thinks about training dynamical systems to solve practical problems there are many interesting avenues for theoretical and performance analysis. In particular Koopman operators are an active area of mathematical research that are precisely concerned with data oriented dynamical systems. Our proposed approach allows us to immediately leverage this body of work.
The stepwise nature of trained dynamical systems leads to an interesting trade-off between complicated dynamical systems that achieve some practical output in a fewer number of iterations versus simple dynamical systems that achieve the same output but only after a greater number of iterations. While seemingly a bad trade-off, having a simple dynamical system that achieves a given result does provide advantages in analysis, explainability and safety. In particular, the use of simple dynamical systems over many iterations would seem to address some of the goals of mechanistic interpretability in AI Zhang et al. (2021); Kästner and Crook (2023); Shen et al. (2023); Michaud et al. (2024).
Acknowledgments and Disclosure of Funding
This research was carried out using computational resources supported by the Academic and Research Computing group at Worcester Polytechnic Institute.
References
- Chang et al. (2017) B. Chang, L. Meng, E. Haber, F. Tung, and D. Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
- Chen et al. (2018) R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
- Chollet et al. (2015) F. Chollet et al. Keras. https://keras.io, 2015.
- Croitoru et al. (2023) F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Falcon and The PyTorch Lightning team (2019) W. Falcon and The PyTorch Lightning team. PyTorch Lightning, 3 2019. URL https://github.com/Lightning-AI/lightning.
- Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- Gu and Dao (2023) A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Gu et al. (2021) A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- Gunther et al. (2020) S. Gunther, L. Ruthotto, J. B. Schroder, E. C. Cyr, and N. R. Gauger. Layer-parallel training of deep residual neural networks. SIAM Journal on Mathematics of Data Science, 2(1):1–23, 2020.
- Han et al. (2019) J. Han, Q. Li, et al. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1–41, 2019.
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Hershey (2023) Q. Hershey. Exploring Neural Network Structure through Iterative Neural Networks: Connections to Dynamical Systems. PhD thesis, Worcester Polytechnic Institute, 2023.
- Hershey et al. (2023) Q. Hershey, R. C. Paffenroth, and H. Pathak. Exploring neural network structure through sparse recurrent neural networks: A recasting and distillation of neural network hyperparameters. In 2023 22nd IEEE International Conference On Machine Learning And Applications (ICMLA). IEEE, 2023.
- Huang et al. (2017) G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Kästner and Crook (2023) L. Kästner and B. Crook. Explaining ai through mechanistic interpretability. 2023.
- LeCun et al. (2010) Y. LeCun, C. Cortes, C. Burges, et al. Mnist handwritten digit database, 2010.
- LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436–444, 2015.
- Lipton et al. (2015) Z. C. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.
- Liu and Theodorou (2019) G.-H. Liu and E. A. Theodorou. Deep learning theory review: An optimal control and dynamical systems perspective. arXiv preprint arXiv:1908.10920, 2019.
- maintainers and contributors (2016) T. maintainers and contributors. TorchVision: PyTorch’s Computer Vision library, 11 2016. URL https://github.com/pytorch/vision.
- Michaud et al. (2024) E. J. Michaud, I. Liao, V. Lad, Z. Liu, A. Mudide, C. Loughridge, Z. C. Guo, T. R. Kheirkhah, M. Vukelić, and M. Tegmark. Opening the ai black box: program synthesis via mechanistic interpretability. arXiv preprint arXiv:2402.05110, 2024.
- Miljanovic (2012) M. Miljanovic. Comparative analysis of recurrent and finite impulse response neural networks in time series prediction. Indian Journal of Computer Science and Engineering, 3(1):180–191, 2012.
- Murtagh (1991) F. Murtagh. Multilayer perceptrons for classification and regression. Neurocomputing, 2(5):183–197, 1991. ISSN 0925-2312. doi: https://doi.org/10.1016/0925-2312(91)90023-5. URL https://www.sciencedirect.com/science/article/pii/0925231291900235.
- Narang et al. (2017) S. Narang, E. Undersander, and G. Diamos. Block-sparse recurrent neural networks, 2017.
- Pascanu et al. (2013) R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA, 6 2013. PMLR. URL https://proceedings.mlr.press/v28/pascanu13.html.
- Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
- Pathak et al. (2023) H. Pathak, R. C. Paffenroth, and Q. Hershey. Sequential2d: Organizing center of skip connections for transformers. In 2023 22nd IEEE International Conference On Machine Learning And Applications (ICMLA). IEEE, 2023.
- Radhakrishnan et al. (2020) A. Radhakrishnan, M. Belkin, and C. Uhler. Overparameterized neural networks implement associative memory. Proceedings of the National Academy of Sciences, 117(44):27162–27170, 2020.
- Rasamoelina et al. (2020) A. D. Rasamoelina, F. Adjailia, and P. Sinčák. A review of activation function for artificial neural network. In 2020 IEEE 18th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pages 281–286, 2020. doi: 10.1109/SAMI48414.2020.9108717.
- Rumelhart and McClelland (1987) D. E. Rumelhart and J. L. McClelland. Learning Internal Representations by Error Propagation, pages 318–362. MIT Press, 1987.
- Salehinejad et al. (2017) H. Salehinejad, S. Sankar, J. Barfett, E. Colak, and S. Valaee. Recent advances in recurrent neural networks. arXiv preprint arXiv:1801.01078, 2017.
- Schäfer and Zimmermann (2006) A. M. Schäfer and H. G. Zimmermann. Recurrent neural networks are universal approximators. In Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16, pages 632–640. Springer, 2006.
- Shen et al. (2023) T. Shen, R. Jin, Y. Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y. Liu, and D. Xiong. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025, 2023.
- Sherstinsky (2020) A. Sherstinsky. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena, 404:132306, mar 2020. doi: 10.1016/j.physd.2019.132306.
- Siegelmann and Sontag (1991) H. T. Siegelmann and E. D. Sontag. Turing computability with neural nets. Applied Mathematics Letters, 4(6):77–80, 1991.
- Weinan (2017) E. Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1(5):1–11, 2017.
- Yang et al. (2023) L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
- Yeung et al. (2019) E. Yeung, S. Kundu, and N. Hodas. Learning deep neural network representations for koopman operators of nonlinear dynamical systems. In 2019 American Control Conference (ACC), pages 4832–4839. IEEE, 2019.
- Zhang et al. (2021) Y. Zhang, P. Tiňo, A. Leonardis, and K. Tang. A survey on neural network interpretability. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5):726–742, 2021.