Minimal Width for Universal Property of Deep RNN
Abstract
A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep narrow networks with bounded width and arbitrary depth are more effective than wide shallow networks with arbitrary width and bounded depth in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show a deep RNN with ReLU activation can approximate any continuous function or function with the widths and , respectively, where the target function maps a finite sequence of vectors in to a finite sequence of vectors in . We also compute the additional width required if the activation function is sigmoid or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and technique can shed light on further research on deep RNNs.
Keywords: recurrent neural network, universal approximation, deep narrow network, sequential data, minimum width
1 Introduction
Deep learning, a type of machine learning that approximates a target function using a numerical model called an artificial neural network, has shown tremendous success in diverse fields, such as regression (Garnelo et al., 2018), image processing (Dai et al., 2021; Zhai et al., 2022), speech recognition (Baevski et al., 2020), and natural language processing (LeCun et al., 2015; Lavanya and Sasikala, 2021).
While the excellent performance of deep learning is attributed to a combination of various factors, it is reasonable to speculate that its notable success is partially based on the universal approximation theorem, which states that neural networks are capable of arbitrarily accurate approximations of the target function.
Formally, for any given function in a target class and , there exists a network such that . In topological language, the theorem states that a set of networks is dense in the class of the target function. In this sense, the closure of the network defines the range of functions it network can represent.
As universality provides the expressive power of a network structure, studies on the universal approximation theorem have attracted increasing attention. Examples include the universality of multi-layer perceptrons (MLPs), the most basic neural networks (Cybenko, 1989; Hornik et al., 1989; Leshno et al., 1993), and the universality of recurrent neural networks (RNNs) with the target class of open dynamical systems (Schäfer and Zimmermann, 2007). Recently, Zhou (2020) has demonstrated the universality of convolutional neural networks.
Classical universal approximation theorems specialize in the representation power of shallow wide networks with bounded depth and unbounded width. Based on mounting empirical evidence that deep networks demonstrate better performance than wide networks, the construction of deep networks instead of shallow networks has gained considerable attention in recent literature. Consequently, researchers have started to analyze the universal approximation property of deep networks (Cohen et al., 2016; Lin and Jegelka, 2018; Rolnick and Tegmark, 2018; Kidger and Lyons, 2020). Studies on MLP have shown that wide shallow networks require only two depths to have universality, while deep narrow networks require widths at least as their input dimension.
A wide network obtains universality by increasing its width even if the depth is only two (Cybenko, 1989; Hornik et al., 1989; Leshno et al., 1993). However, in the case of a deep network, there is a function for a narrow network that cannot be able to approximated, regardless of its depth (Lu et al., 2017; Park et al., 2021). Therefore, clarifying the minimum width to guarantee universality is crucial, and studies are underway to investigate its lower and upper bounds, narrowing the gap.
Network | Function class | Activation | Result |
---|---|---|---|
RNN | ReLU | ||
conti. nonpoly1 | |||
conti. nonpoly2 | |||
conti. nonpoly1,2 | |||
ReLU | |||
LSTM | |||
GRU | |||
BRNN | ReLU | ||
conti. nonpoly1 | |||
conti. nonpoly2 | |||
conti. nonpoly1,2 | |||
ReLU |
-
requires the class to consists of past-dependent functions.
-
1
requires an activation to be continuously differentiable at some point with and . belongs here.
-
2
requires an activation to be continuously differentiable at some point with . A logistic sigmoid function belongs here.
Recurrent neural networks (RNNs) (Rumelhart et al., 1986; Elman, 1990) have been crucial for modeling complex temporal dependencies in sequential data. They have various applications in diverse fields, such as language modeling (Mikolov et al., 2010; Jozefowicz et al., 2016), speech recognition (Graves and Jaitly, 2014; Bahdanau et al., 2016), recommendation systems (Hidasi et al., 2015; Wu et al., 2017), and machine translation (Bahdanau et al., 2014). Deep RNNs are widely used and have been successfully applied in practical applications. However, their theoretical understanding remains elusive despite their intensive use. This deficiency in existing studies motivated our work.
In this study, we prove the universal approximation theorem of deep narrow RNNs and discover the upper bound of their minimum width. The target class consists of a sequence-to-sequence function that depends solely on past information. We refer to such functions as past-dependent functions. We provide the upper bound of the minimum width of the RNN for universality in the space of the past-dependent functions. Surprisingly, the upper bound is independent of the length of the sequence. This theoretical result highlights the suitability of the recurrent structure for sequential data compared with other network structures. Furthermore, our results are not restricted to RNNs; they can be generalized to variants of RNNs, including long short-term memory (LSTM), gated recurrent units (GRU), and bidirectional RNNs (BRNN). As corollaries of our main theorem, LSTM and GRU are shown to have the same universality and target class as an RNN. We also prove that the BRNN can approximate any sequence-to-sequence function in a continuous or space under the respective norms. We also present the upper bound of the minimum width for these variants. Table 1 outlines our main results.
1.1 Summary of Contributions
With a target class of functions that map a finite sequence to a finite sequence , we prove the following:
-
•
A deep RNN can approximate any past-dependent sequence-to-sequence continuous function with width for the ReLU or 111Generally, non-degenerate with requires the same minimal width as ., and for non-degenerating activations.
-
•
A deep RNN can approximate any past-dependent function () with width for the ReLU activation and for non-degenerating activations.
-
•
A deep BRNN can approximate any sequence-to-sequence continuous function with width for the ReLU or 111Generally, non-degenerate with requires the same minimal width as ., and for non-degenerating activations.
-
•
A deep BRNN can approximate any sequence-to-sequence function () with width for the ReLU activation and for non-degenerating activations.
-
•
A deep LSTM or GRU can approximate any past-dependent sequence-to-sequence continuous function with width and function with width .
1.2 Organization
In Section 2, we briefly review previous studies on the universality of neural networks. Section 3 provides some notations, network configuration, a description of the target function class, and the condition of the activation function. Section 4 addresses the approximation property in a continuous function space. First, we fit only the last component of an output sequence and extend it to all components of the sequence. In Section 5, we present the upper bound of the minimum width in space. As an expansion of our theorems, the universal approximation theorems for the LSTM, GRU, and BRNN are deduced in Section 6. Sections 7 and 8 present a discussion and conclusions, respectively.
2 Related Works
We briefly review some of the results of studies on the universal approximation property. Studies have been conducted to determine whether a neural network can learn a sufficiently wide range of functions, that is, whether it has universal properties. Cybenko (1989) and Hornik et al. (1989) first proved that the most basic network, a simple two-layered MLP, can approximate arbitrary continuous functions defined on a compact set. Some follow-up studies have investigated the universal properties of other structures for a specific task, such as a convolutional neural network for image processing (Zhou, 2020), an RNN for open dynamical systems (Schäfer and Zimmermann, 2007; Hanson and Raginsky, 2020), and transformer networks for translation and speech recognition (Yun et al., 2020). Particularly for RNNs, Schäfer and Zimmermann (2007) showed that an open dynamical system with continuous state transition and output function can be approximated by a network with a wide RNN cell and subsequent linear layer in finite time. Hanson and Raginsky (2020) showed that the trajectory of the dynamical system can be reproduced with arbitrarily small errors up to infinite time, assuming a stability condition on long-term behavior.
While such prior studies mainly focused on wide and shallow networks, several studies have determined whether a deep narrow network with bounded width can approximate arbitrary functions in some kinds of target function classes, such as a space of continuous functions or an space (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2019; Kidger and Lyons, 2020; Park et al., 2021). Unlike the case of a wide network that requires only one hidden layer for universality, there exists a function that cannot be approximated by any network whose width is less than a certain threshold. Specifically, in a continuous function space with the supreme-norm on a compact subset , Hanin and Sellke (2017) showed negative and positive results that MLPs do not attain universality with width . Still, width is sufficient for MLPs to achieve universality with the ReLU activation function. Johnson (2019); Kidger and Lyons (2020) generalized the condition on the activation and proved that is too narrow, but is sufficiently wide for the universality in . On the other hand, in an space with -norm for , Lu et al. (2017) showed that the width is insufficient for MLPs to have universality, but is enough, with the ReLU activation when and . Kidger and Lyons (2020) found that MLPs whose width is achieve universality in with the ReLU activation, and Park et al. (2021) determined the exact value required for universality in the space with ReLU.
As described earlier, studies on deep narrow MLP have been actively conducted, but the approximation ability of deep narrow RNNs remains unclear. This is because the process by which the input affects the result is complicated compared with that of an MLP. The RNN cell transfers information to both the next time step in the same layer and the same time step in the next layer, which makes it difficult to investigate the minimal width. In this regard, we examined the structure of the RNN to apply the methodology and results from the study of MLPs to deep narrow RNNs.
3 Terminologies and Notations
This section introduces the definition of network architecture and the notation used throughout this paper. and denote the dimension of input and output space, respectively. is an activation function unless otherwise stated. Sometimes, indicates a vector with suitable dimensions.
First, we used square brackets, subscripts, and colon symbols to index a sequence of vectors. More precisely, for a given sequence of -dimensional vectors , or denotes the -th component of the -th vector. The colon symbol is used to denote a continuous index, such as or . We call the sequential index by time and each a token.
Second, we define the token-wise linear maps and to connect the input, hidden state, and output space. As the dimension of the hidden state space on which the RNN cells act is different from those of the input domain and output domain , we need maps adjusting the dimensions of the spaces. For a given matrix , a lifting map lifts the input vector to the hidden state space. Similarly, for a given matrix , a projection map projects a hidden state onto the output vector. As the first token defines a token-wise map, we sometimes represent token-wise maps without a time length, such as instead of .
Subsequently, an RNN is constructed using a composition of basic recurrent cells between the lifting and projection maps. We considered four basic cells: RNN, LSTM, GRU, and BRNN.
-
•
RNN Cell A recurrent cell, recurrent layer, or RNN cell maps an input sequence to an output sequence using
(1) where is an activation function, are the weight matrices, and is the bias vector. The initial state can be an arbitrary constant vector, which is zero vector in this setting.
-
•
LSTM cell An LSTM cell is an RNN cell variant with an internal cell state and gate structures. As an original version proposed in Hochreiter and Schmidhuber (1997), an LSTM cell includes only two gates: an input gate and an output gate. Still, the base of almost all networks used in modern deep learning is LSTM cells with an additional gate called the forget gate, introduced in Gers et al. (2000). Hence we present the LSTM cell with three gates instead of the original one.
Mathematically, an LSTM cell is a process that computes an output and cell state , defined by the following relation:
(2) where and are weight matrices; is the bias vector for each ; and is the sigmoid activation function. denotes component-wise multiplication, and we set the initial state to zero in this study.
Although some variants of the LSTM have additional connections, such as peephole connection, our result for LSTM extends naturally.
-
•
GRU cell Another import variation of an RNN is GRU, proposed by Cho et al. (2014). Mathematically, a GRU cell is a process that computes defined by
(3) where and are weight matrices, is the bias vector for each , and is the sigmoid activation function. denotes component-wise multiplication, and we set the initial state to zero in this study.
-
•
BRNN cell A BRNN cell consists of a pair of RNN cells and a token-wise linear map that follows the cells (Schuster and Paliwal, 1997). An RNN cell in the BRNN cell receives input from to and the other receives input from to in reverse order. Then, the linear map in combines the two outputs from the RNN cells. Specifically, a BRNN cell is defined as follows:
(4) where , , , , , and are weight matrices; and are bias vectors.
-
•
Network architecture An RNN comprises a lifting map , projection map , and recurrent cells ;
(5) We denote the network as a stack RNN or deep RNN when , and each output of the cell as the -th hidden state. indicates the width of the network. As a special case, a recurrent cell in (1) is a composition of activation and a token-wise affine map with ; A token-wise MLP is a specific example of RNN. If LSTM, GRU, or BRNN cells replace recurrent cells, the network is called an LSTM, a GRU, or a BRNN.
In addition to the type of cell, the activation function affects universality. We focus on the case of ReLU or while also considering the general activation function satisfying the condition proposed by (Kidger and Lyons, 2020). is a continuous non-polynomial function that is continuously differentiable at some with . We refer to the condition as a non-degenerate condition and as a non-degenerating point.
Finally, the target class must be set as a subset of the sequence-to-sequence function space, from to . Given an RNN , each token of the output sequence depends only on for the input sequence . We define this property as past dependency and a function with this property as a past-dependent function. More precisely, if all the output tokens of a sequence-to-sequence function are given by for functions , we say that the function is past-dependent. Meanwhile, we must fix the finite length or terminal time of the input and output sequence. Without additional assumptions such as in (Hanson and Raginsky, 2020), errors generally accumulate over time, making it impossible to approximate implicit dynamics up to infinite time regardless of past dependency. Therefore we set the target function class as a class of past-dependent sequence-to-sequence functions with sequence length .
Remark 1
On a compact domain and under bounded length, the continuity of implies that of each and vice versa. In the case of the norm with , is integrable if and only if is integrable for each . In both cases, the sequence of functions converges to if and only if converges to for each . Thus, we focus on approximating for each under the given conditions.
Remark 2
There are typical tasks where the length of the output sequence differs from that of the input sequence. In translation tasks, for instance, inputs and outputs are sentences represented as sequences of tokens of words in different languages, and the length of each sequence differs in general. Nevertheless, even in such a case, a sequence can be naturally embedded into for an integer as an extended sequence or . By the natural embedding, a map between sequences is a special case of a map between for an integer .
Sometimes, only the last value is required considering an RNN as a sequence-to-vector function . We freely use the terminology RNN for sequence-to-sequence and sequence-to-vector functions because there is no confusion when the output domain is evident.
We have described all the concepts necessary to set a problem, but we end this section with an introduction to the concepts used in the proof of the main theorem. For the convenience of the proof, we slightly modify the activation to act only on some components, instead of all components. With activation and index set , the modified activation is defined as
(6) |
Using the modified activation function , the basic cells of the network are modified in (1). For example, a modified recurrent cell can be defined as
(7) | ||||
Similarly, modified RNN, LSTM, GRU, or BRNN is defined using modified cells in (1). This concept is similar to the enhanced neuron of Kidger and Lyons (2020) in that activation can be selectively applied, but is different in that activation can be applied to partial components.
As activation leads to the non-linearity of a network, modifying the activation can affect the minimum width of the network. Fortunately, the following lemma shows that the minimum width increases by at most one owing to the modification. We briefly introduce the ideas here, with a detailed proof provided in the appendix.
Lemma 3
Let be a modified RNN cell, , and be a token-wise linear projection and lifting map. Suppose that an activation function of is non-degenerate with a non-degenerating point . Then for any compact subset and , there exists RNN cells , , and a token-wise linear map , such that
(8) |
where
Proof [Sketch of proof]
The detailed proof is available in Appendix B.
We use the Taylor expansion of at to recover the value before activation.
For the -th component with , choose a small and linearly approximate as . An affine transform after the Taylor expansion recovers .
Remark 4
Since the additional width is only used to move the input of each RNN cell to near , it seems easily removable using a bias term at first glance, provided the projection map is affine instead of linear. However, a bias term in (1) is fixed across all time steps, while we need different translations: one for the first token and the other for tokens after the second.
For example, for an RNN cell defined by
(9) |
we need to use the Taylor expansion at to the first token . When , we cannot represent the second token
(10) | ||||
(11) |
as the Taylor expansion at unless .
In other words, the need for the additional width originates from that previous state appears only after the second token. Nonetheless, we can make , by setting the proper initial state for each RNN cell and using an affine projection map.
The lemma implies that a modified RNN can be approximated by an RNN with at most one additional width. For a given modified RNN of width and , we can find RNN and linear maps , such that
(12) |
The composition of an RNN cell and token-wise linear map can be substituted by another RNN cell . More concretely, for and defined by
(13) | ||||
(14) | ||||
defines an RNN cell | ||||
(15) |
Thus, becomes a recurrent cell, and defines a network of form (5).
4 Universal Approximation for Deep RNN in Continuous Function Space
This section introduces the universal approximation theorem of deep RNNs in continuous function space.
Theorem 5 (Universal approximation theorem of deep RNN 1)
Let be a continuous past-dependent sequence-to-sequence function and be a non-degenerate activation function. Then, for any and compact subset , there exists a deep RNN of width such that
(16) |
where
To prove the above theorem, we deal with the case of the sequence-to-vector function first, in subsection 4.1. Then, we extend our idea to a sequence-to-sequence function using bias terms to separate the input vectors at different times in subsection 4.2, and the proof of Theorem 5 is presented at the end of this section. As the number of additional components required in each step in the subsections depends on the activation function, we use to state the theorem briefly.
4.1 Approximation of sequence-to-vector function
The motivation for approximating a sequence-to-vector function is to copy a two-layered MLP via a modified RNN. Note that the output of a two-layered MLP is a linear sum of outputs of its hidden nodes, and each output of the hidden nodes is a linear sum of inputs with the following activation function. In Lemma 6, we construct a modified RNN that simulates a hidden node of an MLP, which can be represented as the sum of the inner product of some matrices and input vectors in with activation. After that, we use an additional buffer component in Lemma 7 to copy another hidden node in the two-layered MLP and the following modified RNN cell accumulates two results from the nodes. The buffer component of the modified RNN cell is then reset to zero to copy another hidden node. Repeating the procedure, the modified RNN with bounded width copies the two-layered MLP.
Now, we present the statements and sketches of the proof corresponding to each step. The following lemma implies that a modified RNN can compute the linear sum of all the input components, which copies the hidden node of a two-layered MLP.
Lemma 6
Suppose are the given matrices. Then there exists a modified RNN of width such that (the symbol indicates that there exists some value irrelevant to the proof)
(17) | |||||
Proof
[Sketch of the proof]
The detailed proof is available in Appendix C.
Define the -th modified RNN cell , of the form of (1) without activation, with , where .
Then, the th component of the final output after layers becomes a linear combination of with some constant coefficients and .
Thus the coefficient of is represented by , which we wish to be for each .
In matrix formulation, we intend to find satisfying , where , , and .
As is invertible there exist that solve .
After copying a hidden node using the above lemma, we add a component, th, to copy another hidden node. Then the results are accumulated in the th component, and the final component is to be reset to copy another node. As the process is repeated, a modified RNN replicates the output node of a two-layered MLP.
Lemma 7
Suppose , are given for and . Then, there exists a modified RNN of width such that
(18) |
Proof First construct a modified RNN of width such that
(19) | |||||
(20) |
as Lemma 6. Note that the final component does not affect the first linear summation and remains zero. Next, using the components except for the th one, construct , which satisfies
(21) | |||||
(22) |
and use one modified RNN cell after to add the results and reset the last component:
(23) | ||||
(24) |
As the th component is reset to zero, we use it to compute the third sum and repeat until we obtain the final network such that
(25) |
Remark 8
The above lemma implies that a modified RNN of width can copy the output node of a two-layered MLP. We can extend this result to an arbitrary -dimensional case. Note that the first components remain fixed, the th component computes a part of the linear sum approximating the target function, and the th component computes another part and is reset. When we need to copy another output node for another component of the output of the target function , only one additional width is sufficient. Indeed, the th component computes the sum and the final component, and the th component acts as a buffer to be reset in that case. By repeating this process, we obtain -dimensional output from the modified RNN, which includes all outputs of the MLP and the components from the th to the th ones.
Theorem 9 (Universal approximation theorem of deep RNN 2)
Suppose a target is a continuous sequence-to-vector function, is a compact subset, is a non-degenerating activation function, and is the non-degenerating point. Then, for any , there exists a deep RNN of width such that
(26) |
where
Proof We present the proof for here, but adding width for each output component works for the case . By the universal approximation theorem of the MLP, Theorem 1 in Leshno et al. (1993), there exist and for such that
(27) |
Note that there exists a modified RNN of width ,
(28) |
By Lemma 3, there exists an RNN of width such that
(29) |
Hence we have .
4.2 Approximation of sequence-to-sequence function
Now, we consider an RNN as a function from sequence to sequence defined by (1). Although the above results are remarkable in that the minimal width has an upper bound independent of the length of the sequence, it only approximates a part of the output sequence. Meanwhile, as the hidden states calculated in each RNN cell are connected closely for different times, fitting all the functions that can be independent of each other becomes a more challenging problem. For example, the coefficient of in equals the coefficient of in if is an RNN defined as in the proof of Lemma 6. This correlation originates from the fact that and arrive at via the same intermediate process, 1-time step, and layers.
We sever the correlation between the coefficients of and by defining the time-enhanced recurrent cell in Definition 10 and proceed similarly as in the previous subsection till the Lemma 12.
Definition 10
Time-enhanced recurrent cell, or layer, is a process that maps sequence to sequence via
(30) |
where is an activation function, , are weight matrices and is the bias given for each time step .
Like RNN, time-enhanced RNN indicates a composition of the form (1) with time-enhanced recurrent cells instead of RNN cells, and we denote it as TRNN. The modified TRNN indicates a TRNN whose activation functions in some cell act on only part of the components. Time-enhanced BRNN, denoted as TBRNN, indicates a BRNN whose recurrent layers in each direction are replaced by time-enhanced layers. A modified TBRNN indicates a TBRNN whose activation function is modified to act on only part of the components. With the proof of Lemma 3 using , instead of , , a TRNN can approximate a modified TRNN.
The following lemma shows that the modified TRNN successfully eliminates the correlation between outputs. See the appendix for the complete proof.
Lemma 11
Suppose are the given matrices for , . Then there exists a modified TRNN of width such that
(31) |
for all .
Proof [Sketch of proof] The detailed proof is available in Appendix D. The main idea is to use instead of in the proof of Lemma 6 with a modified TRNN of the form
(32) |
The time-index coefficient separates the calculation along the same intermediate process mentioned at the beginning of this subsection.
For instance, a process from time to results differently from to for , even in the same layer.
As the coefficient matrices at each time after layers are full rank, we can find implementing the required linear combination for each time.
Recall the proof of Lemma 7. An additional width serves as a buffer to implement and accumulate linear sum in a node in an MLP. Similarly, we proceed with Lemma 11 instead of Lemma 6 to conclude that there exists a modified TRNN of width such that each reproduces an MLP approximating .
Lemma 12
Suppose , are the given matrices for , , . Then, there exists a modified TRNN of width such that
(33) |
Proof
We omit the detailed proof because it is almost the same as the proof of Lemma 7.
The only difference is to use instead of to construct and as in the proof of Lemma 11.
This implies that the modified TRNN can approximate any past-dependent sequence-to-sequence function.
Finally, we connect the TRNN and RNN. Although it is unclear whether a modified RNN can approximate an arbitrary modified TRNN, there exists a modified RNN that approximates the specific one described in Lemma 11.
Lemma 13
Proof [Sketch of proof] Detailed proof is available in Appendix E. Note that a function in (31) is constructed by modified TRNN cells of the form in (32). Since is the only time-dependent coefficient in (32), it is enough to show that an RNN cell can approximate a function for . In particular, approximating a map is what we need.
While an MLP can approximate map on a compact set , it is hard to use token-wise MLP since we need different outputs for the same vector at other time steps. For instance, for a and a sequence , a token-wise MLP cannot simultaneously approximate two outputs, and . Still, it is not the case when the domain of and the domain of are disjoint compact sets, and there is an MLP that approximates for and for . From this point of view, we use an RNN and a token-wise MLP, to embed each vector of the input sequence into disjoint compact sets and to approximate the output on each compact set, respectively.
Now we present how to construct disjoint sets and the token-wise MLP.
Without loss of generality, we can assume and construct the first cell as the output at time to be .
As compact sets , are disjoint, and are uniquely determined for each .
Thus, for any given , there exists an MLP of width approximating on as a function from (Hanin and Sellke, 2017; Kidger and Lyons, 2020).
Indeed, we need to approximate as a function from to .
Fortunately, the first components preserve the original input data in the proofs of Proposition 2 in Hanin and Sellke (2017), and Proposition 4.2(Register Model) in Kidger and Lyons (2020).
Thus, an MLP of width approximates while keeping the in the first components, and the MLP is common across overall as inputs at different times and are already embedded in disjoint sets and .
Note that a token-wise MLP is a special case of an RNN of the same width.
Nonetheless, we need an additional width to keep the th component approximating .
With the token-wise MLP implemented by an RNN and additional buffer width, we construct a modified RNN of width approximating the modified TRNN cell used in the proof of Lemma 11.
4.3 Proof of Theorem 5
Summarizing all the results, we have the universality of a deep RNN in a continuous function space.
Proof [Proof of Theorem 5] As mentioned in Remark 8, we can set for notational convenience. By Lemma 12, there exists a modified TRNN of width such that
(35) |
As is a composition of modified TRNN cells of width satisfying (31), there exists a modified RNN of width such that
(36) |
Then, by Lemma 3, there exists an RNN of width such that
(37) |
The triangle inequality yields
(38) |
Remark 14
The number of additional widths depends on the condition of the activation function . Here, is required to find the token-wise MLP that approximates embedding from to . If further studies determine a tighter upper bound of the minimum width of an MLP to have the universal property in a continuous function space, we can reduce or even remove according to the result.
There is still a wide gap between the lower bound and upper bound of the minimum width, and hence, we expect to be able to achieve universality with a narrower width. For example, if , an RNN is simply an MLP, and the RNN has universality without a node required to compute the effect of . Therefore, apart from the result of the minimum width of an MLP, further studies are required to determine whether is essential for the case of .
5 Universal Approximation for Stack RNN in Space
This section introduces the universal approximation theorem of a deep RNN in function space for .
Theorem 15 (Universal approximation theorem of deep RNN 3)
Let be a past-dependent sequence-to-sequence function in for , and be a non-degenerate activation function with the non-degenerating point . Then, for any and compact subset , there exists a deep RNN of width satisfying
(39) |
Moreover, if the activation is ReLU, there exists a deep RNN of width satisfying
(40) |
Before beginning the proof of the theorem, we assume without loss of generality and summarize the scheme that will be used in the proof. Park et al. (2021) constructed an MLP of width approximating a given target function , using the “encoding scheme.” More concretely, the MLP is separated into three parts: encoder, memorizer, and decoder.
First, the encoder part quantizes each component of the input and output into a finite set. The authors use the quantization function
(41) |
where . Then, each quantized vector is encoded into a real number by concatenating its components through the encoder
(42) |
For small , the authors construct an MLP of width satisfying
(43) |
Although the quantization causes a loss in input information, the norm neglects some loss in a sufficiently small domain.
After encoding the input to with large , authors use the information of in to obtain the information of the target output . More precisely, they define the memorizer to map the encoded input to the encoded output as
(44) |
assuming the quantized map acts on component-wise in the above equation. Then, an MLP of width approximates ; that is, for any , there exists satisfying
(45) |
Finally, the decoder reconstructs the original output vector from the encoded output vector by cutting off the concatenated components. Owing to the preceding encoder and memorizer, it is enough to define only the value of the decoder on . Hence the decoder is determined by
(46) |
Indeed, for small , Park et al. (2021) construct an MLP of width so that
(47) |
Although (43) and (47) are not equations but approximations when the activation is just non-degenerate, the composition approximates a target with sufficiently large and sufficiently small .
Let us return to the proof of Theorem 15. We construct the encoder, memorizer, and decoder similarly. As the encoder and decoder is independent of time , we use a token-wise MLP and modified RNNs define the token-wise MLPs. On the other hand, the memorizer must work differently according to the time owing to the multiple output functions. Instead of implementing various memorizers, we separate their input and output domains at each time by translation. Then, it is enough to define one memorizer on the disjoint union of domains.
Proof [Proof of Theorem 15] We first combine the token-wise encoder and translation for the separation of the domains. Consider the token-wise encoder , and the following recurrent cell
(48) |
Then the composition defines an encoder of sequence from to :
(49) |
where is a sequence in . Note that the range of is a disjoint union of compact sets;
(50) |
Hence there exists a memorizer satisfying
(51) |
for each . The token-wise decoder is the last part of the proof.
To complete the proof, we need an approximation of the token-wise encoder , a modified recurrent cell , token-wise memorizer , and token-wise decoder . Following Park et al. (2021), there exist MLPs of width , , and that approximate , , and respectively. Lemma 3 shows that is approximated by an RNN of width . Hence, an RNN of width approximates the target function .
In the case of ReLU activation, we can extend the domain to as stated in Park et al. (2020). Nonetheless, we present briefly how to deal with the subtle problem that the support of a network is not compact generally.
We project each input by defined by
(52) |
Note that an MLP with width computes . We will choose large to cover enough domain for and small to reduce the projection error.
First, choose a large so that for all . Second, construct an RNN as above with , such that . We impose a condition to bound error outside , which is possible because we may set for all under the norm. Finally, set so small that and . After inserting the projection before the encoder and defining new RNN , we have
(53) | |||
(54) | |||
(55) | |||
(56) | |||
(57) | |||
(58) |
6 Variants of RNN
This section describes the universal property of some variants of RNN, particularly LSTM, GRU, or BRNN. LSTM and GRU are proposed to solve the long-term dependency problem. As an RNN has difficulty calculating and updating its parameters for long sequential data, LSTM and GRU take advantage of additional structures in their cells. We prove that they have the same universal property as the original RNN. On the other hand, a BRNN is proposed to overcome the past dependency of an RNN. BRNN consists of two RNN cells, one of which works in reverse order. We prove the universal approximation theorem of a BRNN with the target class of any sequence-to-sequence function.
The universal property of an LSTM originates from the universality of an RNN. Mathematically LSTM indicates a process that computes two outputs, and , defined by (2). As an LSTM can reproduce an RNN with the same width, we have the following corollary:
Corollary 16 (Universal approximation theorem of deep LSTM)
Let be a continuous past-dependent sequence-to-sequence function. Then, for any and compact subset , there exists a deep LSTM , of width , such that
(59) |
If , there exists a deep LSTM , of width , such that
(60) |
Proof We set all parameters but , , , and as zeros, and then (2) is simplified as
(61) | ||||
For any , with sufficiently large negative components yields
(62) |
Thus, an LSTM reproduces an RNN whose activation function is without any additional width in its hidden states.
In other words, an LSTM of width approximates an RNN of width equipped with the activation function .
The universality of GRU is proved similarly.
Corollary 17 (Universal approximation theorem of deep GRU)
Let be a continuous past-dependent sequence-to-sequence function. Then, for any and compact subset , there exists a deep GRU , of width , such that
(63) |
If , there exists a deep GRU , of width , such that
(64) |
Proof Setting only , , , and as non-zero, the GRU is simplified as
(65) |
For any , a sufficiently large yields
(66) |
Hence, we attain the corollary.
Remark 18
We refer to the width as the maximum of hidden states. However, the definition is somewhat inappropriate, as LSTM and GRU cells have multiple hidden states; hence, there are several times more components than an RNN with the same width. Thus we expect that they have better approximation power or have a smaller minimum width for universality than an RNN. Nevertheless, we retain the theoretical proof as future work to identify whether they have different abilities in approximation or examine why they exhibit different performances in practical applications.
Now, let us focus on the universality of a BRNN. Recall that a stack of modified recurrent cells construct a linear combination of the previous input components at each time,
(67) |
Therefore, if we reverse the order of sequence and flow of the recurrent structure, a stack of reverse modified recurrent cells constructs a linear combination of the subsequent input components at each time,
(68) |
From this point of view, we expect that a stacked BRNN successfully approximates an arbitrary sequence-to-sequence function beyond the past dependency. As previously mentioned, we prove it in the following lemma.
Lemma 19
Suppose are the given matrices for , . Then there exists a modified TBRNN of width such that
(69) |
for all .
Proof
[Sketch of proof]
The detailed proof is available in Appendix F.
We use modified TBRNN cells with either only a forward modified TRNN or a backward modified TRNN.
The stacked forward modified TRNN cells compute , and the stacked backward modified TRNN cells compute .
As in previous cases, we have the following theorem for a TBRNN. The proof is almost the same as that of Lemma 12 and 7.
Lemma 20
Suppose , are the given matrices for , , . Then there exists a modified TBRNN of width such that
(70) |
Proof First, construct a modified deep TBRNN of width such that
(71) |
as Lemma 19. The final component does not affect the first linear summation and remains zero. After , use the th component to obtain a stack of cells , which satisfies
(72) |
and use a modified RNN cell to sum up the results and reset the last component:
(73) |
As the th component resets to zero, we use it to compute the third sum and repeat until we obtain the final network such that
(74) |
The following lemma fills the gap between a modified TBRNN and a modified BRNN.
Lemma 21
Let be a modified TBRNN that computes (69) and be a compact set. Then for any there exists a modified BRNN of width such that
(75) |
where , for non-degenerating activation .
Moreover, there exists a BRNN of width such that
(76) |
where
Proof
We omit these details because we only need to construct a modified RNN that approximates (67) and (68) using Lemma 13.
As only the forward or backward modified RNN cell is used in the proof of Lemma 19, it is enough for the modified BRNN to approximate either the forward or backward modified TRNN.
Thus, it follows from Lemma 13.
Lemma 3 provides the second part of this theorem.
Finally, we obtain the universal approximation theorem of the BRNN from the previous results.
Theorem 22 (Universal approximation theorem of deep BRNN 1)
Let be a continuous sequence to sequence function and be a non-degenerate activation function. Then for any and compact subset , there exists a deep BRNN of width , such that
(77) |
where
Proof As in the proof of Theorem 5, we set for notational convenience. According Lemma 20, there exists a modified TBRNN of width such that
(78) |
Lemma 21 implies that there exists a BRNN of width such that
(79) |
The triangle inequality leads to
(80) |
In the space, we have similar results with RNNs much more straightforward than in continuous function spaces as the only encoder needs bidirectional flow.
Theorem 23 (Universal approximation theorem of deep BRNN 2)
Let be a sequence-to-sequence function in for , and be a non-degenerate activation function with the non-degenerating point . Then, for any and compact subset , there exists a deep BRNN of width satisfying
(81) |
Moreover, if the activation is ReLU, there exists a deep BRNN of width satisfying
(82) |
Proof
Recall that in the RNN case, is encoded as a concatenation of a decimal representation of each , cutting it off at a certain number of digits.
The backward process carries the same work, and then the encoded results will be connected.
Since the cells constructing the encoders are the same as equation (48), we omit the further step.
7 Discussion
We proved the universal approximation theorem and calculated the upper bound of the minimum width of an RNN, an LSTM, a GRU, and a BRNN. In this section, we illustrate how our results support the performance of a recurrent network.
We show that an RNN needs a width of at most to approximate a function from a sequence of -dimensional vectors to a sequence of -dimensional vectors. The upper bound of the minimum width of the network depends only on the input and output dimensions, regardless of the length of the sequence. The independence of the sequence length indicates that the recurrent structure is much more effective in learning a function on sequential inputs. To approximate a function defined on a long sequence, a network with a feed-forward structure requires a wide width proportional to the length of the sequence. For example, an MLP should have a wider width than if it approximates a function defined on a sequence (Johnson, 2019). However, with the recurrent structure, it is possible to approximate via a narrow network of width regardless of the length, because the minimum width is independent of the length . This suggests that the recurrent structure, which transfers information between different time steps in the same layer, is crucial for success with sequential data.
From a practical point of view, this fact further implies that there is no need to limit the length of the time steps that affect dynamics to learn the internal dynamics between sequential data. For instance, suppose that a pair of long sequential data and have an unknown relation . Even without prior knowledge of and , a deep RNN learns the relation if we train the network with inputs and outputs . The MLP cannot reproduce the result because the required width increases proportionally to , which is an unknown factor. The difference between these networks theoretically supports that recurrent networks are appropriate when dealing with sequential data whose underlying dynamics are unknown in the real world.
8 Conclusion
In this study, we investigated the universality and upper bound of the minimum width of deep RNNs. The upper bound of the minimum width serves as a theoretical basis for the effectiveness of deep RNNs, especially when underlying dynamics of the data are unknown.
Our methodology enables various follow-up studies, as it connects an MLP and a deep RNN. For example, the framework disentangles the time dependency of output sequence of an RNN. This makes it feasible to investigate a trade-off between width and depth in the representation ability or error bounds of the deep RNN, which has not been studied because of the entangled flow with time and depth. In addition, we separated the required width into three parts: one maintains inputs and results, another resolves the time dependency, and the third modifies the activation. Assuming some underlying dynamics in the output sequence, such as an open dynamical system, we expect to reduce the required minimum width on each part because there is a natural dependency between the outputs, and the inputs are embedded in a specific way by the dynamics.
However, as LSTMs and GRUs have multiple hidden states in the cell process, they may have a smaller minimum width than the RNN. By constructing an LSTM and a GRU to use the hidden states to save data and resolve the time dependency, we hope that our techniques demonstrated in the proof help analyze why these networks have a better result in practice and suffer less from long-term dependency.
Acknowledgement
This work was supported by the Challengeable Future Defense Technology Research and Development Program through ADD[No. 915020201], the NRF grant[2012R1A2C3010887] and the MSIT/IITP[No. 2021-0-01343, Artificial Intelligence Graduate School Program(SNU)].
Appendix A Notations
Symbol | Description |
---|---|
Activation function | |
Modified activation function | |
Input sequence of vectors | |
Output sequence of vectors | |
Order of element in a sequence | |
-th component of -element of a sequence | |
Dimension of input vector | |
Dimension of hidden state in RNN, or width | |
Dimension of output vector | |
Length of the input and output sequence | |
A compact subset of | |
Zero matrix of size | |
Zero vector of size , | |
One vector of size , | |
Identity matrix of size | |
RNN | Recurrent Neural Network defined in page 1 |
TRNN | Time-enhanced Recurrent Neural Network, defined by replacing |
recurrent cell with time-enhanced recurrent cell in RNN. | |
Time-enhenced cell and TRN are defined in page 10 | |
BRNN | Bidirectional Recurrent Neural Network defined in page 4 |
TBRNN | Time-enhanced Bidirectional Recurrent Neural Network |
defined in page 10 |
Appendix B Proof of the Lemma 3
Without loss of generality, we may assume is an identity map and . Let be a given modified RNN cell, and be a given token-wise linear projection map. We use notations and to denote zero matrix in and one vector in respectively. Sometimes we omit symbol in some block-diagonal matrices if the size of the zero matrix is clear.
Case 1:
Let be the identity map. For define as
(83) |
Since is non-degenerating at and is continuous at , we have
(84) |
Then construct a second cell to compute transition as
(85) |
where .
After that, the first output of becomes
(86) | ||||
(87) | ||||
(88) | ||||
(89) |
Now use mathematical induction on time to compute assuming
(90) |
From a direct calculation, we attain
(91) | |||
(92) | |||
(93) |
and
(94) | |||
(95) | |||
(96) | |||
(97) |
With the sum of above two results, we obtain the induction hypothesis (90) for ,
(98) | |||
(99) | |||
(100) | |||
(101) |
Setting and choosing small enough complete the proof:
(102) |
Case 2:
When , there is term independent of in the Taylor expansion of . An additional width removes the term in this case; hence we need a lifting map :
(103) |
Now for define as
(104) |
As in the previous case, we have
(105) |
and construct a second cell to compute
(106) |
where .
After that, the first output of becomes
(107) | ||||
(108) | ||||
(109) | ||||
(110) |
Assume and use mathematical induction on time .
(111) |
Direct calculation yields
(112) | |||
(113) |
and
(114) | |||
(115) | |||
(116) | |||
(117) |
Adding two terms in (106), we obtain the induction hypothesis (111) for ,
(118) |
Setting and choosing small enough complete the proof:
(119) |
Appendix C Proof of the Lemma 6
It suffices to show that there exists a modified RNN that computes
(120) |
for given matrices .
RNN should have multiple layers to implement the arbitrary linear combination. To overcome the complex time dependency deriving from deep structures and explicitly formulate the results of deep modified RNN, we force and to use the information of the previous time step in a limited way. Define the modified RNN cell at -th layer as
(121) |
where , for .
Construct a modified RNN for each as
(122) |
and denote the output of at each time for an input sequence of embedding of :
(123) |
Then we have the following lemma.
Lemma 24
Let be the matrix defined by (123). Then we have
(124) |
where means binomial coefficient for . We define for the case of or for notational convenience.
Proof Since there is no activation in modified RNN (121), has the form of
(125) |
From the definition of the modified RNN cell and , we first show that satisfies the recurrence relation
(126) |
using mathematical induction on in turn. Initially, , by definition, and (125) holds when . Now assume (125) holds for , any . To show that (125) holds for and any , use mathematical induction on . By definition, we have for any . Thus (125) holds when and . Assume it holds for and . Then
(127) | ||||
Hence the relation holds for and any .
Now remains to show
(128) |
From the initial condition of , we know for all . After some direct calculation with the recurrence relation (125) of , we have
-
i)
If or , as .
-
ii)
.
-
iii)
implies for .
-
iv)
Similarly, implies for .
Now use mathematical induction on starting from to show for , .
-
i)
holds only if , for , . In the case, .
-
ii)
Assume that (128) holds for any , with as induction hypothesis. Now suppose for given , . If or we already know . Otherwise , , and we have
(129) which completes the proof.
We have computed the output of modified RNN such that
(130) |
If the square matrix has inverse , satisfies
where is the Kronecker delta function.
The following lemma completes the proof.
Lemma 25
Matrix is invertible.
Proof Use mathematical induction on . is a trivial case. Assume is invertible.
(131) |
Applying elementary row operation to by multiplying the matrix on the left and elementary column operation to by multiplying the matrix on the right where
(132) |
we obtain the following relation:
(133) |
Hence is invertible by the induction hypothesis.
Corollary 26
The following matrix is full-rank.
(134) |
We will use the matrix in the proof of Lemma 11 to approximate a sequence-to-sequence function.
Appendix D Proof of Lemma 11
Define token-wise lifting map and modified TRNN cell as in the proof of Lemma 6:
(135) | ||||
(136) |
where , for . Then we have
(137) | ||||
where and .
Since for each , the matrix
(138) |
is full-rank, there exist satisfying
(139) |
or
(140) |
for each . Then we obtain
(141) | ||||
Appendix E Proof of Lemma 13
As one of the modified TRNNs that computes (31), we use the modified TRNN defined in Appendix D. Specifically, we show that for a given , there exists a modified RNN of width that approximates the modified TRNN cell defined by (135). Suppose , are compact sets and . Then the output of the TRNN cell is
(142) |
Without loss of generality, assume and let . Let be a token-wise linear map defined by . Construct the modified recurrent cells as for ,
(143) | ||||
(144) |
Then, by definition for ,
(145) |
Note that are disjoint each other, for all .
By the universal approximation theorem of deep MLP from Hanin and Sellke (2017); Kidger and Lyons (2020), for any , there exists an MLP of width such that for ,
(146) | |||
(147) |
Since token-wise MLP is implemented by RNN with the same width, there exists an RNN of width whose components all but -th construct so that for all ,
(148) |
Then for we have
(149) |
Finally, define a recurrent cell of width as
(150) |
and attain
(151) |
With the token-wise projection map defined by , an RNN of width maps to
(152) |
Since , we have
(153) |
as . Approximating all in Appendix D finishes the proof.
Appendix F Proof of Lemma 19
The main idea of the proof is to separate the linear sum into the past-dependent part and the remainder part . Then, we construct modified TBRNN with cells; the former cells have only a forward recurrent cell to compute the past-dependent part, and the latter cells have only a backward recurrent cell to compute the remainder.
Let the first modified TRNN cells for be defined as in the proof of Lemma 11:
(154) |
where , for . Then, with token-wise lifting map defined by , we construct modified TRNN . We know that if are given for and , there exist for , such that
(155) |
Therefore, we will determine after constructing the latter cells. Let for brief notation.
After , construct modified TRNN cells for in reverse order:
(156) |
where , for . Define , and we obtain the following result after a similar calculation with input sequence :
(157) | |||
(158) |
We want to find and so that
(159) |
for each .
Note that does not contain terms, so should contain .
(160) | ||||
(161) | ||||
(162) | ||||
(163) |
Since matrix is a lower triangular matrix with unit diagonal components, there exist such that
(164) |
for each .
We now have
(165) | |||
(166) | |||
(167) | |||
(168) |
We switch and for the last equation. By Corollary 26, there exist satisfying
(169) | ||||
(170) | ||||
for , and | ||||
(171) | ||||
(172) |
for .
With the above and , equation (159) holds for each .
It remains to construct modified TRNN cells to implement , which comes directly from the proof of Lemma 11.
References
- Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bahdanau et al. (2016) Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4945–4949. IEEE, 2016.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
- Cohen et al. (2016) Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728. PMLR, 2016.
- Cybenko (1989) George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
- Dai et al. (2021) Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34:3965–3977, 2021.
- Elman (1990) Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
- Garnelo et al. (2018) Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.
- Gers et al. (2000) Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. Neural computation, 12(10):2451–2471, 2000.
- Graves and Jaitly (2014) Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning, pages 1764–1772. PMLR, 2014.
- Hanin and Sellke (2017) Boris Hanin and Mark Sellke. Approximating continuous functions by relu nets of minimal width. arXiv preprint arXiv:1710.11278, 2017.
- Hanson and Raginsky (2020) Joshua Hanson and Maxim Raginsky. Universal simulation of stable dynamical systems by recurrent neural nets. In Learning for Dynamics and Control, pages 384–392. PMLR, 2020.
- Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- Johnson (2019) Jesse Johnson. Deep, skinny neural networks are not universal approximators. In International Conference on Learning Representations, 2019.
- Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- Kidger and Lyons (2020) Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. In Conference on learning theory, pages 2306–2327. PMLR, 2020.
- Lavanya and Sasikala (2021) PM Lavanya and E Sasikala. Deep learning techniques on text classification using natural language processing (nlp) in social healthcare network: A comprehensive survey. In 2021 3rd International Conference on Signal Processing and Communication (ICPSC), pages 603–609. IEEE, 2021.
- LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
- Leshno et al. (1993) Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
- Lin and Jegelka (2018) Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator. Advances in Neural Information Processing Systems, 31, 2018.
- Lu et al. (2017) Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, pages 1045–1048. Makuhari, 2010.
- Park et al. (2020) Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. arXiv preprint arXiv:2006.08859, 2020.
- Park et al. (2021) Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. In International Conference on Learning Representations, 2021.
- Rolnick and Tegmark (2018) David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. In the International Conference on Learning Representations, 2018.
- Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
- Schäfer and Zimmermann (2007) Anton Maximilian Schäfer and Hans-Georg Zimmermann. Recurrent neural networks are universal approximators. International journal of neural systems, 17(04):253–263, 2007.
- Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
- Wu et al. (2017) Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. Recurrent recommender networks. In Proceedings of the tenth ACM international conference on web search and data mining, pages 495–503, 2017.
- Yun et al. (2020) Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020.
- Zhai et al. (2022) Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
- Zhou (2020) Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48(2):787–794, 2020.