Minimal Width for Universal Property of Deep RNN

\nameChang hoon Song \email[email protected]
\addrDepartment of Mathematical Science
Seoul National University \AND\nameGeonho Hwang \email[email protected]
\addrDepartment of Mathematical Science
Seoul National University \AND\nameJun ho Lee \email[email protected]
\addrKongju National Universality \AND\nameMyungjoo Kang \email[email protected]
\addrDepartment of Mathematical Science
Seoul National University

Abstract

A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep narrow networks with bounded width and arbitrary depth are more effective than wide shallow networks with arbitrary width and bounded depth in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show a deep RNN with ReLU activation can approximate any continuous function or $L^{p}$ function with the widths $d_{x}+d_{y}+3$ and $\max\{d_{x}+1,d_{y}\}$ , respectively, where the target function maps a finite sequence of vectors in $\mathbb{R}^{d_{x}}$ to a finite sequence of vectors in $\mathbb{R}^{d_{y}}$ . We also compute the additional width required if the activation function is sigmoid or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and technique can shed light on further research on deep RNNs.

Keywords: recurrent neural network, universal approximation, deep narrow network, sequential data, minimum width

1 Introduction

Deep learning, a type of machine learning that approximates a target function using a numerical model called an artificial neural network, has shown tremendous success in diverse fields, such as regression (Garnelo et al., 2018), image processing (Dai et al., 2021; Zhai et al., 2022), speech recognition (Baevski et al., 2020), and natural language processing (LeCun et al., 2015; Lavanya and Sasikala, 2021).

While the excellent performance of deep learning is attributed to a combination of various factors, it is reasonable to speculate that its notable success is partially based on the universal approximation theorem, which states that neural networks are capable of arbitrarily accurate approximations of the target function.

Formally, for any given function $f$ in a target class and $\epsilon>0$ , there exists a network $\mathcal{N}$ such that $\|f-\mathcal{N}\|<\epsilon$ . In topological language, the theorem states that a set of networks is dense in the class of the target function. In this sense, the closure of the network defines the range of functions it network can represent.

As universality provides the expressive power of a network structure, studies on the universal approximation theorem have attracted increasing attention. Examples include the universality of multi-layer perceptrons (MLPs), the most basic neural networks (Cybenko, 1989; Hornik et al., 1989; Leshno et al., 1993), and the universality of recurrent neural networks (RNNs) with the target class of open dynamical systems (Schäfer and Zimmermann, 2007). Recently, Zhou (2020) has demonstrated the universality of convolutional neural networks.

Classical universal approximation theorems specialize in the representation power of shallow wide networks with bounded depth and unbounded width. Based on mounting empirical evidence that deep networks demonstrate better performance than wide networks, the construction of deep networks instead of shallow networks has gained considerable attention in recent literature. Consequently, researchers have started to analyze the universal approximation property of deep networks (Cohen et al., 2016; Lin and Jegelka, 2018; Rolnick and Tegmark, 2018; Kidger and Lyons, 2020). Studies on MLP have shown that wide shallow networks require only two depths to have universality, while deep narrow networks require widths at least as their input dimension.

A wide network obtains universality by increasing its width even if the depth is only two (Cybenko, 1989; Hornik et al., 1989; Leshno et al., 1993). However, in the case of a deep network, there is a function for a narrow network that cannot be able to approximated, regardless of its depth (Lu et al., 2017; Park et al., 2021). Therefore, clarifying the minimum width to guarantee universality is crucial, and studies are underway to investigate its lower and upper bounds, narrowing the gap.

Network	Function class	Activation	Result
RNN	$C\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)$ ^$\dagger$	ReLU	$w_{\text{min}}\leq d_{x}+d_{y}+3$
		conti. nonpoly¹	$w_{\text{min}}\leq d_{x}+d_{y}+3$
		conti. nonpoly²	$w_{\text{min}}\leq d_{x}+d_{y}+4$
	$L^{p}\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)$ ^$\dagger$	conti. nonpoly^1,2	$w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}+1$
	$L^{p}\left(\mathbb{R}^{d_{x}},\mathbb{R}^{d_{y}}\right)$ ^$\dagger$	ReLU	$w_{\text{min}}=\max\left\{d_{x}+1,d_{y}\right\}$
LSTM	$C\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)$ ^$\dagger$		$w_{\text{min}}\leq d_{x}+d_{y}+3$
	$L^{p}\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)$ ^$\dagger$		$w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}+1$
GRU	$C\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)$ ^$\dagger$		$w_{\text{min}}\leq d_{x}+d_{y}+3$
	$L^{p}\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)$ ^$\dagger$		$w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}+1$
BRNN	$C\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)$	ReLU	$w_{\text{min}}\leq d_{x}+d_{y}+3$
		conti. nonpoly¹	$w_{\text{min}}\leq d_{x}+d_{y}+3$
		conti. nonpoly²	$w_{\text{min}}\leq d_{x}+d_{y}+4$
	$L^{p}\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)$	conti. nonpoly^1,2	$w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}+1$
	$L^{p}\left(\mathbb{R}^{d_{x}},\mathbb{R}^{d_{y}}\right)$	ReLU	$w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}$

$\dagger$

requires the class to consists of past-dependent functions.
1

requires an activation $\sigma$ to be continuously differentiable at some point $z_{0}$ with $\sigma(z_{0})=0$ and $\sigma^{\prime}(z_{0})\neq 0$ . $\tanh$ belongs here.
2

requires an activation $\sigma$ to be continuously differentiable at some point $z_{0}$ with $\sigma^{\prime}(z_{0})\neq 0$ . A logistic sigmoid function belongs here.

Table 1: Summary of our results on the upper bound of the minimal width

w_{\text{min}}

for the universality of RNNs. In the table,

\mathcal{K}

indicates a compact subset of

\mathbb{R}^{d_{x}}

and

1\leq p<\infty

. We abbreviate continuous to “conti”.

Recurrent neural networks (RNNs) (Rumelhart et al., 1986; Elman, 1990) have been crucial for modeling complex temporal dependencies in sequential data. They have various applications in diverse fields, such as language modeling (Mikolov et al., 2010; Jozefowicz et al., 2016), speech recognition (Graves and Jaitly, 2014; Bahdanau et al., 2016), recommendation systems (Hidasi et al., 2015; Wu et al., 2017), and machine translation (Bahdanau et al., 2014). Deep RNNs are widely used and have been successfully applied in practical applications. However, their theoretical understanding remains elusive despite their intensive use. This deficiency in existing studies motivated our work.

In this study, we prove the universal approximation theorem of deep narrow RNNs and discover the upper bound of their minimum width. The target class consists of a sequence-to-sequence function that depends solely on past information. We refer to such functions as past-dependent functions. We provide the upper bound of the minimum width of the RNN for universality in the space of the past-dependent functions. Surprisingly, the upper bound is independent of the length of the sequence. This theoretical result highlights the suitability of the recurrent structure for sequential data compared with other network structures. Furthermore, our results are not restricted to RNNs; they can be generalized to variants of RNNs, including long short-term memory (LSTM), gated recurrent units (GRU), and bidirectional RNNs (BRNN). As corollaries of our main theorem, LSTM and GRU are shown to have the same universality and target class as an RNN. We also prove that the BRNN can approximate any sequence-to-sequence function in a continuous or $L^{p}$ space under the respective norms. We also present the upper bound of the minimum width for these variants. Table 1 outlines our main results.

1.1 Summary of Contributions

With a target class of functions that map a finite sequence $x\in\mathbb{R}^{d_{x}\times N}$ to a finite sequence $y\in\mathbb{R}^{d_{y}\times N}$ , we prove the following:

•

A deep RNN can approximate any past-dependent sequence-to-sequence continuous function with width $d_{x}+d_{y}+3$ for the ReLU or $\tanh$ ¹¹1Generally, non-degenerate $\sigma$ with $\sigma(z_{0})=0$ requires the same minimal width as $\tanh$ ., and $d_{x}+d_{y}+4$ for non-degenerating activations.
•

A deep RNN can approximate any past-dependent $L^{p}$ function ( $1\leq p<\infty$ ) with width $\max\left\{d_{x}+1,d_{y}\right\}$ for the ReLU activation and $\max\left\{d_{x}+1,d_{y}\right\}+1$ for non-degenerating activations.
•

A deep BRNN can approximate any sequence-to-sequence continuous function with width $d_{x}+d_{y}+3$ for the ReLU or $\tanh$ ¹¹1Generally, non-degenerate $\sigma$ with $\sigma(z_{0})=0$ requires the same minimal width as $\tanh$ ., and $d_{x}+d_{y}+4$ for non-degenerating activations.
•

A deep BRNN can approximate any sequence-to-sequence $L^{p}$ function ( $1\leq p<\infty$ ) with width $\max\left\{d_{x}+1,d_{y}\right\}$ for the ReLU activation and $\max\left\{d_{x}+1,d_{y}\right\}+1$ for non-degenerating activations.
•

A deep LSTM or GRU can approximate any past-dependent sequence-to-sequence continuous function with width $d_{x}+d_{y}+3$ and $L^{p}$ function with width $\max\left\{d_{x}+1,d_{y}\right\}+1$ .

1.2 Organization

In Section 2, we briefly review previous studies on the universality of neural networks. Section 3 provides some notations, network configuration, a description of the target function class, and the condition of the activation function. Section 4 addresses the approximation property in a continuous function space. First, we fit only the last component of an output sequence and extend it to all components of the sequence. In Section 5, we present the upper bound of the minimum width in $L^{p}$ space. As an expansion of our theorems, the universal approximation theorems for the LSTM, GRU, and BRNN are deduced in Section 6. Sections 7 and 8 present a discussion and conclusions, respectively.

2 Related Works

We briefly review some of the results of studies on the universal approximation property. Studies have been conducted to determine whether a neural network can learn a sufficiently wide range of functions, that is, whether it has universal properties. Cybenko (1989) and Hornik et al. (1989) first proved that the most basic network, a simple two-layered MLP, can approximate arbitrary continuous functions defined on a compact set. Some follow-up studies have investigated the universal properties of other structures for a specific task, such as a convolutional neural network for image processing (Zhou, 2020), an RNN for open dynamical systems (Schäfer and Zimmermann, 2007; Hanson and Raginsky, 2020), and transformer networks for translation and speech recognition (Yun et al., 2020). Particularly for RNNs, Schäfer and Zimmermann (2007) showed that an open dynamical system with continuous state transition and output function can be approximated by a network with a wide RNN cell and subsequent linear layer in finite time. Hanson and Raginsky (2020) showed that the trajectory of the dynamical system can be reproduced with arbitrarily small errors up to infinite time, assuming a stability condition on long-term behavior.

While such prior studies mainly focused on wide and shallow networks, several studies have determined whether a deep narrow network with bounded width can approximate arbitrary functions in some kinds of target function classes, such as a space of continuous functions or an $L^{p}$ space (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2019; Kidger and Lyons, 2020; Park et al., 2021). Unlike the case of a wide network that requires only one hidden layer for universality, there exists a function that cannot be approximated by any network whose width is less than a certain threshold. Specifically, in a continuous function space $C\left(K,\mathbb{R}^{d_{y}}\right)$ with the supreme-norm $\left\|f\right\|_{\infty}\coloneqq\sup_{x\in K}\left\|f(x)\right\|$ on a compact subset $K\subset\mathbb{R}^{d_{x}}$ , Hanin and Sellke (2017) showed negative and positive results that MLPs do not attain universality with width $d_{x}$ . Still, width $d_{x}+d_{y}$ is sufficient for MLPs to achieve universality with the ReLU activation function. Johnson (2019); Kidger and Lyons (2020) generalized the condition on the activation and proved that $d_{x}$ is too narrow, but $d_{x}+d_{y}+2$ is sufficiently wide for the universality in $C\left(K,\mathbb{R}^{d_{y}}\right)$ . On the other hand, in an $L^{p}$ space $L^{p}\left(\mathbb{R}^{d_{x}},\mathbb{R}^{d_{y}}\right)$ with $L^{p}$ -norm $\left\|f\right\|_{p}\coloneqq\left(\int_{\mathbb{R}^{d_{x}}}f(x)\,dx\right)^{1/p}$ for $p\in[0,\infty)$ , Lu et al. (2017) showed that the width $d_{x}$ is insufficient for MLPs to have universality, but $d_{x}+4$ is enough, with the ReLU activation when $p=1$ and $d_{y}=1$ . Kidger and Lyons (2020) found that MLPs whose width is $d_{x}+d_{y}+1$ achieve universality in $L^{p}\left(\mathbb{R}^{d_{x}},\mathbb{R}^{d_{y}}\right)$ with the ReLU activation, and Park et al. (2021) determined the exact value $\max\left\{d_{x}+1,d_{y}\right\}$ required for universality in the $L^{p}$ space with ReLU.

As described earlier, studies on deep narrow MLP have been actively conducted, but the approximation ability of deep narrow RNNs remains unclear. This is because the process by which the input affects the result is complicated compared with that of an MLP. The RNN cell transfers information to both the next time step in the same layer and the same time step in the next layer, which makes it difficult to investigate the minimal width. In this regard, we examined the structure of the RNN to apply the methodology and results from the study of MLPs to deep narrow RNNs.

3 Terminologies and Notations

This section introduces the definition of network architecture and the notation used throughout this paper. $d_{x}$ and $d_{y}$ denote the dimension of input and output space, respectively. $\sigma$ is an activation function unless otherwise stated. Sometimes, $v$ indicates a vector with suitable dimensions.

First, we used square brackets, subscripts, and colon symbols to index a sequence of vectors. More precisely, for a given sequence of $d_{x}$ -dimensional vectors $x:\mathbb{N}\rightarrow\mathbb{R}^{d_{x}}$ , $x[t]_{j}$ or $x_{j}[t]$ denotes the $j$ -th component of the $t$ -th vector. The colon symbol $:$ is used to denote a continuous index, such as $x[a:b]=\left(x[i]\right)_{a\leq i\leq b}$ or $x[t]_{a:b}=\left(x[t]_{a},x[t]_{a+1},\ldots,x[t]_{b}\right)^{T}\in\mathbb{R}^{b-a+1}$ . We call the sequential index $t$ by time and each $x[t]$ a token.

Second, we define the token-wise linear maps $\mathcal{P}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{s}\times N}$ and $\mathcal{Q}:\mathbb{R}^{d_{s}\times N}\rightarrow\mathbb{R}^{{\color[rgb]{0,0,0}d_{y}}\times N}$ to connect the input, hidden state, and output space. As the dimension of the hidden state space $\mathbb{R}^{d_{s}}$ on which the RNN cells act is different from those of the input domain $\mathbb{R}^{d_{x}}$ and output domain $\mathbb{R}^{d_{y}}$ , we need maps adjusting the dimensions of the spaces. For a given matrix $P\in\mathbb{R}^{d_{s}\times d_{x}}$ , a lifting map $\mathcal{P}(x)[t]\coloneqq Px[t]$ lifts the input vector to the hidden state space. Similarly, for a given matrix $Q\in\mathbb{R}^{d_{y}\times d_{s}}$ , a projection map $\mathcal{Q}(s)[t]\coloneqq Qs[t]$ projects a hidden state onto the output vector. As the first token defines a token-wise map, we sometimes represent token-wise maps without a time length, such as $\mathcal{P}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{s}}$ instead of $\mathcal{P}:\mathbb{R}^{d_{x}\times N}\rightarrow{\color[rgb]{0,0,0}\mathbb{R}^{d_{s}\times N}}$ .

Subsequently, an RNN is constructed using a composition of basic recurrent cells between the lifting and projection maps. We considered four basic cells: RNN, LSTM, GRU, and BRNN.

•

RNN Cell A recurrent cell, recurrent layer, or RNN cell $\mathcal{R}$ maps an input sequence $x=\left(x[1],x[2],\ldots\right)=\left(x[t]\right)_{t\in\mathbb{N}}\in\mathbb{R}^{d_{s}\times\mathbb{N}}$ to an output sequence $y=\left(y[t]\right)_{t\in\mathbb{N}}\in\mathbb{R}^{d_{s}\times\mathbb{N}}$ using

$y[t+1]=\mathcal{R}(x)[t+1]=\sigma\left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right),$ (1)

where $\sigma$ is an activation function, $A,B\in\mathbb{R}^{d_{s}\times d_{s}}$ are the weight matrices, and $\theta\in\mathbb{R}^{d_{s}}$ is the bias vector. The initial state $\mathcal{R}(x)[0]$ can be an arbitrary constant vector, which is zero vector $\mathbf{0}$ in this setting.

•

LSTM cell An LSTM cell is an RNN cell variant with an internal cell state and gate structures. As an original version proposed in Hochreiter and Schmidhuber (1997), an LSTM cell includes only two gates: an input gate and an output gate. Still, the base of almost all networks used in modern deep learning is LSTM cells with an additional gate called the forget gate, introduced in Gers et al. (2000). Hence we present the LSTM cell with three gates instead of the original one.

Mathematically, an LSTM cell $\mathcal{R}_{LSTM}$ is a process that computes an output $h\in\mathbb{R}^{d_{s}}$ and cell state $c\in\mathbb{R}^{d_{s}}$ , defined by the following relation:

$\displaystyle i[t+1]$	$\displaystyle=\sigma_{\operatorname{sig}}\left(W_{i}x[t+1]+U_{i}h[t]+b_{i}\right)$	(2)
$\displaystyle f[t+1]$	$\displaystyle=\sigma_{\operatorname{sig}}\left(W_{f}x[t+1]+U_{f}h[t]+b_{f}\right)$
$\displaystyle g[t+1]$	$\displaystyle=\tanh\left(W_{g}x[t+1]+U_{g}h[t]+b_{g}\right)$
$\displaystyle o[t+1]$	$\displaystyle=\sigma_{\operatorname{sig}}\left(W_{o}x[t+1]+U_{o}h[t]+b_{o}\right)$
$\displaystyle c[t+1]$	$\displaystyle=f[t+1]\odot c[t]+i[t+1]\odot g[t+1]$
$\displaystyle h[t+1]$	$\displaystyle=o[t+1]\odot\tanh\left(c[t+1]\right)$

where $W_{*}\in\mathbb{R}^{d_{s}\times d_{s}}$ and $U_{*}\in\mathbb{R}^{d_{s}\times d_{s}}$ are weight matrices; $b_{*}\in\mathbb{R}^{d_{s}}$ is the bias vector for each $*=i,f,g,o$ ; and $\sigma_{\operatorname{sig}}$ is the sigmoid activation function. $\odot$ denotes component-wise multiplication, and we set the initial state to zero in this study.

Although some variants of the LSTM have additional connections, such as peephole connection, our result for LSTM extends naturally.

•

GRU cell Another import variation of an RNN is GRU, proposed by Cho et al. (2014). Mathematically, a GRU cell $\mathcal{R}_{GRU}$ is a process that computes $h{\color[rgb]{0,0,0}\in\mathbb{R}^{d_{s}}}$ defined by

$\displaystyle r[t+1]$	$\displaystyle=\sigma_{\operatorname{sig}}\left(W_{r}x[t+1]+U_{r}h[t]+b_{r}\right),$	(3)
$\displaystyle\tilde{h}[t+1]$	$\displaystyle=\tanh\left(W_{h}x[t+1]+U_{h}\left(r[t+1]\odot h[t]\right)+b_{h}\right),$
$\displaystyle z[t+1]$	$\displaystyle=\sigma_{\operatorname{sig}}\left(W_{z}x[t+1]+U_{z}h[t]+b_{z}\right),$
$\displaystyle h[t+1]$	$\displaystyle=\left(1-z[t+1]\right){\color[rgb]{0,0,0}\odot}h[t]+z[t+1]{\color[rgb]{0,0,0}\odot}\tilde{h}[t+1],$

where $W_{*}\in\mathbb{R}^{d_{s}\times{\color[rgb]{0,0,0}d_{s}}}$ and $U_{*}\in\mathbb{R}^{d_{s}\times d_{s}}$ are weight matrices, $b_{*}\in\mathbb{R}^{d_{s}}$ is the bias vector for each $*=r,z,h$ , and $\sigma_{\operatorname{sig}}$ is the sigmoid activation function. $\odot$ denotes component-wise multiplication, and we set the initial state to zero in this study.

•

BRNN cell A BRNN cell $\mathcal{B}\mathcal{R}$ consists of a pair of RNN cells and a token-wise linear map that follows the cells (Schuster and Paliwal, 1997). An RNN cell $\mathcal{R}_{1}$ in the BRNN cell $\mathcal{B}\mathcal{R}$ receives input from $x[1]$ to $x[N]$ and the other $\mathcal{R}_{2}$ receives input from $x[N]$ to $x[1]$ in reverse order. Then, the linear map $\mathcal{L}$ in $\mathcal{B}\mathcal{R}$ combines the two outputs from the RNN cells. Specifically, a BRNN cell $\mathcal{B}\mathcal{R}$ is defined as follows:

$\displaystyle\mathcal{R}(x)[t+1]$	$\displaystyle\coloneqq\sigma\left(A\mathcal{R}_{1}(x)[t]+Bx[t+1]+\theta\right),$	(4)
$\displaystyle\bar{\mathcal{R}}(x)[t-1]$	$\displaystyle\coloneqq\sigma\left(\bar{A}\bar{\mathcal{R}}(x)[t]+\bar{B}x[t-1]+\bar{\theta}\right),$
$\displaystyle\mathcal{B}\mathcal{R}(x)[t]$	$\displaystyle\coloneqq\mathcal{L}\left(\mathcal{R}(x)[t],\bar{\mathcal{R}}(x)[t]\right)$
	$\displaystyle\coloneqq W\mathcal{R}(x)[t]+\bar{W}\bar{\mathcal{R}}(x)[t].$

where $A$ , $B$ , $\bar{A}$ , $\bar{B}$ , $W$ , and $\bar{W}$ are weight matrices; $\theta$ and $\bar{\theta}$ are bias vectors.

•

Network architecture An RNN $\mathcal{N}$ comprises a lifting map $\mathcal{P}$ , projection map $\mathcal{Q}$ , and $L$ recurrent cells $\mathcal{R}_{1},\ldots,\mathcal{R}_{L}$ ;

$\mathcal{N}\coloneqq\mathcal{Q}\circ\mathcal{R}_{L}\circ\cdots\circ\mathcal{R}_{1}\circ\mathcal{P}.$ (5)

We denote the network as a stack RNN or deep RNN when $L\geq 2$ , and each output of the cell $\mathcal{R}_{i}$ as the $i$ -th hidden state. $d_{s}$ indicates the width of the network. As a special case, a recurrent cell $\mathcal{R}$ in (1) is a composition of activation and a token-wise affine map with $A=0$ ; A token-wise MLP is a specific example of RNN. If LSTM, GRU, or BRNN cells replace recurrent cells, the network is called an LSTM, a GRU, or a BRNN.

In addition to the type of cell, the activation function $\sigma$ affects universality. We focus on the case of ReLU or $\tanh$ while also considering the general activation function satisfying the condition proposed by (Kidger and Lyons, 2020). $\sigma$ is a continuous non-polynomial function that is continuously differentiable at some $z_{0}$ with $\sigma^{\prime}(z_{0})\neq 0$ . We refer to the condition as a non-degenerate condition and $z_{0}$ as a non-degenerating point.

Finally, the target class must be set as a subset of the sequence-to-sequence function space, from $\mathbb{R}^{d_{x}}$ to $\mathbb{R}^{d_{y}}$ . Given an RNN $\mathcal{N}$ , each token $y[t]$ of the output sequence $y=\mathcal{N}(x)$ depends only on $x[1\mathbin{:}t]\coloneqq\left(x[1],x[2],\ldots,x[t]\right)$ for the input sequence $x$ . We define this property as past dependency and a function with this property as a past-dependent function. More precisely, if all the output tokens of a sequence-to-sequence function are given by $f[t]\left(x[1\mathbin{:}t]\right)$ for functions $f[t]:\mathbb{R}^{d_{x}\times t}\rightarrow\mathbb{R}^{d_{y}}$ , we say that the function is past-dependent. Meanwhile, we must fix the finite length or terminal time $N<\infty$ of the input and output sequence. Without additional assumptions such as in (Hanson and Raginsky, 2020), errors generally accumulate over time, making it impossible to approximate implicit dynamics up to infinite time regardless of past dependency. Therefore we set the target function class as a class of past-dependent sequence-to-sequence functions with sequence length $N$ .

Remark 1

On a compact domain and under bounded length, the continuity of $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}$ implies that of each $f[t]:\mathbb{R}^{d_{x}\times t}\rightarrow\mathbb{R}^{d_{y}}$ and vice versa. In the case of the $L^{p}$ norm with $1\leq p<\infty$ , $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}$ is $L^{p}$ integrable if and only if $f[t]$ is $L^{p}$ integrable for each $t$ . In both cases, the sequence of functions $\left(f_{n}\right)_{n\in\mathbb{N}}$ converges to $g$ if and only if $\left(f_{n}[t]\right)_{n\in\mathbb{N}}$ converges to $g[t]$ for each $t$ . Thus, we focus on approximating $f[t]$ for each $t$ under the given conditions.

Remark 2

There are typical tasks where the length of the output sequence differs from that of the input sequence. In translation tasks, for instance, inputs and outputs are sentences represented as sequences of tokens of words in different languages, and the length of each sequence differs in general. Nevertheless, even in such a case, a sequence $x=\left(x[1],x[2],\ldots,x[N]\right)\in\mathbb{R}^{d_{x}\times N}$ can be naturally embedded into $\mathbb{R}^{d_{x}\times M}$ for an integer $M>N$ as an extended sequence $\left(x[1],x[2],\ldots,x[N],\mathbf{0},\ldots,\mathbf{0}\right)$ or $\left(\mathbf{0},\ldots,\mathbf{0},x[1],x[2],\ldots,x[N]\right)$ . By the natural embedding, a map between sequences $f:\mathbb{R}^{d_{x}\times N_{1}}\rightarrow\mathbb{R}^{d_{y}\times N_{2}}$ is a special case of a map between $f:\mathbb{R}^{d_{x}\times M}\rightarrow\mathbb{R}^{d_{y}\times M}$ for an integer $M\geq N_{1},N_{2}$ .

Sometimes, only the last value $\mathcal{N}(x)[N]$ is required considering an RNN $\mathcal{N}$ as a sequence-to-vector function $\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}}$ . We freely use the terminology RNN for sequence-to-sequence and sequence-to-vector functions because there is no confusion when the output domain is evident.

We have described all the concepts necessary to set a problem, but we end this section with an introduction to the concepts used in the proof of the main theorem. For the convenience of the proof, we slightly modify the activation $\sigma$ to act only on some components, instead of all components. With activation $\sigma$ and index set $I\subseteq\mathbb{N}$ , the modified activation $\sigma_{I}$ is defined as

\sigma_{I}(s)_{i}=\left\{\begin{array}[]{ll}\sigma(s_{i})&\text{ if $i\in I$ }\\ s_{i}&\text{ otherwise }\end{array}\right.

(6)

Using the modified activation function $\sigma_{I}$ , the basic cells of the network are modified in (1). For example, a modified recurrent cell can be defined as

	$\displaystyle\mathcal{R}(x)[t+1]_{i}$	$\displaystyle=\sigma_{I}\left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right)_{i}$		(7)
		$\displaystyle=\left\{\begin{array}[]{ll}\sigma\left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right)_{i}&\text{ if }i\in I\\ \left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right)_{i}&\text{ otherwise }\end{array}\right..$		(7)

Similarly, modified RNN, LSTM, GRU, or BRNN is defined using modified cells in (1). This concept is similar to the enhanced neuron of Kidger and Lyons (2020) in that activation can be selectively applied, but is different in that activation can be applied to partial components.

As activation leads to the non-linearity of a network, modifying the activation can affect the minimum width of the network. Fortunately, the following lemma shows that the minimum width increases by at most one owing to the modification. We briefly introduce the ideas here, with a detailed proof provided in the appendix.

Lemma 3

Let $\bar{\mathcal{R}}:\mathbb{R}^{d\times N}\rightarrow\mathbb{R}^{d\times N}$ be a modified RNN cell, $\bar{\mathcal{Q}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ , and $\bar{\mathcal{P}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ be a token-wise linear projection and lifting map. Suppose that an activation function $\sigma$ of $\bar{\mathcal{R}}$ is non-degenerate with a non-degenerating point $z_{0}$ . Then for any compact subset $K\subset\mathbb{R}^{d}$ and $\epsilon>0$ , there exists RNN cells $\mathcal{R}_{1}$ , $\mathcal{R}_{2}:\mathbb{R}^{(d+\beta(\sigma))\times N}\rightarrow\mathbb{R}^{(d+\beta(\sigma))\times N}$ , and a token-wise linear map $\mathcal{P}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d+\beta(\sigma)}$ , $\mathcal{Q}:\mathbb{R}^{d+\beta(\sigma)}\rightarrow\mathbb{R}^{d}$ such that

\sup_{x\in K^{N}}\left\|\bar{\mathcal{Q}}\circ\bar{\mathcal{R}}\circ\bar{\mathcal{P}}(x)-\mathcal{Q}\circ\mathcal{R}_{2}\circ\mathcal{R}_{1}\circ\mathcal{P}(x)\right\|<\epsilon,

(8)

where $\beta(\sigma)=\left\{\begin{array}[]{ll}0&\text{ if {\color[rgb]{0,0,0}$\sigma\left(z_{0}\right)=0$}}\\ 1&\text{ otherwise}\end{array}\right.$

Proof [Sketch of proof] The detailed proof is available in Appendix B. We use the Taylor expansion of $\sigma$ at $z_{0}$ to recover the value before activation. For the $i$ -th component with $i\not\in I$ , choose a small $\delta>0$ and linearly approximate $\sigma\left(z_{0}+\delta z\right)$ as $\sigma(z_{0})+\delta\sigma^{\prime}(z_{0})z$ . An affine transform after the Taylor expansion recovers $z$ .

Remark 4

Since the additional width is only used to move the input of each RNN cell to near $z_{0}$ , it seems easily removable using a bias term at first glance, provided the projection map is affine instead of linear. However, a bias term $\theta$ in (1) is fixed across all time steps, while we need different translations: one for the first token and the other for tokens after the second.

For example, for an RNN cell $\mathcal{R}^{\delta}:\mathbb{R}^{1\times\mathbb{N}}\rightarrow\mathbb{R}^{1\times\mathbb{N}}$ defined by

\mathcal{R}^{\delta}(x)[t+1]=\sigma\left(a\mathcal{R}^{\delta}(x)[t]+\delta x[t+1]+\theta\right),

(9)

we need $\theta=z_{0}$ to use the Taylor expansion at $z_{0}$ to the first token $\mathcal{R}^{\delta}(x)[1]=\sigma\left(\delta x[1]+\theta\right)$ . When $\theta=z_{0}$ , we cannot represent the second token

	$\displaystyle\mathcal{R}^{\delta}(x)[2]$	$\displaystyle=\sigma\left(a\mathcal{R}^{\delta}(x)[1]+\delta x[2]+z_{0}\right)$		(10)
		$\displaystyle=\sigma\left(a\left(\sigma\left(z_{0}\right)+\delta\sigma^{\prime}\left(z_{0}\right)x[1]+o\left(\delta\right)\right)+\delta x[2]+z_{0}\right),$		(11)

as the Taylor expansion at $z_{0}$ unless $a\sigma\left(z_{0}\right)=0$ .

In other words, the need for the additional width originates from that previous state $\mathcal{R}(x)[t-1]$ appears only after the second token. Nonetheless, we can make $\beta\equiv 0$ , by setting the proper initial state for each RNN cell and using an affine projection map.

The lemma implies that a modified RNN can be approximated by an RNN with at most one additional width. For a given modified RNN $\bar{\mathcal{Q}}\circ\bar{\mathcal{R}}_{L}\circ\cdots\circ\bar{\mathcal{R}}_{1}\circ{\color[rgb]{0,0,0}\bar{\mathcal{P}}}$ of width $d$ and $\epsilon>0$ , we can find RNN $\mathcal{R}_{1},\ldots,\mathcal{R}_{2L}$ and linear maps $\mathcal{P}_{1},\ldots,\mathcal{P}_{L}$ , $\mathcal{Q}_{1},\ldots,\mathcal{Q}_{L}$ such that

\sup_{x\in K^{N}}\left\|\bar{\mathcal{Q}}\circ\bar{\mathcal{R}}_{L}\circ\cdots\circ\bar{\mathcal{R}}_{1}\circ\bar{\mathcal{P}}(x)-\left(\mathcal{Q}_{L}\mathcal{R}_{2L}\mathcal{R}_{2L-1}\mathcal{P}_{L}\right)\circ\cdots\circ\left(\mathcal{Q}_{1}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}_{1}\right)(x)\right\|<\epsilon.

(12)

The composition $\mathcal{R}\circ\mathcal{P}$ of an RNN cell $\mathcal{R}$ and token-wise linear map $\mathcal{P}$ can be substituted by another RNN cell $\mathcal{R}^{\prime}$ . More concretely, for $\mathcal{R}$ and $\mathcal{P}$ defined by

	$\displaystyle\mathcal{R}(x)[t+1]$	$\displaystyle=\sigma\left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right),$	(13)
	$\displaystyle\mathcal{P}(x)[t]$	$\displaystyle=P\left(x[t]\right),$	(14)
$\mathcal{R}\circ\mathcal{P}$ defines an RNN cell $\mathcal{R}^{\prime}$
	$\displaystyle\mathcal{R}^{\prime}(x)[t+1]$	$\displaystyle=\sigma\left(A\mathcal{R}^{\prime}(x)[t]+BPx[t+1]+\theta\right).$	(15)

Thus, $\mathcal{R}_{2l+1}\left(\mathcal{P}_{l+1}\mathcal{Q}_{l}\right)$ becomes a recurrent cell, and $\left(\mathcal{Q}_{L}\mathcal{R}_{2L}\mathcal{R}_{2L-1}\mathcal{P}_{L}\right)\circ\cdots\circ\left(\mathcal{Q}_{1}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}_{1}\right)(x)$ defines a network of form (5).

4 Universal Approximation for Deep RNN in Continuous Function Space

This section introduces the universal approximation theorem of deep RNNs in continuous function space.

Theorem 5 (Universal approximation theorem of deep RNN 1)

Let $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}$ be a continuous past-dependent sequence-to-sequence function and $\sigma$ be a non-degenerate activation function. Then, for any $\epsilon>0$ and compact subset $K\subset\mathbb{R}^{d_{x}}$ , there exists a deep RNN $\mathcal{N}$ of width $d_{x}+d_{y}+3+\alpha(\sigma)$ such that

\sup_{x\in K^{N}}\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|<\epsilon,

(16)

where $\alpha\left(\sigma\right)=\begin{cases}0&\text{$\sigma$ is ReLU or a non-degenerating function with $\sigma\left(z_{0}\right)=0$.}\\ 1&\text{$\sigma$ is a non-degenerating function with $\sigma\left(z_{0}\right)\neq 0$}.\end{cases}$

To prove the above theorem, we deal with the case of the sequence-to-vector function $\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}}$ first, in subsection 4.1. Then, we extend our idea to a sequence-to-sequence function using bias terms to separate the input vectors at different times in subsection 4.2, and the proof of Theorem 5 is presented at the end of this section. As the number of additional components required in each step in the subsections depends on the activation function, we use $\alpha(\sigma)$ to state the theorem briefly.

4.1 Approximation of sequence-to-vector function

The motivation for approximating a sequence-to-vector function $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}}$ is to copy a two-layered MLP via a modified RNN. Note that the output of a two-layered MLP is a linear sum of outputs of its hidden nodes, and each output of the hidden nodes is a linear sum of inputs with the following activation function. In Lemma 6, we construct a modified RNN that simulates a hidden node of an MLP, which can be represented as the sum of the inner product of some matrices and $N$ input vectors in $\mathbb{R}^{d_{x}}$ with activation. After that, we use an additional buffer component in Lemma 7 to copy another hidden node in the two-layered MLP and the following modified RNN cell accumulates two results from the nodes. The buffer component of the modified RNN cell is then reset to zero to copy another hidden node. Repeating the procedure, the modified RNN with bounded width copies the two-layered MLP.

Now, we present the statements and sketches of the proof corresponding to each step. The following lemma implies that a modified RNN can compute the linear sum of all the input components, which copies the hidden node of a two-layered MLP.

Lemma 6

Suppose $A[1],A[2],\cdots,A[N]\in\mathbb{R}^{1\times d_{x}}$ are the given matrices. Then there exists a modified RNN $\mathcal{N}={\color[rgb]{0,0,0}\mathcal{R}_{N}}\circ\mathcal{R}_{N-1}\circ\cdots\circ\mathcal{R}_{1}\circ\mathcal{P}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}$ of width $d_{x}+1$ such that (the symbol $\ast$ indicates that there exists some value irrelevant to the proof)

	$\displaystyle\mathcal{N}(x)[t]$	$\displaystyle=\begin{bmatrix}x[t]\\ \ast\end{bmatrix}$	$\displaystyle\text{ for }t<N,$		(17)
	$\displaystyle\mathcal{N}(x)[N]$	$\displaystyle=\begin{bmatrix}x[N]\\ \sigma\left(\sum_{t=1}^{N}A[t]x[t]\right)\end{bmatrix}.$			(17)

Proof [Sketch of the proof] The detailed proof is available in Appendix C. Define the $m$ -th modified RNN cell $\mathcal{R}_{m}$ , of the form of (1) without activation, with $A_{m}=\begin{bmatrix}O_{d_{x}\times d_{x}}&O_{d_{x}\times 1}\\ O_{1\times d_{x}}&1\end{bmatrix}$ , $B_{m}=\begin{bmatrix}I_{d_{x}}&O_{d_{x}\times 1}\\ b_{m}&{\color[rgb]{0,0,0}1}\end{bmatrix}$ where $b_{m}\in\mathbb{R}^{1\times b_{x}}$ . Then, the $(d_{x}+1)$ th component $y[N]_{d_{x}+1}$ of the final output $y[N]$ after $N$ layers becomes a linear combination of $b_{i}x[j]$ with some constant coefficients $\alpha_{i,j}$ and $\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{i,j}b_{i}x[j]$ . Thus the coefficient of $x[j]$ is represented by $\sum_{i=1}^{N}\alpha_{i,j}b_{i}$ , which we wish to be $A[j]$ for each $j=1,2,\ldots,N$ . In matrix formulation, we intend to find $b$ satisfying $\Lambda^{T}b=A$ , where $\Lambda=\left\{\alpha_{i,j}\right\}_{1\leq i,j\leq N}\in\mathbb{R}^{N\times N}$ , $b=\begin{bmatrix}b_{1}\\ \vdots\\ b_{N}\end{bmatrix}\in\mathbb{R}^{N\times d_{x}}$ , and $A=\begin{bmatrix}A[1]\\ \vdots\\ A[N]\end{bmatrix}$ . As $\Lambda$ is invertible there exist $b_{i}$ that solve $\left(\Lambda^{T}b\right)_{j}=A[j]$ .

After copying a hidden node using the above lemma, we add a component, $(d_{x}+2)$ th, to copy another hidden node. Then the results are accumulated in the $(d_{x}+1)$ th component, and the final component is to be reset to copy another node. As the process is repeated, a modified RNN replicates the output node of a two-layered MLP.

Lemma 7

Suppose $w_{i}\in\mathbb{R}$ , $A_{i}[t]\in\mathbb{R}^{1\times d_{x}}$ are given for $t=1,2,\ldots,N$ and $i=1,2,\ldots,M$ . Then, there exists a modified RNN $\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}$ of width $d_{x}+2$ such that

\mathcal{N}(x)=\sum_{i=1}^{M}w_{i}\sigma\left(\sum_{t=1}^{N}A_{i}[t]x[t]\right).

(18)

Proof First construct a modified RNN $\mathcal{N}_{1}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+2)\times N}$ of width $d_{x}+2$ such that

	$\displaystyle\mathcal{N}_{1}(x)[t]$	$\displaystyle=\begin{bmatrix}x[t]\\ \ast\\ 0\end{bmatrix}$	$\displaystyle\text{ for }t<N,$		(19)
	$\displaystyle\mathcal{N}_{1}(x)[N]$	$\displaystyle=\begin{bmatrix}x[N]\\ \sigma\left(\sum_{t=1}^{N}A_{1}[t]x[t]\right)\\ 0\end{bmatrix},$			(20)

as Lemma 6. Note that the final component does not affect the first linear summation and remains zero. Next, using the components except for the $(d_{x}+1)$ th one, construct $\mathcal{N}_{2}:\mathbb{R}^{(d_{x}+2)\times N}\rightarrow\mathbb{R}^{(d_{x}+2)\times N}$ , which satisfies

	$\displaystyle\mathcal{N}_{2}\mathcal{N}_{1}(x)[t]$	$\displaystyle=\begin{bmatrix}x[t]\\ \ast\\ \ast\end{bmatrix}$	$\displaystyle\text{ for }t<N,$		(21)
	$\displaystyle\mathcal{N}_{2}\mathcal{N}_{1}(x)[N]$	$\displaystyle=\begin{bmatrix}x[N]\\ \sigma\left(\sum_{t=1}^{N}A_{1}[t]x[t]\right)\\ \sigma\left(\sum_{t=1}^{N}A_{2}[t]x[t]\right)\end{bmatrix},$			(22)

and use one modified RNN cell $\mathcal{R}$ after $\mathcal{N}_{2}$ to add the results and reset the last component:

	$\displaystyle\mathcal{R}\mathcal{N}_{2}\mathcal{N}_{1}(x)[t]$	$\displaystyle=\begin{bmatrix}x[t]\\ \ast\\ 0\end{bmatrix},$		(23)
	$\displaystyle\mathcal{R}\mathcal{N}_{2}\mathcal{N}_{1}(x)[N]$	$\displaystyle=\begin{bmatrix}x[N]\\ w_{1}\sigma\left(\sum A_{1}[t]x[t]\right)+w_{2}\sigma\left(\sum A_{2}[t]x[t]\right)\\ 0\end{bmatrix}.$		(24)

As the $(d_{x}+2)$ th component is reset to zero, we use it to compute the third sum $w_{3}\sigma\left(\sum A_{3}[t]x[t]\right)$ and repeat until we obtain the final network $\mathcal{N}$ such that

\mathcal{N}(x)[N]=\begin{bmatrix}x[N]\\ \sum_{i=1}^{M}w_{i}\sigma\left(\sum_{t=1}^{N}A_{i}[t]x[t]\right)\\ 0\end{bmatrix}.

(25)

Remark 8

The above lemma implies that a modified RNN of width $d_{x}+2$ can copy the output node of a two-layered MLP. We can extend this result to an arbitrary $d_{y}$ -dimensional case. Note that the first $d_{x}$ components remain fixed, the $(d_{x}+1)$ th component computes a part of the linear sum approximating the target function, and the $(d_{x}+2)$ th component computes another part and is reset. When we need to copy another output node for another component of the output of the target function $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}$ , only one additional width is sufficient. Indeed, the $(d_{x}+2)$ th component computes the sum and the final component, and the $(d_{x}+3)$ th component acts as a buffer to be reset in that case. By repeating this process, we obtain $(d_{x}+d_{y}+1)$ -dimensional output from the modified RNN, which includes all $d_{y}$ outputs of the MLP and the components from the $(d_{x}+1)$ th to the $(d_{x}+d_{y})$ th ones.

Theorem 9 (Universal approximation theorem of deep RNN 2)

Suppose a target $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}}$ is a continuous sequence-to-vector function, $K\subset\mathbb{R}^{d_{x}}$ is a compact subset, $\sigma$ is a non-degenerating activation function, and $z_{0}$ is the non-degenerating point. Then, for any $\epsilon>0$ , there exists a deep RNN $\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}}$ of width $d_{x}+d_{y}+1+\beta(\sigma)$ such that

\sup_{x\in K^{N}}\left\|f(x)-\mathcal{N}(x)\right\|<\epsilon,

(26)

where $\beta(\sigma)=\left\{\begin{array}[]{ll}0&\text{ if {\color[rgb]{0,0,0}$\sigma\left(z_{0}\right)=0$}}\\ 1&\text{ otherwise}\end{array}\right.$

Proof We present the proof for $d_{y}=1$ here, but adding $d_{y}-1$ width for each output component works for the case $d_{y}>1$ . By the universal approximation theorem of the MLP, Theorem 1 in Leshno et al. (1993), there exist $w_{i}$ and $A_{i}[t]$ for $i=1,\ldots,M$ such that

\sup_{x\in K^{N}}\left\|f(x)-\sum_{i=1}^{M}w_{i}\sigma\left(\sum_{t=1}^{N}A_{i}[t]x[t]\right)\right\|<\frac{\epsilon}{2}.

(27)

Note that there exists a modified RNN $\bar{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}$ of width $d_{x}+2$ ,

\bar{\mathcal{N}}(x)=\sum_{i=1}^{M}w_{i}\sigma\left(\sum_{t=1}^{N}A_{i}[t]x[t]\right).

(28)

By Lemma 3, there exists an RNN $\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}$ of width $d_{x}+2+\beta(\sigma)$ such that

\sup_{x\in K^{n}}\left\|\bar{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\frac{\epsilon}{2}.

(29)

Hence we have $\left\|f(x)-\mathcal{N}(x)\right\|<\epsilon$ .

4.2 Approximation of sequence-to-sequence function

Now, we consider an RNN $\mathcal{R}$ as a function from sequence $x$ to sequence $y=\mathcal{R}(x)$ defined by (1). Although the above results are remarkable in that the minimal width has an upper bound independent of the length of the sequence, it only approximates a part of the output sequence. Meanwhile, as the hidden states calculated in each RNN cell are connected closely for different times, fitting all the functions that can be independent of each other becomes a more challenging problem. For example, the coefficient of $x[t-1]$ in $\mathcal{N}(x)[t]$ equals the coefficient of $x[t]$ in $\mathcal{N}(x)[t+1]$ if $\mathcal{N}$ is an RNN defined as in the proof of Lemma 6. This correlation originates from the fact that $x[t-1]$ and $x[t]$ arrive at $\mathcal{N}(x)[t],\mathcal{N}(x)[t+1]$ via the same intermediate process, 1-time step, and $N$ layers.

We sever the correlation between the coefficients of $x[t-1]$ and $x[t]$ by defining the time-enhanced recurrent cell in Definition 10 and proceed similarly as in the previous subsection till the Lemma 12.

Definition 10

Time-enhanced recurrent cell, or layer, is a process that maps sequence $x=\left(x[t]\right)_{t\in\mathbb{N}}\in\mathbb{R}^{d_{s}\times\mathbb{N}}$ to sequence $y=\left(y[t]\right)_{t\in\mathbb{N}}\in\mathbb{R}^{d_{s}\times\mathbb{N}}$ via

y[t+1]\coloneqq\mathcal{R}(x)[t+1]=\sigma\left(A[t+1]\mathcal{R}(x)[t]+B[t+1]x[t+1]+\theta[t+1]\right)

(30)

where $\sigma$ is an activation function, $A[t]$ , $B[t]\in\mathbb{R}^{d_{s}\times d_{s}}$ are weight matrices and $\theta[t]\in\mathbb{R}^{d_{s}}$ is the bias given for each time step $t$ .

Like RNN, time-enhanced RNN indicates a composition of the form (1) with time-enhanced recurrent cells instead of RNN cells, and we denote it as TRNN. The modified TRNN indicates a TRNN whose activation functions in some cell act on only part of the components. Time-enhanced BRNN, denoted as TBRNN, indicates a BRNN whose recurrent layers in each direction are replaced by time-enhanced layers. A modified TBRNN indicates a TBRNN whose activation function is modified to act on only part of the components. With the proof of Lemma 3 using $\bar{A}[t]$ , $\bar{B}[t]$ instead of $\bar{A}$ , $\bar{B}$ , a TRNN can approximate a modified TRNN.

The following lemma shows that the modified TRNN successfully eliminates the correlation between outputs. See the appendix for the complete proof.

Lemma 11

Suppose $A_{j}[t]\in\mathbb{R}^{1\times d_{x}}$ are the given matrices for $1\leq t\leq N$ , $1\leq j\leq t$ . Then there exists a modified TRNN $\tilde{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}$ of width $d_{x}+1$ such that

\tilde{\mathcal{N}}(x)[t]=\begin{bmatrix}x[t]\\ \sigma\left(\sum_{j=1}^{t}A_{j}[t]x[j]\right)\end{bmatrix},

(31)

for all $t=1,2,\ldots,N$ .

Proof [Sketch of proof] The detailed proof is available in Appendix D. The main idea is to use $b_{m}[t]\in\mathbb{R}^{1\times d_{x}}$ instead of $b_{m}\in\mathbb{R}^{1\times d_{x}}$ in the proof of Lemma 6 with a modified TRNN $\tilde{\mathcal{R}}$ of the form

\tilde{\mathcal{R}}(x)[t+1]\coloneqq\sigma_{I}\left(\begin{bmatrix}O_{d_{x}\times d_{x}}&\\ &1\end{bmatrix}\tilde{\mathcal{R}}(x)[t]+\begin{bmatrix}I_{d_{x}\times d_{x}}&0\\ b_{m}[t+1]&1\end{bmatrix}x[t+1]\right).

(32)

The time-index coefficient $b_{m}[t]$ separates the calculation along the same intermediate process mentioned at the beginning of this subsection. For instance, a process from time $t_{1}$ to $t_{1}+1$ results differently from $t_{2}$ to $t_{2}+1$ for $b_{m}[t_{1}]\neq b_{m}[t_{2}]$ , even in the same layer. As the coefficient matrices at each time $[t]$ after $N$ layers are full rank, we can find $b_{m}[t]$ implementing the required linear combination for each time.

Recall the proof of Lemma 7. An additional width serves as a buffer to implement and accumulate linear sum in a node in an MLP. Similarly, we proceed with Lemma 11 instead of Lemma 6 to conclude that there exists a modified TRNN $\mathcal{N}$ of width $d_{x}+2$ such that each $\mathcal{N}[t]$ reproduces an MLP approximating $f[t]$ .

Lemma 12

Suppose $w_{i}[t]\in\mathbb{R}$ , $A_{i,j}[t]\in\mathbb{R}^{1\times d_{x}}$ are the given matrices for $1\leq t\leq N$ , $1\leq j\leq t$ , $1\leq i\leq M$ . Then, there exists a modified TRNN $\tilde{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{1\times N}$ of width $d_{x}+2$ such that

\tilde{\mathcal{N}}(x)[t]=\sum_{i=1}^{M}{\color[rgb]{0,0,0}w_{i}[t]}\sigma\left(\sum_{j=1}^{t}A_{i,j}[t]x[j]\right)

(33)

Proof We omit the detailed proof because it is almost the same as the proof of Lemma 7. The only difference is to use $b_{m}[t]$ instead of $b_{m}$ to construct $A_{i,j}{\color[rgb]{0,0,0}[t]}$ and $w_{i}[t]$ as in the proof of Lemma 11.
This implies that the modified TRNN can approximate any past-dependent sequence-to-sequence function.

Finally, we connect the TRNN and RNN. Although it is unclear whether a modified RNN can approximate an arbitrary modified TRNN, there exists a modified RNN that approximates the specific one described in Lemma 11.

Lemma 13

Let $\tilde{\mathcal{N}}$ be a given modified TRNN that computes (31) with width $d_{x}+1$ and $K\subset\mathbb{R}^{d_{x}}$ be a compact set. Then, for any $\epsilon>0$ there exists a modified RNN $\mathcal{N}$ of width $d_{x}+2+\gamma(\sigma)$ such that

\sup_{x\in K^{N}}\left\|\tilde{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\epsilon,

(34)

where $\gamma(\operatorname{ReLU})=0$ , $\gamma(\sigma)=1$ for non-degenerating activation $\sigma$ . As a corollary, there exists a modified RNN $\mathcal{N}$ of width $d_{x}+3+\gamma\left(\sigma\right)$ approximating $\tilde{\mathcal{N}}$ in Lemma 12.

Proof [Sketch of proof] Detailed proof is available in Appendix E. Note that a function $\tilde{\mathcal{N}}$ in (31) is constructed by modified TRNN cells of the form in (32). Since $b_{m}[t]$ is the only time-dependent coefficient in (32), it is enough to show that an RNN cell can approximate a function $X\rightarrow\begin{bmatrix}I_{d_{x}}&O_{d_{x}\times 1}\\ b_{m}[t]&{\color[rgb]{0,0,0}1}\end{bmatrix}X$ for $X=\begin{bmatrix}x[t]\\ 0\end{bmatrix}\in\mathbb{R}^{d_{x}+1}$ . In particular, approximating a map $x[t]\mapsto b_{m}[t]x[t]$ is what we need.

While an MLP can approximate map $x[t]\mapsto b_{m}[t]x[t]$ on a compact set $K$ , it is hard to use token-wise MLP since we need different outputs for the same vector at other time steps. For instance, for a $v\in K$ and a sequence $x=\left(x[1],x[2]\right)=\left(v,v\right)\in K^{2}$ , a token-wise MLP cannot simultaneously approximate two outputs, $b_{m}[1]v$ and $b_{m}[2]v$ . Still, it is not the case when the domain $K_{1}$ of $x[1]$ and the domain $K_{2}$ of $x[2]$ are disjoint compact sets, and there is an MLP that approximates $v\mapsto b_{m}[1]v$ for $v\in K_{1}$ and $v\mapsto b_{m}[2]v$ for $v\in K_{2}$ . From this point of view, we use an RNN and a token-wise MLP, to embed each vector of the input sequence into disjoint compact sets and to approximate the output $b_{m}[t]x[t]$ on each compact set, respectively.

Now we present how to construct disjoint sets and the token-wise MLP.

Without loss of generality, we can assume $K\subset\left[0,\frac{1}{2}\right]^{d_{x}}$ and construct the first cell as the output at time $t$ to be $x[t]+t\mathbf{1}_{d_{x}}$ . As $N$ compact sets $K+\mathbf{1}_{d_{x}}$ , $K+2\mathbf{1}_{d_{x}},\ldots,K+N\mathbf{1}_{d_{x}}$ are disjoint, $t$ and $x[t]=y-t$ are uniquely determined for each $y\in K+t\mathbf{1}_{d_{x}}$ . Thus, for any given $b[t]\in\mathbb{R}^{1\times d_{x}}$ , there exists an MLP of width $d_{x}+1+\gamma\left(\sigma\right)$ approximating $y=x[t]+t\mathbf{1}_{d_{x}}\mapsto b[t]x[t]$ on $\bigsqcup_{t}\left(K+t\mathbf{1}_{d_{x}}\right)$ as a function from $\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}$ (Hanin and Sellke, 2017; Kidger and Lyons, 2020). Indeed, we need to approximate $x[t]+t\mathbf{1}_{d_{x}}\mapsto\begin{bmatrix}x[t]+t\mathbf{1}_{d_{x}}\\ b[t]x[t]\end{bmatrix}$ as a function from $\mathbb{R}^{d_{x}}$ to $\mathbb{R}^{d_{x}+1}$ . Fortunately, the first $d_{x}$ components preserve the original input data in the proofs of Proposition 2 in Hanin and Sellke (2017), and Proposition 4.2(Register Model) in Kidger and Lyons (2020). Thus, an MLP of width $d_{x}+1+{\color[rgb]{0,0,0}\gamma\left(\sigma\right)}$ approximates $b[t]x[t]$ while keeping the $x+t\mathbf{1}_{d_{x}}$ in the first $d_{x}$ components, and the MLP is common across overall $t$ as inputs at different times $t_{1}$ and $t_{2}$ are already embedded in disjoint sets $K+t_{1}\mathbf{1}_{d_{x}}$ and $K+t_{2}\mathbf{1}_{d_{x}}$ . Note that a token-wise MLP is a special case of an RNN of the same width. Nonetheless, we need an additional width to keep the $(d_{x}+1)$ th component approximating $b[t]x[t]$ . With the token-wise MLP implemented by an RNN and additional buffer width, we construct a modified RNN of width $d_{x}+2+\gamma(\sigma)$ approximating the modified TRNN cell used in the proof of Lemma 11.

4.3 Proof of Theorem 5

Summarizing all the results, we have the universality of a deep RNN in a continuous function space.

Proof [Proof of Theorem 5] As mentioned in Remark 8, we can set $d_{y}=1$ for notational convenience. By Lemma 12, there exists a modified TRNN $\tilde{\mathcal{N}}$ of width $d_{x}+2$ such that

\sup_{x\in K^{n}}\left\|f(x)-\tilde{\mathcal{N}}(x)\right\|<\frac{\epsilon}{3}.

(35)

As $\tilde{\mathcal{N}}$ is a composition of modified TRNN cells of width $d_{x}+2$ satisfying (31), there exists a modified RNN $\bar{\mathcal{N}}$ of width $d_{x}+3+\gamma(\sigma)$ such that

\sup_{x\in K^{n}}\left\|\tilde{\mathcal{N}}(x)-\bar{\mathcal{N}}(x)\right\|<\frac{\epsilon}{3}.

(36)

Then, by Lemma 3, there exists an RNN $\mathcal{N}$ of width $d_{x}+3+\gamma(\sigma)+\beta(\sigma)=d_{x}+{\color[rgb]{0,0,0}4}+\alpha(\sigma)$ such that

\sup_{x\in K^{n}}\left\|\bar{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\frac{\epsilon}{3}.

(37)

The triangle inequality yields

\sup_{x\in K^{n}}\left\|f(x)-\mathcal{N}(x)\right\|<\epsilon.

(38)

Remark 14

The number of additional widths $\alpha(\sigma)=\beta(\sigma)+\gamma(\sigma)$ depends on the condition of the activation function $\sigma$ . Here, $\gamma(\sigma)$ is required to find the token-wise MLP that approximates embedding from $\mathbb{R}^{d_{x}}$ to $\mathbb{R}^{d_{x}+1}$ . If further studies determine a tighter upper bound of the minimum width of an MLP to have the universal property in a continuous function space, we can reduce or even remove $\alpha(\sigma)$ according to the result.

There is still a wide gap between the lower bound $d_{x}$ and upper bound $d_{x}+d_{y}+{\color[rgb]{0,0,0}4}+\alpha(\sigma)$ of the minimum width, and hence, we expect to be able to achieve universality with a narrower width. For example, if $N=1$ , an RNN is simply an MLP, and the RNN has universality without a node required to compute the effect of $t$ . Therefore, apart from the result of the minimum width of an MLP, further studies are required to determine whether $\gamma$ is essential for the case of $N\geq 2$ .

5 Universal Approximation for Stack RNN in $L^{p}$ Space

This section introduces the universal approximation theorem of a deep RNN in $L^{p}$ function space for $1\leq p<\infty$ .

Theorem 15 (Universal approximation theorem of deep RNN 3)

Let $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}$ be a past-dependent sequence-to-sequence function in $L^{p}\left(\mathbb{R}^{d_{x}\times N},\mathbb{R}^{d_{y}\times N}\right)$ for $1\leq p<\infty$ , and $\sigma$ be a non-degenerate activation function with the non-degenerating point $z_{0}$ . Then, for any $\epsilon>0$ and compact subset $K\subset\mathbb{R}^{d_{x}}$ , there exists a deep RNN $\mathcal{N}$ of width $\max\left\{d_{x}+1,d_{y}\right\}+1$ satisfying

\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left(K\right)}<\epsilon.

(39)

Moreover, if the activation $\sigma$ is ReLU, there exists a deep RNN $\mathcal{N}$ of width $\max\left\{d_{x}+1,d_{y}\right\}$ satisfying

\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\right)}<\epsilon.

(40)

Before beginning the proof of the theorem, we assume $K\subseteq\left[0,1\right]^{d_{x}}$ without loss of generality and summarize the scheme that will be used in the proof. Park et al. (2021) constructed an MLP of width $\max\{d_{x}+1,d_{y}\}+\gamma(\sigma)$ approximating a given target function $f$ , using the “encoding scheme.” More concretely, the MLP is separated into three parts: encoder, memorizer, and decoder.

First, the encoder part quantizes each component of the input and output into a finite set. The authors use the quantization function $q_{n}:[0,1]\rightarrow\mathcal{C}_{n}$

q_{n}(v)\coloneqq\max\left\{c\in\mathcal{C}_{n}\bigm{|}c\leq v\right\},

(41)

where $\mathcal{C}_{n}\coloneqq\left\{0,2^{-n},2\times 2^{-n},\ldots,1-2^{-n}\right\}$ . Then, each quantized vector is encoded into a real number by concatenating its components through the encoder $\operatorname{Enc}_{M}:[0,1]^{d_{x}}\rightarrow\mathcal{C}_{d_{x}M}$

\operatorname{Enc}_{M}(x)\coloneqq\sum_{i=1}^{d_{x}}q_{M}(x_{i})2^{-(i-1)M}.

(42)

For small $\delta_{1}>0$ , the authors construct an MLP $\mathcal{N}_{enc}:[0,1]^{d_{x}}\rightarrow\mathcal{C}_{d_{x}M}$ of width $d_{x}+1+\gamma(\sigma)$ satisfying

\left\|\operatorname{Enc}_{M}(x)-\mathcal{N}_{enc}(x)\right\|<\delta_{1}.

(43)

Although the quantization causes a loss in input information, the $L^{p}$ norm neglects some loss in a sufficiently small domain.

After encoding the input $x$ to $\operatorname{Enc}_{M}(x)$ with large $M$ , authors use the information of $x$ in $\operatorname{Enc}_{M}(x)$ to obtain the information of the target output $f(x)$ . More precisely, they define the memorizer $\operatorname{Mem}_{M,M^{\prime}}:\mathcal{C}_{d_{x}M}\rightarrow\mathcal{C}_{d_{y}M^{\prime}}$ to map the encoded input $\operatorname{Enc}_{M}(x)$ to the encoded output $\operatorname{Enc}_{M^{\prime}}(f(x))$ as

\operatorname{Mem}\left(\operatorname{Enc}_{M}(x)\right)\coloneqq\left(\operatorname{Enc}_{M^{\prime}}\circ f\circ q_{M}\right)(x),

(44)

assuming the quantized map $q_{M}$ acts on $x$ component-wise in the above equation. Then, an MLP $\mathcal{N}_{mem}$ of width $2+\gamma(\sigma)$ approximates $\operatorname{Mem}$ ; that is, for any $\delta_{2}>0$ , there exists $\mathcal{N}_{mem}$ satisfying

\sup_{x\in[0,1]^{d_{x}}}\left\|\operatorname{Mem}(\operatorname{Enc}(x))-\mathcal{N}_{mem}(\operatorname{Enc}(x))\right\|<\delta_{2}.

(45)

Finally, the decoder reconstructs the original output vector from the encoded output vector by cutting off the concatenated components. Owing to the preceding encoder and memorizer, it is enough to define only the value of the decoder on $\mathcal{C}_{d_{y}M^{\prime}}$ . Hence the decoder $\operatorname{Dec}:\mathcal{C}_{d_{y}M^{\prime}}\rightarrow\mathcal{C}_{M^{\prime}}^{d_{y}}\coloneqq\left(\mathcal{C}_{M^{\prime}}\right)^{d_{y}}$ is determined by

\operatorname{Dec}_{M^{\prime}}(v)\coloneqq\hat{v}\qquad\text{where}\qquad\left\{\hat{v}\right\}\coloneqq\operatorname{Enc}_{M^{\prime}}^{-1}(v)\cap\mathcal{C}_{M^{\prime}}^{d_{y}}.

(46)

Indeed, for small $\delta_{3}>0$ , Park et al. (2021) construct an MLP $\mathcal{N}_{dec}:\mathcal{C}_{d_{y}M^{\prime}}\rightarrow\mathcal{C}_{M^{\prime}}^{d_{y}}$ of width $d_{y}+\gamma(\sigma)$ so that

\left\|\operatorname{Dec}_{M^{\prime}}(v)-\mathcal{N}_{dec}(v)\right\|<\delta_{3}

(47)

Although (43) and (47) are not equations but approximations when the activation is just non-degenerate, the composition $\mathcal{N}=\mathcal{N}_{dec}\circ\mathcal{N}_{mem}\circ\mathcal{N}_{enc}$ approximates a target $f$ with sufficiently large $M,M^{\prime}$ and sufficiently small $\delta_{1},\delta_{2}$ .

Let us return to the proof of Theorem 15. We construct the encoder, memorizer, and decoder similarly. As the encoder and decoder is independent of time $t$ , we use a token-wise MLP and modified RNNs define the token-wise MLPs. On the other hand, the memorizer must work differently according to the time $t$ owing to the multiple output functions. Instead of implementing various memorizers, we separate their input and output domains at each time by translation. Then, it is enough to define one memorizer on the disjoint union of domains.

Proof [Proof of Theorem 15] We first combine the token-wise encoder and translation for the separation of the domains. Consider the token-wise encoder $\operatorname{Enc}_{M}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{1\times N}$ , and the following recurrent cell $\mathcal{R}:\mathbb{R}^{1\times N}\rightarrow\mathbb{R}^{1\times N}$

\mathcal{R}(v)[t+1]=2^{-\left(d_{x}M+1\right)}\mathcal{R}(v)[t]+v[t+1]+1.

(48)

Then the composition $\mathcal{R}_{enc}=\mathcal{R}\operatorname{Enc}_{M}$ defines an encoder of sequence from $K^{N}$ to $\mathbb{R}^{1\times N}$ :

\mathcal{R}_{enc}(x)[t]=\sum_{j=1}^{t}2^{-(j-1)d_{x}M}+\sum_{j=1}^{t}\operatorname{Enc}_{M}\left(x[j]\right)2^{-\left(j-1\right)\left(d_{x}M+1\right)},

(49)

where $x=\left(x[t]\right)_{t=1,\ldots,N}$ is a sequence in $K$ . Note that the range $D$ of $\mathcal{R}_{enc}$ is a disjoint union of compact sets;

D=\bigsqcup_{t=1}^{N}\left\{\mathcal{R}_{enc}(x)[t]\mathbin{:}x\in K^{N}\right\}.

(50)

Hence there exists a memorizer $\operatorname{Mem}:\mathbb{R}\rightarrow\mathbb{R}$ satisfying

\operatorname{Mem}(\mathcal{R}_{enc}(x){\color[rgb]{0,0,0}[t]})=\operatorname{Enc}_{M^{\prime}}\left(f\left(q_{M}(x)\right)[t]\right)

(51)

for each $t=1,2,\ldots,N$ . The token-wise decoder $\operatorname{Dec}_{M^{\prime}}$ is the last part of the proof.

To complete the proof, we need an approximation of the token-wise encoder $\operatorname{Enc}_{M}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}$ , a modified recurrent cell $\mathcal{R}:\mathbb{R}^{1\times N}\rightarrow\mathbb{R}^{1\times N}$ , token-wise memorizer $\operatorname{Mem}:\mathbb{R}\rightarrow\mathbb{R}$ , and token-wise decoder $\operatorname{Dec}_{M^{\prime}}:\mathbb{R}\rightarrow\mathbb{R}^{d_{y}}$ . Following Park et al. (2021), there exist MLPs of width $d_{x}+1+\gamma(\sigma)$ , $2+\gamma(\sigma)$ , and $d_{y}+\gamma(\sigma)$ that approximate $\operatorname{Enc}_{M}$ , $\operatorname{Mem}$ , and $\operatorname{Dec}_{M^{\prime}}$ respectively. Lemma 3 shows that $\mathcal{R}$ is approximated by an RNN of width $1+\beta(\sigma)$ . Hence, an RNN of width $\max\left\{d_{x}+1+\gamma(\sigma),{\color[rgb]{0,0,0}1+\beta(\sigma)},2+\gamma(\sigma),d_{y}+\gamma(\sigma)\right\}=\max\left\{d_{x}+1,d_{y}\right\}+\gamma(\sigma)$ approximates the target function $f$ .

In the case of ReLU activation, we can extend the domain $K^{N}$ to $\mathbb{R}^{d_{x}\times N}$ as stated in Park et al. (2020). Nonetheless, we present briefly how to deal with the subtle problem that the support of a network is not compact generally.

We project each input $x[t]\in\mathbb{R}^{d_{x}}$ by $P_{L,\delta}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{x}}$ defined by

P(x)=\begin{cases}x&x\in[-L,L],\\ -\frac{L}{\delta}\left(x+L+\delta\right)&x\in[-L-\delta,-L],\\ -\frac{L}{\delta}\left(x-L-\delta\right)&x\in[L,L+\delta],\\ 0&\text{otherwise}.\end{cases}

(52)

Note that an MLP with width $d_{x}+1$ computes $P_{L,\delta}$ . We will choose large $L$ to cover enough domain for $x\in\mathbb{R}^{d_{x}\times N}$ and small $\delta$ to reduce the projection error.

First, choose a large $L$ so that $\left\|f(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\backslash[-L,L]^{d_{x}}\right)}<\frac{\epsilon}{3}$ for all $t=1,2,\ldots,N$ . Second, construct an RNN $\mathcal{N}=\mathcal{N}_{dec}\circ\mathcal{N}_{mem}\circ\mathcal{N}_{enc}$ as above with $\mathcal{N}(0)=0$ , such that $\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left([-L,L]^{d_{x}}\right)}<\frac{\epsilon}{3}$ . We impose a condition $\mathcal{N}(0)=0$ to bound error outside $[-L,L]^{d_{x}}$ , which is possible because we may set $f(0)[t]=0$ for all $t$ under the $L^{p}$ norm. Finally, set $\delta$ so small that $\left\|f(x)[t]\right\|_{L^{p}\left([-L-\delta,L+\delta]^{d_{x}}\backslash[-L,L]^{d_{x}}\right)}<\frac{\epsilon}{6}$ and $\left\|\mathcal{N}(x)[t]\right\|_{L^{p}\left([-L-\delta,L+\delta]^{d_{x}}\backslash[-L,L]^{d_{x}}\right)}<\frac{\epsilon}{6}$ . After inserting the projection $P_{L,\delta}$ before the encoder $\mathcal{N}_{enc}$ and defining new RNN $\mathcal{N}^{\prime}=\mathcal{N}\circ P_{L,\delta}$ , we have

	$\displaystyle\left\\|f(x)[t]-\mathcal{N}^{\prime}(x)[t]\right\\|_{L^{p}\left(\mathbb{R}^{d_{x}}\right)}$		(53)
	$\displaystyle\leq\left\\|f(x)[t]-\mathcal{N}(x)[t]\right\\|_{L^{p}\left([-L,L]^{d_{x}}\right)}+\left\\|f(x)[t]-\mathcal{N}^{\prime}(x)[t]\right\\|_{L^{p}\left(\mathbb{R}^{d_{x}}\backslash[-L,L]^{d_{x}}\right)}$		(54)
	$\displaystyle\leq\frac{\epsilon}{3}+\left\\|f(x)[t]-\mathcal{N}(x)[t]\right\\|_{L^{p}\left([-L-\delta,L+\delta]^{d_{x}}\backslash[-L,L]^{d_{x}}\right)}$		(55)
	$\displaystyle\qquad+\left\\|f(x)[t]-\mathcal{N}(x)[t]\right\\|_{L^{p}\left(\mathbb{R}^{d_{x}}\backslash[-L-\delta,L+\delta]^{d_{x}}\right)}$		(56)
	$\displaystyle\leq\frac{\epsilon}{3}+\frac{\epsilon}{3}+\left\\|f(x)[t]\right\\|_{L^{p}\left(\mathbb{R}^{d_{x}}\backslash[-L,L]^{d_{x}}\right)}$		(57)
	$\displaystyle\leq\epsilon$		(58)

6 Variants of RNN

This section describes the universal property of some variants of RNN, particularly LSTM, GRU, or BRNN. LSTM and GRU are proposed to solve the long-term dependency problem. As an RNN has difficulty calculating and updating its parameters for long sequential data, LSTM and GRU take advantage of additional structures in their cells. We prove that they have the same universal property as the original RNN. On the other hand, a BRNN is proposed to overcome the past dependency of an RNN. BRNN consists of two RNN cells, one of which works in reverse order. We prove the universal approximation theorem of a BRNN with the target class of any sequence-to-sequence function.

The universal property of an LSTM originates from the universality of an RNN. Mathematically LSTM $\mathcal{R}_{LSTM}$ indicates a process that computes two outputs, $h$ and $c$ , defined by (2). As an LSTM can reproduce an RNN with the same width, we have the following corollary:

Corollary 16 (Universal approximation theorem of deep LSTM)

Let $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}$ be a continuous past-dependent sequence-to-sequence function. Then, for any $\epsilon>0$ and compact subset $K\subset\mathbb{R}^{d_{x}}$ , there exists a deep LSTM $\mathcal{N}_{LSTM}$ , of width $d_{x}+d_{y}+3$ , such that

\sup_{x\in K^{N}}\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}_{LSTM}(x)[t]\right\|<\epsilon.

(59)

If $f\in L^{p}\left(\mathbb{R}^{d_{x}\times N},\mathbb{R}^{d_{y}\times N}\right)$ , there exists a deep LSTM $\mathcal{N}_{LSTM}$ , of width $\max\left\{d_{x}+1,d_{y}\right\}+1$ , such that

\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}_{LSTM}(x)[t]\right\|_{L^{p}\left(K^{N}\right)}<\epsilon.

(60)

Proof We set all parameters but $W_{g}$ , $U_{g}$ , $b_{g}$ , and $b_{f}$ as zeros, and then (2) is simplified as

	$\displaystyle c[t+1]$	$\displaystyle=\sigma_{\operatorname{sig}}(b_{f})\odot c[t]+\frac{1}{2}\tanh\left(U_{g}h[t]+W_{g}x[t+1]+b_{g}\right),$		(61)
	$\displaystyle h[t+1]$	$\displaystyle=\frac{1}{2}\tanh\left(c[t+1]\right).$		(61)

For any $\epsilon>0$ , $b_{f}$ with sufficiently large negative components yields

\left\|h[t+1]-\frac{1}{2}\tanh\left(\frac{1}{2}\tanh\left(U_{g}h[t]+W_{g}x[t+1]+b_{g}\right)\right)\right\|<\epsilon.

(62)

Thus, an LSTM reproduces an RNN whose activation function is $\left(\frac{1}{2}\tanh\right)\circ\left(\frac{1}{2}\tanh\right)$ without any additional width in its hidden states. In other words, an LSTM of width $d$ approximates an RNN of width $d$ equipped with the activation function $\left(\frac{1}{2}\tanh\right)\circ\left(\frac{1}{2}\tanh\right)$ .

The universality of GRU is proved similarly.

Corollary 17 (Universal approximation theorem of deep GRU)

Let $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}$ be a continuous past-dependent sequence-to-sequence function. Then, for any $\epsilon>0$ and compact subset $K\subset\mathbb{R}^{d_{x}}$ , there exists a deep GRU $\mathcal{N}_{GRU}$ , of width $d_{x}+d_{y}+3$ , such that

\sup_{x\in K^{N}}\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}_{GRU}(x)[t]\right\|<\epsilon.

(63)

If $f\in L^{p}\left(\mathbb{R}^{d_{x}\times N},\mathbb{R}^{d_{y}\times N}\right)$ , there exists a deep GRU $\mathcal{N}_{GRU}$ , of width $\max\left\{d_{x}+1,d_{y}\right\}+1$ , such that

\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}_{GRU}(x)[t]\right\|_{L^{p}\left(K^{N}\right)}<\epsilon.

(64)

Proof Setting only $W_{h}$ , $U_{h}$ , $b_{h}$ , and $b_{z}$ as non-zero, the GRU is simplified as

h[t+1]=\left(1-\sigma_{\operatorname{sig}}\left(b_{z}\right)\right)h[t]+\sigma_{\operatorname{sig}}\left(b_{z}\right)\tanh\left(W_{h}x[t+1]+\frac{1}{2}U_{h}h[t]+b_{h}\right).

(65)

For any $\epsilon>0$ , a sufficiently large $b_{z}$ yields

\left\|h[t+1]-\tanh\left(W_{h}x[t+1]+\frac{1}{2}U_{h}h[t]+b_{h}\right)\right\|<\epsilon.

(66)

Hence, we attain the corollary.

Remark 18

We refer to the width as the maximum of hidden states. However, the definition is somewhat inappropriate, as LSTM and GRU cells have multiple hidden states; hence, there are several times more components than an RNN with the same width. Thus we expect that they have better approximation power or have a smaller minimum width for universality than an RNN. Nevertheless, we retain the theoretical proof as future work to identify whether they have different abilities in approximation or examine why they exhibit different performances in practical applications.

Now, let us focus on the universality of a BRNN. Recall that a stack of modified recurrent cells $\mathcal{N}$ construct a linear combination of the previous input components $x[1:t]$ at each time,

\mathcal{N}(x)[t]=\begin{bmatrix}x[t]\\ \sum_{j=1}^{t}A_{j}[t]x[j]\end{bmatrix}.

(67)

Therefore, if we reverse the order of sequence and flow of the recurrent structure, a stack of reverse modified recurrent cells $\bar{\mathcal{N}}$ constructs a linear combination of the subsequent input components $x[t:N]$ at each time,

\bar{\mathcal{N}}(x)[t]=\begin{bmatrix}x[t]\\ \sum_{j=t}^{N}B_{j}[t]x[j]\end{bmatrix}.

(68)

From this point of view, we expect that a stacked BRNN successfully approximates an arbitrary sequence-to-sequence function beyond the past dependency. As previously mentioned, we prove it in the following lemma.

Lemma 19

Suppose $A_{j}[t]\in\mathbb{R}^{1\times d_{x}}$ are the given matrices for $1\leq t\leq N$ , $1\leq j\leq N$ . Then there exists a modified TBRNN $\tilde{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}$ of width $d_{x}+1$ such that

\tilde{\mathcal{N}}(x)[t]=\begin{bmatrix}x[t]\\ \sigma\left(\sum_{j=1}^{N}A_{j}[t]x[j]\right)\end{bmatrix},

(69)

for all $t=1,2,\ldots,N$ .

Proof [Sketch of proof] The detailed proof is available in Appendix F. We use modified TBRNN cells with either only a forward modified TRNN or a backward modified TRNN. The stacked forward modified TRNN cells compute $\sum_{j=1}^{t}A_{j}[t]x[j]$ , and the stacked backward modified TRNN cells compute $\sum_{j=t+1}^{N}A_{j}[t]x[j]$ .

As in previous cases, we have the following theorem for a TBRNN. The proof is almost the same as that of Lemma 12 and 7.

Lemma 20

Suppose $w_{i}[t]\in\mathbb{R}$ , $A_{i,j}[t]\in\mathbb{R}^{1\times d_{x}}$ are the given matrices for $1\leq t\leq N$ , $1\leq j\leq N$ , $1\leq i\leq M$ . Then there exists a modified TBRNN $\tilde{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{1\times N}$ of width $d_{x}+2$ such that

\tilde{\mathcal{N}}(x)[t]=\sum_{i=1}^{M}{\color[rgb]{0,0,0}w_{i}[t]}\sigma\left(\sum_{j=1}^{N}A_{i,j}[t]x[j]\right).

(70)

Proof First, construct a modified deep TBRNN $\mathcal{N}_{1}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+2)\times N}$ of width $d_{x}+2$ such that

\mathcal{N}_{1}(x)[t]=\begin{bmatrix}x[t]\\ \sigma\left(\sum_{j=1}^{N}A_{1,j}[t]x[j]\right)\\ 0\end{bmatrix},

(71)

as Lemma 19. The final component does not affect the first linear summation and remains zero. After $\mathcal{N}_{1}$ , use the $(d_{x}+2)$ th component to obtain a stack of cells $\mathcal{N}_{2}:\mathbb{R}^{(d_{x}+2)\times N}\rightarrow\mathbb{R}^{(d_{x}+2)\times N}$ , which satisfies

\mathcal{N}_{2}\mathcal{N}_{1}(x)[t]=\begin{bmatrix}x[t]\\ \sigma\left(\sum_{j=1}^{N}A_{1,j}[t]x[j]\right)\\ \sigma\left(\sum_{j=1}^{N}A_{2,j}[t]x[j]\right)\end{bmatrix},

(72)

and use a modified RNN cell $\mathcal{R}$ to sum up the results and reset the last component:

\mathcal{R}\mathcal{N}_{2}\mathcal{N}_{1}(x)[t]=\begin{bmatrix}x[t]\\ w_{1}[t]\sigma\left(\sum_{j=1}^{N}A_{1,j}[t]x[j]\right)+w_{2}[t]\sigma\left(\sum_{j=1}^{N}A_{2,j}[t]x[j]\right)\\ 0\end{bmatrix}.

(73)

As the $(d_{x}+2)$ th component resets to zero, we use it to compute the third sum $w_{3}[t]\sigma\left(\sum A_{3,j}[t]x[j]\right)$ and repeat until we obtain the final network $\mathcal{N}$ such that

\mathcal{N}(x)[t]=\begin{bmatrix}x[t]\\ \sum_{i=1}^{M}w_{i}[t]\sigma\left(\sum_{t=1}^{N}A_{i,j}[t]x[j]\right)\\ 0\end{bmatrix}.

(74)

The following lemma fills the gap between a modified TBRNN and a modified BRNN.

Lemma 21

Let $\tilde{\mathcal{N}}$ be a modified TBRNN that computes (69) and $K\subset\mathbb{R}^{d_{x}}$ be a compact set. Then for any $\epsilon>0$ there exists a modified BRNN $\bar{\mathcal{N}}$ of width $d_{x}+2+\gamma(\sigma)$ such that

\sup_{x\in K^{N}}\left\|\tilde{\mathcal{N}}(x)-\bar{\mathcal{N}}(x)\right\|<\epsilon,

(75)

where $\gamma(\operatorname{ReLU})=0$ , $\gamma(\sigma)=1$ for non-degenerating activation $\sigma$ .
Moreover, there exists a BRNN $\mathcal{N}$ of width $d_{x}+{\color[rgb]{0,0,0}3}+\alpha(\sigma)$ such that

\sup_{x\in K^{N}}\left\|\tilde{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\epsilon,

(76)

Proof We omit these details because we only need to construct a modified RNN that approximates (67) and (68) using Lemma 13. As only the forward or backward modified RNN cell is used in the proof of Lemma 19, it is enough for the modified BRNN to approximate either the forward or backward modified TRNN. Thus, it follows from Lemma 13. Lemma 3 provides the second part of this theorem.

Finally, we obtain the universal approximation theorem of the BRNN from the previous results.

Theorem 22 (Universal approximation theorem of deep BRNN 1)

Let $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}$ be a continuous sequence to sequence function and $\sigma$ be a non-degenerate activation function. Then for any $\epsilon>0$ and compact subset $K\subset\mathbb{R}^{d_{x}}$ , there exists a deep BRNN $\mathcal{N}$ of width $d_{x}+d_{y}+{\color[rgb]{0,0,0}3}+\alpha(\sigma)$ , such that

\sup_{x\in K^{N}}\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|<\epsilon,

(77)

Proof As in the proof of Theorem 5, we set $d_{y}=1$ for notational convenience. According Lemma 20, there exists a modified TBRNN $\tilde{\mathcal{N}}$ of width $d_{x}+2$ such that

\sup_{x\in K^{n}}\left\|f(x)-\tilde{\mathcal{N}}(x)\right\|<\frac{\epsilon}{2}.

(78)

Lemma 21 implies that there exists a BRNN of width $d_{x}+{\color[rgb]{0,0,0}4}+\alpha(\sigma)$ such that

\sup_{x\in K^{n}}\left\|\tilde{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\frac{\epsilon}{2}.

(79)

The triangle inequality leads to

\sup_{x\in K^{n}}\left\|f(x)-\mathcal{N}(x)\right\|<\epsilon.

(80)

In the $L^{p}$ space, we have similar results with RNNs much more straightforward than in continuous function spaces as the only encoder needs bidirectional flow.

Theorem 23 (Universal approximation theorem of deep BRNN 2)

Let $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}$ be a sequence-to-sequence function in $L^{p}\left(\mathbb{R}^{d_{x}\times N},\mathbb{R}^{d_{y}\times N}\right)$ for $1\leq p<\infty$ , and $\sigma$ be a non-degenerate activation function with the non-degenerating point $z_{0}$ . Then, for any $\epsilon>0$ and compact subset $K\subset\mathbb{R}^{d_{x}}$ , there exists a deep BRNN $\mathcal{N}$ of width $\max\left\{d_{x}+1,d_{y}\right\}+1$ satisfying

\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left(K\right)}<\epsilon.

(81)

Moreover, if the activation $\sigma$ is ReLU, there exists a deep BRNN $\mathcal{N}$ of width $\max\left\{d_{x}+1,d_{y}\right\}$ satisfying

\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\right)}<\epsilon.

(82)

Proof Recall that in the RNN case, $x[1:t]\in K^{N}$ is encoded as a concatenation of a decimal representation of each $x[1],x[2],\ldots,x[t]$ , cutting it off at a certain number of digits. The backward process carries the same work, and then the encoded results will be connected. Since the cells constructing the encoders are the same as equation (48), we omit the further step.

7 Discussion

We proved the universal approximation theorem and calculated the upper bound of the minimum width of an RNN, an LSTM, a GRU, and a BRNN. In this section, we illustrate how our results support the performance of a recurrent network.

We show that an RNN needs a width of at most $d_{x}+d_{y}+4$ to approximate a function from a sequence of $d_{x}$ -dimensional vectors to a sequence of $d_{y}$ -dimensional vectors. The upper bound of the minimum width of the network depends only on the input and output dimensions, regardless of the length of the sequence. The independence of the sequence length indicates that the recurrent structure is much more effective in learning a function on sequential inputs. To approximate a function defined on a long sequence, a network with a feed-forward structure requires a wide width proportional to the length of the sequence. For example, an MLP should have a wider width than $Nd_{x}$ if it approximates a function $f:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}$ defined on a sequence (Johnson, 2019). However, with the recurrent structure, it is possible to approximate via a narrow network of width $d_{x}+1$ regardless of the length, because the minimum width is independent of the length $N$ . This suggests that the recurrent structure, which transfers information between different time steps in the same layer, is crucial for success with sequential data.

From a practical point of view, this fact further implies that there is no need to limit the length of the time steps that affect dynamics to learn the internal dynamics between sequential data. For instance, suppose that a pair of long sequential data $(x[t])$ and $(y[t])$ have an unknown relation $y[t]=f\left(x[t-p],x[t-p+1],\ldots,x[t]\right)$ . Even without prior knowledge of $f$ and $p$ , a deep RNN learns the relation if we train the network with inputs $x[1:t]$ and outputs $y[t]$ . The MLP cannot reproduce the result because the required width increases proportionally to $p$ , which is an unknown factor. The difference between these networks theoretically supports that recurrent networks are appropriate when dealing with sequential data whose underlying dynamics are unknown in the real world.

8 Conclusion

In this study, we investigated the universality and upper bound of the minimum width of deep RNNs. The upper bound of the minimum width serves as a theoretical basis for the effectiveness of deep RNNs, especially when underlying dynamics of the data are unknown.

Our methodology enables various follow-up studies, as it connects an MLP and a deep RNN. For example, the framework disentangles the time dependency of output sequence of an RNN. This makes it feasible to investigate a trade-off between width and depth in the representation ability or error bounds of the deep RNN, which has not been studied because of the entangled flow with time and depth. In addition, we separated the required width into three parts: one maintains inputs and results, another resolves the time dependency, and the third modifies the activation. Assuming some underlying dynamics in the output sequence, such as an open dynamical system, we expect to reduce the required minimum width on each part because there is a natural dependency between the outputs, and the inputs are embedded in a specific way by the dynamics.

However, as LSTMs and GRUs have multiple hidden states in the cell process, they may have a smaller minimum width than the RNN. By constructing an LSTM and a GRU to use the hidden states to save data and resolve the time dependency, we hope that our techniques demonstrated in the proof help analyze why these networks have a better result in practice and suffer less from long-term dependency.

Acknowledgement

This work was supported by the Challengeable Future Defense Technology Research and Development Program through ADD[No. 915020201], the NRF grant[2012R1A2C3010887] and the MSIT/IITP[No. 2021-0-01343, Artificial Intelligence Graduate School Program(SNU)].

Appendix A Notations

Symbol	Description
$\sigma$	Activation function
$\sigma_{I}$	Modified activation function
$x$	Input sequence of vectors
$y$	Output sequence of vectors
$t$	Order of element in a sequence
$a[t]_{i}$	$i$ -th component of $t$ -element of a sequence $a$
$d_{x}$	Dimension of input vector
$d_{s}$	Dimension of hidden state in RNN, or width
$d_{y}$	Dimension of output vector
$N$	Length of the input and output sequence
$K$	A compact subset of $\mathbb{R}^{d_{x}}$
$O_{m,n}$	Zero matrix of size $m\times n$
$\mathbf{0}_{k}$	Zero vector of size $k$ , $\mathbf{0}_{k}={\underbrace{\begin{bmatrix}0&0\cdots&0\end{bmatrix}}_{k}}^{T}$
$\mathbf{1}_{k}$	One vector of size $k$ , $\mathbf{1}_{k}={\underbrace{\begin{bmatrix}1&1\cdots&1\end{bmatrix}}_{k}}^{T}$
$I_{k}$	Identity matrix of size $k\times k$
RNN	Recurrent Neural Network defined in page 1
TRNN	Time-enhanced Recurrent Neural Network, defined by replacing
	recurrent cell with time-enhanced recurrent cell in RNN.
	Time-enhenced cell and TRN are defined in page 10
BRNN	Bidirectional Recurrent Neural Network defined in page 4
TBRNN	Time-enhanced Bidirectional Recurrent Neural Network
	defined in page 10

Appendix B Proof of the Lemma 3

Without loss of generality, we may assume $\bar{\mathcal{P}}$ is an identity map and $I=\{1,2,\ldots,k\}$ . Let $\bar{\mathcal{R}}(x)[t+1]=\sigma_{I}\left(\bar{A}\mathcal{R}(x)[t]+\bar{B}x[t+1]+\bar{\theta}\right)$ be a given modified RNN cell, and $\mathcal{Q}(x)[t]=\bar{Q}x[t]$ be a given token-wise linear projection map. We use notations $O_{m,n}$ and $\mathbf{1}_{m}$ to denote zero matrix in $\mathbb{R}^{m\times n}$ and one vector in $\mathbb{R}^{m}$ respectively. Sometimes we omit $O_{m,n}$ symbol in some block-diagonal matrices if the size of the zero matrix is clear.

Case 1: $\sigma(z_{0})=0$

Let $\mathcal{P}$ be the identity map. For $\delta>0$ define $\mathcal{R}_{1}^{\delta}$ as

\mathcal{R}_{1}^{\delta}\left(x[t+1]\right)\coloneqq\sigma\left(\delta\bar{B}x[t+1]+\delta\bar{\theta}+z_{0}\mathbf{1}_{d}\right).

(83)

Since $\sigma$ is non-degenerating at $z_{0}$ and $\sigma^{\prime}$ is continuous at $z_{0}$ , we have

\mathcal{R}_{1}^{\delta}\circ\mathcal{P}(x)[t+1]=\delta\sigma^{\prime}(z_{0})\left(\bar{B}x[t+1]+\bar{\theta}\right)+o(\delta).

(84)

Then construct a second cell to compute transition as

\mathcal{R}_{2}^{\delta}(x)[t+1]=\sigma\left(\tilde{A}\mathcal{R}_{2}^{\delta}(x)[t]+\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}x[t+1]+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix}\right),

(85)

where $\tilde{A}=\begin{bmatrix}I_{k}&\\ &\delta I_{d-k}\end{bmatrix}\bar{A}\begin{bmatrix}I_{k}&\\ &\frac{1}{\delta\sigma^{\prime}(z_{0})}I_{d-k}\end{bmatrix}$ .
After that, the first output of $\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)$ becomes

$\displaystyle\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[1]$	$\displaystyle=\sigma\left(\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}(x)[1]+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix}\right)$	(86)
	$\displaystyle=\sigma\left(\begin{bmatrix}\left(\bar{B}x[1]+\bar{\theta}\right)_{1:k}+\delta^{-1}o(\delta)\\ \left(z_{0}\mathbf{1}_{d-k}+\delta(\bar{B}x[1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\end{bmatrix}\right)$	(87)
	$\displaystyle=\begin{bmatrix}\sigma\left(\bar{B}x[1]+\bar{\theta}\right)_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\left(\bar{B}x[1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\end{bmatrix}$	(88)
	$\displaystyle=\begin{bmatrix}\bar{\mathcal{R}}(x)[1]_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[1]_{k+1:d}+o(\delta)\end{bmatrix}.$	(89)

Now use mathematical induction on time $t$ to compute $\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)$ assuming

\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]=\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t]_{k+1:d}+o(\delta)\end{bmatrix}.

(90)

From a direct calculation, we attain

	$\displaystyle\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1]+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix}$		(91)
	$\displaystyle=\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}\left(\delta\sigma^{\prime}(z_{0})\left(\bar{B}x[t+1]+\bar{\theta}\right)+o(\delta)\right)+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix}$		(92)
	$\displaystyle=\begin{bmatrix}\bar{B}x[t+1]_{1:k}+\bar{\theta}_{1:k}+\delta^{-1}o(\delta)\\ z_{0}\mathbf{1}_{d-k}+\delta\left(\bar{B}x[t+1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\end{bmatrix},$		(93)

and

	$\displaystyle\tilde{A}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]$		(94)
	$\displaystyle=\begin{bmatrix}I_{k}&\\ &\delta I_{d-k}\end{bmatrix}\bar{A}\begin{bmatrix}I_{k}&\\ &\frac{1}{\delta\sigma^{\prime}(z_{0})}I_{d-k}\end{bmatrix}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t]_{k+1:d}+o(\delta)\end{bmatrix}$		(95)
	$\displaystyle=\begin{bmatrix}I_{k}&\\ &\delta I_{d-k}\end{bmatrix}\bar{A}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \bar{\mathcal{R}}(x)[t]_{k+1:d}+o(1)\end{bmatrix}$		(96)
	$\displaystyle=\begin{bmatrix}\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{1:k}+o(1)\\ \delta\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{k+1:d}+o(\delta)\end{bmatrix}.$		(97)

With the sum of above two results, we obtain the induction hypothesis (90) for $t+1$ ,

	$\displaystyle\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1]$		(98)
	$\displaystyle=\sigma\left(\tilde{A}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]+\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1]+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix}\right)$		(99)
	$\displaystyle=\sigma\begin{bmatrix}\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{1:k}+\bar{B}x[t+1]_{1:k}+\bar{\theta}_{1:k}+o(1)\\ z_{0}\mathbf{1}_{d-k}+\delta\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{k+1:d}+\delta\left(\bar{B}x[t+1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\end{bmatrix}$		(100)
	$\displaystyle=\begin{bmatrix}\bar{\mathcal{R}}(x)[t+1]_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t+1]_{k+1:d}+o(\delta)\end{bmatrix}.$		(101)

Setting $\mathcal{Q}^{\delta}=\bar{Q}\begin{bmatrix}I_{k}&&\\ &\frac{1}{\sigma^{\prime}(z_{0})\delta}I_{d-k}\end{bmatrix}$ and choosing $\delta$ small enough complete the proof:

\mathcal{Q}^{\delta}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]=\bar{Q}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \bar{\mathcal{R}}(x)[t]_{k+1:d}+o(1)\end{bmatrix}=\bar{\mathcal{Q}}\bar{\mathcal{R}}(x)[t]+o(1)\rightarrow\bar{\mathcal{Q}}\bar{\mathcal{R}}(x)[t].

(102)

Case 2: $\sigma(z_{0})\neq 0$

When $\sigma(z_{0})\neq 0$ , there is $\sigma(z_{0})$ term independent of $\delta$ in the Taylor expansion of $\sigma(z_{0}+\delta x)=\sigma(z_{0})+\delta\sigma^{\prime}(z_{0})x+o(\delta)$ . An additional width removes the term in this case; hence we need a lifting map $\mathcal{P}:\mathbb{R}^{d\times N}\rightarrow\mathbb{R}^{(d+1)\times N}$ :

\mathcal{P}(x)[t]=\begin{bmatrix}x[t]\\ 0\end{bmatrix}.

(103)

Now for $\delta>0$ define $\mathcal{R}_{1}^{\delta}$ as

\mathcal{R}_{1}^{\delta}\left(X\right)\coloneqq\sigma\left(\delta\begin{bmatrix}\bar{B}&\\ &0\end{bmatrix}X+\delta\begin{bmatrix}\bar{\theta}\\ 0\end{bmatrix}+z_{0}\mathbf{1}_{d+1}\right).

(104)

As in the previous case, we have

\mathcal{R}_{1}^{\delta}\circ\mathcal{P}(x)[t+1]=\begin{bmatrix}\sigma(z_{0})\mathbf{1}_{d}+\delta\sigma^{\prime}(z_{0})\left(\bar{B}x[t+1]+\bar{\theta}\right)+o\left(\delta\right)\\ \sigma(z_{0})\end{bmatrix},

(105)

and construct a second cell $\mathcal{R}_{2}^{\delta}$ to compute

\mathcal{R}_{2}^{\delta}(x)[t+1]=\sigma\left(\tilde{A}\mathcal{R}_{2}^{\delta}(x)[t]+\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d+1-k}\end{bmatrix}x[t+1]+\begin{bmatrix}-\frac{\sigma(z_{0})}{\delta\sigma^{\prime}(z_{0})}\mathbf{1}_{k}\\ z_{0}\mathbf{1}_{d+1-k}-\frac{\sigma(z_{0})}{\sigma^{\prime}(z_{0})}\mathbf{1}_{d+1-k}\end{bmatrix}\right),

(106)

where $\tilde{A}=\begin{bmatrix}I_{k}&\\ &\delta I_{d+1-k}\end{bmatrix}\begin{bmatrix}\bar{A}&\\ &0\end{bmatrix}\begin{bmatrix}I_{k}&&\\ &\frac{1}{\delta\sigma^{\prime}(z_{0})}I_{d-k}&-\frac{1}{\delta\sigma^{\prime}(z_{0})}\mathbf{1}_{d-k}\\ &&0\end{bmatrix}$ .
After that, the first output of $\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]$ becomes

$\displaystyle\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[1]$	$\displaystyle=\sigma\left(\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d+1-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}(x)[1]+\begin{bmatrix}-\frac{\sigma(z_{0})}{\delta\sigma^{\prime}(z_{0})}\mathbf{1}_{k}\\ z_{0}\mathbf{1}_{d+1-k}-\frac{\sigma(z_{0})}{\sigma^{\prime}(z_{0})}\mathbf{1}_{d+1-k}\end{bmatrix}\right)$	(107)
	$\displaystyle=\sigma\left(\begin{bmatrix}\left(\bar{B}x[1]+\bar{\theta}\right)_{1:k}+o(\delta)\\ \left(z_{0}\mathbf{1}_{d-k}+\delta(\bar{B}x[1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\\ z_{0}\end{bmatrix}\right)$	(108)
	$\displaystyle=\begin{bmatrix}\sigma\left(\bar{B}x[1]+\bar{\theta}\right)_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\left(\bar{B}x[1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix}$	(109)
	$\displaystyle=\begin{bmatrix}\bar{\mathcal{R}}(x)[1]_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[1]_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix}.$	(110)

Assume $\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)$ and use mathematical induction on time $t$ .

\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]=\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t]_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix}.

(111)

Direct calculation yields

	$\displaystyle\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d+1-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1]+\begin{bmatrix}-\frac{\sigma(z_{0})}{\delta\sigma^{\prime}(z_{0})}\mathbf{1}_{k}\\ z_{0}\mathbf{1}_{d+1-k}-\frac{\sigma(z_{0})}{\sigma^{\prime}(z_{0})}\mathbf{1}_{d+1-k}\end{bmatrix}$		(112)
	$\displaystyle=\begin{bmatrix}\bar{B}x[t+1]_{1:k}+\bar{\theta}_{1:k}+o(1)\\ z_{0}\mathbf{1}_{d-k}+\delta\left(\bar{B}x[t+1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\\ z_{0}\end{bmatrix},$		(113)

and

	$\displaystyle\tilde{A}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]$		(114)
	$\displaystyle=\tilde{A}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t]_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix}$		(115)
	$\displaystyle=\begin{bmatrix}I_{k}&\\ &\delta I_{d+1-k}\end{bmatrix}\begin{bmatrix}\bar{A}&\\ &0\end{bmatrix}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \bar{\mathcal{R}}(x)[t]_{k+1:d}+o(1)\\ 0\end{bmatrix}$		(116)
	$\displaystyle=\begin{bmatrix}\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{1:k}+o(1)\\ \delta\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{k+1:d}+o(\delta)\\ 0\end{bmatrix}.$		(117)

Adding two terms in (106), we obtain the induction hypothesis (111) for $t+1$ ,

\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1]=\begin{bmatrix}\bar{\mathcal{R}}(x)[t+1]_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t+1]_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix}.

(118)

Setting $\mathcal{Q}^{\delta}=\begin{bmatrix}\bar{Q}&0\end{bmatrix}\begin{bmatrix}I_{k}&&\\ &\frac{1}{\sigma^{\prime}(z_{0})\delta}I_{d-k}&-\frac{1}{\sigma^{\prime}(z_{0})\delta}\mathbf{1}_{d-k}\\ &&0\end{bmatrix}$ and choosing $\delta$ small enough complete the proof:

\mathcal{Q}^{\delta}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]=\begin{bmatrix}\bar{Q}&0\end{bmatrix}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \bar{\mathcal{R}}(x)[t]_{k+1:d}+o(1)\\ 0\end{bmatrix}=\bar{\mathcal{Q}}\bar{\mathcal{R}}(x)[t]+o(1)\rightarrow\bar{\mathcal{Q}}\bar{\mathcal{R}}(x)[t].

(119)

Appendix C Proof of the Lemma 6

It suffices to show that there exists a modified RNN $\mathcal{N}$ that computes

\mathcal{N}(x)[N]=\begin{bmatrix}x[N]\\ \sum_{t=1}^{N}A[t]x[t]\end{bmatrix},

(120)

for given matrices $A[1],\ldots,A[N]\in\mathbb{R}^{1\times d_{x}}$ .

RNN should have multiple layers to implement the arbitrary linear combination. To overcome the complex time dependency deriving from deep structures and explicitly formulate the results of deep modified RNN, we force $A$ and $B$ to use the information of the previous time step in a limited way. Define the modified RNN cell at $l$ -th layer $\mathcal{R}_{l}$ as

\mathcal{R}_{l}(x)[t+1]=A_{l}\mathcal{R}_{l}(x)[t]+B_{l}x[t+1]

(121)

where $A_{l}=\begin{bmatrix}O_{d_{x},d_{x}}&O_{d_{x},1}\\ O_{1,d_{x}}&1\end{bmatrix}$ , $B_{l}=\begin{bmatrix}I_{d_{x}}&O_{d_{x},1}\\ b_{l}&1\end{bmatrix}$ for $b_{l}\in\mathbb{R}^{1\times d_{x}}$ .

Construct a modified RNN $\mathcal{N}_{L}$ for each $L\in\mathbb{N}$ as

\mathcal{N}_{L}\coloneqq\mathcal{R}_{L}\circ\mathcal{R}_{L-1}\circ\cdots\circ\mathcal{R}_{1},

(122)

and denote the output of $\mathcal{N}_{L}$ at each time $m$ for an input sequence $x^{\prime}=\begin{bmatrix}x\\ 0\end{bmatrix}\in\mathbb{R}^{d_{x}+1}$ of embedding of $x$ :

T(n,m)\coloneqq\mathcal{N}_{n}\left(x^{\prime}\right)[m].

(123)

Then we have the following lemma.

Lemma 24

Let $T(n,m)$ be the matrix defined by (123). Then we have

T(n,m)=\begin{bmatrix}x[m]\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\binom{n+m-i-j}{n-i}b_{i}x[j]\end{bmatrix},

(124)

where $\binom{n}{k}$ means binomial coefficient $\frac{n!}{k!(n-k)!}$ for $n\geq k$ . We define $\binom{n}{k}=0$ for the case of $k>n$ or $n<0$ for notational convenience.

Proof Since there is no activation in modified RNN (121), $T(n,m)$ has the form of

T(n,m)=\begin{bmatrix}x_{m}\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\alpha_{i,j}^{n,m}b_{i}x[j]\end{bmatrix}.

(125)

From the definition of the modified RNN cell and $T$ , we first show that $\alpha$ satisfies the recurrence relation

\alpha^{n,m}_{i,j}=\left\{\begin{array}[]{ll}\alpha_{i,j}^{n-1,m}+\alpha_{i,j}^{n,m-1}+1,&\text{if $n=i$ and $m=j$},\\ \alpha_{i,j}^{n-1,m}+\alpha_{i,j}^{n,m-1},&\text{otherwise}\end{array}\right.

(126)

using mathematical induction on $n,m$ in turn. Initially, $T(0,m)=\begin{bmatrix}x_{m}\\ 0\end{bmatrix}$ , $T(n,0)=\begin{bmatrix}O_{d_{x},1}\\ 0\end{bmatrix}$ by definition, and (125) holds when $n=0$ . Now assume (125) holds for $n\leq N$ , any $m$ . To show that (125) holds for $n=N+1$ and any $m$ , use mathematical induction on $m$ . By definition, we have $\alpha_{i,j}^{n,0}=0$ for any $n$ . Thus (125) holds when $n=N+1$ and $m=0$ . Assume it holds for $n=N+1$ and $m\leq M$ . Then

		$\displaystyle T(N+1,M+1)$		(127)
		$\displaystyle=\begin{bmatrix}O_{d_{x},d_{x}}&O_{d_{x},1}\\ O_{1,d_{x}}&1\end{bmatrix}\begin{bmatrix}x_{M}\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\alpha_{i,j}^{N+1,M}x_{j}\end{bmatrix}+\begin{bmatrix}I_{d_{x}}&O_{d_{x},1}\\ b_{N+1}&1\end{bmatrix}\begin{bmatrix}x_{M+1}\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\alpha_{i,j}^{N,M+1}x_{j}\end{bmatrix}$
		$\displaystyle=\begin{bmatrix}O_{d_{x},1}\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\alpha_{i,j}^{N+1,M}b_{i}x_{j}\end{bmatrix}+\begin{bmatrix}x_{M+1}\\ b_{N+1}x_{M+1}+\sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\left(\alpha_{i,j}^{N+1,M}+\alpha_{i,j}^{N,M+1}\right)b_{i}x_{j}\end{bmatrix}.$

Hence the relation holds for $n=N+1$ and any $m>0$ .

Now remains to show

\alpha_{i,j}^{n,m}=\left\{\begin{array}[]{ll}\binom{m+n-i-j}{n-i}&\text{if $1\leq i\leq n$, $1\leq j\leq m$}\\ 0&\text{otherwise}\end{array}\right.

(128)

From the initial condition of $\alpha$ , we know $\alpha^{0,m}_{i,j}=\alpha^{n,0}_{i,j}=0$ for all $n,m\in\mathbb{N}$ . After some direct calculation with the recurrence relation (125) of $\alpha$ , we have

i)

If $n<i$ or $m<j$ , $\alpha_{i,j}^{n,m}=0$ as $\alpha^{n,m}_{i,j}=\alpha^{n-1,m}_{i,j}+\alpha^{n,m-1}_{i,j}$ .
ii)

$\alpha_{i,j}^{i,j}=\alpha_{i,j}^{i-1,j}+\alpha_{i,j}^{i,j-1}+1=1$ .
iii)

$\alpha_{i,j}^{i,m}=\alpha_{i,j}^{i-1,m}+\alpha_{i,j}^{i,m-1}=\alpha_{i,j}^{i,m-1}$ implies $\alpha_{i,j}^{i,m}=1$ for $m>j$ .
iv)

Similarly, $\alpha^{n,j}_{i,j}=\alpha^{n-1,j}_{i,j}+\alpha^{n,j-1}_{i,j}=\alpha^{n-1,j}_{i,j}$ implies $\alpha_{i,j}^{n,j}=1$ for $n>i$ .

Now use mathematical induction on $n+m$ starting from $n+m=i+j$ to show $\alpha^{n,m}_{i,j}=\binom{m+n-i-j}{n-i}$ for $n\geq i$ , $m\geq j$ .

i)

$n+m=i+j$ holds only if $n=i$ , $m=j$ for $n\geq i$ , $m\geq j$ . In the case, $\alpha^{i,j}_{i,j}=1=\binom{m+n-i-j}{n-i}$ .

ii)

Assume that (128) holds for any $n$ , $m$ with $n+m=k$ as induction hypothesis. Now suppose $n+m=k+1$ for given $n$ , $m$ . If $n=i$ or $m=j$ we already know $\alpha_{i,j}^{n,m}=1=\binom{m+n-i-j}{n-i}$ . Otherwise $n-1\geq i$ , $m-1\geq j$ , and we have

$\displaystyle\alpha^{n,m}_{i,j}$	$\displaystyle=\alpha^{n-1,m}_{i,j}+\alpha^{n,m-1}_{i,j}$	(129)
	$\displaystyle=\binom{m+n-1-i-j}{n-1-i}+\binom{m+n-1-i-j}{n-i}$
	$\displaystyle=\binom{m+n-i-j}{n-i},$

which completes the proof.

We have computed the output of modified RNN $\mathcal{N}_{N}$ such that

\mathcal{N}_{N}\left(x^{\prime}\right)[N]=\begin{bmatrix}x[N]\\ \sum_{i=1}^{N}\sum_{j=1}^{N}\binom{2n-i-j}{n-i}b_{i}x[j]\end{bmatrix}.

(130)

If the square matrix $\Lambda_{N}=\left\{\binom{2n-i-j}{n-i}\right\}_{1\leq i,j\leq N}$ has inverse $\Lambda_{N}^{-1}=\left\{\lambda_{i,j}\right\}_{1\leq i,j\leq N}$ , $b_{i}=\sum_{t=1}^{N}\lambda_{t,i}A[t]$ satisfies

	$\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}\binom{2n-i-j}{n-i}b_{i}x[j]$	$\displaystyle=\sum_{i=1}^{n}\sum_{j=1}^{n}\sum_{t=1}^{n}\binom{2n-i-j}{n-i}\lambda_{t,i}A[t]x[j]$
		$\displaystyle=\sum_{j=1}^{n}\sum_{t=1}^{n}\left[\sum_{i=1}^{n}\binom{2n-i-j}{n-i}\lambda_{t,i}\right]A[t]x[j]$
		$\displaystyle=\sum_{j=1}^{n}\sum_{t=1}^{n}\delta_{j,t}A[t]x[j]$
		$\displaystyle=\sum_{j=1}^{n}A[j]x[j],$

where $\delta$ is the Kronecker delta function.

The following lemma completes the proof.

Lemma 25

Matrix $\Lambda_{n}=\left\{\binom{2n-i-j}{n-i}\right\}_{1\leq i,j\leq n}\in\mathbb{R}^{n\times n}$ is invertible.

Proof Use mathematical induction on $n$ . $\Lambda_{1}$ is a trivial case. Assume $\Lambda_{n}$ is invertible.

\displaystyle\Lambda_{n+1}=\begin{bmatrix}\binom{2n}{n}&\binom{2n-1}{n}&\binom{2n-2}{n}&\dots&\binom{n+1}{n}&\binom{n}{n}\\ \binom{2n-1}{n-1}&\binom{2n-2}{n-1}&\binom{2n-3}{n-1}&\dots&\binom{n}{n-1}&\binom{n-1}{n-1}\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ \binom{n+1}{1}&\binom{n}{1}&\binom{n-1}{1}&\dots&\binom{2}{1}&\binom{1}{1}\\ \binom{n}{0}&\binom{n-1}{0}&\binom{n-2}{0}&\dots&\binom{1}{0}&\binom{0}{0}\end{bmatrix}.

(131)

Applying elementary row operation to $\Lambda_{n+1}$ by multiplying the matrix $E$ on the left and elementary column operation to $E\Lambda_{n+1}$ by multiplying the matrix $E^{T}$ on the right where

E=\begin{bmatrix}1&-1&0&\dots&0&0\\ 0&1&-1&\dots&0&0\\ 0&0&1&\dots&0&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\dots&1&-1\\ 0&0&0&\dots&0&1\end{bmatrix},

(132)

we obtain the following relation:

E\Lambda_{n+1}E^{T}=\begin{bmatrix}\binom{2n-2}{n-1}&\binom{2n-3}{n-1}&\binom{2n-4}{n-1}&\dots&\binom{n-1}{n-1}&0\\ \binom{2n-3}{n-2}&\binom{2n-4}{n-2}&\binom{2n-5}{n-2}&\dots&\binom{n-2}{n-2}&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ \binom{n-1}{0}&\binom{n-2}{0}&\binom{n-3}{0}&\dots&\binom{0}{0}&0\\ 0&0&0&\dots&0&1\end{bmatrix}=\begin{bmatrix}\Lambda_{n}&O_{n,1}\\ O_{1,n}&1\end{bmatrix}.

(133)

Hence $\Lambda_{n+1}$ is invertible by the induction hypothesis.

Corollary 26

The following matrix $\Lambda_{n,k}\in\mathbb{R}^{k\times n}$ is full-rank.

\Lambda_{n,k}=\left\{\binom{2n-i-j}{n-i}\right\}_{n-k+1\leq i\leq n,1\leq j\leq n}.

(134)

We will use the matrix $\Lambda_{n,k}$ in the proof of Lemma 11 to approximate a sequence-to-sequence function.

Appendix D Proof of Lemma 11

Define token-wise lifting map $\mathcal{P}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{x}+1}$ and modified TRNN cell $\mathcal{T}\mathcal{R}_{l}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}$ as in the proof of Lemma 6:

	$\displaystyle\mathcal{P}(x)[t]$	$\displaystyle=\begin{bmatrix}x[m]\\ 0\end{bmatrix},$		(135)
	$\displaystyle\mathcal{T}\mathcal{R}_{l}(X)[t+1]$	$\displaystyle=A_{l}\mathcal{T}\mathcal{R}_{l}(X)[t]+B_{l}[t](X)[t+1],$		(136)

	$\displaystyle T(n,m)$	$\displaystyle\coloneqq\mathcal{N}_{n}\left(x\right)[m]$		(137)
		$\displaystyle=\begin{bmatrix}x[m]\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\binom{n+m-i-j}{n-i}b_{i}[j]x[j]\end{bmatrix},$		(137)

where $x\in\mathbb{R}^{d_{x}\times N}$ and $\mathcal{N}_{L}=\mathcal{T}\mathcal{R}_{L}\circ\mathcal{T}\mathcal{R}_{L-1}\circ\cdots\circ\mathcal{T}\mathcal{R}_{1}\circ\mathcal{P}$ .

Since for each $t$ , the matrix

\Lambda_{N,N-t+1}=\left\{\binom{2N-i-j}{N-i}\right\}_{t\leq i\leq N,1\leq j\leq N}=\left\{\binom{2N-t+1-i-j}{N-j}\right\}_{1\leq i\leq N-t+1,1\leq j\leq N}

(138)

is full-rank, there exist $b_{1}[t],b_{2}[t],\ldots,b_{N}[t]$ satisfying

\Lambda_{N,N-t+1}\begin{bmatrix}b_{1}[t]\\ \vdots\\ b_{N}[t]\end{bmatrix}=\begin{bmatrix}A_{t}[N]\\ \vdots\\ A_{t}[t]\end{bmatrix},

(139)

\sum_{j=1}^{N}\binom{N+k-j-t}{N-j}b_{j}[t]=A_{t}[k],

(140)

for each $k=1,2,\ldots,N$ . Then we obtain

$\displaystyle T(N,t)$	$\displaystyle=\sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\binom{N+t-i-j}{N-i}b_{i}[j]x[j]$	(141)
	$\displaystyle=\sum_{j=1}^{t}\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}b_{i}[j]x[j]$
	$\displaystyle=\sum_{j=1}^{t}A_{j}[t]x[j].$

Appendix E Proof of Lemma 13

As one of the modified TRNNs that computes (31), we use the modified TRNN defined in Appendix D. Specifically, we show that for a given $l$ , there exists a modified RNN of width $d_{x}+2+\gamma(\sigma)$ that approximates the modified TRNN cell $\mathcal{T}\mathcal{R}_{l}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}$ defined by (135). Suppose $K\subset\mathbb{R}^{d_{x}}$ , $K^{\prime}\subset\mathbb{R}$ are compact sets and $X\in\left(K\times K^{\prime}\right)^{N}\subset\mathbb{R}^{(d_{x}+1)\times N}$ . Then the output of the TRNN cell $\mathcal{T}\mathcal{R}_{l}$ is

\mathcal{T}\mathcal{R}_{l}(X)[t]=\begin{bmatrix}X[t]_{1:d_{x}}\\ \sum_{j=1}^{t}b_{l}[j]X[j]_{1:d_{x}}+\sum_{j=1}^{t}X[j]_{d_{x}+1}\end{bmatrix}.

(142)

Without loss of generality, assume $K\subset\left[0,\frac{1}{2}\right]^{d_{x}}$ and let $\gamma=\gamma(\sigma)$ . Let $\mathcal{P}:\mathbb{R}^{d_{x}+1}\rightarrow\mathbb{R}^{d_{x}+2+\gamma}$ be a token-wise linear map defined by $\mathcal{P}(X)=\begin{bmatrix}X_{1:d_{x}}\\ 0\\ X_{d_{x}+1}\\ \mathbf{0}_{\gamma}\end{bmatrix}$ . Construct the modified recurrent cells $\mathcal{R}_{1},\mathcal{R}_{2}:\mathbb{R}^{(d_{x}+2+\gamma(\sigma))\times N}\rightarrow\mathbb{R}^{(d_{x}+2+\gamma(\sigma))\times N}$ as for $X^{\prime}\in\mathbb{R}^{(d_{x}+2+\gamma)\times N}$ ,

	$\displaystyle\mathcal{R}_{1}(X^{\prime})[t+1]$	$\displaystyle=\begin{bmatrix}O_{d_{x},d_{x}}&&\\ &1&&\\ &&O_{1+\gamma,1+\gamma}\end{bmatrix}\mathcal{R}_{1}(X^{\prime})[t]+X^{\prime}[t+1]+\begin{bmatrix}\mathbf{0}_{d_{x}}\\ 1\\ \mathbf{0}_{1+\gamma}\end{bmatrix},$		(143)
	$\displaystyle\mathcal{R}_{2}(X^{\prime})[t+1]$	$\displaystyle=\begin{bmatrix}I_{d_{x}}&\mathbf{1}_{d_{x}}&&\\ &{\color[rgb]{0,0,0}1}&&\\ &&{\color[rgb]{0,0,0}1}&\\ &&&O_{\gamma,\gamma}\end{bmatrix}X^{\prime}[t].$		(144)

Then, by definition for $X\in\left(K\times K^{\prime}\right)^{N}$ ,

\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[t]=\begin{bmatrix}X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\\ t\\ X[t]_{d_{x}+1}\\ \mathbf{0}_{\gamma}\end{bmatrix}.

(145)

Note that $D_{i}=\{\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[i]_{1:d_{x}}\mid X\in\left(K\times K^{\prime}\right)^{N}\}=\{X[i]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\mid X\in\left(K\times K^{\prime}\right)^{N}\}$ are disjoint each other, $D_{i}\cap D_{j}=\phi$ for all $i\neq j$ .

By the universal approximation theorem of deep MLP from Hanin and Sellke (2017); Kidger and Lyons (2020), for any $\delta_{l}>0$ , there exists an MLP $\mathcal{N}_{l,MLP}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{x}+1}$ of width $d_{x}+1+\gamma$ such that for $v\in\mathbb{R}^{d_{x}}$ ,

	$\displaystyle\mathcal{N}_{l,MLP}(v)_{1:d_{x}}=v$		(146)
	$\displaystyle\sup_{t=1,\ldots,N}\sup_{v\in D_{t}}\left\\|b_{l}[t]\left(v-t\mathbf{1}{d_{x}}\right)-\mathcal{N}_{l,MLP}(v)_{d_{x}+1}\right\\|<\delta_{l}.$		(147)

Since token-wise MLP is implemented by RNN with the same width, there exists an RNN $\mathcal{N}_{l}:\mathbb{R}^{d_{x}+2+\gamma}\rightarrow\mathbb{R}^{d_{x}+2+\gamma}$ of width $d_{x}+2+\gamma$ whose components all but $(d_{x}+2)$ -th construct $\mathcal{N}_{l,MLP}$ so that for all $X^{\prime}\in\mathbb{R}^{d_{x}+2+\gamma}$ ,

\mathcal{N}_{l}(X^{\prime})[t]=\begin{bmatrix}\mathcal{N}_{l,MLP}\left(X^{\prime}[t]_{1:d_{x}}\right)\\ X^{\prime}[t]_{d_{x}+2}\\ \mathbf{0}_{\gamma}\end{bmatrix}.

(148)

Then for $X\in\left(K\times K^{\prime}\right)^{N}$ we have

\mathcal{N}_{l}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[t]=\begin{bmatrix}\mathcal{N}_{l,MLP}\left(X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\right)\\ X[t]_{d_{x}+1}\\ \mathbf{0}_{\gamma}\end{bmatrix}.

(149)

Finally, define a recurrent cell $\mathcal{R}_{3}:\mathbb{R}^{d_{x}+2+\gamma}\rightarrow\mathbb{R}^{d_{x}+2+\gamma}$ of width $d_{x}+2+\gamma$ as

\mathcal{R}_{3}(X^{\prime})[t+1]=\begin{bmatrix}O_{d_{x}+1,d_{x}+1}&&\\ &1&&\\ &&O_{\gamma,\gamma}\end{bmatrix}\mathcal{R}_{3}(X^{\prime})[t]+\begin{bmatrix}I_{d_{x}}&&&\\ &1&&\\ &1&1&\\ &&&O_{\gamma,\gamma}\end{bmatrix}X^{\prime}[t+1],

(150)

and attain

\mathcal{R}_{3}\mathcal{N}_{1}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[t]=\begin{bmatrix}X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\\ \mathcal{N}_{l,MLP}\left(X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\right)_{d_{x}+1}\\ \sum_{j=1}^{t}\mathcal{N}_{l,MLP}\left(X[j]_{1:d_{x}}+j\mathbf{1}_{d_{x}}\right)_{d_{x}+1}+\sum_{j=1}^{t}X[j]_{d_{x}+1}\\ \mathbf{0}_{\gamma}\end{bmatrix}.

(151)

With the token-wise projection map $\mathcal{Q}:\mathbb{R}^{d_{x}+2+\gamma}\rightarrow\mathbb{R}^{d_{x}+1}$ defined by $\mathcal{Q}(X^{\prime})=\begin{bmatrix}X^{\prime}_{1:d_{x}}\\ X^{\prime}_{d_{x}+2}\end{bmatrix}$ , an RNN $\mathcal{Q}\mathcal{R}_{3}\mathcal{N}_{l}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}$ of width $d_{x}+2+\gamma$ maps $X\in\mathbb{R}^{(d_{x}+1)\times N}$ to

\mathcal{Q}\mathcal{R}_{3}\mathcal{N}_{l}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[t]=\begin{bmatrix}X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\\ \sum_{j=1}^{t}\mathcal{N}_{l,MLP}\left(X[j]_{1:d_{x}}+j\mathbf{1}_{d_{x}}\right)_{d_{x}+1}+\sum_{j=1}^{t}X[j]_{d_{x}+1}\end{bmatrix}.

(152)

Since $\mathcal{N}_{l,MLP}\left(X[j]_{1:d_{x}}+j\mathbf{1}_{d_{x}}\right)_{d_{x}+1}\rightarrow b_{l}[j]X[j]_{1:d_{x}}$ , we have

\sup_{X\in\left(K\times K^{\prime}\right)^{N}}\left\|\mathcal{T}\mathcal{R}_{l}(X)-\mathcal{Q}\mathcal{R}_{3}\mathcal{N}_{l}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)\right\|\rightarrow 0,

(153)

as $\delta_{l}\rightarrow 0$ . Approximating all $\mathcal{T}\mathcal{R}_{l}$ in Appendix D finishes the proof.

Appendix F Proof of Lemma 19

The main idea of the proof is to separate the linear sum $\sum_{j=1}^{N}A_{j}[t]x[j]$ into the past-dependent part $\sum_{j=1}^{t-1}A_{j}[t]x[j]$ and the remainder part $\sum_{j=t}^{N}A_{j}[t]x[j]$ . Then, we construct modified TBRNN with $2N$ cells; the former $N$ cells have only a forward recurrent cell to compute the past-dependent part, and the latter $N$ cells have only a backward recurrent cell to compute the remainder.

Let the first $N$ modified TRNN cells $\mathcal{R}_{l}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}$ for $1\leq l\leq N$ be defined as in the proof of Lemma 11:

\mathcal{R}_{l}\left(X\right)[t+1]=A_{l}\mathcal{R}_{l}\left(X\right)[t]+B_{l}[t]X[t+1],

(154)

where $A_{l}=\begin{bmatrix}O_{d_{x},d_{x}}&O_{d_{x},1}\\ O_{1,d_{x}}&1\end{bmatrix}$ , $B_{l}[t]=\begin{bmatrix}I_{d_{x}}&O_{d_{x},1}\\ b_{l}[t]&1\end{bmatrix}$ for $b_{l}[t]\in\mathbb{R}^{1\times d_{x}}$ . Then, with token-wise lifting map $\mathcal{P}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{x}+1}$ defined by $\mathcal{P}(x)=\begin{bmatrix}x\\ 0\end{bmatrix}$ , we construct modified TRNN $\mathcal{N}:\mathcal{R}_{N}\circ\cdots\circ\mathcal{R}_{1}\circ\mathcal{P}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}$ . We know that if $C_{i}[m]\in\mathbb{R}^{1\times d_{x}}$ are given for $1\leq m\leq N$ and $1\leq i\leq m$ , there exist $b_{l}[t]$ for $1\leq l\leq N$ , such that

\mathcal{N}_{N}(x)[m]=\begin{bmatrix}x[m]\\ \sum_{i=1}^{m}C_{i}[m]x[i]\end{bmatrix}.

(155)

Therefore, we will determine $C_{i}[m]$ after constructing the latter $N$ cells. Let $f_{m}=\sum_{i=1}^{m}C_{i}[m]x[i]$ for brief notation.

After $\mathcal{N}_{N}$ , construct $N$ modified TRNN cells $\bar{\mathcal{R}}_{l}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}$ for $1\leq l\leq N$ in reverse order:

\bar{\mathcal{R}}_{l}\left(\bar{X}\right)[t-1]=\bar{A}_{l}\bar{\mathcal{R}}_{l}\left(\bar{X}\right)[t]+\bar{B}_{l}[t]\bar{X}[t-1],

(156)

where $\bar{A}_{l}=\begin{bmatrix}O_{d_{x},d_{x}}&O_{d_{x},1}\\ O_{1,d_{x}}&1\end{bmatrix}$ , $\bar{B}_{l}[t]=\begin{bmatrix}I_{d_{x}}&O_{d_{x},1}\\ \bar{b}_{l}[t]&1\end{bmatrix}$ for $\bar{b}_{l}[t]\in\mathbb{R}^{1\times d_{x}}$ . Define $\bar{\mathcal{N}}_{N}=\bar{\mathcal{R}}_{N}\circ\cdots\circ\bar{\mathcal{R}}_{1}$ , and we obtain the following result after a similar calculation with input sequence $\bar{X}[t]=\mathcal{N}_{N}(x)[t]=\begin{bmatrix}x[t]\\ f_{t}\end{bmatrix}$ :

	$\displaystyle\bar{\mathcal{N}}_{N}\left(\bar{X}\right)[N+1-t]$		(157)
	$\displaystyle=\begin{bmatrix}x[N+1-t]\\ \sum_{j=1}^{t}\left[\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}\bar{b}_{i}[N+1-j]x[N+1-j]+\binom{N+t-1-j}{N-1}f_{N+1-j}\right]\end{bmatrix}.$		(158)

We want to find $f_{m}$ and $\bar{b}_{i}[m]$ so that

\bar{\mathcal{N}}_{N}\left(\bar{X}\right)[N+1-t]_{d_{x}+1}=\sum_{i=1}^{N}A_{i}[N+1-t]x[i],

(159)

for each $t=1,2,\ldots,N$ .

Note that $\sum_{j=1}^{t}\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}\bar{b}_{i}[N+1-j]x[N+1-j]$ does not contain $x[1],x[2],\ldots,x[N-t]$ terms, so $\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}f_{N+1-j}$ should contain $\sum_{i=1}^{N-t}A_{i}[N+1-t]x[i]$ .

$\displaystyle\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}f_{N+1-j}$	$\displaystyle=\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}\sum_{i=1}^{N+1-j}C_{i}[N+1-j]x[i]$	(160)
	$\displaystyle=\sum_{j=1}^{t}\sum_{i=1}^{N+1-j}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]x[i]$	(161)
	$\displaystyle=\sum_{i=N+2-t}^{N}\sum_{j=1}^{N+1-i}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]x[i]$	(162)
	$\displaystyle\qquad+\sum_{i=1}^{N+1-t}\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]x[i].$	(163)

Since matrix $\Lambda_{i}=\left\{\binom{N+t-1-j}{N-1}\right\}_{1\leq t\leq N+1-i,1\leq j\leq N+1-i}$ is a lower triangular $(N+1-i)\times(N+1-i)$ matrix with unit diagonal components, there exist $C_{i}[i],C_{i}[i+1],\ldots,C_{i}[N]$ such that

\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]=A_{i}[N+1-t],

(164)

for each $t=1,2,\ldots,N+1-i$ .

We now have

	$\displaystyle\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}f_{N+1-j}$		(165)
	$\displaystyle=\sum_{i=N+2-t}^{N}\sum_{j=1}^{N+1-i}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]x[i]+\sum_{i=1}^{N+1-t}A_{i}[N+1-t]x[i]$		(166)
	$\displaystyle=\sum_{i=1}^{t-1}\sum_{j=1}^{i}\binom{N+t-1-j}{N-1}C_{N+1-i}[N+1-j]x[N+1-i]+\sum_{i=1}^{N+1-t}A_{i}[N+1-t]x[i]$		(167)
	$\displaystyle=\sum_{j=1}^{t-1}\sum_{i=1}^{j}\binom{N+t-1-i}{N-1}C_{N+1-j}[N+1-i]x[N+1-j]+\sum_{i=1}^{N+1-t}A_{i}[N+1-t]x[i].$		(168)

We switch $i$ and $j$ for the last equation. By Corollary 26, there exist $\bar{b}_{i}[N+1-j]$ satisfying

	$\displaystyle\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}\bar{b}_{i}[N+1-j]$	(169)
	$\displaystyle=A_{N+1-j}[N+1-t]-\sum_{i=1}^{j}\binom{N+t-1-i}{N-1}C_{N+1-j}[N+1-i],$	(170)
for $j=1,2,\ldots,t-1$ , and
	$\displaystyle\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}\bar{b}_{i}[N+1-j]$	(171)
	$\displaystyle=A_{N+1-j}[N+1-t],$	(172)

for $j=t$ .
With the above $C_{i}[m]$ and $\bar{b}_{i}[m]$ , equation (159) holds for each $t=1,2,\ldots,N$ . It remains to construct modified TRNN cells to implement $f_{m}$ , which comes directly from the proof of Lemma 11.

References

Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Bahdanau et al. (2016) Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4945–4949. IEEE, 2016.
Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
Cohen et al. (2016) Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728. PMLR, 2016.
Cybenko (1989) George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
Dai et al. (2021) Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34:3965–3977, 2021.
Elman (1990) Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
Garnelo et al. (2018) Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.
Gers et al. (2000) Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. Neural computation, 12(10):2451–2471, 2000.
Graves and Jaitly (2014) Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning, pages 1764–1772. PMLR, 2014.
Hanin and Sellke (2017) Boris Hanin and Mark Sellke. Approximating continuous functions by relu nets of minimal width. arXiv preprint arXiv:1710.11278, 2017.
Hanson and Raginsky (2020) Joshua Hanson and Maxim Raginsky. Universal simulation of stable dynamical systems by recurrent neural nets. In Learning for Dynamics and Control, pages 384–392. PMLR, 2020.
Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
Johnson (2019) Jesse Johnson. Deep, skinny neural networks are not universal approximators. In International Conference on Learning Representations, 2019.
Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
Kidger and Lyons (2020) Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. In Conference on learning theory, pages 2306–2327. PMLR, 2020.
Lavanya and Sasikala (2021) PM Lavanya and E Sasikala. Deep learning techniques on text classification using natural language processing (nlp) in social healthcare network: A comprehensive survey. In 2021 3rd International Conference on Signal Processing and Communication (ICPSC), pages 603–609. IEEE, 2021.
LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
Leshno et al. (1993) Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
Lin and Jegelka (2018) Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator. Advances in Neural Information Processing Systems, 31, 2018.
Lu et al. (2017) Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, pages 1045–1048. Makuhari, 2010.
Park et al. (2020) Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. arXiv preprint arXiv:2006.08859, 2020.
Park et al. (2021) Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. In International Conference on Learning Representations, 2021.
Rolnick and Tegmark (2018) David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. In the International Conference on Learning Representations, 2018.
Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
Schäfer and Zimmermann (2007) Anton Maximilian Schäfer and Hans-Georg Zimmermann. Recurrent neural networks are universal approximators. International journal of neural systems, 17(04):253–263, 2007.
Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
Wu et al. (2017) Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. Recurrent recommender networks. In Proceedings of the tenth ACM international conference on web search and data mining, pages 495–503, 2017.
Yun et al. (2020) Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020.
Zhai et al. (2022) Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
Zhou (2020) Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48(2):787–794, 2020.