This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimal Width for Universal Property of Deep RNN

\nameChang hoon Song \email[email protected]
\addrDepartment of Mathematical Science
Seoul National University \AND\nameGeonho Hwang \email[email protected]
\addrDepartment of Mathematical Science
Seoul National University \AND\nameJun ho Lee \email[email protected]
\addrKongju National Universality \AND\nameMyungjoo Kang \email[email protected]
\addrDepartment of Mathematical Science
Seoul National University
Abstract

A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep narrow networks with bounded width and arbitrary depth are more effective than wide shallow networks with arbitrary width and bounded depth in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show a deep RNN with ReLU activation can approximate any continuous function or LpL^{p} function with the widths dx+dy+3d_{x}+d_{y}+3 and max{dx+1,dy}\max\{d_{x}+1,d_{y}\}, respectively, where the target function maps a finite sequence of vectors in dx\mathbb{R}^{d_{x}} to a finite sequence of vectors in dy\mathbb{R}^{d_{y}}. We also compute the additional width required if the activation function is sigmoid or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and technique can shed light on further research on deep RNNs.

Keywords: recurrent neural network, universal approximation, deep narrow network, sequential data, minimum width

1 Introduction

Deep learning, a type of machine learning that approximates a target function using a numerical model called an artificial neural network, has shown tremendous success in diverse fields, such as regression (Garnelo et al., 2018), image processing (Dai et al., 2021; Zhai et al., 2022), speech recognition (Baevski et al., 2020), and natural language processing (LeCun et al., 2015; Lavanya and Sasikala, 2021).

While the excellent performance of deep learning is attributed to a combination of various factors, it is reasonable to speculate that its notable success is partially based on the universal approximation theorem, which states that neural networks are capable of arbitrarily accurate approximations of the target function.

Formally, for any given function ff in a target class and ϵ>0\epsilon>0, there exists a network 𝒩\mathcal{N} such that f𝒩<ϵ\|f-\mathcal{N}\|<\epsilon. In topological language, the theorem states that a set of networks is dense in the class of the target function. In this sense, the closure of the network defines the range of functions it network can represent.

As universality provides the expressive power of a network structure, studies on the universal approximation theorem have attracted increasing attention. Examples include the universality of multi-layer perceptrons (MLPs), the most basic neural networks (Cybenko, 1989; Hornik et al., 1989; Leshno et al., 1993), and the universality of recurrent neural networks (RNNs) with the target class of open dynamical systems (Schäfer and Zimmermann, 2007). Recently, Zhou (2020) has demonstrated the universality of convolutional neural networks.

Classical universal approximation theorems specialize in the representation power of shallow wide networks with bounded depth and unbounded width. Based on mounting empirical evidence that deep networks demonstrate better performance than wide networks, the construction of deep networks instead of shallow networks has gained considerable attention in recent literature. Consequently, researchers have started to analyze the universal approximation property of deep networks (Cohen et al., 2016; Lin and Jegelka, 2018; Rolnick and Tegmark, 2018; Kidger and Lyons, 2020). Studies on MLP have shown that wide shallow networks require only two depths to have universality, while deep narrow networks require widths at least as their input dimension.

A wide network obtains universality by increasing its width even if the depth is only two (Cybenko, 1989; Hornik et al., 1989; Leshno et al., 1993). However, in the case of a deep network, there is a function for a narrow network that cannot be able to approximated, regardless of its depth (Lu et al., 2017; Park et al., 2021). Therefore, clarifying the minimum width to guarantee universality is crucial, and studies are underway to investigate its lower and upper bounds, narrowing the gap.

Network Function class Activation Result
RNN C(𝒦,dy)C\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)\dagger ReLU wmindx+dy+3w_{\text{min}}\leq d_{x}+d_{y}+3
conti. nonpoly1 wmindx+dy+3w_{\text{min}}\leq d_{x}+d_{y}+3
conti. nonpoly2 wmindx+dy+4w_{\text{min}}\leq d_{x}+d_{y}+4
Lp(𝒦,dy)L^{p}\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)\dagger conti. nonpoly1,2          wminmax{dx+1,dy}+1w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}+1
Lp(dx,dy)L^{p}\left(\mathbb{R}^{d_{x}},\mathbb{R}^{d_{y}}\right)\dagger ReLU wmin=max{dx+1,dy}w_{\text{min}}=\max\left\{d_{x}+1,d_{y}\right\}
LSTM C(𝒦,dy)C\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)\dagger wmindx+dy+3w_{\text{min}}\leq d_{x}+d_{y}+3
Lp(𝒦,dy)L^{p}\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)\dagger wminmax{dx+1,dy}+1w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}+1
GRU C(𝒦,dy)C\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)\dagger wmindx+dy+3w_{\text{min}}\leq d_{x}+d_{y}+3
Lp(𝒦,dy)L^{p}\left(\mathcal{K},\mathbb{R}^{d_{y}}\right)\dagger wminmax{dx+1,dy}+1w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}+1
BRNN C(𝒦,dy)C\left(\mathcal{K},\mathbb{R}^{d_{y}}\right) ReLU wmindx+dy+3w_{\text{min}}\leq d_{x}+d_{y}+3
conti. nonpoly1 wmindx+dy+3w_{\text{min}}\leq d_{x}+d_{y}+3
conti. nonpoly2 wmindx+dy+4w_{\text{min}}\leq d_{x}+d_{y}+4
Lp(𝒦,dy)L^{p}\left(\mathcal{K},\mathbb{R}^{d_{y}}\right) conti. nonpoly1,2          wminmax{dx+1,dy}+1w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}+1
Lp(dx,dy)L^{p}\left(\mathbb{R}^{d_{x}},\mathbb{R}^{d_{y}}\right) ReLU wminmax{dx+1,dy}w_{\text{min}}\leq\max\left\{d_{x}+1,d_{y}\right\}
  • \dagger

    requires the class to consists of past-dependent functions.

  • 1

    requires an activation σ\sigma to be continuously differentiable at some point z0z_{0} with σ(z0)=0\sigma(z_{0})=0 and σ(z0)0\sigma^{\prime}(z_{0})\neq 0. tanh\tanh belongs here.

  • 2

    requires an activation σ\sigma to be continuously differentiable at some point z0z_{0} with σ(z0)0\sigma^{\prime}(z_{0})\neq 0. A logistic sigmoid function belongs here.

Table 1: Summary of our results on the upper bound of the minimal width wminw_{\text{min}} for the universality of RNNs. In the table, 𝒦\mathcal{K} indicates a compact subset of dx\mathbb{R}^{d_{x}} and 1p<1\leq p<\infty. We abbreviate continuous to “conti”.

Recurrent neural networks (RNNs) (Rumelhart et al., 1986; Elman, 1990) have been crucial for modeling complex temporal dependencies in sequential data. They have various applications in diverse fields, such as language modeling (Mikolov et al., 2010; Jozefowicz et al., 2016), speech recognition (Graves and Jaitly, 2014; Bahdanau et al., 2016), recommendation systems (Hidasi et al., 2015; Wu et al., 2017), and machine translation (Bahdanau et al., 2014). Deep RNNs are widely used and have been successfully applied in practical applications. However, their theoretical understanding remains elusive despite their intensive use. This deficiency in existing studies motivated our work.

In this study, we prove the universal approximation theorem of deep narrow RNNs and discover the upper bound of their minimum width. The target class consists of a sequence-to-sequence function that depends solely on past information. We refer to such functions as past-dependent functions. We provide the upper bound of the minimum width of the RNN for universality in the space of the past-dependent functions. Surprisingly, the upper bound is independent of the length of the sequence. This theoretical result highlights the suitability of the recurrent structure for sequential data compared with other network structures. Furthermore, our results are not restricted to RNNs; they can be generalized to variants of RNNs, including long short-term memory (LSTM), gated recurrent units (GRU), and bidirectional RNNs (BRNN). As corollaries of our main theorem, LSTM and GRU are shown to have the same universality and target class as an RNN. We also prove that the BRNN can approximate any sequence-to-sequence function in a continuous or LpL^{p} space under the respective norms. We also present the upper bound of the minimum width for these variants. Table 1 outlines our main results.

1.1 Summary of Contributions

With a target class of functions that map a finite sequence xdx×Nx\in\mathbb{R}^{d_{x}\times N} to a finite sequence ydy×Ny\in\mathbb{R}^{d_{y}\times N}, we prove the following:

  • A deep RNN can approximate any past-dependent sequence-to-sequence continuous function with width dx+dy+3d_{x}+d_{y}+3 for the ReLU or tanh\tanh111Generally, non-degenerate σ\sigma with σ(z0)=0\sigma(z_{0})=0 requires the same minimal width as tanh\tanh., and dx+dy+4d_{x}+d_{y}+4 for non-degenerating activations.

  • A deep RNN can approximate any past-dependent LpL^{p} function (1p<1\leq p<\infty) with width max{dx+1,dy}\max\left\{d_{x}+1,d_{y}\right\} for the ReLU activation and max{dx+1,dy}+1\max\left\{d_{x}+1,d_{y}\right\}+1 for non-degenerating activations.

  • A deep BRNN can approximate any sequence-to-sequence continuous function with width dx+dy+3d_{x}+d_{y}+3 for the ReLU or tanh\tanh111Generally, non-degenerate σ\sigma with σ(z0)=0\sigma(z_{0})=0 requires the same minimal width as tanh\tanh., and dx+dy+4d_{x}+d_{y}+4 for non-degenerating activations.

  • A deep BRNN can approximate any sequence-to-sequence LpL^{p} function (1p<1\leq p<\infty) with width max{dx+1,dy}\max\left\{d_{x}+1,d_{y}\right\} for the ReLU activation and max{dx+1,dy}+1\max\left\{d_{x}+1,d_{y}\right\}+1 for non-degenerating activations.

  • A deep LSTM or GRU can approximate any past-dependent sequence-to-sequence continuous function with width dx+dy+3d_{x}+d_{y}+3 and LpL^{p} function with width max{dx+1,dy}+1\max\left\{d_{x}+1,d_{y}\right\}+1.

1.2 Organization

In Section 2, we briefly review previous studies on the universality of neural networks. Section 3 provides some notations, network configuration, a description of the target function class, and the condition of the activation function. Section 4 addresses the approximation property in a continuous function space. First, we fit only the last component of an output sequence and extend it to all components of the sequence. In Section 5, we present the upper bound of the minimum width in LpL^{p} space. As an expansion of our theorems, the universal approximation theorems for the LSTM, GRU, and BRNN are deduced in Section 6. Sections 7 and 8 present a discussion and conclusions, respectively.

2 Related Works

We briefly review some of the results of studies on the universal approximation property. Studies have been conducted to determine whether a neural network can learn a sufficiently wide range of functions, that is, whether it has universal properties. Cybenko (1989) and Hornik et al. (1989) first proved that the most basic network, a simple two-layered MLP, can approximate arbitrary continuous functions defined on a compact set. Some follow-up studies have investigated the universal properties of other structures for a specific task, such as a convolutional neural network for image processing (Zhou, 2020), an RNN for open dynamical systems (Schäfer and Zimmermann, 2007; Hanson and Raginsky, 2020), and transformer networks for translation and speech recognition (Yun et al., 2020). Particularly for RNNs, Schäfer and Zimmermann (2007) showed that an open dynamical system with continuous state transition and output function can be approximated by a network with a wide RNN cell and subsequent linear layer in finite time. Hanson and Raginsky (2020) showed that the trajectory of the dynamical system can be reproduced with arbitrarily small errors up to infinite time, assuming a stability condition on long-term behavior.

While such prior studies mainly focused on wide and shallow networks, several studies have determined whether a deep narrow network with bounded width can approximate arbitrary functions in some kinds of target function classes, such as a space of continuous functions or an LpL^{p} space (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2019; Kidger and Lyons, 2020; Park et al., 2021). Unlike the case of a wide network that requires only one hidden layer for universality, there exists a function that cannot be approximated by any network whose width is less than a certain threshold. Specifically, in a continuous function space C(K,dy)C\left(K,\mathbb{R}^{d_{y}}\right) with the supreme-norm fsupxKf(x)\left\|f\right\|_{\infty}\coloneqq\sup_{x\in K}\left\|f(x)\right\| on a compact subset KdxK\subset\mathbb{R}^{d_{x}}, Hanin and Sellke (2017) showed negative and positive results that MLPs do not attain universality with width dxd_{x}. Still, width dx+dyd_{x}+d_{y} is sufficient for MLPs to achieve universality with the ReLU activation function. Johnson (2019); Kidger and Lyons (2020) generalized the condition on the activation and proved that dxd_{x} is too narrow, but dx+dy+2d_{x}+d_{y}+2 is sufficiently wide for the universality in C(K,dy)C\left(K,\mathbb{R}^{d_{y}}\right). On the other hand, in an LpL^{p} space Lp(dx,dy)L^{p}\left(\mathbb{R}^{d_{x}},\mathbb{R}^{d_{y}}\right) with LpL^{p}-norm fp(dxf(x)𝑑x)1/p\left\|f\right\|_{p}\coloneqq\left(\int_{\mathbb{R}^{d_{x}}}f(x)\,dx\right)^{1/p} for p[0,)p\in[0,\infty), Lu et al. (2017) showed that the width dxd_{x} is insufficient for MLPs to have universality, but dx+4d_{x}+4 is enough, with the ReLU activation when p=1p=1 and dy=1d_{y}=1. Kidger and Lyons (2020) found that MLPs whose width is dx+dy+1d_{x}+d_{y}+1 achieve universality in Lp(dx,dy)L^{p}\left(\mathbb{R}^{d_{x}},\mathbb{R}^{d_{y}}\right) with the ReLU activation, and Park et al. (2021) determined the exact value max{dx+1,dy}\max\left\{d_{x}+1,d_{y}\right\} required for universality in the LpL^{p} space with ReLU.

As described earlier, studies on deep narrow MLP have been actively conducted, but the approximation ability of deep narrow RNNs remains unclear. This is because the process by which the input affects the result is complicated compared with that of an MLP. The RNN cell transfers information to both the next time step in the same layer and the same time step in the next layer, which makes it difficult to investigate the minimal width. In this regard, we examined the structure of the RNN to apply the methodology and results from the study of MLPs to deep narrow RNNs.

3 Terminologies and Notations

This section introduces the definition of network architecture and the notation used throughout this paper. dxd_{x} and dyd_{y} denote the dimension of input and output space, respectively. σ\sigma is an activation function unless otherwise stated. Sometimes, vv indicates a vector with suitable dimensions.

First, we used square brackets, subscripts, and colon symbols to index a sequence of vectors. More precisely, for a given sequence of dxd_{x}-dimensional vectors x:dxx:\mathbb{N}\rightarrow\mathbb{R}^{d_{x}}, x[t]jx[t]_{j} or xj[t]x_{j}[t] denotes the jj-th component of the tt-th vector. The colon symbol :: is used to denote a continuous index, such as x[a:b]=(x[i])aibx[a:b]=\left(x[i]\right)_{a\leq i\leq b} or x[t]a:b=(x[t]a,x[t]a+1,,x[t]b)Tba+1x[t]_{a:b}=\left(x[t]_{a},x[t]_{a+1},\ldots,x[t]_{b}\right)^{T}\in\mathbb{R}^{b-a+1}. We call the sequential index tt by time and each x[t]x[t] a token.

Second, we define the token-wise linear maps 𝒫:dx×Nds×N\mathcal{P}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{s}\times N} and 𝒬:ds×Ndy×N\mathcal{Q}:\mathbb{R}^{d_{s}\times N}\rightarrow\mathbb{R}^{{\color[rgb]{0,0,0}d_{y}}\times N} to connect the input, hidden state, and output space. As the dimension of the hidden state space ds\mathbb{R}^{d_{s}} on which the RNN cells act is different from those of the input domain dx\mathbb{R}^{d_{x}} and output domain dy\mathbb{R}^{d_{y}}, we need maps adjusting the dimensions of the spaces. For a given matrix Pds×dxP\in\mathbb{R}^{d_{s}\times d_{x}}, a lifting map 𝒫(x)[t]Px[t]\mathcal{P}(x)[t]\coloneqq Px[t] lifts the input vector to the hidden state space. Similarly, for a given matrix Qdy×dsQ\in\mathbb{R}^{d_{y}\times d_{s}}, a projection map 𝒬(s)[t]Qs[t]\mathcal{Q}(s)[t]\coloneqq Qs[t] projects a hidden state onto the output vector. As the first token defines a token-wise map, we sometimes represent token-wise maps without a time length, such as 𝒫:dxds\mathcal{P}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{s}} instead of 𝒫:dx×Nds×N\mathcal{P}:\mathbb{R}^{d_{x}\times N}\rightarrow{\color[rgb]{0,0,0}\mathbb{R}^{d_{s}\times N}}.

Subsequently, an RNN is constructed using a composition of basic recurrent cells between the lifting and projection maps. We considered four basic cells: RNN, LSTM, GRU, and BRNN.

  • RNN Cell A recurrent cell, recurrent layer, or RNN cell \mathcal{R} maps an input sequence x=(x[1],x[2],)=(x[t])tds×x=\left(x[1],x[2],\ldots\right)=\left(x[t]\right)_{t\in\mathbb{N}}\in\mathbb{R}^{d_{s}\times\mathbb{N}} to an output sequence y=(y[t])tds×y=\left(y[t]\right)_{t\in\mathbb{N}}\in\mathbb{R}^{d_{s}\times\mathbb{N}} using

    y[t+1]=(x)[t+1]=σ(A(x)[t]+Bx[t+1]+θ),y[t+1]=\mathcal{R}(x)[t+1]=\sigma\left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right), (1)

    where σ\sigma is an activation function, A,Bds×dsA,B\in\mathbb{R}^{d_{s}\times d_{s}} are the weight matrices, and θds\theta\in\mathbb{R}^{d_{s}} is the bias vector. The initial state (x)[0]\mathcal{R}(x)[0] can be an arbitrary constant vector, which is zero vector 𝟎\mathbf{0} in this setting.

  • LSTM cell An LSTM cell is an RNN cell variant with an internal cell state and gate structures. As an original version proposed in Hochreiter and Schmidhuber (1997), an LSTM cell includes only two gates: an input gate and an output gate. Still, the base of almost all networks used in modern deep learning is LSTM cells with an additional gate called the forget gate, introduced in Gers et al. (2000). Hence we present the LSTM cell with three gates instead of the original one.

    Mathematically, an LSTM cell LSTM\mathcal{R}_{LSTM} is a process that computes an output hdsh\in\mathbb{R}^{d_{s}} and cell state cdsc\in\mathbb{R}^{d_{s}}, defined by the following relation:

    i[t+1]\displaystyle i[t+1] =σsig(Wix[t+1]+Uih[t]+bi)\displaystyle=\sigma_{\operatorname{sig}}\left(W_{i}x[t+1]+U_{i}h[t]+b_{i}\right) (2)
    f[t+1]\displaystyle f[t+1] =σsig(Wfx[t+1]+Ufh[t]+bf)\displaystyle=\sigma_{\operatorname{sig}}\left(W_{f}x[t+1]+U_{f}h[t]+b_{f}\right)
    g[t+1]\displaystyle g[t+1] =tanh(Wgx[t+1]+Ugh[t]+bg)\displaystyle=\tanh\left(W_{g}x[t+1]+U_{g}h[t]+b_{g}\right)
    o[t+1]\displaystyle o[t+1] =σsig(Wox[t+1]+Uoh[t]+bo)\displaystyle=\sigma_{\operatorname{sig}}\left(W_{o}x[t+1]+U_{o}h[t]+b_{o}\right)
    c[t+1]\displaystyle c[t+1] =f[t+1]c[t]+i[t+1]g[t+1]\displaystyle=f[t+1]\odot c[t]+i[t+1]\odot g[t+1]
    h[t+1]\displaystyle h[t+1] =o[t+1]tanh(c[t+1])\displaystyle=o[t+1]\odot\tanh\left(c[t+1]\right)

    where Wds×dsW_{*}\in\mathbb{R}^{d_{s}\times d_{s}} and Uds×dsU_{*}\in\mathbb{R}^{d_{s}\times d_{s}} are weight matrices; bdsb_{*}\in\mathbb{R}^{d_{s}} is the bias vector for each =i,f,g,o*=i,f,g,o; and σsig\sigma_{\operatorname{sig}} is the sigmoid activation function. \odot denotes component-wise multiplication, and we set the initial state to zero in this study.

    Although some variants of the LSTM have additional connections, such as peephole connection, our result for LSTM extends naturally.

  • GRU cell Another import variation of an RNN is GRU, proposed by Cho et al. (2014). Mathematically, a GRU cell GRU\mathcal{R}_{GRU} is a process that computes hdsh{\color[rgb]{0,0,0}\in\mathbb{R}^{d_{s}}} defined by

    r[t+1]\displaystyle r[t+1] =σsig(Wrx[t+1]+Urh[t]+br),\displaystyle=\sigma_{\operatorname{sig}}\left(W_{r}x[t+1]+U_{r}h[t]+b_{r}\right), (3)
    h~[t+1]\displaystyle\tilde{h}[t+1] =tanh(Whx[t+1]+Uh(r[t+1]h[t])+bh),\displaystyle=\tanh\left(W_{h}x[t+1]+U_{h}\left(r[t+1]\odot h[t]\right)+b_{h}\right),
    z[t+1]\displaystyle z[t+1] =σsig(Wzx[t+1]+Uzh[t]+bz),\displaystyle=\sigma_{\operatorname{sig}}\left(W_{z}x[t+1]+U_{z}h[t]+b_{z}\right),
    h[t+1]\displaystyle h[t+1] =(1z[t+1])h[t]+z[t+1]h~[t+1],\displaystyle=\left(1-z[t+1]\right){\color[rgb]{0,0,0}\odot}h[t]+z[t+1]{\color[rgb]{0,0,0}\odot}\tilde{h}[t+1],

    where Wds×dsW_{*}\in\mathbb{R}^{d_{s}\times{\color[rgb]{0,0,0}d_{s}}} and Uds×dsU_{*}\in\mathbb{R}^{d_{s}\times d_{s}} are weight matrices, bdsb_{*}\in\mathbb{R}^{d_{s}} is the bias vector for each =r,z,h*=r,z,h, and σsig\sigma_{\operatorname{sig}} is the sigmoid activation function. \odot denotes component-wise multiplication, and we set the initial state to zero in this study.

  • BRNN cell A BRNN cell \mathcal{B}\mathcal{R} consists of a pair of RNN cells and a token-wise linear map that follows the cells (Schuster and Paliwal, 1997). An RNN cell 1\mathcal{R}_{1} in the BRNN cell \mathcal{B}\mathcal{R} receives input from x[1]x[1] to x[N]x[N] and the other 2\mathcal{R}_{2} receives input from x[N]x[N] to x[1]x[1] in reverse order. Then, the linear map \mathcal{L} in \mathcal{B}\mathcal{R} combines the two outputs from the RNN cells. Specifically, a BRNN cell \mathcal{B}\mathcal{R} is defined as follows:

    (x)[t+1]\displaystyle\mathcal{R}(x)[t+1] σ(A1(x)[t]+Bx[t+1]+θ),\displaystyle\coloneqq\sigma\left(A\mathcal{R}_{1}(x)[t]+Bx[t+1]+\theta\right), (4)
    ¯(x)[t1]\displaystyle\bar{\mathcal{R}}(x)[t-1] σ(A¯¯(x)[t]+B¯x[t1]+θ¯),\displaystyle\coloneqq\sigma\left(\bar{A}\bar{\mathcal{R}}(x)[t]+\bar{B}x[t-1]+\bar{\theta}\right),
    (x)[t]\displaystyle\mathcal{B}\mathcal{R}(x)[t] ((x)[t],¯(x)[t])\displaystyle\coloneqq\mathcal{L}\left(\mathcal{R}(x)[t],\bar{\mathcal{R}}(x)[t]\right)
    W(x)[t]+W¯¯(x)[t].\displaystyle\coloneqq W\mathcal{R}(x)[t]+\bar{W}\bar{\mathcal{R}}(x)[t].

    where AA, BB, A¯\bar{A}, B¯\bar{B}, WW, and W¯\bar{W} are weight matrices; θ\theta and θ¯\bar{\theta} are bias vectors.

  • Network architecture An RNN 𝒩\mathcal{N} comprises a lifting map 𝒫\mathcal{P}, projection map 𝒬\mathcal{Q}, and LL recurrent cells 1,,L\mathcal{R}_{1},\ldots,\mathcal{R}_{L};

    𝒩𝒬L1𝒫.\mathcal{N}\coloneqq\mathcal{Q}\circ\mathcal{R}_{L}\circ\cdots\circ\mathcal{R}_{1}\circ\mathcal{P}. (5)

    We denote the network as a stack RNN or deep RNN when L2L\geq 2, and each output of the cell i\mathcal{R}_{i} as the ii-th hidden state. dsd_{s} indicates the width of the network. As a special case, a recurrent cell \mathcal{R} in (1) is a composition of activation and a token-wise affine map with A=0A=0; A token-wise MLP is a specific example of RNN. If LSTM, GRU, or BRNN cells replace recurrent cells, the network is called an LSTM, a GRU, or a BRNN.

In addition to the type of cell, the activation function σ\sigma affects universality. We focus on the case of ReLU or tanh\tanh while also considering the general activation function satisfying the condition proposed by (Kidger and Lyons, 2020). σ\sigma is a continuous non-polynomial function that is continuously differentiable at some z0z_{0} with σ(z0)0\sigma^{\prime}(z_{0})\neq 0. We refer to the condition as a non-degenerate condition and z0z_{0} as a non-degenerating point.

Finally, the target class must be set as a subset of the sequence-to-sequence function space, from dx\mathbb{R}^{d_{x}} to dy\mathbb{R}^{d_{y}}. Given an RNN 𝒩\mathcal{N}, each token y[t]y[t] of the output sequence y=𝒩(x)y=\mathcal{N}(x) depends only on x[1:t](x[1],x[2],,x[t])x[1\mathbin{:}t]\coloneqq\left(x[1],x[2],\ldots,x[t]\right) for the input sequence xx. We define this property as past dependency and a function with this property as a past-dependent function. More precisely, if all the output tokens of a sequence-to-sequence function are given by f[t](x[1:t])f[t]\left(x[1\mathbin{:}t]\right) for functions f[t]:dx×tdyf[t]:\mathbb{R}^{d_{x}\times t}\rightarrow\mathbb{R}^{d_{y}}, we say that the function is past-dependent. Meanwhile, we must fix the finite length or terminal time N<N<\infty of the input and output sequence. Without additional assumptions such as in (Hanson and Raginsky, 2020), errors generally accumulate over time, making it impossible to approximate implicit dynamics up to infinite time regardless of past dependency. Therefore we set the target function class as a class of past-dependent sequence-to-sequence functions with sequence length NN.

Remark 1

On a compact domain and under bounded length, the continuity of f:dx×Ndy×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N} implies that of each f[t]:dx×tdyf[t]:\mathbb{R}^{d_{x}\times t}\rightarrow\mathbb{R}^{d_{y}} and vice versa. In the case of the LpL^{p} norm with 1p<1\leq p<\infty, f:dx×Ndy×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N} is LpL^{p} integrable if and only if f[t]f[t] is LpL^{p} integrable for each tt. In both cases, the sequence of functions (fn)n\left(f_{n}\right)_{n\in\mathbb{N}} converges to gg if and only if (fn[t])n\left(f_{n}[t]\right)_{n\in\mathbb{N}} converges to g[t]g[t] for each tt. Thus, we focus on approximating f[t]f[t] for each tt under the given conditions.

Remark 2

There are typical tasks where the length of the output sequence differs from that of the input sequence. In translation tasks, for instance, inputs and outputs are sentences represented as sequences of tokens of words in different languages, and the length of each sequence differs in general. Nevertheless, even in such a case, a sequence x=(x[1],x[2],,x[N])dx×Nx=\left(x[1],x[2],\ldots,x[N]\right)\in\mathbb{R}^{d_{x}\times N} can be naturally embedded into dx×M\mathbb{R}^{d_{x}\times M} for an integer M>NM>N as an extended sequence (x[1],x[2],,x[N],𝟎,,𝟎)\left(x[1],x[2],\ldots,x[N],\mathbf{0},\ldots,\mathbf{0}\right) or (𝟎,,𝟎,x[1],x[2],,x[N])\left(\mathbf{0},\ldots,\mathbf{0},x[1],x[2],\ldots,x[N]\right). By the natural embedding, a map between sequences f:dx×N1dy×N2f:\mathbb{R}^{d_{x}\times N_{1}}\rightarrow\mathbb{R}^{d_{y}\times N_{2}} is a special case of a map between f:dx×Mdy×Mf:\mathbb{R}^{d_{x}\times M}\rightarrow\mathbb{R}^{d_{y}\times M} for an integer MN1,N2M\geq N_{1},N_{2}.

Sometimes, only the last value 𝒩(x)[N]\mathcal{N}(x)[N] is required considering an RNN 𝒩\mathcal{N} as a sequence-to-vector function 𝒩:dx×Ndy\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}}. We freely use the terminology RNN for sequence-to-sequence and sequence-to-vector functions because there is no confusion when the output domain is evident.

We have described all the concepts necessary to set a problem, but we end this section with an introduction to the concepts used in the proof of the main theorem. For the convenience of the proof, we slightly modify the activation σ\sigma to act only on some components, instead of all components. With activation σ\sigma and index set II\subseteq\mathbb{N}, the modified activation σI\sigma_{I} is defined as

σI(s)i={σ(si) if iI si otherwise \sigma_{I}(s)_{i}=\left\{\begin{array}[]{ll}\sigma(s_{i})&\text{ if $i\in I$ }\\ s_{i}&\text{ otherwise }\end{array}\right. (6)

Using the modified activation function σI\sigma_{I}, the basic cells of the network are modified in (1). For example, a modified recurrent cell can be defined as

(x)[t+1]i\displaystyle\mathcal{R}(x)[t+1]_{i} =σI(A(x)[t]+Bx[t+1]+θ)i\displaystyle=\sigma_{I}\left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right)_{i} (7)
={σ(A(x)[t]+Bx[t+1]+θ)i if iI(A(x)[t]+Bx[t+1]+θ)i otherwise .\displaystyle=\left\{\begin{array}[]{ll}\sigma\left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right)_{i}&\text{ if }i\in I\\ \left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right)_{i}&\text{ otherwise }\end{array}\right..

Similarly, modified RNN, LSTM, GRU, or BRNN is defined using modified cells in (1). This concept is similar to the enhanced neuron of Kidger and Lyons (2020) in that activation can be selectively applied, but is different in that activation can be applied to partial components.

As activation leads to the non-linearity of a network, modifying the activation can affect the minimum width of the network. Fortunately, the following lemma shows that the minimum width increases by at most one owing to the modification. We briefly introduce the ideas here, with a detailed proof provided in the appendix.

Lemma 3

Let ¯:d×Nd×N\bar{\mathcal{R}}:\mathbb{R}^{d\times N}\rightarrow\mathbb{R}^{d\times N} be a modified RNN cell, 𝒬¯:dd\bar{\mathcal{Q}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, and 𝒫¯:dd\bar{\mathcal{P}}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} be a token-wise linear projection and lifting map. Suppose that an activation function σ\sigma of ¯\bar{\mathcal{R}} is non-degenerate with a non-degenerating point z0z_{0}. Then for any compact subset KdK\subset\mathbb{R}^{d} and ϵ>0\epsilon>0, there exists RNN cells 1\mathcal{R}_{1}, 2:(d+β(σ))×N(d+β(σ))×N\mathcal{R}_{2}:\mathbb{R}^{(d+\beta(\sigma))\times N}\rightarrow\mathbb{R}^{(d+\beta(\sigma))\times N}, and a token-wise linear map 𝒫:dd+β(σ)\mathcal{P}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d+\beta(\sigma)}, 𝒬:d+β(σ)d\mathcal{Q}:\mathbb{R}^{d+\beta(\sigma)}\rightarrow\mathbb{R}^{d} such that

supxKN𝒬¯¯𝒫¯(x)𝒬21𝒫(x)<ϵ,\sup_{x\in K^{N}}\left\|\bar{\mathcal{Q}}\circ\bar{\mathcal{R}}\circ\bar{\mathcal{P}}(x)-\mathcal{Q}\circ\mathcal{R}_{2}\circ\mathcal{R}_{1}\circ\mathcal{P}(x)\right\|<\epsilon, (8)

where β(σ)={0 if σ(z0)=01 otherwise\beta(\sigma)=\left\{\begin{array}[]{ll}0&\text{ if {\color[rgb]{0,0,0}$\sigma\left(z_{0}\right)=0$}}\\ 1&\text{ otherwise}\end{array}\right.

Proof [Sketch of proof] The detailed proof is available in Appendix B. We use the Taylor expansion of σ\sigma at z0z_{0} to recover the value before activation. For the ii-th component with iIi\not\in I, choose a small δ>0\delta>0 and linearly approximate σ(z0+δz)\sigma\left(z_{0}+\delta z\right) as σ(z0)+δσ(z0)z\sigma(z_{0})+\delta\sigma^{\prime}(z_{0})z. An affine transform after the Taylor expansion recovers zz.  

Remark 4

Since the additional width is only used to move the input of each RNN cell to near z0z_{0}, it seems easily removable using a bias term at first glance, provided the projection map is affine instead of linear. However, a bias term θ\theta in (1) is fixed across all time steps, while we need different translations: one for the first token and the other for tokens after the second.

For example, for an RNN cell δ:1×1×\mathcal{R}^{\delta}:\mathbb{R}^{1\times\mathbb{N}}\rightarrow\mathbb{R}^{1\times\mathbb{N}} defined by

δ(x)[t+1]=σ(aδ(x)[t]+δx[t+1]+θ),\mathcal{R}^{\delta}(x)[t+1]=\sigma\left(a\mathcal{R}^{\delta}(x)[t]+\delta x[t+1]+\theta\right), (9)

we need θ=z0\theta=z_{0} to use the Taylor expansion at z0z_{0} to the first token δ(x)[1]=σ(δx[1]+θ)\mathcal{R}^{\delta}(x)[1]=\sigma\left(\delta x[1]+\theta\right). When θ=z0\theta=z_{0}, we cannot represent the second token

δ(x)[2]\displaystyle\mathcal{R}^{\delta}(x)[2] =σ(aδ(x)[1]+δx[2]+z0)\displaystyle=\sigma\left(a\mathcal{R}^{\delta}(x)[1]+\delta x[2]+z_{0}\right) (10)
=σ(a(σ(z0)+δσ(z0)x[1]+o(δ))+δx[2]+z0),\displaystyle=\sigma\left(a\left(\sigma\left(z_{0}\right)+\delta\sigma^{\prime}\left(z_{0}\right)x[1]+o\left(\delta\right)\right)+\delta x[2]+z_{0}\right), (11)

as the Taylor expansion at z0z_{0} unless aσ(z0)=0a\sigma\left(z_{0}\right)=0.

In other words, the need for the additional width originates from that previous state (x)[t1]\mathcal{R}(x)[t-1] appears only after the second token. Nonetheless, we can make β0\beta\equiv 0, by setting the proper initial state for each RNN cell and using an affine projection map.

The lemma implies that a modified RNN can be approximated by an RNN with at most one additional width. For a given modified RNN 𝒬¯¯L¯1𝒫¯\bar{\mathcal{Q}}\circ\bar{\mathcal{R}}_{L}\circ\cdots\circ\bar{\mathcal{R}}_{1}\circ{\color[rgb]{0,0,0}\bar{\mathcal{P}}} of width dd and ϵ>0\epsilon>0, we can find RNN 1,,2L\mathcal{R}_{1},\ldots,\mathcal{R}_{2L} and linear maps 𝒫1,,𝒫L\mathcal{P}_{1},\ldots,\mathcal{P}_{L}, 𝒬1,,𝒬L\mathcal{Q}_{1},\ldots,\mathcal{Q}_{L} such that

supxKN𝒬¯¯L¯1𝒫¯(x)(𝒬L2L2L1𝒫L)(𝒬121𝒫1)(x)<ϵ.\sup_{x\in K^{N}}\left\|\bar{\mathcal{Q}}\circ\bar{\mathcal{R}}_{L}\circ\cdots\circ\bar{\mathcal{R}}_{1}\circ\bar{\mathcal{P}}(x)-\left(\mathcal{Q}_{L}\mathcal{R}_{2L}\mathcal{R}_{2L-1}\mathcal{P}_{L}\right)\circ\cdots\circ\left(\mathcal{Q}_{1}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}_{1}\right)(x)\right\|<\epsilon. (12)

The composition 𝒫\mathcal{R}\circ\mathcal{P} of an RNN cell \mathcal{R} and token-wise linear map 𝒫\mathcal{P} can be substituted by another RNN cell \mathcal{R}^{\prime}. More concretely, for \mathcal{R} and 𝒫\mathcal{P} defined by

(x)[t+1]\displaystyle\mathcal{R}(x)[t+1] =σ(A(x)[t]+Bx[t+1]+θ),\displaystyle=\sigma\left(A\mathcal{R}(x)[t]+Bx[t+1]+\theta\right), (13)
𝒫(x)[t]\displaystyle\mathcal{P}(x)[t] =P(x[t]),\displaystyle=P\left(x[t]\right), (14)
𝒫\mathcal{R}\circ\mathcal{P} defines an RNN cell \mathcal{R}^{\prime}
(x)[t+1]\displaystyle\mathcal{R}^{\prime}(x)[t+1] =σ(A(x)[t]+BPx[t+1]+θ).\displaystyle=\sigma\left(A\mathcal{R}^{\prime}(x)[t]+BPx[t+1]+\theta\right). (15)

Thus, 2l+1(𝒫l+1𝒬l)\mathcal{R}_{2l+1}\left(\mathcal{P}_{l+1}\mathcal{Q}_{l}\right) becomes a recurrent cell, and (𝒬L2L2L1𝒫L)(𝒬121𝒫1)(x)\left(\mathcal{Q}_{L}\mathcal{R}_{2L}\mathcal{R}_{2L-1}\mathcal{P}_{L}\right)\circ\cdots\circ\left(\mathcal{Q}_{1}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}_{1}\right)(x) defines a network of form (5).

4 Universal Approximation for Deep RNN in Continuous Function Space

This section introduces the universal approximation theorem of deep RNNs in continuous function space.

Theorem 5 (Universal approximation theorem of deep RNN 1)

Let f:dx×Ndy×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N} be a continuous past-dependent sequence-to-sequence function and σ\sigma be a non-degenerate activation function. Then, for any ϵ>0\epsilon>0 and compact subset KdxK\subset\mathbb{R}^{d_{x}}, there exists a deep RNN 𝒩\mathcal{N} of width dx+dy+3+α(σ)d_{x}+d_{y}+3+\alpha(\sigma) such that

supxKNsup1tNf(x)[t]𝒩(x)[t]<ϵ,\sup_{x\in K^{N}}\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|<\epsilon, (16)

where α(σ)={0σ is ReLU or a non-degenerating function with σ(z0)=0.1σ is a non-degenerating function with σ(z0)0.\alpha\left(\sigma\right)=\begin{cases}0&\text{$\sigma$ is ReLU or a non-degenerating function with $\sigma\left(z_{0}\right)=0$.}\\ 1&\text{$\sigma$ is a non-degenerating function with $\sigma\left(z_{0}\right)\neq 0$}.\end{cases}

To prove the above theorem, we deal with the case of the sequence-to-vector function 𝒩:dx×Ndy\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}} first, in subsection 4.1. Then, we extend our idea to a sequence-to-sequence function using bias terms to separate the input vectors at different times in subsection 4.2, and the proof of Theorem 5 is presented at the end of this section. As the number of additional components required in each step in the subsections depends on the activation function, we use α(σ)\alpha(\sigma) to state the theorem briefly.

4.1 Approximation of sequence-to-vector function

The motivation for approximating a sequence-to-vector function f:dx×Ndyf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}} is to copy a two-layered MLP via a modified RNN. Note that the output of a two-layered MLP is a linear sum of outputs of its hidden nodes, and each output of the hidden nodes is a linear sum of inputs with the following activation function. In Lemma 6, we construct a modified RNN that simulates a hidden node of an MLP, which can be represented as the sum of the inner product of some matrices and NN input vectors in dx\mathbb{R}^{d_{x}} with activation. After that, we use an additional buffer component in Lemma 7 to copy another hidden node in the two-layered MLP and the following modified RNN cell accumulates two results from the nodes. The buffer component of the modified RNN cell is then reset to zero to copy another hidden node. Repeating the procedure, the modified RNN with bounded width copies the two-layered MLP.

Now, we present the statements and sketches of the proof corresponding to each step. The following lemma implies that a modified RNN can compute the linear sum of all the input components, which copies the hidden node of a two-layered MLP.

Lemma 6

Suppose A[1],A[2],,A[N]1×dxA[1],A[2],\cdots,A[N]\in\mathbb{R}^{1\times d_{x}} are the given matrices. Then there exists a modified RNN 𝒩=NN11𝒫:dx×N(dx+1)×N\mathcal{N}={\color[rgb]{0,0,0}\mathcal{R}_{N}}\circ\mathcal{R}_{N-1}\circ\cdots\circ\mathcal{R}_{1}\circ\mathcal{P}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N} of width dx+1d_{x}+1 such that (the symbol \ast indicates that there exists some value irrelevant to the proof)

𝒩(x)[t]\displaystyle\mathcal{N}(x)[t] =[x[t]]\displaystyle=\begin{bmatrix}x[t]\\ \ast\end{bmatrix} for t<N,\displaystyle\text{ for }t<N, (17)
𝒩(x)[N]\displaystyle\mathcal{N}(x)[N] =[x[N]σ(t=1NA[t]x[t])].\displaystyle=\begin{bmatrix}x[N]\\ \sigma\left(\sum_{t=1}^{N}A[t]x[t]\right)\end{bmatrix}.

Proof  [Sketch of the proof] The detailed proof is available in Appendix C. Define the mm-th modified RNN cell m\mathcal{R}_{m}, of the form of (1) without activation, with Am=[Odx×dxOdx×1O1×dx1]A_{m}=\begin{bmatrix}O_{d_{x}\times d_{x}}&O_{d_{x}\times 1}\\ O_{1\times d_{x}}&1\end{bmatrix}, Bm=[IdxOdx×1bm1]B_{m}=\begin{bmatrix}I_{d_{x}}&O_{d_{x}\times 1}\\ b_{m}&{\color[rgb]{0,0,0}1}\end{bmatrix} where bm1×bxb_{m}\in\mathbb{R}^{1\times b_{x}}. Then, the (dx+1)(d_{x}+1)th component y[N]dx+1y[N]_{d_{x}+1} of the final output y[N]y[N] after NN layers becomes a linear combination of bix[j]b_{i}x[j] with some constant coefficients αi,j\alpha_{i,j} and i=1Nj=1Nαi,jbix[j]\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{i,j}b_{i}x[j]. Thus the coefficient of x[j]x[j] is represented by i=1Nαi,jbi\sum_{i=1}^{N}\alpha_{i,j}b_{i}, which we wish to be A[j]A[j] for each j=1,2,,Nj=1,2,\ldots,N. In matrix formulation, we intend to find bb satisfying ΛTb=A\Lambda^{T}b=A, where Λ={αi,j}1i,jNN×N\Lambda=\left\{\alpha_{i,j}\right\}_{1\leq i,j\leq N}\in\mathbb{R}^{N\times N}, b=[b1bN]N×dxb=\begin{bmatrix}b_{1}\\ \vdots\\ b_{N}\end{bmatrix}\in\mathbb{R}^{N\times d_{x}}, and A=[A[1]A[N]]A=\begin{bmatrix}A[1]\\ \vdots\\ A[N]\end{bmatrix}. As Λ\Lambda is invertible there exist bib_{i} that solve (ΛTb)j=A[j]\left(\Lambda^{T}b\right)_{j}=A[j].  

After copying a hidden node using the above lemma, we add a component, (dx+2)(d_{x}+2)th, to copy another hidden node. Then the results are accumulated in the (dx+1)(d_{x}+1)th component, and the final component is to be reset to copy another node. As the process is repeated, a modified RNN replicates the output node of a two-layered MLP.

Lemma 7

Suppose wiw_{i}\in\mathbb{R}, Ai[t]1×dxA_{i}[t]\in\mathbb{R}^{1\times d_{x}} are given for t=1,2,,Nt=1,2,\ldots,N and i=1,2,,Mi=1,2,\ldots,M. Then, there exists a modified RNN 𝒩:dx×N\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R} of width dx+2d_{x}+2 such that

𝒩(x)=i=1Mwiσ(t=1NAi[t]x[t]).\mathcal{N}(x)=\sum_{i=1}^{M}w_{i}\sigma\left(\sum_{t=1}^{N}A_{i}[t]x[t]\right). (18)

Proof  First construct a modified RNN 𝒩1:dx×N(dx+2)×N\mathcal{N}_{1}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+2)\times N} of width dx+2d_{x}+2 such that

𝒩1(x)[t]\displaystyle\mathcal{N}_{1}(x)[t] =[x[t]0]\displaystyle=\begin{bmatrix}x[t]\\ \ast\\ 0\end{bmatrix} for t<N,\displaystyle\text{ for }t<N, (19)
𝒩1(x)[N]\displaystyle\mathcal{N}_{1}(x)[N] =[x[N]σ(t=1NA1[t]x[t])0],\displaystyle=\begin{bmatrix}x[N]\\ \sigma\left(\sum_{t=1}^{N}A_{1}[t]x[t]\right)\\ 0\end{bmatrix}, (20)

as Lemma 6. Note that the final component does not affect the first linear summation and remains zero. Next, using the components except for the (dx+1)(d_{x}+1)th one, construct 𝒩2:(dx+2)×N(dx+2)×N\mathcal{N}_{2}:\mathbb{R}^{(d_{x}+2)\times N}\rightarrow\mathbb{R}^{(d_{x}+2)\times N}, which satisfies

𝒩2𝒩1(x)[t]\displaystyle\mathcal{N}_{2}\mathcal{N}_{1}(x)[t] =[x[t]]\displaystyle=\begin{bmatrix}x[t]\\ \ast\\ \ast\end{bmatrix} for t<N,\displaystyle\text{ for }t<N, (21)
𝒩2𝒩1(x)[N]\displaystyle\mathcal{N}_{2}\mathcal{N}_{1}(x)[N] =[x[N]σ(t=1NA1[t]x[t])σ(t=1NA2[t]x[t])],\displaystyle=\begin{bmatrix}x[N]\\ \sigma\left(\sum_{t=1}^{N}A_{1}[t]x[t]\right)\\ \sigma\left(\sum_{t=1}^{N}A_{2}[t]x[t]\right)\end{bmatrix}, (22)

and use one modified RNN cell \mathcal{R} after 𝒩2\mathcal{N}_{2} to add the results and reset the last component:

𝒩2𝒩1(x)[t]\displaystyle\mathcal{R}\mathcal{N}_{2}\mathcal{N}_{1}(x)[t] =[x[t]0],\displaystyle=\begin{bmatrix}x[t]\\ \ast\\ 0\end{bmatrix}, (23)
𝒩2𝒩1(x)[N]\displaystyle\mathcal{R}\mathcal{N}_{2}\mathcal{N}_{1}(x)[N] =[x[N]w1σ(A1[t]x[t])+w2σ(A2[t]x[t])0].\displaystyle=\begin{bmatrix}x[N]\\ w_{1}\sigma\left(\sum A_{1}[t]x[t]\right)+w_{2}\sigma\left(\sum A_{2}[t]x[t]\right)\\ 0\end{bmatrix}. (24)

As the (dx+2)(d_{x}+2)th component is reset to zero, we use it to compute the third sum w3σ(A3[t]x[t])w_{3}\sigma\left(\sum A_{3}[t]x[t]\right) and repeat until we obtain the final network 𝒩\mathcal{N} such that

𝒩(x)[N]=[x[N]i=1Mwiσ(t=1NAi[t]x[t])0].\mathcal{N}(x)[N]=\begin{bmatrix}x[N]\\ \sum_{i=1}^{M}w_{i}\sigma\left(\sum_{t=1}^{N}A_{i}[t]x[t]\right)\\ 0\end{bmatrix}. (25)

 

Remark 8

The above lemma implies that a modified RNN of width dx+2d_{x}+2 can copy the output node of a two-layered MLP. We can extend this result to an arbitrary dyd_{y}-dimensional case. Note that the first dxd_{x} components remain fixed, the (dx+1)(d_{x}+1)th component computes a part of the linear sum approximating the target function, and the (dx+2)(d_{x}+2)th component computes another part and is reset. When we need to copy another output node for another component of the output of the target function f:dx×Ndy×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N}, only one additional width is sufficient. Indeed, the (dx+2)(d_{x}+2)th component computes the sum and the final component, and the (dx+3)(d_{x}+3)th component acts as a buffer to be reset in that case. By repeating this process, we obtain (dx+dy+1)(d_{x}+d_{y}+1)-dimensional output from the modified RNN, which includes all dyd_{y} outputs of the MLP and the components from the (dx+1)(d_{x}+1)th to the (dx+dy)(d_{x}+d_{y})th ones.

Theorem 9 (Universal approximation theorem of deep RNN 2)

Suppose a target f:dx×Ndyf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}} is a continuous sequence-to-vector function, KdxK\subset\mathbb{R}^{d_{x}} is a compact subset, σ\sigma is a non-degenerating activation function, and z0z_{0} is the non-degenerating point. Then, for any ϵ>0\epsilon>0, there exists a deep RNN 𝒩:dx×Ndy\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}} of width dx+dy+1+β(σ)d_{x}+d_{y}+1+\beta(\sigma) such that

supxKNf(x)𝒩(x)<ϵ,\sup_{x\in K^{N}}\left\|f(x)-\mathcal{N}(x)\right\|<\epsilon, (26)

where β(σ)={0 if σ(z0)=01 otherwise\beta(\sigma)=\left\{\begin{array}[]{ll}0&\text{ if {\color[rgb]{0,0,0}$\sigma\left(z_{0}\right)=0$}}\\ 1&\text{ otherwise}\end{array}\right.

Proof  We present the proof for dy=1d_{y}=1 here, but adding dy1d_{y}-1 width for each output component works for the case dy>1d_{y}>1. By the universal approximation theorem of the MLP, Theorem 1 in Leshno et al. (1993), there exist wiw_{i} and Ai[t]A_{i}[t] for i=1,,Mi=1,\ldots,M such that

supxKNf(x)i=1Mwiσ(t=1NAi[t]x[t])<ϵ2.\sup_{x\in K^{N}}\left\|f(x)-\sum_{i=1}^{M}w_{i}\sigma\left(\sum_{t=1}^{N}A_{i}[t]x[t]\right)\right\|<\frac{\epsilon}{2}. (27)

Note that there exists a modified RNN 𝒩¯:dx×N\bar{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R} of width dx+2d_{x}+2,

𝒩¯(x)=i=1Mwiσ(t=1NAi[t]x[t]).\bar{\mathcal{N}}(x)=\sum_{i=1}^{M}w_{i}\sigma\left(\sum_{t=1}^{N}A_{i}[t]x[t]\right). (28)

By Lemma 3, there exists an RNN 𝒩:dx×N\mathcal{N}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R} of width dx+2+β(σ)d_{x}+2+\beta(\sigma) such that

supxKn𝒩¯(x)𝒩(x)<ϵ2.\sup_{x\in K^{n}}\left\|\bar{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\frac{\epsilon}{2}. (29)

Hence we have f(x)𝒩(x)<ϵ\left\|f(x)-\mathcal{N}(x)\right\|<\epsilon.  

4.2 Approximation of sequence-to-sequence function

Now, we consider an RNN \mathcal{R} as a function from sequence xx to sequence y=(x)y=\mathcal{R}(x) defined by (1). Although the above results are remarkable in that the minimal width has an upper bound independent of the length of the sequence, it only approximates a part of the output sequence. Meanwhile, as the hidden states calculated in each RNN cell are connected closely for different times, fitting all the functions that can be independent of each other becomes a more challenging problem. For example, the coefficient of x[t1]x[t-1] in 𝒩(x)[t]\mathcal{N}(x)[t] equals the coefficient of x[t]x[t] in 𝒩(x)[t+1]\mathcal{N}(x)[t+1] if 𝒩\mathcal{N} is an RNN defined as in the proof of Lemma 6. This correlation originates from the fact that x[t1]x[t-1] and x[t]x[t] arrive at 𝒩(x)[t],𝒩(x)[t+1]\mathcal{N}(x)[t],\mathcal{N}(x)[t+1] via the same intermediate process, 1-time step, and NN layers.

We sever the correlation between the coefficients of x[t1]x[t-1] and x[t]x[t] by defining the time-enhanced recurrent cell in Definition 10 and proceed similarly as in the previous subsection till the Lemma 12.

Definition 10

Time-enhanced recurrent cell, or layer, is a process that maps sequence x=(x[t])tds×x=\left(x[t]\right)_{t\in\mathbb{N}}\in\mathbb{R}^{d_{s}\times\mathbb{N}} to sequence y=(y[t])tds×y=\left(y[t]\right)_{t\in\mathbb{N}}\in\mathbb{R}^{d_{s}\times\mathbb{N}} via

y[t+1](x)[t+1]=σ(A[t+1](x)[t]+B[t+1]x[t+1]+θ[t+1])y[t+1]\coloneqq\mathcal{R}(x)[t+1]=\sigma\left(A[t+1]\mathcal{R}(x)[t]+B[t+1]x[t+1]+\theta[t+1]\right) (30)

where σ\sigma is an activation function, A[t]A[t], B[t]ds×dsB[t]\in\mathbb{R}^{d_{s}\times d_{s}} are weight matrices and θ[t]ds\theta[t]\in\mathbb{R}^{d_{s}} is the bias given for each time step tt.

Like RNN, time-enhanced RNN indicates a composition of the form (1) with time-enhanced recurrent cells instead of RNN cells, and we denote it as TRNN. The modified TRNN indicates a TRNN whose activation functions in some cell act on only part of the components. Time-enhanced BRNN, denoted as TBRNN, indicates a BRNN whose recurrent layers in each direction are replaced by time-enhanced layers. A modified TBRNN indicates a TBRNN whose activation function is modified to act on only part of the components. With the proof of Lemma 3 using A¯[t]\bar{A}[t], B¯[t]\bar{B}[t] instead of A¯\bar{A}, B¯\bar{B}, a TRNN can approximate a modified TRNN.

The following lemma shows that the modified TRNN successfully eliminates the correlation between outputs. See the appendix for the complete proof.

Lemma 11

Suppose Aj[t]1×dxA_{j}[t]\in\mathbb{R}^{1\times d_{x}} are the given matrices for 1tN1\leq t\leq N, 1jt1\leq j\leq t. Then there exists a modified TRNN 𝒩~:dx×N(dx+1)×N\tilde{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N} of width dx+1d_{x}+1 such that

𝒩~(x)[t]=[x[t]σ(j=1tAj[t]x[j])],\tilde{\mathcal{N}}(x)[t]=\begin{bmatrix}x[t]\\ \sigma\left(\sum_{j=1}^{t}A_{j}[t]x[j]\right)\end{bmatrix}, (31)

for all t=1,2,,Nt=1,2,\ldots,N.

Proof  [Sketch of proof] The detailed proof is available in Appendix D. The main idea is to use bm[t]1×dxb_{m}[t]\in\mathbb{R}^{1\times d_{x}} instead of bm1×dxb_{m}\in\mathbb{R}^{1\times d_{x}} in the proof of Lemma 6 with a modified TRNN ~\tilde{\mathcal{R}} of the form

~(x)[t+1]σI([Odx×dx1]~(x)[t]+[Idx×dx0bm[t+1]1]x[t+1]).\tilde{\mathcal{R}}(x)[t+1]\coloneqq\sigma_{I}\left(\begin{bmatrix}O_{d_{x}\times d_{x}}&\\ &1\end{bmatrix}\tilde{\mathcal{R}}(x)[t]+\begin{bmatrix}I_{d_{x}\times d_{x}}&0\\ b_{m}[t+1]&1\end{bmatrix}x[t+1]\right). (32)

The time-index coefficient bm[t]b_{m}[t] separates the calculation along the same intermediate process mentioned at the beginning of this subsection. For instance, a process from time t1t_{1} to t1+1t_{1}+1 results differently from t2t_{2} to t2+1t_{2}+1 for bm[t1]bm[t2]b_{m}[t_{1}]\neq b_{m}[t_{2}], even in the same layer. As the coefficient matrices at each time [t][t] after NN layers are full rank, we can find bm[t]b_{m}[t] implementing the required linear combination for each time.  

Recall the proof of Lemma 7. An additional width serves as a buffer to implement and accumulate linear sum in a node in an MLP. Similarly, we proceed with Lemma 11 instead of Lemma 6 to conclude that there exists a modified TRNN 𝒩\mathcal{N} of width dx+2d_{x}+2 such that each 𝒩[t]\mathcal{N}[t] reproduces an MLP approximating f[t]f[t].

Lemma 12

Suppose wi[t]w_{i}[t]\in\mathbb{R}, Ai,j[t]1×dxA_{i,j}[t]\in\mathbb{R}^{1\times d_{x}} are the given matrices for 1tN1\leq t\leq N, 1jt1\leq j\leq t, 1iM1\leq i\leq M. Then, there exists a modified TRNN 𝒩~:dx×N1×N\tilde{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{1\times N} of width dx+2d_{x}+2 such that

𝒩~(x)[t]=i=1Mwi[t]σ(j=1tAi,j[t]x[j])\tilde{\mathcal{N}}(x)[t]=\sum_{i=1}^{M}{\color[rgb]{0,0,0}w_{i}[t]}\sigma\left(\sum_{j=1}^{t}A_{i,j}[t]x[j]\right) (33)

Proof  We omit the detailed proof because it is almost the same as the proof of Lemma 7. The only difference is to use bm[t]b_{m}[t] instead of bmb_{m} to construct Ai,j[t]A_{i,j}{\color[rgb]{0,0,0}[t]} and wi[t]w_{i}[t] as in the proof of Lemma 11.  
This implies that the modified TRNN can approximate any past-dependent sequence-to-sequence function.

Finally, we connect the TRNN and RNN. Although it is unclear whether a modified RNN can approximate an arbitrary modified TRNN, there exists a modified RNN that approximates the specific one described in Lemma 11.

Lemma 13

Let 𝒩~\tilde{\mathcal{N}} be a given modified TRNN that computes (31) with width dx+1d_{x}+1 and KdxK\subset\mathbb{R}^{d_{x}} be a compact set. Then, for any ϵ>0\epsilon>0 there exists a modified RNN 𝒩\mathcal{N} of width dx+2+γ(σ)d_{x}+2+\gamma(\sigma) such that

supxKN𝒩~(x)𝒩(x)<ϵ,\sup_{x\in K^{N}}\left\|\tilde{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\epsilon, (34)

where γ(ReLU)=0\gamma(\operatorname{ReLU})=0, γ(σ)=1\gamma(\sigma)=1 for non-degenerating activation σ\sigma. As a corollary, there exists a modified RNN 𝒩\mathcal{N} of width dx+3+γ(σ)d_{x}+3+\gamma\left(\sigma\right) approximating 𝒩~\tilde{\mathcal{N}} in Lemma 12.

Proof  [Sketch of proof] Detailed proof is available in Appendix E. Note that a function 𝒩~\tilde{\mathcal{N}} in (31) is constructed by modified TRNN cells of the form in (32). Since bm[t]b_{m}[t] is the only time-dependent coefficient in (32), it is enough to show that an RNN cell can approximate a function X[IdxOdx×1bm[t]1]XX\rightarrow\begin{bmatrix}I_{d_{x}}&O_{d_{x}\times 1}\\ b_{m}[t]&{\color[rgb]{0,0,0}1}\end{bmatrix}X for X=[x[t]0]dx+1X=\begin{bmatrix}x[t]\\ 0\end{bmatrix}\in\mathbb{R}^{d_{x}+1}. In particular, approximating a map x[t]bm[t]x[t]x[t]\mapsto b_{m}[t]x[t] is what we need.

While an MLP can approximate map x[t]bm[t]x[t]x[t]\mapsto b_{m}[t]x[t] on a compact set KK, it is hard to use token-wise MLP since we need different outputs for the same vector at other time steps. For instance, for a vKv\in K and a sequence x=(x[1],x[2])=(v,v)K2x=\left(x[1],x[2]\right)=\left(v,v\right)\in K^{2}, a token-wise MLP cannot simultaneously approximate two outputs, bm[1]vb_{m}[1]v and bm[2]vb_{m}[2]v. Still, it is not the case when the domain K1K_{1} of x[1]x[1] and the domain K2K_{2} of x[2]x[2] are disjoint compact sets, and there is an MLP that approximates vbm[1]vv\mapsto b_{m}[1]v for vK1v\in K_{1} and vbm[2]vv\mapsto b_{m}[2]v for vK2v\in K_{2}. From this point of view, we use an RNN and a token-wise MLP, to embed each vector of the input sequence into disjoint compact sets and to approximate the output bm[t]x[t]b_{m}[t]x[t] on each compact set, respectively.

Now we present how to construct disjoint sets and the token-wise MLP.

Without loss of generality, we can assume K[0,12]dxK\subset\left[0,\frac{1}{2}\right]^{d_{x}} and construct the first cell as the output at time tt to be x[t]+t𝟏dxx[t]+t\mathbf{1}_{d_{x}}. As NN compact sets K+𝟏dxK+\mathbf{1}_{d_{x}}, K+2𝟏dx,,K+N𝟏dxK+2\mathbf{1}_{d_{x}},\ldots,K+N\mathbf{1}_{d_{x}} are disjoint, tt and x[t]=ytx[t]=y-t are uniquely determined for each yK+t𝟏dxy\in K+t\mathbf{1}_{d_{x}}. Thus, for any given b[t]1×dxb[t]\in\mathbb{R}^{1\times d_{x}}, there exists an MLP of width dx+1+γ(σ)d_{x}+1+\gamma\left(\sigma\right) approximating y=x[t]+t𝟏dxb[t]x[t]y=x[t]+t\mathbf{1}_{d_{x}}\mapsto b[t]x[t] on t(K+t𝟏dx)\bigsqcup_{t}\left(K+t\mathbf{1}_{d_{x}}\right) as a function from dx\mathbb{R}^{d_{x}}\rightarrow\mathbb{R} (Hanin and Sellke, 2017; Kidger and Lyons, 2020). Indeed, we need to approximate x[t]+t𝟏dx[x[t]+t𝟏dxb[t]x[t]]x[t]+t\mathbf{1}_{d_{x}}\mapsto\begin{bmatrix}x[t]+t\mathbf{1}_{d_{x}}\\ b[t]x[t]\end{bmatrix} as a function from dx\mathbb{R}^{d_{x}} to dx+1\mathbb{R}^{d_{x}+1}. Fortunately, the first dxd_{x} components preserve the original input data in the proofs of Proposition 2 in Hanin and Sellke (2017), and Proposition 4.2(Register Model) in Kidger and Lyons (2020). Thus, an MLP of width dx+1+γ(σ)d_{x}+1+{\color[rgb]{0,0,0}\gamma\left(\sigma\right)} approximates b[t]x[t]b[t]x[t] while keeping the x+t𝟏dxx+t\mathbf{1}_{d_{x}} in the first dxd_{x} components, and the MLP is common across overall tt as inputs at different times t1t_{1} and t2t_{2} are already embedded in disjoint sets K+t1𝟏dxK+t_{1}\mathbf{1}_{d_{x}} and K+t2𝟏dxK+t_{2}\mathbf{1}_{d_{x}}. Note that a token-wise MLP is a special case of an RNN of the same width. Nonetheless, we need an additional width to keep the (dx+1)(d_{x}+1)th component approximating b[t]x[t]b[t]x[t]. With the token-wise MLP implemented by an RNN and additional buffer width, we construct a modified RNN of width dx+2+γ(σ)d_{x}+2+\gamma(\sigma) approximating the modified TRNN cell used in the proof of Lemma 11.  

4.3 Proof of Theorem 5

Summarizing all the results, we have the universality of a deep RNN in a continuous function space.

Proof  [Proof of Theorem 5] As mentioned in Remark 8, we can set dy=1d_{y}=1 for notational convenience. By Lemma 12, there exists a modified TRNN 𝒩~\tilde{\mathcal{N}} of width dx+2d_{x}+2 such that

supxKnf(x)𝒩~(x)<ϵ3.\sup_{x\in K^{n}}\left\|f(x)-\tilde{\mathcal{N}}(x)\right\|<\frac{\epsilon}{3}. (35)

As 𝒩~\tilde{\mathcal{N}} is a composition of modified TRNN cells of width dx+2d_{x}+2 satisfying (31), there exists a modified RNN 𝒩¯\bar{\mathcal{N}} of width dx+3+γ(σ)d_{x}+3+\gamma(\sigma) such that

supxKn𝒩~(x)𝒩¯(x)<ϵ3.\sup_{x\in K^{n}}\left\|\tilde{\mathcal{N}}(x)-\bar{\mathcal{N}}(x)\right\|<\frac{\epsilon}{3}. (36)

Then, by Lemma 3, there exists an RNN 𝒩\mathcal{N} of width dx+3+γ(σ)+β(σ)=dx+4+α(σ)d_{x}+3+\gamma(\sigma)+\beta(\sigma)=d_{x}+{\color[rgb]{0,0,0}4}+\alpha(\sigma) such that

supxKn𝒩¯(x)𝒩(x)<ϵ3.\sup_{x\in K^{n}}\left\|\bar{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\frac{\epsilon}{3}. (37)

The triangle inequality yields

supxKnf(x)𝒩(x)<ϵ.\sup_{x\in K^{n}}\left\|f(x)-\mathcal{N}(x)\right\|<\epsilon. (38)

 

Remark 14

The number of additional widths α(σ)=β(σ)+γ(σ)\alpha(\sigma)=\beta(\sigma)+\gamma(\sigma) depends on the condition of the activation function σ\sigma. Here, γ(σ)\gamma(\sigma) is required to find the token-wise MLP that approximates embedding from dx\mathbb{R}^{d_{x}} to dx+1\mathbb{R}^{d_{x}+1}. If further studies determine a tighter upper bound of the minimum width of an MLP to have the universal property in a continuous function space, we can reduce or even remove α(σ)\alpha(\sigma) according to the result.

There is still a wide gap between the lower bound dxd_{x} and upper bound dx+dy+4+α(σ)d_{x}+d_{y}+{\color[rgb]{0,0,0}4}+\alpha(\sigma) of the minimum width, and hence, we expect to be able to achieve universality with a narrower width. For example, if N=1N=1, an RNN is simply an MLP, and the RNN has universality without a node required to compute the effect of tt. Therefore, apart from the result of the minimum width of an MLP, further studies are required to determine whether γ\gamma is essential for the case of N2N\geq 2.

5 Universal Approximation for Stack RNN in LpL^{p} Space

This section introduces the universal approximation theorem of a deep RNN in LpL^{p} function space for 1p<1\leq p<\infty.

Theorem 15 (Universal approximation theorem of deep RNN 3)

Let f:dx×Ndy×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N} be a past-dependent sequence-to-sequence function in Lp(dx×N,dy×N)L^{p}\left(\mathbb{R}^{d_{x}\times N},\mathbb{R}^{d_{y}\times N}\right) for 1p<1\leq p<\infty, and σ\sigma be a non-degenerate activation function with the non-degenerating point z0z_{0}. Then, for any ϵ>0\epsilon>0 and compact subset KdxK\subset\mathbb{R}^{d_{x}}, there exists a deep RNN 𝒩\mathcal{N} of width max{dx+1,dy}+1\max\left\{d_{x}+1,d_{y}\right\}+1 satisfying

sup1tNf(x)[t]𝒩(x)[t]Lp(K)<ϵ.\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left(K\right)}<\epsilon. (39)

Moreover, if the activation σ\sigma is ReLU, there exists a deep RNN 𝒩\mathcal{N} of width max{dx+1,dy}\max\left\{d_{x}+1,d_{y}\right\} satisfying

sup1tNf(x)[t]𝒩(x)[t]Lp(dx)<ϵ.\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\right)}<\epsilon. (40)

Before beginning the proof of the theorem, we assume K[0,1]dxK\subseteq\left[0,1\right]^{d_{x}} without loss of generality and summarize the scheme that will be used in the proof. Park et al. (2021) constructed an MLP of width max{dx+1,dy}+γ(σ)\max\{d_{x}+1,d_{y}\}+\gamma(\sigma) approximating a given target function ff, using the “encoding scheme.” More concretely, the MLP is separated into three parts: encoder, memorizer, and decoder.

First, the encoder part quantizes each component of the input and output into a finite set. The authors use the quantization function qn:[0,1]𝒞nq_{n}:[0,1]\rightarrow\mathcal{C}_{n}

qn(v)max{c𝒞n|cv},q_{n}(v)\coloneqq\max\left\{c\in\mathcal{C}_{n}\bigm{|}c\leq v\right\}, (41)

where 𝒞n{0,2n,2×2n,,12n}\mathcal{C}_{n}\coloneqq\left\{0,2^{-n},2\times 2^{-n},\ldots,1-2^{-n}\right\}. Then, each quantized vector is encoded into a real number by concatenating its components through the encoder EncM:[0,1]dx𝒞dxM\operatorname{Enc}_{M}:[0,1]^{d_{x}}\rightarrow\mathcal{C}_{d_{x}M}

EncM(x)i=1dxqM(xi)2(i1)M.\operatorname{Enc}_{M}(x)\coloneqq\sum_{i=1}^{d_{x}}q_{M}(x_{i})2^{-(i-1)M}. (42)

For small δ1>0\delta_{1}>0, the authors construct an MLP 𝒩enc:[0,1]dx𝒞dxM\mathcal{N}_{enc}:[0,1]^{d_{x}}\rightarrow\mathcal{C}_{d_{x}M} of width dx+1+γ(σ)d_{x}+1+\gamma(\sigma) satisfying

EncM(x)𝒩enc(x)<δ1.\left\|\operatorname{Enc}_{M}(x)-\mathcal{N}_{enc}(x)\right\|<\delta_{1}. (43)

Although the quantization causes a loss in input information, the LpL^{p} norm neglects some loss in a sufficiently small domain.

After encoding the input xx to EncM(x)\operatorname{Enc}_{M}(x) with large MM, authors use the information of xx in EncM(x)\operatorname{Enc}_{M}(x) to obtain the information of the target output f(x)f(x). More precisely, they define the memorizer MemM,M:𝒞dxM𝒞dyM\operatorname{Mem}_{M,M^{\prime}}:\mathcal{C}_{d_{x}M}\rightarrow\mathcal{C}_{d_{y}M^{\prime}} to map the encoded input EncM(x)\operatorname{Enc}_{M}(x) to the encoded output EncM(f(x))\operatorname{Enc}_{M^{\prime}}(f(x)) as

Mem(EncM(x))(EncMfqM)(x),\operatorname{Mem}\left(\operatorname{Enc}_{M}(x)\right)\coloneqq\left(\operatorname{Enc}_{M^{\prime}}\circ f\circ q_{M}\right)(x), (44)

assuming the quantized map qMq_{M} acts on xx component-wise in the above equation. Then, an MLP 𝒩mem\mathcal{N}_{mem} of width 2+γ(σ)2+\gamma(\sigma) approximates Mem\operatorname{Mem}; that is, for any δ2>0\delta_{2}>0, there exists 𝒩mem\mathcal{N}_{mem} satisfying

supx[0,1]dxMem(Enc(x))𝒩mem(Enc(x))<δ2.\sup_{x\in[0,1]^{d_{x}}}\left\|\operatorname{Mem}(\operatorname{Enc}(x))-\mathcal{N}_{mem}(\operatorname{Enc}(x))\right\|<\delta_{2}. (45)

Finally, the decoder reconstructs the original output vector from the encoded output vector by cutting off the concatenated components. Owing to the preceding encoder and memorizer, it is enough to define only the value of the decoder on 𝒞dyM\mathcal{C}_{d_{y}M^{\prime}}. Hence the decoder Dec:𝒞dyM𝒞Mdy(𝒞M)dy\operatorname{Dec}:\mathcal{C}_{d_{y}M^{\prime}}\rightarrow\mathcal{C}_{M^{\prime}}^{d_{y}}\coloneqq\left(\mathcal{C}_{M^{\prime}}\right)^{d_{y}} is determined by

DecM(v)v^where{v^}EncM1(v)𝒞Mdy.\operatorname{Dec}_{M^{\prime}}(v)\coloneqq\hat{v}\qquad\text{where}\qquad\left\{\hat{v}\right\}\coloneqq\operatorname{Enc}_{M^{\prime}}^{-1}(v)\cap\mathcal{C}_{M^{\prime}}^{d_{y}}. (46)

Indeed, for small δ3>0\delta_{3}>0, Park et al. (2021) construct an MLP 𝒩dec:𝒞dyM𝒞Mdy\mathcal{N}_{dec}:\mathcal{C}_{d_{y}M^{\prime}}\rightarrow\mathcal{C}_{M^{\prime}}^{d_{y}} of width dy+γ(σ)d_{y}+\gamma(\sigma) so that

DecM(v)𝒩dec(v)<δ3\left\|\operatorname{Dec}_{M^{\prime}}(v)-\mathcal{N}_{dec}(v)\right\|<\delta_{3} (47)

Although (43) and (47) are not equations but approximations when the activation is just non-degenerate, the composition 𝒩=𝒩dec𝒩mem𝒩enc\mathcal{N}=\mathcal{N}_{dec}\circ\mathcal{N}_{mem}\circ\mathcal{N}_{enc} approximates a target ff with sufficiently large M,MM,M^{\prime} and sufficiently small δ1,δ2\delta_{1},\delta_{2}.

Let us return to the proof of Theorem 15. We construct the encoder, memorizer, and decoder similarly. As the encoder and decoder is independent of time tt, we use a token-wise MLP and modified RNNs define the token-wise MLPs. On the other hand, the memorizer must work differently according to the time tt owing to the multiple output functions. Instead of implementing various memorizers, we separate their input and output domains at each time by translation. Then, it is enough to define one memorizer on the disjoint union of domains.

Proof  [Proof of Theorem 15] We first combine the token-wise encoder and translation for the separation of the domains. Consider the token-wise encoder EncM:dx×N1×N\operatorname{Enc}_{M}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{1\times N}, and the following recurrent cell :1×N1×N\mathcal{R}:\mathbb{R}^{1\times N}\rightarrow\mathbb{R}^{1\times N}

(v)[t+1]=2(dxM+1)(v)[t]+v[t+1]+1.\mathcal{R}(v)[t+1]=2^{-\left(d_{x}M+1\right)}\mathcal{R}(v)[t]+v[t+1]+1. (48)

Then the composition enc=EncM\mathcal{R}_{enc}=\mathcal{R}\operatorname{Enc}_{M} defines an encoder of sequence from KNK^{N} to 1×N\mathbb{R}^{1\times N}:

enc(x)[t]=j=1t2(j1)dxM+j=1tEncM(x[j])2(j1)(dxM+1),\mathcal{R}_{enc}(x)[t]=\sum_{j=1}^{t}2^{-(j-1)d_{x}M}+\sum_{j=1}^{t}\operatorname{Enc}_{M}\left(x[j]\right)2^{-\left(j-1\right)\left(d_{x}M+1\right)}, (49)

where x=(x[t])t=1,,Nx=\left(x[t]\right)_{t=1,\ldots,N} is a sequence in KK. Note that the range DD of enc\mathcal{R}_{enc} is a disjoint union of compact sets;

D=t=1N{enc(x)[t]:xKN}.D=\bigsqcup_{t=1}^{N}\left\{\mathcal{R}_{enc}(x)[t]\mathbin{:}x\in K^{N}\right\}. (50)

Hence there exists a memorizer Mem:\operatorname{Mem}:\mathbb{R}\rightarrow\mathbb{R} satisfying

Mem(enc(x)[t])=EncM(f(qM(x))[t])\operatorname{Mem}(\mathcal{R}_{enc}(x){\color[rgb]{0,0,0}[t]})=\operatorname{Enc}_{M^{\prime}}\left(f\left(q_{M}(x)\right)[t]\right) (51)

for each t=1,2,,Nt=1,2,\ldots,N. The token-wise decoder DecM\operatorname{Dec}_{M^{\prime}} is the last part of the proof.

To complete the proof, we need an approximation of the token-wise encoder EncM:dx\operatorname{Enc}_{M}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}, a modified recurrent cell :1×N1×N\mathcal{R}:\mathbb{R}^{1\times N}\rightarrow\mathbb{R}^{1\times N}, token-wise memorizer Mem:\operatorname{Mem}:\mathbb{R}\rightarrow\mathbb{R}, and token-wise decoder DecM:dy\operatorname{Dec}_{M^{\prime}}:\mathbb{R}\rightarrow\mathbb{R}^{d_{y}}. Following Park et al. (2021), there exist MLPs of width dx+1+γ(σ)d_{x}+1+\gamma(\sigma), 2+γ(σ)2+\gamma(\sigma), and dy+γ(σ)d_{y}+\gamma(\sigma) that approximate EncM\operatorname{Enc}_{M}, Mem\operatorname{Mem}, and DecM\operatorname{Dec}_{M^{\prime}} respectively. Lemma 3 shows that \mathcal{R} is approximated by an RNN of width 1+β(σ)1+\beta(\sigma). Hence, an RNN of width max{dx+1+γ(σ),1+β(σ),2+γ(σ),dy+γ(σ)}=max{dx+1,dy}+γ(σ)\max\left\{d_{x}+1+\gamma(\sigma),{\color[rgb]{0,0,0}1+\beta(\sigma)},2+\gamma(\sigma),d_{y}+\gamma(\sigma)\right\}=\max\left\{d_{x}+1,d_{y}\right\}+\gamma(\sigma) approximates the target function ff.

In the case of ReLU activation, we can extend the domain KNK^{N} to dx×N\mathbb{R}^{d_{x}\times N} as stated in Park et al. (2020). Nonetheless, we present briefly how to deal with the subtle problem that the support of a network is not compact generally.

We project each input x[t]dxx[t]\in\mathbb{R}^{d_{x}} by PL,δ:dxdxP_{L,\delta}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{x}} defined by

P(x)={xx[L,L],Lδ(x+L+δ)x[Lδ,L],Lδ(xLδ)x[L,L+δ],0otherwise.P(x)=\begin{cases}x&x\in[-L,L],\\ -\frac{L}{\delta}\left(x+L+\delta\right)&x\in[-L-\delta,-L],\\ -\frac{L}{\delta}\left(x-L-\delta\right)&x\in[L,L+\delta],\\ 0&\text{otherwise}.\end{cases} (52)

Note that an MLP with width dx+1d_{x}+1 computes PL,δP_{L,\delta}. We will choose large LL to cover enough domain for xdx×Nx\in\mathbb{R}^{d_{x}\times N} and small δ\delta to reduce the projection error.

First, choose a large LL so that f(x)[t]Lp(dx\[L,L]dx)<ϵ3\left\|f(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\backslash[-L,L]^{d_{x}}\right)}<\frac{\epsilon}{3} for all t=1,2,,Nt=1,2,\ldots,N. Second, construct an RNN 𝒩=𝒩dec𝒩mem𝒩enc\mathcal{N}=\mathcal{N}_{dec}\circ\mathcal{N}_{mem}\circ\mathcal{N}_{enc} as above with 𝒩(0)=0\mathcal{N}(0)=0, such that f(x)[t]𝒩(x)[t]Lp([L,L]dx)<ϵ3\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left([-L,L]^{d_{x}}\right)}<\frac{\epsilon}{3}. We impose a condition 𝒩(0)=0\mathcal{N}(0)=0 to bound error outside [L,L]dx[-L,L]^{d_{x}}, which is possible because we may set f(0)[t]=0f(0)[t]=0 for all tt under the LpL^{p} norm. Finally, set δ\delta so small that f(x)[t]Lp([Lδ,L+δ]dx\[L,L]dx)<ϵ6\left\|f(x)[t]\right\|_{L^{p}\left([-L-\delta,L+\delta]^{d_{x}}\backslash[-L,L]^{d_{x}}\right)}<\frac{\epsilon}{6} and 𝒩(x)[t]Lp([Lδ,L+δ]dx\[L,L]dx)<ϵ6\left\|\mathcal{N}(x)[t]\right\|_{L^{p}\left([-L-\delta,L+\delta]^{d_{x}}\backslash[-L,L]^{d_{x}}\right)}<\frac{\epsilon}{6}. After inserting the projection PL,δP_{L,\delta} before the encoder 𝒩enc\mathcal{N}_{enc} and defining new RNN 𝒩=𝒩PL,δ\mathcal{N}^{\prime}=\mathcal{N}\circ P_{L,\delta}, we have

f(x)[t]𝒩(x)[t]Lp(dx)\displaystyle\left\|f(x)[t]-\mathcal{N}^{\prime}(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\right)} (53)
f(x)[t]𝒩(x)[t]Lp([L,L]dx)+f(x)[t]𝒩(x)[t]Lp(dx\[L,L]dx)\displaystyle\leq\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left([-L,L]^{d_{x}}\right)}+\left\|f(x)[t]-\mathcal{N}^{\prime}(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\backslash[-L,L]^{d_{x}}\right)} (54)
ϵ3+f(x)[t]𝒩(x)[t]Lp([Lδ,L+δ]dx\[L,L]dx)\displaystyle\leq\frac{\epsilon}{3}+\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left([-L-\delta,L+\delta]^{d_{x}}\backslash[-L,L]^{d_{x}}\right)} (55)
+f(x)[t]𝒩(x)[t]Lp(dx\[Lδ,L+δ]dx)\displaystyle\qquad+\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\backslash[-L-\delta,L+\delta]^{d_{x}}\right)} (56)
ϵ3+ϵ3+f(x)[t]Lp(dx\[L,L]dx)\displaystyle\leq\frac{\epsilon}{3}+\frac{\epsilon}{3}+\left\|f(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\backslash[-L,L]^{d_{x}}\right)} (57)
ϵ\displaystyle\leq\epsilon (58)

 

6 Variants of RNN

This section describes the universal property of some variants of RNN, particularly LSTM, GRU, or BRNN. LSTM and GRU are proposed to solve the long-term dependency problem. As an RNN has difficulty calculating and updating its parameters for long sequential data, LSTM and GRU take advantage of additional structures in their cells. We prove that they have the same universal property as the original RNN. On the other hand, a BRNN is proposed to overcome the past dependency of an RNN. BRNN consists of two RNN cells, one of which works in reverse order. We prove the universal approximation theorem of a BRNN with the target class of any sequence-to-sequence function.

The universal property of an LSTM originates from the universality of an RNN. Mathematically LSTM LSTM\mathcal{R}_{LSTM} indicates a process that computes two outputs, hh and cc, defined by (2). As an LSTM can reproduce an RNN with the same width, we have the following corollary:

Corollary 16 (Universal approximation theorem of deep LSTM)

Let f:dx×Ndy×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N} be a continuous past-dependent sequence-to-sequence function. Then, for any ϵ>0\epsilon>0 and compact subset KdxK\subset\mathbb{R}^{d_{x}}, there exists a deep LSTM 𝒩LSTM\mathcal{N}_{LSTM}, of width dx+dy+3d_{x}+d_{y}+3, such that

supxKNsup1tNf(x)[t]𝒩LSTM(x)[t]<ϵ.\sup_{x\in K^{N}}\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}_{LSTM}(x)[t]\right\|<\epsilon. (59)

If fLp(dx×N,dy×N)f\in L^{p}\left(\mathbb{R}^{d_{x}\times N},\mathbb{R}^{d_{y}\times N}\right), there exists a deep LSTM 𝒩LSTM\mathcal{N}_{LSTM}, of width max{dx+1,dy}+1\max\left\{d_{x}+1,d_{y}\right\}+1, such that

sup1tNf(x)[t]𝒩LSTM(x)[t]Lp(KN)<ϵ.\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}_{LSTM}(x)[t]\right\|_{L^{p}\left(K^{N}\right)}<\epsilon. (60)

Proof  We set all parameters but WgW_{g}, UgU_{g}, bgb_{g}, and bfb_{f} as zeros, and then (2) is simplified as

c[t+1]\displaystyle c[t+1] =σsig(bf)c[t]+12tanh(Ugh[t]+Wgx[t+1]+bg),\displaystyle=\sigma_{\operatorname{sig}}(b_{f})\odot c[t]+\frac{1}{2}\tanh\left(U_{g}h[t]+W_{g}x[t+1]+b_{g}\right), (61)
h[t+1]\displaystyle h[t+1] =12tanh(c[t+1]).\displaystyle=\frac{1}{2}\tanh\left(c[t+1]\right).

For any ϵ>0\epsilon>0, bfb_{f} with sufficiently large negative components yields

h[t+1]12tanh(12tanh(Ugh[t]+Wgx[t+1]+bg))<ϵ.\left\|h[t+1]-\frac{1}{2}\tanh\left(\frac{1}{2}\tanh\left(U_{g}h[t]+W_{g}x[t+1]+b_{g}\right)\right)\right\|<\epsilon. (62)

Thus, an LSTM reproduces an RNN whose activation function is (12tanh)(12tanh)\left(\frac{1}{2}\tanh\right)\circ\left(\frac{1}{2}\tanh\right) without any additional width in its hidden states. In other words, an LSTM of width dd approximates an RNN of width dd equipped with the activation function (12tanh)(12tanh)\left(\frac{1}{2}\tanh\right)\circ\left(\frac{1}{2}\tanh\right).  

The universality of GRU is proved similarly.

Corollary 17 (Universal approximation theorem of deep GRU)

Let f:dx×Ndy×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N} be a continuous past-dependent sequence-to-sequence function. Then, for any ϵ>0\epsilon>0 and compact subset KdxK\subset\mathbb{R}^{d_{x}}, there exists a deep GRU 𝒩GRU\mathcal{N}_{GRU}, of width dx+dy+3d_{x}+d_{y}+3, such that

supxKNsup1tNf(x)[t]𝒩GRU(x)[t]<ϵ.\sup_{x\in K^{N}}\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}_{GRU}(x)[t]\right\|<\epsilon. (63)

If fLp(dx×N,dy×N)f\in L^{p}\left(\mathbb{R}^{d_{x}\times N},\mathbb{R}^{d_{y}\times N}\right), there exists a deep GRU 𝒩GRU\mathcal{N}_{GRU}, of width max{dx+1,dy}+1\max\left\{d_{x}+1,d_{y}\right\}+1, such that

sup1tNf(x)[t]𝒩GRU(x)[t]Lp(KN)<ϵ.\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}_{GRU}(x)[t]\right\|_{L^{p}\left(K^{N}\right)}<\epsilon. (64)

Proof  Setting only WhW_{h}, UhU_{h}, bhb_{h}, and bzb_{z} as non-zero, the GRU is simplified as

h[t+1]=(1σsig(bz))h[t]+σsig(bz)tanh(Whx[t+1]+12Uhh[t]+bh).h[t+1]=\left(1-\sigma_{\operatorname{sig}}\left(b_{z}\right)\right)h[t]+\sigma_{\operatorname{sig}}\left(b_{z}\right)\tanh\left(W_{h}x[t+1]+\frac{1}{2}U_{h}h[t]+b_{h}\right). (65)

For any ϵ>0\epsilon>0, a sufficiently large bzb_{z} yields

h[t+1]tanh(Whx[t+1]+12Uhh[t]+bh)<ϵ.\left\|h[t+1]-\tanh\left(W_{h}x[t+1]+\frac{1}{2}U_{h}h[t]+b_{h}\right)\right\|<\epsilon. (66)

Hence, we attain the corollary.  

Remark 18

We refer to the width as the maximum of hidden states. However, the definition is somewhat inappropriate, as LSTM and GRU cells have multiple hidden states; hence, there are several times more components than an RNN with the same width. Thus we expect that they have better approximation power or have a smaller minimum width for universality than an RNN. Nevertheless, we retain the theoretical proof as future work to identify whether they have different abilities in approximation or examine why they exhibit different performances in practical applications.

Now, let us focus on the universality of a BRNN. Recall that a stack of modified recurrent cells 𝒩\mathcal{N} construct a linear combination of the previous input components x[1:t]x[1:t] at each time,

𝒩(x)[t]=[x[t]j=1tAj[t]x[j]].\mathcal{N}(x)[t]=\begin{bmatrix}x[t]\\ \sum_{j=1}^{t}A_{j}[t]x[j]\end{bmatrix}. (67)

Therefore, if we reverse the order of sequence and flow of the recurrent structure, a stack of reverse modified recurrent cells 𝒩¯\bar{\mathcal{N}} constructs a linear combination of the subsequent input components x[t:N]x[t:N] at each time,

𝒩¯(x)[t]=[x[t]j=tNBj[t]x[j]].\bar{\mathcal{N}}(x)[t]=\begin{bmatrix}x[t]\\ \sum_{j=t}^{N}B_{j}[t]x[j]\end{bmatrix}. (68)

From this point of view, we expect that a stacked BRNN successfully approximates an arbitrary sequence-to-sequence function beyond the past dependency. As previously mentioned, we prove it in the following lemma.

Lemma 19

Suppose Aj[t]1×dxA_{j}[t]\in\mathbb{R}^{1\times d_{x}} are the given matrices for 1tN1\leq t\leq N, 1jN1\leq j\leq N. Then there exists a modified TBRNN 𝒩~:dx×N(dx+1)×N\tilde{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N} of width dx+1d_{x}+1 such that

𝒩~(x)[t]=[x[t]σ(j=1NAj[t]x[j])],\tilde{\mathcal{N}}(x)[t]=\begin{bmatrix}x[t]\\ \sigma\left(\sum_{j=1}^{N}A_{j}[t]x[j]\right)\end{bmatrix}, (69)

for all t=1,2,,Nt=1,2,\ldots,N.

Proof  [Sketch of proof] The detailed proof is available in Appendix F. We use modified TBRNN cells with either only a forward modified TRNN or a backward modified TRNN. The stacked forward modified TRNN cells compute j=1tAj[t]x[j]\sum_{j=1}^{t}A_{j}[t]x[j], and the stacked backward modified TRNN cells compute j=t+1NAj[t]x[j]\sum_{j=t+1}^{N}A_{j}[t]x[j].  

As in previous cases, we have the following theorem for a TBRNN. The proof is almost the same as that of Lemma 12 and 7.

Lemma 20

Suppose wi[t]w_{i}[t]\in\mathbb{R}, Ai,j[t]1×dxA_{i,j}[t]\in\mathbb{R}^{1\times d_{x}} are the given matrices for 1tN1\leq t\leq N, 1jN1\leq j\leq N, 1iM1\leq i\leq M. Then there exists a modified TBRNN 𝒩~:dx×N1×N\tilde{\mathcal{N}}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{1\times N} of width dx+2d_{x}+2 such that

𝒩~(x)[t]=i=1Mwi[t]σ(j=1NAi,j[t]x[j]).\tilde{\mathcal{N}}(x)[t]=\sum_{i=1}^{M}{\color[rgb]{0,0,0}w_{i}[t]}\sigma\left(\sum_{j=1}^{N}A_{i,j}[t]x[j]\right). (70)

Proof  First, construct a modified deep TBRNN 𝒩1:dx×N(dx+2)×N\mathcal{N}_{1}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+2)\times N} of width dx+2d_{x}+2 such that

𝒩1(x)[t]=[x[t]σ(j=1NA1,j[t]x[j])0],\mathcal{N}_{1}(x)[t]=\begin{bmatrix}x[t]\\ \sigma\left(\sum_{j=1}^{N}A_{1,j}[t]x[j]\right)\\ 0\end{bmatrix}, (71)

as Lemma 19. The final component does not affect the first linear summation and remains zero. After 𝒩1\mathcal{N}_{1}, use the (dx+2)(d_{x}+2)th component to obtain a stack of cells 𝒩2:(dx+2)×N(dx+2)×N\mathcal{N}_{2}:\mathbb{R}^{(d_{x}+2)\times N}\rightarrow\mathbb{R}^{(d_{x}+2)\times N}, which satisfies

𝒩2𝒩1(x)[t]=[x[t]σ(j=1NA1,j[t]x[j])σ(j=1NA2,j[t]x[j])],\mathcal{N}_{2}\mathcal{N}_{1}(x)[t]=\begin{bmatrix}x[t]\\ \sigma\left(\sum_{j=1}^{N}A_{1,j}[t]x[j]\right)\\ \sigma\left(\sum_{j=1}^{N}A_{2,j}[t]x[j]\right)\end{bmatrix}, (72)

and use a modified RNN cell \mathcal{R} to sum up the results and reset the last component:

𝒩2𝒩1(x)[t]=[x[t]w1[t]σ(j=1NA1,j[t]x[j])+w2[t]σ(j=1NA2,j[t]x[j])0].\mathcal{R}\mathcal{N}_{2}\mathcal{N}_{1}(x)[t]=\begin{bmatrix}x[t]\\ w_{1}[t]\sigma\left(\sum_{j=1}^{N}A_{1,j}[t]x[j]\right)+w_{2}[t]\sigma\left(\sum_{j=1}^{N}A_{2,j}[t]x[j]\right)\\ 0\end{bmatrix}. (73)

As the (dx+2)(d_{x}+2)th component resets to zero, we use it to compute the third sum w3[t]σ(A3,j[t]x[j])w_{3}[t]\sigma\left(\sum A_{3,j}[t]x[j]\right) and repeat until we obtain the final network 𝒩\mathcal{N} such that

𝒩(x)[t]=[x[t]i=1Mwi[t]σ(t=1NAi,j[t]x[j])0].\mathcal{N}(x)[t]=\begin{bmatrix}x[t]\\ \sum_{i=1}^{M}w_{i}[t]\sigma\left(\sum_{t=1}^{N}A_{i,j}[t]x[j]\right)\\ 0\end{bmatrix}. (74)

 

The following lemma fills the gap between a modified TBRNN and a modified BRNN.

Lemma 21

Let 𝒩~\tilde{\mathcal{N}} be a modified TBRNN that computes (69) and KdxK\subset\mathbb{R}^{d_{x}} be a compact set. Then for any ϵ>0\epsilon>0 there exists a modified BRNN 𝒩¯\bar{\mathcal{N}} of width dx+2+γ(σ)d_{x}+2+\gamma(\sigma) such that

supxKN𝒩~(x)𝒩¯(x)<ϵ,\sup_{x\in K^{N}}\left\|\tilde{\mathcal{N}}(x)-\bar{\mathcal{N}}(x)\right\|<\epsilon, (75)

where γ(ReLU)=0\gamma(\operatorname{ReLU})=0, γ(σ)=1\gamma(\sigma)=1 for non-degenerating activation σ\sigma.
Moreover, there exists a BRNN 𝒩\mathcal{N} of width dx+3+α(σ)d_{x}+{\color[rgb]{0,0,0}3}+\alpha(\sigma) such that

supxKN𝒩~(x)𝒩(x)<ϵ,\sup_{x\in K^{N}}\left\|\tilde{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\epsilon, (76)

where α(σ)={0σ is ReLU or a non-degenerating function with σ(z0)=0.1σ is a non-degenerating function with σ(z0)0.\alpha\left(\sigma\right)=\begin{cases}0&\text{$\sigma$ is ReLU or a non-degenerating function with $\sigma\left(z_{0}\right)=0$.}\\ 1&\text{$\sigma$ is a non-degenerating function with $\sigma\left(z_{0}\right)\neq 0$}.\end{cases}

Proof  We omit these details because we only need to construct a modified RNN that approximates (67) and (68) using Lemma 13. As only the forward or backward modified RNN cell is used in the proof of Lemma 19, it is enough for the modified BRNN to approximate either the forward or backward modified TRNN. Thus, it follows from Lemma 13. Lemma 3 provides the second part of this theorem.  

Finally, we obtain the universal approximation theorem of the BRNN from the previous results.

Theorem 22 (Universal approximation theorem of deep BRNN 1)

Let f:dx×Ndy×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N} be a continuous sequence to sequence function and σ\sigma be a non-degenerate activation function. Then for any ϵ>0\epsilon>0 and compact subset KdxK\subset\mathbb{R}^{d_{x}}, there exists a deep BRNN 𝒩\mathcal{N} of width dx+dy+3+α(σ)d_{x}+d_{y}+{\color[rgb]{0,0,0}3}+\alpha(\sigma), such that

supxKNsup1tNf(x)[t]𝒩(x)[t]<ϵ,\sup_{x\in K^{N}}\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|<\epsilon, (77)

where α(σ)={0σ is ReLU or a non-degenerating function with σ(z0)=0.1σ is a non-degenerating function with σ(z0)0.\alpha\left(\sigma\right)=\begin{cases}0&\text{$\sigma$ is ReLU or a non-degenerating function with $\sigma\left(z_{0}\right)=0$.}\\ 1&\text{$\sigma$ is a non-degenerating function with $\sigma\left(z_{0}\right)\neq 0$}.\end{cases}

Proof  As in the proof of Theorem 5, we set dy=1d_{y}=1 for notational convenience. According Lemma 20, there exists a modified TBRNN 𝒩~\tilde{\mathcal{N}} of width dx+2d_{x}+2 such that

supxKnf(x)𝒩~(x)<ϵ2.\sup_{x\in K^{n}}\left\|f(x)-\tilde{\mathcal{N}}(x)\right\|<\frac{\epsilon}{2}. (78)

Lemma 21 implies that there exists a BRNN of width dx+4+α(σ)d_{x}+{\color[rgb]{0,0,0}4}+\alpha(\sigma) such that

supxKn𝒩~(x)𝒩(x)<ϵ2.\sup_{x\in K^{n}}\left\|\tilde{\mathcal{N}}(x)-\mathcal{N}(x)\right\|<\frac{\epsilon}{2}. (79)

The triangle inequality leads to

supxKnf(x)𝒩(x)<ϵ.\sup_{x\in K^{n}}\left\|f(x)-\mathcal{N}(x)\right\|<\epsilon. (80)

 

In the LpL^{p} space, we have similar results with RNNs much more straightforward than in continuous function spaces as the only encoder needs bidirectional flow.

Theorem 23 (Universal approximation theorem of deep BRNN 2)

Let f:dx×Ndy×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{d_{y}\times N} be a sequence-to-sequence function in Lp(dx×N,dy×N)L^{p}\left(\mathbb{R}^{d_{x}\times N},\mathbb{R}^{d_{y}\times N}\right) for 1p<1\leq p<\infty, and σ\sigma be a non-degenerate activation function with the non-degenerating point z0z_{0}. Then, for any ϵ>0\epsilon>0 and compact subset KdxK\subset\mathbb{R}^{d_{x}}, there exists a deep BRNN 𝒩\mathcal{N} of width max{dx+1,dy}+1\max\left\{d_{x}+1,d_{y}\right\}+1 satisfying

sup1tNf(x)[t]𝒩(x)[t]Lp(K)<ϵ.\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left(K\right)}<\epsilon. (81)

Moreover, if the activation σ\sigma is ReLU, there exists a deep BRNN 𝒩\mathcal{N} of width max{dx+1,dy}\max\left\{d_{x}+1,d_{y}\right\} satisfying

sup1tNf(x)[t]𝒩(x)[t]Lp(dx)<ϵ.\sup_{1\leq t\leq N}\left\|f(x)[t]-\mathcal{N}(x)[t]\right\|_{L^{p}\left(\mathbb{R}^{d_{x}}\right)}<\epsilon. (82)

Proof  Recall that in the RNN case, x[1:t]KNx[1:t]\in K^{N} is encoded as a concatenation of a decimal representation of each x[1],x[2],,x[t]x[1],x[2],\ldots,x[t], cutting it off at a certain number of digits. The backward process carries the same work, and then the encoded results will be connected. Since the cells constructing the encoders are the same as equation (48), we omit the further step.  

7 Discussion

We proved the universal approximation theorem and calculated the upper bound of the minimum width of an RNN, an LSTM, a GRU, and a BRNN. In this section, we illustrate how our results support the performance of a recurrent network.

We show that an RNN needs a width of at most dx+dy+4d_{x}+d_{y}+4 to approximate a function from a sequence of dxd_{x}-dimensional vectors to a sequence of dyd_{y}-dimensional vectors. The upper bound of the minimum width of the network depends only on the input and output dimensions, regardless of the length of the sequence. The independence of the sequence length indicates that the recurrent structure is much more effective in learning a function on sequential inputs. To approximate a function defined on a long sequence, a network with a feed-forward structure requires a wide width proportional to the length of the sequence. For example, an MLP should have a wider width than NdxNd_{x} if it approximates a function f:dx×Nf:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R} defined on a sequence (Johnson, 2019). However, with the recurrent structure, it is possible to approximate via a narrow network of width dx+1d_{x}+1 regardless of the length, because the minimum width is independent of the length NN. This suggests that the recurrent structure, which transfers information between different time steps in the same layer, is crucial for success with sequential data.

From a practical point of view, this fact further implies that there is no need to limit the length of the time steps that affect dynamics to learn the internal dynamics between sequential data. For instance, suppose that a pair of long sequential data (x[t])(x[t]) and (y[t])(y[t]) have an unknown relation y[t]=f(x[tp],x[tp+1],,x[t])y[t]=f\left(x[t-p],x[t-p+1],\ldots,x[t]\right). Even without prior knowledge of ff and pp, a deep RNN learns the relation if we train the network with inputs x[1:t]x[1:t] and outputs y[t]y[t]. The MLP cannot reproduce the result because the required width increases proportionally to pp, which is an unknown factor. The difference between these networks theoretically supports that recurrent networks are appropriate when dealing with sequential data whose underlying dynamics are unknown in the real world.

8 Conclusion

In this study, we investigated the universality and upper bound of the minimum width of deep RNNs. The upper bound of the minimum width serves as a theoretical basis for the effectiveness of deep RNNs, especially when underlying dynamics of the data are unknown.

Our methodology enables various follow-up studies, as it connects an MLP and a deep RNN. For example, the framework disentangles the time dependency of output sequence of an RNN. This makes it feasible to investigate a trade-off between width and depth in the representation ability or error bounds of the deep RNN, which has not been studied because of the entangled flow with time and depth. In addition, we separated the required width into three parts: one maintains inputs and results, another resolves the time dependency, and the third modifies the activation. Assuming some underlying dynamics in the output sequence, such as an open dynamical system, we expect to reduce the required minimum width on each part because there is a natural dependency between the outputs, and the inputs are embedded in a specific way by the dynamics.

However, as LSTMs and GRUs have multiple hidden states in the cell process, they may have a smaller minimum width than the RNN. By constructing an LSTM and a GRU to use the hidden states to save data and resolve the time dependency, we hope that our techniques demonstrated in the proof help analyze why these networks have a better result in practice and suffer less from long-term dependency.

Acknowledgement

This work was supported by the Challengeable Future Defense Technology Research and Development Program through ADD[No. 915020201], the NRF grant[2012R1A2C3010887] and the MSIT/IITP[No. 2021-0-01343, Artificial Intelligence Graduate School Program(SNU)].

Appendix A Notations

Symbol Description
σ\sigma Activation function
σI\sigma_{I} Modified activation function
xx Input sequence of vectors
yy Output sequence of vectors
tt Order of element in a sequence
a[t]ia[t]_{i} ii-th component of tt-element of a sequence aa
dxd_{x} Dimension of input vector
dsd_{s} Dimension of hidden state in RNN, or width
dyd_{y} Dimension of output vector
NN Length of the input and output sequence
KK A compact subset of dx\mathbb{R}^{d_{x}}
Om,nO_{m,n} Zero matrix of size m×nm\times n
𝟎k\mathbf{0}_{k} Zero vector of size kk, 𝟎k=[000]kT\mathbf{0}_{k}={\underbrace{\begin{bmatrix}0&0\cdots&0\end{bmatrix}}_{k}}^{T}
𝟏k\mathbf{1}_{k} One vector of size kk, 𝟏k=[111]kT\mathbf{1}_{k}={\underbrace{\begin{bmatrix}1&1\cdots&1\end{bmatrix}}_{k}}^{T}
IkI_{k} Identity matrix of size k×kk\times k
RNN Recurrent Neural Network defined in page 1
TRNN Time-enhanced Recurrent Neural Network, defined by replacing
recurrent cell with time-enhanced recurrent cell in RNN.
Time-enhenced cell and TRN are defined in page 10
BRNN Bidirectional Recurrent Neural Network defined in page 4
TBRNN Time-enhanced Bidirectional Recurrent Neural Network
defined in page 10

Appendix B Proof of the Lemma 3

Without loss of generality, we may assume 𝒫¯\bar{\mathcal{P}} is an identity map and I={1,2,,k}I=\{1,2,\ldots,k\}. Let ¯(x)[t+1]=σI(A¯(x)[t]+B¯x[t+1]+θ¯)\bar{\mathcal{R}}(x)[t+1]=\sigma_{I}\left(\bar{A}\mathcal{R}(x)[t]+\bar{B}x[t+1]+\bar{\theta}\right) be a given modified RNN cell, and 𝒬(x)[t]=Q¯x[t]\mathcal{Q}(x)[t]=\bar{Q}x[t] be a given token-wise linear projection map. We use notations Om,nO_{m,n} and 𝟏m\mathbf{1}_{m} to denote zero matrix in m×n\mathbb{R}^{m\times n} and one vector in m\mathbb{R}^{m} respectively. Sometimes we omit Om,nO_{m,n} symbol in some block-diagonal matrices if the size of the zero matrix is clear.

Case 1: σ(z0)=0\sigma(z_{0})=0

Let 𝒫\mathcal{P} be the identity map. For δ>0\delta>0 define 1δ\mathcal{R}_{1}^{\delta} as

1δ(x[t+1])σ(δB¯x[t+1]+δθ¯+z0𝟏d).\mathcal{R}_{1}^{\delta}\left(x[t+1]\right)\coloneqq\sigma\left(\delta\bar{B}x[t+1]+\delta\bar{\theta}+z_{0}\mathbf{1}_{d}\right). (83)

Since σ\sigma is non-degenerating at z0z_{0} and σ\sigma^{\prime} is continuous at z0z_{0}, we have

1δ𝒫(x)[t+1]=δσ(z0)(B¯x[t+1]+θ¯)+o(δ).\mathcal{R}_{1}^{\delta}\circ\mathcal{P}(x)[t+1]=\delta\sigma^{\prime}(z_{0})\left(\bar{B}x[t+1]+\bar{\theta}\right)+o(\delta). (84)

Then construct a second cell to compute transition as

2δ(x)[t+1]=σ(A~2δ(x)[t]+1σ(z0)[δ1IkIdk]x[t+1]+[𝟎kz0𝟏dk]),\mathcal{R}_{2}^{\delta}(x)[t+1]=\sigma\left(\tilde{A}\mathcal{R}_{2}^{\delta}(x)[t]+\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}x[t+1]+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix}\right), (85)

where A~=[IkδIdk]A¯[Ik1δσ(z0)Idk]\tilde{A}=\begin{bmatrix}I_{k}&\\ &\delta I_{d-k}\end{bmatrix}\bar{A}\begin{bmatrix}I_{k}&\\ &\frac{1}{\delta\sigma^{\prime}(z_{0})}I_{d-k}\end{bmatrix}.
After that, the first output of 2δ1δ𝒫(x)\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x) becomes

2δ1δ𝒫(x)[1]\displaystyle\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[1] =σ(1σ(z0)[δ1IkIdk]1δ(x)[1]+[𝟎kz0𝟏dk])\displaystyle=\sigma\left(\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}(x)[1]+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix}\right) (86)
=σ([(B¯x[1]+θ¯)1:k+δ1o(δ)(z0𝟏dk+δ(B¯x[1]+θ¯)k+1:d+o(δ)])\displaystyle=\sigma\left(\begin{bmatrix}\left(\bar{B}x[1]+\bar{\theta}\right)_{1:k}+\delta^{-1}o(\delta)\\ \left(z_{0}\mathbf{1}_{d-k}+\delta(\bar{B}x[1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\end{bmatrix}\right) (87)
=[σ(B¯x[1]+θ¯)1:k+o(1)σ(z0)δ(B¯x[1]+θ¯)k+1:d+o(δ)]\displaystyle=\begin{bmatrix}\sigma\left(\bar{B}x[1]+\bar{\theta}\right)_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\left(\bar{B}x[1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\end{bmatrix} (88)
=[¯(x)[1]1:k+o(1)σ(z0)δ¯(x)[1]k+1:d+o(δ)].\displaystyle=\begin{bmatrix}\bar{\mathcal{R}}(x)[1]_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[1]_{k+1:d}+o(\delta)\end{bmatrix}. (89)

Now use mathematical induction on time tt to compute 2δ1δ𝒫(x)\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x) assuming

2δ1δ𝒫(x)[t]=[¯(x)[t]1:k+o(1)σ(z0)δ¯(x)[t]k+1:d+o(δ)].\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]=\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t]_{k+1:d}+o(\delta)\end{bmatrix}. (90)

From a direct calculation, we attain

1σ(z0)[δ1IkIdk]1δ𝒫(x)[t+1]+[𝟎kz0𝟏dk]\displaystyle\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1]+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix} (91)
=1σ(z0)[δ1IkIdk](δσ(z0)(B¯x[t+1]+θ¯)+o(δ))+[𝟎kz0𝟏dk]\displaystyle=\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}\left(\delta\sigma^{\prime}(z_{0})\left(\bar{B}x[t+1]+\bar{\theta}\right)+o(\delta)\right)+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix} (92)
=[B¯x[t+1]1:k+θ¯1:k+δ1o(δ)z0𝟏dk+δ(B¯x[t+1]+θ¯)k+1:d+o(δ)],\displaystyle=\begin{bmatrix}\bar{B}x[t+1]_{1:k}+\bar{\theta}_{1:k}+\delta^{-1}o(\delta)\\ z_{0}\mathbf{1}_{d-k}+\delta\left(\bar{B}x[t+1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\end{bmatrix}, (93)

and

A~2δ1δ𝒫(x)[t]\displaystyle\tilde{A}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t] (94)
=[IkδIdk]A¯[Ik1δσ(z0)Idk][¯(x)[t]1:k+o(1)σ(z0)δ¯(x)[t]k+1:d+o(δ)]\displaystyle=\begin{bmatrix}I_{k}&\\ &\delta I_{d-k}\end{bmatrix}\bar{A}\begin{bmatrix}I_{k}&\\ &\frac{1}{\delta\sigma^{\prime}(z_{0})}I_{d-k}\end{bmatrix}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t]_{k+1:d}+o(\delta)\end{bmatrix} (95)
=[IkδIdk]A¯[¯(x)[t]1:k+o(1)¯(x)[t]k+1:d+o(1)]\displaystyle=\begin{bmatrix}I_{k}&\\ &\delta I_{d-k}\end{bmatrix}\bar{A}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \bar{\mathcal{R}}(x)[t]_{k+1:d}+o(1)\end{bmatrix} (96)
=[(A¯¯(x)[t])1:k+o(1)δ(A¯¯(x)[t])k+1:d+o(δ)].\displaystyle=\begin{bmatrix}\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{1:k}+o(1)\\ \delta\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{k+1:d}+o(\delta)\end{bmatrix}. (97)

With the sum of above two results, we obtain the induction hypothesis (90) for t+1t+1,

2δ1δ𝒫(x)[t+1]\displaystyle\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1] (98)
=σ(A~2δ1δ𝒫(x)[t]+1σ(z0)[δ1IkIdk]1δ𝒫(x)[t+1]+[𝟎kz0𝟏dk])\displaystyle=\sigma\left(\tilde{A}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]+\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1]+\begin{bmatrix}\mathbf{0}_{k}\\ z_{0}\mathbf{1}_{d-k}\end{bmatrix}\right) (99)
=σ[(A¯¯(x)[t])1:k+B¯x[t+1]1:k+θ¯1:k+o(1)z0𝟏dk+δ(A¯¯(x)[t])k+1:d+δ(B¯x[t+1]+θ¯)k+1:d+o(δ)]\displaystyle=\sigma\begin{bmatrix}\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{1:k}+\bar{B}x[t+1]_{1:k}+\bar{\theta}_{1:k}+o(1)\\ z_{0}\mathbf{1}_{d-k}+\delta\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{k+1:d}+\delta\left(\bar{B}x[t+1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\end{bmatrix} (100)
=[¯(x)[t+1]1:k+o(1)σ(z0)δ¯(x)[t+1]k+1:d+o(δ)].\displaystyle=\begin{bmatrix}\bar{\mathcal{R}}(x)[t+1]_{1:k}+o(1)\\ \sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t+1]_{k+1:d}+o(\delta)\end{bmatrix}. (101)

Setting 𝒬δ=Q¯[Ik1σ(z0)δIdk]\mathcal{Q}^{\delta}=\bar{Q}\begin{bmatrix}I_{k}&&\\ &\frac{1}{\sigma^{\prime}(z_{0})\delta}I_{d-k}\end{bmatrix} and choosing δ\delta small enough complete the proof:

𝒬δ2δ1δ𝒫(x)[t]=Q¯[¯(x)[t]1:k+o(1)¯(x)[t]k+1:d+o(1)]=𝒬¯¯(x)[t]+o(1)𝒬¯¯(x)[t].\mathcal{Q}^{\delta}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]=\bar{Q}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \bar{\mathcal{R}}(x)[t]_{k+1:d}+o(1)\end{bmatrix}=\bar{\mathcal{Q}}\bar{\mathcal{R}}(x)[t]+o(1)\rightarrow\bar{\mathcal{Q}}\bar{\mathcal{R}}(x)[t]. (102)

Case 2: σ(z0)0\sigma(z_{0})\neq 0

When σ(z0)0\sigma(z_{0})\neq 0, there is σ(z0)\sigma(z_{0}) term independent of δ\delta in the Taylor expansion of σ(z0+δx)=σ(z0)+δσ(z0)x+o(δ)\sigma(z_{0}+\delta x)=\sigma(z_{0})+\delta\sigma^{\prime}(z_{0})x+o(\delta). An additional width removes the term in this case; hence we need a lifting map 𝒫:d×N(d+1)×N\mathcal{P}:\mathbb{R}^{d\times N}\rightarrow\mathbb{R}^{(d+1)\times N}:

𝒫(x)[t]=[x[t]0].\mathcal{P}(x)[t]=\begin{bmatrix}x[t]\\ 0\end{bmatrix}. (103)

Now for δ>0\delta>0 define 1δ\mathcal{R}_{1}^{\delta} as

1δ(X)σ(δ[B¯0]X+δ[θ¯0]+z0𝟏d+1).\mathcal{R}_{1}^{\delta}\left(X\right)\coloneqq\sigma\left(\delta\begin{bmatrix}\bar{B}&\\ &0\end{bmatrix}X+\delta\begin{bmatrix}\bar{\theta}\\ 0\end{bmatrix}+z_{0}\mathbf{1}_{d+1}\right). (104)

As in the previous case, we have

1δ𝒫(x)[t+1]=[σ(z0)𝟏d+δσ(z0)(B¯x[t+1]+θ¯)+o(δ)σ(z0)],\mathcal{R}_{1}^{\delta}\circ\mathcal{P}(x)[t+1]=\begin{bmatrix}\sigma(z_{0})\mathbf{1}_{d}+\delta\sigma^{\prime}(z_{0})\left(\bar{B}x[t+1]+\bar{\theta}\right)+o\left(\delta\right)\\ \sigma(z_{0})\end{bmatrix}, (105)

and construct a second cell 2δ\mathcal{R}_{2}^{\delta} to compute

2δ(x)[t+1]=σ(A~2δ(x)[t]+1σ(z0)[δ1IkId+1k]x[t+1]+[σ(z0)δσ(z0)𝟏kz0𝟏d+1kσ(z0)σ(z0)𝟏d+1k]),\mathcal{R}_{2}^{\delta}(x)[t+1]=\sigma\left(\tilde{A}\mathcal{R}_{2}^{\delta}(x)[t]+\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d+1-k}\end{bmatrix}x[t+1]+\begin{bmatrix}-\frac{\sigma(z_{0})}{\delta\sigma^{\prime}(z_{0})}\mathbf{1}_{k}\\ z_{0}\mathbf{1}_{d+1-k}-\frac{\sigma(z_{0})}{\sigma^{\prime}(z_{0})}\mathbf{1}_{d+1-k}\end{bmatrix}\right), (106)

where A~=[IkδId+1k][A¯0][Ik1δσ(z0)Idk1δσ(z0)𝟏dk0]\tilde{A}=\begin{bmatrix}I_{k}&\\ &\delta I_{d+1-k}\end{bmatrix}\begin{bmatrix}\bar{A}&\\ &0\end{bmatrix}\begin{bmatrix}I_{k}&&\\ &\frac{1}{\delta\sigma^{\prime}(z_{0})}I_{d-k}&-\frac{1}{\delta\sigma^{\prime}(z_{0})}\mathbf{1}_{d-k}\\ &&0\end{bmatrix}.
After that, the first output of 2δ1δ𝒫(x)[t]\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t] becomes

2δ1δ𝒫(x)[1]\displaystyle\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[1] =σ(1σ(z0)[δ1IkId+1k]1δ(x)[1]+[σ(z0)δσ(z0)𝟏kz0𝟏d+1kσ(z0)σ(z0)𝟏d+1k])\displaystyle=\sigma\left(\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d+1-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}(x)[1]+\begin{bmatrix}-\frac{\sigma(z_{0})}{\delta\sigma^{\prime}(z_{0})}\mathbf{1}_{k}\\ z_{0}\mathbf{1}_{d+1-k}-\frac{\sigma(z_{0})}{\sigma^{\prime}(z_{0})}\mathbf{1}_{d+1-k}\end{bmatrix}\right) (107)
=σ([(B¯x[1]+θ¯)1:k+o(δ)(z0𝟏dk+δ(B¯x[1]+θ¯)k+1:d+o(δ)z0])\displaystyle=\sigma\left(\begin{bmatrix}\left(\bar{B}x[1]+\bar{\theta}\right)_{1:k}+o(\delta)\\ \left(z_{0}\mathbf{1}_{d-k}+\delta(\bar{B}x[1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\\ z_{0}\end{bmatrix}\right) (108)
=[σ(B¯x[1]+θ¯)1:k+o(1)σ(z0)𝟏dk+σ(z0)δ(B¯x[1]+θ¯)k+1:d+o(δ)σ(z0)]\displaystyle=\begin{bmatrix}\sigma\left(\bar{B}x[1]+\bar{\theta}\right)_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\left(\bar{B}x[1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix} (109)
=[¯(x)[1]1:k+o(1)σ(z0)𝟏dk+σ(z0)δ¯(x)[1]k+1:d+o(δ)σ(z0)].\displaystyle=\begin{bmatrix}\bar{\mathcal{R}}(x)[1]_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[1]_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix}. (110)

Assume 2δ1δ𝒫(x)\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x) and use mathematical induction on time tt.

2δ1δ𝒫(x)[t]=[¯(x)[t]1:k+o(1)σ(z0)𝟏dk+σ(z0)δ¯(x)[t]k+1:d+o(δ)σ(z0)].\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]=\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t]_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix}. (111)

Direct calculation yields

1σ(z0)[δ1IkId+1k]1δ𝒫(x)[t+1]+[σ(z0)δσ(z0)𝟏kz0𝟏d+1kσ(z0)σ(z0)𝟏d+1k]\displaystyle\frac{1}{\sigma^{\prime}(z_{0})}\begin{bmatrix}\delta^{-1}I_{k}&\\ &I_{d+1-k}\end{bmatrix}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1]+\begin{bmatrix}-\frac{\sigma(z_{0})}{\delta\sigma^{\prime}(z_{0})}\mathbf{1}_{k}\\ z_{0}\mathbf{1}_{d+1-k}-\frac{\sigma(z_{0})}{\sigma^{\prime}(z_{0})}\mathbf{1}_{d+1-k}\end{bmatrix} (112)
=[B¯x[t+1]1:k+θ¯1:k+o(1)z0𝟏dk+δ(B¯x[t+1]+θ¯)k+1:d+o(δ)z0],\displaystyle=\begin{bmatrix}\bar{B}x[t+1]_{1:k}+\bar{\theta}_{1:k}+o(1)\\ z_{0}\mathbf{1}_{d-k}+\delta\left(\bar{B}x[t+1]+\bar{\theta}\right)_{k+1:d}+o(\delta)\\ z_{0}\end{bmatrix}, (113)

and

A~2δ1δ𝒫(x)[t]\displaystyle\tilde{A}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t] (114)
=A~[¯(x)[t]1:k+o(1)σ(z0)𝟏dk+σ(z0)δ¯(x)[t]k+1:d+o(δ)σ(z0)]\displaystyle=\tilde{A}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t]_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix} (115)
=[IkδId+1k][A¯0][¯(x)[t]1:k+o(1)¯(x)[t]k+1:d+o(1)0]\displaystyle=\begin{bmatrix}I_{k}&\\ &\delta I_{d+1-k}\end{bmatrix}\begin{bmatrix}\bar{A}&\\ &0\end{bmatrix}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \bar{\mathcal{R}}(x)[t]_{k+1:d}+o(1)\\ 0\end{bmatrix} (116)
=[(A¯¯(x)[t])1:k+o(1)δ(A¯¯(x)[t])k+1:d+o(δ)0].\displaystyle=\begin{bmatrix}\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{1:k}+o(1)\\ \delta\left(\bar{A}\bar{\mathcal{R}}(x)[t]\right)_{k+1:d}+o(\delta)\\ 0\end{bmatrix}. (117)

Adding two terms in (106), we obtain the induction hypothesis (111) for t+1t+1,

2δ1δ𝒫(x)[t+1]=[¯(x)[t+1]1:k+o(1)σ(z0)𝟏dk+σ(z0)δ¯(x)[t+1]k+1:d+o(δ)σ(z0)].\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t+1]=\begin{bmatrix}\bar{\mathcal{R}}(x)[t+1]_{1:k}+o(1)\\ \sigma(z_{0})\mathbf{1}_{d-k}+\sigma^{\prime}(z_{0})\delta\bar{\mathcal{R}}(x)[t+1]_{k+1:d}+o(\delta)\\ \sigma(z_{0})\end{bmatrix}. (118)

Setting 𝒬δ=[Q¯0][Ik1σ(z0)δIdk1σ(z0)δ𝟏dk0]\mathcal{Q}^{\delta}=\begin{bmatrix}\bar{Q}&0\end{bmatrix}\begin{bmatrix}I_{k}&&\\ &\frac{1}{\sigma^{\prime}(z_{0})\delta}I_{d-k}&-\frac{1}{\sigma^{\prime}(z_{0})\delta}\mathbf{1}_{d-k}\\ &&0\end{bmatrix} and choosing δ\delta small enough complete the proof:

𝒬δ2δ1δ𝒫(x)[t]=[Q¯0][¯(x)[t]1:k+o(1)¯(x)[t]k+1:d+o(1)0]=𝒬¯¯(x)[t]+o(1)𝒬¯¯(x)[t].\mathcal{Q}^{\delta}\mathcal{R}_{2}^{\delta}\mathcal{R}_{1}^{\delta}\mathcal{P}(x)[t]=\begin{bmatrix}\bar{Q}&0\end{bmatrix}\begin{bmatrix}\bar{\mathcal{R}}(x)[t]_{1:k}+o(1)\\ \bar{\mathcal{R}}(x)[t]_{k+1:d}+o(1)\\ 0\end{bmatrix}=\bar{\mathcal{Q}}\bar{\mathcal{R}}(x)[t]+o(1)\rightarrow\bar{\mathcal{Q}}\bar{\mathcal{R}}(x)[t]. (119)

Appendix C Proof of the Lemma 6

It suffices to show that there exists a modified RNN 𝒩\mathcal{N} that computes

𝒩(x)[N]=[x[N]t=1NA[t]x[t]],\mathcal{N}(x)[N]=\begin{bmatrix}x[N]\\ \sum_{t=1}^{N}A[t]x[t]\end{bmatrix}, (120)

for given matrices A[1],,A[N]1×dxA[1],\ldots,A[N]\in\mathbb{R}^{1\times d_{x}}.

RNN should have multiple layers to implement the arbitrary linear combination. To overcome the complex time dependency deriving from deep structures and explicitly formulate the results of deep modified RNN, we force AA and BB to use the information of the previous time step in a limited way. Define the modified RNN cell at ll-th layer l\mathcal{R}_{l} as

l(x)[t+1]=All(x)[t]+Blx[t+1]\mathcal{R}_{l}(x)[t+1]=A_{l}\mathcal{R}_{l}(x)[t]+B_{l}x[t+1] (121)

where Al=[Odx,dxOdx,1O1,dx1]A_{l}=\begin{bmatrix}O_{d_{x},d_{x}}&O_{d_{x},1}\\ O_{1,d_{x}}&1\end{bmatrix}, Bl=[IdxOdx,1bl1]B_{l}=\begin{bmatrix}I_{d_{x}}&O_{d_{x},1}\\ b_{l}&1\end{bmatrix} for bl1×dxb_{l}\in\mathbb{R}^{1\times d_{x}}.

Construct a modified RNN 𝒩L\mathcal{N}_{L} for each LL\in\mathbb{N} as

𝒩LLL11,\mathcal{N}_{L}\coloneqq\mathcal{R}_{L}\circ\mathcal{R}_{L-1}\circ\cdots\circ\mathcal{R}_{1}, (122)

and denote the output of 𝒩L\mathcal{N}_{L} at each time mm for an input sequence x=[x0]dx+1x^{\prime}=\begin{bmatrix}x\\ 0\end{bmatrix}\in\mathbb{R}^{d_{x}+1} of embedding of xx:

T(n,m)𝒩n(x)[m].T(n,m)\coloneqq\mathcal{N}_{n}\left(x^{\prime}\right)[m]. (123)

Then we have the following lemma.

Lemma 24

Let T(n,m)T(n,m) be the matrix defined by (123). Then we have

T(n,m)=[x[m]i=1j=1(n+mijni)bix[j]],T(n,m)=\begin{bmatrix}x[m]\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\binom{n+m-i-j}{n-i}b_{i}x[j]\end{bmatrix}, (124)

where (nk)\binom{n}{k} means binomial coefficient n!k!(nk)!\frac{n!}{k!(n-k)!} for nkn\geq k. We define (nk)=0\binom{n}{k}=0 for the case of k>nk>n or n<0n<0 for notational convenience.

Proof  Since there is no activation in modified RNN (121), T(n,m)T(n,m) has the form of

T(n,m)=[xmi=1j=1αi,jn,mbix[j]].T(n,m)=\begin{bmatrix}x_{m}\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\alpha_{i,j}^{n,m}b_{i}x[j]\end{bmatrix}. (125)

From the definition of the modified RNN cell and TT, we first show that α\alpha satisfies the recurrence relation

αi,jn,m={αi,jn1,m+αi,jn,m1+1,if n=i and m=j,αi,jn1,m+αi,jn,m1,otherwise\alpha^{n,m}_{i,j}=\left\{\begin{array}[]{ll}\alpha_{i,j}^{n-1,m}+\alpha_{i,j}^{n,m-1}+1,&\text{if $n=i$ and $m=j$},\\ \alpha_{i,j}^{n-1,m}+\alpha_{i,j}^{n,m-1},&\text{otherwise}\end{array}\right. (126)

using mathematical induction on n,mn,m in turn. Initially, T(0,m)=[xm0]T(0,m)=\begin{bmatrix}x_{m}\\ 0\end{bmatrix}, T(n,0)=[Odx,10]T(n,0)=\begin{bmatrix}O_{d_{x},1}\\ 0\end{bmatrix} by definition, and (125) holds when n=0n=0. Now assume (125) holds for nNn\leq N, any mm. To show that (125) holds for n=N+1n=N+1 and any mm, use mathematical induction on mm. By definition, we have αi,jn,0=0\alpha_{i,j}^{n,0}=0 for any nn. Thus (125) holds when n=N+1n=N+1 and m=0m=0. Assume it holds for n=N+1n=N+1 and mMm\leq M. Then

T(N+1,M+1)\displaystyle T(N+1,M+1) (127)
=[Odx,dxOdx,1O1,dx1][xMi=1j=1αi,jN+1,Mxj]+[IdxOdx,1bN+11][xM+1i=1j=1αi,jN,M+1xj]\displaystyle=\begin{bmatrix}O_{d_{x},d_{x}}&O_{d_{x},1}\\ O_{1,d_{x}}&1\end{bmatrix}\begin{bmatrix}x_{M}\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\alpha_{i,j}^{N+1,M}x_{j}\end{bmatrix}+\begin{bmatrix}I_{d_{x}}&O_{d_{x},1}\\ b_{N+1}&1\end{bmatrix}\begin{bmatrix}x_{M+1}\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\alpha_{i,j}^{N,M+1}x_{j}\end{bmatrix}
=[Odx,1i=1j=1αi,jN+1,Mbixj]+[xM+1bN+1xM+1+i=1j=1(αi,jN+1,M+αi,jN,M+1)bixj].\displaystyle=\begin{bmatrix}O_{d_{x},1}\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\alpha_{i,j}^{N+1,M}b_{i}x_{j}\end{bmatrix}+\begin{bmatrix}x_{M+1}\\ b_{N+1}x_{M+1}+\sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\left(\alpha_{i,j}^{N+1,M}+\alpha_{i,j}^{N,M+1}\right)b_{i}x_{j}\end{bmatrix}.

Hence the relation holds for n=N+1n=N+1 and any m>0m>0.

Now remains to show

αi,jn,m={(m+nijni)if 1in1jm0otherwise\alpha_{i,j}^{n,m}=\left\{\begin{array}[]{ll}\binom{m+n-i-j}{n-i}&\text{if $1\leq i\leq n$, $1\leq j\leq m$}\\ 0&\text{otherwise}\end{array}\right. (128)

From the initial condition of α\alpha, we know αi,j0,m=αi,jn,0=0\alpha^{0,m}_{i,j}=\alpha^{n,0}_{i,j}=0 for all n,mn,m\in\mathbb{N}. After some direct calculation with the recurrence relation (125) of α\alpha, we have

  1. i)

    If n<in<i or m<jm<j, αi,jn,m=0\alpha_{i,j}^{n,m}=0 as αi,jn,m=αi,jn1,m+αi,jn,m1\alpha^{n,m}_{i,j}=\alpha^{n-1,m}_{i,j}+\alpha^{n,m-1}_{i,j}.

  2. ii)

    αi,ji,j=αi,ji1,j+αi,ji,j1+1=1\alpha_{i,j}^{i,j}=\alpha_{i,j}^{i-1,j}+\alpha_{i,j}^{i,j-1}+1=1.

  3. iii)

    αi,ji,m=αi,ji1,m+αi,ji,m1=αi,ji,m1\alpha_{i,j}^{i,m}=\alpha_{i,j}^{i-1,m}+\alpha_{i,j}^{i,m-1}=\alpha_{i,j}^{i,m-1} implies αi,ji,m=1\alpha_{i,j}^{i,m}=1 for m>jm>j.

  4. iv)

    Similarly, αi,jn,j=αi,jn1,j+αi,jn,j1=αi,jn1,j\alpha^{n,j}_{i,j}=\alpha^{n-1,j}_{i,j}+\alpha^{n,j-1}_{i,j}=\alpha^{n-1,j}_{i,j} implies αi,jn,j=1\alpha_{i,j}^{n,j}=1 for n>in>i.

Now use mathematical induction on n+mn+m starting from n+m=i+jn+m=i+j to show αi,jn,m=(m+nijni)\alpha^{n,m}_{i,j}=\binom{m+n-i-j}{n-i} for nin\geq i, mjm\geq j.

  1. i)

    n+m=i+jn+m=i+j holds only if n=in=i, m=jm=j for nin\geq i, mjm\geq j. In the case, αi,ji,j=1=(m+nijni)\alpha^{i,j}_{i,j}=1=\binom{m+n-i-j}{n-i}.

  2. ii)

    Assume that (128) holds for any nn, mm with n+m=kn+m=k as induction hypothesis. Now suppose n+m=k+1n+m=k+1 for given nn, mm. If n=in=i or m=jm=j we already know αi,jn,m=1=(m+nijni)\alpha_{i,j}^{n,m}=1=\binom{m+n-i-j}{n-i}. Otherwise n1in-1\geq i, m1jm-1\geq j, and we have

    αi,jn,m\displaystyle\alpha^{n,m}_{i,j} =αi,jn1,m+αi,jn,m1\displaystyle=\alpha^{n-1,m}_{i,j}+\alpha^{n,m-1}_{i,j} (129)
    =(m+n1ijn1i)+(m+n1ijni)\displaystyle=\binom{m+n-1-i-j}{n-1-i}+\binom{m+n-1-i-j}{n-i}
    =(m+nijni),\displaystyle=\binom{m+n-i-j}{n-i},

    which completes the proof.

 

We have computed the output of modified RNN 𝒩N\mathcal{N}_{N} such that

𝒩N(x)[N]=[x[N]i=1Nj=1N(2nijni)bix[j]].\mathcal{N}_{N}\left(x^{\prime}\right)[N]=\begin{bmatrix}x[N]\\ \sum_{i=1}^{N}\sum_{j=1}^{N}\binom{2n-i-j}{n-i}b_{i}x[j]\end{bmatrix}. (130)

If the square matrix ΛN={(2nijni)}1i,jN\Lambda_{N}=\left\{\binom{2n-i-j}{n-i}\right\}_{1\leq i,j\leq N} has inverse ΛN1={λi,j}1i,jN\Lambda_{N}^{-1}=\left\{\lambda_{i,j}\right\}_{1\leq i,j\leq N}, bi=t=1Nλt,iA[t]b_{i}=\sum_{t=1}^{N}\lambda_{t,i}A[t] satisfies

i=1Nj=1N(2nijni)bix[j]\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}\binom{2n-i-j}{n-i}b_{i}x[j] =i=1nj=1nt=1n(2nijni)λt,iA[t]x[j]\displaystyle=\sum_{i=1}^{n}\sum_{j=1}^{n}\sum_{t=1}^{n}\binom{2n-i-j}{n-i}\lambda_{t,i}A[t]x[j]
=j=1nt=1n[i=1n(2nijni)λt,i]A[t]x[j]\displaystyle=\sum_{j=1}^{n}\sum_{t=1}^{n}\left[\sum_{i=1}^{n}\binom{2n-i-j}{n-i}\lambda_{t,i}\right]A[t]x[j]
=j=1nt=1nδj,tA[t]x[j]\displaystyle=\sum_{j=1}^{n}\sum_{t=1}^{n}\delta_{j,t}A[t]x[j]
=j=1nA[j]x[j],\displaystyle=\sum_{j=1}^{n}A[j]x[j],

where δ\delta is the Kronecker delta function.

The following lemma completes the proof.

Lemma 25

Matrix Λn={(2nijni)}1i,jnn×n\Lambda_{n}=\left\{\binom{2n-i-j}{n-i}\right\}_{1\leq i,j\leq n}\in\mathbb{R}^{n\times n} is invertible.

Proof  Use mathematical induction on nn. Λ1\Lambda_{1} is a trivial case. Assume Λn\Lambda_{n} is invertible.

Λn+1=[(2nn)(2n1n)(2n2n)(n+1n)(nn)(2n1n1)(2n2n1)(2n3n1)(nn1)(n1n1)(n+11)(n1)(n11)(21)(11)(n0)(n10)(n20)(10)(00)].\displaystyle\Lambda_{n+1}=\begin{bmatrix}\binom{2n}{n}&\binom{2n-1}{n}&\binom{2n-2}{n}&\dots&\binom{n+1}{n}&\binom{n}{n}\\ \binom{2n-1}{n-1}&\binom{2n-2}{n-1}&\binom{2n-3}{n-1}&\dots&\binom{n}{n-1}&\binom{n-1}{n-1}\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ \binom{n+1}{1}&\binom{n}{1}&\binom{n-1}{1}&\dots&\binom{2}{1}&\binom{1}{1}\\ \binom{n}{0}&\binom{n-1}{0}&\binom{n-2}{0}&\dots&\binom{1}{0}&\binom{0}{0}\end{bmatrix}. (131)

Applying elementary row operation to Λn+1\Lambda_{n+1} by multiplying the matrix EE on the left and elementary column operation to EΛn+1E\Lambda_{n+1} by multiplying the matrix ETE^{T} on the right where

E=[1100001100001000001100001],E=\begin{bmatrix}1&-1&0&\dots&0&0\\ 0&1&-1&\dots&0&0\\ 0&0&1&\dots&0&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\dots&1&-1\\ 0&0&0&\dots&0&1\end{bmatrix}, (132)

we obtain the following relation:

EΛn+1ET=[(2n2n1)(2n3n1)(2n4n1)(n1n1)0(2n3n2)(2n4n2)(2n5n2)(n2n2)0(n10)(n20)(n30)(00)000001]=[ΛnOn,1O1,n1].E\Lambda_{n+1}E^{T}=\begin{bmatrix}\binom{2n-2}{n-1}&\binom{2n-3}{n-1}&\binom{2n-4}{n-1}&\dots&\binom{n-1}{n-1}&0\\ \binom{2n-3}{n-2}&\binom{2n-4}{n-2}&\binom{2n-5}{n-2}&\dots&\binom{n-2}{n-2}&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ \binom{n-1}{0}&\binom{n-2}{0}&\binom{n-3}{0}&\dots&\binom{0}{0}&0\\ 0&0&0&\dots&0&1\end{bmatrix}=\begin{bmatrix}\Lambda_{n}&O_{n,1}\\ O_{1,n}&1\end{bmatrix}. (133)

Hence Λn+1\Lambda_{n+1} is invertible by the induction hypothesis.  

Corollary 26

The following matrix Λn,kk×n\Lambda_{n,k}\in\mathbb{R}^{k\times n} is full-rank.

Λn,k={(2nijni)}nk+1in,1jn.\Lambda_{n,k}=\left\{\binom{2n-i-j}{n-i}\right\}_{n-k+1\leq i\leq n,1\leq j\leq n}. (134)

We will use the matrix Λn,k\Lambda_{n,k} in the proof of Lemma 11 to approximate a sequence-to-sequence function.

Appendix D Proof of Lemma 11

Define token-wise lifting map 𝒫:dxdx+1\mathcal{P}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{x}+1} and modified TRNN cell 𝒯l:(dx+1)×N(dx+1)×N\mathcal{T}\mathcal{R}_{l}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N} as in the proof of Lemma 6:

𝒫(x)[t]\displaystyle\mathcal{P}(x)[t] =[x[m]0],\displaystyle=\begin{bmatrix}x[m]\\ 0\end{bmatrix}, (135)
𝒯l(X)[t+1]\displaystyle\mathcal{T}\mathcal{R}_{l}(X)[t+1] =Al𝒯l(X)[t]+Bl[t](X)[t+1],\displaystyle=A_{l}\mathcal{T}\mathcal{R}_{l}(X)[t]+B_{l}[t](X)[t+1], (136)

where Al=[Odx,dxOdx,1O1,dx1]A_{l}=\begin{bmatrix}O_{d_{x},d_{x}}&O_{d_{x},1}\\ O_{1,d_{x}}&1\end{bmatrix}, Bl[t]=[IdxOdx,1bl[t]1]B_{l}[t]=\begin{bmatrix}I_{d_{x}}&O_{d_{x},1}\\ b_{l}[t]&1\end{bmatrix} for bl[t]1×dxb_{l}[t]\in\mathbb{R}^{1\times d_{x}}. Then we have

T(n,m)\displaystyle T(n,m) 𝒩n(x)[m]\displaystyle\coloneqq\mathcal{N}_{n}\left(x\right)[m] (137)
=[x[m]i=1j=1(n+mijni)bi[j]x[j]],\displaystyle=\begin{bmatrix}x[m]\\ \sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\binom{n+m-i-j}{n-i}b_{i}[j]x[j]\end{bmatrix},

where xdx×Nx\in\mathbb{R}^{d_{x}\times N} and 𝒩L=𝒯L𝒯L1𝒯1𝒫\mathcal{N}_{L}=\mathcal{T}\mathcal{R}_{L}\circ\mathcal{T}\mathcal{R}_{L-1}\circ\cdots\circ\mathcal{T}\mathcal{R}_{1}\circ\mathcal{P}.

Since for each tt, the matrix

ΛN,Nt+1={(2NijNi)}tiN,1jN={(2Nt+1ijNj)}1iNt+1,1jN\Lambda_{N,N-t+1}=\left\{\binom{2N-i-j}{N-i}\right\}_{t\leq i\leq N,1\leq j\leq N}=\left\{\binom{2N-t+1-i-j}{N-j}\right\}_{1\leq i\leq N-t+1,1\leq j\leq N} (138)

is full-rank, there exist b1[t],b2[t],,bN[t]b_{1}[t],b_{2}[t],\ldots,b_{N}[t] satisfying

ΛN,Nt+1[b1[t]bN[t]]=[At[N]At[t]],\Lambda_{N,N-t+1}\begin{bmatrix}b_{1}[t]\\ \vdots\\ b_{N}[t]\end{bmatrix}=\begin{bmatrix}A_{t}[N]\\ \vdots\\ A_{t}[t]\end{bmatrix}, (139)

or

j=1N(N+kjtNj)bj[t]=At[k],\sum_{j=1}^{N}\binom{N+k-j-t}{N-j}b_{j}[t]=A_{t}[k], (140)

for each k=1,2,,Nk=1,2,\ldots,N. Then we obtain

T(N,t)\displaystyle T(N,t) =i=1j=1(N+tijNi)bi[j]x[j]\displaystyle=\sum_{i=1}^{\infty}\sum_{j=1}^{\infty}\binom{N+t-i-j}{N-i}b_{i}[j]x[j] (141)
=j=1ti=1N(N+tijNi)bi[j]x[j]\displaystyle=\sum_{j=1}^{t}\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}b_{i}[j]x[j]
=j=1tAj[t]x[j].\displaystyle=\sum_{j=1}^{t}A_{j}[t]x[j].

Appendix E Proof of Lemma 13

As one of the modified TRNNs that computes (31), we use the modified TRNN defined in Appendix D. Specifically, we show that for a given ll, there exists a modified RNN of width dx+2+γ(σ)d_{x}+2+\gamma(\sigma) that approximates the modified TRNN cell 𝒯l:(dx+1)×N(dx+1)×N\mathcal{T}\mathcal{R}_{l}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N} defined by (135). Suppose KdxK\subset\mathbb{R}^{d_{x}}, KK^{\prime}\subset\mathbb{R} are compact sets and X(K×K)N(dx+1)×NX\in\left(K\times K^{\prime}\right)^{N}\subset\mathbb{R}^{(d_{x}+1)\times N}. Then the output of the TRNN cell 𝒯l\mathcal{T}\mathcal{R}_{l} is

𝒯l(X)[t]=[X[t]1:dxj=1tbl[j]X[j]1:dx+j=1tX[j]dx+1].\mathcal{T}\mathcal{R}_{l}(X)[t]=\begin{bmatrix}X[t]_{1:d_{x}}\\ \sum_{j=1}^{t}b_{l}[j]X[j]_{1:d_{x}}+\sum_{j=1}^{t}X[j]_{d_{x}+1}\end{bmatrix}. (142)

Without loss of generality, assume K[0,12]dxK\subset\left[0,\frac{1}{2}\right]^{d_{x}} and let γ=γ(σ)\gamma=\gamma(\sigma). Let 𝒫:dx+1dx+2+γ\mathcal{P}:\mathbb{R}^{d_{x}+1}\rightarrow\mathbb{R}^{d_{x}+2+\gamma} be a token-wise linear map defined by 𝒫(X)=[X1:dx0Xdx+1𝟎γ]\mathcal{P}(X)=\begin{bmatrix}X_{1:d_{x}}\\ 0\\ X_{d_{x}+1}\\ \mathbf{0}_{\gamma}\end{bmatrix}. Construct the modified recurrent cells 1,2:(dx+2+γ(σ))×N(dx+2+γ(σ))×N\mathcal{R}_{1},\mathcal{R}_{2}:\mathbb{R}^{(d_{x}+2+\gamma(\sigma))\times N}\rightarrow\mathbb{R}^{(d_{x}+2+\gamma(\sigma))\times N} as for X(dx+2+γ)×NX^{\prime}\in\mathbb{R}^{(d_{x}+2+\gamma)\times N},

1(X)[t+1]\displaystyle\mathcal{R}_{1}(X^{\prime})[t+1] =[Odx,dx1O1+γ,1+γ]1(X)[t]+X[t+1]+[𝟎dx1𝟎1+γ],\displaystyle=\begin{bmatrix}O_{d_{x},d_{x}}&&\\ &1&&\\ &&O_{1+\gamma,1+\gamma}\end{bmatrix}\mathcal{R}_{1}(X^{\prime})[t]+X^{\prime}[t+1]+\begin{bmatrix}\mathbf{0}_{d_{x}}\\ 1\\ \mathbf{0}_{1+\gamma}\end{bmatrix}, (143)
2(X)[t+1]\displaystyle\mathcal{R}_{2}(X^{\prime})[t+1] =[Idx𝟏dx11Oγ,γ]X[t].\displaystyle=\begin{bmatrix}I_{d_{x}}&\mathbf{1}_{d_{x}}&&\\ &{\color[rgb]{0,0,0}1}&&\\ &&{\color[rgb]{0,0,0}1}&\\ &&&O_{\gamma,\gamma}\end{bmatrix}X^{\prime}[t]. (144)

Then, by definition for X(K×K)NX\in\left(K\times K^{\prime}\right)^{N},

21𝒫(X)[t]=[X[t]1:dx+t𝟏dxtX[t]dx+1𝟎γ].\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[t]=\begin{bmatrix}X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\\ t\\ X[t]_{d_{x}+1}\\ \mathbf{0}_{\gamma}\end{bmatrix}. (145)

Note that Di={21𝒫(X)[i]1:dxX(K×K)N}={X[i]1:dx+t𝟏dxX(K×K)N}D_{i}=\{\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[i]_{1:d_{x}}\mid X\in\left(K\times K^{\prime}\right)^{N}\}=\{X[i]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\mid X\in\left(K\times K^{\prime}\right)^{N}\} are disjoint each other, DiDj=ϕD_{i}\cap D_{j}=\phi for all iji\neq j.

By the universal approximation theorem of deep MLP from Hanin and Sellke (2017); Kidger and Lyons (2020), for any δl>0\delta_{l}>0, there exists an MLP 𝒩l,MLP:dxdx+1\mathcal{N}_{l,MLP}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{x}+1} of width dx+1+γd_{x}+1+\gamma such that for vdxv\in\mathbb{R}^{d_{x}},

𝒩l,MLP(v)1:dx=v\displaystyle\mathcal{N}_{l,MLP}(v)_{1:d_{x}}=v (146)
supt=1,,NsupvDtbl[t](vt𝟏dx)𝒩l,MLP(v)dx+1<δl.\displaystyle\sup_{t=1,\ldots,N}\sup_{v\in D_{t}}\left\|b_{l}[t]\left(v-t\mathbf{1}{d_{x}}\right)-\mathcal{N}_{l,MLP}(v)_{d_{x}+1}\right\|<\delta_{l}. (147)

Since token-wise MLP is implemented by RNN with the same width, there exists an RNN 𝒩l:dx+2+γdx+2+γ\mathcal{N}_{l}:\mathbb{R}^{d_{x}+2+\gamma}\rightarrow\mathbb{R}^{d_{x}+2+\gamma} of width dx+2+γd_{x}+2+\gamma whose components all but (dx+2)(d_{x}+2)-th construct 𝒩l,MLP\mathcal{N}_{l,MLP} so that for all Xdx+2+γX^{\prime}\in\mathbb{R}^{d_{x}+2+\gamma},

𝒩l(X)[t]=[𝒩l,MLP(X[t]1:dx)X[t]dx+2𝟎γ].\mathcal{N}_{l}(X^{\prime})[t]=\begin{bmatrix}\mathcal{N}_{l,MLP}\left(X^{\prime}[t]_{1:d_{x}}\right)\\ X^{\prime}[t]_{d_{x}+2}\\ \mathbf{0}_{\gamma}\end{bmatrix}. (148)

Then for X(K×K)NX\in\left(K\times K^{\prime}\right)^{N} we have

𝒩l21𝒫(X)[t]=[𝒩l,MLP(X[t]1:dx+t𝟏dx)X[t]dx+1𝟎γ].\mathcal{N}_{l}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[t]=\begin{bmatrix}\mathcal{N}_{l,MLP}\left(X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\right)\\ X[t]_{d_{x}+1}\\ \mathbf{0}_{\gamma}\end{bmatrix}. (149)

Finally, define a recurrent cell 3:dx+2+γdx+2+γ\mathcal{R}_{3}:\mathbb{R}^{d_{x}+2+\gamma}\rightarrow\mathbb{R}^{d_{x}+2+\gamma} of width dx+2+γd_{x}+2+\gamma as

3(X)[t+1]=[Odx+1,dx+11Oγ,γ]3(X)[t]+[Idx111Oγ,γ]X[t+1],\mathcal{R}_{3}(X^{\prime})[t+1]=\begin{bmatrix}O_{d_{x}+1,d_{x}+1}&&\\ &1&&\\ &&O_{\gamma,\gamma}\end{bmatrix}\mathcal{R}_{3}(X^{\prime})[t]+\begin{bmatrix}I_{d_{x}}&&&\\ &1&&\\ &1&1&\\ &&&O_{\gamma,\gamma}\end{bmatrix}X^{\prime}[t+1], (150)

and attain

3𝒩121𝒫(X)[t]=[X[t]1:dx+t𝟏dx𝒩l,MLP(X[t]1:dx+t𝟏dx)dx+1j=1t𝒩l,MLP(X[j]1:dx+j𝟏dx)dx+1+j=1tX[j]dx+1𝟎γ].\mathcal{R}_{3}\mathcal{N}_{1}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[t]=\begin{bmatrix}X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\\ \mathcal{N}_{l,MLP}\left(X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\right)_{d_{x}+1}\\ \sum_{j=1}^{t}\mathcal{N}_{l,MLP}\left(X[j]_{1:d_{x}}+j\mathbf{1}_{d_{x}}\right)_{d_{x}+1}+\sum_{j=1}^{t}X[j]_{d_{x}+1}\\ \mathbf{0}_{\gamma}\end{bmatrix}. (151)

With the token-wise projection map 𝒬:dx+2+γdx+1\mathcal{Q}:\mathbb{R}^{d_{x}+2+\gamma}\rightarrow\mathbb{R}^{d_{x}+1} defined by 𝒬(X)=[X1:dxXdx+2]\mathcal{Q}(X^{\prime})=\begin{bmatrix}X^{\prime}_{1:d_{x}}\\ X^{\prime}_{d_{x}+2}\end{bmatrix}, an RNN 𝒬3𝒩l21𝒫:(dx+1)×N(dx+1)×N\mathcal{Q}\mathcal{R}_{3}\mathcal{N}_{l}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N} of width dx+2+γd_{x}+2+\gamma maps X(dx+1)×NX\in\mathbb{R}^{(d_{x}+1)\times N} to

𝒬3𝒩l21𝒫(X)[t]=[X[t]1:dx+t𝟏dxj=1t𝒩l,MLP(X[j]1:dx+j𝟏dx)dx+1+j=1tX[j]dx+1].\mathcal{Q}\mathcal{R}_{3}\mathcal{N}_{l}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)[t]=\begin{bmatrix}X[t]_{1:d_{x}}+t\mathbf{1}_{d_{x}}\\ \sum_{j=1}^{t}\mathcal{N}_{l,MLP}\left(X[j]_{1:d_{x}}+j\mathbf{1}_{d_{x}}\right)_{d_{x}+1}+\sum_{j=1}^{t}X[j]_{d_{x}+1}\end{bmatrix}. (152)

Since 𝒩l,MLP(X[j]1:dx+j𝟏dx)dx+1bl[j]X[j]1:dx\mathcal{N}_{l,MLP}\left(X[j]_{1:d_{x}}+j\mathbf{1}_{d_{x}}\right)_{d_{x}+1}\rightarrow b_{l}[j]X[j]_{1:d_{x}}, we have

supX(K×K)N𝒯l(X)𝒬3𝒩l21𝒫(X)0,\sup_{X\in\left(K\times K^{\prime}\right)^{N}}\left\|\mathcal{T}\mathcal{R}_{l}(X)-\mathcal{Q}\mathcal{R}_{3}\mathcal{N}_{l}\mathcal{R}_{2}\mathcal{R}_{1}\mathcal{P}(X)\right\|\rightarrow 0, (153)

as δl0\delta_{l}\rightarrow 0. Approximating all 𝒯l\mathcal{T}\mathcal{R}_{l} in Appendix D finishes the proof.

Appendix F Proof of Lemma 19

The main idea of the proof is to separate the linear sum j=1NAj[t]x[j]\sum_{j=1}^{N}A_{j}[t]x[j] into the past-dependent part j=1t1Aj[t]x[j]\sum_{j=1}^{t-1}A_{j}[t]x[j] and the remainder part j=tNAj[t]x[j]\sum_{j=t}^{N}A_{j}[t]x[j]. Then, we construct modified TBRNN with 2N2N cells; the former NN cells have only a forward recurrent cell to compute the past-dependent part, and the latter NN cells have only a backward recurrent cell to compute the remainder.

Let the first NN modified TRNN cells l:(dx+1)×N(dx+1)×N\mathcal{R}_{l}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N} for 1lN1\leq l\leq N be defined as in the proof of Lemma 11:

l(X)[t+1]=All(X)[t]+Bl[t]X[t+1],\mathcal{R}_{l}\left(X\right)[t+1]=A_{l}\mathcal{R}_{l}\left(X\right)[t]+B_{l}[t]X[t+1], (154)

where Al=[Odx,dxOdx,1O1,dx1]A_{l}=\begin{bmatrix}O_{d_{x},d_{x}}&O_{d_{x},1}\\ O_{1,d_{x}}&1\end{bmatrix}, Bl[t]=[IdxOdx,1bl[t]1]B_{l}[t]=\begin{bmatrix}I_{d_{x}}&O_{d_{x},1}\\ b_{l}[t]&1\end{bmatrix} for bl[t]1×dxb_{l}[t]\in\mathbb{R}^{1\times d_{x}}. Then, with token-wise lifting map 𝒫:dxdx+1\mathcal{P}:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{x}+1} defined by 𝒫(x)=[x0]\mathcal{P}(x)=\begin{bmatrix}x\\ 0\end{bmatrix}, we construct modified TRNN 𝒩:N1𝒫:dx×N(dx+1)×N\mathcal{N}:\mathcal{R}_{N}\circ\cdots\circ\mathcal{R}_{1}\circ\mathcal{P}:\mathbb{R}^{d_{x}\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N}. We know that if Ci[m]1×dxC_{i}[m]\in\mathbb{R}^{1\times d_{x}} are given for 1mN1\leq m\leq N and 1im1\leq i\leq m, there exist bl[t]b_{l}[t] for 1lN1\leq l\leq N, such that

𝒩N(x)[m]=[x[m]i=1mCi[m]x[i]].\mathcal{N}_{N}(x)[m]=\begin{bmatrix}x[m]\\ \sum_{i=1}^{m}C_{i}[m]x[i]\end{bmatrix}. (155)

Therefore, we will determine Ci[m]C_{i}[m] after constructing the latter NN cells. Let fm=i=1mCi[m]x[i]f_{m}=\sum_{i=1}^{m}C_{i}[m]x[i] for brief notation.

After 𝒩N\mathcal{N}_{N}, construct NN modified TRNN cells ¯l:(dx+1)×N(dx+1)×N\bar{\mathcal{R}}_{l}:\mathbb{R}^{(d_{x}+1)\times N}\rightarrow\mathbb{R}^{(d_{x}+1)\times N} for 1lN1\leq l\leq N in reverse order:

¯l(X¯)[t1]=A¯l¯l(X¯)[t]+B¯l[t]X¯[t1],\bar{\mathcal{R}}_{l}\left(\bar{X}\right)[t-1]=\bar{A}_{l}\bar{\mathcal{R}}_{l}\left(\bar{X}\right)[t]+\bar{B}_{l}[t]\bar{X}[t-1], (156)

where A¯l=[Odx,dxOdx,1O1,dx1]\bar{A}_{l}=\begin{bmatrix}O_{d_{x},d_{x}}&O_{d_{x},1}\\ O_{1,d_{x}}&1\end{bmatrix}, B¯l[t]=[IdxOdx,1b¯l[t]1]\bar{B}_{l}[t]=\begin{bmatrix}I_{d_{x}}&O_{d_{x},1}\\ \bar{b}_{l}[t]&1\end{bmatrix} for b¯l[t]1×dx\bar{b}_{l}[t]\in\mathbb{R}^{1\times d_{x}}. Define 𝒩¯N=¯N¯1\bar{\mathcal{N}}_{N}=\bar{\mathcal{R}}_{N}\circ\cdots\circ\bar{\mathcal{R}}_{1}, and we obtain the following result after a similar calculation with input sequence X¯[t]=𝒩N(x)[t]=[x[t]ft]\bar{X}[t]=\mathcal{N}_{N}(x)[t]=\begin{bmatrix}x[t]\\ f_{t}\end{bmatrix}:

𝒩¯N(X¯)[N+1t]\displaystyle\bar{\mathcal{N}}_{N}\left(\bar{X}\right)[N+1-t] (157)
=[x[N+1t]j=1t[i=1N(N+tijNi)b¯i[N+1j]x[N+1j]+(N+t1jN1)fN+1j]].\displaystyle=\begin{bmatrix}x[N+1-t]\\ \sum_{j=1}^{t}\left[\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}\bar{b}_{i}[N+1-j]x[N+1-j]+\binom{N+t-1-j}{N-1}f_{N+1-j}\right]\end{bmatrix}. (158)

We want to find fmf_{m} and b¯i[m]\bar{b}_{i}[m] so that

𝒩¯N(X¯)[N+1t]dx+1=i=1NAi[N+1t]x[i],\bar{\mathcal{N}}_{N}\left(\bar{X}\right)[N+1-t]_{d_{x}+1}=\sum_{i=1}^{N}A_{i}[N+1-t]x[i], (159)

for each t=1,2,,Nt=1,2,\ldots,N.

Note that j=1ti=1N(N+tijNi)b¯i[N+1j]x[N+1j]\sum_{j=1}^{t}\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}\bar{b}_{i}[N+1-j]x[N+1-j] does not contain x[1],x[2],,x[Nt]x[1],x[2],\ldots,x[N-t] terms, so j=1t(N+t1jN1)fN+1j\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}f_{N+1-j} should contain i=1NtAi[N+1t]x[i]\sum_{i=1}^{N-t}A_{i}[N+1-t]x[i].

j=1t(N+t1jN1)fN+1j\displaystyle\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}f_{N+1-j} =j=1t(N+t1jN1)i=1N+1jCi[N+1j]x[i]\displaystyle=\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}\sum_{i=1}^{N+1-j}C_{i}[N+1-j]x[i] (160)
=j=1ti=1N+1j(N+t1jN1)Ci[N+1j]x[i]\displaystyle=\sum_{j=1}^{t}\sum_{i=1}^{N+1-j}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]x[i] (161)
=i=N+2tNj=1N+1i(N+t1jN1)Ci[N+1j]x[i]\displaystyle=\sum_{i=N+2-t}^{N}\sum_{j=1}^{N+1-i}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]x[i] (162)
+i=1N+1tj=1t(N+t1jN1)Ci[N+1j]x[i].\displaystyle\qquad+\sum_{i=1}^{N+1-t}\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]x[i]. (163)

Since matrix Λi={(N+t1jN1)}1tN+1i,1jN+1i\Lambda_{i}=\left\{\binom{N+t-1-j}{N-1}\right\}_{1\leq t\leq N+1-i,1\leq j\leq N+1-i} is a lower triangular (N+1i)×(N+1i)(N+1-i)\times(N+1-i) matrix with unit diagonal components, there exist Ci[i],Ci[i+1],,Ci[N]C_{i}[i],C_{i}[i+1],\ldots,C_{i}[N] such that

j=1t(N+t1jN1)Ci[N+1j]=Ai[N+1t],\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]=A_{i}[N+1-t], (164)

for each t=1,2,,N+1it=1,2,\ldots,N+1-i.

We now have

j=1t(N+t1jN1)fN+1j\displaystyle\sum_{j=1}^{t}\binom{N+t-1-j}{N-1}f_{N+1-j} (165)
=i=N+2tNj=1N+1i(N+t1jN1)Ci[N+1j]x[i]+i=1N+1tAi[N+1t]x[i]\displaystyle=\sum_{i=N+2-t}^{N}\sum_{j=1}^{N+1-i}\binom{N+t-1-j}{N-1}C_{i}[N+1-j]x[i]+\sum_{i=1}^{N+1-t}A_{i}[N+1-t]x[i] (166)
=i=1t1j=1i(N+t1jN1)CN+1i[N+1j]x[N+1i]+i=1N+1tAi[N+1t]x[i]\displaystyle=\sum_{i=1}^{t-1}\sum_{j=1}^{i}\binom{N+t-1-j}{N-1}C_{N+1-i}[N+1-j]x[N+1-i]+\sum_{i=1}^{N+1-t}A_{i}[N+1-t]x[i] (167)
=j=1t1i=1j(N+t1iN1)CN+1j[N+1i]x[N+1j]+i=1N+1tAi[N+1t]x[i].\displaystyle=\sum_{j=1}^{t-1}\sum_{i=1}^{j}\binom{N+t-1-i}{N-1}C_{N+1-j}[N+1-i]x[N+1-j]+\sum_{i=1}^{N+1-t}A_{i}[N+1-t]x[i]. (168)

We switch ii and jj for the last equation. By Corollary 26, there exist b¯i[N+1j]\bar{b}_{i}[N+1-j] satisfying

i=1N(N+tijNi)b¯i[N+1j]\displaystyle\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}\bar{b}_{i}[N+1-j] (169)
=AN+1j[N+1t]i=1j(N+t1iN1)CN+1j[N+1i],\displaystyle=A_{N+1-j}[N+1-t]-\sum_{i=1}^{j}\binom{N+t-1-i}{N-1}C_{N+1-j}[N+1-i], (170)
for j=1,2,,t1j=1,2,\ldots,t-1, and
i=1N(N+tijNi)b¯i[N+1j]\displaystyle\sum_{i=1}^{N}\binom{N+t-i-j}{N-i}\bar{b}_{i}[N+1-j] (171)
=AN+1j[N+1t],\displaystyle=A_{N+1-j}[N+1-t], (172)

for j=tj=t.
With the above Ci[m]C_{i}[m] and b¯i[m]\bar{b}_{i}[m], equation (159) holds for each t=1,2,,Nt=1,2,\ldots,N. It remains to construct modified TRNN cells to implement fmf_{m}, which comes directly from the proof of Lemma 11.

References

  • Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • Bahdanau et al. (2016) Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4945–4949. IEEE, 2016.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
  • Cohen et al. (2016) Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728. PMLR, 2016.
  • Cybenko (1989) George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  • Dai et al. (2021) Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34:3965–3977, 2021.
  • Elman (1990) Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  • Garnelo et al. (2018) Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.
  • Gers et al. (2000) Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm. Neural computation, 12(10):2451–2471, 2000.
  • Graves and Jaitly (2014) Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning, pages 1764–1772. PMLR, 2014.
  • Hanin and Sellke (2017) Boris Hanin and Mark Sellke. Approximating continuous functions by relu nets of minimal width. arXiv preprint arXiv:1710.11278, 2017.
  • Hanson and Raginsky (2020) Joshua Hanson and Maxim Raginsky. Universal simulation of stable dynamical systems by recurrent neural nets. In Learning for Dynamics and Control, pages 384–392. PMLR, 2020.
  • Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • Johnson (2019) Jesse Johnson. Deep, skinny neural networks are not universal approximators. In International Conference on Learning Representations, 2019.
  • Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
  • Kidger and Lyons (2020) Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. In Conference on learning theory, pages 2306–2327. PMLR, 2020.
  • Lavanya and Sasikala (2021) PM Lavanya and E Sasikala. Deep learning techniques on text classification using natural language processing (nlp) in social healthcare network: A comprehensive survey. In 2021 3rd International Conference on Signal Processing and Communication (ICPSC), pages 603–609. IEEE, 2021.
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • Leshno et al. (1993) Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
  • Lin and Jegelka (2018) Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator. Advances in Neural Information Processing Systems, 31, 2018.
  • Lu et al. (2017) Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, pages 1045–1048. Makuhari, 2010.
  • Park et al. (2020) Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. arXiv preprint arXiv:2006.08859, 2020.
  • Park et al. (2021) Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. In International Conference on Learning Representations, 2021.
  • Rolnick and Tegmark (2018) David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. In the International Conference on Learning Representations, 2018.
  • Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  • Schäfer and Zimmermann (2007) Anton Maximilian Schäfer and Hans-Georg Zimmermann. Recurrent neural networks are universal approximators. International journal of neural systems, 17(04):253–263, 2007.
  • Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
  • Wu et al. (2017) Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. Recurrent recommender networks. In Proceedings of the tenth ACM international conference on web search and data mining, pages 495–503, 2017.
  • Yun et al. (2020) Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020.
  • Zhai et al. (2022) Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  • Zhou (2020) Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48(2):787–794, 2020.