This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast Saturating Gate for Learning Long Time Scales with
Recurrent Neural Networks

Kentaro Ohno, Sekitoshi Kanai, Yasutoshi Ida
Abstract

Gate functions in recurrent models, such as an LSTM and GRU, play a central role in learning various time scales in modeling time series data by using a bounded activation function. However, it is difficult to train gates to capture extremely long time scales due to gradient vanishing of the bounded function for large inputs, which is known as the saturation problem. We closely analyze the relation between saturation of the gate function and efficiency of the training. We prove that the gradient vanishing of the gate function can be mitigated by accelerating the convergence of the saturating function, i.e., making the output of the function converge to 0 or 1 faster. Based on the analysis results, we propose a gate function called fast gate that has a doubly exponential convergence rate with respect to inputs by simple function composition. We empirically show that our method outperforms previous methods in accuracy and computational efficiency on benchmark tasks involving extremely long time scales.

1 Introduction

Recurrent neural networks (RNNs) are models suited to processing sequential data in various applications, e.g., speech recognition (Ling et al. 2020) and video analysis (Zhu et al. 2020). The most widely used RNNs are a long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) and gated recurrent unit (GRU) (Cho et al. 2014), which has a gating mechanism. The gating mechanism controls the information flow in the state of RNNs via multiplication with a gate function bounded to a range [0,1][0,1]. For example, when the forget gate takes a value close to 1 (or 0 for the update gate in the GRU), the state preserves the previous information. On the other hand, when it gets close to the other boundary, the RNN updates the state by the current input. Thus, in order to represent long temporal dependencies of data involving hundreds or thousands of time steps, it is crucial for the forget gate to take values near the boundaries (Tallec and Ollivier 2018; Mahto et al. 2021).

However, it is difficult to train RNNs so that they have the gate values near the boundaries. Previous studies hypothesized that this is due to gradient vanishing for the gate function called saturation (Chandar et al. 2019; Gu et al. 2020b), i.e., the gradient of the gate function near the boundary is too small to effectively update the parameters. To avoid the saturation problem, a previous study used unbounded activation functions (Chandar et al. 2019). However, this makes training unstable due to the gradient explosion (Pascanu, Mikolov, and Bengio 2013). Another study introduced residual connection for a gate function to push the output value toward boundaries, hence mitigating the saturation problem (Gu et al. 2020b). However, it requires additional computational cost due to increasing the number of parameters for another gate function. For broader application of gated RNNs, a more efficient solution is necessary.

To overcome the difficulty of training, we propose a novel activation function for the forget gate based on the usual sigmoid function, which we call the fast gate. Modification of the usual sigmoid gate to the fast gate is simple and easy to implement since it requires only one additional function composition. To this end, we analyze the relation between the saturation and gradient vanishing of the bounded activation function. Specifically, we focus on the convergence rate of the activation function to the boundary, which we call the order of saturation. For example, the sigmoid function σ(z)=1/(1+ez)\sigma(z)=1/(1+e^{-z}) has the exponential order of saturation, i.e., 1σ(z)=O(ez)1-\sigma(z)=O(e^{-z}) (see Fig. 1), and the derivative also decays to 0 exponentially as zz goes to infinity. When a bounded activation function has a higher order of saturation, the derivative decays much faster as the input grows. Since previous studies have assumed that the decaying derivative on the saturating regime causes the stuck of training (Ioffe and Szegedy 2015), it seems that a higher order of saturation would lead to poor training. Contrarily to this intuition, we prove that a higher order of saturation alleviates the gradient vanishing on the saturating regime through observation on a toy problem for learning long time scales. This result indicates that functions saturating superexponentially are more suitable for the forget gate to learn long time scales than the sigmoid function. On the basis of this observation, we explore a method of realizing such functions by composing functions which increase faster than the identity function (e.g., α(z)=z+z3\alpha(z)=z+z^{3}) as σ(α(z))\sigma(\alpha(z)). We find that the hyperbolic sinusoidal function is suitable for achieving a higher order of saturation in a simple way, and we obtain the fast gate. Since the fast gate has a doubly exponential order of saturation O(eez)O(e^{-e^{z}}), it improves the trainability of gated RNNs for long time scales of sequential data. We evaluate the computational efficiency and accuracy of a model with the fast gate on several benchmark tasks, including synthetic tasks, pixel-by-pixel image classification, and language modeling, which involve a wide range of time scales. The model with the fast gate empirically outperforms other models including an LSTM with the sigmoid gate and variants recently proposed for tackling the saturation problem (Chandar et al. 2019; Gu et al. 2020b) in terms of accuracy and the convergence speed of training while maintaining stability of training. Further visualization analysis of learning time scales shows that our theory fits the learning dynamics of actual models and that the fast gate can learn extremely long time scales of thousands of time steps.

Our major contributions are as follows:

  • We prove that gate functions which saturate faster actually accelerates learning values near boundaries. The result indicates that fast saturation improves learnability of gated RNNs on data with long time scales.

  • We propose the fast gate that saturates faster than the sigmoid function. In spite of its simplicity, the fast gate achieves a doubly exponential order of saturation, and thus effectively improves learning of long time scales.

  • We evaluate the effectiveness of the fast gate against recently proposed methods such as an NRU (Chandar et al. 2019) and a refine gate (Gu et al. 2020b). The results verify that the fast gate robustly improves the learnability for long-term dependencies in both synthetic and real data.

Refer to caption
Figure 1: Function values and derivative of various bounded activation functions. Sigmoid function σ\sigma (orange) exponentially converges to 1 as zz\to\infty, since 1σ(z)=11+ezez1-\sigma(z)=\frac{1}{1+e^{z}}\approx e^{-z}. Derivative also decays exponentially. Normalized version of softsign function softsign(z)=z1+|z|\mathrm{softsign}(z)=\frac{z}{1+|z|} (blue) converges to 1 more slowly. Fast gate (red) is proposed gate function. Although gradient decays faster than sigmoid function, it provably helps learning values near boundaries.

2 Preliminaries

2.1 Time Scales in Gated RNNs

In this section, we review gated RNNs and their time scale interpretation (Tallec and Ollivier 2018). We begin with an LSTM (Hochreiter and Schmidhuber 1997), which is one of the most popular RNNs. An LSTM has a memory cell ctnc_{t}\in\mathbb{R}^{n} and hidden state htnh_{t}\in\mathbb{R}^{n} inside, which are updated depending on the sequential input data xtx_{t} at each time step t=1,2,t=1,2,\cdots by

ct\displaystyle c_{t} =ftct1+itc~t\displaystyle=f_{t}\odot c_{t-1}+i_{t}\odot\tilde{c}_{t} (1)
ht\displaystyle h_{t} =ottanh(ct)\displaystyle=o_{t}\odot\tanh{(c_{t})} (2)
ft\displaystyle f_{t} =σ(Wfxt+Ufht1+bf)\displaystyle=\sigma(W_{f}x_{t}+U_{f}h_{t-1}+b_{f}) (3)
it\displaystyle i_{t} =σ(Wixt+Uiht1+bi)\displaystyle=\sigma(W_{i}x_{t}+U_{i}h_{t-1}+b_{i}) (4)
c~t\displaystyle\tilde{c}_{t} =tanh(Wcxt+Ucht1+bc)\displaystyle=\tanh(W_{c}x_{t}+U_{c}h_{t-1}+b_{c}) (5)
ot\displaystyle o_{t} =σ(Woxt+Uoht1+bo)\displaystyle=\sigma(W_{o}x_{t}+U_{o}h_{t-1}+b_{o}) (6)

where W,UW_{*},U_{*} and bb_{*} are weight and bias parameters for each {f,i,c,o}*\in\{f,i,c,o\}. The sigmoid function σ\sigma is defined as

σ(x)=11+ex.\displaystyle\sigma(x)=\frac{1}{1+e^{-x}}. (7)

ft,it,ot[0,1]nf_{t},i_{t},o_{t}\in[0,1]^{n} are called forget, input, and output gates, respectively. They were initially motivated as a binary mechanism, i.e., switching on and off, allowing information to pass through (Gers, Schmidhuber, and Cummins 2000). The forget gate has been reinterpreted as the representation for time scales of memory cells (Tallec and Ollivier 2018). Following that study, we simplify Eq. (1) by assuming c~t=0\tilde{c}_{t}=0 for an interval t[t0,t1]t\in[t_{0},t_{1}]. Then, we obtain

ct1\displaystyle c_{t_{1}} =ft1ct11\displaystyle=f_{t_{1}}\odot c_{t_{1}-1} (8)
=f¯t1t0ct1t0,\displaystyle=\bar{f}^{t_{1}-t_{0}}\odot c_{t_{1}-t_{0}}, (9)

where f¯=(s=t0+1t1fs)1t1t0\bar{f}=(\prod_{s=t_{0}+1}^{t_{1}}f_{s})^{\frac{1}{t_{1}-t_{0}}} is the (entry-wise) geometric mean of the values of the forget gate. Through Eq. (8), the memory cell ctc_{t} loses its information on data up to time t0t_{0} exponentially, and the entry of f¯\bar{f} represents its (averaged) decay rate. This indicates that, in order to capture long-term dependencies of the sequential data, the forget gate is desired to take values near 1 on average. We refer the associated time constant111An exponential function F(t)=eαtF(t)=e^{-\alpha t} of time tt decreases by a factor of 1/e1/e in time T=1/αT=1/\alpha, which is called the time constant. T=1/logf¯T=-1/\log{\bar{f}} as the time scale of units, which has been empirically shown to illustrate well the temporal behavior of LSTMs (Mahto et al. 2021).

The above argument applies not only to an LSTM, but also to general gated RNNs including a GRU (Cho et al. 2014) with state update of the form

ht=ftht1+ith~t,\displaystyle h_{t}=f_{t}\odot h_{t-1}+i_{t}\odot\tilde{h}_{t}, (10)

where ht,ft,ith_{t},f_{t},i_{t} denotes the state, forget gate, and input gate, respectively, and h~t\tilde{h}_{t} is the activation to represent new information at time tt. Here again, the forget gate ftf_{t} takes a role to control the time scale of each unit of the state.

2.2 Saturation in Gating Activation Functions

The sigmoid function σ(z)\sigma(z) in the gating mechanism requires large zz to take a value near 1 as the output. On the other hand, the derivative σ(z)\sigma^{\prime}(z) takes exponentially small values for z0z\gg 0 (Fig. 1). Thus, when a gated model needs to learn large gate values such as 0.990.99 with gradient methods, parameters in the gate cannot be effectively updated due to gradient vanishing. This is called saturation of bounded activation functions (Gulcehre et al. 2016). The behavior of gate functions on the saturating regime is important for gated RNNs because forget gate values need to be large to represent long time scales as explained in Section 2.1. That is, gated RNNs must face saturation of the forget gate to learn long time scales. Thus, it is hypothesized that saturation causes difficulty in training gated RNNs for data with extremely long time scales (Chandar et al. 2019; Gu et al. 2020b).

3 Related Work

Although there is abundant literature on learning long-term dependencies with RNNs, we outline the most related studies in this sectoin due to the space limitation and provide additional discussion of other studies in Appendix A.

Several studies investigate the time scale representation of the forget gate function to improve learning on data involving long-term dependencies (Tallec and Ollivier 2018; Mahto et al. 2021). For example, performance of LSTM language models can be improved by fixing the bias parameter of the forget gate in accordance with a power law of time scale distribution, which underlies natural language (Mahto et al. 2021). Such techniques require us to know the appropriate time scales of data a priori, which is often difficult. Note that this approach can be combined with our method since it is complementary with our work.

Several modifications of the gate function have been proposed to tackle the saturation problem. The noisy gradient for a piece-wise linear gate function was proposed to prevent the gradient to take zero values (Gulcehre et al. 2016). This training protocol includes hyperparameters controlling noise level, which requires manual tuning. Furthermore, such a stochastic approach can result in unstable training due to gradient estimation bias (Bengio, Léonard, and Courville 2013). The refine gate (Gu et al. 2020b) was proposed as another modification introducing a residual connection to push the gate value to the boundaries. It is rather heuristic and does not provide theoretical justification. It also requires additional parameters for the auxiliary gate, which increases the computational cost for both inference and training. In contrast, our method theoretically improves learnability and does not introduce any additional parameters. Another study suggests that omitting gates other than the forget gate makes training of models for long time scales easier (Van Der Westhuizen and Lasenby 2018). However, such simplification may lose the expressive power of the model and limit its application fields. Chandar et al. (2019) proposed an RNN with a non-saturating activation function to directly avoid the gradient vanishing due to saturation. Since its state and memory vector evolves in unbounded regions, the behavior of the gradient can be unstable depending on tasks. Our method mitigates the gradient vanishing by controlling the order of saturation, while maintaining the bounded state transition.

4 Analysis on Saturation and Learnability

We discuss the learning behavior of the forget gate for long time scales. First, we formulate a problem of learning long time scales in a simplified setting. Next, we relate the efficiency of learning on the problem to the saturation of the gate functions. We conclude that the faster saturation makes learning more efficient. All proofs for mathematical results below are given in Appendix C.

4.1 Problem Setting

Recall Eq. (8), which describes the time scales of the memory cell ctc_{t} of an LSTM via exponential decay. Let the memory cell at time t1t_{1} be ct1=λct0c_{t_{1}}=\lambda c_{t_{0}} with λ[0,1]\lambda\in[0,1]. Requiring long time scales corresponds to getting λ\lambda close to 1. Therefore, we can consider a long-time-scale learning problem as minimizing a loss function LL that measures discrepancy of ct1c_{t_{1}} and λct0\lambda_{*}c_{t_{0}} where λ[0,1]\lambda_{*}\in[0,1] is a desired value close to 1. We take LL as the absolute loss for example. Then, we obtain

L\displaystyle L =|ct1λct0|\displaystyle=|c_{t_{1}}-\lambda_{*}c_{t_{0}}| (11)
=|f¯t1t0ct0λct0|\displaystyle=|\bar{f}^{t_{1}-t_{0}}c_{t_{0}}-\lambda_{*}c_{t_{0}}| (12)
=ct0|f¯t1t0λ|,\displaystyle=c_{t_{0}}|\bar{f}^{t_{1}-t_{0}}-\lambda_{*}|, (13)

using Eq. (8). Let zt=Wfxt+Ufht1+bfz_{t}=W_{f}x_{t}+U_{f}h_{t-1}+b_{f}, so that ft=σ(zt)f_{t}=\sigma(z_{t}). Since we are interested in the averaged value of ftf_{t}, we consider ztz_{t} to be time-independent, that is, zt=zz_{t}=z in the same way as Tallec and Ollivier (2018). The problem is then reduced to a problem to obtain zz that minimizes

L(z)=ct0|σ(z)t1t0λ|.\displaystyle L(z)=c_{t_{0}}|\sigma(z)^{t_{1}-t_{0}}-\lambda_{*}|. (14)

We consider this as the minimal problem to analyze the learnability of the forget gate for long time scales. Note that since the product ct1=f¯t1t0ct0c_{t_{1}}=\bar{f}^{t_{1}-t_{0}}c_{t_{0}} is taken element-wise, we can consider this as a one-dimensional problem. Furthermore, the global solution can be explicitly written as z=σ1(λ1/(t1t0))z=\sigma^{-1}(\lambda_{*}^{1/(t_{1}-t_{0})}) where σ1\sigma^{-1} is an inverse of σ\sigma.

Next, we consider the learning dynamics of the model on the aforementioned problem Eq. (14). RNNs are usually trained with gradient methods. Learning dynamics with gradient methods can be analyzed considering learning rate 0\to 0 limit known as gradient flow (Harold and George 2003). Therefore, we consider the following gradient flow

dzdτ=Lz,\displaystyle\frac{dz}{d\tau}=-\frac{\partial L}{\partial z}, (15)

using the loss function introduced above. Here, τ\tau denotes a time variable for learning dynamics, which should not be confused with tt representing the state transition. Our aim is to investigate the convergence rate of a solution of the differential equation Eq. (15) when σ\sigma in the forget gate is replaced with another function ϕ\phi.

4.2 Order of Saturation

To investigate the effect of choice of gate functions on the convergence rate, we first define the candidate set \mathcal{F} of bounded functions for the gate function.

Definition 4.1.

Let \mathcal{F} be a set of differentiable and strictly increasing surjective functions ϕ:[0,1]\phi:\mathbb{R}\to[0,1] such that the derivative ϕ\phi^{\prime} is monotone on z>z0z>z_{0} for some z00z_{0}\geq 0.

\mathcal{F} is a natural class of gating activation functions including σ\sigma. As we explained in Section 2.2, gated RNNs suffer from gradient vanishing due to saturation when learning long time scales. To clarify the issue, we first show that saturation is inevitable regardless of the choice of ϕ\phi\in\mathcal{F}.

Proposition 4.2.

limzϕ(z)=0\lim_{z\to\infty}\phi^{\prime}(z)=0 holds for any ϕ\phi\in\mathcal{F}.

Nevertheless, choices of ϕ\phi significantly affect the efficiency of the training. When the target λ\lambda_{*} takes an extreme value near boundaries, the efficiency of training should depend on the asymptotic behavior of ϕ(z)\phi(z) for z0z\gg 0, that is, the rate at which ϕ(z)\phi(z) converges as zz\to\infty. We call the convergence rate of ϕ(z)\phi(z) as zz\to\infty as the order of saturation. More precisely, we define the notion as follows222Our definition for asymptotic order is slightly different from the usual one which adopts lim supzg(z)1ϕ(z)<\limsup_{z\to\infty}\frac{g(z)}{1-\phi(z)}<\infty, since it is more suitable for analyzing training efficiency.:

Definition 4.3.

Let g:g:\mathbb{R}\to\mathbb{R} be a decreasing function. ϕ\phi\in\mathcal{F} has the order of saturation of O(g(z))O(g(z)) if limzg(az)1ϕ(z)=0\lim_{z\to\infty}\frac{g(az)}{1-\phi(z)}=0 for some a>0a>0. For ϕ,ϕ~\phi,\tilde{\phi}\in\mathcal{F}, ϕ\phi has a higher order of saturation than ϕ~\tilde{\phi} if limz1ϕ(z)1ϕ~(az)=0\lim_{z\to\infty}\frac{1-\phi(z)}{1-\tilde{\phi}(az)}=0 holds for any a>0a>0 and ϕ~1(ϕ(z)){\tilde{\phi}}^{-1}(\phi(z)) is convex for z0z\gg 0.

Intuitively, the order of saturation of O(g(z))O(g(z)) means that the convergence rate of ϕ\phi to 1 is bounded by the decay rate of gg up to constant multiplication of zz. For example, the sigmoid function σ\sigma satisfies eaz/(1σ(z))0e^{-az}/(1-\sigma(z))\to 0 as zz\to\infty for any a>1a>1, thus has the exponential order of saturation O(ez)O(e^{-z}). The convexity condition for a higher order of saturation is rather technical, but automatically satisfied for typical functions, see Appendix C.2. If ϕ\phi has a higher order of saturation (or saturates faster) than another function ϕ~\tilde{\phi}, then ϕ(z)\phi(z) converges faster than ϕ~(z)\tilde{\phi}(z) as zz\to\infty, and ϕ(z)\phi^{\prime}(z) becomes smaller than ϕ~(z)\tilde{\phi}^{\prime}(z). In this sense, training with ϕ~\tilde{\phi} seems more efficient than ϕ\phi in the above problem. However, this is not the case as we discuss in the next section.

4.3 Efficient Learning via Fast Saturation

To precisely analyze learning behavior, we trace the learning dynamics of the output value f=ϕ(z)f=\phi(z) since our purpose is to obtain the desired output value rather than the input zz. We transform the learning dynamics (Eq. (15)) into that of ff by

dfdτ=dzdτdfdz=ϕ(z)Lz=ϕ(z)2Lf.\displaystyle\frac{df}{d\tau}=\frac{dz}{d\tau}\frac{df}{dz}=-\phi^{\prime}(z)\frac{\partial L}{\partial z}=-\phi^{\prime}(z)^{2}\frac{\partial L}{\partial f}. (16)

To treat Eq. (16) as purely of ff, we define a function gϕ(f)g_{\phi}(f) of ff by gϕ(f):=ϕ(ϕ1(f)),g_{\phi}(f):=\phi^{\prime}(\phi^{-1}(f)), so that Eq. (16) becomes

dfdτ=gϕ(f)2Lf.\displaystyle\frac{df}{d\tau}=-g_{\phi}(f)^{2}\frac{\partial L}{\partial f}. (17)

Our interest is in the dynamics of ff near the boundary, i.e., the limit of f1f\to 1. We have the following result:

Theorem 4.4.

Let ϕ,ϕ~\phi,\tilde{\phi}\in\mathcal{F}. If ϕ\phi has a higher order of saturation than ϕ~\tilde{\phi}, then gϕ(f)/gϕ~(f)g_{\phi}(f)/g_{\tilde{\phi}}(f)\to\infty as f1f\to 1.

Theorem 4.4 indicates that a higher order of saturation accelerates the move of the output ff near boundaries in accordance with Eq. (17) since gϕ(f)g_{\phi}(f) takes larger values. Thus, contrarily to the intuition in Section 4.2, a higher order of saturation leads to more efficient training for target values near boundaries. We demonstrate this effect using two activation functions, the sigmoid function σ(z)\sigma(z) and normalized softsign function σns(z)=(softsign(z/2)+1)/2\sigma_{\rm ns}(z)=(\mathrm{softsign}(z/2)+1)/2 where softsign(z)=z/(1+|z|)\mathrm{softsign}(z)=z/(1+|z|). σns\sigma_{\rm ns} is the softsign function modified so that 0σns(z)10\leq\sigma_{\rm ns}(z)\leq 1 and σns(0)=σ(0)\sigma_{\rm ns}^{\prime}(0)=\sigma^{\prime}(0). σ\sigma has a higher order of saturation than σns\sigma_{\rm ns} since σ\sigma has the order of saturation of O(ez)O(e^{-z}) and σns\sigma_{\rm ns} has O(z1)O(z^{-1}) (see Fig. 1). We plot the learning dynamics of gradient flow for the problem in Fig. 2. Since σ\sigma has a higher order of saturation than σns\sigma_{\rm ns}, the gate value ff of σns\sigma_{\rm ns} converges slower to the boundary. Fig. 2 also shows the dynamics of gradient descent with the learning rate 1. While gradient descent is a discrete approximation of gradient flow, it behaves similar to gradient flow.

Refer to caption
Figure 2: Learning curves for simplified long-time-scale learning problem with gradient descent (markers) and with gradient flow (solid lines). Gradient descent is done with learning rate 1. Time difference t1t0t_{1}-t_{0} is set to 10. Dashed lines are lower bounds given in Tab. 1 fitted to each learning curve with suitable translation. These lower bounds well approximate asymptotic convergence of gradient flow. Results of refine and fast gates are explained in Section 5.3.

Explicit convergence rates. Beyond Theorem 4.4, we can explicitly calculate effective bounds of the convergence rate for the problem when the activation function is the sigmoid function σ(z)\sigma(z) or normalized softsign function σns(z)\sigma_{\rm ns}(z).

Proposition 4.5.

Consider the problem in Section 4.1 with the absolute loss L=|ft1t0λ|L=|f^{t_{1}-t_{0}}-\lambda_{*}| with λ=1\lambda_{*}=1. For the sigmoid function f=σ(z)f=\sigma(z), the convergence rate for the problem is bounded as 1f=O(τ1)1-f=O(\tau^{-1}). Similarly, for the normalized softsign function f=σns(z)f=\sigma_{\rm ns}(z), the convergence rate is bounded as 1f=O(τ1/3)1-f=O(\tau^{-1/3}).

Proposition 4.5 shows the quantitative effect of difference in the order of saturation on the convergence rates. We fit the bounds to the learning curves with the gradient flow in Fig. 2. The convergence rates of the learning are well approximated by the bounds. These asymptotic analysis highlights that choices of the function ϕ\phi significantly affects efficiency of training for long time scales.

5 Proposed Method

On the basis of the analysis in Section 4, we construct the fast gate, which is suitable for learning long time scales.

5.1 Desirable Properties for Gate Functions

We consider modification of the usual sigmoid function to another function ϕ\phi\in\mathcal{F} for the forget gate in a gated RNN. Function ϕ\phi should satisfy the following conditions.

  1. (i)

    ϕ\phi has a higher order of saturation than σ\sigma,

  2. (ii)

    ϕ(z)σ(z)\phi(z)\approx\sigma(z) for z0z\approx 0,

  3. (iii)

    ϕ\phi is symmetric in a sense that ϕ(z)=1ϕ(z)\phi(-z)=1-\phi(z).

Condition (i) comes from the argument in the previous section that fast saturating functions learn values near boundaries efficiently. Conditions (ii) and (iii) indicate that the function ϕ(z)\phi(z) behaves similarly to σ(z)\sigma(z) around z=0z=0. In order to avoid possible harmful effects due to the modification, we do not want to change the behavior of the function away from the saturating regime. Hence, we require these conditions. The requirements are analogous to those by Gu et al. (2020b, Section 3.4) for the gate adjustment. The first condition can be viewed as a theoretical refinement of their heuristic modification.

5.2 Fast Gate

We explore gate functions satisfying the above conditions. Recall that the sigmoid function σ(z)\sigma(z) has the exponential order of saturation. From condition (i) in the previous section, we explore functions saturating superexponentially. Since any superexponential order can be written as O(eα(z))O(e^{-\alpha(z)}) with a function satisfying α(z)>z\alpha(z)>z for large zz, it is enough to consider a function of the form ϕ(z)=σ(α(z))\phi(z)=\sigma(\alpha(z)) for such α\alpha. The desirable properties in Section 5.1 are rephrased as follows in terms of α\alpha: (i) α(z)z\alpha(z)\gg z for z0z\gg 0, (ii) α(0)=1\alpha^{\prime}(0)=1, and (iii) α(z)\alpha(-z) = α(z)-\alpha(z) for zz\in\mathbb{R}. Such functions can be found as examples in the form α(z)=z+p(z)\alpha(z)=z+p(z) where pp is a polynomial consisting of only odd higher degree terms, such as α(z)=z+z3\alpha(z)=z+z^{3}. Since a higher degree term has a larger effect on the order of saturation, it mitigates gradient vanishing of the gate function more in accordance with Theorem 4.4. Thus, we take a limit of the degree to infinity, which leads to a simple function expression

α(z)\displaystyle\alpha(z) =z+z33!+z55!+\displaystyle=z+\frac{z^{3}}{3!}+\frac{z^{5}}{5!}+\cdots (18)
=sinh(z):=ezez2.\displaystyle=\sinh(z):=\frac{e^{z}-e^{-z}}{2}. (19)

Therefore, we adopt α(z)=sinh(z)\alpha(z)=\sinh(z) for the alternative function ϕ=σα\phi=\sigma\circ\alpha and obtain the fast gate

ϕ(z):=σ(sinh(z)).\displaystyle\phi(z):=\sigma(\sinh(z)). (20)

This simple expression enables us to implement it with only one additional function. Note that the above form is an example of gate functions which satisfy desirable properties in Section 5.1 and there are infinitely many other possible choices. The above particular form is one of the simplest choices, and not a limitation of our method. For discussion of other candidates, see Appendix D.

Softsign Sigmoid Refine Fast
Saturation order O(z1)O(z^{-1}) O(ez)O(e^{-z}) O(e2z)O(e^{-2z}) O(eez)O(e^{-e^{z}})
Convergence rate O(τ13)O(\tau^{-\frac{1}{3}}) O(τ1)O(\tau^{-1}) O(τ1)O(\tau^{-1}) O(W2(cτ12))O(W^{2}(c\tau^{-\frac{1}{2}}))
Additional parameters No No Yes No
Table 1: Comparison of order of saturation and convergence rate for problem in Section 4. “Fast” denotes our method. W2()W^{2}(\cdot) is square of Lambert’s function WW defined as inverse of function zzezz\mapsto ze^{z}. c>0c>0 is some constant.

5.3 Comparison with Other Gate Functions

We analyze the fast gate ϕ(z)\phi(z) and compare it with other gate functions. First, the order of saturation of the fast gate is O(eez)O(e^{-e^{z}}) since eeaz/(1ϕ(z))0e^{-e^{az}}/(1-\phi(z))\to 0 as zz\to\infty for any a>1a>1. We briefly describe the method of the refine gate (Gu et al. 2020b), which was proposed to avoid gradient vanishing of the gate function. This method exploits an auxiliary gate rt[0,1]nr_{t}\in[0,1]^{n} to modify the forget gate value ftf_{t} to gt=rt(1(1ft)2)+(1rt)ft2g_{t}=r_{t}(1-(1-f_{t})^{2})+(1-r_{t})f_{t}^{2}. When a large output value is desired for the forget gate value, the auxiliary gate is expected to take rt1r_{t}\approx 1 to push ftf_{t} to gt1(1ft)2g_{t}\approx 1-(1-f_{t})^{2}. Therefore, from the asymptotic view point, this method modifies the order of saturation of the gate function from O(ez)O(e^{-z}) to O(e2z)O(e^{-2z}). Compared with the refine gate, the fast gate has a much higher order of saturation. We also analyze the asymptotic convergence rates of solving the toy problem in Section 4. We summarize the results in Tab. 1. See Appendix C.3 for detailed derivation. Since the fast gate has a doubly exponential order of saturation, the order of convergence rate of learning long time scales is faster than the sigmoid and refine gates which have an exponential order of saturation (see also Fig. 2 for comparison of the convergence rates). In addition, the fast gate does not require additional parameters whereas the refine gate does. Therefore, the fast gate is computationally more efficient than the refine gate.

6 Experiments

6.1 Synthetic Tasks

Refer to caption
Figure 3: MSE loss for adding task of sequence length 5000
Refer to caption
Figure 4: Accuracy for copy task of sequence length 500
Refer to caption
Figure 5: Growth of time scales of units over various gate functions on adding task. Lines and shaded areas represent mean and standard deviation of time scales over 128 units in state vector of models. Standard deviation is multiplied by 0.1 for visibility. Fast gate learns exponentially larger magnitudes of time scales than other gates.

We evaluate the learnability for long time scales across various methods on two synthetic tasks, adding and copy, following previous studies (Hochreiter and Schmidhuber 1997; Arjovsky, Shah, and Bengio 2016). While these tasks are simple and easy to solve for short sequences, they get extremely difficult for gated RNNs to solve when the sequence length grows to hundreds or thousands.

Setup. We compare the fast gate with the refine gate because Gu et al. (2020b) reported that the refine gate achieved the best performance among other previous gate variants. We also include the normalized softsign function σns(z)\sigma_{\rm ns}(z) in Section 4 as a referential baseline to test the compatibility of our theory. We use these gate functions in a single-layer LSTM. Since the initialization of the forget gate bias bfb_{f} is critical to model performance (Tallec and Ollivier 2018), we set it so that ϕ(bf)=1/(1+e1)\phi(b_{f})=1/(1+e^{-1}) is satisfied for each gate function ϕ\phi (with the bias for an additional gate function initialized by 0 for the refine gate), which amounts to bf=1b_{f}=1 in the usual sigmoid case (Gers, Schmidhuber, and Cummins 2000; Greff et al. 2016). We also compare performance of these gate variants to that of chrono-initialization (Tallec and Ollivier 2018), a method to initialize parameters to represent long time scales for the sigmoid gate. In addition to the LSTM, we include JANET (Van Der Westhuizen and Lasenby 2018) and NRU (Chandar et al. 2019) as baselines. JANET is one of the simplest gated RNNs specialized to learn long time scales by omitting gates other than the forget gate and applying chrono-initialization. NRU uses non-saturating activation functions to write or erase to a memory cell. We train and evaluate each model three times by varying the random seed. See Appendix E.3 for detailed setting.

Results. The mean squared error on the adding task of sequence length 5000 and the accuracy on the copy task of sequence length 500 during training are shown in Fig. 3 and 4, respectively. While NRU requires the least number of parameter updates on the adding task, the training diverges on the copy tasks due to gradient explosion (Pascanu, Mikolov, and Bengio 2013). This is because the state in the NRU evolves on an unbounded region; thus, a small parameter update can drastically change the behavior of the model. We could not fix this instability even by reducing the clipping threshold for gradient by a factor of 1010. We hypothesize that the training of the NRU tends to be more unstable on the copy task because this task has higher dimensional nature than the adding task in the sense of input dimension (10 vs 2) and the number of tokens to memorize (10 vs 2). Among the gate functions, the fast gate converges the fastest. This is due to the higher order of saturation: the fast gate has the order of saturation O(eez)O(e^{-e^{z}}) whereas the refine gate has O(e2z)O(e^{-2z}), thus learns long time scales more efficiently. The normalized softsign gate completely fails to learn since it has a lower order of saturation O(z1)O(z^{-1}) than the sigmoid function. Thus, the performance of the models is well explained by the difference in the order of saturation in Tab. 1. This indicates that our theoretical analysis matches practical settings despite the fact that it builds on a simplified learning problem. The result also shows that modification of the gate function is more effective for learning long-term dependencies than other methods such as chrono-initialization and JANET.

We further observe the growth of time-scale distribution of the memory cell in the LSTM on the adding task. The time scale of ii-th unit in the memory cell is measured using the bias term of the forget gate by 1/logϕ(bf,i)-1/\log\phi(b_{f,i}) (Tallec and Ollivier 2018; Mahto et al. 2021). We show the statistics of time scales over all 128 units at each iteration of the training in Fig. 5. The fast gate represents much longer time scales than other gate functions after training, which validates our method. While chrono-initialization set time scales so that they are uniformly distributed within the range [1,5000][1,5000], they do not change at all during training due to saturation of the usual sigmoid function σ(z)\sigma(z). Since the fast gate can learn to adapt to even longer time scales than such initialization, it is effective to approximate arbitrary desired time scales which is usually unknown a priori.

6.2 Pixel-by-pixel Image Recognition

sMNIST psMNIST sCIFAR Time
Softsign 97.50 ±\pm 0.58 91.71 ±\pm 0.33 59.21 ±\pm 0.39 17.7 min.
Sigmoid 98.88 ±\pm 0.12 95.71 ±\pm 0.02 69.14 ±\pm 0.39 14.3 min.
Refine 98.94 ±\pm 0.03 95.93 ±\pm 0.16 69.55 ±\pm 0.50 22.7 min.
Fast 99.05 ±\pm 0.04 96.18 ±\pm 0.14 70.06 ±\pm 0.38 14.7 min.
chrono 98.83 ±\pm 0.09 94.37 ±\pm 0.69 60.36 ±\pm 0.51 14.3 min.
JANET 98.59 ±\pm 0.03 93.85 ±\pm 0.23 60.99 ±\pm 0.51 10.6 min.
NRU 98.73 ±\pm 0.27 94.76 ±\pm 0.35 62.32 ±\pm 0.30 35.0 min.
Table 2: Test accuracy on image classification tasks and processing time per epoch on psMNIST
Refer to caption
Figure 6: Validation accuracy for psMNIST

Next, we evaluate the fast gate on the sequential image recognition task, where a pixel value is applied into recurrent models at each time step (Le, Jaitly, and Hinton 2015).

Setup. We use the usual order sequential MNIST (sMNIST) task and permuted order version (psMNIST) to introduce more complex and long-term temporal dependencies. We also use the sequential CIFAR-10 (sCIFAR) task, which involves higher dimensional inputs and longer sequence length (i.e., 3×10243\times 1024) than MNIST. We train the LSTM with the various gates, JANET, and NRU. See Appendix E.4 for training details. We set the hidden dimension as 512 on all models and the dimension of the memory cell in NRU as 144 following the original paper. We evaluate the averaged test accuracy with the standard deviation over three random seeds. We also report the computational time to train each model to compare the computational efficiency of the LSTM with the fast gate to other models.

Results. The results are shown in Tab. 2. The fast gate performs the best among various gates while the normalized softsign gate poorly performs. This is because the order of saturation of activation functions in the forget gate directly affects the learnability for long time scales. The LSTM with the fast gate also outperforms all the other baselines. Note that as chrono-initialization gives too heavy-tailed time scale distribution (Gu et al. 2020b), the chrono-LSTM and JANET performs worse than simply initializing the gate bias as bf=1b_{f}=1 (Sigmoid in the table). Since large model size tends to result in high accuracy on these tasks (Voelker, Kajić, and Eliasmith 2019; Erichson et al. 2021), we also provide results of smaller models in Appendix F.3 for comparison. While the NRU achieves the highest accuracies on the psMNIST task in Tab. 5 in the appendix, the NRU performs worse than the LSTMs in Tab. 2. Therefore, performance of the LSTM seems to scale better than the NRU in model size. The refine gate requires more than 1.5 times longer processing time than the fast gate for training since it involves an auxiliary gate computation with additional parameters. The NRU requires longer processing time than the LSTMs due to its complicated state update rule. The learning curve for the psMNIST task in Fig. 6 shows that the fast gate reaches high accuracy in as few epochs as the refine gate. Combining with the results on the processing time per epoch, we conclude that the fast gate learns complex time scales the most efficiently among the gate functions.

>> 10K 1K-10K 100-1K << 100 All tokens Time
Sigmoid 7.25 27.72 170.25 2026.87 60.22 130 sec.
Refine 7.61 28.01 166.46 1936.02 60.50 185 sec.
Fast 7.43 27.70 166.68 1975.51 60.09 138 sec.
Table 3: Test perplexity for tokens across different frequency bins on Penn Treebank dataset with training time per epoch

6.3 Language Modeling

In natural language processing, performance of language models can suffer from difficulty in learning long time scales because predicting statistically rare words involves long time scales (Mahto et al. 2021). It is expected that using a forget gate function with a higher order of saturation improves learning to predict such rare words. We validate this effect in the following experiment.

Setup. We train and evaluate three-layer LSTM language models following a previous study (Mahto et al. 2021)333https://github.com/HuthLab/multi-timescale-LSTM-LMs. on the Penn Treebank (PTB) dataset, replacing every sigmoid forget gate in the baseline LSTM with the fast gate. We compare the LSTM with the fast gate against the LSTM with the sigmoid and refine gates and also NRU by replacing all LSTM modules with NRU modules under the same experimental setting. To evaluate the model performance on data involving different ranges of time scales, the test dataset is divided into four bins depending on their frequencies in the training dataset: more than 10,000, 1000-10,000, 100-1000, and fewer than 100 occurrences.

Results. The results are shown in Tab. 3. The result for the NRU is not in the table since the training diverges. Both refine and fast gates improve perplexity for less frequently observed words compared to the sigmoid gate. The model with the fast gate also achieves the lowest total perplexity. Since frequently observed words involve short-term dependencies, this result indicates that the fast gate improves model performance by learning a wide range of time scales that appear in practical tasks. Tab. 3 also shows the training time taken for one epoch. We observe that the refine gate has larger computational overhead than the fast gate, although the LSTM with the refine gate has the same number of parameters as other LSTMs by using the gate-tying trick (Appendix E.2). This is due to the two-stage computation for the gate value via the auxiliary gate, which cannot be parallelized, thus leads to slow computation. In summary, the fast gate can improve performance on data involving extremely long time scales without sacrificing the performance on data involving short time scales and with less computational overhead.

7 Conclusion

We analyzed the saturation problem in learning of gate functions in recurrent models. Against the common intuition that saturation of the activation function degrades training, we proved that strengthening the saturating behavior is effective in mitigating gradient vanishing of gate functions. We proposed the fast gate, which has a doubly exponential order of convergence with respect to inputs, by simply composing the hyperbolic sinusoidal function to the usual sigmoid function. We evaluated the trainability of the fast gate on data involving extremely long time scales. We empirically showed that the fast gate improves accuracy on benchmark tasks with little computational overhead. Our analytical approach is applicable to any other bounded activation functions that appear in the core of modules such as an attention mechanism. Thus, we expect that it can improve learnability of other neural networks beyond recurrent models.

References

  • Arjovsky, Shah, and Bengio (2016) Arjovsky, M.; Shah, A.; and Bengio, Y. 2016. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, 1120–1128. PMLR.
  • Bengio, Léonard, and Courville (2013) Bengio, Y.; Léonard, N.; and Courville, A. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
  • Bengio, Simard, and Frasconi (1994) Bengio, Y.; Simard, P.; and Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 157–166.
  • Chandar et al. (2019) Chandar, S.; Sankar, C.; Vorontsov, E.; Kahou, S. E.; and Bengio, Y. 2019. Towards non-saturating recurrent units for modelling long-term dependencies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 3280–3287.
  • Chang et al. (2019) Chang, B.; Chen, M.; Haber, E.; and Chi, E. H. 2019. Antisymmetricrnn: A dynamical system view on recurrent neural networks. In International Conference on Learning Representations.
  • Chen, Pennington, and Schoenholz (2018) Chen, M.; Pennington, J.; and Schoenholz, S. 2018. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. In International Conference on Machine Learning, 873–882. PMLR.
  • Cho et al. (2014) Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734.
  • Collins, Sohl-Dickstein, and Sussillo (2017) Collins, J.; Sohl-Dickstein, J.; and Sussillo, D. 2017. Capacity and trainability in recurrent neural networks. In International Conference on Learning Representations.
  • Erichson et al. (2021) Erichson, N. B.; Azencot, O.; Queiruga, A.; Hodgkinson, L.; and Mahoney, M. W. 2021. Lipschitz recurrent neural networks. In International Conference on Learning Representations.
  • Gers, Schmidhuber, and Cummins (2000) Gers, F. A.; Schmidhuber, J. A.; and Cummins, F. A. 2000. Learning to Forget: Continual Prediction with LSTM. Neural computation, 12(10): 2451–2471.
  • Greff et al. (2016) Greff, K.; Srivastava, R. K.; Koutník, J.; Steunebrink, B. R.; and Schmidhuber, J. 2016. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10): 2222–2232.
  • Gu et al. (2020a) Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; and Ré, C. 2020a. Hippo: Recurrent memory with optimal polynomial projections. Advances in Neural Information Processing Systems, 33: 1474–1487.
  • Gu et al. (2020b) Gu, A.; Gulcehre, C.; Paine, T.; Hoffman, M.; and Pascanu, R. 2020b. Improving the Gating Mechanism of Recurrent Neural Networks. In International Conference on Machine Learning, 3800–3809. PMLR.
  • Gu et al. (2021) Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; and Ré, C. 2021. Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers. Advances in Neural Information Processing Systems, 34.
  • Gulcehre et al. (2016) Gulcehre, C.; Moczulski, M.; Denil, M.; and Bengio, Y. 2016. Noisy activation functions. In International Conference on Machine Learning, 3059–3068. PMLR.
  • Harold and George (2003) Harold, K.; and George, Y. 2003. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media.
  • Helfrich and Ye (2020) Helfrich, K.; and Ye, Q. 2020. Eigenvalue Normalized Recurrent Neural Networks for Short Term Memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 4115–4122.
  • Hochreiter and Schmidhuber (1997) Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8): 1735–1780.
  • Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456. PMLR.
  • Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  • Kusupati et al. (2018) Kusupati, A.; Singh, M.; Bhatia, K.; Kumar, A.; Jain, P.; and Varma, M. 2018. Fastgrnn: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In Advances in Neural Information Processing Systems, 9031–9042.
  • Le, Jaitly, and Hinton (2015) Le, Q. V.; Jaitly, N.; and Hinton, G. E. 2015. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941.
  • Lechner and Hasani (2020) Lechner, M.; and Hasani, R. 2020. Learning long-term dependencies in irregularly-sampled time series. arXiv preprint arXiv:2006.04418.
  • Lezcano-Casado and Martınez-Rubio (2019) Lezcano-Casado, M.; and Martınez-Rubio, D. 2019. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. In International Conference on Machine Learning, 3794–3803. PMLR.
  • Ling et al. (2020) Ling, S.; Liu, Y.; Salazar, J.; and Kirchhoff, K. 2020. Deep contextualized acoustic representations for semi-supervised speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6429–6433. IEEE.
  • Mahto et al. (2021) Mahto, S.; Vo, V. A.; Turek, J. S.; and Huth, A. G. 2021. Multi-timescale representation learning in LSTM Language Models. International Conference on Learning Representations.
  • Pascanu, Mikolov, and Bengio (2013) Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, 1310–1318. PMLR.
  • Romero et al. (2022) Romero, D. W.; Kuzina, A.; Bekkers, E. J.; Tomczak, J. M.; and Hoogendoorn, M. 2022. CKConv: Continuous kernel convolution for sequential data. In International Conference on Learning Representations.
  • Rusch and Mishra (2021) Rusch, T. K.; and Mishra, S. 2021. UnICORNN: A recurrent model for learning very long time dependencies. In International Conference on Machine Learning, 9168–9178. PMLR.
  • Tallec and Ollivier (2018) Tallec, C.; and Ollivier, Y. 2018. Can recurrent neural networks warp time? In International Conference on Learning Representations.
  • Van Der Westhuizen and Lasenby (2018) Van Der Westhuizen, J.; and Lasenby, J. 2018. The unreasonable effectiveness of the forget gate. arXiv preprint arXiv:1804.04849.
  • Voelker, Kajić, and Eliasmith (2019) Voelker, A.; Kajić, I.; and Eliasmith, C. 2019. Legendre memory units: Continuous-time representation in recurrent neural networks. Advances in neural information processing systems, 32.
  • Vorontsov et al. (2017) Vorontsov, E.; Trabelsi, C.; Kadoury, S.; and Pal, C. 2017. On orthogonality and learning recurrent networks with long term dependencies. In International Conference on Machine Learning, 3570–3578. PMLR.
  • Zhang, Lei, and Dhillon (2018) Zhang, J.; Lei, Q.; and Dhillon, I. 2018. Stabilizing gradients for deep neural networks via efficient SVD parameterization. In International Conference on Machine Learning, 5806–5814. PMLR.
  • Zhu et al. (2020) Zhu, L.; Tran, D.; Sevilla-Lara, L.; Yang, Y.; Feiszli, M.; and Wang, H. 2020. FASTER recurrent networks for efficient video classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 13098–13105.

Appendix A Further discussion of related work

There are plenty of research approaching to learning of long-term dependencies of time series data with RNNs. We discuss relations between our work and other previous studies.

Several methods on properly constraining a weight matrix of a simple RNN without gating mechanism are proposed to alleviate the gradient vanishing problem (Arjovsky, Shah, and Bengio 2016; Zhang, Lei, and Dhillon 2018; Lezcano-Casado and Martınez-Rubio 2019; Helfrich and Ye 2020). These methods limits expressivity of models due to parameter constraint (Vorontsov et al. 2017) and thus are not usually used for practical tasks. Our method takes a different approach for learning long-term dependencies through time scales of models and does not limit the model expressivity.

There is another approach to improve learnability of gated RNNs by reducing redundant parameters (Collins, Sohl-Dickstein, and Sussillo 2017; Chen, Pennington, and Schoenholz 2018; Kusupati et al. 2018). Since the forget gate is the most critical component of gated RNNs, these minimal RNNs are also equipped with the forget gate. Therefore, our method is applicable to those minimal models to further enhance model performances.

Recently, continuous-time models based on differential equations and their discrete counterparts have been attracting attention to deal with extremely long (and possibly irregularly sampled) time series data (Chang et al. 2019; Voelker, Kajić, and Eliasmith 2019; Gu et al. 2020a; Lechner and Hasani 2020; Gu et al. 2021; Rusch and Mishra 2021). Some of these models utilize the sigmoid function to represent variable time steps for discretization. Such use of the sigmoid function essentially has the same effect of the forget gate in LSTM and GRU in terms of time scales of models (Tallec and Ollivier 2018; Gu et al. 2020a). Thus, our method replacing gate function can be applied to the sigmoid function representing time steps to further enhance the learnability for wide range of time scales.

Appendix B Relation to Gradient Vanishing due to Recurrence

When learning long-term dependencies of time series data, recurrent models often suffer from gradient vanishing due to recurrent calculation (Bengio, Simard, and Frasconi 1994; Pascanu, Mikolov, and Bengio 2013). We describe the relation between this gradient vanishing and the saturation problem in this section. First, consider a general recurrent model with state update rule of form ht=G(xt,ht1;θ)h_{t}=G(x_{t},h_{t-1};\theta), where hth_{t} and xtx_{t} are the state and input at time tt and θ\theta denotes the parameter vector of the model. During training, the parameter θ\theta is updated using the gradient

Lθ=thtθLht,\displaystyle\frac{\partial L}{\partial\theta}=\sum_{t}\frac{\partial h_{t}}{\partial\theta}\frac{\partial L}{\partial h_{t}}, (21)

where the right hand side is calculated by back-propagation through time (BPTT)

Lht=ht+1htLht+1.\displaystyle\frac{\partial L}{\partial h_{t}}=\frac{\partial h_{t+1}}{\partial h_{t}}\frac{\partial L}{\partial h_{t+1}}. (22)

If the matrix ht+1ht\frac{\partial h_{t+1}}{\partial h_{t}} is contractive, the gradient Lht\frac{\partial L}{\partial h_{t}} exponentially decays through this backward recurrence. Then, the gradient for θ\theta does not include information on data at time tt distant from the last time step, which leads to difficulty in learning long-term dependencies of data.

We now consider a gated model with the state update rule ht=ftht1+itF(xt,ht1)h_{t}=f_{t}\odot h_{t-1}+i_{t}\odot F(x_{t},h_{t-1}) where ftf_{t} and iti_{t} are forget and input gates and FF is some state update function. Then, we have

ht+1ht=diag(ft)+.\displaystyle\frac{\partial h_{t+1}}{\partial h_{t}}=\mathrm{diag}(f_{t})+\cdots. (23)

Assuming that the first term diag(ft)\mathrm{diag}(f_{t}) has relatively large effect than other terms, the decay rate of the gradient Lht\frac{\partial L}{\partial h_{t}} is largely dominated by the distance of ftf_{t} to 1. This corresponds to the time scales of the gated model. Namely, if the forget gate represents only short time scales, then the gradient vanishing due to recurrence tends to occur. On the other hand, if the forget gate learns long time scales, then it mitigates the gradient vanishing and helps the model to learn to extract important features at distant time steps. Thus, learnability of gated models for long time scales is critical to capture long term dependencies. In this work, we focus on learnability for long time scales by considering the gradient vanishing due to saturation of bounded functions. The distinction of two gradient vanishing problems is easy to see through Fig. 2: every gate learns slowly at early stage (until 40 iteration) because of gradient vanishing due to recurrence. After it gets through the plateau, it now faces the gradient vanishing due to saturation to learn long time scales. In this way, the gradient vanishing due to recurrence and the gradient vanishing of bounded function relates closely to each other.

Appendix C Details on Theoretical Results

C.1 Proof of Proposition 4.2

Proof.

Let ϕ\phi be an element of \mathcal{F}. Since ϕ\phi^{\prime} is monotone for z0z\gg 0, limzϕ(z)\lim_{z\to\infty}\phi^{\prime}(z) exists. Since ϕ\phi is increasing, we obtain limzϕ(z)0\lim_{z\to\infty}\phi^{\prime}(z)\geq 0. If a:=limzϕ(z)>0a:=\lim_{z\to\infty}\phi^{\prime}(z)>0, then ϕ(z)=ϕ(0)+0zϕ(z)𝑑zϕ(0)+az\phi(z)=\phi(0)+\int_{0}^{z}\phi^{\prime}(z)dz\geq\phi(0)+az. This contradicts the boundedness of ϕ\phi. Thus, we have limzϕ(z)=0\lim_{z\to\infty}\phi^{\prime}(z)=0. ∎

C.2 Proof of Theorem 4.4

First, we remark some technical points on the following definition of the order of saturation.

Definition C.1.

For ϕ,ϕ~\phi,\tilde{\phi}\in\mathcal{F}, ϕ\phi has higher order of saturation than ϕ~\tilde{\phi} if ϕ~1(ϕ(z)){\tilde{\phi}}^{-1}(\phi(z)) is convex for z0z\gg 0 and

limz1ϕ(z)1ϕ~(az)=0\displaystyle\lim_{z\to\infty}\frac{1-\phi(z)}{1-\tilde{\phi}(az)}=0 (24)

holds for any a>0a>0.

The first condition for convexity is automatically satisfied for typical functions such as logarithmic, polynomial, and exponential and required to ensure the growth rates of functions to be well behaved444We have not found any examples of ϕ,ϕ~\phi,\tilde{\phi}\in\mathcal{F} which do not satisfy the convexity of ϕ~1(ϕ(z)){\tilde{\phi}}^{-1}(\phi(z)) under Eq. (24).. For example, consider the fast gate ϕ(z)=1/(1+esinh(z))\phi(z)=1/(1+e^{-\sinh(z)}) and sigmoid function ϕ~(z)=σ(z)=1/(1+ez)\tilde{\phi}(z)=\sigma(z)=1/(1+e^{-z}). Then ϕ~1(ϕ(z))=sinh(z){\tilde{\phi}}^{-1}(\phi(z))=\sinh(z) is indeed convex for z>0z>0. The second condition is slightly different from the common definition of order that adopts limzϕ(z)ϕ~(z)=0\lim_{z\to\infty}\frac{\phi(z)}{\tilde{\phi}(z)}=0. While the two limit conditions coincides for logarithmic and polynomial classes, there is a difference in the treatment of exponential functions. For example, O(ez)O(e^{z}) and O(e2z)O(e^{2z}) are distinguished in the usual sense but are identified in our definition. Note that in terms of learning of neural networks, this amounts to ignoring re-parametrization of parameters with constant multiplication θ2θ\theta\to 2\theta. Since such naive re-parametrization should not drastically change the learning behavior, our definition for the order of saturation makes more sense for considering learning of neural networks. This is well illustrated in Tab. 1 and Fig. 2: the refine gate has the saturation of order O(e2z)O(e^{-2z}) but results in the same order of convergence rate of learning O(τ1)O(\tau^{-1}) as the usual sigmoid gate since O(e2z)O(e^{-2z}) is the same as O(ez)O(e^{-z}) in our definition. The fast gate, however, has a higher order of saturation in the sense of our definition, thus has a faster order of convergence rate.

We now prove Theorem 4.4.

Proof.

Take ϕ,ϕ~\phi,\tilde{\phi}\in\mathcal{F} such that ϕ\phi has a higher order of saturation than ϕ~\tilde{\phi}. Define a function η\eta by η(z):=ϕ~1(ϕ(z))\eta(z):={\tilde{\phi}}^{-1}(\phi(z)). Note that η(z)\eta(z) is convex for z0z\gg 0 since ϕ\phi has a higher order of saturation than ϕ~\tilde{\phi}. Let a>0a>0 be any positive number. Then, we have η(z)>az\eta(z)>az for z0z\gg 0 since ϕ(z)>ϕ~(az)\phi(z)>\tilde{\phi}(az) for z0z\gg 0 and ϕ~\tilde{\phi} is increasing. Thus, for any large z0>0z_{0}>0, we get η(z)a\eta^{\prime}(z)\geq a for some z>z0z>z_{0}. By convexity of η\eta on z0z\gg 0, we obtain η(z)a\eta^{\prime}(z)\geq a for z0z\gg 0. On the other hand, we have

η(z)\displaystyle\eta^{\prime}(z) =ϕ(z)ϕ~(ϕ~1(ϕ(z)))\displaystyle=\frac{\phi^{\prime}(z)}{\tilde{\phi}^{\prime}(\tilde{\phi}^{-1}(\phi(z)))} (25)
=gϕ(ϕ(z))gϕ~(ϕ(z))\displaystyle=\frac{g_{\phi}(\phi(z))}{g_{\tilde{\phi}}(\phi(z))} (26)

By substituting f:=ϕ(z)f:=\phi(z), we obtain

η(z)=gϕ(f)gϕ~(f)a\displaystyle\eta^{\prime}(z)=\frac{g_{\phi}(f)}{g_{\tilde{\phi}}(f)}\geq a (27)

for f1f\approx 1. Since a>0a>0 is arbitrary, we have gϕ(f)gϕ~(f)\frac{g_{\phi}(f)}{g_{\tilde{\phi}}(f)}\to\infty as f1f\to 1. ∎

C.3 Derivation of Explicit Convergence Rate

In this section, we derive the convergence rates of learning long time scales on the toy problem in Section 4 for the sigmoid and normalized softsign gate, proving Proposition 4.5. Similarly, we also derive the convergence rates for the refine and fast gate. We consider the absolute loss L=|λft1t0|L=|\lambda_{*}-f^{t_{1}-t_{0}}| with the gate value f=ϕ(z)f=\phi(z). Since the learning dynamics does not depend on λ\lambda_{*} when we consider the absolute loss, we may take λ=1\lambda_{*}=1 without loss of generality.

The sigmoid gate case. We first analyze the case with the sigmoid gate f=σ(z)f=\sigma(z). We assume that the initial value of ff is small so that ft1t0<λf^{t_{1}-t_{0}}<\lambda_{*}. Then Eq. (17) becomes

dfdτ=σ(z)2dLdf=(t1t0)ft1t0+1(1f)2,\displaystyle\frac{df}{d\tau}=-\sigma^{\prime}(z)^{2}\frac{dL}{df}=(t_{1}-t_{0})f^{t_{1}-t_{0}+1}(1-f)^{2}, (28)

as long as ft1t0<λf^{t_{1}-t_{0}}<\lambda_{*} holds. Since ff monotonically increases over time, we obtain a lower bound for the right hand side after time τ>τ0\tau>\tau_{0} as (t1t0)ft1t0+1(1f)2>C(1f)2(t_{1}-t_{0})f^{t_{1}-t_{0}+1}(1-f)^{2}>C(1-f)^{2} with C=(t1t0)f(τ0)t1t0+1C=(t_{1}-t_{0})f(\tau_{0})^{t_{1}-t_{0}+1}. This gives a lower bound of ff the dynamics of which is defined by

dflowerdτ=C(1flower)2.\displaystyle\frac{df_{\rm lower}}{d\tau}=C(1-f_{\rm lower})^{2}. (29)

Solving this, we obtain 1flower=1/Cτ1-f_{\rm lower}=1/C\tau. Thus, we obtain the upper bound of the difference

1f=O(τ1),\displaystyle 1-f=O(\tau^{-1}), (30)

which approximates the asymptotic convergence rate for training.

Tightness of bounds. In the above, we only evaluated the upper bound of the difference 1f1-f as the convergence rate. We can easily show that this evaluation is tight and thus the bound gives the exact order of convergence rate. We consider the sigmoid case for example. While the lower bound of gradient is evaluated as dfdτC(1f)2\frac{df}{d\tau}\geq C(1-f)^{2}, it is also upper bounded by dfdτC~(1f)2\frac{df}{d\tau}\leq\tilde{C}(1-f)^{2} with another C~>0\tilde{C}>0 since (t1t0)ft1t0+1(t_{1}-t_{0})f^{t_{1}-t_{0}+1} is upper bounded. Thus, we have the same order of upper bound of ff and thus the bound gives tight order of convergence rate. This argument easily holds in other cases below, but we omit it for simplicity.

The normalized softsign gate case. For the asymptotic analysis, we may assume z>0z>0. Then the normalized softsign function is ϕ(z)=σns(z)=12(z|z|+2+1)=11z+2\phi(z)=\sigma_{\rm ns}(z)=\frac{1}{2}(\frac{z}{|z|+2}+1)=1-\frac{1}{z+2}. Thus, ϕ(z)=1(z+2)2=(1ϕ(z))2\phi^{\prime}(z)=\frac{1}{(z+2)^{2}}=(1-\phi(z))^{2}. Then for f=ϕ(z)f=\phi(z), Eq. (17) becomes

dfdτ=ϕ(z)2dLdf=(t1t0)ft1t0(1f)4.\displaystyle\frac{df}{d\tau}=-\phi^{\prime}(z)^{2}\frac{dL}{df}=(t_{1}-t_{0})f^{t_{1}-t_{0}}(1-f)^{4}. (31)

as long as ft1t0<λf^{t_{1}-t_{0}}<\lambda_{*} holds. The lower bound of the right hand side over τ>τ0\tau>\tau_{0} is given by (t1t0)ft1t0(1f)4>C(1f)4(t_{1}-t_{0})f^{t_{1}-t_{0}}(1-f)^{4}>C(1-f)^{4} with constant C=(t1t0)f(τ0)t1t0C=(t_{1}-t_{0})f(\tau_{0})^{t_{1}-t_{0}} since ff increases during learning. Therefore, the dynamics of a lower bound of ff can be written as

dflowerdτ=C(1flower)4.\displaystyle\frac{df_{\rm lower}}{d\tau}=C(1-f_{\rm lower})^{4}. (32)

Solving this, we get 1flower=(Cτ)1/31-f_{\rm lower}=(C\tau)^{-1/3}. Thus, we obtain the upper bound of the difference

1f=O(τ1/3).\displaystyle 1-f=O(\tau^{-1/3}). (33)

The refine gate case. The refine gate (Gu et al. 2020b) is an auxiliary gate r=σ(z~)r=\sigma(\tilde{z}) that adjust the forget gate value f=σ(z)f=\sigma(z) to the effective gate value g=2rf+(12r)f2g=2rf+(1-2r)f^{2}. We first reformulate the problem as minimizing the loss function

L=|λgt1t0|,\displaystyle L=|\lambda_{*}-g^{t_{1}-t_{0}}|, (34)

i.e., we replace ff with gg. Then, the learning dynamics of gg is given as

dgdτ\displaystyle\frac{dg}{d\tau} =dzdτgz+dz~dτgz~\displaystyle=\frac{dz}{d\tau}\frac{\partial g}{\partial z}+\frac{d\tilde{z}}{d\tau}\frac{\partial g}{\partial\tilde{z}} (35)
=((gz)2+(gz~)2)Lg.\displaystyle=-\Bigl{(}(\frac{\partial g}{\partial z})^{2}+(\frac{\partial g}{\partial\tilde{z}})^{2}\Bigr{)}\frac{\partial L}{\partial g}. (36)

Since the refine gate rr monotonically increases during learning, we may assume r1/2r\geq 1/2. We claim that gz\frac{\partial g}{\partial z} and gz~\frac{\partial g}{\partial\tilde{z}} are bounded as follows when r1/2r\geq 1/2:

f(1g)gz2f(1g)\displaystyle f(1-g)\leq\frac{\partial g}{\partial z}\leq 2f(1-g) (37)
0gz~r(1g)\displaystyle 0\leq\frac{\partial g}{\partial\tilde{z}}\leq r(1-g) (38)

Indeed, we have

gz\displaystyle\frac{\partial g}{\partial z} =2f(1f)(r+(12r)f)\displaystyle=2f(1-f)(r+(1-2r)f) (39)
=2f(1g(1r)(1f))(2f(1g))\displaystyle=2f\bigl{(}1-g-(1-r)(1-f)\bigl{)}\ \Bigl{(}\leq 2f(1-g)\Bigr{)} (40)
=f(1g+(2r1)(1f))(f(1g))\displaystyle=f\bigl{(}1-g+(2r-1)(1-f)\bigr{)}\ \Bigl{(}\geq f(1-g)\Bigr{)} (41)

and

gz~\displaystyle\frac{\partial g}{\partial\tilde{z}} =2fr(1r)2f2r(1r)\displaystyle=2fr(1-r)-2f^{2}r(1-r) (42)
=r(1g(1f)2)r(1g).\displaystyle=r\bigl{(}1-g-(1-f)^{2}\bigr{)}\leq r(1-g). (43)

Thus, we get a bound

C(1g)2dgdτC~(1g)2\displaystyle C(1-g)^{2}\leq\frac{dg}{d\tau}\leq\tilde{C}(1-g)^{2} (44)

with C=(t1t0)g(τ0)t1t01f(τ0)2C=(t_{1}-t_{0})g(\tau_{0})^{t_{1}-t_{0}-1}f(\tau_{0})^{2} and C~=5(t1t0)g(τ0)t1t01\tilde{C}=5(t_{1}-t_{0})g(\tau_{0})^{t_{1}-t_{0}-1}. With this bound, we can do the exactly same calculation as the sigmoid case to obtain the bound

1g=O(τ1).\displaystyle 1-g=O(\tau^{-1}). (45)

The fast gate case. Let ϕ(z)=σ(sinh(z))\phi(z)=\sigma(\sinh(z)) be the fast gate. By the chain rule, the derivative of the function ϕ\phi is given by ϕ(z)=ϕ(z)(1ϕ(z))cosh(z)\phi^{\prime}(z)=\phi(z)(1-\phi(z))\cosh(z). Therefore, Eq. (17) for f=ϕ(z)f=\phi(z) can be written as

dfdτ\displaystyle\frac{df}{d\tau} =(ϕ(z))2Lf\displaystyle=-(\phi^{\prime}(z))^{2}\frac{\partial L}{\partial f} (46)
=f2(1f)2cosh2(z)(t1t0)ft1t01\displaystyle=f^{2}(1-f)^{2}\cosh^{2}(z)(t_{1}-t_{0})f^{t_{1}-t_{0}-1} (47)
=(t1t0)ft1t0+1(1f)2(1+sinh2(z))\displaystyle=(t_{1}-t_{0})f^{t_{1}-t_{0}+1}(1-f)^{2}(1+\sinh^{2}(z)) (48)
C(1f)2sinh2(z)\displaystyle\geq C(1-f)^{2}\sinh^{2}(z) (49)

with constant C:=(t1t0)f(τ0)t1t0+1C:=(t_{1}-t_{0})f(\tau_{0})^{t_{1}-t_{0}+1}. Since we consider the asymptotic convergence rate, we assume ff is sufficiently close to 1 so that sinh(z)=σ1(f)=log(11f1)12log(1f)\sinh(z)=\sigma^{-1}(f)=\log(\frac{1}{1-f}-1)\geq-\frac{1}{2}\log(1-f) and log(1f)2+log(1f)2\frac{\log(1-f)}{2+\log(1-f)}\leq 2. Then, by properly replacing the constant CC, we obtain

dfdτ\displaystyle\frac{df}{d\tau} C(1f)2(log(1f))2log(1f)2+log(1f).\displaystyle\geq C(1-f)^{2}(\log(1-f))^{2}\frac{\log(1-f)}{2+\log(1-f)}. (50)

This inequality gives the upper bound yy of the difference 1f1-f, the dynamics of which is given by

dydτ=Cy2(logy)2logy2+logy.\displaystyle\frac{dy}{d\tau}=-Cy^{2}(\log y)^{2}\frac{\log y}{2+\log y}. (51)

Rewriting this equation, we have

2+logyy2(logy)3dydτ=C.\displaystyle\frac{2+\log y}{y^{2}(\log y)^{3}}\frac{dy}{d\tau}=C. (52)

By integration, we get

1y(logy)2=Cτ.\displaystyle\frac{1}{y(\log y)^{2}}=C\tau. (53)

Thus we obtain

y1/2logy1/2=(4Cτ)1/2,\displaystyle y^{1/2}\log y^{1/2}=(4C\tau)^{-1/2}, (54)

or equivalently,

y=W((4Cτ)1/2)2,\displaystyle y=W((4C\tau)^{-1/2})^{2}, (55)

where WW is an inverse of function zzlogzz\mapsto z\log z. This gives the effective bound of the difference 1f1-f as

1f=O(W(cτ1/2)2)\displaystyle 1-f=O(W(c\tau^{-1/2})^{2}) (56)

with some constant cc. Note that this order is asymptotically smaller than O(τ1)O(\tau^{-1}) which is the convergence rate of learning for the sigmoid and refine gate.

C.4 Further Visualization

Fig. 7 shows the increase in the gradient of gate functions with different order of saturation against output values. This illustrates the intuition for Theorem 4.4 that the gradient gϕ(f)=ϕ(ϕ1(f))g_{\phi}(f)=\phi^{\prime}(\phi^{-1}(f)) with respect to the output ff has different convergence rate to the limit f1f\to 1 in accordance with the order of saturation.

In addition to the simple gradient descent, we also consider solving the toy problem in Section 4 with other optimizers. RMSprop and Adam optimizer (Kingma and Ba 2015) are widely used for training deep neural networks. We plot the learning dynamics on the toy problem with RMSprop and Adam in Fig. 8. Since RMSprop and Adam solve the problem more efficiently than the gradient descent, we set the learning rate 0.010.01 and t1t0=30t_{1}-t_{0}=30. Other hyperparameters are set to the default in pytorch library. We observe that the fast gate consistently learns faster than other gate functions. Thus, we conclude that our method also applies to training of gated recurrent networks with practical optimizers.

Refer to caption
Figure 7: Gradient of gate functions with respect to output value, which is equivalent to gϕg_{\phi} in Section 4.2. Functions with higher order of saturation have larger gradient at both ends.
Refer to caption
(a) RMSprop
Refer to caption
(b) Adam
Figure 8: Learning dynamics for toy problem in Section 4 with RMSprop and Adam. The fast gate consistently learns the fastest.

Appendix D Arbitrariness of Fast Gate

In Section 5, we adopted the particular form ϕ(z)=σ(sinhz)\phi(z)=\sigma(\sinh z) as the fast gate. There are infinitely many other choices of gate functions satisfying the desirable properties in Section 5.1. For example, we can construct a function saturating even faster by composing sinhsinh again, i.e., ϕ(z)=σ(sinh(sinh(z)))\phi(z)=\sigma(\sinh(\sinh(z))). A natural question is whether such a faster gate function further improves the performance of gated models. We evaluated the LSTM with the iterated fast gate function (iter_fast) on the synthetic and image classification tasks. The training converges even faster on both adding and copy tasks (Fig. 9, 10), which is consistent with our theory. On the other hand, there are almost no difference in test accuracy of the image classification tasks (“Iterated Fast LSTM” in Tab. 4), possibly because the models must learn other complex features in addition to time scales. Note that the processing time increases by the application of another sinh\sinh function. Taking the results and the overhead of additional sinh\sinh into account, only one composition of sinh\sinh (i.e., the fast gate in the main text) seems practically the simplest and most efficient solution for learning long time scales. Further investigation of other effective candidates is left as future work.

Appendix E Experimental Details

E.1 Computational settings

Our computational setup is as follows: CPU is Intel Xeon Silver 4214R 2.40GHz, memory size is 512 GB, and the GPU is NVIDIA Tesla V100S.

E.2 LSTM with Gate Tying and Parameter Count

An LSTM has four affine transform computation in its state update rule as in Eq. (1). Since each affine transform has the same number of parameters, the LSTM has 4N4N parameters, where NN is the number of parameters for one linear transform. With a thorough exploration on variants of gated recurrent models, Greff et al. (2016) reported that coupling the forget gate with the input gate by

ft=1nit\displaystyle f_{t}=1_{n}-i_{t} (57)

does not empirically degrade model performance. Here nn is the hidden dimension and 1nn1_{n}\in\mathbb{R}^{n} is an nn-dimensional vector which has 1 for all entries. This trick reduces the model parameters from 4N4N to 3N3N. Gu et al. (2020b) proposed the refine gate on the basis of this variant of LSTM. Their method introduces additional NN parameters to train an auxiliary gate; thus the total number of model parameters matches to the 4N4N of the original LSTM. In the experiments excluding language modeling, we used the gate-tied variant of LSTM to fairly compare the learnability of the gate functions. Thus, the LSTMs with the normalized softsign, sigmoid, and fast gates have 3N3N parameters while that with the refine gate LSTM has 4N4N. In the language-modeling experiment, we applied the fast gate to the forget gate without tying gates to compare the practical performance on an equal number of parameters. Note that the JANET has 2N2N parameters because it further removes the output gate from the gate-tied LSTM.

E.3 Synthetic Tasks

The adding and copy tasks are defined as follows.

Adding task. The input for this task consist of two sequences of prescribed length TT. One is a sequence of random numbers (a1,,aT)T(a_{1},\cdots,a_{T})\in\mathbb{R}^{T} sampled from the uniform distribution on [0,1][0,1]. The other is the indicator sequence, which has 1 on two random indices i1[1,T/2)i_{1}\in[1,T/2) and i2[T/2,T)i_{2}\in[T/2,T), and has 0 for other entries. RNNs are required to output a target value ai1+ai2a_{i_{1}}+a_{i_{2}} after reading all tokens. We trained and evaluated models on this task of sequence length T=5000T=5000 with a mean squared error (MSE).

Copy task. The input sequence of length T+20T+20 consists of 10 alphabets. The first ten are taken randomly from eight alphabets {a1,,a8}\{a_{1},\cdots,a_{8}\}, and others are all void token a0a_{0} except the (T+11)(T+11)-th revoking token a9a_{9}. RNNs are required to output the first ten tokens after reading the revoking input. Each alphabet is encoded as a one-hot vector. The models were trained on this task of void sequence length T=500T=500 with the cross-entropy loss calculated over the last 10 tokens as in a previous study (Gu et al. 2020b) and evaluated in terms of accuracy.

On both tasks, we generated 64 training data randomly at each training iteration instead of creating a dataset beforehand. We set the hidden dimension to 128 for each model, as in a previous study (Gu et al. 2020b). For NRU, the memory cell dimension was set to 64 following the original paper (Chandar et al. 2019). We used RMSProp with a learning rate 10310^{-3} and other parameters set as default (α=0.99,ϵ=108\alpha=0.99,\epsilon=10^{-8}) to train each model. We used the gradient clipping (Pascanu, Mikolov, and Bengio 2013) with threshold 11 for stable training. For the observation of the growth of time scales of models (Fig. 5), models were trained in the same setting above except that the learning rate was set to 10210^{-2} to see clearer difference.

E.4 Pixel-by-pixel Image Recognition

We generally follow the way of Gu et al. (2020b) to train models on image recognition tasks. In the psMNIST task, the means of permutation of pixels adopts bit-reversal permutation rather than random permutation. In the sCIFAR task, recurrent models are given a three-dimensional pixel value at each time step, so the sequence length is 1024 time steps. The hidden dimension of each model was set to 512 on all tasks. We used Adam (Kingma and Ba 2015) for training and the learning rate was swept over {0.0005,0.0002}\{0.0005,0.0002\}. The best stable learning rate was 0.00020.0002 for the LSTM with the softsign gate and the NRU and 0.00050.0005 for the other models on every tasks. The gradient clipping (Pascanu, Mikolov, and Bengio 2013) was applied with threshold 1. We trained models with 150 epochs with the batch size 50. The memory size for the NRU is set to 144 as in Chandar et al. (2019).

E.5 Language Modeling

We explain the implementation details of the language modeling experiment. We used the public code of Mahto et al. (2021) to train and evaluate the language models and used the same hyperparameters. However, their code causes an out-of-memory error of GPUs during training in our computational environment. The memory usage increases after switching the optimizer to ASGD, then training breaks after 324 epochs. We completed the training by successively retraining each model twice after a break in training due to the out-of-memory error. Furthermore, each LSTM module was manually re-implemented on the basis of pytorch library without low-level optimization. Although this results in longer runtime for the baseline LSTM than the original pytorch implementation of LSTM, we did this to fairly compare the runtime.

Appendix F Additional Experimental Results

Refer to caption
Figure 9: MSE for adding task of length 5000 vs elapsed time
Refer to caption
Figure 10: Accuracy for copy task of length 500 vs elapsed time
Refer to caption
Figure 11: Cross entropy loss for copy task of length 500 vs elapsed time
sMNIST psMNIST sCIFAR Time / Epoch
Softsign LSTM 21.63 ±\pm 10.15 91.71 ±\pm 0.33 59.21 ±\pm 0.39 17.7 min.
Sigmoid LSTM 98.88 ±\pm 0.12 95.71 ±\pm 0.02 69.14 ±\pm 0.39 14.3 min.
Refine LSTM 98.94 ±\pm 0.03 95.93 ±\pm 0.16 69.55 ±\pm 0.50 22.7 min.
UR-LSTM 98.80 ±\pm 0.14 96.11 ±\pm 0.08 70.29 ±\pm 0.12 21.7 min.
Fast LSTM 99.05 ±\pm 0.04 96.18 ±\pm 0.14 70.06 ±\pm 0.38 14.7 min.
UF-LSTM 99.01 ±\pm 0.01 96.11 ±\pm 0.03 70.04 ±\pm 0.33 14.7 min.
Iterated Fast LSTM 99.02 ±\pm 0.07 96.22 ±\pm 0.08 69.66 ±\pm 0.41 15.8 min.
Softsign GRU 69.71 ±\pm 41.27 93.80 ±\pm 0.28 61.90 ±\pm 1.33 17.5 min.
Sigmoid GRU 98.99 ±\pm 0.06 95.23 ±\pm 0.07 68.92 ±\pm 0.48 14.9 min.
Refine GRU 98.95 ±\pm 0.02 95.61 ±\pm 0.02 69.54 ±\pm 0.71 22.7 min.
UR-GRU 98.86 ±\pm 0.08 95.49 ±\pm 0.13 70.29 ±\pm 0.35 22.8 min.
Fast GRU 98.95 ±\pm 0.05 95.54 ±\pm 0.34 69.65 ±\pm 0.18 15.5 min.
UF-GRU 98.79 ±\pm 0.02 95.46 ±\pm 0.30 69.59 ±\pm 0.34 15.9 min.
Table 4: Test accuracy on image classification tasks with various LSTM and GRU. UR- and UF- stands for refine and fast gate with uniform initialization (Gu et al. 2020b) of forget gate bias. Processing time per epoch on psMNIST is reported.

For completeness, we further demonstrated the effectiveness of the fast gate on additional experiments.

F.1 Synthetic Tasks Evaluated on Elapsed Time

To show the efficiency of learning with the fast gate based on the processing time rather than the number of parameter updates, we plot the mean squared error on the adding task to wall clock training time on Fig. 9. We observe that the fast gate achieves the convergence speed comparable to NRU due to less computational overhead on the state update. We also show results on accuracy and cross entropy loss for the copy task based on wall clock time in Fig. 10 and Fig. 11. The learning curve of NRU for the loss is truncated since its training diverges. Training with the fast gate converges the fastest among all baselines. As explained in Appendix D, the iterated fast gate ϕ(z)=σ(sinh(sinhz))\phi(z)=\sigma(\sinh(\sinh z)) converges even faster than the fast gate on both tasks.

F.2 Fast gate combined with other methods

Since our method is in principle applicable to any gated RNNs, we further validate the fast gate on another popular gated RNN, GRU (Cho et al. 2014). Additionally, we evaluate the fast gate combined with the gate initialization technique, called uniform initialization (Gu et al. 2020b).

In Tab. 4, we see similar (but slightly smaller) improvement of accuracy of the GRU in accuracy as that of the LSTM on the psMNIST and sCIFAR tasks, while keeping the computational cost. There is almost no difference in accuracy on the sMNIST task among the sigmoid, refine and fast gates because sMNIST contains relatively short-term dependencies. For the fast gate, combination of uniform initialization does not contribute to further performance improvement unlike the refine gate. Note that for the refine gate, the uniform initialization is applied to both the original gate and the auxiliary gate (Gu et al. 2020b, Appendix A), which actually provides different time scale distribution between the refine gate and other gates. That is, the time scale distribution of the refine gate with uniform initialization results in skewed distribution different from (D=x)1/x2\mathbb{P}(D=x)\propto 1/x^{2} unlike explained in (Gu et al. 2020b, Proposition 1). We suspect that this skewness leads to better improvements of the refine gate combined the uniform initialization. More sophisticated initialization methods suited also to other gates are to be explored as future work.

F.3 Image classification with limited model size

sMNIST psMNIST sCIFAR Time
Softsign 97.20 ±\pm 0.20 87.89 ±\pm 0.76 47.91 ±\pm 1.13 345 sec.
Sigmoid 98.84 ±\pm 0.08 92.80 ±\pm 0.37 62.73 ±\pm 0.75 284 sec.
Refine 98.88 ±\pm 0.12 93.28 ±\pm 0.27 62.10 ±\pm 0.68 486 sec.
UR- 98.69 ±\pm 0.08 92.56 ±\pm 0.46 62.54 ±\pm 0.85 484 sec.
Fast 98.81 ±\pm 0.08 93.83 ±\pm 0.15 62.98 ±\pm 0.29 314 sec.
UF- 98.81 ±\pm 0.07 93.00 ±\pm 0.54 63.00 ±\pm 0.78 315 sec.
chrono 97.96 ±\pm 0.16 90.32 ±\pm 0.62 54.71 ±\pm 0.50 287 sec.
JANET 98.07 ±\pm 0.09 87.84 ±\pm 0.50 55.82 ±\pm 0.35 214 sec.
NRU 98.19 ±\pm 0.22 94.89 ±\pm 0.15 53.34 ±\pm 1.06 802 sec.
Table 5: Test accuracy on image classification tasks with limited model size on four random seeds. Gate variants are applied to LSTM. UR-/UF- is refine and fast gate combined with uniform initialization. Processing time per epoch on psMNIST is reported.

As pointed out in (Voelker, Kajić, and Eliasmith 2019), improvements in learning long term dependencies could be better compared under the limits of model size. Therefore, we also tested a smaller classifier for sMNIST, psMNIST and sCIFAR than that in the main text in terms of the number of the output layer (1 vs 2) and hidden dimension of the recurrent layer (128 vs 512), which is commonly adopted in the literature (Chandar et al. 2019; Chang et al. 2019). The memory size of NRU is set to 64 in this setting as in the original paper.

The results in Tab. 5 are generally consistent with the main text, which shows robustness of the performance improvement of the fast gate. While the NRU shows the highest accuracy in the psMNIST task, it does not scale to larger models in terms of accuracy and the processing time as in the main text. For training, we used RMSprop with the learning rate 0.0005 except the following. On the sMNIST and sCIFAR tasks, learning rate of 0.0002 is applied to the NRU since it suffers from training instability at higher learning rates. The gradient clipping (Pascanu, Mikolov, and Bengio 2013) was applied with threshold 1. We trained models with 150 epochs on these task, and 100 epochs on the sCIFAR task. The learning rate was divided by 5 after 100 epochs on the sMNIST and psMNIST tasks, and after 50 epochs on the sCIFAR task. The batch size was set to 100 on all tasks. The models are trained with four different random seeds.

F.4 Raw speech classification

Model Test Accuracy
Sigmoid LSTM 21.5*
Refine LSTM 83.9*
UR-LSTM 77.6*
Fast LSTM 92.3
Table 6: Test accuracy on raw Speech Command (SC) dataset (Romero et al. 2022). Asterisks (*) denotes divergence of training.

To see effectiveness of the fast gate on much longer sequences, we trained LSTMs with various gates on raw Speech Command dataset (Romero et al. 2022), which consists of sequences of length 16,000. We adopted the public code555https://github.com/HazyResearch/state-spaces to train 1-layer LSTM of hidden dimension 256 for classifying speech data into 10 classes. We search the best learning rate from {1×103,5×104,1×104,5×105}\{1\times 10^{-3},5\times 10^{-4},1\times 10^{-4},5\times 10^{-5}\} and gradient clipping threshold from {1,0.1}\{1,0.1\}.

The results in Tab. 6 show that gates other than the fast gate suffer from training instability. In particular, the model with the sigmoid gate diverges with few epochs. We see that the refine gate improves learning, but the training diverges before it reaches sufficiently high accuracy. On the other hand, the LSTM with the fast gate can stably learn and achieve the best accuracy. These results indicate that a gate function with higher order saturation enables more stable training on such extremely long sequences.