Regularizing Recurrent Neural Networks via Sequence Mixup

Armin Karamzade Amir Najafi Seyed Abolfazl Motahari

Abstract

In this paper, we extend a class of celebrated regularization techniques originally proposed for feed-forward neural networks, namely Input Mixup (Zhang et al., 2017) and Manifold Mixup (Verma et al., 2018), to the realm of Recurrent Neural Networks (RNN). Our proposed methods are easy to implement and have a low computational complexity, while leverage the performance of simple neural architectures in a variety of tasks. We have validated our claims through several experiments on real-world datasets, and also provide an asymptotic theoretical analysis to further investigate the properties and potential impacts of our proposed techniques. Applying sequence mixup to BiLSTM-CRF model (Huang et al., 2015) to Named Entity Recognition task on CoNLL-2003 data (Sang and De Meulder, 2003) has improved the F-1 score on the test stage and reduced the loss, considerably. ^†^†Emails: {karamzade,najafy}@ce.sharif.edu, [email protected] ^†^†An implementation of our method is avaiable at https://github.com/ArminKaramzade/SequenceMixup.

Data Analytics Laboratory (DAL),

Computer Engineering Department,

Sharif University of Technology, Tehran, Iran

1 Introduction

Recurrent neural networks are the basis of the state-of-the-art models in natural language processing, including language modeling (Mikolov et al., 2011), machine translation (Cho et al., 2014) and named entity recognition (Lample et al., 2016). It is needless to say that complex learning tasks require relatively large networks with millions of parameters to be accomplished. However, large neural networks need more data and/or strong regularization techniques to be trained successfully and avoid overfitting. Without the means to collect more data, which is the case in the majority of real-world problems, data augmentation and regularization methods are standard alternative practices to overcome this barrier.

Data augmentation in natural language processing is limited, and often task-specific (Kobayashi, 2018; Kafle et al., 2017). On the other hand, adopting the same regularization methods that are originally proposed for feed-forward (non-recurrent) networks needs to be done with extra care to avoid hurting the network’s information flow between consecutive time-steps. To overcome such limitations, we present Sequence Mixup: a set of training methods, regularization techniques, and data augmentation procedures for RNNs. Sequence Mixup can be considered as the RNN-generalization of input mixup (Zhang et al., 2017) and manifold mixup (Verma et al., 2018), which are already introduced for feed-forward neural networks. Generally speaking, the core idea behind mixup strategies is to mix training samples in the network’s input or hidden layers, where by mix, we simply mean to consider random convex combinations of pairs of samples as alternatives for the actual training data points. Mixup in non-recurrent networks has led to smoother decision boundaries, more robustness to adversarial examples, and better generalization compared to many rival regularization methods (Zhang et al., 2017; Verma et al., 2018). Here, we extend input mixup to RNNs and also propose two variants of manifold mixup, namely Pre-Output Mixup (POM) and Through-Time Mixup (TTM), where mixing occurs in the hidden space of the RNN. POM and TTM differ from each other in the way information flow is passed from one time-step to the next.

In order to elucidate the effect of sequence mixup during the learning stage, consider the classification of half-moons data plotted in figure 1(a) with a simple two-timestep RNN. We have also added some levels of noise to the original data points to make the classification task more challenging. Figures 1(b) and 1(c) show the learned decision boundaries from noisy data via regular training and Pre-Output Mixup, respectively. As can be seen, mixup expands the margin between the classes and increases the decision boundary levels, which in turn renders a smoother decision boundary with less certainty about nearby cross-class samples. Intuitively speaking, this type of training creates artificial samples whose labels and hidden states are obtained from intermixing those of the original samples, in a respective manner. Based on our experiments, applying sequence mixup has improved both the test F-1 score and loss of BiLSTM-CRF model (Huang et al., 2015) on CoNLL-03 data (Sang and De Meulder, 2003) (Section 4.3).

We have also provided a theoretical analysis on the impact of our regularization techniques in the asymptotic regime where network widths become increasingly large, and learning rates become infinitesimally small. In a nutshell, our analysis reveals that as long as the number of hidden state neurons, which we denote by $H$ in this work, is less than the number of distinct classes in a classification problem, both POM and TTM cannot achieve a zero training error regardless of how large the training dataset is or how deep the neural networks become. Moreover, we show that as long as $H$ is less than twice the number of classes, the hidden-state generating section of the RNN acts as a memoryless unit and produces hidden states that are almost independent of previous time-steps. On the other hand, given that $H$ is chosen sufficiently large, both POM and TTM are able to divide the hidden representation space of the RNN into a set of orthogonal affine subspaces, where each subspace is an indicator of a unique class. We refer to this property as spectral compression of sequence mixup, which is a similar behaviour to that of manifold mixup for feed-forward networks.

The rest of the paper is organized as follows: Section 2 reviews a number of related works to this problem. In Section 3, we propose Sequence Mixup, describe its challenges and specifications in detail, and also present our theoretical analysis. Section 4 is devoted to our experiments on real-world data. Finally, Section 5 concludes the paper.

2 Related Works

Data augmentation is a popular technique for training large neural networks; It implicitly regularizes the model, which ultimately leads to a lower generalization error. There are several methods for data augmentation in a variety of areas such as computer vision and speech recognition. For example, cropping, translation, rotation, resizing, and flipping are prevailing techniques for creating new artificial images in computer vision tasks (Shorten and Khoshgoftaar, 2019). Similarly, many successful methods such as SpecAugment (Park et al., 2019), vocal tract length perturbation (Jaitly and Hinton, 2013) and the stochastic feature mapping approach of Cui et al. (2017) have been proposed for data augmentation in speech recognition. However, data augmentation for natural language processing tasks is more challenging. One cannot simply employ signal transformation methods used in image and speech processing tasks, since altering the order of words in a sentence can change both its synthetic and semantic meanings. Majority of existing data augmentation techniques for text either rely on replacing a word with its close alternatives, or are only applicable to specific domains. For instance, Zhang et al. (2015) used a thesaurus to obtain synonyms of a word for replacement, Wang and Yang (2015) used $k$ -nearest neighbors and cosine similarity to find similar words, and more recently Kobayashi (2018) introduced contextual augmentation for extracting synonyms. Furthermore, Sennrich et al. (2016) employed a back translation technique for machine translation, and Kafle et al. (2017) made use of a number of generative models for producing new questions in a visual question answering task.

Popular regularization techniques used in feed-forward networks usually fail to work on recurrent neural networks. For example, weight decay prevents RNNs from learning long-term dependencies (Pascanu et al., 2013), and applying naive Dropout to RNNs causes deterioration of information flow through time-steps (Zaremba et al., 2014). Various types of Dropout have been adopted for RNNs without hurting its memorization capability, e.g., Zaremba et al. (2014) and Pham et al. (2014) applied Dropout only to non-recurrent sections of the network, Gal and Ghahramani (2016) used variational Dropout, Krueger et al. (2016) introduced Zoneout, and Merity et al. (2017) applied the same dropout mask to recurrent weights. Apart from Dropout-based strategies, Cooijmans et al. (2016) extended batch normalization (Ioffe and Szegedy, 2015) to RNNs for reducing internal covariate shift among time-steps. Also, Dieng et al. (2018) proposed Noisin for regularizing the network through injecting random noise into the hidden states of the RNN.

3 Sequence Mixup

In this section, we build upon both Input Mixup (Zhang et al., 2017) and Manifold Mixup (Verma et al., 2018) regularizations from non-recurrent networks and make them applicable on RNNs. First, a set of necessary mathematical notations and definitions need to be established: For a finite set of time indices $\mathcal{T}=\left\{1,2,\ldots,T\right\}$ , let us consider a recurrent neural network with input sequence $\boldsymbol{x}=\{\boldsymbol{x}_{t}\}$ , corresponding one-hot label sequence $\boldsymbol{y}=\{\boldsymbol{y}_{t}\}$ , set of hidden state vectors $\boldsymbol{h}_{t}=f(\boldsymbol{x}_{t},\boldsymbol{h}_{t-1})$ and class predictions $\hat{\boldsymbol{y}}_{t}=g(\boldsymbol{h}_{t})$ for time-steps $t\in\mathcal{T}$ . Here, functions $f$ and $g$ represent the state-generating and output-generating neural architectures in the RNN, respectively. Figure 2 illustrates a block diagram which corresponds to the RNN described above. By a one-hot label $\boldsymbol{y}$ , we simply mean a vector of dimension $C$ , with $C$ denoting the number of distinct classes in a given classification task, where the dimension that corresponds to a particular class is set to $1$ while others are $0$ . Also, let us assume the neural net that corresponds to function $f$ has $H$ output neurons, i.e., the hidden representation space is $H$ -dimensional. Throughout the paper, and for any two vectors $\boldsymbol{z}$ and $\boldsymbol{z}^{\prime}$ and coefficient $0\leq\lambda\leq 1$ , we denote $\lambda\boldsymbol{z}+(1-\lambda)\boldsymbol{z}^{\prime}$ by $\boldsymbol{z}^{\mathrm{mix}}(\lambda)$ . Finally, let us denote $\boldsymbol{D}=\left\{\left(\boldsymbol{x}_{i},\boldsymbol{y}_{i}\right)|~{}i=1,\ldots,n\right\}$ as the given training dataset including $n$ sample pairs.

Figure 2: Block diagram of an RNN with hidden-generating function

f

and output-generating function

g

. Functions

f

and

g

correspond to two distinct (and generally) deep neural networks. In a standard RNN, different time-steps have identical neural networks.

In the conventional mixup setting for feed-forward networks, we $\lambda$ -mix randomly chosen samples from the training set (with $\lambda\in\left[0,1\right]$ , and usually sampled from a $\mathrm{Beta}\left(\alpha,\alpha\right)$ distribution) and then expose them to the network for learning. Here, by $\lambda$ -mixing of two feature-label pairs $\left(\boldsymbol{z},\boldsymbol{y}\right)$ and $\left(\boldsymbol{z}^{\prime},\boldsymbol{y}^{\prime}\right)$ , we mean the artificial pair $\left(\lambda\boldsymbol{z}+\left(1-\lambda\right)\boldsymbol{z}^{\prime},\lambda\boldsymbol{y}+\left(1-\lambda\right)\boldsymbol{y}^{\prime}\right)$ ¹¹1In this example, both feature $\boldsymbol{z}$ and corresponding one-hot label $\boldsymbol{y}$ are vectors, which are different from the RNN training dataset in this paper where input sample-pairs are sequences of length $T$ .. For the case of an RNN, this procedure becomes complicated since one can think of mixing different timesteps of input sequence pair $\left(\boldsymbol{x},\boldsymbol{y}\right)$ with potentially different mixing coefficients $\lambda$ . In fact, the sequence $\boldsymbol{\lambda}=\left\{\lambda_{t}\right\}$ can be it naturally considered as a time-dependent stochastic process with preferably $\mathrm{Beta}\left(\alpha,\alpha\right)$ -distributed marginals.

To summarize our contribution, first, we introduce a computationally efficient way of drawing instances of $\left\{\lambda_{t}\right\}$ with an arbitrary level of temporal correlation in section 3.2. Also, in Section 4.1, we investigate the role of temporal correlation among $\lambda_{t},t\in\mathcal{T}$ in the performance of models. Moreover, we show that extending manifold mixup of Verma et al. (2018) to the realm of RNNs can be done in several ways. In this regard, we propose two possible extensions and give an asymptotic theoretical analysis for them in Section 3.3. A detailed experimental investigation is also presented in Sections 4.2 and 4.3, respectively.

3.1 Algorithms

For each step of training with a mini-batch size of $1$ (extension to larger mini-batch sizes is straightforward), given mixup coefficients $0\leq\lambda_{t}\leq 1$ for all $t\in\mathcal{T}$ , and assuming two randomly selected sample sequences $\left(\boldsymbol{x},\boldsymbol{y}\right)$ and $\left(\boldsymbol{x}^{\prime},\boldsymbol{y}^{\prime}\right)$ from the training dataset $\boldsymbol{D}$ , we propose a natural extension of input mixup as follows:

•

Sequence Input Mixup algorithm at each time step $t$ , replaces $\boldsymbol{x}_{t}$ by $\boldsymbol{x}_{t}^{\mathrm{mix}}(\lambda_{t})$ and $\boldsymbol{y}_{t}$ by $\boldsymbol{y}_{t}^{\mathrm{mix}}(\lambda_{t})$ for the two selected samples $\left(\boldsymbol{x},\boldsymbol{y}\right)$ and $\left(\boldsymbol{x}^{\prime},\boldsymbol{y}^{\prime}\right)$ , and for all $t\in\mathcal{T}$ . This procedure is illustrated in Figure 3(a).

Also, we present two possible extensions for manifold mixup as:

•

Pre-Output Mixup (POM), where mixup occurs before the output layer as shown in Figure 3(b), and
•

Through-Time Mixup (TTM), where mixup happens at the hidden layers which produce and then propagate the mixed hidden state as shown in Figure 3(c).

The main difference between POM and TTM is that the latter keeps only one shared hidden representation for both sample pairs, while the former allows each pair to have their own flow historical information. More specifically, POM replaces $g\left(\boldsymbol{h}_{t}\right)$ by $g\left(\boldsymbol{h}_{t}^{\mathrm{mix}}\left(\lambda_{t}\right)\right)$ and $\boldsymbol{y}_{t}$ by $\boldsymbol{y}_{t}^{\mathrm{mix}}(\lambda_{t})$ , while TTM replaces $\boldsymbol{h}_{t}$ by $\boldsymbol{h}^{\mathrm{mix}}_{t}(\lambda_{t})$ and $\boldsymbol{y}_{t}$ by $\boldsymbol{y}_{t}^{\mathrm{mix}}(\lambda_{t})$ . We have skipped the detailed algorithmic explanation of mixup training in this paper for the sake of readability. An interested reader can find such information in the original papers by Zhang et al. (2017) and Verma et al. (2018), respectively.

(a)

(b)

(c)

Figure 3: Block diagram of the proposed Sequence Input Mixup 3(a), Pre-Output Mixup 3(b), and Through-Time Mixup 3(c). Circle denote the mixup operation.

Remark 1

Even though we originally propose sequence mixup for sequence tagging, adopting each of its methods for other similar tasks is straightforward. For example in sequence classification, mixing the labels is equivalent to replacing sequence label $\boldsymbol{y}$ with $\boldsymbol{y}^{\mathrm{mix}}(\bar{\lambda})$ , where $\bar{\lambda}$ is the empirical mean of process $\left\{\lambda_{t}\right\}$ .

3.2 Temporal Dependency Of Mixing Coefficients

In order to choose $\lambda$ , Zhang et al. (2017) and Verma et al. (2018) employed the Beta distribution, $\mathrm{Beta}(\alpha,\alpha)$ , for selecting coefficients in non-recurrent neural networks, where hyper-parameter $\alpha$ is usually adjusted for each particular task. Following the same strategy, we have also used the Beta distribution to generate mixup coefficient sequence $\left\{\lambda_{t}\right\}$ . However, the set of coefficients $\left\{\lambda_{t}\right\}$ are time-series data and therefore can be correlated through time. In fact, since both input and output sequences have potentially important time-varying dependencies, it is natural to design $\lambda_{t}$ s with some levels of temporal correlation. In section 4, we have experimentally shown that correlation level of mixup coefficients $\left\{\lambda_{t}\right\}$ has a meaningful impact on the performance of the model. Interestingly, for each particular task, fixing the spectral bandwidth of $\left\{\lambda_{t}\right\}$ to a corresponding optimal value across the time $\mathcal{T}$ reduces the loss of the base-line model.

In order to generate the mixup coefficients, we create a non-stationary Markov process with varying correlation levels as follows:

	$\displaystyle\lambda_{1}$	$\displaystyle\sim\mathrm{Beta}\left(\alpha,\alpha\right)$
	$\displaystyle\lambda_{t}\|\lambda_{t-1}$	$\displaystyle\sim\mathrm{Beta}\left(\alpha_{t},\beta_{t}\right)\quad\forall t\geq 2,$

where $\alpha_{t}=\alpha_{t}\left(\lambda_{t-1}\right)$ and $\beta_{t}=\beta_{t}\left(\lambda_{t}\right)$ are computed by solving the following set of equations:

\displaystyle\left\{\begin{array}[]{rl}\mathbb{E}\left[\lambda_{t}|\lambda_{t-1}\right]&=\quad\quad~{}\frac{\alpha_{t}}{\alpha_{t}+\beta_{t}}\quad\quad~{}=\rho\mathbb{E}\left[\lambda_{1}\right]+\left(1-\rho\right)\lambda_{t-1}\\[2.84526pt] \mathrm{Var}\left(\lambda_{t}|\lambda_{t-1}\right)&=\frac{\alpha_{t}\beta_{t}}{\left(\alpha_{t}+\beta_{t}\right)^{2}\left(\alpha_{t}+\beta_{t}+1\right)}=\rho^{2}\mathrm{Var}\left(\lambda_{1}\right)\end{array}\right.

(2)

There are closed-form formulas for efficiently computing $\left(\alpha_{t},\beta_{t}\right)$ s for each time-step. Here, user-defined hyper-parameter $0\leq\rho\leq 1$ controls the level of dependency through time. Specifically, $\rho$ smooths the trajectory of coefficients by forcing $\lambda_{t}$ to be close to $\lambda_{t-1}$ through fixing its conditional mean near the $\lambda_{t-1}$ and reducing its conditional variance.

3.3 Asymptotic Theoretical Analysis

In this part, we present the main theoretical results regarding the asymptotic analysis of sequence mixup, and Pre-Output Mixup (POM) technique in particular. Here, we attempt to give a more intuitive insight regarding our theoretical findings while mathematical details, notations and formal definitions are explained in Appendix A. Roughly speaking, the term “asymptotic” refers to the following two assumed properties:

•

Hidden layer widths (or depths, but not necessarily both) of the neural networks corresponding to functions $f$ and $g$ in Figure 3 become asymptotically large.
•

The learning rate which governs the training stage becomes infinitesimally small, which forces the number of training iterations to be increasingly large. This property makes it safe to assume that all possible pairs of samples in the batch have been mixed several times during the training stage.

As already mentioned in Section 1, let us denote the number of output neurons for function $f$ as $H\in\mathbb{N}$ . Also, it should be reminded that the number of classes in the classification problem has been denoted by $C\geq 2$ . Then, we initially prove the following property for both POM and TTM:

Theorem 1 (Over-Regularization)

For any given RNN architecture and classification task, assume we have $H<C-1$ . Then, for an infinitesimally small learning rate $\eta$ , any training dataset size $n$ , and regardless of the vertical or horizontal sizes of neural networks corresponding to functions $f$ and $g$ , training error of both POM and TTM cannot strictly become, or asymptotically approach toward, zero.

Proof of Theorem 1 is discussed in Appendix A. Theorem 1 states that when $H$ is chosen to be smaller than $C-1$ , with $C$ being the number of classes in the problem, one just over-regularizes the RNN via POM or TTM. In other words, the RNN cannot be trained to zero-force the training error regardless of the complexity of its built-in neural nets and thus suffers from a non-zero bias error ²²2Note that due to Universal Approximation Theorem (Cybenko, 1989), non-regularized and asymptotically large neural nets always completely overfit to a finite-size training dataset, i.e., training error always become or asymptotically approach towards zero.. The following two theorems only hold for POM, however, similar arguments might hold for TTM as well.

Theorem 2

For any given RNN architecture and classification task, assume we have $H<2C-1$ . Then, for any training dataset size $n$ , an infinitesimally small learning rate $\eta$ , and asymptotically large vertical and/or horizontal sizes of the neural networks corresponding to $f$ and $g$ , the following argument holds for the asymptotic solution of POM, denoted by $f^{*}:\mathbb{R}^{H}\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{H}$ :

$f^{*}$ acts as an almost memory-less unit, i.e., there exists function $\tilde{f}:\mathbb{R}^{d}\times\mathcal{Y}\rightarrow\mathbb{R}^{H}$ such that for all $t\in\mathcal{T}\char 92\relax\left\{1\right\}$ :

f^{*}\left(\boldsymbol{x}_{t};\boldsymbol{h}_{t-1}\right)=\tilde{f}\left(\boldsymbol{x}_{t};\boldsymbol{y}_{t-1}\right).

Theorem 2 roughly states that in order to enable POM solution $f^{*}$ to use the information of previous time-steps (through $\boldsymbol{h}_{t}$ ), one has to make sure that $H$ is at least almost twice the number of distinct classes $C$ . Otherwise, $f^{*}$ only makes use of the previous label $\boldsymbol{y}_{t-1}$ and current feature $\boldsymbol{x}_{t}$ to estimate $\boldsymbol{y}_{t}$ , which means the state vector $\boldsymbol{h}_{t}$ that carries the information of all previous time-steps would be ignored. The final theorem which is stated below is the most important property of POM (and probably TTM), which describes their ability to linearly separate different classes in the representation space.

Theorem 3 (Spectral Compression Property)

For any given RNN architecture, classification task, and training dataset size $n$ , assume we have $H\geq 2C-1$ . Also, assume POM is used for regularization with an infinitesimally small learning rate $\eta$ , asymptotically large vertical and/or horizontal sizes of neural nets corresponding to $f$ and $g$ , and an arbitrary mini-batch size for training. Also, let us denote the learned RNN functions by $f^{*}$ and $g^{*}$ , respectively. Then, the output-generating function $g^{*}$ is (at least locally) linear, i.e., there exists $\boldsymbol{W}_{g}\in\mathbb{R}^{H\times C}$ and $\boldsymbol{b}_{g}\in\mathbb{R}^{C}$ such that

g^{*}\left(\boldsymbol{h}\right)=\boldsymbol{W}^{T}_{g}\boldsymbol{h}+\boldsymbol{b}_{g},\quad\forall\boldsymbol{h}\in\mathrm{Conv}\left(\left\{\boldsymbol{h}^{\left(i\right)}_{t}\right\}_{i=1,\ldots,n}^{t=1,\ldots,T}\right),

where $\mathrm{Conv}\left(\cdot\right)$ denotes the convex hull of a set of points. Also, $f^{*}$ divides the representation space $\mathbb{R}^{H}$ into a set of $C$ orthogonal affine subspaces $\left\{\mathcal{S}_{1},\ldots,\mathcal{S}_{C}\right\}$ , such that for all $i=1,\ldots,C$ , we have: $\left\{f^{*}\left(\boldsymbol{x}_{t};\boldsymbol{h}_{t-1}\right)\big{|}~{}\boldsymbol{y}_{t}=\boldsymbol{e}_{i}\right\}\subseteq\mathcal{S}_{i},$ where $\boldsymbol{e}_{i}$ represents a one-hot vector with its $i$ th component being $1$ and the rest being $0$ . Also, the following equality holds w.r.t. dimensions of affine subspaces $\mathcal{S}_{1},\ldots,\mathcal{S}_{C}$ :

\sum_{i=1}^{C}\mathrm{dim}\left(\mathcal{S}_{i}\right)=H-C+1.

Spectral compression property is the main essence of hidden-state-based mixup strategies, i.e., manifold mixup for feed-forward nets and the proposed Pre-Output Mixup for RNNs. It shows that the mixup regularizer, when applied to the hidden states of neural nets, forces the core network to map data samples with different labels into distinct orthogonal affine subspaces of the representation space. Also, it dictates that the output-generating network mimics a simple linear unit. This property, which is partially validated through our experiments in Section 4 (even for non-asymptotic neural architectures), means that the spectral power distribution of hidden-state vectors for samples with the same label becomes more compact compared to non-regularized cases.

4 Experiments

In this section, three sets of experiments have been provided: In Section 4.1, we have studied the performance of sequence mixup as the correlation between mixup coefficients varies in time. In Section 4.2, spectral compression property of POM and TMM have been studied. Finally, in Section 4.3, we have compared the performance of sequence mixup with standard training. All experiments have been conducted on the named entity recognition task over CoNLL-2003 data (Sang and De Meulder, 2003).

4.1 Correlation of Mixup Coefficients

We define the baseline model as a model consisting of an embedding layer initialized with weights of Glove embedding (Pennington et al., 2014) followed by a single layer recurrent network with hidden size of $H=256$ , and finally a layer to map hidden states to class scores.

We have trained the baseline model with LSTM cell, $50$ epochs using stochastic gradient descent with an initial learning rate of $0.1$ (while halving it after each $10$ epochs). Also, cross-entropy is chosen as the measure of loss. Sequence mixup training with mixup coefficients $\left\{\lambda_{t}\right\}$ generated from the Markov process of Section 3.2 (with $\alpha=0.5$ and varying $\rho$ ) have been separately utilized for training. Figure 4 illustrates the model’s F-1 score and loss on the test data as a function of $\rho$ for Sequence Input Mixup, Pre-Output Mixup and Through-Time Mixup, respectively. As it is evident, the choice of $\rho=0$ , which is equivalent to setting all $\lambda_{t}$ identical, has maximum test F-1 on all methods. Also, choosing a particular intermediate value for $\rho$ has the worst effect on test loss.

4.2 Spectral Compression

Based on the theoretical results in Section 3.3, POM (and possibly TTM) divide the hidden state of an RNN to orthogonal affine subspaces in the $H$ -dimensional representation space, where each subspace associates to a particular class index. This behavior will compress hidden states of samples within the same class into a lower-dimensional space. We have tried Singular Value Decomposition (SVD) to capture this effect by analyzing the spectral power compactness of the corresponding singular values.

Figure 5 plots the $20$ largest singular values obtained from hidden state vectors of training data which correspond to the same label, for two randomly selected and distinct labels. The baseline model with LSTM cell and $\alpha=1$ is employed and subsequently trained with regular training, POM, and TTM, respectively. As demonstrated in the figure, more compact singular value distributions for POM and TTM suggest that hidden states for such methods lie on a compact and lower-dimensional subspace compared to the case of standard training. This observation is in agreement with the Spectral Compression Property, which is proved for asymptotic network architectures in Section 3.3.

4.3 Evaluation

Table 1 shows the F-1 scores of the baseline model on the test data with various cell types when trained with and without Sequence Mixup for different values of $\alpha$ . We have set $\rho=0$ for generating mixup coefficients, as suggested by the previous result. Details of the experiments are similar to that of Section 4.1.

cell	Standard	alpha	Sequence Input Mixup	POM	TTM
RNN		$0.1$	$72.25\pm 0.45$	$71.51\pm 0.24$	$72.14\pm 0.38$
		$0.2$	$72.40\pm 0.23$	$71.75\pm 0.61$	$72.30\pm 0.33$
	$70.72\pm 0.21$	$0.5$	$71.53\pm 0.42$	$72.32\pm 0.55$	$72.57\pm 0.22$
		$1$	$72.58\pm 0.63$	$71.54\pm 0.43$	$72.15\pm 0.31$
		$2$	$72.48\pm 0.32$	$70.68\pm 0.35$	$71.96\pm 0.34$
GRU		$0.1$	$73.55\pm 0.42$	$73.48\pm 0.17$	$73.37\pm 0.45$
		$0.2$	$73.78\pm 0.34$	$73.36\pm 0.29$	$73.86\pm 0.45$
	$72.17\pm 0.30$	$0.5$	$73.85\pm 0.51$	$73.05\pm 0.33$	$73.74\pm 0.58$
		$1$	$74.00\pm 0.40$	$72.90\pm 0.54$	$73.33\pm 0.38$
		$2$	$73.32\pm 0.33$	$72.93\pm 0.32$	$73.33\pm 0.61$
LSTM		$0.1$	$74.35\pm 0.30$	$73.67\pm 0.38$	$73.84\pm 0.17$
		$0.2$	$74.35\pm 0.31$	$73.78\pm 0.61$	$74.34\pm 0.70$
	$73.23\pm 0.31$	$0.5$	$74.73\pm 0.51$	$73.80\pm 0.53$	$74.33\pm 0.50$
		$1$	$74.34\pm 0.54$	$73.94\pm 0.22$	$73.96\pm 0.32$
		$2$	$74.28\pm 0.33$	$73.77\pm 0.30$	$73.39\pm 0.65$

Table 1: F-1 score of the baseline model with different cells and various values of

\alpha

on CoNLL-2003 Named Entity Recognition.

In Table 2, we have trained biLSTM-CRF model (Huang et al., 2015) with combination of contextual word embeddings (Akbik et al., 2018) and Glove embeddings (Pennington et al., 2014). The model specification is the same as Huang et al. (2015) for CoNLL-2003 on named entity recognition task. Specifically, there is an embedding layer initialized with contextual word embeddings and Glove, followed by a single-layer bidirectional LSTM with hidden size of $256$ , and finally a linear layer to generate the class scores. Similarly, for training, we followed the Akbik et al. (2018) and trained the model using Vanilla stochastic gradient descent, clipping gradients at $5$ for $250$ epochs with batch size of $32$ . For scheduling the learning rate, we halve it if training loss doesn’t improve for $6$ consecutive epochs. When training the model without mixup, we use locked dropout (Merity et al., 2017) and word-level dropout as used by Akbik et al. (2018). However, while using Sequence Mixup, we only utilize locked dropout to avoid the over-regularization problem. We perform model selection over $\alpha\in\left\{0.1,0.2,0.5,0.7,1\right\}$ , choosing the model with the minimum F-1 score on the validation set. Mixup coefficients then set to a random sample from $\mathrm{Beta}\left(\alpha,\alpha\right)$ across the sentence – same as setting $\rho=0$ in the previous part, and the model trains with both training and validation set. Experiments are then repeated $5$ times to obtain the mean and standard deviation of the F-1 score and loss, respectively.

Denoting target sequence of length $T$ by $\boldsymbol{y}$ , predicted class scores by $\hat{\boldsymbol{y}}$ and transition matrix of the CRF model by $\boldsymbol{A}$ , biLSTM-CRF model of Huang et al. (2015) scores each sequence by³³3For the ease of notations, here, $y_{t}$ denotes the class index which corresponds to the already-defined one-hot vector $\boldsymbol{y}_{t}$ .

\sum_{t=1}^{T}{\hat{y}_{t,y_{t}}}+\sum_{t=1}^{T-1}{A_{y_{t},y_{t+1}}},

and computes the loss function as cross-entropy between one-hot label of the whole sequence among all possible sequences and the distribution of sequence scores. Applying Sequence Mixup methods, converts each sequence score to

\displaystyle\sum_{t=1}^{T}{\lambda_{t}\hat{y}_{t,y_{t}}+(1-\lambda_{t})\hat{y}_{t,y^{\prime}_{t}}}+\sum_{t=1}^{T-1}{\left(\frac{\lambda_{t}+\lambda_{t+1}}{2}\right)A_{y_{t},y_{t+1}}+\left(1-\frac{\lambda_{t}+\lambda_{t+1}}{2}\right)A_{y^{\prime}_{t},y^{\prime}_{t+1}}},

for two samples $\left(\boldsymbol{x},\boldsymbol{y}\right)$ and $\left(\boldsymbol{x}^{\prime},\boldsymbol{y}^{\prime}\right)$ .

biLSTM-CRF	F-1	Test NLL
Standard	$93.22\pm 0.09$	$1.07\pm 0.03$
Sequence Input Mixup ( $\alpha=0.1$ )	$93.29\pm 0.09$	$\mathbf{0.59\pm 0.01}$
Pre-Output Mixup ( $\alpha=0.7$ )	$\mathbf{93.30\pm 0.09}$	$0.72\pm 0.01$
Through-Time Mixup ( $\alpha=0.1$ )	$93.17\pm 0.06$	$0.67\pm 0.004$

Table 2: Evaluation of biLSTM-CRF model using conventional and Seuquence Mixup training methods on CoNLL-2003 NER task.

According to Table 2, TTM’s F-1 on the test data has gotten worse than standard training. This may be due to the fact that TTM is a much stronger regularizer than the other two, which them means it does not need locked dropout as an extra regularization. Interestingly, the loss of Sequence Mixup methods is much smaller than regular training, which consolidates the claim that Sequence Mixup renders a decision boundary with less certainty about difficult-to-classify instances.

5 Conclusion

We introduce Sequence Mixup, a set of regularization and data augmentation techniques for RNNs. Our work can thought as extending both input mixup (Zhang et al., 2017) and manifold mixup (Verma et al., 2018), which are originally porposed for feed-forward neural nets. For the case of manifold mixup, we propose two distinct variants called Pre-Output and Throgh-Time Mixup, respectively. An asymptotic theoretical analysis reveals that Pre-Output Mixup imposes (at least) a locally linear behavior on the network’s output generating section. In a classification task, this property leads to partitioning of the hidden representation space into a set of orthogonal affine subspaces, each of which corresponds to a unique class. Experimental results showed improvement on the loss and F-1 scores of both 1) a baseline and 2) state-of-the-art model on CoNLL-2003 NER task. We have studied the correlation of mixup coefficients through consecutive time-steps, and found out that using identical coefficients achieves better loss and F-1 on the NER task. However, at the same time, we conjecture that optimal correlation values for mixup coefficients across time may vary from task to task and thus requires experimental exploration to be adjusted. Lastly, the considerable reduction in the test loss achieved by sequence mixup methods (Section 4.3) implies that employing sequence mixup methods for language models may lead to a substantial improvement on the test perplexity.

References

Akbik et al. (2018) Akbik, A., D. Blythe, and R. Vollgraf (2018). Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649.
Cho et al. (2014) Cho, K., B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014). Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734.
Cooijmans et al. (2016) Cooijmans, T., N. Ballas, C. Laurent, Ç. Gülçehre, and A. Courville (2016). Recurrent batch normalization. arXiv preprint arXiv:1603.09025.
Cui et al. (2017) Cui, X., V. Goel, and B. E. D. Kingsbury (2017, February). Data augmentation method based on stochastic feature mapping for automatic speech recognition. (20170040016).
Cybenko (1989) Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2(4), 303–314.
Dieng et al. (2018) Dieng, A. B., R. Ranganath, J. Altosaar, and D. M. Blei (2018). Noisin: Unbiased regularization for recurrent neural networks. In 35th International Conference on Machine Learning, ICML 2018, pp. 2030–2039. International Machine Learning Society (IMLS).
Gal and Ghahramani (2016) Gal, Y. and Z. Ghahramani (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059.
Huang et al. (2015) Huang, Z., W. Xu, and K. Yu (2015). Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
Ioffe and Szegedy (2015) Ioffe, S. and C. Szegedy (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456.
Jaitly and Hinton (2013) Jaitly, N. and G. E. Hinton (2013). Vocal tract length perturbation (vtlp) improves speech recognition. In Proc. ICML Workshop on Deep Learning for Audio, Speech and Language, Volume 117.
Kafle et al. (2017) Kafle, K., M. Yousefhussien, and C. Kanan (2017, September). Data augmentation for visual question answering. In Proceedings of the 10th International Conference on Natural Language Generation, Santiago de Compostela, Spain, pp. 198–202. Association for Computational Linguistics.
Kobayashi (2018) Kobayashi, S. (2018, June). Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 452–457. Association for Computational Linguistics.
Krueger et al. (2016) Krueger, D., T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, A. Courville, and C. Pal (2016). Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305.
Lample et al. (2016) Lample, G., M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016). Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270.
Merity et al. (2017) Merity, S., N. S. Keskar, and R. Socher (2017). Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182.
Mikolov et al. (2011) Mikolov, T., S. Kombrink, L. Burget, J. Černockỳ, and S. Khudanpur (2011). Extensions of recurrent neural network language model. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5528–5531. IEEE.
Park et al. (2019) Park, D. S., W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019). Specaugment: A simple augmentation method for automatic speech recognition.
Pascanu et al. (2013) Pascanu, R., T. Mikolov, and Y. Bengio (2013). On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318.
Pennington et al. (2014) Pennington, J., R. Socher, and C. D. Manning (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543.
Pham et al. (2014) Pham, V., T. Bluche, C. Kermorvant, and J. Louradour (2014). Dropout improves recurrent neural networks for handwriting recognition. In 2014 14th international conference on frontiers in handwriting recognition, pp. 285–290. IEEE.
Sang and De Meulder (2003) Sang, E. F. and F. De Meulder (2003). Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
Sennrich et al. (2016) Sennrich, R., B. Haddow, and A. Birch (2016, August). Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 86–96. Association for Computational Linguistics.
Shorten and Khoshgoftaar (2019) Shorten, C. and T. M. Khoshgoftaar (2019). A survey on image data augmentation for deep learning. Journal of Big Data 6(1), 60.
Verma et al. (2018) Verma, V., A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, A. Courville, D. Lopez-Paz, and Y. Bengio (2018). Manifold mixup: Better representations by interpolating hidden states. arXiv preprint arXiv:1806.05236.
Wang and Yang (2015) Wang, W. Y. and D. Yang (2015, September). That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 2557–2563. Association for Computational Linguistics.
Zaremba et al. (2014) Zaremba, W., I. Sutskever, and O. Vinyals (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
Zhang et al. (2017) Zhang, H., M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
Zhang et al. (2015) Zhang, X., J. Zhao, and Y. LeCun (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657.

Appendix A Asymptotic Theoretical Analysis: Details

In this section, we theoretically analyze the asymptotic aspects of some of our regularization techniques during the training stage. The term asymptotic here implies that our analysis only considers cases where the width of the neural networks used in the architecture of RNNs, and also the number of SGD iterations in the training stage go to infinity. Non-asymptotic analyzes have also been done w.r,t. neural networks, however, their depth, scope and achievements have been considerably limited due to the inherent technical difficulty of the problem. Consideration of asymptotically wide neural nets (when the number of neurons in the hidden layer goes to infinity) allow us to take advantage of the Universal Approximation Theorem (UAT) (Cybenko, 1989) in order to facilitate the behaviour and performance analysis of neural networks.

First, let us introduce a number of necessary notations that we use in this section. We show the empirical distribution of data samples $\boldsymbol{D}=\left\{\left(\boldsymbol{x}_{1},\boldsymbol{y}_{1}\right),\ldots,\left(\boldsymbol{x}_{n},\boldsymbol{y}_{n}\right)\right\}$ by

\hat{\mathbb{P}}_{\boldsymbol{D}}\triangleq\sum_{i=1}^{n}\delta_{\boldsymbol{Z}_{i}},

(3)

where $\boldsymbol{Z}_{i}$ denotes the tuple $\left(\boldsymbol{x}_{i},\boldsymbol{y}_{i}\right)$ showing the $i$ th training sample consisting of feature vector $\boldsymbol{x}_{i}$ and one-hot label vector $\boldsymbol{y}_{i}\in\left\{0,1\right\}^{C}$ . Here, $C$ denotes the number of classes in our supervised learning problem. Let $\mathrm{NN}_{q,r}$ represent the hypothesis set corresponding to neural network classifiers with $C$ output neurons, which have at least $q$ hidden layers, and at least $r$ neurons per each hidden layer. Throughout this section, we assume $f\in\mathrm{NN}_{q,r}$ with $q\geq 1$ .

Training with Input or Manifold Mixup cannot be readily formulated into a manageable analytic format unless one assumes very small gradient descent steps and a consequently large number of iterations. Let us assume stochastic gradient descent (SGD) with a mini-batch size of $1$ is used to optimize the loss function associated to any of the Input and/or Manifold Mixup regularization techniques described in Section 3. Also, let $N_{\mathrm{iter}}$ denote the number of SGD iterations during the training stage. In this section, we are particularly interested in an asymptotic regime where both $N_{\mathrm{iter}}$ and $r$ go to infinity. In other words, we set out to investigate the properties of the solution neural networks $f^{*}$ and $g^{*}$ defined as

\displaystyle\left(f^{*},g^{*}\right)\triangleq\lim_{r\rightarrow\infty}\operatorname*{\arg\!\min}_{\left(f,g\right)\in\mathrm{NN}_{q,r}}~{}\left(\lim_{N_{\mathrm{iter}}\rightarrow\infty}\hat{\ell}_{\mathrm{emp}}\right).

(4)

We show that regardless of the size of the training dataset $n$ , or the inherent hardness of the learning task, under ceration conditions there exist $f^{*}$ and $g^{*}$ which give a zero value for the empirical loss function $\hat{\ell}_{\mathrm{emp}}$ .

A.1 Asymptotic Theory of Pre-Output Mixup (POM)

In this part, we analyze Pre-Output Mixup (POM) regularization technique. Assuming an infinitely large number of training iterations, the following equality holds almost surely according to the law of large numbers:

\displaystyle\lim_{N_{\mathrm{iter}}\rightarrow\infty}\hat{\ell}_{\mathrm{emp}}\stackrel{{\scriptstyle a.s}}{{=}}\mathbb{E}_{\left\{\lambda_{t},\boldsymbol{Z}_{t},\boldsymbol{Z}^{\prime}_{t}\right\}}\left\{\sum_{t\in\mathcal{T}}\ell\bigg{[}g\left(\lambda_{t}\boldsymbol{h_{t}}+\left(1-\lambda_{t}\right)\boldsymbol{h}^{\prime}_{t}\right),~{}\lambda_{t}y_{t}+\left(1-\lambda_{t}\right)y^{\prime}_{t}\bigg{]}\right\},

where $\boldsymbol{Z}_{t}$ and $\boldsymbol{Z}^{\prime}_{t}$ are training sequences randomly and independently chosen from the empirical data distribution $\hat{\mathbb{P}}_{\boldsymbol{D}}$ , $\boldsymbol{h}_{t}$ and $\boldsymbol{h}^{\prime}_{t}$ are temporal state vectors $f\left(\boldsymbol{x}_{t},\boldsymbol{h}_{t-1}\right)$ and $f\left(\boldsymbol{x}^{\prime}_{t},\boldsymbol{h}^{\prime}_{t-1}\right)$ , respectively, and $\left\{\lambda_{t}\right\}$ is an arbitrary Ergodic stochastic process with marginal Beta distributions. The process $\left\{\lambda_{t}\right\}$ is assumed to be independent from the random choice of training sequence pairs $\boldsymbol{Z}_{t}$ and $\boldsymbol{Z}^{\prime}_{t}$ .

Based on the independence assumption between stochastic processes $\left\{\lambda_{t}\right\}$ and $\left\{\left(\boldsymbol{Z}_{t},\boldsymbol{Z}^{\prime}_{t}\right)\right\}$ , the empirical error for asymptotic regime of $N_{\mathrm{iter}}\rightarrow\infty$ can be rewritten and subsequently simplified as follows:

		$\displaystyle\mathbb{E}_{\left\{\lambda_{t},\boldsymbol{Z}_{t},\boldsymbol{Z}^{\prime}_{t}\right\}}\left\{\sum_{t\in\mathcal{T}}\ell\bigg{[}g\left(\boldsymbol{h}^{\mathrm{mix}}_{t}\left(\lambda_{t}\right)\right),~{}y^{\mathrm{mix}}_{t}\left(\lambda_{t}\right)\bigg{]}\right\}=$		(5)
		$\displaystyle\hskip 17.07164pt\frac{1}{n^{2}}\sum_{i,j=1}^{n}\mathbb{E}_{\left\{\lambda_{t}\right\}}\left\{\sum_{t\in\mathcal{T}}\ell\bigg{[}g\left(\lambda_{t}\boldsymbol{h}_{t,i}+\left(1-\lambda_{t}\right)\boldsymbol{h}_{t,j}\right),~{}\lambda_{t}\boldsymbol{y}_{t,i}+\left(1-\lambda_{t}\right)\boldsymbol{y}_{t,j}\bigg{]}\right\},$

where $\boldsymbol{A}_{t,i}$ for a finite set of vector sequences $\left\{\boldsymbol{A}_{t}\right\}$ means the $t$ th time-instance of the $i$ th member in the set. Equation (5) implies that in order to get a zero training error, one must have the following equality for all $t\in\mathcal{T}$ and $i\in\left[n\right]$ :

\displaystyle\int\ell\bigg{[}g^{*}\left(\lambda_{t}\boldsymbol{h}_{t,i}+\left(1-\lambda_{t}\right)\boldsymbol{h}_{t,j}\right),~{}\lambda_{t}\boldsymbol{y}_{t,i}+\left(1-\lambda_{t}\right)\boldsymbol{y}_{t,j}\bigg{]}\mathrm{d}\mathbb{P}\left(\lambda_{t}|\boldsymbol{\lambda}_{\mathrm{prev}}\right)=0,

(6)

where $\boldsymbol{\lambda}_{\mathrm{prev}}$ indicates the set of all $\lambda_{t^{\prime}}$ values with $t^{\prime}<t$ and $\mathbb{P}\left(\lambda_{t}|\boldsymbol{\lambda}_{\mathrm{prev}}\right)$ denotes the conditional probability measure of stochastic process $\left\{\lambda_{t}\right\}$ . Recall that $\left\{\lambda_{t}\right\}$ has been assumed to be an Ergodic real-valued stochastic process, which makes (6) to hold only when

	$\displaystyle g^{*}\left(\boldsymbol{h}_{t,i}\right)=\boldsymbol{y}_{t,i},$	(7)
$\displaystyle\mathrm{and}\quad$	$\displaystyle g^{*}\left(\lambda\boldsymbol{h}_{t,i}+\left(1-\lambda\right)\boldsymbol{h}_{t,j}\right)=\lambda\boldsymbol{y}_{t,i}+\left(1-\lambda\right)\boldsymbol{y}_{t,j},$	(8)
$\displaystyle\mathrm{forall}\quad$	$\displaystyle t\in\mathcal{T},i\in\left[n\right]~{}\mathrm{and}~{}\forall\lambda\in\left[0,1\right].$

The condition (8) dictates a locally linear behaviour for the presumed optimal neural network $g^{*}$ . In other words, $g^{*}$ is supposed to act as a linear function between any two pairs of points $\left\{\boldsymbol{h}_{t,i}|~{}i\in\left[n\right],~{}t\in\mathcal{T}\right\}$ . Obviously, the local linear behaviour is satisfied if we just let $g^{*}$ be a linear function throughout its input space. In fact, an important implication of the above analytic statement regarding $g^{*}$ is that adding extra hidden layers for the output-generating neural net $g$ in order to make it more flexible for imitating nonlinear relations is totally redundant. This it due to the fact that the whole network still tries to imitate a linear function which can be simply attained with zero hidden layers.

By assuming the global linearity property for $g$ , we have the following relation for the optimal function $g^{*}:\mathbb{R}^{H}\rightarrow\mathbb{R}^{C}$ , with $H$ denoting the number of neurons for the hidden state vector $\boldsymbol{h}\in\mathbb{R}^{H}$ :

\displaystyle g^{*}\left(\boldsymbol{h}\right)=\boldsymbol{W}_{g}^{T}\boldsymbol{h}+b_{g},\quad\forall\boldsymbol{h}\in\mathbb{R}^{H},

(9)

where $\boldsymbol{W}_{g}$ is an $H\times C$ dimensional matrix and $b_{g}\in\mathbb{R}$ . The following statements still hold even if we only consider the proved local linearity property of $g^{*}$ . However, we omit further analytical details for the sake of simplicity and stick to the global linearity assumption. On the other hand, for all $t\in\mathcal{T},~{}i\in\left[n\right]$ we must have

\displaystyle\left\{\begin{array}[]{l}\boldsymbol{h}_{t,i}=f^{*}\left(\boldsymbol{x}_{t,i}~{},~{}\boldsymbol{h}_{t-1,i}\right)\\ \boldsymbol{y}_{t,i}=\boldsymbol{W}^{T}_{g}\boldsymbol{h}_{t,i}+b_{g}\end{array}\right.

(12)

The equalities above almost surely hold under mild conditions which are derived in the remainder of this section. The fundamental reason for this phenomenon is that we have already assumed $r\rightarrow\infty$ ; Then, based on the universal approximation theorem, the function set $\mathrm{NN}_{q,r}$ with $q\geq 1$ becomes dense in the space of functions $\left\{f:\mathbb{R}^{d+H}\rightarrow\mathbb{R}^{H}\right\}$ , with $d$ denoting the dimension of input feature vectors. In other words, $f^{*}$ is simply assumed to be able to approximate any function with an arbitrarily small error bound which solely depends on the network’s width $r$ and goes to zero as $r$ is being increased.

Note that one-hot label vectors $\boldsymbol{y}_{t,i}$ are finite. In fact, there are only $C$ possible distinct label vectors $\boldsymbol{e}_{1},\ldots,\boldsymbol{e}_{C}$ , where $\boldsymbol{e}_{i}$ represents a one-hot $C$ -dimensional vector where the $i$ th component is $1$ and the rest are $0$ . In other words, the linear function $g^{*}$ has been assumed to map the set of $H$ -dimensional real-valued vectors $\left\{\boldsymbol{h}_{t,i}\right\}$ to the set of linearly independent vectors $\left\{\boldsymbol{e}_{1},\ldots,\boldsymbol{e}_{C}\right\}$ .

Let $\boldsymbol{\mathcal{H}}$ be the $H\times\left(n\left|\mathcal{T}\right|\right)$ matrix whose columns are members of $\left\{\boldsymbol{h}_{t,i}\right\}$ . Also, let $\boldsymbol{\mathcal{Y}}$ be a $C\times\left(n\left|\mathcal{T}\right|\right)$ matrix, which is similarly attained from $\left\{\boldsymbol{y}_{t,i}\right\}$ . Then, we have the following relation:

\boldsymbol{W}^{T}_{g}\boldsymbol{\mathcal{H}}=\boldsymbol{\mathcal{Y}}-b_{g}\boldsymbol{1}_{C\times\left(n\left|\mathcal{T}\right|\right)}.

(13)

The above equality has a number of interesting implications for the optimal neural network $f^{*}$ :

•

First, it should be noted that the r.h.s. of (13) is at least rank $C-1$ (assuming we have at least one observation from each class in the training dataset $\boldsymbol{D}$ ), while the l.h.s. is at most rank $H$ . Then, in order to get a zero training error for Pre-Output Mixup regularizer, we must have $H\geq C-1$ . It should be reminded that $H$ is in fact the number of output neurons for $f^{*}$ . This result is in agreement with the analysis of the Manifold Mixup technique presented by Verma et al. (2018), where the latter only considers non-recurrent frameworks.
•

Second, the hidden state subsets $\boldsymbol{\mathcal{H}}_{j}\triangleq\left\{\boldsymbol{h}_{t,i}|\boldsymbol{y}_{t,i}=\boldsymbol{e}_{j}\right\}$ for $j\in\left[C\right]$ must lie in $C$ separate and orthogonal linear subspaces $\mathcal{S}_{j}\subseteq\mathbb{R}^{H}$ . In other words, we have $\boldsymbol{\mathcal{H}}_{j}\subseteq\mathcal{S}_{j}$ , and $\mathcal{S}_{i}\perp\mathcal{S}_{j}$ for $i\neq j$ . We also have

$\sum_{j=1}^{C}\mathrm{dim}\left(\mathcal{S}_{j}\right)=H-C+1.$ (14)

In this regard, the neural network function $f^{*}$ acts as a state-changing machine that switches among orthogonal subspaces in the hidden-state space $\mathbb{R}^{H}$ according to the input feature vector $\boldsymbol{x}$ . More specifically, assume the machine is at state $\boldsymbol{h}_{t-1}$ and has produced the label $\boldsymbol{y}_{t-1}\in\left\{\boldsymbol{e}_{1},\ldots,\boldsymbol{e}_{C}\right\}$ . Therefore, the state vector $\boldsymbol{h}_{t-1}$ must lie in the low-dimensional subspace $\mathcal{S}_{t-1}\subseteq\mathbb{R}^{H}$ which is uniquely determined by $\boldsymbol{y}_{t-1}$ through the procedure described earlier (we have $\boldsymbol{y}_{t-1}=g\left(\boldsymbol{h}_{t-1}\right)$ , where $g$ is a linear function). At time step $t$ , the input vector $\boldsymbol{x}_{t}$ is applied to the machine and its state would be changed to $\boldsymbol{h}_{t}$ which lies on the low-dimensional subspace $\mathcal{S}_{t}$ , again uniquely characterized by the label at time $t$ , shown as $\boldsymbol{y}_{t}$ .

Interestingly, we have $\boldsymbol{h}_{t}\in\mathcal{S}_{t}=\psi\left(\boldsymbol{y}_{t}\right)$ for all $t\in\mathcal{T}$ where $\psi$ is a fixed mapping from the set of all possible one-hot labels $\left\{\boldsymbol{e}_{1},\ldots,\boldsymbol{e}_{C}\right\}$ to a set of $C$ low-dimensional subspaces in $\mathbb{R}^{H}$ , i.e. $\left\{\mathcal{S}_{1},\ldots,\mathcal{S}_{C}\right\}$ . That means the subspace encompassing the hidden state $\boldsymbol{h}_{t}$ does not depend on the history of the sequence and is uniquely identified by the current label $\boldsymbol{y}_{t}$ . However, if we have $\mathrm{dim}\left(\mathcal{S}_{t}\right)>0$ , then $\boldsymbol{h}_{t}$ can still take infinitely many values and thus would be able to store information about the past. Therefore, in order to provide the architecture with capability of storing information about the history of a sequence rather than just the current labels, the following inequality must hold:

\sum_{j=1}^{C}\mathrm{dim}\left(\mathcal{S}_{j}\right)=H-C+1\geq\sum_{j=1}^{C}1=C,

(15)

which means one should have $H\geq 2C-1$ .