This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self-supervised and Supervised Joint Training for
Resource-rich Machine Translation

Yong Cheng    Wei Wang    Lu Jiang    Wolfgang Macherey
Abstract

Self-supervised pre-training of text representations has been successfully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve notable gains on resource-rich NMT. In this paper, we propose a joint training approach, F2F_{2}-XEnDec, to combine self-supervised and supervised learning to optimize NMT models. To exploit complementary self-supervised signals for supervised learning, NMT models are trained on examples that are interbred from monolingual and parallel sentences through a new process called crossover encoder-decoder. Experiments on two resource-rich translation benchmarks, WMT’14 English-German and WMT’14 English-French, demonstrate that our approach achieves substantial improvements over several strong baseline methods and obtains a new state of the art of 46.19 BLEU on English-French when incorporating back translation. Results also show that our approach is capable of improving model robustness to input perturbations such as code-switching noise which frequently appears on social media.

Machine Learning, ICML

1 Introduction

Self-supervised pre-training of text representations (Peters et al., 2018; Radford et al., 2018) has achieved tremendous success in natural language processing applications. Inspired by BERT (Devlin et al., 2019), recent works attempt to leverage sequence-to-sequence model pre-training for Neural Machine Translation (NMT) (Lewis et al., 2019; Song et al., 2019; Liu et al., 2020b). Generally, these methods comprise two stages: pre-training and finetuning. During the pre-training stage, the model is learned with a self-supervised task on abundant unlabeled data (i.e. monolingual sentences). In the second stage, the full or partial model is finetuned on a downstream translation task of labeled data (i.e. parallel sentences). Studies have demonstrated the benefit of pre-training for the low-resource translation task in which the labeled data is limited (Lewis et al., 2019; Song et al., 2019). All these successes share the same setup: pre-training on abundant unlabeled data and finetuning on limited labeled data.

In many NMT applications, we are confronted with a different setup where abundant labeled data, e.g., millions of parallel sentences, are available for finetuning. For these resource-rich translation tasks, the two-stage approach is less effective and, even worse, sometimes can undermine the performance if improperly utilized (Zhu et al., 2020), in part due to the catastrophic forgetting (French, 1999). More recently, several mitigation techniques have been proposed for the two-stage approach (Edunov et al., 2019; Yang et al., 2019; Zhu et al., 2020), such as freezing the pre-trained representations during finetuning. However, these strategies hinder uncovering the full potential of self-supervised learning since the learned representations are either held fixed or slightly tuned in the supervised learning.

In this paper, we study resource-rich machine translation through a different perspective of joint training where, in contrast to the conventional two-stage approaches, we train NMT models in a single stage using the self-supervised objective (on monolingual sentences) in addition to the supervised objective (on parallel sentences). The challenge for this single-stage training paradigm is that self-supervised learning is less useful in joint training because it provides a much weaker learning signal that can be easily dominated by the supervised learning signal in joint training. As a result, conventional approaches such as combining self-supervised and supervised learning objectives perform not much better than the supervised learning objective by itself.

This paper aims at exploiting the complementary signals in self-supervised learning to facilitate supervised learning. Inspired by chromosomal crossovers (Rieger et al., 2012), we propose an essential new task called crossover encoder-decoder (or XEnDec) which takes two training examples as inputs (called parents), shuffles their source sentences, and produces a sentence by a mixture decoder model. Our method applies XEnDec to “deeply” fuse the monolingual (unlabeled) and parallel (labeled) sentences, thereby producing their first and second filial generation (or F1F_{1} and F2F_{2} generation). As we find that the F2F_{2} generation exhibits combinations of traits that differ from those found in the monolingual or the parallel sentence, we train NMT models on the F2F_{2} offspring and name our method F2F_{2}-XEnDec.

To the best of our knowledge, the proposed method is among the first NMT models on joint self-supervised and supervised learning, and moreover, the first to demonstrate such joint learning substantially benefits resource-rich machine translation. Compared to recent two-stage finetuning approaches (Zhu et al., 2020) and (Yang et al., 2019), our method only needs a single training stage to utilize the complementary signals in self-supervised learning. Empirically, our results show the proposed single-stage approach achieves comparable or better results than previous methods. In addition, our method improves the robustness of NMT models which is known as a critical deficiency in contemporary NMT systems (cf. Section 4.3). It is noteworthy that none of the two-stage training approaches have ever reported this behavior.

We empirically validate our approach on the WMT’14 English-German and WMT’14 English-French translation benchmarks which yields an improvement of 2.132.13 and 1.781.78 BLEU points over the vanilla Transformer model (Ott et al., 2018), respectively. It achieves a new state of the art of 46.19 BLEU on the WMT’14 English-French translation task with the back translation technique. In summary, our contributions are as follows:

  1. 1.

    We propose a crossover encoder-decoder (XEnDec) which, with appropriate inputs, can reproduce several existing self-supervised and supervised learning objectives.

  2. 2.

    We jointly train self-supervised and supervised objectives in a single stage, and show that our method is able to exploit the complementary signals in self-supervised learning to facilitate supervised learning.

  3. 3.

    Our approach achieves significant improvements on resource-rich translation tasks and exhibits higher robustness against input perturbations such as code-switching noise.

Refer to caption
Figure 1: (a) Illustration of crossover encoder-decoder (XEnDec). It takes two training examples (𝐱,𝐲)(\mathbf{x},\mathbf{y}) and (𝐱,𝐲)(\mathbf{x}^{\prime},\mathbf{y}^{\prime}) as inputs, and outputs a sentence pair (𝐱~\tilde{\mathbf{x}}, 𝐲~\tilde{\mathbf{y}}). (b) Our method applies XEnDec to fuse the monolingual (blue) and parallel sentences (red). In the first generation, F1F_{1}-XEnDec generates (n(𝐲u),𝐲u)(n(\mathbf{y}^{u}),\mathbf{y}^{u}) incurring a self-supervised loss F1\mathcal{L}_{F_{1}}, where (n(𝐲u)(n(\mathbf{y}^{u}) is the function discussed in Section 2.2 that corrupts the monolingual sentence 𝐲u\mathbf{y}^{u}. F2F_{2}-XEnDec applies another round of XEnDec to incorporate parallel data (𝐱p,𝐲p)(\mathbf{x}^{p},\mathbf{y}^{p}) to get the F2F_{2} output (𝐱~,𝐲~)(\tilde{\mathbf{x}},\tilde{\mathbf{y}}). 𝐲u\mathbf{y}^{u}: a monolingual sentence. 𝐲noiseu\mathbf{y}^{u}_{noise}: a sentence generated by adding non-masking noise to 𝐲u\mathbf{y}^{u}. 𝐲masku\mathbf{y}^{u}_{mask}: a sentence of length |𝐲u||\mathbf{y}^{u}| containing only “mask\langle mask\rangle” tokens.

2 Background

2.1 Neural Machine Translation

Under the encoder-decoder paradigm  (Bahdanau et al., 2015; Gehring et al., 2017; Vaswani et al., 2017), the conditional probability P(𝐲|𝐱;𝜽)P(\mathbf{y}|\mathbf{x};\bm{\theta}) of a target-language sentence 𝐲=y1,,yJ\mathbf{y}=y_{1},\cdots,y_{J} given a source-language sentence 𝐱=x1,,xI\mathbf{x}=x_{1},\cdots,x_{I} is modeled as follows: The encoder maps the source sentence 𝐱\mathbf{x} onto a sequence of II word embeddings e(𝐱)=e(x1),,e(xI)e(\mathbf{x})=e(x_{1}),...,e(x_{I}). Then the word embeddings are encoded into their corresponding continuous hidden representations. The decoder acts as a conditional language model that reads embeddings e(𝐲)e(\mathbf{y}) for a shifted copy of 𝐲\mathbf{y} along with the aggregated contextual representations 𝐜\mathbf{c}. For clarity, we denote the input and output in the decoder as 𝐳\mathbf{z} and 𝐲\mathbf{y}, i.e., 𝐳=s,y1,,yJ1\mathbf{z}=\langle s\rangle,y_{1},\cdots,y_{J-1}, where s\langle s\rangle is a start symbol. Conditioned on an aggregated contextual representation 𝐜j\mathbf{c}_{j} and its partial target input 𝐳j\mathbf{z}_{\leq j}, the decoder generates 𝐲\mathbf{y} as:

P(𝐲|𝐱;𝜽)=j=1JP(yj|𝐳j,𝐜;𝜽).\displaystyle P(\mathbf{y}|\mathbf{x};\bm{\theta})=\prod_{j=1}^{J}P(y_{j}|\mathbf{z}_{\leq j},\mathbf{c};\bm{\theta}). (1)

The aggregated contextual representation 𝐜\mathbf{c} is often calculated by summarizing the sentence 𝐱\mathbf{x} with an attention mechanism (Bahdanau et al., 2015). A byproduct of the attention computation is a noisy alignment matrix 𝐀J×I\mathbf{A}\in\mathbb{R}^{J\times I} which roughly captures the translation correspondence between target and source words (Garg et al., 2019).

Generally, NMT optimizes the model parameters 𝜽\bm{\theta} by minimizing the empirical risk over a parallel training set (𝐱,𝐲)𝒮(\mathbf{x},\mathbf{y})\in\mathcal{S}:

𝒮(𝜽)=𝔼(𝐱,𝐲)𝒮[(f(𝐱,𝐲;𝜽),h(𝐲))],\displaystyle\mathcal{L}_{\mathcal{S}}(\bm{\theta})=\mathop{\mathbb{E}}\limits_{(\mathbf{x},\mathbf{y})\in\mathcal{S}}[\ell(f(\mathbf{x},\mathbf{y};\bm{\theta}),h(\mathbf{y}))], (2)

where \ell is the cross entropy loss between the model prediction f(𝐱,𝐲;𝜽)f(\mathbf{x},\mathbf{y};\bm{\theta}) and h(𝐲)h(\mathbf{y}), and h(𝐲)h(\mathbf{y}) denotes the sequence of one-hot label vectors with label smoothing in the Transformer (Vaswani et al., 2017).

2.2 Pre-training for Neural Machine Translation

Pre-training sequence-to-sequence models for language generation has been shown to be effective for machine translation (Song et al., 2019; Lewis et al., 2019). These methods generally comprise two stages: pre-training and finetuning. The pre-training takes advantage of an abundant monolingual corpus 𝒰={𝐲}\mathcal{U}=\{\mathbf{y}\} to learn representations through a self-supervised objective called denoising autoencoder (Vincent et al., 2008) which aims at reconstructing the original sentence 𝐲\mathbf{y} from one of its corrupted counterparts.

Let n(𝐲)n(\mathbf{y}) be a corrupted copy of 𝐲\mathbf{y} where the function n()n(\cdot) adds noise and/or masks words. (n(𝐲),𝐲)(n(\mathbf{y}),\mathbf{y}) constitutes the pseudo parallel data and is fed into the NMT model to compute the reconstruction loss. The self-supervised reconstruction loss over the corpus 𝒰\mathcal{U} is defined as:

𝒰(𝜽)=𝔼𝐲𝒰[(f(n(𝐲),𝐲;𝜽),h(𝐲))],\displaystyle\mathcal{L}_{\mathcal{U}}(\bm{\theta})=\mathop{\mathbb{E}}\limits_{\mathbf{y}\in\mathcal{U}}[\ell(f(n(\mathbf{y}),\mathbf{y};\bm{\theta}),h(\mathbf{y}))], (3)

The optimal model parameters 𝜽\bm{\theta}^{\star} are learned via the self-supervised loss 𝒰(𝜽)\mathcal{L}_{\mathcal{U}}(\bm{\theta}) and used to initialize downstream models during the finetuning on the parallel training set 𝒮\mathcal{S}.

3 Cross-breeding: F2F_{2}-XEnDec

For resource-rich translation tasks in which a large parallel corpus and (virtually) unlimited monolingual corpora are available, our goal is to improve translation performance by exploiting self-supervised signals to complement the supervised learning. In the proposed method, we train NMT models jointly with supervised and self-supervised learning objectives in a single stage. This is based on an essential task called XEnDec. In the remainder of this section, we first detail the XEnDec and then introduce our approach and present the overall algorithm. Finally, we discuss its relationship to some of the previous works.

3.1 Crossover Encoder-Decoder

This section introduces the crossover encoder-decoder (XEnDec). Different from a conventional encoder-decoder, XEnDec takes two training examples as inputs (called parents), shuffles the parents’ source sentences and produces a virtual example (called offspring) through a mixture decoder model. Fig. 1 illustrates this process.

Formally, let (𝐱,𝐲)(\mathbf{x},\mathbf{y}) denote a training example where 𝐱=x1,,xI\mathbf{x}=x_{1},\cdots,x_{I} represents a source sentence of II words and 𝐲=y1,,yJ\mathbf{y}=y_{1},\cdots,y_{J} is the corresponding target sentence of JJ words. In supervised training, 𝐱\mathbf{x} and 𝐲\mathbf{y} are parallel sentences. As we will see in Section 3.2, XEnDec can be carried out with and without supervision. We do not distinguish these cases for now and use generic notations to illustrate the idea.

Given a pair of examples (𝐱,𝐲)(\mathbf{x},\mathbf{y}) and (𝐱,𝐲)(\mathbf{x}^{\prime},\mathbf{y}^{\prime}) called parents, the crossover encoder shuffles the two source sequences into a new source sentence 𝐱~\tilde{\mathbf{x}}, calculated from:

x~i=mixi+(1mi)xi,\displaystyle\tilde{x}_{i}=m_{i}x_{i}+(1-m_{i})x^{\prime}_{i}, (4)

where 𝐦=m1,,mI{0,1}I\mathbf{m}=m_{1},\cdots,m_{I}\in\{0,1\}^{I} stands for a series of Bernoulli random variables with each taking the value 1 with probability pp called shuffling ratio. If mi=0m_{i}=0, then the ii-th word in 𝐱\mathbf{x} will be substituted with the word in 𝐱\mathbf{x}^{\prime} at the same position. For convenience, the lengths of the two sequences are aligned by appending padding tokens to the end of the shorter sentence.

The crossover decoder employs a mixture model to generate the virtual target sentence. The embedding of the decoder’s input 𝐳~\tilde{\mathbf{z}} is computed as:

e(z~j)=1Z[e(yj1)i=1IA(j1)imi+e(yj1)i=1IA(j1)i(1mi)],\begin{split}e(\tilde{z}_{j})=&\frac{1}{Z}\big{[}e(y_{j-1})\sum_{i=1}^{I}A_{(j-1)i}m_{i}\\ &+e(y^{\prime}_{j-1})\sum_{i=1}^{I}A^{\prime}_{(j-1)i}(1-m_{i})\big{]},\end{split} (5)

where e()e(\cdot) is the embedding function. Z=i=1IA(j1)imi+A(j1)i(1mi)Z=\sum_{i=1}^{I}A_{(j-1)i}m_{i}+A^{\prime}_{(j-1)i}(1-m_{i}) is the normalization term where 𝐀\mathbf{A} and 𝐀\mathbf{A}^{\prime} are the alignment matrices for the source sequences 𝐱\mathbf{x} and 𝐱\mathbf{x}^{\prime}, respectively. Eq. (5) averages embeddings of 𝐲\mathbf{y} and 𝐲\mathbf{y}^{\prime} through the latent weights computed by 𝐦\mathbf{m}, 𝐀\mathbf{A}, and 𝐀\mathbf{A}^{\prime}. The alignment matrix measures the contribution of the source words for generating a specific target word (Och & Ney, 2004; Bahdanau et al., 2015). For example, AjiA_{ji} represents the contribution score of the ii-th word in the source sentence for the jj-th word in the target sentence. For simplicity, this paper uses the attention matrix learned in the NMT model as a noisy alignment matrix (Garg et al., 2019).

Likewise, the label vector for the crossover decoder is calculated from:

h(y~j)=1Z[h(yj)i=1IAjimi+h(yj)i=1IAji(1mi)],\begin{split}h(\tilde{y}_{j})=&\frac{1}{Z}\big{[}h(y_{j})\sum_{i=1}^{I}A_{ji}m_{i}\\ &+h(y^{\prime}_{j})\sum_{i=1}^{I}A^{\prime}_{ji}(1-m_{i})\big{]},\end{split} (6)

The h()h(\cdot) function projects a word onto its label vector, e.g., a one-hot vector. The loss of XEnDec is computed over its output (𝐱~\tilde{\mathbf{x}}, 𝐲~\tilde{\mathbf{y}}) using the negative log-likelihood:

(f(𝐱~,𝐲~;𝜽),h(𝐲~))=logP(𝐲~|𝐱~;𝜽)=jKL(h(y~j)P(y|𝐳~j,𝐜j;𝜽)),\begin{split}&\quad\ell(f(\tilde{\mathbf{x}},\tilde{\mathbf{y}};\bm{\theta}),h(\tilde{\mathbf{y}}))=-\log P(\tilde{\mathbf{y}}|\tilde{\mathbf{x}};\bm{\theta})\\ &=\sum_{j}KL(h(\tilde{y}_{j})\|P(y|\tilde{\mathbf{z}}_{\leq j},\mathbf{c}_{j};\bm{\theta})),\end{split} (7)

where 𝐳~\tilde{\mathbf{z}} is a shifted copy of 𝐲~\tilde{\mathbf{y}} as discussed in Section 2.1. Notice that even though we do not directly observe the “virtual sentences” 𝐳~\tilde{\mathbf{z}} and 𝐲~\tilde{\mathbf{y}}, we are still able to compute the loss using their embeddings and labels. In practice, the length of 𝐱~\tilde{\mathbf{x}} is set to max(|𝐱|,|𝐱|)\mathop{\rm max}(|\mathbf{x}|,|\mathbf{x}^{\prime}|) whereas 𝐲~\tilde{\mathbf{y}} and 𝐳~\tilde{\mathbf{z}} share the same length of max(|𝐲|,|𝐲|)\mathop{\rm max}(|\mathbf{y}|,|\mathbf{y}^{\prime}|).

3.2 Training

The proposed method applies XEnDec to deeply fuse the parallel data 𝒮\mathcal{S} with nonparallel, monolingual data 𝒰\mathcal{U}. As illustrated in Fig. 1, the first generation (F1F_{1}-XEnDec in the figure) uses XEnDec to combine monolingual sentences of different views, thereby incurring a self-supervised loss F1\mathcal{L}_{F_{1}}. We compute the loss F1\mathcal{L}_{F_{1}} using Eq. (3). Afterward, the second generation (F2F_{2}-XEnDec in the figure) applies XEnDec to the offspring of the first generation (n(𝐲u)n(\mathbf{y}^{u}), 𝐲u\mathbf{y}^{u}) and a sampled parallel sentence (𝐱p\mathbf{x}^{p}, 𝐲p\mathbf{y}^{p}), yielding a new loss term F2\mathcal{L}_{F_{2}}. The loss F2\mathcal{L}_{F_{2}} is computed over the output of the F2F_{2}-XEnDec by:

F2(𝜽)=𝔼𝐲u𝒰𝔼(𝐱p,𝐲p)𝒮[(f(𝐱~,𝐲~;𝜽),h(𝐲~))],\displaystyle\mathcal{L}_{F_{2}}(\bm{\theta})=\mathop{\mathbb{E}}\limits_{\mathbf{y}^{u}\in\mathcal{U}}\mathop{\mathbb{E}}\limits_{(\mathbf{x}^{p},\mathbf{y}^{p})\in\mathcal{S}}[\ell(f(\tilde{\mathbf{x}},\tilde{\mathbf{y}};\bm{\theta}),h(\tilde{\mathbf{y}}))], (8)

where (𝐱~,𝐲~)(\tilde{\mathbf{x}},\tilde{\mathbf{y}}) is the output of the F2F_{2}-XEnDec in Fig. 1.

The final NMT models are optimized jointly on the original translation loss and the above two auxiliary losses.

(𝜽)=𝒮(𝜽)+F1(𝜽)+F2(𝜽),\displaystyle\mathcal{L}({\bm{\theta}})=\mathcal{L}_{\mathcal{S}}({\bm{\theta}})+\mathcal{L}_{F_{1}}({\bm{\theta}})+\mathcal{L}_{F_{2}}(\bm{\theta}), (9)

F2\mathcal{L}_{F_{2}} in Eq. (9) is used to deeply fuse monolingual and parallel sentences at instance level rather than combine them mechanically. Section 4.4 empirically verifies the contributions of the F1\mathcal{L}_{F_{1}} and F2\mathcal{L}_{F_{2}} loss terms.

Algorithm 1 delineates the procedure to compute the final loss (𝜽)\mathcal{L}(\bm{\theta}). Specifically, each time, we sample a monolingual sentence for each parallel sentence to circumvent the expensive enumeration in Eq. (8). To speed up the training, we group sentences offline by length in Step 3 (cf. batching data in the supplementary document). For adding noise in Step 4, we can follow (Lample et al., 2017) to locally shuffle words while keeping the distance between the original and new position not larger than 33 or set it as a null operation. There are two techniques to boost the final performance.

Computing 𝐀\mathbf{A}: The alignment matrix 𝐀\mathbf{A} is obtained by averaging the cross-attention weights across all decoder layers and heads. We also add a temperature to control the sharpness of the attention distribution, the reciprocal of which was linearly increased from 0 to 22 during the first 20K20K steps. To avoid overfitting when computing e(𝐳~)e(\tilde{\mathbf{z}}) and h(𝐲~)h(\tilde{\mathbf{y}}), we apply dropout to 𝐀\mathbf{A} and stop back-propagating gradients through 𝐀\mathbf{A} when calculating the loss F2(𝜽)\mathcal{L}_{F_{2}}(\bm{\theta}).

Computing h(𝐲~)h(\tilde{\mathbf{y}}): Instead of interpolating one-hot labels in Eq. (6), we use the prediction vector f(𝐱,𝐲;𝜽^)f(\mathbf{x},\mathbf{y};\hat{\bm{\theta}}) on the sentence pair (𝐱,𝐲)(\mathbf{x},\mathbf{y}) estimated by the model where 𝜽^\hat{\bm{\theta}} indicates no gradients are back-propagated through it. However, the predictions made at early stages are usually unreliable. We propose to linearly combine the ground-truth one-hot label with the model prediction using a parameter vv, which is computed as vfj(𝐱,𝐲;𝜽^)+(1v)h(yj)vf_{j}(\mathbf{x},\mathbf{y};\hat{\bm{\theta}})+(1-v)h(y_{j}) where vv is gradually annealed from 0 to 11 during the first 20K20K steps 111These two annealing hyperparameters in computing both 𝐀\mathbf{A} and h(𝐲~)h(\tilde{\mathbf{y}}) are the same for all the models and not elaborately tuned.. Notice that the prediction vectors are not used in computing the decoder input e(𝐳~)e(\tilde{\mathbf{z}}) which can be clearly distinguished from schedule sampling (Bengio et al., 2015).

Input: Parallel corpus 𝒮\mathcal{S}, Monolingual corpus 𝒰\mathcal{U}, and Shuffling ratios p1p_{1} and p2p_{2}

Output: Batch Loss (𝜽)\mathcal{L}(\bm{\theta}).

Function F2F_{2}-XEnDec(𝒮,𝒰,p1,p2\mathcal{S},\mathcal{U},p_{1},p_{2}):

2       foreach (𝐱p,𝐲p)𝒮(\mathbf{x}^{p},\mathbf{y}^{p})\in\mathcal{S}  do
3             Sample a 𝐲u𝒰\mathbf{y}^{u}\in\mathcal{U} with similar length as 𝐱p\mathbf{x}^{p};   // done offline. 𝐲noiseu\mathbf{y}^{u}_{noise} \leftarrow add non-masking noise to 𝐲u\mathbf{y}^{u}; (n(𝐲u),𝐲u)(n(\mathbf{y}^{u}),\mathbf{y}^{u}) \leftarrow XEnDec over the inputs (𝐲noiseu,𝐲u)(\mathbf{y}^{u}_{noise},\mathbf{y}^{u}) and (𝐲masku,𝐲u)(\mathbf{y}^{u}_{mask},\mathbf{y}^{u}), with the shuffling ratio p1p_{1} and arbitrary alignment matrices; 𝒮\mathcal{L}_{\mathcal{S}} \leftarrow compute \ell in Eq. (2) using (𝐱p,𝐲p)(\mathbf{x}^{p},\mathbf{y}^{p}) and obtain its attention matrix 𝐀\mathbf{A}; F1\mathcal{L}_{F_{1}} \leftarrow compute \ell in Eq. (3) using (n(𝐲u),𝐲u)(n(\mathbf{y}^{u}),\mathbf{y}^{u}) and obtain 𝐀\mathbf{A}^{\prime}; (𝐱~,𝐲~)(\tilde{\mathbf{x}},\tilde{\mathbf{y}}) \leftarrow XEnDec over the inputs (𝐱p,𝐲p)(\mathbf{x}^{p},\mathbf{y}^{p}) and (n(𝐲u),𝐲u)(n(\mathbf{y}^{u}),\mathbf{y}^{u}), with the shuffling ratio p2p_{2}, 𝐀\mathbf{A} and 𝐀\mathbf{A}^{\prime}; F2\mathcal{L}_{F_{2}} \leftarrow compute \ell in Eq. (8);
4       end foreach
5      return (𝜽)=𝒮(𝜽)+F1(𝜽)+F2(𝜽)\mathcal{L}(\bm{\theta})=\mathcal{L}_{\mathcal{S}}({\bm{\theta}})+\mathcal{L}_{F_{1}}({\bm{\theta}})+\mathcal{L}_{F_{2}}(\bm{\theta});   // Eq. (9).
6
Algorithm 1 Proposed F2F_{2}-XEnDec function.

3.3 Relation to Other Works

Table 1: Comparison with different objectives produced by XEnDec. Each row shows a set of inputs to XEnDec and the corresponding objectives in existing work (the last column). 𝐲masku\mathbf{y}_{mask}^{u} is a sentence of length |𝐲u||\mathbf{y}^{u}| containing only “mask\langle mask\rangle” tokens. 𝐲noiseu\mathbf{y}_{noise}^{u} is a sentence obtained by corrupting all the words in 𝐲u\mathbf{y}^{u} with non-masking noises. 𝐱advp\mathbf{x}^{p}_{adv} and 𝐲advp\mathbf{y}^{p}_{adv} are adversarial sentences in which all the words are substituted with adversarial words.
(𝐱\mathbf{x} 𝐲\mathbf{y}) (𝐱\mathbf{x}^{\prime} 𝐲\mathbf{y}^{\prime}) Objectives
𝐲u\mathbf{y}^{u} 𝐲u\mathbf{y}^{u} 𝐲masku\mathbf{y}^{u}_{mask} 𝐲u\mathbf{y}^{u} MASS  (Song et al., 2019)
𝐲noiseu\mathbf{y}^{u}_{noise} 𝐲u\mathbf{y}^{u} 𝐲masku\mathbf{y}^{u}_{mask} 𝐲u\mathbf{y}^{u} BART  (Lewis et al., 2019)
𝐱p\mathbf{x}^{p} 𝐲p\mathbf{y}^{p} 𝐱advp\mathbf{x}^{p}_{adv} 𝐲advp\mathbf{y}^{p}_{adv} Adv.  (Cheng et al., 2019)

This subsection shows that XEnDec, when fed with appropriate inputs, yields learning objectives identical to two recently proposed self-supervised learning approaches: MASS (Song et al., 2019) and BART (Lewis et al., 2019), as well as a supervised learning approach called Doubly Adversarial (Cheng et al., 2019). Table 1 summarizes the inputs of XEnDec to recover these approaches.

XEnDec can be used for self-supervised learning. As shown in Table 1, the inputs to XEnDec are two pairs of sentences (𝐱,𝐲)(\mathbf{x},\mathbf{y}) and (𝐱,𝐲)(\mathbf{x}^{\prime},\mathbf{y}^{\prime}). Given arbitrary alignment matrices, if we set 𝐱=𝐲u\mathbf{x}^{\prime}=\mathbf{y}^{u}, 𝐲=𝐲u\mathbf{y}^{\prime}=\mathbf{y}^{u}, and 𝐱\mathbf{x} to be a corrupted copy of 𝐲u\mathbf{y}^{u}, then XEnDec is equivalent to the denoising autoencoder which is commonly used to pre-train sequence-to-sequence models such as in MASS (Song et al., 2019) and BART (Lewis et al., 2019). In particular, if we allow 𝐱\mathbf{x}^{\prime} to be a dummy sentence of length |𝐲u||\mathbf{y}^{u}| containing only “mask\langle mask\rangle” tokens (𝐲masku\mathbf{y}^{u}_{mask} in the table), Eq. (7) yields the learning objective defined in the MASS model (Song et al., 2019) except that losses over unmasked words are not counted in the training loss. Likewise, as shown in Table 1, we can recover BART’s objective by setting 𝐱=𝐲noiseu\mathbf{x}=\mathbf{y}^{u}_{noise} where 𝐲noiseu\mathbf{y}^{u}_{noise} is obtained by shuffling tokens or dropping them in 𝐲u\mathbf{y}^{u}. In both cases, XEnDec is trained with a self-supervised objective to reconstruct the original sentence from one of its corrupted sentences. Conceptually, denoising autoencoder can be regarded as a degenerated XEnDec in which the inputs are two views of its source correspondence for a monolingual sentence, e.g., n(𝐲)n(\mathbf{y}) and 𝐲mask\mathbf{y}_{mask} for 𝐲\mathbf{y}.

XEnDec can also be used in supervised learning. The translation loss proposed in (Cheng et al., 2019) is achieved by letting 𝐱\mathbf{x}^{\prime} and 𝐲\mathbf{y}^{\prime} be two “adversarial inputs”, 𝐱advp\mathbf{x}^{p}_{adv} and 𝐲advp\mathbf{y}^{p}_{adv}, both of which consist of adversarial words at each position. For the construction of 𝐱advp\mathbf{x}^{p}_{adv}, we refer to Algorithm 1 in (Cheng et al., 2019). In this case, the crossover encoder-decoder is trained with a supervised objective over parallel sentences.

The above connections to existing works illustrate the power of XEnDec when it is fed with different kinds of inputs. The results in Section 4.4 show that XEnDec is still able to improve the baseline with alternative inputs. However, our experiments show the best configuration found so far is to use the F2F_{2}-XEnDec in Algorithm 1 to deeply fuse the monolingual and parallel sentences.

4 Experiments

Table 2: Experiments on WMT’14 English-German and WMT’14 English-French translation.
Models Methods En\rightarrowDe De\rightarrowEn En\rightarrowFr Fr\rightarrowEn
Base Reproduced Transformer 28.7028.70 32.2332.23 - -
F2F_{2}-XEnDec 30.46 34.06 - -
Big Reproduced Transformer 29.4729.47 33.1233.12 43.3743.37 39.8239.82
(Ott et al., 2018) 29.3029.30 - 43.2043.20 -
(Cheng et al., 2019) 30.0130.01 - - -
(Yang et al., 2019) 30.1030.10 - 42.3042.30
(Nguyen et al., 2019) 30.7030.70 - 43.7043.70 -
(Zhu et al., 2020) 30.7530.75 - 43.7843.78 -
Joint Training with MASS 30.63 - 43.00 -
Joint Training with BART 30.88 - 44.18 -
F2F_{2}-XEnDec 31.60 34.94 45.15 41.60
Table 3: Comparison with the best baseline method in Table 2 in terms of BLEU, BLEURT and YiSi.  
Methods En\rightarrowDe En\rightarrowFr
BLEU BLEURT YiSi BLEU BLEURT YiSi
Joint Training with BART 30.88 0.225 0.837 44.18 0.488 0.864
F2F_{2}-XEnDec 31.60 0.261 0.842 45.15 0.513 0.869

4.1 Settings

Datasets. We evaluate our approach on two representative, resource-rich translation datasets, WMT’14 English-German and WMT’14 English-French across four translation directions, English\rightarrowGerman (En\rightarrowDe), German\rightarrowEnglish (De\rightarrowEn), English\rightarrowFrench (En\rightarrowFr), and French\rightarrowEnglish (Fr\rightarrowEn). To fairly compare with previous state-of-the-art results on these two tasks, we report case-sensitive tokenized BLEU scores calculated by the multi-bleu.perl script. The English-German and English-French datasets consist of 4.5M and 36M sentence pairs, respectively. The English, German and French monolingual corpora in our experiments come from the WMT’14 translation tasks. We concatenate all the newscrawl07-13 data for English and German, and newscrawl07-14 for French which results in 90M English sentences, 89M German sentences, and 42M French sentences. We use a word piece model (Schuster & Nakajima, 2012) to split tokenized words into sub-word units. For English-German, we build a shared vocabulary of 32K sub-words units. The validation set is newstest2013 and the test set is newstest2014. The vocabulary for the English-French dataset is also jointly split into 44K sub-word units. The concatenation of newstest2012 and newstest2013 is used as the validation set while newstest2014 is the test set. Refer to the supplementary document for more detailed data pre-processing.

Model and Hyperparameters. We implement our approach on top of the Transformer model (Vaswani et al., 2017) using the Lingvo toolkit (Shen et al., 2019). The Transformer models follow the original network settings (Vaswani et al., 2017). In particular, the layer normalization is applied after each residual connection rather than before each sub-layer. The dropout ratios are set to 0.10.1 for all Transformer models except for the Transformer-big model on English-German where 0.30.3 is used. We search the hyperparameters using the Transformer-base model on English-German. In our method, the shuffling ratio p1p_{1} is set to 0.500.50 while 0.250.25 is used for English-French in Table 6. p2p_{2} is sampled from a Beta distribution Beta(2,6)Beta(2,6). The dropout ratio of 𝐀\mathbf{A} is 0.20.2 for all the models. For decoding, we use a beam size of 44 and a length penalty of 0.60.6 for English-German, and a beam size of 55 and a length penalty of 1.01.0 for English-French. We carry out our experiments on a cluster of 128128 P100 GPUs and update gradients synchronously. The model is optimized with Adam (Kingma & Ba, 2014) following the same learning rate schedule used in (Vaswani et al., 2017) except for warmup_steps which is set to 40004000 for both Transform-base and Transformer-big models.

Training Efficiency. When training the vanilla Transformer model, each batch contains 4096×1284096\times 128 tokens of parallel sentences on a 128128 P100 GPUs cluster. As there are three losses included in our training objective (Eq. (9)) and the inputs for each of them are different, we evenly spread the GPU memory budget into these three types of data by letting each batch include 2048×1282048\times 128 tokens. Thus the total batch size is 2048×128×32048\times 128\times 3. The training speed is on average about 60%60\% of the standard training speed. The additional computation cost is partially due to the implementation of the noise function to corrupt the monolingual sentence 𝐲u\mathbf{y}^{u} and can be reduced by caching noisy data in the data input pipeline. Then the training speed can accelerate to about 80%80\% of the standard training speed.

4.2 Main Results

Table 2 shows the main results on the English-German and English-French datasets. Our method is compared with the following strong baseline methods. (Ott et al., 2018) is the scalable Transformer model. Our reproduced Transformer model performs comparably with their reported results. (Cheng et al., 2019) is a NMT model with adversarial augmentation mechanisms in supervised learning.  (Nguyen et al., 2019) boosts NMT performance by adopting multiple rounds of back-translated sentences. Both (Zhu et al., 2020) and (Yang et al., 2019) incorporate the knowledge of pre-trained models into NMT models by treating them as frozen input representations for NMT. We also compare our approach with MASS (Song et al., 2019) and BART (Lewis et al., 2019). As their goals are to learn generic pre-trained representations from massive monolingual corpora, for fair comparisons, we re-implement their methods using the same backbone model as ours, and jointly optimize their self-supervised objectives together with the supervised objective on the same corpora.

For English-German, our approach achieves significant improvements in both translation directions over the standard Transformer model. Even compared with the strongest baseline on English\rightarrowGerman, our approach obtains a +0.72+0.72 BLEU gain. More importantly, when we apply our approach to a significantly larger dataset, English-French with 3636M sentence pairs (vs. English-German with 4.54.5M sentence pairs), it still yields consistent and notable improvements over the standard Transformer model.

The single-stage approaches (Joint Training with MASS & BART) perform slightly better than the two-stage approaches (Zhu et al., 2020; Yang et al., 2019), which substantiates the benefit of jointly training supervised and self-supervised objectives for resource-rich translation tasks. Among them, BART performs better with stable improvements on English-German and English-French and faster convergence. However, they still lag behind our approach. This is mainly because the F2\mathcal{L}_{F_{2}} term in our approach can deeply fuse the supervised and self-supervised objectives instead of simply summing up their training losses. See Section 4.4 for more details.

Furthermore, we evaluate our approach and the best baseline method (Joint Training with BART) in Table 3 in terms of two additional evaluation metric, BLEURT (Sellam et al., 2020) and YiSi (Lo, 2019), which claim better correlation with human judgement. Results in Table 3 corroborate the superior performance of our approach compared to the best baseline method on both English\rightarrowGerman and English\rightarrowFrench.

4.3 Analyses

Table 4: Effect of monolingual corpora sizes.
Methods Mono. Size En\rightarrowDe
F2F_{2}-XEnDec ×0\times 0 28.7028.70
×1\times 1 29.8429.84
×3\times 3 30.3630.36
×5\times 5 30.4630.46
×10\times 10 30.2230.22

Effect of Monolingual Corpora Sizes. Table 4 shows the impact of monolingual corpora sizes on the performance for our approach. We find that our approach already yields improvements over the baselines when using no monolingual corpora (x0) as well as when using a monolingual corpus with size comparable to the bilingual corpus (1x). As we increase the size of the monolingual corpus to 5x, we obtain the best performance with 30.4630.46 BLEU. However, continuing to increase the data size fails to improve the performance any further. A recent study (Liu et al., 2020a) shows that increasing the model capacity has great potential to exploit extremely large training sets for the Transformer model. We leave this line of exploration as future work.

Table 5: Finetuning vs. Joint Training.
Methods En\rightarrowDe
Transformer 28.7028.70
+ Pretrain + Finetune 28.7728.77
F2F_{2}-XEnDec (Joint Training) 30.4630.46
+ Pretrain + Finetune 29.7029.70
Table 6: Results on F2F_{2}-XEnDec + Back Translation. Experiments on English-German and English-French are based on the Transformer-big model.
Methods En\rightarrowDe En\rightarrowFr
Transformer 28.7028.70 43.3743.37
Back Translation 32.0932.09 35.9035.90
(Edunov et al., 2018) 35.00222Our results cannot directly be compared to the numbers in (Edunov et al., 2018) because they use WMT’18 as bilingual data (5.18M) and 10x more monolingual data (226M vs. ours 23M). 45.6045.60
F2F_{2}-XEnDec 31.6031.60 45.1545.15
+ Back Translation 33.7033.70 46.19

Finetuning vs. Joint Training. To further study the effect of pre-trained models on the Transformer model and our approach, we use Eq. (3) to pre-train an NMT model on the entire English and German monolingual corpora. Then we finetune the pre-trained model on the parallel English-German corpus. Models finetuned on pre-trained models usually perform better than models trained from scratch at the early stage of training. However, this advantage gradually vanishes as training progresses (cf. Figure 1 in the supplementary document). As shown in Table 5, Transformer with finetuning achieves virtually identical results as a Transformer trained from scratch. Using the pre-trained model over our approach impairs performance. We believe this may be caused by a discrepancy between the pre-trained loss and our joint training loss.

Back Translation as Noise. One widely applicable method to leverage monolingual data in NMT is back translation (Sennrich et al., 2016b). A straightforward way to incorporate back translation into our approach is to treat back-translated corpora as parallel corpora. However, back translation can also be regarded as a type of noise used for constructing 𝐲noiseu\mathbf{y}^{u}_{noise} in F2F_{2}-XEnDec (shown in Fig. 1 and Step 4 in Algorithm 1), which can increase the noise diversity. As shown in Table 6, for English\rightarrowGerman trained on the Transformer-big model, our approach yields an additional +1.9+1.9 BLEU gain when using back translation to noise 𝐲u\mathbf{y}^{u} and also outperforms the back-translation baseline. When applied to the English-French dataset, we achieve a new state-of-the-art result over the best baseline (Edunov et al., 2018). In contrast, the standard back translation for English-French hurts the performance of Transformer, which is consistent with what was found in previous works, e.g. (Caswell et al., 2019). These results show that our approach is complementary to the back-translation method and performs more robustly when back-translated corpora are less informative although our approach is conceptually different from works related to back translation (Sennrich et al., 2016b; Cheng et al., 2016; Edunov et al., 2018).

Robustness to Noisy Inputs. Contemporary NMT systems often suffer from dramatic performance drops when they are exposed to input perturbations (Belinkov & Bisk, 2018; Cheng et al., 2019), even though these perturbations may not be strong enough to alter the meaning of the input sentence. In this experiment, we verify the robustness of the NMT models learned by our approach. Following (Cheng et al., 2019), we evaluate the model performance against word perturbations which specifically includes two types of noise to perturb the dataset. The first type of noise is code-switching noise (CS) which randomly replaces words in the source sentences with their corresponding target-language words. Alignment matrices are employed to find the target-language words in the target sentences. The other one is drop-words noise (DW) which randomly discards some words in the source sentences. Figure 2 shows that our approach exhibits higher robustness than the standard Transformer model across all noise types and noise fractions. In particular, our approach performs much more stable for the code-switching noise.

Refer to caption
Figure 2: Results on artificial noisy inputs. “CS”: code-switching noise. “DW”: drop-words noise. We compare our approach to the standard Transformer on different noise types and fractions.
Table 7: Ablation study on English-German.
ID Different Settings BLEU
1 Transformer 28.7028.70
2 F2F_{2}-XEnDec 30.4630.46
3 without F2\mathcal{L}_{F_{2}} 29.2129.21
4 without F1\mathcal{L}_{F_{1}} (a prior alignment is used) 29.5529.55
5 𝒮\mathcal{L}_{\mathcal{S}} with XEnDec over parallel data 29.2329.23
6 XEnDec is replaced by Mixup 29.6729.67
7 without dropout 𝐀\mathbf{A} and model predictions 29.8729.87
8 without model predictions 30.24

4.4 Ablation Study

Table 7 studies the contributions of the key components and verifies the design choices in our approach.

Contribution of F2\mathcal{L}_{F_{2}}. We first verify the importance of our core loss term F2\mathcal{L}_{F_{2}}. When F2\mathcal{L}_{F_{2}} is removed in Eq. (9), the training objective is equivalent to summing up supervised (S\mathcal{L}_{S}) and self-supervised (F1\mathcal{L}_{F_{1}}) losses. By comparing Row 2 and 3 in Table 7, we observe a sharp drop (-1.25 BLEU points) caused by the absence of F2\mathcal{L}_{F_{2}}. This result demonstrates the crucial role of the proposed F2F_{2}-XEnDec that can extract complementary signals to facilitate the joint training. We believe this is because of the deep fusion of monolingual and parallel sentences at instance level.

Inputs to XEnDec. To validate the proposed task, XEnDec, we apply it over different types of inputs. The first one directly combines the parallel and monolingual sentences without using noisy monolingual sentences, which is equivalent to removing F1\mathcal{L}_{F_{1}} (Row 4 in Table 7). We achieve this by setting n(𝐲u)=𝐲un(\mathbf{y}^{u})=\mathbf{y}^{u} in Algorithm 1. However, we cannot obtain 𝐀\mathbf{A}^{\prime} required by the algorithm (Line 7 in Algorithm 1) which leads to the failure of calculating the loss. Thus we design a prior alignment matrix to handle this issue (cf. Section 2 in the supplementary document). The second experiment utilizes XEnDec only over parallel sentences (Row 5 in Table 7). We can find these two cases can both achieve better performance compared to the standard Transformer model (Row 1 in the table). These results show the efficacy of the proposed XEnDec on different types of inputs. Their gap to our final method shows the rationale of using both parallel and monolingual sentences as inputs. We hypothesize this is because XEnDec implicitly regularizes the model by shuffling and reconstructing words in parallel and monolingual sentences.

Comparison to Mixup. We use Mixup (Zhang et al., 2018) to replace our second XEnDec to compute F2\mathcal{L}_{F_{2}} while keeping the first XEnDec untouched. When applying Mixup on a pair of training data (𝐱,𝐲)(\mathbf{x},\mathbf{y}) and (𝐱,𝐲)(\mathbf{x}^{\prime},\mathbf{y}^{\prime}), Eq. (4), Eq. (5) and Eq. (6) are replaced by e(x~i)=λe(xi)+(1λ)e(xi)e(\tilde{x}_{i})=\lambda e(x_{i})+(1-\lambda)e(x^{\prime}_{i}), e(z~j)=λe(yj1)+(1λ)e(yj1)e(\tilde{z}_{j})=\lambda e(y_{j-1})+(1-\lambda)e(y^{\prime}_{j-1}) and h(yj~)=λh(yj)+(1λ)h(yj)h(\tilde{y_{j}})=\lambda h(y_{j})+(1-\lambda)h(y^{\prime}_{j}), respectively, where λ\lambda is sampled from a Beta distribution. The comparison between Row 6 and Row 2 in Table 7 shows that Mixup leads to a worse result. Different from Mixup which encourages the model to behave linearly to the linear interpolation of training examples, our task combines training examples in a non-linear way in the source end, and forces the model to decouple the non-linear integration in the target end while predicting the entire target sentence based on its partial source sentence.

Computation of 𝐀\mathbf{A} and h(𝐲~)h(\tilde{\mathbf{y}}). The last two rows in Table 7 verify the impact of two training techniques discussed in Section 3.2. Removing these components would lower the performance.

5 Related Work

The recent past has witnessed an increasing interest in the research community on leveraging pre-training models to boost NMT model performance (Ramachandran et al., 2016; Lample & Conneau, 2019; Song et al., 2019; Lewis et al., 2019; Edunov et al., 2018; Zhu et al., 2020; Yang et al., 2019; Liu et al., 2020b). Most successes come from low-resource and zero-resource translation tasks. (Zhu et al., 2020) and (Yang et al., 2019) achieve some promising results on resource-rich translations. They propose to combine NMT model representations and frozen pre-trained representations under the common two-stage framework. The bottleneck of these methods is that these two stages are decoupled and separately learned, which exacerbates the difficulty of finetuning self-supervised representations on resource-rich language pairs. Our method, on the other hand, jointly trains self-supervised and supervised NMT models to close the gap between representations learned from either of them with an essential new subtask, XEnDec. In addition, our new subtask can be applied to combine different types of inputs. Experimental results show that our method consistently outperforms previous approaches across several translation benchmarks and establishes a new state-of-the-art result on WMT’14 English-French when applying XEnDec to back-translated corpora.

Another line of research related to ours originates in computer vision by interpolating images and their labels (Zhang et al., 2018; Yun et al., 2019) which have been shown effective in improving generalization (Arazo et al., 2019; Jiang et al., 2020; Xu et al., 2021; Northcutt et al., 2021) and robustness of convolutional neural network (Hendrycks et al., 2020). Recently, some research efforts have been devoted to introducing this idea to NLP applications (Cheng et al., 2020; Guo et al., 2020; Chen et al., 2020). Our XEnDec shares the commonality of combining example pairs. However, XEnDec’s focus is on sequence-to-sequence learning for NLP with the aim of using self-supervised learning to complement supervised learning in joint training.

6 Conclusion

This paper has presented a joint training approach, F2F_{2}-XEnDec, to combine self-supervised and supervised learning in a single stage. The key part is a novel cross encoder-decoder which can be used to “interbreed” monolingual and parallel sentences, which can also be fed with different types of inputs and recover some popular self-supervised and supervised training objectives.

Experiments on two resource-rich translation tasks, WMT’14 English-German and WMT’14 English-French, show that joint training performs favorably against two-stage training approaches when an enormous amount of labeled and unlabeled data is available. When applying XEnDec to deeply fuse monolingual and parallel sentences resulting in F2F_{2}-XEnDec, the joint training paradigm can better exploit the complementary signal from unlabeled data with significantly stronger performance. Finally, F2F_{2}-XEnDec is capable of improving the NMT robustness against input perturbations such as code-switching noise widely found in social media.

In the future, we plan to further examine the effectiveness of our approach on larger-scale corpora with high-capacity models. We also plan to design more expressive noise functions for our approach.

Acknowledgements

The authors would like to thank anonymous reviewers for insightful comments, Isaac Caswell for providing back-translated corpora, and helpful feedback for the early version of this paper.

References

  • Arazo et al. (2019) Arazo, E., Ortego, D., Albert, P., O’Connor, N., and McGuinness, K. Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning (ICML), 2019.
  • Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), 2015.
  • Belinkov & Bisk (2018) Belinkov, Y. and Bisk, Y. Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations (ICLR), 2018.
  • Bengio et al. (2015) Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099, 2015.
  • Caswell et al. (2019) Caswell, I., Chelba, C., and Grangier, D. Tagged back-translation. arXiv preprint arXiv:1906.06442, 2019.
  • Chen et al. (2020) Chen, J., Yang, Z., and Yang, D. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
  • Cheng et al. (2016) Cheng, Y., Xu, W., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. Semi-supervised learning for neural machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2016.
  • Cheng et al. (2019) Cheng, Y., Jiang, L., and Macherey, W. Robust neural machine translation with doubly adversarial inputs. In Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
  • Cheng et al. (2020) Cheng, Y., Jiang, L., Macherey, W., and Eisenstein, J. Advaug: Robust adversarial augmentation for neural machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
  • Edunov et al. (2018) Edunov, S., Ott, M., Auli, M., and Grangier, D. Understanding back-translation at scale. In Empirical Methods in Natural Language Processing (EMNLP), 2018.
  • Edunov et al. (2019) Edunov, S., Baevski, A., and Auli, M. Pre-trained language model representations for language generation. arXiv preprint arXiv:1903.09722, 2019.
  • French (1999) French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 1999.
  • Garg et al. (2019) Garg, S., Peitz, S., Nallasamy, U., and Paulik, M. Jointly learning to align and translate with transformer models. arXiv preprint arXiv:1909.02074, 2019.
  • Gehring et al. (2017) Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional sequence to sequence learning. In International Conference on Machine Learning (ICML), 2017.
  • Guo et al. (2020) Guo, D., Kim, Y., and Rush, A. Sequence-level mixed sample data augmentation. In Empirical Methods in Natural Language Processing (EMNLP), 2020.
  • Hendrycks et al. (2020) Hendrycks, D., Mu, N., Cubuk, E. D., Zoph, B., Gilmer, J., and Lakshminarayanan, B. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations (ICLR), 2020.
  • Jiang et al. (2020) Jiang, L., Huang, D., Liu, M., and Yang, W. Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning (ICML), 2020.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Lample & Conneau (2019) Lample, G. and Conneau, A. Cross-lingual language model pretraining. arXiv, pp.  arXiv–1901, 2019.
  • Lample et al. (2017) Lample, G., Conneau, A., Denoyer, L., and Ranzato, M. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.
  • Lewis et al. (2019) Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  • Liu et al. (2020a) Liu, X., Duh, K., Liu, L., and Gao, J. Very deep transformers for neural machine translation. arXiv preprint arXiv:2008.07772, 2020a.
  • Liu et al. (2020b) Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., and Zettlemoyer, L. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 2020b.
  • Lo (2019) Lo, C.-k. Yisi-a unified semantic mt quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the Fourth Conference on Machine Translation, 2019.
  • Nguyen et al. (2019) Nguyen, X.-P., Joty, S., Kui, W., and Aw, A. T. Data diversification: An elegant strategy for neural machine translation. arXiv preprint arXiv:1911.01986, 2019.
  • Northcutt et al. (2021) Northcutt, C. G., Jiang, L., and Chuang, I. L. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 2021.
  • Och & Ney (2004) Och, F. J. and Ney, H. The alignment template approach to statistical machine translation. Computational linguistics, 2004.
  • Ott et al. (2018) Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling neural machine translation. arXiv preprint arXiv:1806.00187, 2018.
  • Peters et al. (2018) Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
  • Radford et al. (2018) Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pre-training, 2018.
  • Ramachandran et al. (2016) Ramachandran, P., Liu, P. J., and Le, Q. V. Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683, 2016.
  • Rieger et al. (2012) Rieger, R., Michaelis, A., and Green, M. M. Glossary of genetics and cytogenetics: classical and molecular. Springer Science & Business Media, 2012.
  • Schuster & Nakajima (2012) Schuster, M. and Nakajima, K. Japanese and korean voice search. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
  • Sellam et al. (2020) Sellam, T., Das, D., and Parikh, A. Bleurt: Learning robust metrics for text generation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
  • Sennrich et al. (2016a) Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Annual Meeting of the Association for Computational Linguistics (ACL), 2016a.
  • Sennrich et al. (2016b) Sennrich, R., Haddow, B., and Birch, A. Improving nerual machine translation models with monolingual data. In Annual Meeting of the Association for Computational Linguistics (ACL), 2016b.
  • Shen et al. (2019) Shen, J., Nguyen, P., Wu, Y., Chen, Z., Chen, M. X., Jia, Y., Kannan, A., Sainath, T., Cao, Y., Chiu, C.-C., et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019.
  • Song et al. (2019) Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. Mass: Masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning (ICML), 2019.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • Vincent et al. (2008) Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning (ICML), 2008.
  • Xu et al. (2021) Xu, Y., Zhu, L., Jiang, L., and Yang, Y. Faster meta update strategy for noise-robust deep learning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Yang et al. (2019) Yang, J., Wang, M., Zhou, H., Zhao, C., Yu, Y., Zhang, W., and Li, L. Towards making the most of bert in neural machine translation. arXiv preprint arXiv:1908.05672, 2019.
  • Yun et al. (2019) Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), 2019.
  • Zhang et al. (2018) Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations (ICLR), 2018.
  • Zhu et al. (2020) Zhu, J., Xia, Y., Wu, L., He, D., Qin, T., Zhou, W., Li, H., and Liu, T.-Y. Incorporating bert into neural machine translation. arXiv preprint arXiv:2002.06823, 2020.

Appendix

Refer to caption
Refer to caption
Figure 3: Comparison of finetuning and training from scratch using Transformer and F2F_{2}-XEnDec. In both methods, pre-training leads to faster convergence but fails to improve the final performance after the convergence. The comparison between the figures shows our joint training approach on the left (the blue curve) significantly outperforms against the two-stage training on the right. Final BLEU numbers are reported in Table 5 in the main paper.

Appendix A Training Details

Data Pre-processing We mainly follows the pre-processing pipeline 333https://github.com/pytorch/fairseq/tree/master/examples/translation which is also adopted by (Ott et al., 2018),  (Edunov et al., 2018) and (Zhu et al., 2020), except for the sub-word tool. To verify the consistency between the word piece model (Schuster & Nakajima, 2012) and the BPE model (Sennrich et al., 2016a), we conduct a comparison experiment to train two standard Transformer models using the same data set processed by the word piece model and the BPE model respectively. The BLEU difference between them is about ±0.2, which suggests there is no significant difference between them.

Batching Data Transformer groups training examples of similar lengths together with a varying batch size for training efficiency  (Vaswani et al., 2017). In our approach, when interpolating two source sentences, 𝐱p\mathbf{x}^{p} and 𝐲\mathbf{y}^{\diamond}, it is better if the lengths of 𝐱p\mathbf{x}^{p} and 𝐲\mathbf{y}^{\diamond} are similar, which can reduce the chance of wasting positions over padding tokens. To this end, in the first round, we search for monolingual sentences with exactly the same length of the source sentence in a parallel sentence pair. After the first traversal of the entire parallel data set, we relax the length difference to 11. This process is repeated by relaxing the constraint until all the parallel data are paired with their own monolingual data.

Appendix B A Prior Alignment Matrix

When F1\mathcal{L}_{F_{1}} is removed, we can not obtain 𝐀\mathbf{A}^{\prime} according to Algorithm 1 in the main paper which leads to the failure of calculating F2\mathcal{L}_{F_{2}}. Thus we propose a prior alignment to tackle this issue. For simplicity, we set n()n(\cdot) to be a copy function when doing the first XEnDec, which means that we just randomly mask some words in the first round of XEnDec. In the second XEnDec, we want to combine (𝐱p,𝐲p)(\mathbf{x}^{p},\mathbf{y}^{p}) and (𝐲,𝐲)(\mathbf{y}^{\diamond},\mathbf{y}). The alignment matrix 𝐀\mathbf{A}^{\prime} for (𝐲,𝐲)(\mathbf{y}^{\diamond},\mathbf{y}) is constructed as follows.

If a word yjy_{j} in the target sentence 𝐲\mathbf{y} is picked in the source side which indicates yjy_{j}^{\diamond} is picked and mj=0m_{j}=0, its attention value AjiA_{ji}^{\prime} if mi=0m_{i}=0 is assigned to p1𝐦1\frac{p}{\|1-\mathbf{m}\|_{1}}, otherwise it is assigned to 1p𝐦1\frac{1-p}{\|\mathbf{m}\|_{1}} if mi=1m_{i}=1. Conversely, If a word yjy_{j} is not picked which indicates mj=1m_{j}=1, its attention value AjiA_{ji}^{\prime} is assigned to p𝐦1\frac{p}{\|\mathbf{m}\|_{1}} if mi=0m_{i}=0, otherwise it is 1p1𝐦1\frac{1-p}{\|1-\mathbf{m}\|_{1}} if mi=1m_{i}=1.