Improving Neural Machine Translation by Denoising Training

Liang Ding
The University of Sydney
[email protected]
&Keqin Peng
Beihang University
[email protected]
&Dacheng Tao
JD Explore Academy, JD.com
[email protected] Work done when interning at JD Explore Academy.

Abstract

We present a simple and effective pretraining strategy Denoising Training (DoT¹¹1DoT is one of the twin tricks for NMT (another is Bidirectional Training, namely BiT Ding et al. (2021c)) that we proposed in IWSLT21 evaluation Ding et al. (2021d).) for neural machine translation. Specifically, we update the model parameters with source- and target-side denoising tasks at the early stage and then tune the model normally. Notably, our approach does not increase any parameters or training steps, requiring the parallel data merely. Experiments show that DoT consistently improves the neural machine translation performance across 12 bilingual and 16 multilingual directions (data size ranges from 80K to 20M). In addition, we show that DoT can complement existing data manipulation strategies, i.e. curriculum learning, knowledge distillation, data diversification, bidirectional training, and back-translation. Encouragingly, we found that DoT outperforms costly pretrained model mBART Liu et al. (2020b) in high-resource settings. Analyses show DoT is a novel in-domain cross-lingual pretraining strategy, and could offer further improvements with task-relevant self-supervisions.

1 Introduction

Transformer Vaswani et al. (2017) has been become the de facto choice in neural machine translation (NMT) due to its state-of-the-art performance Barrault et al. (2019, 2020); Akhbardeh et al. (2021). However, an interesting study reveals that many Transformer modifications do not result in improved performance due to the lack of generalization Narang et al. (2021). This finding is consistent with the recent call for data-centric AI in the ML community Ng (2021), urging the NMT community pays more attention to how to effectively and efficiently exploit the supervisions from data rather than complicated modifications.

There has been a lot of works on NMT data manipulation to fully exploit the training data. Zhang et al. (2018); Platanios et al. (2019); Liu et al. (2020a); Zhou et al. (2021); Ding et al. (2021a) design difficulty metrics to enable the models to learn from easy to hard. Kim and Rush (2016) propose sequence-level knowledge distillation for machine translation to acquire the refined knowledge from teachers. Nguyen et al. (2020) diversify the training data by using the predictions of multiple forward and backward models. Recently, Ding et al. (2021c) initialize the translation system with a bidirectional system to obtain better performance. However, they assume that the supervisions come from the correlation – the basic properties of parallel data, between the source and target sentences, i.e. src $\leftrightarrow$ tgt, ignoring the self-supervisions of the source or target sentences themselves.

In this work, we decide to find more self-supervisions from parallel data, which is hopefully complementary to existing data manipulation strategies. Accordingly, we break the parallel data into two pieces of high-quality monolingual data, allowing us to design rich self-supervisions on both source and target side. We choose denoising as the self-supervision objective, i.e. denoising training (§2). The core idea is using a multilingual denoising system as the initialization for a translation system. Specifically, given the parallel language pair “B: $\text{src}\rightarrow\text{tgt}$ ”, we can construct the denosing data “ $\text{M}_{\mathrm{src}}$ : ${noised}(\mathrm{src})\rightarrow\mathrm{src}$ ” and “ $\text{M}_{\mathrm{tgt}}$ : ${noised}(\mathrm{tgt})\rightarrow\mathrm{tgt}$ ”. Then we update the parameters with denoising data $\text{M}_{\mathrm{src}}+\text{M}_{\mathrm{tgt}}$ in the early stage, and tune the model with parallel data B.

We validated our approach on bilingual and multilingual benchmarks across different language families and sizes in §3.2. Experiments show DoT consistently improves the translation performance. We also show DoT can complement existing data manipulation strategies, i.e. back translation Caswell et al. (2019a), curriculum learning Platanios et al. (2019), knowledge distillation Kim and Rush (2016), data diversification Nguyen et al. (2020) and bidirectional training Ding et al. (2021c). Analyses in §3.3 provide some insights about where the improvements come from: DoT is a simple in-domain cross-lingual pretraining strategy and can be enhanced with task-relevant self-supervisions.

2 Denoising Training

Preliminary

Given a source sentence $\bf x$ , NMT models generate target word ${\bf y}_{t}$ conditioned on previously generated ${\bf y}_{<t}$ , which can be formulated as:

p({\bf y}|{\bf x})=\prod_{t=1}^{T}p({\bf y}_{t}|{\bf x},{\bf y}_{<t};\theta)

(1)

where $T$ is the length of the target sequence and the parameters $\theta$ are trained to maximize the likelihood of training examples according to $\mathcal{L}(\theta)=\operatorname*{arg\,max}_{\theta}\log p({\bf y}|{\bf x};\theta)$ . The training examples used to achieve the conditional estimation can be defined as $\text{B}=\{(\mathbf{x}_{i},\mathbf{y}_{i})\}^{N}_{i=1}$ , where $N$ is the total number of sentence pairs in the training data.

Motivation

The motivation of training with denoising data is when humans learn languages, one of the best practice for language acquisition is to correct the sentence errors Marcus (1993). Motivated by it, Lewis et al. (2020) propose several noise functions and denoise them in end-to-end way. Liu et al. (2020b) introduce this idea to the multilingual scenarios. Different from above monolingual pretraining approaches, we propose a simpler noise function and apply it to each side of the parallel data.

Method

We want the model to understand both the source- and target-side languages well before lexical translation and reordering Voita et al. (2021). For noise function ${noised}(\cdot)$ , we apply the common noise-injection practices in Appendix A.1, i.e. removing, replacing, or nearby swapping one time for a random word with a uniform distribution in a sentence Edunov et al. (2018). Then size of the original parallel data doubled as follows:

	$\displaystyle\text{M}_{\mathrm{src}}$	$\displaystyle=\{{noised}(\mathbf{x}_{i}),\mathbf{x}_{i}\}^{N}_{i=1}$		(2)
	$\displaystyle\text{M}_{\mathrm{tgt}}$	$\displaystyle=\{{noised}(\mathbf{y}_{i}),\mathbf{y}_{i}\}^{N}_{i=1}$		(3)

where $\text{M}_{\mathrm{src}}$ and $\text{M}_{\mathrm{tgt}}$ can be combined to update the end-to-end model. In doing so, $\theta$ in Eq. 1 can be updated by denoising both the source and target data, then the denoising objective becomes:

	$\displaystyle{\mathcal{L}_{\text{DoT}}}(\theta)$	$\displaystyle=\overbrace{\operatorname*{arg\,max}_{\theta}\log p({\bf x}\|{{noised}({\bf x})};\theta)}^{\text{Source Denoising}:{\mathcal{L}_{\theta}^{S}}}$		(4)
		$\displaystyle+\underbrace{\operatorname*{arg\,max}_{\theta}\log p({\bf y}\|{{noised}({\bf y})};\theta)}_{\text{Target Denoising}:{\mathcal{L}_{\theta}^{T}}}$		(5)

where the source denoising objective ${\mathcal{L}_{\theta}^{S}}$ and target donising objective ${\mathcal{L}_{\theta}^{T}}$ are optimized iteratively. The pretraining can store knowledge of the source and target languages into the shared model parameters, which may help better and faster learning further tasks. Following Ding et al. (2021c), we early stop denoising training at 1/3 of the total steps, and tune the model normally with the rest of 2/3 training steps. This process can be formally denoted as such pipeline: $\text{M}_{\mathrm{src}}+\text{M}_{\mathrm{tgt}}\rightarrow{\text{B}}$ .

There are many possible ways to implement the general idea of denoising training. The aim of this paper is not to explore the whole space but simply to show that one fairly straightforward implementation works well and the idea is reasonable.

Data Source

IWSLT14

WMT16

IWSLT21

WMT14

WMT20

WMT17

Size

160K

0.6M

2.4M

4.5M

13M

20M

\cdashline2-13 Direction

En-De

De-En

En-Ro

Ro-En

En-Sw

Sw-En

En-De

De-En

Ja-En

En-Ja

Zh-En

En-Zh

Transformer

29.2

35.1

33.9

34.1

28.8

48.5

28.6

32.1

20.4

18.2

23.7

33.2

+DoT

29.8^†

36.1^‡

35.0^‡

35.5^‡

29.3^‡

49.6^‡

29.5^‡

32.7^†

20.9^†

19.1^‡

24.7^‡

33.6

Table 1: Performance on several widely-used bilingual benchmarks, including IWSLT14 En

\leftrightarrow

De, WMT16 En

\leftrightarrow

Ro, IWSLT21 En

\leftrightarrow

Sw, WMT14 En

\leftrightarrow

De, WMT20 Ja

\leftrightarrow

En and WMT17 Zh

\leftrightarrow

En. Among them, Ja-En and Zh-En are distant language pairs. “^‡/†” indicates significant difference (

p<0.01/0.05

) from corresponding baselines.

3 Experiments

3.1 Setup

Bilingual Data

Main experiments in Tab. 1 are conducted on 6 translation datasets: IWSLT14 English $\leftrightarrow$ German Nguyen et al. (2020), WMT16 English $\leftrightarrow$ Romanian Gu et al. (2018), IWSLT21 English $\leftrightarrow$ Swahili²²2https://iwslt.org/2021/low-resource, WMT14 English $\leftrightarrow$ German Vaswani et al. (2017), WMT20 Japanese $\leftrightarrow$ English³³3http://www.statmt.org/wmt20 and WMT17 Chinese $\leftrightarrow$ English Hassan et al. (2018). The data sizes can be found in Tab. 1, ranging from 160K to 20M. Notably, Japanese $\leftrightarrow$ English and Chinese $\leftrightarrow$ English are two distant and high-resource language pairs. The monolingual data used for back translation in Tab. 3 is randomly sampled from publicly available News Crawl corpus⁴⁴4http://data.statmt.org/news-crawl/. We use same valid& test sets with previous works for fair comparison excepts IWSLT21 English $\leftrightarrow$ Swahili, where we sample 5K/ 5K sentences from the training set as valid/ test sets. We preprocess all data besides Japanese $\leftrightarrow$ English via BPE Sennrich et al. (2016) with 32K merge operations. For Japanese $\leftrightarrow$ English, we filter the parallel data with Bicleaner Sánchez-Cartagena et al. (2018) and apply the SentencePiece Kudo and Richardson (2018) to generate 32K subwords.

Languages	Fa	Pl	Ar	He
Size	89K	128K	140K	144K
Transformer	17.1	16.4	21.3	28.8
+DoT	18.2^‡	17.5^†	22.8^‡	30.7^‡
Lang.	Nl	De	It	Es
Size	153K	160K	167K	169K
Transformer	31.5	28.5	29.3	34.9
+DoT	33.3^‡	29.6^‡	31.2^‡	36.5^‡

Table 2: Performance on IWSLT multilingual task. For simplicity, we report the average BLEU of En

\rightarrow

X and X

\rightarrow

En within one language. For significance, we compare the translation concatenation of En

\rightarrow

X and X

\rightarrow

En and corresponding concatenated references.

Multilingual Data

We follow Lin et al. (2021) to collect eight English-centric multilingual language pairs from IWSLT14⁵⁵5https://wit3.fbk.eu/, including Farsi (Fa), Polish (Pl), Arabic (Ar), Hebrew (He), Dutch (Nl), German (De), Italian (It), and Spanish (Es). Following Tan et al. (2019), we apply BPE with 30K merge operations, and use over-sampling strategy to balance the training data distribution with temperature of $T=2$ . The hyper-parameters of removing and replacing are set as ratio $=$ 0.1, and the nearby swapping as span $=$ 3 due to their better performance in our preliminary studies. For evaluation, we use tokenized BLEU Papineni et al. (2002) as the metric for bilingual tasks excepts English $\rightarrow$ Chinese and Japanese $\leftrightarrow$ English, where we report SacreBLEU Post (2018). The sign-test Collins et al. (2005) was utilized for statistical significance test.

Model

We validated our proposed DoT on Transformer⁶⁶6https://github.com/pytorch/fairseq Vaswani et al. (2017). All bilingual tasks are trained on Transformer-Big except IWSLT14 En $\leftrightarrow$ De and WMT16 En $\leftrightarrow$ Ro (trained on Transformer-Base) because of their extremely small data size. For multilingual experiments, we closely follow previous work Wu et al. (2019) to adopt smaller⁷⁷7Transformer-Base with $d_{ff}=1024$ and $n_{head}=4$ Transformer-Base due to the small-scale data volume of the IWSLT multilingual dataset. For fair comparison, we set beam size and length penalty as 5 and 1.0 for all language pairs. Notably, our data-level approaches neither modify model structure nor add extra FLOPS, thus they are feasible to deploy on any frameworks (e.g. DynamicConv Wu et al. (2019) and non-autoregressive translation Gu et al. (2018); Ding et al. (2021b)) and other sequence-to-sequence tasks (e.g. grammar error correction and summarization Liu et al. (2021b)). We will explore them in the future works.

Someone may doubt that why not introduce more powerful pretrained language models, e.g. mBART Liu et al. (2020b), as the baseline. We politely argue that it is not fair to directly compare DoT with mBART in the main results because: 1) Our DoT consumes significantly less parameters compared to mBART, 49M~200M vs. 610M; and 2) Our DoT uses only the parallel data, while mBART uses TB-level text during pretraing. In addition, to show the effectiveness and efficiency of our DoT, we report the comparison results between mBART and DoT in §3.3 (see Tab. 4). We show that although mBART performs well on low-resource settings, however, it achieve worse performance on the high-resource settings.

Training

For Transformer-Big models, we adopt large batch strategy Edunov et al. (2018) (i.e. 458K tokens/batch) to optimize the performance. The learning rate warms up to $1\times 10^{-7}$ for 10K steps, and then decays for 30K (data volumes range from 2M to 10M) / 50K (data volumes large than 10M) steps with the cosine schedule; For Transformer-Base, we empirically adopt 65K tokens/batch for small data sizes, e.g. IWSLT14 En $\rightarrow$ De and WMT16 En $\rightarrow$ Ro. The learning rate warms up to $1\times 10^{-7}$ for 4K steps, and then decays for 26K steps. For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance, and apply weight decay with 0.01 and label smoothing with 0.1. We use Adam optimizer (Kingma and Ba, 2015) to train models. We evaluate the performance on an ensemble of last 10 checkpoints to avoid stochasticity. All models were trained on NVIDIA DGX A100 cluster.

Someone may doubt that DoT heavily depends on how to properly set the early-stop steps. To dispel the doubt, we investigate whether our approach is robust to different early-stop steps. In preliminary experiments, we tried several simple fixed early-stop steps according to the size of training data (e.g. training 40K En-De and early stop at 10K/ 15K/ 20K, respectively). We found that both strategies achieve similar performances. Thus, we decide to choose a simple threshold (i.e. 1/3 of the total training steps) for better reproduction.

Types	Model	BLEU
	Transformer-Big/ +DoT	28.6/ 29.5^‡
\hdashlineParallel	+Curriculum Learning Platanios et al. (2019)/ +DoT	29.4/ 29.8^†
	+Knowledge Distillation Kim and Rush (2016)/ +DoT	29.3/ 29.7
	+Data Diversification Nguyen et al. (2020)/ +DoT	30.1/ 30.6^†
	+Bidirectional Training Ding et al. (2021c)/ +DoT	29.7/ 29.9
\hdashlineMonolingual	+Back Translation Caswell et al. (2019a)/ +DoT	30.5/ 31.1^†

Table 3: Complementary to other works. “/+DoT” means combining DoT with corresponding data manipulation works, and BLEU scores of DoT followed their counterparts with “/”. Experiments are conducted on WMT14 En-De.

Scales	100K	1M	5M	10M	20M
Random	10.8	16.4	21.5	27.3	33.2
mBART	17.4	20.5	22.3	26.8	31.4
DoT	12.9	18.1	22.1	27.8	33.6

Table 4: Translation performance comparison between mBART-based models and our proposed DoT across different data scales. Blue color represents improved performance over Random, while Red means reduction. Shades of cell color mean the significance degree.

3.2 Results

Results on Different Data Scales

We experimented on 12 language directions, including IWSLT14 En $\leftrightarrow$ De, WMT16 En $\leftrightarrow$ Ro, IWSLT21 En $\leftrightarrow$ Sw, WMT14 En $\leftrightarrow$ De, WMT20 Ja $\leftrightarrow$ En and WMT17 Zh $\leftrightarrow$ En. Tab. 1 reports the experimental results, DoT achieves significant improvements over strong baseline Transformer in 8 out of 12 directions under the significance test $p<0.01$ , and the other 3 directions also show promising performance under the significance test $p<0.05$ , showing the effectiveness of our method across data scales.

Results on Distant Language Pairs

We report the results of DoT on Ja $\leftrightarrow$ En and Zh $\leftrightarrow$ En language pairs, which belong to different language families (i.e. Japonic, Indo-European and Sino-Tibetan). As shown in Tab. 1, DoT significantly improves the translation quality in all cases. In particular, DoT achieves averaged +0.7 BLEU points improvement over the baselines, showing the effectiveness and universality of our method across language pairs.

Results on Multilingual Translation Tasks

Cross-lingual pretrained models are expected to learn better representations when given more languages Liu et al. (2020b). To verify this hypothesis, we report the multilingual results in Tab. 2. As seen, DoT consistently and significantly outperforms the strong baseline on all language pairs, confirming the effectiveness of DoT under multilingual MT scenario.

Complementary to Related Work

To illustrate the complementarity between DoT and related data manipulation works, we list one monolingual data-based approach: Tagged Back Translation (BT, Caswell et al. 2019b) combines the synthetic data generated with target-side monolingual data and parallel data, and four representative parallel data-based approaches: a) Competence based Curriculum Learning (CL, Platanios et al. 2019) trains samples with easy to hard order, where the difficulty metric is the competence; b) Knowledge Distillation (KD, Kim and Rush 2016) trains the model with sequence-level distilled parallel data; c) Data Diversification (DD, Nguyen et al. 2020) diversifies the data by applying KD and BT on parallel data; d) Bidirectional Training (BiT, Ding et al. 2021c) pretrains the model with bidirectional parallel data. As seen in Tab. 3, DoT is a plug-and-play strategy to yield further improvements.

Here we only provide the empirical evidence for the complementary between our pretraining approach and conventional data manipulation strategies, more insights like Liu et al. (2021a) investigated should be explored in the future.

3.3 Analysis

We conducted analyses on WMT17 En $\Rightarrow$ Zh dataset to better understand where the improvement comes from.

DoT works as an in-domain pretrain strategy

Song et al. (2019); Liu et al. (2020b) reconstruct the abundant unlabeled data with sequence-to-sequence language models, offering the benefit for the low-resource translation tasks Cheng et al. (2021). However, such benefits are hard to enjoy in high-resource settings in part due to the catastrophic forgetting French (1999) and domain discrepancy between pretrain and finetune Gururangan et al. (2020); Anonymous (2021). Actually, DoT can be viewed as a simple in-domain pretraining strategy, where the pretraining corpora are exactly matched with the downstream machine translation task. We randomly sample different data scales to carefully compare the effects of pretrained mBART ⁸⁸8https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz Liu et al. (2020b) and our proposed DoT. The mBART is used to initialize the downstream bilingual models with the officially given scripts⁹⁹9https://github.com/pytorch/fairseq/blob/main/examples/mbart/README.md. To maximize the effect of mBART, we grid-searched the “--update-freq”¹⁰¹⁰10Increasing the batch size during finetuning mBART for large scale datasets is essential to achieve better performance. from 2 to 16 to simulate different batch sizes. As for our DoT, we employ Transformer-Base for 100K and 1M data scales, and Transformer-Big for other larger datasets, i.e. 5M, 10M, and 20M. notably, under such experimental settings, our DoT consumes ~200M parameters for the high resource setting, while the mBART contains significantly more parameters, i.e. ~610M, for all settings.

Tab. 4 shows that although mBART achieves significant higher performance under extremely low-resource setting, e.g. 100K, it undermines the model performance under high-resource settings, e.g. $>=$ 10M, while our DoT strategy consistently outperforms “Random” setting. We attribute the better performance on resource-rich settings to the “in-domain pretrain”, which is consistent with the findings of Gururangan et al. (2020); Anonymous (2021).

DoT works better with task-relevant signals

Recall that we simply employ several straightforward noises, e.g. word drop, word mask and span shuffle, during sequence-to-sequence reconstruction, which are then serve as the self-supervision signals. Gururangan et al. (2020) show that task-relevant signals are important during finetuning the pretrained models. In machine translation, reconstructing the cross-lingual knowledge is intuitively more informative than randomly drop, mask, and shuffling. We therefore deliberately design a translation-relevant denoising signal to provide more insight for our DoT. Concretely, during word masking, we input the code-switched words rather than [MASK] token for the input, where we closely follow Yang et al. (2020) to achieve the code-switching. We found that task-relevant signals enable the model to achieve further improvement by 0.3 BLEU points, confirming our claim.

4 Conclusion

In this paper, we present a simple self-supervised pretraining approach for neural machine translation with parallel data merely. Extensive experiments on bilingual and multilingual language pairs show that our approach DoT consistently and significantly improves translation performance, and can complement well with existing data manipulation methods. In-depth analyses indicate that DoT works as an in-domain pretrain strategy, and can be a better alternative to costly large-scale pretrained models, e.g. mBART.

We hope our proposed twin tricks for NMT, i.e. DoT and BiT Ding et al. (2021c), could facilitate the MT researchers who aim to develop a system under constrained data resources.

Acknowledgements

We would thank anonymous reviewers of ARR Nov. for their considerate proofreading and valuable comments.

References

Akhbardeh et al. (2021) Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, et al. 2021. Findings of the 2021 conference on machine translation (wmt21). In WMT.
Anonymous (2021) ARR Nov. Anonymous. 2021. Understanding and improving sequence-to-sequence pretraining for neural machine translation. In OpenReview.
Barrault et al. (2020) Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, et al. 2020. Proceedings of the fifth conference on machine translation. In WMT.
Barrault et al. (2019) Loïc Barrault, Ondřej Bojar, Marta R Costa-Jussa, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, et al. 2019. Findings of the 2019 conference on machine translation (wmt19). In WMT.
Caswell et al. (2019a) Isaac Caswell, Ciprian Chelba, and David Grangier. 2019a. Tagged back-translation. In WMT, Florence, Italy.
Caswell et al. (2019b) Isaac Caswell, Ciprian Chelba, and David Grangier. 2019b. Tagged back-translation. In WMT.
Cheng et al. (2021) Yong Cheng, Wei Wang, Lu Jiang, and Wolfgang Macherey. 2021. Self-supervised and supervised joint training for resource-rich machine translation–supplementary materials. In ICML.
Collins et al. (2005) Michael Collins, Philipp Koehn, and Ivona Kučerová. 2005. Clause restructuring for statistical machine translation. In ACL.
Ding et al. (2021a) Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao, and Zhaopeng Tu. 2021a. Progressive multi-granularity training for non-autoregressive translation. In findings of ACL.
Ding et al. (2021b) Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao, and Zhaopeng Tu. 2021b. Rejuvenating low-frequency words: Making the most of parallel data in non-autoregressive translation. In ACL.
Ding et al. (2021c) Liang Ding, Di Wu, and Dacheng Tao. 2021c. Improving neural machine translation by bidirectional training. In EMNLP.
Ding et al. (2021d) Liang Ding, Di Wu, and Dacheng Tao. 2021d. The usyd-jd speech translation system for iwslt2021. In IWSLT.
Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In EMNLP.
French (1999) Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences.
Gu et al. (2018) Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2018. Non-autoregressive neural machine translation. In ICLR.
Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In ACL.
Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. 2018. Achieving human parity on automatic chinese to english news translation. In arXiv.
Kim and Rush (2016) Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In EMNLP.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL.
Lin et al. (2021) Zehui Lin, Liwei Wu, Mingxuan Wang, and Lei Li. 2021. Learning language specific sub-network for multilingual machine translation. In ACL.
Liu et al. (2020a) Xuebo Liu, Houtim Lai, Derek F Wong, and Lidia S Chao. 2020a. Norm-based curriculum learning for neural machine translation. In ACL.
Liu et al. (2021a) Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao, Shuming Shi, and Zhaopeng Tu. 2021a. On the complementarity between pre-training and back-translation for neural machine translation. In Findings of EMNLP.
Liu et al. (2021b) Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao, and Zhaopeng Tu. 2021b. Understanding and improving encoder layer fusion in sequence-to-sequence learning. In ICLR.
Liu et al. (2020b) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020b. Multilingual denoising pre-training for neural machine translation. TACL.
Marcus (1993) Gary F Marcus. 1993. Negative evidence in language acquisition. Cognition.
Narang et al. (2021) Sharan Narang, Hyung Won Chung, Yi Tay, Liam Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. 2021. Do transformer modifications transfer across implementations and applications? In EMNLP.
Ng (2021) Andrew Ng. 2021. A chat with andrew on mlops: From model-centric to data-centric ai.
Nguyen et al. (2020) Xuan-Phi Nguyen, Shafiq Joty, Wu Kui, and Ai Ti Aw. 2020. Data diversification: A simple strategy for neural machine translation. In NeurIPS.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
Platanios et al. (2019) Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabás Poczós, and Tom Mitchell. 2019. Competence-based curriculum learning for neural machine translation. In NAACL.
Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In WMT.
Sánchez-Cartagena et al. (2018) Víctor M. Sánchez-Cartagena, Marta Bañón, Sergio Ortiz-Rojas, and Gema Ramírez. 2018. Prompsit’s submission to WMT 2018 parallel corpus filtering shared task. In WMT.
Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. In ICML.
Tan et al. (2019) Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019. Multilingual neural machine translation with knowledge distillation. In ICLR.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
Voita et al. (2021) Elena Voita, Rico Sennrich, and Ivan Titov. 2021. Language modeling, lexical translation, reordering: The training process of NMT through the lens of classical SMT. In EMNLP.
Wu et al. (2019) Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. 2019. Pay less attention with lightweight and dynamic convolutions. In ICLR.
Yang et al. (2020) Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang, and Qi Ju. 2020. Csp: Code-switching pre-training for neural machine translation. In EMNLP.
Zhang et al. (2018) Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy Gwinnup, Marianna J Martindale, Paul McNamee, Kevin Duh, and Marine Carpuat. 2018. An empirical exploration of curriculum learning for neural machine translation. In arXiv.
Zhou et al. (2021) Lei Zhou, Liang Ding, Kevin Duh, Shinji Watanabe, Ryohei Sasano, and Koichi Takeda. 2021. Self-guided curriculum learning for neural machine translation. In IWSLT.

Appendix A Appendix

A.1 Code for Noise Function in PyTorch

Noises include removing, replacing and nearby swapping

⬇

1import fileinput

2import random

3import torch

5def main():

6 parser = argparse.ArgumentParser(\

7 description=’Command-line script to add noise to data’)

8 parser.add_argument(’-wd’, help=’Word dropout’, default=0.1, type=float)

9 parser.add_argument(’-wb’, help=’Word blank’, default=0.1, type=float)

10 parser.add_argument(’-sk’, help=’Shufle k words’, default=3, type=int)

11 args = parser.parse_args()

12 wd = args.wd

13 wb = args.wb

14 sk = args.sk

16 for s in fileinput.input(’-’):

17 s = s.strip().split()

18 if len(s) > 0:

19 s = word_shuffle(s, sk)

20 s = word_dropout(s, wd)

21 s = word_blank(s, wb)

22 print(’ ’.join(s))

24def word_shuffle(s, sk):

25 noise = torch.rand(len(s)).mul_(sk)

26 perm = torch.arange(len(s)).float().add_(noise).sort()[1]

27 return [s[i] for i in perm]

29def word_dropout(s, wd):

30 keep = torch.rand(len(s))

31 res = [si for i,si in enumerate(s) if keep[i] > wd ]

32 if len(res) == 0:

33 return [s[random.randint(0, len(s)-1)]]

34 return res

36def word_blank(s, wb):

37 keep = torch.rand(len(s))

38 return [si if keep[i] > wb else ’MASK’ for i,si in enumerate(s)]

40if __name__ == ’__main__’:

41 main()