This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving Neural Machine Translation by Denoising Training

Liang Ding
The University of Sydney
[email protected]
&Keqin Peng
Beihang University
[email protected]
&Dacheng Tao
JD Explore Academy, JD.com
[email protected]
  Work done when interning at JD Explore Academy.
Abstract

We present a simple and effective pretraining strategy Denoising Training (DoT111DoT is one of the twin tricks for NMT (another is Bidirectional Training, namely BiT Ding et al. (2021c)) that we proposed in IWSLT21 evaluation Ding et al. (2021d).) for neural machine translation. Specifically, we update the model parameters with source- and target-side denoising tasks at the early stage and then tune the model normally. Notably, our approach does not increase any parameters or training steps, requiring the parallel data merely. Experiments show that DoT consistently improves the neural machine translation performance across 12 bilingual and 16 multilingual directions (data size ranges from 80K to 20M). In addition, we show that DoT can complement existing data manipulation strategies, i.e. curriculum learning, knowledge distillation, data diversification, bidirectional training, and back-translation. Encouragingly, we found that DoT outperforms costly pretrained model mBART Liu et al. (2020b) in high-resource settings. Analyses show DoT is a novel in-domain cross-lingual pretraining strategy, and could offer further improvements with task-relevant self-supervisions.

1 Introduction

Transformer Vaswani et al. (2017) has been become the de facto choice in neural machine translation (NMT) due to its state-of-the-art performance Barrault et al. (2019, 2020); Akhbardeh et al. (2021). However, an interesting study reveals that many Transformer modifications do not result in improved performance due to the lack of generalization Narang et al. (2021). This finding is consistent with the recent call for data-centric AI in the ML community Ng (2021), urging the NMT community pays more attention to how to effectively and efficiently exploit the supervisions from data rather than complicated modifications.

There has been a lot of works on NMT data manipulation to fully exploit the training data. Zhang et al. (2018); Platanios et al. (2019); Liu et al. (2020a); Zhou et al. (2021); Ding et al. (2021a) design difficulty metrics to enable the models to learn from easy to hard. Kim and Rush (2016) propose sequence-level knowledge distillation for machine translation to acquire the refined knowledge from teachers. Nguyen et al. (2020) diversify the training data by using the predictions of multiple forward and backward models. Recently, Ding et al. (2021c) initialize the translation system with a bidirectional system to obtain better performance. However, they assume that the supervisions come from the correlation – the basic properties of parallel data, between the source and target sentences, i.e. src\leftrightarrowtgt, ignoring the self-supervisions of the source or target sentences themselves.

In this work, we decide to find more self-supervisions from parallel data, which is hopefully complementary to existing data manipulation strategies. Accordingly, we break the parallel data into two pieces of high-quality monolingual data, allowing us to design rich self-supervisions on both source and target side. We choose denoising as the self-supervision objective, i.e. denoising training2). The core idea is using a multilingual denoising system as the initialization for a translation system. Specifically, given the parallel language pair “B: srctgt\text{src}\rightarrow\text{tgt}”, we can construct the denosing data “Msrc\text{M}_{\mathrm{src}}: noised(src)src{noised}(\mathrm{src})\rightarrow\mathrm{src}” and “Mtgt\text{M}_{\mathrm{tgt}}: noised(tgt)tgt{noised}(\mathrm{tgt})\rightarrow\mathrm{tgt}”. Then we update the parameters with denoising data Msrc+Mtgt\text{M}_{\mathrm{src}}+\text{M}_{\mathrm{tgt}} in the early stage, and tune the model with parallel data B.

We validated our approach on bilingual and multilingual benchmarks across different language families and sizes in §3.2. Experiments show DoT consistently improves the translation performance. We also show DoT can complement existing data manipulation strategies, i.e. back translation Caswell et al. (2019a), curriculum learning Platanios et al. (2019), knowledge distillation Kim and Rush (2016), data diversification Nguyen et al. (2020) and bidirectional training Ding et al. (2021c). Analyses in §3.3 provide some insights about where the improvements come from: DoT is a simple in-domain cross-lingual pretraining strategy and can be enhanced with task-relevant self-supervisions.

2 Denoising Training

Preliminary

Given a source sentence 𝐱\bf x, NMT models generate target word 𝐲t{\bf y}_{t} conditioned on previously generated 𝐲<t{\bf y}_{<t}, which can be formulated as:

p(𝐲|𝐱)=t=1Tp(𝐲t|𝐱,𝐲<t;θ)p({\bf y}|{\bf x})=\prod_{t=1}^{T}p({\bf y}_{t}|{\bf x},{\bf y}_{<t};\theta) (1)

where TT is the length of the target sequence and the parameters θ\theta are trained to maximize the likelihood of training examples according to (θ)=argmaxθlogp(𝐲|𝐱;θ)\mathcal{L}(\theta)=\operatorname*{arg\,max}_{\theta}\log p({\bf y}|{\bf x};\theta). The training examples used to achieve the conditional estimation can be defined as B={(𝐱i,𝐲i)}i=1N\text{B}=\{(\mathbf{x}_{i},\mathbf{y}_{i})\}^{N}_{i=1} , where NN is the total number of sentence pairs in the training data.

Motivation

The motivation of training with denoising data is when humans learn languages, one of the best practice for language acquisition is to correct the sentence errors Marcus (1993). Motivated by it, Lewis et al. (2020) propose several noise functions and denoise them in end-to-end way. Liu et al. (2020b) introduce this idea to the multilingual scenarios. Different from above monolingual pretraining approaches, we propose a simpler noise function and apply it to each side of the parallel data.

Method

We want the model to understand both the source- and target-side languages well before lexical translation and reordering Voita et al. (2021). For noise function noised(){noised}(\cdot), we apply the common noise-injection practices in Appendix A.1, i.e. removing, replacing, or nearby swapping one time for a random word with a uniform distribution in a sentence Edunov et al. (2018). Then size of the original parallel data doubled as follows:

Msrc\displaystyle\text{M}_{\mathrm{src}} ={noised(𝐱i),𝐱i}i=1N\displaystyle=\{{noised}(\mathbf{x}_{i}),\mathbf{x}_{i}\}^{N}_{i=1} (2)
Mtgt\displaystyle\text{M}_{\mathrm{tgt}} ={noised(𝐲i),𝐲i}i=1N\displaystyle=\{{noised}(\mathbf{y}_{i}),\mathbf{y}_{i}\}^{N}_{i=1} (3)

where Msrc\text{M}_{\mathrm{src}} and Mtgt\text{M}_{\mathrm{tgt}} can be combined to update the end-to-end model. In doing so, θ\theta in Eq. 1 can be updated by denoising both the source and target data, then the denoising objective becomes:

DoT(θ)\displaystyle{\mathcal{L}_{\text{DoT}}}(\theta) =argmaxθlogp(𝐱|noised(𝐱);θ)Source Denoising:θS\displaystyle=\overbrace{\operatorname*{arg\,max}_{\theta}\log p({\bf x}|{{noised}({\bf x})};\theta)}^{\text{Source Denoising}:{\mathcal{L}_{\theta}^{S}}} (4)
+argmaxθlogp(𝐲|noised(𝐲);θ)Target Denoising:θT\displaystyle+\underbrace{\operatorname*{arg\,max}_{\theta}\log p({\bf y}|{{noised}({\bf y})};\theta)}_{\text{Target Denoising}:{\mathcal{L}_{\theta}^{T}}} (5)

where the source denoising objective θS{\mathcal{L}_{\theta}^{S}} and target donising objective θT{\mathcal{L}_{\theta}^{T}} are optimized iteratively. The pretraining can store knowledge of the source and target languages into the shared model parameters, which may help better and faster learning further tasks. Following Ding et al. (2021c), we early stop denoising training at 1/3 of the total steps, and tune the model normally with the rest of 2/3 training steps. This process can be formally denoted as such pipeline: Msrc+MtgtB\text{M}_{\mathrm{src}}+\text{M}_{\mathrm{tgt}}\rightarrow{\text{B}}.

There are many possible ways to implement the general idea of denoising training. The aim of this paper is not to explore the whole space but simply to show that one fairly straightforward implementation works well and the idea is reasonable.

Data Source IWSLT14 WMT16 IWSLT21 WMT14 WMT20 WMT17
Size 160K 0.6M 2.4M 4.5M 13M 20M
\cdashline2-13 Direction En-De De-En En-Ro Ro-En En-Sw Sw-En En-De De-En Ja-En En-Ja Zh-En En-Zh
Transformer 29.2 35.1 33.9 34.1 28.8 48.5 28.6 32.1 20.4 18.2 23.7 33.2
    +DoT 29.8 36.1 35.0 35.5 29.3 49.6 29.5 32.7 20.9 19.1 24.7 33.6
Table 1: Performance on several widely-used bilingual benchmarks, including IWSLT14 En\leftrightarrowDe, WMT16 En\leftrightarrowRo, IWSLT21 En\leftrightarrowSw, WMT14 En\leftrightarrowDe, WMT20 Ja\leftrightarrowEn and WMT17 Zh\leftrightarrowEn. Among them, Ja-En and Zh-En are distant language pairs. “‡/†” indicates significant difference (p<0.01/0.05p<0.01/0.05) from corresponding baselines.

3 Experiments

3.1 Setup

Bilingual Data

Main experiments in Tab. 1 are conducted on 6 translation datasets: IWSLT14 English\leftrightarrowGerman Nguyen et al. (2020), WMT16 English\leftrightarrowRomanian Gu et al. (2018), IWSLT21 English\leftrightarrowSwahili222https://iwslt.org/2021/low-resource, WMT14 English\leftrightarrowGerman Vaswani et al. (2017), WMT20 Japanese\leftrightarrowEnglish333http://www.statmt.org/wmt20 and WMT17 Chinese\leftrightarrowEnglish Hassan et al. (2018). The data sizes can be found in Tab. 1, ranging from 160K to 20M. Notably, Japanese\leftrightarrowEnglish and Chinese\leftrightarrowEnglish are two distant and high-resource language pairs. The monolingual data used for back translation in Tab. 3 is randomly sampled from publicly available News Crawl corpus444http://data.statmt.org/news-crawl/. We use same valid& test sets with previous works for fair comparison excepts IWSLT21 English\leftrightarrowSwahili, where we sample 5K/ 5K sentences from the training set as valid/ test sets. We preprocess all data besides Japanese\leftrightarrowEnglish via BPE Sennrich et al. (2016) with 32K merge operations. For Japanese\leftrightarrowEnglish, we filter the parallel data with Bicleaner Sánchez-Cartagena et al. (2018) and apply the SentencePiece Kudo and Richardson (2018) to generate 32K subwords.

Languages Fa Pl Ar He
Size 89K 128K 140K 144K
Transformer 17.1 16.4 21.3 28.8
    +DoT 18.2 17.5 22.8 30.7
Lang. Nl De It Es
Size 153K 160K 167K 169K
Transformer 31.5 28.5 29.3 34.9
    +DoT 33.3 29.6 31.2 36.5
Table 2: Performance on IWSLT multilingual task. For simplicity, we report the average BLEU of En\rightarrowX and X\rightarrowEn within one language. For significance, we compare the translation concatenation of En\rightarrowX and X\rightarrowEn and corresponding concatenated references.

Multilingual Data

We follow Lin et al. (2021) to collect eight English-centric multilingual language pairs from IWSLT14555https://wit3.fbk.eu/, including Farsi (Fa), Polish (Pl), Arabic (Ar), Hebrew (He), Dutch (Nl), German (De), Italian (It), and Spanish (Es). Following Tan et al. (2019), we apply BPE with 30K merge operations, and use over-sampling strategy to balance the training data distribution with temperature of T=2T=2. The hyper-parameters of removing and replacing are set as ratio==0.1, and the nearby swapping as span==3 due to their better performance in our preliminary studies. For evaluation, we use tokenized BLEU Papineni et al. (2002) as the metric for bilingual tasks excepts English\rightarrowChinese and Japanese\leftrightarrowEnglish, where we report SacreBLEU Post (2018). The sign-test Collins et al. (2005) was utilized for statistical significance test.

Model

We validated our proposed DoT on Transformer666https://github.com/pytorch/fairseq Vaswani et al. (2017). All bilingual tasks are trained on Transformer-Big except IWSLT14 En\leftrightarrowDe and WMT16 En\leftrightarrowRo (trained on Transformer-Base) because of their extremely small data size. For multilingual experiments, we closely follow previous work Wu et al. (2019) to adopt smaller777Transformer-Base with dff=1024d_{ff}=1024 and nhead=4n_{head}=4 Transformer-Base due to the small-scale data volume of the IWSLT multilingual dataset. For fair comparison, we set beam size and length penalty as 5 and 1.0 for all language pairs. Notably, our data-level approaches neither modify model structure nor add extra FLOPS, thus they are feasible to deploy on any frameworks (e.g. DynamicConv Wu et al. (2019) and non-autoregressive translation Gu et al. (2018); Ding et al. (2021b)) and other sequence-to-sequence tasks (e.g. grammar error correction and summarization Liu et al. (2021b)). We will explore them in the future works.

Someone may doubt that why not introduce more powerful pretrained language models, e.g. mBART Liu et al. (2020b), as the baseline. We politely argue that it is not fair to directly compare DoT with mBART in the main results because: 1) Our DoT consumes significantly less parameters compared to mBART, 49M~200M vs. 610M; and 2) Our DoT uses only the parallel data, while mBART uses TB-level text during pretraing. In addition, to show the effectiveness and efficiency of our DoT, we report the comparison results between mBART and DoT in §3.3 (see Tab. 4). We show that although mBART performs well on low-resource settings, however, it achieve worse performance on the high-resource settings.

Training

For Transformer-Big models, we adopt large batch strategy Edunov et al. (2018) (i.e. 458K tokens/batch) to optimize the performance. The learning rate warms up to 1×1071\times 10^{-7} for 10K steps, and then decays for 30K (data volumes range from 2M to 10M) / 50K (data volumes large than 10M) steps with the cosine schedule; For Transformer-Base, we empirically adopt 65K tokens/batch for small data sizes, e.g. IWSLT14 En\rightarrowDe and WMT16 En\rightarrowRo. The learning rate warms up to 1×1071\times 10^{-7} for 4K steps, and then decays for 26K steps. For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance, and apply weight decay with 0.01 and label smoothing with 0.1. We use Adam optimizer  (Kingma and Ba, 2015) to train models. We evaluate the performance on an ensemble of last 10 checkpoints to avoid stochasticity. All models were trained on NVIDIA DGX A100 cluster.

Someone may doubt that DoT heavily depends on how to properly set the early-stop steps. To dispel the doubt, we investigate whether our approach is robust to different early-stop steps. In preliminary experiments, we tried several simple fixed early-stop steps according to the size of training data (e.g. training 40K En-De and early stop at 10K/ 15K/ 20K, respectively). We found that both strategies achieve similar performances. Thus, we decide to choose a simple threshold (i.e. 1/3 of the total training steps) for better reproduction.

Types Model BLEU
Transformer-Big/ +DoT 28.6/ 29.5
\hdashlineParallel +Curriculum Learning Platanios et al. (2019)/ +DoT 29.4/ 29.8
+Knowledge Distillation Kim and Rush (2016)/ +DoT 29.3/ 29.7
+Data Diversification Nguyen et al. (2020)/ +DoT 30.1/ 30.6
+Bidirectional Training Ding et al. (2021c)/ +DoT 29.7/ 29.9
\hdashlineMonolingual +Back Translation Caswell et al. (2019a)/ +DoT 30.5/ 31.1
Table 3: Complementary to other works. “/+DoT” means combining DoT with corresponding data manipulation works, and BLEU scores of DoT followed their counterparts with “/”. Experiments are conducted on WMT14 En-De.
Scales 100K 1M 5M 10M 20M
Random 10.8 16.4 21.5 27.3 33.2
mBART 17.4 20.5 22.3 26.8 31.4
DoT 12.9 18.1 22.1 27.8 33.6
Table 4: Translation performance comparison between mBART-based models and our proposed DoT across different data scales. Blue color represents improved performance over Random, while Red means reduction. Shades of cell color mean the significance degree.

3.2 Results

Results on Different Data Scales

We experimented on 12 language directions, including IWSLT14 En\leftrightarrowDe, WMT16 En\leftrightarrowRo, IWSLT21 En\leftrightarrowSw, WMT14 En\leftrightarrowDe, WMT20 Ja\leftrightarrowEn and WMT17 Zh\leftrightarrowEn. Tab. 1 reports the experimental results, DoT achieves significant improvements over strong baseline Transformer in 8 out of 12 directions under the significance test p<0.01p<0.01, and the other 3 directions also show promising performance under the significance test p<0.05p<0.05, showing the effectiveness of our method across data scales.

Results on Distant Language Pairs

We report the results of DoT on Ja\leftrightarrowEn and Zh\leftrightarrowEn language pairs, which belong to different language families (i.e. Japonic, Indo-European and Sino-Tibetan). As shown in Tab. 1, DoT significantly improves the translation quality in all cases. In particular, DoT achieves averaged +0.7 BLEU points improvement over the baselines, showing the effectiveness and universality of our method across language pairs.

Results on Multilingual Translation Tasks

Cross-lingual pretrained models are expected to learn better representations when given more languages Liu et al. (2020b). To verify this hypothesis, we report the multilingual results in Tab. 2. As seen, DoT consistently and significantly outperforms the strong baseline on all language pairs, confirming the effectiveness of DoT under multilingual MT scenario.

Complementary to Related Work

To illustrate the complementarity between DoT and related data manipulation works, we list one monolingual data-based approach: Tagged Back Translation (BTCaswell et al. 2019b) combines the synthetic data generated with target-side monolingual data and parallel data, and four representative parallel data-based approaches: a) Competence based Curriculum Learning (CLPlatanios et al. 2019) trains samples with easy to hard order, where the difficulty metric is the competence; b) Knowledge Distillation (KDKim and Rush 2016) trains the model with sequence-level distilled parallel data; c) Data Diversification (DDNguyen et al. 2020) diversifies the data by applying KD and BT on parallel data; d) Bidirectional Training (BiTDing et al. 2021c) pretrains the model with bidirectional parallel data. As seen in Tab. 3, DoT is a plug-and-play strategy to yield further improvements.

Here we only provide the empirical evidence for the complementary between our pretraining approach and conventional data manipulation strategies, more insights like Liu et al. (2021a) investigated should be explored in the future.

3.3 Analysis

We conducted analyses on WMT17 En\RightarrowZh dataset to better understand where the improvement comes from.

DoT works as an in-domain pretrain strategy

Song et al. (2019); Liu et al. (2020b) reconstruct the abundant unlabeled data with sequence-to-sequence language models, offering the benefit for the low-resource translation tasks Cheng et al. (2021). However, such benefits are hard to enjoy in high-resource settings in part due to the catastrophic forgetting French (1999) and domain discrepancy between pretrain and finetune Gururangan et al. (2020); Anonymous (2021). Actually, DoT can be viewed as a simple in-domain pretraining strategy, where the pretraining corpora are exactly matched with the downstream machine translation task. We randomly sample different data scales to carefully compare the effects of pretrained mBART 888https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz Liu et al. (2020b) and our proposed DoT. The mBART is used to initialize the downstream bilingual models with the officially given scripts999https://github.com/pytorch/fairseq/blob/main/examples/mbart/README.md. To maximize the effect of mBART, we grid-searched the “--update-freq101010Increasing the batch size during finetuning mBART for large scale datasets is essential to achieve better performance. from 2 to 16 to simulate different batch sizes. As for our DoT, we employ Transformer-Base for 100K and 1M data scales, and Transformer-Big for other larger datasets, i.e. 5M, 10M, and 20M. notably, under such experimental settings, our DoT consumes ~200M parameters for the high resource setting, while the mBART contains significantly more parameters, i.e. ~610M, for all settings.

Tab. 4 shows that although mBART achieves significant higher performance under extremely low-resource setting, e.g. 100K, it undermines the model performance under high-resource settings, e.g. >=>=10M, while our DoT strategy consistently outperforms “Random” setting. We attribute the better performance on resource-rich settings to the “in-domain pretrain”, which is consistent with the findings of Gururangan et al. (2020); Anonymous (2021).

DoT works better with task-relevant signals

Recall that we simply employ several straightforward noises, e.g. word drop, word mask and span shuffle, during sequence-to-sequence reconstruction, which are then serve as the self-supervision signals. Gururangan et al. (2020) show that task-relevant signals are important during finetuning the pretrained models. In machine translation, reconstructing the cross-lingual knowledge is intuitively more informative than randomly drop, mask, and shuffling. We therefore deliberately design a translation-relevant denoising signal to provide more insight for our DoT. Concretely, during word masking, we input the code-switched words rather than [MASK] token for the input, where we closely follow Yang et al. (2020) to achieve the code-switching. We found that task-relevant signals enable the model to achieve further improvement by 0.3 BLEU points, confirming our claim.

4 Conclusion

In this paper, we present a simple self-supervised pretraining approach for neural machine translation with parallel data merely. Extensive experiments on bilingual and multilingual language pairs show that our approach DoT consistently and significantly improves translation performance, and can complement well with existing data manipulation methods. In-depth analyses indicate that DoT works as an in-domain pretrain strategy, and can be a better alternative to costly large-scale pretrained models, e.g. mBART.

We hope our proposed twin tricks for NMT, i.e. DoT and BiT Ding et al. (2021c), could facilitate the MT researchers who aim to develop a system under constrained data resources.

Acknowledgements

We would thank anonymous reviewers of ARR Nov. for their considerate proofreading and valuable comments.

References

  • Akhbardeh et al. (2021) Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, et al. 2021. Findings of the 2021 conference on machine translation (wmt21). In WMT.
  • Anonymous (2021) ARR Nov. Anonymous. 2021. Understanding and improving sequence-to-sequence pretraining for neural machine translation. In OpenReview.
  • Barrault et al. (2020) Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, et al. 2020. Proceedings of the fifth conference on machine translation. In WMT.
  • Barrault et al. (2019) Loïc Barrault, Ondřej Bojar, Marta R Costa-Jussa, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, et al. 2019. Findings of the 2019 conference on machine translation (wmt19). In WMT.
  • Caswell et al. (2019a) Isaac Caswell, Ciprian Chelba, and David Grangier. 2019a. Tagged back-translation. In WMT, Florence, Italy.
  • Caswell et al. (2019b) Isaac Caswell, Ciprian Chelba, and David Grangier. 2019b. Tagged back-translation. In WMT.
  • Cheng et al. (2021) Yong Cheng, Wei Wang, Lu Jiang, and Wolfgang Macherey. 2021. Self-supervised and supervised joint training for resource-rich machine translation–supplementary materials. In ICML.
  • Collins et al. (2005) Michael Collins, Philipp Koehn, and Ivona Kučerová. 2005. Clause restructuring for statistical machine translation. In ACL.
  • Ding et al. (2021a) Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao, and Zhaopeng Tu. 2021a. Progressive multi-granularity training for non-autoregressive translation. In findings of ACL.
  • Ding et al. (2021b) Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao, and Zhaopeng Tu. 2021b. Rejuvenating low-frequency words: Making the most of parallel data in non-autoregressive translation. In ACL.
  • Ding et al. (2021c) Liang Ding, Di Wu, and Dacheng Tao. 2021c. Improving neural machine translation by bidirectional training. In EMNLP.
  • Ding et al. (2021d) Liang Ding, Di Wu, and Dacheng Tao. 2021d. The usyd-jd speech translation system for iwslt2021. In IWSLT.
  • Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In EMNLP.
  • French (1999) Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences.
  • Gu et al. (2018) Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2018. Non-autoregressive neural machine translation. In ICLR.
  • Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In ACL.
  • Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. 2018. Achieving human parity on automatic chinese to english news translation. In arXiv.
  • Kim and Rush (2016) Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In EMNLP.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
  • Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL.
  • Lin et al. (2021) Zehui Lin, Liwei Wu, Mingxuan Wang, and Lei Li. 2021. Learning language specific sub-network for multilingual machine translation. In ACL.
  • Liu et al. (2020a) Xuebo Liu, Houtim Lai, Derek F Wong, and Lidia S Chao. 2020a. Norm-based curriculum learning for neural machine translation. In ACL.
  • Liu et al. (2021a) Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao, Shuming Shi, and Zhaopeng Tu. 2021a. On the complementarity between pre-training and back-translation for neural machine translation. In Findings of EMNLP.
  • Liu et al. (2021b) Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao, and Zhaopeng Tu. 2021b. Understanding and improving encoder layer fusion in sequence-to-sequence learning. In ICLR.
  • Liu et al. (2020b) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020b. Multilingual denoising pre-training for neural machine translation. TACL.
  • Marcus (1993) Gary F Marcus. 1993. Negative evidence in language acquisition. Cognition.
  • Narang et al. (2021) Sharan Narang, Hyung Won Chung, Yi Tay, Liam Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. 2021. Do transformer modifications transfer across implementations and applications? In EMNLP.
  • Ng (2021) Andrew Ng. 2021. A chat with andrew on mlops: From model-centric to data-centric ai.
  • Nguyen et al. (2020) Xuan-Phi Nguyen, Shafiq Joty, Wu Kui, and Ai Ti Aw. 2020. Data diversification: A simple strategy for neural machine translation. In NeurIPS.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
  • Platanios et al. (2019) Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabás Poczós, and Tom Mitchell. 2019. Competence-based curriculum learning for neural machine translation. In NAACL.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In WMT.
  • Sánchez-Cartagena et al. (2018) Víctor M. Sánchez-Cartagena, Marta Bañón, Sergio Ortiz-Rojas, and Gema Ramírez. 2018. Prompsit’s submission to WMT 2018 parallel corpus filtering shared task. In WMT.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
  • Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. In ICML.
  • Tan et al. (2019) Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019. Multilingual neural machine translation with knowledge distillation. In ICLR.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
  • Voita et al. (2021) Elena Voita, Rico Sennrich, and Ivan Titov. 2021. Language modeling, lexical translation, reordering: The training process of NMT through the lens of classical SMT. In EMNLP.
  • Wu et al. (2019) Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. 2019. Pay less attention with lightweight and dynamic convolutions. In ICLR.
  • Yang et al. (2020) Zhen Yang, Bojie Hu, Ambyera Han, Shen Huang, and Qi Ju. 2020. Csp: Code-switching pre-training for neural machine translation. In EMNLP.
  • Zhang et al. (2018) Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy Gwinnup, Marianna J Martindale, Paul McNamee, Kevin Duh, and Marine Carpuat. 2018. An empirical exploration of curriculum learning for neural machine translation. In arXiv.
  • Zhou et al. (2021) Lei Zhou, Liang Ding, Kevin Duh, Shinji Watanabe, Ryohei Sasano, and Koichi Takeda. 2021. Self-guided curriculum learning for neural machine translation. In IWSLT.

Appendix A Appendix

A.1 Code for Noise Function in PyTorch

Noises include removing, replacing and nearby swapping
1import fileinput
2import random
3import torch
4
5def main():
6 parser = argparse.ArgumentParser(\
7 description=’Command-line script to add noise to data’)
8 parser.add_argument(’-wd’, help=’Word dropout’, default=0.1, type=float)
9 parser.add_argument(’-wb’, help=’Word blank’, default=0.1, type=float)
10 parser.add_argument(’-sk’, help=’Shufle k words’, default=3, type=int)
11 args = parser.parse_args()
12 wd = args.wd
13 wb = args.wb
14 sk = args.sk
15
16 for s in fileinput.input(’-’):
17 s = s.strip().split()
18 if len(s) > 0:
19 s = word_shuffle(s, sk)
20 s = word_dropout(s, wd)
21 s = word_blank(s, wb)
22 print( .join(s))
23
24def word_shuffle(s, sk):
25 noise = torch.rand(len(s)).mul_(sk)
26 perm = torch.arange(len(s)).float().add_(noise).sort()[1]
27 return [s[i] for i in perm]
28
29def word_dropout(s, wd):
30 keep = torch.rand(len(s))
31 res = [si for i,si in enumerate(s) if keep[i] > wd ]
32 if len(res) == 0:
33 return [s[random.randint(0, len(s)-1)]]
34 return res
35
36def word_blank(s, wb):
37 keep = torch.rand(len(s))
38 return [si if keep[i] > wb else ’MASK’ for i,si in enumerate(s)]
39
40if __name__ == ’__main__’:
41 main()