Building Multilingual Machine Translation Systems
That Serve Arbitrary X-Y Translations
Abstract
Multilingual Neural Machine Translation (MNMT) enables one system to translate sentences from multiple source languages to multiple target languages, greatly reducing deployment costs compared with conventional bilingual systems. The MNMT training benefit, however, is often limited to many-to-one directions. The model suffers from poor performance in one-to-many and many-to-many with zero-shot setup. To address this issue, this paper discusses how to practically build MNMT systems that serve arbitrary X-Y translation directions while leveraging multilinguality with a two-stage training strategy of pretraining and finetuning. Experimenting with the WMT’21 multilingual translation task, we demonstrate that our systems outperform the conventional baselines of direct bilingual models and pivot translation models for most directions, averagely giving +6.0 and +4.1 BLEU, without the need for architecture change or extra data collection. Moreover, we also examine our proposed approach in an extremely large-scale data setting to accommodate practical deployment scenarios.
Building Multilingual Machine Translation Systems
That Serve Arbitrary X-Y Translations
Akiko Eriguchi, Shufang Xie††thanks: Equal contributions. , Tao Qin‡, and Hany Hassan Awadalla† †Microsoft ‡Microsoft Research Asia {akikoe,shufxi,taoqin,hanyh}@microsoft.com
1 Introduction
Multilingual Neural Machine Translation (MNMT), which enables one system to serve translation for multiple directions, has attracted much attention in the machine translation area Zoph and Knight (2016); Firat et al. (2016). Because the multilingual capability hugely reduces the deployment cost at training and inference, MNMT has actively been employed as a machine translation system backbone in recent years Johnson et al. (2017); Hassan et al. (2018).
Most MNMT systems are trained with multiple English-centric data for both directions (e.g., English {French, Chinese} (En-X) and {French, Chinese} English (X-En)). Recent work Gu et al. (2019); Zhang et al. (2020); Yang et al. (2021b) pointed out that such MNMT systems severely face an off-target translation issue, especially in translations from a non-English language X to another non-English language Y. Meanwhile, Freitag and Firat (2020) have extended data resources with multi-way aligned data and reported that one complete many-to-many MNMT can be fully supervised, achieving competitive translation performance for all X-Y directions. In our preliminary experiments, we observed that the complete many-to-many training is still as challenging as one-to-many training Johnson et al. (2017); Wang et al. (2020), since we have introduced more one-to-many translation tasks into the training. Similarly reported in the many-to-many training with zero-shot setup Gu et al. (2019); Yang et al. (2021b), the complete MNMT model also suffers from capturing correlations in the data for all the X-Y directions as one model training, due to highly imbalanced data.


In this paper, we propose a two-stage training for complete MNMT systems that serve arbitrary X-Y translations by 1) pretraining a complete multilingual many-to-many model and 2) finetuning the model to effectively transfer knowledge from pretraining to task-specific multilingual systems. Considering that MNMT is a multi-task learner of translation tasks with “multiple languages”, the complete multilingual model learns more diverse and general multilingual representations. We transfer the representations to a specifically targeted task via many-to-one multilingual finetuning, and eventually build multiple many-to-one MNMT models that cover all X-Y directions. The experimental results on the WMT’21 multilingual translation task show that our systems have substantial improvement against conventional bilingual approaches and many-to-one multilingual approaches for most directions. Besides, we discuss our proposal in the light of feasible deployment scenarios and show that the proposed approach also works well in an extremely large-scale data setting.
2 Two-Stage Training for MNMT Models
To support all possible translations with languages (including English), we first train a complete MNMT system on all available parallel data for directions. We assume that there exist data of English-centric language pairs and remaining non-English-centric language pairs, which lets the system learn multilingual representations across all languages. Usually, the volume of English-centric data is much greater than non-English-centric one. Then, we transfer the multilingual representations to one target language by finetuning the system on a subset of training data for many-to- directions (i.e., multilingual many-to-one finetuning). This step leads the decoder towards the specifically targeted language rather than multiple languages. As a result, we obtain multilingual many-to-one systems to serve all X-Y translation directions. We experiment with our proposed approach in the following two settings: 1) WMT’21 large-scale multilingual translation data with 972M sentence pairs and 2) our in-house production-scale dataset with 4.1B sentence pairs.
model | en | hu | hr | sr | et | id | ms | tl | mk | jv | ta | avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Bilingual | 30.4 | 19.5 | 23.9 | 16.7 | 19.4 | 20.9 | 17.6 | 13.0 | 24.1 | 1.2 | 1.8 | 17.1 |
Pivot-based | 30.4 | 19.7 | 23.6 | 17.6 | 19.1 | 19.5 | 21.8 | 14.1 | 23.6 | 3.5 | 5.4 | 18.0 |
Many-to-one | 32.9 | 20.8 | 24.0 | 17.4 | 19.8 | 25.3 | 20.3 | 16.6 | 25.0 | 1.5 | 5.1 | 19.0 |
\hdashlineOurs: Pretrained | 32.0 | 18.9 | 23.4 | 15.6 | 19.0 | 28.6 | 26.5 | 21.1 | 26.2 | 10.8 | 8.1 | 20.9 |
+ finetuning | 33.2 | 19.8 | 23.8 | 19.5 | 19.8 | 30.0 | 27.3 | 21.9 | 26.8 | 11.2 | 8.7 | 22.0 |
Data size (M) | 321 | 172 | 141 | 123 | 85 | 63 | 20 | 19 | 18 | 5 | 4 | – |
3 WMT’21 Multilingual Translation Task
We experiment with two small tasks of the WMT’21 large-scale multilingual translation task. The tasks provide multilingual multi-way parallel corpora from the Flores 101 data Wenzek et al. (2021). The parallel sentences are provided among English (en), five Central and East European languages of {Croatian (hr), Hungarian (hu), Estonian (et), Serbian (sr), Macedonian (mk)} for the task 1, and five Southeast Asian languages of {Javanese (jv), Indonesian (id), Malay (ms), Tagalog (tl), Tamil (ta)} for the task 2. We removed sentence pairs either of whose sides is an empty line, and eventually collected the data with (English-centric, Non-English-centric)=(321M, 651M) sentence pairs in total. The data size per direction varies in a range of 0.07M-83.9M. To balance the data distribution across languages Kudugunta et al. (2019), we up-sample the low-resource languages with temperature=5. We append language ID tokens at the end of source sentences to specify a target language Johnson et al. (2017). We tokenize the data with the SentencePiece Kudo and Richardson (2018) and build a shared vocabulary with 64k tokens.
We train Transformer models Vaswani et al. (2017) consisting of a -layer encoder and -layer decoder with (hidden dim., ffn dim.) =(768, 3072) in a complete multilingual many-to-many fashion. We have two settings of (, ) = (12, 6) for “12E6D” and (24, 12) for “24E12D” , to learn diverse multilingual dataset. The model parameters are optimized by using RAdam Liu et al. (2020) with an initial learning rate of 0.025, and warm-up steps of 10k and 30k for the 12E6D and 24E12D model training, respectively. The systems are pretrained on 64 V100 GPUs with a mini-batch size of 3072 tokens and gradient accumulation of 16. After the pretraining, the models are finetuned on a subset of X- training data. We finetune the model parameters gently on 8 V100 GPUs with the same mini-batch size, gradient accumulations, and optimizer with different learning rate scheduling of (init_lr, warm-up steps)=({1e-4, 1e-5, 1e-6}, 8k). The best checkpoints are selected based on development loss. The translations are obtained by a beam search decoding with a beam size of 4, unless otherwise stated.
Baselines
For system comparison, we build three different baselines: 1) direct bilingual systems, 2) pivot translation systems via English (only applicable for non-English X-Y evaluation) Utiyama and Isahara (2007), and 3) many-to-one multilingual systems with the 12E6D architecture. The bilingual and pivot-based baselines employ the Transformer base architecture. The embedding dimension is set to 256 for jv, ms, ta, and tl, because of the training data scarcity. For the X-Y pivot translation, a source sentence in language X is translated to English with a beam size of 5 by the X-En model, then the best output is translated to the final target language Y by the En-Y model.
Results
All results on the test sets are displayed in Figure 1 and Table 1, where we report the case-sensitive sacreBLEU score Post (2018) for translation accuracy. Overall, our best systems (“24E12D+FT") are significantly better by sacreBLEU for 83% and 88% directions against the bilingual baselines and the pivot translation baselines, respectively. In Table 1, we present the average sacreBLEU scores for many-to- directions, showing that our proposed approach successfully achieved the best performance in most targeted languages. Compared to the many-to-one multilingual baselines, the proposed approach of utilizing the complete MNMT model transfers multilingual representations more effectively to the targeted translation directions, as the -centric data size are smaller. We also note that the winning system of the shared task achieved (task1, task2)=(37.6, 33.9) BLEU with a 36-layer encoder and 12-layer decoder model Yang et al. (2021a) that is pre-trained on extra language data including parallel and monolingual data, while our best system with a 24-layer encoder and 12-layer decoder obtained (task1, task2)=(25.7, 22.8) sacreBLEU, without using those extra data.
xx-de | xx-fr | xx-es | xx-it | xx-pl | |
---|---|---|---|---|---|
Pivot-based baselines | 35.7 | 39.8 | 38.0 | 33.0 | 26.2 |
Pretrained model on en-xx data | 14.2 | 33.1 | — | — | — |
+ Many-to-one multilingual finetuning | 36.6 | 41.3 | — | — | — |
Pretrained model on {en, de, fr}-xx data | 37.5 | 42.1 | 20.0 | 16.6 | 11.1 |
+ Many-to-one multilingual finetuning | 38.3 | 42.6 | 39.3 | 34.6 | 28.3 |
4 In-house Extremely Large-Scale Setting
Deploying a larger and larger model is not always feasible. We often have limitations in the computational resources at inference time, which leads to a trade-off problem between the performance and the decoding cost caused by the model architecture. In this section, we validate our proposed approach in an extremely large-scale data setting and also discuss how we can build lighter NMT models without the performance loss, while distilling the proposed MNMT systems Kim and Rush (2016). We briefly touch the following three topics of 1) multi-way multilingual data collection, 2) English-centric vs. multi-centric pretraining for X-Y translations, and 3) a lighter NMT model that addresses the trade-off issue between performance and latency. Then, we report the experimental results in the extremely large-scale setting.
Multilingual Data Collection
We build an extremely large-scale data set using our in-house English-centric data set, consisting of 10 European languages, ranging 24M-192M sentences per language. This contains available parallel data and back-translated data between English and {German (de), French (fr), Spanish (es), Italian (it), Polish (pl), Greek (el), Dutch (nl), Portuguese (pt), and Romanian (ro)}. From these English-centric corpora, we extract a multi-way multilingual X-Y data, by aligning En-X and En-Y data via pivoting English. Specifically, we extracted {de, fr, es, it, pl}-centric data and concatenate them to the existent direct X-Y data, providing 78M-279M sentence pairs per direction. Similarly as in Section 3, we build a shared SentencePiece vocabulary with 128k tokens to address the large-scale setting.
En-centric vs Multi-centric Pretraining
In a large-scale data setting, a question might come up; Which pretrained model provides generalized multilingual representations to achieve better X-Y translation quality? Considering English is often a dominant text data, e.g., 70% tasks are English-centric in the WMT’21 news translation task, the model supervised on English-centric corpora might learn representations enough to transfer for X-Y translations. To investigate the usefulness of the multi-centric data training, we pretrain our Transformer models with deeper 24-12 layers described in Section 3, on the English-centric data and the -centric data (={en,de,fr}), individually. After pretraining, we apply the multilingual many-to-one finetuning with a subset of the training data and evaluate each system for the fully supervised X-Y directions, i.e., xx-{en,de,fr}, and the partially supervised X-Y directions, i.e., xx-{es,it,pl}. We followed the same training and finetuning settings as described in Section 3, unless otherwise stated.
MNMT with Light Decoder
At the practical level, one drawback of the large-scale models would be latency at inference time. This is mostly caused by the high computational cost in the decoder layers due to auto-regressive models and the extra cross-attention network in each block of the decoder. Recent studies Kasai et al. (2021); Hsu et al. (2020); Li et al. (2021) have experimentally shown that models with a deep encoder and a shallow decoder can address the issue, without losing much performance. Fortunately, such an architecture also satisfies demands of the many-to-one MNMT training, which requires the encoder networks to be more complex to handle various source languages. To examine the light NMT model architecture, we train the Transformer base architecture modified with 9-3 layers (E9D3) in a bilingual setting and compare it with a standard Transformer base model, with 6-6 layers (E6D6), as a baseline. Additionally, we also report direct X-Y translation performance, when distilling the best large-scale MNMT models alongside the light NMT models as a student model. More specifically, following Kim and Rush (2016), we train light NMT student models (E9D3) that serve many-to- translations (={de, fr, es, it, pl}).
Results
Table 2 reports average sacreBLEU scores for many-to-one directions in our in-house X-Y test sets. For the xx-{de,fr} directions, the proposed finetuning helps both English-centric and multi-centric pretrained models to improve the accuracy. Overall, the finetuned multi-centric models achieved the best, largely outperforming the English pivot-based baselines by +2.6 and +2.8 points. For the comparison among the multilingual systems, the multi-centric model without finetuning already surpasses the finetuned English-centric systems with a large margin of +0.9 and +0.8 points for both xx-{de,fr} directions. This suggests that, by pretraining a model on more multi-centric data, the model learns better multilinguality to transfer. For the xx-{es,it,pl} directions111Most are zero-shot directions such as “Greek-to-Spanish”., the fineutned multi-centric systems gain similar accuracy improvement, averagely outperforming the conventional pivot-based baselines.
Figure 2 shows the effectiveness of our light NMT model architecture for five bilingual En-X directions, reporting the translation performance in sacreBLEU scores and the latency measured on CPUs. Our light NMT model (E9D3) successfully achieves almost 2x speed up, without much drop of the performance for all directions. Employing this light model architecture as a student model, we report the distilled many-to-one model performance in Table 3, measured by sacreBLEU and COMET scores Rei et al. (2020). For consistent comparison, we also built English bilingual baselines (E6D6) that are distilled from the bilingual Teachers, then we obtained the English pivot-based translation performance. For all the many-to- directions (={de,fr,es,it,pl}), the light NMT models that are distilled from the best MNMT models show the best performance in both metrics. Besides that, we also note that our direct X-Y light NMT systems successfully save the decoding cost with 75% against the pivot translation222The light NMT model halves the latency against the baseline system, as shown in in Figure 2 and needs to be run once. On the other hand, the pivot-based baseline systems via English need to translate twice for X-Y directions (e.g., German-to-English and English-to-French translations for a German-French direction)..


Models | BLEU | COMET | |
---|---|---|---|
xx-de | Pivot-based baselines | 38.1 | 60.8 |
Ours | 39.5 | 67.2 | |
xx-fr | Pivot-based baselines | 41.5 | 69.3 |
Ours | 42.9 | 73.2 | |
xx-es | Pivot-based baselines | 37.4 | 73.9 |
Ours | 38.0 | 74.4 | |
xx-it | Pivot-based baselines | 32.6 | 77.2 |
Ours | 33.7 | 80.9 | |
xx-pl | Pivot-based baselines | 26.7 | 77.8 |
Ours | 28.0 | 88.5 |
5 Conclusion
This paper proposes a simple but effective two-stage training strategy for MNMT systems that serve arbitrary X-Y translations. To support translations across languages, we first pretrain a complete multilingual many-to-many model, then transfer the representations via finetuning the model in a many-to-one multilingual fashion. In the WMT’21 translation task, we experimentally showed that the proposed approach substantially improve translation accuracy for most X-Y directions against the strong conventional baselines of bilingual systems, pivot translation systems, and many-to-one multilingual systems. We also examined the proposed approach in the extremely large-scale setting, while addressing the practical questions such as multi-way parallel data collection, the usefulness of multilinguality during the pretraining and finetuning, and how to save the decoding cost, achieving the better X-Y quality.
References
- Firat et al. (2016) Orhan Firat, Baskaran Sankaran, Yaser Al-onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016. Zero-resource translation with multi-lingual neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 268–277, Austin, Texas. Association for Computational Linguistics.
- Freitag and Firat (2020) Markus Freitag and Orhan Firat. 2020. Complete multilingual neural machine translation. In Proceedings of the Fifth Conference on Machine Translation, pages 550–560, Online. Association for Computational Linguistics.
- Gu et al. (2019) Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O.K. Li. 2019. Improved zero-shot neural machine translation via ignoring spurious correlations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1258–1268, Florence, Italy. Association for Computational Linguistics.
- Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving human parity on automatic chinese to english news translation. arXiv: arXiv:1803.05567.
- Hsu et al. (2020) Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, and Ilya Chatsviorkin. 2020. Efficient inference for neural machine translation. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 48–53, Online. Association for Computational Linguistics.
- Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
- Kasai et al. (2021) Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah Smith. 2021. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. In International Conference on Learning Representations.
- Kim and Rush (2016) Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
- Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
- Kudugunta et al. (2019) Sneha Kudugunta, Ankur Bapna, Isaac Caswell, and Orhan Firat. 2019. Investigating multilingual NMT representations at scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1565–1575, Hong Kong, China. Association for Computational Linguistics.
- Li et al. (2021) Yanyang Li, Ye Lin, Tong Xiao, and Jingbo Zhu. 2021. An efficient transformer decoder with compressed sub-layers. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15):13315–13323.
- Liu et al. (2020) Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2020. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations.
- Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
- Utiyama and Isahara (2007) Masao Utiyama and Hitoshi Isahara. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 484–491, Rochester, New York. Association for Computational Linguistics.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Wang et al. (2020) Zirui Wang, Zachary C. Lipton, and Yulia Tsvetkov. 2020. On negative interference in multilingual models: Findings and a meta-learning treatment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4438–4450, Online. Association for Computational Linguistics.
- Wenzek et al. (2021) Guillaume Wenzek, Vishrav Chaudhary, Angela Fan, Sahir Gomez, Naman Goyal, Somya Jain, Douwe Kiela, Tristan Thrush, and Francisco Guzmán. 2021. Findings of the wmt 2021 shared task on large-scale multilingual machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 89–99.
- Yang et al. (2021a) Jian Yang, Shuming Ma, Haoyang Huang, Dongdong Zhang, Li Dong, Shaohan Huang, Alexandre Muzio, Saksham Singhal, Hany Hassan, Xia Song, and Furu Wei. 2021a. Multilingual machine translation systems from Microsoft for WMT21 shared task. In Proceedings of the Sixth Conference on Machine Translation, pages 446–455, Online. Association for Computational Linguistics.
- Yang et al. (2021b) Yilin Yang, Akiko Eriguchi, Alexandre Muzio, Prasad Tadepalli, Stefan Lee, and Hany Hassan. 2021b. Improving multilingual translation by representation and gradient regularization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7266–7279, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Zhang et al. (2020) Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich. 2020. Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628–1639, Online. Association for Computational Linguistics.
- Zoph and Knight (2016) Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 30–34, San Diego, California. Association for Computational Linguistics.