VBD-MT ChineseVietnamese Translation Systems for VLSP 2022
Abstract
We present our systems participated in the VLSP 2022 machine translation shared task. In the shared task this year, we participated in both translation tasks, i.e., Chinese-Vietnamese and Vietnamese-Chinese translations. We build our systems based on the neural-based Transformer model with the powerful multilingual denoising pre-trained model mBART. The systems are enhanced by a sampling method for backtranslation, which leverage large scale available monolingual data. Additionally, several other methods are applied to improve the translation quality including ensembling and post-processing. We achieve 38.9 BLEU on Chinese-Vietnamese and 38.0 BLEU on Vietnamese-Chinese on the public test sets, which outperform several strong baselines.
1 Introduction
For the VLSP machine translation shared task this year (VLSP 2022),111https://vlsp.org.vn/vlsp2022/eval/mt there are two translation tasks, i.e., Chinese-Vietnamese and Vietnamese-Chinese translations. This translation task between these two languages is challenging due to limited bilingual training data although the task has been investigated in several previous work Van Nguyen et al. (2022); Tran et al. (2016); Zhao et al. (2013). There are few powerful translation systems and available large-scale bilingual corpora for this language pair, which cause a gap for machine translation research.
In this work, we aim at investigating and building a strong machine translation (MT) system between Chinese and Vietnamese. For this goal, we first review and compare several strong baseline systems on this language pair. We then leverage monolingual data to enhance the low-resource bilingual data since the provided bilingual training data is very limited (300K bilingual sentences). We also investigate the translation output more deeply and apply post-processing approaches to quickly tackle several typical errors such as translating datetime or numeric values. We report experimental results on the validation and public test data, which show the performance of our baseline systems as well as improvement gains from our approaches. Several analyses are also presented including test samples.
2 Data
2.1 Bilingual Data
For bilingual data, there are 300K (thousands) parallel sentences provided by the shared task. This can be seen as a low-resource machine translation task in comparison with large-scale bilingual training data, which contain millions of pallel sentences such as Czech-English, English-Chinese, French-German Barrault et al. (2020); Akhbardeh et al. (2021). The limited data causes the task challenging but interesting to investigate appropriate approaches and models.
2.2 Monolingual Data
In addition to the bilingual data, the shared task also provided large scale monolingual datasets including 19M (millions) sentences for Chinese and 25M sentences for Vietnamese, which can be utilized as an additional resource to enhance translation models.
The monolingual datasets are used for backtranslation (Section 3.3). By analyzing training and validation datasets, we found that most sentences are in the length of 10 to 60 words (cover more than 90%). Therefore, we first quickly apply a filtering method by sentence length, where keeping only sentences in the sentence length range of 10-60. By doing this, we obtained 6M sentences. Due to time limitation, we do not make use the whole but a subset of the data. We randomly selected 1.5M sentences from the filtered 6M data for backtranslation.
Data | #Sents | #Vocab | Avg.Len |
BiData | |||
train (zh) | 300K | 8.3K | 23 |
train (vi) | 300K | 90.7K | 19 |
valid (zh) | 1K | 1.6K | 29 |
valid (vi) | 1K | 3.6K | 21 |
test (zh) | 1K | 1.6K | 42 |
test (vi) | 1K | 3.7K | 36 |
MonoData (zh) | 19.0M | 201K | 80 |
MonoData (vi) | 25.3M | 3,995K | 37 |
2.3 Evaluation Data
For evaluation, the validation set and public test sets contain 1K sentences. We used the officially provided validation set for tuning hyper-parameters and saving model checkpoints, and the systems are compared based on the public test sets.
We present the statistics of these datasets in Table 1. 222The vocabulary size is an estimation, when we calculate each Chinese character as a token, when Vietnamese tokens are split by white spaces.
3 System Overview
3.1 Baseline
We build our baseline systems based on the well-known and powerful Transformer model Vaswani et al. (2017). Our systems are based on the Fairseq Ott et al. (2019) PyTorch implementation, which has achieved state-of-the-art (SOTA) performance on competitions Tran et al. (2021); Akhbardeh et al. (2021). This is confirmed by our preliminary experiments when we compared several systems to select a strong baseline (we present the detail results in Section 4.1). The model is optimized using the Adam optimizer Kingma and Ba (2014). We used learning rate , dropout , trained with early stopping, and saved models after every steps. We trained our models on 1 NVIDIA GeForce 11GB GPU. We present detail hyper-parameters in Appendix A.
The model is finetuned on the mBART Liu et al. (2020), which is trained on 25 languages (mBART-25)333We also tried the mBART-50 Tang et al. (2020), which is trained on 50 languages, but our preliminary experiments showed that the mBART-25 is better. including Chinese and Vietnamese.
We tokenized texts by the well-known SentencePiece Sennrich et al. (2016b). For evaluation metric, we used the commonly used SacreBLEU Post (2018).
Data | #Sents | #Vocab | Avg.Len |
Mono (zh) | 1.5M | 10K | 54 |
Mono (vi) | 1.5M | 357K | 29 |
Synthetic (zh-vi) | |||
train (zh) | 211K | 17K | 27 |
train (vi) | 211K | 22K | 29 |
Synthetic (vi-zh) | |||
train (vi) | 403K | 21K | 45 |
train (zh) | 403K | 35K | 41 |
3.2 Pre-trained model pruning
We used the mBART-25 Liu et al. (2020) for finetuning the baseline systems. The vocabulary size is 250K, which caused the out-of-memory issue when we trained on our NVIDIA GeForce 11GB GPU. We therefore filtered the vocabulary, which contains only the vocabulary of the provided bilingual datasets and the 6M filtered monolingual datasets. Only the embeddings correspond to the filtered vocabulary are kept. As a result, the filtered vocabulary size becomes 67K, which is approximately four times smaller than the original vocabulary, and we can train our systems without a more powerful server. This technique is a minor step, but it may be useful for training systems on limited computational resources.
3.3 Backtranslation
Leveraging available large scale monolingual data via backtranslation to improve MT systems is a common strategy and has been utilized in previous work Sennrich et al. (2016a); Tran et al. (2021); Akhbardeh et al. (2021). In this work, we utilize backtranslation using the top- sampling method Edunov et al. (2018) to generate synthetic data, which select the top- highest scoring outputs () at every time step. The method has shown to provide richer training signal, which is better than the maximum a posteriori (MAP) beam search, which reduces the diversity and richness of the generated source translations.
We used the baseline systems trained on the provided bilingual data to translate the selected 1.5M monolingual data. As a result, we obtained 211K (zh-vi) and 403K (vi-zh) sentence pairs for back-translated datasets. Table 2 describes the back-translated data.
3.4 Ensembling
Ensembling weights is also a strategy to improve machine learning models Shahhosseini et al. (2022) including MT systems Tran et al. (2021). In our systems, we save different epoch checkpoints and calculate the ensembled model by calculating the average of models’ weights of the last checkpoints. We tried different values of and found that is the best value for our systems.
3.5 Post-Processing
Translating numbers look straightforward but has been shown as a challenging task Wang et al. (2021). When we conducted analyses on translation output, we found that specific data types such as date-time and numeric values can be mis-translated, which is a big problem when translating the wrong values of human or currency. The task can be solved by combining with named entity task Mota et al. (2022), but we leave such method for another extended work. In the scope of this work, we simply create a set of patterns for post-processing to edit the translations of date-time and numeric values. For instance, we illustrate several patterns such as the followings.
-
•
XY亿 XY millions
-
•
XY万 XY thousands
The patterns are simple yet effective. We discuss the results with corrected samples in Section 4.3.
Model | Valid | Test |
---|---|---|
Fairseq Ott et al. (2019) | 34.8 | 38.0 |
Transformer | ||
+ RAML Norouzi et al. (2016) | 35.3 | 37.0 |
+ Point.Gen Enarvi et al. (2020) | 34.4 | N/A |
+ Bi-Simcut Gao et al. (2022) | 33.1 | N/A |
MSP Tan et al. (2022) | 29.4 | N/A |
4 Experiments and Results
4.1 Baseline selection
In order to investigate and build a strong baseline for this task, we first review and compare several existing and recent proposed models. In particular, we compared the following systems.
- •
-
•
RAML Norouzi et al. (2016): used a reward augmented maximum likelihood, which tries to optimize task reward (loss) used for test evaluation, and shown the contribution for neural sequence to sequence models including machine translation.
-
•
Point.Gen Enarvi et al. (2020): uses a pointer-generator network to facilitate the same parts of a source sentence (such as person names) to a target sentence.
-
•
Bi-SimcutGao et al. (2022): a training strategy using a regularization method to forces the consistency between the output distributions of the original and the cutoff sentence pairs to boost translation performance.
-
•
Multi-Stage Prompting (MSP) Tan et al. (2022): a method uses different continuous prompts for shifting from pre-trained models to translation tasks better.
It is noted that for RAML, Bi-Simcut, and Pointer Generator, we used the Huggingface’s implementation.444https://github.com/huggingface/transformers
The compared results are presented in Table 3. From these preliminary experiments, we found that the Fairseq obtained the SOTA performance on the public test set. Therefore, we selected the Fairseq implementation for our baseline system.
4.2 Our systems’ results
Model | Valid | Test |
---|---|---|
Baseline | 34.8 | 38.0 |
Backtranslation (B) | 34.8 | 38.8 |
(B)+Ensembling (E) | 34.8 | 38.9 |
(B)+(E)+Post-Processing | 34.8 | 38.9 |
Chinese-Vietnamese translation
The results of Chinese-Vietnamese are presented in Table 4. The baseline model gains 38.0 BLEU, which is relatively high when comparing with recent work for this language pair Van Nguyen et al. (2022). We achieve 38.8 BLEU when leveraging backtranslation, which improves +0.8 BLEU point. Ensembling gains +0.1 BLEU point improvement, while post-processing keeps the same BLEU performance.
Vietnamese-Chinese translation
For Vietnamese-Chinese task, we presented the results in Table 5. We also obtain a relatively high performance with the baseline (37.8 BLEU point), which is somehow equivalent with the Chinese-Vietnamese task. Backtranslation improves +0.2 BLEU point, while ensembling and post-processing keep the same performance.
For our submitted systems for the shared task, we used the best performance systems, which are the combination of the baseline, backtranslation, ensembling, and post-processing to produce the output. Although the post-processing step does not improve the performance in BLEU scores, we found that we still obtain better and correct translation output when we analyzed translated samples. We discuss the post-processing output in the next Section.
4.3 Post-processing
We conducted a post-processing step to correct the translation output, which we focus on correcting numeric and date-time values. We manually check the output of the test set and found that mis-translated output can be can be edited correctly. For instance, the number 400亿 (40 billion) and date 2021年12月1日 (December 1st, 2021) (as presented in Table 6) are correctly revised via the post-processing step.
4.4 Human evaluation
Besides the systems’ BLEU performance, we would like to analyze the translation quality from the expert viewpoint. Therefore, we conducted human evaluation to investigate the actual translation output. It is noted that this is not human evaluation conducted by the shared task organizers, but we invited a native speaker, who is fluent in both Chinese and Vietnamese to check and analyze a small set of the translated sentences.
Model | Valid | Test |
---|---|---|
Baseline | 32.2 | 37.8 |
Backtranslation (B) | 32.9 | 38.0 |
(B)+Ensembling (E) | 32.9 | 38.0 |
(B)+(E)+Post-Processing | 32.9 | 38.0 |
In particular, for each translation task, we randomly selected a small set (20 translated samples) for the human expert to evaluate and analyze. For each sample, there are two outputs, which are from the baseline and the submitted systems, and we shuffled the outputs so that the expert does not know an output comes from which system. We let the expert to assign a score for each output (scoring point), and give comments for each output. From this evaluation, we gain several following observations.
-
•
The submitted systems produce better translations than the baseline systems (received higher scores) in most cases (more than of the samples), which confirm the improvement shown in BLEU scores.
-
•
The meaning of current translated output is acceptable to some extent when most cases receive the scores from to ( of the samples).
-
•
Common issues in translations still remain such as: incorrect translations of person names; only correctly translate a part of content or missing main content, unclear translations; using inappropriate translated words, etc.
Data | Sample |
---|---|
Input (zh) | 原因在于澳大利亚决定取消总额400亿美元的向法国采购核潜艇合同,转而与美国和英国开展联合项目。 |
Translated (vi) | Nguyên nhân là do Australia quyết định hủy hợp đồng mua tàu ngầm hạt nhân trị giá 4 tỷ USD cho Pháp, chuyển sang triển khai dự án chung với Mỹ và Anh. |
Post-processed (vi) | Nguyên nhân là do Australia quyết định hủy hợp đồng mua tàu ngầm hạt nhân trị giá 40 tỷ USD cho Pháp, chuyển sang triển khai dự án chung với Mỹ và Anh. |
Meaning (English) | The reason is that Australia decided to cancel the $ 40 billion contract to buy nuclear submarines for France, moving to implement a joint project with the US and UK. |
Input (zh) | 目前,美国和欧盟都期待在2021年12月1日前达成解决钢铁和铝贸易争端的协议。 |
Translated (vi) | Hiện cả Mỹ và EU đều trông đợi một thỏa thuận giải quyết tranh chấp thương mại thép và nhôm trước ngày 1/1/2021. |
Post-processed (vi) | Hiện cả Mỹ và EU đều trông đợi một thỏa thuận giải quyết tranh chấp thương mại thép và nhôm trước ngày 1/12/2021. |
Meaning (English) | Currently, both the US and EU expect an agreement to settle the steel and aluminum trade dispute before December 1, 2021. |
4.5 Limitations
Though we achieve promising results with the baseline systems and improvement from our approaches, there are still several limitations. First, experiments on comparing systems to choose a strong baseline are yet fully completed. Second, we only leveraged a subset of the monolingual data (about 10% or less of the provided data), which has not completely utilized the large scale available data. Third, the performance may be better if we use the full vocabulary of the pretrained mBART model instead of pruning the vocabulary, although we are able to get benefits from this pruning when powerful servers are unavailable. Fourth, though we achieve several interesting and useful analyses from the human evaluation, it is better to conduct the evaluation on a larger number of samples.
5 Conclusion
In this paper, we describe our systems participated in the VLSP 2022 machine translation shared task for the Chinese-Vietnamese language pair. Our neural-based systems are built based on the Transformer model using the Fairseq framework. The model is finetuned on the mBART pre-trained model on 25 languages. The model is enhanced by leveraging monolingual data using an upsampling method. Additionally, we create a post-processing step to correct mis-translated output, which we focus on numeric data such as numbers and date-time values. Furthermore, models are ensembled based on weight averaging. We conducted various experiments to select a strong baseline and compare different training settings and approaches. We achieve a relatively high baseline performance and improvement from our approaches, which are confirmed by both empirical experiments (BLEU) and human evaluation. We also present and discuss analyses on translation output as well as point out several limitations, which should be solved to improve the systems in future research.
References
- Akhbardeh et al. (2021) Farhad Akhbardeh, Arkady Arkhangorodsky, Magdalena Biesialska, Ondřej Bojar, Rajen Chatterjee, Vishrav Chaudhary, Marta R. Costa-jussa, Cristina España-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, and Marcos Zampieri. 2021. Findings of the 2021 conference on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88, Online. Association for Computational Linguistics.
- Barrault et al. (2020) Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, and Marcos Zampieri. 2020. Findings of the 2020 conference on machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation, pages 1–55, Online. Association for Computational Linguistics.
- Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500.
- Enarvi et al. (2020) Seppo Enarvi, Marilisa Amoia, Miguel Del-Agua Teba, Brian Delaney, Frank Diehl, Stefan Hahn, Kristina Harris, Liam McGrath, Yue Pan, Joel Pinto, et al. 2020. Generating medical reports from patient-doctor conversations using sequence-to-sequence models. In Proceedings of the first workshop on natural language processing for medical conversations, pages 22–30.
- Gao et al. (2022) Pengzhi Gao, Zhongjun He, Hua Wu, and Haifeng Wang. 2022. Bi-SimCut: A simple strategy for boosting neural machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3938–3948, Seattle, United States. Association for Computational Linguistics.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
- Mota et al. (2022) Pedro Mota, Vera Cabarrão, and Eduardo Farah. 2022. Fast-paced improvements to named entity handling for neural machine translation. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 141–149.
- Norouzi et al. (2016) Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. 2016. Reward augmented maximum likelihood for neural structured prediction. Advances In Neural Information Processing Systems, 29.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.
- Post (2018) Matt Post. 2018. A call for clarity in reporting bleu scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191.
- Sennrich et al. (2016a) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96.
- Sennrich et al. (2016b) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725.
- Shahhosseini et al. (2022) Mohsen Shahhosseini, Guiping Hu, and Hieu Pham. 2022. Optimizing ensemble weights and hyperparameters of machine learning models for regression problems. Machine Learning with Applications, 7:100251.
- Tan et al. (2022) Zhixing Tan, Xiangwen Zhang, Shuo Wang, and Yang Liu. 2022. Msp: Multi-stage prompting for making pre-trained language models better translators. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6131–6142.
- Tang et al. (2020) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. 2020. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401.
- Tran et al. (2021) Chau Tran, Shruti Bhosale, James Cross, Philipp Koehn, Sergey Edunov, and Angela Fan. 2021. Facebook ai’s wmt21 news translation task submission. In Proceedings of the Sixth Conference on Machine Translation, pages 205–215.
- Tran et al. (2016) Phuoc Tran, Dien Dinh, and Hien T Nguyen. 2016. A character level based and word level based approach for chinese-vietnamese machine translation. Computational Intelligence and Neuroscience, 2016:21.
- Van Nguyen et al. (2022) Vinh Van Nguyen, Ha Nguyen, Huong Thanh Le, Thai Phuong Nguyen, Tan Van Bui, Luan Nghia Pham, Anh Tuan Phan, Cong Hoang-Minh Nguyen, Viet Hong Tran, and Anh Huu Tran. 2022. Kc4mt: A high-quality corpus for multilingual machine translation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5494–5502.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
- Wang et al. (2021) Jun Wang, Chang Xu, Francisco Guzmán, Ahmed El-Kishky, Benjamin Rubinstein, and Trevor Cohn. 2021. As easy as 1, 2, 3: Behavioural testing of nmt systems for numerical translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4711–4717.
- Zhao et al. (2013) Hai Zhao, Tianjiao Yin, and Jingyi Zhang. 2013. Vietnamese to chinese machine translation via chinese character as pivot. In Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27), pages 250–259.
Appendix A Hyper-parameters
We present the detail hyper-parameters used for training our systems in Table 7.
Parameter | Value |
---|---|
Max training updates | 120,000 |
Early stop patience | 10 |
Optimizer | Adam |
Adam eps | 1e-06 |
Adam | [0.9, 0.98] |
Warmup updates | 2,500 |
Learning rate | 3e-05 |
Dropout | 0.3 |
Attention dropout | 0.1 |
Max tokens | 1,024 |
Save interval updates | 5,000 |