Samsung R&D Institute Philippines at WMT 2023
Abstract
In this paper, we describe the constrained MT systems submitted by Samsung R&D Institute Philippines to the WMT 2023 General Translation Task for two directions: enhe and heen. Our systems comprise of Transformer-based sequence-to-sequence models that are trained with a mix of best practices: comprehensive data preprocessing pipelines, synthetic backtranslated data, and the use of noisy channel reranking during online decoding. Our models perform comparably to, and sometimes outperform, strong baseline unconstrained systems such as mBART50 M2M and NLLB 200 MoE despite having significantly fewer parameters on two public benchmarks: FLORES-200 and NTREX-128.
1 Introduction
This paper describes Samsung R&D Institute Philippines’s submission to the WMT 2023 General Translation task. We participate in two translation directions: enhe and heen, submitting two constrained single-direction models based on the Transformer Vaswani et al. (2017) sequence-to-sequence architecture. We employ a number of best practices, using a comprehensive data preprocessing pipeline to ensure parallel data quality, create synthetic data through carefully-curated backtranslation, and use reranking methods to select the best candidate translations.
Our systems achieve strong performance on public benchmarks: 44.24 BLEU and 33.77 BLEU for FLORES-200 and NTREX-128 enhe, respectively, and; 42.42 BLEU and 36.89 BLEU on FLORES-200 and NTREX-128 heen, respectively. Our systems outperform mBART50 M2M and slightly underperform against NLLB 200 MoE despite having significantly less parameters compared to these unconstrained baselines.
We detail our data preprocessing, model training, data augmentation, and translation methodology. Additionally, we illustrate hyperparameter sweeping setups and study the effects of hyperparameters during online decoding with reranking.
2 Methodology
2.1 Data Preprocessing
Pairs | Words (en) | Words (he) | |
---|---|---|---|
Original | 72,459,348 | 701,991,594 | 566,555,530 |
Original Filtered | 48,278,395 | 385,975,984 | 312,639,617 |
Synthetic enhe | 10,000,000 | 165,595,289 | 145,849,940 |
Synthetic enhe Filtered | 7,143,725 | 115,239,312 | 95,954,020 |
Synthetic heen | 73,278,018 | 1,471,827,973 | 1,056,677,671 |
Synthetic heen Filtered | 47,372,416 | 659,409,236 | 541,376,459 |
Given that a significant portion of the training dataset is synthetically-aligned, we need to use a comprehensive data preprocessing pipeline to ensure good translation quality. In particular, we use a combination of heuristic-based, ratio-based, and embedding-based methods to filter our data.
Heuristic-based
The following heuristic-based filters based on Cruz and Cheng (2021) are used before applying the others:
-
•
Language Filter – We use use pycld3111https://pypi.org/project/pycld2/ to filter out sentence pairs where one or both sentences have more than 30% tokens that are neither English nor Hebrew.
- •
-
•
Numerical Filter – If one sentence in a pair has a number (ordinal, date, etc.), we also check the other sentence if a matching number is present. If a match is not detected, the pair is removed.
Ratio-based
We employ ratio-based filters on tokenized sentence pairs following Cruz and Sutawika (2022) and Sutawika and Cruz (2021). We first tokenize using SacreMoses222https://github.com/alvations/sacremoses then apply the following ratio-based filters:
-
•
Length Filter – We remove pairs containing sentences with more than 140 characters.
-
•
Token Length Filter – We remove pairs that contain sentences with tokens that are more than 40 characters long.
-
•
Character to Token Ratio – We remove pairs where the ratio between character count and token count in at least one sentence is greater than 12.
-
•
Pair Token Ratio – We remove pairs where the ratio of tokens between the source and target sentences is greater than 4.
-
•
Pair Length Ratio – We remove pairs where the ratio between the string lengths of the source and target sentences is greater than 6.
Embedding-based
Finally, we experiment with the use of sentence embedding models to compute embedding-based similarity between a sentence pair. We use LaBSE Feng et al. (2020) models to embed both the source and target sentences then compute a cosine similarity score between the two. The pair must have a similarity score to be kept.
Statistics on the original and filtered corpus can be found on Table 1.
2.2 Model Architecture
We experiment with two model sizes for each language pair: a Base model with 65M parameters and a Large model with 200M parameters. Both models use the standard Transformer Vaswani et al. (2017) sequence-to-sequence architecture and are trained using Fairseq Ott et al. (2019) with the hyperparameters listed in Table 2.
We parallelize with 8 NVIDIA Tesla P100 GPUs and initially train for a total of 100K steps for experimentation. For the submitted systems trained with backtranslated data, we train for a total of 1M steps.
Training Hyperparameters | |
---|---|
Parameters | 65M and 200M |
Vocab Size | 32,000 |
Tied Weights | Yes |
Dropout | 0.3 |
Attention Dropout | 0.1 |
Weight Decay | 0.0 |
Label Smoothing | 0.1 |
Optimizer | Adam |
Adam Betas | =0.90, =0.98 |
Adam | =1e-6 |
Learning Rate | 7e-4 |
Warmup Steps | 4,000 |
Total Steps | 1,000,000 |
Batch size | 64,000 tokens |
2.3 Backtranslation
We use backtranslation Sennrich et al. (2015a) as a form of data augmentation to improve our initial models. We generate synthetic data via combined top-k and nucleus sampling:
(1) |
where is the top values considered for top-k sampling, is the temperature hyperparameter, and is the maximum total probability for nucleus sampling.
Backtranslation is only performed once using the provided monolingual data. We produce a total of 10,000,000 synthetic sentences for the enhe direction and 73,278,018 synthetic sentences for the heen direction. The same data preprocessing used on the original parallel corpus is then applied to the synthetic corpus. We produce backtranslations using Large 100K models with the sampling hyperparameters listed in Table 3.
Statistics on generated synthetic data before and after filtering can be found on Table 1.
Backtranslation Hyperparameters | |
---|---|
Top-k () | 50 |
Top-p () | 0.93 |
Temperature () | 0.7 |
Beam | 1.0 |
Length Penalty | 1.0 |
2.4 Noisy Channel Reranking
We further improve translations by using Noisy Channel Reranking Yee et al. (2019), which reranks every candidate translation token using Bayes’ Rule, as follows:
(2) |
where refers to the probability of the th candidate token at timestep given source sentence and current translated tokens .
All probabilities are parameterized as standard encode-decoder Transformer neural networks: the Direct Model models or translation between source to target language; the Channel Model models , or the probability of the target translating back into the predicted translation, and; the Language Model models or the probability of the translated sentence to exist. is generally not modeled since it is constant for all . This allows us to leverage a strong language model to guide the outputs of the direct model, while using a channel model to constrain the preferred outputs of the language model (which may be unrelated to the source sentence).
During beam search decoding, we rescore the top candidates using the following linear combination of all three models:
(3) |
where and are source / target debiasing terms, refers to the weight of the channel model, and refers to the weight of the language model.
For Noisy Channel Reranking, our direct and channel models use the same size and setup at all times (i.e. if the direct model is a Large model trained for 100K steps, then the channel model is also a Large model trained for 100K steps in the opposite translation direction).
For the language model, we train one Base-sized decoder-only Transformer language model for English and one for Hebrew. We concatenate the cleaned data from the parallel corpus with the provided monolingual data for each language to train the LM. We use the same training setup as with translation models, except we use a weight decay of 0.01 and a learning rate of 5e-4.
Hyperparameters used for decoding with Noisy Channel Reranking can be found in Table 4.
Decoding Hyperparameters | |
---|---|
Beam | 5 |
Length Penalty | 1.0 |
k2 | 5 |
CM Top-k | 500 |
enhe | 0.2297 |
enhe | 0.2056 |
heen | 0.2998 |
heen | 0.2594 |
2.5 Evaluation
We evaluate our models using two metrics: BLEU Papineni et al. (2002) and ChrF++ Popović (2015), both scored via SacreBLEU444SacreBLEU outputs the following signature for evaluation: nrefs:1—case:mixed—eff:no—tok:spm-flores—smooth:exp—version:2.2.1 Post (2018). We develop our models using both the FLORES 200 Costa-jussà et al. (2022) and NTREX 128 Federmann et al. (2022) datasets, using the validation sets during training and reporting scores on the test sets.
To benchmark our models’ performance, we mainly compare BLEU and ChrF++ against two (unconstrained) models: mBART 50 M2M Tang et al. (2020), a 610M-parameter finetuned version of mBART for many-to-many translation, and NLLB 200 MoE Costa-jussà et al. (2022), the full 54.5B-parameter mixture-of-experts version of NLLB 200 for many-to-many translation.
2.6 Hyperparameter Search
To find the best values for and , as well as to understand how these parameters affect performance, we use Bayesian Hyperparameter Search. We use the Large 1M + BT models and run 1000 iterations of search, keeping the length penalty static at 1.0, and sampling both and from a gaussian with minimum of 0.01 and maximum of 0.99.
We perform this for both enhe and heen translation directions and use the results for the final submission model.
3 Results
A summary of our results on benchmarks can be found on Table 5.
FLORES-200 | NTREX-128 | |||||||
EN HE | HE EN | EN HE | HE EN | |||||
Model | BLEU | ChrF++ | BLEU | ChrF++ | BLEU | ChrF++ | BLEU | ChrF++ |
Base 100K | 39.88 | 56.34 | 12.06 | 29.46 | 31.47 | 48.32 | 29.85 | 52.53 |
Base 100K + NC | 40.22 | 56.55 | 38.75 | 60.52 | 32.10 | 48.93 | 31.86 | 54.57 |
Base 100K + BT | 41.50 | 57.46 | 38.73 | 60.80 | 31.27 | 47.90 | 34.09 | 56.10 |
Base 100K + BT + NC | 41.66 | 57.59 | 40.43 | 62.17 | 32.05 | 48.62 | 35.76 | 57.65 |
Large 100K | 41.26 | 57.46 | 39.07 | 60.06 | 32.49 | 48.95 | 31.08 | 53.19 |
Large 100K + NC | 41.46 | 57.64 | 40.53 | 61.49 | 32.80 | 49.34 | 33.12 | 55.16 |
Large 100K + BT | 43.32 | 58.62 | 40.91 | 61.58 | 32.90 | 49.11 | 35.48 | 56.04 |
Large 100K + BT + NC | 43.26 | 58.72 | 41.92 | 62.64 | 33.18 | 49.42 | 36.79 | 57.37 |
Large 1M + BT | 43.76 | 58.29 | 41.00 | 61.16 | 33.35 | 49.22 | 35.83 | 56.02 |
Large 1M + BT + NC | 44.24 | 59.36 | 42.42 | 62.21 | 33.77 | 49.69 | 36.89 | 56.92 |
mBART50 M2M (610M) | 19.49 | 46.7 | 30.50 | 55.00 | 14.80 | 42.30 | 27.02 | 51.21 |
NLLB 200 MoE (54.5B) | 46.80 | 59.80 | 49.00 | 67.40 | - | - | - | - |

3.1 Benchmarking Results
Our submission systems (Large 1M + BT + NC) exhibit strong performance on both translation directions. On FLORES-200, we achieve 44.24 BLEU for enhe and 42.42 BLEU for heen. The same systems score 33.77 BLEU for enhe and 36.89 BLEU for heen on NTREX-128.
We note that these systems perform strongly when compared against much larger, unconstrained baseline models. On FLORES-200, we significantly outperform mBART 50 M2M on enhe by +24.75 BLEU and on heen by +11.92 BLEU despite having 67% less parameters (200M vs 610M). Notably, our system performs only slightly worse compared to NLLB 200 MoE despite having 96% less parameters compared to the mixture-of-experts model. On FLORES-200, we perform -2.56 BLEU worse on enhe and -6.58 BLEU worse on heen compared to NLLB 200 MoE.
3.2 Hyperparameter Search Results
In order to find optimal hyperparameters for both and , we ran bayesian hyperparameter search for both at the same time while keeping length penalty static. We plot the results of the hyperparameter search over 1000 iterations in Figure 1.
We observe that performance is optimal when both hyperparameters are set to 0.20.3, making performance increasingly worse as both hyperparameters approach closer to 1. We hypothesize that this signifies the model capturing the original distribution close enough that it does not need much correction or aid from the accompanying language model. Noisy channel reranking, however, is still empirically shown to be useful in this case as guidance from the language model produces better candidates in cases where the direct model may be searching a too-constrained space.
3.3 Ablations
We explored multiple configurations of our submission systems in terms of model size, presence of synthetic data during training, and the use of reranking methods during online decoding. Our results show that each step improves performance directly:
-
•
The initial Base 100K performs at 39.88 BLEU for enhe on FLORES-200.
-
•
Increasing the size to 200M parameters (Large 100K) improves performance by +1.38 BLEU.
-
•
Adding backtranslated data (Large 100K + BT) is by far the most beneficial, improving performance by +2.06 BLEU.
-
•
We then experiment with longer training times (1M iterations for Large 1M + BT) to adapt to the new dataset size, increasing the score by +0.44 BLEU.
-
•
Finally, using noisy channel reranking (Large 1M + BT + NC) improves the score by +0.48 BLEU.
Overall, all of our methods improve performance by a total of 4.36 BLEU for the enhe direction on FLORES-200.
We note an interesting jump in performance from Base 100K to Large 1M + BT + NC on the FLORES-200 heen direction at +30.36 BLEU. Base 100K underperforms at 12.06 BLEU, and we hypothesize that this is due to the model not having enough capacity to embed information from Hebrew, which causes it to greatly benefit from the guidance of a language model during noisy channel reranking.
4 Conclusion
In this paper, we describe our submissions to the WMT 2023 General Translation Task. We participate in two constrained tracks: enhe and heen.
We submit two monodirectional models based on the Transformer architecture. Both models are trained using a mix of original and synthetic backtranslated data, filtered and curated using a comprehensive data processing pipeline that combines embedding-based, heuristic-based, and ratio-based filters. Additionally, we employ noisy channel reranking to improve translation candidates using a language model and a channel model trained in the opposite direction.
On two benchmark datasets, our systems outperform mBART50 M2M and perform slightly worse than NLLB 200 MoE, both unconstrained systems with significantly more parameters.
Our results show that established best practices still perform strongly on constrained systems without the need for extraneous data sources as is with unconstrained systems for the same translation directions.
Limitations
We benchmark on datasets that are publicly available with permissive licenses for research.
We note that we are unable to study scale properly for translation models due to a lack of stronger compute resources. The same constraint also prevents us from training multiple iterations of the same model with differing random seeds. Our systems’ true performance may thus be higher or lower depending on the machine random state at the start of training time.
Lastly, our models are trained on Hebrew, which is a language that we do not speak. We are therefore unable to manually evaluate if the output translations are correct, natural, or semantically sound.
Ethical Considerations
Our paper replicates best practices in data preprocessing, model training, and online decoding for translation models. Within our study, we aim to create experiments that replicate prior work under comparable experimental conditions to ensure fairness in benchmarking.
Given that we do not speak the target language in the paper, we report performance in comparison to other existing models. We do not claim that “strong” performance in a computational setting correlates with good translations from a human perspective.
Lastly, while we do not use human annotators for this paper, the conference (WMT) itself does for human evaluations on the General Translation Task. We disclose this fact and note that annotations (and therefore scores) may be different across many speakers of Hebrew.
References
- Bareket and Tsarfaty (2021) Dan Bareket and Reut Tsarfaty. 2021. Neural Modeling for Named Entities and Morphology (NEMO2). Transactions of the Association for Computational Linguistics, 9:909–928.
- Costa-jussà et al. (2022) Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- Cruz and Cheng (2021) Jan Christian Blaise Cruz and Charibeth Cheng. 2021. Improving large-scale language models and resources for filipino. arXiv preprint arXiv:2111.06053.
- Cruz and Sutawika (2022) Jan Christian Blaise Cruz and Lintang Sutawika. 2022. Samsung research philippines-datasaur ai’s submission for the wmt22 large scale multilingual translation task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1034–1038.
- Federmann et al. (2022) Christian Federmann, Tom Kocmi, and Ying Xin. 2022. NTREX-128 – news test references for MT evaluation of 128 languages. In Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, pages 21–24, Online. Association for Computational Linguistics.
- Feng et al. (2020) Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2020. Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852.
- Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Popović (2015) Maja Popović. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the tenth workshop on statistical machine translation, pages 392–395.
- Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
- Sennrich et al. (2015a) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015a. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.
- Sennrich et al. (2015b) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015b. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Sutawika and Cruz (2021) Lintang Sutawika and Jan Christian Blaise Cruz. 2021. Data processing matters: Srph-konvergen ai’s machine translation system for wmt’21. arXiv preprint arXiv:2111.10513.
- Tang et al. (2020) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. 2020. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
- Yang and Zhang (2018) Jie Yang and Yue Zhang. 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.
- Yee et al. (2019) Kyra Yee, Nathan Ng, Yann N Dauphin, and Michael Auli. 2019. Simple and effective noisy channel modeling for neural machine translation. arXiv preprint arXiv:1908.05731.