Centroid-Based Efficient Minimum Bayes Risk Decoding
Abstract
Minimum Bayes risk (MBR) decoding has achieved state-of-the-art translation performance using COMET, which is a neural metric that has a high correlation with human evaluation. However, MBR decoding requires quadratic time because it computes the expected score between a translation hypothesis and all reference translations. We propose centroid-based MBR (CBMBR) decoding to improve the speed of MBR decoding. Our method clusters reference translations in the feature space and then calculates the score using the centroids of each cluster. The experimental results demonstrate that our CBMBR not only improved the decoding speed of the expected score calculation by 5.7 times but also outperformed vanilla MBR decoding in terms of translation quality by up to 0.5 COMET% in the WMT’22 EnJa, EnDe, EnZh, and WMT’23 EnJa translation tasks.111https://github.com/naist-nlp/mbrs
Centroid-Based Efficient Minimum Bayes Risk Decoding
Hiroyuki Deguchi1,2 Yusuke Sakai1 Hidetaka Kamigaito1 Taro Watanabe1 Hideki Tanaka2 Masao Utiyama2 1Nara Institute of Science and Technology 2National Institute of Information and Communications Technology {deguchi.hiroyuki.db0, sakai.yusuke.sr9, kamigaito.h, taro}@is.naist.jp {hideki.tanaka, mutiyama}@nict.go.jp
1 Introduction
Minimum Bayes risk (MBR) decoding achieves robust and high-quality translation by selecting the output sentence that maximizes the expected metric score computed from the set of translation hypotheses (Kumar and Byrne, 2004; Eikema and Aziz, 2020; Müller and Sennrich, 2021). Recently, neural evaluation metrics that have a high correlation with human evaluation have been proposed (Rei et al., 2020, 2022a; Sellam et al., 2020; Zhang et al., 2020), and MBR decoding using such neural metrics has achieved state-of-the-art translation performance in human evaluation compared with conventional maximum-a-posteriori (MAP) decoding using beam search (Fernandes et al., 2022).
However, because of its formulation, typical MBR decoding that regards the hypothesis set as a pseudo-reference set requires the computational time of when translation hypotheses are given. In recent work, the number of hypotheses exceeded 1,000 candidates (Freitag et al., 2023), which makes the quadratic order of computational time a challenge for MBR decoding, particularly when expensive neural metrics are used. Several pruning methods have been proposed (Eikema and Aziz, 2022; Cheng and Vlachos, 2023) to improve the decoding speed. These approaches require the careful selection of a proxy metric (Eikema and Aziz, 2022), or it is difficult to take advantage of computational parallelism because hypotheses are pruned iteratively (Cheng and Vlachos, 2023).

Given that expensive neural metrics, e.g., COMET or BLEURT, are trained to output high scores when a hypothesis sentence and reference translation are semantically similar, we hypothesize that the distance between sentence vectors of similar sentences in the feature space of their models is close. We leverage sentence similarity to improve the decoding speed of COMET-MBR by clustering sentence vectors of the translation candidates into clusters as shown in Figure 1. Then, we calculate the COMET scores using centroid vectors of their clusters, instead of using sentence vectors.
In experiments, our proposed method not only achieved a speed-up of 5.7 times in the calculation of the expected score but also an improvement in the COMET score of up to 0.5% compared with naive MBR decoding in the WMT’22 EnJa, EnDe, EnZh, and WMT’23 EnJa translation tasks.
2 Background
MBR decoding
MBR decoding has been demonstrated to be effective in fields such as statistical automatic speech recognition (Goel and Byrne, 2000) and statistical machine translation Kumar and Byrne (2004). In recent years, it has been applied to neural machine translation (Eikema and Aziz, 2020; Müller and Sennrich, 2021). Furthermore, it is more suitable for multiple translation systems than ensemble models (Ito et al., 2023).
Let and be the spaces of possible source sentences and target sentences, respectively. MAP decoding generates the target sentence using , where denotes the parameter of the translation model that calculates the likelihood of an output sentence given an input sentence . Because it is infeasible to calculate probabilities for all possible , beam search is typically used to obtain the solution.
By contrast, MBR decoding determines the output sentence by maximizing the expected utility as follows:
(1) | ||||
(2) |
where denotes the utility function, which represents the preference relation and denotes the set of translation hypotheses. is the true probability of being translated from a given input sentence and is approximated using the sampled reference translations , as shown in Equation 2, because the true probability is unknown. Typical MBR decoding considers the hypothesis set itself as the pseudo-reference set, i.e., . Note that the time complexity is , where , which is time-consuming.
COMET-MBR
COMET is an evaluation metric of translation quality that achieves a high correlation with human evaluation. The COMET model consists of the XLM-RoBERTa (XLM-R)-based sentence encoder (Conneau et al., 2020) and the output layer, and is trained to predict direct assessment scores (Rei et al., 2020, 2022a). It first encodes the source sentence , hypothesis sentence , and reference sentence into their -dimensional sentence vectors independently. Then the COMET score is computed from the triplet of sentence vectors in the output layer. Let be the function of sentence encoding and be the output layer. The COMET score is computed using . MBR decoding with COMET (COMET-MBR) replaces the utility in Equation 2 with the COMET score:
(3) |
3 Proposed Method
Our proposed centroid-based MBR (CBMBR) approximates the expected utility using the centroids of similar sentence vectors. CBMBR decodes by computing the expected utility according to the following procedures: sentence encoding, clustering, and calculating the expected utility.
Encoding
First, we compute the sentence vector of the source , the hypotheses , and the pseudo-references .
Clustering
Next, we cluster the sentence vectors of the pseudo-references into clusters and obtain the centroid vectors of each cluster . Here, we employ means++ Arthur and Vassilvitskii (2007) to prevent the centroids from being biased. means++ selects the initial centroids so that the distances between each pair of centroids are farther according to the weights calculated from the distances between vectors. We describe the details of the algorithm in Appendix C.1. Then, we cluster the vectors using the standard means algorithm. Specifically, we calculate the following steps iteratively: 1) assign a vector to its nearest neighbor centroid and 2) update the centroid using the vectors assigned to its cluster.
Expected utility
Finally, we calculate the expected utility by replacing pseudo-reference vectors with centroids in Equation 3:
(4) |
The conventional method requires of computational time to compute the expected utility for all hypotheses, whereas our CBMBR computes it in . Note that () is a hyperparameter that balances the trade-off between the decoding speed and approximation accuracy. Especially, when , i.e., , the centroid can be calculated as the average of all pseudo-reference vectors, i.e., , and the time complexity of CBMBR is , which is equivalent to DeNero et al. (2009) and Vamvas and Sennrich (2024). Our proposed method approximates the expected utility using centroid representations, which implicitly assumes that the score function is approximately linear. Specifically, when , we approximate the expected utility as in our method by assuming that the score function is roughly linear.
4 Experiments
Setup
We conducted translation experiments with two settings; one used diversified translation candidates and the other simulated a more realistic scenario, multi-system translation. We evaluated the translation quality using the COMET score, which is the same as the utility function of MBR decoding used in our experiments. For comparison, we also performed translation candidate reranking using a quality estimation model CometKiwi (Rei et al., 2022b) (QE)222 Unlike MBR decoding, which requires calculating multiple pairwise scores from a single sentence representation, the QE model computes a single score from a single representation. Additionally, CometKiwi does not compute explicit sentence vectors that are independent between the source and translation sentences but directly estimates the score from the concatenated two sentences. Therefore, the QE model does not cache the sentence vectors of both the source and hypotheses, like MBR decoding. , MBR decoding with confidence-based pruning (PruneMBR) (Cheng and Vlachos, 2023), and evaluated the quality upper bound (Oracle), which selects the hypothesis with the best score according to COMET using reference translations. We also compared CBMBR without means++, where we randomly selected the initial centroids from the sample set (w/o means++). We used Comet-22 (Rei et al., 2022a) for the evaluation metric and utility function. In MBR decoding, we treat the hypothesis set as the pseudo-reference set, i.e., . We set the number of centroids to . The details of our setup are shown in Appendix D.
Diverse translation candidates
Decoding | en-ja | ja-en | en-de | de-en | en-zh | zh-en | avg. |
---|---|---|---|---|---|---|---|
MAP | 78.7 | 69.7 | 77.3 | 79.2 | 77.4 | 70.1 | 75.4 |
QE | 86.6 | 76.2 | 82.2 | 82.1 | 82.9 | 76.9 | 81.2 |
MBR | 87.9 | 76.6 | 84.0 | 83.0 | 84.2 | 77.3 | 82.2 |
PruneMBR | 87.9 | 76.5 | 84.0 | 83.0 | 84.1 | 77.3 | 82.1 |
CBMBR | 87.9 | 76.6 | 83.9 | 83.0 | 84.1 | 77.1 | 82.1 |
w/o means++ | 87.8 | 76.4 | 83.8 | 82.9 | 84.0 | 77.2 | 82.0 |
Oracle | 90.6 | 81.9 | 87.0 | 86.5 | 87.7 | 81.2 | 85.8 |
Step | QE | MBR | PruneMBR | CBMBR |
---|---|---|---|---|
Encode/hypotheses | – | 247.0 | 248.0 | 247.8 |
Encode/source | – | 51.6 | 51.1 | 51.2 |
Rerank | 450.1 | – | – | – |
Prune | – | – | 5.5 | – |
means++ | – | – | – | 36.5 |
Utility function; | – | 322.2 | 79.6 | 20.1 |
E2E | 450.1 | 633.1 | 384.7 | 356.8 |
In this setting, we evaluated translation quality in six language directions: EnJa, EnDe, and EnZh in the WMT’22 translation task (Kocmi et al., 2022). We generated translation candidates using the pre-trained multilingual translation model, M2M100 (Fan et al., 2021). We used beam search with a beam size of 256 for MAP decoding, and generated 1,024 translations using epsilon sampling with (Freitag et al., 2023) for MBR decoding.
Table 1 shows the translation quality of each decoding method. From the average scores in the table, compared with MAP decoding, both MBR decoding and the proposed CBMBR decoding improved the COMET score by +6.8 and +6.7%, respectively, and the gap between Oracle and both MBR and CBMBR narrowed. The results also show that the difference between CBMBR and MBR was within 0.1% using means++ initialization.
Next, we compared the decoding time of each method as shown in Table 2. From the table, the total time of COMET-MBR (E2E) indicates that CBMBR was 1.8 times faster than vanilla MBR. Specifically, in the computation of the expected utility, which required quadratic time, the speed increased by 5.7 times when including means++, and by 16.0 times when comparing only the utility computation. We confirmed that, compared with PruneMBR, the speed of the expected utility calculation improved by 1.4 times. One reason for this improvement is that, unlike PruneMBR, CBMBR computes the expected utility in a single transaction, which makes it easier to leverage the parallel computation capabilities of the GPU.
To summarize, CBMBR maintains translation quality comparable with naive MBR decoding while accelerating the computational time of the expected utility calculation by 5.7 times, including clustering.
Multi-system translation
WMT’22 | WMT’23 | ||||
---|---|---|---|---|---|
Decoding | en-ja | ja-en | en-ja | ja-en | avg. |
MAP | 86.4 | 80.9 | 83.5 | 80.4 | 82.8 |
QE | 89.8 | 82.6 | 87.6 | 82.3 | 85.6 |
MBR | 90.5 | 84.1 | 88.7 | 83.7 | 86.7 |
PruneMBR | 88.9 | 82.8 | 86.6 | 82.2 | 85.1 |
CBMBR | 90.9 | 84.1 | 89.2 | 83.8 | 87.0 |
w/o means++ | 90.5 | 84.1 | 88.8 | 83.7 | 86.8 |
Oracle | 93.4 | 89.4 | 91.9 | 88.5 | 90.8 |
We also evaluated the effectiveness of our CBMBR in the setting where translation candidates are generated from multiple translation systems. In particular, we followed the practice in Deguchi et al. (2023), in which 18 candidate sets with each set comprising the 50-best translations were generated from nine models and two decoding methods: beam search and top- sampling () with a beam size of 50. We evaluated the translation quality in two language directions: EnJa in the WMT’22 and WMT’23 translation tasks (Kocmi et al., 2022, 2023).
Table 3 shows the results. Unlike the diverse translation candidates setting, CBMBR improved the translation quality by up to 0.5% compared with naive MBR. Naive MBR calculates the expected utility using all samples equally, which is prone to translation bias when candidates have a multimodal distribution. By contrast, CBMBR estimates the expected utility using only centroids; therefore, it decodes robustly, even if the distribution is multimodal. A detailed analysis is shown in Appendix F.
To summarize, we found that CBMBR not only improved the decoding speed but also improved translation quality compared with vanilla MBR when the translation was determined from the candidate sets generated from multiple translation systems.
5 Discussion
5.1 Number of centroids

We evaluated the COMET scores of various in the multi-system translation setting. Figure 2 shows the results. When , while the time complexity was linear time , the COMET score of CBMBR was degraded by 0.5% compared with vanilla MBR. The figure shows that translation quality improved as increased and CBMBR outperformed vanilla MBR when . Additionally, translation quality was better when we used means++ instead of the standard means.
5.2 Distance between similar sentence vectors
Model | dev | test |
---|---|---|
RoBERTa Liu et al. (2019) | 53.7 | 43.0 |
fastText Joulin et al. (2017) | 65.3 | 53.6 |
XLM-R Conneau et al. (2020) | 39.1 | 31.6 |
LaBSE Feng et al. (2022) | 72.9 | 72.7 |
COMET Rei et al. (2022a) | 78.2 | 73.6 |
As mentioned in Section 3, CBMBR implicitly assumes that the score function has linearity. Because in the COMET model is a nonlinear multi-layer perceptron, it is not clear whether it is appropriate to use the averaged vector of hypothesis representations as representative points of the clusters. To verify the assumption, we investigated the distances between sentence vectors of similar sentences using the semantic textual similarity benchmark (STS-B) task (Cer et al., 2017). We evaluated the Pearson correlation coefficient using the ground truth similarity score. Table 4 shows the experimental results. Despite sentence vectors not being trained explicitly like contrastive learning, COMET demonstrated a strong correlation of 73.6. Moreover, the result demonstrates that it implicitly learned sentence similarity through the training of score prediction, as evidenced by its significantly better correlation of 73.6 compared with the pre-trained XLM-R score of 31.6. Furthermore, we confirmed that COMET outperformed LaBSE (Feng et al., 2022) trained using contrastive learning.
To summarize, the sentence vectors of COMET demonstrated a strong correlation with gold scores in the STS-B task although the sentence representations were not explicitly trained. Additionally, we confirmed that the encoder of COMET implicitly learned sentence similarity through score prediction. From the results, we verified that approximating the expected utility by aggregating sentence vectors that are close to each other is reasonable because the distance between sentence vectors represents semantic similarity.
6 Conclusion
In this paper, we proposed CBMBR, which improved the speed of MBR decoding by clustering the sentence vectors of similar sentences and computing the score using the centroid representations of each cluster. Our CBMBR achieved a 5.7 times speed-up in the expected score calculation and an improvement in COMET of up to 0.5% compared with vanilla MBR decoding in the WMT’22 EnJa, EnDe, EnZh, and WMT’23 EnJa translation tasks. For future work, we will apply our method to other evaluation metrics, including both neural and non-neural metrics.
Limitations
In this study focused only on improving the speed of MBR decoding, particularly the neural evaluation metric, COMET. For non-neural metrics, it is necessary to apply the appropriate clustering method for each metric.
In COMET-MBR, there are two bottlenecks for the computational time: the calculation of the expected utility and sentence encoding. However, we only improved the computation speed of the expected utility, which took quadratic time. Although sentence encoding can be computed in linear time, the sentences are encoded using the expensive XLM-R encoder, which is time-consuming.
Our method can only be applied to metrics for which we can compute the representation independently for each sentence. This limitation is the same as that of the method of DeNero et al. (2009) and is also explained in their paper.
We measured the decoding times reported in this paper on a single computer and only in a single run; the amount of speed improvement may differ when different computer architectures are used.
Ethical Consideration
Both vanilla MBR decoding and CBMBR decoding select output sentences from a set of translation candidates generated by translation systems; hence, if the systems generate toxic text, it may be selected.
Acknowledgements
This work was partially supported by JSPS KAKENHI Grant Number JP21H05054.
References
- Arthur and Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, page 1027–1035, USA. Society for Industrial and Applied Mathematics.
- Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
- Cheng and Vlachos (2023) Julius Cheng and Andreas Vlachos. 2023. Faster minimum Bayes risk decoding with confidence-based pruning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12473–12480, Singapore. Association for Computational Linguistics.
- Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Deguchi et al. (2023) Hiroyuki Deguchi, Kenji Imamura, Yuto Nishida, Yusuke Sakai, Justin Vasselli, and Taro Watanabe. 2023. NAIST-NICT WMT’23 general MT task submission. In Proceedings of the Eighth Conference on Machine Translation, pages 110–118, Singapore. Association for Computational Linguistics.
- DeNero et al. (2009) John DeNero, David Chiang, and Kevin Knight. 2009. Fast consensus decoding over translation forests. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 567–575, Suntec, Singapore. Association for Computational Linguistics.
- Eikema and Aziz (2020) Bryan Eikema and Wilker Aziz. 2020. Is MAP decoding all you need? the inadequacy of the mode in neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4506–4520, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Eikema and Aziz (2022) Bryan Eikema and Wilker Aziz. 2022. Sampling-based approximations to minimum Bayes risk decoding for neural machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10978–10993, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Fan et al. (2021) Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2021. Beyond english-centric multilingual machine translation. J. Mach. Learn. Res., 22(1).
- Feng et al. (2022) Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
- Fernandes et al. (2022) Patrick Fernandes, António Farinhas, Ricardo Rei, José G. C. de Souza, Perez Ogayo, Graham Neubig, and Andre Martins. 2022. Quality-aware decoding for neural machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1396–1412, Seattle, United States. Association for Computational Linguistics.
- Freitag et al. (2023) Markus Freitag, Behrooz Ghorbani, and Patrick Fernandes. 2023. Epsilon sampling rocks: Investigating sampling strategies for minimum Bayes risk decoding for machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9198–9209, Singapore. Association for Computational Linguistics.
- Goel and Byrne (2000) Vaibhava Goel and William J Byrne. 2000. Minimum bayes-risk automatic speech recognition. Computer Speech & Language, 14(2):115–135.
- Ito et al. (2023) Ikumi Ito, Takumi Ito, Jun Suzuki, and Kentaro Inui. 2023. Investigating the effectiveness of multiple expert models collaboration. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14393–14404, Singapore. Association for Computational Linguistics.
- Joulin et al. (2017) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain. Association for Computational Linguistics.
- Kocmi et al. (2023) Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Makoto Nagata, Toshiaki Nakazawa, Martin Popel, Maja Popović, and Mariya Shmatova. 2023. Findings of the 2023 conference on machine translation (WMT23): LLMs are here but not quite there yet. In Proceedings of the Eighth Conference on Machine Translation, pages 1–42, Singapore. Association for Computational Linguistics.
- Kocmi et al. (2022) Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Novák, Martin Popel, and Maja Popović. 2022. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Kumar and Byrne (2004) Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Massachusetts, USA. Association for Computational Linguistics.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.
- Müller and Sennrich (2021) Mathias Müller and Rico Sennrich. 2021. Understanding the properties of minimum Bayes risk decoding in neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 259–272, Online. Association for Computational Linguistics.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Rei et al. (2022a) Ricardo Rei, José G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André F. T. Martins. 2022a. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
- Rei et al. (2022b) Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and André F. T. Martins. 2022b. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
- Vamvas and Sennrich (2024) Jannis Vamvas and Rico Sennrich. 2024. Linear-time minimum bayes risk decoding with reference aggregation.
- Vijayakumar et al. (2018) Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
- Zhang et al. (2020) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Appendix A Licenses
In our experiments, we used the Comet-22 model licensed under the Apache-2.0 license and CometKiwi model licensed under the CC BY-NC-SA 4.0 license. We evaluated our method on the test sets of WMT’21, WMT’22, and WMT’23 translation tasks under the following policy: “The data released for the WMT General MT task can be freely used for research purposes”.
Appendix B Details of the Datasets
Table 5 shows the number of sentences for each dataset we used in our experiments.
Dataset | en-ja | ja-en | en-de | de-en | en-zh | zh-en |
---|---|---|---|---|---|---|
WMT’21 | 1,000 | 1,005 | 1,002 | 1,000 | 1,002 | 1,948 |
WMT’22 | 2,037 | 2,008 | 2,037 | 1,984 | 2,037 | 1,875 |
WMT’23 | 2,074 | 1,992 | – | – | – | – |
Appendix C Details of the Algorithms and Models
C.1 means++
We describe the algorithm for the initial centroid selection of means++:
-
1.
Pick up the first centroid from the set and add it to .
-
2.
Calculate the squared Euclidean distance between vector and its nearest centroid .
-
3.
Sample vector from the multinominal distribution according to the weights and add it to the set .
- 4.
C.2 COMET model
Figure 3 shows the overview of the COMET model. A triplet of sentences are independently encoded into their sentence vectors, and then the COMET score is calculated from the vectors.

Appendix D Details of the Experimental Setup
Table 6 shows the details of our experimental setup. We implemented vanilla MBR, PruneMBR, and CBMBR using PyTorch.
Model | |
---|---|
COMET | Unbabel/wmt22-comet-da333https://huggingface.co/Unbabel/wmt22-comet-da |
QE | Unbabel/wmt22-cometkiwi-da444https://huggingface.co/Unbabel/wmt22-cometkiwi-da |
GPU | NVIDIA A100 |
Batch size | 256 sentences |
(sentence encoding) | |
Diverse translation candidates setting | |
Translation model | M2M100 Fan et al. (2021) |
(418M parameters)555https://huggingface.co/facebook/m2m100_418M | |
MAP decoding | |
Generation | beam search |
Beam size | 256 |
MBR decoding | |
Candidate generation | |
# of candidates | 1,024 translations |
Generation | epsilon sampling () |
(Freitag et al., 2023) | |
CBMBR | |
# of centroids | 64 |
# of iterations | 1 |
Multi-system translation setting | |
Translation model | 9 various Transformer models |
(Deguchi et al., 2023) | |
MAP decoding | |
Generation | beam search using |
the ensemble model | |
Beam size | 50 |
MBR decoding | |
Candidate generation | |
# of candidates | 900 translations666https://huggingface.co/naist-nlp/wmt23 |
Generation | beam search and |
top- sampling () | |
(Deguchi et al., 2023) | |
CBMBR | |
# of centroids | 64 |
# of iterations | 1 |
Appendix E Other Experimental Results
E.1 Translation quality on the development set in the diverse translation candidates setting
Table 7 shows the experimental results of the diverse translation candidates setting on the development set. In the table, “niter” denotes the number of iterations of means clustering. We chose niter = 1 from the results.
Decoding | en-ja | ja-en | en-de | de-en | en-zh | zh-en | avg. |
---|---|---|---|---|---|---|---|
MAP | 78.8 | 62.6 | 74.5 | 80.4 | 73.0 | 68.6 | 73.0 |
QE | 86.7 | 71.7 | 80.3 | 83.6 | 80.1 | 77.0 | 79.9 |
MBR | 88.2 | 72.6 | 82.1 | 84.3 | 81.6 | 77.7 | 81.1 |
PruneMBR | 88.2 | 72.6 | 82.0 | 84.3 | 81.6 | 77.7 | 81.1 |
CBMBR | 88.2 | 72.3 | 81.9 | 84.4 | 81.5 | 77.5 | 81.0 |
w/o means++ | 88.1 | 72.4 | 81.8 | 84.3 | 81.4 | 77.5 | 80.9 |
CBMBR with various numbers of means++ iterations | |||||||
niter=1 | 88.2 | 72.3 | 81.9 | 84.4 | 81.5 | 77.5 | 81.0 |
niter=2 | 88.2 | 72.2 | 81.9 | 84.4 | 81.6 | 77.5 | 81.0 |
niter=3 | 88.2 | 72.3 | 81.9 | 84.4 | 81.5 | 77.5 | 81.0 |
niter=4 | 88.2 | 72.3 | 81.8 | 84.4 | 81.5 | 77.5 | 81.0 |
niter=5 | 88.2 | 72.3 | 81.9 | 84.4 | 81.6 | 77.5 | 81.0 |
CBMBR with various numbers of means iterations | |||||||
niter=1 | 88.1 | 72.4 | 81.8 | 84.3 | 81.4 | 77.5 | 80.9 |
niter=2 | 88.1 | 72.4 | 81.8 | 84.3 | 81.4 | 77.5 | 80.9 |
niter=3 | 88.1 | 72.4 | 81.8 | 84.3 | 81.4 | 77.5 | 80.9 |
niter=4 | 88.1 | 72.4 | 81.8 | 84.3 | 81.5 | 77.5 | 80.9 |
niter=5 | 88.1 | 72.4 | 81.8 | 84.3 | 81.5 | 77.5 | 80.9 |
oracle | 89.9 | 76.3 | 84.0 | 87.3 | 84.2 | 80.2 | 83.7 |
E.2 Decoding speed in the multi-system translation setting
Table 8 shows the decoding speed in the multi-system translation setting measure in the WMT’22 and WMT’23 EnJa translation tasks. As shown in the table, our CBMBR improved the speed of the expected score calculation by 5.0 times compared with vanilla MBR and 1.5 times compared with PruneMBR in the multi-system setting.
Step | QE | MBR | PruneMBR | CBMBR |
---|---|---|---|---|
Encode | ||||
hypotheses; | – | 198.1 | 198.7 | 199.1 |
source; | – | 22.0 | 22.0 | 21.9 |
Rerank | 313.0 | – | – | – |
Prune | – | – | 5.4 | – |
means++ | – | – | – | 36.1 |
Utility function; | – | 281.1 | 79.5 | 20.1 |
E2E | 336.0 | 511.9 | 306.0 | 278.4 |
E.3 Relationship between the number of centroids and translation quality in the diverse translation setting

Figure 4 shows the translation quality of various in the diverse translation candidates setting. We observed that the COMET score increased as increased, except for an outlier, i.e., .
E.4 Evaluation of non-target metrics
Decoding | en-ja | ja-en | en-de | de-en | en-zh | zh-en | avg. |
---|---|---|---|---|---|---|---|
MAP | 14.6 | 10.9 | 25.1 | 27.3 | 28.2 | 16.1 | 20.4 |
QE | 16.3 | 11.1 | 22.6 | 24.2 | 25.8 | 15.5 | 19.2 |
MBR | 16.4 | 10.6 | 23.6 | 25.8 | 28.3 | 15.6 | 20.1 |
PruneMBR | 16.4 | 10.5 | 23.6 | 25.7 | 28.2 | 15.6 | 20.0 |
CBMBR | 16.2 | 9.8 | 23.1 | 25.4 | 27.6 | 15.5 | 19.6 |
w/o means++ | 16.4 | 10.4 | 23.5 | 25.7 | 28.3 | 15.7 | 20.0 |
Oracle | 22.0 | 15.8 | 30.3 | 34.5 | 35.6 | 20.4 | 26.4 |
Decoding | en-ja | ja-en | en-de | de-en | en-zh | zh-en | avg. |
---|---|---|---|---|---|---|---|
MAP | 56.6 | 54.3 | 65.2 | 66.9 | 61.2 | 56.8 | 60.2 |
QE | 63.7 | 60.1 | 70.0 | 69.6 | 65.8 | 62.2 | 65.2 |
MBR | 62.7 | 57.2 | 68.5 | 68.9 | 65.1 | 60.3 | 63.8 |
PruneMBR | 62.6 | 57.1 | 68.4 | 68.9 | 65.1 | 60.2 | 63.7 |
CBMBR | 62.3 | 56.8 | 68.1 | 68.7 | 64.7 | 59.9 | 63.4 |
w/o means++ | 62.5 | 56.9 | 68.3 | 68.7 | 64.9 | 60.2 | 63.6 |
Oracle | 67.9 | 63.1 | 72.6 | 73.4 | 70.3 | 64.8 | 68.7 |
WMT’22 | WMT’23 | ||||
---|---|---|---|---|---|
Decoding | en-ja | ja-en | en-ja | ja-en | avg. |
MAP | 24.2 | 23.1 | 20.7 | 21.9 | 22.5 |
QE | 23.3 | 20.7 | 20.0 | 19.9 | 21.0 |
MBR | 25.0 | 21.9 | 21.6 | 21.3 | 22.4 |
PruneMBR | 23.8 | 21.7 | 20.5 | 21.2 | 21.8 |
CBMBR | 24.3 | 20.4 | 20.7 | 19.8 | 21.3 |
w/o means++ | 24.8 | 21.8 | 21.5 | 21.1 | 22.3 |
Oracle | 35.2 | 36.2 | 29.7 | 33.8 | 33.7 |
WMT’22 | WMT’23 | ||||
---|---|---|---|---|---|
Decoding | en-ja | ja-en | en-ja | ja-en | avg. |
MAP | 65.1 | 67.0 | 59.6 | 67.1 | 64.7 |
QE | 68.8 | 68.6 | 63.6 | 68.9 | 67.5 |
MBR | 68.3 | 68.0 | 63.5 | 68.8 | 67.2 |
PruneMBR | 66.5 | 66.9 | 61.3 | 67.5 | 65.6 |
CBMBR | 68.3 | 67.4 | 63.5 | 68.4 | 66.9 |
w/o means++ | 68.3 | 68.0 | 63.6 | 68.7 | 67.1 |
Oracle | 75.5 | 77.1 | 70.6 | 76.5 | 74.9 |
Appendix F Multimodality of Translation Candidates
WMT’22 | WMT’23 | ||||
---|---|---|---|---|---|
Decoding | en-ja | ja-en | en-ja | ja-en | avg. |
MAP | 86.4 | 80.9 | 83.5 | 80.4 | 82.8 |
QE | 89.8 | 82.6 | 87.6 | 82.3 | 85.6 |
MBR | 90.5 | 84.1 | 88.7 | 83.7 | 86.7 |
PruneMBR | 88.9 | 82.8 | 86.6 | 82.2 | 85.1 |
CBMBR | 90.9 | 84.1 | 89.2 | 83.8 | 87.0 |
w/o means++ | 90.5 | 84.1 | 88.8 | 83.7 | 86.8 |
CBMBR | 90.4 | 83.9 | 88.6 | 83.5 | 86.6 |
w/o means++ | 90.4 | 84.0 | 88.6 | 83.6 | 86.6 |
Oracle | 93.4 | 89.4 | 91.9 | 88.5 | 90.8 |
In the multi-system translation setting, CBMBR also outperformed vanilla MBR in terms of translation quality, as shown in Table 3 and Figure 2. In this section, we discuss the multimodal nature of translation, two approximations of MBR decoding, and why our CBMBR outperformed vanilla MBR in the multi-system translation setting.
The -best translations generated by beam search are often similar to each other (Vijayakumar et al., 2018). To diversify the candidates while maximizing translation quality, Deguchi et al. (2023) generated the 50-best translation sets from each translation system, which resulted in candidates that exhibited multimodality. Vanilla MBR decoding calculates the expected score by treating all samples equally, which means that it is prone to being affected by the number of similar translation samples for the candidates that have such a multimodal distribution.
Now, there are two approximation variants of MBR decoding in our CBMBR. One is our proposed method, which calculates the expected score using centroid representations:
(5) |
The other, CBMBR, multiplies each centroid-based score by the weight according to the number of samples in each cluster:
(6) |
where returns the weight of the given centroid as follows:
(7) | |||
(8) | |||
(9) |
where finds the nearest neighbor centroid of a vector from the given set of centroids and counts the number of samples in the cluster of the given centroid. CBMBR, which uses the number of samples in a cluster, can be regarded as more accurately approximating vanilla MBR compared with our CBMBR, which ignores the number of samples in a cluster.
We compared the translation quality of our CBMBR and CBMBR with that in the multi-system translation setting. Table 13 shows the results. From the results, the difference between vanilla MBR and CBMBR narrowed to 0.1% and degraded by 0.6% compared with CBMBR; that is, CBMBR, which more accurately approximates vanilla MBR, was worse than our proposed CBMBR. We attribute this observation to the biased distribution caused by beam search or sampling. With CBMBR, we can robustly decode against bias.