Knowledge-Prompted Estimator: A Novel Approach to Explainable Machine Translation Assessment
Abstract
Cross-lingual Machine Translation (MT) quality estimation plays a crucial role in evaluating translation performance. GEMBA, the first MT quality assessment metric based on Large Language Models (LLMs), employs one-step prompting to achieve state-of-the-art (SOTA) in system-level MT quality estimation; however, it lacks segment-level analysis. In contrast, Chain-of-Thought (CoT) prompting outperforms one-step prompting by offering improved reasoning and explainability. In this paper, we introduce Knowledge-Prompted Estimator (KPE), a CoT prompting method that combines three one-step prompting techniques, including perplexity, token-level similarity, and sentence-level similarity. This method attains enhanced performance for segment-level estimation compared with previous deep learning models and one-step prompting approaches. Furthermore, supplementary experiments on word-level visualized alignment demonstrate that our KPE method significantly improves token alignment compared with earlier models and provides better interpretability for MT quality estimation. 111Code will be released upon publication.
Index Terms:
Machine Translation Quality Estimation Chain-of-Thought Prompting Large Language ModelsI Introduction
Large Language Models (LLMs), such as GPT-3 (Brown et al., 2020), ChatGPT (Kasirzadeh, 2023), GPT-4 (OpenAI, 2023), and LLaMA (Touvron et al., 2023), have been successfully validated in typical NLP scenarios, including question answering, search, summarization, and keyword extraction (Vilar et al., 2022). In the domain of multilingual NLP tasks, (Jiao et al., 2023) (Hendy et al., 2023) have demonstrated that utilizing a prompt-based approach, instead of fine-tuning models, can enable LLMs to perform machine translation tasks. This method has yielded impressive results for high-resource language pairs. However, for low-resource language pairs, the performance may be unsatisfactory due to insufficient training data or lower quality of available data.

Going further, GEMBA (Kocmi and Federmann, 2023) explores the application of LLMs not only for translation tasks but also as translation quality estimators with one-step prompting. This approach targets two main scenarios: system-level quality estimation and segment-level quality estimation. By using one-step prompting, LLMs can assign scores to different systems or sentence pairs. Three scoring modes have been designed: scalar (0-100 points), 5-star (0-5 points), and 5-category (5-category classification). For evaluation, system-level assessment relies on accuracy as the metric, while segment-level assessment employs the Kendall ranking system as the evaluation metric. Results indicate that, under the prompting model and 5-category evaluation, state-of-the-art (SOTA) results are achieved at the system-level assessment. However, for segment-level quality assessment, LLMs still fall short compared with dedicated machine translation quality assessment models.
This study introduces two innovative considerations: (1) transitioning from single-dimensional evaluation to multi-dimensional evaluation. Machine translation Quality Estimation (QE) (Kocmi et al., 2021) now involves more than just assigning a single score; it adopts a multi-dimensional evaluation approach (MQM) (Rei et al., 2022). This approach assesses fluency and accuracy separately. Fluency can be evaluated by having LLMs measure the perplexity of a sentence, while accuracy can be assessed by having LLMs evaluate sentence-level similarity (Yang et al., 2023) or word-level similarity (Zhang et al., 2023). (2) To implement multi-dimensional evaluation, the CoT prompting method for LLMs can be employed (Zhang et al., 2022)(Fu et al., 2022). This approach guides LLMs to consider fluency first, followed by word-level and sentence-level accuracy, before finally combining 2-3 feature aspects to produce the best results. In the end, the study demonstrates that, in the WMT QE task, our KPE system achieves the best performance in 80% of segment-level tasks. Additionally, in terms of interpretability (Tao et al., 2022), token-level alignment exhibits better results.
In summary, this study can be characterized by the following key points:
-
•
Introducing Knowledge-Prompted Evaluator, including 3 one-step prompting evaluations for perplexity, token-level similarity, and sentence-level similarity; and 2 CoT prompting evaluations for perplexity-token prompting and perplexity-token-sent prompting.
-
•
Experimental validation demonstrates that KPE achieves positive gains at each step of segment-level evaluation. Furthermore, one-step prompting is competitive and CoT1 prompting achieves SOTA segment-level evaluation performance, even better than CoT2 prompting.
-
•
In terms of interpretability analysis, we compare the outcomes of BertScore, TeacherSim, and KPE. Our findings reveal that KPE-based token alignment exhibits substantially better accuracy and more discriminative power than previous results, thus offering enhanced interpretability capabilities.
II Related Work
II-A Machine Translation Quality Estimation
Machine translation assessment can be classified into two categories based on the presence or absence of reference translations: quality evaluation as metrics, and Quality Estimation (QE) as metrics. Quality evaluation is a method that predicts the accuracy of machine translations based on a triplet of source text (src), machine translation output (mt), and reference translation (ref). In contrast, QE predicts the accuracy of machine translations solely based on the source text (src) and machine translation output (mt). Quality evaluation as metrics tend to have higher accuracy than QE. However, due to the need for human-provided reference translations for each source text, quality evaluation may be less efficient in practical applications. In contrast, QE as metrics, which does not require reference translations, has a wider range of applications despite its lower accuracy.
QE as metrics can be divided into two categories based on the evaluation methodology: system-level evaluation and segment-level evaluation.
System-level evaluation employs a calculation method similar to that of Learning-to-Rank (LTR), within two steps: For each system, it gets a system-level score with the source list (src list) and machine translation list (mt list). It then calculates the pairwise accuracy based on the machine-ranked and human labeled system scores. Segment-level evaluation, on the other hand, generates Relative Ranking (RR) segment-level data for all system pairs of source text (src) and machine translation (mt) using Direct Assessment (DA) or Multidimensional Quality Metrics (MQM). This data consists of triplets (src, mt1, mt2), indicating that the human evaluation score for mt1 is higher than that for mt2. Segment-level evaluation calculates correlations using the Kendall ranking system.
(1) |
II-B Translation Quality Estimation Metrics
Pairwise accuracy is the most commonly employed metric for system-level translation QE.
(2) |
where Kendall’s Tau (Callison-Burch et al., 2011) is the most commonly employed metric for segment-level translation QE.
(3) |
Human | ||||
---|---|---|---|---|
Concordant | Discordant | Discordant | ||
Discordant | Discordant | Concordant |
II-C LLM-based Quality Estimation Metrics
Prompt engineering is the key component of success when it comes to using modern AI conversation tools and language models like ChatGPT and GPT-4. It is the art of crafting a statement or question that returns accurate and valuable results.
Different from traditional deep learning QE models, LLM QE or GEMBA is a prompt-tuning method based on LLMs. It consists of three steps: (1) finding an appropriate prompt template to complete the task, which includes a predefined response format; (2) filling the template with the source sentence, translation sentence, and other parameters; and (3) parsing results from the response.
The LLMs QE formula is defined as follows:
(4) |
The only difference between traditional QE and LLM QE lies in whether are used.
The generation of prompts can begin by asking counter-questions to LLMs. After obtaining multiple candidate templates, a suitable one can be manually selected and slightly modified. An example of a GEMBA prompt is as follows:
Classify the quality of machine translation into one of following classes: “No meaning preserved”, “Some meaning preserved, but not understandable”, “Some meaning preserved and understandable”, “Most meaning preserved, minor issues”, “Perfect translation”.
source: “source_seg”
machine translation: “target_seg”
Class:
III Proposed Approach
Prompt engineering can be categorized into two types: one-step prompting and Chain-of-Thought (CoT) prompting. One-step prompting is the most basic and straightforward prompt type, similar to zero-shot learning. Essentially, models like ChatGPT and GPT-3 complete the given prompt, which is the primary mechanism behind their functionality. One-step prompting works like a simple question-answer process, and even an initial statement can suffice, as the AI model will always attempt to complete it.
In contrast, CoT prompting not only provides AI with context for completion, but also offers a “chain of thought” process that demonstrates how the correct answer to a question should be reached. This type of prompting encourages reasoning and can even improve arithmetical results, which AI language models sometimes struggle with. Furthermore, due to its step-by-step reasoning approach, CoT prompting is much more explainable.
Our KPE tries to improve segment-level QE performance with CoT prompting. We first design three one-step prompting metrics, and then chain one-step prompting metrics as CoT metrics.
III-A KPE One-Step Prompting Metric
Drawing inspiration from SOTA segment-level QE systems such as CometKiwi and TeacherSim, QE can be divided into three aspects, perplexity, token-level similarity and sentence-level similarity. We designed three one-step prompting metrics.
The one-step perplexity QE formula is defined as follows:
(5) |
where is the prompt to estimate perplexity only based on mt, whereas and are the prompts to estimate token level similarity and sentence level similarity.
III-B KPE CoT Prompting Metric
KPE CoT prompting metrics are step-by-step prompting metrics based on three one-step prompting metrics.
(6) |
where is quality estimation based on perplexity and token-level similarity, whereas is quality estimation plus sentence-level similarity.
III-C KPE Prompt Design
We follow the approach used in GEMBA for generating prompts. For the three one-step prompts and the two CoT prompts, we first have the system recommend five candidate prompt templates. We then manually select and edit these templates to create the optimal prompts. One-step prompts and CoT prompts are created like those in Figure 2.

IV Experiments
In this study, the proposed method is evaluated using the WMT18 metrics segment-level task data. For multilingual reference-free evaluation, the (source, candidate) pair is selected as a metric, similar to the QE approach.
IV-A Datasets
Our experiments employ the WMT18 news dataset as the evaluation dataset, containing 3,000 source sentences. The evaluation dataset also encompasses translated sentences from all teams that participated in WMT18. These target sentences are divided into 14 language pairs, such as DE-EN, FI-EN, and ZH-EN. Among them, DE-EN sentences are the largest in number (77,811), while CS-EN sentences are the smallest in number (5,110). Professional multilingual experts conduct pairwise comparisons on the target sentences before utilizing them as test data.
Language Pair | cs-en | de-en | et-en | fi-en | ru-en | tr-en | zh-en |
---|---|---|---|---|---|---|---|
RR_Systems | 5 | 16 | 14 | 9 | 8 | 5 | 14 |
Dev Datasize | 5110 | 77811 | 56712 | 15648 | 10404 | 5525 | 33357 |
IV-B Models
The comparison systems comprise several strong baseline systems, including:
- •
- •
-
•
TeacherSim and TeacherSim-LM (Yang et al., 2023) as the strong baseline based on sentence-level similarity.
-
•
GEMBA (Kocmi and Federmann, 2023) as the strong baseline based on one-step prompt and LLMs.
In addition, we tested our method, which includes (1) three single-step approaches for perplexity, token similarity, and sentence similarity, and (2) two CoT methods: perplexity + token similarity; perplexity + token similarity + sentence similarity.
IV-C Main Results
Using multiple datasets such as the WMT18 to-En datasets, we compare the similarity of the source and translated sentences and estimate their quality by utilizing the Kendall rank correlation coefficient. The results reveal that:
-
•
(1) One-step prompts achieve comparable performance, with Prompt1 (25.7%), Prompt2 (18.8%), and Prompt3 (17.1%) performing better than traditional deep learning single models like M-BERT (11.0%), LASER (15.0%), XMoverScore (15.0%), and TeacherSim (19.0%);
-
•
(2) CoT1 (29.1%) and CoT2 (28.9%) outperform the combined method XMoverScore-LM (27.0%) and LLM GEMBA (28.8%);
-
•
(3) CoT1 (29.1%) achieves SOAT performance, surpassing CoT2 (28.9%) and Teacher-LM (29.0%), which indicates that increasing steps in prompting does not definitely improve the performance.
model | category | de-en | cs-en | et-en | fi-en | ru-en | zh-en | tr-en | avg |
M-BERT | PLMs | 23.0% | 1.0% | 18.0% | 12.0% | 10.0% | 8.0% | 4.0% | 11.0% |
LASER | PLMs | 32.0% | 7.0% | 25.0% | 16.0% | 10.0% | 6.0% | 9.0% | 15.0% |
XMoverScore | Token | 28.0% | 8.0% | 21.0% | 15.0% | 15.0% | 12.0% | 9.0% | 15.0% |
XMoverScore-LM | Token + PLMs | 46.0% | 29.0% | 23.0% | 32.0% | 16.0% | 19.0% | 16.0% | 27.0% |
TeacherSim | Sentence | 17.0% | 13.0% | 17.0% | 23.0% | 26.0% | 15.0% | 21.0% | 19.0% |
TeacherSim-LM | Sentence+PLMs | 45.0% | 31.0% | 33.0% | 24.0% | 28.0% | 17.0% | 22.0% | 29.0% |
GEMBA | One Step LLMs | 46.3% | 32.1% | 32.9% | 23.9% | 28.3% | 16.7% | 20.7% | 28.8% |
Prompt1(Perplexity) | One Step LLMs | 43.2% | 31.1% | 29.9% | 18.8% | 25.9% | 18.5% | 12.3% | 25.7% |
Prompt2(Token) | One Step LLMs | 35.5% | 13.0% | 26.3% | 16.4% | 17.2% | 6.7% | 16.6% | 18.8% |
Prompt3(Sentence) | One Step LLMs | 32.9% | 14.0% | 24.3% | 14.7% | 14.2% | 4.9% | 14.7% | 17.1% |
CoT1(ppl+token) | CoT LLMs | 47.1% | 33.4% | 33.8% | 22.6% | 28.7% | 19.1% | 19.3% | 29.1% |
CoT2(ppl + token + sent) | CoT LLMs | 46.7% | 32.1% | 33.8% | 23.9% | 28.4% | 17.3% | 20.4% | 28.9% |



IV-D Scorer Analysis
Based on GEMBA’s analysis, the scoring methods for QE can be divided into scalar scoring, 5-star scoring, and 5-category scoring. For segment-level evaluation, LLMs do not have high accuracy in scalar scoring, while 5-star and 5-category scoring methods perform better.
In order to analyze the scoring capabilities of large models, we further examined the differences in scoring quality between 3-category and 5-category methods. We found that as machine translation quality improves, the 3-category model tends to gather more than 30% in the neutral category, while the 5-category model has a much larger distinction, as shown in Figure 4.

IV-E Explainable Analysis
We also attempted to use large models for explainable analysis of QE in a similar manner to BERTScore or TeacherSim. We conducted visualization experiments for token-level alignment, calculating the similarity score for each token in the candidate sentence with each token in the reference sentence. This paper compares token-level alignment visualization for the multilingual pre-trained model XLM-R, the TeacherSim model, and KPE.
A typical case in Fig.3(a) shows that the similarity scores of all words in the multilingual pre-trained model M-BERT exceed 90%, making it difficult to differentiate them. The token-level similarity distribution based on TeacherSim, as shown in Fig.3(b), is considerably uneven. Although it is much more accurate than the multilingual pre-trained model, it still exhibits leaky probability and aligns to the last token, such as (“the”, “.”), with an alignment probability of 91.5%.
Compared with these models, KPE’s token alignment avoids both the issue of alignment probability exceeding 90% in XLM-R and the leaky probability phenomenon in TeacherSim. For instance, the alignment probability of (”the”, ”.”) is 1%, while that of (“.”, “.”) is 95%, as shown in Fig3(c). The accuracy and explainability of KPE’s token alignment are significantly better than those of previous models.
V Results
In this paper, we address the issue of low segment-level accuracy in quality estimation with single step prompts. We propose KPE, a QE method based on LLMs and CoT, which focuses on three core dimensions of quality estimation: perplexity, token similarity, and sentence similarity. Experimental results show that the CoT-based QE system outperforms both previous deep learning systems and single-step large model estimation methods in segment-level quality assessment. Moreover, KPE’s token alignment visualization experiments demonstrate its clear superiority over multilingual pre-trained models and specialized sentence-level QE systems in terms of explainability.
As for future research directions, on the one hand, we will attempt to fine-tune LLMs of various sizes, such as 6B, 13B, or 65B LLaMA models, to explore the upper limit of large models in performing QE. On the other hand, for language pairs with lower Kendall coefficients, such as zh-en, we will try to incorporate knowledge from knowledge-graphs to improve the metrics for the relevant language pairs.
References
- Artetxe and Schwenk (2019) Mikel Artetxe and Holger Schwenk. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610, 2019.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Callison-Burch et al. (2011) Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. Findings of the 2011 workshop on statistical machine translation. In Proceedings of the sixth workshop on statistical machine translation, pages 22–64, 2011.
- Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720, 2022.
- Hendy et al. (2023) Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210, 2023.
- Jiao et al. (2023) Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, and Zhaopeng Tu. Is chatgpt a good translator? yes with gpt-4 as the engine, 2023.
- Kasirzadeh (2023) Atoosa Kasirzadeh. Chatgpt, large language technologies, and the bumpy road of benefiting humanity, 2023.
- Kocmi and Federmann (2023) Tom Kocmi and Christian Federmann. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520, 2023.
- Kocmi et al. (2021) Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, and Arul Menezes. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. arXiv preprint arXiv:2107.10821, 2021.
- OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
- Pires et al. (2019) Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual bert? arXiv preprint arXiv:1906.01502, 2019.
- Rei et al. (2022) Ricardo Rei, José GC de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André FT Martins. Comet-22: Unbabel-ist 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, 2022.
- Tao et al. (2022) Shimin Tao, Su Chang, Ma Miaomiao, Hao Yang, Xiang Geng, Shujian Huang, Min Zhang, Jiaxin Guo, Minghan Wang, and Yinglu Li. Crossqe: Hw-tsc 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 646–652, 2022.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
- Vilar et al. (2022) David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, and George Foster. Prompting palm for translation: Assessing strategies and performance. arXiv preprint arXiv:2211.09102, 2022.
- Yang et al. (2023) Hao Yang, Min Zhang, Shimin Tao, Miaomiao Ma, Ying Qin, and Daimeng Wei. Teachersim: Cross-lingual machine translation evaluation with monolingual embedding as teacher. In 2023 25th International Conference on Advanced Communication Technology (ICACT), pages 283–287. IEEE, 2023.
- Zhang et al. (2023) Min Zhang, Hao Yang, Yanqing Zhao, Xiaosong Qiao, Shimin Tao, Song Peng, Ying Qin, and Yanfei Jiang. Implicit cross-lingual word embedding alignment for reference-free machine translation evaluation. IEEE Access, 2023.
- Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
- Zhao et al. (2019) Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, August 2019. Association for Computational Linguistics.
- Zhao et al. (2020) Wei Zhao, Goran Glavaš, Maxime Peyrard, Yang Gao, Robert West, and Steffen Eger. On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1656–1671, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.151. URL https://aclanthology.org/2020.acl-main.151.