cushLEPOR: customising hLEPOR metric using Optuna for higher agreement with human judgments or pre-trained language model LaBSE

Lifeng Han¹, Irina Sorokina², Gleb Erofeev² Serge Gladkoff²
¹ ADAPT Research Centre, DCU, Ireland
² Logrus Global, Translation & Localization
[email protected]
gleberof, irina.sorokina, [email protected]

Abstract

Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the limited available human labelled scores. We first re-introduce the hLEPOR metric factors, followed by the Python version we developed (ported) which achieved the automatic tuning of the weighting parameters in hLEPOR metric. Then we present the customised hLEPOR (cushLEPOR) which uses Optuna hyper-parameter optimisation framework to fine-tune hLEPOR weighting parameters towards better agreement to pre-trained language models (using LaBSE) regarding the exact MT language pairs that cushLEPOR is deployed to. We also optimise cushLEPOR towards professional human evaluation data based on MQM and pSQM framework on English-German and Chinese-English language pairs. The experimental investigations show cushLEPOR boosts hLEPOR performances towards better agreements to PLMs like LaBSE with much lower cost, and better agreements to human evaluations including MQM and pSQM scores, and yields much better performances than BLEU (data available at https://github.com/poethan/cushLEPOR). Official results show that our submissions win three language pairs including English-German and Chinese-English on News domain via cushLEPOR(LM) and English-Russian on TED domain via hLEPOR.

1 Introduction

Machine Translation (MT) is a rapidly developing research field that plays an important role in NLP area. MT started from 1950s as one of the earliest artificial intelligence (AI) research topics and gained a large improvement in the output quality in large resourced language pairs after the introduction of Neural MT (NMT) in recent years Kalchbrenner and Blunsom (2013); Cho et al. (2014); Bahdanau et al. (2014). However, the challenge still remains in achieving human parity of MT output Han et al. (2021a). Thus MT evaluation (MTE) continues to play an important role in aiding MT development from the aspects of timely and high quality evaluations, as well as reflecting the translation errors that MT systems can take advantages of for further improvement Han et al. (2021b). On one hand, human evaluations have long been criticised as expensive and unrepeatable. Furthermore, the inter- and intra-agreement levels from Human raters may struggle to achieve a consistent and reliable score, unless done in rigour with highly trained and skilled evaluators Alekseeva et al. (2021). On the other hand, even though researchers have claimed that the automatic evaluation metrics have reached much better performances in the category of system level evaluations of MT outputs, with high correlation to human judgements, the segment level performance is still a large gap from human experts’ expectation Freitag et al. (2021); Barrault et al. (2019, 2020); Han et al. (2013a); Macháček and Bojar (2013).

In the meantime, many pre-trained language models have been proposed and developed in very recent years and showing big advantages in different NLP tasks, for instance, BERT Devlin et al. (2019) and its further developed variants Feng et al. (2020). In this work, we take the advantages of both high performing automatic metric and pre-trained language model, aiming at one step further towards higher quality performing automatic MT evaluation metric from both system level and segment level perspectives.

Among the evaluation metrics developed recent years, hLEPOR Han et al. (2013b, a) is an augmented metric that include many evaluation factors with tunable weights assigned including precision, recall, word order (via position difference factor), and sentence length. It has also been applied by researchers from different NLP fields including natural language generation (NLG) Novikova et al. (2017); Gehrmann et al. (2021), natural language understanding (NLU) Ruder et al. (2021), automatic text summarization (ATS) Bhandari et al. (2020), and searching Liu et al. (2021), in addition to MT evaluation Marzouk (2021).

However, original hLEPOR has disadvantage of the manual tuning of its parameter weights which take a lot of human efforts. We choose hLEPOR Han et al. (2013b, a) as our baseline model, and use the very recent language model LaBSE Feng et al. (2020) to achieve automatic tuning of its parameters thus aiming at reducing the evaluation cost and further boosting the performance. This system description paper is based on our earlier work, especially the training models Erofeev et al. (2021).

The rest of the paper is organised as below: Section 2 revisits hLEPOR metric, its factors, advantages and disadvantages, Section 3 introduces our Python ported version of hLEPOR and the further customised hLEPOR (cushLEPOR) using language models, Section 4 presents our experimental development and evaluation that we carried out on cushLEPOR metric using WMT historical data, Section 5 reserves space for our submission to this year WMT21 metrics task, and Section 6 finishes this paper with discussions of our findings and possible future work.

2 Revisiting hLEPOR

hLEPOR is a further developed variant of LEPOR Han et al. (2012) metric which was firstly proposed in 2013 including all evaluation factors from LEPOR but using harmonic mean for grouping factors to produce final calculation score Han et al. (2013b). Its submission to WMT2013 metrics task achieved system level highest average correlating scores to human judgement on English-to-other (French, Spanish, Russian, German, Czech) language pairs by Pearson correlation coefficient (0.854) Han (2014); Macháček and Bojar (2013). Other MT researchers also analysed LEPOR metric variant as one of the best performing segment level metric that was not significantly outperformed by other metrics using WMT shared task data Graham et al. (2015). hLEPOR is calculated by:

	$\displaystyle\text{{h}LEPOR}={Harmonic(w_{LP}LP},$
	$\displaystyle w_{NPosPenal}NPosPenal,w_{HPR}HPR)$

where LP is a sentence length penalty factor which was extended from brevity penalty utilised in BLEU metric, NPosPenal is for n-gram position difference penalty which captures the word order information, as bellow, where $MatchN_{hyp}$ and $MatchN_{ref}$ indicate the position number of matched words in hypothesis and reference sentences:

	$\displaystyle NPosPenal=e^{-NPD}$
	$\displaystyle NPD=\frac{1}{Length_{hyp}}\sum_{i=1}^{Length_{hyp}}\|PD_{i}\|$
	$\displaystyle\|PD_{i}\|=\|MatchN_{hyp}-MatchN_{ref}\|$

The factor HPR is the harmonic mean of Precision and Recall values.

	$\displaystyle HPR=\frac{(\alpha+\beta)PrecisionxRecall}{\alpha Precision+\beta Recall}$
	$\displaystyle Precision=\frac{Aligned_{num}}{Length_{hypothesis}}$
	$\displaystyle Recall=\frac{Aligned_{num}}{Length_{reference}}$

We refer the work Han (2014); Han et al. (2013a, 2012) for detailed factor calculation with examples there.

The basic version of hLEPOR carries out similarity calculation between MT system outputs and reference translations, in the same language setting, based on the word surface level tokens. The hybrid hLEPOR metric also carries out similarity calculation based on POS sequences from system-output and reference text. To do this, POS tagging is needed as the first step, then hLEPOR(POS) calculation uses the same algorithms used for the word level similarity score hLEPOR(word). Finally, hybrid hLEPOR is a combination of both word level and POS level score. In this system submission work, with the time limitations, to make an easier to use customised hLEPOR, we take the basic version of hLEPOR, i.e. the word level similarity calculation and leave the hybrid hLEPOR into the future work.

The weighting parameters for the three main factors in original hLEPOR metric, i.e the ( $w_{LP}$ , $w_{NPosPenal}$ , $w_{HPR}$ ) set, in addition to the other parameters inside each factor, were tuned by manual work based on development data (see Appendix for detailed parameter value sets on each language pair for word-surface level evaluation, en $\Leftrightarrow$ cs/fr/de/es/ru). This is very time consuming, tedious, and costly. In this work, we will introduce an automated tuning model for hLEPOR to customise it regarding deployed language pairs, which we name as cushLEPOR.

3 Proposed Model

3.1 Python port of hLEPOR

Original hLEPOR was published as Perl code ¹¹1https://github.com/poethan/LEPOR, in a non-portable format, which is not very suitable for modern AI/NLP applications, since they are using almost exclusively Python. Python is a programming language of choice for AI and machine learning (ML) tasks, thanks to its amazing ecosystem of open source or simply free libraries available to researchers and developers. However, hLEPOR was not available in NLTK Bird et al. (2009) or any other public Python libraries. We therefore took original published Perl code and ported it to Python, carefully comparing the logic of original paper and the Perl implementation. During this work we run both Perl code to reproduce the results of original code, and the new Python implementation. This work helped us to spot and fix at least three minor errors which did not significantly affected the score, but nevertheless we fixed the bugs of the Perl code.

While doing the porting we did also notice that hLEPOR parameter values were taken empirically and never explained in detail except for the suggested parameter setting table in the paper Han et al. (2013b, a) for eight language pairs that were tested for the WMT2013 shared task, including EN-CZ/DE/FR/ES and the opposite direction. They were:

•

alpha: the tunable weight for recall
•

beta: the tunable weight for precision
•

n: words count before and after matched word in npd calculation
•

weight_elp: tunable weight of enhanced length penalty
•

weight_pos: tunable weight of n-gram position difference penalty
•

weight_pr: tunable weight of harmonic mean of precision and recall

The parameter values for hLEPOR as published in the publicly available Perl code were mannually tuned for English-to-Czech/Russian (EN=>CS/RU) language pair setting Han et al. (2013b, a) as below:

•

alpha = 9.0,
•

beta = 1.0,
•

n = 2,
•

weight_elp = 2.0,
•

weight_pos = 1.0,
•

weight_pr = 7.0.

We refer to our Appendix for the manually tuned parameters for other language pairs available in the paper by Han et al. (2013a, b) including English=>French/German (EN=>FR/DE) and Czech/Spanish/French/German=>English (CS/ES/FR/DE=>EN) and Russian=>English (RU=>EN) which was set up using CS=>EN without extra manual tuning. We came to the conclusion that we need to check whether these parameters are optimal, and find out whether better set of values exist to improve agreement with human judgement.

Because the different characteristics of each language, and language families, the evaluation of MT outputs would emphasis on different factors. For instance, word order factor reflected by n-gram position different penalty in hLEPOR (NPosPenal), can be with higher or lower weight for strict order languages and loose/flexible word order languages. Thus, we assumed that hLEPOR optimisation towards different languages will generate corresponding different set of parameter values. We call this step language-specific optimisation, and it will save much cost and time to achieve an automatic tuning process. The Python ported hLEPOR is available at Pypi https://pypi.org/project/hLepor/.

3.2 Customised hLEPOR: cushLEPOR

With the recent development of pre-trained neural language models and their effective applications in different NLP tasks, including question answering, language inference and MT, it becomes a natural question that why do not we apply them in MT evaluation as well.

Very recent work from Google team verified that MQM (Multi-dimension quality metric) Lommel et al. (2014) and SQM (Scalar Quality Metrics) Freitag et al. (2021) have good agreement with each other when they were carried out both by professional translators. However, this does not correlate to Mechanical Turk based crowd-sourced human evaluation that was carried out by general researchers or untrained online workers with low professional linguistic skills. It also reflected that crowd-sourced evaluation tends to favour very literal translations instead of better translations with more diverse meaning equivalent lexical choices.

To customise hLEPOR (cushLEPOR) towards optimised parameter setting for deployed language pairs, we choose Optuna open source hyper-parameter optimisation framework Akiba et al. (2019) to automate hyper-parameter search for best agreement between cushLEPOR and human experts evaluation wherever such data-set is available.

SQM Freitag et al. (2021) borrows WMT shared task settings to collect segment-level scalar rating, but set the score scale from 0 to 6 instead of 0 to 100. Professional translator labelled scores using SQM is named as pSQM.

We aim at optimising cushLEPOR parameters to obtain best agreement with pSQM scores. However, in practical situation, human evaluations are not often feasible to obtain due to the constrains from both time and financial aspects.

We therefore propose to carry out an alternative optimisation model, i.e. customising cushLEPOR parameters towards pre-trained large scale language models, e.g. LaBSE (Language Agnostic BERT Sentence Embedding) model similarity score.

LaBSE model is built on BERT (Bidirectional Encoder Representations from Transformers) architecture and trained on monolingual (for dictionaries) and bilingual training data. LaBSE training data is filtered and processed. The resulting sentence embeddings achieve excellent performance on measures of sentence embedding quality such as the semantic textual similarity (STS) benchmark and sentence embedding based transfer learning Feng et al. (2020).

LaBSE linguistic similarity score finds matching translations very well. The disadvantages, however, are high demand for computational resources (with GPUs), intensive application coding with requirement for ML skills, and slow performance.

The design of using optimised hLEPOR (cushLEPOR) in lieu of LaBSE similarity aims at developing a simple, high-performing, easy to run and not computationally demanding script to achieve results similar to high-end LaBSE similarity score, and hopefully towards human judgement. The cushLEPOR parameters can be optimised for agreement with any type of scores, such as pSQM, MQM, and LaBSE, etc.

Regarding the optimisation stage using Optuna, the task is to find the extremum values of continuous (not discrete) surface in a 6-dimensional space of six cushLEPOR parameters. The values of parameter set change continuously which means there’s an infinite number of parameter values; however it is not a differentiable situation mathematically, and there are gaps. Generally we cannot presume that it is a smooth surface. Before Optuna, computational tools used to deploy a discrete mesh in such cases by using discretization method, which was less computationally intense than full scale continuous search on all possible values of parameter sets.

Optuna framework is currently one of the best Tree-structured Parzen Estimator (TPE) model implementations, which kind of estimators converges to optimal solution in 200-300 epochs, and the method can work with continuous (real) parameters Bergstra et al. (2011).

4 Experimental Evaluations

The training and development data we used regarding MQM scores and pSQM labels is from the recent work by Google Research team on investigating into human evaluations based on WMT2020 shared task Freitag et al. (2021) (data available at https://github.com/google/wmt-mqm-human-evaluation).

We first focus on English-to-German language (EN-DE) pair, which includes MQM and pSQM labels, acquired from 10 submission of WMT 2020, then take ZH-EN data-set. We refer to the paper Freitag et al. (2021) for detailed MT system names and offering institutions.

Firstly, a multi-parameter optimisation against LaBSE for EN-DE language pair gave the following values for cushLEPOR parameters:

•

alpha = 2.97,
•

beta = 1.97,
•

n = 4,
•

weight_elp = 1.0,
•

weight_pos = 14.97,
•

weight_pr = 2.2.

This set of values reflected very different weighting systems in comparison to the original hLEPOR metric. For instance, 1) cushLEPOR assigned recall and precision much closer weight (2.97 vs 1.97) in comparison to hLEPOR (9.0 vs 1.0), 2) cushLEPOR chose 4-gram in chunk matching instead of bi-gram used in hLEPOR, 3) cushLEPOR assigned NPosPenal (n-gram position difference penalty) factor a very heavy weight against other two factors LP (length penalty) and HPR (harmonic mean of precision and recall) by (14.97 vs 1.0 and 2.2) in comparison to hLEPOR which emphasised the weight on HPR (1.0 vs 2.0 and 7.0). From these points of view, cushLEPOR trained on EN-DE language pair indicates the importance of the larger window context consideration during word matching, as well as the word order information reflected by n-gram (n value) and novel factor NPosPenal introduced by hLEPOR respectively.

This also reflected that LaBSE similarity is indeed a feasible goal for cushLEPOR optimisation. The correlations of hLEPOR and cushLEPOR to LaBSE are shown in Fig. 1 and 2.

\includegraphics

*[width=0.45]./image/LaBSE_hLEPOR

Figure 1: Agreement with LaBSE: hLEPOR

\includegraphics

*[width=0.45]./image/LaBSE_cushLEPOR

Figure 2: Agreement with LaBSE: cushLEPOR

However, we found out that we were not able to decrease much on RMSE (Root Mean Square Error) score for cushLEPOR towards pSQM , in comparison to original hLEPOR, (0.28 vs 0.29)which does indicate that original hLEPOR empirically shows very good fit for pSQM type human evaluation, using the suggested parameter settings for EN-DE Han et al. (2013a, b) as bellow.

•

alpha = 9.0,
•

beta = 1.0,
•

n = 2,
•

weight_elp = 3.0,
•

weight_pos = 7.0,
•

weight_pr = 1.0.

The RMSE value between pSQM and hLEPOR, vs pSQM and cushLEPOR is shown in Fig. 3. However, it indeed shows much better performance then BLEU metric, as in Fig. 4 (0.28 vs 0.46).

\includegraphics

*[width=0.45]./image/RMSE_hLEPORs_pSQM_v3_en-de_

Figure 3: RMSE: hLEPOR vs cushLEPOR to pSQM (lower score is better)

\includegraphics

*[width=0.45]./image/RMSE_BLEU_vs_cushLEPOR_PSQM_v2

Figure 4: RMSE: BLEU vs cushLEPOR to pSQM (lower score is better)

Optuna did optimise cushLEPOR against LaBSE very well, halving the RMSE distance between LaBSE and cushLEPOR as compared to original hLEPOR, shown in Fig. 5.

\includegraphics

*[width=0.45]./image/RMSE_hLEPORs_LaBSE

Figure 5: RMSE: hLEPOR vs cushLEPOR to LaBSE (lower score is better)

The performances of tuning on LaBSE and pSQM are shown in Fig. 6 and 7 respectively. The horizontal axis is the score value (0, 1) and the vertical axis is the sentence number that falls into the corresponding score intervals.

From the score distribution visualisation, it reflects the tuning on pSQM has a larger covered error types while LaBSE is less sensitive to some errors that human experts would spot out. As shown on these charts, pSQM human rating shows much wider "tail" of "low score ratings", while LaBSE rating is much more focused. The reason is that LaBSE similarity model underestimates the severity of errors and error types, while humans analyse the meaning and assign proper error penalties in more diverse setting. As an example, the sentence "The comet did not struck the Earth this time." and "The comet did struck the Earth this time." has very close lexical similarity, but the meaning is very different, in this case “opposite”. LaBSE similarity score would not assign significant penalty to such difference, while human will treat it as a major error. This difference plays a crucial role for reliable translation quality evaluation.

\includegraphics

*[height=2.1in,width=3in]./image/tune_LaBSE

Figure 6: Score Distribution: tune on LaBSE

\includegraphics

*[width=0.45]./image/tune_pSQM

Figure 7: Score Distribution: tune on pSQM

5 Submission to WMT21

For WMT2021 Metrics Task, we submitted our cushLEPOR system scores for zh=>en and en=>de language pairs, both segment-level and system-level evaluation. The training and development set we used are exact the ones from last section (Section 4). We can not tune our cushLEPOR model parameters on en=>ru language pair from the WMT21 official data, because the human labelled MQM and pSQM scores as validation data that cushLEPOR requires do not exist from last year WMT20 set. Instead, we submitted hLEPOR metric for EN=>RU using the parameter settings in hLEPOR as mentioned in the last section. We carried out evaluation on all four official data-sets: newstest2021 (traditional task), florestest2021 (sentences translated as part of the WMT News translation task), tedtalks (additional sets of sentences translated by WMT21 translation systems in the TED talks domain), and challengeset (synthetic outputs generated specifically to challenge automatic metrics).

5.1 Submitted Parameter Setting

The optimised parameter values set for our zh=>en submission to WMT21 is displayed below.

For cushLEPOR(LM) using LaBSE training:

•

alpha = 2.85,
•

beta = 4.73,
•

n = 1,
•

weight_elp = 1.01,
•

weight_pos = 11.13,
•

weight_pr = 4.62

For cushLEPOR(pSQM) using professional translator labelled SQM training:

•

alpha = 9.09,
•

beta = 3.55,
•

n = 3,
•

weight_elp = 1.01,
•

weight_pos = 14.98,
•

weight_pr = 1.57

The optimised parameter values set for our en=>de submission to WMT21 is displayed below:

For cushLEPOR(LM) using LaBSE training:

•

alpha = 2.95,
•

beta = 2.68,
•

n = 2,
•

weight_elp = 1.0,
•

weight_pos = 11.79,
•

weight_pr = 1.87

For cushLEPOR(pSQM) using professional translator labelled SQM training:

•

alpha = 1.13,
•

beta = 1.71,
•

n = 2,
•

weight_elp = 1.06,
•

weight_pos = 11.90,
•

weight_pr = 1.01

5.2 Official Results from Metrics Task

The official results from WMT2021 Metrics task show that cushLEPOR(LM) ranks in the first cluster in performance on News test data with single reference evaluated on overall English-to-German, Chinese-to-English and English-to-Russian where professional human evaluation data is available (Ref. Table 8 “Metric rankings based on pairwise accuracy” in Findings paper Freitag et al. (2021)). Furthermore, in the language specific ranking, cushLEPOR(LM) also wins English-to-German and Chinese-to-English language pairs, including TED data condition. Our hLEPOR baseline metric wins English-to-Russian TED domain language specific ranking (Ref. Table 12 “Summary of language-specific results” in the official findings paper Freitag et al. (2021)). The official result on “System-level Pearson correlations for English-to-German” (Table 23 of findings) shows that cushLEPOR(LM) achieves score 0.938 in News domain, ranking number 1 in Cluster 1 metrics, out of overall 29 metric submissions.

6 Discussions and Future Work

In this work, we described cushLEPOR, a customised hLEPOR metric which can be automatically trained and optimised using both human labelled MQM scores, as well as large scale pre-trained language model (LM) LaBSE towards better agreement to human experts level judgements and distilled LM performance respectively, and reducing cost at the meantime, e.g. the manual tuning from hLEPOR and high computational demand from LMs.

We also optimised cushLEPOR towards human translators’ evaluation scores, i.e. pSQM, which showed much improved performance than BLEU and original hLEPOR (with default parameters). Our research is in line with the MT evaluation guideline suggestions from the very recent work Marie et al. (2021) that better evaluation metrics in correlation to human judgement shall be tested and deployed. Or human judgements shall be carried out directly wherever possible.

We have some findings during the experimental investigation: 1) cushLEPOR trained on LaBSE can replace LaBSE to carry out similarity calculation task in MT evaluation, which is much more light weighted and low cost from computational power and complexity point of view. 2) we can choose alternative pre-trained language models (LMs) in the future to boost performance. 3) this cushLEPOR optimisation framework proves to be functional, offering high performance towards pre-trained LMs, much improved agreement of cushLEPOR to LaBSE scores in comparison to hLEPOR (as in Figure 1 and 2). 4) optimised cushLEPOR achieves better agreement towards professional translator’s evaluation (pSQM).

Optuna, the hyper parameter optimisation toolkit we used, can generate different set of cushLEPOR parameter values in different runs, which could be an consistency issue. However, we believe it optimises the performance of cushLEPOR towards the highest agreement to the reference scoring (pre-trained LMs or human evaluations), but not to ensure the same set of parameter values to be generated, so this will not be a problem. We will carry out further analysis on this aspect in the future work.

The hybrid version of hLEPOR Han et al. (2013b) use POS features to function as pseudo synonyms to capture alternative correct translations. However it relays on POS taggers for target language, which does not exist for newly proposed languages, and its tagging accuracy may be low, and it costs extra processing steps. In the future work, we plan to carry out integrated model which combine the POS tagging as a command function in data pre-processing for hybrid cushLEPOR.

Overall, cushLEPOR achieved the first cluster performances in News Domain data on Chinese-English and English-German in WMT2021 Metrics task, while hLEPOR wins TED domain data on English-Russian Freitag et al. (2021). In the future work, we plan to carry out optimisation of cushLEPOR on more language pairs as well as more domains. We will keep our updated parameter set for extended languages and domains available on our cushLEPOR open-sourced platform.

Acknowledgements

The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. We thank the following open-source teams: Google Research for sharing their human evaluation data on MQM and pSQM scores, Optuna open-source hyper-parameter optimisation framework, and LaBSE pre-trained LMs.

References

Akiba et al. (2019) Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A next-generation hyperparameter optimization framework. CoRR, abs/1907.10902.
Alekseeva et al. (2021) Alexandra Alekseeva, Serge Gladkoff, Irina Sorokina, and Lifeng Han. 2021. Monte carlo modelling of confidence intervals in translation quality evaluation (tqe) and post-editing dstance (ped) measurement. In Metrics 2021: Workshop on Informetric and Scientometric Research (SIG-MET), 23-24 Oct 2021. Association for Information Science and Technology.
Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
Barrault et al. (2020) Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, and Marcos Zampieri. 2020. Findings of the 2020 conference on machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation, pages 1–55, Online. Association for Computational Linguistics.
Barrault et al. (2019) Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
Bergstra et al. (2011) James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, page 2546–2554, Red Hook, NY, USA. Curran Associates Inc.
Bhandari et al. (2020) Manik Bhandari, Pranav Narayan Gour, Atabak Ashfaq, Pengfei Liu, and Graham Neubig. 2020. Re-evaluating evaluation in text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9347–9359, Online. Association for Computational Linguistics.
Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc.
Cho et al. (2014) KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. CoRR, abs/1409.1259.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Erofeev et al. (2021) Gleb Erofeev, Irina Sorokina, Lifeng Han, and Serge Gladkoff. 2021. cushlepor uses labse distilled knowledge to improve correlation with human translation evaluations. In Proceedings for the Machine Translation Summit - User Track (In Press), online. Association for Computational Linguistics & AMTA.
Feng et al. (2020) Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2020. Language-agnostic BERT sentence embedding. CoRR, abs/2007.01852.
Freitag et al. (2021) Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. arXiv e-prints, page arXiv:2104.14478.
Freitag et al. (2021) Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ondřej Bojar. 2021. Results of the wmt21 metrics shared task: Evaluating metrics with expert-based human evaluations on ted and news domain. In Proceedings of the Six Conference on Machine Translation. Association for Computational Linguistics.
Gehrmann et al. (2021) Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D. Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Rubungo Andre Niyongabo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, and Jiawei Zhou. 2021. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics. arXiv e-prints, page arXiv:2102.01672.
Graham et al. (2015) Yvette Graham, Timothy Baldwin, and Nitika Mathur. 2015. Accurate evaluation of segment-level machine translation metrics. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pages 1183–1191.
Han et al. (2012) Aaron L. F. Han, Derek F. Wong, and Lidia S. Chao. 2012. LEPOR: A robust evaluation metric for machine translation with augmented factors. In Proceedings of COLING 2012: Posters, pages 441–450, Mumbai, India. The COLING 2012 Organizing Committee.
Han et al. (2013a) Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yi Lu, Liangye He, Yiming Wang, and Jiaji Zhou. 2013a. A description of tunable machine translation evaluation systems in WMT13 metrics task. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 414–421, Sofia, Bulgaria. Association for Computational Linguistics.
Han (2014) Lifeng Han. 2014. LEPOR: An Augmented Machine Translation Evaluation Metric. University of Macau, MSc. Thesis.
Han et al. (2021a) Lifeng Han, Gareth Jones, Alan Smeaton, and Paolo Bolzoni. 2021a. Chinese character decomposition for neural MT with multi-word expressions. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 336–344, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
Han et al. (2021b) Lifeng Han, Alan Smeaton, and Gareth Jones. 2021b. Translation quality assessment: A brief survey on manual and automatic methods. In Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age, pages 15–33, online. Association for Computational Linguistics.
Han et al. (2013b) Lifeng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing, and Xiaodong Zeng. 2013b. Language-independent model for machine translation evaluation with reinforced factors. In Machine Translation Summit XIV, pages 215–222. International Association for Machine Translation.
Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA. Association for Computational Linguistics.
Liu et al. (2021) Zeyang Liu, Ke Zhou, and Max L. Wilson. 2021. Meta-evaluation of Conversational Search Evaluation Metrics. arXiv e-prints, page arXiv:2104.13453.
Lommel et al. (2014) Arle Lommel, Aljoscha Burchardt, and Hans Uszkoreit. 2014. Multidimensional quality metrics (mqm): A framework for declaring and describing translation quality metrics. Tradumàtica: tecnologies de la traducció, 0:455–463.
Macháček and Bojar (2013) Matouš Macháček and Ondřej Bojar. 2013. Results of the WMT13 metrics shared task. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 45–51, Sofia, Bulgaria. Association for Computational Linguistics.
Marie et al. (2021) Benjamin Marie, Atsushi Fujita, and Raphael Rubino. 2021. Scientific credibility of machine translation research: A meta-evaluation of 769 papers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7297–7306, Online. Association for Computational Linguistics.
Marzouk (2021) Shaimaa Marzouk. 2021. An in-depth analysis of the individual impact of controlled language rules on machine translation output: a mixed-methods approach. Machine Translation.
Novikova et al. (2017) Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark. Association for Computational Linguistics.
Ruder et al. (2021) Sebastian Ruder, Noah Constant, Jan Botha, Aditya Siddhant, Orhan Firat, Jinlan Fu, Pengfei Liu, Junjie Hu, Graham Neubig, and Melvin Johnson. 2021. XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation. arXiv e-prints, page arXiv:2104.07412.

Appendices

Appendix A: hLEPOR parameters

The word level hLEPOR default parameters manually tuned for WMT2013 MT evaluation task across language pairs Han et al. (2013a, b) are displayed as below. Both Python (https://pypi.org/project/hLepor/) and Perl (https://github.com/lHan87/aaron-project-hlepor) version codes can be applied to:

On English-to-Czech/Russian (EN=>CS/RU):

•

alpha = 9.0,
•

beta = 1.0,
•

n = 2,
•

weight_elp = 2.0,
•

weight_pos = 1.0,
•

weight_pr = 7.0.

On English-to-German (EN=>DE):

•

alpha = 9.0,
•

beta = 1.0,
•

n = 2,
•

weight_elp = 3.0,
•

weight_pos = 7.0,
•

weight_pr = 1.0.

On Czech / Spanish / Russian to English (CS/ES/RU =>EN):

•

alpha = 1.0
•

beta = 9.0
•

n = 2
•

weight_elp = 2.0
•

weight_pos = 1.0
•

weight_pr = 7.0

On German/French-to-English (DE/FR=>EN) and English-to-Spanish/French (EN=>ES/FR):

•

alpha = 9.0
•

beta = 1.0
•

n = 2
•

weight_elp = 2.0
•

weight_pos = 1.0
•

weight_pr = 3.0