This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DHP Benchmark: Are LLMs Good NLG Evaluators?

Yicheng Wang1*, Jiayi Yuan2*, Yu-Neng Chuang2, Zhuoer Wang1, Yingchi Liu3,
Mark Cusick3, Param Kulkarni3, Zhengping Ji 3, Yasser Ibrahim3, Xia Hu2,

1Texas A&M University, 2Rice University , 3Axon Enterprise, Inc.,
* Equal Contribution
Abstract

Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain inadequately explored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs utilizing hierarchically perturbed text data and statistical tests to measure the NLG evaluation capabilities of LLMs systematically. We have re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM series provides critical insight into their strengths and limitations as NLG evaluators.

DHP Benchmark: Are LLMs Good NLG Evaluators?


Yicheng Wang1*, Jiayi Yuan2*, Yu-Neng Chuang2, Zhuoer Wang1, Yingchi Liu3, Mark Cusick3, Param Kulkarni3, Zhengping Ji 3, Yasser Ibrahim3, Xia Hu2, 1Texas A&M University, 2Rice University , 3Axon Enterprise, Inc., * Equal Contribution


1 Introduction

Large Language Models (LLMs) play a crucial role in the field of Natural Language Generation (NLG), advanced wide real-world applications including education Latif et al. (2023), healthcare Yuan et al. (2023), business Teubner et al. (2023), etc. The strong capabilities of LLMs allow them not only to serve as text generators but also increasingly as powerful evaluators of text quality Chiang and Lee (2023); Liu et al. (2023a); Li et al. (2024). Their role as evaluators is crucial for advancements in various applications, such as summarization, story completion, question answering, and translation Li et al. (2024); Wang et al. (2023a). LLMs are expected to serve as NLG evaluators, providing reasonable quality scores based on different quality metrics with specially designed evaluation prompts.

Despite the growing performance of LLMs in evaluation tasks, a significant gap remains in fully comprehending their capabilities in evaluating NLG quality. The question, Are LLMs good NLG evaluators? remains challenging for two main reasons illustrated in Figure 1:

(1) Lack of Clear and Unbiased Measurement: There is no clear measurement for the capability of LLM evaluators. Existing methods rely on aligning with human scores Chiang and Lee (2023); Liu et al. (2023a) but these scores themselves are subject to biased response styles Schoch et al. (2020).

(2) Multiple Evaluation Metrics: Evaluating NLG quality requires considering multiple metrics. For example, in summarization tasks, metrics such as coherence, consistency, and fluency are essential considerations Hu et al. (2024); Fabbri et al. (2021). However, LLMs might struggle with correlations between these metrics, potentially leading to misinterpretation and incorrectly scoring, which makes it difficult to assess their effectiveness as evaluators.

To address these challenges, we introduce a novel DHP benchmarking frameworkDiscernment of Hierarchical Perturbation – for quantitatively measuring the evaluation capabilities of LLMs without the need for human annotations. We propose the concept of discernment scores, systematically derived from hierarchically perturbed text data and statistical tests. We perturb the reference data using various hierarchical methods, then compare the differences in LLM evaluation scores using the Wilcoxon Signed-Rank Test Wilcoxon (1945). In order to get overall capability results, we introduce the approach of harmonic mean pp-values and expert weights to combine the results of multiple metrics. The final pp-value derived can be transformed to Discernment Scores to measure the NLG evaluation capabilities of LLMs. This method allows for a more rigorous and comprehensive evaluation of LLM performance, independent of the response styles of the tested models.

Refer to caption
Figure 1: Challenges in Assessing LLMs as NLG Evaluators: Biased Response Styles and Multiple Evaluation Metrics. Our DHP Framework employs hierarchical perturbation and statistical tests to address these challenges, offering quantitative discernment scores for effective comparison.

Our study re-establishes six evaluation datasets that encompass four key NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Each dataset is perturbed and leveraged to challenge the LLMs’ evaluative capabilities in distinct ways, providing a robust basis for benchmarking. The datasets include a range of text perturbations, from minor character problems to significant sentence alterations, allowing us to test the potential discernment limits of the LLMs.

Our comprehensive benchmarking, using newly defined quantitative discernment scores, analyzes five major LLM series. This approach uncovers critical insights into their capabilities as NLG evaluators and provides a detailed understanding of their performance. This benchmark reveals important trends and patterns in the capabilities of LLMs, highlighting areas where they excel and where they may fall short.

The DHP benchmark aims to fill the existing gaps by providing a quantitative framework for assessing LLMs’ evaluation capabilities and emphasizing the necessity of considering multiple metrics for accurate and reliable evaluations. We summarize our contributions as follows.

  1. (1)

    Design the DHP benchmarking framework: quantitative discernment scores for LLMs as NLG evaluators based on hierarchical perturbation, eliminating the need for human annotations.

  2. (2)

    Re-establish six evaluation datasets for four NLG tasks to evaluate the discernment of LLMs.

  3. (3)

    Benchmark five series of LLMs to analyze their capabilities of NLG evaluation.

2 Related Work

Recent advancements highlight the significant potential of utilizing LLMs as evaluators for a variety of natural language processing (NLP) tasks. Extensive empirical evidence supports this viewpoint, as demonstrated by studies Liu et al. (2023a); Chiang and Lee (2023); Hu et al. (2024); Desmond et al. (2024); Wang et al. (2023a), which assert that the evaluation behaviors of pretrained LLM-based evaluators are well-aligned with those of human preference Liu et al. (2023b) and eliminate the prompt bias Liusie et al. (2024). Another line of work Huang et al. (2024); Zhu et al. (2023); Wang et al. (2023b) finetunes LLMs with specific downstream task datasets to increase the capability as the judges. Despite the great assessment performance of a single LLM, advanced studies involve multi LLM agents Chan et al. (2023); Zhang et al. (2023); Li et al. (2023) or human experts Gao et al. (2024); Li et al. (2024) to further increase the judging capability. However, humans often prefer to remain neutral Leroi et al. (2020), which means a perfect alignment between LLM-based evaluators and human evaluators may still lead to bias and inaccurate judgments.

Refer to caption
Figure 2: Response styles of five LLMs evaluated using the SummEval dataset Fabbri et al. (2021).

While using LLM-based evaluators offers a scalable and effective method for approximating human preferences, the significant influence of individual human biases raises concerns about their discernment in assessing LLM-based evaluations. In our work, we focus on developing comprehensive approaches to observe and address this phenomenon.

3 Biased Response Styles

Previous studies focus on the alignment between human and LLM evaluators, using correlation metrics to gauge the LLMs’ performance in NLG evaluation tasks Liu et al. (2023a); Chiang and Lee (2023). However, these studies often overlook an important variable of evaluators: Response Styles which refer to a respondent’s consistent manner of answering survey questions, regardless of the content Van Vaerenbergh and Thomas (2013). Despite similar levels of professionalism, annotators may assign different scores to the same questionnaire due to differences in age, gender, personality, cultural background, and ethnic group Van Vaerenbergh and Thomas (2013); Hui and Triandis (1989); Kieruj and Moors (2013). Similarly, LLMs, trained on diverse datasets, may also exhibit biases in their responses Salecha et al. (2024). This discrepancy casts doubt on the previous methods used to compare human and LLM scores. Since quality-based scoring often relies heavily on a few experts’ annotations, the final alignment scores tend to favor models that share similar response styles with these specific experts.

We illustrate this with an example of the response styles of five LLMs tasked with annotating quality scores for human reference data from the SummEval dataset Fabbri et al. (2021). We averaged the scores across four metrics for each data point and plotted both the Pearson correlation coefficient (ρ\rho) and the average score distributions of the five models. After perturbing the original data by replacing some named entities with fictional ones in the summaries (Fictional Named Entities in Table 1), we repeated the quality evaluation. As shown in Figure 2, all models detected the changes and adjusted their scores accordingly, though their scoring distributions varied significantly. For instance, Llama3 Meta , Mistral Jiang et al. (2023), and Qwen Bai et al. (2023) models assign higher scores to the original data and moderate scores to the perturbed data. In contrast, GPT4-Turbo OpenAI (2023) and Vicuna Chiang et al. (2023) models tend to give moderate scores to the original data and much lower scores to the perturbed data. The variance in the response distributions indicates the presence of bias that can significantly affect alignment (ρ\rho), illustrating that alignment is not a direct or credible metric for assessing the ability of LLMs as NLG evaluators. It is crucial to develop a new metric and measurement for evaluation that is not influenced by the evaluators’ biased response styles, ensuring a more accurate and fair assessment of LLM capabilities.

Refer to caption
Figure 3: The DHP framework for each NLG task. It includes three steps: (1) Hierarchical Perturbation, (2) LLM Evaluation, and (3) Statistical Analysis. This figure demonstrates the framework with four perturbation types (P=4P=4) and three evaluation metrics (M=3M=3).

4 DHP Benchmarking Framework

We propose our DHP framework: Discernment of Hierarchical Perturbation. Previous studies overlook the essence of NLG evaluation, i.e., the content-oriented scoring Novikova et al. (2018). In other words, content that is accurate, fluent, and consistent should receive higher scores than content that is inaccurate, disfluent, and inconsistent. Qualified annotators should be able to recognize inappropriate content without additional references and then assign scores, even though the absolute scores may still reflect their biased response styles. The fundamental principle of our assessment is that a qualified LLM evaluator should be able to independently identify issues in perturbed data (which contains some quality issues) and assign relatively lower scores compared to the original reference data during two separate evaluations. This approach does not rely on human scores, thus eliminating the influence of human response styles.

The overall framework is shown in Figure 3. First, for a specific NLG task, we employ a hierarchical perturbation pipeline to transform high-quality reference data into various forms of lower-quality data. Subsequently, an LLM evaluates both the original and perturbed texts respectively using predefined metrics, generating several sets of rating scores. We then conduct a statistical analysis of these scores. For each pair of scores, original and perturbed, we apply the Wilcoxon Signed-Rank Test to determine the differences in their distributions, achieving this with a confidence level expressed as a pp-value. This test specifically assesses differences in pairwise scores without focusing on absolute values, thereby minimizing the impact of models’ response styles. Following this, we combine the pp-values from different metrics, incorporating Expert Weights (EWEW) to tailor the aggregated pp-values to the specific metrics of the corresponding perturbation methods. These combined pp-values are then transformed into discernment scores, which serve as a direct measure for assessing and comparing the NLG evaluation capabilities of LLMs for this particular task.

4.1 Step 1: Hierarchical Perturbation

To generate data that have quality issues across various levels, formats, and evaluation difficulties, we propose a hierarchical perturbation approach. In contrast to the plain perturbations Sai et al. (2021), our approach encompasses three levels of perturbation content: character, word, and sentence levels; two methods of perturbation: rule-based and LLM-based; and two degrees of perturbation: minor and major as illustrated in Figure 3.

First, at the character level, we alter some characters or letters in the given NN original texts independently. At the word and sentence levels, we degrade the text by processing entire words or sentences, respectively. For NLG tasks involving very short texts, sentence-level perturbation is considered optional. For each level of perturbation, we choose either a rule-based or an LLM-based method, enhancing the diversity of the perturbation’s content and format. Additionally, if the text data is sufficiently long for more perturbation, we implement two degrees of perturbation – minor and major – for each method. These different degrees of perturbation will influence the difficulty that LLMs face in detecting issues within the text. The detailed perturbation methods for each task are shown in Table 1.

With this approach, we generate multiple sets of perturbed data, with each set designed to highlight a specific quality issue tied to a distinct type of perturbation method. Competent LLM evaluators should accurately detect these issues and assign correspondingly lower scores to the perturbed data.

Table 1: The quality metrics and perturbation methods for the four NLG tasks. C: Character Level. W: Word Level. S: Sentence Level. (R): Rule-based Perturbation. (L): LLM-based Perturbation. (M): Major and Minor Perturbations for each method.
\hlineB3 Task Metrics Perturbations
\hlineB2 Summarization
Coherence
Consistency
Fluency
Relevance
C (M): Random Deletions (R), Random Typos (R)
W (M): Fictional Named Entities (L), Grammatical Errors (L)
S (M): Reordering (R), Rewriting and Insertion (L)
Story
Completion
Coherence
Consistency
Fluency
C: Random Deletions (R), Random Typos (R)
W: Fictional Named Entities (L), Grammatical Errors (L)
S: Random Ending Sentence (R), Wrong Ending Sentence (R)
Question
Answering
Answer Quality
C (M): Random Deletions (R), Random Typos (R)
W (M): Fictional Named Entities (L), Grammatical Errors (L)
S: Random Answer (R)
Translation
Accuracy
Fluency
C (M): Random Deletions (R), Random Typos (R)
W (M): Random Deletions (R), Fictional Named Entities (L), Grammatical Errors (L)
\hlineB3

4.2 Step 2: LLM evaluation

Following the evaluation method outlined in G-Eval Liu et al. (2023a), we also utilize the automatic chain-of-thought approach (Auto-CoT) Zhang et al. (2022) to design evaluation prompts for different datasets and evaluation metrics. These prompts are sent to LLMs to assess both the original data and the perturbed, low-quality data. It’s important to note that all perturbed data are evaluated independently, without their original references, to accurately test the models’ capabilities in identifying specific quality issues.

After conducting the LLM evaluation on a dataset consisting of NN datapoints, we obtain several sets of absolute evaluation scores shown in Figure 3:

[{Sm10},{Sm20},,{SmM0}],\displaystyle[\{S^{0}_{m_{1}}\},\{S^{0}_{m_{2}}\},\dots,\{S^{0}_{m_{M}}\}],
[{Sm11},{Sm21},,{SmM1}],,\displaystyle[\{S^{1}_{m_{1}}\},\{S^{1}_{m_{2}}\},\dots,\{S^{1}_{m_{M}}\}],\cdots,
[{Sm1P},{Sm2P},,{SmMP}],\displaystyle[\{S^{P}_{m_{1}}\},\{S^{P}_{m_{2}}\},\dots,\{S^{P}_{m_{M}}\}],

where each {S}\{S\} is a set of NN evaluation scores. The superscripts 0,1,,P0,1,\ldots,P on SS represent the original data (0) and the PP types of perturbed data (1,,P1,\dots,P), respectively. The subscripts m1,,mMm_{1},\ldots,m_{M} represent the MM different metrics used in the dataset. For instance, in the SummEval dataset Fabbri et al. (2021), there are four evaluation metrics: coherence, consistency, fluency, and relevance.

4.3 Step 3: Statistical Analysis

As illustrated in Figure 3, we conduct a chain of statistical analyses to derive the final discernment scores for LLM evaluators. This process includes the Wilcoxon Signed-Rank Test, Harmonic Mean pp-value and Expert Weights, and the final calculation of discernment scores.

4.3.1 Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank Test (W-Test) Wilcoxon (1945) is a non-parametric hypothesis test that compares two dependent samples to assess whether their population mean ranks differ significantly. We apply the W-Test to evaluate whether there is a significant difference in the score distributions between the original data and a given type of perturbed data:

pmjizmji=W-Test({Smj0},{Smji}).p^{i}_{m_{j}}\sim z^{i}_{m_{j}}=\text{W-Test}(\{S^{0}_{m_{j}}\},\{S^{i}_{m_{j}}\}).

In our analysis, we adopt a one-sided alternative hypothesis. The resulting pp-value indicates the confidence level at which we can reject the null hypothesis – that {Smj0}\{S^{0}_{m_{j}}\} and {Smji}\{S^{i}_{m_{j}}\} have the same distribution – and accept the alternative hypothesis – that {Smj0}\{S^{0}_{m_{j}}\} has a greater distribution than {Smji}\{S^{i}_{m_{j}}\}. We consider a difference to be statistically significant if pmji<0.05p^{i}_{m_{j}}<0.05. A lower pp-value represents a more significant score difference between the original data and perturbed data. Totally we can get PP sets of pp-values for the MM metrics as shown in Figure 3:

[pm11,pm21,,pmM1],,[pm1P,pm2P,,pmMP].[p^{1}_{m_{1}},p^{1}_{m_{2}},\dots,p^{1}_{m_{M}}],\cdots,[p^{P}_{m_{1}},p^{P}_{m_{2}},\dots,p^{P}_{m_{M}}].

Because the W-Test does not assume any specific distribution for the scores and does not focus on their absolute values, the resulting pp-values solely reflect whether the LLMs are able to detect the quality issues and assign lower scores to the perturbed data compared to the original data. Consequently, this testing approach inherently avoids the influence of response styles, instead focusing on the relative quality assessment. Meanwhile, the pp-values provide a quantitative evaluation measure to the score difference, i.e., the capability of evaluators to discern low-quality data.

4.3.2 Harmonic Mean p-value and Expert Weights

Given that an evaluation task may involve multiple MM evaluation metrics, resulting in multiple pp-values [pm1i,pm2i,,pmMi][p^{i}_{m_{1}},p^{i}_{m_{2}},\dots,p^{i}_{m_{M}}] for a single perturbed set, it is crucial to derive a combined pp-value to measure the overall confidence level. We employ the Harmonic Mean pp-value (HMP) method Wilson (2019) without or with the Expert Weights (EWEW) presented in Figure 3:

pi=1j=1M1pmji,pi,EW=1j=1MEWmjipmji.p^{i}=\frac{1}{\sum^{M}_{j=1}\frac{1}{p^{i}_{m_{j}}}},\quad p^{i,EW}=\frac{1}{\sum^{M}_{j=1}\frac{EW^{i}_{m_{j}}}{p^{i}_{m_{j}}}}.
Refer to caption
Figure 4: The DHP benchmarking results across four NLG tasks. Notably, in (d) for the Question Answering task, DD and DEWD^{EW} are identical because this task utilizes only one evaluation metric. The red lines on the charts represent DD or DEW=1D^{EW}=1, which indicates the threshold for statistical significance in discernment scores.

There are two main reasons for using the HMP method: (1) The pp-values are dependent as they are derived from the same dataset but differ based on potentially correlated metrics. The HMP method accommodates this dependency Wilson (2019); Vovk and Wang (2020). (2) The harmonic mean emphasizes the effect of smaller numbers, meaning that even if the LLMs identify and appropriately score a problem in just one metric, the combined pp-value is still apparently small enough. However, a limitation of the simple HMP is that it does not indicate whether the LLM evaluators correctly identify the specific problems related to the corresponding metrics. For example, in the SummEval Fabbri et al. (2021) dataset, if a perturbation targets the “fluency” metric but the LLM evaluator incorrectly assigns lower scores to “relevance”, the Harmonic Mean pp-value method might still produce a low combined pp-value. This outcome may not accurately reflect the evaluator’s ability to identify the specific issue.

To address this, we introduce HMP with Expert Weights (EWEW). We conduct a survey involving 1010 NLP experts who are presented with the specific NLG evaluation tasks and metric definitions. They are asked to identify which metric should be most impacted by different quality problems corresponding to the perturbation methods. These preferences are then aggregated to construct EWEW. For instance, a particular quality issue get votes for “coherence”, “consistency”, and “fluency” are 4,1,4,1, and 55, respectively, the EWEW for the corresponding perturbation would be [0.4,0.1,0.5][0.4,0.1,0.5]. The EWEW makes the combination more targeting on those pp-values that are highly influenced by the perturbation. This weighting makes the pp-value combination more targeted, focusing on those metrics most influenced by the perturbation. Consequently, the weighted combined pp-values offer a more precise measure of the LLM evaluators’ ability to not only detect issues but also correctly assign lower scores to the impacted metrics.

4.3.3 Discernment Scores of LLM Evaluators

To facilitate comparisons, we transform these combined pp-values into positive scores, which we define as discernment scores for a specific perturbation ii in Figure 3:

Di=log0.05(pi),Di,EW=log0.05(pi,EW).D^{i}=\log_{0.05}(p^{i}),\quad D^{i,EW}=\log_{0.05}(p^{i,EW}).

Here, DiD^{i} and Di,EWD^{i,EW} are positive values and the higher the better. A value of 1 for DiD^{i} and Di,EWD^{i,EW} is a threshold corresponding to a pp-value of 0.05, indicating statistical significance. If DiD^{i} or Di,EWD^{i,EW} is less than 1, it means that the LLM evaluators do not assign significantly lower scores to the perturbed data compared to the original data, suggesting a lack of discernment for specific quality issues during the NLG evaluation.

To observe the comprehensive capability and worst-case performance of the LLMs, we calculate both the average and minimum of DiD^{i} and Di,EWD^{i,EW} across all perturbation methods i=1,,Pi=1,\dots,P. This results in overall LLM discernment scores DavgD_{\text{avg}}, DminD_{\text{min}}, DavgEWD^{EW}_{\text{avg}}, and DminEWD^{EW}_{\text{min}}. Note that the average discernment scores are calculated using a weighted average across the perturbation levels (character, word, and sentence levels) mentioned previously. We assign equal weights to perturbations within the same level and make sure that the sum of the weights is the same for each level. This weighting approach ensures that each level of perturbation contributes equally to the final scores.

These discernment scores allow us to explicitly evaluate and compare the capabilities of LLMs as evaluators on specific NLG tasks, thereby establishing comprehensive benchmarks for LLMs. Higher average discernment scores (DavgD_{\text{avg}} and DavgEWD^{EW}_{\text{avg}}) indicate that the LLM can generally identify and assign appropriate scores for quality issues in the NLG task, regardless of the specific type of perturbation. The average discernment scores are useful for getting a broad understanding of an LLM’s overall performance as an NLG evaluator. On the other hand, the minimum discernment scores DminD_{\text{min}} and DminEWD^{EW}_{\text{min}} assess the LLM’s performance in the most challenging scenarios, where it may struggle to identify certain types of quality issues. These scores represent the lowest discernment score achieved by the LLM across all perturbation methods, indicating its weakest performance. The minimum discernment scores are crucial for understanding the limitations and potential failure modes of an LLM as an NLG evaluator, even if its overall average performance is acceptable.

5 Benchmarking LLM Discernment

We evaluate five series of LLMs with varying parameter sizes: the GPT-series Wang et al. (2023a), which includes GPT3.5-Turbo and GPT4-Turbo; the Llama3-series Meta ; the Vicuna1.5 series Chiang et al. (2023); Mistral-7B Jiang et al. (2023); and the Qwen-series Bai et al. (2023).

The LLMs are evaluated across four NLG tasks using six re-established public datasets: for Summarization, we use SummEval Fabbri et al. (2021) (news articles) and SumPubMed Gupta et al. (2020) (scientific articles); for Story Completion, we select data from Story Cloze Test dataset Mostafazadeh et al. (2017); for Question Answering, we utilize the data and modify the quality metric based on the Answer Equivalence dataset Bulian et al. (2022); and for Translation, we leverage WMT-22 German-to-English and Chinese-to-English general (news) translation subsets Kocmi et al. (2022). To ensure comparability, we select N=100N=100 datapoints from each dataset. The quality metrics and perturbation methods are detailed in Table 1.

We present our DHP benchmarking results in Figure 4. By examining the discernment scores achieved by these models, we can gain insights into their competence as NLG evaluators.

5.1 Overall Assessment

Most LLMs that we have evaluated demonstrate the ability to discern quality issues, as indicated by most DavgD_{\text{avg}} and DavgEWD^{EW}_{\text{avg}} scores exceeding 1. This suggests they can comprehend most evaluation metrics and detect varying quality in NLG tasks. However, an exception is noted in the WMT22 Chinese-to-English Translation dataset in Figure 4(f), where Vicuna1.5-7B and Qwen1.5-7B fail to achieve favorable average discernment scores, possibly due to their weaker multi-lingual capabilities.

Overall, for NLG evaluation, we recommend the GPT series, especially GPT4-Turbo, which demonstrates superior stability and the highest discernment across nearly all tasks. Among open-source models, Vicuna1.5-13B and Llama3-70B are commendable, achieving good average discernment scores and with most DminD_{\text{min}} and DminEWD^{EW}_{\text{min}} above 1.

5.2 Other Observations

Trends regarding the size of LLMs: In general, larger models within one series generally show better discernment. However, there are notable inconsistencies. For example, Qwen1.5-4B unexpectedly outperforms Qwen-7B in translation tasks in Figure 4(e, f), and Qwen-72B displays variable performance in the Question Answering task in Figure 4(d), suggesting that not all larger models uniformly perform better across all types of tasks.

Limitations of Smaller LLMs: In more challenging scenarios, represented by DminD_{\text{min}} and DminEWD^{EW}_{\text{min}}, smaller-sized LLMs underperform. Models with fewer than 8B parameters show significantly lower DminD_{\text{min}} and DminEWD^{EW}_{\text{min}}, particularly in summarization and translation tasks in Figure 4(a, b, e, f). Among these smaller models, Llama3-8B and Mistral-7B are relatively competitive with higher average scores but still register very low scores in the summarization tasks. This suggests that smaller models may become unstable and unreliable evaluators in some complex NLG evaluation scenarios.

Metric Misunderstanding Phenomenon: Differences between discernment scores with and without expert weights (DD and DEWD^{EW}) are also notable. While most LLMs display consistent DD and DEWD^{EW} scores, Llama3-8B’s performance in translation tasks in Figure 4(e, f) shows a significant discrepancy, with DminEWD^{EW}_{\text{min}} values being substantially lower than DminD_{\text{min}} and even dropping below 1. This indicates the model’s misunderstanding in metrics while identifying quality issues.

Variations in Task Performance: Among the six datasets, LLMs perform best in the Story Cloze Test in Figure 4(c), achieving higher and more stable scores. However, the SumPubMed dataset presented in Figure 4(b) proves the most challenging; all models except GPT4-Turbo score below 1 in DminD_{\text{min}} and DminEWD^{EW}_{\text{min}} because of the dataset’s complex scientific terminology and content. Models lacking sufficient prior knowledge struggle to identify subtle quality issues in such specialized content. Therefore, we encourage the community to test LLM discernment scores for their specific NLG tasks prior to conducting evaluations, ensuring the selected models are competent evaluators.

6 Conclusion

We introduce the DHP benchmark to assess the discernment capabilities of LLMs as evaluators across various NLG tasks. Our approach not only provides benchmarking results for LLMs but also establishes a robust framework to evaluate how effectively LLMs can identify quality issues, thus serving as competent NLG evaluators. While most models generally perform well, their performance is significantly influenced by factors such as model size, task type, and dataset complexity. By pinpointing particular weaknesses of LLMs in evaluating NLG tasks, this benchmark aids researchers in improving LLM performance going forward.

7 Limitations

The limitation of our DHP benchmark is that it generates discernment scores specific to each NLG dataset. A comprehensive assessment of the general evaluation capabilities of LLMs across all NLG tasks remains an open challenge. Additionally, the benchmark’s current focus on English-related text limits its generalizability across different languages and cultural contexts, potentially affecting its reliability in the general multilingual NLG tasks.

References

  • Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
  • Bulian et al. (2022) Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin Börschinger, and Tal Schuster. 2022. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 291–305.
  • Burchardt (2013) Aljoscha Burchardt. 2013. Multidimensional quality metrics: a flexible system for assessing translation quality. In Proceedings of Translating and the Computer 35.
  • Chan et al. (2023) Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
  • Chiang and Lee (2023) Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  • Desmond et al. (2024) Michael Desmond, Zahra Ashktorab, Qian Pan, Casey Dugan, and James M Johnson. 2024. Evalullm: Llm assisted evaluation of generative outputs. In Companion Proceedings of the 29th International Conference on Intelligent User Interfaces, pages 30–32.
  • Fabbri et al. (2021) Alexander R Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  • Gao et al. (2024) Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. 2024. Llm-based nlg evaluation: Current status and challenges. arXiv preprint arXiv:2402.01383.
  • Gupta et al. (2020) Vivek Gupta, Prerna Bharti, Pegah Nokhiz, and Harish Karnick. 2020. Sumpubmed: Summarization dataset of pubmed scientific article. In Proceedings of the 2021 Conference of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics.
  • Hu et al. (2024) Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, and Xiaojun Wan. 2024. Are llm-based evaluators confusing nlg quality criteria? arXiv preprint arXiv:2402.12055.
  • Huang et al. (2024) Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. 2024. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers. arXiv preprint arXiv:2403.02839.
  • Hui and Triandis (1989) C Harry Hui and Harry C Triandis. 1989. Effects of culture and response format on extreme response style. Journal of cross-cultural psychology, 20(3):296–309.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Kieruj and Moors (2013) Natalia D Kieruj and Guy Moors. 2013. Response style behavior: question format dependent or personal style? Quality & Quantity, 47:193–211.
  • Kocmi et al. (2022) Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, et al. 2022. Findings of the 2022 conference on machine translation (wmt22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45.
  • Latif et al. (2023) Ehsan Latif, Luyang Fang, Ping Ma, and Xiaoming Zhai. 2023. Knowledge distillation of llm for education. arXiv preprint arXiv:2312.15842.
  • Leroi et al. (2020) Armand M Leroi, Ben Lambert, James Rosindell, Xiangyu Zhang, and Giorgos D Kokkoris. 2020. Neutral syndrome. Nature human behaviour, 4(8):780–790.
  • Li et al. (2023) Ruosen Li, Teerth Patel, and Xinya Du. 2023. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762.
  • Li et al. (2024) Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, and Chongyang Tao. 2024. Leveraging large language models for nlg evaluation: A survey. arXiv preprint arXiv:2401.07103.
  • Liu et al. (2023a) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023a. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522.
  • Liu et al. (2023b) Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. 2023b. Calibrating llm-based evaluator. arXiv preprint arXiv:2309.13308.
  • Liusie et al. (2024) Adian Liusie, Potsawee Manakul, and Mark Gales. 2024. Llm comparative assessment: Zero-shot nlg evaluation through pairwise comparisons using large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 139–151.
  • (24) Meta. Llama3.
  • Mostafazadeh et al. (2017) Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. 2017. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51.
  • Novikova et al. (2018) Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2018. Rankme: Reliable human ratings for natural language generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 72–78.
  • OpenAI (2023) OpenAI. 2023. Gpt4-turbo.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  • Sai et al. (2021) Ananya B Sai, Tanay Dixit, Dev Yashpal Sheth, Sreyas Mohan, and Mitesh M Khapra. 2021. Perturbation checklists for evaluating nlg evaluation metrics. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7219–7234.
  • Salecha et al. (2024) Aadesh Salecha, Molly E Ireland, Shashanka Subrahmanya, João Sedoc, Lyle H Ungar, and Johannes C Eichstaedt. 2024. Large language models show human-like social desirability biases in survey responses. arXiv preprint arXiv:2405.06058.
  • Schoch et al. (2020) Stephanie Schoch, Diyi Yang, and Yangfeng Ji. 2020. “this is a problem, don’t you agree?” framing and bias in human evaluation for natural language generation. In Proceedings of the 1st Workshop on Evaluating NLG Evaluation, pages 10–16.
  • Teubner et al. (2023) Timm Teubner, Christoph M Flath, Christof Weinhardt, Wil van der Aalst, and Oliver Hinz. 2023. Welcome to the era of chatgpt et al. the prospects of large language models. Business & Information Systems Engineering, 65(2):95–101.
  • Van Vaerenbergh and Thomas (2013) Yves Van Vaerenbergh and Troy D Thomas. 2013. Response styles in survey research: A literature review of antecedents, consequences, and remedies. International journal of public opinion research, 25(2):195–217.
  • Vovk and Wang (2020) Vladimir Vovk and Ruodu Wang. 2020. Combining p-values via averaging. Biometrika, 107(4):791–808.
  • Wang et al. (2023a) Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023a. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  • Wang et al. (2023b) Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. 2023b. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592.
  • Wilcoxon (1945) F Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83.
  • Wilson (2019) Daniel J Wilson. 2019. The harmonic mean p-value for combining dependent tests. Proceedings of the National Academy of Sciences, 116(4):1195–1200.
  • Yuan et al. (2023) Jiayi Yuan, Ruixiang Tang, Xiaoqian Jiang, and Xia Hu. 2023. Large language models for healthcare data augmentation: An example on patient-trial matching. In AMIA Annual Symposium Proceedings, volume 2023, page 1324. American Medical Informatics Association.
  • Zhang et al. (2023) Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. 2023. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862.
  • Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations.
  • Zhu et al. (2023) Lianghui Zhu, Xinggang Wang, and Xinlong Wang. 2023. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
Table 2: Summary of hierarchical perturbation methods applied to different NLG tasks, detailing the types of perturbations and their respective implementations based on character (C), word (W), and sentence-level (S) modification with rule-based (R) or LLM-based (L) approaches.
\hlineB4 Task Avg NLTK Statistics Perturbation Description
\hlineB2 Summarization SummEval: 340.4 Characters 58.3 Words 4.0 Sentences SumPubMed 803.5 Characters 114.9 Words 5.5 Sentences (C, R) Random Deletions
Delete k alphanumeric characters randomly.
SummEval: k=10 for Minor, k=50 for Major;
SumPubMed: k=20 for Minor, k=100 for Major.
(C, R) Random Typos
Add k random typographical errors with "typo" package.
SummEval: k=10 for Minor, k=50 for Major;
SumPubMed: k=20 for Minor, k=100 for Major.
(W, L) Fictional Named Entities
Substitute one ore more named entities with in the
summary (e.g., names, locations, specific numbers,
technical terms, etc.) with fictional counterparts.
(W, L) Grammatical Errors
Modify the summary for creating two or more
grammatical errors, such as subject-verb
disagreement, noun-pronoun disagreement,
incorrect verb tense, misuse of preposition,
and sentence fragment, etc.
(S, R) Reordering
Random shuffle k sentences in the summary.
k=2 for Minor, k=all for Major.
(S, L) Rewriting and Insertion
Select one or more sentences from the summary,
then rephrase them and insert the rewritten
versions immediately after the original sentences.
\hlineB2 Story Completion Story Cloze Test: 38.7 Characters 7.4 Words 1.0 Sentences (C, R) Random Deletions Delete 5 alphanumeric characters randomly.
(C, R) Random Typos
Add 5 random typographical errors with "typo" package.
(W, L) Fictional Named Entities
Substitute one critical named entities within the ending
sentence (e.g., a name, a location, a specific number, etc.)
with a fictional counterpart.
(W, L) Grammatical Errors
Modify the ending for creating one grammatical error,
such as subject-verb disagreement, noun-pronoun
disagreement, incorrect verb tense, misuse of preposition,
and sentence fragment, etc.
(S, R) Random Ending Sentence Replace the ending with a random one from another story.
(S, R) Wrong Ending Sentence Replace the ending with the wrong ending of the dataset.
\hlineB2 Question Answering Answer Equivalence: 156.2 Characters 23.9 Words 1.0 Sentences (C, R) Random Deletions
Delete k alphanumeric characters randomly.
k=5 for Minor, k=25 for Major.
(C, R) Random Typos
Add k random typographical errors with "typo" package.
k=5 for Minor, k=25 for Major.
(W, L) Fictional Named Entities
Substitute one or more critical named entities within
the answer (e.g., names, locations, specific numbers,
technical terms, etc.) with fictional counterparts.
(W, L) Grammatical Errors
Modify the answer for creating one or more grammatical
errors, such as subject-verb disagreement, noun-pronoun
disagreement, incorrect verb tense, misuse of preposition,
and sentence fragment, etc.
(S, R) Random Answer Replace the answer with a random one to another question.
\hlineB2 Translation WMT-22 German-to-English: 436.8 Characters 71.0 Words 3.8 Sentences WMT-22 Chinese-to-English: 434.1 Characters 66.4 Words 1.1 Sentences (C, R) Random Deletions
Delete k alphanumeric characters randomly.
k=10 for Minor, k=50 for Major.
(C, R) Random Typos
Add k random typographical errors with "typo" package.
k=10 for Minor, k=50 for Major.
(W, R) Random Deletions
Delete k continuous words in the translation randomly.
k=5 for Minor, k=25 for Major.
(W, L) Fictional Named Entities
Substitute one or more critical named entities within
the translation (e.g., names, locations, specific numbers,
technical terms, etc.) with fictional counterparts.
(W, L) Grammatical Errors
Modify the translation for creating two or more
grammatical errors, such as subject-verb disagreement,
noun-pronoun disagreement, incorrect verb tense, misuse
of preposition, and sentence fragment, etc.
\hlineB4
Refer to caption
Figure 5: User interface of the expert weight survey conducted to determine the impact of various quality issues on NLG task metrics.

Appendix A NLG Tasks and Metrics

A.1 Summarization

We utilize the SummEval Fabbri et al. (2021) (MIT license) and SumPubMed Gupta et al. (2020) datasets (MIT license) for our summarization tasks. The SummEval dataset comprises 100 news articles, each accompanied by multiple reference and generated summaries. For our analysis, we exclusively use the reference summaries, selecting the one with the highest number of sentences from each article to facilitate perturbation. The SumPubMed dataset contains 32,000 long scientific articles along with their abstracts serving as reference summaries. We only use the "BACKGROUND" sections of these articles and summaries. From this dataset, we randomly select 100 pairs of articles and their corresponding summaries.

For the evaluation of summarization performance, we adhere to the metrics defined by SummEval Fabbri et al. (2021), specifically focusing on Coherence, Consistency, Fluency, and Relevance.

A.2 Story Completion

In this story completion task, we utilize the public Story Cloze Test dataset Mostafazadeh et al. (2017), which comprises four-sentence stories each paired with a reference and wrong ending. We select 100 datapoints at random from the validation set for our analysis.

Given the absence of explicitly defined quality metrics for the dataset, we adapt metrics from summarization tasks—Coherence, Consistency, and Fluency. Coherence evaluates the story’s overall structure and narrative flow. Consistency measures how well the ending maintains the established tone, setting, character development, and narrative style of the story. Fluency focuses on the linguistic and stylistic quality of the story’s conclusion.

A.3 Question Answering

For the question answering task, we employ the Answer Equivalence dataset Bulian et al. (2022) (Apache-2.0 license), which is a modified version of the SQuAD dataset Rajpurkar et al. (2016). We specifically select reference answers that exceed 150 characters to facilitate perturbation. From this filtered set, we randomly choose 100 question-answer pairs.

We adapt the original rating tasks of the dataset into a single metric: Answer Quality. This metric assesses whether the answer provides a comprehensive and accurate response to the question, effectively capturing the essence of the content discussed in the paragraph.

A.4 Translation

We utilize two subsets from the WMT-22 general (news) translation dataset: German-to-English and Chinese-to-English sets which are freely available for research purposes. For our analysis, we select the test sets with reference translations, ensuring each translation exceeds 300 characters in length. We randomly choose 100 datapoints from each subset for evaluation.

In assessing translation tasks, we adopt two principal metrics from the Multidimensional Quality Metrics (MQM) framework Burchardt (2013): Accuracy and Fluency. Accuracy measures how closely the translation mirrors the source text, focusing on the absence of additions, omissions, or mistranslations. Fluency evaluates the translation’s compliance with the linguistic norms of the target language, specifically examining spelling, grammar, and consistency.

Appendix B Hierarchical Perturbation

The specifics of the hierarchical perturbations are detailed in Table 2. We perform these perturbations based on character, word, and sentence-level statistical data of the texts, which are also presented in Table 2. Our rule-based perturbations include simple text deletions, typographical errors using existing software tools, reordering of sentences, and the incorporation of random or incorrect sentences from other data.

For LLM-based perturbations, we employ GPT4-Turbo, modifying the reference text via Auto-CoT Zhang et al. (2022) prompts to generate the detailed procedural perturbation steps. Below, we provide an example of how the “Minor Fictional Named Entities” perturbation is applied to the summarization tasks:

Minor Fictional Named Entities Perturbation Prompt:

You will be given one summary written for an article. Your task is to adjust the summary by implementing a specific change.

Please make sure you read and understand these instructions carefully.

Adjustment: Please substitute only one critical named entity within the summary (e.g., a name, a location, a specific number, a technical term, etc.) with a fictional counterpart.

Adjustment Steps:

1. Identify the critical named entity within the summary. This could be a person’s name, a location, a specific number, or any other specific detail that is crucial to the summary.

2. Create a fictional counterpart for the identified entity. This could be a fictional name, a fictional location, a fictional number, a fictional technical term etc. Make sure that the fictional counterpart is appropriate and fits within the context of the summary.

3. Replace the identified entity with its fictional counterpart in the summary. Ensure that the replacement is grammatically correct and maintains the overall meaning and flow of the summary.

4. Review the adjusted summary to ensure that it still makes sense and conveys the main points of the article, despite the change in one critical named entity.

Summary:

SUMMARY_HERE

Revised Summary:

Refer to caption
Figure 6: Graphical representation of the expert weights for each NLG task.

Appendix C Expert Weights

We invite 10 volunteer experts with extensive backgrounds in NLP/NLG research to complete an expert weight survey. The interface of this survey is displayed in Figure 5, which includes the survey instructions, definitions of the tasks and metrics, data types, and descriptions of quality issues associated with the perturbation methods. The experts are asked to select the metric they believe is most impacted by each quality issue presented. We then utilize their responses as weights for combining the pp-values. The results of these expert evaluations are detailed in Figure 6.

Table 3: Overview of large language models (LLMs) assessed in the DHP benchmark, specifying model versions and sources.
\hlineB3 Model Version Source
\hlineB2 GPT3.5-Turbo gpt-3.5-turbo-0125 platform.openai.com/docs/models
GPT4-Turbo gpt-4-1106-preview platform.openai.com/docs/models
Llama3-8B Meta-Llama-3-8B-Instruct huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
Llama3-70B Meta-Llama-3-70B-Instruct huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
Vicuna1.5-7B vicuna-7b-v1.5-16k huggingface.co/lmsys/vicuna-7b-v1.5-16k
Vicuna1.5-13B vicuna-13b-v1.5-16k huggingface.co/lmsys/vicuna-13b-v1.5-16k
Mistral-7B Mistral-7B-Instruct-v0.2 huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
Qwen1.5-4B Qwen1.5-4B-Chat huggingface.co/Qwen/Qwen1.5-4B-Chat
Qwen1.5-7B Qwen1.5-7B-Chat huggingface.co/Qwen/Qwen1.5-7B-Chat
Qwen1.5-14B Qwen1.5-14B-Chat huggingface.co/Qwen/Qwen1.5-14B-Chat
Qwen1.5-32B Qwen1.5-32B-Chat huggingface.co/Qwen/Qwen1.5-32B-Chat
Qwen1.5-72B Qwen1.5-72B-Chat huggingface.co/Qwen/Qwen1.5-72B-Chat
\hlineB2

Appendix D LLM Evaluation

We evaluate five series of large language models (LLMs), details of which are provided in Table 3. Due to the extensive length of text data from the SumPubMed dataset Gupta et al. (2020), which can exceed the 4K context window, we evaluate the models capable of processing long texts (\geq 8K tokens). The GPT series is operated using the OpenAI API, and the open-source LLMs are executed on a server with 8 Nvidia A100 GPUs. We set the temperature parameters to 0 and maintain the default values for the top_p parameters. Throughout the evaluation process, each model score 5 times on each metric to calculate a final average score. We use the scipy.stats.wilcoxon to conduct the Wilcoxon Signed-Rank Test.

Appendix E Evaluation Prompts

We follow the guidelines of G-Eval Liu et al. (2023a) and utilize the Auto-CoT method Zhang et al. (2022) to construct our evaluation prompts. Below is an example of the prompt used for assessing the Coherence metric in summarization tasks:

You will be given a summary written for an article. Your task is to rate the summary on one metric. Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criterion: Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic.

Evaluation Steps:

1. Read the Summary Thoroughly: Before diving into the evaluation, ensure that you have a clear understanding of the entire summary. Reading it more than once might be necessary.

2. Identify the Central Topic: A coherent summary will have a clear central topic or theme. Identify this topic and see if the subsequent information revolves around it.

3. Check for Logical Flow: Review the summary for logical sequencing. Sentences should follow one another in a way that makes sense and allows the reader to easily follow the progression of information.

4. Look for Transitional Elements: Coherent summaries often have clear transitions between sentences or ideas. This could be in the form of transitional words, phrases, or connecting ideas that tie one sentence to the next.

5. Identify Redundancies: Check if the same information is repeated in different sentences. Redundancies can disrupt the flow and coherence of a summary.

6. Note Any Gaps or Jumps: If there are sudden jumps in topics or if crucial information seems to be missing, this can harm the coherence of the summary. A well-organized summary should present a holistic view of the topic without leaving the reader with questions.

7. Assess Clarity: Even if the content is technically accurate, if it’s written in a convoluted or unclear manner, it can disrupt coherence. The sentences should be clear and easily understandable.

8. Consider the Conclusion: A coherent summary often wraps up or comes to a conclusion that ties the presented information together. It doesn’t necessarily need a formal conclusion, but the end should feel natural and not abrupt.

9. Rate the Summary: Based on the above steps, assign a score between 1-5 for coherence. - 1: Very incoherent. The summary lacks structure, has sudden jumps, and is difficult to follow. - 2: Somewhat incoherent. The summary has some semblance of structure, but has significant flaws in flow and organization. - 3: Neutral. The summary is decently organized, with minor issues in flow and structure. - 4: Mostly coherent. The summary is well-structured with very few minor coherence issues. - 5: Highly coherent. The summary is excellently organized, flows seamlessly, and builds information logically from start to end.

Source Article:

ARTICLE_HERE

Summary:

SUMMARY_HERE

Evaluation Score (please don’t give any feedback, just give a score ONLY) - Coherence: