¹¹institutetext: Charles University in Prague, Czech Republic
¹¹email: {cano, bojar}@ufal.mff.cuni.cz

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts^†^†thanks: Supported by the project no. 19-26934X (NEUREM3) of the Czech Science Foundation and ELITR (H2020-ICT-2018-2-825460) of the EU.

Erion Çano 0000-0002-5496-3860 Ondřej Bojar 0000-0002-0606-0050

Abstract

Automatic evaluation of various text quality criteria produced by data-driven intelligent methods is very common and useful because it is cheap, fast, and usually yields repeatable results. In this paper, we present an attempt to automate the human likeliness evaluation of the output text samples coming from natural language generation methods used to solve several tasks. We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human. Instead of having human participants label or rate those samples, we completely automate the process by using a discrimination procedure based on large pretrained language models and their probability distributions. As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach. A validation procedure involving human participants will also check how the automatic evaluation correlates with human judgments.

Keywords:

Natural Language Generation Automatic Evaluation Human Likeliness Text Naturalness Evaluation Metrics.

1 Introduction

NLG (Natural Language Generation) is a set of techniques and practices for automatically transforming structured data into natural language. Among the typical applications of NLG, we can mention weather predictions, financial reporting, customer support, etc. Same as other related disciplines such as MT (Machine Translation) or TS (Text Summarization), it has surged in the last decade, greatly pushed by the significant advances in text applications of deep neural networks [27, 3] as well as the creation of large datasets [16, 8, 17]. In parallel with the data processing, researchers are also finding ways to automate the evaluation of the intelligent text-related systems they propose. An evaluation practice that is gaining popularity compares method output texts against human-written references of a standard corpus using automatic metrics. Some of the most popular metrics are BLEU of Papineni et al. [19], ROUGE of Lin [15], and METEOR of Banerjee and Lavie [4]. This automatic evaluation practice is tempting for NLP (and NLG) researchers since it is fast and cheap to perform, does not require domain expertise, and usually produces repeatable results.

When comparing two methods A and B, besides accuracy, we usually want to also observe other quality aspects (e.g., coherence, readability, naturalness, fluency, adequacy or grammaticality) of the text they produce. BLEU and the automatic evaluation process have yet been criticized by several authors [25, 5]. As pointed out by Reiter [22], BLEU is not appropriate for evaluating all quality criteria of the NLG systems. Furthermore, the results of the automatic evaluation process do not always correlate well with those of human campaigns [23]. Even adding the data efficiency scores in the process would not tell us anything about the text quality aspects of A and B outputs [6].

As pointed out by Novikova et al. [18], there is a need for new evaluation metrics to objectively assess specific text quality characteristics such as human likeliness (or naturalness), coherence, etc. In this paper, we present the proof of concepts for a procedure that can be used to automatically evaluate human likeliness of NLG outputs. We propose to use a discriminator model that can label each test set output as being human-written (natural) or machine-generated (synthetic). This model will use existing pretrained language models such as BERT or GPT-2 and assume that synthetic texts do mostly contain high-rank words sampled from probability distributions, in contrast to natural texts that contain more low-rank words. This way, it will compute the fraction probability (the ratio between the probability of the actual word in a position and the highest rank word for that position) of each word and the average value for the entire text sample. The latter will be discretized to get the sample class (h or m).

We start with a survey of evaluation procedures in the last five INLG conference proceedings and notice an increasing trend of automatic evaluations against human-based ones. Later on, we discuss the possibility of automating human likeliness evaluation and propose the h score as an objective metric to estimate it. In Section 4, the discrimination model is described, together with possible shortcomings and a few alternative approaches. In the end, we conclude with the follow-up steps of the near future.

2 NLG Evaluation Trends

To have a quantitative picture of the NLG evaluation trends, we carefully examined the papers published in the last five INLG conference proceedings (2014–2019). We focused on the evaluation practices of these papers, sorting out those that used automatic metrics from those using human studies or those using both. The results are presented in Table 1. As we can see, the number of studies performing automatic evaluation only (Auto column) has been rising, especially in the last year. Their portion goes up from 12 % in 2014 to 30.1 % in 2019. Among the evaluation metrics, BLEU was the most popular, followed by METEOR and ROUGE. The studies using both human and automatic evaluation (Both column) have also risen during the last five years. These findings are in line with earlier observations such as those of Gkatzia and Mahamood [12] who considered a wider range of NLP publications and reported an increase in automatic metric evaluations from 44 % in the period 2005–2008 to 60 % in 2012–2015.

The opposite is true for the studies that perform human evaluation only (Human column). They have been following a decreasing trend, falling from 32 % to 11 %. A similar trend is reported in other recent surveys [2, 7]. We also observed that for the human evaluation process, existing studies usually involve a few human experts or some dozens of students. Moreover, a considerable number of studies report to have used crowdsourced reviewers from Amazon Mechanical Turk or similar platforms. Accuracy, coherence, readability, and human likeliness are the most considered text quality criteria. Finally, there is also a category of studies (None column) that do not perform any evaluation or that do not assess the text generation quality. They were mostly proposals for shared tasks, findings from challenges, surveys or reviews, etc.

Table 1: Evaluation types reported in the papers of the last five INLG proceedings (percentages in parenthesis).

Event	Papers	Auto	Human	Both	None
12th INLG 2019	73	22 (30.1)	8 (11)	23 (31.5)	20 (27.4)
11th INLG 2018	63	14 (22.2)	8 (12.7)	20 (31.7)	21 (33.3)
10th INLG 2017	42	10 (23.8)	7 (16.7)	8 (19)	17 (40.5)
9th INLG 2016	44	12 (27.3)	10 (22.7)	7 (15.9)	15 (34.1)
8th INLG 2014	25	3 (12)	8 (32)	5 (20)	9 (36)

3 Automating the Human Likeliness Evaluation

The results of our survey (and similar ones) suggest that it is highly desirable for researchers to benchmark the methods or techniques they propose using automatic metrics. As also pointed out by Reiter [22] and Novikova et al. [18], new evaluation metrics are required. There is thus a strong incentive for formulating new quantitative measures and methodologies. Some attempts have developed objective measures like average word length, mean parse tree height, and the average number of nouns [1, 26]. These measures are combined in formulas to obtain a score for automatically assessing text readability.

In this paper, we present a similar attempt regarding the human likeliness of the generated outputs. The h score we propose reveals the capability of an NLG model to produce texts that are human-like or written by humans. This characteristic is important and highly desired in NLG applications as well as in various surging domains such as MT, TS, question answering and others that go beyond NLG. The most common approach is to assess human likeliness by means of human studies that ask participants to rate the texts using point-based schemes. Likert 5-points scale of is one of the most popular methods in the literature [13, 14].

Our idea is to automate the human likeliness evaluation of any text generation method M by considering the task as a binary discrimination problem and computing a metric that we call the h score. Let’s assume we are using a test set of $n$ reference samples for the evaluation. We can expect to have $n=n_{h}+n_{m}$ , where $n_{h}$ is the number of texts that are perceived as human-written and $n_{m}$ is the number of those which are perceived as machine-generated. We can consider the h score and m score of M as the fraction of its outputs being perceived as human-written or machine-generated. They can be computed using Equation 1.

h^{M}=\frac{n_{h}}{n_{h}+n_{m}}\qquad\textup{and}\qquad m^{M}=1-h^{M}=\frac{n_{m}}{n_{h}+n_{m}}

(1)

Instead of having human participants sorting out the texts (mark them as class h or class m) or giving them scores (e.g., 1 to 5 as in Likert scale), we propose to automate the process by using an intelligent discriminator model H. Using huge pretrained language models, it is possible today to automatically produce texts (e.g., news articles) that are almost impossible to distinguish. The success of this approach will thus depend on the possibility to create a smart H that is able to recognize such synthetic texts. Using the predictions of the discriminator and Equation 1, we then compute the h scores of the methods we wish to evaluate and use it as a model quality indicator. We would thus favor the method which is more capable in fooling the discriminator to think that its text outputs are actually human-written (thus higher h score).

4 The Human Likeliness Discriminator

4.1 GLTR and Pretrained Language Models

The recent pretrained language models such as BERT of Devlin et al. [9] or GPT-2 of Radford et al. [21] have led to significant advances in NLP research. Based on many transformer blocks and trained with huge volumes of texts, these models can be fine-tuned with specific data of various domains and yield top results in the corresponding tasks. It is possible to use big language models not just for synthetic text generation but also for spreading fake or misleading news, comments or reviews in the Web [10]. This has created strong incentives for research and development of synthetic text detection systems [11, 24, 20].

GLTR (Giant Language model Test Room) of Gehrmann et al. [11] utilizes BERT, small GPT-2 (117 M parameters) or large GPT-2 (1.5 B parameters) models as backend for detecting synthetic texts of various sizes. Based on empirical observations, the authors assume that synthetic texts are mostly produced sampling words from the head of a language distribution model $p(X_{i}|X_{1:i-1})$ (high-rank words). To detect if a given word is likely sampled from a language distribution, they propose these tests: (i) checking the probability of the given word in relation to the one that was assigned the highest probability $P_{det}(X_{i}=\hat{X}_{i}|X_{1:i-1})$ ; (ii) checking the rank of the given word; (iii) checking the entropy of the predicted distribution. The higher these scores are, the higher are the chances that the given word and the entire text is synthetic (generated). They further build a visual tool that highlights text passages and can be used online.¹¹1http://gltr.io/dist/index.html

4.2 The Discrimination Scheme

Our idea is to use the approach of GLTR for building the H discriminator described above. Big pretrained models such as BERT, GPT-2 small or GPT-2 large will provide the language model distribution $p(X_{i}|X_{1:i-1})$ . Same as in [11], we assume that synthetic texts do mostly contain high-rank words and natural texts include more low-rank words. What remains is the designation of a numeric scheme by which H can compute and assess the quantity of high-rank words used in each sample and a discretization scheme to translate that quantity in a category (h or m class). We propose to compute $frac(p)$ for each word of the text sample (sequence $\hat{X}_{1:n}$ ). It is the fraction between the probability of a given word in its position and the highest probability of any word appearing in that position given by the language distribution of the pretrained model in use. From the $frac(p)$ values of each word, we can compute the average $frac(p)$ score ( $Fp$ ) of a text sample $t$ consisting of $n$ words using Equation 2:

fp_{t}=1/n\sum_{i=1}^{n}\frac{P(\hat{X}_{i})}{P(X_{i})}

(2)

As for the discretization, we will need a threshold or boundary value $Fp_{b}$ for the $Fp$ score and then sort out h class samples from m class ones using the scheme of Equation 3:

class(t)=\begin{cases}h,&\text{if}\ Fp_{t}<Fp_{b}\\ m,&\text{otherwise}\end{cases}

(3)

To set an optimal $Fp_{b}$ value, we will need to perform empirical examinations of many synthetic and natural texts and their respective $Fp$ scores. Based on our preliminary observations, $Fp_{b}$ should range somewhere between 0.35 and 0.45. After labeling each test sample output of our NLG method M, we can now use Equation 1 to finally obtain its h score. A high h score reflects a high capability for producing samples that are perceived as h class (texts with more low-rank words). It is thus an indication that the human likeliness of texts produced by M is high.

4.3 Possible Flaws and Alternative Implementations

Given that the $Fp$ values are continuous, using only $Fp_{b}$ as a boundary value may not be optimal. It may lead to a high missclassification rate of the texts. An alternative approach could be to use two threshold values for $Fp$ : $Fp_{l}$ as a low boundary and $Fp_{h}$ as a high boundary. This way, we have a better separation of the two intervals for class h ( $0<Fp<Fp_{l}$ ) and class m ( $Fp_{h}<Fp<1$ ) by a third interval ( $Fp_{l}<Fp<Fp_{h}$ ) that constitutes the u (for unknown) class of texts. In other words, a better approach could be to use Equations 4 and 5 instead of Equations 3 and 1.

class(t)=\begin{cases}h,&\text{if}\ Fp_{t}<Fp_{l}\\ u,&\text{if}\ Fp_{l}<Fp_{t}<Fp_{h}\\ m,&\text{otherwise}\end{cases}

(4)

h^{M}=\frac{n_{h}}{n_{h}+n_{m}+n_{u}}\qquad\textup{and}\qquad m^{M}=\frac{n_{m}}{n_{h}+n_{m}+n_{u}}

(5)

Once again, for finding the optimal $Fp_{l}$ and $Fp_{h}$ values, we will need to run several empirical tests using synthetic and natural samples. The $Fp_{l}$ and $Fp_{h}$ values will be set based on the average natural text $Fp$ and the average synthetic text $Fp$ , together with their respective variances.

Another problem could be the loss of information from the discretizations schemes of Equations 3 or 4. Discretizing the continuous $Fp$ and then computing the continuous h score may result in a significant loss of precision. An alternative is to completely avoid the h scores and instead use the $Fp$ values of the test samples to compute the average $Fp$ on the entire test dataset. This could be a cleaner practice, leading to a better quality indicator than the h score. Moreover, this simpler approach could be adopted more easily, easing the cross-interpretation of the results. Another drawback is the fact that this approach can be easily fooled by word repetitions, grammatically incorrect words, etc. As a result, additional text checkups will be required to ensure its validity. Finally, the $Fp$ scores depend on the pretrained backend model that was used. In other words, when reporting the human likeliness of certain experiments, the backend used for the discrimination process should be reported as well.

5 Discussion

In this paper, we first presented the results of a survey which reveals the trend towards the automatic evaluation of text quality criteria. We further described a method for automatically evaluating the naturalness of the NLG output texts by computing the human-likeliness score as the average of human-labeled test samples. Instead of relying on human participants to label those test samples, we propose to use a discrimination approach based on large pretrained language models like BERT or GPT-2 and the computation of the fraction probability of each word and the entire text. We also presented an alternative scheme that can be used to discretize the fraction probability values in case of high information loss from the discriminator. Several empirical observations using synthetic and natural samples will be conducted to find the optimal setup of our proposal. The test set of the popular CNN/Dailymail news dataset created by Nallapati et al. [16] is probably a good source of texts. Furthermore, we plan to conduct a comprehensive validation of the scheme by involving human participants who will judge the naturalness of the text samples. This will check the agreement between the automatic predictions and the human evaluations.

References

[1] Ambati, B.R., Reddy, S., Steedman, M.: Assessing relative sentence complexity using an incremental CCG parser. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics. pp. 1051–1057. ACL, San Diego, California (Jun 2016)
[2] Amidei, J., Piwek, P., Willis, A.: Evaluation methodologies in automatic question generation 2013-2018. In: Proceedings of the 11th International Conference on Natural Language Generation. pp. 307–317. Association for Computational Linguistics, Tilburg University, The Netherlands (Nov 2018)
[3] Balaji, K., Lavanya, K.: Recent Trends in Deep Learning with Applications, pp. 201–222. Springer International Publishing, Cham (2018)
[4] Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. pp. 65–72. ACL, Ann Arbor, Michigan (Jun 2005)
[5] Bojar, O., Kos, K., Mareček, D.: Tackling sparse data issue in machine translation evaluation. In: Proceedings of the ACL 2010 Conference Short Papers. pp. 86–91. Association for Computational Linguistics, Uppsala, Sweden (Jul 2010)
[6] Çano, E., Bojar, O.: Efficiency metrics for data-driven models: A text summarization case study. In: Proceedings of the 12th International Conference on Natural Language Generation. pp. 229–239. Association for Computational Linguistics, Tokyo, Japan (Oct–Nov 2019). https://doi.org/10.18653/v1/W19-8630, https://www.aclweb.org/anthology/W19-8630
[7] Çano, E., Bojar, O.: Keyphrase generation: A multi-aspect survey. In: 2019 25th Conference of Open Innovations Association (FRUCT). pp. 85–94. Helsinki, Finland (Nov 2019). https://doi.org/10.23919/FRUCT48121.2019.8981519, https://ieeexplore.ieee.org/document/8981519
[8] Çano, E., Bojar, O.: Two huge title and keyword generation corpora of research articles. In: Proceedings of The 12th Language Resources and Evaluation Conference. pp. 6663–6671. European Language Resources Association, Marseille, France (may 2020), https://www.aclweb.org/anthology/2020.lrec-1.823
[9] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423
[10] Fornaciari, T., Poesio, M.: Identifying fake Amazon reviews as learning from crowds. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. pp. 279–287. Association for Computational Linguistics, Gothenburg, Sweden (Apr 2014). https://doi.org/10.3115/v1/E14-1030
[11] Gehrmann, S., Strobelt, H., Rush, A.: GLTR: Statistical detection and visualization of generated text. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. pp. 111–116. Association for Computational Linguistics, Florence, Italy (Jul 2019). https://doi.org/10.18653/v1/P19-3019
[12] Gkatzia, D., Mahamood, S.: A snapshot of NLG evaluation practices 2005 - 2014. In: Proceedings of the 15th European Workshop on Natural Language Generation (ENLG). pp. 57–60. Association for Computational Linguistics, Brighton, UK (Sep 2015). https://doi.org/10.18653/v1/W15-4708
[13] Hastie, H., Belz, A.: A comparative evaluation methodology for nlg in interactive systems. In: Proceedings of LREC’14. pp. 4004–4011. European Language Resources Association (12 2014)
[14] van der Lee, C., Gatt, A., van Miltenburg, E., Wubben, S., Krahmer, E.: Best practices for the human evaluation of automatically generated text. In: Proceedings of the 12th International Conference on Natural Language Generation. pp. 355–368. Association for Computational Linguistics, Tokyo, Japan (Oct–Nov 2019)
[15] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Proc. ACL workshop on Text Summarization Branches Out. p. 10 (2004), http://aclweb.org/anthology/W04-1013
[16] Nallapati, R., Zhou, B., dos Santos, C., Gulcehre, C., Xiang, B.: Abstractive text summarization using sequence-to-sequence rnns and beyond. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. pp. 280–290. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/K16-1028, http://aclweb.org/anthology/K16-1028
[17] Napoles, C., Gormley, M., Van Durme, B.: Annotated gigaword. In: Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction. pp. 95–100. AKBC-WEKEX ’12, Association for Computational Linguistics, Stroudsburg, PA, USA (2012), http://dl.acm.org/citation.cfm?id=2391200.2391218
[18] Novikova, J., Dušek, O., Cercas Curry, A., Rieser, V.: Why we need new evaluation metrics for NLG. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2241–2252. ACL, Copenhagen, Denmark (Sep 2017). https://doi.org/10.18653/v1/D17-1238
[19] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Meeting of the Association for Computational Linguistics (2002), http://aclweb.org/anthology/P02-1040
[20] Pérez-Rosas, V., Kleinberg, B., Lefevre, A., Mihalcea, R.: Automatic detection of fake news. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 3391–3401. Association for Computational Linguistics, Santa Fe, New Mexico, USA (Aug 2018), https://www.aclweb.org/anthology/C18-1287
[21] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2018), https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
[22] Reiter, E.: A structured review of the validity of BLEU. Computational Linguistics 44(3), 393–401 (Sep 2018). https://doi.org/10.1162/coli_a_00322
[23] Reiter, E., Belz, A.: An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics 35(4), 529–558 (2009). https://doi.org/10.1162/coli.2009.35.4.35405
[24] Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detection on social media: A data mining perspective. SIGKDD Explor. Newsl. 19(1), 22–36 (Sep 2017). https://doi.org/10.1145/3137597.3137600
[25] Sulem, E., Abend, O., Rappoport, A.: BLEU is not suitable for the evaluation of text simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 738–744. ACL, Brussels, Belgium (Oct-Nov 2018). https://doi.org/10.18653/v1/D18-1081
[26] Vajjala, S., Meurers, D.: Assessing the relative reading level of sentence pairs for text simplification. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL-14). Association for Computational Linguistics, Gothenburg, Sweden (2014)
[27] Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing [review article]. IEEE Computational Intelligence Magazine 13(3), 55–75 (Aug 2018). https://doi.org/10.1109/MCI.2018.2840738

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts††thanks: Supported by the project no. 19-26934X (NEUREM3) of the Czech Science Foundation and ELITR (H2020-ICT-2018-2-825460) of the EU.