AVA: an Automatic eValuation Approach to Question Answering Systems

Thuy Vu
Amazon Alexa
Manhattan Beach, CA, USA
[email protected]
&Alessandro Moschitti
Amazon Alexa
Manhattan Beach, CA, USA
[email protected]

Abstract

We introduce AVA, an automatic evaluation approach for Question Answering, which given a set of questions associated with Gold Standard answers, can estimate system Accuracy. AVA uses Transformer-based language models to encode question, answer, and reference text. This allows for effectively measuring the similarity between the reference and an automatic answer, biased towards the question semantics. To design, train and test AVA, we built multiple large training, development, and test sets on both public and industrial benchmarks. Our innovative solutions achieve up to 74.7% in F1 score in predicting human judgement for single answers. Additionally, AVA can be used to evaluate the overall system Accuracy with an RMSE, ranging from 0.02 to 0.09, depending on the availability of multiple references.

1 Introduction

Accuracy evaluation is essential both to guide system development as well as to estimate its quality, which is important for researchers, developers, and users. This is often conducted using benchmarking datasets, containing a data sample, possibly representative of the target data distribution, provided with Gold Standard (GS) labels (typically produced with a human annotation process). The evaluation is done by comparing the system output with the expected labels using some metrics.

This approach unfortunately falls short when dealing with generation tasks, for which the system output may span a large, possibly infinite, set of correct items. For example, in case of Question Answering (QA) systems, the correct answers for the question, Where is Rome located ? is large. As it is impossible, also for cost reasons, to annotate all possible system pieces of output, the standard approach is to manually re-evaluate the new output of the system. This dramatically limits the experimentation velocity, while increasing significantly the development costs.

Another viable solution in specific domains consists in automatically generating an evaluation score between the system and the reference answers, which correlates with human judgement. The BLEU score, for example, is one popular measure in Machine Translation Papineni et al. (2002). This, however, can only be applied to specific tasks and even in those cases, it typically shows limitations Way (2018). As a consequence there is an active research in learning methods to automatically evaluate MT systems Ma et al. (2019), while human evaluation becomes a requirement in machine translation benchmarking Barrault et al. (2019).

QA will definitely benefit by a similar approach but the automatic evaluation is technically more complex for several reasons: First, segment overlapping metrics such as BLEU, METEOR, or ROUGE, do not work since the correctness of an answer loosely depends on the match between the reference and candidate answers. For example, two text candidates can be correct and incorrect even if they only differ by one word (or even one character), e.g., for the questions, Who was the 43^rd president of USA ?, a correct answer is George W. Bush, while the very similar answer, George H. W. Bush, is wrong.

Second, the matching between the answer candidates and the reference must be carried out at semantic level and it is radically affected by the question semantics. For example, match $(t,r|q_{1})$ can be true but match $(t,r|q_{2})$ can be false, where $t$ and $r$ are a pair of answer candidate and reference, and $q_{1}$ and $q_{2}$ are two different questions. This can especially happen for the case of the so-called non-factoid questions, e.g., asking for a description, opinion, manner, etc., which are typically answered by a fairly long explanatory text. For example, Table 1 shows an example of a non factoid question and three different valid answers, which share similarity with respect to the question. However, if the question were, what may cause anxiety ?, Answer 1 and Answer 3 would intuitively look less related to Answer 2.

Question: What does cause left arm pain ?

Reference: Arm pain can be caused by a wide variety of problems, ranging from joint injuries to compressed nerves; if it radiates into your left arm can even be a sign of a heart attack.

Answer 1: It is possible for left arm pain to be caused from straining the muscles of the arm, pending heart attack, or it can also be caused from indigestion.

Answer 2: Anxiety can cause muscles in the arm to become tense, and that tension could lead to pain.

Answer 3: In many cases, arm pain actually originates from a muscular problem in your neck or upper spine.

Table 1: Example of a non-factoid questions

In this paper, we study the design of models for measuring the Accuracy of QA systems. In particular, we design several pre-trained Transformer models Devlin et al. (2018); Liu et al. (2019) that encode the triple of question $q$ , candidate $t$ , and reference $r$ in different ways.

Most importantly, we built (i) two datasets for training and testing the point-wise estimation of QA system output, i.e., the evaluation if an answer is correct or not, given a GS answer; and (ii) two datasets constituted by a set of outputs from several QA systems, for which AVA is supposed to estimate the Accuracy.

The results show a high Accuracy for point-wise models, up to 75%. Regarding the overall Accuracy estimation, AVA can almost always replicate the ranking of systems in terms of Accuracy performed by humans. Finally, the RMSE with respect to human evaluation depends on the datasets, ranging from 2% to 10%, with an acceptable Std. Dev. lower than 3-4%.

The structure of the paper is as follows: we begin with the description of the problem in Sec. 3. This is then followed by the details of the data construction and model design, which are key aspects for system development, in sections 4 and 5. We study the performance of our models in three different evaluation scenarios in Sec. 6.

2 Related Work

Automatic evaluation has been an interesting research for decades Papineni et al. (2002); Magnini et al. (2002). There are two typical strategies to design an automatic evaluator: supervised and unsupervised. In machine translation, for example, BLEU Papineni et al. (2002) has been a very popular unsupervised evaluation method for the task. There are also other supervised methods recently proposed, most notably Ma et al. (2019). For dialog systems, neural-based automatic evaluators are also studied Ghazarian et al. (2019); Lowe et al. (2017); Tao et al. (2017); Kannan and Vinyals (2017)

QA has been traditionally studied early in literature Green et al. (1961). QA has recently been used to evaluate a summarization task Eyal et al. (2019). Automatic evaluation for QA was addressed by Magnini et al. (2002) and also for multiple subdomain QA systems Leidner and Callison-Burch (2003); Lin and Demner-Fushman (2006); Shah and Pomerantz (2010); Gunawardena et al. (2015). However, little progress has been made in the past two decades towards obtaining a standard method. Automating QA evaluation is still an open problem and there is no recent work supporting it.

3 Problem Definition

We target the automatic evaluation of QA systems, for which system Accuracy (the percentage of correct answers) is the most important measure. We also consider more complex measures such as MAP and MRR in the context of Answer Sentence Reranking/Selection.

3.1 Answer Sentence Selection (AS2)

The task of reranking answer sentence candidates provided by a retrieval engine can be modeled with a classifier scoring the candidates. Let $q$ be a question, $T_{q}=\{t_{1},\dots,t_{n}\}$ be a set of answer sentence candidates for $q$ , we define $\mathcal{R}$ as a ranking function, which orders the candidates in $T_{q}$ according to a score, $p\left(q,t_{i}\right)$ , indicating the probability of $t_{i}$ to be a correct answer for $q$ . Popular methods modeling $\mathcal{R}$ include Compare-Aggregate Yoon et al. (2019), inter-weighted alignment networks Shen et al. (2017), and BERT Garg et al. (2020).

$q$ :	What is the population of California?
$r$ :	With slightly more than 39 million people (according to 2016 estimates), California is the nation’s most populous state—its population is almost one and a half times that of second-place Texas (28 million).
$s$ :	39 million
$t$ :	The resident population of California has been steadily increasing over the past few decades and has increased to 39.56 million people in 2018.

Table 2: An example of input data

3.2 Automatic Evaluation of QA Accuracy

The evaluation of system Accuracy can be approached in two ways: (i) evaluation of the single answer provided by the target system, which we call point-wise evaluation; and (ii) the aggregated evaluation of a set of questions, which we call system-wise evaluation.

We define the former as a function: $\mathcal{A}\left(q,r,t_{i}\right)\rightarrow\{0,1\}$ , where $r$ is a reference answer (GS answer) and the output is simply a correct/incorrect label. Table 2 shows an example question associated with a reference, a system answer, and a short answer¹¹1The latter can be very effective but it adds an additional annotation cost, thus we limit its use just for the baseline model. That is, we aim to have a lower cost AVA model.

A configuration of $\mathcal{A}$ is applied to compute the final Accuracy of a system using an aggregator function. In other words, to estimate the overall system Accuracy, we simply assume the point-wise AVA predictions as they were the GS. For example, in case of the Accuracy measure, we simply average the AVA predictions, i.e., $\frac{1}{|Q|}\sum_{q\in Q}\mathcal{A}(q,r,t_{i}[,s])$ , where $s$ is a short answer (e.g., used in machine reading). It is an optional input, which we only use for a baseline, described in Section 4.1.

4 Model for AVA

The main intuition on building an automatic evaluator for QA is that the model should capture (i) the same information a standard QA system uses; while (ii) exploiting the semantic similarity between the system answer and the reference, biased by the information asked by the question. We build two types of models: (i) linear classifier, which is more interpretable and can help us to verify our design hypothesis and (ii) Transformer-based methods, which have been successfully used in several language understanding tasks.

4.1 Linear Classifier

Given an input example, $\left(q,r,s,t\right)$ , our classifier uses the following similarity features: $x_{1}$ =sim-token $\left(s,r\right)$ , $x_{2}$ =sim-text $\left(r,t\right)$ , $x_{3}$ =sim-text $\left(r,q\right)$ ; and $x_{4}$ =sim-text $\left(q,t\right)$ , where sim-token between $s$ and $r$ is a binary feature testing if $r$ is included in $s$ , sim-text is a sort of Jaccard similarity:

\emph{sim-text}\left(s_{i},s_{j}\right)=2\frac{|\text{\emph{tok}}\left(s_{i}\right)\cap\text{\emph{tok}}\left(s_{j}\right)|}{|\text{\emph{tok}}\left(s_{i}\right)|+|\text{\emph{tok}}\left(s_{j}\right)|},

and $\emph{tok}\left(s\right)$ is a function that splits $s$ into tokens.

Let ${\bf x}=f\left(q,r,s,t\right)=\left(x_{1},x_{2},x_{3},x_{4}\right)$ be a similarity feature vector describing our evaluation tuple. We train ${\bf w}$ on a dataset $D=\{d_{i}:\left({\bf x}_{i},l_{i}\right)\}$ using SVM, where $l_{i}$ is a binary label indicating whether $t$ answers $q$ or not. We compute the point-wise evaluation of $t$ as the test ${\bf x}\!\cdot\!{\bf w}>\alpha$ , where $\alpha$ is a threshold trading off Precision for Recall in standard classification approaches.

4.2 Transformer-based models

Transformer-based architectures have proved to be powerful language models, which can capture complex similarity patterns. Thus, they are suitable methods to improve our basic approach described in the previous section. Following the linear classifier modeling, we propose three different ways to exploit the relations among the members of the tuple $\left(q,r,s,t\right)$ .

Let $\mathcal{B}$ be a pre-trained language model, e.g., the recently proposed BERT Devlin et al. (2018), RoBERTa Liu et al. (2019), XLNet Yang et al. (2019), AlBERT Lan et al. (2020). We use a language model to compute the embedding representation of the tuple members: $\mathcal{B}\left(a,a^{\prime}\right)\rightarrow{\bf x}\in\mathbb{R}^{d}$ , where $\left(a,a^{\prime}\right)$ is a sentence pair, ${\bf x}$ is the output representation of the pair, and $d$ is the dimension of the output representations. The classification layer is a standard feedforward network as $\mathcal{A}\left({\bf x}\right)={\bf W}^{\intercal}{\bf x}+b$ , where W and $b$ are parameters we learn by fine-tuning the model on a dataset $D$ .

We describe different designs for $\mathcal{A}$ as follows.

$\mathcal{A}_{0}$ : Text-Pair Embedding

We build a language model representation for pairs of members of the tuple, $x=\left(q,r,t\right)$ by simply inputing them to Transformer models $\mathcal{B}$ in the standard sentence pair fashion. We consider four different configurations of $\mathcal{A}_{0}$ , one for each following pair $\left(q,r\right)$ , $\left(q,t\right)$ , $\left(r,t\right)$ , and one for the triplet, $\left(q,r,t\right)$ , modeled as the concatenation of the previous three. The representation for each pair is produced by a different and independent BERT instance, i.e., $\mathcal{B}_{p}$ . More formally, we have the following three models $\mathcal{A}_{0}\left(\mathcal{B}_{p}(p)\right)$ , $\forall p\in\mathcal{D}_{0}$ , where $\mathcal{D}_{0}=\{(q,r),(q,t),(r,t)\}$ . Additionally, we design a model over $(q,r,t)$ with $\mathcal{A}_{0}\left(\cup_{p\in\mathcal{D}_{0}}\hskip 3.00003pt\mathcal{B}_{p}(p)\right)$ , where $\cup$ means concatenation of the representations. We do not use the short answer, $s$ , as its contribution is minimal when using powerful Transformer-based models.

$\mathcal{A}_{1}$ : Improved Text-Triple Embedding

The models of the previous section are limited to pair representations. We improve this by designing $\mathcal{B}$ models that can capture pattern dependencies across $q$ , $r$ and $t$ . To achieve this, we concatenate pairs of the three pieces of text above. We indicate this string concatenation with the $\circ$ operator. Specifically, we consider $\mathcal{D}_{1}=\{(q,r\circ t),(r,q\circ t),(t,q\circ r)\}$ and propose the following $\mathcal{A}_{1}$ . As before, we have the individual models, $\mathcal{A}_{1}\left(\mathcal{B}_{p}(p)\right)$ , $\forall p\in\mathcal{D}_{1}$ as well as the combined model, $\mathcal{A}_{1}\left(\cup_{p\in\mathcal{D}_{1}}\hskip 3.00003pt\mathcal{B}_{p}(p)\right)$ , where again, we use different instances of $\mathcal{B}$ and fine-tune them together accordingly.

$\mathcal{A}_{2}$ : Peer Attention for Pair of Transformer-based Models Our previous designs instantiate different $\mathcal{B}$ for each pair, learning the feature representations of the target pair and the relations between its members, during the fine-tuning process. This individual optimization prevents to capture patterns across the representations of different pairs as there is no strong connection between the $\mathcal{B}$ instances. Indeed, the combination of feature representations only happens in the last classification layer.

We propose peer-attention to encourage the feature transferring between different $\mathcal{B}$ instances. The idea, similar to encoder-decoder setting in Transformer-based models Vaswani et al. (2017), is to introduce an additional decoding step for each pair. Figure 1 depicts our proposed setting for learning representation of two different pairs: $a_{0}=\left(a,a^{\prime}\right)$ and $g_{0}=\left(g,g^{\prime}\right)$ . The standard approach learns representations for these two in one pass, via $\mathcal{B}_{a_{0}}$ and $\mathcal{B}_{g_{0}}$ . In peer-attention setting, the representation output after processing one pair, captured in ${H}_{[CLS]}$ , is input to the second pass of fine-tuning for the other pair. Thus, the representation in one pair can attend over the representation in the other pair during the decoding stage. This allows the feature representations from each $\mathcal{B}$ instance to be shared both during training and prediction stages.

Refer to caption — Figure 1: peer attention on $\left(a,a^{\prime}\right)$ and $\left(g,g^{\prime}\right)$ .

5 Dataset Creation

We describe the datasets we created to develop AVA. First, we build two large scale datasets for the standard QA task, namely AS2-NQ and AS2-GPD, derived from the Google Natural Questions dataset and our internal dataset, respectively. The construction of the datasets is described in Section 5.1. Second, we describe our approach to generate labelled data for AVA using the datasets for QA task, described in Section 5.2. Finally, we build an additional dataset constituted by a set of systems and their output on target test sets. This can be used to evaluate the ability of AVA to estimate the end-to-end system performance (system-wise evaluation), described in Section 5.3.

5.1 Question Answering Datasets

5.1.1 AS2-NQ: AS2 Dataset from NQ

Google Natural Questions (NQ) is a large scale dataset for machine reading task Kwiatkowski et al. (2019). Each question is associated with a Wikipedia page and at least one long paragraph (long_answer) that contains the answer to the question. The long_answer may contain additional annotations of short_answer, a succint extractive answer from the long paragraph. A long_answer usually consists of multiple sentences, thus NQ is not directly applicable to our setting.

We create AS2-NQ from NQ by leveraging both long_answer and short_answer annotations. In particular for a given question, the (correct) answers for a question are sentences in the long answer paragraphs that contain annotated short_answers. The other sentences from the Wikipedia page are considered incorrect. The negative examples can be of the following types: (i) Sentences that are in the long_answer but do not contain annotated short answers. It is possible that these sentences might contain the short_answer. (ii) Sentences that are not part of the long_answer but contain a short_answer as subphrase. Such occurrence is generally accidental. (iii) All the other sentences in the document.

The generation of negative examples impacts on the robustness of the training model when selecting the correct answer out of the incorrect ones. AS2-NQ has four labels that describe possible confusing levels of a sentence candidate. We apply the same processing both to training and development sets of NQ. This dataset enables to perform an effective transfer step Garg et al. (2020). Table 3 shows the statistics of the dataset.

	AS2-NQ			AS2-NQ Qs with multiple As			AVA-NQ
data split	#Qs	#As	#wrong-As	#Qs	#As	#wrong-As	positives	negatives	total
NQ-dev	4,263	134,691	1,320,812	1,478	3,376	64,187	11,556	206,497	218,053
NQ-train	105,020	10,288	33,294,803	2,360	6,392	96,152	26,100	432,913	459,013

Table 3: AS2-NQ and AVA-NQ Statistics

5.1.2 AS2-GPD: General Purpose Dataset

A search engine using a large index can retrieve more relevant documents than those available in Wikipedia. Thus, we retrieved high-probably relevant candidates as follows: we (i) retrieved top 500 relevant documents; (ii) automatically extracted the top 100 sentences ranked by a BERT model over all sentences of the documents; and (iii) had all the top 100 sentences manually annotated as correct or incorrect answers. This process does not guarantee that we have all correct answers but the probability to miss them is much lower than for other datasets. In addition, this dataset is richer than AS2-NQ as it consists of answers from multiple sources. Furthermore, the average number of answers to a question is also higher than in AS2-NQ. Table 4 shows the statistics of the dataset.

	AS2-GPD			AS2-GPD Qs with multiple As			AVA-GPD
data split	#Qs	#As	#wrong-As	#Qs	#As	#wrong-As	positives	negatives	total
GPD-train	262	5,399	20,801	245	5,382	20,748	183,894	349,765	533,659
GPD-dev	283	8,682	19,618	276	8,674	19,502	430,230	426,246	856,476
GPD-test	294	9,412	19,988	281	9,399	19,790	479,028	449,625	928,653

Table 4: AS2-GPD and AVA-GPD Statistics

5.2 AVA Datasets

The AS2 datasets from the previous section typically consist of a set of questions $Q$ . Each $q\in Q$ has $T_{q}=\{t_{1},\dots,t_{n}\}$ candidates, comprised of both correct answers $C_{q}$ and incorrect answers $\overline{C_{q}}$ , $T_{q}=C_{q}\cup\overline{C_{q}}$ . We construct the dataset for point-wise automatic evaluation (described in Section 4) in the following steps: (i) to have positive and negative examples for AVA, we first filter the QA dataset to only keep questions that have at least two correct answers. This is critical to build positive and negative examples.

Formally, let $\left\langle q,r,t,l\right\rangle$ be an input for AVA,

\text{AVA-Positives}=\left\langle q;\left(r,t\right)\in C_{q}\times C_{q}\text{ and }r\neq t\right\rangle

We also build negative examples as follows:

\text{AVA-Negatives}=\left\langle q;\left(r,t\right)\in C_{q}\times\overline{C_{q}}\right\rangle

We create AVA-NQ and AVA-GPD from the QA datasets, AS2-NQ and AS2-GPD. The statistics are presented on the right side of tables 3 and 4.

5.3 AVA Datasets from Systems (ADS)

To test AVA at level of overall system Accuracy, we need to have a sample of systems and their output on different test sets. We create a dataset that has candidate answers collected from eight systems from a set of 1,340 questions. The questions were sampled from an anonymized set of user utterances. We only considered information inquiry questions. The systems differ from each other in multiple ways, including: (i) modeling: Compare-Aggregate (CNN-based) and different Transformers-based architectures with different hyper-parameter settings; (ii) training: the systems trained on different resources; and (iii) candidates: the pool of candidates for the selected answers are different.

6 Experiments

We study the following performance aspects of AVA in predicting: (i) the correctness of the individual answers provided by systems to questions (point-wise estimation); and (ii) the overall system Accuracy. We evaluated QA Accuracy as well as passage reranking performance, in comparison with the human labeling.

The first aspect studies the capacity of our different machine learning models, whereas the second provides a perspective on the practical use of AVA to develop QA systems.

Model Setting	Configurations
Linear Classifier	using 4 features $x_{i}$
$\mathcal{A}_{0}$	one for each and one for all from $\mathcal{D}_{0}$
$\mathcal{A}_{1}$	all possible combinations from $\mathcal{D}_{1}$
$\mathcal{A}_{2}$	the most probable setting from $\mathcal{A}_{1}$

Table 5: The AVA configurations used in training

6.1 Datasets

We trained and test models using AVA-NQ and AVA-GPD datasets, described in Section 5.2. We also evaluate the point-wise performance on the WikiQA and TREC-QA datasets.

6.2 Models

Table 5 summarizes the configurations we consider for training and testing. For the linear classifier baseline, we built a vanilla SVM classifier using scikit-learn. We set the probability parameter to enable Platt scaling calibration on the score of SVM.

We developed our Transformer-based evaluators on top of the HuggingFace’s Transformer library Wolf et al. (2019). We use RoBERTa-Base as the initial pre-trained model for each $\mathcal{B}$ instance Liu et al. (2019). We use the default hyperparameter setting for typical GLUE trainings. This includes (i) the use of the AdamW variant Loshchilov and Hutter (2017) as optimizer, (ii) the learning rate of $1e\text{-}06$ set for all fine-tuning exercises, and (iii) the maximum sequence length set to 128. The number of iterations is set to 2. We also use a development set to enable early stopping based on F1 measure after the first iteration. We fix the same batch size setting in the experiments to avoid possible performance discrepancies caused by different batch size settings.

6.3 Metrics

We study the performance of AVA in evaluating passage reranker systems, which differ not only in methods but also in domains and application settings. We employ the following evaluation strategies to benchmark AVA.

Point-wise Evaluation

We study the performance of AVA on point-wise estimation using traditional Precision, Recall, and F1. The metrics indicate the performance of AVA in predicting if an answer candidate is correct or not.

System-wise evaluation

We measured AVA when used in a simple aggregator to compute the overall system performance over a test set. The metrics we consider are: Precision-at-1 (P@1), Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR), when computing the performance on TREC-QA and WikiQA, since such datasets contain ranks of answers. In contrast, we only use P@1 on ADS dataset, as this only includes the selected answers for each system.

We use Kendall’s Tau-b²²2We use scipy.stats.kendalltau to measure the correlation between the ranking produced by AVA and the one available in the GS: $\tau=\frac{c-d}{c+d},$ where $c$ and $d$ are the numbers of concordant and discordant pairs between two rankings.

We additionally analyze the gap of each performance given by AVA and the one computed with the GS, using root mean square error: $\text{RMSE}\left(a,h\right)=\sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\left({a_{i}-h_{i}}\right)^{2}}},$ where $a$ and $h$ are the measures given by AVA and from human annotation respectively.

6.4 Results on Point-wise Evaluation

We evaluate the performance of AVA in predicting if an answer $t$ is correct for a question $q$ , given a reference $r$ . Table 6 shows the result: Column 1 reports the names of the systems described in Section 4, while columns 2 and 3 show the F1 measured on AVA-NQ and AVA-GPD, respectively.

We note that: (i) the F1 on AVA-GPD is much higher than the one on AVA-NQ, this is because the former dataset is much larger than latter;
(ii) $\mathcal{A}_{0}\left(\{\left(q,r\right)\}\right)$ cannot predict if an answer is correct as it does not use it in the representation, thus its Accuracy is lower than 7%;
(iii) $\mathcal{A}_{0}\left(\{\left(r,t\right)\}\right)$ is already a reasonable model mainly based on paraphrasing between $r$ and $t$ ;
(iv) $\mathcal{A}_{0}\left(\{\left(q,t\right)\}\right)$ is also a good model as it is as much powerful as a QA system;
(v) the $\mathcal{A}_{1}$ models that takes the entire triplet $q$ , $r$ and $t$ are the most accurare achieving an F1 of almost 74%;
(vi) the use of combinations of triplets, e.g., $\mathcal{A}_{1}\left(\{\left(r,q\circ t\right),\left(t,q\circ r\right)\}\right)$ , provides an even more accurate model; and finally,
(vii) the peer-attention model, i.e., $\mathcal{A}_{2}\left(\left(r,q\circ t\right),\left(t,q\circ r\right)\right)$ reaches almost 75%.

training set from	AVA-NQ	AVA-GPD
development set from	AVA-NQ	AVA-GPD
Model	F1 on AVA-GPD-Test
Linear Classifier	0.0000	0.3999
$\mathcal{A}_{0}\left(\{\left(q,r\right)\}\right)$	0.0004	0.0695
$\mathcal{A}_{0}\left(\{\left(r,t\right)\}\right)$	0.3778	0.6247
$\mathcal{A}_{0}\left(\{\left(q,t\right)\}\right)$	0.5801	0.6713
$\mathcal{A}_{0}\left(\mathcal{D}_{0}\right)$	0.3962	0.6807
$\mathcal{A}_{1}\left(\{\left(q,r\circ t\right)\}\right)$	0.3788	0.7014
$\mathcal{A}_{1}\left(\{\left(r,q\circ t\right)\}\right)$	0.4583	0.7383
$\mathcal{A}_{1}\left(\{\left(t,q\circ r\right)\}\right)$ )	0.4517	0.7236
$\mathcal{A}_{1}\left(\{\left(q,r\circ t\right),\left(t,q\circ r\right)\}\right)$	0.3546	0.7421
$\mathcal{A}_{1}\left(\{\left(r,q\circ t\right),\left(t,q\circ r\right)\}\right)$	0.4002	0.7447
$\mathcal{A}_{1}\left(\{\left(r,q\circ t\right),\left(q,r\circ t\right)\}\right)$	0.4873	0.7435
$\mathcal{A}_{1}\left(\mathcal{D}_{1}\right)$	0.4121	0.7303
$\mathcal{A}_{2}\left(\left(r,q\circ t\right),\left(t,q\circ r\right)\right)$	0.4187	0.7472

Table 6: F1 on AVA-GPD-Test

Metrics		Kendall		RMSE $~{}\pm~{}\sigma$
Metrics		$\tau$	$p$	RMSE $~{}\pm~{}\sigma$
TREC-QA-Dev	P@1	1.000	0.003	0.000 $~{}\pm~{}$ 0.000
	MAP	1.000	0.003	0.040 $~{}\pm~{}$ 0.019
	MRR	0.866	0.017	0.015 $~{}\pm~{}$ 0.011
TREC-QA-Test	P@1	1.000	0.003	0.034 $~{}\pm~{}$ 0.018
	MAP	0.867	0.017	0.041 $~{}\pm~{}$ 0.029
	MRR	1.000	0.003	0.020 $~{}\pm~{}$ 0.012
WikiQA-Dev	P@1	1.000	0.009	0.000 $~{}\pm~{}$ 0.000
	MAP	0.733	0.056	0.050 $~{}\pm~{}$ 0.039
	MRR	0.690	0.056	0.063 $~{}\pm~{}$ 0.052
WikiQA-Test	P@1	0.889	0.017	0.079 $~{}\pm~{}$ 0.030
	MAP	0.733	0.056	0.081 $~{}\pm~{}$ 0.040
	MRR	0.867	0.017	0.095 $~{}\pm~{}$ 0.035

Table 7: System-wise evaluation on TREC-QA and WikiQA using AVA model,

\mathcal{A}_{2}\left(\left(r,q\circ t\right),\left(t,q\circ r\right)\right)

		Metrics	M1	M2	M3	M4	M5	M6
TREC-QA-Dev	Gold	P@1	0.717	0.870	0.891	0.935	0.739	0.826
		MAP	0.691	0.858	0.913	0.912	0.769	0.796
		MRR	0.819	0.923	0.937	0.967	0.835	0.890
	AVA	P@1	0.717	0.870	0.891	0.935	0.739	0.826
		MAP	0.688	0.831	0.864	0.857	0.717	0.772
		MRR	0.809	0.920	0.940	0.967	0.803	0.876
Trec-QA-Test	Gold	P@1	0.596	0.885	0.904	0.962	0.712	0.788
		MAP	0.661	0.873	0.894	0.904	0.771	0.801
		MRR	0.763	0.933	0.945	0.976	0.820	0.869
	AVA	P@1	0.635	0.904	0.962	0.981	0.712	0.827
		MAP	0.639	0.845	0.896	0.886	0.680	0.789
		MRR	0.764	0.936	0.981	0.990	0.793	0.880
WikiQA-Dev	Gold	P@1	0.545	0.727	0.455	0.545	0.636	0.727
		MAP	0.636	0.744	0.656	0.621	0.755	0.781
		MRR	0.720	0.831	0.695	0.703	0.803	0.864
	AVA	P@1	0.545	0.727	0.455	0.545	0.636	0.727
		MAP	0.523	0.751	0.643	0.617	0.713	0.774
		MRR	0.568	0.841	0.682	0.698	0.788	0.841
WikiQA-Test	Gold	P@1	0.563	0.844	0.781	0.688	0.813	0.781
		MAP	0.634	0.778	0.753	0.746	0.834	0.820
		MRR	0.746	0.917	0.876	0.833	0.906	0.883
	AVA	P@1	0.625	0.781	0.719	0.656	0.719	0.656
		MAP	0.660	0.750	0.687	0.683	0.705	0.704
		MRR	0.732	0.820	0.783	0.741	0.791	0.762

Table 8: Details of system-wise Evaluation on TREC-QA and WikiQA using AVA model and GS,

\mathcal{A}_{2}\left(\left(r,q\circ t\right),\left(t,q\circ r\right)\right)

ADS Split	Evaluator	S1	S2	S3	S4	S5	S6	S7	S8	Kendall		RMSE $~{}\pm~{}\sigma$
										$\tau$	$p$
Dev (20%)	AVA	0.215	0.278	0.22	0.369	0.285	0.294	0.283	0.355	0.929	0.0004	0.0198 $~{}\pm~{}$ 0.012
	ADS	0.218	0.282	0.234	0.379	0.309	0.315	0.261	0.319
Test (80%)	AVA	0.235	0.289	0.235	0.355	0.319	0.321	0.301	0.357	0.643	0.031	0.0350 $~{}\pm~{}$ 0.019
	ADS	0.235	0.324	0.26	0.393	0.356	0.365	0.249	0.336

Table 9: Details of system-wise Evaluation on ADS benchmark dataset

Question $q$	Candidate $t$	TANDA	Reference $r$	$\mathcal{A}$
when were the nobel prize awards first given ?	among them is the winner of the first prize in 1901 , sully prudhomme .	0.0001	leo tolstoy lost the first literature prize in 1901 to the forgettable rene f . a . sully prudhomme .	0.596
what branch of the service did eileen marie collins serve in ?	the first woman to command a space shuttle mission , air force col . eileen collins , sees her flight next month as `` a great challenge ” in more ways than one .	0.046	shuttle commander eileen collins , a working mother and air force colonel , was set to make history as the first woman to command a space mission .	0.895
what was johnny appleseed ’s real name ?	appleseed , whose real name was john chapman , planted many trees in the early 1800s .	0.026	whitmore said he was most fascinated with the story of john chapman , who is better known as johnny appleseed .	0.948
when was the challenger space shuttle disaster ?	sept . 29 , 1988 _ americans return to space aboard the shuttle discovery , after a 32-month absence in the wake of the challenger accident .	0.995	challenger was lost on its 10th mission during a 1986 launch accident that killed seven crew members .	0.080
when did jack welch become chairman of general electric ?	everyone knew it was coming , but now they know when : john f . welch jr . , the chairman of general electric , will retire after the company ’s annual meeting in april 2001 .	0.968	welch has turned what had been a $ 25 billion manufacturing company in 1981 into a $ 100 billion behemoth that derives huge portions of its revenues from more profitable services .	0.064

Table 10: Examples show AVA can detect the failures of the State-of-the-art model by Garg et al. (2020).

6.5 Results on system-wise evaluation

We evaluate the ability of AVA in predicting the Accuracy of QA systems as well as the performance of answer sentence reranking tasks. We conduct two evaluation studies with two public datasets, TREC-QA and WikiQA, and an internal ADS dataset.

6.5.1 Results on public datasets

For TREC-QA and WikiQA, we used a bag of different models against the development and test sets and compared the results with the performance measured by AVA using one of the best model according to the point-wise evaluation, i.e., $\mathcal{A}_{2}\left(\left(r,q\circ t\right),\left(t,q\circ r\right)\right)$ .

More specifically, we apply each model $m$ to select the best answer $t$ from the list of candidates for $q$ in the dataset. We first compute the performance of model $m$ based on the provided annotations. The metrics include Accuracy or Precision-at-1 (P@1), MAP, and MRR. We then run AVA for $\left(q,t\right)$ using the GS answers of $q$ as reference $r$ . The final AVA score is the average of AVA scores applied to different references for $q$ . Before computing the Accuracy on the test set, we tune the AVA threshold to minimize the RMSE between the Accuracy (P@1) measured by AVA and the one computed with the GS, on the development set of each dataset. We use these thresholds to evaluate the results on the test sets.

We considered six different models, including one Compare-Aggregate (CNN) trained model and five other Transformers-based models. Four of the latter are collected from public resources³³3github.com/alexa/wqa_tanda Garg et al. (2020). These models differ in the architectures and their training data thus their output is rather different. We removed questions that have no correct or no incorrect answers.

Table 7 reports the overall results averaged over the six models. We note that (i) setting the right threshold on the dev. set, the error on P@1 is 0; (ii) this is not the case for MAP, which is a much harder value to predict as it requires to estimate an entire ranking; (iii) on the TREC-QA test set, AVA has an error ranging from 2 to 4.1 points on any measure; (iv) on the WikiQA test set, the error is higher, reaching 10%, probably due to a larger complexity of the questions; (v) the std. dev. is low, suggesting that AVA can be used to estimate system performance.

Additionally, we compute the Kendall’s Tau-b correlation between the ranking of the six systems sorted in order of performance (P@1) according to the GS and AVA. We observe a perfect correlation on TREC-QA and a rather high correlation on WikiQA. This means that AVA can be used to determine if a model is better than another, which is desirable when developing new systems. The low p-values indicate reliability of our results.

Finally, Table 8 shows the comparison between the performance evaluated with GS (Human) and AVA for all six models. The predictions of AVA are close to those from human judgement.

6.5.2 Results on ADS

We use ADS dataset in this evaluation. The task is more challenging as AVA only receives one best answer for a system selected from different candidate pools. There was also no control of the sources for the candidates. Table 9 shows the result. We note a lower correlation due to the fact that the 8 evaluated systems have very close Accuracy. On the other hand, the RMSE is rather low 3.1% and the std. dev. is also acceptable $<0.02$ , suggesting an error less than 7% with a probability $>$ 95%.

6.6 Qualitative Analysis

Table 10 reports some example questions from TREC-QA test set, the top candidate selected by the TANDA system Garg et al. (2020), the classification score of the latter, and the AVA score. AVA judges an answer correct if the score is larger than 0.5. We note that even if the score of TANDA system is low, AVA assigns to the answer a very high score, indicating that it is correct (see the first three examples). Conversely, a wrong answer could be classified as such by AVA, even if TANDA assigned it a very large score (see the last two examples).

7 Conclusion

We presented AVA, an automatic evaluator method for QA systems. Specifically, we discussed our data collection strategy and model design to enable AVA development. First, we collected seven different datasets, classified into three different types, which we used to develop AVA in different stages. Second, we proposed different Transformer-based modeling designs of AVA to exploit the feature signals relevant to address the problem. Our extensive experimentation has shown the effectiveness of AVA for different types of evaluation: point-wise and system-wise over Accuracy, MAP and MRR.

References

Barrault et al. (2019) Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Eyal et al. (2019) Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Question answering as an automatic evaluation metric for news article summarization. In NAACL 2019, pages 3938–3948, Minneapolis, Minnesota. Association for Computational Linguistics.
Garg et al. (2020) Siddhant Garg, Thuy Vu, and Alessandro Moschitti. 2020. TANDA: Transfer and adapt pre-trained transformer models for answer sentence selection.
Ghazarian et al. (2019) Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng. 2019. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. CoRR, abs/1904.10635.
Green et al. (1961) Bert F. Green, Jr., Alice K. Wolf, Carol Chomsky, and Kenneth Laughery. 1961. Baseball: An automatic question-answerer. In Papers Presented at the May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer Conference, IRE-AIEE-ACM ’61 (Western), pages 219–224, New York, NY, USA. ACM.
Gunawardena et al. (2015) Tilani Gunawardena, Nishara Pathirana, Medhavi Lokuhetti, Roshan G. Ragel, and Sampath Deegalla. 2015. Performance evaluation techniques for an automatic question answering system.
Kannan and Vinyals (2017) Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. CoRR, abs/1701.08198.
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. TACL.
Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. {ALBERT}: A lite {bert} for self-supervised learning of language representations. In ICLR.
Leidner and Callison-Burch (2003) Jochen L. Leidner and Chris Callison-Burch. 2003. Evaluating question answering systems using faq answer injection. In Proceedings of the 6th Annual CLUK Research Colloquium.
Lin and Demner-Fushman (2006) Jimmy J. Lin and Dina Demner-Fushman. 2006. Methods for automatically evaluating answers to complex questions. Information Retrieval, 9:565–587.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR, abs/1711.05101.
Lowe et al. (2017) Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic Turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1116–1126, Vancouver, Canada. Association for Computational Linguistics.
Ma et al. (2019) Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette Graham. 2019. Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In WMT 2019, pages 62–90, Florence, Italy. Association for Computational Linguistics.
Magnini et al. (2002) Bernardo Magnini, Matteo Negri, Roberto Prevete, and Hristo Tanev. 2002. Towards automatic evaluation of question/answering systems. In LREC.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL 2002, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Shah and Pomerantz (2010) Chirag Shah and Jefferey Pomerantz. 2010. Evaluating and predicting answer quality in community qa. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, pages 411–418, New York, NY, USA. ACM.
Shen et al. (2017) Gehui Shen, Yunlun Yang, and Zhi-Hong Deng. 2017. Inter-weighted alignment network for sentence pair modeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1179–1189, Copenhagen, Denmark. Association for Computational Linguistics.
Tao et al. (2017) Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2017. RUBER: an unsupervised method for automatic evaluation of open-domain dialog systems. CoRR, abs/1701.03079.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
Way (2018) Andy Way. 2018. Quality expectations of machine translation. CoRR, abs/1803.08409.
Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237.
Yoon et al. (2019) Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. 2019. A compare-aggregate model with latent clustering for answer selection. CoRR, abs/1905.12897.