AVA: an Automatic eValuation Approach to Question Answering Systems
Abstract
We introduce AVA, an automatic evaluation approach for Question Answering, which given a set of questions associated with Gold Standard answers, can estimate system Accuracy. AVA uses Transformer-based language models to encode question, answer, and reference text. This allows for effectively measuring the similarity between the reference and an automatic answer, biased towards the question semantics. To design, train and test AVA, we built multiple large training, development, and test sets on both public and industrial benchmarks. Our innovative solutions achieve up to 74.7% in F1 score in predicting human judgement for single answers. Additionally, AVA can be used to evaluate the overall system Accuracy with an RMSE, ranging from 0.02 to 0.09, depending on the availability of multiple references.
1 Introduction
Accuracy evaluation is essential both to guide system development as well as to estimate its quality, which is important for researchers, developers, and users. This is often conducted using benchmarking datasets, containing a data sample, possibly representative of the target data distribution, provided with Gold Standard (GS) labels (typically produced with a human annotation process). The evaluation is done by comparing the system output with the expected labels using some metrics.
This approach unfortunately falls short when dealing with generation tasks, for which the system output may span a large, possibly infinite, set of correct items. For example, in case of Question Answering (QA) systems, the correct answers for the question, Where is Rome located ? is large. As it is impossible, also for cost reasons, to annotate all possible system pieces of output, the standard approach is to manually re-evaluate the new output of the system. This dramatically limits the experimentation velocity, while increasing significantly the development costs.
Another viable solution in specific domains consists in automatically generating an evaluation score between the system and the reference answers, which correlates with human judgement. The BLEU score, for example, is one popular measure in Machine Translation Papineni et al. (2002). This, however, can only be applied to specific tasks and even in those cases, it typically shows limitations Way (2018). As a consequence there is an active research in learning methods to automatically evaluate MT systems Ma et al. (2019), while human evaluation becomes a requirement in machine translation benchmarking Barrault et al. (2019).
QA will definitely benefit by a similar approach but the automatic evaluation is technically more complex for several reasons: First, segment overlapping metrics such as BLEU, METEOR, or ROUGE, do not work since the correctness of an answer loosely depends on the match between the reference and candidate answers. For example, two text candidates can be correct and incorrect even if they only differ by one word (or even one character), e.g., for the questions, Who was the 43rd president of USA ?, a correct answer is George W. Bush, while the very similar answer, George H. W. Bush, is wrong.
Second, the matching between the answer candidates and the reference must be carried out at semantic level and it is radically affected by the question semantics. For example, match can be true but match can be false, where and are a pair of answer candidate and reference, and and are two different questions. This can especially happen for the case of the so-called non-factoid questions, e.g., asking for a description, opinion, manner, etc., which are typically answered by a fairly long explanatory text. For example, Table 1 shows an example of a non factoid question and three different valid answers, which share similarity with respect to the question. However, if the question were, what may cause anxiety ?, Answer 1 and Answer 3 would intuitively look less related to Answer 2.
Question: What does cause left arm pain ? |
Reference: Arm pain can be caused by a wide variety of problems, ranging from joint injuries to compressed nerves; if it radiates into your left arm can even be a sign of a heart attack. |
Answer 1: It is possible for left arm pain to be caused from straining the muscles of the arm, pending heart attack, or it can also be caused from indigestion. |
Answer 2: Anxiety can cause muscles in the arm to become tense, and that tension could lead to pain. |
Answer 3: In many cases, arm pain actually originates from a muscular problem in your neck or upper spine. |
In this paper, we study the design of models for measuring the Accuracy of QA systems. In particular, we design several pre-trained Transformer models Devlin et al. (2018); Liu et al. (2019) that encode the triple of question , candidate , and reference in different ways.
Most importantly, we built (i) two datasets for training and testing the point-wise estimation of QA system output, i.e., the evaluation if an answer is correct or not, given a GS answer; and (ii) two datasets constituted by a set of outputs from several QA systems, for which AVA is supposed to estimate the Accuracy.
The results show a high Accuracy for point-wise models, up to 75%. Regarding the overall Accuracy estimation, AVA can almost always replicate the ranking of systems in terms of Accuracy performed by humans. Finally, the RMSE with respect to human evaluation depends on the datasets, ranging from 2% to 10%, with an acceptable Std. Dev. lower than 3-4%.
The structure of the paper is as follows: we begin with the description of the problem in Sec. 3. This is then followed by the details of the data construction and model design, which are key aspects for system development, in sections 4 and 5. We study the performance of our models in three different evaluation scenarios in Sec. 6.
2 Related Work
Automatic evaluation has been an interesting research for decades Papineni et al. (2002); Magnini et al. (2002). There are two typical strategies to design an automatic evaluator: supervised and unsupervised. In machine translation, for example, BLEU Papineni et al. (2002) has been a very popular unsupervised evaluation method for the task. There are also other supervised methods recently proposed, most notably Ma et al. (2019). For dialog systems, neural-based automatic evaluators are also studied Ghazarian et al. (2019); Lowe et al. (2017); Tao et al. (2017); Kannan and Vinyals (2017)
QA has been traditionally studied early in literature Green et al. (1961). QA has recently been used to evaluate a summarization task Eyal et al. (2019). Automatic evaluation for QA was addressed by Magnini et al. (2002) and also for multiple subdomain QA systems Leidner and Callison-Burch (2003); Lin and Demner-Fushman (2006); Shah and Pomerantz (2010); Gunawardena et al. (2015). However, little progress has been made in the past two decades towards obtaining a standard method. Automating QA evaluation is still an open problem and there is no recent work supporting it.
3 Problem Definition
We target the automatic evaluation of QA systems, for which system Accuracy (the percentage of correct answers) is the most important measure. We also consider more complex measures such as MAP and MRR in the context of Answer Sentence Reranking/Selection.
3.1 Answer Sentence Selection (AS2)
The task of reranking answer sentence candidates provided by a retrieval engine can be modeled with a classifier scoring the candidates. Let be a question, be a set of answer sentence candidates for , we define as a ranking function, which orders the candidates in according to a score, , indicating the probability of to be a correct answer for . Popular methods modeling include Compare-Aggregate Yoon et al. (2019), inter-weighted alignment networks Shen et al. (2017), and BERT Garg et al. (2020).
: | What is the population of California? |
: | With slightly more than 39 million people (according to 2016 estimates), California is the nation’s most populous state—its population is almost one and a half times that of second-place Texas (28 million). |
: | 39 million |
: | The resident population of California has been steadily increasing over the past few decades and has increased to 39.56 million people in 2018. |
3.2 Automatic Evaluation of QA Accuracy
The evaluation of system Accuracy can be approached in two ways: (i) evaluation of the single answer provided by the target system, which we call point-wise evaluation; and (ii) the aggregated evaluation of a set of questions, which we call system-wise evaluation.
We define the former as a function: , where is a reference answer (GS answer) and the output is simply a correct/incorrect label. Table 2 shows an example question associated with a reference, a system answer, and a short answer111The latter can be very effective but it adds an additional annotation cost, thus we limit its use just for the baseline model. That is, we aim to have a lower cost AVA model.
A configuration of is applied to compute the final Accuracy of a system using an aggregator function. In other words, to estimate the overall system Accuracy, we simply assume the point-wise AVA predictions as they were the GS. For example, in case of the Accuracy measure, we simply average the AVA predictions, i.e., , where is a short answer (e.g., used in machine reading). It is an optional input, which we only use for a baseline, described in Section 4.1.
4 Model for AVA
The main intuition on building an automatic evaluator for QA is that the model should capture (i) the same information a standard QA system uses; while (ii) exploiting the semantic similarity between the system answer and the reference, biased by the information asked by the question. We build two types of models: (i) linear classifier, which is more interpretable and can help us to verify our design hypothesis and (ii) Transformer-based methods, which have been successfully used in several language understanding tasks.
4.1 Linear Classifier
Given an input example, , our classifier uses the following similarity features: =sim-token, =sim-text, =sim-text; and =sim-text, where sim-token between and is a binary feature testing if is included in , sim-text is a sort of Jaccard similarity:
and is a function that splits into tokens.
Let be a similarity feature vector describing our evaluation tuple. We train on a dataset using SVM, where is a binary label indicating whether answers or not. We compute the point-wise evaluation of as the test , where is a threshold trading off Precision for Recall in standard classification approaches.
4.2 Transformer-based models
Transformer-based architectures have proved to be powerful language models, which can capture complex similarity patterns. Thus, they are suitable methods to improve our basic approach described in the previous section. Following the linear classifier modeling, we propose three different ways to exploit the relations among the members of the tuple .
Let be a pre-trained language model, e.g., the recently proposed BERT Devlin et al. (2018), RoBERTa Liu et al. (2019), XLNet Yang et al. (2019), AlBERT Lan et al. (2020). We use a language model to compute the embedding representation of the tuple members: , where is a sentence pair, is the output representation of the pair, and is the dimension of the output representations. The classification layer is a standard feedforward network as , where W and are parameters we learn by fine-tuning the model on a dataset .
We describe different designs for as follows.
: Text-Pair Embedding
We build a language model representation for pairs of members of the tuple, by simply inputing them to Transformer models in the standard sentence pair fashion. We consider four different configurations of , one for each following pair , , , and one for the triplet, , modeled as the concatenation of the previous three. The representation for each pair is produced by a different and independent BERT instance, i.e., . More formally, we have the following three models , , where . Additionally, we design a model over with , where means concatenation of the representations. We do not use the short answer, , as its contribution is minimal when using powerful Transformer-based models.
: Improved Text-Triple Embedding
The models of the previous section are limited to pair representations. We improve this by designing models that can capture pattern dependencies across , and . To achieve this, we concatenate pairs of the three pieces of text above. We indicate this string concatenation with the operator. Specifically, we consider and propose the following . As before, we have the individual models, , as well as the combined model, , where again, we use different instances of and fine-tune them together accordingly.
: Peer Attention for Pair of Transformer-based Models Our previous designs instantiate different for each pair, learning the feature representations of the target pair and the relations between its members, during the fine-tuning process. This individual optimization prevents to capture patterns across the representations of different pairs as there is no strong connection between the instances. Indeed, the combination of feature representations only happens in the last classification layer.
We propose peer-attention to encourage the feature transferring between different instances. The idea, similar to encoder-decoder setting in Transformer-based models Vaswani et al. (2017), is to introduce an additional decoding step for each pair. Figure 1 depicts our proposed setting for learning representation of two different pairs: and . The standard approach learns representations for these two in one pass, via and . In peer-attention setting, the representation output after processing one pair, captured in , is input to the second pass of fine-tuning for the other pair. Thus, the representation in one pair can attend over the representation in the other pair during the decoding stage. This allows the feature representations from each instance to be shared both during training and prediction stages.

5 Dataset Creation
We describe the datasets we created to develop AVA. First, we build two large scale datasets for the standard QA task, namely AS2-NQ and AS2-GPD, derived from the Google Natural Questions dataset and our internal dataset, respectively. The construction of the datasets is described in Section 5.1. Second, we describe our approach to generate labelled data for AVA using the datasets for QA task, described in Section 5.2. Finally, we build an additional dataset constituted by a set of systems and their output on target test sets. This can be used to evaluate the ability of AVA to estimate the end-to-end system performance (system-wise evaluation), described in Section 5.3.
5.1 Question Answering Datasets
5.1.1 AS2-NQ: AS2 Dataset from NQ
Google Natural Questions (NQ) is a large scale dataset for machine reading task Kwiatkowski et al. (2019). Each question is associated with a Wikipedia page and at least one long paragraph (long_answer) that contains the answer to the question. The long_answer may contain additional annotations of short_answer, a succint extractive answer from the long paragraph. A long_answer usually consists of multiple sentences, thus NQ is not directly applicable to our setting.
We create AS2-NQ from NQ by leveraging both long_answer and short_answer annotations. In particular for a given question, the (correct) answers for a question are sentences in the long answer paragraphs that contain annotated short_answers. The other sentences from the Wikipedia page are considered incorrect. The negative examples can be of the following types: (i) Sentences that are in the long_answer but do not contain annotated short answers. It is possible that these sentences might contain the short_answer. (ii) Sentences that are not part of the long_answer but contain a short_answer as subphrase. Such occurrence is generally accidental. (iii) All the other sentences in the document.
The generation of negative examples impacts on the robustness of the training model when selecting the correct answer out of the incorrect ones. AS2-NQ has four labels that describe possible confusing levels of a sentence candidate. We apply the same processing both to training and development sets of NQ. This dataset enables to perform an effective transfer step Garg et al. (2020). Table 3 shows the statistics of the dataset.
AS2-NQ | AS2-NQ Qs with multiple As | AVA-NQ | |||||||
data split | #Qs | #As | #wrong-As | #Qs | #As | #wrong-As | positives | negatives | total |
NQ-dev | 4,263 | 134,691 | 1,320,812 | 1,478 | 3,376 | 64,187 | 11,556 | 206,497 | 218,053 |
NQ-train | 105,020 | 10,288 | 33,294,803 | 2,360 | 6,392 | 96,152 | 26,100 | 432,913 | 459,013 |
5.1.2 AS2-GPD: General Purpose Dataset
A search engine using a large index can retrieve more relevant documents than those available in Wikipedia. Thus, we retrieved high-probably relevant candidates as follows: we (i) retrieved top 500 relevant documents; (ii) automatically extracted the top 100 sentences ranked by a BERT model over all sentences of the documents; and (iii) had all the top 100 sentences manually annotated as correct or incorrect answers. This process does not guarantee that we have all correct answers but the probability to miss them is much lower than for other datasets. In addition, this dataset is richer than AS2-NQ as it consists of answers from multiple sources. Furthermore, the average number of answers to a question is also higher than in AS2-NQ. Table 4 shows the statistics of the dataset.
AS2-GPD | AS2-GPD Qs with multiple As | AVA-GPD | |||||||
data split | #Qs | #As | #wrong-As | #Qs | #As | #wrong-As | positives | negatives | total |
GPD-train | 262 | 5,399 | 20,801 | 245 | 5,382 | 20,748 | 183,894 | 349,765 | 533,659 |
GPD-dev | 283 | 8,682 | 19,618 | 276 | 8,674 | 19,502 | 430,230 | 426,246 | 856,476 |
GPD-test | 294 | 9,412 | 19,988 | 281 | 9,399 | 19,790 | 479,028 | 449,625 | 928,653 |
5.2 AVA Datasets
The AS2 datasets from the previous section typically consist of a set of questions . Each has candidates, comprised of both correct answers and incorrect answers , . We construct the dataset for point-wise automatic evaluation (described in Section 4) in the following steps: (i) to have positive and negative examples for AVA, we first filter the QA dataset to only keep questions that have at least two correct answers. This is critical to build positive and negative examples.
5.3 AVA Datasets from Systems (ADS)
To test AVA at level of overall system Accuracy, we need to have a sample of systems and their output on different test sets. We create a dataset that has candidate answers collected from eight systems from a set of 1,340 questions. The questions were sampled from an anonymized set of user utterances. We only considered information inquiry questions. The systems differ from each other in multiple ways, including: (i) modeling: Compare-Aggregate (CNN-based) and different Transformers-based architectures with different hyper-parameter settings; (ii) training: the systems trained on different resources; and (iii) candidates: the pool of candidates for the selected answers are different.
6 Experiments
We study the following performance aspects of AVA in predicting: (i) the correctness of the individual answers provided by systems to questions (point-wise estimation); and (ii) the overall system Accuracy. We evaluated QA Accuracy as well as passage reranking performance, in comparison with the human labeling.
The first aspect studies the capacity of our different machine learning models, whereas the second provides a perspective on the practical use of AVA to develop QA systems.
Model Setting | Configurations |
Linear Classifier | using 4 features |
one for each and one for all from | |
all possible combinations from | |
the most probable setting from |
6.1 Datasets
We trained and test models using AVA-NQ and AVA-GPD datasets, described in Section 5.2. We also evaluate the point-wise performance on the WikiQA and TREC-QA datasets.
6.2 Models
Table 5 summarizes the configurations we consider for training and testing. For the linear classifier baseline, we built a vanilla SVM classifier using scikit-learn. We set the probability parameter to enable Platt scaling calibration on the score of SVM.
We developed our Transformer-based evaluators on top of the HuggingFace’s Transformer library Wolf et al. (2019). We use RoBERTa-Base as the initial pre-trained model for each instance Liu et al. (2019). We use the default hyperparameter setting for typical GLUE trainings. This includes (i) the use of the AdamW variant Loshchilov and Hutter (2017) as optimizer, (ii) the learning rate of set for all fine-tuning exercises, and (iii) the maximum sequence length set to 128. The number of iterations is set to 2. We also use a development set to enable early stopping based on F1 measure after the first iteration. We fix the same batch size setting in the experiments to avoid possible performance discrepancies caused by different batch size settings.
6.3 Metrics
We study the performance of AVA in evaluating passage reranker systems, which differ not only in methods but also in domains and application settings. We employ the following evaluation strategies to benchmark AVA.
Point-wise Evaluation
We study the performance of AVA on point-wise estimation using traditional Precision, Recall, and F1. The metrics indicate the performance of AVA in predicting if an answer candidate is correct or not.
System-wise evaluation
We measured AVA when used in a simple aggregator to compute the overall system performance over a test set. The metrics we consider are: Precision-at-1 (P@1), Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR), when computing the performance on TREC-QA and WikiQA, since such datasets contain ranks of answers. In contrast, we only use P@1 on ADS dataset, as this only includes the selected answers for each system.
We use Kendall’s Tau-b222We use scipy.stats.kendalltau to measure the correlation between the ranking produced by AVA and the one available in the GS: where and are the numbers of concordant and discordant pairs between two rankings.
We additionally analyze the gap of each performance given by AVA and the one computed with the GS, using root mean square error: where and are the measures given by AVA and from human annotation respectively.
6.4 Results on Point-wise Evaluation
We evaluate the performance of AVA in predicting if an answer is correct for a question , given a reference . Table 6 shows the result: Column 1 reports the names of the systems described in Section 4, while columns 2 and 3 show the F1 measured on AVA-NQ and AVA-GPD, respectively.
We note that: (i) the F1 on AVA-GPD is much higher than the one on AVA-NQ, this is because the former dataset is much larger than latter;
(ii) cannot predict if an answer is correct as it does not use it in the representation, thus its Accuracy is lower than 7%;
(iii) is already a reasonable model mainly based on paraphrasing between and ;
(iv) is also a good model as it is as much powerful as a QA system;
(v) the models that takes the entire triplet , and are the most accurare achieving an F1 of almost 74%;
(vi) the use of combinations of triplets, e.g., , provides an even more accurate model; and finally,
(vii) the peer-attention model, i.e., reaches almost 75%.
training set from | AVA-NQ | AVA-GPD |
development set from | AVA-NQ | AVA-GPD |
Model | F1 on AVA-GPD-Test | |
Linear Classifier | 0.0000 | 0.3999 |
0.0004 | 0.0695 | |
0.3778 | 0.6247 | |
0.5801 | 0.6713 | |
0.3962 | 0.6807 | |
0.3788 | 0.7014 | |
0.4583 | 0.7383 | |
) | 0.4517 | 0.7236 |
0.3546 | 0.7421 | |
0.4002 | 0.7447 | |
0.4873 | 0.7435 | |
0.4121 | 0.7303 | |
0.4187 | 0.7472 |
Metrics | Kendall | RMSE | ||
TREC-QA-Dev | P@1 | 1.000 | 0.003 | 0.0000.000 |
MAP | 1.000 | 0.003 | 0.0400.019 | |
MRR | 0.866 | 0.017 | 0.0150.011 | |
TREC-QA-Test | P@1 | 1.000 | 0.003 | 0.0340.018 |
MAP | 0.867 | 0.017 | 0.0410.029 | |
MRR | 1.000 | 0.003 | 0.0200.012 | |
WikiQA-Dev | P@1 | 1.000 | 0.009 | 0.0000.000 |
MAP | 0.733 | 0.056 | 0.0500.039 | |
MRR | 0.690 | 0.056 | 0.0630.052 | |
WikiQA-Test | P@1 | 0.889 | 0.017 | 0.0790.030 |
MAP | 0.733 | 0.056 | 0.0810.040 | |
MRR | 0.867 | 0.017 | 0.0950.035 |
Metrics | M1 | M2 | M3 | M4 | M5 | M6 | ||
TREC-QA-Dev | Gold | P@1 | 0.717 | 0.870 | 0.891 | 0.935 | 0.739 | 0.826 |
MAP | 0.691 | 0.858 | 0.913 | 0.912 | 0.769 | 0.796 | ||
MRR | 0.819 | 0.923 | 0.937 | 0.967 | 0.835 | 0.890 | ||
AVA | P@1 | 0.717 | 0.870 | 0.891 | 0.935 | 0.739 | 0.826 | |
MAP | 0.688 | 0.831 | 0.864 | 0.857 | 0.717 | 0.772 | ||
MRR | 0.809 | 0.920 | 0.940 | 0.967 | 0.803 | 0.876 | ||
Trec-QA-Test | Gold | P@1 | 0.596 | 0.885 | 0.904 | 0.962 | 0.712 | 0.788 |
MAP | 0.661 | 0.873 | 0.894 | 0.904 | 0.771 | 0.801 | ||
MRR | 0.763 | 0.933 | 0.945 | 0.976 | 0.820 | 0.869 | ||
AVA | P@1 | 0.635 | 0.904 | 0.962 | 0.981 | 0.712 | 0.827 | |
MAP | 0.639 | 0.845 | 0.896 | 0.886 | 0.680 | 0.789 | ||
MRR | 0.764 | 0.936 | 0.981 | 0.990 | 0.793 | 0.880 | ||
WikiQA-Dev | Gold | P@1 | 0.545 | 0.727 | 0.455 | 0.545 | 0.636 | 0.727 |
MAP | 0.636 | 0.744 | 0.656 | 0.621 | 0.755 | 0.781 | ||
MRR | 0.720 | 0.831 | 0.695 | 0.703 | 0.803 | 0.864 | ||
AVA | P@1 | 0.545 | 0.727 | 0.455 | 0.545 | 0.636 | 0.727 | |
MAP | 0.523 | 0.751 | 0.643 | 0.617 | 0.713 | 0.774 | ||
MRR | 0.568 | 0.841 | 0.682 | 0.698 | 0.788 | 0.841 | ||
WikiQA-Test | Gold | P@1 | 0.563 | 0.844 | 0.781 | 0.688 | 0.813 | 0.781 |
MAP | 0.634 | 0.778 | 0.753 | 0.746 | 0.834 | 0.820 | ||
MRR | 0.746 | 0.917 | 0.876 | 0.833 | 0.906 | 0.883 | ||
AVA | P@1 | 0.625 | 0.781 | 0.719 | 0.656 | 0.719 | 0.656 | |
MAP | 0.660 | 0.750 | 0.687 | 0.683 | 0.705 | 0.704 | ||
MRR | 0.732 | 0.820 | 0.783 | 0.741 | 0.791 | 0.762 |
ADS Split | Evaluator | S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | Kendall | RMSE | |
Dev (20%) | AVA | 0.215 | 0.278 | 0.22 | 0.369 | 0.285 | 0.294 | 0.283 | 0.355 | 0.929 | 0.0004 | 0.01980.012 |
ADS | 0.218 | 0.282 | 0.234 | 0.379 | 0.309 | 0.315 | 0.261 | 0.319 | ||||
Test (80%) | AVA | 0.235 | 0.289 | 0.235 | 0.355 | 0.319 | 0.321 | 0.301 | 0.357 | 0.643 | 0.031 | 0.03500.019 |
ADS | 0.235 | 0.324 | 0.26 | 0.393 | 0.356 | 0.365 | 0.249 | 0.336 |
Question | Candidate | TANDA | Reference | |
when were the nobel prize awards first given ? | among them is the winner of the first prize in 1901 , sully prudhomme . | 0.0001 | leo tolstoy lost the first literature prize in 1901 to the forgettable rene f . a . sully prudhomme . | 0.596 |
what branch of the service did eileen marie collins serve in ? | the first woman to command a space shuttle mission , air force col . eileen collins , sees her flight next month as `` a great challenge ” in more ways than one . | 0.046 | shuttle commander eileen collins , a working mother and air force colonel , was set to make history as the first woman to command a space mission . | 0.895 |
what was johnny appleseed ’s real name ? | appleseed , whose real name was john chapman , planted many trees in the early 1800s . | 0.026 | whitmore said he was most fascinated with the story of john chapman , who is better known as johnny appleseed . | 0.948 |
when was the challenger space shuttle disaster ? | sept . 29 , 1988 _ americans return to space aboard the shuttle discovery , after a 32-month absence in the wake of the challenger accident . | 0.995 | challenger was lost on its 10th mission during a 1986 launch accident that killed seven crew members . | 0.080 |
when did jack welch become chairman of general electric ? | everyone knew it was coming , but now they know when : john f . welch jr . , the chairman of general electric , will retire after the company ’s annual meeting in april 2001 . | 0.968 | welch has turned what had been a $ 25 billion manufacturing company in 1981 into a $ 100 billion behemoth that derives huge portions of its revenues from more profitable services . | 0.064 |
6.5 Results on system-wise evaluation
We evaluate the ability of AVA in predicting the Accuracy of QA systems as well as the performance of answer sentence reranking tasks. We conduct two evaluation studies with two public datasets, TREC-QA and WikiQA, and an internal ADS dataset.
6.5.1 Results on public datasets
For TREC-QA and WikiQA, we used a bag of different models against the development and test sets and compared the results with the performance measured by AVA using one of the best model according to the point-wise evaluation, i.e., .
More specifically, we apply each model to select the best answer from the list of candidates for in the dataset. We first compute the performance of model based on the provided annotations. The metrics include Accuracy or Precision-at-1 (P@1), MAP, and MRR. We then run AVA for using the GS answers of as reference . The final AVA score is the average of AVA scores applied to different references for . Before computing the Accuracy on the test set, we tune the AVA threshold to minimize the RMSE between the Accuracy (P@1) measured by AVA and the one computed with the GS, on the development set of each dataset. We use these thresholds to evaluate the results on the test sets.
We considered six different models, including one Compare-Aggregate (CNN) trained model and five other Transformers-based models. Four of the latter are collected from public resources333github.com/alexa/wqa_tanda Garg et al. (2020). These models differ in the architectures and their training data thus their output is rather different. We removed questions that have no correct or no incorrect answers.
Table 7 reports the overall results averaged over the six models. We note that (i) setting the right threshold on the dev. set, the error on P@1 is 0; (ii) this is not the case for MAP, which is a much harder value to predict as it requires to estimate an entire ranking; (iii) on the TREC-QA test set, AVA has an error ranging from 2 to 4.1 points on any measure; (iv) on the WikiQA test set, the error is higher, reaching 10%, probably due to a larger complexity of the questions; (v) the std. dev. is low, suggesting that AVA can be used to estimate system performance.
Additionally, we compute the Kendall’s Tau-b correlation between the ranking of the six systems sorted in order of performance (P@1) according to the GS and AVA. We observe a perfect correlation on TREC-QA and a rather high correlation on WikiQA. This means that AVA can be used to determine if a model is better than another, which is desirable when developing new systems. The low p-values indicate reliability of our results.
Finally, Table 8 shows the comparison between the performance evaluated with GS (Human) and AVA for all six models. The predictions of AVA are close to those from human judgement.
6.5.2 Results on ADS
We use ADS dataset in this evaluation. The task is more challenging as AVA only receives one best answer for a system selected from different candidate pools. There was also no control of the sources for the candidates. Table 9 shows the result. We note a lower correlation due to the fact that the 8 evaluated systems have very close Accuracy. On the other hand, the RMSE is rather low 3.1% and the std. dev. is also acceptable , suggesting an error less than 7% with a probability 95%.
6.6 Qualitative Analysis
Table 10 reports some example questions from TREC-QA test set, the top candidate selected by the TANDA system Garg et al. (2020), the classification score of the latter, and the AVA score. AVA judges an answer correct if the score is larger than 0.5. We note that even if the score of TANDA system is low, AVA assigns to the answer a very high score, indicating that it is correct (see the first three examples). Conversely, a wrong answer could be classified as such by AVA, even if TANDA assigned it a very large score (see the last two examples).
7 Conclusion
We presented AVA, an automatic evaluator method for QA systems. Specifically, we discussed our data collection strategy and model design to enable AVA development. First, we collected seven different datasets, classified into three different types, which we used to develop AVA in different stages. Second, we proposed different Transformer-based modeling designs of AVA to exploit the feature signals relevant to address the problem. Our extensive experimentation has shown the effectiveness of AVA for different types of evaluation: point-wise and system-wise over Accuracy, MAP and MRR.
References
- Barrault et al. (2019) Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Eyal et al. (2019) Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Question answering as an automatic evaluation metric for news article summarization. In NAACL 2019, pages 3938–3948, Minneapolis, Minnesota. Association for Computational Linguistics.
- Garg et al. (2020) Siddhant Garg, Thuy Vu, and Alessandro Moschitti. 2020. TANDA: Transfer and adapt pre-trained transformer models for answer sentence selection.
- Ghazarian et al. (2019) Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng. 2019. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. CoRR, abs/1904.10635.
- Green et al. (1961) Bert F. Green, Jr., Alice K. Wolf, Carol Chomsky, and Kenneth Laughery. 1961. Baseball: An automatic question-answerer. In Papers Presented at the May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer Conference, IRE-AIEE-ACM ’61 (Western), pages 219–224, New York, NY, USA. ACM.
- Gunawardena et al. (2015) Tilani Gunawardena, Nishara Pathirana, Medhavi Lokuhetti, Roshan G. Ragel, and Sampath Deegalla. 2015. Performance evaluation techniques for an automatic question answering system.
- Kannan and Vinyals (2017) Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. CoRR, abs/1701.08198.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. TACL.
- Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. {ALBERT}: A lite {bert} for self-supervised learning of language representations. In ICLR.
- Leidner and Callison-Burch (2003) Jochen L. Leidner and Chris Callison-Burch. 2003. Evaluating question answering systems using faq answer injection. In Proceedings of the 6th Annual CLUK Research Colloquium.
- Lin and Demner-Fushman (2006) Jimmy J. Lin and Dina Demner-Fushman. 2006. Methods for automatically evaluating answers to complex questions. Information Retrieval, 9:565–587.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR, abs/1711.05101.
- Lowe et al. (2017) Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic Turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1116–1126, Vancouver, Canada. Association for Computational Linguistics.
- Ma et al. (2019) Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette Graham. 2019. Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In WMT 2019, pages 62–90, Florence, Italy. Association for Computational Linguistics.
- Magnini et al. (2002) Bernardo Magnini, Matteo Negri, Roberto Prevete, and Hristo Tanev. 2002. Towards automatic evaluation of question/answering systems. In LREC.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL 2002, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Shah and Pomerantz (2010) Chirag Shah and Jefferey Pomerantz. 2010. Evaluating and predicting answer quality in community qa. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, pages 411–418, New York, NY, USA. ACM.
- Shen et al. (2017) Gehui Shen, Yunlun Yang, and Zhi-Hong Deng. 2017. Inter-weighted alignment network for sentence pair modeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1179–1189, Copenhagen, Denmark. Association for Computational Linguistics.
- Tao et al. (2017) Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2017. RUBER: an unsupervised method for automatic evaluation of open-domain dialog systems. CoRR, abs/1701.03079.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- Way (2018) Andy Way. 2018. Quality expectations of machine translation. CoRR, abs/1803.08409.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237.
- Yoon et al. (2019) Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. 2019. A compare-aggregate model with latent clustering for answer selection. CoRR, abs/1905.12897.