What Can Secondary Predictions Tell Us?
An Exploration on Question-Answering with SQuAD-v2.0

Michael Kamfonas
[email protected]

Gabriel Alon
[email protected]

Abstract

Performance in natural language processing, specifically for the question-answering task, is typically measured by comparing a model’s most confident (primary) prediction to golden answers (i.e., the ground truth). We are making the case that it is also helpful to quantify how close a model came to predicting a correct answer for examples that failed, a goal that the F1 score only partially satisfies. We derive our metrics from the probability distribution from which the model ranks its predictions. We define an example’s Golden Rank (GR) as the rank of its most confident prediction that matches the ground truth. Thus the GR quantifies how close a correct answer comes to being a model’s best prediction. We demonstrate how the GR can be used to classify questions and visualize their spectrum of difficulty, from relative successes to persistent extreme failures.

We derive a new aggregate statistic, the Golden Rank Interpolated Median (GRIM), that quantifies the proximity of correct predictions to the model’s choices over a whole dataset or any slice of examples. To develop some intuition and explore the applicability of these scores, we use the Stanford Question Answering Dataset (SQuAD-2) and a few popular transformer models from the Huggingface hub. We demonstrate that the GRIM is relatively independent of the F1 or the exact match (EM) scores. We then calculate and visualize these scores for various transformer architectures, probe their applicability in error analysis or difficulty assessment, and see how they relate to standard training diagnostics, i.e., the EM and F1 scores. We finally suggest possible follow-up areas of research.

1 Introduction

1.1 Motivation

In the course of studying the NLP extractive question-answering task with various transformer-based models on the SQuAD v2.0 dataset [RJL18], we became interested in understanding secondary prediction behavior. Specifically, we wanted to know how close correct answers came to becoming predictions and to use this information to classify failures or assess example difficulty. Our notion of proximity differs from the way F1 measures approximate success. Our idea is to employ a rank-based metric, ordering predictions by descending confidence level (the most confident at rank 0) and quantifying proximity by the rank of the lowest-ranking correct prediction. Examples successfully predicted would rank 0; all others would rank higher.

The primary validation method in SQuAD [RZLL16] is the exact match test (EM), a rigid metric that compares predicted answers to a short list of annotated correct answers after normalizing the compared strings. Normalization entails standardizing white space and eliminating punctuation and articles. The F1 score is a softer metric that rewards partial success by matching individual words rather than complete answers whenever the EM criterion fails.

Although the F1 is appropriate as an extension of the EM, it is problematic when used for evaluating approximate success. Some of the issues are:

1.

The F1 is based on the top choice (primary prediction) and disregards secondary predictions.
2.

It lacks proportionality, particularly with SQuAD-v2 when unanswerable questions are involved.
3.

It sometimes rewards answers that a human would find unacceptable

The high degree of overlap between the F1 and EM results in an almost linear relationship, as shown visually in figure 3. The F1 may only differ from the EM for responses that fail the exact match test and are longer than one word. There are also cases where F1’s partial credit may reward incorrect answers or, particularly in SQuAD-2, give no credit to near-misses. We consider near-misses to be predictions matching golden answers that, although they don’t make the primary choice, still rank close to it.

For example, take this passage fragment which contains the golden answer, highlighted in green: Victorian lines mainly use the 1,600 mm (5 ft 3 in) broad gauge. Since “1,600” is roughly equivalent in units to “5 ft 3 in”, there are multiple legitimate ways to answer the same question. An additional answer given in the dataset is: 1,600 mm. Additional answers could be construed as correct, such as 1,600 mm (5 ft 3 in) or 5 ft 3 in or stretching it (5 ft 3 in) broad gauge which the F1 would, at least partially, reward even though no golden answer is explicitly provided. However, a prediction that returns broad gauge would be incorrect, although the F1 will give it partial credit. An answer such as mm (5 ft 3 in) is hard to justify any partial credit for since the inclusion of “mm” is nonsensical, but the F1 would reward it.

As an example of lack of proportionality, consider a case where the most confident prediction (rank-0) is “No Answer” while rank-1 contains the correct answer. The F1 score would give no credit, but the rank in predictive probability order would be 1 (second best). In our judgment, the model came very close to choosing the correct answer, i.e., one step away from the top prediction. Furthermore, in the presence of “no-answers”, two failed predictions have no way of being compared using the F1 test. The rank-based method manifests “proximity to the correct answer” in a proportional way that parallels the mechanism used to choose the best answer. Another way of looking at this is as if the F1 score says “there are no correct words in the answer,” while the rank-based method tells us: “the model’s second choice was correct.”

We propose a metric we call the Golden Rank (GR). We order an example’s list of predictions by descending probability and find the predictions that match any of the correct answers. The lowest rank of those matched is the golden rank for the example. We also define an aggregate statistic we call Golden Rank Interpolated Median (GRIM.). The GRIM is calculated over secondary GRs only and is an estimator of the proximity between golden answers and the top prediction for a sample of examples. Two experiments with the same EM score can have significantly different GRIM scores indicating that for the questions that failed, one model is consistently assigning higher (alas, not the highest) confidence to the correct predictions than the other.

1.2 Related Work

The SQuAD datasets [RZLL16], [RJL18] support the extractive Question-Answering task and provide machine learning models good quality large scale data to learn from with standardized methods of evaluation. The history of question-answering datasets traces back to TREC-8, “the first large-scale evaluation of domain-independent question answering systems” [Voo99]. The proposed Mean Reciprocal Rank (MRR) can assess multiple relevant results to a web search, which returns multiple predictions in response to a question. The question-answering task defined by SQuAD evaluates only the primary (best) prediction from a model with the EM and F1 scores as the established metrics.

Secondary predictions, i.e., predictions that fail conventional model evaluation, are used primarily in error analysis and interpretability studies. These goals have led to hybrid human-machine approaches with implementations such as Pandora [NKH18]. The premise that there is no silver bullet in analyzing wrong predictions or interpreting the reasons for failure led to an elaborate suite of tools made available to aid the researcher. These tools help in analyzing errors, clustering them, categorizing them, and deriving various visual and diagnostic instruments to tackle families of failures in a more efficient machine-assisted manner.

Radev et al. in [RQWF02] describe early rank-based metrics that can handle multiple predictions. They include, among others, First Hit Success (FHS), First Answer Reciprocal Word Rank (FARWR), Total Reciprocal Rank (TRR), and First Answer Reciprocal Rank (FARR). The discounting effect of reciprocal rank, as in the Mean Reciprocal Rank (MRR) [Voo99], is to reward answers closer to the best prediction and penalize those that are further away in a concise metric that reflects the perceived utility in the document search.

When it comes to evaluating the levels of difficulty of examples in a dataset, a systematic framework, ILDAE, was proposed by [VMB22a], which supports five discrete purposes:

1.

Efficient model evaluation by reducing the size of datasets in model comparisons
2.

Improving dataset quality by enhancing trivial examples and repairing erroneous ones
3.

Model analysis by difficulty class, aiding in the selection of models for particular situations
4.

Projection of out-of-domain (OOD) generalization potential by the use of difficulty-weighted accuracy
5.

Dataset difficulty analysis informing future dataset development

The authors use a RoBERTa-large transformer model for classifying each example by difficulty level. The idea is to perform the specific processing tasks involved in difficulty evaluation once, tagging instances accordingly. Then they compare other models, trained over subsets drawn over prescribed difficulty distributions. These layered subsets produce evaluation results showing a high-ranked correlation (Kendall) with those trained on the complete datasets. The authors claim substantial improvements in accuracy scores by modifying too trivial examples and correcting potential errors that make particular examples too tricky. Similar insights may inform the construction of altogether new datasets.

According to [VMB22b], example and dataset-level difficulty scoring can also facilitate curriculum learning strategies [XZM⁺20] for multi-task learning (MTL) by enabling models to form their own curriculum. The methodology also reduces the high computational cost of automated dataset curriculum-forming methods and removes the uncertainty introduced when human judgment determines the ordering. Finally, when applied to low-data regimes, the methodology is most effective for difficult examples, making it more appropriate for real-world applications.

Pan et al. in [PCA20] attempted to define a loss function based on how far away a false positive is from the ground truth. The notion of distance used was between the false positive and ground truth in the passage. It may be possible to apply the same idea, only expressing distance as rank, logit, or probability difference of the same end-points.

2 Methods

[Uncaptioned image] — Table 1: Top 10 predictions and golden answers for three examples. Rank 0 denotes the primary prediction, all other ranks are secondary predictions. The Match column identifies whether the prediction exactly matches a golden answer

Model/Experiment	Previously	+FT Epochs	Batch	Initial	Decayed
Description	Fine-Tuned	(Sample Size)	Size	LR	Opt Steps
BERT-base-uncased-001	-	8	24	5e-5	43920
BERT-base-uncased-007	-	8	24	3e-5	43920
BERT-base-uncased-009	-	8	24	4e-5	43920
BERT-base-uncased-010	-	16	24	4e-5	87840
BERT-large-cased-wwm-003	-	2	4	3e-5	66040
BERT-large-uncased-wwm-eval-only	SQuAD2	-	-	-	-
Roberta-base-uncased-eval-only	SQuAD2	-	-	-	-
Roberta-base-uncased-001	SQuAD2	8	24	4e-5	43944
Electra-base-eval-only	SQuAD2	-	-	-	-
Electra-base-005	SQuAD2	3 (20K)	24	5e-5	2502
Electra-base-006	SQuAD2	8	24	4e-5	43920
DistilBERT-base-unc-eval-only	SQuAD2	-	-	-	-
DistilBERT-base-unc-003	SQuAD2	5	24	5e-5	27450
DistilBERT-base-unc-004	SQuAD2	5	32	5e-5	20500
Longformer-base-4096-eval-only	SQuAD2	-	-	-	-
Longformer-base-4096-009	SQuAD2	4	8	4e-5	65276

Model/Experiment	Total	EM	Best-F1	F1	GRIM
Description	Steps	Score	Step	Score	Score
BERT-lg-uc-wwm-sq2-001	-	80.91	-	83.88	2.38
Longformer-sq2-011	-	79.90	-	83.33	3.13
Electra-sq2-007	-	77.61	-	81.68	4.38
Roberta-sq2-002	-	79.92	-	82.93	2.23
DistilBERT-uc-sq2/008	-	66.31	-	69.72	3.99
BERT-lg-cas-wwm-003	66K	80.73	61.5K	83.75	2.73
Longformer-sq2-009	65K	78.98	55K	82.82	3.65
Electra-sq2-005	2.5K	77.47	2.5K	81.16	3.84
Electra-sq2-006	44K	79.36	44K	83.22	3.40
Roberta-sq2-001	44K	79.83	27.5K	83.33	3.11
DistilBERT-uc-sq2-003	27K	67.27	21K	70.79	3.41
DistilBERT-uc-sq2-004	21K	66.1	20.5K	69.99	3.97

Model/Experiment	Total	EM	Best-F1	F1	GRIM
Description	Steps	Score	Step	Score	Score
BERT-base-uncased-001	44K	71.3	30.5K	75.25	5.3
BERT-base-uncased-007	44K	73.1	30.5K	76.97	4.4
BERT-base-uncased-009	44K	72.92	16.5K	76.49	2.4
BERT-base-uncased-010	88K	71.94	56K	75.78	4.0

Cluster	Counts	from GR	to GR	from Std	to Std
All Correct	2538	0.0000	0.0000	0.000000	0.000000
Mostly Correct	2420	0.0000	6.4375	0.000000	4.943035
Polarized	583	1.1875	8.8750	2.185714	5.000000
Challenges	404	3.5000	10.0000	0.000000	4.833154

Cluster	Counts	from GR	to GR	from Std	to Std
All Correct	2116	0.00	0.0000	0.00	0.000000
Mostly Correct	2715	0.00	4.6875	0.00	4.485654
Polarized	541	0.25	5.7500	0.75	4.639757
Challenges	556	1.75	10.0000	0.00	4.911323

What Can Secondary Predictions Tell Us?
An Exploration on Question-Answering with SQuAD-v2.0

Abstract

1 Introduction

1.1 Motivation

1.2 Related Work

2 Methods

3 Experiments

4 Analysis

4.1 Relationships among GR, EM, and F1

4.2 What GR and GRIM tell us about Models and Experiments

4.2.1 Pre-Fine-Tuned Models

4.2.2 Same Models with Additional Fine-Tuning

4.2.3 BERT-Uncased

4.2.4 BERT-Large Experiments

4.2.5 Other Consolidated Comparisons by Model Architecture

4.3 What GR Distributions tell us about Individual Examples

4.3.1 Overview of All Experiments

4.3.2 Clustering of Experiments

4.4 GR/GRIM Movement During Training

4.5 Ensembles based on F1, EM or GRIM

5 Discussion and Future Work

References

Appendix A Models

A.1 BERT Models

A.2 RoBERTa Models

A.3 ELECTRA Models

A.4 DistilBERT Models

A.5 Longformer

Appendix B QA Post-processing and Metrics

B.1 From Logits to Predictions

Appendix C Error Analysis - 4 Experiments at (1,0)

C.1 Example ID: 57263c78ec44d21400f3dc7c

C.2 Example ID: 57267d52708984140094c7da

C.3 Example ID: 5728dc2d3acd2414000e0080

C.4 Example ID: 572742bd5951b619008f8787

Appendix D Error Analysis - Highly Polarized Examples

D.1 Example ID: 5ad56bcd5b96ef001a10ae62

D.2 Example ID: 5ad251d6d7d075001a428ceb

Ensemble Choice	EM	F1	Models Selected (names abbreviated)
Best EM (baseline)	80.91	83.88	bert-lg-uc
Best 3 by EM	83.17	87.06	bert-lg-uc, bert-lg-cas, roberta/002
Best 3 by F1	83.08	87.30	bert-lg-uc, bert-lg-cas, roberta/001
Best 3 by GRIM	81.11	85.49	roberta, bert-lg-uc, bert/010
Best 5 by EM	83.34	87.21	bert-lg-uc, bert-lg-cas, roberta/002, longformer/011,roberta/001
Best 5 by F1	83.99	87.94	bert-lg-uc, bert-lg-cas, roberta/001, longformer/011, electra/006
Best 5 by GRIM	81.76	85.73	roberta, bert-lg-uc, bert/010, bert-lg-cas, bert/009

	$\displaystyle\hat{p}_{j}^{S(i)}\hat{p}_{k}^{E(i)}$	$\displaystyle=\frac{\exp{\hat{y}^{S(i)}_{j}}\exp{\hat{y}^{E(i)}_{k}}}{\sum_{q=1}^{n}\exp{\hat{y}^{S(i)}_{q}}\sum_{r=1}^{n}\exp{\hat{y}^{E(i)}_{r}}}$
		$\displaystyle=\frac{\exp\left({\hat{y}^{S(i)}_{j}+\hat{y}^{E(i)}_{k}}\right)}{\sum_{q,r=1}^{n}\exp\left({\hat{y}^{S(i)}_{q}+\hat{y}^{E(i)}_{r}}\right)}$

prediction	probability	id	rank	goldAns	correct
	9.999964e-01	dc7c	0	[Two fundamental differences involved the division of functions and tasks between the hosts at the edge of the network and the network core, the division of functions and tasks between the hosts at the edge of the network and the network core., division of functions and tasks between the hosts at the edge of the network and the network core]	False
Two fundamental differences involved the division of functions and tasks between the hosts at the edge of the network and the network core	8.562545e-07	dc7c	1	[Two fundamental differences involved the division of functions and tasks between the hosts at the edge of the network and the network core, the division of functions and tasks between the hosts at the edge of the network and the network core., division of functions and tasks between the hosts at the edge of the network and the network core]	True
microscopic analysis of oriented thin sections of geologic samples	8.835049e-01	c7da	0	[microscopic analysis, microscopic analysis of oriented thin sections, use microscopic analysis of oriented thin sections of geologic samples]	False
use microscopic analysis of oriented thin sections of geologic samples	9.161043e-02	c7da	1	[microscopic analysis, microscopic analysis of oriented thin sections, use microscopic analysis of oriented thin sections of geologic samples]	True
	9.999518e-01	0080	0	[Brownlee]	False
Brownlee	4.670037e-05	0080	1	[Brownlee]	True
	9.999995e-01	8787	0	[waste, a lot of waste]	False
waste	3.547627e-07	8787	1	[waste, a lot of waste]	True

text	probability	experiment	rank	goldAns	correct
photosynthesis	9.999955e-01	bert-uc/001	0	[]	False
photosynthesis,	2.932970e-06	bert-uc/001	1	[]	False
	5.623741e-14	bert-uc/001	10	[]	True
photosynthesis	9.813934e-01	bert-uc/009	0	[]	False
The main driving factor of the oxygen cycle is photosynthesis	5.801043e-03	bert-uc/009	1	[]	False
	2.939046e-07	bert-uc/009	86	[]	True
photosynthesis	9.619037e-01	bert-uc/010	0	[]	False
The main driving factor of the oxygen cycle is photosynthesis	2.637961e-02	bert-uc/010	1	[]	False
	8.520884e-06	bert-uc/010	10	[]	True
	7.064449e-01	bert-lg-cas-wwm/003	0	[]	True
the atmosphere, the biosphere, and the lithosphere	1.259139e-01	bert-lg-cas-wwm/003	1	[]	False
	9.994391e-01	bert-lg-uc-wwm-sq2/001	0	[]	True
photosynthesis	1.410887e-04	bert-lg-uc-wwm-sq2/001	1	[]	False
photosynthesis	9.997396e-01	longformer-sq2/009	0	[]	False
photosynthesis,	1.777181e-04	longformer-sq2/009	1	[]	False
	1.237061e-08	longformer-sq2/009	10	[]	True
	5.466675e-01	longformer-sq2/011	0	[]	True
photosynthesis	4.083973e-01	longformer-sq2/011	1	[]	False
	9.999925e-01	electra-sq2/005	0	[]	True
photosynthesis	1.891310e-06	electra-sq2/005	1	[]	False
	9.999998e-01	electra-sq2/007	0	[]	True
photosynthesis	3.729614e-08	electra-sq2/007	1	[]	False
photosynthesis	9.300030e-01	roberta-sq2/001	0	[]	False
photosynthesis,	6.051588e-02	roberta-sq2/001	1	[]	False
	2.919881e-06	roberta-sq2/001	10	[]	True
	5.483240e-01	roberta-sq2/002	0	[]	True
the atmosphere, the biosphere, and the lithosphere	8.345626e-02	roberta-sq2/002	1	[]	False
photosynthesis	9.972602e-01	distilbert-uc-sq2/003	0	[]	False
photosynthesis,	1.422049e-03	distilbert-uc-sq2/003	1	[]	False
	2.097861e-12	distilbert-uc-sq2/003	10	[]	True

What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0

Abstract

1 Introduction

1.1 Motivation

1.2 Related Work

2 Methods

3 Experiments

4 Analysis

4.1 Relationships among GR, EM, and F1

4.2 What GR and GRIM tell us about Models and Experiments

4.2.1 Pre-Fine-Tuned Models

4.2.2 Same Models with Additional Fine-Tuning

4.2.3 BERT-Uncased

4.2.4 BERT-Large Experiments

4.2.5 Other Consolidated Comparisons by Model Architecture

4.3 What GR Distributions tell us about Individual Examples

4.3.1 Overview of All Experiments

4.3.2 Clustering of Experiments

4.4 GR/GRIM Movement During Training

4.5 Ensembles based on F1, EM or GRIM

5 Discussion and Future Work

References

Appendix A Models

A.1 BERT Models

A.2 RoBERTa Models

A.3 ELECTRA Models

A.4 DistilBERT Models

A.5 Longformer

Appendix B QA Post-processing and Metrics

B.1 From Logits to Predictions

Appendix C Error Analysis - 4 Experiments at (1,0)

C.1 Example ID: 57263c78ec44d21400f3dc7c

C.2 Example ID: 57267d52708984140094c7da

C.3 Example ID: 5728dc2d3acd2414000e0080

C.4 Example ID: 572742bd5951b619008f8787

Appendix D Error Analysis - Highly Polarized Examples

D.1 Example ID: 5ad56bcd5b96ef001a10ae62

D.2 Example ID: 5ad251d6d7d075001a428ceb

What Can Secondary Predictions Tell Us?
An Exploration on Question-Answering with SQuAD-v2.0