Using Sampling to Estimate and Improve Performance of
Automated Scoring Systems with Guarantees

Yaman Kumar Singla^1,2,3\equalcontrib, Sriram Krishna¹\equalcontrib, Rajiv Ratn Shah¹, Changyou Chen³

Abstract

Automated Scoring (AS), the natural language processing task of scoring essays and speeches in an educational testing setting, is growing in popularity and being deployed across contexts from government examinations to companies providing language proficiency services. However, existing systems either forgo human raters entirely, thus harming the reliability of the test, or score every response by both human and machine thereby increasing costs. We target the spectrum of possible solutions in between, making use of both humans and machines to provide a higher quality test while keeping costs reasonable to democratize access to AS. In this work, we propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently. We propose reward sampling and observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget (30% samples) using our proposed sampling. The accuracy increase observed using standard random and importance sampling baselines are 8.6% and 12.2% respectively. Furthermore, we demonstrate the system’s model agnostic nature by measuring its performance on a variety of models currently deployed in an AS setting as well as pseudo models. Finally, we propose an algorithm to estimate the accuracy/QWK with statistical guarantees¹¹1Our code is available at https://git.io/J1IOy.

1 Introduction

Refer to caption — Figure 1: Existing Automated Scoring systems³³3TOEFL by ETS, Pearson PTE, SLTI, Linguaskill by Cambridge, Duolingo English Test, and TrueNorth by Emmersion are registered brand names and are shown here for illustration purposes only. The authors claim no rights over their logos or brand names. In this work, we mainly refer to the automatically scored speaking and writing proficiency measurement tests of these companies. either do not involve humans at all in their scoring (Duolingo, Second Language Testing Inc (SLTI)), or utilize human raters for every single response (Educational Testing Services (ETS)). Crucially, there are no solutions that target the gulf in between, where humans are involved in scoring only some percentage of the responses.

Automated Scoring (AS), the task of assigning scores to unstructured responses to open-ended questions, is an NLP application typically deployed in an educational setting. Historically, its origins have been traced to the work of Ellis Page (Page 1967), who first argued for the possibility of scoring essays by computer. The factors behind the rise of Automated Scoring systems and its subtasks, Automated Essay Scoring (AES) and Automated Speech Scoring (ASS) are numerous, including but not limited to, the costs involved in providing and scoring a test, and ensuring that all test takers are scored on a uniform set of rubrics applied across all students, standardizing the scoring for these unstructured responses. The promise of lower costs and uniform scoring rubrics among other factors, has fueled the popularity of Automated Scoring systems, and various ML and DL systems are being increasingly deployed in AS contexts (Kumar et al. 2019; Liu, Xu, and Zhu 2019; Singla et al. 2021a). AS systems are behind some of the world’s most popular language tests, such as ETS’ Test of English as a Foreign Language (TOEFL) (Zechner et al. 2009), Duolingo’s English Test (DET) (LaFlair and Settles 2019), among others. Various governmental institutions and businesses have also instituted automated systems to augment the scoring process, such as the state schools of Utah (Incorporated 2017) and Ohio (O’Donnell 2018), and a majority of BPOs. It is estimated that automatic scoring has a large market size of more than USD 110 billion, with a US market size alone of USD 17.1 billion (TechNavio 2020; Service 2020; Strauss 2020; Le 2020).

However, this popularity has not been without backlash, with criticism focusing on different aspects, such as “the overreliance on surface features of responses, the insensitivity to the content of responses and to creativity, and the vulnerability to new types of cheating and test-taking strategies.” (Yang et al. 2002). Others have given harsher criticisms, such as (Perelman et al. 2014), who shows that it is possible to game the system and achieve near perfect scores on ETS and Vantage Technologies’ AES systems with gibberish prose. This has led to the revoking of NAPLAN AES in Australia (ACARA 2018).

Nonetheless, the ability of AS systems to instantly provide scores, reduce costs, and make language proficiency tests more widely available to all, makes them an important research area and subsequently there is considerable interest in improving them across multiple dimensions, from leveraging advancements in NLP to achieve state-of-the-art performance (Liu, Xu, and Zhu 2019) to improving their robustness (Kumar et al. 2020; Parekh et al. 2020; Singla et al. 2021b). In this work, we tackle another facet of Automatic Scoring systems, that of improving performance by bringing humans into the loop.

Typically in an AS task, a test taker’s responses are scored on prompts of varying difficulty levels. Each prompt has its own difficulty level, and based on the prompts’ difficulty and the quality of the candidate’s answers to these prompts, a score is assigned to the candidate. The Central European Framework of Reference for Languages (CEFR) is an international standard for measuring language proficiency and assigns scores on a six-point scale from A1 (beginner) to C2 (proficient), each score with their own rubrics for evaluation (Broeder and Martyniuk 2008). Each prompt and response is assigned a score on this scale and a global score is computed aggregating these individual scores.

Existing AS systems are typically of two varieties (Fig 3):
Double Scoring: Examinations such as ETS’ TOEFL score every response by one human and an AES system as the second rater. A second human rater resolves any disagreements between the two (Yang et al. 2002). This effectively means that atleast one human rater is required for every test, driving up costs, as evidenced by the TOEFL’s high price of $\sim$ 230 USD (ETS 2021).
Machine-only Scoring: On the other end of the scale are tests like the Duolingo English Test (DET) which are scored by machines alone, without any human intervention, keeping costs low but decreasing the reliability of the test. This is one of the main reasons, the DET costs USD 49, less than one-fourth of what TOEFL costs. All tests surveyed in Fig 3 except Pearson PTE are priced around the same price point.

Our solution (Fig 2) proposes to unify these varieties, allocating the available human budget intelligently to balance the reliability of the test with the cost to the test-taker. To the best of our knowledge, no existing systems target this continuum of utilizing both humans and AS raters. Providing this option would allow AS models to be deployed in more versatile scenarios, working in tandem with expert human raters to provide both reliability and lower-cost solutions. Increasing reliability helps to build trust in automatically scored exams, thus leading to broader adoption. Cost is a critical consideration to lower-income test-takers and those who need to take the test multiple times.

We define the problem and solution more formally as follows: given a set of responses to be scored, a target AS model, and an expert human budget (that is, the number of responses we can have scored by expert human raters), our goal is to efficiently sample responses to be scored by the expert. These expert-scored samples are then combined with automatically scored samples to maximize the overall system performance metric. We propose a novel Monte-Carlo sampling based reward sampling algorithm to efficiently sample responses to maximize the system performance.

Usually one or multiple amongst accuracy, Quadratic Weighted Kappa (QWK), or Cohen’s kappa (Taghipour and Ng 2016; Zhao et al. 2017; Kumar et al. 2019; Grover et al. 2020; Singla et al. 2021a) are used in automatic scoring literature as they are robust measures of inter-rater reliability, a primary goal in Automated Scoring. A key point to be noted is that the reliability of the test (i.e. how consistently a test measures a characteristic) is measured on the global score (the aggregate of the responses) and lesser on the score on the individual responses. The global score determines admissions, interviews, and career growth, while per-item scores are used as indicators of particular skills. While intuitively, we can say that there exists a monotonically increasing relationship between the reliability of the test on individual questions and the overall score, we show that it is more efficient to consider the global context instead of item-level context, while sampling responses for getting them double-scored by humans.

We establish strong baselines using Uncertainty Sampling (§3.2), an importance sampling formulation that samples using probability of being wrong output by the AS model. We propose Reward Sampling (§3.3), that samples based on the estimated reward of correcting a mistake.

We summarize our main contributions as follows:

- We propose to combine existing paradigms to integrate humans with Automated Scoring systems. Provided a budget indicating the number of responses that can be scored by human raters, we observe significant gains in accuracy and QWK using our proposed sampling model, Reward Sampling (§3.3). For instance, by using 40% human budget with an AS model with 64% accuracy, our sampling methodology can achieve an accuracy gain of 23% while random sampling leads to 14% and uncertainty sampling leads to 15%. To the best of our knowledge, this is the first time such a formulation has been considered in Automatic Scoring systems.

- We conduct experiments on various models differing in accuracy to show our algorithm’s model agnostic nature (§3). We include results from models deployed in AS settings in the real world to crafted pseudo models. Averaging over these models, we observe 19.80% increase in accuracy and 25.60% increase in QWK when using reward sampling with 30% of the dataset as a human budget. The random sampling and uncertainty sampling baselines achieve 8.6% and 12.2% gains in accuracy, respectively.

- While augmenting the system’s performance is an important goal, it is equally important to quantify this improvement, especially when deployed in the real world, where there are no labeled datasets to compare against and the consequences of misgrading, for both business and test takers, could be catastrophic. Thus, we also propose an algorithm to estimate the accuracy and QWK achieved, with statistical guarantees. (§3.4).

2 Related Work

Broadly, our paper covers two areas of research: Automatic Scoring and Sampling methods. Here we cover them briefly.

Automatic Scoring:

The goal of an automatic scorer is to assess language competence of a candidate with an accuracy matching that of a human grader, but faster, with greater consistency and at a fraction of the cost (Malinin 2019; Yan, Rupp, and Foltz 2020). Almost all work in the automatic scoring domain has been to better model the scoring of essays and speech traits as a natural language processing task. The techniques have ranged from manually-engineered natural language features (Kumar et al. 2019; Dong and Zhang 2016) to LSTMs, memory networks (Zhao et al. 2017) and transformers (Singla et al. 2021a). There has also been some recent work in other facets of AS including adversarial testing (Ding et al. 2020; Kumar et al. 2020; Parekh et al. 2020), explainability (Kumar and Boulanger 2020), uncertainty estimation (Malinin 2019), off-topic detection (Malinin et al. 2016), evaluation metrics (Loukina et al. 2020), etc. To the best of our knowledge, there is no work on increasing the reliability of automatic scoring systems by bringing humans into the loop. Most white papers from second language testing firms mention results on historical data as a measure of their reliability (Brenzel and Settles 2017; Pearson 2019). However, historical results are not a guarantee for performance over time. Due to continuous domain shift, historical results cannot be trusted for a model’s future performance gains. Therefore, performance guarantees of AS models are essential to establish institutional trust in them. To fill this research gap, we propose reward sampling based on Monte Carlo sampling methods for measuring and increasing AS systems’ reliability.

Monte-Carlo Sampling For Evaluation:

There has been much work in improving automatic metrics using Monte-Carlo sampling methods in machine translation and natural language (NL) evaluation (Chaganty, Mussman, and Liang 2018; Hashimoto, Zhang, and Liang 2019; Wei and Jia 2021). They use statistical sampling methods like importance sampling and control variates to combine automatic NL evaluation with expensive human queries. To the best of our knowledge, we are the first to extend sampling techniques in the context of automatic scoring. We use them to combine relatively cheaper automatic scoring model results with expensive human expert scorers. Kang et al. (2020) use sampling for approximate selection queries. They combine cheap classifiers with expensive estimators to meet minimum precision or recall targets with guarantees. We extend their work to take the global context into account while estimating accuracy (§3.4).

3 System Overview

This section describes the components of the proposed solution, the intuition and reasoning behind the sampling mechanisms, and the algorithm for estimating the metrics with statistical guarantees. Given an Automated Scoring model, a dataset to be scored, and a human budget indicating the percentage of records we can provide to expert human raters for scoring, records are sampled making use of a pre-computed human-machine agreement matrix (to be described below). For the samples selected, we replace the predictions made by the AS model with the scores provided by the human raters and compute an estimate of the increase in accuracy and QWK with guarantees (Fig 2).

When considering sampling, the baseline approach is random sampling i.e. sampling with uniform probability for each record in the dataset. This is not a good allocation of resources, as when considering models of high quality, most samples will not provide any value. For example, with a model of 75% accuracy, random sampling would only provide value for $\sim$ 25% of samples, as the rest would have been correctly scored anyway. This motivates our search for a more efficient sampling mechanism, one that takes into account the probability of the model being wrong with respect to a human expert, and crucially, the reward that would be gained by correcting this mistake. We define the reward as the magnitude of the change in the global score that would occur when a local response is changed as a result of human correction of machine score (§3.3).

3.1 Human-Machine Agreement Matrix

The human-machine agreement matrix is a normalized confusion matrix of the AS model’s predictions and the ground truth, precomputed on validation data or historical test data. As the matrix is normalized, each entry indicates the probability of the class predicted by the machine aligning with the class labeled by the human. Fig 3 shows a sample human-machine agreement matrix where m[Low B1][High B1] = 0.27 indicates the probability of the ground truth being High B1 when the machine has predicted Low B1.

3.2 Uncertainty Sampling

The key idea behind uncertainty sampling is that the machine is not equally likely to be wrong across all prediction classes. Some scores may be assigned with much better accuracy than others. This idea is borne out by the human-machine agreement matrix as well, where the probabilities of a correct prediction are along the principal diagonal. We can see in Fig 3, High B1, Low B2 are accurately predicted whereas A2, High B2 predictions are likely to be wrong. Since the machine is likely making a wrong judgement when predicting these classes, it would be more efficient to sample more from the records where these predictions have been made and corrected using human labelers.

To quantify this, we formulate Uncertainty Sampling as vanilla importance sampling, where the uncertainty of each class is calculated using the cross-entropy function. Each row in the human-machine agreement matrix represents the probability distribution of the ground truth when that particular class has been predicted. The cross entropy of this distribution with the ideal distribution (one-hot encoding for that class) is calculated.

For e.g., the distribution associated with Low B1 in the matrix is $[0.0057,0.61,0.27,0.11,0.0029,0]$ . The cross entropy of this distribution with the ideal distribution ( $[0,1,0,0,0,0]$ ) for Low B1 is calculated. In this way, we can quantify the “loss” associated with Low B1. Subsequently, every record is assigned a loss associated with the prediction made for that record, and this is normalized over the entire dataset to create a probability distribution. We draw a sample $s\sim U(D)$ without replacement from the uncertainty distribution over the dataset $U(D)$ . The provided human budget indicates the number of samples to be drawn and the likelihood of a record being drawn corresponds to the uncertainty associated with the prediction class.

Model metrics (global)	Sampling Method	Accuracy Improvement					Kappa Improvement
Model metrics (global)	Sampling Method	Human Budget →					Human Budget →
10%	20%	40%	60%	80%	10%	20%	40%	60%	80%
BERT-Baseline acc - 0.66; kappa - 0.56	Random	0.69	0.72	0.78	0.84	0.93	0.59	0.63	0.7	0.78	0.9
	Uncertainty	0.7	0.73	0.8	0.86	0.93	0.59	0.62	0.71	0.8	0.9
	Reward	0.74	0.82	0.88	0.91	0.95	0.64	0.76	0.84	0.88	0.94
BERT-Two Stage acc - 0.69; kappa - 0.60	Random	0.73	0.75	0.81	0.87	0.93	0.65	0.68	0.75	0.83	0.91
	Uncertainty	0.72	0.75	0.82	0.87	0.94	0.63	0.66	0.74	0.82	0.92
	Reward	0.79	0.86	0.91	0.92	0.96	0.72	0.81	0.87	0.9	0.94
LSTM-Attn-Baseline acc - 0.64; kappa - 0.54	Random	0.67	0.71	0.78	0.85	0.93	0.58	0.63	0.71	0.79	0.9
	Uncertainty	0.68	0.72	0.79	0.88	0.93	0.57	0.6	0.69	0.82	0.9
	Reward	0.73	0.78	0.87	0.92	0.96	0.62	0.71	0.83	0.89	0.95
LSTM-Attn-Two Stage acc - 0.65; kappa - 0.57	Random	0.67	0.71	0.76	0.85	0.92	0.59	0.64	0.7	0.8	0.89
	Uncertainty	0.68	0.73	0.8	0.87	0.93	0.58	0.62	0.71	0.81	0.91
	Reward	0.74	0.82	0.87	0.9	0.95	0.66	0.75	0.83	0.86	0.93
Pseudo Model-0.75 acc - 0.72; kappa - 0.57	Random	0.74	0.76	0.8	0.86	0.93	0.62	0.64	0.72	0.81	0.9
	Uncertainty	0.82	0.9	0.93	0.97	0.98	0.73	0.85	0.9	0.95	0.98
	Reward	0.81	0.86	0.92	0.96	0.98	0.73	0.78	0.89	0.94	0.97

Using Sampling to Estimate and Improve Performance of
Automated Scoring Systems with Guarantees

Abstract

1 Introduction