Automated Text Scoring in the Age of Generative AI for the GPU-poor
Abstract
Current research on generative language models (GLMs) for automated text scoring (ATS) has focused almost exclusively on querying proprietary models via Application Programming Interfaces (APIs). Yet such practices raise issues around transparency and security, and these methods offer little in the way of efficiency or customizability. With the recent proliferation of smaller, open-source models, there is the option to explore GLMs with computers equipped with modest, consumer-grade hardware–—that is, for the “GPU poor.” In this study, we analyze the performance and efficiency of open-source, small-scale GLMs for ATS. Results show that GLMs can be fine-tuned to achieve adequate, though not state-of-the-art, performance. In addition to ATS, we take small steps towards analyzing models’ capacity for generating feedback by prompting GLMs to explain their scores. Model-generated feedback shows promise, but requires more rigorous evaluation focused on targeted use cases.
1 Introduction
Generative language models (GLMs), such as GPT-4 [34] and Claude [2], have demonstrated powerful performance across a variety of language and reasoning tasks. In the field of education, researchers are exploring the extent to which these models can perform tasks such as automated essay scoring [56], providing feedback to students [4], individual tutoring [7], and more [15].
Although GLMs show promise in automating certain educative tasks, there are critical limitations that hinder the possibility of wider implementation. For instance, researchers have shown that GLMs can be “jail-broken” to bypass safety guardrails [58] and can disclose personally identifiable information. Large GLMs are extremely large, requiring millions of dollars to train and deploy; as such, they are highly inefficient for specialized tasks [26]. These models are constantly being updated, sometimes leading to degraded performance [6], and they are only accessible via Application Programming Interfaces (APIs), which lead to issues around replicability and leave little room to conduct rigorous research.
It is for these reasons that we shift the focus away from large, proprietary GLMs toward smaller, open-source GLMs. In this study, we focus on two educational applications: Automated Text Scoring (ATS) and providing feedback—specifically, feedback that justifies scores based on the scoring rubric. Our study is the first to demonstrate that it is possible to efficiently fine-tune such GLMs to yield high-quality scores, and that (at least some) feedback from fine-tuned models can explain these scores. Our data is drawn from the publicly available Automated Student Assessment Prize (ASAP),111ASAP Automated Essay Scoring: https://www.kaggle.com/c/asap-aes; ASAP Automated Short Answer Scoring: https://www.kaggle.com/c/asap-sas which allows us to compare more easily our results to other approaches, and share our findings more broadly. More specifically, our research goals are as follows:
-
1.
Fine-tune four recently-released, relatively small (8 GB or less) open-source GLMs for Automated Essay Scoring (AES) and Automated Short Answer Scoring (ASAS).
-
2.
Compare the performance of these GLMs for AES and ASAS, relative to current state-of-the-art (SOTA) benchmarks.
-
3.
Prompt GLMs to explain the scores that they provided based on item-specific rubrics, and characterize patterns of feedback via qualitative analysis.
The organization of this paper is as follows: In Section 2, we review the theoretical and empirical context surrounding ATS, feedback, GLM architectures, and GLM training. In Section 3, we detail the characteristics of the data, models, prompts, and training methods used in this. We review results in Section 4, which is divided into (A) automated scoring and (B) feedback (of essays and short answers, respectively). Finally, we discuss some of the ramifications of our findings in Section 5, and suggest avenues for future research. In addition to this paper, for greater transparency, we make publicly available the scores and feedback generated by our fine-tuned GLMs.
2 Background
2.1 Automated text scoring
AES and ASAS have been active areas of research and development since as early as 1966 [38]. There is widespread acceptance that, when carefully constructed and monitored, AES and ASAS can deliver reliable scores [30]. For this reason, ATS has become common in educational assessment.
From a machine-learning perspective, both AES and ASAS are text classification problems, but from a measurement perspective, they assess different abilities and may require different approaches. For instance, rubrics for essay scoring are often designed to evaluate attributes such as organization, argumentation, grammar, and spelling in lengthier written responses. In contrast, rubrics for short answer questions focus on assessing specific knowledge and comprehension, often independent of grammatical and spelling considerations. For this reason, an approach that works well for AES may not always be suitable for ASAS and vice versa.
There have been a plethora of approaches applied to both AES and ASAS. Perhaps the oldest of these is known as the Bag of Words (BoW), which generally combines rules based on linguistic features in addition to a set of frequency-based features [38]. As Natural Language Processing (NLP) began incorporating neural network-based models, these models were applied to AES and ASAS. Early implementations of neural network-based scoring [14, 37] used layers of recurrent units such as the long-short-term memory (LSTM) unit [20] and gated recurrent units (GRU) [8] with attention [17].
The most influential change to NLP has been the rise of attention [16] and the transformer architecture [53]. The use of transformer-based Large Language Models (LLMs), such as BERT [13], to perform ATS is now well-established in both AES [43, 51, 60] and ASAS [35]. In the past few years, generative language models (GLM)s like ChatGPT [34] have garnered immense excitement from both the media and academic circles. These GLMs are pretrained on a large corpus and then instruction-tuned to perform a multitude of tasks [9].
Attempts at ATS with GLMs have focused primarily on large, proprietary models, e.g. [33], which raises several concerns in an educational setting. Firstly, given that student data can include personally identifiable information, the reliance on an externally managed API poses a security risk. Secondly, since the weights are not publicly available, there is no ability to apply tools from explainable AI (xAI) [29]. From the viewpoint of sustainability, closed-source models can require much more resources to run and can be much more expensive in the long run. Some researchers have explored AES and feedback using small, open-source models: In [46], there is an exploration of prompting strategies and machine evaluation of feedback correlates with human evaluation of feedback; it is also clear, however, that with respect to AES, in-context GLM performance remains far below that of fine-tuned classification models.
2.2 Model-generated feedback
If we limit our research into GLMs merely to improve existing scoring systems, then we will have missed out on the potential to enhance educational assessment. There is a growing call from educators, students, and other stakeholders for these models to be used to provide feedback.
Although model-generated feedback holds potential value for educators, there remain substantial hurdles to producing feedback that is useful. These limitations revolve around the the quality of feedback itself, as well as the difficult endeavor of validating that the feedback is indeed useful in a given context. With respect to feedback quality, even large GLMs produce hallucinations. In the field of text generation, hallucination refers broadly to text that, while grammatically correct, is also nonsensical, unfaithful, unreliable, inaccurate, irrelevant, etc. [22, 61]. With respect to validation, there is no methodology in the field that can be used to easily validate such feedback. There are, moreover, no easy-to-implement systems to capture feedback in an on-going way from educators, which makes development of process-oriented tools extremely challenging.
Beyond technological limitations, there are social implications that need to be considered in the face of novel educational technologies. The Substitution Augmentation Modification Redefinition (SAMR) model for technological innovation and adoption in educational settings, for instance, has been critiqued for justifying hierarchical approaches to product development and implementation [18]. Technological advances which are described or marketed as educational tools need to be developed in tandem with teachers, administrators, and other educational practitioners. Although much of the enthusiasm (as well as economic pressure) behind feedback generation is warranted, this cannot supersede the need for taking a rigorous and ethical approach towards researching and developing such tools.
2.3 Architecture of Generative Language Models for the GPU Poor
In contrast to the large, proprietary GLMs that have dominated public attention, there is a concomitant open-source movement that strives to makes GLMs accessible to all. These relatively small, open-source models are typically released in 7Gb and 70Gb versions by researchers who are often affiliated with the same organizations that develop proprietary GLMs. For instance, Google recently released Gemma, Meta released Llama-3, and Microsoft released Phi-3. In contrast to their large, proprietary counterparts, these GLMs can run on (and can even be trained on) consumer-grade hardware, such as a single 24Gb GPU. That is, these models can be leveraged by the “GPU poor”, which includes most of us educational researchers. This open-source movement allows researchers to experiment directly with GLMs, and to explore targeted use cases in education. Researchers have just begun to explore smaller, open-source GLMs for ATS and feedback (e.g. [46]).
Although performance generally increases with scale, smaller GLMs perform surprisingly well. GPT-4 and Claude are enormous, and it is no surprise that they dominate leaderboards, yet their smaller, open-source counterparts (which require only a fraction of the memory) are not far behind. One reason that smaller GLMs are not further behind is that, aside from small variations, they generally share the same architecture. Furthermore, within the current paradigm, there is a consensus among researchers that the primary bottleneck to increasing performance is data volume and quality, not model architecture.
Current SOTA GLMs use a decoder-only architecture, sometimes combined with Mixture of Experts. The underlying design is actually simpler than the original transformer architecture advanced in Bidirectional Encoder Representations from Transformers (BERT, [13]). Following the advent of BERT, many researchers proposed variants of BERT that improved either the data [39], architecture [25, 47], or training schemes [10, 19] of the original model. These models were predominantly encoder-only models which were made into classifiers by replacing the linear layer that predicts masked tokens with another randomly initiated linear layer (i.e. the classification head). Encoder transformer-based pretrained language models are typically given a classification head, where the loss function is cross-entropy (e.g., see [43]).222It is also possible, though less common, to use the single target variant with a mean-squared error loss function [60]. Many previous authors have applied transformer-based language models to AES and ASAS in this way [52, 43, 60, 35, 48]. Indeed, this is the current paradigm in most of AES and ASAS.
While this paradigm (of affixing a classification head) could also be applied to GLMs,333Indeed, this w done with the first GPT model [42] this disregards the relationship learned by the model between the linear layer that predicts tokens and the transformer layers. The final output layer, however, can be left as is, and fine-tuning can focus on the intermediate layers (e.g., using QLoRA [12], described below). Because this form of fine-tuning preserves the relationship learned by the model between the linear layer, the models themselves retain much of their abilities as generative models when applied to more general tasks. This allows the models to be further prompted to produce feedback where the scores are at least able to be validated against known human-defined targets.
The rapid growth of large language models, now reaching hundreds of billions of parameters, has introduced considerable engineering challenges for their large-scale deployment. A primary concern is training these enormous models within memory constraints. Generally, each parameter and its gradient are stored in 32-bit precision, requiring 4 bytes per trainable parameter. Advanced optimizers such as Adam with weight decay further increase memory consumption by storing additional data for each parameter. For example, fine-tuning a model with 7 billion parameters would typically need at least 28GB of video memory, excluding context length.
To get around the typical memory requirements of GLMs, we employ a combination of two approaches: (1) quantization [12], wherein parameters are stored at lower precision, and (2) Low-Rank Adapters (LoRA) [21]. The combination of these methods is commonly referred to as QLoRA [12]. Quantization converts the model’s parameters from 32-bit floats to 4-bit NormalFloat data types [11]. Memory savings are further increased through double quantization, where the quantization constants themselves are also quantized. Despite using less memory, quantized models generally maintain robust performance. Additionally, memory can be further conserved by using 8-bit optimizers, which store variance and its square in 8-bit precision [11]. Low-rank adaptation (LoRA, [21]), is an increasingly popular method of parameter-efficient fine-tuning [59]. In the following section, we describe LoRA in detail.
2.4 Training Generative Language Models for the GPU Poor
LoRA is a powerful, parameter-efficient technique for fine-tuning GLMs. In combination with quantization, it makes it possible to fine-tune GLMs using less than 8Gb of memory, thereby making them more feasible for development and deployment.
The central idea behind LoRA is that we seek to update the large feed-forward layers of the model by only considering a low-rank additive component, initially set to 0. Mathematically, we suppose a linear layer is represented by
where is the original pretrained weight matrix and is the input. It is known that updates to the linear transformations are sparse and in many cases, approximated well by matrices of low-rank. We seek to update the weight matrix, in a single step by
where and .
In this setting, it is expected that so that the number of trainable parameters is . Typical values of (e.g., ) are chosen such that the number of trainable parameters is far fewer than full-parameter fine-tuning.
The advantages of LoRA include reduced memory requirements for saving fine-tuned models, more efficient training, no impact on inference speed, and the capacity for combination with other parameter efficient fine-tuning methods. The memory requirements for saving a fine-tuned large language model with LoRA are limited to size of the pairs of update matrices, which orders of magnitude smaller than the original model. Training is also more efficient and requires less GPU memory since gradients only need to be calculated for the update matrices. The impact on inference latency can be reduced to zero if the update matrices are added to the pretrained weights and subsequently removed from the model after loading. Finally, because the update matrices can be removed, LoRA can be combined with any other adapters [40].
3 Methods
3.1 Data
The Automated Student Assessment Prize (ASAP) AES and SAS datasets were originally made available to the public via two competitions hosted by Kaggle in 2012 [44]. The AES dataset encompasses a total of 12,978 essays, spanning 8 distinct stimuli.444We use the term stimuli or items instead of prompts, as the latter is easily confused with prompts used to query GLMs. The SAS dataset consists of 17,043 total responses across 10 items that span various subjects, administered to students in grades 8 and 10 (depending on the item). Each response was scored by two human annotators. Accompanying the scored data are comprehensive scoring rubrics that include scoring guidelines and score ranges tailored to each stimulus. One of the advantages of using the AES and SAS datasets are that they are commonly used by other researchers, allowing us to compare our results with a wide range of previously established approaches.
In order to maintain comparability with the extensive literature on these datasets, test-train splits were chosen to align with previous studies [49, 14, 43, 51, 60, 36, 35, 27, 28]. For the AES dataset, we follow the five-fold cross-validation defined by [49]. For the SAS dataset, we used the same splits used in previous studies (e.g., [35, 28]. The (average) size of the training, development (or dev), and test sets for the AES and SAS datasets, in addition to some basic characteristics of the datasets, are presented in Table 1
train | dev | test | Test | Avg. Wrds. / | Score | |||
Set | N | N | N | QWK | Acc | Response | Range | |
Essay | 1 | 1070 | 357 | 357 | .721 | .653 | 366 | 2-12 |
2 | 1080 | 360 | 360 | .814 | .783 | 381 | 1-6 | |
3 | 1036 | 345 | 345 | .769 | .749 | 109 | 0-3 | |
4 | 1063 | 354 | 354 | .851 | .772 | 94 | 0-3 | |
5 | 1083 | 361 | 361 | .753 | .580 | 122 | 0-4 | |
6 | 1080 | 360 | 360 | .776 | .623 | 154 | 0-4 | |
7 | 941 | 314 | 314 | .721 | .292 | 168 | 2-24 | |
8 | 434 | 145 | 145 | .624 | .278 | 605 | 10-60 | |
Short | 1 | 1002 | 335 | 335 | .938 | .904 | 52 | 0-3 |
Answer | 2 | 1278 | 256 | 256 | .911 | .848 | 65 | 0-3 |
3 | 1084 | 362 | 362 | .762 | .785 | 53 | 0-2 | |
4 | 993 | 332 | 332 | .686 | .783 | 46 | 0-2 | |
5 | 1077 | 359 | 359 | .935 | .958 | 28 | 0-3 | |
6 | 1077 | 360 | 360 | .973 | .967 | 28 | 0-3 | |
7 | 1079 | 360 | 360 | .968 | .958 | 46 | 0-2 | |
8 | 1079 | 360 | 360 | .837 | .833 | 60 | 0-2 | |
9 | 1078 | 360 | 360 | .831 | .808 | 54 | 0-2 | |
10 | 984 | 328 | 328 | .904 | .909 | 45 | 0-2 |
The scoring rubric for the AES dataset emphasizes proper spelling and grammar usage, logical organization with smooth transitions between ideas, and the ability to exhibit analytical comprehension backed by supporting evidence. The rubrics for essay set 1,7, and 8 do this by breaking the score into several traits. The final score is the sum of each of the trait scores. While some of the essay topics depend on a particular prompt, the rubric can be generally interpreted independently of any prompt.
In contrast, the rubrics for the SAS items focus on specific pieces of information that need to be in a response in order to obtain a score. These short answer questions are designed to test knowledge and comprehension, hence grammar and spelling are not a part of the rubric.
3.2 Performance Metric
When evaluating the model performance, we compute Quadratic Weighted Kappa (QWK), which was the original metric specified in the Kaggle competitions [44, 45]. A rough interpretation of this metric is that it measures the probability above chance that two raters agree: a QWK of indicates exact agreement, indicates random agreement, and indicates perfect disagreement. This metric is also standard in the industry for comparing machine scoring performance [54].
3.3 Models
In selecting models for our study, we prioritized those that could operate on standard consumer hardware while still delivering performance adequate for generating useful feedback. We identified four models that met these criteria and represented the forefront of open-source model development from major contributors in the field. These include (with affiliation in parentheses): Llama-3 (Meta), Mistral v0.2 (Mistral), Gemma-1.1 (Google), and Phi-3 (Microsoft). Table 2 provides a brief overview of architectural characteristics, along with the total parameter count and references to their respective technical documentation.
Model | Release | # | # | Hidden | Intermediate | # Vocab. |
---|---|---|---|---|---|---|
Name | Date | Params. | Layers | Size | Size | Tokens |
Mistral v0.2-Instuct [23] | 12/11/2023 | 7.24B | 32 | 4,096 | 14,336 | 32k |
Gemma 1.1-Instruct [50] | 3/26/2024 | 8.54B | 28 | 3,072 | 24,576 | 256k |
Llama-3-8B-Instruct [32] | 4/17/2024 | 8.03B | 32 | 4,096 | 14,336 | 128k |
Phi-3-7B-Instruct [1] | 4/22/2024 | 3.82B | 32 | 3,072 | 8,192 | 32k |
One model was trained for each item, resulting in a total of 40 trained models (4 model types x 10 items).
3.4 Parameter-efficient fine-tuning
Models were loaded through Huggingface-hub, quantized into smaller, 4-bit models using bitsandbytes, and trained using low-rank adaptors (LoRA). Learning rate was set to 2e-4 (except for Gemma-1.1, which was set to 1e-4 to ensure convergence), with a linear rate decay over 10 epochs. and , key parameters for LoRA, were each set to 32. Table 3 lists how this value affects trainable parameters and memory used for each of the four models.
# Trainable | Memory | Training | Inference | ||
---|---|---|---|---|---|
Model | r-value | Params. | Used | Time | Time |
Mistral v0.2-Instuct | 32 | 83.9M | 4.67Gb | 10.8 | 30.7 |
Gemma 1.1-Instruct | 32 | 100M | 6.01 Gb | 12.4 | 37.6 |
Llama-3-8B-Instruct | 32 | 83.9M | 5.76Gb | 10.5 | 29.8 |
Phi-3-7B-Instruct | 32 | 59.8M | 2.40Gb | 6.6 | 18.4 |
To ease GPU load, training data were not batched (i.e. batch size was 1), and context length was capped at 2,048 (note that this cap was not exceeded for any response). We used an early stopping criterion, based on best QWK performance on the development set, computed at the end of each epoch, within a span of 10 epochs. Models were trained on a 24GB A10 GPU.
We calculated training and inference times of each model. Times were transformed so as to be relative to the training and inference times of a standard BERT-base classification model. Thus, for example, Mistral took 10.8 times longer to train than BERT, and 30.7 times longer to predict scores on the test set. The BERT model was trained in batches of 4, over the span of 20 epochs, and on the same hardware as the GLMs.
3.5 Prompting for Score Prediction
We used the following template to prompt the model for a score, given an item-specific max score, an item-specific rubric, and a student response (all indicated by curly brackets below). Note that “User” and “Assistant” role formats vary between models; roles were not entered into the prompt itself, but handled automatically via Huggingface’s apply_chat_template function.
User You are a grading assistant. Assign a **Score** between 0 and {max_score} using the **Rubric** provided to a **Student Response**
**Rubric**
{item_rubric}
**Student Response**
{student_response}
Assistant Score:
Using the filled-out template as input, we constrained the model to generate one additional token. If the model generated a non-integer token, then the score was given a 0.
3.6 Prompting for Feedback Generation
After prompting for score predictions, we incorporated the predicted scores into another template to prompt the model for feedback generation. Although much of the feedback generation template is identical to the score prediction template, the model was prompted separately. A maximum of 256 new tokens were produced for AES feedback and 128 tokens for SAS.
User You are a grading assistant. Assign a **Score** between min_score and {max_score} using the **Rubric** provided to a **Student Response**
**Rubric**
{item_rubric}
**Student Response**
{student_response}
Assistant Score: {predicted_score}
User Using the rubric, specify why you gave the response a score of {predicted_score}.
Assistant555This last Assistant Prompt was only included for short answer items The response was given a score of {predicted_score} because
3.7 Qualitative Analysis of Feedback
To characterize the differences in feedback provided by each of the 4 models, we sampled student responses with predicted scores that matched human rater scores. For the SAS dataset, we sampled responses across all possible score points for 2 science items (Items 1 and 10) and 2 ELA items (Items 3 and 7). We analyzed 13 student responses across 4 items (and 2-3 possible score points), for a total of 52 explanations. For the AES dataset, we sampled responses across all possible score points for 2 stimuli (Items 2 and 3). We analyzed 10 student responses across 4 items (and 4-6 possible score points) for a total of 40 explanations.
In analyzing responses, we took a grounded approach (Creswell and Poth, 2016 – add citation). The philosophy behind grounded qualitative research is to let patterns emerge from the data, rather than approach the data with pre-defined codes or hypotheses. More specifically, analyses consisted of two phases. In the first phase, we read through responses, noted salient trends, summarized notes, and revisited notes for each response. In the second phase, we summarized these notes into general patterns and trends, and identified consistent and inconsistent examples in the data.
4 Results
Results are divided into four section: In sections 1 and 2, we present the results of fine-tuned GLMs on AES and ASAS, respectively; in sections 3 and 4, we characterize feedback after prompting GLMs to explain their scores based on item-specific rubrics, for AES and ASAS, respectively.
4.1 Automated Essay Scoring
Table 4 presents the results of fine-tuned GLMs on performing AES on the ASAP-AES datset. We provide comparisons to several notable benchmarks pertinent to the task. These include the original human-human agreement score [44], the BoW results reported in [49] and subsequent modifications using attention mechanisms [14], the original BERT results [43], the current SOTA performance [57], “fine-tuned” GPT-3.5 [31], and GPT-4 [55]. In addition to these important reference points, we also provide results from off-the-shelf, i.e. not fine-tuned, models (no asterisks) alongside fine-tuned models (indicated with asterisks).
Model | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Avg. |
---|---|---|---|---|---|---|---|---|---|
Human [44] | .721 | .812 | .769 | .850 | .753 | .775 | .720 | .620 | .752 |
EASE [49] | .781 | .621 | .630 | .749 | .782 | .771 | .727 | .534 | .699 |
LSTM+CNN+Att [14] | .822 | .682 | .672 | .814 | .803 | .811 | .801 | .705 | .764 |
BERT (base) [43] | .792 | .680 | .715 | .801 | .806 | .805 | .785 | .596 | .758 |
NPCR [57] | .856 | .750 | .756 | .851 | .847 | .858 | .838 | .779 | .817 |
GPT-3.5* [31] | .741 | .618 | .704 | .859 | .796 | .848 | .727 | .614 | .738 |
GPT-4 [55] | .280 | .338 | .331 | .784 | .623 | .728 | .257 | .454 | .474 |
Mistral-7B-Instruct-v0.2 | .595 | .359 | .583 | .740 | .497 | .460 | .320 | .060 | .452 |
Mistral-7B-Instruct-v0.2* | .831 | .702 | .695 | .833 | .822 | .818 | .830 | .728 | .782 |
Gemma-1.1-7b-it | .214 | .516 | .427 | .361 | .251 | .376 | .425 | .293 | .358 |
Gemma-1.1-7b-it* | .809 | .711 | .688 | .826 | .802 | .818 | .824 | .623 | .763 |
Meta-Llama-3-8B-Instruct | .255 | .463 | .432 | .557 | .653 | .608 | .283 | .362 | .452 |
Meta-Llama-3-8B-Instruct* | .821 | .727 | .717 | .824 | .815 | .829 | .837 | .752 | .789 |
Phi-3-mini-4k-instruct | .408 | .334 | .299 | .465 | .605 | .557 | .279 | .269 | .402 |
Phi-3-mini-4k-instruct* | .827 | .714 | .715 | .828 | .830 | .827 | .837 | .710 | .786 |
The fine-tuned generative models performed well compared to standard benchmarks. They exceeded performance of AES, BERT (base), fine-tuned GPT-3.5, and the combination of LSTM, CNN, and attention mechanisms. Although none of the models achieve the current SOTA performance (a distinction held by NPCR), each individual model surpasses many previous benchmarks. Fine-tuned GLMs also seem comparable, if not above, human-level performance.666Regarding comparability to human-human QWK, it should be noted that the models were trained on the resolved scores, which have different ranges than the original human scores. According to the rubric, the resolved scores are calculated as the sum of the two human scores for items 1, 7, and 8.
4.2 Automated Short Answer Scoring
The performance of GLMs fine-tuned for ASAS are presented in Table 5. Fine-tuned models are indicated with asterisks. As with AES, there are a number of important results in the literature to compare against our own results. Firstly, there is the human agreement score [45], the rule-based approach known as AutoSAS [28], the current SOTA given by an ensemble of pretrained models [35], “fine-tuned” GPT-3.5 [5], and GPT-4 [24]. Results from non-fine-tuned versions of each of the 4 models (no have asterisks) are also included.
Model | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Avg |
Human [45] | .938 | .911 | .758 | .686 | .935 | .973 | .968 | .837 | .831 | .904 | .874 |
AutoSAS [28] | .872 | .824 | .745 | .743 | .845 | .858 | .725 | .624 | .843 | .832 | .791 |
BERT-base | .849 | .772 | .692 | .722 | .845 | .840 | .676 | .598 | .829 | .717 | .749 |
Ensemble LLM [35] | .882 | .891 | .722 | .750 | .813 | .822 | .734 | .702 | .865 | .779 | .796 |
GPT-3.5* [5] | .610 | ||||||||||
GPT-4 [24] | .715 | .724 | .626 | .517 | .772 | .799 | .495 | .553 | .703 | .865 | .677 |
Mistral-7B-Instruct-v0.2 | 0.57 | .449 | .188 | 0.33 | .331 | .496 | .243 | .280 | .507 | .529 | .392 |
Mistral-7B-Instruct-v0.2* | .864 | .807 | .725 | .636 | .776 | .855 | .746 | .697 | .771 | .709 | .759 |
Gemma-1.1-7b-it | .315 | .341 | .042 | .087 | .194 | .307 | .168 | .195 | .164 | .522 | .233 |
Gemma-1.1-7b-it* | .859 | .814 | .602 | .659 | .824 | .798 | .757 | .716 | .751 | .712 | .749 |
Meta-Llama-3-8B-Instruct | .427 | 0.45 | .293 | .334 | .602 | .563 | .188 | .361 | .420 | .615 | .425 |
Meta-Llama-3-8B-Instruct* | .878 | .823 | .687 | .649 | 0.82 | .807 | .714 | .659 | .759 | .733 | .753 |
Phi-3-mini-4k-instruct | .452 | .360 | .157 | .281 | .341 | .449 | 0.36 | .126 | .395 | .397 | .332 |
Phi-3-mini-4k-instruct* | .864 | .779 | .697 | .625 | .781 | 0.85 | .718 | .691 | .788 | .703 | .750 |
In contrast with AES, the results of pertaining these large models offers comparable, but not superior, performance to BERT. The GLMs seem do outperform previous benchmarks on items 7 and 8; the results for Gemma and Mistral are above previously known models [35]. The performance on items 4 and 9, however, are lower than the benchmarks provided.
4.3 Automated Feedback for Essay Scoring
After GLMs predicted scores, we prompted them for feedback—in this case, an explanation for the score based on the scoring rubric. To illustrate the type of feedback generated by each of the four models, we present the feedback generated in response to an essay on item 1 (Table 6). The essays was assigned a score of 8 by all GLMs.
Model | Feedback |
---|---|
Mistral-7B-Instruct-v0.2 | I gave the response a score of 8 because it was well-written, organized, and persuasive. The author made a strong argument about the benefits of computers and provided specific details and examples to support their position. They also acknowledged potential counter-arguments and addressed them in a thoughtful and respectful manner. Overall, the response was convincing and persuasive. |
gemma-1.1-7b-it | Your response is well-written, clear, and easy to read. You have a strong understanding of the topic and you have supported your reasons well. You have also used persuasive language to convince the reader to agree with you. I would like to add that I think you could have added a little more detail to your reasons, but other than that I think you have done a good job. |
Meta-Llama-3-8B-Instruct | Here’s why I gave the response a score of 8: Strengths: • The response is well-organized and easy to follow. • The writer provides specific details and examples to support their ideas. • The writer shows a good understanding of the topic and the audience. Weaknesses: • The response is not as polished as it could be. There are some awkward phrasings and minor errors. • The writer could have provided more elaborate explanations and more specific details to support their ideas. • The response lacks a strong, attention-grabbing opening that draws the reader in. Score: 8 I gave the response a score of 8 because it is well-organized and easy to follow, but it is not as polished as it could be. The writer provides some specific details and examples to support their ideas, but could have provided more elaborate explanations and more specific details. Overall, the response shows a good understanding of the topic and the audience, but could be improved with more attention to detail and more elaborate explanations. |
Phi-3-mini-4k-instruct | I gave the response a score of 8 because it clearly takes a strong position on the effects of computers, providing well-elaborated reasons and specific details. The response exhibits strong organization, uses sophisticated language, and shows awareness of the audience. However, it could have provided more elaboration on the benefits of using computers for learning and communication, and could have included more specific details to further support the argument. Overall, it is a strong response but with a few areas for improvement. |
By examining the feedback across items, responses, and models, we found that the feedback provided by fine-tuned versions of Mistral and Gemma tended to be more repetitive as the models seemed to settle into a loop more readily than Phi-3 and Llama-3. For stimuli where the rubric relied on external information, such as the understanding of a text, the language models struggled to produce sensible feedback and often only summarized and reiterated aspects of the response, rather than detailing why the score was assigned.777It is worth noting that the stimuli were very long and including the stimuli in addition to the full rubric would have exceeded the context limits we imposed for practical considerations. Secondly, in the case that the resolved score was the addition of the trait scores for each rater, the rubric described only the rater score, not the resolved score. So we employed a language model to summarize the differences between a high and low-scoring essay. Perhaps managing this better could lead to more constructive feedback. The models seem to provide much clearer feedback when the rubric could be interpreted independently of the stimuli (i.e. 1, 2, 7, and 8).
The most useful feedback overall seemed to come from fine-tuned versions of the Phi-3 and Llama-3 models. Even though they provided the most accurate explanations, they were not immune from repetition or errors.
4.4 Automated Feedback for Short Answer Scoring
In Table 7, we present feedback from for a 1-point response to Item #10. We selected this particular response because model feedback was typical of what we observed for other items and score points.
For Item #10, to get full credit (2 points), the student had to (1) “describe how [a chosen color] might affect the inside of the doghouse” and (2) “use results from the experiment to support [their] description.” The student response for this particular example reads, “black. it might effect it,by using this color it can make the doghouse more warmer on summer days” (Id: 26865). The response does state that the color black would make the doghouse warmer (1 point), but fails to reference the experiment (0 points). Because it met 1 of the 2 criteria outlined in the rubric, it received a score of 1. Table 7 provides the explanations given by each of the 4 models.
Model | Explanation |
---|---|
Mistral-7B-Instruct-v0.2 | 1 is the lowest number on the chart about the lamp and color of the lid it had an average temperature of 42(DEG)C. |
gemma-1.1-7b-it | 1: The doghouse would not be warm enough for the dog if it was painted white. 2: The experiment was not designed to see if white would be too cold for the dog. The average temperature for the white lid was 42(DEG)C, and for the black lid it was 53(DEG)C. The white lid was 11(DEG)C cooler than the black lid, and if the doghouse was painted white, it would not be warm enough for the dog. |
Meta-Llama-3-8B-Instruct | 1) the color black was used and 2) the phrase “it might effect it” was used but the response was incomplete and didn’t provide enough information to determine the temperature of the inside of the doghouse. The response was also unclear about what the color black would do to the temperature of the inside of the doghouse. For example, would it make it warmer or colder? The response didn’t provide enough information to determine the temperature of the inside of the doghouse. Therefore, the response was given a score of 1. |
Phi-3-mini-4k-instruct | 1 point were awarded for the response because the response chose a color and described how that color might affect the inside of the doghouse BUT the results from the experiment were missing. |
Mistral did not produce an explanation for the score. Rather, it seemed to summarize part of the item stem, or perhaps it generated its own (student-like) response. It was common for Mistral to generate its own responses, which it would score, and subsequently produce another response and another score, and so on in a loop (not shown here).
In the above example, Gemma seems to have produced a (student-like) response, and provides no explicit reference to the rubric. The response is separated into two, however, which may indicate some kind of pastiche, blending a response with the form of the rubric. Although not evident in this example, Gemma tended to summarize or repeat student responses in its explanations. These summaries were sometimes accompanied by relevant aspects of the rubrics.
In contrast to Mistral and Gemma, Llama-3 referenced the student response in an evaluative way. It mentioned the color chosen by the student, and it quoted a phrase from the response (“it might effect it”) that could impact its score. At the end, Llama-3 summarized its explanation with a definitive, “Therefore, the response was given a score of 1,” as if it had produced a satisfying justification. Yet there are two serious flaws in Llama-3’s explanation. First, it included statements that contradict the student response, i.e., the response was not “unclear about what the color black would do to the temperature,” as Llama-3 claimed. And second, it omitted one of the criteria in the rubric (i.e. referencing the experiment), and entirely fabricated another in its place (i.e. the color does not have to be black, as implied). Although the explanation is appropriate in style, contains evaluative language, and references the student response, it misrepresents the rubric and the response. This was common of Llama-3 explanations, which were often odd combinations of the rubric and summaries of students’ responses.
Lastly, Phi-3 provided a succinct and accurate explanation of why the student would receive a 1 for this response. Phi-3 was not infallible, but it often evaluated student responses with some justification of the score or explicit reference to the rubric.
5 Discussion
5.1 Summary
In this paper, we have demonstrated that it is possible to fine-tune small, open-source GLMs to (1) achieve adequate performance for AES and ASAS and (2) generate appropriate rationales (at least in some cases) for predicted scores. Our method pushes beyond the paradigm of appending a classification head to a pretrained language model, yet avoids the many issues involved in querying large, proprietary GLMs via APIs. We find that parameter-efficient fine-tuning (using no more than a 24Gb GPU) for relatively small, open-source GLMs exceeds performance of proprietary GLMs that are orders of magnitude larger. Furthermore, due to the efficient nature of training checkpoints, the only parameters that are required to serve these models are the LoRA weights, which amount to less than 100 million parameters, fewer parameters than a BERT model. Given the widespread enthusiasm and fear around GLMs, it may come as a surprise that they did not lead to SOTA results. Ensembles of smaller LMs remain more efficient and performant than GLMs for AES and ASAS.
One of the unique advantages of using GLMs is the ability to move beyond scoring alone—in this study, we prompt the fine-tuned models to provide an explanation of the score. We found that models were capable of (sometimes) generating adequate justifications, and that Phi-3 was more consistent than the other models. Yet this study does not undertake a thorough analysis of model-generated feedback. Although preliminary results are encouraging, rigorous analysis is needed. This would include carefully defined constructs of interest, collaboration with educators and trained human raters, and targeted use cases that identify whom the feedback is for, when the feedback should be provided, and what shortcomings need to be avoided. It is noteworthy, however, that fine-tuned GLMs were able to generate feedback at all, especially given that they were fine-tuned to predict scores (i.e. not feedback). It has been shown that, even with some a small amount of fine-tuning, model behavior can change dramatically [41].
The performance of the GLMs explored in this study are promising, particularly since they avoid the critical issues of proprietary models. Firstly, these models can be run securely and efficiently with relatively low requirements. Although security is not a concern when examining performance on a publicly-available dataset, it is a concern in many educational contexts, where personally identifiable information about students may be shared with the organization hosted the GLM. Secondly, in order to interpret the output of these models, we must be able to access the weights. The lower computational requirements of smaller, open-source models allows them to be more readily used in explainable AI workflows. Thirdly, we believe that GLMs used for educative tasks should be developed by educators and educational researchers. The open-source movement in AI permits some agency in developing these tools, without relegating decisions to a few tech-focused companies. The methods prescribed in this paper can be duplicated without recourse to industrial-scale compute power.
5.2 Comparison to Proprietary GLMs
With respect to scoring, our fine-tuned results far exceed those of “fine-tuned” GPT-3.5 for both AES [31] and ASAS [5]. We put “fine-tuned” in quotation marks because the fine-tuning procedure(s) available to the public are undisclosed and optimization (e.g. modulating the learning rate) is not currently available. Given that GPT-3.5 is vastly larger in size (175B) and requires far more computation [3] compared to the models explored in our study, it is surprising that its performance is so underwhelming. Our results are also superior to (non-fine-tuned) GPT-4 with respect to both AES [55] and ASAS [24]. It should be noted that fine-tuning is not currently available for GPT-4; yet even if fine-tuning were available and results were adequate, these would be subject to the same limitations outlined above. We note that our study does not undertake a comparison of feedback between large, proprietary GLMs and smaller, open-source GLMs; it may be that large GLMs excel in this area.
5.3 Limitations
As noted previously, this study does not attempt to provide quantitative empirical evidence regarding the validity of model-generated feedback. Model-generated feedback, although promising, requires more rigorous evaluation that should be undertaken in collaboration with educational practitioners. Even for the relatively humble task of providing an explanation for a score, models were far from infallible. More research is needed to validate that the model is consistently connecting scores to the rubric. There are others who are exploring the more complicated task of producing model-generated feedback that is useful to educational practitioners (e.g. [46, 55]). Robust feedback systems likely require on-going evaluation, and may depend on human-in-the-loop frameworks.
Although there is growing pressure to develop educational tools using GLMs, there is no easy method of validating feedback. At this stage, the validation of feedback should be a primary concern for the future for the use of GLMs in education. This may mean the creation of datasets that are focused on feedback, or the use of existing information, such as essay trait scores, to validate existing feedback. To help facilitate such analyses, We have open-sourced the feedback provided on a single validation sample in the hopes of prompting further analyses888https://github.com/christopherormerod/kaggle_aes_asas_feedback. One thing that is fairly clear at this stage is that these models are computationally capable of being used in such a pipeline. The question remains, however, as to whether they are valid for carefully defined, targeted use cases.
References
- [1] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, Xiyang Dai, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan Iter, Mei Gao, Min Gao, Jianfeng Gao, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Ce Liu, Mengchen Liu, Weishung Liu, Eric Lin, Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Xin Wang, Lijuan Wang, Chunyu Wang, Yu Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Sonali Yadav, Fan Yang, Jianwei Yang, Ziyi Yang, Yifan Yang, Donghan Yu, Lu Yuan, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, and Xiren Zhou. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, May 2024. arXiv:2404.14219 [cs].
- [2] AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024.
- [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- [4] Makenna Carlson, Austin Pack, and Juan Escalante. Utilizing openai’s gpt-4 for written feedback. TESOL Journal, 759:e759, 2023.
- [5] Imran Chamieh, Torsten Zesch, and Klaus Giebermann. LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 309–315, 2024.
- [6] Lingjiao Chen, Matei Zaharia, and James Zou. How is ChatGPT’s behavior changing over time?, October 2023. arXiv:2307.09009 [cs].
- [7] Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodríguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Jia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, and Danqi Chen. Language Models as Science Tutors, February 2024. arXiv:2402.11111 [cs].
- [8] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, September 2014. arXiv:1406.1078 [cs, stat].
- [9] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling Instruction-Finetuned Language Models, December 2022. arXiv:2210.11416 [cs].
- [10] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. Technical Report arXiv:2003.10555, arXiv, March 2020. arXiv:2003.10555 [cs] type: article.
- [11] Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit Optimizers via Block-wise Quantization, June 2022. arXiv:2110.02861 [cs].
- [12] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. Advances in Neural Information Processing Systems, 36:10088–10115, December 2023.
- [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Technical Report arXiv:1810.04805, arXiv, May 2018. arXiv:1810.04805 [cs] type: article.
- [14] Fei Dong, Yue Zhang, and Jie Yang. Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 153–162, Vancouver, Canada, August 2017. Association for Computational Linguistics.
- [15] Henner Gimpel, Kristina Hall, Stefan Decker, Torsten Eymann, Luis Lämmermann, Alexander Mädche, Maximilian Röglinger, Caroline Ruiner, Manfred Schoch, Mareike Schoop, Nils Urbach, and Steffen Vandrik. Unlocking the power of generative AI models and systems such as GPT-4 and ChatGPT for higher education: A guide for students and lecturers. Working Paper 02-2023, Hohenheim Discussion Papers in Business, Economics and Social Sciences, 2023.
- [16] Alex Graves and Navdeep Jaitly. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Conference on Machine Learning, pages 1764–1772. PMLR, June 2014. ISSN: 1938-7228.
- [17] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with Deep Bidirectional LSTM. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 273–278, December 2013.
- [18] Erica R. Hamilton, Joshua M. Rosenberg, and Mete Akcaoglu. The Substitution Augmentation Modification Redefinition (SAMR) Model: a Critical Review and Suggestions for its Use. TechTrends, 60(5):433–441, September 2016.
- [19] Pengcheng He, Jianfeng Gao, and Weizhu Chen. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing, December 2021. Number: arXiv:2111.09543 arXiv:2111.09543 [cs].
- [20] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, November 1997.
- [21] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, October 2021. arXiv:2106.09685 [cs].
- [22] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):1–38, December 2023.
- [23] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B, October 2023. arXiv:2310.06825 [cs].
- [24] Lan Jiang and Nigel Bosch. Short answer scoring with GPT-4. 2024.
- [25] Zi-Hang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. ConvBERT: Improving BERT with Span-based Dynamic Convolution. In Advances in Neural Information Processing Systems, volume 33, pages 12837–12848. Curran Associates, Inc., 2020.
- [26] Sunder Ali Khowaja, Parus Khuwaja, Kapal Dev, Weizheng Wang, and Lewis Nkenyereye. Chatgpt needs spade (sustainability, privacy, digital divide, and ethics) evaluation: A review. Cognitive Computation, pages 1–23, 2024.
- [27] Vivekanandan S. Kumar and David Boulanger. Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined? International Journal of Artificial Intelligence in Education, 31(3):538–584, September 2021.
- [28] Yaman Kumar, Swati Aggarwal, Debanjan Mahata, Rajiv Ratn Shah, Ponnurangam Kumaraguru, and Roger Zimmermann. Get IT Scored Using AutoSAS — An Automated System for Scoring Short Answers. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):9662–9669, July 2019. Number: 01.
- [29] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy, 23(1):18, January 2021. Number: 1 Publisher: Multidisciplinary Digital Publishing Institute.
- [30] Susan Lottridge, Chris Ormerod, and Amir Jafari. Psychometric Considerations When Using Deep Learning for Automated Scoring. In Advancing Natural Language Processing in Educational Assessment. Routledge, 2023. Num Pages: 16.
- [31] Watheq Mansour, Salam Albatarni, Sohaila Eltanbouly, and Tamer Elsayed. Can Large Language Models Automatically Score Proficiency of Written Essays?, April 2024. arXiv:2403.06149 [cs].
- [32] AI Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2024.
- [33] Atsushi Mizumoto and Masaki Eguchi. Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2):100050, August 2023.
- [34] OpenAI. GPT-4 Technical Report, March 2023. arXiv:2303.08774 [cs].
- [35] Christopher Ormerod. Short-answer scoring with ensembles of pretrained language models, February 2022. arXiv:2202.11558 [cs].
- [36] Christopher Ormerod, Amy Burkhardt, Mackenzie Young, and Sue Lottridge. Argumentation Element Annotation Modeling using XLNet, November 2023. arXiv:2311.06239 [cs].
- [37] Christopher M. Ormerod, Akanksha Malhotra, and Amir Jafari. Automated essay scoring using efficient transformer-based language models, February 2021. Number: arXiv:2102.13136 arXiv:2102.13136 [cs].
- [38] Ellis Batten Page. Project Essay Grade: PEG. In Automated essay scoring: A cross-disciplinary perspective, pages 43–54. Lawrence Erlbaum Associates Publishers, Mahwah, NJ, US, 2003.
- [39] Shuai Peng, Ke Yuan, Liangcai Gao, and Zhi Tang. MathBERT: A Pre-Trained Model for Mathematical Formula Understanding, May 2021. arXiv:2105.00377 [cs].
- [40] Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, and Jonas Pfeiffer. Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning. In Yansong Feng and Els Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 149–160, Singapore, December 2023. Association for Computational Linguistics.
- [41] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
- [42] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-training, 2018.
- [43] Pedro Uria Rodriguez, Amir Jafari, and Christopher M. Ormerod. Language models and Automated Essay Scoring, September 2019. Number: arXiv:1909.09482 arXiv:1909.09482 [cs, stat].
- [44] Mark D. Shermis. State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20:53–76, April 2014.
- [45] Mark D. Shermis. Contrasting State-of-the-Art in the Machine Scoring of Short-Form Constructed Responses. Educational Assessment, 20(1):46–65, January 2015. Publisher: Routledge _eprint: https://doi.org/10.1080/10627197.2015.997617.
- [46] Maja Stahl, Leon Biermann, Andreas Nehring, and Henning Wachsmuth. Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation, April 2024. arXiv:2404.15845 [cs].
- [47] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, April 2020. Number: arXiv:2004.02984 arXiv:2004.02984 [cs].
- [48] Chul Sung, Tejas I. Dhamecha, Swarnadeep Saha, Tengfei Ma, V. Reddy, and R. Arora. Pre-Training BERT on Domain Resources for Short Answer Grading. In EMNLP, 2019.
- [49] Kaveh Taghipour and Hwee Tou Ng. A Neural Approach to Automated Essay Scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1882–1891, Austin, Texas, November 2016. Association for Computational Linguistics.
- [50] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L. Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu-hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open Models Based on Gemini Research and Technology, April 2024. arXiv:2403.08295 [cs].
- [51] Masaki Uto and Yuto Uchida. Automated Short-Answer Grading Using Deep Neural Networks and Item Response Theory. AIED, 2020.
- [52] Masaki Uto, Yikuan Xie, and Maomi Ueno. Neural Automated Essay Scoring Incorporating Handcrafted Features. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6077–6088, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.
- [53] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- [54] David M. Williamson, Xiaoming Xi, and F. Jay Breyer. A Framework for Evaluation and Use of Automated Scoring. Educational Measurement: Issues and Practice, 31(1):2–13, 2012. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1745-3992.2011.00223.x.
- [55] Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, and Qi Fu. Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs, January 2024.
- [56] Changrong Xiao, Wenxing Ma, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, and Qi Fu. From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape, January 2024. arXiv:2401.06431 [cs] version: 1.
- [57] Jiayi Xie, Kaiwei Cai, Li Kong, Junsheng Zhou, and Weiguang Qu. Automated Essay Scoring via Pairwise Contrastive Regression. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na, editors, Proceedings of the 29th International Conference on Computational Linguistics, pages 2724–2733, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics.
- [58] Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending ChatGPT against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486–1496, December 2023. Publisher: Nature Publishing Group.
- [59] Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment, December 2023. arXiv:2312.12148 [cs].
- [60] Ruosong Yang, Jiannong Cao, Zhiyuan Wen, Youzheng Wu, and Xiaodong He. Enhancing Automated Essay Scoring Performance via Fine-tuning Pre-trained Language Models with Combination of Regression and Ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1560–1569, Online, November 2020. Association for Computational Linguistics.
- [61] Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, and Weiqiang Jia. Cognitive Mirage: A Review of Hallucinations in Large Language Models, September 2023. arXiv:2309.06794 [cs].