\noautomath

GEE! Grammar Error Explanation with Large Language Models

Yixiao Song^♠ Kalpesh Krishna^♠ Rajesh Bhatt^♠
Kevin Gimpel^♡ Mohit Iyyer^♠
^♠University of Massachusetts Amherst ^♡QuillBot
{yixiaosong,bhatt}@umass.edu [email protected]
[email protected] [email protected] Work partially done as an intern at QuillBot. Currently at Google.

Abstract

Grammatical error correction tools are effective at correcting grammatical errors in users’ input sentences but do not provide users with natural language explanations about their errors. Such explanations are essential for helping users learn the language by gaining a deeper understanding of its grammatical rules (DeKeyser, 2003; Ellis et al., 2006).

To address this gap, we propose the task of grammar error explanation, where a system needs to provide one-sentence explanations for each grammatical error in a pair of erroneous and corrected sentences. We analyze the capability of GPT-4 in grammar error explanation, and find that it only produces explanations for $60.2\%$ of the errors using one-shot prompting.

To improve upon this performance, we develop a two-step pipeline that leverages fine-tuned and prompted large language models to perform structured atomic token edit extraction, followed by prompting GPT-4 to generate explanations. We evaluate our pipeline on German and Chinese grammar error correction data sampled from language learners with a wide range of proficiency levels. Human evaluation reveals that our pipeline produces $93.9\%$ and $98.0\%$ correct explanations for German and Chinese data, respectively. To encourage further research in this area, we will open-source our data and code.¹¹1https://github.com/Yixiao-Song/GEE-with-LLMs

1 Introduction

Grammatical error correction (GEC) is a practical and valuable application of natural language processing that facilitates both proofreading of text and language learning. Recent advances in large language models (LLMs) have significantly improved the capabilities of GEC systems (Wang et al., 2021; Bryant et al., 2023); however, they are unable to explain errors in natural language alongside providing correction. Error explanation is crucial to language learning and teaching (Ellis, 2010): while corrections are a form of implicit feedback, they are not as impactful as explicit feedback, which involves pointing out errors and providing meta-linguistic information to the user (e.g., rules of forming well-formed phrases or sentences) (DeKeyser, 2003; Ellis et al., 2006).

Refer to caption — Figure 1: An illustration of the two-step pipeline of grammar error explanation (GEE). Given a pair of sentences with corrected errors, the GEE system first extracts linguistically meaningful edit units as errors. The extracted errors are then paired with the sentences as the input for GEE generation. Note: The error in einen can be caused by gender or case or both. Without guessing the mental state of a language user, both are offered as the reasons in the explanation.

In this work, we propose a new task—grammar error explanation (GEE)—for which a model must generate natural language error explanations that help language learners acquire and enhance grammar knowledge. As shown in Figure 1, given a pair of sentences in which one sentence has grammar errors and the other one is corrected, a model needs to generate an explanation for each corrected grammar error. Given the capabilities of modern LLMs, can they simply solve this task via prompting? We show in Section 3 that one-shot GPT-4 prompting detects only $60.2\%$ of the true errors and correctly explains only $67.5\%$ of the errors it does detect.

Given this result, we develop a pipeline for GEE generation that features an essential intermediate step—atomic token edit extraction. As shown in Figure 1, given an erroneous sentence and its corrected counterpart (source and target), we first extract atomic edits at the token level by prompting or fine-tuning LLMs such as GPT-4, which also label the edits with one of four operation-level edit type labels: insert, delete, replace, and relocation. In the second step, we append the extracted edits to the source and target sentences and use them as the input to a GEE system. We utilize the few-shot learning ability of LLMs (Brown et al., 2020) to generate error explanations using carefully crafted language-specific prompts.

We validate our GEE pipeline on datasets in both German and Chinese, two languages with very different properties (fusional vs. analytical). We also recruit language teachers to evaluate the correctness of the explanations. For the first step in the pipeline, our atomic token edit method extracts $92.3\%$ of the true edits for German, which is $32.1\%$ higher than the naive one-shot approach in Section 3. For the final GEE outputs in German, $93.9\%$ of the generated explanations are judged as correct by two German teachers. Similar performance is observed in Chinese with a $98\%$ correctness rate, suggesting that our two-step pipeline together with carefully crafted language-specific prompts generalizes well for different languages.

In summary, our contributions are the following. First, we propose a new task on grammar error explanation to enhance the utility of current grammatical error correction systems. Second, we propose a two-step pipeline and study its performance in German and Chinese with detailed error analysis. Third, we publicly release our atomic edit extraction datasets for German and Chinese as well as all LLM-generated GEE outputs with the aim of enabling future research on GEE and facilitating the development of more effective GEE systems.

2 GEE task definition

While most GEC models provide viable grammar error corrections (Bryant and Ng, 2015; Bryant et al., 2023), they do not provide explanations in natural language alongside the corrections, which are critical for language learners in mastering grammar (Ellis et al., 2006; Ellis, 2010). In this section, we propose and define the task of grammar error explanation, which aims to fill this gap. In this work, we assume that a GEE model has access to the outputs of an existing GEC model, which produces the corrected form of an ungrammatical input sentence.

2.1 Formalizing the GEE task

The input to a GEE model is a pair of sentences²²2In principle, the inputs could also be documents, but we restrict our work to sentence-level GEE and leave longer texts to future work. in which one has (potentially multiple) grammar errors and the other is corrected. Concretely, let $X_{\mathit{error}}$ be an input sentence written by an user which contains grammatical errors. Then, $X_{\mathit{correct}}=\mathit{GEC}(X_{\mathit{error}})$ is the grammatically correct version of $X_{\mathit{error}}$ produced by an existing GEC system. Following common practice in GEC research (Bryant et al., 2017; Lee et al., 2018; Rao et al., 2020), we assume that an error can be corrected in four ways: insert, delete, replace, and relocate. Let $c_{1}^{X},c_{2}^{X},...,c_{n}^{X}$ be a list of corrections made by the GEC system to $X_{\mathit{error}}$ through one of these four types of edits. Then, the goal of GEE is to generate single-sentence explanations in natural language $s_{1}^{X}$ , $s_{2}^{X}$ , …, $s_{n}^{X}$ corresponding to each of these edits $c_{1}^{X},c_{2}^{X},...,c_{n}^{X}$ (example in Figure 1). Concretely,

Input: $X_{\mathit{error}},X_{\mathit{correct}}$

Output: $s_{1}^{X}$ , $s_{2}^{X}$ , …, $s_{n}^{X}$

2.2 Error extraction as foundation of GEE

The quality of error explanation depends on how the list of corrections $c_{1}^{X},c_{2}^{X},...,c_{n}^{X}$ is defined. Consider the corrections in (2.2), one way to define the correction list is through a string-based transformation (i.e., replace machen ein termin with einen Termin machen). However, when being asked to explain such an error, one would naturally approach it as, for example, “machen is moved”, “ein is replaced with einen”, and “termin is changed”. On the other hand, for the corrections made in (2.2), one would not explain the error word by word as “with is moved”. This leads to our consideration of what defines a good correction list.

{exe}

\ex

S: Ich möchte machen ein termin.

T: Ich möchte einen Termin machen.

\ex

S: I with my puppy go to the store.

T: I go to the store with my puppy.

When shown corrected texts, experts naturally tend towards explaining errors in an atomic way, which roughly amounts to one explanation per error. This also allows learners to follow and understand explanations better, especially when there are contiguous errors in the input. All of these require a process of atomic error extraction, such as the ones described for (2.2) and (2.2) which naturally uses the conventions of grammar, spelling, and language usage. In the GEE task, we define atomic errors based on how a human expert would explain a set of error corrections to a language user. In what follows, we treat each atomic error as an atomic edit and give a working definition of how to identify it.

Using (2.2) as an example, an edit (machen ein termin) should be divided into smaller edits (machen, ein, and termin) if an expert would explain the edit as merely the concatenation of explanations for the smaller edits. These smaller edits are then atomic edits (i.e., each of which has its own distinct explanation). Similarly, if an expert would explain an edit with multiple words using one explanation that cannot be separated into the concatenation of an explanation per token, then that multi-word operation is an atomic edit. An example is the relocation of with my puppy in (2.2).

The working definition of atomic edits is a guideline of extracting linguistically meaningful edits. However, language specific decisions might be needed for individual languages (e.g., verbs with a separable prefix in German). We discuss such details for German and Chinese in Section 4 and Appendix B.

2.3 Evaluation of GEE

We evaluate two critical aspects of GEE: error coverage and explanation quality.

Error coverage evaluation can be facilitated by forcing a model to generate position information of explained errors or to describe the edits being done. We thus conduct the evaluation by measuring (1) whether an explained error is indeed an error in the source and being corrected in the target; and (2) whether an error that is corrected in the target has an associated explanation.³³3A GEE model should be able to ignore errors in the source sentence that are not corrected in the target sentence since GEC is not its primary task. An automatic evaluation through string overlap can give a quick estimate of error coverage when gold references are available. We also do manual evaluation to better understand the behavior of models.

Explanation quality evaluation is challenging because errors can be explained in multiple ways. To reliably evaluate GEE outputs automatically, multi-reference metrics such as METEOR (Banerjee and Lavie, 2005) and benchmarks with multiple references for each error are needed. However, collecting such datasets is costly and requires expertise in second language teaching. Without such datasets being available, human experts are the only reliable evaluation. In our work, we recruit language teachers for evaluation described in Section 6.2.1. Language teachers, with their expertise in second language teaching, can reliably judge whether an explanation is correct and informative.

3 Has GPT-4 already solved GEE?

A natural question one might ask is whether state-of-the-art LLMs can solve the GEE task in an end-to-end manner. In this section we demonstrate that GPT-4 in its current form is error-prone. It has low error coverage and frequent hallucinations. Based on this observation, we experiment with an approach which provides GPT-4 a list of manually extracted gold atomic edits. Results show that the edit list improves the performance greatly, indicating substantial headroom with more structured prompting (as we describe in Section 4).

One-shot prompting of GPT-4. We run an experiment using German grammar error correction data (details in Section 5). We randomly sample 30 data points⁴⁴4Five data points from each CEFR level. Details are in Section 5. and generate explanations using the following one-shot prompt:

⬇

You are given a pair of German sentences. The first sentence contains one or more errors, which are corrected in the second one. Your task is to: (1) generate a succinct explanation for each error following the template; (2) assign the error a type.

Template: The word X is deleted/inserted/replaced by Y/relocated because ...

Example:

Ich habe zwei bananen für mein Katze gekauft.

Ich habe zwei Bananen für meine Katze gekauft.

Explanation:

The word ’bananen’ is replaced by ’Bananen’ because German nouns should be capitalized.

Error type: capitalization

The word ’mein’ is replaced by ’meine’ because it should agree with the gender and case of the word Katze, which is feminine and accusative.

Error type: gender and case agreement

Below is the sentence pair for you to work on. Start with the explanation directly.

{src}

{trg}

Explanation:

GPT-4 struggles to identify and explain errors. To get a better sense of what mistakes are made by GPT-4, we manually classify each edit as true positive, false positive, or false negative, depending on whether they are correctly mentioned in the generated explanations.⁵⁵5Correctly mentioned means whether one can arrive from the source to the target through the edits. Our evaluation shows that GPT-4 mentioned 120 edits (compared to 90 edits by manual annotation), achieving recall $0.602$ , precision $0.675$ and F1 score $0.636$ . Such outputs are not useful for language learners as they wrongly consider correct words as errors.

GPT-4 makes mistakes even in short sentences. As the example in Table 1 shows, GPT-4 hallucinates that Sie is relocated. It also groups two errors together and omit the explanation of one error. The unorganized output format makes it hard to track the errors and follow the explanations.

Source: Bitte antworten sreiben Sie?

Target: Bitte antworten und schreiben Sie.

The word ‘sreiben’ is replaced by ‘schreiben’ because there was a spelling mistake in the word.

Error type: spelling

The word ‘Sie’ is relocated after ‘antworten’ and the word ‘und’ is inserted between ‘antworten’ and ‘schreiben’ because these are separate actions and should be connected with a conjunction.

Error type: word order and conjunction

Table 1: An example of GPT-4’s GEE output with one-shot prompt. The second explanation hallucinates that ‘Sie’ is relocated and does not give an explanation why it is relocated. The explanation also groups the relocation of ‘Sie’ with the insertion of ‘und’.

What if GPT-4 was provided with gold edits in prompt? To measure the headroom for improvement, we prompt GPT-4 using the same prompt but provide manually extracted gold atomic edits in the input prompt. Here, the recall, precision, and F1 are increased to $0.824$ , $0.862$ , and $0.843$ respectively. Hence, offering a good list of atomic edits to GPT-4 is an important step of the process. This observation motivates our proposed pipeline in Section 4, where we augment GPT-4 prompts with automatically extracted atomic edits.

4 Pipeline for generating GEE

In Section 3, we observed that adding a list of edits as the input to GPT-4 can greatly improve the performance on error coverage. Motivated by the finding, we propose a two-step pipeline for GEE which uses atomic edit extraction as the intermediate step. The pipeline is illustrated in Figure 1. Given an input defined in Section 2, we first extract atomic edits from the pairs. The edits are then appended to the sentence pair to form the input for the final step. In the last step, GPT-4 is prompted to generate an explanation and an error type.

4.1 Atomic edit extraction

As discussed in Section 2, we define an atomic edit as the smallest individual modifications at the token level, whose boundaries are decided linguistically. Each edit belongs to one of the four operation-level types: replace, insert, delete, and relocate.

Previous work on edit extraction ERRANT Bryant et al. (2017) approaches the task in a rule-based linguistic manner. It extracts errors and label the errors with an edit type (e.g., deletion) and linguistic category (e.g., adverb). However, ERRANT has its limitations. For example, it does not account for relocated words.⁶⁶6It does account for local transposition (e.g., juice apple vs. apple juice). It is also only designed for English. Adapting it to other languages requires great effort for individual languages (Korre et al., 2021; Uz and Eryiğit, 2023). Further restrictions of ERRANT are discussed in Appendix A. As a result, we decide to use LLMs for atomic edit extraction.

Desired LLM output format. To facilitate the evaluation of edit extraction and the GEE generation in the second step, we restrict the outputs of atomic edit extraction to a template [operation type, original token(s), target token(s)]. An example with all four edit types is given in (4.1).

{exe}\ex

möchte machen ein Termine.?

Ich möchte einen Termine machen.

[insert, , Ich]

[relocate, machen, machen]

[replace, ein, einen]

[delete, ?, ]

While being useful, the introduction of relocation occasionally reduces the model performance because models tends to label a relocated token as deletion plus insertion. Relocation can also be challenging for human to decide because a relocated word should be a word order error but have the same dependency in a sentence before and after relocation. We discuss details in Appendix B.

Atomic edit extraction with LLMs. To build an atomic edit extractor, we choose to prompt Claude-2,⁷⁷7anthropic.com/index/introducing-claude Llama2-7B (Touvron et al., 2023), GPT-3.5-turbo-0613, and Azure GPT-4 (2023-03-15-preview), as well as fine-tune Llama2-7B and GPT-3.5-turbo. For prompting, we use the carefully designed few-shot prompts in Appendix C for German and Chinese. For fine-tuning, we use Llama2-7B and GPT-3.5-turbo as the base models. We noticed that the models have a low recall when only sentence pairs are provided. To improve on that, we split sentences into a list of tokens and then extract rough string-based edits which are the longest contiguous matching subsequences.⁸⁸8We use Spacy for German and Jieba for Chinese. These rough edits are appended to sentence pairs as inputs. For all models, prompted or fine-tuned, we set temperature to $0$ because the task does not require creativity and temperature $0$ returns better performance compared to $0.2$ for all models.

4.2 GEE generation

With the atomic edits being extracted, we are ready to generate explanations by prompting GPT-4 using the template in Figure 2. Edits are appended to the sentence pairs as the input. Given that each sentence pair may contain multiple errors, we investigated whether generating explanations for one error at a time or all explanations simultaneously would yield better results. In the prompt designing stage, we observed no significant difference in performance between the two approaches. Hence, we choose the latter strategy as it is efficient and cost-effective.

Figure 2 gives a shortened example of the German GEE prompt. The full prompts for German and Chinese are in Appendix D. The prompts consists of four parts. The first part is the task description, which is followed by a generic template of explanations. Below the template are few-shot examples. In the examples, we aim to offer both meta-linguistic and meaning-oriented explanations whenever it is possible as they help L2 users improve their language skills (i.e., using languages accurately and fluently) (Lyster and Saito, 2010). At the end of the prompt, we provide GPT-4 the sentence pair with a list of atomic edits and ask the model to generate one explanation with an error type for each edit. The generated outputs have the following format:

⬇

[edit description] because [edit reason]

Error type: [error type]

The edit description describes how a word in the source sentence is edited in the target sentence. The edit reason explains why such an edit is made.

5 German and Chinese datasets

This section introduces the datasets that are used in our experiments. Statistics of the sampled data subsets are reported in Table 2.

	German		Chinese
	# of data	# of edits	# of data	# of edits
Fine-tune	500	1598	496	790
Test	50	186	53	94
GEE	1122	–	970	–

Table 2: Number of sentence pairs and gold edits in each data subset in German and Chinese. We do not manually annotate the data for GEE, hence no gold edit count is reported.

5.1 German Merlin and Falko

For German GEE, we use the data from the German L2 learner corpora Falko EssayL1v2.3 (Ludeling et al., 2008; Reznicek et al., 2010) and Merlin (Boyd et al., 2014). Both datasets consist of essays written by German learners whose proficiency ranges from beginners to advanced users. The datasets provide corrections of errors. The datasets are preprocessed as described in Appendix E.

From the preprocessed dataset, we sample two subsets without overlaps between them. First, we sample $550$ data points and manually annotate them for gold atomic edits. The $550$ data points are split into $500$ for fine-tuning and $50$ for testing, each containing $1598$ and $186$ gold edits. Second, for GEE generation, we sample all A1 data points (146) and randomly sample 200 data points from other CEFR levels (A2–C2). We manually remove sentence pairs that are misaligned.⁹⁹9An example is vielen Dank für die Einladung. and 12.3.2012 Liebe Silke, vielen Dank für die Einladung. At the end, we have 1122 sentence pairs in German for GEE.

5.2 Chinese CGED2017

We conduct the Chinese GEE experiment on the training split of Chinese Grammatical Error Diagnosis (CGED) 2017 (Rao et al., 2020), which are from the writing task of the Hanyu Shuiping Kaoshi (Test of Chinese Level) (Cui and Zhang, 2011; Zhang and Cui, 2013). Error corrections are provided but there is no learner proficiency level information.

Data are preprocessed as in Appendix E.2. We sampled 520 and 60 data points for fine-tuning/prompting edit extraction models and testing performance respectively. We sample another 970 data points for generating error explanations. After cleaning, we have 496 data points for fine-tuning, 53 for testing, and 970 for explanation generation. Edit counts are in Table 2.

6 Experimental results

This section presents the results of the GEE pipeline in German and Chinese. We first present the results of the fine-tuned and prompted models on atomic token edit extraction in Section 6.1. We find that the fine-tuned GPT-3.5-turbo achieved the best performance on edit extraction for German but GPT-4 works the best for Chinese. Section 6.2 presents the human evaluation results of German and Chinese GEE outputs generated by GPT-4. Our human evaluation shows promising performance. Among the German GEE outputs, $93.9\%$ are judged by two German teachers as correct. For Chinese GEE outputs, $98\%$ of the outputs are correct according to a Chinese teacher.

6.1 Atomic edit extraction results

We first introduce the experiment and evaluation setup, then measure the performance of fine-tuned and prompted models with respect to recall, precision, and F1. Results are presented in Tables 3 and 4 for German and Chinese, respectively. In general, all models follow the output format well but the fine-tuned GPT-3.5-turbo performs best for German and GPT-4 is best for Chinese.

Experiment setup. We few-shot prompt Claude-2, GPT-3.5-turbo, and GPT-4 with the prompt for German in Appendix C.1.¹⁰¹⁰10We also prompted Llama2-7B in the same way but its F1 is only 0.086. The outputs contain repetition and irrelevant content which makes them unusable. For fine-tuning, we use Llama2-7B and GPT-3.5-turbo as the base models and fine-tune them on the 500 training data points in Table 2. Details of the fine-tuning process are in Appendix F. At inference time, the temperature of all models is set to $0$ . We employ simple heuristics to post-process model outputs to remove low-level false positive errors, such as replacement edits that have the same original and edited tokens.

Evaluation. While automatic evaluation is fast, we evaluate the test data manually because there can be multiple ways to get to a target sentence from a source sentence. Concretely, we compare model edits against the manually extracted gold edits one by one. When there is a discrepancy, if the model outputs are linguistically meaningful and can reach the same target, we treat them as true positives.

Results on German: fine-tuned GPT3.5 is most effective at atomic edit extraction. The results for German edit extraction in terms of precision, recall, and F1 are in Table 3. All models have reasonable performance but the fine-tuned GPT-3.5-turbo outperforms all others. It achieves $0.923$ in recall, $0.939$ in precision, and $0.931$ in F1. We use it as the atomic edit extractor in the next step in German GEE generation.

	Claude-2	Llama2-7B	GPT-3.5-turbo		GPT-4
	Prompting	Fine-Tuned	Prompting	Fine-Tuned	Prompting
Recall	0.789	0.849	0.695	0.923	0.874
Precision	0.737	0.827	0.764	0.939	0.870
F1	0.762	0.838	0.728	0.931	0.870
Edit Count	199	191	161	180	184

Table 3: Recall, precision, and F1 scores of models on the German atomic edit extraction task. Because of the variance in GPT-4 outputs, the outputs are generated three times and the best performance is reported.

Results on Chinese: prompted GPT-4 is the most effective edit extractor. The results are reported in Table 4. Unlike German, the prompted GPT-4 returns the best performance. Because of the variance in the GPT-4 outputs, we verify its performance by running the experiment three times. All three runs of GPT-4 return the highest scores. The best results of GPT-4 are recall 0.884, precision 0.933, and F1 score 0.908. We hypothesize that the reason of the prompted GPT-4 performing well on Chinese is that each Chinese sentence pair has less edits on average (see Table 2). The same reason leads to the fact that there are less edits in the training data, which might cause the fine-tuned models perform worse than the ones in German.

	Claude-2	Llama2-7B	GPT-3.5-turbo		GPT-4
	Prompting	Fine-Tuned	Prompting	Fine-Tuned	Prompting
Recall	0.872	0.840	0.763	0.830	0.884
Precision	0.820	0.908	0.651	0.918	0.933
F1	0.845	0.873	0.703	0.872	0.908
Edit Count	100	87	109	85	90

Table 4: Recall, precision, and F1 scores of models in the Chinese atomic edit extraction task. Because of the variance in GPT-4 outputs, the outputs are generated three times and the best performance is reported.

6.2 Human evaluation of GEE

To evaluate the performance of our GEE pipeline, we recruited two German teachers and one Chinese teacher.¹¹¹¹11Both German teachers give classes 15 to 20 hours per week. The Chinese teacher teaches 4 classes a week. This section provides quantitative results from the human evaluations of GPT-4 on the generated GEEs for German and Chinese. Detailed qualitative analysis is in Appendix H.

The results indicate that our GEE pipeline generates $93.9\%$ and $98\%$ correct explanations for German and Chinese, respectively. However, we find that GPT-4 occasionally produces low-level errors such as formatting issues. For Chinese, when it comes to word choice errors, GPT-4 does not always provide clear contrast between two words.

6.2.1 Human evaluation of German GEE

German GEE generation. Using the best performing edit extractor from Section 6.1, we extract atomic edits from the $1122$ sentence pairs described in Section 5. The extracted edits are paired with the source and target sentences to prompt GPT-4 using the few-shot prompt in Appendix D.1. We use the default hyperparameters offered by the OpenAI API (i.e., temperature $=1$ and top p $=1$ ) for some creativity in the explanations.

German GEE evaluation setting. The annotation interface is shown in Figure 4. We collected annotations on error explanations of 596 unique German sentence pairs. To assess the agreement between the teachers, 96 pairs are annotated by both of them. A total of 692 sentence pairs were annotated for this study.¹²¹²12There are 2082 edits extracted from 692 sentence pairs, but GPT-4 only generates explanations for 1986 of them. The two teachers’ agreement rate is $89.6\%$ . Details of the agreement assessment and evaluation instructions are in Appendix G.

Human annotation protocol for evaluating GEE. For each sentence pair, we present the explanations generated by GPT-4 to the teachers, who are asked to check for four types of mistakes in the explanations:¹³¹³13We call grammar errors in sentences as errors and errors made by GPT-4 as mistakes.

•

Hallucinated error: an error in an explanation that does not exist in the source sentence. Such a mistake can be made by considering a correct word/punctuation as an error, or it can be a word that does not exist in the sentences at all.
•

Missing error: a true error in the source sentence, which is edited in the target sentence but not explained.
•

Wrong error explanation: wrong edit description, wrong edit reason, or both.
•

Wrong error type: an error type that is not related to the explained error.

	Count	Percentage
Fully correct	1865	93.9%
Wrong edit description	65	3.3%
Wrong edit reason	29	1.5%
Wrong error type	12	0.6%
Hallucinated error	15	0.8%
Total explanation count	1986	100%
Total annotated items	692
Missing error	67

Table 5: Results of human evaluation on German GEE by two German teachers. 692 sentence pairs with 1986 explanations are annotated. GPT-4 generates fully correct edit description, edit reason, and error type

93.9\%

of the time. Low-level wrong edit descriptions count for

3.3\%

of the mistakes. The count of missing errors by the teachers is the lower bound of the actual ones.

German GEE using edit-driven GPT-4 prompts has high quality. The counts of each mistake type are reported in Table 5. The results show that GPT-4 generates correct explanations $93.9\%$ of the time. The occurrences of inappropriate error types and hallucinated errors are both below $1\%$ . Among the 94 wrong error explanations, 65 are wrong in the edit description but correct in edit reason. As many as 31 edit description mistakes are made because GPT-4 describes inserted and deleted edits as The word ‘’ is inserted/deleted because … without mentioning the word itself. Among the 15 hallucinated errors, 12 are caused by wrong atomic edit extraction and 2 are hallucinated by GPT-4 in the process of generating explanations.¹⁴¹⁴14One annotated item accidentally has the same source and target sentences. Its atomic edit list is empty but GPT-4 hallucinates that there is an error in the source sentence that is corrected in the target sentence.

Remaining issues. To gain a deeper understanding of GPT-4’s limitations, we look into its mistakes in detail and notice that GPT-4 does not consider a context that is sufficiently large for certain errors, especially when it comes to prepositions. For example, when explaining the error in mit 2 Zimmer vs. mit 2 Zimmern, GPT-4 only says that the dative case is needed here. It does not consider the close-by preposition mit which requires a dative case of its complement. We provide a detailed analysis of other errors in the GPT-4 outputs in Appendix H.

6.2.2 Human evaluation of Chinese GEE

To understand how generalizable our pipeline is to different types of languages, we evaluate its performance on Chinese using the CGED2017 data described in Section 5. One Chinese teacher evaluated Chinese GEE outputs on 200 sentence pairs with 302 explanations.¹⁵¹⁵15There are 310 edits extracted from these 200 sentence pairs. GPT-4 only generates explanations for 302 of them. The annotation task is set up in the same way as German.

	Count	Percentage
Fully correct	296	98.01%
Wrong edit description	1	0.03%
Wrong edit reason	3	0.10%
Wrong error type	2	0.07%
Hallucinated error	0	0.0%
Total explanation count	302	100%
Total annotated items	200
Missing error	0

Table 6: Results of human evaluation on Chinese GEE by one Chinese teacher.

98\%

of the generated explanations are judged as correct. 200 sentence pairs with 302 explanations are annotated. The evaluation criteria are the same as for German.

Positive findings. Among the 302 annotated explanations, $98\%$ are judged as correct by the Chinese teacher. GPT-4 has very low mistake rates in all four mistake types. This shows that the proposed pipeline is effective and adaptable for very different languages like German and Chinese.

Remaining issues. While GPT-4 achieves high correctness rate in Chinese GEE, there are two caveats. First, during the annotation of the data for gold atomic edits, we notice that most of the edits are simple and can be readily extracted by a string-based tool. One reason is that each sentence pair on average has fewer edits than in the German data (3.24 vs. 1.61, see Table 2). Second, for word choice errors, GPT-4 does not always give a clear comparison of word meanings.

{exe}

\ex

严重性的问题 $\rightarrow$ 严重的问题

The word ’严重性’ is replaced with ’严重’ because ’严重’ is the correct word for ’serious’ when describing the severity of a problem.

For example, in (6.2.2), GPT-4 explains what 严重 (serious) means but it does not explain why 严重 is good in 严重的问题 (serious problem) but 严重性 (seriousness) is not. Because word choice is a prevalent problem in Chinese grammar errors (see Table 11 for error types generated by GPT-4), such clear comparison should be enforced in an explanation so that language learners can draw inferences about other cases from the current error.

7 Related work

Our GEE task is built upon the actively studied GEC task. The task is often formulated as a neural machine translation task, with the source being a piece of text with grammar errors and the target being the grammar-error-free text (Boyd, 2018; Bryant et al., 2023; Yuan and Bryant, 2021; Zhang et al., 2022). Researchers in the GEC domain have explored various aspects of the task. We identify two of them which the GEE task can be built on and benefit from. We also compare our task to a related task, feedback comment generation (FCG), and show how GEE is different from it.

GEC with multi-reference and context. Research has been building GEC models on data which have one gold reference for each source input. However, there is an urge to use multiple references for source inputs (Bryant and Ng, 2015; Zhang et al., 2022; Xu et al., 2022). In the context of GEE, a capable model should generate well-suited explanations for any valid error corrections, which requires reasoning of word relations and recovering correction rationales, not just memorize grammar rules. Such ability of GEE models also need to go beyond the sentence level. Wang et al. (2022) has shown that even when only one sentence is added to the input as the context, a GEC model’s performance can be significantly boosted. If some errors can only be better corrected in context, they can be better explained in context as well.

GEC with auxiliary grammar information. There are works that have shown improvement of GEC models by adding edit types, dependency information, or grammatical error type into the training process (Omelianchuk et al., 2020; Ma et al., 2022; Yang et al., 2023). Fei et al. (2023) study the influence of adding evidence words for errors and error types into the pipeline of GEC. They found that such information can significantly increase model performance in English GEC. For the GEE task, it is an interesting direction to explore whether adding those extra information to a GEE system can improve its explanations’ usefulness.

Feedback alongside grammar error detection. On the side of explanation in GEC, Nagata et al. (2021) proposed a generative shared task called feedback comment generation for language learners (FCG). It is based on the dataset ICNALE Learner Essays with Feedback Comments Dataset (Nagata et al., 2020). The task differs from our GEE task in three important aspects. First, the FCG task is built on the grammar error detection task which does not correct errors. The inputs in FCG are erroneous sentences only, which have spans marked as errors. Hence, the FCG task does not need to handle the problem of extracting errors in a linguistically informed way. Second, the FCG task focuses only on preposition words, which are a closed set of function words whose occurrences and usages are limited. In our task, the involved error types have a wide range, as listed in Table 11 for German and Table 10 for Chinese. Third, while the FCG task focuses on generating comments as hints for language learners to correct errors themselves,¹⁶¹⁶16An example comment given in Nagata et al. (2021) is “Agree” requires a preposition since it is an ¡intransitive verb¿. Look up the appropriate preposition in a dictionary. our task aims to enhance learners’ knowledge by showing them the corrected sentences, the underlying grammar rules, and a comparison with errors and corrected words when necessary.

Works on feedback comment generation Coyne (2023) and Coyne et al. (2023) work on the FCG task and develop a typology for learning feedback, including abstract types (e.g., tone and idiom) and grammatical pattern types (e.g., comparative and causative). However, their work is in an early stage with no human or automatic evaluation on the comment quality. Behzad et al. (2023) present a strong baseline for the FCG task but points out that, at the current stage, many feedback comments are generic (e.g., Look up the use of the <verb> X in a dictionary and rewrite the sentence using the appropriate structure.) Stahl and Wachsmuth (2023), Jimichi et al. (2023), and Ueda and Komachi (2023) approach the FCG task via fine-tuning language models such as T5 (Raffel et al., 2020) or BART (Lewis et al., 2020). However, for the GEE task, especially when there are restricted annotated resources for fine-tuning, it is unclear whether such an approach can work. Lastly, these works evaluate model outpus with BLEU Papineni et al. (2002) and lack careful human evaluation.

8 Conclusion

We present a new task grammar error explanation, where systems provide natural language explanations to users explaining the grammatical errors they made. We find that GPT-4 cannot perform this task with high accuracy using one-shot prompting, and hence develop a pipelined approach using LLMs and atomic token edits to generate grammar error explanations. We find that our LLM-based pipeline gets a high score of $93.9\%$ in German and $98\%$ in Chinese error explanation.

While GPT-4 achieve high correctness rate in Chinese error explanation, our Chinese teacher identified $28\%$ of the data that do not fully correct errors in source sentences. Further research is encouraged to build datasets that have the following two properties for GEE. First, the dataset should include data from all proficiency levels of language learners so that we can readily evaluate GEE systems’ performance on a wide range of error types. Second, the dataset should provide high quality correction so that a GEE system can leverage context information when generating explanations.

While we assume a grammar error correction system as the foundation of our GEE system, further work are encouraged to explore GEE generation alongside GEC.

Limitations

We acknowledge two limitations of our current work. First, our grammar error explanation system only considers sentence level inputs. However, certain error types (e.g., word choice and coreference) can benefit from a larger context. Second, because the Chinese data used in our work are from the HSK test (Test of Chinese Level), the covered topics are limited. It also does not include data from learners from all proficiency levels. Hence, the error types might not be representative for all levels of Chinese learners.

References

Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
Behzad et al. (2023) Shabnam Behzad, Amir Zeldes, and Nathan Schneider. 2023. Sentence-level feedback generation for English language learners: Does data augmentation help? In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 53–59, Prague, Czechia. Association for Computational Linguistics.
Boyd (2018) Adriane Boyd. 2018. Using Wikipedia edits in low resource grammatical error correction. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 79–84, Brussels, Belgium. Association for Computational Linguistics.
Boyd et al. (2014) Adriane Boyd, Jirka Hana, Lionel Nicolas, Detmar Meurers, Katrin Wisniewski, Andrea Abel, Karin Schöne, Barbora Štindlová, and Chiara Vettori. 2014. The MERLIN corpus: Learner language and the CEFR. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1281–1288, Reykjavik, Iceland. European Language Resources Association (ELRA).
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Bryant et al. (2017) Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793–805, Vancouver, Canada. Association for Computational Linguistics.
Bryant and Ng (2015) Christopher Bryant and Hwee Tou Ng. 2015. How far are we from fully automatic high quality grammatical error correction? In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 697–707, Beijing, China. Association for Computational Linguistics.
Bryant et al. (2023) Christopher Bryant, Zheng Yuan, Muhammad Reza Qorib, Hannan Cao, Hwee Tou Ng, and Ted Briscoe. 2023. Grammatical Error Correction: A Survey of the State of the Art. Computational Linguistics, pages 1–59.
Coyne (2023) Steven Coyne. 2023. Template-guided grammatical error feedback comment generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 94–104, Dubrovnik, Croatia. Association for Computational Linguistics.
Coyne et al. (2023) Steven Coyne, Diana Galvan-Sosa, Keisuke Sakaguchi, and Kentaro Inui. 2023. Developing a typology for language learning feedback. In Proceedings of the 29th Annual Conference of the Association for Natural Language Processing, Okinawa, Japan.
Cui and Zhang (2011) Xiliang Cui and Bao-lin Zhang. 2011. The principles for building the “international corpus of learner chinese”. Applied Linguistics, 2:100–108.
DeKeyser (2003) Robert DeKeyser. 2003. Implicit and explicit learning. The handbook of second language acquisition, pages 312–348.
Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms.
Ellis (2010) Rod Ellis. 2010. Epilogue: A framework for investigating oral and written corrective feedback. Studies in second language acquisition, 32(2):335–349.
Ellis et al. (2006) Rod Ellis, Shawn Loewen, and Rosemary Erlam. 2006. Implicit and explicit corrective feedback and the acquisition of L2 grammar. Studies in second language acquisition, 28(2):339–368.
Fei et al. (2023) Yuejiao Fei, Leyang Cui, Sen Yang, Wai Lam, Zhenzhong Lan, and Shuming Shi. 2023. Enhancing grammatical error correction systems with explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7489–7501, Toronto, Canada. Association for Computational Linguistics.
Felice et al. (2016) Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 825–835, Osaka, Japan. The COLING 2016 Organizing Committee.
Jimichi et al. (2023) Kunitaka Jimichi, Kotaro Funakoshi, and Manabu Okumura. 2023. Feedback comment generation using predicted grammatical terms. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 79–83, Prague, Czechia. Association for Computational Linguistics.
Korre et al. (2021) Katerina Korre, Marita Chatzipanagiotou, and John Pavlopoulos. 2021. ELERRANT: Automatic grammatical error type classification for Greek. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 708–717, Held Online. INCOMA Ltd.
Korre and Pavlopoulos (2020) Katerina Korre and John Pavlopoulos. 2020. ERRANT: Assessing and improving grammatical error type classification. In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 85–89, Online. International Committee on Computational Linguistics.
Lee et al. (2018) Lung-Hao Lee, Yuen-Hsien Tseng, and Li-Ping Chang. 2018. Building a TOCFL learner corpus for Chinese grammatical error diagnosis. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
Ludeling et al. (2008) A Ludeling, Seanna Doolittle, Hagen Hirschmann, Karin Schmidt, and Maik Walter. 2008. Das Lernerkorpus Falko. Deutsch als Fremdsprache, 45(2):67.
Lyster and Saito (2010) Roy Lyster and Kazuya Saito. 2010. Interactional feedback as instructional input: A synthesis of classroom sla research. Language, Interaction and Acquisition, 1(2):276–297.
Ma et al. (2022) Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang, Ding Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao, Haitao Zheng, and Ying Shen. 2022. Linguistic rules-based corpus generation for native Chinese grammatical error correction. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 576–589, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Nagata et al. (2021) Ryo Nagata, Masato Hagiwara, Kazuaki Hanawa, Masato Mita, Artem Chernodub, and Olena Nahorna. 2021. Shared task on feedback comment generation for language learners. In Proceedings of the 14th International Conference on Natural Language Generation, pages 320–324, Aberdeen, Scotland, UK. Association for Computational Linguistics.
Nagata et al. (2020) Ryo Nagata, Kentaro Inui, and Shin’ichiro Ishikawa. 2020. Creating corpora for research in feedback comment generation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 340–345, Marseille, France. European Language Resources Association.
Omelianchuk et al. (2020) Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi. 2020. GECToR – grammatical error correction: Tag, not rewrite. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 163–170, Seattle, WA, USA → Online. Association for Computational Linguistics.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
Rao et al. (2020) Gaoqi Rao, Erhong Yang, and Baolin Zhang. 2020. Overview of NLPTEA-2020 shared task for Chinese grammatical error diagnosis. In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pages 25–35, Suzhou, China. Association for Computational Linguistics.
Reznicek et al. (2010) Marc Reznicek, Maik Walter, Karin Schmidt, Anke Lüdeling, Hagen Hirschmann, Cedric Krummes, and Torsten Andreas. 2010. Das Falko-Handbuch: Korpusaufbau und annotationen. Institut für deutsche Sprache und Linguistik, Humboldt-Universität zu Berlin, Berlin.
Stahl and Wachsmuth (2023) Maja Stahl and Henning Wachsmuth. 2023. Identifying feedback types to augment feedback comment generation. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 31–36, Prague, Czechia. Association for Computational Linguistics.
Thai et al. (2022) Katherine Thai, Marzena Karpinska, Kalpesh Krishna, Bill Ray, Moira Inghilleri, John Wieting, and Mohit Iyyer. 2022. Exploring document-level literary machine translation with parallel paragraphs from world literature. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9882–9902, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
Ueda and Komachi (2023) Naoya Ueda and Mamoru Komachi. 2023. TMU feedback comment generation system using pretrained sequence-to-sequence language models. In Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, pages 68–73, Prague, Czechia. Association for Computational Linguistics.
Uz and Eryiğit (2023) Harun Uz and Gülşen Eryiğit. 2023. Towards automatic grammatical error type classification for Turkish. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 134–142, Dubrovnik, Croatia. Association for Computational Linguistics.
Wang et al. (2022) Baoxin Wang, Xingyi Duan, Dayong Wu, Wanxiang Che, Zhigang Chen, and Guoping Hu. 2022. CCTC: A cross-sentence Chinese text correction dataset for native speakers. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3331–3341, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Wang et al. (2021) Yu Wang, Yuelin Wang, Kai Dang, Jie Liu, and Zhuo Liu. 2021. A comprehensive survey of grammatical error correction. ACM Trans. Intell. Syst. Technol., 12(5).
Xia (2000) Fei Xia. 2000. The segmentation guidelines for the penn chinese treebank 3.0. IRCS Technical Reports Series. 37.
Xu et al. (2022) Lvxiaowei Xu, Jianwang Wu, Jiawei Peng, Jiayu Fu, and Ming Cai. 2022. FCGEC: Fine-grained corpus for Chinese grammatical error correction. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1900–1918, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Yang et al. (2023) Lingyu Yang, Hongjia Li, Lei Li, Chengyin Xu, Shutao Xia, and Chun Yuan. 2023. LET: Leveraging error type information for grammatical error correction. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5986–5998, Toronto, Canada. Association for Computational Linguistics.
Yannakoudakis et al. (2011) Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA. Association for Computational Linguistics.
Yuan and Bryant (2021) Zheng Yuan and Christopher Bryant. 2021. Document-level grammatical error correction. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 75–84, Online. Association for Computational Linguistics.
Zhang and Cui (2013) Bao-lin Zhang and Xiliang Cui. 2013. Design concepts of “the construction and research of the inter-language corpus of chinese from global learners”. Language Teaching and Linguistic Study, 5:27–34.
Zhang et al. (2022) Yue Zhang, Zhenghua Li, Zuyi Bao, Jiacheng Li, Bo Zhang, Chen Li, Fei Huang, and Min Zhang. 2022. MuCGEC: a multi-reference multi-source evaluation dataset for Chinese grammatical error correction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3118–3130, Seattle, United States. Association for Computational Linguistics.

Appendix A Reasons of not using ERRANT

ERRANT (Bryant et al., 2017) is an effort to standardise datasets for GEC, reduce annotators’ burden, and offer feedback to instructors and learners. It does so by offering a tool that automatically extracts and labels edits in the format of operation:linguistic feature.

ERRANT would have been ideal for our purpose. Concretely, this would have been ideal for the edit extraction in Step 1 and error type tagging in Step 2. However, ERRANT has several shortcomings.

First, ERRANT is designed only for English and its error type tagging process is based on a English rule-based framework. Extending it to another language will take great effort Korre et al. (2021); Uz and Eryiğit (2023).

Second, there is ambiguity in ERRANT’s error type names. For example, R:ADV is a possible error type in ERRANT in which R stands for replacement and ADV stands for adverb. But it is not clear, as it stands, whether it represents only an adverb being replaced by another adverb, or it could be the case that a word of other category is replaced by an adverb.

Third, Korre and Pavlopoulos (2020) show that ERRANT can falsely or ambiguously tag errors. In their work, they use ERRANT to tag the errors in the FCE dataset (Yannakoudakis et al., 2011). They then sample 100 sentence pairs to whose errors ERRANT assigned the type Other. They examine those sentence pairs and found that up to $39\%$ of the data point could have been assigned a more precise label.

Fourth, ERRANT’s underlying edit extractor does not account for non-local token relocation (Felice et al., 2016). The extractor aligns the tokens in the erroneous and correct sentences and assign one of the following labels to spans: M(atch), I(nsertion), D(eletion), S(ubstitution), and T(ransposition). For a relatively locally relocated token, the extractor assigns the span as T as in (1). However, for a less local token relocation such as (2), the extractor treats it as being deleted then inserted.

(1)

Ich₀ möchte₁ haben₂ einen₃ Apfel₄ .₅

Ich₀ möchte₁ einen₂ Apfel₃ haben₄ .₅

(‘M’, 0, 1, 0, 1)

(‘M’, 1, 2, 1, 2)

(‘T3’, 2, 5, 2, 5)

(‘M’, 5, 6, 5, 6)
(2)

Ich₀ möchte₁ haben₂ einen₃ roten₄ Apfel₅ .₆

Ich₀ möchte₁ einen₂ roten₃ Apfel₄ haben₅ .₆

(‘M’, 0, 1, 0, 1)

(‘M’, 1, 2, 1, 2)

(‘D’, 2, 3, 2, 2)

(‘M’, 3, 4, 2, 3)

(‘M’, 4, 5, 3, 4)

(‘M’, 5, 6, 4, 5)

(‘I’, 6, 6, 5, 6)

(‘M’, 6, 7, 6, 7)

Relocation of tokens would be a useful label to have for word order errors, which are prevalent in elementary L2 German and Chinese learners. With this label, we could explain why a token is relocated rather than explaining why it is deleted first then explaining why it is inserted.

Appendix B Guidelines for manual edit extraction Annotation

To prepare the data for fine-tuning models to extract atomic edits in German and Mandarin Chinese, we manually annotated $500$ data points for each language. In this section, we discuss the challenges in extracting atomic edits and how we handle them.

The first step is to tokenize sentences. For German, it is straightforward because of white spaces. We use SpaCy for tokenizing German sentences which can single out punctuation marks. For Chinese, sentences cannot be tokenized into words by simply separating characters because many words are not monosyllabic. We choose to use Jieba, which is a fast and accurate Chinese word segmentation module implemented in Python.

The second step is to use SequenceMatcher from difflib to extract longest edited spans from sentence pairs, which is later used as part of the input for atomic edits. We found that adding rough edits into the input increases the recall of the prompted models. It also accelerates and eases the process of manual annotation.

The third and last step is to get atomic edits. There are four types of edits: replacement, deletion, insertion, and relocation. The challenge lies in how to align words in sentence pairs and extract edits.

For German, replacement mostly happens between tokens which have similar spelling (e.g., wolle and will, meaning want to) or the same categories (e.g., zu and nach, meaning to). Deletion and insertion can happen to individual tokens or a phrase. When more than one consecutive tokens, for example, X and Y, are deleted or inserted, we determine whether to count them as separate edits or one as a whole depending on whether X and Y form a linguistic constituent (for example, a prepositional phrase by train). The edit type relocation is inspired by a common error made by elementary German learners: placing finite verbs or adverbial phrases in the wrong position.¹⁷¹⁷17German is a verb second language, whose verb second constraint does not hold in embedded clauses. In main clauses, the finite verb occurs in the second position and non-finite verbs occur towards the end of a sentence. In embedded clauses, the finite verb usually appears at the end, after all the non-finite verbs. To emphasize that the usage of a word is not wrong but its position in a sentence is wrong, tagging such an edit as relocated is more intuitive than tagging it as a deletion followed by an insertion (or an insertion followed by a deletion).

The introduction of the relocation edit type is not at no cost. It reduces model performance because models tends to predict a relocated token/phrase as deletion plus insertion. It is also challenging because the relocated word should be just placed in a wrong position and have the same dependency in a sentence before and after being relocated. For example, for the sentences in B, it is illogical to say that the first sentence is corrected by relocating for to the first underline and insert to in the original place of for. This is because the verb talk requires a preposition but the language user mistakenly used for instead of to. It is not the case that the language user mistakenly put the for that should have been before me after talking. So, it should be the case that for is inserted to the position of the blank underline and the for after talking is replaced by to. The correct edits for B are given in B and the wrong edits are in B

{exe}\ex

S: This job is exciting me because I like talking for different people.

T: This job is exciting for me because I like talking to different people.

\ex

Good edit extraction

[‘insert’, ‘’, ‘for’]

[‘replace’, ‘for’, ‘to’]

\ex

Bad edit extraction

[‘relocate’, ‘for’, ‘for’]

[‘insert’, ‘’, ‘to’]

On the other hand, the word essen in (1) is more naturally a relocated token because its relation with the finite modal verb moöchte (would like to) and the direct object vierzig Bananen (forty bananas) remains unchanged. It is only the position of the word that is changed.

(1)
1. a.
  
  Ich möchte essen vierzig Bananen.
2. b.
  
  Ich möchte vierzig Bananen essen.
  
  [‘relocate’, ‘essen’, ‘essen’]

For Chinese, deletion and insertion work similarly as in German. Relocation is also useful in Chinese for cases like misplacement of an adverbial phrase or a function word (e.g., 了).¹⁸¹⁸18了 is a multi-functional function word and a heteronym. It can express the completion or ongoingness of an action (among its other functions). Its meaning changes based on the position in a sentence it occurs. However, replacement is not as straightforward in Chinese as in German. For example, verbs in Chinese often come with a resultative complement (e.g., 到, 完, or 出) or other function words to express different states of a verb (e.g., 过). If only the function word is changed but the verb is not, how should the edit be extracted? We experimented with both ways (with and without verbs) and found that, in either case, GPT-4 included the verb when explaining the meaning difference. Hence, for those cases, we always include the unchanged verbs. Similarly, for cases in which a function word is not changed but the verb that the function word is attached to is changed, the edit includes both the verb and the function word (e.g., [‘replace’, ‘看成’, ‘当成’]).

(1)
1. a.
  
  我花了一整天看过了这本书。
2. b.
  
  我花了一整天看完了这本书。
  
  [‘replace’, ‘看过’, ‘看完’]

Other situations in which we always take longer phrases as edits rather than only the parts being changed are idioms (e.g., [‘replace’, ‘心急如坟’, ‘心急如焚’]), formulaic expressions (e.g., [‘replace’, ‘总上所述’, ‘综上所述’]), and de (的) $+$ noun as in 在这紧急的情况下 (in an emergency situation).

Appendix C Prompts for atomic edit extraction

We use the prompts presented below for atomic edit extraction in German and Chinese. The prompt contains the task instruction followed by possible edit types as well as examples. Special instructions are given to the relocation edit type where the relocated tokens should be the same before and after the edit. In the examples, we demonstrate different edit types and their combinations, showing the models how to deal with a sentence pair with multiple edits.

C.1 Extraction prompt for German

⬇

This is an atomic edit extraction task. Given a pair of German sentences and the edits applied to the first sentence to get the second sentence, your task is to break down the edits to the atomic level (i.e., token level) and assign the edit a label. Be case sensitive. Pay attention to punctuation marks and relocated tokens. Pay attention to phonetic similarity when aligning tokens.

Labels:

1. [replace, original_token, edited_token]

2. [delete, original_token, ""]

3. [insert, "", edited_token]

4. [relocate, original_token, edited_token]: pay attention to tokens that are deleted then added again; the relocated token must be the same before and after the edit.

Examples:

Wie oben schon erwähnt ist die Chance erwisht zurweden zwar gering, aber sie ver handen.

Wie oben schon erwähnt ist die Chance, erwischt zu werden, zwar gering, aber sie ist vorhanden.

Edits:

(’replace’, ’erwisht zurweden’, ’, erwischt zu werden ,’)

(’replace’, ’ver handen’, ’ist vorhanden’)

Atomic edits:

["insert", "", ","]

["replace", "erwisht", "erwischt"]

["replace", "zurweden", "zu werden"]

["insert", "", ","]

["insert", "", "ist"]

["replace", "ver handen", "vorhanden"]

ich haben essen zwei Bananen.

Ich habe zwei Bananen gegessen.

Edits:

(’replace’, ’ich haben essen’, ’Ich habe’)

(’insert’, ’’, ’gegessen’)

Atomic edits:

["replace", "ich", "Ich"]

["replace", "haben", "habe"]

["delete", "essen", ""]

["insert", "", "gegessen"]

Ich habe gegessen zwei Bananen.

Ich habe zwei Bananen gegessen.

Edits:

(’delete’, ’gegessen’, ’’)

(’insert’, ’’, ’gegessen’)

Atomic edits:

["relocate", "gegessen", "gegessen"]

Below is the sentence pair for you to work on. Follow the format in the examples strictly.

{src}

{trg}

Edits:

{edits}

Atomic edits:

C.2 Extraction prompt for Chinese

You are a Mandarin Chinese teacher. Given a pair of Mandarin Chinese sentences and the edits applied to the input sentence to get the output sentence, your task is to break down the edits to the atomic level (i.e., token level) and assign the edit a label. Pay attention to punctuation marks and relocated tokens.

Labels:

1. [replace, original_token, editted_token]

2. [delete, original_token, ""]

3. [insert, "", editted_token]

4. [relocate, original_token1, editted_token1]: pay attention to tokens that are deleted then added again; the relocated token must be the same before and after the edit.

Examples:

我去市菜场水果买。

我去菜市场买水果。

Edits:

("replace", "市菜场水果买", "菜市场买水果")

Atomic edits:

["replace", "市菜场", "菜市场"]

["relocate", "水果", "水果"]

我吃了早饭今天。

我今天吃了早饭。

Edits:

("insert", "今天", "")

("delete", "", "今天")

Atomic edits:

["relocate", "今天", "今天"]

再子细的学习相关课题后，我意识到了这个问题的严重。

在仔细地学习了相关课题后，意识到了这个问题的严重性。

Edits:

("replace", "再子细的", "在仔细地")

("insert", "", "了")

("insert", "", "我")

("insert", "", "性")

Atomic edits:

["replace", "再", "在]

["replace", "子细", "仔细"]

["replace", "的", "地"]

["insert", "", "了"]

["insert", "", "我"]

["replace", "严重", "严重性"]

她打算明儿天的午前去北京。

她打算明天上午去北京。

Edits:

("replace", "明儿天的午前", "明天上午")

Atomic edits:

["replace", "明儿天", "明天"]

["delete", "的", ""]

["replace", "午前", "上午"]

Below is the sentence pair for you to work on. Follow the format in the examples strictly.

{original_sentence}

{corrected_sentence}

Edits:

{edits}

Atomic edits:

Appendix D Prompts for explanation generation

We use the following prompts for generating grammar error explanations in German and Chinese.

D.1 Explanation prompt for German

In the prompt for German grammar error explanation, we provide a wide range of error examples, including errors that can only be explained in grammatical terms (e.g., gender/case/number agreement), errors that can be assigned a meaning (e.g., accusative case for time expressions), and errors that are related to collocations (e.g., am Ende instead of im Ende).

⬇

You are given a pair of German sentences and a list of atomic edits. An edit is an error in the first sentence, which is corrected in the second one. Generate a succinct explanation for each error using the template. After each explanation, give the error a type.

Template: The word X is deleted/inserted/replaced by Y/relocated because ...

Example:

Ich habe zwei Bananen für mein Katz gekauft.

Ich habe zwei Bananen für meine Katze gekauft.

Edits:

["replace", "Katz", "Katze"]

["replace", "mein", "meine"]

Explanation:

The word ’Katz’ is replaced by ’Katze’ because ’Katze’ is the correct spelling.

Error type: spelling

The word ’mein’ is replaced by ’meine’ because it should agree with the gender and case of the word Katze, which is feminine and accusative.

Error type: gender and case

Er fliegt nächster Monat Deutschland.

Er fliegt nächsten Monat nach Deutschland.

Edits:

["insert", "", "nach"]

["replace", "nächster", "nächsten"]

Explanation:

The word ’nach’ is inserted because the verb ’fliegen’ requires a preposition when expressing a destination and ’nach’ is usually used for countries.

Error type: preposition

The word ’nächster’ is replaced by ’nächsten’ because German uses accusative case for time expressions.

Error type: case

Ich gehe in der Schule.

Ich gehe in die Schule.

Edits:

["replace", "der", "die"]

Explanation:

The word ’der’ is replaced by ’die’ because the preposition ’in’ requires the accusative case of a noun when expressing a direction or destination.

Error type: case

Ich kann heute jogge gehe.

Ich kann heute joggen gehen.

Edits:

["replace", "gehe", "gehen"]

["replace", "jogge", "joggen"]

Explanation:

The word "gehe" is replaced by "gehen" because the verb "kann" requires an infinitive form of the verb "gehen".

Error type: infinitive

The word "jogge" is replaced by "joggen" because the verb "gehen" requires an infinitive form of the verb "joggen".

Error type: infinitive

Ich muss mich zur neuen Umgebung gewöhnen.

Ich muss mich an die neue Umgebung gewöhnen.

Edits:

["replace", "zur", "an"]

["insert", "", "die"]

["replace", "neuen", "neue"]

Explanation:

The word "zur" is replaced by "an" because the verb "gewöhnen" requires the preposition "an".

Error type: preposition

The word "die" is inserted because the noun "Umgebung" requires a determiner and "gewöhnen an" requires accusative case.

Error type: determiner

The word "neuen" is replaced by "neue" because the existence of "die" indicates that the adjective need only weak inflection.

Error type: adjective inflection

Es ist im Ende des Flusses.

Es ist am Ende des Flusses.

Edits:

["replace", "im", "am"]

Explanation:

The word "im" is replaced by "am" because "am" is the correct preposition for the word "Ende".

Below is the sentence pair for you to work on. Focus on the given edit and do not add other atomic edits. Start with the explanation directly.

{src}

{trg}

Edits:

{edit}

Explanation:

D.2 Explanation generation prompt for Chinese

In the few-shot prompt for Chinese GEE, we cover the following types of errors, which are commonly seen when we manually annotate the training data for fine-tuning: Function word errors, such as 了, 们, 的/地/得, and measure words; Mis-written words/phrases,¹⁹¹⁹19We call them as mis-written words instead of misspelling because there is no letters or spelling in Chinese writing. Such mistakes can be made by a language user who confuses characters with the same/similar pronunciation, with similar meaning, with similar strokes, or simply remembers the wrong character order in a word. such as 平果 vs. 苹果 and 市菜场 vs. 菜市场; Word collocation errors, such as 做错误 vs. 犯错误; Word choice errors, such as 查找 vs. 寻找.

Mandarin Chinese does not have abundant agreement between words in sentences as German or English. Many errors made by learners are word choice errors. For example, 查找 and 寻找 both have the core meaning of looking for but the former emphasizes a systematic and methodological search for data or information while the latter suggests a more intangible search with a sense of exploration. In the example of the word choice error, we show GPT-4 that it should explain the meaning of the two words and why one is better than the other in the context. Without such an example, GPT-4 returns a generic explanation “The word X is replace by Y because Y is the correct word to use in the context." which is not helpful for language learners.

Here begins the prompt:

You are given a pair of Mandarin Chinese sentences and a list atomic edits. An edit is an error in the first sentence, which is corrected in the second one. Generate a succinct explanation for each error using the template. After each explanation, give the error a type.

Template: The word X is replaced by Y/deleted/inserted/relocated because …

Example:

昨天我买四只平果们。

昨天我买了四个苹果。

Edits:

["insert", "", "了"]

["replace", "只", "个"]

["replace", "平果", "苹果"]

["delete", "们", ""]

Explanation:

The word ‘了’ is inserted because ‘了’ indicate the completion of the action ‘买’.

Error type: usage of ‘了’

The word ‘只’ is replaced with ‘个’ because ‘个’ is the correct measure word for ‘苹果’.

Error type: measure word

The word ‘平果’ is replaced with ‘苹果’ because ‘苹果’ is the correct word for ‘apple’.

Error type: miswritten character/word

The word ‘们’ is deleted because ‘们’ is only used after pronouns or human nouns to indicate plurality.

Error type: ‘们’

间而说之，他唱地很好。

简而言之，他唱得很好。

Edits:

["replace", "间而说之", "简而言之"]

["replace", "地", "得"]

Explanation:

The word ‘间而说之’ is replaced with ‘简而言之’ because ‘简而言之’ is the correct way of writing the phrase which means ‘in short’ or ‘in brief’.

Error type: mis-written character/word

The word ‘地’ is replaced with ‘得’ because ‘得’ is the correct ‘de’ particle to use when it follows a verb and the word after ‘得’ modifies the verb.

Error type: "de" particles

许多人们做了一差误。

许多人犯了一个错误。

Edits:

["replace", "许多人们", "许多人"]

["replace", "做", "犯"]

["insert", "", "个"]

["replace", "差误", "错误"]

Explanation:

The word ‘许多人们’ is replaced with ‘许多人’ because when a noun is preceded by a numeral, the plural marker ‘们’ is not needed.

Error type: ‘们’

The word ‘做’ is replaced with ‘犯’ because ‘犯’ is the correct verb to use for the noun ‘mistake’.

Error type: verb-object collocation

The word ‘个’ is inserted because a measure word is needed between the numeral and the noun and ‘个’ is the correct measure word for ‘错误’.

Error type: measure word

The word ‘差误’ is replaced with ‘错误’ because ‘差误’ is not a word in Chinese and ‘错误’ is the correct word for ‘mistake’.

Error type: mis-written character/word

我在查找我的知音。

我在寻找我的知音。

Edits:

["replace", "查找", "寻找"]

Explanation:

The word ‘查找’ is replaced with ‘寻找’ because ‘查找’ suggests a systematic and methodological search. It usually means searching for information or data. On the other hand, ‘寻找’ suggests a more intangible search with a sense of exploration. ‘寻找’ fits the context better.

Error type: word choice

Below is the sentence pair for you to work on. Focus on the given edit and do not add other atomic edits. Start with the explanation directly.

{src}

{trg}

Edits:

{edit}

Explanation:

Appendix E Data preprocess for German and Chinese

This section describes how the datasets in German and Chinese are preprocessed.

E.1 Preprocess German data

The Falko dataset (Ludeling et al., 2008; Reznicek et al., 2010) contains essays written by German learners whose proficiency levels range from A1 to C1 according to the Common European Framework of Reference for Languages (CEFR).²⁰²⁰20The Common European Framework of Reference for Language (CEFR) is a standard for describing language ability. There are six levels: A1, A2, B1, B2, C1, and C2. C2 is the native speaker level. The Merlin dataset (Boyd et al., 2014) is a collection of essays written by advanced German speakers from different countries with both native and non-native background. We use Merlin as C2 data.

Both Falko and Merlin offer two types of grammar error corrections, target hypothesis 1 and target hypothesis 2. Target hypothesis 1 performs minimal correction at the morpho-syntactic level while target hypothesis 2 modifies semantic and pragmatic aspects (e.g., information structure or word choice) of the input text, aiming for a more advanced paraphrase-type correction. For our purpose, we use target hypothesis 1 of each corrected sentence.²¹²¹21Examples of the target hypothesis 1 and 2 of a corrected sentence can be found in https://gucorpling.org/amir/pdf/Reznicek_et_al.pdf.

To prepare the datasets, we first split the paragraphs in Falko and Merlin into sentences by adapting the paragraph alignment algorithm in Thai et al. (2022) for sentence alignment. We then screened out sentence pairs that: (1) have short sentences (less that 3 tokens); (2) contain “incomp” or “unreadable” tokens; and (3) have two sentences in the source and one sentence in the target, or vice versa.

E.2 Preprocess Chinese data

The data for Chinese GEE is the training split of CGED2017 Rao et al. (2020). Texts are split into sentences at the end of sentence punctuation (e.g., periods and question marks) and aligned.

We tokenized the sentence pairs using Jieba and show the length distribution of sentences in Figure 3. Clearly, most of the data points have $2$ to $50$ tokens. Each token has on average 1.8 characters. The overly long sentences (over 170 tokens) exist because of the abusive use of commas.²²²²22As a rough reference, Chinese Treebank 9.0 (Xia, 2000) has 132076 sentences and 2084387 tokens, which amounts to roughly 16 tokens per sentence. For the experiment, we select sentences of length between 5 and 50 tokens. We also remove pairs with the same source and target.

Appendix F Fine-tune atomic edit extraction models

For German we use Llama2-7B and GPT-3.5-turbo as the base models and fine-tune them on the 500 training data points in Table 2. The results show that fine-tuning GPT-3.5-turbo through the OpenAI fine-tuning API with 2 epochs and using temperature $=0$ at the inference time returns the best performance. It took around 30 mins for fine-tuning. For Llama2-7B, we fine-tune the model with QLoRA for 1000 steps using the parameters suggested in Dettmers et al. (2023) on one RTX8000. The fine-tuning takes about five hours. Checkpoints are saved every 250 steps. At the inference time, the checkpoint saved at 750 steps with temperature $=0.01$ performs the best.²³²³23The do_sample parameter is set to False. The temperature is set to $0.01$ instead of $0$ because the model requires the temperature to strictly be a positive float. The best performance are reported in Table 3.

For Chinese, we fine-tune Llama2-7B and GPT-3.5-turbo in the same way as for German. Llama2-7B checkpoints are saved every $100$ steps. It achieves the best performance at $400$ steps. Fine-tuning GPT-3.5-turbo for two epochs returns a better performance than one epoch. The best performance of the fine-tuned models are reported in Table 4.

Appendix G Details on human evaluation

We provide further details in addition to the ones discussed in Section 6.2.1. Figure 4 shows the annotation interface for the German and Chinese teachers. The teachers are given detailed instructions for the German (link) and Chinese (link) tasks.

In the annotation task, the teachers are asked to check for four types of mistakes. Concerning missing error mistakes, they should be marked either in the source sentence for deleted, replaced, and relocated tokens or in the target sentence for inserted ones. Other mistakes should be marked in the explanations. We asked the annotators not to mark imprecise explanation/error type as wrong but leave a comment on how they can be improved.

G.1 German annotator agreement

To evaluate the agreement, we compare the annotations of the 96 pairs and classify them into three categories. Fully agree: if the teachers agree on no mistakes or the same set of mistakes. Disagree on missing errors: if teachers agree on other mistakes but not on missing errors. Disagree on other mistakes: if teachers also disagree on mistakes other than missing errors. Counts of each category are reported in Figure 7.

Among the 96 commonly annotated items, the German teachers agree on $81.3\%$ of them for the overall quality (error coverage and explanation quality), and $89.6\%$ of the time, the teachers agree on the quality of the generated edit reasons (sum of the first and second row in Table 7).

	Count	Percentage
Fully agree	78	81.3%
Disagree on missing errors	8	8.3%
Disagree on other mistakes	10	10.4%
Sum	96	100%

Table 7: Agreement between two German teachers on 96 sentence pairs. Among the 78 annotated items on which the teachers fully agree with each other, 5 have mistakes and 73 have no mistakes at all.

Appendix H Qualitative analysis of German GEE

In this section, we look into the mistakes made by GPT-4 and provide detailed analysis of two of them: wrong error type and wrong error explanation.

H.1 Mistakes in wrong error type

Although there are only 12 wrong error type mistakes marked by the German teachers, they present cases where careful design decisions need to be made. We categorize them into six types and discuss two of them here. Examples and their categories are in Table 8.

Case vs. Plural The explanations and error types in the two cases indicate that, given the prompt we used, GPT-4 is weak at distinguishing certain nuances in German grammar because it does not leverage the larger context while generating explanations and error types.

In German, the suffix -n may occur in two cases (among others): in the plural form of certain nouns or at the end of the dative plural form of a noun if the noun’s plural form does not already end in -n. In the first case with Hauspreise, the language user used die as the definite article of Hauspreise, which shows that they did not consider the case of the determiner phrase as dative. Moreover, they used Hauspreise as part of the subject of the sentence, which further reduces the likelihood that they meant to use Hauspreise in its dative case because it is very rare to have a dative determiner phrase as a subject. Hence, the error type should be plural or number. In the second case with Menschen, it is clearly not a plural error because jede/r takes singular nouns and bei only takes dative nouns. The error type should be case because the word Mensch belongs to the n-declination which takes the -(e)n suffix in the dative case. Further work should add examples in the prompt for training data to enhance the model ability in distinguishing such nuances.

Misspelling vs. Conjugation While GPT-4 judges the errors under this type in Table 8 as conjugation errors, our German teachers judged them as misspelling. These three cases beg for an answer to the question: where is the border line between general misspelling due to an oversight and genuinely lack of knowledge of a grammar point (e.g., misspelling vs. conjugation)? While we do not have an answer to the question, we suggest that error types should always be the more specific one when an error is on the border line. For a language learner, if an error is made by oversight, they can easily ignore the explanation and error type. If an error is made by lacking of relevant knowledge, they should be reminded by an explanation. Since we do not know why a language learner made such an error, providing the more specific error type is more beneficial.

H.2 Mistakes in wrong error explanation

There are 29 explanations that provide a wrong reason of an error. They can be classified into two groups. The first group has mistakes that can be traced back to a wrongly extracted edit, as shown in the first example in Table 9. Eleven cases belong to this group.

The second group has mistakes for miscellaneous reasons. However, there are two reasons that stand out. The first reason is that GPT-4 does not consider information from the bigger context when generating explanations. There are 3 such cases and all of them involve a preposition. One example can be found in Table 8 under Case vs. Plural. Table 9 presents another one. In this example, the word Zimmer should be in dative not because German needs a dative case to indicate numbers but because the preposition mit assigns the noun in the preposition phrase a dative case. The second reason that causes GPT-4 to generate four wrong explanations is that it does not have precise knowledge of German verb position. As in the third example in Table 9, the word entwickelt is relocated not because of the reason in the explanation but because a finite verb in an embedded clause should be at the end of it (see Footnote 17).

H.3 Overall quality of German GEE

In the annotation task, the German teachers were told not to mark correct but imprecise explanations/error types as wrong and leave a comment on how they can be improved. In the annotated results, we see only one such comment. That does not mean that the teachers did not leave enough comments because there are abundant comments pointing out errors in the source sentences that are not corrected in the target sentences, comments pointing out that some corrections done in the target sentences are not correct, or comments on how to modify a wrong error explanation. The first author, as a German as second language learner with level C1, has also gone through all the annotated data and found the correct explanations informative and useful. Hence, we can say that the German error explanations generated by GPT-4 are judged as fully correct by our German teachers for $93.9\%$ of the time.

Overly general

Source: Wie es deine Frau und deine Babys?
Target: Wie geht es deiner Frau und deinem Baby?
The word ‘’ is replaced by ‘geht’ because the correct expression in German is "Wie geht es" to ask how someone is doing.
Error type: expression

Source: Brauchst du etwas hilfe?
Target: Brauchst du Hilfe?
The phrase ‘etwas hilfe’ is replaced by ‘Hilfe’ because ‘etwas’ is unnecessary when offering help in this context.
Error type: redundancy

Source: Meine Eltren und bin ich in montag fahren in Indien.
Target: Meine Eltern und ich fahren am Montag nach Indien.
The phrase ‘in montag’ is deleted and replaced by ‘am Montag’ because ‘am’ is the correct preposition for expressing a specific day.
Error type: preposition and expression

Misspelling vs. Conjugation

Source: Ich mochte Farkarte haben.
Target: Ich möchte eine Fahrkarte haben.
The word ‘mochte’ is replaced by ‘möchte’ because ‘möchte’ is the correct conjugation of the verb “mögen” in this context.
Error type: conjugation

Source: Wo möchst du schwimmen gehen?
Target: Wo möchtest du schwimmen gehen?
The word ‘möchst’ is replaced by ‘möchtest’ because ‘möchtest’ is the correct conjugation of the verb ‘möchten’ in the second person singular.
Error type: verb conjugation

Source: Man ließt es geradezu täglich in der Zeitung, hört es im Radio, oder sieht erschreckende Bilder im Fernsehn.
Target: Man liest es geradezu täglich in der Zeitung, hört es im Radio, oder sieht erschreckende Bilder im Fernsehen.
The word ‘ließt’ is replaced by ‘liest’ because ‘liest’ is the correct conjugation of the verb ‘lesen’ for the third person singular.
Error type: conjugation

Case vs. Plural -(e)n

Source: In den letzten zehn Jahren erhöhten sich die Wohnungs- und Hauspreisen und fast alle müssen einen Kredit aufnehmen, ihre Träume erföllen zu können.
Target: In den letzten zehn Jahren erhöhten sich die Wohnungs - und Hauspreise und fast alle müssen einen Kredit aufnehmen, um ihre Träume erföllen zu können.
The word ‘Hauspreisen’ is replaced by ‘Hauspreise’ because it should be in the nominative case, which is plural without the ‘-n’ ending
Error type: case

Source: Anderseits liegt die Integration jeder Mensch an sie selbst.
Target: Anderseits liegt die Integration bei jedem Menschen an ihm selbst.
The word ’Mensch’ is replaced by ‘Menschen’ because the plural form is needed to agree with the pronoun ‘jedem’.
Error type: plural

Because of wrong atomic edits

Source: Obwohl in Stadt X schwerig ist, eine freie Wohnung zu bekommen, gibt es trotzdem viele Möglichkeiten wie man es sie bekommt.
Target: Obwohl es in Stadt X schwerig ist, eine freie Wohnung zu bekommen, gibt es trotzdem viele Möglichkeiten wie man sie bekommt.
The word ‘es’ is relocated because it should be placed before ‘in Stadt X’ to form the correct sentence structure for expressing a circumstance.
Error type: word order

Source: Bis 30.04 ich muss in neue Wohnung umladen, weil am 01.05 meine neue Arbeit beginn.
Target: Bis 30.04. muss ich in die neue Wohnung umladen, weil am 01.05. meine neue Arbeit beginnt.

The word ‘ich’ is inserted because it is the pronoun that corresponds to the subject performing the action of ‘müssen’.
Error type: subject pronoun

Wrong word category

Source: und werde direk zu dir für das Abholen, wäre es dir passen wenn wir den Abends zusammen Essen?
Target: Und werde direkt zu dir kommen für das Abholen, würde es dir passen, wenn wir den Abend zusammen essen?
The word ‘wäre’ is replaced by ‘würde’ because ‘würde’ is used in conditional sentences to express politeness.
Error type: modal verb

Not understanding German phonology

Source: Hast du Seit für mich zum eine Wohnung finden?
Target: Hast du Zeit, für mich eine Wohnung zu finden?
The word ‘Seit’ is replaced by ‘Zeit’ because ‘Zeit’ is the correct word for time in this context.
Error type: homophone

Table 8: Twelve cases in which error types are judged as wrong by the German teachers. For each sentence pair, only the explanations with a wrong error type are listed. All grammar errors in the target sentences are from the corpora.

Mistakes that can be traced back to a wrongly extracted edit
Source: Dir möchte ich nächten monat nach Amsterdam fahren. Target: Ich möchte nächsten Monat nach Amsterdam fahren. Atomic edits: ["replace", "Dir", "Ich"], ["replace", "ich", "nächsten"], ["replace", "nächten", "Monat"] Explanation: The word ‘Dir’ is replaced by ‘Ich’ because the subject of the sentence should be ‘Ich’ to express one’s own desire to travel. Error type: pronoun The word ‘ich’ is replaced by ‘nächsten’ because the correct word order is ‘nächsten Monat’ for ‘next month’. Error type: word order The word ‘nächten’ is replaced by ‘Monat’ because ‘nächsten’ should be followed by the noun ‘Monat’ to form the time expression ‘nächsten Monat’. Error type: word order
Mistakes due to insufficient consideration of bigger context
Source: Ich brauchen ca. 75 m² Wohnung mit 2 Zimmer, Bad, Kuche, toilette und garage. Target: Ich brauche ca. 75 m² — eine Wohnung mit 2 Zimmern, Bad, Küche, Toilette und Garage. Explanation: The word ‘Zimmer’ is replaced by ‘Zimmern’ because in German, the dative case is needed for indicating the number of rooms in this context. Error type: case
Mistakes due to imprecise knowledge of verb position in German
Source: in dem Sinn, möchte ich dazu sagen, dass integrationsprozes entwickelt sich schneller, wenn man das Gastland nicht als „fremd“ empfindet. Target: In dem Sinn möchte ich dazu sagen, dass ein Integrationsprozess sich schneller entwickelt, wenn man das Gastland nicht als fremd" empfindet. Explanation: The word ‘entwickelt’ is relocated because the verb should be positioned after the subject in a German sentence. Error type: word order
Mistakes due to other reasons
Source: Schauen Sie zuerst ihre Auswahl, sprechen Sie direkt mit dem Verkäufer. Target: Schauen Sie zuerst ihre Auswahl an, sprechen Sie direkt mit dem Verkäufer. Atomic edit: ["insert", "", "an"] Explanation: The word ‘an’ is inserted because the verb ‘schauen’ requires the preposition ‘an’ when it means ’to look at’ something. Error type: preposition

Table 9: Examples of wrong error explanations due to different reasons. For each sentence pair, only the explanations with a wrong error explanation are listed. All grammar errors in the target sentences are from the corpora.

Appendix I Error types generated by GPT-4

Table 10 and Table 11 list the frequent error types generated by GPT-4 in the German and Chinese GEE task.

Error Type	Count	Percent	Error Type	Count	Percent
punctuation	520	16.48	abbreviation	8	0.25
spelling	470	14.89	compound noun	8	0.25
capitalization	353	11.19	noun form	7	0.22
gender and case	175	5.54	extra word	6	0.19
preposition	163	5.16	syntax	6	0.19
word order	157	4.97	adjective	6	0.19
case	119	3.77	adverb	6	0.19
determiner	100	3.17	word form	6	0.19
adjective inflection	71	2.25	verb tense	6	0.19
verb conjugation	62	1.96	noun	5	0.16
conjunction	59	1.87	spelling and capitalization	5	0.16
pronoun	39	1.24	tense	5	0.16
conjugation	33	1.05	comparative	5	0.16
verb form	30	0.95	formatting	5	0.16
word choice	30	0.95	word formation	5	0.16
redundancy	30	0.95	possessive pronoun	4	0.13
plural	29	0.92	preposition and case	4	0.13
infinitive	29	0.92	time expression	4	0.13
unnecessary word	26	0.82	possessive	4	0.13
vocabulary	26	0.82	auxiliary verb	4	0.13
subject-verb agreement	25	0.79	demonstrative pronoun	4	0.13
article	22	0.70	idiomatic expression	4	0.13
verb	20	0.63	missing subject	4	0.13
adjective agreement	20	0.63	past participle	4	0.13
reflexive pronoun	19	0.60	spacing	4	0.13
gender	16	0.51	separable verb	4	0.13
expression	13	0.41	negation	4	0.13
subject	13	0.41	modal verb	4	0.13
compound word	12	0.38	terminology	4	0.13
missing word	11	0.35	relative pronoun	4	0.13
adjective form	11	0.35	singular/plural	4	0.13
plural form	11	0.35	gender agreement	4	0.13
subject omission	10	0.32	compound verb	4	0.13
verb choice	10	0.32	verb agreement	4	0.13
missing verb	8	0.25	spelling and inflection	4	0.13
translation	8	0.25	compound separation	4	0.13

Table 10: A distribution over error types in German grammatical error explanations (3156 total points, types with 4 or more datapoints considered). Overall, we observe a wide variety of error types.

Error Type	Count	Percent	Error Type	Count	Percent
word choice	588	39.65	extraneous word	7	0.47
redundancy	120	8.09	unnecessary ‘的’	7	0.47
word order	101	6.81	preposition usage	7	0.47
missing word	55	3.71	subject omission	6	0.40
miswritten character/word	52	3.51	‘们’	5	0.34
usage of ‘了’	44	2.97	missing particle	5	0.34
"de" particles	31	2.09	redundant character	5	0.34
preposition	24	1.62	redundant ‘的’	5	0.34
redundant word	22	1.48	emphasis	5	0.34
conjunction	21	1.42	particle usage	4	0.27
omission	20	1.35	redundant phrase	4	0.27
verb-object collocation	19	1.28	auxiliary verb	4	0.27
word omission	18	1.21	modal verb	4	0.27
unnecessary word	17	1.15	missing verb	4	0.27
sentence structure	15	1.01	unnecessary particle	4	0.27
usage of ‘的’	14	0.94	conjunction/connective	3	0.20
extra word	11	0.74	missing words	3	0.20
grammar	9	0.61	idiomatic expression	3	0.20
missing information	9	0.61	aspect particle	3	0.20
conjunction usage	8	0.54	unnecessary character	3	0.20
missing subject	8	0.54	adverb usage	3	0.20
measure word	8	0.54	expression	3	0.20
negation	8	0.54	unnecessary use of ‘的’	3	0.20

Table 11: A distribution over error types in Chinese grammatical error explanations (1483 total points, types with 3 or more datapoints considered). Overall, we observe a wide variety of error types.