This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Comparison of Grammatical Error Correction Using
Back-Translation Models

Aomi Koyama    Kengo Hotate    Masahiro Kaneko    Mamoru Komachi
Tokyo Metropolitan University
[email protected], [email protected]
[email protected], [email protected]
  Current affiliation: Recruit Co., Ltd.  Current affiliation: Tokyo Institute of Technology
Abstract

Grammatical error correction (GEC) suffers from a lack of sufficient parallel data. Therefore, GEC studies have developed various methods to generate pseudo data, which comprise pairs of grammatical and artificially produced ungrammatical sentences. Currently, a mainstream approach to generate pseudo data is back-translation (BT). Most previous GEC studies using BT have employed the same architecture for both GEC and BT models. However, GEC models have different correction tendencies depending on their architectures. Thus, in this study, we compare the correction tendencies of the GEC models trained on pseudo data generated by different BT models, namely, Transformer, CNN, and LSTM. The results confirm that the correction tendencies for each error type are different for every BT model. Additionally, we examine the correction tendencies when using a combination of pseudo data generated by different BT models. As a result, we find that the combination of different BT models improves or interpolates the F0.5\mathrm{F_{0.5}} scores of each error type compared with that of single BT models with different seeds.

1 Introduction

Grammatical error correction (GEC) aims to automatically correct errors in text written by language learners. It is generally considered as a translation from ungrammatical sentences to grammatical sentences, and GEC studies use machine translation (MT) models as GEC models. After Yuan and Briscoe (2016) applied an encoder–decoder (EncDec) model (Sutskever et al., 2014; Bahdanau et al., 2015) to GEC, various EncDec-based GEC models have been proposed (Ji et al., 2017; Chollampatt and Ng, 2018; Junczys-Dowmunt et al., 2018; Zhao et al., 2019; Kaneko et al., 2020).

GEC models have different correction tendencies in each architecture. For example, a GEC model based on CNN (Gehring et al., 2017) tends to correct errors effectively using the local context (Chollampatt and Ng, 2018). Furthermore, some studies have combined multiple GEC models to exploit the difference in correction tendencies, thereby improving performance (Grundkiewicz and Junczys-Dowmunt, 2018; Kantor et al., 2019).

Despite their success, EncDec-based models require considerable amounts of parallel data for training (Koehn and Knowles, 2017). However, GEC suffers from a lack of sufficient parallel data. Accordingly, GEC studies have developed various pseudo data generation methods (Xie et al., 2018; Ge et al., 2018a; Zhao et al., 2019; Lichtarge et al., 2019; Xu et al., 2019; Choe et al., 2019; Qiu et al., 2019; Grundkiewicz et al., 2019; Kiyono et al., 2019; Grundkiewicz and Junczys-Dowmunt, 2019; Wang et al., 2020; Takahashi et al., 2020; Wang and Zheng, 2020; Zhou et al., 2020a; Wan et al., 2020). Moreover, Wan et al. (2020) showed that the correction tendencies of the GEC model are different when using (1) a pseudo data generation method by adding noise to latent representations and (2) a rule-based pseudo data generation method. Furthermore, they improved the GEC model by combining pseudo data generated by these methods. Therefore, the combination of pseudo data generated by multiple methods with different tendencies allows us to improve the GEC model further.

One of the most common methods to generate pseudo data is back-translation (BT) (Sennrich et al., 2016a). In BT, we train a BT model (i.e., the reverse model of the GEC model), which generates an ungrammatical sentence from a given grammatical sentence. Subsequently, a grammatical sentence is provided as an input to the BT model, generating a sentence containing pseudo errors. Finally, pairs of erroneous sentences and their input sentences are used as pseudo data to train a GEC model.

Kiyono et al. (2019) reported that a GEC model using BT achieved the best performance among other pseudo data generation methods. However, most previous GEC studies using BT have used the BT model with the same architecture as the GEC model (Xie et al., 2018; Ge et al., 2018a, b; Zhang et al., 2019; Kiyono et al., 2019, 2020). Thus, it is unclear whether the correction tendencies differ when using BT models with different architectures.

We investigated correction tendencies of the GEC model using pseudo data generated by different BT models. Specifically, we used three BT models: Transformer (Vaswani et al., 2017), CNN (Gehring et al., 2017), and LSTM (Luong et al., 2015). The results showed that correction tendencies of each error type are different for each BT model. In addition, we examined correction tendencies of the GEC model when using a combination of pseudo data generated by different BT models. As a result, we found that the combination of different BT models improves or interpolates the F0.5\mathrm{F_{0.5}} scores of each error type compared with that of single BT models with different seeds.

The main contributions of this study are as follows:

  • We confirmed that correction tendencies of the GEC model are different for each BT model.

  • We found that the combination of different BT models improves or interpolates the F0.5\mathrm{F_{0.5}} scores compared with that of single BT models with different seeds.

2 Related Works

2.1 Back-Translation in Grammatical Error Correction

Sennrich et al. (2016a) showed that BT can effectively improve neural machine translation. Therefore, many MT studies focused on BT (Poncelas et al., 2018; Fadaee and Monz, 2018; Edunov et al., 2018; Graça et al., 2019; Caswell et al., 2019; Edunov et al., 2020; Soto et al., 2020; Dou et al., 2020). Subsequently, BT was applied to GEC. For example, Xie et al. (2018) proposed noising beam search methods, and Ge et al. (2018a) proposed back-boost learning. Moreover, Rei et al. (2017) and Kasewa et al. (2018) applied BT to a grammatical error detection task.

Kiyono et al. (2019) compared pseudo data generation methods, including BT. They reported that (1) the GEC model using BT achieved the best performance and (2) using pseudo data for pre-training improves the GEC model more effectively than using a combination of pseudo data and genuine parallel data. This is because the amount of pseudo data is much larger than that of genuine parallel data. This usage of pseudo data in GEC contrasts with the usage of a combination of pseudo data and genuine parallel data in MT (Sennrich et al., 2016a; Edunov et al., 2018; Caswell et al., 2019).

Htut and Tetreault (2019) compared four GEC models—Transformer, CNN, PRPN (Shen et al., 2018), and ON-LSTM (Shen et al., 2019)—using pseudo data generated by different BT models. Specifically, they used Transformer and CNN as BT models. It was reported that the Transformer using pseudo data generated by CNN achieved the best F0.5\mathrm{F_{0.5}} score. However, the correction tendencies for each BT model were not reported. Moreover, although using pseudo data for pre-training is common in GEC (Zhao et al., 2019; Lichtarge et al., 2019; Grundkiewicz et al., 2019; Zhou et al., 2020a; Hotate et al., 2020), they used a less common method of utilizing pseudo data for re-training after training with genuine parallel data. Therefore, we used Transformer as the GEC model and investigated correction tendencies when using Transformer, CNN, and LSTM as BT models. Further, we used pseudo data to pre-train the GEC model.

2.2 Correction Tendencies When Using Each Pseudo Data Generation Method

White and Rozovskaya (2020) conducted a comparative study of two rule/probability-based pseudo data generation methods. The first method (Grundkiewicz et al., 2019) generates pseudo data using a confusion set based on a spell checker. The second method (Choe et al., 2019) generates pseudo data using human edits extracted from annotated GEC corpora or replacing prepositions/nouns/verbs with predefined rules. Based on the comparison results of these methods, it was reported that the former has better performance in correcting spelling errors, whereas the latter has better performance in correcting noun number and tense errors. In addition, Lichtarge et al. (2019) compared pseudo data extracted from Wikipedia edit histories with that generated by round-trip translation. They reported that the former enables better performance in correcting morphology and orthography errors, whereas the latter enables better performance in correcting preposition and pronoun errors. Similarly, we reported correction tendencies of the GEC model when using pseudo data generated by three BT models with different architectures.

Some studies have used a combination of pseudo data generated by different methods for training the GEC model (Lichtarge et al., 2019; Zhou et al., 2020a, b; Wan et al., 2020). For example, Zhou et al. (2020a) proposed a pseudo data generation method that pairs sentences translated by statistical machine translation and neural machine translation. Then, they combined pseudo data generated by it with pseudo data generated by BT to pre-train the GEC model. However, they did not report the correction tendencies of the GEC model when using combined pseudo data. Conversely, we reported correction tendencies when using a combination of pseudo data generated by different BT models.

Dataset Sents. Refs. Split
BEA-train 564,684 1 train
BEA-valid 4,384 1 valid
CoNLL-2014 1,312 2 test
JFLEG 747 4 test
BEA-test 4,477 5 test
Wikipedia 9,000,000 - -
Table 1: Dataset used in the experiments.
CoNLL-2014 JFLEG BEA-test
Back-translation model Pseudo data Prec. Rec. F0.5\mathrm{F_{0.5}} GLEU Prec. Rec. F0.5\mathrm{F_{0.5}}
None (Baseline) - 58.5/65.8 31.3/31.5 49.8/54.0 53.0/53.7 52.6/61.4 42.8/42.8 50.2/56.5
Transformer 9M 65.0/68.6 37.6/37.7 56.7/59.0 57.7/58.3 61.1/66.5 49.8/50.7 58.4/62.6
CNN 9M 64.0/68.1 37.4/37.4 56.0/58.5 57.8/58.4 61.9/67.5 50.7/51.0 59.3/63.4
LSTM 9M 64.7/68.8 36.2/36.4 55.9/58.4 57.0/57.4 61.3/67.1 49.5/49.9 58.5/62.8
Transformer & CNN 18M 65.2/69.1 38.7/39.1 57.3/59.9 57.9/58.5 63.1/67.6 51.1/51.1 60.2/63.5
Transformer & Transformer 18M 65.5/68.3 37.9/38.0 57.2/58.9 57.5/58.0 63.0/67.0 51.0/50.7 60.2/63.0
CNN & CNN 18M 65.6/69.1 38.2/38.7 57.3/59.8 57.9/58.6 61.9/67.1 51.4/51.6 59.5/63.3
Table 2: Results of each GEC model. The left and right scores represent single and ensemble model results, respectively. The top group delineates the performance of the GEC model using each BT model, and the bottom group delineates the performance of the GEC model when using combined pseudo data.

3 Experimental Setup

3.1 Dataset

Table 1 shows the details of the dataset used in the experiments. We used the BEA-2019 workshop official shared task dataset (Bryant et al., 2019) as the training and validation data. This dataset consists of FCE (Yannakoudakis et al., 2011), Lang-8 (Mizumoto et al., 2011; Tajiri et al., 2012), NUCLE (Dahlmeier et al., 2013), and W&I+LOCNESS (Granger, 1998; Yannakoudakis et al., 2018). Following Chollampatt and Ng (2018), we removed sentence pairs with identical source and target sentences from the training data. Next, we applied byte pair encoding Sennrich et al. (2016b) to both source and target sentences. Here, we acquired subwords from the target sentences in the training data and set the vocabulary size to 8,000. Hereinafter, we refer to the training and validation data as BEA-train and BEA-valid, respectively.

We used Wikipedia111We used the 2020-07-06 dump file at https://dumps.wikimedia.org/other/cirrussearch/. as a seed corpus to generate pseudo data and removed possibly inappropriate sentences, such as URLs. In total, we extracted 9M sentences randomly.

3.2 Evaluation

We evaluated the CoNLL-2014 test set (CoNLL-2014) (Ng et al., 2014), the JFLEG test set (JFLEG) (Heilman et al., 2014; Napoles et al., 2017), and the official test set of the BEA-2019 shared task (BEA-test). We reported M2\mathrm{M^{2}} (Dahlmeier and Ng, 2012) for the CoNLL-2014 and GLEU (Napoles et al., 2015, 2016) for the JFLEG. We also reported the scores measured by ERRANT (Felice et al., 2016; Bryant et al., 2017) for the BEA-valid and BEA-test. All the reported results, except for the ensemble model, are the average of three distinct trials using three different random seeds222To reduce the influence of the BT model’s seed, we prepared BT models trained with the corresponding seed of each GEC model. Then, we pre-trained each GEC model using pseudo data generated by the corresponding BT models.. In the ensemble model, we reported the ensemble results of the three GEC models.

3.3 Grammatical Error Correction Model

Following Kiyono et al. (2019), we adopted Transformer, which is a representative EncDec-based model, using the fairseq toolkit (Ott et al., 2019). We used the “Transformer (base)” settings of Vaswani et al. (2017)333Considering the limitation of computing resources, we used “Transformer (base)” instead of “Transformer (big)”., which has a 6-layer encoder and decoder with a dimensionality of 512 for both input and output and 2,048 for inner-layers, and 8 self-attention heads. We pre-trained GEC models on each 9M pseudo data generated by each BT model444See Section 3.4 for details of the BT models. and then fine-tuned them on BEA-train. We optimized the model by using Adam (Kingma and Ba, 2015) in pre-training and with Adafactor (Shazeer and Stern, 2018) in fine-tuning. Most of the hyperparameter settings were the same as those described in Kiyono et al. (2019). Additionally, we trained a GEC model using only the BEA-train without pre-training as a baseline model.

We investigated correction tendencies when using a combination of pseudo data generated by different BT models. Therefore, we pre-trained a GEC model on combined pseudo data and then fine-tuned it on the BEA-train. Notably, in this experiment, we combined pseudo data generated by the Transformer and CNN because they improved the GEC models compared with LSTM in most cases (Section 4.1). Specifically, we obtained 9M pseudo data from the Transformer and CNN and then created 18M pseudo data by combining them. To eliminate the effect of increasing the pseudo data amount, we prepared GEC models that used a combination of pseudo data generated by single BT models with different seeds. We provided all BT models with the same target sentences to focus on the difference in the pseudo source sentences. Hence, in the combined pseudo data, the number of source sentence types increases; however, the number of target sentence types does not increase.

3.4 Back-Translation Model

Based on the GEC studies that used BT, we selected the Transformer (Vaswani et al., 2017), CNN (Gehring et al., 2017), and LSTM (Luong et al., 2015). For all BT models, we used implementations of the fairseq toolkit and its default settings, except for common settings555When training each BT model, the argument –arch in the fairseq toolkit was set to transformer, fconv, and lstm for the Transformer, CNN, and LSTM, respectively..

Common settings.

We used the Adam optimizer with β1=0.9\beta_{\mathrm{1}}=0.9 and β2=0.98\beta_{\mathrm{2}}=0.98. We used label smoothed cross-entropy (Szegedy et al., 2016) as a loss function and selected the model that achieved the smallest loss on the BEA-valid. We set the maximum number of epochs to 40. The learning rate schedule is the same as that described in Vaswani et al. (2017). We applied dropout Srivastava et al. (2014) with a rate of 0.3. We set the beam size to 5 with length normalization. Moreover, to generate various errors, we used the noising beam search method proposed by Xie et al. (2018). In this method, we add rβrandomr\beta_{\mathrm{random}} to the score of each hypothesis in the beam search. Here, rr is randomly sampled from a uniform distribution of interval [0,1][0,1], and βrandom0\beta_{\mathrm{random}}\in\mathbb{R}_{\geq 0} is a hyperparameter that adjusts the noise scale. In this experiment, βrandom\beta_{\mathrm{random}} was set to 8, 10, and 12 for the Transformer, CNN, and LSTM, respectively666Each βrandom\beta_{\mathrm{random}} achieved the best F0.5\mathrm{F_{0.5}} score on the BEA-valid in the preliminary experiments..

Transformer.

Our Transformer model was based on Vaswani et al. (2017), which is a 6-layer encoder and decoder with 512-dimensional embeddings, 2,048 for inner-layers, and 8 self-attention heads.

CNN.

Our CNN model was based on Gehring et al. (2017), which is a 20-layer encoder and decoder with 512-dimensional embeddings, both using kernels of width 3 and hidden size 512.

LSTM.

Our LSTM model was based on Luong et al. (2015), which is a 1-layer encoder and decoder with 512-dimensional embeddings and hidden size 512.

Back-translation model
Error type Freq. Baseline Transformer CNN LSTM Transformer Transformer CNN
& CNN & Transformer & CNN
OTHER 697 22.2±\pm1.77 31.8±\pm0.71 31.7±\pm0.77 30.6±\pm0.16 34.2±\pm1.03 31.8±\pm1.01 31.6±\pm0.74
PUNCT 613 65.6±\pm2.02 64.6±\pm0.42 67.8±\pm0.83 67.3±\pm1.83 65.9±\pm1.51 66.0±\pm0.73 67.8±\pm0.93
DET 607 53.8±\pm0.71 64.8±\pm1.62 65.0±\pm0.41 65.2±\pm0.83 64.8±\pm0.64 66.7±\pm1.15 64.7±\pm0.75
PREP 417 48.2±\pm0.55 58.1±\pm0.76 59.3±\pm0.54 55.2±\pm1.74 61.1±\pm0.43 60.3±\pm0.76 60.3±\pm1.06
ORTH 381 72.7±\pm2.47 77.2±\pm0.50 78.7±\pm1.50 78.0±\pm1.95 79.2±\pm1.25 78.4±\pm1.28 78.8±\pm0.74
SPELL 315 58.3±\pm3.49 71.0±\pm1.71 71.1±\pm1.45 71.6±\pm0.50 73.3±\pm1.03 72.5±\pm0.40 71.1±\pm0.49
NOUN:NUM 263 57.8±\pm2.23 64.4±\pm1.09 63.7±\pm0.90 63.9±\pm1.35 66.2±\pm0.43 66.3±\pm0.61 64.6±\pm1.41
VERB:TENSE 256 43.9±\pm2.35 52.1±\pm1.58 54.6±\pm0.94 52.6±\pm0.50 53.7±\pm1.71 54.6±\pm0.64 54.8±\pm1.27
VERB:FORM 213 62.0±\pm2.26 66.7±\pm2.63 67.1±\pm0.46 66.0±\pm1.60 66.3±\pm0.34 66.9±\pm1.54 66.6±\pm1.01
VERB 196 32.5±\pm3.41 36.0±\pm1.18 36.3±\pm0.91 39.7±\pm3.05 42.7±\pm3.83 39.0±\pm0.76 38.2±\pm0.98
VERB:SVA 157 66.1±\pm1.38 73.7±\pm3.00 75.6±\pm0.86 73.8±\pm2.51 75.1±\pm1.04 76.3±\pm1.20 74.3±\pm0.44
MORPH 155 54.0±\pm2.03 61.9±\pm1.97 63.8±\pm1.23 63.8±\pm0.53 64.5±\pm0.62 66.3±\pm1.26 63.8±\pm2.84
PRON 139 43.8±\pm2.00 53.0±\pm2.79 51.8±\pm0.14 49.6±\pm1.93 53.3±\pm1.10 52.7±\pm2.75 53.3±\pm0.46
NOUN 129 19.7±\pm2.04 31.4±\pm0.62 30.2±\pm2.39 30.5±\pm2.17 35.9±\pm2.90 34.5±\pm1.48 32.8±\pm2.80
Table 3: Each error type’s F0.5\mathrm{F_{0.5}} of the single models on the BEA-test. We extracted error types with a frequency of 100 or more. The total frequency of all error types was 4,882. For details of error types, see Bryant et al. (2017).

4 Results

4.1 Overall Results

Separate pseudo data.

The top group in Table 2 depicts the results of the GEC model using each BT model; the best BT model was different for each test set. The GEC model using the Transformer achieved the best scores in the CoNLL-2014. In contrast, in the JFLEG and BEA-test, the GEC model using CNN achieved the best scores. Moreover, the GEC model using LSTM achieved a higher F0.5\mathrm{F_{0.5}} than that using the Transformer in the BEA-test. These results suggest that the Transformer, which is robust as the GEC model (Kiyono et al., 2019), is not necessarily a good BT model.

Combined pseudo data.

The bottom group of Table 2 shows the results of the GEC model using combined pseudo data. As shown in Table 2, a combination of pseudo data generated by different BT models consistently improved the performance compared with pseudo data from a single source (Transformer & CNN > Transformer, CNN). In contrast, in some of the items in Table 2, the performances of the GEC models using the single BT models with different seeds were lower than that using only a single BT model. For example, when using the Transformer as the BT model, the F0.5\mathrm{F_{0.5}} score of the ensemble model using a single BT model was 59.0 on the CoNLL-2014, whereas that using two homogeneous BT models was 58.9 (Transformer & Transformer: 58.9 < Transformer: 59.0). Similarly, for CNN, the F0.5\mathrm{F_{0.5}} score of the ensemble model using only a single BT model was 63.4 on the BEA-test, whereas that using two homogeneous BT models was 63.3 (CNN & CNN: 63.3 < CNN: 63.4). Hence, the combination of different BT models enables the construction of a more robust GEC model than the combination of single BT models with different seeds.

4.2 Results of Each Error Type

Separate pseudo data.

The left side of Table 3 illustrates the F0.5\mathrm{F_{0.5}} scores of the single models on the BEA-test across various error types. When using the Transformer as the BT model, the performance of PRON was high. In contrast, the performance of PREP, VERB:TENSE, and VERB:SVA was high when using CNN, and the performance of VERB was high when using LSTM, to name a few. Therefore, it is considered that correction tendencies of each error type are different depending on the BT model.

In PUNCT, the performance of the GEC model using the Transformer was lower than that of the baseline model (Transformer: 64.6 < Baseline: 65.6). Moreover, when using CNN and LSTM as the BT model, the performance of PUNCT improved by only approximately 2 points in F0.5\mathrm{F_{0.5}} from the baseline model (CNN: 67.8, LSTM: 67.3 > Baseline: 65.6). It can be seen that this improvement of PUNCT is small compared with that of other error types. Therefore, when using pseudo data generated by BT, PUNCT is considered an error type that is difficult to improve.

Transformer CNN LSTM
Error type Token Type F0.5\mathrm{F_{0.5}} F0.5\mathrm{F_{0.5}} Token Type F0.5\mathrm{F_{0.5}} F0.5\mathrm{F_{0.5}} Token Type F0.5\mathrm{F_{0.5}} F0.5\mathrm{F_{0.5}}
w/ FT w/o FT w/ FT w/o FT w/ FT w/o FT
Overall 64,733,183 12,364,575 58.4 32.7 77,784,638 17,711,223 59.3 31.4 90,205,852 25,502,133 58.5 25.2
OTHER 16,463,382 6,084,184 31.8 10.0 20,237,776 8,453,119 31.7 9.6 29,286,403 13,844,773 30.6 6.3
PUNCT 3,716,117 37,360 64.6 47.1 3,814,449 46,724 67.8 46.1 4,082,594 53,739 67.3 43.1
DET 8,074,615 39,606 64.8 41.5 8,491,264 39,402 65.0 39.2 8,217,106 33,389 65.2 33.0
PREP 6,832,627 19,521 58.1 36.7 7,935,894 23,564 59.3 35.8 8,091,043 25,923 55.2 30.3
ORTH 3,378,022 521,032 77.2 62.7 3,973,439 646,475 78.7 61.4 3,587,805 513,787 78.0 60.0
SPELL 6,620,395 2,795,425 71.0 57.0 11,224,522 4,737,493 71.1 56.3 11,342,223 5,643,091 71.6 50.8
NOUN:NUM 2,241,413 31,939 64.4 45.2 2,149,748 30,205 63.7 43.9 2,177,546 28,226 63.9 41.3
VERB:TENSE 2,585,017 58,935 52.1 27.2 2,599,663 60,266 54.6 26.6 2,418,040 59,207 52.6 22.6
VERB:FORM 1,287,912 47,071 66.7 45.5 1,421,381 48,776 67.1 46.2 1,517,365 48,117 66.0 41.1
VERB 1,821,117 328,147 36.0 18.5 2,201,360 453,181 36.3 17.2 2,704,117 647,785 39.7 12.9
VERB:SVA 761,768 6,564 73.7 52.5 784,762 6,136 75.6 52.8 824,241 6,019 73.8 45.5
MORPH 2,306,204 148,506 61.9 32.5 2,308,793 147,657 63.8 32.6 2,613,870 167,440 63.8 29.2
PRON 810,875 3,642 53.0 14.7 995,686 4,013 51.8 12.7 1,248,554 5,267 49.6 10.9
NOUN 4,402,909 1,888,994 31.4 14.8 6,155,680 2,697,991 30.2 14.4 8,196,758 4,032,482 30.5 9.8
Table 4: Number of edit pair tokens and types in pseudo data generated by each BT model and each error type’s F0.5\mathrm{F_{0.5}} of the single models with and without fine-tuning on the BEA-test. As with Table 3, we extracted error types with a frequency of 100 or more in the BEA-test. FT denotes fine-tuning.

Combined pseudo data.

The right side of Table 3 shows the F0.5\mathrm{F_{0.5}} scores of the single models using combined pseudo data on the BEA-test across various error types. Except for 3 of the 14 error types shown in Table 3, the GEC model using Transformer & CNN yielded the higher F0.5\mathrm{F_{0.5}} scores than using at least either Transformer & Transformer or CNN & CNN. Therefore, it is considered that the combination of different BT models improves or interpolates performance compared with that of single BT models with different seeds.

In OTHER, the combination of single BT models with different seeds did not improve the performance of OTHER compared with a single BT model (Transformer & Transformer: 31.8 = Transformer: 31.8 and CNN & CNN: 31.6 < CNN: 31.7). Conversely, the combination of different BT models improved the performance of OTHER compared with a single BT model (Transformer & CNN: 34.2 > Transformer: 31.8, CNN: 31.7). Thus, by using different BT models, the GEC model is expected to correct more diverse error types.

Effects of different seeds.

Here, we consider the effect of different seeds in the BT model. In some error types in Table 3, the GEC model using single BT models with different seeds has the higher F0.5\mathrm{F_{0.5}} score than that using different BT models. One of the reasons for this is that there exists some variation (i.e., high standard deviation) in the F0.5\mathrm{F_{0.5}} score of each error type, even when changing merely the seed of the BT model. For example, in the GEC model using the Transformer, the standard deviation of DET was 1.62, which is relatively high. Then, the F0.5\mathrm{F_{0.5}} score of DET using Transformer & Transformer was higher than that using Transformer & CNN. Thus, in error types with some variation, using single BT models with different seeds may improve performance compared with using different BT models.

5 Discussion

We examined the number of edit pairs in pseudo data generated by each BT model. We annotated pseudo data using ERRANT and extracted edit pairs from the pseudo source sentences and target sentences. Table 4 shows the number of edit pair tokens and types in the pseudo data generated by each BT model. We expected that the higher the number of errors in each error type, the better the F0.5\mathrm{F_{0.5}} score of the GEC model for each error type. However, the results did not show such a tendency. Specifically, when the number of edit pair tokens and types was the highest in each error type, only 6 of the 14 error types had the highest F0.5\mathrm{F_{0.5}} score (ORTH, SPELL, NOUN:NUM, VERB:TENSE, VERB, and MORPH). This fact implies that simply increasing the number of tokens or types in each error type may not improve each error type’s performance in the GEC model.

Moreover, we investigated the performance of the GEC model with and without fine-tuning. As shown in Table 4, when fine-tuning was not carried out (i.e., pre-training only), the GEC model using the Transformer had the highest F0.5\mathrm{F_{0.5}} score, and there was a 7.5 point difference in F0.5\mathrm{F_{0.5}} between the Transformer and the LSTM (Transformer: 32.7 > LSTM: 25.2). However, interestingly, when fine-tuning was performed, the GEC model using LSTM achieved a better F0.5\mathrm{F_{0.5}} score than that using the Transformer (Transformer: 58.4 < LSTM: 58.5). This result suggests that even if the performance of the GEC model is low in pre-training, it may become high after fine-tuning.

6 Conclusions

In this study, we investigated correction tendencies based on each BT model. The results showed that the correction tendencies of each error type varied depending on the BT models. In addition, we found that the combination of different BT models improves or interpolates the F0.5\mathrm{F_{0.5}} score compared with that of single BT models with different seeds.

Acknowledgments

We would like to thank Lang-8, Inc. for providing the text data. We would also like to thank the anonymous reviewers for their valuable comments. This work was partly supported by JSPS KAKENHI Grant Number 19KK0286.

References