Back-Translated Task-Adaptive Pretraining:
Improving Accuracy and Robustness on Text Classification

Junghoon Lee Jounghee Kim Pilsung Kang

Korea University, Seoul, Republic of Korea
{junghoon_lee, jounghee_kim, pilsung_kang}@korea.ac.kr Corresponding author.

Abstract

Language models (LMs) pretrained on a large text corpus and fine-tuned on a downstream task becomes a de facto training strategy for several natural language processing (NLP) tasks. Recently, an adaptive pretraining method retraining the pretrained language model with task-relevant data has shown significant performance improvements. However, current adaptive pretraining methods suffer from underfitting on the task distribution owing to a relatively small amount of data to re-pretrain the LM. To completely use the concept of adaptive pretraining, we propose a back-translated task-adaptive pretraining (BT-TAPT) method that increases the amount of task-specific data for LM re-pretraining by augmenting the task data using back-translation to generalize the LM to the target task domain. The experimental results show that the proposed BT-TAPT yields improved classification accuracy on both low- and high-resource data and better robustness to noise than the conventional adaptive pretraining method.

1 Introduction

In the history of natural language processing (NLP), the rise of large language models (LMs) trained on a huge amount of text corpora was a game-changer. Before the advent of these LMs, an NLP task-specific model was trained only on a small amount of labeled data. Due to the high cost of label annotation for text data, insufficient training data was always one of the main obstacles to NLP model progress Liu et al. (2016). However, researchers found that large LMs trained on a huge amount of unlabeled, i.e., task independent, text data, such as BERT Devlin et al. (2019) or GPT-3 Brown et al. (2020), significantly improved the performance of various NLP tasks just by beginning with a pre-trained LM and fine-tuning it using the task-specific labeled data. This strategy, i.e., combining a large pretrained LM with task-specific fine-tuning, outperformed the state-of-the-art models in various NLP tasks such as text classification Howard and Ruder (2018), natural language inference Peters et al. (2018), summarization Lewis et al. (2020), and question answering Howard and Ruder (2018); Lan et al. (2019).

Although pretrained LMs have generalized language representations based on the corpora collected from a wide range of domains, it is not sufficient to learn completely a specific domain of some downstream tasks. To overcome this limitation, adaptive pretraining that re-pretrains the pretrained LM with the task-relevant data before fine-tuning phase was proposed Beltagy et al. (2019); Sun et al. (2019); Lee et al. (2020); Gururangan et al. (2020). In practice, however, it is challenging to obtain another in-domain data that share the same characteristics with the task data due to the scarcity of the tasks, such as developing an intent classifier using data collected from a newly-launched chatbot Anaby-Tavor et al. (2020). Moreover, when only a few task data is available, adaptive pretraining on the task data is insufficient to generalize the LM to the task distribution. Hence, the task-adaptively pretrained LMs might still be underfitted on the task distribution even though one can achieve better performance by employing adaptive pretraining.

To solve this problem, we propose a back-translated task-adaptive pretraining (BT-TAPT) strategy that augments the task data based on back-translation to secure more amount of task-relevant data to better generalize the pretrained LMs to the target task domain. Although text augmentation has helped improve the generalization and robustness of NLP models in various tasks such as classification, translation, and question answering Sennrich et al. (2016b); Edunov et al. (2018); Yu et al. (2018); Wei and Zou (2019); Xie et al. (2019), the augmented data are only used in the fine-tuning step thus far.

Figure 1 illustrates the expected advantage of the proposed BT-TAPT. In BT-TAPT, an adaptive pretraining of LM is first conducted with the original task data. The task corpus is then augmented based on the back-translation technique using an appropriate sampling method such as nucleus sampling Holtzman et al. (2019). In this way, we can generate various paraphrases from the original task corpus. These augmented task corpora are used to re-pretrain the adaptively pretrained LM again to better generalize the LM for the target task domain. Based on these consecutive pretraining procedures, we can expect that the overlap between the language model domain and the target task domain would increase as described in Figure 1.

Refer to caption — Figure 1: Comparison between the existing language model pretraining and the proposed methods. (a) General pretraining. (b) Adaptive pretraining with task-relevant data. (c) Proposed method – further re-pretraining with back-translated task data.

To verify the proposed BT-TAPT, we employed two well-known pretrained LMs: BERT and RoBERTa Liu et al. (2019). The performance of BT-TAPT was evaluated on six text classification datasets and compared with two benchmark methods: pretrained LM and task-adaptive pretrained LM. In general, the experimental results show that the proposed BT-TAPT yields higher classification accuracy than the benchmark methods. In addition, we verified the robustness of BT-TAPT by generating five types of noise for the test dataset and comparing the performance with the benchmark methods. As expected, BT-TAPT showed more robust classification performance than the baseline methods, supporting that the back-translation-based augmented data improves the generalization ability of the pretrained LMs.

Consequently, our contribution can be summarized as follows:

•

We propose a new adaptive pretraining method (BT-TAPT) that can generalize LMs to task distribution using back-translation-based augmented downstream task corpus.
•

BT-TAPT enhances the performance of downstream tasks.
•

BT-TAPT shows better robustness to noisy text data.

2 Background

This section briefly reviews three key components of the study: masked language model, language model adaptation, and text data augmentation.

2.1 Masked Language Model

Masked language model (MLM) proposed in Devlin et al. (2019) is a pretraining method that predicts original tokens in a sentence where some tokens are masked with a special token [MASK]. Let $X=(x^{(1)},x^{(2)},...,x^{(N)})$ denote a set of unannotated sentences, where $x=(t_{1},t_{2},...t_{M})$ is a sequence of tokens in a sentence, and $t_{i}\in x$ is a token in a sequence. In the pretraining process of the masked language model, noise is added to the sequences $x$ by randomly replacing some tokens with [MASK] token. Let $\hat{x}\in\hat{X}$ be a noised sentence and $\bar{x}\in\bar{X}$ be a subset of tokens that is randomly replaced. The training objective can be formulated as follows:

\displaystyle\min_{\theta}L_{MLM}(\theta;\hat{X},\bar{X})=-\sum_{\hat{x},\bar{x}\in\hat{X},\bar{X}}logp_{\theta}(\bar{x}|\hat{x}).

Because MLMs do not require any task-specific labels, they can be trained on a large unannotated corpus, which makes researchers believe they can learn a general representation of single or even multiple languages. This belief has been partially supported because pretrained MLMs based on a gigantic text corpus followed by fine-tuning on a small task-specific data often outperform the state-of-the-art models for a wide range of NLP tasks Liu et al. (2019); Lan et al. (2019); Song et al. (2019); Raffel et al. (2020).

2.2 Language Model Adaptation

LMs are pretrained with a vast amount of corpora from different domains such as BookCorpus Zhu et al. (2015), Wikipedia, CC-News Liu et al. (2019), OpenWebText Gokaslan and Cohen (2019), and Stories Trinh and Le (2018). Contrary to the expectation that LMs can properly generalize the language representation despite the text corpora based on which the LMs are trained, LMs have been reported that they are highly dependent on the training data domain. Moreover, their performances on other domains are not as good as on the training data domain.

To resolve this issue, language model adaptation, which re-pretrain the LMs before fine-tuning for specific tasks Sun et al. (2019); Gururangan et al. (2020), was proposed. There are two main streams in language model adaptation: domain-adaptive pretraining (DAPT) and task-adaptive pretraining (TAPT). DAPT re-pretrains the LM based on a new dataset on the same domain of a target task, whereas TAPT directly re-pretrain the LM using the given target task. If a task domain is uncommon or the data acquisition cost is expensive, TAPT can be more suitable than DAPT.

2.3 Text Data Augmentation

Data augmentation for NLP is less actively used than image augmentation thus far. While many simple label-unchanged operations exist for images such as flipping, rotation, cropping, and translation, languages barely have such operations. Hence, synonym replacement using WordNet has been the most popular augmentation method in NLP Wang and Yang (2015); Zhang et al. (2015).

Recently, easy data augmentation (EDA) techniques consisting of synonym replacement, random insertion, random swap, and random deletion proved useful for text classification Wei and Zou (2019). Despite the performance enhancement, the EDA has a significant limitation that the semantics of the original sentences are usually damaged except the synonym replacement Kumar et al. (2020). As an alternative to preserving the semantics of original sentences, back-translation, translating a sentence $\mathbf{x}_{source}$ in the source language into the other language and then translates it back into $\hat{\mathbf{x}}_{source}$ in the source language, was proposed. The technique showed to improve the performances in various NLP tasks, such as machine translation Sennrich et al. (2016a), question answering Yu et al. (2018), and semi-supervised text classification Xie et al. (2019).

Data	Task	# Class	Average	Maximum	# Train	# Train (Low.)	# Dev	# Test
Data	Task	# Class	lengths	lengths	# Train	# Train (Low.)	# Dev	# Test
IMDB	Sentiment	2	297	2,291	20,000	5,000	5,000	25,000
MR	Sentiment	2	28	73	5,000	-	2,662	3,000
SST2	Sentiment	2	25	70	6,920	-	872	1,822
Amazon	Sentiment	2	202	7,810	115,251	11,525	5,000	25,000
TREC	Question	6	13	38	4,906	-	546	500
AGNews	Topic	4	52	255	115,000	10,000	5,000	7,600

Table 1: Description of six classification benchmark datasets. # Train (Low.) refers to a number of data downsampled from the original training data for low-resource scenario, following the low-resource setting in Gururangan et al. (2020)

3 Proposed method

We introduce a back-translated task-adaptive pretraining (BT-TAPT), a new adaptive pretraining strategy that helps adaptive pretraining when task data is insufficient, and in-domain data is unavailable. The overall process of BT-TAPT is shown in Figure 2.

3.1 Task Data is Insufficient

While using TAPT contributes to task-specific performance improvement, Gururangan et al. (2020) showed that continued pretraining using human-curated unlabeled data – corpus from the same source with task data – ensured additional performance gain, which is named human-curated TAPT Simultaneously, the automatic selection approach retrieving unlabeled data aligning with the task data distribution from the in-domain corpus is also proven beneficial. Despite the favorable results, using those methods is impossible when human-curated data or in-domain data are unavailable. Motivated by this limitation, we propose an advanced adaptive pretraining approach requiring only task data.

3.2 Back-Translated TAPT

If the amount of task data is insufficient during TAPT, the LM may still be underfitted on the task domain. In this case, the LM would be more generalized to the task if more task-related sentences were available. The proposed BT-TAPT is an additional adaptation method using human-like task-related sentences. To make plausible sentences, we use a back-translation using a nucleus(top- $p$ ) sampling. We used the nucleus sampling instead of traditional beam search because the former can better implement the purpose of augmentation than the latter, creating label-unchanged and semantic-preserved data that are not exactly the same as the original data. Although back-translated sentence using beam search tends to be an almost identical sentence to the original sentence, sentences generated by back-translation using the proper sampling method are usually paraphrases of the original sentence, which have various expressions without significantly deviating from the domain of the original data.

When the LM is further pretrained only with task data, noises applied to the input at every epoch are changing the location of the [MASK] tokens and infinitesimal random word transition. Otherwise, noise used in BT-TAPT includes not only the change of [MASK] positions or word transition but also synonym replacement or rephrasing facilitated by the sampling-based back-translation.

3.3 BT-TAPT Process

We first apply TAPT on the pretrained LM to expose it to the task domain sentence with correct grammar. Then, we generate multiple sentences with back-translation. We adopt nucleus sampling with $p=0.95$ to ensure proper diversity and fluency for the generated sentences. We generate 20 sentences for each original sentence to expose the LM to semantically and syntactically sufficiently diverse expressions. We further re-pretrain the LM with those sentences where domain-related but less accurate expressions appear. Learning from those paraphrases, the LM can cover a broader range of task distribution, as depicted in figure 1. After the entire adaptive pretraining phase is completed, the LM is fine-tuned on the task data.

	IMDB	Amazon	AGNews	TREC	MR	SST2
BERT	92.2_0.3	60.8_2.3	92.1_0.1	96.7_0.1	86.5_0.7	91.0_0.7
+ tapt	93.0_0.2	67.0_0.8	92.7_0.1	96.0_0.3	85.7_0.3	90.6_0.3
+ bt-tapt	93.3_0.2	67.3_0.9	92.7_0.1	96.9_0.3	86.4_0.4	92.4_0.3
RoBERTa	94.2_0.3	63.7_1.9	92.4_0.3	96.7_0.5	89.3_0.7	93.5_0.6
+ tapt	94.4_0.2	64.5_2.8	92.6_0.3	96.2_0.2	89.4_0.4	93.8_0.6
+ bt-tapt	94.4_0.1	67.7_0.8	92.6_0.1	96.5_0.3	89.7_0.3	93.8_0.5

Table 2: Average classification performance with the standard deviation (subscripted) for each test dataset with three benchmark methods with two base LMs under a low-resource setting. Note that the performance metric for Amazon dataset is macro-

F_{1}

measure while that of the other datasets are the simple accuracy.

	^†IMDB	^†Amazon	^†AGNews
BERT	93.7_0.1	65.3_1.4	94.1_0.1
+ tapt	94.8_0.2	68.1_2.8	94.7_0.2
+ bt-tapt	95.1_0.1	69.7_1.9	94.6_0.1
RoBERTa	95.1_0.1	65.7_2.2	94.6_0.3
+ tapt	95.6_0.1	68.5_1.2	94.9_0.1
+ bt-tapt	95.7_0.0	69.2_0.9	95.0_0.1

Table 3: Average classification performance with the standard deviation (subscripted) for each test dataset with three benchmark methods with two base LMs under a high-resource setting. The performance metrics are same with Table 2. ^† denotes a high-resource setting where the entire dataset is used.

4 Experiments

We verify the proposed BT-TAPT based on six widely-studied classification datasets with two well-known pretrained LMs: BERT and RoBERTa. We compare the performances of BT-TAPT with two benchmark methods: base pretrained model and TAPT.

4.1 Datasets & Performance Metrics

Four sentiment classification datasets, i.e., IMDB Maas et al. (2011), MR Pang and Lee (2005), SST2 Socher et al. (2013), and Amazon He and McAuley (2016); Gururangan et al. (2020) and two other classification datasets, i.e., TREC Li and Roth (2002) question classification and AGNsews Zhang et al. (2015) topic classification, were used in the experiments. As a classification performance metric, a simple accuracy was used except for Amazon, for which the macro- $F_{1}$ is used owing to the class imbalance.

We divided the datasets into two groups: datasets with more than 20,000 training examples were considered high-resource datasets, while the others were considered low-resource datasets. To formulate additional low-resource datasets, IMDB and AGNews were down-sampled 2,500 per class, and Amazon was down-sampled 10% of the training dataset. The description of each dataset is summarized in Table 1.

4.2 Training Details

As the pretrained LMs, BERT-base and RoBERTa-base trained from huggingface Wolf et al. (2020) ¹¹1 https://github.com/huggingface/transformers were employed. For back-translation, we employed the transformer-big model of Facebook Ng et al. (2019) ²²2 https://github.com/pytorch/fairseq/tree/master/examples/translation, a WMT’19 winner. We translated sentences from English to German and then back-translated them from German to English. When generating the translated sentence in the decoding process, we used the nucleus sampling with the sampling probability of $p=0.95$ . The back-translation generated a total of 20 sentences for each original sentence.

For both TAPT and BT-TAPT, we re-pretrained the pretrained LMs 100K steps in the high-resource setting (IMDB, Amazon, and AGNews), following Sun et al. (2019). In contrast, in the low-resource setting, we re-pretrained 50K steps because the model converged faster with a small dataset than a large dataset. We used a batch size of 64, with a max sequence length of 512, and AdamW optimizer using a learning rate of 5e-5 with a linear learning rate scheduler for which the 10% of the steps were used for warming-up, and the rest of steps were used for decay. For task-specific fine-tuning, the modes were trained 2 or 3 epochs with a batch size of 8, 16, or 32, a weight decay of 0 or 0.01, and a fixed learning rate of 2e-5 with the same linear learning rate scheduler as re-pretraining. We stopped the fine-tuning when the validation loss starts to increase and reported the best performance among all hyper-parameter combinations. To compensate for the effect of random seed initialization, we repeated each experiment five times and reported the average and the standard deviation of the performance metrics. All the experiments were conducted using the Nvidia DGX-station with 4 $\times$ 32GB Nvidia V100 GPUs.

4.3 Text Classification Performance

Tables 2 and 3 show the classification performance of each dataset in the high- and low-resource settings, respectively. In both tables, we can observe that further re-pretraining with the augmented dataset based on the back-translation improves the model either in terms of classification accuracy or the model stability regardless of the base pretraind LM. The proposed BT-TAPT yields either higher classification accuracy without the increase in standard deviation than TAPT (e.g., IMDB with BERT, SST2 with BERT, and MR with RoBERTa) or reduce the performance variation while maintaining the classification accuracy (e.g., IMDB with RoBERTa, AGNews with RoBERTa, SST2 with RoBERTa). Note that there are some exceptions: re-pretraining might harm the performance as mentioned in Sun et al. (2019) for those datasets with insufficient amount of sentences to generally represent the domain distribution, such as TREC dataset with RoBERTa and MR dataset with BERT. However, BT-TAPT seems to succeed to compensate the gap by supplementing additional augmented data.

Augmentation	Accuracy
TAPT	93.0_0.2
None	92.8_0.2 ( $\downarrow$ 0.2%p)
+ EDA	92.7_0.2 ( $\downarrow$ 0.3%p)
+ Embedding	92.9_0.2 ( $\downarrow$ 0.1%p)
+ TF-IDF	92.9_0.4 ( $\downarrow$ 0.1%p)
+ Back-Translation	93.3_0.2 ( $\uparrow$ 0.3%p)

Table 4: The average performance of the fine-tuned BERT-base model on IMDB low-resource test dataset with further re-pretraining using additional data generated from various augmentation technique after TAPT is applied.

Figure 3 shows the effect of the re-pretraining of LMs with task data and back-translated task data in the full IMDB dataset with the BERT model. The solid line denotes the classification performance after the fine-tuning is completed using the re-pretrained LMs with the corresponding training steps in the $x$ -axis, and the shaded area is the standard deviation of accuracy for the five repetitions. With TAPT, the accuracy is increased along with the step size. However, the effect of TAPT vanishes beyond a certain step size. For example, the accuracy degenerates after 80K in Figure 3. Note also that the performance variation is also unstable with TAPT. When the proposed BT-TAPT is applied after 100K steps, the accuracy is rebounding along with the training steps. We can also observe that the accuracy not only further improves but also retains a low variability with BT-TAPT because a narrow bandwidth of the shaded area is retained.

4.4 Comparing Augmentation Methods

We compared back-translation with other widely used augmentation methods – EDA, Embedding, and TF-IDF on adaptive pretraining. Embedding Mrkšić et al. (2016) replaces an arbitrary token with a close token in the embedding space. In contrast, TF-IDF Xie et al. (2019) replaces uninformative words with low TF-IDF scores in sentences while preserving those with high TF-IDF scores. As a baseline, we also include the model without any augmentation method, which means the TAPT is applied again with the same training steps for the other augmentation methods.

We first applied TAPT on BERT-base for 50K steps using the IMDB low-resource dataset and then re-pretrained using the augmented dataset generated from each method using a transformation probability of 0.1. Table 4 shows the classification accuracy for different augmentation methods. As mentioned regarding Figure 3, excessive TAPT often degenerated the final performance: another 100K steps of TAPT resulted in a $0.2\%p$ lower accuracy than that without additional re-pretraining. In addition, none of the benchmark text augmentation methods can neither succeed in improving the classification performance nor reduce the performance variation. Only the proposed BT-TAPT enhanced the classification accuracy without the loss of variation. The reason behind this observation is that the back-translation can generate diverse and realistic paraphrases, while the others tend to modify the sentence partially with a simple technique. These simply modified sentences sometimes harm the semantic meaning of the original sentence, which disturbs an appropriate LM re-pretraining.

4.5 Quantity of Augmentation

To investigate the number of back-translation for the downstream task performance, we generated between 1 to 50 sentences in the BT-TAPT process from the IMDB low-resource dataset. Figure 4 shows the classification accuracy regarding the number of augmentations. In general, classification accuracy increases with TAPT as the number of back-translated sentences increases and becomes mature beyond a certain number of augmentations. Because the performance differences beyond 20 augmentations are marginal, we chose 20 as the final number of back-translated augmentations per sentence.

4.6 Comparison of BT-TAPT Strategies

We investigated all possible strategies of back-translation deployment in BT-TAPT. TAPT & BT implies using both task and back-translated data for re-pretraining simultaneously. BT → TAPT and TAPT → BT refer to the re-pretraining with back-translated data and further re-pretraining with task data sequentially and vice versa. As seen in Figure 5, applying TAPT followed by re-pretraining with back-translated sentences yields the highest accuracy with the smallest variation. However, simultaneously using the back-translated sentences on re-pretraining enlarges the deviation even though it improves the accuracy, which implies that the language model should adjust to the target domain with a well-formed corpus first and then be exposed to various domain-related augmented sentences.

4.7 Performance on Small Datasets

In a real-world situation, the task data is commonly insufficient. To validate that the BT-TAPT can effectively handle the data shortage, we conducted additional experiments with only 100, 500, and 1,000 sentences of the IMDB dataset. As shown in Figure 6, BT-TAPT is found to be much beneficial under an extremely low-resource circumstance because the performance is noticeably improved with only 100 sentences. Thus, this result supports that expression-diversified but semantic-preserved augmented texts based on back-translation can help LMs adapt to the domain distribution.

5 Robustness to Noise

When applying the fine-tuned classifier to a real task, not only the clean data, which were sufficiently observed during the model training, but also the noisy data often come during the inference. We expect the proposed TP-TAPT to be more robust to various noises because the model trained with BT-TAPT encounters more diverse contexts than TAPT. To verify this assumption, we constructed another noisy-test dataset by applying corruption or perturbation and compared the classification performances of the three models.

5.1 Type of Noises

Five different realistic noise types were generated for the test datasets AGnews, SST2, MR, and TREC6. Note that these noises are added only to the test dataset, not the training dataset. Synonym Replace the random word in the sentence with its synonyms using WordNet thesaurus (replacement probability $=0.1$ ). BT beam Generate a paraphrase using back-translation with beam search. This method does not shift the original sentence substantially because the beam search does not employ sampling. Nevertheless, it produces more changes than the synonym replacement. BT top-p Generate multiple paraphrases using back-translation with nucleus sampling. It creates diverse but partially incorrect sentences. Char swap Generate random character-level noises with transformation probabilities of 0.1. The noise consists of deleting, adding, or changing the order of characters in a sentence. InvTest Apply the invariance test proposed in Ribeiro et al. (2020). The invariance test applies label preserving-perturbations such as changing numbers or location names. The fine-tuned classifier should generate the same output whether the input changes by the invariance test.

Among the five noises scenarios, BT beam and BT top-p used English $\rightarrow$ German and German $\rightarrow$ English machine translation model adopted from Ng et al. (2019), which is the same model used in section 4. CharSwap, Synonym, and InvTest used the TextAttack Morris et al. (2020) python package. As randomness exists in all methods except for the BT beam, where the maximum probability is used for decoding, we created five different noised datasets for each dataset.

5.2 Result

To evaluate the robustness of TAPT and BT-TAPT, we measured the accuracy gain, the amount of classification accuracy change on the noised test set after TAPT or BT-TAPT was applied. Figure 7 shows the accuracy gain regarding each dataset under different noise-added settings. Note that because the first two noise types, i.e., BT beam and BT top-p can generate similar sentences to BT-TAPT, BT-TAPT always improves the base re-pretrained model with a significant margin for most datasets, whereas TAPT sometimes fails to improve the base model performances. Beside these back-translation-based noises, BT-TAPT still reports favorable accuracy gains. However, TAPT not only achieves less significant accuracy gain than BT-TAPT, but also it degenerates the pretrained model (negative accuracy gain) even though it re-pretrained the base LMs. Consequently, we can conclude that the proposed BT-TAPT is more robust to unexpected data variations. Thus, the proposed method will be practically helpful in real-world tasks where languages are dynamically changing, evolving, and sometimes intentionally or unintentionally breaking.

6 Conclusion

In this paper, we proposed BT-TAPT, a new adaptive pretraining method for generalizing LMs to task domains when task data is insufficient. In contrast to TAPT that only uses task-related unlabeled data, the proposed BT-TAPT generates augmentated data based on back-translation and use it for further re-pretrain the LMs.

Experiments on six text classification datasets show that the BT-TAPT not only improved the classification accuracy but also reduced the deviation. Moreover, BT-TAPT was found to be more practical for small datasets and robust to various noises. Future work will consider applying BT-TAPT to the other NLP tasks such as question answering or summarization.

References

Anaby-Tavor et al. (2020) Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020. Do not have enough data? deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7383–7390.
Beltagy et al. (2019) Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3606–3611.
Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500.
Gokaslan and Cohen (2019) Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
Gururangan et al. (2020) Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pages 507–517.
Holtzman et al. (2019) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. In International Conference on Learning Representations.
Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339.
Kumar et al. (2020) Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26.
Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics.
Liu et al. (2016) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2873–2879.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Maas et al. (2011) Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.
Morris et al. (2020) John X Morris, Eli Lifland, Jin Yong Yoo, and Yanjun Qi. 2020. Textattack: A framework for adversarial attacks in natural language processing. arXiv preprint arXiv:2005.05909.
Mrkšić et al. (2016) Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina M Rojas Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 142–148.
Ng et al. (2019) Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. 2019. Facebook fair’s wmt19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 314–319.
Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124.
Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67.
Ribeiro et al. (2020) Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118.
Sennrich et al. (2016a) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96.
Sennrich et al. (2016b) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725.
Socher et al. (2013) Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 455–465.
Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning, pages 5926–5936. PMLR.
Sun et al. (2019) Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune bert for text classification? In China National Conference on Chinese Computational Linguistics, pages 194–206. Springer.
Trinh and Le (2018) Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
Wang and Yang (2015) William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2557–2563.
Wei and Zou (2019) Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6383–6389.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Xie et al. (2019) Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848.
Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. In International Conference on Learning Representations.
Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28:649–657.
Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.

Back-Translated Task-Adaptive Pretraining: Improving Accuracy and Robustness on Text Classification