This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generative Data Augmentation for Commonsense Reasoning

Yiben Yang     Chaitanya Malaviya†Π     Jared Fernandez♠♡     Swabha Swayamdipta    
Ronan Le Bras     Ji-Ping Wang     Chandra Bhagavatula Yejin Choi†♢ Doug Downey♠†    

Northwestern University, Evanston, IL, USA
Allen Institute for Artificial Intelligence, Seattle, WA, USA
Π University of Pennsylvania, Philadelphia, PA, USA
Carnegie Mellon University, Pittsburgh, PA, USA
University of Washington, Seattle, WA, USA
{yiben.yang@,jared.fern@u.,jzwang@}northwestern.edu
{chaitanyam,swabhas,ronanlb,chandrab,yejinc,dougd}@allenai.org
Abstract

Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAugc\textsc{G-DAug}^{\textbf{c}}, that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. On experiments with multiple commonsense reasoning benchmarks, G-DAugc\textsc{G-DAug}^{\textbf{c}} consistently outperforms existing data augmentation methods based on back-translation, establishing a new state-of-the-art on WinoGrande, Codah, and CommonsenseQA, and also enhances out-of-distribution generalization, proving to be more robust against adversaries or perturbations. Our analysis demonstrates that G-DAugc\textsc{G-DAug}^{\textbf{c}} produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.

1 Introduction

Refer to caption
Figure 1: Example of a selected high-quality generated example compared to a human-authored example from the WinoGrande dataset. Composing commonsense questions can require creativity.

While recent advances in large-scale neural language models Devlin et al. (2019); Liu et al. (2019); Radford et al. (2019); Raffel et al. (2019) have led to strong performance on several commonsense reasoning benchmarks Talmor et al. (2019); Lv et al. (2020); Sakaguchi et al. (2020), their accuracy by and large depends on the availability of large-scale human-authored training data. However, crowdsourcing examples at scale for each new task and domain can be prohibitively expensive. Moreover, human-authored data has been shown to exhibit annotation artifacts Gururangan et al. (2018); Agrawal et al. (2018); Schwartz et al. (2017), leading to models with considerably weaker performance on out-of-distribution samples Jia and Liang (2017); Belinkov and Bisk (2017); Iyyer et al. (2018).

A candidate solution that has shown promise in other tasks, such as reading comprehension, is to augment a human-authored training set with a large set of synthetically-generated examples Zhou et al. (2017); Du et al. (2017); Zhao et al. (2018a). But, generating synthetic examples for commonsense reasoning poses a unique challenge. In reading comprehension, for instance, the goal of data augmentation is to generate questions that are directly answerable by a given reference passage. In contrast, answering commonsense questions relies on commonsense notions that are seldom stated explicitly Gordon and Van Durme (2013); Forbes and Choi (2017), and authoring such questions can require creativity (see Figure 1). Based on promising evidence from previous work Yang et al. (2018); Trinh and Le (2018); Bosselut et al. (2019); Davison et al. (2019), we hypothesize that pretrained language models, such as GPT-2 Radford et al. (2019), capture some common sense expressed implicitly in their pretraining corpus. Could questions generated by such models serve as helpful training data? In this work, we explore this question through Generative Data Augmentation for commonsense reasoning (G-DAugc\textsc{G-DAug}^{\textbf{c}}; §2): a novel framework for augmenting training data with diverse and informative synthetic training examples to improve both in-distribution performance and out-of-distribution generalization of commonsense reasoning models.111https://github.com/yangyiben/G-DAUG-c-Generative-Data-Augmentation-for-Commonsense-Reasoning

Although a generative model allows us to produce large pools of synthetic training examples, the generated examples may be noisy or redundant. To ensure that we use the most informative examples for augmentation, we introduce data selection methods based on influence functions Koh and Liang (2017) and a heuristic to maximize the diversity of the generated data pool. Finally, we propose an effective two-stage training scheme for augmentation with synthetic data. In experiments across multiple commonsense benchmarks, we show that G-DAugc\textsc{G-DAug}^{\textbf{c}} can mitigate the expense and brittleness resulting from large training sets for commonsense reasoning tasks.

To summarize, our contributions include:

  1. 1.

    G-DAugc\textsc{G-DAug}^{\textbf{c}}, a generative data augmentation framework for commonsense reasoning (§2),

  2. 2.

    novel selection methods that identify informative and diverse synthetic training examples from the generated pool (§3),

  3. 3.

    experiments showing that G-DAugc\textsc{G-DAug}^{\textbf{c}} improves in-distribution performance, achieving a 1144% average absolute gain across four commonsense reasoning data sets and state-of-the-art results on the WinoGrande Sakaguchi et al. (2020), CommonsenseQA Talmor et al. (2019), and Codah Chen et al. (2019) benchmarks, and also improves model robustness in terms of resistance to adversarial attacks Jin et al. (2020) and accuracy on perturbed evaluation sets (§4), and

  4. 4.

    a comprehensive analysis of the factors that influence G-DAugc\textsc{G-DAug}^{\textbf{c}}’s performance (§5).

Refer to caption
Figure 2: Illustration of the G-DAugc\textsc{G-DAug}^{\textbf{c}} process: (1) generate synthetic data and train a task model, (2) relabel the generated data using the task model, (3) filter the generated data based on estimated influence scores, (4) further select a subset based on a diversity-maximizing heuristic, (5) train a new task model using the filtered generations (synthetic training), and (6) further train this model using the original training data (organic training).

2 G-DAugc\textsc{G-DAug}^{\textbf{c}}

We now describe our framework for Generative Data Augmentation for Commonsense Reasoning (G-DAugc\textsc{G-DAug}^{\textbf{c}}). Figure 2 shows an overview of the approach. We describe G-DAugc\textsc{G-DAug}^{\textbf{c}}’s data generation procedure (steps 1 and 2 in the figure) in this section, and cover the data selection and training components (steps 3-5) in §3.

2.1 Synthetic Training Data Generation

We will use multiple choice question answering as a running example to describe synthetic data generation. Formally, consider a dataset of NN questions 𝒟={(𝐐i,𝒞i,yi):i=1,2,,N}\mathcal{D}=\{(\mathbf{Q}^{i},\mathcal{C}^{i},y^{i}):i=1,2,...,N\}, where 𝐐i\mathbf{Q}^{i} is a sequence of words denoting the ithi^{th} question, 𝒞i={𝐂ji:j=1,2,,K}\mathcal{C}^{i}=\{\mathbf{C}^{i}_{j}:j=1,2,...,K\} is the corresponding choice set with KK choices which are word sequences as well, and a ground truth label yi{1,2,,K}y^{i}\in\{1,2,...,K\}. We denote the answer as 𝐂yii\mathbf{C}_{y^{i}}^{i} and the distractors as 𝐂jyii\mathbf{C}_{j\neq y^{i}}^{i}s.

Our text generators are pretrained generative language models, finetuned to maximize the log-likelihood of a sequence of text 𝐖\mathbf{W}, W(𝜽)=t=1TlogP(wt|𝐖1:t1;𝜽)\mathcal{L}_{W}(\boldsymbol{\theta})=\sum_{t=1}^{T}\log P(w_{t}|\mathbf{W}_{1:t-1};\boldsymbol{\theta}), where 𝐖1:t1\mathbf{W}_{1:t-1} denotes a subsequence of 𝐖\mathbf{W} and 𝜽\boldsymbol{\theta} denotes the model parameters.222 W1:0W_{1:0} denotes an empty sequence Below, we describe how we use variations of this objective to finetune different LMs to generate questions, answers and distractors.333 Specific modifications for other tasks, e.g. textual entailment, are discussed in Appendix A.

Generating Synthetic Questions

To train our question generator, we finetune the LM on the training question set {𝐐i}\{\mathbf{Q}^{i}\} to optimize the language modeling objective: q(𝜽q)=i=1NlogP(𝐐i;𝜽q),\mathcal{L}_{q}(\boldsymbol{\theta}_{q})=\sum_{i=1}^{N}\log P(\mathbf{Q}^{i};\boldsymbol{\theta}_{q}), where 𝜽q\boldsymbol{\theta}_{q} denotes the parameters of the question generator. After finetuning, we generate new questions with nucleus sampling Holtzman et al. (2020), which is suitable for generating long-form text.

Generating Synthetic Answers and Distractors

To generate choice sets, we independently finetune two separate generative LMs, one for answers and the other for distractors. The answer and distractor generators are trained to maximize the conditional log-likelihood of the answer and the distractors, respectively, given the question. Mathematically, we optimize both a(𝜽a)=i=1NlogP(𝐂yii|𝐐i;𝜽a)\mathcal{L}_{a}(\boldsymbol{\theta}_{a})=\sum_{i=1}^{N}\log P(\mathbf{C}^{i}_{y^{i}}|\mathbf{Q}^{i};\boldsymbol{\theta}_{a}) and d(𝜽d)=i=1NjyilogP(𝐂ji|𝐐i;𝜽d)\mathcal{L}_{d}(\boldsymbol{\theta}_{d})=\sum_{i=1}^{N}\sum_{j\neq y^{i}}\log P(\mathbf{C}^{i}_{j}|\mathbf{Q}^{i};\boldsymbol{\theta}_{d}), where 𝜽a\boldsymbol{\theta}_{a} and 𝜽d\boldsymbol{\theta}_{d} denote the parameters of the answer and distractor generators, respectively. For answers, we use nucleus sampling with low temperature (for long answers) or greedy decoding (for short answers). To encourage diversity across generated distractors, we use nucleus sampling without temperature for these.

Data Relabeling.

Our choice of generative LMs naturally defines labels for the synthetic choice sets. Alternatively, we consider using a supervised task model trained on the original training set, to relabel a candidate pool of synthetic answers and distractors. This is similar to treating the synthetic questions as unlabeled data and applying self-training. The utility of this self-training can be task-dependent; in our experiments, we used validation performance to determine whether or not to relabel our synthetic training data.

3 Synthetic Data Selection and Training

The above generation method can produce a large pool of examples, but training on all of them would be computationally expensive and might harm performance due to noisy generations. Here, we propose three data selection methods aimed at choosing more effective training examples from the generated pool (§3.1). Further, we outline a simple staged training procedure (§3.2) to mitigate the negative impact from noise in the synthetic data.

3.1 Selecting High-quality and Diverse Synthetic Examples

A randomly sampled synthetic dataset may contain examples that are similar to one another, along with low-quality generations Holtzman et al. (2020). We refer to such a random selection approach as G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand. We hypothesize that a diverse and high-quality synthetic set would benefit the task model more. We present three data selection algorithms that target quality, diversity and a combination of both.

Filtering with Influence Functions.

We hypothesize that filtering out detrimental synthetic training examples can boost downstream performance Bras et al. (2020). A given training example xx is considered detrimental if including xx in the training set results in a higher generalization error, approximated by validation loss, i.e.:

(𝒳,𝜽)=1|𝒳|xi𝒳l(xi,𝜽),\displaystyle\mathcal{L}(\mathcal{X},\boldsymbol{\theta})=\frac{1}{|\mathcal{X}|}\sum_{x_{i}\in\mathcal{X}}l(x_{i},\boldsymbol{\theta}),
(𝒳val,𝜽^(𝒳tr{x}))(𝒳val,𝜽^(𝒳tr))>0.\displaystyle\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}(\mathcal{X}_{tr}\cup\{x\}))-\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}(\mathcal{X}_{tr}))>0.

This would naively require retraining the model with xx, which is computationally prohibitive. Fortunately, the validation loss change can be efficiently approximated through the use of influence functions Atkinson et al. (1983); Koh and Liang (2017). While previous work focuses on removing or perturbing existing training examples Koh and Liang (2017); Wang et al. (2018), we use influence functions to estimate the effect of including a novel synthetic example.

The main result from previous work (Atkinson et al., 1983; Koh and Liang, 2017) tells us that the influence of upweighting a training example xx by some small ϵ\epsilon on the model parameters 𝜽^\hat{\boldsymbol{\theta}} with the corresponding parameter space Θ\Theta is given by:

𝜽^ϵ,x=argmin𝜽Θϵl(x,𝜽)+1i=1Nwii=1Nwil(xi,𝜽)\displaystyle\hat{\boldsymbol{\theta}}_{\epsilon,x}=\underset{\boldsymbol{\theta}\in\Theta}{\mathrm{argmin}}\ \epsilon l(x,\boldsymbol{\theta})+\frac{1}{\sum_{i=1}^{N}w_{i}}\sum_{i=1}^{N}w_{i}l(x_{i},\boldsymbol{\theta})
up,params(x):=d𝜽^ϵ,xdϵ|ϵ=0=H1𝜽^𝜽l(x,𝜽^),\displaystyle\mathcal{I}_{up,params}(x):=\left.\frac{d\hat{\boldsymbol{\theta}}_{\epsilon,x}}{d\epsilon}\right\rvert_{\epsilon=0}=-H^{-1}_{\hat{\boldsymbol{\theta}}}\nabla_{\boldsymbol{\theta}}l(x,\hat{\boldsymbol{\theta}}),

where wiw_{i} is weight for the training example xix_{i} and H𝜽^H_{\hat{\boldsymbol{\theta}}} is the Hessian evaluated at 𝜽^\hat{\boldsymbol{\theta}}. The above result is a slight generalization of Koh and Liang (2017), but it is straightforward to generalize their proof to the weighted empirical risk case. Then, we apply the chain rule to get the influence of upweighting xx on the validation loss:

up,loss(x):=d(𝒳val,𝜽^ϵ,x)dϵ|ϵ=0\displaystyle\mathcal{I}_{up,loss}(x):=\left.\frac{d\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}_{\epsilon,x})}{d\epsilon}\right\rvert_{\epsilon=0}
=𝜽(𝒳val,𝜽^)up,params(x).\displaystyle=\nabla_{\boldsymbol{\theta}}\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}})^{\top}\mathcal{I}_{up,params}(x).

Note that (𝒳tr,𝜽)\mathcal{L}(\mathcal{X}_{tr},\boldsymbol{\theta}) can be rewritten as the following weighted average form to incorporate a new training example xnewx_{new}:

(𝒳tr,𝜽)=\displaystyle\mathcal{L}(\mathcal{X}_{tr},\boldsymbol{\theta})= 1i=1N+1wii=1N+1wil(xi,𝜽),\displaystyle\frac{1}{\sum_{i=1}^{N+1}w_{i}}\sum_{i=1}^{N+1}w_{i}l(x_{i},\boldsymbol{\theta}),

where wi=1iN+1w_{i}=1\forall{i\neq N+1}, wN+1=0w_{N+1}=0 and xN+1=xnewx_{N+1}=x_{new}. Adding the new training example xnewx_{new} is equivalent to upweighting xN+1x_{N+1} by 1N\frac{1}{N}:

(𝒳tr{xnew},𝜽)1Nl(xN+1,𝜽)\displaystyle\mathcal{L}(\mathcal{X}_{tr}\cup\{x_{new}\},\boldsymbol{\theta})\propto\frac{1}{N}l(x_{N+1},\boldsymbol{\theta})
+1i=1N+1wii=1N+1wil(xi,𝜽).\displaystyle+\frac{1}{\sum_{i=1}^{N+1}w_{i}}\sum_{i=1}^{N+1}w_{i}l(x_{i},\boldsymbol{\theta}).

Applying the influence function up,loss(x)\mathcal{I}_{up,loss}(x), we obtain the following linear approximation of the validation loss change upon adding the training example xnewx_{new}:

(𝒳val,𝜽^(𝒳tr{xnew}))(𝒳val,𝜽^(𝒳tr))\displaystyle\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}(\mathcal{X}_{tr}\cup\{x_{new}\}))-\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}(\mathcal{X}_{tr}))
1Nup,loss(xnew).\displaystyle\approx\frac{1}{N}\mathcal{I}_{up,loss}(x_{new}).

We adopt the stochastic estimation method described in Koh and Liang (2017) to efficiently compute up,loss\mathcal{I}_{up,loss}. Detrimental synthetic data will have 1Nup,loss>0\frac{1}{N}\mathcal{I}_{up,loss}>0.

Another distinction between our approach and Koh and Liang (2017) is that they compute the influence of a single training example on a single test example, whereas we estimate influence of a synthetic training example on all validation examples at once, which makes our approach scalable to large pools of synthetic data. Our approach, referred to as G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence, filters out detrimental synthetic data (i.e., the examples that have a positive estimated influence on the validation loss).

Selecting Diverse Examples.

While G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence promotes training data quality, it ignores diversity; we hypothesize that better diversity can provide a more reliable training signal. We propose a simple greedy algorithm that iteratively selects a synthetic training example from the pool that maximizes a diversity measure. Here, we use a simple measure of diversity equal to the number of unique unigrams in the selected training set. Surprisingly, preliminary experiments with a more sophisticated diversity method based on embedding distance did not improve results (see Appendix E for details). We refer to this approach as G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity (see Algorithm 1).

Algorithm 1 G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity
Input: Synthetic data pool 𝒟pool\mathcal{D}_{pool}, Target size NN
Output: Synthetic dataset
Initialization: 𝒟synthetic{}\mathcal{D}_{synthetic}\xleftarrow{}\{\}
repeat
   xmax=argmaxxDpool#n-grams(𝒟syntheticx_{max}=\mathrm{argmax}_{x\in D_{pool}}\text{\#n-grams}(\mathcal{D}_{synthetic}
                {x})#n-grams(𝒟synthetic)\cup\{x\})-\text{\#n-grams}(\mathcal{D}_{synthetic})
   Add xmaxx_{max} to 𝒟synthetic\mathcal{D}_{synthetic}
   Remove xmaxx_{max} from 𝒟pool\mathcal{D}_{pool}
until |𝒟synthetic|=N|\mathcal{D}_{synthetic}|=N
return 𝒟synthetic\mathcal{D}_{synthetic}

Combining Influence Filtering and Diversity Maximization

G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence and G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity have complementary benefits—the former aims at improving the quality of individual examples by filtering out detrimental ones, and the latter is designed to compose a diverse training set but does not consider quality. To reap both benefits, we propose a combined selection technique, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo, that first filters the data using G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence, then selects examples according to G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity.

3.2 Training with Synthetic Data

In traditional data augmentation, new data is usually mixed with the original training examples to create an augmented training set Wei and Zou (2019); Kafle et al. (2017). However, when augmenting with data produced using a generative model, label noise can be detrimental to learning Kafle et al. (2017). Moreover, the generated questions themselves can be noisy, i.e. nonsensical or ambiguous (see Table 7 under §4.2). To address this issue, we propose a simple training procedure that treats the synthetic and original data differently. We first train a model on the synthetic data (Synthetic Training), then further train on the original, human-authored training set (Organic Training). The motivation is to correct any unfavorable noise that may have been learnt during the first stage, by subsequently training on original data as more recent training data is favored by neural models Goodfellow et al. (2014) .

We also experiment with a mixing approach that minimizes a weighted average of the loss for the synthetic data and the original data, with an importance weight to downweight the synthetic examples to mitigate noise. We find that two-stage training performs better than the importance-weighted loss (see Section 5).

4 Experiments

CSQA
(Acc)
WinoGrande
(AUC)
Codah
(Acc)
HellaSwag-2K
(Acc)
Average
RoBERTa (reported) 72.1 66.4 - - -
RoBERTa (ours) 71.6 67.5 82.3 75.4 74.2
BackTranslation 70.2 67.2 81.8 73.0 73.1
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand 71.8 70.9 83.6 75.9 75.6
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence 72.1 70.9 84.3 75.8 75.8
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity 72.3 71.2 83.5 76.1 75.8
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo 72.6 71.4 84.0 76.8 76.2
Table 1: Results on the test sets of four commonsense benchmarks. RoBERTa (reported) is the result for the RoBERTa-large baseline reported on public leaderboards.555https://leaderboard.allenai.org/winogrande/submissions/public, https://www.tau-nlp.org/csqa-leaderboardRoBERTa (ours) is re-evaluation of the RoBERTa-large model using our setup. All G-DAugc\textsc{G-DAug}^{\textbf{c}} methods outperform the baseline methods, and G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo performs the best overall.

We present experiments on four commonsense multiple choice QA benchmarks: CommonsenseQA Talmor et al. (2019), WinoGrande Sakaguchi et al. (2020), Codah Chen et al. (2019) and HellaSwag Zellers et al. (2019). Our techniques are also directly applicable to other closed-book multiple choice QA setups, such as science QA, and to textual entailment tasks with minor modifications. To evaluate G-DAugc\textsc{G-DAug}^{\textbf{c}}’s extensibility to these settings, we also experiment with a textual entailment task, SNLI Bowman et al. (2015), and a closed-book version of the ARC-Challenge Scientific QA task Clark et al. (2018) in which access to the scientific corpus for the ARC dataset (or any other information sources) is disallowed during test. We simulate low-resource settings on the large HellaSwag and SNLI datasets by downsampling these to 2K and 3K training samples respectively; the other data sets are either already low-resource or have a low-resource component. Dataset details are provided in Appendix A.

Robustness Evaluation

In addition to measuring in-distribution performance, we also analyze robustness to perturbed or adversarial data. Following Wei and Zou (2019), we perform WordNet-based Fellbaum (1998) synonym replacement on the validation or test set (when test labels are available) with a 10%10\% replacement rate.666https://github.com/jasonwei20/eda_nlp Our second evaluation with TextFooler Jin et al. (2020) identifies the most important words and replaces these with the most semantically and grammatically correct substitutes, until the model prediction is altered. We adopt two metrics to measure robustness under TextFooler’s attacks: 1) failure rate: the proportion of examples for which TextFooler fails to change the prediction and 2) average perturbation ratio: the average fraction of words replaced when TextFooler succeeds in altering a prediction. We re-implement TextFooler with two minor changes: we only swap words in questions, not answers, and we replace the Universal Sentence Encoder with SRoBERTa Reimers and Gurevych (2019).

CSQA WinoGrande Codah HellaSwag-2K Average
RoBERTa (ours) 69.9 63.8 74.7 63.2 67.9
BackTranslation 69.0 62.3 75.5 65.4 68.1
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand 72.1 65.5 75.9 64.1 69.4
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence 71.0 65.7 76.2 64.3 69.3
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity 71.6 66.0 76.0 64.8 69.6
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo 72.0 66.0 76.0 65.2 69.8
Table 2: Results on WordNet-based synonym replacement sets. For Codah and HellaSwag-2K, we perturb test sets, as the labels are available. G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo achieves the highest average score.

4.1 Experimental Settings

We use RoBERTa Liu et al. (2019) as our pretrained task model, and GPT-2 Radford et al. (2019) as our pretrained generator.777We used the HuggingFace library Wolf et al. (2019). We use validation performance to decide whether to do relabeling for CommonsenseQA and WinoGrande, and apply relabeling by default on all other tasks (tuning this choice may boost performance). To perform a controlled comparison, we restrict the synthetic set size to be equal across all methods. We repeat all experiments with 10 random restarts and pick the best model based on validation performance. Additional experimental details, with hyperparameters, are provided in Appendix C.

Baselines

Our first baseline is a finetuned RoBERTa model with no augmentation. We compare with existing work on data augmentation via a BackTranslation approach from Xie et al. (2019); under our setting the original and backtranslated data are mixed at random.888https://github.com/google-research/uda/

4.2 In-Distribution Results

Our main results for commonsense question answering are reported in Table 5. All G-DAugc\textsc{G-DAug}^{\textbf{c}}variants outperform the baselines, highlighting the impact of generative data augmentation. On average, every other variant achieves higher test performance than G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand, which further highlights the importance of our data selection approaches. In addition, influence and diversity selection methods score similarly, however, their combination (in G-DAugc\textsc{G-DAug}^{\textbf{c}}-combo) outperforms either alone, which suggests that they are complementary selection approaches. More specifically, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo performs the best on 3/4 tasks and obtains the highest average score. Further, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo provides a 5.0% absolute gain over previously published state-of-the-art results on WinoGrande.999These results are state-of-the-art for our model class; higher scores have been obtained using a T5 model with roughly an order of magnitude more parameters than ours. For CommonsenseQA, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo outperforms the previous non-ensemble state-of-the-art Zhu et al. (2020) by 0.4%. We also achieve a new state-of-the-art on Codah, where the previous best (BERT-based) score was 67.5% Chen et al. (2019). We find that BackTranslation hurts performance, and uniformly underperforms the RoBERTa baseline. See Appendix B for validation set results.

4.3 Robustness Results

Table 2 presents our evaluation on synonym replacement sets. The G-DAugc\textsc{G-DAug}^{\textbf{c}} variants outperform the baselines, and G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo obtains the best average performance. Table 3 shows results on the TextFooler adversarial attacks. Models trained with data augmentation are more robust to adversarial attacks, as all G-DAugc\textsc{G-DAug}^{\textbf{c}} variants and BackTranslation outperform the RoBERTa baseline on both metrics. G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity obtains the best failure rate and average perturbation ratio (higher is better, in both metrics), and G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo performs comparably with slightly worse numbers. Overall, the findings suggest that optimizing diversity increases robustness.

CSQA WinoGrande Codah Hellaswag-2K Average
RoBERTa (ours) 14.8 / 12.6 4.5 / 7.8 30.9 / 15.8 17.4 / 9.8 16.9 / 11.5
BackTranslation 17.0 / 12.9 5.0 / 8.2 37.1 / 15.9 20.2 / 10.2 19.8 / 11.8
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand 15.6 / 13.0 5.7 / 8.4 36.2 / 15.9 20.0 / 10.6 19.4 / 12.0
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence 16.3 / 12.8 5.4 / 8.4 34.9 / 15.8 19.2 / 10.7 19.0 / 11.9
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity 16.0 / 12.9 5.9 / 8.4 36.1 / 16.2 21.4 / 10.4 19.9 / 12.0
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo 16.5 / 12.6 5.9 / 8.5 35.2 / 15.7 21.3 / 10.5 19.7 / 11.8
Table 3: Robustness to TextFooler-based adversarial attacks (failure rate / average perturbation ratio, higher is better for both). Models trained with augmented data are more robust to TextFooler’s attacks compared to models without data augmentation. On average, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity performs the best.

4.4 Results on ARC and SNLI

We explore G-DAugc\textsc{G-DAug}^{\textbf{c}}’s applicability outside of the commonsense domain in Table 4, via evaluation on the closed-book ARC-Challenge Scientific QA. Valid science questions are hard to generate because their semantics need to be precise, and we find that many of G-DAugc\textsc{G-DAug}^{\textbf{c}}’s generations for ARC are noisy. Perhaps surprisingly, nonetheless G-DAugc\textsc{G-DAug}^{\textbf{c}} outperforms the baselines by a large margin. G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence achieves the best in-distribution performance, while G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity is the most robust against TextFooler but has worse accuracy than G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand. This may suggest that optimizing for quality is more important when the synthetic data is noisier.

ARC-Challenge Scientific QA SNLI-3K
Val. Test Syn. TF:Fail TF:Pert Val. Test Syn. TF:Fail TF:Pert NLI Diag.
RoBERTa (ours) 43.5 39.4 35.2 6.6 9.3 91.8 88.6 77.5 17.0 20.2 56.7
Backtranslation 43.1 43.1 42.4 6.6 9.3 91.2 8.1 81.0 18.8 21.7 54.0
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand 50.8 48.1 43.4 12.9 10.8 91.8 89.0 78.6 17.7 20.6 57.4
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence 51.5 48.5 45.2 12.4 11.0 92.3 88.7 78.6 18.0 20.7 56.9
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity 49.5 47.5 42.2 13.9 10.8 92.0 89.0 79.4 19.0 20.5 57.7
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo 50.8 48.2 43.8 13.1 10.7 91.9 88.7 78.7 16.7 20.5 57.6
Table 4: Results on closed-book ARC-Challenge Scientific QA and SNLI-3K, along with robustness to synonym replacement, TextFooler (TF) attacks and NLI Diagnostics. G-DAugc\textsc{G-DAug}^{\textbf{c}}improves accuracy and robustness.

We also evaluate G-DAugc\textsc{G-DAug}^{\textbf{c}} on a textual entailment using the SNLI dataset Bowman et al. (2015) in Table 4. This task has a different format; it is a pair-wise classification task with 3 labels (details in Appendix A). We find that G-DAugc\textsc{G-DAug}^{\textbf{c}} slightly improves accuracy and robustness over baselines. The performance is likely affected by a label skew introduced by influence-based filtering.

5 Analysis and Discussion

We now analyze G-DAugc\textsc{G-DAug}^{\textbf{c}}’s performance, focusing on WinoGrande where G-DAugc\textsc{G-DAug}^{\textbf{c}} offers the most benefit. We first identify several factors that affect performance, and then present evidence that G-DAugc\textsc{G-DAug}^{\textbf{c}} works by transferring knowledge from the pretrained generator to the task model.

5.1 Factors that Affect G-DAugc\textsc{G-DAug}^{\textbf{c}}’s Performance

G-DAugc\textsc{G-DAug}^{\textbf{c}} is effective at different training sizes.

Refer to caption
Figure 3: Validation results for different training set sizes on the WinoGrande dataset (in log scale). G-DAugc\textsc{G-DAug}^{\textbf{c}}helps more for smaller training sizes.

Figure 3 illustrates that our winning strategy, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo, remains effective as the amount of training data varies, for WinoGrande. The improvement over baseline is largest in the low-resource (small training size) regime. For the smallest sizes, XS and S, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo increases the effective training size by a factor of 4 (i.e. training on XS or S matches unaugmented RoBERTa’s performance on S or M, respectively). In contrast, BackTranslation only helps for the XS size, but hurts performance on larger sizes.

Staged training is essential.

G-DAugc\textsc{G-DAug}^{\textbf{c}} uses a two-staged training method (Section 3.2) aimed at mitigating the effect of noise in the generated data. We analyze alternative training protocols on the WinoGrande-L dataset: Mixing (training on the union of generated and original data) and Importance Weighted Loss. Compared to a no-augmentation baseline (with accuracy of 75.9), two stage training (+1.8 increase) outperforms both mixing (+0.0) and importance weighted loss (+0.7).

Filtering synthetic data does not hurt accuracy.

Random Influence Diversity Whole Pool
Size 127478 127478 127478 380700
Acc 71.7 74.4 73.0 73.1
Table 5: Results comparing G-DAugc\textsc{G-DAug}^{\textbf{c}}’s filtering methods against using the entire synthetic data pool for augmentation, on WinoGrande-M.

G-DAugc\textsc{G-DAug}^{\textbf{c}}’s filtering methods are designed to identify a high-quality and diverse subset of the generated data, to reduce training cost (compared to training on the entire generated pool) without harming accuracy. We evaluate whether G-DAugc\textsc{G-DAug}^{\textbf{c}} is successful at achieving this in Table 5, by comparing G-DAugc\textsc{G-DAug}^{\textbf{c}} against using the entire synthetic data pool for G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence and G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity.101010G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo utilizes a larger pool, so it is not comparable. The selection approaches provide comparable or better accuracy compared to using the entire pool, despite using three times less data.

5.2 Why Does G-DAugc\textsc{G-DAug}^{\textbf{c}} Work?

Below, we present analysis suggesting that G-DAugc\textsc{G-DAug}^{\textbf{c}} works by transferring knowledge from the pretrained model to the task model. In particular, we find that using a pre-trained generator is critical, and that the generated questions are often coherent, include new semantic units, and carry informative labels.

Using a Pretrained Generator is critical.

We analyze the impact of the pretrained generator by comparing our standard G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand setting with a setting where the generator is not pretrained, but instead trained from scratch. We find that using GPT-2 trained from scratch results in a score of 67.8%67.8\% on the WinoGrande-M validation set. This is a slight improvement (by 0.2%0.2\%) over the unaugmented baseline, but is far inferior to the 3.9%3.9\% improvement obtained when using the pretrained GPT-2. This suggests that using a pretrained generator is critical for G-DAugc\textsc{G-DAug}^{\textbf{c}}.

Synthetic data labels are important.

WinoGrande-L CSQA
Baseline 75.9 77.1
Generator label 76.2 78.1
Random relabeling 66.8 77.1
Model relabeling 77.7 77.7
Table 6: Validation accuracy of G-DAugc\textsc{G-DAug}^{\textbf{c}}with different labeling methods on WinoGrande-L and CommonsenseQA. Random labels hurt accuracy, and model relabeling helps on WinoGrande but not on CommonsenseQA.

Even fully unsupervised language model pretraining can improve performance, when using task-relevant data Gururangan et al. (2020). This raises the question of whether G-DAugc\textsc{G-DAug}^{\textbf{c}} boosts performance by simply exposing the model to more task-relevant text, or if the generated labels are in fact informative. A related question is whether G-DAugc\textsc{G-DAug}^{\textbf{c}}’s optional self-supervised relabeling improves performance. We analyze these questions for WinoGrande-L and CommonsenseQA in Table 6, evaluating G-DAugc\textsc{G-DAug}^{\textbf{c}} with three labeling methods: (i) generator labels, (ii) random relabeling, and (iii) relabeling with a task model. When the generator labels are flipped randomly, G-DAugc\textsc{G-DAug}^{\textbf{c}} is unable to outperform the baselines for either dataset (in fact, it dramatically underperforms on WinoGrande-L). This implies that the correctness of labels is crucial for G-DAugc\textsc{G-DAug}^{\textbf{c}}. Self-supervised relabeling provides a 1.5% absolute gain in WinoGrande-L, but a 0.4% drop in CommonsenseQA, which suggests its utility is task-dependent.

Rating Description Examples Count Pct.
1 Nonsensical
What is a square leg made of made out of?
What country does a cow go to make a milk run?
54 3.89%
2 Ambiguous or unanswerable
A person is a human, but they are called what?
He hated flying, the controls were what?
306 22.06%
3 Minor errors (e.g., grammar)
What do you put on your head to do when you’re swimming?
Where does a bugle call be played?
138 9.95%
4 Coherent and Fluent
What is a person likely to feel when applying for jobs?
If you’re running late for work what would you be doing?
889 64.10%
Table 7: Examples and prevalence of generated commonsense questions with different manually-assigned fluency ratings, for the CommonsenseQA dataset. Ratings of 3 and higher correspond to questions that are answerable and address common sense, and most of G-DAugc\textsc{G-DAug}^{\textbf{c}}’s generated questions fall into this category.
Refer to caption
Figure 4: OpenIE analysis on the original data and synthetic data used by G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo on WinoGrande-M. The synthetic dataset contains many more unique semantic units compared to the original dataset.

G-DAugc\textsc{G-DAug}^{\textbf{c}} introduces new semantic units.

We investigate how distinct the generated questions are from each other and from the original training data. We observe that G-DAugc\textsc{G-DAug}^{\textbf{c}} only rarely generates exact duplicate questions (e.g., on CommonsenseQA, 0.06% of the questions are duplicates). We further investigate if G-DAugc\textsc{G-DAug}^{\textbf{c}} introduces new entities and relations to the training data, or if it merely reuses the ones found in the original training set. We quantify the diversity of our synthetic dataset compared to the original data by counting the number of unique semantic units produced by performing Open Information Extraction Banko et al. (2007) on the data. Specifically, we run the Stanford Open IE package Angeli et al. (2015) and report the number of unique triplets, relations and entities extracted from our WinoGrande-M datasets in Figure 4. The synthetic data includes many more unique semantic units than the original training data, suggesting that G-DAugc\textsc{G-DAug}^{\textbf{c}} does introduce new semantic units in the training set.

G-DAugc\textsc{G-DAug}^{\textbf{c}}produces mostly fluent questions.

To evaluate G-DAugc\textsc{G-DAug}^{\textbf{c}}’s output for fluency, we employ three human annotators to rate generated CommonsenseQA questions for their coherence and answerability on a scale of 1 to 4, where a rating of 3 denotes an acceptable question. We obtained a total of 1,387 labels. We measured annotator agreement on a separate set of 50 questions, obtaining a Fleiss’ Kappa of 0.41, which is at the low end of moderate annotator agreement, acceptable given the subjective nature of the task. A large (74.04%74.04\%) majority of questions met the acceptability threshold, with an overall average rating of 3.34. Examples are shown in Table 7.

Next, we ask annotators to answer the 1,027 acceptable questions, where they can edit choices (but not questions) if they are unable to pick a unique correct answer from the given choices. The editing rate is relatively high, at 55.3%. We mix these human-labeled examples with the original training set to train a RoBERTa model, and obtain 78.1%78.1\% validation accuracy, which is comparable to G-DAugc\textsc{G-DAug}^{\textbf{c}}, despite using approximately 50x fewer questions. This suggests that human labels can provide higher leverage than the noisy labels from G-DAugc\textsc{G-DAug}^{\textbf{c}}, although human labeling is expensive.

Additional analyses, provided in Appendix F, show that model sharpness approximated by the Hessian trace Yao et al. (2019) does not completely explain G-DAugc\textsc{G-DAug}^{\textbf{c}}’s performance; and, G-DAugc\textsc{G-DAug}^{\textbf{c}} is more effective than ensembling with a finetuned generator.

6 Related Work

Data augmentation is a common practice in computer vision, where it takes the form of image transformations like translation and rotation Perez and Wang (2017). For language tasks, data augmentation is less straightforward. Broadly, previous augmentation methods have used back-translation architectures Sennrich et al. (2016); Xie et al. (2019), heuristics based on syntactic and semantic properties of text including word replacements using a thesaurus Zhang et al. (2015); Wei and Zou (2019) and word embeddings Wang and Yang (2015); Fadaee et al. (2017); Kobayashi (2018); Wu et al. (2019), and recently, generative models for synthesizing novel examples for text classification and reading comprehension Anaby-Tavor et al. (2020); Kumar et al. (2020); Puri et al. (2020b). Our framework is similar to the last of these as we focus on generative models for data augmentation, but our work is the first to present a generative approach for the challenging commonsense QA setting, and we introduce new data selection approaches to improve the informativeness and diversity of synthetic data.

Concurrently, there has been work on generating adversarial examples for analyzing black-box classifiers. These approaches use generative adversarial networks Zhao et al. (2018b) and population-based optimization algorithms Alzantot et al. (2018). Previous work has also presented methods to generate questions for reading comprehension Heilman and Smith (2010); Rus et al. (2011); Alberti et al. (2019); Puri et al. (2020a), online tutoring Lindberg et al. (2013), factual QA Serban et al. (2016) and visual question generation Mostafazadeh et al. (2016). A comprehensive survey on neural question generation can be found in Pan et al. (2019). Our work is distinct in that it targets question generation in a closed-book setting, investigates the generation of answers as well as distractors, and is aimed at data augmentation.

7 Conclusion

We introduced G-DAugc\textsc{G-DAug}^{\textbf{c}}, a novel data augmentation framework to generate synthetic training data, preserving quality and diversity. We demonstrate that G-DAugc\textsc{G-DAug}^{\textbf{c}} is effective on multiple commonsense reasoning benchmarks, with improvements on in-distribution performance, as well as robustness against perturbed evaluation sets and challenge sets. Our analysis shows that G-DAugc\textsc{G-DAug}^{\textbf{c}} tends to perform better in low-resource settings and that our data selection strategies are important for performance. Future work might explore more sophisticated methods to enhance quality and diversity of generated training data, including having humans-in-the-loop for relabeling.

Acknowledgments

This work was supported in part by NSF Grant IIS-1351029. We thank Iz Beltagy, Jonathan Bragg, Isabel Cachola, Arman Cohan, Mike D’Arcy, Daniel King, Kyle Lo, and Lucy Lu Wang for helpful comments.

References

Appendix A Datasets

CommonsenseQA

Talmor et al. (2019): CommonsenseQA is a multiple choice QA dataset that consists of 12,247 examples, which aims to test commonsense reasoning capabilities. We use the official random split 1.11 which is an 80/10/10 split. We apply greedy decoding to generate answers, as answers are fairly short for this dataset.

WinoGrande

Sakaguchi et al. (2020): WinoGrande is a benchmark for commonsense reasoning, inspired by the original Winograd Schema Challenge design Levesque et al. (2011), with a larger dataset size and higher difficulty level. It consists of 44K questions with five different training sizes: 160, 640, 2,558, 10,234 and 40,398 questions. The evaluation metric is Area Under the (learning) Curve. We observe that applying top-2 greedy decoding on the answer generator is able to yield a satisfactory set of choices, so the distractor generator is not used in this task. The Winograd schema requires that questions in twin pairs have opposite labels Levesque et al. (2011). We use the following method to generate twin questions: 1. generate a sequence until a blank symbol ”_” is produced. 2. use two independent runs of sampling to complete the question in two different ways to form twins. The above process does not guarantee that the labels will differ for the two twins, so we further filter out generated pairs that do not have different labels.

Codah

Chen et al. (2019): Codah is an adversarially-constructed benchmark which tests commonsense reasoning using sentence-completion questions, inspired by the Swag dataset Zellers et al. (2018). It contains 2,801 questions in total, and uses 5-fold cross validation for evaluation.111111The original CODAH work does not specify a particular 5-fold split, so we choose these randomly. We will release our splits for replicability. We lower the temperature to 0.5 for the answer generation in order to increase the confidence of the generated answers.

HellaSwag

Zellers et al. (2019): HellaSwag is a more challenging version of the Swag dataset Zellers et al. (2018), and the task is similar to Codah. The dataset consists of 70K questions where each question comes from one of two domains: ActivityNet or WikiHow. In order to test our methods under a low-resource setting, we downsample the training set to 2,000 examples. We take a random sample of 1000 questions from the original validation set to serve as our validation data, and another non-overlapping random sample of 5,000 questions from the same set as our test data. The generation settings are the same as Codah’s.

SNLI

Bowman et al. (2015): SNLI is a natural language inference dataset with 570K pairs of labeled sentences. The label assigned to each sentence pair is one of entailment, contradiction or neutral. For low-resource experiments, we downsample the dataset to 3K training examples, which contains 1K unique premises and a hypothesis for all three labels. Similarly, we use a downsampled development set with 999 examples (333 premises and 3 hypotheses for each label). The generative model is fine-tuned by providing the premise, label and hypothesis, separated by special delimiters marking the beginning and end of each element.

ARC-Challenge

Clark et al. (2018): The ARC Dataset consists of 7,787 natural grade-school science questions that are used on standardized tests. The ARC-Challenge Set contains 2,590 questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. We use the official split, which has 1,119 train, 299 validation, and 1,172 test examples. The generation settings are the same as CommonsenseQA’s.

Appendix B Validation Set Results

In Table 8, we summarize our main results on the validation sets, comparing the G-DAugc\textsc{G-DAug}^{\textbf{c}}methods against an unaugmented baseline and a backtranslation augmentation baseline. All G-DAugc\textsc{G-DAug}^{\textbf{c}}methods consistently outperform the baseline methods in every benchmark. The proposed selection methods provide an extra boost on average, compared to G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand. Among those, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence achieves the best performance across all tasks, which is expected as G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence selects examples which are helpful in reducing validation loss. Interestingly, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo scores lower than G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence, although it outperforms G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity. Finally, backtranslation does not demonstrate any benefit and obtains lower results compared to the augmented baseline in all benchmarks.

Method
CSQA
(Acc)
WinoGrande
(AUC)
Codah
(Acc)
HellaSwag-2K
(Acc)
Average
RoBERTa (reported) 78.4 66.6 - - - -
RoBERTa (ours) 77.1 68.4 84.2 75.2 76.2
Backtranslation 76.4 67.7 83.4 74.2 75.4
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Rand 78.1 72.0 85.7 77.2 78.3
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence 78.8 73.0 87.2 78.3 79.3
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Diversity 78.1 72.8 86.0 76.6 78.4
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo 78.2 72.7 86.7 77.5 78.8
Table 8: Results on the validation sets of four commonsense benchmarks. All G-DAugc\textsc{G-DAug}^{\textbf{c}}methods outperform the baseline methods, in particular, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Influence performs the best on all tasks, which is expected as it selects examples which are helpful in reducing validation loss.

Appendix C Hyperparameter Settings and Input Formats

Hyperparameter settings for finetuning GPT-2, RoBERTa and G-DAugc\textsc{G-DAug}^{\textbf{c}}are shown in Tables 11, 12, 14, 15 and 16. We manually tune the learning rate and the number of epochs for GPT-2 finetuning based on validation perplexity. For finetuning RoBERTa baseline models, we select the number of epochs from {1,3,5,8,10} based on validation accuracy for CSQA, WinoGrande and HellaSwag-2K. For Codah, SNLI-3K and ARC-Challenge, we simply use 5 epochs. For G-DAugc\textsc{G-DAug}^{\textbf{c}}synthetic training, we train all models using a learning rate of 5e-6 for one epoch. For G-DAugc\textsc{G-DAug}^{\textbf{c}}organic training, we use the same hyperparameter settings as RoBERTa baselines (except for CSQA and HellaSwag-2K, where we find reducing 2 epochs gives significantly better results). In Tables 9 and 10, we specify the input formats for finetuning GPT-2 and RoBERTa. Finally, we benchmark the running time of our implementations of the influence and diversity selection methods on the task of selecting 127,478 examples from a pool consisting of 380,700 candidates for WinoGrande-M. We use one Nvidia 2080 Ti GPU and one Intel Core I9-7900X with 10 cores and a clockspeed of 3.3 GHz. The running time of the influence and diversity algorithms is about 8.3 hours and 2.9 hours, respectively.

Task Format
CSQA Q: Where can I stand on a river to see water falling without getting wet? A: waterfall /s\text{Q: Where can I stand on a river to see water falling without getting wet? A: waterfall }\langle/s\rangle
WinoGrande /sFeeling a draft, William asked Neil to please close the front door because _ was closer./sNeil/s\langle/s\rangle\text{Feeling a draft, William asked Neil to please close the front door because \_ was closer.}\langle/s\rangle\text{Neil}\langle/s\rangle
Codah /sI am always very hungry before I go to bed. I am/sconcerned that this is an illness./s\langle/s\rangle\text{I am always very hungry before I go to bed. I am}\langle/s\rangle\text{concerned that this is an illness.}\langle/s\rangle
HellaSwag-2K /sA man is on a sandy beach, playing croquette. he/sis parasailing, making a random move./s\langle/s\rangle\text{A man is on a sandy beach, playing croquette. he}\langle/s\rangle\text{is parasailing, making a random move.}\langle/s\rangle
SNLI-3K PREMFive black dogs run in a field./PREMANSentailment/ANSHYPSome animals running./HYP\langle\text{PREM}\rangle\text{Five black dogs run in a field.}\langle\text{/PREM}\rangle\langle\text{ANS}\rangle\text{entailment}\langle\text{/ANS}\rangle\langle\text{HYP}\rangle\text{Some animals running.}\langle\text{/HYP}\rangle
ARC-Challenge Q: Which of the following is an example of a physical change? A: breaking a glass /s\text{Q: Which of the following is an example of a physical change? A: breaking a glass }\langle/s\rangle
Table 9: Input formats for GPT-2. ”Q:” and ”A:” are the prefix for a question and a candidate answer (choice).
Task Format
CSQA sQ: Where can I stand on a river to see water falling without getting wet?/s A: waterfall /s\langle s\rangle\text{Q: Where can I stand on a river to see water falling without getting wet?}\langle/s\rangle\text{ A: waterfall }\langle/s\rangle
WinoGrande sFeeling a draft, William asked Neil to please close the front door because _ was closer./sNeil/s\langle s\rangle\text{Feeling a draft, William asked Neil to please close the front door because \_ was closer.}\langle/s\rangle\text{Neil}\langle/s\rangle
Codah sI am always very hungry before I go to bed. I am/sconcerned that this is an illness./s\langle s\rangle\text{I am always very hungry before I go to bed. I am}\langle/s\rangle\text{concerned that this is an illness.}\langle/s\rangle
HellaSwag-2K sA man is on a sandy beach, playing croquette. he/sis parasailing, making a random move./s\langle s\rangle\text{A man is on a sandy beach, playing croquette. he}\langle/s\rangle\text{is parasailing, making a random move.}\langle/s\rangle
SNLI-3K sFive black dogs run in a field./sSome animals running./s\langle s\rangle\text{Five black dogs run in a field.}\langle/s\rangle\text{Some animals running.}\langle/s\rangle
ARC-Challenge sQ: Which of the following is an example of a physical change?/sA: breaking a glass /s\langle s\rangle\text{Q: Which of the following is an example of a physical change?}\langle/s\rangle\text{A: breaking a glass }\langle/s\rangle
Table 10: Input formats for RoBERTa. ”Q:” and ”A:” are the prefix for a question and a candidate answer (choice).
Hyperparam CSQA WinoGrande Codah HellaSwag-2K SNLI-3K ARC-Challenge
Version Large Medium Medium Medium Large Medium
Hardware I9-7900X RTX 2080Ti RTX 2080Ti RTX 2080Ti RTX 8000 RTX 2080Ti
Optimizer AdamW AdamW AdamW AdamW AdamW AdamW
Adam β1\beta_{1} 0.9 0.9 0.9 0.9 0.9 0.9
Adam β2\beta_{2} 0.98 0.98 0.98 0.98 0.999 0.98
Adam ϵ\epsilon 1e-6 1e-6 1e-6 1e-6 1e-8 1e-6
Mixed Precision No Yes Yes Yes Yes Yes
LR (q/a/d) 1e-5/5e-6/2e-5 * 4e-5/5e-5/5e-5 4e-5/5e-5/5e-5 5e-5 2e-5/1e-5/1e-5
Epochs (q/a/d) 3/5/3 * 3/3/3 3/3/3 3 3/5/5
Grad Clipping 1.0 1.0 1.0 1.0 1.0 1.0
Weight Decay 0.01 0.01 0.01 0.01 0.0 0.01
Batch Size 16 16 16 16 16 16
Max Length (q/a/d) 62/70/70 72/72/- 62/92/92 62/128/128 128 90/120
Warmup Ratio 0.06 0.06 0.06 0.06 0.06 0.06
LR Decay Linear Linear Linear Linear Linear Linear
Table 11: Hyperparameter settings for finetuning GPT-2. ”q/a/d” stands for ”question/answer/distractor”. Some hyperparameters for WinoGrande is shown in a separate table as they vary with the train size.
Hyperparam XS S M L XL
LR (q/a) 5e-5/5e-5 2e-5/5e-5 2e-5/5e-5 2e-5/5e-5 1e-5/5e-5
Epochs (q/a) 8/12 6/6 3/3 3/3 3/1
Table 12: Hyperparameter settings for finetuning GPT-2 on WinoGrande.
Test AUC
Baseline 67.5
Baseline + Generator 67.5
G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo 71.4
Table 13: Test performance of an unaugmented baseline model and the same model ensembled with a finetuned GPT-2 generator on WinoGrande. We use weighted average ensemble with weights tuned on validation data.
Hyperparam CSQA WinoGrande Codah HellaSwag-2K SNLI-3K ARC-Challenge
Version Large Large Large Large Large Large
Hardware RTX 2080Ti RTX 2080Ti RTX 2080Ti RTX 2080Ti RTX 8000 RTX 2080Ti
Optimizer AdamW AdamW AdamW AdamW AdamW AdamW
Adam β1\beta_{1} 0.9 0.9 0.9 0.9 0.9 0.9
Adam β2\beta_{2} 0.98 0.98 0.98 0.98 0.98 0.98
Adam ϵ\epsilon 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6
Mixed Precision Yes Yes Yes Yes Yes Yes
LR 1e-5 * 1e-5 1e-5 1e-5 1e-5
Epochs 5 * 5 3 5 5
Grad Clipping 0.0 0.0 0.0 0.0 0.0 0.0
Weight Decay 0.01 0.01 0.01 0.01 0.01 0.01
Batch Size 16 16 16 16 16 16
Max Length 70 70 90 128 128 120
Warmup Ratio 0.06 0.06 0.06 0.06 0.06 0.06
LR Decay Linear Linear Linear Linear Linear Linear
Table 14: Hyperparameter settings for finetuning RoBERTa. Some hyperparameters for WinoGrande are shown in a separate table as they vary with the training set size.
Hyperparam XS S M L XL
LR 1e-5 1e-5 1e-5 1e-5 1e-5
Epochs 10 8 5 5 5
Table 15: Hyperparameter settings for finetuning RoBERTa on WinoGrande.
Hyperparam CSQA WinoGrande Codah HellaSwag-2K SNLI-3K ARC-Challenge
Synthetic Data Size 50K \sim 50K-130K1212footnotemark: 12 100K 50K 100K 50K
LR (synthetic) 5e-6 5e-6 5e-6 5e-6 5e-6 5e-6
Epochs (synthetic) 1 1 1 1 1 1
Table 16: Additional hyperparameter settings for G-DAugc\textsc{G-DAug}^{\textbf{c}}Two-Stage Training. For finetuning on the original data, we use the same settings as RoBERTa (except for CSQA and HellaSwag-2K, where we find reducing 2 epochs gives significantly better results). For Winogrande, we generate 400K examples before the rejection procedure (see Appendix A). The examples retained after the rejection procedure approximately ranges from 50K-130K depending on the training size.

Appendix D Influence Functions

In practice, since the generalization error is usually approximated by validation loss, a training example xix_{i} is considered detrimental if it increases validation loss, i.e.:

(𝒳,𝜽)=1|𝒳|x𝒳l(x,𝜽),\displaystyle\mathcal{L}(\mathcal{X},\boldsymbol{\theta})=\frac{1}{|\mathcal{X}|}\sum_{x\in\mathcal{X}}l(x,\boldsymbol{\theta}), (1)
(𝒳val,𝜽^(𝒳train{xi}))(𝒳val,𝜽^(𝒳train))>0,\displaystyle\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}(\mathcal{X}_{train}\cup\{x_{i}\}))-\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}(\mathcal{X}_{train}))>0, (2)

where 𝒳train={xi}i=1N\mathcal{X}_{train}=\{x_{i}\}_{i=1}^{N} is a training set, 𝒳val={xi}i=1M\mathcal{X}_{val}=\{x_{i}\}_{i=1}^{M} is a validation set, ll is a loss function, and 𝜽^(𝒳train)=argmin𝜽Θ(𝒳train,𝜽)\hat{\boldsymbol{\theta}}(\mathcal{X}_{train})=\underset{\boldsymbol{\theta}\in\Theta}{\mathrm{argmin}}\ \mathcal{L}(\mathcal{X}_{train},\boldsymbol{\theta}) is an empirical risk minimizer.

The main result from previous work (Atkinson et al., 1983; Koh and Liang, 2017) tells us that the influence of upweighting a training example xx by some small ϵ\epsilon on the model parameters 𝜽^\hat{\boldsymbol{\theta}} with the corresponding parameter space Θ\Theta is given by:

𝜽^ϵ,x=argmin𝜽Θϵl(x,𝜽)+1i=1Nwii=1Nwil(xi,𝜽)\displaystyle\hat{\boldsymbol{\theta}}_{\epsilon,x}=\underset{\boldsymbol{\theta}\in\Theta}{\mathrm{argmin}}\ \epsilon l(x,\boldsymbol{\theta})+\frac{1}{\sum_{i=1}^{N}w_{i}}\sum_{i=1}^{N}w_{i}l(x_{i},\boldsymbol{\theta}) (3)
up,params(x):=d𝜽^ϵ,xdϵ|ϵ=0=H1𝜽^𝜽l(x,𝜽^),\displaystyle\mathcal{I}_{up,params}(x):=\left.\frac{d\hat{\boldsymbol{\theta}}_{\epsilon,x}}{d\epsilon}\right\rvert_{\epsilon=0}=-H^{-1}_{\hat{\boldsymbol{\theta}}}\nabla_{\boldsymbol{\theta}}l(x,\hat{\boldsymbol{\theta}}), (4)

where wiw_{i} is weight for the training example xix_{i} and H𝜽^=1i=1Nwii=1Nwi𝜽2l(xi,𝜽^)H_{\hat{\boldsymbol{\theta}}}=\frac{1}{\sum_{i=1}^{N}w_{i}}\sum_{i=1}^{N}w_{i}\nabla_{\boldsymbol{\theta}}^{2}l(x_{i},\hat{\boldsymbol{\theta}}) is the Hessian evaluated at 𝜽^\hat{\boldsymbol{\theta}}. The above result is a slight generalization of Koh and Liang (2017), since the simple average used in that work is a special case of our weighted average, but it is straightforward to generalize their proof to our weighted empirical risk case and we omit the details of the proof in this paper. Then, we apply the chain rule to get the influence of upweighting xx on the validation loss:

up,loss(x):=d(𝒳val,𝜽^ϵ,x)dϵ|ϵ=0\displaystyle\mathcal{I}_{up,loss}(x):=\left.\frac{d\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}_{\epsilon,x})}{d\epsilon}\right\rvert_{\epsilon=0} (5)
=𝜽(𝒳val,𝜽^)up,params(x).\displaystyle=\nabla_{\boldsymbol{\theta}}\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}})^{\top}\mathcal{I}_{up,params}(x). (6)

Note that (𝒳train,𝜽)\mathcal{L}(\mathcal{X}_{train},\boldsymbol{\theta}) can be rewritten as the following weighted average form to incorporate a new training example xnewx_{new}:

(𝒳train,𝜽)=\displaystyle\mathcal{L}(\mathcal{X}_{train},\boldsymbol{\theta})= 1i=1N+1wii=1N+1wil(xi,𝜽),\displaystyle\frac{1}{\sum_{i=1}^{N+1}w_{i}}\sum_{i=1}^{N+1}w_{i}l(x_{i},\boldsymbol{\theta}),

where wi=1iN+1w_{i}=1\forall{i\neq N+1}, wN+1=0w_{N+1}=0 and xN+1=xnewx_{N+1}=x_{new}. Adding the new training example xnewx_{new} is equivalent to upweighting xN+1x_{N+1} by 1N\frac{1}{N}:

(𝒳train{xnew},𝜽)=NN+1(1Nl(xN+1,𝜽)\displaystyle\mathcal{L}(\mathcal{X}_{train}\cup\{x_{new}\},\boldsymbol{\theta})=\frac{N}{N+1}(\frac{1}{N}l(x_{N+1},\boldsymbol{\theta})
+1i=1N+1wii=1N+1wil(xi,𝜽))\displaystyle+\frac{1}{\sum_{i=1}^{N+1}w_{i}}\sum_{i=1}^{N+1}w_{i}l(x_{i},\boldsymbol{\theta}))
1Nl(xN+1,𝜽)+1i=1N+1wii=1N+1wil(xi,𝜽).\displaystyle\propto\frac{1}{N}l(x_{N+1},\boldsymbol{\theta})+\frac{1}{\sum_{i=1}^{N+1}w_{i}}\sum_{i=1}^{N+1}w_{i}l(x_{i},\boldsymbol{\theta}).

Applying the influence function up,loss(x)\mathcal{I}_{up,loss}(x), we obtain the following linear approximation of the validation loss change upon adding the training example xnewx_{new}:

(𝒳val,𝜽^(𝒳train{xnew}))(𝒳val,𝜽^(𝒳train))\displaystyle\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}(\mathcal{X}_{train}\cup\{x_{new}\}))-\mathcal{L}(\mathcal{X}_{val},\hat{\boldsymbol{\theta}}(\mathcal{X}_{train})) (7)
1Nup,loss(xnew).\displaystyle\approx\frac{1}{N}\mathcal{I}_{up,loss}(x_{new}). (8)

We adopt the stochastic estimation method described in Koh and Liang (2017) to efficiently compute up,loss\mathcal{I}_{up,loss}. Detrimental synthetic data will have 1Nup,loss>0\frac{1}{N}\mathcal{I}_{up,loss}>0.

Appendix E Diversity Selection using Embedding Distance

We define our embedding distance based diversity measure as the sum of the cosine distances between every pair of selected examples. To attempt to maximize this measure, we use a greedy algorithm that at each iteration randomly samples 10K candidate examples from the pool, and selects the candidate that maximizes the distance between it and its nearest neighbor in the set of examples selected so far. We use SRoBERTa Reimers and Gurevych (2019) as our sentence embedding method and Faiss Johnson et al. (2017) as our nearest neighbor searcher. We compare the embedding distance based measure with the unigram approach on WinoGrande dataset. The embedding distance based diversity selection is not found to be more effective than the unigram approach, in fact it performs 0.6% worse.

Appendix F Additional Analysis

Sharpness Analysis.

Previous work Hochreiter and Schmidhuber (1997); Keskar et al. (2016); Yao et al. (2019) has shown that models with flatter local minima tend to generalize better. Moreover, Hao et al. (2019) show that pretraining helps BERT to achieve flat and wide optima in the finetuning stage, which partially explains its performance benefits. We investigate whether G-DAugc\textsc{G-DAug}^{\textbf{c}}’s data augmentation may also encourage flatter optima. Specifically, using the fact that a larger Hessian trace for a model implies a sharper local minimum Yao et al. (2019), we compute the Hessian trace of 10 baseline and 10 G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo methods using the Hutchinson Method Avron and Toledo (2011) and find an average relative decrease of 9.5%9.5\% for G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo, suggesting that G-DAugc\textsc{G-DAug}^{\textbf{c}}does find slightly flatter optima. Likewise, when comparing the best performing models of each approach, G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo’s best model is slightly flatter than the baseline (a relative decrease of 0.2%0.2\%). However, we also find the contradictory fact that, over the 20 models, flatter optima tend to be associated with worse task performance (Spearman correlation of 0.390.39, p0.09p\approx 0.09). So, it does not appear that sharpness explains G-DAugc\textsc{G-DAug}^{\textbf{c}}’s performance advantage over the baseline. A more thorough analysis of this hypothesis is an item of future work.

Generator/Task Model Ensemble.

G-DAugc\textsc{G-DAug}^{\textbf{c}}harnesses pretrained knowledge from GPT-2 in order to improve a RoBERTa-based task model. A more standard approach for model combination (albeit, with twice the computational cost at runtime) would be to ensemble the two models instead. We evaluate ensembling a baseline RoBERTa model with a finetuned GPT-2 generator for WinoGrande in Table 13. We adopt a weighted-average ensemble method, where the weights are tuned on validation data (the tuning is important to achieve peak performance). The ensemble model performs same as the baseline model, and G-DAugc\textsc{G-DAug}^{\textbf{c}}-Combo outperforms both of them by 3.9%. This suggests that G-DAugc\textsc{G-DAug}^{\textbf{c}}is more effective than simply ensembling the finetuned generator.