This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Understanding Dataset Difficulty with š’±\mathcal{V}-Usable Information

Kawin Ethayarajh ā€ƒā€ƒ Yejin Choi ā€ƒā€ƒ Swabha Swayamdipta
Abstract

Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty—w.r.t.Ā a model š’±\mathcal{V}—as the lack of š’±\mathcal{V}-usable information (Xu etĀ al., 2019), where a lower value indicates a more difficult dataset for š’±\mathcal{V}. We further introduce pointwise š’±\mathcal{V}-information (pvi) for measuring the difficulty of individual instances w.r.t.Ā a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, š’±\mathcal{V}-usable information and pvi also permit the converse: for a given model š’±\mathcal{V}, we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.

Machine Learning, ICML

1 Introduction

Datasets are designed to act as proxies for real-world tasks, yet most bear limited semblance to the tasks they purport to reflect (Torralba & Efros, 2011; Recht etĀ al., 2019). Understanding dataset difficulty is therefore imperative to understanding progress in AI. In practice, however, estimating dataset difficulty is often limited to an informal comparison of state-of-the-art model performance to that of humans; the bigger the performance gap, the harder the dataset is said to be (Ethayarajh & Jurafsky, 2020; Ma etĀ al., 2021). However, such performance metrics offer little understanding of the differential difficulty of individual instances, or of which attributes in the input a given model finds useful.

Refer to caption
Figure 1: The Stanford NLI dataset contains more BERT-usable information than the MultiNLI and CoLA datasets, making it easier for BERT-base. Above, the distribution of instance difficulty (pvi) in the held-out sets for each; dotted lines denote the average pvi.

To understand why a dataset is difficult, we extend recent work in information theory (Xu etĀ al., 2019). To illustrate, consider a model family š’±\mathcal{V} that can learn to map a sentence XX with its sentiment YY. Even if XX were to be encrypted, the information XX contains about YY would not be removed; in other words, the Shannon mutual information would be unchanged (Shannon, 1948). However, encryption makes predicting the sentiment a lot more difficult for š’±\mathcal{V}. But why? Intuitively, the task is easier when XX is unencrypted because the information it contains is usable by š’±\mathcal{V}; when XX is encrypted, the information still exists but becomes unusable. This quantityā€”š’±\mathcal{V}-usable information—reflects the ease with which š’±\mathcal{V} can predict YY given XX. Xu etĀ al. (2019) show that it can be measured using the predictive š’±\mathcal{V}-information framework, which generalizes Shannon information to consider computational constraints.

Our work extends the above framework by framing dataset difficulty as the lack of š’±\mathcal{V}-usable information.111We use the terms ā€œš’±\mathcal{V}-usable informationā€ and ā€œš’±\mathcal{V}-informationā€ from Xu etĀ al. (2019), interchangeably. The higher the š’±\mathcal{V}-usable information, the easier the dataset is for š’±\mathcal{V}. Not only does this framework allow comparisons of models w.r.t.Ā the same dataset, but also of different datasets w.r.t.Ā the same model. Figure 1 illustrates that different datasets provide different amounts of usable information for the same model, even when the task is identical (i.e., natural language inference in the SNLI (Bowman etĀ al., 2015) and MultiNLI (Williams etĀ al., 2018) datasets).

Building on the aggregate estimate of dataset difficulty, we introduce a measure called pointwise š’±\mathcal{V}-information (pvi) for estimating the difficulty of each instance w.r.t.Ā a given distribution (§3). pvi estimates allow us to compare not only individual instances, but also the difficulty of slices of data w.r.tĀ š’±\mathcal{V}. On datasets containing more usable information (e.g., SNLI), pvi estimates are highly correlated (Pearson r≄r\geq 0.75) across different models, seeds, and training time, and with human judgments of difficulty.

Comparisons of š’±\mathcal{V}-usable information before and after isolating an input attribute shed light on why the dataset is easy or difficult for š’±\mathcal{V} (§4), which has significant implications for interpretability in AI222Our code and data are available here. (Miller, 2019). Specifically, we use š’±\mathcal{V}-usable information to identify some limitations in benchmarks that are widely used in NLP to test for a model’s understanding of different language phenomena:

  • •

    Word ordering has a limited impact on the difficulty of a popular natural language entailment benchmark, SNLI (Bowman etĀ al., 2015), even though entailment describes a causal relationship.

  • •

    Some of the most difficult instances in SNLI and a popular grammaticality detection benchmark, CoLA (Warstadt etĀ al., 2018), are mislabelled.

  • •

    In a popular dataset for hate speech detection (Davidson etĀ al., 2017), just 50 (potentially) offensive words contain most of the BERT-usable information about the label; less subtle bias may be going undetected.

2 š’±\mathcal{V}-Usable Information

2.1 Background

Consider a model family š’±\mathcal{V}, which can be trained to map text input XX to its label YY. If we encrypted the text, or translated it into a language with a very complex grammar, it would be harder to predict YY given XX using the same š’±\mathcal{V}. How might we measure this increase in difficulty? Shannon (1948)’s mutual information I​(X;Y)I(X;Y) is not an option—it would not change after XX is encrypted, as it allows for unbounded computation, including any needed to decrypt the text.

Intuitively, the task is easier when XX is unencrypted because the information it contains is usable by š’±\mathcal{V}; when XX is encrypted, this information still exists but becomes unusable. This quantity, called š’±\mathcal{V}-usable information, provides an estimate the difficulty of a dataset w.r.t. š’±\mathcal{V}. It can be measured under a framework called predictive š’±\mathcal{V}-information, which generalizes Shannon information to measure how much information can be extracted from XX about YY when constrained to functions š’±\mathcal{V}, written as Iš’±ā€‹(X→Y)I_{\mathcal{V}}(X\to Y) (Xu etĀ al., 2019). The greater the Iš’±ā€‹(X→Y)I_{\mathcal{V}}(X\to Y), the easier the dataset is for š’±\mathcal{V}. If š’±\mathcal{V} is the set of all functions—i.e., under unbounded computationā€”š’±\mathcal{V}-information reduces to Shannon information.

Processing the input with Ļ„\tau (e.g., by decrypting the text) can make prediction easier, allowing Iš’±ā€‹(τ​(X)→Y)≄Iš’±ā€‹(X→Y)I_{\mathcal{V}}(\tau(X)\to Y)\geq I_{\mathcal{V}}(X\to Y). Although this violates the data processing inequality, it explains the usefulness of certain types of processing, such as representation learning. Compared to XX, the learned representations cannot have more Shannon information with YY, but they can have more usable information.

2.2 Definitions

As defined in Xu etĀ al. (2019):

Definition 2.1.

Let X,YX,Y denote random variables with sample spaces š’³,š’“\mathcal{X},\mathcal{Y} respectively. Let āˆ…\varnothing denote a null input that provides no information about YY. Given predictive family š’±āŠ†Ī©={f:š’³āˆŖāˆ…ā†’P​(š’“)}\mathcal{V}\subseteq\Omega=\{f:\mathcal{X}\cup\varnothing\to P(\mathcal{Y})\}, the predictive š’±\mathcal{V}-entropy is

Hš’±ā€‹(Y)=inffāˆˆš’±š”¼ā€‹[āˆ’log2⁔f​[āˆ…]​(Y)]H_{\mathcal{V}}(Y)=\inf_{f\in\mathcal{V}}\mathbb{E}[-\log_{2}f[\varnothing](Y)] (1)

and the conditional š’±\mathcal{V}-entropy is

Hš’±ā€‹(Y|X)=inffāˆˆš’±š”¼ā€‹[āˆ’log2⁔f​[X]​(Y)]H_{\mathcal{V}}(Y|X)=\inf_{f\in\mathcal{V}}\mathbb{E}[-\log_{2}f[X](Y)] (2)

We use log2\log_{2} to measure the entropies in bits of information, though one could also use loge\log_{e} and measure them in nats instead.

Put simply, f​[X]f[X] and f​[āˆ…]f[\varnothing] produce a probability distribution over the labels. The goal is to find the fāˆˆš’±f\in\mathcal{V} that maximizes the log-likelihood of the label data with (Eq. 2) and without the input (Eq. 1). f​[āˆ…]f[\varnothing] models the label entropy, so āˆ…\varnothing can be set to an empty string for most NLP tasks. Although predictive family has a technical definition333š’±\mathcal{V} is a subset of all possible mappings from š’³\mathcal{X} to P​(š’“)P(\mathcal{Y}) that satisfies optional ignorance: for any PP in the range of some fāˆˆš’±f\in\mathcal{V}, there exists some fā€²āˆˆš’±f^{\prime}\in\mathcal{V} s.t.Ā f′​[X]=f′​[āˆ…]=Pf^{\prime}[X]=f^{\prime}[\emptyset]=P. See Xu etĀ al. (2019) for why optional ignorance is necessary., most neural models, provided they are finetuned without any frozen parameters, easily meet this definition. Further, as per Xu etĀ al. (2019):

Definition 2.2.

Let XX and YY denote random variables with sample spaces š’³\mathcal{X} and š’“\mathcal{Y}, respectively. Given a predictive family š’±\mathcal{V}, the š’±\mathcal{V}-information is

Iš’±ā€‹(X→Y)=Hš’±ā€‹(Y)āˆ’Hš’±ā€‹(Y|X)I_{\mathcal{V}}(X\to Y)=H_{\mathcal{V}}(Y)-H_{\mathcal{V}}(Y|X) (3)

Because we are estimating this quantity on a finite dataset, the estimate can differ from the true š’±\mathcal{V}-information. Xu etĀ al. (2019) provide PAC bounds for this error, where less complex š’±\mathcal{V} and larger datasets yield tighter bounds. Xu etĀ al. (2019) also list several useful properties of š’±\mathcal{V}-information:

  • •

    Non-Negativity: Iš’±ā€‹(X→Y)≄0I_{\mathcal{V}}(X\to Y)\geq 0

  • •

    Independence: If XX is independent of YY, Iš’±ā€‹(X→Y)=0I_{\mathcal{V}}(X\to Y)=0.

  • •

    Montonicity: If š’°āŠ†š’±\mathcal{U}\subseteq\mathcal{V}, then Hš’°ā€‹(Y)≄Hš’±ā€‹(Y)H_{\mathcal{U}}(Y)\geq H_{\mathcal{V}}(Y) and Hš’°ā€‹(Y|X)≄Hš’±ā€‹(Y|X)H_{\mathcal{U}}(Y|X)\geq H_{\mathcal{V}}(Y|X).

Training with the cross-entropy loss finds the fāˆˆš’±f\in\mathcal{V} that maximizes the log-likelihood of YY given XX (Xu etĀ al., 2019). Thus, Hš’±ā€‹(Y|X)H_{\mathcal{V}}(Y|X) can be easily computed by standard training or by finetuning a pre-trained model.444 Improving model calibration using more advanced methods (Kumar etĀ al., 2019) is a possible direction of future work. We estimate Hš’±ā€‹(Y|X)H_{\mathcal{V}}(Y|X) by calculating š”¼ā€‹[āˆ’log⁔f​[X]​(Y)]\mathbb{E}[-\log f[X](Y)] on an identically distributed held-out set, where YY is the gold label. Since training with cross-entropy ultimately aims to find the infimum over the data distribution, not just the training set, it is important not to overfit the model to the training instances; this is of added significance for estimating Hš’±ā€‹(Y|X)H_{\mathcal{V}}(Y|X). We estimate Hš’±ā€‹(Y)H_{\mathcal{V}}(Y) by training or finetuning another model where XX is replaced by āˆ…\varnothing, intended to fit the label distribution. As such, computing š’±\mathcal{V}-information involves training or finetuning only two models.

2.3 Assumptions

Implicit in estimating the š’±\mathcal{V}-information is the assumption that the data used to find the optimal fāˆˆš’±f\in\mathcal{V} and the data used to estimate Hš’±ā€‹(Y|X)H_{\mathcal{V}}(Y|X) are identically distributed, since š’±\mathcal{V}-information is ultimately a function of two random variables X,YX,Y. This dependence on the data distribution makes š’±\mathcal{V}-information well-suited for estimating and interpreting dataset difficulty. However, it is still possible to estimate the difficulty of sub-populations or subsets of the data, though it would be imprecise to refer to this measure as š’±\mathcal{V}-information (see §3.1 for details). We also assume that the difference between the empirical š’±\mathcal{V}-information (calculated using some finite dataset) and the true š’±\mathcal{V}-information (calculated over the distributions) is negligible, though this may not hold, for example, if the dataset is too small (see Appendix A).

Refer to caption
Figure 2: Comparing the š’±\mathcal{V}-usable information estimate to accuracy in SNLI. In the first three epochs, estimates on the test set are similar across all models (top), but due to over-fitting, the estimates diverge and decline. The test accuracy (bottom) for each model loosely tracks the š’±\mathcal{V}-information estimate for that model, since extracting information makes prediction easier.

2.4 Implications

š’±\mathcal{V}-usable information allows us to compare

  • i.

    different models š’±\mathcal{V} by computing Iš’±ā€‹(X→Y)I_{\mathcal{V}}(X\to Y) for the same X,YX,Y (Fig.Ā 2),

  • ii.

    different datasets {(x,y)}\{(x,y)\} by computing Iš’±ā€‹(X→Y)I_{\mathcal{V}}(X\to Y) for the same š’±\mathcal{V} (Fig.Ā 1), and

  • iii.

    different input variables XiX_{i} by computing Iš’±ā€‹(Xi→Y)I_{\mathcal{V}}(X_{i}\to Y) for the same š’±\mathcal{V} and YY (Fig.Ā 4; §4).

While common classification metrics, such as accuracy or F1F_{1} score, are often used for the above comparisons, š’±\mathcal{V}-usable information offers a theoretically rigorous framework, making it better suited for interpretability. The š’±\mathcal{V}-usable information is measured in bits / nats (depending on the log base), allowing for standardized comparisons across models and datasets. Additionally, consider the case where XX and YY are independent: here, model accuracy would be no greater than the majority class frequency, but this frequency varies across datasets. š’±\mathcal{V}-information avoids this problem by factoring in the label entropy Hš’±ā€‹(Y)H_{\mathcal{V}}(Y); if X,YX,Y are independent, then the š’±\mathcal{V}-information is provably zero.

Say we wish to compare two predictive families, š’±\mathcal{V} and š’°\mathcal{U}, such that š’°āŠ†š’±\mathcal{U}\subseteq\mathcal{V}. Assuming both families can model the label distribution, the task will at least as easy for the larger family. This provably obviates the need to evaluate simpler function families (e.g., linear functions) when estimating dataset difficulty. Our experiments show that this bears out in practice as well (AppendixĀ B).

2.5 š’±\mathcal{V}-Usable Information in Practice

We consider the natural language inference (NLI) task, which involves predicting whether a text hypothesis entails, contradicts or is neutral to a text premise. We first apply the š’±\mathcal{V}-information framework to estimate the difficulty of a largescale NLI dataset, Stanford NLI (SNLI; Bowman etĀ al., 2015), across different state-of-the-art models. The four models we use are GPT2-small (Radford etĀ al., 2019), BERT-base-cased (Devlin etĀ al., 2019), DistilBERT-base-uncased (Sanh etĀ al., 2019), and BART-base (Lewis etĀ al., 2020). Figure 2 shows the š’±\mathcal{V}-information estimate for all four, as well as their accuracy on the SNLI train and held-out (test) sets, across 10 training epochs. See AppendixĀ B for results with larger models.

Model performance tracks š’±\mathcal{V}-information.

As seen in Figure 2, the model with the most š’±\mathcal{V}-information on the SNLI test set is also the most accurate. This is intuitive, since extracting more information makes prediction easier. Overall, BART-base extracts the most š’±\mathcal{V}-information, followed by BERT-base, DistilBERT-base, and GPT2-small; accuracy follows the same trend.

š’±\mathcal{V}-information is more sensitive to over-fitting than held-out performance.

At epoch 10, the š’±\mathcal{V}-information is at its lowest for all models, although the SNLI test accuracy has only declined slightly from its peak, as seen in Figure 2. This is because the models start becoming less certain about the correct label long before they start predicting the wrong label. This causes Hš’±ā€‹(Y|X)H_{\mathcal{V}}(Y|X) to rise—and thus Iš’±ā€‹(X→Y)I_{\mathcal{V}}(X\to Y) to decline—even while most of the probability mass is still placed on the correct label. This suggests that, compared to performance metrics like test accuracy, š’±\mathcal{V}-information can more readily inform us of over-fitting.

Different datasets for the same task can have different amounts of š’±\mathcal{V}-usable information.

We consider the MultiNLI dataset (Williams etĀ al., 2018), a multi-genre counterpart of SNLI. Despite both being proxies for the NLI task, SNLI and MultiNLI have significantly different amounts of BERT-usable information, as shown in Figure 1. The š’±\mathcal{V}-information framework provides a principled means of measuring this difference in levels of difficulty; MultiNLI is expected to be more difficult than SNLI due to the diversity of genres it considers. Also shown is CoLA (Warstadt etĀ al., 2018), a dataset for linguistic acceptability where each sentence is labeled as grammatical or not; this task is seemingly more difficult than NLI for BERT.

3 Measuring Pointwise Difficulty

While š’±\mathcal{V}-information provides an aggregate measure of dataset difficulty (§2), a closer analysis requires measuring the degree of usable information in individual instances (w.r.t.Ā a given distribution). We extend the š’±\mathcal{V}-information framework to introduce a new measure called pointwise š’±\mathcal{V}-information (pvi) for individual instances. The higher the pvi, the easier the instance is for š’±\mathcal{V}, under the given distribution.

Definition 3.1 (Pointwise š’±\mathcal{V}-Information).

Given random variables X,YX,Y and a predictive family š’±\mathcal{V}, the pointwise š’±\mathcal{V}-information (pvi) of an instance (x,y)(x,y) is

pvi​(x→y)=āˆ’log2⁔g​[āˆ…]​(y)+log2⁔g′​[x]​(y)\text{{pvi}}(x\to y)=-\log_{2}g[\varnothing](y)+\log_{2}g^{\prime}[x](y) (4)

where gāˆˆš’±g\in\mathcal{V} s.t.Ā š”¼ā€‹[āˆ’log⁔g​[āˆ…]​(Y)]=Hš’±ā€‹(Y)\mathbb{E}[-\log g[\varnothing](Y)]=H_{\mathcal{V}}(Y) and gā€²āˆˆš’±g^{\prime}\in\mathcal{V} s.t. š”¼ā€‹[āˆ’log⁔g′​[X]​(Y)]=Hš’±ā€‹(Y|X)\mathbb{E}[-\log g^{\prime}[X](Y)]=H_{\mathcal{V}}(Y|X).

If š’±\mathcal{V} were, for instance, the BERT function family, g′g^{\prime} and gg would be the models after finetuning BERT with and without the input respectively. For a held-out instance (x,y)(x,y), pvi​(x→y)\text{{pvi}}(x\to y) is the difference in the log-probability these models place on the gold label. pvi is to š’±\mathcal{V}-information what pmi is to Shannon information:

I​(X;Y)=š”¼x,y∼P​(X,Y)​[pmi​(x,y)]Iš’±ā€‹(X→Y)=š”¼x,y∼P​(X,Y)​[pvi​(x→y)]\begin{split}I(X;Y)&=\mathbb{E}_{x,y\sim P(X,Y)}[\text{{pmi}}(x,y)]\\ I_{\mathcal{V}}(X\to Y)&=\mathbb{E}_{x,y\sim P(X,Y)}[\text{{pvi}}(x\to y)]\end{split} (5)

Given this relationship, our understanding of š’±\mathcal{V}-information extends to pvi as well: higher pvi instances are easier for š’±\mathcal{V} and vice-versa. A higher pvi increases the odds of being predicted correctly—this is intuitive because a correct prediction of a non-majority-class instance requires that some information be extracted from the instance. Although the š’±\mathcal{V}-information cannot be negative, the pvi can be—much like how pmi can be negative even though Shannon information cannot. A negative pvi simply means that the model is better off predicting the majority class than considering XX, which can happen for many reasons (e.g., mislabelling). Examples with negative pvi can still be predicted correctly, as long as g′g^{\prime} places most of the probability mass on the correct label. Algorithm 1 shows our computation of pvi and š’±\mathcal{V}-information (by averaging over pvi).

The pvi of an instance (x,y)(x,y) w.r.t.Ā š’±\mathcal{V} should only depend on the distribution of the random variables. Sampling more from P​(X,Y)P(X,Y) during finetuning should not change pvi​(x→y)\text{{pvi}}(x\to y). However, an instance can be drawn from different distributions, in which case we would expect its pvi to differ. For example, say we have restaurant reviews and movie reviews, along with their sentiment. The instance (ā€œThat was great!ā€, positive) could be drawn from either distribution, but we would expect its pvi to be different in each (even though š’±\mathcal{V} is the same).

3.1 Implications

In addition to the comparisons that š’±\mathcal{V}-information allows us to make (§2.4), pvi allows us to compare:

  • iv.

    different instances (x,y)(x,y) by computing pvi (x→y)(x\to y) for the same X,Y,š’±X,Y,\mathcal{V} (Tables 1, 4; Fig.Ā 11)

  • v.

    different slices or subsets of the data by computing the average pvi over instances in each slice (Table 2; Fig.Ā 5).

Note that the average pvi of a slice of data is not its š’±\mathcal{V}-information, since we optimize the model w.r.t. the entire distribution. However, since in practice one often wishes to understand the relative difficulty of different subpopulations w.r.t.Ā the training distribution, calculating the average pvi—as opposed to the š’±\mathcal{V}-information of the subpopulation itself—is more useful.

Algorithm 0 pvi and š’±\mathcal{V}-Information

Input: training data š’Ÿtrain={(input​xi,gold label​yi)}i=1m\mathcal{D}_{\text{train}}=\{(\text{input}\ x_{i},\text{gold label}\ y_{i})\}_{i=1}^{m}, held-out data š’Ÿtest={(input​xi,gold label​yi)}i=1n\mathcal{D}_{\text{test}}=\{(\text{input}\ x_{i},\text{gold label}\ y_{i})\}_{i=1}^{n}, model š’±\mathcal{V}
do

Ā Ā g′←g^{\prime}\leftarrow Finetune š’±\mathcal{V} on š’Ÿtrain\mathcal{D}_{\text{train}}
Ā Ā āˆ…ā†\varnothing\leftarrow empty string (null input)
Ā Ā g←g\leftarrow Finetune š’±\mathcal{V} on {(āˆ…,yi)|(xi,yi)āˆˆš’Ÿtrain}\{(\varnothing,y_{i})\ |\ (x_{i},y_{i})\in\mathcal{D}_{\text{train}}\}
Ā Ā Hš’±ā€‹(Y),Hš’±ā€‹(Y|X)←0,0H_{\mathcal{V}}(Y),H_{\mathcal{V}}(Y|X)\leftarrow 0,0
Ā Ā forĀ (xi,yi)āˆˆš’Ÿtest(x_{i},y_{i})\in\mathcal{D}_{\text{test}}Ā do
Ā Ā Ā Ā Hš’±ā€‹(Y)←Hš’±ā€‹(Y)āˆ’1n​log2⁔g​[āˆ…]​(yi)H_{\mathcal{V}}(Y)\leftarrow H_{\mathcal{V}}(Y)-\frac{1}{n}\log_{2}g[\varnothing](y_{i})
Ā Ā Ā Ā Hš’±ā€‹(Y|X)←Hš’±ā€‹(Y|X)āˆ’1n​log2⁔g′​[xi]​(yi)H_{\mathcal{V}}(Y|X)\leftarrow H_{\mathcal{V}}(Y|X)-\frac{1}{n}\log_{2}g^{\prime}[x_{i}](y_{i})
Ā Ā Ā Ā pvi​(xi→yi)ā†āˆ’log2⁔g​[āˆ…]​(yi)+log2⁔g′​[xi]​(yi)\textsc{pvi}(x_{i}\to y_{i})\leftarrow-\log_{2}g[\varnothing](y_{i})+\log_{2}g^{\prime}[x_{i}](y_{i})
Ā Ā endĀ for
Ā Ā I^š’±ā€‹(X→Y)=1nā€‹āˆ‘ipvi​(xi→yi)=Hš’±ā€‹(Y)āˆ’Hš’±ā€‹(Y|X)\hat{I}_{\mathcal{V}}(X\to Y)=\frac{1}{n}\sum_{i}\textsc{pvi}(x_{i}\to y_{i})=H_{\mathcal{V}}(Y)-H_{\mathcal{V}}(Y|X)

end do

Algorithm 1 After finetuning on a dataset of size nn, the š’±\mathcal{V}-information and pvi can be calculated in O​(n)O(n) time.

3.2 pvi in Practice

pvi can be used to find mislabelled instances.

Correctly predicted instances have higher pvi values than incorrectly predicted ones. For the held-out sets in SNLI, MultiNLI and CoLA, the difference in mean pvi between instances correctly and incorrectly predicted by BERT-base is 3.03, 2.87, and 2.45 bits respectively. These differences are statistically significant (p<p< 0.001). Table 1 shows the most difficult (lowest pvi) instances from CoLA; we further find that some of these are in fact mislabelled (see AppendixĀ C for an analysis of SNLI).

Sentence Label PVI
Wash you! No -4.616
Who achieved the best result was Angela. No -4.584
Sue gave to Bill a book. No -3.649
Only Churchill remembered Churchill giving the Blood, Sweat and Tears speech. No -3.571
Cynthia chewed. No -3.510
It is a golden hair. Yes -3.251
I won’t have some money. No -3.097
You may pick every flower, but leave a few for Mary. No -2.875
I know which book Mag read, and which book Bob said that you hadn’t. Yes -2.782
John promise Mary to shave himself. Yes -2.609
Table 1: The 10 hardest (lowest pvi) instances in the CoLA in-domain test set for grammaticality detection (label indicates grammaticality), according to BERT-base. Examples in red are assessed to be mislabelled by authors of this work. For e.g., ā€˜Cynthia chewed.’ might be grammatical because the verb ā€˜chew’ could be intransitive in this usage. This suggests that pvi could be used to identify mislabelled examples. All of these examples were predicted incorrectly by BERT-base.

The pvi threshold at which predictions become incorrect is similar across datasets.

In Figure 3, we plot the pvi distribution of correctly and incorrectly predicted instances in each dataset. As expected, high-pvi instances are predicted correctly and low-pvi instances are not. Notably, the point at which instances start being incorrectly predicted is similar across datasets (pvi ā‰ˆ0.5\approx 0.5). Such a pattern could not be observed with a performance metric because the label spaces are different, evincing why the š’±\mathcal{V}-information framework is so useful for cross-dataset comparison.

Refer to caption
Figure 3: The distribution of pvi for correctly and incorrectly predicted instances in each dataset. Note that the point at which instances start being incorrectly predicted is similar across datasets (∼\sim 0.5 bits). In contrast, because the label space is different across CoLA and the other two datasets, such a comparison could not be made with a performance-based metric.

pvi estimates are highly consistent across models, training epochs, and random initializations.

The cross-model Pearson correlation between pvi estimates of SNLI instances is very high (r>r> 0.80). However, the cross-model Pearson correlation is lower for CoLA (0.40 <r<<r< 0.65); see Fig.Ā 9 in AppendixĀ D. This is because, as visualized in Figure 1, CoLA has less usable information, making difficulty estimates noisier. In the limit, if a dataset contained no usable information, then we would expect the correlation between pvi estimates across different models and seeds to be close to zero. It is also worth noting, however, that a high degree of cross-model correlation—as with SNLI—does not preclude comparisons between different models on the same dataset. Rather, it suggests that in SNLI, a minority of instances is responsible for distinguishing one model’s performance from another. This is not surprising—given the similar complexity and architecture of these models, we would expect most instances to be equally easy. Moreover, despite the performance of Transformer-based models varying across random initializations (Dodge etĀ al., 2019, 2020; Mosbach etĀ al., 2020), we find that pvi estimates are quite stable: the correlation across seeds is r>0.85r>0.85 (for SNLI finetuned BERT-base, across 4 seeds); see Table 6 in AppendixĀ D. It also concurs with human judgments of difficulty; see Fig.Ā 10 in AppendixĀ D.

Refer to caption
Figure 4: The amount of š’±\mathcal{V}-usable information contained in different input attributes about the gold labels in SNLI. The token identity alone (regardless of order) provides most of the information for all models (see shuffled). The premise, which can be shared by multiple instances, is useless alone; the hypothesis, which is unique to an instance, is quite useful even without a premise, suggesting it may contain annotation artefacts.

4 Uncovering Dataset Artefacts

A key limitation of standard evaluation metrics (e.g. accuracy) is the lack of interpretability— there is no straightforward way to understand why a dataset is as difficult as it is. š’±\mathcal{V}-usable information offers an answer by allowing comparison of different input variables XiX_{i} under the same š’±\mathcal{V} and YY, as implicated in §2.4. We consider two approaches for this: applying input transformations (§4.1), and slicing the dataset (§4.2).

4.1 Input Transformations

Our first approach involves applying different transformations Ļ„i​(X)\tau_{i}(X) to isolate an attribute aa, followed by calculating Iš’±ā€‹(Ļ„i​(X)→Y)I_{\mathcal{V}}(\tau_{i}(X)\to Y) to measure how much information (usable by š’±\mathcal{V}) the attribute contains about the label. For example, by shuffling the tokens in XX, we can isolate the influence of the word order attribute.

Given that a transformation may make information more accessible (e.g., decrypting some encrypted text; c.f. §2), it is possible for Iš’±ā€‹(Ļ„i​(X)→Y)≄Iš’±ā€‹(X→Y)I_{\mathcal{V}}(\tau_{i}(X)\to Y)\geq I_{\mathcal{V}}(X\to Y), so the latter shouldn’t be treated as an upper bound. Such transformations were applied by O’Connor & Andreas (2021) to understand what syntactic features Transformers use in next-token prediction; we take this a step further, aiming to discover annotation artefacts, compare individual instances, and ultimately understand the dataset itself. We present our findings on SNLI, CoLA, as well as DWMW17 (Davidson etĀ al., 2017), a dataset for hate speech detection, where input posts are labeled as hate speech, offensive, or neither.

We apply transformations to the SNLI input to isolate different attributes (see Appendix E for an example): shuffled (shuffle tokens randomly), hypothesis-only (only include the hypothesis), premise-only (only include the premise), overlap (tokens in both the premise and hypothesis).

Token identity alone provides most of the usable information in SNLI.

Figure 4 shows that the token identity alone—isolated by shuffling the input—contains most of the usable information for all models. The premise, which is often shared by multiple instances, is useless alone; the hypothesis, which is unique to an instance, is useful even without a premise. This corroborates the well-known annotation artefacts in SNLI (Gururangan etĀ al., 2018; Poliak etĀ al., 2018), which are spurious correlations exploited by models to predict the correct answer for the wrong reasons.

Hate speech detection might have lexical biases.

Automatic hate speech detection is an increasingly important part of online moderation, but what causes a model to label speech as offensive? We find that in DWMW17, the text contains 0.724 bits of BERT-usable information about the label. Additionally, if one removed all the tokens, except for 50 (potentially) offensive ones—comprising common racial and homophobic slurs555These terms were manually chosen based on a cursory review of the dataset and are listed in Appendix E.—from the input post hoc, there still remains 0.490 bits of BERT-usable information. In other words, just 50 (potentially) offensive words contain most of the BERT-usable information in DWMW17. Our findings corroborate prior work which shows that certain lexical items (e.g., swear words, identity mentions) are responsible for hate speech prediction (Dixon etĀ al., 2018; Dinan etĀ al., 2019). Allowing models to do well by simply pattern-matching may permit subtleties in hate speech to go undetected, perpetuating harm towards minority groups (Blodgett etĀ al., 2020).

4.2 Slicing Datasets

Entailment Neutral Contradiction
original 1.188 1.064 1.309
shuffled 1.130 0.984 1.224
hypothesis only 0.573 0.553 0.585
premise only 0.032 -0.016 -0.016
overlap 0.415 0.177 0.298
Table 2: The average amount of usable information (i.e., mean pvi, in bits) that each attribute contains about each class in SNLI, according to BERT-base. Some attributes are more useful for a particular class: e.g., the degree of premise-hypothesis overlap is most useful for predicting ā€˜entailment’. Note that the mean pvi for a particular class is different from the š’±\mathcal{V}-information.

Certain attributes are more useful for certain classes.

Comparing the usefulness of an attribute across classes can be useful for identifying systemic annotation artefacts. This can be done by simply averaging the pvi over the slice of data whose difficulty we are interested in measuring. Note that the equivalence between š’±\mathcal{V}-information and expected pvi only holds when the model used to estimate pvi is trained over the entire dataset, which means that the average pvi of a slice of data is not its š’±\mathcal{V}-information. It would not make sense to estimate the š’±\mathcal{V}-information of a slice because it would require training on examples from just one class, in which case the š’±\mathcal{V}-information would be zero. Thus the only usable difficulty measure is the mean pvi.

We do this for SNLI in Table 2. We see that the tokens in the premise-hypothesis overlap contains much more BERT-usable information about the ā€˜entailment’ class than ā€˜contradiction’ or ā€˜neutral’. This is unsurprising, given that the simplest means of entailing a premise is to copy it into the hypothesis and provide some additional detail. If there is no inherent reason for an attribute to be more/less useful—such as overlap for entailment—there may be an artefact at work. Even when there is an inherent reason for an attribute to be useful for a particular slice of the data, an attribute being exceptionally useful may also be evidence of a dataset artefact. For example, if the premise-overlap hypothesis provided almost all the usable information needed for entailment, it may be because the crowdworkers who created the dataset took a shortcut by copying the premise to create the hypothesis.

In Appendix F, we show how similar comparisons can be made between instances.

Certain subsets of each class are more difficult than others.

In Figure 5, we bin the examples in each SNLI class by the level of hypothesis-premise overlap and plot the average pvi. We see entailment instances with no hypothesis-premise overlap are the most difficult (i.e., lowest mean pvi) while contradiction instances with no overlap are the easiest (i.e., highest mean pvi). This is not surprising, since annotation artefacts in SNLI arise from constructing entailment and contradiction via trivial changes to the premise (Gururangan etĀ al., 2018).

We additionally consider slices in the dataset based on dataset cartography (Swayamdipta etĀ al., 2020), which uses training dynamics to differentiate instances via their (1) confidence (i.e., mean probability of the correct label across epochs), and (2) variability (i.e., variance of the former). The result is a dataset map revealing three regions: easy-to-learn, hard-to-learn, and ambiguous w.r.tĀ the trained model. Slices of the dataset based on cartographic regions have distinct ranges of average pvi (Fig.Ā 13 in AppendixĀ H).

Refer to caption
Figure 5: The mean pvi of SNLI instances according to BERT-base, broken down by the overlap length (i.e., the number of tokens shared by the hypothesis and premise). Entailment examples with no overlap are the most difficult (i.e., lowest mean pvi).

4.3 Token-level Artefacts

WARNING: The following content contains language from the DWMW17 dataset that is offensive in nature.

DWMW17 (Davidson etĀ al., 2017)
Hate Speech Offensive Neither
f*ggots (3.844) r*tards (2.821) lame (4.426)
f*g (3.73) n*gs (2.716) clothes (0.646)
f*ggot (3.658) n*gro (2.492) dog (0.616)
c**ns (3.53) n*g (2.414) cat (0.538)
n*ggers (3.274) c*nts (2.372) iDntWearCondoms (0.517)
CoLA (Warstadt etĀ al., 2018)
Grammatical Ungrammatical
will (0.267) book (2.737)
John (0.168) is (2.659)
. (0.006) was (2.312)
and (-0.039) of (2.308)
in (-0.05) to (1.972)
Table 3: Token-level annotation artefacts in DWMW17 and CoLA. These are the tokens whose omission leads to the greatest average increase in conditional entropy for each class (given in parentheses). Note that certain racial slurs are correctly identified as ā€˜hate speech’ but in-group variants of the same terms fall under ā€˜offensive’ instead. The full lists are available in Appendix G.

Transforming XX and then measuring the š’±\mathcal{V}-information to discover all token-level signals and artefacts is untenable, since we would need to finetune one new model per token. Instead, we compute the change in the š’±\mathcal{V}-information estimate after removing tt, which yields modified input x¬tx_{\neg t}. We use the same model g′g^{\prime} but evaluate only on a slice of the data, š’ŸC,t\mathcal{D}_{C,t}, which contains the token tt and belongs to the class CC of interest. This simplifies to measuring the increase in conditional entropy:

1|š’ŸC,t|āˆ‘š’ŸC,t[āˆ’log2g′[x¬t](y)+log2g′[x](y)]\begin{split}\frac{1}{|\mathcal{D}_{C,t}|}\sum_{\mathcal{D}_{C,t}}[&-\log_{2}g^{\prime}[x_{\neg t}](y)+\log_{2}g^{\prime}[x](y)]\end{split}

Token-level signals and artefacts can be discovered using leave-one-out.

Table 3 shows that auxiliary verbs (e.g., be, did) and prepositions are most indicative of ungrammatical sentences in CoLA; in contrast, grammatical sentences have no strong indicators, with no word on average increasing the conditional entropy above 0.30 upon omission.

In DWMW17, racial and homophobic slurs are the top indicators of ā€˜hate speech’. However, in-group variants of the same racial slur—commonly used in African-American Vernacular English (AAVE)—fall under ā€˜offensive’ instead. The fact that AAVE terms are marked as ā€˜offensive’ supports previous findings by that hate speech detection datasets may themselves be biased (Sap etĀ al., 2019). In SNLI, we found many of the token-level artefacts matching those found using descriptive statistics in Gururangan etĀ al. (2018). The complete word lists are available in Appendix G.

4.4 Conditioning Out Information

What if we wanted to measure how much BERT-usable information offensive words contain about the label in DWMW17 beyond that which is captured in the sentiment? In other words, if we already had access to the sentiment polarity of a text (positive/negative/neutral), how many additional bits of information would the offensive words provide? We cannot estimate this by simply subtracting Iš’±ā€‹(offensive→Y)I_{\mathcal{V}}(\text{offensive}\to Y) from Iš’±ā€‹(sentiment→Y)I_{\mathcal{V}}(\text{sentiment}\to Y), since that difference could potentially be negative. Acquiring another random variable should not decrease the amount of information we have about the label (at worst, it should be useless).

To capture this intuition, Hewitt etĀ al. (2021) proposed conditional š’±\mathcal{V}-information, which allows one to condition out any number of random variables. Given a set of random variables ℬ\mathcal{B} that we want to condition out, it is defined as:

Iš’±ā€‹(X→Y|ℬ)=Hš’±ā€‹(Y|ℬ)āˆ’Hš’±ā€‹(Y|ā„¬āˆŖ{X})I_{\mathcal{V}}(X\to Y|\mathcal{B})=H_{\mathcal{V}}(Y|\mathcal{B})-H_{\mathcal{V}}(Y|\mathcal{B}\cup\{X\}) (6)

The conditional entropy with respect to multiple variables is the only new concept here. It is estimated in practice by concatenating the text inputs represented by ℬ\mathcal{B} and XX, which in our example is the sentiment polarity (one of ā€˜negative’/ā€˜neutral’/ā€˜positive’) and the sequence of offensive words in the input. The actual model family need not change to accommodate the longer text, as long as it remains under the input token limit.666If the inputs were vectors, the inputs represented by ℬ\mathcal{B} and ā„¬āˆŖ{X}\mathcal{B}\cup\{X\} would need to be of the same size; to do so, we would concatenate a zero vector to the former (Hewitt etĀ al., 2021). We find that offensive words contain 0.482 bits of BERT-usable information about the label beyond that which is contained in text sentiment.777The sentiment was categorized as positive/neutral/negative based on the polarity estimated by spaCy’s built-in sentiment classifier (Honnibal & Montani, 2017). This is close to all of the BERT-usable information that the offensive words contain about the label (0.490 bits), suggesting that the predictive power of (potentially) offensive words is not mediated through sentiment in DWMW17.

5 Related Work

While prior literature has acknowledged that not all data instances are equal (Vodrahalli etĀ al., 2018; Swayamdipta etĀ al., 2020), there have been few efforts to estimate dataset difficulty formally and directly. As a notable exception, Zhang etĀ al. (2020) proposed DIME, an information-theoretic measure to estimate a lower bound on the lowest possible (i.e., model-agnostic) 0-1 error. Model-agnostic approaches do not explain why some datasets are easier for some models, and have limited interpretability. In contrast, š’±\mathcal{V}-information and pvi are specific to a model family š’±\mathcal{V}.

Various techniques have been proposed to differentiate data instances within a dataset. Text-based heuristics such as word identity (Bengio etĀ al., 2009) or input length (Spitkovsky etĀ al., 2010; Gururangan etĀ al., 2018) have sometimes been used as proxies for instance difficulty, but offer limited insight into difficulty w.r.t.Ā models. Other approaches consider training loss (Han etĀ al., 2018; Arazo etĀ al., 2019; Shen & Sanghavi, 2019), confidence (Hovy etĀ al., 2013), prediction variance (Chang etĀ al., 2017), and area under the curve (Pleiss etĀ al., 2020). Estimates relying on model training dynamics (Toneva etĀ al., 2018; Swayamdipta etĀ al., 2020), gradient magnitudes (Vodrahalli etĀ al., 2018), or loss magnitudes (Han etĀ al., 2018) are sensitive to factors such as variance during steps of training. Influence functions (Koh & Liang, 2017), forgetting events (Toneva etĀ al., 2018), and the Data Shapley (Ghorbani & Zou, 2019; Jia etĀ al., 2019) can all be used to assign pointwise estimates of importance to data instances based on their contribution to the decision boundary. Moreover, although these methods all capture some aspect of difficulty, they do not lend themselves to interpreting datasets as readily as the predictive š’±\mathcal{V}-information framework.

Given its dependence on training behavior across time, cartography (Swayamdipta etĀ al., 2020) offers complementary benefits to š’±\mathcal{V}-information. It can be non-trivial to measure differences between, say a CoLA data map and an SNLI data map, w.r.tĀ BERT. In contrast, š’±\mathcal{V}-information provides a formal framework to make dataset difficulty estimates as an aggregate to compare datasets w.r.tĀ a model. Other work has offered insight by splitting the data into ā€œeasyā€ and ā€œhardā€ sets with respect to some attribute and studying changes in model performance, but these methods do not offer a pointwise estimate of difficulty (Sugawara etĀ al., 2018; Rondeau & Hazen, 2018; Sen & Saffari, 2020).

Item response theory (IRT; Embretson & Reise, 2013) allows the difficulty of instances to be learned via parameters in a probabilistic model meant to explain model performance (Lalor etĀ al., 2018; Rodriguez etĀ al., 2021). However, it does not formally relate dataset difficulty to the model being evaluated. Estimating instance difficulty is also evocative of instance selection for active learning (Lewis & Catlett, 1994; Fu etĀ al., 2013; Liu & Motoda, 2002); however these estimates could change as the dataset picks up new instances. In contrast, pvi estimates are relatively stable, especially when the dataset has higher š’±\mathcal{V}-information. Uncertainty sampling, for example, picks the instances that the partially trained model is least certain about (Lewis & Gale, 1994; Nigam etĀ al., 2000), which could be interpreted as a measure of difficulty. However, once an instance is used for training, the model may become much more certain about it, meaning that the uncertainty values are unstable.

Interpretability of the role of certain attributes in trained models have lately led to the discovery of many dataset artefacts in NLP. Our approach to discovering dataset artefacts can also complement existing approaches to artefact discovery (Gardner etĀ al., 2021; Pezeshkpour etĀ al., 2021; LeĀ Bras etĀ al., 2020). Rissanen data analysis (Perez etĀ al., 2021) offers a complimentary method for interpretability w.r.tĀ attributes; it involves calculating the minimum description length (MDL): how many bits are needed to transmit the gold labels from a sender to a recipient when both have access to the same model and inputs. Since the framework depends on the order of instances (i.e., what data has been transmitted thus far), it is unsuitable for estimating dataset difficulty. In contrast, š’±\mathcal{V}-information is defined w.r.t.Ā a data distribution, so it is (in theory) agnostic to data and its ordering in fine-tuning.

š’±\mathcal{V}-information (Xu etĀ al., 2019) has had limited adoption in NLP. It has been used to study what context features Transformers actually use (O’Connor & Andreas, 2021), as well as to condition out information for probing-based interpretability techniques (Hewitt etĀ al., 2021; Pimentel & Cotterell, 2021). However, to the best of our knowledge, ours is the first approach to use š’±\mathcal{V}-usable information for estimating the difficulty of NLP datasets.

6 Future Work

There has been much work in the way of model interpretability, but relatively little in the way of dataset interpretability. Our framework will allow datasets to be probed, helping us understand what exactly we are testing for in models and how pervasive annotation artefacts really are. By identifying the attributes responsible for difficulty, it will be possible to build challenge sets in a more principled way and reduce artefacts in existing datasets. By studying which attributes contain information that is unusable by existing SOTA models, model creators may make more precise changes to architectures. More immediate directions of future work include:

  1. 1.

    Understanding how changes to the data distribution change the difficulty of individual examples.

  2. 2.

    Extending š’±\mathcal{V}-information to open-ended text generation, which does not induce explicit distributions over the output space. This may requiring truncating the output space (e.g., using beam search with fixed width).

  3. 3.

    Applying š’±\mathcal{V}-information to estimate dataset difficulty in other modalities (e.g., image, audio, tabular, etc.). There is nothing limiting the use of š’±\mathcal{V}-information to the NLP domain. For example, one could create a set of image filters—for different colors and objects—use them to transform the image, and then measure the drop in usable information.

7 Conclusion

We provided an information-theoretic perspective to understanding and interpreting the difficulty of various NLP datasets. We extended predictive š’±\mathcal{V}-information to estimate difficulty at the dataset level, and then introduced pointwise š’±\mathcal{V}-information (pvi) for measuring the difficulty of individual instances. We showed that instances with lower pvi had lower levels of annotator agreement and were less likely to be predicted correctly. We then demonstrated how systemic and token-level annotation artefacts in a dataset could be discovered by manipulating the input before calculating these measures. Our studies indicate that š’±\mathcal{V}-information offers a new, efficient means of interpreting NLP datasets.

Acknowledgements

We thank Dan Jurafsky, Nelson Liu, Daniel Khashabi, and the anonymous reviewers for their helpful comments.

References

  • Arazo etĀ al. (2019) Arazo, E., Ortego, D., Albert, P., O’Connor, N., and McGuinness, K. Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning, pp.Ā 312–321. PMLR, 2019. URL https://arxiv.org/abs/1904.11238.
  • Bengio etĀ al. (2009) Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp.Ā  41–48, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380. URL https://doi.org/10.1145/1553374.1553380.
  • Blodgett etĀ al. (2020) Blodgett, S.Ā L., Barocas, S., Daumé III, H., and Wallach, H. Language (technology) is power: A critical survey of ā€œbiasā€ in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.Ā  5454–5476, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.485. URL https://aclanthology.org/2020.acl-main.485.
  • Bowman etĀ al. (2015) Bowman, S., Angeli, G., Potts, C., and Manning, C.Ā D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.Ā  632–642, 2015. URL https://aclanthology.org/D15-1075/.
  • Chang etĀ al. (2017) Chang, H.-S., Learned-Miller, E., and McCallum, A. Active bias: Training more accurate neural networks by emphasizing high variance samples. Advances in Neural Information Processing Systems, 30:1002–1012, 2017. URL https://arxiv.org/abs/1704.07433.
  • Davidson etĀ al. (2017) Davidson, T., Warmsley, D., Macy, M., and Weber, I. Automated hate speech detection and the problem of offensive language. In Proceedings of the 11th International AAAI Conference on Web and Social Media, ICWSM ’17, pp.Ā  512–515, 2017. URL https://arxiv.org/abs/1703.04009.
  • Devlin etĀ al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.Ā  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  • Dinan etĀ al. (2019) Dinan, E., Humeau, S., Chintagunta, B., and Weston, J. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.Ā  4537–4546, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1461. URL https://aclanthology.org/D19-1461.
  • Dixon etĀ al. (2018) Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, pp.Ā  67–73, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360128. doi: 10.1145/3278721.3278729. URL https://doi.org/10.1145/3278721.3278729.
  • Dodge etĀ al. (2019) Dodge, J., Gururangan, S., Card, D., Schwartz, R., and Smith, N.Ā A. Show your work: Improved reporting of experimental results. In Proc. of EMNLP-IJCNLP, pp.Ā  2185–2194, 2019. URL https://aclanthology.org/D19-1224/.
  • Dodge etĀ al. (2020) Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., and Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping, 2020. URL https://arxiv.org/abs/2002.06305. arXiv preprint arXiv:2002.06305.
  • Embretson & Reise (2013) Embretson, S.Ā E. and Reise, S.Ā P. Item response theory. Psychology Press, 2013. URL https://doi.org/10.4324/9781410605269.
  • Ethayarajh & Jurafsky (2020) Ethayarajh, K. and Jurafsky, D. Utility is in the eye of the user: A critique of nlp leaderboards. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.Ā  4846–4853, 2020. URL https://aclanthology.org/2020.emnlp-main.393/.
  • Fu etĀ al. (2013) Fu, Y., Zhu, X., and Li, B. A survey on instance selection for active learning. Knowledge and information systems, 35(2):249–283, 2013. URL https://link.springer.com/article/10.1007/s10115-012-0507-8.
  • Gardner etĀ al. (2021) Gardner, M., Merrill, W., Dodge, J., Peters, M.Ā E., Ross, A., Singh, S., and Smith, N. Competency problems: On finding and removing artifacts in language data, 2021. URL https://arxiv.org/abs/2104.08646.
  • Ghorbani & Zou (2019) Ghorbani, A. and Zou, J. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pp.Ā 2242–2251. PMLR, 2019. URL https://arxiv.org/abs/1904.02868.
  • Gururangan etĀ al. (2018) Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N.Ā A. Annotation artifacts in natural language inference data. In Proc. of NAACL, pp.Ā  107–112, 2018. URL https://aclanthology.org/N18-2017.
  • Han etĀ al. (2018) Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I.Ā W., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, 2018. URL https://arxiv.org/abs/1804.06872.
  • Hewitt etĀ al. (2021) Hewitt, J., Ethayarajh, K., Liang, P., and Manning, C.Ā D. Conditional probing: measuring usable information beyond a baseline. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.Ā  1626–1639, 2021. URL https://aclanthology.org/2021.emnlp-main.122/.
  • Honnibal & Montani (2017) Honnibal, M. and Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017. URL https://spacy.io/.
  • Hovy etĀ al. (2013) Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., and Hovy, E. Learning whom to trust with MACE. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.Ā  1120–1130, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1132.
  • Jia etĀ al. (2019) Jia, R., Dao, D., Wang, B., Hubis, F.Ā A., Hynes, N., Gürel, N.Ā M., Li, B., Zhang, C., Song, D., and Spanos, C.Ā J. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.Ā  1167–1176. PMLR, 2019. URL https://arxiv.org/abs/1902.10275.
  • Koh & Liang (2017) Koh, P.Ā W. and Liang, P. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pp.Ā 1885–1894. PMLR, 2017. URL https://arxiv.org/abs/1703.04730v3.
  • Kumar etĀ al. (2019) Kumar, A., Liang, P., and Ma, T. Verified uncertainty calibration. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.Ā  3792–3803, 2019. URL https://arxiv.org/abs/1909.10155.
  • Lalor etĀ al. (2018) Lalor, J.Ā P., Wu, H., Munkhdalai, T., and Yu, H. Understanding deep learning performance through an examination of test set difficulty: A psychometric case study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.Ā  4711–4716, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1500. URL https://aclanthology.org/D18-1500.
  • LeĀ Bras etĀ al. (2020) LeĀ Bras, R., Swayamdipta, S., Bhagavatula, C., Zellers, R., Peters, M., Sabharwal, A., and Choi, Y. Adversarial filters of dataset biases. In International Conference on Machine Learning, pp.Ā 1078–1088. PMLR, 2020. URL https://arxiv.org/abs/2002.04108.
  • Lewis & Catlett (1994) Lewis, D.Ā D. and Catlett, J. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pp.Ā  148–156. Elsevier, 1994. URL https://openreview.net/forum?id=HyEyBoWObr.
  • Lewis & Gale (1994) Lewis, D.Ā D. and Gale, W.Ā A. A sequential algorithm for training text classifiers. In SIGIR’94, pp.Ā  3–12. Springer, 1994. URL https://arxiv.org/abs/cmp-lg/9407020.
  • Lewis etĀ al. (2020) Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. of ACL, pp.Ā  7871–7880, 2020. URL https://aclanthology.org/2020.acl-main.703/.
  • Liu & Motoda (2002) Liu, H. and Motoda, H. On issues of instance selection. Data Mining and Knowledge Discovery, 6(2):115, 2002. URL https://link.springer.com/article/10.1023/A:1014056429969.
  • Liu etĀ al. (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692. arXiv preprint arXiv:1907.11692.
  • Ma etĀ al. (2021) Ma, Z., Ethayarajh, K., Thrush, T., Jain, S., Wu, L., Jia, R., Potts, C., Williams, A., and Kiela, D. Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking, 2021. URL https://arxiv.org/abs/2106.06052. arXiv preprint arXiv:2106.06052.
  • Miller (2019) Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38, Feb 2019. ISSN 0004-3702. doi: 10.1016/j.artint.2018.07.007. URL http://dx.doi.org/10.1016/J.ARTINT.2018.07.007.
  • Mosbach etĀ al. (2020) Mosbach, M., Andriushchenko, M., and Klakow, D. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations, 2020. URL https://arxiv.org/abs/2006.04884.
  • Nigam etĀ al. (2000) Nigam, K., McCallum, A.Ā K., Thrun, S., and Mitchell, T. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2):103–134, 2000. URL https://link.springer.com/article/10.1023/A:1007692713085.
  • O’Connor & Andreas (2021) O’Connor, J. and Andreas, J. What context features can transformer language models use? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.Ā  851–864, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.70. URL https://aclanthology.org/2021.acl-long.70.
  • Perez etĀ al. (2021) Perez, E., Kiela, D., and Cho, K. Rissanen data analysis: Examining dataset characteristics via description length. In ICML, 2021. URL https://arxiv.org/abs/2103.03872.
  • Pezeshkpour etĀ al. (2021) Pezeshkpour, P., Jain, S., Singh, S., and Wallace, B.Ā C. Combining feature and instance attribution to detect artifacts, 2021. URL https://arxiv.org/abs/2107.00323. arXiv preprint arXiv:2107.00323.
  • Pimentel & Cotterell (2021) Pimentel, T. and Cotterell, R. A bayesian framework for information-theoretic probing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.Ā  2869–2887, 2021. URL https://arxiv.org/abs/2109.03853.
  • Pleiss etĀ al. (2020) Pleiss, G., Zhang, T., Elenberg, E.Ā R., and Weinberger, K.Ā Q. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2001.10528.
  • Poliak etĀ al. (2018) Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., and VanĀ Durme, B. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pp.Ā  180–191, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/S18-2023. URL https://aclanthology.org/S18-2023.
  • Radford etĀ al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners, 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
  • Recht etĀ al. (2019) Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp.Ā 5389–5400. PMLR, 2019. URL http://proceedings.mlr.press/v97/recht19a.html.
  • Rodriguez etĀ al. (2021) Rodriguez, P., Barrow, J., Hoyle, A.Ā M., Lalor, J.Ā P., Jia, R., and Boyd-Graber, J. Evaluation examples are not equally informative: How should that change NLP leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.Ā  4486–4503, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.346. URL https://aclanthology.org/2021.acl-long.346.
  • Rondeau & Hazen (2018) Rondeau, M.-A. and Hazen, T.Ā J. Systematic error analysis of the stanford question answering dataset. In Proceedings of the Workshop on MRQA, pp.Ā  12–20, 2018. URL https://aclanthology.org/W18-2602/.
  • Sanh etĀ al. (2019) Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2019. URL https://arxiv.org/abs/1910.01108. arXiv preprint arXiv:1910.01108.
  • Sap etĀ al. (2019) Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N.Ā A. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp.Ā  1668–1678, 2019. URL https://aclanthology.org/P19-1163/.
  • Sen & Saffari (2020) Sen, P. and Saffari, A. What do models learn from question answering datasets? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.Ā  2429–2438, 2020. URL https://aclanthology.org/2020.emnlp-main.190.
  • Shannon (1948) Shannon, C.Ā E. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948. URL https://ieeexplore.ieee.org/abstract/document/6773024/.
  • Shen & Sanghavi (2019) Shen, Y. and Sanghavi, S. Learning with bad training data via iterative trimmed loss minimization. In Proc. of ICML, pp.Ā  5739–5748. PMLR, 2019. URL https://proceedings.mlr.press/v97/shen19e.html.
  • Spitkovsky etĀ al. (2010) Spitkovsky, V.Ā I., Alshawi, H., and Jurafsky, D. From baby steps to leapfrog: How ā€œless is moreā€ in unsupervised dependency parsing. In Proc. of NAACL, pp.Ā  751–759. Association for Computational Linguistics, 2010. URL https://aclanthology.org/N10-1116.
  • Sugawara etĀ al. (2018) Sugawara, S., Inui, K., Sekine, S., and Aizawa, A. What makes reading comprehension questions easier? In Proc. of EMNLP, pp.Ā  4208–4219, 2018. URL https://arxiv.org/abs/1808.09384.
  • Swayamdipta etĀ al. (2020) Swayamdipta, S., Schwartz, R., Lourie, N., Wang, Y., Hajishirzi, H., Smith, N.Ā A., and Choi, Y. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proc. of EMNLP, pp.Ā  9275–9293, 2020. URL https://aclanthology.org/2020.emnlp-main.746/.
  • Toneva etĀ al. (2018) Toneva, M., Sordoni, A., des Combes, R.Ā T., Trischler, A., Bengio, Y., and Gordon, G.Ā J. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2018. URL https://arxiv.org/abs/1812.05159.
  • Torralba & Efros (2011) Torralba, A. and Efros, A.Ā A. Unbiased look at dataset bias. In CVPR 2011, pp.Ā  1521–1528. IEEE, 2011. URL https://ieeexplore.ieee.org/abstract/document/5995347.
  • Vodrahalli etĀ al. (2018) Vodrahalli, K., Li, K., and Malik, J. Are all training examples created equal? an empirical study, 2018. URL https://arxiv.org/abs/1811.12569.
  • Warstadt etĀ al. (2018) Warstadt, A., Singh, A., and Bowman, S.Ā R. Neural network acceptability judgments, 2018. URL https://arxiv.org/abs/1805.12471. arXiv preprint arXiv:1805.12471.
  • Williams etĀ al. (2018) Williams, A., Nangia, N., and Bowman, S. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of NAACL, pp.Ā  1112–1122. Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
  • Xu etĀ al. (2019) Xu, Y., Zhao, S., Song, J., Stewart, R., and Ermon, S. A theory of usable information under computational constraints. In International Conference on Learning Representations, 2019. URL https://arxiv.org/abs/2002.10689.
  • Zhang etĀ al. (2020) Zhang, P., Wang, H., Naik, N., Xiong, C., and Socher, R. DIME: An Information-Theoretic difficulty measure for AI datasets. In Proc. of NeurIPS 2020 Workshop for Deep Learning through Information Geometry, October 2020. URL https://openreview.net/forum?id=kvqPFy0hbF.

Appendix A Training Data

In Figure 6, we plot the š’±\mathcal{V}-information estimate on the SNLI test set as BERT-base is trained on increasing amounts of training data. This is to test the assumption that the training set is sufficiently large to find the function fāˆˆš’±f\in\mathcal{V} that minimizes the conditional entropy. Although this assumption is impossible to validate with complete certainty, since we don’t have access to the true distribution, if the š’±\mathcal{V}-information estimate plateaus before all the training data is used, it suggests that the training set size is not a limiting factor to the estimation. We find that this is indeed the case with SNLI, where 80% of the training data on averages provides the same estimate as using the entire training set. In cases when this assumption does not hold, readers may want to consider measuring the Bayesian mutual information instead (Pimentel & Cotterell, 2021).

Refer to caption
Figure 6: The š’±\mathcal{V}-information estimate on the SNLI test set when BERT-base is trained on increasing fractions of the training data, drawn as a random sample (with replacement). Here we plot the average and standard deviation across four samples for each fraction.

Appendix B Larger Models

Refer to caption
Figure 7: Comparing accuracy and the š’±\mathcal{V}-information estimate on the SNLI train and test set w.r.t.Ā various models.

In Figure 7, we plot the š’±\mathcal{V}-information estimate for the SNLI test and train sets. In Figure 8, we plot the š’±\mathcal{V}-information estimate on the CoLA in-domain held-out set for the four models that we previously studied, as well as a larger model, RoBERTa-large (Liu etĀ al., 2019). Despite the increase in scale, the trends observed in §2 still hold.

Refer to caption
Figure 8: Comparing accuracy and the š’±\mathcal{V}-information estimate on the CoLA in-domain train and held-out set w.r.t.Ā various models.

Appendix C Qualitative Analysis

premise hypothesis label PVI
Twenty five people are marching. A man plays the trombone on the sidewalk. N -9.966
A woman in a striped shirt holds an infant. A person is watching TV. N -9.612
A person swimming in a swimming pool. A person embraces the cold N -9.152
Women enjoying a game of table tennis. Women are playing ping pong. E -8.713
A boy dressed for summer in a green shirt and kahki shorts extends food to a reindeer in a petting zoo. A boy alien dressed for summer in a green shirt and kahki shorts E -8.486
Two skateboarders, one wearing a black t-shirt and the other wearing a white t-shirt, race each other. Two snowboarders race. E -8.087
An Asian woman dressed in a colorful outfit laughing. The woman is not laughing. E -7.903
An older gentleman looks at the camera while he is building a deck. An older gentleman in overalls looks at the camera while he is building a stained red deck in front of a house. E -7.709
A man wearing black pants, an orange and brown striped shirt, and a black bandanna in a ā€just thrown a bowling ballā€ stance. The bandana is expensive. C -7.685
Two girls kissing a man with a black shirt and brown hair on the cheeks. Two girls kiss. C -7.582
Table 4: The 10 hardest (lowest pvi) instances in the SNLI test set, according to BERT-base. ā€˜E’ denotes entailment, ā€˜N’ neutral, and ā€˜C’ contradiction. Instances that are possibly mislabelled are colored red.

In Table 4, we list the 10 hardest instances in the SNLI test set according to BERT-base. All three classes—entailment, neutral, and contradiction—are represented in this list, with entailment being slightly over-represented. We see that some of the examples are in fact mislabelled—e.g., ā€˜PREMISE: An Asian woman dressed in a colorful outfit laughing. HYPOTHESIS: The women is not laughing.’ is labelled as ā€˜entailment’ even though the correct label is ā€˜contradiction’.

Appendix D Consistency of pvi estimates

Refer to caption
Figure 9: Cross-model Pearson’s rr between pvi estimates made by different finetuned models, on the SNLI and CoLA test sets. For SNLI, the estimates are consistent: what one model finds difficult, others find difficult as well. Since CoLA has less usable information for all these models, the correlations are lower. All correlations are highly statistically significant (p<0.001p<0.001).
Refer to caption
Figure 10: Examples that human annotators find easier (as measured by the fraction of annotators, in the range [0.5, 1.0], that agree with the gold label) also have higher pvi on average.
BERT-base
Epoch/Epoch 1 2 3 5 10
1 1.000 0.908 0.871 0.838 0.762
2 0.908 1.000 0.929 0.883 0.795
3 0.871 0.929 1.000 0.879 0.796
5 0.838 0.883 0.879 1.000 0.833
10 0.762 0.795 0.796 0.833 1.000
BART-base
Epoch/Epoch 1 2 3 5 10
1 1.000 0.925 0.885 0.853 0.754
2 0.925 1.000 0.952 0.906 0.807
3 0.885 0.952 1.000 0.914 0.814
5 0.853 0.906 0.914 1.000 0.862
10 0.754 0.807 0.814 0.862 1.000
DistilBERT-base
Epoch/Epoch 1 2 3 5 10
1 1.000 0.928 0.884 0.828 0.766
2 0.928 1.000 0.952 0.890 0.825
3 0.884 0.952 1.000 0.900 0.819
5 0.828 0.890 0.900 1.000 0.860
10 0.766 0.825 0.819 0.860 1.000
GPT2
Epoch/Epoch 1 2 3 5 10
1 1.000 0.931 0.887 0.855 0.747
2 0.931 1.000 0.961 0.918 0.813
3 0.887 0.961 1.000 0.933 0.827
5 0.855 0.918 0.933 1.000 0.874
10 0.747 0.813 0.827 0.874 1.000
Table 5: Cross-epoch Pearson correlation between pvi estimates made on the SNLI test set while finetuning various models on the SNLI training set. The estimates are stable: when an instance is easy(difficult) early on, it generally remains easy(difficult). For all models studied, the cross-epoch correlation does not dip below 0.80 for the first five epochs.
seed Run 1 Run 2 Run 3 Run 4
Run 1 1.000 0.877 0.884 0.885
Run 2 0.877 1.000 0.887 0.882
Run 3 0.884 0.887 1.000 0.895
Run 4 0.885 0.882 0.895 1.000
Table 6: Cross-model Pearson correlation between pvi estimates made after one epoch of finetuning BERT on SNLI with different seeds. Estimates are stable: what a model finds difficult is mostly not due to chance.

Cross-Model Correlations

FigureĀ 9 shows a heatmap for Cross-model Pearson’s rr between pvi estimates made by different finetuned models, on the SNLI and CoLA test sets; these results support the findings in §2.5.

Human Agreement

In Figure 10, we plot the average pvi at different levels of annotator agreement. We find that there is a concurrence between what humans find difficult and what examples are difficult according to pvi.

Cross-Epoch Correlations

In Table 5, we list the cross-epoch Pearson correlation between pvi estimates made by the same model on the SNLI test set over the course of finetuning. The correlation is high (r>0.80r>0.80 during the first 5 epochs), suggesting that when an instance is easy(difficult) early on, it tends to remain easy(difficult).

Cross-Seed Correlations

In Table 6, we list the Pearson correlation between pvi estimates made by BERT across different training runs. The correlation is high (r>0.87r>0.87), suggesting that what a model finds difficult is not due to chance.

Appendix E Transformations

Attribute Transformation Transformed Input
Original PREMISE: Two girls kissing a man with a black shirt and brown hair on the cheeks. HYPOTHESIS: Two girls kiss.
Shuffled shuffle tokens randomly PREMISE: girls two a kissing man with a black cheeks shirt and hair brown on the . HYPOTHESIS: kiss two . girls
Hypothesis-only only include hypothesis HYPOTHESIS: Two girls kiss.
Premise-only only include premise PREMISE: Two girls kissing a man with a black shirt and brown hair on the cheeks.
Overlap hypothesis-premise overlap PREMISE: Two girls [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] . HYPOTHESIS: Two girls [MASK] .
Table 7: Given an NLI instance (see ā€˜Original’), each transformation isolates some attribute from the input. The headers ā€˜PREMISE’ and ā€˜HYPOTHESIS’ were added by us to transform the two sentence inputs into a single text input for all models that were evaluated.

WARNING: The following content contains language from the DWMW17 dataset that is offensive in nature.

In Table 7, we provide an instance from the SNLI test set in its original form and after various attribute-specific transformations have been applied to it. These only capture a small subset of the space of possible transformations.

For DWMW17, we hand-picked a set of 50 potentially offensive words based on a cursory review of the dataset to see how much information these terms alone contain about the label: ā€˜nigga’, ā€˜niggas’, ā€˜niggah’, ā€˜niggahs’, ā€˜hoe’, ā€˜hoes’, ā€˜bitch’, ā€˜bitches’, ā€˜whitey’, ā€˜white trash’, ā€˜cracker’, ā€˜crackers’, ā€˜beaner’, ā€˜beaners’, ā€˜pussy’, ā€˜pussies’, ā€˜fag’, ā€˜fags’, ā€˜faggot’, ā€˜faggots’, ā€˜ho’, ā€˜hos’, ā€˜redneck’, ā€˜rednecks’, ā€˜porn’, ā€˜fuck’, ā€˜fucks’, ā€˜fucker’, ā€˜fuckers’, ā€˜motherfucker’, ā€˜motherfuckers’, ā€˜nigger’, ā€˜niggers’, ā€˜coon’, ā€˜coons’, ā€˜niggaz’, ā€˜nig’, ā€˜nigs’, ā€˜slut’, ā€˜sluts’, ā€˜wigger’, ā€˜wiggers’, ā€˜fucked’, ā€˜fucking’, ā€˜wigga’, ā€˜wiggas’, ā€˜retard’, ā€˜retards’, and ā€˜retarded’.

Appendix F Instance-wise Comparisons

#7717: PREMISE: Little kids play a game of running around a pole. HYPOTHESIS: The kids are fighting outside.
#9627: PREMISE: A group of people watching a boy getting interviewed by a man. HYPOTHESIS: A group of people are sleeping on Pluto.

Refer to caption
Figure 11: The pvi of two SNLI ā€˜neutral’ instances (#7717 and #9627) w.r.t.Ā BERT-base after attribute-specific transformations, as well as the š’±\mathcal{V}-information estimate (i.e., average pvi over the data) for each attribute. The latter instance is easier for BERT, likely because its hypothesis is much more informative due to being so different from its premise. Note that it makes sense to compare instances w.r.t.Ā the same attribute, but not different attributes w.r.t.Ā the same instance, since the models used to estimate the attribute š’±\mathcal{V}-information Iš’±ā€‹(Ļ„a​(X)→Y)I_{\mathcal{V}}(\tau_{a}(X)\to Y) are chosen to maximize the likelihood of all the data.

Certain attributes are responsible for the difficulty of certain examples. Figure 11 is an example of how we might do a fine-grained comparison of instances to understand why one may be more difficult for a given model. We compare two SNLI ā€˜neutral’ instances from the test set to try to understand why #9627 is easier for BERT than #7717 (i.e., why pvi​(x9627→y9627)>pvi​(x7717→y7717)\textsc{pvi}(x_{9627}\to y_{9627})>\textsc{pvi}(x_{7717}\to y_{7717})), finding that it is likely due to the former’s hypothesis being more informative. While different instances can be compared w.r.t.Ā the same attribute, different attributes cannot be compared w.r.t.Ā the same instance, since the models used to estimate the attribute-specific š’±\mathcal{V}-information Iš’±ā€‹(Ļ„a​(X)→Y)I_{\mathcal{V}}(\tau_{a}(X)\to Y) are chosen to maximize the likelihood of all the data. This is why, for example, the pvi of #7717 is higher after its tokens have been shuffled even though the average pvi (i.e., dataset-level š’±\mathcal{V}-information) declines after shuffling tokens.

Appendix G Token-Level Artefacts

WARNING: The following content contains language from the DWMW17 dataset that is offensive in nature. In Table 8, we list the tokens in the SNLI, CoLA, and DWMW17 datasets that, when dropped out, cause the greatest decrease in the š’±\mathcal{V}-information estimate. These are token-level artefacts of each class in the dataset. In the DWMW17 hate speech detection dataset, racial and homophobic slurs are artefacts of hate speech, while ableist and sexual slurs are artefacts of offensive speech. In-group AAVE terms are also predictive of offensive speech in DWMW17 even when they are used non-offensively, hinting at possible bias in the dataset (Sap etĀ al., 2019). In CoLA, auxiliary verbs and prepositions are artefacts of ungrammatical sentences; grammatical sentences don’t have any artefacts. For SNLI, we recover many of the token-level artefacts found by Gururangan etĀ al. (2018) using descriptive statistics—even uncommon ones, such as ā€˜cat’ for contradiction.

Appendix H Relation to Dataset Cartography

Refer to caption
Figure 12: Relationship between pvi and the training dynamics-based data map (Swayamdipta etĀ al., 2020) for SNLI held-out (test) set, computed for the DistilBERT-base architecture. As in Swayamdipta etĀ al. (2020), YY-axis corresponds to confidence, i.e. the mean probabilities of the true class across training epochs, and XX-axis corresponds to variability, i.e. the standard deviation of the true class probabilities across the same. Colours indicate binned values of pvi. pvi estimates track closely with confidence.
Refer to caption
Figure 13: Average pvi of the 10% of the most ambiguous, and the 10% of the hardest-to-learn, and 10% of the easiest-to-learn regions of the SNLI / DistilBERT-base data map (Fig. 13). Hard-to-learn instances are frequently mislabeled, and therefore also reflect the lowest average pvi values. The highest average pvi values are possessed by the easy-to-learn instances, which are the most common class of instances in the SNLI dataset. Ambiguous instances are those that the model changes it decision on frequently through training; these correspond to lower average pvi values than easy-to-learn instances.

Swayamdipta etĀ al. (2020) introduced dataset cartography, a method to automatically analyze and diagnose datasets with respect to a trained model. It offers a complimentary understanding of datasets and their properties, taking into account the behavior of a model towards different data instances during training. This behavior—training dynamics—helps differentiate instances via their (1) confidence (i.e., mean probability of the correct label across epochs), and (2) variability (i.e., variance of the former). The result is a dataset map revealing three regions:

  • •

    Easy-to-learn (high confidence, low variability) instances are the most frequent instances in the dataset, those which high capacity models like BERT predict correctly throughout training.

  • •

    Hard-to-learn (low confidence, low variability) instances correspond to those which are predicted incorrectly throughout training; these were shown to correspond to mislabeled examples often.

  • •

    Ambiguous (high variability) instances correspond to those which the model often changes its prediction for; these examples are the ones which are most responsible for high test performance, both in and out of distribution.

Figure Ā 13 shows that pvi values track closely to the confidence axis of a SNLI-DistilBERT-base data map888Data maps were originally plotted on training data; however, they can be plotted on held-out data by computing training dynamics measures on the same, after every training epoch. (Swayamdipta etĀ al., 2020). Data maps and pvi estimates offer orthogonal perspectives to instance difficulty, the former capturing behavior of instances as training proceeds. Moreover, š’±\mathcal{V}-information can estimate dataset difficulty as an aggregate (§2), which is not the case for training dynamics metrics, which offer only point estimates. Both approaches can be helpful for discovering data artefacts. Predictive š’±\mathcal{V}-information estimates, however, offer the unique capability of transforming the input to discover the value of certain attributes in an efficient manner.

In Figure 13, we report the average pvi estimates of the three regions discovered via data maps:

  • •

    Easy-to-learn (high confidence, low variability) instances correspond to the highest average pvi, indicating that they have the highest amount of DistilBERT-usable information.

  • •

    Hard-to-learn (low confidence, low variability) instances correspond to the lowest average pvi, indicating that they have the lowest amount of DistilBERT-usable information. This is not surprising, since they also correspond to mislabeled instances, which can be difficult to extract usable information from.

  • •

    Ambiguous (high variability) instances correspond to lower average pvi, indicating that there is some usable information, but not as much as those of the easy-to-learn instances, w.r.tĀ DistilBERT.

For each of the bars in the plot, we consider 10% of the dataset belonging to each region (with the highest corresponding measures of confidence and variability).

WARNING: The following content contains language from the DWMW17 dataset that is offensive in nature.

DWMW17 (Davidson etĀ al., 2017)
Hate Speech Offensive Neither
f*ggots (3.844) r*tards (2.821) lame (4.426)
f*g (3.73) n*gs (2.716) clothes (0.646)
f*ggot (3.658) n*gro (2.492) dog (0.616)
c*ons (3.53) n*g (2.414) cat (0.538)
n*ggers (3.274) c*nts (2.372) iDntWearCondoms (0.517)
qu*er (3.163) p*ssies (2.29) thank (0.47)
co*n (3.137) qu*er (2.213) kick (0.423)
n*gger (3.094) r*tarded (1.997) 30 (0.345)
d*ke (3.01) c*nt (1.919) football (0.334)
f*gs (2.959) b*tches (1.858) soul (0.323)
SNLI (Bowman etĀ al., 2015)
Entailment Neutral Contradiction
nap (3.256) tall (4.246) Nobody (7.258)
bald (3.183) naked (2.193) not (4.898)
crying (2.733) indoors (1.724) no (4.458)
Woman (2.517) light (1.442) naked (3.583)
asleep (2.482) fun (1.318) crying (2.938)
sleeping (2.416) bed (1.006) indoors (2.523)
soda (2.267) motorcycle (0.993) vegetables (2.295)
bed (2.136) works (0.969) sleeping (2.293)
not (2.111) race (0.943) jogging (2.17)
snowboarder (2.099) daughter (0.924) cat (2.092)
CoLA (Warstadt etĀ al., 2018)
Grammatical Ungrammatical
will (0.267) book (2.737)
John (0.168) is (2.659)
. (0.006) was (2.312)
and (-0.039) of (2.308)
in (-0.05) to (1.972)
’ (-0.063) you (1.903)
to (-0.195) be (1.895)
of (-0.195) in (1.618)
that (-0.379) did (1.558)
the (-0.481) The (1.427)
Table 8: Token-level annotation artefacts in each dataset. These are the tokens whose omission leads to the greatest average increase in conditional entropy for each class (given in parentheses).