Understanding Dataset Difficulty with -Usable Information
Abstract
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficultyāw.r.t.Ā a model āas the lack of -usable information (Xu etĀ al., 2019), where a lower value indicates a more difficult dataset for . We further introduce pointwise -information (pvi) for measuring the difficulty of individual instances w.r.t.Ā a given distribution. While standard evaluation metrics typically only compare different models for the same dataset, -usable information and pvi also permit the converse: for a given model , we can compare different datasets, as well as different instances/slices of the same dataset. Furthermore, our framework allows for the interpretability of different input attributes via transformations of the input, which we use to discover annotation artefacts in widely-used NLP benchmarks.
1 Introduction
Datasets are designed to act as proxies for real-world tasks, yet most bear limited semblance to the tasks they purport to reflect (Torralba & Efros, 2011; Recht etĀ al., 2019). Understanding dataset difficulty is therefore imperative to understanding progress in AI. In practice, however, estimating dataset difficulty is often limited to an informal comparison of state-of-the-art model performance to that of humans; the bigger the performance gap, the harder the dataset is said to be (Ethayarajh & Jurafsky, 2020; Ma etĀ al., 2021). However, such performance metrics offer little understanding of the differential difficulty of individual instances, or of which attributes in the input a given model finds useful.

To understand why a dataset is difficult, we extend recent work in information theory (Xu etĀ al., 2019). To illustrate, consider a model family that can learn to map a sentence with its sentiment . Even if were to be encrypted, the information contains about would not be removed; in other words, the Shannon mutual information would be unchanged (Shannon, 1948). However, encryption makes predicting the sentiment a lot more difficult for . But why? Intuitively, the task is easier when is unencrypted because the information it contains is usable by ; when is encrypted, the information still exists but becomes unusable. This quantityā-usable informationāreflects the ease with which can predict given . Xu etĀ al. (2019) show that it can be measured using the predictive -information framework, which generalizes Shannon information to consider computational constraints.
Our work extends the above framework by framing dataset difficulty as the lack of -usable information.111We use the terms ā-usable informationā and ā-informationā from Xu etĀ al. (2019), interchangeably. The higher the -usable information, the easier the dataset is for . Not only does this framework allow comparisons of models w.r.t.Ā the same dataset, but also of different datasets w.r.t.Ā the same model. Figure 1 illustrates that different datasets provide different amounts of usable information for the same model, even when the task is identical (i.e., natural language inference in the SNLI (Bowman etĀ al., 2015) and MultiNLI (Williams etĀ al., 2018) datasets).
Building on the aggregate estimate of dataset difficulty, we introduce a measure called pointwise -information (pvi) for estimating the difficulty of each instance w.r.t. a given distribution (§3). pvi estimates allow us to compare not only individual instances, but also the difficulty of slices of data w.r.t . On datasets containing more usable information (e.g., SNLI), pvi estimates are highly correlated (Pearson 0.75) across different models, seeds, and training time, and with human judgments of difficulty.
Comparisons of -usable information before and after isolating an input attribute shed light on why the dataset is easy or difficult for (§4), which has significant implications for interpretability in AI222Our code and data are available here. (Miller, 2019). Specifically, we use -usable information to identify some limitations in benchmarks that are widely used in NLP to test for a modelās understanding of different language phenomena:
-
ā¢
Word ordering has a limited impact on the difficulty of a popular natural language entailment benchmark, SNLI (Bowman etĀ al., 2015), even though entailment describes a causal relationship.
-
ā¢
Some of the most difficult instances in SNLI and a popular grammaticality detection benchmark, CoLA (Warstadt etĀ al., 2018), are mislabelled.
-
ā¢
In a popular dataset for hate speech detection (Davidson etĀ al., 2017), just 50 (potentially) offensive words contain most of the BERT-usable information about the label; less subtle bias may be going undetected.
2 -Usable Information
2.1 Background
Consider a model family , which can be trained to map text input to its label . If we encrypted the text, or translated it into a language with a very complex grammar, it would be harder to predict given using the same . How might we measure this increase in difficulty? Shannon (1948)ās mutual information is not an optionāit would not change after is encrypted, as it allows for unbounded computation, including any needed to decrypt the text.
Intuitively, the task is easier when is unencrypted because the information it contains is usable by ; when is encrypted, this information still exists but becomes unusable. This quantity, called -usable information, provides an estimate the difficulty of a dataset w.r.t. . It can be measured under a framework called predictive -information, which generalizes Shannon information to measure how much information can be extracted from about when constrained to functions , written as (Xu etĀ al., 2019). The greater the , the easier the dataset is for . If is the set of all functionsāi.e., under unbounded computationā-information reduces to Shannon information.
Processing the input with (e.g., by decrypting the text) can make prediction easier, allowing . Although this violates the data processing inequality, it explains the usefulness of certain types of processing, such as representation learning. Compared to , the learned representations cannot have more Shannon information with , but they can have more usable information.
2.2 Definitions
As defined in Xu etĀ al. (2019):
Definition 2.1.
Let denote random variables with sample spaces respectively. Let denote a null input that provides no information about . Given predictive family , the predictive -entropy is
(1) |
and the conditional -entropy is
(2) |
We use to measure the entropies in bits of information, though one could also use and measure them in nats instead.
Put simply, and produce a probability distribution over the labels. The goal is to find the that maximizes the log-likelihood of the label data with (Eq. 2) and without the input (Eq. 1). models the label entropy, so can be set to an empty string for most NLP tasks. Although predictive family has a technical definition333 is a subset of all possible mappings from to that satisfies optional ignorance: for any in the range of some , there exists some s.t.Ā . See Xu etĀ al. (2019) for why optional ignorance is necessary., most neural models, provided they are finetuned without any frozen parameters, easily meet this definition. Further, as per Xu etĀ al. (2019):
Definition 2.2.
Let and denote random variables with sample spaces and , respectively. Given a predictive family , the -information is
(3) |
Because we are estimating this quantity on a finite dataset, the estimate can differ from the true -information. Xu etĀ al. (2019) provide PAC bounds for this error, where less complex and larger datasets yield tighter bounds. Xu etĀ al. (2019) also list several useful properties of -information:
-
ā¢
Non-Negativity:
-
ā¢
Independence: If is independent of , .
-
ā¢
Montonicity: If , then and .
Training with the cross-entropy loss finds the that maximizes the log-likelihood of given (Xu etĀ al., 2019). Thus, can be easily computed by standard training or by finetuning a pre-trained model.444 Improving model calibration using more advanced methods (Kumar etĀ al., 2019) is a possible direction of future work. We estimate by calculating on an identically distributed held-out set, where is the gold label. Since training with cross-entropy ultimately aims to find the infimum over the data distribution, not just the training set, it is important not to overfit the model to the training instances; this is of added significance for estimating . We estimate by training or finetuning another model where is replaced by , intended to fit the label distribution. As such, computing -information involves training or finetuning only two models.
2.3 Assumptions
Implicit in estimating the -information is the assumption that the data used to find the optimal and the data used to estimate are identically distributed, since -information is ultimately a function of two random variables . This dependence on the data distribution makes -information well-suited for estimating and interpreting dataset difficulty. However, it is still possible to estimate the difficulty of sub-populations or subsets of the data, though it would be imprecise to refer to this measure as -information (see §3.1 for details). We also assume that the difference between the empirical -information (calculated using some finite dataset) and the true -information (calculated over the distributions) is negligible, though this may not hold, for example, if the dataset is too small (see Appendix A).

2.4 Implications
-usable information allows us to compare
-
i.
different models by computing for the same (Fig.Ā 2),
-
ii.
different datasets by computing for the same (Fig.Ā 1), and
- iii.
While common classification metrics, such as accuracy or score, are often used for the above comparisons, -usable information offers a theoretically rigorous framework, making it better suited for interpretability. The -usable information is measured in bits / nats (depending on the log base), allowing for standardized comparisons across models and datasets. Additionally, consider the case where and are independent: here, model accuracy would be no greater than the majority class frequency, but this frequency varies across datasets. -information avoids this problem by factoring in the label entropy ; if are independent, then the -information is provably zero.
Say we wish to compare two predictive families, and , such that . Assuming both families can model the label distribution, the task will at least as easy for the larger family. This provably obviates the need to evaluate simpler function families (e.g., linear functions) when estimating dataset difficulty. Our experiments show that this bears out in practice as well (AppendixĀ B).
2.5 -Usable Information in Practice
We consider the natural language inference (NLI) task, which involves predicting whether a text hypothesis entails, contradicts or is neutral to a text premise. We first apply the -information framework to estimate the difficulty of a largescale NLI dataset, Stanford NLI (SNLI; Bowman etĀ al., 2015), across different state-of-the-art models. The four models we use are GPT2-small (Radford etĀ al., 2019), BERT-base-cased (Devlin etĀ al., 2019), DistilBERT-base-uncased (Sanh etĀ al., 2019), and BART-base (Lewis etĀ al., 2020). Figure 2 shows the -information estimate for all four, as well as their accuracy on the SNLI train and held-out (test) sets, across 10 training epochs. See AppendixĀ B for results with larger models.
Model performance tracks -information.
As seen in Figure 2, the model with the most -information on the SNLI test set is also the most accurate. This is intuitive, since extracting more information makes prediction easier. Overall, BART-base extracts the most -information, followed by BERT-base, DistilBERT-base, and GPT2-small; accuracy follows the same trend.
-information is more sensitive to over-fitting than held-out performance.
At epoch 10, the -information is at its lowest for all models, although the SNLI test accuracy has only declined slightly from its peak, as seen in Figure 2. This is because the models start becoming less certain about the correct label long before they start predicting the wrong label. This causes to riseāand thus to declineāeven while most of the probability mass is still placed on the correct label. This suggests that, compared to performance metrics like test accuracy, -information can more readily inform us of over-fitting.
Different datasets for the same task can have different amounts of -usable information.
We consider the MultiNLI dataset (Williams etĀ al., 2018), a multi-genre counterpart of SNLI. Despite both being proxies for the NLI task, SNLI and MultiNLI have significantly different amounts of BERT-usable information, as shown in Figure 1. The -information framework provides a principled means of measuring this difference in levels of difficulty; MultiNLI is expected to be more difficult than SNLI due to the diversity of genres it considers. Also shown is CoLA (Warstadt etĀ al., 2018), a dataset for linguistic acceptability where each sentence is labeled as grammatical or not; this task is seemingly more difficult than NLI for BERT.
3 Measuring Pointwise Difficulty
While -information provides an aggregate measure of dataset difficulty (§2), a closer analysis requires measuring the degree of usable information in individual instances (w.r.t. a given distribution). We extend the -information framework to introduce a new measure called pointwise -information (pvi) for individual instances. The higher the pvi, the easier the instance is for , under the given distribution.
Definition 3.1 (Pointwise -Information).
Given random variables and a predictive family , the pointwise -information (pvi) of an instance is
(4) |
where s.t.Ā and s.t. .
If were, for instance, the BERT function family, and would be the models after finetuning BERT with and without the input respectively. For a held-out instance , is the difference in the log-probability these models place on the gold label. pvi is to -information what pmi is to Shannon information:
(5) |
Given this relationship, our understanding of -information extends to pvi as well: higher pvi instances are easier for and vice-versa. A higher pvi increases the odds of being predicted correctlyāthis is intuitive because a correct prediction of a non-majority-class instance requires that some information be extracted from the instance. Although the -information cannot be negative, the pvi can beāmuch like how pmi can be negative even though Shannon information cannot. A negative pvi simply means that the model is better off predicting the majority class than considering , which can happen for many reasons (e.g., mislabelling). Examples with negative pvi can still be predicted correctly, as long as places most of the probability mass on the correct label. Algorithm 1 shows our computation of pvi and -information (by averaging over pvi).
The pvi of an instance w.r.t.Ā should only depend on the distribution of the random variables. Sampling more from during finetuning should not change . However, an instance can be drawn from different distributions, in which case we would expect its pvi to differ. For example, say we have restaurant reviews and movie reviews, along with their sentiment. The instance (āThat was great!ā, positive) could be drawn from either distribution, but we would expect its pvi to be different in each (even though is the same).
3.1 Implications
In addition to the comparisons that -information allows us to make (§2.4), pvi allows us to compare:
- iv.
- v.
Note that the average pvi of a slice of data is not its -information, since we optimize the model w.r.t. the entire distribution. However, since in practice one often wishes to understand the relative difficulty of different subpopulations w.r.t.Ā the training distribution, calculating the average pviāas opposed to the -information of the subpopulation itselfāis more useful.
Input: training data , held-out data , model
do
end do
3.2 pvi in Practice
pvi can be used to find mislabelled instances.
Correctly predicted instances have higher pvi values than incorrectly predicted ones. For the held-out sets in SNLI, MultiNLI and CoLA, the difference in mean pvi between instances correctly and incorrectly predicted by BERT-base is 3.03, 2.87, and 2.45 bits respectively. These differences are statistically significant ( 0.001). Table 1 shows the most difficult (lowest pvi) instances from CoLA; we further find that some of these are in fact mislabelled (see AppendixĀ C for an analysis of SNLI).
Sentence | Label | PVI |
---|---|---|
Wash you! | No | -4.616 |
Who achieved the best result was Angela. | No | -4.584 |
Sue gave to Bill a book. | No | -3.649 |
Only Churchill remembered Churchill giving the Blood, Sweat and Tears speech. | No | -3.571 |
Cynthia chewed. | No | -3.510 |
It is a golden hair. | Yes | -3.251 |
I wonāt have some money. | No | -3.097 |
You may pick every flower, but leave a few for Mary. | No | -2.875 |
I know which book Mag read, and which book Bob said that you hadnāt. | Yes | -2.782 |
John promise Mary to shave himself. | Yes | -2.609 |
The pvi threshold at which predictions become incorrect is similar across datasets.
In Figure 3, we plot the pvi distribution of correctly and incorrectly predicted instances in each dataset. As expected, high-pvi instances are predicted correctly and low-pvi instances are not. Notably, the point at which instances start being incorrectly predicted is similar across datasets (pvi ). Such a pattern could not be observed with a performance metric because the label spaces are different, evincing why the -information framework is so useful for cross-dataset comparison.

pvi estimates are highly consistent across models, training epochs, and random initializations.
The cross-model Pearson correlation between pvi estimates of SNLI instances is very high ( 0.80). However, the cross-model Pearson correlation is lower for CoLA (0.40 0.65); see Fig.Ā 9 in AppendixĀ D. This is because, as visualized in Figure 1, CoLA has less usable information, making difficulty estimates noisier. In the limit, if a dataset contained no usable information, then we would expect the correlation between pvi estimates across different models and seeds to be close to zero. It is also worth noting, however, that a high degree of cross-model correlationāas with SNLIādoes not preclude comparisons between different models on the same dataset. Rather, it suggests that in SNLI, a minority of instances is responsible for distinguishing one modelās performance from another. This is not surprisingāgiven the similar complexity and architecture of these models, we would expect most instances to be equally easy. Moreover, despite the performance of Transformer-based models varying across random initializations (Dodge etĀ al., 2019, 2020; Mosbach etĀ al., 2020), we find that pvi estimates are quite stable: the correlation across seeds is (for SNLI finetuned BERT-base, across 4 seeds); see Table 6 in AppendixĀ D. It also concurs with human judgments of difficulty; see Fig.Ā 10 in AppendixĀ D.

4 Uncovering Dataset Artefacts
A key limitation of standard evaluation metrics (e.g. accuracy) is the lack of interpretabilityā there is no straightforward way to understand why a dataset is as difficult as it is. -usable information offers an answer by allowing comparison of different input variables under the same and , as implicated in §2.4. We consider two approaches for this: applying input transformations (§4.1), and slicing the dataset (§4.2).
4.1 Input Transformations
Our first approach involves applying different transformations to isolate an attribute , followed by calculating to measure how much information (usable by ) the attribute contains about the label. For example, by shuffling the tokens in , we can isolate the influence of the word order attribute.
Given that a transformation may make information more accessible (e.g., decrypting some encrypted text; c.f. §2), it is possible for , so the latter shouldnāt be treated as an upper bound. Such transformations were applied by OāConnor & Andreas (2021) to understand what syntactic features Transformers use in next-token prediction; we take this a step further, aiming to discover annotation artefacts, compare individual instances, and ultimately understand the dataset itself. We present our findings on SNLI, CoLA, as well as DWMW17 (Davidson etĀ al., 2017), a dataset for hate speech detection, where input posts are labeled as hate speech, offensive, or neither.
We apply transformations to the SNLI input to isolate different attributes (see Appendix E for an example): shuffled (shuffle tokens randomly), hypothesis-only (only include the hypothesis), premise-only (only include the premise), overlap (tokens in both the premise and hypothesis).
Token identity alone provides most of the usable information in SNLI.
Figure 4 shows that the token identity aloneāisolated by shuffling the inputācontains most of the usable information for all models. The premise, which is often shared by multiple instances, is useless alone; the hypothesis, which is unique to an instance, is useful even without a premise. This corroborates the well-known annotation artefacts in SNLI (Gururangan etĀ al., 2018; Poliak etĀ al., 2018), which are spurious correlations exploited by models to predict the correct answer for the wrong reasons.
Hate speech detection might have lexical biases.
Automatic hate speech detection is an increasingly important part of online moderation, but what causes a model to label speech as offensive? We find that in DWMW17, the text contains 0.724 bits of BERT-usable information about the label. Additionally, if one removed all the tokens, except for 50 (potentially) offensive onesācomprising common racial and homophobic slurs555These terms were manually chosen based on a cursory review of the dataset and are listed in Appendix E.āfrom the input post hoc, there still remains 0.490 bits of BERT-usable information. In other words, just 50 (potentially) offensive words contain most of the BERT-usable information in DWMW17. Our findings corroborate prior work which shows that certain lexical items (e.g., swear words, identity mentions) are responsible for hate speech prediction (Dixon etĀ al., 2018; Dinan etĀ al., 2019). Allowing models to do well by simply pattern-matching may permit subtleties in hate speech to go undetected, perpetuating harm towards minority groups (Blodgett etĀ al., 2020).
4.2 Slicing Datasets
Entailment | Neutral | Contradiction | |
---|---|---|---|
original | 1.188 | 1.064 | 1.309 |
shuffled | 1.130 | 0.984 | 1.224 |
hypothesis only | 0.573 | 0.553 | 0.585 |
premise only | 0.032 | -0.016 | -0.016 |
overlap | 0.415 | 0.177 | 0.298 |
Certain attributes are more useful for certain classes.
Comparing the usefulness of an attribute across classes can be useful for identifying systemic annotation artefacts. This can be done by simply averaging the pvi over the slice of data whose difficulty we are interested in measuring. Note that the equivalence between -information and expected pvi only holds when the model used to estimate pvi is trained over the entire dataset, which means that the average pvi of a slice of data is not its -information. It would not make sense to estimate the -information of a slice because it would require training on examples from just one class, in which case the -information would be zero. Thus the only usable difficulty measure is the mean pvi.
We do this for SNLI in Table 2. We see that the tokens in the premise-hypothesis overlap contains much more BERT-usable information about the āentailmentā class than ācontradictionā or āneutralā. This is unsurprising, given that the simplest means of entailing a premise is to copy it into the hypothesis and provide some additional detail. If there is no inherent reason for an attribute to be more/less usefulāsuch as overlap for entailmentāthere may be an artefact at work. Even when there is an inherent reason for an attribute to be useful for a particular slice of the data, an attribute being exceptionally useful may also be evidence of a dataset artefact. For example, if the premise-overlap hypothesis provided almost all the usable information needed for entailment, it may be because the crowdworkers who created the dataset took a shortcut by copying the premise to create the hypothesis.
In Appendix F, we show how similar comparisons can be made between instances.
Certain subsets of each class are more difficult than others.
In Figure 5, we bin the examples in each SNLI class by the level of hypothesis-premise overlap and plot the average pvi. We see entailment instances with no hypothesis-premise overlap are the most difficult (i.e., lowest mean pvi) while contradiction instances with no overlap are the easiest (i.e., highest mean pvi). This is not surprising, since annotation artefacts in SNLI arise from constructing entailment and contradiction via trivial changes to the premise (Gururangan etĀ al., 2018).
We additionally consider slices in the dataset based on dataset cartography (Swayamdipta etĀ al., 2020), which uses training dynamics to differentiate instances via their (1) confidence (i.e., mean probability of the correct label across epochs), and (2) variability (i.e., variance of the former). The result is a dataset map revealing three regions: easy-to-learn, hard-to-learn, and ambiguous w.r.tĀ the trained model. Slices of the dataset based on cartographic regions have distinct ranges of average pvi (Fig.Ā 13 in AppendixĀ H).

4.3 Token-level Artefacts
WARNING: The following content contains language from the DWMW17 dataset that is offensive in nature.
DWMW17 (Davidson etĀ al., 2017) | ||
---|---|---|
Hate Speech | Offensive | Neither |
f*ggots (3.844) | r*tards (2.821) | lame (4.426) |
f*g (3.73) | n*gs (2.716) | clothes (0.646) |
f*ggot (3.658) | n*gro (2.492) | dog (0.616) |
c**ns (3.53) | n*g (2.414) | cat (0.538) |
n*ggers (3.274) | c*nts (2.372) | iDntWearCondoms (0.517) |
CoLA (Warstadt etĀ al., 2018) | |
---|---|
Grammatical | Ungrammatical |
will (0.267) | book (2.737) |
John (0.168) | is (2.659) |
. (0.006) | was (2.312) |
and (-0.039) | of (2.308) |
in (-0.05) | to (1.972) |
Transforming and then measuring the -information to discover all token-level signals and artefacts is untenable, since we would need to finetune one new model per token. Instead, we compute the change in the -information estimate after removing , which yields modified input . We use the same model but evaluate only on a slice of the data, , which contains the token and belongs to the class of interest. This simplifies to measuring the increase in conditional entropy:
Token-level signals and artefacts can be discovered using leave-one-out.
Table 3 shows that auxiliary verbs (e.g., be, did) and prepositions are most indicative of ungrammatical sentences in CoLA; in contrast, grammatical sentences have no strong indicators, with no word on average increasing the conditional entropy above 0.30 upon omission.
In DWMW17, racial and homophobic slurs are the top indicators of āhate speechā. However, in-group variants of the same racial slurācommonly used in African-American Vernacular English (AAVE)āfall under āoffensiveā instead. The fact that AAVE terms are marked as āoffensiveā supports previous findings by that hate speech detection datasets may themselves be biased (Sap etĀ al., 2019). In SNLI, we found many of the token-level artefacts matching those found using descriptive statistics in Gururangan etĀ al. (2018). The complete word lists are available in Appendix G.
4.4 Conditioning Out Information
What if we wanted to measure how much BERT-usable information offensive words contain about the label in DWMW17 beyond that which is captured in the sentiment? In other words, if we already had access to the sentiment polarity of a text (positive/negative/neutral), how many additional bits of information would the offensive words provide? We cannot estimate this by simply subtracting from , since that difference could potentially be negative. Acquiring another random variable should not decrease the amount of information we have about the label (at worst, it should be useless).
To capture this intuition, Hewitt etĀ al. (2021) proposed conditional -information, which allows one to condition out any number of random variables. Given a set of random variables that we want to condition out, it is defined as:
(6) |
The conditional entropy with respect to multiple variables is the only new concept here. It is estimated in practice by concatenating the text inputs represented by and , which in our example is the sentiment polarity (one of ānegativeā/āneutralā/āpositiveā) and the sequence of offensive words in the input. The actual model family need not change to accommodate the longer text, as long as it remains under the input token limit.666If the inputs were vectors, the inputs represented by and would need to be of the same size; to do so, we would concatenate a zero vector to the former (Hewitt etĀ al., 2021). We find that offensive words contain 0.482 bits of BERT-usable information about the label beyond that which is contained in text sentiment.777The sentiment was categorized as positive/neutral/negative based on the polarity estimated by spaCyās built-in sentiment classifier (Honnibal & Montani, 2017). This is close to all of the BERT-usable information that the offensive words contain about the label (0.490 bits), suggesting that the predictive power of (potentially) offensive words is not mediated through sentiment in DWMW17.
5 Related Work
While prior literature has acknowledged that not all data instances are equal (Vodrahalli etĀ al., 2018; Swayamdipta etĀ al., 2020), there have been few efforts to estimate dataset difficulty formally and directly. As a notable exception, Zhang etĀ al. (2020) proposed DIME, an information-theoretic measure to estimate a lower bound on the lowest possible (i.e., model-agnostic) 0-1 error. Model-agnostic approaches do not explain why some datasets are easier for some models, and have limited interpretability. In contrast, -information and pvi are specific to a model family .
Various techniques have been proposed to differentiate data instances within a dataset. Text-based heuristics such as word identity (Bengio etĀ al., 2009) or input length (Spitkovsky etĀ al., 2010; Gururangan etĀ al., 2018) have sometimes been used as proxies for instance difficulty, but offer limited insight into difficulty w.r.t.Ā models. Other approaches consider training loss (Han etĀ al., 2018; Arazo etĀ al., 2019; Shen & Sanghavi, 2019), confidence (Hovy etĀ al., 2013), prediction variance (Chang etĀ al., 2017), and area under the curve (Pleiss etĀ al., 2020). Estimates relying on model training dynamics (Toneva etĀ al., 2018; Swayamdipta etĀ al., 2020), gradient magnitudes (Vodrahalli etĀ al., 2018), or loss magnitudes (Han etĀ al., 2018) are sensitive to factors such as variance during steps of training. Influence functions (Koh & Liang, 2017), forgetting events (Toneva etĀ al., 2018), and the Data Shapley (Ghorbani & Zou, 2019; Jia etĀ al., 2019) can all be used to assign pointwise estimates of importance to data instances based on their contribution to the decision boundary. Moreover, although these methods all capture some aspect of difficulty, they do not lend themselves to interpreting datasets as readily as the predictive -information framework.
Given its dependence on training behavior across time, cartography (Swayamdipta etĀ al., 2020) offers complementary benefits to -information. It can be non-trivial to measure differences between, say a CoLA data map and an SNLI data map, w.r.tĀ BERT. In contrast, -information provides a formal framework to make dataset difficulty estimates as an aggregate to compare datasets w.r.tĀ a model. Other work has offered insight by splitting the data into āeasyā and āhardā sets with respect to some attribute and studying changes in model performance, but these methods do not offer a pointwise estimate of difficulty (Sugawara etĀ al., 2018; Rondeau & Hazen, 2018; Sen & Saffari, 2020).
Item response theory (IRT; Embretson & Reise, 2013) allows the difficulty of instances to be learned via parameters in a probabilistic model meant to explain model performance (Lalor etĀ al., 2018; Rodriguez etĀ al., 2021). However, it does not formally relate dataset difficulty to the model being evaluated. Estimating instance difficulty is also evocative of instance selection for active learning (Lewis & Catlett, 1994; Fu etĀ al., 2013; Liu & Motoda, 2002); however these estimates could change as the dataset picks up new instances. In contrast, pvi estimates are relatively stable, especially when the dataset has higher -information. Uncertainty sampling, for example, picks the instances that the partially trained model is least certain about (Lewis & Gale, 1994; Nigam etĀ al., 2000), which could be interpreted as a measure of difficulty. However, once an instance is used for training, the model may become much more certain about it, meaning that the uncertainty values are unstable.
Interpretability of the role of certain attributes in trained models have lately led to the discovery of many dataset artefacts in NLP. Our approach to discovering dataset artefacts can also complement existing approaches to artefact discovery (Gardner etĀ al., 2021; Pezeshkpour etĀ al., 2021; LeĀ Bras etĀ al., 2020). Rissanen data analysis (Perez etĀ al., 2021) offers a complimentary method for interpretability w.r.tĀ attributes; it involves calculating the minimum description length (MDL): how many bits are needed to transmit the gold labels from a sender to a recipient when both have access to the same model and inputs. Since the framework depends on the order of instances (i.e., what data has been transmitted thus far), it is unsuitable for estimating dataset difficulty. In contrast, -information is defined w.r.t.Ā a data distribution, so it is (in theory) agnostic to data and its ordering in fine-tuning.
-information (Xu etĀ al., 2019) has had limited adoption in NLP. It has been used to study what context features Transformers actually use (OāConnor & Andreas, 2021), as well as to condition out information for probing-based interpretability techniques (Hewitt etĀ al., 2021; Pimentel & Cotterell, 2021). However, to the best of our knowledge, ours is the first approach to use -usable information for estimating the difficulty of NLP datasets.
6 Future Work
There has been much work in the way of model interpretability, but relatively little in the way of dataset interpretability. Our framework will allow datasets to be probed, helping us understand what exactly we are testing for in models and how pervasive annotation artefacts really are. By identifying the attributes responsible for difficulty, it will be possible to build challenge sets in a more principled way and reduce artefacts in existing datasets. By studying which attributes contain information that is unusable by existing SOTA models, model creators may make more precise changes to architectures. More immediate directions of future work include:
-
1.
Understanding how changes to the data distribution change the difficulty of individual examples.
-
2.
Extending -information to open-ended text generation, which does not induce explicit distributions over the output space. This may requiring truncating the output space (e.g., using beam search with fixed width).
-
3.
Applying -information to estimate dataset difficulty in other modalities (e.g., image, audio, tabular, etc.). There is nothing limiting the use of -information to the NLP domain. For example, one could create a set of image filtersāfor different colors and objectsāuse them to transform the image, and then measure the drop in usable information.
7 Conclusion
We provided an information-theoretic perspective to understanding and interpreting the difficulty of various NLP datasets. We extended predictive -information to estimate difficulty at the dataset level, and then introduced pointwise -information (pvi) for measuring the difficulty of individual instances. We showed that instances with lower pvi had lower levels of annotator agreement and were less likely to be predicted correctly. We then demonstrated how systemic and token-level annotation artefacts in a dataset could be discovered by manipulating the input before calculating these measures. Our studies indicate that -information offers a new, efficient means of interpreting NLP datasets.
Acknowledgements
We thank Dan Jurafsky, Nelson Liu, Daniel Khashabi, and the anonymous reviewers for their helpful comments.
References
- Arazo etĀ al. (2019) Arazo, E., Ortego, D., Albert, P., OāConnor, N., and McGuinness, K. Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning, pp.Ā 312ā321. PMLR, 2019. URL https://arxiv.org/abs/1904.11238.
- Bengio etĀ al. (2009) Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ā09, pp.Ā 41ā48, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380. URL https://doi.org/10.1145/1553374.1553380.
- Blodgett etĀ al. (2020) Blodgett, S.Ā L., Barocas, S., Daumé III, H., and Wallach, H. Language (technology) is power: A critical survey of ābiasā in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.Ā 5454ā5476, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.485. URL https://aclanthology.org/2020.acl-main.485.
- Bowman etĀ al. (2015) Bowman, S., Angeli, G., Potts, C., and Manning, C.Ā D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.Ā 632ā642, 2015. URL https://aclanthology.org/D15-1075/.
- Chang etĀ al. (2017) Chang, H.-S., Learned-Miller, E., and McCallum, A. Active bias: Training more accurate neural networks by emphasizing high variance samples. Advances in Neural Information Processing Systems, 30:1002ā1012, 2017. URL https://arxiv.org/abs/1704.07433.
- Davidson etĀ al. (2017) Davidson, T., Warmsley, D., Macy, M., and Weber, I. Automated hate speech detection and the problem of offensive language. In Proceedings of the 11th International AAAI Conference on Web and Social Media, ICWSM ā17, pp.Ā 512ā515, 2017. URL https://arxiv.org/abs/1703.04009.
- Devlin etĀ al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.Ā 4171ā4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Dinan etĀ al. (2019) Dinan, E., Humeau, S., Chintagunta, B., and Weston, J. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.Ā 4537ā4546, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1461. URL https://aclanthology.org/D19-1461.
- Dixon etĀ al. (2018) Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ā18, pp.Ā 67ā73, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360128. doi: 10.1145/3278721.3278729. URL https://doi.org/10.1145/3278721.3278729.
- Dodge etĀ al. (2019) Dodge, J., Gururangan, S., Card, D., Schwartz, R., and Smith, N.Ā A. Show your work: Improved reporting of experimental results. In Proc. of EMNLP-IJCNLP, pp.Ā 2185ā2194, 2019. URL https://aclanthology.org/D19-1224/.
- Dodge etĀ al. (2020) Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., and Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping, 2020. URL https://arxiv.org/abs/2002.06305. arXiv preprint arXiv:2002.06305.
- Embretson & Reise (2013) Embretson, S.Ā E. and Reise, S.Ā P. Item response theory. Psychology Press, 2013. URL https://doi.org/10.4324/9781410605269.
- Ethayarajh & Jurafsky (2020) Ethayarajh, K. and Jurafsky, D. Utility is in the eye of the user: A critique of nlp leaderboards. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.Ā 4846ā4853, 2020. URL https://aclanthology.org/2020.emnlp-main.393/.
- Fu etĀ al. (2013) Fu, Y., Zhu, X., and Li, B. A survey on instance selection for active learning. Knowledge and information systems, 35(2):249ā283, 2013. URL https://link.springer.com/article/10.1007/s10115-012-0507-8.
- Gardner etĀ al. (2021) Gardner, M., Merrill, W., Dodge, J., Peters, M.Ā E., Ross, A., Singh, S., and Smith, N. Competency problems: On finding and removing artifacts in language data, 2021. URL https://arxiv.org/abs/2104.08646.
- Ghorbani & Zou (2019) Ghorbani, A. and Zou, J. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pp.Ā 2242ā2251. PMLR, 2019. URL https://arxiv.org/abs/1904.02868.
- Gururangan etĀ al. (2018) Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N.Ā A. Annotation artifacts in natural language inference data. In Proc. of NAACL, pp.Ā 107ā112, 2018. URL https://aclanthology.org/N18-2017.
- Han etĀ al. (2018) Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I.Ā W., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, 2018. URL https://arxiv.org/abs/1804.06872.
- Hewitt etĀ al. (2021) Hewitt, J., Ethayarajh, K., Liang, P., and Manning, C.Ā D. Conditional probing: measuring usable information beyond a baseline. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.Ā 1626ā1639, 2021. URL https://aclanthology.org/2021.emnlp-main.122/.
- Honnibal & Montani (2017) Honnibal, M. and Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017. URL https://spacy.io/.
- Hovy etĀ al. (2013) Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., and Hovy, E. Learning whom to trust with MACE. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.Ā 1120ā1130, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1132.
- Jia etĀ al. (2019) Jia, R., Dao, D., Wang, B., Hubis, F.Ā A., Hynes, N., Gürel, N.Ā M., Li, B., Zhang, C., Song, D., and Spanos, C.Ā J. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.Ā 1167ā1176. PMLR, 2019. URL https://arxiv.org/abs/1902.10275.
- Koh & Liang (2017) Koh, P.Ā W. and Liang, P. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pp.Ā 1885ā1894. PMLR, 2017. URL https://arxiv.org/abs/1703.04730v3.
- Kumar etĀ al. (2019) Kumar, A., Liang, P., and Ma, T. Verified uncertainty calibration. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.Ā 3792ā3803, 2019. URL https://arxiv.org/abs/1909.10155.
- Lalor etĀ al. (2018) Lalor, J.Ā P., Wu, H., Munkhdalai, T., and Yu, H. Understanding deep learning performance through an examination of test set difficulty: A psychometric case study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.Ā 4711ā4716, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1500. URL https://aclanthology.org/D18-1500.
- LeĀ Bras etĀ al. (2020) LeĀ Bras, R., Swayamdipta, S., Bhagavatula, C., Zellers, R., Peters, M., Sabharwal, A., and Choi, Y. Adversarial filters of dataset biases. In International Conference on Machine Learning, pp.Ā 1078ā1088. PMLR, 2020. URL https://arxiv.org/abs/2002.04108.
- Lewis & Catlett (1994) Lewis, D.Ā D. and Catlett, J. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pp.Ā 148ā156. Elsevier, 1994. URL https://openreview.net/forum?id=HyEyBoWObr.
- Lewis & Gale (1994) Lewis, D.Ā D. and Gale, W.Ā A. A sequential algorithm for training text classifiers. In SIGIRā94, pp.Ā 3ā12. Springer, 1994. URL https://arxiv.org/abs/cmp-lg/9407020.
- Lewis etĀ al. (2020) Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. of ACL, pp.Ā 7871ā7880, 2020. URL https://aclanthology.org/2020.acl-main.703/.
- Liu & Motoda (2002) Liu, H. and Motoda, H. On issues of instance selection. Data Mining and Knowledge Discovery, 6(2):115, 2002. URL https://link.springer.com/article/10.1023/A:1014056429969.
- Liu etĀ al. (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692. arXiv preprint arXiv:1907.11692.
- Ma etĀ al. (2021) Ma, Z., Ethayarajh, K., Thrush, T., Jain, S., Wu, L., Jia, R., Potts, C., Williams, A., and Kiela, D. Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking, 2021. URL https://arxiv.org/abs/2106.06052. arXiv preprint arXiv:2106.06052.
- Miller (2019) Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1ā38, Feb 2019. ISSN 0004-3702. doi: 10.1016/j.artint.2018.07.007. URL http://dx.doi.org/10.1016/J.ARTINT.2018.07.007.
- Mosbach etĀ al. (2020) Mosbach, M., Andriushchenko, M., and Klakow, D. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations, 2020. URL https://arxiv.org/abs/2006.04884.
- Nigam etĀ al. (2000) Nigam, K., McCallum, A.Ā K., Thrun, S., and Mitchell, T. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2):103ā134, 2000. URL https://link.springer.com/article/10.1023/A:1007692713085.
- OāConnor & Andreas (2021) OāConnor, J. and Andreas, J. What context features can transformer language models use? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.Ā 851ā864, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.70. URL https://aclanthology.org/2021.acl-long.70.
- Perez etĀ al. (2021) Perez, E., Kiela, D., and Cho, K. Rissanen data analysis: Examining dataset characteristics via description length. In ICML, 2021. URL https://arxiv.org/abs/2103.03872.
- Pezeshkpour etĀ al. (2021) Pezeshkpour, P., Jain, S., Singh, S., and Wallace, B.Ā C. Combining feature and instance attribution to detect artifacts, 2021. URL https://arxiv.org/abs/2107.00323. arXiv preprint arXiv:2107.00323.
- Pimentel & Cotterell (2021) Pimentel, T. and Cotterell, R. A bayesian framework for information-theoretic probing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.Ā 2869ā2887, 2021. URL https://arxiv.org/abs/2109.03853.
- Pleiss etĀ al. (2020) Pleiss, G., Zhang, T., Elenberg, E.Ā R., and Weinberger, K.Ā Q. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2001.10528.
- Poliak etĀ al. (2018) Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., and VanĀ Durme, B. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pp.Ā 180ā191, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/S18-2023. URL https://aclanthology.org/S18-2023.
- Radford etĀ al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners, 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
- Recht etĀ al. (2019) Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp.Ā 5389ā5400. PMLR, 2019. URL http://proceedings.mlr.press/v97/recht19a.html.
- Rodriguez etĀ al. (2021) Rodriguez, P., Barrow, J., Hoyle, A.Ā M., Lalor, J.Ā P., Jia, R., and Boyd-Graber, J. Evaluation examples are not equally informative: How should that change NLP leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.Ā 4486ā4503, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.346. URL https://aclanthology.org/2021.acl-long.346.
- Rondeau & Hazen (2018) Rondeau, M.-A. and Hazen, T.Ā J. Systematic error analysis of the stanford question answering dataset. In Proceedings of the Workshop on MRQA, pp.Ā 12ā20, 2018. URL https://aclanthology.org/W18-2602/.
- Sanh etĀ al. (2019) Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2019. URL https://arxiv.org/abs/1910.01108. arXiv preprint arXiv:1910.01108.
- Sap etĀ al. (2019) Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N.Ā A. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp.Ā 1668ā1678, 2019. URL https://aclanthology.org/P19-1163/.
- Sen & Saffari (2020) Sen, P. and Saffari, A. What do models learn from question answering datasets? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.Ā 2429ā2438, 2020. URL https://aclanthology.org/2020.emnlp-main.190.
- Shannon (1948) Shannon, C.Ā E. A mathematical theory of communication. The Bell system technical journal, 27(3):379ā423, 1948. URL https://ieeexplore.ieee.org/abstract/document/6773024/.
- Shen & Sanghavi (2019) Shen, Y. and Sanghavi, S. Learning with bad training data via iterative trimmed loss minimization. In Proc. of ICML, pp.Ā 5739ā5748. PMLR, 2019. URL https://proceedings.mlr.press/v97/shen19e.html.
- Spitkovsky etĀ al. (2010) Spitkovsky, V.Ā I., Alshawi, H., and Jurafsky, D. From baby steps to leapfrog: How āless is moreā in unsupervised dependency parsing. In Proc. of NAACL, pp.Ā 751ā759. Association for Computational Linguistics, 2010. URL https://aclanthology.org/N10-1116.
- Sugawara etĀ al. (2018) Sugawara, S., Inui, K., Sekine, S., and Aizawa, A. What makes reading comprehension questions easier? In Proc. of EMNLP, pp.Ā 4208ā4219, 2018. URL https://arxiv.org/abs/1808.09384.
- Swayamdipta etĀ al. (2020) Swayamdipta, S., Schwartz, R., Lourie, N., Wang, Y., Hajishirzi, H., Smith, N.Ā A., and Choi, Y. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proc. of EMNLP, pp.Ā 9275ā9293, 2020. URL https://aclanthology.org/2020.emnlp-main.746/.
- Toneva etĀ al. (2018) Toneva, M., Sordoni, A., des Combes, R.Ā T., Trischler, A., Bengio, Y., and Gordon, G.Ā J. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2018. URL https://arxiv.org/abs/1812.05159.
- Torralba & Efros (2011) Torralba, A. and Efros, A.Ā A. Unbiased look at dataset bias. In CVPR 2011, pp.Ā 1521ā1528. IEEE, 2011. URL https://ieeexplore.ieee.org/abstract/document/5995347.
- Vodrahalli etĀ al. (2018) Vodrahalli, K., Li, K., and Malik, J. Are all training examples created equal? an empirical study, 2018. URL https://arxiv.org/abs/1811.12569.
- Warstadt etĀ al. (2018) Warstadt, A., Singh, A., and Bowman, S.Ā R. Neural network acceptability judgments, 2018. URL https://arxiv.org/abs/1805.12471. arXiv preprint arXiv:1805.12471.
- Williams etĀ al. (2018) Williams, A., Nangia, N., and Bowman, S. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. of NAACL, pp.Ā 1112ā1122. Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
- Xu etĀ al. (2019) Xu, Y., Zhao, S., Song, J., Stewart, R., and Ermon, S. A theory of usable information under computational constraints. In International Conference on Learning Representations, 2019. URL https://arxiv.org/abs/2002.10689.
- Zhang etĀ al. (2020) Zhang, P., Wang, H., Naik, N., Xiong, C., and Socher, R. DIME: An Information-Theoretic difficulty measure for AI datasets. In Proc. of NeurIPS 2020 Workshop for Deep Learning through Information Geometry, October 2020. URL https://openreview.net/forum?id=kvqPFy0hbF.
Appendix A Training Data
In Figure 6, we plot the -information estimate on the SNLI test set as BERT-base is trained on increasing amounts of training data. This is to test the assumption that the training set is sufficiently large to find the function that minimizes the conditional entropy. Although this assumption is impossible to validate with complete certainty, since we donāt have access to the true distribution, if the -information estimate plateaus before all the training data is used, it suggests that the training set size is not a limiting factor to the estimation. We find that this is indeed the case with SNLI, where 80% of the training data on averages provides the same estimate as using the entire training set. In cases when this assumption does not hold, readers may want to consider measuring the Bayesian mutual information instead (Pimentel & Cotterell, 2021).

Appendix B Larger Models

In Figure 7, we plot the -information estimate for the SNLI test and train sets. In Figure 8, we plot the -information estimate on the CoLA in-domain held-out set for the four models that we previously studied, as well as a larger model, RoBERTa-large (Liu et al., 2019). Despite the increase in scale, the trends observed in §2 still hold.

Appendix C Qualitative Analysis
premise | hypothesis | label | PVI |
---|---|---|---|
Twenty five people are marching. | A man plays the trombone on the sidewalk. | N | -9.966 |
A woman in a striped shirt holds an infant. | A person is watching TV. | N | -9.612 |
A person swimming in a swimming pool. | A person embraces the cold | N | -9.152 |
Women enjoying a game of table tennis. | Women are playing ping pong. | E | -8.713 |
A boy dressed for summer in a green shirt and kahki shorts extends food to a reindeer in a petting zoo. | A boy alien dressed for summer in a green shirt and kahki shorts | E | -8.486 |
Two skateboarders, one wearing a black t-shirt and the other wearing a white t-shirt, race each other. | Two snowboarders race. | E | -8.087 |
An Asian woman dressed in a colorful outfit laughing. | The woman is not laughing. | E | -7.903 |
An older gentleman looks at the camera while he is building a deck. | An older gentleman in overalls looks at the camera while he is building a stained red deck in front of a house. | E | -7.709 |
A man wearing black pants, an orange and brown striped shirt, and a black bandanna in a ājust thrown a bowling ballā stance. | The bandana is expensive. | C | -7.685 |
Two girls kissing a man with a black shirt and brown hair on the cheeks. | Two girls kiss. | C | -7.582 |
In Table 4, we list the 10 hardest instances in the SNLI test set according to BERT-base. All three classesāentailment, neutral, and contradictionāare represented in this list, with entailment being slightly over-represented. We see that some of the examples are in fact mislabelledāe.g., āPREMISE: An Asian woman dressed in a colorful outfit laughing. HYPOTHESIS: The women is not laughing.ā is labelled as āentailmentā even though the correct label is ācontradictionā.
Appendix D Consistency of pvi estimates


BERT-base | |||||
---|---|---|---|---|---|
Epoch/Epoch | 1 | 2 | 3 | 5 | 10 |
1 | 1.000 | 0.908 | 0.871 | 0.838 | 0.762 |
2 | 0.908 | 1.000 | 0.929 | 0.883 | 0.795 |
3 | 0.871 | 0.929 | 1.000 | 0.879 | 0.796 |
5 | 0.838 | 0.883 | 0.879 | 1.000 | 0.833 |
10 | 0.762 | 0.795 | 0.796 | 0.833 | 1.000 |
BART-base | |||||
---|---|---|---|---|---|
Epoch/Epoch | 1 | 2 | 3 | 5 | 10 |
1 | 1.000 | 0.925 | 0.885 | 0.853 | 0.754 |
2 | 0.925 | 1.000 | 0.952 | 0.906 | 0.807 |
3 | 0.885 | 0.952 | 1.000 | 0.914 | 0.814 |
5 | 0.853 | 0.906 | 0.914 | 1.000 | 0.862 |
10 | 0.754 | 0.807 | 0.814 | 0.862 | 1.000 |
DistilBERT-base | |||||
---|---|---|---|---|---|
Epoch/Epoch | 1 | 2 | 3 | 5 | 10 |
1 | 1.000 | 0.928 | 0.884 | 0.828 | 0.766 |
2 | 0.928 | 1.000 | 0.952 | 0.890 | 0.825 |
3 | 0.884 | 0.952 | 1.000 | 0.900 | 0.819 |
5 | 0.828 | 0.890 | 0.900 | 1.000 | 0.860 |
10 | 0.766 | 0.825 | 0.819 | 0.860 | 1.000 |
GPT2 | |||||
---|---|---|---|---|---|
Epoch/Epoch | 1 | 2 | 3 | 5 | 10 |
1 | 1.000 | 0.931 | 0.887 | 0.855 | 0.747 |
2 | 0.931 | 1.000 | 0.961 | 0.918 | 0.813 |
3 | 0.887 | 0.961 | 1.000 | 0.933 | 0.827 |
5 | 0.855 | 0.918 | 0.933 | 1.000 | 0.874 |
10 | 0.747 | 0.813 | 0.827 | 0.874 | 1.000 |
seed | Run 1 | Run 2 | Run 3 | Run 4 |
---|---|---|---|---|
Run 1 | 1.000 | 0.877 | 0.884 | 0.885 |
Run 2 | 0.877 | 1.000 | 0.887 | 0.882 |
Run 3 | 0.884 | 0.887 | 1.000 | 0.895 |
Run 4 | 0.885 | 0.882 | 0.895 | 1.000 |
Cross-Model Correlations
Human Agreement
In Figure 10, we plot the average pvi at different levels of annotator agreement. We find that there is a concurrence between what humans find difficult and what examples are difficult according to pvi.
Cross-Epoch Correlations
In Table 5, we list the cross-epoch Pearson correlation between pvi estimates made by the same model on the SNLI test set over the course of finetuning. The correlation is high ( during the first 5 epochs), suggesting that when an instance is easy(difficult) early on, it tends to remain easy(difficult).
Cross-Seed Correlations
In Table 6, we list the Pearson correlation between pvi estimates made by BERT across different training runs. The correlation is high (), suggesting that what a model finds difficult is not due to chance.
Appendix E Transformations
Attribute | Transformation | Transformed Input |
---|---|---|
Original | PREMISE: Two girls kissing a man with a black shirt and brown hair on the cheeks. HYPOTHESIS: Two girls kiss. | |
Shuffled | shuffle tokens randomly | PREMISE: girls two a kissing man with a black cheeks shirt and hair brown on the . HYPOTHESIS: kiss two . girls |
Hypothesis-only | only include hypothesis | HYPOTHESIS: Two girls kiss. |
Premise-only | only include premise | PREMISE: Two girls kissing a man with a black shirt and brown hair on the cheeks. |
Overlap | hypothesis-premise overlap | PREMISE: Two girls [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] . HYPOTHESIS: Two girls [MASK] . |
WARNING: The following content contains language from the DWMW17 dataset that is offensive in nature.
In Table 7, we provide an instance from the SNLI test set in its original form and after various attribute-specific transformations have been applied to it. These only capture a small subset of the space of possible transformations.
For DWMW17, we hand-picked a set of 50 potentially offensive words based on a cursory review of the dataset to see how much information these terms alone contain about the label: āniggaā, āniggasā, āniggahā, āniggahsā, āhoeā, āhoesā, ābitchā, ābitchesā, āwhiteyā, āwhite trashā, ācrackerā, ācrackersā, ābeanerā, ābeanersā, āpussyā, āpussiesā, āfagā, āfagsā, āfaggotā, āfaggotsā, āhoā, āhosā, āredneckā, ārednecksā, āpornā, āfuckā, āfucksā, āfuckerā, āfuckersā, āmotherfuckerā, āmotherfuckersā, āniggerā, āniggersā, ācoonā, ācoonsā, āniggazā, ānigā, ānigsā, āslutā, āslutsā, āwiggerā, āwiggersā, āfuckedā, āfuckingā, āwiggaā, āwiggasā, āretardā, āretardsā, and āretardedā.
Appendix F Instance-wise Comparisons
#7717: PREMISE: Little kids play a game of running around a pole. HYPOTHESIS: The kids are fighting outside.
#9627: PREMISE: A group of people watching a boy getting interviewed by a man. HYPOTHESIS: A group of people are sleeping on Pluto.

Certain attributes are responsible for the difficulty of certain examples. Figure 11 is an example of how we might do a fine-grained comparison of instances to understand why one may be more difficult for a given model. We compare two SNLI āneutralā instances from the test set to try to understand why #9627 is easier for BERT than #7717 (i.e., why ), finding that it is likely due to the formerās hypothesis being more informative. While different instances can be compared w.r.t.Ā the same attribute, different attributes cannot be compared w.r.t.Ā the same instance, since the models used to estimate the attribute-specific -information are chosen to maximize the likelihood of all the data. This is why, for example, the pvi of #7717 is higher after its tokens have been shuffled even though the average pvi (i.e., dataset-level -information) declines after shuffling tokens.
Appendix G Token-Level Artefacts
WARNING: The following content contains language from the DWMW17 dataset that is offensive in nature. In Table 8, we list the tokens in the SNLI, CoLA, and DWMW17 datasets that, when dropped out, cause the greatest decrease in the -information estimate. These are token-level artefacts of each class in the dataset. In the DWMW17 hate speech detection dataset, racial and homophobic slurs are artefacts of hate speech, while ableist and sexual slurs are artefacts of offensive speech. In-group AAVE terms are also predictive of offensive speech in DWMW17 even when they are used non-offensively, hinting at possible bias in the dataset (Sap etĀ al., 2019). In CoLA, auxiliary verbs and prepositions are artefacts of ungrammatical sentences; grammatical sentences donāt have any artefacts. For SNLI, we recover many of the token-level artefacts found by Gururangan etĀ al. (2018) using descriptive statisticsāeven uncommon ones, such as ācatā for contradiction.
Appendix H Relation to Dataset Cartography


Swayamdipta etĀ al. (2020) introduced dataset cartography, a method to automatically analyze and diagnose datasets with respect to a trained model. It offers a complimentary understanding of datasets and their properties, taking into account the behavior of a model towards different data instances during training. This behaviorātraining dynamicsāhelps differentiate instances via their (1) confidence (i.e., mean probability of the correct label across epochs), and (2) variability (i.e., variance of the former). The result is a dataset map revealing three regions:
-
ā¢
Easy-to-learn (high confidence, low variability) instances are the most frequent instances in the dataset, those which high capacity models like BERT predict correctly throughout training.
-
ā¢
Hard-to-learn (low confidence, low variability) instances correspond to those which are predicted incorrectly throughout training; these were shown to correspond to mislabeled examples often.
-
ā¢
Ambiguous (high variability) instances correspond to those which the model often changes its prediction for; these examples are the ones which are most responsible for high test performance, both in and out of distribution.
Figure  13 shows that pvi values track closely to the confidence axis of a SNLI-DistilBERT-base data map888Data maps were originally plotted on training data; however, they can be plotted on held-out data by computing training dynamics measures on the same, after every training epoch. (Swayamdipta et al., 2020). Data maps and pvi estimates offer orthogonal perspectives to instance difficulty, the former capturing behavior of instances as training proceeds. Moreover, -information can estimate dataset difficulty as an aggregate (§2), which is not the case for training dynamics metrics, which offer only point estimates. Both approaches can be helpful for discovering data artefacts. Predictive -information estimates, however, offer the unique capability of transforming the input to discover the value of certain attributes in an efficient manner.
In Figure 13, we report the average pvi estimates of the three regions discovered via data maps:
-
ā¢
Easy-to-learn (high confidence, low variability) instances correspond to the highest average pvi, indicating that they have the highest amount of DistilBERT-usable information.
-
ā¢
Hard-to-learn (low confidence, low variability) instances correspond to the lowest average pvi, indicating that they have the lowest amount of DistilBERT-usable information. This is not surprising, since they also correspond to mislabeled instances, which can be difficult to extract usable information from.
-
ā¢
Ambiguous (high variability) instances correspond to lower average pvi, indicating that there is some usable information, but not as much as those of the easy-to-learn instances, w.r.tĀ DistilBERT.
For each of the bars in the plot, we consider 10% of the dataset belonging to each region (with the highest corresponding measures of confidence and variability).
WARNING: The following content contains language from the DWMW17 dataset that is offensive in nature.
DWMW17 (Davidson etĀ al., 2017) | ||
---|---|---|
Hate Speech | Offensive | Neither |
f*ggots (3.844) | r*tards (2.821) | lame (4.426) |
f*g (3.73) | n*gs (2.716) | clothes (0.646) |
f*ggot (3.658) | n*gro (2.492) | dog (0.616) |
c*ons (3.53) | n*g (2.414) | cat (0.538) |
n*ggers (3.274) | c*nts (2.372) | iDntWearCondoms (0.517) |
qu*er (3.163) | p*ssies (2.29) | thank (0.47) |
co*n (3.137) | qu*er (2.213) | kick (0.423) |
n*gger (3.094) | r*tarded (1.997) | 30 (0.345) |
d*ke (3.01) | c*nt (1.919) | football (0.334) |
f*gs (2.959) | b*tches (1.858) | soul (0.323) |
SNLI (Bowman etĀ al., 2015) | ||
---|---|---|
Entailment | Neutral | Contradiction |
nap (3.256) | tall (4.246) | Nobody (7.258) |
bald (3.183) | naked (2.193) | not (4.898) |
crying (2.733) | indoors (1.724) | no (4.458) |
Woman (2.517) | light (1.442) | naked (3.583) |
asleep (2.482) | fun (1.318) | crying (2.938) |
sleeping (2.416) | bed (1.006) | indoors (2.523) |
soda (2.267) | motorcycle (0.993) | vegetables (2.295) |
bed (2.136) | works (0.969) | sleeping (2.293) |
not (2.111) | race (0.943) | jogging (2.17) |
snowboarder (2.099) | daughter (0.924) | cat (2.092) |
CoLA (Warstadt etĀ al., 2018) | |
---|---|
Grammatical | Ungrammatical |
will (0.267) | book (2.737) |
John (0.168) | is (2.659) |
. (0.006) | was (2.312) |
and (-0.039) | of (2.308) |
in (-0.05) | to (1.972) |
ā (-0.063) | you (1.903) |
to (-0.195) | be (1.895) |
of (-0.195) | in (1.618) |
that (-0.379) | did (1.558) |
the (-0.481) | The (1.427) |