Assessing Out-of-Domain Language Model Performance from Few Examples
Abstract
While pretrained language models have exhibited impressive generalization capabilities, they still behave unpredictably under certain domain shifts. In particular, a model may learn a reasoning process on in-domain training data that does not hold for out-of-domain test data. We address the task of predicting out-of-domain (OOD) performance in a few-shot fashion: given a few target-domain examples and a set of models with similar training performance, can we understand how these models will perform on OOD test data? We benchmark the performance on this task when looking at model accuracy on the few-shot examples, then investigate how to incorporate analysis of the models’ behavior using feature attributions to better tackle this problem. Specifically, we explore a set of “factors” designed to reveal model agreement with certain pathological heuristics that may indicate worse generalization capabilities. On textual entailment, paraphrase recognition, and a synthetic classification task, we show that attribution-based factors can help rank relative model OOD performance. However, accuracy on a few-shot test set is a surprisingly strong baseline, particularly when the system designer does not have in-depth prior knowledge about the domain shift.
1 Introduction
The question of whether models have learned the right behavior on a training set is crucial for generalization. Deep models have a propensity to learn shallow reasoning shortcuts Geirhos et al. (2020) like single-word correlations Gardner et al. (2021) or predictions based on partial inputs Poliak et al. (2018), particularly for problems like natural language inference Gururangan et al. (2018); McCoy et al. (2019) and question answering Jia and Liang (2017); Chen and Durrett (2019). Unless we use evaluation sets tailored to these spurious signals, accurately understanding if a model is learning them remains a hard problem Bastings et al. (2021); Kim et al. (2021); Hupkes et al. (2022).

This paper addresses the problem of predicting whether a model will work well in a target domain given only a few examples from that domain. This setting is realistic: a system designer can typically hand-label a few examples to serve as a test set. Computing accuracy on this small set and using that as a proxy for full-test set performance is a simple baseline for our task, but has high variance, which may cause us to incorrectly rank two models that achieve somewhat similar performance. We hypothesize that we can do better if we can interpret the model’s behavior beyond accuracy. With the rise of techniques to analyze post-hoc feature importance in machine-learned models Lundberg and Lee (2017); Ribeiro et al. (2016); Sundararajan et al. (2017), we have seen not just better interpretation of models, but improvements such as constraining them to avoid using certain features Ross et al. (2017) like those associated with biases Liu and Avci (2019); Kennedy et al. (2020), or trying to more generally teach the right reasoning process for a problem Yao et al. (2021); Tang et al. (2021); Pruthi et al. (2022). If post-hoc interpretation can strengthen a models’ ability to generalize, can they also help us understand it?
Figure 1 illustrates the role this understanding can play. We have three trained models and are trying to rank them for suitability on a new domain. The small labeled dataset is a useful (albeit noisy) indicator of success. However, by checking model attributions on our few OOD samples, we can more deeply understand model behavior and analyze if they use certain pathological heuristics. Unlike past work Adebayo et al. (2022), we seek to automate this process as much as possible, provided the unwanted behaviors are characterizable by describable heuristics. We use scalar factors, which are simple functions of model attributions, to estimate proximity to these heuristics, similar to characterizing behavior in past work Ye et al. (2021). We then evaluate whether these factors allow us to correctly rank the models’ performance on OOD data.
Both on synthetic Warstadt et al. (2020), and real datasets McCoy et al. (2019); Zhang et al. (2019), we find that, between models with similar architectures but different training processes, both our accuracy baseline and attribution-based factors are good at distinguishing relative model performance on OOD data. However, on models with different base architectures, we discovering interesting patterns, where factors can very strongly distinguish between different types of models, but cannot always map these differences to correct predictions of OOD performance. In practice, we find probe set accuracy to be a quick and reliable tool for understanding OOD performance, whereas factors are capable of more fine-grained distinctions in certain situations.
Our Contributions:
(1) We benchmark, in several settings, methods for predicting and understanding relative OOD performance with few-shot OOD samples. (2) We establish a ranking-based evaluation framework for systems in our problem setting. (3) We analyze patterns in how accuracy on a few-shot set and factors derived from token attributions distinguish models.

2 Motivating Example
To expand on Figure 1, Figure 2 shows an in-depth motivating example of our process. We show three feature attributions from three different models on an example from the HANS dataset McCoy et al. (2019). These models have (unknown) varied OOD performance but similar performance on the in-domain MNLI Williams et al. (2018) data. Our task is then to correctly rank these models’ performance on the HANS dataset in a few-shot manner.
We can consider ranking these models via simple metrics like accuracy on the small few-shot dataset, where higher-scoring models are higher-ranked. However, such estimates can be high variance on small datasets. In Figure 2, only M3 predicts non-entailment correctly, and we cannot distinguish the OOD performance of M1 and M2 without additional information.
Thus, we turn to explanations to gain more insight into the models’ underlying behavior. With faithful attributions, we should be able to determine if the model is following simple inaccurate rules called heuristics McCoy et al. (2019). Figure 2 shows the heuristic where a model predicts that the sentence entails if is a subsequence of . Crucially, we can use model attributions to assess model use of this heuristic :we can sum the attribution mass the model places on subsequence tokens. We use the term factors to refer to such functions over model attributions.
The use of factors potentially allows for the automation of detection of spurious signals or shortcut learning Geirhos et al. (2020). While prior work has shown that spurious correlations are hard for a human user to detect from explanations Adebayo et al. (2022), well-designed factors could automatically analyze model behavior across a number of tasks and detect such failures.
3 Attributions to Predict Performance
In this section, we formalize the ideas presented thus far. Token-level attribution methods (a subset of post-hoc explanations) are methods which, given an input sequence of tokens and a model prediction for some task, assign an explanation where corresponds to an attribution or importance score for a corresponding towards the final prediction. For cases where the model, prediction, and inputs are unambiguous, we abbreviate this simply .
We assume that the model is trained on an in-domain training dataset and will be evaluated on some unknown OOD set . Given two models and , with a small amount of data ( examples or fewer in our settings), our task is to predict which model will generalize better. We break the process into 2 steps (see Figure 2):
1. Hypothesize a heuristic.
First we must identify an underlying heuristic that reflects pathological model behavior in the OOD dataset. For example, the subsequence heuristic in Figure 2 corresponds to a heuristic which always predicts entailed if the hypothesis is contained within the premise. Let abstractly reflect how closely the th model’s behavior aligns with . Let be the true OOD performance of model . If we then assume that faithfully models some pathological heuristic , we should have that implies . In other words, the more a model agrees with a pathological heuristic , the worse it performs.
2. Measure alignment.
We now want to predict the ranking of ; however, with few labeled examples there may be high variance in directly evaluating these metrics. We instead use factors which map tokens and their attributions for model to scalar scores that should correlate with the heuristic . Factors can be designed to align with known pathological heuristics, where higher scores indicate strong model agreement with the associated heuristic. We then estimate the ranking of using the relative ranking of the corresponding approximated through factors.
Concretely, to measure the alignment, we first compute for each input the prediction ) and the explanation for that prediction. These are used to compute the score for model . We take the overall score of the model to be , the mean over the examples in . We then directly rank models on the basis of the values: the higher the average factor value (the more it follows the heuristic), the lower the relative ranking: . Therefore we can sort the models by these values and arrive at a predicted ranking. We later also consider factors which to not intuitively map to specific heuristics.
Baselines
We also consider three principle explanation-agnostic baselines. A natural baseline given is to simply use the accuracy (ACC) on this dataset: ], however this may be noisy on only a few examples, and frequently leads to ties.111Most of the datasets we consider are constructed specifically to mislead models following the heuristic, so this baseline directly measures agreement with a heuristic .
We can also assess model confidence (CONF), which looks at the softmax probability of the predicted label, as well as looking at CONF-GT which only looks at the softmax probability of the ground-truth label.
4 Experimental Setup
4.1 Models Compared
In this work, we compare various models across different axes yielding different performance. The first approach we use is inoculation Liu et al. (2019a), which involves fine-tuning models on small amounts or batches of data alongside in-domain data to increase model performance on OOD data. The second approach we use is varying the model architecture and pre-training (e.g., using a stronger pre-trained Transformer model).
In Section 5, we use inoculation to create 5 RoBERTa-base Liu et al. (2019b) models of varying performance for each of the three MSGS sets. In Section 6 where we consider the HANS and PAWS datasets, we inoculate a variety of models. For HANS, we inoculate 5 RoBERTa-large models. We additionally examine DeBERTa-v3-base He et al. (2021b, a) and ELECTRA-base Clark et al. (2020) models fine-tuned on in-domain MNLI data. For PAWS, we inoculate 4 RoBERTa-base models on the in-domain set. We also inoculate ELECTRA-base and DEBERTA-base models. We include complete details for these models in Appendix A. The generated models represent a realistic problem scenario: a practitioner may have many different models with similar performance, but different performance. We specifically crafted suites of models which have both near pairs (models with similar performance) and far pairs.
4.2 Attribution Methods
We experiment using several token-level attributions methods: LIME Ribeiro et al. (2016) computes attribution scores using the weights of a linear model approximating model behavior near a datapoint. SHAP Lundberg and Lee (2017) is similar to LIME, but uses a procedure using Shapley values. Finally, Integrated Gradients (tokig) Sundararajan et al. (2017) compute by performing a line integral over the gradients with respect to token embeddings on a path from a baseline token to the ground truth token; commonly, this baseline token is chosen to be <MASK>. While intuitively sensible, Harbecke (2021) has voiced concerns regarding the use of TOKIG in NLP.
4.3 Evaluation Setup
Because model ranking using a small may be unstable, we conduct all experiments over a number of different sampled sets. We first sample examples from each set (in the range of 200-600), then generate explanations for all models on each example. We then take 400-500 bootstrap samples of size (we report results for , as experimental results were similar for sizes 5 and 20), simulating many few-shot evaluations. For each bootstrap sample, we analyse model pairs. Details can be found in Appendix B.
We define a “success” as a technique correctly ranking a model pair, when measured by performance (on the full set); otherwise is a “failure”. We define pairwise accuracy as the accuracy for a method ranking a particular model pair across all bootstrap samples. We define few-shot accuracy (or just accuracy) as the average of the pairwise accuracies over the model pairs. By reporting ranking accuracy across a diverse set of models, we ensure a comprehensive evaluation.
5 MSGS: A Proof of Concept

We first show experiments on the Mixed Signals Generalization Set (MSGS) dataset presented in Warstadt et al. (2020) as a proof of concept for our methodology. MSGS is a synthetic classification dataset. The training (in-domain) set is composed of sentences where both some linguistic feature (e.g., the presence of an adjective) and a spurious surface feature (e.g., the word “the” being in the sentence) are always associated with a positive label . This data is ambiguous, which means the model could rely on either the linguistic or surface feature completely yet still get 100% accuracy on in-domain data. Warstadt et al. (2020) then create sets of OOD data where the linguistic feature becomes associated with the positive label, and the surface feature with a label. The resulting test accuracy reflects model reliance on one feature or the other. Warstadt et al. (2020) use this to investigate what generalizations are learned at which stages of model pre-training; we investigate whether information from small probe sets can help assess model reliance on the surface feature.
We consider three of their linguistic features: MORPH (presence of an irregular past verb like “drew”), ADJECT (prescence of an adjective), and VERB (if the main verb is an -ing verb), each paired with the surface feature of “the” being in the sentence.
We design factors which look at attributions on the tokens corresponding to these linguistic features, including the tokens surrounding these features as well to account for feature dependence on surrounding words. Our factor , where is the index of the feature-critical word for that dataset (e.g., “slept” for IRREG) and is the attribution at an index. This factor corresponds closely to the heuristic that the dataset was designed for, or alternately, we can see this factor as inversely proportional to what other information the model is using (that is, information outside of this window). We name the factors IRREG, VERG, and ADJ for the MORPH, VERB, ADJECT sets respectively.
Note that this approach assumes that a system designer has prior knowledge of the relevant linguistic and surface feature. This is a generous assumption, and for this dataset is almost sufficient to formulate the rule used to construct it, hence why we call this a proof of concept. We will show more realistic conditions in Section 6.
Models
To create a suite of models with varying performance, we inoculate following the steps outlined in Section 4.1. We evaluate our factors via accuracy as described in Section 4.3. More details about the inoculation is present in Section A of the appendix.
Feature | Method | Accuracy | ||
---|---|---|---|---|
MORPH | ACC | 90.9 | ||
CONF | 50.9 | |||
CONF-GT | 90.1 | |||
tokig | shap | lime | ||
IRREG | 89.2 | 90.6 | 92.8 | |
VERB | ACC | 94.5 | ||
CONF | 58.0 | |||
CONF-GT | 93.3 | |||
tokig | shap | lime | ||
VERB | 92.1 | 94.0 | 94.9 | |
ADJECT | ACC | 89.9 | ||
CONF | 50.5 | |||
CONF-GT | 91.3 | |||
tokig | shap | lime | ||
ADJ | 87.4 | 92.1 | 93.5 |
Results
Table 1 shows the results on this dataset. Our ACC baseline performs well: when models differ greatly in performance (e.g., one gets 50% and another gets 90% on the ), accuracy on the small ranks these correctly even despite the small subset size. The high regularity of the dataset also means that a model’s behavior does not vary greatly from example to example, further reducing variance. However, this ranking is nevertheless still not perfect. We see that CONF performs very poorly, by contrast, showing that confidence is not helpful for measuring model behavior.
Overall, we see that methods using explanations are able to beat the ACC baseline, with the exception of tokig. We additionally found trends within the explanation techniques themselves, with lime reliably performing the best, and tokig being the worst. But generally, all techniques can offer relevant information, and in the best case, the attributions can tell us more reliably what a model is learning than evaluation on a small set of data can. In Section 6, we investigate if these results generalize to real-world datasets.
6 Realistic OOD Settings
We now consider two datasets corresponding to realistic OOD settings treated in past work.
First, HANS McCoy et al. (2019) targets spurious heuristics within MNLI Williams et al. (2018), such as the hypothesis being a subsequence of the premise, with balanced test sets that can be used to detect model reliance on these heuristics. Models following these heuristics always predict entailed for the hypotheses, and will perform at random chance accuracy on the dataset. We use MNLI as our in-domain training set in this setting.
Second, PAWS Zhang et al. (2019) is a paraphrase identification task. PAWS-QQP is an OOD dataset for Quora Question Pairs (QQP) Iyer et al. (2017) that is composed of pairs with swapped content words/phrases (e.g., I ran from the Grand Canyon to California to I ran from California to the Grand Canyon). A paraphrase model that relies heavily on lexical overlap will not be sensitive to these changes, and will always predict the label of to indicate paraphrase. We use QQP as our in-domain training set in this setting.
Details regarding models used in this section are presented in Section 4.1. From the test sets of the corresponding datasets, we randomly sample 400 examples from PAWS and 600 from HANS-CON and HANS-SUB each for use in bootstrap sampling, as detailed in Section 4.3. Information regarding the datasets considered can be found in Table 9.
6.1 Factors
General Factors
Both HANS and PAWS involve comparing two sequences of tokens, unlike MSGS which is classification over a single sequence. We can define our input as composed of these two sequences and with respective attributions . We evaluate a number of factors that generally target sensitivity to both sequences and their differences, which represent a broad class of potential heuristics.
-
MAX-DIFF:
The difference between maximum attribution in and , i.e. .
-
SUM-DIFF:
the difference of the sum of attributions, i.e. .
-
INDEX-DIFF:
The difference of attributions between shared words in and .
-
FIRST-TOK:
The attribution at the the first <SEP> token.
We explicitly note that this is the exhaustive set of factors we experimented with, not a cherry-picked set, in order to provide a comprehensive view of what does and doesn’t work. We crafted these by manually examining attribution patterns on various datasets rather than trying a large number and keeping the best ones.
HANS Factors
We look at the “subsequence” heuristic discussed in Section 2 and the constituent heuristic, which assumes that the premise entails all complete subtrees in its parse-tree. For the subsequence OOD set (HANS-SUB) we note that the INDEX-DIFF factor, which specifically examines tokens in the shared subsequence, captures the setting’s pathological heuristic.
On the constituent OOD set (HANS-CON) we evaluate a factor that examines the attribution on the control words of the premise. For example, for the premise “Unless the doctors ran, the lawyers encouraged the scientists” and the hypothesis “The doctors ran”, we would consider the attributions on the word “Unless”.
Ranking Method | PAWS | HANS-SUB | HANS-CON |
---|---|---|---|
Baselines | |||
ACC | 88.7 | 90.6 | 81.6 |
CONF | 9.2 | 40.4 | 52.8 |
CONF-GT | 34.9 | 20.2 | 38.9 |
RANDOM | 50.7 | 51.4 | 49.6 |
Factors | |||
CONST | 87.1 | ||
SWAP-MAX-DIFF | 76.2 | ||
SWAP-AVG | 91.4 | ||
INDEX-DIFF | 60.5 | 91.3 | 68.6 |
MAX-DIFF | 65.9 | 69.2 | 50.6 |
SUM-DIFF | 56.6 | 60.0 | 75.2 |
FIRST-TOK | 74.3 | 50.4 | 55.6 |
PAWS Factors
We further investigate two intuitive heuristics that are based on the construction of the OOD set. SWAP-AVG uses the average attribution across all swapped tokens and SWAP-MAX-DIFF subtracts the highest magnitude attribution of swapped tokens in the first sentence and the highest magnitude attribution of swapped tokens in the second sentence. For example, for the pair (“What factors cause a good person to become bad ?”, “What factors cause a bad person to become good ?”), SWAP-AVG would consider the attributions on “good” and “bad”. SWAP-MAX-DIFF is analogous.
6.2 Inoculated Results
We first evaluate models that differ primarily through inoculation, as described in Section 4.1. Results are shown using SHAP in Table 2 which we selected through experiments in this setting as being the best performing. The conclusions here differ somewhat from those on MSGS. We note that the ACC baseline remains strong, while CONF is near random. We find that certain attribution factors are able to outperform the ACC baseline, with SWAP-AVG the best on PAWS (91.4%), INDEX-DIFF the best on HANS-SUB (91.3%), and CONST on HANS-CON (87.1%).
This shows that even in these settings more realistic than MSGS, the right choice of factor reveals meaningful information about model generalization. Moreover, the heuristics that work well are those hand-designed for these datasets, confirming our hypothesis that measuring association with a heuristic via a factor may reveal something about performance.
We qualify these results by noting that in a true few-shot setting, there is some uncertainty regarding whether a chosen factor is truly the best one. As a coarse option, we find ACC to be reliable. However, these high-performing factors would still be useful in conjunction with accuracy, or if we had previously validated a factor as ranking models well and we wanted to apply it to rank new models in this domain; the factors will generalize to new models even if they do not generalize to new datasets necessarily.
6.3 Architectural Change Results
We further examine our approach when ranking the performance of different pre-trained models (RoBERTa, ELECTRA, and DeBERTa).
Table 3 shows that a heuristic GUESS based on the expectation across choosing a best model and then randomly guessing consistently with that, gives a strong baseline of 72%. Factors also seem to do well in this setting, with all of the general heuristics outperforming the very low ACC baseline.
This suggests that in few-shot factors are able to capture distributional information that baselines can’t. However, to qualify this, given that each set only compares between 3 pairs of models, it’s easier for factors to happen upon strong accuracy patterns by chance.
Thus, in Table 4, we analyze this further by showing resuts on some individual model pairs. (R1, R2 are RoBERTa; E is Electra; D is DeBERTa) R1-E and D-R2, have different architectures, but similar OOD accuracy (see Table 6 in the Appendix). R1-D, E-D and E-R2 are different model types with more distant accuracy. Accuracy values for a single pair on this single dataset therefore only reflect differences across bootstrap samples. What we find in common across these types of pairings is that while some values are close to 50%, including the ACCURACY baseline, each column has several factors achieving very distinct (0%, 100%) accuracy values, consistently differentiating these models. As we note in Figure 4 (Appendix), this pattern of strong distinctions is quite common when different types of models are compared. We further discuss this in Section 7.
Ranking Method | hans/paws pooled |
---|---|
Baseline | |
ACCURACY | 61.0 |
GUESS | 72.0 |
Factors | |
SET-DEPENDENT | 75.5 |
MAX-DIFF | 72.6 |
INDEX-DIFF | 83.0 |
SUM-DIFF | 83.1 |
FIRST-TOK | 69.4 |
Model Pair (-) | |||||
Ranking Method | R1-E | R1-D | E-R2 | D-R2 | E-D |
Baseline | |||||
ACCURACY | 77.4 | 54.6 | 55.2 | 67.2 | 64.4 |
Factors | |||||
SWAP-MAX-DIFF | 57.6 | 57.6 | 70.6 | 65.6 | 42.6 |
SWAP-AVG | 93.4 | 93.4 | 65.0 | 47.6 | 17.0 |
MAX-DIFF | 63.2 | 63.2 | 39.2 | 4.8 | 0.2 |
INDEX-DIFF | 0 | 0 | 99.6 | 99.6 | 15.0 |
SUM-DIFF | 99.8 | 100 | 0 | 0 | 88.0 |
FIRST-TOK | 1.2 | 100 | 99.6 | 0 | 0 |
7 Analysis / Discussion
Accuracy is reliable, but factors can provide more fine-grained distinctions.
On MSGS, where factors beat strong accuracy baselines, we notice that these pairwise accuracies are consistently high. For example, on the MORPH setting, for two models with 95% and 98% accuracy, our factors IRREG is 100% accurate, while the accuracy baseline here is only 58%, as test accuracy on does not discriminate well between two models with such close overall accuracy.
This holds at the fine-grained pairwise level as well. Figure 5 (also see Figure 6 in appendix) shows the baseline accuracy against a specific factor’s accuracy for each model pair in MSGS. Each datapoint in the scatterplot represents a model pair and a point’s vertical distance from the red line represents how much better or worse a given factor does compared to the baseline on a specific pair. We see a regular trend: explanations seem to systematically outperform the baseline across various pairs, with a few significant deviations for low-performing pairs.
These results suggest that explanations can be useful and do add information otherwise missing from accuracy probing alone, especially when the underlying model architecture is held constant. With differing architectures (Figure 6), the problem is made more difficult, and selecting the right factor is less obvious; few-shot accuracy may be more reliable in this setting. Note, however, that these successes from any technique are in spite of us only inspecting 10 examples from the target domain.


Factors differentiate models strongly, though not always in a way aligned with OOD performance.
Figure 4 and Table 4 both show that factors will often consistently decide in favor of a certain model regardless of the choice of , especially when dealing with models with different base architectures. Since ranking accuracy correlates with whether these strong alignments are consistent across a spectrum of models and choose the models with higher OOD performance, the tendency for factors to strongly favor a specific model doesn’t necessarily correlate with strong overall performance, but does heavily imply that these factors extract meaningful information about the model from the attributions. Looking close at Table 4, we can see that that even between different model architectures, certain factors are more (INDEX-DIFF) or less (SWAP-MAX-DIFF) capable of making these distinctions.
Factors as projections of model feature space.
Based on these results, we have evidence that the distributions of attributions are unique to models: in other words, a factor is like a scalar signature for a model’s feature space with respect to some relevant features. Methods like inoculation, that change a model’s behavior in direct ways lead to regular changes in that signature. In these cases, factors align with OOD performance, which explains why factors are so strong in our inoculated experiments. For our non-inoculated experiments (i.g. ELECTRA vs DeBERTa), the feature spaces are fundamentally different, so factor signatures will still capture these differences, but in a way less aligned with ranking on OOD performance. Future work may be able to expand on these differences and what they tell us beyond OOD performance.
8 Related Work
This paper relates to a long line of work on understanding explanations, including investigating human ability to interpret explanations Miller (2019); Jacovi and Goldberg (2020); Alqaraawi et al. (2020); Nguyen et al. (2021), explanations’ faithfulness and ability to detect shortcuts Geirhos et al. (2020) or spurious features Bastings et al. (2021); Madsen et al. (2021); Zhou et al. (2021), and applications to OOD data Ye and Durrett (2022); Choi et al. (2022), including papers in the intersection of multiple directions Adebayo et al. (2022); Kim et al. (2021).
Past work has also investigated using explanations to detect spurious correlations Kim et al. (2021); Bastings et al. (2021); Adebayo et al. (2022). We are different in that we focus on ranking an array of models which exhibits different levels of generalization abilities, as opposed to giving a binary judgment of whether a model is relying on some shortcuts Kim et al. (2021); Bastings et al. (2021); Adebayo et al. (2022). In addition, we experiment with tasks having nuanced shortcuts ‘in the wild’, contrary to synthetically constructed datasets in Bastings et al. (2021). In particular, Adebayo et al. (2022) study the usefulness of explanations in detecting unknown spurious features in an image classification task involving (realistic) possible shortcuts, but find that attributions are ineffective for detecting unknown shortcuts in practice.
9 Conclusion
We establish a robust framework for evaluation of fine-grained few-shot prediction of OOD performance, benchmarking approaches in this setting on a range of models. We find that accuracy is a reliable baseline, but intuitive attribution-based factors derived from explanations can sometimes better predict how models will perform in OOD settings, even when they have similar in-domain performance. We further analyze patterns of our approaches, discovering the potential for factors to represent views of model feature space, leaving further exploration to future work.
10 Limitations
There are a large number of explanation techniques and many domains these have been applied to. We focus here on a set of textual reasoning tasks like entailment where spurious correlations have been frequently identified. However, correlations in other settings like medical imaging Adebayo et al. (2022) could yield different results. We also note that these datasets are all English-language and use English pre-trained models, so different settings may yield different results; additionally, our factors depend on how explanations are normalized between different examples.
Our paper and analysis themselves comment on the limitations of our methodology as well as explanations as a whole: we find that while explanations often can clearly distinguish different models, knowing which factors will do so, or guaranteeing that explanations align with OOD performance, remains difficult.
Acknowledgments
This work was supported by NSF CAREER Award IIS-2145280, a gift from Salesforce, Inc., and a gift from Adobe. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources used to conduct this research.
References
- Adebayo et al. (2022) Julius Adebayo, Michael Muelly, Harold Abelson, and Been Kim. 2022. Post hoc explanations may be ineffective for detecting unknown spurious correlation. In International Conference on Learning Representations.
- Alqaraawi et al. (2020) Ahmed Alqaraawi, Martin Schuessler, Philipp Weiß, Enrico Costanza, and Nadia Berthouze. 2020. Evaluating saliency map explanations for convolutional neural networks: A user study. In Proceedings of the 25th International Conference on Intelligent User Interfaces, IUI ’20, page 275–285, New York, NY, USA. Association for Computing Machinery.
- Bastings et al. (2021) Jasmijn Bastings, Sebastian Ebert, Polina Zablotskaia, Anders Sandholm, and Katja Filippova. 2021. "Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification. In arXiv.
- Chen and Durrett (2019) Jifan Chen and Greg Durrett. 2019. Understanding Dataset Design Choices for Multi-hop Reasoning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4026–4032, Minneapolis, Minnesota. Association for Computational Linguistics.
- Choi et al. (2022) Jihye Choi, Jayaram Raghuram, Ryan Feng, Jiefeng Chen, Somesh Jha, and Atul Prakash. 2022. Concept-based explanations for out-of-distribution detectors. In arXiv.
- Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations (ICLR).
- Gardner et al. (2021) Matt Gardner, William Merrill, Jesse Dodge, Matthew Peters, Alexis Ross, Sameer Singh, and Noah A. Smith. 2021. Competency Problems: On Finding and Removing Artifacts in Language Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1801–1813, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Geirhos et al. (2020) Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673.
- Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation Artifacts in Natural Language Inference Data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
- Harbecke (2021) David Harbecke. 2021. Explaining natural language processing classifiers with occlusion and language modeling. arXiv preprint arXiv:2101.11889.
- He et al. (2021a) Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021a. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. In arXiv ePrint 2111.09543.
- He et al. (2021b) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021b. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
- Hupkes et al. (2022) Giulianelli Hupkes, Artetxe Dankers, Pimentel Elazar, Lasri Christodoulopoulos, Sinclair Saphra, Schottmann Ulmer, Sun Batsuren, Khalatbari Sinha, Cotterell Frieske, and Jin. 2022. State-of-the-art generalisation research in NLP: a taxonomy and review. In arXiv.
- Iyer et al. (2017) Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. 2017. First Quora dataset release: Question pairs.
- Jacovi and Goldberg (2020) Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online. Association for Computational Linguistics.
- Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
- Kennedy et al. (2020) Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza Dehghani, and Xiang Ren. 2020. Contextualizing Hate Speech Classifiers with Post-hoc Explanation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5435–5442, Online. Association for Computational Linguistics.
- Kim et al. (2021) Joon Sik Kim, Gregory Plumb, and Ameet Talwalkar. 2021. Sanity simulations for saliency methods. CoRR, abs/2105.06506.
- Liu and Avci (2019) Frederick Liu and Besim Avci. 2019. Incorporating Priors with Feature Attribution on Text Classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6274–6283, Florence, Italy. Association for Computational Linguistics.
- Liu et al. (2019a) Nelson F. Liu, Roy Schwartz, and Noah A. Smith. 2019a. Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2171–2179, Minneapolis, Minnesota. Association for Computational Linguistics.
- Liu et al. (2019b) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. In arXiv.
- Lundberg and Lee (2017) Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NeurIPS’17, page 4768–4777, Red Hook, NY, USA. Curran Associates Inc.
- Madsen et al. (2021) Andreas Madsen, Nicholas Meade, Vaibhav Adlakha, and Siva Reddy. 2021. Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining. In arXiv ePrint 2110.08412.
- McCoy et al. (2019) Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
- Miller (2019) Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38.
- Nguyen et al. (2021) Giang Nguyen, Daeyoung Kim, and Anh Nguyen. 2021. The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. In Advances in Neural Information Processing Systems.
- Poliak et al. (2018) Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis Only Baselines in Natural Language Inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, New Orleans, Louisiana. Association for Computational Linguistics.
- Pruthi et al. (2022) Danish Pruthi, Rachit Bansal, Bhuwan Dhingra, Livio Baldini Soares, Michael Collins, Zachary C. Lipton, Graham Neubig, and William W. Cohen. 2022. Evaluating Explanations: How Much Do Explanations from the Teacher Aid Students? Transactions of the Association for Computational Linguistics, 10:359–375.
- Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining.
- Ross et al. (2017) Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez. 2017. Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 2662–2670.
- Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3319–3328. JMLR.org.
- Tang et al. (2021) Liyan Tang, Dhruv Rajan, Suyash Mohan, Abhijeet Pradhan, R. Nick Bryan, and Greg Durrett. 2021. Making Document-Level Information Extraction Right for the Right Reasons. arXiv.
- Warstadt et al. (2020) Alex Warstadt, Yian Zhang, Xiaocheng Li, Haokun Liu, and Samuel R. Bowman. 2020. Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 217–235, Online. Association for Computational Linguistics.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- Yao et al. (2021) Huihan Yao, Ying Chen, Qinyuan Ye, Xisen Jin, and Xiang Ren. 2021. Refining Language Models with Compositional Explanations. In Advances in Neural Information Processing Systems, volume 34, pages 8954–8967. Curran Associates, Inc.
- Ye and Durrett (2022) Xi Ye and Greg Durrett. 2022. Can Explanations Be Useful for Calibrating Black Box Models? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6199–6212, Dublin, Ireland. Association for Computational Linguistics.
- Ye et al. (2021) Xi Ye, Rohan Nair, and Greg Durrett. 2021. Connecting Attributions and QA Model Behavior on Realistic Counterfactuals. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5496–5512, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Zhang et al. (2019) Yuan Zhang, Jason Baldridge, and Luheng He. 2019. PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota. Association for Computational Linguistics.
- Zhou et al. (2021) Yilun Zhou, Serena Booth, Marco Tulio Ribeiro, and Julie Shah. 2021. Do feature attribution methods correctly attribute features? In eXplainable AI approaches for debugging and diagnosis.
Appendix A Details of Inoculation
One of the methods we used to obtain models with different performances on the OOD sets was inoculation Liu et al. (2019a), which involves fine-tuning or further fine-tuning models on small amounts or batches of OOD data alongside in-domain data to bring model performance on OOD sets up.
MSGS
We borrow notation from Warstadt et al. (2020). Most of the fine-tuning data is ambiguous data that doesn’t test the spurious correlation, but we add in small percentages of non-ambiguous data where the label favors either the surface or linguistic generalization, tilting the model in that direction. Here, for each set (VERB, MORPH, ADJECT), we used the following inoculation splits. Linguistic (L) and surface(S) are the features that the inoculation data would favor: 2% L, 2% S, mixed (2% L, 1% S), (1% L 2% S), (2% L, 2% S), in addition to no inoculation. The results on are present in Table 5.
HANS
PAWS
We used several inoculation techniques to get a variable number of models here. For our RoBERTA-base model, we start with the base model (35% OOD accuracy) and fine-tune it with data with 2% of the data having data mixed in. We trained this over several epochs to get models with 82.8% and 90.8% accuracy on . We also tried fine-tuning our 35% model on batches of pure data to get a model with 69% accuracy. For our ELECTRA and DeBERTA models, we use similar batch-only inoculation (fine-tuning on batches of only OOD data). More details are present in Table 6.
VERB | MORPH | ADJECT | |
---|---|---|---|
No-inoc | 12.0 | 95.0 | 51.0 |
2L | 99.0 | 98.0 | 99.2 |
2S | 0.0 | 0.0 | 0.0 |
2L 1S | 80.0 | 68.0 | 73.0 |
1L 2S | 33.0 | 57.0 | 32.0 |
2L 2S | 53.7 | 49.0 | 56.0 |
Tables 7-8 contain the same information as Table 2, but for the other 2 studied explanation technique.
Dataset | OOD Performance | Huggingface Model Name | LR | Warmup | Steps |
---|---|---|---|---|---|
HANS | 99.8/99.4 | roberta-large-mnli | 500 | 150 | |
HANS | 96.7/97.6 | roberta-large-mnli | 500 | 100 | |
HANS | 87.1/70.1 | roberta-large-mnli | 500 | 75 | |
HANS | 79.5/62.5 | roberta-large-mnli | 500 | 50 | |
HANS | 69.9/58.7 | roberta-large-mnli | 25 | ||
HANS | 66.8/57.8 | roberta-large-mnli | |||
HANS | 63.5/72.5 | howey/electra-base-mnli | |||
HANS | 62.7/65.7 | MoritzLaurer/DeBERTa-v3-base-mnli | |||
MSGS | Table 5 | roberta-base | 600 | 6000 | |
PAWS | 90.8 | roberta-base | 1200 | 12000 | |
PAWS | 82.8 | roberta-base | 1200 | 12000 | |
PAWS | 69.0 | roberta-base | 1200 | 7600 | |
PAWS | 35.0 | roberta-base | 1200 | 12000 | |
PAWS | 80.5 | google/electra-base-discriminator | 1200 | 7700 | |
PAWS | 71.8 | microsoft/deberta-base | 1200 | 7600 |
PAWS | HANS-SUB | HANS-CON | |
Baselines | |||
ACCURACY | 88.7 | 90.6 | 81.6 |
CONFIDENCE | 9.2 | 40.4 | 52.8 |
RANDOM | 50.7 | 51.4 | 49.6 |
Explanations | |||
CONST | 79.4 | ||
SWAP-MAX-DIFF | 80.6 | ||
SWAP-AVG | 98.2 | ||
MAX-DIFF | 70.1 | 67.2 | 58.3 |
INDEX-DIFF | 70.5 | 88.5 | 67.2 |
SUM-DIFF | 53.9 | 59.1 | 60.4 |
FIRST-TOK | 55.4 | 51.0 | 81.3 |
PAWS | HANS-SUB | HANS-CON | |
Baselines | |||
ACCURACY | 88.7 | 90.6 | 81.6 |
CONFIDENCE | 9.2 | 40.4 | 52.8 |
RANDOM | 50.7 | 51.4 | 49.6 |
Explanations | |||
CONST | 79.2 | ||
SWAP-MAX-DIFF | 84.3 | ||
SWAP-AVG | 85.6 | ||
MAX-DIFF | 86.9 | 55.2 | 69.9 |
INDEX-DIFF | 51.4 | 85.8 | 53.4 |
SUM-DIFF | 51.6 | 77.4 | 68.1 |
FIRST-TOK | 64.0 | 69.7 | 59.2 |
Appendix B Bootstrapping Details
We now describe our process for bootstrapping and evaluating the capability of explanations in our setting.
For a sampled population of examples from the set, for the models that we’re examining at a time, we generate explanations for each of the models on all of the sampled population. We then repeatedly take a sample with replacement (500 times) of 10 examples each, where we have total explanations we want to examine. We calculate factors for each of the 10 explanations in each sample and pool them to get a list of factor metrics for the , one for each model.
For each pair, we then look at the ground-truth ranking for models and their respective factor metrics, getting successes where these match, and failures otherwise. When we average these accuracies across our 500 bootstrap samples, we get pairwise distributions (the distribution of successes vs failures on a sample for a given pair), which we can further aggregate to get few-shot accuracies.
Note, in practice, to prevent variance from run-to-run, we fix the population of 500 s, but we validated that re-running on new sampled populations didn’t impact any numbers greatly. Though we tried using several (5, 10, 20) sizes, we decided to use the probe size of 10 as a realistic probe size for our setting, which wouldn’t be burdensome to hand-craft in practice.
Our methodology can be run quickly in a post-hoc manner as many times as needed on top of a population of the necessary explanations.
ID Set OOD Set | Size | Size | |
---|---|---|---|
msgs | MORPH | 10000 | 10 |
VERB | 10000 | 10 | |
ADJECT | 10000 | 10 | |
mnli | HANS-SUB | 10000 | 10 |
HANS-CON | 10000 | 10 | |
qqp | PAWS | 677 | 10 |
Appendix C Additional Plots
Figure 4 shows additional information about the distrbution of pairwise accuracies between different model architectures.

Appendix D Reproducibility
D.1 Computing Infrastructure
All experiments were conducted on a desktop with 2 NVIDIA 1080 Ti (11 GB) and 1 NVIDIA Titan Xp (12 GB).
D.2 Runtimes
For PAWS and MSGS fine-tuned models, we fine-tuned for roughly 1 GPU hour per model. Since HANS models were trained for very few steps, their training time is inconsequential. Generating attributions required for numerical evaluation took less than 6 GPU hours.
D.3 Dataset Details
We used datasets in the JSONL format. We simplified all our dataset settings to binary classification for simplicity, and used data directly from the downloads made available in the original papers.