SelfExplain: A Self-Explaining Architecture for Neural Text Classifiers
Abstract
We introduce SelfExplain, a novel self-explaining model that explains a text classifier’s predictions using phrase-based concepts. SelfExplain augments existing neural classifiers by adding (1) a globally interpretable layer that identifies the most influential concepts in the training set for a given sample and (2) a locally interpretable layer that quantifies the contribution of each local input concept by computing a relevance score relative to the predicted label. Experiments across five text-classification datasets show that SelfExplain facilitates interpretability without sacrificing performance. Most importantly, explanations from SelfExplain show sufficiency for model predictions and are perceived as adequate, trustworthy and understandable by human judges compared to existing widely-used baselines.111Code and data is publicly available at https://github.com/dheerajrajagopal/SelfExplain
1 Introduction
Neural network models are often opaque: they provide limited insight into interpretations of model decisions and are typically treated as “black boxes” (Lipton, 2018). There has been ample evidence that such models overfit to spurious artifacts (Gururangan et al., 2018; McCoy et al., 2019; Kumar et al., 2019) and amplify biases in data (Zhao et al., 2017; Sun et al., 2019). This underscores the need to understand model decision making.

Prior work in interpretability for neural text classification predominantly follows two approaches: (i) post-hoc explanation methods that explain predictions for previously trained models based on model internals, and (ii) inherently interpretable models whose interpretability is built-in and optimized jointly with the end task. While post-hoc methods (Simonyan et al., 2014; Koh and Liang, 2017; Ribeiro et al., 2016) are often the only option for already-trained models, inherently interpretable models (Melis and Jaakkola, 2018; Arik and Pfister, 2020) may provide greater transparency since explanation capability is embedded directly within the model (Kim et al., 2014; Doshi-Velez and Kim, 2017; Rudin, 2019).
In natural language applications, feature attribution based on attention scores (Xu et al., 2015) has been the predominant method for developing inherently interpretable neural classifiers. Such methods interpret model decisions locally by explaining the classifier’s decision as a function of relevance of features (words) in input samples. However, such interpretations were shown to be unreliable (Serrano and Smith, 2019; Pruthi et al., 2020) and unfaithful (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019). Moreover, with natural language being structured and compositional, explaining the role of higher-level compositional concepts like phrasal structures (beyond individual word-level feature attributions) remains an open challenge. Another known limitation of such feature attribution based methods is that the explanations are limited to the input feature space and often require additional methods (e.g. Han et al., 2020) for providing global explanations, i.e., explaining model decisions as a function of influential training data.
In this work, we propose SelfExplain—a self explaining model that incorporates both global and local interpretability layers into neural text classifiers. Compared to word-level feature attributions, we use high-level phrase-based concepts, producing a more holistic picture of a classifier’s decisions. SelfExplain incorporates: (i) Locally Interpretable Layer (LIL), a layer that quantifies via activation difference, the relevance of each concept to the final label distribution of an input sample. (ii) Globally Interpretable Layer (GIL), a layer that uses maximum inner product search (MIPS) to retrieve the most influential concepts from the training data for a given input sample. We show how GIL and LIL layers can be integrated into transformer-based classifiers, converting them into self-explaining architectures. The interpretability of the classifier is enforced through regularization (Melis and Jaakkola, 2018), and the entire model is end-to-end differentiable. To the best of our knowledge, SelfExplain is the first self-explaining neural text classification approach to provide both global and local interpretability in a single model.
Ultimately, this work makes a step towards combining the generalization power of neural networks with the benefits of interpretable statistical classifiers with hand-engineered features: our experiments on three text classification tasks spanning five datasets with pretrained transformer models show that incorporating LIL and GIL layers facilitates richer interpretability while maintaining end-task performance. The explanations from SelfExplain sufficiency reflect model predictions and are perceived by human annotators as more understandable, adequately justifying the model predictions and trustworthy, compared to strong baseline interpretability methods.
2 SelfExplain
Let be a neural -class classification model that maps , where are the inputs and are the outputs. SelfExplain builds into , and it provides a set of explanations via high-level “concepts” that explain the classifier’s predictions. We first define interpretable concepts in §2.1. We then describe how these concepts are incorporated into a concept-aware encoder in §2.2. In §2.3, we define our Local Interpretability Layer (LIL), which provides local explanations by assigning relevance scores to the constituent concepts of the input. In §2.4, we define our Global Interpretability Layer (GIL), which provides global explanations by retrieving influential concepts from the training data. Finally, in §2.5, we describe the end-to-end training procedure and optimization objectives.
2.1 Defining human-interpretable concepts
Since natural language is highly compositional (Montague, 1970), it is essential that interpreting a text sequence goes beyond individual words. We define the set of basic units that are interpretable by humans as concepts. In principle, concepts can be words, phrases, sentences, paragraphs or abstract entities. In this work, we focus on phrases as our concepts, specifically all non-terminals in a constituency parse tree. Given any sequence , we decompose the sequence into its component non-terminals , where denotes the number of non-terminal phrases in .
Given an input sample , is trained to produce two types of explanations: (i) global explanations from the training data and (ii) local explanations, which are phrases in . We show an example in Figure 1. Global explanations are achieved by identifying the most influential concepts from the “concept store” , which is constructed to contain all concepts from the training set by extracting phrases under each non-terminal in a syntax tree for every data sample (detailed in §2.4). Local interpretability is achieved by decomposing the input sample into its constituent phrases under each non-terminal in its syntax tree. Then each concept is assigned a score that quantifies its contribution to the sample’s label distribution for a given task; then outputs the most relevant local concepts .

2.2 Concept-Aware Encoder
We obtain the encoded representation of our input sequence from a pretrained transformer model Vaswani et al. (2017); Liu et al. (2019); Yang et al. (2019) by extracting the final layer output as . Additionally, we compute representations of concepts, . For each non-terminal in , we represent it as the mean of its constituent word representations where represents the number of words in the phrase . To represent the root node () of the syntax tree, , we use the pooled representation ([CLS] token representation) of the pretrained transformer as for brevity.222We experimented with different pooling strategies (mean pooling, sum pooling and pooled [CLS] token representation) and all of them performed similarly. We chose to use the pooled [CLS] token for the final model as this is the most commonly used method for representing the entire input. Following traditional neural classifier setup, the output of the classification layer is computed as follows:
where is a activation layer, , and denotes the index of the predicted class.
2.3 Local Interpretability Layer (LIL)
For local interpretability, we compute a local relevance score for all input concepts from the sample . Approaches that assign relative importance scores to input features through activation differences (Shrikumar et al., 2017; Montavon et al., 2017) are widely adopted for interpretability in computer vision applications. Motivated by this, we adopt a similar approach to NLP applications where we learn the attribution of each concept to the final label distribution via their activation differences. Each non-terminal is assigned a score that quantifies the contribution of each to the label in comparison to the contribution of the root node . The most contributing phrases is used to locally explain the model decisions.
Given the encoder , LIL computes the contribution solely from to the final prediction. We first build a representation of the input without contribution of phrase and use it to score the labels:
where is a activation function, , , . Here, signifies a label distribution without the contribution of . Using this, the relevance score of each for the final prediction is given by the difference between the classifier score for the predicted label based on the entire input and the label score based on the input without : where is the relevance score of the concept .
2.4 Global Interpretability layer (GIL)
The Global Interpretability Layer GIL aims to interpret each data sample by providing a set of concepts from the training data which most influenced the model’s predictions. Such an approach is advantageous as we can now understand how important concepts from the training set influenced the model decision to predict the label of a new input, providing more granularity than methods that use entire samples from the training data for post-hoc interpretability (Koh and Liang, 2017; Han et al., 2020).
We first build a concept store which holds all the concepts from the training data. Given model , we represent each concept candidate from the training data, as a mean pooled representation of its constituent words , where represents the embedding layer of and represents the number of words in . is represented by a set of , which are number of concepts from the training data. As the model is finetuned for a downstream task, the representations are constantly updated. Typically, we re-index all candidate representations after every fixed number of training steps.
For any input , GIL produces a set of concepts from that are most influential as defined by the cosine similarity function:
Taking as input, GIL uses dense inner product search to retrieve the top- influential concepts for the sample. Differentiable approaches through Maximum Inner Product Search (MIPS) has been shown to be effective in Question-Answering settings (Guu et al., 2020; Dhingra et al., 2020) to leverage retrieved knowledge for reasoning 333MIPS can often be efficiently scaled using approximate algorithms (Shrivastava and Li, 2014) . Motivated by this, we repurpose this retrieval approach to identify the influential concepts from the training data and learn it end-to-end via backpropagation. Our inner product model for GIL is defined as follows:
2.5 Training
SelfExplain is trained to maximize the conditional log-likelihood of predicting the class at all the final layers: linear (for label prediction), LIL , and GIL . Regularizing models with explanation specific losses have been shown to improve inherently interpretable models (Melis and Jaakkola, 2018) for local interpretability. We extend this idea for both global and local interpretable output for our classifier model. For our training, we regularize the loss through GIL and LIL layers by optimizing their output for the end-task as well. For the GIL layer, we aggregate the scores over all the retrieved as a weighted sum, followed by an activation layer, linear layer and softmax to compute the log-likelihood loss as follows:
and where the global interpretable concepts are denoted by , , and represents activation, and represents the softmax for the GIL layer.
For the LIL layer, we compute a weighted aggregated representation over and compute the log-likelihood loss as follows:
and . To train the model, we optimize for the following joint loss,
where . Here, and are regularization hyper-parameters. All loss components use cross-entropy loss based on task label .
3 Dataset and Experiments
Dataset | C | L | Train | Test |
---|---|---|---|---|
SST-2 | 2 | 19 | 68,222 | 1,821 |
SST-5 | 5 | 18 | 10,754 | 1,101 |
TREC-6 | 6 | 10 | 5,451 | 500 |
TREC-50 | 50 | 10 | 5,451 | 499 |
SUBJ | 2 | 23 | 8,000 | 1,000 |
Model | SST-2 | SST-5 | TREC-6 | TREC-50 | SUBJ |
---|---|---|---|---|---|
XLNet | 93.4 | 53.8 | 96.6 | 82.8 | 96.2 |
SelfExplain-XLNet (=5) | 94.6 | 55.2 | 96.4 | 83.0 | 96.4 |
SelfExplain-XLNet (=10) | 94.4 | 55.2 | 96.4 | 82.8 | 96.4 |
RoBERTa | 94.8 | 53.5 | 97.0 | 89.0 | 96.2 |
SelfExplain-RoBERTa (=5) | 95.1 | 54.3 | 97.6 | 89.4 | 96.3 |
SelfExplain-RoBERTa (=10) | 95.1 | 54.1 | 97.6 | 89.2 | 96.3 |
Datasets:
We evaluate our framework on five classification datasets: (i) SST-2 444https://gluebenchmark.com/tasks Sentiment Classification task Socher et al. (2013): the task is to predict the sentiment of movie review sentences as a binary classification task. (ii) SST-5 555https://nlp.stanford.edu/sentiment/index.html : a fine-grained sentiment classification task that uses the same dataset as before, but modifies it into a finer-grained 5-class classification task. (iii) TREC-6 666https://cogcomp.seas.upenn.edu/Data/QA/QC/ : a question classification task proposed by Li and Roth (2002), where each question should be classified into one of 6 question types. (iv) TREC-50: a fine-grained version of the same TREC-6 question classification task with 50 classes (v) SUBJ: subjective/objective binary classification dataset (Pang and Lee, 2005). The dataset statistics are shown in Table 1.
Experimental Settings:
For our SelfExplain experiments, we consider two transformer encoder configurations as our base models: (1) RoBERTa encoder (Liu et al., 2019) — a robustly optimized version of BERT Devlin et al. (2019). (2) XLNet encoder Yang et al. (2019) — a transformer model based on Transformer-XL Dai et al. (2019) architecture.
We incorporate SelfExplain into RoBERTa and XLNet, and use the above encoders without the GIL and LIL layers as the baselines. We generate parse trees (Kitaev and Klein, 2018) to extract target concepts for the input and follow same pre-processing steps as the original encoder configurations for the rest. We also maintain the hyperparameters and weights from the pre-training of the encoders. The architecture with GIL and LIL modules are fine-tuned on datasets described in §3. For the number of global influential concepts , we consider two settings . We also perform hyperparameter tuning on and report results on the best model configuration. All models were trained on an NVIDIA V-100 GPU.
Classification Results :
We first evaluate the utility of classification models after incorporating GIL and LIL layers in Table 2. Across the different classification tasks, we observe that SelfExplain-RoBERTa and SelfExplain-XLNet consistently show competitive performance compared to the base models except for a marginal drop in TREC-6 dataset for SelfExplain-XLNet.
We also observe that the hyperparameter did not make noticeable difference. Additional ablation experiments in Table 3 suggest that gains through GIL and LIL are complementary and both layers contribute to performance gains.
Model | Accuracy |
XLNet-Base | 93.4 |
SelfExplain-XLNet + LIL | 94.3 |
SelfExplain-XLNet + GIL | 94.0 |
SelfExplain-XLNet + GIL + LIL | 94.6 |
RoBERTa-Base | 94.8 |
SelfExplain-RoBERTa + LIL | 94.8 |
SelfExplain-RoBERTa + GIL | 94.8 |
SelfExplain-RoBERTa + GIL + LIL | 95.1 |
4 Explanation Evaluation
Explanations are notoriously difficult to evaluate quantitatively (Doshi-Velez et al., 2017). A good model explanation should be (i) relevant to the current input and predictions and (ii) understandable to humans (DeYoung et al., 2020; Jacovi and Goldberg, 2020; Wiegreffe et al., 2020; Jain et al., 2020). Towards this, we evaluate whether the explanations along the following diverse criteria:
-
Sufficiency – Do explanations sufficiently reflect the model predictions?
-
Plausibility – Do explanations appear plausible and understandable to humans?
-
Trustability – Do explanations improve human trust in model predictions?
From SelfExplain, we extracted (i) Most relevant local concepts: these are the top ranked phrases based on from the LIL layer and (ii) Top influential global concepts: these are the most influential concepts ranked by the output of GIL layer as the model explanations to be used for evaluations.
4.1 Do SelfExplain explanations reflect predicted labels?
Sufficiency aims to evaluate whether model explanations alone are highly indicative of the predicted label (Jacovi et al., 2018; Yu et al., 2019). “Faithfulness-by-construction” (FRESH) pipeline (Jain et al., 2020) is an example of such framework to evaluate sufficiency of explanations: the sole explanations, without the remaining parts of the input, must be sufficient for predicting a label. In FRESH, a BERT (Devlin et al., 2019) based classifier is trained to perform a task using only the extracted explanations without the rest of the input. An explanation that achieves high accuracy using this classifier is indicative of its ability to recover the original model prediction.
We evaluate the explanations on the sentiment analysis task. Explanations from SelfExplain are incorporated to the FRESH framework and we compare the predictive accuracy of the explanations in comparison to baseline explanation methods. Following Jain et al. (2020), we use the same experimental setup and saliency-based baselines such as attention (Lei et al., 2016; Bastings et al., 2019) and gradient (Li et al., 2016) based explanation methods. From Table 4777In these experiments, explanations are pruned at a maximum of 20% of input. For SelfExplain, we select upto top- concepts thresholding at 20% of input, we observe that SelfExplain explanations from LIL and GIL show high predictive performance compared to all the baseline methods. Additionally, GIL explanations outperform full-text (an explanation that uses all of the input sample) performance, which is often considered an upper-bound for span-based explanation approaches. We hypothesize that this is because GIL explanation concepts from the training data are very relevant to help disambiguate the input text. In summary, outputs from SelfExplain are more predictive of the label compared to prior explanation methods indicating higher sufficiency of explanations.
Model | Explanation | Accuracy | ||||
Full input text | - | 0.90 | ||||
Lei et al. (2016) |
|
|
||||
Bastings et al. (2019) |
|
|
||||
Li et al. (2016) |
|
|
||||
[CLS] Attn |
|
|
||||
SelfExplain-LIL | top- concepts | 0.84 | ||||
SelfExplain-GIL | top- concepts | 0.93 |
4.2 Are SelfExplain explanations plausible and trustable for humans?
Sample | Top relevant phrases from LIL | Top influential concepts from GIL | |||||||
---|---|---|---|---|---|---|---|---|---|
|
neg | for days |
|
||||||
|
pos | corny, schmaltzy, of heart |
|
||||||
|
neg | comprehensible, the lack of |
|
||||||
|
pos | the structure of the film |
|
Human evaluation is commonly used to evaluate plausibility and trustability. To this end, 14 human judges888Annotators are graduate students in computer science. annotated 50 samples from the SST-2 validation set of sentiment excerpts (Socher et al., 2013). Each judge compared local and global explanations produced by the SelfExplain-XLNet model against two commonly used interpretability methods (i) Influence functions (Han et al., 2020) for global interpretability and (ii) Saliency detection (Simonyan et al., 2014) for local interpretability. We follow a setup discussed in Han et al. (2020). Each judge was provided the evaluation criteria (detailed next) with a corresponding description. The models to be evaluated were anonymized and humans were asked to rate them according to the evaluation criteria alone.
Following Ehsan et al. (2019), we analyse the plausibility of explanations which aims to understand how users would perceive such explanations if they were generated by humans. We adopt two criteria proposed by Ehsan et al. (2019):
Adequate justification
: Adequately justifying the prediction is considered to be an important criteria for acceptance of a model (Davis, 1989). We evaluate the adequacy of the explanation by asking human judges: “Does the explanation adequately justifies the model prediction?” Participants deemed explanations that were irrelevant or incomplete as less adequately justifying the model prediction. Human judges were shown the following: (i) input, (ii) gold label, (iii) predicted label, and (iv) explanations from baselines and SelfExplain. The models were anonymized and shuffled.
Figure 3 (left) shows that SelfExplain achieves a gain of 32% in perceived adequate justification, providing further evidence that humans perceived SelfExplain explanations as more plausible compared to the baselines.


Understandability:
An essential criterion for transparency in an AI system is the ability of a user to understand model explanations (Doshi-Velez et al., 2017). Our understandability metric evaluates whether a human judge can understand the explanations presented by the model, which would equip a non-expert to verify the model predictions. Human judges were presented (i) the input, (ii) gold label, (iii) sentiment label prediction, and (iv) explanations from different methods (baselines, and SelfExplain), and were asked to select the explanation that they perceived to be more understandable. Figure 3 (right) shows that SelfExplain achieves 29% improvement over the best-performing baseline in terms of understandability of the model explanation.
Trustability:
In addition to plausibility, we also evaluate user trust of the explanations (Singh et al., 2019; Jin et al., 2020). To evaluate user trust, We follow the same experimental setup as Singh et al. (2019) and Jin et al. (2020) to compute the mean trust score. For each data sample, subjects were shown explanations and the model prediction from the three interpretability methods and were asked to rate on a Likert scale of 1–5 based on how much trust did each of the model explanations instill. Figure 4 shows the mean-trust score of SelfExplain in comparison to the baselines. We observe from the results that concept-based explanations are perceived more trustworthy for humans.
5 Analysis
Table 5 shows example interpretations by SelfExplain; we show some additional analysis of explanations from SelfExplain999additional analysis in appendix due to space constraints in this section.
Does SelfExplain’s explanation help predict model behavior? In this setup, humans are presented with an explanation and an input, and
must correctly predict the model’s output (Doshi-Velez and Kim, 2017; Lertvittayakumjorn and
Toni, 2019; Hase and Bansal, 2020).
We randomly selected 16 samples
spanning equal number of true positives, true negatives, false positives and false negatives from the dev set.
Three human judges were tasked to predict the model decision with and without the presence of model explanation. We observe that when users were presented with the explanation, their ability to predict model decision improved by an average of 22%, showing that with SelfExplain’s explanations, humans could better understand model’s behavior.
Performance Analysis:
In GIL, we study the performance trade-off of varying the number of retrieved influential concepts . From a performance perspective, there is only marginal drop in moving from the base model to SelfExplain model with both GIL and LIL (shown in Table 6). From our experiments with human judges, we found that for sentence level classification tasks is preferable for a balance of performance and the ease of interpretability.
GIL top- | steps/sec | memory |
---|---|---|
base | 2.74 | 1 |
=5* | 2.50 | 1.03 |
=100 | 2.48 | 1.04 |
=1000 | 2.20 | 1.07 |
LIL-GIL-Linear layer agreement:
To understand whether our explanations lead to predicting the same label as the model’s prediction, we analyze whether the final logits activations on the GIL and LIL layers agree with the linear layer activations. Towards this, we compute an agreement between label distributions from GIL and LIL layers to the distribution of the linear layer. Our LIL-linear F1 is 96.6%, GIL-linear F1 100% and GIL-LIL-linear F1 agreement is 96.6% for SelfExplain-XLNet on the SST-2 dataset. We observe that the agreement rates between the GIL , LIL and the linear layer are very high, validating that SelfExplain’s layers agree on the same model classification prediction, showing that GIL and LIL concepts lead to same predictions.
Are LIL concepts relevant?
For this analysis, we randomly selected 50 samples from SST2 dev set and removed the top most salient phrases ranked by LIL. Annotators were asked to predict the label without the most relevant local concept and the accuracy dropped by 7%. We also computed the SelfExplain-XLNet classifier’s accuracy on the same input and the accuracy dropped by .101010Statistically significant by Wilson interval test. This suggests that LIL captures relevant local concepts.111111Samples from this experiment are shown in §A.3.
Input |
|
|
||||||
---|---|---|---|---|---|---|---|---|
|
|
|
||||||
|
|
|
Stability: do similar examples have similar explanations?
Melis and Jaakkola (2018) argue that a crucial property that interpretable models need to address is stability, where the model should be robust enough that a minimal change in the input should not lead to drastic changes in the observed interpretations. We qualitatively analyze this by measuring the overlap of SelfExplain’s extracted concepts for similar examples. Table 8 shows a representative example in which minor variations in the input lead to differently ranked local phrases, but their global influential concepts remain stable.
6 Related Work
Post-hoc Interpretation Methods:
Predominant based methods for post-hoc interpretability in NLP use gradient based methods (Simonyan et al., 2014; Sundararajan et al., 2017; Smilkov et al., 2017). Other post-hoc interpretability methods such as Singh et al. (2019) and Jin et al. (2020) decompose relevant and irrelevant aspects from hidden states and obtain a relevance score. While the methods above focus on local interpretability, works such as Han et al. (2020) aim to retrieve influential training samples for global interpretations. Global interpretability methods are useful not only to facilitate explainability, but also to detect and mitigate artifacts in data (Pezeshkpour et al., 2021; Han and Tsvetkov, 2021).
Inherently Intepretable Models:
Heat maps based on attention (Bahdanau et al., 2014) are one of the commonly used interpretability tools for many downstream tasks such as machine translation (Luong et al., 2015), summarization (Rush et al., 2015) and reading comprehension Hermann et al. (2015). Another recent line of work explores collecting rationales (Lei et al., 2016) through expert annotations (Zaidan and Eisner, 2008). Notable work in collecting external rationales include Cos-E (Rajani et al., 2019), e-SNLI (Camburu et al., 2018) and recently, Eraser benchmark (DeYoung et al., 2020). Alternative lines of work in this class of models include Card et al. (2019) that relies on interpreting a given sample as a weighted sum of the training samples while Croce et al. (2019) identifies influential training samples using a kernel-based transformation function. Jiang and Bansal (2019) produce interpretations of a given sample through modular architectures, where model decisions are explained through outputs of intermediate modules. A class of inherently interpretable classifiers explain model predictions locally using human-understandable high-level concepts such as prototypes (Melis and Jaakkola, 2018; Chen et al., 2019) and interpretable classes (Koh et al., 2020; kuan Yeh et al., 2020). They were recently proposed for computer vision applications, but despite their promise have not yet been adopted in NLP. SelfExplain is similar in spirit to Melis and Jaakkola (2018) but additionally provides explanations via training data concepts for neural text classification tasks.
7 Conclusion
In this paper, we propose SelfExplain, a novel self-explaining framework that enables explanations through higher-level concepts, improving from low-level word attributions. SelfExplain provides both local explanations (via relevance of each input concept) and global explanations (through influential concepts from the training data) in a single framework via two novel modules (LIL and GIL), and trainable end-to-end. Through human evaluation, we show that our interpreted model outputs are perceived as more trustworthy, understandable, and adequate for explaining model decisions compared to previous approaches to explainability.
This opens an exciting research direction for building inherently interpretable models for text classification. Future work will extend the framework to other tasks and to longer contexts, beyond single input sentence. We will also explore additional approaches to extract target local and global concepts, including abstract syntactic, semantic, and pragmatic linguistic features. Finally, we will study what is the right level of abstraction for generating explanations for each of these tasks in a human-friendly way.
Acknowledgements
This material is based upon work funded by the DARPA CMO under Contract No. HR001120C0124, and by the United States Department of Energy (DOE) National Nuclear Security Administration (NNSA) Office of Defense Nuclear Nonproliferation Research and Development (DNN R&D) Next-Generation AI research portfolio. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
References
- Arik and Pfister (2020) Sercan Ö. Arik and T. Pfister. 2020. Protoattend: Attention-based prototypical learning. J. Mach. Learn. Res., 21:210:1–210:35.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In ICLR.
- Bastings et al. (2019) Jasmijn Bastings, Wilker Aziz, and Ivan Titov. 2019. Interpretable neural predictions with differentiable binary variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2963–2977, Florence, Italy. Association for Computational Linguistics.
- Camburu et al. (2018) Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. In NeurIPS.
- Card et al. (2019) Dallas Card, Michael Zhang, and Noah A Smith. 2019. Deep weighted averaging classifiers. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 369–378.
- Chen et al. (2019) Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. 2019. This looks like that: deep learning for interpretable image recognition. In Advances in neural information processing systems, pages 8930–8941.
- Croce et al. (2019) Danilo Croce, Daniele Rossini, and Roberto Basili. 2019. Auditing deep learning processes through kernel-based explanatory models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4028–4037.
- Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
- Davis (1989) Fred D. Davis. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Q., 13:319–340.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2019, pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- DeYoung et al. (2020) Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. 2020. ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Online. Association for Computational Linguistics.
- Dhingra et al. (2020) Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdinov, and William W. Cohen. 2020. Differentiable reasoning over a virtual knowledge base. In International Conference on Learning Representations.
- Doshi-Velez and Kim (2017) Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
- Doshi-Velez et al. (2017) Finale Doshi-Velez, Mason Kortz, Ryan Budish, Chris Bavitz, Sam Gershman, D. O’Brien, Stuart Schieber, J. Waldo, D. Weinberger, and Alexandra Wood. 2017. Accountability of ai under the law: The role of explanation. ArXiv, abs/1711.01134.
- Ehsan et al. (2019) Upol Ehsan, Pradyumna Tambwekar, Larry Chan, Brent Harrison, and Mark O Riedl. 2019. Automated rationale generation: a technique for explainable ai and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pages 263–274.
- Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL 201, pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
- Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909.
-
Han and Tsvetkov (2021)
Xiaochuang Han and Yulia Tsvetkov. 2021.
Influence tuning: Demoting spurious correlations via
instance attribution and instance-driven updates. In Findings of EMNLP. - Han et al. (2020) Xiaochuang Han, Byron C. Wallace, and Yulia Tsvetkov. 2020. Explaining black box predictions and unveiling data artifacts through influence functions. In ACL.
- Hase and Bansal (2020) Peter Hase and Mohit Bansal. 2020. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5540–5552, Online. Association for Computational Linguistics.
- Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc.
- Jacovi and Goldberg (2020) Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online. Association for Computational Linguistics.
- Jacovi et al. (2018) Alon Jacovi, Oren Sar Shalom, and Y. Goldberg. 2018. Understanding convolutional neural networks for text classification. ArXiv, abs/1809.08037.
- Jain and Wallace (2019) Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics.
- Jain et al. (2020) Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and Byron C. Wallace. 2020. Learning to faithfully rationalize by construction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4459–4473, Online. Association for Computational Linguistics.
- Jiang and Bansal (2019) Yichen Jiang and Mohit Bansal. 2019. Self-assembling modular networks for interpretable multi-hop reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4474–4484, Hong Kong, China. Association for Computational Linguistics.
- Jin et al. (2020) Xisen Jin, Zhongyu Wei, Junyi Du, Xiangyang Xue, and Xiang Ren. 2020. Towards hierarchical importance attribution: Explaining compositional semantics for neural sequence models. In International Conference on Learning Representations.
- Kim et al. (2014) Been Kim, Cynthia Rudin, and Julie A Shah. 2014. The bayesian case model: A generative approach for case-based reasoning and prototype classification. In Advances in neural information processing systems, pages 1952–1960.
- Kitaev and Klein (2018) Nikita Kitaev and D. Klein. 2018. Constituency parsing with a self-attentive encoder. In ACL.
- Koh and Liang (2017) Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org.
- Koh et al. (2020) Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. 2020. Concept bottleneck models. NeurIPS.
- kuan Yeh et al. (2020) Chih kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Pradeep Ravikumar, and Tomas Pfister. 2020. On completeness-aware concept-based explanations in deep neural networks.
- Kumar et al. (2019) Sachin Kumar, Shuly Wintner, Noah A. Smith, and Yulia Tsvetkov. 2019. Topics to avoid: Demoting latent confounds in text classification. In Proc. EMNLP, pages 4151–4161.
- Lei et al. (2016) Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 107–117, Austin, Texas. Association for Computational Linguistics.
- Lertvittayakumjorn and Toni (2019) Piyawat Lertvittayakumjorn and Francesca Toni. 2019. Human-grounded evaluations of explanation methods for text classification. In EMNLP/IJCNLP.
- Li et al. (2016) J. Li, Xinlei Chen, E. Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in nlp. In HLT-NAACL.
- Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics.
- Lipton (2018) Zachary C Lipton. 2018. The mythos of model interpretability. Queue, 16(3):31–57.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.
- McCoy et al. (2019) R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proc. ACL.
- Melis and Jaakkola (2018) David Alvarez Melis and Tommi Jaakkola. 2018. Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, pages 7775–7784.
- Montague (1970) Richard Montague. 1970. English as a formal language. In Bruno Visentini, editor, Linguaggi nella societa e nella tecnica, pages 188–221. Edizioni di Communita.
- Montavon et al. (2017) Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, W. Samek, and K. Müller. 2017. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognit., 65:211–222.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv preprint cs/0506075.
- Pezeshkpour et al. (2021) Pouya Pezeshkpour, Sarthak Jain, Sameer Singh, and Byron C Wallace. 2021. Combining feature and instance attribution to detect artifacts. arXiv preprint arXiv:2107.00323.
- Pruthi et al. (2020) Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, and Zachary C Lipton. 2020. Learning to deceive with attention-based explanations. In ACL.
- Rajani et al. (2019) Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! leveraging language models for commonsense reasoning. ACL.
- Ribeiro et al. (2016) Marco Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 97–101, San Diego, California. Association for Computational Linguistics.
- Rudin (2019) Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215.
- Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
- Serrano and Smith (2019) Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy. Association for Computational Linguistics.
- Shrikumar et al. (2017) Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. volume 70 of Proceedings of Machine Learning Research, pages 3145–3153, International Convention Centre, Sydney, Australia. PMLR.
- Shrivastava and Li (2014) Anshumali Shrivastava and Ping Li. 2014. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pages 2321–2329.
- Simonyan et al. (2014) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034.
- Singh et al. (2019) Chandan Singh, W. James Murdoch, and Bin Yu. 2019. Hierarchical interpretations for neural network predictions. In International Conference on Learning Representations.
- Smilkov et al. (2017) Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. 2017. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Sun et al. (2019) Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating gender bias in natural language processing: Literature review. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1630–1640, Florence, Italy. Association for Computational Linguistics.
- Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3319–3328. JMLR. org.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Wiegreffe et al. (2020) Sarah Wiegreffe, Ana Marasović, and Noah A. Smith. 2020. Measuring association between labels and free-text rationales. ArXiv, abs/2010.12762.
- Wiegreffe and Pinter (2019) Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, Hong Kong, China. Association for Computational Linguistics.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, R. Salakhutdinov, R. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764.
- Yu et al. (2019) Mo Yu, S. Chang, Y. Zhang, and T. Jaakkola. 2019. Rethinking cooperative rationalization: Introspective extraction and complement control. ArXiv, abs/1910.13294.
- Zaidan and Eisner (2008) Omar Zaidan and Jason Eisner. 2008. Modeling annotators: A generative approach to learning from annotator rationales. In Proceedings of the 2008 conference on Empirical methods in natural language processing, pages 31–40.
- Zhao et al. (2017) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proc. of EMNLP, pages 2979–2989.
Appendix A Appendix
A.1 Additional Analysis
Input |
|
|
||||||
---|---|---|---|---|---|---|---|---|
|
|
|
||||||
|
|
|
Stability: do similar examples have similar explanations?
Melis and Jaakkola (2018) argue that a crucial property that interpretable models need to address is stability, where the model should be robust enough that a minimal change in the input should not lead to drastic changes in the observed interpretations. We qualitatively analyze this by measuring the overlap of SelfExplain’s extracted concepts for similar examples. Table 8 shows a representative example in which minor variations in the input lead to differently ranked local phrases, but their global influential concepts remain stable.
A.2 Qualitative Examples
Table 9 shows some qualitative examples from our best performing SST-2 model.
Input Sentence | Explanation from Input | Explanation from Training Data | |||||||
|
[’much to enjoy’, ’to enjoy’, ’to mull over’] |
|
|||||||
|
|
[’dazzle and delight us’] | |||||||
nervous breakdowns are not entertaining . | [’n erv ous breakdown s’, ’are not entertaining’] | [’mesmerizing portrait’] | |||||||
too slow , too long and too little happens . | [’too long’, ’too little happens’, ’too little’] |
|
|||||||
very bad . | [’very bad’] |
|
|||||||
|
|
|
|||||||
it treats women like idiots . | [’treats women like idiots’, ’like idiots’] |
|
|||||||
|
|
|
|||||||
too much of the humor falls flat . |
|
[’infuriating’] | |||||||
|
|
[’with terrific flair’] | |||||||
|
["do n ’t give a damn"] | [’spiteful idiots’] | |||||||
|
|
[’bang’] | |||||||
|
|
[’all surface psychodramatics’] |
A.3 Relevant Concept Removal
Table 10 shows us the samples where the model flipped the label after the most relevant local concept was removed. In this table, we show the original input, the perturbed input after removing the most relevant local concept, and the corresponding model predictions.
Original Input | Perturbed Input |
|
|
|||||||
---|---|---|---|---|---|---|---|---|---|---|
unflinchingly bleak and desperate | unflinch ________________ | negative | positive | |||||||
|
|
positive | negative | |||||||
|
|
positive | negative | |||||||
|
|
positive | negative | |||||||
holden caulfield did it better . | holden caulfield __________ . | negative | positive | |||||||
|
|
positive | negative | |||||||
|
|
positive | negative | |||||||
|
|
positive | negative | |||||||
|
|
negative | negative |