Overlooked Factors in Concept-based Explanations:
Dataset Choice, Concept Learnability, and Human Capability
Abstract
Concept-based interpretability methods aim to explain a deep neural network model’s components and predictions using a pre-defined set of semantic concepts. These methods evaluate a trained model on a new, “probe” dataset and correlate the model’s outputs with concepts labeled in that dataset. Despite their popularity, they suffer from limitations that are not well-understood and articulated in the literature. In this work, we identify and analyze three commonly overlooked factors in concept-based explanations. First, we find that the choice of the probe dataset has a profound impact on the generated explanations. Our analysis reveals that different probe datasets lead to very different explanations, suggesting that the generated explanations are not generalizable outside the probe dataset. Second, we find that concepts in the probe dataset are often harder to learn than the target classes they are used to explain, calling into question the correctness of the explanations. We argue that only easily learnable concepts should be used in concept-based explanations. Finally, while existing methods use hundreds or even thousands of concepts, our human studies reveal a much stricter upper bound of 32 concepts or less, beyond which the explanations are much less practically useful. We discuss the implications of our findings and provide suggestions for future development of concept-based interpretability methods. Code for our analysis and user interface can be found at https://github.com/princetonvisualai/OverlookedFactors
1 Introduction
Performance and opacity are often correlated in deep neural networks: the highly parameterized nature of these models that enable them to achieve high task accuracy also reduces their interpretability. However, in order to responsibly use and deploy them, especially in high-risk settings such as medical diagnoses, we need these models to be interpretable, i.e., understandable by people. With the growing recognition of the importance of interpretability, many methods have been proposed in recent years to explain some aspects of neural networks and render them more interpretable (see [4, 14, 18, 42, 44, 53] for surveys).

In this work, we dive into concept-based interpretability methods for image classification models, which explain model components and/or predictions using a pre-defined set of semantic concepts [5, 16, 25, 29, 56]. Given access to a trained model and a set of images labelled with semantic concepts (i.e., a “probe” dataset), these methods produce explanations with the provided concepts. See Fig. 1 for an example explanation.
Concept-based methods are a particularly promising approach for bridging the interpretability gap between complex models and human understanding, as they explain model components and predictions with human-interpretable units, i.e., semantic concepts. Recent work finds that people prefer concept-based explanations over other forms (e.g., heatmap and example-based) because they resemble human reasoning and explanations [27]. Further, concept-based methods uniquely provide a global, high-level understanding of a model, e.g., how it predicts a certain class [56, 39] and what the model (or some part of it) has learned [25, 5, 16]. These insights are difficult to gain from local explanation methods that only provide an explanation for a single model prediction, such as saliency maps that highlight relevant regions within an image.
However, existing research on concept-based interpretability methods focuses heavily on new method development, ignoring important factors such as the probe dataset used to generate explanations or the concepts composing the explanations. Outside the scope of concept-based methods, there have been several recent works that study the effect of different factors on explanations. These works, however, are either limited to saliency maps [1, 28, 31, 41] or a general call for transparency, e.g., include more information when releasing an interpretability method [47].
In this work, we conduct an in-depth study of commonly overlooked factors in concept-based interpretability methods. Concretely, we analyze four representative methods: NetDissect [5], TCAV [25], Concept Bottleneck [29] and IBD [56]. These are a representative and comprehensive set of existing concept-based interpretability methods for computer vision models. Using multiple probe datasets (ADE20k [57, 58] and Pascal [13] for NetDissect, TCAV and IBD; CUB-200-2011 [48] for Concept Bottleneck), we examine the effects of (1) the choice of probe dataset, (2) the concepts used within the explanation, and (3) the complexity of the explanation. Through our analyses, we learn a number of key insights, which we summarize below:
-
•
The choice of the probe dataset has a profound impact on explanations. We repeatedly find that different probe datasets give rise to different explanations, when explaining the same model with the same interpretability method. For instance, the prediction of the arena/hockey class is explained with concepts {grandstand, goal, ice-rink, skate-board} with one probe dataset, and {plaything, road} with another probe dataset. We highlight that concept-based explanations are not solely determined by the model or the interpretability method. Hence, probe datasets should be chosen with caution. Specifically, we suggest using probe datasets whose data distribution is similar to that of the dataset the model-being-explained was trained on.
-
•
Concepts used in explanations are frequently harder to learn than the classes they aim to explain. The choice of concepts used in explanations is dependent on the available concepts in the probe dataset. Surprisingly, we find that learning some of these concepts is harder than learning the target classes. For example, in one experiment we find that the target class bathroom is explained using concepts {toilet, shower, countertop, bathtub, screen-door}, all of which are harder to learn than bathroom. Moreover, these concepts can be hard for people to identify, limiting the usefulness of these explanations. We argue that learnability is a necessary (albeit not sufficient) condition for the correctness of the explanations, and advocate for future explanations to only use concepts that are easily learnable.111Ideally, future methods would also include causal rather than purely correlation-based explanations.
-
•
Current explanations use hundreds or even thousands of concepts, but human studies reveal a much stricter upper bound. We conduct human studies with 125 participants recruited from Amazon Mechanical Turk to understand how well people reason with concept-based explanations with varying number of concepts. We find that participants struggle to identify relevant concepts in images as the number of concepts increases (the percentage of concepts recognized per image decreases from with 8 concepts to for 32 concepts). Moreover, the majority of the participants prefer that the number of concepts be limited to 32. We also find that concept-based explanations offer little to no advantage in predicting model output compared to example-based explanations (the participants’ mean accuracy at predicting the model output when given access to explanations with 8 concepts is whereas the accuracy when given access to example-based explanations is ).
These findings highlight the importance of vetting intuitions when developing and using interpretability methods. We have open-sourced our analysis code and human study user interface to aid with this process in the future: https://github.com/princetonvisualai/OverlookedFactors.
2 Related work
Interpretability methods for computer vision models range from highlighting areas within an image that contribute to a model’s prediction (i.e., saliency maps) [9, 15, 37, 45, 46, 51, 52, 54] to labelling model components (e.g., neurons) [5, 16, 25, 56], highlighting concepts that contribute to the model’s prediction [56, 39] and designing models that are interpretable-by-design [8, 10, 29, 34]. In this work, we focus on concept-based interpretability methods. These include post-hoc methods that label a trained model’s components and/or predictions [5, 16, 25, 56, 39] and interpretable-by-design methods that use pre-defined concepts [29]. We focus on methods for image classification models where most interpretability research has been and is being conducted. Recently, concept-based methods are being developed and used for other types of models (e.g., image similarity models [38], language models [50, 7]), however, these are outside the scope of this paper.
Our work is similar in spirit to a growing group of works that propose checks and evaluation protocols to better understand the capabilities and limitations of interpretability methods [1, 3, 2, 21, 23, 26, 28, 32, 41, 49]. Many of these works examine how sensitive post-hoc saliency maps are to different factors such as input perturbations, model weights, or the output class being explained. On the other hand, we conduct an in-depth study of concept-based interpretability methods. Despite their popularity, little is understood about their interpretability and usefulness to human users, or their sensitivity to auxiliary inputs such as the probe dataset. We seek to fill this gap with our work and assist with future development and use of concept-based interpretability methods. To the best of our knowledge, we are the first to investigate the effect of the probe dataset and concepts used for concept-based explanations. There has been work investigating the effect of explanation complexity on human understanding [30], however, it is limited to decision sets.
We also echo the call for releasing more information when releasing datasets [17], models [12, 33] and interpretability methods [47]. More concretely, we suggest that concept-based interpretability method developers to include results from our proposed analyses in their method release, in addition to filling out the explainability fact sheet proposed by Sokol et al. [47], to aid researchers and practitioners to better understand, use, and build on these methods.
3 Dataset choice: Probe dataset has a profound impact on the explanations
Scene class | Top concepts from ADE20k-generated explanations | Top concepts from Pascal-generated explanations |
---|---|---|
arena/hockey | grandstand, goal, ice-rink, scoreboard | plaything, road |
auto-showroom | car, light, trade-name, floor, wall | car, stage, grandstand, baby-buggy, ground |
bedroom | bed, cup, tapestry, lamp, blind | bed, frame, wood, sofa, bedclothes |
bow-window | windowpane, seat, cushion, wall, heater | windowpane, tree, shelves, curtain, cup |
conf-room | swivel-chair, table, mic, chair, document | bench, napkin, plate, candle, table |
corn-field | field, plant, sky, streetlight | tire, sky, dog, water, signboard |
garage/indoor | bicycle, brush, car, tank, ladder | bicycle, vending-mach, tire, motorbike, floor |
hardware-store | shelf, merchandise, pallet, videos, box | rope, shelves, box, bottle, pole |
legis-chamber | seat, chair, pedestal, flag, witness-stand | mic, book, paper |
tree-farm | tree, hedge, land, path, pole | tree, tent, sheep, mountain, rock |
Neuron | ADE20k label & score | Pascal label & score | ||
---|---|---|---|---|
9 | plant | 0.082 | potted-plant | 0.194 |
181 | plant | 0.068 | potted-plant | 0.140 |
318 | computer | 0.079 | tv | 0.251 |
386 | autobus | 0.067 | bus | 0.200 |
435 | runway | 0.071 | airplane | 0.189 |
185 | chair | 0.077 | horse | 0.153 |
239 | pool-table | 0.069 | horse | 0.171 |
257 | tent | 0.042 | bus | 0.279 |
384 | washer | 0.043 | bicycle | 0.201 |
446 | pool-table | 0.193 | tv | 0.086 |
Concept-based explanations are generated by running a trained model on a “probe” dataset (typically not the training dataset) which has concepts labelled within it. The choice of probe dataset has been almost entirely dictated by which datasets have concept labels. The most commonly used dataset is the Broden dataset [5]. It contains images from four datasets (ADE20k [57, 58], Pascal [13], OpenSurfaces [6], Describable Textures Dataset [11]) and labels of over 1190 concepts, comprising of object, object parts, color, scene and texture.
In this section, we investigate the effect of the probe dataset by comparing explanations generated using two different subsets of the Broden datset: ADE20k and Pascal. We experiment with three different methods for generating concept-based explanations: Baseline, NetDissect [5], and TCAV [25], and find that the generated explanations heavily depend on the choice of probe dataset. This finding implies that these explanations can only be used for images drawn from the same distribution as the probe dataset.
Model explained. Following prior work [5, 25, 56], we explain a ResNet18-based [20] scene classification model trained on the Places365 dataset [55], which predicts one of 365 scene classes given an input image.
Probe datasets. We use two probe datasets: ADE20k [57, 58] (19733 images, license: BSD 3-Clause) and Pascal [13] (10103 images, license: unknown).222To our best knowledge, most images used don’t include personally identifiable information or offensive content. However, some feature people without their consent and might contain identifiable information. They are two different subsets of the Broden dataset [5] and are labelled with objects and parts. We randomly split each dataset into training (60%), validation (20%), and test (20%) sets, using the new training set for learning explanations, validation set for tuning hyperparameters (e.g., learning rate and regularization parameters), and test set for reporting our findings.
Interpretability methods. We investigate the effect of the probe dataset on three types of concept-based explanations. First, we study a simple Baseline method that measures correlations between the model’s prediction and concepts, and generates class-level explanations as a linear combination of concepts as in Fig. 1. Similar to Ramaswamy et al. [39], we learn a logistic regression model that matches the model-being-explained’s prediction, given access to ground-truth concept labels within the image. We use an l1 penalty to prioritize explanations with fewer concepts. Second, we study NetDissect [5] which identifies neurons within the model-being-explained that are highly activated by certain concepts and generates neuron-level explanations (concept labels).333We use code provided by the authors: https://github.com/CSAILVision/NetDissect-Lite. Finally, we study TCAV [25] which generates explanations in the form of concept activation vectors, i.e., vectors within the model-being-explained’s feature space that correspond to labelled concepts.
Results. For all three explanation types, we find that using different probe datasets result in very different explanations. To begin, we show in Tab. 1 how Baseline explanations differ when using ADE20k vs. Pascal as the probe dataset. For example, when explaining the corn-field scene prediction, the Pascal-generated explanation highlights dog as important, whereas the ADE20k-generated explanation does not. For the legis-chamber scene, ADE20k highlights chair as important, whereas Pascal does not.
We observe a similar difference for NetDissect (see Tab. 2). We label 123 neurons separately using ADE20k and Pascal, and find that 60 of them are given very different concept labels (e.g., neuron 239 is labelled pool-table by ADE20k and horse by Pascal).444It is possible that these neurons are poly-semantic, i.e., neurons that reference multiple concepts, as noted in [16, 35]. However, as we explore in the supp. mat., the score for the concept from the other dataset is usually below 0.04, the threshold used in [5] to identify “highly activated neurons.” Again, this result highlights the impact of the probe dataset on explanations.
Similarly, TCAV concept activation vectors learned using ADE20k vs. Pascal are different, i.e., they have low cosine similarity (see Fig. 2). We compute concept activation vectors for 32 concepts which have a base rate of over 1% in both datasets combined, then calculate the cosine similarity of each concept vector. We also compute the ROC AUC for each concept vector to measure how well the concept vector corresponds to the concept. We find that the similarity is low (0.078 on average), even though the selected concepts were those that can be learned reasonably well (mean ROC AUC for these concepts is over 85%). We suspect that the explanations are radically different due to differences in the probe dataset distribution. For instance, some concepts have very different base rates in the two datasets: dog has a base rate of 12.0% in Pascal but 0.5% in ADE20k; chair has a base rate of 16.7% in ADE20k but 13.5% in Pascal. We present more analyses in the supp. mat.
Concept | ADE20k AUC | Pascal AUC |
|
|
---|---|---|---|---|
ceiling | 96.6 | 93.0 | 0.267 | |
box | 83.0 | 80.1 | 0.086 | |
pole | 89.0 | 79.3 | 0.059 | |
bag | 79.4 | 75.4 | 0.006 | |
rock | 92.6 | 82.8 | -0.024 | |
mean | 92.0 | 88.1 | 0.087 |

4 Concept learnability: Concepts used are less learnable than target classes
In Sec. 3, we investigated how the choice of the probe dataset influences the generated explanations. In this section, we investigate the individual concepts used within explanations. An implicit assumption made in concept-based interpretability methods is that the concepts used in explanations are easier to learn than the target classes being explained. For instance, when explaining the class bedroom with the concept bed, we are assuming (and hoping) that the model first learns the concept bed, then uses this concept and others to predict the class bedroom. However, if bed is harder to learn than bedroom, this would not be the case. This assumption also aligns with works that argue that “simpler” concepts (i.e., edges and textures) are learned in early layers and “complex” concepts (i.e., parts and objects) are learned in later layers [5, 16].
We thus investigate the learnability of concepts used by different explanation methods. Somewhat surprisingly, we find that the concepts used are frequently harder to learn than the target classes, raising concerns about the correctness of concept-based explanations.


Setup. To compare the learnability of concepts vs. classes, we learn models for the concepts (the learnability of the classes is already known from the model-being-explained). Concretely, we extract features for the probe dataset using an ImageNet [43]-pretrained ResNet18 [20] model and train a linear model using sklearn’s [36] LogisticRegression to predict concepts from the ResNet18 features.555We also tried using features from a Places365 pretrained model and did not find a significant difference. We do so for the two most commonly used probe datasets: Broden [5] and CUB-200-2011 [48]. Broden concepts are frequently used to explain Places365 classes (as done in NetDissect [5], Net2Vec [16], IBD [56], and ELUDE [39]), while CUB concepts are used to explain the CUB target classes (as done in Concept Bottleneck [29] and ELUDE [39]).
Evaluation. We evaluate learnability with normalized average precision (AP) [22]. We choose normalized AP for two reasons: first, to avoid having to set a threshold and second, to fairly compare concepts and scenes that have very different base rates. In our experiments, we set the base rate to be that of the classes: when comparing Broden concepts vs. Places365 classes and when comparing CUB concepts vs. CUB classes.
Results. In both settings, we find that the concepts are much harder to learn than the target classes. The median normalized AP for Broden concepts is 7.6%, much lower than 37.5% of Places365 classes. Similarly, the median normalized AP for CUB concepts is 2.3%, much lower than 65.9% of CUB classes. Histograms of normalized APs are shown in Fig. 3 (Broden/Places365) and the supp. mat. (CUB).
However, is it possible that each class is explained by concepts that are more learnable than the class? Our investigation with IBD [56] explanations suggests this is not the case. IBD greedily learns a basis of concept vectors, as well as a residual vector, and decomposes each model prediction into a linear combination of the basis and residual vectors.666We use code provided by the authors: https://github.com/CSAILVision/IBD. For 10 randomly chosen scene classes, we compare the normalized AP of the scene class vs. 5 concepts with the highest coefficients (i.e., 5 concepts that are the most important for explaining the prediction). See Tab. 3 for the results. We find that all 10 scene classes are explained with at least one concept that is harder to learn than the class. For some classes (e.g., bathroom, kitchen), all concepts used in the explanation are harder to learn than the class.
Scene class | Concepts | ||||
---|---|---|---|---|---|
arena/perform | tennis court | grandstand | ice rink | valley | stage |
38.8 | 74.0 | 44.4 | 40.7 | 19.0 | 11.9 |
art-gallery | binder | drawing | painting | frame | sculpture |
27.4 | 42.6 | 10.8 | 10.5 | 2.5 | 0.7 |
bathroom | toilet | shower | countertop | bathtub | screen door |
43.3 | 39.9 | 18.8 | 12.6 | 11.1 | 9.6 |
kasbah | ruins | desert | arch | dirt track | bottle rack |
50.2 | 64.3 | 17.3 | 16.2 | 8.9 | 4.2 |
kitchen | work surface | stove | cabinet | refrigerator | doorframe |
33.9 | 24.8 | 18.2 | 10.3 | 8.8 | 2.8 |
lock-chamber | water wheel | dam | boat | embankment | footbridge |
36.5 | 47.4 | 43.7 | 16.1 | 4.8 | 4.1 |
pasture | cow | leaf | valley | field | slope |
19.2 | 63.7 | 21.1 | 19.0 | 6.8 | 4.1 |
physics-lab | computer | machine | monitor-device | bicycle | sewing-machine |
17.1 | 25.4 | 4.5 | 3.3 | 1.7 | 1.5 |
store/indoor | shanties | patty | bookcase | shelf | cup |
20.4 | 72.5 | 18.5 | 13.5 | 4.2 | 1.3 |
water-park | roller coaster | hot tub | playground | ride | swimming pool |
38.3 | 73.0 | 59.1 | 44.9 | 38.0 | 36.7 |
Our experiments show that a significant fraction of the concepts used by existing concept-based interpretability methods are harder to learn than the target classes, issuing a wake-up call to the field. In the following section, we show that these concepts can also be hard for people to identify.
5 Human capability: Human studies reveal an upper bound of 32 concepts
Existing concept-based explanations use a large number of concepts: NetDissect [5] and Net2Vec [16] use all 1197 concepts labelled within the Broden [5] dataset; IBD [56] uses Broden object and art concepts with at least 10 examples (660 concepts); and Concept Bottleneck [29] uses all concepts that are predominantly present for at least 10 classes from CUB [48] (112 concepts). However, can people actually reason with these many concepts?
In this section, we study this important yet overlooked aspect of concept-based explanations: explanation complexity and how it relates to human capability and preference. Specifically, we investigate: (1) How well do people recognize concepts in images? (2) How do the (concept recognition) task performance and time change as the number of concepts vary? (3) How well do people predict the model output for a new image using explanations? (4) How do people trade off simplicity and correctness of concept-based explanations? To answer these questions, we design and conduct a human study. We describe the study design in Sec. 5.1 and report findings in Sec. 5.2.
5.1 Human study design
We build on the study design and user interface (UI) of HIVE [26], and design a two-part study to understand how understandable and useful concept-based explanations are to human users with potentially limited knowledge about machine learning . To the best of our knowledge, we are the first to investigate such properties of concept-based explanations for computer vision models.777We note that there are works examining complexity of explanations for other types of models, for example, Lage et al. [30] investigate complexity of explanations over decision sets, Bolubasi et al. [7] investigate this for concept-based explanations for language models.
Part 1: Recognize concepts and predict the model output. First, we present participants with an image and a set of concepts and ask them to identify whether each concept is present or absent in the image. We also show explanations for 4 classes whose scores are calculated real-time based on the concepts selected. As a final question, we ask participants to select the class they think the model predicts for the given image. See Fig. 4 (left) for the study UI.
To ensure that the task is doable and is only affected by explanation complexity (number of concepts used) and not the complexity of the model and its original prediction task (e.g., 365 scenes classification), we generate explanations for only 4 classes and ask participants to identify which of the 4 classes corresponds to the model’s prediction. We only show images where the model output matches the explanation output (i.e., the model predicts the class with the highest explanation score, calculated with ground-truth concept labels), since our goal is to understand how people reason with concept-based explanations with varying complexity.

Part 2: Choose the ideal tradeoff between simplicity and correctness. Next, we ask participants to reason about two properties of concept-based explanations: simplicity, i.e., the number of concepts used in a given set of explanations, and correctness, i.e., the percentage of model predictions correctly explained by explanations, which is the percentage of times the model output class has the highest explanation score. See Fig. 4 (right) for the study UI. We convey the notion of a simplicity-correctness tradeoff through bar plots that show the correctness of explanations of varying simplicity/complexity (4, 8, 16, 32, 64 concepts). We then ask participants to choose the explanation they prefer the most and provide a short justification for their choice.
Full study design and experimental details. In summary, our study consists of the following steps. For each participant, we introduce the study, receive informed consent for participation in the study, and collect information about their demographic (optional) and machine learning experience. We then introduce concept-based explanations in simple terms, and show a preview of the concept recognition and model output prediction task in Part 1. The participant then completes the task for 10 images. In Part 2, the participant indicates their preference for explanation complexity, given simplicity and correctness information. There are no foreseeable risks in participation in the study, and our study design was approved by our institution’s IRB.
Using this study design, we investigate explanations that take the form of a linear combination of concepts (e.g., Baseline, IBD [56], Concept Bottleneck [29]). Explanations are generated using the Baseline method, which is a logistic regression model trained to predict the model’s output using concepts (see Sec. 3 for details). Note that we are evaluating the form of explanation (linear combination of concepts) rather than a specific explanation method. The choice of the method does not impact the task.
Specifically, we compare four types of explanations: concept-based explanations that use (1) 8 concepts, (2) 16 concepts, (3) 32 concepts, and (4) example-based explanations that consist of 10 example images for which the model predicts a certain class. We include (4) as a method that doesn’t use concepts. In Jeyakumar et al. [24], this type of explanation is shown to be preferred over saliency-type explanations for image classification; here, we compare this to concept-based explanations.
For a fair comparison, all four are evaluated on the same set of images. In short, we conduct a between-group study with 125 participants recruited through Amazon Mechanical Turk. Participants were compensated based on the state-level minimum wage of $12/hr. In total, $800 was spent on running human studies. See supp. mat. for more details.
5.2 Key findings from the human studies
When presented with more concepts, participants spend more time but are worse at recognizing concepts. The median time participants spend on each image is 17.4 sec. for 8 concept-, 27.5 sec. for 16 concept-, and 46.2 sec. for 32 concept-explanations. This is expected, since participants are asked to make a judgment for each and every concept. When given example-based explanations with no such task, participants spend only 11.6 seconds on each image. Interestingly, the concept recognition performance, reported in terms of mean recall (i.e., the percentage of concepts in the image that are recognized) and standard deviation, decreases from 71.7% 27.7% (8 concepts) to 61.0% 28.5% (16 concepts) to 56.8% 24.9% (32 concepts). While these numbers are far from perfect recall (100%), participants are better at judging whether concepts are present when shown fewer number of concepts.
Concept-based explanations offer little to no advantage in model output prediction over example-based explanations. Indeed, we see that the participants’ errors in concept recognition result in an incorrect class having the highest explanation score. When predicting the model output as the class with the highest explanation score, calculated based on the participants’ concept selections, the mean accuracy and standard deviation of model output prediction are 64.8% 23.9% (8 concepts), 63.2% 26.9% (16 concepts), 63.6% 22.2% (32 concepts). These are barely higher than 60.0% 30.2% of example-based explanations, which are simpler and require less time to complete the task.
The majority of participants prefer explanations with 8, 16, or 32 concepts. When given options of explanations that use 4, 8, 16, 32, or 64 concepts, 82% of participants prefer explanations with 8, 16, or 32 concepts (28%, 33%, 21% respectively). Only 6% prefer those with 64 concepts, suggesting that existing explanations that use hundreds or even thousands of concepts do not cater to human preferences. In the written responses, many favored having fewer concepts (e.g., “the lesser, the better”) and expressed concerns against having too many (e.g., “I think 32 is a lot, but 16 is an adequate enough number that it could still predict well…”). In making the tradeoff, some valued correctness above all else (e.g., “Out of all the options, 32 is the most correct”), while others reasoned about marginal benefits (e.g., “I would prefer explanations that use 16 concepts because it seems that the difference in percentage of correctness is much closer and less, than other levels of concepts”). Overall, we find that participants actively reason about both simplicity and correctness of explanations.
6 Discussion
Our analyses yield immediate suggestions for improving the quality and usability of concept-based explanations. First, we suggest choosing a probe dataset whose distribution is similar to that of the dataset the model was trained on. Second, we suggest only using concepts that are more learnable than the target classes. Third, we suggest limiting the number of concepts used within an explanation to under 32, so that explanations are not overwhelming to people.
The final suggestion is easy to implement. However, the first two are easier said than done, since the number of available probe datasets (i.e., large-scale datasets with concept labels) is minimal, forcing researchers to use the Broden dataset [5] or the CUB dataset [48]. Hence, we argue creating diverse and high-quality probe datasets is of upmost importance in researching concept-based explanations.
Another concern is that these methods do highlight hard-to-learn concepts when given access to them, suggesting that they sometimes learn correlations rather than causations. Methods by Goyal et al. [19], which output patches within the image that need to be changed for the model’s prediction to change, or Fong et al. [15], which find regions within the image that maximally contribute to the model’s prediction, are more in line with capturing causal relationships. However, these only produce local explanations, i.e., explanations of a single model prediction, and not class-level global explanations. One approach to capturing causal relationships is to generate counterfactual images with or without certain concepts using generative models [40] and observe changes in model predictions.
7 Limitations and future work
Our findings come with a few caveats. First, due to the lack of available probe datasets, we tested each concept-based interpretability method in a single setting. That is, we tested NetDissect [5], TCAV [25] and IBD [56] on a scene classifier trained on the Places365 dataset [55], and Concept Bottleneck [29] on the CUB dataset [48]. We plan to expand our analyses as more probe datasets become available. Second, all participants in our human studies were recruited from Amazon Mechanical Turk. This means that our participants represent a population with limited ML background: the self-reported ML experience was 2.5 1.0 (on a scale of 1 to 5), which is between “2: have heard about…” and “3: know the basics…” We believe Part 1 results of our human studies (described in Sec. 5.1) will not vary with participants’ ML expertise or role in the ML pipeline, as we are only asking participants to identify concepts in images. However, Part 2 results may vary (e.g., developers debugging a ML model may be more willing to trade off explanation simplicity for correctness than lay end-users). Investigating differences in perceptions and uses of concept-based explanations, is an important direction for future research.
8 Conclusion
In this work, we examined implicit assumptions made in concept-based interpretability methods along three axes: the choice of the probe datasets, the learnability of the used concepts, and the complexity of explanations. We found that the choice of the probe dataset profoundly influences the generated explanations, implying that these explanations can only be used for images from the probe dataset distribution. We also found that a significant fraction of the concepts used within explanations are harder for a model to learn than the target classes they aim to explain. Finally, we found that people struggle to identify concepts in images when given too many concepts, and that explanations with less than 32 concepts are preferred. We hope our proposed analyses and findings lead to more careful use and development of concept-based explanations.
Acknowledgements. We foremost thank our participants for taking the time to participate in our study. We also thank the authors of [5, 56, 29, 26] for open-sourcing their code and/or models. Finally, we thank the anonymous reviewers and the Princeton Visual AI Lab members (especially Nicole Meister) who provided helpful feedback on our work. This material is based upon work partially supported by the National Science Foundation under Grants No. 1763642, 2145198 and 2112562. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We also acknowledge support from the Princeton SEAS Howard B. Wentz, Jr. Junior Faculty Award to OR, Princeton SEAS Project X Fund to RF and OR, Open Philanthropy Grant to RF, and NSF Graduate Research Fellowship to SK.
References
- [1] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In NeurIPS, 2018.
- [2] Julius Adebayo, Michael Muelly, Harold Abelson, and Been Kim. Post hoc explanations may be ineffective for detecting unknown spurious correlation. In ICLR, 2022.
- [3] Julius Adebayo, Michael Muelly, Ilaria Liccardi, and Been Kim. Debugging tests for model explanations. In NeurIPS, 2020.
- [4] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 2020.
- [5] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
- [6] Sean Bell, Kavita Bala, and Noah Snavely. Intrinsic images in the wild. ACM Trans. Graph., 2014.
- [7] Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, and Martin Wattenberg. An interpretability illusion for bert, 2021.
- [8] Wieland Brendel and Matthias Bethge. Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. In ICLR, 2018.
- [9] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV, 2018.
- [10] Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: Deep learning for interpretable image recognition. In NeurIPS, 2019.
- [11] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, 2014.
- [12] Anamaria Crisan, Margaret Drouhard, Jesse Vig, and Nazneen Rajani. Interactive model cards: A human-centered approach to model documentation. In FAccT, 2022.
- [13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, 2010.
- [14] Ruth Fong. Understanding convolutional neural networks. PhD thesis, University of Oxford, 2020.
- [15] Ruth Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. In ICCV, 2019.
- [16] Ruth Fong and Andrea Vedaldi. Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In CVPR, 2018.
- [17] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 2021.
- [18] Leilani H. Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In DSAA, 2018.
- [19] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual explanations. In ICML, 2019.
- [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- [21] Adrian Hoffmann, Claudio Fanconi, Rahul Rade, and Jonas Kohler. This looks like that… does it? Shortcomings of latent space prototype interpretability in deep networks. In ICML Workshops, 2021.
- [22] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. Diagnosing error in object detectors. In ECCV, 2012.
- [23] Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for interpretability methods in deep neural networks. In NeurIPS, 2019.
- [24] Jeya Vikranth Jeyakumar, Joseph Noor, Yu-Hsi Cheng, Luis Garcia, and Mani Srivastava. How can i explain this to you? an empirical study of deep neural network explanation methods. In NeurIPS, 2020.
- [25] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML, 2018.
- [26] Sunnie S. Y. Kim, Nicole Meister, Vikram V. Ramaswamy, Ruth Fong, and Olga Russakovsky. HIVE: Evaluating the human interpretability of visual explanations. In ECCV, 2022.
- [27] Sunnie S. Y. Kim, Elizabeth Anne Watkins, Olga Russakovsky, Ruth Fong, and Andrés Monroy-Hernández. “Help me help the AI”: Understanding how explainability can support human-AI interaction. In CHI, 2023.
- [28] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer Cham, 2019.
- [29] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In ICML, 2020.
- [30] Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Samuel J Gershman, and Finale Doshi-Velez. Human evaluation of models built for interpretability. In HCOMP, 2019.
- [31] Aravindh Mahendran and Andrea Vedaldi. Salient deconvolutional networks. In ECCV, 2016.
- [32] Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do concept bottleneck models learn as intended? In ICLR Workshops, 2021.
- [33] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In FAccT, 2019.
- [34] Meike Nauta, Ron van Bree, and Christin Seifert. Neural prototype trees for interpretable fine-grained image recognition. In CVPR, 2021.
- [35] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020.
- [36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. JMLR, 2011.
- [37] Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized input sampling for explanation of black-box models. In BMCV, 2018.
- [38] Bryan A. Plummer, Mariya I. Vasileva, Vitali Petsiuk, Kate Saenko, and David Forsyth. Why do these match? explaining the behavior of image similarity models. In ECCV, 2020.
- [39] Vikram V. Ramaswamy, Sunnie S. Y. Kim, Nicole Meister, Ruth Fong, and Olga Russakovsky. ELUDE: Generating interpretable explanations via a decomposition into labelled and unlabelled features. arXiv:2206.07690, 2022.
- [40] Vikram V. Ramaswamy, Sunnie S. Y. Kim, and Olga Russakovsky. Fair attribute classification through latent space de-biasing. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- [41] Sylvestre-Alvise Rebuffi, Ruth Fong, Xu Ji, and Andrea Vedaldi. There and back again: Revisiting backpropagation saliency methods. In CVPR, 2020.
- [42] Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. Interpretable machine learning: Fundamental principles and 10 grand challenges. In Statistics Surveys, 2021.
- [43] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. In IJCV, 2015.
- [44] Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller, editors. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700 of Lecture Notes in Computer Science. Springer, 2019.
- [45] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
- [46] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshops, 2014.
- [47] Kacper Sokol and Peter Flach. Explainability fact sheets: A framework for systematic assessment of explainable approaches. In FAccT, 2020.
- [48] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- [49] Mengjiao Yang and Been Kim. Benchmarking attribution methods with relative feature importance. arXiv:1907.09701, 2019.
- [50] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Ravikumar. On completeness-aware concept-based explanations in deep neural networks. In NeurIPS, 2020.
- [51] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
- [52] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. IJCV, 2018.
- [53] Quanshi Zhang and Song-Chun Zhu. Visual interpretability for deep learning: A survey. Frontiers of Information Technology & Electronic Engineering, 2018.
- [54] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
- [55] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017.
- [56] Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. Interpretable basis decomposition for visual explanation. In ECCV, 2018.
- [57] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ADE20k dataset. In CVPR, 2017.
- [58] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20k dataset. IJCV, 2019.
Appendix
In this supplementary document, we provide additional details on some sections of the main paper.
-
Section A We provide more information regarding our experimental setup over all experiments.
-
Section B We provide additional results from our experiments regarding probe dataset choice from Section 3 of the main paper.
-
Section C We provide additional results from our experiments regarding concept choice from Section 4 of the main paper.
-
Section D: We supplement Section 5 of the main paper and provide more information about our human studies.
-
Section E: We supplement Section 5 of the main paper and show snapshots of our full user interface.
Appendix A Experimental details
Here we provide additional experimental details regarding all our setups, as well as the computational power we needed.
TCAV. Using the features extracted from the penultimate layer of the ResNet18-based [20] model trained on the Places365 dataset [55], we use scikit-learns’s [36] LogisticRegression models to predict the ground truth attributes in each case. We use the liblinear solver, with an l2 penalty, and pick the regularization weight as a hyperparameter, based on the performance (ROC AUC) on a validation set.
Baseline. Given the ground-truth labelled concepts for an image, this explanation attempts to predict the blackbox model’s output on the image. We use scikit-learn’s [36] LogisticRegression model with a liblinear solver, and an l1 penalty, to prioritize learning simpler explanations. For the experiment reported in Section 3 of the main paper, we pick the regularization weight as a hyperparamater, choosing the weight with the best performance on a validation set. When generating explanations of different complexities for our human studies, we vary the regularization parameter, picking explanations that use a total of 4, 8, 16, 32, or 64 concepts.
Learning concepts. We computed features for all images from ADE20k [57, 58] using the penultimate layer of a ResNet18 [20] model trained on Imagenet [43]. We then learned a linear model for all concepts that had over 10 positive samples within the dataset, using the LogisticRegression model from scikit-learn [36]. Similar to other models, we use a liblinear solver, with an l2 penalty, choosing the regularization weight based on performance (ROC AUC) on a validation set. As mentioned, we report the normalized AP [22] to be able to compare across concepts and target classes with varying base rates.
Appendix B Probe dataset choice: more details
In our first claim, we show that the choice of probe dataset can have a significant impact on the explanation output for concept-based explanations. We give more details from our experiments for this claim within this section.
B.1 Varying the probe dataset
Here we provide the full results from section 3.1 in the main text, where we compute concept-based explanations using 2 different methods (NetDissect [5] and TCAV [25]) when using either ADE20k or Pascal as probe datasets.
NetDissect. Table 4 contains the label generated for all neurons that are strongly activated when using either ADE20k [57, 58] or Pascal [13] as the probe dataset. A majority of neurons (69/123) correspond to very different concepts.
Neuron | ADE20k label | ADE20k score | Pascal label | Pascal score | Neuron | ADE20k label | ADE20k score | Pascal label | Pascal score |
---|---|---|---|---|---|---|---|---|---|
1 | counter | 0.059 | bottle | 0.049 | 3 | sea | 0.067 | water | 0.065 |
4 | seat | 0.064 | tvmonitor | 0.074 | 8 | vineyard | 0.048 | plant | 0.043 |
9 | plant | 0.082 | pottedplant | 0.194 | 22 | bookcase | 0.07 | bus | 0.048 |
30 | house | 0.094 | building | 0.043 | 37 | boat | 0.043 | boat | 0.213 |
43 | bed | 0.151 | bed | 0.075 | 47 | pool table | 0.135 | airplane | 0.079 |
60 | plane | 0.052 | airplane | 0.168 | 63 | field | 0.053 | muzzle | 0.042 |
69 | person | 0.047 | hair | 0.086 | 73 | water | 0.041 | bird | 0.080 |
79 | plant | 0.064 | pottedplant | 0.064 | 90 | mountain | 0.071 | mountain | 0.066 |
102 | bathtub | 0.040 | cat | 0.055 | 104 | cradle | 0.081 | bus | 0.112 |
105 | sea | 0.106 | water | 0.058 | 106 | rock | 0.048 | rock | 0.06 |
110 | painting | 0.119 | painting | 0.06 | 112 | field | 0.05 | bus | 0.051 |
113 | table | 0.116 | table | 0.066 | 115 | plane | 0.046 | airplane | 0.147 |
120 | sidewalk | 0.042 | track | 0.075 | 125 | table | 0.049 | wineglass | 0.047 |
126 | stove | 0.064 | bottle | 0.163 | 127 | book | 0.104 | book | 0.096 |
131 | signboard | 0.043 | body | 0.069 | 134 | bathtub | 0.088 | boat | 0.059 |
141 | skyscraper | 0.065 | cage | 0.068 | 155 | mountain | 0.091 | train | 0.058 |
158 | book | 0.042 | book | 0.052 | 165 | sea | 0.051 | water | 0.051 |
168 | railroad train | 0.055 | train | 0.193 | 172 | car | 0.055 | bus | 0.101 |
173 | car | 0.052 | bus | 0.099 | 181 | plant | 0.068 | pottedplant | 0.14 |
183 | person | 0.041 | horse | 0.187 | 184 | cradle | 0.046 | cat | 0.042 |
185 | chair | 0.077 | horse | 0.153 | 186 | person | 0.051 | bird | 0.094 |
191 | swimming pool | 0.044 | pottedplant | 0.072 | 198 | pool table | 0.064 | ceiling | 0.066 |
208 | shelf | 0.047 | bus | 0.062 | 211 | computer | 0.076 | tvmonitor | 0.089 |
217 | toilet | 0.049 | hair | 0.055 | 218 | case | 0.044 | track | 0.165 |
219 | plane | 0.065 | airplane | 0.189 | 220 | road | 0.066 | road | 0.066 |
222 | grass | 0.105 | grass | 0.046 | 223 | house | 0.069 | airplane | 0.055 |
231 | grandstand | 0.097 | screen | 0.047 | 234 | bridge | 0.05 | train | 0.042 |
239 | pool table | 0.069 | horse | 0.171 | 245 | water | 0.063 | water | 0.042 |
247 | plane | 0.079 | airplane | 0.177 | 248 | bed | 0.127 | tvmonitor | 0.063 |
251 | sofa | 0.073 | pottedplant | 0.053 | 257 | tent | 0.042 | bus | 0.279 |
260 | flower | 0.082 | food | 0.069 | 267 | apparel | 0.042 | car | 0.045 |
276 | earth | 0.041 | rock | 0.047 | 278 | field | 0.06 | sheep | 0.044 |
280 | mountain | 0.045 | mountain | 0.056 | 287 | plant | 0.078 | pottedplant | 0.07 |
289 | pool table | 0.049 | food | 0.059 | 290 | mountain | 0.085 | mountain | 0.097 |
293 | shelf | 0.074 | bottle | 0.105 | 298 | path | 0.047 | motorbike | 0.068 |
305 | waterfall | 0.057 | mountain | 0.047 | 309 | washer | 0.109 | bus | 0.065 |
318 | computer | 0.079 | tvmonitor | 0.251 | 322 | ball | 0.054 | sheep | 0.044 |
324 | mountain | 0.071 | motorbike | 0.048 | 325 | person | 0.04 | head | 0.059 |
327 | waterfall | 0.055 | bird | 0.087 | 337 | water | 0.072 | boat | 0.109 |
341 | sea | 0.153 | boat | 0.076 | 344 | person | 0.052 | person | 0.048 |
345 | autobus | 0.042 | bus | 0.142 | 347 | palm | 0.051 | bicycle | 0.083 |
348 | mountain | 0.058 | mountain | 0.125 | 354 | cradle | 0.042 | chair | 0.053 |
357 | rock | 0.058 | sheep | 0.061 | 360 | pool table | 0.048 | bird | 0.041 |
364 | field | 0.058 | plant | 0.041 | 372 | work surface | 0.045 | cabinet | 0.049 |
379 | bridge | 0.092 | bus | 0.046 | 383 | bed | 0.069 | curtain | 0.079 |
384 | washer | 0.043 | bicycle | 0.201 | 386 | autobus | 0.067 | bus | 0.200 |
387 | hovel | 0.04 | train | 0.085 | 389 | chair | 0.066 | chair | 0.051 |
398 | windowpane | 0.073 | windowpane | 0.07 | 400 | plant | 0.043 | pottedplant | 0.097 |
408 | toilet | 0.045 | bottle | 0.099 | 412 | bed | 0.079 | airplane | 0.086 |
413 | pool table | 0.09 | motorbike | 0.07 | 415 | seat | 0.044 | tvmonitor | 0.045 |
417 | sand | 0.06 | sand | 0.049 | 419 | bed | 0.061 | tvmonitor | 0.054 |
422 | seat | 0.089 | tvmonitor | 0.056 | 430 | bed | 0.078 | bedclothes | 0.042 |
434 | case | 0.047 | cup | 0.041 | 435 | runway | 0.072 | airplane | 0.189 |
438 | plane | 0.045 | airplane | 0.235 | 444 | sofa | 0.045 | plant | 0.09 |
445 | car | 0.201 | car | 0.093 | 446 | pool table | 0.193 | tvmonitor | 0.086 |
454 | car | 0.218 | car | 0.156 | 463 | snow | 0.059 | snow | 0.118 |
465 | crosswalk | 0.097 | road | 0.047 | 475 | cradle | 0.061 | train | 0.132 |
477 | desk | 0.104 | tvmonitor | 0.085 | 480 | sofa | 0.086 | sofa | 0.081 |
483 | swivel chair | 0.052 | horse | 0.041 | 484 | water | 0.15 | water | 0.102 |
485 | sofa | 0.056 | airplane | 0.045 | 500 | sofa | 0.156 | sofa | 0.11 |
502 | washer | 0.07 | train | 0.134 | 503 | bookcase | 0.109 | book | 0.075 |
509 | computer | 0.044 | tvmonitor | 0.074 |
As mentioned by Fong et al. [16] and Olah et al. [35], neurons in deep neural networks can be poly-semantic, i.e, some neurons can recognize multiple concepts. We check if the results from above are due to such neurons, and confirm that that is not the case: out of the 69 neurons, only 7 are highly activated (IOU¿0.04) by both concepts. Table 5 contains the IOU scores for both the ADE20k and Pascal label for each neuron outputting very different concepts.
Probe dataset: ADE20k | Probe dataset: Pascal | |||||
---|---|---|---|---|---|---|
neuron | ADE20k label | Pascal label | IOU ADE20k label | IOU Pascal label | IOU ADE20k label | IOU Pascal label |
1 | counter | bottle | 0.059 | 0.006 | 0.006 | 0.049 |
4 | seat | tvmonitor | 0.064 | 0.0 | 0.0 | 0.074 |
22 | bookcase | bus | 0.07 | 0.0 | 0.0 | 0.048 |
47 | pool table | airplane | 0.135 | 0.0 | 0.002 | 0.079 |
63 | field | muzzle | 0.053 | 0.0 | 0.0 | 0.042 |
73 | water | bird | 0.041 | 0.002 | 0.052 | 0.08 |
102 | bathtub | cat | 0.04 | 0.0 | 0.0 | 0.055 |
104 | cradle | bus | 0.081 | 0.0 | 0.0 | 0.112 |
112 | field | bus | 0.05 | 0.0 | 0.0 | 0.051 |
120 | sidewalk | track | 0.042 | 0.001 | 0.023 | 0.075 |
125 | table | wineglass | 0.049 | 0.0 | 0.043 | 0.047 |
126 | stove | bottle | 0.064 | 0.029 | 0.005 | 0.163 |
131 | signboard | body | 0.043 | 0.0 | 0.06 | 0.069 |
134 | bathtub | boat | 0.088 | 0.001 | 0.005 | 0.059 |
141 | skyscraper | cage | 0.065 | 0.001 | 0.0 | 0.068 |
155 | mountain | train | 0.091 | 0.0 | 0.038 | 0.058 |
172 | car | bus | 0.055 | 0.0 | 0.015 | 0.101 |
173 | car | bus | 0.052 | 0.0 | 0.013 | 0.099 |
183 | person | horse | 0.041 | 0.016 | 0.003 | 0.187 |
184 | cradle | cat | 0.046 | 0.0 | 0.0 | 0.042 |
185 | chair | horse | 0.077 | 0.014 | 0.011 | 0.153 |
186 | person | bird | 0.051 | 0.001 | 0.017 | 0.094 |
191 | swimming pool | pottedplant | 0.044 | 0.0 | 0.0 | 0.072 |
198 | pool table | ceiling | 0.064 | 0.035 | 0.001 | 0.066 |
208 | shelf | bus | 0.047 | 0.0 | 0.0 | 0.062 |
217 | toilet | hair | 0.049 | 0.001 | 0.0 | 0.055 |
218 | case | track | 0.044 | 0.001 | 0.0 | 0.165 |
223 | house | airplane | 0.069 | 0.0 | 0.0 | 0.055 |
231 | grandstand | screen | 0.097 | 0.0 | 0.007 | 0.047 |
234 | bridge | train | 0.05 | 0.0 | 0.014 | 0.042 |
239 | pool table | horse | 0.069 | 0.011 | 0.0 | 0.171 |
248 | bed | tvmonitor | 0.127 | 0.0 | 0.027 | 0.063 |
251 | sofa | pottedplant | 0.073 | 0.0 | 0.033 | 0.053 |
257 | tent | bus | 0.042 | 0.0 | 0.005 | 0.279 |
260 | flower | food | 0.082 | 0.033 | 0.064 | 0.069 |
267 | apparel | car | 0.042 | 0.023 | 0.0 | 0.045 |
278 | field | sheep | 0.06 | 0.0 | 0.0 | 0.044 |
289 | pool table | food | 0.049 | 0.024 | 0.0 | 0.059 |
293 | shelf | bottle | 0.074 | 0.025 | 0.0 | 0.105 |
298 | path | motorbike | 0.047 | 0.0 | 0.0 | 0.068 |
305 | waterfall | mountain | 0.057 | 0.049 | 0.0 | 0.047 |
309 | washer | bus | 0.109 | 0.0 | 0.013 | 0.065 |
322 | ball | sheep | 0.054 | 0.0 | 0.005 | 0.044 |
324 | mountain | motorbike | 0.071 | 0.0 | 0.015 | 0.048 |
327 | waterfall | bird | 0.055 | 0.001 | 0.0 | 0.087 |
337 | water | boat | 0.072 | 0.031 | 0.053 | 0.109 |
341 | sea | boat | 0.153 | 0.014 | 0.0 | 0.076 |
347 | palm | bicycle | 0.051 | 0.001 | 0.0 | 0.083 |
354 | cradle | chair | 0.042 | 0.03 | 0.0 | 0.053 |
357 | rock | sheep | 0.058 | 0.0 | 0.006 | 0.061 |
360 | pool table | bird | 0.048 | 0.0 | 0.0 | 0.041 |
379 | bridge | bus | 0.092 | 0.0 | 0.03 | 0.046 |
383 | bed | curtain | 0.069 | 0.064 | 0.01 | 0.079 |
384 | washer | bicycle | 0.043 | 0.018 | 0.0 | 0.201 |
387 | hovel | train | 0.04 | 0.0 | 0.0 | 0.085 |
408 | toilet | bottle | 0.045 | 0.002 | 0.0 | 0.099 |
412 | bed | airplane | 0.079 | 0.0 | 0.008 | 0.086 |
413 | pool table | motorbike | 0.09 | 0.0 | 0.003 | 0.07 |
415 | seat | tvmonitor | 0.044 | 0.0 | 0.0 | 0.045 |
419 | bed | tvmonitor | 0.061 | 0.0 | 0.016 | 0.054 |
422 | seat | tvmonitor | 0.089 | 0.0 | 0.0 | 0.056 |
434 | case | cup | 0.047 | 0.001 | 0.0 | 0.041 |
444 | sofa | plant | 0.045 | 0.009 | 0.014 | 0.09 |
446 | pool table | tvmonitor | 0.193 | 0.0 | 0.006 | 0.086 |
475 | cradle | train | 0.061 | 0.0 | 0.0 | 0.132 |
477 | desk | tvmonitor | 0.104 | 0.0 | 0.0 | 0.085 |
483 | swivel chair | horse | 0.052 | 0.006 | 0.0 | 0.041 |
485 | sofa | airplane | 0.056 | 0.0 | 0.024 | 0.045 |
502 | washer | train | 0.07 | 0.0 | 0.006 | 0.134 |
TCAV. We report the cosine similarities between the concept activation vectors learned using ADE20k and Pascal as probe datasets for all 32 concepts that have a base rate of at least 1% in Table 6. On the whole, we see that the vectors are not very similar, despite the vectors predicting the concepts well.
Concept | ADE20k AUC | Pascal AUC | Cos.sim. | Concept | ADE20k AUC | Pascal AUC | Cos.sim. |
---|---|---|---|---|---|---|---|
bag | 79.4 | 75.4 | 0.006 | book | 90.4 | 84.6 | 0.138 |
bottle | 88.5 | 85.6 | 0.035 | box | 83.0 | 80.1 | 0.086 |
building | 97.4 | 90.0 | 0.161 | cabinet | 91.3 | 92.4 | 0.03 |
car | 96.9 | 90.3 | 0.147 | ceiling | 96.6 | 93.0 | 0.267 |
chair | 90.5 | 89.6 | 0.034 | curtain | 91.6 | 89.5 | 0.112 |
door | 81.5 | 87.8 | 0.134 | fence | 86.1 | 84.7 | 0.09 |
floor | 97.4 | 92.1 | 0.208 | grass | 95.1 | 91.7 | 0.04 |
light | 92.4 | 85.0 | 0.043 | mountain | 94.2 | 90.8 | 0.02 |
painting | 94.8 | 91.4 | 0.116 | person | 92.2 | 92.1 | 0.253 |
plate | 90.6 | 94.8 | -0.009 | pole | 89.0 | 79.3 | 0.059 |
pot | 79.3 | 85.2 | 0.142 | road | 98.0 | 91.8 | 0.041 |
rock | 92.6 | 82.8 | -0.024 | sidewalk | 97.0 | 92.5 | 0.071 |
signboard | 90.6 | 76.5 | 0.091 | sky | 98.9 | 79.8 | 0.104 |
sofa | 95.9 | 91.2 | -0.009 | table | 93.4 | 93.5 | 0.06 |
tree | 96.8 | 89.2 | 0.172 | wall | 95.9 | 91.3 | 0.027 |
water | 95.2 | 94.6 | 0.078 | windowpane | 91.5 | 90.1 | 0.078 |
B.2 Difference in probe dataset distribution
The first method we use to look at the difference in the 2 probe datasets we used was to consider the base rates of different concepts within the dataset. As noted in Section 3 of the main paper, there are some sizable differences. Figure 5 contains the base rates for all concepts highlighted in Table 2 of the main paper. Some concepts that have very different base rates are wall (highlighted for bow-window when using ADE20k, but not Pascal), floor (highlighted for auto-showroom when using ADE20k but not Pascal), dog (highlighted for corn-field when using Pascal, but not ADE20k) and pole (highlighted for hardware-store for Pascal, but not ADE20k).

However, more than just the base rate, the images themselves look very different across scenes. We visualize random images from different scenes in Figure 6, and find, for example, images labelled bedroom in Pascal tend to have either a person or animal sleeping on a bed, without much of the remaining bedroom being shown, whereas ADE20k features images of full bedrooms. Similarly, images labelled tree-farm contain people in Pascal, but do not in ADE20k.

Upper bounds.
Finally, we present a simple method to compare the similarity of the probe dataset with that of the training dataset by noting that the probe dataset establishes a strict upper bound on the fraction of the model that can be explained. This is intuitively true since the set of semantic labeled concepts is finite, but actually goes deeper than that. Consider the following experiment: we take the original black-box model, run it on a probe dataset to make predictions, and then train a new classifier to emulate those predictions. If this classifier is restricted to use only the labeled concepts then this is similar to a concept-based explanation. However, even if it’s trained on the rich underlying visual features it would not perform perfectly due to the differences between the original training dataset and the probe dataset.
Concretely, consider a black-box ResNet18-based [20] model trained on the Places365 [55] dataset. We reset and re-train its final linear classification layer on the Pascal [13] probe dataset to emulate the original scene predictions; this achieves only 63.7% accuracy. Similarly, on the ADE20k [57, 58] as the probe dataset it achieves only slightly better 75.7% accuracy, suggesting that this dataset is somewhat more similar to Places365 than Pascal but still far from fully capturing the distribution. This is not to suggest that the only way to generate concept-based explanations is to collect concept labels for the original training set (which may lead to overfitting); rather, it’s important to acknowledge this limitation and quantify the explanation method based on such upper bounds.
Similarly, we can ask how well the Concept Bottleneck model [29] can be explained using the CUB test dataset. However, in this case, since the training and test distributions are (hopefully!) similar, we would expect our upper bound to be reasonably high. We check this with our same set up, and find that this is indeed the case – resetting and retraining the final linear layer, using the model’s predictions as our targets achieves an accuracy of 89.3%.
Appendix C Concepts used: more details
Here, we provide additional results regarding learning CUB concepts from Section 4.2 of the main paper. The CUB dataset was used by Concept Bottleneck [29], an interpretable-by-design model. This method learned the concepts as an intermediate layer within the network, and then used these concepts to pretdict the target class. Figure 7 contains the histograms of the normalized AP scores for the 112 concepts from CUB [48] as well as the APs for the target bird classes learned by the model. Similar to learning classifiers for the Broden [5] concepts, we learn a linear model using features from an Imagenet [43] trained Resnet18 [20] model. On average, we see that the bird classes are much better learned than the concepts.


Appendix D Human study details
In Section 5 of the main paper, we discuss the human studies we ran to understand how well humans are able to reason about concept-based explanations as the number of concepts used within the explanation increases. In this section, we provide additional details.
To recap, we compare four types of explanations: (1) concept-based explanations that use 8 concepts, (2) concept-based explanations that use 16 concepts, (3) concept-based explanations that use 32 concepts, and (4) example-based explanations that consist of 10 example images for which the model predicts a certain class. (4) is a baseline that doesn’t use concepts.
For a fair comparison, all four types of explanations are evaluated on the same inputs. We generate five sets of input where each set consists of 5 images from one scene group (commercial buildings, shops, markets, cities, and towns) and 5 images from another scene group (home or hotel). Recall that these are images where the model output match the explanation output (i.e., the class with the highest explanation score calculated based on ground-truth concept labels). Hence, if the participants correctly identify all concepts that appear in a given image, they are guaranteed to get the highest explanation score for the model output class.
To reduce the variance with respect to the input, we had 5 participants for each set of input and explanation type. For 32 concepts explanations, each participant saw 5 images from only one of the two scene groups because the study got too long and overwhelming with the full set of 10 images. For all other explanations, each participant saw the full set of 10 images. In total, we had 125 participants: 50 participants for the study with 32 concepts explanations and 25 participants for the other three studies. Each participant sees only one type of explanation as we conduct a between-group study.
More specifically, we recruited participants through Amazon Mechanical Turk who are US-based, have done over 1000 Human Intelligence Tasks, and have prior approval rate of at least 98%. The demographic distribution was: man 59%, woman 41%; no race/ethnicity reported 82%, White 17%, Black/African American 1%, Asian 1%. The self-reported machine learning experience was 2.5 1.0, between “2: have heard about…” and “3: know the basics…” We did not collect any personally identifiable information. Participants were compensated based on the state-level minimum wage of $12/hr. In total, $800 was spent on running human studies.
Appendix E User interface snapshots
In Section 5.1 of the main paper, we outlined our human study design.888We note that much of our study design and UI is based on the recent work by Kim et al. [kim2021hive] who propose a human evaluation framework called HIVE for evaluating visual interpretability methods. Here we provide snapshots of our study UIs in the following order.
Study introduction.. For each participant, we introduce the study, present a consent form, and receive informed consent for participation in the study. The consent form was approved by our institution’s Institutional Review Board and acknowledges that participation is voluntary, refusal to participate will involve no penalty or loss of benefits, etc. See Fig. 9.
Demographics and background.. Following HIVE [26], we request optional demographic data regarding gender identity, race and ethnicity, as well as the participant’s experience with machine learning. We collect this information to help future researchers calibrate our results. See Fig. 9.
Method introduction.. We introduce concept-based explanations in simple terms. This page is not shown for the study with example-based explanations. See Fig. 10.
Task preview . We present a practice example to help participants get familiar with the task. This page is not shown for the study with example-based explanations. See Fig. 11.
Part 1: Recognize concepts and guess the model output. After the preview, participants move onto the main task where they are asked to recognize concepts in a given photo (for concept-based explanations) and predict the model output (for all explanations). We show the UI for each type of explanation we study:
Part 2: Choose the ideal tradeoff between simplicity and correctness.. Concept-based explanations can have varying levels of complexity/simplicity and correctness. Hence, we investigate how participants reason with these two properties. To do so, we show examples of concept-based explanations that use different numbers of concepts, as well as bar plots with the correctness values for certain instantiations of concept-based explanations. We then ask participants to choose the explanation they prefer the most and provide a short written justification for their choice. See Fig. 16.
Feedback.. At the end of the study, participants can optionally provide feedback. See Fig. 17.

















