Overlooked Factors in Concept-based Explanations:
Dataset Choice, Concept Learnability, and Human Capability

Vikram V. Ramaswamy, Sunnie S. Y. Kim, Ruth Fong, Olga Russakovsky
Princeton University
{vr23, suhk, ruthfong, olgarus}@cs.princeton.edu

Abstract

Concept-based interpretability methods aim to explain a deep neural network model’s components and predictions using a pre-defined set of semantic concepts. These methods evaluate a trained model on a new, “probe” dataset and correlate the model’s outputs with concepts labeled in that dataset. Despite their popularity, they suffer from limitations that are not well-understood and articulated in the literature. In this work, we identify and analyze three commonly overlooked factors in concept-based explanations. First, we find that the choice of the probe dataset has a profound impact on the generated explanations. Our analysis reveals that different probe datasets lead to very different explanations, suggesting that the generated explanations are not generalizable outside the probe dataset. Second, we find that concepts in the probe dataset are often harder to learn than the target classes they are used to explain, calling into question the correctness of the explanations. We argue that only easily learnable concepts should be used in concept-based explanations. Finally, while existing methods use hundreds or even thousands of concepts, our human studies reveal a much stricter upper bound of 32 concepts or less, beyond which the explanations are much less practically useful. We discuss the implications of our findings and provide suggestions for future development of concept-based interpretability methods. Code for our analysis and user interface can be found at https://github.com/princetonvisualai/OverlookedFactors

1 Introduction

Performance and opacity are often correlated in deep neural networks: the highly parameterized nature of these models that enable them to achieve high task accuracy also reduces their interpretability. However, in order to responsibly use and deploy them, especially in high-risk settings such as medical diagnoses, we need these models to be interpretable, i.e., understandable by people. With the growing recognition of the importance of interpretability, many methods have been proposed in recent years to explain some aspects of neural networks and render them more interpretable (see [4, 14, 18, 42, 44, 53] for surveys).

Refer to caption — Figure 1: Concept-based interpretability methods explain model components and/or predictions using a pre-defined set of semantic concepts. In this example, a scene classification model’s prediction bedroom is explained as a complex linear combination of 37 visual concepts, with the final explanation score calculated based on the presence or absence of these concepts. The coefficients are learned by evaluating the model on a new, “probe” dataset, and correlating its predictions with visual concepts labeled in that dataset. However, concept-based explanations can (1) be noisy and heavily dependent on the probe dataset, (2) use concepts that are hard to learn (all concepts in red are harder to learn than the class bedroom) and (3) be overwhelming to people due to the complexity of the explanation.

In this work, we dive into concept-based interpretability methods for image classification models, which explain model components and/or predictions using a pre-defined set of semantic concepts [5, 16, 25, 29, 56]. Given access to a trained model and a set of images labelled with semantic concepts (i.e., a “probe” dataset), these methods produce explanations with the provided concepts. See Fig. 1 for an example explanation.

Concept-based methods are a particularly promising approach for bridging the interpretability gap between complex models and human understanding, as they explain model components and predictions with human-interpretable units, i.e., semantic concepts. Recent work finds that people prefer concept-based explanations over other forms (e.g., heatmap and example-based) because they resemble human reasoning and explanations [27]. Further, concept-based methods uniquely provide a global, high-level understanding of a model, e.g., how it predicts a certain class [56, 39] and what the model (or some part of it) has learned [25, 5, 16]. These insights are difficult to gain from local explanation methods that only provide an explanation for a single model prediction, such as saliency maps that highlight relevant regions within an image.

However, existing research on concept-based interpretability methods focuses heavily on new method development, ignoring important factors such as the probe dataset used to generate explanations or the concepts composing the explanations. Outside the scope of concept-based methods, there have been several recent works that study the effect of different factors on explanations. These works, however, are either limited to saliency maps [1, 28, 31, 41] or a general call for transparency, e.g., include more information when releasing an interpretability method [47].

In this work, we conduct an in-depth study of commonly overlooked factors in concept-based interpretability methods. Concretely, we analyze four representative methods: NetDissect [5], TCAV [25], Concept Bottleneck [29] and IBD [56]. These are a representative and comprehensive set of existing concept-based interpretability methods for computer vision models. Using multiple probe datasets (ADE20k [57, 58] and Pascal [13] for NetDissect, TCAV and IBD; CUB-200-2011 [48] for Concept Bottleneck), we examine the effects of (1) the choice of probe dataset, (2) the concepts used within the explanation, and (3) the complexity of the explanation. Through our analyses, we learn a number of key insights, which we summarize below:

•

The choice of the probe dataset has a profound impact on explanations. We repeatedly find that different probe datasets give rise to different explanations, when explaining the same model with the same interpretability method. For instance, the prediction of the arena/hockey class is explained with concepts {grandstand, goal, ice-rink, skate-board} with one probe dataset, and {plaything, road} with another probe dataset. We highlight that concept-based explanations are not solely determined by the model or the interpretability method. Hence, probe datasets should be chosen with caution. Specifically, we suggest using probe datasets whose data distribution is similar to that of the dataset the model-being-explained was trained on.
•

Concepts used in explanations are frequently harder to learn than the classes they aim to explain. The choice of concepts used in explanations is dependent on the available concepts in the probe dataset. Surprisingly, we find that learning some of these concepts is harder than learning the target classes. For example, in one experiment we find that the target class bathroom is explained using concepts {toilet, shower, countertop, bathtub, screen-door}, all of which are harder to learn than bathroom. Moreover, these concepts can be hard for people to identify, limiting the usefulness of these explanations. We argue that learnability is a necessary (albeit not sufficient) condition for the correctness of the explanations, and advocate for future explanations to only use concepts that are easily learnable.¹¹1Ideally, future methods would also include causal rather than purely correlation-based explanations.
•

Current explanations use hundreds or even thousands of concepts, but human studies reveal a much stricter upper bound. We conduct human studies with 125 participants recruited from Amazon Mechanical Turk to understand how well people reason with concept-based explanations with varying number of concepts. We find that participants struggle to identify relevant concepts in images as the number of concepts increases (the percentage of concepts recognized per image decreases from $71.7\%\pm 27.7\%$ with 8 concepts to $56.8\%\pm 24.9\%$ for 32 concepts). Moreover, the majority of the participants prefer that the number of concepts be limited to 32. We also find that concept-based explanations offer little to no advantage in predicting model output compared to example-based explanations (the participants’ mean accuracy at predicting the model output when given access to explanations with 8 concepts is $64.8\%\pm 23.9\%$ whereas the accuracy when given access to example-based explanations is $60.0\%\pm 30.2\%$ ).

These findings highlight the importance of vetting intuitions when developing and using interpretability methods. We have open-sourced our analysis code and human study user interface to aid with this process in the future: https://github.com/princetonvisualai/OverlookedFactors.

2 Related work

Interpretability methods for computer vision models range from highlighting areas within an image that contribute to a model’s prediction (i.e., saliency maps) [9, 15, 37, 45, 46, 51, 52, 54] to labelling model components (e.g., neurons) [5, 16, 25, 56], highlighting concepts that contribute to the model’s prediction [56, 39] and designing models that are interpretable-by-design [8, 10, 29, 34]. In this work, we focus on concept-based interpretability methods. These include post-hoc methods that label a trained model’s components and/or predictions [5, 16, 25, 56, 39] and interpretable-by-design methods that use pre-defined concepts [29]. We focus on methods for image classification models where most interpretability research has been and is being conducted. Recently, concept-based methods are being developed and used for other types of models (e.g., image similarity models [38], language models [50, 7]), however, these are outside the scope of this paper.

Our work is similar in spirit to a growing group of works that propose checks and evaluation protocols to better understand the capabilities and limitations of interpretability methods [1, 3, 2, 21, 23, 26, 28, 32, 41, 49]. Many of these works examine how sensitive post-hoc saliency maps are to different factors such as input perturbations, model weights, or the output class being explained. On the other hand, we conduct an in-depth study of concept-based interpretability methods. Despite their popularity, little is understood about their interpretability and usefulness to human users, or their sensitivity to auxiliary inputs such as the probe dataset. We seek to fill this gap with our work and assist with future development and use of concept-based interpretability methods. To the best of our knowledge, we are the first to investigate the effect of the probe dataset and concepts used for concept-based explanations. There has been work investigating the effect of explanation complexity on human understanding [30], however, it is limited to decision sets.

We also echo the call for releasing more information when releasing datasets [17], models [12, 33] and interpretability methods [47]. More concretely, we suggest that concept-based interpretability method developers to include results from our proposed analyses in their method release, in addition to filling out the explainability fact sheet proposed by Sokol et al. [47], to aid researchers and practitioners to better understand, use, and build on these methods.

3 Dataset choice: Probe dataset has a profound impact on the explanations

Scene class	Top concepts from ADE20k-generated explanations	Top concepts from Pascal-generated explanations
arena/hockey	grandstand, goal, ice-rink, scoreboard	plaything, road
auto-showroom	car, light, trade-name, floor, wall	car, stage, grandstand, baby-buggy, ground
bedroom	bed, cup, tapestry, lamp, blind	bed, frame, wood, sofa, bedclothes
bow-window	windowpane, seat, cushion, wall, heater	windowpane, tree, shelves, curtain, cup
conf-room	swivel-chair, table, mic, chair, document	bench, napkin, plate, candle, table
corn-field	field, plant, sky, streetlight	tire, sky, dog, water, signboard
garage/indoor	bicycle, brush, car, tank, ladder	bicycle, vending-mach, tire, motorbike, floor
hardware-store	shelf, merchandise, pallet, videos, box	rope, shelves, box, bottle, pole
legis-chamber	seat, chair, pedestal, flag, witness-stand	mic, book, paper
tree-farm	tree, hedge, land, path, pole	tree, tent, sheep, mountain, rock

Table 1: Impact of probe dataset on Baseline (Sec. 3). We compare Baseline explanations generated using ADE20k vs. Pascal. For 10 randomly selected scene classes, we show concepts with the largest coefficients in each explanation. In bold are concepts in one explanation but not the other, e.g., the concept grandstand is important for explaining the arena/hockey scene prediction when using ADE20k, but not when using Pascal. These results show that the probe dataset has a huge impact on the explanations.

Neuron	ADE20k label & score		Pascal label & score
9	plant	0.082	potted-plant	0.194
181	plant	0.068	potted-plant	0.140
318	computer	0.079	tv	0.251
386	autobus	0.067	bus	0.200
435	runway	0.071	airplane	0.189
185	chair	0.077	horse	0.153
239	pool-table	0.069	horse	0.171
257	tent	0.042	bus	0.279
384	washer	0.043	bicycle	0.201
446	pool-table	0.193	tv	0.086

Table 2: Impact of probe dataset on NetDissect [5] (Sec. 3). We compare NetDissect explanations (concept labels) for 10 neurons of the model-being-explained generated using ADE20k vs. Pascal. We find that while some neurons correspond to the same or similar concepts (top half), others correspond to wildly different concepts (bottom half), highlighting the impact of the probe dataset.

Concept-based explanations are generated by running a trained model on a “probe” dataset (typically not the training dataset) which has concepts labelled within it. The choice of probe dataset has been almost entirely dictated by which datasets have concept labels. The most commonly used dataset is the Broden dataset [5]. It contains images from four datasets (ADE20k [57, 58], Pascal [13], OpenSurfaces [6], Describable Textures Dataset [11]) and labels of over 1190 concepts, comprising of object, object parts, color, scene and texture.

In this section, we investigate the effect of the probe dataset by comparing explanations generated using two different subsets of the Broden datset: ADE20k and Pascal. We experiment with three different methods for generating concept-based explanations: Baseline, NetDissect [5], and TCAV [25], and find that the generated explanations heavily depend on the choice of probe dataset. This finding implies that these explanations can only be used for images drawn from the same distribution as the probe dataset.

Model explained. Following prior work [5, 25, 56], we explain a ResNet18-based [20] scene classification model trained on the Places365 dataset [55], which predicts one of 365 scene classes given an input image.

Probe datasets. We use two probe datasets: ADE20k [57, 58] (19733 images, license: BSD 3-Clause) and Pascal [13] (10103 images, license: unknown).²²2To our best knowledge, most images used don’t include personally identifiable information or offensive content. However, some feature people without their consent and might contain identifiable information. They are two different subsets of the Broden dataset [5] and are labelled with objects and parts. We randomly split each dataset into training (60%), validation (20%), and test (20%) sets, using the new training set for learning explanations, validation set for tuning hyperparameters (e.g., learning rate and regularization parameters), and test set for reporting our findings.

Interpretability methods. We investigate the effect of the probe dataset on three types of concept-based explanations. First, we study a simple Baseline method that measures correlations between the model’s prediction and concepts, and generates class-level explanations as a linear combination of concepts as in Fig. 1. Similar to Ramaswamy et al. [39], we learn a logistic regression model that matches the model-being-explained’s prediction, given access to ground-truth concept labels within the image. We use an l1 penalty to prioritize explanations with fewer concepts. Second, we study NetDissect [5] which identifies neurons within the model-being-explained that are highly activated by certain concepts and generates neuron-level explanations (concept labels).³³3We use code provided by the authors: https://github.com/CSAILVision/NetDissect-Lite. Finally, we study TCAV [25] which generates explanations in the form of concept activation vectors, i.e., vectors within the model-being-explained’s feature space that correspond to labelled concepts.

Results. For all three explanation types, we find that using different probe datasets result in very different explanations. To begin, we show in Tab. 1 how Baseline explanations differ when using ADE20k vs. Pascal as the probe dataset. For example, when explaining the corn-field scene prediction, the Pascal-generated explanation highlights dog as important, whereas the ADE20k-generated explanation does not. For the legis-chamber scene, ADE20k highlights chair as important, whereas Pascal does not.

We observe a similar difference for NetDissect (see Tab. 2). We label 123 neurons separately using ADE20k and Pascal, and find that 60 of them are given very different concept labels (e.g., neuron 239 is labelled pool-table by ADE20k and horse by Pascal).⁴⁴4It is possible that these neurons are poly-semantic, i.e., neurons that reference multiple concepts, as noted in [16, 35]. However, as we explore in the supp. mat., the score for the concept from the other dataset is usually below 0.04, the threshold used in [5] to identify “highly activated neurons.” Again, this result highlights the impact of the probe dataset on explanations.

Similarly, TCAV concept activation vectors learned using ADE20k vs. Pascal are different, i.e., they have low cosine similarity (see Fig. 2). We compute concept activation vectors for 32 concepts which have a base rate of over 1% in both datasets combined, then calculate the cosine similarity of each concept vector. We also compute the ROC AUC for each concept vector to measure how well the concept vector corresponds to the concept. We find that the similarity is low (0.078 on average), even though the selected concepts were those that can be learned reasonably well (mean ROC AUC for these concepts is over 85%). We suspect that the explanations are radically different due to differences in the probe dataset distribution. For instance, some concepts have very different base rates in the two datasets: dog has a base rate of 12.0% in Pascal but 0.5% in ADE20k; chair has a base rate of 16.7% in ADE20k but 13.5% in Pascal. We present more analyses in the supp. mat.

4 Concept learnability: Concepts used are less learnable than target classes

In Sec. 3, we investigated how the choice of the probe dataset influences the generated explanations. In this section, we investigate the individual concepts used within explanations. An implicit assumption made in concept-based interpretability methods is that the concepts used in explanations are easier to learn than the target classes being explained. For instance, when explaining the class bedroom with the concept bed, we are assuming (and hoping) that the model first learns the concept bed, then uses this concept and others to predict the class bedroom. However, if bed is harder to learn than bedroom, this would not be the case. This assumption also aligns with works that argue that “simpler” concepts (i.e., edges and textures) are learned in early layers and “complex” concepts (i.e., parts and objects) are learned in later layers [5, 16].

We thus investigate the learnability of concepts used by different explanation methods. Somewhat surprisingly, we find that the concepts used are frequently harder to learn than the target classes, raising concerns about the correctness of concept-based explanations.

Setup. To compare the learnability of concepts vs. classes, we learn models for the concepts (the learnability of the classes is already known from the model-being-explained). Concretely, we extract features for the probe dataset using an ImageNet [43]-pretrained ResNet18 [20] model and train a linear model using sklearn’s [36] LogisticRegression to predict concepts from the ResNet18 features.⁵⁵5We also tried using features from a Places365 pretrained model and did not find a significant difference. We do so for the two most commonly used probe datasets: Broden [5] and CUB-200-2011 [48]. Broden concepts are frequently used to explain Places365 classes (as done in NetDissect [5], Net2Vec [16], IBD [56], and ELUDE [39]), while CUB concepts are used to explain the CUB target classes (as done in Concept Bottleneck [29] and ELUDE [39]).

Evaluation. We evaluate learnability with normalized average precision (AP) [22]. We choose normalized AP for two reasons: first, to avoid having to set a threshold and second, to fairly compare concepts and scenes that have very different base rates. In our experiments, we set the base rate to be that of the classes: $\frac{1}{365}$ when comparing Broden concepts vs. Places365 classes and $\frac{1}{200}$ when comparing CUB concepts vs. CUB classes.

Results. In both settings, we find that the concepts are much harder to learn than the target classes. The median normalized AP for Broden concepts is 7.6%, much lower than 37.5% of Places365 classes. Similarly, the median normalized AP for CUB concepts is 2.3%, much lower than 65.9% of CUB classes. Histograms of normalized APs are shown in Fig. 3 (Broden/Places365) and the supp. mat. (CUB).

However, is it possible that each class is explained by concepts that are more learnable than the class? Our investigation with IBD [56] explanations suggests this is not the case. IBD greedily learns a basis of concept vectors, as well as a residual vector, and decomposes each model prediction into a linear combination of the basis and residual vectors.⁶⁶6We use code provided by the authors: https://github.com/CSAILVision/IBD. For 10 randomly chosen scene classes, we compare the normalized AP of the scene class vs. 5 concepts with the highest coefficients (i.e., 5 concepts that are the most important for explaining the prediction). See Tab. 3 for the results. We find that all 10 scene classes are explained with at least one concept that is harder to learn than the class. For some classes (e.g., bathroom, kitchen), all concepts used in the explanation are harder to learn than the class.

Scene class	Concepts
arena/perform	tennis court	grandstand	ice rink	valley	stage
38.8	74.0	44.4	40.7	19.0	11.9
art-gallery	binder	drawing	painting	frame	sculpture
27.4	42.6	10.8	10.5	2.5	0.7
bathroom	toilet	shower	countertop	bathtub	screen door
43.3	39.9	18.8	12.6	11.1	9.6
kasbah	ruins	desert	arch	dirt track	bottle rack
50.2	64.3	17.3	16.2	8.9	4.2
kitchen	work surface	stove	cabinet	refrigerator	doorframe
33.9	24.8	18.2	10.3	8.8	2.8
lock-chamber	water wheel	dam	boat	embankment	footbridge
36.5	47.4	43.7	16.1	4.8	4.1
pasture	cow	leaf	valley	field	slope
19.2	63.7	21.1	19.0	6.8	4.1
physics-lab	computer	machine	monitor-device	bicycle	sewing-machine
17.1	25.4	4.5	3.3	1.7	1.5
store/indoor	shanties	patty	bookcase	shelf	cup
20.4	72.5	18.5	13.5	4.2	1.3
water-park	roller coaster	hot tub	playground	ride	swimming pool
38.3	73.0	59.1	44.9	38.0	36.7

Table 3: Class-level comparison of concept vs. class learnability (Sec. 4). We report normalized AP scores (

\uparrow

indicates high learnability) for 10 randomly chosen scene classes, along with 5 concepts with the highest IBD explanation coefficients for each. Concepts whose normalized AP scores are lower than the scene class are shown in red, whereas concepts with higher scores are shown in blue. All scenes are explained by at least one concept with a lower normalized AP. Some scenes are only explained by concepts with lower normalized AP.

Our experiments show that a significant fraction of the concepts used by existing concept-based interpretability methods are harder to learn than the target classes, issuing a wake-up call to the field. In the following section, we show that these concepts can also be hard for people to identify.

5 Human capability: Human studies reveal an upper bound of 32 concepts

Existing concept-based explanations use a large number of concepts: NetDissect [5] and Net2Vec [16] use all 1197 concepts labelled within the Broden [5] dataset; IBD [56] uses Broden object and art concepts with at least 10 examples (660 concepts); and Concept Bottleneck [29] uses all concepts that are predominantly present for at least 10 classes from CUB [48] (112 concepts). However, can people actually reason with these many concepts?

In this section, we study this important yet overlooked aspect of concept-based explanations: explanation complexity and how it relates to human capability and preference. Specifically, we investigate: (1) How well do people recognize concepts in images? (2) How do the (concept recognition) task performance and time change as the number of concepts vary? (3) How well do people predict the model output for a new image using explanations? (4) How do people trade off simplicity and correctness of concept-based explanations? To answer these questions, we design and conduct a human study. We describe the study design in Sec. 5.1 and report findings in Sec. 5.2.

5.1 Human study design

We build on the study design and user interface (UI) of HIVE [26], and design a two-part study to understand how understandable and useful concept-based explanations are to human users with potentially limited knowledge about machine learning . To the best of our knowledge, we are the first to investigate such properties of concept-based explanations for computer vision models.⁷⁷7We note that there are works examining complexity of explanations for other types of models, for example, Lage et al. [30] investigate complexity of explanations over decision sets, Bolubasi et al. [7] investigate this for concept-based explanations for language models.

Part 1: Recognize concepts and predict the model output. First, we present participants with an image and a set of concepts and ask them to identify whether each concept is present or absent in the image. We also show explanations for 4 classes whose scores are calculated real-time based on the concepts selected. As a final question, we ask participants to select the class they think the model predicts for the given image. See Fig. 4 (left) for the study UI.

To ensure that the task is doable and is only affected by explanation complexity (number of concepts used) and not the complexity of the model and its original prediction task (e.g., 365 scenes classification), we generate explanations for only 4 classes and ask participants to identify which of the 4 classes corresponds to the model’s prediction. We only show images where the model output matches the explanation output (i.e., the model predicts the class with the highest explanation score, calculated with ground-truth concept labels), since our goal is to understand how people reason with concept-based explanations with varying complexity.

Part 2: Choose the ideal tradeoff between simplicity and correctness. Next, we ask participants to reason about two properties of concept-based explanations: simplicity, i.e., the number of concepts used in a given set of explanations, and correctness, i.e., the percentage of model predictions correctly explained by explanations, which is the percentage of times the model output class has the highest explanation score. See Fig. 4 (right) for the study UI. We convey the notion of a simplicity-correctness tradeoff through bar plots that show the correctness of explanations of varying simplicity/complexity (4, 8, 16, 32, 64 concepts). We then ask participants to choose the explanation they prefer the most and provide a short justification for their choice.

Full study design and experimental details. In summary, our study consists of the following steps. For each participant, we introduce the study, receive informed consent for participation in the study, and collect information about their demographic (optional) and machine learning experience. We then introduce concept-based explanations in simple terms, and show a preview of the concept recognition and model output prediction task in Part 1. The participant then completes the task for 10 images. In Part 2, the participant indicates their preference for explanation complexity, given simplicity and correctness information. There are no foreseeable risks in participation in the study, and our study design was approved by our institution’s IRB.

Using this study design, we investigate explanations that take the form of a linear combination of concepts (e.g., Baseline, IBD [56], Concept Bottleneck [29]). Explanations are generated using the Baseline method, which is a logistic regression model trained to predict the model’s output using concepts (see Sec. 3 for details). Note that we are evaluating the form of explanation (linear combination of concepts) rather than a specific explanation method. The choice of the method does not impact the task.

Specifically, we compare four types of explanations: concept-based explanations that use (1) 8 concepts, (2) 16 concepts, (3) 32 concepts, and (4) example-based explanations that consist of 10 example images for which the model predicts a certain class. We include (4) as a method that doesn’t use concepts. In Jeyakumar et al. [24], this type of explanation is shown to be preferred over saliency-type explanations for image classification; here, we compare this to concept-based explanations.

For a fair comparison, all four are evaluated on the same set of images. In short, we conduct a between-group study with 125 participants recruited through Amazon Mechanical Turk. Participants were compensated based on the state-level minimum wage of $12/hr. In total, $\sim$ $800 was spent on running human studies. See supp. mat. for more details.

5.2 Key findings from the human studies

When presented with more concepts, participants spend more time but are worse at recognizing concepts. The median time participants spend on each image is 17.4 sec. for 8 concept-, 27.5 sec. for 16 concept-, and 46.2 sec. for 32 concept-explanations. This is expected, since participants are asked to make a judgment for each and every concept. When given example-based explanations with no such task, participants spend only 11.6 seconds on each image. Interestingly, the concept recognition performance, reported in terms of mean recall (i.e., the percentage of concepts in the image that are recognized) and standard deviation, decreases from 71.7% $\pm$ 27.7% (8 concepts) to 61.0% $\pm$ 28.5% (16 concepts) to 56.8% $\pm$ 24.9% (32 concepts). While these numbers are far from perfect recall (100%), participants are better at judging whether concepts are present when shown fewer number of concepts.

Concept-based explanations offer little to no advantage in model output prediction over example-based explanations. Indeed, we see that the participants’ errors in concept recognition result in an incorrect class having the highest explanation score. When predicting the model output as the class with the highest explanation score, calculated based on the participants’ concept selections, the mean accuracy and standard deviation of model output prediction are 64.8% $\pm$ 23.9% (8 concepts), 63.2% $\pm$ 26.9% (16 concepts), 63.6% $\pm$ 22.2% (32 concepts). These are barely higher than 60.0% $\pm$ 30.2% of example-based explanations, which are simpler and require less time to complete the task.

The majority of participants prefer explanations with 8, 16, or 32 concepts. When given options of explanations that use 4, 8, 16, 32, or 64 concepts, 82% of participants prefer explanations with 8, 16, or 32 concepts (28%, 33%, 21% respectively). Only 6% prefer those with 64 concepts, suggesting that existing explanations that use hundreds or even thousands of concepts do not cater to human preferences. In the written responses, many favored having fewer concepts (e.g., “the lesser, the better”) and expressed concerns against having too many (e.g., “I think 32 is a lot, but 16 is an adequate enough number that it could still predict well…”). In making the tradeoff, some valued correctness above all else (e.g., “Out of all the options, 32 is the most correct”), while others reasoned about marginal benefits (e.g., “I would prefer explanations that use 16 concepts because it seems that the difference in percentage of correctness is much closer and less, than other levels of concepts”). Overall, we find that participants actively reason about both simplicity and correctness of explanations.

6 Discussion

Our analyses yield immediate suggestions for improving the quality and usability of concept-based explanations. First, we suggest choosing a probe dataset whose distribution is similar to that of the dataset the model was trained on. Second, we suggest only using concepts that are more learnable than the target classes. Third, we suggest limiting the number of concepts used within an explanation to under 32, so that explanations are not overwhelming to people.

The final suggestion is easy to implement. However, the first two are easier said than done, since the number of available probe datasets (i.e., large-scale datasets with concept labels) is minimal, forcing researchers to use the Broden dataset [5] or the CUB dataset [48]. Hence, we argue creating diverse and high-quality probe datasets is of upmost importance in researching concept-based explanations.

Another concern is that these methods do highlight hard-to-learn concepts when given access to them, suggesting that they sometimes learn correlations rather than causations. Methods by Goyal et al. [19], which output patches within the image that need to be changed for the model’s prediction to change, or Fong et al. [15], which find regions within the image that maximally contribute to the model’s prediction, are more in line with capturing causal relationships. However, these only produce local explanations, i.e., explanations of a single model prediction, and not class-level global explanations. One approach to capturing causal relationships is to generate counterfactual images with or without certain concepts using generative models [40] and observe changes in model predictions.

7 Limitations and future work

Our findings come with a few caveats. First, due to the lack of available probe datasets, we tested each concept-based interpretability method in a single setting. That is, we tested NetDissect [5], TCAV [25] and IBD [56] on a scene classifier trained on the Places365 dataset [55], and Concept Bottleneck [29] on the CUB dataset [48]. We plan to expand our analyses as more probe datasets become available. Second, all participants in our human studies were recruited from Amazon Mechanical Turk. This means that our participants represent a population with limited ML background: the self-reported ML experience was 2.5 $\pm$ 1.0 (on a scale of 1 to 5), which is between “2: have heard about…” and “3: know the basics…” We believe Part 1 results of our human studies (described in Sec. 5.1) will not vary with participants’ ML expertise or role in the ML pipeline, as we are only asking participants to identify concepts in images. However, Part 2 results may vary (e.g., developers debugging a ML model may be more willing to trade off explanation simplicity for correctness than lay end-users). Investigating differences in perceptions and uses of concept-based explanations, is an important direction for future research.

8 Conclusion

In this work, we examined implicit assumptions made in concept-based interpretability methods along three axes: the choice of the probe datasets, the learnability of the used concepts, and the complexity of explanations. We found that the choice of the probe dataset profoundly influences the generated explanations, implying that these explanations can only be used for images from the probe dataset distribution. We also found that a significant fraction of the concepts used within explanations are harder for a model to learn than the target classes they aim to explain. Finally, we found that people struggle to identify concepts in images when given too many concepts, and that explanations with less than 32 concepts are preferred. We hope our proposed analyses and findings lead to more careful use and development of concept-based explanations.

Acknowledgements. We foremost thank our participants for taking the time to participate in our study. We also thank the authors of [5, 56, 29, 26] for open-sourcing their code and/or models. Finally, we thank the anonymous reviewers and the Princeton Visual AI Lab members (especially Nicole Meister) who provided helpful feedback on our work. This material is based upon work partially supported by the National Science Foundation under Grants No. 1763642, 2145198 and 2112562. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We also acknowledge support from the Princeton SEAS Howard B. Wentz, Jr. Junior Faculty Award to OR, Princeton SEAS Project X Fund to RF and OR, Open Philanthropy Grant to RF, and NSF Graduate Research Fellowship to SK.

References

[1] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In NeurIPS, 2018.
[2] Julius Adebayo, Michael Muelly, Harold Abelson, and Been Kim. Post hoc explanations may be ineffective for detecting unknown spurious correlation. In ICLR, 2022.
[3] Julius Adebayo, Michael Muelly, Ilaria Liccardi, and Been Kim. Debugging tests for model explanations. In NeurIPS, 2020.
[4] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 2020.
[5] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
[6] Sean Bell, Kavita Bala, and Noah Snavely. Intrinsic images in the wild. ACM Trans. Graph., 2014.
[7] Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, and Martin Wattenberg. An interpretability illusion for bert, 2021.
[8] Wieland Brendel and Matthias Bethge. Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. In ICLR, 2018.
[9] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV, 2018.
[10] Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: Deep learning for interpretable image recognition. In NeurIPS, 2019.
[11] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, 2014.
[12] Anamaria Crisan, Margaret Drouhard, Jesse Vig, and Nazneen Rajani. Interactive model cards: A human-centered approach to model documentation. In FAccT, 2022.
[13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, 2010.
[14] Ruth Fong. Understanding convolutional neural networks. PhD thesis, University of Oxford, 2020.
[15] Ruth Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. In ICCV, 2019.
[16] Ruth Fong and Andrea Vedaldi. Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In CVPR, 2018.
[17] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 2021.
[18] Leilani H. Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In DSAA, 2018.
[19] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual explanations. In ICML, 2019.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[21] Adrian Hoffmann, Claudio Fanconi, Rahul Rade, and Jonas Kohler. This looks like that… does it? Shortcomings of latent space prototype interpretability in deep networks. In ICML Workshops, 2021.
[22] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. Diagnosing error in object detectors. In ECCV, 2012.
[23] Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for interpretability methods in deep neural networks. In NeurIPS, 2019.
[24] Jeya Vikranth Jeyakumar, Joseph Noor, Yu-Hsi Cheng, Luis Garcia, and Mani Srivastava. How can i explain this to you? an empirical study of deep neural network explanation methods. In NeurIPS, 2020.
[25] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML, 2018.
[26] Sunnie S. Y. Kim, Nicole Meister, Vikram V. Ramaswamy, Ruth Fong, and Olga Russakovsky. HIVE: Evaluating the human interpretability of visual explanations. In ECCV, 2022.
[27] Sunnie S. Y. Kim, Elizabeth Anne Watkins, Olga Russakovsky, Ruth Fong, and Andrés Monroy-Hernández. “Help me help the AI”: Understanding how explainability can support human-AI interaction. In CHI, 2023.
[28] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer Cham, 2019.
[29] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In ICML, 2020.
[30] Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Samuel J Gershman, and Finale Doshi-Velez. Human evaluation of models built for interpretability. In HCOMP, 2019.
[31] Aravindh Mahendran and Andrea Vedaldi. Salient deconvolutional networks. In ECCV, 2016.
[32] Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do concept bottleneck models learn as intended? In ICLR Workshops, 2021.
[33] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In FAccT, 2019.
[34] Meike Nauta, Ron van Bree, and Christin Seifert. Neural prototype trees for interpretable fine-grained image recognition. In CVPR, 2021.
[35] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020.
[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. JMLR, 2011.
[37] Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized input sampling for explanation of black-box models. In BMCV, 2018.
[38] Bryan A. Plummer, Mariya I. Vasileva, Vitali Petsiuk, Kate Saenko, and David Forsyth. Why do these match? explaining the behavior of image similarity models. In ECCV, 2020.
[39] Vikram V. Ramaswamy, Sunnie S. Y. Kim, Nicole Meister, Ruth Fong, and Olga Russakovsky. ELUDE: Generating interpretable explanations via a decomposition into labelled and unlabelled features. arXiv:2206.07690, 2022.
[40] Vikram V. Ramaswamy, Sunnie S. Y. Kim, and Olga Russakovsky. Fair attribute classification through latent space de-biasing. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[41] Sylvestre-Alvise Rebuffi, Ruth Fong, Xu Ji, and Andrea Vedaldi. There and back again: Revisiting backpropagation saliency methods. In CVPR, 2020.
[42] Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. Interpretable machine learning: Fundamental principles and 10 grand challenges. In Statistics Surveys, 2021.
[43] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. In IJCV, 2015.
[44] Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller, editors. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700 of Lecture Notes in Computer Science. Springer, 2019.
[45] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
[46] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshops, 2014.
[47] Kacper Sokol and Peter Flach. Explainability fact sheets: A framework for systematic assessment of explainable approaches. In FAccT, 2020.
[48] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
[49] Mengjiao Yang and Been Kim. Benchmarking attribution methods with relative feature importance. arXiv:1907.09701, 2019.
[50] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Ravikumar. On completeness-aware concept-based explanations in deep neural networks. In NeurIPS, 2020.
[51] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
[52] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. IJCV, 2018.
[53] Quanshi Zhang and Song-Chun Zhu. Visual interpretability for deep learning: A survey. Frontiers of Information Technology & Electronic Engineering, 2018.
[54] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
[55] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017.
[56] Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. Interpretable basis decomposition for visual explanation. In ECCV, 2018.
[57] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ADE20k dataset. In CVPR, 2017.
[58] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20k dataset. IJCV, 2019.

Appendix

In this supplementary document, we provide additional details on some sections of the main paper.

Section A We provide more information regarding our experimental setup over all experiments.
Section B We provide additional results from our experiments regarding probe dataset choice from Section 3 of the main paper.
Section C We provide additional results from our experiments regarding concept choice from Section 4 of the main paper.
Section D: We supplement Section 5 of the main paper and provide more information about our human studies.
Section E: We supplement Section 5 of the main paper and show snapshots of our full user interface.

Appendix A Experimental details

Here we provide additional experimental details regarding all our setups, as well as the computational power we needed.

TCAV. Using the features extracted from the penultimate layer of the ResNet18-based [20] model trained on the Places365 dataset [55], we use scikit-learns’s [36] LogisticRegression models to predict the ground truth attributes in each case. We use the liblinear solver, with an l2 penalty, and pick the regularization weight as a hyperparameter, based on the performance (ROC AUC) on a validation set.

Baseline. Given the ground-truth labelled concepts for an image, this explanation attempts to predict the blackbox model’s output on the image. We use scikit-learn’s [36] LogisticRegression model with a liblinear solver, and an l1 penalty, to prioritize learning simpler explanations. For the experiment reported in Section 3 of the main paper, we pick the regularization weight as a hyperparamater, choosing the weight with the best performance on a validation set. When generating explanations of different complexities for our human studies, we vary the regularization parameter, picking explanations that use a total of 4, 8, 16, 32, or 64 concepts.

Learning concepts. We computed features for all images from ADE20k [57, 58] using the penultimate layer of a ResNet18 [20] model trained on Imagenet [43]. We then learned a linear model for all concepts that had over 10 positive samples within the dataset, using the LogisticRegression model from scikit-learn [36]. Similar to other models, we use a liblinear solver, with an l2 penalty, choosing the regularization weight based on performance (ROC AUC) on a validation set. As mentioned, we report the normalized AP [22] to be able to compare across concepts and target classes with varying base rates.

Run times. Computing each of the linear models used less than 2 min on a CPU. Computing features using a ResNet18 [20] model trained on either Places365 [55] or Imagenet [43] for the ADE20k [57, 58] and Pascal [13] datasets took less than 15 min using a NVIDIA GTX 2080 GPU.

Appendix B Probe dataset choice: more details

In our first claim, we show that the choice of probe dataset can have a significant impact on the explanation output for concept-based explanations. We give more details from our experiments for this claim within this section.

B.1 Varying the probe dataset

Here we provide the full results from section 3.1 in the main text, where we compute concept-based explanations using 2 different methods (NetDissect [5] and TCAV [25]) when using either ADE20k or Pascal as probe datasets.

NetDissect. Table 4 contains the label generated for all neurons that are strongly activated when using either ADE20k [57, 58] or Pascal [13] as the probe dataset. A majority of neurons (69/123) correspond to very different concepts.

Neuron	ADE20k label	ADE20k score	Pascal label	Pascal score	Neuron	ADE20k label	ADE20k score	Pascal label	Pascal score
1	counter	0.059	bottle	0.049	3	sea	0.067	water	0.065
4	seat	0.064	tvmonitor	0.074	8	vineyard	0.048	plant	0.043
9	plant	0.082	pottedplant	0.194	22	bookcase	0.07	bus	0.048
30	house	0.094	building	0.043	37	boat	0.043	boat	0.213
43	bed	0.151	bed	0.075	47	pool table	0.135	airplane	0.079
60	plane	0.052	airplane	0.168	63	field	0.053	muzzle	0.042
69	person	0.047	hair	0.086	73	water	0.041	bird	0.080
79	plant	0.064	pottedplant	0.064	90	mountain	0.071	mountain	0.066
102	bathtub	0.040	cat	0.055	104	cradle	0.081	bus	0.112
105	sea	0.106	water	0.058	106	rock	0.048	rock	0.06
110	painting	0.119	painting	0.06	112	field	0.05	bus	0.051
113	table	0.116	table	0.066	115	plane	0.046	airplane	0.147
120	sidewalk	0.042	track	0.075	125	table	0.049	wineglass	0.047
126	stove	0.064	bottle	0.163	127	book	0.104	book	0.096
131	signboard	0.043	body	0.069	134	bathtub	0.088	boat	0.059
141	skyscraper	0.065	cage	0.068	155	mountain	0.091	train	0.058
158	book	0.042	book	0.052	165	sea	0.051	water	0.051
168	railroad train	0.055	train	0.193	172	car	0.055	bus	0.101
173	car	0.052	bus	0.099	181	plant	0.068	pottedplant	0.14
183	person	0.041	horse	0.187	184	cradle	0.046	cat	0.042
185	chair	0.077	horse	0.153	186	person	0.051	bird	0.094
191	swimming pool	0.044	pottedplant	0.072	198	pool table	0.064	ceiling	0.066
208	shelf	0.047	bus	0.062	211	computer	0.076	tvmonitor	0.089
217	toilet	0.049	hair	0.055	218	case	0.044	track	0.165
219	plane	0.065	airplane	0.189	220	road	0.066	road	0.066
222	grass	0.105	grass	0.046	223	house	0.069	airplane	0.055
231	grandstand	0.097	screen	0.047	234	bridge	0.05	train	0.042
239	pool table	0.069	horse	0.171	245	water	0.063	water	0.042
247	plane	0.079	airplane	0.177	248	bed	0.127	tvmonitor	0.063
251	sofa	0.073	pottedplant	0.053	257	tent	0.042	bus	0.279
260	flower	0.082	food	0.069	267	apparel	0.042	car	0.045
276	earth	0.041	rock	0.047	278	field	0.06	sheep	0.044
280	mountain	0.045	mountain	0.056	287	plant	0.078	pottedplant	0.07
289	pool table	0.049	food	0.059	290	mountain	0.085	mountain	0.097
293	shelf	0.074	bottle	0.105	298	path	0.047	motorbike	0.068
305	waterfall	0.057	mountain	0.047	309	washer	0.109	bus	0.065
318	computer	0.079	tvmonitor	0.251	322	ball	0.054	sheep	0.044
324	mountain	0.071	motorbike	0.048	325	person	0.04	head	0.059
327	waterfall	0.055	bird	0.087	337	water	0.072	boat	0.109
341	sea	0.153	boat	0.076	344	person	0.052	person	0.048
345	autobus	0.042	bus	0.142	347	palm	0.051	bicycle	0.083
348	mountain	0.058	mountain	0.125	354	cradle	0.042	chair	0.053
357	rock	0.058	sheep	0.061	360	pool table	0.048	bird	0.041
364	field	0.058	plant	0.041	372	work surface	0.045	cabinet	0.049
379	bridge	0.092	bus	0.046	383	bed	0.069	curtain	0.079
384	washer	0.043	bicycle	0.201	386	autobus	0.067	bus	0.200
387	hovel	0.04	train	0.085	389	chair	0.066	chair	0.051
398	windowpane	0.073	windowpane	0.07	400	plant	0.043	pottedplant	0.097
408	toilet	0.045	bottle	0.099	412	bed	0.079	airplane	0.086
413	pool table	0.09	motorbike	0.07	415	seat	0.044	tvmonitor	0.045
417	sand	0.06	sand	0.049	419	bed	0.061	tvmonitor	0.054
422	seat	0.089	tvmonitor	0.056	430	bed	0.078	bedclothes	0.042
434	case	0.047	cup	0.041	435	runway	0.072	airplane	0.189
438	plane	0.045	airplane	0.235	444	sofa	0.045	plant	0.09
445	car	0.201	car	0.093	446	pool table	0.193	tvmonitor	0.086
454	car	0.218	car	0.156	463	snow	0.059	snow	0.118
465	crosswalk	0.097	road	0.047	475	cradle	0.061	train	0.132
477	desk	0.104	tvmonitor	0.085	480	sofa	0.086	sofa	0.081
483	swivel chair	0.052	horse	0.041	484	water	0.15	water	0.102
485	sofa	0.056	airplane	0.045	500	sofa	0.156	sofa	0.11
502	washer	0.07	train	0.134	503	bookcase	0.109	book	0.075
509	computer	0.044	tvmonitor	0.074

Table 4: We show labels for all neurons from the penultimate layer of a ResNet18 model that are marked as highly activated by both datasets by NetDissect [5]. We find that a 69/123 of neurons correspond to labels that are radically different (shown in red). The remainder correspond to either the same or very similar concepts.

As mentioned by Fong et al. [16] and Olah et al. [35], neurons in deep neural networks can be poly-semantic, i.e, some neurons can recognize multiple concepts. We check if the results from above are due to such neurons, and confirm that that is not the case: out of the 69 neurons, only 7 are highly activated (IOU¿0.04) by both concepts. Table 5 contains the IOU scores for both the ADE20k and Pascal label for each neuron outputting very different concepts.

			Probe dataset: ADE20k		Probe dataset: Pascal
neuron	ADE20k label	Pascal label	IOU ADE20k label	IOU Pascal label	IOU ADE20k label	IOU Pascal label
1	counter	bottle	0.059	0.006	0.006	0.049
4	seat	tvmonitor	0.064	0.0	0.0	0.074
22	bookcase	bus	0.07	0.0	0.0	0.048
47	pool table	airplane	0.135	0.0	0.002	0.079
63	field	muzzle	0.053	0.0	0.0	0.042
73	water	bird	0.041	0.002	0.052	0.08
102	bathtub	cat	0.04	0.0	0.0	0.055
104	cradle	bus	0.081	0.0	0.0	0.112
112	field	bus	0.05	0.0	0.0	0.051
120	sidewalk	track	0.042	0.001	0.023	0.075
125	table	wineglass	0.049	0.0	0.043	0.047
126	stove	bottle	0.064	0.029	0.005	0.163
131	signboard	body	0.043	0.0	0.06	0.069
134	bathtub	boat	0.088	0.001	0.005	0.059
141	skyscraper	cage	0.065	0.001	0.0	0.068
155	mountain	train	0.091	0.0	0.038	0.058
172	car	bus	0.055	0.0	0.015	0.101
173	car	bus	0.052	0.0	0.013	0.099
183	person	horse	0.041	0.016	0.003	0.187
184	cradle	cat	0.046	0.0	0.0	0.042
185	chair	horse	0.077	0.014	0.011	0.153
186	person	bird	0.051	0.001	0.017	0.094
191	swimming pool	pottedplant	0.044	0.0	0.0	0.072
198	pool table	ceiling	0.064	0.035	0.001	0.066
208	shelf	bus	0.047	0.0	0.0	0.062
217	toilet	hair	0.049	0.001	0.0	0.055
218	case	track	0.044	0.001	0.0	0.165
223	house	airplane	0.069	0.0	0.0	0.055
231	grandstand	screen	0.097	0.0	0.007	0.047
234	bridge	train	0.05	0.0	0.014	0.042
239	pool table	horse	0.069	0.011	0.0	0.171
248	bed	tvmonitor	0.127	0.0	0.027	0.063
251	sofa	pottedplant	0.073	0.0	0.033	0.053
257	tent	bus	0.042	0.0	0.005	0.279
260	flower	food	0.082	0.033	0.064	0.069
267	apparel	car	0.042	0.023	0.0	0.045
278	field	sheep	0.06	0.0	0.0	0.044
289	pool table	food	0.049	0.024	0.0	0.059
293	shelf	bottle	0.074	0.025	0.0	0.105
298	path	motorbike	0.047	0.0	0.0	0.068
305	waterfall	mountain	0.057	0.049	0.0	0.047
309	washer	bus	0.109	0.0	0.013	0.065
322	ball	sheep	0.054	0.0	0.005	0.044
324	mountain	motorbike	0.071	0.0	0.015	0.048
327	waterfall	bird	0.055	0.001	0.0	0.087
337	water	boat	0.072	0.031	0.053	0.109
341	sea	boat	0.153	0.014	0.0	0.076
347	palm	bicycle	0.051	0.001	0.0	0.083
354	cradle	chair	0.042	0.03	0.0	0.053
357	rock	sheep	0.058	0.0	0.006	0.061
360	pool table	bird	0.048	0.0	0.0	0.041
379	bridge	bus	0.092	0.0	0.03	0.046
383	bed	curtain	0.069	0.064	0.01	0.079
384	washer	bicycle	0.043	0.018	0.0	0.201
387	hovel	train	0.04	0.0	0.0	0.085
408	toilet	bottle	0.045	0.002	0.0	0.099
412	bed	airplane	0.079	0.0	0.008	0.086
413	pool table	motorbike	0.09	0.0	0.003	0.07
415	seat	tvmonitor	0.044	0.0	0.0	0.045
419	bed	tvmonitor	0.061	0.0	0.016	0.054
422	seat	tvmonitor	0.089	0.0	0.0	0.056
434	case	cup	0.047	0.001	0.0	0.041
444	sofa	plant	0.045	0.009	0.014	0.09
446	pool table	tvmonitor	0.193	0.0	0.006	0.086
475	cradle	train	0.061	0.0	0.0	0.132
477	desk	tvmonitor	0.104	0.0	0.0	0.085
483	swivel chair	horse	0.052	0.006	0.0	0.041
485	sofa	airplane	0.056	0.0	0.024	0.045
502	washer	train	0.07	0.0	0.006	0.134

Table 5: For all neurons from Tab. 4 that output radically different concepts when explanations are computed using ADE20k vs Pascal, we compute the IOU scores for the other concept as well. Other than the 7 attributes marked in red, the IOU scores are all below 0.04, suggesting that this is not because the neurons are polysemantic.

TCAV. We report the cosine similarities between the concept activation vectors learned using ADE20k and Pascal as probe datasets for all 32 concepts that have a base rate of at least 1% in Table 6. On the whole, we see that the vectors are not very similar, despite the vectors predicting the concepts well.

Concept	ADE20k AUC	Pascal AUC	Cos.sim.	Concept	ADE20k AUC	Pascal AUC	Cos.sim.
bag	79.4	75.4	0.006	book	90.4	84.6	0.138
bottle	88.5	85.6	0.035	box	83.0	80.1	0.086
building	97.4	90.0	0.161	cabinet	91.3	92.4	0.03
car	96.9	90.3	0.147	ceiling	96.6	93.0	0.267
chair	90.5	89.6	0.034	curtain	91.6	89.5	0.112
door	81.5	87.8	0.134	fence	86.1	84.7	0.09
floor	97.4	92.1	0.208	grass	95.1	91.7	0.04
light	92.4	85.0	0.043	mountain	94.2	90.8	0.02
painting	94.8	91.4	0.116	person	92.2	92.1	0.253
plate	90.6	94.8	-0.009	pole	89.0	79.3	0.059
pot	79.3	85.2	0.142	road	98.0	91.8	0.041
rock	92.6	82.8	-0.024	sidewalk	97.0	92.5	0.071
signboard	90.6	76.5	0.091	sky	98.9	79.8	0.104
sofa	95.9	91.2	-0.009	table	93.4	93.5	0.06
tree	96.8	89.2	0.172	wall	95.9	91.3	0.027
water	95.2	94.6	0.078	windowpane	91.5	90.1	0.078

Table 6: We report the cosine similarities between the concept activation vectors learned using ADE20k and Pascal datasets. In general, the vectors learned from different datasets do not correlate well.

B.2 Difference in probe dataset distribution

The first method we use to look at the difference in the 2 probe datasets we used was to consider the base rates of different concepts within the dataset. As noted in Section 3 of the main paper, there are some sizable differences. Figure 5 contains the base rates for all concepts highlighted in Table 2 of the main paper. Some concepts that have very different base rates are wall (highlighted for bow-window when using ADE20k, but not Pascal), floor (highlighted for auto-showroom when using ADE20k but not Pascal), dog (highlighted for corn-field when using Pascal, but not ADE20k) and pole (highlighted for hardware-store for Pascal, but not ADE20k).

However, more than just the base rate, the images themselves look very different across scenes. We visualize random images from different scenes in Figure 6, and find, for example, images labelled bedroom in Pascal tend to have either a person or animal sleeping on a bed, without much of the remaining bedroom being shown, whereas ADE20k features images of full bedrooms. Similarly, images labelled tree-farm contain people in Pascal, but do not in ADE20k.

Upper bounds.

Finally, we present a simple method to compare the similarity of the probe dataset with that of the training dataset by noting that the probe dataset establishes a strict upper bound on the fraction of the model that can be explained. This is intuitively true since the set of semantic labeled concepts is finite, but actually goes deeper than that. Consider the following experiment: we take the original black-box model, run it on a probe dataset to make predictions, and then train a new classifier to emulate those predictions. If this classifier is restricted to use only the labeled concepts then this is similar to a concept-based explanation. However, even if it’s trained on the rich underlying visual features it would not perform perfectly due to the differences between the original training dataset and the probe dataset.

Concretely, consider a black-box ResNet18-based [20] model trained on the Places365 [55] dataset. We reset and re-train its final linear classification layer on the Pascal [13] probe dataset to emulate the original scene predictions; this achieves only 63.7% accuracy. Similarly, on the ADE20k [57, 58] as the probe dataset it achieves only slightly better 75.7% accuracy, suggesting that this dataset is somewhat more similar to Places365 than Pascal but still far from fully capturing the distribution. This is not to suggest that the only way to generate concept-based explanations is to collect concept labels for the original training set (which may lead to overfitting); rather, it’s important to acknowledge this limitation and quantify the explanation method based on such upper bounds.

Similarly, we can ask how well the Concept Bottleneck model [29] can be explained using the CUB test dataset. However, in this case, since the training and test distributions are (hopefully!) similar, we would expect our upper bound to be reasonably high. We check this with our same set up, and find that this is indeed the case – resetting and retraining the final linear layer, using the model’s predictions as our targets achieves an accuracy of 89.3%.

Appendix C Concepts used: more details

Here, we provide additional results regarding learning CUB concepts from Section 4.2 of the main paper. The CUB dataset was used by Concept Bottleneck [29], an interpretable-by-design model. This method learned the concepts as an intermediate layer within the network, and then used these concepts to pretdict the target class. Figure 7 contains the histograms of the normalized AP scores for the 112 concepts from CUB [48] as well as the APs for the target bird classes learned by the model. Similar to learning classifiers for the Broden [5] concepts, we learn a linear model using features from an Imagenet [43] trained Resnet18 [20] model. On average, we see that the bird classes are much better learned than the concepts.

Appendix D Human study details

In Section 5 of the main paper, we discuss the human studies we ran to understand how well humans are able to reason about concept-based explanations as the number of concepts used within the explanation increases. In this section, we provide additional details.

To recap, we compare four types of explanations: (1) concept-based explanations that use 8 concepts, (2) concept-based explanations that use 16 concepts, (3) concept-based explanations that use 32 concepts, and (4) example-based explanations that consist of 10 example images for which the model predicts a certain class. (4) is a baseline that doesn’t use concepts.

For a fair comparison, all four types of explanations are evaluated on the same inputs. We generate five sets of input where each set consists of 5 images from one scene group (commercial buildings, shops, markets, cities, and towns) and 5 images from another scene group (home or hotel). Recall that these are images where the model output match the explanation output (i.e., the class with the highest explanation score calculated based on ground-truth concept labels). Hence, if the participants correctly identify all concepts that appear in a given image, they are guaranteed to get the highest explanation score for the model output class.

To reduce the variance with respect to the input, we had 5 participants for each set of input and explanation type. For 32 concepts explanations, each participant saw 5 images from only one of the two scene groups because the study got too long and overwhelming with the full set of 10 images. For all other explanations, each participant saw the full set of 10 images. In total, we had 125 participants: 50 participants for the study with 32 concepts explanations and 25 participants for the other three studies. Each participant sees only one type of explanation as we conduct a between-group study.

More specifically, we recruited participants through Amazon Mechanical Turk who are US-based, have done over 1000 Human Intelligence Tasks, and have prior approval rate of at least 98%. The demographic distribution was: man 59%, woman 41%; no race/ethnicity reported 82%, White 17%, Black/African American 1%, Asian 1%. The self-reported machine learning experience was 2.5 $\pm$ 1.0, between “2: have heard about…” and “3: know the basics…” We did not collect any personally identifiable information. Participants were compensated based on the state-level minimum wage of $12/hr. In total, $\sim$ $800 was spent on running human studies.

Appendix E User interface snapshots

In Section 5.1 of the main paper, we outlined our human study design.⁸⁸8We note that much of our study design and UI is based on the recent work by Kim et al. [kim2021hive] who propose a human evaluation framework called HIVE for evaluating visual interpretability methods. Here we provide snapshots of our study UIs in the following order.

Study introduction.. For each participant, we introduce the study, present a consent form, and receive informed consent for participation in the study. The consent form was approved by our institution’s Institutional Review Board and acknowledges that participation is voluntary, refusal to participate will involve no penalty or loss of benefits, etc. See Fig. 9.

Demographics and background.. Following HIVE [26], we request optional demographic data regarding gender identity, race and ethnicity, as well as the participant’s experience with machine learning. We collect this information to help future researchers calibrate our results. See Fig. 9.

Method introduction.. We introduce concept-based explanations in simple terms. This page is not shown for the study with example-based explanations. See Fig. 10.

Task preview . We present a practice example to help participants get familiar with the task. This page is not shown for the study with example-based explanations. See Fig. 11.

Part 1: Recognize concepts and guess the model output. After the preview, participants move onto the main task where they are asked to recognize concepts in a given photo (for concept-based explanations) and predict the model output (for all explanations). We show the UI for each type of explanation we study:

•

8 concept explanations (Fig. 12)
•

16 concepts explanations (Fig. 13)
•

32 concepts explanations (Fig. 14)
•

Example-based explanations (Fig. 15)

Part 2: Choose the ideal tradeoff between simplicity and correctness.. Concept-based explanations can have varying levels of complexity/simplicity and correctness. Hence, we investigate how participants reason with these two properties. To do so, we show examples of concept-based explanations that use different numbers of concepts, as well as bar plots with the correctness values for certain instantiations of concept-based explanations. We then ask participants to choose the explanation they prefer the most and provide a short written justification for their choice. See Fig. 16.

Feedback.. At the end of the study, participants can optionally provide feedback. See Fig. 17.