This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Overlooked Factors in Concept-based Explanations:
Dataset Choice, Concept Learnability, and Human Capability

Vikram V. Ramaswamy, Sunnie S. Y. Kim, Ruth Fong, Olga Russakovsky
Princeton University
{vr23, suhk, ruthfong, olgarus}@cs.princeton.edu
Abstract

Concept-based interpretability methods aim to explain a deep neural network model’s components and predictions using a pre-defined set of semantic concepts. These methods evaluate a trained model on a new, “probe” dataset and correlate the model’s outputs with concepts labeled in that dataset. Despite their popularity, they suffer from limitations that are not well-understood and articulated in the literature. In this work, we identify and analyze three commonly overlooked factors in concept-based explanations. First, we find that the choice of the probe dataset has a profound impact on the generated explanations. Our analysis reveals that different probe datasets lead to very different explanations, suggesting that the generated explanations are not generalizable outside the probe dataset. Second, we find that concepts in the probe dataset are often harder to learn than the target classes they are used to explain, calling into question the correctness of the explanations. We argue that only easily learnable concepts should be used in concept-based explanations. Finally, while existing methods use hundreds or even thousands of concepts, our human studies reveal a much stricter upper bound of 32 concepts or less, beyond which the explanations are much less practically useful. We discuss the implications of our findings and provide suggestions for future development of concept-based interpretability methods. Code for our analysis and user interface can be found at https://github.com/princetonvisualai/OverlookedFactors

1 Introduction

Performance and opacity are often correlated in deep neural networks: the highly parameterized nature of these models that enable them to achieve high task accuracy also reduces their interpretability. However, in order to responsibly use and deploy them, especially in high-risk settings such as medical diagnoses, we need these models to be interpretable, i.e., understandable by people. With the growing recognition of the importance of interpretability, many methods have been proposed in recent years to explain some aspects of neural networks and render them more interpretable (see [4, 14, 18, 42, 44, 53] for surveys).

Refer to caption
Figure 1: Concept-based interpretability methods explain model components and/or predictions using a pre-defined set of semantic concepts. In this example, a scene classification model’s prediction bedroom is explained as a complex linear combination of 37 visual concepts, with the final explanation score calculated based on the presence or absence of these concepts. The coefficients are learned by evaluating the model on a new, “probe” dataset, and correlating its predictions with visual concepts labeled in that dataset. However, concept-based explanations can (1) be noisy and heavily dependent on the probe dataset, (2) use concepts that are hard to learn (all concepts in red are harder to learn than the class bedroom) and (3) be overwhelming to people due to the complexity of the explanation.

In this work, we dive into concept-based interpretability methods for image classification models, which explain model components and/or predictions using a pre-defined set of semantic concepts [5, 16, 25, 29, 56]. Given access to a trained model and a set of images labelled with semantic concepts (i.e., a “probe” dataset), these methods produce explanations with the provided concepts. See Fig. 1 for an example explanation.

Concept-based methods are a particularly promising approach for bridging the interpretability gap between complex models and human understanding, as they explain model components and predictions with human-interpretable units, i.e., semantic concepts. Recent work finds that people prefer concept-based explanations over other forms (e.g., heatmap and example-based) because they resemble human reasoning and explanations [27]. Further, concept-based methods uniquely provide a global, high-level understanding of a model, e.g., how it predicts a certain class [56, 39] and what the model (or some part of it) has learned [25, 5, 16]. These insights are difficult to gain from local explanation methods that only provide an explanation for a single model prediction, such as saliency maps that highlight relevant regions within an image.

However, existing research on concept-based interpretability methods focuses heavily on new method development, ignoring important factors such as the probe dataset used to generate explanations or the concepts composing the explanations. Outside the scope of concept-based methods, there have been several recent works that study the effect of different factors on explanations. These works, however, are either limited to saliency maps [1, 28, 31, 41] or a general call for transparency, e.g., include more information when releasing an interpretability method [47].

In this work, we conduct an in-depth study of commonly overlooked factors in concept-based interpretability methods. Concretely, we analyze four representative methods: NetDissect [5], TCAV [25], Concept Bottleneck [29] and IBD [56]. These are a representative and comprehensive set of existing concept-based interpretability methods for computer vision models. Using multiple probe datasets (ADE20k [57, 58] and Pascal [13] for NetDissect, TCAV and IBD; CUB-200-2011 [48] for Concept Bottleneck), we examine the effects of (1) the choice of probe dataset, (2) the concepts used within the explanation, and (3) the complexity of the explanation. Through our analyses, we learn a number of key insights, which we summarize below:

  • The choice of the probe dataset has a profound impact on explanations. We repeatedly find that different probe datasets give rise to different explanations, when explaining the same model with the same interpretability method. For instance, the prediction of the arena/hockey class is explained with concepts {grandstand, goal, ice-rink, skate-board} with one probe dataset, and {plaything, road} with another probe dataset. We highlight that concept-based explanations are not solely determined by the model or the interpretability method. Hence, probe datasets should be chosen with caution. Specifically, we suggest using probe datasets whose data distribution is similar to that of the dataset the model-being-explained was trained on.

  • Concepts used in explanations are frequently harder to learn than the classes they aim to explain. The choice of concepts used in explanations is dependent on the available concepts in the probe dataset. Surprisingly, we find that learning some of these concepts is harder than learning the target classes. For example, in one experiment we find that the target class bathroom is explained using concepts {toilet, shower, countertop, bathtub, screen-door}, all of which are harder to learn than bathroom. Moreover, these concepts can be hard for people to identify, limiting the usefulness of these explanations. We argue that learnability is a necessary (albeit not sufficient) condition for the correctness of the explanations, and advocate for future explanations to only use concepts that are easily learnable.111Ideally, future methods would also include causal rather than purely correlation-based explanations.

  • Current explanations use hundreds or even thousands of concepts, but human studies reveal a much stricter upper bound. We conduct human studies with 125 participants recruited from Amazon Mechanical Turk to understand how well people reason with concept-based explanations with varying number of concepts. We find that participants struggle to identify relevant concepts in images as the number of concepts increases (the percentage of concepts recognized per image decreases from 71.7%±27.7%71.7\%\pm 27.7\% with 8 concepts to 56.8%±24.9%56.8\%\pm 24.9\% for 32 concepts). Moreover, the majority of the participants prefer that the number of concepts be limited to 32. We also find that concept-based explanations offer little to no advantage in predicting model output compared to example-based explanations (the participants’ mean accuracy at predicting the model output when given access to explanations with 8 concepts is 64.8%±23.9%64.8\%\pm 23.9\% whereas the accuracy when given access to example-based explanations is 60.0%±30.2%60.0\%\pm 30.2\%).

These findings highlight the importance of vetting intuitions when developing and using interpretability methods. We have open-sourced our analysis code and human study user interface to aid with this process in the future: https://github.com/princetonvisualai/OverlookedFactors.

2 Related work

Interpretability methods for computer vision models range from highlighting areas within an image that contribute to a model’s prediction (i.e., saliency maps) [9, 15, 37, 45, 46, 51, 52, 54] to labelling model components (e.g., neurons) [5, 16, 25, 56], highlighting concepts that contribute to the model’s prediction [56, 39] and designing models that are interpretable-by-design [8, 10, 29, 34]. In this work, we focus on concept-based interpretability methods. These include post-hoc methods that label a trained model’s components and/or predictions [5, 16, 25, 56, 39] and interpretable-by-design methods that use pre-defined concepts [29]. We focus on methods for image classification models where most interpretability research has been and is being conducted. Recently, concept-based methods are being developed and used for other types of models (e.g., image similarity models [38], language models [50, 7]), however, these are outside the scope of this paper.

Our work is similar in spirit to a growing group of works that propose checks and evaluation protocols to better understand the capabilities and limitations of interpretability methods [1, 3, 2, 21, 23, 26, 28, 32, 41, 49]. Many of these works examine how sensitive post-hoc saliency maps are to different factors such as input perturbations, model weights, or the output class being explained. On the other hand, we conduct an in-depth study of concept-based interpretability methods. Despite their popularity, little is understood about their interpretability and usefulness to human users, or their sensitivity to auxiliary inputs such as the probe dataset. We seek to fill this gap with our work and assist with future development and use of concept-based interpretability methods. To the best of our knowledge, we are the first to investigate the effect of the probe dataset and concepts used for concept-based explanations. There has been work investigating the effect of explanation complexity on human understanding [30], however, it is limited to decision sets.

We also echo the call for releasing more information when releasing datasets [17], models [12, 33] and interpretability methods [47]. More concretely, we suggest that concept-based interpretability method developers to include results from our proposed analyses in their method release, in addition to filling out the explainability fact sheet proposed by Sokol et al. [47], to aid researchers and practitioners to better understand, use, and build on these methods.

3 Dataset choice: Probe dataset has a profound impact on the explanations

Scene class Top concepts from ADE20k-generated explanations Top concepts from Pascal-generated explanations
arena/hockey grandstand, goal, ice-rink, scoreboard plaything, road
auto-showroom car, light, trade-name, floor, wall car, stage, grandstand, baby-buggy, ground
bedroom bed, cup, tapestry, lamp, blind bed, frame, wood, sofa, bedclothes
bow-window windowpane, seat, cushion, wall, heater windowpane, tree, shelves, curtain, cup
conf-room swivel-chair, table, mic, chair, document bench, napkin, plate, candle, table
corn-field field, plant, sky, streetlight tire, sky, dog, water, signboard
garage/indoor bicycle, brush, car, tank, ladder bicycle, vending-mach, tire, motorbike, floor
hardware-store shelf, merchandise, pallet, videos, box rope, shelves, box, bottle, pole
legis-chamber seat, chair, pedestal, flag, witness-stand mic, book, paper
tree-farm tree, hedge, land, path, pole tree, tent, sheep, mountain, rock
Table 1: Impact of probe dataset on Baseline (Sec. 3). We compare Baseline explanations generated using ADE20k vs. Pascal. For 10 randomly selected scene classes, we show concepts with the largest coefficients in each explanation. In bold are concepts in one explanation but not the other, e.g., the concept grandstand is important for explaining the arena/hockey scene prediction when using ADE20k, but not when using Pascal. These results show that the probe dataset has a huge impact on the explanations.
Neuron ADE20k label & score Pascal label & score
9 plant 0.082 potted-plant 0.194
181 plant 0.068 potted-plant 0.140
318 computer 0.079 tv 0.251
386 autobus 0.067 bus 0.200
435 runway 0.071 airplane 0.189
185 chair 0.077 horse 0.153
239 pool-table 0.069 horse 0.171
257 tent 0.042 bus 0.279
384 washer 0.043 bicycle 0.201
446 pool-table 0.193 tv 0.086
Table 2: Impact of probe dataset on NetDissect [5] (Sec. 3). We compare NetDissect explanations (concept labels) for 10 neurons of the model-being-explained generated using ADE20k vs. Pascal. We find that while some neurons correspond to the same or similar concepts (top half), others correspond to wildly different concepts (bottom half), highlighting the impact of the probe dataset.

Concept-based explanations are generated by running a trained model on a “probe” dataset (typically not the training dataset) which has concepts labelled within it. The choice of probe dataset has been almost entirely dictated by which datasets have concept labels. The most commonly used dataset is the Broden dataset [5]. It contains images from four datasets (ADE20k [57, 58], Pascal [13], OpenSurfaces [6], Describable Textures Dataset [11]) and labels of over 1190 concepts, comprising of object, object parts, color, scene and texture.

In this section, we investigate the effect of the probe dataset by comparing explanations generated using two different subsets of the Broden datset: ADE20k and Pascal. We experiment with three different methods for generating concept-based explanations: Baseline, NetDissect [5], and TCAV [25], and find that the generated explanations heavily depend on the choice of probe dataset. This finding implies that these explanations can only be used for images drawn from the same distribution as the probe dataset.

Model explained. Following prior work [5, 25, 56], we explain a ResNet18-based [20] scene classification model trained on the Places365 dataset [55], which predicts one of 365 scene classes given an input image.

Probe datasets. We use two probe datasets: ADE20k [57, 58] (19733 images, license: BSD 3-Clause) and Pascal [13] (10103 images, license: unknown).222To our best knowledge, most images used don’t include personally identifiable information or offensive content. However, some feature people without their consent and might contain identifiable information. They are two different subsets of the Broden dataset [5] and are labelled with objects and parts. We randomly split each dataset into training (60%), validation (20%), and test (20%) sets, using the new training set for learning explanations, validation set for tuning hyperparameters (e.g., learning rate and regularization parameters), and test set for reporting our findings.

Interpretability methods. We investigate the effect of the probe dataset on three types of concept-based explanations. First, we study a simple Baseline method that measures correlations between the model’s prediction and concepts, and generates class-level explanations as a linear combination of concepts as in Fig. 1. Similar to Ramaswamy et al. [39], we learn a logistic regression model that matches the model-being-explained’s prediction, given access to ground-truth concept labels within the image. We use an l1 penalty to prioritize explanations with fewer concepts. Second, we study NetDissect [5] which identifies neurons within the model-being-explained that are highly activated by certain concepts and generates neuron-level explanations (concept labels).333We use code provided by the authors: https://github.com/CSAILVision/NetDissect-Lite. Finally, we study TCAV [25] which generates explanations in the form of concept activation vectors, i.e., vectors within the model-being-explained’s feature space that correspond to labelled concepts.

Results. For all three explanation types, we find that using different probe datasets result in very different explanations. To begin, we show in Tab. 1 how Baseline explanations differ when using ADE20k vs. Pascal as the probe dataset. For example, when explaining the corn-field scene prediction, the Pascal-generated explanation highlights dog as important, whereas the ADE20k-generated explanation does not. For the legis-chamber scene, ADE20k highlights chair as important, whereas Pascal does not.

We observe a similar difference for NetDissect (see Tab. 2). We label 123 neurons separately using ADE20k and Pascal, and find that 60 of them are given very different concept labels (e.g., neuron 239 is labelled pool-table by ADE20k and horse by Pascal).444It is possible that these neurons are poly-semantic, i.e., neurons that reference multiple concepts, as noted in [16, 35]. However, as we explore in the supp. mat., the score for the concept from the other dataset is usually below 0.04, the threshold used in [5] to identify “highly activated neurons.” Again, this result highlights the impact of the probe dataset on explanations.

Similarly, TCAV concept activation vectors learned using ADE20k vs. Pascal are different, i.e., they have low cosine similarity (see Fig. 2). We compute concept activation vectors for 32 concepts which have a base rate of over 1% in both datasets combined, then calculate the cosine similarity of each concept vector. We also compute the ROC AUC for each concept vector to measure how well the concept vector corresponds to the concept. We find that the similarity is low (0.078 on average), even though the selected concepts were those that can be learned reasonably well (mean ROC AUC for these concepts is over 85%). We suspect that the explanations are radically different due to differences in the probe dataset distribution. For instance, some concepts have very different base rates in the two datasets: dog has a base rate of 12.0% in Pascal but 0.5% in ADE20k; chair has a base rate of 16.7% in ADE20k but 13.5% in Pascal. We present more analyses in the supp. mat.

Concept ADE20k AUC Pascal AUC
Cosine sim
ceiling 96.6 93.0 0.267
box 83.0 80.1 0.086
pole 89.0 79.3 0.059
bag 79.4 75.4 0.006
rock 92.6 82.8 -0.024
mean 92.0 88.1 0.087
Refer to caption
Figure 2: Impact of probe dataset on TCAV [25] (Sec. 3). We compare TCAV concept activation vectors learned using ADE20k vs. Pascal. (Top) For 5 concepts randomly selected out of 32, we show their learnability in each dataset (AUC) and cosine similarity between the two vectors. While these concepts can be learned reasonably well (AUCs are high), their learned activation vectors have low similarity (Cosine sim is low). (Bottom) The histogram of cosine similarity scores for all 32 concepts again shows that the two activation vectors for the same concept are not very similar.

4 Concept learnability: Concepts used are less learnable than target classes

In Sec. 3, we investigated how the choice of the probe dataset influences the generated explanations. In this section, we investigate the individual concepts used within explanations. An implicit assumption made in concept-based interpretability methods is that the concepts used in explanations are easier to learn than the target classes being explained. For instance, when explaining the class bedroom with the concept bed, we are assuming (and hoping) that the model first learns the concept bed, then uses this concept and others to predict the class bedroom. However, if bed is harder to learn than bedroom, this would not be the case. This assumption also aligns with works that argue that “simpler” concepts (i.e., edges and textures) are learned in early layers and “complex” concepts (i.e., parts and objects) are learned in later layers [5, 16].

We thus investigate the learnability of concepts used by different explanation methods. Somewhat surprisingly, we find that the concepts used are frequently harder to learn than the target classes, raising concerns about the correctness of concept-based explanations.

Refer to caption
Refer to caption
Figure 3: Overall comparison of concept vs. class learnability (Sec. 4). We compare the learnability, quantified as normalized AP of concept/class predictors, of Broden concepts (top) vs. Places365 scene classes (below). Overall, the concepts have much lower normalized AP (i.e., are harder to learn) than the classes.

Setup. To compare the learnability of concepts vs. classes, we learn models for the concepts (the learnability of the classes is already known from the model-being-explained). Concretely, we extract features for the probe dataset using an ImageNet [43]-pretrained ResNet18 [20] model and train a linear model using sklearn’s [36] LogisticRegression to predict concepts from the ResNet18 features.555We also tried using features from a Places365 pretrained model and did not find a significant difference. We do so for the two most commonly used probe datasets: Broden [5] and CUB-200-2011 [48]. Broden concepts are frequently used to explain Places365 classes (as done in NetDissect [5], Net2Vec [16], IBD [56], and ELUDE [39]), while CUB concepts are used to explain the CUB target classes (as done in Concept Bottleneck [29] and ELUDE [39]).

Evaluation. We evaluate learnability with normalized average precision (AP) [22]. We choose normalized AP for two reasons: first, to avoid having to set a threshold and second, to fairly compare concepts and scenes that have very different base rates. In our experiments, we set the base rate to be that of the classes: 1365\frac{1}{365} when comparing Broden concepts vs. Places365 classes and 1200\frac{1}{200} when comparing CUB concepts vs. CUB classes.

Results. In both settings, we find that the concepts are much harder to learn than the target classes. The median normalized AP for Broden concepts is 7.6%, much lower than 37.5% of Places365 classes. Similarly, the median normalized AP for CUB concepts is 2.3%, much lower than 65.9% of CUB classes. Histograms of normalized APs are shown in Fig. 3 (Broden/Places365) and the supp. mat. (CUB).

However, is it possible that each class is explained by concepts that are more learnable than the class? Our investigation with IBD [56] explanations suggests this is not the case. IBD greedily learns a basis of concept vectors, as well as a residual vector, and decomposes each model prediction into a linear combination of the basis and residual vectors.666We use code provided by the authors: https://github.com/CSAILVision/IBD. For 10 randomly chosen scene classes, we compare the normalized AP of the scene class vs. 5 concepts with the highest coefficients (i.e., 5 concepts that are the most important for explaining the prediction). See Tab. 3 for the results. We find that all 10 scene classes are explained with at least one concept that is harder to learn than the class. For some classes (e.g., bathroom, kitchen), all concepts used in the explanation are harder to learn than the class.

Scene class Concepts
arena/perform tennis court grandstand ice rink valley stage
38.8 74.0 44.4 40.7 19.0 11.9
art-gallery binder drawing painting frame sculpture
27.4 42.6 10.8 10.5 2.5 0.7
bathroom toilet shower countertop bathtub screen door
43.3 39.9 18.8 12.6 11.1 9.6
kasbah ruins desert arch dirt track bottle rack
50.2 64.3 17.3 16.2 8.9 4.2
kitchen work surface stove cabinet refrigerator doorframe
33.9 24.8 18.2 10.3 8.8 2.8
lock-chamber water wheel dam boat embankment footbridge
36.5 47.4 43.7 16.1 4.8 4.1
pasture cow leaf valley field slope
19.2 63.7 21.1 19.0 6.8 4.1
physics-lab computer machine monitor-device bicycle sewing-machine
17.1 25.4 4.5 3.3 1.7 1.5
store/indoor shanties patty bookcase shelf cup
20.4 72.5 18.5 13.5 4.2 1.3
water-park roller coaster hot tub playground ride swimming pool
38.3 73.0 59.1 44.9 38.0 36.7
Table 3: Class-level comparison of concept vs. class learnability (Sec. 4). We report normalized AP scores (\uparrow indicates high learnability) for 10 randomly chosen scene classes, along with 5 concepts with the highest IBD explanation coefficients for each. Concepts whose normalized AP scores are lower than the scene class are shown in red, whereas concepts with higher scores are shown in blue. All scenes are explained by at least one concept with a lower normalized AP. Some scenes are only explained by concepts with lower normalized AP.

Our experiments show that a significant fraction of the concepts used by existing concept-based interpretability methods are harder to learn than the target classes, issuing a wake-up call to the field. In the following section, we show that these concepts can also be hard for people to identify.

5 Human capability: Human studies reveal an upper bound of 32 concepts

Existing concept-based explanations use a large number of concepts: NetDissect [5] and Net2Vec [16] use all 1197 concepts labelled within the Broden [5] dataset; IBD [56] uses Broden object and art concepts with at least 10 examples (660 concepts); and Concept Bottleneck [29] uses all concepts that are predominantly present for at least 10 classes from CUB [48] (112 concepts). However, can people actually reason with these many concepts?

In this section, we study this important yet overlooked aspect of concept-based explanations: explanation complexity and how it relates to human capability and preference. Specifically, we investigate: (1) How well do people recognize concepts in images? (2) How do the (concept recognition) task performance and time change as the number of concepts vary? (3) How well do people predict the model output for a new image using explanations? (4) How do people trade off simplicity and correctness of concept-based explanations? To answer these questions, we design and conduct a human study. We describe the study design in Sec. 5.1 and report findings in Sec. 5.2.

5.1 Human study design

We build on the study design and user interface (UI) of HIVE [26], and design a two-part study to understand how understandable and useful concept-based explanations are to human users with potentially limited knowledge about machine learning . To the best of our knowledge, we are the first to investigate such properties of concept-based explanations for computer vision models.777We note that there are works examining complexity of explanations for other types of models, for example, Lage et al. [30] investigate complexity of explanations over decision sets, Bolubasi et al. [7] investigate this for concept-based explanations for language models.

Part 1: Recognize concepts and predict the model output. First, we present participants with an image and a set of concepts and ask them to identify whether each concept is present or absent in the image. We also show explanations for 4 classes whose scores are calculated real-time based on the concepts selected. As a final question, we ask participants to select the class they think the model predicts for the given image. See Fig. 4 (left) for the study UI.

To ensure that the task is doable and is only affected by explanation complexity (number of concepts used) and not the complexity of the model and its original prediction task (e.g., 365 scenes classification), we generate explanations for only 4 classes and ask participants to identify which of the 4 classes corresponds to the model’s prediction. We only show images where the model output matches the explanation output (i.e., the model predicts the class with the highest explanation score, calculated with ground-truth concept labels), since our goal is to understand how people reason with concept-based explanations with varying complexity.

Refer to caption
Figure 4: Human study UI (Sec. 5). We show a simplified version of the UI we developed for our human studies. In Part 1, we ask participants to guess the model’s prediction for a given image by recognizing concepts and using the provided explanations. In Part 2, we show participants explanations with different levels of simplicity and correctness, then ask which one they prefer the most.

Part 2: Choose the ideal tradeoff between simplicity and correctness. Next, we ask participants to reason about two properties of concept-based explanations: simplicity, i.e., the number of concepts used in a given set of explanations, and correctness, i.e., the percentage of model predictions correctly explained by explanations, which is the percentage of times the model output class has the highest explanation score. See Fig. 4 (right) for the study UI. We convey the notion of a simplicity-correctness tradeoff through bar plots that show the correctness of explanations of varying simplicity/complexity (4, 8, 16, 32, 64 concepts). We then ask participants to choose the explanation they prefer the most and provide a short justification for their choice.

Full study design and experimental details. In summary, our study consists of the following steps. For each participant, we introduce the study, receive informed consent for participation in the study, and collect information about their demographic (optional) and machine learning experience. We then introduce concept-based explanations in simple terms, and show a preview of the concept recognition and model output prediction task in Part 1. The participant then completes the task for 10 images. In Part 2, the participant indicates their preference for explanation complexity, given simplicity and correctness information. There are no foreseeable risks in participation in the study, and our study design was approved by our institution’s IRB.

Using this study design, we investigate explanations that take the form of a linear combination of concepts (e.g., Baseline, IBD [56], Concept Bottleneck [29]). Explanations are generated using the Baseline method, which is a logistic regression model trained to predict the model’s output using concepts (see Sec. 3 for details). Note that we are evaluating the form of explanation (linear combination of concepts) rather than a specific explanation method. The choice of the method does not impact the task.

Specifically, we compare four types of explanations: concept-based explanations that use (1) 8 concepts, (2) 16 concepts, (3) 32 concepts, and (4) example-based explanations that consist of 10 example images for which the model predicts a certain class. We include (4) as a method that doesn’t use concepts. In Jeyakumar et al. [24], this type of explanation is shown to be preferred over saliency-type explanations for image classification; here, we compare this to concept-based explanations.

For a fair comparison, all four are evaluated on the same set of images. In short, we conduct a between-group study with 125 participants recruited through Amazon Mechanical Turk. Participants were compensated based on the state-level minimum wage of $12/hr. In total, \sim$800 was spent on running human studies. See supp. mat. for more details.

5.2 Key findings from the human studies

When presented with more concepts, participants spend more time but are worse at recognizing concepts. The median time participants spend on each image is 17.4 sec. for 8 concept-, 27.5 sec. for 16 concept-, and 46.2 sec. for 32 concept-explanations. This is expected, since participants are asked to make a judgment for each and every concept. When given example-based explanations with no such task, participants spend only 11.6 seconds on each image. Interestingly, the concept recognition performance, reported in terms of mean recall (i.e., the percentage of concepts in the image that are recognized) and standard deviation, decreases from 71.7% ±\pm 27.7% (8 concepts) to 61.0% ±\pm 28.5% (16 concepts) to 56.8% ±\pm 24.9% (32 concepts). While these numbers are far from perfect recall (100%), participants are better at judging whether concepts are present when shown fewer number of concepts.

Concept-based explanations offer little to no advantage in model output prediction over example-based explanations. Indeed, we see that the participants’ errors in concept recognition result in an incorrect class having the highest explanation score. When predicting the model output as the class with the highest explanation score, calculated based on the participants’ concept selections, the mean accuracy and standard deviation of model output prediction are 64.8% ±\pm 23.9% (8 concepts), 63.2% ±\pm 26.9% (16 concepts), 63.6% ±\pm 22.2% (32 concepts). These are barely higher than 60.0% ±\pm 30.2% of example-based explanations, which are simpler and require less time to complete the task.

The majority of participants prefer explanations with 8, 16, or 32 concepts. When given options of explanations that use 4, 8, 16, 32, or 64 concepts, 82% of participants prefer explanations with 8, 16, or 32 concepts (28%, 33%, 21% respectively). Only 6% prefer those with 64 concepts, suggesting that existing explanations that use hundreds or even thousands of concepts do not cater to human preferences. In the written responses, many favored having fewer concepts (e.g., “the lesser, the better”) and expressed concerns against having too many (e.g., “I think 32 is a lot, but 16 is an adequate enough number that it could still predict well…”). In making the tradeoff, some valued correctness above all else (e.g., “Out of all the options, 32 is the most correct”), while others reasoned about marginal benefits (e.g., “I would prefer explanations that use 16 concepts because it seems that the difference in percentage of correctness is much closer and less, than other levels of concepts”). Overall, we find that participants actively reason about both simplicity and correctness of explanations.

6 Discussion

Our analyses yield immediate suggestions for improving the quality and usability of concept-based explanations. First, we suggest choosing a probe dataset whose distribution is similar to that of the dataset the model was trained on. Second, we suggest only using concepts that are more learnable than the target classes. Third, we suggest limiting the number of concepts used within an explanation to under 32, so that explanations are not overwhelming to people.

The final suggestion is easy to implement. However, the first two are easier said than done, since the number of available probe datasets (i.e., large-scale datasets with concept labels) is minimal, forcing researchers to use the Broden dataset [5] or the CUB dataset [48]. Hence, we argue creating diverse and high-quality probe datasets is of upmost importance in researching concept-based explanations.

Another concern is that these methods do highlight hard-to-learn concepts when given access to them, suggesting that they sometimes learn correlations rather than causations. Methods by Goyal et al. [19], which output patches within the image that need to be changed for the model’s prediction to change, or Fong et al. [15], which find regions within the image that maximally contribute to the model’s prediction, are more in line with capturing causal relationships. However, these only produce local explanations, i.e., explanations of a single model prediction, and not class-level global explanations. One approach to capturing causal relationships is to generate counterfactual images with or without certain concepts using generative models [40] and observe changes in model predictions.

7 Limitations and future work

Our findings come with a few caveats. First, due to the lack of available probe datasets, we tested each concept-based interpretability method in a single setting. That is, we tested NetDissect [5], TCAV [25] and IBD [56] on a scene classifier trained on the Places365 dataset [55], and Concept Bottleneck [29] on the CUB dataset [48]. We plan to expand our analyses as more probe datasets become available. Second, all participants in our human studies were recruited from Amazon Mechanical Turk. This means that our participants represent a population with limited ML background: the self-reported ML experience was 2.5 ±\pm 1.0 (on a scale of 1 to 5), which is between “2: have heard about…” and “3: know the basics…” We believe Part 1 results of our human studies (described in Sec. 5.1) will not vary with participants’ ML expertise or role in the ML pipeline, as we are only asking participants to identify concepts in images. However, Part 2 results may vary (e.g., developers debugging a ML model may be more willing to trade off explanation simplicity for correctness than lay end-users). Investigating differences in perceptions and uses of concept-based explanations, is an important direction for future research.

8 Conclusion

In this work, we examined implicit assumptions made in concept-based interpretability methods along three axes: the choice of the probe datasets, the learnability of the used concepts, and the complexity of explanations. We found that the choice of the probe dataset profoundly influences the generated explanations, implying that these explanations can only be used for images from the probe dataset distribution. We also found that a significant fraction of the concepts used within explanations are harder for a model to learn than the target classes they aim to explain. Finally, we found that people struggle to identify concepts in images when given too many concepts, and that explanations with less than 32 concepts are preferred. We hope our proposed analyses and findings lead to more careful use and development of concept-based explanations.

Acknowledgements. We foremost thank our participants for taking the time to participate in our study. We also thank the authors of [5, 56, 29, 26] for open-sourcing their code and/or models. Finally, we thank the anonymous reviewers and the Princeton Visual AI Lab members (especially Nicole Meister) who provided helpful feedback on our work. This material is based upon work partially supported by the National Science Foundation under Grants No. 1763642, 2145198 and 2112562. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We also acknowledge support from the Princeton SEAS Howard B. Wentz, Jr. Junior Faculty Award to OR, Princeton SEAS Project X Fund to RF and OR, Open Philanthropy Grant to RF, and NSF Graduate Research Fellowship to SK.

References

  • [1] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In NeurIPS, 2018.
  • [2] Julius Adebayo, Michael Muelly, Harold Abelson, and Been Kim. Post hoc explanations may be ineffective for detecting unknown spurious correlation. In ICLR, 2022.
  • [3] Julius Adebayo, Michael Muelly, Ilaria Liccardi, and Been Kim. Debugging tests for model explanations. In NeurIPS, 2020.
  • [4] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 2020.
  • [5] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
  • [6] Sean Bell, Kavita Bala, and Noah Snavely. Intrinsic images in the wild. ACM Trans. Graph., 2014.
  • [7] Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, and Martin Wattenberg. An interpretability illusion for bert, 2021.
  • [8] Wieland Brendel and Matthias Bethge. Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. In ICLR, 2018.
  • [9] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV, 2018.
  • [10] Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: Deep learning for interpretable image recognition. In NeurIPS, 2019.
  • [11] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, 2014.
  • [12] Anamaria Crisan, Margaret Drouhard, Jesse Vig, and Nazneen Rajani. Interactive model cards: A human-centered approach to model documentation. In FAccT, 2022.
  • [13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, 2010.
  • [14] Ruth Fong. Understanding convolutional neural networks. PhD thesis, University of Oxford, 2020.
  • [15] Ruth Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. In ICCV, 2019.
  • [16] Ruth Fong and Andrea Vedaldi. Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In CVPR, 2018.
  • [17] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 2021.
  • [18] Leilani H. Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In DSAA, 2018.
  • [19] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual explanations. In ICML, 2019.
  • [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [21] Adrian Hoffmann, Claudio Fanconi, Rahul Rade, and Jonas Kohler. This looks like that… does it? Shortcomings of latent space prototype interpretability in deep networks. In ICML Workshops, 2021.
  • [22] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. Diagnosing error in object detectors. In ECCV, 2012.
  • [23] Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for interpretability methods in deep neural networks. In NeurIPS, 2019.
  • [24] Jeya Vikranth Jeyakumar, Joseph Noor, Yu-Hsi Cheng, Luis Garcia, and Mani Srivastava. How can i explain this to you? an empirical study of deep neural network explanation methods. In NeurIPS, 2020.
  • [25] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML, 2018.
  • [26] Sunnie S. Y. Kim, Nicole Meister, Vikram V. Ramaswamy, Ruth Fong, and Olga Russakovsky. HIVE: Evaluating the human interpretability of visual explanations. In ECCV, 2022.
  • [27] Sunnie S. Y. Kim, Elizabeth Anne Watkins, Olga Russakovsky, Ruth Fong, and Andrés Monroy-Hernández. “Help me help the AI”: Understanding how explainability can support human-AI interaction. In CHI, 2023.
  • [28] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer Cham, 2019.
  • [29] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In ICML, 2020.
  • [30] Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Samuel J Gershman, and Finale Doshi-Velez. Human evaluation of models built for interpretability. In HCOMP, 2019.
  • [31] Aravindh Mahendran and Andrea Vedaldi. Salient deconvolutional networks. In ECCV, 2016.
  • [32] Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do concept bottleneck models learn as intended? In ICLR Workshops, 2021.
  • [33] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In FAccT, 2019.
  • [34] Meike Nauta, Ron van Bree, and Christin Seifert. Neural prototype trees for interpretable fine-grained image recognition. In CVPR, 2021.
  • [35] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020.
  • [36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. JMLR, 2011.
  • [37] Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized input sampling for explanation of black-box models. In BMCV, 2018.
  • [38] Bryan A. Plummer, Mariya I. Vasileva, Vitali Petsiuk, Kate Saenko, and David Forsyth. Why do these match? explaining the behavior of image similarity models. In ECCV, 2020.
  • [39] Vikram V. Ramaswamy, Sunnie S. Y. Kim, Nicole Meister, Ruth Fong, and Olga Russakovsky. ELUDE: Generating interpretable explanations via a decomposition into labelled and unlabelled features. arXiv:2206.07690, 2022.
  • [40] Vikram V. Ramaswamy, Sunnie S. Y. Kim, and Olga Russakovsky. Fair attribute classification through latent space de-biasing. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [41] Sylvestre-Alvise Rebuffi, Ruth Fong, Xu Ji, and Andrea Vedaldi. There and back again: Revisiting backpropagation saliency methods. In CVPR, 2020.
  • [42] Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. Interpretable machine learning: Fundamental principles and 10 grand challenges. In Statistics Surveys, 2021.
  • [43] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. In IJCV, 2015.
  • [44] Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller, editors. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, volume 11700 of Lecture Notes in Computer Science. Springer, 2019.
  • [45] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  • [46] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshops, 2014.
  • [47] Kacper Sokol and Peter Flach. Explainability fact sheets: A framework for systematic assessment of explainable approaches. In FAccT, 2020.
  • [48] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • [49] Mengjiao Yang and Been Kim. Benchmarking attribution methods with relative feature importance. arXiv:1907.09701, 2019.
  • [50] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Ravikumar. On completeness-aware concept-based explanations in deep neural networks. In NeurIPS, 2020.
  • [51] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
  • [52] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. IJCV, 2018.
  • [53] Quanshi Zhang and Song-Chun Zhu. Visual interpretability for deep learning: A survey. Frontiers of Information Technology & Electronic Engineering, 2018.
  • [54] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
  • [55] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017.
  • [56] Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. Interpretable basis decomposition for visual explanation. In ECCV, 2018.
  • [57] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ADE20k dataset. In CVPR, 2017.
  • [58] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20k dataset. IJCV, 2019.

Appendix

In this supplementary document, we provide additional details on some sections of the main paper.

  • Section A We provide more information regarding our experimental setup over all experiments.

  • Section B We provide additional results from our experiments regarding probe dataset choice from Section 3 of the main paper.

  • Section C We provide additional results from our experiments regarding concept choice from Section 4 of the main paper.

  • Section D: We supplement Section 5 of the main paper and provide more information about our human studies.

  • Section E: We supplement Section 5 of the main paper and show snapshots of our full user interface.

Appendix A Experimental details

Here we provide additional experimental details regarding all our setups, as well as the computational power we needed.

TCAV. Using the features extracted from the penultimate layer of the ResNet18-based [20] model trained on the Places365 dataset [55], we use scikit-learns’s [36] LogisticRegression models to predict the ground truth attributes in each case. We use the liblinear solver, with an l2 penalty, and pick the regularization weight as a hyperparameter, based on the performance (ROC AUC) on a validation set.

Baseline. Given the ground-truth labelled concepts for an image, this explanation attempts to predict the blackbox model’s output on the image. We use scikit-learn’s [36] LogisticRegression model with a liblinear solver, and an l1 penalty, to prioritize learning simpler explanations. For the experiment reported in Section 3 of the main paper, we pick the regularization weight as a hyperparamater, choosing the weight with the best performance on a validation set. When generating explanations of different complexities for our human studies, we vary the regularization parameter, picking explanations that use a total of 4, 8, 16, 32, or 64 concepts.

Learning concepts. We computed features for all images from ADE20k [57, 58] using the penultimate layer of a ResNet18 [20] model trained on Imagenet [43]. We then learned a linear model for all concepts that had over 10 positive samples within the dataset, using the LogisticRegression model from scikit-learn [36]. Similar to other models, we use a liblinear solver, with an l2 penalty, choosing the regularization weight based on performance (ROC AUC) on a validation set. As mentioned, we report the normalized AP [22] to be able to compare across concepts and target classes with varying base rates.

Run times. Computing each of the linear models used less than 2 min on a CPU. Computing features using a ResNet18 [20] model trained on either Places365 [55] or Imagenet [43] for the ADE20k [57, 58] and Pascal [13] datasets took less than 15 min using a NVIDIA GTX 2080 GPU.

Appendix B Probe dataset choice: more details

In our first claim, we show that the choice of probe dataset can have a significant impact on the explanation output for concept-based explanations. We give more details from our experiments for this claim within this section.

B.1 Varying the probe dataset

Here we provide the full results from section 3.1 in the main text, where we compute concept-based explanations using 2 different methods (NetDissect [5] and TCAV [25]) when using either ADE20k or Pascal as probe datasets.

NetDissect. Table 4 contains the label generated for all neurons that are strongly activated when using either ADE20k [57, 58] or Pascal [13] as the probe dataset. A majority of neurons (69/123) correspond to very different concepts.

Neuron ADE20k label ADE20k score Pascal label Pascal score Neuron ADE20k label ADE20k score Pascal label Pascal score
1 counter 0.059 bottle 0.049 3 sea 0.067 water 0.065
4 seat 0.064 tvmonitor 0.074 8 vineyard 0.048 plant 0.043
9 plant 0.082 pottedplant 0.194 22 bookcase 0.07 bus 0.048
30 house 0.094 building 0.043 37 boat 0.043 boat 0.213
43 bed 0.151 bed 0.075 47 pool table 0.135 airplane 0.079
60 plane 0.052 airplane 0.168 63 field 0.053 muzzle 0.042
69 person 0.047 hair 0.086 73 water 0.041 bird 0.080
79 plant 0.064 pottedplant 0.064 90 mountain 0.071 mountain 0.066
102 bathtub 0.040 cat 0.055 104 cradle 0.081 bus 0.112
105 sea 0.106 water 0.058 106 rock 0.048 rock 0.06
110 painting 0.119 painting 0.06 112 field 0.05 bus 0.051
113 table 0.116 table 0.066 115 plane 0.046 airplane 0.147
120 sidewalk 0.042 track 0.075 125 table 0.049 wineglass 0.047
126 stove 0.064 bottle 0.163 127 book 0.104 book 0.096
131 signboard 0.043 body 0.069 134 bathtub 0.088 boat 0.059
141 skyscraper 0.065 cage 0.068 155 mountain 0.091 train 0.058
158 book 0.042 book 0.052 165 sea 0.051 water 0.051
168 railroad train 0.055 train 0.193 172 car 0.055 bus 0.101
173 car 0.052 bus 0.099 181 plant 0.068 pottedplant 0.14
183 person 0.041 horse 0.187 184 cradle 0.046 cat 0.042
185 chair 0.077 horse 0.153 186 person 0.051 bird 0.094
191 swimming pool 0.044 pottedplant 0.072 198 pool table 0.064 ceiling 0.066
208 shelf 0.047 bus 0.062 211 computer 0.076 tvmonitor 0.089
217 toilet 0.049 hair 0.055 218 case 0.044 track 0.165
219 plane 0.065 airplane 0.189 220 road 0.066 road 0.066
222 grass 0.105 grass 0.046 223 house 0.069 airplane 0.055
231 grandstand 0.097 screen 0.047 234 bridge 0.05 train 0.042
239 pool table 0.069 horse 0.171 245 water 0.063 water 0.042
247 plane 0.079 airplane 0.177 248 bed 0.127 tvmonitor 0.063
251 sofa 0.073 pottedplant 0.053 257 tent 0.042 bus 0.279
260 flower 0.082 food 0.069 267 apparel 0.042 car 0.045
276 earth 0.041 rock 0.047 278 field 0.06 sheep 0.044
280 mountain 0.045 mountain 0.056 287 plant 0.078 pottedplant 0.07
289 pool table 0.049 food 0.059 290 mountain 0.085 mountain 0.097
293 shelf 0.074 bottle 0.105 298 path 0.047 motorbike 0.068
305 waterfall 0.057 mountain 0.047 309 washer 0.109 bus 0.065
318 computer 0.079 tvmonitor 0.251 322 ball 0.054 sheep 0.044
324 mountain 0.071 motorbike 0.048 325 person 0.04 head 0.059
327 waterfall 0.055 bird 0.087 337 water 0.072 boat 0.109
341 sea 0.153 boat 0.076 344 person 0.052 person 0.048
345 autobus 0.042 bus 0.142 347 palm 0.051 bicycle 0.083
348 mountain 0.058 mountain 0.125 354 cradle 0.042 chair 0.053
357 rock 0.058 sheep 0.061 360 pool table 0.048 bird 0.041
364 field 0.058 plant 0.041 372 work surface 0.045 cabinet 0.049
379 bridge 0.092 bus 0.046 383 bed 0.069 curtain 0.079
384 washer 0.043 bicycle 0.201 386 autobus 0.067 bus 0.200
387 hovel 0.04 train 0.085 389 chair 0.066 chair 0.051
398 windowpane 0.073 windowpane 0.07 400 plant 0.043 pottedplant 0.097
408 toilet 0.045 bottle 0.099 412 bed 0.079 airplane 0.086
413 pool table 0.09 motorbike 0.07 415 seat 0.044 tvmonitor 0.045
417 sand 0.06 sand 0.049 419 bed 0.061 tvmonitor 0.054
422 seat 0.089 tvmonitor 0.056 430 bed 0.078 bedclothes 0.042
434 case 0.047 cup 0.041 435 runway 0.072 airplane 0.189
438 plane 0.045 airplane 0.235 444 sofa 0.045 plant 0.09
445 car 0.201 car 0.093 446 pool table 0.193 tvmonitor 0.086
454 car 0.218 car 0.156 463 snow 0.059 snow 0.118
465 crosswalk 0.097 road 0.047 475 cradle 0.061 train 0.132
477 desk 0.104 tvmonitor 0.085 480 sofa 0.086 sofa 0.081
483 swivel chair 0.052 horse 0.041 484 water 0.15 water 0.102
485 sofa 0.056 airplane 0.045 500 sofa 0.156 sofa 0.11
502 washer 0.07 train 0.134 503 bookcase 0.109 book 0.075
509 computer 0.044 tvmonitor 0.074
Table 4: We show labels for all neurons from the penultimate layer of a ResNet18 model that are marked as highly activated by both datasets by NetDissect [5]. We find that a 69/123 of neurons correspond to labels that are radically different (shown in red). The remainder correspond to either the same or very similar concepts.

As mentioned by Fong et al. [16] and Olah et al. [35], neurons in deep neural networks can be poly-semantic, i.e, some neurons can recognize multiple concepts. We check if the results from above are due to such neurons, and confirm that that is not the case: out of the 69 neurons, only 7 are highly activated (IOU¿0.04) by both concepts. Table 5 contains the IOU scores for both the ADE20k and Pascal label for each neuron outputting very different concepts.

Probe dataset: ADE20k Probe dataset: Pascal
neuron ADE20k label Pascal label IOU ADE20k label IOU Pascal label IOU ADE20k label IOU Pascal label
1 counter bottle 0.059 0.006 0.006 0.049
4 seat tvmonitor 0.064 0.0 0.0 0.074
22 bookcase bus 0.07 0.0 0.0 0.048
47 pool table airplane 0.135 0.0 0.002 0.079
63 field muzzle 0.053 0.0 0.0 0.042
73 water bird 0.041 0.002 0.052 0.08
102 bathtub cat 0.04 0.0 0.0 0.055
104 cradle bus 0.081 0.0 0.0 0.112
112 field bus 0.05 0.0 0.0 0.051
120 sidewalk track 0.042 0.001 0.023 0.075
125 table wineglass 0.049 0.0 0.043 0.047
126 stove bottle 0.064 0.029 0.005 0.163
131 signboard body 0.043 0.0 0.06 0.069
134 bathtub boat 0.088 0.001 0.005 0.059
141 skyscraper cage 0.065 0.001 0.0 0.068
155 mountain train 0.091 0.0 0.038 0.058
172 car bus 0.055 0.0 0.015 0.101
173 car bus 0.052 0.0 0.013 0.099
183 person horse 0.041 0.016 0.003 0.187
184 cradle cat 0.046 0.0 0.0 0.042
185 chair horse 0.077 0.014 0.011 0.153
186 person bird 0.051 0.001 0.017 0.094
191 swimming pool pottedplant 0.044 0.0 0.0 0.072
198 pool table ceiling 0.064 0.035 0.001 0.066
208 shelf bus 0.047 0.0 0.0 0.062
217 toilet hair 0.049 0.001 0.0 0.055
218 case track 0.044 0.001 0.0 0.165
223 house airplane 0.069 0.0 0.0 0.055
231 grandstand screen 0.097 0.0 0.007 0.047
234 bridge train 0.05 0.0 0.014 0.042
239 pool table horse 0.069 0.011 0.0 0.171
248 bed tvmonitor 0.127 0.0 0.027 0.063
251 sofa pottedplant 0.073 0.0 0.033 0.053
257 tent bus 0.042 0.0 0.005 0.279
260 flower food 0.082 0.033 0.064 0.069
267 apparel car 0.042 0.023 0.0 0.045
278 field sheep 0.06 0.0 0.0 0.044
289 pool table food 0.049 0.024 0.0 0.059
293 shelf bottle 0.074 0.025 0.0 0.105
298 path motorbike 0.047 0.0 0.0 0.068
305 waterfall mountain 0.057 0.049 0.0 0.047
309 washer bus 0.109 0.0 0.013 0.065
322 ball sheep 0.054 0.0 0.005 0.044
324 mountain motorbike 0.071 0.0 0.015 0.048
327 waterfall bird 0.055 0.001 0.0 0.087
337 water boat 0.072 0.031 0.053 0.109
341 sea boat 0.153 0.014 0.0 0.076
347 palm bicycle 0.051 0.001 0.0 0.083
354 cradle chair 0.042 0.03 0.0 0.053
357 rock sheep 0.058 0.0 0.006 0.061
360 pool table bird 0.048 0.0 0.0 0.041
379 bridge bus 0.092 0.0 0.03 0.046
383 bed curtain 0.069 0.064 0.01 0.079
384 washer bicycle 0.043 0.018 0.0 0.201
387 hovel train 0.04 0.0 0.0 0.085
408 toilet bottle 0.045 0.002 0.0 0.099
412 bed airplane 0.079 0.0 0.008 0.086
413 pool table motorbike 0.09 0.0 0.003 0.07
415 seat tvmonitor 0.044 0.0 0.0 0.045
419 bed tvmonitor 0.061 0.0 0.016 0.054
422 seat tvmonitor 0.089 0.0 0.0 0.056
434 case cup 0.047 0.001 0.0 0.041
444 sofa plant 0.045 0.009 0.014 0.09
446 pool table tvmonitor 0.193 0.0 0.006 0.086
475 cradle train 0.061 0.0 0.0 0.132
477 desk tvmonitor 0.104 0.0 0.0 0.085
483 swivel chair horse 0.052 0.006 0.0 0.041
485 sofa airplane 0.056 0.0 0.024 0.045
502 washer train 0.07 0.0 0.006 0.134
Table 5: For all neurons from Tab. 4 that output radically different concepts when explanations are computed using ADE20k vs Pascal, we compute the IOU scores for the other concept as well. Other than the 7 attributes marked in red, the IOU scores are all below 0.04, suggesting that this is not because the neurons are polysemantic.

TCAV. We report the cosine similarities between the concept activation vectors learned using ADE20k and Pascal as probe datasets for all 32 concepts that have a base rate of at least 1% in Table 6. On the whole, we see that the vectors are not very similar, despite the vectors predicting the concepts well.

Concept ADE20k AUC Pascal AUC Cos.sim. Concept ADE20k AUC Pascal AUC Cos.sim.
bag 79.4 75.4 0.006 book 90.4 84.6 0.138
bottle 88.5 85.6 0.035 box 83.0 80.1 0.086
building 97.4 90.0 0.161 cabinet 91.3 92.4 0.03
car 96.9 90.3 0.147 ceiling 96.6 93.0 0.267
chair 90.5 89.6 0.034 curtain 91.6 89.5 0.112
door 81.5 87.8 0.134 fence 86.1 84.7 0.09
floor 97.4 92.1 0.208 grass 95.1 91.7 0.04
light 92.4 85.0 0.043 mountain 94.2 90.8 0.02
painting 94.8 91.4 0.116 person 92.2 92.1 0.253
plate 90.6 94.8 -0.009 pole 89.0 79.3 0.059
pot 79.3 85.2 0.142 road 98.0 91.8 0.041
rock 92.6 82.8 -0.024 sidewalk 97.0 92.5 0.071
signboard 90.6 76.5 0.091 sky 98.9 79.8 0.104
sofa 95.9 91.2 -0.009 table 93.4 93.5 0.06
tree 96.8 89.2 0.172 wall 95.9 91.3 0.027
water 95.2 94.6 0.078 windowpane 91.5 90.1 0.078
Table 6: We report the cosine similarities between the concept activation vectors learned using ADE20k and Pascal datasets. In general, the vectors learned from different datasets do not correlate well.

B.2 Difference in probe dataset distribution

The first method we use to look at the difference in the 2 probe datasets we used was to consider the base rates of different concepts within the dataset. As noted in Section 3 of the main paper, there are some sizable differences. Figure 5 contains the base rates for all concepts highlighted in Table 2 of the main paper. Some concepts that have very different base rates are wall (highlighted for bow-window when using ADE20k, but not Pascal), floor (highlighted for auto-showroom when using ADE20k but not Pascal), dog (highlighted for corn-field when using Pascal, but not ADE20k) and pole (highlighted for hardware-store for Pascal, but not ADE20k).

Refer to caption
Figure 5: Different concepts have very different base rates across Pascal and ADE20k. The graph shows the base rates for the different concepts highlighted within Table 2 in the main paper.

However, more than just the base rate, the images themselves look very different across scenes. We visualize random images from different scenes in Figure 6, and find, for example, images labelled bedroom in Pascal tend to have either a person or animal sleeping on a bed, without much of the remaining bedroom being shown, whereas ADE20k features images of full bedrooms. Similarly, images labelled tree-farm contain people in Pascal, but do not in ADE20k.

Refer to caption
Figure 6: We view a few example images from ADE20k and Pascal for 4 scene classes that had very different explanations in Table 2 from the original paper. We see that these classes have very different distributions; for example, the images labelled as bedroom from the Pascal dataset tend to have an animal or person on a bed, whereas the ones from ADE20k do not.

Upper bounds.

Finally, we present a simple method to compare the similarity of the probe dataset with that of the training dataset by noting that the probe dataset establishes a strict upper bound on the fraction of the model that can be explained. This is intuitively true since the set of semantic labeled concepts is finite, but actually goes deeper than that. Consider the following experiment: we take the original black-box model, run it on a probe dataset to make predictions, and then train a new classifier to emulate those predictions. If this classifier is restricted to use only the labeled concepts then this is similar to a concept-based explanation. However, even if it’s trained on the rich underlying visual features it would not perform perfectly due to the differences between the original training dataset and the probe dataset.

Concretely, consider a black-box ResNet18-based [20] model trained on the Places365 [55] dataset. We reset and re-train its final linear classification layer on the Pascal [13] probe dataset to emulate the original scene predictions; this achieves only 63.7% accuracy. Similarly, on the ADE20k [57, 58] as the probe dataset it achieves only slightly better 75.7% accuracy, suggesting that this dataset is somewhat more similar to Places365 than Pascal but still far from fully capturing the distribution. This is not to suggest that the only way to generate concept-based explanations is to collect concept labels for the original training set (which may lead to overfitting); rather, it’s important to acknowledge this limitation and quantify the explanation method based on such upper bounds.

Similarly, we can ask how well the Concept Bottleneck model [29] can be explained using the CUB test dataset. However, in this case, since the training and test distributions are (hopefully!) similar, we would expect our upper bound to be reasonably high. We check this with our same set up, and find that this is indeed the case – resetting and retraining the final linear layer, using the model’s predictions as our targets achieves an accuracy of 89.3%.

Appendix C Concepts used: more details

Here, we provide additional results regarding learning CUB concepts from Section 4.2 of the main paper. The CUB dataset was used by Concept Bottleneck [29], an interpretable-by-design model. This method learned the concepts as an intermediate layer within the network, and then used these concepts to pretdict the target class. Figure 7 contains the histograms of the normalized AP scores for the 112 concepts from CUB [48] as well as the APs for the target bird classes learned by the model. Similar to learning classifiers for the Broden [5] concepts, we learn a linear model using features from an Imagenet [43] trained Resnet18 [20] model. On average, we see that the bird classes are much better learned than the concepts.

Refer to caption
Refer to caption
Figure 7: We compare the normalized APs when trying to learn CUB concepts (left) to the normalized APs of the CUB target classes for the Concept Bottleneck model(right). On average, the concepts are much harder to learn.

Appendix D Human study details

In Section 5 of the main paper, we discuss the human studies we ran to understand how well humans are able to reason about concept-based explanations as the number of concepts used within the explanation increases. In this section, we provide additional details.

To recap, we compare four types of explanations: (1) concept-based explanations that use 8 concepts, (2) concept-based explanations that use 16 concepts, (3) concept-based explanations that use 32 concepts, and (4) example-based explanations that consist of 10 example images for which the model predicts a certain class. (4) is a baseline that doesn’t use concepts.

For a fair comparison, all four types of explanations are evaluated on the same inputs. We generate five sets of input where each set consists of 5 images from one scene group (commercial buildings, shops, markets, cities, and towns) and 5 images from another scene group (home or hotel). Recall that these are images where the model output match the explanation output (i.e., the class with the highest explanation score calculated based on ground-truth concept labels). Hence, if the participants correctly identify all concepts that appear in a given image, they are guaranteed to get the highest explanation score for the model output class.

To reduce the variance with respect to the input, we had 5 participants for each set of input and explanation type. For 32 concepts explanations, each participant saw 5 images from only one of the two scene groups because the study got too long and overwhelming with the full set of 10 images. For all other explanations, each participant saw the full set of 10 images. In total, we had 125 participants: 50 participants for the study with 32 concepts explanations and 25 participants for the other three studies. Each participant sees only one type of explanation as we conduct a between-group study.

More specifically, we recruited participants through Amazon Mechanical Turk who are US-based, have done over 1000 Human Intelligence Tasks, and have prior approval rate of at least 98%. The demographic distribution was: man 59%, woman 41%; no race/ethnicity reported 82%, White 17%, Black/African American 1%, Asian 1%. The self-reported machine learning experience was 2.5 ±\pm 1.0, between “2: have heard about…” and “3: know the basics…” We did not collect any personally identifiable information. Participants were compensated based on the state-level minimum wage of $12/hr. In total, \sim$800 was spent on running human studies.

Appendix E User interface snapshots

In Section 5.1 of the main paper, we outlined our human study design.888We note that much of our study design and UI is based on the recent work by Kim et al. [kim2021hive] who propose a human evaluation framework called HIVE for evaluating visual interpretability methods. Here we provide snapshots of our study UIs in the following order.

Study introduction.. For each participant, we introduce the study, present a consent form, and receive informed consent for participation in the study. The consent form was approved by our institution’s Institutional Review Board and acknowledges that participation is voluntary, refusal to participate will involve no penalty or loss of benefits, etc. See Fig. 9.

Demographics and background.. Following HIVE [26], we request optional demographic data regarding gender identity, race and ethnicity, as well as the participant’s experience with machine learning. We collect this information to help future researchers calibrate our results. See Fig. 9.

Method introduction.. We introduce concept-based explanations in simple terms. This page is not shown for the study with example-based explanations. See Fig. 10.

Task preview . We present a practice example to help participants get familiar with the task. This page is not shown for the study with example-based explanations. See Fig. 11.

Part 1: Recognize concepts and guess the model output. After the preview, participants move onto the main task where they are asked to recognize concepts in a given photo (for concept-based explanations) and predict the model output (for all explanations). We show the UI for each type of explanation we study:

  • 8 concept explanations (Fig. 12)

  • 16 concepts explanations (Fig. 13)

  • 32 concepts explanations (Fig. 14)

  • Example-based explanations (Fig. 15)

Part 2: Choose the ideal tradeoff between simplicity and correctness.. Concept-based explanations can have varying levels of complexity/simplicity and correctness. Hence, we investigate how participants reason with these two properties. To do so, we show examples of concept-based explanations that use different numbers of concepts, as well as bar plots with the correctness values for certain instantiations of concept-based explanations. We then ask participants to choose the explanation they prefer the most and provide a short written justification for their choice. See Fig. 16.

Feedback.. At the end of the study, participants can optionally provide feedback. See Fig. 17.

Refer to caption
Figure 8: UI - Study introduction
Refer to caption
Figure 9: UI - Demographics and background
Refer to caption
Refer to caption
Figure 10: UI - Method introduction
Refer to caption
Figure 11: UI - Task preview
Refer to caption
Refer to caption
Figure 12: UI - Part 1: Recognize concepts and guess the model output (8 concepts explanations)
Refer to caption
Refer to caption
Figure 13: UI - Part 1: Recognize concepts and guess the model output (16 concepts explanations)
Refer to caption
Refer to caption
Refer to caption
Figure 14: UI - Part 1: Recognize concepts and guess the model output (32 concepts explanations)
Refer to caption
Refer to caption
Figure 15: UI - Part 1: Guess the model output (example-based explanations)
Refer to caption
Refer to caption
Refer to caption
Figure 16: UI - Part 2: Choose the ideal tradeoff between simplicity and correctness
Refer to caption
Figure 17: UI - Feedback