Towards Robust Metrics For Concept Representation Evaluation
Abstract
Recent work on interpretability has focused on concept-based explanations, where deep learning models are explained in terms of high-level units of information, referred to as concepts. Concept learning models, however, have been shown to be prone to encoding impurities in their representations, failing to fully capture meaningful features of their inputs. While concept learning lacks metrics to measure such phenomena, the field of disentanglement learning has explored the related notion of underlying factors of variation in the data, with plenty of metrics to measure the purity of such factors. In this paper, we show that such metrics are not appropriate for concept learning and propose novel metrics for evaluating the purity of concept representations in both approaches. We show the advantage of these metrics over existing ones and demonstrate their utility in evaluating the robustness of concept representations and interventions performed on them. In addition, we show their utility for benchmarking state-of-the-art methods from both families and find that, contrary to common assumptions, supervision alone may not be sufficient for pure concept representations.
1 Introduction
Addressing the lack of interpretability of deep neural networks (DNNs) has given rise to explainability methods, most common of which are feature importance methods (Ribeiro, Singh, and Guestrin 2016; Lundberg and Lee 2017) that quantify the contribution of input features to certain predictions (Bhatt et al. 2020). However, input features may not necessarily form the most intuitive basis for explanations, in particular when using low-level features such as pixels. Concept-based explainability (Kim et al. 2018; Ghorbani et al. 2019; Koh et al. 2020; Yeh et al. 2020; Ciravegna et al. 2021) remedies this issue by constructing an explanation at a concept level, where concepts are considered intermediate, high-level and semantically meaningful units of information commonly used by humans to explain their decisions. Recent work, however, has shown that concept learning (CL) models may not correctly capture the intended semantics of their representations (Margeloiu et al. 2021), and that their learnt concept representations are prone to encoding impurities (i.e., more information in a concept than what is intended) (Mahinpei et al. 2021). Such phenomena may have severe consequences for how such representations can be interpreted (as shown in the misleading attribution maps described by Margeloiu et al. (2021)) and used in practice (as shown later in our intervention results). Nevertheless, the CL literature is yet to see concrete metrics to appropriately capture and measure these phenomena.
In contrast, the closely-related field of disentanglement learning (DGL) (Bengio, Courville, and Vincent 2013; Higgins et al. 2017; Locatello et al. 2019, 2020b), where methods aim to learn intermediate representations aligned to disentangled factors of variation in the data, offers a wide array of metrics for evaluating the quality of latent representations. However, despite the close relationship between concept representations in CL and latent codes in DGL, metrics proposed in DGL are built on assumptions that do not hold in CL, as explained in Section 2, and are thus inappropriate to measure the aforementioned undesired phenomena in CL.
In this paper, we show the inadequacy of current metrics and introduce two novel metrics for evaluating the purity of intermediate representations in CL. Our results indicate that our metrics can be used in practice for quality assurance of such intermediate representations for:
-
1.
Detecting impurities (i) concealed in soft representations, (ii) caused by different model capacities, or (iii) caused by spurious correlations.
-
2.
Indicating when concept interventions are safe.
-
3.
Revealing the impact of supervisions on concept purity.
-
4.
Being robust to inter-concept correlations.
2 Background and Motivation
Notation
In CL, the aim is to find a low-dimensional intermediate representation of the data, similar to latent codes in DGL. This low-dimensional representation corresponds to a matrix in which the -th column constitutes a -dimensional representation of the -th concept, assuming that the length of all concept representations can be made equal using zero-padding. Under this view, elements in are expected to have high values (under some reasonable aggregation function) if the -th concept is considered to be activated. As most CL methods assume , for succinctness we use in place of when . Analogously, as each latent code aims to encode an independent factor of variation (i.e., concept) in DGL, we use for both learnt concept representations and latent codes. Similarly, we refer to both ground truth concepts and factors of variations as .
Concept Learning
In supervised CL, access to concept labels for each input, in addition to task labels, is assumed. Supervised CL makes use of: (i) a concept encoder function that maps the inputs to a concept representation; and (ii) a label predictor function that maps the concept representations to a downstream task’s set of labels . Together, these two functions constitute a Concept Bottleneck Model (CBM) (Koh et al. 2020). A notable approach that uses the bottleneck idea is Concept Whitening (CW) (Chen, Bei, and Rudin 2020) which introduces a batch normalisation module whose activations are trained to be aligned with sets of binary concepts. Unlike supervised CL, in unsupervised CL concept annotations are not available and concepts are discovered in an unsupervised manner with the help of task labels. Two notable data modality agnostic approaches in this family are Completeness-aware Concept Discovery (CCD) (Yeh et al. 2020) and Self-Explainable Neural Networks (SENNs) (Alvarez-Melis and Jaakkola 2018). We refer to supervision from ground truth concepts in supervised CL as explicit, while supervision from task labels alone is referred to as implicit.
On the other hand, generative models (e.g., VAEs (Kingma and Welling 2014)) used in DGL assume that data is generated from a set of independent factors of variation . Thus, the goal is to find a function that maps inputs to a disentangled latent representation. In the light of recent work (Locatello et al. 2019) showing the impossibility of learning disentangled representations without any supervision, as in -VAEs (Higgins et al. 2017), recent work suggests using weak supervision for learning latent codes (Locatello et al. 2020a).


Shortcomings of Current Metrics
Generally, the DGL literature defines concept quality in terms of disentanglement i.e., the more learnt concepts are decorrelated the better (see Appendix A.1 for a summary of DGL metrics). We argue that existing DGL metrics are inadequate to ensure concept quality in CL as they: (i) Assume that each concept is represented with a single scalar value, which is not the case in some modern CL methods such as CW. (ii) Fail to capture subtle impurities encoded within continuous representations (as demonstrated in Section 4). (iii) May assume access to a tractable concept-to-sample generative process (something uncommon in real-world datasets). (iv) Assume that inter-concept dependencies are undesired, an assumption that may not be realistic in the real world where ground truth concept labels often are correlated. This can be observed in Figure 1(a) where concept labels in the Caltech-UCSD Birds-200-2011 (CUB) dataset (Wah et al. 2011), a widely used CL benchmark, are seen to be highly correlated.
Metrics in CL (Koh et al. 2020; Yeh et al. 2020; Kazhdan et al. 2020), on the other hand, mainly define concept quality w.r.t. the downstream task (e.g., task predictive accuracy), and rarely evaluate properties of concept representations w.r.t. the ground truth concepts (with the exception of concept predictive accuracy). Nevertheless, it is possible for two CL models to learn concept representations that yield similar task and concept accuracy but have vastly different properties/qualities. For example, in Figure 1(b) we show a simple experiment in which two CBMs trained on a toy dataset with 3 independent concepts, where “CBM” uses a sigmoidal bottleneck and “CBM-Logits” uses plain logits in its bottleneck, generate concept representations with the same concept/task accuracies yet significantly different inter-concept correlations (details in Appendix A.2).
3 Measuring Purity of Concept Representations
To address the shortcomings of existing metrics, in this section we propose two metrics that make no assumptions about (i) the presence/absence of correlations between concepts, (ii) the underlying data-generating process, and (iii) the dimensionality of a concept representation. Specifically, we focus on measuring the quality of a concept representation in terms of their “purity”, defined here as whether the predictive power of a learnt concept representation over other concepts is similar to what we would expect from their corresponding ground truth labels. We begin by introducing the oracle impurity score (OIS), a metric that quantifies impurities localised within individual learnt concepts. Then, we introduce the niche impurity score (NIP) as a metric that focuses on capturing impurities distributed across the set of learnt concept representations.
3.1 Oracle Impurity
To circumvent the aforementioned limitations of existing DGL metrics, we take inspiration from (Mahinpei et al. 2021), where they informally measure concept impurity as how predictive a CBM-generated concept probability is for the ground truth value of other independent concepts. If the pre-defined concepts are independent, then the inter-concept predictive performance should be no better than random. To generalise this assumption beyond independent concepts, we first measure the predictability of ground truth concepts w.r.t. one another. Then we measure the predictability of learnt concepts w.r.t. the ground truth ones. The divergence between the former and the latter acts as an impurity metric, measuring the amount of undesired information that is encoded, or lacking, in the learnt concepts. To formally introduce our metric, we begin by defining a purity matrix.
Definition 3.1 (Purity Matrix).
Given a set of concept representations , and corresponding discrete ground truth concept annotations , assume that and are aligned element-wise: for all , the -th concept representation of encodes the same concept as the -th concept label in . The Purity Matrix of , given ground truth labels , is defined as a matrix whose entries are given by:
where is a non-linear model (e.g., an MLP) mapping the -th concept’s representation to a probability distribution over all values concept may take.
The -th entry of contains the AUC-ROC when predicting the ground truth value of concept given the -th concept representation. Therefore, the diagonal entries of this matrix show how good a concept representation is at predicting its aligned ground truth label, while the off-diagonal entries show how good such a representation is at predicting the ground truth labels of other concepts. Intuitively, one can think of the -th entry of this matrix as a proxy of the mutual information between the -th concept representation and the -th ground truth concept. While in principle other methods could be used to estimate this mutual information (e.g., histogram binning), we choose the test AUC-ROC of a trained non-linear model primarily for its tractability, its bounded nature, and its easy generalisation to non-scalar concept representations. Furthermore, while in this work we focus on concepts that are binary in nature, our definition can be applied to multivariate concepts by using the mean one-vs-all AUC-ROC score. See Appendix A.3 for implementation details and Appendix A.4 for a discussion on how the OIS is robust to the model selected for .
This matrix allows us to construct a metric for quantifying the impurity of a concept encoder:
Definition 3.2 (Oracle Impurity Score (OIS)).
Let be a concept encoder and let and be ordered sets of testing samples and their corresponding concept annotations, respectively. If, for any ordered set we define as , then the OIS is defined as:
where represents the Frobenius norm of .
Intuitively, the OIS measures the total deviation of an encoder’s purity matrix with the purity matrix obtained from using the ground truth concept labels only (i.e., the “oracle matrix”). We opt to measure this divergence using the Frobenius norm of their difference in order to obtain a bounded output which can be easily interpreted. Since each entry in the difference can be at most , the upper bound of is . Therefore, the OIS includes a factor of to guarantee that it ranges in . This allows interpreting an OIS of as a complete misalignment between and (i.e., the -th concept representation can predict all other concept labels except its corresponding one even when concepts are independent). An impurity score of , on the other hand, represents a perfect alignment between the two purity matrices (i.e., the -th concept representation does not encode any unnecessary information for predicting concept ).
3.2 Niche Impurity
While the OIS is able to correctly capture impurities that are localised within specific and individual concept representations, it is also possible that information pertinent to unrelated concepts is encoded across multiple learnt representations. To tractably capture such a phenomenon, we propose the Niching Impurity Score (NIS) inspired by the theory of niching. In ecology, a niche is considered to be a resource-constrained subspace of the environment that can support different types of life (Darwin 1859). Analogously, the NIS looks at the predictive power of subsets of disentangled concepts. In contrast with the OIS, the NIS is concerned with impurities encoded in sets of learnt concept representations rather than impurities in individual concept representations. The NIS efficiently quantifies the amount of shared information across concept representations by looking at how predictive disentangled subsets of concept representations are for ground truth concepts. We start by describing a concept nicher, a function that ranks learnt concepts by how much information they share with the ground truth ones. We then define a concept niche for a ground truth concept as a set of learnt concepts that are highly ranked by the concept nicher, while the set of concepts outside the niche is referred to as the concept niche complement. We conclude by constructing the NIS by looking at how predictable a ground truth concept is from its corresponding concept niche complement. The collective NIS of all concepts, therefore, represents impurities encoded across the entire bottleneck.
Definition 3.3 (Concept nicher).
Given a set of concept representations , we define a concept nicher as a function that returns if the -th concept is entangled with the -th ground truth concept , and otherwise.
Our definition above can be instantiated in various ways, depending on how entanglement is measured. In favour of efficiency, we measure entanglement using absolute Pearson correlation , as this measure can efficiently discover (a linear form of) association between variables (Altman and Krzywinski 2015). We call this instantiation concept-correlation nicher (CCorrN) and define it as .
If is not a scalar representation (i.e., ), then for simplicity, we use the maximum absolute correlation coefficient between all entries in , and the target concept label as a representative correlation coefficient for the entire representation . We then define a concept niche as:
Definition 3.4 (Concept niche).
The concept niche for target concept , determined by concept nicher and threshold , is defined as .
From this, the Niche Impurity (NI) measures the predictive capacity of the complement of concept niche , referred to as , for the -th ground truth concept:
Definition 3.5 (Niche Impurity (NI)).
Given a classifier , concept nicher , threshold , and labeled concept representations , the Niche Impurity of the -th output of is defined as , where is the classifier resulting from masking all entries in when feeding with concept representations.
Although can be any classifier, for simplicity in our experiments we use a ReLU MLP with hidden layer sizes (see Appendix A.4 for a discussion on our metric’s robustness to ’s architecture). Intuitively, a NI of (random AUC of niche complement) indicates that the concepts inside the niche are the only concepts predictive of the -th concept, that is, concepts outside the niche do not hold any predictive information of the -th concept. Finally, the Niche Impurity Score metric measures how much information apparently disentangled concepts are actually sharing:
Definition 3.6 (Niche Impurity Score (NIS)).
Given a classifier and concept nicher , the niche impurity score is defined as the summation of niche impurities across all concepts for different values of : .
In practice, this integral is estimated using the trapezoid method with . Furthermore, because we parameterise as a small MLP, leading to a tractable impurity metric that scales with large concept sets. Intuitively, a NIS of means that all the information to perfectly predict each ground truth concept is spread on many different and disentangled concept representations. In contrast, a NIS around (random AUC) indicates that no concept can be predicted by any concept representation subset.
4 Experiments
We now give a brief account of the experimental setup and datasets, followed by highlighting the utility of our impurity metrics and their applications to model benchmarking.
Datasets
To have datasets compatible with both CL and DGL, we construct tasks whose samples are fully described by a vector of ground truth generative factors. Moreover, we simulate real-world scenarios by designing tasks with varying degrees of dependencies in their concept annotations. To achieve this, we first design a parametric binary-class dataset , a variation of the tabular dataset proposed by Mahinpei et al. (2021). We also construct two multiclass image-based parametric datasets: and , based on dSprites (Matthey et al. 2017) and 3dshapes (Burgess and Kim 2018) datasets, respectively. They consist of 3D samples generated from a vector consisting of and factors of variation, respectively. Both datasets have one binary concept annotation per factor of variation. Parameters and control the degree of concept inter-dependencies during generation: and represent inter-concept independence while higher values represent stronger inter-concept dependencies. For dataset details see Appendix A.6.
OIS () | NIS () | SAP () | MIG () | () | DCI Dis () | |
---|---|---|---|---|---|---|
Baseline Soft (%) | 4.69 0.43 | 66.25 2.31 | 48.74 0.41 | 99.93 0.03 | 99.95 0.00 | 99.99 0.00 |
Impure Soft (%) | 22.58 2.34 | 72.36 1.26 | 48.83 0.53 | 99.93 0.04 | 99.95 0.00 | 99.50 0.01 |
-value |
Baselines and Setup
We compare the purity of concept representations in various methods using our metrics. We select representative methods from (i) supervised CL (i.e., jointly-trained CBMs (Koh et al. 2020) with sigmoidal and logits bottlenecks, and CW (Chen, Bei, and Rudin 2020) both when its representations are reduced through a MaxPool-Mean reduction and when no feature map reduction is applied), (ii) unsupervised CL (i.e., CCD (Yeh et al. 2020) and SENN (Alvarez-Melis and Jaakkola 2018)), (iii) unsupervised DGL (vanilla VAE (Kingma and Welling 2014) and -VAE (Higgins et al. 2017)), and (iv) weakly supervised DGL (Ada-GVAE and Ada-MLVAE (Locatello et al. 2020a)). For each method and metric, we report the average metric values and confidence intervals obtained from 5 different random seeds. We include details on training and architecture hyperparameters in Appendix A.6.
4.1 Results and Discussion
In contrast to DGL metrics, our metrics can meaningfully capture impurities concealed in representations.
We begin by empirically showing that our metrics indeed capture impurities encoded within a concept representation. For this, we prepare a simple synthetic dataset of ground-truth concept vectors where, for each sample , its -th concept is a binary indicator of the sign taken by a latent variable sampled from a joint normal distribution (with ). During construction, we simulate real-world co-dependencies between different concepts by setting ’s non-diagonal entries to . To evaluate whether our metrics can discriminate between different levels of impurities encoded in different concept representations, we construct two sets of soft concept representations. First, we construct a baseline “pure” fuzzy representation of vector by sampling from if or from if . Notice that each dimension of this representation preserves enough information to perfectly predict each concept’s activation state without encoding any extra information. In contrast, we construct a perfectly “impure” soft concept representation by encoding, as part of each concept’s fuzzy representation, the state of all other concepts. For this we partition and tile the sets and into equally sized and disjoint subsets and , respectively. From here, we generate an impure representation of ground truth concept vector by sampling from if or from otherwise, where we use to represent the decimal representation of the vector resulting from removing the -th dimension of . Intuitively, each concept in this soft representation encodes the activation state of every other concept using different confidence ranges. Therefore, one can perfectly predict all concepts from a single concept’s representation, an impossibility from ground truth concepts alone.
We hypothesize that if a metric is capable of accurately capturing undesired impurities within concept representations, then it should generate vastly different scores for the two representation sets constructed above. To verify this hypothesis, we evaluate our metrics, together with a selection of DGL disentanglement metrics, and show our results in Table 1. We include SAP (Kumar, Sattigeri, and Balakrishnan 2017), (Ross and Doshi-Velez 2021), mutual information gap (MIG) (Chen et al. 2018), and DCI Disentanglement (DCI Dis) (Eastwood and Williams 2018) as representative DGL metrics given their wide use in the DGL literature (Ross and Doshi-Velez 2021; Zaidi et al. 2020). Our results show that our metrics correctly capture the difference in impurity between the two representation sets in a statistically significant manner. In contrast, existing DGL metrics are incapable of clearly discriminating between these two impurity extremes, with DCI being the only metric that generates some statistically significant differences albeit the scores’ differences are minimal (less than 0.5%). Surprisingly, although MIG is inspired on a similar mutual information (MI) argument as our OIS metric, it was unable to capture any meaningful differences between our two representation types. We believe that this is due to the fact that in order to compute the MIG one requires an estimation of the MI which, being sensitive to hyperparameters, may fail to capture important differences. These results, therefore, support using a non-linear model’s test AUC as a proxy of the MI. Further details can be found in Appendix A.7.


Our metrics can capture impurities caused by differences in concept representations and model capacities, as well as by accidental spurious correlations.
Impurities in a CL model can come from different sources, such as differences in concept representations, as previously shown in Figure 1(b), or architectural constraints (e.g., a CBM trained with a partial/incomplete set of concepts). Here, we show that impurities caused by differences in the nature of concept representations, as well as by inadequate model capacities and spurious correlations, can be successfully captured by our metrics and thus avoided.
Differences in concept representations: in Figure 2(a), we show the impurities in (1) a CBM with a sigmoidal bottleneck (CBM) vs a CBM with logits in its bottleneck (CBM-Logits) and (2) a CW module with and without feature map reduction (CW Feature Map vs CW Max-Pool-Mean). Our metrics show that CBM-Logits and CW Feature Map are prone to encoding more impurities than their counterparts. This is because their representations are less constrained, as logit activations can be within any range (as opposed to in Joint-CBM) and CW Feature Map preserves all information from its concept feature map by not reducing it to a scalar. The exception to this is the failure of NIS to detect impurities in CBM-Logits for 3dshapes. We hypothesise that this is due to this task’s higher complexity, forcing both CBM and CBM-Logits to distribute inter-concept information across all representations more than in other datasets.
Differences in model capacity: low-capacity models may be forced to use their concept representations to encode information outside their corresponding ground truth concept. To verify this, we train a CBM in TabularToy() whose concept encoder and label predictor are three-layered ReLU MLPs. We vary the capacities of the concept encoder or label predictor by setting their hidden layers’ activations to , while fixing the number of hidden units in their corresponding counterpart to . We then monitor the accuracy of concept representations w.r.t. their aligned ground truth concepts as well as their OIS. We observe (Figure 2(b)) that as the concept encoder and label predictor capacities decrease, the CBM exhibits significantly higher impurity and lower concept accuracy. Note that the concept encoder capacity has a significantly greater effect on the purity of the representations compared to the label predictor capacity. Measuring impurities in a systematic way using our metrics can therefore guide the design of CL architectures that are less prone to impurities.

Spurious correlations: we create a variation of dSprites, where we randomly introduce spurious correlations by assigning each sample a class-specific background colour with 75% probability (see Appendix A.8 for details). We train two identical CBMs on dSprites and its corrupted counterpart. During training, we note that CBM-S (the CBM trained on the corrupted data) has a higher task validation accuracy than the other CBM (Figure 3), while having similar concept validation accuracies. Nevertheless, when we evaluate both models using a test set sampled from the original dSprites dataset, we see an interesting result: both models can predict ground truth concept labels with similarly high accuracy. However, unlike CBM, CBM-S struggles to predict the task labels. Failure of CBM-S to accurately predict task labels is remarkable as labels in this dataset are uniquely a function of their corresponding concept annotations and CBM-S is able to accurately predict concepts in the original dSprites dataset. We conjecture that this is due to the fact that concepts in CBM-S encode significantly more information than needed, essentially encoding the background colour in addition to the original concepts as part of their representations. To verify this, we evaluate the OIS and NIS of the concept representations learnt by both models and observe that, in line with our intuition, CBM-S indeed encodes significantly more impurity. Our metrics can therefore expose spurious correlations captured by CL methods which appear to be highly predictive of concept labels while underperforming in their downstream task.


Our metrics can indicate when it is safe to perform interventions on a CBM by giving a realistic picture of impurities.
A major potential consequence of not being able to measure the impurities faithfully is that concept interventions (Koh et al. 2020), which allow domain experts to adjust the model by intervening and fixing predictions at the concept level, may fail: adjusting a concept may unintentionally impact the label predictor’s understanding of another concept if representation encodes unnecessary information about concept . To see this in practice, consider a CBM model and a CBM-Logits model both trained to convergence on dSprites, and both achieving fairly similar task and concept accuracies (Figure 5). We then perform interventions at random on their concept representations as follows: in CBM, where concept activations represent probabilities, we intervene on the -th concept representation by setting to the value of its corresponding ground truth concept . In CBM-Logits, as in (Koh et al. 2020), we intervene on the -th concept by setting it to the 5%-percentile of the empirical distribution of if , and we set it to the 95%-percentile of that concept’s distribution if . Interestingly, our results (Figure 5) show that random interventions cause a significant drop in task accuracy of CBM-Logits while leading to an increase in accuracy in CBM. Looking at the impurities of these two models, we observe that although the CBM-Logits model has a better accuracy, both of its OIS and NIS scores are considerably higher than those for the CBM model, explaining why interventions had such undesired consequences.
To rule out the difference in intervention mechanism as the cause of these results, we train two CBM-Logits with the same concept encoder capacities but with different capacities in their label predictors and observe the same phenomena as above: performance degradation upon intervention, which occurs in the case of the lower capacity model, is evidence for higher OIS and NIS scores compared to that of the higher capacity model. Further details about this experiment are documented in Appendix A.10.

Our metrics can provide insights on the impact of different degrees of supervision on concept purity.
As models of various families benefit from varying degrees of supervision ranging from explicit supervision (supervised CL) to implicit (unsupervised CL), weak (weakly-supervised DGL) and no supervision at all (unsupervised DGL), different models are expected to learn concept representations of varying purity. The intuitive assumption is that more supervision leads to better and purer concepts. Here, we compare models from all families using our metrics and show that, contrary to our intuition, this is not necessarily the case. Within variants of CBM and CW, we choose CBM without logits and CW MaxPool-Mean, as they tend to encode fewer impurities (see Figure 2(a)). Furthermore, given the tabular nature of TabularToy, we do not compare DGL methods in this task. Finally, for details on computing our metrics when an alignment between ground truth concepts and learnt representations is missing, see Appendix A.5.
In terms of task accuracy, the overall set of learnt concept representations is equally predictive of the downstream task across all surveyed methods (see Appendix A.9 for details). However, as discussed previously, models with the same task accuracy can encode highly varied levels of impurities in their individual concept representations. Figure 4(a) shows a comparison of impurities observed across methods using OIS. CBM’s individual concepts consistently experience the least amount of impurity due to receiving explicit supervision, which is to be expected. Unexpectedly though, we observe that the same explicit supervision can lead to highly impure representations in CW. Indeed, CW impurities are on par or more than those of unsupervised approaches. Looking into implicit supervision, we observe that individual concepts in CCD and SENN do not correspond well to the ground truth ones. This indicates that the information about each ground truth concept is distributed across the overall representation rather than localised to individual concepts, leading to relatively high OIS. We attribute CCD’s lower impurity, compared with SENN, to the use of a regularisation term that encourages coherence between concept representations in similar samples and misalignment between concept representations in dissimilar samples. More interestingly, however, both CCD and SENN encode higher impurities than some DGL approaches, despite benefiting from task supervision. Within DGL approaches, astonishingly no supervision in unsupervised DGL can result in purer individual concept representations than those of weakly-supervised DGL methods. This suggests that concept information may be heavily distributed in weakly-supervised DGL methods.
Moving from individual concepts, Figure 4(b) shows a comparison of impurities observed across subsets of concept representations using our NIS metric. Similar to our OIS results, the overall set of concept representations in CBM shows the least amount of impurity. Unlike individual impurities, however, the overall set of concept representations in DGL methods shows a higher impurity than that of explicitly supervised approaches. This can be explained by the fact that DGL methods seem to learn representations that are not fully aligned with our defined set of ground truth concepts, yet when taken as a whole they are still highly predictive of individual concepts. This would lead to complement niches being highly predictive of individual ground truth concepts even when individual representations in those niches are not fully predictive of that concept itself, resulting in relatively high NIS scores and lower OIS score. Furthermore, notice that weakly supervised DGL methods show a lesser niche impurity than unsupervised DGL methods, suggesting, as in Locatello et al. (2020a), that weakly-supervised representations are indeed more disentangled. We notice, however, that this decrease in impurity for weakly-supervised methods comes at the cost of their latent codes being less effective at predicting individual concepts than unsupervised latent codes (see Appendix A.11). Finally, within methods benefiting from explicit supervision, the overall set of learnt concepts in CCD has fewer impurities than that of SENN, which was similarly observed with individual concepts above.
5 Conclusion
Impurities in concept representations can lead to models accidentally capturing spurious correlations and can be indicative of potentially unexpected behaviour during concept interventions, which is crucial given that performing safe interventions is one of the main motivations behind CBMs. Despite this importance, current metrics in CL literature and the related field of DGL fail to fully capture such impurities. In this paper, we address these limitations by introducing two novel robust metrics that are able to circumvent several limitations in existing metrics and correctly capture impurities in learnt concept representations. Indeed, for the first time we are able to compare the purity of concepts in CL and DGL systematically, and show that, contrary to the common assumption, more explicit supervision does not necessarily translate to better concept quality, as measured by purity. More importantly, beyond comparison, our experiments show the utility of these metrics in designing and training more reliable and robust concept learning models. Therefore, we envision them to be an integral part of future tools developed for the safe deployment of concept-based models in real-world scenarios.
Acknowledgements
The authors would like to thank our reviewers for their insightful comments on earlier versions of this manuscript. MEZ acknowledges support from the Gates Cambridge Trust via a Gates Cambridge Scholarship. PB acknowledges support from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 848077. UB acknowledges support from DeepMind and the Leverhulme Trust via the Leverhulme Centre for the Future of Intelligence (CFI), and from a JP Morgan Chase AI PhD Fellowship. AW acknowledges support from a Turing AI Fellowship under grant EP/V025279/1, The Alan Turing Institute, and the Leverhulme Trust via CFI. MJ is supported by the EPSRC grant EP/T019603/1.
References
- Abadi et al. (2016) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), 265–283.
- Altman and Krzywinski (2015) Altman, N.; and Krzywinski, M. 2015. Points of Significance: Association, correlation and causation. Nature methods, 12(10).
- Alvarez-Melis and Jaakkola (2018) Alvarez-Melis, D.; and Jaakkola, T. S. 2018. Towards robust interpretability with self-explaining neural networks. In Neural Information Processing Systems (NeurIPS).
- Bengio, Courville, and Vincent (2013) Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828.
- Bhatt et al. (2020) Bhatt, U.; Xiang, A.; Sharma, S.; Weller, A.; Taly, A.; Jia, Y.; Ghosh, J.; Puri, R.; Moura, J. M.; and Eckersley, P. 2020. Explainable machine learning in deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 648–657.
- Burgess and Kim (2018) Burgess, C.; and Kim, H. 2018. 3D Shapes Dataset. https://github.com/deepmind/3dshapes-dataset/.
- Chen et al. (2018) Chen, R. T.; Li, X.; Grosse, R.; and Duvenaud, D. 2018. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942.
- Chen, Bei, and Rudin (2020) Chen, Z.; Bei, Y.; and Rudin, C. 2020. Concept whitening for interpretable image recognition. Nature Machine Intelligence, 2(12): 772–782.
- Ciravegna et al. (2021) Ciravegna, G.; Barbiero, P.; Giannini, F.; Gori, M.; Lió, P.; Maggini, M.; and Melacci, S. 2021. Logic Explained Networks. CoRR, abs/2108.05149.
- Darwin (1859) Darwin, C. 1859. On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life.
- Eastwood and Williams (2018) Eastwood, C.; and Williams, C. K. I. 2018. A Framework for the Quantitative Evaluation of Disentangled Representations. In International Conference on Learning Representations (ICLR). OpenReview.net.
- Ghorbani et al. (2019) Ghorbani, A.; Wexler, J.; Zou, J. Y.; and Kim, B. 2019. Towards Automatic Concept-based Explanations. In Neural Information Processing Systems (NeurIPS), 9273–9282.
- Higgins et al. (2017) Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2017. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In International Conference on Learning Representations (ICLR). OpenReview.net.
- Kazhdan et al. (2021) Kazhdan, D.; Dimanov, B.; Andrés-Terré, H.; Jamnik, M.; Liò, P.; and Weller, A. 2021. Is Disentanglement all you need? Comparing Concept-based & Disentanglement Approaches. CoRR, abs/2104.06917.
- Kazhdan et al. (2020) Kazhdan, D.; Dimanov, B.; Jamnik, M.; Liò, P.; and Weller, A. 2020. Now You See Me (CME): Concept-based Model Extraction. In Conference on Information and Knowledge Management (CIKM) Workshops, volume 2699 of CEUR Workshop Proceedings. CEUR-WS.org.
- Kim et al. (2018) Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C. J.; Wexler, J.; Viégas, F. B.; and Sayres, R. 2018. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, 2673–2682. PMLR.
- Kim and Mnih (2018) Kim, H.; and Mnih, A. 2018. Disentangling by Factorising. In Dy, J. G.; and Krause, A., eds., International Conference on Machine Learning, (ICML), volume 80 of Proceedings of Machine Learning Research, 2654–2663. PMLR.
- Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kingma and Welling (2014) Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Variational Bayes. In Bengio, Y.; and LeCun, Y., eds., International Conference on Learning Representations (ICLR).
- Koh et al. (2020) Koh, P. W.; Nguyen, T.; Tang, Y. S.; Mussmann, S.; Pierson, E.; Kim, B.; and Liang, P. 2020. Concept Bottleneck Models. In International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, 5338–5348. PMLR.
- Kumar, Sattigeri, and Balakrishnan (2017) Kumar, A.; Sattigeri, P.; and Balakrishnan, A. 2017. Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848.
- Locatello et al. (2019) Locatello, F.; Bauer, S.; Lucic, M.; Rätsch, G.; Gelly, S.; Schölkopf, B.; and Bachem, O. 2019. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. In International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, 4114–4124. PMLR.
- Locatello et al. (2020a) Locatello, F.; Poole, B.; Rätsch, G.; Schölkopf, B.; Bachem, O.; and Tschannen, M. 2020a. Weakly-Supervised Disentanglement Without Compromises. In International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, 6348–6359. PMLR.
- Locatello et al. (2020b) Locatello, F.; Tschannen, M.; Bauer, S.; Rätsch, G.; Schölkopf, B.; and Bachem, O. 2020b. Disentangling Factors of Variations Using Few Labels. In International Conference on Learning Representations (ICLR). OpenReview.net.
- Lundberg and Lee (2017) Lundberg, S. M.; and Lee, S. 2017. A Unified Approach to Interpreting Model Predictions. In Annual Conference on Neural Information Processing Systems (NeurIPS), 4765–4774.
- Mahinpei et al. (2021) Mahinpei, A.; Clark, J.; Lage, I.; Doshi-Velez, F.; and Pan, W. 2021. Promises and Pitfalls of Black-Box Concept Learning Models. CoRR, abs/2106.13314.
- Margeloiu et al. (2021) Margeloiu, A.; Ashman, M.; Bhatt, U.; Chen, Y.; Jamnik, M.; and Weller, A. 2021. Do Concept Bottleneck Models Learn as Intended? CoRR, abs/2105.04289.
- Matthey et al. (2017) Matthey, L.; Higgins, I.; Hassabis, D.; and Lerchner, A. 2017. dsprites: Disentanglement testing sprites dataset.
- Ribeiro, Singh, and Guestrin (2016) Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In International Conference on data science and advanced analytics (DSAA), 1135–1144. ACM.
- Ridgeway and Mozer (2018) Ridgeway, K.; and Mozer, M. C. 2018. Learning Deep Disentangled Embeddings With the F-Statistic Loss. In Advances in Neural Information Processing Systems (NeurIPS), 185–194.
- Ross and Doshi-Velez (2021) Ross, A.; and Doshi-Velez, F. 2021. Benchmarks, algorithms, and metrics for hierarchical disentanglement. In International Conference on Machine Learning, 9084–9094. PMLR.
- Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.
- Welch (1947) Welch, B. L. 1947. The generalization of ‘STUDENT’S’problem when several different population varlances are involved. Biometrika, 34(1-2): 28–35.
- Yeh et al. (2020) Yeh, C.; Kim, B.; Arik, S. Ö.; Li, C.; Pfister, T.; and Ravikumar, P. 2020. On Completeness-aware Concept-Based Explanations in Deep Neural Networks. In Neural Information Processing Systems (NeurIPS).
- Zaidi et al. (2020) Zaidi, J.; Boilard, J.; Gagnon, G.; and Carbonneau, M.-A. 2020. Measuring disentanglement: A review of metrics. arXiv preprint arXiv:2012.09276.
Appendix A Appendix
A.1 Metrics Related to Properties of Concept Representations w.r.t. Ground Truth Concepts
Refer to Table A1 for a summary of some metrics used for measuring the different properties of disentangled representations which are applicable to DGL concept representations.
Individual learnt concepts w.r.t. ground truth concepts | Overall set of learnt concepts w.r.t. ground truth concepts |
---|---|
Modularity (Ridgeway and Mozer 2018): Whether each learnt concept corresponds to at most one ground truth one. Measured by deviation from the ideal case where each learnt concept has high mutual information with only one ground truth concept and shares no information with others. | Explicitness (Ridgeway and Mozer 2018): Whether the overall set of learnt concepts can predict each individual ground truth concept using a simple (e.g., linear) classifier. Measured by the average predictive performance of concept vector. |
Mutual information gap (Chen et al. 2018): Whether learnt concepts are disentangled. Measured by averaging the gap in mutual information between the two learnt concepts that have the highest mutual information with a ground truth concept. The mutual information needs to be estimated if concept representations are continuous. This metric generalises the disentanglement scores in Higgins et al. (2017) and Kim and Mnih (2018). | DCI Informativeness (Eastwood and Williams 2018): Whether the overall concept vector can predict each ground truth concept with low prediction error. Measured by average prediction error of concept vector. |
DCI Disentanglement (Eastwood and Williams 2018): Whether each learnt concept captures at most one ground truth one. Measured by the weighted average of disentanglement degree of each learnt concept. Such degree is calculated based on entropy, where high entropy for a learnt concept shows its equal importance for all ground truth ones and therefore its low disentanglement degree. The weight is calculated based on the aggregation of relative importance of a learnt concept in predicting each ground truth concept. | |
Alignment (Yeh et al. 2020): Whether the learnt concepts match the ground truth ones. Measured by average accuracy of predicting each ground truth concept using the learnt one that predicts it best. | |
SAP (Kumar, Sattigeri, and Balakrishnan 2017): Whether learnt concept representations are aligned to one and only one ground truth generative factor. Measured as the average spread in importance scores (computed using e.g. linear regression) between the two most predictive concept representations for each ground truth concept. This metric, however, does not naturally capture whether a concept representation is highly predictive of two or more ground truth concepts. | |
(Ross and Doshi-Velez 2021): A generalisation of the SAP score to measure whether the learnt concept representations are aligned to one and only one ground truth generative factor. It operates similarly to SAP but it (1) uses a non-linear method for computing importance scores, (2) computes the importance spread across all concept representations rather than between the top two concept representations, and (3) computes concept representation importance scores using a bijective correspondence (where a concept representation’s importance score is a function of how predictive its representation is for a ground truth concept and how predictive that same ground truth concept is for that concept’s latent representation). |
A.2 Toy CBM Correlation Experiment Details
For our toy CBM experiment described in Figure 1(b) in Section 2, we consider the toy dataset proposed in Mahinpei et al. (2021), where samples have three concept annotations with no correlation and a label which is a function of the concepts (see Appendix A.6 for details). Using the task and concept labels in this dataset, we train two CBM with the exact same architecture (the same architecture used for our ToyTabular dataset as described in Appendix A.6) with the exception that one CBM uses the raw logits outputted by the concept encoder in its bottleneck and the other CBM applies a sigmoidal activation to the logits outputted by the concept encoder. After training to convergence, we observe that both of these models are able to achieve an almost identical average concept accuracy and an almost perfect task accuracy. Nevertheless, as seen in Figure 1(b), when looking at the absolute value of the Pearson correlation coefficients of inter-concept representations for both models, the CBM logits model learns concept representations which are not fully independent of each other.
A.3 Purity Matrix Implementation Details
We compute the (i, j)-th entry of the purity matrix as follows: we split the original testing data into two disjoint sets, a new training set and a new testing set , using a traditional 80%-20% split. We then use the concept representations learnt for the -th concept for samples in to train a two-layered ReLU MLP with a single hidden layer with 32 activations to predict the truth value of the -th ground-truth concept from the -th learnt concept representation. In other words, we train using labelled samples . For simplicity, we train each helper model for epochs using batches of training samples from . Finally, we set the (, )-th entry of the purity matrix as the AUC achieved when evaluating on the new testing set .
A.4 Architecture Choice for Helper Model in OIS and NIS
Both the OIS and NIS depend on using a separate helper parametric model in order to compute their values. In the case of the OIS, its computation requires the training of models which, as discussed in Appendix A.3, we implement as a simple two-layer ReLU MLP with 32 activations in its hidden layer. Similarly, to compute the NIS we need to train a helper classifier which, for simplicity, we implement as a simple ReLU MLP with hidden layers sizes . This strategy, i.e., using a helper classifier to compute a metric, is alike that seen in other metrics in the DGL space such as SAP scores (Kumar, Sattigeri, and Balakrishnan 2017) and scores (Ross and Doshi-Velez 2021). Nevertheless, the requirement of including a parametric model as part of the metric’s computation begs the question of how sensitive our metrics are to the choice of model for these helper architectures. Hence, in this section we address this question and empirically show that our metric’s results are consistent once the capacity of the helper models is not constrained (i.e., due to extremely small hidden layers).

In Figure A.1, we show the result of computing both the OIS and NIS for as we vary the number of activations used in the hidden layers of and . Specifically, we compute both metrics for a CBM with logit bottleneck activations (Joint-CBM-Logits) and one whose activations are sigmoidal (Joint-CBM). Both models are using the architecture and training procedure described for our other dSprites experiments in Appendix A.6. Our ablation on the capacity of the helper models shows that, for both the OIS and NIS, our metrics are stable and consistent as long as the helper models are provided with a non-trivial amount of capacity (e.g., we observe that a hidden size of at least for OIS and for NIS leads to close to constant results in dSprites). These results motivate the architectures we selected for our helper models for all the results presented in this paper and suggest that our metrics can be used in practice without the need to fine-tune their helper models.
A.5 Concept Alignment in Unsupervised Concept Representations
In the presence of ground truth concept annotations, one can still compute a purity matrix, and therefore the OIS, even when the concept representations being evaluated were learnt without direct concept supervision. We achieve this by finding an injective alignment between ground truth concepts and learnt concept representations (we assume ). In this setting, we let represent the fact that the -th ground truth concept is best represented by the -th learnt concept representation . In our work, we greedily compute this alignment starting from a set of unmatched ground truth concepts and a set of unmatched learnt concept representations and iteratively constructing by adding one match at a time. Specifically, at time-step we match ground truth concept with learnt concept if one can predict concept from better than every other concept representation can predict every other ground truth concept . We evaluate predictability of ground truth concept from learnt concept representation by training a ReLU MLP with a single hidden layer with 32 activations and evaluating its AUC on a test set. Once a match between ground truth concept and learnt concept representation has been established, we set and . We repeat this process until we have found a match for every ground truth concept (i.e., until becomes the empty set). Notice that in practice, one needs to compute the predictability of ground truth concept from concept representation only once when building alignment .
A.6 Experiment Details
In this section we provide further details on the datasets used for our experiments as well as the architectures and training regimes used in our work.
Datasets
For benchmarking, we use three datasets: a simple synthetic toy tabular dataset extending that defined in (Mahinpei et al. 2021), dSprites (Matthey et al. 2017), and 3dshapes (Burgess and Kim 2018). For our evaluation, we re-implement from scratch the first dataset while we use the open-sourced MIT-licensed releases of the latter two datasets. In order to investigate the impact of concept correlation on the purity of concept representations in each method, we allow a varying degree of correlations in concept annotations and propose the following parameterized tasks:
-
•
: this dataset consists of inputs such that the -th coordinate of is generated by applying a non-invertible nonlinear function to 3 latent factors . These latent factors are sampled from a multivariate normal distribution with zero mean and a covariance matrix with in its off-diagonal elements. The concept annotations for each sample correspond to the binary vector and the task we use is to determine whether at least two of the latent variables are positive. In other words, we set to . The individual functions used to generate each coordinate of are the same as those defined in (Mahinpei et al. 2021). As in (Mahinpei et al. 2021), we use a total of generated samples during training and a total of generated samples during testing.
-
•
: we define a task based on the dSprites dataset where each sample is a grayscale image containing a square, an ellipse, or a heart with varying positions, scales, and rotations. Each sample is procedurally generated from a vector of ground truth factors of variation ( indicating an angle of rotation) and is assigned a binary concept annotation vector with elements . For this task, we construct a set of labels from the concept annotations by setting if (where we use to indicate the base-10 representation of a binary number with digits and ) and otherwise. Finally, we parameterize this dataset on the correlation number that indicates the number of random correlations we introduce across the sample-generating factors of variation (with implying all factors of variation are independent). For example, if , we introduce a conditional correlation between factor of variations (“shape”) and (“scale”) by assigning each value of a random subset of values that may take given . This subset is sampled by selecting, at random for each possible value of , half of all the values that can take. More specifically, if and can take a total of and different values, respectively, then for each we constraint to be able to take only values from the set defined as:
where stands for Sample Without Replacement and is a function that takes in a set and a number and returns a set of elements sampled without replacement from . This process is recursively extended for higher values of by letting the dataset generated with be the same as the dataset generated by with the addition of a new conditional correlation between factor of variations and . Finally, in order to maintain a constant dataset cardinality as varies, we subsample all allowed factor of variations in by selecting only every other allowed value for each of them. This guarantees that once a conditional correlation is added, the cardinality of the resulting dataset is the same as the previous one. Because of this, all parametric variants of this dataset consist of around samples.
-
•
: we define a task based on the 3dshapes dataset where each sample is a color image containing a sphere, a cube, a capsule, or a cylinder with varying component hues, orientation, and scale. Each sample is procedurally generated from a vector of ground truth factors of variation and is assigned a binary concept annotation vector with elements . For this task, we construct a set of labels from the concept annotations by setting if and otherwise. As in the dSprites task defined above, we further parameterise this dataset with parameter to control the number of random conditional correlations we introduce at construction time. The procedure used to introduce such correlations is the same as in but we use rather than for determining the set of values that we sample from. Similarly, we use the same subsampling as in to maintain a constant-sized dataset, resulting in all parametric variants of this dataset having around samples.

Model Architectures and Training
For our evaluation, we select CBM (Koh et al. 2020) from the supervised CL family due to its fundamental role in CL. We focus on jointly trained CBMs, where the task-specific loss and the concept prediction loss are minimised jointly. Specifically, we evaluate CBMs that use sigmoid activations in the output of their concept encoder models (Joint-CBM) as well as CBMs with concept encoders outputting logits (Joint-CBM-Logits). Given that CW (Chen, Bei, and Rudin 2020) explicitly aims to decorrelate concept representations, we use it as another supervised CL baseline, both when reducing the concept representations into scalars using a MaxPool-Mean reduction as in (Chen, Bei, and Rudin 2020) (CW-MaxPoolMean) as well as when using entire feature maps as concept representations (CW Feature Map). From the unsupervised CL family, CCD (Yeh et al. 2020) is selected due to its data agnostic nature, and SENN (Alvarez-Melis and Jaakkola 2018) due to its particular mix of ideas from both DGL and CL literature. From the DGL family, we consider two weakly supervised methods, Adaptive Group Variational Autoencoder (Ada-GVAE) and Adaptive Multilevel Variational Autoencoder (Ada-MLVAE) (Locatello et al. 2020a), as well as two unsupervised methods, namely vanilla Variational AutoEncoders (VAE) (Kingma and Welling 2014) and -VAE (with ) (Higgins et al. 2017).
Toy Dataset Setup
With the exception of the introductory example in Section 2, in all reported dataset results, for CBM, CBM-Logits, CCD, and SENN we use a 4-layer ReLU MLP with activations as the concept encoder and a 4-layer ReLU MLP with activations as label predictor (for SENN we also include a 4-layer ReLU MLP with activations as its helper decoder during training). For CW, we use the same architecture with the exception that a CW module is applied to the output of the concept encoder model. For CBM, CBM-Logits, SENN, and CCD, we train each model for epochs with a batch size of . For CW, we train each model for epochs, with CW updates occurring every 20 batches, using a larger batch size of to obtain better batch statistics. Finally, for the mixture hyperparameter of the joint loss in CBM and CBM-Logits, unless specified otherwise we use a value of .
For the example TabularToy experiment in Section 2, we use the same training procedure for the CBMs described above but we use a 4-layer ReLU MLP with activations as the concept encoder and a 3-layer ReLU MLP with activations as label predictor . Everything else is set in a similar fashion as above.

Implementation Details
dSprites and 3dshapes Setup
With the exception of our dSprites intervention experiments in Section 4, for our CBM, CBM-Logits, and CCD experiments in and , we use a Convolutional Neural Network (CNN) with four (3x3-Conv + BatchNorm + ReLU + MaxPool) blocks with feature maps followed by a three fully-connected layers with activations for the concept encoder model (with being the number of ground truth concepts in the dataset). Furthermore, for the label predictor model we use a simple 4-layer ReLU MLP with activations (with being the number of output labels in the task). For CW’s concept encoder we use a CNN with three (3x3-Conv + BatchNorm + ReLU + MaxPool) blocks with feature maps followed by a (3x3-Conv + CW) block with feature maps. For the label predictor, we use a model composed of a MaxPool layer followed by five ReLU fully connected layers with activations . All models evaluated for CBM, CBM-Logits, and CCD are trained for epochs using a batch size of . In contrast, CW models are trained for epochs using a batch size of and an CW module update step every batches. Finally a value of is used during joint CBM and CBM-Logits training.
For the CBM and CBM-Logit models used in our intervention experiments, we use the same training setup as described above, and the same concept encoder architecture as described above, however we use different label predictor architectures. For the CBMs with the same capacity but different activation function on their concept encoders (i.e., logit vs non-logit) we use a simple 4-layer ReLU MLP with activations as their label predictor architectures. For the intervention experiment showcasing two CBMs with different capacities in Appendix A.10, we use the same concept encoder architecture as described for CBMs above and use a ReLU MLP with layers for label predictor in the higher-capacity model while using a ReLU MLP with layers for the label predictor in the lower-capacity model.
For evaluating VAE, -VAE (, Ada-GVAE and Ada-MLVAE, we use the same architecture as in CBM’s and CCD’s concept predictor for the concept encoder and the same architecture as in (Locatello et al. 2020a) for the decoder. The decoder consists of two ReLU fully connected layers with activations followed by four ReLU deconvolutional layers with feature maps . All DGL models are trained for epochs using a batch size of . Weakly-supervised models are trained with a dataset consisting of pairs of images that share at least one factor of variation (with being the number of samples in the original dataset) while unsupervised models are trained with the same dataset used for CL methods. As in other methods, we train all DGL models using a default Adam optimizer (Kingma and Ba 2014), with learning rate .
When evaluating CCD, we use a threshold of for computing concepts scores and the same regulariser parameters , , as in Yeh et al. (2020)’s released code for their work in (Yeh et al. 2020). Finally, all CCD models, across all tasks, are trained for epochs and a batch size of using a default Adam optimizer, with learning rate .
When benchmarking SENN, we use the same architecture as in DGL methods for the concept encoder and for its corresponding decoder. Note that the decoder in this case is only used as part of the regularization term during training. For the weight model (i.e., the “parameterizer”), we use a simple ReLU MLP with unit sizes ( being the number of concepts SENN will learn). Finally, we use an additive aggregation function and use as a robustness regularization strength and as the sparsity regularization strength, as done in (Alvarez-Melis and Jaakkola 2018). We train our SENN models for 100 epochs using a batch size of 32 and a default Adam optimizer with learning rate .
Libraries
We implemented all of our methods in Python 3.7 using TensorFlow (Abadi et al. 2016) as our main framework. For our implementation of CW, we adapted and modified the MIT-licensed public release of CW accompanying the original paper (Chen, Bei, and Rudin 2020). Similarly, for our implementation of all DGL methods, we adapted the MIT-licensed and open-sourced library accompanying the publication of (Locatello et al. 2020a). Finally, we make use of the the open-sourced MIT-licensed library released by Kazhdan et al. (2021) for easy access to concept-based evaluation metrics and dataset wrappers for 3dshapes and dSprites. All other methods and metrics were implemented from scratch and will be made publicly available upon release of the paper.
Resource Utilization
We ran our experiments on a private 125Gi-RAM GPU cluster with 4 Nvidia Titan Xp GPUs and 40 Intel(R) Xeon(R) CPUs (E5-2630 v4 @ 2.20GHz). We estimate that around 180 GPU-hours were required to complete all of our experiments (this includes running all experiments across 5 random seeds).
A.7 DGL Metric Evaluation Details
In order to compare our metrics with commonly used DGL metrics in the toy dataset of Section 4, we make use of the open-sourced library accompanying the publication of (Locatello et al. 2020a) to compute DCI and MIG scores. Furthermore, we follow the implementation accompanying the publication of (Ross and Doshi-Velez 2021) for evaluating the metric and SAP scores. When computing the MIG, we follow the same steps as Ross and Doshi-Velez (2021) and estimate the mutual information using a 2D histogram with bins. As in the rest of our experiments, we report each metric after computing it over toy datasets sampled with 5 different random seeds and include p-values computed using a two-sided Welch T-test (Welch 1947).
A.8 Corrupted dSprites Samples
Figure A.4 shows 8 corrupted dSprites( samples, one for each class in this task, where a spurious correlation was introduced by setting the background in sample to , where is the label assigned to in dSprites(. In our experiments, we introduce this spurious correlation for 75% of the training samples at random.

A.9 Downstream Task Predictive AUC
Figure A.2 shows that the downstream task predictive AUC for all datasets using raw inputs, in absence of any correlations as well as the maximum correlations between concepts, is relatively high and similar across methods. Nvertheless, if concept representations from various methods are good surrogates to the inputs, we also expect them to be able to recover the same predictive performance. Figure A.3 confirms that this holds to a good degree by looking at the task AUC of a simple ReLU MLP with hidden layers trained to predict the corresponding task labels using the concept representations learnt by each method.
A.10 Intervention Performance in Two CBM-Logits With Different Capacities
To discard the possibility that our intervention results presented in Section 4 are due to the different intervention mechanisms used by Joint-CBM and Joint-CBM-Logits, in this section we show that a similar trend can be observed in CBMs using the same intervention mechanism but with different capacities. To observe this, we train two CBM-Logits with the same concept encoder capacities but with different capacities in the MLPs used as label predictors for both of these models. We see that the higher-capacity model has a higher task accuracy (by about 4%), but a similar concept accuracy to the lower-capacity model. Randomly intervening on their representations, however, produces very different results: interventions in the higher-capacity model lead to a boost in performance while interventions on the lower capacity model lead to performance degradation. Upon inspecting the impurities (see Figure A.5), we see that the lower-capacity model has considerably higher OIS and NIS scores than the higher-capacity model, explaining the performance degradation caused by interventions in the lower-capacity model.


A.11 Average Concept Predictive AUC from Overall Concept Representations in DGL
When benchmarking methods in Section 4, we observe that weakly-supervised DGL methods show a lesser niche impurity than unsupervised DGL methods. Here we hypothesise that this difference is due to their overall representations being less predictive of individual concepts than the representations learnt by unsupervised DGL methods. Figure A.6 confirms this hypothesis by showing the average predictive concept AUC from a classifier trained to predict ground truth concepts from the overall concept representations of each of our DGL methods. If the overall concept representations are able to correctly capture all of the ground truth concepts, then one would expect a classifier trained on those representations to be highly predictive of all concepts. Nevertheless, notice that, as predicted, weakly-supervised DGL methods appear to have overall concept representations that are, on average, less predictive of ground truth concepts than those in unsupervised DGL methods. This result provides an explanation for why we observe a lower niche impurity score in weakly-supervised DGL methods compared to unsupervised DGL and suggests future work which could better explore the cause for this observation.