¹¹institutetext: Purdue University, West Lafayette, USA
¹¹email: {jkim17, wang5026, qqiu}@purdue.edu

Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort

Jeeyung Kim Ze Wang Qiang Qiu

Abstract

Enhancing model interpretability can address spurious correlations by revealing how models draw their predictions. Concept Bottleneck Models (CBMs) can provide a principled way of disclosing and guiding model behaviors through human-understandable concepts, albeit at a high cost of human efforts in data annotation. In this paper, we leverage a synergy of multiple foundation models to construct CBMs with nearly no human effort. We discover undesirable biases in CBMs built on pre-trained models and propose a novel framework designed to exploit pre-trained models while being immune to these biases, thereby reducing vulnerability to spurious correlations. Specifically, our method offers a seamless pipeline that adopts foundation models for assessing potential spurious correlations in datasets, annotating concepts for images, and refining the annotations for improved robustness. We evaluate the proposed method on multiple datasets, and the results demonstrate its effectiveness in reducing model reliance on spurious correlations while preserving its interpretability.

Keywords:

Multimodal Large Language Model for Data Annotation Spurious Correlations Concept Bottleneck Model

1 Introduction

Deep learning models excel in various vision tasks, such as image classification, by learning from massive training data. However, the black-box nature of deep learning models makes them vulnerable to spurious correlations residing in training data. Specifically, models may rely on spurious relationships for predictions, often unaware of having learned such flawed correlations. Such learned spurious correlations can hardly be diagnosed and eliminated after model training.

A promising direction to address this issue is to improve model interpretability by using human-comprehensible concepts introduced in prior studies [15, 18]. These methods elucidate how a set of pre-defined concepts contributes to model prediction. In particular, [18] introduces Concept Bottleneck Models (CBMs), which encode inputs based on the presenting concepts and use these concepts to draw predictions. Moreover, [1, 38] demonstrate using human-understandable concepts in identifying and addressing spurious correlations. Despite their advantages, such methods require concept annotations of the images, which demand considerably greater human effort compared to black-box models.

Recent advancements in large foundation models, such as CLIP [29] and GPT-3 [5], have substantially reduced the cost required to construct CBMs. [46, 27] present methods to represent concepts using features from CLIP to ease CBM construction. [27] uses GPT-3 to collect relevant concepts for specific classification tasks. Moreover, [46] proposes a technique to mitigate spurious correlations in CBMs by pruning classifier weights tied with misleading concepts, exhibiting the capability of CBMs to reduce spurious correlations. However, selecting weights to prune still requires human expertise. In addition, as demonstrated in Sec. 4, such CBMs are limited in addressing spurious correlations as they tend to overlook the biases inherent in large foundation models like CLIP.

To overcome the aforementioned shortcomings, we present a framework for constructing CBMs to effectively tackle spurious correlations at a low cost, leveraging the capacities of multiple foundation models. Specifically, the proposed framework accomplishes CBM constructions in three stages. In the first stage, we employ a multimodal large language model (MLLM) and a large language model (LLM) to summarize datasets and create comprehensive concept pools. The LLM identifies all the potentially helpful visual concepts for each dataset. Considering that some concepts may be associated with spurious correlations, automatic concept filtering with MLLM is applied to identify and remove those tied to potential spurious correlations within each dataset. In the second stage, automatic concept annotations are performed using MLLM based on the filtered concept list. Inspired by the observed vulnerability of the CBMs built on pre-trained models [46, 27] to spurious correlations, as showcased in Sec. 4, we obtain binary annotations through foundation models rather than using potentially biased raw representations from the pre-trained models. The resulting CBMs effectively minimize inheriting unintended biases from pre-trained models. Despite the remarkable capabilities of MLLMs on various tasks, we recognize the inferior concept annotation quality of MLLMs compared to human experts. To bridge the gap in concept annotation accuracy, we introduce an optional third step for annotation refinement using chains of vision foundation models, improving the reliability of the concept annotation with MLLMs with minimal human effort. Applying this framework, we develop CBMs and showcase their efficacy in addressing spurious correlations with little to no reliance on human labor in real-world challenges.

Refer to caption — Figure 1: The framework comprises three stages, where we leverage foundation models to minimize human effort in building CBMs. First, we collect a concept pool of visual attributes necessary for classification with automatic concept filtering using an MLLM. Second, we annotate the concepts by querying a MLLM. Third, we optionally refine the annotations using vision foundation models to improve the accuracy of the concept annotations by MLLMs. Ultimately, based on the obtained concept annotations, we construct a LLaVA-based CBM.

2 Related Works

Concept-based Models. Using human-understandable concepts to guide and interpret model behavior has been investigated in [18, 15, 45, 1, 38, 9]. Particularly, [18] introduce CBMs mapping individual neurons in an intermediate layer to concepts, enabling human intervention at test time to improve accuracy. [15] proposes Concept Activation Vectors (CAVs), aligned with feature space of an intermediate layer to assess the influence of concepts on model predictions. However, the enhanced interpretability of these models comes at the cost and reduced performance compared to black-box models. While recent studies [46, 27] leverage the capabilities of pre-trained models such as CLIP and GPT-3 to reduce the cost for building CBMs and matching black-box performance, its potential to mitigate spurious correlations in real-world scenarios remains unexplored.

Improving robustness to spurious correlations. Previous studies [49, 23, 16, 31, 44, 40, 38, 40] improve the robustness of black-box models to spurious correlations, but their lack of interpretability hinders deep understanding of model behaviors. Moreover, the approaches proposed in [31, 48, 32] require group annotations that are hard to obtain. To address spurious correlations through model behavior interpretation, prior studies [1, 38, 4, 46] employ concepts and reveal spurious correlations in a human-interpretable manner. Particularly, [1, 38] utilizes CAVs to elucidate spurious correlations within models and [38] further mitigates the discovered biases through concept-aware interventions. However, generating CAVs is expensive as it requires concept annotations of images. In addition, [46] introduces CBMs built on the pre-trained model and proposes a weight-pruning method for mitigating spurious correlations. Nevertheless, it is not applicable to large-scale setups and does not adequately address model biases inheriting from the pre-trained models.

Foundation Models. Due to the impressive capabilities of LLMs, numerous studies have focused on extending LLMs’ capabilities to incorporate multimodal inputs such as images and audios [2, 3, 20, 19, 28, 52, 37, 47]. LLaVA [25, 24] stands out for its exceptional performance in visual and language understanding among open-source MLLMs. In addition, [43] demonstrates how to use LLMs for interactive decision-making by leveraging external tools (e.g. search engine, foundation models). They propose effective strategies for planning and activating these tools using LLMs. Similarly, recent studies [7, 8, 42, 12] introduce approaches to solve diverse vision tasks, such as visual math reasoning, image spatial understanding and image editing, by exploiting different vision foundation models (VFMs). A chain of VFMs, activated by LLMs (or MLLMs) for specific sub-tasks, enhances the ability to accomplish the intended vision task. Furthermore, while recent advancements in LLMs prompted many studies to use them for generating training data [35, 30, 39, 41, 21], the use of MLLMs for data annotation remains largely unexplored. To our knowledge, we are the first to investigate the use of MLLMs for data annotation.

3 Preliminaries

CBMs are constructed with human-comprehensible concepts. For example, in the CUB dataset [34], which includes 200 bird species, black wing color and cone bill shape are used as concepts to build CBMs. Note that we interchangeably use attribute and concept in the following sections. CBMs comprise two-stages: a concept model ( $g$ ) and a classifier ( $f$ ). The concept model predicts concept presence in the input, while the classifier predicts labels based on concept representation ( $\hat{g}(\mathbf{x})$ ), output of the concept model, where $g:\mathbb{R}^{d}\rightarrow\mathbb{R}^{m}$ , $f:\mathbb{R}^{m}\rightarrow\mathbb{R}^{l}$ , $d$ denotes the dimension of inputs, $m$ denotes the number of concepts used and $l$ indicates the number of class. In this work, we only focus on cases where $f$ and $g$ are independently trained [18].

In [18], a concept model is trained to predict a binary annotation for each concept. This setup requires datasets comprising tuples of images, labels, and binary concept annotations $\{(\mathbf{x}^{(j)},\mathbf{y}^{(j)},\mathbf{c}^{(j)})\}_{j=1}^{n}$ , where $\mathbf{x}\in\mathbb{R}^{d}$ , $\mathbf{y}\in\mathbb{R}^{l}$ and $\mathbf{c}\in\{0,1\}^{m}$ . We dub this approach as Annotation-based CBM. Recently, Post-hoc CBM [46] and Label-free CBM [27] suggest building CBMs using CLIP text encoder, removing the need to train a concept model. Particularly, [46] obtains $\hat{g}_{i}(\mathbf{x})$ by projecting image embedding ( $h(\mathbf{x})$ ) to the corresponding text embedding from CLIP ( $\mathcal{C}_{i}$ ), such that $\hat{g}_{i}(\mathbf{x})=\frac{\langle h(\mathbf{x}),\mathcal{C}_{i}\rangle}{\|\mathcal{C}_{i}\|_{2}^{2}}\in\mathbb{R}$ , where $i=1,2,\cdots,m$ , $h(\mathbf{x})\in\mathbb{R}^{e}$ , $\mathcal{C}\in\mathbb{R}^{m\times e}$ denotes concept vectors from CLIP text encoder and $e$ is the size of the embedding space. We term these approaches as CLIP-based CBM.

We categorize CBMs into soft CBMs and hard CBMs based on how concepts $\mathbf{c}$ are modeled, following [13]. A soft CBM characterizes $\mathbf{c}_{i}$ as a real number representing the degree of relatedness or presence of a concept. In contrast, a hard CBM represents $\mathbf{c}_{i}$ with a binary value, indicating the presence or absence of each concept in input. Annotation-based CBMs align with hard CBMs, while CLIP-based CBMs fall into the category of soft CBMs.

4 Limitation of using CLIP-based CBMs to Alleviate Spurious Correlation

Compared to end-to-end models, CBMs have the potential to reduce spurious correlations by filtering out concepts identified to contribute to spurious correlations with the aid of human expertise. In this section, we highlight key observations suggesting that despite careful concept selection, CLIP-based CBMs can still be compromised by spurious correlations due to inherent biases in CLIP.

We experiment on Waterbirds [31], a benchmark synthesized by pasting birds from CUB [34] onto background images from Places [50]. The dataset exhibits spurious correlations, with landbirds predominantly appearing against land backgrounds and waterbirds mostly set against water backgrounds. We use the 112 visual concepts provided with CUB in [18] for constructing CBMs, so that they only consist of attributes related to the appearances of birds, not background. To assess the robustness of concept representation ( $\hat{g}(\mathbf{x})$ ) from concept models of CBMs against spurious correlations, we train an additional classifier, $f^{\prime}(\cdot)$ , which takes $\hat{g}(\mathbf{x})$ as input and predicts the labels associated with spurious correlations, land and water. Table 1 shows that the classifier ( $f^{\prime}(\cdot)$ ) with concept representations of the CLIP-based CBM (post-hoc CBM) attaining an accuracy largely exceeding random guesses (50%). This suggests that even only with concepts irrelevant to spurious correlations, $\hat{g}(\mathbf{x})$ of the CLIP-based CBM can still be noticeably biased towards the spurious elements. Conversely, the accuracy of a classifier built on $\hat{g}(\mathbf{x})$ of the Annotation-based CBM [18] is as low as random guesses, implying that $\hat{g}(\mathbf{x})$ from the Annotation-based CBM are nearly free from biases leading to spurious correlations.

Table 1: Background (land, water) prediction accuracy on Waterbirds. The higher accuracy achieved by using concept representations from the CLIP-based CBM (Post-hoc CBM) suggests that the representations contain biases towards spurious correlations.

Model	Background Acc.
Standard ResNet-50[14]	90.1%
Post-hoc CBM [46]	72.8%
Annotation-based CBM[18]	53.0%

There are two possible reasons why concept representations from CLIP-based CBMs exhibit biases, in contrast to those from Annotation-based CBMs. The first reason can relate to the inherent biases residing in CLIP. CLIP-based CBMs generate concept vectors ( $\mathcal{C}$ ) using CLIP encoders, which can be affected by biases in pre-training data. For example, during the training phase of CLIP, visual attributes of waterbirds can often appear against oceanic backgrounds, whereas landbirds’s attributes more commonly appear with forest environments. This can accidentally encode a bias within the encoder that CLIP-based CBMs later use. Second, the difference can stem from the types of CBMs. As mentioned in Sec. 3, CLIP-based CBMs belongs to soft CBMs, while Annotation-based CBMs are hard CBMs. [13] shows the potential risk that soft CBMs can allow the leakage of unintended information from the concept model to the classifier in contrast to hard CBMs. This leakage in soft CBMs may produce biased features, rendering it vulnerable to spurious correlations.

5 Our Framework

In Sec. 4, we investigated the potential of Annotation-based CBMs in addressing spurious correlations compared to CLIP-based CBMs. Nevertheless, building Annotation-based CBMs can be expensive due to the need for binary concept annotations. In this section, we propose a framework of constructing Annotation-based CBMs with minimal human effort using various foundation models.

Figure 1 provides a comprehensive overview of our framework, which unfolds in three stages. First, we collect a concepts pool of visual concepts essential for classification, where spurious concepts are automatically filtered out. Second, we use a MLLM to annotate the selected concepts for each image. The optional third stage involves correcting imprecise concepts annotations by using chains of vision foundation models, which improves the reliability of the MLLM’s annotations. Using concept annotations obtained through three stages, we construct Annotation-based CBMs, denoted as LLaVA-based CBM, effective in mitigating spurious correlations. Detailed procedures are described in the following.

5.1 Collecting Visual Attributes Unaffected by Spurious Correlation

Constructing CBMs demands gathering a list of essential visual concepts, where concept quality significantly influences the classification model’s capability. Particularly, mitigating spurious correlations requires careful concept selection to retain a list of concepts unrelated to such correlations. We propose a method outlined below, designed to collect diverse concepts while avoiding those linked with misleading correlations.

Step 1: Use LLM to collect diverse visual concepts. We leverage the capability of GPT-3 [5] in identifying key concepts for target classes and collect diverse visual attributes for each classification task as in [27]. Specifically, GPT-3 is prompted to provide important features, superclass, and things seen around of each class via the OpenAI API. To further diversify the set of attributes, we additionally prompt GPT-3 with “Provide a list of visual attributes to distinguish between a class 1 and class 2”. Then, we filter out concepts too similar to each other and to classes as in [27]. We denote the collected concepts set as $S_{1}$ . However, we remark $S_{1}$ may include elements tied to spurious correlations.

Step 2: Use MLLM to detect potential spurious correlations. To identify potential spurious correlations within datasets, we propose two sub-steps using LLaVA (MLLM).

(i) We collect descriptions for each image by prompting LLaVa with “Describe the image in a sentence.” We denote the description for input $i$ as $b_{i}$ and the collection of descriptions as $D=\{b_{1},b_{2},\dots,b_{n}\}$ , where $n$ represents the number of training data. Next, we leverage the in-context learning capability of GPT-3. We prompt GPT-3 with a few examples formatted as $(\text{description},\text{keywords})$ to extract keywords of the descriptions and denote the collection of the keywords as $D^{\prime}=\{k_{1},k_{2},\dots,k_{n}\}$ , where ${k}_{i}=[\text{keyword}_{1},\text{keyword}_{2},\dots,\text{keyword}_{o}]$ , and $o$ varies with different inputs. We append class labels to each $k_{i}$ to ensure each description contains the corresponding class.

(ii) We compute correlations between a class label and each keyword, and find the keywords showing high correlations with the label. We start by creating a vocabulary set using $D^{\prime}$ . Assuming $N$ vocabularies, we vectorize each $k_{i}$ to represent the presence of each vocabulary, resulting in binary vectors ( $\mathbf{v}$ ) of dimension $N$ . The collection of the vectors is denoted as $D^{\prime\prime}=\{\mathbf{v}_{1},\mathbf{v}_{2},\dots,\mathbf{v}_{n}\}$ , where $\mathbf{v}_{i}\in\{0,1\}^{N}$ . Each vocabulary is considered as a variable, yielding $n$ samples for the $N$ variables. We measure correlations between the class labels and the $N-1$ vocabularies by computing the point biserial correlation coefficients, which range from -1 to 1. By sorting keywords by their coefficients, we identify attributes highly correlated with the labels. We use a threshold to select high correlated attributes and denote them as $S_{2}$ . This method can detect spurious correlations. Figure 2 illustrates the process captures strong correlations between landbird and the attributes related to the known spurious concepts, such as tree.

Step 3: Filtering concepts related to spurious correlations. We curate a concepts pool with $S_{1}$ and $S_{2}$ obtained from the previous process. To filter out elements linked with spurious correlations, we exclude elements similar to those of $S_{2}$ from $S_{1}$ . Similarity between elements is computed based on an ensemble of CLIP text encoder and all-mpnet-base-v2 sentence embedding as in [27].

The detailed procedure and the finalized list of attributes for each dataset are provided in Appendix 0.A.

5.2 Annotating the Concepts with LLaVA

In this stage, we generate binary concept annotations of the training data by querying a MLLM to decide if the curated concepts present in each image. The prompts consist of both an image and text. For example, we query LLaVA-v1.5-13B using prompts “Does the object have {attribute}?” alongside the corresponding image. The text responses from LLaVA are then converted into binary numbers, which we use as concept annotations for each image. Examples of prompts and responses are provided in Sec. 0.A.1. We acknowledge prompt engineering plays a crucial role in influencing the output of LLMs (MLLMs) [51, 36, 11]. In our work, we have minimally engaged in prompt engineering, leaving room for future optimization of prompts. We remark concept annotations used to require considerable human effort. Our framework effectively replaces the human effort with open-source MLLMs, like LLaVA, cutting down the cost to nearly zero.

Evaluating MLLM in concept annotation. LLaVA demonstrates its multimodal understanding capabilities through challenging benchmarks, such as coarse and fine-grained recognition tasks [24]. However, annotations provided by LLaVA can often contain errors. To evaluate LLaVA’s reliability as a proxy annotator, we examine the accuracy of the annotations. For Waterbird, which already comes with 112 attributes and their labels from CUB, we can compare LLaVA’s responses against the ground truth. It is worth noting, however, that the ground truth may not be completely accurate as they underwent post-processing to reduce annotation noise as detailed by [18]. We discuss details in Appendix 0.B.

Table 2: The evaluation of LLaVA’s annotations against human annotations on Waterbirds.

Recall	Precision	F1 score
0.92	0.21	0.31

Table 2 presents the average recall, precision, and F1 score across all training images and 112 attributes. The results indicate that LLaVA tends to respond positively to queries, resulting in noticeable false positives. Furthermore, we construct Annotation-based CBMs with LLaVA annotations and human annotations, respectively, and compare their performance. As indicated in Tab. 3, the CBM with human annotations outperforms the CBM with LLaVA annotations, implying that the inferior annotation quality of LLaVA affects the performance of CBMs based on these annotations.

Table 3: Worst group accuracies and average accuracies on Waterbirds (we explain the metrics in Sec. 6) with different annotations. Human annotation achieves the highest accuracy at a considerable cost.

Annotator	Worst-group Acc.	Average Acc.
Human	85.0%	98.4%
LLaVA	80.8%	83.8%

To reduce the accuracy gap between human anotations and LLaVA annotations without relying on human expertise, we introduce an optional automatic annotation refinement process in the following section.

5.3 Automatic Annotation Refinement

Motivated by the observed imperfection of LLaVA annotations, we develop an optional refinement process for improved annotation quality. Devising adequate refinement requires us first to understand the imperfections of LLaVA. We examine randomly selected probing images and their corresponding response pairs to identify errors made by LLaVA. Our inspection reveals LLaVA often confuses the image background with the main object. For instance, in classifying letter opener and can opener in ImageNet-Opener, wood material is used as a visual concept (we will provide more details in Sec. 6). We observe LLaVA makes frequent mistakes to answer if the object is made of wood material, especially when the object is placed on a wooden table.

Once identifying LLaVA’s errors, the next step entails correcting the errors with nearly no human effort, leveraging an LLM and VFMs. Recently, exploiting different VFMs to perform (sub-)tasks invoked by LLMs has been actively explored [7, 8, 42, 12]. LLM learns to utilize tools (e.g., VFMs) through in-context learning and activates these tools to accomplish a specific task. Drawing insights from probing LLaVA’s errors, our task is set to rectify annotations by eliminating potentially confusing backgrounds in images before querying LLaVA. To accomplish this, we employ Langchain [6] for implementation. We constitute a chain of tools composed of BLIP-2 [20], Grounding DINO [26] and SAM [17], leveraging the in-context learning capability of LLMs by providing specific examples that guide the activation of VFMs. For instance, the instructions, ignore the background (Fig. 3), trigger the chain of tools. The LLM first activates the visual question answering model, BLIP-2, to identify the main object in the image. Next, it uses Grounding DINO to obtain the bounding box of the main object identified by BLIP-2. Then, the LLM applies SAM to segment the object associated with the bounding box. Finally, the background-removed outputs from the series of tools are used to query LLaVA, making the annotations more accurate. Figure 3 shows an example of how the refinement corrects the annotations. Note that our choice of VFMs is decided by the LLaVA’s errors investigation. We presume VFMs to be interchangeable, and the selection of which VFMs to use can be guided by the potential errors identified in MLLM.

6 Experiment

We build LLaVA-based CBMs based on the proposed framework and demonstrate the effectiveness of the CBMs in mitigating spurious correlations.

6.1 Experimental Setups

Baselines. We compare LLaVA-based CBMs with several methods including Empirical Risk Minimization (ERM) and CLIP-based CBMs (Label-free CBM [27] and Post-hoc CBM [46]). Label-free CBM constructs a concept model using GPT-3 and CLIP. On the other hand, Post-hoc CBM uses ConceptNet [33] to generate concept sets and CLIP to obtain concept representations. We provide the concept sets used in these models in Sec. 0.C.4. Furthermore, we compare ours with [40, 38], which effectively mitigate spurious correlations and demonstrate state-of-the-art performances on datasets we use. We include Group DRO [31] for comparison, known as an oracle, which requires group annotations. Our method provides interpretability, thus our primary focus in comparisons is on other methods that also offer this feature.

Datasets. We assess our method on multiple challenging benchmarks that contain spurious correlations, as detailed below. We provide the concepts we gather for each dataset in App. 0.A.2. See App. 0.C.2 for the annotation time details.

(i) ImageNet-Opener: We use classes can opener and letter opener from a subset of ImageNet [10] as suggested in [40]. In particular, can opener usually appears together with can in the training images, creating spurious correlations. When lacking can in images, can opener might be erroneously classified as letter opener.

(ii) Metashifts: Metashifts [22] provides many subsets of data corresponding to different contexts, naturally creating distribution shifts. [38] proposes experiment setups for evaluating the model robustness to spurious correlations using Metashifts by creating disjoint spurious attributes for each class. The training set includes two classes, cats lying on soft/bed and dogs with bench/bike. The testing set consists of cats and dogs, both with a shelf.

(iii) Waterbirds: Waterbirds [31] comprises two classes: landbirds and waterbirds. In the training set, landbirds and waterbirds primarily appear on land backgrounds and water backgrounds, respectively, while the test set is balanced.

Evaluation Metrics. We assess the methods based on average accuracy and worst-group accuracy. For details on setting groups for each dataset, see App. 0.C.3.

Model Architecture and Training Details. Both the ERMs and the concept models ( $g$ ) of CBMs are built on ResNet-50 [14], pretrained on ImageNet. To build LLaVA-based CBMs, we train the concept model and the classifier independently. We provide more details on other hyperparameters in Sec. 0.C.1. For fair comparison, we use the pre-trained CLIP encoders based on ResNet-50 for Label-free CBM and Post-hoc CBM. Label-free CBM and Post-hoc CBM results are reproduced using the official implementations¹¹1https://github.com/mertyg/post-hoc-cbm²²2https://github.com/Trustworthy-ML-Lab/Label-free-CBM with the identical configurations. We report results for all methods using early stopping based on worst-group validation accuracy, as per standard practice in prior works [31]. For Waterbirds, we use additional prompts to gather a richer concept pool. We provide more details on the prompts used in Waterbirds in Sec. 0.A.1.

6.2 Results

Table 4: Worst group and average accuracies on Metashifts and ImageNet-Opener. [38] identifies spurious concepts but lacks interpretability for predictions. In ImageNet-Opener, refined annotations on two attributes based on our proposed method are used. Results from the original papers are marked with ^∗. The ERM’s and Group DRO’s results on Metashift and Imagenet-Opener are from[38] and [40], respectively.

		Metashifts		ImageNet-Opener
Method	Interpretability	Worst Acc.	Avg. Acc.	Worst Acc.	Avg. Acc.
ERM^∗	✗	62.1%	72.9%	68.0%	80.1%
DISC [38]^∗	$\triangle$	73.5%	75.5%	$\cdot$	$\cdot$
[40]^∗	✗	$\cdot$	$\cdot$	68.0%	73.9%
Label-free CBM [27]	✓	64.1%	74.1%	82.5%	78.0%
Post-hoc CBM [46]	✓	71.4%	80.4%	45.0%	48.0%
LLaVA-based CBM	✓	78.0%	80.0%	90.0%	86.1%
Group DRO [31]^∗	✗	66.0%	73.6%	76.0%	78.4%

Quantitative Results. Table 4 and Table 5 report the average and the worst-group accuracies of different methods across various datasets. LLaVA-based CBMs outperform other baseline models in most cases. ERM presents a significantly lower worst-group accuracy than the average accuracy, suggesting its strong reliance on spurious correlations. Although CLIP-based CBMs may narrow the gap between the worst-group and the average accuracy on Metashifts and ImageNet-Opener, the improvements remain limited. Moreover, when compared to the state-of-the-art methods, such as [40] for ImageNet-Opener, DISC [38] for Metashift, and Group DRO [31] as an oracle for both, our method exhibits notably superior performance. Although LLaVA-based CBM may not outperform Human Annotation CBM or Group DRO in Tab. 5, it is worth noting that these methods rely on human efforts, which can be significantly costly. Moreover, LLaVA-based CBMs can offer interpretability, which is missing in ERM or Group DRO. Additionally, we provide the results of using LLaVA directly for label prediction in App. 0.C.5.

Table 5: Worst group accuacies and average accuracies on Waterbirds. 1st/2nd best accuracies are marked with bolded/underlined. Post-hoc CBM uses 112 CUB attributes in [18] and Label-free CBM uses 370 CUB attributes in [27]. Human Anno. CBM denotes Annotation-based CBM with 112 CUB attributes and their labels in [18].

Method	Worst-group Acc.	Average Acc.
ERM	72.6%	97.3%
Label-free CBM [27]	55.0%	82.8%
Post-hoc CBM [46]	31.6%	91.7%
LLaVA-based CBM	83.2%	94.2%
Human Anno. CBM (CUB)	85.0%	98.4%
Group DRO [31]	91.4%	93.5%

Particularly noteworthy is the superiority of LLaVA-based CBM over CLIP-based CBMs [46, 27] across various datasets. Note that CLIP-based CBMs rely on concepts from CUB, which focus solely on spurious-free bird appearance concepts as clarified in Sec. 4, and hinder the model generalization to novel applications. Meanwhile, our approach is not dependent on CUB, making it applicable to a wide range of datasets. Our superior performance can be attributed to two potential factors. First, as discussed in Sec. 4, we posit biases inherent in CLIP can negatively affect CBMs built on it. Table 6 further supports this claim. Despite utilizing identical concept sets that exclude spurious elements, CLIP-based CBMs exhibit a significant drop in worst-group accuracy, a shortcoming considerably mitigated in LLaVA-based CBM. Second, our strategy to gather visual attributes unaffected by spurious correlations, as proposed in Sec. 5.1, proves effective, ensuring that the collected concept sets are unlikely to include concepts relevant to spurious correlations. Refer to App. 0.C.4 for the used concept sets in CLIP-based CBMs across datasets.

Table 6: Worst group and average accuracies on Waterbirds. Post-hoc CBM and LLaVA-based CBM (CUB) use 112 CUB attributes in [18]. Label-free CBM and LLaVA-based CBM (Label-free) use 370 CUB attributes in [27].

Method	Worst-group Acc.	Average Acc.
Post-hoc CBM [46]	31.6%	91.7%
LLaVA-based CBM (CUB)	80.8%	83.8%
Label-free CBM [27]	55.0%	82.8%
LLaVA-based CBM (Label-free)	80.5%	94.1%

Spurious Correlations Detection. We showcase concept pools collected using the procedure outlined in Sec. 5.1. For ImageNet-Opener, we collect 47 visual concepts by prompting GPT-3 as Step 1. Step 2 identifies the words “can”, “person”, and “opener” to be highly correlated to the class labels. The known spurious concept “can” is detected in this step. We exclude five visual concepts identified as similar to the detected spurious concepts in Step 3 from the initial 47, leaving us with 42 visual attributes.

In Metashifts, among the 28 visual concepts collected from Step 1, Step 2 detects 16 words including “benches”, “frisbee”, “bike”, “leash”, “couch”, “bed” and “computer” as potential spurious concepts. Interestingly, this set contains not only the known spurious concepts in Metashifts such as “bed”, “couch”, “bike” and “bench”, but also other potential spurious concepts such as “leash”, “compute” and “frisbee”. We filter out 3 keywords (“leash”, “a cat bed”, “a cat toy”) based on similarity from our initial concepts pool and obtain 24 visual concepts free from biases. We show the Step 2 examples of Metashifts in Fig. 4. We provide examples in Waterbirds in App. 0.A.2.

The Effect of Automatic Annotation Refinement. We use ImageNet-Opener to assess the effectiveness of the annotation refinement process introduced in Sec. 5.3. As discussed in Sec. 5.3, one of the common errors made by LLaVA is incorrectly identifying objects as being made of wood, when the opener is placed on a wooden table. To mitigate such errors, we refine annotations of two specific concepts, “wood material” and “knife-like shape”, through the proposed refinement process using either LLM or a chain of vision foundation models.

In our qualitative evaluation of LLaVA’s responses, we focus on changes in LLaVA’s responses before and after the refinement. For “wood material”, 21% of can opener responses identified as made of wood initially, reduced to 3% after refinement, aligning better with common knowledge. On the other hand, positive answers to “wood material” decreased from 41% to 11% in letter opener. Regarding “knife-like shape”, positive identifications in can opener dropped from 90% to 86%, while for letter opener, it marginally reduced from 98.5% to 98.0%, reflecting the expected relevance of “knife-like shape” to letter opener.

Based on the refined annotation, we examine the impact of the refinement process on classification performance.

Table 7: Worst-group and average accuracy on Imagenet-Opener. We refined annotations on two attributes (wood material and knife-like shape). The refinement enhances the accuracy.

	Worst. Acc.	Avg. Acc.
without refinement	80.0%	82.4%
with refinement	90.0%	86.1%

Table 7 shows that the refinement positively impacts accuracy gains compared to the CBM without annotation refinement. The result highlights the effectiveness of the annotation refinement in error correction and the subsequent improvement in the accuracy of LLaVA-based CBMs.

Interpretability of LLaVA-based CBMs. We examine how the models derive their predictions, which is one of the advantages of employing CBMs. For ImageNet-Opener, we identify the most common attributes associated with correctly predicted test samples. Specifically, all correctly predicted letter opener samples share 19 positive attributes, while correct can opener samples consistently predict 21 specific attributes positively. There are 17 overlapping attributes for both classes. We focus on the non-overlapping attributes for each class and observe these concepts align well with common visual attributes associated with each class, as following:

•

Letter opener: {“sharp blade”, “an ornamental handle” }
•

Can opener: {“a round wheel blade”, “a bottle opener”, “a lever for opening cans”, “a counter”}

We further explore incorrect test samples to understand why they make wrong predictions. For example, in some incorrect samples of letter opener, the model fails to identify the distinct concept “an ornamental handle” and tends to positively predict all concepts belonging to can opener. On the other hand, Label-free CBM’s top 2 weights to predict can opener are assigned to the concepts, “a small, handheld tool” and “a can”. In the case of letter opener, they are “a long, thin blade” and “a pen”. Thus, the image of can opener lacking “a can” tends to be misclassified, while correct prediction occurs if “a can” is present. The examples are illustrated in Fig. A. The results suggest that LLaVA-based CBMs are beneficial for mitigating spurious correlations, unlike Label-free CBMs.

7 Conclusion

In this paper, we explored mitigating spurious correlations using CBMs developed with minimal human effort by integrating multiple foundation models. We introduced a framework that collects visual concepts that are essential but not affected by spurious correlation and performs concept annotation for each image to build CBMs, all leveraging the capabilities of MLLM and LLM. Furthermore, we suggest an optional refinement method for the annotations to further improve the reliability of the annotations by an MLLM. We demonstrate the effectiveness of our approach in tackling spurious correlations on diverse real-world challenges.

Acknowledgements

This work was supported by NIH (1U01CA269192).

References

[1] Abid, A., Yuksekgonul, M., Zou, J.: Meaningfully debugging model mistakes using conceptual counterfactual explanations. In: International Conference on Machine Learning (2022)
[2] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems (2022)
[3] Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., et al.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)
[4] Bontempelli, A., Teso, S., Tentori, K., Giunchiglia, F., Passerini, A.: Concept-level debugging of part-prototype networks. International Conference on Learning Representations (2023)
[5] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems (2020)
[6] Chase, H.: LangChain (Oct 2022), https://github.com/langchain-ai/langchain
[7] Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022)
[8] Chen, W.G., Spiridonova, I., Yang, J., Gao, J., Li, C.: Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571 (2023)
[9] Chen, Z., Bei, Y., Rudin, C.: Concept whitening for interpretable image recognition. Nature Machine Intelligence (2020)
[10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009)
[11] Gu, J., Han, Z., Chen, S., Beirami, A., He, B., Zhang, G., Liao, R., Qin, Y., Tresp, V., Torr, P.: A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980 (2023)
[12] Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)
[13] Havasi, M., Parbhoo, S., Doshi-Velez, F.: Addressing leakage in concept bottleneck models. Advances in Neural Information Processing Systems (2022)
[14] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
[15] Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al.: Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In: International Conference on Machine Learning (2018)
[16] Kirichenko, P., Izmailov, P., Wilson, A.G.: Last layer re-training is sufficient for robustness to spurious correlations. International Conference on Learning Representations (2023)
[17] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
[18] Koh, P.W., Nguyen, T., Tang, Y.S., Mussmann, S., Pierson, E., Kim, B., Liang, P.: Concept bottleneck models. In: International Conference on Machine Learning (2020)
[19] Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
[20] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International Conference on Machine Learning (2023)
[21] Li, L., Dou, Z.Y., Peng, N., Chang, K.W.: Desco: Learning object recognition with rich language descriptions. Advances in Neural Information Processing Systems (2023)
[22] Liang, W., Zou, J.: Metashift: A dataset of datasets for evaluating contextual distribution shifts and training conflicts. International Conference on Learning Representations (2022)
[23] Liu, E.Z., Haghgoo, B., Chen, A.S., Raghunathan, A., Koh, P.W., Sagawa, S., Liang, P., Finn, C.: Just train twice: Improving group robustness without training group information. In: International Conference on Machine Learning (2021)
[24] Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
[25] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
[26] Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
[27] Oikarinen, T., Das, S., Nguyen, L.M., Weng, T.W.: Label-free concept bottleneck models. International Conference on Learning Representations (2023)
[28] Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
[29] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
[30] Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (2023)
[31] Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. International Conference on Learning Representations (2020)
[32] Sagawa, S., Raghunathan, A., Koh, P.W., Liang, P.: An investigation of why overparameterization exacerbates spurious correlations. In: International Conference on Machine Learning (2020)
[33] Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: An open multilingual graph of general knowledge. In: Proceedings of the AAAI conference on artificial intelligence (2017)
[34] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
[35] Wang, S., Tan, Z., Guo, R., Li, J.: Noise-robust fine-tuning of pretrained language models via external guidance. Findings of the Association for Computational Linguistics: EMNLP (2023)
[36] White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)
[37] Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519 (2023)
[38] Wu, S., Yuksekgonul, M., Zhang, L., Zou, J.: Discover and cure: Concept-aware mitigation of spurious correlation. International Conference on Machine Learning (2023)
[39] Yan, A., Wang, Y., Zhong, Y., Dong, C., He, Z., Lu, Y., Wang, W.Y., Shang, J., McAuley, J.: Learning concise and descriptive attributes for visual recognition. In: International Conference on Computer Vision (2023)
[40] Yang, Y., Nushi, B., Palangi, H., Mirzasoleiman, B.: Mitigating spurious correlations in multi-modal models during fine-tuning. International Conference on Machine Learning (2023)
[41] Yang, Y., Panagopoulou, A., Zhou, S., Jin, D., Callison-Burch, C., Yatskar, M.: Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)
[42] Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., Wang, L.: Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
[43] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: React: Synergizing reasoning and acting in language models. International Conference on Learning Representations (2023)
[44] Ye, H., Zou, J., Zhang, L.: Freeze then train: Towards provable representation learning under spurious correlations and feature noise. In: International Conference on Artificial Intelligence and Statistics (2023)
[45] Yeh, C.K., Kim, B., Arik, S., Li, C.L., Pfister, T., Ravikumar, P.: On completeness-aware concept-based explanations in deep neural networks. Advances in Neural Information Processing Systems (2020)
[46] Yuksekgonul, M., Wang, M., Zou, J.: Post-hoc concept bottleneck models. International Conference on Learning Representations (2023)
[47] Zeng, A., Attarian, M., Ichter, B., Choromanski, K., Wong, A., Welker, S., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., et al.: Socratic models: Composing zero-shot multimodal reasoning with language. International Conference on Learning Representations (2023)
[48] Zhang, J., Menon, A., Veit, A., Bhojanapalli, S., Kumar, S., Sra, S.: Coping with label shift via distributionally robust optimisation. International Conference on Learning Representations (2021)
[49] Zhang, M., Sohoni, N.S., Zhang, H.R., Finn, C., Ré, C.: Correct-n-contrast: A contrastive approach for improving robustness to spurious correlations. International Conference on Machine Learning (2022)
[50] Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE transactions on Pattern Analysis and Machine Intelligence (2017)
[51] Zhou, Y., Muresanu, A.I., Han, Z., Paster, K., Pitis, S., Chan, H., Ba, J.: Large language models are human-level prompt engineers. International Conference on Learning Representations (2023)
[52] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Appendix 0.A Prompts and Visual Concepts

0.A.1 Prompts

Stage 1 in Sec. 5.1. In Step 1, we prompt GPT-3 to get important features, superclass, things seen around [27] and distinguished visual features of each class. For the queries related to important features, superclass, and things seen around, we use the in-context learning examples as described in [27]. The prompts used to query GPT-3 are the following:

•

important features - List the most important features for recognizing something as a {class}
•

superclass - Give superclasses for the word {class}
•

things seen around - List the things most commonly seen around a {class}
•

distinguished visual features - Provide a list of visual features to distinguish between a {class 1} and {class 2}

For Waterbirds, we use different prompts to collect a more comprehensive concept set. We first obtain subclass of each class. This yields a total of 171 distinct bird species initially, and through a subsequent process using the same subclass prompts, we obtain a total of 1852 bird species. Then, we use the prompts suggested above for important features, superclass and things seen around.

In Step 2, we obtain image descriptions by prompting the LLaVA with “Describe the image in a sentence” and the corresponding image. Then, we extract keywords from the list of descriptions by querying GPT-3. We leverage the in-context learning capabilities of GPT-3 to extract keywords from the descriptions. The prompts used to query GPT-3 for each dataset are detailed in Tab. A. We use nltk³³3https://www.nltk.org tokenizer to tokenize the collection of keywords and remove prepositions. We compute the point biserial correlation coefficients to detect potential spurious features. To prevent the correlation coefficients from being distributed among similar words, we combine the similar words into one based on similarity (using CLIP text encoder and all-mpnet-base-v2 sentence embedding) and add the correlation coefficients of each word.

Table A: The prompts used to extract keywords from the image descriptions in Step 2.

Datasets	Prompts
	Instruction : Extract concepts from the sentence such as the examples below.
	The first example,
	sentence : “A can opener is sitting on top of a can of food.”,
	concepts : “can opener, a can of food”
Imagenet-Opener	The second example,
	sentence : “A knife with a wooden handle is sitting on a green plate.”,
	concepts : “knife with a wooden handle, green plate”
	The third example,
	sentence : “A blue wrench with a handle is shown in a close-up view.”,
	concepts : “blue wrench with a handle”
	Instruction : Extract concepts from the sentence such as the examples below.
	The first example,
	sentence : “An orange tabby cat is laying on a bed next to a laptop computer.”,
	concepts : “tabby can, laying on a bed, laptop computer”
Metashifts	The second example,
	sentence : “A cat is laying on a blanket and eating food from a fork.”,
	concepts : “cat, laying on a blanket, eating food”
	The third example,
	sentence : “A cat is sleeping on a couch, with its head resting on a pillow.”,
	concepts : “cat, sleeping on a couch, resting on a pillow”
	Instruction : Extract concepts from the sentence such as the examples below.
	The first example,
	sentence : “The image features a seagull flying over the ocean, with its wings spread wide open as it soars
	through the sky.”,
	concepts : “seagull flying over the ocean, wings spread wide open.”
Waterbirds	The second example,
	sentence : “The image shows a seagull floating on water, with its wings spread out, and its body resting
	on the surface.”,
	concepts : “seagull floating on water, wings spread out, body resting on the surface.”
	The third example,
	sentence : “The image features a bird perched on a tall bamboo plant, with a blue and black color scheme.”,
	concepts : “ perched on a tall bamboo plant, with a blue and black color scheme”

Stage 2 in Sec. 5.2 Table B and Table C illustrate the prompts for querying LLaVA to annotate the concepts of each image for ImageNet-Opener and Metashifts. Note that the concepts are obtained from Stage 1. For Waterbirds, we use “Does the object have {attribute}?” for all attributes.

Table B: The prompts used for LLaVA to annotate the concepts of each image in ImageNet-Opener.

Prompts
Does the object have a bulkier shape with gears and handles?
Does the object have a round cutting wheel shape?
Is the object made of metal material?
Is the object made of plastic material?
Is the object made of rubber material?
Does the object have a round wheel blade?
Does the object have a blade not overly sharp to touch?
Does the object have a handle for better grip?
Does the object have a bottle opener?
Does the object have a lid lifter?
Does the object have a sleek shape?
Does the object have a streamlined shape?
Does the object have a knife-like shape?
Does the object have a delicate size?
Is the object made of wood material?
Is the object made of ornamental material?
Does the object have a solid piece?
Does the object have a flat blade?
Does the object have a sharp blade?
Does the object have a slim handle?
Does the object have an ornamental handle?
Does the object have a blade for cutting open cans?
Does the object have a blunt end?
Does the object have a chair?
Does the object have a comfortable grip?
Does the object have a computer?
Does the object have a counter?
Does the object have a cupboard?
Does the object have a desk?
Does the object have a durable construction?
Does the object have a fridge?
Does the object have a kitchen?
Does the object have a lever for opening cans?
Does the object have a long, thin blade?
Does the object have a mailbox?
Does the object have a pen?
Does the object have a pointed end?
Does the object have a printer?
Does the object have a small handle?
Does the object have a stamp?
Does the object have envelopes?
Does the object have a paper?

Table C: The prompts used for LLaVA to annotate the concepts of each image in Metashifts.

Prompts
Does the object have a collar?
Does the object have a dog tag?
Does the object have a food bowl?
Does the object have a litter box?
Does the object have a person?
Does the object have a scratching post?
Does the object have a toy?
Does the object have floppy ears?
Does the object have a semi-erect ears?
Does the object have a prominent snout with more separated nostrils?
Does the object have a curly tail?
Does the object have a straight tail?
Does the object have a fluffy tail?
Does the object have a sleek tail?
Does the object have a smooth fur?
Does the object have a curly fur?
Does the object have a wirly fur?
Does the object have round paws?
Does the object have erect ears?
Does the object have a short snout?
Does the object have a flat snout with closely set nostrils?
Does the object have a slender tail?
Does the object have a flexible tail?
Does the object have oval-shaped paws with retractable claws?

0.A.2 Collected Visual Attributes in Sec. 5.1

We collect the visual concepts relevant to each class from Step 1 and they are described in Tab. D. We refrain from enumerating the visual concepts of Waterbirds due to their large number (444 attributes). We detect potential spurious features in Step 2 as detailed in Tab. E. In Step 3, we finalize the visual concepts for constructing CBMs by filtering out potentially spurious features from the relevant concepts based on their similarity. The visual concepts used for constructing CBMs in each dataset are described in Tab. F. We refrain from enumerating the visual concepts of Waterbirds due to their large number (428 attributes). The concepts including “a palm tree”, “a perching posture”, “a green body color” and “a beach or ocean scene” are filtered out in Step 3.

Table D: The collected relevant visual concepts in Step 1.

Datasets	Prompts
	“a blade for cutting open cans”, “a blunt end”, “a can”, “a chair”, “a comfortable grip”,
	“a computer”, “a counter”, “a cupboard”, “a desk”, “a durable construction”, “a fridge”
	“a kitchen”, “a lever for opening cans”, ‘a long, thin blade”, “a mailbox”,
	“opener”, “a pen”, “a person”, “a pointed end”, “a printer”,
Imagenet-Opener	“a stamp”, “a table”, ‘envelopes”, “paper”, “knife”, “a bulkier shape with gears and handles”,
	“a round cutting wheel shape”, “metal material”, “plastic material”, “rubber material” , “a round wheel blade”,
	“a blade not overly sharp to touch”, “a handle for better grip”,“a bottle opener”, “a lid lifter”, “a sleek shape”,
	“a streamlined shape”, “a knife-like shape”, “delicate size”, “wood material”, “ornamental material”,
	“solid piece”, “a flat blade”, “a sharp blade”, “a slim handle”, “an ornamental handle”
	“a collar”, “a dog tag”, “a food bowl”, “a litter box”, “a person”, “a scratching post”, “a toy”,
	“leash”, “a cat bed”, “a cat toy”, “floppy ear”,“semi-erect ears”, “prominent snout with more separated nostrils”,
Metashifts	“a curly tail”,“a straight tail”, “a fluffy tail”, “a sleek tail”, “a smooth fur”, “a curly fur”, “a wirly fur”,
	“round paws”, “erect ears”, “a short snout”, “a flat snout with closely set nostrils”,
	“a slender tail”, “a flexible tail”, “oval-shaped paws with retractable claws”, “mammal”

Table E: The potential spurious features and their coefficients detected in Step 2. The threshold for the coefficient is set to 0.2.

Datasets	Prompts
	Can Opener - can (0.39)
	opener (0.24)
	person (0.22)
Imagenet-Opener	Letter Opener - knife (0.52)
	wood (0.28)
	gold (0.27)
	design (0.26)
	surface (0.20)
	Cat - kitty (0.96)
	computer (0.44)
	dark (0.35)
	laying (0.33)
	bed (0.32)
	couch (0.24)
Metashifts	Dog - puppy (0.97)
	women (0.35)
	leash (0.32)
	walking (0.30)
	small (0.27)
	bike (0.25)
	frisbee (0.23)
	catching (0.20)
	benches (0.20)
	jumping (0.20)
	Waterbird - water (0.46)
	body (0.27)
	beach (0.27)
	bird (0.27)
	background (0.27)
	ship (0.26)
	seagull (0.24)
	large (0.24)
	duck (0.21)
	standing (0.2)
Waterbirds	Landbird - tree (0.58)
	perched (0.50)
	surrounded (0.39)
	bamboo (0.27)
	green (0.24)
	plant (0.24)
	small (0.23)
	path (0.23)

Table F: The visual concepts used for constructing CBMs in each dataset.

Datasets	Prompts
	“a blade for cutting open cans”,
	“a chair”, “a comfortable grip”,
	“a computer”, “a counter”, “a cupboard”,
	“a desk”, “a durable construction”,
	“a kitchen”, “a lever for opening cans”,
	‘a long, thin blade”, “a mailbox”,
	“a pen”, “a pointed end”, “a printer”,
Imagenet-Opener	“a stamp”, ‘envelopes”, “paper”,
	“a bulkier shape with gears and handles”,
	“a round cutting wheel shape”,
	“metal material”, “plastic material”,
	“rubber material” , “a round wheel blade”,
	“a blade not overly sharp to touch”,
	“a handle for better grip”,“a bottle opener”,
	“a lid lifter”, “a sleek shape”, “a fridge”,
	“a streamlined shape”, “a blunt end”,
	“a knife-like shape”, “delicate size”,
	“wood material”, “ornamental material”,
	“solid piece”, “a flat blade”,
	“a slim handle”, “an ornamental handle”,
	“a sharp blade”
	“a collar”, “a dog tag”,
	“a food bowl”, “a litter box”, “a person”,
	“a scratching post”, “a toy”,
	“floppy ear”,“semi-erect ears”,
	“prominent snout with more separated nostrils”,
Metashifts	“a curly tail”,“a straight tail”,
	“a fluffy tail”, “a sleek tail”,
	“a smooth fur”, “a curly fur”, “a wirly fur”,
	“round paws”, “erect ears”, “a short snout”,
	“a flat snout with closely set nostrils”,
	“a slender tail”, “a flexible tail”,
	“oval-shaped paws with retractable claws”

Appendix 0.B Concept Annotations of CUB Post-processing

The CUB dataset [34] comprises 11,788 images of birds from 200 species, with each image additionally annotated with binary concepts that denote specific bird attributes, such as wing color and beak shape. [18] refine the concept annotations of CUB to build CBMs. To enhance the accuracy and coherence of annotations, which is compromised due to contributions from multiple crowdworkers (not a bird expert), they implement a majority voting to consolidate instance-level concept annotations into class-level concepts: e.g., if more than 50% of crows have black wings in the data, then they set all crows to have black wings. This makes the approximation that all birds of the same species in the training data should share the same concept annotations. After majority voting, they further filter out sparse concepts, retaining only concepts that are present after majority voting in at least 10 classes. This process results in a refined list of 112 concept annotations.

Appendix 0.C Experiment Details

0.C.1 Training Details

Annotation-based CBMs comprise two main components: a concept model and a classifier. The concept models consist of the ResNet-50 architecture for feature extraction and multiple single-layer classifiers built on top of the feature extractor. The concept models are trained to predict individual concepts with the binary cross-entropy loss where each individual concept prediction task is weighted by the ratio of class imbalance for that individual concept, as suggested in [18]. We use a batch size of 64, a learning rate of 0.01, and SGD with momentum of 0.9 as the optimizer to train the concept models. The classifiers of annotation-based CBMs are constructed with a small four-layer MLP. We train classifiers with a learning rate of 0.001, a batch size of 64, and SGD with momentum of 0.9 as the optimizer.

0.C.2 The Time for Data Annotations

Table G: The time for data annotation in Stage 2 (on two GPUs).

	# of concepts	# of images	time (s) for 1 image	Total time (h)
Waterbirds	428	5,914	180	295
Metashift	24	1,105	10	3
ImageNet-Opener	42	2,286	18	11

In Tab. G, we quantify the annotation speed on two NV3090 GPUs. Multi-GPU parallelization is used for expedite the Waterbirds annotation. We emphasize the substantially reduced cost compared to human annotation. For instance, each of the 312 concepts in CUB is annotated by a crowdworker with additional postprocessing (See App. 0.B).

0.C.3 Evaluation Metrics

We assess the methods based on average accuracy and worst-group accuracy. For Waterbirds, the worst-group accuracy denotes the lowest accuracy across groups, defined by the combination of spurious attributes and the classes, and the average accuracy is the adjusted classification accuracy averaged over groups weighted according to their sizes in the training data, as in [31]. For ImageNet-Opener, the adjusted classification accuracy is used as average accuracy and the worst-performing group is identified as can opener images without the presence of can, which is considered as the spurious concepts, as in [40]. For Metashifts, the worst group accuracy indicates the accuracy on class dog, and the average accuracy denotes the accuracy across both classes with no adjustments, as suggested in [22].

0.C.4 The Concept Sets used in CLIP-based CBMs and Human Annotation CBM

Waterbirds. [18] provide 112 visual attributes specifically related to the appearance of birds, not the backgrounds. As described in Appendix 0.B, these attributes have been annotated by humans. We include these concept annotations ( $\mathbf{c}$ ) corresponding to each image and generate data tuples $\{(\mathbf{x}^{(j)},y^{(j)},\mathbf{c}^{(j)})\}_{j=1}^{n}$ corresponding to data pairs $\{(\mathbf{x},y)\}_{j=1}^{n}$ of Waterbirds. In Tab. 5, Human Anno. CBM refers to the annotation-based CBM based on these 112 attributes and annotations. LLaVA-based CBM (CUB) in Tab. 5 denotes the CBM constructed using the LLaVA-annotated concepts. Post-hoc CBM [46] also uses these 112 attributes. On the other hand, Label-free CBM [27] suggest employing GPT-3 to gather essential visual features for classifying 200 bird species in the CUB dataset. They collect a list of 370 visual attributes, focusing solely on bird appearance. We create prompts based on these 370 attributes and use LLaVA to annotate them. LLaVA-based CBM (Label-free) denotes the resulting CBM.

Metashifts and ImageNet-Opener. Label-free CBM uses GPT-3 to collect concept sets as suggested in Sec 3.1 in [27]. For ImageNet-Opener, the concepts used for Label-free CBM are {a blade for cutting open cans, a blunt end, a can, a chair, a comfortable grip, a computer, a counter, a cupboard, a desk, a durable construction, a fridge, a kitchen, a lever for opening cans, a long, thin blade, a mailbox, a metal construction, a pen, a person, a pointed end, a printer, a sharp point at one end, a slender, sword-like shape, “a small handle”, “a small, handheld tool”, a table, envelopes, knife, opener}. For Metashifts, Label-free CBM uses { a bowl, a cat bed, a cat food bowl, a cat toy, a collar, a dog tag, a leash, a litter box, a person, a round face, a scratching post, a tail, a toy, a water bowl, four legs, fur, green or yellow eyes, large, oval eyes, mammal, pointed ears}.

On the other hand, Post-hoc CBM uses ConceptNet [33] to collect concepts. For ImageNet-Opener, the concept sets Post-hoc CBM used are {sharp, knife, opener}. For Methashifts, they are {playing dead, been shaved, a, fleas, penis, claws, hungry, domestic animal, feline, black, big heart, four legs, hair, sharp claws, faithful companion, teeth, sharp teeth, two ears, brains, nose, woman, flag, gray, brown, loyal friend, gossip, the tail, eyes, mammal, pet, paws, alive, whisker, nice friend, fur, chap, mean, canine, good friend, fun, legs, thirsty, one mouth}.

0.C.5 Using LLaVA directly to Make Predictions

For Waterbirds, when we directly query LLaVA (13B) with “Is the bird in the image a waterbird or a landbird? Answer shortly.”, the average accuracy is 74.2% and the worst group accuracy is 90.6%. LLaVA’s response is biased towards classifying birds as waterbirds, resulting in an accuracy of only 4% for landbirds with water backgrounds.

0.C.6 The Impact of Filtering Threshold on CBM performance.

In Metashift, setting a threshold of 0.2 results in the detection of 16 correlated keywords and the removal of 3 spurious concepts among 27 concepts from Step 1. Raising the threshold to 0.25 reduces correlated keywords to 11, with no change in removed concepts. Further raising the threshold to 0.33 retains all 27 concepts, resulting in a performance drop, with a worst-group accuracy of 75.2% and an average accuracy of 80.6%.

0.C.7 Label-free Interpretability Results

Figure A shows the inputs with explanations on the contribution of each concept to model’s prediction. The image of can opener lacking a can tend to be misclassified (upper). Correct prediction occurs if a can is present (lower).