I saw, I conceived, I concluded: Progressive Concepts as Bottlenecks

Manxi Lin¹, Aasa Feragen¹, Zahra Bashir²
Martin Grønnebæk Tolsgaard², Anders Nymark Christensen¹
¹Technical University of Denmark ²CAMES
{manli, afhar, anym}@dtu.dk [email protected] [email protected]

Abstract

Concept bottleneck models (CBMs) include a bottleneck of human-interpretable concepts providing explainability and intervention during inference by correcting the predicted, intermediate concepts. This makes CBMs attractive for high-stakes decision-making. In this paper, we take the quality assessment of fetal ultrasound scans as a real-life use case for CBM decision support in healthcare. For this case, simple binary concepts are not sufficiently reliable, as they are mapped directly from images of highly variable quality, for which variable model calibration might lead to unstable binarized concepts. Moreover, scalar concepts do not provide the intuitive spatial feedback requested by users.

To address this, we design a hierarchical CBM imitating the sequential expert decision-making process of ”seeing”, ”conceiving” and ”concluding”. Our model first passes through a layer of visual, segmentation-based concepts, and next a second layer of property concepts directly associated with the decision-making task. We note that experts can intervene on both the visual and property concepts during inference. Additionally, we increase the bottleneck capacity by considering task-relevant concept interaction.

Our application of ultrasound scan quality assessment is challenging, as it relies on balancing the (often poor) image quality against an assessment of the visibility and geometric properties of standardized image content. Our validation shows that – in contrast with previous CBM models – our CBM models actually outperform equivalent concept-free models in terms of predictive performance. Moreover, we illustrate how interventions can further improve our performance over the state-of-the-art.

1 Introduction

How would an experienced sonographer recognize a high-quality femur screening (femur standard plane) from thousands of ultrasound scans? Fig. 1 illustrates the standard procedure [20] of first seeking for specific organs to have a rough understanding of the image, and subsequently, based on the concepts from their knowledge, making a decision. This is also the most common process we follow when understanding images in our daily life - we see some objects/regions, conceive their properties according to our knowledge, and finally conclude the content of the image.

Refer to caption — Figure 1: The process of a specialist recognizing a femur standard plane. The concepts in the ”Conceiving” stage are taken from [21].

Due to their superior performance, deep neural networks (DNNs) are increasingly entering all aspects of human decision-making. In applications associated with human safety and privacy, such as healthcare and autonomous driving, their black-box nature is associated with potentially unacceptable risk, and as a consequence, increasing efforts are directed towards the explainability of model decisions. Compared to post hoc interpretations [43, 45, 33, 23, 57], which are intended to explain arbitrary black-box models, intrinsically explainable models [9, 24, 41, 2, 10, 17, 48], or white-box models, provide ante-hoc explanation during inference. Some methods [9, 48, 29] reason in an intuitive way by considering visual prototypes or clues. These methods highlight regions of interest in the image that are semantically meaningful and understandable for humans. These visual explanations are intuitively helpful by explaining ”The model is looking at this” and ”this looks like that”, yet still containing redundant information and requiring more from the users’ prior knowledge for ”what is this”. In contrast, other methods, such as CBMs [24], aim to explain the reasoning process with human specialist-predefined concepts, such as colors and shapes of objects, which are precise and accurate for human understanding.

When considering a problem with input $x$ and target $y$ , concept bottleneck models [24] predict $y$ from expert-annotated concept labels. Specifically, given the concepts $c\in{\mathbb{R}}^{d}$ , concept bottleneck models learn the mapping $g\colon x\mapsto c$ and $f\colon c\mapsto y$ sequentially, jointly or independently [24]. Here, $d$ is the dimension of the concept bottleneck. When the prediction of $y$ is based on human-specified concepts, the model is inherently explainable. However, recent results [31, 30] suggest a potential information leakage from $y$ to the mapping $g$ during the sequential and joint training, which means these CBMs do not learn concepts in an intended manner, but rather as a proxy for target information. This leakage also undermined the intervenability of CBMs [31]. The independent CBMs eliminate the target information leakage by learning $y$ from the ground truth $c$ . Nonetheless, saliency maps show that the independent CBMs fail to learn concepts from semantically meaningful information [31]. This illustrates that CBMs do not really understand the meaning of given concepts - the $x\mapsto c$ mapping is actually learned by a non-transparent model, which is neither intervenable nor explainable.

In this paper, we reconsider concept learning in CBMs. In psychology, the concept is formally defined as the label of a set of things that have something in common [4]. As illustrated in Fig. 1, the femur bone visualization, the expert-specified standard plane requirements, and even the category ”femur standard plane”, can be considered a progression of concepts with increasing abstraction. Here, abstraction means the level of difficulty for a non-expert to understand. To this end, the visual understanding task is a gradual abstraction process of concepts from redundant information. This is also supported by early post-hoc studies [54, 53], which reveal that shallower layers in a neural network tend to learn simpler concepts, while the higher layers prefer abstract ones. Inspired by the procedure ”seeing”, ”conceiving”, and ”concluding”, we improve CBMs by bottlenecking progressive concepts. This includes mapping from images to semantically meaningful visual concepts; abstraction from visual concepts to property concepts; and finally category prediction. The segmentation bottleneck visualizes the region of interest for the learned property concepts, which improves both the intervenability and interpretability of the model.

Existing work [41, 52, 22] investigates the lack of expressivity of CBM predictors. The abstraction from property concepts to task targets can be challenging, especially when the concept is insufficient or the predictor does not have enough capacity to model a complex mapping. Learning additional concepts with e.g. self-supervision [41] fails to address the information leakage problem as trained in a joint way. The learned additional concepts also weaken the human-model communication, since they cannot easily be corrected by experts. Here, we enhance the predictor capacity by learning the interaction of the predefined concepts.

Our key contributions are as follows:

•

We propose Progressive Concept Bottleneck Models, providing both segmentation and property explanations. The property-level explanation has semantically meaningful representations in the input space.
•

Our model allows human-model communication at different abstraction levels, which paves the way for human intervention at different knowledge levels.
•

We expand the model capacity while keeping the explainability via learnable concept interaction.
•

Our model outperforms existing methods in terms of both accuracy and explainability on a challenging fetal ultrasound quality assessment task.

2 Related work

We consider local explanations [5] for visual tasks, focusing on individual predictions.

Explaining with visual concepts.

Visual attention, also known as visuospatial selection attention, allows humans to selectively mask out redundant information in a cognition task [27]. Explaining with visual concepts means understanding an image by ”seeing” regions of interest. Therefore, it is natural to interpret the neural network behavior with visual concepts. Class activation mapping [57] and its successors [42, 46, 14, 38] localize the features of the DNNs that are responsible for the classification decision on the image by fusing channel contributions. Zeiler et al. [54] visualized the intermediate layers in a DNN with de-convolutional neural networks. Feature visualization [33] makes the learned features in DNN explicit via optimization. Network dissection [6] aligns neuron activation with expert-given visual concepts and produces activation masks for learned features. These post-hoc explanations, however, do not benefit the reasoning process of networks but only try to explain well-trained, maybe biased black-box models.

Other methods inject visual concepts into the decision process of DNNs. Region-based recognition methods [48, 19, 47, 18] locate visual concepts of the input images and obtain corresponding feature vectors, based on which the network makes decisions. Interpretable convolutional neural networks [56] design a loss to push the high-level DNN features toward reasonable visual concepts without extra annotation. ProtoPNet [9] not only focuses on the regions of interest but also pointing prototypical cases similar to those visual concepts.

Explaining with visual concepts, i.e., the process ”I saw, I concluded” is intuitively meaningful in visual tasks. However, Pazzani et al. [34] argue that explanations should be what a domain expert would use, and visual explanations can not fulfill the explainability requirement in some cases. In addition, existing visual self-explanation models do not allow human-model communication, i.e., intervention.

Explaining with property concepts.

Property concepts refer to any abstraction distilled by domain experts, e.g., a color, a shape, or even an idea [32]. They are used to describe the typical human-understandable characteristic of an image. Explaining property concepts means ”conceiving” the property of the image before concluding the content. Kim et al. [23] provided a global explanation by generalizing concepts in the activation space of DNN and calculating the conceptual sensitivity to input samples. Hierarchical neuron concept explainer [45], in contrast, produces local explanation by associating neurons with property concepts at a hierarchical level. Different from these post-hoc methods, Concept Learning Models [24, 10, 29, 3, 25, 52] represents concepts with neuron activation, which makes it intrinsically interpretable. Classification transformer [25] employs single-class labels as concepts for a multi-label classification task. Concept whitening [10] replaces the normalization layers in a pre-trained DNN with a layer constrained to represent concepts. CBMs [24] enables human explanation inspection and correction by intervening the neuron activation. The limited number of expert-given concepts as well as the lack of expressivity in the predictor constrain the performance of CBMs. Sarkar et al. [41] supplemented the property concepts by self-supervised learning and enhances the faithfulness of the concepts, which, nevertheless, risks information leakage during concept learning, according to [31, 30]. The independent CBMs do not suffer from the leakage, but can still be learning bias as concepts since the mapping from the input image to properties is not forced to consider the semantic content. We argue directly ”conceiving” from the image may be not enough - the property concepts rely on specific regions of the image. The explainable model should consider the relevance between the visual concepts and the property concepts. In this paper, we not only ”saw” image with segmentation visual concepts, but also ”conceived” the visual concepts with more abstract property concepts. Our model allows human intervention at both the semantic level and property level for debugging network representations. Moreover, we improve the model capacity by interacting with property concepts. The closest CBM variant to ours is the hierarchical concept bottleneck model [35], which incorporates CBM as parts of a Mask R-CNN [15]. Given a scene consisting of multiple objects, the hierarchical concept bottleneck model only classifies single objects in the scene. In contrast, our model considers each object as part of the scene and is able to recognize the scene category.

Segmentation for classification.

In this paper, we built a segmentation network to ”see” the visual concepts. Segmentation as part of a classification network is a common solution in some computer vision applications, e.g., region-based fine-grained image recognition [15, 19], scene recognition [28] and defect detection [8, 51, 36]. Huang et al. [19] weakly localized different parts of objects via region grouping and predicted fine-grained labels from the region features. Mask CNN [48] masks learned features with the segmentation prediction. Semantic-aware scene recognition model (SASceneNet) [28] treats segmentation as another modal. It learns features from segmentation and from the input image in different branches and fuses them to make a final prediction. Bozic et al. [8] proposed a two-stage network detecting defects based on the high-level features learned from a segmentation network. These segmentation-for-classification methods provide visual explanations [48]. In this paper, our model does not only provide visual explanations, but more importantly, the proposed model also allows human intervention in the learned segmentation.

3 Method

We consider a visual task where the input image $x\in{\mathbb{R}}^{h\times w\times m}$ with $m$ channels, height $h$ and width $w$ , induces a predicted target $y$ . A black box model would learn a direct mapping predicting $x\mapsto y$ , which is non-transparent and thus not explainable.

As stated before, the visual understanding task is essentially a hierarchical abstraction process of concepts. Inspired by the heuristic procedure ”seeing”, ”conceiving” and ”concluding”, we propose Progressive Concept Bottleneck Models (PCBMs), which consist of three stages corresponding to the three steps in the procedure. Fig. 2 demonstrates the architecture of PCBM. We represent PCBM as a mapping $f(l(g(x)))$ , where $g$ ”saw” segmentation concepts from $x$ , $l$ ”conceived” property concepts from the segmentation, and $f$ ”concluded” the prediction from property. To eliminate information leakage [31, 30] and to enable concept intervention, the three stages were trained independently. This means that at each stage, ground truth targets from the previous stage are used for training, whereas predicted targets are used for testing. We introduce the $g$ , $l$ , and $f$ in detail respectively in Sec. 3.1, Sec. 3.2 and Sec. 3.3.

3.1 ”Seeing” with an observer

We built an observer network to learn $g$ , which is a mapping ”seeing” segmentation concepts from the input image $x$ . We define these concepts to be experts’ regions of interest in the input image $x$ . We first trained a segmentation convolutional neural network (CNN) to learn specialist-annotated segmentation mask $s\in{\mathbb{R}}^{h\times w\times n}$ , where $n$ is the number of segmentation concepts. During inference, the observer predicts a segmentation map $p(s|x)\in{\mathbb{R}^{h\times w\times n}}$ . Since the segmentation map does not include texture information, which is important for deriving object property, we constructed segmentation concepts by soft-masking the original image $x$ with the segmentation map. The $i$ -th segmentation concept $\bar{s}_{i}\in{\mathbb{R}^{h\times w\times m}}$ for each predicted segmentation map $p(s_{i}|x)\in{\mathbb{R}^{h\times w}}$ is represented as:

\bar{s}_{i}=\{p(s_{i}|x)\odot x_{j}|j=1,2,...,m\}

(1)

where $x_{j}$ is the $j$ -th channel of the input image and $\odot$ means Hadamard product. The segmentation concepts are gathered as $\bar{s}=\{\bar{s}_{i}|i=1,2,...,n\}$ . We call the output of the observer net the segmentation bottleneck, where the model provides visual explanations that human users can inspect.

3.2 ”Conceiving” with a conceiver

We construct a conceiver network to learn the mapping $l$ from segmentation concepts to high-level property concepts, imitating the process of ”conceiving” characteristics of seen objects/regions, and associating it with expert knowledge. Following CBMs, we consider property concepts as either scalars indicating an score property, or binary referring to a categorical property. Our conceiver is an encoder CNN that maps inputs to a concept space $\mathbb{R}^{d}$ , given expert-annotated property concepts $c\in\mathbb{R}^{d}$ . Like previous CBMs, we assume property concepts to be irrelevant to each other. Each output neuron of the conceiver net is thus trained independently to align with one property concept.

The property concepts describe single objects (e.g. angles), regions of interest (e.g. image occupancy), or whole image scenes (e.g. symmetry). As a consequence, a property concept may rely on multiple segmentation concepts. For example, assess the symmetry of a head ultrasound plane, we need to consider all the segmented organs. The conceiver models the relationship between segmentation and property concepts. This is done by concatenating the segmentation concepts and flattening the dimensions other than the spatial dimension before feeding into the conceiver. That is, the conceiver is learning a mapping $l:\mathbb{R}^{h\times w\times\mu}\rightarrow{}\mathbb{R}^{d}$ , where $\mu=m\cdot n$ . We name the output layer of the conceiver net the property bottleneck, where the model provides more property-oriented explanations for users than just visual explanations.

Combining the observer- and conceiver nets, our PCBM allows human intervention in the segmentation bottleneck. When users disagree with the visual explanation, they can correct the segmentation concepts, to intervene in the property prediction. We are the first to introduce this vision-level human intervention.

3.3 ”Concluding” with a predictor

The final predictor ”concludes” targets from property concepts. The predictor learns $f\colon c\mapsto y$ from annotated concepts when training, and predicts $f(l(g(x)))$ from predicted concepts at test time. The predictor was trained independently, allowing intervention at the property bottleneck.

As indicated in Sec. 1, the predictor’s capacity is a factor constraining the full model performance. We argue that modeling concept interaction can improve this capacity. Specifically, we consider interaction between binary (categorical) property concepts. As in Sec. 3.2, both CBMs and our PCBM learn different property concepts independently. In the second PCBM stage, the conceiver performs binary classification on each binary property concept separately. The predicted probability of the $i$ -th binary concept $c_{i}$ is $\hat{c}_{i}=p(c_{i}|s)$ , given segmentation concepts $s$ . As we assume that concepts are independent, the interaction of two binary concepts $p(c_{i}|s)$ and $p(c_{i}|s)$ is described as:

p(c_{i},c_{j}|s)=p(c_{i}|s)\cdot p(c_{i}|s).

(2)

A solution would be a large MLP. However, the implicit concept interaction would neither be transparent nor visible.

We thus propose a task-specific module to introduce concept interaction. For ease of explanation, we temporarily assume that all property concepts $c$ are binary. The module first takes in the predicted probability of binary property concepts $\hat{c}=[\hat{c}_{1},\hat{c}_{2},...,\hat{c}_{d}]$ into an MLP with ReLU [1] activation to generate non-negative weight $\omega=[\omega_{1},\omega_{2},...,\omega_{d}]$ for each concept. This is described as:

\omega=ReLU(\epsilon(\hat{c})),

(3)

where $\epsilon$ is an MLP. Note that the negative values given by the MLP are suppressed to $0$ after the ReLU activation, which means $\omega$ could include some zero weights. We further represent the interacted concepts $\bar{c}$ as:

\bar{c}=\frac{{(\hat{c}{\omega^{T}})}^{2}-{\hat{c}}^{2}{({{{\omega}^{2}}})}^{T}}{{(\sum_{i=1}^{i=d}\omega_{i})}^{2}-\sum_{i=1}^{i=d}{\omega_{i}}^{2}}.

(4)

By the square expansion formula, $\bar{c}$ can be described as:

\bar{c}=\frac{\sum_{i,j\in\{1,2,...,d\}}\omega_{i}\omega_{j}\hat{c}_{i}\hat{c}_{j}}{\sum_{i,j\in\{1,2,...,d\}}\omega_{i}\omega_{j}},

(5)

which is the linear combination of pairwise concept interactions. When $\omega_{i}$ or $\omega_{j}$ is 0, the corresponding concept interaction is not selected. The constructed concepts are linear combinations of pairwise concept interactions, whose components are transparent to users if checking the weight $\omega$ .

We use $\sqrt{\bar{c}}$ , which is in the same range as $c$ , and concatenate it with $c$ as input for the predictor. The predictor and the concept interaction are trained jointly. As the concept interaction is obtained from the predefined property concepts, we keep the intervenability of the property bottleneck. Since we consider the pair-wise interaction of concepts, this fuses the information from up to $\binom{d}{2}$ potential concept pairs to intensify the bottleneck.

4 Experiments

The performance of our model was validated using an excerpt from a national fetal ultrasound screening database (REFERENCE ANONYMIZED).

Data preparation.

For the experimental validation, we used a dataset consisting of n = 2666 images obtained from a typical 3rd-trimester growth screening ultrasound examination. The images consisted of both standard planes (SP) and non-standard planes (NSP) from four distinct anatomies, namely the head, the femur, the abdomen, and the maternal cervix. Standardized planes from these anatomies are routinely acquired to record fetal biometric parameters associated with fetal growth and pre-term birth.

The dataset statistics are detailed in Table 1. For each test, the experiment was repeated 10 times resampling training (50%), validation (10%) and test (40%) sets, divided on the subject level to avoid subject overlap between splits.

The recorded planes were annotated¹¹1https://github.com/wkentaro/labelme by an expert (MD, PhD fellow in Fetal Medicine). The images were divided into eight classes consisting of the four anatomies along with an SP/NSP classification; which done according to established international quality criteria [40, 21]. The relevant anatomical structures (see appendix) were manually outlined, and their corresponding concepts annotated.

Concepts were either binary (true/false) representing correct angle, image occupancy (size of object) and symmetry, or a scalar quality score of 0-10. Quality scores assess the visual quality of anatomical structures and caliper locations (0 = not visible, 10 = excellent visualisation).

All images were annotated by one annotator but selected low- and high-quality annotated images were sent to two fetal medical experts to ensure continued agreement in the annotation process and to minimize potential annotation bias. Images with huge disagreement or need for discussion was presented to a panel of fetal medical experts.

Table 1: Overview of dataset and division across classes.

femur		abdomen		head		cervix
SP	NSP	SP	NSP	SP	NSP	SP	NSP
539	59	133	545	65	556	687	82

Implementation details.

Following CBMs, the property concept label $c$ of each image is a vector of size 27, where each element represents a concept. When the concept was not applicable to the image, the corresponding position is padded with 0. To discriminate the padded 0 and the concepts with value ”false”, we smoothed the applicable binary concept labels from 0 and 1 to 0.01 and 0.99 for each image. The images were resized to $224\times 288$ , and the pixel intensity was normalized to $[-1,1]$ .

We trained a DTU-Net [26] with an ImageNet [12] pre-trained RegNetY-1.6GF [37] backbone as the observer network. This is because DTU-Net is good at capturing curvilinear structures, and half of the segmentation concepts in this dataset are curvilinear. For the conceiver, we trained a ResNet-18 [16] pre-trained on ImageNet. We modified the first convolution layer and the fully connected layer to fit the concept size. The predictor is an MLP with a hidden layer containing 1024 neurons. The MLP $\epsilon$ for concept interaction has three hidden layers with 12 neurons each. The hidden layers are equipped with batch normalization and ReLU activation. We introduce the training details as well as model complexity in the appendix.

Model performance.

We compare our model performance to a series of baseline models as shown in Tab. 2. The standard model is a network with the same architecture as our PCBM, but trained end-to-end for classification. The Hadamard product in the segmentation bottleneck was replaced with a convolution layer to help the model convergence. We also include SonoNet-32 [7], a state-of-the-art classification network for ultrasound plane classification. Similar to our PCBM, another baseline SASceneNet-18 [28] also includes segmentation as part of a classification model. Besides these black-box models, we trained a standard CBM as a baseline as well. For a fair comparison, we used a DTU-Net for segmentation in SASceneNet-18, a ResNet-18, and an MLP with 1024 neurons in a single hidden layer to build CBM.

These models were trained with the same settings as our PCBM. We evaluated the model performance with three metrics: overall classification accuracy over instances (OA), average classification accuracy over categories (MA), and Matthews correlation coefficient (MCC) [11]. Given the class imbalance, MA and MCC can better reflect the model performance on underrepresented classes compared to OA.

Method	OA (%)	MA (%)	MCC (%)
Standard Model	$88.20\pm 0.30$	$58.83\pm 1.12$	$85.31\pm 0.37$
SonoNet-32 [7]	$86.91\pm 0.57$	$54.62\pm 1.85$	$83.71\pm 0.71$
SASceneNet-18	$88.20\pm 0.38$	$58.37\pm 1.49$	$85.26\pm 0.47$
[28]
CBM [24]	$86.72\pm 0.46$	$59.16\pm 1.82$	$83.40\pm 0.58$
PCBM (ours)	$\textbf{88.62}\pm 0.77$	$\textbf{66.51}\pm 1.85$	$\textbf{85.80}\pm 0.95$

Table 2: Classification performance of different models on the ultrasound dataset. ’OA’ means overall classification accuracy over instances; ’MA’ refers to the average classification accuracy over categories; ’MCC’ stands for Matthews correlation coefficient.

According to to Tab. 2, our PCBM outperforms all the baselines on the three metrics. Specifically, our model surpasses other methods by at least $7.35\%$ in terms of the mean MA over 10 splits, which indicates that our model is improving recognition of underrepresented classes.

Fig. 3 gives further insight into our performance in terms of sensitivity and specificity split across different anatomies. Here we see that all models struggle with specificity for femur and cervix, as well as with sensitivity for abdomen and head. This is caused by a difference in the difficulty level of the different scans, as also discussed in the Discussion and Conclusion below. Fig 3 shows us, however, that the PCBM performs on par with or considerably better than the other methods on the challenging categories.

Correctness of explanation.

Method	RMSE	COA (%)
CBM	$0.1184\pm 0.0010$	$75.80\pm 0.09$
PCBM (ours)	$\textbf{0.1072}\pm 0.0024$	$\textbf{99.19}\pm 0.10$

Table 3: Concept classification performance of different models on the ultrasound dataset. ’RMSE’ stands for the root-mean-square deviation of the learned concepts to groud truth; ’COA’ is the binary classification accuracy of concepts.

As the model’s performance depends on two layers of intermediate concepts - segmentation and property - we first validate the performance of those. For the segmentation concepts we validate in terms of IoU, see Fig. 4. Our observer net achieved a high IoU in many categories, except for some performance drops in the cervical canal outline, and cervix inner and outer boundary. These concepts are thin and long curvilinear structures, where IoU is naturally lower even for good performance. The property concepts, on the other hand, are validated in terms of classification accuracy (binary concepts) and root mean squared error (RMSE, scalar concepts); these results are shown in Fig. 4. We see that the binary concepts achieve a classification accuracy well above 95%, whereas the scalar concepts achieve an RMSE of 1.5 or below. We also perform a comparison across concepts between our own model and a standard CBM model; these results are found in Tab. 3. The concept ”cervix symmetry” has a relatively low accuracy compared to other binary concepts. This could be because of the relatively poor segmentation result of the curvilinear structures in cervix planes, which means that our conceiver learned the correct relationship between the segmentation and property. This is further demonstrated by our following experiments on the role of segmentation bottleneck.

Role of the segmentation bottleneck.

Property concept should be predicted on the relevant parts of the input space [31]. Compared to CBM, our PCBM has an extra segmentation bottleneck, which provides a visual explanation while assuring the property concepts to be learned from semantically meaningful regions in the input image. For validation, we inpainted [44] specific organs (thalamus, cavum septi pellucidi, stomach bubble, and umbilical vein) with nearby textures from their images respectively, and tested CBM and our PCBM on the inpainted images. These organs are picked because each of them is directly associated with a property concept indicating the quality grade, and they appear in most images of their anatomy. If the model is learning the correct concept from the image, ideally, images where the organs were absent, such as the inpainted images, should predict the corresponding quality concepts to be zero. For this reason, we evaluated the model performance with the value of the associated concept in Tab. 4. We see that our model has learned the ”real” concepts of thalamus, cavum septi pellucidi, and umbilical vein while CBM did not. Thanks to the segmentation bottleneck, our model first ”saw” regions of interest in the image, then predicted property over them. CBM, in contrast, lacks the segmentation stage and thus suffers from the confounding information in the image. For the stomach bubble, both of the models were biased, but the bias was reduced for PCBM. See the appendix for examples.

Method	Thalamus	CSP	Stomach Bubble	UV
CBM	$0.4176\pm 0.0479$	$0.2389\pm 0.0439$	$0.6834\pm 0.0712$	$0.1464\pm 0.0410$
PCBM	$\textbf{0.1300}\pm 0.0236$	$\textbf{0.0586}\pm 0.0164$	$\textbf{0.5410}\pm 0.0761$	$\textbf{0.0218}\pm 0.0006$

Table 4: Predicted quality value in inpainted image. Here, CSP means cavum septi pellucidi; UV means umbilical vein.

Test-time intervention.

Similar to CBM, our PCBM allows users to inspect and intervene on both concept bottlenecks to correct wrong explanations. The model prediction is updated when the explanations are corrected. We simulated this by replacing predicted concepts with ground truth and recording the change in model performance.

We first intervened in the segmentation bottleneck. As shown in Tab. 5, the intervention improved the predictive performance on property concepts, and thus improved the classification result. This is because our conceiver modeled a relationship between the segmentation concepts and the property concepts. The improvement is, however, marginal, as the observer network already achieved a good performance in segmentation, see Fig. 4.

	w/o intervention	with intervention
OA(%)	$88.62\pm 0.77$	$\textbf{88.95}\pm 0.86$
MA(%)	$66.51\pm 1.85$	$\textbf{66.57}\pm 2.42$
MCC(%)	$85.80\pm 0.95$	$\textbf{86.19}\pm 1.07$
RMSE	$0.1072\pm 0.0024$	$\textbf{0.1067}\pm 0.0025$
COA (%)	$99.19\pm 0.10$	$\textbf{99.22}\pm 0.10$

Table 5: Test-time intervention at the segmentation bottleneck.

At the property level, we intervened concepts one by one by greedy best-first search [50] at each split. We reported the tendency of the mean and standard deviation of overall classification accuracy over 10 splits with the increase of intervened concepts in Fig. 5. We see from the figure that our PCBM benefits more from the intervention than CBM; likely because PCBM, which considers concept interaction, has a larger predictor capacity than CBM. See appendix for test-time intervention examples.

Ablation studies.

SB	CB	CI	OA (%)	MA (%)	MCC (%)
	✓		$86.72\pm 0.46$	$59.16\pm 1.82$	$83.40\pm 0.58$
	✓	✓	$86.86\pm 0.28$	$60.89\pm 1.13$	$83.54\pm 0.80$
✓			$\textbf{89.88}\pm 0.71$	$63.10\pm 2.29$	$\textbf{87.33}\pm 0.88$
✓	✓		$88.19\pm 0.96$	$65.11\pm 1.40$	$85.27\pm 1.14$
✓	✓	✓	$88.62\pm 0.77$	$\textbf{66.51}\pm 1.85$	$85.80\pm 0.95$

Table 6: Ablation studies. ’SB’ = segmentation bottleneck; ’CB’ = property bottleneck; ’CI’ = concept interaction module.

We conducted ablation studies to evaluate the component contribution of PCBM. The result in Tab. 6 shows that all of our three technical contributions, i.e., the semantic bottleneck, the progressive bottleneck architecture, and the concept interaction module contribute to the model performance. Specifically, we find that the concept interaction module is more effective with the presence of the segmentation bottleneck. This could be because of the more accurate property prediction. We further evaluated the faithfulness of the learned concept interaction in the appendix. The model with a single segmentation bottleneck achieves the highest OA as well as MCC among the variants. This demonstrates the importance of our introduction of segmentation concepts. The full model, while it has a lower OA and MCC than the segmentation bottleneck model, surpasses other variants in terms of MA, and provides extra property explanations in contrast to the single segmentation-based model.

5 Discussion and conclusion

Strengths.

We have designed a progressive concept bottleneck model (PCBM) which utilizes a progressive sequence of explanatory concepts which are naturally aligned with the clinician’s thinking process. Indeed, during pilot testing, clinicians emphasize the intuitive utility of the segmentation concepts to quickly make decisions about images. Conversely, the property concepts are far easier to intervene on at test time. Thus, both concept bottlenecks are valuable both from a performance and a user perspective.

We demonstrate the PCBM performance on a challenging, clinical task, and show that it obtains superior performance to other methods with and without explanations.

Limitations.

Our segmentation concept is learned from supervised segmentation, which needs extra annotation. As future research, the segmentation bottleneck might be replaced by attention maps [13] learned from other models, saliency maps from eye trackers [39], or by using methods from weak labeling segmentation [55].

We only consider pair-wise concept interaction. The interaction of three or more concepts would be interesting to explore in the future. Moreover, according to the Tab. 4, our model still faces the risk of learning bias sometimes, which leaves open potential for improvement.

Tab. 1 shows that our dataset is highly imbalanced across different categories. For instance, the ”cervix SP” class has 687 images while the ”head SP” class only has 65 scans. This is caused by a large variation in the difficulty of acquiring different standard planes, meaning many recorded head planes in the national database are NSP, whereas most recorded cervix scans are SP. A closer inspection reveals that this same tendency also affects the difficulty of the classification problems: the annotated quality concepts show that many of the head and abdomen images annotated as SP, are in fact close to being NSP. In other words, the poor image quality and dataset imbalance are both caused by the challenging problem of obtaining a good ultrasound standard plane for certain anatomies. While this is a limitation, it is also a highly realistic one, which to some degree explains why all models are low in either sensitivity or specificity as shown in Fig. 3.

Conclusion.

We propose progressive concept bottleneck model architecture, which consists of an observer ”seeing” segmentation concepts from input images, a conceiver ”conceiving” property concepts from the segmentation, and a predictor ”concluding” based on the property. The segmentation bottleneck between the observer and the conceiver, and the property bottleneck linking the conceiver and the predictor allow human-model communication at both visual and property levels. Furthermore, we introduce concept interaction enhancing the predictor’s capacity.

We demonstrate our model in the real-world scenario of decision support for assessing the quality of fetal ultrasound scans. This is challenging because of poor image quality due to image acquisition being difficult as well. Our experiments demonstrate that our model surpasses baseline models including previous CBMs and state-of-the-art classification models at recognizing ultrasound standard planes. Moreover, we show the importance of our segmentation bottleneck for learning the ”real” concepts. Finally, the PCBM supports human intervention at different levels, enabling even higher performance.

Appendix A Fetal ultrasound quality assessment

Fetal ultrasound quality assessment is a crucial step for accurate biometric measurement in obstetric examination [49]. The aim of this task is to recognize high-quality ultrasound scans of specific views of the fetus (standard planes) from all the images from an ultrasound screening. The International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) defines the standard plane and provides guidelines for clinicians [40]. In this paper, we perform the quality assessment task for the ultrasound scans from the 3rd trimester. There are four different standard planes worth attention to in the 3rd trimester: the Cephalic (head) plane, Abdominal plane, Femoral plane, and Cervix plane. The ISUOG practical guidelines for these planes are listed in Tab. 7.

Cephalic (head) plane	Abdominal plane	Femoral plane	Cervix plane
Symmetrical plane;	Symmetrical plane;	Both ends of bone clearly visible;	Cervix occupies $50-75\%$ of the total image;
plane showing thalamus;	plane showing stomach bubble;	$<45\degree$ angle to horizontal;	the bladder is empty;
plane showing cavum septi pellucidi;	plane showing portal sinus;	femur occupying more than half of total image;	the cervix is symmetric (no signs of excessive pressure);
cerebellum not visible;	kidneys not visible;	calipers placed correctly.	the cervical canal is visualized sufficiently;
head occupying more than half of total image;	abdomen occupying more than half of total image;		calipers placed correctly at the internal and external orificium.
calipers and dotted ellipse placed correctly.	calipers and dotted ellipse placed correctly.

Table 7: ISUOG practical guidelines for the 3rd-trimester standard planes. The guidelines are summarized from [40, 21].

Fetal ultrasound quality assessment is a fine-grained image recognition problem, which is challenging because of small inter-class variations and large intra-class variations. Specifically, according to to Tab. 7, an ultrasound image needs to meet the four criteria at the third column to be a femoral standard plane. Fig. 6 shows examples of two femur standard planes and a femur non-standard plane. In the figure, (a) and (b) looks different from each other, while both of them meet all the criteria. This indicates the fact that the varying image quality during ultrasound screening leads to various femur looking in the standard plane, i.e., a high intra-class variation. (c) in the figure, although similar to (a), has an unclear right end, and thus is not a standard plane. In this work, the proposed PCBM outperforms the state-of-the-art models in this challenging task.

As stated in the main paper, depending on the visibility, an annotator gave a quality score from 0 to 10 to the specific organs. In experiments, we normalized these scores to $[0,1]$ . According to to Tab. 7, the organ quality in an image can reflect the image quality to a degree. We computed the average of the organ quality in each image to illustrate the image quality. Fig. 7 shows the image quality distribution of different classes. The figure demonstrates that in each class, the image quality varies a lot.

Appendix B Training details

This section presents the training details of the proposed progressive concept bottleneck models (PCBMs) for the fetal ultrasound quality assessment task.

We trained a DTU-Net [26] with an ImageNet [12] pre-trained RegNetY-1.6GF [37] backbone as the observer network. The segmentation loss was a combination of dice loss and a weighted focal loss. Model performance was evaluated by IoU.

For the conceiver, we trained a ResNet-18 [16] pre-trained on ImageNet. We modified the first convolution layer and the fully connected layer to fit the concept size. For the scalar property concepts, the loss was measured by mean-square error while for the binary concepts, the loss was computed by binary cross entropy. For scalar concepts, the model performance was evaluated by root-mean-square error (RMSE). the concept interaction module only took the binary property concepts as input. The model performance on the binary concepts was evaluated by classification accuracy.

The predictor is an MLP with a hidden layer containing 1024 neurons trained with cross-entropy loss. The MLP $\epsilon$ for concept interaction has three hidden layers with 12 neurons each. The hidden layers are equipped with batch normalization and ReLU activation.

The observer, conceiver, and predictor were trained with an AdamW optimizer for 200, 50, and 50 epochs respectively. The weight decay was set to 1e-6. The initial learning rate was set to 1e-4, multiplied by 0.1 if the validation loss stopped improving for 10 epochs. The batch size was set to 64 for classification and to 8 for segmentation. The model with the best performance in the validation set was picked. The model performance we reported was evaluated on the test set. We oversampled images from category ”femur NSP”, ”abdomen SP”, ”head SP” and ”cervix NSP” during training to alleviate the data imbalance over classes. We implemented the code with Python 3.7.7 and PyTorch 1.12.1. The experiments were conducted in an Ubuntu 18.04 environment on a Quadro RTX 6000 GPU.

Appendix C Model complexity

Tab. 8 reports the complexity of different components in our PCBM in the ultrasound quality assessment task. We evaluate the model complexity with four metrics: model parameters (Params), the floating point operations per second (FLOPs), and the inference time of the model for each image counted in milliseconds. The experiment was conducted on a Quadro RTX 6000 GPU with a batch size of 8. The Params and FLOPs of the model was evaluated by thop ²²2https://github.com/Lyken17/pytorch-OpCounter.

Component	Params(M)	FLOPs(G)	IT (ms)
Observer	$32.71$	$307.02$	$31.79$
Conceiver	$11.31$	$271.29$	$4.07$
Predictor	$0.04502$	$0.00303$	$3.72$

Table 8: Model complexity of different components in our PCBM. ’Params’ means the model parameters. ’FLOPS’ means the floating point operations per second. ’IT’ means the inference time of the model for each image.

The full PCBM took approximately 5 hours to be trained on a single Quadro RTX 6000 GPU for a split. Although including a segmentation network, our model can still infer an image within 39 ms.

Appendix D Role of the segmentation bottleneck

Fig. 8 proves the importance of the segmentation bottleneck in our model. On the top row, the image is predicted to be an abdomen standard plane since the umbilical vein has a quality grade of 0.53. However, when we inpainted the umbilical vein from the image on the middle and bottom row. Our PCBM (bottom) updates the quality grade to 0.02, which means that the umbilical vein is invisible in the image, while CBM still predicts the quality to be 0.48, which is not correct. This incorrect concept prediction then leads to the incorrect plane classification result.

Appendix E Examples of successful intervention

Human intervention at the property concept level plays an important role in CBM. Fig. 9 shows an example of successful human-model communication in our PCBM. Different from CBM, our model allows intervention at both visual and property levels.

On the top row of the figure, our PCBM made a wrong prediction, since the observer fails to recognize the fossa posterior (the green area on the middle-row segmentation). Human experts corrected the visual explanation by changing the intermediate segmentation output on the middle row, which also altered the predicted fossa posterior quality concept from 0.02 to 0.92, and affected the classification result. On the bottom row of the figure, we show the property-level intervention, the same as that in CBM, which ignores the segmentation bottleneck and directly communicates with the property bottleneck. This intervention also corrects the prediction successfully.

Appendix F Faithfulness of concepts

We propose a concept interaction module in this paper, which selects and combines the interaction of concept pairs linearly into a new concept. The new concept is concatenated with the expert-given concepts and fed into the predictor. The ablation study in the main paper demonstrates that the proposed concept interaction module contributes to the model performance.

To further evaluate the effectiveness of the concept interaction module, we assess the faithfulness of the concepts, following a similar test from [41]. Faithfulness is measured with predictive performance solely on concepts, which represents the information conveyed by the concepts. In this test, we compared the faithfulness of the concepts with and without the new concept constructed by concept interaction to quantify the extra information introduced by this concept. For CBM, we trained an MLP on $c$ . For our model, we trained an MLP on the concatenation of $c$ and the learned $\bar{c}$ from the full PCBM we trained before. Note that $\bar{c}$ was fixed when we were training the MLP, as it was obtained from previous experiments. This is different from that during training the PCBM. Both the MLPs have one latent layer with 1024 neurons. Tab. 9 illustrates the model performance.

Method	OA(%)	MA(%)	MCC(%)
CBM	$86.72\pm 0.46$	$59.16\pm 1.82$	$83.40\pm 0.58$
PCBM (ours)	$\textbf{88.28}\pm 0.87$	$\textbf{68.57}\pm 1.76$	$\textbf{85.41}\pm 1.05$

Table 9: Comparison of faithfulness of the concepts used in CBM and our PCBM on the ultrasound dataset.

According to the table, although we only introduced one additional concept, all the metrics, including OA, MA and MCC have a significant improvement. This proves the effectiveness of the proposed concept interaction.

Appendix G Rule-based intervention based on segmentation

In Tab. 4 in the main paper, when we inpainted specific organs in the image, we checked the predicted quality concepts of CBM and PCBM. We see that our model has learned the ”real” concepts of the thalamus, cavum septi pellucidi, and umbilical vein while CBM did not. This experiment demonstrates that our model can learn meaningful property concepts from the segmentation, as the segmentation contains more key features than the raw image. However, for the stomach bubble, both CBM and PCBM were biased, even though the bias was reduced for PCBM.

We would like to introduce another use of our segmentation concepts in this case. Although some of the predicted property concepts, e.g., the predicted stomach bubble quality, can still be biased, the rule-based correction based on the segmentation can alleviate this problem. In the fetal ultrasound quality assessment task, we can still correct it automatically by involving the segmentation concepts - that is, the user can define a rule in the model, which set the organ quality to 0 when the organ does not appear in the segmentation prediction. Note that this is a potential solution for the biased concept prediction made by our PCBM designed for the specific dataset and it is not implemented in this paper to keep the generalization of our model for different scenarios.

Appendix H Segmentation and property concepts

Based on Tab. 7, we consider 14 segmentation concepts and 27 corresponding property concepts in this paper. The segmentation concepts for each anatomy are:

•

Background (all anatomies)
•

Cervical Canal Outline (Cervix)
•

Cervix Outer Boundary (Cervix)
•

Cervix Inner Boundary (Cervix)
•

Bladder (Cervix)
•

Femur Bone (Femur)
•

Stomach Bubble (Abdomen)
•

Outer Skin Boundary (Abdomen)
•

Umbilical Vein (Abdomen), note that when umbilical vein is of low quality, it is called portal sinus
•

Kidney (Abdomen)
•

Thalamus (Head)
•

Fossa Posterior (Head), also called cerebellum
•

Cavum Septi Pellucidi (Head)
•

Outer Bone Boundary (Head)

”Background” is considered a concept in our model, since it plays an important role in deciding whether a plane is symmetric or not.

The property concepts are:

•

Is femur bone left end visible? (scalar, 0 - 10)
•

Is femur bone right end visible? (scalar, 0 - 10)
•

Is femur bone $<45\degree$ to horizontal? (binary)
•

Is femur bone occupying more than half of the image? (binary)
•

Does the plane fits the criteria of symmetry abdomen? (binary)
•

Is stomach visible? (scalar, 0 - 10)
•

Is umbilical vein visible? (scalar, 0 - 10)
•

Is kidney visible? (scalar, 0 - 10)
•

Is abdomen occupying more than half of the image? (binary)
•

Can we place the caliper ada1 (a caliper)? (scalar, 0 - 10)
•

Can we place the caliper ada2? (scalar, 0 - 10)
•

Can we place the caliper adb1? (scalar, 0 - 10)
•

Can we place the caliper adb2? (scalar, 0 - 10)
•

Does the plane fits the criteria of symmetry head? (binary)
•

Is thalamus visible? (scalar, 0 - 10)
•

Is cavum septi pellucidi visible? (scalar, 0 - 10)
•

Is fossa posterior visible? (scalar, 0 - 10)
•

Is head occupying more than half of the image? (binary)
•

Can we place the caliper bpd_near? (scalar, 0 - 10)
•

Can we place the caliper bpd_far? (scalar, 0 - 10)
•

Can we place the caliper ofd_occ? (scalar, 0 - 10)
•

Can we place the caliper ofd_fro? (scalar, 0 - 10)
•

Is cervix occupying $50-75\%$ of the image? (binary)
•

Is bladder visible? (scalar, 0 - 10)
•

Does the plane fits the criteria of symmetry cervix? (binary)
•

Can we place the caliper orif_inner? (scalar, 0 - 10)
•

Can we place the caliper orif_ext? (scalar, 0 - 10)

References

[1] Abien Fred Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
[2] David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability with self-explaining neural networks. Advances in neural information processing systems, 31, 2018.
[3] Mohamed Amgad, Roberto Salgado, and Lee AD Cooper. Mutils: Explainable, multiresolution computational scoring of tumor-infiltrating lymphocytes in breast carcinomas using clinical guidelines. medRxiv, 2022.
[4] E James Archer. The psychological nature of concepts. In Analyses of concept learning, pages 37–49. Elsevier, 1966.
[5] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115, 2020.
[6] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017.
[7] Christian F Baumgartner, Konstantinos Kamnitsas, Jacqueline Matthew, Tara P Fletcher, Sandra Smith, Lisa M Koch, Bernhard Kainz, and Daniel Rueckert. Sononet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE transactions on medical imaging, 36(11):2204–2215, 2017.
[8] Jakob Božič, Domen Tabernik, and Danijel Skočaj. End-to-end training of a two-stage neural network for defect detection. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 5619–5626. IEEE, 2021.
[9] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32, 2019.
[10] Zhi Chen, Yijie Bei, and Cynthia Rudin. Concept whitening for interpretable image recognition. Nature Machine Intelligence, 2(12):772–782, 2020.
[11] Davide Chicco and Giuseppe Jurman. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13, 2020.
[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[13] Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu Fujiyoshi. Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10705–10714, 2019.
[14] Wei Gao, Fang Wan, Xingjia Pan, Zhiliang Peng, Qi Tian, Zhenjun Han, Bolei Zhou, and Qixiang Ye. Ts-cam: Token semantic coupled attention map for weakly supervised object localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2886–2895, 2021.
[15] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[17] P Hitzler and MK Sarker. Human-centered concept explanations for neural networks. Neuro-Symbolic Artificial Intelligence: The State of the Art, 342(337):2, 2022.
[18] Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Yuan He, and Hui Xue. Rams-trans: Recurrent attention multi-scale transformer for fine-grained image recognition. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4239–4248, 2021.
[19] Zixuan Huang and Yin Li. Interpretable and accurate fine-grained recognition via region grouping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8662–8672, 2020.
[20] Dominik Jakubowski, Daria Salloum, Andrzej Torbe, Sebastian Kwiatkowski, and Magdalena Bednarek-Jedrzejek. The crown-rump length measurement—isuog criteria and clinical practice. Ginekologia Polska, 91(11):674–678, 2020.
[21] KO Kagan and J Sonek. How to measure cervical length. Ultrasound in Obstetrics & Gynecology, 45(3):358–362, 2015.
[22] Dmitry Kazhdan, Botty Dimanov, Mateja Jamnik, Pietro Liò, and Adrian Weller. Now you see me (cme): concept-based model extraction. Third Workshop on Advances in Interpretable Machine Learning and Artificial Intelligence (AIMLAI’20), 2020.
[23] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
[24] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International Conference on Machine Learning, pages 5338–5348. PMLR, 2020.
[25] Jack Lanchantin, Tianlu Wang, Vicente Ordonez, and Yanjun Qi. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16478–16488, 2021.
[26] Manxi Lin, Zahra Bashir, Martin Grønnebæk Tolsgaard, Anders Nymark Christensen, and Aasa Feragen. Dtu-net: Learning topological similarity for curvilinear structure segmentation. arXiv preprint arXiv:2205.11115, 2022.
[27] Denise Elfriede Liesa Lockhofen and Christoph Mulert. Neurochemistry of visual attention. Frontiers in Neuroscience, 15:643597, 2021.
[28] Alejandro López-Cifuentes, Marcos Escudero-Viñolo, Jesús Bescós, and Álvaro García-Martín. Semantic-aware scene recognition. Pattern Recognition, 102:107256, 2020.
[29] Max Losch, Mario Fritz, and Bernt Schiele. Semantic bottlenecks: Quantifying and improving inspectability of deep representations. International Journal of Computer Vision, 129(11):3136–3153, 2021.
[30] Anita Mahinpei, Justin Clark, Isaac Lage, Finale Doshi-Velez, and Weiwei Pan. Promises and pitfalls of black-box concept learning models. In Proceedings at the International Conference on Machine Learning: Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI, 2021.
[31] Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do concept bottleneck models learn as intended? In Proceedings of The Ninth International Conference on Learning Representations Workshop on Responsible AI, 2021.
[32] Christoph Molnar. Interpretable Machine Learning. BOOKDOWN, 2 edition, 2022.
[33] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization: How neural networks build up their understanding of images. distill, 2017.
[34] Michael Pazzani, Severine Soltani, Robert Kaufman, Samson Qian, and Albert Hsiao. Expert-informed, user-centric explanations for machine learning. Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2022.
[35] Federico Pittino, Vesna Dimitrievska, and Rudolf Heer. Hierarchical concept bottleneck models for explainable images segmentation, objects fine classification and tracking. Objects Fine Classification and Tracking, 2021.
[36] Domen Racki, Dejan Tomazevic, and Danijel Skocaj. A compact convolutional neural network for textured surface anomaly detection. In 2018 IEEE winter conference on applications of computer vision (WACV), pages 1331–1339. IEEE, 2018.
[37] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020.
[38] Harish Guruprasad Ramaswamy et al. Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 983–991, 2020.
[39] Yao Rong, Wenjia Xu, Zeynep Akata, and Enkelejda Kasneci. Human attention in fine-grained classification. In Proceedings of the 32nd British Machine Vision Conference (BMVC), 2021.
[40] LJ Salomon, Z Alfirevic, F Da Silva Costa, RL Deter, F Figueras, Tet al Ghi, P Glanc, A Khalil, W Lee, R Napolitano, et al. Isuog practice guidelines: ultrasound assessment of fetal biometry and growth. Ultrasound in obstetrics & gynecology, 53(6):715–723, 2019.
[41] Anirban Sarkar, Deepak Vijaykeerthy, Anindya Sarkar, and Vineeth N Balasubramanian. A framework for learning ante-hoc explainable models via concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10286–10295, 2022.
[42] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[43] Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pages 9269–9278. PMLR, 2020.
[44] Alexandru Telea. An image inpainting technique based on the fast marching method. Journal of graphics tools, 9(1):23–34, 2004.
[45] Andong Wang, Wei-Ning Lee, and Xiaojuan Qi. Hint: Hierarchical neuron concept explainer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10254–10264, 2022.
[46] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 24–25, 2020.
[47] Jun Wang, Xiaohan Yu, and Yongsheng Gao. Feature fusion vision transformer for fine-grained visual categorization. In Proceedings of the 32nd British Machine Vision Conference (BMVC), 2021.
[48] Xiu-Shen Wei, Chen-Wei Xie, Jianxin Wu, and Chunhua Shen. Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition, 76:704–714, 2018.
[49] Lingyun Wu, Jie-Zhi Cheng, Shengli Li, Baiying Lei, Tianfu Wang, and Dong Ni. Fuiqa: Fetal ultrasound image quality assessment with deep convolutional networks. IEEE Transactions on Cybernetics, 47(5):1336–1349, 2017.
[50] Fan Xie, Martin Müller, and Robert Holte. Jasper: the art of exploration in greedy best first search. The Eighth International Planning Competition (IPC-2014), pages 39–42, 2014.
[51] Liang Xu, Shuai Lv, Yong Deng, and Xiuxi Li. A weakly supervised surface defect detection based on convolutional neural network. IEEE Access, 8:42285–42296, 2020.
[52] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Ravikumar. On completeness-aware concept-based explanations in deep neural networks. Advances in Neural Information Processing Systems, 33:20554–20565, 2020.
[53] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. Deep Learning Workshop, 31 st International Conference on Machine Learning, 2015.
[54] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
[55] Man Zhang, Yong Zhou, Jiaqi Zhao, Yiyun Man, Bing Liu, and Rui Yao. A survey of semi-and weakly supervised semantic segmentation of images. Artificial Intelligence Review, 53(6):4259–4288, 2020.
[56] Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8827–8836, 2018.
[57] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Computer Vision and Pattern Recognition, 2016.