This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition

Liangzhi Li1, Bowen Wang2, Manisha Verma1, Yuta Nakashima1,
Ryo Kawasaki3, Hajime Nagahara1
Osaka University, Japan
1{li, mverma, n-yuta, nagahara}@ids.osaka-u.ac.jp
2[email protected]
3[email protected]
Abstract

Explainable artificial intelligence has been gaining attention in the past few years. However, most existing methods are based on gradients or intermediate features, which are not directly involved in the decision-making process of the classifier. In this paper, we propose a slot attention-based classifier called SCOUTER for transparent yet accurate classification. Two major differences from other attention-based methods include: (a) SCOUTER’s explanation is involved in the final confidence for each category, offering more intuitive interpretation, and (b) all the categories have their corresponding positive or negative explanation, which tells “why the image is of a certain category” or “why the image is not of a certain category.” We design a new loss tailored for SCOUTER that controls the model’s behavior to switch between positive and negative explanations, as well as the size of explanatory regions. Experimental results show that SCOUTER can give better visual explanations in terms of various metrics while keeping good accuracy on small and medium-sized datasets. Code is available111https://github.com/wbw520/scouter.

1 Introduction

It is of great significance to know how deep models make predictions, especially for the fields like medical diagnosis, where potential risks exist when black-box models are adopted. Explainable artificial intelligence (XAI), which can give a close look into models’ inference process, therefore has gained lots of attention.

The most popular paradigm in XAI is attributive explanation, which gives the contribution of pixels or regions to the final prediction [31, 7, 28, 30]. One natural question that arises here is how these regions contribute to the decision. For a better view of this, let gl(v)=wlv+blg_{l}(v)=w_{l}^{\top}v+b_{l} denotes a fully-connected (FC) classifier for category ll, where wlw_{l} and blb_{l} are trainable vector and scalar, respectively. Training this classifier may be interpreted as a process to find from the training samples a combination of discriminative patterns slis_{{l}i} with corresponding weight γi\gamma_{i}, i.e.,

gl(v)=(iγisli)v+bl.g_{l}(v)=\left(\sum_{i}\gamma_{i}s_{{l}i}^{\top}\right)v+b_{l}. (1)

In general, these patterns can include positive and negative ones. Given vv of an image of l{l}, a positive pattern gives sliv>0s_{{l}i}^{\top}v>0. A negative pattern, in contrast, gives sliv<0s_{{l}i}^{\top}v<0 for vv of any category other than l{l}, which means that the presence of pattern described by slis_{{l}i} is a support of not being category l{l}. Therefore, set 𝒮l\mathcal{S}_{l} of all (linearly independent) patterns for l{l} can be the union of sets 𝒮l+\mathcal{S}^{+}_{l} and 𝒮l\mathcal{S}^{-}_{l} of all positive and negative patterns.

Differentiation of positive/negative patterns gives useful information on the decision. Figure LABEL:fig_story(top) shows an MNIST image for example. One of positive patterns that makes the image being 7 can be the acute angle formed by white line segments that appears around the top-right corner, as in the second image. Meanwhile, the sixth image shows that the presence of the horizontal line is the support not being 1. A more practical application [2] in medical image analysis also points out the importance of visualizing positive/negative patterns. Nevertheless of the obvious benefit, recent mainstream methods like [49, 31, 27, 10] have not extensively studied this differentiation.

Positive and negative patterns lead to two interesting questions: (i) Can we provide positive explanation and negative explanation that visually show support regions in the image that correspond to positive and negative patterns? (ii) As the combination of patterns to be learned is rather arbitrary and any combination is possible as long as it is discriminative; can we provide preference on the combination in order to leverage prior knowledge on the task in training?

In this paper, we re-formulate explainable AI with an explainable classifier, coined SCOUTER (Slot-based COnfigUrable and Transparent classifiER), which tries to find either positive or negative patterns in images. This approach is similar to the attention-based approach (e.g. [21, 40]) rather than the post-hoc approaches [49, 31, 30]. Our newly proposed explainable slot attention (xSlot) module is the main building block of SCOUTER. This module is built on top of the recently-emerged slot attention [23], which offers an object-centric approach for image representation. The xSlot module identifies the spatial support of either positive or negative patterns for each category in the image, which is directly used as the confidence value of that category; the commonly-used FC classifier is no longer necessary. The xSlot module can also be used to visualize the support as shown in Fig. LABEL:fig_story. SCOUTER is also characterized by its configurability over patterns to be learned, i.e., the choice of positive or negative pattern and the desirable size of the pattern, which can incorporate the prior knowledge on the task. The controllable size of explanation can be beneficial for some applications, e.g., disease diagnosis in medicine, defect recognition in manufacturing, etc.

Contribution

Our transparent classifier, SCOUTER, explicitly models positive and negative patterns with a dedicated loss, allowing to set preference over the spatial size of patterns to be learned. We experimentally show that SCOUTER successfully learns both positive and negative patterns and visualize their support in the given image as the explanation, achieving state-of-the-art results in several commonly-used metrics like IAUC/DAUC [27]. Our case study in medicine also highlights the importance of both types of explanations as well as controlling the area size of explanatory regions.

2 Related Work

2.1 Explainable AI

There are mainly three XAI paradigms [43], i.e. post-hoc, intrinsic, and distillation. The post-hoc paradigm usually provides a heat map highlighting important regions for the decision (e.g. [31, 30]). The heat map is computed besides the forward path of the model. The intrinsic paradigm explores the important piece of information within the forward path of the model, e.g., as attention maps (e.g. [21, 40, 24, 42]). Distillation methods are built upon model distillation [16]. The basic idea is to use an inherently transparent model to mimic the behaviors of a black-box model (e.g. [48, 29]).

The post-hoc paradigm has been extensively studied among them. The most popular type of methods is based on channel activation or back-propagation, including CAM [49], GradCAM [31], DeepLIFT [32], and their extensions [3, 25, 36, 35, 7]. Another type of method is perturbation-based, including Occlusion [45], RISE [27], meaningful perturbations [11], real-time saliency [5], extremal perturbations [10], I-GOS [28], IBA [30], etc. These methods basically give attributive explanation, which visualizes support regions of learned patterns for each category l{l} in the set of all possible categories \mathcal{L}. This visualization can be done by finding regions in feature maps or the input image that give large impact on the score glg_{l}. By definition, attributive explanation is the same as our positive explanation.

Some of the methods for attributive explanation thus can be extended to provide negative explanations by negating the sign of the score, feature maps, or gradients. It should be noted that the interpretation of visual explanation by gradient-based methods [31, 3, 25] may not be straightforward because of linearization of glg_{l} for the given image; and thus the resulting visualization may not highlight the support regions for negative patterns. GradCAM [31] refers to its negative variant as counterfactual explanation that gives regions that can change the decision, emphasizing how it should be interpreted.

Discriminant explanation is a new type of XAI in the post-hoc paradigm, which appeared in [38] to show “why image xx belongs to category l{l} rather than l{l}^{\prime}.” This can be interpreted using set 𝒮l+\mathcal{S}_{l}^{+} of all possible positive patterns for l{l} and set 𝒮l\mathcal{S}_{{l}^{\prime}}^{-} of all possible negative patterns for l{l}^{\prime}: It try to spot a (combination of) discriminative pattern ss that is in the intersection 𝒮l+𝒮l\mathcal{S}_{l}^{+}\cap\mathcal{S}_{l^{\prime}}^{-}. Due to the unavailability of negative patterns, the method [38] first finds (a combination of) positive patterns and uses the complementary of the region containing the positive patterns as a proxy of negative patterns. Goyal et al. gives another line of counterfactual explanation in [13]. Given two images of categories ll and ll^{\prime}, they find the region in the image of ll, of which replacement to a certain region in the image of ll^{\prime} changes the prediction from ll to ll^{\prime}. This can be also implemented using discriminant explanation.

SCOUTER computes a heat map to spot regions important for the decision in the forward path, so it falls into the intrinsic paradigm. Together with the dedicated loss, it can directly identify positive and negative patterns with control over the size of patterns.

2.2 Self-attention in Computer Vision

Self-attention is first introduced in the Transformers [34], in which self-attention layers scan through the input elements one by one and update them using the aggregation over the whole input. Initially, self-attention is used in place of recurrent neural networks for sequential data, e.g., natural language processing [8]. Recently, self-attention is adopted to the computer vision field, e.g., Image Transformer [26], Axial-DeepLab [37], DEtection TRansformer (DETR) [1], Image Generative Pre-trained Transformer (Image GPT) [4], etc. Slot attention [23] is also based on this mechanism to extract object-centric features from images (there are some other works [15, 12] using the concept of slot); however, the original slot attention is tested only on some synthetic image datasets. SCOUTER is based on slot attention but is designed to be an explainable classifier applicable to natural images.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Classification pipeline. (a) Overview of the classification model. (b) The xSlot Attention module in SCOUTER.

3 SCOUTER

Given an image xx, the objective of a classification model is to find its most probable category ll in category set ={l1,l2,,ln}\mathcal{L}=\{l_{1},l_{2},\dots,l_{n}\}. This can be done by first extracting features F=B(x)c×h×wF=B(x)\in\mathbb{R}^{c\times h\times w} using a backbone network BB. FF is then mapped into a score vector ono\in\mathbb{R}^{n}, representing the confidence values, using FC layers and softmax as the classifier. However, such FC classifiers can learn an arbitrary (nonlinear) transformation and thus can be black-box.

We replace such an FC classifier with our SCOUTER (Fig. 1(a)), consisting of the xSlot attention module, which produces the confidence for each category given features FF. The whole network, including the backbone, is trained with the SCOUTER loss, which provides control over the size of support regions and switching between positive and negative explanations.

3.1 xSlot Attention

In the original slot attention mechanism [23], a slot is a representation of a local region aggregated based on the attention over the feature maps. A single slot attention module with multiple slots is attached on top of the backbone network BB. Each slot produces its own feature as output. This configuration is handy when there are multiple objects of interest. This idea can be transferred to spot the supports in the input image that leads to a certain decision.

The main building block of SCOUTER is the xSlot attention module, which is a variant of the slot attention module tailored for SCOUTER. Each slot of the xSlot attention module is associated with a category and gives the confidence that the input image falls into the category. With the slot attention mechanism, the slot for category ll is required to find support 𝒮l\mathcal{S}_{l} in the image that directly correlates to ll.

Given feature FF, the xSlot attention module updates slot wl(t)w^{(t)}_{l} for TT times, where wl(t)w^{(t)}_{l} represents the slot after tt updates and ll\in\mathcal{L} is the category associated to this slot. The slot is initialized with random weights, i.e.,

wl(0)𝒩(μ,diag(σ))1×c,w^{(0)}_{l}\sim\mathcal{N}(\mu,\text{diag}(\sigma))\in\mathbb{R}^{1\times c^{\prime}}, (2)

where μ\mu and σ\sigma are the mean and variance of a Gaussian, and cc^{\prime} is the size of the weight vector. We denote the slots for all categories by W(t)n×cW^{(t)}\in\mathbb{R}^{n\times c^{\prime}}.

The slot W(t+1)W^{(t+1)} is updated using W(t)W^{(t)} and feature FF. Firstly, FF goes through a 1×11\times 1 convolutional layer to reduce the number of channels and the ReLU nonlinearity as F=ReLU(Conv(F))+c×dF^{\prime}=\operatorname{ReLU}(\operatorname{Conv}(F))\in\mathbb{R}_{+}^{c^{\prime}\times d}, with FF’s spatial dimensions being flattened (d=hwd=hw). FF^{\prime} is augmented by adding the position embedding to take the spatial information into account, following [34, 1], i.e. F~=F+PE\tilde{F}=F^{\prime}+\text{PE}, where PE is the position embedding. We then use two multilayer perceptrons (MLPs) QQ and KK, each of which has three FC layers and the ReLU nonlinearity between them. This design is for giving more flexibility in the computation of query and key in the self-attention mechanism. Using

Q(W(t))n×c,K(F~)c×d,Q(W^{(t)})\in\mathbb{R}^{n\times c^{\prime}},\quad K(\tilde{F})\in\mathbb{R}^{c^{\prime}\times d}, (3)

we obtain the dot-product attention A(t)A^{(t)} using sigmoid σ\sigma as

A(t)=σ(Q(W(t))K(F~))(0,1)n×d.A^{(t)}=\sigma(Q(W^{(t)})K(\tilde{F}))\quad\in(0,1)^{n\times d}. (4)

The attention is used to compute the weighted sum of features in the spatial dimensions by

U(t)=A(t)Fn×c,U^{(t)}=A^{(t)}{F^{\prime}}^{\top}\quad\in\mathbb{R}^{n\times c^{\prime}}, (5)

and slot W(t)W^{(t)} is eventually updated through a gated recurrent unit (GRU) as

W(t+1)=GRU(U(t),W(t)),W^{(t+1)}=\operatorname{GRU}(U^{(t)},W^{(t)}), (6)

taking U(t)U^{(t)} and W(t)W^{(t)} as input and hidden state, respectively. Following the original slot attention module, we update the slot for T=3T=3 times.

The output of the xSlot attention module is the sum of all elements for category ll in U(T)U^{(T)}, which is a function of FF. Formally, the output of xSlot Attention module is:

xSlot(F)=U(T)𝟏c+n,\text{xSlot}(F)=U^{(T)}\mathbf{1}_{c^{\prime}}\quad\in\mathbb{R}_{+}^{n}, (7)

where 𝟏\mathbf{1} is the column vector with all cc^{\prime} elements being 1.

From Eqs. (5) and (7), we have xSlot(F)=A(T)F𝟏c\text{xSlot}(F)=A^{(T)}F^{\prime\top}\mathbf{1}_{c^{\prime}}, where F𝟏cdF^{\prime\top}\mathbf{1}_{c^{\prime}}\in\mathbb{R}^{d} is a reduction of FF and is a class-agnostic map. The ll-th row of A(T)A^{(T)} can then be viewed as spatial weights over map F𝟏cF^{\prime\top}\mathbf{1}_{c^{\prime}} to spot where the support regions for category ll is222F𝟏cF^{\prime\top}\mathbf{1}_{c^{\prime}} can be viewed as a single map that includes a mixture of supports for all categories.. In order for visualizing the support regions, we reshape and resize each row of A(T)A^{(T)} to the input image size.

Note that in the original slot attention module, a linear transformation is applied to the features, i.e., V(F~)V(\tilde{F}), which is then weighted using Eq. (5). However, the xSlot attention module omits this transformation as it already has a sufficient number of learnable parameters in QQ, KK, GRU, etc., and thus the flexibility. Also, the confidences, given by Eq. (7), are typically computed by an FC layer, while SCOUTER just sums up the output of xSlot attention module, which is actually the presence of learned supports for each category. This simplicity is essential for a transparent classifier as discussed in Section 3.3.

3.2 SCOUTER Loss

The whole model, including the backbone network, can be trained by simply applying softmax to xSlot(F)(F) and minimizing cross-entropy loss CE\ell_{\text{CE}}. However, there is a phenomenon that, in some cases, the model prefers attending to a broad area (e.g., a slot covers a combination of several supports that occupy large areas in the image) depending on the content of the image. As argued in Section 1, it can be beneficial to have control over the area of support regions to constrain the coverage of a single slot.

Therefore, we design the SCOUTER loss to limit the area of support regions. The SCOUTER loss is defined by

SCOUTER=CE+λArea,\ell_{\operatorname{SCOUTER}}=\ell_{\operatorname{CE}}+\lambda\ell_{\operatorname{Area}}, (8)

where Area\ell_{\operatorname{Area}} is the area loss, λ\lambda is a hyper-parameter to adjust the importance of the area loss. The area loss is defined by

Area=𝟏nA(T)𝟏d,\ell_{\operatorname{Area}}=\mathbf{1}_{n}^{\top}A^{(T)}\mathbf{1}_{d}, (9)

which simply sums up all the elements in A(T)A^{(T)}. With larger λ\lambda, SCOUTER attends smaller regions by selecting fewer and smaller supports. On the contrary, it prefers a larger area with smaller λ\lambda.

3.3 Switching Positive and Negative Explanation

The model with the SCOUTER loss in Eq. (8) can only provide positive explanation since larger elements in A(T)A^{(T)} means the prediction is made based on the corresponding features. We introduce a hyper-parameter e{+1,1}e\in\{+1,-1\} in Eq. (7), i.e.,

o=xSlote(F)=eU(T)𝟏c+n,o=\text{xSlot}_{e}(F)=e\cdot U^{(T)}\mathbf{1}_{c^{\prime}}\quad\in\mathbb{R}_{+}^{n}, (10)

where o={o1,,on}o=\{o_{1},\dots,o_{n}\}. This hyper-parameter configures the xSlot attention module to learn to find either positive or negative supports.

With the softmax cross-entropy loss, the model learns to give the largest confidence olo_{l} corresponding to ground-truth (GT) category ll and a smaller value olo_{l^{\prime}} to wrong category lll^{\prime}\neq l. For e=+1e=+1, all elements given by xSlot\operatorname{xSlot} is non-negative since both A(T)A^{(T)} and FF^{\prime} are non-negative and thus U(T)U^{(T)} is. For arbitrary non-negative FF^{\prime}, thanks to simple reduction in Eq. (7), larger olo_{l} can be produced only when some elements in al(T)a^{(T)}_{l}, the row vector in A(T)A^{(T)} corresponding to ll, is close to 1, whereas a smaller olo_{l^{\prime}} is given when all elements in al(T)a^{(T)}_{l} are close to 0. Therefore, by setting ee to +1+1, the model learns to find the positive supports 𝒮l+\mathcal{S}^{+}_{l} among the images of the GT category. The visualization of al(T)a^{(T)}_{l} thus serves as positive explanation, as in Fig. LABEL:fig_story (left).

On the contrary, for e=1e=-1, all elements in oo are negative and thus the prediction by Eq. (10) gives olo_{l} close to 0 for correct category ll and smaller olo_{l^{\prime}} for non-GT category ll^{\prime}. To make olo_{l} close to 0, all elements in al(T)a^{(T)}_{l} must be close to 0, and a smaller olo_{l^{\prime}} is given when al(T)a^{(T)}_{l^{\prime}} has some elements close to 1. For this, the model learns to find the negative supports 𝒮\mathcal{S}_{-} that do not appear in the images of the GT category. As a result, al(T)a^{(T)}_{l^{\prime}} can be used as negative explanation, as shown in Fig. LABEL:fig_story (right).

4 Experiments

4.1 Experimental Setup

We chose to use the ImageNet dataset [6] for a detailed evaluation of SCOUTER, because of the following three reasons: (i) It is commonly used in the evaluation of classification models. (ii) There are many classes with similar semantics and appearances, and the relationships among them are available in the synsets of the WordNet, which can be used to evaluate positive and negative explanations. (iii) Bounding boxes are available for foreground objects, which helps measure the accuracy of visual explanation. In experiments, we use subsets of ImageNet by extracting the first nn (0<n1,0000<n\leq 1,000) categories in the ascending order of the category IDs. Also, we present classification performance on Con-text [20] and CUB-200 [39] datasets and illustrate glaucoma diagnosis using quantitative and qualitative results on ACRIMA [9] dataset.

The size of images is 260×260260\times 260. The feature FF extracted by the backbone network is mapped into a new feature FF^{\prime} with the channel number c=64c^{\prime}=64. The models were trained on the training set for 2020 epochs and the performance scores are computed on the validation set with the trained models after the last epoch. All the quantitative results are obtained by averaging the scores from three independent runs.

Table 1: Evaluation of the explanations. Positive explanations are from the GT class, while negative is from the least similar class (LSC).
Methods Year Type Area Size Precision \uparrow IAUC \uparrow DAUC \downarrow Infidelity \downarrow Sensitivity \downarrow
Positive CAM [49] 2016 Back-Prop 0.0835 0.7751 0.7185 0.5014 0.1037 0.1123
GradCAM [31] 2017 Back-Prop 0.0838 0.7758 0.7187 0.5015 0.1038 0.0739
GradCAM++ [3] 2018 Back-Prop 0.0836 0.7861 0.7306 0.4779 0.1036 0.0807
S-GradCAM++ [25] 2019 Back-Prop 0.0868 0.7983 0.6991 0.4896 0.1548 0.0812
Score-CAM [36] 2020 Back-Prop 0.0818 0.7714 0.7213 0.5247 0.1035 0.0900
SS-CAM [35] 2020 Back-Prop 0.1062 0.7902 0.7143 0.4570 0.1109 0.1183
\quad\llcorner w/ threshold 2020 Back-Prop 0.0496 0.8243 0.6010 0.7781 0.1079 0.0790
RISE [27] 2018 Perturbation 0.3346 0.5566 0.6913 0.4903 0.1199 0.1548
Extremal Perturbation [10] 2019 Perturbation 0.1458 0.8944 0.7121 0.5213 0.1042 0.2097
I-GOS [28] 2020 Perturbation 0.0505 0.8471 0.6838 0.3019 0.1106 0.6099
IBA [30] 2020 Perturbation 0.0609 0.8019 0.6688 0.5044 0.1039 0.0894
SCOUTER+ (λ=1\lambda=1) Intrinsic 0.1561 0.8493 0.7512 0.1753 0.0799 0.0796
SCOUTER+ (λ=3\lambda=3) Intrinsic 0.0723 0.8488 0.7650 0.1423 0.0949 0.0608
SCOUTER+ (λ=10\lambda=10) Intrinsic 0.0476 0.9257 0.7647 0.2713 0.0840 0.1150
Negative CAM [49] 2016 Back-Prop 0.1876 0.3838 0.6069 0.6584 0.1070 0.0617
GradCAM [31] 2017 Back-Prop 0.0988 0.6543 0.6289 0.7281 0.1060 0.5493
GradCAM++ [3] 2018 Back-Prop 0.0879 0.6280 0.6163 0.6017 0.1047 0.3114
S-GradCAM++ [25] 2019 Back-Prop 0.1123 0.6477 0.6036 0.5430 0.1071 0.0590
RISE [27] 2018 Perturbation 0.4589 0.4490 0.4504 0.7078 0.1064 0.0607
Extremal Perturbation [10] 2019 Perturbation 0.1468 0.6390 0.2089 0.7626 0.1068 0.8733
SCOUTER- (λ=1\lambda=1) Intrinsic 0.0643 0.8238 0.7343 0.1969 0.0046 0.0567
SCOUTER- (λ=3\lambda=3) Intrinsic 0.0545 0.8937 0.6958 0.4286 0.0196 0.1497
SCOUTER- (λ=10\lambda=10) Intrinsic 0.0217 0.8101 0.6730 0.7333 0.0014 0.1895
Table 2: Area sizes of the explanations (λ=10\lambda=10).
Methods Target Classes
GT Highly-similar Similar Dissimilar
SCOUTER+ 0.0476 0.0259 0.0093 0.0039
SCOUTER- 0.0097 0.0141 0.0185 0.0204

4.2 Explainability

To evaluate the quality of visual explanation, we use bounding boxes provided in ImageNet as a proxy of the object regions and compute the percentage of the pixels located inside the bounding box over the total pixel numbers in the whole explanation. Specifically, for set II of all pixels in the input image and set DD of all pixels in the bounding box, we define the explanation precision as Precisionl=iDailiIail\operatorname{Precision}_{l}=\frac{\sum_{i\in D}a^{l}_{i}}{\sum_{i\in I}a^{l}_{i}}, where category ll\in\mathcal{L} and ail[0,1]a^{l}_{i}\in[0,1] is the value of pixel ii in Al¯\bar{A_{l}}, which is attention map AA resized to the same size as the input image by bilinear interpolation. We compute this metric on the visualization results of the GT category for positive explanations and on the least similar class (LSC) (using Eq. 11) for negative explanations, as LSC images usually show strong and consistent negative explanations. This precision metric actually is a generalization of the pointing game [47], which counts one hit when the point with the largest value on the heat map locates in the bounding box and the final score is calculated as #Hits#Hits+#Misses\frac{\operatorname{\#Hits}}{\operatorname{\#Hits}+\operatorname{\#Misses}}.

We also adopt several other metrics, i.e., (i) insertion area under curve (IAUC) [27], which measures the accuracy gain of a model when gradually adding pixels according to their importance given in the explanation (heat map) to a synthesized input image; (ii) deletion area under curve (DAUC) [27], which measures the performance drop when gradually removing important pixels from the input image; (iii) infidelity [44], which measures the degree to which the explanation captures how the prediction changes in response to input perturbations; and (iv) sensitivity [44], which measures the degree to which the explanation is affected by the input perturbations. In addition, we calculate the (v) overall size of the explanation areas by Areal=iIail\operatorname{Area}_{l}=\sum_{i\in I}a^{l}_{i}, as for some applications, a smaller value is better to pinpoint the supports to differentiate one class from the others.

We conduct the explainability experiments with the ImageNet subset with the first 100 classes. We train seven models with (1) an FC classifier, (2)–(4) SCOUTER+ (λ=1,3,10\lambda=1,3,10), and (5)–(7) SCOUTER- (λ=1,3,10\lambda=1,3,10) using ResNeSt 26 [46] as the backbone. The results of competing methods are obtained from the FC classifier-based model. In addition, as introduced in Section 2.1, some of the existing works can give negative explanations. Therefore, we also implement and compare our results with their negative variants by using negative feature maps/gradients or modifying their objective functions.

The numerical results are shown in Table 1. We can see that SCOUTER can generate explanations with different area sizes while achieving good scores in all metrics. These results demonstrate that the visualization by SCOUTER is preferable in terms of controlling area sizes, high precision, insensitive to noises (sensitivity), and with good explainability (infidelity, IAUC, and DAUC). Among the competing methods, extremal perturbation [10], I-GOS [28], and IBA [30] also take the size of support regions into account, and thus some of them give smaller explanatory regions. Extremal perturbation’s explanatory regions cover some parts of foreground objects. This leads to a high precision score, but the performance over other metrics is not satisfactory. I-GOS and IBA give small explanation areas. I-GOS results in low IAUC and sensitivity scores. IBA gets relatively low scores of IAUC and DAUC, which means its explanations cannot give correct attention to the pixels and thus does not have enough explainability.

It is arguable that area size can be controlled by thresholding the heatmap. In order to verify this, we set a threshold (a0.2a\geq 0.2) to one of the back-propagation-based methods (SS-CAM) to get explanations with smaller size. We can see that this variant suffers a deterioration in IAUC and DAUC (significantly worse than I-GOS, IBA, and SCOUTER), which represents a large explainability drop and hampers its uses in actual applications requiring small explanations.

To further explore the explanation for non-GT categories, we define the semantic similarity between categories based on [41], which uses WordNet, as

Similarity=2d(LCS(l,l))d(l)+d(l),\operatorname{Similarity}=2\frac{d(\operatorname{LCS}(l,l^{\prime}))}{d(l)+d(l^{\prime})}, (11)

where d()d(\cdot) gives the depth of category ll in WordNet, and LCS(l,l)\operatorname{LCS}(l,l^{\prime}) is to find the least common subsumer of two arbitrary categories ll and ll^{\prime}. We define the highly-similar categories as the category pairs with a similarity score no less than 0.90.9, similar categories as with a score in [0.7,0.9)[0.7,0.9), and the remaining categories are regarded as dissimilar categories. Table 2 summarizes the area sizes of the explanatory regions for GT, highly-similar, similar, and dissimilar categories. We see a clear trend: SCOUTER+ decreases the area size when the inter-category similarity becomes lower, while SCOUTER- gives larger explanatory regions for the dissimilar categories.

Refer to caption
Figure 2: Visualized positive explanations using SCOUTER+ and existing methods. The numbers in the parentheses are the λ\lambda values used in the SCOUTER training.
Refer to caption
Figure 3: Visualized negative explanations using SCOUTER- and an existing method. The numbers in the parentheses are the λ\lambda values used in the SCOUTER training.
Refer to caption
Figure 4: Classification performance of different models with FC classifier, SCOUTER+ (λ=10\lambda=10), and SCOUTER- (λ=10\lambda=10). The horizontal axis is the category number nn (in the logarithmic scale), which is used to generate the training and test set with the first nn categories of ImageNet dataset; the vertical axis is the accuracy of the model.

Some visualization results are given in Figs. 2 and 3. It can be seen that SCOUTER gives reasonable and accurate explanations. Comparing SCOUTER+’s explanation with SS-CAM [36], and IBA [30], we find that SCOUTER+ can give explanations with more flexible shapes which fit the target objects better. For example, in the first row of Fig. 2, SCOUTER+ gives more accurate attention around the neck. In the second row, it accurately finds the individual entities. Compared with SS-CAM, IBA shows smaller explanatory regions. However, IBA is less precise and less reasonable, which is consistent with the numerical results in Table 1.

In Fig. 3, SCOUTER- can also find the negative supports, e.g., the wattle of the hen, and the hammerhead and the fin of the shark. In addition, although the negative variation of S-GradCAM++ performs well on the first row, its explanation in the second row does not well fit the object’s shape and fails to pinpoint the key difference (the head).

4.3 Classification Performance

We compare SCOUTER and FC classifiers with several commonly used backbone networks with respect to the classification accuracy. The results are summarized in Fig. 4. With the increase of the category number, both the FC classifier and SCOUTER show a performance drop. They show similar trends with respect to the category number.

The relationship between λ\lambda, which controls the size of explanatory regions, and the classification accuracy is shown in Fig. 5 for ResNeSt 26 model with n=100n=100. A clear pattern is that the area sizes of both SCOUTER+ and SCOUTER- drop quickly with the increase of λ\lambda. However, there is no significant trend in the classification accuracy, which should be because the cross-entropy loss term works well regardless of λ\lambda.

Also, according to the visualization results in Figs. 2 and 3, a larger λ\lambda does not simply decrease the explanatory regions’ sizes. Instead, SCOUTER shifts its focus from some larger supports to fewer, smaller yet also decisive supports. For example, in the first row of Fig. 3, when λ\lambda is small, SCOUTER- can easily make a decision that the input image is not a cock because of unique feathers on the neck. With a larger λ\lambda, SCOUTER finds smaller combinations of supports (i.e., its wattle) and thus the explanation changes from the (larger) neck to the (smaller) wattle region.

Refer to caption
Figure 5: Relationships between λ\lambda and explanation area sizes and between λ\lambda and classification accuracy for the GT (SCOUTER+, left) and LSC (SCOUTER-, right) when n=100n=100. The horizontal axis is in the logarithmic scale.

We also summarize the classification performance of the FC classifier, SCOUTER+ (λ=10\lambda=10), and SCOUTER- (λ=10\lambda=10) over ImageNet [6], Con-text [20], and CUB-200 [39] datasets in Table 3. The subsets with n=100n=100 are adopted for ImageNet and CUB-200, while all 30 categories are used for the Con-text. The results show that SCOUTER can be generalized to different domains and has a comparable performance with the FC classifier over all datasets. Also, SCOUTER’s number of parameters is comparable to FC’s (more details in supplementary material).

One drawback of SCOUTER is that its training is unstable when n>100n>100. This is possibly because of the increasing difficulty in finding effective supports that consistently appear in all images of the same category but are not shared by other categories. This drawback limits the application of SCOUTER to small- or medium-sized datasets.

4.4 Case Study

SCOUTER uses the area loss, which constrains the size of support. This constraint can benefit some applications, including the classification of medical images, since small support regions can precisely show the symptoms and are more informative in some cases. Also, there was no method that could give the negative explanation but it is actually needed (doctors need reasons to deny some diseases). SCOUTER was designed upon these needs and is being tested in hospitals for glaucoma (Fig. 6), artery hardening (supplementary material), etc.

For glaucoma diagnosis, we tested SCOUTER with λ=10\lambda=10 over a publicly available dataset, i.e., ACRIMA [9], which has two categories (normal and glaucoma). ResNeSt 26 is used as backbone. The results are shown in Table 4. We can see that both SCOUTER+ and SCOUTER- get better performances than the FC classifier. Besides, SCOUTER is preferred in this task as doctors are eager to know the precise regions in the optic disc that lead to the machine diagnosis. We can see that, in the visualization results in Fig. 6, SCOUTER shows more precise and reasonable explanations that locate on some vessels in the optic disc and show clinical meanings (vessel shape change due to the enlarged optic cup), which are verified by doctors. Although IBA also gives small regions, they cover some unrelated or uninformative locations. In addition, it is notable that the facts to admit category “Glaucoma” need not to match with the facts to deny “Normal”, as they are only subsets of the support sets and an on-purpose negative explanation is especially helpful for the doctors when the machine decisions are against their expectations.

Table 3: Classification accuracy on various datasets.
Models ImageNet Con-text CUB-200
ResNeSt 26 (FC) 0.8080 0.6732 0.7538
ResNeSt 26 (SCOUTER+) 0.7991 0.6870 0.7362
ResNeSt 26 (SCOUTER-) 0.7946 0.6866 0.7490
ResNeSt 50 (FC) 0.8158 0.6918 0.7739
ResNeSt 50 (SCOUTER+) 0.8242 0.6943 0.7397
ResNeSt 50 (SCOUTER-) 0.8066 0.6922 0.7600
ResNeSt 101 (FC) 0.8255 0.7038 0.7804
ResNeSt 101 (SCOUTER+) 0.8251 0.7131 0.7428
ResNeSt 101 (SCOUTER-) 0.8267 0.7062 0.7643
Refer to caption
Figure 6: Explanations for a positive sample in the glaucoma diagnosis dataset. Bottom row is from SCOUTER +/- for normal (N.) and glaucoma (G.).

5 Conclusion

An explainable classifier is proposed in this paper, with two variants, i.e., SCOUTER+ and SCOUTER-, which can respectively give positive or negative explanation of the classification process. SCOUTER adopts an explainable variant of the slot attention, namely, xSlot attention, which is also based on the self-attention. Moreover, a loss is designed to control the size of explanatory regions. Experimental results prove that SCOUTER can give accurate explanations while keeping good classification performance.

Table 4: Classification Performance on ACRIMA Dataset [9].
Methods AUC Acc. Prec. Rec. F1 Kappa
FC 0.9997 0.9857 0.9915 0.9831 0.9872 0.9710
SCOUTER+ 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
SCOUTER- 0.9999 0.9952 1.0000 0.9915 0.9957 0.9903

6 Acknowledgments

This work was supported by Council for Science, Technology and Innovation (CSTI), cross-ministerial Strategic Innovation Promotion Program (SIP), “Innovative AI Hospital System” (Funding Agency: National Institute of Biomedical Innovation, Health and Nutrition (NIBIOHN)). This work was also supported by JSPS KAKENHI Grant Number 19K10662, 20K23343, 21K17764, MHLW AC Program Grant Number 20AC1006, and JST CREST Grant Number JPMJCR20D3, Japan.

References

  • [1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872, 2020.
  • [2] Jooyoung Chang, Jinho Lee, Ahnul Ha, Young Soo Han, Eunoo Bak, Seulggie Choi, Jae Moon Yun, Uk Kang, Il Hyung Shin, Joo Young Shin, et al. Explaining the rationale of deep learning glaucoma decisions with adversarial examples. Ophthalmology, 128(1):78–88, 2021.
  • [3] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In IEEE WACV, pages 839–847, 2018.
  • [4] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In ICML, 2020.
  • [5] Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classifiers. In NeurIPS, pages 6967–6976, 2017.
  • [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE CVPR, pages 248–255, 2009.
  • [7] Saurabh Desai and Harish G. Ramaswamy. Ablation-CAM: Visual explanations for deep convolutional network via gradient-free localization. In IEEE WACV, pages 972–980, 2020.
  • [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [9] Andres Diaz-Pinto, Sandra Morales, Valery Naranjo, Thomas Köhler, Jose M Mossi, and Amparo Navea. CNNs for automatic glaucoma assessment using fundus images: an extensive validation. Biomedical Engineering Online, 18(1):29, 2019.
  • [10] Ruth Fong, Mandela Patrick, and Andrea Vedaldi. Understanding deep networks via extremal perturbations and smooth masks. In IEEE ICCV, pages 2950–2958, 2019.
  • [11] Ruth C Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In IEEE ICCV, pages 3429–3437, 2017.
  • [12] Anirudh Goyal, Alex Lamb, Phanideep Gampa, Philippe Beaudoin, Sergey Levine, Charles Blundell, Yoshua Bengio, and Michael Mozer. Object files and schemata: Factorizing declarative and procedural knowledge in dynamical systems. arXiv preprint arXiv:2006.16225, 2020.
  • [13] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual explanations. In ICML, 2019.
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE CVPR, pages 770–778, 2016.
  • [15] Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the world state with recurrent entity networks. In ICLR, 2017.
  • [16] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [18] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In IEEE CVPR, pages 7132–7141, 2018.
  • [19] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In IEEE CVPR, pages 4700–4708, 2017.
  • [20] Sezer Karaoglu, Ran Tao, Jan van Gemert, and Theo Gevers. Con-Text: Text detection for fine-grained object classification. IEEE TIP, 26(8):3965–3980, 2017.
  • [21] Jinkyu Kim and John Canny. Interpretable learning for self-driving cars by visualizing causal attention. In IEEE ICCV, pages 2942–2950, 2017.
  • [22] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [23] Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. arXiv preprint arXiv:2006.15055, 2020.
  • [24] David Mascharka, Philip Tran, Ryan Soklaski, and Arjun Majumdar. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In IEEE CVPR, pages 4942–4950, 2018.
  • [25] Daniel Omeiza, Skyler Speakman, Celia Cintas, and Komminist Weldermariam. Smooth Grad-CAM++: An enhanced inference level visualization technique for deep convolutional neural network models. arXiv preprint arXiv:1908.01224, 2019.
  • [26] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In ICML, 2018.
  • [27] Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized input sampling for explanation of black-box models. In BMVC, 2018.
  • [28] Zhongang Qi, Saeed Khorram, and Li Fuxin. Visualizing deep networks by optimizing with integrated gradients. In AAAI, volume 34, pages 11890–11898, 2020.
  • [29] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you? Explaining the predictions of any classifier. In KDD, pages 1135–1144, 2016.
  • [30] Karl Schulz, Leon Sixt, Federico Tombari, and Tim Landgraf. Restricting the flow: Information bottlenecks for attribution. In ICLR, 2020.
  • [31] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In IEEE ICCV, pages 618–626, 2017.
  • [32] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, page 3145–3153, 2017.
  • [33] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
  • [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  • [35] Haofan Wang, Rakshit Naidu, Joy Michael, and Soumya Snigdha Kundu. SS-CAM: Smoothed Score-CAM for sharper visual feature localization. arXiv preprint arXiv:2006.14255, 2020.
  • [36] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-CAM: Score-weighted visual explanations for convolutional neural networks. In IEEE CVPR Workshops, pages 24–25, 2020.
  • [37] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.07853, 2020.
  • [38] Pei Wang and Nuno Vasconcelos. SCOUT: Self-aware discriminant counterfactual explanations. In IEEE CVPR, pages 8981–8990, 2020.
  • [39] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  • [40] Zbigniew Wojna, Alex Gorban, Dar-Shyang Lee, Kevin Murphy, Qian Yu, Yeqing Li, and Julian Ibarz. Attention-based extraction of structured information from street view imagery. In ICDAR, pages 844–850, 2017.
  • [41] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In ACL, page 133–138, 1994.
  • [42] Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
  • [43] Ning Xie, Gabrielle Ras, Marcel van Gerven, and Derek Doran. Explainable deep learning: A field guide for the uninitiated. arXiv preprint arXiv:2004.14545, 2020.
  • [44] Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Suggala, David I Inouye, and Pradeep K Ravikumar. On the (in)fidelity and sensitivity of explanations. In NeurIPS, pages 10967–10978, 2019.
  • [45] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833, 2014.
  • [46] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Muller, R. Manmatha, Mu Li, and Alexander Smola. ResNeSt: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020.
  • [47] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102, 2018.
  • [48] Quanshi Zhang, Yu Yang, Haotian Ma, and Ying Nian Wu. Interpreting CNNs via decision trees. In IEEE CVPR, pages 6261–6270, 2019.
  • [49] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In IEEE CVPR, pages 2921–2929, 2016.
{strip}

SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition
(Supplementary Material)

1 Computational Costs

Table 1 gives the cost comparison of SCOUTER and FC classifier. We can see that, compared with the FC classifier, SCOUTER requires a similar computational cost (slightly higher) and a similar number of parameters (slightly lower). The increase in the computational cost (flops) is because the xSlot module has some small FC layers (i.e., QQ and KK), GRU, and some matrix calculations. However, as shown in the lower part of Fig. 1, this is not very significant.

On the other hand, as shown in the upper part of Fig. 1, SCOUTER has more parameters than the FC classifier when nn is roughly in [0,90][0,90]. This is because the FC layers and GRU, which are shared among all slots, have a certain number of parameters. For n>90n>90, SCOUTER uses fewer parameters than the FC classifier because there are only cc^{\prime} (6464 in our implementation) learnable parameters for each category. This is much less than the parameter size of the FC classifier, which usually needs much more parameters per class (2,0482,048 parameters for ResNet 50).

Comparing to the differences in the computation costs and the numbers of parameters of different backbone models, the additional cost of SCOUTER is almost negligible.

2 Components of xSlot Attention Module

In SCOUTER, we adopt a variant of the slot attention [23]. We make some essential modifications to several components in order to enable explainable classification, while other components, i.e. the gated recurrent unit (GRU) and position embedding (PE), remain unchanged, whose effects on the classification as well as the explainability are unexplored. To test the performance of the SCOUTER with and without these components, we consider two variants of SCOUTER. The first one is the SCOUTER without GRU, in which we replace the GRU component, which is used to update slot weights, with an average operation. The second variant is the SCOUTER without PE, where flattened input features are directly used without adding position information.

In Table 2, we show the performances of SCOUTER+ and SCOUTER- as well as their variants in several performance metrics including computation costs, classification accuracy, and explainability. We can see that SCOUTER with all the components gets better results in most metrics than the variants, except for computation costs. The absence of GRU or PE not only causes a decrease of the classification accuracy, but also some deterioration on all explainability metrics, which proves their necessity.

Table 1: Cost comparison of SCOUTER and FC classifier (n=100n=100 and input images are with the size of 260×260260\times 260).
Models Params (M) Flops (G)
FC SCOUTER   FC SCOUTER
ResNet 26 [14] 14.1511 14.1298 3.4238 3.4565
ResNet 50 [14] 23.7129 23.6916 5.9830 6.0171
ResNeSt 26 [46] 15.2253 15.2041 5.1803 5.2130
ResNeSt 50 [46] 25.6391 25.6179 7.7430 7.7762
DenseNet 121 [19] 7.0564 7.0719 3.7536 3.7805
DenseNet 169 [19] 12.6510 12.6435 4.4396 4.4683
MobileNet 75 [17] 1.1194 0.6537 0.0563 0.0812
MobileNet 100 [17] 4.3301 3.0859 0.3154 0.3421
SeResNet 18 [18] 11.3169 11.3509 2.6473 2.6726
SeResNet 50 [18] 26.2439 26.2226 5.6758 5.7098
EfficientNet B2 [33] 7.8419 7.8437 1.0250 1.0564
EfficientNet B5 [33] 28.5457 28.5244 3.6391 3.6721
Refer to caption
Figure 1: Flops and parameter sizes of SCOUTER and FC classifier with ResNet 50.
Table 2: Performance comparison of SCOUTER and its variants on a subset (n=100n=100) of the ImageNet dataset. λ\lambda is set to 1010 during training and ResNeSt 26 is adopted as the backbone. The explanation performance is measured on the GT category for the positive explanation and on the least similar class (LSC) for the negative explanation.
Explanation Type Variants Computational Costs Classification Explainability
Params (M) Flops (G) Accuracy Precision IAUC DAUC
Positive SCOUTER+ 15.2041 5.2130 0.7991 0.9257 0.7647 0.2713
w.o. GRU 15.1791 5.1901 0.7961 0.9219 0.7456 0.2866
w.o. PE 15.2041 5.2130 0.7974 0.8973 0.7557 0.3002
Negative SCOUTER- 15.2041 5.2130 0.7946 0.8101 0.6730 0.7333
w.o. GRU 15.1791 5.1901 0.7910 0.7904 0.5959 0.7529
w.o. PE 15.2041 5.2130 0.7903 0.8067 0.6141 0.7661

3 Classification Performance when n>100n>100

Training of SCOUTER becomes unstable when the category number nn of the ImageNet [6] subsets is larger than 100100. One possible reason is that it is difficult to find consistent and discriminative supports when there are many categories. Fig. 2 shows the classification performance when n>100n>100. The number of independent runs of training is increased to 55 as the training process becomes unstable and often results in failures (low classification accuracy) when n>100n>100. λ\lambda is set to 1010. ResNeSt 26 [46] is adopted as the backbone, with batch size of 7070 and training epoch number of 2020 (both are same as the settings of the experiments in the main paper). We can see that, although sometimes SCOUTER+ and SCOUTER- can achieve similar performance with the FC classifier when n<400n<400, they become significantly unstable with the increase of category number nn. As stated in the main paper, SCOUTER can only be used in small-or medium-sized datasets due to this issue.

Refer to caption
Figure 2: The classification performance of FC classifier, SCOUTER+, and SCOUTER- when 2n10002\leq n\leq 1000. We show the violin plots as well as the average value for SCOUTER+ and SCOUTER-, while the FC classifier is only with the average value.

4 Inter-and Intra-Category Explanation

To better understand (i) what supports SCOUTER uses as the basis for its decision making, (ii) how these supports can be differentiated among different categories, and (iii) whether they are being consistent for images in the same category, we give some additional visualization on the MNIST dataset [22] in Figs. 3 and 4 for SCOUTER+ and SCOUTER-, respectively. MNIST is adopted here as similarities and dissimilarities among categories (digits) are obvious and are easier to understand than ImageNet. In these two figures, (a) is for the inter-category visualization, which shows what the supports for the “Predicted” category look like given the image of the “Actual” category. Whereas, (b) is for intra-category visualization, which shows the support for different images of the same category. For the latter, we use the digit 6 as an example and the first ten samples of category 6 in the test set of MNIST are used.

In the inter-category visualization in Fig. 3, we can see that SCOUTER+ successfully finds supports for the images of ground-truth (GT) categories. Notably, it also finds weaker supports for some categories with similar appearances, e.g., the supports for the prediction of “why 5 is 6” (as the lower half of this hand-wrote 5 digit is a little confusing and is very close to the lower part of 6), as well as the prediction of “why 0 is 9” and “why 8 is 9” (both 0 and 8 have a circle like the one in 9).

Similarly, in Fig. 4, we can see that SCOUTER- finds no supports for the images of the GT categories, while it finds strong supports for the non-GT categories. As digit recognition is an easy task, SCOUTER- can use some very simple supports to deny most non-GT categories. For example, in the prediction of “why 1 is not [non-GT categories]”, all the slots of SCOUTER- find that the top end of the vertical stroke is 1’s unique pattern, thus, they can deny all other categories with this support. Among some visually similar categories, the negative explanations are more informative. For example, in the visualization of “why 9 is not 1” and “why 9 is not 7”, SCOUTER- precisely highlights the discriminative regions, without which 9 will look like the other two digits.

Also, in intra-category visualization, both SCOUTER+ and SCOUTER- show consistent supports for the images of the same category. When predicting “why 6 is 6”, SCOUTER+ always looks at the region close to the crossing point of the bottom circle and vertical stroke. For explanation “why 6 is not 2”, SCOUTER- always recognizes the presence of vertical stroke, which does not exist in the digit 2, as well as the missing of the bottom horizon stroke, which is essential for 2.

5 Some More Visualizations

In this section, we show more visualization results for ImageNet using SCOUTER and competing methods, including I-GOS [28], IBA [30], CAM [49], GradCAM [31], GradCAM++ [3], S-GradCAM++ [25], Score-CAM [36], SS-CAM [35], and Extremal Perturbation [10].

Subsets with n=100n=100 categories are used for training and visualization. Besides the first nn categories (as used in the main paper), we also use several other subsets (with the same category number) in the ImageNet dataset, in order to provide visualizations with more diversity. Figs. 5 and 6 give examples of the positive explanation, while Fig. 7 gives examples of the negative explanation.

Among the positive explanations, we can see that SCOUTER+ can find reasonable and precise supports. Especially for the image of “parallel bars”, SCOUTER+ can provide an explanatory region along the horizon bar. In addition, SCOUTER- with the least similar class (LSC) also finds supports on the foreground objects, which can be used to deny the LSC categories but are not enough for admitting the GT category, which conforms the quantitative results in the main paper.

Moreover, as shown in Fig. 7, SCOUTER- can give very detailed explanations when different categories with high visual similarities, e.g., the differences in the eyes and ears between “Labrador retriever” and “golden retriever”, and the differences of the horn between “water ox” and “ox”.

Figs. 8 and 9 show some more examples of two medical applications (glaucoma diagnosis and artery hardening diagnosis). We can see that SCOUTER+ and SCOUTER- perform well in both tasks.

Refer to caption
Figure 3: Visualized positive explanations using SCOUTER+ (with ResNet 18 [14] and λ=1\lambda=1) on the MNIST dataset [22].
Refer to caption
Figure 4: Visualized negative explanations using SCOUTER- (with ResNet 18 [14] and λ=1\lambda=1) on the MNIST dataset [22].
Refer to caption
Figure 5: More examples of visualized positive explanations (Part 1). The number in parentheses represents the λ\lambda value used in the SCOUTER training.
Refer to caption
Figure 6: More examples of visualized positive explanations (Part 2). The number in parentheses represents the λ\lambda value used in the SCOUTER training.
Refer to caption
Figure 7: More examples of visualized negative explanations for similar categories. The number in parentheses represents the λ\lambda value used in the SCOUTER training.
Refer to caption
Figure 8: More examples of visualized explanations for glaucoma diagnosis on three positive samples using SS-CAM, IBA, SCOUTER+, and SCOUTER-. The first three methods are for “why this is glaucoma” while SCOUTER- is for “why this is not normal”. SCOUTER shows explanations covering only related regions (vessel shape changes), which have been validated by two doctors.
Refer to caption
Figure 9: More examples of visualized explanations for artery hardening diagnosis on three “moderate” samples using SS-CAM, IBA, SCOUTER+, and SCOUTER-. The first three methods are for “why this is moderate” while SCOUTER- is for “why this is not none”. The white circles give the approximate location of the symptoms (shape changes on the vessel wall of the vein, which are caused by the increased blood pressure in the artery). SCOUTER gives precise explanations which are mostly within the symptom region and precisely on the wall of the vein. The explanations of SS-CAM are off the target in the first row and on the wrong vessels (artery) in the first and the third rows while IBA fails to find the symptom in the second and the third rows.