This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\WarningFilter

latexFont shape declaration has incorrect series value

Deeply Explain CNN via Hierarchical Decomposition

Ming-Ming Cheng*, Peng-Tao Jiang*, Ling-Hao Han, Liang Wang, Philip Torr M.M. Cheng, P.T. Jiang, L.H. Han are with TKLNDST, CS, Nankai University. M.M. Cheng is the corresponding author ([email protected]). * denotes equal contribution. L. Wang is with NLPR. P. Torr is with the University of Oxford.
Abstract

In computer vision, some attribution methods for explaining CNNs attempt to study how the intermediate features affect the network prediction. However, they usually ignore the feature hierarchies among the intermediate features. This paper introduces a hierarchical decomposition framework to explain CNN’s decision-making process in a top-down manner. Specifically, we propose a gradient-based activation propagation (gAP) module that can decompose any intermediate CNN decision to its lower layers and find the supporting features. Then we utilize the gAP module to iteratively decompose the network decision to the supporting evidence from different CNN layers. The proposed framework can generate a deep hierarchy of strongly associated supporting evidence for the network decision, which provides insight into the decision-making process. Moreover, gAP is effort-free for understanding CNN-based models without network architecture modification and extra training process. Experiments show the effectiveness of the proposed method. The code and interactive demo website will be made publicly available.

Index Terms:
Explaining CNNs, hierarchical decomposition.

1 Introduction

Deep convolutional neural networks (CNN) have made significant improvements on various computer vision tasks, such as image recognition [1, 2, 3], object detection [4, 5, 6], semantic segmentation [7, 8, 9, 10], traffic environment analysis [11, 12], and medical image understanding [13, 14]. Despite the high performance, CNNs are usually used as black boxes as their internal decision process is unclear. Moreover, plenty of recent research [15, 16, 17], has pointed out that the previous successful CNN models can still be fooled by adversarial examples where the changes can not even be noticed by human eyes. With the above prior, it is difficult for human beings to trust the good-performing yet opaque CNN models. Therefore, the interpretability of CNNs is as crucial as their performance, especially in some critical applications.

\begin{overpic}[width=433.62pt]{motivation.pdf} \end{overpic}
Figure 1: Overview of our approach. The middle is an evidence pyramid for the network prediction, and the bottom shows the important features from different stages of VGG-16. The colored circles represent the features. We detect interactions among the features and show how they are combined at different hierarchies in the decision-making process.

A fully interpretable convolutional neural network is a long-standing holy grail for deep learning researchers. To this end, researchers have proposed a wide range of techniques. Feature attribution (or saliency) methods [18, 19, 20] provide a powerful tool for interpretability. They attribute an output prediction of CNN to the input image, where the generated saliency map can tell us which pixels are important to the prediction. Such ability helps humans to understand how the input affects the prediction. Another set of feature attribution methods [21, 22] measures the importance of intermediate features towards a prediction. They further select important features and study their impact on the prediction. Apart from generating the feature importance, the relationships among intermediate features [23] are also important to understand the predictions but receive little attention.

CNNs have demonstrated a strong ability to gradually abstract image contents and generate features at different semantic levels, e.g., blobs/edges, textures, and object parts [24]. While discovering important features can provide a rich set of evidence for the output prediction, isolated evidence is less convincing and informative than the evidence chain [25] or the evidence pyramid [26]. According to the feature integration theory developed by Treisman et al. [27], the human brain first extracts basic features and then utilizes attention to combine individual features to perceive the object. Ideally, we would expect a hierarchical evidence tree as demonstrated in Fig. 1, which attributes a CNN decision to multiple key features, and each of them can be recursively attributed to more basic features. By associating intermediate features like ‘head’, ‘face’, ‘eye’, ‘nose’, and ‘edge’ in this example, a group of strongly associating evidence corresponding to the network’s inner state are emerging, reviewing real-world facts of the human perception decision.

There are two major challenges for existing feature attribution methods to achieve the hierarchical decomposition. Firstly, directly decomposing millions of feature responses in all channels and all spatial locations is both computationally infeasible and cognitively overloading for humans. Meanwhile, feature attribution methods such as [21, 22] are quite time-consuming because they need to repeat the backpropagation process many times. Secondly, some attribution methods [28, 29] generate an attention map for the whole layer, rather than a group of attention maps for each feature channel. The channel-wise attention maps are crucial for the iterative decomposition process as they can indicate the most important neuron in a feature channel to be decomposed. To alleviate these issues, we propose an efficient gradient-based Activation Propagation (gAP) module, which decomposes a feature response at any CNN location to its lower layer. As the gAP module generates an activation map for each feature channel, we can easily select a few mostly activated feature channels as crucial evidence, obtaining human-scale explanations. For each of those selected feature channels, the CNN feature at the most activated spatial position can be iteratively decomposed. By avoiding decomposing features at too many spatial locations, we can further reduce the number of potential visualizations to the human scale.

The proposed decomposition framework can effectively generate hierarchical explanations (see Fig. 1), which builds relationships among crucial intermediate features. We have conducted extensive experiments on several aspects, including a sanity check of the gAP module and understanding the network decisions. Experiments show the effectiveness of our framework to explain network decisions. In summary, we make two major contributions:

  • We propose an efficient gradient-based Activation Propagation (gAP) module, which decomposes the network decision and intermediate features to find their key supporting evidence from previous layers.

  • We propose a hierarchical decomposition framework, which builds relationships among important intermediate features, enabling hierarchical explanations with human-scale supporting evidence.

2 Related Work

The interpretability of CNNs has been actively studied, with major progress in three main areas, including feature attribution, feature visualization, and knowledge distillation to explainable models.

Refer to caption
(a) Feature Visualization [30]
Refer to caption
(b) Class Activation Map [28]
\begin{overpic}[width=416.27809pt]{ibd.jpg} \put(7.0,20.5){{\color[rgb]{.5,.5,.5}\footnotesize building\_facade (45\%)}} \put(53.0,20.5){{\color[rgb]{.5,.5,.5}\footnotesize balcony (9.65\%)}} \put(77.0,20.5){{\color[rgb]{.5,.5,.5}\footnotesize window (7.04\%)}} \end{overpic}
(c) High-level Decomposition [31]
Figure 2: Illustration of different kinds of interpretative methods.

2.1 Feature Attribution

Feature attribution methods typically generate a saliency map to locate the input locations important to the output. We classify them into three categories: backpropagation-based methods, perturbation-based methods, and activation-based methods.

Backpropagation-based methods. In the early days, Sung et al. [32] learn to rank the importance of input for the backpropagation networks by several tools such as sensitivity analysis. Baehrens et al. [33] identify the feature importance for a particular instance by computing the gradients of the decision function. Simonyan et al. [20] backpropagate the gradients of the output prediction w.r.t. the input image and generate a saliency map that indicates the importance of each pixel in the image. Guided Backpropagation [30] and Deconvnet [24] utilize different backpropagation logics through ReLU, where they both zero out the negative gradients. Sundararajan et al. [18] consider the saturation and thresholding problem. They compute the saliency map by accumulating the gradients along a path from the base image to the input image. Another set of methods, such as LRP [34], DeepTayor [35], RectGrad [36], DeepLift [19], FullGrad [37], PatternAttribution [38], and Excitation Backprop [39], utilize different top-down relevance propagation rules. Yang et al. [40] attempt to learn the propagation rule automatically for attribution map generation. SmoothGrad [41] sharpen the gradient-based salience maps to reduce visual noise. Zintgraf et al. [42] not only identify the important regions supporting the network decision but also identify the regions against the decision. Moreover, some methods [21, 22] measure the importance of the hidden unit to the prediction based on the backpropagation. These methods can find out the most important features from different layers of deep networks. Kim et al. [43] study the high-level concepts instead of low-level features for interpreting the internal state of the neural network. They utilize the directional derivatives to quantify the importance of high-level concepts to a classification result.

Perturbation-based methods. These methods perturb the input to observe the output changes. Zeiler et al. [24] occlude the input image by sliding a gray square and use the change of the output as the importance. Petsiuk et al. [44] randomly sampled a masked region. Ribeiro et al. [45] utilize the super-pixel to select occluded image regions. They learn a local linear model to compute the contribution of each super-pixel. Besides, the recent methods [46, 47, 48] learn a perturbation map, where the map applied to the input image can maximumly affect the prediction. Fong et al. [47] also apply the input attribution method to study the salient channels of deep networks.

Activation-based methods. These methods [28, 29, 49] generate a coarse class activation map by linearly combining the feature channels from the convolutional layer. The class activation map is upsampled to the size of the input image and provides image-level evidence that is important for the network prediction, as demonstrated in Fig. 2 (b). Zhou et al. [28] propose Class Activation Mapping (CAM). They need a specific network with the global average pooling layer to generate class activation maps. Later, Grad-CAM [29] and Grad-CAM++ [49] generalize the CAM method to other tasks by utilizing the task-specific gradients as weights. Unlike Grad-CAM, Score-CAM [50] utilize the forward passing score on the target class to obtain the weight for each activation. Recently, Zhou et al. [31] attempt to decompose the network decision into several semantic components and study each component’s contribution. As shown in Fig. 2 (c), the class activation map is decomposed into several semantic components.

The aforementioned attribution methods mostly focus on generating saliency/activation maps to study how the input affects the output prediction. Although some attribution methods can measure the importance of intermediate features to the output prediction, they usually neglect to study the relationships among different intermediate features. As pointed by Olah et al. [23], the relationships among different intermediate features are also important to interpret a prediction. We decompose not only the network decision but also the intermediate features to find their supporting evidence from previous layers, explaining how these associated intermediate features affect each other. While LRP [34] method propagates feature importance to intermediate features, the feature importance for different channels is coupled in the back-propagation process. This method generates simple explanations for the entire network behavior rather than hierarchical explanations.

2.2 Feature Visualization

Visualizing the CNN features of the intermediate layers can provide insight into what these layers learn. For the first layer of the CNN, we can directly project its three-channel weights into the image space. To visualize the features from higher layers, researchers have proposed many alternative approaches. Among them, Erhan et al. [51] and Simonyan et al. [20] utilize the gradient ascent algorithm to find the optimal stimuli in the image space that maximizes the neuron activations. Other methods [24, 30, 52] identify the image patches from the dataset that maximize the neuron activation of the CNN layers, as shown in Fig. 2 (a). Guided Backpropagation [30] and Deconvnet [24] also utilize the top-down gradients to discover the patterns that the intermediate layers learn. Using the natural image prior, feature inversion methods [53, 54, 55, 56, 57] learn an image to reconstruct the neuron activation. Furthermore, the recent methods [58, 59, 60] attempt to detect the concepts learned by intermediate CNN layers. The above feature visualization methods explore what the intermediate features detect, but they do not answer how the network assembles individual features to make a prediction.

2.3 Distill Knowledge to Explainable Models

Recently, another research line has attempted to transfer the powerful ability of CNN to explainable models, such as the decision tree or linear model, to approximate the behavior of the original model. Chen et al. [61] distill the knowledge into an explainable additive model. Ribeiro et al. [45] utilize a local linear model to approximate the original model, studying how the input affects any classifier’s decisions. Frosst et al. [62] and Liu et al. [63] distill the learned knowledge of CNN into the decision tree. These methods only bridge between the network decision and the input. They cannot help the user understand how the internal features of CNNs affect the network decision and each other. Our hierarchical decomposition is also an approximation to the original model. Unlike the above methods, our hierarchical decomposition not only highlights the important features for the network decision but builds the relationships among the feature channels from different layers. From our method, we can obtain the states of the internal features and how the internal features affect each other and the network decision.

2.4 Intrinsic Interpretable Models

Except for the post-hoc interpretability analysis for a trained CNN, some researchers have attempted to explore inherently interpretable models. Chen et al. [64] proposed a deep network architecture called prototypical part network. The network has a transparent reasoning process that first computes the similarity scores between the image patches and the learned prototypes. Then the network makes predictions based on a weighted sum of the similarity scores. Concept bottleneck models [65, 66, 67] are also inherently interpretable. Unlike those post-hoc methods [58, 59] that utilize human-specific concepts to generate explanations, they directly predict a set of human-specific concepts at training time and then use these concepts to make predictions, where the reasoning process is interpretable. Some recent intrinsic interpretable models [65, 64] utilize VGG [1] or ResNet [2] to extract high-level features firstly and perform the reasoning process on the high-level features. Our method is complementary to these CNN-based intrinsic interpretable models because one can use the hierarchical decomposition to provide more hierarchical evidence from the feature extractor if needed.

3 Methodology

3.1 Gradient-based Activation Propagation

We begin by defining the notation for the CNN, as illustrated in Fig. 3. In the lthl^{th} CNN layer, the features 𝐅l\mathbf{F}^{l}, partial gradients 𝐆l\mathbf{G}^{l}, and corresponding neuron activations 𝐀l\mathbf{A}^{l} are 3D tensors with the same size, i.e., 𝐆l,𝐀l,𝐅lKl×Hl×Wl\mathbf{G}^{l},\mathbf{A}^{l},\mathbf{F}^{l}\in\mathbb{R}^{K^{l}\times H^{l}\times W^{l}}, where KlK^{l} is the number of channels and Hl×WlH^{l}\times W^{l} is the spatial size in the CNN layer ll. To find supporting evidence for the final CNN decision or any intermediate feature response, we propose a gradient-based activation propagation (gAP) method. Using the gAP module, we can understand a decision of interest at a CNN layer by localizing the most related evidence in its previous layer.

\begin{overpic}[width=433.62pt]{detail.pdf} \put(90.0,6.0){\small$\mathbf{F}^{l+1}$} \put(82.0,24.0){\small$\mathbf{F}^{l}$} \put(70.0,3.0){\small$\mathbf{G}^{l}$} \put(43.0,17.0){\small$\mathbf{A}^{l}$} \put(7.0,1.5){\small visualizations} \put(35.0,6.0){{CNN layer} $l$} \end{overpic}
Figure 3: Our gradient-based activation propagation (gAP) method explains a decision of interest 𝐅kl+1(xkl+1,ykl+1)\mathbf{F}_{k^{\prime}}^{l+1}(x_{k^{\prime}}^{l+1},y_{k^{\prime}}^{l+1})\in\mathbb{R}, i.e., the CNN feature illustrated by the black dot, by localizing the most related neuron activations in its previous CNN layer.

As shown in Fig. 3, we decompose a CNN feature (i.e., a decision of interest) 𝐅k,x,yl+1\mathbf{F}^{l+1}_{k^{\prime},x,y} at the convolutional layer l+1l+1, channel kk^{\prime}, and spatial position (x,y)(x,y), to find the supporting evidence in its previous convolutional layer ll. In this work, we are interested in understanding the strong feature response 𝐅k,x,yl+1\mathbf{F}^{l+1}_{k^{\prime},x,y} that has the largest contribution to the decision among the feature channel. In typical CNNs, a certain feature at layer l+1l+1 is computed as a linear combination of features from its previous layer ll and a ReLU. For the strong feature 𝐅k,x,yl+1\mathbf{F}^{l+1}_{k^{\prime},x,y}, we have

𝐅k,x,yl+1=ReLU(𝐰𝐥𝐅𝐥)=𝐰𝐥𝐅𝐥,\mathbf{F}^{l+1}_{k^{\prime},x,y}=\text{ReLU}(\bf{w}^{l}\cdot\mathbf{F}^{l})=\bf{w}^{l}\cdot\mathbf{F}^{l}, (1)

where 𝐰𝐤𝐥\bf{w}_{k}^{l} is the linear weight for combining kthk^{th} feature channel of 𝐅l\mathbf{F}^{l}. To obtain the weight, we first use backpropagation to compute the partial gradient map 𝐆kl\mathbf{G}^{l}_{k} of the feature 𝐅k,x,yl+1\mathbf{F}^{l+1}_{k^{\prime},x,y} w.r.t. the feature map 𝐅kl\mathbf{F}^{l}_{k} by

𝐰𝐤𝐥=𝐆𝐤𝐥=𝐅𝐤,𝐱,𝐲𝐥+𝟏𝐅𝐤𝐥gradients via backprop.\bf{w}_{k}^{l}=\mathbf{G}_{k}^{l}=\underbrace{\frac{\;\;\;\;\partial{\mathbf{F}^{l+1}_{k^{\prime},x,y}}\;\;\;\;}{\;\;\;\;\partial{\mathbf{F}^{l}_{k}}\;\;\;\;}}_{\textup{gradients via backprop}}. (2)

The gradient map 𝐆kl\mathbf{G}_{k}^{l} captures the ‘importance’ of the feature map 𝐅kl\mathbf{F}_{k}^{l} for the decision 𝐅k,x,yl+1\mathbf{F}^{l+1}_{k^{\prime},x,y}.

We employ the gradient map 𝐆kl\mathbf{G}_{k}^{l} to generate an activation map

𝐀kl=𝐆kl𝐅kl.\mathbf{A}^{l}_{k}=\mathbf{G}_{k}^{l}\cdot\mathbf{F}^{l}_{k}. (3)

The activation map indicates the contribution of each feature in 𝐅kl\mathbf{F}_{k}^{l} to the decision 𝐅k,x,yl+1\mathbf{F}^{l+1}_{k^{\prime},x,y}. Based on its corresponding activation map, each channel’s contribution to the decision can be computed by

αkl=1ZlxHlyWl𝐀k,x,yl,\alpha_{k}^{l}=\frac{1}{Z^{l}}\sum_{x}^{H^{l}}\sum_{y}^{W^{l}}\mathbf{A}^{l}_{k,x,y}, (4)

where Zl=Hl×WlZ^{l}=H^{l}\times W^{l} denotes the number of spatial positions in the activation map 𝐀kl\mathbf{A}^{l}_{k}. We can also identify the feature 𝐅k,x^,y^l\mathbf{F}^{l}_{k,\hat{x},\hat{y}} in kthk^{th} feature channel that contributes the most to the decision in which

(x^,y^)=argmax(x,y)𝐀k,x,yl.(\hat{x},\hat{y})=\operatorname*{arg\,max}_{(x,y)}\mathbf{A}^{l}_{k,x,y}. (5)

Thus, for each decision, we can find the most important feature channel 𝐅kl\mathbf{F}_{k}^{l} according to the contribution αkl\alpha_{k}^{l} computed by Eqn. (4). In the most important channel, we can also identify the feature 𝐅k,x^,y^l\mathbf{F}^{l}_{k,\hat{x},\hat{y}} that contributes most to the decision according to Eqn. (5). In the top row of Fig. 4, we show the three most important activation maps 𝐀1314\mathbf{A}^{4}_{131}, 𝐀2554\mathbf{A}^{4}_{255}, and 𝐀4524\mathbf{A}^{4}_{452} in layer conv4_3 for the decision from 𝐅2775\mathbf{F}_{277}^{5}. These activation maps provide spatial channel responses to the decision, benefiting human understanding. Using Guided Backpropagation [30], we visualize the most contributing feature by generating sharp visualizations, which highlight the associated input. An example is shown in the bottom row of Fig. 4.

\begin{overpic}[width=433.62pt]{activations.pdf} \put(13.0,0.0){$131^{st}$} \put(45.0,0.0){$255^{th}$} \put(79.0,0.0){$452^{nd}$} \end{overpic}
Figure 4: Example of the most significant activation maps (upper row) and their corresponding visualizations (lower row) for layer conv4_3, which contains 512512 channels. The black dot denotes the peak location in the activation map.

Discussion. Our gAP module is inspired by CAM [28] and Grad-CAM [29], which explain CNN decisions by class activation localization. To explain the relation and difference to our gAP module, we first revisit CAM and Grad-CAM. Smilkov et al. have proofed that Grad-CAM is a strict generalization of CAM. Without loss of generality, we consider the same network discussed in [28]. For an image classification CNN, the CNN features 𝐅L\mathbf{F}^{L} of the last convolutional layer are spatially pooled using the global average pooling layer to obtain feature vectors. The network performs a linear combination of the feature vectors by feeding them into a fully connected layer before the softmax. Let CC denote the number of classes. The classification score before softmax ScS^{c} for each class c{1,2,,C}c\in\{1,2,\dots,C\} is

Sc=kKlwkc1ZLxHlyWlglobal average pooling𝐅k,x,yL\displaystyle S^{c}=\sum_{k}^{K^{l}}w^{c}_{k}\overbrace{\frac{1}{Z^{L}}\sum_{x}^{H^{l}}\sum_{y}^{W^{l}}}^{\textup{global average pooling}}\mathbf{F}^{L}_{k,x,y} (6)
=1ZLxHlyWlkKlwkc𝐅k,x,yL,\displaystyle=\frac{1}{Z^{L}}\sum_{x}^{H^{l}}\sum_{y}^{W^{l}}\sum_{k}^{K^{l}}w^{c}_{k}\mathbf{F}^{L}_{k,x,y},

where wkcw_{k}^{c} is the weight connecting the kthk^{th} feature map with the cthc^{th} class. The contribution of a feature 𝐅k,x,yL\mathbf{F}^{L}_{k,x,y} to ScS^{c} is wkc𝐅k,x,yLw^{c}_{k}\mathbf{F}^{L}_{k,x,y}. CAM generates a class activation map McM^{c} by summing over all feature maps,

𝐌c=kKlwkc𝐅kL,\mathbf{M}^{c}=\sum_{k}^{K^{l}}w^{c}_{k}\mathbf{F}^{L}_{k}, (7)

where each value in 𝐌c\mathbf{M}^{c} indicates the contribution of each spatial location to ScS^{c}.

For the linear function, the importance weight is also equal to the gradient. Thus, we can also obtain the weight by computing the back-propagating gradients,

wkc=xHlyWlSc𝐅k,x,yL.w^{c}_{k}=\sum_{x}^{H^{l}}\sum_{y}^{W^{l}}\frac{\partial{S^{c}}}{\partial{\mathbf{F}^{L}_{k,x,y}}}. (8)

The detailed derivations of wkcw^{c}_{k} is depicted in [29]. Eqn. (8) is also the way that Grad-CAM computes the weight wkcw^{c}_{k}. A little difference is that Grad-CAM multiplies wkcw^{c}_{k} by a proportionality constant, i.e.,

wkc=1ZLxHlyWlSc𝐅k,x,yL,w^{c}_{k}=\frac{1}{Z^{L}}\sum_{x}^{H^{l}}\sum_{y}^{W^{l}}\frac{\partial{S^{c}}}{\partial{\mathbf{F}^{L}_{k,x,y}}}, (9)

where the proportionality constant 1/ZL1/Z^{L} will be normalized.

\begin{overpic}[width=433.62pt]{hDecomp1.pdf} \put(20.0,46.0){$\mathbf{M}_{\textup{Grad-CAM}}^{\textup{person}}$} \put(40.0,46.0){$\mathbf{A}_{277}^{5}$} \put(56.0,46.0){$\mathbf{A}_{255}^{4}$} \put(73.0,46.0){$\mathbf{A}_{216}^{3}$} \put(90.0,46.0){$\mathbf{A}_{127}^{2}$} \end{overpic}
Figure 5: Illustration of our hierarchical decomposition process. The number for each visualization denotes the channel id of VGG-16 [1]. At each stage, we decompose one of the top-3 most important feature channels to the lower layer. Follow the blue line, we zoom in an activation map for the decision. The gray dash line represents the decomposition of the feature response corresponding to the maximal activation. Additionally, we also make the decomposition process interactive. At each stage, the user can select any decision and decompose it.

Considering the scores {Sc}\{S^{c}\} as CNN features with CC channels and spatial size 1×11\times 1, i.e., 𝐅cL+1(1,1)=Sc\mathbf{F}_{c}^{L+1}(1,1)=S^{c}, we can plug Eqn. (2) into Eqn. (9) and get

wkc=1ZLxHlyWl𝐆k,x,yL.\displaystyle w^{c}_{k}=\frac{1}{Z^{L}}\sum_{x}^{H^{l}}\sum_{y}^{W^{l}}\mathbf{G}_{k,x,y}^{L}. (10)

Due to the global average pooling layer, the gradient of each element in 𝐅kL\mathbf{F}^{L}_{k} is the same, i.e., x,y,𝐆k,x,yL=wkc\forall_{x,y},\mathbf{G}_{k,x,y}^{L}=w^{c}_{k}. The class activation map 𝐌c\mathbf{M}^{c} can be written as

𝐌c=kKl𝐆kL𝐅kL=kKl𝐀kL.\displaystyle\mathbf{M}^{c}=\sum_{k}^{K^{l}}\mathbf{G}^{L}_{k}\cdot\mathbf{F}^{L}_{k}=\sum_{k}^{K^{l}}\mathbf{A}^{L}_{k}. (11)

Eqn. (11) suggests that the activation map 𝐌c\mathbf{M}^{c} for Grad-CAM can be generated by simply adding the activations maps 𝐀kL\mathbf{A}^{L}_{k} from our gAP.

The differences between gAP and Grad-CAM/CAM are:

  • Grad-CAM/CAM combine all activation maps to generate a single class activation map 𝐌c\mathbf{M}^{c}, which highlights important regions supporting the prediction. Our gAP method explains a decision of interest by generating a group of activation maps {𝐀k}\{\mathbf{A}_{k}\}. Each activation map corresponds to a feature channel, which is crucial for our iterative decomposition process.

  • Grad-CAM/CAM generate class activation maps from the last convolutional layer to explain the prediction. Our gAP generalizes this idea and iteratively decomposes a decision of any CNN layer to its lower layer.

While the above derivations apply to adjacent layers, we empirically find that satisfactory decomposition results can also be obtained when applying the gAP module between two layers from different stages of CNN (see Sec. 4.1). In the following, we will describe how we build hierarchical explanations for the network decisions.

3.2 Hierarchical Decomposition

Fig. 5 demonstrates an example of our hierarchical decomposition process. First, we decompose the network decision to the last convolutional layer and find the top few most crucial supporting features. Then, we decompose each of the supporting features to their previous layer and iteratively repeat the decomposition process until the bottom layer. As mentioned in Sec. 1, the key challenge is that naively building the hierarchical decomposition will generate too many visualizations, which will be a cognitive burden for humans. Even if we only decompose a single maximal contributed feature in each channel (see also Eqn. (5)), directly decomposing all channels in VGG-16 will generate 5123×5123×2563×1282×6422.5×1022512^{3}\times 512^{3}\times 256^{3}\times 128^{2}\times 64^{2}\approx 2.5\times 10^{22} visualizations.

To obtain human-scale visualizations, we propose two strategies to reduce the number of visualizations. Firstly, we only decompose the top few most important features at each layer. Experiment (see Sec. 4.1) has verified that a small subset of feature channels in a layer accounts for the majority of the contributions to a decision. Thus, we select the top few most important channels. We simplify the top-down decision decomposition process by utilizing the last convolutional layer of each stage. Current popular CNNs [1, 2] usually reduce the spatial size of feature maps after each stage, where a stage is composed of a set of convolution layers with the same output resolution. Each stage learns different patterns, such as blobs/edges, textures, and object parts [30, 24]. Experiments verify that when using the gAP module between two layers from two consecutive stages, we can obtain visually meaningful decomposition results (see Fig. 5). By these two strategies, we can largely reduce the number of visualizations to obtain human-scale explanations.

An example of the VGG-16 classification network is shown in Fig. 5. We select conv1_2, conv2_2, conv3_3, conv4_3, conv5_3 and index these layers as {1,2,,L}\{1,2,\dots,L\}, where L=5L=5. The network output before softmax could be considered as the 6th6^{th} CNN layer, with features 𝐅6C×1×1\mathbf{F}^{6}\in\mathbb{R}^{C\times 1\times 1}. The decomposition process starts from the CNN decision 𝐅c6(1,1)\mathbf{F}^{6}_{c}(1,1), where cc corresponds to the ‘person’ class. Using gAP, we first decompose the CNN decision 𝐅c6(1,1)\mathbf{F}^{6}_{c}(1,1) to 5th5^{th} layer. The decomposition generates a set of activation maps {𝐀L}\{\mathbf{A}^{L}\} at 5th5^{th} layer for 𝐅c6(1,1)\mathbf{F}^{6}_{c}(1,1). We use Eqn. (4) to select the top NN (e.g., NN=3) important activation maps, i.e., 𝐀3895\mathbf{A}^{5}_{389}, 𝐀2775\mathbf{A}^{5}_{277}, and 𝐀2595\mathbf{A}^{5}_{259}. We continue to decompose the decisions from 𝐅3895\mathbf{F}^{5}_{389}, 𝐅2775\mathbf{F}^{5}_{277}, 𝐅2595\mathbf{F}^{5}_{259}, and find the top NN most important activation maps at 4th4^{th} layer for them, respectively. However, directly decomposing the feature map is not easy. Because not all of the features in a feature map contribute to decision (see the activation maps in the top row of Fig. 4). We select the most representative feature that contributes most to a decision and decompose this feature. We utilize Eqn. (5) to find the feature 𝐅k,x^,y^l\mathbf{F}_{k,\hat{x},\hat{y}}^{l} corresponding to the maximum activation. Then we decompose it to layer l1l-1 using gAP. This hierarchical decomposition process recursively runs until we decompose the CNN decision to the lowest layer.

The number of visualizations NN is a flexible parameter, which controls how many top response feature channels will be selected during each decomposition. To make human cognition easier, NN is set to 3 in Fig. 5. Moreover, we make the hierarchical decomposition interactive, so that the users can choose the features to be decomposed, easily accessing the information they need. We also provide a video about the interactive demo, shown in supplementary materials. In Fig. 5, we can see that the features detected in high-level layers can be decomposed to different parts detected in low-level layers. The hierarchical decomposition process tracks important features and recursively explains the evidence using evidence from lower layers. For instance, the classification results of ‘person’ have been decomposed to ‘face’ and ‘hand’ evidence. The ‘face’ evidence is then decomposed to ‘eye’, ‘nose’, and ‘lower jaw’. This process continues until we reach the lowest layer, which usually detects edge and blob features.

Difference to layer-wise attribution methods. Some attribution methods, such as LRP [34], hierarchically propagate importance to the input in a layer-wise manner. They generate a single saliency map that indicates the importance of each pixel in the input. Unlike them, our method decouples the importance propagation chain and produces a rich hierarchy of activation maps and corresponding visualizations. To explain a person image, our method finds a group of evidence, e.g., activation maps for ‘face’, ‘hand’, etc. Each evidence associated with their own supporting evidence, e.g., ‘face’ has supporting activation maps for ‘eye’, ‘nose’, etc. Our method provides informational details of the internal features and their relations.

4 Experiments

In this section, we first conduct experiments to verify the correctness and efficiency of the decision decomposition. Then, we use the hierarchical decomposition process to analyze network characteristics and explain network decisions. We conduct experiments on two popular datasets, ImageNet [68] and PASCAL VOC [69]. On the PASCAL VOC dataset, the augmented training set containing 10582 training images is used to fine-tune different classification networks. All the experiments are tested on a single RTX 2080Ti GPU.

4.1 Sanity Check for gAP

The effectiveness of gAP. We have shown that the gradient-based Activation Propagation (gAP) module helps to decompose the network decision hierarchically for the CNN-based models. During the decomposition process, what matters most is the accuracy of the channel contributions calculated by the gAP module. Thus, we first examine the accuracy of the channel contributions to the decision of interest. Following [21, 70, 60], we take the decision score drop, when removing a feature channel at a time, as the ground truth of the channel’s contribution. Specifically, given an input image II, let fl+1f^{l+1} be a decision score at the l+1thl+1^{th} layer. f^l+1\hat{f}^{l+1} denotes the decision score when setting the kthk^{th} feature channel in the lthl^{th} layer to the average activation. The score drop α^kl=fl+1f^l+1\hat{\alpha}_{k}^{l}=f^{l+1}-\hat{f}^{l+1} denotes the kthk^{th} channel’s ground truth contribution to this decision.

TABLE I: The Pearson correlation coefficient (PCC) of different settings. \rightarrow denotes the decomposition. S5-S1 denotes the last convolutional layer of 5 different stages in VGG-16 [1]. AA: Average Activation. MA: Maximum Activation. AG: Average Gradient. MG: Maximum Gradient. T: target category. Average activation achieves the best result.
ImageNet T \rightarrow S5 S5 \rightarrow S4 S4 \rightarrow S3 S3 \rightarrow S2 S2 \rightarrow S1
AA 0.985 0.959 0.933 0.898 0.895
MA 0.897 0.912 0.894 0.864 0.890
AG 0.623 0.421 0.497 0.545 0.472
MG 0.454 0.456 0.567 0.594 0.606
VOC T \rightarrow S5 S5 \rightarrow S4 S4 \rightarrow S3 S3 \rightarrow S2 S2 \rightarrow S1
AA 0.987 0.961 0.932 0.899 0.893
MA 0.917 0.913 0.892 0.856 0.897
AG 0.702 0.492 0.525 0.564 0.480
MG 0.575 0.525 0.536 0.583 0.669

The Pearson Correlation Coefficient (PCC) metric [71] is utilized to measure the linear correlations between ground truth contribution α^lKl\hat{\alpha}^{l}\in\mathbb{R}^{K^{l}} and the contribution αlKl\alpha^{l}\in\mathbb{R}^{K^{l}} estimated by Eqn. (4). When the PCC value equals 1, there are linear correlations between the two variables (0 denotes no linear correlations, -1 denotes total negative linear correlations). The PCC metric is computed by

ρ=𝔼[(αlμαl)(α^lμα^l)]σαlσα^l,\rho=\frac{\mathbb{E}\left[(\alpha^{l}-\mu_{\alpha^{l}})(\hat{\alpha}^{l}-\mu_{\hat{\alpha}^{l}})\right]}{\sigma_{\alpha^{l}}\cdot\sigma_{\hat{\alpha}^{l}}}, (12)

where μ\mu and σ\sigma denote the mean and variance, respectively.

As shown in Tab. I, we study several strategies of calculating the contribution of a feature channel to the decision of interest. It can be seen that the contribution computed by averaging activations (i.e., Eqn. (4)) obtains the highest PCC value with the ground truth. For all stages in VGG-16, there are strong linear correlations between the computed contributions and the ground truth. This high correlation verifies the effectiveness of the gAP module.

Taking the computational efficiency into account, it is rather a time-consuming style of measuring channel contributions by calculating the score drop when removing feature channels iteratively in a layer [21, 70, 60]. In comparison, only one backpropagation process is needed when gAP calculates the channel contributions to a decision. On a VGG-16 backbone, calculating the ground truth channel contributions of an image takes about 10s, while the gAP module only takes about 50ms, nearly 200x faster. With the efficiency advantages of the gAP module, our hierarchical decomposition process can immediately yield detailed explanations of a network decision.

\begin{overpic}[width=433.62pt]{contri3.pdf} \end{overpic}
Figure 6: (a) The first five figures plot the contributions of each channel in a CNN layer to the decision. The contribution of a channel to a decision denotes how much it affects the decision. gAP denotes the proposed gradient-based activation propagation method. GT denotes the method that removes feature channels. ‘Stage 1-5’ denotes the last convolutional layer of each stage. The channel contributions are sorted in descending order. The contribution distribution calculated by gAP keeps almost the same as that of the ground truth. Besides, the contribution distribution in a layer is long-tailed. (b) The last chart plots the number of the activated channels and the number of all channels at different layers. In high-level layers, there are many activated channels with similar effects to a decision.

The distribution of contributions. As shown in the first five curves of Fig. 6, we can observe that the distribution of the channel contributions in a CNN layer is long-tailed. A small number of feature channels play the most important role for a decision of interest. With deeper layers of the networks, the proportion of the important feature channels decreases. In high-level layers, the feature channels are usually more discriminative. This fact is in line with the accepted notion [24]. Besides, we also check how many feature channels at a CNN layer work together to determine a decision for the higher CNN layers. We call the channel with αkl>0\alpha_{k}^{l}>0 as the activated channels and compute the number of the activated channels in the decision decomposition process. As shown in the last chart of Fig. 6, when decomposing a decision from layer conv2_2 to layer conv1_2, nearly all channels in layer conv1_2 are found activated. However, for the decomposition from the final decision to layer conv5_3, we can see that the activated channels’ number is much less than the number of all channels in layer conv5_3.

\begin{overpic}[width=249.33017pt]{model_test1.pdf} \end{overpic}
(a) model randomization test
\begin{overpic}[width=166.94131pt]{data_test1.pdf} \end{overpic}
(b) data randomization test
Figure 7: Sanity check for different attribution methods using cascading model parameter and data randomization test. SG: SmoothGrad [41], GB: Guided Backprop [30], IG: Integrated Gradient [18]. The spearman rank correlation metric [72] is used to measure the correlation between the attribution maps of the original model and the randomizing model. Low correlation means the attribution method is sensitive to the model parameters and the data labeling, and thus suitable for explaining the model decisions. Our gAP obtains low correlation values in these two tests. Best viewed with zoom in.

Channel-effect overlaps. Using the gAP module, we observe that the activation maps of some channels decomposed from the same decision often have strong activations in similar spatial locations. Such spatial locations usually denote an underlying concept [58, 59], contributing to the decision. When presenting visualizations of the hierarchical decomposition, we will merge these duplicate channels with similar effects for human better understanding. Specifically, when decomposing a decision of interest into the lower layer, we will obtain activation maps corresponding to each channel in this layer. We first threshold the activation maps into binary masks and then compute Intersection-over-Union (IoU) between them. Then we apply the non-maximum suppression algorithm [73] to suppress activation maps with an IoU score larger than 0.9, where the activation maps are sorted using the contribution scores by Eqn. (4). As shown in Fig. 6, we present how many activated channels have large overlaps with each other. In low-level layers, the number of activated channels with large overlaps is very small. But in high-level layers, there are many activated channels with similar effects to a decision.

Sanity checks for gAP. Adebayo et al. [74] propose the model parameter and data randomization test for sanity check for visual attribution methods. These two tests are used to check whether the attribution method is sensitive to the model parameters and the labeling of the data. An attribution method insensitive to the model parameters and data labels is inadequate for debugging the model and explaining the mechanism that depends on the relationship between the instances and the labeling of the data. To generate saliency maps from our gAP, we hierarchically decompose the decision until the data layer and sum all the gradients from each decomposition. We do the model parameter randomization test on the pretrained ResNet-18 model [2] and randomly initialize the model parameters from the top layer to the bottom layer in a cascading manner. We utilize the spearman rank correlation metric to compute the difference between the attribution maps from the original model and the randomly initialized model. Besides, we do the data randomization test by comparing the saliency maps from CNNs trained with true labels and permuted labels, respectively.

In Fig. 7(a), the low spearman metric indicates that the attribution maps from the original model and the randomly initialized model differ substantially, which demonstrates that gAP is sensitive to model parameters. In Fig. 7(b), the low spearman metric also indicates gAP is sensitive to the labeling of the data. The experimental results verify that our method can be used for debugging models. The visual comparisons are shown in supplementary materials.

\begin{overpic}[width=398.9296pt]{prune_pred2.pdf} \end{overpic}
Figure 8: The classification accuracy of the sparser model generated by gAP and the original model. The match rate denotes the prediction agreement between the sparser model and the original model.
\begin{overpic}[width=424.94574pt]{importance.pdf} \end{overpic}
Figure 9: Comparisons of the classification accuracy after removing the top few most important feature channels (the lower the better). The y-axis is the classification accuracy on the ILSVRC validation set [75]. The x-axis means the number of important feature channels to ablate. Conductance: [21], InternalInfluence: [22], LRP: [34].

Is the top-k decomposition a good approximation to the original model? We have tested the classification accuracy of the sparser surrogate model generated by gAP. Moreover, we measure the match rate by comparing the predictions between the sparser surrogate model and the original model. Specifically, we decompose from the decision of the predicted category to the bottom layers and select the top-k important features in each decomposition to make predictions. As shown in Fig. 8, when using top-16 decomposition, the sparser surrogate model has a similar classification accuracy to the original model. According to the match rate, when using top-16 decomposition, the predictions of the sparser surrogate model and the original model are consistent on almost all samples. The sparser surrogate model selecting a small number of feature channels can make a good approximation to the original model.

Comparison with individual-based methods. the individual-based methods [21, 22] compute the importance of each channel from different layers to the final network decision. Compared with individual-based methods, gAP can help us explore the relationships among different feature channels. To directly compare with them, we propagate the importance of each selected channel of the top layer to the shallow layer. We select the top-NN most important channels from different layers of VGG-16 and ablate them to watch the change of the classification accuracy. We conduct experiments on the ILSVRC validation set [75]. As shown in Fig. 9, when removing the top few most important feature channels, gAP obtains lower classification accuracy than other individual-based methods. We analyze that gAP only propagates the contributions of those important feature channels to lower layers, which reduces the interference of other feature channels. Compared with the individual-based methods, gAP can not only effectively detect the important features but also how these features affect each other.

\begin{overpic}[width=433.62pt]{failure1.pdf} \put(1.5,26.0){\footnotesize prediction: cat} \put(23.0,26.0){$\rightarrow$} \put(33.0,26.0){\footnotesize$\mathbf{A}_{328}^{5}$} \put(48.0,26.0){$\rightarrow$} \put(58.0,26.0){\footnotesize$\mathbf{A}_{330}^{4}$} \put(76.0,26.0){\footnotesize patterns for $\mathbf{F}_{330}^{4}$ } \end{overpic}
Figure 10: Analysis of a failure example. The leftmost is the dog image that is misclassified to the cat category. We decompose the network decision to layer conv4_3. The rightmost is the patterns that maximumly activate the 330th330^{th} channel. For this example, the channels sensitive to the cat category’s attribute have strong activations, causing VGG-16 to make a wrong decision.

4.2 Diagnosing CNN

Analyzing failure predictions of CNN. Previous work [29] can generate class activation maps for the network predictions, highlighting the most important image regions supporting the network decision. However, such an explanation is not informative enough. The hierarchical decomposition can further provide a more detailed explanation for the network decision. We decompose the network’s decision iteratively to the low-level layers and find the most important feature channels at different layers. We can see each channel’s contribution to the network decision. Further, important channels and their corresponding activation maps can also be studied.

\begin{overpic}[width=424.94574pt]{adv.pdf} \put(1.5,46.0){\footnotesize picket\_fence } \put(20.0,46.0){$\rightarrow$ } \put(27.0,47.0){\footnotesize$\mathbf{A}_{181}^{5}$ } \put(47.0,47.0){\footnotesize$\mathbf{A}_{146}^{5}$ } \put(67.0,47.0){\footnotesize$\mathbf{A}_{100}^{5}$ } \put(87.0,47.0){\footnotesize$\mathbf{A}_{499}^{5}$ } \put(5.0,20.5){\footnotesize church } \put(18.0,21.0){$\rightarrow$ } \put(27.0,21.0){\footnotesize$\mathbf{A}_{33}^{5}$ } \put(47.0,21.0){\footnotesize$\mathbf{A}_{440}^{5}$ } \put(67.0,21.0){\footnotesize$\mathbf{A}_{188}^{5}$ } \put(87.0,21.0){\footnotesize$\mathbf{A}_{466}^{5}$ } \end{overpic}
(a) Decomposition
Refer to caption
(b) Peak Response
Refer to caption
(c) Average Peak Response
Figure 11: Example of the adversarial attacks. (a) The top is the original image, and the bottom is the adversarial image. \rightarrow denotes the decomposition. (b) plots the peak feature responses of the most important channels for the network decision in the original image and adversarial image. (c) plots the average peak feature responses of the top 4 most important channels for the network decision in the original image and adversarial image over the whole ILSVRC validation dataset [75]. The peak values of the important channels for the correct category largely decrease, and those for the wrong category increase by a large margin.

As shown in Fig. 10, we use the hierarchical decomposition to examine the CNN’s wrong decision. Fig. 10 demonstrates a failure case. A dog image misclassified to the cat category with a probability of 99%. We first decompose the network decision to layer conv5_3 and find the most important channel, i.e., the 328th328^{th} channel, with a 32.3% contribution. We further present the decomposition from channel 328th328^{th} to layer conv4_3 and find the most important channel, i.e., the 330th330^{th} channel. The activation map 𝐀3304\mathbf{A}_{330}^{4} has strong activations at the ear region. Moreover, the patterns that maximumly activate the 330th330^{th} channel are the ear image patches of the cat category. We find the dog’s ear of this example has a similar shape to those ear image patches of the cat category. We further occlude the image region of the dog’s ear and observe that the CNN correctly predicts the dog category with a probability of 65%. With the hierarchical decomposition, we found that CNN makes the wrong decision because it takes the dog’s ear as the cat’s ear in this example.

Analyzing adversarial attacks. Current CNN models are vulnerable to adversarial attacks. When the adversarial attack algorithms add a small perturbation to the original images, these CNN models easily misclassify them. To understand how the adversarial images successfully fool the CNN models, following [60], we study the change of the feature responses for important channels. As shown in Fig. 11(a), we present the original image (top row) and the adversarial image (bottom row). The adversarial image is generated by a popular attack algorithm [76]. VGG-16 classifies the original image to the picket_\_fence category (probability 92%92\%) and the adversarial image to the church category (probability 100%100\%). Through our decomposition from the network decision to layer conv5_3, we find the top few most important feature channels for the picket_\_fence and church category, respectively.

As shown in Fig. 11(b), when comparing the adversarial image to the original image, we observe that the peak feature responses of important channels for the picket_\_fence category, i.e., the 181st181^{st}, 146th146^{th}, 100th100^{th}, 499th499^{th} channels, largely decrease by 11.3, 14.5, 4.7, and 11.1. However, the peak feature responses of important channels for the church category, i.e., the 33rd33^{rd}, 440th440^{th}, 188th188^{th}, 466th466^{th} channels, largely increase by 8.7, 10.4, 16.4, and 5.5. As shown in Fig. 11(c), we also compute the average peak responses of important channels on the whole ILSVRC validation dataset [75]. The adversarial attack algorithms change the feature responses of important channels to affect the final network decision. For the important channels, they reduce the correct category’s feature responses and increase the wrong category’s feature responses.

\begin{overpic}[width=433.62pt]{context.pdf} \put(8.0,46.0){\footnotesize boat } \put(18.0,46.0){$\rightarrow$ } \put(27.0,46.5){\footnotesize$\mathbf{A}_{331}^{5}$ } \put(40.0,46.0){$\rightarrow$ } \put(47.0,46.5){\footnotesize$\mathbf{A}_{169}^{4}$ } \put(67.0,46.5){\footnotesize$\mathbf{A}_{404}^{4}$ } \put(86.0,46.5){\footnotesize$\mathbf{A}_{115}^{4}$ } \par\put(8.0,21.0){\footnotesize bird } \put(18.0,21.0){$\rightarrow$ } \put(27.0,21.5){\footnotesize$\mathbf{A}_{270}^{5}$ } \put(40.0,21.0){$\rightarrow$ } \put(48.0,21.5){\footnotesize$\mathbf{A}_{1}^{4}$ } \put(67.0,21.5){\footnotesize$\mathbf{A}_{435}^{4}$ } \put(86.0,21.5){\footnotesize$\mathbf{A}_{175}^{4}$ } \end{overpic}
Figure 12: The context information in the activation maps. \rightarrow denotes the decomposition.
\begin{overpic}[width=433.62pt]{context_statis.pdf} \end{overpic}
Figure 13: Context information for each category in the PASCAL VOC dataset.

The context in activation maps. Context information [77, 78] is crucial for recognition. A known prior is that the target category usually appears in a specific context. For example, the boats usually appear in the seas or lakes, and the birds often stand on the tree branch. Through our decision decomposition, we find some context in the activation maps to support the CNN prediction. Fig. 12 shows that the 331st331^{st} channel in layer conv5_3 has strong responses to the image’s ‘boat’ region. We decompose the peak point indicated by the activation map to layer conv4_3. The 169st169^{st}, 404th404^{th}, and 115th115^{th} channels are the top-3 most important channels. The most important channel is the 169st169^{st} channel, whose corresponding activation map locates the sea.

To quantitatively analyze the context information contained in the activation maps, we utilize the PASCAL-Context dataset [79] for evaluation. We select the images with context annotations from the PASCAL VOC validation set [69] and compute the most frequent context labels for each category. Specifically, we perform the hierarchical decomposition to layer conv4_3, obtaining the activation map for each selected channel. The activation map is first thresholded to a binary map. Then we compute the IoU between the binary activation map and each context region. The activation map is assigned with the label of the context region corresponding to the largest IoU. In Fig. 13, we have shown the top few most frequent context labels for three categories, i.e., bird, boat, and train. These categories usually appear in a specific environment. This fact suggests that the context of the objects is critical for recognition. The context information of other categories and the qualitative examples are shown in the supplementary materials.

Channel discrimination analysis. We utilize the hierarchical decomposition to explore the discriminative information of the channels in different layers. Specifically, we define a discriminative degree DD to measure the discriminative information of a channel. When performing the hierarchical decomposition process for the images with label cc, we count the number of times NcN_{c} for channel kk when its contribution to a decision ranking top-3. NcN_{c} is summed on all images from the validation set. Then the discriminative degree DD is computed by

D=maxcNcc=1CNc,D=\frac{\max\limits_{c}N_{c}}{\sum_{c=1}^{C}N_{c}}, (13)

where CC denotes the number of categories in the dataset. When the channel kk is only decomposed from one single category, the discriminative degree DD = 1. Besides, we can get the minimum value of DD when the channel decomposed from each category with equal times: D=1/CD=1/C.

\begin{overpic}[width=433.62pt]{discrim.pdf} \end{overpic}
Figure 14: The discriminative degrees of the disentangled channels from different layers of different CNNs on the validation set of PASCAL VOC [69] and ImageNet [68].

We apply the hierarchical decomposition to different CNNs. As shown in Fig. 14, the channels’ discriminative degrees in low-level layers are very small. They usually have strong activations for multiple categories. This fact indicates that the basic features detected by channels in low-level layers are shared among different categories, which lacks discriminative information for classification. However, in high-level layers of CNNs, the channels’ discriminative degrees are much larger than those in low-level layers. Because the high-level layers in CNNs gradually combine basic features from low-level layers to form more discriminative features. In high-level layers, different categories tend to highlight their own discriminative channels. These results provide additional evidence for the conclusion found by Zeiler et al. [24].

Moreover, for the high-level layers of different CNNs, the discriminative degrees of the channels gradually increase with the growth of the network depth (ResNet-50 [2] >> VGG-16 [1] >> AlexNet [75]). Such difference suggests that the high-level layers of ResNet-50 have a stronger discriminative ability. The strong discriminative ability of the channels can effectively reduce confusion among different categories, which helps ResNet-50 achieve higher classification accuracy than VGG-16 and AlexNet.

5 Limitation

The proposed hierarchical decomposition method explains the individual decision by selecting a set of strongly correlative channels from different layers of CNN. These feature channels provide a rich hierarchy of evidence. However, the feature channels are less confident for an unprofessional user to understand the network’s reasoning process because not all examples are as easy to understand as the person image. So in the future, we will attempt to build connections between the selected feature channels and human-specific concepts for better human understanding.

Besides, following [21, 60], we have removed channels individually to study their contributions. However, as verified in [59, 80], the representations are usually distributed among multiple channels. We observe that the activation maps of some channels decomposed from the same decision often have strong activations in similar spatial locations. This phenomenon suggests that multiple feature channels produce class responses together. One possible solution to the flaw of removing channels individually is that we can first find those feature channels with similar effects by measuring the overlap between their corresponding activation maps. Then we analyze these feature channels together to the network decision. In this paper, we focus on building the evidence hierarchy. The issue of removing individual channels will be our future work.

6 Conclusion

We present a novel gradient-based activation propagation (gAP) scheme that can decompose any CNN layer’s decision to its lower layers. Based on the gAP, the network decision can be hierarchically decomposed to a rich set of the evidence pyramid associated with all layers of the CNN model. Our method allows users to delve deep into the CNN’s decision-making process in a top-down manner. We have experimentally verified the effectiveness of our method and demonstrated its ability to understand and diagnose CNN predictions. While currently mostly focus on explaining CNN-based image classifiers, we will study how to generalize the framework to other tasks and other deep learning models in the future. The source code and interactive demo website will be made publicly available.

References

  • [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Int. Conf. Learn. Represent., 2015.
  • [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016.
  • [3] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 4700–4708.
  • [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 580–587.
  • [5] R. Girshick, “Fast r-cnn,” in Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
  • [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Adv. Neural Inform. Process. Syst., 2015, pp. 91–99.
  • [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015.
  • [8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., 2017.
  • [9] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017.
  • [10] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks with identity mappings for high-resolution semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017.
  • [11] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu, “Traffic-sign detection and classification in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 2110–2118.
  • [12] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane detection cnns by self attention distillation,” in Int. Conf. Comput. Vis., 2019, pp. 1013–1021.
  • [13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Int. Conf. Medical image computing and computer-assisted intervention, 2015, pp. 234–241.
  • [14] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
  • [15] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Int. Conf. Learn. Represent., 2014.
  • [16] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Int. Conf. Learn. Represent. Worksh., 2017.
  • [17] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing robust adversarial examples,” in Int. Conf. Mach. Learn., 2018.
  • [18] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in Int. Conf. Mach. Learn., 2017.
  • [19] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” in Int. Conf. Mach. Learn., 2017.
  • [20] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” in Int. Conf. Learn. Represent. Worksh., 2014.
  • [21] K. Dhamdhere, M. Sundararajan, and Q. Yan, “How important is a neuron?” Int. Conf. Learn. Represent., 2019.
  • [22] K. Leino, S. Sen, A. Datta, M. Fredrikson, and L. Li, “Influence-directed explanations for deep convolutional networks,” in 2018 IEEE International Test Conference (ITC).   IEEE, 2018, pp. 1–8.
  • [23] C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, “Zoom in: An introduction to circuits,” Distill, vol. 5, no. 3, pp. e00 024–001, 2020.
  • [24] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Eur. Conf. Comput. Vis.   Springer, 2014.
  • [25] P. C. Giannelli, “Chain of custody and the handling of real evidence,” Am. Crim. L. Rev., vol. 20, p. 527, 1982.
  • [26] M. H. Murad, N. Asi, M. Alsawas, and F. Alahdab, “New evidence pyramid,” BMJ Evidence-Based Medicine, vol. 21, no. 4, pp. 125–127, 2016.
  • [27] A. M. Treisman and G. Gelade, “A feature-integration theory of attention,” Cognitive psychology, vol. 12, no. 1, pp. 97–136, 1980.
  • [28] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016.
  • [29] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” Int. J. Comput. Vis., vol. 128, no. 2, pp. 336–359, 2020.
  • [30] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” in Int. Conf. Learn. Represent. Worksh., 2015.
  • [31] B. Zhou, Y. Sun, D. Bau, and A. Torralba, “Interpretable basis decomposition for visual explanation,” in Eur. Conf. Comput. Vis., 2018, pp. 119–134.
  • [32] A. Sung, “Ranking importance of input parameters of neural networks,” Expert systems with Applications, vol. 15, no. 3-4, pp. 405–411, 1998.
  • [33] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller, “How to explain individual classification decisions,” The Journal of Machine Learning Research, vol. 11, pp. 1803–1831, 2010.
  • [34] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PloS one, vol. 10, no. 7, p. e0130140, 2015.
  • [35] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. Müller, “Explaining nonlinear classification decisions with deep taylor decomposition,” Pattern Recognition, vol. 65, pp. 211–222, 2017.
  • [36] B. Kim, J. Seo, S. Jeon, J. Koo, J. Choe, and T. Jeon, “Why are saliency maps noisy? cause of and solution to noisy saliency maps,” in IEEE ICCVW.   IEEE, 2019, pp. 4149–4157.
  • [37] S. Srinivas and F. Fleuret, “Full-gradient representation for neural network visualization,” in Adv. Neural Inform. Process. Syst., 2019, pp. 4124–4133.
  • [38] P.-J. Kindermans, K. T. Schütt, M. Alber, K.-R. Müller, D. Erhan, B. Kim, and S. Dähne, “Learning how to explain neural networks: Patternnet and patternattribution,” in Int. Conf. Learn. Represent., 2018.
  • [39] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top-down neural attention by excitation backprop,” in Eur. Conf. Comput. Vis., 2016.
  • [40] Y. Yang, J. Qiu, M. Song, D. Tao, and X. Wang, “Learning propagation rules for attribution map generation,” in Eur. Conf. Comput. Vis., 2020, pp. 672–688.
  • [41] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, “Smoothgrad: removing noise by adding noise,” in Int. Conf. Mach. Learn. Worksh., 2017.
  • [42] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling, “Visualizing deep neural network decisions: Prediction difference analysis,” in Int. Conf. Learn. Represent., 2017.
  • [43] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas et al., “Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav),” in Int. Conf. Mach. Learn.   PMLR, 2018, pp. 2668–2677.
  • [44] V. Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,” in Brit. Mach. Vis. Conf., 2018.
  • [45] M. T. Ribeiro, S. Singh, and C. Guestrin, “”Why should i trust you?” Explaining the predictions of any classifier,” in ACM SIGKDD, 2016, pp. 1135–1144.
  • [46] R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in Int. Conf. Comput. Vis., 2017, pp. 3429–3437.
  • [47] R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremal perturbations and smooth masks,” in Int. Conf. Comput. Vis., 2019, pp. 2950–2958.
  • [48] P. Dabkowski and Y. Gal, “Real time image saliency for black box classifiers,” in Adv. Neural Inform. Process. Syst., 2017.
  • [49] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in IEEE Winter Conf. Appl. Comput. Vis., 2018, pp. 839–847.
  • [50] H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convolutional neural networks,” in IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2020, pp. 24–25.
  • [51] D. Erhan, Y. Bengio, A. Courville, and P. Vincent, “Visualizing higher-layer features of a deep network,” University of Montreal, vol. 1341, no. 3, p. 1, 2009.
  • [52] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object detectors emerge in deep scene cnns,” in Int. Conf. Learn. Represent., 2015.
  • [53] A. Mordvintsev, C. Olah, and M. Tyka, “Inceptionism: Going deeper into neural networks,” 2015.
  • [54] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, “Understanding neural networks through deep visualization,” in Int. Conf. Mach. Learn. Worksh., 2015.
  • [55] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 5188–5196.
  • [56] A. Dosovitskiy and T. Brox, “Inverting visual representations with convolutional networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 4829–4837.
  • [57] C. Olah, A. Mordvintsev, and L. Schubert, “Feature visualization,” Distill, vol. 2, no. 11, p. e7, 2017.
  • [58] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 6541–6549.
  • [59] R. Fong and A. Vedaldi, “Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 8730–8738.
  • [60] D. Bau, J.-Y. Zhu, H. Strobelt, A. Lapedriza, B. Zhou, and A. Torralba, “Understanding the role of individual units in a deep neural network,” Proceedings of the National Academy of Sciences, 2020.
  • [61] R. Chen, H. Chen, J. Ren, G. Huang, and Q. Zhang, “Explaining neural networks semantically and quantitatively,” in Int. Conf. Comput. Vis., 2019, pp. 9187–9196.
  • [62] N. Frosst and G. Hinton, “Distilling a neural network into a soft decision tree,” in CEX workshop at AIIA, 2017.
  • [63] X. Liu, X. Wang, and S. Matwin, “Improving the interpretability of deep neural networks with knowledge distillation,” in IEEE Int. Conf. Data Mining Worksh.   IEEE, 2018, pp. 905–912.
  • [64] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks like that: Deep learning for interpretable image recognition,” in Adv. Neural Inform. Process. Syst., vol. 32, 2019, pp. 8930–8941.
  • [65] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang, “Concept bottleneck models,” in Int. Conf. Mach. Learn., 2020, pp. 5338–5348.
  • [66] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute and simile classifiers for face verification,” in Int. Conf. Comput. Vis., 2009, pp. 365–372.
  • [67] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in IEEE Conf. Comput. Vis. Pattern Recog.   IEEE, 2009, pp. 951–958.
  • [68] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
  • [69] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” Int. J. Comput. Vis., 2015.
  • [70] Q. Zhang, Y. Yang, H. Ma, and Y. N. Wu, “Interpreting cnns via decision trees,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 6261–6270.
  • [71] J. Benesty, J. Chen, Y. Huang, and I. Cohen, “Pearson correlation coefficient,” in Noise reduction in speech processing.   Springer, 2009, pp. 1–4.
  • [72] P. Sedgwick, “Spearman’s rank correlation coefficient,” Bmj, vol. 349, 2014.
  • [73] A. Neubeck and L. Van Gool, “Efficient non-maximum suppression,” in 18th International Conference on Pattern Recognition (ICPR’06), vol. 3.   IEEE, 2006, pp. 850–855.
  • [74] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim, “Sanity checks for saliency maps,” in Adv. Neural Inform. Process. Syst., vol. 31, 2018.
  • [75] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Adv. Neural Inform. Process. Syst., 2012, pp. 1097–1105.
  • [76] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in Int. Conf. Learn. Represent., 2018.
  • [77] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free?-weakly-supervised learning with convolutional neural networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 685–694.
  • [78] S. Kumar and M. Hebert, “A hierarchical field framework for unified context-based classification,” in Int. Conf. Comput. Vis., vol. 2.   IEEE, 2005, pp. 1284–1291.
  • [79] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 891–898.
  • [80] M. L. Leavitt and A. S. Morcos, “Selectivity considered harmful: evaluating the causal impact of class selectivity in dnns,” in Int. Conf. Learn. Represent., 2020.