Informative Class Activation Maps

Zhenyue Qin Dongwoo Kim Tom Gedeon

Abstract

We study how to evaluate the quantitative information content of a region within an image for a particular label. To this end, we bridge class activation maps with information theory. We develop an informative class activation map (infoCAM). Given a classification task, infoCAM depict how to accumulate information of partial regions to that of the entire image toward a label. Thus, we can utilise infoCAM to locate the most informative features for a label. When applied to an image classification task, infoCAM performs better than the traditional classification map in the weakly supervised object localisation task. We achieve state-of-the-art results on Tiny-ImageNet.

1 Introduction

Class activation maps are useful tools for identifying regions of an image corresponding to particular labels. Many weakly supervised object localization methods are based on class activation maps (CAMs). In this paper, we formalize CAMs to be related to information theory. Furthermore, with the new view of neural network classifiers as mutual information evaluators, we are able to depict the quantitative relationship between the information of the entire image and its local regions about a label. We call our new CAM Informative Class Activation Map (infoCAM), since it is based on information theory. Moreover, infoCAM can also improve the performance of the weakly supervised object localisation (WSOL) task than the original CAM.

2 Informative Class Activation Map

To explain infoCAM, we first introduce the concept and definition of the class activation map. We then show how to apply it to weakly supervised object localisation (WSOL).

Refer to caption — Figure 1: A visualization of the infoCAM procedure for the WSOL task. The task aims to draw a bounding box for the target object in the original image. The procedure includes: 1) feed input image into a CNN to extract its feature maps, 2) evaluate PMI difference between the true and the other labels of input image for each region within the feature maps, 3) generate the bounding box by keeping the regions exceeding certain infoCAM values and find the largest connected region and 4) interpolate and map the bounding box to the image.

2.1 CAM: Class Activation Map

Contemporary classification convolutional neural networks (CNNs) consist of stacks of convolutional layers interleaved with pooling layers for extracting visual features. These convolutional layers result in feature maps. A feature map is a collection of 2-dimensional grids. The size of the feature map depends on the structure of convolution and pooling layers. Generally the feature map is smaller than the original image. The number of feature maps corresponds to the number of convolutional filters. The feature maps from the final convolutional layer are usually averaged, flattened and fed into the fully-connected layer for classification (Lin et al., 2014). Given $K$ feature maps $g_{1},..,g_{K}$ , the fully-connected layer consists of weight matrix $W\in\mathbb{R}^{M\times K}$ , where $w_{k}^{y}$ represents the scalar weight corresponding to class $y$ for feature $k$ . We use $g_{k}(a,b)$ to denote a value of 2-dimensional spatial point $(a,b)$ with feature $k$ in map $g_{k}$ . In (Choe & Shim, 2019), the authors propose a way to interpret the importance of each point in feature maps. The importance of spatial point $(a,b)$ for class $y$ is defined as a weighted sum over features:

\displaystyle M_{y}(a,b)=\sum_{k}w_{k}^{y}g_{k}(a,b).\vspace{-4mm}

(1)

We redefine $M_{y}(a,b)$ as an intensity of the point $(a,b)$ . The collection of these intensity values over all grid points forms a class activation map (CAM). CAM highlights the most relevant region in the feature space for classifying $y$ . The input going to the softmax layer corresponding to the class label $y$ is:

\displaystyle\sum_{a,b}M_{y}(a,b)=n(\mathbf{x})_{y}.\vspace{-4mm}

(2)

Intuitively, weight $w_{k}^{y}$ indicates the overall importance of the $k$ th feature to class $y$ , and intensity $M_{y}(a,b)$ implies the importance of the feature map at spatial location $(a,b)$ leading to the classification of image $\mathbf{x}$ to $y$ .

WSOL: The aim of WSOL is to identify the region containing the target object in an image given a label, without any pixel-level supervision. Previous approaches tackle WSOL by creating a bounding box from the CAM (Choe & Shim, 2019). Such a CAM contains all important locations that exceed a certain intensity threshold. The box is then upsampled to match the size of the original image.

2.2 InfoCAM: Informative Class Activation Map

In (Qin et al., 2019), the authors show that softmax classifier carries an explicit implication between inputs and labels in terms of information theory. We extend the notion of mutual information from being a pair of an input image and a label to regions of the input image and labels to capture the regions that have high mutual information with labels.

To simplify the discussion, we assume here that there is only one feature map, i.e., $K=1$ . However, the following results can be easily applied to the general cases where $K>1$ without loss of generality. We introduce a region $R$ containing a subset of grid points in feature map $g$ .

Mutual information is an expectation of the point-wise mutual information (PMI) between two variables, i.e., $\mathbb{I}(\mathbf{X},Y)=\mathbb{E}_{\mathbf{x},y}[\text{PMI}(\mathbf{x},y)]$ . Given two instances of variables, we can estimate their PMI of $\mathbf{x}$ and $y$ as:

\displaystyle\operatorname{PMI}(\mathbf{x},y)=n(\mathbf{x})_{y}-\log\sum_{y^{\prime}=1}^{M}\exp(n(\mathbf{x})_{y^{\prime}})+\log M.

The PMI is close to $\log M$ if $y$ is the maximum argument in log-sum-exp. To find a region which is the most beneficial to the classification, we compute the difference between PMI with true label and the average of the other labels and decompose it into a point-wise summation as

	$\displaystyle\operatorname{Diff}(\operatorname{PMI}(\mathbf{x}))=\operatorname{PMI}(\mathbf{x},y^{})-\frac{1}{M-1}\sum_{y^{\prime}\neq y^{}}\operatorname{PMI}(\mathbf{x},y^{\prime})$
	$\displaystyle=\sum_{(a,b)\in g}w^{y}g(a,b)-\frac{1}{M-1}\sum_{y^{\prime}\neq y^{}}w^{y^{\prime}}g(a,b).$

The point-wise decomposition suggests that we can compute the PMI differences with respect to a certain region. Based on this observation, we propose a new CAM, named informative CAM or infoCAM, with the new intensity function $M_{y}^{\operatorname{Diff}}(R)$ between region $R$ and label $y$ defined as follows:

\displaystyle M_{y}^{\operatorname{Diff}}(R)=\sum_{(a,b)\in R}w^{y}g(a,b)-\frac{1}{M-1}\sum_{y^{\prime}\neq y}w^{y^{\prime}}g(a,b).\vspace{-2mm}

The infoCAM highlights the region which decides the classification boundary against the other labels. Moreover, we further simplify the above equation to be the difference between PMI with the true and the most-unlikely labels according to the classifier’s outputs, denoting as infoCAM+, with the new intensity:

\displaystyle M_{y}^{\operatorname{Diff}^{+}}(R)=\sum_{(a,b)\in R}w^{y}g(a,b)-w^{y^{\prime}}g(a,b),

(3)

where $y^{\prime}=\underset{m}{\arg\min}\sum_{(a,b)\in R}w^{m}g(a,b)$ .

The complete procedure of WSOL with infoCAM is visually illustrated in Figure 1.

3 Object Localisation with InfoCAM

In this section, we demonstrate experimental results with infoCAM for WSOL. We first describe the experimental settings and then present the results.

		CUB-200-2011		Tiny-ImageNet
		GT Loc. (%)	Top-1 Loc. (%)	GT Loc. (%)	Top-1 Loc. (%)
VGG	CAM	42.49	31.38	53.49	33.48
	CAM (ADL)	71.59	53.01	52.75	32.26
	infoCAM	52.96	39.79	55.50	34.27
	infoCAM (ADL)	73.35	53.80	53.95	33.05
	infoCAM+	59.43	44.40	55.25	34.27
	infoCAM+ (ADL)	75.89	54.35	53.91	32.94
ResNet	CAM	61.66	50.84	54.56	40.55
	CAM (ADL)	57.83	46.56	52.66	36.88
	infoCAM	64.78	53.22	57.79	43.34
	infoCAM (ADL)	67.75	54.71	54.18	37.79
	infoCAM+	68.99	55.83	57.71	43.07
	infoCAM+ (ADL)	69.63	55.20	53.70	37.71

Table 1: Localisation results of CAM and infoCAM on CUB-2011-200 and Tiny-ImageNet. InfoCAM outperforms CAM on localisation of objects with the same model architecture. Bold values represent the highest accuracy for a certain metric.

3.1 Experimental settings

We evaluate WSOL performance on CUB-200-2011 (Wah et al., 2011) and Tiny-ImageNet (Fei-Fei, ). CUB-200-2011 consists of 200 bird specifies. Since the dataset only depicts birds, not including other kinds of objects, variations due to class difference are subtle (Dubey et al., 2018). Such nuance-only detection can lead to localisation accuracy degradation (Choe & Shim, 2019).

Tiny-ImageNet is a reduced version of ImageNet. Compared with the full ImageNet, training classifiers on Tiny-ImageNet is faster due to image resolution reduction and quantity shrinkage, yet classification becomes more challenging (Odena et al., 2017).

To perform an evaluation on localisation, we first need to generate a bounding box for the object within an image. We generate a bounding box in the same way as in (Zhou et al., 2016). That is, after evaluating infoCAM within each region of an image, we only retain the regions whose infoCAM values are more than 20% of the maximum infoCAM and abandon all the other regions. Then, we draw the smallest bounding box that covers the largest connected component.

We follow the same evaluation metrics in (Choe & Shim, 2019) to evaluate localisation performance with two accuracy measures: 1) localisation accuracy with known ground truth class (GT Loc.), and 2) top-1 localisation accuracy (Top-1 Loc.). GT Loc. draws the bounding box from the ground truth of image labels, whereas Top-1 Loc. draws the bounding box from the predicted most likely image label and also requires correct classification. The localisation of an image is judged to be correct when the intersection over union of the estimated bounding box and the ground-truth bounding box is greater than 50%.

We adopt the same network architectures and hyper-parameters as in (Choe & Shim, 2019), which shows the current state-of-the-art performance. Specifically, the network backbone is ResNet50 (He et al., 2016) and a variation of VGG16 (Szegedy et al., 2015), in which the fully connected layers are replaced with global average pooling (GAP) layers to reduce the number of parameters. The traditional softmax is used as the final layer since both datasets are well balanced. InfoCAM requires the region parameter $R$ . We apply a square region for the region parameter $R$ . The size of the region $R$ is set as $5$ and $4$ for VGG and ResNet in CUB-200-2011, respectively, and $3$ for both VGG and ResNet in Tiny-ImageNet.

These models are tested with the Attention-based Dropout Layer (ADL) to tackle the localisation degradation problem (Choe & Shim, 2019). ADL is designed to randomly abandon some of the most discriminative image regions during training to ensure CNN-based classifiers cover the entire object. The ADL-based approaches demonstrate state-of-the-art performance in CUB-200-2011 (Choe & Shim, 2019) and Tiny-ImageNet (Choe et al., 2018) for the WSOL task and are computationally efficient. We test ADL with infoCAMs to enhance WSOL capability.

To prevent overfitting in the test dataset, we evenly split the original validation images to two data piles, one still used for validation during training and the other acting as the final test dataset. We pick the trained model from the epoch that demonstrates the highest top-1 classification accuracy in the validation dataset and report the experimental results with the test dataset. All experiments are run on two Nvidia 2080-Ti GPUs, with the PyTorch deep learning framework (Paszke et al., 2017).

3.2 Experimental Results

Table 1 shows the localisation results on CUB-200-2011 and Tiny-ImageNet. The results demonstrate that infoCAM can consistently improve accuracy over the original CAM for WSOL under a wide range of networks and datasets. Both infoCAM and infoCAM+ perform comparably to each other. ADL improves the performance of both models with CUB-200-2011 datasets, but it reduces the performance with Tiny-ImageNet. We conjecture that dropping any part of a Tiny-ImageNet image with ADL significantly influences classification since the images are relatively small.

Figure 2 highlights the difference between CAM and infoCAM. The figure suggests that infoCAM gives relatively high intensity on the object to compare with that of CAM, which only focuses on the head part of the bird. Figure 5 in the Appendix presents additional examples of visualisation for comparing localisation performance of CAM to infoCAM, both without the assistance of ADL . From these visualisations, we notice that the bounding boxes generated from infoCAM are formed closer to the objects than the original CAM. That is, infoCAM tends to precisely cover the areas where objects exist, with almost no extraneous or lacking areas. For example, CAM highlights the bird heads, whereas infoCAM also covers the bird bodies.

3.3 Localisation of multiple objects with InfoCAM

So far, we have shown the results of localisation from a multi-class classification problem. We further extend our experiments on localisation to multi-label classification problems. A softmax function is a generalisation of its binary case, a sigmoid function. Therefore, we can apply infoCAM to each label for a multi-label classification problem, which is a collection of binary classification tasks.

For the experiment, we construct a double-digit MNIST dataset where each image contains up to two digits randomly sampled from the original MNIST dataset (LeCun et al., 2010). We locate one digit on the left-side, and the other on the right-side. Some of the images only contain a single digit. For each side, we first decide whether to include a digit from a Bernoulli distribution with mean of 0.7. Then each digit is randomly sampled from a uniform distribution. However, we remove the images that contain no digits. Random samples from the double-digit MNIST are shown in 4(a).

Type	Digit Classification Accuracy (%)
Type	0	1	2	3	4	5	6	7	8	9
sigmoid	1.00	0.84	0.86	0.94	0.89	0.87	0.87	0.86	1.00	1.00
PC-sigmoid	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00

Table 2: Comparison between the classification accuracy results with sigmoid and PC-sigmoid on the double-digit MNIST dataset.

We first compare the classification accuracy results between using the original sigmoid and PC-sigmoid. As shown in Table 2, PC-sigmoid increases the classification accuracy for each digit type on the test set. InfoCAM improves the localisation accuracy for the WSOL task as well. CAM achieves the localisation accuracy of 91%. InfoCAM enhances the localisation accuracy to 98%. Qualitative visualizations are displayed in Figure 4. We aim to preserve the regions of an image that are most relevant to a digit, and erase all the other regions. From the visualization, one can see that infoCAM localizes digits more accurately than CAM.

4 Conclusion

In (Qin et al., 2019), the authors convert neural network classifiers to mutual information estimators. Then, using the pointwise mutual information between the inputs and labels, we can locate the objects within images more precisely. We also provide a more information-theoretic interpretation of class activation maps. Experimental results demonstrate the effectiveness of our proposed method.

References

Choe & Shim (2019) Choe, J. and Shim, H. Attention-based dropout layer for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2219–2228, 2019.
Choe et al. (2018) Choe, J., Park, J. H., and Shim, H. Improved techniques for weakly-supervised object localization. arXiv preprint arXiv:1802.07888, 2018.
Dubey et al. (2018) Dubey, A., Gupta, O., Guo, P., Raskar, R., Farrell, R., and Naik, N. Pairwise confusion for fine-grained visual classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 70–86, 2018.
(4) Fei-Fei, L. Tiny imagenet visual recognition challenge. https://tiny-imagenet.herokuapp.com/. Accessed: 2019-11-03.
He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
LeCun et al. (2010) LeCun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database. 2010.
Lin et al. (2014) Lin, M., Chen, Q., and Yan, S. Network in network. In International Conference on Learning Representation, 2014.
Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. JMLR. org, 2017.
Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In Proceedings of Neural Information Processing Systems, 2017.
Qin et al. (2019) Qin, Z., Kim, D., and Gedeon, T. Rethinking softmax with cross-entropy: Neural network classifier as mutual information estimator. arXiv preprint arXiv:1911.10688, 2019.
Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
Wah et al. (2011) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. 2011.
Zhou et al. (2016) Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929, 2016.

Appendix A Further Result

In this section, we present some further results on localisation and classification.

A.1 Localisation and Classification Result

Table 4 is a reproduction of main result with the classification results. Note that the classification performances of CAM and infoCAM is the same since we do not modify the training objective of infoCAM. The result can be used to understand the effect of ADL on the classification task.

		GT Loc. (%)	Top-1 Loc. (%)	Top-1 Cls (%)	Top-5 Cls (%)
VGG- 16- GAP	CAM	42.49	31.38	73.97	91.83
	CAM (ADL)	71.59	53.01	71.05	90.20
	infoCAM	52.96	39.79	-	-
	infoCAM (ADL)	73.35	53.80	-	-
	infoCAM+	59.43	44.40	-	-
	infoCAM+ (ADL)	75.89	54.35	-	-
ResNet- 50	CAM	61.66	50.84	80.54	94.09
	CAM (ADL)	57.83	46.56	79.22	94.02
	infoCAM	64.78	53.22	-	-
	infoCAM (ADL)	67.75	54.71	-	-
	infoCAM+	68.99	55.83	-	-
	infoCAM+ (ADL)	69.63	55.20	-	-

Table 3: Evaluation results of CAM and infoCAM on CUB-2011-200. Note that the classification accuracy of infoCAM is the same as those of CAM. InfoCAM always outperforms CAM on localisation of objects under the same model architecture.

		GT Loc. (%)	Top-1 Loc. (%)	Top-1 Cls (%)	Top-5 Cls (%)
VGG- 16- GAP	CAM	53.49	33.48	55.25	79.19
	CAM (ADL)	52.75	32.26	52.48	78.75
	infoCAM	55.50	34.27	-	-
	infoCAM (ADL)	53.95	33.05	-	-
	infoCAM+	55.25	34.27	-	-
	infoCAM+ (ADL)	53.91	32.94	-	-
ResNet- 50	CAM	54.56	40.55	66.45	86.22
	CAM (ADL)	52.66	36.88	63.21	83.47
	infoCAM	57.79	43.34	-	-
	infoCAM (ADL)	54.18	37.79	-	-
	infoCAM+	57.71	43.07	-	-
	infoCAM+ (ADL)	53.70	37.71	-	-

Table 4: Evaluation results of CAM and infoCAM on Tiny-ImageNet. Note that the classification accuracy of infoCAM is the same as those of CAM. InfoCAM always outperforms CAM on localisation of objects under the same model architecture.

A.2 Ablation Study

5(a) shows the result of ablation study. We have tested the importance of three features: 1) ADL, 2) region parameter $R$ and 3) the second subtraction term in the infoCAM equation. To combine the result in the main text, the result suggests that both region parameter and subtraction term are necessary to increase the performance of localisation. The choice of ADL depends on the dataset. We conjecture that ADL is inappropriate to apply Tiny-ImageNet since the removal of any part of tiny image, which is what ADL does during training, affects the performance of the localisation to compare with its application to relatively large images.

ADL	$R$	Subtraction Term	GT Loc. (%)	Top-1 Loc. (%)
N	N	N	42.49	31.38
	N	Y	47.59 $\uparrow$	35.01 $\uparrow$
	Y	N	53.40 $\uparrow$	40.19 $\uparrow$
Y	N	N	71.59	53.01
	N	Y	75.78 $\uparrow$	54.28 $\uparrow$
	Y	N	73.56 $\uparrow$	53.94 $\uparrow$

(a) Localisation results on CUB-200-2011 with VGG-GAP.

ADL	$R$	Subtraction Term	GT Loc. (%)	Top-1 Loc. (%)
N	N	N	54.56	40.55
	N	Y	54.29 $\downarrow$	40.51 $\downarrow$
	Y	N	57.73 $\uparrow$	43.34 $\uparrow$
Y	N	N	52.66	36.88
	N	Y	52.52 $\downarrow$	37.08 $\uparrow$
	Y	N	54.15 $\uparrow$	37.76 $\uparrow$

(b) Localisation results on CUB-200-2011 with ResNet50.

Table 5: Ablation study results on the importance of the region parameter

R

and the subtraction term within the formulation of infoCAM. Y and N indicates the use of corresponding feature. Arrows indicates the relative performance against the case where both features are not used.

A.3 Localisation Examples from Tiny-ImageNet

We present examples from the Tiny-ImageNet dataset in Figure 5. Such examples show the infoCAM draws tighter bound toward target objects.