Informative Class Activation Maps
Abstract
We study how to evaluate the quantitative information content of a region within an image for a particular label. To this end, we bridge class activation maps with information theory. We develop an informative class activation map (infoCAM). Given a classification task, infoCAM depict how to accumulate information of partial regions to that of the entire image toward a label. Thus, we can utilise infoCAM to locate the most informative features for a label. When applied to an image classification task, infoCAM performs better than the traditional classification map in the weakly supervised object localisation task. We achieve state-of-the-art results on Tiny-ImageNet.
1 Introduction
Class activation maps are useful tools for identifying regions of an image corresponding to particular labels. Many weakly supervised object localization methods are based on class activation maps (CAMs). In this paper, we formalize CAMs to be related to information theory. Furthermore, with the new view of neural network classifiers as mutual information evaluators, we are able to depict the quantitative relationship between the information of the entire image and its local regions about a label. We call our new CAM Informative Class Activation Map (infoCAM), since it is based on information theory. Moreover, infoCAM can also improve the performance of the weakly supervised object localisation (WSOL) task than the original CAM.
2 Informative Class Activation Map
To explain infoCAM, we first introduce the concept and definition of the class activation map. We then show how to apply it to weakly supervised object localisation (WSOL).

2.1 CAM: Class Activation Map
Contemporary classification convolutional neural networks (CNNs) consist of stacks of convolutional layers interleaved with pooling layers for extracting visual features. These convolutional layers result in feature maps. A feature map is a collection of 2-dimensional grids. The size of the feature map depends on the structure of convolution and pooling layers. Generally the feature map is smaller than the original image. The number of feature maps corresponds to the number of convolutional filters. The feature maps from the final convolutional layer are usually averaged, flattened and fed into the fully-connected layer for classification (Lin et al., 2014). Given feature maps , the fully-connected layer consists of weight matrix , where represents the scalar weight corresponding to class for feature . We use to denote a value of 2-dimensional spatial point with feature in map . In (Choe & Shim, 2019), the authors propose a way to interpret the importance of each point in feature maps. The importance of spatial point for class is defined as a weighted sum over features:
(1) |
We redefine as an intensity of the point . The collection of these intensity values over all grid points forms a class activation map (CAM). CAM highlights the most relevant region in the feature space for classifying . The input going to the softmax layer corresponding to the class label is:
(2) |
Intuitively, weight indicates the overall importance of the th feature to class , and intensity implies the importance of the feature map at spatial location leading to the classification of image to .
WSOL: The aim of WSOL is to identify the region containing the target object in an image given a label, without any pixel-level supervision. Previous approaches tackle WSOL by creating a bounding box from the CAM (Choe & Shim, 2019). Such a CAM contains all important locations that exceed a certain intensity threshold. The box is then upsampled to match the size of the original image.
2.2 InfoCAM: Informative Class Activation Map
In (Qin et al., 2019), the authors show that softmax classifier carries an explicit implication between inputs and labels in terms of information theory. We extend the notion of mutual information from being a pair of an input image and a label to regions of the input image and labels to capture the regions that have high mutual information with labels.
To simplify the discussion, we assume here that there is only one feature map, i.e., . However, the following results can be easily applied to the general cases where without loss of generality. We introduce a region containing a subset of grid points in feature map .
Mutual information is an expectation of the point-wise mutual information (PMI) between two variables, i.e., . Given two instances of variables, we can estimate their PMI of and as:
The PMI is close to if is the maximum argument in log-sum-exp. To find a region which is the most beneficial to the classification, we compute the difference between PMI with true label and the average of the other labels and decompose it into a point-wise summation as
The point-wise decomposition suggests that we can compute the PMI differences with respect to a certain region. Based on this observation, we propose a new CAM, named informative CAM or infoCAM, with the new intensity function between region and label defined as follows:
The infoCAM highlights the region which decides the classification boundary against the other labels. Moreover, we further simplify the above equation to be the difference between PMI with the true and the most-unlikely labels according to the classifier’s outputs, denoting as infoCAM+, with the new intensity:
(3) |
where .
The complete procedure of WSOL with infoCAM is visually illustrated in Figure 1.
3 Object Localisation with InfoCAM
In this section, we demonstrate experimental results with infoCAM for WSOL. We first describe the experimental settings and then present the results.
CUB-200-2011 | Tiny-ImageNet | ||||
GT Loc. (%) | Top-1 Loc. (%) | GT Loc. (%) | Top-1 Loc. (%) | ||
VGG | CAM | 42.49 | 31.38 | 53.49 | 33.48 |
CAM (ADL) | 71.59 | 53.01 | 52.75 | 32.26 | |
infoCAM | 52.96 | 39.79 | 55.50 | 34.27 | |
infoCAM (ADL) | 73.35 | 53.80 | 53.95 | 33.05 | |
infoCAM+ | 59.43 | 44.40 | 55.25 | 34.27 | |
infoCAM+ (ADL) | 75.89 | 54.35 | 53.91 | 32.94 | |
ResNet | CAM | 61.66 | 50.84 | 54.56 | 40.55 |
CAM (ADL) | 57.83 | 46.56 | 52.66 | 36.88 | |
infoCAM | 64.78 | 53.22 | 57.79 | 43.34 | |
infoCAM (ADL) | 67.75 | 54.71 | 54.18 | 37.79 | |
infoCAM+ | 68.99 | 55.83 | 57.71 | 43.07 | |
infoCAM+ (ADL) | 69.63 | 55.20 | 53.70 | 37.71 |
3.1 Experimental settings
We evaluate WSOL performance on CUB-200-2011 (Wah et al., 2011) and Tiny-ImageNet (Fei-Fei, ). CUB-200-2011 consists of 200 bird specifies. Since the dataset only depicts birds, not including other kinds of objects, variations due to class difference are subtle (Dubey et al., 2018). Such nuance-only detection can lead to localisation accuracy degradation (Choe & Shim, 2019).
Tiny-ImageNet is a reduced version of ImageNet. Compared with the full ImageNet, training classifiers on Tiny-ImageNet is faster due to image resolution reduction and quantity shrinkage, yet classification becomes more challenging (Odena et al., 2017).

To perform an evaluation on localisation, we first need to generate a bounding box for the object within an image. We generate a bounding box in the same way as in (Zhou et al., 2016). That is, after evaluating infoCAM within each region of an image, we only retain the regions whose infoCAM values are more than 20% of the maximum infoCAM and abandon all the other regions. Then, we draw the smallest bounding box that covers the largest connected component.
We follow the same evaluation metrics in (Choe & Shim, 2019) to evaluate localisation performance with two accuracy measures: 1) localisation accuracy with known ground truth class (GT Loc.), and 2) top-1 localisation accuracy (Top-1 Loc.). GT Loc. draws the bounding box from the ground truth of image labels, whereas Top-1 Loc. draws the bounding box from the predicted most likely image label and also requires correct classification. The localisation of an image is judged to be correct when the intersection over union of the estimated bounding box and the ground-truth bounding box is greater than 50%.

We adopt the same network architectures and hyper-parameters as in (Choe & Shim, 2019), which shows the current state-of-the-art performance. Specifically, the network backbone is ResNet50 (He et al., 2016) and a variation of VGG16 (Szegedy et al., 2015), in which the fully connected layers are replaced with global average pooling (GAP) layers to reduce the number of parameters. The traditional softmax is used as the final layer since both datasets are well balanced. InfoCAM requires the region parameter . We apply a square region for the region parameter . The size of the region is set as and for VGG and ResNet in CUB-200-2011, respectively, and for both VGG and ResNet in Tiny-ImageNet.
These models are tested with the Attention-based Dropout Layer (ADL) to tackle the localisation degradation problem (Choe & Shim, 2019). ADL is designed to randomly abandon some of the most discriminative image regions during training to ensure CNN-based classifiers cover the entire object. The ADL-based approaches demonstrate state-of-the-art performance in CUB-200-2011 (Choe & Shim, 2019) and Tiny-ImageNet (Choe et al., 2018) for the WSOL task and are computationally efficient. We test ADL with infoCAMs to enhance WSOL capability.
To prevent overfitting in the test dataset, we evenly split the original validation images to two data piles, one still used for validation during training and the other acting as the final test dataset. We pick the trained model from the epoch that demonstrates the highest top-1 classification accuracy in the validation dataset and report the experimental results with the test dataset. All experiments are run on two Nvidia 2080-Ti GPUs, with the PyTorch deep learning framework (Paszke et al., 2017).
3.2 Experimental Results
Table 1 shows the localisation results on CUB-200-2011 and Tiny-ImageNet. The results demonstrate that infoCAM can consistently improve accuracy over the original CAM for WSOL under a wide range of networks and datasets. Both infoCAM and infoCAM+ perform comparably to each other. ADL improves the performance of both models with CUB-200-2011 datasets, but it reduces the performance with Tiny-ImageNet. We conjecture that dropping any part of a Tiny-ImageNet image with ADL significantly influences classification since the images are relatively small.
Figure 2 highlights the difference between CAM and infoCAM. The figure suggests that infoCAM gives relatively high intensity on the object to compare with that of CAM, which only focuses on the head part of the bird. Figure 5 in the Appendix presents additional examples of visualisation for comparing localisation performance of CAM to infoCAM, both without the assistance of ADL . From these visualisations, we notice that the bounding boxes generated from infoCAM are formed closer to the objects than the original CAM. That is, infoCAM tends to precisely cover the areas where objects exist, with almost no extraneous or lacking areas. For example, CAM highlights the bird heads, whereas infoCAM also covers the bird bodies.
3.3 Localisation of multiple objects with InfoCAM



So far, we have shown the results of localisation from a multi-class classification problem. We further extend our experiments on localisation to multi-label classification problems. A softmax function is a generalisation of its binary case, a sigmoid function. Therefore, we can apply infoCAM to each label for a multi-label classification problem, which is a collection of binary classification tasks.
For the experiment, we construct a double-digit MNIST dataset where each image contains up to two digits randomly sampled from the original MNIST dataset (LeCun et al., 2010). We locate one digit on the left-side, and the other on the right-side. Some of the images only contain a single digit. For each side, we first decide whether to include a digit from a Bernoulli distribution with mean of 0.7. Then each digit is randomly sampled from a uniform distribution. However, we remove the images that contain no digits. Random samples from the double-digit MNIST are shown in 4(a).
Type | Digit Classification Accuracy (%) | |||||||||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
sigmoid | 1.00 | 0.84 | 0.86 | 0.94 | 0.89 | 0.87 | 0.87 | 0.86 | 1.00 | 1.00 |
PC-sigmoid | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
We first compare the classification accuracy results between using the original sigmoid and PC-sigmoid. As shown in Table 2, PC-sigmoid increases the classification accuracy for each digit type on the test set. InfoCAM improves the localisation accuracy for the WSOL task as well. CAM achieves the localisation accuracy of 91%. InfoCAM enhances the localisation accuracy to 98%. Qualitative visualizations are displayed in Figure 4. We aim to preserve the regions of an image that are most relevant to a digit, and erase all the other regions. From the visualization, one can see that infoCAM localizes digits more accurately than CAM.
4 Conclusion
In (Qin et al., 2019), the authors convert neural network classifiers to mutual information estimators. Then, using the pointwise mutual information between the inputs and labels, we can locate the objects within images more precisely. We also provide a more information-theoretic interpretation of class activation maps. Experimental results demonstrate the effectiveness of our proposed method.
References
- Choe & Shim (2019) Choe, J. and Shim, H. Attention-based dropout layer for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2219–2228, 2019.
- Choe et al. (2018) Choe, J., Park, J. H., and Shim, H. Improved techniques for weakly-supervised object localization. arXiv preprint arXiv:1802.07888, 2018.
- Dubey et al. (2018) Dubey, A., Gupta, O., Guo, P., Raskar, R., Farrell, R., and Naik, N. Pairwise confusion for fine-grained visual classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 70–86, 2018.
- (4) Fei-Fei, L. Tiny imagenet visual recognition challenge. https://tiny-imagenet.herokuapp.com/. Accessed: 2019-11-03.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
- LeCun et al. (2010) LeCun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database. 2010.
- Lin et al. (2014) Lin, M., Chen, Q., and Yan, S. Network in network. In International Conference on Learning Representation, 2014.
- Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. JMLR. org, 2017.
- Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In Proceedings of Neural Information Processing Systems, 2017.
- Qin et al. (2019) Qin, Z., Kim, D., and Gedeon, T. Rethinking softmax with cross-entropy: Neural network classifier as mutual information estimator. arXiv preprint arXiv:1911.10688, 2019.
- Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
- Wah et al. (2011) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. 2011.
- Zhou et al. (2016) Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929, 2016.
Appendix A Further Result
In this section, we present some further results on localisation and classification.
A.1 Localisation and Classification Result
Table 4 is a reproduction of main result with the classification results. Note that the classification performances of CAM and infoCAM is the same since we do not modify the training objective of infoCAM. The result can be used to understand the effect of ADL on the classification task.
GT Loc. (%) | Top-1 Loc. (%) | Top-1 Cls (%) | Top-5 Cls (%) | ||
VGG- 16- GAP | CAM | 42.49 | 31.38 | 73.97 | 91.83 |
CAM (ADL) | 71.59 | 53.01 | 71.05 | 90.20 | |
infoCAM | 52.96 | 39.79 | - | - | |
infoCAM (ADL) | 73.35 | 53.80 | - | - | |
infoCAM+ | 59.43 | 44.40 | - | - | |
infoCAM+ (ADL) | 75.89 | 54.35 | - | - | |
ResNet- 50 | CAM | 61.66 | 50.84 | 80.54 | 94.09 |
CAM (ADL) | 57.83 | 46.56 | 79.22 | 94.02 | |
infoCAM | 64.78 | 53.22 | - | - | |
infoCAM (ADL) | 67.75 | 54.71 | - | - | |
infoCAM+ | 68.99 | 55.83 | - | - | |
infoCAM+ (ADL) | 69.63 | 55.20 | - | - |
GT Loc. (%) | Top-1 Loc. (%) | Top-1 Cls (%) | Top-5 Cls (%) | ||
VGG- 16- GAP | CAM | 53.49 | 33.48 | 55.25 | 79.19 |
CAM (ADL) | 52.75 | 32.26 | 52.48 | 78.75 | |
infoCAM | 55.50 | 34.27 | - | - | |
infoCAM (ADL) | 53.95 | 33.05 | - | - | |
infoCAM+ | 55.25 | 34.27 | - | - | |
infoCAM+ (ADL) | 53.91 | 32.94 | - | - | |
ResNet- 50 | CAM | 54.56 | 40.55 | 66.45 | 86.22 |
CAM (ADL) | 52.66 | 36.88 | 63.21 | 83.47 | |
infoCAM | 57.79 | 43.34 | - | - | |
infoCAM (ADL) | 54.18 | 37.79 | - | - | |
infoCAM+ | 57.71 | 43.07 | - | - | |
infoCAM+ (ADL) | 53.70 | 37.71 | - | - |
A.2 Ablation Study
5(a) shows the result of ablation study. We have tested the importance of three features: 1) ADL, 2) region parameter and 3) the second subtraction term in the infoCAM equation. To combine the result in the main text, the result suggests that both region parameter and subtraction term are necessary to increase the performance of localisation. The choice of ADL depends on the dataset. We conjecture that ADL is inappropriate to apply Tiny-ImageNet since the removal of any part of tiny image, which is what ADL does during training, affects the performance of the localisation to compare with its application to relatively large images.
ADL | Subtraction Term | GT Loc. (%) | Top-1 Loc. (%) | |
N | N | N | 42.49 | 31.38 |
N | Y | 47.59 | 35.01 | |
Y | N | 53.40 | 40.19 | |
Y | N | N | 71.59 | 53.01 |
N | Y | 75.78 | 54.28 | |
Y | N | 73.56 | 53.94 |
ADL | Subtraction Term | GT Loc. (%) | Top-1 Loc. (%) | |
N | N | N | 54.56 | 40.55 |
N | Y | 54.29 | 40.51 | |
Y | N | 57.73 | 43.34 | |
Y | N | N | 52.66 | 36.88 |
N | Y | 52.52 | 37.08 | |
Y | N | 54.15 | 37.76 |
A.3 Localisation Examples from Tiny-ImageNet
We present examples from the Tiny-ImageNet dataset in Figure 5. Such examples show the infoCAM draws tighter bound toward target objects.
