Fine-Grained Attention for Weakly Supervised Object Localization

Junghyo Sohn¹ Eunjin Jeon² Wonsik Jung² Eunsong Kang² Heung-Il Suk^1,2,
¹Department of Artificial Intelligence, Korea University, Korea
²Department of Brain and Cognitive Engineering, Korea University, Korea
{jhsohn0633, eunjinjeon, ssikjeong1, eunsong1210, hisuk} @korea.ac.kr Corresponding author.

Abstract

Although recent advances in deep learning accelerated an improvement in a weakly supervised object localization (WSOL) task, there are still challenges to identify the entire body of an object, rather than only discriminative parts. In this paper, we propose a novel residual fine-grained attention (RFGA) module that autonomously excites the less activated regions of an object by utilizing information distributed over channels and locations within feature maps in combination with a residual operation. To be specific, we devise a series of mechanisms of triple-view attention representation, attention expansion, and feature calibration. Unlike other attention-based WSOL methods that learn a coarse attention map, having the same values across elements in feature maps, our proposed RFGA learns fine-grained values in an attention map by assigning different attention values for each of the elements. We validated the superiority of our proposed RFGA module by comparing it with the recent methods in the literature over three datasets. Further, we analyzed the effect of each mechanism in our RFGA and visualized attention maps to get insights.

Refer to caption — Figure 1: (a) Overview of our RFGA method that generates fine-grained attended maps to be efficient for WSOL by considering triple-view attentions (channel, height, and width before a classifier. The full attention map is generated by an outer sum of triple-view attentions. (b) Comparison of CAM [40] and our RFGA with respect to an activation map (left) and a localization (right). For localization, red and green boxes denote the ground-truth and predicted bounding boxes, respectively. In addition, green-masked region indicates the activation map after applying a threshold.

1 Introduction

During the last decade, researchers have developed various forms of deep learning-based models and achieved remarkable performance in object localization of inferring the bounding box of objects in natural images [6, 9, 23]. However, from the learning and data efficiency perspectives, the major limitation of those works is the use of a fully-labeled dataset for supervision. Undoubtedly, it is labor-intensive and time-consuming to make such a fully-labeled dataset, thus causing their applicability limited in practice.

Meanwhile, Weakly Supervised Object Localization (WSOL) methods, which employ only class labels but no information of the targeted bounding box of objects [40, 29, 41, 24, 36, 37, 5, 32, 21, 1], has attracted by showing their great potential for the same task, being trained in a more data-efficient manner. The main idea of WSOL methods is to detect the class-discriminative regions via an object recognition task and to utilize those regions for the localization of the identified object. For example, Class Activation Map (CAM) [40] that estimates the class-specific discriminative regions based on the inferred class scores is one of the representative methods in WSOL. In the meantime, various studies [40, 41, 29, 24, 36, 37, 5, 32, 21, 1, 4] addressed that CAM-based methods is not capable of capturing the overall object regions in a finer way, because it focuses on only the class-discriminative regions, disregarding non-discriminative regions. Hence, many of the output bounding boxes are not tight enough to the target object by resulting in either over-sized or under-sized. There have been efforts to tackle these challenges via diverse network architectures or learning strategies [30, 29, 24, 41, 16, 36, 5, 21, 35, 5, 21, 1].

In principle, those methods devised different kinds of mechanisms to mitigate the major problem of focusing on only the discriminative regions in localization, by intentionally corrupting (i.e., erasing) an input image [35, 24, 29] or a feature [36, 21] or generating an attention map. For the image corruption methods, two different strategies were exploited, namely, random corruption and network-guided corruption. First, the random corruption approach removes a small patch within an image at random and uses the corrupted image to learn richer feature representations [24, 35]. This helps the trained network to discover diverse discriminative representations, thus to detect more object-related regions. The network-guided corruption approach adaptively corrupts by dropping out the most discriminative regions based on the integrated activation maps [29, 36, 5, 21]. As for the attention-based methods [5, 37], they use a specially designed module to generate an attention map, based on which the most discriminative regions are hidden to capture the integral extent of an object.

While those methods helped improve the performance, they have limitations and issues that should be considered further. First, the random-corruption approach [24, 35] potentially disrupts the network learning due to unexpected information loss [36, 5]. For example, if the object-characteristic parts were removed from an input image, a network is enforced to discover other parts from the remaining regions. Obviously, when there exists no further discriminative region, it would be trained in a wrong way. Second, the network-guided corruption approach introduces additional hyperparameters to determine the most discriminative regions and their sizes in activation maps. Third, the attention-based methods [29, 36, 5, 21] mostly exploit the coarse information in the form of channel or spatial attention and apply the same attention values to units in feature maps.

In this paper, we propose a novel fine-grained attention method that efficiently and accurately localizes object-related regions in an image. Specifically, we propose a new mechanism to generate a fine-level attention map that allows to utilize a series of information distributed over channels and locations within feature maps. The fine-level attention map is in the same size of as the input feature maps to the attention-based module, thus the attention is assigned for each of the units across feature maps and channels. Compared to the corruption-based approaches, our proposed method doesn’t need to mask patches in an image and doesn’t have additional hyperparameters for most discriminative regions selection.

The main contributions of our work are three-fold:

•

We propose a novel mechanism to represent a fine-grained attention that allows us to utilize feature representations globally in high resolution, thus to localize an object accurately.
•

In combination with a residual connection, our attention module autonomously concentrates on the less activated regions. Accordingly, it is more likely to focus on other informative regions of an object in an image.
•

In the experiments, our proposed method, Residual Fine-Grained Attention (RFGA), achieved the state-of-the-art object localization performances in the metrics of mIOU and MaxBoxAcc [4] on three datasets, i.e., CUB-200-2011 [25], FGVC Aircraft [22], and Stanford Dogs [15].

2 Related Work

Weakly supervised object localization. WSOL can be mainly categorized into two approaches depending on the selection method of erasing (i.e., corrupting) regions: (1) random corruption [24, 35] and (2) network-guided currption methods [36, 37, 5, 21]. With regard to the random corruption method, Singh and Lee [24] devised Hide-and-Seek (HaS), which randomly drops the patches of input images in order to encourage the network to find other relevant regions rather than only focus on the most discriminative parts of an object. Yun et al. [35] introduced CutMix where the randomly erasing (i.e., cutting) patches are filled with patches of another class and the corresponding labels are also mixed. Though these methods have been considered as an efficient data augmentation method due to their no requirement of parameters, the random corruption can negatively affect localization performance due to its brute-force elimination of the input images [1, 36, 5].

For the network-guided corruption methods [36, 5, 37, 21], the most discriminative regions of the original image or feature map are dropped with a threshold (i.e., drop rate). Zhang et al. [36] proposed an Adversarial Complementary Learning (ACoL) to find the complementary regions through an adversarial learning between two parallel-classifiers; one to erase discriminative regions and the other to learn other discriminative regions except for the erased regions. Similar to ACoL, Choe et al. [5] introduced an Attention-based Dropout Layer (ADL), which generates a drop mask and an importance map by utilizing the self-attention mechanism and then randomly selects one of them for thresholded feature maps. In addition to these methods, [36, 5, 37, 21] also exploited a self-attention mechanism to identify discriminative regions. However, they all require a drop rate as a criterion of the masking. In these regards, our proposed RFGA is capable of discovering from the less and to the more discriminative regions by using a novel self-attention module, without setting up the drop rate.

Attention based Deep Neural Networks. Attention mechanisms have widely used to enhance the representational power of features for their tasks. Among various attention mechanisms [34, 8, 26, 39, 7, 13, 28, 38, 2], here, we focus on a context fusion based mechanism [12, 31, 27, 19, 42, 20, 17, 19, 11, 18, 33] that strengthens the feature maps to be more meaningful by aggregating information from every pixel. For instance, Hu et al. [12] proposed a Squeeze-and-Excitation Network (SENet), a highly simple and efficient gating mechanism to consider the channel-wise relationships among feature maps of the basic architectures. Likewise, Woo et al. [31] devised a Convolutional Block Attention Module (CBAM) that sequentially combines two separate attention maps for channel and spatial. Different from SENet [12], CBAM [31] additionally considered spatial attention which involves “where” to focus. Moreover, to alleviate a limitation of SENet [12] that utilizes fully-connected layers in order for the reduction of the computational cost at the expense of the association between channel and weight, Wang et al. [27] introduced an Efficient Channel Attention Network (ECA-Net) [27] that deploys a 1D convolutional layer to obtain cross-channel attention, while maintaining lower model complexity.

However, since [12, 31, 27] emphasized meaningful features by multiplying the same attention values, where the different information corresponding to spatial (i.e., height and weight) or channel dimensions might be ignored, they can be unsuitable for WSOL where the fine location information is demanded. Meanwhile, our RFGA generates a detailed attention map that has different attention values across all regions by inferring the intersection of triple-view (i.e., height, weight, and channel) attentions.

3 Residual Fine-Grained Attention

In this section, we present the details of our proposed residual fine-grained attention (RFGA) module. The RFGA is applied on the output feature maps before feeding into a classifier (Fig. 1) to induce the model to learn the entire region of an object. Hereafter, we regard the output feature maps as a feature tensor (3D) without loss of generality.

Our RFGA generates a self-attention tensor, which is generated from three types of view-dependent attention maps by projecting the input feature tensor into channel, width, and height dimensions, respectively. On the contrary to this, the existing works [12, 27] primarily consider channel-wise attention by ignoring the spatial characteristics distributed over the different maps in a feature tensor. The RFGA-generated attention tensor presents a fine-grained characteristic in the sense of assigning different attention values for each of the elements in a tensor. Notably, the residual connection in RFGA leads the attention tensor to focus on less discriminative areas of an object as well. In these regards, the final output feature tensor presents an enriched representation resulting in better object localization output, even relatively less discriminative feature for classification. The overall architecture of the proposed RFGA is illustrated in Fig. 2 and the detailed descriptions are given below.

3.1 Triple-view Attentions

Let $\mathbf{X}\in\mathbb{R}^{C\times H\times W}$ be an input feature tensor, where $C$ , $H$ , and $W$ denote the dimension of the channel, height, and width, respectively. To condense the global distribution of an input feature tensor $\mathbf{X}$ in the triple views, we apply an average pooling in each dimension of the tensor, i.e., channel, height, and width as follows:

	$\displaystyle\mathbf{c}=\text{AvgPool}_{\text{wh}}(\mathbf{X})$		(1)
	$\displaystyle\mathbf{h}=\text{AvgPool}_{\text{w}}(\mathbf{X})$		(2)
	$\displaystyle\mathbf{w}=\text{AvgPool}_{\text{h}}(\mathbf{X})$		(3)

where $\text{AvgPool}_{\{\cdot\}}$ is an average pooling operator with respect to the dimensions of $\{\cdot\}$ .

After computations in Eq. (1)-(3), the three pooled features of $\mathbf{c}\in\mathbb{R}^{C\times 1\times 1}$ , $\mathbf{h}\in\mathbb{R}^{C\times H\times 1}$ , and $\mathbf{w}\in\mathbb{R}^{C\times 1\times W}$ can be regarded as a summary of the extracted features in $\mathbf{X}$ from different viewpoints. Surely, the three views carry different information distributed in the input feature tensor $\mathbf{X}$ . That is, $\mathbf{c}$ captures which feature representations are highly activated, and $\mathbf{h}$ and $\mathbf{w}$ reflect, respectively, where the discriminative features distributed vertically and horizontally across channels.

Subsequently, in order to utilize their local interaction among units in each pooled feature [27], we apply a 1D convolution with a kernel size of $k$ and zero-padding without biases, thus keeping their dimensionality. Then, a batch normalization [14] and a non-linear activation function are applied as follows:

	$\displaystyle\mathbf{z}_{\text{h}}=\sigma(\text{BN}(\mathbf{W}_{\text{h}}(\mathbf{h})))$		(4)
	$\displaystyle\mathbf{z}_{\text{w}}=\sigma(\text{BN}(\mathbf{W}_{\text{w}}(\mathbf{w})))$		(5)
	$\displaystyle\mathbf{z}_{\text{c}}=\sigma(\text{BN}(\mathbf{W}_{\text{c}}(\mathbf{c})))$		(6)

where $\sigma(\cdot)$ is a sigmoid function and $\mathbf{W}_{\{\cdot\}}$ indicates the 1D convolutional layer for the respective pooled features. Here, $\mathbf{z}_{\text{h}}\in\mathbb{R}^{C\times H\times 1}$ , $\mathbf{z}_{\text{w}}\in\mathbb{R}^{C\times 1\times W}$ , and $\mathbf{z}_{\text{c}}\in\mathbb{R}^{C\times 1\times 1}$ corresponds to the resulting triple-view attentions.

3.2 Attentions Expansion

We propose to expand the triple-view attentions of $\mathbf{z}_{\text{h}}$ , $\mathbf{z}_{\text{w}}$ , and $\mathbf{z}_{\text{c}}$ in the size of an input feature tensor $\mathbf{X}$ , by which it is beneficial to reflect the attention information back into the input feature tensor $\mathbf{X}$ in a fine-grained manner. Therefore, we create an attention map $\mathbf{M}\in\mathbb{R}^{C\times H\times W}$ of the same size of the input feature map $\mathbf{X}$ by means of an expansion function $f$ defined by an outer sum as follows:

\mathbf{M}=\sigma(f(\mathbf{z}_{\text{h}},\mathbf{z}_{\text{w}},\mathbf{z}_{\text{c}})).

(7)

It should be note that values in the attention map $\mathbf{M}$ are likely to be different from each other, resulting in a fine-grained attention map. Our fine-grained attention map representation method is of major contrast to the previous attention-based methods that learn a coarse attention map, having the same values across elements within the same channel, for example.

3.3 Feature Calibration

With the attention tensor estimated in Eq. (7), we then apply it to the input feature tensor. In particular, we consider computational approaches as follows:

\displaystyle\mathbf{\hat{X}}=\left\{\begin{array}[]{ll}\mathbf{X}\odot\mathbf{M}&\text{(non-residual)}\\ \mathbf{X}\oplus(\mathbf{X}\odot\mathbf{M})&\text{(residual)}\end{array}\right.

(10)

where $\odot$ and $\oplus$ denote Hadamard product and element-wise summation, respectively.

Fundamentally, although those two approaches employ fine-level attention maps, enabling detailed feature calibration at element-level units, they work in different ways in terms of feature representation learning. The non-residual approach is to mine and calibrate as many discriminative features as possible by multiplying the input feature tensor with the corresponding attention tensor. Meanwhile, in the residual approach, because of the element-wise sum operation (i.e., a residual operation), the element-wise multiplication part between the input feature tensor and the attention tensor is likely to excite the locations where the input feature tensor $\mathbf{X}$ may have less activations. Hence, the attention module described in Section 3.2 is trained to give more attention to locations where the input feature tensor presents relatively small activations. Accordingly, the output attention tensor in $\mathbf{M}$ plays the role of exciting the less activated regions in $\mathbf{X}$ while inhibiting the more activated regions. This interpretable phenomenon is clearly observed from our experimental results in Fig. 5 and Fig. 6.

Table 1: WSOL performances between our proposed method (RFGA) and comparative methods on CUB-200-2011 (CUB), FGVC Aircraft (Aircraft), and Stanford Dogs (Dogs) datasets. Note that w/ R and w/o R denote with residual connection and without residual connection, and

\tau

indicates a threshold of activation map. The best performance is indicated in boldface.

Method	CUB							Aircraft							Dogs
	MaxBoxAcc (%)						mIoU (optimal $\tau$ )	MaxBoxAcc (%)						mIoU (optiaml $\tau$ )	MaxBoxAcc (%)						mIoU (optiaml $\tau$ )
	$\delta=0.5$	$\delta=0.6$	$\delta=0.7$	$\delta=0.8$	$\delta=0.9$	Avg.	mIoU (optimal $\tau$ )	$\delta=0.5$	$\delta=0.6$	$\delta=0.7$	$\delta=0.8$	$\delta=0.9$	Avg.	mIoU (optiaml $\tau$ )	$\delta=0.5$	$\delta=0.6$	$\delta=0.7$	$\delta=0.8$	$\delta=0.9$	Avg.	mIoU (optiaml $\tau$ )
SENet [12]	69.92	39.49	15.26	4.04	0.55	25.85	0.6598	82.81	63.67	33.30	8.88	1.95	38.12	0.7148	81.60	69.95	52.87	31.18	10.34	49.18	0.7850
CBAM [31]	68.19	39.16	15.36	3.75	0.55	25.40	0.6538	75.67	57.88	35.04	11.97	2.16	36.54	0.7179	81.60	69.63	52.44	30.79	10.36	48.96	0.7875
ECA-Net [27]	69.23	39.61	15.21	3.66	0.48	25.63	0.6603	89.41	75.46	41.31	10.44	2.16	43.75	0.7327	82.72	71.48	54.88	32.72	10.92	50.54	0.7901
CAM [40]	68.54	39.59	16.00	3.92	0.54	25.71	0.6565	84.22	66.31	36.84	11.70	1.98	40.21	0.7234	81.05	69.58	53.87	31.76	10.51	49.35	0.7884
HaS [24]	64.43	35.88	13.74	3.40	0.54	23.59	0.6474	73.18	52.12	29.58	10.26	2.04	33.43	0.7147	78.72	67.47	51.20	30.61	10.24	47.64	0.7833
ACoL [36]	57.92	29.03	9.75	2.19	0.35	19.84	0.6197	51.40	28.11	12.84	5.01	1.86	19.84	0.6259	68.88	54.55	37.84	21.26	7.55	38.01	0.7141
ADL [5]	62.69	33.47	12.08	3.12	0.45	22.36	0.6402	64.36	40.41	17.94	6.00	1.92	26.12	0.6593	79.17	66.54	48.40	27.75	9.46	46.26	0.7667
CutMix [35]	72.58	45.37	19.88	5.37	0.66	28.71	0.6724	79.75	71.83	56.95	29.04	4.86	48.48	0.7775	74.55	62.30	47.38	28.45	10.31	44.59	0.7822
RFGA w/o R	71.85	42.04	17.24	4.09	0.59	27.16	0.6635	96.22	87.22	53.29	15.78	2.37	50.97	0.7600	84.81	74.62	59.22	36.49	12.10	53.44	0.7613
RFGA w/ R (Ours)	75.99	51.19	26.30	8.08	1.19	32.55	0.6970	89.11	79.15	62.59	35.70	6.48	54.60	0.8101	84.15	74.69	60.55	39.14	13.61	54.42	0.8135

4 Experiment

4.1 Experimental Setup

Datasets. We validated our RFGA over three public datasets for WSOL, i.e., CUB-200-2011 [25], FGVC Aircraft [22], and Stanford Dogs [15]. First, CUB-200-2011 includes a total of $11,788$ images from $200$ bird categories, which is divided into $5,994$ images for training and $5,794$ images for evaluation. FGVC Aircraft consists of $10,000$ images across $100$ aircraft categories with $3,334$ for training, $3,333$ for a validation, and $3,333$ for testing. Stanford Dogs contains a total of $20,580$ dog samples in $102$ categories, which is composed of $12,000$ training samples and $8,580$ test samples.

Competing methods. We compared our RFGA with five existing state-of-the-art WSOL methods; e.g., CAM [40], HaS [24], ACoL [36], ADL [5], and CutMix [35]. Further, in order to see the effectiveness of attention methods in WSOL, we compared to three other context fusion based attention methods; e.g., SENet [12], CBAM [31], and ECA-Net [27] as well.

Evaluation metric. In order for quantitative evaluation, we used MaxBoxAcc [4] over the IoU thresholds $\delta\in\{0.5,0.6,0.7,0.8,0.9\}$ at the optimal activation map threshold. A threshold of activation map, $\tau$ , is set between 0 and 1 at 0.01 intervals. Therefore, we finally obtained the results, measuring various localization performances over threshold $\tau$ for an activation map at various levels of $\delta$ .

4.2 Implementation Details

We used ResNet-50 [10] pre-trained with ImageNet data as a backbone network. In order to obtain localization maps, we used feature maps of $1\times 1$ convolutional layers, similar to ACoL [36]. For the kernel size $k$ in the triple-view attentions, we used $3$ , according to the work of [27]. The input images of training were resized to $600\times 600$ and then we cropped randomly $448\times 448$ patches from the resized images. In addition, the input images were flipped horizontally with a probability of $0.5$ . Meanwhile, the test images were resized to $448\times 448$ . We trained our RFGA for a total of $60$ epochs with a mini-batch size of $20$ and an initial learning rate of $0.01$ that was decreased by $0.1$ after every $15$ epochs. Further, we used the stochastic gradient descent optimizer with a momentum of $0.9$ . More details for the settings of comparative methods can be found in Supplemantary. We implemented all methods in PyTorch¹¹1https://pytorch.org and trained with Titan X GPU.

4.3 Experimental Results

Quantitative evaluation. Table 1 summarizes the performance of the competing methods at the optimal activation map threshold $\tau$ . We observed that our RFGA method outperformed localization performance compared to other competing methods in terms of MaxBoxAcc and Mean Intersection over Union (mIoU) which is the average IoU of all images at optiaml $\tau$ . Further, it is noteworthy that our RFGA showed its superiority to all competing methods in all cases including various IoU thresholds over three datasets.

Table 2: Classification performances on CUB-200-2011 (CUB), FGVC Aircraft (Aircraft), and Stanford Dogs (Dogs) datasets. Note that w/ R and w/o R denote with residual connection and without residual connection. The best performance is indicated in boldface.

Method	Classification Top-1 Accuracy (%)
Method	CUB	Aircraft	Dogs
SENet [12]	80.84	63.04	79.51
CBAM [31]	80.57	58.69	79.36
ECA-Net [27]	80.31	65.62	78.55
CAM [40]	81.41	63.16	79.35
HaS [24]	70.54	41.97	73.15
ACoL [36]	61.70	11.94	47.28
ADL [5]	63.32	46.38	67.63
CutMix [35]	83.55	59.59	83.17
RFGA w/o R	79.75	65.83	76.13
RFGA w/ R (Ours)	76.65	52.99	68.17

Qualitative visualization. We visualized the predicted localization bounding boxes and activation maps for all methods in Fig. 3. Each image presents the localization results at the optimal threshold $\tau$ where the IoU of the bounding box from the activation map achieves maximum value. We observed that RFGA elaborately localized the entire part of an object for CUB-200-2011, FGVC aircraft, and Stanford Dogs datasets. While most competing methods focused on the partial objects or covered in excess of the exact object region, our RFGA tightly bounded the entire and specified region of the object in an image, thereby achieving the best localization performance.

Classification. We additionally report the classification accuracy in Table 2 to explore the relation between localization and classification. By following to work of [3], we selected the model at the last epoch without regard to validation results. However, we observed that most WSOL methods showed a tendency of achieving the best localization performance in the early stage of training in spite of low classification performance. Consequently, we believe that the classification performance is not correlated with the localization performance, consistent with the work in [4].

4.4 Analysis

Hyperparameter analysis. We plotted the change of localization performance (MaxBoxAcc [4]) of all methods by varying the value of $\tau$ in Fig. 4. Our proposed RFGA showed the best performance at a smaller $\tau$ value than that of other comparative methods. From the results, we could infer that most of the high activation values in our RFGA-based feature tensor were well aligned to the object-related region.

Visualization of attention maps. In order to get an insight into the working of our RFGA, we visualized all triple-view attention maps (i.e., $\mathbf{z}_{c}$ , $\mathbf{z}_{w}$ , $\mathbf{z}_{h}$ ) as well as the final combined attention map (i.e., $\mathbf{M}$ ) (top) and the input feature map $\mathbf{X}$ , the element-wise product of $\mathbf{X}$ and $\mathbf{M}$ (i.e., $A(\mathbf{X})$ ), the resulting output feature map $\hat{\mathbf{X}}$ via a residual approach, and the difference between $\mathbf{X}$ and $\hat{\mathbf{X}}$ (bottom) in Fig. 5. We took an average of each attention vector along the channel axis and expanded the averaged vectors, $\mathbf{z}_{c}$ , $\mathbf{z}_{w}$ , and $\mathbf{z}_{h}$ to a matrix by repetition for a visualization. It should be noted that we normalized each matrix in a range of $[0,1]$ . Contrary to the activation map of CAM that only focuses on the body of a bird, our RFGA pays additional attention to the wings, resulting in the entire object attention. In accordance with Fig. 3, we validated the effectiveness of our fine-grained calibration of features in WSOL.

4.5 Ablation Study

Effect of triple-view attention. We assumed that our RFGA could localize an object from a variety of fine-grained information. To validate the effectiveness of the fine-grained attention, we conducted ablation studies with respect to each of the attention maps generated from different views, i.e., channel $\mathbf{z}_{c}$ , vertical $\mathbf{z}_{h}$ , and horizontal $\mathbf{z}_{w}$ . We employed only one-view attention out of those three view when training the same architecture, and reported the results in Table 3. Our triple-view attentions method clearly outperformed all the ablation cases. Based on those results, we believe that our fine-grained attention map from the triple-view attentions is capable of calibrating features through the complementary relations inherent in the input feature tensor.

Table 3: Comparison of different attention views and fine-grained attention (denoted as Full) in RFGA. The best performance is highlighted in boldface. Note that we expand the triple-view attentions (channel, height and width) to a fine-grained attention map by an outer sum.

Dataset	Metric	Attention Dimension
Dataset	Metric	Channel	Height	Width	Full
CUB	MaxBoxAcc	31.71	6.78	31.03	32.55
CUB	mIoU	0.6893	0.3862	0.6865	0.6970
Aircraft	MaxBoxAcc	52.65	21.55	45.53	54.60
Aircraft	mIoU	0.7866	0.6120	0.7556	0.8101
Dogs	MaxBoxAcc	52.81	30.21	53.13	54.42
Dogs	mIoU	0.8001	0.5710	0.8018	0.8135

Effect of residual connection. In order to investigate residual connection effect, we compared the residual approach and the non-residual approach in Eq. (10) in terms of the localization task in Table 1 and the classification task in Table 2. We also visualized their respective attention maps in Fig. 6. The residual approach generated the attention maps that focused on both the most and the less discriminative regions of an object. Meanwhile, for the non-residual approach, their attention maps showed the opposite patterns to those of the residual-based maps. Based on the understanding of a residual operation, note that $\mathbf{X}\oplus A(\mathbf{X})$ leads the function $A(\mathbf{X})$ to learn some amount of information that the input feature tensor $\mathbf{X}$ may have missed or less emphasized. From the viewpoint of attention map generation, the role of $A(\mathbf{X})$ can be interpreted as to inhibit the regions of high activation values (as those are already well represented in $\mathbf{X}$ ) and to excite the less activated regions where the target task-related information is inherent. Here, the inhibition effect can be related to those of the specially-designed module in ACoL[36] and ADL[5] that erase the discriminative features.

5 Conclusion

In this paper, we proposed a novel residual fine-grained attention module to localize an object accurately. Our proposed RFGA consisted of three components; (i) the triple-view attentions, (ii) expansion of the attentions to a high resolution, and (iii) calibration of the feature. Notably, our proposed method does not require a hyperparameter such as a drop rate for masking discriminative regions. Based on the evaluation with the metrics of mIOU and MacBoxAcc [4] over three datasets, our proposed method achieved the highest performance. In our exhaustive ablation studies, we presented the validity of all the three components and also interpreted the inner working of the feature calibration formulated by a residual operation. It is noteworthy that because our proposed RFGA is plugged in between the last convolution layer and a classifier, it is applicable to other CNN architectures without modifying the original network architecture. In that sense, it would be our forthcoming research issue to more generalize its application to multi-object localization.

Acknowledgement. This work was supported by Institute of Information $\&$ communications Technology Planning $\&$ Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079 , Artificial Intelligence Graduate School Program(Korea University)).

References

[1] Sadbhavana Babar and Sukhendu Das. Where to Look?: Mining complementary image regions for weakly supervised object localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1010–1019, 2021.
[2] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
[3] Junsuk Choe, Seong Joon Oh, Sanghyuk Chun, Zeynep Akata, and Hyunjung Shim. Evaluation for weakly supervised object localization: Protocol, metrics, and datasets. arXiv preprint arXiv:2007.04178, 2020.
[4] Junsuk Choe, Seong Joon Oh, Seungho Lee, Sanghyuk Chun, Zeynep Akata, and Hyunjung Shim. Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3133–3142, 2020.
[5] Junsuk Choe and Hyunjung Shim. Attention-based dropout layer for weakly supervised object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2219–2228, 2019.
[6] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2009.
[7] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Kronecker attention networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 229–237, 2020.
[8] Yu Gao, Xintong Han, Xun Wang, Weilin Huang, and Matthew Scott. Channel interaction networks for fine-grained image categorization. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10818–10825, 2020.
[9] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 580–587, 2014.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 770–778, 2016.
[11] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. arXiv preprint arXiv:1810.12348, 2018.
[12] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
[13] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. CCNet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 603–612, 2019.
[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456. PMLR, 2015.
[15] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
[16] Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So Kweon. Two-phase learning for weakly supervised object localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 3534–3543, 2017.
[17] Ildoo Kim, Woonhyuk Baek, and Sungwoong Kim. Spatially attentive output layer for image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9533–9542, 2020.
[18] Minchul Kim, Jongchan Park, Seil Na, Chang Min Park, and Donggeun Yoo. Learning visual context by comparison. In Proceedings of the European Conference on Computer Vision, pages 576–592. Springer, 2020.
[19] HyunJae Lee, Hyo-Eun Kim, and Hyeonseob Nam. SRM: A style-based recalibration module for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1854–1862, 2019.
[20] Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Changhu Wang, and Jiashi Feng. Improving convolutional networks with self-calibrated convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10096–10105, 2020.
[21] Jinjie Mai, Meng Yang, and Wenfeng Luo. Erasing integrated learning: A simple yet effective approach for weakly supervised object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8766–8775, 2020.
[22] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
[23] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2016.
[24] Krishna Kumar Singh and Yong Jae Lee. Hide-and-Seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 3544–3553. IEEE, 2017.
[25] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD birds-200-2011 dataset. 2011.
[26] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Proceedings of the European Conference on Computer Vision, pages 108–126. Springer, 2020.
[27] Q Wang, B Wu, P Zhu, P Li, W Zuo, and Q Hu. Eca-net: Efficient channel attention for deep convolutional neural networks, 2020 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.
[28] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
[29] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao Zhao, and Shuicheng Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1568–1576, 2017.
[30] Yunchao Wei, Xiaodan Liang, Yunpeng Chen, Xiaohui Shen, Ming-Ming Cheng, Jiashi Feng, Yao Zhao, and Shuicheng Yan. STC: A simple to complex framework for weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11):2314–2320, 2016.
[31] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, pages 3–19, 2018.
[32] Haolan Xue, Chang Liu, Fang Wan, Jianbin Jiao, Xiangyang Ji, and Qixiang Ye. DANet: Divergent activation for weakly supervised object localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6589–6598, 2019.
[33] Qing-Long Zhang Yu-Bin Yang. SA-Net: Shuffle attention for deep convolutional neural networks. arXiv preprint arXiv:2102.00240, 2021.
[34] Kaiyu Yue, Ming Sun, Yuchen Yuan, Feng Zhou, Errui Ding, and Fuxin Xu. Compact generalized non-local network. arXiv preprint arXiv:1810.13125, 2018.
[35] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6023–6032, 2019.
[36] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas S Huang. Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1325–1334, 2018.
[37] Xiaolin Zhang, Yunchao Wei, Guoliang Kang, Yi Yang, and Thomas Huang. Self-produced guidance for weakly-supervised object localization. In Proceedings of the European Conference on Computer Vision, pages 597–613, 2018.
[38] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10076–10085, 2020.
[39] Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5012–5021, 2019.
[40] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , June 2016.
[41] Yi Zhu, Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Soft proposal networks for weakly supervised object localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1841–1850, 2017.
[42] Peiqin Zhuang, Yali Wang, and Yu Qiao. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13130–13137, 2020.