¹¹institutetext: College of Control Science and Engineering, Zhejiang University, China ¹¹email: [email protected]

CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization

Ming Li 11

Abstract

Weakly supervised object localization (WSOL) is a challenging task to localize the object by only category labels. However, there is contradiction between classification and localization because accurate classification network tends to pay attention to discriminative region of objects rather than the entirety. We propose this discrimination is caused by handcraft threshold choosing in CAM-based methods. Therefore, we propose Clustering and Filter of Tokens (CaFT) with Vision Transformer (ViT) backbone to solve this problem in another way. CaFT first sends the patch tokens of the image split to ViT and cluster the output tokens to generate initial mask of the object. Secondly, CaFT considers the initial mask as pseudo labels to train a shallow convolution head (Attention Filter, AtF) following backbone to directly extract the mask from tokens. Then, CaFT splits the image into parts, outputs masks respectively and merges them into one refined mask. Finally, a new AtF is trained on the refined masks and used to predict the box of object. Experiments verify that CaFT outperforms previous work and achieves 97.55% and 69.86% localization accuracy with ground-truth class on CUB-200 and ImageNet-1K respectively. CaFT provides a fresh way to think about the WSOL task.

Keywords:

Weakly Supervised Object Localization (WSOL) Clustering Vision Transformer.

1 Introduction

Current deep learning methods have achieved enormous development and wide-spread application. However, the fully supervised learning always needs a mass of accurate labeled data, which is costly to acquire. So weakly supervised learning become an important and challenging field to be explored. In recent years, many methods are proposed to solve WSOL tasks.

Weakly supervised object location (WSOL) aims to localize the object in an image without the label of its bounding box. The most important method of WSOL is the Class Activation Mapping (CAM)[20]. It utilizes the activation map from the last convolution layer to generate the bounding box of object. But the classification models prefer to pay attention to the most discriminative region of the object rather than the entirety, as well as many developed methods based on CAM [9, 17, 18, 2]. Also, after the model generates the activation maps, the threshold of activation value should be chosen to get the final bounding box, which is a process with lots of handcraft design. Because there are different activation conditions in different images, it is difficult to set a threshold for all samples.

Therefore, we propose a method utilizing k-means [6] clustering algorithm to realize automatically learning the threshold of each region in one image. By dividing the feature map from convolutional neural network (CNN) into eigenvectors with length of channels, we intend to cluster these vectors. However there is a problem to decide which cluster represents the region of object. Vision Transformer (ViT) [3] is a new framework with self-attention mechanism. It possesses the special class token, which is independent of other tokens in the output feature map. By putting the class token into clustering, it is easy to choose the cluster of class token as the foreground and generate the mask of the object.

In addition, following the previous work [16], we also try to shift the weakly supervised task to a pseudo supervised task. The method of [16] generates pseudo bounding-box labels. Different from it, we consider the mask result of clustering as pseudo label and use a shallow convolutional module (called Attention Filter, AtF) following ViT’s output tokens to generate a mask more accurate and robust. Also, AtF is designed as a classification module with two classes (foreground and background) to directly output the region of object, which is no need to set a threshold in a regression method as well.

We summarize our contributions as follows:

-

We propose a novel combination of clustering and deep-learning in WSOL task, which helps to get rid of the difficult threshold-choosing in CAM-based method and generate a complete mask of object.
-

We propose a light-weight module, AtF, trained on pseudo mask labels, to extract information from feature map and generate an accurate mask of object.
-

We propose a high-quality transformer-based architecture to solve the WSOL task, which achieves 97.55% and 69.86% localization accuracy with ground-truth class on CUB-200 and ImageNet-1K respectively. We name the method as Clustering and Filter on Tokens (CaFT).

2 Related Work

2.0.1 Weakly Supervised Object Localization (WSOL)

is a challenging task to learn object localizations by given only the category labels. It is attractive because image-level category labels are easier and more inexpensive to obtain than localization labels.

CAM [20] is the pioneer work in WSOL. It utilizes the global average pooling to replace the fully connected layer in CNN-based classification models, and utilizes the class activation maps to generate the region of object. However, there are problems to use class activation maps. Firstly, the class activation maps usually pay attention to the most discriminative parts of object. Secondly, after generating class activation maps, it needs to set a handcraft threshold to distinguish the foreground and background. Different choices of threshold often greatly influences the performance of bounding-box prediction.

2.0.2 Development of CAM-based Methods.

To solve the problems of CAM and improve the performance of localization, a lot of different methods are proposed.

For example, HaS [9] divides the image into several patches then inputs them into network during training, which aims to reduce the reliance on the most discriminative regions. CutMix [15] mixes patches of different images. ACoL [17] uses two branches of classifier to erasing and predict discriminative regions. ADL [2] tries to erase feature maps corresponding to discriminative regions during inference. These methods are based on erasing some information of images during training or inference. DA-Net [14] utilizes a discrepant activation method reducing the similarity of CAMs and gathers to a less discriminative activation maps. SPG [18] and I²C [19] restrict the pixel-level correlations in the network to reduce the dependence on discriminative regions. PSOL [16] finds that localization and classification should be divided into two separate tasks, and uses class-agnostic localization method to predict bounding boxes.

2.0.3 Vision Transformer.

Transformer is initially an architecture to solve the sequential tasks. Recently, it is applied to domain of computer vision. ViT [3] uses the pure transformer directly to sequences of image patches for exploring spatial correlation. And it achieves a great performance on classification tasks. Deit [10] introduces several training strategies that allow ViT to be also effective when using the smaller ImageNet-1K dataset. The results of ViT on image classification are encouraging. Lately, there is work to deploy transformer architecture to WSOL tasks. TS-CAM [4] firstly splits an image into a sequence of patch tokens for spatial embedding, then re-allocates category-related semantics for patch tokens, finally couples the patch tokens with the semantic-agnostic attention map to achieve semantic-aware localization.

All the methods mentioned above give a big boost on WSOL. But there is still weakness on the discriminative region and threshold choosing. Our CaFT replaces the handcraft threshold choosing to a self-learning process by clustering the points on the feature maps. Due to the label-agnostic clustering and the global vision of ViT, activated region is discretized and the threshold of foreground and background is no longer fixed. Therefore, it sets a more flexible boundary to segment the complete object from background rather than the region beneficial to classification.

Refer to caption — Figure 1: CaFT structure. The ViT backbone converts the image to tokens and outputs the category. Then the last three layers of tokens and position embedding parameters are merged together to $map_{m}$ . In training, (a) clustering tokens and postprocessing; (b) using the mask of (a) to train the AtF₁; (c) dividing the input image and merging the output masks to train the AtF₂. Then CaFT uses the output mask of AtF₂ to draw the bounding box. In inference, CaFT directly inputs the $map_{m}$ to AtF₂ to predict mask and the bounding box.

3 Methodology

In this section, firstly we introduce the main architecture of CaFT. Then, we analyze each module of CaFT respectively.

3.1 Overview

We use the ViT model as backbone and directly use the output of backbone as final classification result.

We take the tokens of last three blocks out as $map_{0}$ , $map_{1}$ , $map_{2}$ respectively, and the position embedding parameter of ViT as $map_{p}$ . Then we merge these four maps to $map_{m}$ as

map_{m}=\sum_{i\in\left\{0,1,2,p\right\}}\alpha_{i}\times map_{i}

(1)

where $\alpha_{i}$ is the merge ratio.

In training process, we have designed to have three steps on $map_{m}$ , as is shown in Fig.1.

(a)

At the beginning of training, we input the merged tokens of $map_{m}$ to the clustering module, in which these tokens are clustered into 3 categories by k-means algorithm and the tokens with the same category as class token are chosen to be foreground. The next step is to binarize the category distribution map according to the foreground and utilize classical image filter to reduce some noises. Up to this point, the initial masks of images have been generated.
(b)

The next step is to take the initial masks of Clustering as pseudo targets for Attention Filter 1 (AtF₁), which is a two-category classifier of foreground and background.
(c)

After finishing training of AtF₁, CaFT divides the input image into four parts, inputs them into the model respectively and merges their masks together to a more refined mask. A new AtF₂ with the same structure as AtF₁ is trained on the pseudo label of refined masks.

In inference, the $map_{m}$ is directly input to AtF₂, and the output mask of AtF₂ is used to generate one bounding box for one image.

3.2 ViT Backbone

ViT firstly separates the image into patches, then these patches are flattened and put through a linear layer to become n tokens. Different from CNN network, ViT adds an extra class token and input it to network with other tokens from images. In addition, ViT adds a position embedding for every tokens before being input. Experiments of [3] show that position embedding is related to the corresponding position of tokens. The major structure of ViT is encoder, in which tokens will go through cascaded blocks consisted of a multi-head self-attention layer and a Multilayer Perceptron (MLP). After propagation in encoder, the network outputs the same number of tokens t_∗,t₁,…,t_n-1,t_n, and uses the class token t_∗ to inference the category.

A little different from original ViT model, we reserve output of the last three blocks of ViT for the subsequent process.

3.3 Clustering of Tokens

Classical CAM-based methods usually need to set a handcrafted fixed threshold to decide the foreground and background. For different inputs, the most suitable threshold to select foreground are different, as shown in Fig.3. However, under the previous frameworks, it is able to only choose a compromised value according to the performance on the whole dataset. In other words, a fixed design cannot extract the accurate boundary of foreground and background. Because of the compromised boundary, the predicted region is unable to cover the entire object, so it is inclined to focus on the most discriminative region.

Therefore, we decide to avoid this handcraft process. We find the eigenvectors that make up the feature map have apparent correlation with each other. But the similarity is still a continuous value and has different magnitude among images.

So, we apply clustering method to this task and choose the common k-means algorithm to automatically learn the boundary of foreground and background on the one-image level. At the beginning of experiment, as shown in Fig.4, we use the feature map of CNN-based network (we use ResNet50 [5] in this experiment) to cluster category of each point. However, we find the result of clustering is unstable with different random seeds which influence the initial state of clustering, as shown in the top line in Fig.4(b). By computing the similarity matrix of eigenvectors, we find the similarity between vectors of CNN are commonly low, in Fig.4(c). Moreover, in Fig.4(d), we can see the vector tends to have higher similarity with the vectors among it and lower similarity with the vectors far from it even if they are all regions of object. Most important of all, there is a fatal problem of CNN that although we can cluster the categories, we have no way to decide which cluster belongs to the object’s region.

It is easy to solve above problems by the Vision Transformer (ViT). Firstly, due to the special design of the extra class token, we find there are obvious correlations between class token and other tokens covering the object. And these correlations have a characteristic of long range, which is because of the different topological structure of CNN’s eigenvectors and ViT’s tokens. As is shown in Fig.5, eigenvectors are indirectly connected, while tokens acquire direct information from each other. From the similarity curve with the same image in Fig.4(c), it is obvious that the similarity of tokens has bigger variance than CNN’s vectors, which means a higher degree of difference between clustering classes. Therefore, the result of clustering is more stable for ViT’s tokens.

To verify the feasibility of using class token as flag of object cluster, we compute the mean Euclidean distance between class token and center of clustering of sampling images. It (10.653) is close to the mean distance within cluster to center of clustering ( $D_{ic}$ ) and is much less than clustering radius ( $D_{r}$ ) of $map_{m}$ shown in Table 1. This proves that the class token belongs to the cluster of object region.

As the Fig.7(c) shows, after choosing the cluster of class token as positive region, the positive mask has almost covered the entire object but there are some noisy points in the background. So, we use simple smoothing filter of picture processing to suppress these noises, which outputs Fig.7(d). Although the mask covers the entire object, the edge of mask does not perfectly fit to the object. Therefore, we proposed the next step of Attention Filter to fine tune the edge result and replace the clustering to accelerate the inference.

3.4 Attention Filter

Some CAM-based methods usually separate the training of WSOL into several steps. PSOL [16] uses the DDT [13] algorithm to generate bounding boxes and considers these boxes as pseudo box-labels to train the box-regression network in the second stage. Inspired by this, we have designed a multi-stage training process. Different from regression of bounding boxes, we regard the subsequent training as segmentation training and take the output mask of clustering as pseudo label.

Through the previous clustering we obtain a set of masks covering the object but have some noises of background. We use Gaussian filter on the noisy mask to smooth part of these points. After that, it already has the ability to predict the bounding box, and the model with backbone ViT-B can achieve 70.16% and 57.59% GT-Known accuracy on CUB-200 and ImageNet-1k respectively.

However, it is not enough for a high-quality WSOL model. Besides the noise of background, the clustering process has a low inference speed because of the iteration of clustering algorithm. According to our early experiments on CNN, we find that there is less noise in the clustering result of CNN. Besides, convolution on the feature map with the same kernel is able to increase the resolving power of foreground and smooth the noise. Therefore, we use a shallow convolution head (Fig.2) following the backbone, which directly extracts the attention mask of the object from merged tokens $map_{m}$ . This module plays an analogous role as the filter in image denoising, so we named it as Attention Filter (AtF).

Take the ViT-B-384 for example, after obtaining $map_{m}$ , we pull the class token out and send the rest as feature map with shape (D, H, W) to cascaded convolution layers (AtF₁) which outputs an attention mask with shape (2, H, W). D means the embedding dimension of ViT. The 2 channels of output feature map correspond to two classes as background and foreground, and after choosing the points of foreground, AtF₁ outputs the (1, H, W) mask which could predict the bounding box of object by finding the external border. During training AtF₁, the ViT backbone is frozen, which can prevent the classification accuracy of backbone from being influenced by training AtF₁ and also avoid the ViT backbone with strong fitting ability from over-fitting the training set and the noise of the pseudo label.

After being trained on the initial mask from clustering, AtF₁ can achieve a big rise on location accuracy, from 70.16% to 85.64% on CUB-200. The output mask is shown in Fig.7(e). The $1\times 1$ convolution is similar to fully connected layer for each token, so it is essentially eigenspace transformation of tokens to enhance the feature. After transformation, it is able to separate the vague vectors around the cluster boundary. We do the same clustering process as original tokens of $map_{m}$ on feature maps of convolution layers in AtF (Conv1, Conv2). Table 1 shows the comparison of following metrics. The mean within-cluster distance to center of clustering ( $D_{ic}$ ), as

D_{ic}=\frac{1}{n_{o}}\sum_{x\in C_{o}}{\rm Euclidean}(x,c_{o}),

(2)

where Euclidean(*, *) is computing Euclidean distance, $C_{o}$ is the set of points in cluster of object region with size $n_{o}$ , and $c_{o}$ is the center of cluster of object region. Clustering radius ( $D_{r}$ ), as

D_{r}={\rm max}\left\{{\rm Euclidean}(x,c_{o})|x\in C_{o}\right\}.

(3)

Distance between centers of clustering ( $D_{cc}$ ) and $D_{cc}$ divided by $D_{ic}$ and $D_{r}$ , as

D_{cc}=\frac{1}{k-1}\sum_{q=1}^{k}{\rm Euclidean}(c_{o},c_{q}),

(4)

where $k$ is the number of clusters in k-means algorithm, $c_{q}$ is the center of cluster $q$ . Calinski-Harabasz score [1] ( $Score$ ), which is defined as

Score=\frac{tr(B_{k})}{tr(W_{k})}\times\frac{n_{E}-k}{k-1},

(5)

where $n_{E}$ is the size of the whole tokens set $E$ , $W_{k}$ is the trace of the within-cluster dispersion matrix and $B_{k}$ is trace of the between group dispersion matrix defined by

W_{k}=\sum_{q=1}^{k}\sum_{x\in C_{q}}(x-c_{q})(x-c_{q})^{T},B_{k}=\sum_{q=1}^{k}n_{q}(c_{q}-c_{E})(c_{q}-c_{E})^{T}

(6)

with $c_{E}$ is the center of $E$ and $n_{q}$ is the number of points in cluster $q$ . Calinski-Harabasz score evaluates the quality of clustering by within-cluster and among-cluster variance. A higher Calinski-Harabasz score relates to a model with better defined clusters.

Table 1: Comparison of clustering on

map_{m}

and feature maps of AtF₁

Feature maps	$D_{ic}$	$D_{r}$	$D_{cc}$	$D_{cc}/D_{ic}$	$D_{cc}/D_{r}$	$Score$
$map_{m}$	9.29	18.03	16.58	1.78	0.92	188.86
Conv1	8.37	16.16	24.72	2.95	1.53	688.55
Conv2	3.87	9.30	19.39	5.01	2.08	1603.12

Through AtF₁, $D_{ic}$ and $D_{r}$ descend, $D_{cc}/D_{ic}$ , $D_{cc}/D_{r}$ and $Score$ increase, which means the degree of within-cluster aggregation is higher, the spacing between clusters is larger and a more stable clustering result. The visual result is shown in Fig.6. Therefore, AtF is able to enhance the feature and reduce the noisy points in the mask.

To further enhance the result, we divide the training image into four parts and use the AtF₁ to predict four masks. Then these four masks are stitched into a more refined mask which is the pseudo label for AtF₂. Compared to directly using the result mask of AtF₁, AtF₂ trained on the refined mask performs much better. Although the box prediction on refined mask is not much better on CUB-200 (even worse on ImageNet-1K), in Table 2. Because the detail of edge preforms better in refined mask while there are some extra sporadic noisy points, which is shown in Fig.7(f). However, as Fig.7(g) shows, AtF₂ is able to smooth these noisy points and obtain a more accurate edge information from refined masks at the same time. While the AtF₂ trained on AtF₁ mask tends to drop into the local minimum as AtF₁ because they have similar structure. These are reasons why AtF₂ trained on refined mask has a big rise from 87.11% to 94.17% and surpasses the AtF₂ trained on AtF₁ mask by a large margin.

Table 2: Comparison between AtF₁ and refined mask on CUB-200

AtF₂ trained on	Own result		AtF₂ result
AtF₂ trained on	Gt-Known	mean IoU	Gt-Known	mean IoU
AtF₁ mask	85.64	0.6629	86.97	0.6744
Refined mask	87.11	0.6885	94.17	0.7418

By the way, We can only use the divide-method after obtaining AtF₁ rather than when clustering, because the divided part of images may not contain objects and the result of clustering from blank part is unauthentic.

4 Experiments

4.1 Experimental Settings

4.1.1 Datasets.

There are two accepted datasets to evaluate the performance of WSOL methods, CUB-200 [11] and ImageNet-1K [7]. CUB-200 contains 200 categories of birds with 5994 training images and 5794 testing images. ImageNet-1K is an outstanding classification dataset with 1000 classes, containing 1,281,197 training images and 50000 validation images. There are also the bounding boxes labels of objects on validation images.

4.1.2 Evaluation Metrics.

Following previous work [20], Top-1/Top-5 localization accuracy (Top-1/Top-5 Loc) and localization accuracy with ground-truth class (Gt-Known Loc) are applied as evaluation metrics. Gt-Known Loc is correct when the intersection over union (IOU) between the predicted bounding box and the ground truth bounding box is 0.5 or more. Top-1 Loc is correct when the Top-1 classification result and Gt-Known Loc are both correct. Top-5 Loc is correct when given the Top-5 predictions of groundtruth labels and bounding boxes, there is one prediction which the classification result and localization result are both correct. Because our CaFT just predicts one box for one image, Top-5 Loc is just different from Top-1 Loc on the classification result.

4.2 Implementation Details

CaFT is based on the ViT backbone [3], which is pre-trained on the ImageNet-1K [7]. Each input image is directly resized to (384, 384). For experiment on CUB-200, we use the pre-trained weight of ViT to train the classification backbone on CUB-200, the following training never changes parameters of the backbone. The top-1 classification accuracy of the backbone is around 88.50% and 83.70% on CUB-200 and ImageNet-1K respectively.

We use k-means algorithm in package scikit-learn, and set 3 categories as the target with other parameters in default. On CUB-200 we use the whole training set of 5994 images. For experiment on ImageNet-1k, because of the high computing pressure of k-means algorithm, we just randomly sample 10 images from each class in training set to generate pseudo mask label, and train the AtF on this mini set (10000 images in total).

We use SGD optimizer for all training. For training AtFs, we set the learning rate as 0.1 with cosine scheduler and 20 epochs for CUB-200, and 5 epochs for ImageNet-1K.

We have tested the results of ViT-B and ViT-L model. For ViT-B model, we follow the integral process with clustering, AtF₁, refined mask and AtF₂; while for ViT-L, we directly train the AtF₂ using the pseudo mask labels generated by ViT-B model.

Table 3: Results at different stages of CaFT. Center cropping is not used.

Method	Backbone	CUB-200		ImageNet-1K
Method	Backbone	Top-1 Loc	Gt-Known Loc	Top-1 Loc	Gt-Known Loc
Clustering	ViT-B	63.63	70.16	49.68	57.59
AtF₁	ViT-B	77.32	85.64	54.74	63.45
Refined	ViT-B	78.62	87.11	54.04	62.59
AtF₂	ViT-B	84.79	94.17	56.02	64.79

4.3 Performance

4.3.1 Main Results.

The main results of CaFT on CUB-200 and ImageNet-1K are shown in Table 4 and Table 5, compared with other methods. The large model achieves the best result on metrics and the base model also has a competitive performance. There is an interesting phenomenon that Gt-Known Loc divided by Top-1 Loc surpasses 90.00% on CUB-200 and 86.50% on ImageNet-1K, which surpasses the Top-1 classification accuracy of the backbone (around 88.50% on CUB-200 and 83.70% on ImageNet-1K). A reasonable explanation is that the location accuracy and classification accuracy of CaFT have a significant positive correlation, while there is often a contradiction between this two in CAM. The attention of CaFT will not be payed more to discriminative region along with the increase of classification accuracy; on the contrary, the mask generated by a better classification backbone will be consisted of fewer noisy points, which benefits the clustering and training of AtF, as well as localization. The Top-1 Loc and Gt-Known Loc of different stages of CaFT are shown in Table3.

Table 4: Comparison of CaFT with state-of-the-art methods on the CUB-200

Methods	Backbone	CUB-200
Methods	Backbone	Top-1 Loc	Top-5 Loc	Gt-Known Loc
CAM [20]	VGG-GAP	36.13	-	-
ACoL [17]	VGG-GAP	45.92	56.51	62.96
ADL [2]	VGG-GAP	52.36	-	73.96
DDT [13]	VGG16	62.3	78.15	84.55
SPG [18]	InceptionV3	46.64	57.72	-
ADL [2]	ResNet50-SE	62.29	-	71.99
I2C [19]	InceptionV3	65.99	68.34	72.6
DANet [14]	VGG16	52.5	62	67.7
PSOL [16]	DenseNet161+EfficientNet-B7	77.44	89.51	93.01
SPOL [12]	ResNet50+DenseNet161	79.74	93.69	96.46
SPOL [12]	ResNet50+EfficientNet-B7	80.12	93.44	96.46
TS-CAM [4]	Deit-S	71.3	83.8	87.7
TS-CAM [4]	Deit-B-384	75.8	84.1	86.6
TS-CAM [4]	Conformer-S	77.2	90.9	94.1
CaFT(ours)	ViT-B-384^*	84.79	92.75	94.17
CaFT(ours)	ViT-B-384	86.57	94.1	95.5
CaFT(ours)	ViT-L-384^*	86.8	95.18	96.44
CaFT(ours)	ViT-L-384	88.26	96.38	97.55

*

This model does not use center cropping.

Table 5: Comparison of CaFT with state-of-the-art methods on the ImageNet-1k

Methods	Backbone	ImageNet-1k
Methods	Backbone	Top-1 Loc	Top-5 Loc	Gt-Known Loc
CAM [20]	VGG-GAP	42.8	54.86	59
ACoL [17]	VGG-GAP	45.83	59.43	62.96
ADL [2]	VGG-GAP	44.92	-	-
DDT [13]	VGG16	47.31	58.23	61.41
SPG [18]	InceptionV3	48.6	60	64.69
ADL [2]	ResNet50-SE	48.53	-	-
I2C [19]	InceptionV3	53.11	64.13	68.5
DANet [14]	GoogLeNet	47.5	58.3	-
PSOL [16]	DenseNet161+EfficientNet-B7	58	65.02	66.28
SPOL [12]	ResNet50+DenseNet161	56.4	66.48	69.02
SPOL [12]	ResNet50+EfficientNet-B7	59.14	67.15	69.02
TS-CAM [4]	Deit-S	53.4	64.3	67.6
CaFT(ours)	ViT-B-384^*	56.02	63.49	64.79
CaFT(ours)	ViT-B-384	58.06	65.74	67.03
CaFT(ours)	ViT-L-384^*	58.35	65.91	67.2
CaFT(ours)	ViT-L-384	60.59	68.56	69.86

*

This model does not use center cropping.

4.3.2 About Center Cropping.

A lot of recent previous work [17, 18, 16, 12] used center cropping during inference, we follow these works and get the results of our CaFT to compare the result with them. However, we think center cropping is a method to resize the ground truth boxes and reduce the difficulty of the task regardless of model design. Because by cropping the input images, the proportion of the bounding box in the image is increased so that even if using the boundary of the image directly, the IoU will be higher than 0.5 and the prediction will be regarded as a positive sample. Obviously, this method makes the result of evaluation unreliable, so we also record the result without center cropping to show the objective results.

Our CaFT outperforms the previous work on Top-1 Loc, Top-5 Loc and Gt-Known Loc metrics. With ViT-L backbone, CaFT achieves 88.26% Top-1 Loc and 97.55% Gt-Known Loc on CUB-200, 60.59% Top-1 Loc and 69.86% Gt-Known Loc on ImageNet-1K. With ViT-B backbone, CaFT also achieves 86.57% Top-1 Loc and 95.50% Gt-Known Loc on CUB-200, 58.06% Top-1 Loc and 67.03% Gt-Known Loc on ImageNet-1K, which is also a competitive result. As the opinion in [4], transformer-based networks preserve the entirety view of objects. By clustering, CaFT extracts a more complete mask from feature map than CAM. Although there are noisy points, through the subsequent convolution layers with $1\times 1$ kernel size, these points are easy to erase.

To compare the effectiveness of the model comprehensively, we show the localization under multiple IoUs in Fig.8. CaFT outperforms the previous methods in each IoU threshold.

4.4 Ablation Study and More Details

In this section, we will verify the components of CaFT by ablation study on CUB-200 dataset, and introduce the effect of some details.

4.4.1 Attention Filter.

Different from [16], we use the masks from clustering to be pseudo labels rather than bounding boxes. To compare the effect of these two kinds of pseudo labels, we set the other structures and parameters fixed and replace the AtF in different stages by bounding boxes regression. The result and the process of training is recorded in Fig.9 (a) and Table6.

The box reg1 is trained on the pseudo box labels generated by clustering and the box reg2 is trained on the pseudo box labels generated by AtF₁. The backbone of the box regression is ViT-B to avoid the effect of different backbones, and we replace the MLP head of class token by a regression head of [x, y, w, h], and the regression parameters are relative value of the image size. From the Table 6 we can see that even if being trained on the same result of clustering, box reg1 is unable to achieve the same location accuracy as AtF₁. Moreover, for box reg2 trained on the prediction boxes of AtF₁, the results fall down sharply. Besides the finally accuracy, the curves of the training process show that the convergence of box regression is worse than AtF. This experiment proves that the box regression is capable to increase the location accuracy to a certain extent, but is unable to obtain a stable and high-quality result. On the one hand, box regression usually needs a larger model than AtF and it is easy to over-fit the training set as well as the noise of pseudo labels. On the other hand, according to the experience in fully supervised location, box regression is an inherently unstable training method because of the four-parameter regression and it is more severe with pseudo labels. Therefore, box regression is powerless to smooth and correct the error in pseudo labels.

We try to directly use deep network to extract the mask from raw images with the pseudo mask labels from clustering. We use ResNet50 [5] and VGG16 [8] and the results are shown in Table6. However, it is really easy to over-fitting the noise of background for these networks due to the depth, so the location accuracy is even worse than it of Clustering.

Table 6: Comparison of different training methods on pseudo mask.

Method	Backbone	Gt-Known Loc
Clustering	ViT-B	70.16
AtF₁	ViT-B	85.64
AtF₂	ViT-B	94.17
box reg1	ViT-B	79.13
box reg2	ViT-B	78.75
Direct	ResNet50	68.42
Direct	VGG16	58.21

A vital problem of pseudo labels is over-fitting, so we control the size of AtF and compare the effect of different structures in Fig.9(b). AtFS has two convolution layers of $1\times 1$ kernel with batch normalization and activation function; AtFT has one convolution layer of $1\times 1$ kernel with batch normalization and activation function; AtFS-K3 has the same structure with AtFS with $3\times 3$ kernel in the first convolution layer. The result shows that AtFS is the best structure with stable curve and there is over-fitting in AtFS-K3. The normalized result of trained parameters of $3\times 3$ kernel shows that the center of the kernel has much more weight than the surrounding’s. Therefore, the center is crucial while merging surrounding information has opposite effect which may confuse the judgement of the edge points.

4.4.2 Fusion of Multi-layers.

CaFT merges the last three layers of tokens and position embedding parameters before clustering and AtF. We compare the influence of the quantity of fusion layers to clustering result. As shown in Fig.10, with different merge ratio, the more layers merged, Gt-Known Loc and Mean IoU both rise, but the magnitude of increase also declines. Moreover, a measure of emphasizing on the last layer results in a bit of increase.

Merging position embedding parameters of ViT can also lift the result a little. Keeping the ratio fixed, results with position embedding parameters is higher 0.05% than their counterparts.

5 Conclusion

In this paper, we have proposed the Clustering and Filter on Tokens (CaFT) model for weakly supervised object location. CaFT aims to solve the problem of discriminative region preference and threshold choosing in WSOL. We use the clustering method and the special class token in Vision Transformer (ViT) backbone to generate the initial mask of object. To filter the noise of the initial mask, we train a shallow convolution head (Attention Filter, AtF) to extract a more accurate mask from tokens with the pseudo label of initial mask. Experiments on the CUB-200 and ImageNet-1K datasets show the effectiveness of CaFT. Different from CAM-based methods, CaFT provides a fresh way to think about WSOL tasks with clustering.

References

[1] Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3(1), 1–27 (1974)
[2] Choe, J., Shim, H.: Attention-based dropout layer for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2219–2228 (2019)
[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[4] Gao, W., Wan, F., Pan, X., Peng, Z., Tian, Q., Han, Z., Zhou, B., Ye, Q.: Ts-cam: Token semantic coupled attention map for weakly supervised object localization. arXiv preprint arXiv:2103.14862 (2021)
[5] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[6] MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. vol. 1, pp. 281–297. Oakland, CA, USA (1967)
[7] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
[8] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
[9] Singh, K.K., Lee, Y.J.: Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In: 2017 IEEE international conference on computer vision (ICCV). pp. 3544–3553. IEEE (2017)
[10] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. pp. 10347–10357. PMLR (2021)
[11] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011)
[12] Wei, J., Wang, Q., Li, Z., Wang, S., Zhou, S.K., Cui, S.: Shallow feature matters for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5993–6001 (2021)
[13] Wei, X.S., Zhang, C.L., Wu, J., Shen, C., Zhou, Z.H.: Unsupervised object discovery and co-localization by deep descriptor transformation. Pattern Recognition 88, 113–126 (2019)
[14] Xue, H., Liu, C., Wan, F., Jiao, J., Ji, X., Ye, Q.: Danet: Divergent activation for weakly supervised object localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6589–6598 (2019)
[15] Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6023–6032 (2019)
[16] Zhang, C.L., Cao, Y.H., Wu, J.: Rethinking the route towards weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13460–13469 (2020)
[17] Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.S.: Adversarial complementary learning for weakly supervised object localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1325–1334 (2018)
[18] Zhang, X., Wei, Y., Kang, G., Yang, Y., Huang, T.: Self-produced guidance for weakly-supervised object localization. In: Proceedings of the European conference on computer vision (ECCV). pp. 597–613 (2018)
[19] Zhang, X., Wei, Y., Yang, Y.: Inter-image communication for weakly supervised localization. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16. pp. 271–287. Springer (2020)
[20] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)