Semi-Weakly Supervised Object Detection by Sampling
Pseudo Ground-Truth Boxes

Akhil Meethal Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA)
Dept. of Systems Engineering, École de technologie supérieure, Montreal Canada Marco Pedersoli Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA)
Dept. of Systems Engineering, École de technologie supérieure, Montreal Canada Zhongwen Zhu GAIA Montreal, Ericsson Canada Francisco Perdigon Romero GAIA Montreal, Ericsson Canada Eric Granger Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA)
Dept. of Systems Engineering, École de technologie supérieure, Montreal Canada

Abstract

Semi- and weakly-supervised learning have recently attracted considerable attention in the object detection literature since they can alleviate the cost of annotation needed to successfully train deep learning models. State-of-art approaches for semi-supervised learning rely on student-teacher models trained using a multi-stage process, and considerable data augmentation. Custom networks have been developed for the weakly-supervised setting, making it difficult to adapt to different detectors. In this paper, a weakly semi-supervised training method is introduced that reduces these training challenges, yet achieves state-of-the-art performance by leveraging only a small fraction of fully-labeled images with information in weakly-labeled images. In particular, our generic sampling-based learning strategy produces pseudo ground-truth (GT) bounding box annotations in an online fashion, eliminating the need for multi-stage training, and student-teacher network configurations. These pseudo GT boxes are sampled from weakly-labeled images based on the categorical score of object proposals accumulated via a score propagation process. Empirical results¹¹1Our code is available at: https://github.com/akhilpm/SemiWSOD on the Pascal VOC dataset, indicates that the proposed approach improves performance by 5.0% when using VOC 2007 as fully-labeled, and VOC 2012 as weak-labeled data. Also, with 5-10% fully annotated images, we observed an improvement of more than 10% in mAP, showing that a modest investment in image-level annotation, can substantially improve detection performance.

1 Introduction

Recent advances in object detection are mainly due to deep learning models trained using a massive amount of densely annotated images[10, 3, 16, 20, 5, 30, 17]. The main development bottleneck of such detector models is the requirement for dense annotation of each image, i.e., the bounding box and class label of each instance present in an image. Therefore, producing a curated and well balanced dataset with annotations is a costly undertaking, often requiring expert knowledge. These challenges have motivated research on learning paradigms to train object detectors with reduced annotation. Popular approaches in this direction can be broadly categorized into semi-supervised learning[29, 32], weakly supervised learning[28, 22], active learning[23, 15], and few shot learning[18]. In semi-supervised settings, some approaches additionally used weak image-level labels[9] and point annotations[4].

Common steps in semi- and weakly-supervised learning consist of first extracting pseudo GT boxes, and then self-training the detector with these pseudo annotations[28, 4, 29, 22]. However, this often results in multiple stages of training, leading to learning difficulties that vary according to the settings. In addition to that, the extraction of pseudo GT boxes often involves a precise thresholding of detection scores in order to find a good trade-off between precision and recall, which can lead to unstable training[13]. In semi-supervised learning, the self-training targets are obtained from a first stage of training with only the fully annotated samples, which are less in number so prone to overfitting[4]. In weakly-supervised learning, a custom MIL (Multiple Instance Learning) based detector must first be trained to generate these pseudo annotations, and then a standard fully-supervised detector is self-trained with the pseudo targets[28, 22]. Therefore, multiple training stages are required, and multiple detection architectures need to be trained. Also the pseudo target extraction must be performed offline, resulting in a training process that is not end-to-end.

In this paper, we propose an online self-training method for weakly semi-supervised object detection that exploits a small fraction of fully-annotated images, along with the remaining weakly-annotated ones, for efficient training. In this settings, we propose a sampling strategy where pseudo GT boxes are sampled using the semantics accumulated on object proposal regions. For images with real bounding box annotation, we compute the normal classification and localization loss to update the network. For images with weak annotations, we sample a set of pseudo GT boxes for each given category present in the image, and classify and regress those, allowing to use a standard detector again. The pseudo GT boxes are sampled from object proposals extracted from weakly-labeled images. Thus, during training, our network learns fully- and weakly-annotated images in a single stage, utilizing any backbone object detection architecture. Unlike other self-training based model for semi-supervised object detection, our method does not require thresholding the detection scores for defining the number of pseudo GT boxes, which translates to less effort in the training process. For sampling, we make use of the category specific score of the object proposals. Category specific score of object proposals are accumulated by propagating the scores of detection boxes to object proposals based on their Intersection over Union (IoU). For experimental validation, the proposed approach is compared against state-of-art methods for semi-supervised learning on the Pascal VOC dataset.

2 Related work

2.1 Supervised Object Detection:

Models for object detection can be broadly classified into two-stage[21, 16] and one-stage [20, 17] object detectors. In both cases, supervised object detection requires class and bounding box labels for each instance present in the image[11]. Two-stage object detectors have a first stage that extracts RoIs (candidate object regions) whose reliability as a potential candidate region is quantified by their objectness score. Earlier approaches extract these RoIs using low-level image features in R-CNN[11], Fast R-CNN[10], etc. Later, end-to-end two-stage models emerged as more accurate detectors, where an additional learnable head called RPN (Region Proposal Network) is used to regress candidate regions[21, 16, 5]. In contrast, one-stage object detectors avoid the RoI extraction stage, and classify and regress directly from the anchor boxes. They are generally fast and applicable to real-time object detection[20, 19, 17]. Though the two-stage detectors are typically slower compared to their one-stage counterparts, extracting reliable candidate region in the first stage provides an edge in terms of localization accuracy. Recently, anchor-free detectors[30, 7] have grown in popularity since they dissociate from hand-designed anchor box selection and matching process, and provide even faster detectors.

2.2 Semi/Weakly-Supervised Object Detection:

Semi-supervised object detectors rely on a small subset of labeled images and a large collection of unlabeled images to train a detector[29]. They have been explored in many different forms. Recently, detectors trained in a student-teacher fashion have gained popularity in the research community[32, 27]. In this setting, a teacher model is trained first using the available annotated data, and then used to produce pseudo-labels for unlabeled data. Once the pseudo-labels are obtained, a student model is trained using the combined dataset. Though our proposed method also make use of the concepts of pseudo-labeling, our approach differs from these methods because we don’t have two separate networks, and multiple training stages, making it a simpler solution for semi-supervised detection.

Weakly-supervised detectors, on the other hand, only rely on image-class labels[28, 1, 6]. Research in weakly-supervised methods have progressed greatly in recent years[22]. However, a fundamental problem is their inability to distinguish the entire object from its parts and context. As a result, the detectors often produce bounding boxes on discriminative object parts, failing to distinguish the object boundaries when multiples objects of the same class are spatially adjacent. They produce imprecise localization that do not differentiate object boundaries from its context. Our proposed method also learns from a vast majority of weakly-labeled images, but, combined with a few fully-annotated images, providing better objectness distillation from the fully- to weakly-annotated images. We argue that this is the most practical setting, as we can typically annotate a small number of images with bounding boxes, and supply the remaining with weak annotations, or no annotations at all. Some approaches also attempt to leverage the additional, easy to obtain weak supervision for the unlabeled data. In particular, they used weak image-level labels[9] or point annotations[4] for the unlabeled images. Though this required additional annotation effort, they are easy to obtain, and may provide substantial improvements. Our approach also relies on weak image-class labels.

2.3 Self-Training:

Self-training has emerged as a popular approach in weakly and semi-supervised object detection[28, 22, 4]. In these approaches, a detector is first trained on the available data and then the obtained detections from the first training are used for training a refined model. This procedure can be repeated often multiple times. However, these methods perform self-training with pseudo labels obtained offline, e.g., a Faster R-CNN network training with detection result of weakly-supervised detectors as pseudo labels[28, 22]. In contrast, our method performs self-training using the pseudo labels for weakly labeled images, but in an online fashion. In particular, for each iteration, our approach samples pseudo GT boxes for each given weak label using the accumulated semantics from a score propagation block. Given this online nature of the self-training, the training is a simple one-stage process.

3 Proposed Method

In this section, we introduce a learning technique for semi-weakly supervised object detection. Fig 1 illustrates the overall system design. The backbone of our system can be any fully supervised architecture – we used Faster-RCNN[21] in our experiments. For every input image, the detector is employed in a different way, depending on the available annotation level. For the fully-labeled images, we perform a normal forward-backward cycle by taking the real GT annotations provided. For the weakly-labeled images, we use an importance sampling approach to select the most likely bonding boxes for each class in the image, to use them for training as a pseudo ground-truth. This allows us to use the same detector employed for the fully-labeled images also for the weakly-labeled ones.

When combining weakly and fully supervised learning, we need to determine the right importance to associate to the two learning tasks. For doing that, we rely on a hyper-parameter that defines the sampling ratio between the fully- and weakly-labeled images. This allows to associate a different level of importance to both tasks without changing the actual losses defined in the detector code. The rest of this section introduces the learning steps of both strongly and weakly-annotated category. Then, we present how their sampling ratio is applied within the training process.

Refer to caption — Figure 1: Proposed Method for Weakly Semi-Supervised Training. For fully-labeled images, our approach uses the backbone detector without any modifications. For weakly-labeled images, our approach samples pseudo GT boxes from a set of proposals. The probability of sampling a proposal is proportional to the score given by the detector to the most overlapping detection. In this way, the sampling approach focuses more and more to the objects of interest in the image.

3.1 Learning with Strong Annotations:

For the images that are strongly annotated (.i.e bounding box annotation for each object present), the learning step is straightforward. Given an input image $I$ , the ground truth (GT) annotations are defined with the bounding box positions $\mathcal{B}=\{b_{0},b_{1},\cdots\ b_{N}\}$ and corresponding classes $\mathcal{C}=\{c_{0},c_{1},\cdots\ c_{N}\}$ . $b=(x_{0},y_{0},x_{1},y_{1})$ is a vector with 4 values that represents for instance the top left and bottom right corner of a box, while $c\in\mathcal{C}$ is a discrete value that represents the object category. In our experiments we use faster RCNN as detector [21], and thus use a loss as:

	$\displaystyle L_{F}=\sum_{j\in\mathcal{D}}\sum_{i\in\mathcal{B}}\frac{1}{N_{cls}}L_{cls}(f_{c_{i}}(d_{j},I))$		(1)
	$\displaystyle+\lambda\frac{1}{N_{reg}}L_{reg}(c_{i},d_{j},b_{i}),$		(1)

For each GT bounding box $b_{i}$ it generates a loss based on the scores $f_{c_{i}}$ and overlap of the obtained detections $d_{j}$ . $L_{cls}$ and $L_{reg}$ denote the classification and localization loss, respectively. $N_{cls}$ and $N_{reg}$ are the normalization factors which depends on the number of foreground and background RoIs considered. $\lambda$ is a hyperparameter that controls the relative importance of the classification and localization loss. Note that the exact form of loss can vary according to the fully supervised detector architecture used in the model, but our approach is independent of the specific fully supervised loss.

3.2 Learning with Weak Annotations:

In case of weak supervision, we know the object classes that are present in the image $c$ , but not the bounding box locations $b_{i}$ . Thus, the general approach of weakly supervised models is to learn during training what are the regions of the image that are more likely to contain the object of interest. A typical approach for doing that is to compute the classification loss for a given class $c$ and an image $I$ as a weighted sum of scores:

L_{cls}\Big{(}\underbrace{\sum_{l}w_{l}f_{c}(p_{l},I)}_{h}\Big{)},

(2)

where $f_{c}(p_{l},I)$ is the score of the detector for ground truth class $c$ and object proposal $p_{l}\in\mathcal{P}$ of to the image $I$ , and $w_{l}$ is a normalized weight associated with each object proposal $p_{l}$ . The weights associated with each proposal $w_{l}$ for a image $I$ are computed as normalized scores:

w_{l}=\frac{\exp\{\frac{f_{c^{*}}(p_{l},I)}{T}\}}{\sum_{j}\exp\{\frac{f_{c^{*}}(p_{j},I)}{T}\}},

(3)

with $T$ being the temperature parameter that defines the sharpness of the weight distribution and is a hyper-parameter of the learning approach. In this way, boxes with higher score will have more impact on the learning and the learning will focus more and more on the locations of the image that are more likely to contain the object of interest. While this formulation works well, it is computationally expensive because it has to evaluate at each training iteration and for each image all box locations $p_{l}$ .

Here, we propose to approximate the weighted sum of scores $h$ with Monte Carlo sampling, in which instead of computing the sum over entire bounding box locations $l$ , we uniformly sample $K$ boxes:

h\approx\hat{h}=\sum_{k\sim\mathcal{U}}\theta_{k}f_{c}(p_{k},I),

(4)

where $\theta_{k}$ is the weight associated with the bounding box $k$ . This allows us to compute only $K$ evaluations of the expensive $f$ , while using an unbiased estimation of the weakly supervised scoring function. We call these samples pseudo GT bounding boxes because as we will see in the next section, these samples can be pass to the detection algorithm as ground truth annotations. This allows our algorithm to use any off-the-shelf object detector without modification.

However, when sampling, the weights $\theta_{k}$ associated with each bounding box score cannot be computed directly because it is the normalized version of the object score $f_{c}(p_{k},I)$ for the box $p_{k}$ , but we do not have all scores for the normalization factor, as we sampled only a few boxes. Instead, for each image $I$ we keep in memory the scores $s_{l}$ associated with the bounding box $p_{l}$ and update only the score of the $K$ sampled bounding boxes. Then, $\theta_{k}$ will be computed as the normalized version of $s_{k}$ :

\theta_{k}=\frac{\exp\{\frac{s_{k}}{T}\}}{\sum_{j}\exp\{\frac{s_{j}}{T}\}}

(5)

As $\theta_{k}$ approximates $w_{l}$ , we see that when the scores $f$ do not change anymore (i.e. at convergence), $\hat{h}$ becomes $h$ . Therefore $\hat{h}$ is an unbiased estimation of $h$ . While this approach would work, sampling uniformly any possible image bounding box proposal $p_{l}$ would make the learning very slow because most of the time the sample would not come from the object of interest. Instead, in this work we use an importance sampling approach. We use $\theta_{k}$ as sampling probabilities associated with a multinomial distribution $\mathcal{M}(\theta_{k})$ to sample bounding boxes, so that the bounding box proposal $p_{k}$ would have a probability $\theta_{k}$ to be sampled. In this case, in order to maintain the same estimation of $h$ we need to divide by the sampling probability $\theta_{k}$ . Thus the final estimation of $h$ will be:

\hat{h}=\sum_{k\sim\mathcal{M}(\theta_{k})}f_{c}(p_{k},I),

(6)

which is again an unbiased estimator of the classification score of an image, but with lower variance. Thus, the final weakly supervised loss $L_{W}$ for an image $I$ is the same as in Eqn.(1), but with the ground truth bounding boxes $b_{i}^{*}$ changed to the pseudo GT boxes sampled from the object proposals $\mathcal{P}$ .

3.3 Combining Strong and Weak Annotation:

While the fully supervised object detector uses ground truth boxes that are correct, the weakly supervised counterpart estimates the object box location during training and therefore the estimation can be noisy. Thus, when learning with strong and weak labels we might want to set a hyper-parameter value that balances the relative importance of the two losses. In this case the final loss is $L=L_{F}+\lambda L_{W}$ . As the aim of this approach is to avoid to directly modify the loss on the detector, in this case we express the weight $\lambda$ with a sampling ratio. Thus, we use a ratio parameter $r$ that controls the amount of training data from the fully and weakly labeled pool of data. For instance $r=0.6$ means that we sample with probability $0.6$ from the pool of the fully-labeled samples and $0.4$ from the weakly-labeled samples. This approach has the advantage of not changing the internals of the underlying detector such as loss function or additional regularization etc. With this design, we can feed both the fully annotated and weakly annotated images in parallel to the model and train it in a single stage.

3.4 Learning Algorithm

In this section we present in more detail the different parts of the proposed learning algorithm as summarized in Alg. 1. For the sake of simplicity the algorithm is shown for the case of a single image $I$ , but it could be trivially extended to a batch of images.

For supervised samples our algorithm uses directly the bounding box annotations $\mathcal{B}$ and the corresponding classes $\mathcal{C}$ for inference (detect). For weak supervision, the inference is performed on pseudo GT annotations ( $\hat{\mathcal{B}},\hat{\mathcal{C}}$ ) that are obtained by sampling object proposals (sample). Then, the obtained detections $\mathcal{D}$ and scores $\mathcal{S}^{D}$ are used to update the proposal scores ( $\mathcal{S}^{P}$ ). In both cases, the obtained detections $\mathcal{D}$ and scores $\mathcal{S}^{D}$ are used to compute the loss $L$ and update the recognition model (backprop).

Input: Image: I, GT:

(\mathcal{B},\mathcal{C})

proposals and scores:

(\mathcal{P},\mathcal{S}^{P})

if $\mathcal{B}\neq\emptyset$ then

fully supervised ;

\mathcal{D},\mathcal{S}^{D}

= detect(I,

\mathcal{B},\mathcal{C}

);

else

weakly supervised;

\hat{\mathcal{B}},\hat{\mathcal{C}}

= sample (

\mathcal{P},\mathcal{S}^{P},\mathcal{C}

) ;

\mathcal{D},\mathcal{S}^{D}

= detect (I,

\hat{\mathcal{B}},\hat{\mathcal{C}}

) ;

\mathcal{S}^{P}

= update (

\mathcal{P},\mathcal{D},\mathcal{S}^{D}

);

end if

backprop( I ,

\mathcal{B},\mathcal{C},\mathcal{D},\mathcal{S}^{D}

)

Algorithm 1 Semi-Weakly supervised learning with Pseudo GT

3.4.1 Detection inference

(detect) This can be any detection algorithm that takes as input the ground truth annotations ( $\mathcal{B,C}$ ) for an image $I$ and returns detections $\mathcal{D}$ with associated scores $\mathcal{S}^{D}$ for all classes. Notice that, in general, ground truth annotations $\mathcal{B}$ are not needed to perform inference. However, in many detectors, these are used to limit the detections computation nearby the GT annotations.

3.4.2 Sampling Pseudo GT

(sample) For each proposal $p_{l}\in\mathcal{P}$ we have the corresponding classification score $s_{l,c}$ for a given class $c$ . This score is accumulated based on the detector output via the score propagation step which will be explained in the next section. For each class present in an image, we consider the scores of all proposals $\mathcal{P}$ , denoted by $\mathcal{S}^{P}$ and sample K boxes based on the multinomial distribution $\mathcal{M}$ with probabilities $\theta$ computed as in Eqn.5. Figure 2 illustrates an example sampling process during the training phase for the person category. It can be observed that, though in the beginning we sample pseudo GT boxes randomly from the image, it converges to meaningful locations for the person category in the later stages.

3.4.3 Score Propagation

(update) Score propagation is the component which updates the score values $\mathcal{S}^{P}$ of the object proposals $\mathcal{P}$ . If the output bounding boxes of the detector $\mathcal{D}$ would correspond to the object proposals $\mathcal{P}$ , we could directly copy the detection values to our pool of proposals. Instead, as in modern detectors the output detections $\mathcal{D}$ are generated by a regression, here we propose a method to propagate the scores from the output detection $\mathcal{D}$ to the scores $\mathcal{S}^{P}$ of our object proposals that are then sampled as pseudo GT. During learning, the proposals will accumulate scores from their overlapping boxes produced by the detector. In our design, we define the score propagation according the percentage of overlap between a proposal $p_{l}$ and a detection $d_{l}$ . This will help the proposals to aggregate the detection scores of its neighborhood region during learning.

Initial score values are initialized to 0. Then, during learning, their scores will be updated based on detection scores. We explored several criteria for propagating the score and observed that propagating scores from the maximum overlapping detection boxes helps the model to collect better semantics for the region. Thus, we define $\gamma$ as the maximum intersection over union between proposal $p$ and all detection boxes $d\in\mathcal{D}$ : $\gamma=\max_{d\in\mathcal{D}}\frac{p\cap d}{p\cup d}$ . So, for each proposal $p_{l}$ present in the image, we propagate its score $s_{l,c}$ proportional to $\gamma$ :

s_{l,c}=(1-\gamma)s_{l,c}+\gamma\cdot s_{d,c},

(7)

where $s_{d,c}$ is the score of the maximum overlapping detection box $d$ for category $c$ . In this way, scores associated to the proposal $l$ with high overlap with the detection $d$ will receive a strong update, while scores of proposals with low overlap will not influence the stored score $s_{l,c}$ .

3.4.4 Updating the model

(backprop) The update of the detector model is performed as usual. The loss $L_{F}$ that evaluates the difference between the ground truth annotations (or pseudo GT) $(\mathcal{B},\mathcal{C})$ and the obtained detections ( $\mathcal{D},\mathcal{S}^{D}$ ) is computed. Then, the model parameters are updated with standard back-propagation.

4 Extensions

4.1 Reducing Proposals with Class Activation Maps:

In our basic model we use around 2000 object proposals obtained from the selective search algorithm [31] in order to have a high detection recall. However, keeping so many proposals, means that we need to keep a large set of scores. This will make the algorithm slow and more noisy in the beginning of the training as there are many possible regions to explore. In the experiments, we tested to use a Class Activation Map (CAM) model [25] to reduce the initial number of proposals. In practice, for each image and for each class present in an image, we extract its CAM. Then, for each CAM region, only the proposals that overlap at least $\rho$ with that CAM region are kept. The final set of proposals will be the union of the proposals selected for each class.

4.2 From Weakly to Fully Semi-Supervised Learning:

We extend the weakly semi-supervised approach proposed to a normal semi-supervised approach, in which part of the data is fully supervised and the rest has not supervision at all. For doing that, we keep the same structure of the previous approach, but for the non-annotated images, instead of using ground truth labels, we use the labels provided by the image classifier already used to compute the CAMs. As we will see in the experiments, this will provide almost the same level of performance as the semi-weakly supervised but without the need of weak labels.

5 Experiments

5.1 Experimental Setup:

Datasets and evaluation metric: The effectiveness of our proposed method is evaluated on VOC07 and VOC12 [8]. We perform ablation experiments on VOC07. In order to test the performance of our algorithm with different amount of fully annotated and weakly annotated training data, we use the trainval split of VOC07 into different percentages of fully annotated and weakly annotated set. We used 0%, 5%, 10%, and 20% of the images with bounding box annotations in this study and the rest with image-level labels. The images are sampled randomly to create the fully annotated and weakly annotated split. For all experiments with VOC dataset, we used the test set for evaluation. The standard VOC AP metric(AP 50) is used to measure the performance of the model. Finally to compare with other weakly and semi-supervised approaches, we trained using VOC 2007 trainval as the fully labeled set, and VOC 2012 trainval as the weakly and unlabeled sets.

Implementation details In most of our experiments, we used VGG16[26] as the CNN backbone, pre-trained on the ImageNet[24] dataset. We also reported results with a ResNet101 backbone[12] in order to present of fair comparisons with other state-of-the-art methods. The backbone detector used in our study is Faster R-CNN[21]. The whole network is trained end-to-end using stochastic gradient descent(SGD) with a momentum of $0.9$ and a weight decay of $0.0005$ . The initial learning rate is set to 1e-2 and decayed at epochs [5,10] by a factor of 10. We trained the model for $20$ epochs with a batch size of 8. The temperature parameter $T$ for the multinomial distribution used for sampling is set to 2.5. From an image for each class present, we sample $K=5$ object proposals as pseudo GT during training.

During training, the shorter edge of the input images are randomly rescaled within {480,576,688,864,1200}. We only use horizontal flipping for data augmentation. Object proposals are extracted using the selective search algorithm [31]. Typically, from an image, up to 2000 object proposals are extracted for good recall of all the object instances. Images are normalized with $\textrm{mean}=[0.485,0.456,0.406]$ and $\textrm{std}=[0.229,0.224,0.225]$ , as in ImageNet training [24]. The network is trained on NVIDIA V100 GPU with 32GB memory.

5.2 Ablation Study:

Several ablation studies are conducted in order to asses the individual components of our proposed model. First, we study the impact on performance of different ways of defining the sampler and score propagation module. Then, we analyze the impact of learning with a growing amount of annotations. Finally, we study the contribution of different types of errors made by the model. All of the these studies are conducted on PASCAL VOC 2007 by training the model using its trainval set, and testing on its test set. We used 10% bounding annotations in this analysis, while the rest of the images are weakly annotated.

5.2.1 Sampler and score propagation

To understand whether the sampler is learning a meaningful object location, we analyze the heatmap produced by the score distributions of the object proposals for a given class. To obtain the heatmap, for each pixel location, the scores from all object proposals covering that pixel are added, and then normalized by the number of object proposals covering that pixel. Fig. 3 shows some examples of heatmaps. It can be observed that active regions of heatmaps correlate well with object locations, and hence the sampler is finding meaningful semantic information through sampling and score propagation. We also notice that for small objects (ducks on the top right image) or objects with a recurrent background (train), the sampler selects not only the object of interest but also some background. However, this is a common problem of all weakly supervised approaches.

Score propagation can be designed in many ways based on, e.g., all detection boxes, or a selected set boxes matching some quality criteria. In this study, 3 settings are considered: (1) score propagation from all detection boxes, (2) from the maximum overlapping boxes, and (3) from the maximum overlapping boxes when the overlap is above a threshold $t$ . We found that $t=0.3$ provides the best performance. Table 1 summarizes the results from this study on VOC 2007 dataset. The model is trained using different 10% split on its trainval set, and evaluated on the test set. It can be observed that propagating scores from the maximum overlapping detection box of each proposal provides the highest mAP accuracy. When the overlap is above a threshold $t$ imposes more quality constraints for score propagation, and improves the results. Score propagation from all detection boxes does not perform well, although it can provide a smoother update to the object proposal scores. This may be due to the distribution of the high scores over a large area when all detection boxes are propagating their scores. This results in incorrect sampling of over-sized proposals, especially for smaller objects.

Table 1: Analysis of score propagation strategies. The performance is measured on a weakly semi-supervised model using with 10% full annotations and remaining weakly-labeled images on the VOC 2007 dataset.

Score Propagation Strategy	mAP
Propagate from all boxes	57.2
Propagate from max-overlapping boxes	58.3
Propagate from max-overlapping boxes when IOU $>\theta$	60.3

5.2.2 Impact of fully annotated images

In this experiment we compare the performance of a detector baseline trained only with strong labels (red line) with a model trained with strong and weak labels using our sampling approach (green line). Figure 4 shows the results from this study. As expected, the gain of our model is more significant when the amount of strong labels is reduced. For instance, with $5\%$ of strong labels, our model improves over the baseline by $10$ points. When increasing the percentage of strong labels, the gain reduces until a few points when using $50\%$ of strong labels. This experience shows how our approach is particularly useful when using a very reduced amount of fully labeled data and the rest is weakly labeled. In this setting, our model can approach the performance of a fully supervised model, but with much fewer annotations.

5.2.3 Impact of the ratio parameter

We analyze the importance of the ratio parameter $r$ for balancing the number of fully and weakly annotated training images. In Table 2, it can be observed that without this balancing, the detection performance is even worse than the settings where only annotated images are used. With the proper ratio balancing ( $r=0.7$ ), the mAP performance of the detector significantly outperforms the baseline using only fully supervised images. Thus by tuning this ratio parameter, we can effectively leverage the large pool of weakly-annotated images. One of the appealing property of this strategy is that it does not require any change to the model architecture or loss function.

Table 2: Impact of the ratio of fully to weakly annotated images.

Settings	mAP
$10\%$ fully annotated images	50.9
10% fully annotated and remaining weakly annotated(without ratio balancing)	47.5
10% fully annotated and remaining weakly annotated(with ratio balancing)	60.3

5.2.4 Impact of CAM proposals

Table 3 shows the impact on performance when considering the overlap between object proposals and CAM of the relevant class during sampling. The CAM is obtained by training a vgg16[26] network on the multi-label VOC 2007 image-level labels. Then the overlap of selective search proposals[31] to the CAM of all classes present in the image is computed. Based on the overlap, the object proposals without sufficient overlap to the CAM, which are perhaps from the image background region, are ignored. This results in a slight loss of recall, but an improvement in terms of the mAP, especially with few fully annotated images, due to reduction of noisy proposal regions that could misguide the sampler. In practice, we used an overlap threshold of 0.1 which result in a 5% reduction of recall, but the average number of proposals is reduced 4 times to approximately 500 object proposals per image. From table 3, it is clear that filtering noisy proposals using CAM brings improvement in mAP. But the impact of the CAM proposals reduces with the availability of more fully annotated images. This is in accordance to the general facts that with more annotations, the appearance model will be more accurate and hence, the model itself will be powerful enough to distinguish the object boundaries very well.

Table 3: Impact on performance when using proposals filtered by a CAM model.

% images with bounding box annotation	mAP without CAM proposals	mAP with CAM proposals
0%	27.2	35.5
5%	48.4	53.1
10%	57.6	60.3
20%	64.6	65.5

5.2.5 Loss in performance

We also analyze the distribution of the error of our model using the TIDE[2] evaluation tool (see Figure 5). It can be observed that the localization error contributes the most towards the overall errors made by our detection model. This is expected, since there is a large fraction of the images without bounding box labels, so the objectness distilled from a small fraction of fully annotated images is insufficient to capture large variations in appearance. Missed ground-truth is the next major error with our model. This is mainly the consequence of exploration capacity of the sampler. Once some major object regions start providing higher scores from the score propagation, the sampler can miss other difficult instances, especially smaller objects. Thus, our sampler will not sample candidate proposals from those regions, and they remains undetected. Frequently co-occurring background regions are also challenging for the sampler, since such regions can also accumulate higher scores over the time from the score propagation block. Those regions might also be sampled many times, resulting in detection boxes in background regions.

5.3 Comparison with State-of-Art Methods:

Table 4 shows a comparison of our method with state-of-the-art methods for semi- and weakly-supervised learning of object detectors. Pascal VOC 2007 is used as the fully-labeled data, while VOC 2012 is used as the unlabeled or weakly-labeled set. We first evaluate our method for semi-weakly supervised training. The only other method performing Semi-weakly supervised learning is WSSOD [9]. In this setting our method outperforms it, while being also more flexible, as it can be used with an detector model. Compared to the model trained only on VOC07 (lower bound), we observed a significant improvement (5.0%) when using additional weak labeled data, approaching a model with full annotations in both datasets (upper bound).

We then compare our model to the state-of-the-art in plain semi-supervised settings. To report the results of our method in the semi-supervised settings, we train a classifier on the available fully-labeled images, and used that classifier to obtain weak image-level labels for the unlabeled images. This requires training of an additional classifier, but it is less expensive than the detector pre-training used in most of the semi-supervised methods. Results in the table indicate a significant improvement in terms of performance, with the additional moderate cost of collecting weak image-level labels. In the semi-supervised case also, our method shows an improvement of 3.4%, outperforming most of the methods in terms of mAP and AP 50.

Table 4: Comparison in mAPs performance for different methods on VOC 2007 test set. The model is trained using VOC 2007 as fully annotated set, and VOC 2012 as the weakly annotated set. We reported the VOC style mAP (as AP 50) and COCO style mAP (as AP).

Semi-weakly-supervised VOC07(Fully)+12(Weakly)
Method	AP 50	AP
Fully Supervised VOC07 (Lower Bound)	74.4	-
WSSOD [9], ArXiv 2021	78.9	-
Ours	79.4	47.3
Semi-supervised VOC07(Fully)+12(Unsup.)
CSD [13], NeurIPS 2019	74.7	42.7
STAC [27], ArXiv 2020	77.4	44.6
WSSOD [9], ArXiv 2021	78.0	-
ISD [14], CVPR 2021	74.4	-
Ours	77.8	44.2
Fully Supervised VOC07+12 (Upper Bound)	80.9	-

6 Conclusion

This paper introduces a sampling based framework that can adapt any detector model requiring bounding boxes annotations, to weakly- and semi-supervised learning. With a single stage online learning, our method effectively makes use of the images without bounding box annotations by sampling pseudo GT boxes from the object proposals of those images based on a score accumulated for each region via a score propagation mechanism. By repeating this sampling, together with the normal detector weight update, our method is able to sample representative regions as pseudo GT boxes for the unlabeled images. With this sampling-based learning framework, training is a single stage training process. Unlike the vast majority of the semi-supervised techniques in the literature, the pseudo GT boxes are not controlled by a threshold, which makes the learning process easy and straightforward. Our experimental validation on VOC07 shows that our method is able to achieve excellent detection performance, with a reduced amount of fully annotated data, and additional image-level annotations.

References

[1] H. Bilen and A.a Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016.
[2] D. Bolya, S. Foley, J. Hays, and J. Hoffman. Tide: A general toolbox for identifying object detection errors. In ECCV, 2020.
[3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
[4] L. Chen, T. Yang, X. Zhang, W. Zhang, and J. Sun. Points as queries: Weakly semi-supervised object detection by points. In CVPR, 2021.
[5] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In NeurIPS, 2016.
[6] A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and L. V. Gool. Weakly supervised cascaded convolutional networks. In CVPR, 2017.
[7] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian. Centernet: Keypoint triplets for object detection. In https://arxiv.org/abs/1904.08189v3, 2019.
[8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
[9] S. Fang, Y. Cao, X. Wang, K. Chen, D. Lin, and W. Zhang. Wssod: A new pipeline for weakly- and semi-supervised object detection. In https://arxiv.org/abs/2105.11293, 2021.
[10] R. Girshick. Fast r-cnn. In ICCV, 2015.
[11] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
[13] J. Jeong, S. Lee, J. Kim, and N. Kwak. Consistency-based semi-supervised learning for object detection. In NeurIPS, 2019.
[14] J. Jeong, V. Verma, M. Hyun, J. Kannala, and N. Kwak. Interpolation-based semi-supervised learning for object detection. In CVPR, 2021.
[15] Y. Li, B. Fan, W. Zhang, W. Ding, and J. Yin. Deep active learning for object detection. Journal of Information Sciences, 2021.
[16] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
[17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, 2016.
[18] L. Qiao, Y. Zhao, Z. Li, X. Qiu, J. Wu, and C. Zhang. Defrcn: Decoupled faster r-cnn for few-shot object detection. In ICCV, 2021.
[19] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
[20] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017.
[21] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
[22] Z. Ren, Z. Yu, X. Yang, M. Liu, Y. J. Lee, A. G. Schwing, and J. Kautz. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In CVPR, 2020.
[23] S. Roy, A. Unmesh, and V. P. Namboodiri. Deep active learning for object detection. In BMVC, 2018.
[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
[25] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
[26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[27] K. Sohn, Z. Zhang, C. Li, H. Zhang, C. Lee, and T. Pfister. A simple semi-supervised learning framework for object detection. arXiv:2005.04757, 2020.
[28] P. Tang, X. Wang, X. Bai, and W. Liu. Multiple instance detection network with online instance classifier refinement. In CVPR, 2017.
[29] Y. Tang, W. Chen, Y. Luo, and Y. Zhang. Humble teachers teach better students for semi-supervised object detection. In CVPR, 2021.
[30] Z. Tian, C. Shen, H. Chen, and T. He. Fcos: Fully convolutional one-stage object detection. In ICCV, 2019.
[31] K. van de Sande, J. Uijlings, T. Gevers, and A. Smeulders. Segmentation as selective search for object recognition. In ICCV, 2011.
[32] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu. End-to-end semi-supervised object detection with soft teacher. In ICCV, 2021.


Epoch 1	Epoch 5	Epoch 10	Epoch 15

Epoch 20	Epoch 25	Epoch 30	Epoch 35

Semi-Weakly Supervised Object Detection by Sampling Pseudo Ground-Truth Boxes