This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semi-Weakly Supervised Object Detection by Sampling
Pseudo Ground-Truth Boxes

Akhil Meethal Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA)
Dept. of Systems Engineering, École de technologie supérieure, Montreal Canada
Marco Pedersoli Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA)
Dept. of Systems Engineering, École de technologie supérieure, Montreal Canada
Zhongwen Zhu GAIA Montreal, Ericsson Canada Francisco Perdigon Romero GAIA Montreal, Ericsson Canada Eric Granger Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA)
Dept. of Systems Engineering, École de technologie supérieure, Montreal Canada
Abstract

Semi- and weakly-supervised learning have recently attracted considerable attention in the object detection literature since they can alleviate the cost of annotation needed to successfully train deep learning models. State-of-art approaches for semi-supervised learning rely on student-teacher models trained using a multi-stage process, and considerable data augmentation. Custom networks have been developed for the weakly-supervised setting, making it difficult to adapt to different detectors. In this paper, a weakly semi-supervised training method is introduced that reduces these training challenges, yet achieves state-of-the-art performance by leveraging only a small fraction of fully-labeled images with information in weakly-labeled images. In particular, our generic sampling-based learning strategy produces pseudo ground-truth (GT) bounding box annotations in an online fashion, eliminating the need for multi-stage training, and student-teacher network configurations. These pseudo GT boxes are sampled from weakly-labeled images based on the categorical score of object proposals accumulated via a score propagation process. Empirical results111Our code is available at: https://github.com/akhilpm/SemiWSOD on the Pascal VOC dataset, indicates that the proposed approach improves performance by 5.0% when using VOC 2007 as fully-labeled, and VOC 2012 as weak-labeled data. Also, with 5-10% fully annotated images, we observed an improvement of more than 10% in mAP, showing that a modest investment in image-level annotation, can substantially improve detection performance.

1 Introduction

Recent advances in object detection are mainly due to deep learning models trained using a massive amount of densely annotated images[10, 3, 16, 20, 5, 30, 17]. The main development bottleneck of such detector models is the requirement for dense annotation of each image, i.e., the bounding box and class label of each instance present in an image. Therefore, producing a curated and well balanced dataset with annotations is a costly undertaking, often requiring expert knowledge. These challenges have motivated research on learning paradigms to train object detectors with reduced annotation. Popular approaches in this direction can be broadly categorized into semi-supervised learning[29, 32], weakly supervised learning[28, 22], active learning[23, 15], and few shot learning[18]. In semi-supervised settings, some approaches additionally used weak image-level labels[9] and point annotations[4].

Common steps in semi- and weakly-supervised learning consist of first extracting pseudo GT boxes, and then self-training the detector with these pseudo annotations[28, 4, 29, 22]. However, this often results in multiple stages of training, leading to learning difficulties that vary according to the settings. In addition to that, the extraction of pseudo GT boxes often involves a precise thresholding of detection scores in order to find a good trade-off between precision and recall, which can lead to unstable training[13]. In semi-supervised learning, the self-training targets are obtained from a first stage of training with only the fully annotated samples, which are less in number so prone to overfitting[4]. In weakly-supervised learning, a custom MIL (Multiple Instance Learning) based detector must first be trained to generate these pseudo annotations, and then a standard fully-supervised detector is self-trained with the pseudo targets[28, 22]. Therefore, multiple training stages are required, and multiple detection architectures need to be trained. Also the pseudo target extraction must be performed offline, resulting in a training process that is not end-to-end.

In this paper, we propose an online self-training method for weakly semi-supervised object detection that exploits a small fraction of fully-annotated images, along with the remaining weakly-annotated ones, for efficient training. In this settings, we propose a sampling strategy where pseudo GT boxes are sampled using the semantics accumulated on object proposal regions. For images with real bounding box annotation, we compute the normal classification and localization loss to update the network. For images with weak annotations, we sample a set of pseudo GT boxes for each given category present in the image, and classify and regress those, allowing to use a standard detector again. The pseudo GT boxes are sampled from object proposals extracted from weakly-labeled images. Thus, during training, our network learns fully- and weakly-annotated images in a single stage, utilizing any backbone object detection architecture. Unlike other self-training based model for semi-supervised object detection, our method does not require thresholding the detection scores for defining the number of pseudo GT boxes, which translates to less effort in the training process. For sampling, we make use of the category specific score of the object proposals. Category specific score of object proposals are accumulated by propagating the scores of detection boxes to object proposals based on their Intersection over Union (IoU). For experimental validation, the proposed approach is compared against state-of-art methods for semi-supervised learning on the Pascal VOC dataset.

2 Related work

2.1 Supervised Object Detection:

Models for object detection can be broadly classified into two-stage[21, 16] and one-stage [20, 17] object detectors. In both cases, supervised object detection requires class and bounding box labels for each instance present in the image[11]. Two-stage object detectors have a first stage that extracts RoIs (candidate object regions) whose reliability as a potential candidate region is quantified by their objectness score. Earlier approaches extract these RoIs using low-level image features in R-CNN[11], Fast R-CNN[10], etc. Later, end-to-end two-stage models emerged as more accurate detectors, where an additional learnable head called RPN (Region Proposal Network) is used to regress candidate regions[21, 16, 5]. In contrast, one-stage object detectors avoid the RoI extraction stage, and classify and regress directly from the anchor boxes. They are generally fast and applicable to real-time object detection[20, 19, 17]. Though the two-stage detectors are typically slower compared to their one-stage counterparts, extracting reliable candidate region in the first stage provides an edge in terms of localization accuracy. Recently, anchor-free detectors[30, 7] have grown in popularity since they dissociate from hand-designed anchor box selection and matching process, and provide even faster detectors.

2.2 Semi/Weakly-Supervised Object Detection:

Semi-supervised object detectors rely on a small subset of labeled images and a large collection of unlabeled images to train a detector[29]. They have been explored in many different forms. Recently, detectors trained in a student-teacher fashion have gained popularity in the research community[32, 27]. In this setting, a teacher model is trained first using the available annotated data, and then used to produce pseudo-labels for unlabeled data. Once the pseudo-labels are obtained, a student model is trained using the combined dataset. Though our proposed method also make use of the concepts of pseudo-labeling, our approach differs from these methods because we don’t have two separate networks, and multiple training stages, making it a simpler solution for semi-supervised detection.

Weakly-supervised detectors, on the other hand, only rely on image-class labels[28, 1, 6]. Research in weakly-supervised methods have progressed greatly in recent years[22]. However, a fundamental problem is their inability to distinguish the entire object from its parts and context. As a result, the detectors often produce bounding boxes on discriminative object parts, failing to distinguish the object boundaries when multiples objects of the same class are spatially adjacent. They produce imprecise localization that do not differentiate object boundaries from its context. Our proposed method also learns from a vast majority of weakly-labeled images, but, combined with a few fully-annotated images, providing better objectness distillation from the fully- to weakly-annotated images. We argue that this is the most practical setting, as we can typically annotate a small number of images with bounding boxes, and supply the remaining with weak annotations, or no annotations at all. Some approaches also attempt to leverage the additional, easy to obtain weak supervision for the unlabeled data. In particular, they used weak image-level labels[9] or point annotations[4] for the unlabeled images. Though this required additional annotation effort, they are easy to obtain, and may provide substantial improvements. Our approach also relies on weak image-class labels.

2.3 Self-Training:

Self-training has emerged as a popular approach in weakly and semi-supervised object detection[28, 22, 4]. In these approaches, a detector is first trained on the available data and then the obtained detections from the first training are used for training a refined model. This procedure can be repeated often multiple times. However, these methods perform self-training with pseudo labels obtained offline, e.g., a Faster R-CNN network training with detection result of weakly-supervised detectors as pseudo labels[28, 22]. In contrast, our method performs self-training using the pseudo labels for weakly labeled images, but in an online fashion. In particular, for each iteration, our approach samples pseudo GT boxes for each given weak label using the accumulated semantics from a score propagation block. Given this online nature of the self-training, the training is a simple one-stage process.

3 Proposed Method

In this section, we introduce a learning technique for semi-weakly supervised object detection. Fig 1 illustrates the overall system design. The backbone of our system can be any fully supervised architecture – we used Faster-RCNN[21] in our experiments. For every input image, the detector is employed in a different way, depending on the available annotation level. For the fully-labeled images, we perform a normal forward-backward cycle by taking the real GT annotations provided. For the weakly-labeled images, we use an importance sampling approach to select the most likely bonding boxes for each class in the image, to use them for training as a pseudo ground-truth. This allows us to use the same detector employed for the fully-labeled images also for the weakly-labeled ones.

When combining weakly and fully supervised learning, we need to determine the right importance to associate to the two learning tasks. For doing that, we rely on a hyper-parameter that defines the sampling ratio between the fully- and weakly-labeled images. This allows to associate a different level of importance to both tasks without changing the actual losses defined in the detector code. The rest of this section introduces the learning steps of both strongly and weakly-annotated category. Then, we present how their sampling ratio is applied within the training process.

Refer to caption
Figure 1: Proposed Method for Weakly Semi-Supervised Training. For fully-labeled images, our approach uses the backbone detector without any modifications. For weakly-labeled images, our approach samples pseudo GT boxes from a set of proposals. The probability of sampling a proposal is proportional to the score given by the detector to the most overlapping detection. In this way, the sampling approach focuses more and more to the objects of interest in the image.

3.1 Learning with Strong Annotations:

For the images that are strongly annotated (.i.e bounding box annotation for each object present), the learning step is straightforward. Given an input image II, the ground truth (GT) annotations are defined with the bounding box positions ={b0,b1,bN}\mathcal{B}=\{b_{0},b_{1},\cdots\ b_{N}\} and corresponding classes 𝒞={c0,c1,cN}\mathcal{C}=\{c_{0},c_{1},\cdots\ c_{N}\}. b=(x0,y0,x1,y1)b=(x_{0},y_{0},x_{1},y_{1}) is a vector with 4 values that represents for instance the top left and bottom right corner of a box, while c𝒞c\in\mathcal{C} is a discrete value that represents the object category. In our experiments we use faster RCNN as detector [21], and thus use a loss as:

LF=j𝒟i1NclsLcls(fci(dj,I))\displaystyle L_{F}=\sum_{j\in\mathcal{D}}\sum_{i\in\mathcal{B}}\frac{1}{N_{cls}}L_{cls}(f_{c_{i}}(d_{j},I)) (1)
+λ1NregLreg(ci,dj,bi),\displaystyle+\lambda\frac{1}{N_{reg}}L_{reg}(c_{i},d_{j},b_{i}),

For each GT bounding box bib_{i} it generates a loss based on the scores fcif_{c_{i}} and overlap of the obtained detections djd_{j}. LclsL_{cls} and LregL_{reg} denote the classification and localization loss, respectively. NclsN_{cls} and NregN_{reg} are the normalization factors which depends on the number of foreground and background RoIs considered. λ\lambda is a hyperparameter that controls the relative importance of the classification and localization loss. Note that the exact form of loss can vary according to the fully supervised detector architecture used in the model, but our approach is independent of the specific fully supervised loss.

3.2 Learning with Weak Annotations:

In case of weak supervision, we know the object classes that are present in the image cc, but not the bounding box locations bib_{i}. Thus, the general approach of weakly supervised models is to learn during training what are the regions of the image that are more likely to contain the object of interest. A typical approach for doing that is to compute the classification loss for a given class cc and an image II as a weighted sum of scores:

Lcls(lwlfc(pl,I)h),L_{cls}\Big{(}\underbrace{\sum_{l}w_{l}f_{c}(p_{l},I)}_{h}\Big{)}, (2)

where fc(pl,I)f_{c}(p_{l},I) is the score of the detector for ground truth class cc and object proposal pl𝒫p_{l}\in\mathcal{P} of to the image II, and wlw_{l} is a normalized weight associated with each object proposal plp_{l}. The weights associated with each proposal wlw_{l} for a image II are computed as normalized scores:

wl=exp{fc(pl,I)T}jexp{fc(pj,I)T},w_{l}=\frac{\exp\{\frac{f_{c^{*}}(p_{l},I)}{T}\}}{\sum_{j}\exp\{\frac{f_{c^{*}}(p_{j},I)}{T}\}}, (3)

with TT being the temperature parameter that defines the sharpness of the weight distribution and is a hyper-parameter of the learning approach. In this way, boxes with higher score will have more impact on the learning and the learning will focus more and more on the locations of the image that are more likely to contain the object of interest. While this formulation works well, it is computationally expensive because it has to evaluate at each training iteration and for each image all box locations plp_{l}.

Here, we propose to approximate the weighted sum of scores hh with Monte Carlo sampling, in which instead of computing the sum over entire bounding box locations ll, we uniformly sample KK boxes:

hh^=k𝒰θkfc(pk,I),h\approx\hat{h}=\sum_{k\sim\mathcal{U}}\theta_{k}f_{c}(p_{k},I), (4)

where θk\theta_{k} is the weight associated with the bounding box kk. This allows us to compute only KK evaluations of the expensive ff, while using an unbiased estimation of the weakly supervised scoring function. We call these samples pseudo GT bounding boxes because as we will see in the next section, these samples can be pass to the detection algorithm as ground truth annotations. This allows our algorithm to use any off-the-shelf object detector without modification.

However, when sampling, the weights θk\theta_{k} associated with each bounding box score cannot be computed directly because it is the normalized version of the object score fc(pk,I)f_{c}(p_{k},I) for the box pkp_{k}, but we do not have all scores for the normalization factor, as we sampled only a few boxes. Instead, for each image II we keep in memory the scores sls_{l} associated with the bounding box plp_{l} and update only the score of the KK sampled bounding boxes. Then, θk\theta_{k} will be computed as the normalized version of sks_{k}:

θk=exp{skT}jexp{sjT}\theta_{k}=\frac{\exp\{\frac{s_{k}}{T}\}}{\sum_{j}\exp\{\frac{s_{j}}{T}\}} (5)

As θk\theta_{k} approximates wlw_{l}, we see that when the scores ff do not change anymore (i.e. at convergence), h^\hat{h} becomes hh. Therefore h^\hat{h} is an unbiased estimation of hh. While this approach would work, sampling uniformly any possible image bounding box proposal plp_{l} would make the learning very slow because most of the time the sample would not come from the object of interest. Instead, in this work we use an importance sampling approach. We use θk\theta_{k} as sampling probabilities associated with a multinomial distribution (θk)\mathcal{M}(\theta_{k}) to sample bounding boxes, so that the bounding box proposal pkp_{k} would have a probability θk\theta_{k} to be sampled. In this case, in order to maintain the same estimation of hh we need to divide by the sampling probability θk\theta_{k}. Thus the final estimation of hh will be:

h^=k(θk)fc(pk,I),\hat{h}=\sum_{k\sim\mathcal{M}(\theta_{k})}f_{c}(p_{k},I), (6)

which is again an unbiased estimator of the classification score of an image, but with lower variance. Thus, the final weakly supervised loss LWL_{W} for an image II is the same as in Eqn.(1), but with the ground truth bounding boxes bib_{i}^{*} changed to the pseudo GT boxes sampled from the object proposals 𝒫\mathcal{P}.

3.3 Combining Strong and Weak Annotation:

While the fully supervised object detector uses ground truth boxes that are correct, the weakly supervised counterpart estimates the object box location during training and therefore the estimation can be noisy. Thus, when learning with strong and weak labels we might want to set a hyper-parameter value that balances the relative importance of the two losses. In this case the final loss is L=LF+λLWL=L_{F}+\lambda L_{W}. As the aim of this approach is to avoid to directly modify the loss on the detector, in this case we express the weight λ\lambda with a sampling ratio. Thus, we use a ratio parameter rr that controls the amount of training data from the fully and weakly labeled pool of data. For instance r=0.6r=0.6 means that we sample with probability 0.60.6 from the pool of the fully-labeled samples and 0.40.4 from the weakly-labeled samples. This approach has the advantage of not changing the internals of the underlying detector such as loss function or additional regularization etc. With this design, we can feed both the fully annotated and weakly annotated images in parallel to the model and train it in a single stage.

3.4 Learning Algorithm

In this section we present in more detail the different parts of the proposed learning algorithm as summarized in Alg. 1. For the sake of simplicity the algorithm is shown for the case of a single image II, but it could be trivially extended to a batch of images.

For supervised samples our algorithm uses directly the bounding box annotations \mathcal{B} and the corresponding classes 𝒞\mathcal{C} for inference (detect). For weak supervision, the inference is performed on pseudo GT annotations (^,𝒞^\hat{\mathcal{B}},\hat{\mathcal{C}}) that are obtained by sampling object proposals (sample). Then, the obtained detections 𝒟\mathcal{D} and scores 𝒮D\mathcal{S}^{D} are used to update the proposal scores (𝒮P\mathcal{S}^{P}). In both cases, the obtained detections 𝒟\mathcal{D} and scores 𝒮D\mathcal{S}^{D} are used to compute the loss LL and update the recognition model (backprop).

Input: Image: I, GT:(,𝒞)(\mathcal{B},\mathcal{C}) proposals and scores: (𝒫,𝒮P)(\mathcal{P},\mathcal{S}^{P})
if \mathcal{B}\neq\emptyset then
       fully supervised ;
       𝒟,𝒮D\mathcal{D},\mathcal{S}^{D} = detect(I, ,𝒞\mathcal{B},\mathcal{C});
      
else
       weakly supervised;
       ^,𝒞^\hat{\mathcal{B}},\hat{\mathcal{C}} = sample (𝒫,𝒮P,𝒞\mathcal{P},\mathcal{S}^{P},\mathcal{C}) ;
       𝒟,𝒮D\mathcal{D},\mathcal{S}^{D} = detect (I, ^,𝒞^\hat{\mathcal{B}},\hat{\mathcal{C}}) ;
       𝒮P\mathcal{S}^{P} = update (𝒫,𝒟,𝒮D\mathcal{P},\mathcal{D},\mathcal{S}^{D});
      
end if
backprop( I , ,𝒞,𝒟,𝒮D\mathcal{B},\mathcal{C},\mathcal{D},\mathcal{S}^{D})
Algorithm 1 Semi-Weakly supervised learning with Pseudo GT

3.4.1 Detection inference

(detect) This can be any detection algorithm that takes as input the ground truth annotations (,𝒞\mathcal{B,C}) for an image II and returns detections 𝒟\mathcal{D} with associated scores 𝒮D\mathcal{S}^{D} for all classes. Notice that, in general, ground truth annotations \mathcal{B} are not needed to perform inference. However, in many detectors, these are used to limit the detections computation nearby the GT annotations.

3.4.2 Sampling Pseudo GT

(sample) For each proposal pl𝒫p_{l}\in\mathcal{P} we have the corresponding classification score sl,cs_{l,c} for a given class cc. This score is accumulated based on the detector output via the score propagation step which will be explained in the next section. For each class present in an image, we consider the scores of all proposals 𝒫\mathcal{P}, denoted by 𝒮P\mathcal{S}^{P} and sample K boxes based on the multinomial distribution \mathcal{M} with probabilities θ\theta computed as in Eqn.5. Figure 2 illustrates an example sampling process during the training phase for the person category. It can be observed that, though in the beginning we sample pseudo GT boxes randomly from the image, it converges to meaningful locations for the person category in the later stages.

Refer to caption Refer to caption Refer to caption Refer to caption
Epoch 1 Epoch 5 Epoch 10 Epoch 15
Refer to caption Refer to caption Refer to caption Refer to caption
Epoch 20 Epoch 25 Epoch 30 Epoch 35
Figure 2: Evolution of the Pseudo GT sampling. While in the first iterations of the training, bounding boxes are samples almost randomly (exploration), after some training, the algorithm learns to sample only from meaningful locations (exploitation).

3.4.3 Score Propagation

(update) Score propagation is the component which updates the score values 𝒮P\mathcal{S}^{P} of the object proposals 𝒫\mathcal{P}. If the output bounding boxes of the detector 𝒟\mathcal{D} would correspond to the object proposals 𝒫\mathcal{P} , we could directly copy the detection values to our pool of proposals. Instead, as in modern detectors the output detections 𝒟\mathcal{D} are generated by a regression, here we propose a method to propagate the scores from the output detection 𝒟\mathcal{D} to the scores 𝒮P\mathcal{S}^{P} of our object proposals that are then sampled as pseudo GT. During learning, the proposals will accumulate scores from their overlapping boxes produced by the detector. In our design, we define the score propagation according the percentage of overlap between a proposal plp_{l} and a detection dld_{l}. This will help the proposals to aggregate the detection scores of its neighborhood region during learning.

Initial score values are initialized to 0. Then, during learning, their scores will be updated based on detection scores. We explored several criteria for propagating the score and observed that propagating scores from the maximum overlapping detection boxes helps the model to collect better semantics for the region. Thus, we define γ\gamma as the maximum intersection over union between proposal pp and all detection boxes d𝒟d\in\mathcal{D}: γ=maxd𝒟pdpd\gamma=\max_{d\in\mathcal{D}}\frac{p\cap d}{p\cup d}. So, for each proposal plp_{l} present in the image, we propagate its score sl,cs_{l,c} proportional to γ\gamma:

sl,c=(1γ)sl,c+γsd,c,s_{l,c}=(1-\gamma)s_{l,c}+\gamma\cdot s_{d,c}, (7)

where sd,cs_{d,c} is the score of the maximum overlapping detection box dd for category cc. In this way, scores associated to the proposal ll with high overlap with the detection dd will receive a strong update, while scores of proposals with low overlap will not influence the stored score sl,cs_{l,c}.

3.4.4 Updating the model

(backprop) The update of the detector model is performed as usual. The loss LFL_{F} that evaluates the difference between the ground truth annotations (or pseudo GT) (,𝒞)(\mathcal{B},\mathcal{C}) and the obtained detections (𝒟,𝒮D\mathcal{D},\mathcal{S}^{D}) is computed. Then, the model parameters are updated with standard back-propagation.

4 Extensions

4.1 Reducing Proposals with Class Activation Maps:

In our basic model we use around 2000 object proposals obtained from the selective search algorithm [31] in order to have a high detection recall. However, keeping so many proposals, means that we need to keep a large set of scores. This will make the algorithm slow and more noisy in the beginning of the training as there are many possible regions to explore. In the experiments, we tested to use a Class Activation Map (CAM) model [25] to reduce the initial number of proposals. In practice, for each image and for each class present in an image, we extract its CAM. Then, for each CAM region, only the proposals that overlap at least ρ\rho with that CAM region are kept. The final set of proposals will be the union of the proposals selected for each class.

4.2 From Weakly to Fully Semi-Supervised Learning:

We extend the weakly semi-supervised approach proposed to a normal semi-supervised approach, in which part of the data is fully supervised and the rest has not supervision at all. For doing that, we keep the same structure of the previous approach, but for the non-annotated images, instead of using ground truth labels, we use the labels provided by the image classifier already used to compute the CAMs. As we will see in the experiments, this will provide almost the same level of performance as the semi-weakly supervised but without the need of weak labels.

5 Experiments

5.1 Experimental Setup:

Datasets and evaluation metric: The effectiveness of our proposed method is evaluated on VOC07 and VOC12 [8]. We perform ablation experiments on VOC07. In order to test the performance of our algorithm with different amount of fully annotated and weakly annotated training data, we use the trainval split of VOC07 into different percentages of fully annotated and weakly annotated set. We used 0%, 5%, 10%, and 20% of the images with bounding box annotations in this study and the rest with image-level labels. The images are sampled randomly to create the fully annotated and weakly annotated split. For all experiments with VOC dataset, we used the test set for evaluation. The standard VOC AP metric(AP 50) is used to measure the performance of the model. Finally to compare with other weakly and semi-supervised approaches, we trained using VOC 2007 trainval as the fully labeled set, and VOC 2012 trainval as the weakly and unlabeled sets.

Implementation details In most of our experiments, we used VGG16[26] as the CNN backbone, pre-trained on the ImageNet[24] dataset. We also reported results with a ResNet101 backbone[12] in order to present of fair comparisons with other state-of-the-art methods. The backbone detector used in our study is Faster R-CNN[21]. The whole network is trained end-to-end using stochastic gradient descent(SGD) with a momentum of 0.90.9 and a weight decay of 0.00050.0005. The initial learning rate is set to 1e-2 and decayed at epochs [5,10] by a factor of 10. We trained the model for 2020 epochs with a batch size of 8. The temperature parameter TT for the multinomial distribution used for sampling is set to 2.5. From an image for each class present, we sample K=5K=5 object proposals as pseudo GT during training.

During training, the shorter edge of the input images are randomly rescaled within {480,576,688,864,1200}. We only use horizontal flipping for data augmentation. Object proposals are extracted using the selective search algorithm [31]. Typically, from an image, up to 2000 object proposals are extracted for good recall of all the object instances. Images are normalized with mean=[0.485,0.456,0.406]\textrm{mean}=[0.485,0.456,0.406] and std=[0.229,0.224,0.225]\textrm{std}=[0.229,0.224,0.225], as in ImageNet training [24]. The network is trained on NVIDIA V100 GPU with 32GB memory.

5.2 Ablation Study:

Several ablation studies are conducted in order to asses the individual components of our proposed model. First, we study the impact on performance of different ways of defining the sampler and score propagation module. Then, we analyze the impact of learning with a growing amount of annotations. Finally, we study the contribution of different types of errors made by the model. All of the these studies are conducted on PASCAL VOC 2007 by training the model using its trainval set, and testing on its test set. We used 10% bounding annotations in this analysis, while the rest of the images are weakly annotated.

5.2.1 Sampler and score propagation

To understand whether the sampler is learning a meaningful object location, we analyze the heatmap produced by the score distributions of the object proposals for a given class. To obtain the heatmap, for each pixel location, the scores from all object proposals covering that pixel are added, and then normalized by the number of object proposals covering that pixel. Fig. 3 shows some examples of heatmaps. It can be observed that active regions of heatmaps correlate well with object locations, and hence the sampler is finding meaningful semantic information through sampling and score propagation. We also notice that for small objects (ducks on the top right image) or objects with a recurrent background (train), the sampler selects not only the object of interest but also some background. However, this is a common problem of all weakly supervised approaches.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Examples of heatmap of sampler scores for images belonging to different Pascal VOC categories.

Score propagation can be designed in many ways based on, e.g., all detection boxes, or a selected set boxes matching some quality criteria. In this study, 3 settings are considered: (1) score propagation from all detection boxes, (2) from the maximum overlapping boxes, and (3) from the maximum overlapping boxes when the overlap is above a threshold tt. We found that t=0.3t=0.3 provides the best performance. Table 1 summarizes the results from this study on VOC 2007 dataset. The model is trained using different 10% split on its trainval set, and evaluated on the test set. It can be observed that propagating scores from the maximum overlapping detection box of each proposal provides the highest mAP accuracy. When the overlap is above a threshold tt imposes more quality constraints for score propagation, and improves the results. Score propagation from all detection boxes does not perform well, although it can provide a smoother update to the object proposal scores. This may be due to the distribution of the high scores over a large area when all detection boxes are propagating their scores. This results in incorrect sampling of over-sized proposals, especially for smaller objects.

Table 1: Analysis of score propagation strategies. The performance is measured on a weakly semi-supervised model using with 10% full annotations and remaining weakly-labeled images on the VOC 2007 dataset.
Score Propagation Strategy mAP
Propagate from all boxes 57.2
Propagate from max-overlapping boxes 58.3
Propagate from max-overlapping boxes when IOU >θ>\theta 60.3

5.2.2 Impact of fully annotated images

Refer to caption
Figure 4: The impact on mAP performance of adding more fully annotated images during training on the VOC 2007 dataset.

In this experiment we compare the performance of a detector baseline trained only with strong labels (red line) with a model trained with strong and weak labels using our sampling approach (green line). Figure 4 shows the results from this study. As expected, the gain of our model is more significant when the amount of strong labels is reduced. For instance, with 5%5\% of strong labels, our model improves over the baseline by 1010 points. When increasing the percentage of strong labels, the gain reduces until a few points when using 50%50\% of strong labels. This experience shows how our approach is particularly useful when using a very reduced amount of fully labeled data and the rest is weakly labeled. In this setting, our model can approach the performance of a fully supervised model, but with much fewer annotations.

5.2.3 Impact of the ratio parameter

We analyze the importance of the ratio parameter rr for balancing the number of fully and weakly annotated training images. In Table 2, it can be observed that without this balancing, the detection performance is even worse than the settings where only annotated images are used. With the proper ratio balancing (r=0.7r=0.7), the mAP performance of the detector significantly outperforms the baseline using only fully supervised images. Thus by tuning this ratio parameter, we can effectively leverage the large pool of weakly-annotated images. One of the appealing property of this strategy is that it does not require any change to the model architecture or loss function.

Table 2: Impact of the ratio of fully to weakly annotated images.
Settings mAP
10%10\% fully annotated images 50.9
10% fully annotated and remaining weakly annotated(without ratio balancing) 47.5
10% fully annotated and remaining weakly annotated(with ratio balancing) 60.3

5.2.4 Impact of CAM proposals

Table 3 shows the impact on performance when considering the overlap between object proposals and CAM of the relevant class during sampling. The CAM is obtained by training a vgg16[26] network on the multi-label VOC 2007 image-level labels. Then the overlap of selective search proposals[31] to the CAM of all classes present in the image is computed. Based on the overlap, the object proposals without sufficient overlap to the CAM, which are perhaps from the image background region, are ignored. This results in a slight loss of recall, but an improvement in terms of the mAP, especially with few fully annotated images, due to reduction of noisy proposal regions that could misguide the sampler. In practice, we used an overlap threshold of 0.1 which result in a 5% reduction of recall, but the average number of proposals is reduced 4 times to approximately 500 object proposals per image. From table 3, it is clear that filtering noisy proposals using CAM brings improvement in mAP. But the impact of the CAM proposals reduces with the availability of more fully annotated images. This is in accordance to the general facts that with more annotations, the appearance model will be more accurate and hence, the model itself will be powerful enough to distinguish the object boundaries very well.

Table 3: Impact on performance when using proposals filtered by a CAM model.
% images with bounding box annotation mAP without CAM proposals mAP with CAM proposals
0% 27.2 35.5
5% 48.4 53.1
10% 57.6 60.3
20% 64.6 65.5

5.2.5 Loss in performance

We also analyze the distribution of the error of our model using the TIDE[2] evaluation tool (see Figure 5). It can be observed that the localization error contributes the most towards the overall errors made by our detection model. This is expected, since there is a large fraction of the images without bounding box labels, so the objectness distilled from a small fraction of fully annotated images is insufficient to capture large variations in appearance. Missed ground-truth is the next major error with our model. This is mainly the consequence of exploration capacity of the sampler. Once some major object regions start providing higher scores from the score propagation, the sampler can miss other difficult instances, especially smaller objects. Thus, our sampler will not sample candidate proposals from those regions, and they remains undetected. Frequently co-occurring background regions are also challenging for the sampler, since such regions can also accumulate higher scores over the time from the score propagation block. Those regions might also be sampled many times, resulting in detection boxes in background regions.

Refer to caption
Figure 5: Evaluation of performance loss. TIDE[2] Evaluation of detection results. Error types are: Cls: localized correctly but classified incorrectly, Loc: classified correctly but localized incorrectly, Both: both cls and loc error, Dupe: duplicate detection error, Bkg: detected background as foreground, Miss: missed ground truth error.

5.3 Comparison with State-of-Art Methods:

Table 4 shows a comparison of our method with state-of-the-art methods for semi- and weakly-supervised learning of object detectors. Pascal VOC 2007 is used as the fully-labeled data, while VOC 2012 is used as the unlabeled or weakly-labeled set. We first evaluate our method for semi-weakly supervised training. The only other method performing Semi-weakly supervised learning is WSSOD [9]. In this setting our method outperforms it, while being also more flexible, as it can be used with an detector model. Compared to the model trained only on VOC07 (lower bound), we observed a significant improvement (5.0%) when using additional weak labeled data, approaching a model with full annotations in both datasets (upper bound).

We then compare our model to the state-of-the-art in plain semi-supervised settings. To report the results of our method in the semi-supervised settings, we train a classifier on the available fully-labeled images, and used that classifier to obtain weak image-level labels for the unlabeled images. This requires training of an additional classifier, but it is less expensive than the detector pre-training used in most of the semi-supervised methods. Results in the table indicate a significant improvement in terms of performance, with the additional moderate cost of collecting weak image-level labels. In the semi-supervised case also, our method shows an improvement of 3.4%, outperforming most of the methods in terms of mAP and AP 50.

Table 4: Comparison in mAPs performance for different methods on VOC 2007 test set. The model is trained using VOC 2007 as fully annotated set, and VOC 2012 as the weakly annotated set. We reported the VOC style mAP (as AP 50) and COCO style mAP (as AP).
Method AP 50 AP
Fully Supervised VOC07 (Lower Bound) 74.4 -
Semi-weakly-supervised VOC07(Fully)+12(Weakly)
WSSOD [9], ArXiv 2021 78.9 -
Ours 79.4 47.3
Semi-supervised VOC07(Fully)+12(Unsup.)
CSD [13], NeurIPS 2019 74.7 42.7
STAC [27], ArXiv 2020 77.4 44.6
WSSOD [9], ArXiv 2021 78.0 -
ISD [14], CVPR 2021 74.4 -
Ours 77.8 44.2
Fully Supervised VOC07+12 (Upper Bound) 80.9 -

6 Conclusion

This paper introduces a sampling based framework that can adapt any detector model requiring bounding boxes annotations, to weakly- and semi-supervised learning. With a single stage online learning, our method effectively makes use of the images without bounding box annotations by sampling pseudo GT boxes from the object proposals of those images based on a score accumulated for each region via a score propagation mechanism. By repeating this sampling, together with the normal detector weight update, our method is able to sample representative regions as pseudo GT boxes for the unlabeled images. With this sampling-based learning framework, training is a single stage training process. Unlike the vast majority of the semi-supervised techniques in the literature, the pseudo GT boxes are not controlled by a threshold, which makes the learning process easy and straightforward. Our experimental validation on VOC07 shows that our method is able to achieve excellent detection performance, with a reduced amount of fully annotated data, and additional image-level annotations.

References

  • [1] H. Bilen and A.a Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016.
  • [2] D. Bolya, S. Foley, J. Hays, and J. Hoffman. Tide: A general toolbox for identifying object detection errors. In ECCV, 2020.
  • [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
  • [4] L. Chen, T. Yang, X. Zhang, W. Zhang, and J. Sun. Points as queries: Weakly semi-supervised object detection by points. In CVPR, 2021.
  • [5] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In NeurIPS, 2016.
  • [6] A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and L. V. Gool. Weakly supervised cascaded convolutional networks. In CVPR, 2017.
  • [7] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian. Centernet: Keypoint triplets for object detection. In https://arxiv.org/abs/1904.08189v3, 2019.
  • [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
  • [9] S. Fang, Y. Cao, X. Wang, K. Chen, D. Lin, and W. Zhang. Wssod: A new pipeline for weakly- and semi-supervised object detection. In https://arxiv.org/abs/2105.11293, 2021.
  • [10] R. Girshick. Fast r-cnn. In ICCV, 2015.
  • [11] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [13] J. Jeong, S. Lee, J. Kim, and N. Kwak. Consistency-based semi-supervised learning for object detection. In NeurIPS, 2019.
  • [14] J. Jeong, V. Verma, M. Hyun, J. Kannala, and N. Kwak. Interpolation-based semi-supervised learning for object detection. In CVPR, 2021.
  • [15] Y. Li, B. Fan, W. Zhang, W. Ding, and J. Yin. Deep active learning for object detection. Journal of Information Sciences, 2021.
  • [16] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
  • [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, 2016.
  • [18] L. Qiao, Y. Zhao, Z. Li, X. Qiu, J. Wu, and C. Zhang. Defrcn: Decoupled faster r-cnn for few-shot object detection. In ICCV, 2021.
  • [19] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • [20] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In CVPR, 2017.
  • [21] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  • [22] Z. Ren, Z. Yu, X. Yang, M. Liu, Y. J. Lee, A. G. Schwing, and J. Kautz. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In CVPR, 2020.
  • [23] S. Roy, A. Unmesh, and V. P. Namboodiri. Deep active learning for object detection. In BMVC, 2018.
  • [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
  • [25] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [27] K. Sohn, Z. Zhang, C. Li, H. Zhang, C. Lee, and T. Pfister. A simple semi-supervised learning framework for object detection. arXiv:2005.04757, 2020.
  • [28] P. Tang, X. Wang, X. Bai, and W. Liu. Multiple instance detection network with online instance classifier refinement. In CVPR, 2017.
  • [29] Y. Tang, W. Chen, Y. Luo, and Y. Zhang. Humble teachers teach better students for semi-supervised object detection. In CVPR, 2021.
  • [30] Z. Tian, C. Shen, H. Chen, and T. He. Fcos: Fully convolutional one-stage object detection. In ICCV, 2019.
  • [31] K. van de Sande, J. Uijlings, T. Gevers, and A. Smeulders. Segmentation as selective search for object recognition. In ICCV, 2011.
  • [32] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu. End-to-end semi-supervised object detection with soft teacher. In ICCV, 2021.