This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CrossRectify: Leveraging Disagreement for Semi-supervised Object Detection

Chengcheng Ma Xingjia Pan Qixiang Ye Fan Tang [email protected] Weiming Dong Changsheng Xu National Lab of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, 100190, China School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, 100049, China Youtu Lab, Tencent Inc., Shanghai, 200233, China School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (UCAS), Beijing, 101408, China Jilin University, Changchun, 130000, China
Abstract

Semi-supervised object detection has recently achieved substantial progress. As a mainstream solution, the self-labeling-based methods train the detector on both labeled data and unlabeled data with pseudo labels predicted by the detector itself, but their performances are always limited. Through experimental analysis, we reveal the underlying reason is that the detector is misguided by the incorrect pseudo labels predicted by itself (dubbed self-errors). These self-errors can hurt performance even worse than random-errors, and can be neither discerned nor rectified during the self-labeling process. In this paper, we propose an effective detection framework named CrossRectify, to obtain accurate pseudo labels by simultaneously training two detectors with different initial parameters. Specifically, the proposed approach leverages the disagreements between detectors to discern the self-errors and refines the pseudo label quality by the proposed cross-rectifying mechanism. Extensive experiments show that CrossRectify achieves outperforming performances over various detector structures on 2D and 3D detection benchmarks.

keywords:
object detection, semi-supervised learning, 2D semi-supervised object detection, 3D semi-supervised object detection, self-labeling
journal: Pattern Recognition

1 Introduction

The success of deep learning has greatly promoted the development of object detection approaches [1, 2, 3, 4, 5, 6], and a large amount of labeled data is essential to the training process of object detectors. However, as illustrated in [7], it is always labor-intensive and expensive to acquire a large amount of labeled data with bounding-box-level annotations. In comparison, unlabeled data are much easier and cheaper to collect. Therefore, semi-supervised object detection [8] is recently investigated to reduce the cost of data annotating, which leverages only a few labeled data and a large amount of unlabeled data to train object detectors.

Semi-supervised object detection (SSOD) has achieved significant progress in recent years, and one mainstream of existing SSOD solutions is based on the self-labeling scheme [9]. The core idea of self-labeling scheme is to first utilize the current detector to predict pseudo bounding boxes for unlabeled data in each training iteration, then conduct detector training with both labeled data and pseudo-labeled data. However, compared with fully-supervised baselines, the performance increments brought by the self-labeling-based methods are always observed limited. For example, the absolute AP50 gain is only 0.40% (0.46%) without (with) the mix-up data augmentation [10] over the SSD300 [1] detector on the Pascal VOC [11] benchmark (as shown in Fig. 1). To reveal the reason behind such phenomenon, we introduce the ground-truth annotations of unlabeled data for in-depth analysis. Existing self-labeling-based SSOD methods generate pseudo labels by selecting bounding boxes with high confidence scores to ensure the quality of pseudo labels. However, we observe that part of high-confidence boxes are still misclassified (dubbed self-errors), and these self-errors can hurt the detection performance even worse. Specifically, when we replace the misclassified category labels with random labels (dubbed random-errors) during the self-labeling training process, the final performance is even improved. Unfortunately, these self-errors can be neither discerned nor rectified by the detector itself, which we summarize as two inherent limitations of the self-labeling training scheme. These two limitations will lead the detector to be misguided by self-errors, and finally results in insignificant performance increments.

Refer to caption
Figure 1: The proposed CrossRectify and CrossRectify method can outperform the self-labeling-based semi-supervised object detection method by large margins on the Pascal VOC benchmark dataset without/with the mix-up data augmentation. (For interpretation of the references to color in this figure legend, please refer to the online version of this article.)

Since one single detector can neither discern nor rectify the misclassified pseudo bounding boxes, it is necessary to introduce the guidance information from other distinct detector to the training process. Recently, a few works [12, 13, 14, 15] illustrate the fact that two differently initialized models with an identical structure can yield diverse results on the same training sample during the training process. Inspired by this fact, in this paper, we propose an effective and general training framework named CrossRectifiy for both 2D and 3D semi-supervised object detection task. In CrossRectify, two detectors with the same structure but different initialization are trained simultaneously, and each detector is supervised by the pseudo labels generated from the proposed cross-rectifying mechanism. Specifically, the proposed cross-rectifying mechanism first leverages the disagreements on the same objects between two detectors to discern the latent self-errors predicted by each single detector. Then, the pseudo labels are generated based on the bounding boxes predicted by two detectors, by adopting a simple yet effective comparison-to-assigning pipeline with confidence scores being considered. To this end, the proposed CrossRectify method can discern and rectify the self-errors and improve the pseudo label quality.

Note that a few recent works utilize another separate detector for pseudo label generation, which can somehow alleviate the limitations of self-labeling training scheme. For instance, [16, 17, 18, 19, 20] adopt the teacher-student mutual learning framework [21] and utilize a specific teacher detector to generate pseudo labels in an offline or online way. However, these approaches still suffer from the following shortcomings: the pseudo labels always remain fixed [16, 17], and the teacher detector is converged to the student detector in the late stage of training [18, 19, 20], thus the labeling process degenerates into the self-labeling manner and suffers from the same limitations. Similar with our method, [22] proposes the co-rectify method to train two models simultaneously and takes the average on two predictions sets as pseudo labels, which is the only prior work adopting the co-training framework [12, 13, 14, 15] to the best of our knowledge. However, we conduct quantitative comparison on the pseudo label quality, and the experimental results validate the superiority of our CrossRectify method compared with co-rectify.

We carry out extensive experiments on both 2D and 3D semi-supervised object detection tasks to verify the effectiveness and versatility of the proposed CrossRectify method. As illustrated in Section 5.3, our method obtains consistent and substantial improvements compared with the state-of-the-art SSOD methods on the Pascal VOC, MS-COCO, and SUN-RGBD benchmark datasets, improving by >>1% absolute AP margins.

Our main contributions are summarized as follows:

  1. 1)

    We point out that the performances of self-labeling-based SSOD approaches are always limited, and the reason behind such phenomenon lies in that the detector can neither discern nor rectify the misclassified pseudo bounding boxes predicted by itself.

  2. 2)

    We propose an effective approach named CrossRectify to discern and rectify the misclassified pseudo bounding boxes using the disagreements between two detectors, which can address the inherent limitations of self-labeling and improve the detection performance.

  3. 3)

    We conduct extensive experiments on both 2D and 3D object detection benchmark datasets, and the results verify the superiority of the proposed CrossRectify approach, compared with state-of-the-art approaches.

In the remainder of this paper, we briefly review several related works in Section 2. Then we describe the two limitations of self-labeling-based SSOD approaches in Sec 3, and provide the technical details of the proposed CrossRectify method in Sec 4. Finally, we display the detection performances of 2D and 3D SSOD tasks in Sec 5 and conclude the whole paper in Sec 6.

2 Related Work

2.1 2D and 3D object detection

Object detection is one of the most significant tasks in computer vision, including 2D and 3D scenes. In the field of 2D object detection, the structures of detectors can be categorized into single-stage (SSD [1], etc) and two-stage (Faster-RCNN [2], cascade RCNN [23], etc), depending on whether a region proposal network is utilized. Although these detectors have reached outstanding performances, their training processes heavily rely on a large amount of labeled data with bounding box annotations, which are always laborious and expensive to acquire [7]. In this paper, we focus on how to leverage a large amount of unlabeled data for performance increment. For fair comparisons with existing works, we conduct experiments over the SSD300 and Faster-RCNN-FPN detector structures on 2D object detection, and over the VoteNet [24] structure on 3D object detection.

2.2 Semi-supervised Object Detection (SSOD)

The existing SSOD approaches can be categorized as follows.

Consistency regularization

Many of the existing SSOD methods utilize the consistency regularization proposed in semi-supervised learning (SSL), such as CSD [25], ISD [26], PL [27], etc. The key idea of consistency regularization is to require the detector to predict consistently on both weak- and strong- augmented versions of the same input. We point out that the consistency-regularization-based methods can be regarded as the special case of the self-labeling training scheme, because the detector is supervised by the pseudo labels on weak-augmented image predicted by itself. Note that these studies report large performance improvements over the fully-supervised baselines, but we point out that the experimental settings are somehow unfair. In fact, under the same data augmentation, the performance increments brought by consistency regularization are always observed limited. We analyze the reason behind such phenomenon in this paper, and point out the inherent limitations of the self-labeling training scheme, as well as the consistency regularization.

Teacher-student mutual learning

Beyond the consistency regularization, there are a few SSOD works based on the teacher-student mutual learning framework [21]. In an offline or online manner, the pseudo labels are generated by the teacher model, instead of by the student model itself. As for the former, [16, 17] first pre-train the teacher model with available labeled data, then utilize the teacher to annotate the entire unlabeled data. However, the pseudo labels are generated only once and remain fixed during semi-supervised training, thus the final performance of student model is limited by that of the teacher model. As for the latter,  [18, 19, 20] compute the exponential moving average of student as teacher and the teacher is utilized to generate pseudo labels during the semi-supervised training process. However, we observe that the teacher is converged to the student in the late stage of training, which indicates that the labeling process degenerates into the self-labeling manner and suffers from the same limitations.

Co-rectify and CPS

As another research line of semi-supervised learning, co-training methods [12, 13, 14, 15] are proposed to train two models in a collaborative manner. Each model can learn from the pseudo labels predicted by its counterpart, which seems a promising way to mitigate the limitations of self-labeling methods. The only prior work taking the idea of co-training for the SSOD task is co-rectify [22]. In [22], the pseudo boxes for unlabeled data are first predicted by one detector, then refined by the corresponding predictions from another model, with probability scores and coordinates being averaged. Besides, a recent study [28] also adopts co-training and proposes the cross pseudo supervision (CPS) for the semi-supervised semantic segmentation task, where each model is supervised by the pseudo maps predicted by the other model. We note that CPS can be adapted for the SSOD task. However, we conduct quantitative comparisons to show that both co-rectify and CPS cannot fully exploit the advantages of multiple models and improve the quality of pseudo labels.

3 Problem Analysis

In this section, we first introduce the preliminaries in semi-supervised object detection (SSOD), then analyze the inherent limitations of self-labeling-based SSOD methods.

3.1 Preliminaries

Under the semi-supervised setting, an object detector ff is trained on a labeled dataset Dl={𝒙il,𝒚il}i=1NlD_{l}=\{\boldsymbol{x}_{i}^{l},\boldsymbol{y}_{i}^{l}\}_{i=1}^{N_{l}} with NlN_{l} samples and an unlabeled dataset Du={𝒙ju}j=1NuD_{u}=\{\boldsymbol{x}_{j}^{u}\}_{j=1}^{N_{u}} with NuN_{u} samples. For a labeled image 𝒙l\boldsymbol{x}^{l}, its annotation 𝒚l=(𝒄,𝒕)\boldsymbol{y}^{l}=(\boldsymbol{c},\boldsymbol{t}) contains the category labels 𝒄\boldsymbol{c} and coordinates 𝒕\boldsymbol{t} of all foreground objects.

Overall, the detector model ff is optimized by minimizing the supervised loss LSL_{S} on labeled data and unsupervised loss LUL_{U} on unlabeled data, formulated as:

L=LS+λULU,\displaystyle L=L_{S}+\lambda_{U}\cdot L_{U}, (1)

where λU\lambda_{U} denotes the weight factor. Generally, the supervised loss LSL_{S} consists of the classification loss lclsl_{cls} and coordinate regression loss lregl_{reg}:

LS=lcls(fcls(𝒙l),𝒄)+lreg(floc(𝒙l),𝒕),\displaystyle L_{S}=l_{cls}\big{(}f_{cls}(\boldsymbol{x}^{l}),\boldsymbol{c}\big{)}+l_{reg}\big{(}f_{loc}(\boldsymbol{x}^{l}),\boldsymbol{t}\big{)}, (2)

where fcls(𝒙l)f_{cls}(\boldsymbol{x}^{l}) and floc(𝒙l)f_{loc}(\boldsymbol{x}^{l}) stand for probabilities and coordinates predicted by the classification and localization branch of detector ff, respectively.

In each training iteration, the self-labeling-based SSOD method utilizes the current detector to predict bounding boxes on the unlabeled inputs 𝒙u\boldsymbol{x}^{u}, and then selects the pseudo labels 𝒚^\hat{\boldsymbol{y}} with confidence larger than the threshold τ\tau, and finally computes the unsupervised loss:

LU=lcls(fcls(𝒙u),𝒄^)+lreg(floc(𝒙u),𝒕^),\displaystyle L_{U}=l_{cls}\big{(}f_{cls}(\boldsymbol{x}^{u}),\hat{\boldsymbol{c}}\big{)}+l_{reg}\big{(}f_{loc}(\boldsymbol{x}^{u}),\hat{\boldsymbol{t}}\big{)}, (3)

where 𝒚^=(𝒄^,𝒕^)=(argmaxfcls(𝒙u),floc(𝒙u))\hat{\boldsymbol{y}}=(\hat{\boldsymbol{c}},\hat{\boldsymbol{t}})=\big{(}\operatorname*{argmax}f_{cls}(\boldsymbol{x}^{u}),f_{loc}(\boldsymbol{x}^{u})\big{)} and maxfcls(𝒙u)>τ\max f_{cls}(\boldsymbol{x}^{u})>\tau are correspondingly satisfied.

As the special case of self-labeling training, the consistency-regularization-based methods introduce weak data augmentation α()\alpha(\cdot) and strong data augmentations A()A(\cdot), and the Eq. (3) is updated as:

LU=lcls(fcls(A(𝒙u)),argmaxfcls(α(𝒙u)))+lreg(floc(A(𝒙u)),floc(α(𝒙u))).\displaystyle L_{U}=l_{cls}\big{(}f_{cls}(A(\boldsymbol{x}^{u})),\operatorname*{argmax}f_{cls}(\alpha(\boldsymbol{x}^{u}))\big{)}+l_{reg}\big{(}f_{loc}(A(\boldsymbol{x}^{u})),f_{loc}(\alpha(\boldsymbol{x}^{u}))\big{)}. (4)

Note that as for the consistency regularization, the loss term Eq. (4) can be computed on both labeled and unlabeled data, and the strong augmentations can also boost the performances of fully-supervised training [26]. Accordingly, the total loss in Eq. (1) is augmented as L=LS+λU(LU(𝒙l)+LU(𝒙u))L=L_{S}+\lambda_{U}\cdot\big{(}L_{U}(\boldsymbol{x}^{l})+L_{U}(\boldsymbol{x}^{u})\big{)}, and the fully-supervised baseline is trained by optimizing L=LS+λULU(𝒙l)L=L_{S}+\lambda_{U}\cdot L_{U}(\boldsymbol{x}^{l}).

Table 1: Results of self-labeling-based semi-supervised object detection methods under various data augmentations. The benchmark dataset is Pascal VOC and the detector structure is SSD300. “SeLa” stands for self-labeling. “identical”, “HF” and “MU” stand for no data augmentation, horizontal flip augmentation (CSD) [25] and mix-up augmentation (ISD) [26], respectively. The figures in brackets are the performance increments over the fully-supervised baselines, which always seem trivial for the self-labeling-based methods.
Method A()A(\cdot) Labeled Unlabeled AP50\text{AP}_{50} (%)
Supervised identical VOC07 - 71.73
SeLa identical VOC07 VOC12 72.13 (+0.40)
Supervised identical VOC0712 - 77.37 (+5.64)
Supervised HF VOC07 - 71.89
SeLa HF VOC07 VOC12 72.35 (+0.46)
Supervised HF VOC0712 - 77.26 (+5.37)
Supervised MU VOC07 - 73.04
SeLa MU VOC07 VOC12 73.50 (+0.46)
Supervised MU VOC0712 - 78.83 (+5.79)

3.2 Limitations of Self-labeling

To verify the performances of existing self-labeling-based SSOD methods, we conduct experiments on the Pascal VOC benchmark dataset [11] based on the SSD300 structure [1]. We use the trainval set of VOC07 as labeled data and trainval set of VOC12 as unlabeled data, and finally report the AP50 performance on the test set of VOC07. The confidence threshold τ\tau is fixed as 0.5, similar with [25] and [26]. For comparison, we conduct fully-supervised training with same hyper-parameters (batch size, iteration number, etc) as baseline. As shown in Table 1, the AP50 improvement achieved by self-labeling methods is only 0.40%. Besides, we test with two representative consistency-regularization-based methods, namely CSD [25] and ISD [26]. Similarly, the detection results achieve only 0.46% absolute AP50 gains over the baseline, which indicates the inefficiency of self-labeling training scheme.

Table 2: Analysis on the limitations of self-labeling methods. The benchmark dataset is Pascal VOC and the detector structure is SSD300. TP (FP) stands for the correctly (falsely) classified pseudo boxes. “SeLa” stands for self-labeling, which uses both TP and FP for training. The figures in brackets are the AP50 increments over the fully-supervised baseline.
Method Labeled Unlabeled τ\tau AP50\text{AP}_{50}
Supervised VOC07 - - 71.73
SeLa VOC07 VOC12 0.5 72.13 (+0.40)
SeLa VOC07 VOC12 0.5\rightarrow0.8 72.09 (+0.36)
SeLa VOC07 VOC12 0.8 72.12 (+0.39)
SeLa (use TP and discard FP) VOC07 VOC12 0.5 74.03 (+2.30)
SeLa (use TP and random labeled FP) VOC07 VOC12 0.5 73.87 (+2.14)
SeLa (use GT labels for TP and FP) VOC07 VOC12 0.5 74.86 (+3.13)
Supervised VOC0712 - - 77.37 (+5.64)
Refer to caption
Figure 2: Pseudo label quality under different confidence thresholds in self-labeling training. (a) Precision of pseudo labels. (b) Average number of correctly classified pseudo boxes in each iteration. (For interpretation of the references to color in this figure legend, please refer to the online version of this article.)

To reveal the reason behind such phenomenon and find out possible solutions, we introduce the ground-truth category labels of all pseudo bounding boxes for in-depth analysis. Although all existing self-labeling-based SSOD methods generate pseudo labels by selecting bounding boxes with high confidence scores to ensure the quality of pseudo labels, Fig. 2(a) illustrates that part of high-confidence pseudo bounding boxes can also be misclassified. We name these incorrect boxes “self-errors” for clarity. Since all pseudo boxes are predicted by the detector, it is impossible for the detector itself to discern the self-errors, which we summarize as the inherent limitation of the self-labeling process. We further conduct an experiment to illustrate how much can such limitation affect the detection performance: in each training iteration, when we use the correct pseudo bounding boxes and discard the incorrect ones for training, the AP50 result increases from 72.13% to 74.03% (see the 2nd and 5th rows in Table 2). Besides, we note that the detection performance cannot be improved by naively increasing the confidence threshold τ\tau. As shown in Fig. 2(a) and (b), the precision of pseudo bounding boxes increases from 71% to 91% as the threshold increases from 0.4 to 0.8, while a large threshold can also overkill the correct pseudo boxes and waste the unlabeled training data. Correspondingly, we conduct self-labeling training under three settings of the threshold τ\tau: (i) fixed as 0.5, (ii) fixed as 0.8, (iii) rising from 0.5 to 0.8 gradually during the whole training process. As displayed in Table 2 (from 2nd row to 4th row), the trade-off between precision and recall leads to similar final performances (about 72.1% AP50).

Since the self-errors cannot be discerned during the self-labeling process, it is also impossible for the detector to rectify them, which we summarize as the second limitation of the self-labeling process. In each training iteration, when we utilize the ground-truth (GT) category labels for all pseudo bounding boxes, the AP50 result increases to 74.86%, obtaining a 3.13% absolute gain rather than 0.40% (see the penultimate row in Table 2). Such phenomenon shows the effectiveness of pseudo label rectification. Besides, we find another interesting phenomenon: when we replace the misclassified category labels with random labels (dubbed random-errors) during the self-labeling training process, the final performance increases to 73.87% (see the third row from the bottom in Table 2). We also conduct experiments on the Faster-RCNN-FPN structure [2] and observe the similar trend: the AP50 obtains with a 0.2% gain by replacing the misclassified labels with random labels during training. Such phenomena imply that the detector model can be misguided more severely by self-errors than random-errors.

Based on the above experimental analysis, we draw that two inherent limitations in self-labeling-based SSOD methods will lead the detector to be misguided by self-errors, and self-errors will hurt the detector performance even worse than random-errors.

4 Methodology

4.1 CrossRectify

Since one single detector can neither discern nor rectify its self-errors, an intuitive idea is to utilize another model to deal with them. Inspiring by the fact that two models with same structure but different initialization can yield different predictions on the same input [12, 13, 14, 15], we present the CrossRectify method to address the inherent limitations of self-labeling.

In CrossRectify, two detectors with the same structure but distinct initialization111Take 2D object detection task for example. The backbone parameters are both initialized by the ImageNet-pretrained model, while the parameters of detection heads are randomly initialized., fAf_{A} and fBf_{B}, are trained simultaneously. Both detectors are trained by jointly optimizing the supervised and unsupervised loss in Eq. (1). For simplicity, we only introduce how to generate pseudo bounding boxes 𝒚^A\hat{\boldsymbol{y}}_{A} for training detector fAf_{A}, since the pseudo boxes 𝒚^B\hat{\boldsymbol{y}}_{B} for training detector fBf_{B} are generated in the same way. There are three steps in generating 𝒚^A\hat{\boldsymbol{y}}_{A}, including: 1) conducting detector feed-forward; 2) matching predicted bounding boxes; 3) cross-rectifying the matched boxes to generate pseudo labels. They will be explained in details sequentially. The label generation process of 𝒚^A\hat{\boldsymbol{y}}_{A} is illustrated in Fig. 3 and briefly summarized in Algorithm 1.

Refer to caption
Figure 3: Overview of the pseudo label generation process in the proposed algorithm. Refer to Section4 for more details. (For color discrimination in this figure, please refer to the online version of this article.)

Detector feed-forward

In each training iteration, we utilize two detector models fAf_{A} and fBf_{B} to predict on the unlabeled inputs, then select the bounding boxes with max probability scores higher than threshold τ\tau, denoted as 𝒚¯A=(𝒑¯A,𝒕¯A)\overline{\boldsymbol{y}}_{A}=(\overline{\boldsymbol{p}}_{A},\overline{\boldsymbol{t}}_{A}) and 𝒚¯B=(𝒑¯B,𝒕¯B)\overline{\boldsymbol{y}}_{B}=(\overline{\boldsymbol{p}}_{B},\overline{\boldsymbol{t}}_{B}). 𝒑¯\overline{\boldsymbol{p}} and 𝒕¯\overline{\boldsymbol{t}} denote the probability scores and coordinates of all predicted bounding boxes.

Matching bounding boxes

For each box in 𝒚¯A\overline{\boldsymbol{y}}_{A}, we search its best match box among all boxes in 𝒚¯B\overline{\boldsymbol{y}}_{B}. For example, for i-th box (𝐩¯A,i,𝐭¯A,i)(\overline{\mathbf{p}}_{A,i},\overline{\mathbf{t}}_{A,i}) in 𝒚¯A\overline{\boldsymbol{y}}_{A}, the matching process is formulated as:

j=argmaxj{1,,|𝒚¯B|}M((𝐩¯A,i,𝐭¯A,i),(𝐩¯B,j,𝐭¯B,j)),\displaystyle j^{\ast}=\operatorname*{argmax}_{j\in\{1,\cdots,\left|\overline{\boldsymbol{y}}_{B}\right|\}}M\Big{(}(\overline{\mathbf{p}}_{A,i},\overline{\mathbf{t}}_{A,i}),(\overline{\mathbf{p}}_{B,j},\overline{\mathbf{t}}_{B,j})\Big{)}, (5)

where M(,)M(\cdot,\cdot) stands for the matching metric and is slightly different for kinds of detectors. For the Faster-RCNN [2], M((𝐩¯A,i,𝐭¯A,i),(𝐩¯B,j,𝐭¯B,j))M((\overline{\mathbf{p}}_{A,i},\overline{\mathbf{t}}_{A,i}),(\overline{\mathbf{p}}_{B,j},\overline{\mathbf{t}}_{B,j})) is the area of intersection over union (IoU) between two boxes as:

M((𝐩¯A,i,𝐭¯A,i),(𝐩¯B,j,𝐭¯B,j))=𝐭¯A,i𝐭¯B,j𝐭¯A,i𝐭¯B,j.\displaystyle M\big{(}(\overline{\mathbf{p}}_{A,i},\overline{\mathbf{t}}_{A,i}),(\overline{\mathbf{p}}_{B,j},\overline{\mathbf{t}}_{B,j})\big{)}=\frac{\overline{\mathbf{t}}_{A,i}\cap\overline{\mathbf{t}}_{B,j}}{\overline{\mathbf{t}}_{A,i}\cup\overline{\mathbf{t}}_{B,j}}. (6)

Specifically, if the IoU areas between (𝐩¯A,i,𝐭¯A,i)(\overline{\mathbf{p}}_{A,i},\overline{\mathbf{t}}_{A,i}) and all boxes in 𝒚¯B\overline{\boldsymbol{y}}_{B} are all below a certain threshold δ\delta, we create a virtual bounding box 𝐭¯B,j\overline{\mathbf{t}}_{B,j^{\ast}} to match it, as (𝐩¯B,j,𝐭¯j)=(𝐩¯A,i,𝐭¯A,i)(\overline{\mathbf{p}}_{B,j^{\ast}},\overline{\mathbf{t}}_{j^{\ast}})=(\overline{\mathbf{p}}_{A,i},\overline{\mathbf{t}}_{A,i}) and the matching metric M(,)M(\cdot,\cdot) equals 1. For the SSD structure [1], M((𝐩¯A,i,𝐭¯A,i),(𝐩¯B,j,𝐭¯B,j))M((\overline{\mathbf{p}}_{A,i},\overline{\mathbf{t}}_{A,i}),(\overline{\mathbf{p}}_{B,j},\overline{\mathbf{t}}_{B,j})) equals 1 if two boxes are based on the same anchor, otherwise 0. Note that M(,)M(\cdot,\cdot) for SSD can also be specified as the IoU areas as like that for Faster-RCNN, but we find it more effective to adopt the anchor correspondence in our experiments. For the VoteNet [24], the matching metric M(,)M(\cdot,\cdot) is specified as the negative Euclidean distance between the centers of two bounding boxes, computed as:

M((𝐩¯A,i,𝐭¯A,i),(𝐩¯B,j,𝐭¯B,j))=C(𝐭¯A,i)C(𝐭¯B,j)2,\displaystyle M\big{(}(\overline{\mathbf{p}}_{A,i},\overline{\mathbf{t}}_{A,i}),(\overline{\mathbf{p}}_{B,j},\overline{\mathbf{t}}_{B,j})\big{)}=-\left\|C(\overline{\mathbf{t}}_{A,i})-C(\overline{\mathbf{t}}_{B,j})\right\|_{2}, (7)

where C()C(\cdot) denotes the center of a certain box.

Algorithm 1 Generating pseudo bounding boxes 𝒚^A\hat{\boldsymbol{y}}_{A} via CrossRectify for training detector fAf_{A}.
0:  Object detectors fAf_{A} and fBf_{B}, and unlabeled input.
0:  The pseudo bounding boxes for training fAf_{A}.
1:  Utilize fAf_{A} and fBf_{B} to predict on unlabeled input
2:  Select the predicted boxes with their max probability scores larger than threshold τ\tau, denoted as 𝒚¯A\overline{\boldsymbol{y}}_{A} and 𝒚¯B\overline{\boldsymbol{y}}_{B}.
3:  Compute the matching metrics M(,)M(\cdot,\cdot) for each bounding boxes in 𝒚¯A\overline{\boldsymbol{y}}_{A} with all bounding boxes in 𝒚¯B\overline{\boldsymbol{y}}_{B}.
4:  Find the best matching box for each bounding boxes in 𝒚¯A\overline{\boldsymbol{y}}_{A} (using Eq. 5).
5:  Compare max probability scores between each matched pair and obtain the pseudo boxes (using Eq. 8)

Cross-rectifying

Based on the matched bounding box pair, each pseudo bounding box (c^A,i,𝐭^A,i)(\hat{c}_{A,i},\hat{\mathbf{t}}_{A,i}) within 𝒚^A\hat{\boldsymbol{y}}_{A} can be generated as:

(c^A,i,𝐭^A,i)={(argmax𝐩¯B,j,𝐭¯B,j),if max𝐩¯A,i<max𝐩¯B,j(argmax𝐩¯A,i,𝐭¯A,i),otherwise.(\hat{c}_{A,i},\hat{\mathbf{t}}_{A,i})=\begin{cases}(\operatorname*{argmax}\overline{\mathbf{p}}_{B,j^{\ast}},\overline{\mathbf{t}}_{B,j^{\ast}}),&\text{{if} $\max\overline{\mathbf{p}}_{A,i}<\max\overline{\mathbf{p}}_{B,j^{\ast}}$}\\ (\operatorname*{argmax}\overline{\mathbf{p}}_{A,i},\overline{\mathbf{t}}_{A,i}),&\text{{otherwise}}.\end{cases} (8)

Note that the Eq. (8) covers two situations. (a) When both detectors predict the same class on a certain object, we adopt it as the pseudo label, since two decisions are more reliable than one decision. (b) When two detectors have disagreements on a certain object, such bounding box tends to be unreliable. To this end, the bounding box with higher confidence is regarded as the pseudo label. The rationality behind such cross-rectifying mechanism lies in that the bounding boxes with higher confidence scores are more likely to be correctly classified (see Fig. 2(a)). Thus, the wrong elements in 𝒚¯A\overline{\boldsymbol{y}}_{A} can be both discerned and rectified in such manner.

When the training process ends, we evaluate the performance of one single detector. Moreover, to exploit the different detection abilities, we propose to adopt the weighted boxes fusion (WBF) [29] strategy to ensemble two predictions sets. The corresponding performance is denoted as CrossRectify.

4.2 Comparisons with other works

Now we conduct quantitative analysis to show the superiority of CrossRectify on improving pseudo label quality, comparing with other recent works.

Teacher-student mutual learning

As discussed in Section 2.2, some recent SSOD works [16, 17, 18, 19, 20] are established on the offline/online teacher-student mutual learning [21]. These works can alleviate the self-errors in self-labeling process by introducing another separate object detector for generation of pseudo labels. However, as for offline methods [16, 17], the pseudo labels are generated only once and remain fixed when training student detector, so the student performance is upper-bounded by that of teacher. For instance, we conduct experiments with Faster-RCNN-FPN detector on the MS-COCO benchmark dataset under 10% degree of supervision. The AP50 performance of teacher detector after fully-supervised pre-training is 23.86%, while that of student detector supervised by teacher detector only increases with a 3.30% absolute gain, far away from the results in Table 4 (34.89% AP50). Similar phenomenon can be observed with respect to the SSD300 structure on the Pascal VOC benchmark dataset in Table 3 (obtaining only a 0.79% AP50 gain).

As for online methods [18, 19, 20], the teacher detector is converged to the student detector and yields similar predictions in the late stage of training, thus the pseudo label generation process degenerates to the self-labeling process and suffers from the same limitations. For instance, we conduct online teacher-student mutual learning based on SSD300 and Pascal VOC. As shown in Fig. 4(a), the average KL-divergence between probability scores predicted by teacher and student detector reaches zero in the last 40k iterations. Correspondingly, the detection performance shown in Table 3 also indicate the ineffectiveness of online teacher-student mutual learning.

Refer to caption
Refer to caption
Figure 4: Left: (a) comparison on the average KL-divergence between probability scores predicted by two detectors over each 12k iterations. “TS-online” stands for the teacher-student mutual learning in the online manner. Right: (b) comparison on the average precision of pseudo bounding boxes among different methods over each 12k iterations. (For interpretation of the references to color in this figure legend, please refer to the online version of this article.)

Co-rectify and CPS

Recently, a co-training based SSOD method named co-rectify has been proposed in [22], which is the only prior work taking the idea of co-training in the SSOD task to the best of our knowledge. In co-rectify, the pseudo bounding boxes are first predicted by one detector, then refined by corresponding predictions from another model, with probability scores and coordinates being averaged. Besides, we notice that a recent work proposes cross pseudo supervision (CPS) [28] for the semi-supervised semantic segmentation task and achieves the state-of-the-art performances, where each model directly takes the predictions from the other model as pseudo labels. CPS can be adapted to the SSOD task, as each detector is supervised by the other detector. However, as shown in Fig. 4(b), the precision values of pseudo label of these methods are inferior to that of CrossRectify (conducting semi-supervised training with SSD300 on Pascal VOC). We consider the reason behind such phenomenon lies in that simply averaging multiple predictions (co-rectify) or directly taking predictions from other models as supervision (CPS) cannot fully exploit the advantages of multiple models, in comparison to our cross-rectifying mechanism. Their inferior performances shown in Table 3 and Table 6 also validate our consideration. Besides, we also investigate more alternative strategies on pseudo label rectification and observe that cross-rectifying turns out to be most effective strategy (as detailed in Sec. 5.4).

5 Experiments

5.1 Datasets and Evaluation Metrics

2D semi-supervised object detection

We evaluate the proposed CrossRectify on two widely-used benchmark datasets, i.e., Pascal VOC[11] and MS-COCO [30]. Pascal VOC has 20 object categories. We take the VOC07 trainval set (5,011 images) as labeled and the VOC12 trainval set (11,540 images) as unlabeled. The detection performance is evaluated on the VOC07 test set (4,952 images) using the VOC style AP50 metric. MS-COCO has 80 object categories. We follow the same settings as that in [16, 18, 20, 22, 31] to randomly sample 1/2/5/10% of the COCO2017 train set (118,287 images) as labeled and take the remaining part as unlabeled. Also, we create five data folds under each degree of supervision, and finally report the mean and standard deviation from five results. The detection performance is evaluated on the COCO2017 val set (5,000 images) using the COCO style AP50:95 metric.

3D semi-supervised object detection

We follow [32] to conduct experiments on the SUN-RGBD benchmark dataset [33]. We randomly sample 5% of 5,285 training samples as labeled and take the remaining part as unlabeled. The detection performances is evaluated on 5,050 validation samples, using both AP25 and AP50 metrics.

5.2 Implementation Details

Detector structures

We carry out experiments on the Pascal VOC dataset with two detector structures, that are SSD300 [1] with VGG-16 backbone and Faster-RCNN-FPN [2, 34] with ResNet-50 backbone. The latter structure is also utilized in experiments on the MS-COCO dataset. As for 3D detection, we utilize VoteNet [24] with PointNet++ backbone [35].

Training settings

We utilize the Pytorch implementation222https://github.com/amdegroot/ssd.pytorch to train SSD300 on Pascal VOC. Within a total of 120k iterations, we conduct fully-supervised training in the first 12k iterations as warm-up. We ramp-up/down the unsupervised loss weight λU\lambda_{U}, and set threshold τ\tau and batch size as 0.5 and 32 according to [26]. We utilize the Detectron2 platform333https://github.com/facebookresearch/detectron2 to train Faster-RCNN-FPN on Pascal VOC. We train a total of 36k iterations with first 6k being fully-supervised warm-up, and adopt the same data augmentation strategy as that in [18]. We set λU\lambda_{U} as 2.0 and threshold τ\tau as 0.7 following [18]. The batch sizes for labeled data and unlabeled data are both 16. The threshold δ\delta on matching metric is 0.5. To show the generality across different platforms of our CrossRectify method, we adopt the MMdetection444https://github.com/open-mmlab/mmdetection to train Faster-RCNN-FPN on MS-COCO. We train a total of 180k iterations and adopt the data augmentation strategies in [20]. Under 1% degree of supervision, we conduct fully-supervised warm-up in the first 80k iterations to ensure the stability of training. We set λU\lambda_{U} as 4.0 and threshold τ\tau as 0.9 according to [20]. The batch sizes for labeled data and unlabeled data are respectively 8 and 32. The threshold δ\delta on matching metric is 0.5. To train VoteNet on SUN-RGBD, we first conduct fully-supervised pre-training by 900 iterations, then conduct semi-supervised training by 1k iterations, following [32].

Note that the exponential moving average (EMA) strategy is commonly used for the pseudo label-based methods, since a detector model aggregated by EMA can yield more conservative and stable predictions than the detector itself [18, 19, 20]. For fair comparisons, We follow such common practice in our experiments, as we utilize the EMAs of two detectors to conduct the detector feed-forward process.

5.3 Results

Pascal VOC

Table 3 shows the results of our CrossRectify method compared with other training frameworks on Pascal VOC. As for the SSD300 detector, we take the 71.73% AP50 performance of fully-supervised training as the baseline. As can be seen, our proposed method can obtain a 73.65% AP50 result, while the results of all compared approaches are only about 72.50%. Such comparison validates the effectiveness of our CrossRectify on improving pseudo label quality. Besides, the WBF-merged [29] results from both detectors can further boost the final performances, denoted as CrossRectify. Under the mix-up data augmentation [10], our CrossRectify method can still show better performance than the self-labeling-based method, ISD [26] (by a 1.41% margin).

Table 3: 2D Semi-supervised Object Detection performances (AP50) on Pascal VOC benchmark dataset.
Model Backbone Method Labeled Unlabeled Threshold AP50
SSD300 VGG-16 Supervised VOC07 - - 71.73
Self-Labeling VOC07 VOC12 0.5 72.13 (+0.40)
Online Teacher-Student Mutual Teaching VOC07 VOC12 0.5 72.56 (+0.83)
Offline Teacher-Student Mutual Teaching[17] VOC07 VOC12 - 72.52 (+0.79)
Cross Pseudo Supervision[28] VOC07 VOC12 - 72.56 (+0.83)
Co-rectify[22] VOC07 VOC12 0.5 72.48 (+0.75)
CrossRectify (ours) VOC07 VOC12 0.5 73.56 (+1.83)
CrossRectify (ours) VOC07 VOC12 0.5 74.97 (+3.24)
\cdashline3-7 Supervised + MixUp VOC07 VOC12 0.5 73.04
Self-Labeling + MixUp (ISD [26]) VOC07 VOC12 0.5 73.50 (+0.46)
CrossRectify + MixUp (ours) VOC07 VOC12 0.5 74.91 (+1.87)
CrossRectify + MixUp (ours) VOC07 VOC12 0.5 76.16 (+3.12)
Faster- RCNN- FPN ResNet-50 Supervised VOC07 - - 76.90
CSD[25] VOC07 VOC12 - 77.50 (+0.60)
STAC[16] VOC07 VOC12 - 77.50 (+0.60)
Co-rectify[22] VOC07 VOC12 - 79.20 (+2.30)
Combating Noise[31] VOC07 VOC12 - 80.60 (+3.70)
Humble Teacher[19] VOC07 VOC12 0.7 80.94 (+3.94)
Unbiased Teacher[18] VOC07 VOC12 0.7 80.51 (+3.61)
CrossRectify (ours) VOC07 VOC12 0.7 81.56 (+4.66)
CrossRectify (ours) VOC07 VOC12 0.7 82.34 (+5.44)

As for the Faster-RCNN-FPN detector, we compare CrossRectify with previous methods, and our method can improve the AP50 result with a 4.66% margin over fully-supervised baseline, achieving the state-of-the-art performance. Note that Unbiased Teacher [18] reports the performance based on the COCO style AP50 metric in their paper. For a fair comparison, we instead adopt the VOC style AP50 metric for Unbiased Teacher, and the AP50 raises from 77.37% to 80.51%, still surpassed by that of CrossRectify with a 0.59% margin.

Table 4: 2D Semi-supervised object detection performances (AP50:95) on MS-COCO benchmark dataset.
Model Backbone Method Proportion of labeled data
1%1\% 2%2\% 5%5\% 10%10\%
Faster- RCNN- FPN ResNet-50 Supervised 9.05 ±\pm 0.16 12.70 ±\pm 0.15 18.47 ±\pm 0.22 23.86 ±\pm 0.81
CSD[25] 10.51 ±\pm 0.06 13.93 ±\pm 0.12 18.63 ±\pm 0.07 22.46 ±\pm 0.08
STAC[16] 13.97 ±\pm 0.35 18.25 ±\pm 0.25 24.38 ±\pm 0.12 28.64 ±\pm 0.21
Unbiased Teacher[18] 20.75 ±\pm 0.12 24.30 ±\pm 0.07 28.27 ±\pm 0.11 31.50 ±\pm 0.10
Humble Teacher[19] 16.96 ±\pm 0.38 21.72 ±\pm 0.24 27.70 ±\pm 0.15 31.61 ±\pm 0.28
Co-rectify[22] 18.05 ±\pm 0.15 22.45 ±\pm 0.15 26.75 ±\pm 0.05 30.40 ±\pm 0.05
Combating Noise[31] 18.41 ±\pm 0.10 24.00 ±\pm 0.15 28.96 ±\pm 0.29 32.43 ±\pm 0.20
Soft Teacher[20] 20.46 ±\pm 0.39 26.20 ±\pm 0.10 30.74 ±\pm 0.08 34.04 ±\pm 0.14
CrossRectify (ours) 21.90 ±\pm 0.11 26.70 ±\pm 0.07 31.70 ±\pm 0.04 34.89 ±\pm 0.07
CrossRectify (ours) 22.50 ±\pm 0.12 27.60 ±\pm 0.07 32.80 ±\pm 0.05 36.30 ±\pm 0.07
Refer to caption
Figure 5: The visual comparisons between Self-Labeling (the first row) and the proposed CrossRectify method (the second row) on MS-COCO under 1% degree of supervision. (For color discrimination in this figure, please refer to the online version of this article.)

MS-COCO

Table 4 shows the performances of our CrossRectify method compared with previous state-of-the-arts on the MS-COCO dataset. Under different degrees of supervision 1%, 2%, 5% and 10%, the proposed CrossRectify can obtain consistent and substantial improvements, surpassing those of Soft Teacher [20] by 1.46%, 0.50%, 0.96%, and 0.85% AP50:95 margins. These comparative results further verify the effectiveness of the proposed method. Moreover, we visualize the pseudo bounding boxes for some unlabeled images in Fig. 5. Compared with self-labeling training scheme, our method can yield more accurate pseudo boxes.

SUN-RGBD

Table 5 shows the comparison with all previous works (i.e., SESS [36] and 3DIoUMatch [32]) on the SUN-RGBD benchmark dataset. Under 5% degree of supervision, the performance of our CrossRectify can outperform that of the state-of-the-art 3DIoUMatch method by 3.1 AP25 and 1.9 AP50 margins. The results validate the efficiency of CrossRectify on 3D semi-supervised object detection task. We omit the WBF-merging performance CrossRectify, because WBF does not support on 3D bounding boxes with different rotation angles.

Table 5: 3D Semi-supervised object detection performances (AP25 and AP50) on SUN-RGBD benchmark dataset.
Model Backbone Method AP25 AP50
VoteNet PointNet++ Supervised 29.9 ±\pm 1.5 10.5 ±\pm 0.5
SESS[36] 34.2 ±\pm 2.0 13.1 ±\pm 1.0
3DIoUMatch[32] 39.0 ±\pm 1.9 21.1 ±\pm 1.7
CrossRectify (ours) 42.1 ±\pm 1.7 23.0 ±\pm 1.2

5.4 Empirical Study

Pseudo label rectification strategy

Now we investigate the alternative strategies on rectifying the pseudo labels, including (a) only utilizing the intersection of two prediction sets from two detectors as pseudo labels, which is composed of the objects classified as the same classes. (b) Only utilizing the difference set of two prediction sets from two detectors as the pseudo label, which is composed of objects classified as different classes. (c) Directly taking all the predicted bounding boxes from the other detector as the pseudo label, which turns out to be the cross pseudo supervision (CPS) [28]. As observed in Table. 6, all these strategies cannot ensure the pseudo label quality and finally lead to inferior performances.

Extension to more detectors

Our proposed CrossRectify method can be easily extended to train more than two detectors simultaneously. Specifically, during the pseudo label rectification process, each pseudo bounding box is re-labeled by the majority of all predicted classes, and re-located by the average of all predicted coordinates. As displayed in Table 7, CrossRectify over four SSD300 detectors can bring a 0.06% AP50 improvement for each detector on average. We believe that two-detector scenario is already able to cross-rectify the misclassified pseudo labels adequately.

Table 6: Empirical study on strategies of pseudo label rectification.
Model Labeled Unlabeled Strategy AP50
SSD300 VOC07 - - 71.73
VOC07 VOC12 intersection 72.59
VOC07 VOC12 difference set 65.52
VOC07 VOC12 CPS 72.56
VOC07 VOC12 CrossRectify 73.65
Table 7: Extension to four detector models on Pascal VOC dataset.
Model Index Single Average WBF-Merged
SSD300 detector #1 73.67 73.65 74.83
detector #2 73.63
detector #1 73.60 73.71 75.84
detector #2 73.71
detector #3 73.80
detector #4 73.73

6 Conclusion

In this paper, we propose the CrossRectify training framework for the semi-supervised object detection task, aiming to address the inherent limitations in the self-labeling-based methods. In CrossRectify, two detectors with same structure but different initialization are trained simultaneously. The disagreements on the same objects between two detectors are utilized to discern and rectify the latent self-errors predicted by each single detector. Moreover, we conduct both theoretical analysis and quantitative experiments to show the superiority on improving pseudo label quality, compared with other recent works. Extensive results on both 2D and 3D semi-supervised object detection task validate the effectiveness and versatility of CrossRectify.

Declaration of Competing Interest

The authors declare that they have no known competing finan- cial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China 61832016, U20B2070, 6210070958, 62102162.

References

  • [1] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox detector, in: European conference on computer vision, Springer, 2016, pp. 21–37.
  • [2] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28 (2015) 91–99.
  • [3] B. Bosquet, M. Mucientes, V. M. Brea, Stdnet-st: Spatio-temporal convnet for small object detection, Pattern Recognition 116 (2021) 107929.
  • [4] H. Wang, Q. Wang, P. Li, W. Zuo, Multi-scale structural kernel representation for object detection, Pattern Recognition 110 (2021) 107593.
  • [5] Y. Kong, M. Feng, X. Li, H. Lu, X. Liu, B. Yin, Spatial context-aware network for salient object detection, Pattern Recognition 114 (2021) 107867.
  • [6] J. Zhang, H. Su, W. Zou, X. Gong, Z. Zhang, F. Shen, Cadn: a weakly supervised learning-based category-aware object detection network for surface defect detection, Pattern Recognition 109 (2021) 107571.
  • [7] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al., The open images dataset v4, International Journal of Computer Vision 128 (7) (2020) 1956–1981.
  • [8] C. Rosenberg, M. Hebert, H. Schneiderman, Semi-supervised self-training of object detection models, in: Proceedings of the Seventh IEEE Workshops on Application of Computer Vision (WACV/MOTION’05)-Volume 1-Volume 01, 2005, pp. 29–36.
  • [9] Y. M. Asano, C. Rupprecht, A. Vedaldi, Self-labelling via simultaneous clustering and representation learning, in: International Conference on Learning Representations (ICLR), 2020.
  • [10] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, in: International Conference on Learning Representations, 2018.
  • [11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision 88 (2) (2010) 303–338.
  • [12] S. Qiao, W. Shen, Z. Zhang, B. Wang, A. Yuille, Deep co-training for semi-supervised image recognition, in: Proceedings of the european conference on computer vision (eccv), 2018, pp. 135–152.
  • [13] Z. Ke, D. Wang, Q. Yan, J. Ren, R. W. Lau, Dual student: Breaking the limits of the teacher in semi-supervised learning, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6728–6736.
  • [14] H. Wei, L. Feng, X. Chen, B. An, Combating noisy labels by agreement: A joint training method with co-regularization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13726–13735.
  • [15] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, M. Sugiyama, How does disagreement help generalization against label corruption?, in: International Conference on Machine Learning, PMLR, 2019, pp. 7164–7173.
  • [16] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, T. Pfister, A simple semi-supervised learning framework for object detection, arXiv preprint arXiv:2005.04757.
  • [17] Z. Wang, Y. Li, Y. Guo, L. Fang, S. Wang, Data-uncertainty guided multi-phase learning for semi-supervised object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4568–4577.
  • [18] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, P. Vajda, Unbiased teacher for semi-supervised object detection, in: International Conference on Learning Representations, 2020.
  • [19] Y. Tang, W. Chen, Y. Luo, Y. Zhang, Humble teachers teach better students for semi-supervised object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3132–3141.
  • [20] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, Z. Liu, End-to-end semi-supervised object detection with soft teacher, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3060–3069.
  • [21] A. Tarvainen, H. Valpola, Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1195–1204.
  • [22] Q. Zhou, C. Yu, Z. Wang, Q. Qian, H. Li, Instant-teaching: An end-to-end semi-supervised object detection framework, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4081–4090.
  • [23] Z. Cai, N. Vasconcelos, Cascade r-cnn: Delving into high quality object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154–6162.
  • [24] C. R. Qi, O. Litany, K. He, L. J. Guibas, Deep hough voting for 3d object detection in point clouds, in: proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9277–9286.
  • [25] J. Jeong, S. Lee, J. Kim, N. Kwak, Consistency-based semi-supervised learning for object detection, Advances in neural information processing systems 32 (2019) 10759–10768.
  • [26] J. Jeong, V. Verma, M. Hyun, J. Kannala, N. Kwak, Interpolation-based semi-supervised learning for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11602–11611.
  • [27] P. Tang, C. Ramaiah, Y. Wang, R. Xu, C. Xiong, Proposal learning for semi-supervised object detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2291–2301.
  • [28] X. Chen, Y. Yuan, G. Zeng, J. Wang, Semi-supervised semantic segmentation with cross pseudo supervision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2613–2622.
  • [29] R. Solovyev, W. Wang, T. Gabruseva, Weighted boxes fusion: Ensembling boxes from different object detection models, Image and Vision Computing 107 (2021) 104117.
  • [30] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755.
  • [31] Z. Wang, Y.-L. Li, Y. Guo, S. Wang, Combating noise: Semi-supervised learning by region uncertainty quantification, Advances in Neural Information Processing Systems 34.
  • [32] H. Wang, Y. Cong, O. Litany, Y. Gao, L. J. Guibas, 3dioumatch: Leveraging iou prediction for semi-supervised 3d object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14615–14624.
  • [33] S. Song, S. P. Lichtenberg, J. Xiao, Sun rgb-d: A rgb-d scene understanding benchmark suite, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.
  • [34] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
  • [35] C. R. Qi, L. Yi, H. Su, L. J. Guibas, Pointnet++: Deep hierarchical feature learning on point sets in a metric space, Advances in neural information processing systems 30.
  • [36] N. Zhao, T.-S. Chua, G. H. Lee, Sess: Self-ensembling semi-supervised 3d object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11079–11087.