CrossRectify: Leveraging Disagreement for Semi-supervised Object Detection
Abstract
Semi-supervised object detection has recently achieved substantial progress. As a mainstream solution, the self-labeling-based methods train the detector on both labeled data and unlabeled data with pseudo labels predicted by the detector itself, but their performances are always limited. Through experimental analysis, we reveal the underlying reason is that the detector is misguided by the incorrect pseudo labels predicted by itself (dubbed self-errors). These self-errors can hurt performance even worse than random-errors, and can be neither discerned nor rectified during the self-labeling process. In this paper, we propose an effective detection framework named CrossRectify, to obtain accurate pseudo labels by simultaneously training two detectors with different initial parameters. Specifically, the proposed approach leverages the disagreements between detectors to discern the self-errors and refines the pseudo label quality by the proposed cross-rectifying mechanism. Extensive experiments show that CrossRectify achieves outperforming performances over various detector structures on 2D and 3D detection benchmarks.
keywords:
object detection, semi-supervised learning, 2D semi-supervised object detection, 3D semi-supervised object detection, self-labeling1 Introduction
The success of deep learning has greatly promoted the development of object detection approaches [1, 2, 3, 4, 5, 6], and a large amount of labeled data is essential to the training process of object detectors. However, as illustrated in [7], it is always labor-intensive and expensive to acquire a large amount of labeled data with bounding-box-level annotations. In comparison, unlabeled data are much easier and cheaper to collect. Therefore, semi-supervised object detection [8] is recently investigated to reduce the cost of data annotating, which leverages only a few labeled data and a large amount of unlabeled data to train object detectors.
Semi-supervised object detection (SSOD) has achieved significant progress in recent years, and one mainstream of existing SSOD solutions is based on the self-labeling scheme [9]. The core idea of self-labeling scheme is to first utilize the current detector to predict pseudo bounding boxes for unlabeled data in each training iteration, then conduct detector training with both labeled data and pseudo-labeled data. However, compared with fully-supervised baselines, the performance increments brought by the self-labeling-based methods are always observed limited. For example, the absolute AP50 gain is only 0.40% (0.46%) without (with) the mix-up data augmentation [10] over the SSD300 [1] detector on the Pascal VOC [11] benchmark (as shown in Fig. 1). To reveal the reason behind such phenomenon, we introduce the ground-truth annotations of unlabeled data for in-depth analysis. Existing self-labeling-based SSOD methods generate pseudo labels by selecting bounding boxes with high confidence scores to ensure the quality of pseudo labels. However, we observe that part of high-confidence boxes are still misclassified (dubbed self-errors), and these self-errors can hurt the detection performance even worse. Specifically, when we replace the misclassified category labels with random labels (dubbed random-errors) during the self-labeling training process, the final performance is even improved. Unfortunately, these self-errors can be neither discerned nor rectified by the detector itself, which we summarize as two inherent limitations of the self-labeling training scheme. These two limitations will lead the detector to be misguided by self-errors, and finally results in insignificant performance increments.

Since one single detector can neither discern nor rectify the misclassified pseudo bounding boxes, it is necessary to introduce the guidance information from other distinct detector to the training process. Recently, a few works [12, 13, 14, 15] illustrate the fact that two differently initialized models with an identical structure can yield diverse results on the same training sample during the training process. Inspired by this fact, in this paper, we propose an effective and general training framework named CrossRectifiy for both 2D and 3D semi-supervised object detection task. In CrossRectify, two detectors with the same structure but different initialization are trained simultaneously, and each detector is supervised by the pseudo labels generated from the proposed cross-rectifying mechanism. Specifically, the proposed cross-rectifying mechanism first leverages the disagreements on the same objects between two detectors to discern the latent self-errors predicted by each single detector. Then, the pseudo labels are generated based on the bounding boxes predicted by two detectors, by adopting a simple yet effective comparison-to-assigning pipeline with confidence scores being considered. To this end, the proposed CrossRectify method can discern and rectify the self-errors and improve the pseudo label quality.
Note that a few recent works utilize another separate detector for pseudo label generation, which can somehow alleviate the limitations of self-labeling training scheme. For instance, [16, 17, 18, 19, 20] adopt the teacher-student mutual learning framework [21] and utilize a specific teacher detector to generate pseudo labels in an offline or online way. However, these approaches still suffer from the following shortcomings: the pseudo labels always remain fixed [16, 17], and the teacher detector is converged to the student detector in the late stage of training [18, 19, 20], thus the labeling process degenerates into the self-labeling manner and suffers from the same limitations. Similar with our method, [22] proposes the co-rectify method to train two models simultaneously and takes the average on two predictions sets as pseudo labels, which is the only prior work adopting the co-training framework [12, 13, 14, 15] to the best of our knowledge. However, we conduct quantitative comparison on the pseudo label quality, and the experimental results validate the superiority of our CrossRectify method compared with co-rectify.
We carry out extensive experiments on both 2D and 3D semi-supervised object detection tasks to verify the effectiveness and versatility of the proposed CrossRectify method. As illustrated in Section 5.3, our method obtains consistent and substantial improvements compared with the state-of-the-art SSOD methods on the Pascal VOC, MS-COCO, and SUN-RGBD benchmark datasets, improving by 1% absolute AP margins.
Our main contributions are summarized as follows:
-
1)
We point out that the performances of self-labeling-based SSOD approaches are always limited, and the reason behind such phenomenon lies in that the detector can neither discern nor rectify the misclassified pseudo bounding boxes predicted by itself.
-
2)
We propose an effective approach named CrossRectify to discern and rectify the misclassified pseudo bounding boxes using the disagreements between two detectors, which can address the inherent limitations of self-labeling and improve the detection performance.
-
3)
We conduct extensive experiments on both 2D and 3D object detection benchmark datasets, and the results verify the superiority of the proposed CrossRectify approach, compared with state-of-the-art approaches.
In the remainder of this paper, we briefly review several related works in Section 2. Then we describe the two limitations of self-labeling-based SSOD approaches in Sec 3, and provide the technical details of the proposed CrossRectify method in Sec 4. Finally, we display the detection performances of 2D and 3D SSOD tasks in Sec 5 and conclude the whole paper in Sec 6.
2 Related Work
2.1 2D and 3D object detection
Object detection is one of the most significant tasks in computer vision, including 2D and 3D scenes. In the field of 2D object detection, the structures of detectors can be categorized into single-stage (SSD [1], etc) and two-stage (Faster-RCNN [2], cascade RCNN [23], etc), depending on whether a region proposal network is utilized. Although these detectors have reached outstanding performances, their training processes heavily rely on a large amount of labeled data with bounding box annotations, which are always laborious and expensive to acquire [7]. In this paper, we focus on how to leverage a large amount of unlabeled data for performance increment. For fair comparisons with existing works, we conduct experiments over the SSD300 and Faster-RCNN-FPN detector structures on 2D object detection, and over the VoteNet [24] structure on 3D object detection.
2.2 Semi-supervised Object Detection (SSOD)
The existing SSOD approaches can be categorized as follows.
Consistency regularization
Many of the existing SSOD methods utilize the consistency regularization proposed in semi-supervised learning (SSL), such as CSD [25], ISD [26], PL [27], etc. The key idea of consistency regularization is to require the detector to predict consistently on both weak- and strong- augmented versions of the same input. We point out that the consistency-regularization-based methods can be regarded as the special case of the self-labeling training scheme, because the detector is supervised by the pseudo labels on weak-augmented image predicted by itself. Note that these studies report large performance improvements over the fully-supervised baselines, but we point out that the experimental settings are somehow unfair. In fact, under the same data augmentation, the performance increments brought by consistency regularization are always observed limited. We analyze the reason behind such phenomenon in this paper, and point out the inherent limitations of the self-labeling training scheme, as well as the consistency regularization.
Teacher-student mutual learning
Beyond the consistency regularization, there are a few SSOD works based on the teacher-student mutual learning framework [21]. In an offline or online manner, the pseudo labels are generated by the teacher model, instead of by the student model itself. As for the former, [16, 17] first pre-train the teacher model with available labeled data, then utilize the teacher to annotate the entire unlabeled data. However, the pseudo labels are generated only once and remain fixed during semi-supervised training, thus the final performance of student model is limited by that of the teacher model. As for the latter, [18, 19, 20] compute the exponential moving average of student as teacher and the teacher is utilized to generate pseudo labels during the semi-supervised training process. However, we observe that the teacher is converged to the student in the late stage of training, which indicates that the labeling process degenerates into the self-labeling manner and suffers from the same limitations.
Co-rectify and CPS
As another research line of semi-supervised learning, co-training methods [12, 13, 14, 15] are proposed to train two models in a collaborative manner. Each model can learn from the pseudo labels predicted by its counterpart, which seems a promising way to mitigate the limitations of self-labeling methods. The only prior work taking the idea of co-training for the SSOD task is co-rectify [22]. In [22], the pseudo boxes for unlabeled data are first predicted by one detector, then refined by the corresponding predictions from another model, with probability scores and coordinates being averaged. Besides, a recent study [28] also adopts co-training and proposes the cross pseudo supervision (CPS) for the semi-supervised semantic segmentation task, where each model is supervised by the pseudo maps predicted by the other model. We note that CPS can be adapted for the SSOD task. However, we conduct quantitative comparisons to show that both co-rectify and CPS cannot fully exploit the advantages of multiple models and improve the quality of pseudo labels.
3 Problem Analysis
In this section, we first introduce the preliminaries in semi-supervised object detection (SSOD), then analyze the inherent limitations of self-labeling-based SSOD methods.
3.1 Preliminaries
Under the semi-supervised setting, an object detector is trained on a labeled dataset with samples and an unlabeled dataset with samples. For a labeled image , its annotation contains the category labels and coordinates of all foreground objects.
Overall, the detector model is optimized by minimizing the supervised loss on labeled data and unsupervised loss on unlabeled data, formulated as:
(1) |
where denotes the weight factor. Generally, the supervised loss consists of the classification loss and coordinate regression loss :
(2) |
where and stand for probabilities and coordinates predicted by the classification and localization branch of detector , respectively.
In each training iteration, the self-labeling-based SSOD method utilizes the current detector to predict bounding boxes on the unlabeled inputs , and then selects the pseudo labels with confidence larger than the threshold , and finally computes the unsupervised loss:
(3) |
where and are correspondingly satisfied.
As the special case of self-labeling training, the consistency-regularization-based methods introduce weak data augmentation and strong data augmentations , and the Eq. (3) is updated as:
(4) |
Note that as for the consistency regularization, the loss term Eq. (4) can be computed on both labeled and unlabeled data, and the strong augmentations can also boost the performances of fully-supervised training [26]. Accordingly, the total loss in Eq. (1) is augmented as , and the fully-supervised baseline is trained by optimizing .
Method | Labeled | Unlabeled | (%) | |
Supervised | identical | VOC07 | - | 71.73 |
SeLa | identical | VOC07 | VOC12 | 72.13 (+0.40) |
Supervised | identical | VOC0712 | - | 77.37 (+5.64) |
Supervised | HF | VOC07 | - | 71.89 |
SeLa | HF | VOC07 | VOC12 | 72.35 (+0.46) |
Supervised | HF | VOC0712 | - | 77.26 (+5.37) |
Supervised | MU | VOC07 | - | 73.04 |
SeLa | MU | VOC07 | VOC12 | 73.50 (+0.46) |
Supervised | MU | VOC0712 | - | 78.83 (+5.79) |
3.2 Limitations of Self-labeling
To verify the performances of existing self-labeling-based SSOD methods, we conduct experiments on the Pascal VOC benchmark dataset [11] based on the SSD300 structure [1]. We use the trainval set of VOC07 as labeled data and trainval set of VOC12 as unlabeled data, and finally report the AP50 performance on the test set of VOC07. The confidence threshold is fixed as 0.5, similar with [25] and [26]. For comparison, we conduct fully-supervised training with same hyper-parameters (batch size, iteration number, etc) as baseline. As shown in Table 1, the AP50 improvement achieved by self-labeling methods is only 0.40%. Besides, we test with two representative consistency-regularization-based methods, namely CSD [25] and ISD [26]. Similarly, the detection results achieve only 0.46% absolute AP50 gains over the baseline, which indicates the inefficiency of self-labeling training scheme.
Method | Labeled | Unlabeled | ||
Supervised | VOC07 | - | - | 71.73 |
SeLa | VOC07 | VOC12 | 0.5 | 72.13 (+0.40) |
SeLa | VOC07 | VOC12 | 0.50.8 | 72.09 (+0.36) |
SeLa | VOC07 | VOC12 | 0.8 | 72.12 (+0.39) |
SeLa (use TP and discard FP) | VOC07 | VOC12 | 0.5 | 74.03 (+2.30) |
SeLa (use TP and random labeled FP) | VOC07 | VOC12 | 0.5 | 73.87 (+2.14) |
SeLa (use GT labels for TP and FP) | VOC07 | VOC12 | 0.5 | 74.86 (+3.13) |
Supervised | VOC0712 | - | - | 77.37 (+5.64) |

To reveal the reason behind such phenomenon and find out possible solutions, we introduce the ground-truth category labels of all pseudo bounding boxes for in-depth analysis. Although all existing self-labeling-based SSOD methods generate pseudo labels by selecting bounding boxes with high confidence scores to ensure the quality of pseudo labels, Fig. 2(a) illustrates that part of high-confidence pseudo bounding boxes can also be misclassified. We name these incorrect boxes “self-errors” for clarity. Since all pseudo boxes are predicted by the detector, it is impossible for the detector itself to discern the self-errors, which we summarize as the inherent limitation of the self-labeling process. We further conduct an experiment to illustrate how much can such limitation affect the detection performance: in each training iteration, when we use the correct pseudo bounding boxes and discard the incorrect ones for training, the AP50 result increases from 72.13% to 74.03% (see the 2nd and 5th rows in Table 2). Besides, we note that the detection performance cannot be improved by naively increasing the confidence threshold . As shown in Fig. 2(a) and (b), the precision of pseudo bounding boxes increases from 71% to 91% as the threshold increases from 0.4 to 0.8, while a large threshold can also overkill the correct pseudo boxes and waste the unlabeled training data. Correspondingly, we conduct self-labeling training under three settings of the threshold : (i) fixed as 0.5, (ii) fixed as 0.8, (iii) rising from 0.5 to 0.8 gradually during the whole training process. As displayed in Table 2 (from 2nd row to 4th row), the trade-off between precision and recall leads to similar final performances (about 72.1% AP50).
Since the self-errors cannot be discerned during the self-labeling process, it is also impossible for the detector to rectify them, which we summarize as the second limitation of the self-labeling process. In each training iteration, when we utilize the ground-truth (GT) category labels for all pseudo bounding boxes, the AP50 result increases to 74.86%, obtaining a 3.13% absolute gain rather than 0.40% (see the penultimate row in Table 2). Such phenomenon shows the effectiveness of pseudo label rectification. Besides, we find another interesting phenomenon: when we replace the misclassified category labels with random labels (dubbed random-errors) during the self-labeling training process, the final performance increases to 73.87% (see the third row from the bottom in Table 2). We also conduct experiments on the Faster-RCNN-FPN structure [2] and observe the similar trend: the AP50 obtains with a 0.2% gain by replacing the misclassified labels with random labels during training. Such phenomena imply that the detector model can be misguided more severely by self-errors than random-errors.
Based on the above experimental analysis, we draw that two inherent limitations in self-labeling-based SSOD methods will lead the detector to be misguided by self-errors, and self-errors will hurt the detector performance even worse than random-errors.
4 Methodology
4.1 CrossRectify
Since one single detector can neither discern nor rectify its self-errors, an intuitive idea is to utilize another model to deal with them. Inspiring by the fact that two models with same structure but different initialization can yield different predictions on the same input [12, 13, 14, 15], we present the CrossRectify method to address the inherent limitations of self-labeling.
In CrossRectify, two detectors with the same structure but distinct initialization111Take 2D object detection task for example. The backbone parameters are both initialized by the ImageNet-pretrained model, while the parameters of detection heads are randomly initialized., and , are trained simultaneously. Both detectors are trained by jointly optimizing the supervised and unsupervised loss in Eq. (1). For simplicity, we only introduce how to generate pseudo bounding boxes for training detector , since the pseudo boxes for training detector are generated in the same way. There are three steps in generating , including: 1) conducting detector feed-forward; 2) matching predicted bounding boxes; 3) cross-rectifying the matched boxes to generate pseudo labels. They will be explained in details sequentially. The label generation process of is illustrated in Fig. 3 and briefly summarized in Algorithm 1.

Detector feed-forward
In each training iteration, we utilize two detector models and to predict on the unlabeled inputs, then select the bounding boxes with max probability scores higher than threshold , denoted as and . and denote the probability scores and coordinates of all predicted bounding boxes.
Matching bounding boxes
For each box in , we search its best match box among all boxes in . For example, for i-th box in , the matching process is formulated as:
(5) |
where stands for the matching metric and is slightly different for kinds of detectors. For the Faster-RCNN [2], is the area of intersection over union (IoU) between two boxes as:
(6) |
Specifically, if the IoU areas between and all boxes in are all below a certain threshold , we create a virtual bounding box to match it, as and the matching metric equals 1. For the SSD structure [1], equals 1 if two boxes are based on the same anchor, otherwise 0. Note that for SSD can also be specified as the IoU areas as like that for Faster-RCNN, but we find it more effective to adopt the anchor correspondence in our experiments. For the VoteNet [24], the matching metric is specified as the negative Euclidean distance between the centers of two bounding boxes, computed as:
(7) |
where denotes the center of a certain box.
Cross-rectifying
Based on the matched bounding box pair, each pseudo bounding box within can be generated as:
(8) |
Note that the Eq. (8) covers two situations. (a) When both detectors predict the same class on a certain object, we adopt it as the pseudo label, since two decisions are more reliable than one decision. (b) When two detectors have disagreements on a certain object, such bounding box tends to be unreliable. To this end, the bounding box with higher confidence is regarded as the pseudo label. The rationality behind such cross-rectifying mechanism lies in that the bounding boxes with higher confidence scores are more likely to be correctly classified (see Fig. 2(a)). Thus, the wrong elements in can be both discerned and rectified in such manner.
When the training process ends, we evaluate the performance of one single detector. Moreover, to exploit the different detection abilities, we propose to adopt the weighted boxes fusion (WBF) [29] strategy to ensemble two predictions sets. The corresponding performance is denoted as CrossRectify∗.
4.2 Comparisons with other works
Now we conduct quantitative analysis to show the superiority of CrossRectify on improving pseudo label quality, comparing with other recent works.
Teacher-student mutual learning
As discussed in Section 2.2, some recent SSOD works [16, 17, 18, 19, 20] are established on the offline/online teacher-student mutual learning [21]. These works can alleviate the self-errors in self-labeling process by introducing another separate object detector for generation of pseudo labels. However, as for offline methods [16, 17], the pseudo labels are generated only once and remain fixed when training student detector, so the student performance is upper-bounded by that of teacher. For instance, we conduct experiments with Faster-RCNN-FPN detector on the MS-COCO benchmark dataset under 10% degree of supervision. The AP50 performance of teacher detector after fully-supervised pre-training is 23.86%, while that of student detector supervised by teacher detector only increases with a 3.30% absolute gain, far away from the results in Table 4 (34.89% AP50). Similar phenomenon can be observed with respect to the SSD300 structure on the Pascal VOC benchmark dataset in Table 3 (obtaining only a 0.79% AP50 gain).
As for online methods [18, 19, 20], the teacher detector is converged to the student detector and yields similar predictions in the late stage of training, thus the pseudo label generation process degenerates to the self-labeling process and suffers from the same limitations. For instance, we conduct online teacher-student mutual learning based on SSD300 and Pascal VOC. As shown in Fig. 4(a), the average KL-divergence between probability scores predicted by teacher and student detector reaches zero in the last 40k iterations. Correspondingly, the detection performance shown in Table 3 also indicate the ineffectiveness of online teacher-student mutual learning.


Co-rectify and CPS
Recently, a co-training based SSOD method named co-rectify has been proposed in [22], which is the only prior work taking the idea of co-training in the SSOD task to the best of our knowledge. In co-rectify, the pseudo bounding boxes are first predicted by one detector, then refined by corresponding predictions from another model, with probability scores and coordinates being averaged. Besides, we notice that a recent work proposes cross pseudo supervision (CPS) [28] for the semi-supervised semantic segmentation task and achieves the state-of-the-art performances, where each model directly takes the predictions from the other model as pseudo labels. CPS can be adapted to the SSOD task, as each detector is supervised by the other detector. However, as shown in Fig. 4(b), the precision values of pseudo label of these methods are inferior to that of CrossRectify (conducting semi-supervised training with SSD300 on Pascal VOC). We consider the reason behind such phenomenon lies in that simply averaging multiple predictions (co-rectify) or directly taking predictions from other models as supervision (CPS) cannot fully exploit the advantages of multiple models, in comparison to our cross-rectifying mechanism. Their inferior performances shown in Table 3 and Table 6 also validate our consideration. Besides, we also investigate more alternative strategies on pseudo label rectification and observe that cross-rectifying turns out to be most effective strategy (as detailed in Sec. 5.4).
5 Experiments
5.1 Datasets and Evaluation Metrics
2D semi-supervised object detection
We evaluate the proposed CrossRectify on two widely-used benchmark datasets, i.e., Pascal VOC[11] and MS-COCO [30]. Pascal VOC has 20 object categories. We take the VOC07 trainval set (5,011 images) as labeled and the VOC12 trainval set (11,540 images) as unlabeled. The detection performance is evaluated on the VOC07 test set (4,952 images) using the VOC style AP50 metric. MS-COCO has 80 object categories. We follow the same settings as that in [16, 18, 20, 22, 31] to randomly sample 1/2/5/10% of the COCO2017 train set (118,287 images) as labeled and take the remaining part as unlabeled. Also, we create five data folds under each degree of supervision, and finally report the mean and standard deviation from five results. The detection performance is evaluated on the COCO2017 val set (5,000 images) using the COCO style AP50:95 metric.
3D semi-supervised object detection
5.2 Implementation Details
Detector structures
We carry out experiments on the Pascal VOC dataset with two detector structures, that are SSD300 [1] with VGG-16 backbone and Faster-RCNN-FPN [2, 34] with ResNet-50 backbone. The latter structure is also utilized in experiments on the MS-COCO dataset. As for 3D detection, we utilize VoteNet [24] with PointNet++ backbone [35].
Training settings
We utilize the Pytorch implementation222https://github.com/amdegroot/ssd.pytorch to train SSD300 on Pascal VOC. Within a total of 120k iterations, we conduct fully-supervised training in the first 12k iterations as warm-up. We ramp-up/down the unsupervised loss weight , and set threshold and batch size as 0.5 and 32 according to [26]. We utilize the Detectron2 platform333https://github.com/facebookresearch/detectron2 to train Faster-RCNN-FPN on Pascal VOC. We train a total of 36k iterations with first 6k being fully-supervised warm-up, and adopt the same data augmentation strategy as that in [18]. We set as 2.0 and threshold as 0.7 following [18]. The batch sizes for labeled data and unlabeled data are both 16. The threshold on matching metric is 0.5. To show the generality across different platforms of our CrossRectify method, we adopt the MMdetection444https://github.com/open-mmlab/mmdetection to train Faster-RCNN-FPN on MS-COCO. We train a total of 180k iterations and adopt the data augmentation strategies in [20]. Under 1% degree of supervision, we conduct fully-supervised warm-up in the first 80k iterations to ensure the stability of training. We set as 4.0 and threshold as 0.9 according to [20]. The batch sizes for labeled data and unlabeled data are respectively 8 and 32. The threshold on matching metric is 0.5. To train VoteNet on SUN-RGBD, we first conduct fully-supervised pre-training by 900 iterations, then conduct semi-supervised training by 1k iterations, following [32].
Note that the exponential moving average (EMA) strategy is commonly used for the pseudo label-based methods, since a detector model aggregated by EMA can yield more conservative and stable predictions than the detector itself [18, 19, 20]. For fair comparisons, We follow such common practice in our experiments, as we utilize the EMAs of two detectors to conduct the detector feed-forward process.
5.3 Results
Pascal VOC
Table 3 shows the results of our CrossRectify method compared with other training frameworks on Pascal VOC. As for the SSD300 detector, we take the 71.73% AP50 performance of fully-supervised training as the baseline. As can be seen, our proposed method can obtain a 73.65% AP50 result, while the results of all compared approaches are only about 72.50%. Such comparison validates the effectiveness of our CrossRectify on improving pseudo label quality. Besides, the WBF-merged [29] results from both detectors can further boost the final performances, denoted as CrossRectify∗. Under the mix-up data augmentation [10], our CrossRectify method can still show better performance than the self-labeling-based method, ISD [26] (by a 1.41% margin).
Model | Backbone | Method | Labeled | Unlabeled | Threshold | AP50 |
SSD300 | VGG-16 | Supervised | VOC07 | - | - | 71.73 |
Self-Labeling | VOC07 | VOC12 | 0.5 | 72.13 (+0.40) | ||
Online Teacher-Student Mutual Teaching | VOC07 | VOC12 | 0.5 | 72.56 (+0.83) | ||
Offline Teacher-Student Mutual Teaching[17] | VOC07 | VOC12 | - | 72.52 (+0.79) | ||
Cross Pseudo Supervision[28] | VOC07 | VOC12 | - | 72.56 (+0.83) | ||
Co-rectify[22] | VOC07 | VOC12 | 0.5 | 72.48 (+0.75) | ||
CrossRectify (ours) | VOC07 | VOC12 | 0.5 | 73.56 (+1.83) | ||
CrossRectify∗ (ours) | VOC07 | VOC12 | 0.5 | 74.97 (+3.24) | ||
\cdashline3-7 | Supervised + MixUp | VOC07 | VOC12 | 0.5 | 73.04 | |
Self-Labeling + MixUp (ISD [26]) | VOC07 | VOC12 | 0.5 | 73.50 (+0.46) | ||
CrossRectify + MixUp (ours) | VOC07 | VOC12 | 0.5 | 74.91 (+1.87) | ||
CrossRectify∗ + MixUp (ours) | VOC07 | VOC12 | 0.5 | 76.16 (+3.12) | ||
Faster- RCNN- FPN | ResNet-50 | Supervised | VOC07 | - | - | 76.90 |
CSD[25] | VOC07 | VOC12 | - | 77.50 (+0.60) | ||
STAC[16] | VOC07 | VOC12 | - | 77.50 (+0.60) | ||
Co-rectify[22] | VOC07 | VOC12 | - | 79.20 (+2.30) | ||
Combating Noise[31] | VOC07 | VOC12 | - | 80.60 (+3.70) | ||
Humble Teacher[19] | VOC07 | VOC12 | 0.7 | 80.94 (+3.94) | ||
Unbiased Teacher[18] | VOC07 | VOC12 | 0.7 | 80.51 (+3.61) | ||
CrossRectify (ours) | VOC07 | VOC12 | 0.7 | 81.56 (+4.66) | ||
CrossRectify∗ (ours) | VOC07 | VOC12 | 0.7 | 82.34 (+5.44) |
As for the Faster-RCNN-FPN detector, we compare CrossRectify with previous methods, and our method can improve the AP50 result with a 4.66% margin over fully-supervised baseline, achieving the state-of-the-art performance. Note that Unbiased Teacher [18] reports the performance based on the COCO style AP50 metric in their paper. For a fair comparison, we instead adopt the VOC style AP50 metric for Unbiased Teacher, and the AP50 raises from 77.37% to 80.51%, still surpassed by that of CrossRectify with a 0.59% margin.
Model | Backbone | Method | Proportion of labeled data | |||
Faster- RCNN- FPN | ResNet-50 | Supervised | 9.05 0.16 | 12.70 0.15 | 18.47 0.22 | 23.86 0.81 |
CSD[25] | 10.51 0.06 | 13.93 0.12 | 18.63 0.07 | 22.46 0.08 | ||
STAC[16] | 13.97 0.35 | 18.25 0.25 | 24.38 0.12 | 28.64 0.21 | ||
Unbiased Teacher[18] | 20.75 0.12 | 24.30 0.07 | 28.27 0.11 | 31.50 0.10 | ||
Humble Teacher[19] | 16.96 0.38 | 21.72 0.24 | 27.70 0.15 | 31.61 0.28 | ||
Co-rectify[22] | 18.05 0.15 | 22.45 0.15 | 26.75 0.05 | 30.40 0.05 | ||
Combating Noise[31] | 18.41 0.10 | 24.00 0.15 | 28.96 0.29 | 32.43 0.20 | ||
Soft Teacher[20] | 20.46 0.39 | 26.20 0.10 | 30.74 0.08 | 34.04 0.14 | ||
CrossRectify (ours) | 21.90 0.11 | 26.70 0.07 | 31.70 0.04 | 34.89 0.07 | ||
CrossRectify∗ (ours) | 22.50 0.12 | 27.60 0.07 | 32.80 0.05 | 36.30 0.07 |

MS-COCO
Table 4 shows the performances of our CrossRectify method compared with previous state-of-the-arts on the MS-COCO dataset. Under different degrees of supervision 1%, 2%, 5% and 10%, the proposed CrossRectify can obtain consistent and substantial improvements, surpassing those of Soft Teacher [20] by 1.46%, 0.50%, 0.96%, and 0.85% AP50:95 margins. These comparative results further verify the effectiveness of the proposed method. Moreover, we visualize the pseudo bounding boxes for some unlabeled images in Fig. 5. Compared with self-labeling training scheme, our method can yield more accurate pseudo boxes.
SUN-RGBD
Table 5 shows the comparison with all previous works (i.e., SESS [36] and 3DIoUMatch [32]) on the SUN-RGBD benchmark dataset. Under 5% degree of supervision, the performance of our CrossRectify can outperform that of the state-of-the-art 3DIoUMatch method by 3.1 AP25 and 1.9 AP50 margins. The results validate the efficiency of CrossRectify on 3D semi-supervised object detection task. We omit the WBF-merging performance CrossRectify∗, because WBF does not support on 3D bounding boxes with different rotation angles.
5.4 Empirical Study
Pseudo label rectification strategy
Now we investigate the alternative strategies on rectifying the pseudo labels, including (a) only utilizing the intersection of two prediction sets from two detectors as pseudo labels, which is composed of the objects classified as the same classes. (b) Only utilizing the difference set of two prediction sets from two detectors as the pseudo label, which is composed of objects classified as different classes. (c) Directly taking all the predicted bounding boxes from the other detector as the pseudo label, which turns out to be the cross pseudo supervision (CPS) [28]. As observed in Table. 6, all these strategies cannot ensure the pseudo label quality and finally lead to inferior performances.
Extension to more detectors
Our proposed CrossRectify method can be easily extended to train more than two detectors simultaneously. Specifically, during the pseudo label rectification process, each pseudo bounding box is re-labeled by the majority of all predicted classes, and re-located by the average of all predicted coordinates. As displayed in Table 7, CrossRectify over four SSD300 detectors can bring a 0.06% AP50 improvement for each detector on average. We believe that two-detector scenario is already able to cross-rectify the misclassified pseudo labels adequately.
Model | Labeled | Unlabeled | Strategy | AP50 |
SSD300 | VOC07 | - | - | 71.73 |
VOC07 | VOC12 | intersection | 72.59 | |
VOC07 | VOC12 | difference set | 65.52 | |
VOC07 | VOC12 | CPS | 72.56 | |
VOC07 | VOC12 | CrossRectify | 73.65 |
Model | Index | Single | Average | WBF-Merged |
SSD300 | detector #1 | 73.67 | 73.65 | 74.83 |
detector #2 | 73.63 | |||
detector #1 | 73.60 | 73.71 | 75.84 | |
detector #2 | 73.71 | |||
detector #3 | 73.80 | |||
detector #4 | 73.73 |
6 Conclusion
In this paper, we propose the CrossRectify training framework for the semi-supervised object detection task, aiming to address the inherent limitations in the self-labeling-based methods. In CrossRectify, two detectors with same structure but different initialization are trained simultaneously. The disagreements on the same objects between two detectors are utilized to discern and rectify the latent self-errors predicted by each single detector. Moreover, we conduct both theoretical analysis and quantitative experiments to show the superiority on improving pseudo label quality, compared with other recent works. Extensive results on both 2D and 3D semi-supervised object detection task validate the effectiveness and versatility of CrossRectify.
Declaration of Competing Interest
The authors declare that they have no known competing finan- cial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported by the National Natural Science Foundation of China 61832016, U20B2070, 6210070958, 62102162.
References
- [1] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox detector, in: European conference on computer vision, Springer, 2016, pp. 21–37.
- [2] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28 (2015) 91–99.
- [3] B. Bosquet, M. Mucientes, V. M. Brea, Stdnet-st: Spatio-temporal convnet for small object detection, Pattern Recognition 116 (2021) 107929.
- [4] H. Wang, Q. Wang, P. Li, W. Zuo, Multi-scale structural kernel representation for object detection, Pattern Recognition 110 (2021) 107593.
- [5] Y. Kong, M. Feng, X. Li, H. Lu, X. Liu, B. Yin, Spatial context-aware network for salient object detection, Pattern Recognition 114 (2021) 107867.
- [6] J. Zhang, H. Su, W. Zou, X. Gong, Z. Zhang, F. Shen, Cadn: a weakly supervised learning-based category-aware object detection network for surface defect detection, Pattern Recognition 109 (2021) 107571.
- [7] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al., The open images dataset v4, International Journal of Computer Vision 128 (7) (2020) 1956–1981.
- [8] C. Rosenberg, M. Hebert, H. Schneiderman, Semi-supervised self-training of object detection models, in: Proceedings of the Seventh IEEE Workshops on Application of Computer Vision (WACV/MOTION’05)-Volume 1-Volume 01, 2005, pp. 29–36.
- [9] Y. M. Asano, C. Rupprecht, A. Vedaldi, Self-labelling via simultaneous clustering and representation learning, in: International Conference on Learning Representations (ICLR), 2020.
- [10] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, in: International Conference on Learning Representations, 2018.
- [11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision 88 (2) (2010) 303–338.
- [12] S. Qiao, W. Shen, Z. Zhang, B. Wang, A. Yuille, Deep co-training for semi-supervised image recognition, in: Proceedings of the european conference on computer vision (eccv), 2018, pp. 135–152.
- [13] Z. Ke, D. Wang, Q. Yan, J. Ren, R. W. Lau, Dual student: Breaking the limits of the teacher in semi-supervised learning, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6728–6736.
- [14] H. Wei, L. Feng, X. Chen, B. An, Combating noisy labels by agreement: A joint training method with co-regularization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13726–13735.
- [15] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, M. Sugiyama, How does disagreement help generalization against label corruption?, in: International Conference on Machine Learning, PMLR, 2019, pp. 7164–7173.
- [16] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, T. Pfister, A simple semi-supervised learning framework for object detection, arXiv preprint arXiv:2005.04757.
- [17] Z. Wang, Y. Li, Y. Guo, L. Fang, S. Wang, Data-uncertainty guided multi-phase learning for semi-supervised object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4568–4577.
- [18] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, P. Vajda, Unbiased teacher for semi-supervised object detection, in: International Conference on Learning Representations, 2020.
- [19] Y. Tang, W. Chen, Y. Luo, Y. Zhang, Humble teachers teach better students for semi-supervised object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3132–3141.
- [20] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, Z. Liu, End-to-end semi-supervised object detection with soft teacher, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3060–3069.
- [21] A. Tarvainen, H. Valpola, Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1195–1204.
- [22] Q. Zhou, C. Yu, Z. Wang, Q. Qian, H. Li, Instant-teaching: An end-to-end semi-supervised object detection framework, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4081–4090.
- [23] Z. Cai, N. Vasconcelos, Cascade r-cnn: Delving into high quality object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154–6162.
- [24] C. R. Qi, O. Litany, K. He, L. J. Guibas, Deep hough voting for 3d object detection in point clouds, in: proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9277–9286.
- [25] J. Jeong, S. Lee, J. Kim, N. Kwak, Consistency-based semi-supervised learning for object detection, Advances in neural information processing systems 32 (2019) 10759–10768.
- [26] J. Jeong, V. Verma, M. Hyun, J. Kannala, N. Kwak, Interpolation-based semi-supervised learning for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11602–11611.
- [27] P. Tang, C. Ramaiah, Y. Wang, R. Xu, C. Xiong, Proposal learning for semi-supervised object detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2291–2301.
- [28] X. Chen, Y. Yuan, G. Zeng, J. Wang, Semi-supervised semantic segmentation with cross pseudo supervision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2613–2622.
- [29] R. Solovyev, W. Wang, T. Gabruseva, Weighted boxes fusion: Ensembling boxes from different object detection models, Image and Vision Computing 107 (2021) 104117.
- [30] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755.
- [31] Z. Wang, Y.-L. Li, Y. Guo, S. Wang, Combating noise: Semi-supervised learning by region uncertainty quantification, Advances in Neural Information Processing Systems 34.
- [32] H. Wang, Y. Cong, O. Litany, Y. Gao, L. J. Guibas, 3dioumatch: Leveraging iou prediction for semi-supervised 3d object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14615–14624.
- [33] S. Song, S. P. Lichtenberg, J. Xiao, Sun rgb-d: A rgb-d scene understanding benchmark suite, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.
- [34] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
- [35] C. R. Qi, L. Yi, H. Su, L. J. Guibas, Pointnet++: Deep hierarchical feature learning on point sets in a metric space, Advances in neural information processing systems 30.
- [36] N. Zhao, T.-S. Chua, G. H. Lee, Sess: Self-ensembling semi-supervised 3d object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11079–11087.