Boosting Semi-Supervised 3D Object Detection with Semi-Sampling

Xiaopei Wu^1,2*, Yang Zhao^1*, Liang Peng¹, Hua Chen³, Xiaoshui Huang²,
Binbin Lin¹, Haifeng Liu¹, Deng Cai¹, Wanli Ouyang²
¹ Zhejiang University ² Shanghai AI Laboratory ³ Beijing Aircraft Technology Research Institute
{wuxiaopei, zhaoyang6, pengliang}@zju.edu.cn

Abstract

Current 3D object detection methods heavily rely on an enormous amount of annotations. Semi-supervised learning can be used to alleviate this issue. Previous semi-supervised 3D object detection methods directly follow the practice of fully-supervised methods to augment labeled and unlabeled data, which is sub-optimal. In this paper, we design a data augmentation method for semi-supervised learning, which we call Semi-Sampling. Specifically, we use ground truth labels and pseudo labels to crop gt samples and pseudo samples on labeled frames and unlabeled frames, respectively. Then we can generate a gt sample database and a pseudo sample database. When training a teacher-student semi-supervised framework, we randomly select gt samples and pseudo samples to both labeled frames and unlabeled frames, making a strong data augmentation for them. Our semi-sampling can be regarded as an extension of gt-sampling to semi-supervised learning. Our method is simple but effective. We consistently improve state-of-the-art methods on ScanNet, SUN-RGBD, and KITTI benchmarks by large margins. For example, when training using only 10% labeled data on ScanNet, we achieve 3.1 mAP and 6.4 mAP improvement upon 3DIoUMatch in terms of [email protected] and [email protected]. When training using only 1% labeled data on KITTI, we boost 3DIoUMatch by 3.5 mAP, 6.7 mAP and 14.1 mAP on car, pedestrian and cyclist classes. Codes will be made publicly available at https://github.com/LittlePey/Semi-Sampling.

^†^†*: equal contribution

1 Introduction

In recent years, the boom in deep learning has led to the rapid development of 3D object detection. Many 3D object detection methods [11, 12, 19, 22, 38, 32, 2, 9] make impressive performances. However, the 3D object detection task heavily relies on the availability of large 3D datasets annotated very well. The careful annotations for the raw data can be costly on both human resources and time. By contrast, the cost of acquiring unlabeled data is far less than labeled data, and the amount of unlabeled data is huge. The natural idea is to exploit unlabeled data to facilitate model learning. Semi-supervised learning (SSL) provides means of leveraging unlabeled data to improve model performance when labeled data is limited. Existing semi-supervised 3D object detection methods usually use a teach-student mutual learning framework to facilitate student learning from the EMA teacher. SESS[37] is the pioneering work. It designs a thorough perturbation scheme to enhance the generalization of the network on unlabeled data. 3DIoUMatch[30] makes use of unlabeled data in the form of pseudo labels. It introduces a confidence-based filtering mechanism to remove poorly localized pseudo labels and achieves notable improvement compared to SESS.

In this work, we concentrate on data augmentation for semi-supervised learning. For the student network, the training data consists of two parts: labeled data and unlabeled data. Data augmentation for labeled data has been intensively studied, while data augmentation for unlabeled data is under-explored. For example, the gt-sampling [35] strategy is a practical data augmentation for labeled data. It crops ground truth samples from other frames and pastes them to empty space of the current frame with the help of collision detection. However, there is no its counterpart on unlabeled data. Considering that unlabeled data account for most of the training data in semi-supervised learning, it is meaningful to perform an object sampling strategy similar to gt-sampling on them. In addition, the massive unlabeled data can provide a vast amount of unseen objects, which can be cropped and pasted to both labeled or unlabeled frames, making a stronger data augmentation.

Based on the above analysis, we propose a general object sampling strategy called Semi-Sampling. The key idea is to bridge the object sampling gap between labeled frames and unlabeled frames. We can use gt-sampling not only on labeled frames but also on unlabeled frames. We can crop objects not only in labeled frames but also in unlabeled frames for object sampling, which we call pseudo-sampling. Our semi-sampling can be regarded as a general extension of gt-sampling in semi-supervised learning, and our method can increase the diversity of semi-supervised training data greatly. Extensive experiments on indoor and outdoor datasets demonstrate its effectiveness.

To further boost the fully-supervised methods with semi-supervised learning, we construct a new dataset Omni-KITTI with KITTI raw data. KITTI raw data consists of many sequences, and the KITTI 3D object detection dataset is sampled from these sequences. Many frames in KITTI raw are not annotated, which can be used as unlabeled data. We provide a benchmark for Omni-KITTI, which gives the results of 3DIoUMatch and our method on Omni-KITTI.

To summarize, our contributions are listed as follows:

•

We propose a new object sampling strategy Semi-Sampling to augment both labeled frames and unlabeled frames.
•

We achieve remarkable improvement over prior art on the indoor 3D object detection benchmark ScanNet, SUN-RGBD, and outdoor 3D object detection benchmark KITTI.
•

We contribute a new benchmark dataset setting called Omni-KITTI for semi-supervised learning, and we also achieve a promising result on the dataset.

2 Related Works

2.1 Semi-Supervised Learning

Semi-Supervised Learning(SSL) draws growing attention in a wide range of research areas like image classification and segmentation. Since unlabeled data can be obtained more easily than labeled data, unlabeled data is far more than labeled data. Semi-supervised learning focuses on leveraging the labeled and unlabeled data to help the learning of the model. Current semi-supervised methods can be roughly divided into two groups. One is pseudo-label based methods like [8]. The method pre-trains a model on labeled data and uses the pre-trained model to infer the unlabeled data. The prediction will be used as the ground truth of the unlabeled data. For the pseudo-label based method, setting a classification confidence threshold to filter pseudo labels with low quality can bring a big improvement to the performance, which is shown in FixMatch[23]. Another kind of method is based on consistency regularization, such as [1, 13, 20, 7]. They build a regularization loss with unlabeled images, which encourages the model to generate similar predictions on different perturbations of the same image. There are many ways to perform perturbations, such as conducting data augmentation on input data[20] and adding perturbations to the model[1]. The methods mentioned above commonly use a framework called teacher-student, which is proposed in[27]. In the teacher-student framework, the teacher network will be initialized with frozen weight, and the weight of the teacher will be upgraded during training as the EMA of the student netowrk.

2.2 Semi-Supervised Object Detection

Object detection is an essential task in computer vision, and great progress has been made in both 2D and 3D object detection areas. For semi-supervised object detection, they can also be separated into two groups: consistency methods [6, 26] and pseudo-label methods [24, 10, 34, 36]. CSD[6] proposes a consistency regularization to keep the consistency between predictions of an image and its flipped version. STAC[24] adopts a two-stage scheme for training Faster R-CNN[18]. In the first stage, STAC pre-trains a detector with labeled data and takes the predictions of the pre-trained detector on the unlabeled data as pseudo labels. In the second stage, it selects pseudo labels with high confidence and updates the model by enforcing consistency with strong augmentations.

For the semi-supervised 3D object detection in indoor scenes, SESS[37] is based on the consistency loss method, and 3DIoUMatch[30] is based on the pseudo-label method. SESS utilizes a mutual learning framework composed of a student and an EMA teacher. It uses three kinds of consistency losses to align the predictions of the student and teacher. 3DIoUMatch also uses a teacher-student framework, while it uses the predictions from the EMA teacher network as pseudo labels to supervise the student network on unlabeled data. For outdoor semi-supervised 3D object detection, most methods [16, 3, 14, 33] use the pseudo-label method. 3D Auto Labeling[16] leverages a multi-frame detector and an object-centric auto-labeling model to generate accurate pseudo labels.

Refer to caption — Figure 1: Illustration of our semi-sampling. We use ground truth bounding boxes to crop gt samples from labeled frames, generating a gt database. And we use pseudo labels on unlabeled frames to crop pseudo samples from unlabeled frames, generating a pseudo database. Semi-sampling augments the labeled and unlabeled frames by performing gt-sampling and pseudo-sampling on both of them.

3 Methods

3.1 Problem Definition

Given a scene $\boldsymbol{x}$ containing a set of objects that are represented as a 3D point cloud, where $\boldsymbol{x}\in\mathbb{R}^{n\times 3}$ denotes the scene containing $n$ points, our task is to localize the amodal oriented 3D bounding boxes for these objects and get their semantic class labels. Under the semi-supervised setting, the problem is more challenging due to the limited supervising information. Assuming that we have access to $N_{l}$ labeled point clouds and $N_{u}$ unlabeled point clouds. We donate the $i^{th}$ labeled point cloud as $P^{l}_{i}=(\boldsymbol{x}^{l}_{i},\boldsymbol{y}^{l}_{i})$ and the $i^{th}$ unlabeled point cloud as $P^{u}_{i}=\boldsymbol{x}^{u}_{i}$ , where $\boldsymbol{x}^{l}_{i}$ and $\boldsymbol{x}^{u}_{i}$ are point clouds and $\boldsymbol{y}^{l}_{i}$ is ground truth annotations for $\boldsymbol{x}^{l}_{i}$ . Each item in $\boldsymbol{y}^{l}_{i}$ consists of a semantic class $s$ (1-of- $C$ predefined classes), an amodal 3D bounding box parameterized by its center $c=(c^{x},c^{y},c^{z})$ , size $d=(d^{x},d^{y},d^{z})$ and an orientation $\theta$ along the upright-axis.

3.2 Teacher-Student Framework

In semi-supervised learning, teacher-student framework [27] is widely used to propagate information from labeled data to unlabeled data. Here we introduce how the teacher-student framework works in the pseudo-label method. During the training stage, each batch contains labeled frames $\{\boldsymbol{x}^{l}_{i}\}^{B_{l}}_{i=1}$ and unlabeled frames $\{\boldsymbol{x}^{u}_{i}\}^{B_{u}}_{i=1}$ , where $B_{l}$ and $B_{u}$ are the numbers of labeled frames and unlabeled frames in a batch, respectively. The labeled frames $\{\boldsymbol{x}^{l}_{i}\}^{B_{l}}_{i=1}$ are used to supervise the student network as fully-supervised methods. For unlabeled frames $\{\boldsymbol{x}^{u}_{i}\}^{B_{u}}_{i=1}$ , they are first feed to the teacher network. The predictions of the teacher network will be used as pseudo labels $\{\boldsymbol{\tilde{y}}^{u}_{i}\}_{i=1}^{N_{u}}$ of unlabeled frames. To ensure quality, the pseudo labels with low classification scores will be removed. As the training goes on, the teacher network generates pseudo labels for the student, which facilitates the learning of the student. The student gradually updates the teacher via exponential moving average (EMA), which makes a better teacher. Therefore, the teacher-student framework is trained in a mutually-beneficial manner.

3.3 Gt Database and Pseudo Database

Gt database and pseudo database are the bases of our semi-sampling. They collect ground truth samples and pseudo samples cropped from labeled frames and unlabeled frames. When using semi-sampling, we will randomly select object samples from the two databases.

Gt Database.

Gt-sampling is a data augmentation method introduced by [35], which is widely used in outdoor 3D object detection. It uses ground truth bounding boxes to crops ground truth samples from labeled frames, and the cropped object samples will be collected to generate a gt database $\mathcal{G}=\{s_{i}^{l}\}_{i=1}^{N_{gt}}$ . $N_{gt}$ is the number of ground truth boxes in all labeled frames and $s_{i}^{l}$ denotes the $i^{th}$ ground truth sample. For $s_{i}^{l}$ , it consists of the corresponding point clouds ${p}_{i}^{l}$ in labeled frames and its ground truth bounding box ${b}_{i}^{l}$ . In the training stage, gt-sampling will randomly select ground truth samples from $\mathcal{G}$ and pastes them to the current frame. To prevent pasted samples from overlapping with existing objects in the current frame, collision detection is needed.

For the KITTI dataset, we directly follow existing outdoor 3D object-detecting methods to build a gt database. For indoor scenes, as far as we know, we are the first to use gt-sampling. There are some differences between outdoor and indoor scenes. In indoor scenes, 3D object bounding boxes may overlap with each other, such as a chair under a table. Using ground truth bounding boxes to crop the gt samples will involve points from other objects, resulting in a noisy gt database. For the ScanNet[4] dataset, as it provides an instance mask for each ground truth object, we can use instance masks instead of gt boxes to crop gt samples. In this way, only points belonging to the instance are cropped. For the SUN-RGBD[25] dataset, there is no instance mask for each object, so we have to use ground truth bounding boxes to crop gt samples. As a result, the gt database of SUN-RGBD is not clean, which will produce unreal scenes when performing gt-sampling. In the experiments on SUN-RGBD, we find that gt-sampling can improve model performance at the early training epochs. However, we observe accuracy degradation at the end of training. We address this problem by using the fade strategy [29], which disables gt-sampling when the model is near convergent. Please refer to Table 6 for more details.

Pseudo Database.

To build a pseudo database, we firstly pre-train a backbone detector on labeled frames $\{\boldsymbol{x}^{l}_{i},\boldsymbol{y}^{l}_{i}\}_{i=1}^{N_{l}}$ . Then, the pre-trained detector is used to forward unlabeled frames $\{\boldsymbol{x}^{u}_{i}\}_{i=1}^{N_{u}}$ . The predictions are used as pseudo labels $\{\boldsymbol{\tilde{y}}^{u}_{i}\}_{i=1}^{N_{u}}$ . Different from the gt database, we use pseudo labels to crop pseudo samples from unlabeled frames. With these pseudo samples, we can generate a pseudo database $\mathcal{P}=\{s_{i}^{u}\}_{i=1}^{N_{pse}}$ , where $N_{pse}$ is the number of all pseudo samples and $s_{i}^{u}$ denotes the $i^{th}$ pseudo sample. For $s_{i}^{u}$ , it consists of corresponding point clouds ${p}_{i}^{u}$ in unlabeled frames, predicted bounding box ${b}_{i}^{pred}$ , and predicted classification score ${c}_{i}^{pred}$ . Since there are many low-quality pseudo labels, when using the pseudo database, we will filter out high-quality pseudo samples by their scores.

3.4 Semi-Sampling

Current data augmentation methods are mainly designed for labeled data, but data augmentation for unlabeled data is less investigated. Moreover, unlabeled data can provide a large number of unseen objects which can be used for object sampling. Therefore, we propose semi-sampling to diversify labeled and unlabeled data and make full use of object samples from unlabeled data. As illustrated in Figure 1, our semi-sampling is composed of four object sampling strategies: gt-sampling on labeled frames, gt-sampling on unlabeled frames, pseudo-sampling on labeled frames, and pseudo-sampling on unlabeled frames. Here pseudo-sampling is sampling pseudo samples instead of gt samples compared to gt-sampling. Our semi-sampling method is a superset of the vanilla gt-sampling[35] that only pastes gt samples to labeled frames.

Semi-Sampling on Labeled Frames.

For gt-sampling on labeled frames, we follow the vanilla gt-sampling [35]. For pseudo-sampling on labeled frames, we randomly select some pseudo samples along with their pseudo labels from the pseudo database. The pseudo labels of selected pseudo samples serve two purposes. Firstly, they are used to perform collision detection between pseudo samples and ground truth objects on labeled frames. Secondly, we use them as labels of pseudo samples in labeled frames to supervise the student network at the training time, which requires the pseudo-label to be of high quality. Therefore, we set a pseudo label score threshold $\tau_{pseudo\_sample}$ . When performing pseudo-sampling, we only use pseudo samples whose pseudo label scores $c>\tau_{pseudo\_sample}$ .

Semi-Sampling on Unlabeled Frames.

When employing semi-sampling on unlabeled frames, there are no ground truth bounding boxes on unlabeled frames for collision detection with pasted samples. Considering that pseudo labels can provide approximate localization of foreground objects in unlabeled frames, we directly use them for collision detection. However, the massive pseudo labels may take up quite a lot of space. So we set a threshold $\tau_{unlabeled\_frame}$ to filter the pseudo labels in unlabeled frames with pseudo label scores $c>\tau_{unlabeled\_frame}$ . It is worth noting that the ground truth labels and pseudo labels of the pasted samples on the unlabeled frames will not be used to supervise the student network. All supervising information of unlabeled frames is produced by the teacher network online.

Category Shuffling.

When performing semi-sampling, we select several samples for each category from the gt database or pseudo database. However, indoor scenes are crowded and there are many categories, such as 18 for ScanNet and 10 for SUN-RGBD. If we choose samples according to a fixed category queue like outdoor methods, there may be no space for the samples of the category at the end of the queue, which will cause inter-class sample imbalance. To solve the issue, we simply shuffle the category queue every time we use semi-sampling.

4 Experiments

4.1 Datasets and Evaluation Metrics

ScanNet and SUN-RGBD.

For indoor scenes, we evaluate our method on ScanNet[4] and SUN-RGBD[25]. ScanNet is an indoor dataset consisting of 1,513 reconstructed meshes, among which 1,201 are training samples and the rest are validation samples. SUN-RGBD contains 10,335 RGB-D images of indoor scenes, which are split into 5,285 training samples and 5,050 validation samples. We follow a standard evaluation protocol [15] by using mean Average Precision (mAP) under iou thresholds of 0.25 and 0.50.

KITTI.

For outdoor scenes, we use KITTI[5] for evaluation. KITTI is a popular dataset for autonomous driving, which consists of fine annotations for 3D object detection. There are 7481 outdoor scenes for training and 7518 for testing. The training samples are generally divided into a training split of 3712 samples and a validation split of 3769 samples. We report the mAP with 40 recall positions, with a rotated IoU threshold of 0.7, 0.5, and 0.5 for the three classes, car, pedestrian, and cyclist, respectively.

Omni-KITTI.

Omni-supervised learning[17] is a special regime of semi-supervised learning, which uses all labeled data as well as large-scale unlabeled data. It is lower-bounded by the accuracy of training on all labeled data and it aims to boost the fully-supervised methods. To validate the effectiveness of our method under the omni-supervised setting, we construct the Omni-KITTI dataset with KITTI raw data. Specifically, KITTI raw data consists of many sequences containing all frames that emerge on the KITTI training and validation set. Therefore, we can use the KITTI training set as all labeled data (including 3712 samples) and take the remaining data that excludes KITTI validation sequences as unlabeled data (including 33507 samples). The final performance is evaluated on the KITTI validation set with the same evaluation protocol as the KITTI dataset.

10%

20%

100%

Dataset

Model

mAP

@0.25

mAP

@0.5

mAP

@0.25

mAP

@0.5

mAP

@0.25

mAP

@0.5

mAP

@0.25

mAP

@0.5

ScanNet

VoteNet[15]

27.9

10.8

36.9

18.2

46.9

27.5

57.8

36.0

SESS[37]

\backslash

\backslash

39.7

18.6

47.9

26.9

62.1

38.8

3DIoUMatch[30]

40.0

22.5

47.2

28.3

52.8

35.2

62.9

42.1

Semi-Sampling

41.1

27.7

50.3

34.7

54.4

38.9

64.6

47.7

improvements

+1.1

+5.2

+3.1

+6.4

+1.6

+3.7

+1.7

+5.6

SUN-RGBD

VoteNet[15]

29.9

10.5

38.9

17.2

45.7

22.5

58.0

33.4

SESS[37]

\backslash

\backslash

42.9

14.4

47.9

20.6

61.1

37.3

3DIoUMatch[30]

39.0

21.1

45.5

28.8

49.7

30.9

61.5

41.3

Semi-Sampling

40.1

24.3

50.0

31.5

54.4

34.5

63.2

45.7

improvements

+1.1

+3.2

+4.5

+2.7

+4.7

+3.6

+1.7

+4.4

Table 1: Comparison between Semi-Sampling and state-of-the-arts on ScanNet and SUN-RGBD val set under different ratios of labeled data. Following prior works, we also report the results with 100% labeled data, where the unlabeled data is a copy of the full dataset.

	1%			2%			5%
Model	Car	Pedestrian	Cyclist	Car	Pedestrian	Cyclist	Car	Pedestrian	Cyclist
PV-RCNN[22]	75.0	16.1	39.2	78.5	38.2	75.0	80.0	52.8	56.9
3DIoUMatch[30]	75.1	17.4	44.1	77.9	33.4	52.4	82.0	52.6	64.2
Semi-Sampling	78.6	24.1	58.2	81.9	47.5	63.3	82.3	54.9	71.1
improvements	+3.5	+6.7	+14.1	+4.0	+14.1	+10.9	+0.3	+2.3	+6.9

Table 2: Comparison between PV-RCNN, 3DIoUMatch and Semi-Sampling on KITTI val set under different ratios of labeled data. The results are evaluated with the AP calculated by 40 recall positions for car, pedestrian and cyclist classes. All the models in the table are trained with the same data split under each ratio.

4.2 Implementation Details

ScanNet and SUN-RGBD.

We re-implement the indoor 3DIoUMatch [30] on the OpenPCDet[28], achieving similar results as the official repo. Then we use the 3DIoUMatch as our baseline. In 3DIoUMatch, the backbone detector is an IoU-aware VoteNet which is a VoteNet equipped with a 3D IoU estimation module. Our backbone detector is also the same as 3DIoUMatch. All the experiments are trained on 4 GPUs. At pre-training stage, we train the IoU-aware VoteNet with a batch size of 8 and use the same data augmentation as 3DIoUMatch[30]. The IoU-aware VoteNet is optimized by an adam optimizer with an initial learning rate of 0.008 for 900 epochs. The learning rate is decayed by a factor of 0.1 at the $400^{th}$ , $600^{th}$ , $800^{th}$ epoch. The pre-trained model is not only used for the initialization of the training stage but also used to generate pseudo labels for unlabeled data and build a pseudo database. At the training stage, we train our method with a batch size of $\{4+8\}$ , where 4 is the number of labeled frames and 8 is the number of unlabeled frames. For the ScanNet dataset, our model is optimized by an adam optimizer with an initial learning rate of 0.004 for 1000 epochs. We decay the learning rate by 0.3, 0.3, 0.1, 0.1 at the $400^{th}$ , $600^{th}$ , $800^{th}$ , $900^{th}$ epoch, following the training strategy of 3DIoUMatch. For the SUN-RGBD dataset, the training is time-consuming. To speed up the convergence, we train our model for 300 epochs with an adam one-cycle optimizer and a maximum learning rate of 0.008. At the inference stage, we use the student IoU-aware VoteNet to forward input data.

KITTI.

There is an official implementation of outdoor 3DIoUMatch on OpenPCDet. We can directly use it as our baseline. We employ PV-RCNN as our backbone detector following 3DIoUMatch. At pre-training stage, the backbone detector is optimized by an adam one-cycle optimizer with the batch size 8, learning rate 0.01 for 80 epochs. For all ratios of labeled data, we repeat the labeled data 10 times. At the training stage, we repeat the labeled data 5 times and train the model for 60 epochs with the same optimizer and learning rate as the pre-training stage. For the experiments under 1% and 2% labeled data, we set $\tau_{unlabeled\_frame}$ = 0.5 to filter abundant pseudo labels on unlabeled frames for collision detection. The $\tau_{pseudo\_sample}$ , which is used for selecting pseudo samples for pseudo-sampling, is set to 0.8, 0.7, 0.7 for car, pedestrian and cyclist, respectively. For the experiments under 5% labeled data, we use higher $\tau_{unlabeled\_frame}$ and $\tau_{pseudo\_sample}$ . Because with more labeled data, the pre-trained model will produce more high-quality pseudo labels, which usually have higher scores. In practice, we set $\tau_{unlabeled\_frame}$ to 0.8 and $\tau_{pseudo\_sample}$ to 0.9, 0.8, 0.8 for car, pedestrian and cyclist. At the inference stage, we use the student PV-RCNN network to forward input data.

4.3 Comparison with State-of-the-Arts

ScanNet and SUN-RGBD.

We compare our methods with previous state-of-the-art works SESS[37] and 3DIoUMatch[30] on ScanNet and SUN-RGBD dataset under 5%, 10%, 20% and 100% labeled data. Our method outperforms the current state-of-the-art, 3DIoUMatch[30], under all ratios of labeled data and all listed datasets. For example, with the same backbone detector and the same split of 10% labeled data, our approach achieves 50.3 [email protected] and 34.7 [email protected] on ScanNet, outperforming previous best results by 3.1 and 6.4 points. For the SUN-RGBD dataset, we also attain decent results. With the same backbone detector and split of 5% labeled data, we improve 3DIoUMatch by 1.1 and 3.2 points in terms of [email protected] and [email protected]. When utilizing 100% labeled data, there are still gains on two datasets, suggesting that semi-supervised learning is powerful with only labeled data. It should be noted that because the gt and pseudo databases of the SUN-RGBD dataset are noisy, we do not use semi-sampling when training 10%, 20% and 100% SUN-RGBD. Instead, we perform gt-sampling with the fade strategy[29] on the pre-trained model. The good pre-trained models can also benefit the semi-supervised learning on 10%, 20% and 100% SUN-RGBD, achieving promising results.

KITTI.

We also provide a comparison with 3DIoUMatch on the outdoor dataset KITTI. As shown in Table 2, our semi-sampling method yields significant improvement. In particular, we outperform 3DIoUMatch by 3.5 mAP, 6.7 mAP and 14.1 mAP on car, pedestrian and cyclist classes under 1% labeled data. With only 5% labeled data, the performance of our method is close to the methods trained with 100% labeled data, as can be seen in Table 8.

4.4 Ablation Study

In this section, we conduct extensive ablation experiments to analyze the effect of our semi-sampling on three widely used 3D object detection datasets.

Effect of Semi-Sampling on KITTI.

We first ablate the effectiveness of each object sampling strategy of semi-sampling on the KITTI dataset. Our semi-sampling consists of gt-sampling on labeled frames, gt-sampling on unlabeled frames, pseudo-sampling on labeled frames, and pseudo-sampling on unlabeled frames. In outdoor scenes, the effectiveness of gt-sampling on labeled frames has been verified by a lot of prior works, so we use it by default. All experiments in Table 3 use the same pre-trained weight. In Table 3, experiment (a) is our baseline, namely 3DIoUMatch. Experiment (b) suggests that the pseudo-sampling on labeled frames can improve our baseline by a large margin. Further,

Exp.	Labeled Frame		Unlabeled Frame		KITTI 1%
Exp.	gt.	pseudo.	gt.	pseudo.	Car	Pedestrian	Cyclist
(a)	✓				75.1	17.4	44.1
(b)	✓	✓			77.6	20.3	56.0
(c)	✓	✓	✓		77.3	17.1	55.6
(d)	✓	✓		✓	78.5	21.0	57.8
(e)	✓	✓	✓	✓	78.6	24.1	58.2

Table 3: Ablation of semi-sampling on KITTI 1% labeled data. The results are for moderate difficulty level evaluated by the 3D mAP with 40 recall positions. “gt.” and “pseudo.” represent gt-sampling and pseudo-sampling.

Exp.	Labeled Frame		Unlabeled Frame		ScanNet 10%
Exp.	gt.	pseudo.	gt.	pseudo.	[email protected]	[email protected]
(a)					47.3	29.6
(b)	✓				48.3	32.7
(c)	✓	✓			47.7	31.7
(d)	✓		✓		50.3	34.7
(e)	✓			✓	48.2	33.1
(f)		✓			47.6	30.4
(g)			✓		46.7	29.1
(h)				✓	46.4	28.5

Table 4: Ablation study of semi-sampling on ScanNet 10% labeled data. We use the 10% data split provided by 3DIoUMatch as labeled data for a fair comparison.

Exp.	Labeled Frame		Unlabeled Frame		SUN-RGBD 5%
Exp.	gt.	pseudo.	gt.	pseudo.	[email protected]	[email protected]
(a)					39.4	23.0
(b)	✓				40.0	23.5
(c)	✓	✓			40.0	22.8
(d)	✓		✓		40.1	24.3
(e)	✓			✓	39.1	24.2
(f)		✓			39.4	23.9
(g)			✓		39.2	22.9
(h)				✓	40.0	22.9

Table 5: Ablation study of semi-sampling on SUN-RGBD 5% labeled data. We use the 5% data split provided by 3DIoUMatch as labeled data for a fair comparison.

experiment (d) obtains 0.9, 0.7, 1.8 improvements on car, pedestrian and cyclist compared to experiment (b), demonstrating the effectiveness of pseudo sampling on unlabeled frames. Interestingly, gt sampling on unlabeled frames leads to a decrease to experiment (b) as seen in experiment (c), while leading to an increase to experiment (d) as seen in experiment (e). Overall, when equipped with all object sampling strategies in our semi-sampling, we achieve the best performance, which indicates that it is helpful to augment both labeled and unlabeled data with the rich object samples from both of them.

Effect of Semi-Sampling on ScanNet and SUN-RGBD.

In order to be consistent with 3DIoUMatch, we ablate our method on ScanNet with 10% labeled data and on SUN-RGDB with 5% labeled data. We use pre-trained weights provided by 3DIoUMatch on ScanNet 10% and SUN-RGBD 5% to initialize our model in Table 4 and Table 5 for a fair comparison. And we use 3DIoUMatch as our baseline, corresponding to experiment (a) in both tables. Experiment (b) in Table 4 and Table 5 demonstrate the effectiveness of gt-sampling on labeled frames. Experiment (f)-(h) in Table 4 suggest that pseudo-sampling on labeled frames is also helpful while gt-sampling or pseudo-sampling on unlabeled frames can not lead to improvement on ScanNet. By contrast, Experiment (f)-(h) in Table 5 shows that pseudo-sampling on labeled and unlabeled frames benefits our baseline while gt-sampling on unlabeled frames can not bring improvement on SUN-RGBD. Overall, the best performance on both datasets is achieved when using gt-sampling on labeled and unlabeled frames.

ScanNet 100%

SUN-RGBD 100%

Exp.

Gt-sampling

Fade Strategy

mAP

@0.25

mAP

@0.5

mAP

@0.25

mAP

@0.5

(a)

70.2

54.2

64.1

47.2

(b)

✓

71.1

56.4

61.0

46.4

(c)

✓

from

20^{th}

epoch

70.7

56.0

64.6

48.1

(d)

✓

from

30^{th}

epoch

70.2

55.5

65.1

48.5

(e)

✓

from

40^{th}

epoch

71.1

55.7

64.9

47.5

Table 6: Ablation of gt-sampling on RBGNet. The total epoch of RBGNet is 64. “from

20^{th}

epoch” means that we disable gt-sampling after the

20^{th}

epochs, and so on.

Effect of Gt-Sampling on Fully-Supervised Method.

As discussed in Section 3.3, gt-sampling is a very useful data augmentation method in outdoor scenes while there is no investigation on it in indoor scenes. Here we conduct an experiment to study the effectiveness of gt-sampling in indoor scenes. To provide more value to the community, we use a fully-supervised state-of-the-art method RBGNet[31] as our baseline. As shown in experiment (b) of Table 6, gt-sampling boosts the strong baseline by 0.9 [email protected] and 2.2 [email protected] on ScanNet, achieving a new state-of-the-art on ScanNet. However, there is a decrease on SUN-RGBD. As depicted in Figure 2, in the early stage, gt-sampling is actually beneficial to the model performance. However, the performance degrades at the end compared to the baseline. We attribute the phenomenon to the noisy gt database of SUN-RGBD as mentioned in Section 3.3. In the early epochs, the model is not well-trained, and gt-sampling can augment the training data largely. In the late epochs, cleaner data are needed to refine the well-trained model, so the noisy gt samples become harmful for training. To make use of gt-sampling on SUN-RGBD, we use the fade strategy [29], which disables gt-sampling when the model is near convergent. We use three different epochs to study the best time to fade gt-sampling. As shown in experiments (c)-(e), when fading from the $30^{th}$ epoch, we get the best result and achieve a new state-of-the-art on SUN-RGBD.

	3D mAP
$\tau_{unlabeled\_frame}$	Car	Pedestrian	Cyclist
0.3	78.4	19.2	57.4
0.5	78.6	24.1	58.2
0.7	78.0	20.0	57.0

Table 7: Further study on the influence of the threshold

\tau_{unlabeled\_frame}

on KITTI 1% labeled data. The results are for moderate difficulty evaluated by the mAP with 40 recall positions.

Pseudo Label Threshold on Unlabeled Frames.

When performing semi-sampling on unlabeled frames, we use pseudo labels of unlabeled frames to do collision detection with the pasted samples. To remove low-quality pseudo labels in unlabeled frames, we set a score threshold $\tau_{unlabeled\_frame}$ to filter them. A low threshold will leave too many abundant pseudo labels, making there no space on unlabeled frames to place object samples. A high threshold will filter pseudo labels of some foreground objects on unlabeled frames. Without pseudo labels for collision detection, the pasted samples may overlap with them. In Table 7, we study three different thresholds, and the experiment shows that 0.5 is a good choice for our method.

Performance on Omni-KITTI.

		Omni-KITTI
Method	Car	Pedestrian	Cyclist
PV-RCNN[22]	84.4	54.5	70.4
PV-RCNN + 3DIoUMatch	84.3	55.1	71.9
PV-RCNN + Semi-Sampling	85.2	64.2	74.3
CT3D[21]	85.0	55.6	71.9
CT3D + Semi-Sampling	85.8	64.8	72.8

Table 8: Further study on KITTI 100% labeled data with Omni-KITTI as unlabeled data.

To further verify the effectiveness of our method under the omni-supervised setting, we evaluate our method on the Omni-KITTI dataset. We use the pre-train weight provided by PV-RCNN to initialize 3DIoUMatch and our method. Table 8 summarize the results. Our method pushes the fully-supervised PV-RCNN to a new level. Specifically, we outperform PV-RCNN by 0.8 mAP, 9.7 mAP and 3.9 mAP on car, pedestrian and cyclist classes. For 3DIoUMatch, its improvement on PV-RCNN is relatively small. In addition, we use another backbone detector CT3D[21] to validate our method. We use the pre-train weight provided by CT3D as initialization. As shown in Table 8, the improvement is also considerable, especially on the pedestrian class.

4.5 Qualitative Results and Analysis

Here we provide the qualitative comparison between our method and 3DIoUMatch on ScanNet and KITTI validation set. For better visualization on ScanNet, we filter out the predictions whose classification scores $s>0.25$ . As shown in Figure 3 and Figure 4, our method can suppress false positives greatly. Moreover, our method gives more accurate predictions and recalls more challenging objects, such as the chairs under the table, which confirms the effectiveness of our semi-sampling again. More qualitative results are provided in the supplementary material.

5 Conclusion

In this paper, we propose semi-sampling, a simple but effective data augmentation method for semi-supervised 3D object detection. In general, we perform object sampling on both labeled and unlabeled frames and utilize unlabeled data to augment labeled data. In addition, we contribute a new benchmark dataset setting Omni-KITTI for the study of omni-supervised learning. Experiments on the ScanNet, SUN-RGBD, KITTI and Omni-KITTI datasets demonstrate the effectiveness of our method.

References

[1] Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. Advances in neural information processing systems, 27, 2014.
[2] Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1090–1099, 2022.
[3] Benjamin Caine, Rebecca Roelofs, Vijay Vasudevan, Jiquan Ngiam, Yuning Chai, Zhifeng Chen, and Jonathon Shlens. Pseudo-labeling for scalable 3d object detection. arXiv preprint arXiv:2103.02093, 2021.
[4] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
[5] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
[6] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak. Consistency-based semi-supervised learning for object detection. In Advances in neural information processing systems, pages 10759–10768, 2019.
[7] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
[8] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, 2013.
[9] Yingwei Li, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Yifeng Lu, Denny Zhou, Quoc V Le, et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17182–17191, 2022.
[10] Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Vajda. Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480, 2021.
[11] Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. Group-free 3d object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2949–2958, 2021.
[12] Ishan Misra, Rohit Girdhar, and Armand Joulin. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2906–2917, 2021.
[13] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
[14] Jinhyung Park, Chenfeng Xu, Yiyang Zhou, Masayoshi Tomizuka, and Wei Zhan. Detmatch: Two teachers are better than one for joint 2d and 3d semi-supervised object detection. arXiv preprint arXiv:2203.09510, 2022.
[15] Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9277–9286, 2019.
[16] Charles R Qi, Yin Zhou, Mahyar Najibi, Pei Sun, Khoa Vo, Boyang Deng, and Dragomir Anguelov. Offboard 3d object detection from point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6134–6144, 2021.
[17] Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He. Data distillation: Towards omni-supervised learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4119–4128, 2018.
[18] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
[19] Danila Rukhovich, Anna Vorontsova, and Anton Konushin. Fcaf3d: Fully convolutional anchor-free 3d object detection. arXiv preprint arXiv:2112.00322, 2021.
[20] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in neural information processing systems, pages 1163–1171, 2016.
[21] Hualian Sheng, Sijia Cai, Yuan Liu, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, and Min-Jian Zhao. Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2743–2752, 2021.
[22] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10529–10538, 2020.
[23] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685, 2020.
[24] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020.
[25] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, 2015.
[26] Peng Tang, Chetan Ramaiah, Yan Wang, Ran Xu, and Caiming Xiong. Proposal learning for semi-supervised object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2291–2301, 2021.
[27] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
[28] OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet, 2020.
[29] C. Wang, C. Ma, M. Zhu, and X. Yang. Pointaugmenting: Cross-modal augmentation for 3d object detection. In Computer Vision and Pattern Recognition, 2021.
[30] He Wang, Yezhen Cong, Or Litany, Yue Gao, and Leonidas J Guibas. 3dioumatch: Leveraging iou prediction for semi-supervised 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14615–14624, 2021.
[31] Haiyang Wang, Shaoshuai Shi, Ze Yang, Rongyao Fang, Qi Qian, Hongsheng Li, Bernt Schiele, and Liwei Wang. Rbgnet: Ray-based grouping for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1110–1119, 2022.
[32] Xiaopei Wu, Liang Peng, Honghui Yang, Liang Xie, Chenxi Huang, Chengqi Deng, Haifeng Liu, and Deng Cai. Sparse fuse dense: Towards high quality 3d detection with depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5418–5427, 2022.
[33] Hongyi Xu, Fengqi Liu, Qianyu Zhou, Jinkun Hao, Zhijie Cao, Zhengyang Feng, and Lizhuang Ma. Semi-supervised 3d object detection via adaptive pseudo-labeling. In 2021 IEEE International Conference on Image Processing (ICIP), pages 3183–3187. IEEE, 2021.
[34] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3060–3069, 2021.
[35] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
[36] Fangyuan Zhang, Tianxiang Pan, and Bin Wang. Semi-supervised object detection with adaptive class-rebalancing self-training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3252–3261, 2022.
[37] Na Zhao, Tat-Seng Chua, and Gim Hee Lee. Sess: Self-ensembling semi-supervised 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11079–11087, 2020.
[38] Wu Zheng, Weiliang Tang, Li Jiang, and Chi-Wing Fu. Se-ssd: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14494–14503, 2021.