DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detection

The authors are grateful to the meta-reviewer and reviewers for their constructive comments and positive feedback. We are encouraged that the reviewers found the domain adaptive class-incremental scenario novel (R1) and critical (R4) in real-world environment, the work well-motivated (R1), clear (R4), well-written (R1, R4) and easy to follow (R1, R4), and the method novel (R1, R4), well-designed (R1), convincing (R1) and interesting (R3). All concerns are carefully considered and replied below.

(MR, R1, R4) Comparison with more methods. The CIL method in [6] is designed for 3D object recognition, which is hard to implement for 3D detection, while the augmentation method in [7] is designed for 2D single-domain segmentation, which is different from our cross-domain scenario. To further validate the effectiveness of the proposed method, we added the results of the general self-ensembling method in [23] for UDA, i.e.,, $55.42\%$ for CIL and $46.43\%$ for DA-CIL in the final version. Due to page limit, we will explore more baselines, such as [13,26], self-training and adversarial learning for comparison in future work.

(MR, R1) Evaluation of robustness. We evaluated the influence of the number of augmented data by reporting the model performance with different maximum numbers of augmented objects for cross-domain copy-paste, i.e., $47.60\%$ (1-2 objects), $47.08\%$ (1-3 objects), $46.86\%$ (1-4 object), and we can see that the model is not sensitive to the changes. We will add the sensitivity analysis and revise the claim of robustness in the final version.

(R1) Clarification on the loss function. In Equ.5, there is no order between $M$ (the number of proposals by in-domain teacher) and $N$ (the number of proposals by student). The loss function is averaged over $M$ proposal pairs obtained by selecting the closest student proposal of each in-domain teacher proposal based on the minimum Euclidean distance between their centers. Multiple in-domain proposals could be paired with the same student proposal. Therefore, $M$ proposal pairs can always be selected for loss computation. We will add more explanations in the final version.

(R1) Model size. All methods used the same detection backbone as SDCoT with $10.9$ MB in model size and $0.94$ M parameters, which will be added in the final version.

(R2) Lack of practical application scenarios. Our approach is broadly applicable to various applications, e.g., autonomous driving, robotics and interior design. For example, when autonomous driving models developed in one domain are applied to a new domain with different distributions (e.g., new roads and different light conditions) and novel classes (e.g., local vehicles and animals), the proposed framework can help effectively tackle both domain shift and catastrophic forgetting. We will add more descriptions of practical application scenarios in the final version.

(R2) [email protected] should be reported. For CIL scenario, [email protected] is $30.36\%$ and $32.16\%$ for SDCoT and our method, respectively. For CIL under domain shift scenario, [email protected] is $21.19\%$ and $22.39\%$ for SDCoT and our method, respectively. We will provide the detection performance [email protected] of different methods in the final version.

(R2) Inference Time All methods used the same detection backbone with the inference time of 0.2370 s/point cloud, which will be included in the final version.

(MR, R3) Evaluation on different domain gaps. The datasets used in our work include various scenes and scenarios, such as bathrooms, bedrooms, offices. Figure 1 shows the difference between two datasets on the distribution of object dimensions, indicating the serious domain gap. Due to page limit, we plan to explore more scenarios with different domain shifts, such as geography-to-geography, day-to-night, and simulation-to-reality to further verify the effectiveness of the proposed method in our future work.

(MR, R4) Ablation Study To validate the contribution of each component in our method, we list the detection performance ([email protected]) of $43.57\%$ (baseline), $46.60\%$ (only in-domain CP), $46.86\%$ (in-domain CP and cross-domain CP) and $47.60\%$ (Ours). We will revise the ablation study in the final version.

(MR, R4) Clarification on color information. No color information is used in our model training. We only use color information for better visualization with MeshLab ¹¹1https://www.meshlab.net/. We followed SDCoT and modified the VoteNet for aligning proposals from teacher and student models to facilitate class-incremental learning, without leveraging color information. We will add more descriptions on modified VoteNet in the final version. VoteNet has already been implemented on KITTI Benchmark ²²2https://github.com/qiqihaer/votenet-kitti, and therefore, our method based on VoteNet is applicable to outdoor 3D object datasets, such as KITTI. We plan to explore our methods on the domain adaptive class-incremental scenario with outdoor datasets, such as KITTI, Waymo Open Dataset, and nuScenes in [13] in our future work, since both domain shift and catastrophic forgetting commonly exist in the outdoor datasets and autonomous driving datasets.

(R1, R3) Typos and Verbosity We have corrected the typos and revised the two first listed contributions for clarity.

(MR) Are all digits significant in Tables? To show the significance of our method over the baseline SDCoT under the domain adaptive class-incremental setting, we conducted one-tailed paired T-Test to per-class detection performance between SDCoT and our method, and set the significance level as 0.05. The p-value is $0.0444<0.05$ , which confirms that the improvements over SDCoT are significant.