This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

I3DOD: Towards Incremental 3D Object Detection via Prompting

Wenqi Liang1,2,3,†, Gan Sun1,2,∗, Chenxi Liu1,2,3,†, Jiahua Dong1,2,3 and Kangru Wang4
{liangwenqi0123,sungan1412,liuchenxi0101,dongjiahua1995}@gmail.com  [email protected]
1State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, 110016, China;2Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang, 110169, China;3University of Chinese Academy of Sciences, Beijing, 100049, China;4Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China.The corresponding author is Prof. Gan Sun.These authors contributed equally to this work.This work is supported by National Nature Science Foundation of China under Grant (62003336, 62273333), CAS-Youth Innovation Promotion Association Scholarship under Grant 2023207, and the State Key Laboratory of Robotics (2022-Z06).
Abstract

3D object detection have achieved significant performance in many fields, e.g., robotics system, autonomous driving, and augmented reality. However, most existing methods could cause catastrophic forgetting of old classes when performing on the class-incremental scenarios. Meanwhile, the current class-incremental 3D object detection methods neglect the relationships between the object localization information and category semantic information, and assume all the knowledge of old model is reliable. To address the above challenge, we present a novel Incremental 3D Object Detection framework with the guidance of prompting, i.e., I3DOD. Specifically, we propose a task-shared prompts mechanism to learn the matching relationships between the object localization information and category semantic information. After training on the current task, these prompts will be stored in our prompt pool, and perform the relationship of old classes in the next task. Moreover, we design a reliable distillation strategy to transfer knowledge from two aspects: a reliable dynamic distillation is developed to filter out the negative knowledge and transfer the reliable 3D knowledge to new detection model; the relation feature is proposed to capture the responses relation in feature space and protect plasticity of the model when learning novel 3D classes. To the end, we conduct comprehensive experiments on two benchmark datasets and our method outperforms the state-of-the-art object detection methods by 0.6%2.7%0.6\%\sim 2.7\% in terms of [email protected].

I INTRODUCTION

3D object detection has recently received widespread attention in computer vision and robot perception fields, which is developed to localize and identify 3D objects in a scene. It performs a vital role in autonomous driving [1], robotics system [2] and augmented reality [3]. Especially in robotics systems, 3D object detection can help the robot to perform 3D scene understanding and object location. For instance, robots can effectively carry out the task of object grasp, obstacle avoidance and context awareness with the assistance of 3D object detection,.

However, the current deep learning-based methods (e.g., [4] [5]) in 3D object detection are trained in the invariable data, while the data of the real world is constantly changing through time. For example, a domestic robot can detect the common domestic garbage in the indoor scene, and then grasp to clean it up. When the owner brings garbage with a brand-new package that the robot has never seen before, one straightforward way for the robot to detect the new garbage is by fine-tuning our model on the point cloud data of new classes. However, this manner will lead to the notorious catastrophic forgetting of old 3D objects. The other way is adopting the existing class-incremental 2D object detection methods to the 3d scenes, e.g., Zhao et al. [6] propose the co-teaching method SDCoT to distill 3D knowledge from old model. However, the SDCoT model could neglect the object localization information, when assuming that the way of transferring the anti-forgetting knowledge of 2D objects is identical with 3D objects. As shown in Fig. 1, given an input point cloud, object detection model can locate the object to predict 3D bounding boxes and category information after capturing the center of object from the high-level feature (i.e. red points). Failure to match this relationship between the object localization information and category semantic information exacerbates catastrophic forgetting in old classes. Moreover, when performing knowledge distillation, current methods are convinced that all the responses produced by old model are positive for new model, whereas Feng et al. [7] have proved that the negative responses exist and could affect the performance of new model.

Refer to caption
Figure 1: Qualitative detection results obtained from: (a) fine-tuning on the new 3D classes and (b) jointly-training strategy on all the 3D classes, where the red points denotes the votes generated by VoteNet [4], and yellow and blue in the output denotes the bounding boxes of old 3D classes and the new 3D classes, respectively.
Refer to caption
Figure 2: Overview framework of our method I3DOD, whose network structure is based on VoteNet [4]. It mainly consists of a Prompt Guidance Block to learn the category space information, a Reliable Dynamic Distillation module to screen out the reliable 3D knowledge from the regression head, and a Relation Feature Distillation module to distill the spatial positional relation in feature space.

To address the above problems, we propose a novel Incremental 3D Object Detection framework with the guidance of prompting, i.e., I3DOD. Specifically, we present two aspects of comprehension about the challenge of catastrophic forgetting. To reinforce the above-mentioned relationship of each class, we explore a prompt-based method. Concretely, we initialize some prompts to capture the the object localization information from the high-level semantic information. Furthermore, these prompts are stored in our prompt pool and updated with learning a series of tasks. When trained in the next task, our model can adopt the task-shared prompts to perform the above-mentioned relationship of old classes. Moreover, we apply a reliable distillation strategy to distill the reliable 3D knowledge from old model to new model. The reliable dynamic distillation is designed to filter out the negative knowledge produced by the regression head of old model. In addition, we develop the relation feature distillation to protect plasticity of new model, and distill the distribution of responses in feature space.

In the end, the main contributions of this paper are presented as follows:

  • We propose a novel class-incremental 3D object detection method with the guidance of prompting, which could learn the new 3D object in a plasticity manner while mitigating the damage of catastrophic forgetting.

  • A prompt guidance block is designed to store prompts including the comprehension of old category information, which could explore the relationship between the object localization information and high-level semantic information in class-incremental 3D object detection.

  • A reliable distillation strategy is designed to address heterogeneous forgetting, which can purify the anti-forgetting 3D knowledge from old model and protect the plasticity of new model. Our method achieves significant performance on SUN RGB-D and ScanNet in comparison to the state-of-the-art.

II Related Work

II-A Class-Incremental learning

Since the issue of class-incremental learning was raised more than 20 years ago [8], class-incremental learning [9, 10, 11] has received much attention in different fields. Following [12], we classify class-incremental learning into four types: regularisation-based [13, 14], replay-based [15, 16], pseudo-sample generation [17, 18], and architecture-based [19, 20, 21]. To distill knowledge of old tasks, knowledge distillation is introduced into incremental learning by LWF [14]. Recently, DDE [22] demonstrates the importance of causal effect to resist catastrophic forgetting and introduces it to knowledge distillation. In 2D object detection, the early method [23] firstly applies knowledge distillation on the faster-RCNN architecture. Faster ILOD [24] further adopts an adaptive multi-network distillation loss that enhances the accuracy and efficiency of the RPN-based detection method. MVCD [25] extends the application of knowledge distillation to further retain knowledge of old tasks by retaining correlations at the channel, point and instance levels. Class-incremental learning for 3D object detection is still under development. In [6] , SDCoT introduces knowledge distillation into 3D class-incremental object detection to guarantee the performance of model on old classes. However, the above-mentioned methods neglect the object localization information, and assume that all responses produced by old model are reliable for new model.

II-B Prompt-based learning

Prompt-based learning is first proposed in natural language processing to control the model prediction output and save the cost of pre-training. L2P [26] introduces prompt-based learning into class-incremental learning to guide model prediction with less computational cost, as prompt-based learning is able to capture task-specific knowledge with smaller additional parameters. In [27], DualPrompt further improves the performance by adding complementary learning systems to prompt-based learning. However, there still exists difficulties when extends the prompt-based learning into 3D object detection.

III The Proposed Method

III-A Problem Definition and Method Overview

According to class-incremental 3D object detection, the model observes a series of class-incremental learning tasks {𝒯t}t=1T\{\mathcal{T}^{t}\}_{t=1}^{T} and a data stream {𝒟t}t=1T\{\mathcal{D}^{t}\}_{t=1}^{T}, where TT represents the total number of tasks. Following SDCoT, the training data 𝒟t\mathcal{D}^{t} and the class set 𝒞t\mathcal{C}^{t} are only available for the tt-th task, i.e. 𝒟t(i=1t1𝒟i)=\mathcal{D}^{t}\cap(\cup_{i=1}^{t-1}\mathcal{D}^{i})=\emptyset and 𝒞t(i=1t1𝒞i)=\mathcal{C}^{t}\cap(\cup_{i=1}^{t-1}\mathcal{C}^{i})=\emptyset. Given a stream {𝒟t,𝒞t}t=1T\{\mathcal{D}^{t},\mathcal{C}^{t}\}_{t=1}^{T} in the tt-th task, we aim to detect 3D objects from all current classes 𝒞t\mathcal{C}^{t} and old classes i=1t1𝒞i\cup_{i=1}^{t-1}\mathcal{C}^{i}.

We present the overview of our I3DOD in Fig. 2. A core purpose of our method is to tackle catastrophic forgetting of previous knowledge, when the training data i=1t1𝒟I\cup_{i=1}^{t-1}\mathcal{D}^{I} is not available for the current task. Specifically, we define the backbone and model in the current task as t\mathcal{F}^{t} and Φt\Phi^{t}. The old backbone t1\mathcal{F}^{t-1} and Φt1\Phi^{t-1} are learned from the last task. The head of each model is composed of a regression head R\mathcal{H}_{R} and a classification head C\mathcal{H}_{C}. To tackle the forgetting of old classes without affecting the learning of new classes, we propose our solution from two aspects. On the one hand, we propose a prompt guidance block to record the relation between the category semantic information of each class and its the object localization information, which can guide the matching of the two pieces of information to solve catastrophic forgetting. On the other hand, we design a distillation scheme (i.e. RFD\mathcal{L}_{RFD}, RBD\mathcal{L}_{RBD} and Dis\mathcal{L}_{Dis}) in Fig. 2 to effectively extract knowledge of old model.

III-B Prompt Guidance Learning

In class-incremental 3D object detection, SDCoT [6] introduces a co-teaching distillation to transfer knowledge between two teacher models and a student model. However, this method is very common in 2D vision tasks and neglects the matching relationship between the object localization information and high-level semantic information. As shown in Fig.1, we find that a matching relationship between the two pieces of information of old classes is crucial for the new model to capture the center of object and locate it for object detection. Inspired by L2P [26] and DualPropmt [27], we propose a prompt guidance block to address the above issues from the comprehension of two pieces of information.

As shown in Fig. 2, given an input point cloud 𝐱it\mathbf{x}_{i}^{t}, a feature vector 𝐟it\mathbf{f}_{i}^{t} and a subset of the input points 𝐳it\mathbf{z}_{i}^{t} are obtained from the output of t\mathcal{F}^{t}:

{𝐳it,𝐟it}=t(𝐱it),\displaystyle\{\mathbf{z}_{i}^{t},\mathbf{f}_{i}^{t}\}=\mathcal{F}^{t}(\mathbf{x}_{i}^{t}), (1)

where 𝐱it\mathbf{x}_{i}^{t} and 𝐳it3\mathbf{z}_{i}^{t}\in\mathbb{R}^{3}, 𝐟itM×D\mathbf{f}_{i}^{t}\in\mathbb{R}^{M\times D}.

To build the relationship between the high-level feature 𝐟it\mathbf{f}_{i}^{t} and the center of 3D object, we here introduce our prompt pool {𝐩n}n=1N\{\mathbf{p}_{n}\}_{n=1}^{N} , where NN represents the total prompt quantity and 𝐩itD\mathbf{p}_{i}^{t}\in\mathbb{R}^{D}. Inspired by a great success of transformer, we introduce a multi-head self-attention module as our prompting function to combine the selected task-share prompts {𝐩s}s=1S\{\mathbf{p}_{s}\}_{s=1}^{S} and the high-level feature 𝐟it\mathbf{f}_{i}^{t}. After obtaining {𝐩s}s=1S\{\mathbf{p}_{s}\}_{s=1}^{S} and 𝐟it\mathbf{f}_{i}^{t}, we feed them into our multi-head prompting self-attention module. Specifically, we obtain 𝐡Q\mathbf{h}_{Q}, 𝐡K\mathbf{h}_{K}, 𝐡V\mathbf{h}_{V} by repeating 𝐟it\mathbf{f}_{i}^{t}, where 𝐡Q\mathbf{h}_{Q}=𝐡K\mathbf{h}_{K}=𝐡V\mathbf{h}_{V}. In addition, we divide each 𝐩\mathbf{p} to gain 𝐩K\mathbf{p}_{K} and 𝐩V\mathbf{p}_{V}, where 𝐩K\mathbf{p}_{K}, 𝐩VS/2×D\mathbf{p}_{V}\in\mathbb{R}^{S/2\times D}. We concatenate 𝐩K\mathbf{p}_{K}, 𝐩V\mathbf{p}_{V} with 𝐡K\mathbf{h}_{K}, 𝐡V\mathbf{h}_{V} and execute parallel self-attention map H times. In the end, we obtain the hh-th self-attention map 𝐀h\mathbf{A}_{h} as:

𝐀h=σ(𝐡Q𝐖Qh([𝐩K;𝐡K]𝐖Kh)d)([𝐩V;𝐡V]𝐖Vh),\displaystyle\mathbf{A}^{h}=\sigma(\frac{\mathbf{h}_{Q}\mathbf{W}_{Q}^{h}([\mathbf{p}_{K};\mathbf{h}_{K}]\mathbf{W}_{K}^{h})^{\top}}{\sqrt{d}})([\mathbf{p}_{V};\mathbf{h}_{V}]\mathbf{W}_{V}^{h}), (2)

where 𝐖kh\mathbf{W}_{k}^{h},𝐖Qh\mathbf{W}_{Q}^{h},𝐖Vh\mathbf{W}_{V}^{h} are the weight matrices of hh-th self-attention module, and σ\sigma denotes the softmax function.

TABLE I: Comparison results on SUN RGB-D dataset in terms of [email protected] and Recall.
Methods [email protected](%) Recall(%)
5 + 5 7 + 3 9 + 1 5 + 5 7 + 3 9 + 1
B N Avg. Imp. B N Avg. Imp. B N Avg. Imp. B N Avg. Imp. B N Avg. Imp. B N Avg. Imp.
Freeze and add 57.8 28.0 42.9 \Uparrow15.1 54.9 35.8 45.4 \Uparrow10.2 56.4 46.1 51.3 \Downarrow2.3 80.3 60.2 70.3 \Uparrow12.1 80.7 71.4 76.1 \Uparrow6.4 80.3 80.0 80.2 \Uparrow0.7
Fine-tuning 57.8 30.5 44.2 \Uparrow13.8 54.9 21.2 38.1 \Uparrow17.5 56.4 5.6 31.0 \Uparrow18.0 80.3 65.9 73.1 \Uparrow9.3 80.7 59.1 69.9 \Uparrow12.6 80.3 30.7 55.5 \Uparrow22.0
SDCoT [6] 57.8 56.3 57.1 \Uparrow0.9 54.9 53.8 54.4 \Uparrow1.2 56.4 38.6 47.5 \Uparrow1.5 80.3 81.3 79.4 \Uparrow3.0 80.7 79.4 80.1 \Uparrow2.4 80.3 72.7 76.5 \Uparrow4.4
Ours w/o PGB & RDD & RFD 57.8 54.2 56.0 \Uparrow2.0 54.9 53.3 54.1 \Uparrow1.5 56.4 36.1 46.3 \Uparrow2.7 80.3 82.5 81.4 \Uparrow1.0 80.7 81.0 80.9 \Uparrow1.6 80.3 72.1 76.2 \Uparrow4.7
Ours w/o PGB & RDD 57.8 55.5 56.7 \Uparrow1.3 54.9 54.3 54.6 \Uparrow1.0 56.4 38.9 47.7 \Uparrow1.3 80.3 81.0 80.7 \Uparrow1.7 80.7 80.2 80.5 \Uparrow2.0 80.3 72.9 76.6 \Uparrow4.3
Ours w/o PGB 57.8 56.4 57.1 \Uparrow0.9 54.9 54.6 54.8 \Uparrow0.8 56.4 40.0 48.2 \Uparrow0.8 80.3 81.6 81.0 \Uparrow1.4 80.7 81.9 81.3 \Uparrow1.2 80.3 76.8 78.6 \Uparrow2.3
Ours 58.7 57.2 58.0 \mathrm{-} 56.3 54.8 55.6 \mathrm{-} 56.7 41.3 49.0 \mathrm{-} 80.4 84.6 82.4 \mathrm{-} 82.5 82.6 82.5 \mathrm{-} 84.9 76.9 80.9 \mathrm{-}

 

TABLE II: Comparison results on ScanNet dataset in terms of [email protected] and Recall.
Methods [email protected](%) Recall(%)
9 + 9 14 + 4 17 + 1 9 + 9 14 + 4 17 + 1
B N Avg. Imp. B N Avg. Imp. B N Avg. Imp. B N Avg. Imp. B N Avg. Imp. B N Avg. Imp.
Freeze and add 60.8 30.8 45.8 \Uparrow14.1 53.5 42.3 47.9 \Uparrow7.5 57.4 53.8 55.6 \Uparrow0.2 80.3 55.0 67.7 \Uparrow11.1 76.8 69.2 73.0 \Uparrow2.1 79.6 77.1 78.4 \Uparrow0.2
Fine-tuning 60.8 27.0 43.7 \Uparrow16.2 53.5 15.7 34.6 \Uparrow20.8 57.4 1.0 29.2 \Uparrow26.6 80.3 50.5 65.4 \Uparrow13.4 76.8 36.7 56.8 \Uparrow18.3 79.6 8.9 44.3 \Uparrow23.3
SDCoT [6] 60.8 54.3 57.6 \Uparrow2.3 53.5 55.0 54.3 \Uparrow1.1 57.4 53.0 55.2 \Uparrow0.6 80.3 75.2 77.8 \Uparrow1.0 76.8 74.7 75.8 \Uparrow1.3 79.6 74.6 77.1 \Uparrow1.5
Ours 61.4 58.4 59.9 \mathrm{-} 53.8 56.9 55.4 \mathrm{-} 57.5 54.2 55.8 \mathrm{-} 80.6 76.9 78.8 \mathrm{-} 76.9 77.2 77.1 \mathrm{-} 79.8 77.3 78.6 \mathrm{-}

 

TABLE III: Performance of each class on SUN RGB-D dataset under the setting of 5+5.

  Methods/Classes [email protected](%) 1 2 3 4 5 6 7 8 9 10 Avg. Imp. Freeze and add 66.7 79.1 26.0 69.6 20.5 0.7 0.1 6.0 9.6 1.7 28.0 \Uparrow29.2 Fine-tuning 0.2 2.2 1.0 2.4 16.0 28.5 54.5 62.5 52.3 85.8 30.5 \Uparrow26.7 SDCoT [6] 68.1 83.4 26.5 71.3 20.1 32.6 58.4 63.1 52.0 87.1 56.3 \Uparrow0.9 Ours 69.5 82.7 29.5 71.8 21.3 34.9 58.6 64.6 52.6 86.1 57.2 \mathrm{-}
 

We concatenate all 𝐀h\mathbf{A}_{h} and apply the MLP layer and the one-dimensional convolution layer to extract the center information of 3D objects. As mentioned above, we use prompts to learn the relationship between the high-level feature and the object localization information in class-incremental learning, and store them for the next task. The outputs of our prompt guidance block are given as follows:

𝐳˙it=𝐳it+𝐌𝐋𝐏(𝐂𝐨𝐧𝐜𝐚𝐭({𝐀h}h=1H)),\displaystyle\dot{\mathbf{z}}_{i}^{t}=\mathbf{z}_{i}^{t}+\mathbf{MLP}(\mathbf{Concat}(\{\mathbf{A}_{h}\}_{h=1}^{H})), (3)
𝐳¨it=𝐳it+𝐂𝐨𝐧𝐯(𝐳˙it).\displaystyle\ddot{\mathbf{z}}_{i}^{t}=\mathbf{z}_{i}^{t}+\mathbf{Conv}(\dot{\mathbf{z}}_{i}^{t}). (4)

Overall, we have established the structure of prompt guidance block. By following VoteNet [4], the outputs of prompt guidance block are expected to locate the center of objects. Therefore, for training our prompt guidance block, we compute the center of object as the supervisory label. The loss function is shown as below:

PGB=1Bi=1B𝐳¨it𝐜it1,\displaystyle\mathcal{L}_{PGB}=\frac{1}{B}\sum_{i=1}^{B}\|\ddot{\mathbf{z}}_{i}^{t}-\mathbf{c}_{i}^{t}\|_{1}, (5)

where 𝐜it\mathbf{c}_{i}^{t} represents the ground truth of object center.

In addition, the prompts have learned the matching relationship of categories in the current task. We build a prompt pool to store these prompts and perform the relationship to tackle catastrophic forgetting of old classes in the next task.

0:  The data stream {𝒟t}t=1T\{\mathcal{D}^{t}\}_{t=1}^{T}. The hyper-parameters {α,β,γ,ξ,ζ\alpha,\beta,\gamma,\xi,\zeta}. Randomly initialize Φ1\Phi^{1}, {𝐩s}s=1S\{\mathbf{p}_{s}\}_{s=1}^{S}.
1:  for t=1,2,,Tt=1,2,\cdots,T do
2:     Initialize the classifier Ct\mathcal{H}_{C}^{t};
3:     while not converged do
4:        if t=1t=1 then
5:           Optimize Φt\Phi^{t}, {𝐩s}s=1S\{\mathbf{p}_{s}\}_{s=1}^{S} by S\mathcal{L}_{S};
6:        else
7:           Optimize Φt\Phi^{t}, {𝐩s}s=1S\{\mathbf{p}_{s}\}_{s=1}^{S} by Eq. (9);
8:        end if
9:     end while
10:     Store {𝐩s}s=1S\{\mathbf{p}_{s}\}_{s=1}^{S} in prompt pool and Φt\Phi^{t} as Φt1\Phi^{t-1}.
11:  end for
Algorithm 1 Optimization pipeline of Our I3DOD.

III-C Reliable Dynamic Distillation

In class-incremental 2D or 3D object detection, performing bounding boxes distillation between old and new object detection models is difficult. Given a sample that does not contain old classes, the regression head of old model also generates bounding boxes at low confidence. There is too much noise in the responses of the regression head when applying old model to predict on training data of the current task. Inspired by [7], we can divide all the responses into reliable responses and unreliable responses. Transferring unreliable responses from old model to new model will affect learning new classes, and accelerate catastrophic forgetting of old classes. Therefore, the current method SDCoT [6] cannot adopt the knowledge distillation strategy on bounding boxes.

To tackle the problems above and effectively transfer the 3D knowledge of old model, we propose a reliable dynamic distillation module to dynamically screen out the reliable 3D knowledge from the regression head. As shown in Fig. 2, through the forward calculation of Φt\Phi^{t}, we can obtain classification scores {Sit}i=1N\mathcal{\{}{S}_{i}^{t}\}_{i=1}^{N} and bounding boxes {Bit}i=1N\mathcal{\{}{B}_{i}^{t}\}_{i=1}^{N}, where NN means the total quantity of proposals generated by object proposal generation module. Specifically, we initialize classification scores of Φt\Phi^{t} to our confidence scores for each proposal. Specifically, we select the subset of {Sit}i=1N\mathcal{\{}{S}_{i}^{t}\}_{i=1}^{N} whose top-1 classification score is distributed in old classes. We then compute the average μ\mu and the standard deviation σ\sigma of all selected classification scores. The final threshold value τ\tau is calculated below:

τ=μ+ζi=1B(Sitμ)𝕀argmaxSiti=1t1𝒞ii=1B𝕀argmaxSiti=1t1𝒞i,\displaystyle\tau=\mu+\zeta\cdot\sqrt{\frac{\sum_{i=1}^{B}(S_{i}^{t}-\mu)\cdot\mathbb{I}_{\mathrm{argmax}S_{i}^{t}\in\cup_{i=1}^{t-1}\mathcal{C}^{i}}}{\sum_{i=1}^{B}\mathbb{I}_{\mathrm{argmax}S_{i}^{t}\in\cup_{i=1}^{t-1}\mathcal{C}^{i}}}}, (6)

where μ=1Oi=1OSit𝕀argmaxSiti=1t1𝒞i\mu=\frac{1}{O}\sum_{i=1}^{O}S_{i}^{t}\cdot\mathbb{I}_{\mathrm{argmax}S_{i}^{t}\in\cup_{i=1}^{t-1}\mathcal{C}^{i}}, and 𝕀\mathbb{I} means the indicator function that 𝕀=1\mathbb{I}=1 when the condition is true; or 𝕀=0\mathbb{I}=0. The hyperparameter ζ\zeta is designed to control τ\tau.

We obtain bounding boxes {Bit}i=1O\mathcal{\{}{B}_{i}^{t}\}_{i=1}^{O}, {Bit1}i=1O\mathcal{\{}{B}_{i}^{t-1}\}_{i=1}^{O} of proposals whose confidence score exceeds the threshold τ\tau. These reliable responses can transfer positive 3D knowledge from the teacher model to the student model, which is crucial to prevent catastrophic forgetting. Thus, the proposed reliable dynamic distillation for bounding boxes can be defined as:

RDD=1Oi=1OBitBit12.\displaystyle\mathcal{L}_{RDD}=\frac{1}{O}\sum_{i=1}^{O}\|{B}_{i}^{t}-{B}_{i}^{t-1}\|_{2}. (7)

III-D Relation Feature Distillation

Most current CIL distillation methods based on feature [10, 28] neglect the spatial positional relationship in feature space, while further damaging the model plasticity when learning novel 3D classes. They assume that the direct distillation can exert the same effect in the feature space as it does in the output space. However, the distribution of responses in feature space is complex. Directly distilling responses cannot well transfer knowledge form the distribution.

Refer to caption
Figure 3: Class-wise performance comparisons ([email protected]) on ScanNet dataset under the setting of 9 + 9.

To handle the above problems, we propose the relation feature distillation to capture the responses relation between 𝐟it\mathbf{f}_{i}^{t} and 𝐟it1\mathbf{f}_{i}^{t-1}. We define our relation feature distillation loss function as:

RFD=𝐱it𝐱jt𝒟tCos(𝐟it,𝐟jt)Cos(𝐟it1,𝐟jt1)2C|𝒟t|2,\displaystyle{\mathcal{L}}_{RFD}=\sum_{\mathbf{x}_{i}^{t}\,\mathbf{x}_{j}^{t}\in\mathcal{D}^{t}}\frac{\|Cos(\mathbf{f}_{i}^{t},\mathbf{f}_{j}^{t})-Cos(\mathbf{f}_{i}^{t-1},\mathbf{f}_{j}^{t-1})\|_{2}}{C_{|\mathcal{D}^{t}|}^{2}}, (8)

where C|𝒟t|2C_{|\mathcal{D}^{t}|}^{2} means the total number of paired samples selected from training data 𝒟t\mathcal{D}^{t}.

At each training iteration, we compute a supervised loss S\mathcal{L}_{S} via following the loss function in VoteNet [4]. The supervised loss of our prompt guidance block PGB{\mathcal{L}}_{PGB} is also included in S\mathcal{L}_{S}. Overall, the model Φit\Phi_{i}^{t} is optimized as follows:

=αS+βRDD+γRFD+ξDis,{\mathcal{L}}=\alpha\mathcal{L}_{S}+\beta\mathcal{L}_{RDD}+\gamma\mathcal{L}_{RFD}+\xi\mathcal{L}_{Dis}, (9)

where α,β,γ,ξ\alpha,\beta,\gamma,\xi are hyperparameters in our experiments.

IV EXPERIMENTS

IV-A Datasets and Evaluation metrics

Datasets. By following SDCoT [6], we organize our experiments on SUN RGB-D [29] and ScanNet [30] datasets. We present the details for these two datasets as follows: 1) SUN RGB-D consists of 5,285 training samples for hundreds of object categories and has 5,050 samples to evaluate. In line with the standard evaluation protocol in SDCoT, we use the same 10 categories to report our performance. 2) ScanNet includes 1,201 samples for training and 312 samples for evaluating. Following VoteNet [4], we generate our input point clouds by selecting vertices from the reconstructed meshes and gain the bounding boxes from the point-level labeling. We choose the same 18 categories to start our experiments.

Evaluation metrics. We use mean average precision ([email protected]) and recall in 3D object detection as our evaluation metrics.

IV-B Implementation Details

For the hyper-parameters in our framework, we define the number of selected prompts SS as 10 and ζ\zeta in Eq. 6 as 1.2. The weights of loss functions in Eq. 9 are designed as α=10\alpha=10, β=0.8\beta=0.8, γ=1\gamma=1, ξ=1\xi=1. We adopt the same method in SDCoT to schedule the respective contributions of β\beta, γ\gamma and ξ\xi. For our prompt guidance block, we randomly initialize prompts as learnable embeddings at the beginning of the first task and adopt PGB\mathcal{L}_{PGB} in Eq. 5 to optimize. After training on the current task, we store these prompts in prompt pool, which are used to initialize prompts in the next task. The adam optimizer with the initial learning rate 0.001 is adopted to train all models. In addition, the adam optimizer is scheduled via the cosine annealing schedule.

Concretely, we design two baselines for class-incremental 3D object detection (i.e. Freeze and add, Fine-tuning in Table. I). The method Freeze and add means that we freeze the model Φt1\Phi^{t-1} to initialize the model Φt\Phi^{t}. Then, a new classifier based on 𝒞t\mathcal{C}^{t} is added and trained as the only learnable module. Fine-tuning denotes that we fine-tune Φt1\Phi^{t-1} (all parameters except the old classifier) with a new classifier on the training data 𝒟t\mathcal{D}^{t}. In addition to the two baselines, we compare our I3DOD with SDCoT using the identical class order and task setting.

Refer to caption
Figure 4: Qualitative results on ScanNet dataset. The left is the three original scenes. The middle is the scenes with bounding boxes generated by our I3DOD. The right is the scenes with ground truth bounding boxes. Yellow and blue denote the bounding boxes of old classes and new classes of ground truth, and red represents the predict bounding boxes of our I3DOD.

IV-C Quantitative Results

SUN RGB-D: As shown in Tabs. IIII and Fig. 3, we evaluate the effectiveness of all experiments on SUN RGB-D. In this paper, B means the base task and N denotes the novel task (e.g., 5+5 means that the base task consists of the first five classes and the other five classes belong to the novel task).Our model gains significant improvement over other methods by 0.9%22.0%0.9\%\sim 22.0\% in terms of task average [email protected]. In the setting of 9+1, our I3DOD significantly exceeds the state-of-the-art method SDCoT by 2.9%2.9\%, 8.5%8.5\% in terms of [email protected] and recall.

ScanNet: We report the performance of all comparison experiments in Tab. II and visualize some evaluation results in Fig. 4. Compared with other methods, our I3DOD achieves solid improvement in all incremental settings. Furthermore, I3DOD outperforms SDCoT by 0.6%0.6\% in the base task and 4.1%4.1\% in the novel task. All the experiments show that our I3DOD can tackle catastrophic forgetting from two aspects of category information comprehension and reliable representation distillation.

IV-D Ablation Studies

As presented in Tab. I and Fig. 5, we ablate the effectiveness of each module in our I3DOD. We compare our I3DOD with I3DOD eliminated our proposed modules one by one (i.e., PGB, RDD, RFD). Specifically, we notice that the performance of Ours w/o PGB degrades 0.8%2.3%0.8\%\sim 2.3\% in terms of task average [email protected], which show that our prompt guidance block can perform category information comprehension to gain identical improvement of new and old classes. Furthermore, Ours w/o PGB & RDD is worse by 0.2%2.0%0.2\%\sim 2.0\% in comparison with Ours w/o PGB, and achieves better performance compared with the baseline Ours w/o PGB & RDD & RFD, which reflects our distillation strategy achieves the significant effectiveness.

V CONCLUSIONS

In this paper, we propose a novel I3DOD framework to tackle the challenge of catastrophic forgetting in class-incremental 3D object detection. In particular, the prompt guidance block is designed to capture the relationship between the object localization information and high-level semantic information. We then point out that the responses of old detection model consist of positive responses and negative responses, and present our reliable dynamic distillation to filter out the negative knowledge of old model. Meanwhile, a relation feature distillation module is designed to the spatial positional relationship in feature space. Finally, we report our significant performance against baseline methods and verify the effectiveness of each module in I3DOD.

Refer to caption
Figure 5: Ablation study results on SUN RGB-D dataset under the class setting of 5+5 (left) and 7+3 (right).

References

  • [1] D. Wang, C. Devin, Q.-Z. Cai, P. Krähenbühl, and T. Darrell, “Monocular plan view networks for autonomous driving,” in IROS.   IEEE, 2019, pp. 2876–2883.
  • [2] Z. Zhou, L. Li, A. Fürsterling, H. J. Durocher, J. Mouridsen, and X. Zhang, “Learning-based object detection and localization for a mobile robot manipulator in sme production,” Robotics and Computer-Integrated Manufacturing, vol. 73, p. 102229, 2022.
  • [3] M. Billinghurst, A. Clark, G. Lee et al., “A survey of augmented reality,” Foundations and Trends® in Human–Computer Interaction, vol. 8, no. 2-3, pp. 73–272, 2015.
  • [4] C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for 3d object detection in point clouds,” in CVPR, 2019, pp. 9277–9286.
  • [5] X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang, “3d object detection with pointformer,” in CVPR, 2021, pp. 7463–7472.
  • [6] N. Zhao and G. H. Lee, “Static-dynamic co-teaching for class-incremental 3d object detection,” in AAAI, vol. 36, no. 3, 2022, pp. 3436–3445.
  • [7] T. Feng, M. Wang, and H. Yuan, “Overcoming catastrophic forgetting in incremental object detection via elastic response distillation,” in CVPR, 2022, pp. 9427–9436.
  • [8] S. Lewandowsky and S.-C. Li, “Catastrophic interference in neural networks: Causes, solutions, and data,” in Interference and inhibition in cognition.   Elsevier, 1995, pp. 329–361.
  • [9] J. Dong, L. Wang, Z. Fang, G. Sun, S. Xu, X. Wang, and Q. Zhu, “Federated class-incremental learning,” in CVPR, 2022, pp. 10 164–10 173.
  • [10] A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “Podnet: Pooled outputs distillation for small-tasks incremental learning,” in ECCV.   Springer, 2020, pp. 86–102.
  • [11] B. Ma, Y. Cong, and J. Dong, “Topology-aware graph convolution network for few-shot incremental 3d object learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023.
  • [12] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 7, pp. 3366–3385, 2021.
  • [13] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
  • [14] Z. Li and D. Hoiem, “Learning without forgetting,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 2935–2947, 2017.
  • [15] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in CVPR, 2017, pp. 2001–2010.
  • [16] L. Pellegrini, G. Graffieti, V. Lomonaco, and D. Maltoni, “Latent replay for real-time continual learning,” in IROS.   IEEE, 2020, pp. 10 203–10 209.
  • [17] C. Atkinson, B. McCane, L. Szymanski, and A. Robins, “Pseudo-recursal: Solving the catastrophic forgetting problem in deep neural networks,” arXiv preprint arXiv:1802.03875, 2018.
  • [18] F. Lavda, J. Ramapuram, M. Gregorova, and A. Kalousis, “Continual classification learning using generative models,” arXiv preprint arXiv:1810.10612, 2018.
  • [19] A. Mallya and S. Lazebnik, “Packnet: Adding multiple tasks to a single network by iterative pruning,” in CVPR, 2018, pp. 7765–7773.
  • [20] J. Dong, W. Liang, Y. Cong, and G. Sun, “Heterogeneous forgetting compensation for class-incremental learning,” in ICCV, Oct. 2023.
  • [21] R. Aljundi, P. Chakravarty, and T. Tuytelaars, “Expert gate: Lifelong learning with a network of experts,” in CVPR, 2017, pp. 3366–3375.
  • [22] X. Hu, K. Tang, C. Miao, X.-S. Hua, and H. Zhang, “Distilling causal effect of data in class-incremental learning,” in CVPR, 2021, pp. 3957–3966.
  • [23] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of object detectors without catastrophic forgetting,” in CVPR, 2017, pp. 3400–3409.
  • [24] C. Peng, K. Zhao, and B. C. Lovell, “Faster ilod: Incremental learning for object detectors based on faster rcnn,” Pattern Recognition Letters, vol. 140, pp. 109–115, 2020.
  • [25] D. Yang, Y. Zhou, A. Zhang, X. Sun, D. Wu, W. Wang, and Q. Ye, “Multi-view correlation distillation for incremental object detection,” Pattern Recognition, vol. 131, p. 108863, 2022.
  • [26] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in CVPR, 2022, pp. 139–149.
  • [27] Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y. Lee, X. Ren, G. Su, V. Perot, J. Dy et al., “Dualprompt: Complementary prompting for rehearsal-free continual learning,” in ECCV.   Springer, 2022, pp. 631–648.
  • [28] M. Kang, J. Park, and B. Han, “Class-incremental learning by knowledge distillation with adaptive feature consolidation,” in CVPR, 2022, pp. 16 071–16 080.
  • [29] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in CVPR, 2015, pp. 567–576.
  • [30] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in CVPR, 2017, pp. 5828–5839.