Continual Learning for Pose-Agnostic Object Recognition in 3D Point Clouds

Xihao Wang
Technical University of Munich
[email protected] Xian Wei
East China Normal University
[email protected] Corresponding Author

Abstract

Continual Learning aims to learn multiple incoming new tasks continually, and to keep the performance of learned tasks at a consistent level. However, existing research on continual learning assumes the pose of the object is pre-defined and well-aligned. For practical application, this work focuses on pose-agnostic continual learning tasks, where the object’s pose changes dynamically and unpredictably. The point cloud augmentation adopted from past approaches would sharply rise with the task increment in the continual learning process. To address this problem, we inject the equivariance as the additional prior knowledge into the networks. We proposed a novel continual learning model that effectively distillates previous tasks’ geometric equivariance information. The experiments show that our method overcomes the challenge of pose-agnostic scenarios in several mainstream point cloud datasets. We further conduct ablation studies to evaluate the validation of each component of our approach.

1 Introduction

With the development of artificial intelligence, deep neural networks has presented its impressive ability in a wide variety of learning tasks in several fields [28, 29]. This outstanding capability partially depends on an ideal learning setting, where sufficient training samples are available and well-aligned. Whereas the world we live in is made up of a series of processes that are changing. With the arrival of the new tasks, the artificial agents are prone to cause serious forgetting of the knowledge learned from old tasks while learning a new task, called catastrophic forgetting [19]. Thus, continual learning(CL) is introduced to address this problem in order to promote the deep neural networks to satisfy the natural world’s requirements [2].

However, especially in the 3D domain, most of the current continual learning research only evaluated the object in a well-aligned scenario setting. From the view of real-world application, we draw our attention to a more practical scenario of pose-agnostic continual learning. As described in Figure 1, the pose-agnostic CL considers the following aspects: (i) The pose of the object is constantly changed in every test phase. (ii) Each task arrives sequentially following the definition of incremental learning. (iii) Due to the limited computing resource, the model could only save a small portion of data from previous tasks into the memory buffers. For instance, a robot server is assumed to work in a realistic scenario where it can recognize the object depending on the 3D point cloud data generated from the sensor. With the change of surroundings, the robot continually updates the new class captured from its working perception. Due to the constraint of learning resources, the robot could only learn the sample with the first captured pose. In comparison, the object’s pose varies every time when the robot encounters the same object. An example is depicted in Figure 1, the object bottle learned in task $t-1$ could also be recognized in task $t$ and task $t+1$ even though the pose of the bottle is drastically changed in both tasks.

Refer to caption — Figure 1: For instance, suppose a robot server that categorizes new items with their 3D point cloud data taken by the surrounding environment. For each category, recognizing new items conspicuously depends on the pose when the robot observes it. The pose of the object varies whenever the robot first learns it or encounters it the next time. We called the above scenario the pose-agnostic continual learning.

Despite the fact that the pose-agnostic CL is a practical scenario that is close to our real world, exploration of the 3D geometric data in continual learning is still rare. From the past approaches to 2D data, data augmentation is a prevalent strategy for solving the pose-agnostic problem [31]. In recent works of continual learning, data augmentation makes an effort to increase the generalization of given tasks, which not only solves the problem caused by arrived samples’ pose but also mitigates the catastrophic forgetting [26, 37]. Nevertheless, the data augmentation strategy is usually difficult to be implemented in the 3D data domain. Due to the increase in freedom degrees, the augmentation of 3D data is always more exhaustive and expensive than 2D data. Furthermore, the computation burden of augmentation would sharply rise with the incremental of the incoming new class in the continual learning process. Thus, we need a strategy to leverage sufficient geometric information from the 3D data and solve the problem caused by data augmentation.

To address this problem, we take our sight on introducing more inductive bias, as prior knowledge, for the pose-agnostic CL. In machine learning, a successful learning scheme usually needs to encode the appropriate inductive bias [6]. For instance, Occam’s razor selects a low complexity model [21], translational invariance in the convolutional neural network [3], and time invariance in the recurrent neural network[44]. The popular attention mechanism is also the inductive rule from human intuition [51]. Moreover, the research on the equivariance of the deep neural networks have demonstrated that equivariant features retain more discriminative information [38, 17, 20, 14]. To this end, we attempt to design the equivariance as an extra inductive bias into the internal network layers to solve the continual learning problem in the pose-agnostic scenario.

In this paper, we propose a novel continual learning approach by effectively combining the injected rotation equivariance and distilled knowledge from previous tasks. Firstly, to address the problem of the unpredictable object’s pose, we inject the rotation equivariance into the feature extraction block. Then, to cooperate with the equivariant structure, we extend the knowledge distillation framework to leverage the enriched geometric information retained in the network feature maps. Finally, we construct the memory buffer to store the distinguished samples to facilitate the model getting better performance.

Our main contributions are: (i) We develop a dynamic knowledge distillation framework to mitigate catastrophic forgetting, which could efficiently extract equivariant features to overcome the challenge in the pose-agnostic CL scenario. (ii) Our model presents the effectiveness of exemplars storage and the robustness of encountering complex scenarios. (iii) In the 3D object recognition CL tasks, our method outperforms previous methods in the pose-agnostic scenarios and proposes the corresponding benchmark. Furthermore, we conduct ablation studies to prove the validation of each component of our approach.

2 Background and Related work

2.1 Continual Learning

The continual learning approaches could be classified into three major categories depending on the storage and usage of task-specific information during the learning process [15]. (i) Replay-based methods mitigate the catastrophic forgetting via replaying the data from past tasks in episodic memories. These methods were inspired by the idea of jointly training the previous samples with the current task [45, 25, 9]. Due to the problem of samples imbalance, constrained replay and pseudo rehearsal methods were proposed [47, 33]. (ii) Regularization-based methods investigate the continual problem in the no sample conservation situation [27, 55, 8, 1]. [27] constrains the update of network parameters to mitigate forgetting. On the other hand, [42] solves the continual learning problem by constraining each task’s features. (iii) Dynamic architectural methods leverage the separated network to train different tasks [34, 46]. The network grows new branches to study current tasks and freezes old branches to remember previous tasks. For other approaches, knowledge distillation which transfers the past version model’s knowledge to the next version, could mitigate catastrophic forgetting [32, 43].

Knowledge Distillation

As one of the approaches in the Dynamic architectural methods, the knowledge distillation transfers the teacher network’s knowledge to the student network. The concept of knowledge distillation was firstly proposed in [23], and the works [32, 43] also introduced it as one of the parts to solve continual learning problems. Knowledge distillation defines the previous task model as the teacher network. The ”dark knowledge” is distilled to the current task model, which is defined as a student network, through the soft labels from the teacher network. Depending on the position of the distillation, the methods could be divided into two categories, distillation from logits [10, 11] and intermediate features [22, 56].

2.2 Equivariance

A function $f:\mathcal{V}\rightarrow\mathcal{U}$ is called equivariant [18] with respect to the transformations $T_{g}:\mathcal{V}\rightarrow\mathcal{V}$ for the abstract group element $g\in G$ , if for every $g$ exists a transformation $S_{g}:\mathcal{U}\rightarrow\mathcal{U}$ such that

S_{g}[f(v)]=f(T_{g}[v]),\quad\text{for all}\quad g\in G,v\in\mathcal{V}.

(1)

The $G$ indicates the set of symmetries, and the $g$ is considered as a group element describing the symmetry transformation. If we have two symmetry transformations $(g,g^{\prime}\in G)$ and we compose them, the result $gg^{\prime}$ is another symmetry transformation [12]. In the Eq. (1), the transformation $T_{g}$ and $S_{g}$ are referred to the group action of group element $g\in G$ on object $v\in\mathcal{V}$ . Since in the standard point cloud deep learning architectures, like PointNet [39], PointNet++ [40], and DGCNN [52], the robustness of rotation is absent. Due to the expensive augmentation in the 3D domain, the equivariance immediately attracts the interest of the researchers. Recent research has demonstrated the importance of the equivariance to ensure stable and predictable performance when the nuisance transformations exist in the data input. [20].

SO(3) Equivariant method

In 3D roto-translation, SO(3) equivariance is a vital property. Based on the SO(3) representation theory, SO(3) equivariant network architecture was proposed. In the early period, spherical convolution leverages the spherical harmonic domain to achieve SO(3) equivariance [13]. Then, a series of works build the equivariance based on steerable kernels. These researches could be divided into two styles, relying on Tensor Field Network theory [48, 38] or Lie group theory [18, 24]. Different from explicit designing the steerable kernel, [16] proposed the vector neuron representations for creating SO(3) equivariance implicitly.

3 Our Method

3.1 Problem Definition

Among the basic continual learning definition summarized in [50], we focus on solving the class-incremental problem, which requires the model to infer the task without explicit task identity. We formulate the data arrives incrementally via a batch of point set $x_{i}\in\mathbb{R}^{s_{a}\times d_{v}}\in\mathcal{X}^{i}$ and corresponding true labels $y_{i}\in\mathbb{R}^{b}$ , where $s_{a}$ denotes the number of sampled point, $b$ stands for the dimension of the label vector, and $d_{v}$ denotes the dimension of the point depending on whether use normal vector. Each task $\mathcal{T}_{i}$ is composed of point cloud sets and their labels. At each incremental step, point cloud data is only available for new classes $\mathcal{T}_{new}=\{(\mathcal{X}^{c+1},y^{c+1}),\cdots,(\mathcal{X}^{t},y^{t})\}$ and small amount of exemplar data $\mathcal{E}=\{\mathcal{E}^{1},\cdots,\mathcal{E}^{c}\}$ . The exemplars are saved from the previous classes $\mathcal{X}_{old}=\{\mathcal{X}^{1},\cdots,\mathcal{X}^{c}\}$ . We now formulate the pose-agnostic CL setup. As we mentioned in the introduction, 3D point cloud data has usually suffered the problem of unclear pose-aligned in realistic scenarios. In the pose-agnostic scenario, the input data is presented as $T_{g}\mathcal{X}^{t}$ , where $T_{g}$ denotes the unpredictable SO(3) geometric transformation. As the representation of group element $g\in SO(3)$ , $T_{g}$ has a standard rotation matrix that acts on $\mathcal{V}=\mathbb{R}^{3}$ .

3.2 The Basic Learning Framework

To effectively address the scenario of the pose-agnostic CL, we propose the training model including three components: equivariance injected block, internal distillation structure, and rehearsal exemplar distillation. To deal with geometric transformed input point cloud data, we propose to enforce rotation equivariance into the neural networks by designing the feature extractor blocks. The internal distillation structure is responsible for transferring the extracted pose-agnostic representation from the previous tasks to the current tasks. The rehearsal exemplar strategy makes an effort to consolidate the knowledge in the internal distillation structure.

3.2.1 Injecting Rotation Equivariance into Network

The equivariance network is a functional architecture for learning general features with particular inductive bias. In the equivariant network, each layer is required to satisfy the equivariance to transmit the equivariance dependency. Through the above transmission, the semantic information from spatial data is retained between the layers in the network. Thus, the spatial information in each layer is isomorphic with the input point data. In our feature extraction network, composition blocks could divide the layers into linear, non-linear, and pooling layers. We will describe the SO(3)-equivariance in each layer in the following.

Firstly, the linear layer is the indispensable layer of mapping features in the composition blocks. It could be represented as $\mathcal{U}=f(\mathcal{V})=W\mathcal{V}$ . When the rotated input passes through the linear layer, the representation could be written as:

f(T_{g}\mathcal{V})=W\mathcal{V}T_{g}=T_{g}f(\mathcal{V})=T_{g}W\mathcal{V},

(2)

which indicates that the linear layer naturally possesses the SO(3)-equivariance property. Thus, the linear layer could maintain its original construction.

Secondly, the non-linear layer usually makes a critical effort on the neural networks’ learning ability. Inspired by the work [16], we also applied the implicit representation to predict the direction of the input vector feature dynamically. At the non-linear layer, the input vector is the output of the preceding linear layer, $q=f_{lin}(\mathcal{V})=W\mathcal{V}$ . Moreover, $k=\Phi\mathcal{V}$ is the normal direction of $q$ . Depending on the distance of inner product space, the output of the non-linear layer $\mathcal{U}=f(\mathcal{V})=\mathcal{V}$ , when $\langle q,k\rangle\geq 0$ . Otherwise, $\mathcal{U}=f(\mathcal{V})=q-\left\langle q,\frac{k}{\|k\|}\right\rangle\frac{k}{\|k\|}$ . Since the rotation transformation does not influence the distance in the inner product space, the SO(3)-equivariance could be verify as

	$\displaystyle f(T_{g}\mathcal{V})=T_{g}q-\left\langle T_{g}q,\frac{T_{g}k}{\\|T_{g}k\\|}\right\rangle\frac{T_{g}k}{\\|T_{g}k\\|}$		(3)
	$\displaystyle=T_{g}(q-\left\langle q,\frac{k}{\\|k\\|}\right\rangle\frac{k}{\\|k\\|})=T_{g}f(\mathcal{V}),\text{if}\langle q,k\rangle\leq 0.$		(3)

Lastly, according to the blueprint of [6], the pooling, which makes the function of aggregating the feature, could be divided into two types. In the case of global pooling, it is the invariance layer. About the case of local pooling, we focus on the maximum pooling, where the SO(3)-equivariant construction follows the insight from the non-linear layer [16]. Through the learned direction $\xi=\Psi\mathcal{V}$ , the maximum pooling chooses the input vector element that best aligns with the learned direction $d^{*}$ . The SO(3)-equivariance could be verified as follows:

f_{max}(T_{g}\mathcal{V})=\underset{d^{*}}{\operatorname{argmax}}\langle T_{g}\xi,T_{g}\mathcal{V}\rangle=T_{g}f_{max}(\mathcal{V}).

(4)

3.2.2 Internal Distillation Structure

After the SO(3)-equivariance is injected into the feature extraction network, the model requires a structure to resist the catastrophic forgetting by transferring learned knowledge from the current model $\Theta^{c}$ to the target model $\Theta^{t}$ . Even though more than one kind of method could achieve continual learning, we choose knowledge distillation as our approach because it could effectively transfer the equivariance learned by the previous model version [44]. However, previous knowledge distillation works did not adapt to our equivariant feature extraction network $\zeta(\cdot)$ . Due to the reason that our feature extraction network retains the rich semantic information in the feature map, we would not adopt the typical knowledge distillation framework that only utilizes the logits of the output layer. We leverage the feature map between sections to combine with the internal distillation block. Thus, the knowledge retained in the network is squeezed into the shallow portion to affect the final full-connect layer (illustrated in Figure 2)

During the training process, the selected feature between sections is mapped to an external layer individually to keep a unified dimension. Similar to the typical knowledge distillation, the information in the external layer is distilled and conveyed to constitute a comprehensive final loss as:

\text{loss}_{\mathcal{X}}=\sum^{D}_{\delta=1}(loss_{1}+loss_{2}+loss_{3}+loss_{4}).

(5)

The final loss has four sources. To construct the loss about hard labels in the student’s model, the first source is the cross-entropy from the ground truth and each feature map in the target model:

loss_{1}=-(1-\gamma)\sum^{t}_{i=c}y^{i}\cdot\log(\sigma(\Theta^{t}(\mathcal{X}^{i}_{\delta}))),

(6)

where $(\mathcal{X},y)_{i=c}^{t}$ come from the input samples in the new task, and $\Theta(\mathcal{X}_{\delta})$ denotes the logits from the feature map. The first source not only employs the final classifier layer but also allows every feature map, which contains the equivariance semantic information from the input, participate in the computation of the cross-entropy. Then, in order to compute the distillation loss between soft labels and soft predictions, the second is the KL-divergence between the student model’s feature map and the corresponding feature map trained in the teacher model:

loss_{2}=\lambda\text{KL}(\sigma(\Theta^{t}(\mathcal{X}_{\delta})/T),\sigma(\Theta^{c}(\mathcal{X}_{\delta})/T)).

(7)

Here, the second source connects the distribution of each feature map from the teacher’s model to the student’s model. Next, the third source is the KL-divergence between each internal feature map and the final output:

loss_{3}=\gamma\text{KL}(\sigma(\Theta^{t}(\mathcal{X}_{\delta})/T),\sigma(\Theta^{t}(\mathcal{X}_{D})/T)),

(8)

where $\Theta^{t}(\mathcal{X}_{D})$ denotes the final classifier logits. The third source makes the feature map which hides in the deep layer to affect the final classifier output in the most shallow layer. Lastly, in the aspect of the Euclidean space, the fourth source enforces the ”dark knowledge” in each feature map to the shallow output layer:

loss_{4}=\kappa||F_{\delta}-F_{D}||^{2}_{2},

(9)

where $F_{\cdot}$ denotes the feature map of indicated layer. Depending on the above comprehensive loss constitution, the rich semantic information extracted by the equivariance network sufficiently facilitates the student network to inherit the knowledge from the teacher network.

3.2.3 Rehearsal Exemplar Distillation

According to the outstanding robustness and performance of the rehearsal, we also employ the memory buffer to save the selected exemplar of old classes. Thus, in each iteration, the input of trained model contains the current task’s data $\mathcal{X}_{new}$ and the storage exemplar $\mathcal{E}_{new}$ . Since the redundancy input augmentation has been avoided by enforcing the SO(3)-equivariance into the network, the burden of saving exemplar sets of the new tasks has been remarkably relieved. The distillation of rehearsal exemplar consists of two steps: exemplar selection and exemplar distillation.

Exemplar Selection

Among the input data $\mathcal{X}_{new}$ , we will randomly select a certain number of samples to supply into the $\mathcal{E}_{old}=\mathcal{E}$ for creating new exemplar set $\mathcal{E}_{new}=\{\mathcal{E}^{1},\cdots,\mathcal{E}^{t}\}$ . We define that the rehearsal memory buffer could save up to $M$ samples. Meanwhile, in order to keep a balance of the number of saved samples, we randomly selected an equal number of samples for each class when they arrived with the new task. Hence, the sample number of each class is denoted as $r=M/t$ in the updated exemplar $\mathcal{E}_{new}$ . With the class increase, it is necessary to discard some samples from each class to keep the total number of samples in the exemplar. We assume that the center of the $\mathcal{E}^{i}$ is the mean of the feature vector, and the $r$ nearest samples to the center would be picked in Euclidean distance.

d_{k}=||\frac{1}{r}\sum^{r}_{i=1}\zeta(e_{i})-\zeta(e_{k})||^{2}_{2},\quad\text{for all}\quad e_{k}=\{e_{1},\cdots,e_{r}\}.

(10)

Depending on the distance, the last $r^{\prime}-r$ samples will be discarded.

Exemplar Distillation

In order to balance the effect between new and old exemplar samples, the source of the exemplar distillation consists of two elements.

	$\displaystyle\text{loss}_{\mathcal{E}}=-\sum^{c}_{i=1}\hat{y}^{i}\cdot\log(\sigma(\Theta^{t}(\hat{\mathcal{X}^{i}})))$		(11)
	$\displaystyle+\lambda\text{KL}(\sigma(\Theta^{t}(\hat{\mathcal{X}})/T),\sigma(\Theta^{c}(\hat{\mathcal{X}})/T)),$		(11)

where $\hat{\mathcal{X}},\hat{y}$ denotes the sample from exemplar $\mathcal{E}_{new}$ , $\sigma(\cdot)$ indicates the softmax function, $T$ stands for the temperature of distillation, and $\lambda$ is the hyper-parameter. The first element is the cross-entropy of the exemplar samples and their labels. The second element is the KL(Kullback-Leibler) divergence loss to minimize the Euclidean distance between the corresponding exemplar logits. The completed process is presented in Algorithm1.

Input : new task data

\mathcal{T}_{new}=\{(\mathcal{X}^{c+1},y^{c+1}),\cdots,(\mathcal{X}^{t},y^{t})\}

, old exemplar set

\mathcal{E}_{old}=\{\mathcal{E}^{1},\cdots,\mathcal{E}^{c}\}

, current model

\Theta^{c}

, distillation temperature

T

, Memory size

M

Output : target model trained up to t classes

\Theta^{t}

, new exemplar set

\mathcal{E}_{new}

2 update the memory size

r=M/t

;

3 for $(\mathcal{X}^{i},y^{i})\in\mathcal{T}_{new}$ do

4 Randomly pick

r

samples

\mathcal{E}^{i}\rightarrow\{e_{1},\cdots,e_{r}\}\subset\mathcal{X}^{i}

;

5 Softmax over the logtis from each internal feature map in target model

\sigma(\Theta^{t}(\mathcal{X}^{i}_{\delta}))

;

6 Compute the classification loss Eq. (6) and distillation loss Eq. (8) in target model;

7 Compute the Feature map loss Eq. (9) in L2 norm;

8 Softmax over the logtis from each internal feature map in current model

\sigma(\Theta^{c}(\mathcal{X}^{i}_{\delta}))

;

9 Compute the distillation loss Eq. (7) between current model and target model;

10 New exemplar set selection;

11 Randomly pick

r

samples

\mathcal{E}^{i}\rightarrow\{e_{1},\cdots,e_{r}\}\subset\mathcal{X}^{i}

;

12 Calculate the mean feature

\xi(e_{i})

;

13 Arrange the

\mathcal{E}^{i}\subset\mathcal{X}^{i}

in descending order depending on the distance in Eq. (10);

14 Load the input exemplar data;

15 for $(\hat{\mathcal{X}^{i}},\hat{y}^{i})\in\mathcal{E}_{old}$ do

16 Sample number of each class in

\mathcal{E}_{old}\rightarrow r^{\prime}

;

17 Compute the distillation loss of exemplar set Eq. (11);

18 Discard last

r^{\prime}-r

old samples to generate

\hat{\mathcal{E}^{i}}\subset\mathcal{E}_{old}

;

20 end for

\text{Loss}=\text{loss}_{\mathcal{X}}+\text{loss}_{\mathcal{E}}

;

\mathcal{E}_{new}\leftarrow\{\hat{\mathcal{E}^{1}},\cdots,\hat{\mathcal{E}^{c}},\mathcal{E}^{i}\}

;

24 end for

Algorithm 1 Pose-agnostic Continual Learning Algorithm

4 Experiment

In this section, we investigate the capabilities of our method and other continual learning approaches to two popular 3D point cloud datasets(ModelNet40 [54] and ScanObjectNN [49]) for the object recognition task and achieve consistent improvement. Especially in the pose-agnostic case, our model overcomes the challenge of unpredictable rotation, which is usually encountered in realistic scenarios. We describe the dataset and implementation details in the section 4.1. Following the results, the comparative analysis and ablation studies are presented.

Aligned/Aligned
Methods	4	8	12	16	20	24	28	32	36	40	Avg
LwF [32]	96.5	87.2	77.5	70.6	62.3	56.8	44.7	39.4	36.1	31.5	60.3
iCaRL [43]	96.8	90.4	83.6	78.3	72.5	67.3	59.6	53.1	47.8	39.6	68.9
DeeSIL [4]	97.7	91.5	85.4	80.5	74.4	71.8	65.3	58.7	52.4	43.7	72.1
EEIL [7]	97.6	93.8	87.5	81.6	78.2	74.7	69.2	62.4	56.8	48.1	75.0
IL2M [5]	97.8	95.1	89.4	85.7	83.8	82.2	78.4	72.8	67.9	57.6	81.1
DGMw [36]	97.5	93.2	86.4	82.5	80.1	78.4	73.6	65.3	61.5	53.4	77.2
DGMa [36]	97.5	93.4	84.7	81.8	79.5	77.8	74.1	67.4	60.8	51.5	76.8
BiC [53]	97.8	95.5	88.5	86.9	84.3	83.1	79.3	74.2	70.7	59.2	82.0
RPS-Net [41]	97.7	94.6	90.3	88.2	86.7	82.5	78.0	73.6	68.4	58.3	81.7
Ours	99.6	92.8	88.5	87.3	85.3	81.9	78.7	74.8	71.8	61.2	82.0
Pose-Agnostic/Pose-Agnostic
iCaRL	98.7	61.8	37.7	32.2	28.8	27.9	24.6	19.2	15.6	14.1	36.1
iCaRL with EQ	98.3	79.1	72.5	71.0	69.8	66.9	64.2	58.8	52.4	44.2	67.7
Ours w/o EQ	99.2	79.2	64.8	54.6	45.8	39.2	36.3	26.1	21.8	17.6	48.5
Ours w/o EM	98.3	86.3	65.4	51.7	47.1	40.8	37.6	34.1	26.4	25.8	51.4
Ours	99.6	94.8	88.5	88.3	87.3	81.9	78.7	76.8	71.8	63.2	83.2

Table 1: Quantitative comparisons on ModelNet40 dataset. iCaRL-EQ denotes the iCaRL method injected SO(3)-equivariance. Ours-no EM indicates our model only without ememplar memory. Ours-no EQ denotes our model only removes SO(3)-equivairance.

4.1 Datasets and Implementation details

According to the datasets employed, the ModelNet40 [54], which consists of clean 3D CAD object models from 40 classes, is divided into 9843 training samples and 2468 testing samples. We define the number of continual learning tasks as 10, and each task $\mathcal{T}_{i}$ has 4 classes model. About the size of the exemplar memory, $M$ is 500. For the ScanObject dataset [49], it contains 15000 point cloud objects which are collected from 2902 real-world unique object instances into 15 categories. The number of continual learning tasks is 5, and each task $\mathcal{T}_{i}$ has 3 classes model.The memory size $M$ for the ScanObjectNN dataset is 400. We defined the training and test environment to simulate the pose-agostic case in realistic scenarios, where the $\cdot/\cdot$ denotes the train/test situation. The Aligned denotes the pose of the model is well-aligned, and the Pose-Agnostic denotes the pose of the model is unknown due to the arbitrary rotation. We train our model use a server with 8 GeForce RTX 3090(24GB).

4.2 Evaluation Metrics

We employ three metrics to evaluate the performance of our CL model: average accuracy, forgetting rate, and feature retention. In the test phase, the test samples are also possessed agnostic geometric transformation to simulate the real world. (i) Average accuracy is represented as (Avg Acc), which is similar to [43].(ii) Forgetting rate $\mathcal{F}$ measures the accuracy drop from the first task, as the definition of [30]. (iii) Following the [35], the feature retention $\mathcal{R}$ measures how much information is retained in the feature extractor. After the entire learning process is finished, the $\mathcal{R}$ could be computed as te performance ratio between the final model and the model of the current task is arrived.

4.3 Evaluation in the ModelNet40 dataset

As described in Table1, we conduct the object classification task in ModelNet40 dataset. From the results of per-task accuracy, our model notably surpasses the other approaches on the average accuracy metric in the case of Aligned/Aligned. In the case of pose-agnostic/pose-agnostic, our model overcomes the challenge of maintaining satisfactory performance and avoiding catastrophic forgetting in an unfamiliar environment where the pose of the object is changeable. Our model achieves the consistently the same result in both cases, but other methods attain a poor performance compared with their result in the well-aligned case.

The figure3 illustrates the change tendency of the accuracy when a new task arrives. Our model keeps a relatively high result in each task and gets the best average accuracy Avg Acc. On the other hand, our model presents the lowest forgetting rate $\mathcal{F}$ compared with other methods. The ratio of information retained $\mathcal{R}$ could be presented as the slope of the accuracy. In the figure3, our model exhibits a smooth from the first task to the final task, which stands for the knowledge from the last model is transferred to the next model sufficiently.

Ablation Studies

We conduct ablation studies to explore the effectiveness of different components in our model in Table 1. One of the studies investigates the performance of equivariance injection. We additional enforced the equivariance into the iCaRL method, but it did not reach the performance where the object pose is well-aligned. Although injecting equivariance into the network could alleviate the difficulty of the pose-agnostic case. It still requires a suitable structure to support the achievement of good performance. On the other hand, we also conducted the test to remove the equivariance of our model. Even though the original architecture got a high result in the first task, the (Acc) decreased significantly with the new task arriving. The result proves that the equivariance makes a significant effort to our model. Moreover, the exemplar memory also has a high contribution to the continual learning task. Thus, the study proves that each component of our model plays an indispensable role in achieving desired performance.

4.4 Evaluation in the ScanObjectNN dataset

To challenge a more realistic scenario, we conducted the experiment on the ScanObjectNN dataset to evaluate our model. Different from the ModelNet40 dataset, the ScanObjectNN dataset is obtained based on scanned indoor scene data with background noise. From the Table4, we can note that our model also defeats the challenge of the pose-agnostic problem. In the case of the iCaRL method, it performs well in the well-aligned scenario. However, its average accuracy gets a marked drop under the pose-agnostic scenario. In contrast, our method performs similarly between the two different scenarios. Due to the more complex environment, our model also exposed its limitation that our model does not outperform the performance of the iCaRL method in the well-aligned scenario. But we believe that our performance still has improvement space with the further fine-tuning process.

5 Conclusion

In this paper, we address a realistic class incremental continual learning scenarios where the pose of the object is dynamically changed. To effectively address such a scenario, we propose to inject the rotation equivariance as the additional prior knowledge into the network and design a novel training framework to alleviate the catastrophic forgetting. Our model achieves competitive performance in both well-aligned and pose-agnostic scenarios by injecting the SO(3)-equivariance into the network. The experiment results on popular point cloud datasets demonstrate the effectiveness of our model. Specifically, our model sufficiently leverages the information among feature maps in the network to improve the efficiency of knowledge distillation. In addition, our approach releases the redundancy data augmentation in the 3D domain. In future work, we will further investigate the relationship between other prior knowledge and continual learning performance. We believe that our model will facilitate continual learning approaches to address real-world problems.

References

[1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision, pages 139–154, 2018.
[2] Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, and Jonghyun Choi. Rainbow memory: Continual learning with a memory of diverse samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8218–8227, 2021.
[3] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
[4] Eden Belouadah and Adrian Popescu. Deesil: Deep-shallow incremental learning. In Proceedings of the European Conference on Computer Vision Workshops, pages 0–0, 2018.
[5] Eden Belouadah and Adrian Popescu. Il2m: Class incremental learning with dual memory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 583–592, 2019.
[6] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
[7] Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision, pages 233–248, 2018.
[8] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision, pages 532–547, 2018.
[9] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019.
[10] Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4794–4802, 2019.
[11] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 215–223, 2011.
[12] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International Conference on Machine Learning, pages 2990–2999, 2016.
[13] Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018.
[14] Pim De Haan, Maurice Weiler, Taco Cohen, and Max Welling. Gauge equivariant mesh cnns: Anisotropic convolutions on geometric graphs. arXiv preprint arXiv:2003.05425, 2020.
[15] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. Continual learning: A comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383, 2(6), 2019.
[16] Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, and Leonidas J Guibas. Vector neurons: A general framework for so (3)-equivariant networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12200–12209, 2021.
[17] Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision, pages 52–68, 2018.
[18] Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. In International Conference on Machine Learning, pages 3165–3176. PMLR, 2020.
[19] Robert M French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999.
[20] Fabian Fuchs, Daniel Worrall, Volker Fischer, and Max Welling. Se (3)-transformers: 3d roto-translation equivariant attention networks. Advances in Neural Information Processing Systems, 33:1970–1981, 2020.
[21] Rohan Ghosh and Mehul Motani. Network-to-network regularization: Enforcing occam’s razor to improve generalization. Advances in Neural Information Processing Systems, 34, 2021.
[22] Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3779–3787, 2019.
[23] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
[24] Michael J Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. Lietransformer: Equivariant self-attention for lie groups. In International Conference on Machine Learning, pages 4533–4543. PMLR, 2021.
[25] David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.
[26] Xisen Jin, Arka Sadhu, Junyi Du, and Xiang Ren. Gradient-based editing of memory examples for online task-free continual learning. Advances in Neural Information Processing Systems, 34, 2021.
[27] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 2012.
[29] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
[30] Seungwon Lee, James Stokes, and Eric Eaton. Learning shared knowledge for deep lifelong learning using deconvolutional networks. In International Joint Conferences on Artificial Intelligence, pages 2837–2844, 2019.
[31] Wonsung Lee, Kyungwoo Song, and Il-Chul Moon. Augmented variational autoencoders for collaborative filtering with auxiliary information. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1139–1148, 2017.
[32] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017.
[33] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems, 30, 2017.
[34] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
[35] Sudhanshu Mittal, Silvio Galesso, and Thomas Brox. Essentials for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3513–3522, 2021.
[36] Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jahnichen, and Moin Nabi. Learning to remember: A synaptic plasticity driven framework for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11321–11329, 2019.
[37] Quang Pham, Chenghao Liu, and Steven Hoi. Dualnet: Continual learning, fast and slow. Advances in Neural Information Processing Systems, 34, 2021.
[38] Adrien Poulenard and Leonidas J Guibas. A functional approach to rotation equivariant non-linearities for tensor field networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13174–13183, 2021.
[39] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
[40] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 30, 2017.
[41] Jathushan Rajasegaran, Munawar Hayat, Salman H Khan, Fahad Shahbaz Khan, and Ling Shao. Random path selection for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
[42] Amal Rannen, Rahaf Aljundi, Matthew B Blaschko, and Tinne Tuytelaars. Encoder based lifelong learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 1320–1328, 2017.
[43] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
[44] Mamshad Nayeem Rizve, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10836–10846, 2021.
[45] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
[46] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018.
[47] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
[48] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219, 2018.
[49] Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international Conference on computer vision, pages 1588–1597, 2019.
[50] Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734, 2019.
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[52] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics, 38(5):1–12, 2019.
[53] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 374–382, 2019.
[54] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
[55] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995. PMLR, 2017.
[56] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3713–3722, 2019.