Subspace Distillation for Continual Learning

Kaushik Roy [email protected], csiro.au Christian Simon [email protected] Peyman Moghadam [email protected], qut.edu.au Mehrtash Harandi [email protected] Monash University; Melbourne, VIC, Australia Australian national University; Canberra, ACT, Australia CSIRO, Data61; Brisbane, QLD, Australia Queensland University of Technology; Brisbane, QLD, Australia

Abstract

An ultimate objective in continual learning is to preserve knowledge learned in preceding tasks while learning new tasks. To mitigate forgetting prior knowledge, we propose a novel knowledge distillation technique that takes into the account the manifold structure of the latent/output space of a neural network in learning novel tasks. To achieve this, we propose to approximate the data manifold up-to its first order, hence benefiting from linear subspaces to model the structure and maintain the knowledge of a neural network while learning novel concepts. We demonstrate that the modeling with subspaces provides several intriguing properties, including robustness to noise and therefore effective for mitigating Catastrophic Forgetting in continual learning. We also discuss and show how our proposed method can be adopted to address both classification and segmentation problems. Empirically, we observe that our proposed method outperforms various continual learning methods on several challenging datasets including Pascal VOC, and Tiny-Imagenet. Furthermore, we show how the proposed method can be seamlessly combined with existing learning approaches to improve their performances. The codes of this article will be available at https://github.com/csiro-robotics/SDCL.

keywords:

Lifelong Learning , Subspace Distillation , Knowledge Distillation , Continual Semantic Segmentation , Catastrophic Forgetting , Background Shift , Continual Learning

^†^†journal: Neural Networks⁰⁰footnotetext: Manuscript submitted in May 2022. Manuscript accepted by Neural Networks in July 2023.

1 Introduction

Continual Learning (CL) is the process of robust, efficient and gradual learning in non-stationary environments. A fundamental aspect of intelligence is the capability of incrementally learning from sequential experiences. Equipping neural networks with CL capability requires the model to preserve its previously learned experiences while acquiring novel knowledge. Neural networks, trained in an offline mode, are currently the method of choice in a wide spectrum of problems in AI and machine learning. The underlying assumption here is that the model has the knowledge about all the decisions it should take in the future apriori. For example, all classes a model will encounter in future are known in an image classification task. Furthermore, in offline training, data used for training the model in future steps should be i.i.d, otherwise internal representations learned by the model are hardly useful.

In this paper, our focus is to design a mechanism that enables neural network model to learn continually in a dynamic environment. One may wonder is it advantageous for a model to learn sequentially like humans? Continual learning techniques will endow our machines to learn potentially over a lifetime, as does a human. Furthermore, having visual understanding and semantic segmentation in mind, continual adaptation to a changing target specification enables the model to learn a diverse, and growing set of classes. This aspect of continual learning is commonly considered as a necessity towards human-level artificial general intelligence [1]. Also, we note that continual learning methods could offer profound advantages for models even in stationary settings, by enabling them to improve their efficacy without the need to train from scratch upon availability of new data.

In a continual learning setting, current Deep Neural Networks (DNNs) exhibits a drastic fall in the overall performance when the model is trained on a series of tasks. Precisely, in absence of samples from old tasks, the performance degrades on previously encountered tasks after the model is trained on a novel tasks. In other words, the knowledge from previous tasks gets overwritten thoroughly and DNNs forget previously learned tasks abruptly once information relevant to novel tasks is presented [2]. This phenomenon of forgetting prior tasks because of the changes on critical weights related to previously observed tasks is often referred to as Catastrophic Forgetting [3, 4, 5, 6, 7]. Therefore, to design DNNs for Continual Learning, one needs to address Catastrophic Forgetting.

Refer to caption — Figure 1: Semantic shift of background classes at different learning steps in Continual Semantic Segmentation (CSS). Classes (*e.g*.. table and chair) from the old task at step $t$ and future task at step $t+1$ are collapsed into background at current task $t$ .

Another problem of interest in this paper is Continual Semantic Segmentation (CSS). Semantic segmentation [8, 9, 10] is the task of assigning a category label such as “person” or “vehicle” to every single pixel of an image. In class-incremental CSS problem, a model is sequentially exposed to learn a set of novel classes. At the end of each training step, the CSS model is supposed to classify a pixel with all the seen classes until current task for evaluation. Aside from catastrophic forgetting, in CSS, we need to tackle another fundamental problem, namely Background Shift [11].

In a conventional semantic segmentation setup, all object categories are predefined, and the class “background” encapsulates all other object categories that are not relevant to the problem at hand. In contrast and in CSS, at each learning step, the class background merely corresponds to categories that do not belong to any of the classes at the current step (see Fig. 1). As a result, the class background contains not only pixels from unseen and future classes but also pixels from previously seen and old classes. This setting can be considered as a dense prediction task with noisy labels as the future unseen classes or old seen ones are grouped under a super-class named background. If certain measures are not taken, the background shift could exacerbate the catastrophic forgetting even further.

A common way of addressing Catastrophic Forgetting and Background Shift is to distill the knowledge from old model to current one (see Fig. 3). Distillation methods such as LwF [12], PODnet [13]), often match the output/latent representation of a network, and hence ensuring that the prior knowledge remains unchanged in current model and the performance remains consistent on old tasks.

In this paper, we introduce a structured form of knowledge distillation [14] that is suitable for both Continual Learning with class-incremental setting and Continual Semantic Segmentation (CSS) [15, 11]. Precisely, we propose to distill the structure of the feature space in the intermediate layers of neural network to preserve the previously learned knowledge in the current model.

Our proposed method encodes the structure via low-dimensional subspaces. Subspaces have been used in a broad range of problems in computer vision to model the data manifold locally. Subspaces are robust to perturbation, and can be computed for high-dimensional data easily, hence has been employed with success for adapting neural networks [17, 18]. Therefore, to mitigate catastrophic forgetting in CL, we propose to maintain geometric structure of the feature space through encoding subspaces between models from sequential learning steps. Our approach starts with decomposing extracted feature map in the intermediate layers of deep neural network followed by constructing structures of it through selecting prominent subspaces that approximate the data manifold to the first-order. This enables us to impose constraint to maintain similar subspace structures between old and new models. In our method, we formulate the constraints using the geometry of Grassmannian [19], and propose to minimize the distance between corresponding feature subspaces of old and current model.

As a motivating example, to examine the ability of subspace distillation in improving catastrophic forgetting by preserving prior knowledge, we evaluate the performance of subspace distillation by adding it to existing state-of-the-art CSS methods, e.g.. ILT [15], MiB[11], and PLOP [16] and report the result in Fig. 2. The result shows that the average IOU for all of the three methods improves significantly on the Pascal VOC dataset. Furthermore, our proposed subspace distillation algorithm can be merged with other distillation techniques seamlessly and provides them with complementary constraints to enforce structural similarities.

Overall, our contributions in this paper are as follows:

1.

We propose a robust feature distillation strategy, namely Subspace Distillation (SD) to tackle catastrophic forgetting in CL through applying constraint on maintaining similar feature structure between old and new model.
2.

We present a generalized end-to-end continual learning framework using our proposed Subspace Distillation (SD) strategy in presence of a small subset of already observed samples from past tasks.
3.

Our proposed Subspace Distillation (SD) strategy requires backpropagation through Singular Value Decomposition (SVD) method as it relies on SVD to compute the basis of subspaces. We show how this can be done in closed form solution by using partial derivative.
4.

Our proposed approach outperforms state-of-the-art continual learning methods on MNIST, CIFAR10 and Tiny-Imagenet datasets with varying memory size.
5.

We also show that a significant improvement can be achieved by combining subspace distillation strategy with existing methods for CSS on Pascal VOC dataset for different short and long task settings.

2 Related Work

In this section we discuss related works in class-incremental learning and continual semantic segmentation as the targeted application.

2.1 Continual Learning

A variety of methods has been proposed to alleviate catastrophic forgetting for class-incremental classification problems [4, 12, 13, 20, 21, 22]. These methods for continual learning are classified primarily into three categories, such as regularization, dynamic architecture and memory replay based methods [23].

Regularization based methods preserve already learned information by imposing constraints on the update of weight [24, 25, 4, 26], intermediate feature representation [13, 27, 28], prediction [12, 29], gradient [30, 31], or combination thereof. Li et al. [12] in Learning without Forgetting (LwF) introduced knowledge distillation strategy in the output layer to minimize the dissimilarity between old task and new one. The significance of each synapses is measured to penalize update on most influential synapses and a surrogate loss function to estimate loss for old tasks is used with modified cost function in SI [26]. Hou et al. in [27] proposed cosine normalization to combat catastrophic forgetting and facilitate the seamless integration of new and prior knowledge within a continual learning model. By adjusting the magnitudes of the model’s weights, cosine normalization provides precise control over the influence of novel tasks. This mechanism ensures that the model does not prioritize the new task at the expense of previously learned tasks, thereby preserving valuable knowledge while accommodating novel knowledge. This normalization approach helps to mitigate the negative impact of catastrophic forgetting and enhances the model’s ability to generalize across multiple tasks. Cheraghian et al. in [32] introduced a semantic-aware distillation loss for few-shot class incremental learning that takes into account the semantic structure of the data. The distillation strategy utilizes semantic embeddings associated with each class to guide the distillation process. The integration of semantic information fosters the preservation of prior knowledge in the continual learning (CL) model by facilitating learning not only from the representations of the previous model but also from the semantic relationships among the classes. During the incremental learning process, PODNet in [13] used a Pooled Outputs Distillation (POD) mechanism to transfer knowledge from the previously learned tasks to the current task. Specifically, the outputs of the intermediate layers of the previous model are pooled and distilled into the corresponding layer of current model through minimizing the discrepancies using Euclidean distance of L2-normalized features. Mathematically, POD can be expressed as $L_{POD}=\left\|f_{i}^{old}({\mathbf{x}})-f_{i}^{new}({\mathbf{x}})\right\|^{2}$ , where $f_{i}^{old}({\mathbf{x}})$ and $f_{i}^{new}({\mathbf{x}})$ are the pooled output of the $i^{th}$ spatial position for input ${\mathbf{x}}$ using old and new model respectively. One of the limitations of PODNet is the absence of an explicit mechanism to effectively preserve the underlying latent structure, as it primarily relies on imposing constraints on the pooled features. Consequently, the presence of outliers or noisy attributes can potentially compromise the effectiveness of distillation in PODNet. In contrast, our proposed subspace distillation approach addresses this concern by imposing constraints on the low-dimensional subspaces derived from the latent features of both the old and current models. This not only enhances robustness to noise but also ensures the preservation of latent structure in continual learning scenarios.

Dynamic architecture based methods allocate new neurons to adapt to novel task. Andrei et al. [33] introduced progressive network where old knowledge remains unchanged by keeping previously trained model frozen and a novel sub-network with fixed resources is allocated to learn new knowledge. Yoon et al. [22] proposed dynamically extendable network (DEN) that learns a compact representation by selective training and expanding neural network capacity by optimal number of units when new task arrives. Recently, Douillard et al. in [34] proposed first transformer based architecture where dynamic expansion of task specific token is used.

Memory-based methods partially stores the previous data and train model by replaying stored old data together with new data [29, 35]. Rebuffi et al. [29] introduced iCARL method where herding based sample selection was used to keep a small portion of previous dataset in the memory and replayed interleaved with new samples. Many recent approaches have extended iCARL to bias correction problem for classifier [36], a metric learning model for imbalance dataset [27], a memory sampling method [37, 38, 39]. Javed et al. [36] introduced a dynamic threshold moving method to address the classifier bias generated by the knowledge distillation approach in iCARL. Hou et al. [27] in LUCIR proposed to combine inter-class separation constraint, old classes geometric structure preserving constraint and cosine normalization to tackle imbalance dataset problem. Aljundi et al. [37] claimed that memory sampling method is crucial, and accuse random memory sample selection for sub-optimal performance on old tasks. In [37] replay memory sampling is defined as a constrained optimization problem and formulated as solid angle minimization problem to maximize the diversity in the replay memory. While in iCARL, the memory constitutes samples randomly chosen based on the cluster centers, in [38] and [39], diverse sample selection mechanisms, including most interfered sample retrieval and global distribution matching based sampling are utilized.

To address the security concern, few recent approaches employed generative model to produce samples belonging to old classes [40] instead of storing subset of old samples in memory. Deep Generative Replay (DGR) [40] proposed a dual-model architecture, one for generating pseudo samples and another for solving tasks by replaying pseudo samples together with new samples. To reduce the memory footprint of storing real samples, compressed feature have been stored in [41, 42]. REMIND [41] used product quantization method to quantize latent representation and stored indices in memory that were used later to decode the representation for reply. However, because of incremental update in model, stored latent representation also requires adaptation. To fit the stored representation into current latent space, Iscen et al. [42] employed a multi-layer perceptron to map corresponding old and new feature map generated for images of current task.

Recently, Dark Experience Replay (DER++) [43] proposed to store both logit, and label for corresponding sample and used the reservoir sampling strategy to select samples from data stream for memory buffer. In DER++, knowledge distillation is performed by mapping output logit from current model with corresponding memory logit.

2.2 Continual Semantic Segmentation

In recent time, a growing number of works have emerged into continual semantic segmentation. In the literature, CSS methods primarily fall into two major categories: (i) regularization and (ii) Replay based model.

Following the success of regularization methods in CL, several works have proposed mechanism for controlling update of neuron weights to mitigate catastrophic forgetting in CSS [15, 11, 16, 44]. ILT [15] investigates a regularization-based technique by freezing weights of the encoder network after learning the first task and applying knowledge distillations for the upcoming tasks. ILT is also equipped with the masked cross-entropy loss and masks on the output of the current model to consider only the seen classes. However, this approach doesn’t resolve the background shift problem of CSS properly. Since the background in CSS may include previously seen objects, MiB [11] proposed to preserve the knowledge of an old model with knowledge distillation. However, the difference compared to standard knowledge distillation is that the probabilities of the background and the novel classes are appended such that the non-overlapping prediction of a novel class as the output of a current model can be aligned with the output of a previous model. As knowledge distillation shows a promising direction for CSS, PLOP [16] introduced pseudo-labelling to tackle the background shift problem and adapt Pooled Out Distillation (POD) [13] to preserve the previously learned knowledge. Since POD is developed for a classification problem in classical continual learning settings with reliance on global statistics, thus PLOP uses a multi-scale version of POD to integrate local and global statistics at different intermediate layers. SDR [44] leveraged contrastive learning together with novel sparsity constraint and prototype matching strategy to efficiently learn novel tasks and mitigate forgetting of prior knowledge. In SDR, to organize geometric structure of feature representation, clusters of data are described using prototypes that are forced to be closer in consecutive learning steps and far apart from one another by using prototype matching and repulsive force. Additional sparsity constraint imposed on feature representation of same classes helped to construct well-separated and tight cluster as well as create space for accommodating novel classes.

Replay based methods tackle the catastrophic forgetting in CSS by either storing real images from past tasks [45] or generating synthetic data of prior classes [46]. Cha et al. in SSUL [45] proposed to extract future classes from background that are defined as unknown classes to facilitate learning novel classes. SSUL used a subset of old samples for the first time in the literature of CSS to improve stability and plasticity of model. Additionally, freezing weights of encoder and old classifiers with binary cross entropy loss, and pseudo-labeling techniques helped to improve catastrophic forgetting in [45]. Maracani et al. [46] employed generative model to produce samples from previously seen tasks together with newly proposed background impainting method for pseudo-labeling to tackle background shift and catastrophic forgetting.

3 Preliminaries

In continual learning, a model needs to learn from a series of tasks. Each learning task is represented by its training set, often samples from novel classes or concepts. Let $\mathds{T}=\{\mathcal{T}_{1},\mathcal{T}_{2},\cdots,\mathcal{T}_{T}\}$ be a sequence of $T$ tasks. In the setup we are interested in this work, every task comprises of

\mathcal{D}^{t}=\Big{\{}\big{(}{\mathbf{X}}_{i},y_{i}\big{)}\Big{\}}_{i=1}^{n_{t}}

where ${\mathbf{X}}_{i}\in\mathcal{X}$ denotes a training image of size $W\times H$ and $y_{i}\in\mathcal{Y}$ is the corresponding class belonging to current task $t$ . We also maintain a fixed size memory $\mathcal{M}$ to retain a small subset of samples from old tasks to better tackle the catastrophic forgetting. The goal of our knowledge distillation based approach is to apply constraints on updating weights of model so that the model generates similar latent representation and prediction in future tasks (i.e., $t,t+1,\cdots$ ) for what it has learned previously (i.e., $1,2,\cdots,t-1$ ).

There is a subtle difference in setup when CL is considered for semantic segmentation. For the problem of continual semantic segmentation, the training set consists of input samples and their corresponding segmentation mask. We denote the training set of CSS by

\mathcal{D}^{t}=\Big{\{}\big{(}{\mathbf{X}}_{i},{\mathbf{Y}}_{i}\big{)}\Big{\}}_{i=1}^{n_{t}}

where ${\mathbf{Y}}_{i}\in\mathcal{Y}$ is the corresponding segmentation mask for the input image ${\mathbf{X}}_{i}$ . Each pixel of the image belongs to a set of classes given by the current task $C^{t}$ . Previously observed classes (i.e., $\mathbb{C}^{t-1}$ ) and future classes (e.g., $\mathbb{C}^{t+1}$ ) are all labeled as the background class $c_{bg}$ for task $t$ . In both continual semantic segmentation and classification tasks, at step $t$ , the model should be able to predict all the observed classes, $\mathbb{C}^{1:t}$ , throughout the learning experience.

3.1 Evaluating an Incremental Learning Model

Class-Incremental Learning The performance of continual learning methods are measured using the average task accuracy [25] in experiments. Average accuracy is computed by average performance across all the previously observed and current tasks after training on current task $t$ and is defined as: $\mathrm{Acc}_{t}=\frac{1}{t}\sum_{i=1}^{t}\mathrm{Acc}_{t,i},$ (1) where $Acc_{t,i}$ is the accuracy of task $i$ after learning task $t$ .

Continual Semantic Segmentation Intersection Over Union (IOU)[47] metric is commonly used to evaluate the robustness towards catastrophic forgetting, in other words stability, and the ability of learning new class (i.e., plasticity) of CSS methods. IOU is computed at the end of learning all task at step $t$ for (i) initial set of classes $C^{1}$ at first task, (ii) incrementally leaned classes $C^{2:t}$ , and (iii) all classes $C^{1:t}$ . IOU is defined as follows $\displaystyle\mathrm{IOU}=\dfrac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}}$ (2) where $TP$ , $FP$ and $FN$ refers to true-positive, false-positive and false-negative, respectively.

3.2 Regularization with Knowledge Distillation

Distilling knowledge from old model to the current model has shown promises to mitigate catastrophic forgetting of neural network in a CL setting [12, 13]. Here, the old model with the knowledge of already observed tasks acts as a teacher model and the purpose is to distill knowledge from teacher model to student model (i.e., current model) such that prediction of these two models matches for previously seen tasks samples. Knowledge distillation is often performed on the probability space to relate the temperature smoothed probability distribution of old to the new model [12, 11]. Feature distillation on the other hand is performed on the feature space extracted from the intermediate layers of neural network to match corresponding local or global statistics of feature maps between old and new model [44, 15, 13, 16].

Assume that the extracted feature and the prediction using teacher model from step $t-1$ for input ${\mathbf{X}}$ are ${\mathbf{f}}^{t-1}$ and $\hat{y}^{t-1}$ , respectively. Similarly, let ${\mathbf{f}}^{t}$ and $\hat{y}^{t}$ be the feature and prediction respectively for the same input using the student model at step $t$ . The knowledge distillation is performed between teacher and student model by minimizing the following loss function

\displaystyle\mathrm{L}_{\text{kd}}({\mathbf{X}};\Theta^{t})\coloneqq-\sum_{c\in{\mathbb{C}^{1:t}}}\hat{y}^{t-1}_{c}~{}\log\,(\hat{y}^{t}_{c})\;.

(3)

Feature distillation strategy applies constraint on the similarity between old and new feature representation and is performed by minimizing a notion of distance between corresponding features of ${\mathbf{f}}^{t-1}$ and ${\mathbf{f}}^{t}$ , such as the distance induced by the $\ell_{1}$ norm as

\displaystyle\text{L}_{\ell{1}}({\mathbf{X}};\Theta^{t})\coloneqq\big{\|}{{\mathbf{f}}^{t-1}-{\mathbf{f}}^{t}}\big{\|}_{1}\;.

(4)

4 Methodology

In this section, we first discuss our main contribution, the Subspace Distillation (SD) , and its properties. Next, we will elaborate on how the SD will be used for classification and segmentation problems.

Subspace Distillation

Let ${\mathbf{F}}_{i}^{t}\in\mathbb{R}^{d\times p}$ and ${\mathbf{F}}_{i}^{t-1}\in\mathbb{R}^{d\times p}$ be the extracted features from layer $i$ of DNNs at step $t$ and $t-1$ , respectively. The goal of feature distillation is to ensure that the learned knowledge at step $t-1$ remain unchanged at step $t$ while training DNNs on novel dataset $\mathcal{D}^{t}$ . To avoid cluttering equations, from now on we drop the layer index, unless it is not clear from the context. In conventional feature distillation approaches, old feature maps or statistics are directly matched with corresponding new one. That is, in order to obtain ${\mathbf{f}}_{i}^{t-1},{\mathbf{f}}_{i}^{t}$ from ${\mathbf{F}}_{i}^{t-1},{\mathbf{F}}_{i}^{t}$ for distillation per Eq. 4, different pooling strategies (i.e.., channel, height or width pooling) have been discussed in PODnet [13].

Our hypothesis here is that distilling geometric structure of feature distribution can enrich the distillation process and will help mitigating the catastrophic forgetting in continual learning. In doing so, we propose to model feature maps ${\mathbf{F}}^{t}$ and ${\mathbf{F}}^{t-1}$ with a set of subspaces, $\mathbb{S}=\{\mathcal{S}_{j}\}_{j=1}^{\tau}$ . Soon, we will discuss how the set of subspaces will be constructed for the classification and segmentation problems, but for now, we focus on the main idea. Each subspace $\mathcal{S}\in\mathbb{S}$ is represented by its basis as $\mathbb{R}^{d\times m}\ni{\mathbf{P}};m\ll d$ with ${\mathbf{P}}^{\top}{\mathbf{P}}=\mathbf{I}_{m}$ . Note that, $m\leq p$ . Our goal here is to preserve the subspace structure of the intermediate feature maps extracted from different layers of old model to the new one. We argue that, robust knowledge distillation through maintaining similarity across subspaces will be advantageous to improve continual classification/segmentation paradigms. This is achieved by enforcing a constraint on subspace similarity between the old and new model. To do so, we propose to minimize the distance between corresponding subspace constructed from feature maps of the old and new model.
A valid distance between $\mathcal{S}_{i}$ and $\mathcal{S}_{j}$ is a distance that is invariant to the choice of the basis of the subspace. To be more specific, assume ${\mathbf{P}}_{i}\in\mathbb{R}^{d\times m}$ and ${\mathbf{P}}_{j}\in\mathbb{R}^{d\times m}$ are the basis for $\mathcal{S}_{i}$ and $\mathcal{S}_{j}$ , i.e., ${\mathbf{P}}_{i}{\mathbf{P}}_{i}^{\top}={\mathbf{P}}_{j}{\mathbf{P}}_{j}^{\top}=\mathbf{I}_{m}$ . Then a distance between $\mathcal{S}_{i}$ and $\mathcal{S}_{j}$ should meet :

d(\mathcal{S}_{i},\mathcal{S}_{j})=g({\mathbf{P}}_{i},{\mathbf{P}}_{j})=g({\mathbf{P}}_{i}{\mathbf{R}}_{i},{\mathbf{P}}_{j}{\mathbf{R}}_{j})\;,

where $g:\mathbb{R}^{d\times m}\times\mathbb{R}^{d\times m}\to\mathbb{R}_{+}$ is the distance function and ${\mathbf{R}}_{i},{\mathbf{R}}_{j}\in\mathcal{O}_{m}$ with $\mathcal{O}_{m}$ denoting the orthogonal group.

To be precise, given ${\mathbf{P}}_{i}^{t}$ and ${\mathbf{P}}_{i}^{t-1}$ , we opt to minimize the projection metric [48] as

\displaystyle\delta_{p}^{2}({\mathbf{P}}_{i}^{t},{\mathbf{P}}_{i}^{t-1})\coloneqq\left\|{\mathbf{P}}_{i}^{t\top}{\mathbf{P}}_{i}^{t}-{\mathbf{P}}_{i}^{t-1\top}{\mathbf{P}}_{i}^{t-1}\right\|^{2}_{F}=2m-2\left\|{\mathbf{P}}_{i}^{t\top}{\mathbf{P}}_{i}^{t-1}\right\|^{2}_{F}\;.

(5)

The projection metric is a proper distance on Grassmannian and endows intriguing properties, among them, the length of curves on Grassmannian obtained by $\delta_{p}(\cdot,\cdot)$ is simply related to the length obtained by the geodesic distance via a fixed constant [48]. In order to illustrate the rationale behind computing subspace distance using Eq.(5), we present a trivial example involving the XY plane residing within three-dimensional space ( $\mathbb{R}^{3}$ ) as a two-dimensional subspace. Consider the XY plane in $\mathbb{R}^{3}$ as a 2D subspace in 3D space. Both

\displaystyle{\mathbf{P}}_{1}

\displaystyle=\begin{pmatrix}1&0\\ 0&1\\ 0&0\\ \end{pmatrix}

and

\displaystyle{\mathbf{P}}_{2}

\displaystyle=\begin{pmatrix}\cos(\pi/4)&-\sin(\pi/4)\\ \sin(\pi/4)&\cos(\pi/4)\\ 0&0\\ \end{pmatrix}

represent the XY plane; hence their distance should be zero. The distance used in Eq.(5), i.e., $\left\|{\mathbf{P}}_{i}^{t\top}{\mathbf{P}}_{i}^{t}-{\mathbf{P}}_{i}^{t-1\top}{\mathbf{P}}_{i}^{t-1}\right\|^{2}_{F}$ not only satisfies the required conditions, but also closely related to the geodesics on the Grassmannian [48].
In Eq. (5), the mapping $f:\mathbb{R}^{d\times m}\to\mathbb{R}^{m\times m};f({\mathbf{P}})={\mathbf{P}}{\mathbf{P}}^{\top}$ is a diffeomorphism between Grassmannian and the space of symmetric matrices (positive semidefinite to be adequate). The induced distance $\delta_{p}^{2}({\mathbf{P}}_{i}^{t},{\mathbf{P}}_{i}^{t-1})=\left\|{\mathbf{P}}_{i}^{t\top}{\mathbf{P}}_{i}^{t}-{\mathbf{P}}_{i}^{t-1\top}{\mathbf{P}}_{i}^{t-1}\right\|^{2}_{F}$ is invariant to the action of the orthogonal group, satisfying the requirement of having a Grassmannian distance. Furthermore, since ${\mathbf{P}}^{\top}{\mathbf{P}}=\mathbf{I}_{m}$ , the distance can be simplified to $\delta_{p}^{2}({\mathbf{P}}_{i}^{t},{\mathbf{P}}_{i}^{t-1})=2m-2\left\|{\mathbf{P}}_{i}^{t\top}{\mathbf{P}}_{i}^{t-1}\right\|^{2}_{F}$ , which is computationally very attractive in comparison to geodesics on Grassmannian that require computing SVD.

Before formulating the SD loss for continual learning, note that in practice and in order to construct the subspace representing feature map, we rely on matrix decomposition techniques, and in particular on Singular Value Decomposition (SVD). In other words, given ${\mathbf{F}}_{i}^{t}$ and ${\mathbf{F}}_{i}^{t-1}$ , we first apply SVD to attain ${\mathbf{P}}_{i}$ and ${\mathbf{P}}_{i}^{t-1}$ and then use them accordingly for distillation. This operation differs from many common operations in deep learning, in the sense that a somewhat complicated matrix operation is involved, hence one may wonder how backpropagation will work in this case.

Backpropagation through SVD. First note that for $\delta_{p}^{2}({\mathbf{P}}_{i}^{t},{\mathbf{P}}_{i}^{t-1})$ , we have

\displaystyle\nabla_{{\mathbf{P}}}\triangleq\frac{\partial}{\partial{\mathbf{P}}_{i}^{t}}\delta_{p}^{2}\big{(}{\mathbf{P}}_{i}^{t},{\mathbf{P}}_{i}^{t-1}\big{)}=\frac{\partial}{\partial{\mathbf{P}}_{i}^{t}}\bigg{\{}2m-2\left\|{\mathbf{P}}_{i}^{t\top}{\mathbf{P}}_{i}^{t-1}\right\|^{2}_{F}\bigg{\}}=-2\frac{\partial}{\partial{\mathbf{P}}_{i}^{t}}\Tr\Bigg(\Big{(}{\mathbf{P}}_{i}^{t\top}{\mathbf{P}}_{i}^{t-1}\Big{)}^{\top}\Big{(}{\mathbf{P}}_{i}^{t\top}{\mathbf{P}}_{i}^{t-1}\Big{)}\Biggl{missing})=-4{\mathbf{P}}^{t-1}\Big{(}{\mathbf{P}}^{t-1}\Big{)}^{\top}{\mathbf{P}}^{t}\;.

(6)

Please note that the subspace from previous model acts as a teacher for distillation and we only need the gradient with respect to ${\mathbf{P}}_{i}^{t}$ to update our current model, hence the above derivation. The next step to update the weights of the current model is to obtain the gradient of the loss with respect to ${\mathbf{F}}_{i}^{t}$ , which requires us to backpropagate $\nabla_{{\mathbf{P}}}$ through the SVD operation. By applying chain rule (see [49] for backpropagation through matrix operations) ,

\displaystyle\Tr\Big(\nabla_{{\mathbf{P}}}^{\top}{\mathbf{P}}_{i}^{t}\Big{missing})=\Tr\Big(\nabla_{{\mathbf{F}}}^{\top}{\mathbf{F}}_{i}^{t}\Big{missing})\;,

(7)

with

\displaystyle\nabla_{{\mathbf{F}}}\triangleq\frac{\partial}{\partial{\mathbf{F}}_{i}^{t}}\delta_{p}^{2}\big{(}{\mathbf{P}}_{i}^{t},{\mathbf{P}}_{i}^{t-1}\big{)}\;.

(8)

Feature map ${\mathbf{F}}_{i}^{t}$ is decomposed as ${\mathbf{F}}_{i}^{t}={\mathbf{P}}_{i}^{t}{\mathbf{\Sigma}}{\mathbf{Q}}^{\top}$ , where ${\mathbf{F}}_{i}^{t}\in\mathbb{R}^{d\times p}$ , ${\mathbf{P}}_{i}^{t}\in\mathbb{R}^{d\times m}$ , ${\mathbf{\Sigma}}\in\mathbb{R}^{m\times p}$ , ${\mathbf{Q}}\in\mathbb{R}^{p\times p}$ and $m\leq p$ with the constraints $\big{(}{\mathbf{P}}_{i}^{t}\big{)}^{\top}{\mathbf{P}}_{i}^{t}={\mathbf{Q}}^{\top}{\mathbf{Q}}=\mathbf{I}$ . It can be shown that (see A for the derivation)

\displaystyle\nabla_{{\mathbf{F}}}={\mathbf{D}}{\mathbf{Q}^{\top}}-{\mathbf{P}}_{i}^{t}\Big{(}\big{(}{\mathbf{P}}_{i}^{t}\big{)}^{\top}{\mathbf{D}}\Big{)}_{\text{diag}}{\mathbf{Q}}^{\top}-2{\mathbf{P}}_{i}^{t}{\mathbf{\Sigma}}\Big{(}{\mathbf{K}}^{\top}\circ\big{(}{\mathbf{D}}^{\top}{\mathbf{P}}_{i}^{t}{\mathbf{\Sigma}}\big{)}\Big{)}_{\text{sym}}{\mathbf{Q}}^{\top}\;,

(9)

where ${\mathbf{K}}\in\mathbb{R}^{p\times p}$ is defined as follows

{\mathbf{K}}_{ij}=\begin{cases}\dfrac{1}{\sigma_{i}^{2}-\sigma_{j}^{2}},&i\neq j\\ 0,&i=j\end{cases}

(10)

\displaystyle{\mathbf{D}}=\nabla_{{\mathbf{F}}}{\mathbf{\Sigma}}^{-1}_{m}\;.

(11)

Here, ${\mathbf{\Sigma}}_{m}\in\mathbb{R}^{m\times m}$ consists of top $m$ rows and first $m$ columns of ${\mathbf{\Sigma}}$ . Note that, ${\mathbf{A}}_{\text{diag}}$ is the diagonal part of ${\mathbf{A}}$ (i.e., all off-diagonal elements are set to $0$ ).

4.1 Subspace Distillation for Continual Classification

For the task of continual classification, we distill the subspace constructed across samples in a mini-batch. For a mini-batch $\big{(}\mathcal{X}_{B},\mathcal{Y}_{B}\big{)}$ of size $b$ , let ${\mathbf{f}}_{j}\in\mathbb{R}^{d}$ be the latent representation for input ${\mathbf{X}}_{j}\in\mathcal{X}_{B}$ . The formulation below can be applied to any layer in a DNN and hence we drop the layer index for the sake of simplicity. Assume there are $p=\lfloor b/|\mathbb{C}^{t}|\rfloor$ samples per class in the mini-batch. In SD, we propose to compute class-wise subspaces using latent features generated from both old and new model. More specifically, from the mini-batch, we form $\{{\mathbf{F}}^{t-1}_{k}\}_{k=1}^{|\mathbb{C}^{t}|}$ and $\{{\mathbf{F}}^{t}_{k}\}_{k=1}^{|\mathbb{C}^{t}|}$ with ${\mathbf{F}}^{t-1}_{k},{\mathbf{F}}^{t}_{k}\in\mathbb{R}^{d\times p}$ by stacking corresponding samples in class $k$ into a matrix (i.e., ${\mathbf{F}}^{t}_{k}=\big{[}{\mathbf{f}}_{k,1},\cdots,{\mathbf{f}}_{k,p}\big{]}$ , where $\mathcal{Y}_{B}\ni y_{k,i}=k$ ). We note that stacking ordering is not important in forming ${\mathbf{F}}^{t-1}_{k},{\mathbf{F}}^{t}_{k}$ as we are merely interested in subspace spanned by the samples. Next, we represent each class $k$ from the teacher and the student model by its low-dimensional subspace ${\mathbf{P}}^{t-1}_{k},{\mathbf{P}}^{t}_{k}\in\mathbb{R}^{d\times m},m\leq p$ . This is achieved by applying thin SVD to ${\mathbf{F}}^{t-1}_{k},{\mathbf{F}}^{t}_{k}$ and picking up the top left singular vectors. With the above, the SD loss is defined as

\displaystyle\ell_{\text{SD}}^{\text{CL}}\big{(}\mathcal{X}_{B},\mathcal{Y}_{B}\big{)}\coloneqq\dfrac{1}{|\mathbb{C}^{t}|}\sum_{k=1}^{|\mathbb{C}^{t}|}\Big{(}2m-2\left\|{\mathbf{P}}_{k}^{t\top}{\mathbf{P}}_{k}^{t-1}\right\|^{2}_{F}\Big{)}\;.

(12)

4.2 Subspace Distillation for Continual Semantic Segmentation

For class-incremental semantic segmentation problem, we propose to compute subspace across the channel dimension of intermediate feature map for each sample in a batch. To reduce the memory overhead while performing subspace distillation, instead of computing subspace from all feature maps at a layer, we split the full feature maps into several smaller group $G$ with $p$ channels. To compute a subspace from each group, we need to form a matrix representation from the feature map. To do so, for a particular input, we form $\{{\mathbf{F}}^{t-1}_{g}\}_{g=1}^{G}$ and $\{{\mathbf{F}}^{t}_{g}\}_{g=1}^{G}$ with ${\mathbf{F}}^{t-1}_{g},{\mathbf{F}}^{t}_{g}\in\mathbb{R}^{d\times p}$ from the feature maps. Here, $d=wh$ , where $h$ and $w$ denote the height and width of the feature map. We encode the geometry of ${\mathbf{F}}^{t-1}_{j},{\mathbf{F}}^{t}_{j}$ by low-dimensional subspaces ${\mathbf{P}}^{t-1}_{g},{\mathbf{P}}^{t}_{g}\in\mathbb{R}^{d\times m}$ via SVD. The subspace distillation loss is defined as

\displaystyle\ell_{\text{SD}}^{\text{CSS}}\big{(}{\mathbf{X}}\big{)}\coloneqq\dfrac{1}{G}\sum_{g=1}^{G}\Big{(}2m-2\left\|{\mathbf{P}}_{g}^{t\top}{\mathbf{P}}_{g}^{t-1}\right\|^{2}_{F}\Big{)}\;.

(13)

4.3 Class-Incremental Continual Learning using Subspace Distillation

In class-incremental learning, we maintain a fixed memory to store a subset of samples using reservoir sampling strategy [50] from prior tasks. The samples in the memory are subsequently used during training the model on a novel task. We compute distillation loss between subspace constructed from latent feature maps extracted by feeding memory samples to old and new model. Afterwards, by minimizing subspace distillation loss combined with classification loss on novel and memory samples, we adapt DNNs model for class-incremental learning. The classification loss used in conjunction with the subspace distillation loss is the cross entropy defined as:

\displaystyle\ell_{\text{CE}}^{\text{CL}}({\mathbf{X}},y)=-\sum_{c\in{C^{1:t}}}y_{c}~{}\log~{}(\tilde{y}^{t}_{c})\;,

(14)

where, $y_{c}$ , and $\tilde{y}^{t}_{c}$ are the true label and predicted probability for class $c$ respectively for the input $X$ and $\tilde{y}^{t}=f_{\Theta^{t}}({\mathbf{X}})$ . Putting everything together, the overall loss used to train our model to tackle catastrophic forgetting in continual class incremental learning is

\displaystyle\mathcal{L}_{\mathrm{CL}}(\mathcal{D}^{t};\Theta^{t})\coloneqq\operatorname*{\mathbb{E}}\nolimits_{({\mathbf{X}},y)\sim\mathcal{D}^{t},~{}({\mathbf{X}^{\prime}},y^{\prime})\sim\mathcal{M}}\Big{[}\ell_{\mathrm{CE}}^{\text{CL}}\big{(}{\mathbf{X}},y\big{)}+\alpha\ell_{\mathrm{CE}}^{\text{CL}}\big{(}{\mathbf{X}^{\prime}},y^{\prime}\big{)}+\beta\ell_{\mathrm{SD}}^{\text{CL}}\big{(}{\mathbf{X}^{\prime}},y^{\prime}\big{)}\Big{]}\;,

(15)

where, ${\mathbf{X}}$ represents the images belonging to novel task while ${\mathbf{X}^{{}^{\prime}}}$ represents the examples from the memory buffer. $\alpha$ and $\beta$ are hyper-parameter used to control the contribution of the second loss term and subspace distillation, respectively. We present the overall steps of training a continual learning model using subspace distillation in Algorithm 1.

Algorithm 1 Class-Incremental Learning using Subspace Distillation

\text{Dataset}~{}\mathcal{D}^{t},~{}\text{Memory}~{}\mathcal{M},~{}\text{and Model from step}~{}t-1,~{}h_{\Theta^{t-1}}=h_{\text{feat}}^{t-1}\circ h_{\text{cls}}^{t-1}

2:The new model at time

t

with parameters

\Theta^{t}

\text{Initialize }\Theta^{t}~{}\text{ with }\Theta^{t-1}

4:for iteration

1

to max_iter do

5: Sample a mini batch

(\mathcal{X}_{B},\mathcal{Y}_{B})

from

{\mathcal{D}^{t}}

6: Sample a mini batch

({\mathbf{X}}^{\prime},{y}^{\prime})

from the memory

{\mathcal{M}}

\mathcal{\tilde{Y}}_{B}\leftarrow h_{\Theta^{t}}(\mathcal{X}_{B})

{\mathbf{f}}^{t}\leftarrow h_{\text{feat}}^{t}({\mathbf{X}}^{\prime})

\tilde{y}^{\prime}\leftarrow h_{\text{cls}}^{t}({\mathbf{f}}^{t})

10:

{\mathbf{f}}^{t-1}\leftarrow h_{\text{feat}}^{t-1}({\mathbf{X}}^{\prime})

11: Compute Cross Entropy loss,

\ell_{\mathrm{CE}}

between ground truth

\mathcal{Y}_{B}

and prediction

\mathcal{\tilde{Y}}_{B}

using Eq. (14)

12: Compute Cross Entropy loss,

\ell_{\mathrm{CE}}

between ground truth

y^{\prime}

and prediction

\tilde{y^{\prime}}

using Eq. (14)

13: Compute Subspace Distillation loss,

\ell_{\text{SD}}^{\text{CL}}

between

{\mathbf{f}}^{t-1}

and

{\mathbf{f}}^{t}

with Eq. (12)

14: Update

\Theta^{t}

by minimizing the overall loss defined based on two cross entropy loss,

\ell_{\mathrm{CE}}

computed for

\mathcal{X}_{B}

, and

{\mathbf{X}^{\prime}}

respectively and subspace distillation loss,

\ell_{\mathrm{SD}}

as in Eq. (15)

15:end for

4.4 Continual Semantic Segmentation using Subspace Distillation

Since image pixels belonging to prior classes are labeled as background in CSS, old model is employed for distilling knowledge. The idea of knowledge distillation is a crucial step and widely adapted in preserving prior knowledge for CSS. In CSS, we apply subspace distillation loss at intermediate layers of model to maintain consistency in geometric structure of latent features. Additionally, we use output distillation to ensure that the current model mimics the output of prior model. In other words, subspace distillation is combined with classification loss and output distillation loss to train CSS model at any step $t$ . Bellow, we briefly discuss the classification loss and output distillation loss for CSS.

4.4.1 Classification Loss

In semantic segmentation, the cross entropy loss is often applied at each pixel of the output generated by the DNN. However, in CSS and to train the model at time $t$ , the background class may include prior classes from already seen tasks. Hence, keeping background shift problem of CSS in mind, we define the cross entropy loss as follows:

\displaystyle\ell_{\mathrm{CE}}^{CSS}({\mathbf{X}},{\mathbf{Y}};\Theta^{t})=-\dfrac{1}{|{\mathbf{X}}|}\sum_{(x,y)\in({\mathbf{X}},{\mathbf{Y}})}\sum_{c\in{C^{t}}}~{}{y}^{t}_{x,c}\log~{}(\tilde{y}^{t}_{x,c})\;,

(16)

where ${y}^{t}_{x,c}$ is the ground truth label at pixel $x$ for class $c$ and $\tilde{y}^{t}_{x,c}$ is defined as follows

\tilde{y}^{t}_{x,c}=\begin{cases}\hat{y}^{t}_{x,c}~{}\text{if label is not background}\\ \sum_{c^{{}^{\prime}}\in{C^{1:t-1}}}\hat{y}^{t}_{x,c^{{}^{\prime}}}~{}\text{if label is background}.\\ \end{cases}

(17)

Here, $\hat{y}^{t}_{x,c}$ is the predicted probability for class $c$ at pixel $x$ using model at task $t$ .

4.4.2 Knowledge Distillation

Distilling knowledge from the old model to current one is crucial to mitigate the catastrophic forgetting and accurate semantic segmentation in continual learning setting. Output distillation by mimicking the predicted probability of old model to the new model has been widely used in the literature of continual learning [12, 51]. However, because of background shift problem, conventional knowledge distillation approach cannot be applied directly in the continual semantic segmentation. In this work, similar to MiB method, we follow the masked cross entropy by relating the old model’s prediction to the new model’s prediction after combining new classes probability with background. Therefore, the adapted output distillation loss is defined as follows:

\displaystyle\ell_{\mathrm{KD}}^{CSS}({\mathbf{X}},{\mathbf{Y}})=-\dfrac{1}{|{\mathbf{X}}|}\sum_{x\in{\mathbf{X}}}\sum_{c\in{C^{1:t-1}}}y^{t-1}_{x,c}~{}\log(\tilde{y}^{t}_{x,c})

(18)

where $y^{t-1}_{x,c}$ is the predicted probability using $f_{\theta^{t-1}}$ for class $c$ at pixel $x$ in image ${\mathbf{X}}$ . We define $\tilde{y}^{t}_{x,c}$ as follows:

\tilde{y}^{t}_{x,c}=\begin{cases}y^{t}_{x,c}&\text{if label $c\in{C^{1:t-1}}$ is not background}\\ \sum_{c^{{}^{\prime}}\in{C^{t}}}\limits y^{t}_{x,c^{{}^{\prime}}}&\text{if label $c\in{C^{1:t-1}}$ is background}.\\ \end{cases}

(19)

Since the modified distillation loss does not directly match the predicted background class probability of old model to the new model, the distillation loss plays a vital role in tackling the background label shift problem in class-incremental learning for semantic segmentation.

Finally, the combined loss of our end-to-end continual semantic segmentation method can be written as the linear combination of classification loss, $\ell_{\mathrm{CE}}$ , feature distillation using newly proposed subspace distillation loss, $\ell_{\mathrm{SD}}$ , and knowledge distillation loss, $\ell_{\mathrm{KD}}$

\displaystyle\mathcal{L}_{\mathrm{CSS}}\big{(}\mathcal{D}^{t};\Theta^{t}\big{)}\coloneqq\operatorname*{\mathbb{E}}\nolimits_{({\mathbf{X}},{\mathbf{Y}})\sim\mathcal{D}^{t}}\Big{[}\ell_{\mathrm{CE}}^{CSS}\big{(}{\mathbf{X}},{\mathbf{Y}}\big{)}+\alpha\ell_{\mathrm{KD}}^{CSS}\big{(}{\mathbf{X}},{\mathbf{Y}}\big{)}+\beta\ell_{\mathrm{SD}}^{CSS}\big{(}{\mathbf{X}}\big{)}\Big{]}

(20)

Here, $\alpha$ and $\beta$ are the predefined hyperparameter used to control the contribution of $\ell_{\mathrm{KD}}^{CSS}$ and $\ell_{\mathrm{SD}}^{CSS}$ respectively. Algorithm 2 summarizes the steps needed to be taken to train a new model for CSS.

Algorithm 2 Subspace Distillation for CSS

1:Dataset

\mathcal{D}^{t}

, Model from previous step

t-1

h_{\Theta^{t-1}}=h_{\text{encoder}}^{t-1}\circ h_{\text{decoder}}^{t-1}

2:New model at step

t

with parameter

\Theta^{t}

\text{Initialize }\Theta^{t}~{}\text{ with }\Theta^{t-1}

4:for iteration

1

to max_iter do

5: Sample a mini batch

(\mathcal{X}_{B},\mathcal{Y}_{B})

from

{\mathcal{D}^{t}}

{\mathbf{F}}^{t},\hat{\mathcal{Y}_{B}}^{t}\leftarrow h_{\Theta^{t}}(\mathcal{X}_{B})

{\mathbf{F}}^{t-1},\hat{\mathcal{Y}_{B}}^{t-1}\leftarrow h_{\Theta^{t-1}}(\mathcal{X}_{B})

8: Compute Cross Entropy loss,

\ell_{\mathrm{CE}}

between ground truth

\mathcal{Y}_{B}

and prediction

\mathcal{\tilde{Y}}_{B}^{t}

using Eq. (16)

9: Compute Output Distillation loss,

\ell_{\mathrm{KD}}

between the prediction from current and old model,

\mathcal{\tilde{Y}}_{B}^{t}

and

\mathcal{\tilde{Y}}_{B}^{t-1}

respectively using Eq. (18)

10: for

l\leftarrow 1

L

11: Split

{\mathbf{F}}^{t}[l]

and

{\mathbf{F}}^{t-1}[l]

into sub group

12: Compute Layer-wise Subspace Distillation loss,

\ell_{\mathrm{SD}}

between

{\mathbf{F}}^{t-1}[l]

and

{\mathbf{F}}^{t}[l]

using Eq. (13)

13: end for

14: Update

\Theta^{t}

by minimizing the linear combination of

\ell_{\mathrm{CE}}

\ell_{\mathrm{KD}}

and

\ell_{\mathrm{SD}}

with Eq. (20)

15:end for

5 Experiments

We start this section by describing the datasets, architectures, and implementation details used in our experiments for both continual image classification and semantic segmentation. For classification problem we focus on class-incremental and task-incremental settings while in case of continual semantic segmentation problem, we only follow the class-incremental setting.

Continual Learning for Classification

We evaluate our proposed method on 3 different benchmark datasets: MNIST [52], CIFAR-10 [53], Tiny Imagenet [54]. Following DER settings, To quantitatively demonstrate the effectiveness of our proposed subspace distillation method in different tasks settings, we split the MNIST, and CIFAR-10 datasets into sequences of 5 tasks having 2 classes per task. Tiny Imagenet is split in 10 tasks with equal number of classes in all sequential tasks (20 classes per task). In our comparative analysis we consider eight state-of-the-art regularization and distillation methods including LwF, oEWC, SI, iCARL, A-GEM, ER, DER and DER++. In our experiment, we follow class-incremental (CI) and task-incremental (TI) protocols described in [55]. The identity of task is provided along with input sample in TI setting while in case of CI setting, task identity is absent. In our experiments, we partition data into distinct sets of non-overlapping classes, hence applicable in both class-incremental and task-incremental scenarios. Furthermore, we maintain a consistent ordering of all classes across all algorithms, guaranteeing that each algorithm receives identical data for every task. Compared to the class-incremental scenario, where task identity is missing, task-incremental learning benefits from having access to task identifiers, which in return helps in selecting appropriate classifiers, rendering it a comparatively easier scenario. Conversely, the class-incremental scenario poses a challenge due to the absence of task identity.

Implementation Details

Task-incremental learning can be implemented using either a multi-head or single-head classifier, depending on the specific implementation. Following the implementation of DER [51], in our approach, we employed a single-head classifier model to learn in class-incremental scenarios. However, in task-incremental settings, we relied on the output masking technique of the single-head classifier to identify task-specific classes, leveraging the availability of task identity during inference. The output masking technique is used to selectively mask specific outputs of the single-head classifier, depending on the task at hand. This approach allows the single-head classifier to effectively prioritize the relevant outputs for the current task while disregarding the irrelevant ones. Following the setting described in [51, 56], a neural network with 2 fully connected layers of 100 neurons are used to extract latent feature for MNIST dataset. For CIFAR and Tiny Imagenet datasets, a modified Resnet18-like [29] structure is used for feature extraction. Finally, a single head linear classifier is used for classification across the experiments on MNIST, CIFAR and Tiny Imagenet datasets. We augment both stream and memory samples by applying random crop and horizontal flip for both CIFAR10 and Tiny-Imagenet datasets [51].

The SGD optimizer is used for training DNN model throughout the learning experiences with keeping flexibility in selection of batch size and mini-batch size for different task setting. Please refer to the appendix for the task-specific values of hyperparameters. Following DER training scheme, we train our model for one epoch at each learning step on MNIST dataset while for relatively complex dataset such as CIFAR10 and Tiny-Imagenet we use 50 and 100 epochs respectively for training.

Continual Semantic Segmentation

We benchmark our proposed method against state-of-the-art methods with different task settings on Pascal-VOC 2012 [57] dataset. We follow the experimental setup used in [11] for VOC dataset, baseline implementation and metric. Precisely, in our comparative study, we consider eight methods, elastic weight consolidation (EWC) [4], learning without forgetting (LwF) [12], Riemannian walk (RW) [25], ILT [15], MiB [11], SDR [44], GIFS [58] and PLOP [16],

The Pascal-VOC 2012 dataset has 10582 training images and 1449 validation images which is used for testing. Each pixel belongs to 20 foreground classes against background. We use three different tasks settings: 19-1, 15-5 and 15-5s in the experiments. In the first two settings we incrementally add 1 and 5 novel classes on the base model trained on 19 and 15 classes respectively. However, in the 15-5 setting we add 1 class in 5 consecutive tasks while base model remains similar to the 15-5 setting. In our evaluation of continual semantic segmentation methods, we follow class overlapped setting where task consists of images that may contain classes belonging to future task with a label of background.

Implementation Details

In our implementation, we use a Deeplab-V3 [59] architecture with ResNet-101 [60] as a backbone that is pretrained on Imagenet [61]. Following [62], to reduce required memory, we use inplace activated batch normalization for training our model. We use Stochastic Gradient Decent (SGD) with a momentum of 0.9 and learning decay of 1e-4 to train our model. Following [16], we crop the images to $512\times 512$ followed by applying random horizontal flip while training our model at each step on Pascal VOC dataset. While computing subspace distillation, we construct subspaces at intermediate layers using a group of 32 channels and we consider top 5 subspaces in our distillation strategy. We train our model with a learning rate of .001 from the second task while first task is trained with a higher learning rate of .01 and each task model is trained for 30 epochs with a batch size of 48 distributed over 4 GPU. We use $\alpha=10$ and $\beta=0.01$ while combining output distillation and KD with pixel-wise cross-entropy loss to compute overall loss. For computational efficiency, we employ O1 optimization from Nvidia APEX library¹¹1https://github.com/NVIDIA/apex to train with half precision. Finally, we use the validation image set to evaluate our model. Below, we present the experimental results of our proposed subspace distillation method for both continual learning with classification and continual semantic segmentation problem.

5.1 Continual Learning

Method	S-MNIST		S-CIFAR-10		S-Tiny Imagenet
Method	Task-IL	Class-IL	Task-IL	Class-IL	Task-IL	Class-IL
JOINT	99.65	97.92	98.29	92.20	82.04	59.87
SGD	87.15	19.90	61.02	19.61	17.93	7.79
LwF [12]	99.25	20.07	63.28	19.59	15.79	8.46
oEWC [63]	99.10	20.00	68.27	19.47	19.20	7.56
SI [26]	99.07	19.97	68.05	19.46	35.97	6.58
	Online Data Stream Setting with Tiny Memory (Buffer Size: 100)
ER [20]	97.72	73.80	77.85	32.87	28.07	5.85
DER [51]	98.48	77.12	80.72	32.43	27.73	4.26
Ours (SD)	98.35	79.37	81.65	35.1	30.11	6.05
	Small Memory (Buffer Size: 200)
iCARL [29]	98.28	70.51	88.99	49.02	28.19	7.53
ER [20]	97.86	80.43	91.19	44.79	38.17	8.49
DER [51]	98.80	84.55	91.40	61.93	40.22	11.87
Ours (SD)	97.71	85.28	92.88	61.85	39.52	8.54
DER [51] + SD	98.86	86.54	92.07	66.12	42.63	12.26
	Medium Memory (Buffer Size: 500)
iCARL [29]	98.81	74.55	88.22	47.55	31.55	9.38
ER [20]	98.89	86.57	93.61	57.74	48.64	9.99
DER [51]	98.84	90.54	93.40	70.51	51.78	17.75
Ours (SD)	99.00	89.00	94.86	71.85	48.60	10.03
DER [51] + SD	98.98	91.47	94.68	75.96	52.74	19.43
	Large Memory (Buffer Size: 5120)
iCARL [29]	98.32	70.60	92.23	55.07	40.83	14.08
ER [20]	99.33	93.40	96.98	82.47	67.29	27.40
DER [51]	99.29	94.9	95.43	83.81	69.50	36.73
Ours (SD)	99.69	95.90	97.18	84.75	69.13	29.32
DER [51] + SD	99.44	95.33	96.77	86.32	69.47	37.27

Table 1: Average top-1 accuracy on splited MNIST, CIFAR-10 and Tiny Imagenet for 5, 5 and 10 tasks respectively for standard L3 benchmarks. Best values are represented in bold. Higher is better. Results for LwF, oEWC, SI, iCARL, ER, and DER are from [51].

We analyze the efficacy of subspace distillation in continual learning setting for classification problem and present the result in Table 1. In comparison, we consider two knowledge distillation based methods (LwF [12], iCARL [29]), two regularization methods (SI [26], oEWC [63]) and three memory replay based methods (ER [20], DER [51], DER++ [51]). Additionally, we provide upper and lower bounds, that reflect to jointly training on tasks one by one or on all tasks with Stochastic Gradient Descent (SGD) without any specialized strategy designed for CL.

The results suggest that regularization methods performs poorly on class incremental learning setting as those methods are developed focusing on task incremental learning setting. This observation indicates that regularization towards the set of old parameters is not suitable for tackling catastrophic forgetting because of the local information modeling weight importance [51]. Overall, replay based methods outperform regularization methods with big margin across the datasets. Regularization methods performs good on MNIST datasets for task incremental settings while they perform notably worse on comparatively complex datasets, e.g. CIFAR10, Tiny-Imagenet. We observe that our method outperforms state-of-the-art replay base methods in task incremental learning setting on all datasets. Noticeably, for class incremental learning setting on split CIFAR-10 dataset our method performs noticeably better by 15% and 8% percentage points compared to iCARL and ER methods, respectively. However, DER method performs slightly better than subspace distillation method as DER relies on storing additional information (i.e., logits) together with corresponding image and label from past tasks in the memory for the distillation that requires additional memory requirements. We observe that by combining dark experience replay method DER, we can further improve our result. For instance, we can achieve about 2% and 5% performance gain on Tiny-Imagenet and CIFAR10 datasets respectively with medium size memory buffer.

We evaluate our model in a more constrained setting where each sample at a task is presented to model once. In other terms, we train our model for a single epoch at each task given a tiny buffer size of 100 samples. Overall, because of the complex setting, continual learning model faces underfitting problem and struggles to mitigate catastrophic forgetting. Replay-based distillation methods also suffer from poor performance when there are insufficient exemplars available, especially for previously observed classes where no memory exemplars exist. For instance, in S-Tiny Imagenet, during the final task, with just 100 memory exemplars, only one example is stored for the 100 previously seen classes out of a total of 180 prior classes, leading to a subpar performance of replay-based distillation methods. We observe that our method considerably improves on baseline ER method across the datasets and performs significantly better than DER. For example, our method enjoys around 2% improvement on both CIFAR-10 and Tiny-Imagenet datasets than state-of-the-art DER method for class incremental setting. However, we think that this scenario deserves further investigation because of its complex nature.

In task-incremental setting, we notice better performance compared to class-incremental setting for all methods across the datasets. The reason behind such observation is the presence of task identifier at the test time in task-incremental learning that makes the problem easier. We see that SI and oEWC perform badly on relatively complex datasets such as CIFAR10, and Tiny-Imagenet though regularization based methods are designed particularly for task-incremental scenario and our method outperforms both methods with unbridgeable gap in performance. Subspace distillation method performs remarkably better than iCARL across the settings with different memory size. For example, our method outperforms iCARL by 15% percentage points with large memory buffer on Tiny-Imagenet, and by 24% percentage points with medium size memory on CIFAR10 dataset. We also note that our method shows competitive performance with other memory replay based methods. We see around 2% improvement on ER for task incremental setting on Tiny Imagenet dataset while the performance gain is around 15% on CIFAR10 dataset for both small and medium size memory. Furthermore, by combining our method with DER, we see 4%, 5% and 3% performance improvements on DER with small, medium and large size memory buffer for task-incremental setting on CIFAR10 dataset.

5.2 Continual Semantic Segmentation

Table 2 reports the IOU of different baseline strategies and subspace distillation method for three scenarios: (i) 19-1, (ii) 15-5 and (iii) 15-1 overlap scenarios on Pascal-VOC dataset. The results suggest that our proposed subspace distillation based method outperforms state-of-the-art methods in 19-1 and 15-5 task settings by a significant margin. Though our method does not match the performance of PLOP in 15-5s task setting, we observe about 13% IOU improvement on MiB at last task. Furthermore our method outperforms ILT by around 30% in IOU of last task. We note that subspace distillation shows consistency in retaining prior knowledge across the settings. To show the efficacy of subspace distillation method in tacking catastrophic forgetting, we combine SD with ILT and PLOP methods, and see improvements on IOU in different settings. We observe that by combining SD with ILT the overall performance (i.e., mIOU) of ILT method improves by about 3% and 2% for 19-1 and 15-5 task settings respectively while performance remains similar for 15-5s setting. Similarly, we notice 2% performance improvement of PLOP methods after combining SD for 19-1 and 15-5s settings which suggests that SD imposes additional constraint on distillation strategy using multi-scale POD that helps learning complimentary information and tackling catastrophic forgetting better together. Though our method shows little lower plasticity in case of learning new classes, subspace distillation shows promises in retaining prior knowledge and tackling forgetting old classes.

Method	19-1 (2 tasks)			15-5 (2 tasks)			15-1 (6 tasks)
Method	0-19	20	All	0-15	16-20	All	0-15	16-20	All
JOINT	77.60	76.60	77.50	79.0	72.80	77.50	79.00	72.80	77.50
SGD	6.80	12.90	7.10	2.10	33.10	9.80	0.20	1.80	0.60
EWC [4]	26.90	14.00	26.30	24.30	35.50	27.10	0.30	4.30	1.30
LwF [12]	51.20	8.50	49.10	58.90	36.60	53.30	1.00	3.90	1.80
RW [25]	23.30	14.20	22.90	16.60	34.90	21.20	0.00	5.20	1.30
SDR [44]	71.30	23.40	69.0	76.30	50.20	70.10	47.30	14.70	39.50
GIFS [58]	57.88	32.82	56.69	23.61	16.43	21.90	59.36	13.89	48.53
ILT [15]	67.75	10.88	65.05	67.08	39.23	60.45	8.75	7.99	8.56
ILT [15] + SD	72.18	28.02	70.05	70.28	42.63	63.70	6.52	9.03	7.12
PLOP [16]	75.35	37.35	73.54	75.73	51.71	70.09	65.12	21.11	54.64
PLOP [16] + SD	76.50	46.39	75.07	75.63	50.17	69.57	66.83	21.56	56.05
MiB [11]	71.43	23.59	69.15	76.37	49.97	70.08	34.22	13.50	29.29
Ours (MiB [11] + SD)	76.09	20.16	73.43	78.10	51.21	71.70	50.62	14.41	42.00

Table 2: Performance of different continual semantic segmentation method in IOU on Pascal VOC dataset for three continual overlap class learning settings: (i) 19-1 (ii) 15-5 and (iii) 15-1 tasks. Best values are represented in bold. Results for EWC, LwF, RW, ILT, PLOP and MiB are extracted from PLOP [16] paper while the result for SDR and GIFS are collected from the corresponding original paper. Higher is better. SD: Subspace Distillation.

Fig. 4 reports the continual evolution of mean IOU for 15-5s (6 tasks) overlap setting on Pascal-VOC dataset. As depicted in the figure, the performance of MiB drops significantly throughout the learning steps while PLOP shows strength in preserving already learned knowledge and at each learning step PLOP performs significantly better than MiB. We observe considerably improved performance across the learning steps when subspace distillation is added with MiB. SD facilitates MiB to preserve previously learned knowledge and therefore we note about 9%, and 12% improvement of mean IOU at task 5, and 6 respectively after adding SD with MiB. Similarly, SD helps PLOP tackling catastrophic forgetting better and we see improvements on mean IOU at the end of all learning tasks.

6 Ablation Studies

In this section, we evaluate our model in terms of changes in activation map, and feature representation in different learning steps in class incremental learning setting on CIFAR10 dataset. Furthermore, in case of continual semantic segmentation on Pascal VOC dataset, we investigate the contribution of each regularizer in subspace distillation strategy and analyze our proposed method with (i) varying dimensionality of subspace and (ii) different no. of channels to construct the subspace.

[Uncaptioned image] — Table 3: The transition of what neural network is looking at in different learning tasks on CIFAR10 dataset using GradCAM [64]. The model trained using our subspace distillation method consistently attends in the regions of the images where the feature of the cat or bird (*i.e*.., around the head or body) is located while the activated regions shift significantly for state-of-the-art DER++ and DER.

Changes in Activation Map. We examine our model in terms of where neural network is looking at the input images for decision making and whether the activated region has been changed over time in continual learning setting. To do so, We first train a neural network on 5 tasks CIFAR10 datasets and, at the end of learning different tasks, we feed the model with an image that is presented to the model at second step followed by computing heatmap images using GradCAM [65]. We present the result in Figure 3 showing the evolution of the important regions in the images. The result suggests that DER method activated the body part of cat while our method looks at face region of cat image. Furthermore subspace distillation consistently activates the same regions and the changes in the activation map is lower than the state-of-the-art of DER++ method. Similarly, both DER and DER++ show inconsistency in activating the interested region where bird is located in the image while SD method exhibits robustness in activating the body part of bird in consecutive tasks.

Similarity in Representation. We analyze our models capability of retaining previously learned knowledge by maintaining similar feature representation in the intermediate layer of neural network while learning a series of tasks. Precisely, we employ Centered Kernel Alignment (CKA) metric [66] to compute similarity between intermediate feature representation generated by neural network at task $t$ and $t+1$ on test dataset of CIFAR10 and report the comparative result in Figure 5. Overall we observe that the CKA similarity steadily increases as the model faces more tasks and we think that is because the model becomes more stable as it gets adapted incrementally on novel dataset in presence of samples from prior tasks in memory buffer. We notice that our proposed subspace distillation method consistently maintains higher CKA similarity of feature representation throughout the learning experiences. For instance, SD method outperforms baseline method, ER, on CKA similarity metric by 8% in the first task while the gap reduces to 3% in the last task.

No. of Channels	15-5s (6 tasks)			19-1 (2 tasks)
No. of Channels	0-15	16-20	All	0-19	20	All
16	66.73	21.57	55.97	76.47	46.38	75.04
32	66.83	21.56	56.05	76.50	46.39	75.07
64	66.75	21.84	56.06	76.48	46.23	75.04

Table 4: Mean IOU for 15-5s and 19-1 incremental task setting on Pascal-VOC overlapped dataset with different no. of channels to compute subspace using SVD.

Varying no. of channels for computing subspace. We evaluate the effect of dimensionality of channels being used to construct the subspace on Pascal VOC dataset for 15-5s and 19-1 tasks setting by combining SD with PLOP. With the increasing no. of channels for constructing subspace, we do not observe any considerable changes in both stability and plasticity as the IOU for new classes and mean IOU for old classes remains stable. As reported in 4, subspace distillation exhibits robustness against the varying no. of channels used to construct subspace.

Dimension of Subspace	19-1 (2 tasks)
Dimension of Subspace	0-19	20	All
1	76.45	46.19	75.00
3	76.49	46.43	75.06
5	76.46	46.32	75.03
7	76.46	46.33	75.03

Table 5: Mean IOU for 19-1 overlap incremental task setting on Pascal-VOC dataset with different subspace dimension.

Subspace Dimensionality. To also examine the effect of dimensionality of subspace used in distilling structure for continual semantic segmentation, we combine subspace distillation with PLOP method and report the experimental result on Pascal VOC dataset for 19-1 overlap tasks setting in 5. The results suggest that, with the increase of subspace dimension, the performance on the already observed classes remains similar and subspace dimensionality exhibits less impact on the overall performance.

Contribution of each Regularizer. In order to evaluate the contribution of each regularization term in our proposed subspace distillation method for continual semantic segmentation and class-incremental learning, we conducted experiments on the Pascal VOC dataset with varying numbers of tasks and S-CIFAR10 dataset (see Tab. 6). Results in Tab. 6(a) suggest that both regularizers, $\ell_{\mathrm{KD}}^{CSS}$ and $\ell_{\mathrm{SD}}^{CSS}$ , contribute to improving overall performance across the different settings. For example, in the (19-1) 2-task setting, combining regularizer KD with the baseline using CE led to an increase in performance of around 9%. We also noted an additional 4% improvement on both previously observed classes and overall performance when regularizer SD was combined with KD in our proposed subspace distillation method. At the same time, the experimental findings displayed in Tab. 6(b) demonstrate that the combination of subspace distillation (SD), $\ell_{\mathrm{SD}}^{CL}$ , and memory replay, $\ell_{\mathrm{CE}}^{CL}$ yields substantial improvements in overall accuracy and forgetting on the S-CIFAR10 dataset. Specifically, there is an approximate increase of 9% in accuracy and a reduction of 15% in forgetting.

Method			19-1 (2 tasks)
CE	KD	SD	0-19 (Old Classes)	Overall
✓	✗	✗	62.30	59.85
✓	✓	✗	71.43 (+9.13)	69.15 (+9.30)
✓	✓	✓	76.09 (+4.66)	73.43 (+4.28)

(a) Experimental results of Continual Learning (CL) methods on Pascal VOC dataset for continual (19-1) 2 tasks overlap class learning settings.

Method			S-CIFAR10 (5 tasks; C-IL setting)
CE	Memory Replay	SD	Avg. Accuracy	Forgetting
✓	✗	✗	17.32	80.65
✓	✓	✗	38.97 (+21.65)	43.34 (-37.31)
✓	✓	✓	47.68 (+8.71)	27.86 (-15.48)

(b) Experimental results of CL methods trained for one epoch with 500 memory exemplars on the S-CIFAR10 dataset.

Table 6: Experimental results of (a) Continual Semantic Segmentation on Pascal VOC, and (b) Class-Incremenal laerning on S-CIFAR10 indicate that incorporating subspace distillation with memory replay significantly improves the performance of the CL method.

Computational Complexity. Our proposed distillation loss requires computing the subspace basis employing SVD. For an input feature vector ${\mathbf{F}}\in\mathbb{R}^{d\times m};d>m$ , computational complexity of SVD is $\mathcal{O}\big{(}d\times m\times\min(d,m)\big{)}=\mathcal{O}(dm^{2})$ . Assuming, ${\mathbf{P}}^{t}\in\mathbb{R}^{d\times n}$ and ${\mathbf{P}}^{t-1}\in\mathbb{R}^{d\times n}$ be the basis of top $n$ subspaces used in our computation, the computation cost of ${{\mathbf{P}}^{t}}^{\top}{\mathbf{P}}^{t-1}$ is $\mathcal{O}(n\times d\times n)=\mathcal{O}(dn^{2})$ . Therefore, the total computation cost of our proposed subspace distillation loss can be expressed as $\mathcal{O}(dm^{2}+dn^{2})$ . During our experiments on the 5-tasks CIFAR10 dataset, we found that one iteration of our replay-based subspace distillation (SD) method requires approximately 60ms on Tesla P100-SXM2 GPU for a batch of 32 samples, while the replay-based vanilla SGD method requires about 37ms. We rely on the assumption that a limited buffer memory is available throughout the learning process to store exemplars from prior tasks that might restrict the use of our method in privacy-focused applications.

7 Conclusion

In this work, we propose a generalised end-to-end continual learning framework where subspace distillation is at the core of it. Here, we model low dimensional intermediate feature representations using subspaces. By imposing constraint on maintaining similar subspace between old and new model, we ensure robustness in the model towards catastrophic forgetting. Proposed subspace distillation is equally effective for classification and semantic segmentation problem in continual learning scenarios. Empirical analysis shows that our proposed framework with subspace distillation achieves state-of-the-art performance in multiple settings on Pascal-VOC dataset for continual semantic segmentation and MNIST, CIFAR10, and Tiny-Imagenet datasets for class-incremental classification problem. In future, we would like to investigate more about the efficient use of subspace distillation for long tasks setting for CSS as well as for complex continual learning setting when model is trained on data stream that is presented to model once at training time.

Acknowledgments

P.M. and K.R. gratefully acknowledge co-funding of the project by the CSIRO’s Machine Learning and Artificial Intelligence Future Science Platform (MLAI FSP). K.R. also acknowledges funding from the CSIRO’s ResearchPlus Postgraduate Scholarship. M.H. gratefully acknowledges the support from the Australian Research Council (ARC), project DP230101176.

References

[1] R. Hadsell, D. Rao, A. A. Rusu, R. Pascanu, Embracing change: Continual learning in deep neural networks, Trends in cognitive sciences.
[2] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, S. Wermter, Continual lifelong learning with neural networks: A review, Neural Networks 113 (2019) 54–71.
[3] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, Y. Bengio, An empirical investigation of catastrophic forgetting in gradient-based neural networks, arXiv preprint arXiv:1312.6211.
[4] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (13) (2017) 3521–3526.
[5] A. Robins, Catastrophic forgetting, rehearsal and pseudorehearsal, Connection Science 7 (2) (1995) 123–146.
[6] R. M. French, Catastrophic forgetting in connectionist networks, Trends in cognitive sciences 3 (4) (1999) 128–135.
[7] S. Thrun, Lifelong learning algorithms, in: Learning to learn, Springer, 1998, pp. 181–209.
[8] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
[9] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: CVPR, 2015, pp. 3431–3440.
[10] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE TPAMI 39 (12) (2017) 2481–2495.
[11] F. Cermelli, M. Mancini, S. R. Bulo, E. Ricci, B. Caputo, Modeling the background for incremental learning in semantic segmentation, in: CVPR, 2020, pp. 9233–9242.
[12] Z. Li, D. Hoiem, Learning without forgetting, IEEE TPAMI 40 (12) (2017) 2935–2947.
[13] A. Douillard, M. Cord, C. Ollion, T. Robert, E. Valle, Podnet: Pooled outputs distillation for small-tasks incremental learning, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, Springer, 2020, pp. 86–102.
[14] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531.
[15] U. Michieli, P. Zanuttigh, Incremental learning techniques for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[16] A. Douillard, Y. Chen, A. Dapogny, M. Cord, Plop: Learning without forgetting for continual semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4040–4050.
[17] C. Simon, P. Koniusz, R. Nock, M. Harandi, Adaptive subspaces for few-shot learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4136–4145.
[18] T. Zhang, P. Ji, M. Harandi, W. Huang, H. Li, Neural collaborative subspace clustering, in: International Conference on Machine Learning, PMLR, 2019, pp. 7384–7393.
[19] A. Edelman, T. A. Arias, S. T. Smith, The geometry of algorithms with orthogonality constraints, SIAM journal on Matrix Analysis and Applications 20 (2) (1998) 303–353.
[20] D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, G. Wayne, Experience replay for continual learning, in: NeurIPS, 2019, pp. 350–360.
[21] A. Mallya, S. Lazebnik, Packnet: Adding multiple tasks to a single network by iterative pruning, in: CVPR, 2018, pp. 7765–7773.
[22] J. Yoon, E. Yang, J. Lee, S. J. Hwang, Lifelong learning with dynamically expandable networks, arXiv preprint arXiv:1708.01547.
[23] S. Ebrahimi, M. Elhoseiny, T. Darrell, M. Rohrbach, Uncertainty-guided continual learning with bayesian neural networks, arXiv preprint arXiv:1906.02425.
[24] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, T. Tuytelaars, Memory aware synapses: Learning what (not) to forget, in: ECCV, 2018, pp. 139–154.
[25] A. Chaudhry, P. K. Dokania, T. Ajanthan, P. H. Torr, Riemannian walk for incremental learning: Understanding forgetting and intransigence, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 532–547.
[26] F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence, Proceedings of machine learning research 70 (2017) 3987.
[27] S. Hou, X. Pan, C. C. Loy, Z. Wang, D. Lin, Learning a unified classifier incrementally via rebalancing, in: CVPR, 2019, pp. 831–839.
[28] P. Dhar, R. V. Singh, K.-C. Peng, Z. Wu, R. Chellappa, Learning without memorizing, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5138–5146.
[29] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, C. H. Lampert, icarl: Incremental classifier and representation learning, in: CVPR, 2017, pp. 2001–2010.
[30] D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, arXiv preprint arXiv:1706.08840.
[31] A. Chaudhry, M. Ranzato, M. Rohrbach, M. Elhoseiny, Efficient lifelong learning with a-gem, arXiv preprint arXiv:1812.00420.
[32] A. Cheraghian, S. Rahman, P. Fang, S. K. Roy, L. Petersson, M. Harandi, Semantic-aware knowledge distillation for few-shot class-incremental learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2534–2543.
[33] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive neural networks, arXiv preprint arXiv:1606.04671.
[34] A. Douillard, A. Ramé, G. Couairon, M. Cord, Dytox: Transformers for continual learning with dynamic token expansion, arXiv preprint arXiv:2111.11326.
[35] A. Gepperth, C. Karaoguz, A bio-inspired incremental learning architecture for applied perceptual problems, Cognitive Computation 8 (5) (2016) 924–934.
[36] K. Javed, F. Shafait, Revisiting distillation and incremental classifier learning, in: Asian conference on computer vision, Springer, 2018, pp. 3–17.
[37] R. Aljundi, M. Lin, B. Goujaud, Y. Bengio, Gradient based sample selection for online continual learning, in: NeurIPS, 2019, pp. 11816–11825.
[38] R. Aljundi, E. Belilovsky, T. Tuytelaars, L. Charlin, M. Caccia, M. Lin, L. Page-Caccia, Online continual learning with maximal interfered retrieval, in: NeurIPS, 2019, pp. 11849–11860.
[39] D. Isele, A. Cosgun, Selective experience replay for lifelong learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 2018.
[40] H. Shin, J. K. Lee, J. Kim, J. Kim, Continual learning with deep generative replay, in: NeurIPS, 2017, pp. 2990–2999.
[41] T. L. Hayes, K. Kafle, R. Shrestha, M. Acharya, C. Kanan, Remind your neural network to prevent catastrophic forgetting, in: European Conference on Computer Vision, Springer, 2020, pp. 466–483.
[42] A. Iscen, J. Zhang, S. Lazebnik, C. Schmid, Memory-efficient incremental learning through feature adaptation, in: European Conference on Computer Vision, Springer, 2020, pp. 699–715.
[43] A. Douillard, Y. Chen, A. Dapogny, M. Cord, Tackling catastrophic forgetting and background shift in continual semantic segmentation, arXiv preprint arXiv:2106.15287.
[44] U. Michieli, P. Zanuttigh, Continual semantic segmentation via repulsion-attraction of sparse and disentangled latent representations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1114–1124.
[45] S. Cha, Y. Yoo, T. Moon, et al., Ssul: Semantic segmentation with unknown label for exemplar-based class-incremental learning, Advances in Neural Information Processing Systems 34.
[46] A. Maracani, U. Michieli, M. Toldo, P. Zanuttigh, Recall: Replay-based continual learning in semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7026–7035.
[47] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision 88 (2) (2010) 303–338.
[48] M. Harandi, R. Hartley, C. Shen, B. Lovell, C. Sanderson, Extrinsic methods for coding and dictionary learning on grassmann manifolds, IJCV 114 (2) (2015) 113–136.
[49] C. Ionescu, O. Vantzos, C. Sminchisescu, Matrix backpropagation for deep networks with structured layers, in: ICCV, 2015, pp. 2965–2973.
[50] J. S. Vitter, Random sampling with a reservoir, ACM Trans. Math. Softw. 11 (1) (1985) 37–57.
[51] P. Buzzega, M. Boschini, A. Porrello, D. Abati, S. Calderara, Dark experience for general continual learning: a strong, simple baseline, arXiv preprint arXiv:2004.07211.
[52] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
[53] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images.
[54] Stanford, Tiny ImageNet Challenge (CS231n), http://tiny-imagenet.herokuapp.com/ (2015).
[55] G. M. van de Ven, A. S. Tolias, Three scenarios for continual learning, arXiv preprint arXiv:1904.07734.
[56] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, G. Tesauro, Learning to learn without forgetting by maximizing transfer and minimizing interference, arXiv preprint arXiv:1810.11910.
[57] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes challenge: A retrospective, International journal of computer vision 111 (1) (2015) 98–136.
[58] F. Cermelli, M. Mancini, Y. Xian, Z. Akata, B. Caputo, A few guidelines for incremental few-shot segmentation, arXiv preprint arXiv:2012.01415.
[59] L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv preprint arXiv:1706.05587.
[60] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[61] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255.
[62] S. R. Bulo, L. Porzi, P. Kontschieder, In-place activated batchnorm for memory-optimized training of dnns, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5639–5647.
[63] J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, R. Hadsell, Progress & compress: A scalable framework for continual learning, in: ICML, PMLR, 2018, pp. 4528–4537.
[64] J. Gildenblat, contributors, Pytorch library for cam methods, https://github.com/jacobgil/pytorch-grad-cam (2021).
[65] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
[66] S. Kornblith, M. Norouzi, H. Lee, G. Hinton, Similarity of neural network representations revisited, in: International Conference on Machine Learning, PMLR, 2019, pp. 3519–3529.

Appendix A Proof of Eq. 9

For the sake of simplicity, we avoid feature map index $i$ and task identifier $t$ in the following proof. Let, feature map ${\mathbf{F}}$ be a ${d\times p}$ matrix and $d\geq p$ . Then, ${\mathbf{F}}$ can be decomposed with ${\mathbf{F}}={\mathbf{P}}{\mathbf{\Sigma}}{\mathbf{Q}}^{\top}$ , where ${\mathbf{P}}\in\mathbb{R}^{d\times m}$ , ${\mathbf{\Sigma}}\in\mathbb{R}^{m\times p}$ , and ${\mathbf{Q}}\in\mathbb{R}^{p\times p}$ such that ${\mathbf{P}}{\mathbf{P}}^{\top}={\mathbf{Q}}{\mathbf{Q}}^{\top}={\mathbf{I}}$ . The differential of ${\mathbf{P}}$ can be expressed as

\displaystyle\differential{\mathbf{F}}=\differential{\mathbf{P}}{\mathbf{\Sigma}}{\mathbf{Q}}^{\top}+{\mathbf{P}}\differential{\mathbf{\Sigma}}{\mathbf{Q}}^{\top}+{\mathbf{P}}{\mathbf{\Sigma}}\differential{\mathbf{Q}}^{\top}

(21)

The differential $\differential{\mathbf{\Sigma}}$ is diagonal like ${\mathbf{\Sigma}}$ while $\differential{\mathbf{P}}$ and $\differential{\mathbf{Q}}$ maintain orthogonality constraints: $\differential{\mathbf{P}}^{\top}{\mathbf{P}}+{\mathbf{P}}^{\top}\differential{\mathbf{P}}=0$ and $\differential{\mathbf{Q}}^{\top}{\mathbf{Q}}+{\mathbf{Q}}^{\top}\differential{\mathbf{Q}}=0$ respectively. By applying the orthogonality of ${\mathbf{P}}$ and ${\mathbf{Q}}$ , Eq. 21 can be written as

\displaystyle{\mathbf{P}}^{\top}\differential{\mathbf{F}}{\mathbf{Q}}={\mathbf{P}}^{\top}\differential{\mathbf{P}}{\mathbf{\Sigma}}+\differential{\mathbf{\Sigma}}+{\mathbf{\Sigma}}\differential{\mathbf{Q}}^{\top}{\mathbf{Q}}

(22)

Since both ${\mathbf{P}}^{\top}\differential{\mathbf{P}}$ and $\differential{\mathbf{Q}}^{\top}{\mathbf{Q}}$ are anti-symmetric as well as zero diagonal whereas $\differential{\mathbf{\Sigma}}$ is diagonal, $\differential{\mathbf{\Sigma}}$ can be written as

\displaystyle\differential{\mathbf{\Sigma}}=\Big{(}{\mathbf{P}}^{\top}\differential{\mathbf{F}}{\mathbf{Q}}\Big{)}_{\text{diag}}

(23)

with ${\mathbf{A}}={\mathbf{P}}^{\top}\differential{\mathbf{P}}$ , ${\mathbf{B}}=\differential{\mathbf{Q}}^{\top}{\mathbf{Q}}$ and ${\mathbf{R}}={\mathbf{P}}^{\top}\differential{\mathbf{F}}{\mathbf{Q}}$ The off-diagonal part satisfies the following

	$\displaystyle{\mathbf{A}}{\mathbf{\Sigma}}+{\mathbf{\Sigma}}{\mathbf{B}}={\mathbf{R}}-{\mathbf{R}}_{\text{diag}}$
	$\displaystyle\Rightarrow{\mathbf{\Sigma}}^{\top}{\mathbf{A}}{\mathbf{\Sigma}}+{\mathbf{\Sigma}}^{\top}{\mathbf{\Sigma}}{\mathbf{B}}={\mathbf{\Sigma}}^{\top}\Big{(}{\mathbf{R}}-{\mathbf{R}}_{\text{diag}}\Big{)}={\mathbf{\Sigma}}^{\top}\bar{{\mathbf{R}}}$
	$\displaystyle\Rightarrow\begin{cases}\sigma_{i}a_{ij}\sigma_{j}+\sigma_{i}^{2}b_{ij}=\sigma_{i}\bar{{\mathbf{R}}}_{ij}\\ -\sigma_{j}a_{ij}\sigma_{i}-\sigma_{j}^{2}b_{ij}=\bar{{\mathbf{R}}}_{ji}\sigma_{j}\end{cases}$
	$\displaystyle\Rightarrow b_{ij}=\begin{cases}\Big{(}\sigma_{i}^{2}-\sigma_{j}^{2}\Big{)}^{-1}~{}\Big{(}\sigma_{i}\bar{{\mathbf{R}}}_{ij}+\bar{{\mathbf{R}}}_{ji}\sigma_{j}\Big{)}~{},~{}~{}~{}~{}i\neq j\\ 0~{},~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}i=j\end{cases}$		(24)

here $\bar{{\mathbf{R}}}={\mathbf{R}}-{\mathbf{R}}_{\text{diag}}$ and $\sigma_{i}={\mathbf{\Sigma}_{ii}}$ . Using Eq. A, we can write ${\mathbf{B}}$ as follows

\displaystyle{\mathbf{B}}={\mathbf{K}}\circ\Big{(}{\mathbf{\Sigma}}^{\top}\bar{{\mathbf{R}}}+\bar{{\mathbf{R}}}^{\top}{\mathbf{\Sigma}}\Big{)}={\mathbf{K}}\circ\Big{(}{\mathbf{\Sigma}}^{\top}{\mathbf{R}}+{\mathbf{R}}^{\top}{\mathbf{\Sigma}}\Big{)}

(25)

where

\displaystyle{\mathbf{K}}_{ij}=\begin{cases}\dfrac{1}{\sigma_{i}^{2}-\sigma_{j}^{2}},i\neq j\\ 0,i=j\end{cases}

(26)

Consequently,

\displaystyle\differential{\mathbf{Q}}=2{\mathbf{Q}}~{}\Bigg{(}{\mathbf{K}}^{\top}\circ\Big{(}{\mathbf{\Sigma}}^{\top}{\mathbf{P}}^{\top}\differential{\mathbf{F}}{\mathbf{Q}}\Big{)}_{\text{sym}}~{}\Bigg{)}

(27)

Using Eq. 23 and 27 we can obtain $\differential{\mathbf{P}}$ from Eq. 21

\displaystyle\differential{\mathbf{P}}{\mathbf{\Sigma}}=\differential{\mathbf{F}}{\mathbf{Q}}-{\mathbf{P}}\differential{\mathbf{\Sigma}}-{\mathbf{P}}{\mathbf{\Sigma}}\differential{\mathbf{Q}}^{\top}{\mathbf{Q}}=:{\mathbf{C}}

(28)

For any block form $\differential{\mathbf{P}}=(\differential{\mathbf{P}_{1}},\differential{\mathbf{P}_{2}})$ , the Eq. 28 would be satisfied, where $\differential{\mathbf{P}_{1}}\coloneqq{\mathbf{C}}{\mathbf{\Sigma}}_{d}^{-1}\in\mathbb{R}^{m\times d}$ and $\differential{\mathbf{P}_{2}}\coloneqq-{\mathbf{P}_{1}}\differential{\mathbf{P}_{1}}^{\top}{\mathbf{P}_{2}}\in\mathbb{R}^{m\times m-d}$ with ${\mathbf{\Sigma}}_{d}$ being the top $d$ rows of ${\mathbf{\Sigma}}$ . Therefore,

\displaystyle\differential{\mathbf{P}}=\Big{(}{\mathbf{C}}{\mathbf{\Sigma}}_{d}^{-1}~{}|~{}-{\mathbf{P}_{1}}{\mathbf{\Sigma}}_{d}^{-1}{\mathbf{C}}^{\top}{\mathbf{P}_{2}}\Big{)},~{}{\mathbf{C}}=\differential{\mathbf{F}}{\mathbf{Q}}-{\mathbf{P}}\differential{\mathbf{\Sigma}}-{\mathbf{P}}{\mathbf{\Sigma}}\differential{\mathbf{Q}}^{\top}{\mathbf{Q}}

(29)

and we have

\displaystyle\nabla_{{\mathbf{F}}}:\differential{\mathbf{F}}=\nabla_{{\mathbf{P}}}:\differential{\mathbf{P}}+\nabla_{{\mathbf{\Sigma}}}:\differential{\mathbf{\Sigma}}+\nabla_{{\mathbf{Q}}}:\differential{\mathbf{Q}}

(30)

As we only consider the basis of subspace ${\mathbf{P}}$ in the computation of subspace distillation loss, therefore

\displaystyle\nabla_{{\mathbf{F}}}:\differential{\mathbf{F}}=\nabla_{{\mathbf{P}}}:\differential{\mathbf{P}}

(31)

we simplify $\nabla_{{\mathbf{P}}}:\differential{\mathbf{P}}$ as follows

\displaystyle\nabla_{{\mathbf{P}}}:\differential{\mathbf{P}}=\Big{(}\nabla_{{\mathbf{P}}}\Big{)}_{1}:{\mathbf{C}}{\mathbf{\Sigma}}_{d}^{-1}+\Big{(}\nabla_{{\mathbf{P}}}\Big{)}_{2}:-{\mathbf{P}_{1}}{\mathbf{\Sigma}}_{d}^{-1}{\mathbf{C}}^{\top}{\mathbf{P}_{2}}

(32)

As we consider top $d$ subspace in the computation of subspace distillation loss, therefore $\Big{(}\nabla_{{\mathbf{P}}}\Big{)}_{2}$ is zero. Consequently

\displaystyle\begin{split}\nabla_{{\mathbf{P}}}:\differential{\mathbf{P}}&=\Big{(}\nabla_{{\mathbf{P}}}\Big{)}_{1}{\mathbf{\Sigma}}_{d}^{-1}:{\mathbf{C}}\\ &=\underbrace{\Big{(}\nabla_{{\mathbf{P}}}\Big{)}_{1}{\mathbf{\Sigma}}_{d}^{-1}}_{{\mathbf{D}}}:\Big{(}\differential{\mathbf{F}}{\mathbf{Q}}-{\mathbf{P}}\differential{\mathbf{\Sigma}}-{\mathbf{P}}{\mathbf{\Sigma}}\differential{\mathbf{Q}}^{\top}{\mathbf{Q}}\Big{)}\\ &={\mathbf{D}}:\differential{\mathbf{F}}{\mathbf{Q}}~{}-~{}{\mathbf{D}}:{\mathbf{P}}\differential{\mathbf{\Sigma}}~{}-~{}{\mathbf{D}}:{\mathbf{P}}{\mathbf{\Sigma}}\differential{\mathbf{Q}}^{\top}{\mathbf{Q}},~{}{\mathbf{D}}=\Big{(}\nabla_{{\mathbf{P}}}\Big{)}_{1}{\mathbf{\Sigma}}_{d}^{-1}\\ &={\mathbf{D}}{\mathbf{Q}}^{\top}:\differential{\mathbf{F}}~{}-~{}{\mathbf{P}}^{\top}{\mathbf{D}}:\differential{\mathbf{\Sigma}}~{}-~{}{\mathbf{\Sigma}}{\mathbf{P}}^{\top}{\mathbf{D}}{\mathbf{Q}}^{\top}:\differential{\mathbf{Q}}^{\top}\\ &={\mathbf{D}}{\mathbf{Q}}^{\top}:\differential{\mathbf{F}}~{}-~{}{\mathbf{P}}^{\top}{\mathbf{D}}:\differential{\mathbf{\Sigma}}~{}-~{}{\mathbf{Q}}{\mathbf{D}}^{\top}{\mathbf{P}}{\mathbf{\Sigma}}:\differential{\mathbf{Q}}\\ &={\mathbf{D}}{\mathbf{Q}}^{\top}:\differential{\mathbf{F}}~{}-~{}{\mathbf{P}}^{\top}{\mathbf{D}}:\Big{(}{\mathbf{P}}^{\top}\differential{\mathbf{P}}{\mathbf{Q}}\Big{)}_{\text{diag}}~{}-~{}{\mathbf{Q}}{\mathbf{D}}^{\top}{\mathbf{P}}{\mathbf{\Sigma}}:2{\mathbf{Q}}~{}\Biggl{(}{\mathbf{K}}^{\top}\circ\Big{(}{\mathbf{\Sigma}}^{\top}{\mathbf{P}}^{\top}\differential{\mathbf{P}}{\mathbf{Q}}\Big{)}_{\text{sym}}~{}\Biggl{)}\\ &={\mathbf{D}}{\mathbf{Q}}^{\top}:\differential{\mathbf{F}}~{}-~{}\Big{(}{\mathbf{P}}^{\top}{\mathbf{D}}\Big{)}_{\text{diag}}:{\mathbf{P}}^{\top}\differential{\mathbf{P}}{\mathbf{Q}}~{}-~{}2{\mathbf{Q}}^{\top}{\mathbf{Q}}{\mathbf{D}}^{\top}{\mathbf{P}}{\mathbf{\Sigma}}:~{}\Biggl{(}{\mathbf{K}}^{\top}\circ\Big{(}{\mathbf{\Sigma}}^{\top}{\mathbf{P}}^{\top}\differential{\mathbf{P}}{\mathbf{Q}}\Big{)}_{\text{sym}}~{}\Biggl{)}\\ &={\mathbf{D}}{\mathbf{Q}}^{\top}:\differential{\mathbf{F}}~{}-~{}{\mathbf{P}}\Big{(}{\mathbf{P}}^{\top}{\mathbf{D}}\Big{)}_{\text{diag}}{\mathbf{Q}}^{\top}:\differential{\mathbf{P}}~{}-~{}2\Big{(}{\mathbf{K}}^{\top}\circ{\mathbf{D}}^{\top}{\mathbf{P}}{\mathbf{\Sigma}}\Big{)}_{\text{sym}}:~{}\Big{(}{\mathbf{\Sigma}}^{\top}{\mathbf{P}}^{\top}\differential{\mathbf{P}}{\mathbf{Q}}\Big{)}\\ &={\mathbf{D}}{\mathbf{Q}}^{\top}:\differential{\mathbf{F}}~{}-~{}{\mathbf{P}}\Big{(}{\mathbf{P}}^{\top}{\mathbf{D}}\Big{)}_{\text{diag}}{\mathbf{Q}}^{\top}:\differential{\mathbf{P}}~{}-~{}2{\mathbf{P}}{\mathbf{\Sigma}}\Big{(}{\mathbf{K}}^{\top}\circ{\mathbf{D}}^{\top}{\mathbf{P}}{\mathbf{\Sigma}}\Big{)}_{\text{sym}}{\mathbf{Q}}^{\top}:\differential{\mathbf{P}}\end{split}

(33)

Finally by using Eq. 33 in Eq. 31, we have

\displaystyle\nabla_{{\mathbf{F}}}={\mathbf{D}}{\mathbf{Q}}^{\top}~{}-~{}{\mathbf{P}}\Big{(}{\mathbf{P}}^{\top}{\mathbf{D}}\Big{)}_{\text{diag}}{\mathbf{Q}}^{\top}~{}-~{}2{\mathbf{P}}{\mathbf{\Sigma}}\Big{(}{\mathbf{K}}^{\top}\circ{\mathbf{D}}^{\top}{\mathbf{P}}{\mathbf{\Sigma}}\Big{)}_{\text{sym}}{\mathbf{Q}}^{\top}

(34)

Appendix B Notation and Properties

The following notations have been used in the derivation

1.

Symmetric part of a square matrix ${\mathbf{A}}$ , ${\mathbf{A}}_{\text{sym}}=\dfrac{1}{2}\Big{(}{\mathbf{A}}^{\top}+{\mathbf{A}}\Big{)}$
2.

${\mathbf{A}}_{\text{diag}}$ be ${\mathbf{A}}$ with all off-diagonal elements set to $0$ .
3.

Element-wise product ${\mathbf{A}}\circ{\mathbf{B}}$
4.

Colon product ${\mathbf{A}}:{\mathbf{B}}=Tr({\mathbf{A}}^{\top}{\mathbf{B}})$

The following properties of inner product also have been used

	$\displaystyle{\mathbf{A}}:{\mathbf{B}}{\mathbf{C}}={\mathbf{A}}{\mathbf{C}}^{\top}:{\mathbf{B}}={\mathbf{B}}^{\top}{\mathbf{A}}:{\mathbf{C}}$		(35)
	$\displaystyle{\mathbf{A}}:{\mathbf{B}}\circ{\mathbf{C}}={\mathbf{B}}\circ{\mathbf{A}}:{\mathbf{C}}$		(36)
	$\displaystyle{\mathbf{A}}:{\mathbf{B}}_{\text{sym}}={\mathbf{A}}_{\text{sym}}:{\mathbf{B}}$		(37)
	$\displaystyle{\mathbf{A}}:{\mathbf{B}}_{\text{diag}}={\mathbf{A}}_{\text{diag}}:{\mathbf{B}}$		(38)

Appendix C Hyperparameter

C.1 Class-Incremental Learning

Method	S-MNIST		S-CIFAR-10		S-Tiny Imagenet
Method	Task-IL	Class-IL	Task-IL	Class-IL	Task-IL	Class-IL
	Online Data Stream Setting with Tiny Memory (Buffer Size: 100)
SD	lr: .01, $\alpha:8$ , $\beta:.4$	lr: .01, $\alpha:10$ , $\beta:.5$	lr:.03 , $\alpha:.5$ , $\beta:.75$	lr:.03 , $\alpha:.5$ , $\beta:.5$	lr: .03, $\alpha:1$ , $\beta:.25$	lr: .03, $\alpha:1$ , $\beta:.25$
	Small Memory (Buffer Size: 200)
SD	lr: .03, $\alpha:4$ , $\beta:.4$	lr: .03, $\alpha:4$ , $\beta:.4$	lr:.03 , $\alpha:.4$ , $\beta:.4$	lr:.03 , $\alpha:.4$ , $\beta:.4$	lr: .03, $\alpha:.1$ , $\beta:.1$	lr: .03, $\alpha:.1$ , $\beta:.1$
	Medium Memory (Buffer Size: 500)
SD	lr: .1, $\alpha:1$ , $\beta:.1$	lr: .1, $\alpha:1$ , $\beta:.1$	lr:.03 , $\alpha:1$ , $\beta:1$	lr:.03 , $\alpha:1$ , $\beta:1$	lr: .03, $\alpha:.1$ , $\beta:.1$	lr: .03, $\alpha:.1$ , $\beta:.1$
	Large Memory (Buffer Size: 5120)
SD	lr: .1, $\alpha:1$ , $\beta:2$	lr: .1, $\alpha:1$ , $\beta:2$	lr:.03 , $\alpha:1$ , $\beta:2$	lr:.03 , $\alpha:1$ , $\beta:2$	lr:.03, $\alpha:1$ , $\beta:2.5$	lr:.03, $\alpha:1$ , $\beta:2.5$

Table 7: Hyperparameter selected for SD method for class-incremental classification.

C.2 Continual Semantic Segmentation

Method	19-1 (2-Tasks)	15-5 (2-Tasks)	15-1 (5-Tasks)
SD	lr: .001, $\alpha:10$ , $\beta:.01$	lr: .001, $\alpha:10$ , $\beta:.01$	lr:.001 , $\alpha:10$ , $\beta:.01$

Table 8: Hyperparameter selected for SD method for Continual Semantic Segmentation.

Input	Method	Task-2	Task-3	Task-4	Task-5
	DER
	DER++
	SD
	DER
	DER++
	SD