This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Active Continual Learning: On Balancing Knowledge Retention and Learnability

Thuy-Trang Vu1, Shahram Khadivi2, Mahsa Ghorbanali1, Dinh Phung1,Gholamreza Haffari1
Department of Data Science and AI, Monash University, Australia1
eBay Inc.2
{trang.vu1,first.last}@monash.edu, [email protected]
Abstract

Acquiring new knowledge without forgetting what has been learned in a sequence of tasks is the central focus of continual learning (CL). While tasks arrive sequentially, the training data are often prepared and annotated independently, leading to the CL of incoming supervised learning tasks. This paper considers the under-explored problem of active continual learning (ACL) for a sequence of active learning (AL) tasks, where each incoming task includes a pool of unlabelled data and an annotation budget. We investigate the effectiveness and interplay between several AL and CL algorithms in the domain, class and task-incremental scenarios. Our experiments reveal the trade-off between two contrasting goals of not forgetting the old knowledge and the ability to quickly learn new knowledge in CL and AL, respectively. While conditioning the AL query strategy on the annotations collected for the previous tasks leads to improved task performance on the domain and task incremental learning, our proposed forgetting-learning profile suggests a gap in balancing the effect of AL and CL for the class-incremental scenario.

1 Introduction

The ability to continuously acquire knowledge while retaining previously learned knowledge is the hallmark of human intelligence. The pursuit to achieve this type of learning is referred to as continual learning (CL). Standard CL protocol involves the learning of a sequence of incoming tasks where the learner has limited access to training data from the previous tasks, posing the risk of forgetting past knowledge. Despite the sequential learning nature in which the learning of previous tasks may heavily affect the learning of subsequent tasks, the standard protocol often ignores the process of training data collection. That is, it implicitly assumes independent data annotation among tasks without considering the learning dynamic of the current model.

In this paper, we explore active learning (AL) problem to annotate training data for CL, namely active continual learning (ACL). AL and CL emphasise two distinct learning objectives (Mundt et al., 2020). While CL aims to maintain the learned information, AL concentrates on identifying suitable labelled data to incrementally learn new knowledge. The challenge is how to balance the ability of learning new knowledge and the prevention of forgetting the old knowledge (Riemer et al., 2019). Despite of this challenge, current CL approaches mostly focus on overcoming catastrophic forgetting — a phenomenon of sudden performance drop in previously learned tasks during learning the current task (McCloskey & Cohen, 1989; Ratcliff, 1990; Kemker et al., 2018).

Similar to CL, ACL also faces a similar challenge of balancing the prevention of catastrophic forgetting and the ability to quickly learn new tasks. Thanks to the ability to prepare its own training data proactively, ACL opens a new opportunity to address this challenge by selecting samples to both improve the learning of the current task and minimise interference to previous tasks. This paper conducts an extensive analysis to study the ability of ACL with the combination of existing AL and CL methods to address this challenge. We first investigate the benefit of actively labelling training data on CL and whether conditioning labelling queries on previous tasks can accelerate the learning process. We then examine the influence of ACL on balancing between preventing catastrophic forgetting and learning new knowledge.

Out contributions and findings are as follows:

  • We formalise the problem of active continual learning and study the combination of several and prominent active learning and continual learning algorithms on image and text classification tasks covering three continual learning scenarios: domain, class and task incremental learning.

  • We found that ACL methods that utilise AL to carefully select and annotate only a portion of training data can reach the performance of CL on the full training dataset, especially in the domain-IL (incremental learning) scenario.

  • We observe that there is a trade-off between forgetting the knowledge of the old tasks and quickly learning the knowledge of the new incoming task in ACL. We propose the forgetting-learning profile to better understand the behaviour of ACL methods and discover that most of them are grouped into two distinct regions of slow learners with high forgetting rates and quick learners with low forgetting rates.

  • ACL with sequential labelling is more effective than independent labelling in the domain-IL scenario where the learner can benefit from positive transfer across domains. In contrast, sequential labelling tends to focus on accelerating the learning of the current task, resulting in higher forgetting and lower overall performance in the class and task-IL.

  • Our study suggests guidelines for choosing AL and CL algorithms. Across three CL scenarios, experience replay (Rolnick et al., 2019) is consistently the best overall CL method. Uncertainty-based AL methods perform best in the domain-IL scenario while diversity-based AL is more suitable for class-IL due to the ill-calibration of model prediction on newly introduced classes.

2 Knowledge Retention and Quick Learnability

This section first provides the problem formulation of continual learning (CL), active learning (AL) and active continual learning (ACL). We then present the evaluation metrics to measure the level of forgetting the old knowledge, the ability to quickly learn new knowledge and overall task performance.

2.1 Knowledge Retention in Continual Learning

Continual Learning

It is the problem of learning a sequence of tasks 𝒯={τ1,τ2,,τT}\mathcal{T}=\{\tau_{1},~{}\tau_{2},~{}\cdots,~{}\tau_{T}\} where TT is the number of tasks with an underlying model fθf_{\theta}(.). Each task τt\tau_{t} has training data 𝒟tl={(xit,yit)}\mathcal{D}^{l}_{t}=\{(x^{t}_{i},y^{t}_{i})\} where xitx^{t}_{i} is the input drawn from an input space 𝒳t\mathcal{X}_{t} and its associated label yity^{t}_{i} in label space 𝒴t\mathcal{Y}_{t}, sampled from the joint input-output distribution Pt(𝒳t,𝒴t)P_{t}(\mathcal{X}_{t},\mathcal{Y}_{t}). At each CL step 1tT1\leq t\leq T, task τt\tau_{t} with training data 𝒟tl\mathcal{D}^{l}_{t} arrives for learning while only a limited subset of training data 𝒟t1l\mathcal{D}^{l}_{t-1} from the previous task τt1\tau_{t-1} are retained. We denote θt\theta_{t}^{*} as the model parameter after learning the current task τt\tau_{t} which is continually optimised from the parameter θt1\theta^{*}_{t-1} learned for the previous tasks, fθt=ψ(fθt1,𝒟tl)f_{\theta^{*}_{t}}=\psi(f_{\theta^{*}_{t-1}},\mathcal{D}^{l}_{t}) where ψ(.)\psi(.) is the CL algorithm. The objective is to learn the model fθtf_{\theta^{*}_{t}} such that it achieves good performance for not only the current task τt\tau_{t} but also all previous tasks τ<t\tau_{<t}, 1ti=1tA(𝒟itest,fθt)\frac{1}{t}\sum_{i=1}^{t}\textrm{A}(\mathcal{D}^{test}_{i},f_{\theta^{*}_{t}}) where 𝒟ttest\mathcal{D}^{test}_{t} denotes the test set of task τt\tau_{t}, and A(.)\textrm{A}(.) is the task performance metric (e.g. the accuracy).

Depending on the nature of the tasks, CL problems can be categorised into domain, class and task incremental learning.

  • Domain incremental learning (domain-IL) refers to the scenario where all tasks in the task sequence differ in the input distribution Pt1(𝒳t1)Pt(𝒳t)P_{t-1}(\mathcal{X}_{t-1})\neq P_{t}(\mathcal{X}_{t}) but share the same label set 𝒴t1=𝒴t\mathcal{Y}_{t-1}=\mathcal{Y}_{t}.

  • Class incremental learning (class-IL) is the scenario where new classes are added to the incoming task 𝒴t1𝒴t\mathcal{Y}_{t-1}\neq\mathcal{Y}_{t}. The model learns to distinguish among classes not only in the current task but also across previous tasks.

  • Task incremental learning (task-IL) assists the model in learning a sequence of non-overlapping classification tasks. Each task is assigned with a unique id which is then added to its data samples so that the task-specific parameters can be activated accordingly.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: (a) Active continual learning (ACL) annotates training data sequentially by conditioning on the learning dynamic of the current model (red arrow). (b) Forgetting-Learning Profile to visualize the balance between old knowledge retention and new knowledge learning in ACL. An ideal ACL method should lie at the quick learner with low forgetting rate region.

Forgetting Rate

We denote Ai,jA_{i,j} as the performance of task τj\tau_{j} after training on the tasks up to τi\tau_{i}, i.e. Ai,j:=A(𝒟jtest,fθi)A_{i,j}:=\textrm{A}(\mathcal{D}^{test}_{j},f_{\theta^{*}_{i}}). The overall task performance is measured by the average accuracy at the end of CL procedure 1Tj=1TAT,j\frac{1}{T}\sum_{j=1}^{T}A_{T,j} where TT is the number of tasks. Following the literature, we use forgetting rate (Chaudhry et al., 2018) as the forgetting measure. It is the average of the maximum performance degradation due to learning the current task over all tasks in the task sequence

FR=1T1j=1T1maxk[j,T]Ak,jAT,j\displaystyle FR=\frac{1}{T-1}\sum_{j=1}^{T-1}\max_{k\in[j,T]}A_{k,j}-A_{T,j} (1)

The lower the forgetting rate, the better the ability of the model to retain knowledge.

2.2 Quick Learnablity in Active Learning

The core research question in pool-based AL is how to identify informative unlabelled data to annotate within a limited annotation budget such that the performance of an underlying active learner is maximised. Given a task τ\tau with input space 𝒳\mathcal{X} and label space 𝒴\mathcal{Y}, the AL problem often starts with a small set of labelled data 𝒟l={(xi,yi)}\mathcal{D}^{l}=\{(x_{i},y_{i})\} and a large set of unlabelled data 𝒟u={xj}\mathcal{D}^{u}=\{x_{j}\} where xi,xj𝒳x_{i},x_{j}\in\mathcal{X}, yi𝒴y_{i}\in\mathcal{Y}. The AL procedure consists of multiple rounds of selecting unlabelled instances to annotate and repeats until exhausting the annotation budget.

More specifically, the algorithm chooses one or more instances xx^{*} in the unlabelled dataset 𝒟u\mathcal{D}^{u} to ask for labels in each round according to an scoring/acquisition function g(.){g}(.), x=argmaxxj𝒟ug(xj,fθ)x^{*}=\arg\max_{x_{j}\in\mathcal{D}^{u}}{g}(x_{j},f_{\theta}) where the acquistion function estimates the value of an unlabeled data point, if labelled, to the re-training of the current model. The higher the score of an unlabelled datapoint, the higher performance gain it may bring to the model. The differences between AL algorithms boil downs to the choice of the scoring function g(.){g}(.).

Quick Learnability

Learning curve area (LCA) (Chaudhry et al., 2019a) is the area under the accuracy curve of the current task with respect to the number of trained minibatches. It measures how quickly a learner learns the current task. The higher the LCA value, the quicker the learning is. We adopt this metric to compute the learning speed of an ACL method wrt the amount of annotated data.

2.3 Active Continual Learning

We consider the problem of continually learning a sequence of AL tasks, namely active continual learning (ACL). Figure 1(a) illustrates the learning procedure of the ACL problem. CL often overlooks how training data is annotated and implicitly assumes independent labelling among tasks, leading to CL as a sequence of supervised learning tasks.

More specifically, each task τt\tau_{t} in the task sequence consists of an initial labelled dataset 𝒟tl\mathcal{D}^{l}_{t}, a pool of unlabelled data 𝒟tu\mathcal{D}^{u}_{t} and an annotation budget BtB_{t}. The ACL training procedure is described in Algorithm 1. Upon the arrival of the current task τt\tau_{t}, we first train the proxy model fθ^tf_{\hat{\theta}_{t}} on the current labelled data with the CL algorithm ψ(.)\psi(.) from the best checkpoint of previous task θt1\theta^{*}_{t-1} (line 3). This proxy model is going to be used later in the AL acquisition function. We then iteratively run the AL query to annotate new labelled data (lines 4-8) and retrain the proxy model (line 9) until exhausting the annotation budget. The above procedure is repeated for all tasks in the sequence. Notably, the AL acquisition function g(.){g}(.) ranks the unlabelled data (line 5) according to the proxy model fθ^tf_{\hat{\theta}_{t}}, which is warm-started from the model learned from the previous tasks. Hence, the labelling query in ACL is no longer independent from the previous CL tasks and is able to leverage their knowledge.

1: Input: Task sequence 𝒯={τ1,,τT}\mathcal{T}=\{\tau_{1},\cdots,\tau_{T}\} where τt={𝒟tl,𝒟tu,𝒟ttest,Bt}\tau_{t}=\{\mathcal{D}^{l}_{t},\mathcal{D}^{u}_{t},\mathcal{D}^{test}_{t},B_{t}\}, initial model θ0\theta_{0}, query size bb
2: Output: Model parameters θ\theta
3:
4:Initialize θ0\theta_{0}^{*}
5:for t1,,Tt\in 1,\dots,T do \triangleright CL loop
6:     fθ^tψ(𝒟tl,fθt1)f_{\hat{\theta}_{t}}\leftarrow\psi(\mathcal{D}^{l}_{t},f_{\theta_{t-1}^{*}}) \triangleright Build the proxy model to be used in the AL acquisition function
7:     for i1,,Btbi\in 1,\dots,\frac{B_{t}}{b} do \triangleright AL round
8:         {xj}|1bg(𝒟tl,𝒟tu,b,fθ^t)\{x^{*}_{j}\}|_{1}^{b}\leftarrow{g}(\mathcal{D}^{l}_{t},\mathcal{D}^{u}_{t},b,f_{\hat{\theta}_{t}}) \triangleright Build AL query
9:         {(xj,yj)}|1bannotateByOracle({xj}|1b)\{(x^{*}_{j},y^{*}_{j})\}|_{1}^{b}\leftarrow\textrm{annotateByOracle}(\{x_{j}\}|_{1}^{b})
10:         𝒟tu𝒟tu\{xj}|1b\mathcal{D}^{u}_{t}\leftarrow\mathcal{D}^{u}_{t}\backslash\{x^{*}_{j}\}|_{1}^{b} \triangleright Update unlabelled dataset
11:         𝒟tl𝒟tl{(xj,yj)}|1b\mathcal{D}^{l}_{t}\leftarrow\mathcal{D}^{l}_{t}\cup\{(x^{*}_{j},y^{*}_{j})\}|_{1}^{b} \triangleright Update labelled dataset
12:         fθ^tψ(𝒟tl,fθt1)f_{\hat{\theta}_{t}}\leftarrow\psi(\mathcal{D}^{l}_{t},f_{\theta_{t-1}^{*}}) \triangleright Re-train the proxy model
13:     end for
14:     fθtψ(𝒟tl,fθt1)f_{\theta_{t}^{*}}\leftarrow\psi(\mathcal{D}^{l}_{t},f_{\theta_{t-1}^{*}}) \triangleright Re-train the model on the AL collected data starting from the previous task
15:end for
16:return θT\theta_{T}^{*}
Algorithm 1 Active Continual Learning

Another active learning approach in continual learning would be to have an acquisition function for each task, which does not depend on the previous tasks. That is to build the proxy model (lines 3 and 9) by iniliazing it randomly, instead of the model learned from the previous tasks. We will compare this independent AL approach to the ACL approach in the experiments, and show that leveraging the knowledge of the previous tasks is indeed beneficial for some CL settings.

Forgetting-Learning Profile

To understand the trade-off between CL and AL, we propose the forgetting-learning profile - a 2-D plot where the x-axis is the LCA, the y-axis is the forgetting rate, and each ACL algorithm represents a point in this space (Figure 1(b)). Depending on the level of forgetting and the learning speed, the forgetting-learning space can be divided into four regions: slow learners with low/high forgetting rate which corresponding the bottom and top left quarters; and similarly, quick learners with low/high forgetting rate residing at the bottom and top right quarters. Ideally, an effective ACL should lie at the quick learner with low forgetting rate region.

3 Experiments

In this paper, we investigate the effectiveness and dynamics of ACL by addressing the following research questions (RQs):

  • RQ1: Does utilising AL to carefully select and annotate training data improve CL performance?

  • RQ2: Is it more effective to label data sequentially than CL with independent AL?

  • RQ3: How do different ACL methods influence the balance between forgetting and learning?

Dataset

We conduct ACL experiments on two text classification and three image classification tasks. The text classification tasks include aspect sentiment classification (ASC) (Ke et al., 2021) and news classification (20News) (Pontiki et al., 2014), corresponding to the domain and class/task incremental learning, respectively. For image classification tasks, we evaluate the domain-IL scenario with the permuted-MNIST (P-MNIST) dataset and class/task IL scenaroios with the sequential MNIST (S-MNIST) (Lecun et al., 1998) and sequential CIFAR10 (S-CIFAR10) datasets (Krizhevsky et al., 2009). The detail of task grouping and data statistics are reported in Table 3 in Appendix C.

Continual Learning Methods

This paper mainly studies two widely-adopted methods in CL literature: elastic weight consolidation (EWC(Kirkpatrick et al., 2017) and experience replay (ER(Rolnick et al., 2019). EWC adds a regularization term based on the Fisher information to constrain the update of important parameters for the previous tasks. ER exploits a fixed replay buffer to store and replay examples from the previous tasks during training the current task. We also evaluate other experience replay CL methods including iCaRL (Rebuffi et al., 2017), AGEM (Chaudhry et al., 2019a), GDumb (Prabhu et al., 2020a), DER and DER++ (Buzzega et al., 2020) on image classification benchmarks. The detailed description of each CL method is reported in Appendix A.

Active Learning Methods

We consider two uncertainty-based methods, including entropy (Ent) and min-margin (Marg(Scheffer et al., 2001; Luo et al., 2005), embedding-kk-means (kMeans) - a diversity-based AL method (Yuan et al., 2020), coreset (Sener & Savarese, 2018) and BADGE which takes both uncertainty and diversity into account (Ash et al., 2020), and random sampling as AL baseline. While the uncertainty-based strategy gives a higher score to unlabelled data deemed uncertain to the current learner, the diversity-based method aims to increase the diversity among selected unlabelled data to maximise the coverage of representation space. The detailed description of each AL method is reported in Appendix B.

Refer to caption
(a) Active individual learning
Refer to caption
(b) Active Multi-task learning.
Figure 2: The ceiling methods of ACL.

Ceiling Methods

We report the results of AL in building different classifiers for each individual task (Figure 2.a). This serves as an approximate ceiling method for our ACL setting. We also report the integration of AL methods with multi-task learning (MTL), as another approximate ceiling method (Figure 2.b). In each AL round, we use the current MTL model in the AL acquisition function to estimate the scores of unlabelled data for all tasks simultaneously. We then query the label of those unlabelled data points which are ranked high and the labelling budget of their corresponding tasks are not exhausted. These two ceiling methods are denoted by Mtl (multitask learning) and Indiv (individual learning) in the result tables.

Model Training and Hyperparameters

We experiment cold-start AL for AL and ACL models. That is, the models start with an empty labelled set and select 1% training data for annotation until exhausting the annotation budget, i.e. 30% for ASC, 20% for 20News, 25% for S-CIFAR10 dataset. For P-MNIST and S-MNIST, the query size and the annotation budget are 0.5% and 10%. While the text classifier is initialised with RoBERTa (Liu et al., 2019), image classifiers (MLP for P-MNIST and S-MNIST, Resnet18 (He et al., 2016) for S-CIFAR10) are initialized randomly. During training each task, the validation set of the corresponding task is used for early stopping. We report the average results over 6 runs with different seeds for the individual learning and multi-task learning baselines. As CL results are highly sensitive to the task order, we evaluate each CL and ACL method on 6 different task orders and report the average results. More details on training are in Appendix D.

3.1 Active Continual Learning Results

Refer to caption
Figure 3: Relative performance (average accuracy of 6 runs) of various ACL methods with respect to the CL on full labelled data (full CL) in image classification benchmarks. The error bar indicates the standard deviation of the difference between two means, full CL and ACL. iCaRL is a class-IL method, hence not applicable for P-MNIST dataset.

Image Classification

The relative average task performance of ACL methods with respect to CL on full labelled data in three image classification benchmarks are reported in Figure 3. Detailed results are shown in Table 6 in Appendix E. Notably, in the P-MNIST dataset (domain-IL scenario), only the CL ceiling methods (Indv and MTL) on the full dataset outperform their active learning counterparts. While utilising only a portion of training data (10-30%), ACL methods surpass the performance of the corresponding CL methods in domain-IL scenario and achieve comparable performance in most class/task-IL scenarios (with only 2-5% difference), except in S-CIFAR10 Task-IL. As having access to a small amount of training data from previous tasks, experience replay methods are less prone to catastrophic forgetting and achieve higher accuracy than naive finetuning (Ft) and the regularization method EWC. Within the same CL method, uncertainty-based AL methods (Ent and Marg) generally perform worse than other AL methods.

Refer to caption
Figure 4: Relative performance (average accuracy of 6 runs) of various ACL methods with respect to the CL on full labelled data in text classification tasks. The error bar indicates the standard deviation of the difference between two means, full CL and ACL.

Text Classification

Figure 4 shows the relative performance with respect to CL on full labelled data of text classification benchmarks. Detailed accuracy is reported in Table 4 in Appendix E. In ASC dataset, the multi-task learning, CL and ACL models outperform individual learning (Indiv), suggesting positive transfer across domains. In contrast to the observations in P-MNIST, uncertainty-based AL methods demonstrate better performance than other AL methods in the domain-IL scenario. Conversely, there is no significant difference in accuracy when comparing various AL methods in the 20News - class IL scenario. On the other hand, ACL significantly lags behind supervised CL in 20News task-IL scenario, especially in EWC. Detailed accuracy and standard deviation are provided in Appendix E.

Comparison to CL with Independent AL

We have shown the benefit of active continual learning over the supervised CL on full dataset in domain and class-IL for both image and text classification tasks. An inherent question that arises is whether to actively annotate data for each task independently or to base it on the knowledge acquired from previous tasks. Table 1 shows the difference of the performance between CL with independent AL and the corresponding ACL method, i.e. negative values show ACL is superior. Interestingly, the effect of sequential labelling in ACL varies depending on the CL scenarios. For the ASC dataset, the performance of ACL is mostly better than CL with independent AL, especially for uncertainty-based AL methods (Ent and Marg). In contrast, we do not observe the improvement of ACL over the independent AL for the 20News and S-MNIST dataset in the class-IL scenario. On the other hand, sequential AL seems beneficial in experience replay methods in task-IL scenarios. Full results in other CL methods are provided in Section E.3.

Table 1: Relative performance of CL with independent AL with respect to the corresponding ACL. Diver denotes diversity-based method, kMeans for text classification and coreset for image classification tasks.
ASC (Domain-IL) 20News (Class-IL) 20News (Task-IL) S-MNIST (Class-IL) S-MNIST (Task-IL)
Ft EWC ER Ft EWC ER Ft EWC ER Ft EWC ER Ft EWC ER
Ind. Rand 1.67-1.67 +0.35+0.35 0.02-0.02 0.03-0.03 +0.01+0.01 +0.90+0.90 +1.70+1.70 3.22-3.22 +0.23+0.23 0.10-0.10 +3.60+3.60 +0.70+0.70 +3.82+3.82 1.33-1.33 0.12-0.12
Ind. Ent 0.59-0.59 0.92-0.92 0.19-0.19 0.09-0.09 0.08-0.08 +2.32+2.32 0.87-0.87 +3.39+3.39 1.03-1.03 0.30-0.30 +3.59+3.59 +1.83+1.83 0.35-0.35 10.70-10.70 0.63-0.63
Ind. Marg 0.93-0.93 1.36-1.36 0.82-0.82 0.08-0.08 +0.14+0.14 +2.24+2.24 +1.05+1.05 +2.58+2.58 0.42-0.42 0.19-0.19 1.94-1.94 +0.06+0.06 +1.68+1.68 11.91-11.91 0.52-0.52
Ind. BADGE +0.15+0.15 0.08-0.08 +0.10+0.10 +0.01+0.01 0.20-0.20 +4.28+4.28 1.16-1.16 +3.14+3.14 0.38-0.38 0.06-0.06 +6.14+6.14 +0.59+0.59 +3.35+3.35 0.64-0.64 0.47-0.47
ind. Diver 0.54-0.54 0.71-0.71 +0.96+0.96 0.21-0.21 +0.06+0.06 +2.78+2.78 2.66-2.66 +3.65+3.65 +0.35+0.35 0.07-0.07 +0.66+0.66 0.60-0.60 6.58-6.58 4.94-4.94 0.57-0.57

3.2 Knowledge Retention and Quick Learnability

Refer to caption
Figure 5: Learning-Forgetting profile of ACL methods with entropy in text classification tasks.
Refer to caption
Figure 6: Learning-Forgetting profile of ACL methods with entropy on seq-CIFAR10 dataset.

Figure 5 and Figure 6 report the forgetting rate and LCA for ACL methods with entropy strategy for text classification tasks and S-CIFAR10. The results of other ACL methods can be found in LABEL:appx:lca. These metrics measure the ability to retain knowledge and the learning speed on the current task. Ideally, we want to have a quick learner with the lowest forgetting. ER in class-IL scenario has lower LCA than Ft and EWC due to the negative interference of learning a minibatch of training data from both the current and previous tasks. However, it has much lower forgetting rate. In domain-IL, ER has a comparable forgetting rate and LCA with other ACL methods where positive domain transfer exists. On labelling strategy, sequential labelling results in quicker learners than independent labelling across all three learning scenarios. This evidence shows the benefit of carefully labelling for the current task by conditioning on the previous tasks. However, it comes with a compensation of a slightly higher forgetting rate.

Refer to caption
Figure 7: Average LCA of AL curves through tasks.

Learning Rate

Figure 7 show the average LCA of AL curves for entropy-based ACL methods. For each task τt\tau_{t}, we compute the LCA of the average accuracy curve of all learned tasks τt\tau_{\leq t} at each AL round. This metric reflects the learning speed of the models on all seen tasks so far. The higher the LCA, the quicker the model learns. In domain-IL, ACL with sequential labelling has a higher LCA than independent labelling across 3 CL methods. The gap is larger for early tasks and shrunk towards the final tasks. Unlike the increasing trend in domain-IL, we observe a decline in LCA over tasks in all ACL methods in class and task-IL. This declining trend is evidence of catastrophic forgetting. Aligned with above findings that sequential labelling suffers from more severe forgetting, ER with independent labelling has better LCA than sequential labelling, especially at the later tasks.

Refer to caption
Figure 8: Learning-Forgetting profile of ACL methods for text classification tasks.

Forgetting-Learning Profile

Figure 8 and Figure 9 show the forgetting-learning profile for ACL models trained on image and text classification benchmarks respectively. In text classification tasks (Figure 8), while the ACL methods scatter all over the space in the profile of domain-IL for ASC dataset, the sequential labelling with uncertainty (Ent and Marg) and ER, EWC has the desired properties of low forgetting and quick learning. For task-IL, ER also lies in the low forgetting region, especially several ER with AL methods also have quick learning ability. On the other hand, we can observe two distinct regions in the profile of class-IL: the top right region of quick learners with high forgetting (Ft and EWC), and the bottom left region of slow learners with low forgetting (ER). While ACL with ER has been shown effective in non-forgetting and quick learning ability for domain and task-IL, none of the studied ACL methods lies in the ideal regions in the class-IL profile. Compared to sequential labelling, independent labelling within the same ACL methods tend to reside in slower learning regions but with lower forgetting. We observe similar findings in the S-CIFAR10 dataset (Figure 9). However, in the case of S-MNIST, most experience replay methods reside in the ideal low-forgetting and quick learning region. We hypothesize that this phenomenon can be attributed to the relatively simpler nature of MNIST datasets, where most methods achieve nearly perfect accuracy.

Refer to caption
Figure 9: Learning-Forgetting profile of ACL methods for image classification tasks.

Normalized forgetting rate at different AL budget

We have shown that there is a trade-off between forgetting and quick-learnability when combining AL and CL.

Refer to caption
Figure 10: Normalized forgetting rate across different CL methods.

To test our hypothesis that sequential active learning with continual learning may lead to forgetting old knowledge, we report the normalized ratio of the forgetting rate of ACL over the forgetting rate of supervised CL FRACLFRCL\frac{FR_{ACL}}{FR_{CL}} at different annotation budget.111Normalized forgetting rates of different task orders in ACL are at the same scale, due to the normalization.

Figure 10 reports the average normalized forgetting ratio across different CL methods in S-MNIST. A normalized ratio greater than 1 means that the ACL method has higher forgetting than the corresponding supervised CL baseline, i.e. performing AL increases the forgetting. We observe that both diversity-based (coreset) and uncertainty-based (min-margin) methods lead to more forgetting than the baseline. On the other hand, BADGE consistently scores lower forgetting rates across different annotation budgets.

4 Related Works

Active learning

AL algorithms can be classified into heuristic-based methods such as uncertainty sampling (Settles & Craven, 2008; Houlsby et al., 2011) and diversity sampling (Brinker, 2003; Joshi et al., 2009); and data-driven methods which learn the acquisition function from the data (Bachman et al., 2017; Fang et al., 2017; Liu et al., 2018; Vu et al., 2019). We refer readers to the following survey papers (Settles, 2012; Zhang et al., 2022) for more details.

Continual learning

CL has rich literature on mitigating the issue of catastrophic forgetting and can be broadly categorised into regularisation-based methods (Kirkpatrick et al., 2017), memory-based method (Aljundi et al., 2019; Zeng et al., 2019; Chaudhry et al., 2019b; Prabhu et al., 2020b) and architectural-based methods(Hu et al., 2021; Ke et al., 2021). We refer the reader to the survey (Delange et al., 2021) for more details. Recent works have explored beyond CL on a sequence of supervised learning tasks such as continual few-shot learning (Li et al., 2022), unsupervised learning (Madaan et al., 2021). In this paper, we study the usage of AL to carefully prepare training data for CL where the goal is not only to prevent forgetting but also to quickly learn the current task.

5 Conclusion

This paper studies the under-explored problem of carefully annotating training data for continual learning (CL), namely active continual learning (ACL). Our experiments and analysis shed light on the performance characteristics and learning dynamics of the integration between several well-studied active learning (AL) and CL algorithms. With only a portion of the training set, ACL methods achieve comparable results with CL methods trained on the entire dataset; however, the effect on different CL scenarios varies. We also propose the forgetting-learning profile to understand the relationship between two contrasting goals of low-forgetting and quick-learning ability in CL and AL, respectively. We identify the current gap in ACL methods where the AL acquisition function concentrates too much on improving the current task and hinders the overall performance.

Limitations

Due to limited computational resources, this study does not exhaustively cover all AL and CL methods proposed in the literature of AL and CL. We consider several well-established CL and AL algorithms and solely study on classification tasks. We hope that the simplicity of the classification tasks has helped in isolating the factors that may otherwise influence the performance and behaviours of ACL. We leave the exploration of learning dynamics of ACL with more complicated CL and AL algorithms and more challenging tasks such as sequence generation as future work.

References

  • Aljundi et al. (2019) Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. Advances in neural information processing systems, 32, 2019.
  • Arthur & Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp.  1027–1035, 2007.
  • Ash et al. (2020) Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations (ICLR) 2020, 2020. URL https://openreview.net/forum?id=ryghZJBKPS.
  • Bachman et al. (2017) Philip Bachman, Alessandro Sordoni, and Adam Trischler. Learning algorithms for active learning. In Proceedings of the 34th International Conference on Machine Learning, pp.  301–310, 2017.
  • Brinker (2003) Klaus Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp.  59–66, 2003.
  • Buzzega et al. (2020) Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  15920–15930. Curran Associates, Inc., 2020.
  • Chaudhry et al. (2018) Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  532–547, 2018.
  • Chaudhry et al. (2019a) Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-GEM. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=Hkf2_sC5FX.
  • Chaudhry et al. (2019b) Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019b.
  • Delange et al. (2021) Matthias Delange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, 2019. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
  • Ding et al. (2008) Xiaowen Ding, Bing Liu, and Philip S. Yu. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, pp.  231–240, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781595939272. doi: 10.1145/1341531.1341561. URL https://doi.org/10.1145/1341531.1341561.
  • Fang et al. (2017) Meng Fang, Yuan Li, and Trevor Cohn. Learning how to active learn: A deep reinforcement learning approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  595–605, 2017. doi: 10.18653/v1/D17-1063. URL https://aclanthology.org/D17-1063.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Houlsby et al. (2011) Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
  • Hu & Liu (2004) Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pp.  168–177, New York, NY, USA, 2004. Association for Computing Machinery. ISBN 1581138881. doi: 10.1145/1014052.1014073. URL https://doi.org/10.1145/1014052.1014073.
  • Hu et al. (2021) Wenpeng Hu, Qi Qin, Mengyu Wang, Jinwen Ma, and Bing Liu. Continual learning by using information of each class holistically. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):7797–7805, May 2021. doi: 10.1609/aaai.v35i9.16952. URL https://ojs.aaai.org/index.php/AAAI/article/view/16952.
  • Joshi et al. (2009) Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active learning for image classification. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp.  2372–2379, 2009.
  • Ke et al. (2021) Zixuan Ke, Bing Liu, Nianzu Ma, Hu Xu, and Lei Shu. Achieving forgetting prevention and knowledge transfer in continual learning. Advances in Neural Information Processing Systems, 34:22443–22456, 2021.
  • Kemker et al. (2018) Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Lang (1995) Ken Lang. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995, pp.  331–339. Elsevier, 1995.
  • Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
  • Li et al. (2022) Guodun Li, Yuchen Zhai, Qianglong Chen, Xing Gao, Ji Zhang, and Yin Zhang. Continual few-shot intent detection. In Proceedings of the 29th International Conference on Computational Linguistics, pp.  333–343, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.26.
  • Liu et al. (2018) Ming Liu, Wray Buntine, and Gholamreza Haffari. Learning how to actively learn: A deep imitation learning approach. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1874–1883, 2018. URL http://aclweb.org/anthology/P18-1174.
  • Liu et al. (2015) Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. Automated rule selection for aspect extraction in opinion mining. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pp.  1291–1297. AAAI Press, 2015. ISBN 9781577357384.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • Luo et al. (2005) Tong Luo, Kurt Kramer, Dmitry B Goldgof, Lawrence O Hall, Scott Samson, Andrew Remsen, Thomas Hopkins, and David Cohn. Active learning to recognize multiple types of plankton. Journal of Machine Learning Research, 2005.
  • Madaan et al. (2021) Divyam Madaan, Jaehong Yoon, Yuanchun Li, Yunxin Liu, and Sung Ju Hwang. Representational continuity for unsupervised continual learning. In International Conference on Learning Representations, 2021.
  • McCloskey & Cohen (1989) Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp.  109–165. Elsevier, 1989.
  • Mundt et al. (2020) Martin Mundt, Yong Won Hong, Iuliia Pliushch, and Visvanathan Ramesh. A wholistic view of continual learning with deep neural networks: Forgotten lessons and the bridge to active and open world learning. arXiv preprint arXiv:2009.01797, 2020.
  • Pontiki et al. (2014) Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. SemEval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp.  27–35, Dublin, Ireland, August 2014. Association for Computational Linguistics. doi: 10.3115/v1/S14-2004. URL https://aclanthology.org/S14-2004.
  • Prabhu et al. (2020a) Ameya Prabhu, Philip Torr, and Puneet Dokania. Gdumb: A simple approach that questions our progress in continual learning. In The European Conference on Computer Vision (ECCV), August 2020a.
  • Prabhu et al. (2020b) Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. Gdumb: A simple approach that questions our progress in continual learning. In European conference on computer vision, pp.  524–540. Springer, 2020b.
  • Ratcliff (1990) Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990.
  • Rebuffi et al. (2017) Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.  2001–2010, 2017.
  • Riemer et al. (2019) Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, , and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1gTShAct7.
  • Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Scheffer et al. (2001) Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden markov models for information extraction. In International Symposium on Intelligent Data Analysis, pp.  309–318, 2001.
  • Sener & Savarese (2018) Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1aIuk-RW.
  • Settles (2012) Burr Settles. Active Learning. Morgan & Claypool Publishers, 2012. ISBN 1608457257, 9781608457250.
  • Settles & Craven (2008) Burr Settles and Mark Craven. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the conference on empirical methods in natural language processing, pp.  1070–1079, 2008.
  • Vu et al. (2019) Thuy-Trang Vu, Ming Liu, Dinh Phung, and Gholamreza Haffari. Learning how to active learn by dreaming. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4091–4101, 2019. URL https://aclanthology.org/P19-1401.
  • Wu et al. (2022) Tongtong Wu, Massimo Caccia, Zhuang Li, Yuan-Fang Li, Guilin Qi, and Gholamreza Haffari. Pretrained language model in continual learning: A comparative study. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=figzpGMrdD.
  • Yuan et al. (2020) Michelle Yuan, Hsuan-Tien Lin, and Jordan Boyd-Graber. Cold-start active learning through self-supervised language modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7935–7948, 2020. doi: 10.18653/v1/2020.emnlp-main.637. URL https://aclanthology.org/2020.emnlp-main.637.
  • Zeng et al. (2019) Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019.
  • Zhang et al. (2022) Zhisong Zhang, Emma Strubell, and Eduard Hovy. A survey of active learning for natural language processing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.

Appendix A Continual Learning Methods

We consider the following continual learning baselines

  • Finetuning (FT) is a naïve CL method which simply continue training the best checkpoint θt1\theta_{t-1} learned in previous task on the training data of the current task τt\tau_{t}.

  • Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) is a regularisation method to prevent the model learned on task τt1\tau_{t-1} parametrised by θt1\theta_{t-1} from catastrophic forgetting when learning new task τt\tau_{t}. The overall loss for training on data 𝒟tl\mathcal{D}^{l}_{t} of task τt\tau_{t} is

    θ(𝒟t)+λiFi,i(θiθt1,i)\displaystyle\mathcal{L}_{\theta}(\mathcal{D}_{t})+\lambda\sum_{i}F_{i,i}(\theta_{i}-\theta^{*}_{t-1,i}) (2)

    where λ\lambda is the coefficient parameter to control the contribution of regularisation term, Fi,i(.)F_{i,i}(.) is the diagonal of the Fisher information matrix of the learned parameters on previous tasks θt1\theta^{*}_{t-1} and is approximate as the gradients calculated with data sampled from the previous task, that is

    Fi,i(θ)=1Nxj,yj𝒟t1l((xj,yj;θ)θi)2\displaystyle F_{i,i}(\theta)=\frac{1}{N}\sum_{x_{j},y_{j}\in\mathcal{D}^{l}_{t-1}}\left(\frac{\partial\mathcal{L}(x_{j},y_{j};\theta)}{\partial\theta_{i}}\right)^{2} (3)
  • Experience Replay (ER) (Rolnick et al., 2019) exploits a fixed replay buffer \mathcal{M} to store a small amount of training data in the previous tasks. Some replay examples are added to the current minibatch of current tasks for rehearsal during training

    θt(𝒟t)+β𝔼(x,y)[l(y,y^;θt)]\displaystyle\mathcal{L}_{\theta_{t}}(\mathcal{D}_{t})+\beta\mathbb{E}_{(x^{\prime},y^{\prime})\sim\mathcal{M}}[l(y^{\prime},\hat{y^{\prime}};\theta_{t})] (4)

    where l(.)l(.) is the loss function and β\beta is a coefficient controlling the contribution of replay examples. Upon finishing the training of each task, some training instances from the current task are added to the memory buffer. If the memory is full, the newly added instances will replace the existing samples in the memory. In this paper, we maintain a fixed memory buffer of size m=400m=400 and allocate mt\frac{m}{t} places for each task. We employ random sampling to select new training instances and for the sample replacement strategy.

  • DER (Buzzega et al., 2020) and DER++ (Buzzega et al., 2020) modify the experience replay loss with the logit matching loss for the samples from previous tasks.

  • iCaRL (Rebuffi et al., 2017) is a learning algorithm for class-IL scenario. It learns class representation incrementally and select samples close to the class representation to add to the memory buffer.

  • GDumb(Prabhu et al., 2020a) simply greedily train the models from scratch on only the samples from the memory buffer.

Appendix B Active Learning Methods

We consider the following AL strategies in our experiment

  • Random sampling (Rand) which selects query datapoints randomly.

  • Entropy (Ent) strategy selects the sentences with the highest predictive entropy

    fAL(x)=Prθ(y^|x)logPrθ(y^|x)\displaystyle f_{\textrm{AL}}(x)=-\sum\textrm{Pr}_{\theta}(\hat{y}|x)\log\textrm{Pr}_{\theta}(\hat{y}|x) (5)

    where y^=argmaxPr(y|x)\hat{y}=\arg\max\Pr(y|x) is the predicted label of the given sentence xx.

  • Min-margin (Marg) strategy (Scheffer et al., 2001; Luo et al., 2005) selects the datapoint xx with the smallest different in prediction probability between the two most likely labels y^1\hat{y}_{1} and y^2\hat{y}_{2} as follows:

    fAL(x)=(Prθ(y^1|x)Prθ(y^2|x))\displaystyle f_{\textrm{AL}}(x)=-(\textrm{Pr}_{\theta}(\hat{y}_{1}|x)-\textrm{Pr}_{\theta}(\hat{y}_{2}|x)) (6)
  • BADGE (Ash et al., 2020) measures uncertainty as the gradient embedding with respect to parameters in the output (pre-softmax) layer and then chooses an appropriately diverse subset by sampling via kk-means++ (Arthur & Vassilvitskii, 2007).

  • Embedding kk-means (kMeans) is the generalisation of BERT-KM (Yuan et al., 2020) which uses the kk-Mean algorithm to cluster the examples in unlabelled pool based on the contextualised embeddings of the sentences. The nearest neighbours to each cluster centroids are chosen for labelling. For the embedding kk-means with MTL, we cluster the training sentences into k×Tk\times T clusters and greedily choose a sentence from kk clusters for each task based on the distance to the centroids. In this paper, we compute the sentence embedding using hidden states of the last layer of RoBERTa (Liu et al., 2019) instead of BERT (Devlin et al., 2019).

  • coreset (Sener & Savarese, 2018) is a diversity-based sampling method for computer vision task.

Appendix C Dataset

Text classification tasks

ASC is a task of identifying the sentiment (negative, neutral, positive) of a given aspect in its context sentence. We use the ASC dataset released by (Ke et al., 2021). It contains reviews of 19 products collected from (Hu & Liu, 2004; Liu et al., 2015; Ding et al., 2008; Pontiki et al., 2014). We remove the neutral labels from two SemEval2014 tasks (Pontiki et al., 2014) to ensure the same label set across tasks. 20News dataset (Lang, 1995) contains 20 news topics, and the goal is to classify the topic of a given news document. We split the dataset into 10 tasks with 2 classes per task. Each task contains 1600 training and 200 validation and test sentences. The detail of task grouping of 20News dataset and the data statistics of both ASC and 20News datasets are reported in Appendix C. The data statistics of ASC and 20News dataset are reported in  Table 2 and  Table 3, respectively.

Image classification tasks

The MNIST handwritten digits dataset (Lecun et al., 1998) contains 60K (approximately 6,000 per digit) normalised training images and 10K (approximately 1,000 per digit) testing images, all of size 28 × 28. In the sequential MNIST dataset (S-MNIST), we have 10 classes of sequential digits. In the sequential MNIST task, the MNIST dataset was divided into five tasks, such that each task contained two sequential digits and MNIST images were presented to the sequence model as a flattened 784 × 1 sequence for digit classification. The order of the digits was fixed for each task order. The CIFAR-10 dataset (Krizhevsky et al., 2009) contains 50K images for training and 10K for testing, all of size 32 × 32. In the sequential CIFAR-10 task, these images are passed into the model one at each time step, as a flattened 784 × 1 sequence.

Speaker

Router

Computer

Nokia6610

Nikon4300

Creative

CanonG3

ApexAD

CanonD500

Canon100

Diaper

Hitachi

ipod

Linksys

MicroMP3

Nokia6600

Norton

Restaurant

Laptop

train 352 245 283 271 162 677 228 343 118 175 191 212 153 176 484 362 194 2873 1730
dev 44 31 35 34 20 85 29 43 15 22 24 26 20 22 61 45 24 96 123
test 44 31 36 34 21 85 29 43 15 22 24 27 20 23 61 46 25 964 469
Table 2: Statistics of ASC dataset (domain incremental learning scenario).
Task 1 2 3 4 5 6 7 8 9 10

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

misc.forsale

talk.politics.misc

talk.politics.guns

talk.politics.mideast

talk.religion.misc

alt.atheism

soc.religion.christian

train 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 800 797
dev 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
test 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
Table 3: Statistics of 20News dataset (class incremental learning scenario).

Appendix D Model training and hyper-parameters

We train the classifier using the Adam optimiser with a learning rate 1e-5, batch size of 16 sentences, up to 50 and 20 epochs for ASC and 20News respectively, with early stopping if there is no improvement for 3 epochs on the loss of the development set. All the classifiers are initialised with RoBERTA base. Each experiment is run on a single V100 GPU and takes 10-15 hours to finish. For experience replay, we use the fixed memory size of 400 for both ASC and 20News.

For the sequential MNIST datasets, we train an MLP as a classifier. Our learning rate is 0.01 and our batch size is 32. We used 10 epochs to train the model. The query size is 0.5% and our annotation budget is 10%. For sequential CIFAR10 dataset, the architecture of the model is Resnet18 with a learning rate of 0.05. Our batch size is 32. We use 30 epochs to train the model. The query size is 1% and the annotation budget is 25%.

Appendix E Additional Results and Analysis

E.1 Active Continual Learning Results on Text Classification Tasks

Domain Incremental Learning

Table 5 reports the average accuracy at the end of all tasks on the ASC dataset in the domain-IL scenario. Overall, multi-task learning, CL and ACL models outperform individual learning (Indiv), suggesting positive knowledge transfer across domains. Compared to naive finetuning, EWC and ER tend to perform slightly better, but the difference is not significant. While using only 30% of the training dataset, ACL methods achieve comparable results with CL methods trained on the entire dataset (Full). In some cases, the ACL with ER and EWC even surpasses the performance of the corresponding CL methods (Full). In addition, the uncertainty-based AL strategies consistently outperform the diversity-based method (kMeans).

Class Incremental Learning

The overall task performance on the 20News dataset is reported at Table 4. Contrasting to the domain-IL, the individual learning models (Indiv) surpass other methods by a significant margin in the class-IL scenario. The reason is that Indiv models only focus on distinguishing news from 2 topics while Mtl, CL and ACL are 20-class text classification tasks. On the comparison of CL algorithms, finetuning and EWC perform poorly. This is inline with the previous finding that the regularisation-based CL method with pretrained language models suffers from the severe problem of catastrophic forgetting (Wu et al., 2022). While outperforming other CL methods significantly, ER still largely lags behind Mtl. ACL performs comparably to CL (Full) which is learned on the entire training dataset. The diversity-aware AL strategies (kMeans and BADGE) outperform the uncertainty-based strategies (Ent and Marg). We speculate that ACL in non-overlap class IL is equivalent to the cold-start AL problem as the models are poorly calibrated, hence the uncertainty scores become unreliable.

Table 4: Average accuracy (6 runs or task orders) and standard deviation of different ACL models at the end of tasks on 20News dataset in class-IL and task-IL settings.
Class-IL Task-IL
Ceiling Methods ACL Methods Ceiling Methods ACL Methods
Indiv Mtl Ft EWC ER Indiv Mtl Ft EWC ER
20% Labelled Data
Rand 88.95 ±0.60\scriptstyle\pm 0.60 67.33 ±0.46\scriptstyle\pm 0.46 8.90 ±0.44\scriptstyle\pm 0.44 8.99 ±0.52\scriptstyle\pm 0.52 55.25±2.34{}^{\dagger}\scriptstyle\pm 2.34 88.95 ±0.60\scriptstyle\pm 0.60 89.27 ±0.52\scriptstyle\pm 0.52 63.34 ±2.48\scriptstyle\pm 2.48 63.23 ±1.93\scriptstyle\pm 1.93 87.81±0.59{}^{\dagger}\scriptstyle\pm 0.59
Ent 89.08 ±0.58\scriptstyle\pm 0.58 65.21 ±0.88\scriptstyle\pm 0.88 8.84 ±0.61\scriptstyle\pm 0.61 8.94 ±0.48\scriptstyle\pm 0.48 52.42±1.42{}^{\dagger}\scriptstyle\pm 1.42 89.08 ±0.58\scriptstyle\pm 0.58 89.10 ±0.46\scriptstyle\pm 0.46 61.41 ±4.81\scriptstyle\pm 4.81 61.38 ±7.03\scriptstyle\pm 7.03 88.47±0.58{}^{\dagger}\scriptstyle\pm 0.58
Marg 90.38 ±0.64\scriptstyle\pm 0.64 67.66 ±0.47\scriptstyle\pm 0.47 8.95 ±0.55\scriptstyle\pm 0.55 9.05 ±0.58\scriptstyle\pm 0.58 52.83±2.24{}^{\dagger}\scriptstyle\pm 2.24 90.38 ±0.64\scriptstyle\pm 0.64 89.80 ±0.38\scriptstyle\pm 0.38 62.41 ±2.63\scriptstyle\pm 2.63 61.18 ±2.38\scriptstyle\pm 2.38 88.65±0.23{}^{\dagger}\scriptstyle\pm 0.23
BADGE 88.74 ±0.40\scriptstyle\pm 0.40 67.64 ±0.54\scriptstyle\pm 0.54 8.95 ±0.48\scriptstyle\pm 0.48 8.94 ±0.47\scriptstyle\pm 0.47 54.60±2.13{}^{\dagger}\scriptstyle\pm 2.13 88.74 ±0.40\scriptstyle\pm 0.40 89.52 ±0.84\scriptstyle\pm 0.84 64.92 ±1.72\scriptstyle\pm 1.72 64.16 ±3.42\scriptstyle\pm 3.42 88.05±0.54{}^{\dagger}\scriptstyle\pm 0.54
kMeans 88.67 ±0.74\scriptstyle\pm 0.74 67.41 ±0.68\scriptstyle\pm 0.68 9.49 ±1.33\scriptstyle\pm 1.33 8.88 ±0.50\scriptstyle\pm 0.50 54.54±2.38{}^{\dagger}\scriptstyle\pm 2.38 88.67 ±0.74\scriptstyle\pm 0.74 89.23 ±0.45\scriptstyle\pm 0.45 68.32 ±1.98\scriptstyle\pm 1.98 64.04 ±3.75\scriptstyle\pm 3.75 87.47±1.08{}^{\dagger}\scriptstyle\pm 1.08
Full Labelled Data
91.78 ±0.47\scriptstyle\pm 0.47 72.42 ±0.62\scriptstyle\pm 0.62 9.94 ±0.88\scriptstyle\pm 0.88 9.32 ±0.43\scriptstyle\pm 0.43 55.15±1.00{}^{\dagger}\scriptstyle\pm 1.00 91.78±0.47\scriptstyle\pm 0.47 91.17 ±0.69\scriptstyle\pm 0.69 70.48 ±3.25\scriptstyle\pm 3.25 77.45 ±1.56\scriptstyle\pm 1.56 89.33±0.62{}^{\dagger}\scriptstyle\pm 0.62

Task Incremental Learning

Having task id as additional input significantly accelerates the accuracy of the CL methods. Both EWC and ER surpass the finetuning baseline. As observed in the class-IL, ER is the best overall CL method. Adding examples of previous tasks to the training minibatch of the current task resembles the effect of Mtl in CL, resulting in a large boost in performance. Overall, ACL lags behind the Full CL model, but the gap is smaller for ER.

Table 5: Average accuracy (6 runs or task orders) and standard deviation of different ACL models at the end of tasks on ASC dataset (Domain-IL). The best AL score in each column is marked in bold. denotes the best CL algorithm within the same AL strategy.
Ceiling Methods ACL Methods
Indiv Mtl Ft EWC ER
30% Labelled Data
Rand 88.01 ±1.78\scriptstyle\pm 1.78 94.41 ±0.80\scriptstyle\pm 0.80 94.10±0.77{}^{\dagger}\scriptstyle\pm 0.77 93.03 ±0.75\scriptstyle\pm 0.75 93.71 ±1.29\scriptstyle\pm 1.29
Ent 91.51 ±0.84\scriptstyle\pm 0.84 94.95 ±0.58\scriptstyle\pm 0.58 95.08 ±0.52\scriptstyle\pm 0.52 95.58±0.33{}^{\dagger}\scriptstyle\pm 0.33 95.07 ±0.55\scriptstyle\pm 0.55
Marg 91.75 ±0.74\scriptstyle\pm 0.74 94.49 ±0.83\scriptstyle\pm 0.83 95.31 ±0.55\scriptstyle\pm 0.55 95.68±0.56{}^{\dagger}\scriptstyle\pm 0.56 95.55 ±0.38\scriptstyle\pm 0.38
BADGE 83.61 ±3.23\scriptstyle\pm 3.23 93.85 ±0.42\scriptstyle\pm 0.42 92.68 ±1.11\scriptstyle\pm 1.11 93.15 ±0.42\scriptstyle\pm 0.42 94.24±0.49{}^{\dagger}\scriptstyle\pm 0.49
kMeans 86.26 ±2.42\scriptstyle\pm 2.42 93.93 ±0.60\scriptstyle\pm 0.60 94.35±0.64{}^{\dagger}\scriptstyle\pm 0.64 93.59 ±0.72\scriptstyle\pm 0.72 93.76 ±1.57\scriptstyle\pm 1.57
Full Labelled Data
92.36 ±0.84\scriptstyle\pm 0.84 95.00 ±0.27\scriptstyle\pm 0.27 94.30 ±1.09\scriptstyle\pm 1.09 94.85±1.00{}^{\dagger}\scriptstyle\pm 1.00 94.19 ±1.21\scriptstyle\pm 1.21

E.2 Active Continual Learning Results on Image Classification Tasks

Table 6: Average accuracy (6 runs or task orders) and standard deviation of different ACL models at the end of tasks on image classification benchmarks.
Ceiling Methods ACL Methods
Indiv Mtl Ft EWC ER iCaRL GDumb DER DER++ AGem
S-MNIST Class-IL 10% Labelled Data
Rand 99.73±0.03\scriptstyle\pm 0.03 93.43±0.20\scriptstyle\pm 0.20 19.97±0.02\scriptstyle\pm 0.02 20.01±0.08\scriptstyle\pm 0.08 87.81±0.66\scriptstyle\pm 0.66 77.88±2.05\scriptstyle\pm 2.05 85.34±0.82\scriptstyle\pm 0.82 78.51±3.11\scriptstyle\pm 3.11 78.80±3.29\scriptstyle\pm 3.29 50.60±9.26\scriptstyle\pm 9.26
Ent 99.74±0.03\scriptstyle\pm 0.03 95.76±0.19\scriptstyle\pm 0.19 19.97±0.02\scriptstyle\pm 0.02 17.46±3.07\scriptstyle\pm 3.07 79.38±3.00\scriptstyle\pm 3.00 73.59±3.80\scriptstyle\pm 3.80 76.06±2.74\scriptstyle\pm 2.74 67.19±5.50\scriptstyle\pm 5.50 67.19±5.50\scriptstyle\pm 5.50 39.40±7.19\scriptstyle\pm 7.19
Marg 99.74±0.03\scriptstyle\pm 0.03 96.32±0.16\scriptstyle\pm 0.16 19.98±0.01\scriptstyle\pm 0.01 19.72±0.47\scriptstyle\pm 0.47 79.89±1.98\scriptstyle\pm 1.98 75.66±2.85\scriptstyle\pm 2.85 80.54±1.78\scriptstyle\pm 1.78 65.69±8.41\scriptstyle\pm 8.41 64.48±7.62\scriptstyle\pm 7.62 39.91±8.17\scriptstyle\pm 8.17
BADGE 99.74±0.03\scriptstyle\pm 0.03 93.20±0.33\scriptstyle\pm 0.33 19.92±0.06\scriptstyle\pm 0.06 20.08±0.13\scriptstyle\pm 0.13 87.40±0.44\scriptstyle\pm 0.44 76.35±2.36\scriptstyle\pm 2.36 85.35±0.61\scriptstyle\pm 0.61 83.25±3.31\scriptstyle\pm 3.31 78.48±5.59\scriptstyle\pm 5.59 56.52±3.80\scriptstyle\pm 3.80
coreset 99.75±0.03\scriptstyle\pm 0.03 95.26±0.24\scriptstyle\pm 0.24 19.98±0.01\scriptstyle\pm 0.01 21.27±1.91\scriptstyle\pm 1.91 83.12±1.16\scriptstyle\pm 1.16 73.51±5.98\scriptstyle\pm 5.98 78.80±2.17\scriptstyle\pm 2.17 74.17±6.21\scriptstyle\pm 6.21 74.17±6.21\scriptstyle\pm 6.21 38.48±7.19\scriptstyle\pm 7.19
Full Labelled Data
Full 99.71±0.02\scriptstyle\pm 0.02 97.57±0.15\scriptstyle\pm 0.15 19.98±0.01\scriptstyle\pm 0.01 20.41±0.71\scriptstyle\pm 0.71 84.68±1.88\scriptstyle\pm 1.88 76.69±0.80\scriptstyle\pm 0.80 84.52±1.49\scriptstyle\pm 1.49 73.77±8.63\scriptstyle\pm 8.63 88.64±1.58\scriptstyle\pm 1.58 56.89±9.10\scriptstyle\pm 9.10
S-MNIST Task-IL 10% Labelled Data
Rand 99.73±0.03\scriptstyle\pm 0.03 99.07±0.03\scriptstyle\pm 0.03 78.42±3.33\scriptstyle\pm 3.33 94.34±3.32\scriptstyle\pm 3.32 98.75±0.30\scriptstyle\pm 0.30 97.54±0.25\scriptstyle\pm 0.25 97.48±0.08\scriptstyle\pm 0.08 98.30±0.57\scriptstyle\pm 0.57 96.96±0.64\scriptstyle\pm 0.64 97.92±0.29\scriptstyle\pm 0.29
Ent 99.74±0.03\scriptstyle\pm 0.03 99.69±0.04\scriptstyle\pm 0.04 75.31±5.33\scriptstyle\pm 5.33 95.81±3.54\scriptstyle\pm 3.54 98.78±0.39\scriptstyle\pm 0.39 98.08±0.88\scriptstyle\pm 0.88 97.23±0.22\scriptstyle\pm 0.22 98.54±0.47\scriptstyle\pm 0.47 95.85±1.42\scriptstyle\pm 1.42 97.38±1.02\scriptstyle\pm 1.02
Marg 99.74±0.03\scriptstyle\pm 0.03 99.62±0.04\scriptstyle\pm 0.04 74.81±5.24\scriptstyle\pm 5.24 95.81±3.54\scriptstyle\pm 3.54 98.70±0.42\scriptstyle\pm 0.42 96.71±1.09\scriptstyle\pm 1.09 97.23±0.22\scriptstyle\pm 0.22 98.54±0.47\scriptstyle\pm 0.47 96.25±0.77\scriptstyle\pm 0.77 96.85±0.60\scriptstyle\pm 0.60
BADGE 99.74±0.03\scriptstyle\pm 0.03 98.93±0.10\scriptstyle\pm 0.10 74.81±5.24\scriptstyle\pm 5.24 95.81±3.54\scriptstyle\pm 3.54 98.87±0.03\scriptstyle\pm 0.03 97.02±0.54\scriptstyle\pm 0.54 97.14±0.31\scriptstyle\pm 0.31 98.43±0.23\scriptstyle\pm 0.23 96.85±1.95\scriptstyle\pm 1.95 97.98±0.44\scriptstyle\pm 0.44
coreset 99.75±0.03\scriptstyle\pm 0.03 99.54±0.05\scriptstyle\pm 0.05 78.39±6.42\scriptstyle\pm 6.42 94.11±4.44\scriptstyle\pm 4.44 98.83±0.12\scriptstyle\pm 0.12 97.58±2.44\scriptstyle\pm 2.44 97.09±0.32\scriptstyle\pm 0.32 98.39±0.36\scriptstyle\pm 0.36 97.48±0.53\scriptstyle\pm 0.53 98.07±0.60\scriptstyle\pm 0.60
Full Labelled Data
Full 99.71±0.02\scriptstyle\pm 0.02 99.71±0.04\scriptstyle\pm 0.04 77.82±7.09\scriptstyle\pm 7.09 94.38±3.81\scriptstyle\pm 3.81 98.79±0.10\scriptstyle\pm 0.10 97.22±5.38\scriptstyle\pm 5.38 97.03±0.64\scriptstyle\pm 0.64 98.76±0.12\scriptstyle\pm 0.12 98.73±0.16\scriptstyle\pm 0.16 98.65±0.18\scriptstyle\pm 0.18
S-CIFAR10 Class-IL 25% Labelled Data
Rand 86.24±0.45\scriptstyle\pm 0.45 66.40±1.34\scriptstyle\pm 1.34 17.11±1.10\scriptstyle\pm 1.10 16.95±0.74\scriptstyle\pm 0.74 24.12±2.20\scriptstyle\pm 2.20 35.21±2.29\scriptstyle\pm 2.29 26.68±4.24\scriptstyle\pm 4.24 17.67±0.62\scriptstyle\pm 0.62 18.32±0.77\scriptstyle\pm 0.77 17.36±1.12\scriptstyle\pm 1.12
Ent 87.16±0.55\scriptstyle\pm 0.55 65.08±3.04\scriptstyle\pm 3.04 17.46±1.01\scriptstyle\pm 1.01 17.17±1.03\scriptstyle\pm 1.03 25.20±3.79\scriptstyle\pm 3.79 35.74±1.75\scriptstyle\pm 1.75 22.89±2.40\scriptstyle\pm 2.40 17.90±1.18\scriptstyle\pm 1.18 17.71±1.39\scriptstyle\pm 1.39 17.88±0.95\scriptstyle\pm 0.95
Marg 87.47±0.35\scriptstyle\pm 0.35 68.66±1.10\scriptstyle\pm 1.10 17.55±1.05\scriptstyle\pm 1.05 17.08±0.90\scriptstyle\pm 0.90 24.28±4.25\scriptstyle\pm 4.25 34.02±3.82\scriptstyle\pm 3.82 24.32±2.71\scriptstyle\pm 2.71 17.76±1.15\scriptstyle\pm 1.15 17.34±1.08\scriptstyle\pm 1.08 17.93±0.90\scriptstyle\pm 0.90
BADGE 84.99±0.23\scriptstyle\pm 0.23 66.27±1.19\scriptstyle\pm 1.19 17.40±1.01\scriptstyle\pm 1.01 16.95±0.98\scriptstyle\pm 0.98 24.35±1.83\scriptstyle\pm 1.83 31.75±3.22\scriptstyle\pm 3.22 29.26±1.93\scriptstyle\pm 1.93 17.51±1.07\scriptstyle\pm 1.07 17.27±1.11\scriptstyle\pm 1.11 17.48±0.93\scriptstyle\pm 0.93
coreset 86.39±0.41\scriptstyle\pm 0.41 67.33±1.48\scriptstyle\pm 1.48 17.26±1.23\scriptstyle\pm 1.23 17.04±0.85\scriptstyle\pm 0.85 23.58±2.57\scriptstyle\pm 2.57 35.39±2.79\scriptstyle\pm 2.79 22.03±4.78\scriptstyle\pm 4.78 17.57±0.70\scriptstyle\pm 0.70 17.45±0.92\scriptstyle\pm 0.92 17.72±0.89\scriptstyle\pm 0.89
Full Labelled Data
Full 94.99±0.36\scriptstyle\pm 0.36 89.60±0.73\scriptstyle\pm 0.73 19.09±0.42\scriptstyle\pm 0.42 19.20±0.39\scriptstyle\pm 0.39 23.26±4.21\scriptstyle\pm 4.21 43.70±3.66\scriptstyle\pm 3.66 27.57±3.12\scriptstyle\pm 3.12 19.16±0.35\scriptstyle\pm 0.35 28.85±6.39\scriptstyle\pm 6.39 21.32±1.86\scriptstyle\pm 1.86
S-CIFAR10 Task-IL 25% Labelled Data
Rand 86.24±0.45\scriptstyle\pm 0.45 90.75±0.81\scriptstyle\pm 0.81 59.65±3.50\scriptstyle\pm 3.50 69.03±2.51\scriptstyle\pm 2.51 76.39±1.11\scriptstyle\pm 1.11 77.07±2.18\scriptstyle\pm 2.18 67.23±1.69\scriptstyle\pm 1.69 70.55±2.00\scriptstyle\pm 2.00 69.62±3.54\scriptstyle\pm 3.54 71.34±2.46\scriptstyle\pm 2.46
Ent 87.16±0.55\scriptstyle\pm 0.55 90.18±1.07\scriptstyle\pm 1.07 59.93±5.09\scriptstyle\pm 5.09 67.80±2.36\scriptstyle\pm 2.36 73.96±1.93\scriptstyle\pm 1.93 77.93±2.27\scriptstyle\pm 2.27 63.23±3.05\scriptstyle\pm 3.05 69.45±3.55\scriptstyle\pm 3.55 68.65±3.64\scriptstyle\pm 3.64 66.67±2.59\scriptstyle\pm 2.59
Marg 87.47±0.35\scriptstyle\pm 0.35 91.20±0.83\scriptstyle\pm 0.83 58.18±3.42\scriptstyle\pm 3.42 69.50±2.62\scriptstyle\pm 2.62 75.65±1.86\scriptstyle\pm 1.86 78.93±1.35\scriptstyle\pm 1.35 67.19±2.37\scriptstyle\pm 2.37 63.92±1.70\scriptstyle\pm 1.70 65.34±4.89\scriptstyle\pm 4.89 63.58±4.42\scriptstyle\pm 4.42
BADGE 84.99±0.23\scriptstyle\pm 0.23 90.84±0.46\scriptstyle\pm 0.46 55.52±1.77\scriptstyle\pm 1.77 68.39±3.97\scriptstyle\pm 3.97 75.96±1.92\scriptstyle\pm 1.92 78.93±3.23\scriptstyle\pm 3.23 67.36±2.21\scriptstyle\pm 2.21 67.89±4.20\scriptstyle\pm 4.20 71.93±2.71\scriptstyle\pm 2.71 67.51±1.94\scriptstyle\pm 1.94
coreset 86.39±0.41\scriptstyle\pm 0.41 91.13±0.39\scriptstyle\pm 0.39 58.08±2.70\scriptstyle\pm 2.70 68.75±3.98\scriptstyle\pm 3.98 72.86±2.53\scriptstyle\pm 2.53 79.32±3.36\scriptstyle\pm 3.36 65.21±2.80\scriptstyle\pm 2.80 66.18±1.10\scriptstyle\pm 1.10 70.26±5.20\scriptstyle\pm 5.20 63.98±2.30\scriptstyle\pm 2.30
Full Labelled Data
Full 94.99±0.36\scriptstyle\pm 0.36 97.86±0.18\scriptstyle\pm 0.18 59.55±2.22\scriptstyle\pm 2.22 65.00±5.14\scriptstyle\pm 5.14 83.47±1.72\scriptstyle\pm 1.72 77.76±4.16\scriptstyle\pm 4.16 70.23±2.09\scriptstyle\pm 2.09 77.16±4.14\scriptstyle\pm 4.14 86.25±0.75\scriptstyle\pm 0.75 80.90±6.00\scriptstyle\pm 6.00
P-MNIST Domain-IL 5% Labelled Data
Rand 94.33±0.12\scriptstyle\pm 0.12 92.49±0.36\scriptstyle\pm 0.36 55.83±2.81\scriptstyle\pm 2.81 38.55±2.27\scriptstyle\pm 2.27 75.10±0.75\scriptstyle\pm 0.75 53.61±0.82\scriptstyle\pm 0.82 86.22±0.31\scriptstyle\pm 0.31 86.23±0.39\scriptstyle\pm 0.39 74.09±1.10\scriptstyle\pm 1.10
Ent 97.04±0.04\scriptstyle\pm 0.04 95.00±0.10\scriptstyle\pm 0.10 44.72±2.06\scriptstyle\pm 2.06 37.83±1.07\scriptstyle\pm 1.07 65.08±1.16\scriptstyle\pm 1.16 42.28±1.59\scriptstyle\pm 1.59 76.02±1.15\scriptstyle\pm 1.15 75.72±1.07\scriptstyle\pm 1.07 63.89±0.91\scriptstyle\pm 0.91
Marg 97.23±0.03\scriptstyle\pm 0.03 95.38±0.14\scriptstyle\pm 0.14 44.07±3.03\scriptstyle\pm 3.03 38.15±1.26\scriptstyle\pm 1.26 65.82±1.49\scriptstyle\pm 1.49 48.52±1.88\scriptstyle\pm 1.88 77.91±0.69\scriptstyle\pm 0.69 77.82±0.63\scriptstyle\pm 0.63 64.42±0.70\scriptstyle\pm 0.70
BADGE 93.80±0.07\scriptstyle\pm 0.07 92.19±0.38\scriptstyle\pm 0.38 58.44±1.07\scriptstyle\pm 1.07 39.02±1.93\scriptstyle\pm 1.93 74.99±1.35\scriptstyle\pm 1.35 54.04±1.30\scriptstyle\pm 1.30 86.64±0.36\scriptstyle\pm 0.36 86.62±0.46\scriptstyle\pm 0.46 74.00±0.94\scriptstyle\pm 0.94
coreset 96.24±0.04\scriptstyle\pm 0.04 94.11±0.17\scriptstyle\pm 0.17 46.55±1.93\scriptstyle\pm 1.93 35.35±2.06\scriptstyle\pm 2.06 67.90±1.00\scriptstyle\pm 1.00 42.79±1.95\scriptstyle\pm 1.95 77.60±1.39\scriptstyle\pm 1.39 76.94±0.83\scriptstyle\pm 0.83 67.20±0.73\scriptstyle\pm 0.73
Full Labelled Data
Full 97.62±0.05\scriptstyle\pm 0.05 96.69±0.14\scriptstyle\pm 0.14 35.12±2.31\scriptstyle\pm 2.31 40.13±2.30\scriptstyle\pm 2.30 60.13±0.65\scriptstyle\pm 0.65 53.18±0.98\scriptstyle\pm 0.98 71.49±1.62\scriptstyle\pm 1.62 72.63±0.85\scriptstyle\pm 0.85 58.51±3.22\scriptstyle\pm 3.22

Results on Permuted MNIST

We plot the forgetting-learning profile of P-MNIST in Figure 11. In general, ER has lower forgetting rate than Ft and EWC. However, it has a slightly slower learning rate than EWC.

Refer to caption
Figure 11: Forgetting-Learning Profile of each ACL run on P-MNIST dataset (Domain-IL scenario).

E.3 Independent vs. Sequential Labelling

We show the difference of the performance between CL with independent AL and the corresponding ACL method for P-MNIST, S-MNIST and S-CIFAR10 in Table 7, Table 8 and Table 9 respectively.

Table 7: Relative performance of CL with independent AL with respect to the corresponding ACL for P-MNIST.
Ft EWC ER GDumb DER DER++ AGEM
Ind. Rand +3.56+3.56 +0.09+0.09 0.04-0.04 +0.48+0.48 +0.54+0.54 +0.81+0.81 0.67-0.67
Ind. Ent +0.35+0.35 4.46-4.46 2.32-2.32 9.08-9.08 5.59-5.59 4.08-4.08 2.36-2.36
Ind. Marg +3.82+3.82 3.49-3.49 0.72-0.72 9.97-9.97 3.84-3.84 3.99-3.99 2.24-2.24
Ind. BADGE 6.30-6.30 0.82-0.82 +0.03+0.03 +1.65+1.65 0.99-0.99 0.08-0.08 1.29-1.29
Table 8: Relative performance of CL with independent AL with respect to the corresponding ACL for S-MNIST.
S-MNIST (Class-IL) S-MNIST (Task-IL)
Ft EWC ER iCaRL GDumb DER DER++ AGEM Ft EWC ER iCaRL GDumb DER DER++ AGEM
Ind. Rand 0.10-0.10 +3.60+3.60 +0.70+0.70 1.39-1.39 +0.42+0.42 +4.20+4.20 +3.90+3.90 +2.18+2.18 +3.82+3.82 1.33-1.33 0.12-0.12 0.15-0.15 0.10-0.10 1.83-1.83 0.43-0.43 +0.08+0.08
Ind. Ent 0.30-0.30 +3.59+3.59 +1.83+1.83 3.30-3.30 3.72-3.72 5.06-5.06 5.29-5.29 +1.31+1.31 0.35-0.35 10.70-10.70 0.63-0.63 0.38-0.38 3.65-3.65 5.45-5.45 2.76-2.76 1.30-1.30
Ind. Marg 0.19-0.19 1.94-1.94 +0.06+0.06 6.38-6.38 7.43-7.43 4.93-4.93 6.46-6.46 +2.37+2.37 +1.68+1.68 11.91-11.91 0.52-0.52 +0.53+0.53 3.53-3.53 5.66-5.66 3.38-3.38 +0.89+0.89
Ind. BADGE 0.06-0.06 +6.14+6.14 +0.59+0.59 +0.32+0.32 +0.25+0.25 +0.15+0.15 +4.92+4.92 +2.59+2.59 +3.35+3.35 0.64-0.64 0.47-0.47 0.30-0.30 0.07-0.07 0.78-0.78 +0.83+0.83 0.05-0.05
ind. Coreset 0.07-0.07 +0.66+0.66 0.60-0.60 +1.77+1.77 0.22-0.22 3.98-3.98 3.31-3.31 +2.28+2.28 6.58-6.58 4.94-4.94 0.57-0.57 +0.29+0.29 2.18-2.18 3.49-3.49 2.58-2.58 0.79-0.79
Table 9: Relative performance of CL with independent AL with respect to the corresponding ACL for S-CIFAR10.
S-CIFAR10 (Class-IL) S-CIFAR10 (Task-IL)
Ft EWC ER iCaRL GDumb DER DER++ AGEM Ft EWC ER iCaRL GDumb DER DER++ AGEM
Ind. Rand +0.05+0.05 0.43-0.43 +0.53+0.53 1.36-1.36 +0.66+0.66 0.66-0.66 1.24-1.24 0.15-0.15 3.63-3.63 +0.18+0.18 0.61-0.61 +0.96+0.96 +1.01+1.01 1.43-1.43 +0.52+0.52 0.96-0.96
Ind. Ent 0.19-0.19 0.41-0.41 1.47-1.47 6.85-6.85 2.55-2.55 0.70-0.70 0.75-0.75 0.35-0.35 4.97-4.97 +1.35+1.35 2.88-2.88 +0.95+0.95 3.35-3.35 2.99-2.99 0.31-0.31 0.32-0.32
Ind. Marg 0.01-0.01 0.76-0.76 1.82-1.82 2.19-2.19 1.44-1.44 0.22-0.22 +0.08+0.08 0.59-0.59 1.71-1.71 1.87-1.87 3.85-3.85 +1.60+1.60 9.02-9.02 +5.63+5.63 +0.03+0.03 0.25-0.25
Ind. BADGE 0.48-0.48 0.48-0.48 2.10-2.10 0.18-0.18 2.82-2.82 0.83-0.83 0.55-0.55 0.63-0.63 +0.75+0.75 4.43-4.43 +1.29+1.29 2.36-2.36 +0.75+0.75 +3.80+3.80 3.02-3.02 +2.34+2.34
Ind. Coreset +0.07+0.07 0.25-0.25 0.62-0.62 5.12-5.12 +2.67+2.67 0.55-0.55 0.07-0.07 0.38-0.38 0.62-0.62 0.18-0.18 +1.85+1.85 0.67-0.67 1.96-1.96 +3.52+3.52 0.65-0.65 +3.16+3.16

E.4 Additional Analysis

Refer to caption
Figure 12: Task prediction probability of ER-based ACL on 20News dataset (Class-IL).

Task Prediction Probability

Figure 12 shows the task prediction probability at the end of the training for ACL with ER and entropy methods in class-IL scenario. The results of other methods can be found in the appendix. Overall, the finetuning and EWC methods put all the predicted probability on the final task, explaining their inferior performance. In contrast, ER puts a relatively small probability on the previous tasks. As expected, the sequential labelling method has a higher task prediction probability on the final task than the independent labelling method, suggesting higher forgetting and overfitting to the final task.

Refer to caption
Figure 13: Jaccard coefficient between independent and sequential labelling queries of ER-based ACL methods.

AL Query Overlap

We report the Jaccard coefficient between the selection set of sequential and independent entropy-based AL methods at the end of AL for each task in the task sequence in Figure 13. We observe low overlapping in the uncertainty-based AL methods. On the other hand, diversity-based AL methods have high overlap and decrease rapidly throughout the learning of subsequent tasks.

Task similarity.

We examine the task similarity in 20News dataset by measuring the percentage of vocabulary overlap (excluding stopwords) among tasks. It can be shown in Figure 14 that tasks in 20News dataset have relatively low vocabulary overlap with each other (less than 30%). This explains the poor performance of EWC as it does not work well with abrupt distribution change between tasks (Ke et al., 2021).

Refer to caption
Figure 14: Task similarity in 20News dataset.
Refer to caption
Figure 15: Jaccard coefficient between sequential and independent AL queries of ACL methods on 20News (class-IL).
Refer to caption
Figure 16: Jaccard coefficient between sequential and independent AL queries of ACL methods on ASC.