Improving Representational Continuity with Supervised Continued Pretraining

Michael Sun
Department of Computer Science
Stanford University
[email protected]
&Ananya Kumar
Department of Computer Science
Stanford University
[email protected]
\ANDDivyam Madaan
New York University
[email protected],
&Percy Liang
Stanford University
[email protected]

Abstract

We consider the continual representation learning setting: sequentially pretrain a model $M^{\prime}$ on tasks $\mathcal{T}_{1},\ldots,\mathcal{T}_{T}$ , and then adapt $M^{\prime}$ on a small amount of data from each task $T_{i}$ to check if it has forgotten information from old tasks. Under a kNN adaptation protocol, prior work shows that continual learning methods improve forgetting over naive training (SGD). In reality, practitioners do not use kNN classifiers—they use the adaptation method that works best (e.g., fine-tuning)—here, we find that strong continual learning baselines do worse than naive training. Interestingly, we find that a method from the transfer learning community (LP-FT) outperforms naive training and the other continual learning methods. Even with standard kNN evaluation protocols, LP-FT performs comparably with strong continual learning methods (while being simpler and requiring less memory) on three standard benchmarks: sequential CIFAR-10, CIFAR-100, and TinyImageNet. LP-FT also reduces forgetting in a real world satellite remote sensing dataset (FMoW), and a variant of LP-FT gets state-of-the-art accuracies on an NLP continual learning benchmark.

1 Introduction

We consider the setting of continual representation learning [15, 14]. As a motivating example, suppose we are developing a foundation model [2] for satellite remote sensing [9]. A resource rich organization pretrains a model $M$ on lots of satellite data from North America. Over time, the organization needs to upgrade the capabilities of its model by incorporating more geographical regions—retraining the model is expensive so it takes $M$ and continues the pretraining process on data from Africa to get a model $M^{\prime}$ .

The organization releases the pretrained model, and resource constrained academic labs across the world use the updated model $M^{\prime}$ for important applications—a common way to use the model $M^{\prime}$ is to fine-tune it on a small amount of task-specific labeled data. We want the updated model $M^{\prime}$ to be useful for both new data (from Africa) and old data (from North America).

More formally, we sequentially pretrain a model on a sequence of tasks $\mathcal{T}_{1},\ldots,\mathcal{T}_{T}$ to get a model $f_{\theta_{T}}$ —that is, we pretrain on data from $\mathcal{T}_{1}$ , then data from $\mathcal{T}_{2}$ , etc. Naively pretraining in this way (which we call SGD) leads to forgetting—the model performs poorly on data from older tasks such as $\mathcal{T}_{1}$ . Prior works in continual learning propose a variety of methods to reduce forgetting: popular methods include Synaptic intelligence (SI), Dark Experience Replay (DER), and Unsupervised Continual Learning (UCL). To examine if $f_{\theta_{T}}$ has retained knowledge about every task $\mathcal{T}_{i}$ , we adapt $f_{\theta_{T}}$ using a small amount of labeled data from task $\mathcal{T}_{i}$ , and evaluate its accuracy on task $\mathcal{T}_{i}$ .

Practitioners will typically use the adaptation method that gets the highest accuracy for their use case—in this setting, we find that continual learning methods such as UCL, DER, and SI, can actually do worse than naive training (SGD). For each task, we consider adapting the pretrained model with three adaptation methods (training a kNN or linear classifier on frozen representations, or full fine-tuning of the model) on a small amount of task specific data. Prior work uses the kNN evaluation protocol—partly because this is cheap to evaluate. Under this scheme, we indeed see that continual learning methods improve over naive training. However, when considering the best adaptation method, we find that strong continual learning baselines (SI, DER, and UCL) can perform even worse than naive training—see Table 2. For example, under a kNN evaluation protocol, SI gets 86% which is 8% better than naive SGD (78%), but under the best adaptation method SI gets 91% accuracy which is 2% lower than naive SGD (93%).

However, we find that a method from the transfer learning community (LP-FT) [12] outperforms naive SGD. For each new task $\mathcal{T}_{i}$ , naive SGD updates the entire model via gradient descent—this can distort representations learned from previous tasks. Instead,[12] proposes first training the linear ‘head’ layer on the new task $\mathcal{T}_{i}$ (to find the best way to use existing representations for the new task) and then fine-tune the entire model to incorporate information from $\mathcal{T}_{i}$ into its representations. LP-FT consistently improves over naive SGD under all adaptation methods, and gets the highest overall accuracy. Even under the exact kNN evaluation protocol used in [14], LP-FT matches or outperforms continual learning methods on three popular datasets: Sequential-CIFAR10 [11], Sequential-CIFAR100 [11], and Sequential-TinyImageNet [6].

Finally, we find that LP-FT performs well in two other domains as well. 1. We consider continual representation learning in a real world satellite remote sensing task (Functional Map of the World [5])—LP-FT (56%) outperforms naive SGD (53%). 2. LP-FT gets state-of-the-art accuracies on an NLP continual learning setting [10] where they update BERT for a sequence of sentiment analysis tasks. [10] propose a new continual learning method that outperforms 12 strong baselines. We show that a variant of LP-FT gives a further 2% accuracy boost over their strongest method.

2 Related Work

Over the years, many continual learning approaches have been proposed in three main categories. Regularization-based methods [21, 13, 4] minimize the drift of model representations during sequential learning to prevent the forgetting of learned knowledge. Architectural-methods [16, 19, 20] expand the network structure to learn the task-independent and task-shared knowledge across the sequence of tasks. Replay-based methods [17, 1, 3] select and revisit a representative subset of the past-task examples during the training of future tasks to alleviate forgetting. While most of these methods were restricted to supervised settings, recent works [14, 7] have extended them to continual learning with the unlabelled data stream.

3 Preliminaries

In continual learning, there is a continuum of $T$ tasks $\mathcal{T}_{1},\ldots,\mathcal{T}_{T}$ . For each task $\mathcal{T}_{t}$ , we have a training dataset $\mathcal{D}_{tr}^{(t)}$ , few-shot dataset $\mathcal{D}_{ft}^{(t)}$ , evaluation dataset $\mathcal{D}_{te}^{(t)}$ sampled from a distribution $P_{t}$ over $\mathcal{X}_{t}\times\mathcal{Y}_{t}$ . We parameterize the models by a base feature extractor $\theta\in\mathcal{B}$ and task-specific head $\phi_{i}\in\mathcal{V}$ to predict $h_{\phi_{t}}(f_{\theta}(x))$ on task $t$ , where $f_{\theta}(x)\in R^{k}$ maps inputs into a lower dimensional feature space, and $h_{\phi}$ represents a linear head.

3.1 Training

Given a loss function $\ell:X\times Y\rightarrow R_{\geq 0}$ (for example, the cross-entropy loss), the loss on task $t$ is the average of the loss $l$ over the training set for task $i$ :

\mathcal{L}_{t}(\phi,\theta)=\sum_{x,y\in D_{tr}^{(t)}}\ell(h_{\phi}(f_{\theta}(x)),y)

(3.1)

3.2 Linear probe finetuning for CL

We start with $\widehat{\theta_{0}}$ initialized randomly. For each task $1\leq t\leq T$ , LP-FT first trains the head $\phi$ , and then jointly optimizes the entire model $(\phi,\theta)$ .

	$\displaystyle\widehat{\phi_{t,lp}}$	$\displaystyle=\operatorname*{arg\,min}_{\phi}L_{t}(\phi,\widehat{\theta}_{t-1})$		(3.2)
	$\displaystyle\widehat{\phi_{t}},\widehat{\theta_{t}}$	$\displaystyle=\operatorname*{arg\,min}_{\phi,\theta}L_{t}(\phi,\theta)\mbox{, initialized at }\phi=\widehat{\phi_{t,lp}},\theta=\widehat{\theta}_{t-1},$		(3.3)

where we approximate the $\operatorname*{arg\,min}$ using stochastic gradient descent. We apply LP-FT on both a standard ResNet18 classifier, and the B-CL architecture from [10] and show its efficacy in both vision and NLP. More architecture-level decisions can be found in Appendix B.

3.3 Evaluation

At evaluation time, we consider the model $\theta_{T}$ at the end of continual learning. Our goal is to evaluate how good the quality of representations in $\theta_{T}$ are for older tasks by evaluating $\theta_{T}$ on (a typically small amount of) data from each task $i$ . We probe the model on each of the tasks $i$ :

\widehat{\xi_{i}},\widehat{\theta_{i}}=\operatorname*{\text{probe}}(\theta_{T},D_{ft}^{(i)})

(3.4)

We then get the accuracy $A_{i}$ on task $i$ , and finally measure the average $A$ over all the tasks:

A=\frac{1}{T}\sum_{i=1}^{T}A_{i}\mbox{, where }A_{i}=\frac{1}{|D_{{te}}^{(i)}|}\sum_{x.y\in D_{te}^{(i)}}\mathbbm{1}[h_{\widehat{\xi_{i}}}(f_{\widehat{\theta_{i}}}(x))=y].

(3.5)

We investigate three different options for $\operatorname*{\text{probe}}$ : 1. Train a kNN classifier on frozen representations produced by the model $f_{\widehat{\theta_{T}}}$ , 2. Train a linear probe on the model representations, 3. Fine-tune the entire model parameters via LP-FT.

4 Probe Definitions

We include formal definitions for the three different instantiations of the probing classifier used for our evaluation.

Linear probe

We train a linear classifier on frozen representations produced by the model $\theta_{T}$ :

\begin{split}\operatorname*{\text{probe}}(\theta_{T},D_{ft}^{(i)}):&=\operatorname*{arg\,min}_{\xi}L_{i}(\xi),\theta_{T}\mbox{, where }\\ L_{i}(\xi)&=\sum_{x,y\in D_{ft}^{(i)}}l(h_{\xi}(f_{\theta_{T}}(x)),y)\end{split}

(4.1)

2.

KNN probe $\xi$ follows [18] by building a nearest-neighbor classifier based on ${\theta_{T}(x):x\in D_{ft}^{(i)}}.$

LPFT probe We define a loss function:

L_{i}(\xi,\theta)=\sum_{x,y\in D_{ft}^{(i)}}l(h_{\xi}(f_{\theta}(x)),y).

For each task $i$ , we first train the head $\xi$ and then fine-tune the entire model $\xi,\theta_{T}$ .

	$\displaystyle\widehat{\xi_{i}}_{lp}$	$\displaystyle=\operatorname*{arg\,min}_{\xi}L_{i}(\xi,\theta_{T})$		(4.2)
	$\displaystyle\operatorname*{\text{probe}}(\theta_{T},D_{ft}^{(i)}):$	$\displaystyle=\operatorname*{arg\,min}_{\xi,\theta}L_{i}(\xi,\theta)$		(4.3)

initialized at $\xi=\widehat{\xi_{i}}_{lp},\theta=\theta_{T}.$

5 Experiments

We show quantitative results on popular benchmarks Sequential-CIFAR-10 $(T=5)$ , CIFAR-100 $(T=20$ ), Tiny-ImageNet $(T=10)$ , as well as a benchmark of 19 Aspect Sentiment Classification Datasets [10] $(T=19)$ and an initial result on real-world satellite data Functional Map of the World $(T=6)$ . We abbreviate these as C10, C100, Tiny, ASC, and FMOW. We show that LPFT significantly improves over fine-tuning on C10, C100, and Tiny, and is almost as good as the best unsupervised continual learning algorithms and SOTA supervised continual learning algorithms. In order to better evaluate representations, we introduce an evaluation protocol where we are given a small percentage of data to tune on at test time and show that the benefit from popular regularization and replay continual learning algorithms is largely negated when we follow this evaluation protocol.

5.1 Experimental Setup

At training time, we tune every supervised continual learning method by averaging over 3 random seeds and sweeping over 6 learning rates. Meanwhile, we use the best method-specific hyperparameters from [3] and [14] for DER and SI, adding group normalization which was found to work better across all supervised methods. We use the best hyperparameters from [14] directly for UCL, reproducing their best runs. For more details, see section 5.4.

To prepare the best runs for probe evaluation, we select the hyperparameters corresponding to the best confidence interval, using the KNN probe accuracy as the validation and test metric. Then, we rerun 3 random seeds with those hyperparameters and obtain 3 corresponding checkpoints with the highest average test accuracy in the span of the final task. For KNN probe, we follow the same hyperparameters as [14]. For linear probes, we using lbfgs solver to sweep over 100 regularization values, and evaluate accuracy using the best classifier. For LPFT probe, we sweep over the same 6 learning rates as training and train for the same number of epochs as during the training.

5.2 Ranking of Methods Depend on Evaluation Protocol

First, we compare the methods using the original evaluation protocol, which evaluates the mean test accuracy using the KNN classifier over all tasks at the end of training. The results from 1 are shown.

Second, we apply our evaluation protocol as described in 3.3, which uses 10% of the data as fewshot to train various probes. As shown in 2, SI and DER are the best two methods with the standard KNN and LP probes, but SGD and LPFT are the best two methods with the LPFT probe. With the exception of UCL, all methods obtain a better accuracy with LPFT probe, suggesting it is the probe that practitioners will elect to use to maximize accuracy. SGD and LPFT especially yield high-performing probes that beat DER.

Table 1: LPFT does better than UCL and DER and almost as well as SI. On Tiny, LP-FT does better than all other methods—6% higher accuracy than the next best method.

	KNN probe
	C10	C100	Tiny	Average
SGD	5. 88.92	5. 71.27	4. 57.12	5. 72.44
LPFT	2. 91.65	2. 79.24	1. 64.25	1. 78.38
UCL	4. 90.11	4. 75.42	3. 58.31	4. 74.61
DER	3. 90.38	3. 76.97	5. 56.97	3. 74.77
SI	1. 92.73	1. 79.63	2. 58.55	2. 76.97

Table 2: We find that using the best probe trained on fewshot data, continual learning methods do worse than SGD, while LPFT does better. We tested on 10% few-shot data under our various probes’ evaluation schemes. The average metric over C10 and C100 is reported. The full table can be found in B.

	KNN probe	LP probe	LPFT probe	Best probe
SGD	5. 77.64	4. 88.30	2. 93.46	2. 93.46
UCL	4. 82.45	5. 88.21	5. 87.06	5. 88.21
LPFT	3. 83.95	3. 90.20	1. 93.74	1. 93.74
SI	1. 85.87	1. 90.91	4. 91.05	4. 91.05
DER	2. 84.78	2. 90.68	3. 93.26	3. 93.26

5.3 LPFT Maintains Strong Performance and Further Gains from Proper Evaluation

From Table 1, we see LPFT at training time significantly improves over SGD (2.73% on C10, 7.97% on C100) with the KNN evaluation protocol in [14], outperforming DER and second only to SI. This simple and effective technique refutes prior conclusions that supervised finetuning cannot outperform unsupervised learning unless replay or regularization is applied. LPFT is also not data-intensive. From Table 2, we see with only 10% fewshot data, applying LPFT probe to SGD and LPFT yields linear probes that beats DER (on C100), or comes within 0.28% of it (C10). Meanwhile, UCL does not gain from LPFT at few-shot evaluation time, losing 2.46% and gaining 0.19% from a linear probe (-1.15% on average), implying its representation quality hinges on having access to the full dataset. Using additional data at test time helps “simple” methods defeat their more involved counterparts, without incurring additional cost like memory from replay buffer (DER) or past weights for information accumulation (SI). We highlight this by partitioning the table into methods that incur extra memory (bottom) and methods that don’t (top).

5.4 LPFT Couples with SOTA CL Methods for Consistent Gains

In this section, we also include results coupling LPFT with SI and DER in Table 3. We see that for every supervised continual learning method - finetuning, SI, DER - adding LPFT either results in gains that outperform UCL or comes near confidence interval of it. These experiments challenge the prior belief that unsupervised methods are becoming the SOTA way to continually learn representations [14].

Table 3: We report all supervised methods with batch norm (BN) and with group normalization (GN). Methods with +SK mean sklearn is used to obtain the linear probe.

	C10	C100
FT (BN)	88.19 ( $\pm$ 1.02)	70.89 ( $\pm$ 1.92)
FT (GN)	88.92 ( $\pm$ 0.31)	71.27 ( $\pm$ 1.23)
LPFT (BN)	89.73 ( $\pm$ 0.48)	75.38 ( $\pm$ 1.66)
LPFT (GN)	91.18 ( $\pm$ 0.21)	76.42 ( $\pm$ 1.38)
LPFT + SK (GN)	91.65 ( $\pm$ 0.05)	79.24 ( $\pm$ 0.52)
UCL (original, BN)	90.11 ( $\pm$ 0.12)	75.42 ( $\pm$ 0.78)
DER (ours, GN)	90.38 ( $\pm$ 0.53)	76.97
DER + LPFT (GN)	91.07 ( $\pm$ 0.25)	78.21
DER + LPFT + SK (GN)	91.65 ( $\pm$ 0.32)	80.56 ( $\pm$ 0.15)
UCL + DER (original, BN)	91.22 ( $\pm$ 0.30)	77.27 ( $\pm$ 0.30)
SI (BN)	91.02 ( $\pm$ 0.40)	78.85 ( $\pm$ 0.88)
SI (GN)	92.73 ( $\pm$ 0.07)	79.63 ( $\pm$ 0.10)
SI + LPFT (GN)	92.26 ( $\pm$ 0.15)	80.01 ( $\pm$ 0.32)
SI + LPFT + SK (GN)	92.64 ( $\pm$ 0.07)	78.03 ( $\pm$ 0.63)
UCL + SI (original, BN)	92.75 ( $\pm$ 0.06)	80.08 ( $\pm$ 1.30)

For both the LPFT method and LPFT probe evaluation, our ablation study considers two ways of probing - either training the linear head with SGD for 25 epochs with the rest of the model frozen, or using sklearn (+sklearn) to obtain the probe via the lbfgs logistic regression solver from sklearn. For LPFT training, both ways are followed by 25 epochs of finetuning. We find sklearn’s logistic regression obtains a better linear probe than training by sweeping over 100 regularization values within the logspace of $[10^{-7},10^{2}]$ , and that using the obtained weights to set the probe improves the performance of LPFT and LPFT-coupled methods. For the results in 1, we performed all instances of linear probing with sklearn’s probe.

We tuned our supervised methods for the best learning rate over $[0.003,0.01,0.03,0.1,0.3,1.0]*\text{batch size}/256$ , with the adjustment rule from [14]. For UCL methods, we did not tune the learning rate due to our computational constraints: UCL requires 200 epochs (instead of 50) to converge, and employs batch size of $256$ instead of $32$ used for supervised methods. Instead, these hyperparameters were tuned extensively by [14], so we use them as is.

5.5 LPFT Generalizes to Other Domains and Scales to Larger Datasets

Table 4: (left) LP-FT gives a further boost of 0.55% and 1.41 Macro F1 compared to B-CL [10], the state-of-the-art method on an aspect sentiment classification CL benchmark. (right) LP-FT also improves over naive training on a real-world satellite remote sensing dataset (FMoW) where 6 tasks correspond to 6 continents, getting 3% higher average accuracy.

(a)

	Accuracy	Macro F1
B-CL	89.51 ( $\pm$ 0.55)	82.05 ( $\pm$ 1.30)
B-CL + LPFT	90.06 ( $\pm$ 0.77)	83.46 ( $\pm$ 0.67)

(b)

	KNN probe
SGD	53.22
LPFT	56.03

In Table 4(a), we see that $\phi$ and $\xi$ can be instantiated as task-specific parameters for a general task-incremental CL framework in the domain of NLP, and LPFT can still achieve gains. We follow their setup and report the average over 5 random sequences of the 19 tasks.

We also apply LPFT to FMOW, a real-world dataset of satellite imagery, and the initial result shows LPFT can bring immediate gains even on a messy real-world dataset where the most resourceful region has >400x samples as the least resourceful region.

LPFT’s numbers are generally more competitive for larger datasets. As detailed in Appendix Table 5(b), the gap between LPFT and SGD increases from 3.94% (KNN probe), 1.01% (linear probe) to 8.69% and 2.7% respectively on C100 compared to C10. On Tiny, we see LPFT to be the dominant method, achieving 7.28% and 1.67% on top of the second best methods under KNN (DER) and linear probe (SI) evaluations respectively.

6 Discussion and Conclusion

This works aims to learn better representations via continued pretraining and find better protocols to evaluate them. We introduce a probe evaluation framework that changes the ranking of continual learning methods. We introduce a simple yet effective technique that boosts performance across all datasets, can be applied in different domains and settings, couples with existing continual learning methods, and does not incur additional costs.

References

Aljundi et al. [2019] R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio. Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
Bommasani et al. [2021] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut, L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, and P. Liang. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Buzzega et al. [2020] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
Chaudhry et al. [2019] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
Christie et al. [2018] G. Christie, N. Fendley, J. Wilson, and R. Mukherjee. Functional map of the world. In Computer Vision and Pattern Recognition (CVPR), 2018.
Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
Fini et al. [2022] E. Fini, V. G. T. da Costa, X. Alameda-Pineda, E. Ricci, K. Alahari, and J. Mairal. Self-supervised models are continual learners. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Houlsby et al. [2019] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient transfer learning for NLP. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/houlsby19a.html.
Jean et al. [2016] N. Jean, M. Burke, M. Xie, W. M. Davis, D. B. Lobell, and S. Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353, 2016.
Ke et al. [2021] Z. Ke, H. Xu, and B. Liu. Adapting bert for continual learning of a sequence of aspect sentiment classification tasks. In NAACL, 2021.
Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
Kumar et al. [2022] A. Kumar, A. Raghunathan, R. Jones, T. Ma, and P. Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=UYneFzXSJWh.
Lopez-Paz and Ranzato [2017] D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
Madaan et al. [2022] D. Madaan, J. Yoon, Y. Li, Y. Liu, and S. J. Hwang. Representational continuity for unsupervised continual learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9Hrka5PA7LW.
Rao et al. [2019] D. Rao, F. Visin, A. A. Rusu, Y. W. Teh, R. Pascanu, and R. Hadsell. Continual unsupervised representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
Rusu et al. [2016] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
Sinha et al. [2020] S. Sinha, J. Song, A. Garg, and S. Ermon. Experience replay with likelihood-free importance weights. arXiv preprint arXiv:2006.13169, 2020.
Wu et al. [2018] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Yoon et al. [2018] J. Yoon, E. Yang, J. Lee, and S. J. Hwang. Lifelong learning with dynamically expandable networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
Yoon et al. [2020] J. Yoon, S. Kim, E. Yang, and S. J. Hwang. Scalable and order-robust continual learning with additive parameter decomposition. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
Zenke et al. [2017] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In Proceedings of the International Conference on Machine Learning (ICML), 2017.

Appendix

Appendix A Full Fewshot results

Table 5: We test each method’s utility on 10% few-shot data under our various probes’ evaluation scheme. Nearly all methods obtain a better probe when we apply LPFT at test time, but finetuning and LPFT yield high-performing probes that beats (CIFAR100) or nears DER and SI on CIFAR10. The ranking of the probes change significantly across the columns. DER and SI is dominant under the traditional KNN and linear probe scheme, but fades in comparison on LPFT probe scheme. Using additional data at test time helps “simple” methods defeat their more involved counterparts, without incurring additional cost like memory from replay buffer (DER) or information accumulation (SI).

(a)

	KNN probe		LP probe
	C10	C100	C10	C100
FT	5. 84.51 (+-0.58)	5. 70.76 (+-0.64)	5. 90.69 (+-0.22)	4. 85.90 (+-0.20)
UCL	3. 88.53 (+0.05)	4. 76.14 (+0.06)	4. 91.63 (+-0.22)	5. 84.78 (+-0.16)
LPFT	4. 88.45 (+-0.15)	3. 79.45 (+-0.70)	3. 91.70 (+-0.14)	1. 88.60 (+0.42)
SI	1. 91.90 (+0.08)	2. 79.83 (+-0.01)	1. 93.44 (+-0.21)	2. 88.38 (+-0.36)
DER	2. 88.98 (+-0.35)	1. 80.58 (+-0.11)	2. 93.01 (+-0.15)	3. 88.34 (+-0.11)

(b)

	LPFT probe		Best probe
	C10	C100	C10	C100
FT	2. 94.93 (+0.10)	2. 91.99 (+-0.16)	2. 94.93 (+-0.10)	2. 91.99 (+-0.16)
UCL	5. 89.15 (+0.90)	5. 84.97 (+-0.25)	5. 91.63 (+-0.22)	5. 84.78 (+-0.16)
LPFT	3. 94.84 (+-0.11)	1. 92.63 (+-0.17)	3. 94.84 (+-0.11)	1. 92.63 (+-0.17)
SI	4. 93.36 (+-0.08)	4. 88.73 (+-0.24)	4. 93.44 (+-0.21)	4. 88.73 (+-0.24)
DER	1. 95.21 (+-0.11)	3. 91.31 (+-0.11)	1. 95.21 (+-0.11)	3. 91.31 (+-0.11)

Refer to caption — Figure 1: To better evaluate the representations, we retrain the probes for few-shot evaluation using a small percentage of training data. We plot the percent of data along x-axis, and observe the increasing accuracy trend on the y-axis. We note that LPFT and FT are both competitive with DER, and do not suffer the drawback of unsupervised CL in the low-data regime.

Appendix B Vision and NLP Architecture

We use a Resnet18 (parameters) for our C10, C100, Tiny experiments and Densenet121 for FMOW. For the NLP sequential datasets, we use the same pretrained BERT as [10].

In the first case, only the final classification head is specific to a task, and so we train only those weights before finetuning the whole network for every new task. Adapter modules were introduced for large pretrained models in NLP by [8], and [10] built B-CL to yield a compact and extensible model. It is designed to add a few trainable parameters (task mask) to each encoder layer per new task, in a way the model can be extended while a) contributing shared knowledge to the learnt representations, and b) mitigating forgetting by protecting the neurons in the task-specific module using a task-specific mask. In the second case, we train only the task mask for every new task, then finetune the whole network after. Following their notations, we tune the task-specific embedding $e_{l}^{(t)}$ described in section 4.3.

We also find a new checkpoint strategy per task to produce better results for both SGD and LPFT. Previously, the model checkpoint after training the full number of epochs per task (10 epochs for the 2 SemEval tasks, 30 epochs for other tasks) is passed on to the next task. Instead, we select the task checkpoint which achieves the highest average accuracy over all tasks up to and including the current task. We also find training 30 epochs for SemEval datasets improves both SGD and LPFT. The final SGD accuracy, 89.51% is 1.22% higher than [10]’s reported accuracy. For the SemEval task datasets, using LPFT is not sufficient for convergence, so we use SGD on those 2 tasks in our LPFT runs.

Our code is available at link1 and link2.