Continual Learners are Incremental Model Generalizers

Jaehong Yoon Sung Ju Hwang Yue Cao

Abstract

Motivated by the efficiency and rapid convergence of pre-trained models for solving downstream tasks, this paper extensively studies the impact of Continual Learning (CL) models as pre-trainers. In both supervised and unsupervised CL, we find that the transfer quality of the representation often increases gradually without noticeable degradation in fine-tuning performance. This is because CL models can learn improved task-general features when easily forgetting task-specific knowledge. Based on this observation, we suggest a new unsupervised CL framework with masked modeling, which aims to capture fluent task-generic representation during training. Furthermore, we propose a new fine-tuning scheme, GLobal Attention Discretization (GLAD), that preserves rich task-generic representation during solving downstream tasks. The model fine-tuned with GLAD achieves competitive performance and can also be used as a good pre-trained model itself. We believe this paper breaks the barriers between pre-training and fine-tuning steps and leads to a sustainable learning framework in which the continual learner incrementally improves model generalization, yielding better transfer to unseen tasks.

Machine Learning, ICML

1 Introduction

Unsupervised Representation Learning (URL) (Radford et al., 2015; Gidaris et al., 2018; Grill et al., 2020; Xie et al., 2021) is a pertinent branch of machine learning in which a model exploits data without human-generated signals to extract the generic representations. Although the standard URL scenario assumes that we have a complete unlabeled dataset before training, this setting is often unrealistic in the real world; as the world persistently changes, the model should cope with non-stationary data throughout its lifespan. It carries the lifelong learnability of the representation model. As motivated by the Continual Learning (CL) field (Thrun, 1995; Silver & Mercer, 2002; Kumar & Daume III, 2012; Li & Hoiem, 2016), Unsupervised Continual Learning (UCL) (Rao et al., 2019; Madaan et al., 2022; Fini et al., 2022) has recently been explored to address the limitations of the conventional representation learning setup and provides comprehensive analyses regarding the quality of learned representations along with their forgetting.

However, the recently proposed UCL frameworks have clear limitations in their interpretation of model transfer to downstream tasks. Suppose $r_{i,j}$ be the performance of a pre-defined supervised metric for task $j$ using a sequentially pre-trained backbone model from the first to $i^{th}$ task. They train on $T$ sequential tasks $\{\mathcal{T}_{t}\}^{T}_{t=1}$ without labels, and measure the effectiveness of their representation model leveraging two supervised metrics: 1) the averaged performance gap measured immediately after the task is learned and after all tasks are learned $\sum^{T-1}_{t=1}r_{t}=r_{T,t}-r_{t,t}$ (backward transfer) (Lopez-Paz & Ranzato, 2017) and 2) the averaged performance of all tasks $\sum^{T}_{t=1}r_{t}=r_{T,t}$ . Though maximizing the transferability of the learned representations on the target problem is essential for general-purpose models, prior UCL works have confined their validations and analyses to linear evaluations as updating the linear classifier with keeping fixed representation model backbones. This can be suitable for measuring direct differences in model drift (i.e., catastrophic forgetting) of continual learners. Yet, it cannot disclose the change in knowledge transferability during task sequential training, which is crucial in utilizing the pre-trained model in practice.

Refer to caption — Figure 1: Linear evaluation of the first task (T0) and fine-tuning of the unseen task (T9) during supervised CL over a sequence of nine tasks from ImageNet1K-split (T0-T8). Sequential training on more tasks decreases the linear evaluation performance, but it increases fine-tuning performance on the unseen task.

Beyond the limited understanding of knowledge transfer from prior works, we provide comprehensive transferability analyses with varying evaluation setups via supervised and unsupervised CL methods to explore their potential as a pre-trainer. In Figure 1, we perform a simple experiment to investigate the potential of incremental model generalization via a continual learning setup. A model sequentially learns the nine tasks from ImageNet1K-split (containing ten tasks in total) using supervised CL methods: Base (a CL model without any additional method), SI (Zenke et al., 2017), and DER (Buzzega et al., 2020). Then, we measure the change of representation quality through linear evaluation for the first task, T0, and fine-tuning for the out-of-distribution task, T9. As shown in the increased performance in Figure 1 Right, we find that CL methods gain steady increases in their transferability to unseen tasks without the concern of performance degeneration from representational forgetting if the model trains on more tasks in a sequential manner. It is distinguishable from T0 linear evaluation results of the same methods, which suffer from performance degradation (Figure 1 Left).

We explain these phenomena because the transferability hinges upon rich task-generic features in pre-trained models (Xie et al., 2022a; Wei et al., 2022), while the model mostly loses task-specific features during CL, particularly severe when they train in a supervised/contrastive manner (please see Figures 2 and 5). Inspired by our above observations, we propose a new UCL framework based on Masked Image Modeling (MIM) (Xie et al., 2022b; He et al., 2022) that improves task-generic representation across all layers during training. Our framework retains fluent task-generic features over all layers by learning to predict masked regions of input images during unsupervised CL, conditioning other available areas. and outperforms existing supervised/unsupervised CL baselines in fine-tuning. Then we demonstrated that the suggested reconstruction-based CL framework achieves substantially higher fine-tuning performance on OOD tasks (Figure 6) than existing CL frameworks that aim to learn class-discriminative features, particularly at deeper layers.

Additionally, as we observe that continual learning models improve model generalization for downstream tasks, we raise the question of how a fine-tuned model can become a good pre-trained model itself since normal fine-tuning shifts generic features to be task-dependent. To this end, we additionally explore the potential of continual pre-training, reusing the fine-tuned model as a pre-trained model for other downstream tasks. The objective is to encourage the fine-tuning model to retain rich task-generic features during supervised fine-tuning. Leveraging our motivations for building the MIM-based UCL framework, we suggest a new method, named GLobal Attention Discretization (GLAD). Our proposed method introduces a lightweight, trainable adapter in the multi-head attention module of the ViT backbone. For future reuse of fine-tuned networks, GLAD aims to keep improving global (task-generic) attention during fine-tuning through a constraint to encourage diversity of attention distance in the adaptor-free backbone while the model solves the downstream task by focusing on local (task-adaptive) features utilizing adaptor-guided attentions. We believe our observations and proposed approach lead to removing the barriers between the standard pretraining-finetuning scheme and continual learning towards improving model generalization via never-ending continual training while alleviating the threat of loss of generality of large pre-trained models during downstream task fine-tuning.

The main contributions of the paper are threefold:

•

We unveil the behavior of representational transferability and forgetting of task-generic and task-specific features under multiple supervised/unsupervised continual learning frameworks at scale, with Vision Transformer backbones.
•

We suggest a new learning/evaluation paradigm of the popular pretraining-finetuning scheme amalgamating to continual learning that aims to continuously increase the generalization of the pre-training backbone during the endless sequential fine-tuning phases.
•

We further suggest a simple yet efficient remedy to increment task-generic feature expressiveness throughout continual pre-training, dubbed GLAD, which enables the model rapidly adapts to the target problem while preserving high transfer affinity to future tasks.

2 Related Work

Continual learning

SI (Zenke et al., 2017) introduces an additional surrogate loss that reduces the weight shift during continual learning by maintaining the training trajectory according to the weight importance of previous tasks. DEN (Yoon et al., 2018) adaptively controls the network capacity by adding/pruning parameters when new tasks arrive. DER (Buzzega et al., 2020) stores a few training instances of previous tasks as well as their predicted logits and minimize the similarity to produce similar logit predictions on past tasks. BiC (Wu et al., 2019) adds a new layer at the top of the backbone to correct classification bias on new tasks. Similarly, WA (Zhao et al., 2020) corrects the prediction bias by rescaling the FC layer with averaged weights normalization on past tasks. DyToX (Douillard et al., 2022) adopts ViT and performs ensembled prediction with task-specific classifiers leveraging additional task-specific tokens. However, dominant research resorts to the sophisticated human annotation of inputs during training a sequence of tasks.

CURL (Rao et al., 2019) learns unsupervised representation on task sequences with a generative model adopting task-specific inference. However, the method is validated for only MNIST-scale datasets due to their limited scalability by design. Madaan et al. (2022) suggest a new unsupervised continual learning framework in a contrastive manner using Siamese structures. They demonstrate the scalability of the proposed framework through comprehensive analyses of learned representations. CaSSLe (Fini et al., 2022) utilizes a similar contrastive self-supervised framework for unsupervised continual learning, yet provides further extensive validations including diverse self-supervised learning backbones over ImageNet-100.

Very recently, Chen et al. (2023) provides intriguing discussions on forgetting and forward transfer of FOMAML (Finn et al., 2017) during supervised CL, primarily via quantitative evaluation with a few-shot linear probe. On the other hand, we extensively consider not only supervised CL but also siamese network-based contrastive and reconstruction-based unsupervised CL frameworks. In our work, we found that the continually learned representation behaves differently, depending on whether transferring via fine-tuning the entire model or linear evaluation with a scalable test set. And we deliver various discussion points on representation transferability through both quantitative and qualitative analyses, allowing the re-update of learned backbone weights for downstream tasks. Also, we further propose the GLAD module based on our findings, which preserves rich task-generic representation during solving downstream tasks. More discussions regarding meta-learning are provided in Appendix B.

Self-supervised learning

Simsiam (Chen et al., 2020a) maximizes the similarity of input prediction upon two different augmentations using the Siamese network, learning input-invariant self-supervision. BarlowTwins (Zbontar et al., 2021) aims to remove cross-correlation across different feature vector embeddings from Siamese networks. DINO (Caron et al., 2021) distills teacher model predictions to the student by minimizing cross-entropy loss between their predictions, where the teacher model is updated through an exponential moving average from the student model. Unlike contrastive learning-based directions, Masked Image Modeling (MIM) has recently been developed inspired by masked language models for natural language understanding. SimMIM (Xie et al., 2022b) and MAE (He et al., 2022) adopt an encoder-decoder structure that zeroes out random spatial patches in each patchfied image and learns representations by predicting pixel values in masked patches. MSN (Assran et al., 2022) combines Siamese networks with masked modeling that maximize the prediction similarity between patchfied masked inputs and the augmented target views.

3 Preliminaries

3.1 Pre-training and Fine-tuning

Given a neural network $f(\cdot;\bm{w})$ parameterized by weights $\bm{w}$ , recent works have addressed the broad machine learning problems described to $\mathcal{D}_{target}$ by optimizing learnable weights with respect to complex objective functions. Beyond statistical initialization of network weights (Glorot & Bengio, 2010; He et al., 2015), pre-training, where leveraging learned weights from scaled benchmark datasets (e.g., ImageNet (Deng et al., 2009)) as the initialization of $\bm{w}$ , has been widely adopted to promote a rapid and stable convergence curve during training. Self-supervised learning (Chen et al., 2020a; He et al., 2020; Caron et al., 2021; Bardes et al., 2021; Xie et al., 2022a) has recently become prevalent for pre-training, demonstrating superior generalization performance compared to supervised counterparts by capturing task-agnostic input features. While multiple different frameworks are considered for self-supervised learning, we exemplify the encoder-decoder framework in this paragraph. Let $h$ and $g$ be an encoder and a decoder parameterized by $\bm{\theta}$ and $\bm{\phi}$ , respectively, the objective function is to minimize self-supervised loss given input data $\bm{d}$ without supervision:

\begin{split}\bm{\theta}^{*},~{}\bm{\phi}^{*}=\text{arg}\min_{\bm{\theta},\bm{\phi}}\ell\left(g_{\bm{\phi}}\circ h_{\bm{\theta}}\left(\bm{d}\right)\right),\end{split}

(1)

where $\circ$ indicates function composition. The loss function is often designed in several formulations based on similarity, identity correlation, and contrastive loss. After the pre-training phase, the encoder transfers learned features to backbone neural networks for fine-tuning, $\bm{w}\leftarrow\bm{\theta}^{*}$ .

3.2 Continual Learning Paradigms

Supervised Continual Learning (SCL) (Mallya & Lazebnik, 2018; Riemer et al., 2019; Aljundi et al., 2019; Chaudhry et al., 2019, 2020; Chrysakis & Moens, 2020; Titsias et al., 2020; Shen et al., 2020; Douillard et al., 2022; Yoon et al., 2020, 2022) is about a sustainable adaptation to unlimited task sequences while maintaining proficiency on previous tasks. Let us consider an intuitive example with the image-based problem: suppose $\mathcal{T}=\{\mathcal{T}_{1},...,\mathcal{T}_{T}\}$ be a sequence of $T$ tasks, where the dataset $\mathcal{D}_{t}$ for the $t$ -th task consists of $n_{t}$ training instances $\mathcal{X}_{t}\in\mathbb{R}^{n_{t}\times C\times H\times W}$ and corresponding labels $\mathcal{Y}_{t}\in\mathbb{R}^{n_{t}}$ . That is, $C,H,\text{and~{}}W$ denotes a channel, height, and width of images, respectively. A continual learner $f_{\bm{w}}$ , parameterized by a set of weights $\bm{w}$ , aims to predict classes by minimizing the optimization problem: ${minimize}_{\bm{w}}~{}\sum^{T}_{t=1}L_{CE}\left(f\left(\mathcal{X}_{t};\bm{w}\right),\mathcal{Y}_{t}\right)$ , where $L_{CE}$ is a cross-entropy loss. Yet, we assume that $f_{\bm{w}}$ can access each task in a specific timestep that loses the authorization to revisit previous tasks’ data instances when the next task arrives. That is, the model solves the following non-stationary problem at task $t$ throughout task sequential training:

\begin{split}\bm{w}^{*}&=arg\min_{\bm{w}}~{}L_{CE}\left(f\left(\mathcal{X}_{t};\bm{w}\right),\mathcal{Y}_{t}\right)\\ &\approx arg\min_{\bm{w}}~{}\sum^{t}_{i=1}L_{CE}\left(f\left(\mathcal{X}_{i};\bm{w}\right),\mathcal{Y}_{i}\right).\end{split}

(2)

Obtained models directly evaluate the performance of each task, categorizing several incremental learning setups according to the accessibility to task oracle during inference.

Unsupervised Continual Learning (UCL) is formulated in representation learning frameworks on a sequence of unlabeled tasks $\{\mathcal{X}_{t}\}_{t=1}^{T}$ , often referred to as continual self-supervised learning. A learner $f_{\bm{w}}$ aims to find the best solution that learns the informative representation of multiple tasks sequentially. At each timestep $t$ , the model resorts to the accessible dataset $\mathcal{X}_{t}$ without any human-annotated labels to solve the problem:

\begin{split}\bm{w}^{*},~{}\bm{w}_{ext}&=arg\min_{\bm{w}}~{}L\left(f\left(\mathcal{X}_{t};\bm{w},\bm{w}_{ext}\right)\right),\end{split}

(3)

where $L$ is an arbitrary loss function for representation learning (e.g., self-supervised losses (Chen et al., 2020b; Grill et al., 2020; Zbontar et al., 2021)) and $\bm{w}_{ext}$ is an (optional) extra weights for additional structures not included in backbone weights $\bm{w}$ , such as a decoder, a projection layer, and a predictor. Since a direct comparison of the quality of representation models is intractable, recent representation learning literature validates obtained representation models by probing generic transferability on multiple downstream tasks. In a similar vein, prior UCL works (Madaan et al., 2022; Fini et al., 2022) adopt supervised prediction tools like the KNN classifier and linear evaluation while keeping the learned backbone fixed. However, we argue that such evaluation paradigms cannot appropriately measure the transferability of representation on unseen tasks.

4 Continual Learning for Incremental
Model Generalization

4.1 The Role of Global and Local Attention during Continual Learning

Prior continual learning literature has demonstrated that a model with a standard CL setting suffers from forgetting due to loss of local features and attention to past tasks. But, we argue that they have barely discussed the generalization of continually learned representation, which can be a great source for deploying an improvable foundation model by fine-tuning an unlimited number of tasks. We throw a question mark at this point:

”So, is the model generalization getting worse as it goes through training sequential tasks?”

Surprisingly, we found that this is not the case as the generalization (a.k.a., transfer quality) is consistently getting improved during CL as shown in Figure 1 Right. To explicate behaviors of multi-head attention using transformer backbones, similar to the experiment in Figure 1, we train supervised and unsupervised CL frameworks on the ImageNet-1K split dataset without any specific CL methods. After that, we investigate changes in regional inductive bias by measuring the average distance of attention heads (defined by Equation 8) in each layer, visualized as different points in Figure 2 (a) top and middle row. Supervised and contrastive self-supervised (SimSiam) continual learning models focus more on strong locality inductive bias at lower layers (decreased attention distance) and global attention at higher layers (increased attention distance) during continual learning as these frameworks are innately designed to cluster/classify input features in deeper layers. However, focusing on global attention to capture task-specific features is undesirable for transferring the representations to out-of-distribution tasks. Next, we visualize the entropy conditioned solely on the distribution of each attention head in Figure 2 (b) by computing $-\sum_{i}\bm{a}_{i}\log(\bm{a}_{i})$ for each attention head $\bm{a}$ . Supervised and contrastive unsupervised frameworks broadly focus on most attention heads during continual learning at deeper layers. This indicates that they already substantially adapt to pre-trained tasks while losing a degree of freedom to transfer downstream tasks.

To build a new UCL framework for a better generalizable representation model across all layers, we survey Masked Image modeling (MIM) (Pathak et al., 2016; He et al., 2022) that self-trains input representation by minimizing regression loss to predict RGB pixel values in randomly zeroed patches in patchified input images. MIM enjoys locality inductive bias with diverse attention across layers, allowing better transfer ability to unseen tasks.

4.2 Continual Self-supervised Learning with
Masked Modeling

We formulate a representational learner $f_{\bm{w}}$ , composed of a neural encoder $h_{\bm{\theta}}$ and a decoder $g_{\bm{\phi}}$ . We build backbones using Vision Transformer variants (Dosovitskiy et al., 2020; Liu et al., 2021) due to their powerful generality and remarkable performance on high-resolution visual tasks. They are flexible to transfer the obtained representations to downstream tasks requiring various input image sizes in demands when existing UCL frameworks (Madaan et al., 2022; Rao et al., 2019) allow the fixed image size for representation learning and fine-tuning since the architectures are basically composed of multi-layer perceptrons and convolutional neural networks. At training $t$ -th task with a training instance $\bm{x}_{t}\in\mathbb{R}^{C\times H\times W}\in\mathcal{X}_{t}$ , a model segments $\bm{x}_{t}$ into smaller image patches where the width and height are $s<H,W$ , and randomly zeros a fraction of image patches out with a fixed ratio $\tau$ . An encoder tokenizes masked patches to the embedding space and fed into multiple self-attention blocks to capture latent representation features. A decoder reconstructs encoded features to approximate the input image. The objective is to minimize the following loss function for continual representation learning ( $\|\cdot\|_{\mu}$ denotes any norm, often $\mu\in\{1,2\}$ is used). Let $K=\left\lfloor\frac{H}{s}\right\rfloor\cdot\left\lfloor\frac{W}{s}\right\rfloor$ be the number of tokens at each image, we formulate the loss function as follows:

\begin{split}\ell\left(\bm{x}_{t};\bm{w}\right)&=\|f(\bm{m}*_{s}\bm{x}_{t};\bm{w})-\bm{m}*_{s}\bm{x}_{t}\|_{\mu}\\ &=\|g\left(h\left(\bm{m}*_{s}\bm{x}_{t};\bm{\theta}\right);\bm{\phi}\right)-\bm{m}*_{s}\bm{x}_{t}\|_{\mu},\\ ~{}~{}\text{where}&~{}~{}\bm{m}=\{0,1\}^{K}\sim B\left(K,\rho\right).\end{split}

(4)

With patch size $s$ , $*_{s}$ denotes a patch-wise multiplication operation between a training instance $\bm{x}$ and a generated mask vector $\bm{m}$ drawn by the binary distribution $B$ with sparsity ratio $\rho$ . $B(i,\rho)$ is a $i$ independent binary sampling with a ratio $\rho$ to pick $1$ . The model updates a set of weights that predicts masked regions of input images, conditioning other available areas. We simply adopt $\ell_{1}$ regularization to minimize the distance between predicted patches and the targets, followed by earlier reconstruction-based works (Xie et al., 2022b; He et al., 2022), and after completing a sequential training, the obtained encoder $h_{\bm{\theta}}$ can be utilized for many different downstream tasks. And we find that our new framework outperforms supervised and contrastive benchmarks in model transferability during continual pre-training (Please see Figure 6). Interestingly, unsupervised continual learning with the masked autoencoder framework (SimMIM in Figure 2 bottom row) behaves very differently from the other two frameworks. Almost all layers have a diverse focus on locality, and this tendency becomes stronger as they continue to pre-train more tasks.

4.3 Continual Pre-training via Global Attention Discretization

Our aim is to utilize the backbone of the fine-tuned model as a pre-training checkpoint for another problem in a supervised manner. Given the target task dataset $\mathcal{D}_{target}=\{\mathcal{X},\mathcal{Y}\}$ and a classifier $\delta(\cdot;{\bm{u}})$ parameterized by $\bm{u}$ , we formulate the objective of continual pre-training as follows:

\begin{split}\operatorname*{minimize}_{\bm{w}^{(t)},\bm{u}^{(t)}}\ell\left(\delta\left(f\left(\mathcal{X};\bm{w}^{(t)}\right);\bm{u}^{(t)}\right),\mathcal{Y}\right),\end{split}

(5)

We suppose each fine-tuning step independently introduces its own classifier. The formulation is aligned with the continual learning problem described in Section 3.2, but this setting is about never-ending model generalization to achieve a consistently improved adaptation to the out-of-distribution task in the future. That is, obtained representation model for fine-tuning task $t$ , $\widehat{\bm{w}}^{(t)}$ , will be reused for future task training ( $\bm{w}^{(t+1)}=\widehat{\bm{w}}^{(t)}$ ).

However, fine-tuning often reduces the general transferability to adapt to different tasks, demonstrating a suboptimal model generalization of supervised CL compared to our MIM-based framework (Please see Figure 6 Left). Motivated by our findings in Section 4.1 and Section 4.2, we propose a new method for continual pre-training, named GLobal Attention Discretization (GLAD). Our proposed method preserves diverse degrees of averaged distance at each attention head to preserve transferable backbone weights for future problems while capturing task-adaptive features guided by GLAD modules. As illustrated in Figure 3, a multi-head self-attention operation with adaptor weights $\bm{v}$ . We transform the task-generic MSA features to input-dependent by propagating adaptor $\bm{v}$ . Then, the model enables solving the current task problem while constraining the backbone weights to preserve abundant locality inductive bias. Let $\bm{a}^{l,i}$ be an averaged entropy of the attention passed over adaptor operation (dark dashed arrow) from $i$ -th head at layer $l$ , the objective function of our GLAD is:

\begin{split}&\operatorname*{minimize}_{\bm{w},\bm{v}}\sum^{N}_{n=1}\ell\left(f\left(\bm{x}^{(n)};\bm{w},\bm{v}\right),\bm{y}^{(n)}\right),\\ &+\frac{1}{L}\sum^{L}_{l=1}\left(\left(\sqrt{E\left[\left(\bm{a}^{l,i}-\overline{\bm{a}}^{l}\right)^{2}\right]}+\epsilon\right)^{-1}+\lambda\left\|\overline{\bm{a}}^{l}\right\|^{2}_{F}\right),\end{split}

(6)

where $E$ indicates the expectation, $\overline{\bm{a}}^{l}=\frac{1}{H^{l}}\sum^{H^{l}}_{i}\bm{a}^{l,i}$ at layer $l$ with the number of its attention head $H^{l}$ , $\lambda$ is a scaling factor. $\epsilon$ is a small constant value. We jointly minimize the task loss with an additional regularizer that constrains the entropy variance of attention heads to sufficiently diverge at each layer as an average of their inverse standard deviation. We add to minimize a Frobenius norm for $\overline{\bm{a}}$ to promote abundant locality inductive bias for backbone attention weights. Note that our proposed method is robust to utilize any kind of multi-head self-attention modules, we demonstrate the efficacy in vanilla Vision Transformer (Dosovitskiy et al., 2020) and Swin Transformer (Liu et al., 2021). The learned backbone weights $\bm{w}$ excepting classifier and GLAD-adaptors can be reused for fine-tuning future tasks. We describe the overall continual pre-training procedure with GLAD in Algorithm 1.

Algorithm 1 Continal Pre-training with GLAD

0: A sequence of tasks

\{\mathcal{D}_{1},\mathcal{D}_{2},\cdots\}

, backbone network

f

, learning rate

\eta\in\mathbb{R}^{+}

, hyperparameter

\lambda

,small constant

\epsilon

, initialization

\bm{w}_{\text{init}},\bm{v}_{\text{init}}

1: for all task

\mathcal{T}_{t}=\mathcal{T}_{1},\mathcal{T}_{2},\ldots

2: Build a model

f_{\bm{w},\bm{v}}(\cdot)

with GLAD-MSA \eqparboxCOMMENT

\rhd

Figure 3

3: Initialize

\bm{w}\leftarrow\bm{w}^{*}

excluding classifier if

f_{\bm{w}^{*}}

exists, otherwise

\bm{w}\leftarrow\bm{w}_{\text{init}}

4: Initialize

\bm{v}\leftarrow diag(\bm{v}_{\text{init}})\coloneqq(1,\ldots,1)\in\mathbb{R}^{d_{\text{out}}}

5: for batch

\bm{x}_{n},\bm{y}_{n}\sim\mathcal{D}_{t}

\mathcal{L}=\ell\left(f\left(\bm{x}_{n};\bm{w},\bm{v}\right),\bm{y}_{n}\right)+

\frac{1}{L}\sum^{L}_{l=1}\left(\sqrt{E\left[\left(\bm{a}^{l,i}-\overline{\bm{a}}^{l}\right)^{2}\right]}+\epsilon\right)^{-1}

\eqparboxCOMMENT

\rhd

Equation 6

\bm{w}\leftarrow\nabla_{\bm{w}}\mathcal{L}

\bm{v}\leftarrow\nabla_{\bm{v}}\mathcal{L}

8: end for

\bm{w}^{*}\leftarrow\bm{w}

10: end for

5 Experiments

We conduct the experiments on various continual learning frameworks with and without supervision using ImageNet-1K Split dataset against multiple strong baselines. for unsupervised continual learning that demonstrates the effectiveness of our proposed method on fine-tuning performance on downstream tasks.

5.1 Architectures and Baselines

ImageNet 1K Split (T=10)		Supervised		Contrastive (Chen & He, 2021)		Masked Model (Xie et al., 2022b)
		Final Acc	Neg. BWT	Final Acc	Neg. BWT	Final Acc	Neg. BWT
Fine-tuning (FT)	1K Pretrained	87.48 / 98.08	$-$	$-$	$-$	$-$	$-$
	22K Pretrained	87.76 / 98.48	$-$	$-$	$-$	$-$	$-$
	Base Model	71.90 / 90.64	6.88 / 3.70	64.38 / 86.18	7.18 / 4.56	73.18 / 91.76	13.26 / 7.46
	Si (Zenke et al., 2017)	70.00 / 90.38	7.40 / 4.52	61.46 / 84.82	4.15 / 2.42	71.54 / 90.76	11.92 / 6.58
	Der (Buzzega et al., 2020)	70.57 / 90.12	8.94 / 4.86	62.37 / 85.46	9.28 / 6.08	70.10 / 90.10	19.55 / 12.24
	LUMP (Madaan et al., 2022)	N/A	N/A	64.01 / 86.42	6.24 / 4.32	75.11 / 92.38	21.28 / 12.42
Linear Probe (LP)	1K Pretrained	87.48 / 97.98	$-$	$-$	$-$	$-$	$-$
	22K Pretrained	86.53 / 98.06	$-$	$-$	$-$	$-$	$-$
	Base Model	33.66 / 62.20	-5.98 / -4.30	17.10 / 40.60	7.64 / 13.94	17.46 / 40.60	4.76 / 6.36
	Si (Zenke et al., 2017)	34.82 / 63.18	-6.18 / -5.56	15.24 / 36.76	-1.42 / -2.38	14.92 / 37.80	4.82 / 8.26
	Der (Buzzega et al., 2020)	34.59 / 62.29	-5.86 / -6.16	14.84 / 36.13	3.68 / 5.22	6.22 / 21.52	-0.84 / -1.04
	LUMP (Madaan et al., 2022)	N/A	N/A	18.50 / 42.05	7.54 / 11.38	19.26 / 43.21	7.38 / 11.22

CIFAR-100 Split (T=10)		Supervised		Contrastive (Chen & He, 2021)		Masked Model (Xie et al., 2022b)
		Final Acc	Neg. BWT	Final Acc	Neg. BWT	Final Acc	Neg. BWT
FT	Base Model	79.3	3.8	49.3	12.6	88.9	18.3
	Si (Zenke et al., 2017)	78.0	5.8	57.3	14.5	86.4	20.0
	LUMP (Madaan et al., 2022)	N/A	N/A	83.3	16.5	88.6	18.7
LP	Base Model	70.3	0.0	45.7	5.9	73.0	11.1
	Si (Zenke et al., 2017)	69.0	0.9	49.3	3.0	68.4	9.1
	LUMP (Madaan et al., 2022)	N/A	N/A	73.6	9.8	77.1	11.5

Table 1: Fine-tuning and linear evaluation performance with their negative backward transfer of the first task on ImageNet 1K and CIFAR-100 Split after supervised/unsupervised continual learning. We report the Top-1/Top-5 performance for all individual experiments on ImageNet and the Top-1 performance on CIFAR-100. Higher is better for both metrics, and the best results are highlighted in bold.

We use ViT (Dosovitskiy et al., 2020) and Swin Transformer (Liu et al., 2021) as backbone architectures. We follow Siamese network structure by Madaan et al. (2022) and implement a MIM-based continual self-supervised learning framework under SimMIM (Xie et al., 2022b) and MAE (He et al., 2022) for UCL. CURL (Rao et al., 2019) is one of the pioneer works on unsupervised continual learning literature, but it is not scalable for high-resolution visual images by design. We utilize several CL methods: SI (Zenke et al., 2017), DER (Buzzega et al., 2020), and LUMP (Madaan et al., 2022). We further describe details, including hyperparameter setups in Appendix A.

Datasets

We use ImageNet-1K (Deng et al., 2009) and CIFAR-100 dataset (Krizhevsky et al., 2009) by splitting it into ten tasks where each task contains $100$ and $10$ classes, respectively. In Table 1, we construct a sequential dataset with nine earlier tasks and assign the last one as a downstream (validation) task with fine-tuning. Additionally, we split CIFAR-100 into five tasks for the continual pre-training experiment.

5.2 Experimental Results

Evaluation performance during continual learning

We validate the transfer quality of representation of continual learning models through the first task (T0)’s evaluation in Table 1. We measure the change in the linear evaluation and fine-tuning performance. The evaluation of T0 from full pre-trained models over the Imagenet-1K and -22K datasets obtains high validation accuracy on fine-tuning and linear evaluation as they train on entire datasets. Fine-tuning the base continual learning models, which perform a simple CL strategy without additional methods during training task sequences, achieves performance increases in T0 as they pre-trained longer task sequences, obtaining positive values in backward transfer. In ImageNet 1K Split, the results are similar to all supervised and unsupervised continual learning frameworks, including Contrastive Self-supervised Learning (Madaan et al., 2022)- and Masked Image Modeling-based UCL (Please see Section 4.2). We also performed multiple continual learning methods, such as SI, DER, and LUMP. We found that they follow consistent tendencies according to the CL frameworks when LUMP with masked image modeling gains the highest accuracy on T0 with the strong backward transfer. On the other hand, the model degrades the linear evaluation performance in supervised continual learning, which had to do with catastrophic forgetting reported in conventional CL scenarios. SI and DER achieve increased final performance since the model mitigates the weight drift preserving the task-specific features on learned tasks. In CIFAR-100 Split, we similarly observed that fine-tuning the in-distribution task (i.e., the first task) of Base/SI/LUMP with reconstruction-based UCL framework achieves higher and positive BwT than the case of CL methods in a supervised manner, demonstrating that supervised CL suffers more severe forgetting of task-discriminative information at deeper layers, unlike reconstruction-based UCL focusing on task-generic features over all layers. Note that the linear evaluation on CIFAR-100 seems more robust to forgetting than ImageNet experiments. We expect that these relative benefits in forgetting came from the shallower data distribution space and simpler visual features of CIFAR-100 compared to ImageNet.

Analyses for an Out-Of-Distribution (OOD) task

We denote the last task (T9) as an out-of-distribution problem, excluding it from the continual task sequence. In Figure 4, we visualize the top-1 validation accuracy on an OOD task over three continual learning frameworks w/ and w/o SI. Similar to in-distribution evaluation, the MIM-based UCL method achieves higher fine-tuning performance both on the base model and SI. The linear probe performance of Supervised CL surpasses unsupervised counterparts, and we expect that representation from supervised learning contains directly helpful features to classify the high-resolution and complex task problems even without the re-update of backbone weights. In contrast, MIM remarkably underperforms on linear evaluations due to its characteristic property; Masked Modeling focuses on capturing global attention rather than the local one, providing a better generalization ability to unseen tasks. But its linear evaluation without fine-tuning the weights is inadequate to solve the problem as they contain little task-discriminative information.

To understand how CL frameworks exhibit incremental model generalization and fine-tuning performance, we analyze the behavior of layer attention during continual learning using the Swin-T backbone. In Figure 5, we visualize the layer-by-layer changes in aggregated attention distance for T9 while the model trains ImageNet-1K-Split sequentially until the penultimate task (T0 $\rightarrow$ T8). Interestingly, aggregated attention distance significantly decreases the scale and increases the diversity across attention heads. This demonstrates that the continual learner captures richer task-general (or low-level) features behaving with more local attention (i.e., lower attention distance), which retains localized information with strong local inductive bias, such as edges, patterns, and textures. In Supervised and Contrastive Continual Learning frameworks, lower layers tend to drastically change toward capturing task-generic attention, and it is also coincident with well-known observations that lower layers in neural networks are more concerned with low-level features. Also, Masked Image Modeling (SimMIM) results demonstrate the salient effectiveness of capturing task-generic attention compared with the other two frameworks.

Freezing the partial layer weights during continual learning

We further analyze the effect of the layers for incremental generalization during continual learning in Figure 6 right. After supervised/unsupervised training of the first task (After T0), we freeze the two lowest or two deepest layer weights during the successive continual learning up to the final task (After T8). For the MIM-based UCL framework, we use MAE with a ViT-B backbone. In supervised learning, both partial gradient update policies reduce the degree of incremental generalization during continual learning. It significantly reduces the representation model’s fine-tuning performance compared to the fully-trained model. However, interestingly, prohibiting the update of layer weights at both ends less affects MIM-based unsupervised continual learning. We expect that this property comes from its flexibility in learning diverse attention across all layers. For further analyses, please see Section C.4.

Supervised
SimSiam
SimMIM
	$\bm{w}_{\bf{T0}}$	$\bm{w}_{\bf{T0\rightarrow T8}}$	$\bm{w}_{\bf{T0}}$	$\bm{w}_{\bf{T0\rightarrow T8}}$


(a) FInetuning - Base	(b) Linear Probe - Base	(c) FInetuning - SI	(d) Linear Probe - SI

Pre-trained T0 $\rightarrow$ T4	Continual learning over T5 $\rightarrow$ T8
	T5	T6	T7	T8
Supervised	64.56	69.06	75.06	77.80
\cdashline1-5 + GLAD (Ours)	65.78	69.84	76.12	79.04
SimMIM (Xie et al., 2022b)	68.34	72.22	77.38	80.12
\cdashline1-5 + GLAD (Ours)	68.24	73.24	78.94	81.62
MAE (He et al., 2022)	42.19	52.07	63.07	71.19
\cdashline1-5 + GLAD (Ours)	41.75	54.74	68.01	74.30

	pre-training	fine-tuning
supervised	lr: $1.0\eta$ , epoch: $60$	lr: $0.5\eta$ , epoch: $30$
contrastive	lr: $0.2\eta$ , epoch: $100$	lr: $1.0\eta$ , epoch: $30$
mim	lr: $5.0\eta$ , epoch: $100$	lr: $5.0\eta$ , epoch: $30$

		Supervised		Masked Model
		Final Acc	Neg. BWT	Final Acc	Neg. BWT
FT	Base	70.47 ( $\pm$ 7.91)	1.77 ( $\pm$ 1.56)	86.17 ( $\pm$ 1.99)	20.90 ( $\pm$ 1.88)
	Si	70.93 ( $\pm$ 6.11)	4.80 ( $\pm$ 2.24)	83.50 ( $\pm$ 3.04)	16.80 ( $\pm$ 3.33)
	LUMP	N/A	N/A	85.80 ( $\pm$ 2.57)	18.30 ( $\pm$ 1.18)
LP	Base	62.53 ( $\pm$ 6.58)	0.03 ( $\pm$ 0.05)	67.53 ( $\pm$ 4.05)	10.03 ( $\pm$ 1.17)
	Si	62.47 ( $\pm$ 5.12)	2.07 ( $\pm$ 2.16)	64.60 ( $\pm$ 2.72)	9.77 ( $\pm$ 0.87)
	LUMP	N/A	N/A	68.97 ( $\pm$ 5.98)	12.60 ( $\pm$ 0.99)

Continual Learners are Incremental Model Generalizers

Abstract

1 Introduction