Maintain Plasticity in Long-timescale Continual Test-time Adaptation

Yanshuo Wang^1,2, Xuesong Li^1,3, Jinguang Tong^1,3, Jie Hong⁴, Jun Lan² Corresponding author Weiqiang Wang², Huijia Zhu

{}^{2},

Haoxing Chen²¹¹footnotemark: 1
¹Australian National University, ²Tiansuan Lab, Ant Group
³Data61-CSIRO, ⁴The University of Hong Kong
[email protected], [email protected], [email protected],
[email protected], [email protected]

Abstract

Continual test-time domain adaptation (CTTA) aims to adjust pre-trained source models to perform well over time across non-stationary target environments. While previous methods have made considerable efforts to optimize the adaptation process, a crucial question remains: can the model adapt to continually-changing environments with preserved plasticity over a long time? The plasticity refers to the model’s capability to adjust predictions in response to non-stationary environments continually. In this work, we explore plasticity, this essential but often overlooked aspect of continual adaptation to facilitate more sustained adaptation in the long run. First, we observe that most CTTA methods experience a steady and consistent decline in plasticity during the long-timescale continual adaptation phase. Moreover, we find that the loss of plasticity is strongly associated with the change in label flip. Based on this correlation, we propose a simple yet effective policy, Adaptive Shrink-Restore (ASR), towards preserving the model’s plasticity. In particular, ASR does the weight re-initialization by the adaptive intervals. The adaptive interval is determined based on the change in label flipping. Our method is validated on extensive CTTA benchmarks, achieving excellent performance.

1 Introduction

Humans and animals have remarkable abilities to learn and adapt what they have learned to new environments. For example, dogs can learn to follow complex commands and perform various tasks. They can quickly adjust to new surroundings and obey orders when the environment changes. In machine learning, researchers aim to enable the model to be powerfully adaptable to changing scenarios.

Refer to caption — Figure 1: Illustration of plasticity loss in adaptation: Test data continuously changes from one domain to another, and the dotted line represents the reset point for CTTA methods. Upon resetting, we observe that a CTTA model with re-initialization (represented by the blue line) initially experiences a significant performance drop. However, after adapting to sufficient samples, the model quickly recovers and fits the data better. In contrast, the CTTA model without resetting (represented by the yellow line) shows a reduced ability to adapt with consistently superior performance later in the long run.

To achieve this, several studies focus on adapting the models pre-trained on the source environment to different test conditions. Two common methods are the alignment of the feature distribution through discrepancy losses between domains [17] and adversarial training [35]. Recently, the primary focus has shifted towards a much more realistic setting known as continual test-time domain adaptation (CTTA), focusing on adapting models in non-stationary conditions during deployment. Existing methods like CoTTA [41] and RMT [7] adapt the pre-trained source model to constantly evolving test-time data via consistency-regularization in a teacher-student framework.

Although these methods have made considerable improvements to optimize the adaptation process in various aspects [32, 25, 16], most approaches seem to overlook their stability in the continuous long-term adaptation stage. Recently, in [26], a set of CTTA methods are evaluated on a large benchmark named Continuously Changing Corruptions (CCC). It was found that most current TTA methods fail to maintain consistent performance or even collapse over a long time. We provide one example in Fig. 1, where the accuracy drops dramatically after around $60000$ time steps. The performance collapse indicates the model is unable to learn new information during adaptation.

Nevertheless, no detailed study is provided for the collapse phenomenon in CTTA. To better understand this, we focus on the long-term behavior of continual adaptation and look into a factor in the adaptation process, plasticity, the ability to continually adjust predictions in the continually-changing environments [20, 19, 8]. This capability is essential as it represents a model’s ability to learn new information during long-time adaptation. In other words, the lack of plasticity would lead to model performance collapse over a long-timescale process.

In this work, we first conduct a series of experiments to demonstrate the loss of plasticity in long-timescale CTTA. To further study the plasticity, we observe the label fip [27], which measures the prediction difference between the current model and the previous-step model. Notably, we find that plasticity loss becomes particularly severe when the label flip count begins to fluctuate. This indicates that the fluctuation of the label flip count could signal the loss of plasticity.

Based on this observation, we propose an adaptive plasticity restoration policy using the label-flipping trajectory, named Adaptive Shrink-Restore (ASR). Specifically, ASR selects a suitable timing based on the label-flipping trajectory for the weight re-initialization. The re-initialization comprises shrinking the current model weights and restoring the source model weights at a ratio. In summary, the contributions of this work can be summarized as follows:

•

To the best of our knowledge, we are the first to explore the model property, plasticity, in the CTTA task. We demonstrate the importance of preserving plasticity for the model’s long-timescale adaptation.
•

We find that changes in the landscape of label-flipping could be an explicit signal to reflect the plasticity loss in the adaptation process.
•

We propose a straightforward policy, ASR, which re-initializes the model weights at adaptive timings, effectively preserving the plasticity. The adaptive property of the policy is related to the observation of the label-flipping.

2 Related Work

2.1 Unsupervised Domain Adaptation

Unsupervised domain adaptation utilizes labeled source and unlabeled target data to mitigate the gap between domains. Many previous studies have focused on reducing this domain divergence via adversarial training [9, 36] or aligning statistic moments [18]. Recent work began to explore self-training-based methods [44, 43], where pseudo-labels are generated for the unlabeled target data to refine the model. In addition, serval methods adjust batch norm statistics between source and target domains to perform adaptation [15, 4].

2.2 Continual Test-time Adaptation

Continual test-time adaptation aims to adapt the pre-trained model in response to continually changing environments. In contrast to conventional domain adaptation settings, it is designed to update its model with only unlabeled data to a stream of test data from diverse domains without accessing source data. As a more realistic setting, [40] first proposed weighted augmentation-averaged mean teacher framework [33] to generate robust predictions. Following the mean teacher framework, many existing methods improve and continue based on pseudo label adaptation [41, 7]. Another line of work is utilizing batch norm statistics from the input test data for the prediction [23, 30]. In addition, Tent [37] directly minimizes the entropy of a model’s predictions at test time to update the batch norm statics. Furthermore, EATA [25] combines a similar student-teacher framework and utilizes continuous batch normalization statistics updates to achieve efficient adaption.

2.3 Plasticity

Plasticity refers to the model’s ability to adapt and learn in response to new environments. The study of plasticity networks was first introduced in neurobiology decades ago [1, 42, 21]. In the context of neuroscience, it is widely observed that age of acquisition affects people’s ability in word processing due to the loss of plasticity in aging, such as response times and accuracy rates, [42, 3] In terms of neural networks, recent works also find a similar phenomenon that modern deep-learning networks gradually lose their ability to learn when training on new data at non-stationary environments [8, 19, 24] This loss of plasticity happens most clearly when distribution shifts over time, forcing the network to update its previous predictions [20]. In reinforcement learning, bootstrapping has been analyzed as a source of instability on the offline RL case, potentially worsening this issue [14]. Previous research has also explored some strategies to preserve plasticity, such as resetting a subset of dormant neurons continually in learning[31] and shrinking or perturbing parameters when switching to new tasks[2]. These works examine the problem of supervised continual learning for new tasks or reinforcement learning. In contrast, our work focuses on a challenging case in continual adaptation to preserve plasticity.

3 Plasticity in Adaptation

3.1 Plasticity

Plasticity is generally defined as the model’s ability to continually learn from new data [20]. There are different metrics proposed to measure the model plasticity [28, 38, 20, 8]. In this work, following [39], we use the classification accuracy to represent the model plasticity.

We carry out a simple toy experiment to show how the model plasticity decreases over time in adaptation. In the experiment, we use one of the most commonly-used CTTA methods, EATA [25], with two simple policies of re-initialization, reset [26] and without-reset, on the benchmark [26]. We can observe from Fig. 1 that the model under both policies showcases the plasticity loss in between. Specifically, the weight re-initialization policy affects the model’s plasticity. From Fig. 1, the method with the reset policy consistently has a higher performance than that of the no-reset policy as the number of time steps increases.

Then, we briefly discuss why maintaining plasticity is important to CTTA methods. One reason behind the reduced plasticity is error accumulation resulting from either entropy minimization [37] or pseudo-labeling [10, 40]. Unfortunately, most CTTA methods are developed based on these techniques, and they heavily rely on them to adjust the model in response to new samples. We test several existing CTTA methods on CCC-medium [12]. Fig. 2 shows that all these methods, built on entropy, have the obvious loss of plasticity. Even CTTA methods that use regularization, like EATA [25], still struggle to maintain classification accuracy in the later runs. It can also be seen that a no-reset policy underperforms a reset policy.

These observations suggest that most CTTA methods lose some plasticity due to error accumulation in the model adaptation process. More importantly, the accuracy difference between the no-reset and reset policies gradually increases as adaptation progresses, indicating that plasticity decreases at an accelerating rate without any policy. Overall, these experiments on plasticity raise an essential question in CTTA: How can we achieve sustained continual adaptation by preserving plasticity?”

3.2 Plasticity Preservation

In this subsection, we delve deeper into the plasticity of the model in CTTA and propose a potential solution to preserve this important attribute, as shown in Fig. 4.

3.2.1 When to Trigger the Re-initialization?

It is a natural question to ask when to trigger the re-initialization of weight to preserve plasticity during the adaptation. To obtain sustained performance in the long run, we should use the plasticity policy timely. The model may still be unable to adapt to coming environments or samples if we intervene too early, late, or often. This is demonstrated in Fig. 3, where resetting timing is randomly determined. The random timing of using the reset policy fails to deliver superior performance over a long time compared to using a no-reset policy. Therefore, it is crucial to identify the appropriate time to trigger the re-initialization.

One of the previous methods [26], RDumb, completely resets the model by the fixed time-step interval determined by the related validation dataset. However, using external datasets limits the realistic usage of CTTA. Moreover, those CTTA methods with different adaptation mechanisms may have different adaption progress, as shown in Fig. 2, making it difficult to find the unified value for all methods. Thus, we aim to propose a general and adaptive way to find the right time to trigger the plasticity preservation policy.

3.2.2 Label Flip

To achieve the correct trigger timing of weight re-initialization in the plasticity preservation policy, we utilize one of the adaptation characteristics of the TTA method. During adaptation, it is expected that there will be a large number of prediction changes between the current adapted model and the model at the previous time step, and this is named as label flip (LF) [34, 27, 6]. More specifically, we run a set of test images using the model at the last time step of adaptation and the current time step. When passing the same set of images to both models, we record the predicted class from each model for every image. If the predicted classes differ between the two models, we refer to that as a label flip.

To better measure LF, we apply similar weighting factors, the initial model confidence [27], and the increased model confidence for the changed class after adaptation. Hence, we have:

\text{LF}_{t}=\sum_{i}\mathbb{I}(i)\cdot c_{i}\cdot(c_{i,{t}}-c_{i,{t-1}})

(1)

where $\mathbb{I}(.)$ is an indicator function that denotes whether the label flip exists (predicated class changes) or not. $c_{i}$ represents the confidence in predicting the current time-step model while $c_{i,\text{t-1}}$ is the one from the model at the previous time step.

Along with the adaptation, the diverse environments cause severe fluctuations in the flip count. Thus, we process the label flip via an exponential moving average to get a smooth behavior or trajectory.

\text{LF}_{t+1}=\beta\text{LF}_{t-1}+(1-\beta)\text{LF}_{t}

(2)

where $t$ is the time step and $\beta$ is the coefficient for updating the LF along the adaptation.

It can be seen from Fig. 5 that the model has a decreasing trend in flip in the beginning period of the adaptation (see the purple line). After the decrease, the label flip starts to fluctuate. The purple line does not reflect the fluctuation since the exponential moving average algorithm has processed the raw data in Eq. (2).

3.2.3 Adaptive Trigger

Empirically, we found severe plasticity loss happens when the label flip meets fluctuations. As shown in Fig. 5, the minimal accuracy (see the blue line) is obtained as the flip starts to fluctuate (see the purple line). In other words, any significant fluctuation in the flipping trajectory would cause a loss in plasticity, negatively impacting overall performance in later runs.

Thus, we decide when the weight re-initialization is carried out based on the straightforward observation of the flip’s landscape changes. We first locate the minima of the label flip by simply setting the lowest point along the trajectory and its surrounding neighbors.

Min=\frac{1}{N}\sum_{t\in S}{\text{LF}_{t}}

(3)

where $S$ is neighboring points around the lowest point in $\text{LF}_{t}$ , and $N$ is the number of neigborhood point. If a rise in label flip over the percentage of $\pi$ is detected, we then judge it is the suitable time to trigger the weight re-initialization.

LF_{t+1}>\pi*Min

(4)

where $LF_{t}$ denotes the current label flip and $\pi$ denotes the maximum allowed fluctuation. Thus, we empower our policy with the ability to determine when to do re-initialization adaptively.

3.2.4 Shrink-Restore Re-initialization

The simplest way of weight re-initialization is to reset the model directly (see Figs. 1and 2). However, this would cause the model to lose all the previous knowledge it had acquired. Ideally, we want the adaptation model to retain some of its past knowledge at each weight re-initialization time. Otherwise, we would have to completely adapt from scratch, which would be time-consuming and influence initial performance.

In the context of continual learning, previous methods have used a simple but effective technique called ”shrink and perturb” [2] to add variability to the network for to standardize gradients, then applies noises to the weights for better exploration in new tasks. However, in our case, the step of perturbing would add noise to the weights and may further influence the adaptation negatively since there is no supervision signal in CTTA. Motivated by ”shrink and perturb,” we propose a method, “Shrink-and-Restore” (Shrink-Restore). In contrast to “shrink and perturb”, we use the weights of the source model to add variability to the network. The pretrained source model weight could be a good complement to preserve plasticity since domains are correlated.

Therefore, we have the weight update rule of the Shrink-Restore method:

\theta_{\text{reinit}}=\lambda\theta_{t}+\gamma\theta_{\text{pre}}

(5)

where $\lambda<1$ and $\gamma<1$ , are coefficients. We shrink the current parameters and restore them using scaled source initialization. In this case, when initialization happens, the model can still leverage some previous to warm start the next iterations while restoring necessary plasticity for ongoing adaptation. Meanwhile, as shown in [26], model weights would expand indefinitely in the context of CTTA over the long run without any regularization. To keep the weights within an appropriate range, we further constrain the sum of the combination ratios $\lambda$ and $\gamma$ to be less than 1, ensuring that the each time the reinitialized weights still remain at a reasonable magnitude. This helps to control weight growth over the long term and prevents the collapse phenomenon of weight explosion.

4 Experiments

Method CIN-C CIN-3DCC CCC-Easy CCC-Medium CCC-Hard Average Pretrained 18.0 ± 0.0 31.5 ± 0.22 34.1 ± 0.22 17.3 ± 0.21 1.5 ± 0.02 20.5 BN [23, 30] 31.5 ± 0.02 35.7 ± 0.02 42.6 ± 0.39 27.9 ± 0.74 6.8 ± 0.31 28.9 Tent [37] 15.6 ± 3.5 24.4 ± 3.5 3.9 ± 0.58 1.4 ± 0.17 0.51 ± 0.07 9.2 RPL [29] 21.8 ± 3.6 30.0 ± 3.6 7.5 ± 0.83 2.7 ± 0.36 0.67 ± 0.14 12.5 SLR [22] 12.4 ± 7.7 12.2 ± 7.7 22.2 ± 18.4 7.7 ± 9.0 0.66 ± 0.57 11.0 CPL [10] 3.0 ± 3.3 5.7 ± 3.3 0.41 ± 0.06 0.22 ± 0.03 0.14 ± 0.01 1.9 CoTTA [40] 34.0 ± 0.68 37.6 ± 0.68 14.9 ± 0.88 7.7 ± 0.43 1.1 ± 0.16 19.1 EATA [25] 41.8 ± 0.98 43.6 ± 0.98 48.2 ± 0.6 35.4 ± 1.0 8.7 ± 0.8 35.5 ETA [25] 43.8 ± 0.33 42.7 ± 0.33 41.4 ± 0.95 1.1 ± 0.43 0.23 ± 0.05 25.8 RDumb [26] 46.5 ± 0.15 45.2 ± 0.15 49.3 ± 0.88 38.9 ± 1.4 9.6 ± 1.6 37.9 ASR (ours) 47.37 ± 0.35 46.5 ± 0.10 51.2 ± 0.94 42.2 ± 1.58 12.9 ± 0.7 40.0

Table 1: Performance comparison of different adaptation methods across 3 benchmarks, CCC, CIN-3DCC, and CCC, with difficulties. We report the mean accuracy with std over a series of combinations. Overall, it can be seen that our method surpasses all other adaption methods.

Method	Accuracy
ASR with shrink and restore	46.58
ASR without shrink and restore	46.10

Table 2: Ablation study: the components of Shrink-Restore on CCC-medium.

Method	no-reset	50	100	250	500	750	1250	2000	2500	3000	4000	5000	Adp
TENT	0.1588	0.3188	0.3362	0.3656	0.3845	0.3926	0.3915	0.3856	0.3783	0.3666	0.3666	0.3487	0.3902
EATA	0.3914	0.3530	0.3811	0.4123	0.4261	0.4313	0.4304	0.4274	0.42574	0.4253	0.4202	0.4171	0.4344
RPL	0.2992	0.3081	0.3162	0.3358	0.3566	0.3692	0.3809	0.3877	0.3892	0.3882	0.3846	0.3838	0.3873

Table 3: Accuracy values for various tent methods with different reset intervals. ‘Adp’ refers to the baseline method with the adaptive trigger.

To test the effectiveness of our method, extensive experiments of CTTA are carried out on three large and comprehensive datasets, CIN-C [12], CIN-3DCC [13] and CCC [26].

4.1 Datasets

CIN-C: ImageNet-C [12], named as CIN-C in RDumb [26], comprises $15$ variations of the ImageNet [5] validation set, where each variation corresponds to one of $15$ distinct types of common corruption, such as snow, defocus, and motion blur. This results in $750,000$ images across all $15$ corruption types at each severity level, with severity levels ranging from $1$ to $5$ .

CIN-3DCC: ImageNet-3DCC [13], also referred to CIN-3DCC in RDumb [26], is a similar extension of the ImageNet dataset [5], which includes $12$ different corruptions with five levels of severity. In contrast to ImageNet-C, its corruptions are generated using 3D information based on scene geometry, yielding shifts that are more aligned with real-world conditions. Each severity level contains $600,000$ images, covering $12$ distinct types of corruption.

CCC: CCC [26] is a much larger dataset that first features smooth transitions from one domain to another, mimicking how environments change in reality. It consists of $3$ difficulty levels, each with $15$ types of corruption, $3$ random seeds, and $3$ levels of transition speed. Each difficulty level includes $9$ combinations for continual domain testing across $15$ distinct domains, and every single combination contains $7,500,000$ images.

4.2 Implementation

Following RDumb [26], we use the same pre-trained ResNet-50 [11] as the default adaptation model, and our approach is also built on the same EATA [25] framework used in [26]. In all experiments, we use a batch size of $64$ , and since our method is generally compatible with any baseline, we maintain the same optimized testing configuration as in [26] for each method. Note that results for CIN-C and CIN-3DCC are obtained by averaging over ten different permutations of corruptions, and CCC results are also averaged across its combinations. For CIN-C and CIN-3DCC, we follow the settings in [26] that use the highest severity level $5$ as the default. Here, we list all the methods that used for comparison across $3$ datasets: BatchNorm (BN) Adaptation [23, 30], Test Entropy Minimization (Tent) [37], Robust Pseudo-Labeling (RPL) [29], Soft Likelihood Ratio (SLR) [22], Conjugate Pseudo Labels (CPL) [10], Continual Test Time Adaptation (CoTTA) [40], Efficient Test Time Adaptation (EATA) [25], EATA Without Weight Regularization (ETA) [25], RDumb [26] and our proposed ASR.

The methods for comparison are briefly introduced as follows: BN [23, 30] involves estimating BatchNorm statistics (mean and variance) individually for each test-time batch. Note that affine transformation parameters remain unchanged. Tent [37] aims to minimize entropy on the test set by updating the BatchNorm scale and shift parameters and learning the necessary statistics. RPL [29] employs a teacher-student framework combined with a loss function that is resistant to label noise. CPL [10] uses meta-learning to determine the best adaptation objective function within a range of potential functions. SLR [22] utilizes a loss function akin to entropy but avoids vanishing gradients. An additional loss term promotes uniform predictions across classes, with the network’s last layer remaining static. CoTTA [40] employs a teacher-student method with augmentations to support continuous adaptation. A small proportion ( $0.1\%$ ) of the weights are reset to their original pre-trained values in each iteration. EATA [25] applies two weighting functions to its outputs: one based on entropy (assigning higher weights to low-entropy outputs) and another based on diversity (excluding outputs similar to previously seen ones). An $L_{2}$ regularization loss term keeps model weights close to their initial states. ETA As a complementary experiment, we test ETA, a version of EATA that omits the regularizer loss proposed in [25]. RDumb [26] is our baseline approach to prevent collapse through periodic resetting. We reset every $T=1,000$ steps, with this interval selected through hyperparameter tuning on a holdout set. ASR is our proposed method for maintaining plasticity in CTTA. The moving average ratio $\beta$ for label flip is set to $0.8$ , and the shrink parameter $\lambda$ and restore parameter $\gamma$ are set to $0.2$ and $0.75$ , respectively.

4.3 Main Results

From Tab. 1 , we observe that most CTTA baselines struggle to maintain consistent performance across all datasets due to a continual loss of plasticity. For instance, methods such as SLR and RPL experience significant declines in plasticity even on the easier datasets, CIN-C and CIN-3DCC. In some cases, their performance deteriorates so much that it falls below the performance of the pre-trained model, indicating a clear failure to adapt effectively to new data. The situation becomes even more pronounced on the most challenging dataset, CCC, where many methods completely collapse, with accuracy dropping to single-digit levels. This sharp decline suggests that the network has almost no plasticity left to effectively adapt to the test data, highlighting a severe limitation in these methods’ ability to handle ongoing adaptation under difficult conditions

Overall, it can be observed that among all baseline methods, only EATA and RDumb exhibit some degree of robustness against the loss of plasticity over the long term, either via simple resetting or regularization. Notably, RDumb [26] performs relatively well, with interval hyperparameter tuned from a separate validation set. However, tuning with external resources may impose unnecessary constraints in continual test-time adaptation. In contrast to these methods, our proposed method, ASR, which involves shrinking and restoring with an adaptive trigger, shows an absolute improvement of $2.9\%$ in mean accuracy compared to the best baseline RDumb and achieves $40.0\%$ . More importantly, the improvement becomes larger when we use more difficult datasets. This indicates that the proposed method can better handle more extreme conditions when the target distribution differs significantly from the source.

4.4 Ablation Study

4.4.1 Adaptive trigger

Our proposed method’s main contribution is adaptively maintaining plasticity. We first conduct an ablation study of the adaptive trigger component to validate its effectiveness. From Fig. 3, our ASR with only an adaptive trigger can achieve similar performance with RDumb using the best interval period. In other words, our adaptive reset can be used in continual test time adaptation without validation from related test tests. In addition, we can see that if the reset interval is not correctly determined, its performance varies to a certain extent, which means that a suitable reset scheme is essential. If we reset too early or often, it would influence the model to adapt and interrupt when it has not used its full potential for domain testing. However, if we intervene too late, noisy previous knowledge learned would damage the plasticity for continual adaptation. Therefore, we can see that our adaptive framework balances these two factors, achieving a performance similar to that of the manually tuned best interval.

4.4.2 General Application to CTTA Methods

Since our methods are independent of specific components within CTTA, they can be generalized to work with any adaptation method for long-term sustainability. In the major experiments, we adopt EATA [25] for the base framework as it was also used in RDumb [26], and we strictly follow their methodology to ensure a fair comparison. Here, our primary objective is to test whether our methods apply to other CTTA approaches and can be combined with them to improve sustained adaptation performance, as shown in 3. We built our adaptive trigger on three baseline methods using 902,200 samples from CCC-Medium, and we could see that the trigger could achieve performance relatively comparable to that of the best-tuned reset interval.

4.4.3 Shrink and Restore

Here, we test explicitly the effectiveness of the Shrink and Restore component. Since running through all the images in the CCC dataset is very expensive, we select a subset of the CCC dataset with $902{,}200$ images in medium challenging to conduct this ablation study. From 2, we can see that when the Shrink and Restore component is turned off in our proposed ASR, the overall adaptation performance decreases by approximately $0.5$ . This demonstrates that our Shrink and Restore component improves performance by incorporating specific prior knowledge while adding the plasticity needed for ongoing adaptation.

5 Conclusion

In summary, this work introduces the concept of plasticity in the context of continual test-time adaptation (CTTA), highlighting its crucial role in sustaining long-term adaptation. It is found that monitoring label-flipping patterns provides a reliable signal for plasticity loss during adaptation. Based on these insights, we propose the ASR policy, a straightforward yet effective method for preserving plasticity through adaptive re-initialization of model weights. Our experiments show that ASR significantly enhances the model’s ability to maintain performance over time. More importantly, it can also be integrated into any arbitrary CTTA method to sustain continual adaptation.

References

[1] Larry F Abbott and Sacha B Nelson. Synaptic plasticity: taming the beast. Nature neuroscience, 3(11):1178–1183, 2000.
[2] Jordan Ash and Ryan P Adams. On warm-starting neural network training. Advances in neural information processing systems, 33:3884–3894, 2020.
[3] Patrick Bonin, Christopher Barry, Alain Méot, and Marylène Chalard. The influence of age of acquisition in word reading and other tasks: A never ending story? Journal of Memory and language, 50(4):456–476, 2004.
[4] Fabio Maria Carlucci, Lorenzo Porzi, Barbara Caputo, Elisa Ricci, and Samuel Rota Bulo. Autodial: Automatic domain alignment layers. In 2017 IEEE international conference on computer vision (ICCV), pages 5077–5085. IEEE, 2017.
[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[6] Xiang Deng, Yun Xiao, Bo Long, and Zhongfei Zhang. Reducing flipping errors in deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6506–6514, 2022.
[7] Mario Döbler, Robert A Marsden, and Bin Yang. Robust mean teacher for continual and gradual test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7704–7714, 2023.
[8] Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning. Nature, 632(8026):768–774, 2024.
[9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of machine learning research, 17(59):1–35, 2016.
[10] Sachin Goyal, Mingjie Sun, Aditi Raghunathan, and J Zico Kolter. Test time adaptation via conjugate pseudo-labels. Advances in Neural Information Processing Systems, 35:6204–6218, 2022.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[12] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
[13] Oğuzhan Fatih Kar, Teresa Yeo, Andrei Atanov, and Amir Zamir. 3d common corruptions and data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18963–18974, 2022.
[14] Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. arXiv preprint arXiv:2010.14498, 2020.
[15] Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. Adaptive batch normalization for practical domain adaptation. Pattern Recognition, 80:109–117, 2018.
[16] Yukang Lin, Haonan Han, Chaoqun Gong, Zunnan Xu, Yachao Zhang, and Xiu Li. Consistent123: One image to highly consistent 3d asset using case-aware diffusion priors. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 6715–6724, 2024.
[17] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR, 2015.
[18] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. Advances in neural information processing systems, 31, 2018.
[19] Clare Lyle and Razvan Pascanu. Switching between tasks can cause ai to lose the ability to learn, 2024.
[20] Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. In International Conference on Machine Learning, pages 23190–23211. PMLR, 2023.
[21] Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, 2013.
[22] Chaithanya Kumar Mummadi, Robin Hutmacher, Kilian Rambach, Evgeny Levinkov, Thomas Brox, and Jan Hendrik Metzen. Test-time adaptation to distribution shift by confidence maximization and input transformation. arXiv preprint arXiv:2106.14999, 2021.
[23] Zachary Nado, Shreyas Padhy, D Sculley, Alexander D’Amour, Balaji Lakshminarayanan, and Jasper Snoek. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963, 2020.
[24] Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and André Barreto. Deep reinforcement learning with plasticity injection. Advances in Neural Information Processing Systems, 36, 2024.
[25] Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test-time model adaptation without forgetting. In International conference on machine learning, pages 16888–16905. PMLR, 2022.
[26] Ori Press, Steffen Schneider, Matthias Kümmerer, and Matthias Bethge. Rdumb: A simple approach that questions our progress in continual test-time adaptation. Advances in Neural Information Processing Systems, 36, 2024.
[27] Ori Press, Ravid Shwartz-Ziv, Yann LeCun, and Matthias Bethge. The entropy enigma: Success and failure of entropy minimization. arXiv preprint arXiv:2405.05012, 2024.
[28] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
[29] Evgenia Rusak, Steffen Schneider, George Pachitariu, Luisa Eck, Peter Gehler, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. If your data distribution shifts, use self-learning. arXiv preprint arXiv:2104.12928, 2021.
[30] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. Advances in neural information processing systems, 33:11539–11551, 2020.
[31] Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. In International Conference on Machine Learning, pages 32145–32168. PMLR, 2023.
[32] Junha Song, Jungsoo Lee, In So Kweon, and Sungha Choi. Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11920–11929, 2023.
[33] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
[34] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
[35] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7472–7481, 2018.
[36] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
[37] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
[38] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[39] Maorong Wang, Nicolas Michel, Ling Xiao, and Toshihiko Yamasaki. Improving plasticity in online continual learning via collaborative learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23460–23469, 2024.
[40] Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022.
[41] Yanshuo Wang, Jie Hong, Ali Cheraghian, Shafin Rahman, David Ahmedt-Aristizabal, Lars Petersson, and Mehrtash Harandi. Continual test-time domain adaptation via dynamic sample selection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1701–1710, 2024.
[42] Jason D Zevin and Mark S Seidenberg. Age of acquisition effects in word reading and other tasks. Journal of Memory and language, 47(1):1–29, 2002.
[43] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pages 289–305, 2018.
[44] Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5982–5991, 2019.