Towards Understanding Multi-Task Learning (Generalization) of LLMs via Detecting and Exploring Task-Specific Neurons

Yongqi Leng and Deyi Xiong
College of Intelligence and Computing, Tianjin University, Tianjin, China
{lengyq,dyxiong}@tju.edu.cn
Corresponding author

Abstract

While large language models (LLMs) have demonstrated superior multi-task capabilities, understanding the learning mechanisms behind this is still a challenging problem. In this paper, we attempt to understand such mechanisms from the perspective of neurons. Specifically, we detect task-sensitive neurons in LLMs via gradient attribution on task-specific data. Through extensive deactivation and fine-tuning experiments, we demonstrate that the detected neurons are highly correlated with the given task, which we term as task-specific neurons. With these identified task-specific neurons, we delve into two common problems in multi-task learning and continuous learning: Generalization and Catastrophic Forgetting. We find that the overlap of task-specific neurons is strongly associated with generalization and specialization across tasks. Interestingly, at certain layers of LLMs, there is a high similarity in the parameters of different task-specific neurons, and such similarity is highly correlated with the generalization performance. Inspired by these findings, we propose a neuron-level continuous fine-tuning method that only fine-tunes the current task-specific neurons during continuous learning, and extensive experiments demonstrate the effectiveness of the proposed method. Our study provides insights into the interpretability of LLMs in multi-task learning.

Yongqi Leng and Deyi Xiong ^†^†thanks: Corresponding author College of Intelligence and Computing, Tianjin University, Tianjin, China {lengyq,dyxiong}@tju.edu.cn

1 Introduction

The advent and development of LLMs have marked an significant milestone in natural language processing (Brown et al., 2020; Touvron et al., 2023; Achiam et al., 2023). LLMs perform instruction tuning on a wide range of tasks (Wei et al., 2021), exhibiting superior capabilities across multiple tasks and even being able to generalize to unseen tasks (Sanh et al., 2021). Despite their effectiveness, the multi-task learning mechanisms of LLMs remain as an open question.

Previous studies have demonstrated the existence of language-related neurons in multilingual large language models (MLLMs), and these neurons have been analyzed to explore the multilingual learning mechanisms of MLLMs (Tang et al., 2024; Chen et al., 2024b). In contrast, research into the multi-task learning mechanisms of LLMs remains limited. We argue that multilingual learning is essentially a type of multi-task learning as well. Inspired by these studies and thinking analogously, we attempt to ask three questions: (1) Do task-related neurons exist in LLMs, from a broad perspective? (2) If they exist, can they facilitate the understanding of the multi-task learning mechanisms in LLMs? And (3) can we improve LLMs by exploring such neurons?

In order to answer these questions, we perform neuronal analysis for LLMs. First, we identify neurons that are highly correlated with a given task by the gradient attribution method (Simonyan et al., 2013). Subsequently, we conduct fine-tuning and deactivation experiments on these neurons, to analyze their impact on the performance of the given task. Results of extensive experiments show that task-related neurons are indeed present in LLMs and they are highly correlated with specific tasks. We hence term them as task-specific neurons.

With identified task-specific neurons, we delve into two problems in multi-task learning and continuous learning: Generalization and Catastrophic Forgetting. A well-developed deep learning system should have less forgetfulness about learned tasks, as well as a good ability to generalize to unseen tasks (Rish, 2021). Therefore, we believe that analyzing these two problems in depth will contribute to enhance our further understanding of multi-task learning mechanisms in LLMs.

For this, we control the proportion of fine-tuned task-specific neurons to investigate generalization across tasks. We find that the overlap of task-specific neurons among different tasks is strongly correlated with generalization across these tasks, with higher overlap leading to higher generalization. However, in some cases, this overlap does not lead deterministically to generalization, since generalization is complex in nature, rather than a one-factor outcome. In addition to this, we find that at certain layers of LLMs, there is a high similarity between other task-specific neuron parameters and the task-specific neuron parameters of the task to be generalized, which suggests that LLMs learn to share knowledge between tasks, and that this similarity is highly correlated with the generalization results.

In the analysis of generalization, we not only observe the generalization across tasks, but also find that multi-task learning affects the performance of single-task specialization, which is caused by parameter interference between tasks. However, the cause of catastrophic forgetting is also parameter interference. Based on this, we propose a Neuron-level Continuous Fine-Tuning method (NCFT). Experimental results on two continuous learning benchmarks show that NCFT is capable of effectively mitigating catastrophic forgetting.

In summary, the main contributions of our study are as follows:

•

We discover task-specific neurons in LLMs empirically through extensive experiments.
•

We provide significant insights into generalization across tasks with our task-specific neuron analysis.
•

We propose a neuron-level continuous learning fine-tuning method for mitigating catastrophic forgetting, and experiments demonstrate its effectiveness.

2 Related Work

Neuronal Interpretability

With the development of LLMs, neuronal interpretability has gained much attention in recent years (Sajjad et al., 2022; Luo and Specia, 2024). Existing researches include knowledge storage (Dai et al., 2021), task solving (Wang et al., 2022), sentiment analysis (Radford et al., 2017), privacy preservation (Chen et al., 2024a; Wu et al., 2023a), and model editing (Gu et al., 2023). In MLLMs, studies find the existence of language-related neurons and utilize neuronal analysis to reveal the multilingual mechanisms of MLLMs (Tang et al., 2024; Chen et al., 2024b; Zhao et al., 2024), which greatly contributes to the understanding of MLLMs. In contrast, limited studies are conducted on the neuronal analysis in multi-task learning in LLMs. We hence extend this line of research from multilingual learning to multi-task learning.

Cross-task Generalization

Wei et al. (2021) find that LLMs have excellent zero-shot performance after multi-task fine-tuning, which motivates further investigation into cross-task generalization in depth (Hupkes et al., 2022; Grosse et al., 2023). Existing studies have shown that model size (Wei et al., 2021), number of tasks (Sanh et al., 2021), and data quality (Zhou et al., 2024) all affect the performance of generalization, which illustrates that generalization is affected by a variety of factors. There are also some studies that aim to improving the generalization ability of LLMs, such as step-by-step instruction tuning (Wu et al., 2023b) and hierarchical curriculum learning training strategy (Huang et al., 2024). In addition to this, Yang et al. (2024) conduct an empirical study to investigate generalization between tasks at a fine-grained level. Compared to the above studies, we focus more on the provenance of the generalization phenomenon after instruction tuning, and we analyze task-specific neurons to interpret generalization.

Catastrophic Forgetting

Consistent with previous works (Ke and Liu, 2022; Wang et al., 2024), we categorize continuous learning methods into three classes. (1) Rehearsal-based methods mitigate forgetting by replaying data from previous tasks (Su et al., 2019). (2) Regularization-based methods add explicit regularization terms so that knowledge of previous tasks is retained during continuous training (Aljundi et al., 2018). (3) Parameter isolation-based methods assign task-specific parameters to new tasks, thereby reducing interference between tasks (Razdaibiedina et al., 2023; Wang et al., 2023). Our proposed NCFT method follows the philosophy of parameter isolation continuous learning, but unlike prior works, we do not need to introduce additional parameters and also consider the correlation between tasks.

Refer to caption — Figure 1: Illustration of our research methodology. The entire framework consists of three components: Identification (task-specific neurons), Understanding (multi-task learning mechanisms of LLMs from the neuron level) and Exploration (neuron-level continuous fine-tuning method). The first component provides tools for mechanism understanding which in turn provides insights for the third component Exploration.

3 Methodology

Figure 1 illustrates our proposed methodology. First, we compute task relevance scores for all neurons using the gradient attribution method. Based on these scores, we assign neurons to specific tasks to identify task-specific neurons. Next, we analyze these identified neurons both quantitatively and qualitatively to gain insights into the multi-task learning mechanisms of LLMs. Finally, capitalizing on our analysis of task-specific neurons, we propose a neuron-level continuous fine-tuning method designed to mitigate catastrophic forgetting in LLMs.

3.1 Identifying Task-Specific Neurons in LLMs

To identify neurons highly relevant to a specific task, it is essential to determine the relevance of each neuron to task-specific data. Drawing inspiration from importance-based neuron fine-tuning studies (Xu et al., 2024) and neuronal interpretability research (Tang et al., 2024), we employ the gradient attribution method to quantify each neuron’s relevance score for a given task.

First, we need to clarify what we define as a neuron. Currently, the dominant architecture for LLMs is the auto-regressive transformer, in which the basic modules are multi-head self-attention (MHA) and feed-forward network (FFN). Here, we focus only on FFN, which have been demonstrated to store a large amount of parametric knowledge (Dai et al., 2021).

The FFN module at layer $i$ can be formulated as:

\boldsymbol{h}^{i}=f(\boldsymbol{\tilde{h}}^{i}\boldsymbol{W}_{1}^{i})\cdot\boldsymbol{W}_{2}^{i}

(1)

where $\boldsymbol{\tilde{h}}^{i}\in\mathbb{R}^{d}$ denotes the output of the MHA module in layer $i$ , which is also the input of the current FFN module. $\boldsymbol{h}^{i}\in\mathbb{R}^{d}$ denotes the output of the current FFN module. $\boldsymbol{W}^{i}_{1}\in\mathbb{R}^{d\times 4d}$ and $\boldsymbol{W}^{i}_{2}\in\mathbb{R}^{4d\times d}$ are the parameters, and $f$ is the activation function.

A neuron is defined as a column in $\boldsymbol{W}^{i}_{1}$ or $\boldsymbol{W}^{i}_{2}$ . Subsequently, we define the relevance score $\mathcal{R}^{i}_{j}$ of the $j$ -th neuron in the $i$ -th layer to a certain task:

\mathcal{R}^{i}_{j}=|\Delta\mathcal{L}(\boldsymbol{\omega}^{i}_{j})|

(2)

where $\boldsymbol{\omega}^{i}_{j}$ is the output of the $j$ -th neuron in the $i$ -th layer, and $\Delta\mathcal{L}(\boldsymbol{\omega}^{i}_{j})$ is the change in loss between setting $\boldsymbol{\omega}^{i}_{j}$ to 0 and keeping its original value. It can be converted to the following form by Taylor Expansion (see Appendix A.1 for detailed proof):

\mathcal{R}^{i}_{j}=\left|\Delta\mathcal{L}(\boldsymbol{\omega}^{i}_{j})\right|=\left|\frac{\partial{\mathcal{L}}}{\partial{\boldsymbol{\omega}^{i}_{j}}}\boldsymbol{\omega}^{i}_{j}\right|

(3)

Subsequently, we take the neurons with the top $k\%$ relevance scores for the current task as task-specific neurons, where $k$ is a predefined hyper-parameter.

3.2 Understanding Multi-Task Learning in LLMs by Analyzing Task-Specific Neurons

Method \ Task-CLS	AmazonFood	SST-2	QQP	Paws	MNLI	GPTNLI	Avg.
Original	91.8	92.4	83.2	91.6	84.8	82.4	87.7
Deactivate-Random	90.6	91.2	79.8	87.6	80.5	79.3	84.8
Deactivate-Task	83.6	84.6	72.8	70.2	73.3	71.4	76.0
Method \ Task-GEN	Sciqa	Tweetqa	E2E	CommonGen	CNN/DailyMail	XSum	Avg.
Original	54.3	45.6	52.6	49.8	34.7	36.8	45.6
Deactivate-Random	50.8	41.3	48.7	47.3	31.3	34.4	42.3
Deactivate-Task	33.6	29.3	39.6	37.8	25.5	26.3	32.0

Table 1: Performance of Llama-2-7b after task-specific neurons deactivation or without deactivation in each task. “Original” is the performance after fine-tuning with multi-task data without any neurons being deactivated. “Deactivate-Task” indicates deactivation of task-specific neurons. “Deactivate-Random” indicates that the same number of neurons are randomly selected for deactivation. Task-CLS: Classification Task. Task-GEN: Generation Task.

Method \ Task-CLS	AmazonFood	SST-2	QQP	Paws	MNLI	GPTNLI	Avg.
Zero-shot	85.2	78.3	42.1	46.5	35.3	32.4	53.3
Train-Random	85.5	80.3	45.6	47.8	34.7	34.8	54.8
Train-Task	88.5	87.8	79.2	84.8	82.5	76.3	83.2
Method \ Task-GEN	Sciqa	Tweetqa	E2E	CommonGen	CNN/DailyMail	XSum	Avg.
Zero-shot	21.3	6.9	36.5	26.8	14.7	12.3	19.8
Train-Random	22.8	11.8	37.4	29.6	17.7	15.8	22.5
Train-Task	45.3	37.1	42.7	36.8	29.8	30.3	37.0

Table 2: Performance of Llama-2-7b after fine-tuning task-specific neurons and under the zero-shot setting. “Train-Task” indicates training task-specific neurons. “Train-Random” indicates that the same number of neurons are randomly selected for training. Task-CLS: Classification Task. Task-GEN: Generation Task.

Once the presence of task-specific neurons is established, we proceed to analyze these neurons to understand the multi-task learning mechanisms of LLMs. First, we fine-tune varying proportions of task-specific neurons to study the impact on cross-task generalization and single-task specialization, exploring multi-task learning from a quantitative perspective. Additionally, we analyze the similarity between task-specific neuron parameters across different tasks, which encapsulate task-specific knowledge. In doing so, we aim to understand the provenance of generalization, thus revealing the multi-task learning mechanisms from a qualitative perspective.

In quantitative analysis, we set up different neuron proportions to investigate the trends in specialization and generalization. During fine-tuning, only the neurons specific to the current training task are trained. We use the results on the test set of the training task (in-domain, ID) to denote specialization performance, while the results on the test sets of other tasks (out-of-domain, OOD) to denote generalization performance.

In qualitative analysis, we compute the task-specific neuron parameters cosine similarity within a model between the task used to train that model and test task, and we study how this similarity varies across different layers of the model, aiming to investigate knowledge transfer between the test task and training task. In addition to this, we also compute the correlation coefficient between this parameter similarity and the performance on the corresponding test set, aiming to further demonstrate the association between parameter similarity and generalization.

3.3 Exploring Task-Specific Neurons to Mitigate Catastrophic Forgetting of LLMs

Through the analysis of neurons, we find that while multi-task learning can effectively handle multiple tasks, it does not necessarily achieve optimal performance on a single task (see Section 5.1). This is due to parameter interference between tasks. Similarly, catastrophic forgetting is also caused by parameter interference between tasks. Inspired by this correlation, we propose that isolating task-specific neuron parameters during continuous training might mitigate catastrophic forgetting. In order to substantiate this, we introduce a neuron-level continuous fine-tuning method aimed at mitigating catastrophic forgetting in continuous learning.

Given a sequence of tasks ${D_{1},\cdots,D_{N}}$ , the tasks arrive sequentially in the order of the task sequence during the training stage. For the current task $D_{n}$ , we update only the neuron-specific parameters of the current task, while keeping the other parameters frozen. During the test stage, the inference is executed as usual. We refer to this approach as Neuron-level Continuous Fine-Tuning (NCFT). This method isolates parameters for different tasks during training but maintains the original inference process. To better utilize the task-specific parameters of the already trained tasks, we propose using task similarity to weight different task-specific neurons during inference. We refer to this approach as Weighted Neuron-level Continuous Fine-Tuning (W-NCFT), more details of which are provided in Appendix A.2.

4 Experiments: Identifying Task-Specific Neurons

In this section, we conducted two groups of experiments to examine the existence of task-specific neurons as defined in Section 3.1.

4.1 Experimental Setup

In the first group of experiments, we deactivated task-specific neurons to conduct deactivation experiments. Specifically, the deactivation was achieved by setting the activation value of these neurons to zero or by directly setting the corresponding parameter to zero. In the second group of experiments, we fine-tuned the task-specific neurons to carry out fine-tuning experiments. Particularly, only task-specific neurons were updated with parameters and other neurons were frozen during training. For both groups of experiments, we set the hyper-parameter $k=10$ .

We tested two open-source models that perform well on multi-tasks, including Llama-2-7b (Touvron et al., 2023) and Bloom-7b1 (Le Scao et al., 2023). The dataset and evaluation metrics can be found in Appendix A.3.

4.2 Results

Table 1 shows the results of the deactivation experiments. Despite deactivating only $10\%$ task-specific neurons, it has a large negative impact on task-specific processing capacity. In contrast, deactivating the same number of randomly selected neurons resulted in a small impact.

To bolster the dependability of task-specific neurons, we conducted additional fine-tuning experiments. As shown in Table 2, the fine-tuning approach to task-specific neurons yields remarkable improvements compared to the approach of fine-tuning randomly selected neurons ( $29.9$ vs $1.5$ in classification tasks while $17.2$ vs $2.7$ in generation tasks). These improvements remain consistent across both task categories (classificatiton and generation). The only task where the improvement is not significant is AmazonFood, since it has a good enough zero-shot result. Appendix A.4 presents results for Bloom-7b1, which demonstrate the same trend.

In summary, we find that the effects of fine-tuning and perturbing task-specific neurons are more significant than those of randomly selected neurons. Consequently, we can empirically assert the presence of task-specific neurons within LLMs.

5 Experiments: Analyzing Task-Specific Neurons to Interpret Generalization

We analyzed task-specific neurons to understand the multi-task learning mechanisms of LLMs. Based on the analytical approach of Section 3.2, we conducted two sets of experiments, qualitative and quantitative, on various training-test combinations listed in Appendix A.5.

5.1 Proportion of Task-Specific Neurons

We controlled the proportion of fine-tuned task-specific neurons to conduct experiments on the various training-test combinations listed in Appendix A.5. Figure 2 shows results for all training-test combinations. In each subfigure, we focus only on the trend of each color line. Comparisons between different color lines are meaningless because they represent different tasks.

Specialization.

As the proportion of trained task-specific neurons increases, the specialization performance (see Section 3.2 for definition) for both classification and generation tasks first ascends and then declines, reaching its peak at $70\%$ for the classification task (blue line in Figure 2 (a)) and at $50\%$ for the generation task (purple line in Figure 2 (b)). This is contrary to our intuition that under normal circumstances, better results should be obtained as more task-specific neurons are trained. We analyzed the reason behind this, which could stem from the parameter interference between different tasks induced by simultaneous training of three tasks. This interference further results in the specialization performance of a single task not exhibiting a continuous improvement as more parameters are trained. To corroborate this, we conducted ablation experiments. Specifically, we trained a model for each task, meaning that the fine-tuning of task-specific neurons was conducted individually. Results are shown in Appendix A.6, wherein we observe a continuous enhancement in performance as the proportion of neurons increases, thus validating our analysis.

Generalization.

As the proportion of trained task-specific neurons increases, we find a continuous increasing trend for the performance of generalization from the trained classification tasks to other classification tasks (orange line in Figure 2 (a)). Similarly, the performance of generalization from the trained generation tasks to other classification tasks (orange line in Figure 2 (b)) and from the trained generation tasks to other generation tasks (green line in Figure 2 (b)) shows the same trend. The overlap rate of task-specific neurons between the training and test tasks can be found in Appendix A.7, where it becomes evident that as the proportion of trained task-specific neurons increases, the overlap rate also experiences a significant surge. Consequently, one plausible explanation is that the overlap of task-specific neurons contributes to transfer learning between tasks, ultimately resulting in consistently higher generalization performance. However, no generalization is produced from the trained classification tasks to other generation tasks (green line in Figure 2 (a)), and the test results are similar to the zero-shot results in Table 2. The reason for no generalization observed from classification to generation might be that classification tasks are usually easier that generation tasks as they only need to predict a single label. In contrast, generation tasks need to generate consecutive texts that satisfy the task requirements, which is relatively harder. This observation is consistent with that observed by Yang et al. (2024).

Trade-offs between Specialization and Generalization.

The red line in Figure 2 indicates the average of the performance of generalization and specialization. We find that when the proportion of fine-tuned neurons is $70\%$ , the best trade-off between generalization and specialization can be achieved in both experimental settings.

In summary, our findings reveal that when training all parameters of the model under the multi-task learning setup, inevitable interference among tasks occurs, thereby diminishing the efficacy of individual tasks to some degree. Furthermore, our experiments illustrate the efficacy of controlling the appropriate proportion of fine-tuned task-specific neurons as a promising strategy. Additionally, we observe a significant correlation between the overlap of task-specific neurons and generalization performance across tasks. However, this overlap does not always guarantee deterministic generalization, as numerous factors also play pivotal roles. These comprehensive analyses serve to enrich our comprehension on generalization.

5.2 Parameters of Task-Specific Neurons

We evaluated the similarity of specific neuron parameters for the training and test tasks (see Section 3.2 for the way to calculate the similarity) aiming to conduct a qualitative analysis of generalization provenance. We trained a separate model (full-parameter training) for each of the six training tasks in the training-test combination in Appendix A.5, denoted as $M_{1},\cdots,M_{6}$ . We then tested these models on the six out-of-domain test tasks listed in that combination, denoted as $T_{1},\cdots,T_{6}$ . In a particular layer, for model $M_{i}$ and test task $T_{j}$ , $\boldsymbol{P}^{i}_{i}$ and $\boldsymbol{P}^{i}_{j}$ are used to denote the task-specific neuron parameters of training task $i$ and test task $j$ in $M_{i}$ , respectively. Then, we calculated the cosine similarity between $\boldsymbol{P}^{i}_{i}$ and $\boldsymbol{P}^{i}_{j}$ . For test task $T_{j}$ , testing across the six trained models provides six similarity measures. We computed the average of these similarities and then investigated how this average similarity varies across different layers of the model, aiming to show knowledge transfer to the test task $T_{j}$ . Figure 3 illustrates the similarity of the different layers for three different settings.

Parameter Similarity on Classification Tasks.

Figure 3 (a) shows how the parameter similarity across three classification test tasks. We find that at the bottom layer, the similarity remains notably low. When reaching a certain layer depth, similarity starts to gradually increase. Finally, the similarity drops again to the value close to that at the bottom layer. This observation holds for all three classification tasks. This illustrates that a model learns the shared knowledge between tasks only after a certain number of layers. In this aspect, knowledge transfer occurs, thus contributing to generalization. Chatterjee et al. (2024) provide similar findings in cross-task in-context learning to ours, which show that information transfer across tasks occurs only after a certain layer depth is reached. Although their findings are based on in-context learning, in-context learning can be understood as a form of implicit training without parameter updates (Akyürek et al., 2022; Von Oswald et al., 2023). We consider these findings resonate with each other.

Method	Order-1	Order-2	Order-3	Avg.	Order-4	Order-5	Order-6	Avg.
SeqFT	46.4	47.3	47.5	47.1	35.6	34.8	33.5	34.6
SeqLoRA	53.6	54.8	53.1	53.8	47.9	49.5	45.7	47.7
NCFT	71.3	70.9	71.6	71.3	70.5	68.3	71.2	70.0
W-NCFT	73.7	72.3	73.8	73.3	73.4	70.1	72.6	72.0
Per-Task FT	77.2	77.2	77.2	77.2	84.5	84.5	84.5	84.5

Table 3: Results on two continual learning benchmarks. The average accuracy after training on the last task is reported.

Parameter Similarity on Generation Tasks.

However, on the three generation test tasks in Figure 3 (b), we find no such trend. In Section 5.1, we have previously found that it is difficult to generalize from classification tasks to generation tasks. Therefore, we conjecture that the absence of the expected observation in Figure 3 (b) is due to the fact that the six training models used include three models trained with classification tasks, which do not have good parameter similarity within these three models. In turn, after averaging the parameter similarity, lower values appear. To substantiate this conjecture, we tested again using three of the six models trained with generation tasks. Results are shown in Figure 3 (c), and the overall trend is similar to that observed in Figure 3 (a). Only the layer depths where the similarity rises differ, which indicates that the location where knowledge transfer occurs varies across tasks. At the same time, this confirms our conjecture.

Parameter Similarity and Generalization.

We further investigated the relationship between the similarity of task-specific neuron parameters and generalization performance. For each test task, we used six models. We then calculated the similarity in each model between the specific neuron parameters of that test task and the specific neuron parameters of the training task used by that model. Finally, we calculated the correlation coefficients between these parameter similarities and the predictions of the six models. Experiment results are reported in Appendix A.8, from which we find that the similarity is highly correlated with the generalization performance.

In summary, our findings suggest a correlation between the generalization across different tasks and the similarity of task-specific neuron parameters. When layers after a certain depth are reached, the model can learn shared knowledge between tasks, which contributes to the generalization across these tasks. Additionally, higher parameter similarity corresponds to better generalization performance. Our conclusions provide a guideline for improving generalization performance across tasks.

6 Experiments: Fine-tuning Task-specific Neurons to Mitigate Catastrophic Forgetting

We finally conducted experiments on the two benchmarks of continuous learning so as to test the effectiveness of the NCFT and W-NCFT methods described in Section 3.3.

6.1 Experimental Setup

Model and Datasets

We used Llama-2-7b as the model for experiments. We used two continuous learning benchmarks, Standard CL Benchmark and Large Number of Tasks Benchmark (Razdaibiedina et al., 2023), and tested different task orders. Details on the datasets and task order can be found in Appendix A.9 and A.10.

Metrics and Baselines

We used continuous learning performance and forgetting rate as evaluation metrics. The way for calculating the two metrics can be found in Appendix A.11. The baselines can be found in Appendix A.12.

6.2 Results and Analysis

As shown in Table 3, our proposed method achieves significant improvements compared to the baselines across both task benchmarks. Such improvements are consistent across various sequences of tasks, illustrating the effectiveness and robustness of our approach. Additionally, we find that W-NCFT outperforms NCFT, suggesting that weighting different task-specific parameters based on their similarity enhances the performance of continuous learning. Figure 4 illustrates the forgetting rate across eight stages on the Large Number of Tasks benchmark, and we can find that both NCFT and W-NCFT methods substantially mitigate catastrophic forgetting.

It is worth noting that although our proposed method effectively mitigates catastrophic forgetting, it still has some shortcomings. As shown in Figure 4, there remains a gap between the performance of the NCFT and W-NCFT methods and that of Per-Task FT. This indicates that catastrophic forgetting has not been entirely resolved. Additionally, W-NCFT employs task similarity to weight the parameters, which is a static approach. A dynamic weighting method, applied during continuous training, could potentially yield better results. Nevertheless, it is undeniable that this empirical study demonstrates the effectiveness of the task-specific parameter isolation approach in mitigating catastrophic forgetting.

7 Conclusion

In this study, we have presented a methodology framework for understanding multi-task learning and cross-task generalization of LLMs from the perspective of neurons. With this framework, we have conducted an extensive analysis of LLMs to identify task-specific neurons that are highly correlated with specific tasks. Using these task-specific neurons, we have investigated two common problems of LLMs in multi-task learning and continuous learning: generalization and catastrophic forgetting. Our findings indicate that the overlap of task-specific neurons is strongly associated with generalization. Furthermore, we find that the parameter similarity of these neurons reflects the degree of knowledge sharing, contributing to generalization. Additionally, we propose a neuron-level continuous fine-tuning method to effectively mitigate catastrophic forgetting. The proposed method only fine-tuning the current task-specific neurons in continuous learning, and experimental results in two continuous learning benchmarks demonstrate the effectiveness of our method.

Limitations

Our analysis is based on the identification of neurons. In the identification experiments, we did not conduct a detailed analysis on the hyper-parameters, but only used empirical values. However, we believe that it is crucial to identify neurons more accurately, as this may better utilize neurons for these specific tasks. Additionally, our analysis of generalization is currently conducted on only classification and generation tasks. There is a need to extend this analysis to a broader range of tasks. We plan to address these more detailed studies in our future work.

Ethics Statement

This study adheres to the ethical guidelines set forth by our institution and follows the principles outlined in the ACM Code of Ethics and Professional Conduct. All datasets used in our experiments are publicly available.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Akyürek et al. (2022) Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. 2022. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661.
Aljundi et al. (2018) Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. 2018. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Chatterjee et al. (2024) Anwoy Chatterjee, Eshaan Tanwar, Subhabrata Dutta, and Tanmoy Chakraborty. 2024. Language models can exploit cross-task in-context learning for data-scarce novel tasks. arXiv preprint arXiv:2405.10548.
Chen et al. (2024a) Ruizhe Chen, Tianxiang Hu, Yang Feng, and Zuozhu Liu. 2024a. Learnable privacy neurons localization in language models. arXiv preprint arXiv:2405.10989.
Chen et al. (2024b) Yuheng Chen, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. 2024b. Journey to the center of the knowledge neurons: Discoveries of language-independent knowledge neurons and degenerate knowledge neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17817–17825.
Dai et al. (2021) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2021. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
Dušek et al. (2020) Ondřej Dušek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the state-of-the-art of end-to-end natural language generation: The e2e nlg challenge. Computer Speech & Language, 59:123–156.
Grosse et al. (2023) Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. 2023. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296.
Gu et al. (2023) Jian Gu, Chunyang Chen, and Aldeida Aleti. 2023. Neuron patching: Neuron-level model editing on code generation and llms. arXiv preprint arXiv:2312.05356.
Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
Huang et al. (2024) Yuncheng Huang, Qianyu He, Yipei Xu, Jiaqing Liang, and Yanghua Xiao. 2024. Laying the foundation first? investigating the generalization from atomic skills to complex reasoning tasks. arXiv preprint arXiv:2403.09479.
Hupkes et al. (2022) Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, et al. 2022. State-of-the-art generalisation research in nlp: a taxonomy and review. arXiv preprint arXiv:2210.03050.
Ke and Liu (2022) Zixuan Ke and Bing Liu. 2022. Continual learning of natural language processing tasks: A survey. arXiv preprint arXiv:2211.12701.
Keung et al. (2020) Phillip Keung, Yichao Lu, György Szarvas, and Noah A Smith. 2020. The multilingual amazon reviews corpus. arXiv preprint arXiv:2010.02573.
Le Scao et al. (2023) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2023. Bloom: A 176b-parameter open-access multilingual language model.
Lin et al. (2019) Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. 2019. Commongen: A constrained text generation challenge for generative commonsense reasoning. arXiv preprint arXiv:1911.03705.
Luo and Specia (2024) Haoyan Luo and Lucia Specia. 2024. From understanding to utilization: A survey on explainability for large language models. arXiv preprint arXiv:2401.12874.
Narayan et al. (2018) Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
Radford et al. (2017) Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444.
Razdaibiedina et al. (2023) Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, and Amjad Almahairi. 2023. Progressive prompts: Continual learning for language models. arXiv preprint arXiv:2301.12314.
Rish (2021) Irina Rish. 2021. Continual learning with deep architectures.
Sajjad et al. (2022) Hassan Sajjad, Nadir Durrani, and Fahim Dalvi. 2022. Neuron-level interpretation of deep nlp models: A survey. Transactions of the Association for Computational Linguistics, 10:1285–1303.
Sanh et al. (2021) Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2021. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
Scialom et al. (2022) Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. 2022. Fine-tuned language models are continual learners. arXiv preprint arXiv:2205.12393.
Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
Su et al. (2019) Xin Su, Shangqi Guo, Tian Tan, and Feng Chen. 2019. Generative memory for lifelong learning. IEEE transactions on neural networks and learning systems, 31(6):1884–1898.
Tang et al. (2024) Tianyi Tang, Wenyang Luo, Haoyang Huang, Dongdong Zhang, Xiaolei Wang, Xin Zhao, Furu Wei, and Ji-Rong Wen. 2024. Language-specific neurons: The key to multilingual capabilities in large language models. arXiv preprint arXiv:2402.16438.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Von Oswald et al. (2023) Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. 2023. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR.
Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
Wang et al. (2024) Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. 2024. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Wang et al. (2022) Xiaozhi Wang, Kaiyue Wen, Zhengyan Zhang, Lei Hou, Zhiyuan Liu, and Juanzi Li. 2022. Finding skill neurons in pre-trained transformer-based language models. arXiv preprint arXiv:2211.07349.
Wang et al. (2023) Zhicheng Wang, Yufang Liu, Tao Ji, Xiaoling Wang, Yuanbin Wu, Congcong Jiang, Ye Chao, Zhencong Han, Ling Wang, Xu Shao, et al. 2023. Rehearsal-free continual language learning via efficient parameter isolation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10933–10946.
Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
Welbl et al. (2017) Johannes Welbl, Nelson F Liu, and Matt Gardner. 2017. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209.
Williams et al. (2017) Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
Wu et al. (2023a) Xinwei Wu, Junzhuo Li, Minghui Xu, Weilong Dong, Shuangzhi Wu, Chao Bian, and Deyi Xiong. 2023a. Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138.
Wu et al. (2023b) Yang Wu, Yanyan Zhao, Zhongyang Li, Bing Qin, and Kai Xiong. 2023b. Improving cross-task generalization with step-by-step instructions. arXiv preprint arXiv:2305.04429.
Xie et al. (2021) Wanying Xie, Yang Feng, Shuhao Gu, and Dong Yu. 2021. Importance-based neuron allocation for multilingual neural machine translation. arXiv preprint arXiv:2107.06569.
Xiong et al. (2019) Wenhan Xiong, Jiawei Wu, Hong Wang, Vivek Kulkarni, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2019. Tweetqa: A social media focused question answering dataset. arXiv preprint arXiv:1907.06292.
Xu et al. (2024) Haoyun Xu, Runzhe Zhan, Derek F Wong, and Lidia S Chao. 2024. Let’s focus on neuron: Neuron-level supervised fine-tuning for large language model. arXiv preprint arXiv:2403.11621.
Yang et al. (2024) Haoran Yang, Yumeng Zhang, Jiaqi Xu, Hongyuan Lu, Pheng Ann Heng, and Wai Lam. 2024. Unveiling the generalization power of fine-tuned large language models. arXiv preprint arXiv:2403.09162.
Zhang et al. (2019) Yuan Zhang, Jason Baldridge, and Luheng He. 2019. Paws: Paraphrase adversaries from word scrambling. arXiv preprint arXiv:1904.01130.
Zhao et al. (2024) Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. 2024. How do large language models handle multilingualism? arXiv preprint arXiv:2402.18815.
Zhou et al. (2024) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36.

Appendix A Appendix

A.1 Taylor Expansion

We follow Xie et al. (2021) to provide the proof of Equation 3.

We adopt a criterion based on the Taylor Expansion, where we directly approximate the change in loss when removing a particular neuron. Let $\boldsymbol{\omega}^{i}_{j}$ be the output of the $j$ -th neuron in layer $i$ , and $\Omega$ represents the set of other neurons. Assuming the independence of each neuron in the model, the change of loss when removing the $j$ -th neuron in layer $i$ can be represented as:

\left|\Delta\mathcal{L}(\boldsymbol{\omega}^{i}_{j})\right|=\left|\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j}=0)-\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j})\right|

(4)

where $\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j}=0)$ is the loss value if the $j$ -th neuron in layer $i$ is pruned and $\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j})$ is the loss if it is not pruned. For the function $\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j})$ , its Taylor Expansion at $\boldsymbol{\omega}^{i}_{j}=0$ is:

\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j})=\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j}=0)+\frac{\partial{\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j})}}{\partial{\boldsymbol{\omega}^{i}_{j}}}\boldsymbol{\omega}^{i}_{j}+R_{1}({\boldsymbol{\omega}^{i}_{j}})

(5)

where $R_{1}({\boldsymbol{\omega}^{i}_{j}})$ can be ignored since the derivatives of the activation function of second order and higher in the model tend to be zero. So the above equation can be reduced to the following form:

\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j})\approx\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j}=0)+\frac{\partial{\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j})}}{\partial{\boldsymbol{\omega}^{i}_{j}}}\boldsymbol{\omega}^{i}_{j}

(6)

Therefore $\left|\Delta\mathcal{L}(\boldsymbol{\omega}^{i}_{j})\right|$ can eventually be simplified to the following form:

\left|\Delta\mathcal{L}(\boldsymbol{\omega}^{i}_{j})\right|\approx\left|\frac{\partial{\mathcal{L}(\Omega,\boldsymbol{\omega}^{i}_{j})}}{\partial{\boldsymbol{\omega}^{i}_{j}}}\boldsymbol{\omega}^{i}_{j}\right|

(7)

A.2 Details of W-NCFT Method

Assuming that the model has been trained on the previous $i$ tasks, when inference is executed on the $j$ -th task ( $j\leqslant i$ ), we calculate the similarity between task $j$ and the previous $i$ tasks. The similarity between any two tasks as follows:

\mathrm{sim}(x,y)=\frac{\boldsymbol{fea}_{x}\cdot\boldsymbol{fea}_{y}}{\left|\left|\boldsymbol{fea}_{x}\right|\right|\times\left|\left|\boldsymbol{fea}_{y}\right|\right|}

(8)

where $\boldsymbol{fea}_{task}\in\mathbb{R}^{d}$ is the task vector. We randomly select $1000$ samples for each task, and use the Llama-2-7b to compute the mean of the features in the last layer for each particular task sample, and finally take the mean of these sample features as a representation of the task vector.

Then, we get a similarity vector $(\mathrm{sim}_{j}^{1},\cdots,\mathrm{sim}_{j}^{i})$ , where $\mathrm{sim}_{j}^{k}$ is the similarity between task $j$ and task $k$ ( $1\leqslant k\leqslant i$ ). Finally, we conduct $\mathrm{Softmax}$ normalization:

\boldsymbol{Sim}_{j}=\mathrm{Softmax}(\mathrm{sim}_{j}^{1},\cdots,\mathrm{sim}_{j}^{i})

(9)

During inference, for the parameter matrix $\boldsymbol{W}$ of the FFN module in a particular layer of the model, we sequentially identify the task-specific neuron parameters (i.e., certain columns of $\boldsymbol{W}$ ) among the tasks previously trained, ranging from task $1$ to task $i$ , and allocate weights to this portion of parameters based on $\boldsymbol{Sim}_{j}$ as follows:

\boldsymbol{W}^{\prime}=\sum_{k=1}^{i}\boldsymbol{Sim}_{j}[k]\times\boldsymbol{W}_{task-k}

(10)

where $\boldsymbol{W}_{task-k}$ is the task-specific neuron parameter for the $k$ -th task, the summation notation $\sum$ indicates that combining the individual submatrices by columns, and $\boldsymbol{W}^{\prime}$ is the final weighted parameter matrix.

Subsequently, inference is conducted. We refer to this approach as Weighted Neuron-level Continuous Fine-Tuning (W-NCFT).

A.3 Datasets and Metrics for Identifying Neurons Experiments

According to task output forms, we tested two main types of tasks: classification and generation.

$\bullet$

For classification tasks, we chose three tasks. They are sentiment classification, including AmazonFood (Keung et al., 2020), SST-2 (Socher et al., 2013); paraphrase detection, including QQP (Wang et al., 2018), Paws (Zhang et al., 2019); and natural language inference, including MNLI (Williams et al., 2017), GPTNLI¹¹1https://huggingface.co/datasets/pietrolesci/gpt3_nli.
$\bullet$

For generation tasks, we chose three tasks. They are summary generation, including CNN/DailyMail (Hermann et al., 2015), Xsum (Narayan et al., 2018); question generation, including Sciqa (Welbl et al., 2017), Tweetqa (Xiong et al., 2019); and data-to-text generation, including E2E (Dušek et al., 2020), CommonGen (Lin et al., 2019).

We used accuracy to evaluate classification tasks and Rouge-L²²2https://huggingface.co/spaces/evaluate-metric/rouge to evaluate generation tasks.

A.4 Results of Identifying Neurons Experiments in Bloom-7b1

Table 4 shows the results of the deactivation experiments on Bloom-7b1 and Table 5 shows the results of the fine-tuning experiments on Bloom-7b1. We can find a more significant trend for fine-tuning and deactivation of task-specific neurons compared to randomly selected neurons, consistent with the observation in Llama-2-7b.

A.5 Training-Test Task Combination

Table 6 shows the training and test task combinations used to investigate generalization and specialization.

A.6 Ablation Experiments

Figure 5 shows the results of training and testing each task individually.

A.7 Overlap Rate

We calculate the overlap rate of task-specific neurons between the training tasks and test tasks as:

\mathrm{overlap}(x,y)=\frac{\mathcal{N}_{x}\cap\mathcal{N}_{y}}{\mathcal{N}_{x}\cup\mathcal{N}_{y}}

(11)

where $\mathcal{N}_{tasks}$ denotes the set of task-specific neurons.

Table 8 shows the overlap rate of task-specific neurons between the training tasks and test tasks. It is worth noting that for all training-test task combinations listed in Table 6, we use the overall set of task-specific neurons of three training tasks as $\mathcal{N}_{x}$ and the overall set of task-specific neurons of three test tasks as $\mathcal{N}_{y}$ .

A.8 Results of Correlation Coefficients

Table 7 shows the correlation coefficients between the similarity of task-specific neuron parameters and the generalization performance.

A.9 Benchmarks of Continuous Learning

Table 9 and Table 10 show the datasets included in the Standard CL Benchmark and Large Number of Tasks Benchmark, respectively. Note that the original Large Number of Tasks Benchmark have $15$ tasks, from which we select $8$ tasks to form a simplified version for our experiments.

A.10 Task Orders

Table 11 shows the task order sequence for the two continuous learning benchmarks.

A.11 Metrics for Catastrophic Forgetting Experiments

Let $a_{i,j}$ be the testing accuracy of the $i$ -th task after training on $j$ -th task, and $A_{i}$ denote the testing accuracy after training on task $i$ alone. The evaluation metrics are:

•

Performance on Continuous Learning (CL). The average accuracy of all tasks after training on the last task, is computed as:

$\mathrm{CL}=\frac{1}{N}\sum_{i=1}^{N}a_{i,N}$ (12)
•

Forgetting (FG). Following the evaluation metrics proposed by Scialom et al. (2022), we utilized relative gain to calculate the forgetting rate at different stages. The forgetting rate for the $j$ -th stage is calculated as:

$\mathrm{FG}_{j}=\frac{1}{j-1}\sum_{i=1}^{j-1}\frac{a_{i,j}}{A_{i}}\times 100\%$ (13)

A.12 Baselines for Catastrophic Forgetting Experiments

We used the following continual learning techniques as baselines:

•

SeqFT: training the entire model parameters on a sequence of tasks.
•

SeqLoRA: training fixed-size LoRA parameters on a sequence of tasks.
•

Per-Task FT: training a separate model for each task.

Method \ Task-CLS	AmazonFood	SST-2	QQP	Paws	MNLI	GPTNLI	Avg.
Original	90.6	91.2	81.8	91	80.3	79.5	85.7
Deactivate-Random	89.5	89.7	79.3	88.5	78.5	77.6	83.9
Deactivate-Task	80.3	83.5	71.2	82.3	70.6	69.5	76.2
Method \ Task-GEN	Sciqa	Tweetqa	E2E	CommonGen	CNN/DailyMail	XSum	Avg.
Original	53.8	41.8	54.5	45.6	31.8	33.2	43.5
Deactivate-Random	50.9	40.8	52.5	41.6	29.8	30.8	41.1
Deactivate-Task	34.7	30.6	41.8	32.3	20.7	21.5	30.3

Table 4: Performance of Bloom-7b1 after task-specific neurons deactivation or without deactivation in each task. “Original” is the performance after fine-tuning with multi-task data without any neurons being deactivated. “Deactivate-Task” indicates deactivation of task-specific neurons. “Deactivate-Random” indicates that the same number of neurons are randomly selected for deactivation. Task-CLS: Classification Task. Task-GEN: Generation Task.

Method \ Task-CLS	AmazonFood	SST-2	QQP	Paws	MNLI	GPTNLI	Avg.
Zero-shot	83.7	79.1	46.5	44.3	33.6	34.2	53.6
Train-Random	84.1	80.5	48.0	46.1	35.2	36.1	55.0
Train-Task	87.6	88.3	77.6	82.3	79.4	72.0	81.2
Method \ Task-GEN	Sciqa	Tweetqa	E2E	CommonGen	CNN/DailyMail	XSum	Avg.
Zero-shot	23.1	10.3	33.2	23.6	12.5	13.4	19.4
Train-Random	23.8	12.7	34.8	25.2	14.2	15.5	21.0
Train-Task	42.0	34.3	40.4	33.0	27.1	28.6	34.2

Table 5: Performance of Bloom-7b1 after fine-tuning task-specific neurons and under the zero-shot setting. “Train-Task” indicates training task-specific neurons. “Train-Random” indicates that the same number of neurons are randomly selected for training. Task-CLS: Classification Task. Task-GEN: Generation Task.

Group	Training Tasks	ID Test Tasks	OOD Test Tasks
(a)	Amazon, QQP, MNLI	Amazon, QQP, MNLI	SST-2, Paws, GPTNLI
(a)	Amazon, QQP, MNLI	Amazon, QQP, MNLI	Tweetqa, CommonGen, Xsum
(b)	Sciqa, E2E, CNN	Sciqa, E2E, CNN	SST-2, Paws, GPTNLI
(b)	Sciqa, E2E, CNN	Sciqa, E2E, CNN	Tweetqa, CommonGen, Xsum

Table 6: Experimental groups for exploring generalization and specialization. Results from the in-domain (ID) test set indicate generalization performance while results from the out-of-domain (OOD) test set indicate specialization performance. Four test set colors, corresponding to the legend in Figure 2. Amazon, QQP, MNLI corresponds to ID-CLS in the legend. Sciqa, E2E, CNN corresponds to ID-GEN in the legend. SST-2, Paws, GPTNLI corresponds to OOD-CLS in the legend. Tweetqa, CommonGen, Xsum corresponds to OOD-GEN in the legend.

Testset	SST-2		Paws		GPTNLI		Tweetqa		CommonGen		Xsum
Testset	r	p-value	r	p-value	r	p-value	r	p-value	r	p-value	r	p-value
PCCs	0.87	0.02	0.92	0.01	0.79	0.05	0.96	0.00	0.96	0.00	0.97	0.00
SROCC	0.81	0.05	0.77	0.07	0.81	0.05	0.77	0.07	0.83	0.04	0.71	0.11

Table 7: Correlation coefficients between the similarity of specific neuron parameters and generalization performance. PCCs denotes Pearson correlation coefficients and SROCC denotes Spearman correlation coefficients.

Group	10%	30%	50%	70%	100%
CLS-CLS	20.8	53.9	84.5	96.2	100
CLS-GEN	12.9	41.6	71.5	83.5	100
GEN-CLS	11.8	40.2	69.3	81.8	100
GEN-GEN	21.6	52.5	82.0	94.3	100

Table 8: The overlap rate of task-specific neurons between training tasks and test tasks when controlling the proportion of task-specific neurons. The overlap rate formula can be found in Appendix A.7.

Dataset	Class	Task Type	Domain
AGNews	4	Topic classification	News
Amazon	5	Sentiment anlysis	Amazon reviews
DBPedia	14	Topic classification	Wikipedia
Yahoo	10	Q&A	Yahoo Q&A

Table 9: Details of the Standard CL Benchmark.

Dataset	Class	Task Type	Domain
Amazon	5	Sentiment anlysis	Amazon reviews
DBPedia	14	Topic classification	Wikipedia
Yahoo	10	Q&A	Yahoo Q&A
AGNews	4	Topic classification	News
MNLI	3	NLI	various
QQP	2	Paragraph detection	Quora
RTE	2	NLI	news, Wikipedia
SST-2	2	Sentiment analysis	movie reviews

Table 10: Details of the simplified version Large Number of Tasks Benchmark.

Order

Task Sequence

DBPedia → Amazon → Yahoo → AGNews

DBPedia → Amazon → AGNews → Yahoo

Yahoo → Amazon → AGNews → DBPedia

MNLI → QQP → RTE → Amazon →

SST-2 → DBPedia → AGNews → Yahoo

Amazon → AGNews → Yahoo → QQP →

RTE → MNLI → DBPedia → SST-2

AGNews → Yahoo → SST-2 → RTE →

QQP → MNLI → DBPedia → Amazon

Table 11: Task order sequence for two continuous learning benchmarks.