When Parts Are Greater Than Sums:
Individual LLM Components Can Outperform Full Models
Abstract
This paper studies in-context learning by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do well on a classification task, even when the full model performs poorly; bad-performing ones that do much worse than chance; and label-biased components that always predict the same label. We find that component accuracies are well-correlated across different demonstration sets and perturbations of prompt templates. Based on our findings, we propose component reweighting, which learns to linearly re-scale the component activations from a few labeled examples. Given labeled examples, our method improves by an average of accuracy points over -shot ICL across 8 tasks on Llama-2-7B. Overall, this paper both enriches our understanding of ICL and provides a practical method for improvement by examining model internals.
When Parts Are Greater Than Sums:
Individual LLM Components Can Outperform Full Models
Ting-Yun Chang Jesse Thomason Robin Jia University of Southern California, Los Angeles, CA, USA {tingyun, jessetho, robinjia}@usc.edu
1 Introduction
The rapid progress in large language models (LLMs) has popularized prompting, which guides LLMs to perform tasks with instructions or examples. Notably, in-context learning (ICL; Brown et al., 2020) adapts LLMs to a new task using only a few labeled examples without parameter updates. However, how LLMs react to the in-context examples is sometimes unintuitive Min et al. (2022b). Recently, Sclar et al. (2024) and Voronov et al. (2024) find that even for instruction-tuned (Ouyang et al., 2022) or very large models, adding a space or newline in prompts can greatly affect accuracy.

We look into the LLM internals to understand what causes the surprising behavior across various ICL settings. Our work stands in contrast to prior studies, which often treat LLMs as black boxes and alter either the input Chen et al. (2023); Bertsch et al. (2024) or output Zhao et al. (2021); Holtzman et al. (2021). We introduce a new view of ICL by decomposing the output of an LLM into the sum of individual contributions of MLPs and attention heads, denoted “components.” Figure 1 reveals three types of curious components: good-performing ones (blue) that individually perform well or even outperform the full model, bad-performing ones (red) that perform below chance, and label-biased ones (green) that predict the same label on the entire test set. We observe these three classes of components on Llama-2-7B, Llama-2-13B Touvron et al. (2023), Llama-3-8B Dubey et al. (2024), and Mistral-Instruct-7B Jiang et al. (2023) across 8 classification tasks.
We study the sensitivity of LLM components to multiple prompts formed by different demonstrations and templates. We also construct contrast sets of templates—pairs of similar templates that yield large differences in ICL accuracy. Despite large variance in full-model accuracy, we find that component accuracies correlate well across different demonstrations ( on average) and contrast set templates (). The top-performing components in contrast set pairs overlap and achieve decent accuracy even when the full model performs near random (Figure 2). Nonetheless, the component accuracies of two sampled templates are less correlated (). Further, good-performing components generalize well to out-of-distribution test sets. For instance, the top-1 component for MNLI outperforms the full Llama-2-13B model by on MedNLI; Figure 1 also shows that components are transferrable from SST2 to Yelp. We conclude that components are relatively consistent in their behavior across prompts and datasets.
Inspired by our findings, we propose component reweighting. Compared to prior work that selects prompts from a large pool of labeled data to improve ICL accuracy Liu et al. (2022b), component reweighting softly selects components by learning weights from few-shot examples to scale component activations. Training these weights only involves learning a linear layer, which takes less than a minute on one CPU. Overall, component reweighting better utilizes the same labeled examples, improving over -shot ICL by on Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B, respectively. At the same time, it enjoys similar inference speed as 4-shot ICL.
Finally, we study the training dynamics of components using the Pythia pretraining checkpoints Biderman et al. (2023). During pretraining, good-performing components emerge well before the full model performs well. These findings suggest that LLMs acquire the internal ability to perform ICL early in training, but this ability only surfaces in the full model’s behavior later on.
Overall, our work conducts extensive analysis of LLM internals, which motivates a practical method to improve ICL. We hope to inspire future work that further sheds light on LLM internals in order to improve performance. Our implementation is available at https://github.com/terarachang/LLMDecomp.
2 Decomposing the Transformer in ICL
We introduce a new view of in-context learning by decomposing the Transformer architecture Vaswani et al. (2017). Our decomposition is exact—a mathematically equivalent formula for the model’s outputs—and enables us to analyze model internals without training additional parameters (unlike, e.g., probing). We first discuss what our new view offers over the standard view of ICL, and then walk through the mathematical details.
2.1 A New View of In-Context Learning
Standard view.
An LLM performs in-context learning (ICL) on a task based on a few demonstrations without training, where each demonstration is a templated example consisting of an input and a label word . We refer to a sequence of demonstrations as a prompt. The LLM makes predictions on a test input conditioned on the prompt, denoted by , where is the set of possible label words in a classification task.
Our view.
The residual stream of an LLM directly carries the information of the initial hidden state, every attention head, and every MLP, collectively named “components,” towards the output layer. We view this information as the direct contributions111In comparison, a component has indirect contributions to the output by affecting other components in later layers Wang et al. (2023a). This paper focuses on direction contributions. of components to the output logits, and derive a formula for logits, , where is the direct contribution of the component indexed by . We can obtain the predictions of component with , and then calculate its individual ICL accuracy. Specifically, we derive in Eq. 8 below, where is the output embedding matrix and is the post-layernorm activations of component . We name the operation as early decode, sharing the same spirit as nostalgebraist (2020) and Geva et al. (2022), which interpret hidden representations by decoding through . Compared to the standard view, we can directly study the behavior of individual components (Figure 2), characterizing them and scaling their contributions to the model output.

2.2 A Walkthrough of the Decomposition
A Transformer of layers consists of a multi-headed attention (MHA) and MLP in every layer. Let and be the output of the MHA and MLP at layer , respectively. Due to residual connections, the hidden state is:
(1) | ||||
(2) |
Note that GPT2-like LLMs apply layernorm before MHA and MLP Radford et al. (2019); thus, layernorm is already taken into account as part of the formula for computing and (see A.3).
An MHA is composed of attention heads:
(3) |
for a head and the output projection in MHA aggregating all heads. Elhage et al. (2021) rewrite Eq. 3 by segmenting into matrices :
(4) | ||||
(5) |
Thus, we can treat each head as a single component adding to the residual stream.
SST2 | BoolQ | QQP | WiC | RTE | MNLI | AGNews | ARC-Easy | Avg. | ||
Llama2 7B | Full | |||||||||
Oracle-T1 | 66.6 | |||||||||
Oracle-B1 | ||||||||||
Llama2 13B | Full | |||||||||
Oracle-T1 | 74.2 | |||||||||
Oracle-B1 | ||||||||||
Mistral Ins 7B | Full | |||||||||
Oracle-T1 | 76.7 | |||||||||
Oracle-B1 | ||||||||||
Llama3 8B | Full | |||||||||
Oracle-T1 | 77.0 | |||||||||
Oracle-B1 | ||||||||||
Random |
Finally, through the output embedding matrix , the output logits are:
(6) |
where in Eq. 6 and we index every term in the summation with . LN denotes the final layernorm, specifically, RMSNorm Zhang and Sennrich (2019) for LLMs in our paper (see A.3). In Eq. 6, LN, where RMS denotes root mean square, denotes element-wise multiplication, and is the affine parameters. By pre-computing , we have:
(7) | |||
(8) |
We refer to all as the component activations, which include the activations of attention heads and MLPs after the final layernorm.222Empirically, we find that has near-random ICL accuracy on all the tasks, so we omit it in the rest of the paper. Now that we have broken down the Transformer output into simple additions in Eq. 8, we can easily analyze the direct contribution of each component to the logits through the residual stream, .
In ICL, we only need to do the decomposition when LLMs start to generate, i.e, when processing the last token of the input. The computations on the other tokens are the same as the standard ICL. In all our experiments, we use single-token label words. We use multiple templates from Bach et al. (2022) that cover diverse label words for each task.
3 Characterizing Components for ICL
We conduct in-context learning across 8 classification tasks on 4 LLMs: Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B. ICL is sensitive to prompts, so we randomly sample 5 disjoint sets of demonstrations formatted with 3 templates and report the standard deviation across the 15 runs. To avoid majority and recency biases Zhao et al. (2021), each prompt consists of the same number of demonstrations from every class in shuffled order. We use demonstrations for 3-way classification tasks and for the other tasks. Except for section 5.1, we refer to without further notice. We sample 2000 examples with balanced labels as the test set for every task. Please see A.1 for details about the tasks and templates.
3.1 Good and Bad-Performing Components
Across all the tasks and LLMs, we observe good-performing components that perform well or even outperform the full model, and bad-performing components that individually perform much worse than chance (blue and red dots in Figure 1, respectively). Table 1 compares the full model (Full) with the top-1 (Oracle-T1) and bottom-1 (Oracle-B1) components selected on the test set. On average, Oracle-T1 outperforms Full by on Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B, respectively; Oracle-B1 underperforms random guessing (Random) by .
3.2 Label-Biased Components
Besides good and bad-performing components, we also observe label-biased components, which predict a certain label on the entire test set (the green dots in Figure 1). These components exist in all the tasks and LLMs we study, accounting for of components on average in Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B, respectively (Table 5). In A.2, we show that even when we prompt the model with all demonstrations of positive labels, the most biased component still insists on predicting “negative” on the entire test set, and vice versa.
3.3 Mechanistic Understanding of Bad-Performing Heads
Prior work studies the mechanism of certain components in LLMs, showing that there are negative mover attention heads that write in the opposite direction of the expected answer Wang et al. (2023a) and copy suppression heads that suppress the prediction of a prior token in the context McDougall et al. (2023). Inspired by them, we investigate the mechanism behind bad-performing heads identified by our decomposition. We focus on label tokens in the context, as Wang et al. (2023b) show that label words serve as anchors in ICL.
We conduct a case study on Llama-2-7B with 4-shot balanced in-context examples from SST2. We examine the bottom-5 attention heads that have the worst ICL accuracy on SST2. We find that three of these heads, L19H15, L15H14, and L18H9, assign top attention probabilities to all 4 label tokens of the 4-shot in-context examples when predicting test examples. Furthermore, despite their poor ICL accuracy, these heads actually assign higher attention to the correct in-context label tokens than the incorrect ones most of the time ( of the test examples). In other words, when a test example has a positive label, these heads assign higher attention333We average the attention probabilities of the same label tokens and then compare the average ones of the two labels. to the tokens “positive” in the context than the tokens “negative”. We also observe that the more the heads attend to “positive” in the context, the lower the inner product between the head and the output embedding of the token “positive”, with the correlation for L19H15, L15H14, and L18H9, respectively.
In summary, we show that some bad-performing heads attend highly to prior label tokens and decrease the output probability of the correct one, which shares similarities with the copy suppression heads and negative mover heads McDougall et al. (2023); Wang et al. (2023a). However, we do not observe similar behavior in other tasks, where the bad-performing heads usually attend to “<s>”, “?”, or “\n”. We invite future work to further analyze how bad-performing heads function in general.
SST2 | BoolQ | QQP | AGNews | ARC | ||
Corr | (1) Demo | 0.81 | 0.84 | 0.60 | 0.89 | 0.88 |
(2) Temp | 0.40 | 0.16 | 0.03 | 0.68 | 0.44 | |
(3) Cst T | 0.72 | 0.63 | 0.23 | 0.82 | 0.46 | |
IoU | (1) Demo | 0.36 | 0.74 | 0.27 | 0.63 | 0.70 |
(2) Temp | 0.12 | 0.01 | 0.01 | 0.20 | 0.20 | |
(3) Cst T | 0.40 | 0.23 | 0.02 | 0.36 | 0.45 |
4 Transferability of Components
We observe moderate to high component transferability across demonstrations, minimally contrastive templates, and data distributions, whereas there is little transferability across randomly sampled templates. Our decomposition uncovers hidden abilities of individual components when the full model performs poorly.
4.1 Transfer across Prompt Variants
We first measure the agreement in component accuracies between (1) two disjoint sets of demonstrations with a fixed template, (2) two randomly sampled templates with fixed demonstrations, and (3) two minimally-contrastive templates with fixed demonstrations. Recall that we have 5 sets of demonstrations and 3 templates in total (section 3); here, we calculate the average agreement between every pair. For (3), we construct contrast sets Gardner et al. (2020) by minimally editing the worst-performing template out of the 3 templates into a good template, which yields at least improvement in average accuracy. Our edits include adding a space, removing a newline, or changing label words (see Table 10). We use two metrics to measure the agreement between each pair: Pearson correlation of the accuracies of all components and the intersection over union (IoU) on the sets of top-5 components, which measures whether the top-performing components of the pair overlap.
Table 2 summarizes the results on Llama-2-7B; A.6 shows similar findings on other models. (1) The accuracies of the internal components are highly consistent across different choices of demonstrations, having strong correlations and an average of IoU. (2) The components have much weaker agreement across randomly sampled templates, having a near 0 IoU on BoolQ and QQP. (3) Nevertheless, there is agreement between minimally contrastive templates (Cst T), with an average correlation of across tasks, despite contrasting full-model accuracy. For example, Figure 2 demonstrates that full-model accuracy changes dramatically ( vs ) in a minimal pair of templates, but internal components have a high correlation of and the pair shares top-performing components. Combining (2) and (3) suggests components behave similarly on similar templates, but this similarity decreases as the templates diverge.
4.2 Transfer to Out-of-Distribution Test Sets
We further study whether the best component selected on the test set can still perform well on an out-of-distribution (OOD) test set. We name this method, which uses a single component to make predictions, as Transfer-1. Specifically, we study component transferability from SST2 to Yelp-polarity, MNLI to MedNLI, and BoolQ to BoolQ Contrast Set. We compare Transfer-1 with using the full model (Full) on the OOD test sets. To understand the best possible Transfer-1 accuracy, we also report the best component accuracy directly selected on the OOD set, Oracle-1.
Table 3 shows that Transfer-1 closely matches Oracle-1 overall, suggesting that the top-performing components are transferable across data distribution. Moreover, Transfer-1 sometimes outperforms Full, especially on Llama2 models, showing the hidden abilities of the internal components.
4.3 Transfer between Two Opposite Tasks
We conduct a case study of component transferability across instructions using Task069 and Task070 of Super-NaturalInstructions Wang et al. (2022b), both of which are binary abductive NLI tasks Bhagavatula et al. (2020). The instruction for Task069 asks for correct answers, while Task070 asks for incorrect ones (“pick the one that makes less sense;” see Figure 7 for the full instructions). Examples in the two tasks are not parallel.
We find that Mistral-Instruct-7B achieves good accuracy across 15 runs on Task069 (), but below chance on Task070 (). We observe a strong negative correlation, on average, between the component accuracies of the two tasks. The worst-performing components in Task069 become the top-performing in Task070 and vice versa. The correlation suggests that the model has the ability to solve Task070, but misunderstands negation. Thus, we apply the Transfer-1 method (section 4.2) but select the worst-performing component from Task069 and then calculate its individual accuracy on Task070. Transfer-1 achieves accuracy across the 15 runs, an improvement of over the full model. These results suggest that components behave consistently even across tasks with opposite instructions, as the active components in Task069 are also active in Task070.
Yelp-polarity | MedNLI | BoolQ Cst | ||
Llama2 7B | Full | |||
Transfer-1 | ||||
Oracle-1 | ||||
Llama2 13B | Full | |||
Transfer-1 | ||||
Oracle-1 | ||||
Mistral Ins 7B | Full | |||
Transfer-1 | ||||
Oracle-1 | ||||
Llama3 8B | Full | |||
Transfer-1 | ||||
Oracle-1 |
5 Component Reweighting
5.1 Proposed Method
Our findings in section 4 show the promising direction of selecting internal components to improve ICL. Therefore, we propose a method that reweights components by learning a weight on every component activation . Reweighting is a soft version of selection, which can be learned by gradient descent on very few examples.
Given labeled examples, instead of using all of them as ICL demonstrations, we divide them into a demonstration set and a training set . We first randomly sample examples with balanced labels as demonstrations and use the remaining examples as to train the component weights. Specifically, we can rewrite Eq. 8 as , where for all . Because of the existence of good and bad-performing components, weighing all components equally may not be optimal. Therefore, we tune the weights of components on with cross-entropy loss and regularization, while keeping the LLM frozen:
(9) | |||
where is a submatrix of that comprises the output embeddings of label words, is the probability distribution of the LLM after reweighting, and is the hyperparameter of the loss to encourage sparsity on the component weights. We obtain the activations of all components in one -shot forward pass, computed on the prompt derived from , followed by . Our method scales each component’s direct contributions to the logits () by . In practice, we cache these contributions on the entire training set as input features to the linear layer , which allows us to discard the entire LLM while training (line 9 and 13 in Algorithm 1), saving tremendous training time and GPU memory. The cache only requires space. At inference time, the overhead of our method over -shot ICL is to early decode components and apply the learned weights, i.e., . As both and are small ( for all LLMs in this paper), the overhead is negligible compared to the computation of the LLM itself.
5.2 Baselines
Standard ICL.
The simplest baseline is to use all the labeled examples as demonstrations. Since the other methods use examples as demonstrations, we report the accuracy of standard -shot ICL using the same for reference.
Prompt Selection.
Liu et al. (2022b) improve ICL accuracy by selecting demonstrations from a pool of labeled data for each test example. Here, we select from the given labeled examples. Following Rubin et al. (2022), we use SBERT Reimers and Gurevych (2019) to encode examples into sentence embeddings and select the nearest neighbors under cosine similarity as the demonstrations for each test example.
Calibration.
As LLMs tend to predict a certain class over others, Zhao et al. (2021) reweight the output class probabilities. They use context-free inputs, such as “N/A”, to calibrate the probability distribution. However, Fei et al. (2023) and Zhou et al. (2023) find context-free inputs sometimes ineffective, because in-domain context is important for calibration. Thus, we introduce Calib+, which calibrates the original probabilities with a training set of in-distribution labeled examples, . We train the calibration weights on with cross-entropy loss and obtain the calibrated probabilities . For direct comparisons, we split the examples into the same and sets as component reweighting for Calib+, where . We include the training details of both methods in A.8.
SST2 | BoolQ | QQP | WiC | RTE | MNLI | AGNews | ARC-Easy | Avg. | ||
Llama-2-7B | Standard | |||||||||
Standard | ||||||||||
PromptS | ||||||||||
Calib+ | ||||||||||
CompRW | 68.8 | |||||||||
Standard | ||||||||||
PromptS | ||||||||||
Calib+ | ||||||||||
CompRW | 69.8 | |||||||||
Llama-2-13B | Standard | |||||||||
Standard | ||||||||||
PromptS | ||||||||||
Calib+ | ||||||||||
CompRW | 74.9 | |||||||||
Standard | ||||||||||
PromptS | ||||||||||
Calib+ | ||||||||||
CompRW | 75.7 | |||||||||
Mistral-Instruct-7B | Standard | |||||||||
Standard | ||||||||||
PromptS | ||||||||||
Calib+ | ||||||||||
CompRW | 77.3 | |||||||||
Standard | ||||||||||
PromptS | ||||||||||
Calib+ | ||||||||||
CompRW | 77.8 | |||||||||
Llama-3-8B | Standard | |||||||||
Standard | ||||||||||
Calib+ | ||||||||||
CompRW | 78.7 | |||||||||
Standard | ||||||||||
Calib+ | ||||||||||
CompRW | 79.5 |
5.3 Results
We set . Table 4 compares our component reweighting (CompRW) with standard ICL (Standard), prompt selection (PromptS), and calibration (Calib+). First, we find that simply increasing the number of demonstrations from to has limited improvements in ICL accuracy, while the longer prompt greatly increases the inference time. For example, on Llama-2-7B, Standard only improves the average accuracy by over Standard and the accuracy even decreases on Mistral-Instruct. Second, PromptS performs the worst in most setups, likely because it is hard to find similar examples from a small pool of examples, and a bad selection induces majority label biases. Third, both calibration (Calib+) and component reweighting (CompRW) achieve substantially better accuracy than Standard with little test-time overhead. Overall, CompRW achieves the best average accuracy in all setups, outperforming Standard by on Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B, respectively, and outperforming Standard by , respectively. We run one-tailed paired t-tests comparing CompRW with Calib+ and find that p-values in all 8 setups (see Table 6), showing that CompRW performs significantly better Calib+.
6 When Do Good Components Emerge?
We study the dynamics of components during pretraining by monitoring their accuracies on 32 checkpoints of Pythia-6.9B, uniformly spaced from the first to the last checkpoint. For each checkpoint, we run 4-shot ICL on AGNews with 3 templates 3 sets of demonstrations. The demonstrations are balanced in labels with randomly shuffled orders. Figure 3 shows the average accuracy of the 9 runs shaded by the standard deviation.
While the full model (green) fluctuates and has a large variance across prompts, the top-1 components (solid blue) achieve good accuracy at an early step and plateau quickly. We also backtrack the top-1 components of different prompts at the last checkpoint (dashed blue), monitoring how they perform on average during pretraining. We observe that they are not the top components at the early stage (there are gaps between the two blue lines before the steps), but start to perform steadily well from the middle stage. Our findings also hold on SST2 and Pythia-1.4B (see Figure 6 in the appendix), suggesting that the model’s ability to do a task emerges before it is apparent from the full model on these tasks.444On the other hand, Pythia models perform poorly on the other tasks over all checkpoints; thus, the training dynamics of the model components on challenging tasks remain unclear.
7 Related Work and Discussion
Improving ICL.
Prior work shows that ICL performance varies greatly across different choices of demonstrations and templates Zhao et al. (2021); Lu et al. (2022). Specifically, Sclar et al. (2024) and Voronov et al. (2024) find no universally better prompt template that can transfer across tasks and models, implying that it is not easy to explain ICL through prompt engineering. While several approaches, such as prompt selection Liu et al. (2022b); Chang and Jia (2023); Fu et al. (2023), prompt ensemble Min et al. (2022a); Arora et al. (2023); Voronov et al. (2024), and many-shot ICL Agarwal et al. (2024), substantially improve accuracy, they treat LLMs like black boxes without understanding the internals. Besides, they greatly increase inference time or require a large set of labeled data, which deviates from true few-shot learning Perez et al. (2021). In comparison, our paper studies this problem by looking inside the LLMs. Rather than selecting prompts, we select components in a soft, learnable way. Our method only requires examples and has negligible computation overhead over -shot ICL at inference.

Components Interpretation.
Components interpretation studies the function of different components in a trained model Elhage et al. (2021); Shah et al. (2024), where components could be neurons Radford et al. (2017); Wang et al. (2022a); Gurnee et al. (2023), attention heads Olsson et al. (2022), and MLPs Geva et al. (2021). To analyze the components, probing Alain and Bengio (2017), knockout Geva et al. (2023); Chang et al. (2024); Li et al. (2023), patching Wang et al. (2023a); Goldowsky-Dill et al. (2023), and early decoding nostalgebraist (2020); Geva et al. (2022) are widely used techniques. For example, Li et al. (2024) train a linear probe on every attention head to discover the truthful heads inside LLMs. Michel et al. (2019) and Voita et al. (2019) prune away a large percentage of attention heads and show that only a few are critical to the performance. Hendel et al. (2023), Liu et al. (2023), Merullo et al. (2024a), and Todd et al. (2024) view ICL as compressing demonstrations into function vectors, where they remove the demonstrations and modify (patch) the LLM activations at certain layers with the function vectors at test time. Early decoding interprets the investigated components in the textual space by projecting them through the output embedding matrix Geva et al. (2022). Our model decomposition is based on early decoding and we share some similarities with prior work Yu et al. (2023); Wang et al. (2023c), especially in discovering individual components that perform well on a task. Our contributions lie in providing a new view of ICL by decomposition, which reveals the transferability of components across diverse ICL settings.
Our Method vs. Pruning.
Our method caches the direct contributions of components to the outputs through the residual stream, i.e., logits = . Thus, removing , the direct contribution of the component , does not alter the contributions of the other components. In comparison, pruning a component changes the activations of the other components in later layers. In A.7, we show that pruning the good-performing components identified by our method greatly hurts the accuracy, meaning that pruning also defines these components as important Michel et al. (2019).
8 Conclusion
We introduce a new perspective of ICL via decomposing the model output into the sum of individual contributions of components. We then identify three types of component characteristics across 3 LLMs and 8 classification tasks. Our extensive analyses reveal consistency in component accuracy across prompts and suggest the promising direction of improving ICL by selecting components. To this end, we propose component reweighting, which learns to scale components differently on few-shot examples. Our method achieves the best average accuracy compared to prior methods. We hope this work can deepen our grasp of LLMs while motivating more methods for practical use.
9 Limitations
Our component reweighting method requires a small set of labeled data to train the component weights . However, we believe it is not unreasonable to have at least labeled examples in total and we compare with baselines using the same examples. On the other hand, we do not compare with fine-tuning-based baselines, such as LM-BFF Gao et al. (2021), T-few Liu et al. (2022a), and LoRA Hu et al. (2021), because they usually require a larger GPU memory for training and more sophisticated early stopping criteria to prevent overfitting on few-shot examples. Another limitation is that we only experiment with classification tasks for ease of evaluation. We leave it for future work to generalize our method to generation tasks by doing decomposition and reweighting at every token during generation.
Despite similarities in model decomposition, the focus of this paper is not circuits in LLMs Wang et al. (2023a). Thus, we only have limited experiments towards mechanistic understanding of the curious components in section 3.3 and A.7. Unlike prior work that uses synthetic tasks to testify whether a head attends to certain tokens Dutta et al. (2024); Merullo et al. (2024b), we work on standard NLP benchmark datasets without obviously correct or incorrect tokens to collect answers, making mechanistic interpretation more challenging.
Acknowledgements
We thank Johnny Wei for his valuable suggestions on the paper structure. We thank Qinyuan Ye, Ryan Wang, Gustavo Lucas Carvalho, Ameya Godbole, Wang Zhu, Daniel Firebanks-Quevedo, and the anonymous reviewers for their helpful feedback. This work was funded in part by gifts from Open Philanthropy and Cisco Research, and was also supported in part by the National Science Foundation under Grant No. IIS-2403436. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
References
- Agarwal et al. (2024) Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Stephanie Chan, Ankesh Anand, Zaheer Abbas, Azade Nova, John D Co-Reyes, Eric Chu, et al. 2024. Many-shot in-context learning. arXiv preprint arXiv:2404.11018.
- Alain and Bengio (2017) Guillaume Alain and Yoshua Bengio. 2017. Understanding intermediate layers using linear classifier probes.
- Arora et al. (2023) Simran Arora, Avanika Narayan, Mayee F Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, and Christopher Re. 2023. Ask me anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations.
- Bach et al. (2022) Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Dragomir Radev, Mike Tian-jian Jiang, and Alexander Rush. 2022. PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland. Association for Computational Linguistics.
- Bertsch et al. (2024) Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R Gormley, and Graham Neubig. 2024. In-context learning with long-context models: An in-depth exploration. arXiv preprint arXiv:2405.00200.
- Bhagavatula et al. (2020) Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen tau Yih, and Yejin Choi. 2020. Abductive commonsense reasoning. In International Conference on Learning Representations.
- Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Chang and Jia (2023) Ting-Yun Chang and Robin Jia. 2023. Data curation alone can stabilize in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8123–8144, Toronto, Canada. Association for Computational Linguistics.
- Chang et al. (2024) Ting-Yun Chang, Jesse Thomason, and Robin Jia. 2024. Do localization methods actually localize memorized data in LLMs? a tale of two benchmarks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3190–3211, Mexico City, Mexico. Association for Computational Linguistics.
- Chen et al. (2023) Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, and He He. 2023. On the relation between sensitivity and accuracy in in-context learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 155–167, Singapore. Association for Computational Linguistics.
- Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Dutta et al. (2024) Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty. 2024. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv preprint arXiv:2402.18312.
- Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1:1.
- Fei et al. (2023) Yu Fei, Yifan Hou, Zeming Chen, and Antoine Bosselut. 2023. Mitigating label biases for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14014–14031, Toronto, Canada. Association for Computational Linguistics.
- Fu et al. (2023) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2023. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations.
- Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
- Gardner et al. (2020) Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
- Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, Singapore. Association for Computational Linguistics.
- Geva et al. (2022) Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Goldowsky-Dill et al. (2023) Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. 2023. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969.
- Gurnee et al. (2023) Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. 2023. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research.
- Hendel et al. (2023) Roee Hendel, Mor Geva, and Amir Globerson. 2023. In-context learning creates task vectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318–9333, Singapore. Association for Computational Linguistics.
- Hewitt and Liang (2019) John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics.
- Holtzman et al. (2021) Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. 2021. Surface form competition: Why the highest probability answer isn’t always right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Hu et al. (2021) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
- Li et al. (2024) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2024. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36.
- Li et al. (2023) Maximilian Li, Xander Davies, and Max Nadeau. 2023. Circuit breaking: Removing model behaviors with targeted ablation. arXiv preprint arXiv:2309.05973.
- Liu et al. (2022a) Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022a. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
- Liu et al. (2022b) Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022b. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.
- Liu et al. (2023) Sheng Liu, Lei Xing, and James Zou. 2023. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668.
- Lu et al. (2022) Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
- McDougall et al. (2023) Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. 2023. Copy suppression: Comprehensively understanding an attention head. arXiv preprint arXiv:2310.04625.
- Merullo et al. (2024a) Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2024a. Language models implement simple Word2Vec-style vector arithmetic. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5030–5047, Mexico City, Mexico. Association for Computational Linguistics.
- Merullo et al. (2024b) Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2024b. Talking heads: Understanding inter-layer communication in transformer language models. arXiv preprint arXiv:2406.09519.
- Michel et al. (2019) Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Min et al. (2022a) Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022a. Noisy channel language model prompting for few-shot text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5316–5330, Dublin, Ireland. Association for Computational Linguistics.
- Min et al. (2022b) Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022b. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
- nostalgebraist (2020) nostalgebraist. 2020. interpreting GPT: the logit lens.
- Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. 2022. In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Perez et al. (2021) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. In Advances in Neural Information Processing Systems.
- Pilehvar and Camacho-Collados (2019) Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, Minneapolis, Minnesota. Association for Computational Linguistics.
- Radford et al. (2017) Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. ArXiv preprint, abs/1704.01444.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Ramanujan et al. (2020) Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. 2020. What’s hidden in a randomly weighted neural network? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11893–11902.
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Romanov and Shivade (2018) Alexey Romanov and Chaitanya Shivade. 2018. Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1586–1596, Brussels, Belgium. Association for Computational Linguistics.
- Rubin et al. (2022) Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics.
- Sclar et al. (2024) Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations.
- Shah et al. (2024) Harshay Shah, Andrew Ilyas, and Aleksander Madry. 2024. Decomposing and editing predictions by modeling model computation. arXiv preprint arXiv:2404.11534.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Todd et al. (2024) Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. 2024. Function vectors in large language models. In The Twelfth International Conference on Learning Representations.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30.
- Voita et al. (2019) Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics.
- Voronov et al. (2024) Anton Voronov, Lena Wolf, and Max Ryabinin. 2024. Mind your format: Towards consistent evaluation of in-context learning improvements. In Findings of the Association for Computational Linguistics ACL 2024, pages 6287–6310, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- Wang et al. (2023a) Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023a. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations.
- Wang et al. (2023b) Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023b. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore. Association for Computational Linguistics.
- Wang et al. (2023c) Tony Wang, Miles Kai, Kaivalya Hariharan, and Nir Shavit. 2023c. Forbidden facts: An investigation of competing objectives in llama 2. In Socially Responsible Language Modelling Research.
- Wang et al. (2022a) Xiaozhi Wang, Kaiyue Wen, Zhengyan Zhang, Lei Hou, Zhiyuan Liu, and Juanzi Li. 2022a. Finding skill neurons in pre-trained transformer-based language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11132–11152, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Wang et al. (2022b) Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Savan Doshi, Shailaja Keyur Sampat, Siddhartha Mishra, Sujan Reddy A, Sumanta Patro, Tanay Dixit, and Xudong Shen. 2022b. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- Yu et al. (2023) Qinan Yu, Jack Merullo, and Ellie Pavlick. 2023. Characterizing mechanisms for factual recall in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9924–9959, Singapore. Association for Computational Linguistics.
- Zhang and Sennrich (2019) Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.
- Zhang and Bowman (2018) Kelly Zhang and Samuel Bowman. 2018. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 359–361, Brussels, Belgium. Association for Computational Linguistics.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
- Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697–12706. PMLR.
- Zhou et al. (2023) Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, and Subhrajit Roy. 2023. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. arXiv preprint arXiv:2309.17249.
Appendix A Appendix
A.1 Tasks and Templates
Table 11 summarizes the 13 datasets we use in the paper, where we construct balanced test sets by randomly sampling 2000 examples in each task. We form the prompts by concatenating demonstrations in a randomly shuffled order. To avoid the recency bias Zhao et al. (2021), we keep shuffling the demonstrations until the last two have different labels. For minimally conservative templates (section 4.1), Table 10 compares the contrast sets we construct on Llama-2-7B. For our case study on Task069 and Task070, we sample 3 templates from Sclar et al. (2024). Figure 7 compare the prompts of Task069 and Task070, which consist of an instruction followed by templated demonstrations. Originally, the two tasks have of parallel examples. To make our task transfer challenging, we discard these overlapped examples.
A.2 Label-Biased Components
We say a component is label-biased when it always predicts a certain label on the entire test set section 3.2. In this section, we focus on the most biased components in binary classification tasks, i.e., the two components that have the largest value of and , respectively, where and are the LLM output logits on the two classes. We name these two components as Biased Component-0 and Biased Component-1, respectively. To understand how biased these two components are, we alter the choices of demonstrations and observe their behavior. Specifically, we consider three settings: demonstrations balanced in labels (green in Figure 5), demonstrations of all negative labels (; red), and demonstrations of all positive labels (; blue). We fix the template and sample 5 disjoint sets of demonstrations for each setting. Each dot in Figure 5 shows the components’ prediction on an example, and the x-axis and y-axis correspond to logit0 and logit1, respectively. A dot below the dashed diagonal line means the prediction on the example is class 0. We find that both Biased Component-0 and Biased Component-1 still insist on predicting a certain label on all examples, regardless of the labels in the prompts.


SST2 | BoolQ | QQP | WiC | RTE | MNLI | AGNews | ARC-Easy | |
Llama-2-7B | ||||||||
Llama-2-13B | ||||||||
Mistral-Instruct-7B | ||||||||
Llama-3-8B |
A.3 LayerNorms
Figure 4 shows the transformer architecture in GPT2-like models. Because the layernorms inside each block are before MHA and MLP, known as Pre-LN, Eq. 1 has already taken and into account, and Eq. 6 only has the term for the final layernorm, LN.
Both Llama-2 and Mistral model families use RMSNorm Zhang and Sennrich (2019), a layer normalization variant without centering and adding bias terms. Formally, let be the input, the root mean square norm LN is:
(10) | ||||
(11) |
where is the affine transform parameters and denotes element-wise multiplication.
A.4 Tests of Significance
We run one-tailed paired t-tests to test whether CompRW outperforms Calib+ significantly. In Table 4, we have the results of 15 prompts for each task and 8 tasks in total. For each model, we aggregate the 120 accuracy scores of CompRW and Calib+, respectively, and then calculate the p-values. Table 6 shows that p-values in 8/8 setups, suggesting that CompRW performs significantly better than Calib+.
Llama2-7B | Llama2-13B | Mistral-Ins-7B | Llama3-8B | |
0.0010 | 0.0002 | 0.0470 | 0.0198 | |
0.0003 | 0.0001 | 0.0027 | 0.0245 |
SST2 | ARC-Easy | ||
Trained | Full | ||
Oracle-T1 | |||
CompRW | |||
Random | Full | ||
Oracle-T1 | |||
CompRW |
A.5 Do Good-Performing Components Exist in Randomly Initialized Models?
Ramanujan et al. (2020) find that untrained subnetworks can perform on par with a ResNet-34 trained on ImageNet. Similarly, Zhang and Bowman (2018); Hewitt and Liang (2019) show that representations of randomly initialized language models yield a strong baseline for probing tasks. In this section, we investigate (1) whether good-performing components still exist in a randomly initialized LLM, and (2) how CompRW method performs using component activations extracted from the randomly initialized LLM.
We run 4-shot ICL with 15 prompts and report the average accuracy and standard deviation. For CompRW, we use the same 4 demonstrations and 20 more examples for reweighting. Table 7 shows that the best-performing component (Oracle-T1) in a randomly initialized Llama-2-7B still performs poorly on SST2 and ARC-Easy. While CompRW has substantial improvement over Full on the pretrained model, it has no effect on the randomly initialized model. We conclude that good-performing components do not exist in a randomly initialized LLM and our CompRW method relies on the pretrained component activations to perform well.
A.6 More Results on Transferability
In section 4, we study the transferability of components across different choices of demonstrations and templates. Here, Table 9 shows the full results on all LLMs and tasks. We observe the same findings as Table 2: component accuracies agree well across randomly sampled demonstrations, but have much weaker agreements across randomly sampled templates. Because constructing minimally-contrastive templates requires non-trivial manual efforts, we only build contrast sets for 5 tasks on Llama-2-7B (shown in Table 2), where these tasks have the largest variances across templates.
A.7 Pruning Good and Bad Components
Our method studies a component using its cached direct contribution to the output, whereas Michel et al. (2019) (pruning) zeroes out the activations of a component in the forward pass and thus indirectly changes the activations of other components in the upper layers. They consider a component important if pruning it causes large drops in task performance. In this section, we investigate the intersection between our method and pruning.
First, we apply our decomposition to identify good and bad-performing components based on their ICL accuracy (3-shot for MNLI, 4-shot for other tasks). Second, we run ICL with pruning on Llama-2-7B, using the same 15 prompts in our main experiments for every task. We prune the top-50 components555 of the total components and the bottom-50 components, respectively. Table 8 compares the results with the full model without pruning. We find that pruning the top components (T50) greatly hurts the accuracy. On the contrary, pruning the bottom components (B50) only decreases the average accuracy on SST2 and RTE by , and even slightly improves the ones on MNLI and AGNews. These findings may imply that our method and pruning interpret components in similar fashion.
SST2 | RTE | MNLI | AGNews | |
Full Model | ||||
Prune-T50 | ||||
Prune-B50 |
A.8 Training Details and Hyperparameters
For both CompRW and Calib+ methods, we train a linear layer on with stochastic gradient descent. Because we do not have an additional dev set to tune the hyperparameters, we use the same hyperparameters on all the tasks and models and do early stopping based on the loss and accuracy on . Specifically, we set learning rate for both methods and for the L1 regularization term in CompRW. We run all our ICL experiments on a single RTX A6000 GPU (48G). Both the component reweighting and calibration training processes can be run on a single i7 CPU within a minute.
A.9 Models
We use the model checkpoints on Hugging Face, meta-llama/Llama-2-7b-hf, Llama-2-13b-hf, mistralai/Mistral-7B-Instruct-v0.1, and meta-llama/Meta-Llama-3-8B.
SST2 | BoolQ | QQP | WiC | RTE | MNLI | AGNews | ARC | |
Correlation | Llama-2-7B | |||||||
(1) Demo | 0.81 | 0.84 | 0.60 | 0.65 | 0.75 | 0.65 | 0.89 | 0.88 |
(2) Temp | 0.40 | 0.16 | 0.03 | 0.15 | 0.19 | 0.09 | 0.68 | 0.44 |
IoU | ||||||||
(1) Demo | 0.36 | 0.74 | 0.27 | 0.21 | 0.53 | 0.24 | 0.63 | 0.70 |
(2) Temp | 0.12 | 0.01 | 0.01 | 0.03 | 0.05 | 0.01 | 0.20 | 0.20 |
Correlation | Llama-2-13B | |||||||
(1) Demo | 0.83 | 0.84 | 0.63 | 0.67 | 0.78 | 0.73 | 0.91 | 0.91 |
(2) Temp | 0.57 | 0.30 | 0.09 | 0.19 | 0.28 | 0.16 | 0.76 | 0.55 |
IoU | ||||||||
(1) Demo | 0.26 | 0.71 | 0.31 | 0.18 | 0.46 | 0.39 | 0.55 | 0.65 |
(2) Temp | 0.21 | 0.11 | 0.07 | 0.01 | 0.21 | 0.07 | 0.25 | 0.30 |
Correlation | Mistral-Instruct-7B | |||||||
(1) Demo | 0.88 | 0.91 | 0.72 | 0.75 | 0.87 | 0.82 | 0.92 | 0.97 |
(2) Temp | 0.58 | 0.44 | 0.19 | 0.26 | 0.40 | 0.30 | 0.77 | 0.60 |
IoU | ||||||||
(1) Demo | 0.39 | 0.59 | 0.27 | 0.29 | 0.50 | 0.45 | 0.68 | 0.80 |
(2) Temp | 0.10 | 0.17 | 0.06 | 0.05 | 0.17 | 0.09 | 0.29 | 0.22 |
Correlation | Llama-3-8B | |||||||
(1) Demo | 0.85 | 0.88 | 0.70 | 0.73 | 0.80 | 0.81 | 0.89 | 0.95 |
(2) Temp | 0.55 | 0.39 | 0.26 | 0.25 | 0.31 | 0.23 | 0.67 | 0.52 |
IoU | ||||||||
(1) Demo | 0.42 | 0.56 | 0.28 | 0.25 | 0.46 | 0.52 | 0.65 | 0.68 |
(2) Temp | 0.15 | 0.12 | 0.09 | 0.07 | 0.08 | 0.05 | 0.34 | 0.27 |
Task | Templates | Labels | Accuracy | |
SST-2 |
|
negative/positive | ||
|
negative/positive | |||
BoolQ |
|
No/Yes | ||
|
No/Yes | |||
QQP |
|
no/yes | ||
|
No/Yes | |||
AGNews |
|
World/Sports/Business/Technology | ||
|
World/Sports/Business/Technology |
Dataset | Task | # Classes |
SST-2 Socher et al. (2013) | Sentiment Analysis | 2 |
Yelp-polarity Zhang et al. (2015) | Sentiment Analysis | 2 |
BoolQ Clark et al. (2019) | Yes/No QA | 2 |
BoolQ Contrast Set Gardner et al. (2020) | Yes/No QA | 2 |
QQP Wang et al. (2018) | Paraphrase Identification | 2 |
WiC Pilehvar and Camacho-Collados (2019) | Word Sense Disambiguation | 2 |
RTE Wang et al. (2018) | Natural Language Inference | 2 |
MNLI Williams et al. (2018) | Natural Language Inference | 3 |
MedNLI Romanov and Shivade (2018) | NLI in Medical Domain | 3 |
AGNews Zhang et al. (2015) | Topic Classification | 4 |
ARC-Easy Clark et al. (2018) | Multiple-Choice QA | 4 |
Task069 Mishra et al. (2022); Wang et al. (2022b) | Abductive NLI | 2 |
Task070 Mishra et al. (2022); Wang et al. (2022b) | Abductive NLI | 2 |


