This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

When Parts Are Greater Than Sums:
Individual LLM Components Can Outperform Full Models

Ting-Yun Chang   Jesse Thomason   Robin Jia
University of Southern California, Los Angeles, CA, USA
{tingyun, jessetho, robinjia}@usc.edu
Abstract

This paper studies in-context learning by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do well on a classification task, even when the full model performs poorly; bad-performing ones that do much worse than chance; and label-biased components that always predict the same label. We find that component accuracies are well-correlated across different demonstration sets and perturbations of prompt templates. Based on our findings, we propose component reweighting, which learns to linearly re-scale the component activations from a few labeled examples. Given 2424 labeled examples, our method improves by an average of 6.0%6.0\% accuracy points over 2424-shot ICL across 8 tasks on Llama-2-7B. Overall, this paper both enriches our understanding of ICL and provides a practical method for improvement by examining model internals.

When Parts Are Greater Than Sums:
Individual LLM Components Can Outperform Full Models


Ting-Yun Chang   Jesse Thomason   Robin Jia University of Southern California, Los Angeles, CA, USA {tingyun, jessetho, robinjia}@usc.edu


1 Introduction

The rapid progress in large language models (LLMs) has popularized prompting, which guides LLMs to perform tasks with instructions or examples. Notably, in-context learning (ICL; Brown et al., 2020) adapts LLMs to a new task using only a few labeled examples without parameter updates. However, how LLMs react to the in-context examples is sometimes unintuitive Min et al. (2022b). Recently, Sclar et al. (2024) and Voronov et al. (2024) find that even for instruction-tuned (Ouyang et al., 2022) or very large models, adding a space or newline in prompts can greatly affect accuracy.

Refer to caption
Figure 1: Each dot represents a component (attention head or MLP) under 4-shot ICL on Llama-2-7B. The xx-axis shows how often a component predicts “positive” on the test set. Up: We discover good-performing (blue), bad-performing (red), and label-biased (green) components. Down: Most components identified on SST2 show similar characteristics on Yelp-polarity.

We look into the LLM internals to understand what causes the surprising behavior across various ICL settings. Our work stands in contrast to prior studies, which often treat LLMs as black boxes and alter either the input Chen et al. (2023); Bertsch et al. (2024) or output Zhao et al. (2021); Holtzman et al. (2021). We introduce a new view of ICL by decomposing the output of an LLM into the sum of individual contributions of MLPs and attention heads, denoted “components.” Figure 1 reveals three types of curious components: good-performing ones (blue) that individually perform well or even outperform the full model, bad-performing ones (red) that perform below chance, and label-biased ones (green) that predict the same label on the entire test set. We observe these three classes of components on Llama-2-7B, Llama-2-13B Touvron et al. (2023), Llama-3-8B Dubey et al. (2024), and Mistral-Instruct-7B Jiang et al. (2023) across 8 classification tasks.

We study the sensitivity of LLM components to multiple prompts formed by different demonstrations and templates. We also construct contrast sets of templates—pairs of similar templates that yield large differences in ICL accuracy. Despite large variance in full-model accuracy, we find that component accuracies correlate well across different demonstrations (r=0.80r=0.80 on average) and contrast set templates (r=0.57r=0.57). The top-performing components in contrast set pairs overlap and achieve decent accuracy even when the full model performs near random (Figure 2). Nonetheless, the component accuracies of two sampled templates are less correlated (r=0.34r=0.34). Further, good-performing components generalize well to out-of-distribution test sets. For instance, the top-1 component for MNLI outperforms the full Llama-2-13B model by 9.1%9.1\% on MedNLI; Figure 1 also shows that components are transferrable from SST2 to Yelp. We conclude that components are relatively consistent in their behavior across prompts and datasets.

Inspired by our findings, we propose component reweighting. Compared to prior work that selects prompts from a large pool of labeled data to improve ICL accuracy Liu et al. (2022b), component reweighting softly selects components by learning weights from few-shot examples to scale component activations. Training these weights only involves learning a linear layer, which takes less than a minute on one CPU. Overall, component reweighting better utilizes the same labeled examples, improving over 2424-shot ICL by 6.0%,2.2%,5.1%,1.6%6.0\%,2.2\%,5.1\%,1.6\% on Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B, respectively. At the same time, it enjoys similar inference speed as 4-shot ICL.

Finally, we study the training dynamics of components using the Pythia pretraining checkpoints Biderman et al. (2023). During pretraining, good-performing components emerge well before the full model performs well. These findings suggest that LLMs acquire the internal ability to perform ICL early in training, but this ability only surfaces in the full model’s behavior later on.

Overall, our work conducts extensive analysis of LLM internals, which motivates a practical method to improve ICL. We hope to inspire future work that further sheds light on LLM internals in order to improve performance. Our implementation is available at https://github.com/terarachang/LLMDecomp.

2 Decomposing the Transformer in ICL

We introduce a new view of in-context learning by decomposing the Transformer architecture Vaswani et al. (2017). Our decomposition is exact—a mathematically equivalent formula for the model’s outputs—and enables us to analyze model internals without training additional parameters (unlike, e.g., probing). We first discuss what our new view offers over the standard view of ICL, and then walk through the mathematical details.

2.1 A New View of In-Context Learning

Standard view.

An LLM performs in-context learning (ICL) on a task based on a few demonstrations without training, where each demonstration is a templated example (x,y)(x,y) consisting of an input xx and a label word yy. We refer to a sequence of KK demonstrations [x1,y1,,xK,yK][x_{1},y_{1},\dotsc,x_{K},y_{K}] as a prompt. The LLM makes predictions on a test input xtestx_{\text{test}} conditioned on the prompt, denoted by argmaxy𝒴P(y|prompt,xtest){\arg\max}_{y\in\mathcal{Y}}P(y|\emph{prompt},x_{\text{test}}), where 𝒴\mathcal{Y} is the set of possible label words in a classification task.

Our view.

The residual stream of an LLM directly carries the information of the initial hidden state, every attention head, and every MLP, collectively named “components,” towards the output layer. We view this information as the direct contributions111In comparison, a component has indirect contributions to the output by affecting other components in later layers Wang et al. (2023a). This paper focuses on direction contributions. of components to the output logits, and derive a formula for logits, j𝐠j\sum_{j}\mathbf{g}_{j}, where 𝐠j\mathbf{g}_{j} is the direct contribution of the component indexed by jj. We can obtain the predictions of component jj with argmaxy𝒴𝐠j{\arg\max}_{y\in\mathcal{Y}}\;\mathbf{g}_{j}, and then calculate its individual ICL accuracy. Specifically, we derive 𝐠j=UCj\mathbf{g}_{j}=U\cdot C_{j} in Eq. 8 below, where UU is the output embedding matrix and CjC_{j} is the post-layernorm activations of component jj. We name the operation (CjUCj)(C_{j}\mapsto U\cdot C_{j}) as early decode, sharing the same spirit as nostalgebraist (2020) and Geva et al. (2022), which interpret hidden representations by decoding through UU. Compared to the standard view, we can directly study the behavior of individual components (Figure 2), characterizing them and scaling their contributions to the model output.

Refer to caption
Figure 2: Left: Transformer decomposition. The components—MLPs and attention heads—are filled with blue, and the blue lines show the flow of early decoding. Right: We can calculate the individual accuracy of every component after decomposition. Although a pair of templates that only differ slightly yield very different accuracies (0.390.39 vs. 0.890.89 on AGNews with Llama-2-7B), the accuracies of their internal components are highly correlated. The top components for Template 1 overlap with the ones for Template 2 and achieve >0.7>0.7 accuracy despite the poor full-model accuracy.

2.2 A Walkthrough of the Decomposition

A Transformer of LL layers consists of a multi-headed attention (MHA) and MLP in every layer. Let a(l)da^{(l)}\in\mathbb{R}^{d} and m(l)dm^{(l)}\in\mathbb{R}^{d} be the output of the MHA and MLP at layer ll, respectively. Due to residual connections, the hidden state x(l)dx^{(l)}\in\mathbb{R}^{d} is:

x(l)\displaystyle x^{(l)} =x(l1)+a(l)+m(l),\displaystyle=x^{(l-1)}+a^{(l)}+m^{(l)}, (1)
x(L)\displaystyle x^{(L)} =x(0)+l=1L(a(l)+m(l)).\displaystyle=x^{(0)}+\sum_{l=1}^{L}\;\left(a^{(l)}+m^{(l)}\right). (2)

Note that GPT2-like LLMs apply layernorm before MHA and MLP Radford et al. (2019); thus, layernorm is already taken into account as part of the formula for computing a(l)a^{(l)} and m(l)m^{(l)} (see A.3).

An MHA a(l)a^{(l)} is composed of nn attention heads:

a(l)=Wo(l)Concat([h1(l),,hn(l)])a^{(l)}=W_{o}^{(l)}\cdot\text{Concat}([h^{(l)}_{1},\dots,h^{(l)}_{n}]) (3)

for hi(l)dheadh^{(l)}_{i}\in\mathbb{R}^{d_{\text{head}}} a head and Wo(l)d×ndheadW_{o}^{(l)}\in\mathbb{R}^{d\times nd_{\text{head}}} the output projection in MHA aggregating all heads. Elhage et al. (2021) rewrite Eq. 3 by segmenting Wo(l)W_{o}^{(l)} into nn matrices Woi(l)d×dheadW_{o_{i}}^{(l)}\in\mathbb{R}^{d\times d_{\text{head}}}:

a(l)\displaystyle a^{(l)} =i=1n(Woi(l)hi(l))=i=1nh~i(l),\displaystyle=\sum_{i=1}^{n}\left(W_{o_{i}}^{(l)}\cdot h^{(l)}_{i}\right)=\sum_{i=1}^{n}\tilde{h}_{i}^{(l)}, (4)
where[Wo1(l),,Won(l)]=Wo(l)\displaystyle\text{where}\;[W_{o_{1}}^{(l)},\dots,W_{o_{n}}^{(l)}]=W_{o}^{(l)} (5)

Thus, we can treat each head as a single component adding h~i(l)=Woi(l)hi(l)\tilde{h}_{i}^{(l)}=W_{o_{i}}^{(l)}\cdot h^{(l)}_{i} to the residual stream.

SST2 BoolQ QQP WiC RTE MNLI AGNews ARC-Easy Avg.
Llama2 7B Full 75.818.175.8_{\hskip 1.42271pt18.1} 69.212.069.2_{\hskip 1.42271pt12.0} 61.39.961.3_{\hskip 1.42271pt9.9} 52.43.052.4_{\hskip 1.42271pt3.0} 68.93.2\textbf{68.9}_{\hskip 1.42271pt3.2} 34.41.734.4_{\hskip 1.42271pt1.7} 70.019.970.0_{\hskip 1.42271pt19.9} 57.514.4\textbf{57.5}_{\hskip 1.42271pt14.4} 61.261.2
Oracle-T1 91.70.9\textbf{91.7}_{\hskip 1.42271pt\phantom{1}0.9} 69.77.7\textbf{69.7}_{\hskip 1.42271pt\phantom{1}7.7} 67.84.3\textbf{67.8}_{\hskip 1.42271pt4.3} 57.81.1\textbf{57.8}_{\hskip 1.42271pt1.1} 64.62.764.6_{\hskip 1.42271pt2.7} 46.33.3\textbf{46.3}_{\hskip 1.42271pt3.3} 80.85.2\textbf{80.8}_{\hskip 1.42271pt\phantom{1}5.2} 54.510.154.5_{\hskip 1.42271pt10.1} 66.6
Oracle-B1 12.12.712.1_{\hskip 1.42271pt\phantom{1}2.7} 34.17.334.1_{\hskip 1.42271pt\phantom{1}7.3} 32.53.932.5_{\hskip 1.42271pt3.9} 42.91.242.9_{\hskip 1.42271pt1.2} 34.72.834.7_{\hskip 1.42271pt2.8} 24.12.424.1_{\hskip 1.42271pt2.4} 3.01.1\phantom{1}3.0_{\hskip 1.42271pt\phantom{1}1.1} 12.74.212.7_{\hskip 1.42271pt\phantom{1}4.2} 24.524.5
Llama2 13B Full 89.05.389.0_{\hskip 1.42271pt5.3} 77.66.8\textbf{77.6}_{\hskip 1.42271pt6.8} 71.06.871.0_{\hskip 1.42271pt6.8} 55.03.855.0_{\hskip 1.42271pt3.8} 75.12.375.1_{\hskip 1.42271pt2.3} 45.77.945.7_{\hskip 1.42271pt7.9} 70.820.670.8_{\hskip 1.42271pt20.6} 73.213.7\textbf{73.2}_{\hskip 1.42271pt13.7} 69.769.7
Oracle-T1 92.50.6\textbf{92.5}_{\hskip 1.42271pt0.6} 77.56.077.5_{\hskip 1.42271pt6.0} 73.52.9\textbf{73.5}_{\hskip 1.42271pt2.9} 60.41.2\textbf{60.4}_{\hskip 1.42271pt1.2} 75.72.3\textbf{75.7}_{\hskip 1.42271pt2.3} 56.44.7\textbf{56.4}_{\hskip 1.42271pt4.7} 84.63.6\textbf{84.6}_{\hskip 1.42271pt\phantom{1}3.6} 73.17.973.1_{\hskip 1.42271pt\phantom{1}7.9} 74.2
Oracle-B1 8.21.0\phantom{1}8.2_{\hskip 1.42271pt1.0} 27.19.727.1_{\hskip 1.42271pt9.7} 31.83.431.8_{\hskip 1.42271pt3.4} 39.51.639.5_{\hskip 1.42271pt1.6} 27.92.827.9_{\hskip 1.42271pt2.8} 18.62.618.6_{\hskip 1.42271pt2.6} 1.80.9\phantom{1}1.8_{\hskip 1.42271pt\phantom{1}0.9} 5.43.5\phantom{1}5.4_{\hskip 1.42271pt\phantom{1}3.5} 20.020.0
Mistral Ins 7B Full 90.12.990.1_{\hskip 1.42271pt2.9} 81.32.1\textbf{81.3}_{\hskip 1.42271pt2.1} 70.97.270.9_{\hskip 1.42271pt7.2} 58.54.258.5_{\hskip 1.42271pt4.2} 80.51.780.5_{\hskip 1.42271pt1.7} 56.15.056.1_{\hskip 1.42271pt5.0} 83.05.783.0_{\hskip 1.42271pt5.7} 79.81.4\textbf{79.8}_{\hskip 1.42271pt1.4} 75.075.0
Oracle-T1 91.90.7\textbf{91.9}_{\hskip 1.42271pt0.7} 80.82.080.8_{\hskip 1.42271pt2.0} 75.62.6\textbf{75.6}_{\hskip 1.42271pt2.6} 60.62.2\textbf{60.6}_{\hskip 1.42271pt2.2} 81.30.8\textbf{81.3}_{\hskip 1.42271pt0.8} 61.53.3\textbf{61.5}_{\hskip 1.42271pt3.3} 83.74.3\textbf{83.7}_{\hskip 1.42271pt4.3} 78.52.278.5_{\hskip 1.42271pt2.2} 76.7
Oracle-B1 8.10.9\phantom{1}8.1_{\hskip 1.42271pt0.9} 19.52.519.5_{\hskip 1.42271pt2.5} 25.84.125.8_{\hskip 1.42271pt4.1} 39.32.839.3_{\hskip 1.42271pt2.8} 20.01.720.0_{\hskip 1.42271pt1.7} 14.62.914.6_{\hskip 1.42271pt2.9} 1.80.7\phantom{1}1.8_{\hskip 1.42271pt0.7} 4.61.3\phantom{1}4.6_{\hskip 1.42271pt1.3} 16.716.7
Llama3 8B Full 91.41.791.4_{\hskip 1.42271pt1.7} 79.27.2\textbf{79.2}_{\hskip 1.42271pt7.2} 74.08.074.0_{\hskip 1.42271pt8.0} 58.74.758.7_{\hskip 1.42271pt4.7} 76.52.2\textbf{76.5}_{\hskip 1.42271pt2.2} 59.43.759.4_{\hskip 1.42271pt3.7} 84.06.6\textbf{84.0}_{\hskip 1.42271pt6.6} 87.45.5\textbf{87.4}_{\hskip 1.42271pt5.5} 76.376.3
Oracle-T1 92.31.0\textbf{92.3}_{\hskip 1.42271pt1.0} 77.47.377.4_{\hskip 1.42271pt7.3} 77.43.7\textbf{77.4}_{\hskip 1.42271pt3.7} 64.52.7\textbf{64.5}_{\hskip 1.42271pt2.7} 76.32.976.3_{\hskip 1.42271pt2.9} 60.71.5\textbf{60.7}_{\hskip 1.42271pt1.5} 81.45.781.4_{\hskip 1.42271pt5.7} 86.05.986.0_{\hskip 1.42271pt5.9} 77.0
Oracle-B1 9.00.9\phantom{1}9.0_{\hskip 1.42271pt0.9} 22.57.422.5_{\hskip 1.42271pt7.4} 23.53.923.5_{\hskip 1.42271pt3.9} 36.73.336.7_{\hskip 1.42271pt3.3} 23.42.123.4_{\hskip 1.42271pt2.1} 10.04.210.0_{\hskip 1.42271pt4.2} 1.40.6\phantom{1}1.4_{\hskip 1.42271pt0.6} 1.90.9\phantom{1}1.9_{\hskip 1.42271pt0.9} 16.016.0
Random 50.050.0 50.050.0 50.050.0 50.050.0 50.050.0 33.333.3 25.025.0 25.025.0 41.741.7
Table 1: {3,4}\{3,4\}-shot ICL accuracy of 8 tasks and the average accuracy (Avg.). We run 15 prompts for each task (see section 3) and report the mean accuracy and standard deviation. We show the existence of good components (Oracle-T1) inside LLMs that individually perform on par with the full model (Full) on diverse tasks. Similarly, there exist bad components (Oracle-B1) that perform substantially below chance (Random).

Finally, through the output embedding matrix U|Vocab|×dU\in\mathbb{R}^{|\text{Vocab}|\times d}, the output logits are:

logits=ULN(x(L))\displaystyle\text{logits}=U\cdot\text{LN}(x^{(L)})
=ULN(x(0)+l=1Li=1nh~i(l)+l=1Lm(l))\displaystyle=U\cdot\text{LN}\left(x^{(0)}+\sum_{l=1}^{L}\sum_{i=1}^{n}\tilde{h}_{i}^{(l)}+\sum_{l=1}^{L}m^{(l)}\right)
=ULN(j=11+L×n+Lzj),\displaystyle=U\cdot\text{LN}\left(\sum_{j=1}^{1+L\times n+L}z_{j}\right), (6)

where z=[x(0),h~1(1),,h~n(L),m(1),,m(L)]z=[x^{(0)},\tilde{h}_{1}^{(1)},\dotsc,\tilde{h}_{n}^{(L)},m^{(1)},\dotsc,m^{(L)}] in Eq. 6 and we index every term in the summation with jj. LN()(\cdot) denotes the final layernorm, specifically, RMSNorm Zhang and Sennrich (2019) for LLMs in our paper (see A.3). In Eq. 6, LN(jzj)=jzjRMS(jzj)γ(\sum_{j}z_{j})=\frac{\sum_{j}z_{j}}{\text{RMS}(\sum_{j}z_{j})}\odot\gamma, where RMS denotes root mean square, \odot denotes element-wise multiplication, and γd\gamma\in\mathbb{R}^{d} is the affine parameters. By pre-computing γ^=γRMS(jzj)\hat{\gamma}=\frac{\gamma}{\text{RMS}(\sum_{j}z_{j})}, we have:

logits=U(jzjγ^)\displaystyle\text{logits}=U\cdot\left(\sum_{j}z_{j}\odot\hat{\gamma}\right) (7)
=jUCj,whereCj=zjγ^\displaystyle=\sum_{j}U\cdot C_{j},\;\text{where}\;C_{j}=z_{j}\odot\hat{\gamma} (8)

We refer to all CjdC_{j}\in\mathbb{R}^{d} as the component activations, which include the activations of attention heads and MLPs after the final layernorm.222Empirically, we find that x(0)x^{(0)} has near-random ICL accuracy on all the tasks, so we omit it in the rest of the paper. Now that we have broken down the Transformer output into simple additions in Eq. 8, we can easily analyze the direct contribution of each component to the logits through the residual stream, 𝐠j=UCj\mathbf{g}_{j}=U\cdot C_{j}.

In ICL, we only need to do the decomposition when LLMs start to generate, i.e, when processing the last token of the input. The computations on the other tokens are the same as the standard ICL. In all our experiments, we use single-token label words. We use multiple templates from Bach et al. (2022) that cover diverse label words for each task.

3 Characterizing Components for ICL

We conduct in-context learning across 8 classification tasks on 4 LLMs: Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B. ICL is sensitive to prompts, so we randomly sample 5 disjoint sets of demonstrations formatted with 3 templates and report the standard deviation across the 15 runs. To avoid majority and recency biases Zhao et al. (2021), each prompt consists of the same number of demonstrations from every class in shuffled order. We use K=3K=3 demonstrations for 3-way classification tasks and K=4K=4 for the other tasks. Except for section 5.1, we refer to K={3,4}K=\{3,4\} without further notice. We sample 2000 examples with balanced labels as the test set for every task. Please see A.1 for details about the tasks and templates.

3.1 Good and Bad-Performing Components

Across all the tasks and LLMs, we observe good-performing components that perform well or even outperform the full model, and bad-performing components that individually perform much worse than chance (blue and red dots in Figure 1, respectively). Table 1 compares the full model (Full) with the top-1 (Oracle-T1) and bottom-1 (Oracle-B1) components selected on the test set. On average, Oracle-T1 outperforms Full by 5.4%,4.5%,1.7%,0.7%5.4\%,4.5\%,1.7\%,0.7\% on Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B, respectively; Oracle-B1 underperforms random guessing (Random) by 17.2%,20.7%,25.0%,25.7%17.2\%,20.7\%,25.0\%,25.7\%.

3.2 Label-Biased Components

Besides good and bad-performing components, we also observe label-biased components, which predict a certain label on the entire test set (the green dots in Figure 1). These components exist in all the tasks and LLMs we study, accounting for 29.1%,26.4%,22.8%,29.7%29.1\%,26.4\%,22.8\%,29.7\% of components on average in Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B, respectively (Table 5). In A.2, we show that even when we prompt the model with all demonstrations of positive labels, the most biased component still insists on predicting “negative” on the entire test set, and vice versa.

3.3 Mechanistic Understanding of Bad-Performing Heads

Prior work studies the mechanism of certain components in LLMs, showing that there are negative mover attention heads that write in the opposite direction of the expected answer Wang et al. (2023a) and copy suppression heads that suppress the prediction of a prior token in the context McDougall et al. (2023). Inspired by them, we investigate the mechanism behind bad-performing heads identified by our decomposition. We focus on label tokens in the context, as Wang et al. (2023b) show that label words serve as anchors in ICL.

We conduct a case study on Llama-2-7B with 4-shot balanced in-context examples from SST2. We examine the bottom-5 attention heads that have the worst ICL accuracy on SST2. We find that three of these heads, L19H15, L15H14, and L18H9, assign top attention probabilities to all 4 label tokens of the 4-shot in-context examples when predicting test examples. Furthermore, despite their poor ICL accuracy, these heads actually assign higher attention to the correct in-context label tokens than the incorrect ones most of the time (>70%>70\% of the test examples). In other words, when a test example has a positive label, these heads assign higher attention333We average the attention probabilities of the same label tokens and then compare the average ones of the two labels. to the tokens “positive” in the context than the tokens “negative”. We also observe that the more the heads attend to “positive” in the context, the lower the inner product between the head and the output embedding of the token “positive”, with the correlation r=0.97,0.96,0.89r=-0.97,-0.96,-0.89 for L19H15, L15H14, and L18H9, respectively.

In summary, we show that some bad-performing heads attend highly to prior label tokens and decrease the output probability of the correct one, which shares similarities with the copy suppression heads and negative mover heads McDougall et al. (2023); Wang et al. (2023a). However, we do not observe similar behavior in other tasks, where the bad-performing heads usually attend to “<s>”, “?”, or “\n”. We invite future work to further analyze how bad-performing heads function in general.

SST2 BoolQ QQP AGNews ARC
Corr (1) Demo 0.81 0.84 0.60 0.89 0.88
(2) Temp 0.40 0.16 0.03 0.68 0.44
(3) Cst T 0.72 0.63 0.23 0.82 0.46
IoU (1) Demo 0.36 0.74 0.27 0.63 0.70
(2) Temp 0.12 0.01 0.01 0.20 0.20
(3) Cst T 0.40 0.23 0.02 0.36 0.45
Table 2: The average correlation and IoU between (1) two random sets of demonstrations, (2) two random templates, and (3) two minimally contrastive templates.

4 Transferability of Components

We observe moderate to high component transferability across demonstrations, minimally contrastive templates, and data distributions, whereas there is little transferability across randomly sampled templates. Our decomposition uncovers hidden abilities of individual components when the full model performs poorly.

4.1 Transfer across Prompt Variants

We first measure the agreement in component accuracies between (1) two disjoint sets of demonstrations with a fixed template, (2) two randomly sampled templates with fixed demonstrations, and (3) two minimally-contrastive templates with fixed demonstrations. Recall that we have 5 sets of demonstrations and 3 templates in total (section 3); here, we calculate the average agreement between every pair. For (3), we construct contrast sets Gardner et al. (2020) by minimally editing the worst-performing template out of the 3 templates into a good template, which yields at least 10%10\% improvement in average accuracy. Our edits include adding a space, removing a newline, or changing label words (see Table 10). We use two metrics to measure the agreement between each pair: Pearson correlation of the accuracies of all components and the intersection over union (IoU) on the sets of top-5 components, which measures whether the top-performing components of the pair overlap.

Table 2 summarizes the results on Llama-2-7B; A.6 shows similar findings on other models. (1) The accuracies of the internal components are highly consistent across different choices of demonstrations, having strong correlations and an average of 0.540.54 IoU. (2) The components have much weaker agreement across randomly sampled templates, having a near 0 IoU on BoolQ and QQP. (3) Nevertheless, there is agreement between minimally contrastive templates (Cst T), with an average correlation of 0.570.57 across tasks, despite contrasting full-model accuracy. For example, Figure 2 demonstrates that full-model accuracy changes dramatically (39%39\% vs 89%89\%) in a minimal pair of templates, but internal components have a high correlation of 0.810.81 and the pair shares top-performing components. Combining (2) and (3) suggests components behave similarly on similar templates, but this similarity decreases as the templates diverge.

4.2 Transfer to Out-of-Distribution Test Sets

We further study whether the best component selected on the test set can still perform well on an out-of-distribution (OOD) test set. We name this method, which uses a single component to make predictions, as Transfer-1. Specifically, we study component transferability from SST2 to Yelp-polarity, MNLI to MedNLI, and BoolQ to BoolQ Contrast Set. We compare Transfer-1 with using the full model (Full) on the OOD test sets. To understand the best possible Transfer-1 accuracy, we also report the best component accuracy directly selected on the OOD set, Oracle-1.

Table 3 shows that Transfer-1 closely matches Oracle-1 overall, suggesting that the top-performing components are transferable across data distribution. Moreover, Transfer-1 sometimes outperforms Full, especially on Llama2 models, showing the hidden abilities of the internal components.

4.3 Transfer between Two Opposite Tasks

We conduct a case study of component transferability across instructions using Task069 and Task070 of Super-NaturalInstructions Wang et al. (2022b), both of which are binary abductive NLI tasks Bhagavatula et al. (2020). The instruction for Task069 asks for correct answers, while Task070 asks for incorrect ones (“pick the one that makes less sense;” see Figure 7 for the full instructions). Examples in the two tasks are not parallel.

We find that Mistral-Instruct-7B achieves good accuracy across 15 runs on Task069 (76.8±2.476.8\pm 2.4), but below chance on Task070 (40.6±5.440.6\pm 5.4). We observe a strong negative correlation, r=0.60r=-0.60 on average, between the component accuracies of the two tasks. The worst-performing components in Task069 become the top-performing in Task070 and vice versa. The correlation suggests that the model has the ability to solve Task070, but misunderstands negation. Thus, we apply the Transfer-1 method (section 4.2) but select the worst-performing component from Task069 and then calculate its individual accuracy on Task070. Transfer-1 achieves 58.7±4.858.7\pm 4.8 accuracy across the 15 runs, an improvement of 18.1%\mathbf{18.1\%} over the full model. These results suggest that components behave consistently even across tasks with opposite instructions, as the active components in Task069 are also active in Task070.

Yelp-polarity MedNLI BoolQ Cst
Llama2 7B Full 84.715.484.7_{\hskip 1.42271pt15.4} 34.31.734.3_{\hskip 1.42271pt1.7} 64.99.8\textbf{64.9}_{\hskip 1.42271pt9.8}
Transfer-1 94.93.1\textbf{94.9}_{\hskip 1.42271pt\phantom{1}3.1} 42.64.7\textbf{42.6}_{\hskip 1.42271pt4.7} 64.37.964.3_{\hskip 1.42271pt7.9}
Oracle-1 96.90.796.9_{\hskip 1.42271pt\phantom{1}0.7} 48.82.348.8_{\hskip 1.42271pt2.3} 66.25.766.2_{\hskip 1.42271pt5.7}
Llama2 13B Full 95.91.495.9_{\hskip 1.42271pt1.4} 46.89.646.8_{\hskip 1.42271pt9.6} 72.07.672.0_{\hskip 1.42271pt7.6}
Transfer-1 96.01.8\textbf{96.0}_{\hskip 1.42271pt1.8} 55.94.0\textbf{55.9}_{\hskip 1.42271pt4.0} 72.36.5\textbf{72.3}_{\hskip 1.42271pt6.5}
Oracle-1 97.10.497.1_{\hskip 1.42271pt0.4} 57.03.757.0_{\hskip 1.42271pt3.7} 73.06.173.0_{\hskip 1.42271pt6.1}
Mistral Ins 7B Full 97.00.5\textbf{97.0}_{\hskip 1.42271pt0.5} 57.35.757.3_{\hskip 1.42271pt5.7} 74.63.5\textbf{74.6}_{\hskip 1.42271pt3.5}
Transfer-1 95.61.695.6_{\hskip 1.42271pt1.6} 61.94.8\textbf{61.9}_{\hskip 1.42271pt4.8} 73.73.773.7_{\hskip 1.42271pt3.7}
Oracle-1 97.10.497.1_{\hskip 1.42271pt0.4} 62.74.162.7_{\hskip 1.42271pt4.1} 74.53.674.5_{\hskip 1.42271pt3.6}
Llama3 8B Full 97.80.4\textbf{97.8}_{\hskip 1.42271pt0.4} 61.02.261.0_{\hskip 1.42271pt2.2} 77.37.5\textbf{77.3}_{\hskip 1.42271pt7.5}
Transfer-1 95.94.495.9_{\hskip 1.42271pt4.4} 61.30.8\textbf{61.3}_{\hskip 1.42271pt0.8} 73.98.473.9_{\hskip 1.42271pt8.4}
Oracle-1 97.90.597.9_{\hskip 1.42271pt0.5} 61.60.661.6_{\hskip 1.42271pt0.6} 74.88.974.8_{\hskip 1.42271pt8.9}
Table 3: The average ICL accuracy and standard deviation on OOD test sets. The components selected on the in-distribution test sets (Transfer-1) can transfer to OOD sets, performing similarly to the oracle components (Oracle-1) directly selected on the OOD sets.

5 Component Reweighting

5.1 Proposed Method

Our findings in section 4 show the promising direction of selecting internal components to improve ICL. Therefore, we propose a method that reweights components by learning a weight wjw_{j}\in\mathbb{R} on every component activation CjC_{j}. Reweighting is a soft version of selection, which can be learned by gradient descent on very few examples.

Given KK labeled examples, instead of using all of them as ICL demonstrations, we divide them into a demonstration set 𝒟demo\mathcal{D}_{\text{demo}} and a training set 𝒟train\mathcal{D}_{\text{train}}. We first randomly sample K={3,4}K^{\prime}=\{3,4\} examples with balanced labels as demonstrations and use the remaining examples as 𝒟train\mathcal{D}_{\text{train}} to train the component weights. Specifically, we can rewrite Eq. 8 as logits=jwj(UCj)\text{logits}=\sum_{j}w_{j}(U\cdot C_{j}), where wj=1w_{j}=1 for all jj. Because of the existence of good and bad-performing components, weighing all components equally may not be optimal. Therefore, we tune the weights wNw\in\mathbb{R}^{N} of NN components on 𝒟train\mathcal{D}_{\text{train}} with cross-entropy loss and L1L_{1} regularization, while keeping the LLM frozen:

=(x,y)𝒟trainlogPrw(y|x)+λw1,\displaystyle\mathcal{L}=\sum_{(x,y)\in\mathcal{D}_{\text{train}}}-\log P_{rw}(y|x)+\lambda\|w\|_{1}, (9)
Prw(y|x)=softmax(j=1Nwj(U𝒴Cj))y,\displaystyle P_{rw}(y|x)=\text{softmax}\left(\sum_{j=1}^{N}w_{j}\;(U_{\mathcal{Y}}\cdot C_{j})\right)_{y},

where U𝒴|𝒴|×dU_{\mathcal{Y}}\in\mathbb{R}^{|\mathcal{Y}|\times d} is a submatrix of UU that comprises the output embeddings of label words, PrwP_{rw} is the probability distribution of the LLM after reweighting, and λ\lambda is the hyperparameter of the L1L_{1} loss to encourage sparsity on the component weights. We obtain the activations {Cj}j=1N\{C_{j}\}_{j=1}^{N} of all components in one KK^{\prime}-shot forward pass, computed on the prompt derived from 𝒟demo\mathcal{D}_{\text{demo}}, followed by xx. Our method scales each component’s direct contributions to the logits (U𝒴Cj|𝒴|U_{\mathcal{Y}}\cdot C_{j}\in\mathbb{R}^{|\mathcal{Y}|}) by wjw_{j}. In practice, we cache these contributions on the entire training set as input features to the linear layer ww, which allows us to discard the entire LLM while training ww (line 9 and 13 in Algorithm 1), saving tremendous training time and GPU memory. The cache only requires O(|𝒴|×N×|𝒟train|)O(|\mathcal{Y}|\times N\times|\mathcal{D}_{\text{train}}|) space. At inference time, the overhead of our method over KK^{\prime}-shot ICL is to early decode NN components and apply the learned weights, i.e., j=1Nwj(U𝒴Cj)\sum_{j=1}^{N}w_{j}\;(U_{\mathcal{Y}}\cdot C_{j}). As both |𝒴||\mathcal{Y}| and NN are small (N<2000N<2000 for all LLMs in this paper), the overhead is negligible compared to the computation of the LLM itself.

Algorithm 1 Component Reweighting
1:Input: KK labeled examples, a test set 𝒟test\mathcal{D}_{\text{test}}, a set of label words 𝒴\mathcal{Y}, an LLM \mathcal{M}, the number of components NN
2:Output: 𝒵\mathcal{Z}, the predictions of \mathcal{M} on 𝒟test\mathcal{D}_{\text{test}}
3:Split KK examples into a prompt consists of KK^{\prime} demonstrations and a training set 𝒟train\mathcal{D}_{\text{train}} of KKK-K^{\prime} examples
4:U𝒴U_{\mathcal{Y}}\leftarrow concatenate the output embeddings of 𝒴\mathcal{Y} in \mathcal{M}
5:Initialize 𝒢train\mathcal{G}^{\text{train}}\leftarrow\varnothing
6:for  (x,y)𝒟train(x,y)\in\mathcal{D}_{\text{train}}  do
7:     {Cj}j=1N\{C_{j}\}_{j=1}^{N}\leftarrow (prompt,x)\mathcal{M}(\emph{prompt},x) \triangleright KK^{\prime}-shot ICL
8:     for  j1j\leftarrow 1 to NN  do
9:         𝒢train𝒢train(U𝒴Cj)\mathcal{G}^{\text{train}}\leftarrow\mathcal{G}^{\text{train}}\cup(U_{\mathcal{Y}}\cdot C_{j}) \triangleright early decode
10:     end for
11:end for
12:Initialize w[1,,1]Nw\leftarrow[1,\dotsc,1]\in\mathbb{R}^{N}
13:Train the weights ww on 𝒢train\mathcal{G}^{\text{train}} with Eq. 9
14:Initialize 𝒵\mathcal{Z}\ \leftarrow\varnothing \triangleright Start Inference
15:for  (x,y)𝒟test(x,y)\in\mathcal{D}_{\text{test}}  do
16:     {Cj}j=1N\{C_{j}\}_{j=1}^{N}\leftarrow (prompt,x)\mathcal{M}(\emph{prompt},x) \triangleright KK^{\prime}-shot ICL
17:     Initialize 𝐠[0,,0]|𝒴|\mathbf{g}\leftarrow[0,\dotsc,0]\in\mathbb{R}^{|\mathcal{Y}|}
18:     for  j1j\leftarrow 1 to NN  do \triangleright Test-Time Overhead
19:         𝐠𝐠+wj(U𝒴Cj)\mathbf{g}\leftarrow\mathbf{g}+w_{j}(U_{\mathcal{Y}}\cdot C_{j}) \triangleright early decode
20:     end for
21:     y^argmaxy𝒴𝐠\hat{y}\leftarrow{\arg\max}_{y\in\mathcal{Y}}\;\mathbf{g}
22:     𝒵𝒵y^\mathcal{Z}\ \leftarrow\mathcal{Z}\ \cup\hat{y}
23:end for
24:return 𝒵\mathcal{Z}

5.2 Baselines

Standard ICL.

The simplest baseline is to use all the KK labeled examples as demonstrations. Since the other methods use KK^{\prime} examples as demonstrations, we report the accuracy of standard KK^{\prime}-shot ICL using the same 𝒟demo\mathcal{D}_{\text{demo}} for reference.

Prompt Selection.

Liu et al. (2022b) improve ICL accuracy by selecting demonstrations from a pool of labeled data for each test example. Here, we select from the given KK labeled examples. Following Rubin et al. (2022), we use SBERT Reimers and Gurevych (2019) to encode examples into sentence embeddings and select the K={3,4}K^{\prime}=\{3,4\} nearest neighbors under cosine similarity as the demonstrations for each test example.

Calibration.

As LLMs tend to predict a certain class over others, Zhao et al. (2021) reweight the output class probabilities. They use context-free inputs, such as “N/A”, to calibrate the probability distribution. However, Fei et al. (2023) and Zhou et al. (2023) find context-free inputs sometimes ineffective, because in-domain context is important for calibration. Thus, we introduce Calib+, which calibrates the original probabilities 𝐩|𝒴|\mathbf{p}\in\mathbb{R}^{|\mathcal{Y}|} with a training set of in-distribution labeled examples, 𝒟train\mathcal{D}_{\text{train}}. We train the calibration weights 𝐯|𝒴|\mathbf{v}\in\mathbb{R}^{|\mathcal{Y}|} on 𝒟train\mathcal{D}_{\text{train}} with cross-entropy loss and obtain the calibrated probabilities 𝐩^=softmax(𝐯𝐩)\hat{\mathbf{p}}=\text{softmax}(\mathbf{v}\cdot\mathbf{p}). For direct comparisons, we split the KK examples into the same 𝒟demo\mathcal{D}_{\text{demo}} and 𝒟train\mathcal{D}_{\text{train}} sets as component reweighting for Calib+, where |𝒟demo|=K|\mathcal{D}_{\text{demo}}|=K^{\prime}. We include the training details of both methods in A.8.

SST2 BoolQ QQP WiC RTE MNLI AGNews ARC-Easy Avg.
Llama-2-7B Standard 3,43,4 75.818.175.8_{\hskip 1.42271pt18.1} 69.212.069.2_{\hskip 1.42271pt12.0} 61.39.961.3_{\hskip 1.42271pt9.9} 52.43.052.4_{\hskip 1.42271pt3.0} 68.93.268.9_{\hskip 1.42271pt3.2} 34.41.734.4_{\hskip 1.42271pt1.7} 70.019.970.0_{\hskip 1.42271pt19.9} 57.514.457.5_{\hskip 1.42271pt14.4} 61.261.2
Standard 1212 77.819.677.8_{\hskip 1.42271pt19.6} 71.68.0\textbf{71.6}_{\hskip 1.42271pt\phantom{1}8.0} 63.67.863.6_{\hskip 1.42271pt7.8} 52.52.452.5_{\hskip 1.42271pt2.4} 71.12.1\textbf{71.1}_{\hskip 1.42271pt2.1} 37.02.837.0_{\hskip 1.42271pt2.8} 69.020.869.0_{\hskip 1.42271pt20.8} 59.613.9\textbf{59.6}_{\hskip 1.42271pt13.9} 62.862.8
PromptS 1212 73.819.273.8_{\hskip 1.42271pt19.2} 69.410.569.4_{\hskip 1.42271pt10.5} 62.26.162.2_{\hskip 1.42271pt6.1} 53.12.753.1_{\hskip 1.42271pt2.7} 65.51.865.5_{\hskip 1.42271pt1.8} 35.51.635.5_{\hskip 1.42271pt1.6} 59.128.759.1_{\hskip 1.42271pt28.7} 58.711.958.7_{\hskip 1.42271pt11.9} 59.759.7
Calib1212 85.16.085.1_{\hskip 1.42271pt\phantom{1}6.0} 69.213.669.2_{\hskip 1.42271pt13.6} 73.66.1\textbf{73.6}_{\hskip 1.42271pt6.1} 55.15.155.1_{\hskip 1.42271pt5.1} 70.32.770.3_{\hskip 1.42271pt2.7} 45.57.845.5_{\hskip 1.42271pt7.8} 77.812.277.8_{\hskip 1.42271pt12.2} 58.614.658.6_{\hskip 1.42271pt14.6} 66.966.9
CompRW 1212 88.52.8\textbf{88.5}_{\hskip 1.42271pt\phantom{1}2.8} 70.411.270.4_{\hskip 1.42271pt11.2} 71.45.471.4_{\hskip 1.42271pt5.4} 56.33.4\textbf{56.3}_{\hskip 1.42271pt3.4} 70.02.870.0_{\hskip 1.42271pt2.8} 48.34.8\textbf{48.3}_{\hskip 1.42271pt4.8} 87.42.3\textbf{87.4}_{\hskip 1.42271pt\phantom{1}2.3} 58.313.658.3_{\hskip 1.42271pt13.6} 68.8
Standard 2424 77.819.577.8_{\hskip 1.42271pt19.5} 71.67.371.6_{\hskip 1.42271pt\phantom{1}7.3} 66.45.066.4_{\hskip 1.42271pt5.0} 53.23.353.2_{\hskip 1.42271pt3.3} 71.91.5\textbf{71.9}_{\hskip 1.42271pt1.5} 39.93.639.9_{\hskip 1.42271pt3.6} 71.120.071.1_{\hskip 1.42271pt20.0} 58.316.258.3_{\hskip 1.42271pt16.2} 63.863.8
PromptS 2424 74.220.474.2_{\hskip 1.42271pt20.4} 68.910.268.9_{\hskip 1.42271pt10.2} 62.14.962.1_{\hskip 1.42271pt4.9} 53.61.953.6_{\hskip 1.42271pt1.9} 64.80.964.8_{\hskip 1.42271pt0.9} 36.41.536.4_{\hskip 1.42271pt1.5} 57.530.257.5_{\hskip 1.42271pt30.2} 58.012.558.0_{\hskip 1.42271pt12.5} 59.459.4
Calib2424 87.65.087.6_{\hskip 1.42271pt\phantom{1}5.0} 70.311.970.3_{\hskip 1.42271pt11.9} 73.45.5\textbf{73.4}_{\hskip 1.42271pt5.5} 55.84.955.8_{\hskip 1.42271pt4.9} 70.42.770.4_{\hskip 1.42271pt2.7} 46.46.746.4_{\hskip 1.42271pt6.7} 78.411.878.4_{\hskip 1.42271pt11.8} 59.214.4\textbf{59.2}_{\hskip 1.42271pt14.4} 67.767.7
CompRW 2424 90.61.7\textbf{90.6}_{\hskip 1.42271pt\phantom{1}1.7} 71.79.4\textbf{71.7}_{\hskip 1.42271pt\phantom{1}9.4} 71.94.471.9_{\hskip 1.42271pt4.4} 57.13.0\textbf{57.1}_{\hskip 1.42271pt3.0} 70.04.170.0_{\hskip 1.42271pt4.1} 49.84.0\textbf{49.8}_{\hskip 1.42271pt4.0} 88.12.1\textbf{88.1}_{\hskip 1.42271pt\phantom{1}2.1} 58.813.658.8_{\hskip 1.42271pt13.6} 69.8
Llama-2-13B Standard 3,43,4 89.05.389.0_{\hskip 1.42271pt\phantom{1}5.3} 77.66.877.6_{\hskip 1.42271pt6.8} 71.06.871.0_{\hskip 1.42271pt6.8} 55.03.855.0_{\hskip 1.42271pt3.8} 75.12.375.1_{\hskip 1.42271pt2.3} 45.77.945.7_{\hskip 1.42271pt7.9} 70.820.670.8_{\hskip 1.42271pt20.6} 73.213.773.2_{\hskip 1.42271pt13.7} 69.769.7
Standard 1212 91.31.9\textbf{91.3}_{\hskip 1.42271pt\phantom{1}1.9} 78.17.478.1_{\hskip 1.42271pt7.4} 70.57.370.5_{\hskip 1.42271pt7.3} 59.62.4\textbf{59.6}_{\hskip 1.42271pt2.4} 74.43.574.4_{\hskip 1.42271pt3.5} 55.16.255.1_{\hskip 1.42271pt6.2} 84.77.884.7_{\hskip 1.42271pt\phantom{1}7.8} 71.216.471.2_{\hskip 1.42271pt16.4} 73.173.1
PromptS 1212 83.810.283.8_{\hskip 1.42271pt10.2} 74.96.674.9_{\hskip 1.42271pt6.6} 64.65.764.6_{\hskip 1.42271pt5.7} 57.02.157.0_{\hskip 1.42271pt2.1} 69.53.569.5_{\hskip 1.42271pt3.5} 48.15.448.1_{\hskip 1.42271pt5.4} 64.429.664.4_{\hskip 1.42271pt29.6} 74.29.374.2_{\hskip 1.42271pt\phantom{1}9.3} 67.167.1
Calib1212 89.43.289.4_{\hskip 1.42271pt\phantom{1}3.2} 78.46.1\textbf{78.4}_{\hskip 1.42271pt6.1} 72.14.172.1_{\hskip 1.42271pt4.1} 58.15.158.1_{\hskip 1.42271pt5.1} 75.31.975.3_{\hskip 1.42271pt1.9} 57.34.557.3_{\hskip 1.42271pt4.5} 81.58.781.5_{\hskip 1.42271pt\phantom{1}8.7} 74.79.374.7_{\hskip 1.42271pt\phantom{1}9.3} 73.373.3
CompRW 1212 89.13.289.1_{\hskip 1.42271pt\phantom{1}3.2} 77.76.777.7_{\hskip 1.42271pt6.7} 72.73.3\textbf{72.7}_{\hskip 1.42271pt3.3} 58.74.058.7_{\hskip 1.42271pt4.0} 76.22.0\textbf{76.2}_{\hskip 1.42271pt2.0} 60.23.7\textbf{60.2}_{\hskip 1.42271pt3.7} 88.11.7\textbf{88.1}_{\hskip 1.42271pt\phantom{1}1.7} 76.26.8\textbf{76.2}_{\hskip 1.42271pt\phantom{1}6.8} 74.9
Standard 2424 91.90.6\textbf{91.9}_{\hskip 1.42271pt\phantom{1}0.6} 77.78.277.7_{\hskip 1.42271pt8.2} 69.58.569.5_{\hskip 1.42271pt8.5} 60.61.660.6_{\hskip 1.42271pt1.6} 74.73.374.7_{\hskip 1.42271pt3.3} 58.27.058.2_{\hskip 1.42271pt7.0} 85.84.485.8_{\hskip 1.42271pt\phantom{1}4.4} 69.117.769.1_{\hskip 1.42271pt17.7} 73.573.5
PromptS 2424 81.913.281.9_{\hskip 1.42271pt13.2} 75.15.775.1_{\hskip 1.42271pt5.7} 64.94.864.9_{\hskip 1.42271pt4.8} 57.31.857.3_{\hskip 1.42271pt1.8} 69.51.769.5_{\hskip 1.42271pt1.7} 49.85.149.8_{\hskip 1.42271pt5.1} 65.228.965.2_{\hskip 1.42271pt28.9} 74.29.474.2_{\hskip 1.42271pt\phantom{1}9.4} 67.267.2
Calib2424 90.72.190.7_{\hskip 1.42271pt\phantom{1}2.1} 78.66.2\textbf{78.6}_{\hskip 1.42271pt6.2} 73.14.373.1_{\hskip 1.42271pt4.3} 59.53.2\textbf{59.5}_{\hskip 1.42271pt3.2} 75.91.975.9_{\hskip 1.42271pt1.9} 58.42.858.4_{\hskip 1.42271pt2.8} 82.08.482.0_{\hskip 1.42271pt\phantom{1}8.4} 75.29.175.2_{\hskip 1.42271pt\phantom{1}9.1} 74.274.2
CompRW 2424 91.01.891.0_{\hskip 1.42271pt\phantom{1}1.8} 78.26.478.2_{\hskip 1.42271pt6.4} 74.23.1\textbf{74.2}_{\hskip 1.42271pt3.1} 58.54.158.5_{\hskip 1.42271pt4.1} 77.11.8\textbf{77.1}_{\hskip 1.42271pt1.8} 62.03.7\textbf{62.0}_{\hskip 1.42271pt3.7} 88.81.4\textbf{88.8}_{\hskip 1.42271pt\phantom{1}1.4} 76.17.2\textbf{76.1}_{\hskip 1.42271pt\phantom{1}7.2} 75.7
Mistral-Instruct-7B Standard 3,43,4 90.12.990.1_{\hskip 1.42271pt2.9} 81.32.181.3_{\hskip 1.42271pt2.1} 70.97.270.9_{\hskip 1.42271pt7.2} 58.54.258.5_{\hskip 1.42271pt4.2} 80.51.780.5_{\hskip 1.42271pt1.7} 56.15.056.1_{\hskip 1.42271pt5.0} 83.05.783.0_{\hskip 1.42271pt5.7} 79.81.479.8_{\hskip 1.42271pt\phantom{1}1.4} 75.075.0
Standard 1212 91.40.991.4_{\hskip 1.42271pt0.9} 81.22.281.2_{\hskip 1.42271pt2.2} 67.98.767.9_{\hskip 1.42271pt8.7} 57.72.857.7_{\hskip 1.42271pt2.8} 79.11.679.1_{\hskip 1.42271pt1.6} 57.23.657.2_{\hskip 1.42271pt3.6} 85.43.685.4_{\hskip 1.42271pt3.6} 77.75.677.7_{\hskip 1.42271pt\phantom{1}5.6} 74.774.7
PromptS 1212 90.32.590.3_{\hskip 1.42271pt2.5} 81.11.981.1_{\hskip 1.42271pt1.9} 68.75.868.7_{\hskip 1.42271pt5.8} 57.12.757.1_{\hskip 1.42271pt2.7} 79.11.679.1_{\hskip 1.42271pt1.6} 56.73.256.7_{\hskip 1.42271pt3.2} 84.93.084.9_{\hskip 1.42271pt3.0} 79.03.079.0_{\hskip 1.42271pt\phantom{1}3.0} 74.674.6
Calib1212 91.51.6\textbf{91.5}_{\hskip 1.42271pt1.6} 81.31.8\textbf{81.3}_{\hskip 1.42271pt1.8} 75.82.6\textbf{75.8}_{\hskip 1.42271pt2.6} 58.36.658.3_{\hskip 1.42271pt6.6} 81.01.381.0_{\hskip 1.42271pt1.3} 61.94.761.9_{\hskip 1.42271pt4.7} 85.44.085.4_{\hskip 1.42271pt4.0} 79.61.6\textbf{79.6}_{\hskip 1.42271pt\phantom{1}1.6} 76.976.9
CompRW 1212 89.92.789.9_{\hskip 1.42271pt2.7} 80.72.780.7_{\hskip 1.42271pt2.7} 75.12.975.1_{\hskip 1.42271pt2.9} 60.04.9\textbf{60.0}_{\hskip 1.42271pt4.9} 81.11.3\textbf{81.1}_{\hskip 1.42271pt1.3} 64.74.6\textbf{64.7}_{\hskip 1.42271pt4.6} 87.62.1\textbf{87.6}_{\hskip 1.42271pt2.1} 79.21.279.2_{\hskip 1.42271pt\phantom{1}1.2} 77.3
Standard 2424 91.21.091.2_{\hskip 1.42271pt1.0} 80.82.380.8_{\hskip 1.42271pt2.3} 65.38.465.3_{\hskip 1.42271pt8.4} 57.44.057.4_{\hskip 1.42271pt4.0} 75.61.775.6_{\hskip 1.42271pt1.7} 56.66.556.6_{\hskip 1.42271pt6.5} 85.84.385.8_{\hskip 1.42271pt4.3} 68.816.968.8_{\hskip 1.42271pt16.9} 72.772.7
PromptS 2424 90.52.690.5_{\hskip 1.42271pt2.6} 81.32.0\textbf{81.3}_{\hskip 1.42271pt2.0} 68.95.668.9_{\hskip 1.42271pt5.6} 57.12.157.1_{\hskip 1.42271pt2.1} 79.11.779.1_{\hskip 1.42271pt1.7} 57.43.157.4_{\hskip 1.42271pt3.1} 86.02.186.0_{\hskip 1.42271pt2.1} 78.73.378.7_{\hskip 1.42271pt\phantom{1}3.3} 74.974.9
Calib2424 91.61.5\textbf{91.6}_{\hskip 1.42271pt1.5} 80.92.080.9_{\hskip 1.42271pt2.0} 76.12.476.1_{\hskip 1.42271pt2.4} 59.55.459.5_{\hskip 1.42271pt5.4} 81.20.981.2_{\hskip 1.42271pt0.9} 62.74.362.7_{\hskip 1.42271pt4.3} 85.93.785.9_{\hskip 1.42271pt3.7} 80.11.2\textbf{80.1}_{\hskip 1.42271pt\phantom{1}1.2} 77.277.2
CompRW 2424 90.81.890.8_{\hskip 1.42271pt1.8} 80.62.180.6_{\hskip 1.42271pt2.1} 76.41.7\textbf{76.4}_{\hskip 1.42271pt1.7} 60.74.4\textbf{60.7}_{\hskip 1.42271pt4.4} 81.61.0\textbf{81.6}_{\hskip 1.42271pt1.0} 65.33.4\textbf{65.3}_{\hskip 1.42271pt3.4} 88.01.8\textbf{88.0}_{\hskip 1.42271pt1.8} 79.01.679.0_{\hskip 1.42271pt\phantom{1}1.6} 77.8
Llama-3-8B Standard 3,43,4 91.41.791.4_{\hskip 1.42271pt1.7} 79.27.279.2_{\hskip 1.42271pt7.2} 74.08.074.0_{\hskip 1.42271pt8.0} 58.74.758.7_{\hskip 1.42271pt4.7} 76.52.276.5_{\hskip 1.42271pt2.2} 59.43.759.4_{\hskip 1.42271pt3.7} 84.06.684.0_{\hskip 1.42271pt6.6} 87.45.587.4_{\phantom{1}\hskip 1.42271pt5.5} 76.376.3
Standard 1212 92.20.6\textbf{92.2}_{\hskip 1.42271pt0.6} 79.66.9\textbf{79.6}_{\hskip 1.42271pt6.9} 73.15.073.1_{\hskip 1.42271pt5.0} 63.32.4\textbf{63.3}_{\hskip 1.42271pt2.4} 77.52.277.5_{\hskip 1.42271pt2.2} 62.73.862.7_{\hskip 1.42271pt3.8} 87.72.087.7_{\hskip 1.42271pt2.0} 82.118.182.1_{\hskip 1.42271pt18.1} 77.377.3
Calib1212 91.11.591.1_{\hskip 1.42271pt1.5} 79.25.879.2_{\hskip 1.42271pt5.8} 77.93.3\textbf{77.9}_{\hskip 1.42271pt3.3} 60.59.360.5_{\hskip 1.42271pt9.3} 77.32.177.3_{\hskip 1.42271pt2.1} 65.23.265.2_{\hskip 1.42271pt3.2} 86.44.786.4_{\hskip 1.42271pt4.7} 87.74.0\textbf{87.7}_{\phantom{1}\hskip 1.42271pt4.0} 78.278.2
CompRW 1212 90.72.090.7_{\hskip 1.42271pt2.0} 78.36.778.3_{\hskip 1.42271pt6.7} 77.22.977.2_{\hskip 1.42271pt2.9} 61.86.461.8_{\hskip 1.42271pt6.4} 78.01.8\textbf{78.0}_{\hskip 1.42271pt1.8} 66.92.4\textbf{66.9}_{\hskip 1.42271pt2.4} 89.11.0\textbf{89.1}_{\hskip 1.42271pt1.0} 87.43.887.4_{\phantom{1}\hskip 1.42271pt3.8} 78.7
Standard 2424 92.20.8\textbf{92.2}_{\hskip 1.42271pt0.8} 78.27.278.2_{\hskip 1.42271pt7.2} 78.02.078.0_{\hskip 1.42271pt2.0} 63.81.8\textbf{63.8}_{\hskip 1.42271pt1.8} 76.23.076.2_{\hskip 1.42271pt3.0} 66.42.366.4_{\hskip 1.42271pt2.3} 87.91.987.9_{\hskip 1.42271pt1.9} 80.618.980.6_{\hskip 1.42271pt18.9} 77.977.9
Calib2424 91.71.491.7_{\hskip 1.42271pt1.4} 80.06.1\textbf{80.0}_{\hskip 1.42271pt6.1} 78.33.478.3_{\hskip 1.42271pt3.4} 63.82.9\textbf{63.8}_{\hskip 1.42271pt2.9} 78.11.778.1_{\hskip 1.42271pt1.7} 66.02.466.0_{\hskip 1.42271pt2.4} 86.74.986.7_{\hskip 1.42271pt4.9} 87.73.8\textbf{87.7}_{\phantom{1}\hskip 1.42271pt3.8} 79.079.0
CompRW 2424 91.61.791.6_{\hskip 1.42271pt1.7} 79.16.979.1_{\hskip 1.42271pt6.9} 78.82.7\textbf{78.8}_{\hskip 1.42271pt2.7} 63.73.363.7_{\hskip 1.42271pt3.3} 78.51.4\textbf{78.5}_{\hskip 1.42271pt1.4} 67.42.6\textbf{67.4}_{\hskip 1.42271pt2.6} 89.51.1\textbf{89.5}_{\hskip 1.42271pt1.1} 87.43.287.4_{\phantom{1}\hskip 1.42271pt3.2} 79.5
Table 4: ICL accuracy of 8 classification tasks and the average accuracy (Avg.). The number after a method denotes the number of labeled data used. We run 15 prompts for each task (5 disjoint sets of KK labeled data and 3 templates) and report the mean accuracy and standard deviation. CompRW achieves the best average accuracy in all setups.

5.3 Results

We set K={12,24}K=\{12,24\}. Table 4 compares our component reweighting (CompRW) with standard ICL (Standard), prompt selection (PromptS), and calibration (Calib+). First, we find that simply increasing the number of demonstrations from 44 to 2424 has limited improvements in ICL accuracy, while the longer prompt greatly increases the inference time. For example, on Llama-2-7B, Standard 2424 only improves the average accuracy by 2.6%2.6\% over Standard 3,43,4 and the accuracy even decreases on Mistral-Instruct. Second, PromptS performs the worst in most setups, likely because it is hard to find similar examples from a small pool of KK examples, and a bad selection induces majority label biases. Third, both calibration (Calib+) and component reweighting (CompRW) achieve substantially better accuracy than Standard 3,43,4 with little test-time overhead. Overall, CompRW achieves the best average accuracy in all setups, outperforming Standard 1212 by 6.0%,1.8%,2.6%,1.46.0\%,1.8\%,2.6\%,1.4 on Llama-2-7B, Llama-2-13B, Mistral-Instruct-7B, and Llama-3-8B, respectively, and outperforming Standard 2424 by 6.0%,2.2%,5.1%,1.6%6.0\%,2.2\%,5.1\%,1.6\%, respectively. We run one-tailed paired t-tests comparing CompRW with Calib+ and find that p-values <0.05<0.05 in all 8 setups (see Table 6), showing that CompRW performs significantly better Calib+.

6 When Do Good Components Emerge?

We study the dynamics of components during pretraining by monitoring their accuracies on 32 checkpoints of Pythia-6.9B, uniformly spaced from the first to the last checkpoint. For each checkpoint, we run 4-shot ICL on AGNews with 3 templates ×\times 3 sets of demonstrations. The demonstrations are balanced in labels with randomly shuffled orders. Figure 3 shows the average accuracy of the 9 runs shaded by the standard deviation.

While the full model (green) fluctuates and has a large variance across prompts, the top-1 components (solid blue) achieve good accuracy at an early step and plateau quickly. We also backtrack the top-1 components of different prompts at the last checkpoint (dashed blue), monitoring how they perform on average during pretraining. We observe that they are not the top components at the early stage (there are gaps between the two blue lines before the 75k75k steps), but start to perform steadily well from the middle stage. Our findings also hold on SST2 and Pythia-1.4B (see Figure 6 in the appendix), suggesting that the model’s ability to do a task emerges before it is apparent from the full model on these tasks.444On the other hand, Pythia models perform poorly on the other tasks over all checkpoints; thus, the training dynamics of the model components on challenging tasks remain unclear.

7 Related Work and Discussion

Improving ICL.

Prior work shows that ICL performance varies greatly across different choices of demonstrations and templates Zhao et al. (2021); Lu et al. (2022). Specifically, Sclar et al. (2024) and Voronov et al. (2024) find no universally better prompt template that can transfer across tasks and models, implying that it is not easy to explain ICL through prompt engineering. While several approaches, such as prompt selection Liu et al. (2022b); Chang and Jia (2023); Fu et al. (2023), prompt ensemble Min et al. (2022a); Arora et al. (2023); Voronov et al. (2024), and many-shot ICL Agarwal et al. (2024), substantially improve accuracy, they treat LLMs like black boxes without understanding the internals. Besides, they greatly increase inference time or require a large set of labeled data, which deviates from true few-shot learning Perez et al. (2021). In comparison, our paper studies this problem by looking inside the LLMs. Rather than selecting prompts, we select components in a soft, learnable way. Our method only requires {12,24}\{12,24\} examples and has negligible computation overhead over 44-shot ICL at inference.

Refer to caption
Figure 3: The ICL accuracy of the full model (green) fluctuates greatly during pretraining. However, good-performing components (T1) emerge in the early steps.

Components Interpretation.

Components interpretation studies the function of different components in a trained model Elhage et al. (2021); Shah et al. (2024), where components could be neurons Radford et al. (2017); Wang et al. (2022a); Gurnee et al. (2023), attention heads Olsson et al. (2022), and MLPs Geva et al. (2021). To analyze the components, probing Alain and Bengio (2017), knockout Geva et al. (2023); Chang et al. (2024); Li et al. (2023), patching Wang et al. (2023a); Goldowsky-Dill et al. (2023), and early decoding nostalgebraist (2020); Geva et al. (2022) are widely used techniques. For example, Li et al. (2024) train a linear probe on every attention head to discover the truthful heads inside LLMs. Michel et al. (2019) and Voita et al. (2019) prune away a large percentage of attention heads and show that only a few are critical to the performance. Hendel et al. (2023), Liu et al. (2023), Merullo et al. (2024a), and Todd et al. (2024) view ICL as compressing demonstrations into function vectors, where they remove the demonstrations and modify (patch) the LLM activations at certain layers with the function vectors at test time. Early decoding interprets the investigated components in the textual space by projecting them through the output embedding matrix Geva et al. (2022). Our model decomposition is based on early decoding and we share some similarities with prior work Yu et al. (2023); Wang et al. (2023c), especially in discovering individual components that perform well on a task. Our contributions lie in providing a new view of ICL by decomposition, which reveals the transferability of components across diverse ICL settings.

Our Method vs. Pruning.

Our method caches the direct contributions of components to the outputs through the residual stream, i.e., logits = j𝐠j\sum_{j}\mathbf{g}_{j}. Thus, removing 𝐠j\mathbf{g}_{j}, the direct contribution of the component jj, does not alter the contributions of the other components. In comparison, pruning a component changes the activations of the other components in later layers. In A.7, we show that pruning the good-performing components identified by our method greatly hurts the accuracy, meaning that pruning also defines these components as important Michel et al. (2019).

8 Conclusion

We introduce a new perspective of ICL via decomposing the model output into the sum of individual contributions of components. We then identify three types of component characteristics across 3 LLMs and 8 classification tasks. Our extensive analyses reveal consistency in component accuracy across prompts and suggest the promising direction of improving ICL by selecting components. To this end, we propose component reweighting, which learns to scale components differently on few-shot examples. Our method achieves the best average accuracy compared to prior methods. We hope this work can deepen our grasp of LLMs while motivating more methods for practical use.

9 Limitations

Our component reweighting method requires a small set of labeled data 𝒟train\mathcal{D}_{\text{train}} to train the component weights ww. However, we believe it is not unreasonable to have at least K=12K=12 labeled examples in total and we compare with baselines using the same KK examples. On the other hand, we do not compare with fine-tuning-based baselines, such as LM-BFF Gao et al. (2021), T-few Liu et al. (2022a), and LoRA Hu et al. (2021), because they usually require a larger GPU memory for training and more sophisticated early stopping criteria to prevent overfitting on few-shot examples. Another limitation is that we only experiment with classification tasks for ease of evaluation. We leave it for future work to generalize our method to generation tasks by doing decomposition and reweighting at every token during generation.

Despite similarities in model decomposition, the focus of this paper is not circuits in LLMs Wang et al. (2023a). Thus, we only have limited experiments towards mechanistic understanding of the curious components in section 3.3 and A.7. Unlike prior work that uses synthetic tasks to testify whether a head attends to certain tokens Dutta et al. (2024); Merullo et al. (2024b), we work on standard NLP benchmark datasets without obviously correct or incorrect tokens to collect answers, making mechanistic interpretation more challenging.

Acknowledgements

We thank Johnny Wei for his valuable suggestions on the paper structure. We thank Qinyuan Ye, Ryan Wang, Gustavo Lucas Carvalho, Ameya Godbole, Wang Zhu, Daniel Firebanks-Quevedo, and the anonymous reviewers for their helpful feedback. This work was funded in part by gifts from Open Philanthropy and Cisco Research, and was also supported in part by the National Science Foundation under Grant No. IIS-2403436. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References

  • Agarwal et al. (2024) Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Stephanie Chan, Ankesh Anand, Zaheer Abbas, Azade Nova, John D Co-Reyes, Eric Chu, et al. 2024. Many-shot in-context learning. arXiv preprint arXiv:2404.11018.
  • Alain and Bengio (2017) Guillaume Alain and Yoshua Bengio. 2017. Understanding intermediate layers using linear classifier probes.
  • Arora et al. (2023) Simran Arora, Avanika Narayan, Mayee F Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, and Christopher Re. 2023. Ask me anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations.
  • Bach et al. (2022) Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Fries, Maged Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Dragomir Radev, Mike Tian-jian Jiang, and Alexander Rush. 2022. PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland. Association for Computational Linguistics.
  • Bertsch et al. (2024) Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R Gormley, and Graham Neubig. 2024. In-context learning with long-context models: An in-depth exploration. arXiv preprint arXiv:2405.00200.
  • Bhagavatula et al. (2020) Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen tau Yih, and Yejin Choi. 2020. Abductive commonsense reasoning. In International Conference on Learning Representations.
  • Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Chang and Jia (2023) Ting-Yun Chang and Robin Jia. 2023. Data curation alone can stabilize in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8123–8144, Toronto, Canada. Association for Computational Linguistics.
  • Chang et al. (2024) Ting-Yun Chang, Jesse Thomason, and Robin Jia. 2024. Do localization methods actually localize memorized data in LLMs? a tale of two benchmarks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3190–3211, Mexico City, Mexico. Association for Computational Linguistics.
  • Chen et al. (2023) Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, and He He. 2023. On the relation between sensitivity and accuracy in in-context learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 155–167, Singapore. Association for Computational Linguistics.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1.
  • Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
  • Dutta et al. (2024) Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty. 2024. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv preprint arXiv:2402.18312.
  • Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1:1.
  • Fei et al. (2023) Yu Fei, Yifan Hou, Zeming Chen, and Antoine Bosselut. 2023. Mitigating label biases for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14014–14031, Toronto, Canada. Association for Computational Linguistics.
  • Fu et al. (2023) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2023. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations.
  • Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
  • Gardner et al. (2020) Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
  • Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, Singapore. Association for Computational Linguistics.
  • Geva et al. (2022) Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Goldowsky-Dill et al. (2023) Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. 2023. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969.
  • Gurnee et al. (2023) Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. 2023. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research.
  • Hendel et al. (2023) Roee Hendel, Mor Geva, and Amir Globerson. 2023. In-context learning creates task vectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318–9333, Singapore. Association for Computational Linguistics.
  • Hewitt and Liang (2019) John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics.
  • Holtzman et al. (2021) Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. 2021. Surface form competition: Why the highest probability answer isn’t always right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Hu et al. (2021) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Li et al. (2024) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2024. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36.
  • Li et al. (2023) Maximilian Li, Xander Davies, and Max Nadeau. 2023. Circuit breaking: Removing model behaviors with targeted ablation. arXiv preprint arXiv:2309.05973.
  • Liu et al. (2022a) Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022a. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
  • Liu et al. (2022b) Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022b. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.
  • Liu et al. (2023) Sheng Liu, Lei Xing, and James Zou. 2023. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668.
  • Lu et al. (2022) Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
  • McDougall et al. (2023) Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. 2023. Copy suppression: Comprehensively understanding an attention head. arXiv preprint arXiv:2310.04625.
  • Merullo et al. (2024a) Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2024a. Language models implement simple Word2Vec-style vector arithmetic. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5030–5047, Mexico City, Mexico. Association for Computational Linguistics.
  • Merullo et al. (2024b) Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2024b. Talking heads: Understanding inter-layer communication in transformer language models. arXiv preprint arXiv:2406.09519.
  • Michel et al. (2019) Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Min et al. (2022a) Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022a. Noisy channel language model prompting for few-shot text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5316–5330, Dublin, Ireland. Association for Computational Linguistics.
  • Min et al. (2022b) Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022b. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
  • nostalgebraist (2020) nostalgebraist. 2020. interpreting GPT: the logit lens.
  • Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. 2022. In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  • Perez et al. (2021) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. In Advances in Neural Information Processing Systems.
  • Pilehvar and Camacho-Collados (2019) Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Radford et al. (2017) Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. ArXiv preprint, abs/1704.01444.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  • Ramanujan et al. (2020) Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. 2020. What’s hidden in a randomly weighted neural network? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11893–11902.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  • Romanov and Shivade (2018) Alexey Romanov and Chaitanya Shivade. 2018. Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1586–1596, Brussels, Belgium. Association for Computational Linguistics.
  • Rubin et al. (2022) Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics.
  • Sclar et al. (2024) Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations.
  • Shah et al. (2024) Harshay Shah, Andrew Ilyas, and Aleksander Madry. 2024. Decomposing and editing predictions by modeling model computation. arXiv preprint arXiv:2404.11534.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  • Todd et al. (2024) Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. 2024. Function vectors in large language models. In The Twelfth International Conference on Learning Representations.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30.
  • Voita et al. (2019) Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics.
  • Voronov et al. (2024) Anton Voronov, Lena Wolf, and Max Ryabinin. 2024. Mind your format: Towards consistent evaluation of in-context learning improvements. In Findings of the Association for Computational Linguistics ACL 2024, pages 6287–6310, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  • Wang et al. (2023a) Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023a. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations.
  • Wang et al. (2023b) Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023b. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore. Association for Computational Linguistics.
  • Wang et al. (2023c) Tony Wang, Miles Kai, Kaivalya Hariharan, and Nir Shavit. 2023c. Forbidden facts: An investigation of competing objectives in llama 2. In Socially Responsible Language Modelling Research.
  • Wang et al. (2022a) Xiaozhi Wang, Kaiyue Wen, Zhengyan Zhang, Lei Hou, Zhiyuan Liu, and Juanzi Li. 2022a. Finding skill neurons in pre-trained transformer-based language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11132–11152, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Wang et al. (2022b) Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Savan Doshi, Shailaja Keyur Sampat, Siddhartha Mishra, Sujan Reddy A, Sumanta Patro, Tanay Dixit, and Xudong Shen. 2022b. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  • Yu et al. (2023) Qinan Yu, Jack Merullo, and Ellie Pavlick. 2023. Characterizing mechanisms for factual recall in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9924–9959, Singapore. Association for Computational Linguistics.
  • Zhang and Sennrich (2019) Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.
  • Zhang and Bowman (2018) Kelly Zhang and Samuel Bowman. 2018. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 359–361, Brussels, Belgium. Association for Computational Linguistics.
  • Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  • Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697–12706. PMLR.
  • Zhou et al. (2023) Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, and Subhrajit Roy. 2023. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. arXiv preprint arXiv:2309.17249.

Appendix A Appendix

A.1 Tasks and Templates

Table 11 summarizes the 13 datasets we use in the paper, where we construct balanced test sets by randomly sampling 2000 examples in each task. We form the prompts by concatenating demonstrations in a randomly shuffled order. To avoid the recency bias Zhao et al. (2021), we keep shuffling the demonstrations until the last two have different labels. For minimally conservative templates (section 4.1), Table 10 compares the contrast sets we construct on Llama-2-7B. For our case study on Task069 and Task070, we sample 3 templates from Sclar et al. (2024). Figure 7 compare the prompts of Task069 and Task070, which consist of an instruction followed by KK templated demonstrations. Originally, the two tasks have 4%\sim 4\% of parallel examples. To make our task transfer challenging, we discard these overlapped examples.

A.2 Label-Biased Components

We say a component is label-biased when it always predicts a certain label on the entire test set section 3.2. In this section, we focus on the most biased components in binary classification tasks, i.e., the two components that have the largest value of (logit0logit1)(\text{logit}_{0}-\text{logit}_{1}) and (logit1logit0)(\text{logit}_{1}-\text{logit}_{0}), respectively, where logit0\text{logit}_{0}\in\mathbb{R} and logit1\text{logit}_{1}\in\mathbb{R} are the LLM output logits on the two classes. We name these two components as Biased Component-0 and Biased Component-1, respectively. To understand how biased these two components are, we alter the choices of demonstrations and observe their behavior. Specifically, we consider three settings: demonstrations balanced in labels (green in Figure 5), demonstrations of all negative labels ([0,0,0,0][0,0,0,0]; red), and demonstrations of all positive labels ([1,1,1,1][1,1,1,1]; blue). We fix the template and sample 5 disjoint sets of demonstrations for each setting. Each dot in Figure 5 shows the components’ prediction on an example, and the x-axis and y-axis correspond to logit0 and logit1, respectively. A dot below the dashed diagonal line means the prediction on the example is class 0. We find that both Biased Component-0 and Biased Component-1 still insist on predicting a certain label on all examples, regardless of the labels in the prompts.

Refer to caption
Figure 4: Transformer architecture in GPT2.
Refer to caption
Figure 5: Each dot represents an example in the test set. The two most biased components still insist on predicting the same label on the entire test set regardless of the labels of the demonstrations.
SST2 BoolQ QQP WiC RTE MNLI AGNews ARC-Easy
Llama-2-7B 37.837.8 18.918.9 43.443.4 44.244.2 35.435.4 28.228.2 13.413.4 11.511.5
Llama-2-13B 32.632.6 18.718.7 37.237.2 39.339.3 32.132.1 26.026.0 12.412.4 13.113.1
Mistral-Instruct-7B 31.931.9 14.014.0 32.332.3 32.432.4 27.627.6 20.920.9 14.514.5 8.48.4
Llama-3-8B 38.338.3 21.421.4 42.442.4 40.040.0 36.936.9 28.128.1 15.815.8 15.015.0
Table 5: We report the average percentage of label-biased components across 15 prompts for each task. A label-biased component always predicts the same label on the entire test set.

A.3 LayerNorms

Figure 4 shows the transformer architecture in GPT2-like models. Because the layernorms inside each block are before MHA and MLP, known as Pre-LN, Eq. 1 has already taken Ln1Ln1 and Ln2Ln2 into account, and Eq. 6 only has the term for the final layernorm, LN()(\cdot).

Both Llama-2 and Mistral model families use RMSNorm Zhang and Sennrich (2019), a layer normalization variant without centering and adding bias terms. Formally, let xdx\in\mathbb{R}^{d} be the input, the root mean square norm LN(x)(x) is:

LN(x)\displaystyle\text{LN}(x) =xRMS(x)γ,\displaystyle=\frac{x}{\text{RMS}(x)}\odot\gamma, (10)
RMS(x)\displaystyle\text{RMS}(x) =1di=1dxi2,\displaystyle=\sqrt{\frac{1}{d}\sum_{i=1}^{d}x_{i}^{2}}, (11)

where γd\gamma\in\mathbb{R}^{d} is the affine transform parameters and \odot denotes element-wise multiplication.

A.4 Tests of Significance

We run one-tailed paired t-tests to test whether CompRW outperforms Calib+ significantly. In Table 4, we have the results of 15 prompts for each task and 8 tasks in total. For each model, we aggregate the 120 accuracy scores of CompRW and Calib+, respectively, and then calculate the p-values. Table 6 shows that p-values <0.05<0.05 in 8/8 setups, suggesting that CompRW performs significantly better than Calib+.

Llama2-7B Llama2-13B Mistral-Ins-7B Llama3-8B
K=12K=12 0.0010 0.0002 0.0470 0.0198
K=24K=24 0.0003 0.0001 0.0027 0.0245
Table 6: The p-values <0.05<0.05 in all 8 setups (4 LLMs, with K={12,24}K=\{12,24\} labeled examples), showing that CompRW performs significantly better than Calib+.
SST2 ARC-Easy
Trained Full 75.818.175.8_{\hskip 1.42271pt18.1} 57.514.457.5_{\hskip 1.42271pt14.4}
Oracle-T1 91.70.991.7_{\hskip 1.42271pt\phantom{1}0.9} 54.510.154.5_{\hskip 1.42271pt10.1}
CompRW 90.81.890.8_{\hskip 1.42271pt\phantom{1}1.8} 79.01.679.0_{\hskip 1.42271pt\phantom{1}1.6}
Random Full 49.70.749.7_{\hskip 1.42271pt0.7} 25.00.525.0_{\hskip 1.42271pt0.5}
Oracle-T1 55.20.555.2_{\hskip 1.42271pt0.5} 26.70.226.7_{\hskip 1.42271pt0.2}
CompRW 51.41.751.4_{\hskip 1.42271pt1.7} 25.00.725.0_{\hskip 1.42271pt0.7}
Table 7: Comparing the ICL accuracy between pretrained (Up) and randomly initialized (Down) Llama-2-7B. The top-1 component (Oracle-T1) and CompRW perform near random on the untrained model.

A.5 Do Good-Performing Components Exist in Randomly Initialized Models?

Ramanujan et al. (2020) find that untrained subnetworks can perform on par with a ResNet-34 trained on ImageNet. Similarly, Zhang and Bowman (2018); Hewitt and Liang (2019) show that representations of randomly initialized language models yield a strong baseline for probing tasks. In this section, we investigate (1) whether good-performing components still exist in a randomly initialized LLM, and (2) how CompRW method performs using component activations extracted from the randomly initialized LLM.

We run 4-shot ICL with 15 prompts and report the average accuracy and standard deviation. For CompRW, we use the same 4 demonstrations and 20 more examples for reweighting. Table 7 shows that the best-performing component (Oracle-T1) in a randomly initialized Llama-2-7B still performs poorly on SST2 and ARC-Easy. While CompRW has substantial improvement over Full on the pretrained model, it has no effect on the randomly initialized model. We conclude that good-performing components do not exist in a randomly initialized LLM and our CompRW method relies on the pretrained component activations to perform well.

A.6 More Results on Transferability

In section 4, we study the transferability of components across different choices of demonstrations and templates. Here, Table 9 shows the full results on all LLMs and tasks. We observe the same findings as Table 2: component accuracies agree well across randomly sampled demonstrations, but have much weaker agreements across randomly sampled templates. Because constructing minimally-contrastive templates requires non-trivial manual efforts, we only build contrast sets for 5 tasks on Llama-2-7B (shown in Table 2), where these tasks have the largest variances across templates.

A.7 Pruning Good and Bad Components

Our method studies a component using its cached direct contribution to the output, whereas Michel et al. (2019) (pruning) zeroes out the activations of a component in the forward pass and thus indirectly changes the activations of other components in the upper layers. They consider a component important if pruning it causes large drops in task performance. In this section, we investigate the intersection between our method and pruning.

First, we apply our decomposition to identify good and bad-performing components based on their ICL accuracy (3-shot for MNLI, 4-shot for other tasks). Second, we run ICL with pruning on Llama-2-7B, using the same 15 prompts in our main experiments for every task. We prune the top-50 components5555%\sim 5\% of the total components and the bottom-50 components, respectively. Table 8 compares the results with the full model without pruning. We find that pruning the top components (T50) greatly hurts the accuracy. On the contrary, pruning the bottom components (B50) only decreases the average accuracy on SST2 and RTE by 3.5%3.5\%, and even slightly improves the ones on MNLI and AGNews. These findings may imply that our method and pruning interpret components in similar fashion.

SST2 RTE MNLI AGNews
Full Model 75.818.1\textbf{75.8}_{\hskip 1.42271pt18.1} 68.93.2\textbf{68.9}_{\hskip 1.42271pt3.2} 34.41.734.4_{\hskip 1.42271pt1.7} 70.019.970.0_{\hskip 1.42271pt19.9}
Prune-T50 53.48.8{\color[rgb]{1,0,0}53.4}_{\phantom{1}\hskip 1.42271pt8.8} 57.86.4{\color[rgb]{1,0,0}57.8}_{\hskip 1.42271pt6.4} 34.33.234.3_{\hskip 1.42271pt3.2} 26.82.5{\color[rgb]{1,0,0}26.8}_{\phantom{1}\hskip 1.42271pt2.5}
Prune-B50 72.314.872.3_{\hskip 1.42271pt14.8} 65.45.865.4_{\hskip 1.42271pt5.8} 35.74.0\textbf{35.7}_{\hskip 1.42271pt4.0} 72.615.7\textbf{72.6}_{\hskip 1.42271pt15.7}
Table 8: Comparing the accuracies of the full Llama-2-7B model and pruning the top/bottom 50 components. We run 15 prompts for each task and report the average accuracy and standard deviation. We color the numbers red when there is a large drop in accuracy.

A.8 Training Details and Hyperparameters

For both CompRW and Calib+  methods, we train a linear layer on 𝒟train\mathcal{D}_{\text{train}} with stochastic gradient descent. Because we do not have an additional dev set to tune the hyperparameters, we use the same hyperparameters on all the tasks and models and do early stopping based on the loss and accuracy on 𝒟train\mathcal{D}_{\text{train}}. Specifically, we set learning rate =0.05=0.05 for both methods and λ=0.1\lambda=0.1 for the L1 regularization term in CompRW. We run all our ICL experiments on a single RTX A6000 GPU (48G). Both the component reweighting and calibration training processes can be run on a single i7 CPU within a minute.

A.9 Models

We use the model checkpoints on Hugging Face, meta-llama/Llama-2-7b-hf, Llama-2-13b-hf, mistralai/Mistral-7B-Instruct-v0.1, and meta-llama/Meta-Llama-3-8B.

SST2 BoolQ QQP WiC RTE MNLI AGNews ARC
Correlation Llama-2-7B
(1) Demo 0.81 0.84 0.60 0.65 0.75 0.65 0.89 0.88
(2) Temp 0.40 0.16 0.03 0.15 0.19 0.09 0.68 0.44
IoU
(1) Demo 0.36 0.74 0.27 0.21 0.53 0.24 0.63 0.70
(2) Temp 0.12 0.01 0.01 0.03 0.05 0.01 0.20 0.20
Correlation Llama-2-13B
(1) Demo 0.83 0.84 0.63 0.67 0.78 0.73 0.91 0.91
(2) Temp 0.57 0.30 0.09 0.19 0.28 0.16 0.76 0.55
IoU
(1) Demo 0.26 0.71 0.31 0.18 0.46 0.39 0.55 0.65
(2) Temp 0.21 0.11 0.07 0.01 0.21 0.07 0.25 0.30
Correlation Mistral-Instruct-7B
(1) Demo 0.88 0.91 0.72 0.75 0.87 0.82 0.92 0.97
(2) Temp 0.58 0.44 0.19 0.26 0.40 0.30 0.77 0.60
IoU
(1) Demo 0.39 0.59 0.27 0.29 0.50 0.45 0.68 0.80
(2) Temp 0.10 0.17 0.06 0.05 0.17 0.09 0.29 0.22
Correlation Llama-3-8B
(1) Demo 0.85 0.88 0.70 0.73 0.80 0.81 0.89 0.95
(2) Temp 0.55 0.39 0.26 0.25 0.31 0.23 0.67 0.52
IoU
(1) Demo 0.42 0.56 0.28 0.25 0.46 0.52 0.65 0.68
(2) Temp 0.15 0.12 0.09 0.07 0.08 0.05 0.34 0.27
Table 9: Full results of the average correlation and IoU between (1) two random sets of demonstrations and (2) two randomly sampled templates.
Task Templates Labels Accuracy
SST-2
T1 Review: {text}\nDo you think the review is positive or negative? {label}
negative/positive 50.6±0.750.6\pm 0.7
T2 Review: {text}{space}\nDo you think the review is positive or negative? {label}
negative/positive 72.7±6.172.7\pm 6.1
BoolQ
T1 Based on the following passage, {question}? {passage}\nAnswer: {label}
No/Yes 52.5±2.052.5\pm 2.0
T2 Based on the following passage, {question}? {passage} Answer: {label}
No/Yes 66.7±2.166.7\pm 2.1
QQP
T1 Are the questions "{sent1}" and "{sent2}" asking the same thing? {label}
no/yes 54.3±1.154.3\pm 1.1
T2 Are the questions "{sent1}" and "{sent2}" asking the same thing? {label}
No/Yes 68.7±4.168.7\pm 4.1
AGNews
T1 {text}\nIs this a piece of news regarding World, Sports, Business, or Technology? {label}
World/Sports/Business/Technology 43.9±8.743.9\pm 8.7
T2 {text} Is this a piece of news regarding World, Sports, Business, or Technology? {label}
World/Sports/Business/Technology 88.5±0.888.5\pm 0.8
Table 10: We construct minimally contrastive templates that only differ slightly (colored in red) but yield large differences in 4-shot ICL accuracy on Llama-2-7B. We report the average accuracy and standard deviation across 5 ICL runs with different demonstrations under the same template.
Dataset Task # Classes
SST-2 Socher et al. (2013) Sentiment Analysis 2
Yelp-polarity Zhang et al. (2015) Sentiment Analysis 2
BoolQ Clark et al. (2019) Yes/No QA 2
BoolQ Contrast Set Gardner et al. (2020) Yes/No QA 2
QQP Wang et al. (2018) Paraphrase Identification 2
WiC Pilehvar and Camacho-Collados (2019) Word Sense Disambiguation 2
RTE Wang et al. (2018) Natural Language Inference 2
MNLI Williams et al. (2018) Natural Language Inference 3
MedNLI Romanov and Shivade (2018) NLI in Medical Domain 3
AGNews Zhang et al. (2015) Topic Classification 4
ARC-Easy Clark et al. (2018) Multiple-Choice QA 4
Task069 Mishra et al. (2022); Wang et al. (2022b) Abductive NLI 2
Task070 Mishra et al. (2022); Wang et al. (2022b) Abductive NLI 2
Table 11: Summary of all the datasets.
Refer to caption
Figure 6: 4-shot ICL accuracy on different pretraining checkpoints. We compare the full model (green) with the top-1 (solid blue) and bottom-1 (red) components. The dashed blue line tracks how the top-1 components of the last checkpoint (Last-T1) perform across time.
Refer to caption
Figure 7: Comparing the prompts of Task069 and Task070. We apply the templates of Sclar et al. (2024) and prepend the task instructions before KK demonstrations. We ensure that the two tasks do not have parallel examples to make the transfer experiment (section 4.3) challenging.
Refer to caption
Figure 8: Each dot represents a component (attention head: blue; MLP: orange) under 4-shot ICL on Mistral-Instruct-7B. The x-axis shows how often a component predicts label 1 across the test data of a binary task.