Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Charles O’Neill
School of Computing
The Australian National University
Canberra, ACT, 2601
[email protected]
&Thang Bui
School of Computing
The Australian National University
Canberra, ACT, 2601
[email protected]

Abstract

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational complexity and sensitivity to hyperparameters. We propose training sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples. We hypothesise that learned representations of attention head outputs will signal when a head is engaged in specific computations. By discretising the learned representations into integer codes and measuring the overlap between codes unique to positive examples for each head, we enable direct identification of attention heads involved in circuits without the need for expensive ablations or architectural modifications. On three well-studied tasks - indirect object identification, greater-than comparisons, and docstring completion - the proposed method achieves higher precision and recall in recovering ground-truth circuits compared to state-of-the-art baselines, while reducing runtime from hours to seconds. Notably, we require only 5-10 text examples for each task to learn robust representations. Our findings highlight the promise of discrete sparse autoencoders for scalable and efficient mechanistic interpretability, offering a new direction for analysing the inner workings of large language models.

1 Introduction

The rapid advancement of large language models (LLMs) based on transformers [44] has spurred interest in mechanistic interpretability [32], which aims to break down model components into human-understandable circuits. Circuits are defined as subgraphs of a model’s computation graph that implement a task-specific behaviour [31]. While progress has been made in automating the isolation of certain circuits [6, 5, 28], automatic circuit discovery remains too brittle and complex to replace manual inspection. As model sizes grow [18, 22], manual inspection becomes increasingly impractical, hence the need for more robust and efficient methods.

Automated circuit identification algorithms suffer from several drawbacks, such as sensitivity to the choice of metric, the type of intervention used to identify important components, and computational intensity [6, 34, 46, 11]. Whilst faster variants of automated algorithms have been leveraged with some success [42, 15], they retain many of the same failure modes and performance is heavily dependent on metric choice. All variants of automated circuit-identification have been shown to perform poorly at recovering ground-truth circuits in specific situations [6], meaning researchers cannot know a priori whether the circuit found is accurate or not. Without simpler and more robust algorithms, researchers will be limited to using painstakingly slow manual circuit identification [38].

Our main contribution is the introduction of a highly performant yet remarkably simple circuit-identification method based on the presence of features in sparse autoencoders (SAEs) trained on transformer attention head outputs. SAEs have been shown to learn interpretable compressed features of the model’s internal states [7, 40, 4]. We hypothesise that these representations of attention head outputs should contain signal about when a head is engaged in a particular type of computation as part of a circuit. The key insight behind our approach is that by training SAEs on carefully designed examples of a task that requires the language model to use a specific circuit (and examples where it doesn’t), the learned representations should capture circuit-specific behaviour.

We demonstrate that by simply looking for the codes unique to positive examples within as few as 5-10 text examples of a task, we can directly identify the attention heads in the ground-truth circuit with better or equal precision and recall than existing methods, while being significantly faster and less complex. Specifically, our method allows us to do away with choosing a metric to measure the importance of a model component, which we see as a fundamental advantage of our method over previous approaches. We evaluate the proposed method on three well-studied circuits and demonstrate its robustness to hyperparameter choice. Our findings highlight the potential of using discrete sparse autoencoders for efficient and effective circuit identification in large language models.

Refer to caption — Figure 1: After training the sparse autoencoder, we obtain discrete representations $\mathbf{z}$ by passing tensor $\mathbf{x}$ and taking the argmax over the feature dimension, obtaining an integer code for each head in each example in $\hat{\mathbf{z}}$ . $b$ is the number of examples, $h$ is the number of heads, $d$ is the transformer hidden dimension and $n$ is the number of learned features. For node-level circuit identification, shown here, we compute the number of codes unique to positive examples per head, normalise with softmax, choose a threshold $\theta$ , and identify a head as being in the circuit if it surpasses the threshold. For edge-level circuit identification, shown in Figure 6, we count the number of co-occurrences of codes between heads for the top- $k$ co-occurrences, and then again take the softmax and thresholding with $\theta$ .

2 Background

2.1 Attention Heads and Circuits in Autoregressive Transformers

Autoregressive, decoder-only transformers rely on self-attention to weigh the importance of different parts of the input sequence [44], with a goal to predict the next token. In the multi-head attention mechanism, each attention head operates on a unique set of query, key, and value matrices, allowing the model to capture diverse relationships between elements of the input. The output of the $i$ -th attention head can be formally described as:

\displaystyle\mathbf{h}_{i}=\text{softmax}\left(\frac{(XW_{Q}^{i})(XW_{K}^{i})^{T}}{\sqrt{d_{k}}}\right)(XW_{V}^{i})

where $X$ is the input sequence, $W_{Q}^{i},W_{K}^{i},W_{V}^{i}$ are the query, key, and value matrices for the $i$ -th head, and $d_{k}$ is the dimensionality of the key vectors. The outputs of the individual attention heads are then concatenated and linearly transformed to produce the overall output of the multi-head attention layer:

\displaystyle\text{MultiHead}_{\ell}(Q_{\ell},K_{\ell},V_{\ell})

\displaystyle=\text{Concat}(\mathbf{h}_{\ell 1},\ldots,\mathbf{h}_{\ell J})W_{O}^{\ell}

where $W_{O}^{\ell}\in\mathbb{R}^{Jd_{v}\times d}$ is the projection matrix. Each attention head output $\mathbf{h}_{\ell j}$ resides in the residual stream space $\mathbb{R}^{d}$ , contributing independently to total attention at that layer [8].

The residual stream refers to the sequence of token embeddings, with each layer’s output being added back into this stream. Attention heads read from the residual stream by extracting information from specific tokens and write back their outputs, modifying the embeddings for subsequent layers. This additive form allows us to analyse each head’s contribution to the model’s behaviour by examining them independently. By tracing the flow of information across layers, we can identify computational circuits composed of multiple attention heads.

2.2 Learning Sparse Representations of Attention Heads with Autoencoders

Sparse autoencoders provide a promising approach to learn useful representations of attention head outputs that are amenable to circuit analysis. Given a set of attention head outputs $\{\mathbf{h}_{i}\}_{i=1}^{n_{\text{heads}}}$ , where $\mathbf{h}_{i}\in\mathbb{R}^{d_{\text{model}}}$ , we train an autoencoder with a single hidden layer and tied weights for the encoder $E$ and decoder $D$ . The autoencoder learns a dictionary of basis vectors $\mathbf{v}_{j}\in\mathbb{R}^{d_{\text{model}}}$ such that each $\mathbf{h}_{i}$ can be approximated as a sparse linear combination of the dictionary elements: $\mathbf{h}_{i}\approx\sum_{j=1}^{d_{\text{bottleneck}}}z_{i,j}\mathbf{v}_{j},$ where $z_{i,j}$ are the sparse activations and $d_{\text{bottleneck}}$ is the dimensionality of the bottleneck layer. The autoencoder is trained to minimise a loss function that includes a reconstruction term and a sparsity penalty, controlled by the hyperparameter $\lambda$ :

\mathcal{L}=\sum_{i=1}^{n_{\text{heads}}}\left\lVert\mathbf{h}_{i}-\sum_{j=1}^{d_{\text{bottleneck}}}z_{i,j}\mathbf{v}_{j}\right\rVert_{2}^{2}+\lambda\sum_{i=1}^{n_{\text{heads}}}\sum_{j=1}^{d_{\text{bottleneck}}}|z_{i,j}|.

The dimensionality of the bottleneck layer $d_{\text{bottleneck}}$ can be either larger (projecting up) or smaller (projecting down) than the input dimensionality $d_{\text{model}}$ . While projecting up allows for an overcomplete representation and can capture more nuanced features, projecting down can also be effective in learning a compressed representation that captures the most essential aspects of the attention head outputs [33, 21, 48]. We propose a subtle but significant shift in perspective by treating sparse autoencoding as a compression problem rather than a problem of learning higher-dimensional sparse bases in the context of transformers. We hypothesise that compression is likely a key mechanism in identifying features which represent circuit-related computation, and in contrast which computation is shared between positive and negative examples.

To further simplify the representation and facilitate the identification of distinct behaviours within the attention heads, we discretise the sparse activations obtained from the autoencoder using an argmax operation over the feature dimension, $c_{i}=\text{argmax}_{j}z_{i,j}$ , where $c_{i}\in{1,\ldots,d_{\text{bottleneck}}}$ is the discrete code assigned to the $i$ -th attention head output. This yields a discrete bottleneck representation analogous to vector quantization [35]. We will next discuss how to leverage the resulting discrete representations to identify important task-specific circuits in the transformer.

3 Methodology

Our approach centers on training a sparse autoencoder with carefully designed positive and negative examples, where the model only successfully predicts the next token for positive examples. The critical insight is that the compressed representation must capture the difference between the two sets of examples to achieve a low reconstruction loss. This differentiation enables us to isolate circuit-specific behaviours and identify the attention heads involved in the circuit of interest.

3.1 Datasets

We compile datasets of 250 “positive” and 250 “negative” examples for each task. Positive examples are text sequences where the model must use the circuit of interest to correctly predict the next token. In contrast, negative examples are semantically similar to positive examples but corrupted such that there is no correct “next token.” This dataset design ensures that the learned representations are common between positive and negative examples for attention heads processing semantic similarities but different for heads involved in circuit-specific computations. Table 3.1 shows task examples, and Appendix B contains details of each dataset.

Task	Positive Example	Negative Example	Answer
IOI	“When Elon and Sam finished their meeting, Elon gave the model to ”	“When Elon and Sam finished their meeting, Andrej gave the model to ”	“Sam”
Greater-than	“The AI war lasted from 2024 to 20”	“The AI war lasted from 2024 to 19”	Any two digit number $>24$
Docstring	⬇ def old(self, page, names, size): """sector␣gap""" :param page: message tree :param names: detail mine :param	⬇ def old(self, page, names, size): """sector␣gap""" :param image: message tree :param update: detail mine :param	size

	$\displaystyle\mathbf{C}_{h1,h2,c1,c2}^{+}$	$\displaystyle=\Big{\|}(c1,c2)\text{ co-occurrences in positive examples between }h1\text{ and }h2\Big{\|}$
	$\displaystyle\mathbf{C}_{h1,h2,c1,c2}^{-}$	$\displaystyle=\Big{\|}(c1,c2)\text{ co-occurrences in negative examples between }h1\text{ and }h2\Big{\|}$

Model/Circuit	No. Heads	IOI		Greater-than
Model/Circuit	No. Heads	Probability mult.	Logit Diff.	Probability Diff.	Sharpness
GPT-2 (Clean)	144	34.88	3.55	76.96%	5.57%
GPT-2 (Corrupt)	144	0.03	-3.55	-40.32%	-0.06%
Ground-truth	26	61.14	4.11	71.30%	5.50%
Ours	40	37.48	3.62	76.54%	5.76%
Random comp.	40	0.23	-2.23	-37.91%	-0.04%

Model	$n_{\text{params}}$	$n_{\text{layers}}$	$d_{\text{model}}$	$n_{\text{heads}}$	act_fn	$n_{\text{ctx}}$	$d_{\text{vocab}}$	$d_{\text{mlp}}$
GPT-2 Small	85M	12	768	12	gelu	1024	50257	3072
GPT-2 Medium	302M	24	1024	16	gelu	1024	50257	4096
GPT-2 Large	708M	36	1280	20	gelu	1024	50257	5120
GPT-2 XL	1.5B	48	1600	25	gelu	1024	50257	6400

Model/circuit	Attn. heads	Logit Diff.	Normalised Logit Diff.	Probability Diff.
GPT-2 (Clean)	144	3.55	1.0	34.88
GPT-2 (Corrupted)	144	-3.55	0.0	0.03
Ground-truth	26	4.11	1.08	61.14
Ours	40	3.62	1.01	37.48
Random complement	40	-2.23	0.19	0.23

Model/circuit	Attention heads	Probability Difference	Cutoff Sharpness
GPT-2 (Clean)	144	76.96% ( $\pm$ 26.82%)	5.57% ( $\pm$ 8.08%)
GPT-2 (Corrupted)	144	-40.32% ( $\pm$ 55.28%)	-0.06% ( $\pm$ 0.08%)
Ground-truth	9	71.30% ( $\pm$ 28.71%)	5.50% ( $\pm$ 6.89%)
Ours	29	76.54% ( $\pm$ 27.51%)	5.76% ( $\pm$ 7.42%)
Random complement	29	-37.91% ( $\pm$ 55.76%)	-0.04% ( $\pm$ 0.78%)

Task	ACDC	HISP	SP	Ours	Ours (set params)
Node-level
Docstring	0.938 / 0.825	0.889 / 0.889	0.941 / 0.398	0.945	0.915 $(\pm 0.014)$
Greater-than	0.766 / 0.783	0.631 / 0.631	0.811 / 0.522	0.821	0.832 $(\pm 0.058)$
IOI	0.777 / 0.424	0.728 / 0.728	0.797 / 0.479	0.854	0.853 $(\pm 0.016)$
Edge-level
Docstring	0.972 / 0.929	0.821 / 0.821	0.942 / 0.482	0.974	0.914 $(\pm 0.020)$
Greater-than	0.461 / 0.491	0.706 / 0.706	0.812 / 0.639	0.963	0.856 $(\pm 0.021)$
IOI	0.589 / 0.447	0.836 / 0.836	0.707 / 0.393	0.863	0.840 $(\pm 0.016)$

Task	Learned Features	$\lambda$	Threshold
Docstring	270	0.067	1.12e-07
Greater-than	246	0.011	1.71e-15
IOI	379	0.022	6.26e-16

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Abstract

1 Introduction

2 Background

2.1 Attention Heads and Circuits in Autoregressive Transformers

2.2 Learning Sparse Representations of Attention Heads with Autoencoders

3 Methodology

3.1 Datasets

3.2 Model and circuit identification with learned features

Training the autoencoder to get learned features

Node-Level and Edge-Level Circuit Identification

4 Results

4.1 Discrete SAE features excel in node-level and edge-level circuit discovery

4.2 Performance is robust to hyperparameter choice

4.3 Identified circuits outperform or match the full model

Standalone metrics of circuit performance

Faithfulness of IOI and Greater-than circuits

5 Related Work

5.1 Sparse and Discrete Representations for Circuit Discovery

5.2 Ablation and Attribution-Based Circuit Discovery Methods

6 Discussion

6.1 Limitations

6.2 Future Directions

References

Appendix A Methodology details

A.1 Sparse autoencoder architecture

A.2 Training the sparse autoencoder

A.3 Counting the unique positive occurrences and co-occurrences

Edge-level: co-occurrences

A.4 Indication of wall-time as the underlying language models scale

Appendix B Circuit visualisation and analysis

B.1 Indirect Object Identification

Predicted circuit

Negative name mover heads and previous token heads

B.2 Greater-than

Setup details

Predicted circuit

Exploratory analysis of relationship between codes and year

B.3 Docstring

Appendix C Detailed comparison to ACDC and other circuit identification methods

C.1 Nodes vs. edges

C.2 Final output

C.3 On why we can compare ACDC to our method

C.4 HISP and SP

Subnetwork probing (SP)

Head importance score for pruning (HISP)

C.5 Edge attribution patching (EAP)

C.6 Head activation norm difference

Appendix D How does the formulation of positive and negative examples affect performance?

D.1 Alternative negative examples for the Greater-than task

D.2 Including “easy negatives” in the training data

Appendix E Normalisation and design choices

E.1 Softmax across head or layer

E.2 Normalising unique positive codes by overall number of unique codes

E.3 Number of examples used

Appendix F Further comparisons to previous methods

Appendix G tracr-tasks

G.1 Compiled models

Appendix H Induction circuit

Appendix I Visualising the distribution of codes and their relation to the circuit

I.1 Correlation between occurrences/co-occurrences and presence in the circuit

I.2 Distribution and sparsity of codes

Appendix J What didn’t work: contrastive loss experiments

Appendix K Entropy vs. Co-occurrence for Edge-Level Circuit Identification

Appendix L Full hyperparameter sweeps

L.1 Best hyperparameters for each task

L.2 Effect of kk in edge-selection

L.3 Contour plots

Appendix M Comparison to Vector-Quantised Variational Autoencoders (VQ-VAEs)

L.2 Effect of $k$ in edge-selection