Efficient Training of Sparse Autoencoders for Large Language Models via Layer Clustering
Read this article in light mode (top left button). This is a converted pdf, read better at: https://arxiv.org/pdf/2410.21508
Under review as a conference paper at ICLR 2025
EFFICIENT TRAINING OF SPARSE AUTOENCODERS FOR LARGE LANGUAGE MODELS VIA LAYER GROUPS
Davide Ghilardi∗, Federico Belotti∗ Marco Molinari∗
Department of Informatics, Systems and Communication Department of Statistics University of Milan-Bicocca London School of Economics {name}.{surname}@unimib.it
m.molinari1@lse.ac.uk
ABSTRACT
Sparse Autoencoders (SAEs) have recently been employed as an unsupervised ap- proach for understanding the inner workings of Large Language Models (LLMs). They reconstruct the model’s activations with a sparse linear combination of in- terpretable features. However, training SAEs is computationally intensive, es- pecially as models grow in size and complexity. To address this challenge, we propose a novel training strategy that reduces the number of trained SAEs from one per layer to one for a given group of contiguous layers. Our experimental results on Pythia 160M highlight a speedup of up to 6x without compromising the reconstruction quality and performance on downstream tasks. Therefore, layer clustering presents an efficientapproach to train SAEs in modern LLMs.
1 INTRODUCTION
With the significantadoption of Large Language Model (LLM)s in world applications, understand- ingtheirinnerworkingshasgainedparamountimportance. AkeychallengeinLLMsinterpretability is the polysemanticity of neurons in models’ activations, lacking a clear and unique meaning (Olah et al., 2020). Recently, SAEs (Huben et al., 2024; Bricken et al., 2023) have shown great promise to tackle this problem by decomposing the model’s activations into a sparse combination of human- interpretable features.
TheuseofSAEsasaninterpretabilitytoolismotivatedbytwokeyreasons: thefirstisthesubstantial empirical evidence supporting the Linear Representation Hypothesis (LRH), or that LLMs exhibit interpretablelineardirectionsintheiractivationspace(Mikolovetal.,2013;Nandaetal.,2023;Park et al., 2023); the second is the Superposition Hypothesis (SH)(Elhage et al., 2022), which supposes that, by leveraging sparsity, neural networks represent more features than they have neurons. Under this hypothesis, we can consider a trained neural network as a compressed simulation of a larger disentangled model, where every neuron corresponds to a single feature. To overcome superposi- tion, SAEs leverage the LRH to decompose model activations into a sparse linear combination of interpretable features.
However, training SAEs is computationally expensive and will become even more costly as the model size and parameter counts grow. Indeed, one Sparse Autoencoder (SAE) is typically learned for a given component at every layer in a LLM. Moreover, the number of features usually equals the model activation dimension multiplied by a positive integer, called the expansion factor. For example, a single SAEs trained on the Llama-3.1 8B model activations with an expansion factor of 32 has roughly 40962 ·32·2 ≈ 1.073B parameters, with a total of more than 32B when the training each one of the 32 layers.
In this work, we reduce the computational overhead of training a separate SAE for each layer of a target LLM by learning a single SAE for groups of related and contiguous layers. This approach is inspired by the observation that neural network layers often group together in learning task-specific representations (Szegedy et al., 2014; Zeiler & Fergus, 2014; Jawahar et al., 2019): shallow layers typically capture low-level features, while deeper layers learn high-level abstractions. Additionally, adjacent layers in LLMs tend to encode redundant information, as evidenced by the similarity in
*Equal contribution
Figure 1: The illustration of our method. While standard training of SAEs requires training one per layer, our method first clusters layers by angular similarity and then trains a single SAE for each group.
the angular distance of their outputs (Gromov et al., 2024). To quantify the relatedness between layers, we measure the angular distance of the residual stream after the MLP’s contribution across neighboring layers.
Denoting with L the number of layers of the target model and k the number of groups we want to cluster the layers into, our approach obtains at least a (L−1) times speedup1 without sacrificing
k
neither reconstruction quality nor downstream performance.
Our contributions can be summarized as follows:
- We demonstrate that a single SAE can effectively reconstruct the activations of an entire cluster as a sparse linear combination of interpretable features. This approach reduces the number of SAEs required to be trained for a given model by a factor of k.
- We demonstrate the practical effectiveness of our method by training SAEs on the Pythia- 160M model for different values of k, and we obtain a L−1 times speedup at the cost of a
k
minor deterioration in performance.
- We extensively analyze our method on reconstruction performance, downstream tasks, and human interpretability for different values of k. 2 RELATED WORK
- THE LINEAR REPRESENTATION AND SUPERPOSITION HYPOTHESES
Supported by substantial evidence, from the seminal Mikolov et al. (2013) vector arithmetic to the more recent work of Nanda et al. (2023) and Park et al. (2023) on LLMs, the Linear Representation Hypothesis (LRH) supposes that neural networks have interpretable linear directions in their acti- vations space. However, neuron polysemanticity remains an essential challenge in neural network interpretability (Olah et al., 2020).
Recently, Bricken et al. (2023) explored this issue by relating the Superposition Hypothesis (SH) to the decomposition ideally found by a SAE. According to the SH, neural networks utilize n- dimensional activations to encode m ≫ n features by leveraging their sparsity and relative impor- tance. As a result, we can write the activations xj in a model as
m
xj ≈ b + fi(xj )di (1)
i
where xj ∈Rn is the activation vector for an example j, f ∈Rm is a sparse feature vector, fi(xj ) is the activation of the i-th feature, di is a unit vector in the activation space and b is a bias.
- SPARSE AUTOENCODERS
SparseAutoencodershavegainedpopularityinLLMinterpretabilityduetotheirabilitytocounteract superposition and decompose neuron activations into interpretable features (Bricken et al., 2023; Huben et al., 2024). Given an input activation x ∈ Rd, a SAE reconstructs it as a sparse linear combination of d ≫ d features, denoted as v ∈Rdmodel. The reconstruction follows the form:
sae i
(xˆ ◦ f)(x) = Wdf(x) + bd (2) Here,thecolumnsofWd representthefeaturesvi,bd isthedecoder’sbiasterm,andf(x) represents
the sparse features activations. The feature activations are computed as
f(x) = σ(We(x − bd) + be) (3)
wherebe istheencoder’sbiastermandσ isanactivationfunction, typicallyReLU(x) = max(0,x). The training of a SAE involves minimizing the following loss function:
Lsae = ∥x − xˆ ∥2 + λ∥f(x)∥ (4)
2 1
where the firstterm in Equation 4 represents the reconstruction error, while the second term is an ℓ1 regularization on the activations f(x) to encourage sparsity.
Typically, d is set as dsae = c ·d, where c ∈ 2n|n in N+. As the size and depth of the model
sae
increase, training SAEs can become more computationally intensive.
- SAES EVALUATION
SAE evaluation in the context of LLMs presents a significant challenge. While standard unsuper- vised metrics such as L2 (reconstruction) loss and L0 sparsity are widely adopted to measure SAE performance (Gao et al., 2024; Lieberum et al., 2024), they fall short of assessing two key aspects: causal importance and interpretability.
Recent approaches, including auto-interpretability (Bricken et al., 2023; Huben et al., 2024; Bills et al., 2023) and ground-truth comparisons (Sharkey et al., 2023), aim to provide a more holistic evaluation. Thesemethodsfocusonthecausalrelevanceoffeatures(Marksetal.,2024)andevaluate SAEs in downstream tasks. Makelov et al. (2024), for instance, proposed a framework for the Indirect Object Identification(IOI) task, emphasizing three aspects: the sufficiency and necessity of reconstructions, sparse feature steering (Templeton et al., 2024), and the interpretability of features in causal terms.
Karvonen et al. (2024) further contributed by developing specialized metrics for board game lan- guage models. Using structured games like chess and Othello, they introduced supervised metrics, such as board reconstruction accuracy and coverage of predefined state properties, offering a more directassessmentofSAEs’abilitytocapturesemanticallymeaningfulandcausallyrelevantfeatures.
- IMPROVING SAES TRAINING
As SAEs gain popularity for LLMs interpretability and are increasingly applied to state-of-the-art models (Lieberum et al., 2024), the need for more efficienttraining techniques has become evident. To address this, (Gao et al., 2024) explored the scaling laws of Autoencoders to identify the optimal combination of size and sparsity.
Recent work also explored using transfer learning to improve SAE training. For example, Kissane et al. (2024) and Lieberum et al. (2024) demonstrated the transferability of SAE weights between base and instruction-tuned versions of Gemma-1 (Team et al., 2024a) and Gemma-2 (Team et al., 2024b), respectively. On the other hand, Ghilardi et al. (2024) shows that transfer also occurs among layers of a single model, both in forward and backward directions.
Average Angular Distance Between Layers
0.18
0.25 0.19
0.32 0.28 0.22
0.35 0.31 0.26 0.19
0.38 0.35 0.32 0.27 0.21
0.40 0.37 0.34 0.31 0.26 0.22
- 0.39 0.37 0.35 0.31 0.27 0.24
- 0.41 0.38 0.36 0.33 0.30 0.28 0.23
- 0.42 0.40 0.38 0.35 0.33 0.30 0.28 0.24
0.46 0.45 0.43 0.41 0.38 0.37 0.34 0.33 0.31 0.23
0.50 0.50 0.49 0.49 0.49 0.49 0.48 0.48 0.47 0.46 0.43
0 1 2 3 4 5 6 7 8 9 10 11 Layer
Figure 2: Average angular distance between all layers of the Pythia-160M model, as defined in Equation 5. The angular distances are computed over 5M tokens from the training dataset.
3 EXPERIMENTAL SETUP
We train SAEs on the residual stream of the Pythia-160M model (Biderman et al., 2023) after the contribution of the MLP. The chosen dataset is a 2B pre-tokenized version2 of the Pile dataset (Gao et al., 2020) with a context size of 1024. We set the expansion factor c = 8, λ = 1 in Equation 4, learning rate equal to 3e-5, and a batch size of 4096 samples. Following Bricken et al. (2023), we constrain the decoder columns to have unit norm and do not tie the encoder and decoder weights.
We use the JumpReLU activation function as specifiedin Rajamanoharan et al. (2024), and defined as JumpReLUθ(z) = z·ReLU(z− θ), with θ being a threshold learned during training. We fixed the hyperparameters for all the experiments conducted in this work. All hyperparameters can be found in Table 3.
For a model with L layers,Lthe−1 number. With thisof possiblenumbercombinationsgrowing withofmodelk groupsdepth,ofweadjacentemployedlayersanthatag- can be tested is given by k−1
glomerative grouping strategy based on the angular distances between layers to reduce it drastically. In particular, we compute the mean angular distances, as specifiedin Gromov et al. (2024), over 5M tokens from our training set:
1 xp ·xq
d xp ,xq = arccos post post (5)
angular xpost 2 post 2
post post π p xq
for every p,q ∈ 1,...,L, where xlpost are the l-th residual stream activations after the MLP’s contribution. From Figure2, it can be noted how the last layer is different from every other layer in terms of angular distance. For this reason, we have decided to exclude it from the grouping procedure.
We adopted a bottom-up hierarchical clustering strategy with a complete linkage (Nielsen, 2016) that aggregates layers based on their angular distances. Specifically, at every step of the process,
the two groups with minimal group-distance3 are merged; we repeat the process until the predefined number of groups k is reached. This approach prevents the formation of groups as long chains of layers and ensures that the maximum distance within each group remains minimal. In this work, we create groups considering all layers except for the last one, and we chose k varying from 1 (a single cluster with all layers) to 5. Layer groups can be found in Appendix 3.1.
- SAES TRAINING AND EVALUATION
WetrainaSAEforeverylayerofthePythia-160MmodelwiththehyperparametersspecifiedinSec- tion 3.1 and consider it our baseline. We denote SAEi as the baseline SAE trained on the activations from the i-th layer.
Moreover, forevery1 ≤ j ≤ k ≤ 5withj,k ∈N,wedefineSAEjk astheSAEtrainedtoreconstruct the post-MLP residual stream activations of all layers in the j-th group from a partition of k groups.
Additionally, we define [jk] as the set of all the layers that belong to the j-th group in the k groups partition. We train a SAEj for every k and j and compare it with the baselines. In particular, we
compare every SAEjk againstk all the baseline SAEs, with s ∈[jk].
Section 4 reports the performance w.r.t. standard reconstruction and sparsity metrics. Addition- ally, in Section 5 and 6 we show, respectively, the performance achieved on popular downstream tasks(Marks et al., 2024; Hanna et al., 2023; Wang et al., 2023), and human interpretability scores.
CE Loss Score R2
1.0 0.9 0.8 0.7 0.6 0.5
1.0
:- | :- | :- | :- | :- |
0.9 |
0.8
0.7
0.6
0.5
1 2 3 4 5 Baseline 1 2 3 4 5 Baseline
k k
L2 L0 250
100 80 60 40 20 0
:- | :- | :- | :- | :- |
200 |
150
100
50
0
1 2 3 4 5 Baseline 1 2 3 4 5 Baseline
k k
Figure 3: Average CE Loss Score, R2, L2 and L0. The average is computed over layers for every k. The “Baseline” average is computed considering the performance obtained by SAEi, ∀i = 0,..., 10.
To assess the quality of the reconstruction of SAEs trained with our grouping strategy, we report three standard reconstruction metrics and one sparsity metric. In particular, the Cross-Entropy Loss Score (CELS), defined as CECE((ζζ))−−CECE((xˆM◦f)) , measures the change in Cross-Entropy loss (CE) between
the output with SAE reconstructed activations (xˆ ◦ f) and the model’s output (M), relative to the loss in a zero-ablated run (ζ), i.e., a run with the activations set to zero.
The L2 loss is the firstterm of Equation 4 and measures the reconstruction error made by the SAE.
2 2 2
The R score, defined as 1 − ∥x − xˆ ∥2/∥x − Ex∼D [x]∥2, measures the fraction of explained vari- ance of the input recovered by the SAE.
Finally, the L0 sparsity loss, defined as dj =1sae I[fj = 0], represents the number of non-zero SAE features used to compute the reconstruction. For each metric, we compute the average over 1M
examples from the test dataset.
Figure 3 shows the reconstruction and sparsity metrics for each number of groups, averaged across layers. It can be noted that training a single SAEs on the activations from multiple close layers doesn’t dramatically affect the reconstruction, even when all layers cluster in a single group (k = 1) and only one SAE is trained on all of their activations.
These results demonstrate that a single SAE can effectively reconstruct activations across multiple layers. Furthermore, the comparable performance between SAE and the individual layer-specific
SAEi indicates that post-residual stream activations from adjacentjklayers share a common set of un- derlying features. This hypothesis is further supported by directly comparing the directions learned
by SAEi and SAEjk using the Mean Maximum Cosine Similarity (MMCS) score:
MMCS = max CosSim(u,v) (6)
dsae u v
where u and v are the columns of the SAEj and SAEi decoder matrices, respectively. Figure 4 shows the average MMCS, where the averagekis computed for a given k by calculating the MMCS
between SAEjk and SAEs for every 1 ≤ j ≤ k and s ∈[jk], then dividing by L − 1.
We detail the per-layer reconstruction performances of every SAEjk for every k in Appendix B.
MMCS
1.0 0.8 0.6 0.4 0.2 0.0
1 2 3 4 5 K |
Figure 4: Average MMCS, as defined in Equation 6. The average is computed for a given k by calculating the MMCS between SAEj and SAEs for every 1 ≤ j ≤ k and s ∈ [jk], then dividing
by L − 1. k
While achieving good reconstruction metrics is crucial for a trained SAE, it is insufficient for a comprehensive evaluation of its performance. For instance, unsupervised metrics alone cannot de- termine whether the identified features capture causally influential directions for the model. To address this, following Marks et al. (2024), we applied SAEs to three well-known tasks: Indirect Object Identification(IOI), Greater Than, and Subject-Verb Agreement.
Each task can be represented as a set of counterfactual prompts paired with their respective answers, formally denoted as T. Counterfactual prompts are similar to clean ones but contain slight modificationsthat result in a different predicted answer.
Ideally, a robust SAE should be able to recover the model’s performance on a task when recon- structing its activations. Furthermore, we expect the SAE to rely on a small subset of task-relevant features to complete the task. To assess this, we filteredthe features to include only the most impor- tant ones, where importance is definedas the indirect effect (IE) (Pearl, 2022) of the feature on task performance, measured by a real-valued metric m : Rdvocab → R. Specifically,
Calculating these effects is computationally expensive, as it requires a forward pass for each feature. To mitigate this, we employed two approximate methods: Attribution Patching (AtP)(Nanda, 2023; Syed et al., 2024) and Integrated Gradients (IG)(Sundararajan et al., 2017). Appendix C provides a formal definitionof both methods.
Following Marks et al. (2024), we used faithfulness and completeness metrics to evaluate the per- formance of the SAEs on the tasks. These metrics are defined as m(C)−m(Ø) where m(M) and
m(M )−m(Ø)
m(Ø) represent the metric average over T, achieved by the model alone and with the mean-ablated SAE reconstructions, respectively. m(C) is computed based on the task and either the faithfulness or completeness criteria: for faithfulness, it is the metric average when using only the important SAE features, while mean-ablating the others; on the contrary for completeness, it is calculated by mean-ablating the important features while keeping the others active.
These metrics allow us to evaluate two critical aspects of SAE quality: whether the SAE learned a set of features that is both sufficient and necessary to perform the task. Figures 5 and 6 display the faithfulness and completeness scores for all k groups.
Faithfulness (AtP)
IOI Greater Than
:- | :- | :- | :- | :- |
1.0 0.5 0.0 |
1.0 0.5 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
N Features N Features
Subject-Verb Agreement Average
:- | :- | :- | :- | :- |
1.0 0.5 0.0 |
1.0 0.5 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
N Features N Features
K
1 2 3 4 5 Baseline
Figure 5: Average faithfulness (Marks et al., 2024) for every downstream task (IOI, Greater Than, Subject-Verb Agreement) with IE computed with AtP (Equation 8). The “Baseline” average is com- puted considering the performance obtained by SAEi, ∀i = 0,..., 10. The “Average” plot depicts the average over the three downstream tasks.
All SAEjk perform comparably to, or slightly better than, the baseline SAEi models on the faith- fulness score (Figure 5) across all three downstream tasks evaluated, demonstrating the sufficiency of learned features. Remarkably, this performance is achieved using only the top 5% of the most important active features, which recover 75% of the baseline performance, on average. Even more remarkable are the performances of the single-group SAEj1 (k = 1), which closely follow the trend
4Completeness (AtP)
IOI Greater Than
1.0 0.5
1.0 0.5 0.0 0.5 |
0.0 0.5 1.0 |
1.0
0 20 40 60 80 100 0 20 40 60 80 100
N Features N Features
Subject-Verb Agreement Average
1.0 0.5 0.0 0.5 1.0
1.0 0.5 0.0 0.5 1.0 |
0 20 40 60 80 100 0 20 40 60 80 100 |
N Features N Features
K
1 2 3 4 5 Baseline
Figure6: Averagecompletenessforeverydownstreamtask(IOI,GreaterThan, Subject-VerbAgree- ment) with IE computed with AtP (Equation 8). The “Baseline” average is computed considering the performance obtained by SAEi, ∀i = 0,..., 10. The “Average” plot depicts the average perfor- mance over the three downstream tasks.
of both the baseline SAEs and the SAEj5 (k = 5). Moreover, we did not observe any substantial differences in performance for any of the tested values of k. The necessity of the features learned by both the baseline SAEi and SAEj is confirmedby the completeness scores depicted in Figure 6, with a severe drop in performance ev5en with only the top 10 active features mean-ablated.
The results of SAE on downstream tasks demonstrate that their learned features are both sufficient and necessary. Moreojk ver, these findings confirm those in Section 4, i.e., that a single SAE can
effectivelyreconstructactivationsacrossmultiplecontiguouslayersandlearnasetofshared,general features that span adjacent layers.
In addition to achieving excellent reconstruction and downstream performance, SAEs must learn interpretable features. Following Ghilardi et al. (2024), we engaged human annotators to identify
Human Interpretability
1.0 0.8 0.6 0.4 0.2 0.0
1 2 3 4 5 Baseline |
K
interpretable patterns in the feature annotations. Specifically, they attempted to provide clear def- initions for each feature by examining its top and bottom logits attribution scores, as well as the top activating tokens for 96 features sampled from 1M tokens in the training dataset. To evaluate the quality of these features, we defined the Human Interpretability Score as the ratio of features considered interpretable by the human annotators.
Figure 7 presents the Human Interpretability Scores for all values of k, averaged across layers. According to human annotators, the interpretability of the features learned by SAEj is comparable
to that of the baseline SAEi. Moreover, we found that when many layers are groupedk together, e.g., when k = 1, SAEj features are more polysemantic overall, probably due to increased interference.
Nevertheless, somekof them remain perfectly interpretable across all model layers, capturing critical directions in model computations.
7 CONCLUSION
This work introduces a novel approach to efficiently train Sparse Autoencoders (SAEs) for Large Language Models (LLMs) by clustering layers based on their angular distance and training a single SAE for each group. Through this method, we achieved up to a 6x speedup in training without com- promising reconstruction quality or performance on downstream tasks. The results demonstrate that activations from adjacent layers in LLMs share common features, enabling effective reconstruction with fewer SAEs.
OurfindingsalsoshowthattheSAEstrainedongroupedlayersperformcomparablytolayer-specific SAEs in terms of reconstruction metrics, faithfulness, and completeness on various downstream tasks. Furthermore, human evaluations confirmed the interpretability of the features learned by our SAEs, underscoring their utility in disentangling neural activations.
The methodology proposed in this paper opens avenues for more scalable interpretability tools, fa- cilitatingdeeperanalysisofLLMsastheygrowinsize. Futureworkwillfocusonfurtheroptimizing the number of layer groups and scaling the approach to even larger models.
8 LIMITATIONS AND FUTURE WORKS
One limitation of our approach is the absence of a precise method for selecting the optimal number of layer groups (k). This choice is due to the lack of a clear elbow rule for identifying the correct number of groups. While our results showed comparable performance across different values of k on all key metrics, further exploration is needed to determine whether certain configurations could yield more optimal outcomes under specificconditions.
Additionally, we tested our method primarily on the Pythia-160M model, a relatively small LLM. While our findings demonstrate significant improvements in efficiency without sacrificing perfor- mance, the scalability of our approach to much larger models remains an open question. Future work could explore how the grouping strategy and training techniques generalize to models with billions of parameters, where the computational benefitswould be even more pronounced.
Another important direction for future research involves understanding how Sparse Autoencoders (SAEs)handlethesuperpositionhypothesiswhenencodinginformationfrommultiplelayers. While our method effectively grouped layers and maintained high performance, how SAEs manage the po- tentialoverlapinfeaturerepresentationacrosslayersremainsunclear. Investigatingthisaspectcould lead to a more clear understanding of the trade-offs between sparsity and feature disentanglement in SAEs, and inform strategies for improving interpretability without compromising task performance.
Insummary, whileourworkrepresentsanefficientstepforwardintrainingSAEsforinterpretability, extending this approach to larger models and exploring the handling of superposition will provide valuable insights for both practical applications and the theoretical understanding of sparse neural representations.
9 REPRODUCIBILITY STATEMENT
To support the replication of our empirical findingson training SAEs via layer groups and to enable further research on understanding their inner works, we plan to release all the code and SAEs used in this study upon acceptance.
REFERENCES
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric
Hallahan, Mohammad AflahKhan, ShivanshuPurohit, USVSNSaiPrashanth, EdwardRaff, etal. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya
Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html. Accessed: 2024-08-18.
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con-
erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer- circuits.pub/2023/monosemantic-features/index.html.
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec,
Zac Hatfield-Dodds,Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of super- position. Transformer Circuits Thread, 2022. URL https://transformer-circuits. pub/2022/toy_model/index.html.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
LeoGao,TomDupre´ laTour,HenkTillman,GabrielGoh,RajanTroll,AlecRadford,IlyaSutskever,
Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders, 2024. URL https: //arxiv.org/abs/2406.04093.
Davide Ghilardi, Federico Belotti, Marco Molinari, and Jaehyuk Lim. Accelerating sparse autoen-
coder training via layer-wise transfer learning in large language models. In The 7th BlackboxNLP Workshop, 2024. URL https://openreview.net/forum?id=GI5j6OMTju.
Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The
unreasonable ineffectiveness of the deeper layers. arXiv preprint arXiv:2403.17887, 2024.
Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT-2 compute greater-than?:
Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Confer- ence on Neural Information Processing Systems, 2023. URL https://openreview.net/ forum?id=p4PckNQR8k.
Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse
autoencoders find highly interpretable features in language models. In The Twelfth International ConferenceonLearningRepresentations,2024. URLhttps://openreview.net/forum? id=F76bwRSLeK.
Ganesh Jawahar, Benoˆıt Sagot, and Djame´ Seddah. What does BERT learn about the structure
of language? In Anna Korhonen, David Traum, and Llu´ıs Marquez` (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3651–3657, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1356. URL https://aclanthology.org/P19-1356.
Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs
Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictio- nary learning for language model interpretability with board game models. In ICML 2024 Work- shoponMechanisticInterpretability,2024. URLhttps://openreview.net/forum?id= qzsDKwGJyB.
Connor Kissane, Ryan Krzyzanowski, Andrew Conmy, and Neel Nanda.
SAEs (usually) transfer between base and chat models. https: //www.alignmentforum.org/posts/fmwk6qxrpW8d4jvbd/ saes-usually-transfer-between-base-and-chat-models, July 2024. AI Alignment Forum.
TomLieberum,SenthooranRajamanoharan,ArthurConmy,LewisSmith,NicolasSonnerat,Vikrant
Varma, Janos´ Kramar´ , Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147, 2024.
Aleksandar Makelov, Georg Lange, and Neel Nanda. Towards principled evaluations of sparse
autoencoders for interpretability and control. In ICLR 2024 Workshop on Secure and Trust- worthy Large Language Models, 2024. URL https://openreview.net/forum?id= MHIX9H8aYF.
Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller.
Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word
representations. In Lucy Vanderwende, Hal Daume´ III, and Katrin Kirchhoff (eds.), Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pp. 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1090.
Neel Nanda. Attribution patching: Activation patching at industrial scale, 2023.
URL https://www.neelnanda.io/mechanistic-interpretability/ attribution-patching.
Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models
of self-supervised sequence models. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mohebbi (eds.), Proceedings of the 6th BlackboxNLP Work- shop: Analyzing and Interpreting Neural Networks for NLP, pp. 16–30, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.2. URL https://aclanthology.org/2023.blackboxnlp-1.2.
Frank Nielsen. Hierarchical Clustering, pp. 195–211. 02 2016. ISBN 978-3-319-21902-8. doi:
10.1007/978-3-319-21903-5~~ 8.
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter.
Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
KihoPark, YoJoongChoe, andVictorVeitch. Thelinearrepresentationhypothesisandthegeometry
of large language models. In Causal Representation Learning Workshop at NeurIPS 2023, 2023. URL https://openreview.net/forum?id=T0PoOJg8cK.
Judea Pearl. Direct and indirect effects. In Probabilistic and causal inference: the works of Judea
Pearl, pp. 373–392. 2022.
SenthooranRajamanoharan,TomLieberum,NicolasSonnerat,ArthurConmy,VikrantVarma,Janos´
Kramar´ , and Neel Nanda. Jumping ahead: Improving reconstruction fidelitywith jumprelu sparse autoencoders, 2024. URL https://arxiv.org/abs/2407.14435.
Lee Sharkey, Dan Braun, and Beren Millidge. Taking the temperature of transformer circuits,
2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/ interim-research-report-taking-features-out-of-superposition. Ac- cessed: 2024-08-18.
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In
International conference on machine learning, pp. 3319–3328. PMLR, 2017.
Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit
discovery. In The 7th BlackboxNLP Workshop, 2024. URL https://openreview.net/ forum?id=RysbaxAnc6.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,
and Rob Fergus. Intriguing properties of neural networks, 2014. URL https://arxiv.org/ abs/1312.6199.
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya
Pathak, Laurent Sifre, Morgane Riviere,` Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Leonard´ Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amelie´ Heliou,´ Andrea Tacchetti, Anna Bulanova, An- tonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clement´ Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Hen- ryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clement´ Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open models based on gemini research and technology, 2024a. URL https://arxiv.org/abs/2403.08295.
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, Leonard´ Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Rame,´ Johan Fer- ret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Char- line Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchi- son, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Wein- berger, Dimple Vijaykumar, Dominika Rogozinska,´ Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Plucinska,´ Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju yeong Ji, Kareem Mo- hamed, Kartikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leti- cia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Mar- tins, Machel Reid, Manvinder Singh, Mark Iverson, Martin Gorner¨ , Mat Velloso, Mateo Wirth, Matt Davidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moynihan, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khat- wani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Os- car Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culliton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Perrin, Sebastien´ M. R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ronstrom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Ko- cisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren
Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D. Sculley, Jeanine Banks,AncaDragan,SlavPetrov,OriolVinyals,JeffDean,DemisHassabis,KorayKavukcuoglu, ClementFarabet,ElenaBuchatskaya,SebastianBorgeaud,NoahFiedel,ArmandJoulin,Kathleen Kenealy, Robert Dadashi, and Alek Andreev. Gemma 2: Improving open language models at a practical size, 2024b. URL https://arxiv.org/abs/2408.00118.
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen,
Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Trans- former Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html.
Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt.
Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=NpsVSN6o4ul.
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In
Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833. Springer, 2014.
A HYPERPARAMETERS
Table 1: Pythia-160M model specifics
Config | Value |
---|---|
Layers (L) Model dimension (dmodel) Heads (H) Non-Embedding params Equivalent models | 12 768 12 85,056,000 GPT-Neo OPT-125M |
Table 2: Training and fine-tuninghyperparameters
Hyperparameter | Value |
---|---|
c λ Hook name Batch size Adam (β1,β2) Context size lr lr scheduler lr deacy steps l1 warm-up steps # tokens (Train) Checkpoint freq Decoder weights initialization Activation function Decoder column normalization Activation normalization FP precision Prepend BOS token MSE loss normalization Scale sparsity penalty by decoder norm | 8 1.0 resid-post 4096 (0, 0.999) 1024 3e-5 constant 20% of the training steps 5% of the training steps 1B 200M Zeroe JumpReLU Yes No 32 No No No |
Table 3: Layer groups for every k
k | Group 1 | Group 2 | Group 3 | Group 4 | Group 5 |
---|---|---|---|---|---|
1 2 3 4 5 | 11 7 3 3 3 | 11 7 7 5 | 11 9 7 | 11 9 | 11 |
B DETAILED PER-LAYER RECONSTRUCTION, SPARSITY AND MMCS PLOTS
k = 1
CE Loss Score R2
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
L2 L0 250
100 80 60 40 20 0
200 |
150
100
50
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
Baseline Group-1
Figure 8: Per-layer CE Loss Score, R2, L2 and L0 with k = 1.
k = 2
CE Loss Score R2
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
L2 L0 250
100 80 60 40 20 0
200 |
150
100
50
0
0 1 2 3 4 5 6 7
Layer
8 9 10 0 1 2 3 4 5 6 7 8 9 10
Layer
Baseline Group-1 Group-2
Figure 9: Per-layer CE Loss Score, R2, L2 and L0 with k = 2.
k = 3
CE Loss Score R2
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
L2 L0
100 250
80 200
60 150
40 100
20 50
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
Baseline Group-1 Group-2 Group-3
Figure 10: Per-layer CE Loss Score, R2, L2 and L0 with k = 3.
k = 4
CE Loss Score R2
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
L2 L0
100 250
80 200
60 150
40 100
20 50
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
Baseline Group-1 Group-2 Group-3 Group-4
Figure 11: Per-layer CE Loss Score, R2, L2 and L0 with k = 4.
k = 5
CE Loss Score R2
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
L2 L0
100 250
80 200
60 150
40 100
20 50
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
Baseline Group-1 Group-2 Group-3 Group-4 Group-5
Figure 12: Per-layer CE Loss Score, R2, L2 and L0 with k = 5.
MMCS MMCS
K = 1 K = 2
1.0 0.8 0.6 0.4 0.2 0.0
1.0
:- | :- | :- | :- | :- | :- | :- | :- | :- | :- | :- |
0.8 |
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
Group-1 Group-1 Group-2
(a) Per-layer MMCS with k = 1. (b) Per-layer MMCS with k = 2.
MMCS MMCS
K = 3 K = 4
1.0 0.8 0.6
1.0
:- | :- | :- | :- | :- | :- | :- | :- | :- | :- | :- |
0.8 |
0.6
0.4 0.2 0.0
0.4
0.2
0.0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Layer Layer
Group-1 Group-2 Group-3 Group-1 Group-2 Group-3 Group-4
(c) Per-layer MMCS with k = 3. (d) Per-layer MMCS with k = 4.
MMCS
K = 5
1.0 0.8 0.6 0.4 0.2 0.0
0 1 2 3 4 5 6 7 8 9 10 Layer |
Group-1 Group-2 Group-3 Group-4 Group-5
(e) Per-layer MMCS with k = 5.
Figure 13: Per-layer MMCS for every k.
C APPROXIMATE INDIRECT EFFECTS
In Equation 7 we reported the Indirect Effect (IE) (Pearl, 2022), which measures the importance of a feature with respect to a generic downstream task T. To reduce the computational burden of estimating the IE with a single forward pass per feature, we employed two approximate methods: Attribution Patching (AtP) (Nanda, 2023; Syed et al., 2024) and Integrated Gradients (IG) (Sun- dararajan et al., 2017).
AtP (Nanda, 2023; Syed et al., 2024) employs a first-orderTaylor expansion
IEˆ AtpM = ∇m f=fclean(fpatch − fclean) (8) which estimates Equation 7 for every f in two forward passes and a single backward pass.
Integrated Gradients (Sundararajan et al., 2017) is a more expensive but more accurate approxima- tion of Equation 7
IEˆ IG = ∇m αfclean+(1−α)fpatch (fpatch − fclean) (9)
α∈Γ
where α ranges in an equally spaced set Γ = 0, 1 ,..., N −1 . In our experiments we have set N = 10. N N
Faithfulness (IG)
IOI Greater Than
:- | :- | :- | :- | :- |
1.0 0.5 0.0 |
1.0 0.5 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
N Features N Features
Subject-Verb Agreement Average
:- | :- | :- | :- | :- |
1.0 0.5 0.0 |
1.0 0.5 0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
N Features N Features
K
1 2 3 4 5 Baseline
Figure 14: Average faithfulness (Marks et al., 2024) for every downstream task (IOI, Greater Than, Subject-Verb Agreement). The “Baseline” average is computed considering the performance ob- tained by SAEi, ∀i = 0,..., 10. The “Average” plot depicts the average over the three downstream tasks.
Completeness (IG)
IOI Greater Than
:- | :- | :- | :- | :- |
1.0 0.5 0.0 0.5 |
1.0 0.5 0.0 0.5
0 20 40 60 80 100 0 20 40 60 80 100
N Features N Features
Subject-Verb Agreement Average
:- | :- | :- | :- | :- |
1.0 0.5 0.0 |
1.0 0.5 0.0 0.5
0.5
0 20 40 60 80 100 0 20 40 60 80 100
N Features N Features
K
1 2 3 4 5 Baseline
Figure 15: Average completeness for every downstream task (IOI, Greater Than, Subject-Verb Agreement). The “Baseline” average is computed considering the performance obtained by SAEi, ∀i = 0,..., 10. The“Average”plotdepictstheaverageperformanceoverthethreedownstreamtasks. 25
Footnotes
-
We do not consider the last transformer layer, as different from every other layer w.r.t. the angular distance definedin Section 3.2. ↩
-
The complete linkage clustering strategy defines the group-distance between two groups X and Y as D(X,Y ) = maxx∈X,y∈Y dangular(x,y) ↩
-
Figure 7: Average Human Interpretability Scores, with the average computed over layers for every ↩
-
. The interpretability of features learned by the SAEjk is comparable to the baseline SAEi. ↩