Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models

Note, this article is a converted pdf, and is better read with plots tables and figures at: https://openreview.net/forum?id=GI5j6OMTju

Abstract

Sparse AutoEncoders (SAEs) have gained pop- ularity as a tool for enhancing the interpretabil- ity of Large Language Models (LLMs). How- ever, trainingSAEscan be computationally in- tensive, especially as model complexity grows. In this study, the potential of transfer learning to accelerateSAEstraining is explored by cap- italizing on the shared representations found across adjacent layers ofLLMs. Our exper- imental results demonstrate that fine-tuning SAEsusing pre-trained models from nearby layers not only maintains but often improves the quality of learned representations, while significantly accelerating convergence. These findings indicate that the strategic reuse of pre- trainedSAEsis a promising approach, particu- larly in settings where computational resources are constrained.

1 Introduction

Transformer-based models have become ubiqui- tous in a large variety of different application fields (Dubey et al., 2024; Kirillov et al., 2023; Rad- ford et al., 2023; Chen et al., 2021; Zitkovich et al., 2023; Waisberg et al., 2023). Given their tremen- dous impact on society, concerns about their inter- pretability have been raised by various stakehold- ers (Bernardo, 2023). Mechanistic Interpretabil- ity (MI) (Conmy et al., 2023; Nanda et al., 2023), seeks to reverse-engineer how Neural Networks, and in particularLLMs, generate outputs by uncov- ering the circuits they have learned during training, stored inside their parameters, and executed dur- ing a forward pass (Nanda et al., 2023; Conmy et al., 2023; Gurnee et al., 2023). A promising in- terpretability technique is dictionary learning (Cun- ningham et al., 2023; Gao et al., 2024; Karvonen et al., 2024) which seeks to capture interpretable and editable features within the internal layers of LLMs. This method implies training Sparse Au- toencoders (SAEs) to reconstruct the model’s ac-

  • We demonstrate thatSAEs exhibit partial transfer to adjacent layers in a zero-shot set- ting, though fine-tuning is recommended for optimal performance.
  • We show that both Forward-SAEs and Backward-SAEs, when fine-tuned on adja- cent activations, consistently transfer across all tested checkpoints, achieving comparable or superior performance toSAEstrained from scratch, while using significantly less training data.
  • We train and publicly releaseSAEsfor Pythia- 160M (Biderman et al., 2023), the model uti- lized in this study.

2 Background and objectives

2.1 Linear representation hypothesis and

superposition

Although it has been demonstrated thatLLMsrep- resent some of their feature linearly (Park et al., 2024), a key challenge in LLM interpretability is the lack of clear neuron interpretation. Recent work of Elhage et al. (2022) tries to explain this phenomenon by showing that models can usen- dimensional activations to representm≫nsparse almost-orthogonal features insuperposition. Super- position theory is based on three key concepts: (i) the existence of a hypothetical large and disentan- gled model where each neuron perfectly aligns with a single feature, with each neuron activating for ex- actly one feature at a time. The observed models can be thought as dense, almost-orthogonal projec- tions of this larger, ideal model. (ii) Features are

(^1) Assuming training half ofSAEsfrom scratch and the other half with transfer from an adjacent layer with half of the training tokens. sparse, reflecting the idea that in the natural world, many features are inherently sparse. (iii) The im- portance of features varies depending on the task at hand. These assumptions, combined with two mathematical principles^2 , suggest that the hidden sparse features can be recovered by projecting the dense model back to the hypothetical large and dis- entangled one.SAEsserve this purpose: learning a set of sparse, interpretable, and high-dimensional features from an observed model’s dense and su- perposed activations. 2.2 Sparse Autoencoders Recently, Sparse AutoEncoders have become a pop- ular tool in Large Language Model (LLM) inter- pretability as they effectively decompose neuron ac- tivations into interpretable features (Bricken et al., 2023; Cunningham et al., 2023). For a given in- put activationx∈ Rdmodel, theSAEcomputes a reconstructionxˆas a sparse linear combination of dsae≫dmodelfeaturesvi∈Rdmodel. The recon- struction is given by: (xˆ◦f)(x) =Wdf(x) +bd (1) whereviare the columns ofWd,bdis the bias term of the decoder andf(x)are feature activations. The latter are computed as: f(x) =ReLU(We(x−bd) +be) (2) wherebeis the encoder bias term. SAEs are trained to minimize the following loss function: Lsae=∥x−xˆ∥^22 +λ∥f(x)∥ 1 (3) In Equation 3, the first term corresponds to the re- construction error, to which anℓ 1 regularization term on the activationsf(x)is added to promote sparsity in the feature activations. The training process of aSAE can become computationally intensive, particularly as model size increases. For example, training a singleSAEof a widely used model such as Llama- 3-8b (Dubey et al., 2024) (dmodel= 4096) with an expansion factor ofc= 32(i.e.,dsae= 131072) requires≈ 1 B parameters. Under these circum- (^2) The Johnson-Lindenstrauss lemma, which ensures that points in a high-dimensional space can be embedded into a lower dimension while almost preserving distances, and compressed sensing, which exploits sparsity to recover signals from fewer samples than required by the Nyquist–Shannon theorem

2.3 Evaluating SAEs

EvaluatingSAEsand the features they have learned presents significant challenges. In our work, the techniques employed can be divided intorecon- structionandinterpretabilitymetrics. The first includes:

  • The Cross-Entropy Loss Score (CES), defined as CES=
CE(ζ)−CE(ˆx◦f)
CE(ζ)−CE(Id)
  • The L2 loss (reconstruction loss) is the first term of Equation 3, which measures the recon- struction error made by the SAE.

  • The L0 loss of the learned features, defined as (^3) As specified in(Biderman et al., 2023) Measuring the quality of the features learned by a SAEis not straightforward, and multiple strategies exist. As reported in Makelov et al. (2024),inter- pretabilitymetrics can be categorized as follows:

  • Indirect Geometric Measures: Sharkey et al. (2023) proposed using mean maximum co- sine similarity (MMCS) between features learned by differentSAEsto assess their qual- ity. Given two feature dictionariesDandD′, with|D|=|D′|, MMCS is defined as:

  • Auto-Interpretability: Bricken et al. (2023), Bills et al. (2023), and Cunningham et al. (2023) used LLMs to generate natural- language descriptions ofSAEfeatures based on highly activating examples and measured interpretability as the prediction quality on previously unseen text.

  • Manually Crafted Proxies for Ground Truth: (Bricken et al., 2023) developed computa- tional proxies for a set ofSAEfeatures, re- lying on manually formulated hypotheses.

  • Direct Logit Attribution (DLA): This method, used by Bricken et al. (2023), assesses the direct effect of a feature on the next-token distribution, providing insights into the causal role of features. Attribution score is computed as:

  • Supervised Dictionary Benchmarking: Makelov et al. (2024) introduced a technique that benchmarks unsupervisedSAEdictionar- ies against supervised dictionaries based on task-relevant attributes to ensure extracted features are interpretable and relevant to specific tasks.

In our work, evaluation metrics employed include all the reconstruction techniques listed before, the

transfer learning and the ones fromSAEstrained from scratch, feature averaged DLA computed on a custom dataset comprising 64 handcrafted prompts in the form of a Human Inter- pretability Score defined in Section 3.

2.4 Transfer Learning

Transfer learning (Goodfellow et al., 2016) is a powerful technique in machine learning where knowledge gained from one task is applied to im- prove performance on a related, but distinct, task. This approach is particularly useful when train- ing from scratch is computationally expensive or when labeled data is scarce. In the context ofSAEs forLLMs, transfer learning enables the reuse of weights learned in one layer to initialize and accel- erate the training of SAEs in adjacent layers.

2.5 Objectives

In this work transferability and generalization of intra-modelSAEshave been studied, aiming to answer the following research questions: Q1.AreSAEstransferable between layers? I.e., can aSAEtrained on the activations of layer ibe reused to reconstruct activations of layer j̸=i?

Q2.Is Transfer Learning applicable toSAEs? Specifically, can aSAEinitialized with the weights of a neighboringSAEand then fine- tuned achieve equal or superior performance, potentially using only a fraction of the data, compared to an SAE trained from scratch?

3 Experimental setup

To address the questions raised in Section 2, we first trained from scratch one SAEifor each layer iof Pythia-160M, a 12-layer decoder-only Trans- former model from the Pythia family (Biderman et al., 2023). EachSAEwas trained using the JumpReLU activation function (Rajamanoharan et al., 2024), with activations taken from the cor- responding layer’s residual stream after the MLP contribution. The model configuration details are provided in Table 1. Let alsoj ̸= ibe another layer index. Then SAEi←jis defined as theSAE initialized with weights from thej-thSAEand fine- tuned with activations of thei-th layer. In particular, this work is focused on SAEi←i− 1 and SAEi←i+1, named Forward-SAE (Fwd-SAE) and Backward- SAE (Bwd-SAE) respectively. Figure 1 summa- rizes the overall training and fine-tuning procedure,

Figure 2:CE-Loss Score(Eq. 4), where the cell(i,j) in the plot represents theCE-Loss Scoreobtained by reconstructing the activations from layeriwith SAEj. This plot has to be read column-wise.

with the hyperparameters specified in Table 2. The dataset adopted for both training and fine-tuning is the Pile-small-2b^4 , an already tokenized version of the Pile dataset (Gao et al., 2020) with a total of 2b tokens. To effectively measure the recon- struction performance of aSAEbefore and after fine-tuning with transfer learning, the normalized CE-Loss Score is adopted and defined as:

CESi,j= CES(SAEi←j(xi))−CES(SAEj(xi)) CES(SAEi(xi))−CES(SAEj(xi)) (8) by assumingCES(SAEj(xi))andCES(SAEi(xi)) being, respectively, the lower and the upper bound for the CES onxi. With the definitions above,CESi,i− 1 andCESi,i+1are the normalized Cross-Entropy Loss Score (CE-Loss Score) of the Fwd-SAEandBwd-SAErespectively. Finally, to evaluate feature quality, aHuman Interpretability Scorehas been defined as the ratio of features that have been evaluated interpretable by human anno- tators. To generate the score, all theSAEshave been run on approximately 1M tokens randomly sampled from the training dataset. With their ac- tivations, max activating tokens and top/bottom attribution logits have been computed and analyzed from the labelers.

4 Results

4.1 SAE transferability

Figure 2 shows theCE-Loss Scoreachieved by ev- ery SAEjreconstructing the activations of layer

(^4) https://huggingface.co/datasets/NeelNanda/ pile-small-tokenized-2b

Figure 3: Average CE-Loss Score,L 2 -Loss andL 0 -Loss. The average is computed over layers for a single checkpoint. The “No Transfer” average is computed considering the performance obtained by SAEi(xi),∀i= 0 ,..., 11.

i, for everyi,j= 0,...,L− 1 , i.e., the zero-shot setting. It is clear that a certain degree of trans- ferability exists between SAEjand the activations of adjacent layers, with this being more noticeable wheni=j− 1 (i.e.,SAEsare more effective at reconstructing the activations of preceding layers than those of subsequent ones). These findings can also be attributed to the fact that, as demonstrated by Gromov et al. (2024), angular distances between adjacent layers are smaller, enabling neighboring SAEsto operate on a similar basis with respect to the activations they were trained on. The answer to Q1is, therefore, yes; however, although transfer- ability between layers exists, it remains partial and, potentially, not completely reliable for downstream applications.

4.2 Feature Evaluation

The metric value decreases for deeper lay- ers, suggesting a slight divergence in the features learned by the transferSAEs. Notably,SAEL− 1 ←L exhibits a sharp decline in the score, indicating that transferring on the last layer should be ap- proached with caution. Figure 5 displays the layer- averaged DLA scores for each tested checkpoint. The plot reveals that forward transferSAEsconsis- tently achieve higher scores than the baseline, while backward transferSAEsconsistently score lower. This outcome contrasts with the reconstruction met- rics, where the backward technique consistently outperformed the forward approach. Lastly, from human interpretability scores (Figure 6), no sig- nificant differences can be observed between each transfer type. By manually looking at the learned features, a key pattern has emerged: many features learned bySAEstrained with transfer learning re- main shared with theSAEused for initialization. This phenomenon, termedFeature Transfer, partic- ularly affects the most interpretable features (see an example in Figure 18). To further investigate this phenomenon, a metric was developed to quantify it. Given aSAEiand another trained via transfer learning from it,SAEi←i± 1 , the number of shared “top”, “bottom”, and “max activating tokens”^5 for each feature have been computed (features have been compared using the same indices). The trans- fer score has been then defined as the percentage of shared tokens across all three heuristics. Fig- ure 7 presents the scores across all the layers for the last evaluated checkpoint. Except for layer 1, backward transfer consistently exhibits lower scores. It’s important to note that this phenomenon is easily recognized inSAEstrained with transfer learning when compared to their initialization, as feature indices are preserved. Evaluating this in SAEstrained from scratch is more demanding due to the exponential growth in the number of com- parisons required, and although relevant, it falls outside the scope of this work.

(^5) “Top” and “bottom” logit tokens refer to those whose unembedding directions are most and least aligned, respec- tively, with the projection of the feature in the unembedding space. “Max activating” tokens are those for which the feature exhibits the highest activations. Figure 6: Human Interpretability Scores (Section 3) for 32 features randomly sampled from eachSAElayer and type of transfer.

5 Related works

5.1 Scaling and evaluating SAEs

AsSAEsgain popularity forLLMsinterpretabil- ity and are increasingly applied to state-of-the-art models (Lieberum et al., 2024), the need for more efficient training techniques has become evident. To address this, Gao et al. (2024) explored scaling laws of autoencoders to identify the optimal combi- nation of size and sparsity. However, trainingSAEs is only one aspect of the challenge; evaluating them presents another significant hurdle. This evalua- tion is a crucial focus withinMI. While early ap- proaches in Cunningham et al. (2023) and (Bricken et al., 2023) relied on unsupervised metrics like reconstruction loss andL 0 sparsity to assessSAE performance, these metrics alone cannot fully cap- ture the efficacy of aSAE. They provide quantita- tive measures of how wellSAEscapture informa- tion in model activations while maintaining spar- sity, but they fall short of addressing the broader utility of these features. More recent techniques, such as auto-interpretability (Bricken et al. (2023), Bills et al. (2023), Cunningham et al. (2023)) and ground-truth comparisons (Sharkey et al., 2023), have shifted towards a more holistic evaluation, focusing on the causal relevance of the extracted features (Marks et al., 2024) and evaluatingSAEs on different downstream tasks in which they can be employed (Makelov et al., 2024). In particular, Makelov et al. (2024) introduced a framework for evaluatingSAEson the Indirect Object Identifica- tion (IOI) task, focusing on three key aspects: the sufficiency and necessity of activation reconstruc- tions, the ability to control model behavior through sparse feature editing, also called feature steering

(Templeton et al., 2024), and the interpretability of features in relation to their causal role. Kar- vonen et al. (2024) further advanced principled evaluations by introducing novel metrics specifi- cally designed for board game language models. Their approach leverages the well-defined structure of chess and Othello to create supervised metrics forSAEquality, including board reconstruction accuracy and coverage of predefined board state properties. These methods provide a more direct assessment of how wellSAEscapture semantically meaningful and causally relevant features, offering a complement to the earlier unsupervised metrics likeL 0 andL 2.

5.2 SAEs transfer learning

Recent work by Kissane et al. (2024) and Lieberum et al. (2024) has demonstrated the transferability of SAEweights between base and instruction-tuned versions of the Gemma-1 (Team et al., 2024a) and Gemma-2 (Team et al., 2024b), respectively. This finding is significant as it suggests that many in- terpretable features are preserved during the fine- tuning process. While this transfer occurs be- tween model variants (inter-model) rather than be- tween layers (intra-model), it complements our work by indicating thatSAEfeatures can remain stable across different stages of model develop- ment. The preservation of these features through fine-tuning not only offers insights into the robust- ness of learned representations but also suggests potential efficiency gains in interpreting families of models derived from a common base SAE.

6 Conclusions

We hypothesized and validated whetherSAEtrans- fer is an effective method to accelerate and opti- mize theSAEtraining process. We investigated whetherSAEweights derived from adjacent layers could maintain efficacy in reconstruction, which our results affirmed. Furthermore, we examined whether the transferredSAEs, when fine-tuned on a layer’s activations, could reliably capture monose- mantic features comparable to the originalSAE, which has been also confirmed by our experiments. The transferredSAEs (both forward and backward) demonstrated comparable and occasionally supe- rior reconstruction loss relative to the original. Em- pirically, we observed frequent overlap in the most strongly activated features across adjacent layers (e.g. Figure 18). For a given feature indexi, the features learned by SAEi←i+1(Backward), SAEi (No Transfer), and SAEi←i− 1 (Forward) appeared to represent similar concepts.

7 Limitations and future works

While our study successfully demonstrates the fea- sibility of reconstruction transfer and the transfer learning ofSAEweights to adjacent layers, there are several limitations that warrant consideration and pave the way for future research directions.

  • Model Size and Scope: We trained base and transferSAEson the activations of Pythia- 160m, a model mcuh smaller than state-of- the-artLLMs. Although not being tested, as model size and training complexity increase, the benefits of transfer learning are expected to become more pronounced. In such sce- narios, transfer learning can significantly ac- celerate training and reduce associated costs, making our approach potentially more impact- ful for larger models. Therefore, a critical area for future research is to extend these in- vestigations to larger models, exploring how scaling affects the efficacy of transfer learning and how these benefits can be maximized in real-world settings.
  • Inter-Model and Intra-Model transferability: In our study, we focused on the transfer of intra-modelSAEs, particularly assessing the transferability betweenSAEsin adjacent lay- ers. Given that model architectures are now commonly shared across different model fam- ilies, a direction for future research would be

to evaluate the transferability of intra-model SAEswithin models from different families that utilize the same architecture. This explo- ration could offer valuable insights into the broader applicability ofSAEsbeyond closely related model families.

  • Experimental Scale and Hyperparameter In- teractions: Our study was conducted on a lim- ited scale in terms of model components in- volved and the range of training hyperparame- ters explored. The fixed set of hyperparame- ters used may not fully capture the potential of our transfer learning approach across differ- ent configurations. Future research should in- volve a broader exploration of hyperparameter spaces, especially theλcoefficient and expan- sion factorc, along with component variations to determine the robustness and versatility of the method.
  • Feature Transfer Phenomenon: Our find- ings reveal a “feature transfer” phenomenon, where features learned in one layer are exactly replicated in another during transfer learning. This can be problematic, as it may prevent the fine-tunedSAEsfrom discovering new, layer-specific features. However, it also of- fers an interesting opportunity to study how similar features are encoded across layers. Fu- ture research should focus on understanding and managing this phenomenon to either har- ness or mitigate its effects, depending on the desired outcomes, thereby improving the flex- ibility and effectiveness of transfer learning.

References

Vítor Bernardo. 2023. Techdispatch #2/

  • explainable artificial intelligence. https: //www.edps.europa.eu/data-protection/our- work/publications/techdispatch/2023- 11-16-techdispatch-22023-explainable- artificial-intelligence_en. European Data Protection Supervisor.

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal- lahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language mod- els across training and scaling. InInternational Conference on Machine Learning, pages 2397–2430. PMLR.

Steven Bills, Nick Cammarata, Dan Mossing, Henk
Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan
Leike, Jeff Wu, and William Saunders. 2023. Lan-
guage models can explain neurons in language mod-
els. Accessed: 2024-08-18.
Trenton Bricken, Adly Templeton, Joshua Batson,
Brian Chen, Adam Jermyn, Tom Conerly, Nick
Turner, Cem Anil, Carson Denison, Amanda Askell,
Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas
Schiefer, Tim Maxwell, Nicholas Joseph, Zac
Hatfield-Dodds, Alex Tamkin, Karina Nguyen,
Brayden McLean, Josiah E Burke, Tristan Hume,
Shan Carter, Tom Henighan, and Christopher
Olah. 2023. Towards monosemanticity: Decom-
posing language models with dictionary learning.
Transformer Circuits Thread. Https://transformer-
circuits.pub/2023/monosemantic-
features/index.html.
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee,
Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind
Srinivas, and Igor Mordatch. 2021. Decision trans-
former: Reinforcement learning via sequence model-
ing. InAdvances in Neural Information Processing
Systems, volume 34, pages 15084–15097. Curran As-
sociates, Inc.
Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch,
Stefan Heimersheim, and Adrià Garriga-Alonso.
  1. Towards automated circuit discovery for mech- anistic interpretability.Advances in Neural Informa- tion Processing Systems, 36:16318–16352. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. Preprint, arXiv:2309.08600. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Bap- tiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al- lonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor- gia Lewis Anderson, Graeme Nail, Gregoire Mi- alon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi,

Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuen- ley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Lau- rens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bash- lykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Pra- jjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Ro- main Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gu- rurangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petro- vic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whit- ney Meers, Xavier Martinet, Xiaodong Wang, Xiao- qing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesen- berg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, An- drei Lupu, Andres Alvarado, Andrew Caples, An- drew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Apara- jita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yaz- dan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Han- cock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Da- mon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Tes-

tuggine, Delia David, Devi Parikh, Diana Liskovich,
Didem Foss, Dingkang Wang, Duc Le, Dustin Hol-
land, Edward Dowling, Eissa Jamil, Elaine Mont-
gomery, Eleonora Presani, Emily Hahn, Emily Wood,
Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan
Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat
Ozgenel, Francesco Caggioni, Francisco Guzmán,
Frank Kanayet, Frank Seide, Gabriela Medina Flo-
rez, Gabriella Schwarz, Gada Badeer, Georgia Swee,
Gil Halpern, Govind Thattai, Grant Herman, Grigory
Sizov, Guangyi, Zhang, Guna Lakshminarayanan,
Hamid Shojanazeri, Han Zou, Hannah Wang, Han-
wen Zha, Haroun Habeeb, Harrison Rudolph, He-
len Suk, Henry Aspegren, Hunter Goldman, Igor
Molybog, Igor Tufanov, Irina-Elena Veliche, Itai
Gat, Jake Weissman, James Geboski, James Kohli,
Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff
Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizen-
stein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi
Yang, Joe Cummings, Jon Carvill, Jon Shepard,
Jonathan McPhie, Jonathan Torres, Josh Ginsburg,
Junjie Wang, Kai Wu, Kam Hou U, Karan Sax-
ena, Karthik Prasad, Kartikay Khandelwal, Katay-
oun Zand, Kathy Matosich, Kaushik Veeraragha-
van, Kelly Michelena, Keqian Li, Kun Huang, Ku-
nal Chawla, Kushal Lakhotia, Kyle Huang, Lailin
Chen, Lakshya Garg, Lavender A, Leandro Silva,
Lee Bell, Lei Zhang, Liangpeng Guo, Licheng
Yu, Liron Moshkovich, Luca Wehrstedt, Madian
Khabsa, Manav Avalani, Manish Bhatt, Maria Tsim-
poukelli, Martynas Mankus, Matan Hasson, Matthew
Lennie, Matthias Reso, Maxim Groshev, Maxim
Naumov, Maya Lathi, Meghan Keneally, Michael L.
Seltzer, Michal Valko, Michelle Restrepo, Mihir
Patel, Mik Vyatskov, Mikayel Samvelyan, Mike
Clark, Mike Macey, Mike Wang, Miquel Jubert Her-
moso, Mo Metanat, Mohammad Rastegari, Mun-
ish Bansal, Nandhini Santhanam, Natascha Parks,
Natasha White, Navyata Bawa, Nayan Singhal, Nick
Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev,
Ning Dong, Ning Zhang, Norman Cheng, Oleg
Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem
Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pa-
van Balaji, Pedro Rittner, Philip Bontrager, Pierre
Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratan-
chandani, Pritish Yuvraj, Qian Liang, Rachad Alao,
Rachel Rodriguez, Rafi Ayub, Raghotham Murthy,
Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah
Hogan, Robin Battey, Rocky Wang, Rohan Mah-
eswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu,
Samyak Datta, Sara Chugh, Sara Hunt, Sargun
Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma,
Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lind-
say, Shaun Lindsay, Sheng Feng, Shenghao Lin,
Shengxin Cindy Zha, Shiva Shankar, Shuqiang
Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agar-
wal, Soji Sajuyigbe, Soumith Chintala, Stephanie
Max, Stephen Chen, Steve Kehoe, Steve Satterfield,
Sudarshan Govindaprasad, Sumit Gupta, Sungmin
Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury,
Sydney Goldman, Tal Remez, Tamar Glaser, Tamara
Best, Thilo Kohler, Thomas Robinson, Tianhe Li,
Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook
Shaked, Varun Vontimitta, Victoria Ajayi, Victoria
Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal
Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu
Mihailescu, Vladimir Ivanov, Wei Li, Wenchen
Wang, Wenwen Jiang, Wes Bouaziz, Will Consta-
ble, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu,
Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yan-
jun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin
Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu,
Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach
Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen,
Zhenyu Yang, and Zhiwei Zhao. 2024. The llama 3
herd of models.Preprint, arXiv:2407.21783.

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022. Toy models of superpo- sition.Preprint, arXiv:2209.10652.

Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The pile: An 800gb dataset of diverse text for language modeling. Preprint, arXiv:2101.00027.

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2024. Scaling and evaluating sparse autoencoders.Preprint, arXiv:2406.04093.

Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. 2016.Deep Learning. MIT Press, Cambridge, MA, USA.http://www.deeplearningbook.org.

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. 2024. The unreasonable ineffectiveness of the deeper layers. Preprint, arXiv:2403.17887.

Wes Gurnee, Neel Nanda, Matthew Pauly, Kather- ine Harvey, Dmitrii Troitskii, and Dimitris Bert- simas. 2023. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610.

Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Clau- dio Mayrink Verdun, David Bau, and Samuel Marks.

  1. Measuring progress in dictionary learning for language model interpretability with board game models. InICML 2024 Workshop on Mechanistic Interpretability.

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment anything.Preprint, arXiv:2304.02643.

Connor Kissane, Ryan Krzyzanowski, Andrew Conmy, and Neel Nanda. 2024. SAEs (usu- ally) transfer between base and chat models. https://www.alignmentforum.org/posts/

fmwk6qxrpW8d4jvbd/saes-usually-transfer-
between-base-and-chat-models. AI Alignment
Forum.
Tom Lieberum, Senthooran Rajamanoharan, Arthur
Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant
Varma, János Kramár, Anca Dragan, Rohin Shah,
and Neel Nanda. 2024. Gemma scope: Open sparse
autoencoders everywhere all at once on gemma 2.
arXiv preprint arXiv:2408.05147.
Aleksandar Makelov, George Lange, and Neel Nanda.
  1. Towards principled evaluations of sparse au- toencoders for interpretability and control.Preprint, arXiv:2405.08366. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Be- linkov, David Bau, and Aaron Mueller. 2024. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. Progress mea- sures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations. Kiho Park, Yo Joong Choe, and Victor Veitch.
  2. The linear representation hypothesis and the geometry of large language models. Preprint, arXiv:2311.03658. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine Mcleavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak su- pervision. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. 2024. Jumping ahead: Im- proving reconstruction fidelity with jumprelu sparse autoencoders.Preprint, arXiv:2407.14435. Lee Sharkey, Dan Braun, and Beren Millidge. 2023. Taking the temperature of transformer circuits. Ac- cessed: 2024-08-18. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro- Ros, Ambrose Slone, Amélie Héliou, Andrea Tac- chetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christo- pher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski,
Jean-Baptiste Lespiau, Jeff Stanway, Jenny Bren-
nan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin
Mao-Jones, Katherine Lee, Kathy Yu, Katie Milli-
can, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon,
Machel Reid, Maciej Mikuła, Mateo Wirth, Michael
Sharman, Nikolai Chinaev, Nithum Thain, Olivier
Bachem, Oscar Chang, Oscar Wahltinez, Paige Bai-
ley, Paul Michel, Petko Yotov, Rahma Chaabouni,
Ramona Comanescu, Reena Jana, Rohan Anil, Ross
McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith,
Sebastian Borgeaud, Sertan Girgin, Sholto Douglas,
Shree Pandya, Siamak Shakeri, Soham De, Ted Kli-
menko, Tom Hennigan, Vlad Feinberg, Wojciech
Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao
Gong, Tris Warkentin, Ludovic Peran, Minh Giang,
Clément Farabet, Oriol Vinyals, Jeff Dean, Koray
Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani,
Douglas Eck, Joelle Barral, Fernando Pereira, Eli
Collins, Armand Joulin, Noah Fiedel, Evan Sen-
ter, Alek Andreev, and Kathleen Kenealy. 2024a.
Gemma: Open models based on gemini research
and technology.Preprint, arXiv:2403.08295.

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Wal- ton, Aliaksei Severyn, Alicia Parrish, Aliya Ah- mad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, An- thony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Wein- berger, Dimple Vijaykumar, Dominika Rogozinska, ́ Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Elty- shev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Pluci ́nska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svens- son, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fer- nandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju yeong Ji, Kareem Mo- hamed, Kartikeya Badola, Kat Black, Katie Mil- lican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lau- ren Usui, Laurent Sifre, Lena Heuermann, Leti- cia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Mar- tin Görner, Mat Velloso, Mateo Wirth, Matt Davi- dow, Matt Miller, Matthew Rahtz, Matthew Wat- son, Meg Risdal, Mehran Kazemi, Michael Moyni- han, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nen- shad Bardoliwalla, Nesh Devanathan, Neta Dumai,

Nilay Chauhan, Oscar Wahltinez, Pankil Botarda,
Parker Barnes, Paul Barham, Paul Michel, Peng-
chong Jin, Petko Georgiev, Phil Culliton, Pradeep
Kuppala, Ramona Comanescu, Ramona Merhej,
Reena Jana, Reza Ardeshir Rokni, Rishabh Agar-
wal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy,
Sarah Perrin, Sébastien M. R. Arnold, Sebastian
Krause, Shengyang Dai, Shruti Garg, Shruti Sheth,
Sue Ronstrom, Susan Chan, Timothy Jordan, Ting
Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky,
Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh
Meshram, Vishal Dharmadhikari, Warren Barkley,
Wei Wei, Wenming Ye, Woohyun Han, Woosuk
Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan
Wei, Victor Cotruta, Phoebe Kirk, Anand Rao,
Minh Giang, Ludovic Peran, Tris Warkentin, Eli
Collins, Joelle Barral, Zoubin Ghahramani, Raia
Hadsell, D. Sculley, Jeanine Banks, Anca Dragan,
Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hass-
abis, Koray Kavukcuoglu, Clement Farabet, Elena
Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Ar-
mand Joulin, Kathleen Kenealy, Robert Dadashi,
and Alek Andreev. 2024b. Gemma 2: Improving
open language models at a practical size.Preprint,
arXiv:2408.00118.
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack
Lindsey, Trenton Bricken, Brian Chen, Adam Pearce,
Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy
Cunningham, Nicholas L Turner, Callum McDougall,
Monte MacDiarmid, C. Daniel Freeman, Theodore R.
Sumers, Edward Rees, Joshua Batson, Adam Jermyn,
Shan Carter, Chris Olah, and Tom Henighan. 2024.
Scaling monosemanticity: Extracting interpretable
features from claude 3 sonnet.Transformer Circuits
Thread.
Ethan Waisberg, Joshua Ong, Mouayad Masalkhi,
Sharif Amit Kamran, Nasif Zaman, Prithul Sarker,
Andrew G Lee, and Alireza Tavakkoli. 2023. Gpt-
4 and ophthalmology operative notes. Annals of
Biomedical Engineering, 51(11):2353–2355.
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu,
Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan
Welker, Ayzaan Wahid, Quan Vuong, Vincent Van-
houcke, Huong Tran, Radu Soricut, Anikait Singh,
Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi,
Grecia Salazar, Michael S. Ryoo, Krista Reymann,
Kanishka Rao, Karl Pertsch, Igor Mordatch, Hen-
ryk Michalewski, Yao Lu, Sergey Levine, Lisa Lee,
Tsang-Wei Edward Lee, Isabel Leal, Yuheng Kuang,
Dmitry Kalashnikov, Ryan Julian, Nikhil J. Joshi,
Alex Irpan, Brian Ichter, Jasmine Hsu, Alexander
Herzog, Karol Hausman, Keerthana Gopalakrish-
nan, Chuyuan Fu, Pete Florence, Chelsea Finn, Ku-
mar Avinava Dubey, Danny Driess, Tianli Ding,
Krzysztof Marcin Choromanski, Xi Chen, Yevgen
Chebotar, Justice Carbajal, Noah Brown, Anthony
Brohan, Montserrat Gonzalez Arenas, and Kehang
Han. 2023. Rt-2: Vision-language-action models
transfer web knowledge to robotic control. InPro-
ceedings of The 7th Conference on Robot Learning,
volume 229 ofProceedings of Machine Learning
Research, pages 2165–2183. PMLR.