MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Towards More Interpretable AI With Sparse Autoencoders

Author(s)
Engels, Joshua
Thumbnail
DownloadThesis PDF (6.973Mb)
Advisor
Tegmark, Max
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
While large language models demonstrate remarkable capabilities across diverse domains, the specific representations and algorithms they learn remain largely unknown. The quest to understand these mechanisms holds dual significance: scientifically, it represents a fundamental inquiry into the principles underlying intelligence, while practically–and with growing urgency– it is vital for mitigating risks from these very same increasingly powerful systems. The initial section of this thesis tackles this challenge of interpreting internal language model representations (features) by employing sparse autoencoders (SAEs). An SAE decomposes neural network hidden states into a potentially more interpretable basis. In Chapter 2, we introduce an unsupervised, SAE-based methodology that successfully identifies inherently multi-dimensional features. Notably, we establish that language models causally represent concepts such as days of the week and months of the year using circular structures. This work provided the first definitive evidence of causal, multi-dimensional features, thereby refuting the one-dimensional linear representation hypothesis. Chapter 3 further assesses whether SAEs identify “true” atomic language model features. We compare the generalization performance and data efficiency of linear probes trained on SAE latents against those trained on the original hidden state basis. The negative outcomes of these experiments suggest limitations in SAEs for capturing the true ontology of language models. Motivated by the aforementioned limitations, the second part of this thesis investigates sparse autoencoders themselves, exploring potential improvements and characterizing their failure modes. Chapter 4 examines the portion of activations not reconstructed by SAEs, which we term “Dark Matter.” We find that a significant fraction of this dark matter is linearly predictable, and furthermore, that specific tokens poorly reconstructed by SAEs remain largely consistent across SAE sizes and sparsities. This suggests that SAEs may systematically fail to capture certain input subspaces, which we hypothesize to contain inherently dense features. Subsequently, Chapter 5 investigates a method to enhance SAE utility: freezing the learned SAE parameters and finetuning the surrounding language model components to minimize KL divergence with the original model’s output distribution. This technique results in a 30% to 55% decrease in the cross-entropy loss gap incurred by inserting the SAE into the model.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/163714
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.