Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

AI-generated keywords: Large Language Models (LLMs) Sparse Autoencoders (SAEs) Jacobian SAEs (JSAEs) Computational Sparsity Interpretability

AI-generated Key Points

Researchers aimed to enhance understanding of computations performed by LLMs beyond just their representations
Introduced Jacobian Sparse Autoencoders (JSAEs) to induce sparsity in input and output activations as well as computational connections between them
Devised an efficient method for computing Jacobians in LLMs, enabling considerable computational sparsity without compromising performance
JSAEs identified semantically meaningful computational units within LLMs, such as phrases like "this text is in German"
Pre-trained LLMs exhibit significantly more sparse computational connections compared to randomly initialized transformers, indicating learned property during training
JSAEs offer potential for better understanding transformer operations and enhancing interpretability of deep learning models

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison

arXiv: 2502.18147v2 - DOI (cs.LG)

License: CC BY 4.0

Abstract: Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of LLMs. However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to "sparsify" computations in any sense, only latent activations. To solve this, we propose Jacobian SAEs (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a na\"ive implementation, the Jacobians in LLMs would be computationally intractable due to their size. One key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that Jacobians are a reasonable proxy for computational sparsity because MLPs are approximately linear when rewritten in the JSAE basis. Lastly, we show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.

Submitted to arXiv on 25 Feb. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2502.18147v2

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the study "Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations," researchers aimed to enhance the understanding of computations performed by beyond just their representations. While have been effective in uncovering sparse and interpretable latent activations of LLMs, they do not specifically target sparsity in computations. To address this limitation, the researchers introduced , which not only induce sparsity in input and output activations but also in the computational connections () between them. One significant technical contribution of this research was devising an efficient method for computing Jacobians in LLMs, as traditional approaches would be computationally prohibitive due to their size. The results demonstrated that JSAEs can achieve a considerable degree of computational sparsity while maintaining LLM performance comparable to conventional SAEs. Moreover, the study showed that serve as a reasonable proxy for computational sparsity, particularly evident in Multilayer Perceptrons (MLPs) when expressed in the JSAE basis. By analyzing "max-activating" examples of JSAEs, the researchers verified that these models can identify semantically meaningful computational units within LLMs. For instance, specific output SAE latents were found to correspond to phrases such as "this text is in German," computed based on input latents representing tokens common in German text or related to historical events like the Third Reich. Furthermore, comparisons with randomly initialized transformers revealed that pre-trained LLMs exhibit significantly more sparse , indicating that computational sparsity is a learned property during training. This contrasts with previous findings showing similar interpretability scores for SAEs on random and pre-trained transformers. The study also provided insights into how JSAEs extract information about complex learned computations and highlighted their potential for understanding transformer operations better than standard SAEs. In conclusion, "Jacobian Sparse Autoencoders" presents a novel approach to enhancing interpretability and understanding of computations within LLMs through sparsity-inducing techniques like JSAEs. The findings underscore the importance of considering not just representations but also the underlying computations in advancing our comprehension of deep learning models.

- Researchers aimed to enhance understanding of computations performed by LLMs beyond just their representations
- Introduced Jacobian Sparse Autoencoders (JSAEs) to induce sparsity in input and output activations as well as computational connections between them
- Devised an efficient method for computing Jacobians in LLMs, enabling considerable computational sparsity without compromising performance
- JSAEs identified semantically meaningful computational units within LLMs, such as phrases like "this text is in German"
- Pre-trained LLMs exhibit significantly more sparse computational connections compared to randomly initialized transformers, indicating learned property during training
- JSAEs offer potential for better understanding transformer operations and enhancing interpretability of deep learning models

Summary- Researchers wanted to learn more about how computers think, not just what they show. - They made a new way called Jacobian Sparse Autoencoders (JSAEs) to make some parts of the computer work better together. - This helped them find important things in the computer's thinking, like understanding sentences in different languages. - The new method made the computer work faster without losing its accuracy. - JSAEs can help us understand how computers learn and improve their explanations. Definitions1. Researchers: People who study and learn new things. 2. Computation: How a computer processes information or performs tasks. 3. LLMs (Large Language Models): Advanced computer programs that understand and generate human language. 4. Sparsity: Having only a few important connections or parts active while others are inactive. 5. Transformers: A type of deep learning model used for various tasks like language translation or text generation.

Deep learning has revolutionized the field of artificial intelligence, enabling machines to learn and perform complex tasks that were previously thought to be impossible. One area where deep learning has made significant advancements is in natural language processing (NLP), with the development of large language models (LLMs) such as BERT and GPT-3. These models have achieved impressive results in various NLP tasks, but their inner workings are still not fully understood. In a recent study titled "Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations," researchers aimed to enhance our understanding of LLMs beyond just their representations. While LLMs have been successful in uncovering sparse and interpretable latent activations, they do not specifically target sparsity in computations. This limitation led the researchers to introduce Jacobian Sparse Autoencoders (JSAEs), which induce sparsity not only in input and output activations but also in the computational connections between them. One significant technical contribution of this research was devising an efficient method for computing Jacobians in LLMs. Traditional approaches would be computationally prohibitive due to the size of these models, so the researchers developed a more efficient method that could handle larger models. The results showed that JSAEs can achieve a considerable degree of computational sparsity while maintaining performance comparable to conventional Sparse Autoencoders (SAEs). Moreover, the study demonstrated that JSAEs serve as a reasonable proxy for computational sparsity, particularly evident in Multilayer Perceptrons (MLPs). When expressed in the JSAE basis, MLPs exhibited significantly more sparse Jacobians compared to randomly initialized transformers. This finding suggests that computational sparsity is a learned property during training rather than being present by chance. To further validate their approach, the researchers analyzed "max-activating" examples of JSAEs and found that these models can identify semantically meaningful computational units within LLMs. For example, specific output SAE latents were found to correspond to phrases such as "this text is in German," computed based on input latents representing tokens common in German text or related to historical events like the Third Reich. The study also provided insights into how JSAEs extract information about complex learned computations and highlighted their potential for understanding transformer operations better than standard SAEs. This finding is particularly significant as transformers are currently the state-of-the-art architecture for NLP tasks, and understanding their inner workings can lead to further improvements and advancements in the field. In conclusion, "Jacobian Sparse Autoencoders" presents a novel approach to enhancing interpretability and understanding of computations within LLMs through sparsity-inducing techniques like JSAEs. The findings underscore the importance of considering not just representations but also the underlying computations in advancing our comprehension of deep learning models. With further research and development, JSAEs could potentially play a crucial role in improving our understanding of LLMs and other deep learning models.

Created on 23 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

65.7%

Sparse Autoencoders Can Interpret Randomly Initialized Transformers

cs.LG

63.5%

Sparse Autoencoders Trained on the Same Data Learn Different Features

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.