Not All Language Model Features Are Linear

AI-generated keywords: Linear Representation Hypothesis Multi-dimensional Features Sparse Autoencoders Language Models Comprehension

AI-generated Key Points

  • Linear representation hypothesis proposed in recent research
  • Possibility of some language model representations being multi-dimensional
  • Definition of irreducible multi-dimensional features based on independent lower-dimensional features
  • Development of a scalable method using sparse autoencoders to identify multi-dimensional features in GPT-2 and Mistral 7B
  • Discovery of circular features representing days of the week and months of the year
  • Utilization of circular features in computational problems involving modular arithmetic related to days and months
  • Evidence from intervention experiments on Mistral 7B and Llama 3 8B showing circular features as fundamental units of computation
  • Application of existing feature extraction methodologies like sparse autoencoders to uncover complex multi-dimensional representations within language models
  • Aim to deepen understanding of representations to unveil underlying algorithms and transform circuits into verifiable programs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark

Code and data at https://github.com/JoshEngels/MultiDimensionalFeatures
License: CC BY 4.0

Abstract: Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Finally, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we find further circular representations by breaking down the hidden states for these tasks into interpretable components.

Submitted to arXiv on 23 May. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2405.14860v1

In recent research, the linear representation hypothesis has been proposed. This suggests that language models operate by manipulating one-dimensional representations of concepts in activation space. However, this study delves into the possibility that some language model representations may inherently be multi-dimensional. The researchers establish a rigorous definition of irreducible multi-dimensional features based on their ability to be broken down into independent or non-co-occurring lower-dimensional features. Motivated by these definitions, they develop a scalable method using sparse autoencoders to automatically identify multi-dimensional features in GPT-2 and Mistral 7B. These newly discovered features include highly interpretable examples such as circular features representing days of the week and months of the year. The researchers observe how these circular features are utilized to solve computational problems involving modular arithmetic related to days and months. Through intervention experiments on Mistral 7B and Llama 3 8B, they provide evidence that these circular features serve as fundamental units of computation in these tasks. Furthermore, the study discusses how existing feature extraction methodologies like sparse autoencoders can be applied to uncover complex multi-dimensional representations within language models. By deepening our understanding of these representations, the researchers aim to unveil the underlying algorithms that utilize them and ultimately transform intricate circuits in future more-capable models into formally verifiable programs. The work acknowledges contributions from various individuals and funding sources including Erik Otto, Jaan Tallinn, the Rothberg Family Fund for Cognitive Science, NSF Graduate Research Fellowship (Grant No. 2141064), and IAIFI through NSF grant PHY-2019786. The researchers emphasize that their focus is solely on advancing our comprehension of language model representations without anticipating any adverse impacts from their findings.
Created on 11 May. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.