Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
AI-generated Key Points
- The paper discusses the effectiveness of attention-based architectures in machine learning.
- The authors propose a new way to understand self-attention networks by decomposing their output into smaller terms involving the operation of attention heads across layers.
- Self-attention has a strong inductive bias towards "token uniformity", which can cause output convergence to a rank-1 matrix without skip connections or multi-layer perceptrons (MLPs).
- Skip connections and MLPs prevent output degeneration.
- Experiments were conducted on different variants of standard transformer architectures to verify identified convergence phenomena and study the effects of path length on performance in three tasks: memorization, sorting, and convex hull.
- Short paths carry predictive power with accuracy above 0.8, 0.6, and 0.65 in the respective tasks while longer paths do not perform much better than random guessing.
- Length zero paths contain no useful information about the task.
- The models used for the experiments had varying depths (L), number of heads (H), and hidden dimensions (d).
Authors: Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas
Abstract: Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Welcome to our AI assistant! Here are some important things to keep in mind:
- The assistant will only answer questions related to this specific paper.
- Please note that this is not a bot for casual chatting.
- If you want the answer in a language other than the language you chose for navigating the website, simply add "TRANSLATE IN LANGUAGE L" at the end of your query (replace "LANGUAGE L" with the language of your choice).
- For example, you could ask "Can you extract the most important aspect of the paper? TRANSLATE IN SPANISH".
- If you want to keep the history of your questions/answers you should create an account.
Assess the quality of the AI-generated content by voting
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through atree representation
Look for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.