What Matters in Transformers? Not All Attention is Needed

AI-generated keywords: Transformer-based large language models scalability redundancy pruning model efficiency

AI-generated Key Points

  • Study conducted by Shwai He, Guoheng Sun, Zhenyu Shen, and Ang Li from the University of Maryland, College Park
  • Focus on Transformer-based large language models (LLMs) and scalability
  • Aim to explore redundancy variability in Transformers and propose methods for enhancing model efficiency
  • Use similarity-based metric to identify redundant structures based on output similarity to inputs
  • Surprising findings regarding excessive similarity of attention layers in LLMs
  • Insights provide guidance for future network architecture design for creating more compact yet efficient models
  • Code available at https://github.com/Shwai-He/LLM-Drop for further exploration and implementation
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

15 pages, 13 figures, 6 tables
License: CC BY-NC-SA 4.0

Abstract: Scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks. However, this scaling also introduces redundant structures, posing challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different structures, such as MLP and Attention layers, is under-explored. In this work, we investigate the varying redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. This metric operates on the premise that redundant structures produce outputs highly similar to their inputs. Surprisingly, while attention layers are essential for transformers and distinguish them from other mainstream architectures, we found that a large proportion of attention layers exhibit excessively high similarity and can be safely pruned without degrading performance, leading to reduced memory and computation costs. Additionally, we further propose a method that jointly drops Attention and MLP layers, achieving improved performance and dropping ratios. Extensive experiments demonstrate the effectiveness of our methods, e.g., Llama-3-70B maintains comparable performance even after pruning half of the attention layers. Our findings provide valuable insights for future network architecture design. The code will be released at: \url{https://github.com/Shwai-He/LLM-Drop}.

Submitted to arXiv on 22 Jun. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2406.15786v1

In the study "What Matters in Transformers? Not All Attention is Needed," conducted by Shwai He, Guoheng Sun, Zhenyu Shen, and Ang Li from the University of Maryland, College Park, the researchers delve into the realm of Transformer-based large language models (LLMs) and their scalability. The study aims to explore the variability of redundancy across different components within Transformers and proposes methods for pruning attention layers and implementing joint layer dropping techniques to enhance model efficiency. Using a similarity-based metric to identify redundant structures based on their output similarity to inputs, surprising findings are revealed regarding the excessive similarity of attention layers in LLMs. These insights provide valuable guidance for future network architecture design in creating more compact yet efficient models. The code for this research will be made available at https://github.com/Shwai-He/LLM-Drop for further exploration and implementation in future studies.
Created on 16 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.