Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

AI-generated keywords: Large Language Models Transformer models Efficient Architectures Scalability Resource-aware

AI-generated Key Points

  • Large Language Models (LLMs) have revolutionized language understanding, generation, reasoning, and multimodal models.
  • The foundation of modern LLMs lies in Transformer models with excellent scaling properties but substantial computational requirements.
  • Researchers are exploring innovative architectural designs and optimization strategies to address challenges and unlock the full potential of LLMs.
  • Efficient architectures for LLMs include Linear Sequence Modeling, Sparse Sequence Modeling, Efficient Full Attention, Sparse Mixture of Experts, Hybrid Architectures, and Diffusion LLMs.
  • These architectural principles can be adapted beyond language tasks to other modalities like vision, audio, and multi-modality applications.
  • By combining different components and leveraging techniques like sparse modeling and mixture-of-experts approaches, researchers aim to develop scalable and resource-aware foundation models for AI systems.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng

Survey, 82 pages, GitHub: https://github.com/weigao266/Awesome-Efficient-Arch
License: CC BY 4.0

Abstract: Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

Submitted to arXiv on 13 Aug. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2508.09834v1

Large Language Models (LLMs) have revolutionized language understanding, generation, reasoning, and multimodal models. The foundation of modern LLMs lies in Transformer models, which offer excellent scaling properties but come with substantial computational requirements that hinder large-scale training and practical deployment. To address these challenges and unlock the full potential of LLMs, researchers have been exploring innovative architectural designs and optimization strategies. This survey delves into various efficient architectures for LLMs, categorizing them into distinct methods to provide a comprehensive overview. These methods include Linear Sequence Modeling, which aims to reduce the quadratic complexity of self-attention to linear complexity by reformulating the attention mechanism; Sparse Sequence Modeling, which selectively focuses on a subset of interactions to reduce computational and memory requirements; Efficient Full Attention, which enhances standard softmax attention's efficiency while retaining theoretical quadratic complexity; Sparse Mixture of Experts, introducing conditional computation for increased model capacity without proportional computational cost; Hybrid Architectures combining linear sequence modeling with traditional full attention layers; and Diffusion LLMs exploring non-autoregressive diffusion models for language generation. Furthermore, the survey highlights the adaptability of these architectural principles beyond language tasks to other modalities such as vision, audio, and multi-modality applications. By strategically combining different components and leveraging innovative techniques like sparse modeling and mixture-of-experts approaches, researchers aim to develop scalable and resource-aware foundation models that are efficient yet versatile in various AI systems. The core architecture behind recent breakthroughs in LLMs is the Transformer model with its self-attention mechanism enabling effective capture of long-range dependencies. However, the quadratic complexity of this mechanism poses challenges in terms of computational efficiency for tasks involving long-context inputs. <kd> Keywords: Large Language Models (LLMs), Transformer models, Efficient Architectures, Scalability, Resource-aware </kd> Through systematic examination and categorization of efficient architectures for LLMs, this survey presents a blueprint for modern advancements in AI systems towards more efficient and high-quality text synthesis while considering implications for future research directions.
Created on 22 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.