Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

AI-generated keywords: Large Language Models Transformer models Efficient Architectures Scalability Resource-aware

AI-generated Key Points

Large Language Models (LLMs) have revolutionized language understanding, generation, reasoning, and multimodal models.
The foundation of modern LLMs lies in Transformer models with excellent scaling properties but substantial computational requirements.
Researchers are exploring innovative architectural designs and optimization strategies to address challenges and unlock the full potential of LLMs.
Efficient architectures for LLMs include Linear Sequence Modeling, Sparse Sequence Modeling, Efficient Full Attention, Sparse Mixture of Experts, Hybrid Architectures, and Diffusion LLMs.
These architectural principles can be adapted beyond language tasks to other modalities like vision, audio, and multi-modality applications.
By combining different components and leveraging techniques like sparse modeling and mixture-of-experts approaches, researchers aim to develop scalable and resource-aware foundation models for AI systems.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng

arXiv: 2508.09834v1 - DOI (cs.CL)

Survey, 82 pages, GitHub: https://github.com/weigao266/Awesome-Efficient-Arch

License: CC BY 4.0

Abstract: Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

Submitted to arXiv on 13 Aug. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2508.09834v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Large Language Models (LLMs) have revolutionized language understanding, generation, reasoning, and multimodal models. The foundation of modern LLMs lies in Transformer models, which offer excellent scaling properties but come with substantial computational requirements that hinder large-scale training and practical deployment. To address these challenges and unlock the full potential of LLMs, researchers have been exploring innovative architectural designs and optimization strategies. This survey delves into various efficient architectures for LLMs, categorizing them into distinct methods to provide a comprehensive overview. These methods include Linear Sequence Modeling, which aims to reduce the quadratic complexity of self-attention to linear complexity by reformulating the attention mechanism; Sparse Sequence Modeling, which selectively focuses on a subset of interactions to reduce computational and memory requirements; Efficient Full Attention, which enhances standard softmax attention's efficiency while retaining theoretical quadratic complexity; Sparse Mixture of Experts, introducing conditional computation for increased model capacity without proportional computational cost; Hybrid Architectures combining linear sequence modeling with traditional full attention layers; and Diffusion LLMs exploring non-autoregressive diffusion models for language generation. Furthermore, the survey highlights the adaptability of these architectural principles beyond language tasks to other modalities such as vision, audio, and multi-modality applications. By strategically combining different components and leveraging innovative techniques like sparse modeling and mixture-of-experts approaches, researchers aim to develop scalable and resource-aware foundation models that are efficient yet versatile in various AI systems. The core architecture behind recent breakthroughs in LLMs is the Transformer model with its self-attention mechanism enabling effective capture of long-range dependencies. However, the quadratic complexity of this mechanism poses challenges in terms of computational efficiency for tasks involving long-context inputs. <kd> Keywords: Large Language Models (LLMs), Transformer models, Efficient Architectures, Scalability, Resource-aware </kd> Through systematic examination and categorization of efficient architectures for LLMs, this survey presents a blueprint for modern advancements in AI systems towards more efficient and high-quality text synthesis while considering implications for future research directions.

- Large Language Models (LLMs) have revolutionized language understanding, generation, reasoning, and multimodal models.
- The foundation of modern LLMs lies in Transformer models with excellent scaling properties but substantial computational requirements.
- Researchers are exploring innovative architectural designs and optimization strategies to address challenges and unlock the full potential of LLMs.
- Efficient architectures for LLMs include Linear Sequence Modeling, Sparse Sequence Modeling, Efficient Full Attention, Sparse Mixture of Experts, Hybrid Architectures, and Diffusion LLMs.
- These architectural principles can be adapted beyond language tasks to other modalities like vision, audio, and multi-modality applications.
- By combining different components and leveraging techniques like sparse modeling and mixture-of-experts approaches, researchers aim to develop scalable and resource-aware foundation models for AI systems.

Summary1. Big language models have changed how we understand, create, think, and use different types of language. 2. Modern big language models are built on Transformer models that scale well but need a lot of computer power. 3. Scientists are trying new designs and strategies to solve problems and make big language models even better. 4. Some efficient designs for big language models include Linear Sequence Modeling, Sparse Sequence Modeling, Efficient Full Attention, Sparse Mixture of Experts, Hybrid Architectures, and Diffusion LLMs. 5. These design ideas can be used not just for words but also for pictures, sounds, and combining different things together. Definitions- Language Models: Programs that help computers understand and generate human languages like English or Spanish. - Transformers: A type of model in computer science that helps with tasks like understanding text or images by looking at relationships between different parts. - Computational Requirements: The amount of computer power needed to run a program or model effectively. - Architectural Designs: The way something is planned or structured to work efficiently. - Multimodal: Involving multiple modes or ways of communicating information such as text, images, and sounds.

Introduction

Large Language Models (LLMs) have revolutionized language understanding, generation, reasoning, and multimodal models. These models have shown impressive performance on various natural language processing tasks such as machine translation, question-answering, and text summarization. The foundation of modern LLMs lies in Transformer models, which offer excellent scaling properties but come with substantial computational requirements that hinder large-scale training and practical deployment. To address these challenges and unlock the full potential of LLMs, researchers have been exploring innovative architectural designs and optimization strategies. This survey delves into various efficient architectures for LLMs, categorizing them into distinct methods to provide a comprehensive overview. These methods include Linear Sequence Modeling, Sparse Sequence Modeling, Efficient Full Attention, Sparse Mixture of Experts, Hybrid Architectures, and Diffusion LLMs.

Linear Sequence Modeling

The first method discussed in this survey is Linear Sequence Modeling. This approach aims to reduce the quadratic complexity of self-attention to linear complexity by reformulating the attention mechanism. One example is Linformer (Wang et al., 2020), which introduces a low-rank factorization technique to approximate the original attention matrix with a smaller one while preserving its essential information. Another approach is Reformer (Kitaev et al., 2020), which uses locality-sensitive hashing to cluster input tokens into buckets based on their content similarity before applying self-attention within each bucket instead of attending to all tokens simultaneously.

Sparse Sequence Modeling

Sparse Sequence Modeling selectively focuses on a subset of interactions to reduce computational and memory requirements. One prominent example is Sparse Transformer (Child et al., 2019), which introduces sparsity patterns in the attention matrix by masking out certain connections between input tokens based on their relative positions. Another approach is Big Bird (Zaheer et al., 2020), which combines sparse attention with random projection to reduce the number of parameters and improve scalability for long sequences.

Efficient Full Attention

Efficient Full Attention enhances standard softmax attention's efficiency while retaining theoretical quadratic complexity. One example is Performer (Choromanski et al., 2020), which uses a fast Fourier transform-based approximation to compute self-attention efficiently without sacrificing performance. Another approach is Longformer (Beltagy et al., 2020), which introduces a sliding window mechanism to limit the number of tokens attended by each input token, reducing computational requirements for long sequences.

Sparse Mixture of Experts

Sparse Mixture of Experts introduces conditional computation for increased model capacity without proportional computational cost. One example is Switch Transformer (Fedus et al., 2021), which dynamically selects different experts based on input tokens' content similarity, allowing the model to focus on relevant information while ignoring irrelevant ones. Another approach is Routing Transformer (Roy et al., 2021), which uses routing-by-agreement to select expert modules based on their agreement with the input sequence, enabling efficient utilization of resources.

Hybrid Architectures

Hybrid Architectures combine linear sequence modeling with traditional full attention layers. One example is Sparse Transformers with Linear Complexity (Katharopoulos et al., 2019), which combines Linformer's low-rank factorization technique with Sparse Transformer's sparsity patterns to achieve both linear complexity and sparsity in self-attention. Another approach is Axial Transformer (Hoover et al., 2019), which divides input tokens into two dimensions and applies different types of attention mechanisms along each dimension, reducing computational requirements while maintaining performance.

Diffusion LLMs

Diffusion LLMs explore non-autoregressive diffusion models for language generation. These models use a diffusion process to iteratively refine the generated output, allowing for more efficient and parallelizable generation compared to traditional autoregressive models. One example is Diffusion Transformer (So et al., 2021), which uses a diffusion process to generate text while leveraging self-attention for capturing long-range dependencies.

Adaptability of Efficient Architectures

The survey also highlights the adaptability of these architectural principles beyond language tasks to other modalities such as vision, audio, and multi-modality applications. For instance, Sparse Transformers have been applied successfully in image recognition tasks (Child et al., 2019), while Routing Transformer has shown promising results in speech recognition (Roy et al., 2021). By strategically combining different components and leveraging innovative techniques like sparse modeling and mixture-of-experts approaches, researchers aim to develop scalable and resource-aware foundation models that are efficient yet versatile in various AI systems.

Conclusion

In conclusion, this survey provides a comprehensive overview of various efficient architectures for LLMs. These methods offer solutions to address the computational challenges posed by large-scale training and practical deployment of LLMs. By categorizing these methods into distinct categories and highlighting their adaptability beyond language tasks, this survey serves as a blueprint for future research directions towards more efficient and high-quality text synthesis. With continued advancements in LLMs' architecture design, we can expect even more impressive performance on natural language processing tasks while making them more accessible for real-world applications.

Created on 22 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

70.8%

A Survey of Small Language Models

cs.CL

69.7%

Large Language Models: A Survey

cs.CL

68.6%

A Comprehensive Overview of Large Language Models

cs.CL

68.4%

Recent Trends in Deep Learning Based Natural Language Processing

cs.CL

68.4%

A Comprehensive Survey on Long Context Language Modeling

cs.CL

67.8%

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Lar…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.