Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

AI-generated keywords: Artificial Intelligence Large Language Models Transformer-based Architectures Efficient Serving Methodologies Machine Learning Systems

AI-generated Key Points

Generative large language models (LLMs) are driving significant advancements in understanding and manipulating human languages.
Transformer-based architectures like GPT-family, LLaMA-family, DeepSeek-family, OPT, BLOOM, Mistral, DeciLM, Baichuan, and GLM have revolutionized natural language processing (NLP) tasks.
LLMs have expanded into diverse applications including automated programming, science discovery, personalized digital assistants, creative arts, and next-generation computing architecture.
Computational intensity and memory consumption of deploying these models present challenges in serving efficiency due to their immense size and complexity.
Efficient LLM serving methodologies are being explored by the research community to address concerns over energy consumption, scalability, and accessibility for broader adoption in real-world applications.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia

arXiv: 2312.15234v2 - DOI (cs.LG)

ACM Computing Surveys

License: CC BY 4.0

Abstract: In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.

Submitted to arXiv on 23 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2312.15234v2

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) have emerged as a driving force behind significant advancements in understanding and manipulating human languages. These LLMs include transformer-based architectures such as the GPT-family, LLaMA-family, DeepSeek-family, and other latest public models like OPT, BLOOM, Mistral, DeciLM, Baichuan, and GLM. They have revolutionized natural language processing (NLP) tasks and expanded into diverse applications including automated programming, science discovery, personalized digital assistants, creative arts, and next-generation computing architecture. Despite their exceptional performance across various domains, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency. The immense model size and complexity require extensive computational resources during inference processes for LLMs. This resource-intensive nature raises concerns over energy consumption, scalability, and accessibility for broader adoption in real-world applications. To address this critical need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective,<Organization> this survey paper provides an exhaustive exploration of multifaceted strategies proposed by the research community. From cutting-edge algorithmic modifications to groundbreaking changes in system designs such as Speculative Sampling Speculative Decoding (SpecDec), InferMedusa RESTLookahead SequoiaEAGLEOuroboros Triforce Hydra Kangaroo OPT-Tree SpecPrefill LongSpec AdaServe Sequence-based speculative decoding Tree-based speculative decoding among others. The survey aims to offer valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment. By providing a comprehensive understanding of the current state and future directions in efficient LLM serving methods from algorithms to systems optimization,<Organization> this survey paper stands at the crux of advanced AI innovations and practical system optimizations reshaping the future of AI technology.

- Generative large language models (LLMs) are driving significant advancements in understanding and manipulating human languages.
- Transformer-based architectures like GPT-family, LLaMA-family, DeepSeek-family, OPT, BLOOM, Mistral, DeciLM, Baichuan, and GLM have revolutionized natural language processing (NLP) tasks.
- LLMs have expanded into diverse applications including automated programming, science discovery, personalized digital assistants, creative arts, and next-generation computing architecture.
- Computational intensity and memory consumption of deploying these models present challenges in serving efficiency due to their immense size and complexity.
- Efficient LLM serving methodologies are being explored by the research community to address concerns over energy consumption, scalability, and accessibility for broader adoption in real-world applications.

SummaryGenerative large language models are like super smart computers that help us understand and use languages better. They have special structures called Transformer-based architectures, which have made language tasks much easier. These models are used in many different areas like writing code, making new scientific discoveries, creating art, and improving computers. However, they need a lot of power and memory to work well, which can be a problem. Scientists are trying to find better ways to make these models more efficient so that everyone can use them easily. Definitions- Generative large language models (LLMs): Advanced computer programs that help with understanding and using human languages. - Transformer-based architectures: Special structures in computers that have greatly improved how we process languages. - Natural Language Processing (NLP): Using computers to understand and work with human languages. - Computational intensity: How much computing power is needed for a task. - Memory consumption: How much computer memory is used for storing information. - Efficiency: Doing something well without wasting resources like time or energy.

Artificial intelligence (AI) has been rapidly advancing in recent years, and one of the most significant developments in this field is the emergence of generative large language models (LLMs). These LLMs, which include transformer-based architectures such as GPT-family, LLaMA-family, DeepSeek-family, and other latest public models like OPT, BLOOM, Mistral, DeciLM, Baichuan, and GLM have revolutionized natural language processing (NLP) tasks. They have expanded into diverse applications including automated programming, science discovery, personalized digital assistants, creative arts, and next-generation computing architecture. However, despite their exceptional performance across various domains and applications, there are still challenges that need to be addressed for these LLMs to be more efficient in real-world scenarios. One of the main concerns is their computational intensity and memory consumption during inference processes. The immense size and complexity of these models require extensive computational resources for deployment. This raises concerns over energy consumption, scalability,and accessibility for broader adoption in practical applications. To address this critical need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, a survey paper has been published that provides an exhaustive exploration of multifaceted strategies proposed by the research community. This survey aims to offer valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment. The survey paper covers a range of topics related to efficient LLM serving methods from algorithms to systems optimization. It starts with an overview of current state-of-the-art techniques used in LLMs such as transformer-based architectures like GPT-3 which uses self-attention mechanisms to process sequential data efficiently. It also discusses other popular models like BERT which uses bidirectional transformers for pre-training on large datasets. Moving on from model architectures, the survey delves into algorithmic modifications that have been proposed to improve efficiency in LLMs. These include Speculative Sampling, Speculative Decoding (SpecDec), InferMedusa, RESTLookahead, SequoiaEAGLEOuroboros Triforce Hydra Kangaroo OPT-Tree, SpecPrefill, LongSpec AdaServe Sequence-based speculative decoding Tree-based speculative decoding among others. Each of these techniques is explained in detail with their advantages and limitations. The survey paper also explores groundbreaking changes in system designs that have been proposed to optimize the deployment of LLMs. This includes techniques like Speculative Decoding which uses a combination of pre-computed results and on-the-fly computation to reduce inference time. Another approach is InferMedusa which uses a hierarchical structure to store intermediate results for faster retrieval during inference. Furthermore, the survey discusses other methods such as RESTLookahead which uses parallel processing to speed up inference and SequoiaEAGLEOuroboros Triforce Hydra Kangaroo OPT-Tree which utilizes tree-based data structures for efficient storage and retrieval of intermediate results. Overall, this survey paper provides a comprehensive understanding of the current state and future directions in efficient LLM serving methods from algorithms to systems optimization. It offers valuable insights for researchers and practitioners looking to overcome the barriers of effective LLM deployment. By exploring cutting-edge algorithmic modifications and groundbreaking changes in system designs, this survey stands at the crux of advanced AI innovations and practical system optimizations reshaping the future of AI technology. In conclusion, while generative large language models have shown exceptional performance across various domains, there is still room for improvement when it comes to efficiency in real-world applications. With ongoing research efforts focused on developing more efficient LLM serving methodologies, we can expect significant advancements in this field that will pave the way for broader adoption and integration into our daily lives.

Created on 28 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 1

Similar papers summarized with our AI tools

68.2%

TransMLA: Multi-head Latent Attention Is All You Need

cs.LG

66.1%

Zephyr: Direct Distillation of LM Alignment

cs.LG

65.1%

Efficiently Scaling Transformer Inference

cs.LG

64.9%

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in Sta…

cs.LG

64.1%

Temporal Data Meets LLM -- Explainable Financial Time Series Forecasting

cs.LG

63.9%

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

cs.LG

62.9%

Efficient Memory Management for Large Language Model Serving with PagedAttent…

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.