Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

AI-generated keywords: Large Language Models AI Agents Throughput-Optimal Scheduling Algorithms Queuing Theory Latency Optimization

AI-generated Key Points

  • Increasing demand for Large Language Models (LLMs) and AI agents
  • Importance of optimizing systems for efficient LLM inference
  • Use of mathematical modeling and queuing theory to develop fundamental principles
  • Exploration of throughput in LLM inference systems
  • 'Work-conserving' scheduling algorithms achieve maximum throughput
  • Highlighting 'work-conserving' as a key design principle for system performance enhancement
  • Orca and Sarathi-Serve identified as throughput-optimal solutions, caution against FastTransformer and vanilla vLLM due to instability issues
  • Benefits of incorporating queuing theory in enhancing LLM inference systems
  • Insights into latency optimization, algorithm selection based on different factors
  • Token budget choices in Sarathi-Serve impact latency performance, revealing challenges and opportunities for future research
  • Influence of token budget sizes on end-to-end latency and prefill time demonstrated through experiments using the CodeLlama-34B model
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yueying Li, Jim Dai, Tianyi Peng

License: CC BY 4.0

Abstract: As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have targeted system-level engineering, little is explored through a mathematical modeling and queuing perspective. In this paper, we aim to develop the queuing fundamentals for LLM inference, bridging the gap between queuing and LLM system communities. In particular, we study the throughput aspect in LLM inference systems. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput for both individual requests and AI agent workloads, highlighting 'work-conserving' as a key design principle in practice. Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, reassuring practitioners, while FastTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our results highlight the substantial benefits queuing community can offer in improving LLM inference systems and call for more interdisciplinary developments.

Submitted to arXiv on 10 Apr. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.07347v1

The paper titled "Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents" addresses the increasing demand for Large Language Models (LLMs) and AI agents. It emphasizes the importance of optimizing systems for efficient LLM inference and delves into mathematical modeling and queuing theory to develop fundamental principles. The research aims to bridge the gap between queuing theory and LLM system communities by exploring throughput in LLM inference systems. The authors demonstrate that a class of 'work-conserving' scheduling algorithms can achieve maximum throughput for individual requests and AI agent workloads. They highlight 'work-conserving' as a key design principle that enhances system performance. Real-world evaluations showcase Orca and Sarathi-Serve as throughput-optimal solutions while cautioning against FastTransformer and vanilla vLLM due to instability issues. These findings underscore the significant benefits of incorporating queuing theory in enhancing LLM inference systems, calling for further interdisciplinary collaborations. Additionally, the paper introduces insights into latency optimization, acknowledging that while optimal throughput algorithms have been developed, selecting the algorithm with the lowest latency can vary based on different factors. Preliminary analyses on how token budget choices in Sarathi-Serve impact latency performance reveal both challenges and opportunities for future research. Experiments using the CodeLlama-34B model illustrate how token budget sizes influence end-to-end latency and prefill time, highlighting trade-offs involved in latency optimization. Overall, this study contributes valuable insights into maximizing system efficiency through queuing theory principles while also shedding light on considerations for latency optimization in LLM inference systems. These findings emphasize the need for continued interdisciplinary advancements to further enhance these critical technologies.
Created on 25 Apr. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.