DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
AI-generated Key Points
- Longer experiments using 1-day traces for the Conversation service resulted in improvements over SinglePool:
- 5.3% improvement in P99 TTFT latency
- 11.1% improvement in TBT latency
- Increase in P50 TTFT and TBT latencies by:
- 11.4%
- 7.6%, respectively
- Power consumption across the systems for the cluster was reduced with DynamoLLM:
- Decreased P50 and P99 power consumption over baseline by:
- 43%
- 9%, respectively
- DynamoLLM accommodates different request types by operating pools at different frequencies compared to the maximum allowed frequency used by the baseline.
- Sharding changes were observed, indicating that different pools operate with varying model parallelisms under DynamoLLM as load changes over time.
- Sensitivity studies showed improvements in energy efficiency and performance metrics with DynamoLLM compared to other evaluated systems.
- Overall, DynamoLLM achieved significant reductions in energy consumption (53%), operational carbon emissions (38%), and cost (61%) while maintaining latency SLOs by dynamically reconfiguring inference clusters based on workload fluctuations and compute properties.
Authors: Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse
Abstract: The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.