DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

AI-generated keywords: Study

AI-generated Key Points

  • Longer experiments using 1-day traces for the Conversation service resulted in improvements over SinglePool:
  • 5.3% improvement in P99 TTFT latency
  • 11.1% improvement in TBT latency
  • Increase in P50 TTFT and TBT latencies by:
  • 11.4%
  • 7.6%, respectively
  • Power consumption across the systems for the cluster was reduced with DynamoLLM:
  • Decreased P50 and P99 power consumption over baseline by:
  • 43%
  • 9%, respectively
  • DynamoLLM accommodates different request types by operating pools at different frequencies compared to the maximum allowed frequency used by the baseline.
  • Sharding changes were observed, indicating that different pools operate with varying model parallelisms under DynamoLLM as load changes over time.
  • Sensitivity studies showed improvements in energy efficiency and performance metrics with DynamoLLM compared to other evaluated systems.
  • Overall, DynamoLLM achieved significant reductions in energy consumption (53%), operational carbon emissions (38%), and cost (61%) while maintaining latency SLOs by dynamically reconfiguring inference clusters based on workload fluctuations and compute properties.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, Esha Choukse

License: CC BY-NC-SA 4.0

Abstract: The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.

Submitted to arXiv on 01 Aug. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2408.00741v1

, , , , In the study conducted by D. Long on Cluster-Level Experiments, longer experiments were run using 1-day traces for the Conversation service. These traces covered all invocations for P99 TTFT and TBT latencies, resulting in a 5.3% and 11.1% improvement over SinglePool. This also led to an increase in P50 TTFT and TBT latencies by 11.4% and 7.6%, respectively. The power consumption across the systems for the cluster was reduced due to operating in energy-efficient modes with DynamoLLM, resulting in a decrease of both cluster and per-GPU power consumption. Specifically, DynamoLLM decreased P50 and P99 power consumption over the baseline by 43% and 9%, respectively. Frequency changes were analyzed, showing that DynamoLLM accommodates different request types by operating their pools at different frequencies compared to the maximum allowed frequency used by the baseline. Sharding changes were also observed, indicating that different pools operate with varying model parallelisms under DynamoLLM as the load changes over time. Sensitivity studies were conducted to analyze how predictor accuracy affects system performance overall. The results showed improvements in energy efficiency and performance metrics with DynamoLLM compared to other evaluated systems. Overall, DynamoLLM proved to be an effective energy-management framework for LLM inference environments, optimizing for energy and cost while meeting service performance SLOs. By dynamically reconfiguring inference clusters based on workload fluctuations and compute properties, DynamoLLM achieved significant reductions in energy consumption (53%), operational carbon emissions (38%), and cost (61%) while maintaining latency SLOs.
Created on 18 Nov. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.