From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

AI-generated keywords: Large language models computational challenges inference energy costs LLaMA resource utilization

AI-generated Key Points

  • Large language models (LLMs) are popular for their advanced generative capabilities, surpassing previous state-of-the-art models.
  • LLMs are utilized across various domains like law, finance, and medicine.
  • Computational challenges associated with LLMs, particularly in terms of compute and energy costs for inference, are significant.
  • Focus on inference energy costs is often overshadowed by training costs in large models like ChatGPT.
  • Experiments were conducted to analyze computational and energy utilization of inference with LLMs using the open-sourced pre-trained large language model developed by Meta AI called LLaMA.
  • Larger variants of LLaMA typically require multiple high-end GPUs for both training and inference.
  • Benchmarking and analysis of inference performance and energy costs were done using NVIDIA V100 & A100 GPUs and two datasets (Alpaca and GSM8K).
  • Multi-node, multi-GPU inference with model sharding across up to 32 GPUs was included in the experiments for a comprehensive analysis at scale.
  • Comparison was made with results from single node instances using smaller variants of the model as a baseline for comparison.
  • The study aims to shed light on compute performance and energy utilization characteristics of LLM inference for cost-savings, efficient hardware usage, and optimal deployment strategies.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, Vijay Gadepally

License: CC BY 4.0

Abstract: Large language models (LLMs) have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art. These technologies are increasingly being leveraged in various domains such as law, finance, and medicine. However, these models carry significant computational challenges, especially the compute and energy costs required for inference. Inference energy costs already receive less attention than the energy costs of training LLMs -- despite how often these large models are called on to conduct inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see increasing usage and deployment in various domains, a better understanding of their resource utilization is crucial for cost-savings, scaling performance, efficient hardware usage, and optimal inference strategies. In this paper, we describe experiments conducted to study the computational and energy utilization of inference with LLMs. We benchmark and conduct a preliminary analysis of the inference performance and inference energy costs of different sizes of LLaMA -- a recent state-of-the-art LLM -- developed by Meta AI on two generations of popular GPUs (NVIDIA V100 \& A100) and two datasets (Alpaca and GSM8K) to reflect the diverse set of tasks/benchmarks for LLMs in research and practice. We present the results of multi-node, multi-GPU inference using model sharding across up to 32 GPUs. To our knowledge, our work is the one of the first to study LLM inference performance from the perspective of computational and energy resources at this scale.

Submitted to arXiv on 04 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2310.03003v1

Large language models (LLMs) have gained immense popularity due to their advanced generative capabilities, surpassing previous state-of-the-art models. These technologies are increasingly being utilized across various domains such as law, finance, and medicine. However, the computational challenges associated with LLMs, particularly in terms of compute and energy costs for inference, are significant. Despite the widespread use of large models like ChatGPT for inference tasks, the focus on inference energy costs is often overshadowed by the attention given to training costs. In this paper, we present the results of experiments conducted to analyze the computational and energy utilization of inference with LLMs using , an open-sourced pre-trained large language model developed by Meta AI. The larger variants of LLaMA typically require multiple high-end GPUs for both training and inference. We benchmarked and analyzed the inference performance and energy costs using two generations of popular GPUs (NVIDIA V100 & A100) and two datasets (Alpaca and GSM8K) to provide insights into resource utilization across different tasks. Our experiments included multi-node, multi-GPU inference with model sharding across up to 32 GPUs, offering a comprehensive analysis at scale that is one of the first in this domain. Additionally, we compared results from single node instances using smaller variants of the model as a baseline for comparison. Through our work, we aim to shed light on the compute performance and energy utilization characteristics of LLM inference to facilitate cost-savings, efficient hardware usage, and optimal strategies for deployment. The landscape of large language models has seen rapid growth in recent years with competing advancements in speed and complexity among AI research groups in both private companies and academic institutions. The development paths of LLMs have evolved significantly since 2017 as illustrated in Fig. 1, showcasing a diverse range of models tailored for specific purposes or use-cases within natural language processing tasks. Overall, our study contributes valuable insights into the computational challenges associated with deploying large language models like LLaMA for real-world applications. By providing detailed analysis on compute performance and energy utilization at scale, we hope to encourage further research, benchmarking efforts, and open dissemination of performance characteristics across different hardware configurations and optimization strategies within the field of large language models.
Created on 30 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.