Insights into resource utilization of code small language models serving with runtime engines and execution providers

AI-generated keywords: Language models

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Rapidly evolving landscape of language models, particularly in code generation
  • Pressing need to address substantial computational resources required by these models
  • Concerns regarding energy consumption and environmental impact
  • Optimizing resource utilization of language model inference is paramount, with Small Language Models (SLMs) emerging as a promising solution
  • Impact of deep learning serving configurations on resource utilization within the context of code generation SLMs
  • Configurations defined as combinations of runtime engines and execution providers play a crucial role in determining energy consumption, execution time, and computing-resource utilization
  • Significant disparities observed across various serving configurations; CUDA outperformed CPU-based providers in terms of energy consumption and execution time
  • TORCH paired with CUDA exhibited exceptional energy efficiency, yielding impressive energy savings ranging from 37.99% up to 89.16%
  • ONNX coupled with CPU demonstrated notable energy savings ranging from 8.98% up to 72.04% within CPU-based setups
  • Choice of serving configuration significantly influences resource utilization metrics such as energy consumption and execution time
  • Recommendations for software engineers: leverage TORCH paired with CUDA or ONNX with CPU for enhanced resource utilization efficiency when working with code generation SLMs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Francisco Durán, Matias Martinez, Patricia Lago, Silverio Martínez-Fernández

Accepted in Journal of Systems and Software (JSS). For its published version refer to the Journal of JSS

Abstract: The rapid growth of language models, particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing language models inference resource utilization is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Our goal is to analyze the impact of deep learning serving configurations, defined as combinations of runtime engines and execution providers, on resource utilization, in terms of energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code generation SLMs. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Serving configuration choice significantly impacts resource utilization. While further research is needed, we recommend the above configurations best suited to software engineers' requirements for enhancing serving resource utilization efficiency.

Submitted to arXiv on 19 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2412.15441v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In the rapidly evolving landscape of language models, particularly in the realm of code generation, there is a pressing need to address the substantial computational resources required by these models. This necessity has sparked concerns regarding energy consumption and environmental impact. In response to this challenge, optimizing the resource utilization of language model inference has become paramount, with Small Language Models (SLMs) emerging as a promising solution to mitigate resource demands. The focus of this study was to delve into the impact of deep learning serving configurations on resource utilization within the context of code generation SLMs. These configurations are defined as combinations of runtime engines and execution providers, which play a crucial role in determining energy consumption, execution time, and computing-resource utilization. The perspective adopted for this analysis was that of software engineers engaged in conducting inference tasks using code generation SLMs. To conduct a comprehensive investigation, a technology-oriented, multi-stage experimental pipeline was implemented utilizing twelve distinct code generation SLMs. Through this rigorous approach, significant disparities were observed across various serving configurations. Notably, configurations employing CUDA as the execution provider consistently outperformed those utilizing CPU-based providers in terms of both energy consumption and execution time. Amongst the myriad configurations explored, it was found that TORCH paired with CUDA exhibited exceptional energy efficiency, yielding impressive energy savings ranging from 37.99% up to 89.16% when compared to alternative serving configurations. Furthermore, configurations featuring optimized runtime engines such as ONNX coupled with CPU execution providers demonstrated notable energy savings ranging from 8.98% up to 72.04% within CPU-based setups. Moreover, TORCH combined with CUDA showcased efficient computing-resource utilization across various experiments conducted during this study. It became evident that the choice of serving configuration significantly influences resource utilization metrics such as energy consumption and execution time. While acknowledging the need for further research in this domain, based on our findings we recommend leveraging TORCH paired with CUDA or ONNX with CPU for software engineers seeking to enhance resource utilization efficiency when working with code generation SLMs. By adopting these optimized configurations, practitioners can potentially achieve substantial improvements in energy efficiency and overall performance while conducting inference tasks within their respective projects.
Created on 13 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.