Insights into resource utilization of code small language models serving with runtime engines and execution providers
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Rapidly evolving landscape of language models, particularly in code generation
- Pressing need to address substantial computational resources required by these models
- Concerns regarding energy consumption and environmental impact
- Optimizing resource utilization of language model inference is paramount, with Small Language Models (SLMs) emerging as a promising solution
- Impact of deep learning serving configurations on resource utilization within the context of code generation SLMs
- Configurations defined as combinations of runtime engines and execution providers play a crucial role in determining energy consumption, execution time, and computing-resource utilization
- Significant disparities observed across various serving configurations; CUDA outperformed CPU-based providers in terms of energy consumption and execution time
- TORCH paired with CUDA exhibited exceptional energy efficiency, yielding impressive energy savings ranging from 37.99% up to 89.16%
- ONNX coupled with CPU demonstrated notable energy savings ranging from 8.98% up to 72.04% within CPU-based setups
- Choice of serving configuration significantly influences resource utilization metrics such as energy consumption and execution time
- Recommendations for software engineers: leverage TORCH paired with CUDA or ONNX with CPU for enhanced resource utilization efficiency when working with code generation SLMs
Authors: Francisco Durán, Matias Martinez, Patricia Lago, Silverio Martínez-Fernández
Abstract: The rapid growth of language models, particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing language models inference resource utilization is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Our goal is to analyze the impact of deep learning serving configurations, defined as combinations of runtime engines and execution providers, on resource utilization, in terms of energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code generation SLMs. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Serving configuration choice significantly impacts resource utilization. While further research is needed, we recommend the above configurations best suited to software engineers' requirements for enhancing serving resource utilization efficiency.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.