From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph

AI-generated keywords: CUDA programming

AI-generated Key Points

Challenges persist in effectively harnessing massively parallel engines in CUDA programming despite advancements in programming and domain-specific libraries.
Large language models (LLMs) show promise in generating optimized CUDA code, but face hurdles such as privacy risks with cloud-based APIs and high computational costs with local deployment.
Small language models (SLMs) offer a more lightweight and privacy-friendly alternative to LLMs, achieving comparable performance on specific tasks but limited reasoning abilities for complex CUDA generation.
ReGraphT is a novel training-free, retrieval-augmented generation framework that enhances the reasoning capabilities of SLMs by transferring LLM-level reasoning through structured reasoning graphs and Monte Carlo Graph Search (MCGS).
Experimental results demonstrate that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33X speedup on CUDAEval and ParEval tasks.
Combining ReGraphT with specific instructive models enables SLMs to approach LLM-level performance without compromising privacy or requiring excessive computing resources, offering a promising solution for optimizing CUDA code generation efficiently while maintaining privacy standards.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Junfeng Gong, Zhiyi Wei, Junying Chen, Cheng Liu, Huawei Li

arXiv: 2510.19873v1 - DOI (cs.LG)

License: CC BY 4.0

Abstract: Despite significant evolution of CUDA programming and domain-specific libraries, effectively utilizing GPUs with massively parallel engines remains difficult. Large language models (LLMs) show strong potential in generating optimized CUDA code from sequential code. However, using LLMs in practice faces two major challenges: cloud-based APIs pose risks of code leakage, and local deployment is often computationally expensive and inefficient. These drawbacks have spurred interest in small language models (SLMs), which are more lightweight and privacy-friendly. Encouragingly, recent studies show that SLMs can achieve performance comparable to LLMs on specific tasks. While SLMs can match LLMs on domain-specific tasks, their limited reasoning abilities lead to suboptimal performance in complex CUDA generation according to our experiments. To bridge this gap, we propose ReGraphT, a training-free, retrieval-augmented generation framework that transfers LLM-level reasoning to smaller models. ReGraphT organizes CUDA optimization trajectories into a structured reasoning graph, modeling the combined CUDA optimizations as state transitions, and leverages Monte Carlo Graph Search (MCGS) for efficient exploration. We also present a CUDA-specific benchmark with difficulty tiers defined by reasoning complexity to evaluate models more comprehensively. Experiments show that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33X speedup on CUDAEval and ParEval. When paired with DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct, ReGraphT enables SLMs to approach LLM-level performance without the associated privacy risks or excessive computing overhead.

Submitted to arXiv on 22 Oct. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2510.19873v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In the realm of CUDA programming and GPU utilization, challenges persist in effectively harnessing massively parallel engines despite advancements in CUDA programming and domain-specific libraries. Large language models (LLMs) have shown promise in generating optimized CUDA code from sequential code, but practical implementation faces hurdles such as privacy risks with cloud-based APIs and high computational costs with local deployment. This has sparked interest in small language models (SLMs), which offer a more lightweight and privacy-friendly alternative. Recent studies have demonstrated that SLMs can achieve comparable performance to LLMs on specific tasks. However, their limited reasoning abilities result in suboptimal performance when it comes to complex CUDA generation. To address this gap, a novel approach called ReGraphT has been proposed. ReGraphT is a training-free, retrieval-augmented generation framework that transfers LLM-level reasoning to smaller models. By organizing CUDA optimization trajectories into a structured reasoning graph and leveraging Monte Carlo Graph Search (MCGS) for efficient exploration, ReGraphT aims to enhance the reasoning capabilities of SLMs. Furthermore, a CUDA-specific benchmark with difficulty tiers based on reasoning complexity has been introduced to evaluate models comprehensively. Experimental results show that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33X speedup on CUDAEval and ParEval tasks. When combined with DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct, ReGraphT enables SLMs to approach LLM-level performance without compromising privacy or requiring excessive computing resources. Despite the resource-intensive nature of deploying LLMs locally or the potential risks associated with cloud-based APIs, the emergence of SLMs presents a promising solution for optimizing CUDA code generation while maintaining efficiency and privacy standards. Ongoing research in this direction aims to further enhance the capabilities of SLMs through innovative frameworks like ReGraphT, paving the way for more effective GPU utilization in parallel processing environments.

- Challenges persist in effectively harnessing massively parallel engines in CUDA programming despite advancements in programming and domain-specific libraries.
- Large language models (LLMs) show promise in generating optimized CUDA code, but face hurdles such as privacy risks with cloud-based APIs and high computational costs with local deployment.
- Small language models (SLMs) offer a more lightweight and privacy-friendly alternative to LLMs, achieving comparable performance on specific tasks but limited reasoning abilities for complex CUDA generation.
- ReGraphT is a novel training-free, retrieval-augmented generation framework that enhances the reasoning capabilities of SLMs by transferring LLM-level reasoning through structured reasoning graphs and Monte Carlo Graph Search (MCGS).
- Experimental results demonstrate that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33X speedup on CUDAEval and ParEval tasks.
- Combining ReGraphT with specific instructive models enables SLMs to approach LLM-level performance without compromising privacy or requiring excessive computing resources, offering a promising solution for optimizing CUDA code generation efficiently while maintaining privacy standards.

Summary1. It's hard to use many computers together for programming in CUDA, even though there have been improvements in tools and libraries. 2. Big language models can write good CUDA code but have problems with privacy and cost when using the internet or local computers. 3. Small language models are simpler and safer than big ones, doing well on some tasks but not as good at complex thinking for CUDA. 4. ReGraphT is a new way to help small models think better by learning from big models through special graphs and search methods. 5. ReGraphT makes small models faster and better than other methods, helping them do well on specific tasks without needing lots of computing power. Definitions- CUDA: A type of programming used for making computer programs run faster by using many parts of the computer at once. - Language Models: Computer programs that can understand and generate human-like text or code. - Privacy: Keeping information safe and secret from others who shouldn't see it. - Reasoning: Thinking about things logically to solve problems or make decisions. - Graphs: Visual representations of connections between different pieces of information or ideas.

Introduction

In recent years, there has been a growing demand for efficient utilization of GPUs in parallel processing environments. However, challenges persist in effectively harnessing the power of these massively parallel engines. While advancements in CUDA programming and domain-specific libraries have made some progress, the complexity of optimizing CUDA code still poses significant hurdles. Large language models (LLMs) have shown promise in generating optimized CUDA code from sequential code. However, practical implementation faces issues such as privacy risks with cloud-based APIs and high computational costs with local deployment. This has sparked interest in small language models (SLMs), which offer a more lightweight and privacy-friendly alternative. Recent studies have demonstrated that SLMs can achieve comparable performance to LLMs on specific tasks. However, their limited reasoning abilities result in suboptimal performance when it comes to complex CUDA generation. To address this gap, a novel approach called ReGraphT has been proposed.

The ReGraphT Framework

ReGraphT is a training-free, retrieval-augmented generation framework that transfers LLM-level reasoning to smaller models. It aims to enhance the reasoning capabilities of SLMs by organizing CUDA optimization trajectories into a structured reasoning graph and leveraging Monte Carlo Graph Search (MCGS) for efficient exploration. The key components of ReGraphT include:

CUDA-Specific Benchmark

To evaluate models comprehensively, a new benchmark specifically designed for CUDA optimization tasks has been introduced. This benchmark includes difficulty tiers based on reasoning complexity.

Reasoning Graph

ReGraphT organizes the optimization trajectories into a structured graph representation that captures the dependencies between different optimization steps. This allows for more efficient exploration during code generation.

Monte Carlo Graph Search (MCGS)

MCGS is used to traverse through the reasoning graph and identify optimal paths for code generation. This approach allows for more efficient and effective reasoning, resulting in improved performance.

Experimental Results

Experimental results show that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches. On the CUDAEval and ParEval tasks, ReGraphT achieved an average 2.33X speedup compared to baseline SLMs. When combined with DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct, ReGraphT enables SLMs to approach LLM-level performance without compromising privacy or requiring excessive computing resources.

Implications and Future Research

The emergence of SLMs presents a promising solution for optimizing CUDA code generation while maintaining efficiency and privacy standards. With ongoing research in this direction, we can expect further enhancements in the capabilities of SLMs through innovative frameworks like ReGraphT. Future studies could explore the potential of incorporating additional techniques such as transfer learning or reinforcement learning to further improve the performance of SLMs. Additionally, expanding the benchmark to include a wider range of CUDA optimization tasks could provide a more comprehensive evaluation of model performance.

Conclusion

In conclusion, ReGraphT offers a promising solution for enhancing the reasoning capabilities of small language models when it comes to complex CUDA code generation. By leveraging structured reasoning graphs and Monte Carlo Graph Search, ReGraphT enables SLMs to achieve comparable performance to LLMs without compromising privacy or requiring excessive computing resources. Ongoing research in this area holds great potential for improving GPU utilization in parallel processing environments.

Created on 20 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

52.7%

Understanding Transformer Reasoning Capabilities via Graph Algorithms

cs.LG

51.5%

LADDER: Self-Improving LLMs Through Recursive Problem Decomposition

cs.LG

51.1%

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

cs.LG

50.7%

Generalizable Insights for Graph Transformers in Theory and Practice

cs.LG

50.6%

Graph-Guided Concept Selection for Efficient Retrieval-Augmented Generation

cs.LG

50.6%

TransformerG2G: Adaptive time-stepping for learning temporal graph embeddings…

cs.LG

50.3%

NAAS: Neural Accelerator Architecture Search

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.