DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

AI-generated keywords: DiffuCoder

AI-generated Key Points

  • Diffusion large language models (dLLMs) are explored as alternatives to autoregressive models for code generation
  • dLLMs offer global planning and iterative refinement features that are beneficial in coding tasks
  • Authors conduct a systematic investigation into denoising processes and reinforcement learning methods of dLLMs in coding
  • A 7B dLLM named "DiffuCoder" is trained on a massive dataset of 130B tokens of code for analysis
  • Key differences between dLLMs and autoregressive models are uncovered, including the ability to determine causality without semi-autoregressive decoding
  • Increasing sampling temperature diversifies token choices and alters generation order, creating a rich search space for reinforcement learning rollouts
  • A novel sampling scheme called "coupled-GRPO" is proposed for reinforcement learning training to improve performance on code generation benchmarks
  • DiffuCoder's performance achieves a +4.4% improvement on EvalPlus while reducing reliance on autoregressive causality during decoding
  • Deeper insights into dLLM generation mechanics are provided, along with an effective diffusion-native RL training framework
  • Practical considerations such as faster generation speeds with diffusion models compared to autoregressive ones are discussed
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang

preprint
License: CC BY-NC-SA 4.0

Abstract: Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR causal during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.

Submitted to arXiv on 25 Jun. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2506.20639v1

, , , , In their paper titled "DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation," authors Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang explore the potential of diffusion large language models (dLLMs) as alternatives to autoregressive models for code generation. The denoising capabilities of dLLMs operating over the entire sequence offer global planning and iterative refinement features that are particularly beneficial in coding tasks. Despite this promise, the training and inference mechanisms for dLLMs in coding remain under-explored. To shed light on the decoding behavior of dLLMs and maximize their effectiveness in coding tasks, the authors conduct a systematic investigation into their denoising processes and reinforcement learning methods. They train a 7B dLLM named "DiffuCoder" on a massive dataset of 130B tokens of code to serve as a testbed for their analysis. Through their study, they uncover key differences between dLLMs and autoregressive models: firstly, dLLMs have the ability to determine the level of causality in their generation process without relying on semi-autoregressive decoding; secondly, increasing the sampling temperature not only diversifies token choices but also alters their generation order, creating a rich search space for reinforcement learning rollouts. For reinforcement learning training, the authors propose a novel sampling scheme called "coupled-GRPO" to reduce variance in token log-likelihood estimates and maintain training efficiency. This approach significantly improves DiffuCoder's performance on code generation benchmarks (achieving a +4.4% improvement on EvalPlus) while reducing reliance on autoregressive causality during decoding. Overall, this work provides deeper insights into the mechanics of dLLM generation and presents an effective diffusion-native RL training framework. Additionally, related work in text diffusion models is discussed with early explorations based on continuous space evolving into discrete diffusion models. The authors also introduce practical considerations such as faster generation speeds with diffusion models compared to autoregressive ones. Furthermore, within the context provided by section 3 detailing DiffuCoder's architecture and design principles aimed at enhancing code correctness and quality are highlighted as essential components of this research endeavor.
Created on 03 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.