Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

AI-generated keywords: Reinforcement Learning Large Language Models Policy Gradient Methods Training Stability Stabilizing Techniques

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors propose a novel formulation for reinforcement learning (RL) using large language models
  • Surrogate token-level objective in policy gradient methods like REINFORCE optimizes true sequence-level reward
  • Importance of techniques such as importance sampling correction, clipping, and Routing Replay for stabilizing RL training highlighted
  • On-policy training with basic policy gradient algorithm and importance sampling correction yields highest training stability
  • Off-policy updates to accelerate convergence require combining clipping and Routing Replay to mitigate instability from policy staleness
  • Prolonged optimization leads to comparable final performance regardless of cold-start initialization
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, An Yang, Jingren Zhou, Junyang Lin

Abstract: This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.

Submitted to arXiv on 01 Dec. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2512.01374v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Stabilizing Reinforcement Learning with LLMs: Formulation and Practices," authors Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, An Yang, Jingren Zhou, and Junyang Lin propose a novel formulation for reinforcement learning (RL) using large language models. They explore how the true sequence-level reward can be optimized through a surrogate token-level objective in policy gradient methods like REINFORCE. By conducting a first-order approximation analysis, the authors demonstrate that this surrogate objective becomes increasingly valid when minimizing both training-inference discrepancy and policy staleness. The study sheds light on the importance of various techniques in stabilizing RL training including importance sampling correction, clipping and Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments involving a 30B MoE model and hundreds of thousands of GPU hours,the authors find that on-policy training with the basic policy gradient algorithm and importance sampling correction yields the highest training stability. They also highlight that introducing off-policy updates to accelerate convergence requires combining clipping and Routing Replay to mitigate instability stemming from policy staleness. Once training is stabilized using these techniques,prolonged optimization consistently leads to comparable final performance regardless of cold-start initialization. The authors hope that by sharing their insights and developed strategies for stable RL training in this paper will facilitate future research in this area.
Created on 29 Jan. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.