Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

AI-generated keywords: Group Relative Policy Optimization

AI-generated Key Points

  • Group Relative Policy Optimization (GRPO) explored in both on-policy and off-policy optimization scenarios
  • Tailoring GRPO to the off-policy setting leads to improved reward outcomes
  • Effectiveness of utilizing clipped surrogate objectives in the off-policy version of GRPO
  • Off-policy GRPO significantly outperforms or performs comparably to its on-policy counterpart
  • On-Policy and Off-Policy Actor-Critic Methods combine policy gradients with value function estimation for enhanced learning efficiency
  • Off-policy variants introduce importance weighting techniques for stable updates from off-policy data
  • Mixing on-policy and off-policy methods through approaches like P3O aims to leverage stability with efficiency
  • Off-Policy RLHF and other variants of GRPO within the iterative DPO framework for improved convergence over epochs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Jiri Navratil, Jerret Ross, Jesus Rios

License: CC BY 4.0

Abstract: We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage. In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting. We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO. We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with its on-policy counterpart.

Submitted to arXiv on 28 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.22257v2

, , , , In this study, we explore Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization scenarios. Our inspiration comes from recent advancements in off-policy Proximal Policy Optimization (PPO), which have shown improvements in training stability, sampling efficiency, and memory utilization. A recent analysis of GRPO has also suggested that estimating the advantage function using off-policy samples could be beneficial. Building upon these insights, we tailor GRPO to the off-policy setting and demonstrate that both on-policy and off-policy GRPO objectives lead to improved reward outcomes. Our findings highlight the effectiveness of utilizing clipped surrogate objectives in the off-policy version of GRPO. By comparing the empirical performance of reinforcement learning with verifiable rewards post-training using both variants of GRPO, our results reveal that off-policy GRPO either significantly outperforms or performs comparably to its on-policy counterpart. We also delve into the landscape of On-Policy and Off-Policy Actor-Critic Methods, which combine policy gradients with value function estimation for enhanced learning efficiency. Off-policy variants such as Off-Policy Actor-Critic algorithm and ACER introduce importance weighting techniques to enable stable updates from off-policy data. Additionally, mixing on-policy and off-policy methods through approaches like P3O aims to leverage the stability of on-policy updates with the efficiency of off-policy learning. Furthermore, we discuss Off-Policy RLHF and other variants of GRPO introduced within the iterative DPO framework for improved convergence over epochs. This comprehensive exploration sheds light on valuable insights into optimizing policies through a combination of on- and off-policy strategies for efficient reinforcement learning outcomes.
Created on 18 Jun. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.