Group Sequence Policy Optimization

AI-generated keywords: Group Sequence Policy Optimization Reinforcement Learning Large Language Models Sequence-Level Considerations Natural Language Processing

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Group Sequence Policy Optimization (GSPO) is a novel reinforcement learning algorithm designed for training large language models.
GSPO leverages sequence likelihood to define the importance ratio and implements sequence-level clipping, rewarding, and optimization strategies.
GSPO outperforms the GRPO algorithm in terms of training efficiency and performance.
GSPO stabilizes Mixture-of-Experts (MoE) RL training processes and simplifies the design of reinforcement learning infrastructure.
The advancements in GSPO have led to significant improvements in the latest Qwen3 models.
GSPO represents a stable, efficient, and performant solution for training large language models using reinforcement learning techniques.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin

arXiv: 2507.18071v1 - DOI (cs.LG)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

Submitted to arXiv on 24 Jul. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2507.18071v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In their paper titled "Group Sequence Policy Optimization," authors Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou and Junyang Lin introduce Group Sequence Policy Optimization (GSPO), a novel reinforcement learning algorithm designed for training large language models. Unlike traditional approaches that rely on token-level importance ratios, GSPO leverages sequence likelihood to define the importance ratio and implements sequence-level clipping, rewarding and optimization strategies. The authors demonstrate that GSPO outperforms the GRPO algorithm in terms of training efficiency and performance. Notably, GSPO is shown to stabilize Mixture-of-Experts (MoE) RL training processes and has the potential to simplify the design of reinforcement learning infrastructure. These advancements have led to significant improvements in the latest Qwen3 models. Overall, GSPO represents a stable, efficient and performant solution for training large language models using reinforcement learning techniques. The algorithm's emphasis on sequence-level considerations sets it apart from existing methods and showcases its potential for driving advancements in natural language processing research.

- Group Sequence Policy Optimization (GSPO) is a novel reinforcement learning algorithm designed for training large language models.
- GSPO leverages sequence likelihood to define the importance ratio and implements sequence-level clipping, rewarding, and optimization strategies.
- GSPO outperforms the GRPO algorithm in terms of training efficiency and performance.
- GSPO stabilizes Mixture-of-Experts (MoE) RL training processes and simplifies the design of reinforcement learning infrastructure.
- The advancements in GSPO have led to significant improvements in the latest Qwen3 models.
- GSPO represents a stable, efficient, and performant solution for training large language models using reinforcement learning techniques.

Summary- Group Sequence Policy Optimization (GSPO) is a new way to help computers learn better by practicing with big language puzzles. - GSPO uses the order of words in sentences to figure out how important they are and makes sure it doesn't get too hard or too easy. - GSPO is better than another learning method called GRPO because it works faster and does a better job. - GSPO helps make training computer experts easier and simpler, making them smarter at solving problems. - The improvements from using GSPO have made the latest Qwen3 models much better at understanding languages. Definitions- Reinforcement learning: A way for computers to learn by trying different things and getting rewards for doing well. - Algorithm: A set of instructions that tells a computer what to do step by step. - Efficiency: Doing something well without wasting time or effort. - Performance: How well something works or how good it is at its job. - Infrastructure: The basic systems and structures needed for something to work properly.

Introduction

In recent years, natural language processing (NLP) has seen significant advancements thanks to the use of large language models. These models have shown impressive performance in tasks such as machine translation, text summarization, and question-answering. However, training these large language models can be a challenging task due to their size and complexity. Traditional approaches for training large language models rely on token-level importance ratios. However, these methods have limitations when it comes to handling long sequences and can lead to unstable training processes. To address these challenges, a team of researchers from Tsinghua University and Microsoft Research Asia have proposed a novel reinforcement learning algorithm called Group Sequence Policy Optimization (GSPO).

The Problem with Traditional Approaches

Token-level importance ratios are commonly used in traditional approaches for training large language models using reinforcement learning techniques. These ratios measure the contribution of each token in a sequence towards achieving the desired outcome or reward. However, this approach has several limitations. Firstly, token-level importance ratios do not consider the overall sequence likelihood but instead focus on individual tokens' contributions. This can result in sub-optimal solutions as important tokens may be overlooked while less relevant ones are given more weight. Secondly, traditional methods often suffer from instability during training processes due to the high variance caused by long sequences. As a result, convergence can be slow or even fail altogether. To overcome these limitations and improve upon existing methods for training large language models using reinforcement learning techniques, the authors propose GSPO.

The Solution: Group Sequence Policy Optimization (GSPO)

The key idea behind GSPO is to leverage sequence likelihood rather than token-level importance ratios when defining the importance ratio for each action taken by an agent in a reinforcement learning environment. This means that instead of focusing on individual tokens' contributions towards achieving the desired outcome or reward, GSPO considers how likely the entire sequence is to lead to the desired outcome. To implement this approach, GSPO introduces three key strategies: sequence-level clipping, reward shaping, and optimization. These strategies work together to ensure that the importance ratio for each action is defined based on its contribution towards improving the overall sequence likelihood. Sequence-level clipping helps prevent instability during training by limiting the impact of long sequences on the learning process. This is achieved by setting a maximum length for sequences and truncating any longer sequences before calculating their importance ratios. Reward shaping involves modifying rewards given to an agent during training to encourage desirable behavior. In GSPO, rewards are shaped based on how much they contribute towards improving the overall sequence likelihood rather than just focusing on individual tokens' contributions. Finally, optimization in GSPO involves using a trust region policy optimization (TRPO) algorithm with a modified objective function that takes into account both sequence likelihood and reward shaping. This ensures that actions taken by an agent are not only focused on achieving high rewards but also contribute towards improving the overall sequence likelihood.

Results and Impact

The authors demonstrate through experiments that GSPO outperforms existing methods such as GRPO in terms of training efficiency and performance. In particular, GSPO has shown significant improvements in stabilizing Mixture-of-Experts (MoE) RL training processes, which have been notoriously difficult to train due to their complex architecture and large number of parameters. This makes it a promising solution for training large language models using reinforcement learning techniques, as these models often have similar characteristics as MoE architectures. Furthermore, GSPO has potential implications beyond just NLP research. Its emphasis on considering sequence-level information when defining importance ratios can be applied in other domains where long sequences are present, such as video processing or speech recognition. Moreover, its use of TRPO with a modified objective function simplifies reinforcement learning infrastructure design, making it easier for researchers and practitioners alike to implement and experiment with reinforcement learning algorithms. The advancements brought by GSPO have also had a significant impact on the latest Qwen3 models, which have shown impressive performance in various NLP tasks. This further highlights the potential of GSPO as a stable, efficient, and performant solution for training large language models using reinforcement learning techniques.

Conclusion

In conclusion, Group Sequence Policy Optimization (GSPO) is a novel reinforcement learning algorithm designed specifically for training large language models. Its emphasis on sequence-level considerations sets it apart from traditional methods that rely on token-level importance ratios. Through its use of sequence-level clipping, reward shaping, and optimization strategies, GSPO has shown improvements in both training efficiency and performance compared to existing methods. Moreover, its potential implications beyond just NLP research make it an exciting development in the field of reinforcement learning. Overall, GSPO represents a promising solution for driving advancements in natural language processing research and has already made significant contributions towards improving the latest Qwen3 models.

Created on 28 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

73.0%

Revisiting Group Relative Policy Optimization: Insights into On-Policy and Of…

cs.LG

70.3%

Competitive Policy Optimization

cs.LG

68.6%

Proximal Policy Optimization Algorithms

cs.LG

68.0%

Hybrid Group Relative Policy Optimization: A Multi-Sample Approach to Enhanci…

cs.LG

66.0%

Capturing Momentum: Tennis Match Analysis Using Machine Learning and Time Ser…

cs.LG

65.5%

Generative Adversarial Imitation Learning

cs.LG

65.4%

Fighting biases with dynamic boosting

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.