Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

AI-generated keywords: Light-R1

AI-generated Key Points

Light-R1 is an open-source suite for training long reasoning models in a reproducible and cost-effective manner
Curriculum training with increasing data difficulty and multi-staged post-training techniques are key components of the methodology
The Light-R1-32B model outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning
Fine-tuning DeepSeek-R1-Distilled models with 3,000 challenging examples from the curriculum dataset leads to state-of-the-art 7B and 14B models
The final model, Light-R1-14B-DS, achieves state-of-the-art performance in math with AIME24 & 25 scores surpassing many other models
Light-R1 demonstrates strong cross-domain generalization capabilities
Models, training data, and code are openly available at https://github.com/Qihoo360/Light-R1

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang

arXiv: 2503.10460v4 - DOI (cs.CL)

v4: ACL'25 industry track camera ready; v3: minor modifications; v2: better writing & format for later submission; all release at https://github.com/Qihoo360/Light-R1

License: CC BY 4.0

Abstract: This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.

Submitted to arXiv on 13 Mar. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2503.10460v4

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In this paper, we introduce Light-R1, an open-source suite designed for training long reasoning models in a reproducible and cost-effective manner. Our methodology involves curriculum training that gradually increases the difficulty of the data, coupled with multi-staged post-training techniques. The Light-R1-32B model demonstrates superior performance in math reasoning compared to DeepSeek-R1-Distill-Qwen-32B. Through experimental results, we show that our curriculum approach is most effective when diverse datasets are available for different training stages. Fine-tuning DeepSeek-R1-Distilled models with 3,000 challenging examples from our curriculum dataset has led to state-of-the-art 7B and 14B models. Additionally, the 32B model, Light-R1-32B-DS performs comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our research by implementing GRPO on long reasoning models. Our final model, Light-R1-14B-DS achieves state-of-the-art performance among 14B models in math with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite its focus on math training, Light-R1-14B-DS showcases strong cross-domain generalization capabilities. Overall, Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data, and code are openly available at https://github.com/Qihoo360/Light-R1 for further exploration and implementation purposes.

- Light-R1 is an open-source suite for training long reasoning models in a reproducible and cost-effective manner
- Curriculum training with increasing data difficulty and multi-staged post-training techniques are key components of the methodology
- The Light-R1-32B model outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning
- Fine-tuning DeepSeek-R1-Distilled models with 3,000 challenging examples from the curriculum dataset leads to state-of-the-art 7B and 14B models
- The final model, Light-R1-14B-DS, achieves state-of-the-art performance in math with AIME24 & 25 scores surpassing many other models
- Light-R1 demonstrates strong cross-domain generalization capabilities
- Models, training data, and code are openly available at https://github.com/Qihoo360/Light-R1

Summary1. Light-R1 is a free tool for training smart computers to solve hard problems in a cheap and repeatable way. 2. Training starts easy and gets harder, with extra learning steps after, which are important parts of the process. 3. Light-R1-32B model is better at math than DeepSeek-R1-Distill-Qwen-32B model. 4. Making small changes to DeepSeek-R1-Distilled models using tough examples makes them really good. 5. The best model, Light-R1-14B-DS, does great at math tests and beats many other models. Definitions- Open-source: Free software where anyone can see and change the code. - Reproducible: Can be done again in the same way to get the same results. - Cost-effective: Doesn't cost much money to use or make. - Fine-tuning: Making small adjustments to improve something that's already good. - State-of-the-art: The best available right now in terms of performance or quality.

Introduction

In recent years, there has been a growing interest in developing sophisticated reasoning models that can perform complex tasks such as math reasoning. However, training these models is often time-consuming and expensive, making it challenging for researchers to explore new approaches and techniques. In this research paper, the authors introduce Light-R1, an open-source suite designed to address these challenges and make long reasoning models more accessible.

The Need for Light-R1

The authors highlight the need for a cost-effective and reproducible approach to training long reasoning models. They note that existing methods are either too expensive or not easily reproducible due to the lack of publicly available data and code. This limitation hinders progress in this field and makes it difficult for researchers to compare their results with others.

The Methodology of Light-R1

The core methodology of Light-R1 involves curriculum training coupled with multi-staged post-training techniques. Curriculum training gradually increases the difficulty of the data during training, allowing the model to learn from simpler examples before moving on to more complex ones. This approach has been proven effective in other fields such as natural language processing (NLP) but has not been widely explored in math reasoning tasks. Additionally, Light-R1 incorporates diverse datasets at different stages of training to improve performance. The authors note that this approach is most effective when there is a variety of datasets available for different stages of training.

Experimental Results

To evaluate the effectiveness of their approach, the authors conducted experiments using their 32B model (Light-R1-32B) on math reasoning tasks compared to DeepSeek-R1-Distill-Qwen-32B model. The results showed that Light-R1-32B outperformed DeepSeek-R1-Distill-Qwen-32B by a significant margin. Furthermore, they fine-tuned DeepSeek-R1-Distilled models with 3,000 challenging examples from their curriculum dataset. This led to state-of-the-art 7B and 14B models, with the Light-R1-32B-DS model performing comparably to QwQ-32B and DeepSeek-R1. The authors also extended their research by implementing GRPO (Gradual Release of Parameters Optimization) on long reasoning models. The final model, Light-R1-14B-DS achieved state-of-the-art performance among 14B models in math reasoning tasks, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B.

Generalization Capabilities

One of the key strengths of Light-R1 is its strong cross-domain generalization capabilities. Despite its focus on math training, the Light-R1-14B-DS model showcased impressive performance in other domains as well. This highlights the potential for this approach to be applied in various real-world applications beyond just math reasoning tasks.

Conclusion

In conclusion, Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Its use of curriculum training coupled with diverse datasets at different stages has shown superior results compared to existing methods. The authors have made all their models, training data, and code openly available on GitHub for further exploration and implementation purposes. This will not only facilitate reproducibility but also encourage collaboration and progress in this field. Overall, Light-R1 has paved the way for future research in developing cost-effective and reproducible approaches for training long reasoning models. With its promising results and open-source availability, we can expect to see more advancements in this area that will ultimately benefit various industries that require complex reasoning abilities.

Created on 05 Feb. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

63.4%

Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

cs.CL

62.5%

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

cs.CL

62.2%

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Mul…

cs.CL

59.9%

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

cs.CL

58.9%

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Think…

cs.CL

58.3%

IPO: Your Language Model is Secretly a Preference Classifier

cs.CL

57.5%

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.