Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

AI-generated keywords: Nanbeige4-3B

AI-generated Key Points

Nanbeige4-3B is a family of high-performing language models pretrained on 23T tokens and finetuned on over 30 million diverse instructions.
The model incorporates a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler in pre-training to extend the scaling law for small language models, boosting performance through progressive refinement of data mixtures.
Post-training includes improving SFT data quality through deliberative generation refinement and chain-of-thought reconstruction, leading to gains on complex tasks.
Dual Preference Distillation (DPD) is used to distill Nanbeige4-3B's flagship reasoning model, resulting in further performance improvements.
A multi-stage reinforcement learning phase enhances reasoning and human alignment abilities using verifiable rewards and preference modeling.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Wei Ruan, Xiaoqi Liu, Xiaoxue Cheng, Xiyun Xu, Yang Song, Yanzipeng Gao, Yiming Jia, Yun Xing, Yuntao Wen, Zekai Wang, Zhenwei An, Zhicong Sun, Zongchao Chen

arXiv: 2512.06266v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at https://huggingface.co/Nanbeige.

Submitted to arXiv on 06 Dec. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2512.06266v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , We introduce Nanbeige4-3B, a family of high-performing language models that have been pretrained on 23T tokens and finetuned on over 30 million diverse instructions. Our model extends the boundary of the scaling law for small language models by incorporating a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler in pre-training. This progressive refinement of data mixtures across stages boosts model performance. In post-training, we improve the quality of SFT data through a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, resulting in substantial gains on complex tasks. Following SFT, our flagship reasoning model is used to distill Nanbeige4-3B through Dual Preference Distillation (DPD), leading to further performance improvements. A multi-stage reinforcement learning phase leverages verifiable rewards and preference modeling to enhance abilities in both reasoning and human alignment. Our extensive evaluations demonstrate that Nanbeige4-3B outperforms models of comparable parameter scale and rivals much larger models across various benchmarks. The post-training pipeline includes a Cold Start Supervised Fine-tuning stage aimed at establishing a robust foundation for reasoning by focusing on high-quality reasoning data such as mathematical, code, and subject-area problem-solving tasks. Scaling SFT Instructions from hundreds of thousands to tens of millions of examples continues to produce substantial improvements on challenging reasoning benchmarks without early saturation. Overall Supervised Fine-Tuning further enhances the model's general abilities and task diversity by combining general conversation and writing data with agent-style interaction data, harder reasoning tasks, and code-related tasks. Deliberative generation refinement combined with chain completion is employed to improve the model's output quality on complex tasks. In conclusion, Nanbeige4-3B demonstrates superior performance compared to models of similar size across a wide range of benchmarks. Model checkpoints are available at https://huggingface.co/Nanbeige for further exploration and evaluation purposes.

- Nanbeige4-3B is a family of high-performing language models pretrained on 23T tokens and finetuned on over 30 million diverse instructions.
- The model incorporates a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler in pre-training to extend the scaling law for small language models, boosting performance through progressive refinement of data mixtures.
- Post-training includes improving SFT data quality through deliberative generation refinement and chain-of-thought reconstruction, leading to gains on complex tasks.
- Dual Preference Distillation (DPD) is used to distill Nanbeige4-3B's flagship reasoning model, resulting in further performance improvements.
- A multi-stage reinforcement learning phase enhances reasoning and human alignment abilities using verifiable rewards and preference modeling.

Summary- Nanbeige4-3B is a smart computer program that helps with understanding and using words better. It has been trained on lots of words and instructions to make it very good at its job. - The program uses a special method called FG-WSD to keep getting better at understanding language, especially for smaller models. - After the initial training, the program works on improving how it understands information by refining data and reconstructing thoughts. - Another technique called DPD is used to make the program even smarter by distilling its main reasoning abilities. - Lastly, the program goes through a learning phase where it gets even better at thinking like humans by receiving rewards and modeling preferences. Definitions1. Language model: A computer program designed to understand and generate human language. 2. Pretrained: Already taught or trained before being used for specific tasks. 3. Finetuned: Adjusted or improved further after initial training to enhance performance. 4. Reinforcement learning: A type of machine learning where an algorithm learns through trial and error based on rewards received for its actions.

Introducing Nanbeige4-3B: A High-Performing Language Model for Complex Tasks

Language models have become a crucial tool in natural language processing (NLP) tasks, with the ability to generate human-like text and perform various language-based tasks. However, as the size of these models increases, so does their performance on complex tasks. In this blog article, we will discuss a recent research paper titled "Nanbeige4-3B: A Family of High-Performing Language Models for Complex Tasks" that introduces a new language model that outperforms other models of comparable size on various benchmarks.

The Need for Scaling Law in Small Language Models

The scaling law is an important concept in NLP that states that as the size of a language model increases, its performance also improves. This has been observed in various studies and has led to the development of larger and more powerful models such as GPT-3 with 175 billion parameters. However, there is still room for improvement in smaller language models. In this research paper, the authors introduce Nanbeige4-3B, a family of high-performing language models that have been pretrained on 23T tokens and finetuned on over 30 million diverse instructions. This model extends the boundary of the scaling law for small language models by incorporating a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler during pre-training.

Progressive Refinement through FG-WSD Training Scheduler

The FG-WSD training scheduler progressively refines data mixtures across stages during pre-training, leading to improved model performance. This approach allows Nanbeige4-3B to achieve better results compared to other small-scale language models without early saturation.

Improving Quality through Joint Mechanism

In post-training, the authors further improve the quality of Nanbeige4-3B by using a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction. This results in substantial gains on complex tasks.

Dual Preference Distillation for Further Performance Improvements

To enhance the model's performance even further, a Dual Preference Distillation (DPD) approach is used to distill Nanbeige4-3B through verifiable rewards and preference modeling. This leads to significant improvements in both reasoning abilities and human alignment.

Multi-stage Reinforcement Learning for Task Diversity

The researchers also employ multi-stage reinforcement learning to enhance the model's general abilities and task diversity. This includes combining general conversation and writing data with agent-style interaction data, harder reasoning tasks, and code-related tasks.

Cold Start Supervised Fine-tuning for Robust Foundation

To establish a robust foundation for reasoning, Nanbeige4-3B undergoes a Cold Start Supervised Fine-tuning stage. This stage focuses on high-quality reasoning data such as mathematical problems, coding tasks, and subject-area problem-solving tasks. The results show that this approach significantly improves the model's performance on challenging reasoning benchmarks without early saturation.

Conclusion: Superior Performance Across Various Benchmarks

In conclusion, Nanbeige4-3B demonstrates superior performance compared to models of similar size across a wide range of benchmarks. Its post-training pipeline includes various techniques such as FG-WSD training scheduler, joint mechanism for improving quality, DPD for further performance improvements, multi-stage reinforcement learning for task diversity, and Cold Start Supervised Fine-tuning for establishing a robust foundation. If you are interested in exploring or evaluating this language model further, you can access its checkpoints at https://huggingface.co/Nanbeige. With its impressive results on complex tasks and its ability to outperform other models of comparable size, Nanbeige4-3B is a promising addition to the world of language models.

Created on 15 Dec. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

63.2%

Yi: Open Foundation Models by 01.AI

cs.CL

62.6%

Small Language Models: Survey, Measurements, and Insights

cs.CL

62.1%

Flan-MoE: Scaling Instruction-Finetuned Language Models with Sparse Mixture o…

cs.CL

61.4%

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Think…

cs.CL

61.3%

SemiKong: Curating, Training, and Evaluating A Semiconductor Industry-Specifi…

cs.CL

61.3%

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Mul…

cs.CL

61.2%

Qwen Technical Report

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.