Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

AI-generated keywords: Nanbeige4-3B

AI-generated Key Points

  • Nanbeige4-3B is a family of high-performing language models pretrained on 23T tokens and finetuned on over 30 million diverse instructions.
  • The model incorporates a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler in pre-training to extend the scaling law for small language models, boosting performance through progressive refinement of data mixtures.
  • Post-training includes improving SFT data quality through deliberative generation refinement and chain-of-thought reconstruction, leading to gains on complex tasks.
  • Dual Preference Distillation (DPD) is used to distill Nanbeige4-3B's flagship reasoning model, resulting in further performance improvements.
  • A multi-stage reinforcement learning phase enhances reasoning and human alignment abilities using verifiable rewards and preference modeling.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Wei Ruan, Xiaoqi Liu, Xiaoxue Cheng, Xiyun Xu, Yang Song, Yanzipeng Gao, Yiming Jia, Yun Xing, Yuntao Wen, Zekai Wang, Zhenwei An, Zhicong Sun, Zongchao Chen

License: CC BY 4.0

Abstract: We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at https://huggingface.co/Nanbeige.

Submitted to arXiv on 06 Dec. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2512.06266v1

, , , , We introduce Nanbeige4-3B, a family of high-performing language models that have been pretrained on 23T tokens and finetuned on over 30 million diverse instructions. Our model extends the boundary of the scaling law for small language models by incorporating a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler in pre-training. This progressive refinement of data mixtures across stages boosts model performance. In post-training, we improve the quality of SFT data through a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, resulting in substantial gains on complex tasks. Following SFT, our flagship reasoning model is used to distill Nanbeige4-3B through Dual Preference Distillation (DPD), leading to further performance improvements. A multi-stage reinforcement learning phase leverages verifiable rewards and preference modeling to enhance abilities in both reasoning and human alignment. Our extensive evaluations demonstrate that Nanbeige4-3B outperforms models of comparable parameter scale and rivals much larger models across various benchmarks. The post-training pipeline includes a Cold Start Supervised Fine-tuning stage aimed at establishing a robust foundation for reasoning by focusing on high-quality reasoning data such as mathematical, code, and subject-area problem-solving tasks. Scaling SFT Instructions from hundreds of thousands to tens of millions of examples continues to produce substantial improvements on challenging reasoning benchmarks without early saturation. Overall Supervised Fine-Tuning further enhances the model's general abilities and task diversity by combining general conversation and writing data with agent-style interaction data, harder reasoning tasks, and code-related tasks. Deliberative generation refinement combined with chain completion is employed to improve the model's output quality on complex tasks. In conclusion, Nanbeige4-3B demonstrates superior performance compared to models of similar size across a wide range of benchmarks. Model checkpoints are available at https://huggingface.co/Nanbeige for further exploration and evaluation purposes.
Created on 15 Dec. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.