, , , ,
We introduce Nanbeige4-3B, a family of high-performing language models that have been pretrained on 23T tokens and finetuned on over 30 million diverse instructions. Our model extends the boundary of the scaling law for small language models by incorporating a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler in pre-training. This progressive refinement of data mixtures across stages boosts model performance. In post-training, we improve the quality of SFT data through a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, resulting in substantial gains on complex tasks. Following SFT, our flagship reasoning model is used to distill Nanbeige4-3B through Dual Preference Distillation (DPD), leading to further performance improvements. A multi-stage reinforcement learning phase leverages verifiable rewards and preference modeling to enhance abilities in both reasoning and human alignment. Our extensive evaluations demonstrate that Nanbeige4-3B outperforms models of comparable parameter scale and rivals much larger models across various benchmarks. The post-training pipeline includes a Cold Start Supervised Fine-tuning stage aimed at establishing a robust foundation for reasoning by focusing on high-quality reasoning data such as mathematical, code, and subject-area problem-solving tasks. Scaling SFT Instructions from hundreds of thousands to tens of millions of examples continues to produce substantial improvements on challenging reasoning benchmarks without early saturation. Overall Supervised Fine-Tuning further enhances the model's general abilities and task diversity by combining general conversation and writing data with agent-style interaction data, harder reasoning tasks, and code-related tasks. Deliberative generation refinement combined with chain completion is employed to improve the model's output quality on complex tasks. In conclusion, Nanbeige4-3B demonstrates superior performance compared to models of similar size across a wide range of benchmarks. Model checkpoints are available at https://huggingface.co/Nanbeige for further exploration and evaluation purposes.
- - Nanbeige4-3B is a family of high-performing language models pretrained on 23T tokens and finetuned on over 30 million diverse instructions.
- - The model incorporates a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler in pre-training to extend the scaling law for small language models, boosting performance through progressive refinement of data mixtures.
- - Post-training includes improving SFT data quality through deliberative generation refinement and chain-of-thought reconstruction, leading to gains on complex tasks.
- - Dual Preference Distillation (DPD) is used to distill Nanbeige4-3B's flagship reasoning model, resulting in further performance improvements.
- - A multi-stage reinforcement learning phase enhances reasoning and human alignment abilities using verifiable rewards and preference modeling.
Summary- Nanbeige4-3B is a smart computer program that helps with understanding and using words better. It has been trained on lots of words and instructions to make it very good at its job.
- The program uses a special method called FG-WSD to keep getting better at understanding language, especially for smaller models.
- After the initial training, the program works on improving how it understands information by refining data and reconstructing thoughts.
- Another technique called DPD is used to make the program even smarter by distilling its main reasoning abilities.
- Lastly, the program goes through a learning phase where it gets even better at thinking like humans by receiving rewards and modeling preferences.
Definitions1. Language model: A computer program designed to understand and generate human language.
2. Pretrained: Already taught or trained before being used for specific tasks.
3. Finetuned: Adjusted or improved further after initial training to enhance performance.
4. Reinforcement learning: A type of machine learning where an algorithm learns through trial and error based on rewards received for its actions.
Introducing Nanbeige4-3B: A High-Performing Language Model for Complex Tasks
Language models have become a crucial tool in natural language processing (NLP) tasks, with the ability to generate human-like text and perform various language-based tasks. However, as the size of these models increases, so does their performance on complex tasks. In this blog article, we will discuss a recent research paper titled "Nanbeige4-3B: A Family of High-Performing Language Models for Complex Tasks" that introduces a new language model that outperforms other models of comparable size on various benchmarks.
The Need for Scaling Law in Small Language Models
The scaling law is an important concept in NLP that states that as the size of a language model increases, its performance also improves. This has been observed in various studies and has led to the development of larger and more powerful models such as GPT-3 with 175 billion parameters. However, there is still room for improvement in smaller language models.
In this research paper, the authors introduce Nanbeige4-3B, a family of high-performing language models that have been pretrained on 23T tokens and finetuned on over 30 million diverse instructions. This model extends the boundary of the scaling law for small language models by incorporating a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler during pre-training.
Progressive Refinement through FG-WSD Training Scheduler
The FG-WSD training scheduler progressively refines data mixtures across stages during pre-training, leading to improved model performance. This approach allows Nanbeige4-3B to achieve better results compared to other small-scale language models without early saturation.
Improving Quality through Joint Mechanism
In post-training, the authors further improve the quality of Nanbeige4-3B by using a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction. This results in substantial gains on complex tasks.
Dual Preference Distillation for Further Performance Improvements
To enhance the model's performance even further, a Dual Preference Distillation (DPD) approach is used to distill Nanbeige4-3B through verifiable rewards and preference modeling. This leads to significant improvements in both reasoning abilities and human alignment.
Multi-stage Reinforcement Learning for Task Diversity
The researchers also employ multi-stage reinforcement learning to enhance the model's general abilities and task diversity. This includes combining general conversation and writing data with agent-style interaction data, harder reasoning tasks, and code-related tasks.
Cold Start Supervised Fine-tuning for Robust Foundation
To establish a robust foundation for reasoning, Nanbeige4-3B undergoes a Cold Start Supervised Fine-tuning stage. This stage focuses on high-quality reasoning data such as mathematical problems, coding tasks, and subject-area problem-solving tasks. The results show that this approach significantly improves the model's performance on challenging reasoning benchmarks without early saturation.
Conclusion: Superior Performance Across Various Benchmarks
In conclusion, Nanbeige4-3B demonstrates superior performance compared to models of similar size across a wide range of benchmarks. Its post-training pipeline includes various techniques such as FG-WSD training scheduler, joint mechanism for improving quality, DPD for further performance improvements, multi-stage reinforcement learning for task diversity, and Cold Start Supervised Fine-tuning for establishing a robust foundation.
If you are interested in exploring or evaluating this language model further, you can access its checkpoints at https://huggingface.co/Nanbeige. With its impressive results on complex tasks and its ability to outperform other models of comparable size, Nanbeige4-3B is a promising addition to the world of language models.