To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Dataset size is important in scaling language models
- Large language models (LLMs) are token-hungry during pre-training
- High-quality text data on the web is reaching its scaling limit for LLMs
- Repeating pre-training data for additional epochs can enhance LLMs
- Model is susceptible to overfitting and multi-epoch degradation when repeating pre-training data
- Key factors contributing to multi-epoch degradation include dataset size, model parameters, and training objectives
- Regularization techniques were explored to alleviate multi-epoch degradation, with dropout showing remarkable effectiveness but requiring careful tuning when scaling up the model size
- Leveraging mixture-of-experts (MoE) was discovered as a cost-effective and efficient hyperparameter tuning method for computationally intensive dense LLMs with comparable trainable parameters potentially impacting efficient LLM development on a broader scale.
Authors: Fuzhao Xue, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, Yang You
Abstract: Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives, while less influential factors consist of dataset quality and model FLOPs. Finally, we explore whether widely used regularization can alleviate multi-epoch degradation. Most regularization techniques do not yield significant improvements, except for dropout, which demonstrates remarkable effectiveness but requires careful tuning when scaling up the model size. Additionally, we discover that leveraging mixture-of-experts (MoE) enables cost-effective and efficient hyper-parameter tuning for computationally intensive dense LLMs with comparable trainable parameters, potentially impacting efficient LLM development on a broader scale.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Welcome to our AI assistant! Here are some important things to keep in mind:
- The assistant will only answer questions related to this specific paper.
- Please note that this is not a bot for casual chatting.
- If you want to keep the history of your questions/answers you should create an account.
Assess the quality of the AI-generated content by voting
Why do we need votes?
Similar papers summarized with our AI tools
Navigate through even more similar papers through atree representation
Look for similar papers (in beta version)