To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

AI-generated keywords: Dataset Size Language Modeling Multi-epoch Degradation Dropout Mixture-of-Experts

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Dataset size is important in scaling language models
  • Large language models (LLMs) are token-hungry during pre-training
  • High-quality text data on the web is reaching its scaling limit for LLMs
  • Repeating pre-training data for additional epochs can enhance LLMs
  • Model is susceptible to overfitting and multi-epoch degradation when repeating pre-training data
  • Key factors contributing to multi-epoch degradation include dataset size, model parameters, and training objectives
  • Regularization techniques were explored to alleviate multi-epoch degradation, with dropout showing remarkable effectiveness but requiring careful tuning when scaling up the model size
  • Leveraging mixture-of-experts (MoE) was discovered as a cost-effective and efficient hyperparameter tuning method for computationally intensive dense LLMs with comparable trainable parameters potentially impacting efficient LLM development on a broader scale.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Fuzhao Xue, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, Yang You

Abstract: Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives, while less influential factors consist of dataset quality and model FLOPs. Finally, we explore whether widely used regularization can alleviate multi-epoch degradation. Most regularization techniques do not yield significant improvements, except for dropout, which demonstrates remarkable effectiveness but requires careful tuning when scaling up the model size. Additionally, we discover that leveraging mixture-of-experts (MoE) enables cost-effective and efficient hyper-parameter tuning for computationally intensive dense LLMs with comparable trainable parameters, potentially impacting efficient LLM development on a broader scale.

Submitted to arXiv on 22 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.13230v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The importance of dataset size in scaling language models has been highlighted by recent research. However, large language models (LLMs) are known to be token-hungry during pre-training, and high-quality text data on the web is reaching its scaling limit for LLMs. To enhance LLMs further, a simple approach is to repeat the pre-training data for additional epochs. In this study by Fuzhao Xue et al., three key aspects were empirically investigated under this approach. Firstly, the consequences of repeating pre-training data were explored, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Secondly, the key factors contributing to multi-epoch degradation were examined and found that significant factors include dataset size, model parameters and training objectives while less influential factors consist of dataset quality and model FLOPs. Finally, widely used regularization techniques were explored to alleviate multi-epoch degradation. Most techniques did not yield significant improvements except for dropout which demonstrated remarkable effectiveness but required careful tuning when scaling up the model size. Additionally, leveraging mixture-of-experts (MoE) was discovered as a cost-effective and efficient hyperparameter tuning method for computationally intensive dense LLMs with comparable trainable parameters potentially impacting efficient LLM development on a broader scale. These findings provide insights into improving LLM performance through pre-training data repetition while also highlighting potential pitfalls and solutions for multi-epoch degradation in large scale language modeling tasks.
Created on 31 May. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.