Large-Scale Data Selection for Instruction Tuning

AI-generated keywords: Large-Scale Data Selection

AI-generated Key Points

  • Importance of selecting high-quality training data for instruction-tuning language models
  • Carefully curated datasets produce models that outperform those trained on larger, noisier datasets
  • Automated data selection approaches and their effectiveness on large-scale datasets
  • Evaluation of scalability of methods using up to 2.5 million samples from pools as large as 5.8 million samples across seven diverse tasks
  • Representation-based data selection (RDS+) consistently outperformed more complex methods while being more compute-efficient
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, Pradeep Dasigi

License: CC BY 4.0

Abstract: Selecting high-quality training data from a larger pool is a crucial step when instruction-tuning language models, as carefully curated datasets often produce models that outperform those trained on much larger, noisier datasets. Automated data selection approaches for instruction-tuning are typically tested by selecting small datasets (roughly 10k samples) from small pools (100-200k samples). However, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples, subsampled from even larger data pools. We present a systematic study of how well data selection methods scale to these settings, selecting up to 2.5M samples from pools of up to 5.8M samples and evaluating across 7 diverse tasks. We show that many recently proposed methods fall short of random selection in this setting (while using more compute), and even decline in performance when given access to larger pools of data to select over. However, we find that a variant of representation-based data selection (RDS+), which uses weighted mean pooling of pretrained LM hidden states, consistently outperforms more complex methods across all settings tested -- all whilst being more compute-efficient. Our findings highlight that the scaling properties of proposed automated selection methods should be more closely examined. We release our code, data, and models at https://github.com/hamishivi/automated-instruction-selection.

Submitted to arXiv on 03 Mar. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2503.01807v1

, , , , In the study "Large-Scale Data Selection for Instruction Tuning," conducted by Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, and Pradeep Dasigi, the importance of selecting high-quality training data for instruction-tuning language models is emphasized. The researchers highlight how carefully curated datasets have been shown to produce models that outperform those trained on larger, noisier datasets. This emphasizes the crucial role of data selection in achieving optimal performance in language models. The study focuses on automated data selection approaches and their effectiveness when applied to large-scale datasets. While these methods are typically tested on small datasets from small pools, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples from even larger data pools. To evaluate the scalability of these methods, the researchers conducted a systematic study using up to 2.5 million samples from pools as large as 5.8 million samples across seven diverse tasks. The results showed that many proposed methods fell short of random selection in this setting and even performed worse when given access to larger pools of data. However, a variant of representation-based data selection (RDS+), which utilizes weighted mean pooling of pretrained LM hidden states, consistently outperformed more complex methods while being more compute-efficient. These findings highlight the importance of examining the scaling properties of automated selection methods and provide valuable insights into the challenges and opportunities in large-scale data selection for instruction tuning in language models. The researchers have made their code, data, and models available for further exploration. Additionally, they extended their experimental design using TÜLU 2 as a base dataset and model family and reported results using TÜLU 3 mixture and Llama 3.1 in subsequent sections. Further details on the size and makeup of data pools considered in the study are provided in Figure 2, with additional information available in Appendices B and E.
Created on 19 Mar. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.