Data Filtering Networks

AI-generated keywords: Machine Learning

AI-generated Key Points

  • Large training sets are crucial for advancements in language modeling and multimodal learning.
  • Data filtering networks are developed to effectively filter large uncurated datasets.
  • Quality of a network for data filtering differs from its performance on downstream tasks.
  • New data filtering networks have been constructed to generate state-of-the-art image-text datasets.
  • DFN-5B dataset enables training cutting-edge models within compute budgets and achieves high zero-shot transfer accuracy on ImageNet.
  • Research findings can be used as a blueprint for creating high-quality datasets from publicly available data, democratizing access to such datasets.
  • Release of DFN-2B will facilitate research on large image-text models and encourage collaboration in advancing machine learning capabilities.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, Vaishaal Shankar

License: CC BY 4.0

Abstract: Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.

Submitted to arXiv on 29 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.17425v1

In recent years, large training sets have played a crucial role in the advancement of , particularly in the fields of language modeling and multimodal learning. The process of for pre-training often involves collecting a vast amount of data from the Web and then filtering this dataset to create an actual training set using various heuristics. This study focuses on the development of a to filter large uncurated datasets effectively. One key insight from this research is that the quality of a network for filtering data is different from its performance on downstream tasks. For example, a model that performs well on ImageNet may not necessarily produce high-quality training sets compared to a model with lower ImageNet accuracy trained on a smaller but higher-quality dataset. Building upon these insights, new <kd>data filtering networks</ kd > have been constructed to generate state-of-the-art image-text datasets. The best performing dataset, DFN-5B, has enabled the training of cutting-edge models within compute budgets. For instance, a Vision Transformer-H (ViT-H) trained on DFN-5B achieved an impressive 83.0% zero-shot transfer accuracy on ImageNet, outperforming models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. Furthermore, the findings from this research can serve as a blueprint for creating high-quality datasets from scratch using only publicly available data. This approach contributes towards democratizing access to large high-quality datasets and promotes further research in < kd >dataset design advancements</ kd >. Additionally, the release of DFN-2B to the community will facilitate research on large image-text models and encourage collaboration in advancing machine learning capabilities. In summary, this study sheds light on the importance of effective <kd>data filtering networks</ kd > in generating top-tier datasets for machine learning applications and highlights the significance of quality over quantity in dataset construction. By leveraging these insights and advancements in <kd>dataset design</ kd >, researchers can continue to push boundaries in artificial intelligence and enhance model performance across various tasks and domains.
Created on 02 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.