DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

AI-generated keywords: DataFlow Large Language Models Data Preparation Framework End-to-End System

AI-generated Key Points

The DataFlow Technical Report presents a comprehensive and innovative framework for data preparation in Large Language Models (LLMs)
DataFlow addresses the growing demand for high-quality data in LLMs and challenges posed by current ad-hoc practices
DataFlow provides a unified and extensible solution with system-level abstractions enabling modular, reusable, and composable data transformations
Designed with a PyTorch-like programming interface for usability, extensibility, and maintainability
Offers nearly 200 reusable operators and six domain-general pipelines spanning various domains such as text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction
DataFlow-Agent automates translation of natural-language specifications into executable pipelines through operator synthesis and iterative verification
Extensive experiments demonstrate effectiveness of the framework across diverse data preparation scenarios
Resulting datasets consistently match or surpass state-of-the-art baselines in quality and performance
Mathematical reasoning data outperforms high-quality synthetic baselines on various benchmarks
Pipelines achieve significant improvements in execution accuracy compared to existing corpora while using fewer training examples
Training on only 10K samples can yield substantial gains in data efficiency compared to models trained on larger datasets when combining text and code data into a unified corpus with DataFlow.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang

arXiv: 2512.16676v1 - DOI (cs.LG)

License: CC BY 4.0

Abstract: The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

Submitted to arXiv on 18 Dec. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2512.16676v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The DataFlow Technical Report presents a comprehensive and innovative framework for data preparation in Large Language Models (LLMs). The report addresses the growing demand for high-quality data in LLMs and the challenges posed by current ad-hoc practices. To overcome these challenges, DataFlow provides a unified and extensible solution with system-level abstractions that enable modular, reusable, and composable data transformations. Designed with a PyTorch-like programming interface, DataFlow ensures usability, extensibility, and maintainability. The framework offers nearly 200 reusable operators and six domain-general pipelines spanning various domains such as text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. In addition to this rich ecosystem of operators and pipelines, DataFlow-Agent automates the translation of natural-language specifications into executable pipelines through operator synthesis and iterative verification. Extensive experiments on six DataFlow-implemented pipelines demonstrate the effectiveness of the framework across diverse data preparation scenarios. The resulting datasets consistently match or surpass state-of-the-art baselines in terms of quality and performance. For instance, mathematical reasoning data outperforms high-quality synthetic baselines on various benchmarks. Similarly, pipelines achieve significant improvements in execution accuracy compared to existing corpora while using fewer training examples. By combining text,and code data into a unified corpus (), DataFlow shows that training on only 10K samples can yield substantial gains in data efficiency compared to models trained on larger datasets. This highlights the framework's ability to produce high-quality supervision across diverse domains. Overall, emerges as an end-to-end system for LLM-based data preparation that offers a rich ecosystem of operators and pipelines. With its focus on principled abstractions and scalability, DataFlow enhances programmability, reproducibility, and data quality in LLM workflows. It serves as a foundational tool for building semantically rich and scalable data preparation pipelines that improve performance across various domains.

- The DataFlow Technical Report presents a comprehensive and innovative framework for data preparation in Large Language Models (LLMs)
- DataFlow addresses the growing demand for high-quality data in LLMs and challenges posed by current ad-hoc practices
- DataFlow provides a unified and extensible solution with system-level abstractions enabling modular, reusable, and composable data transformations
- Designed with a PyTorch-like programming interface for usability, extensibility, and maintainability
- Offers nearly 200 reusable operators and six domain-general pipelines spanning various domains such as text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction
- DataFlow-Agent automates translation of natural-language specifications into executable pipelines through operator synthesis and iterative verification
- Extensive experiments demonstrate effectiveness of the framework across diverse data preparation scenarios
- Resulting datasets consistently match or surpass state-of-the-art baselines in quality and performance
- Mathematical reasoning data outperforms high-quality synthetic baselines on various benchmarks
- Pipelines achieve significant improvements in execution accuracy compared to existing corpora while using fewer training examples
- Training on only 10K samples can yield substantial gains in data efficiency compared to models trained on larger datasets when combining text and code data into a unified corpus with DataFlow.

Summary- The DataFlow Technical Report introduces a new way to get data ready for big language models. - DataFlow helps meet the need for good data in these models and deals with problems from current messy methods. - It offers a flexible solution that makes it easy to change and reuse different ways of transforming data. - It's made to be user-friendly like PyTorch, making it simple to use, expand, and keep up. - There are many pre-made tools in DataFlow for working with text, math, code, and more. Definitions- Data preparation: Getting information ready to use in a computer program or system. - Large Language Models (LLMs): Big programs that understand and generate human language. - Abstractions: Simplified versions of complex ideas or systems. - Usability: How easy something is to use or work with. - Extensibility: The ability to add new features or functions easily.

The DataFlow Technical Report: A Comprehensive Framework for Data Preparation in Large Language Models Introduction In recent years, large language models (LLMs) have achieved remarkable success in natural language processing tasks such as text generation, question-answering, and machine translation. These models are trained on vast amounts of data to learn the underlying patterns and relationships between words and phrases. However, the quality of data used for training these models has a significant impact on their performance. Currently, most LLMs rely on ad-hoc practices for data preparation, which can lead to suboptimal results. To address this issue, researchers at Carnegie Mellon University have developed a comprehensive framework called DataFlow that provides a unified solution for data preparation in LLMs. The framework offers system-level abstractions that enable modular, reusable, and composable data transformations while ensuring usability, extensibility, and maintainability. DataFlow Ecosystem DataFlow is designed with a PyTorch-like programming interface that makes it easy to use for both novice and experienced users. It offers nearly 200 reusable operators across six domain-general pipelines spanning various domains such as text classification, mathematical reasoning, code generation, Text-to-SQL conversion, agentic RAG (Retrieve-and-Generate), and large-scale knowledge extraction. These pipelines are built using a combination of operators from the ecosystem to perform specific tasks such as tokenization or entity recognition. This allows users to easily customize their pipelines according to their specific needs without having to write complex code from scratch. One of the key features of DataFlow is its focus on scalability. The framework is designed to handle large datasets efficiently by leveraging parallel computing techniques. This enables faster execution times even when dealing with massive amounts of data. DataFlow-Agent: Automating Pipeline Creation To further enhance usability and efficiency in creating pipelines for LLMs, DataFlow also includes an automated pipeline creation tool called DataFlow-Agent. This tool uses operator synthesis and iterative verification to translate natural-language specifications into executable pipelines. This means that users can simply provide a description of the data they want to prepare, and DataFlow-Agent will automatically generate a pipeline using the appropriate operators from the ecosystem. This not only saves time but also ensures consistency and reproducibility in pipeline creation. Experimental Results To demonstrate the effectiveness of DataFlow, extensive experiments were conducted on six different pipelines implemented using the framework. The results showed significant improvements in data quality and performance across various domains. For instance, in mathematical reasoning tasks, DataFlow-generated datasets consistently outperformed high-quality synthetic baselines on various benchmarks. Similarly, pipelines built with DataFlow achieved higher execution accuracy compared to existing corpora while using fewer training examples. Data Efficiency with Unified Corpus One of the most impressive results from these experiments was seen when combining text and code data into a unified corpus for training LLMs. By doing so, DataFlow showed that even with just 10K samples, substantial gains in data efficiency could be achieved compared to models trained on larger datasets. This highlights the framework's ability to produce high-quality supervision across diverse domains by leveraging its rich ecosystem of operators and pipelines. Conclusion In conclusion, DataFlow is an end-to-end system for LLM-based data preparation that offers a comprehensive solution for overcoming challenges posed by current ad-hoc practices. With its focus on principled abstractions and scalability, it enhances programmability, reproducibility, and data quality in LLM workflows. The framework's rich ecosystem of operators and pipelines makes it easy for users to customize their data preparation processes according to their specific needs while also automating pipeline creation through DataFlow-Agent. Extensive experiments have shown its effectiveness across diverse domains, making it a valuable tool for building semantically rich and scalable data preparation pipelines for LLMs.

Created on 29 Dec. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

57.8%

DataComp-LM: In search of the next generation of training sets for language m…

cs.LG

57.8%

ChaTA: Towards an Intelligent Question-Answer Teaching Assistant using Open-S…

cs.LG

56.4%

AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

cs.LG

55.0%

A Comprehensive Survey of Few-shot Learning: Evolution, Applications, Challen…

cs.LG

55.0%

Zephyr: Direct Distillation of LM Alignment

cs.LG

54.1%

Scalable and Weakly Supervised Bank Transaction Classification

cs.LG

54.1%

AI/ML Algorithms and Applications in VLSI Design and Technology

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.