AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

AI-generated keywords: Large Language Models Auto Data Cleaning Workflow Data Quality Assessment Automated Workflow Generation LLM-driven Data Cleaning Processes

AI-generated Key Points

  • Large Language Models (LLMs) used in automating data-cleaning workflows through AutoDCWorkflow
  • Three common data quality issues addressed: duplicates, missing values, inconsistent data formats
  • Process involves three key components driven by LLMs:
  • Selecting target columns related to the purpose
  • Inspecting data quality in each target column and generating a Data Quality Report
  • Predicting the next operation and its arguments based on data quality assessment results
  • Introduction of a data cleaning benchmark with annotated datasets, raw tables, clean tables, workflows, and answer sets for evaluating LLM agents' ability
  • AutoDCWorkflow aims to automate real-world data cleaning operations using LLMs effectively and record these workflows for future reference
  • Exploration of LLMs as automated data cleaning agents capable of generating tailored workflows fit for specific purposes
  • Future research focus on examining column schema dependencies and interdependencies among different data cleaning operations to enhance precision and effectiveness
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lan Li, Liri Fang, Vetle I. Torvik

License: CC BY 4.0

Abstract: We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation & Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.

Submitted to arXiv on 09 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2412.06724v1

We investigate the capabilities of Large Language Models (LLMs) in automating data-cleaning workflows through our LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow). This pipeline utilizes LLMs to address three common data quality issues: duplicates, missing values, and inconsistent data formats. Our goal is to generate a clean table from a dirty one that fulfills a specific purpose expressed as a query. The process involves three key components driven by LLMs: selecting target columns related to the purpose, inspecting the quality of data in each target column and generating a Data Quality Report, and predicting the next operation and its arguments based on the results of the data quality assessment. To evaluate LLM agents' ability to automatically generate workflows for different levels of difficulty, we introduce a data cleaning benchmark with annotated datasets, raw tables, clean tables, data cleaning workflows, and answer sets. Building upon previous work, our AutoDCWorkflow aims to analyze how effectively LLMs can automate real-world data cleaning operations and record these workflows for future reference. We explore the potential of LLMs as automated data cleaning agents capable of generating tailored workflows fit for specific purposes. This study represents an initial attempt at fully automating the process of generating data cleaning workflows using LLMs. In future research, we will focus on examining column schema dependencies and interdependencies among different data cleaning operations in automated workflow generation to enhance precision and effectiveness. Overall, our findings suggest that LLMs show promise in inferring necessary cleaning steps and applying them accurately across diverse datasets without requiring fine-tuning.
Created on 17 Apr. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.