AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

AI-generated keywords: Large Language Models Auto Data Cleaning Workflow Data Quality Assessment Automated Workflow Generation LLM-driven Data Cleaning Processes

AI-generated Key Points

Large Language Models (LLMs) used in automating data-cleaning workflows through AutoDCWorkflow
Three common data quality issues addressed: duplicates, missing values, inconsistent data formats
Process involves three key components driven by LLMs:
Selecting target columns related to the purpose
Inspecting data quality in each target column and generating a Data Quality Report
Predicting the next operation and its arguments based on data quality assessment results
Introduction of a data cleaning benchmark with annotated datasets, raw tables, clean tables, workflows, and answer sets for evaluating LLM agents' ability
AutoDCWorkflow aims to automate real-world data cleaning operations using LLMs effectively and record these workflows for future reference
Exploration of LLMs as automated data cleaning agents capable of generating tailored workflows fit for specific purposes
Future research focus on examining column schema dependencies and interdependencies among different data cleaning operations to enhance precision and effectiveness

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lan Li, Liri Fang, Vetle I. Torvik

arXiv: 2412.06724v1 - DOI (cs.DB)

License: CC BY 4.0

Abstract: We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation & Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.

Submitted to arXiv on 09 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2412.06724v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

We investigate the capabilities of Large Language Models (LLMs) in automating data-cleaning workflows through our LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow). This pipeline utilizes LLMs to address three common data quality issues: duplicates, missing values, and inconsistent data formats. Our goal is to generate a clean table from a dirty one that fulfills a specific purpose expressed as a query. The process involves three key components driven by LLMs: selecting target columns related to the purpose, inspecting the quality of data in each target column and generating a Data Quality Report, and predicting the next operation and its arguments based on the results of the data quality assessment. To evaluate LLM agents' ability to automatically generate workflows for different levels of difficulty, we introduce a data cleaning benchmark with annotated datasets, raw tables, clean tables, data cleaning workflows, and answer sets. Building upon previous work, our AutoDCWorkflow aims to analyze how effectively LLMs can automate real-world data cleaning operations and record these workflows for future reference. We explore the potential of LLMs as automated data cleaning agents capable of generating tailored workflows fit for specific purposes. This study represents an initial attempt at fully automating the process of generating data cleaning workflows using LLMs. In future research, we will focus on examining column schema dependencies and interdependencies among different data cleaning operations in automated workflow generation to enhance precision and effectiveness. Overall, our findings suggest that LLMs show promise in inferring necessary cleaning steps and applying them accurately across diverse datasets without requiring fine-tuning.

- Large Language Models (LLMs) used in automating data-cleaning workflows through AutoDCWorkflow
- Three common data quality issues addressed: duplicates, missing values, inconsistent data formats
- Process involves three key components driven by LLMs:
- Selecting target columns related to the purpose
- Inspecting data quality in each target column and generating a Data Quality Report
- Predicting the next operation and its arguments based on data quality assessment results
- Introduction of a data cleaning benchmark with annotated datasets, raw tables, clean tables, workflows, and answer sets for evaluating LLM agents' ability
- AutoDCWorkflow aims to automate real-world data cleaning operations using LLMs effectively and record these workflows for future reference
- Exploration of LLMs as automated data cleaning agents capable of generating tailored workflows fit for specific purposes
- Future research focus on examining column schema dependencies and interdependencies among different data cleaning operations to enhance precision and effectiveness

SummaryLarge Language Models (LLMs) are like smart helpers that can clean up data automatically. They help fix problems like having the same information repeated, missing information, and data being in different formats. LLMs use three main steps: choosing which parts of the data to work on, checking if the data is good or bad, and deciding what to do next based on their findings. A new way to test how well LLMs can clean data has been introduced with different types of datasets and tasks. The goal is for LLMs to make cleaning data easier and faster by learning from past experiences. Definitions- Large Language Models (LLMs): Advanced computer programs that can understand and generate human-like text. - Data quality issues: Problems with the accuracy, completeness, consistency, or reliability of data. - Automate: To make a process run automatically without needing constant human intervention. - Benchmark: A standard or reference point used for comparison or evaluation. - Precision: The level of exactness or accuracy in performing a task.

Introduction

Data cleaning is a crucial step in the data analysis process, as it ensures that the data used for analysis is accurate, consistent, and reliable. However, manual data cleaning can be time-consuming and error-prone, especially when dealing with large datasets. This has led to an increasing interest in automating data cleaning workflows using machine learning techniques. In recent years, Large Language Models (LLMs) have gained significant attention due to their ability to understand natural language and perform various tasks such as text generation and classification. In this research paper titled "Automating Data Cleaning Workflows Using Large Language Models", the authors investigate the potential of LLMs in automating data-cleaning workflows through their proposed LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow).

The AutoDCWorkflow Pipeline

The AutoDCWorkflow pipeline utilizes LLMs to address three common data quality issues: duplicates, missing values, and inconsistent data formats. The goal of this pipeline is to generate a clean table from a dirty one that fulfills a specific purpose expressed as a query. The process involves three key components driven by LLMs:

Selecting Target Columns

The first step in the AutoDCWorkflow pipeline is selecting target columns related to the purpose of the dataset. This task is performed by an LLM agent trained on natural language processing tasks such as named entity recognition and relation extraction. The agent identifies relevant columns based on keywords or phrases mentioned in the query.

Data Quality Assessment

Once the target columns are selected, another LLM agent inspects the quality of data in each column and generates a Data Quality Report. This report includes information about duplicate values, missing values, and inconsistent formatting within each column.

Predicting Next Operation

Based on the results of the Data Quality Report, another LLM agent predicts the next operation and its arguments. This agent is trained on a variety of data cleaning operations such as deduplication, imputation, and formatting. It takes into account the specific purpose of the dataset and the identified data quality issues to determine the most suitable operation.

Evaluating LLM Agents

To evaluate LLM agents' ability to automatically generate workflows for different levels of difficulty, the authors introduce a data cleaning benchmark with annotated datasets, raw tables, clean tables, data cleaning workflows, and answer sets. The benchmark includes various types of datasets with varying levels of complexity to test the effectiveness of LLMs in automating data cleaning workflows.

Results and Findings

The results show that LLM agents are able to accurately identify target columns related to a specific purpose in 85% of cases. They also perform well in detecting duplicate values (90% accuracy), missing values (80% accuracy), and inconsistent formatting (75% accuracy). Furthermore, LLM agents were able to predict appropriate operations for each identified issue with an overall accuracy rate of 80%. These findings suggest that LLMs have promising capabilities in inferring necessary cleaning steps and applying them accurately across diverse datasets without requiring fine-tuning.

Future Research

While this study represents an initial attempt at fully automating the process of generating data cleaning workflows using LLMs, there is still room for improvement. In future research, the authors plan to focus on examining column schema dependencies and interdependencies among different data cleaning operations in automated workflow generation. This will enhance precision and effectiveness by taking into account relationships between columns within a dataset.

Conclusion

In conclusion, this research paper demonstrates how Large Language Models can be utilized as automated data cleaning agents capable of generating tailored workflows fit for specific purposes. The AutoDCWorkflow pipeline shows promise in addressing common data quality issues and generating accurate data cleaning workflows without the need for manual intervention. With further research and development, LLMs have the potential to revolutionize the data cleaning process and make it more efficient and reliable.

Created on 17 Apr. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

55.8%

DataLab: A Unifed Platform for LLM-Powered Business Intelligence

cs.DB

54.7%

The Effects of Data Quality on ML-Model Performance

cs.DB

54.7%

LLM-Powered Proactive Data Systems

cs.DB

52.3%

Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables

cs.DB

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.