Improve3C: Data Cleaning on Consistency and Completeness with Currency

AI-generated keywords: Big data management Data quality Improve3C framework Currency constraints Incomplete and inconsistent data

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Ensuring data quality is crucial in the field of big data management
  • Exponential growth of data from various sources brings forth challenges in maintaining data quality
  • The paper focuses on improving completeness, consistency, and currency of data through a 4-step framework called Improve3C
  • Improve3C aims to detect and improve the quality of incomplete and inconsistent data without timestamps by computing a relative currency order among records based on given criteria
  • Inconsistent repair is prioritized before incomplete repair for enhanced effectiveness and efficiency
  • A currency-related consistency distance metric is introduced to accurately measure similarity between dirty records and clean ones
  • Currency orders are incorporated as an important feature in training models for incompleteness repair
  • Solution algorithms are provided with examples to illustrate their application
  • Experiments using real-life and synthetic datasets validate the proposed method's effectiveness in cleaning dirty data with multiple quality problems
  • The paper highlights the significance of data quality in big data management and presents a comprehensive framework for addressing completeness, consistency, and currency issues.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Jianzhong Li, Hong Gao

Abstract: Data quality plays a key role in big data management today. With the explosive growth of data from a variety of sources, the quality of data is faced with multiple problems. Motivated by this, we study the multiple data quality improvement on completeness, consistency and currency in this paper. For the proposed problem, we introduce a 4-step framework, named Improve3C, for detection and quality improvement on incomplete and inconsistent data without timestamps. We compute and achieve a relative currency order among records derived from given currency constraints, according to which inconsistent and incomplete data can be repaired effectively considering the temporal impact. For both effectiveness and efficiency consideration, we carry out inconsistent repair ahead of incomplete repair. Currency-related consistency distance is defined to measure the similarity between dirty records and clean ones more accurately. In addition, currency orders are treated as an important feature in the training process of incompleteness repair. The solution algorithms are introduced in detail with examples. A thorough experiment on one real-life data and a synthetic one verifies that the proposed method can improve the performance of dirty data cleaning with multiple quality problems which are hard to be cleaned by the existing approaches effectively.

Submitted to arXiv on 31 Jul. 2018

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 1808.00024v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the field of big data management, ensuring is crucial due to the exponential growth of data from various sources. However, this growth also brings forth multiple challenges in maintaining the quality of data. In response to these challenges, this paper focuses on improving the completeness, consistency, and currency of data through a 4-step framework called . This framework aims to detect and improve the quality of incomplete and inconsistent data that lacks timestamps by computing a relative currency order among records based on given . This allows for effective repair of inconsistent and incomplete data by considering their temporal impact. To enhance both effectiveness and efficiency, the authors prioritize inconsistent repair before incomplete repair. They also introduce a currency-related consistency distance metric to accurately measure the similarity between dirty records (inconsistent or incomplete) and clean ones. Additionally, they incorporate currency orders as an important feature in training models for incompleteness repair. The paper provides detailed explanations of solution algorithms along with examples to illustrate their application. To validate their proposed method, the authors conduct experiments using real-life and synthetic datasets. The results demonstrate that Improve3C effectively improves the performance of cleaning dirty data with multiple quality problems that are challenging for existing approaches. Overall, this paper highlights the significance of in big data management and presents a comprehensive framework for addressing issues related to completeness, consistency, and currency. The proposed method offers an efficient solution for cleaning dirty data by considering temporal impact and incorporating currency orders as a key feature in repairing incomplete records.
Created on 07 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.