In the field of data science, missing data presents a significant challenge, impacting decision-making processes and outcomes. Understanding the nature of missing data, how it occurs, and the importance of handling it appropriately is crucial when working with real-world data, particularly in tabular data which is widely used. Existing literature defines three main missing mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR), each posing unique challenges in imputation techniques. While most research focuses on MCAR due to its relative simplicity, there is a lack of exploration into the more complex cases of MAR and MNAR. Recent studies by Graham et al., Dong et al., and Sun et al. have delved into various methods for handling missing values such as normal-model multiple imputation, full information maximum likelihood, expectation-maximization algorithms, deep learning, and traditional machine learning methods. However, these studies primarily concentrate on MCAR due to its relative simplicity , leaving a gap in understanding how to address MAR and MNAR effectively. To bridge this gap, our study makes several contributions to the field
1. Comprehensive Review of Special Missing Mechanisms in Tabular Data: We provide an extensive summary and detailed discussion of methods for handling missing data with a focus on special missing mechanisms in tabular data. Our review covers traditional techniques like deletion and imputation as well as emerging methods based on representation learning. By emphasizing deep learning-based strategies , we aim to equip researchers with valuable resources for addressing missing data challenges effectively. 2. Thorough Examination of Missing Data Generation Methods: We meticulously catalog different approaches used in generating missing data , especially for MAR and MNAR mechanisms that are less explored in existing literature. Our goal is to raise awareness about these special missing mechanisms' importance and variability to encourage further exploration in future studies. 3. Guidance for Future Research Directions: We propose future research directions aimed at overcoming limitations of existing methods and promoting advanced techniques in practical settings. By identifying research gaps within the literature and suggesting new applications for imputation schemes, our study serves as a roadmap for researchers and practitioners. The paper is organized into sections that provide background information on key features of missing data including patterns and mechanisms, common methods for handling missing data, taxonomy of handling techniques, specific methods for dealing with missing data, evaluation metrics used to measure performance , commonly used generation methods for special missing mechanisms from literature reviews , challenges faced in the field, and future directions for research works. Overall, our study aims to advance the field of imputation techniques by addressing the complexities of special missing mechanisms in tabular data through comprehensive reviews and proposing innovative solutions for future research endeavors.
- - Missing data in the field of data science presents a significant challenge, impacting decision-making processes and outcomes.
- - Three main missing mechanisms are defined: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR), each posing unique challenges in imputation techniques.
- - Existing research primarily focuses on MCAR, with a lack of exploration into the more complex cases of MAR and MNAR.
- - Recent studies have delved into various methods for handling missing values, including normal-model multiple imputation, full information maximum likelihood, expectation-maximization algorithms, deep learning, and traditional machine learning methods.
- - The study makes several contributions to the field:
- 1. Comprehensive Review of Special Missing Mechanisms in Tabular Data
- 2. Thorough Examination of Missing Data Generation Methods
- 3. Guidance for Future Research Directions
Summary- Sometimes, when we are working with data, some information is missing which makes things difficult.
- There are three main ways that data can be missing: completely random, at random, or not at random.
- People have mostly studied the first type of missing data and haven't looked as much into the other two types.
- Researchers have come up with different ways to fill in the missing information using various techniques like deep learning and traditional machine learning.
- The study has looked at special cases of missing data and given ideas for future research.
Definitions- Missing data: Information that is not available or incomplete in a dataset.
- Imputation: Filling in missing data with estimated values.
- Mechanisms: Different ways something can happen or occur.
Introduction
In the field of data science, missing data is a common and significant challenge that can impact decision-making processes and outcomes. It refers to the absence of values in a dataset, which can occur due to various reasons such as human error, technical issues, or incomplete surveys. Understanding the nature of missing data, how it occurs, and the importance of handling it appropriately is crucial when working with real-world data, particularly in tabular data which is widely used.
Existing literature defines three main missing mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). MCAR refers to cases where the probability of a value being missing is unrelated to any other variables in the dataset. MAR occurs when there is a systematic relationship between the probability of a value being missing and other observed variables in the dataset. MNAR happens when there is a relationship between the probability of a value being missing and unobserved variables in the dataset.
While most research focuses on MCAR due to its relative simplicity, there is a lack of exploration into more complex cases like MAR and MNAR. Recent studies have delved into various methods for handling missing values such as normal-model multiple imputation, full information maximum likelihood, expectation-maximization algorithms, deep learning, and traditional machine learning methods. However , these studies primarily concentrate on MCAR due to its relative simplicity , leaving a gap in understanding how to address MAR and MNAR effectively.
To bridge this gap , our study makes several contributions to the field:
Comprehensive Review of Special Missing Mechanisms in Tabular Data
We provide an extensive summary and detailed discussion of methods for handling missing data with a focus on special missing mechanisms in tabular data. Our review covers traditional techniques like deletion and imputation as well as emerging methods based on representation learning. By emphasizing deep learning-based strategies, we aim to equip researchers with valuable resources for addressing missing data challenges effectively.
Thorough Examination of Missing Data Generation Methods
We meticulously catalog different approaches used in generating missing data, especially for MAR and MNAR mechanisms that are less explored in existing literature. Our goal is to raise awareness about these special missing mechanisms' importance and variability to encourage further exploration in future studies.
Guidance for Future Research Directions
We propose future research directions aimed at overcoming limitations of existing methods and promoting advanced techniques in practical settings. By identifying research gaps within the literature and suggesting new applications for imputation schemes, our study serves as a roadmap for researchers and practitioners.
Paper Organization
The paper is organized into sections that provide background information on key features of missing data including patterns and mechanisms, common methods for handling missing data, taxonomy of handling techniques, specific methods for dealing with missing data, evaluation metrics used to measure performance , commonly used generation methods for special missing mechanisms from literature reviews , challenges faced in the field, and future directions for research works.
Background Information on Missing Data
This section provides an overview of the types of missing data (MCAR, MAR, MNAR) and their implications on statistical analysis. It also discusses the patterns of missingness (e.g., monotone pattern) which can affect imputation techniques.
Common Methods for Handling Missing Data
Here we discuss traditional approaches such as deletion (complete case analysis), single imputation (mean/median/mode imputation), multiple imputation (e.g., chained equations), hot deck imputation , as well as newer techniques like deep learning-based approaches.
Taxonomy of Handling Techniques
This section categorizes different handling techniques based on their underlying assumptions or principles. For example: model-based vs non-model-based, single imputation vs multiple imputation, and parametric vs non-parametric methods.
Specific Methods for Dealing with Missing Data
We provide a detailed discussion of specific methods for handling missing data such as normal-model multiple imputation, full information maximum likelihood, expectation-maximization algorithms, deep learning techniques (e.g., autoencoders), and traditional machine learning methods (e.g., k-nearest neighbors).
Evaluation Metrics Used to Measure Performance
This section discusses commonly used metrics for evaluating the performance of imputation techniques such as mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination (R-squared).
Commonly Used Generation Methods for Special Missing Mechanisms from Literature Reviews
Here we catalog different approaches used in generating missing data, especially for MAR and MNAR mechanisms. This includes techniques like conditional probability models and pattern mixture models.
Challenges Faced in the Field
We discuss common challenges faced when working with missing data such as biased results due to inappropriate handling techniques or difficulties in identifying the true mechanism behind missingness.
Future Directions for Research Works
In this section, we propose future research directions aimed at addressing limitations of existing methods and promoting advanced techniques in practical settings. These include exploring new applications of imputation schemes , developing more robust deep learning-based approaches that can handle complex patterns of missingness , and investigating ways to incorporate external information into imputation models.
Conclusion
Our study provides a comprehensive review of special missing mechanisms in tabular data along with an examination of various generation methods and guidance for future research directions. By addressing the complexities associated with MAR and MNAR mechanisms through thorough reviews and proposing innovative solutions, our study aims to advance the field of imputation techniques. We hope that our work will serve as a valuable resource for researchers and practitioners working with missing data in real-world settings.