A Systematic Approach to Cleaning Routine Health Surveillance Datasets: An Illustration Using National Vector Borne Disease Control Programme Data of Punjab, India

AI-generated keywords: ICT4D data science data cleaning routine health information systems logic model

AI-generated Key Points

Advances in Information and Communication Technologies for Development (ICT4D) and data science have enabled a systematic, reproducible, and scalable approach to data cleaning in routine health information systems.
A logic model was developed for data cleaning, incorporating an algorithm for screening, diagnosis, and editing datasets in a rule-based, interactive, and semi-automated manner.
The model was successfully applied to the dengue line-list of the National Vector Borne Disease Control Programme in Punjab, India from 2015 to 2019.
Significant data cleaning and imputation success rates were achieved: over 96% records cleaned for 2015, 98.9% for 2016, and all cases from 2017 to 2019 effectively cleaned and imputed.
Accurate extraction of age (over 98.4%) and sex (over 99.4%) information from records was achieved.
The logic model resulted in the development of an analysis-ready dataset that supports spatiotemporal epidemiology understanding and data-driven public health decision-making.
Limitations include not exploring reasons behind data anomalies within the study's scope and the need for additional screening mechanisms when applying the algorithm to different diseases or program datasets.
Despite limitations, the study's strength lies in its reproducible and scalable logic algorithm for preprocessing routine health surveillance data with potential applications beyond dengue surveillance in India globally.
Future studies should explore additional variables relevant to health program managers while ensuring external generalizability of the algorithm across diverse contexts by incorporating robust ICT4D principles into research practices.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Gurpreet Singh, Biju Soman, Arun Mitra

arXiv: 2108.09963v1 - DOI (cs.CY)

In proceedings of the 1st Virtual Conference on Implications of Information and Digital Technologies for Development, 2021

License: CC BY-NC-SA 4.0

Abstract: Advances in ICT4D and data science facilitate systematic, reproducible, and scalable data cleaning for strengthening routine health information systems. A logic model for data cleaning was used and it included an algorithm for screening, diagnosis, and editing datasets in a rule-based, interactive, and semi-automated manner. Apriori computational workflows and operational definitions were prepared. Model performance was illustrated using the dengue line-list of the National Vector Borne Disease Control Programme, Punjab, India from 01 January 2015 to 31 December 2019. Cleaning and imputation for an estimated date were successful for 96.1% and 98.9% records for the year 2015 and 2016 respectively, and for all cases in the year 2017, 2018, and 2019. Information for age and sex was cleaned and extracted for more than 98.4% and 99.4% records. The logic model application resulted in the development of an analysis-ready dataset that can be used to understand spatiotemporal epidemiology and facilitate data-based public health decision making.

Submitted to arXiv on 23 Aug. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2108.09963v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Advances in Information and Communication Technologies for Development (ICT4D) and data science have paved the way for a systematic, reproducible, and scalable approach to data cleaning in routine health information systems. A logic model was developed for data cleaning, incorporating an algorithm for screening, diagnosis, and editing datasets in a rule-based, interactive, and semi-automated manner. This model was applied to the dengue line-list of the National Vector Borne Disease Control Programme in Punjab, India from 2015 to 2019. The study successfully cleaned and imputed missing data for an estimated date in over 96% of records for 2015 and 98.9% for 2016. Additionally, all cases from 2017 to 2019 were cleaned and imputed effectively. Information on age and sex was extracted accurately from more than 98.4% and 99.4% of records respectively. The application of the logic model resulted in the development of an analysis-ready dataset that can be utilized to understand spatiotemporal epidemiology and support data-driven public health decision-making. While the study demonstrated significant progress in enhancing data quality within routine health surveillance systems, there are limitations to consider. Understanding the underlying reasons behind data anomalies was not within the scope of this study, highlighting a need for further research in this area. The algorithm developed focused on a single disease dataset, necessitating additional screening mechanisms when applied to different diseases or program datasets. Despite these limitations, the study's strength lies in its reproducible and scalable logic algorithm for preprocessing routine health surveillance data. This approach enables researchers to analyze existing datasets efficiently, leading to insights into disease epidemiology over time and space. The scalability of these algorithms offers potential applications beyond dengue surveillance in India to other diseases globally. Moving forward, future studies should explore additional variables relevant to health program managers and researchers while ensuring external generalizability of the algorithm across diverse contexts. By incorporating robust ICT4D principles into research practices, sustainable improvements in data quality can be achieved, ultimately contributing to better-informed public health interventions worldwide.

- Advances in Information and Communication Technologies for Development (ICT4D) and data science have enabled a systematic, reproducible, and scalable approach to data cleaning in routine health information systems.
- A logic model was developed for data cleaning, incorporating an algorithm for screening, diagnosis, and editing datasets in a rule-based, interactive, and semi-automated manner.
- The model was successfully applied to the dengue line-list of the National Vector Borne Disease Control Programme in Punjab, India from 2015 to 2019.
- Significant data cleaning and imputation success rates were achieved: over 96% records cleaned for 2015, 98.9% for 2016, and all cases from 2017 to 2019 effectively cleaned and imputed.
- Accurate extraction of age (over 98.4%) and sex (over 99.4%) information from records was achieved.
- The logic model resulted in the development of an analysis-ready dataset that supports spatiotemporal epidemiology understanding and data-driven public health decision-making.
- Limitations include not exploring reasons behind data anomalies within the study's scope and the need for additional screening mechanisms when applying the algorithm to different diseases or program datasets.
- Despite limitations, the study's strength lies in its reproducible and scalable logic algorithm for preprocessing routine health surveillance data with potential applications beyond dengue surveillance in India globally.
- Future studies should explore additional variables relevant to health program managers while ensuring external generalizability of the algorithm across diverse contexts by incorporating robust ICT4D principles into research practices.

Summary- New technologies and data science help clean health data more efficiently. - A special method was created to clean and fix datasets automatically. - The method was used successfully in India to clean dengue disease data. - Most of the data from 2015 to 2019 was cleaned accurately. - Important information like age and sex were extracted correctly. Definitions- Technologies: Tools or methods used to do things better or faster. - Data science: Using math and technology to understand and work with data. - Datasets: Collections of information or data organized for analysis. - Dengue: A type of disease spread by mosquitoes causing fever and body pain. - Epidemiology: Studying how diseases spread in populations.

Advances in Information and Communication Technologies for Development (ICT4D) and data science have revolutionized the way we approach data cleaning in routine health information systems. With the development of a logic model that incorporates an algorithm for screening, diagnosis, and editing datasets, this systematic, reproducible, and scalable approach has shown promising results in improving data quality within health surveillance systems. The study conducted by researchers at the National Vector Borne Disease Control Programme in Punjab, India focused on developing a logic model for data cleaning specifically for dengue line-lists from 2015 to 2019. The aim was to create an analysis-ready dataset that could be used to understand spatiotemporal epidemiology and support data-driven public health decision-making. The results were impressive with over 96% of records from 2015 and 98.9% from 2016 being successfully cleaned and imputed with missing data. Furthermore, all cases from 2017 to 2019 were effectively cleaned and imputed as well. One of the key strengths of this study lies in its use of ICT4D principles which have enabled researchers to develop a logic model that is not only reproducible but also scalable. This means that it can be applied not just to dengue surveillance in India but also potentially to other diseases globally. By incorporating robust ICT4D principles into research practices, sustainable improvements in data quality can be achieved. However, it is important to note some limitations of this study as well. While the logic model developed showed significant progress in enhancing data quality within routine health surveillance systems, it did not address the underlying reasons behind data anomalies. This highlights a need for further research in this area. Another limitation is that the algorithm developed focused on a single disease dataset – dengue line-lists – which may require additional screening mechanisms when applied to different diseases or program datasets. Therefore, future studies should explore additional variables relevant to health program managers and researchers while ensuring the external generalizability of the algorithm across diverse contexts. Despite these limitations, this study has paved the way for a more efficient and effective approach to data cleaning in routine health information systems. By utilizing ICT4D principles and data science techniques, researchers have been able to develop a logic model that can be applied on a larger scale, leading to insights into disease epidemiology over time and space. The implications of this research are significant not just for India but also for other developing countries where routine health surveillance systems may face similar challenges with data quality. By implementing this systematic approach to data cleaning, decision-makers can have access to reliable and accurate data which is crucial in designing effective public health interventions. In conclusion, the application of ICT4D principles and data science techniques in developing a logic model for data cleaning has shown promising results in improving data quality within routine health surveillance systems. This study serves as an important step towards achieving sustainable improvements in public health interventions worldwide. Further research in this area will continue to enhance our understanding of disease epidemiology and support evidence-based decision-making for better health outcomes globally.

Created on 22 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

47.9%

Reliable and Resilient AI and IoT-based Personalised Healthcare Services: A S…

cs.CY

47.3%

Ethical Machine Learning in Health

cs.CY

45.9%

Gender Data 4 Girls?: A Postcolonial Feminist Participatory Study in Banglade…

cs.CY

45.5%

Data Governance in the Age of Large-Scale Data-Driven Language Technology

cs.CY

44.2%

An Urban Population Health Observatory System to Support COVID-19 Pandemic Pr…

cs.CY

43.4%

Sex Trouble: Common pitfalls in incorporating sex/gender in medical machine l…

cs.CY

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.