A Systematic Approach to Cleaning Routine Health Surveillance Datasets: An Illustration Using National Vector Borne Disease Control Programme Data of Punjab, India

AI-generated keywords: ICT4D data science data cleaning routine health information systems logic model

AI-generated Key Points

  • Advances in Information and Communication Technologies for Development (ICT4D) and data science have enabled a systematic, reproducible, and scalable approach to data cleaning in routine health information systems.
  • A logic model was developed for data cleaning, incorporating an algorithm for screening, diagnosis, and editing datasets in a rule-based, interactive, and semi-automated manner.
  • The model was successfully applied to the dengue line-list of the National Vector Borne Disease Control Programme in Punjab, India from 2015 to 2019.
  • Significant data cleaning and imputation success rates were achieved: over 96% records cleaned for 2015, 98.9% for 2016, and all cases from 2017 to 2019 effectively cleaned and imputed.
  • Accurate extraction of age (over 98.4%) and sex (over 99.4%) information from records was achieved.
  • The logic model resulted in the development of an analysis-ready dataset that supports spatiotemporal epidemiology understanding and data-driven public health decision-making.
  • Limitations include not exploring reasons behind data anomalies within the study's scope and the need for additional screening mechanisms when applying the algorithm to different diseases or program datasets.
  • Despite limitations, the study's strength lies in its reproducible and scalable logic algorithm for preprocessing routine health surveillance data with potential applications beyond dengue surveillance in India globally.
  • Future studies should explore additional variables relevant to health program managers while ensuring external generalizability of the algorithm across diverse contexts by incorporating robust ICT4D principles into research practices.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Gurpreet Singh, Biju Soman, Arun Mitra

In proceedings of the 1st Virtual Conference on Implications of Information and Digital Technologies for Development, 2021
License: CC BY-NC-SA 4.0

Abstract: Advances in ICT4D and data science facilitate systematic, reproducible, and scalable data cleaning for strengthening routine health information systems. A logic model for data cleaning was used and it included an algorithm for screening, diagnosis, and editing datasets in a rule-based, interactive, and semi-automated manner. Apriori computational workflows and operational definitions were prepared. Model performance was illustrated using the dengue line-list of the National Vector Borne Disease Control Programme, Punjab, India from 01 January 2015 to 31 December 2019. Cleaning and imputation for an estimated date were successful for 96.1% and 98.9% records for the year 2015 and 2016 respectively, and for all cases in the year 2017, 2018, and 2019. Information for age and sex was cleaned and extracted for more than 98.4% and 99.4% records. The logic model application resulted in the development of an analysis-ready dataset that can be used to understand spatiotemporal epidemiology and facilitate data-based public health decision making.

Submitted to arXiv on 23 Aug. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2108.09963v1

Advances in Information and Communication Technologies for Development (ICT4D) and data science have paved the way for a systematic, reproducible, and scalable approach to data cleaning in routine health information systems. A logic model was developed for data cleaning, incorporating an algorithm for screening, diagnosis, and editing datasets in a rule-based, interactive, and semi-automated manner. This model was applied to the dengue line-list of the National Vector Borne Disease Control Programme in Punjab, India from 2015 to 2019. The study successfully cleaned and imputed missing data for an estimated date in over 96% of records for 2015 and 98.9% for 2016. Additionally, all cases from 2017 to 2019 were cleaned and imputed effectively. Information on age and sex was extracted accurately from more than 98.4% and 99.4% of records respectively. The application of the logic model resulted in the development of an analysis-ready dataset that can be utilized to understand spatiotemporal epidemiology and support data-driven public health decision-making. While the study demonstrated significant progress in enhancing data quality within routine health surveillance systems, there are limitations to consider. Understanding the underlying reasons behind data anomalies was not within the scope of this study, highlighting a need for further research in this area. The algorithm developed focused on a single disease dataset, necessitating additional screening mechanisms when applied to different diseases or program datasets. Despite these limitations, the study's strength lies in its reproducible and scalable logic algorithm for preprocessing routine health surveillance data. This approach enables researchers to analyze existing datasets efficiently, leading to insights into disease epidemiology over time and space. The scalability of these algorithms offers potential applications beyond dengue surveillance in India to other diseases globally. Moving forward, future studies should explore additional variables relevant to health program managers and researchers while ensuring external generalizability of the algorithm across diverse contexts. By incorporating robust ICT4D principles into research practices, sustainable improvements in data quality can be achieved, ultimately contributing to better-informed public health interventions worldwide.
Created on 22 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.