Towards actionability for open medical imaging datasets: lessons from community-contributed platforms for data management and stewardship

AI-generated keywords: Artificial Intelligence Healthcare Medical Imaging Community-Contributed Platforms Dataset Management

AI-generated Key Points

  • Medical imaging datasets are crucial for training and evaluating diagnostic algorithms in AI healthcare.
  • Quality of these datasets impacts accuracy, robustness, and fairness of AI models.
  • Concerns exist about governance models on Community-Contributed Platforms (CCPs) like Kaggle and HuggingFace regarding dataset sharing, documentation, and evaluation practices.
  • Issues identified in popular datasets on CCPs include vague licenses, lack of persistent identifiers, duplicates, missing metadata, and platform discrepancies.
  • Proposal for a commons-based stewardship model to enhance dataset documentation and maintenance on CCPs.
  • Emphasis on tracking dataset evolution within static infrastructures to address biases or spurious correlations over time.
  • Suggestions for creating living reviews through community-contributed platforms to enhance transparency and accountability in data development lifecycle.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Dovile Juodelyte, Théo Sourget, Caroline Vang-Larsen, Hubert Dariusz Zając, Veronika Cheplygina

Manuscript under review
License: CC BY-NC-SA 4.0

Abstract: Medical imaging datasets are fundamental to artificial intelligence (AI) in healthcare. The accuracy, robustness and fairness of diagnostic algorithms depend on the data (and its quality) on which the models are trained and evaluated. Medical imaging datasets have become increasingly available to the public, and are often hosted on Community-Contributed Platforms (CCP), including private companies like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper we investigate medical imaging datasets on CCPs and how they are documented, shared, and maintained. We first highlight some differences between medical imaging and computer vision, particularly in the potentially harmful downstream effects due to poor adoption of recommended dataset management practices. We then analyze 20 (10 medical and 10 computer vision) popular datasets on CCPs and find vague licenses, lack of persistent identifiers and storage, duplicates and missing metadata, with differences between the platforms. We present "actionability" as a conceptual metric to reveal the data quality gap between characteristics of data on CCPs and the desired characteristics of data for AI in healthcare. Finally, we propose a commons-based stewardship model for documenting, sharing and maintaining datasets on CCPs and end with a discussion of limitations and open questions.

Submitted to arXiv on 09 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.06353v1

In the realm of artificial intelligence (AI) in healthcare, medical imaging datasets play a crucial role in training and evaluating diagnostic algorithms. The quality of these datasets directly impacts the accuracy, robustness, and fairness of AI models. While medical imaging datasets are increasingly accessible on Community-Contributed Platforms (CCPs) like Kaggle and HuggingFace, there are concerns regarding the governance model's ability to uphold recommended practices for sharing, documenting, and evaluating datasets. This paper delves into the documentation, sharing, and maintenance of medical imaging datasets on CCPs. Drawing distinctions between medical imaging and computer vision datasets, the study highlights potential downstream consequences of inadequate dataset management practices. An analysis of 20 popular datasets (10 medical and 10 computer vision) on CCPs reveals issues such as vague licenses, lack of persistent identifiers, duplicates, missing metadata, and platform discrepancies. Introducing the concept of "actionability" as a metric to assess data quality gaps on CCPs versus ideal characteristics for AI in healthcare applications, the paper proposes a commons-based stewardship model for enhancing dataset documentation and maintenance on these platforms. Additionally, it emphasizes the need for tracking dataset evolution within static infrastructures through processes that acknowledge data as dynamic entities that evolve over time. Recognizing challenges in capturing dataset changes without stable identifiers like DOIs or traditional reviews, the study underscores the importance of formalized practices for documenting explicit changes while addressing implicit variations that may introduce biases or spurious correlations. Suggestions include creating living reviews through community-contributed platforms where users can contribute derived datasets or related research to enhance transparency and accountability throughout the data development lifecycle. Acknowledging limitations based on quantitative evidence and subjective perceptions, the paper calls for more insights from all stakeholders involved in managing and stewarding medical imaging datasets to foster a comprehensive understanding of best practices in dataset documentation within AI-driven healthcare initiatives.
Created on 25 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.