Towards actionability for open medical imaging datasets: lessons from community-contributed platforms for data management and stewardship

AI-generated keywords: Artificial Intelligence Healthcare Medical Imaging Community-Contributed Platforms Dataset Management

AI-generated Key Points

Medical imaging datasets are crucial for training and evaluating diagnostic algorithms in AI healthcare.
Quality of these datasets impacts accuracy, robustness, and fairness of AI models.
Concerns exist about governance models on Community-Contributed Platforms (CCPs) like Kaggle and HuggingFace regarding dataset sharing, documentation, and evaluation practices.
Issues identified in popular datasets on CCPs include vague licenses, lack of persistent identifiers, duplicates, missing metadata, and platform discrepancies.
Proposal for a commons-based stewardship model to enhance dataset documentation and maintenance on CCPs.
Emphasis on tracking dataset evolution within static infrastructures to address biases or spurious correlations over time.
Suggestions for creating living reviews through community-contributed platforms to enhance transparency and accountability in data development lifecycle.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Dovile Juodelyte, Théo Sourget, Caroline Vang-Larsen, Hubert Dariusz Zając, Veronika Cheplygina

arXiv: 2402.06353v1 - DOI (cs.CV)

Manuscript under review

License: CC BY-NC-SA 4.0

Abstract: Medical imaging datasets are fundamental to artificial intelligence (AI) in healthcare. The accuracy, robustness and fairness of diagnostic algorithms depend on the data (and its quality) on which the models are trained and evaluated. Medical imaging datasets have become increasingly available to the public, and are often hosted on Community-Contributed Platforms (CCP), including private companies like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper we investigate medical imaging datasets on CCPs and how they are documented, shared, and maintained. We first highlight some differences between medical imaging and computer vision, particularly in the potentially harmful downstream effects due to poor adoption of recommended dataset management practices. We then analyze 20 (10 medical and 10 computer vision) popular datasets on CCPs and find vague licenses, lack of persistent identifiers and storage, duplicates and missing metadata, with differences between the platforms. We present "actionability" as a conceptual metric to reveal the data quality gap between characteristics of data on CCPs and the desired characteristics of data for AI in healthcare. Finally, we propose a commons-based stewardship model for documenting, sharing and maintaining datasets on CCPs and end with a discussion of limitations and open questions.

Submitted to arXiv on 09 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.06353v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the realm of artificial intelligence (AI) in healthcare, medical imaging datasets play a crucial role in training and evaluating diagnostic algorithms. The quality of these datasets directly impacts the accuracy, robustness, and fairness of AI models. While medical imaging datasets are increasingly accessible on Community-Contributed Platforms (CCPs) like Kaggle and HuggingFace, there are concerns regarding the governance model's ability to uphold recommended practices for sharing, documenting, and evaluating datasets. This paper delves into the documentation, sharing, and maintenance of medical imaging datasets on CCPs. Drawing distinctions between medical imaging and computer vision datasets, the study highlights potential downstream consequences of inadequate dataset management practices. An analysis of 20 popular datasets (10 medical and 10 computer vision) on CCPs reveals issues such as vague licenses, lack of persistent identifiers, duplicates, missing metadata, and platform discrepancies. Introducing the concept of "actionability" as a metric to assess data quality gaps on CCPs versus ideal characteristics for AI in healthcare applications, the paper proposes a commons-based stewardship model for enhancing dataset documentation and maintenance on these platforms. Additionally, it emphasizes the need for tracking dataset evolution within static infrastructures through processes that acknowledge data as dynamic entities that evolve over time. Recognizing challenges in capturing dataset changes without stable identifiers like DOIs or traditional reviews, the study underscores the importance of formalized practices for documenting explicit changes while addressing implicit variations that may introduce biases or spurious correlations. Suggestions include creating living reviews through community-contributed platforms where users can contribute derived datasets or related research to enhance transparency and accountability throughout the data development lifecycle. Acknowledging limitations based on quantitative evidence and subjective perceptions, the paper calls for more insights from all stakeholders involved in managing and stewarding medical imaging datasets to foster a comprehensive understanding of best practices in dataset documentation within AI-driven healthcare initiatives.

- Medical imaging datasets are crucial for training and evaluating diagnostic algorithms in AI healthcare.
- Quality of these datasets impacts accuracy, robustness, and fairness of AI models.
- Concerns exist about governance models on Community-Contributed Platforms (CCPs) like Kaggle and HuggingFace regarding dataset sharing, documentation, and evaluation practices.
- Issues identified in popular datasets on CCPs include vague licenses, lack of persistent identifiers, duplicates, missing metadata, and platform discrepancies.
- Proposal for a commons-based stewardship model to enhance dataset documentation and maintenance on CCPs.
- Emphasis on tracking dataset evolution within static infrastructures to address biases or spurious correlations over time.
- Suggestions for creating living reviews through community-contributed platforms to enhance transparency and accountability in data development lifecycle.

Summary- Medical imaging datasets are important for teaching computers how to help doctors in healthcare. - The quality of these datasets affects how well the computer programs work. - People are worried about how data is shared and used on websites like Kaggle and HuggingFace. - Some problems with these websites include unclear rules, missing information, and mistakes in the data. - A new idea suggests having a better way to take care of the data on these websites. Definitions- Medical imaging datasets: Collections of pictures used by computers to learn about medical conditions. - Algorithms: Step-by-step instructions that tell a computer what to do. - Governance models: Rules and systems for managing something, like sharing data online. - Documentation: Information that explains something or keeps track of details. - Metadata: Data that describes other data, like when a picture was taken or who created it.

Artificial intelligence (AI) has been making significant strides in the healthcare industry, particularly in the field of medical imaging. With the help of AI algorithms, medical professionals can now accurately and efficiently diagnose diseases from images such as X-rays, MRIs, and CT scans. However, the accuracy and reliability of these algorithms depend heavily on the quality of the datasets used to train them. In recent years, there has been an increase in accessibility to medical imaging datasets through Community-Contributed Platforms (CCPs) like Kaggle and HuggingFace. These platforms allow for easy sharing and collaboration among researchers, but there are concerns about their ability to uphold recommended practices for dataset management. This is where a research paper titled "Documentation Practices for Medical Imaging Datasets on Community-Contributed Platforms" comes into play. The paper delves into the documentation, sharing, and maintenance of medical imaging datasets on CCPs. It highlights potential downstream consequences of inadequate dataset management practices and proposes a commons-based stewardship model to enhance dataset documentation and maintenance on these platforms. Distinctions between medical imaging datasets and computer vision datasets are drawn in this study. While both types of data are used in AI applications, they have different characteristics that require specific considerations when it comes to managing them effectively. For instance, medical imaging datasets contain sensitive patient information that must be handled with utmost care due to privacy concerns. An analysis of 20 popular datasets (10 medical and 10 computer vision) on CCPs reveals several issues such as vague licenses, lack of persistent identifiers like DOIs (Digital Object Identifiers), duplicates, missing metadata, and discrepancies across different platforms. These problems can significantly impact the accuracy and fairness of AI models trained using these datasets. To address these issues, the paper introduces the concept of "actionability" as a metric for assessing data quality gaps on CCPs compared to ideal characteristics for AI in healthcare applications. Actionability refers to how easily a dataset can be used by others for research or clinical purposes. The paper suggests that datasets with high actionability have clear documentation, persistent identifiers, and minimal discrepancies across platforms. The proposed commons-based stewardship model aims to enhance dataset documentation and maintenance on CCPs. It emphasizes the need for tracking dataset evolution within static infrastructures through processes that acknowledge data as dynamic entities that evolve over time. This is crucial because medical imaging datasets are not static; they often undergo changes due to updates in technology or new findings in research. One of the challenges in managing medical imaging datasets on CCPs is capturing dataset changes without stable identifiers like DOIs or traditional reviews. To address this issue, the study suggests creating living reviews through community-contributed platforms where users can contribute derived datasets or related research to enhance transparency and accountability throughout the data development lifecycle. Moreover, the paper acknowledges limitations based on quantitative evidence and subjective perceptions. While it provides valuable insights into best practices for dataset management on CCPs, there is still a need for more input from all stakeholders involved in managing and stewarding medical imaging datasets. This will help foster a comprehensive understanding of how to effectively document and maintain these datasets within AI-driven healthcare initiatives. In conclusion, "Documentation Practices for Medical Imaging Datasets on Community-Contributed Platforms" sheds light on the importance of proper documentation and maintenance of medical imaging datasets on CCPs. It highlights potential consequences of inadequate dataset management practices and proposes a commons-based stewardship model to improve data quality on these platforms. With further collaboration among researchers, platform developers, and other stakeholders, we can ensure that AI algorithms trained using medical imaging datasets are accurate, robust, fair, and ultimately beneficial for patient care.

Created on 25 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

60.5%

A Survey on Medical Document Summarization

cs.CL

58.7%

Synthesizing brain tumor images and annotations by combining progressive grow…

cs.CV

57.2%

AI Technical Considerations: Data Storage, Cloud usage and AI Pipeline

cs.AI

56.8%

The "Collections as ML Data" Checklist for Machine Learning & Cultural Herita…

cs.LG

56.7%

Detecting Harmful Content On Online Platforms: What Platforms Need Vs. Where …

cs.CL

55.4%

Data Governance in the Age of Large-Scale Data-Driven Language Technology

cs.CY

55.0%

Deep learning in agriculture: A survey

cs.LG

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.