BRATsynthetic: Text De-identification using a Markov Chain Replacement Strategy for Surrogate Personal Identifying Information

AI-generated keywords: De-identification PHI HIPS Markov Chain FNER

AI-generated Key Points

  • Objective: Implement and evaluate different personal health identifying information (PHI) substitution strategies to quantify privacy preserving benefits.
  • Background and Significance:
  • Privacy considerations and legal mandates limit clinical text availability for machine learning model training.
  • Synthetic generation of clinical notes has potential, but more research is needed.
  • De-identification is an alternative approach, but existing software may not effectively replace identified PHI with surrogate text.
  • Methods:
  • Implement and assess three "Hiding in Plain Sight" (HIPS) strategies for PHI replacement: Consistent, Random, and Markov model-based.
  • Evaluate privacy preserving benefits using false negative error rates (FNER).
  • Results:
  • Markov chain strategy significantly reduces PHI leakage compared to Consistent strategy on diverse set of notes from University of Alabama at Birmingham (UAB).
  • Markov chain strategy outperforms Consistent and Random strategies on MIMIC corpus of discharge summaries and synthetic clinical PHI distributions.
  • Document-level PHI leakage reduced from 27.1% to 0.1% at 0.1% FNER, and from 94.2% to 57.7% at 5% FNER using Markov chain strategy.
  • Discussion:
  • Markov chain surrogate generation strategy substantially reduces risk of inadvertent PHI release across different assumed FNERs.
  • Implementation called "BRATsynthetic" released on Github for clinical informatics community use.
  • Conclusion: Markov chain replacement strategy allows for release of larger de-identified corpora at same risk level compared to consistent HIPS strategy.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: John D. Osborne, Tobias O'Leary, Akhil Nadimpalli, Salma M. Aly., Richard E. Kennedy

License: CC BY 4.0

Abstract: Objective: Implement and assess personal health identifying information (PHI) substitution strategies and quantify their privacy preserving benefits. Materials and Methods: We implement and assess 3 different `Hiding in Plain Sight` (HIPS) strategies for PHI replacement including a standard Consistent replacement strategy, a Random replacement strategy and a novel Markov model-based strategy. We evaluate the privacy preserving benefits of these strategies on a synthetic PHI distribution and real clinical corpora from 2 different institutions using a range of false negative error rates (FNER). Results: Using FNER ranging from 0.1% to 5% PHI leakage at the document level could be reduced from 27.1% to 0.1% (0.1% FNER) and from 94.2% to 57.7% (5% FNER) utilizing the Markov chain strategy versus the Consistent strategy on a corpus containing a diverse set of notes from the University of Alabama at Birmingham (UAB). The Markov chain substitution strategy also consistently outperformed the Consistent and Random substitution strategies in a MIMIC corpus of discharge summaries and on a range of synthetic clinical PHI distributions. Discussion: We demonstrate that a Markov chain surrogate generation strategy substantially reduces the chance of inadvertent PHI release across a range of assumed PHI FNER and release our implementation `BRATsynthetic` on Github. Conclusion: The Markov chain replacement strategy allows for the release of larger de-identified corpora at the same risk level relative to corpora released using a consistent HIPS strategy.

Submitted to arXiv on 28 Oct. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.16125v1

Objective: The objective of this study is to implement and evaluate different personal health identifying information (PHI) substitution strategies in order to quantify their privacy preserving benefits. Background and Significance: Privacy considerations and legal mandates create a scarcity of clinical text for machine learning model training and evaluation. Synthetic generation of clinical notes has shown promise but more research is needed. An alternative approach is to use real notes and remove or replace only personal identifying information (PII), including PHI, through a process called de-identification. However, existing de-identification software may not replace identified PHI with surrogate text, making it difficult to train machine learning algorithms effectively. This challenge may lead to PHI data leaks if the de-identification process is skipped. Methods: In this study, we implement and assess three different "Hiding in Plain Sight" (HIPS) strategies for PHI replacement: a standard Consistent replacement strategy, a Random replacement strategy, and a novel Markov model-based strategy. We evaluate the privacy preserving benefits of these strategies on synthetic PHI distributions and real clinical corpora from two different institutions using false negative error rates (FNER). Results: Using FNER ranging from 0.1% to 5%, we observe that the Markov chain strategy significantly reduces PHI leakage compared to the Consistent strategy on a diverse set of notes from the University of Alabama at Birmingham (UAB). The Markov chain strategy also outperforms the Consistent and Random strategies on a MIMIC corpus of discharge summaries as well as on various synthetic clinical PHI distributions. Specifically, at 0.1% FNER document-level PHI leakage could be reduced from 27.1% to 0.1%, while at 5% FNER it could be reduced from 94.2% to 57.7%. Discussion: Our findings demonstrate that the Markov chain surrogate generation strategy substantially reduces the risk of inadvertent PHI release across different assumed PHI FNERs. We also release our implementation called "BRATsynthetic" on Github for the clinical informatics community to use. Conclusion: The Markov chain replacement strategy allows for the release of larger de-identified corpora at the same risk level compared to using a consistent HIPS strategy.
Created on 23 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.