Generalizing Clinical De-identification Models by Privacy-safe Data Augmentation using GPT-4

Woojin Kim; Sungeun Hahm; Jaejin Lee

2024 EMNLP EMNLP 2024

Generalizing Clinical De-identification Models by Privacy-safe Data Augmentation using GPT-4

Abstract

AbstractDe-identification (de-ID) refers to removing the association between a set of identifying data and the data subject. In clinical data management, the de-ID of Protected Health Information (PHI) is critical for patient confidentiality. However, state-of-the-art de-ID models show poor generalization on a new dataset. This is due to the difficulty of retaining training corpora. Additionally, labeling standards and the formats of patient records vary across different institutions. Our study addresses these issues by exploiting GPT-4 for data augmentation through one-shot and zero-shot prompts. Our approach effectively circumvents the problem of PHI leakage, ensuring privacy by redacting PHI before processing. To evaluate the effectiveness of our proposal, we conduct cross-dataset testing. The experimental result demonstrates significant improvements across three types of F1 scores.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Healthcare & Medicine and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — clinical de-identification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Woojin Kim , Sungeun Hahm , Jaejin Lee

Topics

Machine Learning > Application Areas > Data Augmentation Machine Learning > Application Areas > Domain Adaptation Machine Learning > Application Areas > Privacy Natural Language Processing > Applications > Information Extraction Healthcare & Medicine > Clinical > Clinical NLP Machine Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Privacy Healthcare & Medicine > Clinical > Medical AI Deep Learning > Learning Types > Data Augmentation

Keywords

privacy-preserving machine learning data augmentation privacy preservation cross-dataset evaluation protected health information large language model clinical de-identification

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024