Anonymization Through Substitution: Words vs Sentences

Vasco Alves; Vitor Rolla; João Alveira; David Pissarra; Duarte Pereira; Isabel Curioso; André Carreiro; Henrique Lopes Cardoso

2024 ACL ACL 2024

Anonymization Through Substitution: Words vs Sentences

Abstract

AbstractAnonymization of clinical text is crucial to allow the sharing and disclosure of health records while safeguarding patient privacy. However, automated anonymization processes are still highly limited in healthcare practice, as these systems cannot assure the anonymization of all private information. This paper explores the application of a novel technique that guarantees the removal of all sensitive information through the usage of text embeddings obtained from a de-identified dataset, replacing every word or sentence of a clinical note. We analyze the performance of different embedding techniques and models by evaluating them using recently proposed evaluation metrics. The results demonstrate that sentence replacement is better at keeping relevant medical information untouched, while the word replacement strategy performs better in terms of anonymization sensitivity.

🌉 Interdisciplinary Bridge — Healthcare & Medicine and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — clinical note anonymization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Vasco Alves , Vitor Rolla , João Alveira , David Pissarra , Duarte Pereira , Isabel Curioso , André Carreiro , Henrique Lopes Cardoso

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Application Areas > Privacy Healthcare & Medicine > Clinical > Clinical NLP Natural Language Processing > Applications > Text Processing

Keywords

named entity recognition privacy preservation data anonymization text embedding clinical text word replacement clinical note anonymization sentence replacement

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024