PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation

Xiang Yue; Shuang Zhou

2020 EMNLP EMNLP 2020

PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation

Abstract

AbstractDe-identification is the task of identifying protected health information (PHI) in the clinical text. Existing neural de-identification models often fail to generalize to a new dataset. We propose a simple yet effective data augmentation method PHICON to alleviate the generalization issue. PHICON consists of PHI augmentation and Context augmentation, which creates augmented training corpora by replacing PHI entities with named-entities sampled from external sources, and by changing background context with synonym replacement or random word insertion, respectively. Experimental results on the i2b2 2006 and 2014 de-identification challenge datasets show that PHICON can help three selected de-identification models boost F1-score (by at most 8.6%) on cross-dataset test setting. We also discuss how much augmentation to use and how each augmentation method influences the performance.

🌉 Interdisciplinary Bridge — Deep Learning and Healthcare & Medicine and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — clinical text de-identification

🐣 Hot Topic Early Bird — model generalization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xiang Yue , Shuang Zhou

Topics

Machine Learning > Application Areas > Data Augmentation Machine Learning > Application Areas > Domain Generalization Natural Language Processing > Applications > Information Extraction Natural Language Processing > Applications > Text Classification Healthcare & Medicine > Clinical > Clinical NLP Natural Language Processing > Applications > Named Entity Recognition Machine Learning > Learning Types > Data Augmentation Deep Learning > Learning Types > Data Augmentation

Keywords

text classification data augmentation named entity recognition model generalization cross-dataset generalization clinical text protected health information context augmentation clinical text de-identification phi augmentation healthcare privacy

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020