Self-Supervised Detection of Contextual Synonyms in a Multi-Class Setting: Phenotype Annotation Use Case

Jingqing Zhang; Luis Bolanos Trujillo; Tong Li; Ashwani Tanwar; Guilherme Freire; Xian Yang; Julia Ive; Vibhor Gupta; Yike Guo

2021 EMNLP EMNLP 2021

Self-Supervised Detection of Contextual Synonyms in a Multi-Class Setting: Phenotype Annotation Use Case

Abstract

AbstractContextualised word embeddings is a powerful tool to detect contextual synonyms. However, most of the current state-of-the-art (SOTA) deep learning concept extraction methods remain supervised and underexploit the potential of the context. In this paper, we propose a self-supervised pre-training approach which is able to detect contextual synonyms of concepts being training on the data created by shallow matching. We apply our methodology in the sparse multi-class setting (over 15,000 concepts) to extract phenotype information from electronic health records. We further investigate data augmentation techniques to address the problem of the class sparsity. Our approach achieves a new SOTA for the unsupervised phenotype concept annotation on clinical text on F1 and Recall outperforming the previous SOTA with a gain of up to 4.5 and 4.0 absolute points, respectively. After fine-tuning with as little as 20% of the labelled data, we also outperform BioBERT and ClinicalBERT. The extrinsic evaluation on three ICU benchmarks also shows the benefit of using the phenotypes annotated by our model as features.

🌉 Interdisciplinary Bridge — Deep Learning and Healthcare & Medicine and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — contextual synonym

🐣 Hot Topic Early Bird — electronic health record

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jingqing Zhang , Luis Bolanos Trujillo , Tong Li , Ashwani Tanwar , Guilherme Freire , Xian Yang , Julia Ive , Vibhor Gupta , Yike Guo

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Applications > Information Extraction Healthcare & Medicine > Clinical > Clinical NLP Natural Language Processing > Applications > Named Entity Recognition Deep Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Paradigms > Self-Supervised Learning Machine Learning > Learning Types > Multi-Class Classification Machine Learning > Core Methods > Multi-Class Classification

Keywords

self-supervised learning data augmentation named entity recognition multi-class classification concept extraction electronic health record clinical natural language processing contextual synonym contextual synonym detection phenotype annotation

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021