Identifying Semantically Deviating Outlier Documents

Honglei Zhuang; Chi Wang; Fangbo Tao; Lance Kaplan; Jiawei Han

2017 EMNLP EMNLP 2017

Identifying Semantically Deviating Outlier Documents

Abstract

AbstractA document outlier is a document that substantially deviates in semantics from the majority ones in a corpus. Automatic identification of document outliers can be valuable in many applications, such as screening health records for medical mistakes. In this paper, we study the problem of mining semantically deviating document outliers in a given corpus. We develop a generative model to identify frequent and characteristic semantic regions in the word embedding space to represent the given corpus, and a robust outlierness measure which is resistant to noisy content in documents. Experiments conducted on two real-world textual data sets show that our method can achieve an up to 135% improvement over baselines in terms of recall at top-1% of the outlier ranking.

🌉 Interdisciplinary Bridge — Computer Science and Machine Learning

🧭 Keyword Pioneer — semantic deviation

🐣 Hot Topic Early Bird — text mining

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Honglei Zhuang , Chi Wang , Fangbo Tao , Lance Kaplan , Jiawei Han

Topics

Machine Learning > Core Methods > Classification Computer Science > Applications > Document Analysis

Keywords

text mining outlier detection document analysis generative model semantic deviation

Download PDF

Related papers

Reinforced Video Captioning with Entailment Rewards 2017

Cross-lingual Character-Level Neural Morphological Tagging 2017

Inter-Weighted Alignment Network for Sentence Pair Modeling 2017

Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings 2017

An Empirical Analysis of Edit Importance between Document Versions 2017