Rapid Distance-Based Outlier Detection via Sampling

Mahito Sugiyama; Karsten Borgwardt

2013 NIPS NeurIPS 2013

Rapid Distance-Based Outlier Detection via Sampling

Abstract

Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Machine Learning

📈 Trend Setter — Data Mining

🧭 Keyword Pioneer — sampling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

🐣 Hot Topic Early Bird — anomaly detection

Authors

Mahito Sugiyama , Karsten Borgwardt

Topics

Machine Learning > Core Methods > Classification Machine Learning > Core Methods > Clustering Machine Learning > Learning Types > Unsupervised Learning Data Science & Analytics > Methods > Data Mining Data Science & Analytics > Applications > Clustering Machine Learning > Core Methods > Anomaly Detection Machine Learning > Learning Types > Sampling

Keywords

anomaly detection high-dimensional data outlier detection nearest neighbor search sampling high-dimensional datum distance-based method

Download PDF

Related papers

Latent Structured Active Learning 2013

On Flat versus Hierarchical Classification in Large-Scale Taxonomies 2013

Generalized Method-of-Moments for Rank Aggregation 2013

Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections 2013

Accelerated Mini-Batch Stochastic Dual Coordinate Ascent 2013