A Sampling-Based Approach for Efficient Clustering in Large Datasets

Georgios Exarchakis; Omar Oubari; Gregor Lenz

2022 CVPR CVPR 2022

A Sampling-Based Approach for Efficient Clustering in Large Datasets

Abstract

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including the number of operations until convergence, and the stability of the results. An efficient implementation of the algorithm is provided in online.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Machine Learning

🐣 Hot Topic Early Bird — high-dimensional datum

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Georgios Exarchakis , Omar Oubari , Gregor Lenz

Topics

Machine Learning > Core Methods > Clustering Machine Learning > Optimization & Theory > Optimization Data Science & Analytics > Applications > Clustering

Keywords

cluster analysis k-means clustering approximation algorithm high-dimensional datum sampling method approximate clustering

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022