Fast and Provably Good Seedings for k-Means

Olivier Bachem; Mario Lucic; Hamed Hassani; Andreas Krause

2016 NIPS NeurIPS 2016

Fast and Provably Good Seedings for k-Means

Abstract

Seeding - the task of finding initial cluster centers - is critical in obtaining high-quality clusterings for k-Means. However, k-means++ seeding, the state of the art algorithm, does not scale well to massive datasets as it is inherently sequential and requires k full passes through the data. It was recently shown that Markov chain Monte Carlo sampling can be used to efficiently approximate the seeding step of k-means++. However, this result requires assumptions on the data generating distribution. We propose a simple yet fast seeding algorithm that produces provably good clusterings even without assumptions on the data. Our analysis shows that the algorithm allows for a favourable trade-off between solution quality and computational cost, speeding up k-means++ seeding by up to several orders of magnitude. We validate our theoretical results in extensive experiments on a variety of real-world data sets.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization

🧭 Keyword Pioneer — seeding algorithm

🐣 Hot Topic Early Bird — k-means clustering

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Olivier Bachem , Mario Lucic , Hamed Hassani , Andreas Krause

Topics

Machine Learning > Core Methods > Clustering Mathematics & Optimization > Optimization > Stochastic Methods

Keywords

k-means clustering markov chain monte carlo clustering quality computational cost seeding algorithm

Download PDF

Related papers

Bayesian Intermittent Demand Forecasting for Large Inventories 2016

Dynamic Network Surgery for Efficient DNNs 2016

Beyond Exchangeability: The Chinese Voting Process 2016

Safe and Efficient Off-Policy Reinforcement Learning 2016

Tagger: Deep Unsupervised Perceptual Grouping 2016