Scalable Training of Mixture Models via Coresets

Dan Feldman; Matthew Faulkner; Andreas Krause

2011 NIPS NeurIPS 2011

Scalable Training of Mixture Models via Coresets

Abstract

How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of $O(dk^3/\eps^2)$ data points suffices for computing a $(1+\eps)$-approximation for the optimal model on the original $n$ data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Machine Learning

🧭 Keyword Pioneer — coresets

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing

📈 Trend Setter — Approximation Algorithms

🐣 Hot Topic Early Bird — density estimation

Authors

Dan Feldman , Matthew Faulkner , Andreas Krause

Topics

Machine Learning > Core Methods > Clustering Machine Learning > Optimization & Theory > Optimization Machine Learning > Optimization & Theory > Statistical Learning Machine Learning > Optimization & Theory > Theory Data Science & Analytics > Applications > Clustering Mathematics & Optimization > Optimization > Discrete Optimization Mathematics & Optimization > Optimization > Approximation Algorithms

Keywords

density estimation map-reduce statistical estimation coresets scalable training streaming algorithm approximation algorithm mixture model gaussian mixture

Download PDF

Related papers

Co-Training for Domain Adaptation 2011

The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning 2011

Learning to Agglomerate Superpixel Hierarchies 2011

A Reinforcement Learning Theory for Homeostatic Regulation 2011

A Global Structural EM Algorithm for a Model of Cancer Progression 2011