Learning and Data Selection in Big Datasets

Hossein Shokri Ghadikolaei; Hadi Ghauch; Carlo Fischione; Mikael Skoglund

2019 ICML ICML 2019

Learning and Data Selection in Big Datasets

Abstract

Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization

🧭 Keyword Pioneer — sufficient dataset

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🐣 Hot Topic Early Bird — generalization error

Authors

Hossein Shokri Ghadikolaei , Hadi Ghauch , Carlo Fischione , Mikael Skoglund

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Active Learning Mathematics & Optimization > Optimization > Combinatorial Optimization Machine Learning > Learning Types > Transfer Learning

Keywords

representation learning distributed optimization generalization error representative sampling data selection dataset compression sufficient dataset

Download PDF

Related papers

Bayesian leave-one-out cross-validation for large data 2019

A Block Coordinate Descent Proximal Method for Simultaneous Filtering and Parameter Estimation 2019

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks 2019

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously 2019

Improved Convergence for $\ell_1$ and $\ell_∞$ Regression via Iteratively Reweighted Least Squares 2019