Data-splitting improves statistical performance in overparameterized regimes

Nicole Muecke; Enrico Reiss; Jonas Rungenhagen; Markus Klein

2022 AISTATS AISTATS 2022

Data-splitting improves statistical performance in overparameterized regimes

Abstract

While large training datasets generally offer improvement in model performance, the training process becomes computationally expensive and time consuming. Distributed learning is a common strategy to reduce the overall training time by exploiting multiple computing devices. Recently, it has been observed in the single machine setting that overparameterization is essential for benign overfitting in ridgeless regression in Hilbert spaces. We show that in this regime, data splitting has a regularizing effect, hence improving statistical performance and computational complexity at the same time. We further provide a unified framework that allows to analyze both the finite and infinite dimensional setting. We numerically demonstrate the effect of different model parameters.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — ridgeless regression

🐣 Hot Topic Early Bird — distributed learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning

Authors

Nicole Muecke , Enrico Reiss , Jonas Rungenhagen , Markus Klein

Topics

Artificial Intelligence > Learning Paradigms > Federated Learning Machine Learning > Core Methods > Regression Machine Learning > Optimization & Theory > Statistical Learning

Keywords

distributed learning kernel ridge regression benign overfitting ridgeless regression data splitting

Download PDF

Related papers

Exploring Image Regions Not Well Encoded by an INN 2022

On Linear Model with Markov Signal Priors 2022

Probabilistic Numerical Method of Lines for Time-Dependent Partial Differential Equations 2022

On Distributionally Robust Optimization and Data Rebalancing 2022

Common Failure Modes of Subcluster-based Sampling in Dirichlet Process Gaussian Mixture Models - and a Deep-learning Solution 2022