Why is parameter averaging beneficial in SGD? An objective smoothing perspective

Atsushi Nitanda; Ryuhei Kikuchi; Shugo Maeda; Denny Wu

2024 AISTATS AISTATS 2024

Why is parameter averaging beneficial in SGD? An objective smoothing perspective

Abstract

It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.

❓ The Questioner

🧭 Keyword Pioneer — parameter averaging

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Machine Learning, Mathematics & Optimization, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

Authors

Atsushi Nitanda , Ryuhei Kikuchi , Shugo Maeda , Denny Wu

Topics

Machine Learning > Optimization & Theory > Optimization Machine Learning > Optimization & Theory > Statistical Learning Machine Learning > Optimization & Theory > Stochastic Processes Deep Learning > Optimization & Theory > Neural Network Optimization

Keywords

stochastic gradient descent implicit bia flat minima parameter averaging sharpness minimization objective smoothing local minima

Download PDF

Related papers

Causal Bandits with General Causal Models and Interventions 2024

Boundary-Aware Uncertainty for Feature Attribution Explainers 2024

Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective 2024

A Primal-Dual-Critic Algorithm for Offline Constrained Reinforcement Learning 2024

Pure Exploration in Bandits with Linear Constraints 2024