Sharpness-Aware Minimization and the Edge of Stability

Philip M. Long; Peter L. Bartlett

2024 JMLR JMLR 2024

Sharpness-Aware Minimization and the Edge of Stability

Abstract

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value. The quantity $2/\eta$ has been called the “edge of stability” based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an “edge of stability” for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis. [abs] [ pdf ][ bib ] [ code ] © JMLR 2024. (edit, beta)

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Philip M. Long , Peter L. Bartlett

Topics

Machine Learning > Optimization & Theory > Neural Network Optimization Machine Learning > Optimization & Theory > Optimization Deep Learning > Techniques > Model Architecture Machine Learning > Learning Types > Deep Learning Deep Learning > Optimization & Theory > Neural Network Optimization

Keywords

sharpness-aware minimization gradient descent hessian eigenvalue edge of stability neural network

Download PDF

Related papers

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks 2024

Convergence for nonconvex ADMM, with applications to CT imaging 2024

Functional Directed Acyclic Graphs 2024

Sum-of-norms clustering does not separate nearby balls 2024

Decentralized Natural Policy Gradient with Variance Reduction for Collaborative Multi-Agent Reinforcement Learning 2024