On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Sadhika Malladi; Kaifeng Lyu; Abhishek Panigrahi; Sanjeev Arora

2022 NIPS NeurIPS 2022

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Abstract

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Mathematics & Optimization

🧭 Keyword Pioneer — batch size scaling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sadhika Malladi , Kaifeng Lyu , Abhishek Panigrahi , Sanjeev Arora

Topics

Machine Learning > Optimization & Theory > Neural Network Optimization Machine Learning > Optimization & Theory > Stochastic Processes Mathematics & Optimization > Optimization > Stochastic Methods Machine Learning > Optimization & Theory > Stochastic Methods Deep Learning > Optimization & Theory > Optimization Deep Learning > Optimization & Theory > Theory Deep Learning > Optimization & Theory > Stochastic Methods

Keywords

neural network optimization gradient descent stochastic differential equation adaptive gradient method adaptive gradient batch size scaling scaling rule

Download PDF

Related papers

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching 2022

A Theoretical View on Sparsely Activated Networks 2022

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks 2022

Matryoshka Representation Learning 2022

Off-Policy Evaluation with Deficient Support Using Side Information 2022