Surprising Instabilities in Training Deep Networks and a Theoretical Analysis

Yuxin Sun; DONG LAO; Ganesh Sundaramoorthi; Anthony Yezzi

2022 NIPS NeurIPS 2022

Surprising Instabilities in Training Deep Networks and a Theoretical Analysis

Abstract

We empirically demonstrate numerical instabilities in training standard deep networks with SGD. Specifically, we show numerical error (on the order of the smallest floating point bit) induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance, comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of a one-layer convolutional neural network, which is sufficient to illustrate these instabilities. We show that it is stable only under certain conditions on the learning rate and weight decay. We reproduce the localized instabilities in the PDE for the one-layer network, which arise when the conditions are violated.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — numerical instability

🐣 Hot Topic Early Bird — training dynamics

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuxin Sun , DONG LAO , Ganesh Sundaramoorthi , Anthony Yezzi

Topics

Machine Learning > Optimization & Theory > Neural Network Optimization Machine Learning > Optimization & Theory > Optimization Machine Learning > Optimization & Theory > Stochastic Processes Machine Learning > Optimization & Theory > Theory Deep Learning > Optimization & Theory > Theory

Keywords

stochastic gradient descent deep learning gradient descent learning rate partial differential equation weight decay training dynamics deep network numerical instability optimization stability optimization dynamics floating point error

Download PDF

Related papers

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching 2022

A Theoretical View on Sparsely Activated Networks 2022

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks 2022

Matryoshka Representation Learning 2022

Off-Policy Evaluation with Deficient Support Using Side Information 2022