Early Stopping as Nonparametric Variational Inference

David Duvenaud; Dougal Maclaurin; Ryan Adams

2016 AISTATS AISTATS 2016

Early Stopping as Nonparametric Variational Inference

Abstract

We show that unconverged stochastic gradient descent can be interpreted as sampling from a nonparametric approximate posterior distribution. This distribution is implicitly defined by the transformation of an initial distribution by a sequence of optimization steps. By tracking the change in entropy of this distribution during optimization, we give a scalable, unbiased estimate of a variational lower bound on the log marginal likelihood. This bound can be used to optimize hyperparameters instead of cross-validation. This Bayesian interpretation of SGD also suggests new overfitting-resistant optimization procedures, and gives a theoretical foundation for early stopping and ensembling. We investigate the properties of this marginal likelihood estimator on neural network models.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐣 Hot Topic Early Bird — stochastic gradient descent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

David Duvenaud , Dougal Maclaurin , Ryan Adams

Topics

Machine Learning > Optimization & Theory > Bayesian Inference Machine Learning > Optimization & Theory > Stochastic Processes Machine Learning > Optimization & Theory > Theory Deep Learning > Models > Variational Inference Machine Learning > Bayesian & Probabilistic > Variational Inference

Keywords

stochastic gradient descent variational inference hyperparameter optimization neural network optimization marginal likelihood posterior distribution stochastic process early stopping approximate posterior

Download PDF

Related papers

Bipartite Correlation Clustering: Maximizing Agreements 2016

Precision Matrix Estimation in High Dimensional Gaussian Graphical Models with Faster Rates 2016

On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes 2016

Time-Varying Gaussian Process Bandit Optimization 2016

Bayesian Markov Blanket Estimation 2016