On the Convergence Properties of a K-step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization

Fan Zhou; Guojing Cong

2018 IJCAI IJCAI 2018

On the Convergence Properties of a K-step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization

Abstract

We adopt and analyze a synchronous K-step averaging stochastic gradient descent algorithm which we call K-AVG for solving large scale machine learning problems. We establish the convergence results of K-AVG for nonconvex objectives. Our analysis of K-AVG applies to many existing variants of synchronous SGD. We explain why the K-step delay is necessary and leads to better performance than traditional parallel stochastic gradient descent which is equivalent to K-AVG with $K=1$. We also show that K-AVG scales better with the number of learners than asynchronous stochastic gradient descent (ASGD). Another advantage of K-AVG over ASGD is that it allows larger stepsizes and facilitates faster convergence. On a cluster of $128$ GPUs, K-AVG is faster than ASGD implementations and achieves better accuracies and faster convergence for training with the CIFAR-10 dataset.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization

🧭 Keyword Pioneer — k-step averaging

🐣 Hot Topic Early Bird — stochastic gradient descent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Reinforcement Learning, Robotics

Authors

Fan Zhou , Guojing Cong

Topics

Machine Learning > Optimization & Theory > Neural Network Optimization Machine Learning > Optimization & Theory > Optimization Machine Learning > Optimization & Theory > Stochastic Processes Machine Learning > Optimization & Theory > Theory Mathematics & Optimization > Optimization > Stochastic Methods Machine Learning > Optimization & Theory > Stochastic Methods Deep Learning > Optimization & Theory > Neural Network Optimization

Keywords

stochastic gradient descent nonconvex optimization convergence analysis distributed learning parallel computing parallel learning k-step averaging

Download PDF

Related papers

Semi-Supervised Multi-Modal Learning with Incomplete Modalities 2018

High-dimensional Similarity Learning via Dual-sparse Random Projection 2018

FISH-MML: Fisher-HSIC Multi-View Metric Learning 2018

Generative Warfare Nets: Ensemble via Adversaries and Collaborators 2018

Semi-Supervised Optimal Margin Distribution Machines 2018