Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

Alham Fikri Aji; Kenneth Heafield; Nikolay Bogoychev

2019 IJCNLP IJCNLP 2019

Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training

Abstract

AbstractOne way to reduce network traffic in multi-node data-parallel stochastic gradient descent is to only exchange the largest gradients. However, doing so damages the gradient and degrades the model’s performance. Transformer models degrade dramatically while the impact on RNNs is smaller. We restore gradient quality by combining the compressed global gradient with the node’s locally computed uncompressed gradient. Neural machine translation experiments show that Transformer convergence is restored while RNNs converge faster. With our method, training on 4 nodes converges up to 1.5x as fast as with uncompressed gradients and scales 3.5x relative to single-node training.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Alham Fikri Aji , Kenneth Heafield , Nikolay Bogoychev

Topics

Machine Learning > Optimization & Theory > Distributed Learning Mathematics & Optimization > Optimization > Stochastic Methods

Keywords

stochastic gradient descent neural machine translation distributed training gradient compression model convergence

Download PDF

Related papers

Fine-grained Knowledge Fusion for Sequence Labeling Domain Adaptation 2019

Exploiting Monolingual Data at Scale for Neural Machine Translation 2019

Distributionally Robust Language Modeling 2019

Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling 2019

ARAML: A Stable Adversarial Training Framework for Text Generation 2019