Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

Nikolay Bogoychev; Kenneth Heafield; Alham Fikri Aji; Marcin Junczys-Dowmunt

2018 EMNLP EMNLP 2018

Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

Abstract

AbstractIn order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that delays gradient updates effectively increasing the mini-batch size. Unfortunately with the increase of mini-batch size we worsen the stale gradient problem in asynchronous stochastic gradient descent (SGD) which makes the model convergence poor. We introduce local optimizers which mitigate the stale gradient problem and together with fine tuning our momentum we are able to train a shallow machine translation system 27% faster than an optimized baseline with negligible penalty in BLEU.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

📈 Trend Setter — Stochastic Methods

🧭 Keyword Pioneer — mini-batch size

🐣 Hot Topic Early Bird — gradient optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Nikolay Bogoychev , Kenneth Heafield , Alham Fikri Aji , Marcin Junczys-Dowmunt

Topics

Natural Language Processing > Applications > Machine Translation Machine Learning > Optimization & Theory > Stochastic Methods Natural Language Processing > Generation > Machine Translation Deep Learning > Optimization & Theory > Optimization Deep Learning > Optimization & Theory > Stochastic Methods

Keywords

stochastic gradient descent neural machine translation distributed learning gradient optimization distributed training model convergence gradient staleness asynchronous learning mini-batch size asynchronous stochastic gradient descent

Download PDF

Related papers

Speeding Up Neural Machine Translation Decoding by Cube Pruning 2018

Limitations in learning an interpreted language with recurrent models 2018

Results of the sixth edition of the BioASQ Challenge 2018

Neural Segmental Hypergraphs for Overlapping Mention Recognition 2018

Hybrid Neural Attention for Agreement/Disagreement Inference in Online Debates 2018