Two Tiered Distributed Training Algorithm for Acoustic Modeling

Pranav Ladkat; Oleg Rybakov; Radhika Arava; Sree Hari Krishnan Parthasarathi; I-Fan Chen; Nikko Strom

2019 INTERSPEECH INTERSPEECH 2019

Two Tiered Distributed Training Algorithm for Acoustic Modeling

Abstract

We present a hybrid approach for scaling distributed training of neural networks by combining Gradient Threshold Compression (GTC) algorithm — a variant of stochastic gradient descent (SGD) — which compresses gradients with thresholding and quantization techniques and Blockwise Model Update Filtering (BMUF) algorithm — a variant of model averaging (MA). In this proposed method, we divide total number of workers into smaller subgroups in a hierarchical manner and limit frequent communication across subgroups. We update local model using GTC within a subgroup and global model using BMUF across different subgroups. We evaluate this approach in an Automatic Speech Recognition (ASR) task, by training deep long short-term memory (LSTM) acoustic models on 2000 hours of speech. Experiments show that, for a wide range in the number of GPUs used for distributed training, the proposed approach achieves a better trade-off between accuracy and scalability compared to GTC and BMUF.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Pranav Ladkat , Oleg Rybakov , Radhika Arava , Sree Hari Krishnan Parthasarathi , I-Fan Chen , Nikko Strom

Topics

Machine Learning > Optimization & Theory > Distributed Learning Machine Learning > Optimization & Theory > Neural Network Optimization Deep Learning > Architectures > Neural Networks

Keywords

stochastic gradient descent neural network optimization acoustic modeling distributed training model averaging gradient compression

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019