Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce
Abstract
AbstractAutomatic Speech Recognition (ASR) is essential for any voice-based application. The streaming capability of ASR becomes necessary to provide immediate feedback to the user in applications like Voice Search. LSTM/RNN and CTC based ASR systems are very simple to train and deploy for low latency streaming applications but have lower accuracy when compared to the state-of-the-art models. In this work, we build accurate LSTM, attention and CTC based streaming ASR models for large-scale Hinglish (blend of Hindi and English) Voice Search. We evaluate how various modifications in vanilla LSTM training improve the system’s accuracy while preserving the streaming capabilities. We also discuss a simple integration of end-of-speech (EOS) detection with CTC models, which helps reduce the overall search latency. Our model achieves a word error rate (WER) of 3.69% without EOS and 4.78% with EOS, with ~1300 ms (~46.64%) reduction in latency.