Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios

Jay Mahadeokar; Yangyang Shi; Yuan Shangguan; Chunyang Wu; Alex Xiao; Hang Su; Duc Le; Ozlem Kalinli; Christian Fuegen; Michael L. Seltzer

2021 INTERSPEECH INTERSPEECH 2021

Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios

Abstract

Often, the storage and computational constraints of embedded devices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose a Flexible Transducer (FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provides a fast response for voice commands, and accurate transcription but with more latency for dictation. In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used. Firstly, we propose using domain-specific altering of segment size for Emformer encoder that enables FlexiT to achieve flexible decoding. Secondly, we use Alignment Restricted RNNT loss to achieve flexible fine-grained control on token emission latency for different domains. Finally, we add a domain indicator vector as an additional input to the FlexiT model. Using the combination of techniques, we show that a single model can be used to improve WERs and real time factor for dictation scenarios while maintaining optimal latency for voice commands use-cases.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — real time factor

🐝 Cross-Pollinator — Artificial Intelligence, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Jay Mahadeokar , Yangyang Shi , Yuan Shangguan , Chunyang Wu , Alex Xiao , Hang Su , Duc Le , Ozlem Kalinli , Christian Fuegen , Michael L. Seltzer

Topics

Machine Learning > Application Areas > Efficient Computing Speech & Audio > Recognition > Automatic Speech Recognition Deep Learning > Models > Transformers

Keywords

automatic speech recognition latency optimization real time factor on-device speech recognition

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021