An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling

Tara N. Sainath; Yanzhang He; Arun Narayanan; Rami Botros; Ruoming Pang; David Rybach; Cyril Allauzen; Ehsan Variani; James Qin; Quoc-Nam Le-The; Shuo-Yiin Chang; Bo Li; Anmol Gulati; Jiahui Yu; Chung-Cheng Chiu; Diamantino Caseiro; Wei Li; Qiao Liang; Pat Rondon

2021 INTERSPEECH INTERSPEECH 2021

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling

Abstract

On-device end-to-end (E2E) models have shown improvements over a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER) [1], and latency [2], measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model [3], we explore using a Hybrid Autoregressive Transducer (HAT) [4] factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318× smaller.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — rare-word modeling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tara N. Sainath , Yanzhang He , Arun Narayanan , Rami Botros , Ruoming Pang , David Rybach , Cyril Allauzen , Ehsan Variani , James Qin , Quoc-Nam Le-The , Shuo-Yiin Chang , Bo Li , Anmol Gulati , Jiahui Yu , Chung-Cheng Chiu , Diamantino Caseiro , Wei Li , Qiao Liang , Pat Rondon

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Deep Learning > Architectures > Neural Networks Deep Learning > Techniques > Pretraining

Keywords

speech recognition language model on-device inference streaming model neural language model end-to-end model streaming asr on-device speech recognition rare word rare-word modeling

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021