Speech ReaLLM – Real-time Speech Recognition with Multimodal Language Models by Teaching the Flow of Time

Frank Seide; Yangyang Shi; Morrie Doulaty; Yashesh Gaur; Junteng Jia; Chunyang Wu

2024 INTERSPEECH INTERSPEECH 2024

Speech ReaLLM – Real-time Speech Recognition with Multimodal Language Models by Teaching the Flow of Time

Abstract

We introduce Speech ReaLLM, a new ASR architecture that marries “decoder-only” ASR with the RNN-T to make multi-modal LLM architectures capable of real-time streaming. This is the first “decoder-only” ASR architecture designed to handle continuous audio without explicit end-pointing. Speech ReaLLM is a special case of the more general ReaLLM (“real-time LLM”) approach, also introduced here for the first time. The idea is inspired by RNN-T: Instead of generating a response only at the end of a user prompt, generate after every input token received in real time (it is often empty). On Librispeech “test,” an 80M Speech ReaLLM achieves WERs of 3.0% and 7.4% in real time (without an external LM or auxiliary loss). This is only slightly above a 3x larger Attention-Encoder-Decoder baseline. We also show that this way, an LLM architecture can learn to represent and reproduce the flow of time; and that a pre-trained 7B LLM can be fine-tuned to do reasonably well on this task.

🧭 Keyword Pioneer — real-time streaming

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Frank Seide , Yangyang Shi , Morrie Doulaty , Yashesh Gaur , Junteng Jia , Chunyang Wu

Topics

Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

multimodal learning automatic speech recognition language model real-time streaming

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024