Multi-Head State Space Model for Speech Recognition

Yassir Fathullah; Chunyang Wu; Yuan Shangguan; Junteng Jia; Wenhan Xiong; Jay Mahadeokar; Chunxi Liu; Yangyang Shi; Ozlem Kalinli; Mike Seltzer; Mark J. F. Gales

2023 INTERSPEECH INTERSPEECH 2023

Multi-Head State Space Model for Speech Recognition

Abstract

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76%/4.37% on the development and 1.91%/4.36% on the test sets without using an external language model.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🐣 Hot Topic Early Bird — state space model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yassir Fathullah , Chunyang Wu , Yuan Shangguan , Junteng Jia , Wenhan Xiong , Jay Mahadeokar , Chunxi Liu , Yangyang Shi , Ozlem Kalinli , Mike Seltzer , Mark J. F. Gales

Topics

Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Speech Recognition

Keywords

speech recognition state space model multi-head attention transformer encoder word error rate

Download PDF

Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer 2023

Improving the response timing estimation for spoken dialogue systems by reducing the effect of speech recognition delay 2023

Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation 2023

What are differences? Comparing DNN and Human by Their Performance and Characteristics in Speaker Age Estimation 2023

Multi-Head State Space Model for Speech Recognition

Abstract

Authors

Topics

Keywords

Related papers