Streaming Target-Speaker ASR with Neural Transducer

Takafumi Moriya; Hiroshi Sato; Tsubasa Ochiai; Marc Delcroix; Takahiro Shinozaki

2022 INTERSPEECH INTERSPEECH 2022

Streaming Target-Speaker ASR with Neural Transducer

Abstract

Although recent advances in deep learning technology have boosted automatic speech recognition (ASR) performance in the single-talker case, it remains difficult to recognize multi-talker speech in which many voices overlap. One conventional approach to tackle this problem is to use a cascade of a speech separation or target speech extraction front-end with an ASR back-end. However, the extra computation costs of the front-end module are critical for a quick response, especially for streaming ASR. In this paper, we propose a target-speaker ASR (TS-ASR) system, which integrates implicitly the target speech extraction functionality within a streaming end-to-end (E2E) ASR system, i.e. recurrent neural network-transducer (RNNT). Our system uses a similar idea as target speech extraction, but implements it directly at the level of the encoder of RNNT. This allows to realize TS-ASR without extra computation costs for the front-end. Note that our works present two major differences between prior studies about E2E-TS-ASR, we investigate streaming models and base our study on Conformer models, whereas prior studies used RNN-based systems and only dealt with offline processing. We confirm in experiments that our TS-ASR achieves comparable recognition performance with conventional cascade system in offline setting, while reducing computation costs and allowing streaming TS-ASR.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — target-speaker asr

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Takafumi Moriya , Hiroshi Sato , Tsubasa Ochiai , Marc Delcroix , Takahiro Shinozaki

Topics

Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

automatic speech recognition neural transducer speech extraction target-speaker asr

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022