Multi-Channel Multi-Speaker ASR Using Target Speaker’s Solo Segment

Yiwen Shao; Shi-Xiong Zhang; Yong Xu; Meng Yu; Dong Yu; Daniel Povey; Sanjeev Khudanpur

2024 INTERSPEECH INTERSPEECH 2024

Multi-Channel Multi-Speaker ASR Using Target Speaker’s Solo Segment

Abstract

In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker’s speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker’s location or voiceprint. This study introduces the Solo Spatial Feature (Solo-SF), an innovative method that utilizes a target speaker’s isolated speech segment to enhance ASR performance, thereby circumventing the need for conventional inputs like microphone array layouts. We explore effective strategies for selecting optimal solo segments, a crucial aspect for Solo-SF’s success. Through evaluations conducted on the AliMeeting dataset and AISHELL-1 simulations, Solo-SF demonstrates superior performance over existing techniques, significantly lowering Character Error Rates (CER) in various test conditions. Our findings highlight Solo-SF’s potential as an effective solution for addressing the complexities of multi-channel, multi-speaker ASR tasks.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — multi-speaker automatic speech recognition

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Yiwen Shao , Shi-Xiong Zhang , Yong Xu , Meng Yu , Dong Yu , Daniel Povey , Sanjeev Khudanpur

Topics

Machine Learning > Application Areas > Domain Adaptation Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Analysis > Speaker Verification

Keywords

speech separation speaker diarization target speaker extraction spatial feature multi-speaker automatic speech recognition

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024