On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization

Yiling Huang; Weiran Wang; Guanlong Zhao; Hank Liao; Wei Xia; Quan Wang

2024 INTERSPEECH INTERSPEECH 2024

On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization

Abstract

While standard speaker diarization attempts to answer the question "who spoke when", many realistic applications are interested in determining "who spoke what". In both the conventional modularized approach and the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate speakers with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same architecture by sharing blank logits. Such a framework allows easily adding diarization capabilities to any existing RNN-T based ASR models without Word Error Rate (WER) regressions. Experimental results demonstrate that WEEND outperforms a strong turn-based diarization baseline system on all 2-speaker short-form scenarios, with the capability to generalize to audio lengths of 5 minutes.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — blank logit

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Yiling Huang , Weiran Wang , Guanlong Zhao , Hank Liao , Wei Xia , Quan Wang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Speech & Audio > Recognition > Speaker Recognition Speech & Audio > Analysis > Speaker Verification Machine Learning > Learning Types > Multi-Task Learning Deep Learning > Learning Types > Deep Learning

Keywords

multi-task learning automatic speech recognition speaker diarization end-to-end learning recurrent neural network transducer blank logit end-to-end neural diarization word-level diarization

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024