Variable Segment Length and Domain-Adapted Feature Optimization for Speaker Diarization

Chenyuan Zhang; Linkai Luo; Hong Peng; Wei Wen

2024 INTERSPEECH INTERSPEECH 2024

Variable Segment Length and Domain-Adapted Feature Optimization for Speaker Diarization

Abstract

In speaker diarization, a suitable segment length is still a challenge. Long segments may contain multiple speakers, leading to unreliable embeddings, while short segments may lack sufficient information. We propose an approach of variable segment length using a mixed segment recognition (MSR) network to address this. The MSR module distinguishes between segments with multiple speakers and those with a single speaker. Identified mixed segments are re-cut until pure or reaching the minimum length. In addition, we propose a scheme of domain-adapted feature optimization to fine-tune the pre-trained speaker embedding extractor, where both a specific data augmentation and a distance loss function are used to improve embeddings of the remaining segments still with speaker alternation and overlap. The results demonstrate the effectiveness of our method. It achieves a relative improvement of 25.5% in diarization error rate over the baseline and surpasses the recent state-of-the-art methods.

🧭 Keyword Pioneer — variable segment length

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chenyuan Zhang , Linkai Luo , Hong Peng , Wei Wen

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Application Areas > Domain Adaptation

Keywords

domain adaptation speaker embedding speaker diarization variable segment length

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024