2024 INTERSPEECH INTERSPEECH 2024

Specializing Self-Supervised Speech Representations for Speaker Segmentation

Abstract

Self-supervised speech representation learning has been shown to be very effective for a wide range of speech processing downstream tasks. However, most of these models have been pretrained using clean pre-segmented single-speaker utterances, which is not representative of tasks involving realistic multi-speaker conversational speech like speaker diarization. WavLM pretraining mitigates this domain mismatch using artificial mixtures of single-speaker utterances, and outperforms other pretrained models such as wav2vec2 or HuBERT for speaker diarization. We propose to further specialize WavLM for speaker diarization in two ways: pretraining on real-world multi-speaker conversational speech, and crafting targets of pretraining pretext task to benefit the most to speaker diarization. When finetuned with recently proposed powerset multi-class cross entropy loss, we outperform, often by a large margin, the state-of-the-art on most speaker diarization benchmarks.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio