Specializing Self-Supervised Speech Representations for Speaker Segmentation

Séverin Baroudi; Thomas Pellegrini; Hervé Bredin

2024 INTERSPEECH INTERSPEECH 2024

Specializing Self-Supervised Speech Representations for Speaker Segmentation

Abstract

Self-supervised speech representation learning has been shown to be very effective for a wide range of speech processing downstream tasks. However, most of these models have been pretrained using clean pre-segmented single-speaker utterances, which is not representative of tasks involving realistic multi-speaker conversational speech like speaker diarization. WavLM pretraining mitigates this domain mismatch using artificial mixtures of single-speaker utterances, and outperforms other pretrained models such as wav2vec2 or HuBERT for speaker diarization. We propose to further specialize WavLM for speaker diarization in two ways: pretraining on real-world multi-speaker conversational speech, and crafting targets of pretraining pretext task to benefit the most to speaker diarization. When finetuned with recently proposed powerset multi-class cross entropy loss, we outperform, often by a large margin, the state-of-the-art on most speaker diarization benchmarks.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Séverin Baroudi , Thomas Pellegrini , Hervé Bredin

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Domain Adaptation

Keywords

self-supervised learning speaker diarization domain mismatch speech representation

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024