Self-supervised speaker verification with relational mask prediction

Ju-ho Kim; Hee-soo Heo; Bong-Jin Lee; Youngki Kwon; Minjae Lee; Ha-Jin Yu

2024 INTERSPEECH INTERSPEECH 2024

Self-supervised speaker verification with relational mask prediction

Abstract

Recently, self-supervised learning (SSL) has emerged as a promising strategy for constructing speaker verification (SV) systems, effectively mitigating the cost and privacy issues associated with the labeling process. The majority of SSL-based SV systems tend to focus on utterance-level features, potentially overlooking the inherent inter-frame structure of speech. To bridge this gap, we propose the relational mask prediction (RMP), a novel loss function that encourages models to understand the relationships between frames. Additionally, we introduce a block aggregation Transformer (BA-Transformer) to enrich frame-level features. Models were trained without labels using the VoxCeleb2 development set and comprehensively evaluated using various test sets. Experimental results demonstrate that the proposed framework outperforms recent SSL-based SV systems, achieving an average performance improvement of 22.39% over the baseline across the entire evaluation dataset.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — relational mask prediction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ju-ho Kim , Hee-soo Heo , Bong-Jin Lee , Youngki Kwon , Minjae Lee , Ha-Jin Yu

Topics

Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > Biometrics

Keywords

self-supervised learning speaker verification speaker recognition relational mask prediction

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024