MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms

Seung-Bin Kim; Chan-yeong Lim; Jungwoo Heo; Ju-ho Kim; Hyun-seo Shin; Kyo-Won Koo; Ha-Jin Yu

2024 INTERSPEECH INTERSPEECH 2024

MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms

Abstract

In speaker verification systems, the utilization of short utterances presents a persistent challenge, leading to performance degradation primarily due to insufficient phonetic information to characterize the speakers. To overcome this obstacle, we propose a novel structure, MR-RawNet, designed to enhance the robustness of speaker verification systems against variable duration utterances using raw waveforms. The MR-RawNet extracts time-frequency representations from raw waveforms via a multi-resolution feature extractor that optimally adjusts both temporal and spectral resolutions simultaneously. Furthermore, we apply a multi-resolution attention block that focuses on diverse and extensive temporal contexts, ensuring robustness against changes in utterance length. The experimental results, conducted on VoxCeleb1 dataset, demonstrate that the MR-RawNet exhibits superior performance in handling utterances of variable duration compared to other raw waveform-based systems.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Seung-Bin Kim , Chan-yeong Lim , Jungwoo Heo , Ju-ho Kim , Hyun-seo Shin , Kyo-Won Koo , Ha-Jin Yu

Topics

Speech & Audio > Analysis > Speaker Verification

Keywords

attention mechanism speaker verification raw waveform temporal resolution variable duration

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024