A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

Xujiang Xing; Mingxing Xu; Thomas Fang Zheng

2024 INTERSPEECH INTERSPEECH 2024

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

Abstract

Automatic Speaker Verification (ASV) suffers from performance degradation in noisy conditions. To address this issue, we propose a novel adversarial learning framework that incorporates noise-disentanglement to establish a noise-independent speaker invariant embedding space. Specifically, the disentanglement module includes two encoders for separating speaker related and irrelevant information, respectively. The reconstruction module serves as a regularization term to constrain the noise. A feature-robust loss is also used to supervise the speaker encoder to learn noise-independent speaker embeddings without losing speaker information. In addition, adversarial training is introduced to discourage the speaker encoder from encoding acoustic condition information for achieving a speaker-invariant embedding space. Experiments on Voxceleb1 indicate that the proposed method improves the performance of the speaker verification system under both clean and noisy conditions.

🧭 Keyword Pioneer — feature robustness

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

Authors

Xujiang Xing , Mingxing Xu , Thomas Fang Zheng

Topics

Machine Learning > Learning Types > Adversarial Learning Machine Learning > Application Areas > Domain Adaptation Speech & Audio > Recognition > Speaker Recognition Machine Learning > Learning Types > Representation Learning Deep Learning > Learning Types > Adversarial Learning

Keywords

feature learning embedding learning adversarial training speaker embedding speaker verification speaker recognition feature robustness noise disentanglement

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024