Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Sanyuan Chen; Yu Wu; Chengyi Wang; Shujie LIU; Zhuo Chen; Peidong Wang; Gang Liu; Jinyu Li; Jian Wu; Xiangzhan Yu; Furu Wei

2022 INTERSPEECH INTERSPEECH 2022

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Abstract

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

❓ The Questioner

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — mask speech prediction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sanyuan Chen , Yu Wu , Chengyi Wang , Shujie LIU , Zhuo Chen , Peidong Wang , Gang Liu , Jinyu Li , Jian Wu , Xiangzhan Yu , Furu Wei

Topics

Machine Learning > Learning Types > Self-Supervised Learning Speech & Audio > Analysis > Speaker Verification

Keywords

self-supervised learning speech recognition speaker verification mask speech prediction

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022