Multimodal Association for Speaker Verification

Suwon Shon; James Glass

2020 INTERSPEECH INTERSPEECH 2020

Multimodal Association for Speaker Verification

Abstract

In this paper, we propose a multimodal association on a speaker verification system for fine-tuning using both voice and face. Inspired by neuroscientific findings, the proposed approach is to mimic the unimodal perception system benefits from the multisensory association of stimulus pairs. To verify this, we use the SRE18 evaluation protocol for experiments and use out-of-domain data, Voxceleb, for the proposed multimodal fine-tuning. Although the proposed approach relies on voice-face paired multimodal data during the training phase, the face is no more needed after training is done and only speech audio is used for the speaker verification system. In the experiments, we observed that the unimodal model, i.e. speaker verification model, benefits from the multimodal association of voice and face and generalized better than before by learning channel invariant speaker representation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

📈 Trend Setter — Contrastive Learning

🧭 Keyword Pioneer — voice face

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Suwon Shon , James Glass

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Contrastive Learning Speech & Audio > Recognition > Speaker Recognition Machine Learning > Learning Types > Transfer Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

transfer learning multimodal learning speaker verification speaker recognition voice face voice-face association channel invariant representation

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020