Deep Speaker Recognition: Modular or Monolithic?

Gautam Bhattacharya; Jahangir Alam; Patrick Kenny

2019 INTERSPEECH INTERSPEECH 2019

Deep Speaker Recognition: Modular or Monolithic?

Abstract

Speaker recognition has made extraordinary progress with the advent of deep neural networks. In this work, we analyze the performance of end-to-end deep speaker recognizers on two popular text-independent tasks - NIST-SRE 2016 and VoxCeleb. Through a combination of a deep convolutional feature extractor, self-attentive pooling and large-margin loss functions, we achieve state-of-the-art performance on VoxCeleb. Our best individual and ensemble models show a relative improvement of 70% an 82% respectively over the best reported results on this task. On the challenging NIST-SRE 2016 task, our proposed end-to-end models show good performance but are unable to match a strong i-vector baseline. State-of-the-art systems for this task use a modular framework that combines neural network embeddings with a probabilistic linear discriminant analysis (PLDA) classifier. Drawing inspiration from this approach we propose to replace the PLDA classifier with a neural network. Our modular neural network approach is able to outperform the i-vector baseline using cosine distance to score verification trials.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🧭 Keyword Pioneer — self-attentive pooling

Authors

Gautam Bhattacharya , Jahangir Alam , Patrick Kenny

Topics

Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Core Methods > Embedding Learning Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Speaker Recognition

Keywords

attention mechanism speaker embedding speaker verification speaker recognition convolutional neural network deep neural network end-to-end learning probabilistic linear discriminant analysis self-attentive pooling

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019