The JHU Speaker Recognition System for the VOiCES 2019 Challenge

David Snyder; Jesus Villalba; Nanxin Chen; Daniel Povey; Gregory Sell; Najim Dehak; Sanjeev Khudanpur

2019 INTERSPEECH INTERSPEECH 2019

The JHU Speaker Recognition System for the VOiCES 2019 Challenge

Abstract

This paper describes the systems developed by the JHU team for the speaker recognition track of the 2019 VOiCES from a Distance Challenge. On this far-field task, we achieved good performance using systems based on state-of-the-art deep neural network (DNN) embeddings. In this paradigm, a DNN maps variable-length speech segments to speaker embeddings, called x-vectors, that are then classified using probabilistic linear discriminant analysis (PLDA). Our submissions were composed of three x-vector-based systems that differed primarily in the DNN architecture, temporal pooling mechanism, and training objective function. On the evaluation set, our best single-system submission used an extended time-delay architecture, and achieved 0.435 in actual DCF, the primary evaluation metric. A fusion of all three x-vector systems was our primary submission, and it obtained an actual DCF of 0.362.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

David Snyder , Jesus Villalba , Nanxin Chen , Daniel Povey , Gregory Sell , Najim Dehak , Sanjeev Khudanpur

Topics

Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Speaker Recognition

Keywords

speaker embedding speaker recognition deep neural network probabilistic linear discriminant analysis

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019