Deeply Fused Speaker Embeddings for Text-Independent Speaker Verification

Gautam Bhattacharya; Md Jahangir Alam; Vishwa Gupta; Patrick Kenny

2018 INTERSPEECH INTERSPEECH 2018

Deeply Fused Speaker Embeddings for Text-Independent Speaker Verification

Abstract

Recently there has been a surge of interest is learning speaker embeddings using deep neural networks. These models ingest time-frequency representations of speech and can be trained to discriminate between a known set speakers. While embeddings learned in this way perform well, they typically require a large number of training data points for learning. In this work we propose deeply fused speaker embeddings - speaker representations that combine neural speaker embeddings with i-vectors. We show that by combining the two speaker representations we are able to learn robust speaker embeddings in a computationally efficient manner. We compare several different fusion strategies and find that the resulting speaker embeddings show significantly different verification performance. To this end we propose a novel fusion approach that uses an attention model to combine i-vectors with neural speaker embeddings. Our best performing embedding achieves an error rate of 3.17% using a simple cosine distance classifier. Combining our embeddings with a powerful Joint Bayesian classifier, we are able to further improve the performance of our speaker embeddings to 2.22%, which gave a 7.8% relative improvement over the baseline i-vector system.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — attention model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Gautam Bhattacharya , Md Jahangir Alam , Vishwa Gupta , Patrick Kenny

Topics

Machine Learning > Core Methods > Embedding Learning Deep Learning > Architectures > Neural Networks Deep Learning > Techniques > Model Architecture Speech & Audio > Recognition > Speaker Recognition Speech & Audio > Analysis > Speaker Verification Machine Learning > Learning Types > Transfer Learning Machine Learning > Learning Types > Representation Learning Deep Learning > Learning Types > Representation Learning

Keywords

speaker embedding speaker verification deep neural network attention model cosine distance neural speaker embedding fusion strategy neural network

Download PDF

Related papers

HoloCompanion: An MR Friend for EveryOne 2018

Estimation of the Vocal Tract Length of Vowel Sounds Based on the Frequency of the Significant Spectral Valley 2018

Deep Learning Techniques for Koala Activity Detection 2018

An Exploration of Local Speaking Rate Variations in Mandarin Read Speech 2018

Acoustic Analysis of Whispery Voice Disguise in Mandarin Chinese 2018