2017 INTERSPEECH INTERSPEECH 2017

Deep Neural Network Embeddings for Text-Independent Speaker Verification

Abstract

This paper investigates replacing i-vectors for text-independent speaker verification with embeddings extracted from a feed-forward deep neural network. Long-term speaker characteristics are captured in the network by a temporal pooling layer that aggregates over the input speech. This enables the network to be trained to discriminate between speakers from variable-length speech segments. After training, utterances are mapped directly to fixed-dimensional speaker embeddings and pairs of embeddings are scored using a PLDA-based backend. We compare performance with a traditional i-vector baseline on NIST SRE 2010 and 2016. We find that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions. Moreover, the two representations are complementary, and their fusion improves on the baseline at all operating points. Similar systems have recently shown promising results when trained on very large proprietary datasets, but to the best of our knowledge, these are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.

πŸŒ‰ Interdisciplinary Bridge β€” Deep Learning and Machine Learning
🧭 Keyword Pioneer β€” text-independent verification
🐝 Cross-Pollinator β€” Artificial Intelligence, Computer Vision, Deep Learning, Machine Learning, Speech & Audio
🐣 Hot Topic Early Bird β€” speaker embedding