2018 INTERSPEECH INTERSPEECH 2018

Deep Discriminative Embeddings for Duration Robust Speaker Verification

Abstract

The embedding-based deep convolution neural networks (CNNs) have demonstrated effective for text-independent speaker verification systems with short utterances. However, the duration robustness of the existing deep CNNs based algorithms has not been investigated when dealing with utterances of arbitrary duration. To improve robustness of embedding-based deep CNNs for longer duration utterances, we propose a novel algorithm to learn more discriminative utterance-level embeddings based on the Inception-ResNet speaker classifier. Specifically, the discriminability of embeddings is enhanced by reducing intra-speaker variation with center loss and simultaneously increasing inter-speaker discrepancy with softmax loss. To further improve system performance when long utterances are available, at test stage long utterances are segmented into shorter ones, where utterance-level speaker embeddings are extracted by an average pooling layer. Experimental results show that when cosine distance is employed as the measure of similarity for a trial, the proposed method outperforms ivector/PLDA framework for short utterances and is effective for long utterances.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🧭 Keyword Pioneer — duration robustness
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio