2017 INTERSPEECH INTERSPEECH 2017

End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances

Abstract

Text-independent speaker verification against short utterances is still challenging despite of recent advances in the field of speaker recognition with i-vector framework. In general, to get a robust i-vector representation, a satisfying amount of data is needed in the MAP adaptation step, which is hard to meet under short duration constraint. To overcome this, we present an end-to-end system which directly learns a mapping from speech features to a compact fixed length speaker discriminative embedding where the Euclidean distance is employed for measuring similarity within trials. To learn the feature mapping, a modified Inception Net with residual block is proposed to optimize the triplet loss function. The input of our end-to-end system is a fixed length spectrogram converted from an arbitrary length utterance. Experiments show that our system consistently outperforms a conventional i-vector system on short duration speaker verification tasks. To test the limit under various duration conditions, we also demonstrate how our end-to-end system behaves with different duration from 2sโ€“4s.

๐ŸŒ‰ Interdisciplinary Bridge โ€” Machine Learning and Speech & Audio
๐Ÿฃ Hot Topic Early Bird โ€” speaker embedding
๐Ÿ Cross-Pollinator โ€” Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio