2020 INTERSPEECH INTERSPEECH 2020

ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification

Abstract

The time-delay neural network (TDNN) is widely used in speaker verification to extract long-term temporal features of speakers. Although common TDNN approaches well capture time-sequential information, they lack the delicate transformations needed for deep representation. To solve this problem, we propose two TDNN architectures. RET integrates shortcut connections into conventional time-delay blocks, and ARET adopts a split-transform-merge strategy to extract more discriminative representation. Experiments on VoxCeleb datasets without augmentation indicate that ARET realizes satisfactory performance on the VoxCeleb1 test set, VoxCeleb1-E, and VoxCeleb1-H, with 1.389%, 1.520%, and 2.614% equal error rate (EER), respectively. Compared to state-of-the-art results on these test sets, RET achieves a 23%~43% relative reduction in EER, and ARET reaches 32%~45%.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🧭 Keyword Pioneer — aggregated feature
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio