ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification

Ruiteng Zhang; Jianguo Wei; Wenhuan Lu; Longbiao Wang; Meng Liu; Lin Zhang; Jiayu Jin; Junhai Xu

2020 INTERSPEECH INTERSPEECH 2020

ARET: Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification

Abstract

The time-delay neural network (TDNN) is widely used in speaker verification to extract long-term temporal features of speakers. Although common TDNN approaches well capture time-sequential information, they lack the delicate transformations needed for deep representation. To solve this problem, we propose two TDNN architectures. RET integrates shortcut connections into conventional time-delay blocks, and ARET adopts a split-transform-merge strategy to extract more discriminative representation. Experiments on VoxCeleb datasets without augmentation indicate that ARET realizes satisfactory performance on the VoxCeleb1 test set, VoxCeleb1-E, and VoxCeleb1-H, with 1.389%, 1.520%, and 2.614% equal error rate (EER), respectively. Compared to state-of-the-art results on these test sets, RET achieves a 23%~43% relative reduction in EER, and ARET reaches 32%~45%.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — aggregated feature

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Ruiteng Zhang , Jianguo Wei , Wenhuan Lu , Longbiao Wang , Meng Liu , Lin Zhang , Jiayu Jin , Junhai Xu

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Speaker Recognition Deep Learning > Learning Types > Deep Learning

Keywords

speaker verification speaker recognition residual connection equal error rate time-delay neural network aggregated feature

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020