2023 INTERSPEECH INTERSPEECH 2023

A Method of Audio-Visual Person Verification by Mining Connections between Time Series

Abstract

It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. But the relationship of keyframes in time series between modalities seems to be unexplored. Hence, we proposed a novel audio-visual strategy that considers connections between time series from a generative perspective. First, we introduced weight-enhanced attentive statistics pooling to extend the salience of the keyframe weights. Then, joint attentive pooling incorporating 3 popular generative supervision models is proposed. Finally, each modality is fused with a gated attention mechanism to gain robust embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.14%, 0.21%, and 0.37% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio