Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments

Hengshun Zhou; Jun Du; Hang Chen; Zijun Jing; Shifu Xiong; Chin-Hui Lee

2021 INTERSPEECH INTERSPEECH 2021

Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments

Abstract

We propose an information fusion approach to audio-visual voice activity detection (AV-VAD) based on cross-modal teacher-student learning leveraging on factorized bilinear pooling (FBP) and Kullback-Leibler (KL) regularization. First, we design an audio-visual network by using FBP fusion to fully utilize the interaction between audio and video modalities. Next, to transfer the rich information in audio-based VAD (A-VAD) model trained with a massive audio-only dataset to AV-VAD model built with relatively limited multi-modal data, a cross-modal teacher-student learning framework is then proposed based on cross entropy with regulated KL-divergence. Finally, evaluated on an in-house dataset recorded in realistic conditions using standard VAD metrics, the proposed approach yields consistent and significant improvements over other state-of-the-art techniques. Moreover, by applying our AV-VAD technique to an audio-visual Chinese speech recognition task, the character error rate is reduced by 24.15% and 8.66% from A-VAD and the baseline AV-VAD systems, respectively.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — factorized bilinear pooling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hengshun Zhou , Jun Du , Hang Chen , Zijun Jing , Shifu Xiong , Chin-Hui Lee

Topics

Machine Learning > Learning Types > Self-Supervised Learning Speech & Audio > Analysis > Speech Analysis

Keywords

cross-modal learning information fusion teacher-student learning voice activity detection factorized bilinear pooling

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021