2017 INTERSPEECH INTERSPEECH 2017

Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

Abstract

Voice activity detection (VAD) is an important preprocessing step in speech-based systems, especially for emerging hand-free intelligent assistants. Conventional VAD systems relying on audio-only features are normally impaired by noise in the environment. An alternative approach to address this problem is audiovisual VAD (AV-VAD) systems. Modeling timing dependencies between acoustic and visual features is a challenge in AV-VAD. This study proposes a bimodal recurrent neural network (RNN) which combines audiovisual features in a principled, unified framework, capturing the timing dependency within modalities and across modalities. Each modality is modeled with separate bidirectional long short-term memory (BLSTM) networks. The output layers are used as input of another BLSTM network. The experimental evaluation considers a large audiovisual corpus with clean and noisy recordings to assess the robustness of the approach. The proposed approach outperforms audio-only VAD by 7.9% (absolute) under clean/ideal conditions (i.e., high definition (HD) camera, close-talk microphone). The proposed solution outperforms the audio-only VAD system by 18.5% (absolute) when the conditions are more challenging (i.e., camera and microphone from a tablet with noise in the environment). The proposed approach shows the best performance and robustness across a varieties of conditions, demonstrating its potential for real-world applications.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors