Real Time Online Visual End Point Detection Using Unidirectional LSTM

Tanay Sharma; Rohith Chandrashekar Aralikatti; Dilip Kumar Margam; Abhinav Thanda; Sharad Roy; Pujitha Appan Kandala; Shankar M. Venkatesan

2019 INTERSPEECH INTERSPEECH 2019

Real Time Online Visual End Point Detection Using Unidirectional LSTM

Abstract

Visual Voice Activity Detection (V-VAD) involves the detection of speech activity of a speaker using visual features. The V-VAD is useful in detecting the end point of an utterance under noisy acoustic conditions or for maintaining speaker privacy. In this paper, we propose a speaker independent, real-time solution for V-VAD. The focus is on real-time aspect and accuracy as such algorithms will play a key role in detecting end point especially while interacting with speech assistants. We propose two novel methods one using CNN and the other using 2D-DCT features. Unidirectional LSTMs are used in both the methods to make it online and learn temporal dependence. The methods are tested on two publicly available datasets. Additionally the methods are also tested on a locally collected dataset which further validates our hypothesis. Additionally it has been observed through experiments that both the approaches generalize to unseen speakers. It has been shown that our best approach gives substantial improvement over earlier methods done on the same dataset.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — visual voice activity detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Tanay Sharma , Rohith Chandrashekar Aralikatti , Dilip Kumar Margam , Abhinav Thanda , Sharad Roy , Pujitha Appan Kandala , Shankar M. Venkatesan

Topics

Machine Learning > Core Methods > Classification Computer Vision > Analysis > Action Recognition

Keywords

temporal modeling speaker independent visual voice activity detection speech endpoint detection unidirectional lstm

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019