Lipper: Speaker Independent Speech Synthesis Using Multi-View Lipreading

Khwaja Mohd. Salik; Swati Aggarwal; Yaman Kumar; Rajiv Ratn Shah; Rohit Jain; Roger Zimmermann

2019 AAAI AAAI 2019

Lipper: Speaker Independent Speech Synthesis Using Multi-View Lipreading

Abstract

Abstract Lipreading is the process of understanding and interpreting speech by observing a speaker’s lip movements. In the past, most of the work in lipreading has been limited to classifying silent videos to a fixed number of text classes. However, this limits the applications of the lipreading since human language cannot be bound to a fixed set of words or languages. The aim of this work is to reconstruct intelligible acoustic speech signals from silent videos from various poses of a person which Lipper has never seen before. Lipper, therefore is a vocabulary and language agnostic, speaker independent and a near real-time model that deals with a variety of poses of a speaker. The model leverages silent video feeds from multiple cameras recording a subject to generate intelligent speech of a speaker. It uses a deep learning based STCNN+BiGRU architecture to achieve this goal. We evaluate speech reconstruction for speaker independent scenarios and demonstrate the speech output by overlaying the audios reconstructed by Lipper on the corresponding videos.

🚀 Conference Pioneer — AAAI 2019

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — speaker independent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Khwaja Mohd. Salik , Swati Aggarwal , Yaman Kumar , Rajiv Ratn Shah , Rohit Jain , Roger Zimmermann

Topics

Deep Learning > Architectures > Neural Networks Deep Learning > Models > Generative Models Computer Vision > Processing > Video Understanding Speech & Audio > Synthesis > Speech Synthesis

Keywords

speech synthesis multi-view learning visual speech recognition multiview learning bidirectional recurrent neural network speech reconstruction speaker independent speaker independence video to audio

Download PDF

Related papers

Cooperative Multimodal Approach to Depression Detection in Twitter 2019

Learning to Align Question and Answer Utterances in Customer Service Conversation with Recurrent Pointer Networks 2019

Community Detection in Social Networks Considering Topic Correlations 2019

Session-Based Recommendation with Graph Neural Networks 2019

Blameworthiness in Multi-Agent Settings 2019