VisualVoice: Audio-Visual Speech Separation With Cross-Modal Consistency

Ruohan Gao; Kristen Grauman

2021 CVPR CVPR 2021

VisualVoice: Audio-Visual Speech Separation With Cross-Modal Consistency

Abstract

We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce. Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video. It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement, and generalizes well to challenging real-world videos of diverse scenarios. Our video results and code: http://vision.cs.utexas.edu/projects/VisualVoice/.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — face appearance

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio

Authors

Ruohan Gao , Kristen Grauman

Topics

Artificial Intelligence > Core AI > Multimodal Learning Speech & Audio > Synthesis > Speech Enhancement Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Speech & Audio > Processing > Speech Enhancement

Keywords

speech separation speaker embedding audio-visual speech lip movement cross-modal consistency audio-visual speech separation face appearance

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021