Reading To Listen at the Cocktail Party: Multi-Modal Speech Separation

Akam Rahimi; Triantafyllos Afouras; Andrew Zisserman

2022 CVPR CVPR 2022

Reading To Listen at the Cocktail Party: Multi-Modal Speech Separation

Abstract

The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture which inputs and outputs raw waveforms and is tailored to fuse different modalities to solve the speech separation task; (ii) we propose conditioning on the text content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the art performance on the well-established benchmark datasets LRS2 and LRS3.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Akam Rahimi , Triantafyllos Afouras , Andrew Zisserman

Topics

Artificial Intelligence > Core AI > Multimodal Learning Speech & Audio > Synthesis > Speech Enhancement Deep Learning > Learning Types > Multi-Modal Learning

Keywords

transformer architecture speech separation multimodal learning multi-modal learning speech enhancement audio visual audio-visual speech visual information raw waveform text conditioning

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022