Visual Acoustic Matching

Changan Chen; Ruohan Gao; Paul Calamia; Kristen Grauman

2022 CVPR CVPR 2022

Visual Acoustic Matching

Abstract

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Interdisciplinary and Speech & Audio

🧭 Keyword Pioneer — visual acoustic matching

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Changan Chen , Ruohan Gao , Paul Calamia , Kristen Grauman

Topics

Speech & Audio > Synthesis > Speech Enhancement Interdisciplinary > Linguistics > Computational Linguistics Interdisciplinary > Cognitive Science > Perception Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

cross-modal learning speech translation cross-modal transformer room acoustics acoustic matching audio-visual attention visual acoustic matching

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022