2025 WACV WACV 2025

SoundLoc3D: Invisible 3D Sound Source Localization and Classification using a Multimodal RGB-D Acoustic Camera

Abstract

Accurately localizing 3D sound sources and estimating their semantic labels - where the sources may not be visible but are assumed to lie on the physical surface of objects in the scene - have many real applications including detecting gas leak and machinery malfunction. The audio-visual weak- correlation in such setting poses new challenges in deriving innovative methods to answer if or how we can use cross- modal information to solve the task. Towards this end we propose to use an acoustic-camera rig consisting of a pinhole RGB-D camera and a coplanar four-channel microphone array (Mic-Array). By using this rig to record audio-visual signals from multiviews we can use the cross-modal cues to estimate the sound sources 3D locations. Specifically our framework SoundLoc3D treats the task as a set prediction problem each element in the set corresponds to a potential sound source. Given the audio-visual weak-correlation the set representation is initially learned from a single view microphone array signal and then refined by actively incorporating physical surface cues revealed from multiview RGB-D images. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset and further show its robustness to RGB-D measurement inaccuracy and ambient noise interference.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio