SoundLoc3D: Invisible 3D Sound Source Localization and Classification using a Multimodal RGB-D Acoustic Camera

Yuhang He; Sangyun Shin; Anoop Cherian; Niki Trigoni; Andrew Markham

2025 WACV WACV 2025

SoundLoc3D: Invisible 3D Sound Source Localization and Classification using a Multimodal RGB-D Acoustic Camera

Abstract

Accurately localizing 3D sound sources and estimating their semantic labels - where the sources may not be visible but are assumed to lie on the physical surface of objects in the scene - have many real applications including detecting gas leak and machinery malfunction. The audio-visual weak- correlation in such setting poses new challenges in deriving innovative methods to answer if or how we can use cross- modal information to solve the task. Towards this end we propose to use an acoustic-camera rig consisting of a pinhole RGB-D camera and a coplanar four-channel microphone array (Mic-Array). By using this rig to record audio-visual signals from multiviews we can use the cross-modal cues to estimate the sound sources 3D locations. Specifically our framework SoundLoc3D treats the task as a set prediction problem each element in the set corresponds to a potential sound source. Given the audio-visual weak-correlation the set representation is initially learned from a single view microphone array signal and then refined by actively incorporating physical surface cues revealed from multiview RGB-D images. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset and further show its robustness to RGB-D measurement inaccuracy and ambient noise interference.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuhang He , Sangyun Shin , Anoop Cherian , Niki Trigoni , Andrew Markham

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > 3D Vision Computer Vision > Core AI > Multimodal Learning

Keywords

sound source localization multimodal learning depth estimation 3d vision acoustic signal processing 3d localization rgb-d camera cross-modal information

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025