2025
CVPR
CVPR 2025
Supervising Sound Localization by In-the-wild Egomotion
Abstract
We present a method for learning binaural sound localization from ego-motion in videos. When the camera moves in a video, the direction of sound sources will change along with it. We train an audio model to predict sound directions that are consistent with visual estimates of camera motion, which we obtain using methods from multi-view geometry. This provides a weak but plentiful form of supervision that we combine with traditional binaural cues. To evaluate this idea, we propose a dataset of real-world audio-visual videos with ego-motion. We show that our model can successfully learn from this real-world data, and that it obtains strong performance on sound localization tasks.
🌉
Interdisciplinary Bridge
— Artificial Intelligence and Computer Vision and Deep Learning and Interdisciplinary and Machine Learning and Speech & Audio
🧭
Keyword Pioneer
— multi view geometry
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Topics
Artificial Intelligence > Core AI > Multimodal Learning
Machine Learning > Learning Types > Weakly Supervised Learning
Interdisciplinary > Cognitive Science > Perception
Speech & Audio > Analysis > Speech Analysis
Deep Learning > Learning Types > Self-Supervised Learning
Computer Vision > Analysis > Motion Analysis
Artificial Intelligence > Core AI > Multi-Modal Learning