Supervising Sound Localization by In-the-wild Egomotion

Anna Min; Ziyang Chen; Hang Zhao; Andrew Owens

2025 CVPR CVPR 2025

Supervising Sound Localization by In-the-wild Egomotion

Abstract

We present a method for learning binaural sound localization from ego-motion in videos. When the camera moves in a video, the direction of sound sources will change along with it. We train an audio model to predict sound directions that are consistent with visual estimates of camera motion, which we obtain using methods from multi-view geometry. This provides a weak but plentiful form of supervision that we combine with traditional binaural cues. To evaluate this idea, we propose a dataset of real-world audio-visual videos with ego-motion. We show that our model can successfully learn from this real-world data, and that it obtains strong performance on sound localization tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Interdisciplinary and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — multi view geometry

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Anna Min , Ziyang Chen , Hang Zhao , Andrew Owens

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Weakly Supervised Learning Interdisciplinary > Cognitive Science > Perception Speech & Audio > Analysis > Speech Analysis Deep Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > Motion Analysis Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

self-supervised learning audio-visual learning sound localization camera motion binaural audio multi-view geometry camera motion estimation ego motion audio visual learning multi view geometry binaural sound localization

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025