Less Can Be More: Sound Source Localization With a Classification Model

Arda Senocak; Hyeonggon Ryu; Junsik Kim; In So Kweon

2022 WACV WACV 2022

Less Can Be More: Sound Source Localization With a Classification Model

Abstract

In this paper, we tackle sound localization as a natural outcome of the audio-visual video classification problem. Differently from the existing sound localization approaches, we do not use any explicit sub-modules or training mechanisms but use simple cross-modal attention on top of the representations learned by a classification loss. Our key contribution is to show that a simple audio-visual classification model has the ability to localize sound sources accurately and to give on par performance with state-of-the-art methods by proving that indeed "less is more". Furthermore, we propose potential applications that can be built based on our model. First, we introduce informative moment selection to enhance the localization task learning in the existing approaches compare to mid-frame usage. Then, we introduce a pseudo bounding box generation procedure that can significantly boost the performance of the existing methods in semi-supervised settings or be used for large-scale automatic annotation with minimal effort from any video dataset.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — pseudo bounding box

🐣 Hot Topic Early Bird — audio-visual learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Arda Senocak , Hyeonggon Ryu , Junsik Kim , In So Kweon

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Classification Machine Learning > Learning Types > Weakly Supervised Learning

Keywords

semi-supervised learning sound source localization audio-visual learning classification model cross-modal attention pseudo bounding box

Download PDF

Related papers

A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic Segmentation 2022

Unsupervised Sounding Object Localization With Bottom-Up and Top-Down Attention 2022

Dynamic Iterative Refinement for Efficient 3D Hand Pose Estimation 2022

Deep Photo Scan: Semi-Supervised Learning for Dealing With the Real-World Degradation in Smartphone Photo Scanning 2022

Let There Be a Clock on the Beach: Reducing Object Hallucination in Image Captioning 2022