Mix and Localize: Localizing Sound Sources in Mixtures

Xixi Hu; Ziyang Chen; Andrew Owens

2022 CVPR CVPR 2022

Mix and Localize: Localizing Sound Sources in Mixtures

Abstract

We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds each correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio

🐣 Hot Topic Early Bird — audio-visual learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xixi Hu , Ziyang Chen , Andrew Owens

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Contrastive Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Multi-Modal Learning Speech & Audio > Processing > Speech Enhancement

Keywords

contrastive learning source separation sound source localization self-supervised learning audio-visual learning random walk audio-visual correspondence multimodal representation random walker

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022