Audio-Visual Instance Segmentation

Ruohao Guo; Xianghua Ying; Yaru Chen; Dantong Niu; Guangyao Li; Liao Qu; Yanyu Qi; Jinxing Zhou; Bowei Xing; Wenzhen Yue; Ji Shi; Qixun Wang; Peiliang Zhang; Buwen Liang

2025 CVPR CVPR 2025

Audio-Visual Instance Segmentation

Abstract

In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos. Additionally, we propose a strong baseline model for this task. Our model first localizes sound source within each frame, and condenses object-specific contexts into concise tokens. Then it builds long-range audio-visual dependencies between these tokens using window-based attention, and tracks sounding objects among the entire video sequences. Extensive experiments reveal that our method performs best on AVISeg, surpassing the existing methods from related tasks. We further conduct the evaluation on several multi-modal large models. Unfortunately, they exhibits subpar performance on instance-level sound source localization and temporal perception. We expect that AVIS will inspire the community towards a more comprehensive multi-modal understanding. Dataset and code is available at https://github.com/ruohaoguo/avis.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — audio-visual instance segmentation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ruohao Guo , Xianghua Ying , Yaru Chen , Dantong Niu , Guangyao Li , Liao Qu , Yanyu Qi , Jinxing Zhou , Bowei Xing , Wenzhen Yue , Ji Shi , Qixun Wang , Peiliang Zhang , Buwen Liang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Processing > Image Segmentation Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Computer Vision > Analysis > Instance Segmentation

Keywords

video segmentation sound source localization multi-modal learning object tracking sounding object detection audio-visual instance segmentation video instance tracking audio-visual dependency

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025