SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

Yuji Wang; Haoran Xu; Yong Liu; Jiaze Li; Yansong Tang

2025 CVPR CVPR 2025

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

Abstract

Reference Audio-Visual Segmentation (Ref-AVS) aims to provide a pixel-wise scene understanding in Language-aided Audio-Visual Scenes (LAVS). This task requires the model to continuously segment objects referred to by text and audio from a video. Previous dual-modality methods always fail due to the lack of a third modality and the existing triple-modality method struggles with spatio-temporal consistency, leading to the target shift of different frames. In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2, as well as token propagation and accumulation strategies designed to enhance spatio-temporal consistency without forgetting historical information. We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5% in J&F on the Ref-AVS benchmark and showcase the simplicity and effectiveness of the components. Our code will be available here.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — reference audio-visual segmentation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuji Wang , Haoran Xu , Yong Liu , Jiaze Li , Yansong Tang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Semantic Segmentation Computer Vision > Processing > Image Segmentation Computer Vision > Processing > Video Processing Computer Vision > Processing > Video Understanding Computer Vision > Core AI > Multimodal Learning

Keywords

semantic segmentation video segmentation multimodal learning multimodal fusion reference segmentation audio-visual segmentation spatio-temporal consistency token propagation reference audio-visual segmentation

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025