VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Shehan Munasinghe; Hanan Gani; Wenqi Zhu; Jiale Cao; Eric Xing; Fahad Shahbaz Khan; Salman Khan

2025 CVPR CVPR 2025

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Abstract

Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — spatio-temporal decoding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shehan Munasinghe , Hanan Gani , Wenqi Zhu , Jiale Cao , Eric Xing , Fahad Shahbaz Khan , Salman Khan

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Natural Language Processing > Resources & Methods > Large Language Models Computer Vision > Core AI > Multimodal Learning Computer Vision > Analysis > Video Understanding Natural Language Processing > Applications > Visual Question Answering

Keywords

video segmentation multimodal learning video understanding visual grounding large multimodal model vision-language model referring video segmentation spatio-temporal decoding pixel-level grounding

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025