Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Shaojie Zhang; Jiahui Yang; Jianqin Yin; Zhenbo Luo; Jian Luan

2025 ICCV ICCV 2025

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shaojie Zhang , Jiahui Yang , Jianqin Yin , Zhenbo Luo , Jian Luan

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Efficient Computing Computer Vision > Core AI > Multimodal Learning Deep Learning > Models > Transformers

Keywords

video understanding frame selection multimodal large language model visual comprehension video-language model adaptive frame selection

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025