HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Shehreen Azad; Vibhav Vineet; Yogesh Singh Rawat

2025 CVPR CVPR 2025

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Abstract

Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce **HierarQ**, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed **Hierar**chical **Q**uerying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on **10** video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis. All code will be made available upon acceptance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — hierarchical query transformer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shehreen Azad , Vibhav Vineet , Yogesh Singh Rawat

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Processing > Video Understanding Natural Language Processing > Applications > Question Answering Deep Learning > Learning Types > Multi-Modal Learning

Keywords

video captioning video understanding multimodal large language model video question answering query transformer frame sampling memory bank hierarchical query transformer hierarchical query

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025