HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Trong-Thuan Nguyen; Pha Nguyen; Jackson Cothren; Alper Yilmaz; Khoa Luu

2025 CVPR CVPR 2025

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Abstract

Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — multimodal llm

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Trong-Thuan Nguyen , Pha Nguyen , Jackson Cothren , Alper Yilmaz , Khoa Luu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Adaptation Computer Vision > Analysis > Scene Understanding Computer Vision > Processing > Video Understanding Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

multimodal learning video understanding scene graph generation relation reasoning video scene graph large language model multimodal llm

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025