Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Shengqiong Wu; Hao Fei; Jingkang Yang; Xiangtai Li; Juncheng Li; Hanwang Zhang; Tat-Seng Chua

2025 CVPR CVPR 2025

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Abstract

The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can largely suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🧭 Keyword Pioneer — spatial-temporal scene

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shengqiong Wu , Hao Fei , Jingkang Yang , Xiangtai Li , Juncheng Li , Hanwang Zhang , Tat-Seng Chua

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Machine Learning > Application Areas > Domain Adaptation Computer Vision > Analysis > Scene Understanding Machine Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Knowledge Graph

Keywords

transfer learning scene graph panoptic segmentation large language model 4d scene understanding panoptic scene graph spatial-temporal scene visual scene annotation

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025