4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Wenxuan Zhu; Bing Li; Cheng Zheng; Jinjie Mai; Jun Chen; Letian Jiang; Abdullah Hamdi; Sara Rojas Martinez; Chia-Wen Lin; Mohamed Elhoseiny; Bernard Ghanem

2025 ICCV ICCV 2025

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities.However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects.In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning.4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks.With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs.The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding.4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63% accuracy compared to the human baseline of 91%.These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — 4d object understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wenxuan Zhu , Bing Li , Cheng Zheng , Jinjie Mai , Jun Chen , Letian Jiang , Abdullah Hamdi , Sara Rojas Martinez , Chia-Wen Lin , Mohamed Elhoseiny , Bernard Ghanem

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Natural Language Processing > Resources & Methods > Large Language Models

Keywords

multimodal large language model temporal understanding object captioning 4d object understanding object question answering

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025