HOLO: Holistic Lightweight Optimization for Scene Understanding with Auto-Annotation and Multimodal Learning

Xiaoyun Hu; Xiaohan Yan; Nan Wang; Gang Wei; Zhicheng Wang

2026 WACV WACV 2026

HOLO: Holistic Lightweight Optimization for Scene Understanding with Auto-Annotation and Multimodal Learning

Abstract

Vision-language models (VLMs) have achieved remarkable success in various domains. However, their application to 3D scene understanding remains largely underexplored. Existing 3D VLMs predominantly focus on object-level tasks and often emphasize instance-centric representations within a scene, lacking holistic scene-level descriptions. In this work, we propose an automated annotation framework that leverages multi-view images to partition 3D scenes into localized point cloud sub-regions, which are then enriched with precise semantic information-all without any manual intervention. We processed ScanNet v2 and ScanNet++ to construct SceneCap, a large-scale dataset designed for scene-level description. To demonstrate the benefits of our framework for scene understanding , we introduce NIUMO-LLM, a lightweight yet high-performing model training on SceneCap. We further demonstrate that NIUMO-LLM achieves state-of-the-art (SOTA) performance on both scene description benchmarks and object-level tasks, requiring only 12 hours of training on a single NVIDIA A800 GPU. This design significantly reduces computational demands, lowering the barrier for MLLM-related research.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🧭 Keyword Pioneer — 3d scene description

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio