SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Yue Li; Qi Ma; Runyi Yang; Huapeng Li; Mengjiao Ma; Bin Ren; Nikola Popovic; Nicu Sebe; Ender Konukoglu; Theo Gevers; Luc Van Gool; Martin R. Oswald; Danda Pani Paudel

2025 ICCV ICCV 2025

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Abstract

Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge.To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes.Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines. Our code, model, and datasets will be released to facilitate further research.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — 3d semantic reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yue Li , Qi Ma , Runyi Yang , Huapeng Li , Mengjiao Ma , Bin Ren , Nikola Popovic , Nicu Sebe , Ender Konukoglu , Theo Gevers , Luc Van Gool , Martin R. Oswald , Danda Pani Paudel

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > 3D Vision Computer Vision > Analysis > Scene Understanding Deep Learning > Techniques > Self-Supervised Learning

Keywords

scene understanding self-supervised learning 3d gaussian splatting indoor scene semantic reasoning vision-language pretraining 3d semantic reasoning

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025