Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi; Miao Wang; Hao-Bin Duan; Shao-Hua Guan

2024 CVPR CVPR 2024

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Abstract

Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work we introduce Language Embedded 3D Gaussians a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians we propose a dedicated quantization scheme that drastically alleviates the memory requirement and a novel embedding procedure that achieves smoother yet high accuracy query countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations while maintaining real-time rendering frame rates on a single desktop GPU.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — open-vocabulary scene

🐣 Hot Topic Early Bird — scene representation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jin-Chuan Shi , Miao Wang , Hao-Bin Duan , Shao-Hua Guan

Topics

Deep Learning > Architectures > Neural Networks Deep Learning > Models > Diffusion Models Computer Vision > Analysis > 3D Vision Computer Vision > Analysis > Scene Understanding Computer Vision > Analysis > Semantic Segmentation

Keywords

feature quantization scene representation 3d gaussian splatting language embedding novel view synthesis 3d gaussian semantic query open-vocabulary scene understanding open-vocabulary scene

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024