OpenScene: 3D Scene Understanding With Open Vocabularies

Songyou Peng; Kyle Genova; Chiyu “Max” Jiang; Andrea Tagliasacchi; Marc Pollefeys; Thomas Funkhouser

2023 CVPR CVPR 2023

OpenScene: 3D Scene Understanding With Open Vocabularies

Abstract

Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🐣 Hot Topic Early Bird — open vocabulary

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Songyou Peng , Kyle Genova , Chiyu “Max” Jiang , Andrea Tagliasacchi , Marc Pollefeys , Thomas Funkhouser

Topics

Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Analysis > 3D Vision Deep Learning > Models > Large Language Models Deep Learning > Learning Types > Zero-Shot Learning Deep Learning > Models > Vision-Language Models

Keywords

zero-shot learning semantic segmentation scene understanding 3d scene understanding vision-language model open vocabulary clip feature 3d semantic segmentation

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023