2025 WACV WACV 2025

OpenCity3D: What do Vision-Language Models Know About Urban Environments?

Abstract

The rise of 2D vision-language models (VLMs) has enabled new possibilities for language-driven 3D scene understanding tasks. Existing works focus on indoor scenes or autonomous driving scenarios and typically validate against a pre-defined set of semantic object classes. In this work we analyze the capabilities of vision-language models for large-scale urban 3D scene understanding and propose new applications of VLMs that directly operate on aerial 3D reconstructions of cities. In particular we address higher-level 3D scene understanding tasks such as population density building age property prices crime rate and noise pollution. Our analysis reveals surprising zero-shot and few-shot performance of VLMs in urban environments.

The Questioner
🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision
🧭 Keyword Pioneer — aerial 3d reconstruction
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio