OpenCity3D: What do Vision-Language Models Know About Urban Environments?

Valentin Bieri; Marco Zamboni; Nicolas Samuel Blumer; Qingxuan Chen; Francis Engelmann

2025 WACV WACV 2025

OpenCity3D: What do Vision-Language Models Know About Urban Environments?

Abstract

The rise of 2D vision-language models (VLMs) has enabled new possibilities for language-driven 3D scene understanding tasks. Existing works focus on indoor scenes or autonomous driving scenarios and typically validate against a pre-defined set of semantic object classes. In this work we analyze the capabilities of vision-language models for large-scale urban 3D scene understanding and propose new applications of VLMs that directly operate on aerial 3D reconstructions of cities. In particular we address higher-level 3D scene understanding tasks such as population density building age property prices crime rate and noise pollution. Our analysis reveals surprising zero-shot and few-shot performance of VLMs in urban environments.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — aerial 3d reconstruction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Valentin Bieri , Marco Zamboni , Nicolas Samuel Blumer , Qingxuan Chen , Francis Engelmann

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Domain-Specific > Remote Sensing Artificial Intelligence > Learning Paradigms > Zero-Shot Learning

Keywords

zero-shot learning few-shot learning 3d scene understanding remote sensing vision-language model urban environment aerial 3d reconstruction

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025