LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Fan-Yun Sun; Weiyu Liu; Siyi Gu; Dylan Lim; Goutam Bhat; Federico Tombari; Manling Li; Nick Haber; Jiajun Wu

2025 CVPR CVPR 2025

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Abstract

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still struggle with 3D reasoning tasks like arranging objects in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🧭 Keyword Pioneer — scene layout representation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Fan-Yun Sun , Weiyu Liu , Siyi Gu , Dylan Lim , Goutam Bhat , Federico Tombari , Manling Li , Nick Haber , Jiajun Wu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Planning Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Domain Adaptation Computer Vision > Core AI > Computer Vision

Keywords

scene understanding differentiable optimization vision-language model spatial reasoning 3d layout scene layout representation

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025