PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Ahmed Abdelreheem; Filippo Aleotti; Jamie Watson; Zawar Qureshi; Abdelrahman Eldesokey; Peter Wonka; Gabriel Brostow; Sara Vicente; Guillermo Garcia-Hernando

2025 ICCV ICCV 2025

PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Abstract

We introduce the task of Language-Guided Object Placement in Real 3D Scenes. Given a 3D reconstructed point-cloud scene, a 3D asset, and a natural-language instruction, the goal is to place the asset so that the instruction is satisfied. The task demands tackling four intertwined challenges: (a) one-to-many ambiguity in valid placements; (b) precise geometric and physical reasoning; (c) joint understanding across the scene, the asset, and language; and (d) robustness to noisy point clouds with no privileged metadata at test time. The first three challenges mirror the complexities of synthetic scene generation, while the metadata-free, noisy-scan scenario is inherited from language-guided 3D visual grounding. We inaugurate this task by introducing a benchmark and evaluation protocol, releasing a dataset for training multi-modal large language models (MLLMs), and establishing a first nontrivial baseline. We believe this challenging setup and benchmark will provide a foundation for evaluating and advancing MLLMs in 3D understanding.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — language-guided object placement

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ahmed Abdelreheem , Filippo Aleotti , Jamie Watson , Zawar Qureshi , Abdelrahman Eldesokey , Peter Wonka , Gabriel Brostow , Sara Vicente , Guillermo Garcia-Hernando

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > 3D Vision Computer Vision > Domain-Specific > Autonomous Driving Artificial Intelligence > Core AI > Language

Keywords

point cloud 3d scene understanding visual grounding multi-modal large language model multimodal large language model 3d scene language-guided object placement 3d visual grounding object placement

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025