SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Aleksey Bokhovkin; Quan Meng; Shubham Tulsiani; Angela Dai

2025 CVPR CVPR 2025

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Abstract

We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.

🧭 Keyword Pioneer — controllable editing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Aleksey Bokhovkin , Quan Meng , Shubham Tulsiani , Angela Dai

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Procedural Generation

Keywords

diffusion model 3d scene generation text-guided generation controllable editing factored diffusion

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025