AutoScape: Geometry-Consistent Long-Horizon Scene Generation

Jiacheng Chen; Ziyu Jiang; Mingfu Liang; Bingbing Zhuang; Jong-Chyi Su; Sparsh Garg; Ying Wu; Manmohan Chandraker

2025 ICCV ICCV 2025

AutoScape: Geometry-Consistent Long-Horizon Scene Generation

Abstract

This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiacheng Chen , Ziyu Jiang , Mingfu Liang , Bingbing Zhuang , Jong-Chyi Su , Sparsh Garg , Ying Wu , Manmohan Chandraker

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Computer Vision > Domain-Specific > Autonomous Driving

Keywords

video generation autonomous driving diffusion model scene generation geometric consistency driving scene keyframe generation geometry consistency rgb-d diffusion

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025