MultiDiff: Consistent Novel View Synthesis from a Single Image

Norman Müller; Katja Schwarz; Barbara Rössle; Lorenzo Porzi; Samuel Rota Bulò; Matthias Nießner; Peter Kontschieder

2024 CVPR CVPR 2024

MultiDiff: Consistent Novel View Synthesis from a Single Image

Abstract

We introduce MultiDiff a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature as there exist multiple plausible explanations for unobserved areas. To address this issue we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements while reducing inference time by an order of magnitude. For additional consistency and image quality improvements we introduce a novel structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging real-world datasets RealEstate10K and ScanNet. Finally our model naturally supports multi-view consistent editing without the need for further tuning.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — video-diffusion model

🐣 Hot Topic Early Bird — video diffusion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Norman Müller , Katja Schwarz , Barbara Rössle , Lorenzo Porzi , Samuel Rota Bulò , Matthias Nießner , Peter Kontschieder

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Analysis > 3D Vision Computer Vision > Generation > Image Generation Computer Vision > Generation > 3D Generation Computer Vision > Processing > 3D Vision

Keywords

3d reconstruction depth estimation diffusion model novel view synthesis video diffusion 3d scene multi-view consistency monocular depth video-diffusion model pixel-accurate correspondence

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024