AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction

Xuying Zhang; Yupeng Zhou; Kai Wang; Yikai Wang; Zhen Li; Shaohui Jiao; Daquan Zhou; Qibin Hou; Ming-Ming Cheng

2025 ICCV ICCV 2025

AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction

Abstract

Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — next-view prediction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xuying Zhang , Yupeng Zhou , Kai Wang , Yikai Wang , Zhen Li , Shaohui Jiao , Daquan Zhou , Qibin Hou , Ming-Ming Cheng

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Analysis > 3D Vision Computer Vision > Generation > Image Generation

Keywords

view consistency diffusion model novel view synthesis camera pose 3d asset next-view prediction

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025