IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Yiyu Zhuang; Jiaxi Lv; Hao Wen; Qing Shuai; Ailing Zeng; Hao Zhu; Shifeng Chen; Yujiu Yang; Xun Cao; Wei Liu

2025 CVPR CVPR 2025

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Abstract

Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman GEnerated training dataset, HuGe100K, consisting of 100K diverse, photorealistic human images with corresponding 24-view in a static pose or dynamic pose frames generated via a pose-controllable image-to-video model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space of a given human image. This model is trained to disentangle human pose, shape, clothing geometry, and texture. Accordingly, the estimated Gaussians can be animated robustly without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the generalizable ability to efficiently reconstruct photorealistic humans in under 1 second using a single GPU. Additionally, it seamlessly supports various applications, including animation, shape, and texture editing tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — image-to-video model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yiyu Zhuang , Jiaxi Lv , Hao Wen , Qing Shuai , Ailing Zeng , Hao Zhu , Shifeng Chen , Yujiu Yang , Xun Cao , Wei Liu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Deep Learning > Learning Types > Representation Learning Computer Vision > Generation > 3D Generation

Keywords

3d reconstruction pose estimation 3d human reconstruction gaussian splatting feed-forward transformer gaussian representation human avatar image-to-video model

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025