Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Ziyao Huang; Fan Tang; Yong Zhang; Xiaodong Cun; Juan Cao; Jintao Li; Tong-Yee Lee

2024 CVPR CVPR 2024

Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Abstract

Despite the remarkable process of talking-head-based avatar-creating solutions directly generating anchor-style videos with full-body motions remains challenging. In this study we propose Make-Your-Anchor a novel system necessitating only a one-minute video clip of an individual for training subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model effectively binding movements with specific appearances. To produce arbitrary long temporal video we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality temporal coherence and identity preservation outperforming SOTA diffusion/non-diffusion methods. Project page: https://github.com/ICTMCG/Make-Your-Anchor.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🐣 Hot Topic Early Bird — temporal coherence

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ziyao Huang , Fan Tang , Yong Zhang , Xiaodong Cun , Juan Cao , Jintao Li , Tong-Yee Lee

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Computer Vision > Domain-Specific > 3D Vision

Keywords

video generation human motion diffusion model 3d mesh temporal coherence identity preservation avatar generation

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024