Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos

Yue Ma; Yingqing He; Xiaodong Cun; Xintao Wang; Siran Chen; Xiu Li; Qifeng Chen

2024 AAAI AAAI 2024

Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos

Abstract

Abstract Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e., image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models are available on https://follow-your-pose.github.io/.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — pose-controllable video

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yue Ma , Yingqing He , Xiaodong Cun , Xintao Wang , Siran Chen , Xiu Li , Qifeng Chen

Topics

Deep Learning > Models > Diffusion Models Deep Learning > Techniques > Pretraining Computer Vision > Generation > Video Generation Artificial Intelligence > Core AI > Computer Vision Deep Learning > Learning Types > Fine-Tuning

Keywords

pose estimation video generation text-to-image generation diffusion model text-to-image model text-to-video generation temporal attention character animation pose-controllable video

Download PDF

Related papers

Goal Alignment: Re-analyzing Value Alignment Problems Using Human-Aware AI 2024

Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables 2024

Suppressing Uncertainty in Gaze Estimation 2024

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification 2024