Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Zan Wang; Yixin Chen; Baoxiong Jia; Puhao Li; Jinlu Zhang; Jingze Zhang; Tengyu Liu; Yixin Zhu; Wei Liang; Siyuan Huang

2024 CVPR CVPR 2024

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Abstract

Despite significant advancements in text-to-motion synthesis generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language 3D scenes and human motion and (ii) the generative models' intensive data requirements contrasted with the scarcity of comprehensive high-quality language-scene-motion datasets. To tackle these issues we introduce a novel two-stage framework that employs scene affordance as an intermediate representation effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting explicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps our method overcomes the difficulty in generating human motion under multimodal condition signals especially when training with limited data lacking extensive language-scene-motion pairs. Our extensive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks including HumanML3D and HUMANISE. Additionally we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🧭 Keyword Pioneer — 3d scene grounding

🐣 Hot Topic Early Bird — human motion generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zan Wang , Yixin Chen , Baoxiong Jia , Puhao Li , Jinlu Zhang , Jingze Zhang , Tengyu Liu , Yixin Zhu , Wei Liang , Siyuan Huang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Planning Deep Learning > Models > Diffusion Models

Keywords

diffusion model human motion generation scene affordance 3d scene grounding

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024