DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Junming Chen; Yunfei Liu; Jianan Wang; Ailing Zeng; Yu Li; Qifeng Chen

2024 CVPR CVPR 2024

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Abstract

We propose DiffSHEG a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation. While previous works focused on co-speech gesture or expression generation individually the joint generation of synchronized expressions and gestures remains barely explored. To address this our diffusion-based co-speech motion generation Transformer enables uni-directional information flow from expression to gesture facilitating improved matching of joint expression-gesture distributions. Furthermore we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally a user study confirms the superiority of our method over prior approaches. By enabling the real-time generation of expressive and synchronized motions our method showcases its potential for various applications in the development of digital humans and embodied agents.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — 3d expression

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Junming Chen , Yunfei Liu , Jianan Wang , Ailing Zeng , Yu Li , Qifeng Chen

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Artificial Intelligence > Core AI > Robotics Computer Vision > Analysis > Motion Analysis Deep Learning > Learning Types > Multi-Modal Learning Computer Vision > Generation > 3D Generation Speech & Audio > Synthesis > Speech Synthesis

Keywords

motion synthesis motion generation diffusion model gesture generation co-speech gesture speech-driven generation expression generation 3d expression 3d gesture generation

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024