2024 CVPR CVPR 2024

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Abstract

We propose DiffSHEG a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation. While previous works focused on co-speech gesture or expression generation individually the joint generation of synchronized expressions and gestures remains barely explored. To address this our diffusion-based co-speech motion generation Transformer enables uni-directional information flow from expression to gesture facilitating improved matching of joint expression-gesture distributions. Furthermore we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally a user study confirms the superiority of our method over prior approaches. By enabling the real-time generation of expressive and synchronized motions our method showcases its potential for various applications in the development of digital humans and embodied agents.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Speech & Audio
🧭 Keyword Pioneer — 3d expression
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio