GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Quanwei Yang; Luying  Huang; Kaisiyuan Wang; Jiazhi Guan; Shengyi He; Fengguo Li; Hang Zhou; Lingyun Yu; Yingying Li; Haocheng Feng; Hongtao Xie

2025 ICCV ICCV 2025

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Abstract

While increasing attention has been paid to co-speech gesture synthesis, most previous works neglect to investigate hand gestures with explicit and essential semantics. In this paper, we study co-speech gesture generation with an emphasis on specific hand gesture activation, which can deliver more instructional information than common body movements. To achieve this, we first build a high-quality dataset of 3D human body movements including a set of semantically explicit hand gestures that are commonly used by live streamers. Then we present a hybrid-modality gesture generation system GestureHYDRA built upon a hybrid-modality diffusion transformer architecture with novelly designed motion-style injective transformer layers, which enables advanced gesture modeling ability and versatile gesture operations. To guarantee these specific hand gestures can be activated, we introduce a cascaded retrieval-augmented generation strategy built upon a semantic gesture repository annotated for each subject and an adaptive audio-gesture synchronization mechanism, which substantially improves semantic gesture activation and production efficiency. Quantitative and qualitative experiments demonstrate that our proposed approach achieves superior performance over all the counterparts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — human body movement

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Quanwei Yang , Luying Huang , Kaisiyuan Wang , Jiazhi Guan , Shengyi He , Fengguo Li , Hang Zhou , Lingyun Yu , Yingying Li , Haocheng Feng , Hongtao Xie

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Human Pose Estimation Computer Vision > Generation > Video Generation

Keywords

multimodal learning gesture synthesis retrieval-augmented generation diffusion transformer co-speech gesture human body movement

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025