SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

Xiangyue Zhang; Jianfang Li; Jiaxu Zhang; Ziqiang Dang; Jianqiang Ren; Liefeng Bo; Zhigang Tu

2025 ICCV ICCV 2025

SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

Abstract

A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn base motions and sparse motions, and then adaptively fuse them. In particular, coarse2fine cross-attention module and rhythmic consistency learning are explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Interdisciplinary

🧭 Keyword Pioneer — semantic emphasis learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Xiangyue Zhang , Jianfang Li , Jiaxu Zhang , Ziqiang Dang , Jianqiang Ren , Liefeng Bo , Zhigang Tu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Action Recognition Computer Vision > Generation > Video Generation Interdisciplinary > Cognitive Science > Perception Deep Learning > Learning Types > Multi-Modal Learning

Keywords

motion generation cross-attention module gesture generation co-speech motion generation semantic emphasis learning rhythmic consistency learning speech synchronization semantic emphasis co-speech motion

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025