MoSCo: Real-time and Efficient Text-to-Motion Synthesis via Delta Training

Zhiyuan Zhang; Lingqiao Liu

2026 WACV WACV 2026

MoSCo: Real-time and Efficient Text-to-Motion Synthesis via Delta Training

Abstract

Generating expressive, fine-grained human motion from text remains a formidable challenge, particularly when aiming for high fidelity without incurring excessive computational cost. Existing methods often rely on complex, multi-stage pipelines with slow inference and large memory footprints, hindering real-time deployment. To address these limitations, we introduce MoSCo, a simple autoregressive text-to-motion framework that discretizes motion into part-level token sequences and models temporal dynamics via Delta-based training strategy --i.e., predicting the motion difference from the previous time step--before fusing these tokens with textual embeddings through our Part-Aware Coordinator(PAO) and generating the full sequence with a single, lightweight transformer decoder. MoSCo sets a new milestone in text-to-motion inference speed--achieving an AITS of just 0.002s(vs 0.03s), over an order of magnitude faster than all prior methods--while maintaining a compact model footprint and delivering highly realistic motions (FID 0.085), making real-time, high-quality generation practical.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — delta training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio