MoSCo: Real-time and Efficient Text-to-Motion Synthesis via Delta Training
Abstract
Generating expressive, fine-grained human motion from text remains a formidable challenge, particularly when aiming for high fidelity without incurring excessive computational cost. Existing methods often rely on complex, multi-stage pipelines with slow inference and large memory footprints, hindering real-time deployment. To address these limitations, we introduce MoSCo, a simple autoregressive text-to-motion framework that discretizes motion into part-level token sequences and models temporal dynamics via Delta-based training strategy --i.e., predicting the motion difference from the previous time step--before fusing these tokens with textual embeddings through our Part-Aware Coordinator(PAO) and generating the full sequence with a single, lightweight transformer decoder. MoSCo sets a new milestone in text-to-motion inference speed--achieving an AITS of just 0.002s(vs 0.03s), over an order of magnitude faster than all prior methods--while maintaining a compact model footprint and delivering highly realistic motions (FID 0.085), making real-time, high-quality generation practical.