MotionCtrl: A Real-time Controllable Vision-Language-Motion Model

Bin Cao; Sipeng Zheng; Ye Wang; Lujie Xia; Qianshan Wei; Qin Jin; Jing Liu; Zongqing Lu

2025 ICCV ICCV 2025

MotionCtrl: A Real-time Controllable Vision-Language-Motion Model

Abstract

Human motion generation involves synthesizing coherent human motion sequences conditioned on diverse multimodal inputs and holds significant potential for real-world applications. Despite recent advancements, existing vision-language-motion models (VLMMs) remain limited in achieving this goal. In this paper, we identify the lack of controllability as a critical bottleneck, where VLMMs struggle with diverse human commands, pose initialization, generation of long-term or unseen cases, and fine-grained control over individual body parts. To address these challenges, we introduce MotionCtrl, the first real-time, controllable VLMM with state-of-the-art performance. MotionCtrl achieves its controllability through training on HuMo100M, the largest human motion dataset to date, featuring over 5 million self-collected motions, 100 million multi-task instructional instances, and detailed part-level descriptions that address a long-standing gap in the field. Additionally, we propose a novel part-aware residual quantization technique for motion tokenization, enabling precise control over individual body parts during motion generation. Extensive experiments demonstrate MotionCtrl's superior performance across a wide range of motion benchmarks. Furthermore, we provide strategic design insights and a detailed time efficiency analysis to guide the development of practical motion generators.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Robotics

🧭 Keyword Pioneer — part-aware quantization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bin Cao , Sipeng Zheng , Ye Wang , Lujie Xia , Qianshan Wei , Qin Jin , Jing Liu , Zongqing Lu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Generative Models Computer Vision > Generation > Video Generation Robotics > Capabilities > Motion Planning

Keywords

real-time control human motion synthesis motion generation vision-language model human motion generation part-aware quantization motion tokenization vision-language-motion model part-aware residual quantization

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025