Precise Action-to-Video Generation Through Visual Action Prompts

Yuang Wang; Chao Wen; Haoyu Guo; Sida Peng; Minghan Qin; Hujun Bao; Xiaowei Zhou; Ruizhen Hu

2025 ICCV ICCV 2025

Precise Action-to-Video Generation Through Visual Action Prompts

Abstract

We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality tradeoff: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for its generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources -- human-object interactions (HOI) and dexterous robotic manipulation -- enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach.

🧭 Keyword Pioneer — action-to-video generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Yuang Wang , Chao Wen , Haoyu Guo , Sida Peng , Minghan Qin , Hujun Bao , Xiaowei Zhou , Ruizhen Hu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Procedural Generation

Keywords

cross-domain adaptation skeleton-based action action-to-video generation visual action prompt

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025