VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Chi-Pin Huang; Yen-Siang Wu; Hung-Kai Chung; Kai-Po Chang; Fu-En Yang; Yu-Chiang Frank Wang

2025 CVPR CVPR 2025

VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Abstract

Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — subject personalization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chi-Pin Huang , Yen-Siang Wu , Hung-Kai Chung , Kai-Po Chang , Fu-En Yang , Yu-Chiang Frank Wang

Topics

Machine Learning > Application Areas > Knowledge Distillation Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Zero-Shot Learning

Keywords

diffusion model text-to-video generation subject customization text-to-video diffusion video customization lora adaptation motion disentanglement motion customization subject personalization multi-subject video

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025