Instruction-based Image Manipulation by Watching How Things Move

Mingdeng Cao; Xuaner Zhang; Yinqiang Zheng; Zhihao Xia

2025 CVPR CVPR 2025

Instruction-based Image Manipulation by Watching How Things Move

Abstract

This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics--such as non-rigid subject motion and complex camera movements--that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — instruction-based image manipulation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mingdeng Cao , Xuaner Zhang , Yinqiang Zheng , Zhihao Xia

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Image Editing Deep Learning > Learning Types > Multimodal Learning

Keywords

multimodal large language model video frame content preservation dataset construction instruction-based image manipulation video frame sampling pose adjustment

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025