Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Jianzong Wu; Xiangtai Li; Chenyang Si; Shangchen Zhou; Jingkang Yang; Jiangning Zhang; Yining Li; Kai Chen; Yunhai Tong; Ziwei Liu; Chen Change Loy

2024 CVPR CVPR 2024

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Abstract

We introduce a new task -- language-driven video inpainting which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset containing 5650 videos and 9091 inpainting results to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework the first end-to-end baseline for this task integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We have made datasets code and models publicly available at https://github.com/jianzongwu/Language-Driven-Video-Inpainting.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — language-driven video editing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jianzong Wu , Xiangtai Li , Chenyang Si , Shangchen Zhou , Jingkang Yang , Jiangning Zhang , Yining Li , Kai Chen , Yunhai Tong , Ziwei Liu , Chen Change Loy

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Computer Vision > Processing > Video Processing Natural Language Processing > Resources & Methods > Large Language Models Deep Learning > Learning Types > Multi-Modal Learning

Keywords

diffusion model multimodal large language model video inpainting language instruction language-driven video editing

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024