PEEKABOO: Interactive Video Generation via Masked-Diffusion

Yash Jain; Anshul Nasery; Vibhav Vineet; Harkirat Behl

2024 CVPR CVPR 2024

PEEKABOO: Interactive Video Generation via Masked-Diffusion

Abstract

Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However a significant limitation is their inability to offer interactive control to users a feature that promises to open up unprecedented applications and creativity. In this work we introduce the first solution to equip diffusion-based video generation models with spatio-temporal control. We present Peekaboo a novel masked attention module which seamlessly integrates with current video generation models offering control without the need for additional training or inference overhead. To facilitate future research we also introduce a comprehensive benchmark for interactive video generation. This benchmark offers a standardized framework for the community to assess the efficacy of emerging interactive video generation models. Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models all while maintaining the same latency. Code and benchmark are available on the webpage.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — spatio-temporal control

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yash Jain , Anshul Nasery , Vibhav Vineet , Harkirat Behl

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Computer Vision > Processing > Video Processing Artificial Intelligence > Core AI > Computer Vision

Keywords

video generation diffusion model interactive control video editing masked attention spatio-temporal control masked diffusion

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024