Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

Bichen Wu; Ching-Yao Chuang; Xiaoyan Wang; yichen jia; Kapil Krishnakumar; Tong Xiao; Feng Liang; Licheng Yu; Peter Vajda

2024 CVPR CVPR 2024

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

Abstract

In this paper we introduce Fairy a minimalist yet robust adaptation of image-editing diffusion models enhancing them for video editing applications. Our approach centers on the concept of anchor-based cross-frame attention a mechanism that implicitly propagates diffusion features across frames ensuring superior temporal coherence and high-fidelity synthesis. Fairy not only addresses limitations of previous models including memory and processing speed. It also improves temporal consistency through a unique data augmentation strategy. This strategy renders the model equivariant to affine transformations in both source and target images. Remarkably efficient Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds outpacing prior works by at least 44x. A comprehensive user study involving 1000 generated samples confirms that our approach delivers superior quality decisively outperforming established methods.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — anchor-based attention

🐣 Hot Topic Early Bird — video synthesis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bichen Wu , Ching-Yao Chuang , Xiaoyan Wang , yichen jia , Kapil Krishnakumar , Tong Xiao , Feng Liang , Licheng Yu , Peter Vajda

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Computer Vision > Processing > Video Processing Deep Learning > Learning Types > Self-Supervised Learning

Keywords

video generation video synthesis diffusion model video editing temporal coherence anchor-based attention cross-frame attention video-to-video synthesis

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024