Versatile Transition Generation with Image-to-Video Diffusion

Zuhao Yang; Jiahui Zhang; Yingchen Yu; Shijian Lu; Song Bai

2025 ICCV ICCV 2025

Versatile Transition Generation with Image-to-Video Diffusion

Abstract

Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantic-coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization to mitigate the limitations of pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation covering two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — transition generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zuhao Yang , Jiahui Zhang , Yingchen Yu , Shijian Lu , Song Bai

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation

Keywords

video generation semantic coherence diffusion model motion representation video interpolation motion smoothness image-to-video diffusion transition generation video transition

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025