Contrastive Sequential-Diffusion Learning: Non-Linear and Multi-Scene Instructional Video Synthesis

Vasco Ramos; Yonatan Bitton; Michal Yarom; Idan Szpektor; Joao Magalhaes

2025 WACV WACV 2025

Contrastive Sequential-Diffusion Learning: Non-Linear and Multi-Scene Instructional Video Synthesis

Abstract

Generated video scenes for action-centric sequence descriptions such as recipe instructions and do-it-yourself projects often include non-linear patterns where the next video may need to be visually consistent not with the immediately preceding video but with earlier ones. Current multi-scene video synthesis approaches fail to meet these consistency requirements. To address this we propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene. The result is a multi-scene video that is grounded in the scene descriptions and coherent w.r.t. the scenes that require visual consistency. Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work. Code and examples available at https://github.com/novasearch/CoSeD

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — instructional video synthesis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Vasco Ramos , Yonatan Bitton , Michal Yarom , Idan Szpektor , Joao Magalhaes

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Deep Learning > Learning Types > Contrastive Learning

Keywords

contrastive learning video generation video synthesis diffusion model denoising process video diffusion instructional video multi-scene video instructional video synthesis

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025