SKALD: Learning-Based Shot Assembly for Coherent Multi-Shot Video Creation

Chen-Yi Lu; Md Mehrab Tanjim; Ishita Dasgupta; Somdeb Sarkhel; Gang Wu; Saayan Mitra; Somali Chaterji

2025 ICCV ICCV 2025

SKALD: Learning-Based Shot Assembly for Coherent Multi-Shot Video Creation

Abstract

We present SKALD, a multi-shot video assembly method that constructs coherent video sequences from candidate shots with minimal reliance on text. Central to our approach is the Learned Clip Assembly (LCA) score, a learning-based metric that measures temporal and semantic relationships between shots to quantify narrative coherence. We tackle the exponential complexity of combining multiple shots with an efficient beam-search algorithm guided by the LCA score. To train our model effectively with limited human annotations, we propose two tasks for the LCA encoder: Shot Coherence Learning, which uses contrastive learning to distinguish coherent and incoherent sequences, and Feature Regression, which converts these learned representations into a real-valued coherence score. We develop two variants: a base SKALD model that relies solely on visual coherence and SKALD-text, which integrates auxiliary text information when available. Experiments on the VSPD and our curated MSV3C datasets show that SKALD achieves an improvement of up to 48.6% in IoU and a 43% speedup over the state-of-the-art methods. A user study further validates our approach, with 45% of participants favoring SKALD-assembled videos, compared to 22% preferring text-based assembly methods.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — shot assembly

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chen-Yi Lu , Md Mehrab Tanjim , Ishita Dasgupta , Somdeb Sarkhel , Gang Wu , Saayan Mitra , Somali Chaterji

Topics

Machine Learning > Learning Types > Contrastive Learning Computer Vision > Generation > Video Generation Computer Vision > Processing > Video Processing

Keywords

contrastive learning temporal modeling video generation video processing beam search narrative coherence multi-shot video shot assembly shot coherence

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025