Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts

Jaehun Bang; Moon Ye-Bin; Tae-Hyun Oh; Kyungdon Joo

2026 WACV WACV 2026

Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts

Abstract

When searching for videos, users often rely on surrounding context such as background elements or temporal details beyond salient content. However, existing video models struggle with fine-grained spatio-temporal understanding particularly surrounding contexts, and there are no datasets that effectively evaluate their performance. We introduce SS Datasets, three video retrieval datasets with detailed salient and surrounding captions. To capture rich, temporally localized contexts aligned with meaningful scene changes, we segment videos by scene transitions and generate captions with a vision-language model. Analyzing current models reveals difficulties in handling surrounding queries and temporally complex videos. To address this, we propose simple yet effective baselines that improve retrieval across diverse query types, enabling more robust generalization to real-world scenarios.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — spatio-temporal understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio