Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts
Abstract
When searching for videos, users often rely on surrounding context such as background elements or temporal details beyond salient content. However, existing video models struggle with fine-grained spatio-temporal understanding particularly surrounding contexts, and there are no datasets that effectively evaluate their performance. We introduce SS Datasets, three video retrieval datasets with detailed salient and surrounding captions. To capture rich, temporally localized contexts aligned with meaningful scene changes, we segment videos by scene transitions and generate captions with a vision-language model. Analyzing current models reveals difficulties in handling surrounding queries and temporally complex videos. To address this, we propose simple yet effective baselines that improve retrieval across diverse query types, enabling more robust generalization to real-world scenarios.