VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos

Yue Qiu; Yanjun Sun; Takuma Yagi; Shusaku Egami; Natsuki Miyata; Ken Fukuda; Kensho Hara; Ryusuke Sagawa

2025 ICCV ICCV 2025

VideoSetDiff: Identifying and Reasoning Similarities and Differences in Similar Videos

Abstract

Recognizing subtle similarities and differences among sets of similar activities is central to many real-world applications, including skill acquisition, sports performance evaluation, and anomaly detection. Humans excel at such fine-grained analysis, which requires comprehensive video understanding and cross-video reasoning about action attributes, poses, positions, and emotional states. Yet existing video-based large language models typically address only single-video recognition, leaving their capacity for multi-video reasoning largely unexplored. We introduce VideoSetDiff, a curated dataset designed to test detail-oriented recognition across diverse activities, from subtle action attributes to viewpoint transitions. Our evaluation of current video-based LLMs on VideoSetDiff reveals critical shortcomings, particularly in fine-grained detail recognition and multi-video reasoning. To mitigate these issues, we propose an automatically generated dataset for instruction tuning alongside a novel multi-video recognition framework. While instruction tuning and specialized multi-video reasoning improve performance, all tested models remain far from satisfactory. These findings underscore the need for more robust video-based LLMs capable of handling complex multi-video tasks, enabling diverse real-world applications.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — multi-video reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yue Qiu , Yanjun Sun , Takuma Yagi , Shusaku Egami , Natsuki Miyata , Ken Fukuda , Kensho Hara , Ryusuke Sagawa

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Processing > Video Understanding Artificial Intelligence > Core AI > Large Language Models Computer Vision > Analysis > Video Understanding

Keywords

video recognition multimodal learning video understanding instruction tuning fine-grained recognition video large language model multi-video reasoning fine-grained video understanding video instruction tuning large language model

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025