Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

Qiji Zhou; Yifan Gong; Guangsheng Bao; Hongjie Qiu; Jinqiang Li; Xiangrong Zhu; Huajian Zhang; Yue Zhang

2025 ACL ACL 2025

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

Abstract

AbstractCounterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce **COVER** (**CO**unterfactual **V**id**E**o **R**easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs’ logical reasoning abilities in dynamic environments. Our work is available at https://github.com/gongyifan-hash/COVER-Benchmark.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — sub-question evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio