Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Huabin Liu; Filip Ilievski; Cees G. M. Snoek

2025 CVPR CVPR 2025

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Abstract

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video- and image-based VLMs across reasoning types.To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrite VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — entailment tree reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Huabin Liu , Filip Ilievski , Cees G. M. Snoek

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Understanding > Semantic Analysis Natural Language Processing > Applications > Question Answering Artificial Intelligence > Core AI > Reasoning Computer Vision > Analysis > Video Understanding Deep Learning > Models > Vision-Language Models

Keywords

visual-language model video question answering commonsense reasoning video grounding visual language model entailment tree entailment tree reasoning video-grounded verification dynamic tree expansion

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025