SpatialLogic-Bench: A Diagnostic Benchmark for Task-Oriented Spatiotemporal Reasoning

Xiaoda Yang; Shenzhou Gao; Can Wang; Jiahe Zhang; Menglan Tang; Jingyang Xue; Sheng Liu; Peijian Zhang; Yao Mu; Xiangyu Yue

2026 AAAI AAAI 2026

SpatialLogic-Bench: A Diagnostic Benchmark for Task-Oriented Spatiotemporal Reasoning

Abstract

Abstract Vision-Language Models (VLMs) have made significant progress in static perception, but their ability to understand dynamic task-oriented reasoning remains unclear. Existing benchmarks mainly focus on static spatial relationships and lack systematic assessment of dynamic reasoning capabilities. To this end, we propose SpatialLogic-Bench, a novel benchmark designed to evaluate VLMs’ understanding of spatiotemporal logic and their ability to assess task progress. The benchmark assesses two critical capabilities: first, fine-grained visual discrimination to accurately perceive subtle physical changes between state frames; second, the logical capacity to connect these changes to task goals and judge whether they indicate progress. To mitigate temporal dependency biases, we introduce a dual-task paradigm, presenting image pairs in both chronological and reversed orders while keeping task descriptions consistent. We construct a multi-scale evaluation system by varying time intervals between frames: smaller intervals test the model's fine-grained perception, while larger intervals demand more sophisticated logical inference. Empirical evaluation reveals that most VLMs experience significant performance degradation on tasks presented in inverse chronological order, indicating an over-reliance on temporal cues rather than robust reasoning abilities. SpatialLogic-Bench clearly exposes critical limitations in current models and provides valuable guidance for improving dynamic spatial perception capabilities.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xiaoda Yang , Shenzhou Gao , Can Wang , Jiahe Zhang , Menglan Tang , Jingyang Xue , Sheng Liu , Peijian Zhang , Yao Mu , Xiangyu Yue

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning

Keywords

benchmark evaluation vision-language model visual discrimination spatiotemporal reasoning logical inference

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026