GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Yushuo Zheng; Jiangyong Ying; Huiyu Duan; Chunyi Li; Zicheng Zhang; Jing Liu; Xiaohong Liu; Guangtao Zhai

2026 AAAI AAAI 2026

GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Abstract

Abstract Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, etc. To bridge this gap, we introduce GeoX-Bench, a comprehensive Benchmark designed to explore and evaluate the capabilities of LMMs in cross-view Geo-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yushuo Zheng , Jiangyong Ying , Huiyu Duan , Chunyi Li , Zicheng Zhang , Jing Liu , Xiaohong Liu , Guangtao Zhai

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Scene Understanding

Keywords

pose estimation instruction tuning large multimodal model satellite imagery cross-view geo-localization

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026