Where am I? Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye; Honglin Lin; Leyan Ou; Dairong Chen; Zihao Wang; Qi Zhu; Conghui He; Weijia Li

2025 ICCV ICCV 2025

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Abstract

Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response.In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. Additionally, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at https://github.com/yejy53/CVG-Text

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Junyan Ye , Honglin Lin , Leyan Ou , Dairong Chen , Zihao Wang , Qi Zhu , Conghui He , Weijia Li

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Scene Understanding Computer Vision > Domain-Specific > Remote Sensing

Keywords

large multimodal model satellite imagery text-to-image retrieval natural language description cross-view geo-localization

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025