BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving

Tao Tang; Dafeng Wei; Zhengyu Jia; Tian Gao; Changwei Cai; Chengkai Hou; Peng Jia; Kun Zhan; Haiyang Sun; Fan JingChen; Yixing Zhao; Xiaodan Liang; XianPeng Lang; Yang Wang

2025 AAAI AAAI 2025

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving

Abstract

Abstract The rapid development of the autonomous driving industry has led to a significant accumulation of autonomous driving data. Consequently, there comes a growing demand for retrieving data to provide specialized optimization. However, directly applying previous image retrieval methods faces several challenges, such as the lack of global feature representation and inadequate text retrieval ability for complex driving scenes. To address these issues, firstly, we propose the BEV-TSR framework which leverages descriptive text as an input to retrieve corresponding scenes in the Bird’s Eye View (BEV) space. Then to facilitate complex scene retrieval with extensive text descriptions, we employ a large language model (LLM) to extract the semantic features of the text inputs and incorporate knowledge graph embeddings to enhance the semantic richness of the language embedding. To achieve feature alignment between the BEV feature and language embedding, we propose Shared Cross-modal Embedding with a set of shared learnable embeddings to bridge the gap between these two modalities, and employ a caption generation task to further enhance the alignment. Furthermore, there lack of well-formed retrieval datasets for effective evaluation. To this end, we establish a multi-level retrieval dataset, nuScenes-Retrieval, based on the widely adopted nuScenes dataset. Experimental results on the multi-level nuScenes-Retrieval show that BEV-TSR achieves state-of-the-art performance, e.g., 85.78% and 87.66% top-1 accuracy on scene-to-test and text-to-scene retrieval respectively.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tao Tang , Dafeng Wei , Zhengyu Jia , Tian Gao , Changwei Cai , Chengkai Hou , Peng Jia , Kun Zhan , Haiyang Sun , Fan JingChen , Yixing Zhao , Xiaodan Liang , XianPeng Lang , Yang Wang

Topics

Artificial Intelligence > Core AI > Autonomous Vehicles Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Embedding Learning

Keywords

bird's eye view text retrieval knowledge graph embedding cross-modal embedding large language model scene retrieval

Download PDF

Related papers

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025

Decentralized Projected Riemannian Stochastic Recursive Momentum Method for Nonconvex Optimization 2025