NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation

Peiran Xu; Xicheng Gong; Yadong Mu

2025 ICCV ICCV 2025

NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation

Abstract

In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Reinforcement Learning

🧭 Keyword Pioneer — foresighted navigation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Peiran Xu , Xicheng Gong , Yadong Mu

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Planning Computer Vision > Domain-Specific > Autonomous Driving Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Applications > Robotics Artificial Intelligence > Core AI > Robotics

Keywords

reinforcement learning vision-language navigation trajectory datum cross-modal encoding foresighted navigation foresighted planning

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025