Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Akhil Perincherry; Jacob Krantz; Stefan Lee

2025 CVPR CVPR 2025

Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

Abstract

Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or "imaginations", we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of ~1 point and up to ~0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — landmark cue

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Akhil Perincherry , Jacob Krantz , Stefan Lee

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Computer Vision > Analysis > Scene Understanding Artificial Intelligence > Core AI > Robotics Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

image generation vision-language navigation multimodal learning visual representation landmark detection diffusion model vision-and-language navigation navigation agent instruction understanding visual imagination landmark cue

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025