DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes

Zhaowei Wang; Hongming Zhang; Tianqing Fang; Ye Tian; Yue Yang; Kaixin Ma; Xiaoman Pan; Yangqiu Song; Dong Yu

2025 EMNLP EMNLP 2025

DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes

Abstract

AbstractLarge Vision-Language Models (LVLMs) have achieved significant progress in tasks like visual question answering and document understanding. However, their potential to comprehend embodied environments and navigate within them remains underexplored. In this work, we first study the challenge of open-vocabulary object navigation by introducing DivScene, a large-scale dataset with 4,614 houses across 81 scene types and 5,707 kinds of target objects. Our dataset provides a much greater diversity of target objects and scene types than existing datasets, enabling a comprehensive task evaluation. We evaluated various methods with LVLMs and LLMs on our dataset and found that current models still fall short of open-vocab object navigation ability. Then, we fine-tuned LVLMs to predict the next action with CoT explanations. We observe that LVLM’s navigation ability can be improved substantially with only BFS-generated shortest paths without any human supervision, surpassing GPT-4o by over 20% in success rates.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Zhaowei Wang , Hongming Zhang , Tianqing Fang , Ye Tian , Yue Yang , Kaixin Ma , Xiaoman Pan , Yangqiu Song , Dong Yu

Topics

Artificial Intelligence > Core AI > Autonomous Vehicles Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Planning

Keywords

scene understanding embodied ai action prediction vision language model object navigation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025