Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation

Zehao Wang; Minye Wu; Yixin Cao; Yubo Ma; Meiqi Chen; Tinne Tuytelaars

2024 EMNLP EMNLP 2024

Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation

Abstract

AbstractThis study presents a novel evaluation framework for the Vision-Language Navigation (VLN) task. It aims to diagnose current models for various instruction categories at a finer-grained level. The framework is structured around the context-free grammar (CFG) of the task. The CFG serves as the basis for the problem decomposition and the core premise of the instruction categories design. We propose a semi-automatic method for CFG construction with the help of Large-Language Models (LLMs). Then, we induct and generate data spanning five principal instruction categories (i.e. direction change, landmark recognition, region recognition, vertical movement, and numerical comprehension). Our analysis of different models reveals notable performance discrepancies and recurrent issues. The stagnation of numerical comprehension, heavy selective biases over directional concepts, and other interesting findings contribute to the development of future language-guided navigation systems. The project is now available at https://zehao-wang.github.io/navnuances.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zehao Wang , Minye Wu , Yixin Cao , Yubo Ma , Meiqi Chen , Tinne Tuytelaars

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Computer Vision Computer Vision > Analysis > Computer Vision Deep Learning > Models > Multi-Modal Learning

Keywords

vision-language navigation multimodal learning instruction following autonomous agent context-free grammar fine-grained evaluation landmark recognition language-guided navigation

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024