Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

Ronghang Hu; Daniel Fried; Anna Rohrbach; Dan Klein; Trevor Darrell; Kate Saenko

2019 ACL ACL 2019

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

Abstract

AbstractVision-and-Language Navigation (VLN) requires grounding instructions, such as “turn right and stop at the door”, to routes in a visual environment. The actual grounding can connect language to the environment through multiple modalities, e.g. “stop at the door” might ground into visual objects, while “turn right” might rely only on the geometric structure of a route. We investigate where the natural language empirically grounds under two recent state-of-the-art VLN models. Surprisingly, we discover that visual features may actually hurt these models: models which only use route structure, ablating visual features, outperform their visual counterparts in unseen new environments on the benchmark Room-to-Room dataset. To better use all the available modalities, we propose to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-of-the-art models on the VLN task.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — multimodal grounding

🐣 Hot Topic Early Bird — instruction following

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ronghang Hu , Daniel Fried , Anna Rohrbach , Dan Klein , Trevor Darrell , Kate Saenko

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Robotics Computer Vision > Core AI > Multimodal Learning

Keywords

vision-language navigation multimodal learning visual grounding instruction following ensemble method vision-and-language navigation multimodal grounding route structure

Download PDF

Related papers

What do phone embeddings learn about Phonology? 2019

Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages 2019

Understanding Undesirable Word Embedding Associations 2019

Inferential Machine Comprehension: Answering Questions by Recursively Deducing the Evidence Chain from Text 2019

Domain Adaptation of Neural Machine Translation by Lexicon Induction 2019