Less Is More: Generating Grounded Navigation Instructions From Landmarks

Su Wang; Ceslee Montgomery; Jordi Orbay; vighnesh Birodkar; Aleksandra Faust; Izzeddin Gur; Natasha Jaques; Austin Waters; Jason Baldridge; Peter Anderson

2022 CVPR CVPR 2022

Less Is More: Generating Grounded Navigation Instructions From Landmarks

Abstract

We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator--a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 1.1m English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 73% following MARKY-MT5's instructions, just shy of their 76% SR following human instructions---and well above SRs with other generators. Evaluations on RxR's longer, diverse paths obtain 62-64% SRs on three languages. Generating such high-quality navigation instructions in novel environments is a step towards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — image-text encoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Su Wang , Ceslee Montgomery , Jordi Orbay , vighnesh Birodkar , Aleksandra Faust , Izzeddin Gur , Natasha Jaques , Austin Waters , Jason Baldridge , Peter Anderson

Topics

Artificial Intelligence > Core AI > Planning Computer Vision > Generation > Image Captioning Artificial Intelligence > Core AI > Language Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Natural Language Processing > Applications > Natural Language Generation

Keywords

multimodal learning image captioning visual grounding landmark detection language model language generation encoder-decoder architecture navigation instruction image-text encoder

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022