Topological Planning With Transformers for Vision-and-Language Navigation

Kevin Chen; Junshen K. Chen; Jo Chuang; Marynel Vázquez; Silvio Savarese

2021 CVPR CVPR 2021

Topological Planning With Transformers for Vision-and-Language Navigation

Abstract

Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Robotics

🧭 Keyword Pioneer — navigation planning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kevin Chen , Junshen K. Chen , Jo Chuang , Marynel Vázquez , Silvio Savarese

Topics

Artificial Intelligence > Core AI > Planning Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Computer Vision > Analysis > Scene Understanding Robotics > Capabilities > Navigation Artificial Intelligence > Core AI > Robotics Deep Learning > Models > Vision-Language Models Artificial Intelligence > Core AI > Vision-Language Models

Keywords

scene understanding attention mechanism vision-and-language navigation topological map navigation planning topological planning

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021