Transformer-based Localization from Embodied Dialog with Large-scale Pre-training

Meera Hahn; James M. Rehg

2022 IJCNLP IJCNLP 2022

Transformer-based Localization from Embodied Dialog with Large-scale Pre-training

Abstract

AbstractWe address the challenging task of Localization via Embodied Dialog (LED). Given a dialog from two agents, an Observer navigating through an unknown environment and a Locator who is attempting to identify the Observer’s location, the goal is to predict the Observer’s final location in a map. We develop a novel LED-Bert architecture and present an effective pretraining strategy. We show that a graph-based scene representation is more effective than the top-down 2D maps used in prior works. Our approach outperforms previous baselines.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Meera Hahn , James M. Rehg

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Classification Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Computer Vision > Analysis > Scene Understanding

Keywords

transformer architecture multimodal learning visual localization scene graph pretraining strategy dialog understanding bert architecture embodied dialog

Download PDF

Related papers

Chasing the Tail with Domain Generalization: A Case Study on Frequency-Enriched Datasets 2022

Double Trouble: How to not Explain a Text Classifier’s Decisions Using Counterfactuals Synthesized by Masked Language Models? 2022

Leveraging Key Information Modeling to Improve Less-Data Constrained News Headline Generation via Duality Fine-Tuning 2022

Graph-augmented Learning to Rank for Querying Large-scale Knowledge Graph 2022

Missing Modality meets Meta Sampling (M3S): An Efficient Universal Approach for Multimodal Sentiment Analysis with Missing Modality 2022