SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering

Haonan Luo; Guosheng Lin; Zichuan Liu; Fayao Liu; Zhenmin Tang; Yazhou Yao

2019 ICCV ICCV 2019

SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering

Abstract

Embodied Question Answering (EQA) is a newly defined research area where an agent is required to answer the user's questions by exploring the real world environment. It has attracted increasing research interests due to its broad applications in automatic driving system, in-home robots, and personal assistants. Most of the existing methods perform poorly in terms of answering and navigation accuracy due to the absence of local details and vulnerability to the ambiguity caused by complicated vision conditions. To tackle these problems, we propose a segmentation based visual attention mechanism for Embodied Question Answering. Firstly, We extract the local semantic features by introducing a novel high-speed video segmentation framework. Then by the guide of extracted semantic features, a bottom-up visual attention mechanism is proposed for the Visual Question Answering (VQA) sub-task. Further, a feature fusion strategy is proposed to guide the training of the navigator without much additional computational cost. The ablation experiments show that our method boosts the performance of VQA module by 4.2% (68.99% vs 64.73%) and leads to 3.6% (48.59% vs 44.98%) overall improvement in EQA accuracy.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🐣 Hot Topic Early Bird — agent system

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haonan Luo , Guosheng Lin , Zichuan Liu , Fayao Liu , Zhenmin Tang , Yazhou Yao

Topics

Artificial Intelligence > Core AI > Agent Systems Computer Vision > Processing > Video Understanding

Keywords

video segmentation visual question answering visual attention feature fusion agent system embodied question answering

Download PDF

Related papers

Hierarchical Self-Attention Network for Action Localization in Videos 2019

StructureFlow: Image Inpainting via Structure-Aware Appearance Flow 2019

Overcoming Catastrophic Forgetting With Unlabeled Data in the Wild 2019

Compact Trilinear Interaction for Visual Question Answering 2019

A2J: Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation From a Single Depth Image 2019