Global Fusion Attention for Vision and Language Understanding (Student Abstract)

Zixin Guo; Chen Liang; Ziyu Wan; Yang Bai

2021 AAAI AAAI 2021

Global Fusion Attention for Vision and Language Understanding (Student Abstract)

Abstract

Abstract We extend the popular transformer architecture to a multi-modal model, processing both visual and textual inputs. We propose a new attention mechanism on Transformer-based architecture for the joint vision and language understanding tasks. Our model fuses multi-level comprehension between images and texts in a weighted manner, which could better curve the internal relationships. Experiments on benchmark VQA dataset CLEVR demonstrate the effectiveness of the proposed attention mechanism. We also observe the improvements in sample efficiency of reinforcement learning through the experiments on grounded language understanding tasks of BabyAI platform.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐣 Hot Topic Early Bird — vision language

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zixin Guo , Chen Liang , Ziyu Wan , Yang Bai

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Attention Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

reinforcement learning visual question answering attention mechanism multimodal learning multi-modal learning vision language

Download PDF

Related papers

Contextual Conditional Reasoning 2021

Attention Beam: An Image Captioning Approach (Student Abstract) 2021

Movie Summarization via Sparse Graph Construction 2021

Text Analysis for Understanding Symptoms of Social Anxiety in Student Veterans 2021

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs 2021