Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

Po-Yao Huang; Xiaojun Chang; Alexander Hauptmann

2019 IJCNLP IJCNLP 2019

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

Abstract

AbstractWith the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — cross-modal alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Po-Yao Huang , Xiaojun Chang , Alexander Hauptmann

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Generation > Image Captioning Natural Language Processing > Generation > Language Modeling Deep Learning > Learning Types > Multi-Modal Learning

Keywords

image retrieval object detection visual-semantic embedding cross-modal alignment multi-head attention multilingual multimodal attention diversity multilingual multimodal representation

Download PDF

Related papers

Fine-grained Knowledge Fusion for Sequence Labeling Domain Adaptation 2019

Exploiting Monolingual Data at Scale for Neural Machine Translation 2019

Distributionally Robust Language Modeling 2019

Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling 2019

ARAML: A Stable Adversarial Training Framework for Text Generation 2019