Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Gen Li; Nan Duan; Yuejian Fang; Ming Gong; Daxin Jiang

2020 AAAI AAAI 2020

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Abstract

Abstract We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — cross-modal pretraining

🐣 Hot Topic Early Bird — vision-language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Gen Li , Nan Duan , Yuejian Fang , Ming Gong , Daxin Jiang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Techniques > Pretraining Natural Language Processing > Resources & Methods > Text Representation Deep Learning > Models > Vision-Language Models

Keywords

multimodal learning vision-language model image-text retrieval masked language modeling vision language visual commonsense reasoning cross-modal pretraining cross-modal pre-training

Download PDF

Related papers

Enhancing Pointer Network for Sentence Ordering with Pairwise Ordering Predictions 2020

CopyMTL: Copy Mechanism for Joint Extraction of Entities and Relations with Multi-Task Learning 2020

Neural Simile Recognition with Cyclic Multitask Learning and Local Attention 2020

Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy 2020

Multi-Point Semantic Representation for Intent Classification 2020