UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training

Mingyang Zhou; Luowei Zhou; Shuohang Wang; Yu Cheng; Linjie Li; Zhou Yu; Jingjing Liu

2021 CVPR CVPR 2021

UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training

Abstract

Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC^2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (eg. using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Maksed Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state of the art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

📈 Trend Setter — Multi-Modal Learning

🧭 Keyword Pioneer — multilingual image-text retrieval

🐣 Hot Topic Early Bird — multilingual processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mingyang Zhou , Luowei Zhou , Shuohang Wang , Yu Cheng , Linjie Li , Zhou Yu , Jingjing Liu

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Transfer Learning Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

contrastive learning machine translation cross-lingual representation multilingual processing cross-modal learning cross-modal representation vision-language model image-text retrieval masked language modeling vision-language pretraining multilingual image-text retrieval

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021