Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Po-Yao Huang; Mandela Patrick; Junjie Hu; Graham Neubig; Florian Metze; Alexander Hauptmann

2021 NAACL NAACL 2021

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Abstract

AbstractThis paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — text-to-video search

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Po-Yao Huang , Mandela Patrick , Junjie Hu , Graham Neubig , Florian Metze , Alexander Hauptmann

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

cross-lingual transfer vision-language model zero-shot transfer multilingual multimodal multilingual pre-training text-to-video search

Download PDF

Related papers

Knowledge Router: Learning Disentangled Representations for Knowledge Graphs 2021

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks 2021

Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint Information Extraction 2021

Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing 2021

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers 2021