mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Chenliang Li; Haiyang Xu; Junfeng Tian; Wei Wang; Ming Yan; Bin Bi; Jiabo Ye; He Chen; Guohai Xu; Zheng Cao; Ji Zhang; Songfang Huang; Fei Huang; Jingren Zhou; Luo Si

2022 EMNLP EMNLP 2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Abstract

AbstractLarge-scale pre-trained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross-modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections.mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability on vision-language and video-language tasks. The code and pre-trained models are available at https://github.com/alibaba/AliceMind

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — vision-language foundation model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chenliang Li , Haiyang Xu , Junfeng Tian , Wei Wang , Ming Yan , Bin Bi , Jiabo Ye , He Chen , Guohai Xu , Zheng Cao , Ji Zhang , Songfang Huang , Fei Huang , Jingren Zhou , Luo Si

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Natural Language Processing > Resources & Methods > Large Language Models Deep Learning > Models > Foundation Models Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

visual question answering image captioning cross-modal learning foundation model vision-language model image-text retrieval vision-language pre-training vision-language foundation model cross-modal skip-connection

Download PDF

Generative Entity Typing with Curriculum Learning 2022

Towards Reinterpreting Neural Topic Models via Composite Activations 2022

Weakly Supervised Headline Dependency Parsing 2022

Cross-modal Transfer Between Vision and Language for Protest Detection 2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Abstract

Authors

Topics

Keywords

Related papers