i-Code: An Integrative and Composable Multimodal Learning Framework

Ziyi Yang; Yuwei Fang; Chenguang Zhu; Reid Pryzant; Dongdong Chen; Yu Shi; Yichong Xu; Yao Qian; Mei Gao; Yi-Ling Chen; Liyang Lu; Yujia Xie; Robert Gmyr; Noel Codella; Naoyuki Kanda; Bin Xiao; Lu Yuan; Takuya Yoshioka; Michael Zeng; Xuedong Huang

2023 AAAI AAAI 2023

i-Code: An Integrative and Composable Multimodal Learning Framework

Abstract

Abstract Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel merge- and co-attention mechanisms to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five multimodal understanding tasks and single-modality benchmarks, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

👥 Mega-Team — 20 authors

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

📈 Trend Setter — Multi-Modal Learning

🧭 Keyword Pioneer — cross-modality contrastive learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ziyi Yang , Yuwei Fang , Chenguang Zhu , Reid Pryzant , Dongdong Chen , Yu Shi , Yichong Xu , Yao Qian , Mei Gao , Yi-Ling Chen , Liyang Lu , Yujia Xie , Robert Gmyr , Noel Codella , Naoyuki Kanda , Bin Xiao , Lu Yuan , Takuya Yoshioka , Michael Zeng , Xuedong Huang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Multi-Modal Learning

Keywords

self-supervised learning multimodal learning feature fusion self-supervised pretraining cross-modal contrastive learning cross-modality contrastive learning masked modality modeling visual language speech integration

Download PDF

Related papers

A Model-Agnostic Heuristics for Selective Classification 2023

Tackling Safe and Efficient Multi-Agent Reinforcement Learning via Dynamic Shielding (Student Abstract) 2023

Head-Free Lightweight Semantic Segmentation with Linear Transformer 2023

Hierarchical ConViT with Attention-Based Relational Reasoner for Visual Analogical Reasoning 2023

Deep Spiking Neural Networks with High Representation Similarity Model Visual Pathways of Macaque and Mouse 2023