i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Ziyi Yang; Mahmoud Khademi; Yichong Xu; Reid Pryzant; Yuwei Fang; Chenguang Zhu; Dongdong Chen; Yao Qian; Xuemei Gao; Yi-Ling Chen; Robert Gmyr; Naoyuki Kanda; Noel Codella; Bin Xiao; Yu Shi; Lu Yuan; Takuya Yoshioka; Michael Zeng; Xuedong Huang

2024 NAACL NAACL 2024

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Abstract

AbstractThe convergence of text, visual, and audio data is crucial towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models that lack generative abilities. We propose closing this gap with i-Code V2, one of the first models capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder to project combinations of modalities into a shared representational space. Language tokens are generated from these representations via an autoregressive decoder. i-Code V2 is pretrained end-to-end on a large collection of dual- and single-modality datasets with a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.

🌉 Interdisciplinary Bridge — Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — vision language speech

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ziyi Yang , Mahmoud Khademi , Yichong Xu , Reid Pryzant , Yuwei Fang , Chenguang Zhu , Dongdong Chen , Yao Qian , Xuemei Gao , Yi-Ling Chen , Robert Gmyr , Naoyuki Kanda , Noel Codella , Bin Xiao , Yu Shi , Lu Yuan , Takuya Yoshioka , Michael Zeng , Xuedong Huang

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Natural Language Processing > Generation > Text Generation

Keywords

text generation autoregressive generation multimodal learning vision language speech

Download PDF

Related papers

Working Alliance Transformer for Psychotherapy Dialogue Classification 2024

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences 2024

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study 2024

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation 2024

Extractive Summarization with Text Generator 2024