Generative Multimodal Models are In-Context Learners

Quan Sun; Yufeng Cui; Xiaosong Zhang; Fan Zhang; Qiying Yu; Yueze Wang; Yongming Rao; Jingjing Liu; Tiejun Huang; Xinlong Wang

2024 CVPR CVPR 2024

Generative Multimodal Models are In-Context Learners

Abstract

Humans can easily solve multimodal tasks in context with only a few demonstrations or simple instructions which current multimodal systems largely struggle to imitate. In this work we demonstrate that by effectively scaling up generative multimodal models their task-agnostic in-context learning capabilities can be significantly enhanced. We introduce Emu2 a generative multimodal model with 37 billion parameters which serves as a base model and general-purpose interface for a variety of multimodal tasks. Emu2 not only achieves strong performance in few-shot setting but can also be instruct-tuned to follow specific instructions such as visual question answering and object-grounded image generation. Emu2 even emerges to solve tasks that require on-the-fly reasoning such as visual prompting which existing models are unlikely to handle. We identify additional tasks where Emu2's in-context learning can further improve and discuss its broader societal impact. Our code and models will be made publicly available to facilitate future research.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — generative multimodal model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Quan Sun , Yufeng Cui , Xiaosong Zhang , Fan Zhang , Qiying Yu , Yueze Wang , Yongming Rao , Jingjing Liu , Tiejun Huang , Xinlong Wang

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Few-Shot Learning Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > In-Context Learning Deep Learning > Models > Large Language Models Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > In-Context Learning

Keywords

zero-shot learning few-shot learning visual question answering in-context learning multimodal learning instruction tuning foundation model multimodal reasoning multimodal model visual prompting large language model generative multimodal model

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024