2024 CVPR CVPR 2024

Generative Multimodal Models are In-Context Learners

Abstract

Humans can easily solve multimodal tasks in context with only a few demonstrations or simple instructions which current multimodal systems largely struggle to imitate. In this work we demonstrate that by effectively scaling up generative multimodal models their task-agnostic in-context learning capabilities can be significantly enhanced. We introduce Emu2 a generative multimodal model with 37 billion parameters which serves as a base model and general-purpose interface for a variety of multimodal tasks. Emu2 not only achieves strong performance in few-shot setting but can also be instruct-tuned to follow specific instructions such as visual question answering and object-grounded image generation. Emu2 even emerges to solve tasks that require on-the-fly reasoning such as visual prompting which existing models are unlikely to handle. We identify additional tasks where Emu2's in-context learning can further improve and discuss its broader societal impact. Our code and models will be made publicly available to facilitate future research.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning
🧭 Keyword Pioneer — generative multimodal model
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio