2024 CVPR CVPR 2024

Instruct-Imagen: Image Generation with Multi-modal Instruction

Abstract

This paper presents Instruct-Imagen a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction for image generation a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g. text edge style subject etc.) such that abundant generation intents can be standardized in a uniform format. We then build Instruct-Imagen by fine-tuning a pre-trained text-to-image diffusion model with two stages. First we adapt the model using the retrieval-augmented training to enhance model's capabilities to ground its generation on external multi-modal context. Subsequently we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g. subject-driven generation etc.) each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. Our evaluation suite will be made publicly available.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning
🧭 Keyword Pioneer — multi-modal instruction
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio