Instruct-Imagen: Image Generation with Multi-modal Instruction

Hexiang Hu; Kelvin C.K. Chan; Yu-Chuan Su; Wenhu Chen; Yandong Li; Kihyuk Sohn; Yang Zhao; Xue Ben; Boqing Gong; William Cohen; Ming-Wei Chang; Xuhui Jia

2024 CVPR CVPR 2024

Instruct-Imagen: Image Generation with Multi-modal Instruction

Abstract

This paper presents Instruct-Imagen a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction for image generation a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g. text edge style subject etc.) such that abundant generation intents can be standardized in a uniform format. We then build Instruct-Imagen by fine-tuning a pre-trained text-to-image diffusion model with two stages. First we adapt the model using the retrieval-augmented training to enhance model's capabilities to ground its generation on external multi-modal context. Subsequently we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g. subject-driven generation etc.) each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. Our evaluation suite will be made publicly available.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — multi-modal instruction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hexiang Hu , Kelvin C.K. Chan , Yu-Chuan Su , Wenhu Chen , Yandong Li , Kihyuk Sohn , Yang Zhao , Xue Ben , Boqing Gong , William Cohen , Ming-Wei Chang , Xuhui Jia

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Computer Vision > Generation > Image Generation

Keywords

image generation task generalization text-to-image diffusion vision-language understanding multi-modal instruction retrieval-augmented training

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024