Generating Illustrated Instructions

Sachit Menon; Ishan Misra; Rohit Girdhar

2024 CVPR CVPR 2024

Generating Illustrated Instructions

Abstract

We introduce a new task of generating "Illustrated Instructions" i.e. visual instructions customized to a user's needs. We identify desiderata unique to this task and formalize it through a suite of automatic and human evaluation metrics designed to measure the validity consistency and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases users even prefer it to human-generated articles. Most notably it enables various new and exciting applications far beyond what static articles on the web can provide such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — illustrated instruction

🐣 Hot Topic Early Bird — multimodal generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sachit Menon , Ishan Misra , Rohit Girdhar

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Deep Learning > Models > Generative Models Computer Vision > Generation > Image Generation Natural Language Processing > Generation > Text Generation Artificial Intelligence > Core AI > Large Language Models Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

multimodal learning text-to-image generation diffusion model visual instruction multimodal generation large language model illustrated instruction

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024