2024 CVPR CVPR 2024

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Abstract

We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. By aligning modalities with language for both encoding and generation CoDi-2 empowers Large Language Models (LLMs) to understand modality-interleaved instructions and in-context examples and autoregressively generate grounded and coherent multimodal outputs in an any-to-any input-output modality paradigm. To train CoDi-2 we build a large-scale generation dataset encompassing in-context multimodal instructions across text vision and audio. CoDi-2 demonstrates a wide range of zero-shot and few-shot capabilities for tasks like editing exemplar learning composition reasoning etc. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation vision transformation and audio editing and showcases a significant advancement for integrating diverse multimodal tasks with sequential generation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing and Speech & Audio
🧭 Keyword Pioneer — interleaved generation
🐣 Hot Topic Early Bird — modality alignment
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio