Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Jiasen Lu; Christopher Clark; Sangho Lee; Zichen Zhang; Savya Khosla; Ryan Marten; Derek Hoiem; Aniruddha Kembhavi

2024 CVPR CVPR 2024

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Abstract

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can generate text image or audio outputs which is accomplished in a unified way by tokenizing these different inputs and outputs into a shared semantic space that can then be processed by a single encoder-decoder transformer model. Unified-IO 2 is trained from scratch on a custom-built multimodal pre-training corpus and then learns an expansive set of skills through fine-tuning on over 120 datasets including datasets for segmentation object detection image editing audio localization video tracking embodied AI and 3D detection. To facilitate instruction-following we add prompts and other data augmentations to these tasks to allow Unified-IO 2 to generalize these skills to new tasks zero-shot. Unified-IO 2 is the first model to be trained on such a diverse and wide-reaching set of skills and unify three separate generation capabilities. Unified-IO 2 achieves state-of-the-art performance on the multi-task GRIT benchmark and achieves strong results on 30 diverse datasets including SEED-Bench image and video understanding TIFA image generation VQA 2.0 ScienceQA VIMA robotic manipulation VGG-Sound and Kinetics-Sounds and can perform unseen tasks and generate free-form responses. We release our model and code to facilitate future work.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiasen Lu , Christopher Clark , Sangho Lee , Zichen Zhang , Savya Khosla , Ryan Marten , Derek Hoiem , Aniruddha Kembhavi

Topics

Deep Learning > Architectures > Transformers Computer Vision > Analysis > Object Detection Computer Vision > Generation > Image Generation

Keywords

transformer architecture image generation multimodal learning instruction following autoregressive model vision language model

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024