OneLLM: One Framework to Align All Modalities with Language

Jiaming Han; Kaixiong Gong; Yiyuan Zhang; Jiaqi Wang; Kaipeng Zhang; Dahua Lin; Yu Qiao; peng gao; Xiangyu Yue

2024 CVPR CVPR 2024

OneLLM: One Framework to Align All Modalities with Language

Abstract

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However existing works rely heavily on modality-specific encoders which usually differ in architecture and are limited to common modalities. In this paper we present OneLLM an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail we first train an image projection module to connect a vision encoder with LLM. Then we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions we also curated a comprehensive multimodal instruction dataset including 2M items from image audio video point cloud depth/normal map IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks encompassing tasks such as multimodal captioning question answering and reasoning where it delivers excellent performance. Code data model and online demo are available at https://github.com/csuhan/OneLLM

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐣 Hot Topic Early Bird — modality alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiaming Han , Kaixiong Gong , Yiyuan Zhang , Jiaqi Wang , Kaipeng Zhang , Dahua Lin , Yu Qiao , peng gao , Xiangyu Yue

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Deep Learning > Learning Types > Multi-Modal Learning

Keywords

cross-modal learning instruction following dynamic routing modality alignment instruction tuning multimodal large language model multimodal alignment unified encoder unified framework large language model

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024