ModaVerse: Efficiently Transforming Modalities with LLMs

Xinyu Wang; Bohan Zhuang; Qi Wu

2024 CVPR CVPR 2024

ModaVerse: Efficiently Transforming Modalities with LLMs

Abstract

Humans possess the capability to comprehend diverse modalities and seamlessly transfer information between them. In this work we introduce ModaVerse a Multi-modal Large Language Model (MLLM) capable of comprehending and transforming content across various modalities including images videos and audio. Predominant MLLM frameworks have largely relied on aligning latent spaces of textual and non-textual features. This alignment process which synchronizes a language model trained on textual data with encoders and decoders trained on multi-modal data often necessitates extensive training of several projection layers in multiple stages. Inspired by LLM-as-agent methodologies we propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language. It aligns the LLM's output with the input of generative models avoiding the complexities associated with latent feature alignments and simplifying the multiple training stages of existing MLLMs into a single efficient process. By conducting experiments on several benchmarks we demonstrate that our approach attains comparable performance with the state of the art while achieving considerable efficiencies in data usage.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — input output alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xinyu Wang , Bohan Zhuang , Qi Wu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Efficient Computing Deep Learning > Models > Generative Models

Keywords

generative model multi-modal large language model latent space alignment modality transformation input output alignment

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024