The Revolution of Multimodal Large Language Models: A Survey

Davide Caffagni; Federico Cocchi; Luca Barsellotti; Nicholas Moratelli; Sara Sarto; Lorenzo Baraldi; Marcella Cornia; Rita Cucchiara

2024 ACL ACL 2024

The Revolution of Multimodal Large Language Models: A Survey

Abstract

AbstractConnecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Davide Caffagni , Federico Cocchi , Luca Barsellotti , Nicholas Moratelli , Sara Sarto , Lorenzo Baraldi , Marcella Cornia , Rita Cucchiara

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Image Generation

Keywords

image generation image editing visual grounding instruction following multimodal large language model multimodal alignment

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024