InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model

Haogeng Liu; Quanzeng You; Yiqi Wang; Xiaotian Han; Bohan Zhai; Yongfei Liu; Wentao Chen; Yiren Jian; Yunzhe Tao; Jianbo Yuan; Ran He; Hongxia Yang

2024 ACL ACL 2024

InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model

Abstract

AbstractIn this work, we present InfiMM, an advanced Multimodal Large Language Model that adapts to intricate vision-language tasks. InfiMM, inspired by the Flamingo architecture, distinguishes itself through the utilization of large-scale training data, comprehensive training strategies, and diverse large language models. This approach ensures the preservation of Flamingo’s foundational strengths while simultaneously introducing augmented capabilities. Empirical evaluations across a variety of benchmarks underscore InfiMM’s remarkable capability in multimodal understanding. The code can be found at: https://anonymous.4open.science/r/infimm-zephyr-F60C/.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐣 Hot Topic Early Bird — image understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haogeng Liu , Quanzeng You , Yiqi Wang , Xiaotian Han , Bohan Zhai , Yongfei Liu , Wentao Chen , Yiren Jian , Yunzhe Tao , Jianbo Yuan , Ran He , Hongxia Yang

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

multimodal learning vision language model multimodal large language model image understanding multimodal understanding visual language large language model visual language understanding

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024