2024
ACL
ACL 2024
InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model
Abstract
AbstractIn this work, we present InfiMM, an advanced Multimodal Large Language Model that adapts to intricate vision-language tasks. InfiMM, inspired by the Flamingo architecture, distinguishes itself through the utilization of large-scale training data, comprehensive training strategies, and diverse large language models. This approach ensures the preservation of Flamingo’s foundational strengths while simultaneously introducing augmented capabilities. Empirical evaluations across a variety of benchmarks underscore InfiMM’s remarkable capability in multimodal understanding. The code can be found at: https://anonymous.4open.science/r/infimm-zephyr-F60C/.
🌉
Interdisciplinary Bridge
— Artificial Intelligence and Deep Learning
🐣
Hot Topic Early Bird
— image understanding
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Haogeng Liu
,
Quanzeng You
,
Yiqi Wang
,
Xiaotian Han
,
Bohan Zhai
,
Yongfei Liu
,
Wentao Chen
,
Yiren Jian
,
Yunzhe Tao
,
Jianbo Yuan
,
Ran He
,
Hongxia Yang