Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

Jie Liu; Wenxuan Wang; Su Yihang; Jingyuan Huang; Yudi Zhang; Cheng-Yi Li; Wenting Chen; Xiaohan Xing; Kao-Jung Chang; Linlin Shen; Michael R. Lyu

2025 ACL ACL 2025

Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

Abstract

AbstractThe significant breakthroughs of Medical Multi-Modal Large Language Models (Med-MLLMs) renovate modern healthcare with robust information synthesis and medical decision support. However, these models are often evaluated on benchmarks that are unsuitable for the Med-MLLMs due to the intricate nature of the real-world diagnostic frameworks, which encompass diverse medical specialties and involve complex clinical decisions. Thus, a clinically representative benchmark is highly desirable for credible Med-MLLMs evaluation. To this end, we introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses Med-MLLMs in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc.) and different diagnostic capacities (perception, disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties, stratifying into 3 main categories and 8 sub-categories of clinical tasks, and exempting overlap with the existing VQA dataset. We further provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists, providing insights into their competencies and limitations in various medical contexts. Our work not only advances the understanding of Med-MLLMs’ capabilities but also sets a precedent for future evaluations and the safe deployment of these models in clinical environments.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — medical multi-modal large language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning

Authors

Jie Liu , Wenxuan Wang , Su Yihang , Jingyuan Huang , Yudi Zhang , Cheng-Yi Li , Wenting Chen , Xiaohan Xing , Kao-Jung Chang , Linlin Shen , Michael R. Lyu

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Adaptation Healthcare & Medicine > Clinical > Medical Imaging Artificial Intelligence > Core AI > Large Language Models

Keywords

benchmark evaluation healthcare application clinical evaluation medical multi-modal large language model clinical decision support diagnostic framework medical multi-modal medical specialty diagnostic capacity

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models

Abstract

Authors

Topics

Keywords

Related papers