Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Aofei Chang; Le Huang; Alex James Boyd; Parminder Bhatia; Taha Kass-Hout; Cao Xiao; Fenglong Ma

2025 ACL ACL 2025

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Abstract

AbstractMedical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A3Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. ATune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A3MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A3Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — medical vision-language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing

Authors

Aofei Chang , Le Huang , Alex James Boyd , Parminder Bhatia , Taha Kass-Hout , Cao Xiao , Fenglong Ma

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Resources & Methods > Large Language Models Healthcare & Medicine > Clinical > Medical AI Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

visual question answering attention mechanism vision-language alignment parameter-efficient tuning medical vision-language model attention tuning visual hallucination medical vqa medical vision language model

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Abstract

Authors

Topics

Keywords

Related papers