Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced Agents

Guangfu Guo; Xiaoqian Lu; Yue Feng

2025 EMNLP EMNLP 2025

Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced Agents

Abstract

AbstractVision-language models (VLMs) achieve promising results in medical reasoning but struggle with hallucinations, vague descriptions, Inconsistent logic and poor localization. To address this, we propose a agent framework named Medical Visual Reasoning Agent (Med-VRAgent). The approach is based on Visual Guidance and Self-Reward paradigms and Monte Carlo Tree Search (MCTS). By combining the Visual Guidance with tree search, Med-VRAgent improves the medical visual reasoning capabilities of VLMs. We use the trajectories collected by Med-RAgent as feedback to further improve the performance by fine-tuning the VLMs with the proximal policy optimization (PPO) objective. Experiments on multiple medical VQA benchmarks demonstrate that our method outperforms existing approaches.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Reinforcement Learning

🧭 Keyword Pioneer — medical visual reasoning agent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Guangfu Guo , Xiaoqian Lu , Yue Feng

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Planning Reinforcement Learning > Methods > Deep RL

Keywords

monte carlo tree search vision language model proximal policy optimization medical visual question answering medical visual reasoning agent self reward paradigm

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025