2026 AAAI AAAI 2026

MedGR2: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

Abstract

Abstract The application of vision-language models in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised fine-tuning on existing datasets often leads to poor generalization on unseen modalities and tasks, while reinforcement learning, a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To address this challenge, we propose a Generative Reward Learning framework that establishes a self-improving training cycle. The framework jointly develops a data generator and a reward model, enabling the automated and continuous creation of high-quality multimodal medical data that serves as an effective training source for post-training. Our experiments demonstrate that supervised fine-tuning using the generated data already surpasses models trained on large-scale human-curated datasets. More importantly, when the generated data is further leveraged for reinforcement learning via Group Relative Policy Optimization, the resulting model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized reinforcement-learning-based methods. Notably, a compact model trained under this framework attains performance competitive with foundation models containing more than an order of magnitude more parameters. These results suggest a new paradigm for data-efficient learning in high-stakes medical domains, shifting the bottleneck from data scarcity to data generation and unlocking the potential of reinforcement learning for building robust and generalizable medical AI systems.

🧭 Keyword Pioneer — generative reward learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio