GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning

Chenglong Wang; Yongyu Mu; Hang Zhou; Yifu Huo; Ziming Zhu; Jiali Zeng; Murun Yang; Bei Li; Xiaoyang Hao; Chunliang Zhang; Fandong Meng; Jingbo Zhu; Tong Xiao

2026 AAAI AAAI 2026

GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning

Abstract

Abstract Major progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs to generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short in instilling explicit reasoning capabilities into reward models. To bridge this gap, we propose a self-training approach that can leverage unlabeled data to scale up reward reasoning in reward models. Based on this approach, we develop GRAM-R² a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R² can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as policy optimization and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R² consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — reward reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chenglong Wang , Yongyu Mu , Hang Zhou , Yifu Huo , Ziming Zhu , Jiali Zeng , Murun Yang , Bei Li , Xiaoyang Hao , Chunliang Zhang , Fandong Meng , Jingbo Zhu , Tong Xiao

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Learning Types > Self-Supervised Learning

Keywords

preference learning generative model reward model foundation model reward reasoning

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026