Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning

Qianxi He; Qingyu Ren; Shanzhe Lei; Xuhong Wang; Yingchun Wang

2025 EMNLP EMNLP 2025

Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning

Abstract

AbstractRecent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, most current methods rely on rule-based evaluations of answer correctness, overlooking the importance of confidence-aware reasoning, especially for small to medium-sized models. These models often receive rewards for speculative answers without generating coherent reasoning chains. To address this limitation, we propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in https://github.com/qianxiHe147/C2RM.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — confidence-aware reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qianxi He , Qingyu Ren , Shanzhe Lei , Xuhong Wang , Yingchun Wang

Topics

Machine Learning > Core Methods > Regression Machine Learning > Learning Types > Adversarial Learning Machine Learning > Optimization & Theory > Stochastic Processes Machine Learning > Learning Types > Reinforcement Learning Machine Learning > Learning Types > Representation Learning Deep Learning > Models > Large Language Models Deep Learning > Learning Types > Reinforcement Learning Artificial Intelligence > Core AI > Reinforcement Learning

Keywords

reinforcement learning reward modeling chain-of-thought reasoning confidence estimation large language model confidence-aware reasoning ppo training reasoning capabilities stem reasoning

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025