Out of Distribution, Out of Luck: Process Rewards Misguide Reasoning Models

Alexey Dontsov; Anton Korznikov; Andrey V. Galichin; Elena Tutubalina

2026 EACL EACL 2026

Out of Distribution, Out of Luck: Process Rewards Misguide Reasoning Models

Abstract

AbstractProcess Reward Models (PRMs) have emerged as a promising approach for guiding large language models (LLMs) through multi-step reasoning by providing step-level feedback during inference. However, our evaluation across 7 LLMs reveals a failure mode: while PRMs improve performance for instruct mathematical models, they fail to enhance and sometimes degrade reasoning model performance. Through systematic analysis with linear probes, we identify distinct reward prediction patterns that differentiate reasoning from non-reasoning model outputs. To understand this mechanism, we train Sparse Autoencoders on the Qwen2.5-Math-PRM and analyze reasoning features. Our analysis reveals that 80% of these features respond to formatting artifacts (whitespace patterns, Unicode tokens, punctuation) rather than mathematical content. Reasoning model outputs exhibit distinct metacognitive patterns absent from standard mathematical solutions. This explains why they lead to unreliable reward estimation. Our findings expose a fundamental limitation in applying existing reward models to reasoning systems and provide mechanistic insights into this failure mode. We release our trained SAEs to facilitate future research into reward model interpretability.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — formatting artifact

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Alexey Dontsov , Anton Korznikov , Andrey V. Galichin , Elena Tutubalina

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Optimization & Theory > Learning Theory

Keywords

sparse autoencoder reasoning model reward estimation process reward model linear probe formatting artifact

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026