Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Yen-Ting Lin; Di Jin; Tengyu Xu; Tianhao Wu; Sainbayar Sukhbaatar; Chen Zhu; Yun He; Yun-Nung Chen; Jason E Weston; Yuandong Tian; Arash Rahnama; Sinong Wang; Hao Ma; Han Fang

2025 EMNLP EMNLP 2025

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Abstract

AbstractLarge language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — pass@1 accuracy

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Yen-Ting Lin , Di Jin , Tengyu Xu , Tianhao Wu , Sainbayar Sukhbaatar , Chen Zhu , Yun He , Yun-Nung Chen , Jason E Weston , Yuandong Tian , Arash Rahnama , Sinong Wang , Hao Ma , Han Fang

Topics

Artificial Intelligence > Learning Paradigms > Meta-Learning Machine Learning > Optimization & Theory > Neural Network Optimization

Keywords

mathematical reasoning chain-of-thought prompting binary feedback reasoning trajectory step-wise feedback pass@1 accuracy

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025