Reflect, Rewrite, Repeat: How Simple Arithmetic Enables Advanced Reasoning in Small Language Models

Mengdie Flora Wang; Haochen Xie; Mun Young Kim; Baishali Chaudhury; Meghana Ashok; Suren Gunturu; Sungmin Hong; Jae Oh Woo

2026 EACL EACL 2026

Reflect, Rewrite, Repeat: How Simple Arithmetic Enables Advanced Reasoning in Small Language Models

Abstract

AbstractContemporary advancements in language model reasoning typically require computationally intensive reinforcement learning (RL) and massive datasets, creating barriers for resource-constrained teams. In this work, we demonstrate that high-quality, iterative training on minimal data can rival modern RL approaches. We introduce a resource-efficient framework that combines Direct Preference Optimization (DPO) and Supervised Fine-Tuning (SFT) with selective guidance from larger models, iteratively refining solutions through a "reflect, rewrite, repeat" cycle (R3). Using Qwen 2.5 7B and Qwen 2.5 Math 7B as base models, our method shows meaningful performance improvements across arithmetic, symbolic and cognitive reasoning benchmarks—including GSM8K (83.1% → 88.6%), AIME’25@10 (20.0% → 30.0%) and LastLetterConcat (40.7% → 53.3%) problems. The model-agnostic nature of our R3 framework is further demonstrated through substantial improvements when applied to Mistral and LLaMA-based models. Remarkably, these gains are achieved using mere 700 basic arithmetic training samples, in stark contrast to the hundreds of thousands of examples typically required by RL-based systems. Our results suggest that reasoning improvements need not strictly depend on large-scale data. By emphasizing strategically curated training grounded in foundational principles, we achieve competitive generalization with minimal resource overhead. Our R3 pipeline also generates high-quality SFT data with high-fidelity reasoning traces as byproduct, further enabling scalable and annotation-free fine-tuning. Code is available.[<https://github.com/aws-samples/sample-for-reflect-rewrite-repeat>]

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Mengdie Flora Wang , Haochen Xie , Mun Young Kim , Baishali Chaudhury , Meghana Ashok , Suren Gunturu , Sungmin Hong , Jae Oh Woo

Topics

Machine Learning > Learning Types > Self-Supervised Learning Natural Language Processing > Generation > Language Modeling Machine Learning > Learning Types > Reinforcement Learning

Keywords

direct preference optimization supervised fine-tuning iterative training arithmetic reasoning

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026