Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang; Lei Li; Zhihong Shao; Runxin Xu; Damai Dai; Yifei Li; Deli Chen; Yu Wu; Zhifang Sui

2024 ACL ACL 2024

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Abstract

AbstractIn this paper, we present an innovative process-oriented math process reward model called Math-shepherd, which assigns a reward score to each step of math problem solutions. The training of Math-shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-shepherd in two scenarios: 1) Verification: Math-shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) Reinforcement Learning (RL): Math-shepherd is employed to reinforce LLMs.With Math-shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, process RL with Math-shepherd significantly enhances Mistral-7B (77.9%→84.1% on GSM8K and 28.6%→33.0% on MATH).The accuracy can be further improved to 89.1% and 43.5% on two benchmarks with verification of Math-shepherd.We believe that automatic process supervision holds significant potential for the future evolution of LLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — step-by-step verification

🐣 Hot Topic Early Bird — process reward model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Peiyi Wang , Lei Li , Zhihong Shao , Runxin Xu , Damai Dai , Yifei Li , Deli Chen , Yu Wu , Zhifang Sui

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Optimization & Theory > Optimization

Keywords

reinforcement learning mathematical reasoning step-by-step verification process reward model large language model

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024