AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

Zhaorun Chen; Zhuokai Zhao; Zhihong Zhu; Ruiqi Zhang; Xiang Li; Bhiksha Raj; Huaxiu Yao

2024 NAACL NAACL 2024

AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

Abstract

AbstractRecent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework **AutoPRM** that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, **AutoPRM** first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that **AutoPRM** significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, **AutoPRM** can be easily integrated with other orthogonal reasoning pipelines.

🧭 Keyword Pioneer — procedural supervision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhaorun Chen , Zhuokai Zhao , Zhihong Zhu , Ruiqi Zhang , Xiang Li , Bhiksha Raj , Huaxiu Yao

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Optimization & Theory > Stochastic Processes

Keywords

reinforcement learning question decomposition multi-step reasoning large language model procedural supervision reward tampering

Download PDF

Related papers

Working Alliance Transformer for Psychotherapy Dialogue Classification 2024

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences 2024

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study 2024

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation 2024

Extractive Summarization with Text Generator 2024