Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Tianduo Wang; Shichen Li; Wei Lu

2024 ACL ACL 2024

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Abstract

AbstractTeaching small-scale language models to perform math reasoning is a valuable yet challenging task. Besides obtaining labeled data from human experts, one of the most common ways to collect high-quality data is by sampling from a larger and more powerful language model. Although previous works have demonstrated the effectiveness of this method, such a knowledge distillation paradigm can be costly and unstable, especially considering that many large language models, such as GPT-4, are closed-sourced, proprietary, and their behaviors are unpredictable. In this work, to avoid relying on outputs from large models, we demonstrate that the reasoning abilities of small-scale language models can be enhanced through self-training, which involves training models with their own outputs. We also show that the vanilla self-training can be further augmented by an alignment algorithm, direct preference optimization (DPO). We empirically found that models trained with the DPO objective are capable of making better generations that largely benefit multi-turn self-training. The experiments show our models outperform the state-of-the-art models with comparable sizes on a series of downstream math reasoning tasks with minimal resource requirements.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tianduo Wang , Shichen Li , Wei Lu

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Learning Types > Self-Supervised Learning Natural Language Processing > Understanding > Semantic Analysis Artificial Intelligence > Core AI > Reasoning Deep Learning > Learning Types > Reinforcement Learning

Keywords

mathematical reasoning direct preference optimization chain-of-thought reasoning language model math reasoning

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024