Token-level Proximal Policy Optimization for Query Generation

Yichen Ouyang; Lu Wang; Fangkai Yang; Pu Zhao; Chenghua Huang; Jianfeng Liu; Bochen Pang; Yaming Yang; Yuefeng Zhan; Hao Sun; Qingwei Lin; Saravan Rajmohan; Weiwei Deng; Dongmei Zhang; Feng Sun

2025 EMNLP EMNLP 2025

Token-level Proximal Policy Optimization for Query Generation

Abstract

AbstractQuery generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. We conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine, demonstrating that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.

🌉 Interdisciplinary Bridge — Deep Learning and Natural Language Processing and Reinforcement Learning

🧭 Keyword Pioneer — token-level proximal policy optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yichen Ouyang , Lu Wang , Fangkai Yang , Pu Zhao , Chenghua Huang , Jianfeng Liu , Bochen Pang , Yaming Yang , Yuefeng Zhan , Hao Sun , Qingwei Lin , Saravan Rajmohan , Weiwei Deng , Dongmei Zhang , Feng Sun

Topics

Natural Language Processing > Applications > Information Retrieval Reinforcement Learning > Methods > Deep RL Reinforcement Learning > Methods > Policy Learning Deep Learning > Learning Types > Reinforcement Learning Deep Learning > Learning Types > Large Language Models

Keywords

reinforcement learning reward model query generation user intent large language model fine-tuning reinforcement learning from ai feedback token-level proximal policy optimization sparse reward modeling token-level optimization

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025