SelfRACG: Enabling LLMs to Self-Express and Retrieve for Code Generation

Qian Dong; Jia Chen; Qingyao Ai; Hongning Wang; Haitao Li; Yiwu; Yao Hu; Yiqun Liu; Shaoping Ma

2025 EMNLP EMNLP 2025

SelfRACG: Enabling LLMs to Self-Express and Retrieve for Code Generation

Abstract

AbstractExisting retrieval-augmented code generation (RACG) methods typically use an external retrieval module to fetch semantically similar code snippets used for generating subsequent fragments. However, even for consecutive code fragments, the content often diverges due to logical progression, resulting in a content gap. This gap undermines the performance of current RACG methods, as external retrieval modules based on content matching fail to infer the specific information need of LLMs to generate the next code fragment. Therefore, we propose SelfRACG, a novel paradigm that enables large language models (LLMs) to Self-express their information needs to enhance RACG. Specifically, SelfRACG includes an information need expression module and a two-stage information need-guided training strategy, which encourages LLMs to express their information need. Extensive experiments demonstrate that SelfRACG can retrieve external knowledge that better aligns with the LLM’s own information needs, resulting in superior generation performance compared to vanilla RACG. Moreover, both the training and deployment costs for retrieval in our framework are much lower than those of the strongest retrieval model.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Science and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — semantic code retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qian Dong , Jia Chen , Qingyao Ai , Hongning Wang , Haitao Li , Yiwu , Yao Hu , Yiqun Liu , Shaoping Ma

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Learning Paradigms > Transfer Learning Natural Language Processing > Generation > Text Generation Computer Science > Applications > Software Engineering Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Natural Language Processing > Generation > Retrieval-Augmented Generation

Keywords

information retrieval code generation semantic similarity retrieval-augmented generation semantic retrieval two-stage training large language model semantic code retrieval external retrieval

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025