Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

Demin Song; Honglin Guo; Yunhua Zhou; Shuhao Xing; Yudong Wang; Zifan Song; Wenwei Zhang; Qipeng Guo; Hang Yan; Xipeng Qiu; Dahua Lin

2024 ACL ACL 2024

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

Abstract

AbstractThe programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs’ performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — comment augmentation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Demin Song , Honglin Guo , Yunhua Zhou , Shuhao Xing , Yudong Wang , Zifan Song , Wenwei Zhang , Qipeng Guo , Hang Yan , Xipeng Qiu , Dahua Lin

Topics

Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Resources & Methods > Large Language Models

Keywords

data augmentation code generation pre-training datum code understanding large language model comment augmentation

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024