CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng; Daya Guo; Duyu Tang; Nan Duan; Xiaocheng Feng; Ming Gong; Linjun Shou; Bing Qin; Ting Liu; Daxin Jiang; Ming Zhou

2020 EMNLP EMNLP 2020

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Abstract

AbstractWe present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both “bimodal” data of NL-PL pairs and “unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NLPL probing.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — natural language code search

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhangyin Feng , Daya Guo , Duyu Tang , Nan Duan , Xiaocheng Feng , Ming Gong , Linjun Shou , Bing Qin , Ting Liu , Daxin Jiang , Ming Zhou

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Transformers Deep Learning > Learning Types > Representation Learning

Keywords

transformer architecture natural language processing pre-trained model pre-trained language model programming language code search natural language code search code documentation generation

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020