Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models

Sheng Zhang; Hui Li; Rongrong Ji

2024 EMNLP EMNLP 2024

Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models

Abstract

AbstractCode pre-trained language models (CPLMs) have received great attention since they can benefit various tasks that facilitate software development and maintenance. However, CPLMs are trained on massive open-source code, raising concerns about potential data infringement. This paper launches the study of detecting unauthorized code use in CPLMs, i.e., Code Membership Inference (CMI) task. We design a framework Buzzer for different settings of CMI. Buzzer deploys several inference techniques, including signal extraction from pre-training tasks, hard-to-learn sample calibration and weighted inference, to identify code membership status accurately. Extensive experiments show that CMI can be achieved with high accuracy using Buzzer. Hence, Buzzer can serve as a CMI tool and help protect intellectual property rights. The implementation of Buzzer is available at: https://github.com/KDEGroup/Buzzer

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — code pre-training

🐣 Hot Topic Early Bird — intellectual property

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sheng Zhang , Hui Li , Rongrong Ji

Topics

Artificial Intelligence > Core AI > Model Compression Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Application Areas > Privacy Artificial Intelligence > Core AI > Privacy Deep Learning > Models > Large Language Models

Keywords

code generation intellectual property language model signal extraction intellectual property protection pre-trained language model membership inference code pre-training data copyright

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024