CLMTracing: Black-box User-level Watermarking for Code Language Model Tracing

Boyu Zhang; Ping He; Tianyu Du; Xuhong Zhang; Lei Yun; Kingsum Chow; Jianwei Yin

2025 EMNLP EMNLP 2025

CLMTracing: Black-box User-level Watermarking for Code Language Model Tracing

Abstract

AbstractWith the widespread adoption of open-source code language models (code LMs), intellectual property (IP) protection has become an increasingly critical concern. While current watermarking techniques have the potential to identify the code LM to protect its IP, they have limitations when facing the more practical and complex demand, i.e., offering the individual user-level tracing in the black-box setting. This work presents CLMTracing, a black-box code LM watermarking framework employing the rule-based watermarks and utility-preserving injection method for user-level model tracing. CLMTracing further incorporates a parameter selection algorithm sensitive to the robust watermark and adversarial training to enhance the robustness against watermark removal attacks. Comprehensive evaluations demonstrate CLMTracing is effective across multiple state-of-the-art (SOTA) code LMs, showing significant harmless improvements compared to existing SOTA baselines and strong robustness against various removal attacks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing and Security & Privacy

🧭 Keyword Pioneer — user-level tracing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Boyu Zhang , Ping He , Tianyu Du , Xuhong Zhang , Lei Yun , Kingsum Chow , Jianwei Yin

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Application Areas > Privacy Natural Language Processing > Generation > Text Generation Machine Learning > Application Areas > Model Compression Security & Privacy > Privacy Artificial Intelligence > Core AI > Privacy Artificial Intelligence > Core AI > Large Language Models Deep Learning > Learning Types > Representation Learning

Keywords

code generation adversarial training intellectual property intellectual property protection code language model model tracing user-level tracing

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025