“I’ve Decided to Leak”: Probing Internals Behind Prompt Leakage Intents

Jianshuo Dong; Yutong Zhang; Liu Yan; Zhenyu Zhong; Tao Wei; Ke Xu; Minlie Huang; Chao Zhang; Han Qiu

2025 EMNLP EMNLP 2025

“I’ve Decided to Leak”: Probing Internals Behind Prompt Leakage Intents

Abstract

AbstractLarge language models (LLMs) exhibit prompt leakage vulnerabilities, where they may be coaxed into revealing system prompts embedded in LLM services, raising intellectual property and confidentiality concerns. An intriguing question arises: Do LLMs genuinely internalize prompt leakage intents in their hidden states before generating tokens? In this work, we use probing techniques to capture LLMs’ intent-related internal representations and confirm that the answer is yes. We start by comprehensively inducing prompt leakage behaviors across diverse system prompts, attack queries, and decoding methods. We develop a hybrid labeling pipeline, enabling the identification of broader prompt leakage behaviors beyond mere verbatim leaks. Our results show that a simple linear probe can predict prompt leakage risks from pre-generation hidden states without generating any tokens. Across all tested models, linear probes consistently achieve 90%+ AUROC, even when applied to new system prompts and attacks. Understanding the model internals behind prompt leakage drives practical applications, including intention-based detection of prompt leakage risks. Code is available at: https://github.com/jianshuod/Probing-leak-intents.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Security & Privacy

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jianshuo Dong , Yutong Zhang , Liu Yan , Zhenyu Zhong , Tao Wei , Ke Xu , Minlie Huang , Chao Zhang , Han Qiu

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Interpretability Security & Privacy Artificial Intelligence > Core AI > Large Language Models

Keywords

intent detection hidden state linear probing model internal probing classifier linear probe prompt leakage

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025