Extracting Prompts by Inverting LLM Outputs

Collin Zhang; John Xavier Morris; Vitaly Shmatikov

2024 EMNLP EMNLP 2024

Extracting Prompts by Inverting LLM Outputs

Abstract

AbstractWe consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that extracts prompts without access to the model’s logits and without adversarial or jailbreaking queries. Unlike previous methods, output2prompt only needs outputs of normal user queries. To improve memory efficiency, output2prompt employs a new sparse encoding techique. We measure the efficacy of output2prompt on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — language model inversion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Collin Zhang , John Xavier Morris , Vitaly Shmatikov

Topics

Artificial Intelligence > Core AI > AI Safety Machine Learning > Application Areas > Privacy Artificial Intelligence > Core AI > Privacy Artificial Intelligence > Core AI > Large Language Models Deep Learning > Learning Types > Representation Learning

Keywords

zero-shot transfer large language model prompt extraction sparse encoding language model inversion black-box method

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024