Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition

Sander Schulhoff; Jeremy Pinto; Anaum Khan; Louis-François Bouchard; Chenglei Si; Svetlina Anati; Valen Tagliabue; Anson Kost; Christopher Carnahan; Jordan Boyd-Graber

2023 EMNLP EMNLP 2023

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition

Abstract

AbstractLarge Language Models (LLMs) are increasingly being deployed in interactive contexts that involve direct user engagement, such as chatbots and writing assistants. These deployments are increasingly plagued by prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and instead follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of a large-scale resource and quantitative study on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive ontology of the types of adversarial prompts.

🧭 Keyword Pioneer — adversarial prompt

🐣 Hot Topic Early Bird — adversarial prompt

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sander Schulhoff , Jeremy Pinto , Anaum Khan , Louis-François Bouchard , Chenglei Si , Svetlina Anati , Valen Tagliabue , Anson Kost , Christopher Carnahan , Jordan Boyd-Graber

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Responsible AI Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Adversarial Learning Artificial Intelligence > Core AI > Safety

Keywords

adversarial attack adversarial prompt prompt injection security vulnerability large language model prompt hacking

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023