Many-shot Jailbreaking

Cem Anil; Esin Durmus; Nina Panickssery; Mrinank Sharma; Joe Benton; Sandipan Kundu; Joshua Batson; Meg Tong; Jesse Mu; Daniel Ford; Fracesco Mosconi; Rajashree Agrawal; Rylan Schaeffer; Naomi Bashkansky; Samuel Svenningsen; Mike Lambert; Ansh Radhakrishnan; Carson Denison; Evan J Hubinger; Yuntao Bai; Trenton Bricken; Timothy Maxwell; Nicholas Schiefer; James Sully; Alex Tamkin; Tamera Lanhan; Karina Nguyen; Tomasz Korbak; Jared Kaplan; Deep Ganguli; Samuel R. Bowman; Ethan Perez; Roger Baker Grosse; David Duvenaud

2024 NIPS NeurIPS 2024

Many-shot Jailbreaking

Abstract

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This attack is newly feasible with the larger context windows recently deployed by language model providers like Google DeepMind, OpenAI and Anthropic. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.

👥 Mega-Team — 34 authors

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐣 Hot Topic Early Bird — red teaming

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Cem Anil , Esin Durmus , Nina Panickssery , Mrinank Sharma , Joe Benton , Sandipan Kundu , Joshua Batson , Meg Tong , Jesse Mu , Daniel Ford , Fracesco Mosconi , Rajashree Agrawal , Rylan Schaeffer , Naomi Bashkansky , Samuel Svenningsen , Mike Lambert , Ansh Radhakrishnan , Carson Denison , Evan J Hubinger , Yuntao Bai , Trenton Bricken , Timothy Maxwell , Nicholas Schiefer , James Sully , Alex Tamkin , Tamera Lanhan , Karina Nguyen , Tomasz Korbak , Jared Kaplan , Deep Ganguli , Samuel R. Bowman , Ethan Perez , Roger Baker Grosse , David Duvenaud

Topics

Artificial Intelligence > Core AI > AI Safety Natural Language Processing > Resources & Methods > Large Language Models

Keywords

jailbreaking attack adversarial prompt red teaming context window large language model

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024