2025 AAAI AAAI 2025

Single Character Perturbations Break LLM Alignment

Abstract

Abstract When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as ``Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space or other single character token to the end of a model's input. In a study of a variety of open-source models, we demonstrate that this simple perturbation is able to cause the majority of models to generate harmful outputs with very high probability. We further find that both Claude and GPT-3.5 demonstrate the same behavior. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models answer in lists or other formatted responses, overriding training signals to refuse unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio