Single Character Perturbations Break LLM Alignment

Leon Lin; Hannah Brown; Kenji Kawaguchi; Michael Shieh

2025 AAAI AAAI 2025

Single Character Perturbations Break LLM Alignment

Abstract

Abstract When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as ``Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space or other single character token to the end of a model's input. In a study of a variety of open-source models, we demonstrate that this simple perturbation is able to cause the majority of models to generate harmful outputs with very high probability. We further find that both Claude and GPT-3.5 demonstrate the same behavior. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models answer in lists or other formatted responses, overriding training signals to refuse unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Leon Lin , Hannah Brown , Kenji Kawaguchi , Michael Shieh

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Foundation Models Deep Learning > Models > Large Language Models Machine Learning > Learning Types > Robustness

Keywords

model alignment adversarial attack prompt perturbation large language model harmful output

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025