Shortcut Learning in Safety: The Impact of Keyword Bias in Safeguards

Panuthep Tasawong; Napat Laosaengpha; Wuttikorn Ponwitayarat; Sitiporn Lim; Potsawee Manakul; Samuel Cahyawijaya; Can Udomcharoenchaikit; Peerat Limkonchotiwat; Ekapol Chuangsuwanich; Sarana Nutanong

2025 ACL ACL 2025

Shortcut Learning in Safety: The Impact of Keyword Bias in Safeguards

Abstract

AbstractThis paper investigates the problem of shortcut learning in safety guardrails for large language models (LLMs). It reveals that current safeguard models often rely excessively on superficial cues, such as specific keywords that are spuriously correlated with training labels, rather than genuinely understanding the input’s semantics or intent. As a result, their performance degrades significantly when there is a shift in keyword distribution. The paper also examines the impact of reducing shortcut reliance, showing that merely minimizing shortcut influence is insufficient. To build robust safeguard models, it is equally crucial to promote the use of intended features.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — safeguard model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio