Detection of Adversarial Prompts with Model Predictive Entropy

Franziska Rubenbauer; Sebastian Steindl; Patrick Levi; Daniel Loebenberger; Ulrich Schäfer

2026 EACL EACL 2026

Detection of Adversarial Prompts with Model Predictive Entropy

Abstract

AbstractLarge Language Models (LLMs) are increasingly deployed in high-impact scenarios, raising concerns about their safety and security. Despite existing defense mechanisms, LLMs remain vulnerable to adversarial attacks. This paper introduces the novel attack-agnostic pipeline SENTRY (semantic entropy-based attack recognition system) for detecting such attacks by leveraging the predictive entropy of model outputs, quantified through the Token-Level Shifting Attention to Relevance (TokenSAR) score, a weighted token entropy measurement. Our approach dynamically identifies adversarial inputs without relying on prior knowledge of attack specifications. It requires only ten newly generated tokens, making it a computationally efficient and adaptable solution. We evaluate the pipeline on multiple state-of-the-art models, including Llama, Vicuna, Falcon, Deep Seek, and Mistral, using a diverse set of adversarial prompts generated via the h4rm31 framework. Experimental results demonstrate a clear separation in TokenSAR scores between benign, malicious, and adversarial prompts. This distinction enables effective threshold-based classification, achieving robust detection performance across various model architectures. Our method outperforms traditional defenses in terms of adaptability and resource efficiency.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — token-level entropy

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Franziska Rubenbauer , Sebastian Steindl , Patrick Levi , Daniel Loebenberger , Ulrich Schäfer

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Interpretability Machine Learning > Optimization & Theory > Stochastic Processes

Keywords

adversarial attack adversarial prompt detection attention weight predictive entropy token-level entropy

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026