2026 EACL EACL 2026

ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity

Abstract

AbstractWhile large language models (LLMs) offer great promise, they also pose concrete safety risks. To audit and mitigate these risks, researchers have developed automated red-teaming methods, which generate adversarial prompts to elicit unsafe behavior of target LLMs during evaluation. Recent automated red-teaming methods for LLMs face a persistent trade-off: techniques that increase prompt diversity often reduce the level of the toxicity elicited from the target LLMs, while toxicity-maximizing methods tend to collapse diversity. To address the limitations, we propose ToxiPrompt, a two-stage framework that explicitly separates exploration (diversity) from exploitation (toxicity) and reunifies them with a single selection criterion to balance between diversity and toxicity. Experimental results show that ToxiPrompt outperforms four state-of-the-art baselines in both adversarial prompt diversity and the level of elicited toxicity from target LLMs, improving 14.6% harmonic mean of toxicity and diversity against the best baseline. The approach also performs well for multiple instruction-tuned target LLMs (Llama-2/3, Qwen, Mistral) without re-tuning, achieving up to 55% harmonic mean improvement against the best baseline. Our code is available at https://github.com/seungho715/ToxiPrompt

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio