ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity

Seungho Lee; Kyumin Lee

2026 EACL EACL 2026

ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity

Abstract

AbstractWhile large language models (LLMs) offer great promise, they also pose concrete safety risks. To audit and mitigate these risks, researchers have developed automated red-teaming methods, which generate adversarial prompts to elicit unsafe behavior of target LLMs during evaluation. Recent automated red-teaming methods for LLMs face a persistent trade-off: techniques that increase prompt diversity often reduce the level of the toxicity elicited from the target LLMs, while toxicity-maximizing methods tend to collapse diversity. To address the limitations, we propose ToxiPrompt, a two-stage framework that explicitly separates exploration (diversity) from exploitation (toxicity) and reunifies them with a single selection criterion to balance between diversity and toxicity. Experimental results show that ToxiPrompt outperforms four state-of-the-art baselines in both adversarial prompt diversity and the level of elicited toxicity from target LLMs, improving 14.6% harmonic mean of toxicity and diversity against the best baseline. The approach also performs well for multiple instruction-tuned target LLMs (Llama-2/3, Qwen, Mistral) without re-tuning, achieving up to 55% harmonic mean improvement against the best baseline. Our code is available at https://github.com/seungho715/ToxiPrompt

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Seungho Lee , Kyumin Lee

Topics

Artificial Intelligence > Core AI > AI Safety Machine Learning > Learning Types > Adversarial Learning

Keywords

prompt engineering language model adversarial prompt prompt diversity

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026