Agent vs. Agent: Automated Data Generation and Red-Teaming for Custom Agentic Workflows

Ninad Kulkarni; Xian Wu; Siddharth Varia; Dmitriy Bespalov

2025 EMNLP EMNLP 2025

Agent vs. Agent: Automated Data Generation and Red-Teaming for Custom Agentic Workflows

Abstract

AbstractLarge Language Models (LLMs) deployed as autonomous agents with tool access present unique safety challenges that extend beyond standalone model vulnerabilities. Existing red-teaming frameworks like AgentHarm use static prompts and hardcoded toolsets, limiting their applicability to custom production systems.We introduce a dual-component automated red-teaming framework: AgentHarm-Gen generates adversarial tasks and evaluation functions tailored to arbitrary toolsets, while Red-Agent-Reflect employs iterative prompt refinement with self-reflection to develop progressively more effective attacks.Evaluating across 115 harmful tasks (71 generated, 44 from AgentHarm) spanning 8 risk categories, our method achieves substantial improvements: up to 162% increase in attack success rate on o4-mini and 86% success on Gemini 2.5 Pro. Successful attacks systematically decompose adversarial objectives into benign-appearing sub-tasks that circumvent safety alignment, highlighting the need for agent-specific guardrails.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — tool access

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy

Authors

Ninad Kulkarni , Xian Wu , Siddharth Varia , Dmitriy Bespalov

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > AI Safety Machine Learning > Learning Types > Adversarial Learning Deep Learning > Models > Large Language Models Artificial Intelligence > Core AI > Adversarial Learning

Keywords

safety alignment autonomous agent tool use agentic workflow adversarial testing agent system llm agent llm safety agent safety adversarial generation prompt refinement tool access adversarial task automated data generation adversarial task generation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025