Beyond Blind Following: Evaluating Robustness of LLM Agents under Imperfect Guidance

Yao Fu; Ran Qiu; Xinhe Wang; Jacob Sansom; Sathvika Ayyappa Prabhu; Huijie Tang; Jaekyeom Kim; Sungryull Sohn; Honglak Lee

2026 EACL EACL 2026

Beyond Blind Following: Evaluating Robustness of LLM Agents under Imperfect Guidance

Abstract

AbstractLarge language models (LLMs) have shown strong capabilities as task-solving agents across interactive domains. However, in complex environments, these agents may need to rely on auxiliary guidance to reduce the search space or make up for limited domain-specific knowledge. Such guidance includes human-provided manuals and demonstrations, retrieved examples from memory or external tools, high-level heuristics, and agent-acquired knowledge from prior interactions. However, this guidance may be imperfect. For example, due to changes in the environment, ambiguous or simplified language, or retrieval errors from external sources, guidance can be incomplete, outdated, or contextually mismatched, potentially causing errors or failures during task execution. To address this, we introduce MIRAGE, a benchmark for MeasurIng Robustness of LLM Agents under Imperfect GuidancE. MIRAGE includes procedurally generated environments in navigation, cooking, and gaming, where both the environment and the auxiliary guidance vary in fidelity and relevance. We further extend MIRAGE to realistic web tasks via WebArena, using noisy or underspecified instructions extracted from demonstrations. Our findings reveal critical failure modes in current LLM agents and motivate future work on improving their robustness under imperfect guidance.

🧭 Keyword Pioneer — imperfect guidance

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yao Fu , Ran Qiu , Xinhe Wang , Jacob Sansom , Sathvika Ayyappa Prabhu , Huijie Tang , Jaekyeom Kim , Sungryull Sohn , Honglak Lee

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Human-AI Interaction

Keywords

benchmark evaluation task execution llm agent agent robustness imperfect guidance

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026