2025 IJCNLP IJCNLP 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Abstract

AbstractThis paper investigates defenses in LLM-based evaluation, where prompt injection attacks can manipulate scores by deceiving the evaluation system. We formalize blind attacks as a class in which candidate answers are crafted independently of the true answer. To counter such attacks, we propose an evaluation framework that combines standard and counterfactual evaluation. Experiments show it significantly improves attack detection with minimal performance trade-offs for recent models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Security & Privacy
🧭 Keyword Pioneer — blind attack
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy