Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Lijia Liu; Takumi Kondo; Kyohei Atarashi; Koh Takeuchi; Jiyi Li; Shigeru Saito; Hisashi Kashima

2025 AACL AACL 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Abstract

AbstractThis paper investigates defenses in LLM-based evaluation, where prompt injection attacks can manipulate scores by deceiving the evaluation system. We formalize blind attacks as a class in which candidate answers are crafted independently of the true answer. To counter such attacks, we propose an evaluation framework that combines standard and counterfactual evaluation. Experiments show it significantly improves attack detection with minimal performance trade-offs for recent models.

🌉 Interdisciplinary Bridge — Computer Science and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy

Authors

Lijia Liu , Takumi Kondo , Kyohei Atarashi , Koh Takeuchi , Jiyi Li , Shigeru Saito , Hisashi Kashima

Topics

Machine Learning > Learning Types > Adversarial Learning Computer Science > Applications > Cybersecurity

Keywords

prompt injection attack detection counterfactual evaluation llm security

Download PDF

Related papers

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge 2025

Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection 2025

CtrlShift: Steering Language Models for Dense Quotation Retrieval with Dynamic Prompts 2025

A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics 2025

Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts 2025