Metric assessment protocol in the context of answer fluctuation on MCQ tasks

Ekaterina Goliakova; Xavier Renard; Marie-jeanne Lesot; Thibault Laugel; Christophe Marsala; Marcin Detyniecki

2025 ACL ACL 2025

Metric assessment protocol in the context of answer fluctuation on MCQ tasks

Abstract

AbstractUsing multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. Highest association on the protocol is demonstrated by a novel metric, worst accuracy.

🧭 Keyword Pioneer — answer fluctuation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy