ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Gili Lior; Eliya Habba; Shahar Levy; Avi Caciularu; Gabriel Stanovsky

2025 EMNLP EMNLP 2025

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Abstract

AbstractLLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of *reliable evaluation* that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — stochastic evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Gili Lior , Eliya Habba , Shahar Levy , Avi Caciularu , Gabriel Stanovsky

Topics

Machine Learning > Optimization & Theory > Theory Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Optimization & Theory > Statistics Deep Learning > Models > Large Language Models Machine Learning > Learning Types > Evaluation

Keywords

prompt sensitivity method of moment large language model stochastic evaluation reliable evaluation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025