Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Sher Badshah; Hassan Sajjad

2025 EMNLP EMNLP 2025

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Abstract

AbstractThe emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — reference-guided evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sher Badshah , Hassan Sajjad

Topics

Machine Learning > Core Methods > Classification Natural Language Processing > Applications > Machine Reading Comprehension Natural Language Processing > Applications > Question Answering Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Evaluation

Keywords

natural language inference question answering automatic evaluation free-form question answering generative question answering large language model free-form answer reference-guided evaluation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025