From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Dawei Li; Bohan Jiang; Liangjie Huang; Alimohammad Beigi; Chengshuai Zhao; Zhen Tan; Amrita Bhattacharjee; Yuxuan Jiang; Canyu Chen; Tianhao Wu; Kai Shu; Lu Cheng; Huan Liu

2025 EMNLP EMNLP 2025

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Abstract

AbstractAssessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). Traditional methods, usually matching-based or small model-based, often fall short in open-ended and dynamic scenarios. Recent advancements in Large Language Models (LLMs) inspire the “LLM-as-a-judge” paradigm, where LLMs are leveraged to perform scoring, ranking, or selection for various machine learning evaluation scenarios. This paper presents a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to review this evolving field. We first provide the definition from both input and output perspectives. Then we introduce a systematic taxonomy to explore LLM-as-a-judge along three dimensions: what to judge, how to judge, and how to benchmark. Finally, we also highlight key challenges and promising future directions for this emerging area.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dawei Li , Bohan Jiang , Liangjie Huang , Alimohammad Beigi , Chengshuai Zhao , Zhen Tan , Amrita Bhattacharjee , Yuxuan Jiang , Canyu Chen , Tianhao Wu , Kai Shu , Lu Cheng , Huan Liu

Topics

Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Applications > Text Classification

Keywords

model evaluation text generation automated evaluation large language model

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025