LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts

Zongxia Li; Xiyang Wu; Ishani Mondal; Alexa Siu; Jordan Lee Boyd-Graber; Ani Nenkova

2025 EMNLP EMNLP 2025

LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts

Abstract

AbstractLarge language models (LLMs) such as GPT-4, Claude and LLaMA are routinely used to evaluate long-form text generated by language models. We study the ability of these models to identify low quality texts, an increasingly rare subset of output which is of great interest to pinpoint during development. We present experiments with a panel of LLM judges, and crowd-sourced approximations of reference judgments. Pinpointing sub-par outputs is a difficult task for both models and crowdworkers, with models doing overall better. Moreover, unlike findings in prior work on factoid question answering, panels of cheaper models do not agree as well with high quality developer judgments of low quality as panels of frontier models. We present qualitative and quantitative analysis of the relative strengths of models in the panel, gleaning insights why they yield better results over a single model.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zongxia Li , Xiyang Wu , Ishani Mondal , Alexa Siu , Jordan Lee Boyd-Graber , Ani Nenkova

Topics

Natural Language Processing > Applications > Text Classification Artificial Intelligence > Core AI > Large Language Models

Keywords

model evaluation text quality assessment large language model output evaluation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025