2025 EMNLP EMNLP 2025

LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts

Abstract

AbstractLarge language models (LLMs) such as GPT-4, Claude and LLaMA are routinely used to evaluate long-form text generated by language models. We study the ability of these models to identify low quality texts, an increasingly rare subset of output which is of great interest to pinpoint during development. We present experiments with a panel of LLM judges, and crowd-sourced approximations of reference judgments. Pinpointing sub-par outputs is a difficult task for both models and crowdworkers, with models doing overall better. Moreover, unlike findings in prior work on factoid question answering, panels of cheaper models do not agree as well with high quality developer judgments of low quality as panels of frontier models. We present qualitative and quantitative analysis of the relative strengths of models in the panel, gleaning insights why they yield better results over a single model.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio