EvalAssist: LLM-as-a-Judge Simplified

Michael Desmond; Zahra Ashktorab; Werner Geyer; Elizabeth M. Daly; Martín Santillán Cooper; Qian Pan; Rahul Nair; Nico Wagner; Tejaswini Pedapati

2025 AAAI AAAI 2025

EvalAssist: LLM-as-a-Judge Simplified

Abstract

Abstract We present EvalAssist, a framework that simplifies the LLM- as-a-judge workflow. The system provides an online criteria development environment, where users can interactively build, test, and share custom evaluation criteria in a structured and portable format. A library of LLM based evaluators is made available that incorporates various algorithmic innovations such as token-probability based judgement, positional bias checking, and certainty estimation that help to engender trust in the evaluation process. We have computed extensive benchmarks and also deployed the system internally in our organization with several hundreds of users.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Michael Desmond , Zahra Ashktorab , Werner Geyer , Elizabeth M. Daly , Martín Santillán Cooper , Qian Pan , Rahul Nair , Nico Wagner , Tejaswini Pedapati

Topics

Machine Learning > Application Areas > Fairness Artificial Intelligence > Core AI > Large Language Models Natural Language Processing > Applications > Text Generation

Keywords

text generation token probability positional bia trustworthy evaluation large language model evaluation criterion certainty estimation

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025