Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Seonil Son; Ju-Min Oh; Heegon Jin; Cheolhun Jang; Jeongbeom Jeong; KunTae Kim

2025 EMNLP EMNLP 2025

Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Abstract

AbstractAs Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines.This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems.We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison.The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings.We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges.We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on https://huggingface.co/spaces/NCSOFT/ArenaLite

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — tournament-based evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Seonil Son , Ju-Min Oh , Heegon Jin , Cheolhun Jang , Jeongbeom Jeong , KunTae Kim

Topics

Machine Learning > Optimization & Theory > Theory Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Evaluation

Keywords

benchmark evaluation model comparison model ranking large language model tournament-based evaluation direct comparison tournament ranking tournament structure

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025