2025 EMNLP EMNLP 2025

Towards Optimal Evaluation Efficiency for Large Language Models

Abstract

AbstractComprehensive evaluation of large language models (LLMs) typically requires large-scale benchmarks, which is costly in terms of both data annotation and computational resource needed for evaluation. To mitigate these challenges, we propose an efficient evaluation framework that selects a question subset based on pre-tested results, thereby reducing the costs. We formulate the subset selection problem as an optimization task, solved using optimal random sampling and simulated annealing algorithms. We compare our approach with prior clustering-based methods and assess their reliability in terms of score accuracy. Additionally, we perform semantic analysis and evaluate whether the selected subsets preserve the semantic information of the original benchmark using Wasserstein distance. Experimental results show that our method outperforms previous approaches in terms of reliability, as measured by L2 norm. Our study provides an optimized perspective for balancing evaluation efficiency and reliability in LLM assessments, while revealing the relationship between optimization methods and semantic retention.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Mathematics & Optimization and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio