Co-Eval: Augmenting LLM-based Evaluation with Machine Metrics

Ling-I Wu; Weijie Wu; Minyu Chen; Jianxin Xue; Guoqiang Li

2025 EMNLP EMNLP 2025

Co-Eval: Augmenting LLM-based Evaluation with Machine Metrics

Abstract

AbstractLarge language models (LLMs) are increasingly used as evaluators in natural language generation tasks, offering advantages in scalability and interpretability over traditional evaluation methods. However, existing LLM-based evaluations often suffer from biases and misalignment, particularly in domain-specific tasks, due to limited functional understanding and knowledge gaps. To address these challenges, we first investigate the relationship between an LLM-based evaluator’s familiarity with the target task and its evaluation performance. We then introduce the Co-Eval framework, which leverages a criteria planner model and optimized machine metrics to enhance the scalability and fairness of LLM-based evaluation. Experimental results on both general and domain-specific tasks demonstrate that Co-Eval reduces biases, achieving up to a 0.4903 reduction in self-preference bias, and improves alignment with human preferences, with gains of up to 0.324 in Spearman correlation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — machine metrics

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ling-I Wu , Weijie Wu , Minyu Chen , Jianxin Xue , Guoqiang Li

Topics

Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Natural Language Inference Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Natural Language Processing > Applications > Text Generation Machine Learning > Learning Types > Evaluation Natural Language Processing > Applications > Natural Language Generation

Keywords

natural language generation llm evaluation evaluation metric human preference alignment self-preference bia human preference bias reduction human alignment large language model machine metrics llm-based evaluation co-evaluation framework

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025