UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs

Chaoqun He; Renjie Luo; Shengding Hu; Ranchi Zhao; Jie Zhou; Hanghao Wu; Jiajie Zhang; Xu Han; Zhiyuan Liu; Maosong Sun

2024 ACL ACL 2024

UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs

Abstract

AbstractEvaluation is pivotal for honing Large Language Models (LLMs), pinpointing their capabilities and guiding enhancements. The rapid development of LLMs calls for a lightweight and easy-to-use framework for swift evaluation deployment. However, due to the various implementation details to consider, developing a comprehensive evaluation platform is never easy. Existing platforms are often complex and poorly modularized, hindering seamless incorporation into researcher’s workflows. This paper introduces UltraEval, a user-friendly evaluation framework characterized by lightweight, comprehensiveness, modularity, and efficiency. We identify and reimplement three core components of model evaluation (models, data, and metrics). The resulting composability allows for the free combination of different models, tasks, prompts, and metrics within a unified evaluation workflow. Additionally, UltraEval supports diverse models owing to a unified HTTP service and provides sufficient inference acceleration.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chaoqun He , Renjie Luo , Shengding Hu , Ranchi Zhao , Jie Zhou , Hanghao Wu , Jiajie Zhang , Xu Han , Zhiyuan Liu , Maosong Sun

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Resources & Methods > Large Language Models Deep Learning > Models > Large Language Models Machine Learning > Learning Types > Evaluation

Keywords

model evaluation language model evaluation inference acceleration model benchmark large language model metric computation

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024