LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

Fahim Dalvi; Maram Hasanain; Sabri Boughorbel; Basel Mousi; Samir Abdaljalil; Nizi Nazar; Ahmed Abdelali; Shammur Absar Chowdhury; Hamdy Mubarak; Ahmed Ali; Majd Hawasly; Nadir Durrani; Firoj Alam

2024 EACL EACL 2024

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

Abstract

AbstractThe recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework, which can be seamlessly customized to evaluate LLMs for any NLP task, regardless of language. The framework features generic dataset loaders, several model providers, and pre-implements most standard evaluation metrics. It supports in-context learning with zero- and few-shot settings. A specific dataset and task can be evaluated for a given LLM in less than 20 lines of code while allowing full flexibility to extend the framework for custom datasets, models, or tasks. The framework has been tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We open-sourced LLMeBench for the community (https://github.com/qcri/LLMeBench/) and a video demonstrating the framework is available online (https://youtu.be/9cC2m_abk3A).

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Fahim Dalvi , Maram Hasanain , Sabri Boughorbel , Basel Mousi , Samir Abdaljalil , Nizi Nazar , Ahmed Abdelali , Shammur Absar Chowdhury , Hamdy Mubarak , Ahmed Ali , Majd Hawasly , Nadir Durrani , Firoj Alam

Topics

Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Resources & Methods > Large Language Models

Keywords

zero-shot learning benchmark evaluation few-shot learning natural language processing model evaluation large language model

Download PDF

Related papers

A Dataset for Metaphor Detection in Early Medieval Hebrew Poetry 2024

PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation 2024

Overview of the Hate Speech Detection in Turkish and Arabic Tweets (HSD-2Lang) Shared Task at CASE 2024 2024

Evaluating In-Context Learning for Computational Literary Studies: A Case Study Based on the Automatic Recognition of Knowledge Transfer in German Drama 2024

Selam@DravidianLangTech 2024:Identifying Hate Speech and Offensive Language 2024