Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark

Chanjun Park; Hyeonwoo Kim; Dahyun Kim; SeongHwan Cho; Sanghoon Kim; Sukyung Lee; Yungi Kim; Hwalsuk Lee

2024 ACL ACL 2024

Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark

Abstract

AbstractThis paper introduces the Open Ko-LLM Leaderboard and the Ko-H5 Benchmark as vital tools for evaluating Large Language Models (LLMs) in Korean. Incorporating private test sets while mirroring the English Open LLM Leaderboard, we establish a robust evaluation framework that has been well integrated in the Korean LLM community. We perform data leakage analysis that shows the benefit of private test sets along with a correlation study within the Ko-H5 benchmark and temporal analyses of the Ko-H5 score. Moreover, we present empirical support for the need to expand beyond set benchmarks. We hope the Open Ko-LLM Leaderboard sets precedent for expanding LLM evaluation to foster more linguistic diversity.

🧭 Keyword Pioneer — data leakage analysis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

Authors

Chanjun Park , Hyeonwoo Kim , Dahyun Kim , SeongHwan Cho , Sanghoon Kim , Sukyung Lee , Yungi Kim , Hwalsuk Lee

Topics

Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Optimization & Theory > Evaluation

Keywords

benchmark evaluation evaluation benchmark korean language data leakage analysis linguistic diversity large language model private test set

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024