IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

David Ifeoluwa Adelani; Jessica Ojo; Israel Abebe Azime; Jian Yun Zhuang; Jesujoba Oluwadara Alabi; Xuanli He; Millicent Ochieng; Sara Hooker; Andiswa Bukula; En-Shiun Annie Lee; Chiamaka Ijeoma Chukwuneke; Happy Buzaaba; Blessing Kudzaishe Sibanda; Godson Koffi Kalipe; Jonathan Mukiibi; Salomon Kabongo Kabenamualu; Foutse Yuehgoh; Mmasibidi Setaka; Lolwethu Ndolela; Nkiruka Odu; Rooweither Mabuya; Salomey Osei; Shamsuddeen Hassan Muhammad; Sokhar Samb; Tadesse Kebede Guge; Tombekai Vangoni Sherman; Pontus Stenetorp

2025 NAACL NAACL 2025

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

Abstract

AbstractDespite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench—a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference(AfriXNLI), mathematical reasoning(AfriMGSM), and multi-choice knowledge-based QA(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages (such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.

👥 Mega-Team — 27 authors

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio