AfriCaption: Establishing a New Paradigm for Image Captioning in African Languages

Mardiyyah Oduwole; Prince Mireku; Fatimo Adebanjo; Oluwatosin Olajide; Mahi Aminu Aliyu; Jekaterina Novikova

2026 EACL EACL 2026

AfriCaption: Establishing a New Paradigm for Image Captioning in African Languages

Abstract

AbstractMultimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parametervision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across underrepresented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for underrepresented African languages, laying the groundwork for truly inclusive multimodal AI.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mardiyyah Oduwole , Prince Mireku , Fatimo Adebanjo , Oluwatosin Olajide , Mahi Aminu Aliyu , Jekaterina Novikova

Topics

Machine Learning > Application Areas > Domain Adaptation Computer Vision > Generation > Image Captioning

Keywords

image captioning low-resource language vision-language model model ensembling multilingual model

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026