BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context

Alexis Matzopoulos; Charl Hendriks; Hishaam Mahomed; Francois Meyer

2025 COLING COLING 2025

BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context

Abstract

AbstractThe BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, outperforming models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Alexis Matzopoulos , Charl Hendriks , Hishaam Mahomed , Francois Meyer

Topics

Machine Learning > Learning Types > Self-Supervised Learning Natural Language Processing > Generation > Language Modeling Machine Learning > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Few-Shot Learning Machine Learning > Learning Types > Transfer Learning Natural Language Processing > Resources & Methods > Language Modeling

Keywords

named entity recognition part-of-speech tagging low-resource language language model data-efficient pretraining

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025