2025 EMNLP EMNLP 2025

Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque

Abstract

AbstractThis work investigates the effectiveness of small autoregressive language models (SLMs) with up to one billion parameters (sub-1B) for natural language processing (NLP) tasks in low-resource languages, focusing on Basque. We analyze optimal training strategies by comparing training from scratch and continual pre-training using state-of-the-art SLM architectures. Our analysis considers factors such as model size and the extent of Basque presence in the pre-training corpus. To assess linguistic capabilities, models are evaluated on 12 NLP tasks using the Harness framework. We also conduct a manual evaluation of fine-tuned models on three downstream natural language generation (NLG) tasks: question answering (QA), summarization, and machine translation (MT). Our findings indicate that continual pre-training on a multilingual SLM substantially enhances linguistic performance compared to training from scratch, particularly in low-resource language settings where available corpora typically contain fewer than one billion words. Additionally, the presence of Basque during the pre-training and larger model sizes contribute positively to performance in NLG tasks.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio