Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque

Gorka Urbizu; Ander Corral; Xabier Saralegi; Iñaki San Vicente

2025 EMNLP EMNLP 2025

Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque

Abstract

AbstractThis work investigates the effectiveness of small autoregressive language models (SLMs) with up to one billion parameters (sub-1B) for natural language processing (NLP) tasks in low-resource languages, focusing on Basque. We analyze optimal training strategies by comparing training from scratch and continual pre-training using state-of-the-art SLM architectures. Our analysis considers factors such as model size and the extent of Basque presence in the pre-training corpus. To assess linguistic capabilities, models are evaluated on 12 NLP tasks using the Harness framework. We also conduct a manual evaluation of fine-tuned models on three downstream natural language generation (NLG) tasks: question answering (QA), summarization, and machine translation (MT). Our findings indicate that continual pre-training on a multilingual SLM substantially enhances linguistic performance compared to training from scratch, particularly in low-resource language settings where available corpora typically contain fewer than one billion words. Additionally, the presence of Basque during the pre-training and larger model sizes contribute positively to performance in NLG tasks.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Gorka Urbizu , Ander Corral , Xabier Saralegi , Iñaki San Vicente

Topics

Natural Language Processing > Generation > Text Generation Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Large Language Models

Keywords

machine translation question answering low-resource language autoregressive model continual pre-training small language model

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025