2025 EMNLP EMNLP 2025

Multilingual Language Model Pretraining using Machine-translated Data

Abstract

AbstractEnglish, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). However, the same can not be said for most other languages, likely due to a gap in the quality and diversity of available multilingual pretraining corpora. In this work, we find that documents machine-translated from a high-quality English corpus, can contribute significantly to the pretraining quality of multilingual LLMs. Concretely, we translate FineWeb-Edu, a high-quality English web corpus, into nine languages. resulting in a 1.7-trillion-token corpus, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this corpus. Across Non-English understanding and reasoning tasks, we show that TransWebLLM matches or even outperforms multilingual LLMs of similar size, including Llama3.2, Qwen2.5, and Gemma3, despite being trained on an order of magnitude less data. Moreover, we show that adding fewer than 5% of TransWebLLM’s training tokens as domain-specific data for continued pretraining yields state-of-the-art results in Arabic, Indonesian, Swahili, and Welsh for understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus and models under Open Source Initiative-approved licenses.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio