Building a Turkish Large Language Model via Continual Pre-Training and Parameter-Efficient Adaptation

Alperen Enes Bayar; Mert Ege; Gökhan Yurtalan; Alper Karamanlioglu; Berkan Demirel; Ramazan Gokberk Cinbis

2026 EACL EACL 2026

Building a Turkish Large Language Model via Continual Pre-Training and Parameter-Efficient Adaptation

Abstract

AbstractLarge Language Models (LLMs) achieve strong performance on many tasks, but they still struggle with morphologically rich, low-resource languages such as Turkish. This difficulty stems from Turkish being an agglutinative language and underrepresented in multilingual training data, which causes current models to often fail at capturing its morphology, flexible word order, and formal registers. In this paper, we introduce MODA (Model Adapted for Domain Applications), a Turkish-specialized LLM built via a modular pipeline that combines continual pre-training, parameter-efficient fine-tuning, and model merging. Starting from Qwen2.5-7B as the base model, we first perform large-scale continual pre-training on a Turkish web corpus to improve grammatical and morphological representations. We then apply parameter-efficient supervised fine-tuning on task-oriented instruction data, and finally merge specialized variants into a single unified model. We evaluate MODA on TurkishMMLU, the Turkish subset of EXAMS, and TRCLAIM-19, where it consistently outperforms both the base and instruction-tuned Qwen2.5-7B models. Our results support a training strategy that explicitly separates linguistic acquisition from task alignment when adapting LLMs to morphologically rich, underrepresented languages under realistic hardware constraints.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Alperen Enes Bayar , Mert Ege , Gökhan Yurtalan , Alper Karamanlioglu , Berkan Demirel , Ramazan Gokberk Cinbis

Topics

Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Multilingual NLP Artificial Intelligence > Learning Paradigms > Continual Learning

Keywords

multilingual nlp model merging parameter-efficient fine-tuning continual pre-training morphologically rich language large language model

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026