Exploring morphology-aware tokenization: A case study on Spanish language modeling

Alba Táboas García; Piotr Przybyła; Leo Wanner

2025 EMNLP EMNLP 2025

Exploring morphology-aware tokenization: A case study on Spanish language modeling

Abstract

AbstractThis paper investigates to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), we explore a linguistically grounded approach: training a tokenizer on morphologically segmented data. To do so, we develop a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluate it. We then use this tokenizer to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.

🌉 Interdisciplinary Bridge — Deep Learning and Interdisciplinary and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Alba Táboas García , Piotr Przybyła , Leo Wanner

Topics

Machine Learning > Core Methods > Representation Learning Natural Language Processing > Generation > Language Modeling Natural Language Processing > Resources & Methods > Text Representation Interdisciplinary > Linguistics > Computational Linguistics Natural Language Processing > Resources & Methods > Language Modeling Deep Learning > Learning Types > Representation Learning

Keywords

language modeling language model subword tokenization morphological analysis subword segmentation byte pair encoding spanish language

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025