2026 EACL EACL 2026

Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

Abstract

AbstractIn this paper, we investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, focusing on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish—with its transparent morphological markers—both monolingual and multilingual models succeed either when tokenization is highly atomic or breaking words into small subword units. For Hebrew, however, a multilingual model using character-level tokenization fails to capture its non-concatenative morphology, while a monolingual model with unified morpheme-aware segmentation excels. Performance improves on more synthetic datasets, in all models.

🧭 Keyword Pioneer — morpheme-aware segmentation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio