2026 EACL EACL 2026

Out-Of-Tune rather than Fine-Tuned: How Pre-training, Fine-tuning and Tokenization Affect Semantic Similarity in a Historical, Non-Standardized Domain

Abstract

AbstractDomain-specific encoder language models have been shown to accurately represent semantic distributions as they appear in the pre-training corpus. However, the general consensus is that general language models can adapt to a domain through fine-tuning. Similarly, multilingual models have been shown to leverage transfer learning even for languages that were not present in their pre-training data. Contrastively, tokenization has also been shown to have a great impact on a models’ abilities to capture relevant semantic information, while this remains unchanged between pre-training and fine-tuning. This raises the question whether word embeddings for subtokens in models are of sufficient semantic quality for a target domain if not learned for the same domain. In this paper, we compare how different models assign similarity scores to different semantic categories in a highly specialized, non-standardised domain: Early Modern Dutch as written in the archives of the Dutch East India Company. Since the language in this domain is from before spelling conventions were established, and noise accumulates due to the fact that the original handwritten text went through a Handwritten Text Recognition pipeline, this use-case offers a unique opportunity to study both domain-specific semantics as well as a highly complex tokenization task for lesser-resourced languages. Our results support findings in earlier work that fine-tuned models may pick up spurious correlations in the adaptation process and stop relying on relevant semantics learned during pre-training.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio