2026 EACL EACL 2026

PortOldBERT: Portuguese Historical Language Models

Abstract

AbstractHistorical language models play a crucial role in the study of languages, and can benefit tasks such as named-entity recognition (NER), part-of-speech (PoS) tagging, and post-OCR correction, among others. Despite their relevance, most efforts have been concentrated on English. To the best of our knowledge, no such model exists for historical Portuguese. In this work, we introduce PortOldBERT, the first historical Portuguese encoder language model. We demonstrate its usefulness by comparing PortOldBERT’s performance with Albertina, the encoder on which it is based, across multiple tasks—pseudo-perplexity, NER, PoS tagging, word error rate (WER) prediction, and OCR error detection—and for different historical periods. PortOldBERT consistently outperforms Albertina in historical data, demonstrating its ability to effectively integrate historical linguistic contexts while retaining the ability to process contemporary text.

🧭 Keyword Pioneer — historical language model
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio