Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants

Raphael Rubino; Johanna Gerlach; Jonathan Mutal; Pierrette Bouillon

2024 NAACL NAACL 2024

Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants

Abstract

AbstractConservation of historical documents benefits from computational methods by alleviating the manual labor related to digitization and modernization of textual content. Languages usually evolve over time and keeping historical wordforms is crucial for diachronic studies and digital humanities. However, spelling conventions did not necessarily exist when texts were originally written and orthographic variations are commonly observed depending on scribes and time periods. In this study, we propose to automatically normalize orthographic wordforms found in historical archives written in Middle French during the 16th century without fully modernizing textual content. We leverage pre-trained models in a low resource setting based on a manually curated parallel corpus and produce additional resources with artificial data generation approaches. Results show that causal language models and knowledge distillation improve over a strong baseline, thus validating the proposed methods.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — orthographic normalization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Raphael Rubino , Johanna Gerlach , Jonathan Mutal , Pierrette Bouillon

Topics

Machine Learning > Learning Types > Self-Supervised Learning Natural Language Processing > Understanding > Semantic Analysis

Keywords

knowledge distillation low-resource setting causal language model historical text orthographic normalization

Download PDF

Related papers

Working Alliance Transformer for Psychotherapy Dialogue Classification 2024

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences 2024

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study 2024

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation 2024

Extractive Summarization with Text Generator 2024