Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation

Olli Kuparinen; Aleksandra Miletic; Yves Scherrer

2023 EMNLP EMNLP 2023

Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation

Abstract

AbstractText normalization methods have been commonly applied to historical language or user-generated content, but less often to dialectal transcriptions. In this paper, we introduce dialect-to-standard normalization – i.e., mapping phonetic transcriptions from different dialects to the orthographic norm of the standard variety – as a distinct sentence-level character transduction task and provide a large-scale analysis of dialect-to-standard normalization methods. To this end, we compile a multilingual dataset covering four languages: Finnish, Norwegian, Swiss German and Slovene. For the two biggest corpora, we provide three different data splits corresponding to different use cases for automatic normalization. We evaluate the most successful sequence-to-sequence model architectures proposed for text normalization tasks using different tokenization approaches and context sizes. We find that a character-level Transformer trained on sliding windows of three words works best for Finnish, Swiss German and Slovene, whereas the pre-trained byT5 model using full sentences obtains the best results for Norwegian. Finally, we perform an error analysis to evaluate the effect of different data splits on model performance.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — character-level transformer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Olli Kuparinen , Aleksandra Miletic , Yves Scherrer

Topics

Deep Learning > Architectures > Transformers Natural Language Processing > Generation > Text Generation Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Transfer Learning Natural Language Processing > Applications > Text Generation Natural Language Processing > Applications > Text Processing Deep Learning > Learning Types > Sequence Modeling

Keywords

transformer architecture multilingual nlp sequence-to-sequence model text normalization dialect mapping character transduction character-level transformer dialect-to-standard normalization

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023