Modeling North African Dialects from Standard Languages

Yassine Toughrai; Kamel Smaili; David Langlois

2025 EMNLP EMNLP 2025

Modeling North African Dialects from Standard Languages

Abstract

AbstractProcessing North African Arabic dialects presents significant challenges due to high lexical variability, frequent code-switching with French, and the use of both Arabic and Latin scripts. We address this with a phonemebased normalization strategy that maps Arabic and French text into a simplified representation (Arabic rendered in Latin script), reflecting native reading patterns. Using this method, we pretrain BERTbased models on normalized Modern Standard Arabic and French only and evaluate them on Named Entity Recognition (NER) and text classification. Experiments show that normalized standard-language corpora yield competitive performance on North African dialect tasks; in zero-shot NER, Ar_20k surpasses dialectpretrained baselines. Normalization improves vocabulary alignment, indicating that normalized standard corpora can suffice for developing dialect-supportive

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — phoneme normalization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yassine Toughrai , Kamel Smaili , David Langlois

Topics

Machine Learning > Core Methods > Classification Natural Language Processing > Resources & Methods > Multilingual NLP Natural Language Processing > Applications > Named Entity Recognition

Keywords

zero-shot learning text classification named entity recognition dialect processing phoneme normalization

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025