The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness

Sanad Sha’ban; Nizar Habash

2025 EMNLP EMNLP 2025

The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness

Abstract

AbstractArabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness. Code is publicly available at https://github.com/CAMeL-Lab/arabic-generality-score.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Interdisciplinary and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — lexical generality

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Sanad Sha’ban , Nizar Habash

Topics

Machine Learning > Core Methods > Regression Natural Language Processing > Resources & Methods > Lexical Semantics Interdisciplinary > Linguistics > Computational Linguistics Artificial Intelligence > Core AI > Natural Language Processing

Keywords

word alignment arabic dialect edit distance regression model morphological analysis lexical variation lexical generality dialectness modeling

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025