How Well Do Tweets Represent Sub-Dialects of Egyptian Arabic?

Mai Mohamed Eida; Mayar Nassar; Jonathan Dunn

2024 NAACL NAACL 2024

How Well Do Tweets Represent Sub-Dialects of Egyptian Arabic?

Abstract

AbstractHow well does naturally-occurring digital text, such as Tweets, represent sub-dialects of Egyptian Arabic (EA)? This paper focuses on two EA sub-dialects: Cairene Egyptian Arabic (CEA) and Sa’idi Egyptian Arabic (SEA). We use morphological markers from ground-truth dialect surveys as a distance measure across four geo-referenced datasets. Results show that CEA markers are prevalent as expected in CEA geo-referenced tweets, while SEA markers are limited across SEA geo-referenced tweets. SEA tweets instead show a prevalence of CEA markers and higher usage of Modern Standard Arabic. We conclude that corpora intended to represent sub-dialects of EA do not accurately represent sub-dialects outside of the Cairene variety. This finding calls into question the validity of relying on tweets alone to represent dialectal differences.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — sub-dialect analysis

🐝 Cross-Pollinator — Artificial Intelligence, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Mai Mohamed Eida , Mayar Nassar , Jonathan Dunn

Topics

Artificial Intelligence > Core AI > Causal Inference Machine Learning > Core Methods > Classification

Keywords

dialect classification sub-dialect analysis morphological marker dialect representation geo-referenced dataset

Download PDF

Related papers

Working Alliance Transformer for Psychotherapy Dialogue Classification 2024

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences 2024

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study 2024

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation 2024

Extractive Summarization with Text Generator 2024