SADSLyC: A Corpus for Saudi Arabian Multi-dialect Identification through Song Lyrics

Salwa Saad Alahmari

2025 COLING COLING 2025

SADSLyC: A Corpus for Saudi Arabian Multi-dialect Identification through Song Lyrics

Abstract

AbstractThis paper presents the Saudi Arabian Dialects Song Lyrics Corpus (SADSLyC), the first dataset featuring song lyrics from the five major Saudi dialects: Najdi (Central Region), Hijazi (Western Region), Shamali (Northern Region), Janoubi (Southern Region), and Shargawi (Eastern Region). The dataset consists of 31,358 sentences, with each sentence representing a self-contained verse in a song, totaling 151,841 words. Additionally, we present a baseline experiment using the SaudiBERT model to classify the fine-grained dialects in the SADSLyC Corpus. The model achieved an overall accuracy of 73% on the test dataset.

🌉 Interdisciplinary Bridge — Deep Learning and Interdisciplinary and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — song lyrics corpus

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Salwa Saad Alahmari

Topics

Machine Learning > Core Methods > Classification Natural Language Processing > Applications > Text Classification Interdisciplinary > Linguistics > Computational Linguistics Natural Language Processing > Applications > Named Entity Recognition Deep Learning > Learning Types > Deep Learning

Keywords

natural language processing text classification dialect identification arabic dialect classification song lyrics corpus song lyrics saudi arabian dialect

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025