USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature Engineering Strategies for Arabic Dialect Identification

Mohamed Lichouri; Khaled Lounnas; Aicha Zitouni; Houda Latrache; Rachida Djeradi

2023 EMNLP EMNLP 2023

USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature Engineering Strategies for Arabic Dialect Identification

Abstract

AbstractIn this paper, we conduct an in-depth analysis of several key factors influencing the performance of Arabic Dialect Identification NADI’2023, with a specific focus on the first subtask involving country-level dialect identification. Our investigation encompasses the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features. For classification purposes, we employ the Linear Support Vector Classification (LSVC) model. During the evaluation phase, our system demonstrates noteworthy results, achieving an F1 score of 62.51%. This achievement closely aligns with the average F1 scores attained by other systems submitted for the first subtask, which stands at 72.91%.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mohamed Lichouri , Khaled Lounnas , Aicha Zitouni , Houda Latrache , Rachida Djeradi

Topics

Machine Learning > Core Methods > Classification Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Applications > Text Classification Machine Learning > Core Methods > Support Vector Machine Machine Learning > Application Areas > Classification

Keywords

text classification arabic dialect support vector machine feature engineering tf-idf feature

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023