Parallel resources for Tunisian Arabic Dialect Translation

Saméh Kchaou; Rahma Boujelbane; Lamia Hadrich-Belguith

2020 COLING COLING 2020

Parallel resources for Tunisian Arabic Dialect Translation

Abstract

AbstractThe difficulty of processing dialects is clearly observed in the high cost of building representative corpus, in particular for machine translation. Indeed, all machine translation systems require a huge amount and good management of training data, which represents a challenge in a low-resource setting such as the Tunisian Arabic dialect. In this paper, we present a data augmentation technique to create a parallel corpus for Tunisian Arabic dialect written in social media and standard Arabic in order to build a Machine Translation (MT) model. The created corpus was used to build a sentence-based translation model. This model reached a BLEU score of 15.03% on a test set, while it was limited to 13.27% utilizing the corpus without augmentation.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — tunisian arabic

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Saméh Kchaou , Rahma Boujelbane , Lamia Hadrich-Belguith

Topics

Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Applications > Machine Translation Machine Learning > Learning Types > Transfer Learning Natural Language Processing > Generation > Machine Translation Deep Learning > Models > Transformers

Keywords

machine translation data augmentation neural machine translation parallel corpus dialect translation tunisian arabic

Download PDF

Related papers

Persuasiveness of News Editorials depending on Ideology and Personality 2020

A Graph Representation of Semi-structured Data for Web Question Answering 2020

Span-based Joint Entity and Relation Extraction with Attention-based Span-specific and Contextual Semantic Representations 2020

Hierarchical Chinese Legal event extraction via Pedal Attention Mechanism 2020

End-to-End Emotion-Cause Pair Extraction with Graph Convolutional Network 2020