Normalising Non-standardised Orthography in Algerian Code-switched User-generated Data

Wafia Adouane; Jean-Philippe Bernardy; Simon Dobnik

2019 EMNLP EMNLP 2019

Normalising Non-standardised Orthography in Algerian Code-switched User-generated Data

Abstract

AbstractWe work with Algerian, an under-resourced non-standardised Arabic variety, for which we compile a new parallel corpus consisting of user-generated textual data matched with normalised and corrected human annotations following data-driven and our linguistically motivated standard. We use an end-to-end deep neural model designed to deal with context-dependent spelling correction and normalisation. Results indicate that a model with two CNN sub-network encoders and an LSTM decoder performs the best, and that word context matters. Additionally, pre-processing data token-by-token with an edit-distance based aligner significantly improves the performance. We get promising results for the spelling correction and normalisation, as a pre-processing step for downstream tasks, on detecting binary Semantic Textual Similarity.

🌉 Interdisciplinary Bridge — Computer Science and Machine Learning

🧭 Keyword Pioneer — orthography normalization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Wafia Adouane , Jean-Philippe Bernardy , Simon Dobnik

Topics

Machine Learning > Core Methods > Representation Learning Computer Science > Applications > Document Analysis

Keywords

sequence labeling neural machine translation spelling correction text normalization orthography normalization

Download PDF

Related papers

Read, Attend and Comment: A Deep Architecture for Automatic News Comment Generation 2019

Chains-of-Reasoning at TextGraphs 2019 Shared Task: Reasoning over Chains of Facts for Explainable Multi-hop Inference 2019

A Boundary-aware Neural Model for Nested Named Entity Recognition 2019

Iterative Dual Domain Adaptation for Neural Machine Translation 2019

A Multi-Pairwise Extension of Procrustes Analysis for Multilingual Word Translation 2019