Normalization of Indonesian-English Code-Mixed Twitter Data

Anab Maulana Barik; Rahmad Mahendra; Mirna Adriani

2019 EMNLP EMNLP 2019

Normalization of Indonesian-English Code-Mixed Twitter Data

Abstract

AbstractTwitter is an excellent source of data for NLP researches as it offers tremendous amount of textual data. However, processing tweet to extract meaningful information is very challenging, at least for two reasons: (i) using nonstandard words as well as informal writing manner, and (ii) code-mixing issues, which is combining multiple languages in single tweet conversation. Most of the previous works have addressed both issues in isolated different task. In this study, we work on normalization task in code-mixed Twitter data, more specifically in Indonesian-English language. We propose a pipeline that consists of four modules, i.e tokenization, language identification, lexical normalization, and translation. Another contribution is to provide a gold standard of Indonesian-English code-mixed data for each module.

🐣 Hot Topic Early Bird — code-mixed text

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Anab Maulana Barik , Rahmad Mahendra , Mirna Adriani

Topics

Natural Language Processing > Applications > Text Classification Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

language identification code-mixed text text normalization lexical normalization

Download PDF

Related papers

Read, Attend and Comment: A Deep Architecture for Automatic News Comment Generation 2019

Chains-of-Reasoning at TextGraphs 2019 Shared Task: Reasoning over Chains of Facts for Explainable Multi-hop Inference 2019

A Boundary-aware Neural Model for Nested Named Entity Recognition 2019

Iterative Dual Domain Adaptation for Neural Machine Translation 2019

A Multi-Pairwise Extension of Procrustes Analysis for Multilingual Word Translation 2019