Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?

Arij Riabi; Benoît Sagot; Djamé Seddah

2021 EMNLP EMNLP 2021

Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?

Abstract

AbstractRecent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high- resource languages. Building language mod- els and, more generally, NLP systems for non- standardized and low-resource languages remains a challenging task. In this work, we fo- cus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data display- ing a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre- trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set- tings.

❓ The Questioner

🌉 Interdisciplinary Bridge — Interdisciplinary and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — code-mixed language

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Arij Riabi , Benoît Sagot , Djamé Seddah

Topics

Natural Language Processing > Understanding > Part-of-Speech Tagging Natural Language Processing > Resources & Methods > Multilingual NLP Interdisciplinary > Linguistics > Computational Linguistics Machine Learning > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Transfer Learning Machine Learning > Learning Types > Representation Learning Natural Language Processing > Resources & Methods > Language Modeling

Keywords

transfer learning cross-lingual transfer dependency parsing part-of-speech tagging low-resource language code-mixed language character-based language model

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021