Cross-corpus Native Language Identification via Statistical Embedding

Francisco Rangel; Paolo Rosso; Julian Brooke; Alexandra Uitdenbogerd

2018 NAACL NAACL 2018

Cross-corpus Native Language Identification via Statistical Embedding

Abstract

AbstractIn this paper, we approach the task of native language identification in a realistic cross-corpus scenario where a model is trained with available data and has to predict the native language from data of a different corpus. The motivation behind this study is to investigate native language identification in the Australian academic scenario where a majority of students come from China, Indonesia, and Arabic-speaking nations. We have proposed a statistical embedding representation reporting a significant improvement over common single-layer approaches of the state of the art, identifying Chinese, Arabic, and Indonesian in a cross-corpus scenario. The proposed approach was shown to be competitive even when the data is scarce and imbalanced.

🧭 Keyword Pioneer — statistical embedding

🐣 Hot Topic Early Bird — dialect identification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Francisco Rangel , Paolo Rosso , Julian Brooke , Alexandra Uitdenbogerd

Topics

Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

multilingual nlp dialect identification native language identification cross-corpus evaluation statistical embedding

Download PDF

Related papers

A Melody-Conditioned Lyrics Language Model 2018

Before Name-Calling: Dynamics and Triggers of Ad Hominem Fallacies in Web Argumentation 2018

Automated Essay Scoring in the Presence of Biased Ratings 2018

Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input 2018

QuickEdit: Editing Text & Translations by Crossing Words Out 2018