Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?

Dana-Maria Iliescu; Rasmus Grand; Sara Qirko; Rob van der Goot

2021 NAACL NAACL 2021

Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?

Abstract

AbstractBecause of globalization, it is becoming more and more common to use multiple languages in a single utterance, also called code-switching. This results in special linguistic structures and, therefore, poses many challenges for Natural Language Processing. Existing models for language identification in code-switched data are all supervised, requiring annotated training data which is only available for a limited number of language pairs. In this paper, we explore semi-supervised approaches, that exploit out-of-domain mono-lingual training data. We experiment with word uni-grams, word n-grams, character n-grams, Viterbi Decoding, Latent Dirichlet Allocation, Support Vector Machine and Logistic Regression. The Viterbi model was the best semi-supervised model, scoring a weighted F1 score of 92.23%, whereas a fully supervised state-of-the-art BERT-based model scored 98.43%.

❓ The Questioner

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — code-switch detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dana-Maria Iliescu , Rasmus Grand , Sara Qirko , Rob van der Goot

Topics

Machine Learning > Learning Types > Semi-Supervised Learning Natural Language Processing > Applications > Text Classification

Keywords

semi-supervised learning logistic regression latent dirichlet allocation language identification support vector machine code-switch detection

Download PDF

Related papers

Knowledge Router: Learning Disentangled Representations for Knowledge Graphs 2021

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks 2021

Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint Information Extraction 2021

Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing 2021

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers 2021