Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

Hila Gonen; Yoav Goldberg

2019 IJCNLP IJCNLP 2019

Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

Abstract

AbstractWe focus on the problem of language modeling for code-switched language, in the context of automatic speech recognition (ASR). Language modeling for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates language modeling performance from the other intricacies of the ASR system; and (3) the reliance on generative modeling. We tackle these three issues: we propose an ASR-motivated evaluation setup which is decoupled from an ASR system and the choice of vocabulary, and provide an evaluation dataset for English-Spanish code-switching. This setup lends itself to a discriminative training approach, which we demonstrate to work better than generative language modeling. Finally, we explore a variety of training protocols and verify the effectiveness of training with large amounts of monolingual data followed by fine-tuning with small amounts of code-switched data, for both the generative and discriminative cases.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hila Gonen , Yoav Goldberg

Topics

Machine Learning > Core Methods > Classification Machine Learning > Learning Types > Semi-Supervised Learning Machine Learning > Learning Types > Weakly Supervised Learning Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Generation > Language Modeling Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

language modeling automatic speech recognition discriminative training monolingual datum

Download PDF

Related papers

Fine-grained Knowledge Fusion for Sequence Labeling Domain Adaptation 2019

Exploiting Monolingual Data at Scale for Neural Machine Translation 2019

Distributionally Robust Language Modeling 2019

Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling 2019

ARAML: A Stable Adversarial Training Framework for Text Generation 2019