Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Riccardo Bassani; Anders Søgaard; Tejaswini Deoskar

2021 EMNLP EMNLP 2021

Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Abstract

AbstractMultilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — vocabulary clustering

🐣 Hot Topic Early Bird — cross-lingual generalization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Riccardo Bassani , Anders Søgaard , Tejaswini Deoskar

Topics

Machine Learning > Core Methods > Clustering Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Transfer Learning Deep Learning > Learning Types > Transfer Learning Deep Learning > Learning Types > Multi-Lingual Learning

Keywords

cross-lingual transfer question answering multilingual model multilingual language model cross-lingual generalization vocabulary clustering

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021