2021
EMNLP
EMNLP 2021
Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization
Abstract
AbstractMultilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.
🌉
Interdisciplinary Bridge
— Deep Learning and Machine Learning and Natural Language Processing
🧭
Keyword Pioneer
— vocabulary clustering
🐣
Hot Topic Early Bird
— cross-lingual generalization
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Topics
Machine Learning > Core Methods > Clustering
Natural Language Processing > Resources & Methods > Large Language Models
Natural Language Processing > Resources & Methods > Multilingual NLP
Machine Learning > Learning Paradigms > Transfer Learning
Machine Learning > Learning Types > Transfer Learning
Deep Learning > Learning Types > Transfer Learning
Deep Learning > Learning Types > Multi-Lingual Learning