TEMA: Token Embeddings Mapping for Enriching Low-Resource Language Models

Rodolfo Zevallos; Núria Bel; Mireia Farrús

2024 EMNLP EMNLP 2024

TEMA: Token Embeddings Mapping for Enriching Low-Resource Language Models

Abstract

AbstractThe objective of the research we present is to remedy the problem of the low quality of language models for low-resource languages. We introduce an algorithm, the Token Embedding Mapping Algorithm (TEMA), that maps the token embeddings of a richly pre-trained model L1 to a poorly trained model L2, thus creating a richer L2’ model. Our experiments show that the L2’ model reduces perplexity with respect to the original monolingual model L2, and that for downstream tasks, including SuperGLUE, the results are state-of-the-art or better for the most semantic tasks. The models obtained with TEMA are also competitive or better than multilingual or extended models proposed as solutions for mitigating the low-resource language problems.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rodolfo Zevallos , Núria Bel , Mireia Farrús

Topics

Artificial Intelligence > Learning Paradigms > Transfer Learning Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Large Language Models Natural Language Processing > Resources & Methods > Transfer Learning Natural Language Processing > Resources & Methods > Language Modeling Artificial Intelligence > Core AI > Knowledge Representation Deep Learning > Techniques > Transfer Learning Deep Learning > Learning Types > Transfer Learning

Keywords

transfer learning cross-lingual transfer low-resource language language model perplexity reduction multilingual model token embedding embedding mapping

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024