Understanding the effects of language-specific class imbalance in multilingual fine-tuning

Vincent Jung; Lonneke Plas

2024 EACL EACL 2024

Understanding the effects of language-specific class imbalance in multilingual fine-tuning

Abstract

AbstractWe study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — multilingual classification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Vincent Jung , Lonneke Plas

Topics

Machine Learning > Application Areas > Domain Generalization Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Multi-Task Learning Machine Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Language

Keywords

class imbalance latent space multilingual classification multilingual fine-tuning transformer-based model latent space analysis class weight large language model

Download PDF

Related papers

A Dataset for Metaphor Detection in Early Medieval Hebrew Poetry 2024

PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation 2024

Overview of the Hate Speech Detection in Turkish and Arabic Tweets (HSD-2Lang) Shared Task at CASE 2024 2024

Evaluating In-Context Learning for Computational Literary Studies: A Case Study Based on the Automatic Recognition of Knowledge Transfer in German Drama 2024

Selam@DravidianLangTech 2024:Identifying Hate Speech and Offensive Language 2024