Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity

Eric Khiu; Hasti Toossi; David Anugraha; Jinyu Liu; Jiaxu Li; Juan Flores; Leandro Roman; A. Seza Doğruöz; En-Shiun Lee

2024 EACL EACL 2024

Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity

Abstract

AbstractFine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — fine-tuning corpus

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Eric Khiu , Hasti Toossi , David Anugraha , Jinyu Liu , Jiaxu Li , Juan Flores , Leandro Roman , A. Seza Doğruöz , En-Shiun Lee

Topics

Machine Learning > Core Methods > Regression Natural Language Processing > Applications > Machine Translation Machine Learning > Learning Types > Transfer Learning

Keywords

machine translation low-resource language regression model domain similarity fine-tuning corpus

Download PDF

Related papers

A Dataset for Metaphor Detection in Early Medieval Hebrew Poetry 2024

PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation 2024

Overview of the Hate Speech Detection in Turkish and Arabic Tweets (HSD-2Lang) Shared Task at CASE 2024 2024

Evaluating In-Context Learning for Computational Literary Studies: A Case Study Based on the Automatic Recognition of Knowledge Transfer in German Drama 2024

Selam@DravidianLangTech 2024:Identifying Hate Speech and Offensive Language 2024