2025 EMNLP EMNLP 2025

Can Large Language Models Translate Unseen Languages in Underrepresented Scripts?

Abstract

AbstractLarge language models (LLMs) have demonstrated impressive performance in machine translation, but still struggle with unseen low-resource languages, especially those written in underrepresented scripts. To investigate whether LLMs can translate such languages with the help of linguistic resources, we introduce Lotus, a benchmark designed to evaluate translation for Mongolian (in traditional script) and Yi. Our study shows that while linguistic resources can improve translation quality as measured by automatic metrics, LLMs remain limited in their ability to handle these languages effectively. We hope our work provides insights for the low-resource NLP community and fosters further progress in machine translation for underrepresented script low-resource languages. Our code and data are available.

The Questioner
🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing
🧭 Keyword Pioneer — underrepresented script
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio