Leveraging Large Language Models for Spanish-Indigenous Language Machine Translation at AmericasNLP 2025

Mahshar Yahan; Dr. Mohammad Islam

2025 NAACL NAACL 2025

Leveraging Large Language Models for Spanish-Indigenous Language Machine Translation at AmericasNLP 2025

Abstract

AbstractThis paper presents our approach to machine translation between Spanish and 13 Indigenous languages of the Americas as part of the AmericasNLP 2025 shared task. Addressing the challenges of low-resource translation, we fine-tuned advanced multilingual models, including NLLB-200 (Distilled-600M), Llama 3.1 (8B-Instruct) and XGLM 1.7B, using techniques such as dynamic batching, token adjustments, and embedding initialization. Data preprocessing steps like punctuation removal and tokenization refinements were employed to achieve data generalization. While our models demonstrated strong performance for Awajun and Quechua translations, they struggled with morphologically complex languages like Nahuatl and Otomí. Our approach achieved competitive ChrF++ scores for Awajun (35.16) and Quechua (31.01) in the Spanish-to-Indigenous translation track (Es→Xx). Similarly, in the Indigenous-to-Spanish track (Xx→Es), we obtained ChrF++ scores of 33.70 for Awajun and 31.71 for Quechua. These results underscore the potential of tailored methodologies in preserving linguistic diversity while advancing machine translation for endangered languages.

🌉 Interdisciplinary Bridge — Deep Learning and Interdisciplinary and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio