Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan; Christopher Driggers-Ellis; Christan Grant; Daisy Zhe Wang

2026 EACL EACL 2026

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Abstract

AbstractMachine translation for Indigenous and other low-resource languages is constrained by limited parallel data, orthographic variation, and evaluation instability for morphologically rich languages. In this work, we study Spanish–Aymara, Spanish–Guarani, and Spanish–Quechua translation in the context of the AmericasNLP benchmarks, focusing on data-centric improvements rather than architectural changes. We augment curated parallel corpora with forward-translated synthetic sentence pairs generated using a high-capacity multilingual translation model, while applying conservative, language-specific preprocessing tailored to each language. Training data is filtered using length-ratio constraints and deduplication, whereas official development sets are left unfiltered to ensure fair evaluation. We fine-tune a multilingual mBART model under curated-only and curated+synthetic settings and evaluate performance primarily using chrF++, which is better suited for agglutinative languages than BLEU. Across all three languages, synthetic data augmentation consistently improves chrF++, with the largest gains observed for Aymara and Guarani, while Quechua benefits primarily from deterministic orthographic normalization. Our analysis highlights both the effectiveness and the limitations of generic preprocessing for highly agglutinative languages, suggesting that data-centric augmentation and language-aware normalization are strong, reproducible baselines for low-resource Indigenous language machine translation.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — language-specific preprocessing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Aashish Dhawan , Christopher Driggers-Ellis , Christan Grant , Daisy Zhe Wang

Topics

Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

machine translation low-resource language synthetic datum multilingual model language-specific preprocessing

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026