Machine Translation Metrics for Indigenous Languages Using Fine-tuned Semantic Embeddings
Abstract
AbstractThis paper describes the Tekio submission to the AmericasNLP 2025 shared task on machine translation metrics for Indigenous languages. We developed two primary metric approaches leveraging multilingual semantic embeddings. First, we fine-tuned the Language-agnostic BERT Sentence Encoder (LaBSE) specifically for Guarani, Bribri, and Nahuatl, significantly enhancing semantic representation quality. Next, we integrated our fine-tuned LaBSE into the semantic similarity metric YiSi-1, exploring the effectiveness of averaging multiple layers. Additionally, we trained regression-based COMET metrics (COMET-DA) using the fine-tuned LaBSE embeddings as a semantic backbone, comparing Mean Absolute Error (MAE) and Mean Squared Error (MSE) loss functions. Our YiSi-1 metric using layer-averaged embeddings chosen by having the best performance on the development set for each individual language achieved the highest average correlation across languages among our submitted systems, and our COMET models demonstrated competitive performance for Guarani.